A Machine Lea ning F amewo k o Pe sonalized
Li es yle Recommenda ions in Colo ec al Cance
P e en ion
Ch is os And ou sos1, T aianos Tsiok is1, Zheshen Jiang2, Nicolas Gillain2, Ioannis S. Papanikolaou3,
Eleni Koukoulio i3, Cons an ina Cloconi4, An ia Sa a4, Sisse H. Njo 5, Susanne F. Jø gensen5, Maja Ra nik6,
Se gej ˇ
Ce nˇ
ciˇ
c6, Ma ´
ıa Gonz´
alez O e 7,8, Raquel Alca az O ega8, Vasilis Giannakopoulos9, Dimi ios Kyp eos9,
Dimi ios Dimi oulopoulos9, Geo ge K. Ma sopoulos10, Dimi ios I. Fo iadis1,11
1Uni o Medical Technology and In elligen In o ma ion Sys ems, Dep . o Ma e ials Science and Enginee ing,
Uni e si y o Ioannina, Ioannina, G eece
2Depa men o In o ma ion Sys em Managemen , Hospi al Cen e o Uni e si y o Li`
ege, Li`
ege, Belgium
3Hepa ogas oen e ology Uni , Second Depa men o In e nal Medicine, Na ional and Kapodis ian Uni e si y,
A ikon Uni e si y Gene al Hospi al, A hens, G eece
4Radia ion Oncology Cen e , Ge man Oncology Cen e
5Lillebael Hospi al, Resea ch Uni o Sc eening and Epidemiology, Depa men o Biochemis y and Immunology, Denma k
Uni e si y o Sou he n Denma k, Depa men o Regional Heal h Resea ch, Denma k.
6Depa men o Oncology, Uni e si y Medical Cen e Ma ibo , Ma ibo , Slo enia
7Cance Gene ics G oup, Uni o Excellence Ins i u e o Biomedicine and Molecula Gene ics,
Uni e si y o Valladolid Spanish Na ional Resea ch Council (IBGM; UVa-CSIC), 47003 Valladolid, Spain
8Unidad de In es igaci´
on, Hospi al Uni e si a io de Bu gos, Bu gos, Espa˜
na
9Agios Sa as, Cance Hospi al, A hens, G eece
10 Biomedical Enginee ing Labo a o y, School o Elec ical and Compu e Enginee ing,
Na ional Technological Uni e si y o A hens, A hens, G eece
11 Biomedical Resea ch Ins i u e, FORTH, Ioannina, G eece
Abs ac —Colo ec al cance (CRC) is a la gely p e en able
disease in luenced by modi iable beha io al isk ac o s such
as die , smoking, alcohol consump ion, physical inac i i y, and
ch onic s ess. This s udy p oposes a machine lea ning–based
amewo k ha gene a es pe sonalized li es yle ecommenda-
ions aimed a CRC p e en ion. The sys em consis s o wo
componen s: he Beha io al Recommenda ion Mapping En-
gine, which maps beha io al ques ionnai e esponses o expe -
alida ed ecommenda ions, and he Risk Assessmen Module,
which classi ies pa icipan s in o speci ic ecommenda ions using
supe ised lea ning models. Eigh domain-speci ic classi ie s we e
de eloped, each a ge ing a key beha io al isk ac o . Random
Fo es s consis en ly ou pe o med Decision T ees, achie ing high
mac o-a e aged F1 sco es e en in imbalanced ca ego ies such
as smoking (F1 = 0.88) and s ess (F1 = 0.80). The sys em also
iden i ies he mos in luen ial beha io al a iable pe domain
o highligh ac ionable isk ac o s. This amewo k will be
in eg a ed in o he DIOPTRA mobile applica ion o suppo eal-
ime, pe sonalized p e en ion.
Funded by he Eu opean Union (DIOPTRA, 101096649). Views and
opinions exp essed a e howe e hose o he au ho (s) only and do no
necessa ily e lec hose o he Eu opean Union o he Heal h and Digi al
Execu i e Agency. Nei he he Eu opean Union no he g an ing au ho i y can
be held esponsible o hem. This wo k has ecei ed unding om he Swiss
S a e Sec e a ia o Educa ion, Resea ch and Inno a ion (SERI). Funded by
UK Resea ch and Inno a ion (UKRI) unde he UK go e nmen ’s Ho izon
Eu ope unding gua an ee [g an numbe 10056682].
Index Te ms—Colo ec al cance p e en ion, pe sonalized ec-
ommenda ions, machine lea ning, beha io al p o iling, isk as-
sessmen
I. INTRODUCTION
Colo ec al cance (CRC) emains a leading cause o cance -
ela ed mo bidi y and mo ali y wo ldwide [1]. Al hough
la gely p e en able h ough ea ly sc eening and beha io al
changes, CRC incidence con inues o ise, pa icula ly among
indi iduals unde he age o 50 [2]. Modi iable beha io al
ac o s such as smoking, physical ac i i y, alcohol consump-
ion, die , and ch onic s ess signi ican ly con ibu e o CRC
beha io al isk [3]. Add essing hese ac o s wi h pe sonalized,
e idence-based in e en ions is essen ial in p e en ion. Recen
ad ancemen s in A i icial In elligence (AI) ha e enabled
he de elopmen o sys ems ha analyze beha io al da a o
gene a e pe sonalized heal h ecommenda ions. These sys ems
o e scalable solu ions o suppo p e en i e ca e by ailo ing
ecommenda ions o a pe son’s unique isk p o ile. In he con-
ex o CRC, while ML has been applied ex ensi ely o cance
isk p edic ion and clinical decision suppo , ew s udies ha e
add essed he gene a ion o explici , domain-speci ic beha -
io al ecommenda ions. Mos exis ing app oaches ei he ocus
on es ima ing disease isk o use simula ions o model po en ial
ou comes o li es yle changes, wi hou di ec ly classi ying o
alida ing ecommenda ion ou pu s. The p esen s udy aims o
b idge his gap by de eloping and e alua ing a modula ML
amewo k ha gene a es pe sonalized CRC- ela ed li es yle
ecommenda ions based on beha io al p o iles.
He e a e al. [4] de eloped and in e nally alida ed an
in e p e able ML model o CRC isk p edic ion based on
easily ob ainable li es yle and clinical ac o s. U ilizing da a
om 154,887 olde adul s who pa icipa ed in he P os a e,
Lung, Colo ec al, and O a ian (PLCO) cance sc eening ial
(including beha io al a iables like smoking and weigh , and
medical his o y), a Ligh G adien Boos ing Machine (Ligh -
GBM) classi ie was de eloped o es ima e an indi idual’s
p obabili y o de eloping CRC. The inal model inco po a ed
12 p edic o s and age, body weigh , and smoking his o y
eme ged as he s onges isk ac o s, while use o hea
medica ion showed a sligh p o ec i e e ec . The Ligh GBM
model achie ed a mode a e disc imina i e pe o mance wi h
an a ea unde he ecei e ope a ing cha ac e is ic cu e o
app oxima ely 0.73 in in e nal alida ion. Beyond isk sco ing,
he model’s ou pu was s uc u ed o s a i y indi iduals in o
a e age, inc eased, o high- isk ca ego ies, accompanied by
measu es ha highligh ed modi iable li es yle con ibu o s o
each indi idual’s isk p o ile. This ea u e enables he model
o suppo a ge ed clinician–pa ien discussions and acili a es
pe sonalized beha io al ecommenda ions.
Dogan e al. [5] p oposed a no el AI-d i en amewo k
o gene a ing pe sonalized li es yle ecommenda ions aimed
a ca dio ascula disease (CVD) p e en ion. The sys em was
designed o op imize beha io al in e en ions such as die
o exe cise changes by simula ing hei p ojec ed impac on
indi idual isk p o iles. Using a publicly a ailable da ase o
clinical and li es yle isk ac o s, he au ho s implemen ed
a h ee-componen pipeline. Fi s , a supe ised classi ica ion
model p edic ed an indi idual’s baseline CVD isk. Second, a
gene a i e ad e sa ial ne wo k (GAN) was ained o simula e
how hypo he ical modi ica ions o one o mo e isk ac o s
such as educing sodium in ake o inc easing physical ac i i y
would a ec he indi idual’s o e all isk p o ile and hi d, a
pe sonalized u ili y unc ion e alua ed he ade-o be ween
he expec ed isk educ ion and he pe cei ed e o o cos
associa ed wi h each li es yle change. This enabled he sys em
o iden i y and ecommend he op imal in e en ion o each
pe son. Valida ion esul s demons a ed ha he p oposed
sys em could success ully gene a e indi idualized li es yle
modi ica ions ha meaning ully educed p edic ed CVD isk.
Al hough no ex e nal labels o g ound- u h ecommenda ions
we e a ailable o di ec compa ison, he e ec i eness o he
gene a ed li es yle changes was alida ed h ough simula ion,
demons a ing meaning ul educ ions in p edic ed CVD isk
ac oss indi iduals.
This pape p esen s a amewo k o gene a ing and alida -
ing pe sonalized CRC p e en ion ecommenda ions h ough
wo in e connec ed componen s: a Beha io al Recommenda-
ion Mapping Engine (BRME) and a Risk Assessmen Module
(RAM). The BRME maps beha io al ques ionnai e esponses
o e idence-based ecommenda ions sou ced om au ho i a-
i e guidelines u ilized as anno a ions o ML models. The
RAM hen p edic s hese ecommenda ions using classi ie s
ained on he anno a ed da ase . The objec i e o his s udy
is o compa e he pe o mance o a ious classi ie s ac oss
beha io al ca ego ies and o demons a e he easibili y o
a modula , ML-d i en sys em o deli e ing CRC-speci ic
li es yle ecommenda ions. The ou come o his wo k will
in o m he in eg a ion o hese modules in o he DIOPTRA
mobile applica ion, con ibu ing o he b oade goal o pe -
sonalized CRC p e en ion.
II. MATERIALS AND METHODS
A. Da ase
The da ase u ilized in his s udy comp ises exclusi ely
beha io al da a collec ed h ough a s uc u ed ques ionnai e
de eloped wi hin he amewo k o he DIOPTRA p ospec i e
s udy. Pa icipan s comple ed he ques ionnai e du ing hei
colonoscopy sc eening ac oss mul iple clinical si es. The ques-
ionnai e was designed o cap u e a wide ange o li es yle
ac o s associa ed wi h CRC p e en ion, wi h a pa icula ocus
on modi iable beha io s. Each pa icipan was ca ego ized in o
one o ou p ede ined heal h s a us g oups: Heal hy, Non-
Ad anced Adenomas (NAA), Ad anced Adenomas (AA),
and CRC pa ien s. These g oupings we e no employed as
inpu ea u es o a ge a iables in he de elopmen o he
ML models. The emphasis o his s udy emains solely on
beha io al pa e ns and he gene a ion o pe sonalized li es yle
ecommenda ions.
The beha io al ques ionnai e co e ed mul iple hema ic ca -
ego ies, such as smoking s a us, alcohol consump ion, physical
ac i i y, die , supplemen usage, s ess le els and sociodemo-
g aphic backg ound. I ga he s in o ma ion on smoking habi s,
and exposu e o secondhand smoke, alcohol in ake equency
and olume, physical ac i i y le els and seden a y beha io ,
as well as die a y habi s, including he consump ion o ui s,
ege ables, whole g ains, p ocessed mea s, suga y p oduc s,
and as ood. Addi ionally, i eco ds he equency o die a y
supplemen use such as mul i i amins, p obio ics, omega-3,
calcium, and i amin D and includes s ess- ela ed ques ions
adap ed om he pe cei ed s ess scale. Socioeconomic in-
dica o s, including income pe cep ion, educa ion le el, and
employmen s a us, a e also cap u ed.
A e excluding incomple e eco ds, he inal da ase com-
p ised 756 ully comple ed eco ds, each con aining 45 s uc-
u ed ea u es. The anno a ions co esponding o each pa ici-
pan ’s p o ile a e desc ibed in de ail in Sec ion II.B.
B. Beha io al Recommenda ion Mapping Engine
The BRME is he in e media e laye be ween aw beha io al
da a and ML model aining. I s p ima y unc ion is o map
ques ionnai e esponses in o s uc u ed, pe sonalized li es yle
ecommenda ions based on in e na ionally accep ed CRC p e-
en ion guidelines. These ecommenda ions a e he anno a ion
labels ha guide he supe ised lea ning p ocess o he RAM.
The mapping p ocess was de eloped by in eg a ing public
heal h guidelines om o ganiza ions such as he Wo ld Cance
Resea ch Fund, he Na ional Comp ehensi e Cance Ne wo k,
he Na ional Cance Ins i u e, and he Ame ican Ins i u e o
Cance Resea ch. Each o hese ins i u ions p o ides e idence-
based ecommenda ions ela ed o die , physical ac i i y, al-
cohol consump ion, smoking cessa ion, s ess managemen ,
and supplemen use, ca ego ies s ongly associa ed wi h CRC
beha io al isk modula ion.
To implemen he anno a ion p ocess, a ule-based logic was
de eloped ha ansla ed speci ic pa e ns o esponses om
he beha io al ques ionnai e in o disc e e, guideline-aligned
ecommenda ions. Each ule co esponded o a speci ic combi-
na ion o esponses and was de ined in close collabo a ion wi h
clinical expe s o ensu e medical alidi y. Recommenda ions
we e gene a ed independen ly o each beha io al ca ego y
cap u ed in he ques ionnai e. Fo each pa icipan one ecom-
menda ion pe ca ego y was assigned, esul ing in a s uc u ed,
mul i-label anno a ion o ma . Fo example, in he physical
ac i i y ca ego y, pa icipan s who epo ed exe cising daily o
se e al imes a week we e mapped o a ecommenda ion such
as “Cong a ula ions on s aying physically ac i e! You com-
mi men o main aining an ac i e li es yle is commendable.
Howe e , o u he enhance you heal h, conside educing
you seden a y ime. Physical Ac i i y con incingly p o ec s
agains CRC and balancing mo emen wi h less si ing ime
can maximize i s bene i s.” This app oach allowed each be-
ha io al ca ego y o be ea ed as an indi idual classi ica ion
ask.
C. Risk Assessmen Module
The RAM cons i u es he ML componen o he sys em
and is designed o p edic pe sonalized beha io al ecom-
menda ions based on pa icipan s’ ques ionnai e esponses. I s
p ima y objec i e is o enable au oma ed ecommenda ion de-
li e y by lea ning he mapping be ween indi idual beha io al
p o iles and he expe - alida ed anno a ions gene a ed by he
BRME. By modeling his ela ionship, he RAM is he co e o
he AI-based decision suppo sys em, which will be in eg a ed
in o he DIOPTRA mobile applica ion.
Unlike con en ional classi ica ion sys ems ha aim o de ec
disease o es ima e isk sco es, he RAM is ocused on
ecommenda ion gene a ion. I classi ies a speci ic beha io al
ecommenda ion o each li es yle ca ego y, based on pa -
icipan s’ esponses. To achie e his, he RAM comp ises
eigh independen classi ie s, each dedica ed o he ca ego ies
o he beha io al ques ionnai e. Al hough a mul i-label o
mul i-ou pu classi ica ion s a egy could heo e ically add ess
all eigh beha io al ca ego ies simul aneously, his app oach
would esul in ex eme label spa si y and educed pe o -
mance due o he high numbe o possible label combina ions.
To a oid his, he classi ica ion ask was decomposed in o
eigh independen p oblems. This modula design allows each
model’s p edic ion o be di ec ly a ibu ed o ca ego y-speci ic
ea u es.
T ee-based models such as Decision T ees (DTs) and
Random Fo es s (RF) we e employed o classi ica ion due
o hei obus ness, and e ec i eness on s uc u ed ques-
ionnai e da a. The model de elopmen p ocess ollowed a
consis en pipeline ac oss all ca ego ies. Ca ego ical a iables
we e encoded u ilizing ei he o dinal o one-ho encoding and
con inuous a iables we e no malized when necessa y. The
da ase was pa i ioned u ilizing a s a i ied 70%-30% ain- es
spli o p ese e class dis ibu ion. To e alua e pe o mance
obus ness, i e- old c oss- alida ion was conduc ed. Hype -
pa ame e uning was pe o med sepa a ely o each classi ie .
Fo ee-based models, op imiza ion included pa ame e s such
as maximum ee dep h, numbe o es ima o s and minimum
sample h esholds o spli ing. Final model selec ion was
based on a combined assessmen o classi ica ion accu acy
and F1-sco e.
Fig. 1. O e iew o he p oposed ML–based amewo k.
Beyond classi ica ion, he RAM also suppo s he iden i i-
ca ion o key beha io al isk ac o s wi hin each beha io al
ca ego y. Fo e e y ained model, ea u e impo ance sco es
we e u ilized o de e mine which a iables had he g ea es
in luence on he p edic ed ecommenda ion. The mos in lu-
en ial ea u e o each ca ego y was iden i ied o highligh he
p ima y modi iable beha io con ibu ing o he ecommended
ac ion.
III. RESULTS
The pe o mance o he RF and DTs classi ie s was e alu-
a ed ac oss eigh beha io al ca ego ies using i e- old c oss-
alida ion. RF was selec ed as he p ima y model due o
i s o e all s ong p edic i e accu acy, and capaci y o handle
mul i-class classi ica ion asks e ec i ely. Al hough DTs mod-
els also demons a ed accep able pe o mance, RF consis en ly
ou pe o med DTs and p oduced mo e s able p edic ions, pa -
icula ly unde condi ions o class imbalance. Class imbalance
was expec ed in se e al ca ego ies due o he limi ed da ase
size and he he inhe en a iabili y in eal-wo ld beha io al
pa e ns. To mi iga e his, mac o-a e aged F1 sco e was
epo ed alongside classi ica ion accu acy. Unlike accu acy,
mac o F1 sco e assigns equal weigh o each class, and p o-
ides a mo e balanced measu e, pa icula ly impo an when
he accu a e p edic ion o mino i y class ecommenda ions
ca ies signi ican heal h implica ions. The es ing accu acy
and mac o-a e aged F1 sco es o bo h classi ie s ac oss each
ca ego y a e p esen ed in Table I, along wi h he numbe o
classes pe ca ego y. Wi hin he die - ela ed ca ego ies, RF
achie ed high pe o mance in suga in ake (Accu acy = 0.87,
F1 = 0.82) and mea consump ion (F1 = 0.66), bo h o which
exhibi ed ela i ely balanced label dis ibu ions. Howe e , pe -
o mance in he ui s and ege ables ca ego y was sligh ly
lowe (F1 = 0.67), likely e lec ing class imbalance.
TABLE I
COMPARISON OF RF AND DTS CLASSIFIERS BY BEHAVIORAL CATEGORY.
Ca ego y # Classes RF
Acc. / F1
DTs
Acc. / F1
Die
Suga In ake 60.87 / 0.82 0.87 / 0.84
F ui s & Vege ables 90.75 /0.67 0.73 / 0.66
Mea Consump ion 6 0.78 /0.66 0.78 / 0.66
Li es yle
Physical Ac i i y 70.69 /0.74 0.68 / 0.72
Smoking S a us 60.97 /0.88 0.97 / 0.77
Alcohol Consump ion 90.73 /0.68 0.73 / 0.66
Die a y Supplemen s
Supplemen s Use 6 0.83 / 0.64 0.86 / 0.63
S ess Managemen
S ess Le el 30.76 /0.80 0.74 / 0.78
In he li es yle- ela ed ca ego ies, smoking s a us achie ed
pa icula ly s ong esul s (F1 = 0.88) demons a ing he
model’s abili y o gene alize well e en in imbalanced con-
di ions. Physical ac i i y and alcohol consump ion achie ed
mode a e pe o mance le els (F1 = 0.74 and 0.68, espec-
i ely), wi h a iabili y d i en by di e ences in label balance.
Simila ly, RF pe o med well in he die a y supplemen s (F1
= 0.64) and s ess managemen ca ego ies (F1 = 0.80), bo h
o which we e a ec ed by skewed class dis ibu ions.
DISCUSSION
The de elopmen o da a-d i en sys ems capable o deli -
e ing ailo ed heal h ecommenda ions based on indi idual
beha io p o iles ep esen s an impo an s ep owa d pe son-
alized p e en ion in CRC. In his s udy, his challenge was
add essed by implemen ing a classi ica ion amewo k ha
p edic s domain-speci ic li es yle ecommenda ions u ilizing
s uc u ed beha io al da a. The combina ion o he BRME
and he RAM enabled he ans o ma ion o ques ionnai e
esponses in o a ge ed, guideline-aligned ad ice. RF classi ie
achie ed obus pe o mance ac oss eigh beha io al ca e-
go ies, main aining p edic i e s eng h e en in he p esence
o class imbalance.
Compa ed o exis ing li e a u e, his wo k ad ances be-
yond adi ional CRC isk es ima ion models, such as hose
by He e a e al [4], which ocus on p edic ing disease
p obabili y. While such models a e aluable o sc eening
p io i iza ion, hey ypically do no p o ide conc e e, indi-
idualized beha io al ecommenda ions. Simila ly, simula ion-
based amewo ks like Dogan e al [5] demons a e p omising
me hods o modeling beha io al change impac , bu do no
e alua e ecommenda ion ou pu s di ec ly as p edic i e a ge s.
In con as , he p oposed RAM explici ly lea ns o p edic
domain-speci ic ecommenda ions, allowing o di ec e alua-
ion h ough classi ica ion me ics.
Despi e hese p omising esul s, he s udy has se e al lim-
i a ions. Fi s , he da ase used, while di e se in beha io al
domains, emains limi ed bo h in e ms o class dis ibu ion
and in numbe o pa icipan s. Se e al ca ego ies exhibi ed
class imbalance, which likely impac ed F1 sco es o un-
de ep esen ed ecommenda ions. Al hough echniques such
as s a i ied sampling and mac o-a e aged F1 sco ing we e
applied o add ess his, pe o mance in a e classes emains
an a ea o imp o emen . Second, while ecommenda ions
we e de i ed om expe - e iewed guidelines, he cu en
anno a ion logic does no co e he ull spec um o possible
beha io al a ia ions, meaning ha some ecommenda ion
classes a e unde ep esen ed o absen . Fu u e wo k will
include he collec ion o mo e p ospec i e da a o ep esen
mo e beha io al pa e ns and he inclusion o longi udinal da a
o assess beha io al change o e ime. Finally, o scale he
amewo k o la ge and mo e imbalanced da ase s, imbalance
handling me hods will be explo ed.
CONCLUSIONS
This s udy p esen ed a ML–based sys em o gene a ing
pe sonalized li es yle ecommenda ions o suppo CRC p e-
en ion. By combining expe -guided anno a ion wi h domain-
speci ic classi ie s, he p oposed amewo k o e s a a ge ed
app oach o beha io al isk assessmen . The modules de-
eloped he e will be in eg a ed in o he DIOPTRA mobile
applica ion, enabling eal- ime, pe sonalized guidance.
REFERENCES
[1] J. Li, J. Pan, L. Wang, G. Ji, and Y. Dang, “Colo ec al Cance :
Pa hogenesis and Ta ge ed The apy,” MedComm, ol. 6, pp. e70127,
Ma ch 2025.
[2] F. Ka am, Y. E. Deghel, R. I a ni, A. H. Dak oub, and A. H. Eid, “The
Gu Mic obiome and Colo ec al Cance : An In eg a i e Re iew o he
Unde lying Mechanisms,” Cell Biochemis y and Biophysics, ol 5, pp.
1–14, Feb ua y 2025.
[3] V. V. Tsukano , A. V. Vasyu in, and J. L. Tonkikh, “Risk ac o s,
p e en ion and sc eening o colo ec al cance : A ising p oblem,” Wo ld
Jou nal o Gas oen e ology, ol 31, pp. 98629, Feb ua y 2025.
[4] D. J. He e a, D. M. Seibe , K. Feyen, M. an Loo, G. an Hal
and W. an de Vee donk “De elopmen and In e nal Valida ion o a
Machine Lea ning-Based Colo ec al Cance Risk P edic ion Model,”
Gas oin es inal Diso de s, ol 7, pp. 26, Janua y 2025.
[5] A. Dogan, Y. Li, C. P. Odo, K. Sonawane, Y. Lin and C. Liu “A u ili y-
based machine lea ning-d i en pe sonalized li es yle ecommenda ion
o ca dio ascula disease p e en ion,” Jou nal o Biomedical In o ma -
ics, ol 141, pp. 104342, May 2023.