Resea ch a icle
In e en ial Z-Tes Valida ion o Dual S uc u al Bias in Cance Risk
Assessmen wi hin La ge COVID-19 Vaccine Coho s
Ma co Rocce i 1
1 Depa men o Compu e Science and Enginee ing, Uni e si y o Bologna, Bologna, I aly
* Co espondence: Email: [email p o ec ed]; Tel: +393920271318
Abs ac : The inc easing eliance on complex analy ical models using la ge adminis a i e heal h da a
necessi a es igo ous, p e-analysis in e en ial alida ion o me hodological in eg i y, a p inciple o en
neglec ed in high-impac epidemiological s udies. A ecen la ge-scale coho s udy epo ed s a is ical
associa ions sugges ing inc eased cance isk ollowing COVID-19 accina ion [1], indings ha a
p io s a is ical desc ip i e analysis had linked o a se e e s uc u al asymme y, sugges ing a
p onounced ex e nal alidi y disc epancy [2]. Building upon he esul s o [2], his s udy aims o
o mally and in e en ially alida e his dual s uc u al bias, speci ically, he non- ep esen a i eness o
he coho 's age dis ibu ion ela i e o he na ional popula ion (Roo Cause), and he esul ing non-
compa ibili y o he cance incidence a e in he con ol subg oup ela i e o he na ional gold s anda d
(Ex e nal Validi y). We applied wo in e en ial es s, bo h Z- es s. The i s es ed o he demog aphic
ep esen a i eness by compa ing he sample p opo ion o indi iduals >= 65 yea s (12.2%) agains he
same na ional popula ion p opo ion (18%). The second es ed he incidence compa ibili y by
compa ing he non- accina ed >= 65 subg oup C ude Incidence Ra e agains he na ional C ude
Incidence Ra e. The Z- es o demog aphic ep esen a i eness yielded a es s a is ic Z-sco e o -
260.39 (wi h a p- alue < 10!"), con i ming a p o ound s uc u al sampling ailu e (- 32.2%) al eady
iden i ied in [2]. The Z- es o incidence was equally conclusi e, yielding a Z-sco e o - 15.23 (wi h
again a p- alue < 10!"), o mally alida ing an eno mous cance incidence de ici in ha age b acke
o he coho . The combined in e en ial e idence con i ms a dual s uc u al bias caused by he sys emic
unde -sampling o he high- isk elde ly demog aphic. This non-equi alence means he baseline isk in
he non- accina ed g oup was a i icially supp essed, causing ma hema ical in la ion o he ou come
and in alid esul s s emming om unco ec ed baseline da a. Ou esul s manda e ha he alidi y o
conclusions d awn om any la ge-scale coho s udy mus be condi ional upon he in e en ial
con i ma ion o me hodological in eg i y.
Keywo ds: Bioma hema ics, Compu a ional Epidemiology; In e en ial S a is ics, Selec ion Bias,
2
COVID-19 accina ion, Cance Resea ch
1. In oduc ion
The cu en landscape o public heal h esea ch is c i ically dependen on massi e adminis a i e
heal h da ase s, ueling he widesp ead adop ion o complex epidemiological models. While his
eliance p omises as e and b oade disco e ies, i ha bo s a undamen al isk: he po en ial o
me hodological laws o inhe en selec ion biases p esen in he sou ce da a o be ampli ied, a he han
mi iga ed, by he complexi y o ma hema ical modeling. A co e p inciple in bios a is ics dic a es ha
he igo and clinical ele ance o any complex model mus be subo dina e o, and con ingen upon,
he in eg i y and ex e nal alidi y o he unde lying s a is ical design. We asse ha complex models
canno compensa e o undamen ally poo da a s uc u e. In si ua ions whe e compa ison g oups a e
ende ed non-equi alen o he en i e coho is non- ep esen a i e, ma hema ical models, eg essions
o example, while pe o ming ma hema ical con ol, can isola e and subsequen ly in la e he in luence
o his bias, leading o s a is ically signi ican conclusions ha a e clinically spu ious.
A ecen and highly in luen ial la ge-scale e ospec i e coho s udy, u ilizing na ional Sou h
Ko ean adminis a i e da a, epo ed s a is ical associa ions sugges ing an inc eased isk o cance one
yea ollowing COVID-19 accina ion [1]. Gi en he eno mous clinical sensi i i y and po en ial public
heal h impac o such indings, he s udy's me hodological ounda ion ep esen s an exempla case ha
need o be subjec ed o he highes le el o in e en ial sc u iny. To his aim, we de eloped such a
sc u iny wi h a p elimina y ma hema ical analysis which iden i ied wo key s uc u al asymme ies by
simply compa ing o he coho o [1] agains es ablished na ional epidemiological and demog aphic
gold s anda ds [2, 3-7].
These p elimina y desc ip i e obse a ions es ablished a causal chain o non-equi alence ha his
p esen s udy seeks o alida e h ough o mal in e en ial es ing. This causal chain is based on wo
undamen al laws. The i s law, which can be de ined as he demog aphic non- ep esen a i eness
(Roo Cause), was desc ip i ely iden i ied because he en i e coho possessed a p opo ion o elde ly
indi iduals >= 65 yea s ha was subs an ially lowe han he na ional a e age. Since age is he mos
powe ul p edic o o cance , his imbalance sugges ed a s uc u al sampling ailu e wi hin he coho
i sel . The second law, named as he isk non-equi alence (Consequence/Ex e nal Validi y Flaw), was
desc ip i ely obse ed as an 45.1% de ici in cance incidence in he high- isk non- accina ed con ol
subg oup (>= 65 yea s) compa ed o he na ional gold s anda d [2]. This s uc u al supp ession o he
baseline isk is he di ec epidemiological esul o he demog aphic law and he p e equisi e o
ma hema ically in la ing he s udy's p ima y ou comes.
The p esen s udy's objec i e is ins ead o ma hema ically and in e en ially alida e hese wo
desc ip i e indings using igo ous Z- es s. Ou cen al hypo hesis is dual, and causally linked: i s ,
we hypo hesize ha he p opo ion o high- isk elde ly indi iduals >= 65 yea s in he o al coho is
s a is ically incompa ible wi h he p opo ion o elde ly indi iduals in he sou ce na ional popula ion
(Roo Cause con i ma ion); and second, we hypo hesize ha he de ia ion o he non- accina ed
g oup’s cance incidence om he na ional a e is s a is ically signi ican o a le el ha ules ou
andom chance, p o iding in e en ial p oo ha he baseline isk is a i icially supp essed
(Consequence con i ma ion). In e en ial con i ma ion o his dual s uc u al bias will p o ide
conclusi e e idence ha he epo ed esul s o [1] sugges ing ha m a e me hodological a i ac s
de i ed om undamen al, unco ec ed non-equi alence a baseline.
3
The emainde o his pape is o ganized as ollows. The Ma e ials and Me hods Sec ion de ails
he applica ion o wo single-p opo ion Z- es s used o in e en ially alida e/ ejec he a o emen ioned
ep esen a i eness and incidence. The Resul s sec ion p esen s he ob ained es s a is ics, ollowed by
a Discussion Sec ion ha in e p e s he causal chain o he bias, i s implica ions o esul s’ alidi y,
and limi a ions, culmina ing in a Conclusion Sec ion ha o mally exposes he me hodological a i ac .
2. Ma e ials and me hods
2.1. Da a Sou ces, Ex ac ed Me ics and Desc ip i e Findings
P ima y da a, including C ude Incidence Ra es (CRs), sample sizes (N), and age me ics, we e
ex ac ed om he published pape by [1]. The Na ional Gold S anda d Benchma ks we e sou ced om
o icial Sou h Ko ean cance s a is ics and demog aphic da a [3-7]. Based on hese na ional igu es,
he popula ion aged >= yea s cons i u es 18% o he o al popula ion in Sou h Ko ea, and he c ude
incidence a e (CR) o he >= 65 popula ion is 155.2 pe 10,000 indi iduals. Ins ead, he o al numbe
o pa icipan s in he inal ma ched s udy coho o [1] was N(To al) = 2,975,035, as epo ed in Table
1 below.
Table 1. Inpu Pa ame e s and Na ional Benchma ks o In e en ial Z-Tes s.
Tes
Pa ame e
Symbol
Value
Incidence
(Ex e nal Validi y)
Na ional Incidence (>= 65
yea s)
P(0)
155.2 pe
10,000 (0.01552)
Obse ed Incidence (>=
65 yea s, Non-Vacc.)
p(inc)
85.2 pe 10,000
(0.00852)
Non-Vaccina ed Sample
Size (>= 65)
N(inc)
72,285
pa icipan s
Age
Rep esen a i eness
Na ional P opo ion
(>= 65)
P(age)
18.0% (0.18)
To al Coho Size
N(To al)
2,975,035
pa icipan s
To al Coho >=65
P opo ion
p(age)
12.2% (0.122)
2.2. In e en ial Me hod I: Z-Tes o he Age Rep esen a i eness
This es is uded o o mally alida e/ ejec he desc ip i e obse a ion ha he coho 's age
4
composi ion is non- ep esen a i e. The es compa es he obse ed p opo ion o indi iduals >= 65
yea s in he coho p(age), ha is N(Coho >= 65) / N(To al), agains he es ablished na ional
p opo ion P(age).
The o mal hypo heses can be consequen ly de ined as ollows:
Null Hypo hesis (H0): The coho 's p opo ion o indi iduals aged >= 65 is no
s a is ically lowe han he na ional p opo ion: p(age) >= P(age);
Al e na i e Hypo hesis (H1): The coho 's p opo ion o indi iduals aged >= 65 is
s a is ically lowe han he na ional p opo ion (one- ailed es ): p(age) < P(age).
Based on well known epidemiological o mulas [8], he Z-Sco e de i a ion can be calcula ed as:
Z(Age- P op) = #(%&'))*+(%&')
,!(#$%)'()!(#$%)*
+(,-.#/)
, (1)
2.3. In e en ial Me hod II: Z-Tes o he Cance Incidence
This es o mally e i ies he hypo hesis o a baseline cance isk non-equi alence. I is ca ied
ou o ha e a con i ma ion/ ejec ion o he hypo hesis o consequence o he demog aphic issue on he
cance indicences. In essence, he es compa es he obse ed cance incidence a e in he non-
accina ed >= 65 subg oup p(inc) agains he na ional gold s anda d a e P(0). The o mal hypo hesis
o his second es can be s uc u ed as ollows:
Null Hypo hesis (H0): The obse ed incidence a e in he coho 's >= 65 subg oup is no
s a is ically lowe han he na ional a e: p(inc) >= P(0);
Al e na i e Hypo hesis (H1): The obse ed incidence a e in he coho 's >= 65 subg oup
is s a is ically lowe han he na ional a e: p(inc) < P(0).
Consequen ly, he Z-Sco e de i a ion can be o mula ed as ollows:
Z(Incidence) = #(-./))*+(")
,!(0)(()!(0))
+(123)
,. (2)
In closing hese Sec ions, i s i is o be no iced ha bo h his la e es and he o me one will
be conduc ed as one- ailed es s, wi h a signi icance le el o 0.05% o speci ically e i y he
hypo hesized de ici s. Second we emind ha he da a p esen ed he e is ei he included di ec ly o was
ex ac ed om he e e enced documen s. All calcula ions a e easily ep oducible based on he
de ini ions p o ided. Fu he easonable eques s ela i e o da a and calcula ions can be also add essed
o he co esponding and sole au ho (email: [email p o ec ed]).
5
2.4. E hics app o al o esea ch
This s udy uses publicly a ailable, agg ega ed da a ha con ains no p i a e in o ma ion. The e o e,
e hical app o al is no equi ed.
3. Resul s
The combined esul s o he wo in e en ial Z- es s ea lie in oduced p o ide o e whelming
s a is ical e idence o a a al dual s uc u al bias as explained in he ollowing wo Sub-Sec ions.
3.1. In e en ial Valida ion o Demog aphic Non-Rep esen a i eness
The o mal alida ion o he demog aphic de ici was ob ained by compa ing he obse ed
p opo ion o >= 65 indi iduals in he coho p(age) agains he na ional benchma k P(age). All
calcula ions and esul s a e p o ided in de ails in Table 2 below.
Table 2. Calcula ion o Z-Sco e o Age Rep esen a i eness: Z(Age - P op).
S ep
Desc ip ion
Fo mula / Inpu Values
Calcula ed Value
I.A
Coho P opo ion
p(age)
N(Coho >= 65) / N(To al) =
361,425 / 2,975,0535
0.122
I.B
Na ional P opo ion
P(age)
Na ional Benchma k
0.18
II
Nume a o
Calcula ion
p(age) – P(age)
0.122 - 0.18 = -
0.058
III
S anda d E o
Calcula ion
#𝑃(𝑎𝑔𝑒)(1−𝑃(𝑎𝑔𝑒))
𝑁(𝑇𝑜𝑡𝑎𝑙)
0.000223
IV
Z-Sco e Calcula ion
II / III
-0.056 / 0.000222
= -260.39
As seen om he inal ow o Table 2, he esul ing es s a is ic is: Z(Age - P op) = - 260.39. The
esul an p- alue o < 10!" compels he ca ego ical ejec ion o he Null Hypo hesis o demog aphic
compa ibili y. This Z-sco e alue o mally es ablishes ha he en i e s udy coho is s uc u ally non-
ep esen a i e o he sou ce popula ion, con i ming he sys emic ela i e de ici o -32.2% indi idua ed
in he p e ious p elimina y analysis o [2] o he highes - isk demog aphic.
6
3.2. In e en ial Valida ion o Cance Incidence Flaw
The o mal alida ion o he cance incidence de ici (calcula ed as la ge as 45.1% in [2]) was
ob ained by compa ing he obse ed a e in he non- accina ed >= 65 subg oup p(inc) agains he
na ional gold s anda d (P(0)), using Fo mula 2 abo e, an yielding he esul s exposed in Table 3 below.
Table 3. Calcula ion o Z-Sco e o Cance Incidence: In e en ial e i ica ion o baseline
isk non-equi alence agains he na ional s anda d.
S ep
Desc ip ion
Fo mula / Inpu
Values
Calcula ed Value
I.A
Obse ed Incidence
p(inc)
85.2 pe 10,000
85.5 / 10,000 =
0.00852
I.B
Na ional Incidence P(0)
155.2 pe 10,000
155.2 / 10,000 =
0.01552
II
Nume a o Calcula ion
p(inch) – P(0)
0.00852 - 0.01552 =
- 0.00700
III
S anda d E o
Calcula ion
#
𝑃(0)(1−𝑃
(
0
)
)
𝑁(𝑖𝑛𝑐)
0.000460
IV
Z-Sco e Calcula ion
II / III
- 15.23
The esul ing es s a is ic is: - 15.23. This Z-sco e is o an unp eceden ed magni ude,
co esponding o a p- alue ha is s a is ically negligible (< 10!").4The Null Hypo hesis is
conclusi ely ejec ed, o mally es ablishing ha he -45.1% de ici in cance incidence, iden i ied in
he p elimina y analysis [2], is a s uc u al non-equi alence o isk.
In closing his Sec ion, i should be no iced ha ou in e en ial esul s es ablish a s a is ically
con i med causal chain o bias. The p o ound demog aphic non- ep esen a i eness o he coho Z(Age
- P op) = - 260.39, due o he sys emic exclusion o he elde ly popula ion, c ea es a baseline
popula ion ha is s uc u ally oo young. Since olde age is he p ima y isk ac o o cance , his
s uc u al issue leads di ec ly and ine i ably o he a i icially supp essed cance incidence o - 15.41
in he e e ence non- accina ed g oup, he eby c ea ing he p e equisi e ma hema ical condi ion o
in alid esul s o [1].
4. Discussion
The in e en ial e i ica ion o he demog aphic non- ep esen a i eness, demons a ed by he Z-
sco e o - 260.39, cons i u es he mos undamen al inding o his analysis. I o mally con i ms ha
7
he coho s udied in [1] is no a alid ep esen a ion o he sou ce Ko ean popula ion, as i sys emically
excludes he high- isk elde ly demog aphic (>= 65 yea s) a a s a is ically impossible magni ude. This
demog aphic issue is he oo cause o he s udy's o e all s uc u al bias and alone ende s he en i e
s udy non-gene alizable.
The second in e en ial esul , Z-sco e o - 15.23, con i ms he ine i able consequence o he
demog aphic issue. The magni ude o his Z-sco e demons a es ha he 45.1% sho all in cance
incidence in he non- accina ed g oup is an epidemiological a i ac o he non- ep esen a i e sampling.
The link is causal and in e en ially e i ied: he coho 's s a is ical non- ep esen a i eness c ea ed a
s udy popula ion wi h an a i icially supp essed baseline isk. This supp essed isk se es as he non-
equi alen denomina o in he calcula ion o he Haza d Ra io (HR) o [1], ollowing he ela ionship:
HR = Risk in Vaccina ed G oup / Risk in Non Vaccina ed G oup (A i icially Supp essed).
Consequen ly, he epo ed Haza d Ra ios o [1], which sugges an inc eased isk o cance (HR >
1), a e me hodological a i ac s, ha is ma hema ical ampli ica ions o he p e-exis ing sampling bias,
no biological e ec s o he accina ion. The dual e idence o non- ep esen a i eness and supp essed
incidence ( Z-sco es o - 260.39 and - 15.23) p o ides i e u able p oo ha he ma hema ical
mul i a iable models employed on his da a in [1] (i.e., Cox models) we e ende ed impo en by he
s uc u al bias in he inpu da a. The model's ailu e is wo- old: an unco ec able baseline ( he ou come
a iable was undamen ally e oneous a baseline) and a s uc u al con ounding ( he scale o he
demog aphic non- ep esen a i eness is beyond he co ec i e capaci y o s anda d modeling).
The unique and p ima y limi a ion o ou p esen analysis es s on he necessi y o use ex e nal
na ional gold s anda ds a he han ha ing access o he indi idual pa ien -le el da a. Howe e , he
magni ude o he Z-sco es ob ained o bo h he demog aphic and he incidence issues is so ex eme
ha i obus ly o e ides he in luence o any easonable unobse ed con ounde s o mino de ia ions
om he na ional gold s anda d, con i ming a s uc u al de ec .
5. Conclusions
Ou p esen s udy p o ides decisi e in e en ial s a is ical e idence ha he coho da a analyzed
is an exempla case o sampling om da a ha su e om a a al dual s uc u al bias: a non-
ep esen a i eness o he o e all coho ela i e o he sou ce popula ion (Z-sco e o - 260.39) ha
leads di ec ly o a s a is ically incompa ible cance incidence a e in he con ol g oup (Z-sco e o -
15.23). We conclude ha he s udy's epo ed Haza d Ra ios [1] a e o e whelmingly likely o be
me hodological a i ac s esul ing om an unco ec ed s uc u al bias inhe en in he coho selec ion
p ocess. This esea ch manda es ha he alidi y o conclusions d awn om any la ge-scale coho
s udy mus be condi ional upon he in e en ial con i ma ion o me hodological in eg i y agains bo h
demog aphic and epidemiological gold s anda ds be o e any se ious epidemiological conclusion can
be d awn [9-11].
Use o AI ools decla a ion
The au ho decla es ha he has no used a i icial in elligence (AI) ools in he c ea ion o his a icle.
8
Acknowledgmen s
This esea ch ecei ed no ex e nal unding. The Au ho is g a e ul o se e al colleagues om he
Uni e si y o Bologna who p o ided commen s on an ea lie e sion o his pape dis ibu ed as a
p ep in .
Au ho ’s Con ibu ion
MR concei ed and designed he s udy, ca ied ou all da a collec ion and analysis, in e p e ed he
quan i a i e esul s, and was he sole au ho esponsible o w i ing and e ising he manusc ip . The
au ho a^i ms ull esponsibili y o he in eg i y o he da a and he accu acy o he da a analysis
p esen ed.
In o med Consen S a emen
No applicable: Nei he humans no animals no pe sonal da a a e in ol ed in his s udy.
Con lic o in e es
The au ho decla es he e is no con lic o in e es .
Re e ences
1. Kim HJ, Kim M-H, Choi MG, Chun EM. (2025) 1-yea isks o cance s associa ed
wi h COVID-19 accina ion: a la ge popula ion-based coho s udy in Sou h Ko ea. Bioma k
Res. 13(114). DOI: 10.1186/s40364-025-00831-w
2. Rocce i M. (2025) A Bios a is ical Reapp aisal Un eiling he Mechanism Behind
Appa en Cance Risk Signals in a COVID-19 Vaccina ed Coho . Zenodo P ep in n.
17508347. DOI: 10.5281/zenodo.17508346
3. Kang MJ, Jung K-W, Bang SH, e al. (2023). Cance S a is ics in Ko ea: Incidence,
Mo ali y, Su i al, and P e alence in 2020. Cance Res T ea ., 55(2):385-399. DOI:
10.4143/c .2023.447
4. Pa k EH, Jung K-W, Pa k NJ, e al. (2024). Cance S a is ics in Ko ea: Incidence,
Mo ali y, Su i al, and P e alence in 2021. Cance Res T ea ., 56(2):357-371. DOI:
10.4143/c .2024.253
5. Pa k EH, Jung K-W, Pa k NJ, e al. (2025) Cance S a is ics in Ko ea: Incidence,
Mo ali y, Su i al, and P e alence in 2022. Cance Res T ea . 57(2):312-330. DOI:
10.4143/c .2025.264
6. S a is a. Sou h Ko ea: Cance c ude incidence a e by age, 2022. S a is a; 2024.
9
[Accessed 2025 No 2]. A ailable om: h ps://www.s a is a.com/s a is ics/1440818/sou h-
ko ea-cance -c ude-incidence- a e-by-age/
7. Wo ld Bank. Popula ion ages 65 and abo e (% o o al popula ion) - Ko ea, Rep.
Wo ld Popula ion P ospec s, Uni ed Na ions (UN). [Accessed 2025 No 2]. A ailable om:
h ps://da a.wo ldbank.o g/indica o /SP.POP.65UP.TO.ZS?loca ions=KR 5
8. Ro hman KJ, G eenland S, Lash TL. (2008) Measu es o Disease Occu ence. In:
Mode n Epidemiology. 3 d ed. Philadelphia: Lippinco Williams & Wilkins; 2008. ISBN:
9780781755641
9. Rocce i, M., Cacciapuo i, G. (2025) Beyond he Gold S anda d: Linea Reg ession
and Poisson GLM Yield Iden ical Mo ali y T ends and Dea hs Coun s o COVID-19 in I aly:
2021–2025. Compu a ion 2025, 13(10), 233. doi: 10.3390/ compu a ion13100233
10. Chemai elly H, Ayoub H, Coyle P, e al. (2025) Assessing heal hy accinee e ec in
COVID-19 accine e ec i eness s udies: a na ional coho s udy in Qa a . eLi e 2025;
14:e103690. DOI: 10.7554/eLi e.103690
11. Fishe RA, (1922) On he ma hema ical ounda ions o heo e ical s a is ics. Phil.
T ans. R. Soc. A.; 222:594-604. DOI: 10.1098/ s a.1922.0009