The benchma king o clus e ing s a egies in scRNA-seq and ADT da a demons a ed ha Leiden was he mos obus , scalable, and
consis en me hod o iden i ying cellula subpopula ions. This app oach clea ly o e comed Lou ain, K-means, GMM/EM, and HDBSCAN in
accu acy and s abili y. In con as , dimensionali y educ ion combina ions such as UMAP+PCA o PCA+UMAP did no imp o e pe o mance
and o en deg aded clus e ing quali y, highligh ing hei inadequacy o cap u e s able biological s uc u es. The e o e, ou indings p o ide
ac ionable guidance o cons uc ing scalable, accu a e scRNA-seq analysis pipelines, emphasizing he impo ance o aligning algo i hmic
choices wi h biological con ex and compu a ional cons ain s in cance esea ch and beyond.
Mo i a ion
Pe o mance-Guided E alua ion o Clus e ing S a egies o
Single-Cell RNA Sequencing in Cance Resea ch wi hin HPC
En i onmen s
Welin on Ba e a Mondaca1*, Joaquín A aya Bus os1, Rena o Ál a ez Ramos1, Claudia Cancino Qui oz1,
Raúl Caulie -Cis e na2, Ana Moya-Bel án2
1Escuela de In o má ica, Facul ad de Ingenie ía, Uni e sidad Tecnológica Me opoli ana, San iago, Chile.
2Depa amen o de In o má ica y Compu ación, Facul ad de Ingenie ía, Uni e sidad Tecnológica Me opoli ana, San iago, Chile.
Single-cell RNA sequencing (scRNA-seq) has ans o med he s udy o cellula he e ogenei y, enabling de ailed cha ac e iza ions o
biological sys ems. Ye , hese da a a e high-dimensional, spa se, noisy, and apidly g owing in scale, c ea ing pe sis en compu a ional
challenges. Despi e his, clus e ing is o en applied h ough de aul pipelines, wi h limi ed sc u iny o how me hodological choices shape
biological conclusions o compu a ional easibili y. Such p ac ices isk obscu ing ue biological signals o in la ing echnical a i ac s,
pa icula ly in High-Pe o mance Compu ing (HPC) se ings whe e e iciency is c i ical. This aises a main ques ion: which combina ions
o dimensionali y educ ion (PCA, UMAP, PCA+UMAP) and clus e ing s a egies bes cap u e biologically meaning ul g oups in
complex cance scRNA-seq da a while emaining compu a ionally easible a scale? Mo i a ed by his challenge, ou wo k
sys ema ically examines hese me hodological in e ac ions ac oss bo h RNA and ADT modali ies, whe e panel in eg a ion and
no maliza ion add u he laye s o complexi y, seeking o es ablish anspa en , consis en , and ep oducible p ac ices ha enhance
biological ideli y, compu a ional scalabili y, and in e p e abili y in la ge-scale single-cell s udies.
To iden i y he bes clus e ing algo i hm o eco e ing immune sub-lineages in scRNA-seq by benchma king i e me hods (Leiden,
Lou ain, K-means, EM/GMM, and HDBSCAN) on he exp ession ma ix o a la ge da ase (91 samples, using sub_lineage as
e e ence) and con i ming i s ans e abili y on amul imodal ADT subse (40/91 samples). The co e aim is o conclude which
algo i hm pe o ms bes and p o ide a clea ecommenda ion ha o he eams can plug in o hei pipelines, educing
me hodological ambigui y and speeding subsequen s udies.
The cu a ed single-cell NSCLC da ase is
he e e ence esou ce used in his s udy
(published in Cance Cell, 2021; publicly
a ailable ia GEO: GSE15482).
The inal da ase comp ises
361,929 cells × 33,660 genes
ac oss 91 samples.
In addi ion, a 40/91-sample subse
includes ADT (CITE-seq), which we
used o mul imodal alida ion.
Leiden
Acknowledgmen s:
Leiden is a g aph communi y-de ec ion algo i hm: build a k-
NN g aph, op imize a quali y unc ion (e.g., modula i y/CPM),
e ine o ensu e well-connec ed g oups, hen agg ega e
communi ies; epea un il s able.
Lou ain
Lou ain maximizes modula i y in wo phases: locally mo es
nodes o imp o e i , hen agg ega es communi ies in o supe -
nodes; ebuilds he g aph and epea s un il con e gence.
K-means
HDBSCAN
GMM/EM
K-means is an i e a i e pa i ioning me hod (Lloyd) ha
ini ializes Kcen oids wi h k-means, al e na es nea es -
cen oid assignmen and mean upda es un il ine ia ba ely
imp o es; assumes compac clus e s and bene i s om
ea u e scaling.
HDBSCAN is hie a chical, densi y-based, and noise-awa e;
builds mu ual- eachabili y dis ances and an MST, condenses
he clus e ee using min_clus e _size/min_samples, and
selec s he mos s able clus e s by pe sis ence while labeling
he es as noise, handling a iable densi ies and shapes.
GMM/EM i s a Gaussian mix u e wi h so membe ships;
s a s wi h μ, Σ, π, compu es esponsibili ies in he E-s ep,
upda es pa ame e s in he M-s ep, and s ops when he log-
likelihood con e ges; suppo s ull o diagonal co a iances o
model ellip ical, unequal-size clus e s.
Mac o-F1 (Hunga ian): A e age F1 sco e ac oss classes a e
ma ching clus e s and ue labels wi h he Hunga ian algo i hm.
Hunga ian Accu acy: O e all accu acy a e aligning
p edic ed clus e s wi h g ound- u h labels using he Hunga ian
algo i hm.
AMI: Mu ual in o ma ion be ween clus e s and ue labels,
adjus ed o chance.
ARI: Pai wise simila i y be ween pa i ions, co ec ed o
chance.
V-measu e: Ha monic mean o homogenei y and
comple eness.
Homogenei y: Each clus e con ains only membe s o a single
class.
Comple eness: All membe s o a gi en class a e assigned o
he same clus e .
Goals
G ound T u h
Resul s
Da a
Clus e ing S a egies Me hodology
Clus e ing E alua ion Me ics
Bes -pe o ming me hod —Leiden clus e ing
P o ein-space clus e ing (ADT): how i lines up wi h RNA sub ypes
Sub-lineage landscape (30 g oups) in he scRNA-seq da ase
The ada cha s (absolu e 0–1 sco es) show Leiden wi h he
mos balanced, high-a ea p o ile, sligh ly ahead o Lou ain,
indica ing he bes o e all ag eemen wi h he sub_lineage
e e ence. K-Means and GMM/EM we e compe i i e bu
consis en ly lowe , while HDBSCAN was he mos a iable and
s ongly embedding-dependen (UMAP > PCA+UMAP > PCA),
wi h highe Comple eness o en d i en by many small clus e s.
Complemen ing accu acy, he un ime benchma k showed K-
Means as as es (ms seconds), ollowed by Leiden ( en
seconds), Lou ain (2.5 minu es), and HDBSCAN/GMM he
slowes . O e all, Leiden o e s he bes accu acy– un ime ade-
o ac oss embeddings.
Leiden ea s da a as a k-NN g aph and op imizes communi ies
on ha g aph, so i doesn’ equi e ixing kand handles non-
sphe ical, imbalanced g oups well. Compa ed wi h Lou ain, i
yields well-connec ed clus e s and scales o hund eds o
housands o cells. On ou da ase i ma ched he biological
e e ence bes , achie ing Hunga ian Accu acy = 55.2% (wi h
s ong AMI/V) while keeping un ime in he ens o seconds, he
bes o e all accu acy, un ime ade-o .
To p o ide an o hogonal alida ion o he RNA-
based benchma k, we an Leiden on he
mul imodal subse comp ising 40 samples wi h
ADT (CITE-seq), s anda dized o he in e sec ion
o 82 su ace p o eins. P ep ocessing mi o ed
he RNA wo k low (no maliza ion, dimensionali y
educ ion, kNN g aph), and e alua ion used he
cu a ed sub_lineage labels om he eposi o y. The ADT analysis cap u es p o eomic s uc u e
consis en wi h known cell iden i ies and shows
quali a i e conco dance wi h he RNA landscape, while
also e ealing expec ed ansc ip ome- o-p o eome
di e ences (e.g., pos - ansc ip ional egula ion and
memb ane kine ics). O e all, hese esul s con i m he
obus ness o he Leiden-based pipeline and p o ide
independen , p o ein-le el suppo o he subpopula ion
sepa a ion.
The sub_lineage is he cu a ed g ound
u h (published in Cance Cell, 2021;
publicly a ailable ia GEO: GSE15482),
used independen ly o ou clus e ing o
a oid ci cula i y and cap u e ine-g ained
biology (~30 sub ypes). Because he da a
a e s ongly imbalanced, we emphasized
Mac o-F1 (Hunga ian) and also epo ed
ARI/AMI/V. Fo con ex , we included a
UMAP o ~361,929 cells om PCA
colo ed by sub_lineage; his 2D iew is
p o ided only o isualiza ion and
summa izes neighbo hood s uc u e.
Th oughou he s udy, sub_lineage was
consis en ly used as he e e ence o
e alua e all clus e ing algo i hms.
Leiden consis en ly o e comed Lou ain, K-means,
GMM/EM, and HDBSCAN, achie ing he bes balance
among accu acy, un ime, and scalabili y. I s obus ness
was con i med on ADT da a, whe e i eliably eco e ed
biologically ele an subpopula ions. In con as ,
combining PCA wi h UMAP did no imp o e esul s and
o en educed esolu ion, unde sco ing he need o
ca e ul pipeline design in la ge-scale scRNA-seq
clus e ing.
Discussion
Do plo o ma ke
genes (columns)
ac oss immune
sub-lineages
( ows). Size = %
cells exp essing;
colo ( i idis_ ) =
mean exp ession.
Columns a e
g ouped by
lineage; he
dend og am o de s
ows by simila i y.
B igh blocks show
expec ed pa e ns
(B/plasma in B,
mono/mac in MNP,
cy o oxic in T/NK).
Con ac : wba e au em.cl
amoya@u em.cl
Cell 1
Cell 3
Cell 2
Cell n
…
Genes ADT
Cell 1
Cell 2
Cell 3
…
Cell n
Conclusions
1.- Maie B, Leade AM, Chen ST. A conse ed dend i ic-cell
egula o y p og am limi s an i umou immuni y. Na u e.
2020;580:257–262.
2.- Qi R, Ma A, Ma Q, Zou Q. Clus e ing and classi ica ion
me hods o single-cell RNA-sequencing da a. B ie Bioin o m.
2020;21(4):1196-1208.
Re e ences
Depa amen o de In o má ica y Compu ación, UTEM; Escuela de In o má ica, UTEM; Labo a o io de In es igación
Aplicada, Depa amen o de In o má ica y Compu ación, UTEM. This wo k was suppo ed in pa by “Compe i ion
o Resea ch Assis an Funding UTEM”, yea 2024, code AI23-11,and in pa by he “Scien i ic and Technological
Equipmen P ojec s Compe i ion, yea 2024, code LE24-03”.