scieee Science in your language
[en] (orig)

The Chemical Space Spanned by Manually Curated Datasets of Natural and Synthetic Compounds with Activities against SARS‐CoV‐2

Author: Betow, Jude Y.; Turón, Gemma; Metuge, Clovis S.; Akame, Siméon; Shu, Vanessa Asoh; Ebob, Oyere T.; Duran‐Frigola, Miquel; Ntie‐Kang, Fidele
Publisher: Zenodo
DOI: 10.5281/zenodo.14372499
Source: https://zenodo.org/records/14372499/files/Manuscript_ChemicalSpace_SARS_CoV2_clean.pdf
J. Y. Be ow e al.
1
DOI: 10.1002/min .200(( ull DOI will be illed in by he edi o ial s a ))
The chemical space spanned by manually cu a ed
da ase s o na u al and syn he ic compounds wi h
ac i i ies agains SARS-CoV-2
Jude Y. Be ow,‡[a,b] Gemma Tu on,‡[c] Clo is S. Me uge,[a,b] Simeon Akame, [a,d] Vanessa A. Shu,[a,b] Oye e T.
Ebob,[e] Miquel Du an-F igola,*[c] and Fidele N ie-Kang*[a,b, ]
[a] J.Y. Be ow, C.S. Me uge, V.A. Shu, S. Akame, F. N ie-Kang
Cen e o D ug Disco e y, Facul y o Science, Uni e si y o Buea, P. O. Box 63, Buea, Came oon,
[b] J.Y. Be ow, C.S. Me uge, V.A. Shu, F. N ie-Kang
Depa men o Chemis y, Facul y o Science, Uni e si y o Buea, P. O. Box 63, Buea, Came oon
*e-mail: idele.n [email protected], phone:+237 673872475
[c] G. Tu on, M. Du an-F igola
E silia Open Sou ce Ini ia i e, Ba celona, Spain *e-
mail: miquel@e silia.io,
[d] S. Akame
Depa men o Clinical Mic obiology, Facul y o Heal h Sciences, Uni e si y o Buea, Came oon
[ ] O. T. Ebob
Depa men o Chemis y and Fo ensics, School o Science and Technology (SST), No ingham T en Uni e si y, Cli on Lane,
No ingham, NG11 8NS, UK
[g] F. N ie-Kang
Ins i u e o Pha macy, Ma in-Lu he Uni e si y Halle-Wi enbe g, Ku -Mo hes S asse 3, 06120 Halle (Saale), Ge many
‡ These au ho s con ibu ed equally
* Co-co esponding au ho s
Abs ac : Diseases caused by i uses a e challenging o
con ain, as hei ou b eak and sp ead could be e y sudden,
compounded by apid mu a ions, making he de elopmen o
d ugs and accines a con inued endea ou ha equi es as
disco e y and p epa edness. Ta ge ing i al in ec ions wi h
small molecules emains one o he ea men op ions o
educe ansmission and he disease bu den. A lesson
lea ned om he ecen co ona i us disease (COVID-19) is o
collec eady- o-sc een small molecule lib a ies in p epa a ion
o he nex i al ou b eak, and po en ially ind a clinical
candida e be o e i becomes a pandemic. Public a ailabili y o
di e se compound lib a ies, well anno a ed in e ms o
chemical s uc u es and sca olds, modes o ac ion, and
bioac i i ies a e, he e o e, c ucial o ensu e he pa icipa ion
o academic labo a o ies in hese sc eening e o s, especially
in esou ce-limi ed se ings whe e syn hesis, es ing and
compu ing capaci y a e sca ce. He e, we demons a e a low-
esou ce app oach o popula e he chemical space o
na u ally occu ing and syn he ic small molecules ha ha e
shown in i o and/o in i o ac i i ies
agains he se e e acu e espi a o y synd ome co ona i us 2
(SARS-CoV-2) and i s a ge p o eins. We ha e manually
cu a ed wo da ase s o small molecules (na u ally occu ing
and syn he ically de i ed) by eading and collec ing (hand-
cu a ing) he published li e a u e. In o ma ion om he
li e a u e e eals ha a majo i y o he epo ed SARS-CoV-2
compounds ac by inhibi ing he main p o ease, while 25% o
he compounds cu en ly ha e no known a ge . Sca old
analysis and p incipal componen analysis e ealed ha he
mos common sca olds in he da ase s a e qui e dis inc . We
hen expanded he ini ially manually cu a ed da ase o o e
1200 compounds ia an ul a-la ge scale 2D and 3D simila i y
sea ch, ob aining an expanded collec ion o o e 150k
pu chasable compounds. The spanned chemical space
signi ican ly ex ends beyond ha o a comme cially a ailable
co ona i us lib a y o mo e han 20k small molecules and
cons i u es a good s a ing collec ion o i ual sc eening
campaigns gi en i s manageable size and p oximi y o hand-
cu a ed compounds.
Keywo ds: chemical space, da a cu a ion, SARS-CoV-2.
Chemical space explo a ion o SARS-CoV-2 compounds
2
1 In oduc ion
Chemical space is a well-known concep in chemin o ma ics, o en de ined simply as he se o molecules
in a company’s chemical in en o y o endo ca alogue, and o he imes de ined as he o ali y o
molecules ha can po en ially be cons uc ed using known eac ions and building blocks wi hin a ce ain
ange o p ope ies.[1] When all possible molecules ha abide by a gi en se o cons uc ion p inciples a e
cha ac e ised, hei chemical space e e s o he p ope ies spanned by all hese compounds.[1,2] Wi h he
ise o he popula i y o ul a-la ge-scale chemical lib a ies, i becomes c ucial o de elop e icien ways
o enci cle small, manageable subse s o molecules wi h desi ed p ope ies.[2] In hei simples o m,
chemical spaces a e o en limi ed by ce ain unc ional g oups, chemo ypes, o p ope ies ha a e easy
o calcula e om a chemical s uc u e alone.[1b] “D ug-like chemical space” is used in he con ex o d ug
disco e y o e lec he as numbe o molecules wi h physical p ope ies simila o hose o exis ing
small-molecule he apeu ics. These p ope ies a e o en encapsula ed in “ ules o humb” like Lipinski’s
“ ule o i e”[2a,2e] and o he well-known ules and me ics adhe ed o by mos app o ed d ugs, e.g. “Ghose
ule”,[2 ] “Vebe ’s ule”,[2g] “Egans’ ule”[2h] and he quan i a i e es ima e o d ug-likeness (QED) me ic,
among o he s.[2i,2j] E en his ela i ely s aigh o wa d se o ules un eils signi ican complexi y. Fo
example, i has been shown ha all cu en ly known d ugs only occupy a e y minu e po ion o he
a ailable and/o explo able “syn he ically accessible chemical space”.[3] This implies ha cu en
molecula lib a ies co e only a small ac ion o he o al possible d ug-like chemical space i one we e
o enume a e he compounds esul ing om an exhaus i e combina ion o easible chemical eac ions
and ules.[4] The e ha e been se e al a emp s o es ima e he size o he ealis ic d ug-like chemical
space,[3b] including compounds ha a e di ec ly a ailable o be pu chased and sc eened in biological
assays.[3e] A widely-ci ed es ima e is ha he numbe o possible Lipinski-complian (i.e. wi h MW < 500
Da) molecules su passes 1060, which is a beyond he chemical space o bioac i e compounds epo ed
in li e a u e-cu a ed da ase s like ChEMBL (2·106).[3b,4b]
Gi en he in ac able size o he d ug-like chemical space, i is necessa y o u he cons ain i wi h a ge -
o disease-speci ic p ope ies ha will make subsequen ( i ual) sc eening campaigns easible,
especially when esou ces a e limi ed. Di e en ways o e alua ing he chemical space include using
molecula assembly ees, sca old hopping, simila i y sea ch echniques, pha macopho e ma ching,
quan um-based machine lea ning, and chemog aphy.[3d,5] To e icien ly na ow down he d ug-like space,
da ase s o compounds wi h desi able p ope ies o bioac i i ies a e equen ly used as s a ing poin s.
Wi h he ad en o a i icial in elligence (AI), (deep) gene a i e models a e ge ing a lo o a en ion, since
hey ha e he po en ial o apidly sugges new chemical ma e in a p ope y-cons ained manne , o en
aking a known molecule as a seed. A i s co e, he app oach is based on he idea ha simila compounds
end o bind o simila a ge s, which is a guiding p inciple in chemoin o ma ics.[6] Fi s , we ge a se o
s a ing molecules, and hen we explo e he su oundings o his se [7] o iden i y a much la ge collec ion
o molecules ha a e s ill wi hin he space and could cap u e he ele an chemical ea u es equi ed o
exhibi ing he desi ed biological ac i i y.
Today, a wide a ay o compu a ional app oaches o explo ing chemical spaces exis , wi h signi ican
imp o emen s o ensu e he syn he ic easibili y o he compounds[8], le e aged by he inco po a ion o AI
echniques o cha ac e ise and lea n he plausible s uc u es associa ed wi h a ge p ope ies.[8b] The
huge imp o emen o compu a ional powe a ailable o esea che s, including cloud compu ing[8b,9], has
made i possible o gene a e la ge i ual collec ions o po en ially in e es ing compounds, he challenge
being now how o choose which o hem o syn hesise and es wi hin a design-make- es cycle.[9 ] Fo
small labo a o ies and d ug disco e y cen es ope a ing unde s ong esou ce limi a ions, as is he case
o many compu a ional d ug disco e y g oups. Mo eso, in he con ex o A ica whe e ou acili ies a e
loca ed, apid syn hesis o he compounds is a ue limi ing ac o . In his scena io, a mo e p ac ical
app oach is o limi he sea ch o he pu chasable chemical space, which nowadays is a beyond he
billion-scale. This is pa icula ly ele an in he sea ch o an i i al d ugs, since diseases caused by
Chemical space explo a ion o SARS-CoV-2 compounds
3
i uses sp ead e y quickly and a e qui e challenging o con ain whene e he e is a i al ou b eak.[10] In
many esou ce-limi ed coun ies, accine accessibili y, and accep abili y, ha e emained challenging,
implying ha he quick disco e y o small molecules ha a ge i al in ec ions emains one o he ways
o wa d, wi h compu a ional me hods playing a c ucial ole in he pipeline.[11] Thus, i becomes impo an
o de elop di e se and ocused compound lib a ies ha could be eadily sc eened o keep pace wi h
possible expec ed i al ou b eaks o mu a ions.
This wo k aims o explo e he chemical space o po en ial an i i al agen s, beginning wi h a manually
cu a ed da ase o syn he ically de i ed and na u ally occu ing compounds wi h ac i i ies agains known
se e e acu e espi a o y synd ome co ona i us 2 (SARS-CoV-2) a ge s o wi h cell g ow h inhibi o y
p ope ies agains he i us. Mo e speci ically, we ha e analysed he p ope ies o small molecules ha
ha e shown in i o o in i o ac i i ies agains SARS-CoV-2 and i s a ge p o eins, as epo ed in he
li e a u e. We ha e hen explo ed he chemical space associa ed wi h he s a ing lib a y using a apid
sea ch wi hin he pu chasable space o he Enamine REAL[12] and ZINC[13] lib a ies. The expanded
da ase was compa ed wi h he Co ona i us Lib a y a ailable om ChemDi [14] and d ug molecules om
D ugBank,[15] o e i y i he expanded da ase could be used as a easonable sized, easy- o- es s a ing
lib a y o i ual sc eening e o s o a ge he disease, i.e. a lib a y ha could be u he easily sc eened
in silico ia docking and molecula dynamics, ollowed by in i o sc eening. Wi h limi ed con en ional
compu ing capaci y, he goal would be o look o a manageable da ase ha does no equi e high
pe o mance compu ing, e.g. 100,000 o 500,000 compounds.
2 Ma e ials and Me hods
2.1 Da a collec ion
The hand-cu a ed (na u al and syn he ic) compound lib a ies we e ob ained as ollows. The elec onic
da abases employed o he asso men o ele an in o ma ion include Scopus, NISCAIR, SciFinde ,
PubMed, Sp inge Link, Science Di ec , Google Schola , Web o Science, and an exhaus i e lib a y
sea ch o keywo ds and combina ions o keywo ds ela ed o “COVID-19”, “SARS-CoV-2”, “compounds”,
“small molecules”, e c., and a combina ion o hese e ms as p e iously desc ibed.[16] Each indi idual e m
and a sum o hem, e.g. “COVID-19 + compound”, “SARS-CoV-2 + compound”, “small molecule =
COVID-19”, e c we e used in he sea ch. This was ca ied ou du ing he pe iod om Janua y o July
2024. The e ie ed a icles we e checked and compounds showing ac i i ies agains he i us and/o
i al a ge s we e selec ed. The au ho s hen wen ahead and double checked he published pape s i
he e we e epo ed bioac i e compounds agains SARS-CoV-2 in he e ie ed li e a u e sou ces. The
compounds we e classi ied in o na u al p oduc s (NPs) and syn he ic de i a i es (SDs) acco ding o he
in o ma ion a ailable om he li e a u e sou ces. The chemical s uc u es we e downloaded om he
PubChem da abase, when a ailable.[17] Compounds no a ailable in PubChem we e d awn using he
ChemD aw Ul a so wa e ( e sion 19.1). Addi ionally, PubChem and ChemSpide da abases we e used
o check he IUPAC names o he compounds, as p e iously desc ibed.[16] Figu e 1A p o ides a wo k low
o he manual cu a ion p ocedu e. In summa y, in checking h ough he compounds a ailable in he
li e a u e ound om he a ious sea ch engines, i a compound had been es ed in clinical ials o had
been epu posed o he ea men o COVID-19, i was au oma ically e ained. Compounds ha had
shown ac i i y in i al assays wi h >50% g ow h inhibi ion o had shown ac i i y in a a ge -based assay
we e also kep . The e ie ed a icles we e checked and compounds showing ac i i ies agains he i us
o i al a ge s (e.g. Mp o, PLp o, Spike/ACE2, RdRp, e c., see Table 1) we e selec ed. The selec ion c i e ia
we e based on he pheno ypic and/o a ge assays (IC50, EC50) epo ed in he li e a u e, wi h compounds
wi h IC50 o EC50 < 50 μM e ained, while hose no alling in his cu o and hose no epu posed o
COVID-19 ea men we e disca ded. The mode o ac ion was ei he de e mined om he a ailable
expe imen al in i o assay esul s agains speci ic i al enzyme a ge s o h ough molecula simula ions,
e.g. by docking and molecula dynamics/binding a ini y calcula ions. Addi ional in o ma ion on he modes
o ac ion o he compounds we e ound by sea ching he COVID-19 HELP[18] and MedChemExp ess[19]
Chemical space explo a ion o SARS-CoV-2 compounds
4
da abases. ChemDi ’s Co ona i us Lib a y (con aining 21145 small molecule compounds) has been
di ec ly e ie ed om he ChemDi websi e,[14] while he D ugBank da ase used was e sion 5.1.10.[15]
Bo h we e downloaded in July 2024.
2.2 P incipal componen s and sca old di e si y analysis o manually cu a ed syn he ic and
na u al p oduc lib a ies
The molecula desc ip o s o he compounds in he wo da ase s we e calcula ed using he Molecula
Ope a ing En i onmen (MOE) so wa e ( e sion 2016.08, 2016).[20] The compu ed desc ip o s included
40 well-known physicochemical pa ame e s like molecula weigh (MW), he loga i hm o he n-
oc anol/wa e pa i ion coe icien (log P), he numbe o Lipinski iola ions (Lip iol), numbe o a oms
(#a om), syn he ic accessibili y (SA), he ene gies o he lowes unoccupied molecula o bi als (LUMO)
and o he highes occupied molecula o bi als (HOMO), he numbe o o a able bonds, he wa e
solubili y, he o mal cha ge (Cha ge), Op ea lead-likeness sco e (Op ea Lead), he numbe o chi al
cen es (#chi al), he numbe o basic (#basic) and acidic a oms (#acid), he mola e ac i i y (m ), he
o al pola su ace a ea (TPSA), he molecula olume ( ol), he dipole momen s, he pola izabili ies, he
numbe o H-bond dono s and accep o s, e c. The dimensionali y educ ion o he compu ed desc ip o s
was conduc ed by p incipal componen analysis (PCA) using MOE.[20] Sca old analysis was p eceded by
he Re osyn he ic Combina o ial Analysis P ocedu e (RECAP)[21] implemen ed in MOE.[20] This consis s
in agmen ing each molecule by b eaking he bonds ha a e es ima ed o be hose ha can be o med
when syn hesising each molecule om i s cons i uen building blocks by common syn he ic eac ions.
Thus, a unique ex ended SMILES s ing and he agmen ’s name, which e ains he chemical con ex o
he b oken bond, was assigned o each esul ing agmen , as desc ibed by Weininge .[22] This was
applied o bo h he NPs and SDs da ase s o de e mine he mos equen chemical sca olds and he
s a is ics on he equency o he indi idual agmen s we e gene a ed, while e aining only sca olds wi h
a leas 10 a oms.
2.3 Chemical p ope ies calcula ion and ADMET p edic ion
To isualize chemical spaces, -SNE plo s we e gene a ed using Uni-Mol[23] and WHALES[24] desc ip o s.
Bo h we e calcula ed using hei implemen a ion in he E silia Model Hub (h ps://e silia.io/model-hub)[25],
e e ences eos39co and eos24u , espec i ely. The na u al p oduc -like sco e[26] was calcula ed using
he RDKi package ia i s implemen a ion in he E silia Model Hub ( e e ence eos8ioa). The syn he ic
accessibili y has been calcula ed using he SYBA package[27] ( e e ence eos7pw8). The SARS-CoV-2
p edic ed ac i i ies ha e been calcula ed using he E silia implemen a ion o REDIAL-2020[28] ( e e ence
eos8 h), and ADMET p ope ies ha e been calcula ed using he ADMET-AI package[29] ( e e ence
eos7d58). We used he openTSNE implemen a ion (Py hon) wi h Euclidean dis ance, pe plexi y 30 and
500 i e a ions.
2.4 Explo a ion o he chemical space o he manually cu a ed da ase by ul a-la ge lib a y
sc eening
We used he eely a ailable CHEESE API (h ps://cheese.deepmedchem.com) o sea ch agains he
ZINC15 and Enamine REAL da abases. Fo each que y compound, we used ou simila i y sea ch
modes, namely “2D inge p in ”, “3D shape”, “3D elec os a ic” and “consensus”; 100 nea es -neighbou s
using Euclidean dis ance we e e ie ed o each sea ch mode wi h he “high accu acy” op ion. All
molecules we e indexed wi h hei InChIKeys and op ionally la ened (i.e. s e eochemis y emo ed) o
ob ain a de-duplica ed lis . The da a agg ega ion pipeline esul ing om his ul a-la ge-scale sea ch is
ully ep oducible om he code eposi o y speci ied in he Code a ailabili y sec ion.
2.5 Compound p io i isa ion
To p io i ise he compounds ob ained om he ul a-la ge scale simila i y sea ch, we de eloped wo
c i e ia. On one hand, we summed he numbe o occu ences o each e ie ed compound ac oss he
Chemical space explo a ion o SARS-CoV-2 compounds
5
ou sea ch me hods (namely Mo gan, 3D shape, 3D elec os a ics, and consensus), mul iplied by he
Tanimo o coe icien (Tc) wi h espec o he que y compounds. On he o he hand, we de eloped an
ensemble o bina y classi ie s aimed a sco ing he p obabili y o a gi en compound belonging o he an i-
SARS-CoV-2 chemical space. As a e e ence chemical space, we used ou manually anno a ed
compounds, and as “nega i e” (null) se s we used D ugBank compounds and h ee subsamples o he
ChEMBL da abase ( 33) wi h a maximum posi i e-nega i e imbalance o 1:10. We also ained a
classi ie using he ChemDi Co ona i us Lib a y as posi i e, and a 100k-scale di e si y lib a y om he
same endo as nega i es. All classi ie s we e ained using E silia’s LazyQSAR[11b] amewo k based on
Mo gan coun s inge p in s ( adius 3, 2048 dimensions) and he au oML amewo k FLAML ( andom
o es s and LGBM)[30] wi h a ime budge o 60 seconds. Based on i e 80:20 s a i ied ain- es spli s, all
classi ie s sa is ac o ily pe o med wi hin he ange o 0.75-0.85 AUROC. Finally, since he simila i y and
he classi ie anks a e wo genuinely di e en anking app oaches, we me ged hem in o a consensus
ank using he ank a e ages.
3 Resul s and Discussion
3.1 Li e a u e e iew p o ides a comp ehensi e cu a ed an i-SARS-CoV-2 lib a y
The p ocedu e o ga he ing li e a u e e idence o he biological ac i i ies o he SARS-CoV-2
compounds has been summa ised in Figu e 1A. I was ound ha he syn he ic compounds belong o
qui e di e se classes like indoles and pep idomime ics, well-known o hei an i i al ac i i ies[31], as well
as an imala ials like chlo oquine and i s analogues. The na u ally occu ing compound lib a y was ich in
e penoids, la onoids, and alkaloids, including he ecen ly disco e ed hi s like sal ino in A and
deace ylgedunin which block SARS-CoV-2 i al cell en y by inhibi ing he ansmemb ane p o ease,
se ine 2, an enzyme ha in humans is encoded by he TMPRSS2 gene[32]. Ou analysis ende ed a inal
da ase o 618 unique NPs and 620 unique SDs (Figu e 1B). A e da a collec ion, we sough o
unde s and he cha ac e is ics o ou da ase . As expec ed, NPs p esen a highe na u al p oduc -likeness
sco e and, con e sely, a lowe syn he ic accessibili y when compa ed o SDs (Figu e 1C and D).
In e es ingly, a e osyn he ic analysis o he NP lib a y p o ided 421 sca olds, e ealing ha oxygen-
con aining ings like suga s and polyphenol moie ies a e he mos abundan chemical building blocks in
hei biosyn hesis (Figu e 1E). On he o he hand, 793 chemical sca olds esul ed om he e osyn he ic
analysis o he SD lib a y, e ealing a highe di e si y in e ms o ing ypes and cons i uen a oms, wi h
many halogen-, O-, N- and S-con aining chains and ings. A compa ison o he op- en mos abundan
sca olds in each da ase and no abundan ( eq <3) in D ugBank sca olds, e ealed ha he NP
agmen s con ain suga moie ies, polyphenolic ings, and non-oxygena ed alipha ics. In con as , he SD
agmen s con ain he e ocyclic ings, a oma ic ings, and alipha ic sys ems, wi h mul iple N-a oms, ewe
O-a oms han in he NP agmen s, and some S-a oms and halogens (supplemen a y Figu e S1).
To isualise he chemical space o ou da ase , we chose wo di e en molecula desc ip o echniques.
On one hand, Uni-Mol[33] (a deep-lea ning embedding echnique p e- ained on o e 209 million molecula
con o ma ions) has chemical in o ma ion in 3D space. On he o he hand, WHALES desc ip o s a e a
small se o physicochemical pa ame e s ha cap u e bo h molecula 3D shapes and pa ial cha ges,
making hem sui ed o sca old hopping exe cises. Figu e 1F shows how NPs and SDs clus e oge he
much be e when ep esen ed wi h WHALES desc ip o s, indica ing ha , despi e ha ing dissimila 3D
s uc u es, hey may e ain simila cha ge pa e ns, an essen ial cha ac e is ic o bind o he pocke s o
hei a ge s in SARS-CoV-2. To u he inspec he chemical space o ou manually-cu a ed compounds,
we used MOE desc ip o s o build a 2D PCA and analysed he op con ibu o s o de ining componen s
1 and 2 (Figu e 1G), which highligh ed he desc ip o s co esponding o Tudo Op ea’s es o lead-
likeness (Op ea Lead)[34] and syn he ic accessibili y (SA),[35] along wi h he numbe o basic a oms and
he numbe o H-bond dono s, which a e all empi ical ules ha gene ally cha ac e ise d ugs and lead
compounds. The cumula i e a iances eco e ed wi h he wo p incipal componen s we e 46.72% and
58.33%, espec i ely. The weigh s o he desc ip o s used in PCA analysis ha e been included in he

Chemical space explo a ion o SARS-CoV-2 compounds
6
upda ed Supplemen a y Da a (Da a S1). This means ha , wi hin ou li e a u e-cu a ed collec ion, he e is
wide a iabili y in e ms o d ug- and lead-likeness. Acco ding o he second p incipal componen , he
numbe o acidic and basic a oms, as well as he LUMO ea u es (o en associa ed wi h chemical
eac i i ies) con ibu e o he di e si y o he da ase .
Finally, we aimed o compa e ou hand-cu a ed da ase wi h he chemical space o app o ed d ugs
a ailable om D ugBank. To ha end, we le e aged a ecen ly published AI/ML model, ADMET-AI[36],
which has been ained on e e ence da ase s om he The apeu ics Da a Commons[37]. In Figu e 1H,
we show esul s om he ADMET-AI p edic ions as pe cen iles wi h espec o app o ed d ugs. Thus, a
pe cen ile o 50 means ha a gi en alue co esponds o he median alue o hose obse ed in he d ug
space, while ex emely high (~100) and low (~0) pe cen iles indica e de ia ions om he p ope ies
obse ed in app o ed d ugs. When compa ing SD and NP compounds, he e was no appa en clea
dis inc ion be ween he wo da ase s o MW, log P, solubili y, inhibi ion o he cy och ome CYP2C9, NR-
PPAR-𝛄, and SR-ARE. Howe e , o he desc ip o s BBB, NR-AR-LBD, and skin oxici y, he NP da ase
seems o ha e a highe p opo ion o compounds abo e he 50 h pe cen ile, whe eas his was he
con a y o compu ed desc ip o s ela ed o d ug abso p ion (e.g. in es inal abso p ion and
bioa ailabili y), dis ibu ion, e.g. abili y o c oss he blood-b ain ba ie (BBB), me abolism, e.g. he abili y
o in e ac wi h CYP3A4 enzymes and oxici y e.g. d ug-induced li e inju y (DILI), ca cinogenesis, and
inhibi ion o he human e he -a-go-go- ela ed gene (hERG). I mus be men ioned ha he dys unc ion o
hERG o en causes ca diac a hy hmia and sudden dea h, implying ha compounds ha block hERG
channels a e conside ed oxic. Collec i ely, and as expec ed, his indica es ha na u al p oduc
compounds end o p esen mo e liabili ies, which is why hey a e o en conside ed as s a ing poin s ha
equi e u he op imiza ion om an ADMET pe spec i e. Bo h NPs and SDs a e skewed owa ds
ela i ely high MW and low solubili y wi h espec o app o ed d ugs, and, as expec ed in compounds no
ye p og essed o he clinics, he e is an en ichmen o po en ial CYP liabili ies and oxici y pa hways,
ein o cing he no ion ha his se o compounds should be used as a s a ing collec ion o iden i y a la ge
se o op imised compounds.
Inse Figu e 1 he e
3.2 Dis ibu ion o compounds by d ug a ge based on li e a u e in o ma ion
In addi ion, we ca e ully anno a ed ou cu a ed collec ion wi h a ge in o ma ion, when possible. A
summa y o he a ious a ge s iden i ied in he li e a u e om in i o assays and pu a i e a ge s
p edic ed by molecula simula ions is gi en in Table 1. I was obse ed ha he main p o ease (Mp o) is
he mos ep esen ed a ge in he wo da ase s (36% and 48% o NPs and SDs, espec i ely). Besides,
se e al compounds ha e mo e han one a ge , including dual p o ease inhibi o s like hose ha inhibi
bo h Mp o and he papain-like p o ease (PLp o), as well as hose ha inhibi bo h Mp o and he RNA-
dependen -RNA polyme ase (RdRp), and hose ha inhibi bo h Mp o and he i al spike in complex wi h
he human angio ensin-con e ing enzyme 2 (spike/ACE2) and o he p o ein a ge s. In bo h he NP and
SD da ase s, a small numbe o he compounds inhibi mo e han wo a ge s and a e classi ied as mul i-
a ge compounds, while a signi ican numbe ha e no known a ge . This las ca ego y co esponds o
25% o bo h NPs and SDs (supplemen a y Figu e S2).
Inse Table 1 he e
3.3 Ul a-la ge lib a y sc eening a ound he an i-SARS-CoV-2 chemical space
Ha ing de ined and cha ac e ised he chemical space o manually-cu a ed compounds, we ca ied ou a
sys ema ic simila i y sea ch agains wo o he mos widely used compound lib a ies o i ual sc eening,
namely ZINC15[13] and Enamine REAL[12]. ZINC is a compendium o comme cially a ailable molecules,
and Enamine REAL o e s an enume a ed billion-scale lib a y o make-on-demand molecules based on
a la ge collec ion o building blocks. E en he mos basic chemoin o ma ics ope a ions such as simila i y
Chemical space explo a ion o SARS-CoV-2 compounds
7
sea ch can become p ohibi i e a such scales, mo e so in esou ce-limi ed se ings whe e compu ing
capaci y is low. Thus, we used he online se e CHEESE which le e ages an embedding-based me hod
o index compounds and speed up he simila i y sea ch. The app oach capi alises on ecen ad ances in
AI embedding echniques ini ially de eloped o image and ex da a, which equi e as que ies o e
ex emely la ge da abases. In pa icula , i uses “seman ic simila i y” sea ch echniques o e small
molecule embedding ec o s, e u ning he k-neighbo s o he seed compound. An ad an age o he
CHEESE me hodology is ha i allows pe o ming 3D-based sea ches, which can be ad an ageous when
he que y molecule is IP-p o ec ed o di icul o syn hesise, as is he case o NP compounds.
We success ully ca ied ou a sea ch o 1231 compounds and ob ained, in o al, a se o unique 225,774
hi s, o which 152,901 emained a e la ening ou s e eochemis y in o ma ion o emo e edundancy.
The esul s o he sea ch co espond o ou que ies (namely, Mo gan (2D) simila i y, 3D-shape and 3D-
elec os a ics, and a consensus measu e) agains bo h ZINC15 and Enamine REAL. We e ie ed 100
nea es neighbou s pe sea ch eques , ob aining a ela i ely balanced se o s uc u ally simila
compounds, wi h Tanimo o simila i y (Tc>0.7) and mo e dis an ones (Figu e 2A). The ankings om he
classi ie and he simila i y sea ch we e signi ican ly di e en and, he e o e, we a gue ha hey can be
combined in a blended measu e ha cap u es bo h magni udes. Gene ally, amongs he op-100 lis ,
ZINC compounds we e mo e abundan han hose om Enamine REAL, albei wi h mo e edundancy
when he s e eochemis y was emo ed. Enamine REAL is a make-on-demand lib a y based on a
p ede ined se o building blocks and, by de ini ion, i enume a es easily syn hesizable compounds. Thus,
as expec ed, na u al p oduc s we e gene ally less simila o Enamine REAL compounds han ZINC
compounds om a 2D s uc u e pe spec i e (Figu e 2B), and hi s om 3D-elec os a ics and 3D-shape
sea ches ended o gi e mo e dis al compounds, which may be help ul o sca old hopping. Gene ally,
he consensus CHEESE sco e cap u es s uc u al simila i y while p o iding a sligh ly be e balance
be ween Enamine REAL and ZINC hi s han a me e Mo gan inge p in s sea ch (Figu e 2B).
We hen sco ed he lis o compounds based on (a) hei simila i y o que y compounds and (b) hei
p obabili y o being associa ed wi h he SARS-CoV-2 chemical space. These a e wo simple and
indica i e measu es ha can be used o na iga e he ela i ely la ge collec ion (>150k) when sc eening
capaci y is limi ed. To assign a sco e o he la e , we buil an ensemble o bina y classi ie s capable o
disc imina ing be ween compounds in ou manually cu a ed da ase om andomly sampled compounds
in he medicinal chemis y space, as well as be ween compounds om ChemDi ’s Co ona i us Lib a y
and a di e se, agnos ic collec ion om he same endo . As expec ed, ZINC compounds we e anked
highe in he simila i y sco e (Mann-Whi ney s a is ic 2·109, P- alue ~ 0), while we could s ill ind 6,066
compounds om he Enamine-REAL da abase ha we e dissimila (Tc < 0.5) o any compound in he
que y lis bu s ill anked in he op 20% o he classi ie lis . In Figu e 2C, a ew examples a e shown
whe e s a ing om a na u al p oduc compound wi h high na u al p oduc -likeness (>2), i was possible
o ind make-on-demand hi s om Enamine REAL (some o hem only e ie able ia a 3D sea ch in he
CHEESE embedding space) ha appea o ha e a high p obabili y o being in e pola ed in he chemical
space associa ed wi h SARS-CoV-2. All he sco es a e anno a ed in an easy- o-na iga e able as
speci ied in he Da a A ailabili y sec ion. When we inspec ed he ADMET p ope ies o he expanded
collec ion (Figu e 2E), we obse ed ha especially o Enamine REAL compounds, p ope ies like MW,
logP, solubili y, BBB pene a ion and bioa ailabili y we e qui e cen ed o well dis ibu ed wi h espec o
app o ed d ugs, and ce ainly much be e han hose o NPs (Figu e 1H). While, gene ally, some ADMET
liabili ies emained (e.g. CYPs), in some cases such as he oxici y pa hway NR-AR-LBD he p o ile was
much imp o ed wi h espec o hand-cu a ed compounds.
In e es ingly, when we mapped he abo emen ioned ChemDi Co ona i us Lib a y along wi h D ugBank
compounds and ou se o >150k molecules, we obse ed ha he ChemDi se was ocused on a
ela i ely well-de ined egion (Figu e 2E) wi h espec o ou se o compounds and he D ugBank
Chemical space explo a ion o SARS-CoV-2 compounds
8
collec ion. This sugges s ha ou expanded lib a y can be a good s a ing poin o sc eening pu poses
agains SARS-CoV-2 gene ally. Since his se has been gene a ed wi h a ligand-cen ed app oach using
a di e se se o mechanisms o ac ion, and including bo h na u al and syn he ic compounds, he lib a y
is expec ed o ha e b oad applicabili y wi hin his ield o esea ch.
Inse Figu e 2 he e
3.4 P oposed lib a ies e ain SARS-CoV-2 p edic ed ac i i y
The goal o his s udy is no o p o ide a sho lis o an i-SARS-CoV-2 molecules wi h s ong con idence.
Ra he , we wan ed o o e a i ual sc eening lib a y ha can be used as a go- o op ion in his disease
a ea, ensu ing ha compounds a e pu chasable and inspi ed by compounds wi h epo ed e idence in
he li e a u e. As an explo a o y assessmen o he po en ial o ou collec ion ac oss a b oad ange o
i ual sc eening asks ela ed o COVID-19, we chose o use REDIAL-2020, a compendium o open-
sou ce machine lea ning (ML) models con aining QSAR p edic o s o in i o endpoin s o i al load
educ ion. In Figu e 3 we can see ha , compa ed o D ugBank compounds (mimicking an unbiased d ug
epu posing exe cise), bo h ou manual collec ion and he ChemDi Co ona i us Lib a y end o pe o m
be e in se e al asks, mos no ably in he AlphaLISA sc een es ing he spike/ACE2 in e ac ion.
Di e ences we e obse ed be ween NPs and SDs, wi h NPs ha ing, o example, highe sco es in he
3CL and AlphaLISA p edic ions, and lowe in he T uHi coun e sc een. Ou expanded lib a y was also
en iched in high AlphaLISA sco es, al hough, as in he case o he ChemDi Co ona i us Lib a y and he
SD hand-cu a ed compounds, i would be ad isable o con ol o T uHi coun e sc een hi s. O he
en iched p edic ions a e ACE2 blocking and pseudo yped pa icle en y (PPE) bo h o SARS-CoV and
MERS, sugges ing a b oad applicabili y o ou collec ion. We did no ob ain pa icula ly high sco es in he
3CL p edic ions, which means ha he lib a y is p obably no pa icula ly en iched in his class o
compounds. Howe e , no e ha a a classi ica ion sco e abo e 0.6 (app oxima ely he median o he
manually anno a ed molecules), we s ill ha e 26,440 candida es o his ac i i y.
Inse Figu e 3 he e
4 Conclusions
In an a emp o unde s and he chemical space o po en ial lead compounds o d ug disco e y agains
COVID-19, we ha e cha ac e ised he chemical space o na u ally occu ing and syn he ically de i ed
small molecules ha inhibi he g ow h o he SARS-CoV-2 i us. We ha e compa ed he wo da ase s o
compounds hand-cu a ed om he li e a u e by desc ip o calcula ion, p incipal componen analysis and
sca old analysis. I was obse ed ha mos o he compounds ac by inhibi ing he main p o ease, while
se e al compounds could also be dual and mul iple inhibi o s. We hen de i ed an expanded chemical
space o o e 150k pu chasable compounds wi h ei he 2D o 3D ela edness o he manually-cu a ed
collec ion. I is planned ha , in ollow-up s udies, hese compounds will be i ually sc eened h ough
pha macopho e modelling and p o ein-ligand in e ac ions wi h he iew o iden i ying a small subse o
ligands ha could pu a i ely bind o he a ge s epo ed in he li e a u e. These will hen be sc eened in
i o o iden i y no el an i i als which we e no o iginally epo ed in he li e a u e.
Wi h make-on-demand lib a ies g owing a an exponen ial a e, i is impo an o de ise ways o e icien ly
exploi hese lib a ies and use hem o de elop cus om and smalle i ual collec ions[38] like ou A ican
Na u al P oduc s Da abase (ANPDB)[39]. Sea ching ac oss billion-scale lib a ies is s ill compu a ionally
in ensi e and becomes p ohibi i e in esou ce-limi ed se ings such as labo a o ies in A ica, as is ou
case. He e, we ha e demons a ed how a well-de ined me hodology o li e a u e cu a ion, coupled wi h
a simple and as me hodology o sea ch ul a-la ge chemical spaces, can yield a manageable numbe
o molecules o be used in subsequen i ual sc eening asks. We ha e p o ed he concep o SARS-
Chemical space explo a ion o SARS-CoV-2 compounds
9
CoV-2, a pa hogen o which we ha e in es ed e o s in ou g oup, bu he app oach is disease-agnos ic
and could be applied o any o he a ea o which some compounds a e anno a ed in he li e a u e. Gi en
he in as uc u al limi a ions o chemis y labo a o ies in ou se ing, we chose o use pu chasable
compounds om ei he ZINC o Enamine REAL da abases. The size o he cu en lib a y (150k) is
amenable o low- esou ce compu ing and we expec i o be use ul o o he esea che s pu suing COVID-
19 ea men s based on small molecules and in a cos -e ec i e manne . Indeed, ou o e a ching plan is
o apply his pipeline o o he disease a eas and a ge s o in e es o ou eam, including neglec ed
opical diseases ha disp opo iona ely a ec people li ing in he global Sou h.
Acknowledgmen s
We acknowledge inancial suppo om he Bill & Melinda Ga es Founda ion h ough he Cales ous Juma
Science Leade ship Fellowship awa ded o FNK (g an awa d numbe : INV-036848 h ough he
Uni e si y o Buea). FNK also acknowledges join unding om he Bill & Melinda Ga es Founda ion
(awa d numbe : INV-055897) and Li eA c (G an ID: 10646) unde he A ican D ug Disco e y Accele a o
p og am. FNK acknowledges u he unding om he Alexande on Humbold Founda ion o a
Resea ch G oup Linkage p ojec (Re 3.4-1156361-CMR-IP). We acknowledge he echnical suppo o
D . Con ad V. Simoben.
Au ho con ibu ion s a emen
Concep ualiza ion: Fidele N ie-Kang and Miquel Du an-F igola; Da a cu a ion: Jude Y. Be ow, Clo is S. Me uge,
Simeon Akame, Vanessa A. Shu, and Oye e T. Ebob; Fo mal analysis: Gemma Tu on, Fidele N ie-Kang and
Miquel Du an-F igola; Funding acquisi ion: Fidele N ie-Kang; In es iga ion: Jude Y. Be ow, Gemma Tu on,
Miquel Du an-F igola and Fidele N ie-Kang; Me hodology: Jude Y. Be ow, Gemma Tu on, Miquel Du an-F igola
and Fidele N ie-Kang; P ojec adminis a ion: Jude Y. Be ow, Gemma Tu on and Fidele N ie-Kang; So wa e:
Gemma Tu on, Miquel Du an-F igola and Fidele N ie-Kang; Resou ces: Gemma Tu on, Miquel Du an-F igola and
Fidele N ie-Kang; Supe ision: Fidele N ie-Kang and Miquel Du an-F igola; Valida ion: Fidele N ie-Kang, Gemma
Tu on, and Miquel Du an-F igola; W i ing – o iginal d a : Jude Y. Be ow, Gemma Tu on, Miquel Du an-F igola
and Fidele N ie-Kang; W i ing – e iew & edi ing: e e yone.
Con lic o in e es
None decla ed.
Da a a ailabili y
All da a used in he s udy is a ailable o download a h ps://gi hub.com/e silia-os/sa s-co -2-chemspace. The
lib a y o compounds is e e enced in he README ile o his eposi o y.
Code a ailabili y
All code used in he s udy is a ailable o download a h ps://gi hub.com/e silia-os/sa s-co -2-chemspace.
Re e ences
[1] a) C. W. Coley, T ends Chem. 2020, 3, 133-145.; b) J. Wang, J. Mao, M. Wang, X. Le, Y. Wang, Me hods
2023, 210, 52-59.
[2] a) C. A. Lipinski, D ug Disco . Today 2004, 1, 337-341 ; b) J. L. Medina-F anco, A. L.Chá ez-He nández, E.
López-López, F. I. Saldí a -González, Mol. In o m. 2022, 41, e2200116 ; c) P. S. G omski, A. B. Henson, J. M.
G anda, L. C onin, Na . Re . Chem. 2019, 3, 119-128; d) C. Dobson, Na u e 2004, 432, 824-828; C. Lipinski,
A. Hopkins, Na u e 2004, 432, 855-861; e) C. A. Lipinski, F. Lomba do, B. W. Dominy, P. J. Feeney, Ad .
D ug Deli e y Re . 2001, 46, 3−26; ) A. K. Ghose, V. N. Viswanadhan, J. J. A. Wendoloski, J. Comb. Chem.
1999, 1, 55−68; g) D. F. Vebe , S. R. Johnson, H.-Y.Cheng, B. R. Smi h, K. W. Wa d, F. D. Kopple, J. Med.
Chem. 2002, 45, 615−2623; h) W. J. Egan, K. M. Me z, J. J. Baldwin, J. Med. Chem. 2000, 43, 3867–3877; i)
G. R. Bicke on, G. V. Paolini, J. Besna d, S. Mu esan, A. L. Hopkins, Na . Chem. 2012, 4, 90–98; j) B. Li, Z.
Wang, Z. Liu, Y. Tao, C. Sha, M. He, X. Li, B ie . Bioin o m. 2024, 25, bbae321