scieee Science in your language
[en] (orig)

intbio/SimChrom: v1 - SimChrom Initial Release

Author: Anna Gribkova; Alexey Shaytan; Grigoriy Armeev
Publisher: Zenodo
DOI: 10.5281/zenodo.17314850
Source: https://zenodo.org/records/17314850/files/SimChrom_main_text_and_supplement.pdf
1
(Re)de ining he human ch oma ome:
an in eg a ed me a-analysis o localiza ion, unc ion,
abundance, physical p ope ies and domain composi ion o
ch oma in p o eins
Anna K. G ibko a1,2, G igo iy A. A mee 1,2, Mikhail P. Ki pichniko 1,3, Alexey K. Shay an1,2,4*
1 Depa men o Biology, Lomonoso Moscow S a e Uni e si y, Moscow, Russia
2 Va ilo Ins i u e o Gene al Gene ics, Moscow, Russia
3 Shemyakin–O chinniko Ins i u e o Bioo ganic Chemis y,
Russian Academy o Sciences, Moscow, Russia
4 In e na ional Labo a o y o Bioin o ma ics, AI and Digi al Sciences Ins i u e,
Facul y o Compu e Science, HSE Uni e si y, Moscow, Russia
* To whom co espondence should be add essed. E-mail: shay an[email p o ec ed]io.msu. u
GRAPHICAL ABSTRACT
ABSTRACT
The ull complemen o ch oma in-associa ed p o eins—collec i ely e e ed o as he
ch oma ome—enables genome unc ioning in euka yo es by pa icipa ing in a wide ange o physico-
chemical p ocesses. These include media ing di e se speci ic and non-speci ic in e molecula
in e ac ions, ca alyzing in si u syn hesis and modi ica ion o mac omolecules, acili a ing ATP-
dependen ch oma in emodeling, e c. Despi e conside able p og ess in epigenomics and he s uc u al
cha ac e iza ion o many nuclea p o eins and hei complexes, ou unde s anding o ch oma in
o ganiza ion a he p o eome scale emains incomple e. This gap hinde s he de elopmen o a holis ic
iew o genome egula ion. In his s udy, we p esen a s a e-o - he-a cha ac e iza ion o he human
2
ch oma ome based on an in eg a i e me a-analysis o di e se da a sou ces desc ibing he composi ion,
abundance, and sub-nuclea localiza ion o ch oma in p o eins. This e o is complemen ed by o iginal
analyses o hei physico-chemical p ope ies, domain a chi ec u es, and in e ac ion pa e ns. To suppo
and s eamline hese analyses, we de eloped a e e ence da ase o ch oma in p o eins, in eg a ed wi h
an empi ical, unc ion-based classi ica ion on ology and an associa ed in e ac i e web esou ce —
SimCh om — accessible a h ps://simch om.in bio.o g/. The e e ence da ase was ca e ully cu a ed
by econciling da a among p o ein da abases, localiza ion, and mass spec ome y-based expe imen al
s udies. Sequence-based and AI-assis ed s uc u al analyses e ealed p e iously unanno a ed domains
wi hin ch oma in p o eins ha wa an expe imen al alida ion, as well as he widesp ead use o
mul i alen in e ac ion s a egies ha unde pin ch oma in o ganiza ion. Toge he , ou indings es ablish
a obus amewo k o u u e s udies aimed a elucida ing genome unc ion h ough de ailed analysis
o p o ein–p o ein and p o ein–nucleic acid in e ac ions wi hin ch oma in.
KEY POINTS
● The i s comp ehensi e me a-analysis o human ch oma in p o eins ha b idges di e se da a ypes
● Es ablished an in e ac i e SimCh om amewo k o ch oma ome esea ch a ailable o he
communi y
● Iden i ied unc ionally ele an hallma ks o ch oma in p o ein o ganiza ion
Keywo ds: ch oma in, ch oma ome, epigenomics, p o eomics, me a-analysis, genome unc ioning,
p o ein domains, AI-based p o ein s uc u e p edic ion, mul i alen p o ein-p o ein in e ac ions,
in insically diso de ed p o eins
1. INTRODUCTION
Ch oma in, acco ding o he gene ally accep ed de ini ion, is he complex o DNA, p o eins,
and associa ed RNA molecules ound in he nuclei o euka yo ic cells [1,2] (Fig. 1A, In e ac i e Fig.
1 a h ps://simch om.in bio.o g/#nucleus). Howe e , in he c owded nuclea en i onmen , i is
challenging o es ablish s ingen c i e ia ha clea ly dis inguish be ween mac omolecules ha o m a
complex and hose ha do no , lea ing oom o in e p e a ion o his de ini ion. Ch oma in p o eins,
collec i ely called he ch oma ome [3,4], enable genome unc ioning in space and ime h ough ac i e
ATP-dependen p ocesses and passi e p o ein-DNA/RNA in e ac ions. This unc ioning employs non-
i ial physical phenomena such as liquid-liquid phase sepa a ion [5,6], opological cons ain s on he
DNA, DNA looping and loop ex usion [7,8], di usion in he c owded mac omolecula en i onmen
[9], mul i alen coope a i e in e ac ions [10,11], e c., all o which a e egula ed by he ch oma ome
composi ion a speci ic loca ions and he p ope ies o indi idual p o eins including hei pos -
ansla ional modi ica ions (PTM), domain a chi ec u e and in insically diso de ed egions (IDR).
3
A e he disco e y o nucleosomes in 1970-ies ch oma in esea ch ocused on elucida ing
molecula unde pinnings o he genome o ganiza ion and unc ion a he nucleosome and
sup anucleosome le els [1]. Du ing ecen decades h ough he con ibu ions o c yo-EM, epigenomics
and 3D genomics much mo e de ails on he o ganiza ion o la ge mac omolecula assemblies [12],
p o ein-DNA in e ac ions and DNA opology [13,14] wi hin ch oma in ha e become a ailable. We a e
a a poin when holis ic quan i a i e o a leas quali a i e models o he genome unc ioning based on
in eg a ing ou knowledge abou nume ous molecula in e ac ions and p ocesses may seem o be wi hin
each [15,16]. The scope o he da a equi ed o such models would lie beyond he one p o ided wi hin
he ypical amewo ks o genomics and epigenomics, and should also ely on wha is some imes
e e ed o as “ch oma omics” [3] - he sys ema ic s udy o he en i e con en o he euka yo ic nucleus,
including ch oma in p o eins, hei spa io- empo al dis ibu ion and in e ac ions. Howe e , ou
unde s anding o he p o ein con en o ch oma in and i s unc ioning a he “omics”-le el is cu en ly
lagging behind ou abili y o p obe DNA sequence, i s epigene ic ma kup and 3D con ac s. I aces
ce ain challenges, which we de ail below, wi h he human ch oma ome in mind.
The i s se o challenges lies in p ecisely de ining he se o p o eins ha make up he
ch oma ome. His o ically, he de ini ion was ope a ional in na u e elying on expe imen al ch oma in
ex ac ion ollowed by he analysis o he p o ein con en ia physico-chemical me hods (see a his o ical
accoun by K.E. an Holde [1]) and la e by a ious la ou s o mass spec ome y analysis combined
wi h di e en ch oma in ex ac ion and ea men echniques ( e iew by an Mie lo and Ve meulen
2021 [17]). Unsu p isingly, he esul s o such s udies depend on se e al ac o s – he de ails o he
ch oma in ex ac ion echniques (e.g., non-s ongly associa ing p o eins may be no ex ac ed),
al e na i ely cy oplasmic p o eins may con amina e he sample [18,19], he sensi i i y and he
esolu ion o he analysis me hod (e.g., low abundan p o eins may be no de ec ed, a ia ions in pos -
ansla ional modi ica ions, al e na i e splice iso o ms) [20–22], and he ansien na u e o he
exp ession o some nuclea p o eins. An addi ional complica ion is he he e ogeneous and dynamic
composi ion o ch oma in (some imes called a uzzy o ganel) – i depends on he cell ype, cell cycle
phase, as well as on he condi ions expe ienced by he cell [23]. One has o keep in mind also ha many
p o eins shu le be ween nucleus and cy oplasm. Recen p o eomics s udies ha e es ima ed he numbe
o ch oma in p o eins o be a ound 200 - 3800 [4,23–33]. Despi e he abo e men ioned challenges, o
many ch oma in analysis asks ha ing a lis o ch oma in-associa ed p o eins in he s a ing poin .
Since he human ch oma ome con ains a leas se e al housand en ies, any a emp a he
a ional unde s anding and desc ip ion o i s unc ioning equi es some dimensionali y educ ion
app oaches. Hence, ce ain g ouping o classi ica ion o ch oma in p o eins ha conside s hei
unc ional p ope ies is desi able. Ye , ob aining such a classi ica ion is cu en ly challenging. The e
a e ce ain his o ically es ablished classes o ch oma in p o eins ha can be clea ly de ined (e.g.,
4
his ones, high mobili y g oup p o eins, e c.) [1], howe e , o he s ha e become obsole e (e.g., nuclea
ma ix p o eins [34] o canno be easily de ined (e.g., nucleosol p o eins). GeneOn ology (GO) cu en ly
p o ides he mos comp ehensi e se o anno a ions ela ed o di e en aspec s o gene p oduc s and is
ou inely used o in e p e la ge-scale biological da a, such as ansc ip omics and p o eomics esul s
[35,36]. Howe e , i canno pe se p o ide a s aigh o wa d and easy o comp ehend classi ica ion o
ch oma in p o eins due o he p esence o a la ge numbe o ch oma in ela ed GO e ms connec ed in o
a complex cumbe some hie a chy, which may be incomple e in some cases (e.g., lacks a e m o
his ones) o include obsole e e ms in ce ain cases (e.g., nuclea ma ix). While ul ima e unc ional
classi ica ion o ch oma in p o eins may likely no be possible (due o he complexi y o he genome
unc ioning, di e en p o eins con ibu ing o many di e en unc ional p ocesses, e c.), some
app oxima ion is a leas needed o es ablish a amewo k o a a ional educ ionis -wise unde s anding
o ch oma in by us humans.
The hi d se o challenges, in ou mind, ela es o he need o de eloping sys ems biology
app oaches o desc ibe and s udy ch oma in in a holis ic way as a complex unc ioning sys em [37,38].
Conside able ad ances in me hodology a e cu en ly needed o mo e om s udying he s uc u e o
indi idual mac omolecula complexes and analyzing sequence-le el (albei genome-wide) epigenomic
da a owa ds he quan i a i e models o ch oma in ope a ion ha can g asp he eme gence o complex
o ganismal unc ions. Ch oma in unc ioning elies on complex dynamics ne wo ks o mul i alen
in e ac ions be ween mac omolecules. These in e ac ions depend on he abundance o ch oma in
p o eins in a gi en compa men , hei physico-chemical p ope ies, and domain a chi ec u es ha
media e speci ic o non-speci ic in e ac ions. Unde s anding hese issues a he ch oma ome-wide scale
is a p e equisi e o building holis ic unc ional ch oma in models.
Mo i a ed by he abo e men ioned challenges and he o e all need o build complex models o
ch oma in unc ioning, in his wo k we a emp ed o p o ide he s a e-o - he-a me a-analysis o wha
is known abou ch oma in p o eins, hei localiza ion, abundance, and p ope ies. The uniqueness o
his s udy is in c oss-compa ison o di e en da a sou ces including da abase in o ma ion, mass
spec ome y da a, and p o ein localiza ion da a. Ou analysis was challenged by a common p oblem –
he limi ed cong uence be ween di e en da a sou ces. To add ess i we de eloped se e al e e ence
da ase s o ch oma in and nuclea p o eins based on c oss-compa ison o di e en da ase s and manual
cu a ion. Nex we de eloped a ela i ely simple empi ical unc ion-based hie a chical classi ica ion o
ch oma in p o eins (SimCh om classi ica ion) which was ins umen al o all downs eam analyses by
allowing o compa e p ope ies be ween di e en g oups o ch oma in p o eins. Using his amewo k
we: 1) sys ema ically analyzed he abundance o ch oma in p o eins, iden i ied po en ial pi alls in MS-
based da ase s, and using whole cell p o eomics da a quan i ied he p esence o di e en ch oma in
p o eins and ch oma in p o ein g oups in he cell; 2) cha ac e ized he in e play be ween amino acid
composi ion o ch oma in p o eins, he p e alence o in insically diso de ed egions and speci ic
5
dis ibu ion o cha ged amino acids in hei sequences; 3) analyzed he cu en s a e o s uc u al
cha ac e iza ion and domain anno a ion o ch oma in p o eins and based on no el AI-enabled p o ein
s uc u e p edic ion ools iden i ied mo e han 200 domains in ch oma in p o eins ha belong o
cu en ly unknown s uc u al supe amilies and awai expe imen al cha ac e iza ion, 4) cha ac e ized
ypical pa e ns o mul i alen in e ac ions employed by ch oma in egula o p o eins mainly engaging
combina ions o his one me hyla ion, ace yla ion and DNA binding modes.
Finally, we supplemen ou analyses wi h an in e ac i e web esou ce — SimCh om —
accessible a h ps://simch om.in bio.o g/. SimCh om ha monizes da a om di e en sou ces and may
be used o explo a ion o di e en ch oma in p o ein g oups and p ope ies o indi idual p o eins.
The o e all scheme o ou wo k is ou lined in Fig. 1B. Al hough ou analysis in ac equi ed
many i e a ions o achie e sel -consis ency (e.g., de elopmen o SimCh om classi ica ion was
pe o med concomi an ly wi h he de elopmen o e e ence ch oma in p o ein se s, and classi ica ion
was employed in analysis o he quali y and con en o di e en da a sou ces) below we p esen he
logic o ou analysis as a se ies o consecu i e s eps o he ex en possible, o loading de ailed
conside a ion o ce ain aspec s ha may equi e he amilia iza ion wi h he whole manusc ip o he
Supplemen a y Resul s and Discussion (Suppl. R&D) sec ions.
Figu e 1. (A) The s uc u e o he nucleus wi h de ails o ch oma in o ganisa ion a he le els o
ch oma in domains and ch oma in egula o y p o eins. (B) The o e iew o his s udy shows sou ces
o in o ma ion abou ch oma in p o eins (e.g., hei unc ions, subcellula localiza ion, iden i ica ion by
MS-based me hods o speci ic p o ein unc ional domains) and hei use in he cu en s udy.

6
2. MATERIAL AND METHODS
Fo he pu pose o all analyses in his s udy we used he se o human p o ein iden i ie s and
co esponding amino acid sequences ep esen ing he canonical p o ein iso o ms (usually
co esponding o he majo splice iso o ms) as p o ided by he UniP o KB/Swiss-P o da abase (also
known as he e iewed sec ion o he UniP o Knowledgebase) (UniP o p o eome ID UP000005640,
elease 2022_2) [39]. The se con ained 20,272 gene en ies co esponding o 20,225 unique p o ein
IDs (some genes code o iden ical p o ein sequences). Whe e e needed he o iginal da ase s we e
mapped o he abo e desc ibed se o p o ein UniP o IDs.
2.1. Collec ion and p ocessing o da a abou ch oma in and nuclea p o ein epe oi es,
as well as o he p o ein g oups, om da abases and MS-based s udies
2.1.1. P o ein localiza ion da a sou ces (UniP o , HPA, OpenCell)
P o ein subcellula localiza ion anno a ions we e ob ained om UniP o KB [39] ( elease
2022_2, subcellula loca ion sec ion), Human P o ein A las (HPA) [40] ( e sion 22,
h ps://www.p o eina las.o g/download/subcellula _loca ion. s .zip, accessed on 09.06.2022), he
OpenCell da ase [41] was e ie ed om he websi e (h ps://opencell.czbiohub.o g/, accessed on
10.06.2022). To ensu e high-con idence subcellula localiza ion anno a ions, localiza ion e ms we e
il e ed acco ding o da abase-speci ic eliabili y c i e ia. In UniP o , only anno a ions suppo ed by a
leas one e idence ag we e e ained. In HPA, only anno a ions wi h a eliabili y sco e exceeding he
'unce ain' h eshold we e included. In OpenCell, only anno a ions sco ing abo e he lowes quali y
g ade we e e ained.
Fo he analysis o p o ein mul iple localiza ion, localiza ion anno a ions om HPA and
UniP o we e g ouped in o he ollowing gene alized ca ego ies: Nucleus, Cy oplasm, Endomemb ane
sys em, O he (including ch omosome, sec e o y, and ex acellula p o eins in UniP o ). De ailed
g ouping in o ma ion is p o ided in Suppl. Table ST3.
To es ima e o e ep esen a ion o unde ep esen a ion o a pa icula localiza ion o a lis o
p o eins en ichmen analysis based on hype geome ic es (also known as Fishe ’s exac es ) was
used wi h signi icance h eshold p- alue < 0.05. To es ima e he cong uence o subnuclea
localiza ion anno a ions be ween UniP o (p o ein se deno ed by A) and HPA (p o ein se deno ed by
B) ollowing me ics we e used: TP = |A ∩ B|; FP = |B ∩ ¬A|; FN = |A ∩ ¬B|; Union = |A ∪ B|;
Jacca d simila i y coe icien = TP / Union. Pe o mance measu es: P ecision = TP / (TP + FP);
Recall = TP / (TP + FN); and F1-sco e = 2 × P ecision × Recall / (P ecision + Recall).
7
2.1.2. GO and unc ion-o ien ed da abases
In Sec ion 3.1, o a selec ed se o Gene On ology (GO) e ms, lis s o associa ed human
p o eins we e e ie ed om he QuickGO da abase [42] (GO anno a ion se c ea ed on 2025-03-06)
using i s REST API. To cap u e all ele an p o eins, anno a ions we e ob ained also o all hei
descendan e ms wi h he ela ionships "is_a", "pa _o ", and "occu s_in". In o ma ion on human
his one and non-his one epigene ic egula o s was ob ained om he EpiFac o s da abase [43] ( e sion
2.1, eleased Sep embe 10, 2024). DNA-binding ansc ip ion ac o s (dbTFs) we e ob ained om
h ps://www.ebi.ac.uk/QuickGO/ a ge se /dbTF, accessed on 01.09.2024) [44].
2.1.3. MS-based s udies
We conduc ed a comp ehensi e sea ch on PubMed using keywo ds "ch oma ome", "ch oma in
p o eins", "expe imen ally ob ained ch oma in p o eins", "nuclea p o eins", and "nuclea p o eome" o
ga he ele an in o ma ional esou ces con aining da a on nuclea and ch oma in p o eins.
All p o ein en ies iden i ied in MS-based s udies we e mapped o he UniP o human e e ence
p o eome ( elease 2022_02), p o ein iso o ms we e collapsed o canonical en ies. Gene names we e
used as seconda y iden i ie s o acili a e mapping o eco ds wi hou a di ec ma ch. Ou da ed UniP o
en y iden i ie s we e upda ed o hei cu en coun e pa s in he used p o eome. Below he speci ic
de ails o ob aining he p o ein lis s om espec i e MS-s udies a e p o ided (see Table 1). F om
Kus a sche e al. (2014) s udy [23], which p o ides in e phase ch oma in p obabili y sco es (ICP) o
7,635 human p o eins, p o eins wi h ICP > 0.5 we e selec ed. F om Alabe e al. (2014) s udy [26]
en ies wi h missing UniP o ID, gene name, o in ensi y we e excluded, unmapped en ies we e
ma ched by gene name. No il e ing by nascen en ichmen o ch oma in p obabili y was applied. F om
Ginno e al. (2018) s udy [25] p o eins quan i ied wi h a leas wo unique pep ides and consis en signal
ac oss all h ee eplica es o a leas one cell cycle s age (G1, S, o M) we e selec ed. Al hough mi o ic
ch oma in was included, he p o ein composi ion ac oss s ages was nea ly iden ical, and excluding M-
phase would no signi ican ly al e he da ase . F om Shi e al. (2021) s udy [27], p o ein en ies we e
aken om he expe imen al un ha used condi ions in ol ing HaeIII diges ion and no 1,6-hexanediol
ea men ("condi ion 2" in he s udy), ep esen ing ch oma in-associa ed p o eins ex ac ed unde
na i e condi ions using he Hi-MS p o ocol. Only p o eins wi h non-ze o iBAQ alues in all h ee
eplica es we e e ained. F om To en e e al. (2011) s udy [4] ch oma in-associa ed p o ein lis s
ob ained by h ee h ee ex ac ion me hods we e combined ( o al ch oma in ex ac ion, high-sal
ex ac ion, and mic ococcal nuclease diges ion). P o ein GI accession numbe s we e con e ed o
UniP o e iewed en ies. F om I zhak e al. (2016) s udy [45] all en ies anno a ed as "mos ly nuclea "
by he "Global classi ie 2" we e aken. F om Al a ez e al. (2023) s udy [28] a combined lis o p o eins
ob ained by NCC (Nascen Ch oma in Cap u e) in HeLa S3 cells and iPOND (isola ion o P o eins On
8
Nascen DNA) in TIG-3 ib oblas s was aken. F om he Ugu e al. (2023) s udy [24] p o eins wi h
non-missing aw Log2 in ensi y alues in all h ee ch oma in eplica es o human emb yonic s em cells
we e aken.
2.1.4. O he sou ces
Housekeeping p o eins we e de ined as p o eins de ec ed in all analyzed issues by RNA-seq
and downloaded om HPA ( e sion 23) [46] (in o al 8899 p o eins). Manually cu a ed p o ein
complexes we e downloaded om Complex Po al (accessed on 7 Janua y 2025) [47]. Only complexes
exclusi ely con aining ch oma in p o eins we e selec ed o domain co-occu ence analysis.
2.2 Cons uc ion o e e ence ch oma in and nuclea p o ein da ase s
2.2.1. The SimCh om on ology, SimCh om p o ein da ase , and SimCh om/SimCh om-SL
classi ica ion
The SimCh om ch oma in p o eins classi ica ion on ology was de eloped simul aneously wi h
he co esponding se o human ch oma in p o eins ha we e collec ed acco ding o he de eloped
classi ica ion. To his end, o almos e e y SimCh om classi ica ion e m we a ibu ed speci ic e ms
om GO classi ica ion ha we e manually selec ed o ep esen molecula unc ions and biological
p ocesses ha happened exclusi ely inside he cell nucleus and we e ela ed o he espec i e SimCh om
e m. In ce ain cases cellula componen GO e ms we e also used when hey we e di ec ly ela ed o
he espec i e SimCh om e m (e.g., complexes o ch oma in emodele s). The lis o he GO e ms
a ibu ed o e e y SimCh om e m is gi en in Suppl. Table ST4. The da ase was hen supplemen ed
by p o eins de ined as ch oma in p o eins by se e al da abases, e iew pape s, and o iginal s udies (e.g.,
His oneDB 2.0 - his one p o eins, His ome2 - o PTM w i e s, see de ails and co esponding da a
sou ces in Suppl. Table ST4). A he las s ep o ce ain SimCh om ca ego ies he con en s o he
da ase we e addi ionally il e ed o ei he emo e he po en ially non-nuclea p o eins (in he case o
RNA-binding p o eins) o his ones om he ca ego ies belonging o he “non-his one p o eins” g oup
(see de ails in Suppl. Table ST4).
To ensu e unambiguous ca ego iza ion, we cons uc ed a SimCh om-SL (single labeled)
classi ica ion, whe e each p o ein was assigned o exac ly one ca ego y o he same SimCh om
on ology. Assignmen ollowed a p ede ined p io i y o de : "Molecula unc ion" and "Physico-
chemical p ope ies" ca ego ies we e p io i ized, ollowed by o he s ("Biological p ocesses" and
"Genomic loca ion"). Wi hin hese g oups, ca ego ies con aining ewe p o eins we e o de ed i s (see
p io i y hie a chy in Suppl. Fig. SF3_2). Each p o ein was labeled wi h i s i s eligible ca ego y in his
sequence.
9
To assess he quali y o he SimCh om da ase , Gene On ology en ichmen analysis was
pe o med. Two g oups o p o eins we e analyzed sepa a ely: (1) p o eins p esen in SimCh om bu
absen om NULOC_CS (see below), and (2) p o eins p esen in NULOC_CS bu absen om
SimCh om. GO en ichmen was assessed using g:P o ile [48], wi h mul iple es ing co ec ions applied
(Bon e oni co ec ion, signi icance h eshold 0.05). Only d i e GO e ms (de ined by g:P o ile ) we e
selec ed o u he in e p e a ion.
2.2.2. Cons uc ion o e e ence localiza ion da ase s
Re e ence da ase s o nuclea (abb e ia ed as NULOC), non-nuclea (NON_NULOC) and
cy oplasmic (CYTLOC) p o ein en ies a di e en le els o con idence and uniqueness o localiza ion
we e cons uc ed (see se o localiza ion e ms in Suppl. Table ST3). The da ase s we e c ea ed using
he combina ion o localiza ion in o ma ion om UniP o and HPA ( ull de ails a e p o ided in Suppl.
Table ST6). Da ase s o p o eins ha ing only one speci ic localiza ion (i.e. no o he localiza ion
epo ed in he sou ce da abases) a e deno ed by he UL su ix. Da ase s, whe e localiza ion is
simul aneously suppo ed bo h by UniP o and HPA a e deno ed by CS (consensus) su ix, al e na i ely,
i localiza ion is suppo ed by a leas one da abase he JT (join ) su ix is gi en. The NECF (“no
e idence code il a ion”) su ix deno es da ase s whe e no p elimina y il a ion o localiza ion
in o ma ion p o ided by he da abases based on e idence codes o con idence le els we e applied. We
no iced ha many his one p o ein en ies in UniP o lack e idence codes o hei localiza ion (in elease
2022_2). This is appa en ly due o he ac ha hese en ies appea ed in UniP o be o e he manually
cu a ion o e idence a ibu ion was in oduced, and hey a e s ill awai ing manual e iew and
e o i ing. We manually added his one p o eins in all cons uc ed nuclea e e ence da ase s.
2.3. P o ein abundance da a p ocessing
2.3.1. PaxDB da a
Two da ase om PaxDB [49] e sion 4.2 wi h p o ein abundance da a we e used: he da ase
wi h he highes p o eome co e age (“H. sapiens - Whole o ganism (In eg a ed)” - co e s 99% o
human p o eome acco ding o PaxDb, e e ed o as “PaxDb_INT'' in his pape ) and he da ase wi h
he highes in e ac ion consis ency sco e (“Whole o ganism, SC (Pep idea las,aug,2014)” - co e s 84%
o human p o eome acco ding o PaxDb, e e ed o as “PaxDb_PA'' in his pape ), see also Suppl.
R&D Sec. 3.1. An abundance uni is p o ein pe million, ppm, which desc ibes p o ein abundance
ela i e o all exp essed molecules in he p o eome. P o ein abundance was ob ained by agg ega ing
abundance o indi idual genes wi h simila p o ein sequences (e.g., in he case o canonical his ones).
Cumula i e abundance was de ined as he sum o p o ein abundances o a g oup o p o ein en ies;
cumula i e weigh was calcula ed as abundance mul iplied by he p o ein molecula weigh .
16
Ginno e al.,
2018
To al ch oma in:
ime- esol ed
(G1, S, M)
Densi y-based en ichmen o mass spec ome y analysis o ch oma in
(DEMAC): o maldehyde- ixed cells we e sonica ed, subjec ed o cesium
chlo ide (CsCl) g adien ul acen i uga ion o isola e DNA–p o ein complexes
by buoyan densi y (1.39 g/cm³). Ch oma in ac ions we e collec ed, dialyzed,
dec osslinked, diges ed wi h DNase I.
Human T98G
(glioblas oma)
3065 (ch oma ome);
6242 ( o al p o eome)
3051
Shi e al., 2021
P omo e -
p oximal
ch oma in
Hi-MS (Hi-C-based p o eomics, adap ed om BL-Hi-C): cells c osslinked wi h
1% o maldehyde; genomic DNA diges ed wi h HaeIII (GGCC si es); liga ed
wi h bio inyla ed b idge linke s; nuclei lysed in 0.2% SDS; ch oma in
sonica ed; ch oma in-DNA complexes cap u ed on s ep a idin beads.
Quan i ied sensi i i y o 1,6-hexanediol e alua ed ia AICAP index (An i-1,6-
Hexanediol Index o Ch oma in-Associa ed P o eins).
K562
3228
2848
Ugu e al.,
2023
To al ch oma in
Ch oma in Agg ega ion Cap u e (ChAC): nuclei ixed wi h 1% o maldehyde,
lysed wi h SDS and u ea, sonica ed, and pu i ied by p o ein agg ega ion
cap u e (PAC) on magne ic beads. DIA-MS wi h DIA-NN used o
quan i ica ion.
Human ESCs
(H9)
2487
1730
Al a ez e al.,
2023
Time- esol ed
(nascen , G2/M,
ea ly and la e
G1) ch oma in
Nascen Ch oma in Cap u e (NCC) me hod, which elies on pulse-labeling
newly eplica ed DNA wi h bio in-dUTP, ollowed by o maldehyde c osslinking
and sonica ion-based ch oma in agmen a ion. Bio inyla ed DNA-p o ein
complexes we e a ini y-pu i ied using s ep a idin magne ic beads. HeLa S3
cells we e synch onized and ha es ed a i e pos - eplica ion ime poin s
(Nasc, La e S, G2/M, ea ly G1, la e G1) ac oss six biological eplica es.
HeLa S3
1454 (p esen a all ime
poin s in all 6 eplica es;
om o al o 5770)
1478 (2894
o al)
Isola ion o P o eins On Nascen DNA (iPOND): o maldehyde c osslinking
(1%), EdU labeling o 15 minu es, click chemis y wi h bio in-azide, ch oma in
agmen a ion, s ep a idin bead en ichmen .
TIG-3
ib oblas s
2351 (de ec ed in ≥4 o
5 eplica es)
2397 (2894
o al)
Among he p o ein/gene anno a ion da abases he GeneOn ology (GO) da abase s ands ou as a
comp ehensi e a emp in desc ibing he unc ions o gene p oduc s in an e e g owing numbe o
o ganisms [35,36]. Wi hin he GO amewo k genes a e anno a ed acco ding o hei in ol emen in
ce ain molecula unc ions, biological p ocesses, and cellula componen s. The on ology i sel o ms a
complex in e linked hie a chy wi h mo e han 40,000 GO e ms and o e s anno a ions o nea ly he
en i e human e e ence p o eome. Howe e , despi e i s appa en comp ehensi eness he GO da abase
could no pe se p o ide answe s o he ques ions ha we e ins umen al o his s udy, namely, o p o ide
a se o ch oma in genes/p o eins and a ela i ely simple unc ional classi ica ion o hese p o eins ha
could be used o u he analysis.
The GO cellula componen e m "ch oma in" is de ined b oadly as " he o de ed and o ganized
complex o DNA, p o ein, and some imes RNA, ha o ms he ch omosome" and encompasses a ound
2000 p o eins. Compa isons wi h o he da abases sugges ha his numbe is a a he conse a i e
es ima e. Fo ins ance, up o 528 p o eins lis ed in specialized da abases o epigene ic ac o s
(EpiFac o s) and ansc ip ion ac o s (GO ca alogue o TFs [44]) a e missing om his se (see Suppl.
Fig. SF2_1A); concomi an ly he HPA p o ein localiza ion da abase sugges s ha he e a e a ound 6000
p o eins loca ed in nucleoplasm (see Suppl. Fig. SF2_2A). Fu he mo e many p o eins anno a ed by
GO e ms ha a e bona ide ela ed o ch oma in (e.g., ch oma in binding) a e missing om hose
anno a ed by he GO e m “ch oma in” (see Suppl. Fig. SF2_1C). This s ems in pa om he
complexi y o he ela ions be ween he GO e ms belonging o di e en anno a ion aspec s, in his
example he e ms “ch oma in o ganiza ion” and "ch oma in emodeling" a e no connec ed o he e m
“ch oma in” wi hin he on ology ee. The manual sea ch and iden i ica ion o he ch oma in ela ed

17
GO e ms and ele an p o eins is challenging because (1) o he shea numbe o GO e ms (e.g., he
wo d “ch oma in” is ound in he names o mo e han 60 e ms, see Suppl. Fig. SF2_1C), (2) he ac
ha appa en ly ele an e ms may include also non-ch oma in associa ed en ies (e.g., ansc ip ion
may also include mi ochond ial ansc ip ion), (3) he ac ha e ms desc ibing ce ain his o ically
es ablished ch oma in p o ein g oups may be missing (e.g., his ones, HMG p o eins), (4) he ac ha
GO da abase is no ch oma in-speci ic and may no be up- o-da e in ce ain aspec s (e.g., con ain
obsole e e ms such as “nuclea ma ix” o lack anno a ions o p o eins ha a e a ailable in ecen
li e a u e e iews). Suppl. R&D Sec. 1.1 p o ides u he de ails and examples om ou analysis.
A numbe o epigene ic/ch oma in egula o s/ ac o s da abases (e.g., EpiFac o s [43], CRdb
[57]) p o ide ca e ully cu a ed in o ma ion abou ch oma in p o eins ha a e in ol ed in wha is
his o ically assumed o be molecula mechanisms o epigene ic egula ion (Fig. 2A, Suppl. Table ST1).
Howe e , hey anno a e only a ound 400-800 ch oma in p o eins, which is much less han is expec ed
o be in ch oma in (see Fig. 2D). Unlike GO, hese da abases in oduce a a he simple classi ica ion o
p o eins ( he espec i e ca ego ies a e highligh ed in Fig. 3), bu lack many essen ial ch oma in
ca ego ies (e.g., his ones, his one chape ones, e c.). P o ein class speci ic da abases and da ase s
a ailable in published pape s p o ided an e en mo e us wo hy bu na ow se s o in o ma ion abou
ce ain classes o ch oma in p o eins. O pa icula impac he e by he numbe o p o ided en ies a e
he da abases o ansc ip ion ac o s. Recen da abases (e.g., The Human T ansc ip ion Fac o s [55],
GO ca alogue o TFs [44]) comp ise a ound 1500 ansc ip ion ac o s. Addi ionally, se e al ch oma in-
ela ed p o ein classes ha e been e iewed in he li e a u e bu lack dedica ed da abase esou ces [58–
64].
The o he al e na i e and powe ul sou ce o in o ma ion abou nuclea /ch oma in p o eins is
localiza ion da abases and p o eome-wide s udies – UniPo , HPA and OpenCell p ojec s a e cu en ly
ega ded as he mos comp ehensi e and us ed esou ces on p o ein in acellula localiza ion (see Fig.
2B). F om each esou ce we ex ac ed he se s o p o eins whose localiza ion was anno a ed (only
anno a ion wi h su icien con idence le els was conside ed o ou analysis - see Me hods Sec ion
2.1.1) as belonging o he nucleus o sub-nuclea compa men s acco ding o he localiza ion on ologies
speci ic o each esou ce. The de ailed c oss-compa ison o he da ase s is p esen ed in Suppl. Fig.
SF2_2, i s in e ac i e e sion (In e ac i e Fig. 2 a ailable a
h ps://simch om.in bio.o g/#localiza ion), and a leng h discussed in Suppl. R&D Sec. 1.2 using
Suppl. Fig. SF2_2, SF2_3, SF2_4. In summa y he e is a conside able deg ee o a ia ion bo h be ween
he se s o nuclea p o eins, hei sub-nuclea localiza ion anno a ion and he anno a ion on ologies
hemsel es be ween he esou ces.
I has o be kep in mind in he i s place ha localiza ion anno a ion co e age is no comple e
– collec i ely he h ee esou ces co e 86% o he human e e ence p o eome (wi h su icien
18
con idence - see abo e), while only 44% o p o eins a e simul aneously anno a ed by UniP o and HPA.
The esou ces also di e by he numbe o localiza ion anno a ions hey p o ide on a e age o each
p o ein (median numbe is wo and one, o HPA and UniP o , espec i ely), sugges ing HPA is mo e
comple e wi h espec o anno a ing mul ilocaliza ion o p o eins. Hence, al hough he p o ein space
co e age by UniP o is la ge compa ed o HPA (70% s 60%), he nucleome p o ided by he o me
is conside ably smalle (~4700 s 6500 p o eins). Toge he he h ee esou ces anno a e ~8000 p o eins
as ha ing nuclea localiza ion (see Venn diag ams in Fig. 2B), which amoun s o 47% o p o eins ha
ha e localiza ion in o ma ion acco ding o a leas one esou ce. Hence, as a ough es ima e i is
emp ing o conclude ha cu en p o ein localiza ion da abases sugges ha a ound hal o human
p o eins ha e some e idence o nuclea localiza ion. Howe e , i has o be kep in mind ha he
cong uency be ween he esou ces emains medioc e. Among ~9000 p o eins whose localiza ion is
simul aneously a ailable in UniP o and HPA, ~3300 a e anno a ed as nuclea by bo h HPA and
UniP o , while ano he ~2000 a e anno a ed as such only by one o he wo esou ces (~600 by UniP o
and ~1400 by HPA) (see Fig. 2B, lowe le Venn diag am). The disc epancies a e in pa due o (1)
incomple e anno a ion o mul iple localiza ion possibili ies by he da abases (among he ~2000 p o eins,
~60% ha e a ma ching localiza ion anno a ion be ween he da abases o he han nuclea ), (2) po en ial
biases in localiza ion anno a ions (HPA ends o label nuclea p o eins as esicula p o eins and UniP o
ends o label nuclea p o eins as sec e ed p o eins and p o eins o he ex acellula ma ix) (see Suppl.
R&D Sec. 1.2, Suppl. Fig. SF2_3). Ex apola ing he abo e es ima es o he whole p o eome (wi h all
ca ea s in mind abou he non-uni o m anno a ion co e age o di e en p o ein g oups) one can sugges
ha be ween 35 o 60% gene p o ein p oduc s may be asc ibed nuclea localiza ion depending on he
chosen deg ee o ce ain y. Hence, a combina ion o he da ase s p o ided by he esou ces may be used
o cons uc e e ence nucleome da ase s o a ying con idence (see Resul s Sec. 3.2).
Many nuclea p o eins we e ound o ha e mul iple localiza ion anno a ions belonging o
di e en cellula compa men s (see Suppl. Fig. SF2_2B, Me hods Sec ion 2.1.1, and Suppl. R&D
Sec. 1.2). On one hand his e lec s he unc ionally impo an p ope y o nuclea p o eins o shu le
be ween compa men s. Fo example, many ansc ip ion ac o s and coac i a o s (e.g., NF-κB, STAT,
p53, TAF7, YAP/TAZ) egula e hei ac ion h ough cy oplasm/nucleus shu ling [65–67], e en some
his ones, such as H2B, may elocalize o cy oplasm unde s ess and pe o m uncon en ional unc ions
[68,69]. On he o he hand, unc ionally i ele an mul iple localiza ion anno a ions may a ise due o
expe imen al a e ac s o subop imal signal- o-noise h esholds, keeping in mind ha all nuclea
p o eins a e in ac syn hesized in cy oplasm and impo ed o he nucleus. Acco ding o ou analysis
UniP o and HPA es ima e sepa a ely ha ~50% o p o eins wi h nuclea localiza ion may be also
localized in o he compa men s (~40% in cy oplasm, 12%–22% in he endomemb ane sys em),
anno a ing a ound 48%-50% o be localized solely in he nucleus. Howe e , once he anno a ions o
UniP o and HPA a e compa ed wi hin he sha ed common se o p o eins (nuclea localiza ion
19
anno a ion is a ailable in bo h da abases) i u ns ou ha only o ~40% o p o eins he wo da abases
each consensus o hei unique nuclea localiza ion (see Fig. 2B, lowe igh Venn diag am). In o he
wo ds (see Suppl. R&D Sec. 1.2), app oxima ely o e e y i e p o eins iden i ied as uniquely localized
in he nucleus by one da abase, i is likely ha wo o hem will ha e non-nuclea localiza ion anno a ion
in he o he da abase (pe se o in addi ion o he nuclea localiza ion). The same endency was
obse ed o he anno a ions o uniquely localized cy oplasmic p o eins. The ob ained es ima es likely
e lec he subop imal speci ici y o he localiza ion in o ma ion p o ided by he da abases (addi ional
localiza ions o nuclea p o eins may no be always cap u ed) and po en ial p esence o spu ious
localiza ion anno a ions (a i ac s o inco ec localiza ion assignmen s). I is, howe e , non- i ial o
decon olu e be ween hese wo ypes o e o s.
Ou analysis o sub-nuclea localiza ion on ologies showed ha he one o Unip o is mo e
di e se comp ising 20 e ms ( his numbe includes ch omosome localiza ion which is dis inc om he
nuclea localiza ion acco ding o UniP o , al hough he majo i y – 84% – o ch omosome p o eins a e
also anno a ed as nuclea ), while HPA and OpenCell comp ize, 9 and 6 e ms, espec i ely. Howe e ,
in e ms o anno a ion speci ici y only 19% o nuclea p o eins in UniP o a e anno a ed wi h sub-
nuclea localiza ion e ms, while o HPA all o he nuclea p o eins ha e some sub-nuclea localiza ion
(al hough 92% a e conside ed a pa o nucleoplasm, 33% bea localiza ion anno a ions o he han
nucleoplasm). The e a e ce ain pa s o he on ologies ha do no ma ch be ween he esou ces o o
he s a e-o - he-a knowledge. Fo ins ance, HPA conside s mi o ic ch omosomes as a pa o
nucleoplasm, while UniP o uses ou da ed “nucleus ma ix” e m. OpenCell is he only esou ce ha
explici ly conside s “ch oma in” as he possible localiza ion o nuclea p o eins, while UniP o
explici ly lis s “Ch omosome” as he possible localiza ion, which was in u n inhe i ed om GO cellula
compa men on ology whe e “ch oma in” has a child-pa en ela ion wi h he e m “ch omosome”. The
abo e-men ioned disc epancies e lec he dynamic complexi y o cellula o ganiza ion, ou cons an ly
e ol ing unde s anding o nuclea o ganiza ion, and he esul ing di icul y in desc ibing subcellula
localiza ion in a o m o a simple hie a chical ee-like on ology. While he exac names may di e , all
esou ces con e ge on he p esence o he ollowing localiza ion e ms: Nucleoplasm; Nuclea bodies;
Nucleolus; Nuclea en elope. Among hese e ms he nucleoplasm localiza ion is he one mos ela ed
o ch oma in p o eins (acco ding o HPA nucleoplasm is wha is ound wi hin he nuclea memb ane,
bu excludes nucleoli acco ding o he espec i e localiza ion on ology). I one in e p e s he de ini ion
o ch oma in b oadly ( ea s p o eins ha localize wi h he in e phase ch omosomes o be pa o he
ch oma in “complex”) he se o p o eins wi h nucleoplasm localiza ion is a di ec sou ce o in o ma ion
abou ch oma in p o eins. HPA lis s a ound six housand nucleoplasm p o eins (Fig. 2B). The analysis
o subnuclea mul ilocaliza ion is a ailable in Suppl. R&D Sec. 1.2 and Suppl. Fig. SF2_4.
MS-based s udies o ch oma in ex ac s a e ano he key sou ce o in o ma ion abou he p o ein
con en o ch oma in. Despi e being he ul ima e di ec sou ce o da a abou he composi ion o
20
ch oma in i un o una ely has ce ain limi a ions (see In oduc ion, and ele an e iews [17,70]). To
gain quan i a i e unde s anding in o he u ili y o MS-based s udies o ou goals we ha e selec ed da a
om se e al s udies in human cell lines (see Table 1 and Me hods Sec ion 2.1.3) o analysis. The
selec ed da ase s included i e s udies ha aimed a o al in e phase ch oma in cha ac e iza ion using
di e en me hods o ch oma in pu i ica ion and pos -MS da a analysis, wo s udies cha ac e izing
nascen ch oma in, and one s udy cha ac e izing o al nuclea p o eome. The mo e han wo old
a ia ion ( om 1.5 o 3.5 housand en ies) in he numbe o de ec ed ch oma in p o eins in a ious MS-
based s udies highligh s he a ying sensi i i ies o di e en ch oma in pu i ica ion/MS-de ec ion
se ups (Table 1). The pai wise compa ison o di e en ch oma in da ase s o compa able size (ha ing
a ound 3000 p o eins) sugges s ha o any gi en se i s ac ion o e lapping wi h any o he se does
no exceed 68% (see Fig. 2C). The numbe o ch oma in p o eins p esen simul aneously in all o al
ch oma in da ase s is 179 (Suppl. Fig. SF2_5B). These ac s highligh conside able a ia ion o MS-
based da a due o di e en sample sou ces and ch oma in ex ac ion echniques.
We nex ho oughly analyzed hese p o ein da ase s h ough c oss-compa ison be ween
hemsel es, compa ison wi h p o ein localiza ion da a, and es ed en ichmen o di e en ch oma in
p o ein ca ego ies (acco ding o SimCh om classi ica ion desc ibed in Resul s Sec ion 3.2). The
de ailed analysis is p o ided as Suppl. R&D Sec. 1.3, and we only succinc ly summa ize ou
conclusions below. F om 10% o 38% o p o eins iden i ied in MS-based ch oma in da ase s cu en ly
do no ha e any suppo om localiza ion da abases h ough hei anno a ed nuclea localiza ion (see
Suppl. Fig. SF2_5A,C), sugges ing ha e en o ch oma in pu i ica ion p o ocols based on p o ein-
DNA c oss-linking he e s ill migh be a ce ain deg ee o con amina ion wi h non-nuclea p o eins,
mainly cy oplasmic ones (see Suppl. Fig. SF2_6). Ye , MS-based echniques may ha e p edic i e
powe o iden i y new ch oma in p o eins ha a e no anno a ed in he localiza ion da abases. Fo
example, among 195 p o eins epo ed simul aneously by a leas i e ou o se en ch oma in MS-based
s udies we es ima ed ha a ound ~30% o p o eins may ha e indica ions in he li e a u e suppo ing
hei nuclea localiza ion. MS-based s udies a e biased owa ds iden i ying he housekeeping p o eins
- mo e han 80% o nuclea /ch oma in p o eins epo ed by he MS-based s udies we e om he
housekeeping pool, while he a e age expec ed ac ion o nuclea housekeeping p o eins is a ound
62% (see Suppl. Fig. SF2_7A,B). This is expec ed since many non-housekeeping p o eins a e
condi ionally exp essed. Howe e , MS-based s udies end o miss he housekeeping ansc ip ion
ac o s oo (and e en o a g ea e ex en non-housekeeping TF) appa en ly due o hei low abundance
and dynamic na u e o in e ac ions (Suppl. Fig. SF2_7C, SF2_8A, see also Resul s Sec ion 3.3.1 o
discussion o ch oma in p o ein abundance). MS-based s udies also s uggle o eco e as sepa a e gene
p oduc s p o eins wi h e y simila sequences, e.g. canonical his one iso o ms (see Suppl. Fig.
SF2_7C, SF2_8B).
21
To inalize ou analysis we compa ed he da ase s om h ee ypes o da a sou ces abou
ch oma in p o eins examined abo e (Fig. 2D, Suppl. Fig. SF2_9). One can see ha localiza ion
da abases a e cu en ly leading by he numbe o p o eins ha may be conside ed as ch oma in p o eins
in he b oad sense (e.g., he p o eins o he nucleoplasm). Howe e , he e is s ill limi ed cong uence
wi h he o he da a sou ces. Fo ins ance, 25% o GO "Ch oma in" p o eins a e no localized in he
nucleus acco ding o HPA, mo eo e , o hese 500 p o eins, only 115 ha e any localiza ion in o ma ion
in HPA. No ably, 42% (2699) o he p o eins iden i ied in MS-based ch oma ome and nucleome s udies
lack nuclea localiza ion anno a ions in bo h UniP o and he HPA, whe eas only 254 p o eins emain
en i ely unanno a ed o subcellula localiza ion in hese da abases.
Taken oge he ou analysis o di e en ch oma ome da a sou ces e ealed conside able
he e ogenei y o in o ma ion and limi ed cong uence be ween he a ailable da ase s. The a ailable
unc ional da abases while p o iding unc ionally suppo ed da a a e ei he limi ed in scope o su e
om his o ically-con ingen complexi y and some imes disc epancies in hei classi ica ion on ologies
ha a e no ailo ed o p o ide comp ehensi e s aigh o wa d in o ma ion abou in e phase ch oma in
p o eins. The localiza ion da abases is a powe ul al e na i e sou ce o in o ma ion ha can gi e an
uppe bound o he se o ch oma in p o eins (since hey should ha e nuclea /nucleoplasm localiza ion),
p o ide a ela i ely eliable es ima e o he lowe bound o he numbe o nuclea p o eins, howe e ,
hey su e om an incomple e co e age o he p o eome-localiza ion space and hence di icul ies in
es ima ing alse-posi i e and alse-nega i e anno a ion a es (keeping in mind he mul i-localiza ion o
p o eins) and limi ed cong uence o subnuclea localiza ion on ologies. The MS-based s udies o
ch oma in ex ac s a e he mos di ec sou ce o in o ma ion abou ch oma ome, hey may iden i y new
ch oma in p o eins no anno a ed cu en ly in o he da abases, howe e , hey a e limi ed in scope (many
p o eins a e condi ionally exp essed o ha e low exp ession le els) and su e om con amina ion wi h
non-nuclea p o eins.

22
3.2. The SimCh om ch oma in p o ein classi ica ion, he SimCh om da ase and o he
e e ence da ase s
Figu e 3. The SimCh om empi ical ch oma in classi ica ion on ology and he SimCh om
ch oma in p o eins da ase . The hie a chical ee-like classi ica ion o ganizes 39 SimCh om
ca ego ies. Fo each ca ego y he espec i e numbe o p o eins om he SimCh om da ase is gi en in
pa en heses (p o eins can simul aneously belong o mo e han one ca ego y wi h he excep ion o
his ones). The pic og ams on he le o each ca ego y name p o ide he in o ma ion abou he p esence
o simila ca ego ies in he on ologies o o he da abases (see legend). The colo ed ba s show whe he
he speci ic ca ego y was de i ed om g ouping he p o eins acco ding o ce ain aspec s: he simila i y
o hei molecula unc ions, physico-chemical p ope ies, in ol emen in simila biological p ocesses
o localiza ion in simila genomic loca ions (see legend). No e: he la e anno a ions may de ia e om
GO anno a ion aspec s.
Taking in o accoun he ad an ages and disad an age o di e en sou ces o in o ma ion abou
ch oma in p o eins p esen ed abo e, we aimed a cons uc ing a e e ence se o ch oma in p o eins
oge he wi h a classi ica ion on ology and se e al supplemen a y nuclea localiza ion p o ein da ase s
ha can be la e used in analyzing he epe oi e, abundance, unc ional, s uc u al, and physico-
chemical p ope ies o ch oma in p o eins. Ou aim was o c ea e a ela i ely simple classi ica ion
on ology ha while po en ially sac i icing he de ails will enable a holis ic human-unde s andable
o e iew o he ch oma ome (see Suppl. R&D Sec ion 1.1 o he discussion o GO complexi y and
ensuing challenges). The cu en e sion o SimCh om classi ica ion ocuses on classi ica ion o
ch oma in/nucleoplasm p o eins lea ing aside he classi ica ion o he nuclea en elope p o eins, which
a e his o ically no conside ed o be a pa o ch oma in. The SimCh om classi ica ion on ology was
c ea ed by manually analyzing, c i ically e alua ing, selec ing and combining in o a ee-like
classi ica ion scheme in o ma ion om (1) he his o ically es ablished consensus on ch oma in p o eins
classi ica ion (e.g., his one, non-his one p o eins, HMG-p o eins [1]), (2) classi ica ion used in majo
e e och oma in
associa ed
ch omosome
inac i a ion
uclea binding
o eins
uch oma in
associa ed
e ibosome
associa ed
Cen ome e
associa ed
me abolic
ocesses
em la ed
ansc i ion
e hyla ed
binding binding s
is one w i e s
ca ego y is in
i ac o s
ca ego y is in
C Cis ome 13,
C 1 ,
C Cance 1 ,
C db 9 3
simila ca ego y is
esen in
ch oma in/e igene ic
egula o s
hysico chemical
o e ies
23
da abases o ch oma in and epigene ic egula o s (e.g., EpiFac o s, FACER), (3) classi ica ion used in
he h ee aspec s o Gene On ology, (Suppl. Fig. SF3_1). Ou hie a chical SimCh om classi ica ion is
p esen ed in Fig. 3. The majo i y o classi ica ion e ms used in SimCh om was inspi ed by GO-based
classi ica ion, ye only a small subse o e ms was used. The main ocus o he classi ica ion was o
classi y ch oma in p o eins acco ding o hei unc ions and biological p ocesses ha hey a e in ol ed
in, bu genomic-loca ion (which is also indi ec ly ela ed o unc ion) and physical p ope ies (e.g., high-
mobili y g oup p o eins o A, B and N amilies) we e also conside ed (Fig. 3 highligh s wha
classi ica ional aspec s a e mos ele an o each e m using a colo ba ). The SimCh om on ology was
de eloped simul aneously wi h he SimCh om da ase in an i e a i e manne by ob aining se s o
p o eins anno a ed by a ious GO e ms, ex ac ing hem om li e a u e and domain speci ic da abases,
manually cu a ing, alida ing and il e ing (see Me hods Sec ion 2.2). Only majo splice iso o ms o
genes a e included in SimCh om. The esul ing SimCh om da ase con ains 3045 p o eins, is a ailable
as a Suppl. Table ST5 and iewable in he In e ac i e Fig. 3 a he SimCh om web-si e
(h ps://simch om.in bio.o g#classi ica ion). The desc ip i e de ails abou he SimCh om da ase a e
a ailable in Suppl. R&D Sec. 2.
In he de aul SimCh om classi ica ion (depic ed in Fig. 3) e e y p o ein om he SimCh om
da ase may belong o mo e han one SimCh om on ology ca ego y. This p o ides he needed deg ee o
lexibili y since many p o eins indeed may bona ide belong o se e al ca ego ies due o hei complex
unc ional, physico-chemical o s uc u al p ope ies. Howe e , in ce ain cases o holis ic analysis an
e en simple classi ica ion may be use ul, which asc ibes e e y p o ein o only one ca ego y. Such
single label classi ica ion (SimCh om-SL) based on he same SimCh om on ology was also de eloped
(see Me hods Sec ion 2.2, Suppl. Fig. SF3_2). B ie ly, i he p o ein belonged o se e al ca ego ies by
de aul i was asc ibed o he ca ego y wi h he leas numbe o o he p o eins (i.e. he mos speci ic
ca ego y o his p o ein) wi h unc ional ca ego ies aking p io i y (see Suppl. Fig. SF3_2 o ca ego y
p io i y o de ).
As auxilia y da ase s based on he esul s o Sec ion 3.1 we ha e compiled se e al e e ence
da ase s o nuclea and non-nuclea p o eins a di e en le els o suppo (depending on whe he nuclea
localiza ion is suppo ed by one o se e al localiza ion da abases), con idence (depending on he
e idence codes and eliabili y sco es p o ided by he da abases), and also whe he p o eins a e uniquely
localized in he nucleus o ha e mul iple localiza ion in he nucleus and o he cellula compa men s
(see Me hods Sec ion 2.2.2). The lis o he da ase s and hei de ini ion is p esen ed in Suppl. Table
ST6, he da ase s a e a ailable o download in he In e ac i e Table 2 a
h ps://simch om.in bio.o g/#download. Ins umen al o ou u he analysis will be he “nuclea
localiza ion consensus” (NULOC_CS) da ase – he se o nuclea p o eins, whose nuclea localiza ion
is suppo ed (wi h su icien ly good con idence le els) bo h by UniP o and HPA and does no
con adic he da a om OpenCell, and he “nuclea localiza ion join da ase wi h no e idence code
24
il e ing” (NULOC_JT_NECF) da ase - he maximally b oad se o nuclea p o eins, which includes
p o eins whose nuclea localiza ion is suppo ed by any o he localiza ion da abases a any le els o
con idence. The NULOC_CS da ase con ains 3296 en ies, while NULOC_JT_NECF con ains 8912
en ies.
To e alua e he con en s o ou SimCh om da ase we pe o med i s c oss-compa ison o he
localiza ion based da ase s desc ibed abo e (NULOC_CS and NULOC_JT_NECF) (see Suppl. Fig.
SF3_3). De ailed discussion o he esul s is p o ided in Suppl. R&D Sec. 2. B ie ly, almos all
SimCh om p o eins had some e idence o nuclea localiza ion (95% we e p esen in
NULOC_JT_NECF da ase , 60% in NULOC_CS da ase , see Suppl. Fig. SF3_3). Fo he SimCh om
p o eins ha did no ha e high con idence suppo o nuclea localiza ion (non p esen in NULOC_CS)
GO en ichmen analysis o SimCh om-exclusi e p o eins e ealed minimal associa ion wi h non-
nuclea unc ions, wi h only a mino subse (~10 cen ome e-associa ed p o eins) linked o such
ca ego ies (Suppl. Fig. SF3_4, Suppl. Table ST8). Mo eo e , no addi ional ch oma in- ela ed GO
ca ego ies we e ound o be unde ep esen ed in SimCh om, indica ing i s b oad co e age o ch oma in-
associa ed unc ions (Suppl. Table ST9). 60% o SimCh om p o eins we e success ully iden i ied in
MS-based ch oma omes and nucleomes (see Suppl. R&D Sec. 1.3, Suppl. Fig. SF2_7). The emaining
40%, p edominan ly low-abundan ansc ip ion ac o s, we e likely unde ec ed due o hei ansien
na u e and dynamic in e ac ion p ope ies, which pose challenges o MS-based de ec ion. Fu he mo e
GO en ichmen analysis o MS-de i ed p o eins absen in SimCh om o nuclea e e ence se s did no
e eal a bona ide ch oma in-associa ed ca ego y (see Suppl. Table ST7). Toge he , hese esul s
suppo he quali y o he SimCh om da ase , sugges ing ha SimCh om is su icien ly comp ehensi e
in i s co e age o ch oma in- ela ed p o eins and i s ca ego ies.
3.3. Analysis o he human ch oma ome
Equipped wi h he da ase s desc ibed abo e we aimed a a comp ehensi e cha ac e iza ion o
he ch oma ome, including cha ac e iza ion o i composi ion (numbe s o p o eins belonging o
di e en ch oma in ca ego ies), abundance ( he numbe o indi idual p o ein molecules p esen in he
cells), physico-chemical p ope ies o he amino acid sequences o he p o eins, hei domain
a chi ec u es and in e ac ion pa e ns (including engagemen in mul i alen in e ac ions). The ull
discussion o he esul s is p esen ed in Suppl. R&D Sec. 3 and Suppl. Fig. SF4_1 - SF 4_4, SF5_1 -
SF5_5, SF6_1 - SF6_3, SF8_1, SF8_2. The sec ions below summa ize ou analysis.
25
3.3.1. The ch oma ome composi ion and abundance o ch oma in p o eins
32
esul ing 2D p ojec ions on o he main UMAP componen s e ealed ha (1) ch oma in and cy oplasmic
p o eins occupied o e lapping domains on he 2D map, bu wi h a isible shi be ween hei cen e s,
sugges ing he e is an o e all di e ence in he a e age amino acid composi ion, (2) ce ain ch oma in
p o ein g oups o med dedica ed clus e s on he map, sugges ing signi ican dis inc ness in hei
composi ion (see clus e 1, 2 and ou lie s shown by he a ows in Fig. 5K,L). Fu he analysis e ealed
ha in he 2D UMAP map ansc ip ion ac o s, con aining zinc inge domains and homedomains
o med dis inc clus e s (see Suppl. Fig. SF5_3A,B). The mos dis inc g oup (clus e 1) was almos
exclusi ely (415 ou o 422) composed o zinc- inge con aining DNA-binding ansc ip ion ac o s
(240 housekeeping and 175 non-housekeeping) wi h he median numbe o zinc- inge domains (ZFD)
o a ound 10 (Suppl. Fig. SF5_3C). Zinc- inge con aining DNA-binding ansc ip ion ac o s we e
also p esen in clus e 2, bu he median numbe o zinc- inge domains (ZFD) in ha clus e was only
h ee, hence con aining a lowe p opo ion o amino acids speci ic o ZFD (Suppl. Fig. SF5_3D). ZFD
a e en iched in his idine and cy osine (see Fig. 5J and discussion below). O he p o ein g oups ha
occupied dis inc posi ions on he UMAP map, included (1) his ones, (2) se ine/a ginine- ich splicing
ac o s (en iched in se ine and a ginine), and (3) e e se ansc ip ases o endogenous e o i uses
(en iched in isoleucine and h eonine) (see Fig. 5K, Suppl. Fig. SF5_3D).
The de ailed analysis o amino acids composi ion o di e en ch oma in p o ein g oups is
p esen ed in Suppl. R&D Sec. 3.2. B ie ly, among he op ou en iched amino acids in ch oma in
p o eins a e se ine, cys eine, p oline, and his idine (Fig. 5J). The en ichmen o cys eine and his idine
is solely con ibu ed by he ZFD o ansc ip ion ac o s (Suppl. Table ST13, Suppl. Fig. SF5_4B,
Suppl. Fig. SF5_5A). The o al en ichmen o se ine and p oline in ch oma in p o eins is a ibu ed due
o hei en ichmen in he non-IDR egions ( ela i e o IDR and non-IDR egions o cy oplasmic
p o eins), and mo e impo an ly due o he highe p opo ion o IDR egions in ch oma in p o eins (46%
s 23%) ha in u n ha e a conside ably highe p opo ion o hese amino acids han non-IDRs (Suppl.
Table ST13). Se ine was also en iched in non-IDR egions globally, while he en ichmen o p oline in
non-IDRs was obse ed only in a ew ca ego ies (e.g., HMG-p o eins) (Suppl. Fig. SF5_4H).
The en ichmen o posi i ely cha ged amino acids is only s a is ically signi ican o lysine,
bu no o a ginine, and he en ichmen is ela i ely mode a e (1.03 in ch oma in) (Suppl. Fig.
SF5_4A). A ginine is highly en iched in non-IDRs, bu i is he mos deple ed amino acid in IDRs o
ch oma in p o eins e sus he espec i e egions o he cy oplasmic ones. The deple ion o nega i ely
cha ged amino acids in ch oma in/nuclea p o eins is s a is ically signi ican o aspa a e ( old
en ichmen is a ound 0.9), while he deple ion o glu ama e is s a is ically non-signi ican . In e es ingly,
aspa a e is en iched in IDRs and signi ican ly deple ed in non-IDRs. This sugges s ha he inc eased
posi i e cha ge o ch oma in/nuclea p o eins has i s main con ibu ions in he deple ion o aspa a e
and en ichmen o a ginine in non-IDRs, and mode a e global en ichmen o lysine.
Among he mos ela i ely deple ed amino acids in ch oma in/nucleus a e hyd ophobic
alipha ic amino acids, hey a e ela i ely a e in IDRs and hence he la ge p opo ion o IDRs in

33
ch oma in p o eins accoun s o hei lowe o al ac ion (Suppl. Fig. SF5_4F,I). T yp ophan, which
is he a es amino acid (~1% in p o eins), is he mos deple ed amino acid on a e age in
ch oma in/nuclea p o eins and almos in all ch oma in ca ego ies, excep o a ew.
34
3.3.3. Domain composi ion o ch oma in p o eins and iden i ica ion o new s uc u al domains
35
Figu e 6. Domain composi ion o ch oma in p o eins and iden i ica ion o new s uc u al
domains. (A) A schema ic o e iew o ch oma in p o eins’ domain anno a ion analysis and
iden i ica ion o uncha ac e ized new domains. Sou ces o anno a ion and ypical anno a ion pa e ns
o an abs ac p o ein a e schema ically ou lined. A s uc u e wi h a no el domain iden i ied using AI-
based anno a ion pipeline implemen ed in TED esou ce is shown on he igh . (B) Cumula i e
anno a ion co e age o all ch oma in p o ein sequences combined a he amino acid le el ia di e en
esou ces. Anno a ion co e age wi h expe imen al s uc u es in he PDB da abase, he AlphaFold
da abase, and h ee domain anno a ion da abases (P am, CATH, TED) is p esen ed. Fo AlphaFold and
PDB addi ional in o ma ion abou he ac ion o anno a ed amino acids belonging o IDRs and non-
IDRs is depic ed (see Me hods Sec ion 2.4). Fo all anno a ions addi ionally he ac ion o amino acids
belonging o anno a ed P am domain models in he P am da abase is also depic ed, o P am and TED
addi ionally he ac ion o amino acids esol ed in PDB is also depic ed. (C) Analysis o he s uc u al
domains in ch oma in p o eins iden i ied by he TED esou ce ia AlphaFold-based algo i hm. The
numbe and ac ions o s uc u al domains ha ha e ma ching s uc u es in he PDB da abase a a ious
le els o sequence iden i y a e depic ed. The s uc u al ma ches we e iden i ied ia FoldSeek (see
Me hods Sec ion 2.5). Fo hose domains ha we e no ma ched o PDB s uc u es di ec ly a ew we e
anno a ed by CATH (depic ed in o ange), he emaining ac ion (depic ed in magen a) ep esen no el
s uc u al domains p esen only in TED. (D) Analysis o unc ional domain di e si y in ch oma in
p o eins as iden i ied by he P am da abase. 11147 domains belonging o 1753 P am domain models
we e iden i ied. The plo cha ac e izes domain models wi h espec o he a ailabili y o a ma ching
s uc u e in PDB ( he median sequence iden i ies o he ma ches be ween he ch oma in p o eins’
domains belonging o he espec i e P am model and hei bes s uc u al ma ch in he PDB da abase as
iden i ied by FoldSeek a e shown), an anno a ed TED domain, o o he wise he absence o s uc u al
cha ac e iza ion (see Me hods Sec ion 2.5). (E) Analysis o unc ional domain di e si y in ch oma in
p o eins as iden i ied by he P am da abase o p o eins belonging o di e en ch oma in ca ego ies
acco ding o SimCh om-SL classi ica ion. Subpanels 1-5 ep esen a ious cha ac e is ics.
36
Figu e 7. The mos ep esen a i e p o ein domains/ amilies (acco ding o P am) in p o eins
belonging o unc ional SimCh om ca ego ies. The dashed ec angles highligh he p esence o
pa icula g oups o domains in ce ain ca ego ies o ch oma in p o eins (le - ansc ip ion ac o s and
simila DNA-binding p o eins, igh - a ious his one in e ac ing and modi ying p o eins). The plo is
based on SimCh om-SL ch oma in p o ein classi ica ion; only P am domain models p esen in mo e
han i e p o eins we e conside ed. Only da apoin s wi h he size o mo e han 5% a e displayed. The
ull size plo s based on bo h SimCh om and SimCh om-SL classi ica ions a e a ailable as In e ac i e
Fig. 4 (h ps://simch om.in bio.o g/#domain_composi ion).
Nex we se ou o sys ema ically analyze he a ailable da a on s uc u al cha ac e iza ion,
domain anno a ion and domain composi ion o ch oma in p o eins. We speci ically explo ed he
s uc u ally uncha ac e ized po ion o he ch oma ome ( he “da k” p o eome) and iden i ied po en ial
new s uc u al domains ha a e p edic ed by AI-based p o ein s uc u e p edic ion ools (see Fig. 6A).
His o ically, p o ein domains a e loosely de ined as e olu iona y conse ed uni s wi h
simila i ies a unc ional, s uc u al and/o sequence le els [72]. Rela ed indi idual p o ein domains
may be g ouped and aligned o p oduce domain models, ca alogued and anno a ed by a numbe o
esou ces/da abases such as PFAM [73], CDD [74], CATH [75], In e P o [76, p20], e c (see Suppl.
R&D Sec. 3.3 o a ho ough discussion). The ul ima e expe imen al s uc u al cha ac e iza ion o
ch oma in p o eins is a ailable in he PDB da abase, howe e , ecen p og ess in p o ein s uc u e
p edic ion spu ed by AlphaFold esul ed in new app oaches o he s uc u al cha ac e iza ion and
disco e y o new s uc u al domains (e.g., as implemen ed in he TED da abase used below [52]) (Fig.
6A).
Fig. 6B shows he ac ions o he agg ega e numbe o amino acids in all human ch oma in
p o eins ( e e ed below o as “agg ega e ch oma ome sequence”, o ACS) which a e s uc u ally
cha ac e ized o ha e domain anno a ions in di e en da abases. De ailed discussion is a ailable in
37
Suppl. R&D Sec. 3.3. B ie ly, despi e ecen emendous p og ess in s uc u al biology many human
ch oma in p o eins s ill lack di ec s uc u al cha ac e iza ion. On one hand only 25% o ACS can be
mapped di ec ly o PDB s uc u es and 25% can be mapped o known s uc u al p o ein supe amilies
( h ough CATH). On he o he hand, AlphaFold 2 iden i ies 53% o ACS as belonging o non-IDRs,
and TED p edic s ha 35% o ACS belong o domains ha ing well-de ined 3D s uc u es. The la e is
a conse a i e es ima e o s uc u ally cha ac e izable ACS, since bo h pa ially o de ed and diso de ed
egions can become o de ed in p o ein-p o ein complexes. Fo example, 9% o ACS is a ailable in PDB
(whe e p o ein complexes a e p esen ) while no being anno a ed wi h TED domains (which ely on
single p o ein chain s uc u e p edic ions). This is in line wi h he ac ha among 6246 TED domains
ound in ch oma in p o eins almos hal o hem (42%) a e di ec ly co e ed by he PDB da abase.
Howe e , he majo i y o o he domains (56%) can be ma ched o a PDB s uc u e o a homologous
p o ein a a ious le els o sequence iden i y ( om 99% o 5%, see Fig. 6C). The majo i y o hese
homologous domains a e in ac di e en pa alogous sequences ound wi hin human genes (e en o
domains wi h sequence iden i y o 35-50% he ac ion o human sequences among he ma ches was
51%), o ma ches wi h sequence iden i y abo e 35% he second la ges con ibu ion came om
s uc u es o mammalian homologues, o ma ches wi h sequence iden i y below 35% signi ican
con ibu ions we e om s uc u es de i ed om p o eins o ungi, p o os omia and bac e ia (see Suppl.
Fig. SF6_1A o de ails). Addi ionally, 6% o TED domains ha lacked di ec hi s among he PDB
s uc u es we e mapped o p o ein s uc u al supe amilies in he CATH da abase. The emaining 4%
(241) ep esen ed domains could no be ma ched o any known p o ein s uc u e o p o ein s uc u e
supe amilies and po en ially ep esen new ypes o s uc u al supe amilies/ olds. These domains a e
p esen ed in Suppl. Table ST14 (see also In e ac i e Table 3 a
h ps://simch om.in bio.o g/#no el_s uc u al_domains), anked ia hei s uc u al complexi y by he
numbe o hei seconda y s uc u e elemen s. Among hese domains, 123 domains ha e anno a ions in
P am o o he domain anno a ion da abases p esen in In e P o, lea ing 118 domains ha a e comple ely
wi hou anno a ions. The la e domains belong o 106 ch oma in p o eins, which may be conside ed as
p ospec i e new a ge s o expe imen al s udies o hei unc ion and s uc u e. Among such p o eins
a e, o example, (1) a p o ein encoded by he GTF3C1 gene (i has a p e iously unanno a ed and
uncha ac e ized s uc u al domain wi h a leng h o 233 amino acids, see de ailed cha ac e iza ion in
Suppl. Fig. SF6_2A), (2) he globula domain o he es is speci ic linke his one H1.7, which has a
qui e di e en sequence om o he H1 p o eins esul ing in a p edic ed s uc u e ha has a di e en
opology ( he “wing” o he globula domain consis s o h ee be a-shee s a he han wo [77], see
Suppl. Fig. SF6_2B and Suppl. R&D Sec. 3.3).
We used he sequence-based P am domain anno a ion o cha ac e ize he di e si y o di e en
ypes o e olu iona y ela ed p o ein domains (he ea e e e ed o as P am domain models o P am
domain ypes) ound in ch oma in p o eins and ypical domain composi ion he eo . In o al 1753

38
di e en P am domain models ma ched a ious pa s o ch oma in p o eins (Fig. 6D). 42% o hese
we e conside ed ully s uc u ally cha ac e ized, i.e., e e y indi idual domain in ch oma in p o eins
belonging o hese models can be ound in PDB. 34% o domain models a e pa ially cha ac e ized –
hei domains could be ma ched o a PDB s uc u e o a homolog (using FoldSeek, see Me hods Sec ion
2.5). 14% o hese P am domain models we e no ma ched by FoldSeek o PDB s uc u es wi h ou
s ic c i e ia (see Me hods Sec ion 2.5), bu could be s ill iden i ied in PDB ia sequence sea ch
me hods – hese ep esen ed mo e lexible domains wi h IDR egions, epea s and coiled-coils (34 P am
models), DNA-binding mo i s, e c. 3% (55 domain models) could be ma ched o s uc u al domains
p edic ed by AlphaFold and ound in he TED da abase. These ep esen p ospec i e a ge s o
alida ion wi h s uc u al biology me hods and u he in es iga ion o hei in e ac ions. Fo ins ance,
among hese domain models a e domains, po en ially associa ed wi h ch oma in emodeling (SANTA,
z -C3Hc3H), his one PTM w i ing (DUF7030, COMPASS-Shg1), zinc inge s (z _CCCH_4, z -
LITAF-like, z -WIZ, SWIM), e c. 7% o P am domain models cu en ly ha e no s uc u al in o ma ion
ha can be assigned ei he h ough he PDB o TED da abases.
We nex analyzed he di e si y o P am domain models in a ious SimCh om-SL p o ein
ca ego ies (Fig. 6E, subpanels 1,2) and he domain con en o indi idual p o eins belonging o hese
ca ego ies (Fig. 6E, subpanels 3-5). De ailed discussion is a ailable in Suppl. R&D Sec. 3.3. B ie ly,
he numbe o dis inc P am domain models ound in ch oma in p o eins (~1700) is compa able o he
numbe o ch oma in p o eins (~3000), a he same ime an a e age ch oma in p o ein usually con ains
wo P am domains ep esen ing wo di e en domain models. The majo i y o P am domain models a e
p esen only in a single ch oma in p o ein, bu he e a e also hose ha a e p esen in dozens o e en
hund eds o p o eins (Suppl. Fig. SF6_1B). Ce ain ch oma in g oups s and ou in e ms o hei domain
composi ion in some aspec s: he numbe o indi idual domains is high in housekeeping TF (due o
ZFDs); ansc ip ion ac o s, his ones and HMG p o eins a e ela i ely poo in hei domain di e si y
(i.e., he p o eins in hese ca ego ies ha bou a limi ed numbe o dis inc P am domain models); his one
PTM w i e s on a e age ha e domains belonging o h ee di e en domain models (while his numbe
is one o wo o all o he s). S ill a conside able numbe o ch oma in p o eins may ha bo domains
belonging o se e al domain models. DNA-ac ing enzymes, his one PTM w i e s, chape ones,
emodele s, ansc ip ion ac o s may ha e as much as 8-9 P am domain models p esen in hei
sequence (see Suppl. Table ST15). The e a e 118 ch oma in p o eins ha bo ing a leas i e di e en
domain ypes (see Suppl. Fig. SF6_1C, le panel). This highligh s he mul i alency o p o ein
in e ac ions in ch oma in, keeping in mind ha many p o eins u he o m p o ein-p o ein complexes
inc easing hei in e ac ion po en ial (see nex sec ion). The a e age indi idual domain leng h in
ch oma in p o eins is a ound 65 amino acids ( he median is 28 aa), howe e , his numbe is biased by
he p esence o many zinc- inge domains (a ound 22 aa in leng h). Subpanel 5 in Fig. 6E gi es a mo e
balanced iew o each SimCh om ca ego y. Fo he majo i y o p o ein g oups he median domain
39
leng h in p o ein is a ound 100 amino acids (mean is 137, median is 134). Only 70 ch oma in p o eins
had no domain anno a ion a all.
The bi ds-eye iew o he mos equen ly ma ched P am domain models in p o eins o a ious
unc ional SimCh om-SL ca ego ies is p esen ed in Fig. 7. The da a is p esen ed o domain models
ha occu in a leas i e ch oma in p o eins and in a leas 10% o p o eins in a ca ego y ( he h eshold
o da a poin depic ion is 5%). The comp ehensi e in e ac i e analysis igu e wi h he abili y o al e
hese h esholds and swi ch be ween SimCh om and SimCh om-SL classi ica ions sys ems is a ailable
a In e ac i e Fig. 4 (h ps://simch om.in bio.o g/#domain_composi ion). In Fig. 7 he ollowing
ca ego ies and hei espec i e domains can be g ouped e ealing hei pa ially sha ed domain
composi ion: 1) he ca ego ies con aining ansc ip ion ac o s and hei zinc inge , homeodomains and
KRAB domains o m he mos equen ly occu ing en i ies, 2) some ch oma in egula o s, such as PTM
w i e s, eade s, e ase s and ch oma in emodele s oge he wi h hei Ch omo-, B omo-, and PHD
domains.
40
3.3.4. Mul i alen in e ac ions in ch oma in p o ein
Figu e 8. Analysis o mul i alen in e ac ions in ch oma in p o eins. (A) Schema ic illus a ion o
mul i alen in e ac ions. (B) Dis ibu ion o ch oma in p o eins om he ch oma in/epigene ic
egula o s g oup wi h espec o he o al numbe o P am domains ( igh ) and dis inc P am domain
models (le ), ed lines indica e median alues. The dis ibu ions o all ch oma in p o eins a e shown in
Suppl. Fig. SF6_2C. (C) Co-occu ence o domains co esponding o di e en P am domain models
in ch oma in p o eins. Only domains ound in p o eins belonging o ch oma in/epigene ic egula o
g oups a e depic ed (see Me hods Sec ion 2.5 and Fig. 3). Domains a e g ouped in o se e al unc ional
classes (see desc ip ion a he op o he plo ). The alues indica e he condi ional p obabili ies o a
domain in column (A) occu ing alongside a domain in ow (B) in ch oma in p o eins. Along he
diagonal, da a belonging o indi idual domain g oups a e highligh ed wi h shading, dashed lines
highligh g oups associa ed wi h his one me hyla ion o ace yla ion. The ollowing abb e ia ions a e
used in he domain subg oup names o he la e : W - w i e s, R - eade s, and E - e ase s. (D) Co-
occu ence o domains om di e en unc ional classes in ch oma in p o eins and p o ein complexes.
Only combina ions ha a e p esen in mo e han one p o ein o complex a e shown, see ull e sion in
Suppl. Fig. SF8_2. (E) Examples o domain a chi ec u es in ch oma in p o eins con aining he la ges
numbe o ch oma in/epigene ic egula o domains. The op shows domains om 3D s uc u es colo ed
by hei main unc ion; he links be ween domains a e no shown. The bo om shows he o de o
domains a he sequence le el.
41
The p esence o mul iple domains (belonging o he same o di e en domain models) in
ch oma in p o eins is a known ea u e con ibu ing o hei abili y o engage in mul i alen in e ac ions
(Fig. 8A) [10]. Below we p esen he analysis o such domains engaged in mul i alen in e ac ions
( e e ed o as EMVI-domains he ea e ) ha a e ound in ch oma in/epigene ics egula o p o eins (see
Fig. 3 o de ini ion o his g oup). These p o eins o en con ain many domains (Fig. 8B). The median
o al numbe o domains ound in ch oma in p o eins and in ch oma in/epigene ic egula o s is wo.
Ne e heless, many ch oma in p o eins con ain mo e (16% - ha e h ee domains, 10% - ou domains,
7% - i e, six o ou een - 10%). The e a e 409 P am domain models ha a e ound in combina ion
wi h o he models o in mul iple copies in a leas one ch oma in egula o p o ein. To limi ou analysis
o a manageable se o EMVI-domains, we selec ed hose ha we e ound in mul iple copies o in
combina ion wi h ano he P am domain in a leas h ee ch oma in egula o p o eins (94 P am domain
models in o al), and om hose we selec ed 59 domain models ha we we e able o manually classi y
based on he in o ma ion cu en ly a ailable in he li e a u e acco ding o hei unc ional binding
modes. The ollowing unc ional g oups o domains we e used: his one
me hyla ion/ace yla ion/phospho yla ion, ch oma in emodeling, his one binding, DNA binding, DNA
me hyla ion, p o ein dime iza ion/oligome iza ion, PPI, RNA binding. His one pos - ansla ional
modi ica ions we e u he subdi ided in o eade s, w i e s and e ase s unc ional subg oups (see Fig.
8C and Suppl. Table ST16 o he lis o domains and hei de ailed classi ica ion). Domains in ol ed
in his one me hyla ion a e mos p esen in ch oma in egula o s, ollowed by DNA binding, His one
ace yla ion, His one phospho yla ion and Ch oma in emodeling associa ed domains (Suppl. Fig.
SF8_1A).
We analyzed he co-occu ence o selec ed EMVI-domains in all ch oma in p o eins. The e
we e in o al 851 ch oma in p o eins (589 o hese a e ansc ip ion ac o s) ha had mo e han one
EMVI-domain. The condi ional p obabili y o inding a co esponding domain A in a ch oma in p o ein
gi en ha ano he domain B is al eady p esen was es ima ed and is p esen ed in Fig. 8C (columns and
ows co espond o domains A and B, espec i ely). The In e ac i e Fig. 5 is a ailable a
h ps://simch om.in bio.o g/#domain_co-occu ence (also ex ends he analysis o unclassi ied po en ial
EMVI-domains ound in a leas wo ch oma in egula o p o eins). The ma ix in Fig. 8C allows o
ace he in e play be ween di e en domains employed in a chi ec u es o ch oma in p o eins. The
la ges g oups o domains in Fig. 8C a e hose in ol ed in his one me hyla ion and DNA binding,
sugges ing ha hese mechanisms a e he mos ep esen ed and employed in ch oma in unc ioning
egula ion. See Suppl. R&D Sec. 3.4 o de ailed discussion o he esul s. B ie ly, in ce ain cases one
can see 100% associa ion be ween he p esence o a ious domains in ch oma in p o eins. This may be
due o di ec s uc u al in e ac ions be ween he domains o likely due o unc ional easons. Among he
P am domains ha co-occu wi h he mos numbe o o he di e en P am domains is he PHD domain
48
Taken oge he we hope ha ou wo k es ablishes a holis ic amewo k o u he ad ances in
he ield o ch oma in esea ch which will help o unde s and genome unc ioning hough deepe
app ecia ion o he complex ole played by he ch oma ome.
ACKNOWLEDGEMENTS
We hank A.L. Si kina, D.K. Malinina, N.S. Ge asimo a, A.V. Lyubi ele , S.V. Uliano , and
A.A. Ga ilo o aluable discussions ha helped o imp o e his wo k.
AUTHOR CONTRIBUTIONS
AKG: Concep ualiza ion, Da a cu a ion, Fo mal analysis, In es iga ion, Me hodology, So wa e,
Visualiza ion. GAA: Resou ces, So wa e. MPK: Concep ualiza ion. W i ing – e iew & edi ing. AKS:
Concep ualiza ion, Fo mal analysis, Funding acquisi ion, Me hodology, Supe ision, Valida ion,
W i ing – o iginal d a , W i ing – e iew & edi ing.
SUPPLEMENTARY DATA
Supplemen a y ma e ial is a ailable online, including supplemen a y igu es, ables, supplemen a y
esul s and discussion.
CONFLICT OF INTEREST
Non decla ed.
FUNDING
This wo k was unded by he Russian Science Founda ion g an #25-14-00046
(h ps:// sc . u/en/p ojec /25-14-00046/) (cons uc ion o ch oma in p o ein classi ica ion, analysis o
ch oma in p o eins domain composi ion, AI-based p edic ion o new s uc u al domains), he Russian
Science Founda ion g an #23-74-10012 (h ps:// sc . u/en/p ojec /23-74-10012/) (analysis o
physicochemical p ope ies o ch oma in p o eins), and wi hin he amewo k o he Minis y o Science
and Highe Educa ion o he Russian Fede a ion p ojec “Whole-Genome Epigene ic Analysis as he
Basis o he De elopmen o Gene ic Technologies o he P e en ion and T ea men o COVID”
(FFRW-2023-0007), no. 123120500032-9 (analysis o mul i alen in e ac ions o ch oma in p o ein
domains). A.K.S. was suppo ed by he HSE Uni e si y Basic Resea ch P og am (s uc u al

49
cha ac e iza ion o ch oma in p o eins) and A.K.G. was suppo ed by he Gennady Komissa o
Founda ion (cons uc ion o e e ence da ase s abou p o ein localiza ion).
DATA AVAILABILITY
The SimCh om da abase including in e ac i e supplemen a y ma e ials abou ch oma in
p o eins’ classi ica ion, localiza ion, unc ions, domain composi ion a e eely a ailable a a Gi Hub
hos ed web si e h ps://simch om.in bio.o g/. The SimCh om sou ce code is a ailable a Gi Hub
h ps://gi hub.com/in bio/SimCh om and a chi ed ia Zenodo.
REFERENCES
1. Van Holde KE. Ch oma in. Sp inge ; 1989. doi:10.1007/978-1-4612-3490-6
2. Be ns ein E, Allis CD. RNA mee s ch oma in. Genes De . 2005;19(14):1635-1655.
doi:10.1101/gad.1324305
3. Imho A, Bonaldi T. “Ch oma omics” he analysis o he ch oma ome. Mol BioSys . 2005;1(2):112-116.
doi:10.1039/B502845K
4. To en e MP, Zee BM, Young NL, e al. P o eomic In e oga ion o Human Ch oma in. PLOS ONE.
2011;6(9):e24747. doi:10.1371/jou nal.pone.0024747
5. Uliano SV, Velichko AK, Magni o MD, e al. Supp ession o liquid-liquid phase sepa a ion by 1,6-
hexanediol pa ially comp omises he 3D genome o ganiza ion in li ing cells. Nucleic Acids Res.
2021;49(18):10524-10541. doi:10.1093/na /gkab249
6. Rippe K. Liquid–Liquid Phase Sepa a ion in Ch oma in. Cold Sp ing Ha b Pe spec Biol.
2022;14(2):a040683. doi:10.1101/cshpe spec .a040683
7. Uliano SV, Kh amee a EE, Ga ilo AA, e al. Ac i e ch oma in and ansc ip ion play a key ole in
ch omosome pa i ioning in o opologically associa ing domains. Genome Res. 2016;26(1):70-84.
doi:10.1101/g .196006.115
8. Da idson IF, Pe e s JM. Genome olding h ough loop ex usion by SMC complexes. Na Re Mol Cell
Biol. 2021;22(7):445-464. doi:10.1038/s41580-021-00349-7
9. Kanada R, Te akawa T, Kenzaki H, Takada S. Nucleosome C owding in Ch oma in Slows he Di usion
bu Can P omo e Ta ge Sea ch o P o eins. Biophys J. 2019;116(12):2285-2295.
doi:10.1016/j.bpj.2019.05.007
10. Ru henbu g AJ, Li H, Pa el DJ, Allis CD. Mul i alen engagemen o ch oma in modi ica ions by linked
binding modules. Na Re Mol Cell Biol. 2007;8(12):983-994. doi:10.1038/n m2298
11. A mee GA, G ibko a AK, Shay an AK. NucleosomeDB - a da abase o 3D nucleosome s uc u es and
hei complexes wi h compa a i e analysis oolki . bioRxi . P ep in pos ed online Ap il 18,
2023:2023.04.17.537230. doi:10.1101/2023.04.17.537230
50
12. A mee GA, G ibko a AK, Shay an AK. Nucleosomes and hei complexes in he c yoEM e a: T ends
and limi a ions. F on Mol Biosci. 2022;9:1070489. doi:10.3389/ molb.2022.1070489
13. Do onin SA, Ilyin AA, Kononko a AD, e al. Nucleopo in Elys a aches pe iphe al ch oma in o he
nuclea po es in in e phase nuclei. Commun Biol. 2024;7(1):1-18. doi:10.1038/s42003-024-06495-w
14. Ple ene IA, Baza e ich M, Zagi o a DR, e al. Ex ensi e long- ange polycomb in e ac ions and weak
compa men aliza ion a e hallma ks o human neu onal 3D genome. Nucleic Acids Resea ch.
2024;52(11):6234-6252. doi:10.1093/na /gkae271
15. Consens ME, Du aul C, Wainbe g M, e al. T ans o me s and genome language models. Na Mach In ell.
2025;7(3):346-362. doi:10.1038/s42256-025-01007-9
16. Hwang Y, Co nman AL, Kellogg EH, O chinniko S, Gi guis PR. Genomic language model p edic s
p o ein co- egula ion and unc ion. Na Commun. 2024;15(1):2880. doi:10.1038/s41467-024-46947-9
17. an Mie lo G, Ve meulen M. Ch oma in P o eomics o S udy Epigene ics - Challenges and Oppo uni ies.
Mol Cell P o eomics. 2021;20:100056. doi:10.1074/mcp.R120.002208
18. Kus a sche G, G abowski P, Rappsilbe J. Mul iclassi ie combina o ial p o eomics o o ganelle shadows
a he example o mi ochond ia in ch oma in da a. P o eomics. 2016;16(3):393-401.
doi:10.1002/pmic.201500267
19. Oh a S, Bukowski-Wills JC, Sanchez-Pulido L, e al. The P o ein Composi ion o Mi o ic Ch omosomes
De e mined Using Mul iclassi ie Combina o ial P o eomics. Cell. 2010;142(5):810-821.
doi:10.1016/j.cell.2010.07.047
20. Sini cyn P, Richa ds AL, Wea he i RJ, e al. Global de ec ion o human a ian s and iso o ms by deep
p o eome sequencing. Na Bio echnol. 2023;41(12):1776-1786. doi:10.1038/s41587-023-01714-x
21. Guo T, S een JA, Mann M. Mass-spec ome y-based p o eomics: om single cells o clinical
applica ions. Na u e. 2025;638(8052):901-911. doi:10.1038/s41586-025-08584-0
22. Wie e M, Mann M. P o eomics o s udy DNA-bound and ch oma in-associa ed gene egula o y
complexes. Hum Mol Gene . 2016;25(R2):R106-R114. doi:10.1093/hmg/ddw208
23. Kus a sche G, Héga a N, Wills KLH, e al. P o eomics o a uzzy o ganelle: in e phase ch oma in. EMBO
J. 2014;33(6):648-664. doi:10.1002/embj.201387614
24. Ugu E, de la Po e A, Qin W, e al. Comp ehensi e ch oma in p o eomics esol es unc ional phases o
plu ipo ency and iden i ies changes in egula o y componen s. Nucleic Acids Resea ch. 2023;51(6):2671-
2690. doi:10.1093/na /gkad058
25. Ginno PA, Bu ge L, Seebache J, Iesman a icius V, Schübele D. Cell cycle- esol ed ch oma in
p o eomics e eals he ex en o mi o ic p ese a ion o he genomic egula o y landscape. Na Commun.
2018;9(1):4048. doi:10.1038/s41467-018-06007-5
26. Alabe C, Bukowski-Wills JC, Lee SB, e al. Nascen ch oma in cap u e p o eomics de e mines ch oma in
dynamics du ing DNA eplica ion and iden i ies unknown o k componen s. Na Cell Biol.
2014;16(3):281-291. doi:10.1038/ncb2918
27. Shi M, You K, Chen T, e al. Quan i ying he phase sepa a ion p ope y o ch oma in-associa ed p o eins
unde physiological condi ions using an an i-1,6-hexanediol index. Genome Biology. 2021;22(1):229.
doi:10.1186/s13059-021-02456-2
28. Al a ez V, Bandau S, Jiang H, e al. P o eomic p o iling e eals dis inc phases o he es o a ion o
ch oma in ollowing DNA eplica ion. Cell Repo s. 2023;42(1). doi:10.1016/j.cel ep.2023.111996
51
29. Chou DM, Adamson B, Dephou e NE, e al. A ch oma in localiza ion sc een e eals poly (ADP ibose)-
egula ed ec ui men o he ep essi e polycomb and NuRD complexes o si es o DNA damage.
P oceedings o he Na ional Academy o Sciences. 2010;107(43):18475-18480.
doi:10.1073/pnas.1012946107
30. Fede a ion AJ, Nandakuma V, Sea le BC, e al. Highly Pa allel Quan i ica ion and Compa men
Localiza ion o T ansc ip ion Fac o s and Nuclea P o eins. Cell Repo s. 2020;30(8):2463-2471.e5.
doi:10.1016/j.cel ep.2020.01.096
31. Du a B, Ren Y, Hao P, e al. P o iling o he Ch oma in-associa ed P o eome Iden i ies HP1BP3 as a
No el Regula o o Cell Cycle P og ession. Mol Cell P o eomics. 2014;13(9):2183-2197.
doi:10.1074/mcp.M113.034975
32. Geladaki A, Koče a B i o šek N, B eckels LM, e al. Combining LOPIT wi h di e en ial
ul acen i uga ion o high- esolu ion spa ial p o eomics. Na Commun. 2019;10(1):331.
doi:10.1038/s41467-018-08191-w
33. Wang H, Syed AA, K ijgs eld J, Sigismondo G. Isola ion o P o eins on Ch oma in Re eals Signaling
Pa hway–Dependen Al e a ions in he DNA-Bound P o eome. Molecula & Cellula P o eomics.
2025;24(3). doi:10.1016/j.mcp o.2025.100908
34. Razin SV, Ia o aia OV, Vasse zky YS. A equiem o he nuclea ma ix: om a con o e sial concep o
3D o ganiza ion o he nucleus. Ch omosoma. 2014;123(3):217-224. doi:10.1007/s00412-014-0459-8
35. Ashbu ne M, Ball CA, Blake JA, e al. Gene On ology: ool o he uni ica ion o biology. Na Gene .
2000;25(1):25-29. doi:10.1038/75556
36. The Gene On ology Conso ium, Aleksande SA, Balho J, e al. The Gene On ology knowledgebase in
2023. Gene ics. 2023;224(1):iyad031. doi:10.1093/gene ics/iyad031
37. Go ski S, Mis eli T. Sys ems biology in he cell nucleus. Jou nal o Cell Science. 2005;118(18):4083-
4092. doi:10.1242/jcs.02596
38. Johns one CP, Wang NB, Se ie SA, Galloway KE. Unde s anding and Enginee ing Ch oma in as a
Dynamical Sys em ac oss Leng h and Timescales. Cell Sys ems. 2020;11(5):424-448.
doi:10.1016/j.cels.2020.09.011
39. The UniP o Conso ium. UniP o : he Uni e sal P o ein Knowledgebase in 2025. Nucleic Acids
Resea ch. 2025;53(D1):D609-D617. doi:10.1093/na /gkae1010
40. Thul PJ, Åkesson L, Wiking M, e al. A subcellula map o he human p o eome. Science.
2017;356(6340):eaal3321. doi:10.1126/science.aal3321
41. Cho NH, Che e alls KC, B unne AD, e al. OpenCell: Endogenous agging o he ca og aphy o human
cellula o ganiza ion. Science. 2022;375(6585):eabi6983. doi:10.1126/science.abi6983
42. Binns D, Dimme E, Hun ley R, Ba ell D, O’Dono an C, Apweile R. QuickGO: a web-based ool o
Gene On ology sea ching. Bioin o ma ics. 2009;25(22):3045-3046. doi:10.1093/bioin o ma ics/b p536
43. Ma akulina D, Vo on so IE, Kulako skiy IV, Lenna sson A, D abløs F, Med ede a YA. EpiFac o s
2022: expansion and enhancemen o a cu a ed da abase o human epigene ic ac o s and complexes.
Nucleic Acids Resea ch. 2023;51(D1):D564-D570. doi:10.1093/na /gkac989
44. Lo e ing RC, Gaude P, Acencio ML, e al. A GO ca alogue o human DNA-binding ansc ip ion ac o s.
Biochimica e Biophysica Ac a (BBA) - Gene Regula o y Mechanisms. 2021;1864(11):194765.
doi:10.1016/j.bbag m.2021.194765
45. I zhak DN, Tyano a S, Cox J, Bo ne GH. Global, quan i a i e and dynamic mapping o p o ein
subcellula localiza ion. Hegde RS, ed. eLi e. 2016;5:e16950. doi:10.7554/eLi e.16950
52
46. Uhlén M, Fage be g L, Halls öm BM, e al. Tissue-based map o he human p o eome. Science.
2015;347(6220):1260419. doi:10.1126/science.1260419
47. Balu S, Huge S, Medina Reyes JJ, e al. Complex po al 2025: p edic ed human complexes and enhanced
isualisa ion ools o he compa ison o o hologous and pa alogous complexes. Nucleic Acids Res.
2025;53(D1):D644-D650. doi:10.1093/na /gkae1085
48. Raud e e U, Kolbe g L, Kuzmin I, e al. g:P o ile : a web se e o unc ional en ichmen analysis and
con e sions o gene lis s (2019 upda e). Nucleic Acids Res. 2019;47(W1):W191-W198.
doi:10.1093/na /gkz369
49. Wang M, He mann CJ, Simono ic M, Szkla czyk D, on Me ing C. Ve sion 4.0 o PaxDb: P o ein
abundance da a, in eg a ed ac oss model o ganisms, issues, and cell‐lines. P o eomics.
2015;15(18):3163-3168. doi:10.1002/pmic.201400441
50. Akdel M, Pi es DEV, Pa do EP, e al. A s uc u al biology communi y assessmen o AlphaFold2
applica ions. Na S uc Mol Biol. 2022;29(11):1056-1067. doi:10.1038/s41594-022-00849-w
51. Bo din N, Silli oe I, Nallapa eddy V, e al. AlphaFold2 e eals commonali ies and no el ies in p o ein
s uc u e space o 21 model o ganisms. Commun Biol. 2023;6(1):160. doi:10.1038/s42003-023-04488-9
52. Lau AM, Bo din N, Kanda hil SM, e al. Explo ing s uc u al di e si y ac oss he p o ein uni e se wi h
The Encyclopedia o Domains. Science. Published online No embe 1, 2024.
doi:10.1126/science.adq4946
53. an Kempen M, Kim SS, Tumeschei C, e al. Fas and accu a e p o ein s uc u e sea ch wi h Foldseek.
Na Bio echnol. Published online May 8, 2023:1-4. doi:10.1038/s41587-023-01773-0
54. Gligo ije ić V, Ren ew PD, Kosciolek T, e al. S uc u e-based p o ein unc ion p edic ion using g aph
con olu ional ne wo ks. Na Commun. 2021;12(1):3168. doi:10.1038/s41467-021-23303-9
55. Lambe SA, Jolma A, Campi elli LF, e al. The Human T ansc ip ion Fac o s. Cell. 2018;172(4):650-
665. doi:10.1016/j.cell.2018.01.029
56. D aizen EJ, Shay an AK, Ma iño-Ramí ez L, Talbe PB, Landsman D, Panchenko AR. His oneDB 2.0:
a his one da abase wi h a ian s—an in eg a ed esou ce o explo e his ones and hei a ian s. Da abase.
2016;2016:baw014. doi:10.1093/da abase/baw014
57. Zhang Y, Zhang Y, Song C, e al. CRdb: a comp ehensi e esou ce o deciphe ing ch oma in egula o s
in human. Nucleic Acids Resea ch. 2023;51(D1):D88-D100. doi:10.1093/na /gkac960
58. Hammond CM, S ømme CB, Huang H, Pa el DJ, G o h A. His one chape one ne wo ks shaping
ch oma in unc ion. Na Re Mol Cell Biol. 2017;18(3):141-158. doi:10.1038/n m.2016.159
59. Ree es R. High mobili y g oup (HMG) p o eins: Modula o s o ch oma in s uc u e and DNA epai in
mammalian cells. DNA Repai . 2015;36:122-136. doi:10.1016/j.dna ep.2015.09.015
60. May an A, D ouin J. Pionee ansc ip ion ac o s shape he epigene ic landscape. J Biol Chem.
2018;293(36):13795-13804. doi:10.1074/jbc.R117.001232
61. Sun H, Fu B, Qian X, Xu P, Qin W. Nuclea and cy oplasmic speci ic RNA binding p o eome en ichmen
and i s changes upon e op osis induc ion. Na Commun. 2024;15(1):852. doi:10.1038/s41467-024-
44987-9
62. Van Nos and EL, F eese P, P a GA, e al. A la ge-scale binding and unc ional map o human RNA-
binding p o eins. Na u e. 2020;583(7818):711-719. doi:10.1038/s41586-020-2077-3
63. Azad GK, Swaga ika S, Kumawa M, Kumawa R, Toma RS. Modi ying Ch oma in by His one Tail
Clipping. Jou nal o Molecula Biology. 2018;430(18, Pa B):3051-3067. doi:10.1016/j.jmb.2018.07.013
53
64. Lee H, Noh H, Ryu JK. S uc u e- unc ion ela ionships o SMC p o ein complexes o DNA loop
ex usion. BioDesign. 2021;9(1):1-13. doi:10.34184/kssb.2021.9.1.1
65. Ca w igh P, Helin K. Nucleocy oplasmic shu ling o ansc ip ion ac o s. Cell Mol Li e Sci. 2000;57(8-
9):1193-1206. doi:10.1007/pl00000759
66. Cheng D, Semmens K, McManus E, e al. The nuclea ansc ip ion ac o , TAF7, is a cy oplasmic
egula o o p o ein syn hesis. Science Ad ances. Published online Decembe 2021.
doi:10.1126/sciad .abi5751
67. Sh ebe k-Shaked M, O en M. New insigh s in o YAP/TAZ nucleo-cy oplasmic shu ling: new cance
he apeu ic oppo uni ies? Mol Oncol. 2019;13(6):1335-1341. doi:10.1002/1878-0261.12498
68. Kobiyama K, Kawashima A, Jounai N, e al. Role o Ex ach omosomal His one H2B on Recogni ion o
DNA Vi uses and Cell Damage. F on Gene . 2013;4:91. doi:10.3389/ gene.2013.00091
69. Zeng Z, Chen L, Luo H, Xiao H, Gao S, Zeng Y. P og ess on H2B as a mul i unc ional p o ein ela ed o
pa hogens. Li e Sciences. 2024;347:122654. doi:10.1016/j.l s.2024.122654
70. Sigismondo G, Papageo giou DN, K ijgs eld J. C acking ch oma in wi h p o eomics: F om ch oma ome
o his one modi ica ions. PROTEOMICS. 2022;22(15-16):2100206. doi:10.1002/pmic.202100206
71. Huang Q, Szkla czyk D, Wang M, Simono ic M, on Me ing C. PaxDb 5.0: cu a ed p o ein quan i ica ion
da a sugges s adap i e p o eome changes in yeas s. Molecula & Cellula P o eomics. Published online
Augus 31, 2023:100640. doi:10.1016/j.mcp o.2023.100640
72. Vogel C, Bash on M, Ke ison ND, Cho hia C, Teichmann SA. S uc u e, unc ion and e olu ion o
mul idomain p o eins. Cu en Opinion in S uc u al Biology. 2004;14(2):208-216.
doi:10.1016/j.sbi.2004.03.011
73. Paysan-La osse T, And ee a A, Blum M, e al. The P am p o ein amilies da abase: emb acing AI/ML.
Nucleic Acids Res. 2025;53(D1):D523-D534. doi:10.1093/na /gkae997
74. Wang J, Chi saz F, De byshi e MK, e al. The conse ed domain da abase in 2023. Nucleic Acids Res.
2023;51(D1):D384-D388. doi:10.1093/na /gkac1096
75. Waman VP, Bo din N, Alc a R, e al. CATH 2024: CATH-AlphaFlow Doubles he Numbe o S uc u es
in CATH and Re eals Nea ly 200 New Folds. Jou nal o Molecula Biology. 2024;436(17):168551.
doi:10.1016/j.jmb.2024.168551
76. Blum M, And ee a A, Flo en ino LC, e al. In e P o: he p o ein sequence classi ica ion esou ce in 2025.
Nucleic Acids Res. 2025;53(D1):D444-D456. doi:10.1093/na /gkae1082
77. Lyubi ele AV, Niki in DV, Shay an AK, S udi sky VM, Ki pichniko MP. S uc u e and unc ions o
linke his ones. Biochemis y (Moscow). 2016;81(3):213-223. doi:10.1134/S0006297916030032
78. Wiene N. Cybe ne ics. Scien i ic Ame ican. 1948;179(5):14-19.
79. Shah SG, Mandloi T, Kun e P, e al. HISTome2: a da abase o his one p o eins, modi ie s o mul iple
o ganisms and epid ugs. Epigene ics & Ch oma in. 2020;13(1):31. doi:10.1186/s13072-020-00354-8
80. Wiśniewski JR, Hein MY, Cox J, Mann M. A “P o eomic Rule ” o P o ein Copy Numbe and
Concen a ion Es ima ion wi hou Spike-in S anda ds*. Molecula & Cellula P o eomics.
2014;13(12):3497-3506. doi:10.1074/mcp.M113.037309
81. Palii CG, Cheng Q, Gillespie MA, e al. Single-Cell P o eomics Re eal ha Quan i a i e Changes in Co-
exp essed Lineage-Speci ic T ansc ip ion Fac o s De e mine Cell Fa e. Cell S em Cell. 2019;24(5):812-
820.e5. doi:10.1016/j.s em.2019.02.006

54
82. Baska R, Chen AF, Fa a o P, e al. In eg a ing ansc ip ion- ac o abundance wi h ch oma in
accessibili y in human e y h oid lineage commi men . Cell Rep Me hods. 2022;2(3):100188.
doi:10.1016/j.c me h.2022.100188
83. Shinoha a K, Toné S, Ejima T, Ohigashi T, I o A. Quan i a i e Dis ibu ion o DNA, RNA, His one and
P o eins O he han His one in Mammalian Cells, Nuclei and a Ch omosome a High Resolu ion Obse ed
by Scanning T ansmission So X-Ray Mic oscopy (STXM). Cells. 2019;8(2):164.
doi:10.3390/cells8020164
84. Hock R, Fu usawa T, Ueda T, Bus in M. HMG ch omosomal p o eins in de elopmen and disease. T ends
in Cell Biology. 2007;17(2):72-79. doi:10.1016/j. cb.2006.12.001
85. Holehouse AS, Albe i S. Molecula de e minan s o condensa e composi ion. Molecula Cell.
2025;85(2):290-308. doi:10.1016/j.molcel.2024.12.021
86. Miao J, Chong S. Roles o in insically diso de ed p o ein egions in ansc ip ional egula ion and genome
o ganiza ion. Cu en Opinion in Gene ics & De elopmen . 2025;90:102285.
doi:10.1016/j.gde.2024.102285
87. Zanzoni A, Ribei o DM, B un C. Unde s anding p o ein mul i unc ionali y: om sho linea mo i s o
cellula unc ions. Cell Mol Li e Sci. 2019;76(22):4407-4412. doi:10.1007/s00018-019-03273-4
88. Kuma M, Michael S, Al a ado-Val e de J, e al. ELM— he Euka yo ic Linea Mo i esou ce—2024
upda e. Nucleic Acids Res. 2024;52(D1):D442-D455. doi:10.1093/na /gkad1058
89. Ghi i M, Colley LS, Man onico MV, Musco G, Bianchi ME. In insic diso de and uzzy in e ac ions
d i e mul iple unc ions o HMGB1. T ends in Biochemical Sciences. Published online Sep embe 1,
2025. doi:10.1016/j. ibs.2025.08.001
90. Ha os A, Monzon AM, Tosa o SCE, Pio esan D, Fux ei e M. FuzDB: a new phase in unde s anding
uzzy in e ac ions. Nucleic Acids Res. 2022;50(D1):D509-D517. doi:10.1093/na /gkab1060
91. Jonas F, Na on Y, Ba kai N. In insically diso de ed egions as acili a o s o he ansc ip ion ac o a ge
sea ch. Na Re Gene . 2025;26(6):424-435. doi:10.1038/s41576-025-00816-3
92. Má M, Ni senko K, Heida sson PO. Mul i unc ional In insically Diso de ed Regions in T ansc ip ion
Fac o s. Chemis y. 2023;29(21):e202203369. doi:10.1002/chem.202203369
93. Saba i BR, Dall’Agnese A, Young RA. Biomolecula Condensa es in he Nucleus. T ends in Biochemical
Sciences. 2020;45(11):961-977. doi:10.1016/j. ibs.2020.06.007
94. Requião RD, Fe nandes L, Souza HJA de, Rosse o S, Domi o ic T, Palhano FL. P o ein cha ge
dis ibu ion in p o eomes and i s impac on ansla ion. PLOS Compu a ional Biology.
2017;13(5):e1005549. doi:10.1371/jou nal.pcbi.1005549
95. Fishe RS, Elbaum-Ga inkle S. Tunable mul iphase dynamics o a ginine and lysine liquid condensa es.
Na Commun. 2020;11(1):4628. doi:10.1038/s41467-020-18224-y
96. Hong Y, Naja i S, Casey T, Shea JE, Han SI, Hwang DS. Hyd ophobici y o a ginine leads o een an
liquid-liquid phase sepa a ion beha io s o a ginine- ich p o eins. Na Commun. 2022;13(1):7326.
doi:10.1038/s41467-022-35001-1
97. Dang M, Li T, Zhou S, Song J. A g/Lys-con aining IDRs a e c yp ic binding domains o ATP and nucleic
acids ha in e play o modula e LLPS. Commun Biol. 2022;5(1):1315. doi:10.1038/s42003-022-04293-w
98. Ama o RE, Åq is J, Baha I, e al. The need o implemen FAIR p inciples in biomolecula simula ions.
Na Me hods. 2025;22(4):641-645. doi:10.1038/s41592-025-02635-0
55
99. A mee GA, Kniaze a AS, Koma o a GA, Ki pichniko MP, Shay an AK. His one dynamics media e
DNA unw apping and sliding in nucleosomes. Na Commun. 2021;12. doi:10.1038/s41467-021-22636-9
100. Fedulo a AS, A mee GA, Romano a TA, e al. Molecula dynamics simula ions o nucleosomes a e
coming o age. WIREs Compu a ional Molecula Science. 2024;14(4):e1728. doi:10.1002/wcms.1728
101. Kilgo e HR, Chinn I, Mikhael PG, e al. P o ein codes p omo e selec i e subcellula
compa men aliza ion. Science. Published online Ma ch 7, 2025. doi:10.1126/science.adq2634
102. Yang X, Zhu H, Shi L, e al. AlphaFold-guided s uc u al analyses o nucleosome binding p o eins.
Nucleic Acids Res. 2025;53(14):gka 735. doi:10.1093/na /gka 735
103. Lim Y, Tamayo-O ego L, Schmid E, e al. In silico p o ein in e ac ion sc eening unco e s DONSON’s
ole in eplica ion ini ia ion. Science. 2023;381(6664):eadi3448. doi:10.1126/science.adi3448
104. Ru henbu g AJ, Li H, Pa el DJ, Da id Allis C. Mul i alen engagemen o ch oma in modi ica ions by
linked binding modules. Na Re Mol Cell Biol. 2007;8(12):983-994. doi:10.1038/n m2298
Supplemen a y ma e ials
(Re)de ining he human ch oma ome:
an in eg a ed me a-analysis o localiza ion, unc ion,
abundance, physical p ope ies and domain composi ion o
ch oma in p o eins
Anna K. G ibko a1,2, G igo iy A. A mee 1,2, Mikhail P. Ki pichniko 1,3, Alexey K. Shay an1,2,4*
1 Depa men o Biology, Lomonoso Moscow S a e Uni e si y, Moscow, Russia
2 Va ilo Ins i u e o Gene al Gene ics, Moscow, Russia
3 Shemyakin–O chinniko Ins i u e o Bioo ganic Chemis y,
Russian Academy o Sciences, Moscow, Russia
4 In e na ional Labo a o y o Bioin o ma ics, AI and Digi al Sciences Ins i u e,
Facul y o Compu e Science, HSE Uni e si y, Moscow, Russia
* To whom co espondence should be add essed. Email: shay [email protected]. u
2
Table o con en s
Supplemen a y Tables ............................................................................................................................................ 3
Supplemen a y Figu es ........................................................................................................................................... 5
1. Sou ces o in o ma ion abou ch oma in and nuclea p o eins and hei c i ical e alua ion ......................... 5
2. The SimCh om ch oma in p o ein classi ica ion, he SimCh om da ase and o he e e ence da ase s ...... 14
3. Analysis o he human ch oma ome ............................................................................................................. 18
3.1. The ch oma ome composi ion and abundance o ch oma in p o eins ....................................................... 18
3.2. Physico-chemical p ope ies and amino acid composi ion ........................................................................ 22
3.3. Domain composi ion o ch oma in p o eins and iden i ica ion o no el s uc u al domains ..................... 27
3.4. Mul i alen in e ac ions in ch oma in p o eins .......................................................................................... 30
Supplemen a y Resul s and Discussion ................................................................................................................ 32
1. Sou ces o in o ma ion abou ch oma in and nuclea p o eins and hei c i ical e alua ion ............................. 32
1.1. Analysis o ch oma in p o eins’ ep esen a ion in he GO da abase and o he p o ein- unc ion o ien ed
da abases .......................................................................................................................................................... 33
1.2. De ailed compa a i e analysis o nuclea p o eins subcellula localiza ion be ween UniP o , HPA, and
OpenCell .......................................................................................................................................................... 35
1.3. De ailed compa a i e analysis o se s o ch oma in p o eins iden i ied in MS-based s udies ................... 39
2. The SimCh om ch oma in p o ein classi ica ion, he in e ac i e SimCh om da abase and o he e e ence
da ase s ................................................................................................................................................................. 44
3. Analysis o he human ch oma ome ................................................................................................................. 46
3.1. The ch oma ome composi ion and abundance o ch oma in p o eins ....................................................... 46
3.2. De ailed analysis o he physico-chemical p ope ies and amino acid composi ion .................................. 52
3.3. De ailed analysis o he domain composi ion o ch oma in p o eins and iden i ica ion o new s uc u al
domains ............................................................................................................................................................ 58
3.4. De ailed analysis o he mul i alen in e ac ions in ch oma in p o eins .................................................... 63
Re e ences ............................................................................................................................................................ 68
9
Supplemen a y Figu e SF2_5. The MS-based human ch oma ome (and nucleome) da ase s
examined h ough he lens o anno a ions p o ided by he localiza ion da abases and he
SimCh om ch oma in p o ein classi ica ion. (A) The lis o MS-based da ase s o ch oma in/nuclea
p o eins om he espec i e s udies analyzed in his wo k oge he wi h a sho desc ip ion o he
expe imen al/analysis wo k low (le ) and he plo s ( igh ) showing he size o he da ase s and he
ac ions o he da ase s ha o e lap wi h he SimCh om da ase o nuclea localiza ion da ase s
(NULOC_CS and NULOC_JT_NECF). The median alues a e shown by do ed in he plo s. (B)
Cumula i e ac ion o p o eins iden i ied by a leas N MS-based ch oma in s udies ela i e o he o al
numbe o ch oma in p o eins iden i ied in a leas one MS-based s udy. (C) O e lap o p o ein en ies
iden i ied in MS-based s udies ha lack nuclea localiza ion acco ding o he da abase anno a ions
(NULOC_JT_NECF da ase ). (D) Numbe o p o ein en ies iden i ied in MS-based s udies ha ha e
no anno a ions in he da abases. Le : p o eins lacking localiza ion da a in bo h UniP o and HPA. Righ :
p o eins lacking bo h localiza ion da a and GO anno a ion.

10
Supplemen a y Figu e SF2_6. GO en ichmen analysis o p o ein en ies om MS-based
ch oma ome da ase s ha a e absen om bo h he NULOC_JT_NECF da ase and he SimCh om
classi ica ion (n = 2232).
11
Supplemen a y Figu e SF2_7. (A) The Venn diag am showing o e laps be ween he se o SimCh om
p o eins, he housekeeping p o eome and he combined p o ein se o MS-based ch oma ome da ase s
(union o p o ein con en om eigh MS-based s udies analyzed in his wo k). (B) The pe cen age o
housekeeping (HK) p o eins in MS-based ch oma omes and nucleome ( ange: 76% - 90%). (C) The
pe cen age o SimCh om p o eins by SimCh om ca ego y in MS-based ch oma omes and nucleome.
12
Supplemen a y Figu e SF2_8. (A) Fold en ichmen o ch oma in-associa ed p o eins iden i ied in
MS-based s udies o e e y SimCh om ca ego y ( old en ichmen is calcula ed wi h espec o he
dis ibu ion o p o eins among he ca ego ies in SimCh om). Only s a is ically signi ican alues (p-
alue o Fishe exac es wi h Benjamini co ec ion < 0.05) a e shown. (B) Numbe o his one p o eins
de ec ed in MS-based s udies compa ed o he e e ence coun s om MS_His oneDB and His oneDB
2.0.
13
Supplemen a y Figu e SF2_9. The o e lap o ch oma in/nuclea p o ein en ies om di e en ypes
o sou ces: p o ein unc ion da abases (GO "Ch oma in"), p o ein localiza ion DBs (Unip o Nucleus,
HPA Nucleus o Nucleoplasm), MS-based s udies o ch oma ome and nucleome p o eins ( wo p o ein
se s a e used - see legends: 1. union o p o ein en ies om i e o al ch oma in s udies, wo nascen
ch oma in and one nucleome; 2. p o ein en ies ha a e p esen in h ee ou o se en MS-based
ch oma in da ase s). The “backg ound” se o p o eins o each panel is shown in i alic.
14
2. The SimCh om ch oma in p o ein classi ica ion, he SimCh om da ase and o he e e ence
da ase s
Supplemen a y Figu e SF3_1. The scheme o c ea ion o he SimCh om classi ica ion on ology and
SimCh om p o ein da ase is shown in panels (A) and (B), espec i ely.

15
Supplemen a y Figu e SF3_2. The o de o SimCh om ca ego ies ( om op o bo om) used o c ea e
he single-label SimCh om-SL classi ica ion o ch oma in p o eins. The ca ego ies we e o de ed as
ollows: molecula unc ion and physicochemical p ope ies we e placed i s , ollowed by he o he s.
Among hem, ca ego ies con aining ewe p o eins we e o de ed ea lie . The numbe o p o eins
belonging o he espec i e SimCh om and SimCh om-SL ca ego ies is also shown.
16
Supplemen a y Figu e SF3_3. SimCh om p o ein analysis using p o ein localiza ion in o ma ion and
MS-based ch oma omes and nucleome. (A-B) A Venn diag am showing he o e lap be ween he p o ein
se s: SimCh om, p o eins iden i ied in MS-based s udies and he e e ence nuclea p o ein se s
NULOC_CS (A) o NULOC_JT_NECF (B). (C) The pe cen age o p o eins om SimCh om ca ego ies
ha a e ound in NULOC_CS and NULOC_JT_NECF da ase s.
17
Supplemen a y Figu e SF3_4. SimCh om p o ein da ase analysis using p o ein localiza ion
in o ma ion. (A) The numbe o SimCh om-SL classi ied p o eins wi hou nuclea localiza ion
acco ding o NULOC_JT_NECF ( he b oades da ase ha combined all nuclea p o ein en ies om
all p o ein localiza ion da abases a any le el o con idence). (B) The unexpec ed en iched GO e ms
o p o eins we e iden i ied o he SimCh om p o eins ha a e absen in NULOC_CS.
18
3. Analysis o he human ch oma ome
3.1. The ch oma ome composi ion and abundance o ch oma in p o eins
Supplemen a y Figu e SF4_1. Ch oma in p o eins abundance analysis. (A, B) Dis ibu ion o p o eins
om PaxDb_INT and PaxDb_PA da ase s acco ding o hei ela i e abundance alues. The dis ibu ion
o housekeeping (HK) and non-housekeeping (non-HK) a e also shown (see legend). The dis ibu ion
was cons uc ed by aking he loga i hm o he abundance alues in ppm, making a his og am (bin size
o 0.15) and smoo hing i wi h a gaussian ke nel o isual cla i y. (C) F ac ion dis ibu ion o low-
abundan (LA) and high-abundan (HA) housekeeping (HK) and non-housekeeping (non-HK) p o eins
in he whole p o eome (PaxDb_INT), p o ein localiza ion da ase s (NULOC_CS and NULOC_JT), and
SimCh om and MS-based ch oma omes. (D) The dis ibu ion o low-abundan (LA) housekeeping
(HK) p o eins among SimCh om-SL ca ego ies.
25
Supplemen a y Figu es SF5_4. Compa ison o amino acid composi ion be ween ch oma in and
uniquely localized nuclea p o eins ela i e o cy oplasmic p o eins (SimCh om, NULOC_CS_UL,
CYLOC_CS_UL da ase s). The compa ison is done sepa a ely o he o al p o ein sequence (A,B,C),
IDRs (D,E,F), and non-IDRs (G,H,I). Subplo s (A, D, G) p esen s he median ac ions o amino acids
o ch oma in and nuclea p o eins (subpanel 1 on each plo ), he old en ichmen (FE) o hese ac ions
ela i e o he cy oplasmic p o eins (subpanel 2 on each plo ), he black line indica es FE = 1. The
adjus ed p- alue is shown o he s a is ical es s (Mann-Whi ney es ) compa ing he median alues o
amino acid ac ions o ch oma in and nuclea p o eins wi h he cy oplasmic ones (subpanel 3 on each
plo ). G ay highligh s indica e a lack o s a is ical signi icance (adj. p- alue > 0.05). De ailed analysis
o he dis ibu ion o he selec ed amino acids in he o al sequence o p o eins belonging o espec i e
SimCh om-SL p o ein ca ego ies is p esen ed in panels (B-I): en iched amino acids in ch oma in
p o eins a e shown in panels (B,E,H), deple ed - in panels (C,F,I). In he op o each plo ( he i s h ee
ows) he ollowing da apoin s o he old en ichmen a e gi en: “To al” - o all p o eins om
SimCh om o NULOC_CS_UL da ase s ( he la e also depic ed by dashed line), “Common” – o
common p o eins among SimCh om and NULOC_CS_UL da ase s, “No common” – o p o eins no
p esen in he pa ne da ase (e.g., o SimCh om hose p esen in SimCh om bu absen in
NULOC_CS_UL will be depic ed, and ice e sa o Nuclea _UL da ase ).

26
Supplemen a y Figu es SF5_5. Addi ional compa isons o amino acids composi ion o di e en
p o ein g oups. (A) Compa ison o amino acid composi ion in ch oma in p o eins and ch oma in
p o eins wi hou z -C2H2 con aining p o eins ela i e o cy oplasmic p o eins. The median ac ion o
amino acids o p o ein subse s (subpanel 1), he old en ichmen (FE) ela i e o he cy oplasmic
p o eins (subpanel 2), whe e he black line indica es FE = 1. The adjus ed p- alue is shown o he
s a is ical es s compa ing he median alues o ch oma in and ch oma in p o eins ha lack zinc- inge
domains wi h he cy oplasmic ones (subpanel 3). G ay shading indica es alues ha lack s a is ical
signi icance (adj. p- alue > 0.05). (B) Fold en ichmen o amino acids’ median ac ions in ch oma in
p o eins s uniquely localized cy oplasmic ones, o al sequences, IDRs and non-IDRs we e analyzed
sepa a ely. (C) Median alue o amino acids’ ac ions in ch oma in p o eins o o al p o ein sequences,
IDRs and non-IDRs.
27
3.3. Domain composi ion o ch oma in p o eins and iden i ica ion o no el s uc u al domains
Supplemen a y Figu es SF6_1. (A) Taxonomic dis ibu ion o sou ce o ganisms o PDB s uc u es
wi h domains homologous o ch oma in p o eins ( axon o he bes ma ch o he s uc u al domains
iden i ied by he TED esou ce, see Figu e 6, Me hods). (B) The his og am showing how many P am
domain models (Y-axis) a e ound in exac ly N (X-axis) ch oma in p o eins. One can see ha he
majo i y o P am domain models a e ep esen ed only by domains ound in one ch oma in p o ein. The
ed line indica es median alues. (C) The dis ibu ion o ch oma in p o eins acco ding o he o al
numbe o P am domains iden i ied in p o eins (see also Supplemen a y Table ST15). (D) The
dis ibu ion o p o eins acco ding o he numbe o z -C2H2 domains in Housekeeping and Non-
housekeeping DNA-binding ansc ip ion ac o s (HK TFs and Non-HK TFs, espec i ely). The lines
indica e median alues (7 and 9). (E) Analysis o unc ional domain di e si y in ch oma in p o eins as
iden i ied by he P am da abase o p o eins belonging o di e en ch oma in ca ego ies acco ding o
SimCh om classi ica ion. Subpanel 1-5 ep esen a ious cha ac e is ics. This is he same as Figu e 6E
bu SimCh om classi ica ion ins ead o SimCh om-SL classi ica ion is used.
28
Supplemen a y Figu es SF6_2. The examples o no el s uc u al domains iden i ied in ch oma in
p o eins: s uc u es, colo ed by TED domain anno a ion and AlphaFold2 pLDDT sco e, and i s
anno a ion in In e P o (sc eensho ). (A) Gene al ansc ip ion ac o 3C polypep ide 1 (gene GTF3C1,
p o ein Q12789). (B) Tes is-speci ic H1 his one (gene H1-7, p o ein Q75WM6).
29
Supplemen a y Figu es SF6_3. P edic ions o GO molecula unc ion (MF) (panel A) and biological
p ocesses (BP) e ms (panel B) o no el s uc u al domains wi hou in o ma ion in o he DBs
acco ding o In e P o.
30
3.4. Mul i alen in e ac ions in ch oma in p o eins
Supplemen a y Figu es SF8_1. (A) The numbe o ch oma in egula o p o eins ha con ain EMVI-
domains o ce ain g oups. (B) The numbe o ch oma in p o eins wi h di e en numbe s o EMVI-
domains belonging o di e en g oups ('DNA binding' domain unc ional g oup is no shown). (C) Co-
occu ence o EMVI-domains belonging o di e en unc ional g oups in ch oma in p o eins. The
alues indica e he es ima ed condi ional p obabili ies o ind in a ch oma in p o ein a domain speci ied
in he column name gi en ha a domain speci ied in he ow name is al eady p esen .

31
Supplemen a y Figu es SF8_2. The UpSe plo shows combina ions o EMVI domains classi ied by
hei unc ional g oups/subg oups in ch oma in p o eins (panel A) and p o ein complexes ha
exclusi ely con ain ch oma in p o eins (panel B).
32
Supplemen a y Resul s and Discussion
1. Sou ces o in o ma ion abou ch oma in and nuclea p o eins and hei c i ical e alua ion
This sec ion includes supplemen a y esul s and discussion o sec ion 3.1. Sou ces o
in o ma ion abou ch oma in and nuclea p o eins and hei c i ical e alua ion in he main ex .
A no e on he dis inc ion be ween and de ini ion o nuclea p o eome and ch oma ome
Nuclea p o eome and ch oma ome a e wo e ms ha a e his o ically used o desc ibe he
p o ein con en o he nucleus and he p o eins associa ed wi h genome packaging, main enance and
unc ioning (see Figu e 1A o mode n iew o nucleus s uc u e). The exac dis inc ion be ween hese
wo e ms may be uzzy and is o en based on di e en consensus (p o ein localiza ion o unc ional
classi ica ion on ologies) o ope a ional (expe imen al based ex ac ion echniques) de ini ions. Du ing
in e phase when he nucleus en elope is in ac he ch oma in p o eins ob iously eside inside he
nucleus and a e a pa o he nuclea p o eome. Hence, e minologically he nucleome seems o be mo e
s aigh o wa dly de ined jus by he p o ein con en s o he nucleus. Howe e , du ing mi osis and
meiosis once he nucleus disin eg a es as a dis inc o ganelle, he si ua ion becomes mo e complex.
Du ing hese s ages o he cell cycle he e a e no nuclea p o eins pe se while ch oma in p o eins can
s ill be de ined as hose associa ed wi h he DNA in ch omosomes.
Ano he deba able ques ion is whe he all o he p o eins inside he nucleus can be conside ed
ch oma in p o eins (e en i nuclea en elope p o eins a e se aside). Acco ding o one mode n iew
apa om he ch oma in compa men he nucleus con ains also in e ch oma in compa men s [2] and
nuclea bodies en iched in RNA and p o ein complexes (e.g., nucleolus, nuclea speckles). His o ically
he soluble ac ion o nuclea p o eins was a ibu ed o nucleosol o nuclea sap. Howe e , o say ha
p o eins localized in hese compa men s do no in e ac wi h genomic DNA a leas ansien ly would
be an o e simpli ica ion. Nucleoplasm p o eins mos ly also in e ac wi h genomes, some pa s o he
genome in e ac wi h he nucleolus ( he so-called, nucleola associa ed domains o NADs). E en
p o eins o he nuclea en elope – lamins do in e ac wi h he genomic DNA (e.g., o ming lamina-
associa ed domains, LADs).
33
1.1. Analysis o ch oma in p o eins’ ep esen a ion in he GO da abase and o he p o ein-
unc ion o ien ed da abases
The mos comp ehensi e gene and gene p oduc s classi ica ion esou ce o da e is
GeneOn ology (GO), which classi ies p o eins acco ding o he h ee in e ela ed on ologies desc ibing
molecula unc ion, biological p ocesses, and cellula componen s (called aspec s). GO anno a es 97%
o all p o eins o he human e e ence p o eome (as p o ided by UniP o ), bu his classi ica ion also
has se e al d awbacks. GO combines many ca ego ies (cu en ly a ound 42 housand e ms) o a ious
scope desc ibing a ious aspec s o gene unc ioning connec ed ia di e en ypes o ela ionships (such
as “A is B”, “A is pa o B”, “A egula es B”, “A occu s in B”, e c.) in a non- eelike s uc u e (di ec ed
acyclic g aph). This complex in e wined hie a chy o GO makes i di icul o ge a holis ic pic u e o
a ious ch oma in p o ein g oups, and apply a educ ionis way o hinking while in e p e ing he esul s
o bioin o ma ics analysis o p o ein se s made using GO classi ica ion. Ano he d awback is ha GO
omi s ca ego ies ha a e his o ically well es ablished in he communi y o ch oma in esea che s (e.g.,
such ca ego ies as “his one p o eins”, “high-mobili y g oup p o eins”, e c.), again hampe ing
in e p e a ion o GO-based da a analysis using he es ablished knowledge (see discussion below).
The GO cellula componen e m "ch oma in" is de ined b oadly as " he o de ed and o ganized
complex o DNA, p o ein, and some imes RNA, ha o ms he ch omosome." Consequen ly,
unc ionally ele an ch oma in e ms — such as "nucleosomal DNA binding", "DNA-binding
ansc ip ion ac o ac i i y', 'His one H3K27 DNA-binding ansc ip ion ac o ac i i y', 'Nucleus',
'His one H3K27 monome hyl ans e ase ac i i y" — may no be linked hie a chically o he ch oma in
GO node, esul ing in incomple e o e laps be ween p o ein lis s. Based on ou analysis, o e 500
unc ionally de ined ch oma in p o eins (in e ed om li e a u e and o he sou ces) a e absen om GO
anno a ions (see Supplemen a y Figu e SF2_1A).
While he GO p o ides nea ly comp ehensi e co e age o he p o eome, i s classi ica ion
s uc u e p esen s challenges o ex ac ing o compa ing speci ic p o ein se s. The GO hie a chy is
complex, wi h he e ogeneous ela ionships be ween nodes, o e lapping p o ein anno a ions ac oss
e ms, and a ying le els o de ail and comple eness. Fo ins ance, manually inspec ing GO e ms
associa ed wi h ch oma in- ela ed keywo ds (e.g., "DNA", " ansc ip ion", "his one", "RNA
polyme ase") is imp ac ical due o hei shee olume (>100 e ms; see Supplemen a y Figu e
SF2_1B). Fu he mo e, GO e ms inhe en ly include p o eins om all child e ms, which can lead o
unin ended inclusions. Fo example: he e m "DNA- empla ed" ansc ip ion inco po a es
"mi ochond ial ansc ip ion"; "gene exp ession" encompasses unc ionally dis inc p ocesses like
"p o ein ma u a ion" and " ansla ion".
34
When compa ed wi h he EpiFac o s da abase [3] (con aining epigene ic egula o p o ein
en ies ob ained by ex mining) 46% o en ies in EpiFac o s a e missing om he lis o GO 'ch oma in'
p o eins, see Figu e 2D. Mo eo e , while a specialized e iew by Hammond e al., 2017 [1] lis s 35
p o eins o his one chape one ca ego y, he GO e m "his one chape one ac i i y" includes only 14, wi h
jus 6 o e lapping en ies (see Supplemen a y Figu e SF2_1D).
Se e al unc ionally impo an bu small ch oma in p o ein ca ego ies a e en i ely missing om
GO, including: HMG p o eins, His one ail clea age p o eins, gene al TFs. E en o well-anno a ed
classes like his ones, inconsis encies pe sis . Mos a e classi ied unde "s uc u al cons i uen o
ch oma in", bu his e m also includes wo non-his one p o eins (HMGA1 and LMNTD2).
The Gene On ology o e s he mos comp ehensi e co e age, anno a ing nea ly he en i e
human e e ence p o eome. In con as , ch oma in/epigene ic egula o da abases ypically include 400–
800 p o eins, while p o ein class-speci ic da abases a y signi ican ly in scope, anging om as ew as
30 p o eins (e.g., ch oma in emodele s o he SWI/SNF amily) o o e 1500 (e.g., ansc ip ion
ac o s). Despi e hei u ili y, none o hese esou ces ully cap u es he complexi y o ch oma in-
associa ed p o eins, ei he in e ms o p o ein co e age o unc ional classi ica ion. While
ch oma in/epigene ic egula o da abases encompass key co ac o s o ce ain p o ein complexes, hey
exclude c i ical ca ego ies such as ansc ip ion ac o s, RNA polyme ase subuni s, DNA-modi ying
enzymes, DNA epai machine y, and HMG p o eins. Con e sely, p o ein class-speci ic da abases a e
limi ed o a mos six unc ional ca ego ies in o al, including his ones, ch oma in emodele s ( om
selec amilies), his one pos - ansla ional modi ica ion (PTM) w i e s/ eade s/e ase s, and ansc ip ion
ac o s. Addi ionally, se e al ch oma in- ela ed p o ein classes ha e been e iewed as he gene g oup
in HGNC (e.g., 'High mobili y g oup', 'DNA polyme ases') o in he li e a u e bu lack dedica ed
da abase esou ces. Examples include his one chape ones [1], SMC complexes [4], HMG p o eins [5],
pionee ansc ip ion ac o s [6], nuclea RNA-binding p o eins [7,8], and his one ail clea age enzymes
[9]. The ull lis o ch oma in-associa ed p o ein class speci ic sou ces a e a ailable in Supplemen a y
Table ST1. The absence o cen alized eposi o ies o hese p o ein classes highligh s a c i ical gap in
cu en bioin o ma ics esou ces.
The e ealed p oblems o using GO di ec ly o ex ac ing a se o ch oma in p o eins a ise
om se e al ac o s (see also [10] o a b oade discussion o GO applicabili y): 1) p o ein
mul i unc ionali y: many p o eins pa icipa e in di e se p ocesses o localize o mul iple compa men s,
2) ambiguous e m de ini ions: b ie GO desc ip ions may lead o inconsis en in e p e a ions, 3)
anno a ion delays and e o s: lag imes in upda es and p opaga ion o e o s in cu a ed da ase s, 4)
cu a ion bias: well-s udied p o eins a e anno a ed mo e ho oughly han niche ca ego ies.
41
The MS-based da ase s showed high a iabili y, only 179 p o eins we e in common be ween he
ch oma in da ase s only (see Supplemen a y Figu e SF2_5B). The la ges MS-based da ase s o o al
ch oma in (Shi e al., 2021; Ginno e al., 2018) and nascen ch oma in (Alabe e al, 2014; Al a ez e
al., 2023) con ained a ound h ee housand p o eins – app oxima ely he same amoun as he numbe o
p o eins in SimCh om. Howe e , su p isingly he numbe o p o eins o hese da ase s ha we e p esen
in SimCh om was small (25-35%). This low consis ency wi h SimCh om was also obse ed in he
To ene e e al., 2011 da ase . These h ee da ase s (Shi e al., 2021; Ginno e al., 2018; To en e e al.,
2011) we e based pu ely on expe imen al ch oma in ex ac ion echniques (Supplemen a y Figu e
SF2_5A). Al e na i ely, he consis ency wi h SimCh om was wice as high (50-65%) o he
Kus a sche e al., 2014 and I zhak e al., 2016 da ase s. These da ase s s and ou om he o he da ase s.
Kus a sche e al., 2014 used a machine lea ning classi ica ion app oach based on MS-da a signals wi h
a manually p o ided ch oma in p o eins aining da ase , which was based on li e a u e and da abase
mining. In I zhak e al., 2016 he p o eins we e conside ed nuclea i hei MS-measu ed in ensi y in he
c ude nuclea ex ac exceeded 85% o he global in ensi y. We hypo hesize ha hese wo la e da ase s
ha e be e consis ency wi h SimCh om da ase in pa because by hei design hey a e ini ially biased
ei he by he in o ma ion al eady a ailable in he li e a u e and a ious da abases (in he case o
Kus a sche e al., 2014) o by he selec ion o p o eins ha a e p e e en ially localized in he nucleus
and hence ha ing highe chances o be desc ibed in he li e a u e and da abases (in he case o I zhack
e al., 2016).
When MS-based da ase s we e compa ed o he nuclea localiza ion da ase s, we obse ed he
same endency. A ound 95% o p o eins in Kus a sche e al., 2014 and I zhak e al., 2016 da ase s may
be ound in ou b oad nuclea localiza ion da ase (NULOC_JT_NECF), while o o he MS-based
da ase s he p opo ion was a ound 60-70% (To en e e al., 2011; Alabe e al., 2014; Ginno e al.,
2018; Shi e al., 2021) and 75-80% (Ugu e al., 2023; Al a ez e al., 2023). The same endency was
obse ed o he consensus NULOC_CS da ase ( he po ion o he MS-based da ase s p esen in
NULOC_CS a ied be ween 30% and 73%).
To u he unde s and he o igins o hese disc epancies we analyzed p o eins ha we e ound
in MS-based expe imen al ch oma in da ase s bu no p esen in nuclea localiza ion da ase s o
SimCh om (Supplemen a y Figu e SF2_5C). The I zhak e al., 2016 da ase had a 98% o e lap wi h
NULOC_JT_NECF and i s inclusion would no a ec he esul s p esen ed below. In o al he e we e
2232 o such p o eins, and only a mino ac ion o hese (94 p o eins) did no ha e localiza ion
anno a ion in he da abases (see Supplemen a y Figu e SF2_5D). Fi e p o eins we e p esen in ou
o al ch oma in MS-based da ase s (see Supplemen a y Figu e SF2_5C), hey included p o eins
encoded by he FLNB (Filamin B, ac in-binding p o ein), GMPS (Guanine Monophospha e Syn hase),
CHERP (Calcium Homeos asis Endoplasmic Re iculum P o ein), ILKAP (ILK Associa ed

42
Se ine/Th eonine Phospha ase), PLEC (Plec in) genes. Howe e , he Ugu e al., 2023 da ase includes
only PLEC and CHERP. While acco ding o UniP o and HPA all o hese a e non-nuclea p o eins
p edominan ly localized in cy oplasm (wi h addi ional localiza ions in cy oskele on, in e media e
ilamen s, endoplasmic e iculum, Golgi appa a us), manual li e a u e mining con i med expe imen al
e idence suppo ing he p esence o hese p o eins in he nucleus (e.g. [16]). We addi ionally andomly
selec ed 20 p o eins om a subse o p o eins ha we e epo ed by a leas i e ou o se en ch oma in
MS-based s udies ( he e we e 195 such p o ein coding genes) and manually pe o med li e a u e
sea ches. F om hose 15 o 5 genes li e a u e e idence was ound sugges ing hei nuclea localiza ion
(CALR [17], PDIA4 [18], ABCF2 [19], SEC23B [20], EIF3D [21]). Hence, i may be s a ed ha MS-
base s udies cu en ly ha e p edic i e powe o iden i y new ch oma in p o eins ha a e no anno a ed
as such by he localiza ion and unc ional da abases.
I is no s aigh o wa d o es ima e he po en ial con amina ion o MS-based da ase s wi h non-
nuclea p o eins since one canno come up wi h an ul ima e e e ence se o non-nuclea p o eins. E en
o well s udied p o eins he e a e s ill chances ha hey may ha e axilla y unc ionali y in he nucleus
ha ha e no ye been expe imen ally cha ac e ized. S ill o add ess his p oblem we elied on analyzing
he abo e men ioned se o 2232 p o eins using in o ma ion a ailable in GO. GO en ichmen analysis
e ealed ha hese p o eins we e mainly associa ed wi h a di e se se o GO- e ms ela ed o non-
nuclea o ganelles/compa men s, cellula me abolism, p o ein ansla ion and ma u a ion sugges ing
ha no impo an ch oma in associa ed ca ego ies we e missed du ing he cons uc ion o SimCh om
da ase ha could accoun o his disc epancy (Supplemen a y Figu e SF2_6). P e ious s udies
sugges ha he esul s o MS-based s udies may be con amina ed by cy oplasmic and mi ochond ial
p o eins [12,22]. In ou analysis 2025 p o eins we e ela ed o cy oplasm, 115 o Golgi esicle anspo
acco ding o GO. No en iched e ms include mi ochond ia p o eins. I is s ill possible ha among 2232
p o eins he e a e s ill nuclea p o eins, whose anno a ion by GO does no accoun o hei addi ional,
moonligh ing unc ions in he nucleus. Fo ins ance, 30 p o eins we e associa ed wi h ansla ion
(" ansla ion", “ ansla ion ini ia ion ac o ac i i y”) acco ding o GO, many p o eins o he
ansla ional appa a us a e known o be moonligh ing p o eins wi h unc ional oles in he nucleus [23].
Simila ly, Golgi appa a us coope a ing wi h nucleus and ER in esicula anspo .
In a di e en ype o analysis we looked a ch oma in p o eins ha we e no iden i ied by he
MS-s udies bu we e included in ou SimCh om da ase . The e we e 1246 such p o eins (o 41% o
SimCh om). Acco ding o he HPA classi ica ion o housekeeping p o eins, 67% (839 p o eins) o hese
we e no housekeeping, consis en wi h he idea ha hey we e missed by MS-based s udies, because
hey we e no exp essed in he cell lines. Howe e , i was also ound ha MS-based s udies a e biased
owa ds iden i ying he housekeeping p o eins. Mo e han 75% o nuclea /ch oma in p o eins epo ed
by he MS-based s udies we e om he housekeeping pool, while he a e age expec ed ac ion o
43
nuclea housekeeping p o eins is a ound 62% (Supplemen a y Figu e SF2_7B). Among he se o
1246 p o eins no ound in he MS-based s udies he dominan SimCh om ca ego y was ela ed o DNA-
binding ansc ip ion ac o s (see Supplemen a y Figu e S2_7C). 1148 DNA-binding TF we e missed
by MS-based s udies, including 394 housekeeping TFs. This highligh s ano he po en ial sou ce o
disc epancy - housekeeping TFs may be p esen in small amoun s o be washed away du ing
ch oma in/nucleome ex ac ions and hus be missed by MS analysis. To u he unde s and he
disc epancies be ween he MS-based da ase s and SimCh om we pe o med en ichmen analysis o he
SimCh om ca ego ies in he expe imen al da ase s (Supplemen a y Figu e SF2_8A). I can be seen
ha ca ego ies ela ed o DNA-binding ansc ip ion ac o s we e mainly deple ed in he MS-based
da ase s, consis en wi h he analysis desc ibed abo e. The sepa a e analysis o housekeeping TF
sugges s ha di e en expe imen al me hods show high a iabili y in hei abili y o eco e TF in
ch oma in ex ac s. The To en e e al., 2011 da ase includes only 8 housekeeping TF, while he
Kus a sche e al., 2014 and I zhak e al., 2016 da ase s each con ain 158. The dynamic na u e o
ansc ip ion ac o s’ in e ac ions wi h ch oma in likely explains hese ac s. The mos en iched
ca ego ies we e ela ed o p o ein in ol ed in RNA binding and me abolism, his one chape one,
emodele s and he e och oma in associa ed ac o s. This is likely ela ed o he highe chances o hese
p o eins o be de ec ed due o hei uni e sal p esence in he cells and high exp ession le els. De ailed
analysis o he SimCh om ca ego ies ep esen a ion in MS-based da ase s u he e ealed some de ails
o he di e ences be ween he da ase s (Supplemen a y Figu e SF2_7C). Fo ins ance, he ML-based
Kus a sche e al., 2014 da ase was able o eco e wice as many DNA ansc ip ion ac o s han Ginno
e al., 2018 and Shi e al., 2021 da ase s. The a io o housekeeping and non-housekeeping TFs o
Ginno e al., 2018 and Shi e al., 2021 da ase s is he same, and is highe o he Kus a sche e al., 2014;
I zhak e al. 2016; Alabe e . al., 2014 da ase s. In e es ingly he nucleome I zhak e al., 2016 da ase
was also able o eco e he same amoun o TF as he Kus a sche e al., 2014 da ase , al hough he
size o he da ase was h ee imes smalle han Shi e al., 2021 and Ginno e al., 2018 da ase s. This
again poin s o he ac ha TF may be los du ing ch oma in ex ac ion. The ep esen a ion o some
o he SimCh om ca ego ies had signi ican di e ences be ween he MS-based da ase s likely a ibu ed
o he di e en ial ex ac ion p obabili y. Fo ins ance, only 29% and 25% o RNA polyme ases and
his ones, espec i ely, we e p esen in he Ginno e al., 2018 da ase , while 50-60% we e p esen in he
Shi e al., 2021 da ase . A close look a he his one p o eins ( o which cu en ly in o ma ion abou all
62 exp essed p o eins is known [15,24,25] e ealed ha while ce ain issue speci ic his one a ian s
we e missed as expec ed, all MS-based s udies we e no able o eco e all canonical his ones a ian s,
especially o he H2B-his one (Supplemen a y Figu e SF2_8B). Fo ins ance, om 14 canonical H2B
p o eins iso o ms he Shi e al., 2021 da ase was able o eco e six p o eins (co esponding o he
genes p oduc s o H2BC1, H2BC3, H2BC4, H2BC13, H2BC18, H2BC26), while he I zhak e al., 2016
da ase ou p o eins (co esponding o genes p oduc s o H2BC4, H2BC11, H2BC12, H2BC13). While
44
he canonical his ones a e likely o be exp essed simul aneously in he cell, hese disc epancies may be
due o he sensi i i y o MS-based analysis o exp ession a ia ion in cell lines.
Taken oge he , ou analysis indica es ha MS-based ch oma in da ase s exhibi inconsis encies
wi h subcellula localiza ion o unc ional anno a ions a ailable in UniP o , HPA, o GO. Up o 35% o
ch oma in p o eins iden i ied by MS-based echniques lack known nuclea localiza ion acco ding o he
da abases. Among hese, app oxima ely 30% o p o eins iden i ied simul aneously by a leas h ee
expe imen al s udies may possess an addi ional nuclea localiza ion ha is no cu en ly cap u ed by he
da abases, bu may ha e been epo ed in esea ch pape s. Fo o he p o eins, he mos pa simonious
explana ion o hei p esence is he conside able con amina ion o ch oma in ex ac s by mainly
cy oplasmic p o eins. F om ano he poin o iew, many ch oma in p o eins anno a ed in he da abases
a e no p esen in he MS-based da ase s. While his is pa ially due o he limi ed numbe o genes
exp essed in he analyzed cell lines, ou analysis also sugges s ha many p o eins (such as housekeeping
ansc ip ion ac o s) a e los du ing ch oma in ex ac ion, likely due o he dynamic na u e o hei
in e ac ions and hei low exp ession. Finally, we showed ha MS-based s udies ha il e hei esul s
by selec ing he p o eins ha a e highly en iched in he nucleus wi h espec o o he cellula
compa men s, o use ML-assis ed classi ica ion based on da abase/li e a u e da a, show be e
consis ency wi h he in o ma ion in he da abases. Howe e , his comes a he expense o dec easing
he size o hei da ase s and likely limi ing hei abili y o iden i y new p o eins associa ed wi h he
nucleus/ch oma in.
2. The SimCh om ch oma in p o ein classi ica ion, he in e ac i e SimCh om da abase
and o he e e ence da ase s
This sec ion supplemen s sec ion 3.2. The SimCh om ch oma in p o ein classi ica ion, he
SimCh om da ase and o he e e ence da ase s in he main ex . No e: In e ac i e Figu e 3
(h ps://simch om.in bio.o g/#classi ica ion) is he in e ac i e e sion o Figu e 3, which is he key
sou ce o in o ma ion o he analysis p esen ed below.
The la ges subg oup o he SimCh om “non-his one p o eins'' ca ego y is he “DNA- empla ed
ansc ip ion” g oup (1547 p o eins), which consis s o he p o eins belonging o he “Regula ion o
ansc ip ion” subg oup (1514 p o eins) and "RNA polyme ases" subg oup (34 p o eins). The "DNA
me abolic p ocesses'' o m he second la ges p o ein subg oup (495 p o eins) and include DNA
eplica ion, epai , and ecombina ion machine y. “Nuclea RNA binding p o eins'' is ano he majo
g oup o p o eins (309 p o eins) p esen in SimCh om. The e a e a lo o RNA binding p o eins in he
cell (mo e han 1500 [26]), in he “Nuclea RNA binding p o eins'' ca ego y we aimed a including only
45
hose ha a e ound inside he nucleus (see Me hods Sec ion 2.2 and Supplemen a y Table ST4),
including hose in ol ed in p e ibosome o ma ion, RNA p ocessing and modi ica ion inside he
nucleus. O he majo subg oups o ou classi ica ion included “His one modi ica ion” (257 p o eins),
"DNA-ac ing enzymes'' (258 p o eins including DNA me hyla ion and deme hyla ion enzymes),
“Cen ome e-associa ed” (241 p o eins) (see Figu e 3). Se e al speci ic subg oups o a ious scope
con aining p o eins impo an o ch oma in unc ioning we e also included in ou classi ica ion, such
as “His one chape ones'' (32 p o eins), ATP-dependen "Ch oma in emodele s'' complexes (114
p o eins), “SMC complexes” (28 p o eins, including cohesins implica ed in ch oma in loop ex usion).
To e alua e he con en s o ou SimCh om da ase we pe o med i s c oss-compa ison o he
localiza ion-based da ase s desc ibed abo e (NULOC_JT_NECF and NULOC_CS) (see
Supplemen a y Figu e SF3_3). The as majo i y o SimCh om en ies was also p esen in he
NULOC_JT_NECF da ase ( he b oades da ase ha combined all nuclea p o ein en ies om all
p o ein localiza ion da abases a any le el o con idence), which is consis en wi h ch oma in p o eins
being only a subse o nuclea p o eins. Howe e , a mino subse o SimCh om (156 p o eins) was no
classi ied as nuclea by he localiza ion da abases (Supplemen a y Figu e SF3_3A). To u he
unde s and he na u e o his mino disc epancy we analyzed he SimCh om ca ego ies which
con ibu ed he mos en ies o his subse (Supplemen a y Figu e SF3_3C) o whe e a signi ican
p opo ion o en ies in he espec i e ca ego y was absen in he localiza ion da abases
(Supplemen a y Figu e SF3_4A). In his subse 32 en ies ou o 156 we e no anno a ed by he
localiza ion da abases a all, 27 we e conside ed by UniP o as only ch omosomal ( his is consis en
wi h “Cen ome e-associa ed” ca ego y o SimCh om ha ing he mos en ies no p esen in
NULOC_JT_NECF), and 124 had only non-nuclea localiza ion acco ding o he localiza ion da abases.
A manual e iew o he la e en ies sugges ed ha hey included bo h bona ide nuclea p o eins (such
as his one ace yl and me hyl ans e ases), o he p o eins such as ibosomal p o eins, a ious kinases
and p o eins in ol ed in mi ochond ial DNA p ocessing. Fu he esea ch and da a would be needed o
cla i y he localiza ion s a us o he la e en ies. The SimCh om ca ego y ha ing he la ges p opo ion
o p o eins absen om he localiza ion da abases was “His one ail clea age” (Supplemen a y Figu e
SF3_3C). This is consis en wi h he ac ha o many o he his one ail clea age enzymes (e.g.,
me allop o einases, ca hepsins, neu ophil elas ase) he his one ail clea age ac i i y in he nucleus is
no hei p ima y unc ion and mani es s only in speci ic condi ions and cell de elopmen s ages [9].
The compa ison o SimCh om wi h he NULOC_CS (ou s ingen high con idence consensus da ase
o nuclea p o eins) showed a su icien ly highe numbe o SimCh om p o eins ha we e no included
in he NULOC_CS da ase (1208) (Supplemen a y Figu e SF3_3A). This is no unexpec ed since
only 44% o he human p o eome has simul aneous localiza ion anno a ions a su icien le els o
con idence om UniP o and HPA (see Resul s Sec ion 3.1). The in e sec ion o SimCh om and
NULOC_CS da ase s included 1837 p o eins (Supplemen a y Figu e SF3_3A), and may be
46
conside ed as a se o ch oma in p o eins wi h a high le el o con idence. To u he alida e ha he
subse o SimCh om ha was no p esen in NULOC_CS (1208 p o eins) ep esen ed nuclea p o eins
we pe o med GO en ichmen analysis o his subse agains a lis o all GO- e ms and hen selec non-
nuclea associa ed e ms o u he analysis (see Me hods Sec ion 2.2, Supplemen a y Figu e
SF3_4B, Supplemen a y Table ST8). The analysis con i med he low numbe o SimCh om p o eins
ha we e associa ed wi h bona ide non-nuclea GO ca ego ies ( he “Cen ome e associa ed p o eins”
ha ing he highes numbe – a ound a dozen ou o 156 – o p o eins ha had “non-nuclea ” GO-
anno a ion e ms, ha belongs o cha ged mul i esicula body p o eins and dynac in complex subuni s).
Finally, as a byp oduc o NULOC_CS and SimCh om compa ison we ind ha 1459 p o eins we e
p esen in NULOC_CS bu no in SimCh om - his se may be conside ed as bo h a high-con idence se
o nuclea non-ch oma in p o eins and a se o p o eins ha should be added o SimCh om. The GO
analysis o his da ase did no e eal any clea GO ca ego ies ha should ha e been included in
SimCh om as ca ego ies ela ed o ch oma in unc ioning (see Supplemen a y Table ST9).
Limi a ions o he p oposed ch oma in classi ica ion SimCh om and i s con en s include he
ollowing: he absence o cell cycle con ol p o eins and checkpoin signaling p o eins, he lack o
de ailed classi ica ion o p o eins in ol ed in eading, w i ing and e asing DNA and RNA
modi ica ions. The 'Genomic loca ion' ca ego ies equi e addi ional cu a ion suppo ed by expe imen al
e idence o enhance he accu acy and eliabili y o hei p o ein con en . In addi ion, we did no conside
he p o ein componen s o nonmemb ane nuclea o ganelles whose p o eins may also unc ionally
in e ac wi h nucleic acids (di ec ly o h ough phase sepa a ion). The classi ica ion does no include
p o ein iso o ms. Also, SimCh om is limi ed cu en ly o human p o eins only. These limi a ions will
be add essed in he u u e e sions o SimCh om.
3. Analysis o he human ch oma ome
3.1. The ch oma ome composi ion and abundance o ch oma in p o eins
This sec ion supplemen s and expands sec ion 3.3.1. The ch oma ome composi ion and
abundance o ch oma in p o eins in he main ex .
To unde s and ch oma in unc ioning i is impo an o know he ch oma ome con en no only
in e ms o he se o p o eins associa ed wi h ch oma in, bu also in e ms o hei abundance (i.e., he
( ela i e) numbe o p o eins pe cell o o ganelle). Hence, we aimed a analyzing he a ailable mass
spec ome y da a o add ess his ques ion. The analysis o MS p o ein in ensi ies om he expe imen al
ch oma ome/nucleome s udies discussed abo e, e ealed a high deg ee o a iabili y (see Figu e 4A,
Supplemen a y Figu e SF4_1). Fo ins ance, he es ima ed ela i e mass o his one p o eins a ied

47
om 0.1 o 58 % depending on he s udy, sugges ing a high deg ee o bias due o di e en expe imen al
echniques and analysis pipelines used o p ocess aw mass spec ome y da a (see Figu e 4A). Hence,
o u he analysis we elied on he “whole-o ganism” p o ein abundance in o ma ion a ailable in
PaxDB o H. sapiens [27]. PaxDb p o ides high quali y in o ma ion on p o ein abundance combined
om many expe imen s wi h high co e age, dynamic ange and in e ac ion consis ency (es ima ed
consis ency o abundance da a wi h da a on p o ein unc ional in e ac ions) in eg a ed o e many cell
ypes and condi ions. Among he da ase s a ailable in PaxDb we ha e chosen wo whole-o ganism
da ase s: he da ase wi h he highes p o eome co e age (“H.sapiens - Whole o ganism (In eg a ed)” -
co e s 99% o human p o eome acco ding o PaxDb, e e ed o as “PaxDb_INT'' in his pape ) and he
da ase wi h he highes in e ac ion consis ency sco e (“Whole o ganism, SC (Pep idea las,aug,2014)”
- co e s 84% o human p o eome acco ding o PaxDb, e e ed o as “PaxDb_PA'' in his pape ), see
Supplemen a y Figu e S4_1A. Ou analysis showed ha wi h espec o PaxDb_PA, PaxDb_INT
da ase has addi ional abundance in o ma ion o a ound 2700 human p o eins ha almos exclusi ely
ha e low le els o exp ession (less han 1 ppm, see Supplemen a y Figu e SF4_1B). Among hese
p o eins he e a e up o a ound 700 nuclea /ch oma in p o eins, hence we op ed o use PaxDb_INT o
gene al cha ac e iza ion o he abundance dis ibu ion o ch oma in/nuclea p o eins (p esen ed in
Figu e 4B). PaxDb_PA da ase showed a highe consis ency wi h espec o he ela i e abundance o
unc ionally in e ac ing ch oma in p o eins. The o al abundance o di e en ypes o his one p o eins
(H3, H4, H2A, H2B) ma ched hei expec ed equimola a io (see Supplemen a y Figu e SF4_2A,B).
Hence, PaxDb_PA was used o a de ailed analysis o ch oma in p o ein abundance dis ibu ion
be ween ch oma in p o ein g oups and indi idual p o eins (Figu e 4C,D).
A ound hal o he whole-o ganism human p o eome consis s o low abundance p o eins wi h
exp ession le els o less han 1 ppm (~50%, see Figu e 4B, Supplemen a y Figu e SF4_1C,
Supplemen a y Table ST10). The whole-p o eome abundance dis ibu ions a e posi i ely skewed
owa ds he low abundan p o eins. In PaxDb_INT da ase his skewness is addi ionally supplemen ed
by a second peak a he low abundance alues (Figu e 4B, Supplemen a y Figu e SF4_1A). Among
he low-abundance p o eins only 25% o hem co espond o he housekeeping p o eins, while he
p opo ion o housekeeping p o eins among high-abundance p o eins (abundance o mo e han 1 ppm)
is 68% (see dis ibu ion in Figu e 4B, Supplemen a y Figu e SF4_1C, and Supplemen a y Table
ST10). We used PaxDb_INT da a o analyze abundance dis ibu ions o da abase-de i ed
ch oma ome/nucleome p o ein se s discussed in he p e ious sec ions o he pape .
The NULOC_CS, NULOC_JT, and SimCh om da ase s epo ed abo e all mani es ed
dis ibu ions mi o ing ha o he whole p o eome o PaxDb_INT da a (see Figu e 4B) wi h a
signi ican p opo ion o p o eins in hese da ase s s ill ep esen ed by he low-abundance p o eins (40%,
44%, and 48%, o NULOC_CS, NULOC_JT, and SimCh om, espec i ely, see Supplemen a y
48
Figu e SF4_1C, Supplemen a y Table ST10). We nex aimed a unde s anding he ypes o p o eins
con ibu ing o he low and high-abundance po ion o he nucleome/ch oma ome. The p opo ions o
house-keeping/non-housekeeping p o eins in low- and high-abundance ac ions o di e en da abase-
de i ed da ase s a e gi en in Supplemen a y Figu e SF4_1C and show ha a ound 60% and 40% o
ch oma in/nuclea p o eins a e housekeeping ones o he high- and low-abundance ac ion,
espec i ely. As discussed abo e nuclea and ch oma in da abased-de i ed p o ein se s a e on a e age
en iched in housekeeping p o eins wi h espec o he whole p o eome (~58.5 % s ~47%, see
Supplemen a y Resul s and Discussion Sec ion 1.3, Supplemen a y Figu e SF4_1C,
Supplemen a y Figu e SF2_7A,B). This inc ease in he p opo ion o housekeeping p o eins s ems
om bo h he inc ease o he numbe o low-abundance and high-abundance housekeeping p o eins
ela i e o he espec i e o al numbe s o low-abundance and high-abundance p o eins in
nucleome/ch oma ome da ase s. A de ailed analysis showed ha he inc ease o he ac ion o
housekeeping p o eins among he low-abundance ones was mo e han expec ed, while o he high-
abundance ones was less han expec ed (unde he assump ion ha high- and low-abundance ac ions
should con ibu e o he inc ease p opo ionally o he numbe o housekeeping p o eins belonging o
hese ac ions, see Supplemen a y Figu e SF4_1C). Fo ins ance, unde pa simonious conside a ions
he o e all inc ease in he ac ion o housekeeping p o eins o SimCh om wi h espec o he whole
p o eome (60% s 47%) should imply he inc ease o house-keeping p o eins’ ac ion among he low-
abundance p o eins om 25% o 32% (25*60/47=32), ye an inc ease o 38% was obse ed. This
highligh s he impo an ole ha low-abundance housekeeping p o eins play in ch oma in
unc ioning. A mo e de ailed analysis e ealed ha 64% o hese housekeeping low-abundance
ch oma in p o eins belong o he housekeeping DNA-binding ansc ip ion ac o s g oup
(Supplemen a y Figu e SF4_1D).
We nex applied simila analysis o he se s o ch omosome/nucleosome p o eins iden i ied in
MS-based s udies. The esul ing dis ibu ions di e ed conside ably om he dis ibu ions o da abase-
de i ed p o ein se s discussed abo e, ha ing a single maxima cen e ed a highe alues o abundance
(see Figu e 4B). This ac again sugges s ha MS-based s udies o ch oma in a e able mainly o
eco e highly exp essed p o eins and miss low exp essed p o eins (low-abundance p o eins a e in
he ange o 1%-27% o he iden i ied p o ein se s, Supplemen a y Table ST10, Supplemen a y
Figu e SF4_1C). The p o ein se s eco e ed by MS-based s udies we e signi ican ly en iched in
housekeeping p o eins when compa ed o da abase-de i ed da ase s (Supplemen a y Table ST10,
Supplemen a y Figu e SF4_1C). The abundance dis ibu ions a ied be ween di e en MS-de i ed
ch oma in/nucleome p o ein se s. The ch oma ome s udies based on ex ac ion and/o c oss-linking
echniques (Alabe e al., 2023; Shi e al., 2021; Ginno e al., 2018; To en o e al., 2011) had
dis ibu ions shi ed owa ds highe alues o abundance, han he s udies by Kus a sche e al., 2014;
and Ugu e al., 2023 sugges ing ha he la e s udies we e also able o cap u e mo e ch oma in p o eins
49
wi h low abundance. The nucleome s udy by I zhak e al., 2016 was also able o cap u e mo e lowe -
abundan p o eins.
We nex aimed a unde s anding he abundance o di e en ch oma in p o ein g oups and
indi idual ch oma in p o eins in he cell elying on ou SimCh om-SL classi ica ion using PaxDb_PA
abundance da ase . The esul ing diag ams depic ing abundance a ia ions o ch oma in p o eins,
belonging o di e en SimCh om-SL ca ego ies, he numbe o p o eins belonging o he espec i e
ca ego ies, and he cumula i e abundances (calcula ed bo h as he o al numbe o p o ein molecules
and he o al molecula weigh o p o ein molecules belonging o each SimCh om-SL ca ego y) a e
p esen ed in Figu e 4C. To gain addi ional insigh s in o he unc ioning o ch oma in in Figu e 4D we
plo ed he abundance alues o highly exp essed ch oma in p o eins (abundance o mo e han 1% o
he H4 his one abundance) belonging o SimCh om-SL ca ego ies o he “Molecula unc ion” o
“Physico-chemical p ope ies” ype. Abundance da a o all his one p o eins and non-his one ch oma in
p o eins wi h abundance mo e han 0.01% o His one H4 is p esen ed in he Supplemen a y Table
ST12. I is impo an o no e ha many ch oma in p o eins ha e addi ional localiza ion in o he cellula
compa men s, hence he p esen ed da a e lec s he o e all abundance o he ch oma in p o eins in he
cell a he han hei abundance in he nucleus. To shed mo e ligh on he p o ein abundance in he
nucleus we ha e also buil diag ams analogous o Figu e 4C only o 802 SimCh om p o eins ha a e
uniquely localized in he nucleus (acco ding o ou NULOC_CS_UL da ase ) (see Supplemen a y
Figu e SF4_3A). These p o eins a e also highligh ed in Figu e 4D. As seen in panel 1 o Figu e 4C
ch oma in ca ego ies a y subs an ially by hei median abundance om 0.09 ppm o 570 ppm and he e
is s ill conside able a ia ion in he abundance alues wi hin he ca ego ies. The mos abundan
ch oma in p o ein is his one H4 (~11000 ppm), which is exp essed by a amily o genes almos
exclusi ely coding he same p o ein sequence (excep o H4C7, which has a negligible abundance). I
is con enien o measu e he abundance o all o he p o eins in ac ions o H4 abundance (see Figu e
4D). Each nucleosome con ains wo copies o H4 his ones, he e o e he numbe s a e also easily
con e ed o ela i e abundance o ch oma in p o eins pe nucleosome. De ailed analysis o his one
p o ein abundance is in Supplemen a y Table ST12 and shown in Supplemen a y Figu e SF4_2A,C.
The o al numbe o co e nucleosomal his one ypes H3, H4, H2A, H2B exp essed by a ious genes
sums up o simila numbe s (~10400-10900 ppm) consis en wi h hei equimola associa ion wi hin
nucleosome co e pa icles. The cumula i e abundance o H1 his ones (~4500 ppm) sugges s ha
sligh ly less han one H1 his one is associa ed wi h each nucleosome. The mos abundan co e his one
a ian s a e H3.3 (23% o H4, 2530 ppm), H2A.X (6.5%, 714 ppm), H2A.Z (10%, 1140 ppm), H2A.W
(3.8%, 423 ppm). The leas abundan his one a ian s a e H2A.B and H1.7 (less han 1 ppm). Despi e
he ela i ely small numbe o p o ein coding human his one genes (108), many o which code o
iden ical sequences, he cumula i e abundance o his one p o eins exceeds ha o all o he ch oma in
p o ein ca ego ies e en i p o eins wi h mul iple localiza ion a e aken in o accoun (see panel 2,3 in
50
Figu e 4C). Howe e , when he o al molecula weigh o p o eins belonging o di e en ca ego ies is
compa ed, he ela i ely small size o his one p o eins (median ~15 kDa) esul s in hem yielding he
i s place o RNA p ocessing p o eins (see panel 4, Figu e 4C). Collec i ely he cumula i e weigh o
p o eins belonging o “Nuclea RNA binding p o eins” ca ego y ( ha combines P e ibosome-
associa ed, RNA modi ica ion, and RNA p ocessing ca ego ies) amoun s o 30.4% o all SimCh om
p o eins weigh (4.8% o whole-o ganism p o eome weigh ). Howe e , many p o eins om hese
ca ego ies a e also localized in cy oplasm, and he majo con ibu ion o hei cumula i e molecula
weigh likely comes om he cy oplasmic ac ion. I he same analysis is pe o med only o he
SimCh om p o eins ha a e uniquely localized in he nucleus (Supplemen a y Figu e SF4_3A), he
mass ac ion o his one goes up o 38% o all he ch oma in p o eins ha a e uniquely localized in he
nucleus.
O he unc ional ch oma in p o ein g oups (o g oups wi h speci ic p ope ies) wi h high alues
o median abundance and high le el o indi idual p o ein abundance include HMG A/B/N, his one ail
clea age, his one chape ones, ch oma in emodele s and o he ca ego ies (see Figu e 4D). The high
mobili y g oup p o eins (HMG A/B/N) a e he second g oup a e his ones anked by hei median
abundance. Al hough g ouped oge he due o his o ical easons, hey include h ee sepa a e
supe amilies: HMGA (con ains AT-hook domains), HMGB (con ains DNA binding HMG-box
domain), and HMGN (con ains nucleosome binding domain). Ou analysis sugges s ha he a io o
HMG p o eins o nucleosomes is 1:8, 1:2, 1:3 o HMGA, HMGB, o HMGN p o eins, espec i ely.
Howe e , he majo i y o HMG p o eins a e no exclusi ely localized in he nucleus (Figu e 4D). The
his one ail clea age p o eins a e ano he small g oup o p o eins in ou classi ica ion wi h high median
abundance in he whole-o ganism p o eome. These enzymes, howe e , a e no exclusi ely speci ic o
his one clea age, and likely pe o m hei main unc ions ou side he nucleus by clea ing o he p o eins.
Among his one chape ones he H3-H4 his one chape one NPM1 and H2A-H2B his one chape one NCL
ha e he highes abundance, 32% and 13% o H4 abundance, espec i ely. The mos abundan his one
a ian speci ic chape one is ANP32E (speci ic o H2A.Z-H2B wi h abundance o 2%). The whole
nucleosome chape one FACT complex consis ing o SSRP1 and SUPT16H gene p oduc s, has an
abundance o a ound 1%, amoun ing o one FACT complex pe a ound 50 nucleosomes. Among RNA
polyme ase subuni s POLR2E he common subuni E o RNA polyme ases I, II, and III is he mos
abundan p o ein (0.58% o his one H4 abundance o a ound 1 pe 90 nucleosomes). The exclusi e
componen s o polyme ase II (POLR2B, POLR2C, POLR2D, and o he s) ha e hei abundances in he
ange o 0.05-0.3%. Wi h a median human gene leng h o 24kb and nucleosomal epea leng h o a ound
200 bp his gi es a lowe es ima e o one polyme ase II pe app oxima ely 10 genes. Among genes
in ol ed in ch oma in emodeling ac in encoding genes (ACTB, ACTA1 and ac in-like ACTL6A) a e
leading by he abundance o hei p o ein p oduc s. While ac in is a componen o some ch oma in
emodeling complexes (e.g., SWI/SNF) he majo con ibu ion o i s abundance clea ly comes om i s
57
SF5_5B). Thei en ichmen alues a e in he ange 1.13-1.18. By classes o amino acids
ch oma in/nuclea p o eins a e mos ly en iched in pola (N, Q, T, C, G, P), small (P, G, A, S), and
posi i e (K, R) amino acids (Supplemen a y Figu e SF5_4A, Supplemen a y Figu e SF5_5B). I is
impo an o no e ha such an analysis should be aken wi h a g ain o sal , because he en ichmen o
ce ain amino acids may a y ac oss di e en ca ego ies o ch oma in p o eins, and he ca ego ies wi h
high numbe o p o ein en ies ha e highe con ibu ion o he o e all a e age. To elucida e his
a iabili y we ha e also pe o med en ichmen analysis o p o eins in majo unc ional SimCh om
ca ego ies (see Supplemen a y Figu e SF5_4).
Se ine and p oline a e among amino acids ha a e ela i ely abundan in p o eins (abundance
o a ound 7-8% and 5-6% in ch oma in and cy oplasmic p o eins, espec i ely, Supplemen a y Table
ST13). The o al en ichmen o se ine in ch oma in p o eins is a ibu ed simul aneously due o i s
en ichmen in he IDR, non-IDR egions ( ela i e o IDR and non-IDR egions o cy oplasmic p o eins),
and mo e impo an ly due o highe p opo ion o IDR egions in ch oma in p o eins (46% s 23%) ha
in u n ha e a conside ably highe p opo ion o se ine han non-IDRs (Supplemen a y Table ST13).
The sligh en ichmen o se ine in IDRs was obse ed almos ac oss all SimCh om ca ego ies
(Supplemen a y Figu e SF5_4E). In non-IDR egions se ine showed bo h en ichmen and deple ion
in ce ain ca ego ies, and he o e all en ichmen was d i en mainly by ansc ip ion ac o s due o he
la ge numbe o p o eins in hese ca ego ies. The o al en ichmen o p oline in ch oma in p o eins is
a ibu ed due o i s en ichmen in IDRs ( ela i e o IDRs o cy oplasmic p o eins and mo e impo an ly
due o highe p opo ion o IDR egions in ch oma in p o eins (p oline is he mos en iched amino acid
in IDRs o bo h ch oma in and cy oplasmic p o eins e sus he non-IDRs, old en ichmen 1.8-2.1,
Supplemen a y Table ST13). The o e all en ichmen o p oline in non-IDRs was close o one and
s a is ically no signi ican . Ce ain small g oups, such as his ones and HMG p o eins showed
conside able de ia ions in p oline con en in hei non-IDRs (Supplemen a y Figu e SF5_4H). The
en ichmen o p oline in IDRs was obse ed in many SimCh om ca ego ies (Supplemen a y Figu e
SF5_4E). In ce ain ca ego ies, such as Non-Housekeeping ansc ip ion ac o s and pionee TFs
en ichmen was high (1.37 and 1.55 old, espec i ely). Su p isingly, he en ichmen o p oline in IDRs
o housekeeping TF was deple ed (FE o 0.92). Sugges ing ha while he e is s ill a conside able
ac ion o p olines in IDRs o housekeeping TF, his ac ion is signi ican ly lowe han in IDRs o
non-house keeping TF (7 % and 10.4 % median ac ions, espec i ely).
Cys eine and his idine a e among amino acids ha ha e a ela i ely low abundance in p o eins
(abundance o a ound 1-2.5%, Supplemen a y Figu e SF5_5B). The o al en ichmen o cys eine and
his idine is mainly d i en by he p e alence o zinc inge s con aining ansc ip ion ac o s
(Supplemen a y Figu e SF5_4B). The exclusion o hese p o eins om analysis esul ed in he
disappea ance o any s a is ically signi ican en ichmen Supplemen a y Figu e SF5_5A).

58
In e es ingly, he en ichmen o posi i e amino acids is only s a is ically signi ican o lysine,
bu no o a ginine, and he en ichmen is ela i ely mode a e (1.03 in ch oma in) (Supplemen a y
Figu e SF5_4A). Lysines a e en iched in IDRs and non-IDRs o ch oma in p o eins, while a ginines
a e deple ed in IDRs and en iched in non-IDRs (when compa ed wi h IDRs and non-IDRs o
cy oplasmic p o eins) (Supplemen a y Figu e SF5_5B). The o e all highe posi i e cha ge o
ch oma in p o eins s ems also om he deple ion o nega i ely cha ged amino acids in hei sequence.
The deple ion o aspa a e in ch oma in/nuclea p o eins is s a is ically signi ican ( old en ichmen is
a ound 0.9), while he deple ion o glu ama e is s a is ically non-signi ican (Supplemen a y Figu e
SF5_4A). This sugges s ha he inc eased posi i e cha ge o ch oma in nuclea p o eins has i s main
con ibu ions in he deple ion o aspa a e, and mode a e en ichmen o lysine.
Wi hin he IDR egions o ch oma in p o eins y osine and aspa agine we e also signi ican ly
en iched (FE o 1.23 and 1.17 e sus he IDRs o cy oplasmic p o eins, espec i ely), mainly due o he
con ibu ion o ansc ip ion ac o s (Supplemen a y Figu e SF5_4F).
Among he mos ela i ely deple ed amino acids in ch oma in/nucleus a e yp ophan and
hyd ophobic/alipha ic amino acids like aline, isoleucine, leucine, and me hionine (Supplemen a y
Figu e SF5_4A). T yp ophan is he a es amino acid in p o eins (a ound 1%). Ce ain ca ego ies like
his one and HMG p o eins lack i comple ely (Supplemen a y Figu e SF5_4C). I is deple ed in almos
all ch oma in ca ego ies, excep o a ew small ca ego ies such as DNA (de)me hyla ion, his one ail
clea age, and his one modi ica ion, whe e he en ichmen comes om non-IDRs (Supplemen a y
Figu e SF5_4I). Hyd ophobic/alipha ic amino acids a e deple ed in IDRs s non-IDRs o p o eins and
hence he la ge p opo ion o IDRs in ch oma in p o eins accoun s o a lowe ac ion o hese amino
acids in ch oma in p o eins (Supplemen a y Figu e SF5_4F,I).
The mos en iched amino acids in he uniquely localized nuclea p o eins we e he same excep
o cys eine ( his di e ence may be aced o he diminished numbe o ansc ip ion ac o s en iched
in cys eines in he NULOC_CS_UL da ase , appa en ly because o hei mul iple localiza ion, see
Supplemen a y Figu e SF5_4A, Supplemen a y Figu e SF4_3B,).
3.3. De ailed analysis o he domain composi ion o ch oma in p o eins and iden i ica ion o new
s uc u al domains
This sec ion supplemen s and expands sec ion 3.3.3. Domain composi ion o ch oma in
p o eins and iden i ica ion o new s uc u al domains in he main ex .
59
Nex we se ou o sys ema ically analyze he a ailable da a on s uc u al cha ac e iza ion,
domain anno a ion and domain composi ion o ch oma in p o eins. We speci ically explo ed he
s uc u ally uncha ac e ized po ion o he ch oma ome (“da k” p o eome) and iden i ied po en ial new
s uc u al domains ha a e p edic ed by AI-based p o ein s uc u e p edic ion ools (see Figu e 6A).
His o ically, p o ein domains a e loosely de ined as e olu iona y conse ed uni s wi h
simila i ies a unc ional, s uc u al and/o sequence le els [33]. Domains may ep esen single p o eins
o exis in a a ie y o a ious sequence con ex s. Sequences o ela ed indi idual p o ein domains may
be g ouped and aligned o p oduce domain models. Domain models a e ca alogued and anno a ed by a
numbe o esou ces/da abases such as PFAM [34], CDD [35], CATH [36], In e P o [37], and may be
u he g ouped in o supe amilies, clans, olds, e c [38,39]. Domain models a e usually de ined h ough
mul iple sequence alignmen s (MSA) and co esponding hidden Ma ko models (HMM). In s uc u e-
based app oaches (e.g. CATH/Gene3D da abase) domain supe amilies a e assigned h ough g ouping
and alignmen o a ailable expe imen al 3D s uc u es. The ul ima e expe imen al s uc u al
cha ac e iza ion o ch oma in p o eins is a ailable in he PDB da abase, howe e , ecen p og ess in
p o ein s uc u e p edic ion spu ed by AlphaFold esul ed in new app oaches o he s uc u al
cha ac e iza ion and disco e y o new s uc u al domains (e.g., as implemen ed in he TED da abase)
[40] (Figu e 6A). S uc u e p edic ion algo i hms combined wi h s uc u e simila i y sea ch algo i hms,
such as FoldSeek [41], now allow o ind emo e homologs and assign indi idual domains o hei
espec i e supe amilies.
Figu e 6B shows he ac ion o he agg ega e numbe o amino acids in all human ch oma in
p o eins ( e e ed below o as “agg ega e ch oma ome sequence”, o ACS) which a e s uc u ally
cha ac e ized o ha e domain anno a ions in di e en da abases. Acco ding o AlphaFold
app oxima ely one hal o he ACS (47%) is p edic ed o be in insically diso de ed, o o become
o de ed wi hin p o ein-p o ein complexes ( he s uc u ally uncha ac e izable “da k” ch oma ome) (see
Me hods), and he es as ha ing dis inc 3D s uc u e. Di ec expe imen al s uc u al da a in PDB is
a ailable o only one ou h o he ACS (20% o ACS a e simul aneously conside ed o de ed by AFDB
and a ailable in PDB). Hence, we en ision ha a leas one hi d (34%) o he agg ega e human
ch oma ome sequence is amenable o cha ac e iza ion wi h s uc u al biology me hods bu has no ye
been cha ac e ized (cons i u es he po en ially s uc u ally cha ac e izable “da k” ch oma ome). The
P am da abase ( he la ges sequence-based da abase o p o ein domains and p o ein amilies o da e)
has anno a ions o a ound 39% o he agg ega e human ch oma ome sequence. The CATH da abase,
which ocuses on iden i ying and anno a ing s uc u al domains, anno a es 25% o ACS, while he
au oma ed AlphaFold-d i en TED esou ce inds s uc u al domains in 35% o ACS. The di e ence
be ween he ac ion o ACS anno a ed by TED and ha conside ed o de ed by AFDB was aced o a
leas se e al ac s: 1) AlphaFold is known o be biased o p edic long soli a y alpha-helices which a e
60
no conside ed domains by algo i hms ha iden i y s uc u al domains, 2) he TED algo i hm equen ly
ails o anno a e epe i i e egions ha con ain isually iden i iable seconda y s uc u e elemen s wi hin
la ge mul idomain p o eins, 3) we iden i ied non-IDRs as egions no less han 4 amino acids whe eas
median leng h o TED domain in human p o eins we e 108 aa. A ca ea ha has o be kep in mind, is
ha cu en au oma ed analysis using AlphaFold is based only on p edic ions o single chain p o eins,
while in eali y ch oma in p o eins engage in many in e molecula in e ac ions. To some ex en P am
and CATH/TED a e complimen a y (see Figu e 6B). In addi ion o 39% o ACS anno a ed by P am,
TED anno a es addi ionally 13% o ACS, and CATH adds anno a ions o 3% o ACS on op o i
(yielding a combined anno a ion co e age o 55% by hese h ee esou ces).
Nex we analyzed he s uc u al cha ac e iza ion o he agg ega e human ch oma ome sequence
om he poin o iew o s uc u al domains p esen in ch oma in p o eins (as iden i ied by he mos
comp ehensi e TED da abase, which au oma ically de ec s s uc u al domains) (see Figu e 6C).
Ch oma in p o eins con ain in o al 6246 indi idual TED domains. Using FoldSeek and combina ion o
FoldSeek and CATH esou ces (see Me hods) we ma ched hese domains o he s uc u ally ela ed
domains in PDB o CATH supe amilies. The emaining domains we e analyzed o he p esence o
p e iously uncha ac e ized s uc u al olds/supe amilies and po en ial unc ional oles o hese
domains. Among he 6246 p edic ed s uc u al domains cons i u ing human ch oma in p o eins, 34%
had exac ma ches in PDB s uc u es (100% sequence iden i y, see Me hods), 56% ma ched PDB
s uc u es o homologues wi h di e en le els o sequence iden i y ( om 99% o 5%, o de ails see
Figu e 6C). The majo i y o hese homologous domains we e in ac di e en pa alogous sequences
ound wi hin human genes (e en o domains wi h sequence iden i y o 35-50% he ac ion o human
sequences among he ma ches was 51%), o ma ches wi h sequence iden i y abo e 35% he second
la ges con ibu ion came om s uc u es o mammalian homologues, o ma ches wi h sequence
iden i y below 35% signi ican con ibu ions we e om s uc u es de i ed om p o eins o ungi,
p o os omia and bac e ia (see Supplemen a y Figu es SF6_1A o de ails). Addi ionally, 6% o TED
domains ha lacked di ec hi s among he PDB s uc u es we e mapped o p o ein s uc u al
supe amilies in he CATH da abase ( he in o ma ion abou po en ial sequence a ia ion in each
homologous supe amily collec ed in CATH da abase combined wi h AlphaFold s uc u al p edic ions
allowed o iden i y mo e dis an s uc u ally cha ac e ized homologues). The emaining 4% (241 TED
domains) ep esen ed domains ha could no be ma ched o any known p o ein s uc u e o p o ein
s uc u e supe amily, po en ially ep esen ing new ypes o s uc u al supe amilies o e en p o ein
olds. These domains a e p esen ed in Supplemen a y Table ST14 (see also In e ac i e Table 3 a
h ps://simch om.in bio.o g/#no el_s uc u al_domains) and anked ia hei s uc u al complexi y by
he numbe o hei seconda y s uc u e elemen s. Among hese domains, 123 domains ha e anno a ions
in P am o o he domain anno a ion da abases p esen in In e P o, lea ing 118 domains ha a e
comple ely wi hou anno a ions. The la e domains belong o 106 ch oma in p o eins, which may be
61
conside ed as pe spec i e new a ge s o expe imen al s udies o hei unc ion and s uc u e. Among
such p o eins a e, o example, a p o ein encoded by he GTF3C1 gene, a Gene al ansc ip ion ac o
3C polypep ide 1 (i has a p e iously unanno a ed and uncha ac e ized s uc u al domain wi h leng h o
233 amino acids) (see de ailed cha ac e iza ion in Supplemen a y Figu e SF6_2A). Ano he
ins uc i e example is he globula domain o he es is speci ic linke his one H1.7 (p oduc o H1-7
gene, see Supplemen a y Figu e SF6_2B). Despi e he conside able amoun o s udies dedica ed o
he elucida ion o he s uc u e o H1-linke his ones [42], he H1.7 his one a ian (p e iously, named
HANP1/H1T2) has a qui e di e en sequence esul ing in a p edic ed s uc u e ha has a di e en
opology han o he known H1 his ones ( he “wing” o he globula domain consis s o h ee be a-shee s
a he han wo). The ela ion o his domain o he H1 his one amily canno be iden i ied wi h
con en ional sequence analysis me hods (such as hose implemen ed in he P am da abase), howe e ,
i should be no ed ha new deep-lea ning-based anno a ion app oaches (such as P am-N) a e able o
anno a e i (see Supplemen a y Figu e SF6_2B).
Nex we p edic ed GO molecula unc ions and biological p ocesses o men ioned abo e 118
no -anno a ed ch oma in p o ein domains using DeepFRI [43], a G aph Con olu ional Ne wo k o
p edic ing p o ein unc ions by le e aging sequence ea u es ex ac ed om a p o ein language model
and p o ein s uc u es, see Me hods Sec ion 2.5. The op-7 common GO MF e ms: ion binding, o ganic
cyclic compound binding, he e ocyclic compound binding, p o ein binding, ca ion binding, me al ion
binding, nucleic acid binding. Top-10 GO BP e ms: o ganic subs ance me abolic p ocess, p ima y
me abolic p ocess, cellula me abolic p ocess, ni ogen compound me abolic p ocess, mac omolecule
me abolic p ocess, o ganoni ogen compound me abolic p ocess, egula ion o cellula p ocess, cellula
mac omolecule me abolic p ocess, cellula ni ogen compound me abolic p ocess, cellula esponse o
s imulus. 9 ou o 118 TED domains lacked he p edic ed GO molecula unc ion by DeepFRI: wo o
hem we e in membe s o he egula o y ac o X (RFX) amily o ansc ip ion ac o s (encoded by
genes RFX1, RFX5).
Many ch oma in p o eins con ain simila , e olu iona y ela ed indi idual p o ein domains
whose kinship may be iden i ied by ma ching hem o he same P am domain sequence models. Hence,
we used he P am domain anno a ion o cha ac e ize he di e si y o p o ein domains ound in ch oma in
p o eins and ypical domain composi ion he eo . In o al 1753 di e en domain ypes (sequence
models) we e iden i ied in ch oma in p o eins (Figu e 6D). Nex we analyzed he s uc u al in o ma ion
a ailable o hese models. 76% o domain models had a leas one indi idual domain among ch oma in
p o eins ha could be ma ched o a PDB s uc u e using FoldSeek (bona ide s uc u al domain in
Figu e 6D). To cha ac e ize he comp ehensi eness o he s uc u al cha ac e iza ion o each domain
model we es ima ed he median sequence iden i y be ween all indi idual domains in ch oma in p o eins
belonging o he said domain model and hei bes ma ches in PDB ound ia FoldSeek (see Me hods
62
Sec ion 2.5, Figu e 6D). 42% o domain models we e conside ed ully cha ac e ized, i.e. e e y
indi idual domain in ch oma in p o eins belonging o hese models can be ound in PDB, 34% o
domain models a e pa ially cha ac e ized. 14% o P am domains we e no ma ched by FoldSeek o
PDB s uc u es, bu could be s ill iden i ied in PDB ia sequence sea ch me hods – hese ep esen ed
IDR egions, epea s, e c. 3% (55 domain models) did no ma ch any PDB s uc u e bu could be
ma ched o s uc u al domains p edic ed by AlphaFold and ound in he TED da abase. These ep esen
p ospec i e a ge s o alida ion wi h s uc u al biology me hods and u he in es iga ion o hei
in e ac ions. Fo ins ance, among hese domains a e domains, po en ially associa ed wi h ch oma in
emodeling (SANTA, z -C3Hc3H), his one PTM w i ing (DUF7030, COMPASS-Shg1), zinc inge s
(z _CCCH_4, z -LITAF-like, z -WIZ, SWIM) e c. 7% o P am domain models cu en ly ha e no
s uc u al in o ma ion ha can be assigned ei he h ough he PDB o TED da abases.
We nex analyzed he di e si y o P am domains in a ious SimCh om-SL p o ein ca ego ies
(Figu e 6E, subpanels 1,2) and he domain con en o indi idual p o eins belonging o hese ca ego ies
(Figu e 6E, subpanels 3-5). P am iden i ied 11147 indi idual domains in ch oma in p o eins belonging
o 1753 domain ypes (P am models); only 70 ch oma in p o eins had no domain anno a ion a all. Fo
he dis ibu ion o he o al numbe o P am models and dis inc P am models iden i ied in ch oma in
p o eins, see Supplemen a y Figu e SF6_1B, Supplemen a y Figu e SF6_1C. Expec edly, la ge
SimCh om-SL ca ego ies consis ing o mo e han one hund ed p o eins ha bo ed he la ges numbe o
di e en domain ypes (e.g., DNA-ac ing enzymes, his one PTM w i e s, ansc ip ion ac o s, e c.),
while he smalle ca ego ies had less (see Figu e 6E, subpanel 1). Al hough P am may no be
comp ehensi e in i s anno a ion, we es ima ed he numbe o dis inc domain ypes pe p o ein in each
ca ego y ( ela i e domain di e si y, Figu e 6E, subpanel 2). The a e age domain di e si y was a ound
one o all ca ego ies. The ca ego ies wi h he conside ably lowe domain di e si y we e ansc ip ion
ac o s ca ego ies ( hei a iabili y elies on di e en combina ions o zinc- inge domains ha a e
desc ibed h ough only a ew P am domain models), his ones ( hei unc ional a iabili y is o en
con e ed by only small changes in he sequence), and HMG-cons ain ing p o eins ( his is a e y small
g oup o p o eins wi h only nine p o eins and h ee co esponding P am models). The median numbe
o P am domains in human ch oma in p o eins was wo (which co esponds o he s uc u e based
domain analysis p esen ed abo e). Ce ain ch oma in p o ein ca ego ies had a highe median numbe
o domains, including Housekeeping TF, his one PTM w i e s and eade s (Figu e 6E, subpanel 3).
In e es ingly he median numbe o domains o Non-housekeeping TF was wo ( hey mo e o en ely
on single homeodomains han on casse es o zinc- inge domains), al hough he g oup is di e se and
p o eins wi h as many as 32 domains we e p esen . This, howe e , is again explained by he la ge
numbe o zinc- inge domains ha may be p esen in such p o eins (see Supplemen a y Figu e
SF6_1D). The p esence o addi ional domains in PTM w i e s and eade s may be hypo hesized o ha e
e ol ed due o he unc ional necessi y o mul i alen binding o di e en ch oma in s uc u es (see

63
below o a de ailed analysis). Some ca ego ies mos ly consis o single domain p o eins, such as
Cen ome e-associa ed, DNA epai , Regula ion o ansc ip ion, RNA polyme ases (bu his g oup also
includes p o eins wi h a maximum o 42 domains), DNA ecombina ion, RNA modi ica ion, His ones,
HMG_A/B/N, e c. The analysis o he numbe o dis inc di e en domain ypes p esen in ch oma in
p o ein ca ego ies co obo a es he abo e men ioned analysis (Figu e 6E, subpanel 4). P o eins om
PTM w i e s g oup ha e he median numbe o h ee dis inc domain ypes, while all o he ca ego ies
ha e less. S ill many ch oma in p o eins ha bo many dis inc domain ypes, DNA-ac ing enzymes,
his one PTM w i e s, chape ones, emodele s, ansc ip ion ac o s wi h as much as 8-9 dis inc domains
a e p esen (see Supplemen a y Table ST15). The e a e 118 ch oma in p o eins ha bo ing a leas i e
di e en domain ypes (see Supplemen a y Figu e SF6_1C). This highligh s he mul i alency o
p o ein in e ac ions in ch oma in, keeping in mind ha many p o eins u he o m p o ein-p o ein
complexes inc easing hei in e ac ion po en ial. The a e age indi idual domain leng h in ch oma in
p o eins is a ound 65 amino acids ( he median is 28 aa), howe e , his numbe is biased by he p esence
o many zinc- inge domains (a ound 22 aa in leng h). Subpanel 5 in Figu e 6E gi es a mo e balanced
iew o each SimCh om ca ego y. Fo he majo i y o p o ein g oups he median domain leng h is
a ound 100 amino acids (mean is 137, median is 134).
The bi ds-eye iew o he mos equen ly occu ing P am domains’ in a ious unc ional
SimCh om-SL ca ego ies is p esen ed in Figu e 7. The da a is p esen ed o domains ha occu in a
leas i e ch oma in p o eins and in a leas 10% o p o eins in a ca ego y ( he h eshold o da a poin
depic ion is 5%). The comp ehensi e in e ac i e analysis igu e wi h he abili y o al e hese h esholds
and swi ch be ween SimCh om and SimCh om-SL classi ica ions sys ems is a ailable in In e ac i e
Figu e 4 (h ps://simch om.in bio.o g/#domain_composi ion). In Figu e 7 he ollowing ca ego ies and
hei espec i e domains can be g ouped e ealing hei pa ially sha ed domain composi ion: 1) he
ca ego ies con aining ansc ip ion ac o s and hei zinc inge , homeodomains and KRAB domains
o m he mos equen ly occu ing en i ies, 2) some ch oma in egula o s, such as PTM w i e s, eade s,
e ase s and ch oma in emodele s oge he wi h hei Ch omo, B omodomain, PHD.
3.4. De ailed analysis o he mul i alen in e ac ions in ch oma in p o eins
This sec ion supplemen s and expands sec ion 3.3.4. Mul i alen in e ac ions in ch oma in
p o ein in he main ex .
The p esence o mul iple domains (belonging o he same o di e en domain models) in
ch oma in p o eins is a known ea u e con ibu ing o hei abili y o engage in mul i alen in e ac ions
(Figu e 8A) [44]. Below we p esen he analysis o such domains engaged in mul i- alen in e ac ions
( e e ed o as EMVI-domains he ea e ) ha a e ound in ch oma in/epigene ics egula o p o eins (see
64
Figu e 3 o de ini ion o his g oup). To limi ou analysis o a manageable se o EMVI-domains, we
selec ed hose ha we e ound in mul iple copies o in combina ion wi h ano he P am domain in a
leas h ee ch oma in egula o p o eins (94 P am domains in o al), and om hose we selec ed 59
domains ha we we e able o manually classi y based on he in o ma ion cu en ly a ailable in he
li e a u e acco ding o hei unc ional binding modes. The ollowing unc ional g oups o domains
we e used: his one me hyla ion/ace yla ion/phospho yla ion, ch oma in emodeling, his one binding,
DNA binding, DNA me hyla ion, p o ein dime iza ion/oligome iza ion, PPI, RNA binding. His one
pos - ansla ional modi ica ions we e u he subdi ided in o eade s, w i e s and e ase s unc ional
subg oups (see Figu e 8C, In e ac i e Figu e 5 (h ps://simch om.in bio.o g/#domain_co-
occu ence), and Supplemen a y Table ST16 o he lis o domains and hei de ailed classi ica ion).
We conside his subse o ch oma in p o eins’ domains as ep esen a i e o illus a e he concep o
mul i alency in ch oma in egula o s in e ac ions, since he selec ed domains a e ex ensi ely
cha ac e ized and hei unc ions a e known. A comp ehensi e analysis would equi e cha ac e iza ion
o all 409 P am domain models ha a e ound in combina ion wi h o he models o in mul iple copies
in a leas one ch oma in egula o p o ein. As a comp omise ou online da abase includes he analysis
o 163 P am domain models ha a e p esen in a leas wo ch oma in egula o p o eins (see
h ps://simch om.in bio.o g/#domain_co-occu ence).
Fi s , we analyzed he co-occu ence o selec ed EMVI-domains in all ch oma in p o eins.
The e we e in o al 922 ch oma in p o eins (306 wi h he exclusion o ansc ip ion ac o s) ha had
mo e han one selec ed EMVI-domain (including mul icopies). The condi ional p obabili y o inding
a co esponding domain A in a ch oma in p o ein gi en ha ano he domain B is al eady p esen was
es ima ed and is p esen ed in Figu e 8C (columns and ows co espond o domains A and B,
espec i ely). The In e ac i e igu e 5 is a ailable a h ps://simch om.in bio.o g/#domain_co-
occu ence (also includes unclassi ied po en ial EMVI-domains ound in a leas wo ch oma in
egula o p o eins). The ma ix in Figu e 8C allows o ace he in e play be ween di e en domains
employed in a chi ec u es o ch oma in p o eins. The la ges g oups o domains in Figu e 8C a e hose
in ol ed in his one me hyla ion and DNA binding, sugges ing ha hese mechanisms a e he mos
ep esen ed and employed in ch oma in unc ioning egula ion.
The e we e 49 cases whe e associa ion be ween he p esence o a ious domains in ch oma in
p o eins was 100% ( ed squa es in Figu e 8C), among hem o 18 cases (9 domain pai s) he
associa ion was ecip ocal (i.e. P(A|B) = P(B|A)). In ce ain cases his exclusi e associa ion be ween
domains may be aced o he ac ha hey o m a la ge s uc u al complex wi h di ec s uc u al
in e ac ions be ween he domains as judged by he isual inspec ion o AlphaFold based p edic ions
(MOZ_SAS and z -MYST, ADD_DNMT3 and DNMT3_ADD_GATA1-like). In o he cases he
associa ion is likely due o unc ional easons, in ou analysis in he majo i y o cases such domains
65
we e con ined o he his one me hyla ion eade s subg oup (KDM3B_Tudo and PWWP_KDM3B;
C5HCH, NSD_PHD, PHD-1s _NSD, and PHD a _NSD ound in His one-lysine N-
me hyl ans e ases).
Among he EMVI P am domains ha co-occu wi h he mos numbe o o he di e en P am
domains in ch oma in egula o p o eins is he his one me hyla ion/ace yla ion domains: PHD domain
(45 o he domains), B omodomain (38), SET (40) and PWWP (28) and ch oma in emodeling
Helicase_C (33) and SNF2- el_dom (31).
The diagonal elemen s in Figu e 8C show ha ce ain domains end o be p esen in mul iple
copies, pa icula ly o en, MBT, WD40 and zinc inge (z -C2H2) domains. Howe e , all hese a e
special cases o sho epea domains, whe e mul iple copies a e needed o o m one unc ional uni .
No so o en, bu in a conside able numbe o p o eins PHD and B omodomain may be ound in mul iple
copies.
Ano he mo e gene al iew o mul i alen in e ac ions in ch oma in p o eins may be ob ained
i we ace he ela ionships wi hin o be ween di e en unc ional g oups o domains. One can see ha
domains om he same unc ional g oup (e.g., his one me hyla ion) end o co-occu (Figu e 8C) in
ch oma in p o eins and may also be p esen in mul iple ins ances in p o eins (Supplemen a y Figu e
SF8_1B). Fo example, he e may be up o nine his one me hyla ion associa ed domains in ch oma in
p o eins (Supplemen a y Figu e SF8_1B). I a ch oma in p o ein has a domain in ol ed in his one
me hyla ion (ei he w i ing, eading o e asing) he e is an es ima ed 38% chance ha he e will be
ano he di e en unc ional domain om his g oup o domains (Supplemen a y Table ST17,
Supplemen a y Figu e SF8_1C). Fo ace yla ion his es ima ed p obabili y is 31%, o
phospho yla ion 18%. No e, ha domain ca ego iza ion is no i ial. Fo example, WD40 is a epea
ha olds in o a highe o de s uc u e (median numbe in ch oma in p o eins a e ou ). They can
ecognize bo h unmodi ied his one ails and me hyla ed egions [45] and i s classi ica ion may a ec
his pa o s udy.
The associa ions be ween he occu ence o domains om di e en unc ional g oups can also
be obse ed. This can be seen in Figu e 8C, and he upse plo in Figu e 8D (see also Supplemen a y
Figu e SF8_2A). One can see ha domains in ol ed in his one me hyla ion (one o he mos abundan
g oup by he numbe o P am models and he numbe o ch oma in p o eins) may be in a conside able
numbe o ch oma in p o eins combined wi h o he EMVI-domains (pa icula ly DNA binding domains
and his one ace yla ion), associa ion associa ion wi h wi h his one binding domains, ch oma in
emodeling, DNA me hyla ion, (di/oligo)-me iza ion and RNA binding domains was also obse ed, see
Supplemen a y Figu e SF8_2C. The same can be said abou domains in ol ed in his one ace yla ion,
al hough in a somewha smalle numbe o cases, and wi h exclusion o hei combina ion wi h
66
dime iza ion domains. No ably, domains in ol ed in his one phospho yla ion we e no ound in
combina ion wi h domains om o he unc ional g oups in ou analysis. This may e lec an
e olu iona y s a egy whe eby combina ions o his one me hyla ion and ace yla ion e ol ed o
delica ely egula e gene exp ession a he epigene ic le el, while phospho yla ion emained as a mo e
gene al mechanism a ec ing a b oad numbe p o eins and pa hways in he cell. DNA binding domains
a e ound o be associa ed wi h me hyla ion, ace yla ion, dime iza ion, and ch oma in emodeling
domains al hough he ela i e numbe o p o eins ha bo ing combina ions o such domains is small
compa ed o he numbe o p o ein (mainly ansc ip ion ac o ) ha ha e DNA binding domains in
hei sequence. Domains associa ed wi h ca aly ic subuni s ch oma in emodeling complexes in ce ain
cases a e ound oge he wi h his one ace yla ion, me hyla ion o DNA binding domains. A mo e
de ailed analysis e eals ha one “ eade ” domains o ace yla ion and me hyla ion a e ound in hese
p o eins.
Fo a mo e comp ehensi e iew o mul i alen in e ac ions i is easonable o (1) analyze no
only pai wise co-occu ence o di e en unc ional domains, bu simul aneous co-occu ence o domains
om se e al unc ional g oups in one p o eins, (2) ex end he analysis o complexes o ch oma in
p o eins. The esul s o such analysis a e p esen ed in Figu e 8D (see Me hods Sec ion 2.1.4 o ou
selec ion o 513 p o ein complexes whe e all p o eins a e ch oma in p o eins om Complex Po al). In
ou analysis a he le el o indi idual p o eins, p o eins ha bo ed domains only om up o h ee
unc ional g oups. Pa icula ly, DNA binding domains may be combined wi h (di/oligo)me iza ion,
ch oma in emodeling, his one me hyla ion, me hyla ion and ace yla ion o his one ace yla ion and
ch oma in emodeling domains. The o ma ion o ch oma in p o ein complexes conside ably a ec s he
a ailable combina ions o unc ional domains. Among he 513 analyzed p o ein complexes, 181
complexes con ained domains EMVI-domains om he analyzed unc ional domain g oups, 101
complexes ha bou ed mo e han one domain, 80 complexes ha bo ed domains om di e en unc ional
g oups. F om hese 80 complexes he majo i y we e a ious ch oma in emodeling complexes (53
complexes), o he s ep esen a i es included ace yl ans e ase complexes (13), deace ylase (4), and
DNA-me hyla ion (2).
One can see ha he la ges numbe o analyzed complexes (22) simul aneously con ained
domains om ou unc ional g oups (DNA binding, his one me hyla ion, his one ace yla ion, and
ch oma in emodeling), in a selec numbe o complexes domains belonging o up o six unc ional
g oups we e obse ed (all he abo e men ioned oge he wi h domains in ol ed in DNA me hyla ion
and his one binding). These we e all complexes in ol ed in ch oma in emodeling. Fo example,
'MBD2 o MBD3/NuRD nucleosome emodeling and deace ylase complex'. No ably, in ch oma in
complexes his one ace yla ion domains a e ound mo e o en han his one me hyla ion domains (unlike
in he case when indi idual p o eins a e analyzed), his migh , howe e , be biased by he cu en lis o