scieee Science in your language
[en] (orig)

Genomic studies of piscine streptococci and genome assembly of Clariid catfishes (Siluriformes) for aquaculture enhancement

Author: Andres, Quentin Ludovic Stephane
Publisher: Zenodo
DOI: 10.5281/zenodo.17551161
Source: https://zenodo.org/records/17551161/files/thesis_latex_style_final-6.pdf
THESIS PROPOSAL
GENOMIC STUDIES OF PISCINE STREPTOCOCCI AND GENOME
ASSEMBLY OF CLARIIDAE CATFISHES (SILURIFORMES) FOR
AQUACULTURE ENHANCEMENT
QUENTIN LUDOVIC STEPHANE ANDRES
GRADUATE SCHOOL KASETSART UNIVERSITY
2025
THESIS PROPOSAL APPROVAL
GRADUATE SCHOOL KASETSART UNIVERSITY
DEGREE
Doc o o Philosophy (Fishe y Science and
Technology)
MAJOR FIELD
Fishe y Science and Technology
FACULTY
Fishe ies
TITLE
Genomic s udies o piscine s ep ococci and genome assembly
o Cla iidae ca ishes (Silu i o mes) o aquacul u e
enhancemen
NAME
MR. QUENTIN LUDOVIC STEPHANE ANDRES
THIS THESIS HAS BEEN ACCEPTED BY
(Associa e P o esso P apansak S isapoome, Ph.D.)
THESIS ADVISOR
(P o esso Ko nso n S ikulna h, Ph.D.)
THESIS CO-ADVISOR
(M . Wo apong Singcha , Ph.D.)
THESIS CO-ADVISOR
(Assis an P o esso Me hee Kaewne n, D.Tech.Sc.)
GRADUATE
COMMITTEE
CHAIRMAN
(Associa e P o esso Wee apha Khun a anasi i, D . e .na .)
DEAN
THESIS PROPOSAL
Genomic s udies o piscine s ep ococci and genome assembly o
Cla iidae ca ishes (Silu i o mes) o aquacul u e enhancemen
QUENTIN LUDOVIC STEPHANE ANDRES
A Thesis P oposal Submi ed in Pa ial Ful illmen o
he Requi emen s o he Deg ee o
Doc o o Philosophy (Fishe y Science and Technology)
G adua e School, Kase sa Uni e si y
Academic Yea 2025
1
Con en s
Page
LIST OF TABLES 3
LIST OF FIGURES 5
LIST OF ACRONYMS AND GLOSSARY 5
INTRODUCTION 1
OBJECTIVES AND RATIONALE 3
LITERATURE REVIEW 4
Genomic Technologies o Aquacul u e 4
Cla ias Ca ish Aquacul u e in Sou heas Asia 10
Case S udy: Genome Assembly Applica ions in Tilapia 11
GENOME ASSEMBLY – BIGHEAD CATFISH 12
O e iew o Genome Assembly S a egy 12
Haplo ype Resolu ion, Hi-C Sca olding, and Phasing 18
GENOME ASSEMBLY – F1 HYBRID CATFISH 33
O e iew o Genome Assembly S a egy 33
VALIDATION OF CATFISH ASSEMBLIES 45
Benchma king Me hodology 45
Resul s – Bighead Ca ish 48
Resul s – F1 Hyb id Ca ish 63
GENOME ASSEMBLY, REVERSE VACCINOLOGY, AND QUALITY BY DE-
SIGN — S ep ococcus iniae 80
In oduc ion 80
Me hods 83
Resul s 99
DATA AVAILABILITY AND NCBI SUBMISSIONS 113
Bighead Ca ish 113
F1 Hyb id Ca ish 113
2
S ep ococcus iniae 114
Associa ed Publica ions 114
DISCUSSION AND CONCLUSIONS 115
Genomic Insigh s and Technical Achie emen s 115
S ep ococcus iniae Vaccine De elopmen 116
O e all Conclusions 117
Recommenda ions 118
Appendix 141
Pe sonal In o ma ion 164

3
Lis o Tables
Table Page
1 Bighead Ca ish – Assembly S a s. 49
2 Bighead Ca ish – Sca old Me ics – Haplo ype 1. 57
3 Bighead Ca ish – Sca old Me ics – Haplo ype 2. 58
4 Bighead Ca ish – TE Con en . 60
5 F1 Hyb id Ca ish – No h A ican Subgenome – Sca old Me ics. 75
6 F1 Hyb id Ca ish – Bighead Subgenome – Sca old Me ics. 76
7 F1 Hyb id Ca ish – No h A ican Subgenome – S uc u al Me ics. 78
8 F1 Hyb id Ca ish – Bighead Subgenome – S uc u al Me ics. 79
9S. iniae RV – P edic ed an igenic epi opes om s ain SIKU01. 102
10 S. iniae QbD – QTPP and CQAs de ini ion wo k low. 156
11 S. iniae – Compiled accine s udies. 157
12 S. iniae – Supplemen a y Code o analysis. 160
13 S. iniae – Supplemen a y Da a 1: Me ada a and P o eome. 161
14 S. iniae – Supplemen a y Da a 2: Pangenomics and MSAs. 162
15 S. iniae – Supplemen a y Da a 3: QbD Manu ac u abili y. 163
LIST OF FIGURES Figu e Page
1 Long- eads s Sho - eads 4
2 ONT sequencing 5
3 HiFi Sequencing 6
4 Illumina PE Sequencing 7
5 Hi-C Sequencing 8
6 Cla ias ca ish species 10
7 Bighead Ca ish – Specimen Pho og aph. 16
8 Bighead Ca ish – Genome Assembly Wo k low. 17
9 Bighead Ca ish – Genome Assembly G aph – Haplo ype 1 and 2. 20
10 Bighead Ca ish – Genome Assembly 1-Yea P og ess. 27
11 Bighead Ca ish – m DNA Alignmen Agains 208 Ca ish Species. 30
12 F1 Hyb id Ca ish – Genome Assembly Wo k low. 38
13 F1 Hyb id Ca ish – Assembly alida ion using IGV. 43
14 F1 Hyb id Ca ish – Genome Assembly 1-Yea P og ess. 44
15 Bighead Ca ish – DNA Sequencing. 48
16 Bighead Ca ish – Hi-C Hea map – Haplo ype 1. 51
17 Bighead Ca ish – Hi-C Hea map – Haplo ype 2. 52
18 Bighead Ca ish – Hi-C Hea maps – Haplo ype 1 and 2. 53
19 Bighead Ca ish – Assembly Comple eness Assessmen . 55
20 Bighead Ca ish – TE Di e gence P o ile. 59
21 Fou Species Mac osyn eny – Ca ishes Ch omosomes. 62
22 F1 Hyb id Ca ish – Sequencing and GenomeScope2.0 su ey. 64
23 F1 Hyb id Ca ish – No h A ican Subgenome – Hi-C Hea map. 70
24 F1 Hyb id Ca ish – Bighead Subgenome – Hi-C Hea map. 71
25 F1 Hyb id Ca ish – Genome Assembly Valida ion. 73
26 S ep ococcus iniae - Phase con as mic og aph 80
27 S. iniae – Genome assembly and anno a ion. 100
4
5
28 S. iniae QbD – Wo k low QTPP and manu ac u abili y. 105
28 S. iniae QbD – Manu ac u abili y design spaces. 110
29 S. iniae RV – 3D S uc u e o Enolase and GAPDH Immunogens. 112
30 Fou Ca ish Genome Su ey – GenomeScope2.0 P o iles. 142
31 C. ga iepinus – Syn eny s F1 hyb id genome. 143
32 C. mac ocephalus – Syn eny s F1 hyb id genome. 144
33 S. iniae – Ci cos syn eny & QC (SIKU01–05). 145
34 S. iniae – Syn eny s dolphin isola e. 146
35 S. iniae – An igenic a ia ion (17 epi opes). 147
36 S. iniae – P o eome landscape (M0). 148
37 S. iniae – P o eome anno a ion. 149
38 S. iniae – QbD CQAs co ela ion ma ix. 150
39 S. iniae – QbD il e ing wo k low (M0–P eM1–M1-gene al). 151
40 S. iniae – Exp ession and pDNA pla o m il e s. 152
41 S. iniae – Vaccine ype and RPS dis ibu ion in li e a u e. 153
42 S. iniae – RPS dis ibu ion by ish hos species. 154
43 S. iniae – Compa ison o pDNA and p o ein accine RPS. 155
1
GENOMIC STUDIES OF PISCINE STREPTOCOCCI AND
GENOME ASSEMBLY OF CLARIID CATFISHES
(SILURIFORMES) FOR AQUACULTURE ENHANCEMENT
INTRODUCTION
Global aquacul u e is he p ocess o a ming aqua ic o ganisms, bo h
ma ine and eshwa e , o human consump ion and o he uses, such as p oducing eed,
pha maceu icals, and ela ed p oduc s. I ep esen s he as es -g owing ood p oduc ion
sec o , p o iding o e hal o he sea ood consumed globally, wi h Asia domina ing
p oduc ion. In Thailand, inland aquacul u e is c i ical o global ood secu i y, whe e
eshwa e species such as ilapia (O eoch omis spp.), ca ish (Cla ias spp., Silu us spp.,
and Pangasius spp.), and Asian seabass (La es calca i e ) con ibu e signi ican ly o
domes ic consump ion and sea ood expo s (FAO,2020;Naylo e al.,2021).
In 2022, aquacul u e su passed cap u e ishe ies as he p ima y sou ce o aqua ic
animals, p oducing 94.4 million onnes compa ed o 92.3 million onnes om wild-
cap u e ishe ies, acco ding o he Global Sea ood Alliance. Globally, aquacul u e p o-
duc ion (including seaweeds) eached 130.9 million onnes in 2022, alued a USD 313
billion, acco ding o he Food and Ag icul u e O ganiza ion. Majo gains in aquacul u e
eed e iciency and ish nu i ion ha e lowe ed he ish-in– ish-ou a io o many ed
species, al hough dependence on ma ine ing edien s pe sis s and eliance on e es ial
ing edien s has inc eased. To ensu e sus ainable g ow h, gene ic imp o emen h ough
selec i e b eeding has become a key s a egy, whe e op imizing ai s such as g ow h,
eed e iciency, sex de e mina ion, obus ness, and disease esis ance equi es iden i y-
ing and deploying ele an gene ic a ian s.
Mode n aquacul u e b eeding p og ams do no ely solely on conse ing “pu e”
s ains o ex eme inb eeding. Ins ead, hey a e ypically s uc u ed a ound amilies o
lines ha se e as a de ined s a ing poin and a e imp o ed cumula i ely o e gene -
8
2. Hi-C o Genome Sca olding
Hi-C cap u es ch omosome h ee-dimensional o ganiza ion h ough p oximi y-
liga ion, whe e physically close DNA agmen s become joined. Pai ed-end sequencing
o hese junc ions e eals long- ange linkage: con igs om he same ch omosome show
many Hi-C con ac s while hose om di e en ch omosomes show ew (Pu nam e al.,
2016). This enables clus e ing, o de ing, and o ien ing con igs in o ch omosome-scale
sca olds o de no o assembly—cons uc ion wi hou a e e ence genome— a he han
e e ence-guided assembly which equi es an exis ing genome.
Hi-C has become essen ial o achie ing ch omosome-le el assemblies in aqua-
cul u e species, enabling p ecise localiza ion o genes and QTLs o b eeding ai s. In
polyploid ish genomes, Hi-C co ec ly sepa a es duplica ed loci and p o ides insigh s
in o 3D genome s uc u e, ch oma in con o ma ion, and egula o y in e ac ions (Dud-
chenko e al.,2017).
Figu e 5 Ul a-long- ange sequencing o genome sca olding
C edi s: Ul a-long- ange Sequencing o Genome Sca olding
3. Haplogenome Assembly o Ca ish Genomes
Haplogenome assembly econs uc s each pa en al haplo ype independen ly in-
s ead o p oducing a collapsed consensus sequence. This cap u es he ull spec um o

9
allelic and s uc u al a ia ion, aluable o ou b ed species and highly he e ozygous
o ganisms such as in e speci ic ca ish hyb ids. By esol ing phase-speci ic sequences,
haplogenome assemblies e eal allele-speci ic exp ession pa e ns c i ical o unde -
s anding gene unc ion and op imizing selec i e b eeding. The app oach ypically uses
long- ead sequencing (PacBio HiFi o ONT) wi h bioin o ma ic phasing o sepa a e ma-
e nal and pa e nal haplo ypes. A well-known example includes he assembly o hap-
logenomes in in e speci ic Eucalyp us hyb ids, analogous o he genome-wide s uc u al
di e ences obse ed be ween Silu us aso us and S. me idionalis ha a e linked o hyb id
ai s Chen e al. (2021a). Simila ly, in ca ishes, haplogenome assemblies o Cla ias
mac ocephalus,C. ga iepinus, and hei F1 hyb ids enable p ecise cha ac e iza ion o
ch omosomal ea angemen s, gene duplica ions, and in og ession e en s—laying he
ounda ion o genomic selec ion and imp o ed aquacul u e p oduc i i y Duong and
Sc ibne (2018).
4. In Summa y
These sequencing pla o ms cap u e genome s uc u e and iden i y a ian s—
SNPs1, indels2, and s uc u al a ian s (SVs)3—associa ed wi h economically impo -
an ai s (Chai ichoo e al.,2020). Sho eads excel a accu a e SNP/indel de ec-
ion, long eads esol e s uc u al a ian s and complex egions, while Hi-C p o ides
ch omosomal-le el sca olding. As o May 2025, NCBI hos s o e 3.01 million genomes
including 51,820 euka yo ic genomes, wi h ewe han 500 being haplo ype- esol ed,
comple e e eb a e genomes (h ps://www.ncbi.nlm.nih.go /da ase s/genome/).
1Single-base changes a speci ic genomic posi ions, used as gene ic ma ke s.
2Sho inse ions o dele ions o 1–50 bp.
3La ge genomic al e a ions >50 bp including dele ions, inse ions, duplica ions, in e sions, and
ansloca ions.
10
Cla ias Ca ish Aquacul u e in Sou heas Asia
1. E olu iona y His o y and Dis ibu ion
Ca ish (Silu i o mes) a e widely cul i a ed eshwa e species c ucial o ood
p oduc ion ac oss A ica, he Ame icas, and Asia (Lisacho e al.,2023). This di e se
o de di e ged du ing he ea ly C e aceous ( 145 Ma) and comp ises o e 36 amilies and
3,000 species (Fe a is,2007). While dis ibu ed globally in eshwa e en i onmen s
wi h some ma ine ex ensions, ca ish a e mos di e se in opical Sou h Ame ica, Asia,
and A ica. Key ep esen a i es like Cla ias and Ic alu us se e as bo h esea ch models
and economically aluable aquacul u e species.
2. Classi ica ion and Biology
Bighead ca ish (Cla ias mac ocephalus) and No h A ican ca ish (C. ga iepi-
nus) belong o amily Cla iidae, o de Silu i o mes. These ai -b ea hing ca ish a e eco-
nomically impo an in Sou heas Asia and A ica, espec i ely. C. mac ocephalus, na-
i e o Sou heas Asia and widely a med in Thailand, has a b oad la ened head, sho
ba bels, and eaches 40–50 cm. I s body is da k b own o black wi h pale blo ches. C.
ga iepinus, in oduced globally o i s as g ow h and en i onmen al ole ance, g ows
o 1.7 m wi h an elonga ed body and long do sal ins. I su i es oxygen-deple ed wa-
e s using i s sup ab anchial o gan o a mosphe ic b ea hing.
Figu e 6 Le : Female C. mac ocephalus (C edi s: Ma illano, J.D.). Righ : C. ga iepinus
(C edi s: La sen, J.H.)
11
F1 hyb ids om C. ga iepinus males × C. mac ocephalus emales combine pa en al
ad an ages: body o m om C. mac ocephalus wi h apid g ow h om C. ga iepinus.
Bo h species h i e a 26–32°C.
3. Role in Food P oduc ion and Economy
Cla ias species a e ex ensi ely a med ac oss Asia and A ica. C. mac ocephalus
is p ima ily used o hyb id p oduc ion wi h C. ga iepinus (Lisacho e al.,2023;Na-
Nako n e al.,2004), yielding s e ile F1 hyb ids wi h hyb id igo (he e osis) ha com-
bine apid g ow h and desi able body cha ac e is ics. Howe e , hyb id s e ili y necessi-
a es sepa a e pa en al p oduc ion sys ems. The lack o high-quali y e e ence genomes
limi s implemen a ion o mode n b eeding ools like ddRAD sequencing, low-co e age
genome sequencing, and CRISPR/Cas9. S udies indica e low in og ession isk in C.
mac ocephalus, suppo ing po en ial gene ic imp o emen e o s.
Case S udy: Genome Assembly Applica ions in Tilapia
Tilapia (O eoch omis spp.) exempli ies success ul genomic applica ions in aqua-
cul u e (Yu e al.,2022). Resea che s iden i ied adap i e esponses4 o salini y s ess
h ough genome-wide SNP analysis linked o osmo egula ion and su i al (Gu e al.,
2018;Jiang e al.,2019;Rengma k e al.,2007). Key ole ance genes include p olac in
(PRL) (B e es e al.,2013), g ow h ho mone (GH1) (Deane and Woo,2008), insulin-
like g ow h ac o 1 (IGF1) (Yan e al.,2020), and plasma-memb ane Ca2+-ATPases
(PMCAs) (Rengma k e al.,2007). Simila s udies in ainbow ou (Le B as e al.,
2011), A lan ic salmon (No man e al.,2012), and s ickleback (Kusakabe e al.,2016)
collec i ely es ablished ounda ions o ma ke -assis ed selec ion (MAS) in aquacul-
u e. Non-synonymous mu a ions in sal egula ion genes enhance ole ance and al e
pheno ypes. These achie emen s equi ed high-quali y e e ence genome assemblies.
4Biological adjus men s o en i onmen al s ess h ough gene egula ion, me abolic changes, o im-
mune modula ion.
12
GENOME ASSEMBLY – BIGHEAD CATFISH
O e iew o Genome Assembly S a egy
The haplo ype- esol ed genome o Cla ias mac ocephalus was assembled using
an hyb id sequencing s a egy comp ising ou DNA sequencing pla o ms gene a ing
a ious ypes o DNA eads (i.e., in DNA sequencing, a ead is an in e ed sequence o
base pai s (o base pai p obabili ies) co esponding o all o pa o a single DNA ag-
men ). The ou DNA sequencing echnologies used o ead he gDNA o bighead ca -
ish in o DNA eads (sho - eads and long- eads) we e: PacBio HiFi CSS (Wenge e al.,
2019) (Ci cula consensus sequencing5, long- eads o base accu acy 99.9%), used o
maximizing genome assembly quali y. Ox o d Nanopo e (ONT) noisy 1D long- eads
(20%.e )6(Jain e al.,2018b) o inc easing assembly con inui y and o span com-
plex epea egions o he genome. P oximi y-liga ion (Hi-C) (Dudchenko e al.,2017;
Pu nam e al.,2016) da a o long- ange linking in o ma ion o phase (i.e., he speci ic
a angemen o alleles on he same ch omosome o dis inguish ma e nal and pa e nal
haplo ypes) and sca old con igs (i.e., sepa a ing haplo ypes). These complemen a y
da a ypes acili a ed haplo ype esolu ion and sca old ancho ing and allow o gene a e
dual assemblies7 om a single diploid issue. Among he es ed assemble s (Hi iasm
(Cheng e al.,2021), Flye (Kolmogo o e al.,2020), w dbg2 (Ruan and Li,2019)), I
selec ed Hi iasm o he haplo ype- esol ed assembly because o i s abili y o sepa a e
pa en al haplo ypes (phasing) on low genome co e age (< 13X) HiFi da a. In con as I
ound ha Flye and w dbg2 p oduced collapsed consensus assemblies. Below, I ou line
he main s eps o he comple e genome assembly o bighead ca ish Haplo ype 1 and
Haplo ype 2.
5Ci cula consensus sequencing (CCS) is a high-accu acy long- ead sequencing me hod, commonly
used wi h PacBio HiFi eads, whe e mul iple passes o he same DNA molecule a e combined o gene a e
a consensus sequence.
6Ox o d Nanopo e 1D eads a e single-s and long eads ha his o ically exhibi high e o a es, o en
a ound 10–20%, due o inse ions, dele ions, and base-calling inaccu acies. These eads a e aluable o
spanning long genomic egions bu equi e polishing o accu acy.
7Dual assemblies e e o sepa a e genome assemblies gene a ed o each haplo ype in a diploid o gan-
ism, allowing esolu ion o he e ozygous egions and s uc u al di e ences be ween homologous ch o-
mosomes.
13
1. Genomic DNA Sequencing, Genome Su ey, and k-me Me yl Da abases Ge-
nomic DNA was ex ac ed om he li e issue o a single male bighead ca ish
indi idual using a cus om p o ocol (Supikamolseni e al.,2015). Sex was de e -
mined based on his ology and ex e nal gonadal examina ion (Ki ano e al.,2007;
Wyneken e al.,2007). Sequencing was pe o med using ou complemen a y
pla o ms gene a ing ou syne gis ic aw sequencing da a ypes: (i) Paci ic Bio-
sciences (PacBio) High-Fideli y (HiFi) sequencing (Wenge e al.,2019) o gen-
e a ing highly accu a e con igs, (ii) Ox o d Nanopo e Technology (ONT) long-
ead sequencing (Jain e al.,2018b) o sca olding and esol ing epe i i e e-
gions, and (iii and i ) Illumina sho - ead 150 base pai ed-end sequencing (Hi-C,
and s anda d non Hi-C) (Ben ley e al.,2008) o ch omosomal phasing and e o
co ec ion, espec i ely. Genome su ey aims in assessing genome cha ac e is ics
(genome size, he e ozygosi y, epea con en ) om aw sequencing da a (e.g., Illu-
mina sho - eads) using k-me analysis in Jelly ish (Ma çais and Kings o d,2011)
and GenomeScope2.0 (Ranallo-Bena idez e al.,2020). Me yl da abases (N=2)
o 21-me and 31-me (bighead.hi i.illuminaPCR ee.g 1.db.me yl) hy-
b id k-me da abases, made om combined Illumina and HiFi eads using Me yl
(Rhie e al.,2020), as explained in he T2T-polish Gi Hub eposi o y (h ps://
gi hub.com/a ang hie/T2T-Polish) (Rhie e al.,2022). K-me Me yl da abases
use 21-me and 31-me DNA s ings o suppo quali y e alua ion and e o co -
ec ion in genome assemblies. The sho e 21-me spec a p o ide high sensi i i y
o de ec ing po en ial e o s, while he longe 31-me spec a o e highe speci-
ici y, alida ing ue a ian s and minimizing alse posi i es. Du ing genome
polishing, hese k-me da abases enhance e ec i e co e age in low-dep h HiFi
egions by le e aging k-me spec a o guide a ge ed co ec ions and imp o e
assembly accu acy.
2. Ini ial Assembly Gene a ion (Con igguing and Phasing): Con igging is he
p ocess o assembling o e lapping DNA eads in o con inuous sequences wi h-
ou gaps (i.e., con igs). These con igs a e hen sepa a ed and assigned o a phase
g oup using p oximi y-liga ion da a (Hi-C) based on hei pa en al o igin and spa-
cial physical dis ance in he cell nucleus. The e o e, a con ig ha comes om a

14
haplo ype is e e ed o as a haplo ig8, and in a diploid phased assembly he e a e
wo g oups o homologous haplo igs (i.e., uni igs, one pe haploid ”homo ype”
haplo ype). He e, Hi iasm was used in ”Hi-C UL” mode (Cheng e al.,2021) o
gene a e a haplo ype- esol ed assembly using Hi-C + ONT + HiFi eads (Figu e
8D, le ), ollowed by G eenHill (Ouchi e al.,2023) using all da a ypes: Hi-C
+ Illumina + ONT + HiFi eads o u he sca olding and phasing o he hap-
lo igs (Figu e 8E), and inally w dbg2 (Hu e al.,2024;Ruan and Li,2019) was
employed wi h HiFi and Illumina sho - eads co ec ing small e o s a he nu-
cleo ide le el (i.e., SNPs) while p e en ing swi ch e o s (i.e., allele swi ching
be ween homologous haplo ypes). This p oduced wo phased con ig se s ep e-
sen ing homologous haplo ypes (Figu e 9).
3. Hi-C Sca olding and Thei Valida ion: Sca olds a e compu a ionally o de ed
and o ien ed a ays o con igs ha ha e sequence gaps along hei leng h. The
p ocess o sca olding o con igs is abou a anging and o ien ing con igs in o
la ge s uc u es, some imes wi h gaps, using addi ional da a like p oximi y lig-
a ion (Hi-C) in o de o econs uc an in-silico equi alen o he ka yo ype ( e-
e ed o as ”Sca o ype”) which consis s o Hi-C sca olds o ”pseudoch omo-
somes” (i.e., ch omosome pseudomolecules), isualized as Hi-C maps. P io o
Hi-C sca olding, i is possible o b eak he con igs a e oneous si es, o ha ,
misassembly co ec ions a e done using CRAQ (Li e al.,2023), hen Hi-C sca -
olds (i.e., con igs and sca olds) a e lipped, eo ien ed, and e ined o imp o e
s uc u al accu acy, his s ep is knows as he ”manual e iew s ep9” s ep and
is pe o med using JBAT (JuiceBox Assembly Tools) (Dudchenko e al.,2017),
nex , a ”pos - e iew s ep10 ” s ep seals con igs (i.e., by c ea ing gaps be ween
newly adjacen con igs and inse ing 500 ’N’ cha ac e s o ep esen unknown
8In an unphased assembly, a con ig may join alleles om di e en pa en al haplo ypes in
a diploid o polyploid genome (see (Lewin e al.,2019) and h ps://lh3.gi hub.io/2021/04/17/
concep s-in-phased-assemblies). The p ocess o sepa a ing sequences based on hei pa en al haplo ype
in diploid genomes is e e ed o as ”haplo ype phasing”.
9In Hi-C sca olding, manual e iew in ol es isually inspec ing con ac maps (e.g., using Juicebox
o HiGlass) o de ec misassemblies, o ien a ion e o s, o misplaced con igs ha au oma ed ools may
miss.
10Pos - e iew e e s o he s age a e Hi-C con ac map inspec ion, whe e alida ed co ec ions (e.g.,
lips, cu s, o joins) a e applied o he genome assembly.
15
sequences) o o m Hi-C sca olds o ul ima ely econs uc pseudoch omosomes
ollowed by 3D-DNA ”pos - e iew” alida ion (Dudchenko e al.,2017) (Figu e
8F). Finally mo e gaps we e closed wi h TGS-GapClose (Xu e al.,2020) (Figu e
8G). These s eps — including CRAQ b eak de ec ion, Hi-C ead alignmen , man-
ual e iew in JBAT, and pos - e iew in he 3D-DNA pipeline — we e i e a i ely
pe o med h ee imes. As illus a ed in Figu e 8F, his cycle can be epea ed i
needed a e con ig polishing (Figu e 8H), and ypically esul s in well-de ined
Hi-C sca olded pseudoch omosomes (Figu e 8C).
4. Assembly Polishing o Con igs and Benchma king: Polishing a genome as-
sembly consis in co ec ing sequencing e o s using addi ional da a (Me yl k-me
da abases, long- eads and sho - eads). To u he imp o e assembly accu acy,
eads we e aligned o he polished assembly, he h ee p ima y long- ead mappe s
employed we e Winnowmap (Jain e al.,2022), Minimap2 (Li,2018), and Ve i-
yMap (Mikheenko e al.,2020) (p e iously known as TandemTools), hese we e
used in assembly alida ion only (i.e., no o a ian calls and genome polishing).
Seconda y alignmen s (i.e., eads lagged wi h bi 0x10011 ), low-quali y egions
(i.e., MAPQ12 = 0), and ha d-clipped eads (i.e., alignmen s wi h ha d clipping
ope a ions indica ed by ‘H’ in he CIGAR s ing13 ) we e excluded om analy-
sis using sam ools (Li and Du bin,2009). This ead ealignmen s ep was c u-
cial o alida ing assembly comple eness and educing e o s. Non-haplo ype-
awa e ools14 (e.g., Racon (Vase e al.,2017), Clai 3 (Zheng e al.,2022)) we e
applied cau iously o minimize e o s om pa en al haplo ype swi ches, while
Me in (Fo men i e al.,2022) il e ed edi s o polishing and BCF ools consen-
sus p oduced he inal polished sequence. Fo la ge a ian s and closing gaps
using ead alignmen s I used Sni les2 (Smolka e al.,2024) o la ge s uc u al
11In SAM/BAM lags, bi 0x100 indica es a seconda y alignmen . Tools like Pica d use his lag o
ma k eads ha a e no he p ima y alignmen o a ead wi h mul iple mappings.
12MAPQ (Mapping Quali y) is a Ph ed-scaled sco e ha e lec s he con idence in he ead’s alignmen
posi ion. Highe alues indica e mo e eliable mappings.
13The CIGAR (Compac Idiosync a ic Gapped Alignmen Repo ) s ing encodes how a ead aligns o
a e e ence, using ope a ions such as ma ches (M), inse ions (I), dele ions (D), so clips (S), and o he s.
Fo example, 100M indica es 100 ma ching bases.
14Non-haplo ype-awa e ools do no dis inguish be ween ma e nal and pa e nal sequences in diploid
o polyploid genomes, o en collapsing allelic a ian s in o a single consensus sequence and po en ially
masking s uc u al di e ences be ween haplo ypes.
16
a ian (SV) calling, speci ically o inse ions (INS) and dele ions (DEL). I e al-
ua e genome quali y in e m o e o s ound in he assembly ha a e no ound in
aw eads, o ha I calcula e he Quali y Value, usually e e ed o as QV15 o
a genome assembly whe e QV e e s o he Ph ed quali y sco e (o quali y (Q)
sco e), an in ege ep esen ing he es ima ed p obabili y o an e o (i.e., ha a
base is inco ec ). The polishing wo k low implemen ed he e o co ec ing big-
head ca ish ensu ed high nucleo ide accu acy Quali y Values (QV) ( om ini ial
QV33 o a ound QV46) a e Benchma king agains ecommended assembly me -
ics, om he VGP pape (Rhie e al.,2021).
Figu e 7 Rep esen a i e specimens o he bighead ca ish (Cla ias mac ocephalus), collec ed
o high-quali y haplo ype- esol ed genome assembly. The indi iduals we e ea ed unde con-
olled aquacul u e condi ions p io o issue sampling. C. mac ocephalus is an economically
impo an eshwa e species na i e o Sou heas Asia, widely cul u ed in Thailand o selec i e
b eeding and genomic imp o emen p og ams.
Me hodological de ails a e p esen ed in he ollowing sec ions and in Figu e 8.
15Quali y Value (QV) is a Ph ed-scaled sco e used o ep esen base-le el o assembly-le el accu acy.
I Pis he p obabili y o e o , hen Q=−10log10(P), o equi alen ly, P=10−Q/10. Highe QV indica es
g ea e accu acy.
17
Figu e 8 Comp ehensi e haplo ype- esol ed genome assembly and sca olding wo k low o bighead ca ish (Cla ias mac ocephalus). (A) Hi-C con ac
map (Juicebox 2.16.0). (B) Genome assembly using PacBio HiFi, ONT, and Hi-C eads wi h Hi iasm (Hi-C UL mode) and Flye o consensus. (C) Hi-C
sca olding ia G eenHill and 3D-DNA. (D) Manual pos - e iew wi h JBAT. (E) Gap illing using TGS-GapClose and Qua TeT GapFille . (F) Genome
polishing wi h Nex Polish2 and CRAQ. (G) Assembly QV alida ion wi h Me qu y, Pilon, BCF ools, and Ve i yMap. (H) Mi ochond ial genome assembly
wi h Mi oHiFi and Minimap2. The pipeline yields wo high-quali y phased haplo ype assemblies ep esen ing he comple e bighead ca ish genome.
24
3. Running pilon and calling he consensus: Pilon was employed on assemblies
o co ec ing SNPs, INDELs (i.e., 1-2 base inse ions and dele ions), and o he
base-le el e o s. In indi idual haplo ypes, each sca old was p ocessed by Pilon
wi h pa ame e s (’--genome,-- ags,--bam,-- a ge s,-- ix all, -- c ,
--diploid,--minmq 30,--minqual 30’) o use he alignmen s om all da a
ypes (HiFi, ONT, and sho eads). The ou pu Va ian Calling Files (VCF)
o con aining he de ec ed a ian s we e so ed by posi ion using (’bc ools
so ’), comp essed using bgzip, and indexed using (bc ools index)28 , and
inal consensus sequences we e gene a ed o each sca old using (’bc ools
consensus - $genome. a -H 1’).
The o e all median quali y alue me ic inc eased om 41 o app oxima ely 45-47 a e
haplo ype-awa e a ge ed assembly polishing.
5. Hi-C Sca olding wi h HapHiC
To ein eg a e unplaced sca olds in each haplo ype in he pseudoch omosomes,
I used Qua e and HapHiC. Fi s , I aligned he unplaced con igs o e e ence genome C.
uscus (GCA_030347435.1), and e e ence genome C. ga iepinus (GCA_024256425.2)
using Qua e AssemblyMappe (Lin e al.,2023) 1.2.1 wi h he ollowing pa ame-
e s (’- $ e e ence -q $con igs -c 50000 -l 2000 -i 90 -a minimap2’). I
iden i ied 53 MB and 33 MB o unplaced sca old sequences in bighead ca ish Haplo-
ype 1 and Haplo ype 2, espec i ely, wi h s ong homology o C. uscus pseudoch o-
mosomes, nex I il e ed bighead ca ish Haplo ype 1 and Haplo ype 2 sepa a ely wi h
SeqKi (Shen e al.,2016) 0.8.1 o e ain pseudoch omosomes and conca ena ed hem
o unplaced sca olds. Fo he p epa a ion o Hi-C sca olding s ep, Hi-C eads we e
mapped o sepa a e haplo ypes using BWA-MEM (’-5SP’) a e making a BWT index29
28bc ools index c ea es a comp essed index (.csi o . bi) o VCF o BCF iles, allowing apid
andom access o a ian s by genomic posi ion du ing downs eam p ocessing o isualiza ion.
29The BWT index e e s o a da a s uc u e de i ed om he Bu ows-Wheele T ans o m (BWT), which
allows as and memo y-e icien alignmen o sequencing eads o a e e ence genome. I comp esses
he genome while e aining he abili y o sea ch o exac o app oxima e ma ches, enabling ools like
Bow ie2 and BWA o align sho eads wi h high pe o mance.

25
o Haplo ype 1 and Haplo ype 2 (’bwa index $genome. as a’) and Hi-C ead align-
men s il e ed o PCR duplica es30 and seconda y alignmen s (’samblas e $BAM |
sam ools iew - -@ $ h eads -S -h -b -F 3340’) using he so wa e Samblas e
(Faus and Hall,2014) 0.1.26. Hi-C sca olding was pe o med on bighead ca ish hap-
lo ypes using HapHic (Zeng e al.,2024) 1.0. The esul ing Hi-C maps we e isualized
in JBAT and using (’haphic plo ’) o sepa a e sca olds om each haplo ype (Figu e
16 and Figu e 17) bu also wi h all sca olds ep esen ed in a wide-map (Figu e 18), and
a pos - e iew o he Hi-C sca old was pe o med as desc ibed in he p e ious sec ion.
Finally, h ee ounds o TGS-GapClose ollowed by one ound o a ge ed Pilon pol-
ishing speci ying he new a ge s esul ed in a genome o global highe quali y (i.e., bo h
in e m o QV quali y alues and CRAQ’s me ics o CRE/CSE s uc u al accu acy). I
add essed he duplica ion e o s31 isible in he he e ozygous peak o he k-me co e -
age (i.e., om he Me qu y spec a CN) wi h mo e o less success by using haplo ype-
speci ic k-me da abases. This in ol ed applying (’me yl di e ence’) command o
iden i y e oneous k-me s and using (’me yl-lookup’) command o ex ac eads as-
socia ed wi h hese k-me s. Finally, I used BCF ools o call a consensus32 using he
opposi e haplo ype (’-H 1’) wi hin bc ools consensus command, e ec i ely e e sing
mos haplo ype swi ch e o s.
6. SV and SNP Consensus Polishing
Consensus polishing33 was au oma ed by ollowing ins uc ions om he T2T-
polish Gi Hub eposi o y (h ps://gi hub.com/a ang hie/T2T-Polish). A HiFi mapping
ile o epe i i e k-me s (k=15) was gene a ed wi h Me yl and used in Winnowmap o
30PCR duplica es e e o ead pai s ha o igina e om he same DNA agmen and a e ampli ied
mul iple imes du ing lib a y p epa a ion, po en ially leading o biased co e age i no emo ed.
31Duplica ion e o s e e o he e oneous p esence o edundan sequences in genome assemblies,
o en caused by uncollapsed haplo ypes, epe i i e egions, o misassemblies. These can in la e genome
size and complica e downs eam analyses.
32Calling a consensus in bc ools e e s o gene a ing a modi ied e e ence sequence by applying
a ian calls (e.g., om a . c ile) o a e e ence genome, esul ing in a consensus sequence ha e lec s
he sample’s speci ic geno ype.
33Consensus polishing is he p ocess o co ec ing e o s in a d a genome assembly using aligned
eads, ypically by a e aging obse ed base calls a each posi ion in a ead pileup. Mos polishing ools
a e no haplo ype-awa e and mus be un on bo h haplo ypes oge he , meaning eads a e dis ibu ed ac oss
haplo igs wi hou dis inguishing pa en al o igin. This can lead o misco ec ions in he e ozygous egions.
26
HiFi ead alignmen (’-MD -W . epe i i e_k15. x -ax map-pb’), ollowed by
alignmen il e ing wi h sam ools (’-Sb’). pb- alconc il e ed ha d-clipped eads and o
bi 0x104 in Pica d34 (’bam- il e -clipped - -F 0x104 --ou pu -coun - n’)
(h ps://gi hub.com/bio-nim/pb- alconc) 1.15.0 was hen used o il e clipped eads.
Genome polishing was pe o med using he li o e b anch o he Racon Gi Hub epos-
i o y, using Racon wi h (’-L -S’) op ions (Vase e al.,2017) 1.5.0. A e polishing, he
k-me s p esen in he genome (seqme s) we e coun ed using Me yl coun (k=21). Me in
was hen applied using (’- eadme s Illumina.HiFi.g 1.PCR ee.hyb id.me yl
-seqme s’) and he hyb id-kme db o eads o e alua e he esul s by compa ing he
dis ibu ions in he eads and in he polished genome (Fo men i e al.,2022;Rhie e al.,
2022). Fo consensus gene a ion (i.e., o apply polishing edi s o he genome assembly),
I used BCF ools (’-H 1’) o wo ounds o genome co ec ion. Assembly quali y me -
ics we e measu ed o QV, comple eness and BUSCOs sco es, I ound a la ge inc ease
in QV in mos ch omosomes (min. inc ease > +1-5 QV poin s) a e ONT Racon and
Me in, he median QV was 50, he p og ess o e mon hs is p esen ed in Figu e 10.
34The SAM lag 0x104 is a combina ion o 0x004 ( ead unmapped) and 0x100 (no p ima y alignmen ).
I indica es ha he ead is bo h unmapped and no he p ima y alignmen , ypically seen in seconda y
alignmen s o unaligned eads o ambiguous mul i-mappings.
27
Figu e 10 Bighead Ca ish Assembly S a us Janua y 2024 - No embe 2025. (A) Haplo ype
1 (blue) and Haplo ype 2 (pu ple) ideog ams showing gaps on pseudoch omosomes (o ange
ec angles) and elome es (blue iangles) a e manual- e iew in Juicebox (JBAT). (B) Same
assemblies a ew mon hs la e .
28
7. Mi ochond ial Genome Assembly
The mi ochond ial genome (m DNA) was assembled by mapping o e e ence
(NC_046749.1) bighead ca ish m DNA, a ailable a NCBI Nucleo ide35 . Minimap2
(’-ax map-on --seconda y=no’) mapped nanopo e eads o he e e ence and min-
imap2 (’-ax s ’) mapped Illumina eads. PCR duplica es in sho eads we e emo ed
om he alignmen s using sam ools ma kdup and he esul s we e isualized in he
In eg a i e Genome Viewe (IGV) (Tho aldsdo i e al.,2012). Pilon (’-- ix all
--diploid --changes -- c -- acks --minmq 10’) was used o co ec m DNA
(NC_046749.1) and call SNPs, SVs, gaps and local a ian s and ob ain he consensus
m DNA sequence. Reads we e e-aligned o he consensus m DNA sequence and no
mo e SNPs we e isible in he IGV. Nex , mapped eads we e il e ed wi h sam ools
iew (’-F4 -q 20’) and e-assembled de no o wi h Unicycle (Wick e al.,2017) o
compa ison. The esul s we e isualized in Bandage-ng (Wick e al.,2015) 2022.09.
The assembly was polished wi h Pilon and he wo homologous mi ochond ia we e
compa ed using minimap2 (’--eqx -x asm5’). Gene anno a ions we e gene a ed us-
ing Mi oFinde (Allio e al.,2020) 1.4.2. To ensu e ha he mi ochond ial genome
was co ec and no dissimila o o he m DNAs in Silu i o mes, I downloaded all e -
e ence m DNA sequences o (N = 209) species o Silu i o mes ca ish om NCBI
Nucleo ide, las accessed Sep embe 2024. All sequences we e aken om in NCBI
Re Seq36 and no om NCBI GenBankNCBI GenBank37. All m DNA nucleo ide se-
quences (N = 210) we e enamed o he Pan-SN naming scheme (h ps://gi hub.com/
pangenome/PanSN-spec), conca ena ed in a single mul i-FASTA including he bighead
ca ish m DNA gene a ed he e a e Pilon, and all- e sus-all alignmen s we e pe o med
wi h w mash, hen ODGI was used o il e he alignmen g aph, and isualiza ions we e
35NCBI Nucleo ide is a public da abase main ained by he Na ional Cen e o Bio echnology In o -
ma ion ha p o ides access o nucleo ide sequences om GenBank, Re Seq, and o he sou ces o a wide
ange o o ganisms.
36NCBI Re Seq (Re e ence Sequence Da abase) is a cu a ed, non- edundan collec ion o genomic,
ansc ip , and p o ein sequences p o ided by he Na ional Cen e o Bio echnology In o ma ion, se ing
as a s anda d e e ence o genome anno a ion and compa a i e analysis.
37NCBI GenBank is a comp ehensi e public da abase o anno a ed nucleo ide sequences submi ed by
esea che s wo ldwide. I includes aw and cu a ed da a and se es as a p ima y a chi e o sequence
da a in molecula biology.
29
made wi h Bandage and mul iQC (Ewels e al.,2016). All ools o he pipeline we e pa
o he la ge PGGB (Pan Genome G aph Builde ) (Gua acino e al.,2022). The esul s
o he alignmen s a e shown in Figu e 11, he m DNA aligns well wi h o he sequences
in he phylogene ic o de and is no an ou lie .

30
Figu e 11 All- e sus-all 2D ep esen a ions o mi ochond ial DNA sequences alignmen s in
Silu i o mes ca ishes. The m DNA assembled o bighead ca ish is on he las ow (blue colo )
o he linea 2D ep esen a ion.
31
8. T ansposable Elemen Anno a ion
Two complemen a y app oaches we e used o iden i y ansposable elemen s
(TEs)38 in he bighead ca ish genome. (1) A species-speci ic TE lib a y was gene -
a ed de no o o he bighead ca ish. (2) A cu a ed ish TE lib a y was e ie ed om
he FishTEDB da abase (h ps://www. ish edb.com/) and used as a e e ence o c oss-
species TE anno a ion.
8.1 De no o ansposable elemen lib a y o bighead ca ish
De no o TE anno a ion was pe o med using EDTA (Ex ensi e de no o
TE Anno a o ) (Ou e al.,2019) 2.2.0. The pipeline in eg a es se e al ools o de-
ec di e en elemen ypes. LTR ha es (Ellinghaus e al.,2008) 1.5.10 iden i ies
LTR e o ansposons by e icien ly loca ing s uc u al ea u es such as bo de posi ions,
LTR leng hs, and mo i s in la ge genomic da ase s. LTR FINDER (Xu and Wang,2007)
and he pa allel e sion o LTR FINDER (Ou and Jiang,2019) 1.1.0 accele a e LTR
de ec ion using pa allel compu ing o la ge genomes. LTR e ie e (Ou and Jiang,
2017) 2.9.0 e ines LTR anno a ions o imp o e accu acy and emo e alse posi i es.
TEso e is an accu a e and as way o classi y LTR e o ansposons (Zhang e al.,
2022) 1.4.7, GRF (Gene ic Repea Finde ) (Shi and Liang,2019) 1.1 o genome-
wide de no o epea de ec ion. TIR-Lea ne (Su e al.,2019) 1.1.2 uses machine lea n-
ing o de ec e minal in e ed epea (TIR) ansposons, including minia u e in e ed
TEs (MITEs). Heli onScanne (Xiong e al.,2014) 1.0 iden i ies Heli ons by ecog-
nizing hei cha ac e is ic s uc u al mo i s. Repea Modele (Flynn e al.,2020) 2.0.3
pe o ms de no o epea disco e y and builds a comp ehensi e epea lib a y. MAFFT
(Ka oh and S andley,2013) 7.526 is used o mul iple sequence alignmen s, and HM-
MER (Finn e al.,2011) 3.4 is used o domain-based sea ching and TE classi ica ion.
38T ansposable elemen s (TEs), also known as “jumping genes,” a e mobile DNA sequences ha can
mo e o eplica e wi hin he genome. They play key oles in genome e olu ion, s uc u al a ia ion, and
gene egula ion.
32
8.2 FishTEDB gene al ish-speci ic TE lib a y collec ion
Nex , lib a ies o TE consensus sequences39 o ansposons we e e ie ed
om he FishTEDB da abase (Shao e al.,2018) e sion 1, which con ains di e en
species o ish and hei associa ed ansposable elemen lib a ies in . as a ile o ma ,
and hese sequences we e me ged wi h EDTA’s de no o bighead ca ish ini ial TE an-
no a ions. To educe he lib a y complexi y and o c ea e a non- edundan TE lib a y
(i.e., o lowe he numbe o sequences and edundancy o he combined TE lib a y
da ase ), I used CD-HIT-EST (’-d 20 -aS 0.95 -c 0.95 -G 0 -g 1 -b 500’) (Li
and Godzik,2006) o an 80% iden i y h eshold and 80% alignmen co e age o TE se-
quences in all-agains -all sequence clus e ing ollowing he commonly used 80%-80%
ule (Goube e al.,2022). The educed TE lib a y was used o mask (i.e., o anno-
a e) he bighead ca ish genome o TEs wi h Repea Maske (Smi , AFA, Hubley, R. &
G een, P Repea Maske a h p://www. epea maske .o g).
39TE ( ansposable elemen ) consensus sequences a e a i icial, ep esen a i e sequences cons uc ed
by aligning and a e aging mul iple genomic copies o a TE amily. They do no co espond o any speci ic
eal inse ion, bu se e as idealized e e ences o anno a ion and classi ica ion.
33
GENOME ASSEMBLY – F1 HYBRID CATFISH
O e iew o Genome Assembly S a egy
S a ing om an F1 hyb id ca ish (i.e., a i s ilial gene a ion ca ish made om
he c oss o a pu e bighead ca ish and a pu e No h A ican ca ish), I econs uc ed
wo comple e haploid, non-homo ypic pa en al genomes, comp ising 27 pseudoch o-
mosomes o C. mac ocephalus and 28 o C. ga iepinus (Fig. 25). In mos F1 hyb ids
de i ed om di e gen pa en al species, he obse ed he e ozygosi y a e— ypically
a ound 10%— e lec s he in e -speci ic genomic di e gence be ween he wo haploid
sub-genomes, encompassing bo h single nucleo ide polymo phisms (SNPs) and s uc-
u al a ian s (SVs). The comple e wo k low o gene a ing ch omosome-scale assem-
blies o he Thai F1 hyb id ca ish is p esen ed in Figu e 12. The wo k low consis s o six
sequen ial s eps ( om S ep 1A o S ep 6) ha in eg a e mul iple sequencing echnolo-
gies and analysis ools and pipelines in speci ic o de used o achie e op imal phasing
and Hi-C sca olding a a high-con igui y, low s uc u al e o a e, high sol e a es, and
high base-accu acy genome assemblies in highly he e ozygous F1 genomes. Below, I
ou line each s ep in de ail.
•S ep 1: Genomic DNA Sequencing, Genome Su ey, and Me yl Da abases
Genomic DNA was ex ac ed om he li e issue o a single F1 hyb id indi-
idual using a cus om p o ocol (Supikamolseni e al.,2015). Sex was de e -
mined o be emale based on his ology and ex e nal gonadal examina ion (Ki-
ano e al.,2007;Wyneken e al.,2007). Sequencing was pe o med using ou
complemen a y pla o ms gene a ing ou syne gis ic aw sequencing da a ypes:
(i) Paci ic Biosciences (PacBio) High-Fideli y (HiFi) sequencing (Wenge e al.,
2019) o gene a ing highly accu a e con igs, (ii) Ox o d Nanopo e Technology
(ONT) long- ead sequencing (Jain e al.,2018b) o sca olding and esol ing
epe i i e egions, and (iii and i ) Illumina sho - ead 150 base pai ed-end se-
quencing (Hi-C, and s anda d non Hi-C) (Ben ley e al.,2008) o ch omoso-
40
ha C. mac ocephalus exhibi s lowe he e ozygosi y (0.62%) compa ed o C. ga iepinus
(1.0%) (Figu e ??). The inal assembly achie ed high quali y alues (QV) a bo h k=21
and k=31, wi h obus comple eness me ics including CRAQ s uc u al accu acy. The
combina ion o manual cu a ion, i e a i e polishing, and uni o m ead co e age p o ed
c i ical o o e coming inhe en limi a ions in bo h so wa e algo i hms and sequencing
dep h. The inal assembly comp ises 55 pseudoch omosomes, accu a ely ep esen ing
bo h pa en al haplo ypes om a single F1 indi idual.
Du ing ini ial assembly, op imal esul s we e achie ed by gene a ing mul-
iple independen assemblies om he same da ase ollowed by pos -hoc me ging and
sca olding. Speci ically, combining ou pu s om Flye and G eenHill wi h Hi iasm im-
p o ed esolu ion o complex genomic egions. Unde condi ions o limi ed ead dep h,
Hi iasm occasionally exhibi ed p oblema ic beha io , ex ending eads and swi ching o
longe o e lapping eads despi e signi ican nucleo ide di e ences (>20 SNPs)—a le el
o a ia ion inconsis en wi h sequencing e o alone. Addi ionally, using ela ed e e -
ence genomes o ill assembly gaps equi es ca e ul alida ion, as inco ec placemen s
equen ly ou numbe accu a e ones wi hou manual e i ica ion. Hi-C sca olding un-
de wen h ee ounds o manual e iew using Juicebox, in ol ing con ig spli ing a o -
diagonal signals and iden i ica ion o sca old gaps, ul ima ely educing con ig coun
om >2,000 o <500.
1.3 Telome ic Re inemen and S uc u al Imp o emen s
Telome ic sequence e inemen ep esen ed a signi ican echnical achie e-
men . Manual ex ension o clipped sequences a ch omosome e mini success ully in-
co po a ed i e addi ional elome ic egions con aining canonical [TTAGGG]nmo i s.
While elome es may no be essen ial o mos genomic analyses, hei accu a e ep-
esen a ion enables p ecise ch omosome bounda y delinea ion. No able imp o emen s
included:
•C. ga iepinus ch omosome 16: le elome ic epea coun inc eased om 138 o

41
1,602
•C. ga iepinus ch omosome 18: le elome ic epea coun inc eased om 268 o
1,411
•C. mac ocephalus ch omosome 7: le elome ic epea coun inc eased om 199
o 924
These e inemen s inc eased he numbe o pseudoch omosomes wi h bi-
la e al elome es om 43 o 48, while educing hose wi h unila e al elome es om 8 o
7. Telome ic addi ions con ibu ed app oxima ely 400 kb o he o al assembly leng h,
ep esen ing subs an ial p og ess owa d elome e- o- elome e (T2T) comple eness o
mul iple ch omosomes.
Subgenomic analysis e ealed addi ional di e ences be ween pa en al hap-
lo ypes, including he e ozygosi y a e a ia ions and assembly complexi y di e ences
a ibu able o limi ed nanopo e co e age. While mos genomic gaps ha e been e-
sol ed, cen ome ic and sa elli e egions emain challenging, pa icula ly whe e HiFi
ead co e age is absen o whe e only MAPQ0 Illumina eads p o ide suppo , necessi-
a ing eliance on lowe -accu acy ONT da a.
1.4 Quali y Value Enhancemen Th ough Manual and Au oma ed Polishing
The polishing s a egy combined au oma ed QV co ec ion wi h manual
cu a ion. Ini ial a ian calling wi h Clai 3 iden i ied o e 60,000 a ian s, subsequen ly
esol ed using Me in. Despi e achie ing QV60 a k=21, a inal manual cu a ion ound
in oduced 8,410 edi s, including 551 la ge s uc u al a ian s, esul ing in QV imp o e-
men o 1–5 poin s. The comple e manual cu a ion p ocess comp ised ou dis inc
ounds:
Round 1: Sni les2 iden i ied 755 a ge egions ac oss all alignmen iles
o manual e iew.
42
Round 2: C oss-assembly cu a ion using Unimap examined sca olds om
Flye, G eenHill, and Hi iasm P-con igs, inco po a ing seqme e o iles. This esul ed
in 2,221 manual edi s and applica ion o 5,951 a ian s.
Round 3: Combined Winnowmap and Unimap alignmen s enabled iden-
i ica ion o 8,710 a ian s om 5,740 alignmen s, wi h 1,932 applied o he assem-
bly. Addi ionally, 300 gaps we e esol ed ollowing Sni les2 and BCF ools consensus-
based co ec ions.
Round 4: Co ec ion o 97 la ge s uc u al a ian s, including 10 isible
only h ough Ve i yMap alignmen s. To ensu e genome eliabili y, 65 gaps we e in o-
duced o mask sequences lacking p ope ead suppo o exhibi ing high e o p o iles.
Pos -masking, 247 s uc u al a ian s (196 inse ions, 73 dele ions) iden-
i ied using Sni les2 we e applied o he genome. These a ian s, de i ed om ONT
pb- alconc- il e ed alignmen s (h ps://gi hub.com/bio-nim/pb- alconc) ealigned a Hi-
iasm con ig bounda ies wi h 50 n ma gins, we e app oxima ely 90% homozygous and
10% he e ozygous. Genomic e o k-me s showed non-uni o m ch omosomal dis i-
bu ion, clus e ing in egions wi h ONT-only o single-dep h co e age (511 loci o al).
Fu he co ec ion using Clai 3 (Zheng e al.,2022) add essed 22,625 ONT-based con-
sensus e o s and 7,558 HiFi Winnowmap a ian s, achie ing median QV31.
Assembly alida ion using IGV demons a ed quali y imp o emen s and
s uc u al esolu ion. Ch omosome 03 showed clea educ ion in k-me e o a es and
inc eased QV pos -polishing. Ch omosome 02 analysis e ealed a alse duplica ion
p esen in Hi iasm bu absen in Flye, con i med by doubled homozygous co e age and
lack o suppo ing HiFi alignmen s. A single spanning ONT ead and clipped Illumina
ead alida ed bounda y p ecision. Consis en single-nucleo ide ma ke s ac oss all da a
ypes (HiFi, ONT, Flye) sugges po en ial o u he e inemen using high-con idence
a ian s (AF > 0.9) wi h Clai 3 (Zheng e al.,2022) and Wha sHap (Ma in e al.,2016).
43
Figu e 13 Quali y alue and co e age alida ion using In eg a i e Genomics Viewe . (A) Ch o-
mosome 03 demons a ing QV imp o emen ollowing polishing p ocedu es. (B) Ch omosome
02 showing alse duplica ion and gap analysis h ough compa a i e assembly assessmen .
Following eigh mon hs o genome cu a ion, he es ima ed QV imp o ed
om Q40 o Q60, demons a ing ha F1 hyb id genomes can achie e supe io nucleo ide-
le el accu acy compa ed o single-species assemblies. Fo compa ison, he pu e-b eed
bighead ca ish assembly achie ed median QV40–50 (k=21, k=31, pileup), while he
cu en F1 hyb id genome demons a es en- old highe accu acy using compa able se-
quencing co e age, analy ical ools, and cu a ion e o .
44
Figu e 14 F1 Hyb id Ca ish Assembly s a us Janua y 2024 - No embe 2024.
45
VALIDATION OF CATFISH ASSEMBLIES
Benchma king Me hodology
This chap e p esen s he esul s and echnical alida ions pe o med on he genome
assemblies o he wo ca ish species. Fi s , I p esen he mos common me ics o mea-
su e he quali y and comple eness o he assemblies, nex I p esen each indi idual as-
sembly wi h esul s, inally compa a i e genomic analysis a e p o ided and illus a e
how hese wo genomes can be used o compa a i e s udies ac oss species. Me ics o
con inui y, s uc u al accu acy, base accu acy, and unc ional comple eness we e used
o benchma king as desc ibed in he Ve eb a e Genome P ojec (VGP) pape (Rhie
e al.,2021).
1. Me ics Assessed
1.1 Con inui y and summa y s a is ics
To assess con inui y and summa y s a is ics o he assembly, I compu e
he ollowing measu es o sca old/con ig (N50, N90, NG50, LG50, and LG90)50 wi h
RagTag (’ ag ag.py asms a s -g’) (Alonge e al.,2022) 2.1.0.
1.2 Repea comple eness and con inui y o epea s
To measu e epea comple eness and con inui y o epea s o assessing as-
sembly quali y, I es ima ed he pe cen age o ully assembled LTR e oelemen s (LTR-
RTs)51 and compu ed he long e minal epea (LTR) Assembly Index (LAI) using wo
50N50 and N90 ep esen he con ig o sca old leng h such ha 50% o 90% o he o al assembly
leng h is con ained in con igs/sca olds o ha leng h o longe . NG50 is simila o N50 bu calcula ed
ela i e o an expec ed e e ence genome size. LG50 and LG90 indica e he minimum numbe o con igs
o sca olds whose combined leng h makes up 50% o 90% o he assembly, espec i ely.
51LTR-RTs (Long Te minal Repea Re o ansposons) a e a class o ansposable elemen s cha ac e -
ized by di ec long e minal epea s a bo h ends. They eplica e ia an RNA in e media e and e e se
ansc ip ion, and a e majo con ibu o s o genome size and s uc u e in many euka yo ic o ganisms.

46
e e ence- ee p og ams, LTR Assembly Index (LAI) (Ou e al.,2018) and LTR_ e ie e
(Ou and Jiang,2017) 2.9.00. To assess he assembly quali y o complex epea s (cen-
ome es), I used TandemTools and TandemQUAST (Mikheenko e al.,2020) 1. Telom-
e e p edic ion o he p esence / absence and o ien a ion o elome es was done wi h
TIDK, a Telome e Iden i ica ion Toolki B own e al. (2023), implemen ed in TeloEx-
plo e (’-m 50 -c animal’), a module o Qua TeT Lin e al. (2023). To es ima e gaps,
I used he sc ip de gaps, a ailable on Gi Hub (h ps://gi hub.com/d guan/asse ).
1.3 S uc u al accu acy ( egional and s uc u al e o s and eliable blocks)
To assess s uc u al accu acy I used he CRAQ so wa e (’sms_co e age=5
ngs_co e age=20 -B T --minimap2-sensi e’) (Li e al.,2023) 1.0.9, mo e speci -
ically (’-q 20 -m 2 - 0.4 -h 0.6 - 0.75 -a 20’). CRAQ is a me hod o as-
sembly s uc u al alida ion elying on he s udy o mapped eads, clipped eads and
co e age suppo by wo o mo e simul aneous sequencing pla o ms o ind suppo ing
egions o eliable blocks (Rhie e al.,2021).and isualized esul s using he In eg a i e
Genome Viewe (IGV) (Tho aldsdo i e al.,2012). CRAQ e alua es genome assem-
bly quali y based on clipped- ead e idence om bo h sho and long eads. I epo s
wo main ypes o e o s:
1. Clip-based Regional E o s (CREs), which ep esen small-scale local misas-
semblies iden i ied by sho - ead clipping wi hou s uc u al dis up ion.
2. Clip-based S uc u al E o s (CSEs), which ep esen la ge-scale misassem-
blies suppo ed by long- ead clipping nea b eakpoin egions.
These e o s a e quan i ied in o wo sub-sco es: R-AQI (based on CREs) and S-AQI
(based on CSEs). Bo h sco es con ibu e o he global Assembly Quali y Index (AQI),
anging om 0–100, whe e highe alues indica e be e assembly in eg i y. AQI is
compu ed based on e o densi y ela i e o genome size.
47
1.4 C oss-species s uc u al co ec ness
To assess c oss-species s uc u al co ec ness, I used he same e e ences
om ela ed ca ish species o one- o-one nucleo ide-le el alignmen s o o hologous
segmen s wi h MashMap252 (’-s 2000000 --pi 90 -c 100000’) (Jain e al.,2018a)
3.1.3.
1.5 Base accu acy and assembly comple eness
To assess base accu acy and comple eness, speci ically Me cu y’s quali y
alues (QV) and 21-me genome comple eness (%), I used Me cu y (Rhie e al.,2020)
1.3. Me qu y was un h ee imes using di e en k-me da abases: Illumina, HiFi,
and a hyb id 21-me da abase combining Illumina and HiFi eads, as explained abo e,
in he ”Consensus polishing” sec ion. To assess unc ional comple eness and e alua e
he comple eness o he gene se , I used he lineage o ay- inned ishes and BUSCO
(’-l ac inop e ygii_odb10’) (Simão e al.,2015) 5.6.1. Finally, nucleo ide ac-
cu acy was e i ied by ead- o-assembly mapping, pe o med using minimap2 (’-ax
map-hi i --seconda y=no’) and Winnowmap (’-W epe i i e.15. x ’) a MAPQ
>10 o HiFi eads, while o ONT eads I used he '-ax map-on ' p ese o minimap2
and he '-ax map-pb' p ese in Winnowmap MAPQ >10. Fo Visualisa ions, I used
IGV (’ig -g $genome $BAM(s) $me qu y_only_bed_wig_kme _ iles’).
52MashMap2 is a as and app oxima e algo i hm o compu ing whole-genome homology maps. I uses
minimize -based locali y-sensi i e hashing o apidly iden i y high-con idence egions o sequence simi-
la i y, making i sui able o genome- o-genome alignmen s, s uc u al a ia ion de ec ion, and e e ence-
guided sca olding (Jain e al.,2018a).
48
Resul s – Bighead Ca ish
The haplo ype- esol ed de no o assembly esul ed in 27 Hi-C sca olds, and he
ch omosome numbe was 2n = 2x = 54 pseudoch omosomes (Maneecho e al.,2016).
1. Raw Read Quali y
Figu e 15 summa izes aw da a quali y o he bighead ca ish genome.
Figu e 15 Sequencing summa y and genome su ey o bighead ca ish. (A) Sequencing p o o-
cols and co e ages. (B) GenomeScope k-me p o ile showing genome size, epea con en , and
17.6% he e ozygosi y. (C) HiFi eads: high quali y and 15 kbp peak leng h. (D) ONT eads:
b oade size ange wi h N50 30 kbp and mode a e quali y.
49
2. Global Assembly Me ics
Sequence a ia ion analysis53 was pe o med wi h Plo SR sui e (Goel and Schnee-
be ge ,2022), e ealing 1,968,666 he e ozygous SNPs ac oss bo h haplo ypes a e min-
imap2 ('-ax asm5') c oss-haplo ype mapping, esul ing in a he e ozygosi y a e54 o
0.594%. S uc u al a ian analysis iden i ied 392,973 inse ions o aling 7.67 Mb in
Haplo ype 2, while Haplo ype 1 had 393,127 dele ions spanning 7.75 Mb. Copy num-
be a ia ions55 included 114 copy gains in Haplo ype 2 (184 Kb) and 123 copy losses
in Haplo ype 1 (416 Kb). Highly di e gen egions56 we e iden i ied, spanning 57.9 Mb
in Haplo ype 1 and 56.0 Mb in Haplo ype 2. Addi ionally, 47 andem epea clus e s57
we e de ec ed, co e ing 10 Kb in Haplo ype 1 and 7 Kb in Haplo ype 2.
Table 1 Summa y s a is ics o he haplo ype- esol ed Cla ias mac ocephalus genome assembly.
Fea u es Haplo ype 1 Haplo ype 2
O e all quali y (x.y.P.Q.C) 6.30.P7.53.Q48.C94.25 6.30.P7.53.Q48.C94.25
Genome size (Mb) 875 Mb 880 Mb
pseudoch omosomes 27 27
Mi ochond ion leng h (bp) 16,510 bp N/A
NG50 o con igs (Mb) 3 Mb 3 Mb
N50 o sca olds (Mb) 34 Mb 34 Mb
LG50/LG90 o sca olds 11 / 24 (Hap1) 11 / 24 (Hap2)
Numbe o gaps 752 653
GC con en (%) 39.32% 39.32%
Comple e BUSCOs N (%) 93.0% (Hap1) 93.5% (Hap2)
Median QV (Me qu y k21) 42.99 (Hap1) 43.33 (Hap2) 48.22 (All)
Comple eness (Me qu y k21) 89.11 (Hap1) 88.88 (Hap2) 95.25 (All)
53Compa ison o Haplo ype 1 and 2 ch omosomes o de ec he e ozygous a ian s such as SNPs, indels,
and s uc u al changes.
54P opo ion o si es whe e he wo haplo ypes di e , e lec ing allelic di e si y.
55Duplica ions o dele ions causing a iable copy numbe s o genomic egions.
56Genomic segmen s showing s ong sequence di e gence be ween haplo ypes o species.
57Sho mo i s epea ed head- o- ail, o en in elome es o cen ome es.
56
5. Sca old Me ics
Pe -sca old analyses con i med uni o m base and s uc u al accu acy ac oss
ch omosomes. Key quali y me ics included base accu acy (median QV ≈ 50), gap
con en , and elome e o ien a ion. S uc u al in eg i y was assessed using he S uc u al
Assembly Quali y Index (S-AQI %), base-le el accu acy wi h Me qu y (k=21), s uc-
u al p ecision wi h CRAQ, elome e pola i y wi h Qua TeT (+ inwa d, – ou wa d, >
100 epea s), and gap s a is ics wi h De gaps.

57
Table 2 Summa y o sca old me ics in he haplo ype- esol ed genome assembly o Cla ias
mac ocephalus,Haplo ype 1.
Sca old Name Size E o s Quali y Values Gaps Le Telome e Righ Telome e S-AQI
Haplo ype_Linkage (Mb) 21-me hyb.k21.me yl (N) (5’) AACCCT) (AGGGTT 3’) (%)
ClaMac_1_LG_01 48,4 3242 54.8396 23 Le 507 (-) -93.24
ClaMac_1_LG_02 46,18 6400 51.574 24 Le 607 (+) Righ 1016 (-) 73.94
ClaMac_1_LG_03 40,54 6784 50.6951 20 - Righ 215 (-) 91.18
ClaMac_1_LG_04 37,64 4160 52.7678 23 Le 147 (+) Righ 122 (-) 100.00
ClaMac_1_LG_05 37,19 9007 49.2436 23 Le 579 (+) Righ 190 (-) 93.23
ClaMac_1_LG_06 37,01 5548 51.4013 25 - - 84.09
ClaMac_1_LG_07 34,64 7089 49.8928 25 Le 214 (+) - 93.24
ClaMac_1_LG_08 47,03 12668 48.7208 32 Le 170 (+) Righ 190 (-) 89.37
ClaMac_1_LG_09 28,28 7205 49.2196 13 Le 153 (+) Righ 1630 (-) 81.31
ClaMac_1_LG_10 29,87 12042 47.0424 24 Le 282 (+) - 95.23
ClaMac_1_LG_11 25,51 6087 49.3463 19 Le 1008 (+) - 80.92
ClaMac_1_LG_12 28,04 8367 48.2834 21 Le 124 (+) - 95.73
ClaMac_1_LG_13 24,08 3715 51.2825 18 Le 377 (+) - 86.62
ClaMac_1_LG_14 21,22 4707 49.7286 13 - Righ 136 (-) 94.05
ClaMac_1_LG_15 25,46 3974 51.1515 20 Le 127 (+) Righ 115 (-) 95.00
ClaMac_1_LG_16 24,8 4054 50.9451 15 - Righ 109 (-) 73.65
ClaMac_1_LG_17 22,98 67523 38.5013 15 Le 120 (+) - 77.54
ClaMac_1_LG_18 24,8 2060 53.9205 13 - - 90.26
ClaMac_1_LG_19 31,01 3329 52.8263 22 - Righ 363 (-) 92.45
ClaMac_1_LG_20 30,49 4074 51.8251 15 - Righ 406 (-) 83.79
ClaMac_1_LG_21 28,95 4609 51.2 28 - - 100.00
ClaMac_1_LG_22 33,48 2784 54.0062 23 - Righ 904 (-) 92.37
ClaMac_1_LG_23 31,45 8435 48.8646 16 - Righ 790 (-) 86.84
ClaMac_1_LG_24 28,07 4984 50.6636 12 Le 423 (+) - 88.33
ClaMac_1_LG_25 38,33 6711 50.6434 24 - - 96.55
ClaMac_1_LG_26 40,06 4967 52.1555 21 Le 556 (+) Righ 604 (-) 87.85
ClaMac_1_LG_27 29,17 2146 54.5417 13 - - 74.96
58
Table 3 Summa y o sca old me ics in he haplo ype- esol ed genome assembly o Cla ias
mac ocephalus,Haplo ype 2.
Sca old Name Size E o s Quali y Values Gaps Le Telome e Righ Telome e S-AQI
Haplo ype_Linkage (Mb) 21-me hyb.k21.me yl (N) (5’) AACCCT) (AGGGTT 3’) (%)
ClaMac_2_LG_01 51,51 2649 55.5298 17 - - 89.28
ClaMac_2_LG_02 44,94 6893 51.161 25 Le 255 (+) - 71.62
ClaMac_2_LG_03 39,44 8894 49.5912 16 - - 83.57
ClaMac_2_LG_04 38,01 6912 50.5588 18 Le 293 (+) Righ 526 (-) 88.17
ClaMac_2_LG_05 38,16 3424 53.4601 23 - Righ 125 (-) 78.26
ClaMac_2_LG_06 38,66 2675 54.3325 17 - - 80.86
ClaMac_2_LG_07 33,93 5651 50.8217 17 Le 169 (+) - 89.91
ClaMac_2_LG_08 47,19 10458 49.5856 29 Le 174 (+) Righ 190 (-) 89.44
ClaMac_2_LG_09 28,8 10239 47.5736 14 Le 641 (+) - 91.91
ClaMac_2_LG_10 29,07 5001 50.8593 14 Le 268 (+) - 95.24
ClaMac_2_LG_11 25,76 6627 48.9338 21 - - 89.75
ClaMac_2_LG_12 30,93 6915 48.7951 19 - - 82.58
ClaMac_2_LG_13 23,66 3213 51.9117 15 - Righ 100 (+) 86.65
ClaMac_2_LG_14 21,06 3487 50.9324 7 - - 100.00
ClaMac_2_LG_15 24,93 7673 48.2988 17 Le 134 (+) Righ 137 (-) 85.58
ClaMac_2_LG_16 24,83 3011 52.2777 10 - - 81.74
ClaMac_2_LG_17 22,34 2012 53.6414 15 Le 175 (+) - 85.99
ClaMac_2_LG_18 24,44 1986 54.0521 11 - - 100.00
ClaMac_2_LG_19 30,86 3566 52.521 16 - Righ 362 (-) 90.62
ClaMac_2_LG_20 34,51 2384 53.5357 13 - - 100.00
ClaMac_2_LG_21 27,99 2550 53.5741 13 - - 100.00
ClaMac_2_LG_22 33,35 4588 51.8225 16 Le 128 (+) Righ 908 (-) 88.64
ClaMac_2_LG_23 31,45 9701 48.2459 6 Le 214 (+) Righ 793 (-) 82.68
ClaMac_2_LG_24 27,39 2840 53.0349 6 Le 114 (+) - 93.87
ClaMac_2_LG_25 37,88 7891 49.9914 15 - - 87.01
ClaMac_2_LG_26 40,17 11271 48.6822 13 - Righ 607 (-) 88.08
ClaMac_2_LG_27 29,56 1236 56.9751 12 Le 155 (+) - 90.78
MT_ om_NC_046749 0,02 - - - - - -
59
6. TE Me ics
6.1 Da ing o TE Replica i e Bu s s
To es ima e he expansion his o y o majo ansposable elemen (TE) su-
pe amilies in Cla ias mac ocephalus, I calcula ed he nucleo ide di e gence o each TE
copy ela i e o i s consensus sequence and con e ed his di e gence in o app oxima e
inse ion ages using a neu al mu a ion a e58 (
µ
=6×10−9pe si e pe gene a ion)
p e iously de e mined in elec ic ca ish (Liu e al.,2023b). Assuming a one-yea gen-
e a ion ime, he es ima ed expansion peaks indica e ha LTR/Gypsy and TIR/Mu a o
amilies expanded a ound 30–33 Mya, LTR/Copia and TIR/hAT a ound 28–30 Mya, and
TIR/CACTA, PIF/Ha binge , and Tc1/Ma ine amilies be ween 16 and 25 Mya. These
esul s (Figu e 20) e eal mul iple wa es o TE p oli e a ion, likely e lec ing episodes
o genome ins abili y o e olu iona y ansi ion in he species’ his o y.
Figu e 20 TE di e gence p o iles in bighead ca ish genome.
58The neu al mu a ion a e is he a e a which mu a ions accumula e in non- unc ional genomic e-
gions, p o iding a baseline o molecula -clock da ing.
60
6.2 Genome composi ion in TE
T ansposable elemen (TE) anno a ion indica ed ha 35.25% o he big-
head ca ish genome consis ed o epe i i e sequences. Among hese, TIR DNA ans-
posons comp ised 19.12%, Heli ons 4.47%, LTR e o ansposons 8.30%, LINE ele-
men s 0.46%, and he emaining 2.82% consis ed o uncha ac e ized epea s.
Table 4 T ansposable elemen con en in Bighead ca ish Haplo ype 1.
Class Supe amily Haplo ype 1 Haplo ype 2
Fea u es # o elemen s
LINE
I128
L1 547
L2 1,129
Rex 556
LINE/Unknown 7,412
LTR
Bel_Pao 151
Copia 53
Gypsy 68,215
LTR/Unknown 13,466
TIR
CACTA 357,388
Mu a o 172,578
PIF_Ha binge 36,538
Tc1_Ma ine 41,500
hAT 80,189
PiggyBac 758
Polin on 281
NonLTR DIRS_YR 678
Penelope 417
NonTIR Heli on 166,251
O he Repea F agmen 78,900
To al In e spe sed 1,148,329
61
LINEs (Long In e spe sed Nuclea Elemen s) a e non-LTR e o ansposons
ha anspose ia a copy-and-pas e mechanism using e e se ansc ip ase; al hough
less abundan in ish genomes han in mammals, hey con ibu e o s uc u al a ia ion.
LTR e o ansposons eplica e h ough an RNA in e media e using e e se ansc ip-
ion and a e lanked by long e minal epea s, playing key oles in genome expansion
and gene egula ion. TIR DNA ansposons (Te minal In e ed Repea ansposons)
a e cu -and-pas e elemen s lanked by in e ed epea s ecognized by a ansposase en-
zyme, con ibu ing o genome ea angemen and egula o y di e si ica ion. Non-LTR
e o ansposons such as DIRS and Penelope elemen s add u he di e si y o he epe -
i i e landscape h ough dis inc mobiliza ion mechanisms. Heli ons, belonging o he
NonTIR class, a e olling-ci cle DNA ansposons ha lack e minal in e ed epea s
and anspose h ough a mechanism simila o olling-ci cle eplica ion, o en cap u ing
and mobilizing gene agmen s ha p omo e genomic inno a ion. Toge he , hese TE
classes e lec a dynamic e olu iona y his o y o C. mac ocephalus, consis en wi h
high DNA ansposon ac i i y ypical o eleos genomes.
7. Mac osyn eny Analysis
To examine o hologous ela ionships and iden i y syn enic egions be ween ca -
ish and ela ed eleos species, I ob ained p o eomes (.pep. a iles) om Ensembl
2024 (Ha ison e al.,2023), ensu ing he la es assembly e sions o consis ency.
Species a e Onco hynchus mykiss (USDA_OmykA_1.1), O yzias la ipes (ASM223467 1),
La es calca i e (ASB_HGAPassembly_ 1), Cyp inus ca pio (Cypca _WagV4.0), Danio
e io (GRCz11), O eoch omis nilo icus (O_nilo icus_UMD_NMBU), and Lepisos eus
ocula us (LepOcu1). O hologous gene clus e s (O hog oups) we e i s iden i ied wi h
O hoFinde (’o ho inde - /p o eomes -M msa -a 126 - 126’) (Emms and
Kelly,2019) .2.5.5. Subsequen ly, I u ilized x hbexp ess (h ps://gi hub.com/SamiLhll/
bhXp ess) and mac osyn R (El Hilali and Copley,2023) 0.2.19 o analyze he esul ing
o hog oups and isualize hei syn enic ela ionships ac oss species. This combined ap-
p oach allowed o an in-dep h iew o conse ed and di e gen genomic egions wi hin
he ca ish lineage and ela ed eleos s (Figu e 21).

62
Figu e 21 Syn eny analysis o a ious ca ish samples. (A) Whole-genome 4-way
syn eny analysis om 1413 sha ed single-copy o hog oups ac oss Cla ias ga iepinus
(GCA_024256425.2), Cla ias mac ocephalus,Danio e io (GRCz11)Danio e io and O e-
och omis nilo icus (O_nilo icus_UMD_NMBU), e eals conse ed ch omosome s uc u es and
ea angemen s. (B) Simple phylogeny and geological imescales (TimeT ee5, h ps:// ime ee.
o g/).
63
Resul s – F1 Hyb id Ca ish
The haplo ype- esol ed de no o assembly o he F1 hyb id ca ish (C. mac o-
cephalus ×C. ga iepinus) yielded a o al o 55 Hi-C sca olds, co esponding o he
combined 27 + 28 pseudoch omosomes inhe i ed om he pa en al species (Lewin e al.,
2019;Maneecho e al.,2016). The esul ing assembly spans bo h subgenomes, ep e-
sen ing he i s comple e haplo ype- esol ed hyb id genome o aquacul u e ca ishes.
1. Raw Read Quali y
Figu e 22 summa izes he sequencing da a quali y and genome su ey s a is ics
o he hyb id genome.
64
Figu e 22 Sequencing summa y and genome su ey o he F1 hyb id ca ish (C. mac ocephalus
×C. ga iepinus). (A) GenomeScope2.0 k-me p o ile (k=21) showing es ima ed haploid genome
size (903 Mb), 10.1 % he e ozygosi y, and low sequencing e o a e (0.28 %). (B) Sequencing
p o ocols and mapped co e ages ac oss da a ypes, including HiFi (30×), ONT (36×), Hi-C
(37×), and Illumina WGS (55×). (C) HiFi eads: high quali y wi h a modal leng h a ound 15
kbp. (D) ONT eads: b oade leng h dis ibu ion wi h N50 ≈ 30 kbp and mode a e quali y.
65
1.1 Illumina pai ed-end sho - eads
S anda d Illumina pai ed-end sho - eads sequencing Ben ley e al. (2008)
was pe o med on he Illumina Nex Seq2000 pla o m (by No oGene), esul ing in mo e
han 50 Gb o aw ead da a ( ead-leng h=151 nucleo ides), which is equi alen o a
genome co e age o 36X. Raw eads had an a e age quali y o Q25, and 92% o he
da ase had a median quali y o Q30.
1.2 P oximi y-liga ion (Hi-C) pai ed-end sho - eads
P oximi y-liga ion (Hi-C) da a was ob ained by ollowing a s anda d in
si u Hi-C p o ocol om 2009 Dudchenko e al. (2017). B ie ly, a nuclea liga ion was
pe o med by c oss-linking ch omosomes, hen a es ic ion diges ion was ca ied ou
wi h DpnII endonuclease. The ch oma in con o ma ion cap u e lib a y was p epa ed
using Phase Genomics (h ps://in o.phasegenomics.com/p o ocols), and he genomic
DNA was sequenced on he Illumina Nex Seq2000 pla o m, esul ing in pai ed-end
sho - eads (151 nucleo ides in leng h) o 37.19 Gb o aw da a (app oxima ely 100M
pai s o eads pe billion bases in genome leng h) co esponding o a genome co e age
o 39.77X. 97.18% o he da ase had a quali y sco e o Q30 o highe (Supplemen a y
Table 3). Fo Haplo ype 1, 123.96M ead pai s we e sequenced, yielding 74.77M unique
Hi-C con ac s, wi h 59.40M alid con ac s (22.71M in e -ch omosomal, 36.69M in a-
ch omosomal), including 17.47M sho - ange (<20Kb) and 19.72M long- ange (>20Kb)
con ac s. Haplo ype 2 showed simila me ics.
1.3 PacBio HiFi long- eads
PacBio’s HiFi long- ead sequencing was pe o med using a SMRT cell
on he Sequel II sys em, esul ing in 10.62 Gb o aw da a co esponding o a genome
co e age o app oxima ely 12.9X o 6-7X pe haplo ype. The ead leng h N50 alue
was 10,379 bases, he aw eads’ mean quali y sco e was Q28.7, and he median quali y
sco e was Q36. O e 72% we e abo e Q30, and 86% abo e Q25. The e we e 191,790
72
5. Genome Comple eness E alua ion
To u he alida e gene con en and comple eness, BUSCO analysis (Figu e
25) was pe o med, yielding high comple eness sco es. Fo he No h A ican ca ish
sub-genome, 97.1% o co e genes we e comple e [95.7% single-copy, 1.4% duplica ed],
wi h 2.2% missing. The bighead ca ish sub-genome showed simila comple eness, wi h
96.6% o co e genes [95.3% single-copy, 1.3% duplica ed] and 2.7% missing. This
analysis unde sco es he high in eg i y and comple eness o gene con en wi hin he
assembly.
No ably, he F1 genome exceeded 97% BUSCO comple eness (Fig. 25B) and
achie ed a median QV o 55 (Me qu y Figs. 25D, 25E), wi h no mul iplici y peaks
o e 2×—indica ing nea - e e ence-g ade assemblies. Toge he wi h k-me analysis
and BUSCO me ics, hese p o iles con i m he F1 hyb id assembly’s comple eness,
ch omosome numbe consis ency (27 + 28), and dual subgenome con ibu ion, while
BUSCO-de i ed gene syn eny in Cla ias con i ms he F1 hyb id na u e o he genome
wo subspecies (Fig. 25F). One mi ochond ial sequence, con i med h ough IGV poly-
mo phism analysis, belongs o C. mac ocephalus. O he esul s we e a inal assembly
k-me comple eness es ima ed a 99.4% (k=21, Me qu y) o 99.38% (k=31, Me qu y),

73
Figu e 25 O e iew o F1 Hyb id Ca ish (Cla ias ga iepinus ×Cla ias mac ocephalus)
genome assembly alida ion. Hi-C phasing con i ms 55 pseudoch omosomes, wi h BUSCO
sco es indica ing 97% gene comple eness. Syn eny analysis and k-me dis ibu ions suppo he
F1 hyb id na u e, demons a ing ch omosome sepa a ion in o sub-genomes and hei e olu ion-
a y a ilia ions.
74
6. Sca old Me ics
6.1 Assembly S a is ics and Quali y Assessmen
Assembly me ics a e summa ized in Table 5and Table 6, including sca -
old coun , leng h in Mb, QV, e o a es, and sca old leng h in bases, wi h a g adien
indica ing QV quali y: ed o QV < 50, yellow o QV be ween 50-60, and g een o
QV > 60. The le mos column displays he coun o e o seqme s (k-me s p esen in
he assembly bu absen in eads), indica ing po en ial assembly e o s. Alongside a e
gi en he ela i e elome es o ien a ion (5’ +inwa d -> / <- inwa d -3’), and gaps coun
in ch omosomes.
S uc u al alida ion, pe o med using 3D-DNA and JBAT, con i med sca -
old accu acy, wi h addi ional polishing s eps h ough Nex Polish2, TGS GapClose ,
Pilon, and au oma ed consensus polishing using ONT 1D eads, Racon, Me in, and
BCF ools, esul ing in a inal quali y sco e o QV50, a 99.999% base p ecision, 99%
mapping a e, and 99.4% k-me comple eness. Despi e he p esence o complex e-
pea s, he assembly achie ed CRAQ s uc u al accu acy (S-AQI) abo e 97%, indica ing
e e ence-g ade quali y (Li e al.,2023).
6.2 Compa a i e Analysis o Subgenomes
No h A ican ca ish sca olds exhibi consis en ly highe QVs (up o ~65.79)
han bighead ca ish sca olds ( anging ~50–54), indica ing ewe sequence e o s and
highe assembly quali y o e all. Se e al ch omosomes exhibi Telome e- o-Telome e
(T2T) con inui y, alida ing success ul esolu ion o ull-leng h ch omosomes. No h
A ican ca ish sca olds con ain ewe unclosed gaps han bighead ca ish, e lec ing
supe io comple eness and assembly con igui y. No e ha his may esul om local
genome complexi y o di e ences in ansposable elemen con en and di e si y, aside
om a ia ions in aw sequencing da a quali y.
75
Table 5 No h A ican subgenome sca old me ics wi h Quali y Values om Me qu y (k=21
and k=31), elome e o ien a ion, and gap coun s.
Sca old_name Leng h QV_k21 E o s_k21 QV_k31 E o s_k31 Le _ elome e Gaps Righ _ elome e
Sub-Genome_Ch omosome Mb. 21-me 21-me 31-me 31-me (5’ AACCCT) (N) (AGGGTT 3’)
ClaHyb_a ican_1_Ch _01 51,1 64.5442 377 59.4881 1783 119 (+) T2T 0
ClaHyb_a ican_1_Ch _02 51,7 56.0219 2713 56.0911 3942 174 (+) 10
ClaHyb_a ican_1_Ch _03 47,6 59.3028 1174 54.5941 5125 1701 (+) 31603 (-)
ClaHyb_a ican_1_Ch _04 44,3 49.4714 10490 56.9825 2716 185 (+) 8 160 (-)
ClaHyb_a ican_1_Ch _05 41,5 62.5139 488 61.5864 892 1472 (+) T2T 1883 (-)
ClaHyb_a ican_1_Ch _06 41,3 54.5824 3021 56.2081 3068 278 (+) 11447 (-)
ClaHyb_a ican_1_Ch _07 42,6 58.0864 1389 54.3111 4889 384 (+) 3382 (-)
ClaHyb_a ican_1_Ch _08 40,2 54.3182 3124 59.8602 1287 0 T2T 130 (-)
ClaHyb_a ican_1_Ch _09 37,3 55.9044 2013 53.8378 4784 1449 (+) 1315 (-)
ClaHyb_a ican_1_Ch _10 34,9 67.6478 126 60.3736 993 0 2127 (-)
ClaHyb_a ican_1_Ch _11 33,5 51.4996 4986 58.2486 1555 1614 (+) T2T 1997 (-)
ClaHyb_a ican_1_Ch _12 35,3 54.4649 2653 58.1993 1657 0 6 0
ClaHyb_a ican_1_Ch _13 33,6 66.4695 159 63.3235 483 1525 (+) 21557 (-)
ClaHyb_a ican_1_Ch _14 32,4 47.2905 12717 61.6029 690 105 (+) T2T 226 (-)
ClaHyb_a ican_1_Ch _15 35 64.2407 275 58.2795 1602 328 (+) 2118 (-)
ClaHyb_a ican_1_Ch _16 33,8 60.5363 627 57.9679 1675 1602 (+) 4137 (-)
ClaHyb_a ican_1_Ch _17 30,3 55.2666 1892 56.604 2052 335 (-) T2T 276 (-)
ClaHyb_a ican_1_Ch _18 30,5 56.641 1382 57.9088 1522 1411 (+) 32289 (-)
ClaHyb_a ican_1_Ch _19 30,7 64.5469 226 58.6026 1311 784 (+) 2178 (+)
ClaHyb_a ican_1_Ch _20 27,1 57.0276 1128 60.0252 835 171 (+) 2406 (-)
ClaHyb_a ican_1_Ch _21 30,6 60.6832 549 60.1377 919 1936 (+) 20
ClaHyb_a ican_1_Ch _22 30 62.328 369 60.2163 886 0 2129 (-)
ClaHyb_a ican_1_Ch _23 26,2 52.9697 2773 58.0814 1260 279 (+) 11739 (-)
ClaHyb_a ican_1_Ch _24 25,9 60.3587 501 60.6659 689 185 (+) 2114 (-)
ClaHyb_a ican_1_Ch _25 25,6 58.5637 748 59.0404 989 232 (+) T2T 129 (-)
ClaHyb_a ican_1_Ch _26 26,3 56.7889 1163 57.358 1506 1343 (+) 30
ClaHyb_a ican_1_Ch _27 24 68.2243 76 59.5233 832 0 T2T 264 (-)
ClaHyb_a ican_1_Ch _28 20,3 59.8335 443 56.9713 1261 0 20
76
Table 6 Bighead subgenome sca old me ics wi h Quali y Values om Me qu y (k=21 and
k=31), elome e o ien a ion, and gap coun s.
Sca old_name Leng h QV_k21 E o s_k21 QV_k31 E o s_k31 Le _ elome e Gaps Righ _ elome e
Sub-Genome_Ch omosome Mb. 21-me 21-me 31-me 31-me (5’ AACCCT) (N) (AGGGTT 3’)
ClaHyb_bighead_1_Ch _01 50,1 55.7992 2767 50.7048 13207 998 (+) 14 740 (-)
ClaHyb_bighead_1_Ch _02 44,3 51.8276 6102 54.4421 4934 815 (+) 50
ClaHyb_bighead_1_Ch _03 40,4 44.33 31251 55.4715 3531 327 (+) 12 279 (-)
ClaHyb_bighead_1_Ch _04 38,4 52.1302 4933 52.1322 7281 222 (+) 13 819 (-)
ClaHyb_bighead_1_Ch _05 38,1 47.0334 15835 53.7219 4994 321 (+) 14 301 (-)
ClaHyb_bighead_1_Ch _06 37,4 52.5469 4370 50.6785 9921 282 (+) 14 0
ClaHyb_bighead_1_Ch _07 33,8 50.998 5644 48.4795 14888 924 (+) 15 629 (-)
ClaHyb_bighead_1_Ch _08 45,6 56.2767 2252 51.0889 10984 161 (+) 11 1237 (-)
ClaHyb_bighead_1_Ch _09 30 48.1331 9677 48.5674 12936 184 (+) 13 751 (-)
ClaHyb_bighead_1_Ch _10 29,8 54.8059 2065 49.5164 10314 316 (+) 12 664 (-)
ClaHyb_bighead_1_Ch _11 25 51.0866 4085 53.4888 3469 0 21 267 (-)
ClaHyb_bighead_1_Ch _12 31,6 48.5768 9241 47.4853 17546 559 (+) 13 0
ClaHyb_bighead_1_Ch _13 24,9 44.8933 16909 48.6904 10418 740 (+) 8220 (-)
ClaHyb_bighead_1_Ch _14 23,3 44.5272 17249 51.8503 4715 0 8190 (-)
ClaHyb_bighead_1_Ch _15 27,6 39.2488 68816 52.3266 4848 200 (+) 12 107 (-)
ClaHyb_bighead_1_Ch _16 25,2 51.936 3389 53.2655 3686 656 (+) 9818 (-)
ClaHyb_bighead_1_Ch _17 24 40.0722 49568 46.7745 15616 326 (+) 15 222 (-)
ClaHyb_bighead_1_Ch _18 25,2 52.6726 2856 52.3131 4582 632 (+) 11 973 (-)
ClaHyb_bighead_1_Ch _19 31,1 53.8536 2692 52.9809 4841 674 (+) 8751 (-)
ClaHyb_bighead_1_Ch _20 32,7 51.6697 4681 51.0227 7929 410 (+) 18 422 (-)
ClaHyb_bighead_1_Ch _21 29,2 43.0312 30479 51.9822 5662 805 (+) 16 253 (-)
ClaHyb_bighead_1_Ch _22 34,9 51.0828 5705 50.9355 8715 568 (+) 13 121 (-)
ClaHyb_bighead_1_Ch _23 31,7 42.7592 35295 48.8582 12797 931 (+) 81022 (-)
ClaHyb_bighead_1_Ch _24 29,6 50.1103 6080 52.8055 4829 851 (+) 10 159 (-)
ClaHyb_bighead_1_Ch _25 40,3 52.391 4875 52.1653 7582 0 9798 (-)
ClaHyb_bighead_1_Ch _26 40,1 53.1914 4035 51.6319 8529 1038 (+) 11 0
ClaHyb_bighead_1_Ch _27 30,5 48.7036 8641 48.5145 13324 1858 (+) 10 219 (-)
77
6.3 S uc u al Accu acy Assessmen
Table 7and Table 8p esen s uc u al accu acy me ics om CRAQ o
bo h sub-genomes, including S uc u al Accu acy Index (S-AQI), Regional Accu acy
(R-AQI), and he e ozygous loci o s uc u al and egional me ics (A g.CSH/A g.CRH).
No e ha CRH and CSH alues a e expec ed o be close o 0 due o he high genomic di-
e si y be ween he wo pa en al species (di e gence > 10%, mash-based (Ondo e al.,
2019), no shown he e).
No h A ican ca ish sca olds displayed nea -pe ec S-AQI and R-AQI
alues, bo h anging app oxima ely om 97% o 100%, indica ing minimal misassem-
blies a bo h la ge-scale and egional-scale le els. In con as , bighead ca ish sca olds
exhibi ed sligh ly lowe accu acy, wi h S-AQI alues a ound 90–97% and R-AQI be-
ween 90–96%. This educ ion in quali y may be a ibu ed o ac o s such as lowe
species he e ozygosi y, inc eased local assembly complexi y due o ansposable ele-
men con en , o po en ial sequencing quali y issues, possibly a ising om deg aded
genomic DNA used in ONT sequencing.

78
Table 7 No h A ican subgenome s uc u al accu acy me ics om CRAQ: S-AQI, R-AQI, and
he e ozygous egions (A g.CSH/A g.CRH).
Sca old_name Mapping. a e A g.CRH A g.CRE A g.CSE Regional-AQI A g.CSH S uc u al-AQI
Sub-Genome_Ch omosome ONT/PE (%) Coun Coun Coun Accu acy (%) Coun Accu acy (%)
Genome ( 55 Ch omosomes) >97% 0.046 0.270 0.021 97.33 0.001 97.92
ClaHyb_a ican_1_Ch _01 0.992 0.020 0.078 0.000 99.22 0.000 100.00
ClaHyb_a ican_1_Ch _02 0.981 0.000 0.098 0.000 99.02 0.000 100.00
ClaHyb_a ican_1_Ch _03 0.955 0.033 0.154 0.000 98.47 0.000 100.00
ClaHyb_a ican_1_Ch _04 0.980 0.046 0.207 0.046 97.95 0.000 95.50
ClaHyb_a ican_1_Ch _05 0.986 0.048 0.097 0.000 99.04 0.000 100.00
ClaHyb_a ican_1_Ch _06 0.998 0.024 0.085 0.000 99.15 0.024 100.00
ClaHyb_a ican_1_Ch _07 0.978 0.000 0.120 0.000 98.81 0.000 100.00
ClaHyb_a ican_1_Ch _08 0.992 0.025 0.100 0.000 99.00 0.000 100.00
ClaHyb_a ican_1_Ch _09 0.992 0.076 0.083 0.027 99.17 0.000 97.34
ClaHyb_a ican_1_Ch _10 0.998 0.066 0.172 0.000 98.29 0.000 100.00
ClaHyb_a ican_1_Ch _11 0.905 0.030 0.188 0.000 98.13 0.000 100.00
ClaHyb_a ican_1_Ch _12 0.873 0.086 0.230 0.000 97.72 0.000 100.00
ClaHyb_a ican_1_Ch _13 0.917 0.075 0.119 0.000 98.81 0.000 100.00
ClaHyb_a ican_1_Ch _14 0.990 0.031 0.062 0.062 99.38 0.000 93.96
ClaHyb_a ican_1_Ch _15 0.922 0.076 0.152 0.000 98.49 0.000 100.00
ClaHyb_a ican_1_Ch _16 0.952 0.000 0.248 0.000 97.55 0.000 100.00
ClaHyb_a ican_1_Ch _17 0.997 0.033 0.083 0.000 99.18 0.000 100.00
ClaHyb_a ican_1_Ch _18 0.954 0.175 0.315 0.000 96.90 0.000 100.00
ClaHyb_a ican_1_Ch _19 0.998 0.000 0.294 0.033 97.10 0.000 96.79
ClaHyb_a ican_1_Ch _20 0.995 0.056 0.148 0.000 98.53 0.000 100.00
ClaHyb_a ican_1_Ch _21 0.986 0.033 0.131 0.098 98.70 0.000 90.65
ClaHyb_a ican_1_Ch _22 0.992 0.067 0.134 0.034 98.67 0.000 96.70
ClaHyb_a ican_1_Ch _23 0.998 0.058 0.186 0.038 98.16 0.000 96.23
ClaHyb_a ican_1_Ch _24 0.999 0.077 0.232 0.000 97.71 0.000 100.00
ClaHyb_a ican_1_Ch _25 0.989 0.039 0.118 0.000 98.83 0.000 100.00
ClaHyb_a ican_1_Ch _26 0.994 0.057 0.324 0.000 96.81 0.000 100.00
ClaHyb_a ican_1_Ch _27 0.980 0.042 0.166 0.000 98.35 0.000 100.00
ClaHyb_a ican_1_Ch _28 0.995 0.000 0.297 0.000 97.07 0.000 100.00
79
Table 8 Bighead subgenome s uc u al accu acy me ics om CRAQ: S-AQI, R-AQI, and he -
e ozygous egions (A g.CSH/A g.CRH).
Sca old_name Mapping. a e A g.CRH A g.CRE A g.CSE Regional-AQI A g.CSH S uc u al-AQI
Sub-Genome_Ch omosome ONT/PE (%) Coun Coun Coun Accu acy (%) Coun Accu acy (%)
Genome ( 55 Ch omosomes) >97% 0.046 0.270 0.021 97.33 0.001 97.92
ClaHyb_bighead_1_Ch _01 0.981 0.000 0.329 0.000 96.76 0.000 100.00
ClaHyb_bighead_1_Ch _02 0.992 0.046 0.148 0.023 98.53 0.000 97.75
ClaHyb_bighead_1_Ch _03 0.980 0.050 0.276 0.000 97.28 0.000 100.00
ClaHyb_bighead_1_Ch _04 0.990 0.000 0.370 0.026 96.36 0.000 97.40
ClaHyb_bighead_1_Ch _05 0.972 0.053 0.360 0.160 96.46 0.000 85.22
ClaHyb_bighead_1_Ch _06 0.934 0.000 0.497 0.027 95.15 0.000 97.33
ClaHyb_bighead_1_Ch _07 0.990 0.030 0.403 0.030 96.05 0.000 97.06
ClaHyb_bighead_1_Ch _08 0.987 0.000 0.429 0.044 95.80 0.000 95.66
ClaHyb_bighead_1_Ch _09 0.986 0.051 0.422 0.067 95.87 0.000 93.47
ClaHyb_bighead_1_Ch _10 0.982 0.034 0.460 0.034 95.50 0.000 96.65
ClaHyb_bighead_1_Ch _11 0.961 0.042 0.382 0.042 96.26 0.000 95.92
ClaHyb_bighead_1_Ch _12 0.946 0.194 0.606 0.069 94.12 0.000 93.38
ClaHyb_bighead_1_Ch _13 0.989 0.041 0.509 0.081 95.04 0.000 92.19
ClaHyb_bighead_1_Ch _14 0.982 0.080 0.544 0.000 94.71 0.044 100.00
ClaHyb_bighead_1_Ch _15 0.922 0.136 0.506 0.039 95.07 0.000 96.19
ClaHyb_bighead_1_Ch _16 0.993 0.040 0.240 0.000 97.63 0.000 100.00
ClaHyb_bighead_1_Ch _17 0.985 0.042 0.793 0.042 92.38 0.000 95.87
ClaHyb_bighead_1_Ch _18 0.977 0.000 0.474 0.000 95.37 0.000 100.00
ClaHyb_bighead_1_Ch _19 0.992 0.000 0.341 0.000 96.65 0.000 100.00
ClaHyb_bighead_1_Ch _20 0.971 0.000 0.236 0.062 97.66 0.000 93.96
ClaHyb_bighead_1_Ch _21 0.977 0.064 0.591 0.000 94.26 0.000 100.00
ClaHyb_bighead_1_Ch _22 0.935 0.102 0.406 0.044 96.02 0.000 95.74
ClaHyb_bighead_1_Ch _23 0.990 0.000 0.303 0.032 97.02 0.000 96.87
ClaHyb_bighead_1_Ch _24 0.992 0.303 0.608 0.000 94.10 0.000 100.00
ClaHyb_bighead_1_Ch _25 0.990 0.062 0.350 0.025 96.56 0.000 97.53
ClaHyb_bighead_1_Ch _26 0.991 0.050 0.378 0.025 96.29 0.000 97.51
ClaHyb_bighead_1_Ch _27 0.979 0.033 0.447 0.000 95.63 0.000 100.00
80
GENOME ASSEMBLY, REVERSE VACCINOLOGY, AND
QUALITY BY DESIGN — S ep ococcus iniae
In oduc ion
1. S ep ococcus iniae as a Majo Aquacul u e Pa hogen
S ep ococcus iniae, i s iden i ied in an Amazon Ri e dolphin (Inia geo en-
sis) in he 1970s (Pie and Madin,1976), has eme ged as a majo bac e ial pa hogen
in global aquacul u e. The pa hogen causes annual losses exceeding USD 100 million
wo ldwide (Shoemake e al.,2001), wi h mo ali y a es o 30–80% du ing ou b eaks
(Chen e al.,2012) and cumula i e mo ali y up o 70% a e h ee mon hs in some
species (Mmanda e al.,2014). De ec ed ac oss all con inen s (Baiano and Ba nes,
2009;Mish a e al.,2018), S. iniae a ec s o e 27 ish species (Agnew and Ba nes,
2007), pa icula ly in in ensi e aquacul u e sys ems whe e high s ocking densi ies and
en i onmen al s esso s acili a e apid disease ansmission (Chen e al.,2012). Eco-
nomically impo an species including Asian seabass (La es calca i e ), ilapia (O e-
och omis spp.), and ca ish (Silu i o mes spp.) a e pa icula ly suscep ible (Azmai and
Saad,2011;Nawawi e al.,2008). While a e, zoono ic ansmission o humans han-
dling in ec ed ish has been documen ed (Facklam e al.,2005).
Figu e 26 Phase con as mic og aph o S. iniae s ain QMA0076, showing cha ac e is ic sphe -
ical mo phology. C edi : Baiano e al. (2008).
81
2. Taxonomic Classi ica ion and Molecula Cha ac e is ics
S. iniae belongs o he phylum Bacillo a ( o me ly Fi micu es), comp ising G am-
posi i e bac e ia wi h hick pep idoglycan cell walls. Wi hin he genus S ep ococcus,
S. iniae sha es molecula mechanisms wi h human pa hogens S. pyogenes and S. pneu-
moniae, pa icula ly he so ase A pa hway o cell wall p o ein ancho ing.
Taxonomic hie a chy:
Domain: Bac e ia
Phylum: Bacillo a
Class: Bacilli
O de : Lac obacillales
Family: S ep ococcaceae
Genus: S ep ococcus
Species: S. iniae
Key i ulence ac o s iden i ied include M-like p o ein (SimA), hyalu onidase
(Hyl), enolase (eno), and glyce aldehyde-3-phospha e dehyd ogenase (GAPDH). No-
ably, GAPDH localizes o he ou e memb ane despi e i s p ima y glycoly ic unc-
ion, enhancing hos cell adhe ence and immune e asion. So ase A (S A) ancho s
hese su ace p o eins o he cell wall h ough a conse ed mechanism sha ed wi h
o he pa hogenic s ep ococci. Despi e hese molecula insigh s, knowledge gaps e-
main ega ding genomic a iabili y, ho izon al gene ans e , and an imic obial esis-
ance mechanisms.
3. Cu en Disease Managemen Challenges
Disease con ol in aquacul u e elies p ima ily on b oad-spec um an ibio ics
(Scha e al.,2020,2021), wi h accines se ing as complemen a y p ophylac ic mea-
su es. Howe e , accine adop ion emains limi ed, pa icula ly o low- alue esh-
88
Commen s, Anno a ion, and Fea u es. These we e mapped o SIKU01 ia En y_name
de i ed om DIAMOND subjec s a e one- o-one ma ching be ween isola e SF1 om
UniP o KB e e ence p o eome and isola e SIKU01 (Supplemen a y Da a 13).
Pa hway and hie a chy assignmen s u ilized KEGG Mappe /BRITE (Kanehisa,
2000), and GO e ms ollowed he Gene On ology amewo k (The Gene On ology
Conso ium,2021) (Supplemen a y Da a 13–13). T ansmemb ane helices and bac e ial
G am-posi i e N- e minal signal pep ides we e p edic ed wi h TMHMM 2.0 (K ogh
e al.,2001) and SignalP 5.0 (Almag o A men e os e al.,2019), espec i ely (Sup-
plemen a y Da a 13 and 13). Addi ional amily/domain calls om In e P o’s membe
da abases (Blum e al.,2021) we e conside ed whe e p esen .
An epi ope e idence laye (Supplemen a y Da a 13–13) was me ged on he a i-
able Locus_ ag, ca ying e alue_epi ope_p edic ion,pc _iden i y_epi ope_p edic ion,
and bi _sco e_epi ope_p edic ion o ma ches o cu a ed S ep ococcus spp. epi opes
e ie ed om he Immune Epi ope Da abase (IEDB) (Vi a e al.,2019) and aligned o
he S. iniae p o eome wi h DIAMOND Blas P (Buch ink e al.,2014) (Supplemen a y
Code 12). Epi ope hi s we e e ained a E ≤ 1×10e-3 (obse ed ange 10e-3 o 10e-16)
(Supplemen a y Da a 13).
7. P o ein-Le el Biophysical Desc ip o s
P o ein-le el biophysical desc ip o s we e compu ed om amino acid sequences
using he Pep ides R package (Oso io e al.,2015) and a cus om sc ip o anno a e
SIKU01 p o eins o he p o eome (Supplemen a y Code 12) and he SeqinR package in
R (Cha i and Lob y,2007) om ansla ed p o ein sequences in Supplemen a y Da a 13.
Fo each p o ein, we eco ded heo e ical isoelec ic poin (pI), molecula weigh (MW),
ne cha ge a pH 6.8, 7.0, and 7.4, GRAVY-like hyd ophobici y, alipha ic index, ins abil-
i y index, Boman binding po en ial, p edic ed memb ane p opensi y class, and sequence
leng h (aa). These desc ip o s ed Ma ix-1 il e s (solubili y, s abili y, manu ac u a-
bili y biases) and Ma ix-2 pla o m compa ibili y ules. Physicochemical desc ip o s

89
we e appended as Cha ge_pH_6_8,Cha ge_pH_7,Cha ge_pH_7_4,alipha ic_Index,
hyd ophobici y,ins abili y_index,binding_po en ial,leng h_AA,mw, and pI o a subse
o UniP o KB TSV da a (Supplemen a y Da a 13), summa izing SIKU01 physicochem-
ical p ope ies o indi idual p o eins (Supplemen a y Da a 13).
8. PubMed Li e a u e Sea ch
PubMed li e a u e and PMID me ada a we e e ie ed using a cus om R sc ip
(Supplemen a y Code 12). P o eins wi h p io expe imen al men ions, pa icula ly as
accine candida es o i ulence ac o s, ecei ed g adua ed posi i e sco ing based on
e idence s eng h (Supplemen a y Da a 13).
9. Anno a ion Me ging and Biophysical Analysis
All p e ious anno a ion laye s we e me ged using (Supplemen a y Code 12) a
cus om sc ip made o ha pu pose while o manual cu a ion, c oss-checking UniP o-
KB keywo ds/ ea u e ex s wi h In e P o domain calls, KEGG/GO assignmen s, and
p ima y PGAP/P okka anno a ions o p oduce he combined ini ial able used in down-
s eam QbD sco ing (Supplemen a y Da a 15). Plo s and dis ibu ions we e gene a ed
using a cus om R sc ip (Supplemen a y Code 12).
10. Pangenomics and Sequence Conse a ion Analysis
Nine y ep esen a i e S. iniae genomes ob ained om NCBI Genomes (lis ed in
Supplemen a y Da a 14) we e anno a ed using P okka (Seemann,2014) o homogenize
coding sequence (CDS) anno a ions p io o g aph-based pan-genome clus e ing wi h
Pana oo (Tonkin-Hill e al.,2020b), which implemen s an imp o ed o hology g aph
algo i hm o iginally inspi ed by Roa y (Page e al.,2015b). Clus e ing was pe o med
a high amino acid iden i y h eshold (≥85%), wi h a co e gene inclusion cu o o 95%,
and pa alogs excluded. This g aph-based o holog clus e ing p oduced Supplemen a y
Da a 14, including a gene p esence/absence (P/A) able and summa y links o each gene
90
sequence and anno a ion used in pa sing he inal pangenome g aph.
To anno a e each genome wi h consis en pan-genome me ada a, we used a cus-
om Py hon sc ip (Supplemen a y Code 12), adap ed om pos _ un_g _ou pu .py,
a Pana oo-based u ili y. This sc ip in eg a ed Pana oo’s g aph and able ou pu s o e-
cons uc isola e-speci ic GFF3 anno a ion iles con aining s anda dized pangenome a -
ibu es. The R ab ile se ed as quan i a i e inpu o de e mine gene ca iage equency
ac oss isola es and o assign each o hog oup o a pangenome class (co e, so -co e,
shell, o cloud) based on p e alence h esholds: ≥99% o co e, 95–98% o so -co e,
15–94% o shell, and <15% o cloud. These classi ica ions gene a ed augmen ed p es-
ence/absence ables linking each Pana oo clus e ID o i s class, ep esen a i e gene, and
genome dis ibu ion.
To de i e isola e-speci ic o hog oups, we ex ac ed S. iniae SIKU01 clus e s
om Pana oo ables using a cus om sc ip (Supplemen a y Code 12), which il e ed he
pangenome R ab ile o p oduce Supplemen a y Da a 14. This da ase was en iched wi h
GFF-de i ed a ibu es using (Supplemen a y Code 12), ano he cus om sc ip which
c oss- e e enced pangenome_id en ies agains he Pana oo-gene a ed GFF ile o c e-
a e a uni ied anno a ion able. Al hough biased owa d he SIKU01 e e ence isola e,
his app oach p o ided one- o-one mapping be ween pangenome clus e s and genome-
speci ic locus ags, enabling p ecise acking o o hog oups du ing downs eam con-
se a ion analysis.
Fo each genome, CDS and ansla ed p o ein sequences we e ex ac ed om
anno a ed GFF3 iles using g ead (Pe ea and Pe ea,2020), and all o hologous se-
quences we e conca ena ed in o wo agged FASTA iles wi h SeqKi (Shen e al.,2016).
These combined FASTA iles we e p ocessed by a cus om Py hon pipeline (Supplemen-
a y Code 12), which au oma ically e ie ed sequences by clus e ID and pe o med
pe -clus e mul iple sequence alignmen s (MSAs) wi h MAFFT (Ka oh e al.,2002) o
Clus al Omega (Sie e s and Higgins,2018) depending on clus e size. The esul ing
nucleo ide- and amino acid-le el MSAs o med he ounda ion o subsequen analyses
91
o sequence a iabili y, conse a ion, and s uc u e-based mapping.
All MSAs we e analyzed in R (Supplemen a y Code 12) using a cus om sc ip ,
which calcula ed no malized Shannon en opy (H.no m) a bo h codon and amino acid
le els ia he Bio3D R package (G an e al.,2006). Fo each o hog oup, H.no m
quan i ied sequence conse a ion on a con inuous scale om 0 ( ully conse ed) o 1
(maximally a iable). An igens wi h median H.no m ≤0.05 we e classi ied as highly
conse ed, while hose abo e 0.90 we e excluded as excessi ely a iable. These conse -
a ion me ics, oge he wi h gene-ca iage da a om Pana oo, de ined he gene-ca iage
and sequence- a iabili y componen s o he Quali y-by-Design (QbD) amewo k (P e-
M1 and M1 ma ices) (Supplemen a y Da a 14).
11. P o ein S uc u e P edic ion and Visualiza ion
Th ee-dimensional s uc u es o p io i ized S. iniae an igens (GAPDH, enolase,
G oEL, So ase A) we e p edic ed using AlphaFold2 (Jumpe e al.,2021) and c oss-
alida ed agains a ailable homologous empla es in he P o ein Da a Bank (PDB) (Be man,
2000). P edic ed PDB iles we e examined in UCSF Chime aX 1.6 (Meng e al.,2023)
o old in eg i y, esidue geome y, and sol en accessibili y.
To quan i y si e-speci ic s uc u al conse a ion, we implemen ed a pe - esidue
iden i y analysis (Supplemen a y Code 12) ha scanned each aligned FASTA p e iously
p oduced by MAFFT. Fo e e y alignmen , he sc ip compa ed all sequences o he e -
e ence ( i s en y) and eco ded, o each esidue posi ion, he numbe and pe cen age o
iden ical esidues ac oss all genomes. These pe -si e iden i y p o iles we e mapped on o
AlphaFold2 models in Chime aX using cus om colo sc ip s o gene a e g adien -based
isualiza ions whe e blue indica ed highly conse ed esidues and ed deno ed a iable
posi ions. This app oach p o ided a di ec link be ween MSA en opy (H.no m) alues
and 3D spa ial conse a ion, highligh ing s uc u ally s able and immunologically ele-
an egions.
92
Epi ope localiza ion was pe o med using (Supplemen a y Code 12), a cus om
Py hon sc ip which scanned PDB chains o exac ma ches o epi ope pep ide sequences
iden i ied by IEDB mapping. Fo each hi , he sc ip epo ed he model ID, chain ID,
and PDB esidue ange co esponding o he ma ched pep ide, enabling p ecise o e lay
o B- and T-cell epi opes on o p edic ed an igen s uc u es. Su ace isualiza ion and
elec os a ic po en ial mapping we e ca ied ou in Chime aX using ”su ace colo ” and
”ca oon” ep esen a ions o assess accessibili y and opological con ex .
12. Codon Adap a ion Index (CAI) and Exp ession Compa ibili y
Codon usage equencies o a ge o ganisms we e ob ained om he Kazusa
Codon Usage Da abase (Holcomb e al.,2019), a ailable a h ps://www.kazusa.o .jp/codon/.
Species-speci ic codon equency ables we e downloaded (Supplemen a y Da a 15),
p o iding equencies pe housand codons o each o ganism’s coding sequences. The
Codon Adap a ion Index (CAI) quan i ies he deg ee o p e e ence o synonymous
codons in a gi en o ganism, whe e alues ange om 0 (leas p e e ed) o 1 (mos p e-
e ed).
Fo each amino acid amily, he ela i e adap i eness o codon iwas calcula ed as
wi= i/ max, whe e i ep esen s he equency pe housand o codon i o i s amino acid
and max ep esen s he equency pe housand o he mos equen ly used codon o ha
amino acid. Fo example, p oline codons in O eoch omis nilo icus we e calcula ed as
ollows: CCG had wi=7.39/16.53 =0.447, CCA had wi=14.59/16.53 =0.883, CCT
had wi=16.53/16.53 =1.000 (op imal), and CCC had wi=14.91/16.53 =0.902. The
alue 16.53 ep esen s he highes equency among all p oline codons, making CCT he
op imal e e ence codon.
Fo comple e gene sequences, CAI was compu ed as he geome ic mean o el-
93
a i e adap i eness alues using he o mula:
CAI =(L
∏
k=1
wk)1/L
whe e he p oduc encompasses all Lsense codons in he gene sequence. S op codons
(UAA, UAG, UGA) and he single me hionine codon (AUG) we e excluded om CAI
calcula ions as hey lack synonymous al e na i es o op imiza ion.
13. Quali y by Design (QbD) F amewo k
To implemen QbD p inciples e ec i ely wi hin he e e se accinology ame-
wo k, speci ic c i e ia we e es ablished o guide sys ema ic an igen selec ion. These
c i e ia se ed as he ounda ion o de eloping a quan i a i e sco ing ma ix compu ed
in a cus om sc ip (Supplemen a y Code 12) ha ensu es bo h immunological e icacy
and manu ac u ing easibili y.
13.1 QbD Design Space: Gene Ca iage (Ma ix P e-M1)
The gene ca iage design space de ines he le el o genomic conse a ion
o an an igen ac oss he S. iniae pangenome. This C i ical Quali y A ibu e (CQA) p i-
o i izes s able and widely dis ibu ed an igens o p e en loss o accine e icacy in he -
e ologous s ains. Based on he pan-genome p esence/absence ma ix, he co e genome
(p esen in ≥99% o isola es) was e ained whe eas so -co e genome (95–98%), shell
(15–94%), and cloud (<15%) genes we e excluded om u he e alua ion. This il-
e ing yielded 1,538 co e and 110 so -co e genes (Supplemen a y Da a 15), o ming
he ounda ion o downs eam an igen a iabili y and QbD sco ing due o hei b oad
dis ibu ion and gene ic s abili y.

94
13.2 QbD Design Space: Amino Acid and Nucleo ide Va iabili y (Ma ix 1)
The no malized Shannon en opy (H.no m)de ined he an igen a iabil-
i y design space, quan i ied om pe -clus e MSAs. Each esidue ecei ed wo com-
plemen a y measu es: (1) H.no m om sequence alignmen s (0 = ully conse ed, 1
= maximally a iable) and (2) pe cen iden i y (%ID) ela i e o he amino acid e e -
ence sequence (SIKU01). These alues we e a e aged pe gene o yield gene-wise con-
se a ion indices subsequen ly in eg a ed in o he QbD a iabili y ma ix (M1) (Sup-
plemen a y Da a 15). Genes wi h median H.no m ≤0.05 and mean %ID ≥95% we e
p io i ized as s uc u ally and unc ionally s able an igens. Residues wi h low en opy
and high su ace exposu e (as con i med by Chime aX mapping) de ined he s uc u al
conse a ion-d i en QbD design space, used o cons ain an igen selec ion and ensu e
manu ac u ing consis ency ac oss S. iniae lineages.
13.3 QbD Design Space: Gene al Physicochemical and Exp ession Cha ac e -
is ics (Ma ix 1)
The p o ein leng h design space a o s candida es wi h 100 o mo e amino
acids o ensu e su icien epi ope di e si y while a oiding o e ly sho sequences ha
lack unc ional ele ance. P o eins con aining 50–99 amino acids a e classi ied as oo
sho and excluded due o ins abili y conce ns, while hose wi h ewe han 50 amino
acids ace exclusion o insu icien immunogenic po en ial.
Molecula weigh cons ain s a ge he 20–80 kDa ange o balance im-
mune sys em in e ac ion wi h p ope olding capabili ies, wi h he op imal 20–50 kDa
ange ecei ing highes p io i y o E. coli exp ession sys ems.
The isoelec ic poin (pI)design space ecognizes mul iple op imal anges
depending on pu i ica ion s a egy. P o eins wi h pI alues be ween 4.0–7.5 ecei e p i-
o i y due o supe io solubili y cha ac e is ics and b oad pu i ica ion pla o m compa -
ibili y. The pI ange o 7–9 p o ides nega i e cha ge unde physiological condi ions,
95
acili a ing pu i ica ion ia his idine ag ch oma og aphy o silicon dioxide-based il e s
as speci ied in he ounda ional selec ion c i e ia. Addi ionally, p o eins wi h pI alues
be ween 2–4 o e specialized ad an ages o silica-based pu i ica ion sys ems h ough
enhanced elec os a ic in e ac ions.
Hyd ophobici y design space assessmen elies on he GRAVY index,
which p io i izes p o eins wi hin he -0.5 o +0.5 ange o minimize agg ega ion en-
dencies and suppo aqueous solubili y h oughou he manu ac u ing p ocess.
P o ein s abili y design space elies on he ins abili y index (II) calcu-
la ed om p ima y sequence da a, wi h alues o 40 o lowe indica ing a o able in
i o s abili y cha ac e is ics.
Li e a u e suppo p o ides empi ical alida ion h ough PubMed li e a-
u e and PMID, wi h p o eins ha ing p io expe imen al men ions, pa icula ly as ac-
cine candida es o i ulence ac o s, ecei ing g adua ed posi i e sco ing based on e i-
dence s eng h.
13.4 QbD Design Space: Immunogenici y (Ma ix 1)
Epi ope p edic ion analysis iden i ies he p esence o known sequences
ha s imula e T and B lymphocy es, conside ing he 30% simila i y be ween ish and
human immunoglobulins in he sco ing amewo k. P o eins p edic ed o con ain bo h
T- and B-cell epi opes (in e ed om IEDB) ecei e maximum immunological ele ance
sco ing, while hose lacking p edic ed epi opes ace penal y assessmen .
13.5 QbD Design Space: Hos -speci ic Exp ession (Ma ix 1 – CAI)
In i o exp ession e iciency design space p edic ion u ilizes he Codon
Adap a ion Index (CAI) o assess ansla ional compa ibili y wi h E. coli sys ems, ol-
lowing he p inciple o op imal codon usage o maximizing p o ein exp ession. Top-
96
qua ile CAI alues ecei e posi i e weigh ing, while bo om-qua ile sco es indica e
po en ial exp ession di icul ies (Supplemen a y Da a 15–15).
13.6 QbD Design Space: Ma ix 2 – Seconda y Selec ion
In Ma ix 2, candida e an igens om Ma ix 1 we e sc eened mul iple imes
o compa ibili y wi h speci ic downs eam pu i ica ion modali ies by aligning hei in-
insic molecula desc ip o s wi h pla o m-speci ic C i ical P ocess Pa ame e s (CPPs).
Fi e dis inc se s o design spaces we e e alua ed: silica a ini y, cellulose a ini y, ion
exchange ch oma og aphy (anion exchange (AEX) and ca ion exchange (CEX)), and
plasmid DNA (pDNA) exp ession. This p e-downs eam design space c oss- e e ences
in silico C i ical Quali y A ibu es (CQAs) wi h pla o m-speci ic CPPs o ensu e p ocess–
an igen compa ibili y be o e physical de elopmen o maximum cos -e iciency pos -
biomanu ac u ing (Supplemen a y Da a 15–15).
Silica A ini y Pu i ica ion: Silica-based pu i ica ion echnologies o e
ubiqui y, low cos , and adap abili y ac oss labo a o y and indus ial biop ocessing appli-
ca ions. C i ical ma e ial a ibu es (CMAs) include silanol densi y, su ace opology,
and unc ionaliza ion chemis y, while key CPPs encompass pH, ionic s eng h, and
bu e composi ion (Supplemen a y Da a 15). A ini y ags wi h high silica speci ici y,
including Si- ag, SB7, Ca 9, R5, and he syn he ic oc apep ide (RH)4, composed o ou
epea ing A ginine-His idine uni s, in e ac h ough elec os a ic and hyd ogen-bonding
mechanisms wi h silanol- ich ma ices. P o eins p esen ing ne posi i e cha ge a pH
7.4 demons a e enhanced adso p ion due o elec os a ic a ac ion o pa ially dep o o-
na ed silanol g oups (≡Si–OH → ≡Si–O-). QbD selec ion c i e ia equi ed pI > 4 and
a su ace-exposed ca ionic pa ch a loading pH (o e ed by he usion o a silica bind-
ing pep ide sho ag). Elu ion is achie ed by inc easing pH om 7.4 o 8.5, expanding
silanol dep o ona ion and educing hyd ogen-bond dono capaci y. Sho ag leng hs
( 7–20 aa) we e a o ed o a oid s e ic hind ance and olding in e e ence (Supplemen-
a y Da a 15).
97
Cellulose A ini y Pu i ica ion: Cellulose-based pu i ica ion uses cellulose-
binding modules (CBMs) such as CBM3, CBM9, o CelD, which ecognize epea ing
β-(1→4)-D-glucopy anose uni s h ough hyd ogen bonding wi h hyd oxyl g oups and
hyd ophobic s acking wi h plana glucan ings. Key CPPs include bu e pH (6.0–8.5),
sal concen a ion, cellulose ype, and con ac ime. Since binding is media ed by he
CBM domain a he han an igen cha ge p ope ies, no pI equi emen was applied.
Howe e , CBMs add subs an ial polypep ide segmen s (30–200 aa), necessi a ing leng h
cons ain s. An igens we e p e e en ially kep ≤400 aa o ensu e o al usion cons uc s
emained ≤430–630 aa, limi ing me abolic bu den, mis olding isk, and agg ega ion
while p ese ing ag accessibili y (Supplemen a y Da a 15).
Ion Exchange Ch oma og aphy (IEX): Ion exchange u ilizes he ne
cha ge o a p o ein o sepa a e i om o he p o eins. Based on he ype o esins and
he cha ge o he p o ein, anion exchange ch oma og aphy o ca ion exchange ch o-
ma og aphy echniques may be used. Anion exchange ch oma og aphy was modeled
on qua e na y amine unc ional g oups (e.g., –N+(CH3)3) binding nega i ely cha ged
p o eins h ough elec os a ic in e ac ion wi h dep o ona ed ca boxyl g oups. The QbD
il e e ained p o eins wi h pI ≤ 6.5 and ne cha ge ≤ -1 a pH 7.4 (Supplemen a y
Da a 15). Ca ion exchange ch oma og aphy employed sul ona e ligands (–SO3-) ha
bind p o ona ed amines om Lys, A g, and His side chains. Selec ion c i e ia a o ed
p o eins wi h pI ≥ 8.0 and ne cha ge ≥ +1 a pH 7.4, ensu ing posi i e cha ge unde
wo king condi ions o e ec i e esin binding (Supplemen a y Da a 15).
Plasmid DNA Exp ession: Plasmid DNA exp ession was ea ed as a
dis inc design space, op imizing an igens o high-yield, high- ideli y euka yo ic ex-
p ession in DNA accine o ma s. Sequence-le el CQAs included GC3 en ichmen o
imp o e mRNA s abili y and ansla ion e iciency, ORF leng h h eshold (se a he
median coding sequence leng h o co e genes ≈2.2 kb), and comple e absence o in e -
nal Type IIS es ic ion si es (BsaI, BsmBI, Eco31I) o ensu e seamless Golden Ga e
Assembly cloning compa ibili y (Supplemen a y Da a 15–15).
104
•M1 (p o ein ou e). Hos -speci ic CAI in E. coli, CAIec (+2 op qua ile / −2
bo om qua ile / 0 o he wise); hyd ophobici y (GRAVY −0.5 o +0.5) +2; ins a-
bili y index (II) ≤ 40 +1 (Gu up asad e al.,1990).
•M1 (pDNA ou es). Hos -speci ic CAI in zeb a ish (CAId ) o ilapia (CAIon)
sco ed iden ically (+2 op qua ile / −2 bo om qua ile / 0 o he wise).
•M2 (downs eam manu ac u abili y, pla o m speci ic). In he inal s age,
manu ac u abili y desc ip o s we e applied. Fo silica a ini y, ne cha ge (z) a
pH 7 was used: z ≤ −15 led o exclusion, −15 < z ≤ −5 sco ed −1, −5 < z ≤ +5
sco ed +2, and z > +5 sco ed +3 (capped a +3). A so p e e ence o +1 was also
added o p o eins wi h isoelec ic poin s (pI) be ween 7 and 9. Silica-binding
pep ides we e cons ained o 20 aa o minimize me abolic cos . Fo cellulose
a ini y, candida e an igens we e equi ed o be <400 aa o o se he 30-200 aa
ca bohyd a e-binding module usion; longe p o eins we e excluded. Fo ion-
exchange ch oma og aphy, anion exchange (AEX) equi ed pI ≤ 6.0; wi hin his
cons ain , p o eins wi h z a pH 7 we e sco ed as 0 o −5 (+1), −5 o −15 (+2),
and ≤ −15 (+3). Ca ion exchange (CEX) equi ed pI ≥ 8.0; wi hin his cons ain ,
p o eins wi h z a pH 7 we e sco ed as 0 o +5 (+1), +5 o +15 (+2), and ≥ +15
(+3). Fo plasmid DNA (pDNA) manu ac u abili y, GC3 ac ion [0–1] in he op
qua ile sco ed +2, he bo om qua ile sco ed −2, and he middle qua iles sco ed
0. Gene leng h (n ) sho e han he coho median sco ed +1, while longe genes
sco ed 0. Absence o in e nal Type IIS si es (coun = 0) sco ed +2, while p esence
sco ed 0.
Figu e 28 summa izes he Quali y-by-Design (QbD) li ecycle applied o
he S. iniae p o eome, illus a ing how successi e il e ing s ages (M0–M2) p og es-
si ely e ine he an igen candida e space. The wo k low in eg a es an igen design and
manu ac u abili y ou es h ough i e a i e il e ing o C i ical Quali y A ibu es (CQAs)
in o C i ical P ocess Pa ame e s (CPPs) ac oss mul iple p oduc ion sys ems. This s aged
CQA→CPP mapping and de ini ion o design spaces ensu e ha e ained candida es
a e bo h immunologically ele an and manu ac u able a scale, accommoda ing mul-

105
iple exp ession sys ems (plasmid DNA in i o and ecombinan p o ein accines in
E. coli) as well as di e se downs eam pu i ica ion ou es (silica and cellulose a in-
i y, ion-exchange ch oma og aphy, and pDNA exp ession in zeb a ish and Nile ilapia).
Full h esholds and sco ing ules a e p o ided in (Table 10). Comple e p o eome-le el
dis ibu ions o CQA a ibu es and hei pai wise co ela ions (Spea man’s ρ) a e shown
in (Appendix Figu es 36–38).
Figu e 28 Quali y-by-Design (QbD) wo k low in eg a ing an igen design and manu ac u abili y
ou es. (a) The concep ual QbD li ecycle links he Ta ge P oduc P o ile (TPP), in silico an igen
sco ing, and a ini y- ag o plasmid pu i ica ion s a egy h ough i e a i e design eedback. (b)
The M2 manu ac u abili y s age de ines pla o m-speci ic ou es—silica and cellulose a ini y,
ion-exchange ch oma og aphy (AEX/CEX), and plasmid DNA op imiza ion—each ansla ing
C i ical Quali y A ibu es (CQAs) in o measu able C i ical P ocess Pa ame e s (CPPs) o scal-
able accine p oduc ion. (c) The wo k low illus a es he p og essi e na owing o he accine-
an igen design space h ough successi e QbD s ages (M0 → P e-M1 → M1 → M2).
106
3.3 Ini ial QbD P o eome Reduc ion (M0 → P e-M1 → M1-gene al) Resul s
A he un il e ed M0 s age (Supplemen a y Da a 15), he p o eome (N =
1,855 p o eins) spanned a wide biophysical space, wi h hyd ophobici y and isoelec-
ic poin dis ibu ions showing clea bimodali y (Appendix Figu e 39a). Applying he
P e-M1 inclusion ga e (co e genome ca iage) educed he se o 1,374 p o eins (Ap-
pendix Figu e 39b). (Supplemen a y Da a 15). M1-gene al il e s hen e ained 1,101
p o eins by en o cing p o ein leng h 100–699 aa (else excluded), conse a ion (H.no m
> 0.98, else excluded), molecula weigh (MW) 20–60 kDa (+2 poin s), li e a u e sup-
po (+1 poin ), and IEDB epi ope p esence (+3 poin s) (Appendix Figu e 39c). (Sup-
plemen a y Da a 15).
3.4 Pla o m-speci ic Ga es (M1) Resul s
Fil e s we e adap ed o exp ession pla o ms in a hos -dependen manne
(Appendix Figu e 40a). Fo he M1-p o ein ou e in E. coli (141 p o eins), easible
an igens equi ed CAIec ≥ Q3, mode a e hyd ophobici y (GRAVY −0.5 o +0.5), and
s abili y (ins abili y index (II) ≤ 40) (Supplemen a y Fig. S08a), de ailed in (Supple-
men a y Da a 15). Fo he M1-pDNA ou es, he only ga e was hos -speci ic CAI ≥ Q3,
yielding ilapia (207 p o eins) and zeb a ish (212 p o eins) (Appendix Figu e 40b-c),
de ailed in (Supplemen a y Da a 15-15). GC3 and nucleo ide leng h cons ain s we e
no applied un il M2. In bo h hos s, densi y con ou s showed clus e ing o su i o s
wi hin na ow codon usage and sequence leng h anges, e lec ing pla o m-speci ic op-
imiza ion o ansla ion e iciency and manu ac u abili y.
3.5 Downs eam Manu ac u abili y Design Spaces (M2) Resul s
To e alua e manu ac u abili y, each downs eam pla o m was subjec ed
o s epwise Quali y by Design (QbD) il e ing (M2 c i e ia) (Figu e 3.5). Mapping an i-
gens in o ion-exchange cha ge space (Figu e 3.5) showed ha mos p o eins we e acidic
a pH 7 and hus compa ible wi h anion exchange (AEX), whe eas only a mino ac-
107
ion we e su icien ly basic o su i e ca ion exchange (CEX). This was e lec ed in he
su i o pools (Fig. 3.5), wi h 98 p o eins e ained by AEX and only 20 by CEX. Bu e -
speci ic anges (Figs. 3.5–3.5) con i med ha AEX su i o s we e s able ac oss nea ly
all chemis ies (n = 82–98 p o eins), while CEX bu e s consis en ly yielded he same
es ic ed se (n = 20 p o eins). The ull an igen lis s o AEX and CEX a e p o ided in
(Supplemen a y Da a 15-15).
A ini y pu i ica ion ou es imposed dis inc o hogonal cons ain s: cel-
lulose excluded p o eins longe han 400 aa (Fig. 3.5), because he CBM usion ag is
i sel la ge and places a me abolic bu den on ecombinan exp ession; minimizing he
size o he used an igen educes ene gy demand and s e ic hind ance, he eby imp o -
ing olding and yield. In con as , silica a ini y (Fig. 3.5) was go e ned by elec os a ic
in e ac ions wi h su ace silanol g oups: p o eins wi h pI be ween 7–9 acqui e a ne posi-
i e cha ge a physiological pH, enabling s able adso p ion o nega i ely cha ged silica.
These il e s educed he pools o 100 cellulose-compa ible and 49 silica-compa ible
p o eins, de ailed in (Supplemen a y Da a 15-15).
Fo plasmid DNA (pDNA) pla o ms, C i ical Quali y A ibu e (CQA) il-
e s we e applied sepa a ely on a pe -hos basis. In O. nilo icus (Figs. 3.5–3.5), genes
and hei p oduc we e i s assessed o CAI ≥ Q3 ( 0.60), ensu ing codon usage was
well adap ed o he hos ansla ion machine y; his e ained 188 sequences. Applying
GC3 ≥ Q3 ( 0.32) hal ed he pool o 95, as high G/C con en a he hi d codon posi ion
imp o es mRNA s abili y and educes me abolic s ess. A u he an igen leng h cu o
(≤ 2,200 n ) yielded 57 candida es, a o ing sho e cons uc s wi h educed ansc ip-
ional bu den. Finally, genes con aining in e nal Type IIS es ic ion si es (e.g., BsaI,
SapI) we e excluded because hese enzymes cu ou side hei ecogni ion si es, dis up -
ing modula DNA assembly wo k lows by p e en ing a one-s ep cloning in a Golden
Ga e Assembly. The inal Nile ilapia pDNA pool emained a 57 su i o s.
In D. e io (Figs. 3.5–3.5), he h esholds we e sligh ly highe (CAI ≥ Q3
0.70, GC3 ≥ Q3 0.31). He e, 186 p o eins passed he CAI il e , 82 emained a e he
108
GC3 il e , 66 a e he leng h il e , and 65 a e he Type IIS emo al il e . Comp e-
hensi e candida e se s o bo h pDNA pla o ms (Nile ilapia and zeb a ish) a e a ailable
in (Supplemen a y Da a 15-15).
O e all, he e e se accinology–QbD unnel ou pu s pla o m- eady sho -
lis s: 98 AEX, 20 CEX, 100 cellulose, 49 silica, and 57–65 pDNA an igens o imme-
dia e downs eam de elopmen .
109

110
Figu e 28 Manu ac u abili y design spaces ac oss pu i ica ion pla o ms. (a) Ion-exchange
cha ge space o p o ein candida es: pI s. ne cha ge (z) a pH 7. Shaded bands ma k he
AEX and CEX inclusion egions; accine- e e enced an igens a e labeled. (b) Su i o s a e
M2 il e ing by pla o m; ba s show unique genes pe ou e wi h coun s and pe cen ages. Only
candida es i ing he biophysiochemical c i e ia o hei espec i e pu i ica ion pa hway a e
displayed in colo ba s. This s ep e lec s manu ac u abili y cons ain s and downs eam p ocess
op imiza ion in he QbD amewo k. (c–d) Bu e ope a ing pH anges o AEX (c) and CEX
(d). Sho ick ma ks indica e he wo king pH ( ange midpoin ). (e– ) P o ein- ou e su i o s
pe bu e e alua ed a he midpoin : AEX ule = base ga e passed, ne cha ge (z) a pH 7 ≤0,
and pI ≤pHmid; CEX ule = base ga e passed, ne cha ge (z) a pH 7 ≥0, and pI ≥pHmid.(g)
Cellulose: an igen leng h dis ibu ion o M2 su i o s; dashed line a 400 aa. (j) Silica binding
pep ides a o p o eins wi h mode a e pI (7–9) and posi i e ne cha ge a pH > 7. (h, k) pDNA
(M2) sca e plo s by hos : ilapia (h) and zeb a ish (k), showing CAI s. GC3; dashed lines ma k
da ase h esholds. (i, l) pDNA- ou e su i o s passing each manu ac u abili y ga e o ilapia
(i) and zeb a ish (l): CAI ≥ h eshold, GC3 ≥ h eshold, leng h ≤median, and no Type IIS si es
(coun ).
111
4. C oss-Valida ion wi h Li e a u e
We alida ed ou sho lis ed an igens agains p e iously es ed S. iniae accine
a ge s epo ed in mul iple hos s, including Channel ca ish (Ic alu us punc a us) (Wang
e al.,2016b), Nile ilapia (O eoch omis nilo icus) (Kayansam uaj e al.,2017), Oli e
lounde (Pa alich hys oli aceus) (Sheng e al.,2018a,2023), Zeb a ish (Danio e io)
(Memb ebe e al.,2016), Mouse (Mus musculus) (Wang e al.,2015a), and Tu bo ish
(Scoph halmus maximus) (Zhang e al.,2014a) (Table 11). Among he op- anked can-
dida es consis en ly ound ac oss pu i ica ion ou es, an igens wi h demons a ed in i o
p o ec ion (enolase, GAPDH, and G oEL) we e eco e ed, suppo ing he p edic i e ac-
cu acy o he QbD amewo k.
To u he assess hei sui abili y, we modeled GAPDH and enolase, wo ep e-
sen a i e an igens, bo h well-es ablished in he li e a u e. S uc u al p edic ions gene -
a ed using AlphaFold2 (Yang e al.,2023) o S. iniae SIKU01 con i med highly con-
se ed olds and su ace-exposed loops (Wang e al.,2017). Se e al epi ope-con aining
egions iden i ied h ough e e se accinology we e su ace-accessible ((Figu es 29a–
c), highligh ed in yellow), consis en wi h p io epo s (Gen e al.,2024). These epi-
opes (Table 9) a e sui able o di ec inco po a ion in o subuni accines o o he de-
sign o chime ic mul i-epi ope cons uc s (Pumchan e al.,2020), u he alida ing hei
sui abili y as accine a ge s. Bo h enolase (eno) and GAPDH (gap) displayed epi ope-
ich egions spa ially sepa a ed om ca aly ic o hype a iable si es, ein o cing hei
accessibili y and s abili y as b oad-spec um candida es.
112
Figu e 29 S uc u al mapping o a iabili y, epi opes, and ac i e si es in wo model S ep ococ-
cus iniae accine candida es. (a) Shannon en opy p o iles show ha wo model and well-known
an igens o he scien i ic li e a u e, GAPDH (gap) and enolase (eno), a e la gely conse ed wi h
limi ed a iable egions. (b) GAPDH (336 aa, UniP o Q7BB80) ca ies a p edic ed N- e minal
epi ope (MVVKVGINGFGRIGRLAFRRIQ) posi ioned nea , bu no o e lapping wi h, he ac-
i e si e and adjacen o a egion o high nucleo ide-le el (codon-de i ed) a iabili y (1–89 %
a ia ion). (c) Enolase (435 aa, UniP o T1TFA0) con ains a p edic ed epi ope (RAAADYLEV-
PLYNYLG) loca ed opposi e he ac i e si e and spa ially sepa a ed om i e nucleo ide-d i en
hype a iable su ace egions (HR1–HR5), indica ing s abili y and accessibili y as a accine a -
ge . Va iabili y alues a e based on codon-le el polymo phisms wi hin he co e-genome align-
men and p ojec ed on o co esponding amino acid posi ions in he 3-D models o isualiza ion.
113
DATA AVAILABILITY AND NCBI SUBMISSIONS
Bighead Ca ish
The inal diploid genome assembly was sc eened o con amina ion using he NCBI
Fo eign Con amina ion Sc een (FCS) (As ashyn e al.,2024) and submi ed ollowing
he Ve eb a e Genome P ojec (VGP) naming con en ions (h ps://gi hub.com/VGP/
gp-assembly) (Rhie e al.,2021). The p ojec is egis e ed unde NCBI BioP ojec
PRJNA1132508, wi h BioSample accession SAMN41769988 co esponding o a Thai
(male, adul ) bighead ca ish (isola e: CMAM; TaxID: 35657). Raw sequencing eads
a e deposi ed in he NCBI Sequence Read A chi e (SRA): Nanopo e (20% e o ) —
SRR29723575, HiFi — SRR29723576, Hi-C (150PE) — SRR29723577, and Illumina
(150PE) — SRR29723578. The inal diploid assembly is a ailable in GenBank un-
de accession numbe s JBLWMO000000000 (Haplo ype 1) and JBLWMP000000000
(Haplo ype 2). A comple e da ase , including genome assemblies and suppo ing iles,
is pe manen ly a chi ed a Zenodo (10.5281/zenodo.14826875).
F1 Hyb id Ca ish
The hyb id genome assembly (Cla ias mac ocephalus ×C. ga iepinus) was p o-
cessed and submi ed using he same quali y con ol and s anda diza ion wo k low as de-
sc ibed abo e. GenBank accessions a e JBLWFY000000000.1 and JBLWFZ000000000.1,
co esponding o he C. mac ocephalus and C. ga iepinus sub-genomes, espec i ely.
Associa ed eco ds a e hos ed unde NCBI BioP ojec PRJNA1153495 and BioSample
SAMN43395848 (TaxID: 1334085). Raw sequencing eads a e Nanopo e (SRR30599638),
HiFi (SRR30599641), Illumina (SRR30599640), and Hi-C (SRR30599639). Addi ional
Illumina da ase s co espond o emale C. mac ocephalus (SAMN42503781) and male
C. ga iepinus (SAMN43548335). The comple e F1 hyb id da ase , including assem-
blies and me ada a, is a chi ed a Zenodo (10.5281/zenodo.15269601).
120
A. A. B own, D. H. Bue mann, A. A. Bundu, J. C. Bu ows, N. P. Ca e , N. Cas illo,
M. Chia a E. Ca enazzi, S. Chang, R. Neil Cooley, N. R. C ake, O. O. Dada, K. D. Diak-
oumakos, B. Dominguez-Fe nandez, D. J. Ea nshaw, U. C. Egbujo , D. W. Elmo e, S. S.
E chin, M. R. Ewan, M. Fedu co, L. J. F ase , K. V. Fuen es Faja do, W. Sco Fu ey,
D. Geo ge, K. J. Gie zen, C. P. Godda d, G. S. Golda, P. A. G anie i, D. E. G een, D. L.
Gus a son, N. F. Hansen, K. Ha nish, C. D. Haudenschild, N. I. Heye , M. M. Hims, J. T.
Ho, A. M. Ho gan, K. Hoschle , S. Hu wi z, D. V. I ano , M. Q. Johnson, T. James,
T. A. Huw Jones, G.-D. Kang, T. H. Ke elska, A. D. Ke sey, I. Kh eb uko a, A. P.
Kindwall, Z. Kingsbu y, P. I. Kokko-Gonzales, A. Kuma , M. A. Lau en , C. T. Law-
ley, S. E. Lee, X. Lee, A. K. Liao, J. A. Loch, M. Lok, S. Luo, R. M. Mammen, J. W.
Ma in, P. G. McCauley, P. McNi , P. Meh a, K. W. Moon, J. W. Mullens, T. New-
ing on, Z. Ning, B. Ling Ng, S. M. No o, M. J. O’Neill, M. A. Osbo ne, A. Osnowski,
O. Os adan, L. L. Pa aschos, L. Picke ing, A. C. Pike, A. C. Pike, D. Ch is Pinka d, D. P.
Pliskin, J. Podhasky, V. J. Quijano, C. Raczy, V. H. Rae, S. R. Rawlings, A. Chi a Ro-
d iguez, P. M. Roe, J. Roge s, M. C. Roge Bacigalupo, N. Romano , A. Romieu, R. K.
Ro h, N. J. Rou ke, S. T. Ruedige , E. Rusman, R. M. Sanches-Kuipe , M. R. Schenke ,
J. M. Seoane, R. J. Shaw, M. K. Shi e , S. W. Sho , N. L. Siz o, J. P. Sluis, M. A.
Smi h, J. E nes Sohna Sohna, E. J. Spence, K. S e ens, N. Su on, L. Szajkowski, C. L.
T egidgo, G. Tu ca i, S. andeVondele, Y. Ve ho sky, S. M. Vi k, S. Wakelin, G. C.
Walco , J. Wang, G. J. Wo sley, J. Yan, L. Yau, M. Zue lein, J. Roge s, J. C. Mullikin,
M. E. Hu les, N. J. McCooke, J. S. Wes , F. L. Oaks, P. L. Lundbe g, D. Klene man,
R. Du bin and A. J. Smi h. 2008. Accu a e whole human genome sequencing using
e e sible e mina o chemis y. Na u e. 456 (7218): 53–59.
Be man, H. M. 2000. The P o ein Da a Bank. Nucleic Acids Resea ch. 28 (1): 235–242.
Bi d, J. E., J. Ma les-W igh and A. Giachino. 2022a. A use ’s guide o golden ga e cloning
me hods and s anda ds. ACS Syn he ic Biology. 11 (12): 3551–3563.
Bi d, J. E., J. Ma les-W igh and A. Giachino. 2022b. A Use ’s Guide o Golden Ga e Cloning
Me hods and S anda ds. ACS Syn he ic Biology. 11 (11): 3551–3563.
Blum, M. and o he s. 2021. The In e P o p o ein amilies and domains da abase: 20 yea s on.
Nucleic Acids Resea ch. 49 (D1): D344–D354.
B e es, J. P., S. B. Se izie , V. Go in, S. D. McCo mick and R. O. Ka ls om. 2013. P olac in
egula es ansc ip ion o he ion up ake Na+/Cl− co anspo e (ncc) gene in zeb a ish
gill. Molecula and Cellula Endoc inology. 369 (1–2): 98–106.

121
B own, M., P. M. González De laRosa and B. Ma k. A Telome e Iden i ica ion Toolki , 2023.
URL h ps://doi.o g/10.5281/zenodo.10091385.
Buchanan, J. T., J. A. S anna d, X. Lau h and o he s. 2005. S ep ococcus iniae phosphoglu-
comu ase is a i ulence ac o and a a ge o accine de elopmen . In ec ion and
Immuni y. 73 (10): 6935–6944.
Buch ink, B., K. Reu e and H. D os . 2021. Sensi i e p o ein alignmen s a ee-o -li e scale
using DIAMOND. Na u e Me hods. 18 (4): 366–368.
Buch ink, B., C. Xie and D. H. Huson. 2014. Fas and sensi i e p o ein alignmen using DIA-
MOND. Na u e Me hods. 12 (1): 59–60.
Cao, Y., J. Liu, G. Liu, H. Du, T. Liu, G. Wang, Q. Wang, Y. Zhou and E. Wang. 2023. Explo -
ing he Immunop o ec i e Po en ial o a Nanoca ie Imme sion Vaccine Encoding Sip
agains S ep ococcus In ec ion in Tilapia (O eoch omis nilo icus). Vaccines. 11 (7):
1262.
Ca a d, G., A. Koi ula, H. Söde lund and P. Béguin. 2000. Cellulose-binding domains p o-
mo e hyd olysis o di e en si es on c ys alline cellulose. P oceedings o he Na ional
Academy o Sciences o he Uni ed S a es o Ame ica. 97 (19): 10342–10347.
Chai ichoo, P., S. Koonawoo i i on, S. Cha chaiphan, W. S imai and U. Na-Nako n. 2020.
Gene ic componen s o g ow h ai s o he hyb id be ween ♂ No h A ican ca ish
(Cla ias ga iepinus Bu chell, 1822) and ♀ bighead ca ish (C. mac ocephalus Gun he ,
1864). Aquacul u e. 521: 735082.
Chai ichoo, P., S. Sukha achana, R. Khum hong, P. S isapoome, S. Cha chaiphan and U. Na-
Nako n. 2023. Genome–wide associa ion s udy and genomic p edic ion o g ow h ai s
in bighead ca ish (Cla ias mac ocephalus Gün he , 1864). Aquacul u e. 562: 738748.
Cha i , D. and J. R. Lob y. SeqinR 1.0-2: A Con ibu ed Package o he R P ojec o S a is ical
Compu ing De o ed o Biological Sequences Re ie al and Analysis. In S uc u al Ap-
p oaches o Sequence E olu ion, pp. 207–232. Sp inge Be lin Heidelbe g, 2007. doi:
10.1007/978-3-540-35306-5_10. URL h ps://doi.o g/10.1007/978-3-540-35306-5_
10.
Chen, M. and o he s. 2012. PCR de ec ion and PFGE geno ype analyses o s ep ococcal clinical
isola es om ilapia in China. Ve e ina y Mic obiology. 159 (3-4): 526–530.
Chen, S., Y. Zhou, Y. Chen and J. Gu. 2018. as p: an ul a- as all-in-one FASTQ p ep ocesso .
Bioin o ma ics. 34 (17): i884–i890.
Chen, W., M. Zou, Y. Li, S. Zhu, X. Li and J. Li. 2021a. Sequencing an F1 hyb id o Silu us
122
aso us and S. me idionalis enabled he assembly o high-quali y pa en al genomes. Sci-
en i ic Repo s. 11 (1).
Chen, Y., Y. Zhang, A. Y. Wang, M. Gao and Z. Chong. 2021b. Accu a e long- ead de no o
assembly e alua ion wi h Inspec o . Genome Biology. 22 (1).
Cheng, H., G. T. Concepcion, X. Feng, H. Zhang and H. Li. 2021. Haplo ype- esol ed de no o
assembly using phased assembly g aphs wi h hi iasm. Na u e Me hods. 18 (2): 170–
175.
Cheng, H., E. D. Ja is, O. Fed igo, K.-P. Koep li, L. U ban, N. J. Gemmell and H. Li.
2022. Haplo ype- esol ed assembly o diploid genomes wi hou pa en al da a. Na-
u e Bio echnology. 40 (9): 1332–1335.
Dale, J., P. Smees e s, H. Cou ney, T. Pen ound, C. Hohn, J. Smi h and J. Baud y. 2017.
S uc u e-based design o b oadly p o ec i e g oup A s ep ococcal M p o ein-based
accines. Vaccine. 35 (1): 19–26.
Danecek, P., J. K. Bon ield, J. Liddle, J. Ma shall, V. Ohan, M. O. Polla d, A. Whi wham,
T. Keane, S. A. McCa hy, R. M. Da ies and H. Li. 2021. Twel e yea s o SAM ools
and BCF ools. GigaScience. 10 (2).
Da ling, A. E., B. Mau and N. T. Pe na. 2010. p og essi eMau e: Mul iple Genome Alignmen
wi h Gene Gain, Loss and Rea angemen . PLoS ONE. 5 (6): e11147.
De Cos e , W. and R. Rademake s. 2023. NanoPack2: popula ion-scale e alua ion o long- ead
sequencing da a. Bioin o ma ics. 39 (5).
De Cos e , W., S. D’He , D. T. Schul z, M. C u s and C. Van B oeckho en. 2018. NanoPack:
isualizing and p ocessing long- ead sequencing da a. Bioin o ma ics. 34 (15): 2666–
2669.
Deane, E. E. and N. Y. S. Woo. 2008. Modula ion o ish g ow h ho mone le els by salini y,
empe a u e, pollu an s and aquacul u e ela ed s ess: a e iew. Re iews in Fish Biol-
ogy and Fishe ies. 19 (1): 97–120.
De oi d , T., P. So geloos and P. Bossie . 2011. Al e na i es o an ibio ics o he con ol o
bac e ial disease in aquacul u e. Cu en Opinion in Mic obiology. 14 (3): 251–258.
Dob u , A., E. B zozowska, S. Go ska, M. Pyclik, A. Gamian, M. Bulanda, E. Majewska and
M. B zychczy-Wloch. 2018. Epi opes o Immuno eac i e P o eins o S ep ococcus
Agalac iae: Enolase, Inosine 5’-Monophospha e Dehyd ogenase and Molecula Chap-
e one G oEL. F on ie s in Cellula and In ec ion Mic obiology. 8: 349.
Dudchenko, O., S. S. Ba a, A. D. Ome , S. K. Nyquis , M. Hoege , N. C. Du and, M. S. Shamim,
123
I. Machol, E. S. Lande , A. P. Aiden and E. L. Aiden. 2017. De no o assembly o he
Aedes aegyp i genome using Hi-C yields ch omosome-leng h sca olds. Science. 356
(6333): 92–95.
Duong, T.-Y. and K. T. Sc ibne . 2018. Regional a ia ion in gene ic di e si y be ween wild and
cul u ed popula ions o bighead ca ish (Cla ias mac ocephalus) in he Mekong Del a.
Fishe ies Resea ch. 207: 118–125.
Duong-Ly, K. C. and S. B. Gabelli. 2014. Using ion exchange ch oma og aphy o pu i y a
ecombinan ly exp essed p o ein. Me hods in Enzymology. 541: 95–103.
Du and, N. C., M. S. Shamim, I. Machol, S. S. Rao, M. H. Hun ley, E. S. Lande and E. L.
Aiden. 2016. Juice P o ides a One-Click Sys em o Analyzing Loop-Resolu ion Hi-C
Expe imen s. Cell Sys ems. 3 (1): 95–98.
Eisens ein, M. 2017. An ace in he hole o DNA sequencing. Na u e. 550 (7675): 285–288.
El Hilali, S. and R. R. Copley. 2023. mac osyn R: D awing au oma ically o de ed Ox o d G ids
om s anda d genomic iles in R. A chi e Ou e e HAL.
Elda , A., A. Ho o i cz and H. Be co ie . 1997. De elopmen and e icacy o a accine agains
S ep ococcus iniae in ec ion in a med ainbow ou . Ve e ina y Immunology and
Immunopa hology. 56 (1-2): 175–183.
El ai ou i, A., B. He mann, A. Bolin-Wiene , Y. Wang, C. Go ies, O. Zach isson, R. Pipko n,
L. Ronnblom and J. Blombe g. 2013. Epi opes o mic obial and human hea shock
p o ein 60 and hei ecogni ion in myalgic encephalomyeli is. PLoS ONE. 8 (11):
e81155.
Ellinghaus, D., S. Ku z and U. Willhoe . 2008. LTRha es , an e icien and lexible so wa e
o de no o de ec ion o LTR e o ansposons. BMC Bioin o ma ics. 9 (1).
Emms, D. M. and S. Kelly. 2019. O hoFinde : phylogene ic o hology in e ence o compa a-
i e genomics. Genome Biology. 20 (1).
Ewels, P., M. Magnusson, S. Lundin and M. Källe . 2016. Mul iQC: summa ize analysis esul s
o mul iple ools and samples in a single epo . Bioin o ma ics. 32 (19): 3047.
Eyngo , M. and o he s. 2008. Eme gence o no el S ep ococcus iniae exopolysaccha ide-
p oducing s ains ollowing accina ion wi h nonp oducing s ains. Applied and En i-
onmen al Mic obiology. 74 (22): 6892–6897.
Facklam, R., J. Ellio , L. Shewmake and A. Reingold. 2005. Iden i ica ion and cha ac e iza ion
o spo adic isola es o S ep ococcus iniae isola ed om humans. Jou nal o Clinical
Mic obiology. 43 (2): 933–937.
124
FAO, . 2020. The S a e o Wo ld Fishe ies and Aquacul u e 2020. FAO.
Faus , G. G. and I. M. Hall. 2014. SAMBLASTER: as duplica e ma king and s uc u al a ian
ead ex ac ion. Bioin o ma ics. 30 (17): 2503–2505.
Fe a is, C. J. 2007. Checklis o ca ishes, ecen and ossil (Os eich hyes: Silu i o mes), and
ca alogue o silu i o m p ima y ypes. Zoo axa. 1418 (1): 1–628.
Finn, R. D., J. Clemen s and S. R. Eddy. 2011. HMMER web se e : in e ac i e sequence
simila i y sea ching. Nucleic Acids Resea ch. 39 (suppl): W29–W37.
Flynn, J. M., R. Hubley, C. Goube , J. Rosen, A. G. Cla k, C. Fescho e and A. F. Smi . 2020.
Repea Modele 2 o au oma ed genomic disco e y o ansposable elemen amilies.
P oceedings o he Na ional Academy o Sciences. 117 (17): 9451–9457.
Fo men i, G., A. Rhie, B. P. Walenz, F. Thibaud-Nissen, K. Sha in, S. Ko en, E. W. Mye s,
E. D. Ja is and A. M. Phillippy. 2022. Me in: imp o ed a ian il e ing, assembly
e alua ion and polishing ia k-me alida ion. Na u e Me hods. 19 (6): 696–704.
Fou ie, K. and H. Wilson. 2020. Unde s anding G oEL and DnaK S ess Response P o eins as
An igens o Bac e ial Diseases. Vaccines. 8 (4): 773.
F ancis, D. M. and R. Page. 2010. S a egies o op imize p o ein exp ession in E. coli. Cu en
P o ocols in P o ein Science. (1): 5.24.1–5.24.29.
F ei as, A. I., L. Domingues and T. Q. Aguia . 2022. Ba e silica as an al e na i e ma ix o
a ini y pu i ica ion/immobiliza ion o his- agged p o eins. Sepa a ion and Pu i ica-
ion Technology. 286: 120448.
Gen , V., Y.-J. Lu, S. Lukhele and o he s. 2024. Su ace p o ein dis ibu ion in G oup B S ep-
ococcus isola es om Sou h A ica and iden i ying accine a ge s h ough in silico
analysis. Scien i ic Repo s. 14: 22665.
Gio anni, A., Y.-Z. Shi, P.-C. Wang, M.-A. Tsai and S.-C. Chen. 2025. Recombinan C5a Pep-
idase and Fo malin-Killed Cell: A Syne gis ic Vaccine Agains S ep ococcus iniae in
Fou -Finge Th ead in Fish (Eleu he onema e adac ylum). Jou nal o Fish Diseases.
48: e14154.
Glazuno a, O. O., D. Raoul and V. Roux. 2009. Pa ial sequence compa ison o he poB,
sodA, g oEL and gy B genes wi hin he genus S ep ococcus. INTERNATIONAL
JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY. 59
(9): 2317–2322.
Goel, M. and K. Schneebe ge . 2022. plo s : isualizing s uc u al simila i ies and ea ange-
men s be ween mul iple genomes. Bioin o ma ics. 38 (10): 2922–2926.
125
Gong, H. and o he s. 2017. Comple e Genome Sequence o S ep ococcus iniae 89353, a Vi -
ulen S ain Isola ed om Diseased Tilapia in Taiwan. Genome Announcemen s. 5
(8): e01524–16.
Goube , C., R. J. C aig, A. F. Bila , V. Peona, A. A. Vogan and A. V. P o asio. 2022. A beginne ’s
guide o manual cu a ion o ansposable elemen s. Mobile DNA. 13 (1).
G an , B. J., L. Skjae en and X. Q. Yao. 2021. The Bio3D packages o s uc u al bioin o ma -
ics. P o ein Science. 30 (1): 20–30.
G an , B. J. and o he s. 2006. Bio3D: an R package o he compa a i e analysis o p o ein
s uc u es. Bioin o ma ics. 22 (21): 2695–2696.
Gu, X. H., D. L. Jiang, Y. Huang, B. J. Li, C. H. Chen, H. R. Lin and J. H. Xia. 2018. Iden i ying a
Majo QTL Associa ed wi h Salini y Tole ance in Nile Tilapia Using QTL-Seq. Ma ine
Bio echnology. 20 (1): 98–107.
Guan, D., S. A. McCa hy, J. Wood, K. Howe, Y. Wang and R. Du bin. 2020. Iden i ying and
emo ing haplo ypic duplica ion in p ima y genome assemblies. Bioin o ma ics. 36
(9): 2896–2898.
Gua acino, A., S. Heumos, S. Nahnsen, P. P ins and E. Ga ison. 2022. ODGI: unde s anding
pangenome g aphs. Bioin o ma ics. 38 (13): 3319–3326.
Gu up asad, K., B. V. Reddy and M. W. Pandi . 1990. Co ela ion be ween s abili y o a p o ein
and i s dipep ide composi ion: a no el app oach o p edic ing in i o s abili y o a
p o ein om i s p ima y sequence. P o ein Enginee ing. 4 (2): 155–161.
Ha ison, P. W., M. R. Amode, O. Aus ine-O imoloye, A. G. Azo , M. Ba ba, I. Ba nes,
A. Becke , R. Benne , A. Be y, J. Bhai, S. K. Bhu ji, S. Boddu, P. R. B anco Lins,
L. B ooks, S. B. Rama aju, L. I. Campbell, M. C. Ma inez, M. Cha khchi, K. Chougule,
A. Cockbu n, C. Da idson, N. H. De Sil a, K. Dodiya, S. Donaldson, B. El Houdaigui,
T. E. Naboulsi, R. Fa ima, C. G. Gi on, T. Genez, D. G igo iadis, G. S. Gha ao aya,
J. G. Ma inez, T. A. Gu bich, M. Ha dy, Z. Hollis, T. Hou lie , T. Hun , M. Kay,
V. Kaykala, T. Le, D. Lemos, D. Lodha, D. Ma ques-Coelho, G. Maslen, G. A. Me ino,
L. P. Mi abueno, A. Mush aq, S. N. Hossain, D. N. Ogeh, M. P. Sak hi el, A. Pa ke ,
M. Pe y, I. Piližo a, D. Popple on, I. P oso e skaia, S. Raj, J. G. Pé ez-Sil a, A. I. A.
Salam, S. Sa a , N. Sa ai a-Agos inho, D. Sheppa d, S. Sinha, B. Sipos, V. Si nik,
W. S a k, E. S eed, M.-M. Sune , L. Su apaneni, K. Su inen, F. F. T icomi, D. U bina-
Gómez, A. Veidenbe g, T. A. Walsh, D. Wa e, E. Wass, N. L. Willho , J. Allen,
J. Al a ez-Ja e a, M. Chakiach ili, B. Flin , S. Gio ge i, L. Hagge y, G. R. Ilsley,

126
J. Kea ley, J. E. Lo eland, B. Moo e, J. M. Mudge, G. Naama i, J. Ta e, S. J. T e anion,
A. Win e bo om, A. F ankish, S. E. Hun , F. Cunningham, S. Dye , R. D. Finn, F. J.
Ma in and A. D. Ya es. 2023. Ensembl 2024. Nucleic Acids Resea ch. 52 (D1):
D891–D899.
Heckman, T. I., K. Shahin, E. E. Hende son, M. J. G i in and E. So o. 2022. De elopmen and
e icacy o S ep ococcus iniae li e-a enua ed accines in Nile ilapia (O eoch omis
nilo icus). Fish & Shell ish Immunology. 121: 152–162.
Hoelze , K., L. Bielke, D. P. Blake, E. Cox, S. M. Cu ing, B. De iend , E. E lache -Vindel,
E. Goossens, K. Ka aca, S. Lemie e, M. Me zne , M. Raicek, M. C. Su iñach, N. M.
Wong, C. Gay and F. V. Imme seel. 2018. Vaccines as al e na i es o an ibio ics o
ood p oducing animals. Pa 1: challenges and needs. Ve e ina y Resea ch. 49 (1).
Holcomb, D. D., A. Alexaki, U. Ka neni and C. Kimchi-Sa a y. 2019. The Kazusa codon usage
da abase, CoCoPUTs, and he alue o up- o-da e codon usage s a is ics. In ec ion,
Gene ics and E olu ion. 73: 266–268.
Hon, T., K. Ma s, G. Young, Y.-C. Tsai, J. W. Ka alius, J. M. Landolin, N. Mau e , D. Kud na,
M. A. Ha digan, C. C. S eine , S. J. Knapp, D. Wa e, B. Shapi o, P. Peluso and D. R.
Rank. 2020. Highly accu a e long- ead HiFi sequencing da a o i e complex genomes.
Scien i ic Da a. 7 (1).
Hu, J., Z. Wang, F. Liang, S.-L. Liu, K. Ye and D.-P. Wang. 2024. Nex Polish2: A Repea -
awa e Polishing Tool o Genomes Assembled Using HiFi Long Reads. Genomics,
P o eomics and Bioin o ma ics. 22 (1).
Jain, C., S. Ko en, A. Dil hey, A. M. Phillippy and S. Alu u. 2018a. A as adap i e algo i hm
o compu ing whole-genome homology maps. Bioin o ma ics. 34 (17): i748–i756.
Jain, C., A. Rhie, H. Zhang, C. Chu, B. P. Walenz, S. Ko en and A. M. Phillippy. 2020. Weigh ed
minimize sampling imp o es long ead mapping. Bioin o ma ics. 36 (Supplemen 1):
111–118.
Jain, C., A. Rhie, N. F. Hansen, S. Ko en and A. M. Phillippy. 2022. Long- ead mapping o
epe i i e e e ence sequences using Winnowmap2. Na u e Me hods. 19 (6): 705–710.
Jain, M., S. Ko en, K. H. Miga, J. Quick, A. C. Rand, T. A. Sasani, J. R. Tyson, A. D. Beggs,
A. T. Dil hey, I. T. Fiddes, S. Malla, H. Ma io , T. Nie o, J. O’G ady, H. E. Olsen, B. S.
Pede sen, A. Rhie, H. Richa dson, A. R. Quinlan, T. P. Snu ch, L. Tee, B. Pa en, A. M.
Phillippy, J. T. Simpson, N. J. Loman and M. Loose. 2018b. Nanopo e sequencing and
assembly o a human genome wi h ul a-long eads. Na u e Bio echnology. 36 (4):
127
338–345.
Jiang, D. L., X. H. Gu, B. J. Li, Z. X. Zhu, H. Qin, Z. n. Meng, H. R. Lin and J. H. Xia. 2019.
Iden i ying a Long QTL Clus e Ac oss ch LG18 Associa ed wi h Sal Tole ance in
Tilapia Using GWAS and QTL-seq. Ma ine Bio echnology. 21 (2): 250–261.
Jumpe , J. and o he s. 2021. Highly accu a e p o ein s uc u e p edic ion wi h AlphaFold. Na-
u e. 596 (7873): 583–589.
Kanehisa, M. 2000. KEGG: Kyo o Encyclopedia o Genes and Genomes. Nucleic Acids
Resea ch. 28 (1): 27–30.
Kanehisa, M. and Y. Sa o. 2020. KEGG Mappe o in e ing cellula unc ions om p o ein
sequences. P o ein Science. 29 (1): 28–35.
Ka oh, K. and D. M. S andley. 2013. MAFFT Mul iple Sequence Alignmen So wa e Ve sion
7: Imp o emen s in Pe o mance and Usabili y. Molecula Biology and E olu ion.
30 (4): 772–780.
Ka oh, K., K. Misawa, K. Kuma and T. Miya a. 2002. MAFFT: a no el me hod o apid mul iple
sequence alignmen based on as Fou ie ans o m. Nucleic Acids Resea ch. 30 (14):
3059–3066.
Kayansam uaj, P., H. T. Dong, N. Pi a a , D. Nilubol and C. Rodkhum. 2017. E icacy o
α
-enolase-based DNA accine agains pa hogenic S ep ococcus iniae in Nile ilapia
(O eoch omis nilo icus). Aquacul u e. 468: 102–106.
Kayansam uaj, P., N. A eechon and S. Unajak. 2020. De elopmen o ish accine in Sou heas
Asia: A challenge o he sus ainabili y o SE Asia aquacul u e. Fish and Shell ish
Immunology. 103: 73–87.
Ki ano, J., S. Mo i and C. L. Peichel. 2007. Sexual Dimo phism in he Ex e nal Mo phology o
he Th eespine S ickleback (Gas e os eus Aculea us). Copeia. 2007 (2): 336–349.
Kolbe g, J., A. Aase, S. Be gmann, T. He s ad, G. Rodal, R. F ank, M. Rohde and S. Hamme -
schmid . 2006. S ep ococcus pneumoniae enolase is impo an o plasminogen binding
despi e low abundance o enolase p o ein on he bac e ial cell su ace. Mic obiology
(Reading, England). 152 (P 5): 1307–1317.
Kolmogo o , M., D. M. Bickha , B. Behsaz, A. Gu e ich, M. Rayko, S. B. Shin, K. Kuhn,
J. Yuan, E. Pole iko , T. P. L. Smi h and P. A. Pe zne . 2020. me aFlye: scalable long-
ead me agenome assembly using epea g aphs. Na u e Me hods. 17 (11): 1103–1110.
K ogh, A., B. La sson, G. onHeijne and E. L. Sonnhamme . 2001. P edic ing ans-
memb ane p o ein opology wi h a hidden ma ko model: applica ion o comple e
128
genomes11Edi ed by F. Cohen. Jou nal o Molecula Biology. 305 (3): 567–580.
Kusakabe, M., A. Ishikawa, M. Ra ine , K. Yoshida, T. Makino, A. Toyoda, A. Fujiyama and
J. Ki ano. 2016. Gene ic basis o a ia ion in salini y ole ance be ween s ickleback
eco ypes. Molecula Ecology. 26 (1): 304–319.
Ku zle , M. and D. Weine . 2008. DNA accines: eady o p ime ime? Na u e Re iews
Gene ics. 9: 776–788.
Ky e, J. and R. F. Dooli le. 1982. A simple me hod o displaying he hyd opa hic cha ac e o
a p o ein. Jou nal o Molecula Biology. 157 (1): 105–132.
Langmead, B. and S. L. Salzbe g. 2012. Fas gapped- ead alignmen wi h Bow ie 2. Na u e
Me hods. 9 (4): 357–359.
Le B as, Y., N. Dechamp, F. K ieg, O. Filangi, R. Guyoma d, M. Boussaha, H. Bo enhuis, T. G.
Po inge , P. P une , P. Le Roy and E. Quille . 2011. De ec ion o QTL wi h e ec s on
osmo egula ion capaci ies in he ainbow ou (Onco hynchus mykiss). BMC Gene ics.
12 (1).
Le unic, I., S. Khedka and P. Bo k. 2021. SMART: ecen upda es, new de elopmen s and
s a us in 2020. Nucleic Acids Resea ch. 49 (D1): D458–D460.
Lewin, H. A., J. A. M. G a es, O. A. Ryde , A. S. G aphoda sky and S. J. O’B ien. 2019.
P ecision nomencla u e o he new genomics. GigaScience. 8 (8).
Li, H. 2018. Minimap2: pai wise alignmen o nucleo ide sequences. Bioin o ma ics. 34
(18): 3094–3100.
Li, H. and R. Du bin. 2009. Fas and accu a e sho ead alignmen wi h Bu ows–Wheele
ans o m. Bioin o ma ics. 25 (14): 1754–1760.
Li, H. and R. Du bin. 2010. Fas and accu a e long- ead alignmen wi h Bu ows–Wheele
ans o m. Bioin o ma ics. 26 (5): 589–595.
Li, H., B. Handsake , A. Wysoke , T. Fennell, J. Ruan, N. Home , G. Ma h, G. Abecasis and
R. Du bin. 2009. The Sequence Alignmen /Map o ma and SAM ools. Bioin o ma ics.
25 (16): 2078–2079.
Li, K., P. Xu, J. Wang, X. Yi and Y. Jiao. 2023. Iden i ica ion o e o s in d a genome assem-
blies a single-nucleo ide esolu ion o quali y assessmen and imp o emen . Na u e
Communica ions. 14 (1).
Li, W., K. R. O’Neill, D. H. Ha and o he s. 2021. Re Seq: expanding he P oka yo ic Genome
Anno a ion Pipeline each wi h p o ein amily model cu a ion. Nucleic Acids Resea ch.
49 (D1): D1020–D1028.
129
Li, W. and A. Godzik. 2006. Cd-hi : a as p og am o clus e ing and compa ing la ge se s o
p o ein o nucleo ide sequences. Bioin o ma ics. 22 (13): 1658–1659.
Lin, X., J. Tan, Y. Shen, B. Yang, Y. Zhang, Y. Liao, P. Wang, D. Zhou, G. Li and C. Tian.
2022. A high-densi y gene ic linkage map and QTL mapping o sex in Cla ias uscus.
Aquacul u e. 561: 738723.
Lin, Y., C. Ye, X. Li, Q. Chen, Y. Wu, F. Zhang, R. Pan, S. Zhang, S. Chen, X. Wang, S. Cao,
Y. Wang, Y. Yue, Y. Liu and J. Yue. 2023. qua TeT: a elome e- o- elome e oolki
o gap- ee genome assembly and cen ome ic epea iden i ica ion. Ho icul u e Re-
sea ch. 10 (8).
Lisacho , A., D. H. M. Nguyen, T. Pan hum, S. F. Ahmad, W. Singcha , J. Ponja a , K. Jaisamu ,
P. S isapoome, P. Duengkae, S. Ha acho e, K. S iphai oj, N. Muangmai, S. Unajak,
K. Han, U. Na-Nako n and K. S ikulna h. 2023. Eme ging impo ance o bighead ca ish
(Cla ias mac ocephalus) and no h A ican ca ish (C. ga iepinus) as a bio esou ce and
hei genomic pe spec i e. Aquacul u e. 573: 739585.
Lische , H. E. L. and K. K. Shimizu. 2017. Re e ence-guided de no o assembly app oach
imp o es genome econs uc ion o ela ed species. BMC Bioin o ma ics. 18 (1).
Liu, C., X. Hu, Z. Cao, Y. Sun, X. Chen and Z. Zhang. 2019. Cons uc ion and cha ac e iza ion
o a DNA accine encoding he SagH agains S ep ococcus iniae.Fish & Shell ish
Immunology. 89: 71–75.
Liu, J., Y. Cao, H. Ma, H. Du, T. Liu, G. Wang, M. Liu, Q. Wang, P. Li and E. Wang. 2023a.
Enolase-based nano accine imme sion immuniza ion induces obus immuni y and p o-
ec ion agains S ep ococcus in ec ion in ilapia. Aquacul u e. 576.
Liu, L. and o he s. 2009. Iden i ica ion and expe imen al e i ica ion o p o ec i e an igens
agains S ep ococcus suis se o ype 2 based on genome sequence analysis. Cu en
Mic obiology. 58: 11–17.
Liu, M., Y. Song, S. Zhang, L. Yu, Z. Yuan, H. Yang, M. Zhang, Z. Zhou, I. Seim, S. Liu, G. Fan
and H. Yang. 2023b. A ch omosome-le el genome o elec ic ca ish (Malap e u us
elec icus) p o ided new insigh s in o o de Silu i o mes e olu ion. Ma ine Li e Sci-
encec and Technology. 6 (1): 1–14.
Liu, Y., L. Li, F. Yu and o he s. 2020. Genome-wide analysis e ealed he i ulence a enu-
a ion mechanism o he ish-de i ed o al a enua ed S ep ococcus iniae accine s ain
YM011. Fish & Shell ish Immunology. 106: 546–554.
Mahmoud, M., Y. Huang, K. Ga imella, P. A. Audano, W. Wan, N. P asad, R. E. Handsake ,
136
Simão, F. A., R. M. Wa e house, P. Ioannidis, E. V. K i en se a and E. M. Zdobno . 2015.
BUSCO: assessing genome assembly and anno a ion comple eness wi h single-copy o -
hologs. Bioin o ma ics. 31 (19): 3210–3212.
Smolka, M., L. F. Paulin, C. M. G ochowski, D. W. Ho ne , M. Mahmoud, S. Behe a, E. Kale -
Ez a, M. Gandhi, K. Hong, D. Pehli an, S. W. Scholz, C. M. B. Ca alho, C. P oukakis
and F. J. Sedlazeck. 2024. De ec ion o mosaic and popula ion-le el s uc u al a ian s
wi h Sni les2. Na u e Bio echnology. 42 (10): 1571–1580.
Sp ies e sbach, A., J. Kubicek, F. Schä e , H. Block and B. Mae ens. 2015. Pu i ica ion o
his- agged p o eins. Labo a o y Me hods in Enzymology: P o ein Pa D. pp. 1–15.
S ai , B. J. and T. G. Dewey. 1996. The Shannon in o ma ion en opy o p o ein sequences.
Biophysical Jou nal. 71 (1): 148–155.
S u m, M., C. Sch oede and P. Baue . 2016. SeqPu ge: highly-sensi i e adap e imming o
pai ed-end NGS da a. BMC Bioin o ma ics. 17 (1).
Su, W., X. Gu and T. Pe e son. 2019. TIR-Lea ne , a New Ensemble Me hod o TIR T anspos-
able Elemen Anno a ion, P o ides E idence o Abundan New T ansposable Elemen s
in he Maize Genome. Molecula Plan . 12 (3): 447–460.
Sun, Y., Y. H. Hu, C. S. Liu and L. Sun. 2010. Cons uc ion and analysis o an expe imen al
S ep ococcus iniae DNA accine. Vaccine. 28 (23): 3905–3912.
Sun, Y., Y. H. Hu, C. S. Liu and L. Sun. 2012. S ep ococcus iniae DNA accine deli e ed by a
li e a enua ed Edwa dsiella a da ia na u al in ec ion induces c oss-genus p o ec ion.
Le e s in Applied Mic obiology. 55 (6): 420–426.
Sun, Y., L. Sun, M. Q. Xing, C. S. Liu and Y. H. Hu. 2013. SagE induces highly e ec i e p o-
ec i e immuni y agains S ep ococcus iniae mainly h ough an immunogenic domain
in he ex acellula egion. Ac a Ve e ina ia Scandina ica. 55 (1): 78.
Supikamolseni, A., N. Ngaobu anawi , M. Sumon ha, L. Chanhome, S. Sun a achun, S. Pey-
achoknagul and K. S ikulna h. 2015. Molecula ba coding o enomous snakes and
species-speci ic mul iplex PCR assay o iden i y snake g oups o which an i enom is
a ailable in Thailand. Gene ics and Molecula Resea ch. 14 (4): 13981–13997.
Sémon, M., D. Mouchi oud and L. Du e . 2005. Rela ionship be ween gene exp ession and GC-
con en in mammals: s a is ical signi icance and biological ele ance. Human Molec-
ula Gene ics. 14 (3): 421–427.
Tanpichai, P. and o he s. 2023. Immune Ac i a ion Following Vaccina ion o S ep ococcus
iniae Bac e in in Asian Seabass (La es calca i e , Bloch 1790). Vaccines. 11 (2): 351.

137
Ta uso a, T., M. DiCuccio, A. Bad e din, V. Che e nin, E. P. Naw ocki, L. Zasla sky, A. Lom-
sadze, K. D. P ui , M. Bo odo sky and J. Os ell. 2016. NCBI p oka yo ic genome
anno a ion pipeline. Nucleic Acids Resea ch. 44 (14): 6614–6624.
Te elin, H., D. Riley, C. Ca u o and D. Medini. 2008. Compa a i e genomics: he bac e ial
pan-genome. Cu en Opinion in Mic obiology. 11 (5): 472–477.
The Gene On ology Conso ium, . 2021. The Gene On ology esou ce: en iching a GOld mine.
Nucleic Acids Resea ch. 49 (D1): D325–D334.
Tho aldsdo i , H., J. T. Robinson and J. P. Mesi o . 2012. In eg a i e Genomics Viewe
(IGV): high-pe o mance genomics da a isualiza ion and explo a ion. B ie ings in
Bioin o ma ics. 14 (2): 178–192.
Tonkin-Hill, G., N. MacAlasdai , C. Ruis, A. Weimann, G. Ho esh, J. A. Lees, R. A. Glads one,
S. Lo, C. Beaudoin, R. A. Flo o and o he s. 2020a. P oducing polished p oka yo ic
pangenomes wi h he Pana oo pipeline. Genome Biology. 21: 180.
Tonkin-Hill, G., N. MacAlasdai , C. Ruis, A. Weimann, G. Ho esh, J. A. Lees, R. A. Glads one,
S. Lo, C. Beaudoin, R. A. Flo o, S. D. F os , J. Co ande , S. D. Ben ley and J. Pa khill.
2020b. P oducing polished p oka yo ic pangenomes wi h he Pana oo pipeline. Genome
Biology. 21 (1).
T eepong, P., C. Guyeux, A. Meunie , C. Couchoud, D. Hocque and B. Valo . 2018. panISa:
ab ini io de ec ion o inse ion sequences in bac e ial genomes om sho ead sequence
da a. Bioin o ma ics. 34 (22): 3795–3800.
Vase , R., I. So ić, N. Naga ajan and M. Šikić. 2017. Fas and accu a e de no o genome assembly
om long unco ec ed eads. Genome Resea ch. 27 (5): 737–746.
Vinog ado , A. E. 2005. Dualism o gene GC con en and CpG pa e n in ega d o exp ession in
he human genome: magni ude e sus b ead h. T ends in Gene ics. 21 (12): 639–643.
Vi a, R., N. Blazeska, D. Ma ama, I. C. T. Membe s, S. Duesing, J. Benne , J. G eenbaum,
M. Mendes, J. Mahi a, D. Wheele , J. Can ell, J. O e on, D. Na ale, A. Se e and
B. Pe e s. 2025. The Immune Epi ope Da abase (IEDB): 2024 upda e. Nucleic Acids
Resea ch. 53 (D1): D436–D443.
Vi a, R. and o he s. 2019. The Immune Epi ope Da abase (IEDB): 2019 upda e. Nucleic Acids
Resea ch. 47 (D1): D339–D343.
Walke , B. J., T. Abeel, T. Shea, M. P ies , A. Abouelliel, S. Sak hikuma , C. A. Cuomo, Q. Zeng,
J. Wo man, S. K. Young and A. M. Ea l. 2014. Pilon: An In eg a ed Tool o Com-
p ehensi e Mic obial Va ian De ec ion and Genome Assembly Imp o emen . PLoS
138
ONE. 9 (11): e112963.
Wang, E., B. Long, K. Wang and o he s. 2016a. In e leukin-8 holds p omise o se e as a
molecula adju an in DNA accina ion model agains S ep ococcus iniae in ec ion in
ish. Onco a ge . 7 (51): 83938–83950.
Wang, E., J. Wang, B. Long and o he s. 2016b. Molecula cloning, exp ession and he adju an
e ec s o in e leukin-8 o channel ca ish (Ic alu us punc a us) agains S ep ococcus
iniae.Scien i ic Repo s. 6: 29310.
Wang, J., L. L. Zou and A. X. Li. 2014. Cons uc ion o a S ep ococcus iniae so ase A mu an
and e alua ion o i s po en ial as an a enua ed modi ied li e accine in Nile ilapia
(O eoch omis nilo icus). Fish & Shell ish Immunology. 40 (2): 392–398.
Wang, J., K. Wang, D. Chen, Y. Geng, X. Huang, Y. He, L. Ji, T. Liu, E. Wang, Q. Yang
and W. Lai. 2015a. Cloning and cha ac e iza ion o su ace-localized
α
-enolase o
S ep ococcus iniae, an e ec i e p o ec i e an igen in mice. In e na ional Jou nal o
Molecula Sciences. 16 (7): 14490–14510.
Wang, L., D. Xing, A. Le Van, A. E. Je se and S. Wang. 2017. S uc u e-based design o e i in
nanopa icle immunogens displaying an igenic loops o Neisse ia gono hoeae.FEBS
Open Bio. 7 (2): 262–272.
Wang, L., Z. Y. Wan, B. Bai, S. Q. Huang, E. Chua, M. Lee, H. Y. Pang, Y. F. Wen, P. Liu, F. Liu,
F. Sun, G. Lin, B. Q. Ye and G. H. Yue. 2015b. Cons uc ion o a high-densi y linkage
map and ine mapping o QTL o g ow h in Asian seabass. Scien i ic Repo s. 5 (1).
Wenge , A. M., P. Peluso, W. J. Rowell, P.-C. Chang, R. J. Hall, G. T. Concepcion, J. Eble ,
A. Fung ammasan, A. Kolesniko , N. D. Olson, A. Töp e , M. Alonge, M. Mahmoud,
Y. Qian, C.-S. Chin, A. M. Phillippy, M. C. Scha z, G. Mye s, M. A. DeP is o, J. Ruan,
T. Ma schall, F. J. Sedlazeck, J. M. Zook, H. Li, S. Ko en, A. Ca oll, D. R. Rank and
M. W. Hunkapille . 2019. Accu a e ci cula consensus long- ead sequencing imp o es
a ian de ec ion and assembly o a human genome. Na u e Bio echnology. 37 (10):
1155–1162.
Wick, R. R., M. B. Schul z, J. Zobel and K. E. Hol . 2015. Bandage: in e ac i e isualiza ion
o de no o genome assemblies. Bioin o ma ics. 31 (20): 3350–3352.
Wick, R. R., L. M. Judd, C. L. Go ie and K. E. Hol . 2017. Unicycle : Resol ing bac e ial
genome assemblies om sho and long sequencing eads. PLOS Compu a ional Bi-
ology. 13 (6): e1005595.
Widmann, M., P. T odle and J. Pleiss. 2010. The isoelec ic egion o p o eins: A sys ema ic
139
analysis. PLoS ONE. 5 (1): e10546.
Woes enenk, E. A., M. Hamma s öm, S. an denBe g, T. Hä d and H. Be glund. 2004. His
ag e ec on solubili y o human p o eins p oduced in Esche ichia coli: a compa ison
be ween ou exp ession ec o s. Jou nal o S uc u al and Func ional Genomics. 5
(3): 217–229.
Wyneken, J., S. P. Eppe ly, L. B. C owde , J. Vaughan and K. Blai Espe . 2007. DETERMIN-
ING SEX IN POSTHATCHLING LOGGERHEAD SEA TURTLES USING MULTI-
PLE GONADAL AND ACCESSORY DUCT CHARACTERISTICS. He pe ologica.
63 (1): 19–30.
Xiong, W., L. He, J. Lai, H. K. Doone and C. Du. 2014. Heli onScanne unco e s a la ge
o e looked cache o Heli on ansposons in many plan genomes. P oceedings o he
Na ional Academy o Sciences. 111 (28): 10263–10268.
Xiong, X., Y. Peng, R. Chen, X. Liu and F. Jiang. 2023. E icacy and ansc ip ome analysis o
golden pompano (T achino us o a us) immunized wi h a o malin-inac i a ed accine
agains S ep ococcus iniae.Fish & Shell ish Immunology. 134: 108489.
Xu, M., L. Guo, S. Gu, O. Wang, R. Zhang, B. A. Pe e s, G. Fan, X. Liu, X. Xu, L. Deng and
Y. Zhang. 2020. TGS-GapClose : A as and accu a e gap close o la ge genomes
wi h low co e age o e o -p one long eads. GigaScience. 9 (9).
Xu, Z. and H. Wang. 2007. LTR_FINDER: an e icien ool o he p edic ion o ull-leng h
LTR e o ansposons. Nucleic Acids Resea ch. 35 (Web Se e ): W265–W268.
Yan, J.-J., Y.-C. Lee, Y.-L. Tsou, Y.-C. Tseng and P.-P. Hwang. 2020. Insulin-like g ow h
ac o 1 igge s sal sec e ion machine y in ish unde acu e salini y s ess. Jou nal o
Endoc inology. 246 (3): 277–288.
Yang, Z., X. Zeng, Y. Zhao and o he s. 2023. AlphaFold2 and i s applica ions in he ields o
biology and medicine. Signal T ansduc ion and Ta ge ed The apy. 8: 115.
Yu, L. X. and o he s. 2014. Unde s anding pha maceu ical quali y by design. AAPS Jou nal.
16 (4): 771–783.
Yu, X., P. Se yawan, J. W. Bas iaansen, L. Liu, I. Im on, M. A. G oenen, H. Komen and H.-J.
Megens. 2022. Genomic analysis o a Nile ilapia s ain selec ed o salini y ole ance
shows signa u es o selec ion and hyb idiza ion wi h blue ilapia (O eoch omis au eus).
Aquacul u e. 560: 738527.
Yue, G. H. 2013. Recen ad ances o genome mapping and ma ke �assis ed selec ion in aqua-
cul u e. Fish and Fishe ies. 15 (3): 376–396.
140
Zayas, J. F. 1997. Solubili y o P o eins. Func ionali y o P o eins in Food. pp. 6–75.
Zeng, X., Z. Yi, X. Zhang, Y. Du, Y. Li, Z. Zhou, S. Chen, H. Zhao, S. Yang, Y. Wang and
G. Chen. 2024. Ch omosome-le el sca olding o haplo ype- esol ed assemblies using
Hi-C da a wi hou e e ence genomes. Na u e Plan s. 10 (8): 1184–1200.
Zhang, B.-C., J. Zhang and L. Sun. 2014a. S ep ococcus iniae SF1: Comple e genome sequence,
p o eomic p o ile, and immunop o ec i e an igens. PLoS ONE. 9 (3): e91324.
Zhang, B.-c., J. Zhang and L. Sun. 2014b. S ep ococcus iniae SF1: Comple e Genome Se-
quence, P o eomic P o ile, and Immunop o ec i e An igens. PLoS ONE. 9 (3): e91324.
Zhang, R.-G., G.-Y. Li, X.-L. Wang, J. Daina , Z.-X. Wang, S. Ou and Y. Ma. 2022. TEso e :
An accu a e and as me hod o classi y LTR- e o ansposons in plan genomes. Ho -
icul u e Resea ch. 9.
Zheng, Z., S. Li, J. Su, A. W.-S. Leung, T.-W. Lam and R. Luo. 2022. Symphonizing pileup
and ull-alignmen o deep lea ning-based long- ead a ian calling. Na u e Compu-
a ional Science. 2 (12): 797–803.
Zhou, Z., Y. Dang, M. Zhou, L. Li, C. Yu, J. Fu, S. Chen and Y. Liu. 2016. Codon usage is an
impo an de e minan o gene exp ession le els la gely h ough i s e ec s on ansc ip-
ion. P oceedings o he Na ional Academy o Sciences. 113 (41): E6117–E6125.
141
Appendix

142
Figu e 30 GenomeScope2.0 p o iles o (a) male C. ga iepinus, (b) emale C. mac ocephalus,
and he F1 hyb id ca ish a (c) k=21 and (d) k=31. The hyb id genome shows an in e medi-
a e he e ozygosi y le el ( 1%) and genome size o 1.8 Gb, consis en wi h con ibu ions om
bo h pa en al subgenomes. Pa en al species exhibi 0.056% (C. mac ocephalus) and 1.56%
(C. ga iepinus) he e ozygosi y, espec i ely. BUSCO analyses con i med assembly comple e-
ness and he p esence o wo subgenomes. Low Illumina co e age (<20×) in pa en al da ase s
esul ed in b oad peaks in panels (a–b).
143
Figu e 31 Compa a i e syn eny and s uc u al a ia ion be ween he No h A ican ca ish (C. ga iepinus) e e ence genome (GCA_024256425.2) and
he hyb id ca ish genome ( ClaHyb_Ga , his s udy). (A) Genome-wide mac osyn eny ac oss 28 pseudoch omosomes showing conse ed collinea i y
and localized in e sions, ansloca ions, and duplica ions. (B) Sequence a ia ion leng hs and pe cen ages o genome size, highligh ing highly di e gen
egions. (C) S uc u al a ia ion composi ion including duplica ions, ansloca ions, and in e sions. (D) Va ian ea u e coun s (SNPs, indels, CNVs,
andem epea s) showing genome-wide he e ogenei y.
144
Figu e 32 Compa a i e syn eny and s uc u al a ia ion be ween he Bighead ca ish (C. mac ocephalus) Haplo ype 1 (GCA_048544425.1) and he
hyb id genome ( ClaHyb_Mac, his s udy). (A) Genome-wide mac osyn eny ac oss 27 pseudoch omosomes showing ex ensi e collinea i y wi h limi ed
ea angemen s. (B) Sequence a ia ion p o iles showing di e gence and inse ion–dele ion pa e ns. (C) S uc u al a ia ion composi ion be ween pa en al
and hyb id genomes. (D) Va ian ea u e coun s ac oss all ca ego ies, emphasizing SNP dominance and mino s uc u al a ian s.
145
Figu e 33 Ci cula genome syn eny and quali y con ol o e iew o he i e sequenced S ep o-
coccus iniae s ains (SIKU01–SIKU05). The Ci cos plo shows in e -s ain genomic alignmen s,
GC con en a ia ion, GC skew, and con ig connec i i y me ics. Ou e ings ep esen genome
coo dina es, while inne links highligh conse ed syn enic egions among isola es, con i ming
o e all s uc u al s abili y ac oss s ains.