THESIS PROPOSAL
GENOMIC STUDIES OF PISCINE STREPTOCOCCI AND GENOME
ASSEMBLY OF CLARIIDAE CATFISHES (SILURIFORMES) FOR
AQUACULTURE ENHANCEMENT
QUENTIN LUDOVIC STEPHANE ANDRES
GRADUATE SCHOOL KASETSART UNIVERSITY
2025
THESIS PROPOSAL APPROVAL
GRADUATE SCHOOL KASETSART UNIVERSITY
DEGREE
Doc o o Philosophy (Fishe y Science and
Technology)
MAJOR FIELD
Fishe y Science and Technology
FACULTY
Fishe ies
TITLE
Genomic s udies o piscine s ep ococci and genome assembly
o Cla iidae ca ishes (Silu i o mes) o aquacul u e
enhancemen
NAME
MR. QUENTIN LUDOVIC STEPHANE ANDRES
THIS THESIS HAS BEEN ACCEPTED BY
(Associa e P o esso P apansak S isapoome, Ph.D.)
THESIS ADVISOR
(P o esso Ko nso n S ikulna h, Ph.D.)
THESIS CO-ADVISOR
(M . Wo apong Singcha , Ph.D.)
THESIS CO-ADVISOR
(Assis an P o esso Me hee Kaewne n, D.Tech.Sc.)
GRADUATE
COMMITTEE
CHAIRMAN
(Associa e P o esso Wee apha Khun a anasi i, D . e .na .)
DEAN
THESIS PROPOSAL
Genomic s udies o piscine s ep ococci and genome assembly o
Cla iidae ca ishes (Silu i o mes) o aquacul u e enhancemen
QUENTIN LUDOVIC STEPHANE ANDRES
A Thesis P oposal Submi ed in Pa ial Ful illmen o
he Requi emen s o he Deg ee o
Doc o o Philosophy (Fishe y Science and Technology)
G adua e School, Kase sa Uni e si y
Academic Yea 2025
1
Con en s
Page
LIST OF TABLES 3
LIST OF FIGURES 5
LIST OF ACRONYMS AND GLOSSARY 5
INTRODUCTION 1
OBJECTIVES AND RATIONALE 3
LITERATURE REVIEW 4
Genomic Technologies o Aquacul u e 4
Cla ias Ca ish Aquacul u e in Sou heas Asia 10
Case S udy: Genome Assembly Applica ions in Tilapia 11
GENOME ASSEMBLY – BIGHEAD CATFISH 12
O e iew o Genome Assembly S a egy 12
Haplo ype Resolu ion, Hi-C Sca olding, and Phasing 18
GENOME ASSEMBLY – F1 HYBRID CATFISH 32
O e iew o Genome Assembly S a egy 32
VALIDATION OF CATFISH ASSEMBLIES 44
Benchma king Me hodology 44
Resul s – Bighead Ca ish 47
Resul s – F1 Hyb id Ca ish 61
GENOME ASSEMBLY, REVERSE VACCINOLOGY, AND QUALITY BY DE-
SIGN — S ep ococcus iniae 77
In oduc ion 77
Me hods 80
Resul s 96
DATA AVAILABILITY AND NCBI SUBMISSIONS 110
Bighead Ca ish 110
F1 Hyb id Ca ish 110
2
S ep ococcus iniae 111
Associa ed Publica ions 111
DISCUSSION AND CONCLUSIONS 112
Genomic Insigh s and Technical Achie emen s 112
S ep ococcus iniae Vaccine De elopmen 113
O e all Conclusions 114
Recommenda ions 115
Appendix 138
Pe sonal In o ma ion 161
3
Lis o Tables
Table Page
1 Bighead Ca ish – Assembly S a s. 48
2 Bighead Ca ish – Sca old Me ics – Haplo ype 1. 55
3 Bighead Ca ish – Sca old Me ics – Haplo ype 2. 56
4 Bighead Ca ish – TE Con en . 58
5 F1 Hyb id Ca ish – No h A ican Subgenome – Sca old Me ics. 72
6 F1 Hyb id Ca ish – Bighead Subgenome – Sca old Me ics. 73
7 F1 Hyb id Ca ish – No h A ican Subgenome – S uc u al Me ics. 75
8 F1 Hyb id Ca ish – Bighead Subgenome – S uc u al Me ics. 76
9S. iniae RV – P edic ed an igenic epi opes om s ain SIKU01. 99
10 S. iniae QbD – QTPP and CQAs de ini ion wo k low. 153
11 S. iniae – Compiled accine s udies. 154
12 S. iniae – Supplemen a y Code o analysis. 157
13 S. iniae – Supplemen a y Da a 1: Me ada a and P o eome. 158
14 S. iniae – Supplemen a y Da a 2: Pangenomics and MSAs. 159
15 S. iniae – Supplemen a y Da a 3: QbD Manu ac u abili y. 160
LIST OF FIGURES Figu e Page
1 Long- eads s Sho - eads 4
2 ONT sequencing 5
3 HiFi Sequencing 6
4 Illumina PE Sequencing 7
5 Hi-C Sequencing 8
6 Cla ias ca ish species 10
7 Bighead Ca ish – Specimen Pho og aph. 16
8 Bighead Ca ish – Genome Assembly Wo k low. 17
9 Bighead Ca ish – Genome Assembly G aph – Haplo ype 1 and 2. 20
10 Bighead Ca ish – Genome Assembly 1-Yea P og ess. 27
11 Bighead Ca ish – m DNA Alignmen Agains 208 Ca ish Species. 29
12 F1 Hyb id Ca ish – Genome Assembly Wo k low. 37
13 F1 Hyb id Ca ish – Assembly alida ion using IGV. 42
14 F1 Hyb id Ca ish – Genome Assembly 1-Yea P og ess. 43
15 Bighead Ca ish – DNA Sequencing. 47
16 Bighead Ca ish – Hi-C Hea map – Haplo ype 1. 50
17 Bighead Ca ish – Hi-C Hea map – Haplo ype 2. 51
18 Bighead Ca ish – Hi-C Hea maps – Haplo ype 1 and 2. 52
19 Bighead Ca ish – Assembly Comple eness Assessmen . 54
20 Bighead Ca ish – TE Di e gence P o ile. 57
21 Fou Species Mac osyn eny – Ca ishes Ch omosomes. 60
22 F1 Hyb id Ca ish – Sequencing and GenomeScope2.0 Su ey. 61
23 F1 Hyb id Ca ish – No h A ican Subgenome – Hi-C Hea map. 67
24 F1 Hyb id Ca ish – Bighead Subgenome – Hi-C Hea map. 68
25 F1 Hyb id Ca ish – Genome Assembly Valida ion. 70
26 S ep ococcus iniae - Phase con as mic og aph 77
27 S. iniae – Genome assembly and anno a ion. 97
4
5
28 S. iniae QbD – Wo k low QTPP and manu ac u abili y. 102
28 S. iniae QbD – Manu ac u abili y design spaces. 107
29 S. iniae RV – 3D S uc u e o Enolase and GAPDH Immunogens. 109
30 Fou Ca ish Genome Su ey – GenomeScope2.0 P o iles. 139
31 C. ga iepinus – Syn eny s F1 hyb id genome. 140
32 C. mac ocephalus – Syn eny s F1 hyb id genome. 141
33 S. iniae – Ci cos syn eny & QC (SIKU01–05). 142
34 S. iniae – Syn eny s dolphin isola e. 143
35 S. iniae – An igenic a ia ion (17 epi opes). 144
36 S. iniae – P o eome landscape (M0). 145
37 S. iniae – P o eome anno a ion. 146
38 S. iniae – QbD CQAs co ela ion ma ix. 147
39 S. iniae – QbD il e ing wo k low (M0–P eM1–M1-gene al). 148
40 S. iniae – Exp ession and pDNA pla o m il e s. 149
41 S. iniae – Vaccine ype and RPS dis ibu ion in li e a u e. 150
42 S. iniae – RPS dis ibu ion by ish hos species. 151
43 S. iniae – Compa ison o pDNA and p o ein accine RPS. 152
1
GENOMIC STUDIES OF PISCINE STREPTOCOCCI AND
GENOME ASSEMBLY OF CLARIID CATFISHES
(SILURIFORMES) FOR AQUACULTURE ENHANCEMENT
INTRODUCTION
Global aquacul u e is he p ocess o a ming aqua ic o ganisms, bo h
ma ine and eshwa e , o human consump ion and o he uses, such as p oducing eed,
pha maceu icals, and ela ed p oduc s. I ep esen s he as es -g owing ood p oduc ion
sec o , p o iding o e hal o he sea ood consumed globally, wi h Asia domina ing
p oduc ion. In Thailand, inland aquacul u e is c i ical o global ood secu i y, whe e
eshwa e species such as ilapia (O eoch omis spp.), ca ish (Cla ias spp., Silu us spp.,
and Pangasius spp.), and Asian seabass (La es calca i e ) con ibu e signi ican ly o
domes ic consump ion and sea ood expo s (FAO,2020;Naylo e al.,2021).
In 2022, aquacul u e su passed cap u e ishe ies as he p ima y sou ce o aqua ic
animals, p oducing 94.4 million onnes compa ed o 92.3 million onnes om wild-
cap u e ishe ies, acco ding o he Global Sea ood Alliance. Globally, aquacul u e p o-
duc ion (including seaweeds) eached 130.9 million onnes in 2022, alued a USD 313
billion, acco ding o he Food and Ag icul u e O ganiza ion. Majo gains in aquacul u e
eed e iciency and ish nu i ion ha e lowe ed he ish-in– ish-ou a io o many ed
species, al hough dependence on ma ine ing edien s pe sis s and eliance on e es ial
ing edien s has inc eased. To ensu e sus ainable g ow h, gene ic imp o emen h ough
selec i e b eeding has become a key s a egy, whe e op imizing ai s such as g ow h,
eed e iciency, sex de e mina ion, obus ness, and disease esis ance equi es iden i y-
ing and deploying ele an gene ic a ian s.
Mode n aquacul u e b eeding p og ams do no ely solely on conse ing “pu e”
s ains o ex eme inb eeding. Ins ead, hey a e ypically s uc u ed a ound amilies o
lines ha se e as a de ined s a ing poin and a e imp o ed cumula i ely o e gene -
8
2. Hi-C o Genome Sca olding
Hi-C cap u es ch omosome h ee-dimensional o ganiza ion h ough p oximi y-
liga ion, whe e physically close DNA agmen s become joined. Pai ed-end sequencing
o hese junc ions e eals long- ange linkage: con igs om he same ch omosome show
many Hi-C con ac s while hose om di e en ch omosomes show ew (Pu nam e al.,
2016). This enables clus e ing, o de ing, and o ien ing con igs in o ch omosome-scale
sca olds o de no o assembly—cons uc ion wi hou a e e ence genome— a he han
e e ence-guided assembly which equi es an exis ing genome.
Hi-C has become essen ial o achie ing ch omosome-le el assemblies in aqua-
cul u e species, enabling p ecise localiza ion o genes and QTLs o b eeding ai s. In
polyploid ish genomes, Hi-C co ec ly sepa a es duplica ed loci and p o ides insigh s
in o 3D genome s uc u e, ch oma in con o ma ion, and egula o y in e ac ions (Dud-
chenko e al.,2017).
Figu e 5 Ul a-long- ange sequencing o genome sca olding
C edi s: Ul a-long- ange Sequencing o Genome Sca olding
3. Haplogenome Assembly o Ca ish Genomes
Haplogenome assembly econs uc s each pa en al haplo ype independen ly in-
s ead o p oducing a collapsed consensus sequence. This cap u es he ull spec um o
9
allelic and s uc u al a ia ion, aluable o ou b ed species and highly he e ozygous
o ganisms such as in e speci ic ca ish hyb ids. By esol ing phase-speci ic sequences,
haplogenome assemblies e eal allele-speci ic exp ession pa e ns c i ical o unde -
s anding gene unc ion and op imizing selec i e b eeding. The app oach ypically uses
long- ead sequencing (PacBio HiFi o ONT) wi h bioin o ma ic phasing o sepa a e ma-
e nal and pa e nal haplo ypes. A well-known example includes he assembly o hap-
logenomes in in e speci ic Eucalyp us hyb ids, analogous o he genome-wide s uc u al
di e ences obse ed be ween Silu us aso us and S. me idionalis ha a e linked o hyb id
ai s Chen e al. (2021a). Simila ly, in ca ishes, haplogenome assemblies o Cla ias
mac ocephalus,C. ga iepinus, and hei F1 hyb ids enable p ecise cha ac e iza ion o
ch omosomal ea angemen s, gene duplica ions, and in og ession e en s—laying he
ounda ion o genomic selec ion and imp o ed aquacul u e p oduc i i y Duong and
Sc ibne (2018).
4. In Summa y
These sequencing pla o ms cap u e genome s uc u e and iden i y a ian s—
SNPs1, indels2, and s uc u al a ian s (SVs)3—associa ed wi h economically impo -
an ai s (Chai ichoo e al.,2020). Sho eads excel a accu a e SNP/indel de ec-
ion, long eads esol e s uc u al a ian s and complex egions, while Hi-C p o ides
ch omosomal-le el sca olding. As o May 2025, NCBI hos s o e 3.01 million genomes
including 51,820 euka yo ic genomes, wi h ewe han 500 being haplo ype- esol ed,
comple e e eb a e genomes (h ps://www.ncbi.nlm.nih.go /da ase s/genome/).
1Single-base changes a speci ic genomic posi ions, used as gene ic ma ke s.
2Sho inse ions o dele ions o 1–50 bp.
3La ge genomic al e a ions >50 bp including dele ions, inse ions, duplica ions, in e sions, and
ansloca ions.
10
Cla ias Ca ish Aquacul u e in Sou heas Asia
1. E olu iona y His o y and Dis ibu ion
Ca ish (Silu i o mes) a e widely cul i a ed eshwa e species c ucial o ood
p oduc ion ac oss A ica, he Ame icas, and Asia (Lisacho e al.,2023). This di e se
o de di e ged du ing he ea ly C e aceous ( 145 Ma) and comp ises o e 36 amilies and
3,000 species (Fe a is,2007). While dis ibu ed globally in eshwa e en i onmen s
wi h some ma ine ex ensions, ca ish a e mos di e se in opical Sou h Ame ica, Asia,
and A ica. Key ep esen a i es like Cla ias and Ic alu us se e as bo h esea ch models
and economically aluable aquacul u e species.
2. Classi ica ion and Biology
Bighead ca ish (Cla ias mac ocephalus) and No h A ican ca ish (C. ga iepi-
nus) belong o amily Cla iidae, o de Silu i o mes. These ai -b ea hing ca ish a e eco-
nomically impo an in Sou heas Asia and A ica, espec i ely. C. mac ocephalus, na-
i e o Sou heas Asia and widely a med in Thailand, has a b oad la ened head, sho
ba bels, and eaches 40–50 cm. I s body is da k b own o black wi h pale blo ches. C.
ga iepinus, in oduced globally o i s as g ow h and en i onmen al ole ance, g ows
o 1.7 m wi h an elonga ed body and long do sal ins. I su i es oxygen-deple ed wa-
e s using i s sup ab anchial o gan o a mosphe ic b ea hing.
Figu e 6 Le : Female C. mac ocephalus (C edi s: Ma illano, J.D.). Righ : C. ga iepinus
(C edi s: La sen, J.H.)
11
F1 hyb ids om C. ga iepinus males × C. mac ocephalus emales combine pa en al
ad an ages: body o m om C. mac ocephalus wi h apid g ow h om C. ga iepinus.
Bo h species h i e a 26–32°C.
3. Role in Food P oduc ion and Economy
Cla ias species a e ex ensi ely a med ac oss Asia and A ica. C. mac ocephalus
is p ima ily used o hyb id p oduc ion wi h C. ga iepinus (Lisacho e al.,2023;Na-
Nako n e al.,2004), yielding s e ile F1 hyb ids wi h hyb id igo (he e osis) ha com-
bine apid g ow h and desi able body cha ac e is ics. Howe e , hyb id s e ili y necessi-
a es sepa a e pa en al p oduc ion sys ems. The lack o high-quali y e e ence genomes
limi s implemen a ion o mode n b eeding ools like ddRAD sequencing, low-co e age
genome sequencing, and CRISPR/Cas9. S udies indica e low in og ession isk in C.
mac ocephalus, suppo ing po en ial gene ic imp o emen e o s.
Case S udy: Genome Assembly Applica ions in Tilapia
Tilapia (O eoch omis spp.) exempli ies success ul genomic applica ions in aqua-
cul u e (Yu e al.,2022). Resea che s iden i ied adap i e esponses4 o salini y s ess
h ough genome-wide SNP analysis linked o osmo egula ion and su i al (Gu e al.,
2018;Jiang e al.,2019;Rengma k e al.,2007). Key ole ance genes include p olac in
(PRL) (B e es e al.,2013), g ow h ho mone (GH1) (Deane and Woo,2008), insulin-
like g ow h ac o 1 (IGF1) (Yan e al.,2020), and plasma-memb ane Ca2+-ATPases
(PMCAs) (Rengma k e al.,2007). Simila s udies in ainbow ou (Le B as e al.,
2011), A lan ic salmon (No man e al.,2012), and s ickleback (Kusakabe e al.,2016)
collec i ely es ablished ounda ions o ma ke -assis ed selec ion (MAS) in aquacul-
u e. Non-synonymous mu a ions in sal egula ion genes enhance ole ance and al e
pheno ypes. These achie emen s equi ed high-quali y e e ence genome assemblies.
4Biological adjus men s o en i onmen al s ess h ough gene egula ion, me abolic changes, o im-
mune modula ion.
12
GENOME ASSEMBLY – BIGHEAD CATFISH
O e iew o Genome Assembly S a egy
The haplo ype- esol ed genome o Cla ias mac ocephalus was assembled using
an hyb id sequencing s a egy comp ising ou DNA sequencing pla o ms gene a ing
a ious ypes o DNA eads (i.e., in DNA sequencing, a ead is an in e ed sequence o
base pai s (o base pai p obabili ies) co esponding o all o pa o a single DNA ag-
men ). The ou DNA sequencing echnologies used o ead he gDNA o bighead ca -
ish in o DNA eads (sho - eads and long- eads) we e: PacBio HiFi CSS (Wenge e al.,
2019) (Ci cula consensus sequencing5, long- eads o base accu acy 99.9%), used o
maximizing genome assembly quali y. Ox o d Nanopo e (ONT) noisy 1D long- eads
(20%.e )6(Jain e al.,2018b) o inc easing assembly con inui y and o span com-
plex epea egions o he genome. P oximi y-liga ion (Hi-C) (Dudchenko e al.,2017;
Pu nam e al.,2016) da a o long- ange linking in o ma ion o phase (i.e., he speci ic
a angemen o alleles on he same ch omosome o dis inguish ma e nal and pa e nal
haplo ypes) and sca old con igs (i.e., sepa a ing haplo ypes). These complemen a y
da a ypes acili a ed haplo ype esolu ion and sca old ancho ing and allow o gene a e
dual assemblies7 om a single diploid issue. Among he es ed assemble s (Hi iasm
(Cheng e al.,2021), Flye (Kolmogo o e al.,2020), w dbg2 (Ruan and Li,2019)), I
selec ed Hi iasm o he haplo ype- esol ed assembly because o i s abili y o sepa a e
pa en al haplo ypes (phasing) on low genome co e age (< 13X) HiFi da a. In con as I
ound ha Flye and w dbg2 p oduced collapsed consensus assemblies. Below, I ou line
he main s eps o he comple e genome assembly o bighead ca ish Haplo ype 1 and
Haplo ype 2.
5Ci cula consensus sequencing (CCS) is a high-accu acy long- ead sequencing me hod, commonly
used wi h PacBio HiFi eads, whe e mul iple passes o he same DNA molecule a e combined o gene a e
a consensus sequence.
6Ox o d Nanopo e 1D eads a e single-s and long eads ha his o ically exhibi high e o a es, o en
a ound 10–20%, due o inse ions, dele ions, and base-calling inaccu acies. These eads a e aluable o
spanning long genomic egions bu equi e polishing o accu acy.
7Dual assemblies e e o sepa a e genome assemblies gene a ed o each haplo ype in a diploid o gan-
ism, allowing esolu ion o he e ozygous egions and s uc u al di e ences be ween homologous ch o-
mosomes.
13
1. Genomic DNA Sequencing, Genome Su ey, and k-me Me yl Da abases Ge-
nomic DNA was ex ac ed om he li e issue o a single male bighead ca ish
indi idual using a cus om p o ocol (Supikamolseni e al.,2015). Sex was de e -
mined based on his ology and ex e nal gonadal examina ion (Ki ano e al.,2007;
Wyneken e al.,2007). Sequencing was pe o med using ou complemen a y
pla o ms gene a ing ou syne gis ic aw sequencing da a ypes: (i) Paci ic Bio-
sciences (PacBio) High-Fideli y (HiFi) sequencing (Wenge e al.,2019) o gen-
e a ing highly accu a e con igs, (ii) Ox o d Nanopo e Technology (ONT) long-
ead sequencing (Jain e al.,2018b) o sca olding and esol ing epe i i e e-
gions, and (iii and i ) Illumina sho - ead 150 base pai ed-end sequencing (Hi-C,
and s anda d non Hi-C) (Ben ley e al.,2008) o ch omosomal phasing and e o
co ec ion, espec i ely. Genome su ey aims in assessing genome cha ac e is ics
(genome size, he e ozygosi y, epea con en ) om aw sequencing da a (e.g., Illu-
mina sho - eads) using k-me analysis in Jelly ish (Ma çais and Kings o d,2011)
and GenomeScope2.0 (Ranallo-Bena idez e al.,2020). Me yl da abases (N=2)
o 21-me and 31-me (bighead.hi i.illuminaPCR ee.g 1.db.me yl) hy-
b id k-me da abases, made om combined Illumina and HiFi eads using Me yl
(Rhie e al.,2020), as explained in he T2T-polish Gi Hub eposi o y (h ps://
gi hub.com/a ang hie/T2T-Polish) (Rhie e al.,2022). K-me Me yl da abases
use 21-me and 31-me DNA s ings o suppo quali y e alua ion and e o co -
ec ion in genome assemblies. The sho e 21-me spec a p o ide high sensi i i y
o de ec ing po en ial e o s, while he longe 31-me spec a o e highe speci-
ici y, alida ing ue a ian s and minimizing alse posi i es. Du ing genome
polishing, hese k-me da abases enhance e ec i e co e age in low-dep h HiFi
egions by le e aging k-me spec a o guide a ge ed co ec ions and imp o e
assembly accu acy.
2. Ini ial Assembly Gene a ion (Con igguing and Phasing): Con igging is he
p ocess o assembling o e lapping DNA eads in o con inuous sequences wi h-
ou gaps (i.e., con igs). These con igs a e hen sepa a ed and assigned o a phase
g oup using p oximi y-liga ion da a (Hi-C) based on hei pa en al o igin and spa-
cial physical dis ance in he cell nucleus. The e o e, a con ig ha comes om a
14
haplo ype is e e ed o as a haplo ig8, and in a diploid phased assembly he e a e
wo g oups o homologous haplo igs (i.e., uni igs, one pe haploid ”homo ype”
haplo ype). He e, Hi iasm was used in ”Hi-C UL” mode (Cheng e al.,2021) o
gene a e a haplo ype- esol ed assembly using Hi-C + ONT + HiFi eads (Figu e
8D, le ), ollowed by G eenHill (Ouchi e al.,2023) using all da a ypes: Hi-C
+ Illumina + ONT + HiFi eads o u he sca olding and phasing o he hap-
lo igs (Figu e 8E), and inally w dbg2 (Hu e al.,2024;Ruan and Li,2019) was
employed wi h HiFi and Illumina sho - eads co ec ing small e o s a he nu-
cleo ide le el (i.e., SNPs) while p e en ing swi ch e o s (i.e., allele swi ching
be ween homologous haplo ypes). This p oduced wo phased con ig se s ep e-
sen ing homologous haplo ypes (Figu e 9).
3. Hi-C Sca olding and Thei Valida ion: Sca olds a e compu a ionally o de ed
and o ien ed a ays o con igs ha ha e sequence gaps along hei leng h. The
p ocess o sca olding o con igs is abou a anging and o ien ing con igs in o
la ge s uc u es, some imes wi h gaps, using addi ional da a like p oximi y lig-
a ion (Hi-C) in o de o econs uc an in-silico equi alen o he ka yo ype ( e-
e ed o as ”Sca o ype”) which consis s o Hi-C sca olds o ”pseudoch omo-
somes” (i.e., ch omosome pseudomolecules), isualized as Hi-C maps. P io o
Hi-C sca olding, i is possible o b eak he con igs a e oneous si es, o ha ,
misassembly co ec ions a e done using CRAQ (Li e al.,2023), hen Hi-C sca -
olds (i.e., con igs and sca olds) a e lipped, eo ien ed, and e ined o imp o e
s uc u al accu acy, his s ep is knows as he ”manual e iew s ep9” s ep and
is pe o med using JBAT (JuiceBox Assembly Tools) (Dudchenko e al.,2017),
nex , a ”pos - e iew s ep10 ” s ep seals con igs (i.e., by c ea ing gaps be ween
newly adjacen con igs and inse ing 500 ’N’ cha ac e s o ep esen unknown
8In an unphased assembly, a con ig may join alleles om di e en pa en al haplo ypes in
a diploid o polyploid genome (see (Lewin e al.,2019) and h ps://lh3.gi hub.io/2021/04/17/
concep s-in-phased-assemblies). The p ocess o sepa a ing sequences based on hei pa en al haplo ype
in diploid genomes is e e ed o as ”haplo ype phasing”.
9In Hi-C sca olding, manual e iew in ol es isually inspec ing con ac maps (e.g., using Juicebox
o HiGlass) o de ec misassemblies, o ien a ion e o s, o misplaced con igs ha au oma ed ools may
miss.
10Pos - e iew e e s o he s age a e Hi-C con ac map inspec ion, whe e alida ed co ec ions (e.g.,
lips, cu s, o joins) a e applied o he genome assembly.
15
sequences) o o m Hi-C sca olds o ul ima ely econs uc pseudoch omosomes
ollowed by 3D-DNA ”pos - e iew” alida ion (Dudchenko e al.,2017) (Figu e
8F). Finally mo e gaps we e closed wi h TGS-GapClose (Xu e al.,2020) (Figu e
8G). These s eps — including CRAQ b eak de ec ion, Hi-C ead alignmen , man-
ual e iew in JBAT, and pos - e iew in he 3D-DNA pipeline — we e i e a i ely
pe o med h ee imes. As illus a ed in Figu e 8F, his cycle can be epea ed i
needed a e con ig polishing (Figu e 8H), and ypically esul s in well-de ined
Hi-C sca olded pseudoch omosomes (Figu e 8C).
4. Assembly Polishing o Con igs and Benchma king: Polishing a genome as-
sembly consis in co ec ing sequencing e o s using addi ional da a (Me yl k-me
da abases, long- eads and sho - eads). To u he imp o e assembly accu acy,
eads we e aligned o he polished assembly, he h ee p ima y long- ead mappe s
employed we e Winnowmap (Jain e al.,2022), Minimap2 (Li,2018), and Ve i-
yMap (Mikheenko e al.,2020) (p e iously known as TandemTools), hese we e
used in assembly alida ion only (i.e., no o a ian calls and genome polishing).
Seconda y alignmen s (i.e., eads lagged wi h bi 0x10011 ), low-quali y egions
(i.e., MAPQ12 = 0), and ha d-clipped eads (i.e., alignmen s wi h ha d clipping
ope a ions indica ed by ‘H’ in he CIGAR s ing13 ) we e excluded om analy-
sis using sam ools (Li and Du bin,2009). This ead ealignmen s ep was c u-
cial o alida ing assembly comple eness and educing e o s. Non-haplo ype-
awa e ools14 (e.g., Racon (Vase e al.,2017), Clai 3 (Zheng e al.,2022)) we e
applied cau iously o minimize e o s om pa en al haplo ype swi ches, while
Me in (Fo men i e al.,2022) il e ed edi s o polishing and BCF ools consen-
sus p oduced he inal polished sequence. Fo la ge a ian s and closing gaps
using ead alignmen s I used Sni les2 (Smolka e al.,2024) o la ge s uc u al
11In SAM/BAM lags, bi 0x100 indica es a seconda y alignmen . Tools like Pica d use his lag o
ma k eads ha a e no he p ima y alignmen o a ead wi h mul iple mappings.
12MAPQ (Mapping Quali y) is a Ph ed-scaled sco e ha e lec s he con idence in he ead’s alignmen
posi ion. Highe alues indica e mo e eliable mappings.
13The CIGAR (Compac Idiosync a ic Gapped Alignmen Repo ) s ing encodes how a ead aligns o
a e e ence, using ope a ions such as ma ches (M), inse ions (I), dele ions (D), so clips (S), and o he s.
Fo example, 100M indica es 100 ma ching bases.
14Non-haplo ype-awa e ools do no dis inguish be ween ma e nal and pa e nal sequences in diploid
o polyploid genomes, o en collapsing allelic a ian s in o a single consensus sequence and po en ially
masking s uc u al di e ences be ween haplo ypes.
16
a ian (SV) calling, speci ically o inse ions (INS) and dele ions (DEL). I e al-
ua e genome quali y in e m o e o s ound in he assembly ha a e no ound in
aw eads, o ha I calcula e he Quali y Value, usually e e ed o as QV15 o
a genome assembly whe e QV e e s o he Ph ed quali y sco e (o quali y (Q)
sco e), an in ege ep esen ing he es ima ed p obabili y o an e o (i.e., ha a
base is inco ec ). The polishing wo k low implemen ed he e o co ec ing big-
head ca ish ensu ed high nucleo ide accu acy Quali y Values (QV) ( om ini ial
QV33 o a ound QV46) a e Benchma king agains ecommended assembly me -
ics, om he VGP pape (Rhie e al.,2021).
Figu e 7 Rep esen a i e specimens o he bighead ca ish (Cla ias mac ocephalus), collec ed
o high-quali y haplo ype- esol ed genome assembly. The indi iduals we e ea ed unde con-
olled aquacul u e condi ions p io o issue sampling. C. mac ocephalus is an economically
impo an eshwa e species na i e o Sou heas Asia, widely cul u ed in Thailand o selec i e
b eeding and genomic imp o emen p og ams.
Me hodological de ails a e p esen ed in he ollowing sec ions and in Figu e 8.
15Quali y Value (QV) is a Ph ed-scaled sco e used o ep esen base-le el o assembly-le el accu acy.
I Pis he p obabili y o e o , hen Q=−10log10(P), o equi alen ly, P=10−Q/10. Highe QV indica es
g ea e accu acy.
17
Figu e 8 Comp ehensi e haplo ype- esol ed genome assembly and sca olding wo k low o bighead ca ish (Cla ias mac ocephalus). (A) Hi-C con ac
map (Juicebox 2.16.0). (B) Genome assembly using PacBio HiFi, ONT, and Hi-C eads wi h Hi iasm (Hi-C UL mode) and Flye o consensus. (C) Hi-C
sca olding ia G eenHill and 3D-DNA. (D) Manual pos - e iew wi h JBAT. (E) Gap illing using TGS-GapClose and Qua TeT GapFille . (F) Genome
polishing wi h Nex Polish2 and CRAQ. (G) Assembly QV alida ion wi h Me qu y, Pilon, BCF ools, and Ve i yMap. (H) Mi ochond ial genome assembly
wi h Mi oHiFi and Minimap2. The pipeline yields wo high-quali y phased haplo ype assemblies ep esen ing he comple e bighead ca ish genome.
24
3. Running pilon and calling he consensus: Pilon was employed on assemblies
o co ec ing SNPs, INDELs (i.e., 1-2 base inse ions and dele ions), and o he
base-le el e o s. In indi idual haplo ypes, each sca old was p ocessed by Pilon
wi h pa ame e s (’--genome,-- ags,--bam,-- a ge s,-- ix all, -- c ,
--diploid,--minmq 30,--minqual 30’) o use he alignmen s om all da a
ypes (HiFi, ONT, and sho eads). The ou pu Va ian Calling Files (VCF)
o con aining he de ec ed a ian s we e so ed by posi ion using (’bc ools
so ’), comp essed using bgzip, and indexed using (bc ools index)28 , and
inal consensus sequences we e gene a ed o each sca old using (’bc ools
consensus - $genome. a -H 1’).
The o e all median quali y alue me ic inc eased om 41 o app oxima ely 45-47 a e
haplo ype-awa e a ge ed assembly polishing.
5. Hi-C Sca olding wi h HapHiC
To ein eg a e unplaced sca olds in each haplo ype in he pseudoch omosomes,
I used Qua e and HapHiC. Fi s , I aligned he unplaced con igs o e e ence genome C.
uscus (GCA_030347435.1), and e e ence genome C. ga iepinus (GCA_024256425.2)
using Qua e AssemblyMappe (Lin e al.,2023) 1.2.1 wi h he ollowing pa ame-
e s (’- $ e e ence -q $con igs -c 50000 -l 2000 -i 90 -a minimap2’). I
iden i ied 53 MB and 33 MB o unplaced sca old sequences in bighead ca ish Haplo-
ype 1 and Haplo ype 2, espec i ely, wi h s ong homology o C. uscus pseudoch o-
mosomes, nex I il e ed bighead ca ish Haplo ype 1 and Haplo ype 2 sepa a ely wi h
SeqKi (Shen e al.,2016) 0.8.1 o e ain pseudoch omosomes and conca ena ed hem
o unplaced sca olds. Fo he p epa a ion o Hi-C sca olding s ep, Hi-C eads we e
mapped o sepa a e haplo ypes using BWA-MEM (’-5SP’) a e making a BWT index29
28bc ools index c ea es a comp essed index (.csi o . bi) o VCF o BCF iles, allowing apid
andom access o a ian s by genomic posi ion du ing downs eam p ocessing o isualiza ion.
29The BWT index e e s o a da a s uc u e de i ed om he Bu ows-Wheele T ans o m (BWT), which
allows as and memo y-e icien alignmen o sequencing eads o a e e ence genome. I comp esses
he genome while e aining he abili y o sea ch o exac o app oxima e ma ches, enabling ools like
Bow ie2 and BWA o align sho eads wi h high pe o mance.
25
o Haplo ype 1 and Haplo ype 2 (’bwa index $genome. as a’) and Hi-C ead align-
men s il e ed o PCR duplica es30 and seconda y alignmen s (’samblas e $BAM |
sam ools iew - -@ $ h eads -S -h -b -F 3340’) using he so wa e Samblas e
(Faus and Hall,2014) 0.1.26. Hi-C sca olding was pe o med on bighead ca ish hap-
lo ypes using HapHic (Zeng e al.,2024) 1.0. The esul ing Hi-C maps we e isualized
in JBAT and using (’haphic plo ’) o sepa a e sca olds om each haplo ype (Figu e
16 and Figu e 17) bu also wi h all sca olds ep esen ed in a wide-map (Figu e 18), and
a pos - e iew o he Hi-C sca old was pe o med as desc ibed in he p e ious sec ion.
Finally, h ee ounds o TGS-GapClose ollowed by one ound o a ge ed Pilon pol-
ishing speci ying he new a ge s esul ed in a genome o global highe quali y (i.e., bo h
in e m o QV quali y alues and CRAQ’s me ics o CRE/CSE s uc u al accu acy). I
add essed he duplica ion e o s31 isible in he he e ozygous peak o he k-me co e -
age (i.e., om he Me qu y spec a CN) wi h mo e o less success by using haplo ype-
speci ic k-me da abases. This in ol ed applying (’me yl di e ence’) command o
iden i y e oneous k-me s and using (’me yl-lookup’) command o ex ac eads as-
socia ed wi h hese k-me s. Finally, I used BCF ools o call a consensus32 using he
opposi e haplo ype (’-H 1’) wi hin bc ools consensus command, e ec i ely e e sing
mos haplo ype swi ch e o s.
6. SV and SNP Consensus Polishing
Consensus polishing33 was au oma ed by ollowing ins uc ions om he T2T-
polish Gi Hub eposi o y (h ps://gi hub.com/a ang hie/T2T-Polish). A HiFi mapping
ile o epe i i e k-me s (k=15) was gene a ed wi h Me yl and used in Winnowmap o
30PCR duplica es e e o ead pai s ha o igina e om he same DNA agmen and a e ampli ied
mul iple imes du ing lib a y p epa a ion, po en ially leading o biased co e age i no emo ed.
31Duplica ion e o s e e o he e oneous p esence o edundan sequences in genome assemblies,
o en caused by uncollapsed haplo ypes, epe i i e egions, o misassemblies. These can in la e genome
size and complica e downs eam analyses.
32Calling a consensus in bc ools e e s o gene a ing a modi ied e e ence sequence by applying
a ian calls (e.g., om a . c ile) o a e e ence genome, esul ing in a consensus sequence ha e lec s
he sample’s speci ic geno ype.
33Consensus polishing is he p ocess o co ec ing e o s in a d a genome assembly using aligned
eads, ypically by a e aging obse ed base calls a each posi ion in a ead pileup. Mos polishing ools
a e no haplo ype-awa e and mus be un on bo h haplo ypes oge he , meaning eads a e dis ibu ed ac oss
haplo igs wi hou dis inguishing pa en al o igin. This can lead o misco ec ions in he e ozygous egions.
26
HiFi ead alignmen (’-MD -W . epe i i e_k15. x -ax map-pb’), ollowed by
alignmen il e ing wi h sam ools (’-Sb’). pb- alconc il e ed ha d-clipped eads and o
bi 0x104 in Pica d34 (’bam- il e -clipped - -F 0x104 --ou pu -coun - n’)
(h ps://gi hub.com/bio-nim/pb- alconc) 1.15.0 was hen used o il e clipped eads.
Genome polishing was pe o med using he li o e b anch o he Racon Gi Hub epos-
i o y, using Racon wi h (’-L -S’) op ions (Vase e al.,2017) 1.5.0. A e polishing, he
k-me s p esen in he genome (seqme s) we e coun ed using Me yl coun (k=21). Me in
was hen applied using (’- eadme s Illumina.HiFi.g 1.PCR ee.hyb id.me yl
-seqme s’) and he hyb id-kme db o eads o e alua e he esul s by compa ing he
dis ibu ions in he eads and in he polished genome (Fo men i e al.,2022;Rhie e al.,
2022). Fo consensus gene a ion (i.e., o apply polishing edi s o he genome assembly),
I used BCF ools (’-H 1’) o wo ounds o genome co ec ion. Assembly quali y me -
ics we e measu ed o QV, comple eness and BUSCOs sco es, I ound a la ge inc ease
in QV in mos ch omosomes (min. inc ease > +1-5 QV poin s) a e ONT Racon and
Me in, he median QV was 50, he p og ess o e mon hs is p esen ed in Figu e 10.
34The SAM lag 0x104 is a combina ion o 0x004 ( ead unmapped) and 0x100 (no p ima y alignmen ).
I indica es ha he ead is bo h unmapped and no he p ima y alignmen , ypically seen in seconda y
alignmen s o unaligned eads o ambiguous mul i-mappings.
27
Figu e 10 Bighead Ca ish Assembly S a us Janua y 2024 - No embe 2025. (A) Haplo ype
1 (blue) and Haplo ype 2 (pu ple) ideog ams showing gaps on pseudoch omosomes (o ange
ec angles) and elome es (blue iangles) a e manual- e iew in Juicebox (JBAT). (B) Same
assemblies a ew mon hs la e .
28
7. Mi ochond ial Genome Assembly
The mi ochond ial genome (m DNA) was assembled by mapping o e e ence
(NC_046749.1) bighead ca ish m DNA, a ailable a NCBI Nucleo ide35 . Minimap2
(’-ax map-on --seconda y=no’) mapped nanopo e eads o he e e ence and min-
imap2 (’-ax s ’) mapped Illumina eads. PCR duplica es in sho eads we e emo ed
om he alignmen s using sam ools ma kdup and he esul s we e isualized in he
In eg a i e Genome Viewe (IGV) (Tho aldsdo i e al.,2012). Pilon (’-- ix all
--diploid --changes -- c -- acks --minmq 10’) was used o co ec m DNA
(NC_046749.1) and call SNPs, SVs, gaps and local a ian s and ob ain he consensus
m DNA sequence. Reads we e e-aligned o he consensus m DNA sequence and no
mo e SNPs we e isible in he IGV. Nex , mapped eads we e il e ed wi h sam ools
iew (’-F4 -q 20’) and e-assembled de no o wi h Unicycle (Wick e al.,2017) o
compa ison. The esul s we e isualized in Bandage-ng (Wick e al.,2015) 2022.09.
The assembly was polished wi h Pilon and he wo homologous mi ochond ia we e com-
pa ed using minimap2 (’--eqx -x asm5’). Gene anno a ions we e gene a ed using
Mi oFinde (Allio e al.,2020) 1.4.2. To ensu e ha he mi ochond ial genome was
co ec and no dissimila o o he m DNAs in Silu i o mes, I downloaded all e e ence
m DNA sequences o (N = 209) species o Silu i o mes ca ish om NCBI Nucleo ide,
las accessed Sep embe 2024. All sequences we e aken om NCBI Re Seq36 and no
om NCBI GenBank37. All m DNA nucleo ide sequences (N = 210) we e enamed
o he Pan-SN naming scheme (h ps://gi hub.com/pangenome/PanSN-spec), conca e-
na ed in a single mul i-FASTA including he bighead ca ish m DNA gene a ed he e
a e Pilon, and all- e sus-all alignmen s we e pe o med wi h w mash, hen ODGI was
used o il e he alignmen g aph, and isualiza ions we e made wi h Bandage and mul-
iQC (Ewels e al.,2016). All ools o he pipeline we e pa o he la ge PGGB (Pan
Genome G aph Builde ) (Gua acino e al.,2022). The esul s o he alignmen s a e
shown in Figu e 11, he m DNA aligns well wi h o he sequences in he phylogene ic
o de and is no an ou lie .
35NCBI Nucleo ide is a public da abase p o iding access o sequences om GenBank and Re Seq.
36NCBI Re Seq is a cu a ed, non- edundan da abase o genomic, ansc ip , and p o ein sequences.
37NCBI GenBank is a comp ehensi e a chi e o nucleo ide sequences submi ed by esea che s.
29
Figu e 11 All- e sus-all 2D ep esen a ions o mi ochond ial DNA sequences alignmen s in
Silu i o mes ca ishes. The m DNA assembled o bighead ca ish is on he las ow (blue colo ).
30
8. T ansposable Elemen Anno a ion
Two complemen a y app oaches we e used o iden i y ansposable elemen s
(TEs)38 in he bighead ca ish genome. (1) A species-speci ic TE lib a y was gene -
a ed de no o o he bighead ca ish. (2) A cu a ed ish TE lib a y was e ie ed om
he FishTEDB da abase (h ps://www. ish edb.com/) and used as a e e ence o c oss-
species TE anno a ion.
8.1 De no o ansposable elemen lib a y o bighead ca ish
De no o TE anno a ion was pe o med using EDTA (Ex ensi e de no o
TE Anno a o ) (Ou e al.,2019) 2.2.0. The pipeline in eg a es se e al ools o de-
ec di e en elemen ypes. LTR ha es (Ellinghaus e al.,2008) 1.5.10 iden i ies
LTR e o ansposons by e icien ly loca ing s uc u al ea u es such as bo de posi ions,
LTR leng hs, and mo i s in la ge genomic da ase s. LTR FINDER (Xu and Wang,2007)
and he pa allel e sion o LTR FINDER (Ou and Jiang,2019) 1.1.0 accele a e LTR
de ec ion using pa allel compu ing o la ge genomes. LTR e ie e (Ou and Jiang,
2017) 2.9.0 e ines LTR anno a ions o imp o e accu acy and emo e alse posi i es.
TEso e is an accu a e and as way o classi y LTR e o ansposons (Zhang e al.,
2022) 1.4.7, GRF (Gene ic Repea Finde ) (Shi and Liang,2019) 1.1 o genome-
wide de no o epea de ec ion. TIR-Lea ne (Su e al.,2019) 1.1.2 uses machine lea n-
ing o de ec e minal in e ed epea (TIR) ansposons, including minia u e in e ed
TEs (MITEs). Heli onScanne (Xiong e al.,2014) 1.0 iden i ies Heli ons by ecog-
nizing hei cha ac e is ic s uc u al mo i s. Repea Modele (Flynn e al.,2020) 2.0.3
pe o ms de no o epea disco e y and builds a comp ehensi e epea lib a y. MAFFT
(Ka oh and S andley,2013) 7.526 is used o mul iple sequence alignmen s, and HM-
MER (Finn e al.,2011) 3.4 is used o domain-based sea ching and TE classi ica ion.
38T ansposable elemen s (TEs), also known as “jumping genes,” a e mobile DNA sequences ha can
mo e o eplica e wi hin he genome. They play key oles in genome e olu ion, s uc u al a ia ion, and
gene egula ion.
31
8.2 FishTEDB gene al ish-speci ic TE lib a y collec ion
Nex , lib a ies o TE consensus sequences39 o ansposons we e e ie ed
om he FishTEDB da abase (Shao e al.,2018) e sion 1, which con ains di e en
species o ish and hei associa ed ansposable elemen lib a ies in . as a ile o ma ,
and hese sequences we e me ged wi h EDTA’s de no o bighead ca ish ini ial TE an-
no a ions. To educe he lib a y complexi y and o c ea e a non- edundan TE lib a y
(i.e., o lowe he numbe o sequences and edundancy o he combined TE lib a y
da ase ), I used CD-HIT-EST (’-d 20 -aS 0.95 -c 0.95 -G 0 -g 1 -b 500’) (Li
and Godzik,2006) o an 80% iden i y h eshold and 80% alignmen co e age o TE se-
quences in all-agains -all sequence clus e ing ollowing he commonly used 80%-80%
ule (Goube e al.,2022). The educed TE lib a y was used o mask (i.e., o anno-
a e) he bighead ca ish genome o TEs wi h Repea Maske (Smi , AFA, Hubley, R. &
G een, P Repea Maske a h p://www. epea maske .o g).
39TE ( ansposable elemen ) consensus sequences a e a i icial, ep esen a i e sequences cons uc ed
by aligning and a e aging mul iple genomic copies o a TE amily. They do no co espond o any speci ic
eal inse ion, bu se e as idealized e e ences o anno a ion and classi ica ion.
32
GENOME ASSEMBLY – F1 HYBRID CATFISH
O e iew o Genome Assembly S a egy
S a ing om an F1 hyb id ca ish (i.e., a i s ilial gene a ion ca ish made om
he c oss o a pu e bighead ca ish and a pu e No h A ican ca ish), I econs uc ed
wo comple e haploid, non-homo ypic pa en al genomes, comp ising 27 pseudoch o-
mosomes o C. mac ocephalus and 28 o C. ga iepinus (Fig. 25). In mos F1 hyb ids
de i ed om di e gen pa en al species, he obse ed he e ozygosi y a e— ypically
a ound 10%— e lec s he in e -speci ic genomic di e gence be ween he wo haploid
sub-genomes, encompassing bo h single nucleo ide polymo phisms (SNPs) and s uc-
u al a ian s (SVs). The comple e wo k low o gene a ing ch omosome-scale assem-
blies o he Thai F1 hyb id ca ish is p esen ed in Figu e 12. The wo k low consis s o six
sequen ial s eps ( om S ep 1A o S ep 6) ha in eg a e mul iple sequencing echnolo-
gies and analysis ools and pipelines in speci ic o de used o achie e op imal phasing
and Hi-C sca olding a a high-con igui y, low s uc u al e o a e, high sol e a es, and
high base-accu acy genome assemblies in highly he e ozygous F1 genomes. Below, I
ou line each s ep in de ail.
•S ep 1: Genomic DNA Sequencing, Genome Su ey, and Me yl Da abases
Genomic DNA was ex ac ed om he li e issue o a single F1 hyb id indi-
idual using a cus om p o ocol (Supikamolseni e al.,2015). Sex was de e -
mined o be emale based on his ology and ex e nal gonadal examina ion (Ki-
ano e al.,2007;Wyneken e al.,2007). Sequencing was pe o med using ou
complemen a y pla o ms gene a ing ou syne gis ic aw sequencing da a ypes:
(i) Paci ic Biosciences (PacBio) High-Fideli y (HiFi) sequencing (Wenge e al.,
2019) o gene a ing highly accu a e con igs, (ii) Ox o d Nanopo e Technology
(ONT) long- ead sequencing (Jain e al.,2018b) o sca olding and esol ing
epe i i e egions, and (iii and i ) Illumina sho - ead 150 base pai ed-end se-
quencing (Hi-C, and s anda d non Hi-C) (Ben ley e al.,2008) o ch omoso-
33
mal phasing and e o co ec ion, espec i ely. Genome su ey aims in assess-
ing genome cha ac e is ics (genome size, he e ozygosi y, epea con en ) om
aw sequencing da a (e.g., Illumina sho - eads) using k-me analysis in Jelly-
ish (Ma çais and Kings o d,2011) and GenomeScope2.0 (Ranallo-Bena idez
e al.,2020). Raw sequencing eads quali y was assessed Fas QC (h p://www.
bioin o ma ics.bab aham.ac.uk/p ojec s/ as qc/) 0.1.12 o sho eads and us-
ing NanoPlo 1.42.0 o long eads. Me yl da abases (N = 2) consis ed in 21-
me and 31-me (hyb id.hi i.illuminaPCR ee.g 1.db.me yl) hyb id k-
me da abases, made om combined Illumina and HiFi eads using Me yl (Rhie
e al.,2020), as explained in he T2T-polish Gi Hub eposi o y (h ps://gi hub.
com/a ang hie/T2T-Polish) (Rhie e al.,2022). k-me Me yl da abases use 21-
me and 31-me DNA s ings o suppo quali y e alua ion and e o co ec ion in
genome assemblies. The sho e 21-me spec a p o ide high sensi i i y o de-
ec ing po en ial e o s, while he longe 31-me spec a o e highe speci ici y,
alida ing ue a ian s and minimizing alse posi i es. Du ing genome polish-
ing, hese k-me da abases enhance e ec i e co e age in low-dep h HiFi egions
by le e aging k-me spec a o guide a ge ed co ec ions and imp o e assembly
accu acy.
•S ep 2: Con ig Assembly and Thei Haplo ype Phasing Con ig assembly was
pe o med using Hi iasm (Cheng e al.,2021), a haplo ype- esol ed assemble
in eg a ing da a om HiFi eads, Hi-C eads, and ONT eads. This esul ed in
wo haplo ype-sepa a ed sub-genomes40 (hap1 and hap2). Howe e , I obse ed
ha he Hi iasm algo i hm a i icially de-duplica ed he p ima y phased haplo ype
assembly, p oducing wo edundan copies, namely .hic.hap1.p_c g. a) and
.hic.hap2.p_c g. a) which we e nea ly iden ical (combined genome size o
3.6 Gb e sus expec ed 1.8 Gb, GenomeScope2.0). The bes s a egy was he e-
o e o clus e pa en al haplo igs (i.e., he haplo ype-speci ic con igs also e e ed
o as ”uni igs”, con ained in he haplo ype- esol ed polished uni ig g aph wi hou
40In he con ex o an F1 hyb id, a subgenome e e s o he haploid genome inhe i ed om one pa en
species. Since diploid hyb ids ca y one genome copy om each pa en , hey con ain wo sub-genomes—
each ep esen ing a dis inc pa en al lineage.
40
1,602
•C. ga iepinus ch omosome 18: le elome ic epea coun inc eased om 268 o
1,411
•C. mac ocephalus ch omosome 7: le elome ic epea coun inc eased om 199
o 924
These e inemen s inc eased he numbe o pseudoch omosomes wi h bi-
la e al elome es om 43 o 48, while educing hose wi h unila e al elome es om 8 o
7. Telome ic addi ions con ibu ed app oxima ely 400 kb o he o al assembly leng h,
ep esen ing subs an ial p og ess owa d elome e- o- elome e (T2T) comple eness o
mul iple ch omosomes.
Subgenomic analysis e ealed addi ional di e ences be ween pa en al hap-
lo ypes, including he e ozygosi y a e a ia ions and assembly complexi y di e ences
a ibu able o limi ed nanopo e co e age. While mos genomic gaps ha e been e-
sol ed, cen ome ic and sa elli e egions emain challenging, pa icula ly whe e HiFi
ead co e age is absen o whe e only MAPQ0 Illumina eads p o ide suppo , necessi-
a ing eliance on lowe -accu acy ONT da a.
1.4 Quali y Value Enhancemen Th ough Manual and Au oma ed Polishing
The polishing s a egy combined au oma ed QV co ec ion wi h manual
cu a ion. Ini ial a ian calling wi h Clai 3 iden i ied o e 60,000 a ian s, subsequen ly
esol ed using Me in. Despi e achie ing QV60 a k=21, a inal manual cu a ion ound
in oduced 8,410 edi s, including 551 la ge s uc u al a ian s, esul ing in QV imp o e-
men o 1–5 poin s. The comple e manual cu a ion p ocess comp ised ou dis inc
ounds:
Round 1: Sni les2 iden i ied 755 a ge egions ac oss all alignmen iles
o manual e iew.
41
Round 2: C oss-assembly cu a ion using Unimap examined sca olds om
Flye, G eenHill, and Hi iasm P-con igs, inco po a ing seqme e o iles. This esul ed
in 2,221 manual edi s and applica ion o 5,951 a ian s.
Round 3: Combined Winnowmap and Unimap alignmen s enabled iden-
i ica ion o 8,710 a ian s om 5,740 alignmen s, wi h 1,932 applied o he assem-
bly. Addi ionally, 300 gaps we e esol ed ollowing Sni les2 and BCF ools consensus-
based co ec ions.
Round 4: Co ec ion o 97 la ge s uc u al a ian s, including 10 isible
only h ough Ve i yMap alignmen s. To ensu e genome eliabili y, 65 gaps we e in o-
duced o mask sequences lacking p ope ead suppo o exhibi ing high e o p o iles.
Pos -masking, 247 s uc u al a ian s (196 inse ions, 73 dele ions) iden-
i ied using Sni les2 we e applied o he genome. These a ian s, de i ed om ONT
pb- alconc- il e ed alignmen s (h ps://gi hub.com/bio-nim/pb- alconc) ealigned a Hi-
iasm con ig bounda ies wi h 50 n ma gins, we e app oxima ely 90% homozygous and
10% he e ozygous. Genomic e o k-me s showed non-uni o m ch omosomal dis i-
bu ion, clus e ing in egions wi h ONT-only o single-dep h co e age (511 loci o al).
Fu he co ec ion using Clai 3 (Zheng e al.,2022) add essed 22,625 ONT-based con-
sensus e o s and 7,558 HiFi Winnowmap a ian s, achie ing median QV31.
Assembly alida ion using IGV demons a ed quali y imp o emen s and
s uc u al esolu ion. Ch omosome 03 showed clea educ ion in k-me e o a es and
inc eased QV pos -polishing. Ch omosome 02 analysis e ealed a alse duplica ion
p esen in Hi iasm bu absen in Flye, con i med by doubled homozygous co e age and
lack o suppo ing HiFi alignmen s. A single spanning ONT ead and clipped Illumina
ead alida ed bounda y p ecision. Consis en single-nucleo ide ma ke s ac oss all da a
ypes (HiFi, ONT, Flye) sugges po en ial o u he e inemen using high-con idence
a ian s (AF > 0.9) wi h Clai 3 (Zheng e al.,2022) and Wha sHap (Ma in e al.,2016).
42
Figu e 13 Quali y alue and co e age alida ion using In eg a i e Genomics Viewe . (A) Ch o-
mosome 03 demons a ing QV imp o emen ollowing polishing p ocedu es. (B) Ch omosome
02 showing alse duplica ion and gap analysis h ough compa a i e assembly assessmen .
Following eigh mon hs o genome cu a ion, he es ima ed QV imp o ed
om Q40 o Q60, demons a ing ha F1 hyb id genomes can achie e supe io nucleo ide-
le el accu acy compa ed o single-species assemblies. Fo compa ison, he pu e-b eed
bighead ca ish assembly achie ed median QV40–50 (k=21, k=31, pileup), while he
cu en F1 hyb id genome demons a es en- old highe accu acy using compa able se-
quencing co e age, analy ical ools, and cu a ion e o .
43
Figu e 14 F1 Hyb id Ca ish Assembly s a us Janua y 2024 - No embe 2024.
44
VALIDATION OF CATFISH ASSEMBLIES
Benchma king Me hodology
This chap e p esen s he esul s and echnical alida ions pe o med on he genome
assemblies o he wo ca ish species. Fi s , I p esen he mos common me ics o mea-
su e he quali y and comple eness o he assemblies, nex I p esen each indi idual as-
sembly wi h esul s, inally compa a i e genomic analysis a e p o ided and illus a e
how hese wo genomes can be used o compa a i e s udies ac oss species. Me ics o
con inui y, s uc u al accu acy, base accu acy, and unc ional comple eness we e used
o benchma king as desc ibed in he Ve eb a e Genome P ojec (VGP) pape (Rhie
e al.,2021).
1. Me ics Assessed
1.1 Con inui y and summa y s a is ics
To assess con inui y and summa y s a is ics o he assembly, I compu e
he ollowing measu es o sca old/con ig (N50, N90, NG50, LG50, and LG90)50 wi h
RagTag (’ ag ag.py asms a s -g’) (Alonge e al.,2022) 2.1.0.
1.2 Repea comple eness and con inui y o epea s
To measu e epea comple eness and con inui y o epea s o assessing as-
sembly quali y, I es ima ed he pe cen age o ully assembled LTR e oelemen s (LTR-
RTs)51 and compu ed he long e minal epea (LTR) Assembly Index (LAI) using wo
50N50 and N90 ep esen he con ig o sca old leng h such ha 50% o 90% o he o al assembly
leng h is con ained in con igs/sca olds o ha leng h o longe . NG50 is simila o N50 bu calcula ed
ela i e o an expec ed e e ence genome size. LG50 and LG90 indica e he minimum numbe o con igs
o sca olds whose combined leng h makes up 50% o 90% o he assembly, espec i ely.
51LTR-RTs (Long Te minal Repea Re o ansposons) a e a class o ansposable elemen s cha ac e -
ized by di ec long e minal epea s a bo h ends. They eplica e ia an RNA in e media e and e e se
ansc ip ion, and a e majo con ibu o s o genome size and s uc u e in many euka yo ic o ganisms.
45
e e ence- ee p og ams, LTR Assembly Index (LAI) (Ou e al.,2018) and LTR_ e ie e
(Ou and Jiang,2017) 2.9.00. To assess he assembly quali y o complex epea s (cen-
ome es), I used TandemTools and TandemQUAST (Mikheenko e al.,2020) 1. Telom-
e e p edic ion o he p esence / absence and o ien a ion o elome es was done wi h
TIDK, a Telome e Iden i ica ion Toolki B own e al. (2023), implemen ed in TeloEx-
plo e (’-m 50 -c animal’), a module o Qua TeT Lin e al. (2023). To es ima e gaps,
I used he sc ip de gaps, a ailable on Gi Hub (h ps://gi hub.com/d guan/asse ).
1.3 S uc u al accu acy ( egional and s uc u al e o s and eliable blocks)
To assess s uc u al accu acy I used he CRAQ so wa e (’sms_co e age=5
ngs_co e age=20 -B T --minimap2-sensi e’) (Li e al.,2023) 1.0.9, mo e speci -
ically (’-q 20 -m 2 - 0.4 -h 0.6 - 0.75 -a 20’). CRAQ is a me hod o as-
sembly s uc u al alida ion elying on he s udy o mapped eads, clipped eads and
co e age suppo by wo o mo e simul aneous sequencing pla o ms o ind suppo ing
egions o eliable blocks (Rhie e al.,2021).and isualized esul s using he In eg a i e
Genome Viewe (IGV) (Tho aldsdo i e al.,2012). CRAQ e alua es genome assem-
bly quali y based on clipped- ead e idence om bo h sho and long eads. I epo s
wo main ypes o e o s:
1. Clip-based Regional E o s (CREs), which ep esen small-scale local misas-
semblies iden i ied by sho - ead clipping wi hou s uc u al dis up ion.
2. Clip-based S uc u al E o s (CSEs), which ep esen la ge-scale misassem-
blies suppo ed by long- ead clipping nea b eakpoin egions.
These e o s a e quan i ied in o wo sub-sco es: R-AQI (based on CREs) and S-AQI
(based on CSEs). Bo h sco es con ibu e o he global Assembly Quali y Index (AQI),
anging om 0–100, whe e highe alues indica e be e assembly in eg i y. AQI is
compu ed based on e o densi y ela i e o genome size.
46
1.4 C oss-species s uc u al co ec ness
To assess c oss-species s uc u al co ec ness, I used he same e e ences
om ela ed ca ish species o one- o-one nucleo ide-le el alignmen s o o hologous
segmen s wi h MashMap252 (’-s 2000000 --pi 90 -c 100000’) (Jain e al.,2018a)
3.1.3.
1.5 Base accu acy and assembly comple eness
To assess base accu acy and comple eness, speci ically Me cu y’s quali y
alues (QV) and 21-me genome comple eness (%), I used Me cu y (Rhie e al.,2020)
1.3. Me qu y was un h ee imes using di e en k-me da abases: Illumina, HiFi,
and a hyb id 21-me da abase combining Illumina and HiFi eads, as explained abo e,
in he ”Consensus polishing” sec ion. To assess unc ional comple eness and e alua e
he comple eness o he gene se , I used he lineage o ay- inned ishes and BUSCO
(’-l ac inop e ygii_odb10’) (Simão e al.,2015) 5.6.1. Finally, nucleo ide ac-
cu acy was e i ied by ead- o-assembly mapping, pe o med using minimap2 (’-ax
map-hi i --seconda y=no’) and Winnowmap (’-W epe i i e.15. x ’) a MAPQ
>10 o HiFi eads, while o ONT eads I used he '-ax map-on ' p ese o minimap2
and he '-ax map-pb' p ese in Winnowmap MAPQ >10. Fo Visualisa ions, I used
IGV (’ig -g $genome $BAM(s) $me qu y_only_bed_wig_kme _ iles’).
52MashMap2 is a as and app oxima e algo i hm o compu ing whole-genome homology maps. I uses
minimize -based locali y-sensi i e hashing o apidly iden i y high-con idence egions o sequence simi-
la i y, making i sui able o genome- o-genome alignmen s, s uc u al a ia ion de ec ion, and e e ence-
guided sca olding (Jain e al.,2018a).
47
Resul s – Bighead Ca ish
The haplo ype- esol ed de no o assembly esul ed in 27 Hi-C sca olds, and he
ch omosome numbe was 2n = 2x = 54 pseudoch omosomes (Maneecho e al.,2016).
1. Raw Read Quali y
Figu e 15 summa izes aw da a quali y o he bighead ca ish genome.
Figu e 15 Sequencing summa y and genome su ey o bighead ca ish. (A) Sequencing p o o-
cols and co e ages. (B) GenomeScope k-me p o ile showing genome size, epea con en , and
17.6% he e ozygosi y. (C) HiFi eads: high quali y and 15 kbp peak leng h. (D) ONT eads:
b oade size ange wi h N50 30 kbp and mode a e quali y.
48
2. Global Assembly Me ics
Sequence a ia ion analysis53 was pe o med wi h Plo SR sui e (Goel and Schnee-
be ge ,2022), e ealing 1,968,666 he e ozygous SNPs ac oss bo h haplo ypes a e min-
imap2 ('-ax asm5') c oss-haplo ype mapping, esul ing in a he e ozygosi y a e54 o
0.594%. S uc u al a ian analysis iden i ied 392,973 inse ions o aling 7.67 Mb in
Haplo ype 2, while Haplo ype 1 had 393,127 dele ions spanning 7.75 Mb. Copy num-
be a ia ions55 included 114 copy gains in Haplo ype 2 (184 Kb) and 123 copy losses
in Haplo ype 1 (416 Kb). Highly di e gen egions56 we e iden i ied, spanning 57.9 Mb
in Haplo ype 1 and 56.0 Mb in Haplo ype 2. Addi ionally, 47 andem epea clus e s57
we e de ec ed, co e ing 10 Kb in Haplo ype 1 and 7 Kb in Haplo ype 2.
Table 1 Summa y s a is ics o he haplo ype- esol ed Cla ias mac ocephalus genome assembly.
Fea u es Haplo ype 1 Haplo ype 2
O e all quali y (x.y.P.Q.C) 6.30.P7.53.Q48.C94.25 6.30.P7.53.Q48.C94.25
Genome size (Mb) 875 Mb 880 Mb
pseudoch omosomes 27 27
Mi ochond ion leng h (bp) 16,510 bp N/A
NG50 o con igs (Mb) 3 Mb 3 Mb
N50 o sca olds (Mb) 34 Mb 34 Mb
LG50/LG90 o sca olds 11 / 24 (Hap1) 11 / 24 (Hap2)
Numbe o gaps 752 653
GC con en (%) 39.32% 39.32%
Comple e BUSCOs N (%) 93.0% (Hap1) 93.5% (Hap2)
Median QV (Me qu y k21) 42.99 (Hap1) 43.33 (Hap2) 48.22 (All)
Comple eness (Me qu y k21) 89.11 (Hap1) 88.88 (Hap2) 95.25 (All)
53Compa ison o Haplo ype 1 and 2 ch omosomes o de ec he e ozygous a ian s such as SNPs, indels,
and s uc u al changes.
54P opo ion o si es whe e he wo haplo ypes di e , e lec ing allelic di e si y.
55Duplica ions o dele ions causing a iable copy numbe s o genomic egions.
56Genomic segmen s showing s ong sequence di e gence be ween haplo ypes o species.
57Sho mo i s epea ed head- o- ail, o en in elome es o cen ome es.
49
56
Table 3 Summa y o sca old me ics in he haplo ype- esol ed genome assembly o Cla ias
mac ocephalus,Haplo ype 2.
Sca old Name Size E o s Quali y Values Gaps Le Telome e Righ Telome e S-AQI
Haplo ype_Linkage (Mb) 21-me hyb.k21.me yl (N) (5’) AACCCT) (AGGGTT 3’) (%)
ClaMac_2_LG_01 51,51 2649 55.5298 17 - - 89.28
ClaMac_2_LG_02 44,94 6893 51.161 25 Le 255 (+) - 71.62
ClaMac_2_LG_03 39,44 8894 49.5912 16 - - 83.57
ClaMac_2_LG_04 38,01 6912 50.5588 18 Le 293 (+) Righ 526 (-) 88.17
ClaMac_2_LG_05 38,16 3424 53.4601 23 - Righ 125 (-) 78.26
ClaMac_2_LG_06 38,66 2675 54.3325 17 - - 80.86
ClaMac_2_LG_07 33,93 5651 50.8217 17 Le 169 (+) - 89.91
ClaMac_2_LG_08 47,19 10458 49.5856 29 Le 174 (+) Righ 190 (-) 89.44
ClaMac_2_LG_09 28,8 10239 47.5736 14 Le 641 (+) - 91.91
ClaMac_2_LG_10 29,07 5001 50.8593 14 Le 268 (+) - 95.24
ClaMac_2_LG_11 25,76 6627 48.9338 21 - - 89.75
ClaMac_2_LG_12 30,93 6915 48.7951 19 - - 82.58
ClaMac_2_LG_13 23,66 3213 51.9117 15 - Righ 100 (+) 86.65
ClaMac_2_LG_14 21,06 3487 50.9324 7 - - 100.00
ClaMac_2_LG_15 24,93 7673 48.2988 17 Le 134 (+) Righ 137 (-) 85.58
ClaMac_2_LG_16 24,83 3011 52.2777 10 - - 81.74
ClaMac_2_LG_17 22,34 2012 53.6414 15 Le 175 (+) - 85.99
ClaMac_2_LG_18 24,44 1986 54.0521 11 - - 100.00
ClaMac_2_LG_19 30,86 3566 52.521 16 - Righ 362 (-) 90.62
ClaMac_2_LG_20 34,51 2384 53.5357 13 - - 100.00
ClaMac_2_LG_21 27,99 2550 53.5741 13 - - 100.00
ClaMac_2_LG_22 33,35 4588 51.8225 16 Le 128 (+) Righ 908 (-) 88.64
ClaMac_2_LG_23 31,45 9701 48.2459 6 Le 214 (+) Righ 793 (-) 82.68
ClaMac_2_LG_24 27,39 2840 53.0349 6 Le 114 (+) - 93.87
ClaMac_2_LG_25 37,88 7891 49.9914 15 - - 87.01
ClaMac_2_LG_26 40,17 11271 48.6822 13 - Righ 607 (-) 88.08
ClaMac_2_LG_27 29,56 1236 56.9751 12 Le 155 (+) - 90.78
MT_ om_NC_046749 0,02 - - - - - -
57
6. TE Me ics
6.1 Da ing o TE Replica i e Bu s s
To es ima e he expansion his o y o majo ansposable elemen (TE) su-
pe amilies in Cla ias mac ocephalus, I calcula ed he nucleo ide di e gence o each TE
copy ela i e o i s consensus sequence and con e ed his di e gence in o app oxima e
inse ion ages using a neu al mu a ion a e58 (
µ
=6×10−9pe si e pe gene a ion)
p e iously de e mined in elec ic ca ish (Liu e al.,2023b). Assuming a one-yea gen-
e a ion ime, he es ima ed expansion peaks indica e ha LTR/Gypsy and TIR/Mu a o
amilies expanded a ound 30–33 Mya, LTR/Copia and TIR/hAT a ound 28–30 Mya, and
TIR/CACTA, PIF/Ha binge , and Tc1/Ma ine amilies be ween 16 and 25 Mya. These
esul s (Figu e 20) e eal mul iple wa es o TE p oli e a ion, likely e lec ing episodes
o genome ins abili y o e olu iona y ansi ion in he species’ his o y.
Figu e 20 TE di e gence p o iles in bighead ca ish genome.
58The neu al mu a ion a e is he a e a which mu a ions accumula e in non- unc ional genomic e-
gions, p o iding a baseline o molecula -clock da ing.
58
6.2 Genome composi ion in TE
T ansposable elemen (TE) anno a ion indica ed ha 35.25% o he big-
head ca ish genome consis ed o epe i i e sequences. Among hese, TIR DNA ans-
posons comp ised 19.12%, Heli ons 4.47%, LTR e o ansposons 8.30%, LINE ele-
men s 0.46%, and he emaining 2.82% consis ed o uncha ac e ized epea s.
Table 4 T ansposable elemen con en in Bighead ca ish Haplo ype 1.
Class Supe amily Haplo ype 1 Haplo ype 2
Fea u es # o elemen s
LINE
I128
L1 547
L2 1,129
Rex 556
LINE/Unknown 7,412
LTR
Bel_Pao 151
Copia 53
Gypsy 68,215
LTR/Unknown 13,466
TIR
CACTA 357,388
Mu a o 172,578
PIF_Ha binge 36,538
Tc1_Ma ine 41,500
hAT 80,189
PiggyBac 758
Polin on 281
NonLTR DIRS_YR 678
Penelope 417
NonTIR Heli on 166,251
O he Repea F agmen 78,900
To al In e spe sed 1,148,329
59
LINEs (Long In e spe sed Nuclea Elemen s) a e non-LTR e o ansposons
ha anspose ia a copy-and-pas e mechanism using e e se ansc ip ase; al hough
less abundan in ish genomes han in mammals, hey con ibu e o s uc u al a ia ion.
LTR e o ansposons eplica e h ough an RNA in e media e using e e se ansc ip-
ion and a e lanked by long e minal epea s, playing key oles in genome expansion
and gene egula ion. TIR DNA ansposons (Te minal In e ed Repea ansposons)
a e cu -and-pas e elemen s lanked by in e ed epea s ecognized by a ansposase en-
zyme, con ibu ing o genome ea angemen and egula o y di e si ica ion. Non-LTR
e o ansposons such as DIRS and Penelope elemen s add u he di e si y o he epe -
i i e landscape h ough dis inc mobiliza ion mechanisms. Heli ons, belonging o he
NonTIR class, a e olling-ci cle DNA ansposons ha lack e minal in e ed epea s
and anspose h ough a mechanism simila o olling-ci cle eplica ion, o en cap u ing
and mobilizing gene agmen s ha p omo e genomic inno a ion. Toge he , hese TE
classes e lec a dynamic e olu iona y his o y o C. mac ocephalus, consis en wi h
high DNA ansposon ac i i y ypical o eleos genomes.
7. Mac osyn eny Analysis
To examine o hologous ela ionships and iden i y syn enic egions be ween ca -
ish and ela ed eleos species, I ob ained p o eomes (.pep. a iles) om Ensembl
2024 (Ha ison e al.,2023), ensu ing he la es assembly e sions o consis ency.
Species a e Onco hynchus mykiss (USDA_OmykA_1.1), O yzias la ipes (ASM223467 1),
La es calca i e (ASB_HGAPassembly_ 1), Cyp inus ca pio (Cypca _WagV4.0), Danio
e io (GRCz11), O eoch omis nilo icus (O_nilo icus_UMD_NMBU), and Lepisos eus
ocula us (LepOcu1). O hologous gene clus e s (O hog oups) we e i s iden i ied wi h
O hoFinde (’o ho inde - /p o eomes -M msa -a 126 - 126’) (Emms and
Kelly,2019) .2.5.5. Subsequen ly, I u ilized x hbexp ess (h ps://gi hub.com/SamiLhll/
bhXp ess) and mac osyn R (El Hilali and Copley,2023) 0.2.19 o analyze he esul ing
o hog oups and isualize hei syn enic ela ionships ac oss species. This combined ap-
p oach allowed o an in-dep h iew o conse ed and di e gen genomic egions wi hin
he ca ish lineage and ela ed eleos s (Figu e 21).
60
Figu e 21 Syn eny analysis o a ious ca ish samples. (A) Whole-genome 4-way
syn eny analysis om 1413 sha ed single-copy o hog oups ac oss Cla ias ga iepinus
(GCA_024256425.2), Cla ias mac ocephalus,Danio e io (GRCz11)Danio e io and O e-
och omis nilo icus (O_nilo icus_UMD_NMBU), e eals conse ed ch omosome s uc u es and
ea angemen s. (B) Simple phylogeny and geological imescales (TimeT ee5, h ps:// ime ee.
o g/).
61
Resul s – F1 Hyb id Ca ish
The haplo ype- esol ed de no o assembly o he F1 hyb id ca ish (C. mac o-
cephalus ×C. ga iepinus) yielded a o al o 55 Hi-C sca olds, ep esen ing 27 + 28
ch omosomes om he pa en al species (Lewin e al.,2019;Maneecho e al.,2016).
1. Raw Read Quali y
Figu e 22 summa izes he sequencing da a quali y and genome su ey s a is ics.
Figu e 22 Sequencing summa y and genome su ey o he F1 hyb id ca ish (C. mac ocephalus
×C. ga iepinus). (A) GenomeScope2.0 k-me p o ile (k=21) es ima ing genome size (903 Mb),
he e ozygosi y (10.1%), and sequencing e o a e (0.28%). (B) Sequencing co e age by da a
ype: HiFi (30×), ONT (36×), Hi-C (37×), and Illumina (55×). (C) HiFi eads show high quali y
wi h 15 kbp modal leng h. (D) ONT eads display b oade leng h dis ibu ion (N50 ≈ 30 kbp).
62
1.1 Illumina pai ed-end sho - eads
S anda d Illumina pai ed-end sho - eads sequencing Ben ley e al. (2008)
was pe o med on he Illumina Nex Seq2000 pla o m (by No oGene), esul ing in mo e
han 50 Gb o aw ead da a ( ead-leng h=151 nucleo ides), which is equi alen o a
genome co e age o 36X. Raw eads had an a e age quali y o Q25, and 92% o he
da ase had a median quali y o Q30.
1.2 P oximi y-liga ion (Hi-C) pai ed-end sho - eads
P oximi y-liga ion (Hi-C) da a was ob ained by ollowing a s anda d in
si u Hi-C p o ocol om 2009 Dudchenko e al. (2017). B ie ly, a nuclea liga ion was
pe o med by c oss-linking ch omosomes, hen a es ic ion diges ion was ca ied ou
wi h DpnII endonuclease. The ch oma in con o ma ion cap u e lib a y was p epa ed
using Phase Genomics (h ps://in o.phasegenomics.com/p o ocols), and he genomic
DNA was sequenced on he Illumina Nex Seq2000 pla o m, esul ing in pai ed-end
sho - eads (151 nucleo ides in leng h) o 37.19 Gb o aw da a (app oxima ely 100M
pai s o eads pe billion bases in genome leng h) co esponding o a genome co e age
o 39.77X. 97.18% o he da ase had a quali y sco e o Q30 o highe (Supplemen a y
Table 3). Fo Haplo ype 1, 123.96M ead pai s we e sequenced, yielding 74.77M unique
Hi-C con ac s, wi h 59.40M alid con ac s (22.71M in e -ch omosomal, 36.69M in a-
ch omosomal), including 17.47M sho - ange (<20Kb) and 19.72M long- ange (>20Kb)
con ac s. Haplo ype 2 showed simila me ics.
1.3 PacBio HiFi long- eads
PacBio’s HiFi long- ead sequencing was pe o med using a SMRT cell
on he Sequel II sys em, esul ing in 10.62 Gb o aw da a co esponding o a genome
co e age o app oxima ely 12.9X o 6-7X pe haplo ype. The ead leng h N50 alue
was 10,379 bases, he aw eads’ mean quali y sco e was Q28.7, and he median quali y
sco e was Q36. O e 72% we e abo e Q30, and 86% abo e Q25. The e we e 191,790
63
eads wi h a cumula i e leng h o 5,066,253,114 bases (5 Gb) in he high-quali y eads
subse (> Q10).
1.4 Ox o d nanopo e echnologies (ONT) 1D-long noisy eads
ONT long noisy eads we e p epa ed om 1D- agmen ed and size-selec ed
pooled lib a ies o end-polished, nick- epai ed, and A- ailed DNA agmen s, p oduced
by liga ing sequencing adap e s on o double-s anded DNA agmen s pu i ied wi h
AMPu e XP beads. DNA quali y was assessed wi h QuBi be o e sequencing on he
Nanopo e MiniION sys em (No oGene). The s anda d ONT lib a ies comp ised 32.62
Gb o aw ead da a, co esponding o a genome co e age o 36X, o 18X pe haplo-
ype. The ead N50 alue was 4 kb. We es ima e an ONT da a median ead quali y o
Q11, wi h 97.0% o eads ha ing quali y > Q7 (4,023,582 eads), co esponding o 20%
base-e o s. Un o una ely he quali y o he da a was ex emely low esul ing in only <
4X o genome co e age o eads usable om 36X sequenced.
2. Global Assembly Me ics
Ou F1 hyb id ca ish genome eleased in o GenBank consis s o 55 sca olds
(all exceeding 10,000 bp), o aling 1,858,695,542 bp, one m DNA and 0 unplaced se-
quences. No ably, e e y con ig in sca olds exceeds 1,000,000 bp, wi h he longes
spanning 51,692,840 bp and he second longes being 51,114,572 bp which co espond
o a Telome e- o-Telome e (T2T) ch omosome wi h 0 gaps. Ou assembly eached an
N50 o 33,844,116 bp which is an indica o o high con igui y. Read- o-con ig HiFi min-
imap2 alignmen pe o med in he con ex o Inspec o con i ms a 99.96% mapping a e,
a 0.52% spli - ead a e, and an a e age mapped dep h o 10.65×, which emains consis-
en o con igs > 1 Mbp. S uc u al alida ion iden i ied 22 s uc u al e o s (12 expan-
sions, 7 collapses, 2 o 5 haplo ype swi ches, and 1 in e sion), while small-scale e o s
a e aged 4.79 pe Mbp (8,897 o al), comp ising 6,226 base subs i u ions, 1,189 expan-
sions, and 1,482 collapses. Fu he quali y con ol was pe o med wi h Me qu y (k=21)
o e alua e k-me -based me ics (Figu e 25, Table 5and Table 6). Consensus Quali y
64
Values (QVs) o he hyb id genome and each sub-genome we e 50.1, 48.3, and 49.5,
espec i ely, wi h k-me comple eness sco es o 96.7%, 95.8%, and 97.4% o hyb id (2
haps, combined), bighead, and No h A ican ca ish sub-genomes. Me qu y’s spec um
analysis con i med he high quali y and comple eness o he haplo ype- esol ed assem-
bly wi h minimal assembly e o s. O e all, he polished assembly achie ed a QV o
50.28 (pileup-based, Inspec o 1.3), indica ing a obus base-le el accu acy and s uc-
u al in eg i y sui able o a ious ypes o downs eam analyses.
3. Compa a i e syn eny wi h p e ious C. mac ocephalus and C. ga iepinus
Compa a i e syn eny analyses e ealed ex ensi e genomic conse a ion be ween
he hyb id ca ish genome and i s pa en al species (Appendix Figu e 31; Appendix
Figu e 32). In he No h A ican ca ish (C. ga iepinus) compa ison, mac osyn eny
ac oss 28 pseudo-ch omosomes indica ed s ong s uc u al collinea i y wi h he hyb id
genome, in e up ed only by localized in e sions, ansloca ions, and small duplica ed
egions. The o e all sequence a ia ion landscape was domina ed by highly di e -
gen egions and SNPs, while s uc u al a ian s ep esen ed a mino ac ion o o-
al di e ences. Simila ly, in he Bighead ca ish (C. mac ocephalus) compa ison, 27
pseudo-ch omosomes showed high collinea i y and p ese ed ch omosomal o ganiza-
ion, con i ming he s uc u al s abili y o bo h pa en al lineages. Va ia ion analyses
iden i ied modes inse ions, dele ions, and in e sion ac s, consis en wi h e olu iona y
di e gence p eceding hyb idiza ion. Toge he , hese esul s indica e ha bo h pa en al
genomes con ibu ed la gely conse ed ch omosomal a chi ec u es o he hyb id lin-
eage, suppo ing a high-quali y, s uc u ally s able hyb id genome assembly.
3.1 Expec ed Genome Sizes, k-me P o iles and Hyb id Genome Assessmen
(Appendix Figu e 30) p esen s GenomeScope2.0 k-me p o iles o e alu-
a e genome size, he e ozygosi y, and epea con en o he F1 hyb id genome and i s
pa en al species. Sepa a e k-me dis ibu ions a e shown o indi idual pa en al lineages
om dis inc bloodlines o assess in a-species a ia ion.
65
•Pa en al indi iduals (same species, di e en bloodlines): The C. mac ocephalus
indi idual exhibi s low he e ozygosi y (~0.056%), while C. ga iepinus shows
mode a e he e ozygosi y (~1.56%), consis en wi h expec ed in aspeci ic le els.
B oade peaks a k=21 likely e lec lowe Illumina co e age (<20×), bu species-
le el di e gence emains e iden .
•F1 hyb id genome: A k=21, he es ima ed genome size is app oxima ely 1.8
Gb, aligning wi h he combined haploid con ibu ions o bo h species. The ap-
pa en he e ozygosi y a e o 10% e lec s ypical di e gence in in e -speci ic F1
hyb ids wi h dis inc sub-genomes. This di e gence esul s in a bimodal dis i-
bu ion due o haplo ype di e ences. A k=31, he e ozygosi y es ima es in he F1
hyb id dec ease as k-me size inc eases, e lec ing pa ial subgenome sepa a ion
and educing alse o e laps.
4. Hi-C Con ac Valida ion
Sepa a e Hi-C con ac ma ices we e gene a ed o each ch omosome-scale sca -
old in he F1 hyb id genome o con i m subgenome iden i y and s uc u al in eg i y.
As shown in Figu es 23 and 24, he wo sub-genomes display dis inc , in e nally con-
sis en ch omosomal con ac pa e ns, suppo ing success ul haplo ype sepa a ion and
ch omosome-scale phasing.
•A ican ca ish subgenome (28 ch omosomes): The majo i y o sca olds show
s ong diagonal con ac pa e ns wi hou majo o -diagonal dis up ions, indica -
ing high con igui y and ew mis-joins. A ew ch omosomes exhibi mino local
noise o con ac gaps, which we e u he cu a ed using Juicebox manual co ec-
ion and IGV inspec ion.
•Bighead ca ish subgenome (27 ch omosomes): Hi-C in e ac ion maps demon-
s a e simila quali y, wi h well-de ined diagonals consis en wi h ch omosome-
leng h assemblies. Ch omosomes such as Ch 07 and Ch 23 displayed addi ional
72
Table 5 No h A ican subgenome sca old me ics wi h Quali y Values om Me qu y (k=21
and k=31), elome e o ien a ion, and gap coun s.
Sca old_name Leng h QV_k21 E o s_k21 QV_k31 E o s_k31 Le _ elome e Gaps Righ _ elome e
Sub-Genome_Ch omosome Mb. 21-me 21-me 31-me 31-me (5’ AACCCT) (N) (AGGGTT 3’)
ClaHyb_a ican_1_Ch _01 51,1 64.5442 377 59.4881 1783 119 (+) T2T 0
ClaHyb_a ican_1_Ch _02 51,7 56.0219 2713 56.0911 3942 174 (+) 10
ClaHyb_a ican_1_Ch _03 47,6 59.3028 1174 54.5941 5125 1701 (+) 31603 (-)
ClaHyb_a ican_1_Ch _04 44,3 49.4714 10490 56.9825 2716 185 (+) 8 160 (-)
ClaHyb_a ican_1_Ch _05 41,5 62.5139 488 61.5864 892 1472 (+) T2T 1883 (-)
ClaHyb_a ican_1_Ch _06 41,3 54.5824 3021 56.2081 3068 278 (+) 11447 (-)
ClaHyb_a ican_1_Ch _07 42,6 58.0864 1389 54.3111 4889 384 (+) 3382 (-)
ClaHyb_a ican_1_Ch _08 40,2 54.3182 3124 59.8602 1287 0 T2T 130 (-)
ClaHyb_a ican_1_Ch _09 37,3 55.9044 2013 53.8378 4784 1449 (+) 1315 (-)
ClaHyb_a ican_1_Ch _10 34,9 67.6478 126 60.3736 993 0 2127 (-)
ClaHyb_a ican_1_Ch _11 33,5 51.4996 4986 58.2486 1555 1614 (+) T2T 1997 (-)
ClaHyb_a ican_1_Ch _12 35,3 54.4649 2653 58.1993 1657 0 6 0
ClaHyb_a ican_1_Ch _13 33,6 66.4695 159 63.3235 483 1525 (+) 21557 (-)
ClaHyb_a ican_1_Ch _14 32,4 47.2905 12717 61.6029 690 105 (+) T2T 226 (-)
ClaHyb_a ican_1_Ch _15 35 64.2407 275 58.2795 1602 328 (+) 2118 (-)
ClaHyb_a ican_1_Ch _16 33,8 60.5363 627 57.9679 1675 1602 (+) 4137 (-)
ClaHyb_a ican_1_Ch _17 30,3 55.2666 1892 56.604 2052 335 (-) T2T 276 (-)
ClaHyb_a ican_1_Ch _18 30,5 56.641 1382 57.9088 1522 1411 (+) 32289 (-)
ClaHyb_a ican_1_Ch _19 30,7 64.5469 226 58.6026 1311 784 (+) 2178 (+)
ClaHyb_a ican_1_Ch _20 27,1 57.0276 1128 60.0252 835 171 (+) 2406 (-)
ClaHyb_a ican_1_Ch _21 30,6 60.6832 549 60.1377 919 1936 (+) 20
ClaHyb_a ican_1_Ch _22 30 62.328 369 60.2163 886 0 2129 (-)
ClaHyb_a ican_1_Ch _23 26,2 52.9697 2773 58.0814 1260 279 (+) 11739 (-)
ClaHyb_a ican_1_Ch _24 25,9 60.3587 501 60.6659 689 185 (+) 2114 (-)
ClaHyb_a ican_1_Ch _25 25,6 58.5637 748 59.0404 989 232 (+) T2T 129 (-)
ClaHyb_a ican_1_Ch _26 26,3 56.7889 1163 57.358 1506 1343 (+) 30
ClaHyb_a ican_1_Ch _27 24 68.2243 76 59.5233 832 0 T2T 264 (-)
ClaHyb_a ican_1_Ch _28 20,3 59.8335 443 56.9713 1261 0 20
73
Table 6 Bighead subgenome sca old me ics wi h Quali y Values om Me qu y (k=21 and
k=31), elome e o ien a ion, and gap coun s.
Sca old_name Leng h QV_k21 E o s_k21 QV_k31 E o s_k31 Le _ elome e Gaps Righ _ elome e
Sub-Genome_Ch omosome Mb. 21-me 21-me 31-me 31-me (5’ AACCCT) (N) (AGGGTT 3’)
ClaHyb_bighead_1_Ch _01 50,1 55.7992 2767 50.7048 13207 998 (+) 14 740 (-)
ClaHyb_bighead_1_Ch _02 44,3 51.8276 6102 54.4421 4934 815 (+) 50
ClaHyb_bighead_1_Ch _03 40,4 44.33 31251 55.4715 3531 327 (+) 12 279 (-)
ClaHyb_bighead_1_Ch _04 38,4 52.1302 4933 52.1322 7281 222 (+) 13 819 (-)
ClaHyb_bighead_1_Ch _05 38,1 47.0334 15835 53.7219 4994 321 (+) 14 301 (-)
ClaHyb_bighead_1_Ch _06 37,4 52.5469 4370 50.6785 9921 282 (+) 14 0
ClaHyb_bighead_1_Ch _07 33,8 50.998 5644 48.4795 14888 924 (+) 15 629 (-)
ClaHyb_bighead_1_Ch _08 45,6 56.2767 2252 51.0889 10984 161 (+) 11 1237 (-)
ClaHyb_bighead_1_Ch _09 30 48.1331 9677 48.5674 12936 184 (+) 13 751 (-)
ClaHyb_bighead_1_Ch _10 29,8 54.8059 2065 49.5164 10314 316 (+) 12 664 (-)
ClaHyb_bighead_1_Ch _11 25 51.0866 4085 53.4888 3469 0 21 267 (-)
ClaHyb_bighead_1_Ch _12 31,6 48.5768 9241 47.4853 17546 559 (+) 13 0
ClaHyb_bighead_1_Ch _13 24,9 44.8933 16909 48.6904 10418 740 (+) 8220 (-)
ClaHyb_bighead_1_Ch _14 23,3 44.5272 17249 51.8503 4715 0 8190 (-)
ClaHyb_bighead_1_Ch _15 27,6 39.2488 68816 52.3266 4848 200 (+) 12 107 (-)
ClaHyb_bighead_1_Ch _16 25,2 51.936 3389 53.2655 3686 656 (+) 9818 (-)
ClaHyb_bighead_1_Ch _17 24 40.0722 49568 46.7745 15616 326 (+) 15 222 (-)
ClaHyb_bighead_1_Ch _18 25,2 52.6726 2856 52.3131 4582 632 (+) 11 973 (-)
ClaHyb_bighead_1_Ch _19 31,1 53.8536 2692 52.9809 4841 674 (+) 8751 (-)
ClaHyb_bighead_1_Ch _20 32,7 51.6697 4681 51.0227 7929 410 (+) 18 422 (-)
ClaHyb_bighead_1_Ch _21 29,2 43.0312 30479 51.9822 5662 805 (+) 16 253 (-)
ClaHyb_bighead_1_Ch _22 34,9 51.0828 5705 50.9355 8715 568 (+) 13 121 (-)
ClaHyb_bighead_1_Ch _23 31,7 42.7592 35295 48.8582 12797 931 (+) 81022 (-)
ClaHyb_bighead_1_Ch _24 29,6 50.1103 6080 52.8055 4829 851 (+) 10 159 (-)
ClaHyb_bighead_1_Ch _25 40,3 52.391 4875 52.1653 7582 0 9798 (-)
ClaHyb_bighead_1_Ch _26 40,1 53.1914 4035 51.6319 8529 1038 (+) 11 0
ClaHyb_bighead_1_Ch _27 30,5 48.7036 8641 48.5145 13324 1858 (+) 10 219 (-)
74
6.3 S uc u al Accu acy Assessmen
Table 7and Table 8p esen s uc u al accu acy me ics om CRAQ o
bo h sub-genomes, including S uc u al Accu acy Index (S-AQI), Regional Accu acy
(R-AQI), and he e ozygous loci o s uc u al and egional me ics (A g.CSH/A g.CRH).
No e ha CRH and CSH alues a e expec ed o be close o 0 due o he high genomic di-
e si y be ween he wo pa en al species (di e gence > 10%, mash-based (Ondo e al.,
2019), no shown he e).
No h A ican ca ish sca olds displayed nea -pe ec S-AQI and R-AQI
alues, bo h anging app oxima ely om 97% o 100%, indica ing minimal misassem-
blies a bo h la ge-scale and egional-scale le els. In con as , bighead ca ish sca olds
exhibi ed sligh ly lowe accu acy, wi h S-AQI alues a ound 90–97% and R-AQI be-
ween 90–96%. This educ ion in quali y may be a ibu ed o ac o s such as lowe
species he e ozygosi y, inc eased local assembly complexi y due o ansposable ele-
men con en , o po en ial sequencing quali y issues, possibly a ising om deg aded
genomic DNA used in ONT sequencing.
75
Table 7 No h A ican subgenome s uc u al accu acy me ics om CRAQ: S-AQI, R-AQI, and
he e ozygous egions (A g.CSH/A g.CRH).
Sca old_name Mapping. a e A g.CRH A g.CRE A g.CSE Regional-AQI A g.CSH S uc u al-AQI
Sub-Genome_Ch omosome ONT/PE (%) Coun Coun Coun Accu acy (%) Coun Accu acy (%)
Genome ( 55 Ch omosomes) >97% 0.046 0.270 0.021 97.33 0.001 97.92
ClaHyb_a ican_1_Ch _01 0.992 0.020 0.078 0.000 99.22 0.000 100.00
ClaHyb_a ican_1_Ch _02 0.981 0.000 0.098 0.000 99.02 0.000 100.00
ClaHyb_a ican_1_Ch _03 0.955 0.033 0.154 0.000 98.47 0.000 100.00
ClaHyb_a ican_1_Ch _04 0.980 0.046 0.207 0.046 97.95 0.000 95.50
ClaHyb_a ican_1_Ch _05 0.986 0.048 0.097 0.000 99.04 0.000 100.00
ClaHyb_a ican_1_Ch _06 0.998 0.024 0.085 0.000 99.15 0.024 100.00
ClaHyb_a ican_1_Ch _07 0.978 0.000 0.120 0.000 98.81 0.000 100.00
ClaHyb_a ican_1_Ch _08 0.992 0.025 0.100 0.000 99.00 0.000 100.00
ClaHyb_a ican_1_Ch _09 0.992 0.076 0.083 0.027 99.17 0.000 97.34
ClaHyb_a ican_1_Ch _10 0.998 0.066 0.172 0.000 98.29 0.000 100.00
ClaHyb_a ican_1_Ch _11 0.905 0.030 0.188 0.000 98.13 0.000 100.00
ClaHyb_a ican_1_Ch _12 0.873 0.086 0.230 0.000 97.72 0.000 100.00
ClaHyb_a ican_1_Ch _13 0.917 0.075 0.119 0.000 98.81 0.000 100.00
ClaHyb_a ican_1_Ch _14 0.990 0.031 0.062 0.062 99.38 0.000 93.96
ClaHyb_a ican_1_Ch _15 0.922 0.076 0.152 0.000 98.49 0.000 100.00
ClaHyb_a ican_1_Ch _16 0.952 0.000 0.248 0.000 97.55 0.000 100.00
ClaHyb_a ican_1_Ch _17 0.997 0.033 0.083 0.000 99.18 0.000 100.00
ClaHyb_a ican_1_Ch _18 0.954 0.175 0.315 0.000 96.90 0.000 100.00
ClaHyb_a ican_1_Ch _19 0.998 0.000 0.294 0.033 97.10 0.000 96.79
ClaHyb_a ican_1_Ch _20 0.995 0.056 0.148 0.000 98.53 0.000 100.00
ClaHyb_a ican_1_Ch _21 0.986 0.033 0.131 0.098 98.70 0.000 90.65
ClaHyb_a ican_1_Ch _22 0.992 0.067 0.134 0.034 98.67 0.000 96.70
ClaHyb_a ican_1_Ch _23 0.998 0.058 0.186 0.038 98.16 0.000 96.23
ClaHyb_a ican_1_Ch _24 0.999 0.077 0.232 0.000 97.71 0.000 100.00
ClaHyb_a ican_1_Ch _25 0.989 0.039 0.118 0.000 98.83 0.000 100.00
ClaHyb_a ican_1_Ch _26 0.994 0.057 0.324 0.000 96.81 0.000 100.00
ClaHyb_a ican_1_Ch _27 0.980 0.042 0.166 0.000 98.35 0.000 100.00
ClaHyb_a ican_1_Ch _28 0.995 0.000 0.297 0.000 97.07 0.000 100.00
76
Table 8 Bighead subgenome s uc u al accu acy me ics om CRAQ: S-AQI, R-AQI, and he -
e ozygous egions (A g.CSH/A g.CRH).
Sca old_name Mapping. a e A g.CRH A g.CRE A g.CSE Regional-AQI A g.CSH S uc u al-AQI
Sub-Genome_Ch omosome ONT/PE (%) Coun Coun Coun Accu acy (%) Coun Accu acy (%)
Genome ( 55 Ch omosomes) >97% 0.046 0.270 0.021 97.33 0.001 97.92
ClaHyb_bighead_1_Ch _01 0.981 0.000 0.329 0.000 96.76 0.000 100.00
ClaHyb_bighead_1_Ch _02 0.992 0.046 0.148 0.023 98.53 0.000 97.75
ClaHyb_bighead_1_Ch _03 0.980 0.050 0.276 0.000 97.28 0.000 100.00
ClaHyb_bighead_1_Ch _04 0.990 0.000 0.370 0.026 96.36 0.000 97.40
ClaHyb_bighead_1_Ch _05 0.972 0.053 0.360 0.160 96.46 0.000 85.22
ClaHyb_bighead_1_Ch _06 0.934 0.000 0.497 0.027 95.15 0.000 97.33
ClaHyb_bighead_1_Ch _07 0.990 0.030 0.403 0.030 96.05 0.000 97.06
ClaHyb_bighead_1_Ch _08 0.987 0.000 0.429 0.044 95.80 0.000 95.66
ClaHyb_bighead_1_Ch _09 0.986 0.051 0.422 0.067 95.87 0.000 93.47
ClaHyb_bighead_1_Ch _10 0.982 0.034 0.460 0.034 95.50 0.000 96.65
ClaHyb_bighead_1_Ch _11 0.961 0.042 0.382 0.042 96.26 0.000 95.92
ClaHyb_bighead_1_Ch _12 0.946 0.194 0.606 0.069 94.12 0.000 93.38
ClaHyb_bighead_1_Ch _13 0.989 0.041 0.509 0.081 95.04 0.000 92.19
ClaHyb_bighead_1_Ch _14 0.982 0.080 0.544 0.000 94.71 0.044 100.00
ClaHyb_bighead_1_Ch _15 0.922 0.136 0.506 0.039 95.07 0.000 96.19
ClaHyb_bighead_1_Ch _16 0.993 0.040 0.240 0.000 97.63 0.000 100.00
ClaHyb_bighead_1_Ch _17 0.985 0.042 0.793 0.042 92.38 0.000 95.87
ClaHyb_bighead_1_Ch _18 0.977 0.000 0.474 0.000 95.37 0.000 100.00
ClaHyb_bighead_1_Ch _19 0.992 0.000 0.341 0.000 96.65 0.000 100.00
ClaHyb_bighead_1_Ch _20 0.971 0.000 0.236 0.062 97.66 0.000 93.96
ClaHyb_bighead_1_Ch _21 0.977 0.064 0.591 0.000 94.26 0.000 100.00
ClaHyb_bighead_1_Ch _22 0.935 0.102 0.406 0.044 96.02 0.000 95.74
ClaHyb_bighead_1_Ch _23 0.990 0.000 0.303 0.032 97.02 0.000 96.87
ClaHyb_bighead_1_Ch _24 0.992 0.303 0.608 0.000 94.10 0.000 100.00
ClaHyb_bighead_1_Ch _25 0.990 0.062 0.350 0.025 96.56 0.000 97.53
ClaHyb_bighead_1_Ch _26 0.991 0.050 0.378 0.025 96.29 0.000 97.51
ClaHyb_bighead_1_Ch _27 0.979 0.033 0.447 0.000 95.63 0.000 100.00
77
GENOME ASSEMBLY, REVERSE VACCINOLOGY, AND
QUALITY BY DESIGN — S ep ococcus iniae
In oduc ion
1. S ep ococcus iniae as a Majo Aquacul u e Pa hogen
S ep ococcus iniae, i s iden i ied in an Amazon Ri e dolphin (Inia geo en-
sis) in he 1970s (Pie and Madin,1976), has eme ged as a majo bac e ial pa hogen
in global aquacul u e. The pa hogen causes annual losses exceeding USD 100 million
wo ldwide (Shoemake e al.,2001), wi h mo ali y a es o 30–80% du ing ou b eaks
(Chen e al.,2012) and cumula i e mo ali y up o 70% a e h ee mon hs in some
species (Mmanda e al.,2014). De ec ed ac oss all con inen s (Baiano and Ba nes,
2009;Mish a e al.,2018), S. iniae a ec s o e 27 ish species (Agnew and Ba nes,
2007), pa icula ly in in ensi e aquacul u e sys ems whe e high s ocking densi ies and
en i onmen al s esso s acili a e apid disease ansmission (Chen e al.,2012). Eco-
nomically impo an species including Asian seabass (La es calca i e ), ilapia (O e-
och omis spp.), and ca ish (Silu i o mes spp.) a e pa icula ly suscep ible (Azmai and
Saad,2011;Nawawi e al.,2008). While a e, zoono ic ansmission o humans han-
dling in ec ed ish has been documen ed (Facklam e al.,2005).
Figu e 26 Phase con as mic og aph o S. iniae s ain QMA0076, showing cha ac e is ic sphe -
ical mo phology. C edi : Baiano e al. (2008).
78
2. Taxonomic Classi ica ion and Molecula Cha ac e is ics
S. iniae belongs o he phylum Bacillo a ( o me ly Fi micu es), comp ising G am-
posi i e bac e ia wi h hick pep idoglycan cell walls. Wi hin he genus S ep ococcus,
S. iniae sha es molecula mechanisms wi h human pa hogens S. pyogenes and S. pneu-
moniae, pa icula ly he so ase A pa hway o cell wall p o ein ancho ing.
Taxonomic hie a chy:
Domain: Bac e ia
Phylum: Bacillo a
Class: Bacilli
O de : Lac obacillales
Family: S ep ococcaceae
Genus: S ep ococcus
Species: S. iniae
Key i ulence ac o s iden i ied include M-like p o ein (SimA), hyalu onidase
(Hyl), enolase (eno), and glyce aldehyde-3-phospha e dehyd ogenase (GAPDH). No-
ably, GAPDH localizes o he ou e memb ane despi e i s p ima y glycoly ic unc-
ion, enhancing hos cell adhe ence and immune e asion. So ase A (S A) ancho s
hese su ace p o eins o he cell wall h ough a conse ed mechanism sha ed wi h
o he pa hogenic s ep ococci. Despi e hese molecula insigh s, knowledge gaps e-
main ega ding genomic a iabili y, ho izon al gene ans e , and an imic obial esis-
ance mechanisms.
3. Cu en Disease Managemen Challenges
Disease con ol in aquacul u e elies p ima ily on b oad-spec um an ibio ics
(Scha e al.,2020,2021), wi h accines se ing as complemen a y p ophylac ic mea-
su es. Howe e , accine adop ion emains limi ed, pa icula ly o low- alue esh-
79
wa e species. In Thailand, no comme cial accines exis o S. iniae in ba amundi
(Kayansam uaj e al.,2020). The economic ba ie is subs an ial: accines cos app ox-
ima ely 100× mo e han an ibio ics (Hoelze e al.,2018;Pham e al.,2015), hinde ing
adop ion despi e hei po en ial o educe an ibio ic dependence and en i onmen al im-
pac .
Exis ing whole-cell o polysaccha ide-based accines p o ide only sho - e m
p o ec ion (4–6 mon hs) be o e losing e icacy due o an igenic a ia ion (Elda e al.,
1997;Eyngo e al.,2008;Tanpichai e al.,2023). Capsula polysaccha ides, adi ion-
ally conside ed key p o ec i e an igens, p o e uns able a ge s as S. iniae apidly al e s
su ace s uc u es unde immune p essu e. Mo eo e , genes esponsible o capsule p o-
duc ion belong o he accesso y pangenome and a e inconsis en ly p esen ac oss s ains
(Milla d e al.,2012), limi ing b oad p o ec ion.
T adi ional diagnos ic me hods (bac e ial cul u e, ELISA, his opa hology) lack
speci ici y, while molecula app oaches (PCR, MLST) can di e en ia e S. iniae om
o he s ep ococci (Glazuno a e al.,2009) bu equi e whole-genome sequencing o
accu a e geno yping.
4. S udy Objec i es
De eloping e ec i e accines o S. iniae equi es add essing bo h biological
and manu ac u ing challenges. Candida e an igens mus be conse ed ac oss s ains,
immunogenic, and compa ible wi h scalable p oduc ion p ocesses—pa icula ly c ucial
o low- o medium- alue aquacul u e species whe e accines mus be cos -compe i i e
wi h an imic obial ea men s. The Quali y by Design (QbD) amewo k (Yu e al.,
2014), es ablished in pha maceu ical manu ac u ing, p o ides sys ema ic me hodology
o connec an igen p ope ies wi h downs eam manu ac u abili y, ye i s applica ion o
e e se accinology emains la gely unexplo ed.
This s udy aimed o: (1) sequence and assemble comple e genomes o pa hogenic
80
S. iniae isola ed om diseased Asian seabass in Thai aquacul u e sys ems; (2) apply in-
eg a ed e e se accinology and QbD app oaches o iden i y conse ed, immunogenic
an igens; and (3) classi y candida es by manu ac u abili y c i e ia aligned wi h mul i-
ple p oduc ion pla o ms. Using isola e SIKU01 o comple e genome assembly and
in silico sc eening, we de eloped an ”omics- o-manu ac u ing” pipeline ha p o ides
a p ac ical amewo k o de eloping cos -e ec i e accines agains S. iniae and o he
economically impo an aquacul u e pa hogens.
Me hods
1. Bac e ial Isola ion and Cul i a ion
Fi e S ep ococcus iniae pa hogens (designa ed SIKU01–SIKU05) we e iso-
la ed om diseased Asian seabass (La es calca i e ) exhibi ing clinical signs o s ep-
ococcosis disease. All isola es o igina ed om a single a m loca ed in Chachoengsao
P o ince, Thailand. Pu e bac e ial cul u es we e main ained in T yp ic Soy B o h (TSB)
con aining 25% glyce ol a -80°C o long- e m s o age and subsequen molecula con-
i ma ion and genomic DNA ex ac ion.
2. Genomic DNA Ex ac ion and Sequencing
Genomic DNA ex ac ion om S. iniae cells in ol ed cul i a ing bac e ia in
TSB o loga i hmic phase, ollowed by cell ha es ing and lysis. DNA pu i ica ion was
pe o med using a column-based me hod wi h silica memb anes (QIAamp DNA Mini
Ki , Qiagen, Ge many) acco ding o he manu ac u e ’s p o ocol wi h mino modi ica-
ions o G am-posi i e bac e ia. The bac e ial cell pelle s we e subjec ed o enzyma ic
lysis using hen lysozyme (20 mg/mL) and o e nigh incuba ion a 37°C o diges he
pep idoglycan laye , which is pa icula ly hick in G am-posi i e bac e ia. Following
enzyma ic ea men , a de e gen -based lysis bu e con aining p o einase-K was added
o comple e cell dis up ion and p o ein deg ada ion.
81
The quali y and quan i y o ex ac ed DNA we e assessed using spec opho om-
e y (NanoD op2000, The mo Fishe Scien i ic, USA). High Molecula Weigh (HMW)
DNA in eg i y was e i ied by aga ose gel elec opho esis. DNA lib a y p epa a ion and
quali y con ol we e pe o med acco ding o s anda d p o ocols o Illumina sequenc-
ing. Sequencing was pe o med on he Illumina HiSeq 2000 pla o m (Illumina, USA)
using pai ed-end 151 bp eads, gene a ing app oxima ely 3 million pai ed-end eads pe
sample.
3. Genome Assembly and Quali y Con ol
3.1 De No o Assembly
Ini ial quali y assessmen o aw Illumina pai ed-end eads was pe o med
using Fas QC 0.11.9 (And ews,2010) o e alua e ead quali y, GC con en , and adap e
con amina ion. SeqPu ge 2019 (S u m e al.,2016) was used o im adap e s and low-
quali y bases using a sliding window app oach, emo ing bases wi h quali y sco es be-
low Q20. De no o assembly was conduc ed wi h Unicycle 0.4.9 (Wick e al.,2017),
which employs SPAdes 3.15.2 (Banke ich e al.,2012) as i s assembly algo i hm. Re-
sul ing assembly g aphs we e isualized using Bandage 0.8.1 (Wick e al.,2015) o as-
sess connec i i y and iden i y po en ial misassemblies. Assembly quali y me ics we e
e alua ed using QUAST 5.0.2 (Quas e al.,2013).
3.2 Re e ence-Guided Assembly
Mul iple e e ence s ains (QMA0248, 89353, SF1) we e ob ained om
NCBI Re Seq (Supplemen a y Da a 13). Re e ence-guided assembly was pe o med by
mapping immed eads o each e e ence s ain genome using Bow ie2 2.4.4 (Lang-
mead and Salzbe g,2012) wi h he (’-- as -local’) op ion. To assess syn eny and
s uc u al a ia ions ac oss s ains, assembly con igs we e aligned agains all e e ence
genomes using P og essi eMau e 2.4.0 (Da ling e al.,2010).
88
o sequence a iabili y, conse a ion, and s uc u e-based mapping.
All MSAs we e analyzed in R (Supplemen a y Code 12) using a cus om sc ip ,
which calcula ed no malized Shannon en opy (H.no m) a bo h codon and amino acid
le els ia he Bio3D R package (G an e al.,2006). Fo each o hog oup, H.no m
quan i ied sequence conse a ion on a con inuous scale om 0 ( ully conse ed) o 1
(maximally a iable). An igens wi h median H.no m ≤0.05 we e classi ied as highly
conse ed, while hose abo e 0.90 we e excluded as excessi ely a iable. These conse -
a ion me ics, oge he wi h gene-ca iage da a om Pana oo, de ined he gene-ca iage
and sequence- a iabili y componen s o he Quali y-by-Design (QbD) amewo k (P e-
M1 and M1 ma ices) (Supplemen a y Da a 14).
11. P o ein S uc u e P edic ion and Visualiza ion
Th ee-dimensional s uc u es o p io i ized S. iniae an igens (GAPDH, enolase,
G oEL, So ase A) we e p edic ed using AlphaFold2 (Jumpe e al.,2021) and c oss-
alida ed agains a ailable homologous empla es in he P o ein Da a Bank (PDB) (Be man,
2000). P edic ed PDB iles we e examined in UCSF Chime aX 1.6 (Meng e al.,2023)
o old in eg i y, esidue geome y, and sol en accessibili y.
To quan i y si e-speci ic s uc u al conse a ion, we implemen ed a pe - esidue
iden i y analysis (Supplemen a y Code 12) ha scanned each aligned FASTA p e iously
p oduced by MAFFT. Fo e e y alignmen , he sc ip compa ed all sequences o he e -
e ence ( i s en y) and eco ded, o each esidue posi ion, he numbe and pe cen age o
iden ical esidues ac oss all genomes. These pe -si e iden i y p o iles we e mapped on o
AlphaFold2 models in Chime aX using cus om colo sc ip s o gene a e g adien -based
isualiza ions whe e blue indica ed highly conse ed esidues and ed deno ed a iable
posi ions. This app oach p o ided a di ec link be ween MSA en opy (H.no m) alues
and 3D spa ial conse a ion, highligh ing s uc u ally s able and immunologically ele-
an egions.
89
Epi ope localiza ion was pe o med using (Supplemen a y Code 12), a cus om
Py hon sc ip which scanned PDB chains o exac ma ches o epi ope pep ide sequences
iden i ied by IEDB mapping. Fo each hi , he sc ip epo ed he model ID, chain ID,
and PDB esidue ange co esponding o he ma ched pep ide, enabling p ecise o e lay
o B- and T-cell epi opes on o p edic ed an igen s uc u es. Su ace isualiza ion and
elec os a ic po en ial mapping we e ca ied ou in Chime aX using ”su ace colo ” and
”ca oon” ep esen a ions o assess accessibili y and opological con ex .
12. Codon Adap a ion Index (CAI) and Exp ession Compa ibili y
Codon usage equencies o a ge o ganisms we e ob ained om he Kazusa
Codon Usage Da abase (Holcomb e al.,2019), a ailable a h ps://www.kazusa.o .jp/codon/.
Species-speci ic codon equency ables we e downloaded (Supplemen a y Da a 15),
p o iding equencies pe housand codons o each o ganism’s coding sequences. The
Codon Adap a ion Index (CAI) quan i ies he deg ee o p e e ence o synonymous
codons in a gi en o ganism, whe e alues ange om 0 (leas p e e ed) o 1 (mos p e-
e ed).
Fo each amino acid amily, he ela i e adap i eness o codon iwas calcula ed as
wi= i/ max, whe e i ep esen s he equency pe housand o codon i o i s amino acid
and max ep esen s he equency pe housand o he mos equen ly used codon o ha
amino acid. Fo example, p oline codons in O eoch omis nilo icus we e calcula ed as
ollows: CCG had wi=7.39/16.53 =0.447, CCA had wi=14.59/16.53 =0.883, CCT
had wi=16.53/16.53 =1.000 (op imal), and CCC had wi=14.91/16.53 =0.902. The
alue 16.53 ep esen s he highes equency among all p oline codons, making CCT he
op imal e e ence codon.
Fo comple e gene sequences, CAI was compu ed as he geome ic mean o el-
90
a i e adap i eness alues using he o mula:
CAI =(L
∏
k=1
wk)1/L
whe e he p oduc encompasses all Lsense codons in he gene sequence. S op codons
(UAA, UAG, UGA) and he single me hionine codon (AUG) we e excluded om CAI
calcula ions as hey lack synonymous al e na i es o op imiza ion.
13. Quali y by Design (QbD) F amewo k
To implemen QbD p inciples e ec i ely wi hin he e e se accinology ame-
wo k, speci ic c i e ia we e es ablished o guide sys ema ic an igen selec ion. These
c i e ia se ed as he ounda ion o de eloping a quan i a i e sco ing ma ix compu ed
in a cus om sc ip (Supplemen a y Code 12) ha ensu es bo h immunological e icacy
and manu ac u ing easibili y.
13.1 QbD Design Space: Gene Ca iage (Ma ix P e-M1)
The gene ca iage design space de ines he le el o genomic conse a ion
o an an igen ac oss he S. iniae pangenome. This C i ical Quali y A ibu e (CQA) p i-
o i izes s able and widely dis ibu ed an igens o p e en loss o accine e icacy in he -
e ologous s ains. Based on he pan-genome p esence/absence ma ix, he co e genome
(p esen in ≥99% o isola es) was e ained whe eas so -co e genome (95–98%), shell
(15–94%), and cloud (<15%) genes we e excluded om u he e alua ion. This il-
e ing yielded 1,538 co e and 110 so -co e genes (Supplemen a y Da a 15), o ming
he ounda ion o downs eam an igen a iabili y and QbD sco ing due o hei b oad
dis ibu ion and gene ic s abili y.
91
13.2 QbD Design Space: Amino Acid and Nucleo ide Va iabili y (Ma ix 1)
The no malized Shannon en opy (H.no m)de ined he an igen a iabil-
i y design space, quan i ied om pe -clus e MSAs. Each esidue ecei ed wo com-
plemen a y measu es: (1) H.no m om sequence alignmen s (0 = ully conse ed, 1
= maximally a iable) and (2) pe cen iden i y (%ID) ela i e o he amino acid e e -
ence sequence (SIKU01). These alues we e a e aged pe gene o yield gene-wise con-
se a ion indices subsequen ly in eg a ed in o he QbD a iabili y ma ix (M1) (Sup-
plemen a y Da a 15). Genes wi h median H.no m ≤0.05 and mean %ID ≥95% we e
p io i ized as s uc u ally and unc ionally s able an igens. Residues wi h low en opy
and high su ace exposu e (as con i med by Chime aX mapping) de ined he s uc u al
conse a ion-d i en QbD design space, used o cons ain an igen selec ion and ensu e
manu ac u ing consis ency ac oss S. iniae lineages.
13.3 QbD Design Space: Gene al Physicochemical and Exp ession Cha ac e -
is ics (Ma ix 1)
The p o ein leng h design space a o s candida es wi h 100 o mo e amino
acids o ensu e su icien epi ope di e si y while a oiding o e ly sho sequences ha
lack unc ional ele ance. P o eins con aining 50–99 amino acids a e classi ied as oo
sho and excluded due o ins abili y conce ns, while hose wi h ewe han 50 amino
acids ace exclusion o insu icien immunogenic po en ial.
Molecula weigh cons ain s a ge he 20–80 kDa ange o balance im-
mune sys em in e ac ion wi h p ope olding capabili ies, wi h he op imal 20–50 kDa
ange ecei ing highes p io i y o E. coli exp ession sys ems.
The isoelec ic poin (pI)design space ecognizes mul iple op imal anges
depending on pu i ica ion s a egy. P o eins wi h pI alues be ween 4.0–7.5 ecei e p i-
o i y due o supe io solubili y cha ac e is ics and b oad pu i ica ion pla o m compa -
ibili y. The pI ange o 7–9 p o ides nega i e cha ge unde physiological condi ions,
92
acili a ing pu i ica ion ia his idine ag ch oma og aphy o silicon dioxide-based il e s
as speci ied in he ounda ional selec ion c i e ia. Addi ionally, p o eins wi h pI alues
be ween 2–4 o e specialized ad an ages o silica-based pu i ica ion sys ems h ough
enhanced elec os a ic in e ac ions.
Hyd ophobici y design space assessmen elies on he GRAVY index,
which p io i izes p o eins wi hin he -0.5 o +0.5 ange o minimize agg ega ion en-
dencies and suppo aqueous solubili y h oughou he manu ac u ing p ocess.
P o ein s abili y design space elies on he ins abili y index (II) calcu-
la ed om p ima y sequence da a, wi h alues o 40 o lowe indica ing a o able in
i o s abili y cha ac e is ics.
Li e a u e suppo p o ides empi ical alida ion h ough PubMed li e a-
u e and PMID, wi h p o eins ha ing p io expe imen al men ions, pa icula ly as ac-
cine candida es o i ulence ac o s, ecei ing g adua ed posi i e sco ing based on e i-
dence s eng h.
13.4 QbD Design Space: Immunogenici y (Ma ix 1)
Epi ope p edic ion analysis iden i ies he p esence o known sequences
ha s imula e T and B lymphocy es, conside ing he 30% simila i y be ween ish and
human immunoglobulins in he sco ing amewo k. P o eins p edic ed o con ain bo h
T- and B-cell epi opes (in e ed om IEDB) ecei e maximum immunological ele ance
sco ing, while hose lacking p edic ed epi opes ace penal y assessmen .
13.5 QbD Design Space: Hos -speci ic Exp ession (Ma ix 1 – CAI)
In i o exp ession e iciency design space p edic ion u ilizes he Codon
Adap a ion Index (CAI) o assess ansla ional compa ibili y wi h E. coli sys ems, ol-
lowing he p inciple o op imal codon usage o maximizing p o ein exp ession. Top-
93
qua ile CAI alues ecei e posi i e weigh ing, while bo om-qua ile sco es indica e
po en ial exp ession di icul ies (Supplemen a y Da a 15–15).
13.6 QbD Design Space: Ma ix 2 – Seconda y Selec ion
In Ma ix 2, candida e an igens om Ma ix 1 we e sc eened mul iple imes
o compa ibili y wi h speci ic downs eam pu i ica ion modali ies by aligning hei in-
insic molecula desc ip o s wi h pla o m-speci ic C i ical P ocess Pa ame e s (CPPs).
Fi e dis inc se s o design spaces we e e alua ed: silica a ini y, cellulose a ini y, ion
exchange ch oma og aphy (anion exchange (AEX) and ca ion exchange (CEX)), and
plasmid DNA (pDNA) exp ession. This p e-downs eam design space c oss- e e ences
in silico C i ical Quali y A ibu es (CQAs) wi h pla o m-speci ic CPPs o ensu e p ocess–
an igen compa ibili y be o e physical de elopmen o maximum cos -e iciency pos -
biomanu ac u ing (Supplemen a y Da a 15–15).
Silica A ini y Pu i ica ion: Silica-based pu i ica ion echnologies o e
ubiqui y, low cos , and adap abili y ac oss labo a o y and indus ial biop ocessing appli-
ca ions. C i ical ma e ial a ibu es (CMAs) include silanol densi y, su ace opology,
and unc ionaliza ion chemis y, while key CPPs encompass pH, ionic s eng h, and
bu e composi ion (Supplemen a y Da a 15). A ini y ags wi h high silica speci ici y,
including Si- ag, SB7, Ca 9, R5, and he syn he ic oc apep ide (RH)4, composed o ou
epea ing A ginine-His idine uni s, in e ac h ough elec os a ic and hyd ogen-bonding
mechanisms wi h silanol- ich ma ices. P o eins p esen ing ne posi i e cha ge a pH
7.4 demons a e enhanced adso p ion due o elec os a ic a ac ion o pa ially dep o o-
na ed silanol g oups (≡Si–OH → ≡Si–O-). QbD selec ion c i e ia equi ed pI > 4 and
a su ace-exposed ca ionic pa ch a loading pH (o e ed by he usion o a silica bind-
ing pep ide sho ag). Elu ion is achie ed by inc easing pH om 7.4 o 8.5, expanding
silanol dep o ona ion and educing hyd ogen-bond dono capaci y. Sho ag leng hs
( 7–20 aa) we e a o ed o a oid s e ic hind ance and olding in e e ence (Supplemen-
a y Da a 15).
94
Cellulose A ini y Pu i ica ion: Cellulose-based pu i ica ion uses cellulose-
binding modules (CBMs) such as CBM3, CBM9, o CelD, which ecognize epea ing
β-(1→4)-D-glucopy anose uni s h ough hyd ogen bonding wi h hyd oxyl g oups and
hyd ophobic s acking wi h plana glucan ings. Key CPPs include bu e pH (6.0–8.5),
sal concen a ion, cellulose ype, and con ac ime. Since binding is media ed by he
CBM domain a he han an igen cha ge p ope ies, no pI equi emen was applied.
Howe e , CBMs add subs an ial polypep ide segmen s (30–200 aa), necessi a ing leng h
cons ain s. An igens we e p e e en ially kep ≤400 aa o ensu e o al usion cons uc s
emained ≤430–630 aa, limi ing me abolic bu den, mis olding isk, and agg ega ion
while p ese ing ag accessibili y (Supplemen a y Da a 15).
Ion Exchange Ch oma og aphy (IEX): Ion exchange u ilizes he ne
cha ge o a p o ein o sepa a e i om o he p o eins. Based on he ype o esins and
he cha ge o he p o ein, anion exchange ch oma og aphy o ca ion exchange ch o-
ma og aphy echniques may be used. Anion exchange ch oma og aphy was modeled
on qua e na y amine unc ional g oups (e.g., –N+(CH3)3) binding nega i ely cha ged
p o eins h ough elec os a ic in e ac ion wi h dep o ona ed ca boxyl g oups. The QbD
il e e ained p o eins wi h pI ≤ 6.5 and ne cha ge ≤ -1 a pH 7.4 (Supplemen a y
Da a 15). Ca ion exchange ch oma og aphy employed sul ona e ligands (–SO3-) ha
bind p o ona ed amines om Lys, A g, and His side chains. Selec ion c i e ia a o ed
p o eins wi h pI ≥ 8.0 and ne cha ge ≥ +1 a pH 7.4, ensu ing posi i e cha ge unde
wo king condi ions o e ec i e esin binding (Supplemen a y Da a 15).
Plasmid DNA Exp ession: Plasmid DNA exp ession was ea ed as a
dis inc design space, op imizing an igens o high-yield, high- ideli y euka yo ic ex-
p ession in DNA accine o ma s. Sequence-le el CQAs included GC3 en ichmen o
imp o e mRNA s abili y and ansla ion e iciency, ORF leng h h eshold (se a he
median coding sequence leng h o co e genes ≈2.2 kb), and comple e absence o in e -
nal Type IIS es ic ion si es (BsaI, BsmBI, Eco31I) o ensu e seamless Golden Ga e
Assembly cloning compa ibili y (Supplemen a y Da a 15–15).
95
14. S a is ical Analysis
All s a is ical analyses we e pe o med in R e sion 4.2.0. Spea man’s ank co -
ela ion coe icien s (
ρ
) we e calcula ed using he co () unc ion along wi h op ion
me hod="spea man". S a is ical signi icance o co ela ions was assessed using he
co () unc ion om he Hmisc package, wi h signi icance se a p< 0.05. Fo he
co ela ion be ween hyd ophobici y and alipha ic index (
ρ
=0.82,p<10−15), signi -
icance was calcula ed using co . es (). While no o mal co ec ions o mul iple
es ing we e applied in his explo a o y analysis, we no e ha key co ela ions (e.g.,
hyd ophobici y-alipha ic index,
ρ
=0.82) would emain signi ican e en a e Bon e -
oni co ec ion.
Ke nel densi y es ima ion was pe o med using he densi y() unc ion wi h
512 in e pola ion poin s. Densi y con ou s in sca e plo s we e gene a ed using gg-
plo 2’s s a _densi y_2d() wi h de aul bandwid h selec ion. All his og ams used 50
bins o consis ency ac oss dis ibu ions.
No malized Shannon en opy o sequence conse a ion was calcula ed o p o-
eins p esen in a leas 95% o genomes as H=−Σ(pi×log2(pi)), whe e piis he e-
quency o amino acid ia each posi ion. No malized en opy (H.no m) was calcula ed
as H/log2(20) o scale alues be ween 0 ( ully conse ed) and 1 (maximum a ia ion),
using he Bio3D R package (G an e al.,2006).
Qua ile h esholds o codon adap a ion index (CAI) and GC3 con en we e cal-
cula ed using quan ile() a p obabili ies 0.25 and 0.75. All h esholds we e calcula ed
on he P e-M1 il e ed da ase (n=1,374 p o eins) o ensu e consis ency ac oss pla o ms.
Fo mul imodal dis ibu ion de ec ion, hie a chical densi y-based clus e ing (HDB-
SCAN) was applied wi h minP s=40 using he dbscan package. S a is ical summa ies
(mean, median, s anda d de ia ion) we e calcula ed o all biophysical p ope ies. The
da ase M0 comp ising he ull SIKU01 p o eome (N= 1,855 p o eins) (Supplemen a y
96
Da a 15) and he P e-M1 il e ed subse (n= 1,374 p o eins) (Supplemen a y Da a 15)
p o ided adequa e s a is ical powe o co ela ion analyses.
Resul s
1. Genome Assembly and Anno a ion Resul s
1.1 Genome Assembly o S ep ococcus iniae s ain SIKU01
We assembled a comple e, ci cula genome o S. iniae s ain SIKU01 (2.09
Mb, 37% GC) wi h high comple eness and no plasmids de ec ed (Figs. 27a–b); Ap-
pendix Figu e 33). Whole-genome compa isons wi h e e ence S. iniae s ains QMA0248
(Alsheikh-Hussain e al.,2022), 89353 (Gong e al.,2017), SF1 (Zhang e al.,2014b),
and LSSM211007Si con i med s ong mac osyn enic conse a ion (Fig. 27c–d). We
also analyzed syn eny ela i e o he o iginal Amazon dolphin isola e QMA0141 om
1976 (Pie and Madin,1976) (Appendix Figu e 34).
1.2 Func ional Anno a ion o S ep ococcus iniae s ain SIKU01
Anno a ion o s ain SIKU01 iden i ied 2,004 genes, including 1,855 p o ein-
coding sequences and 77 RNAs (Supplemen a y Da a 13). Func ional classi ica ion wi h
In e P oScan (Que illon e al.,2005), KEGG Mappe (Kanehisa and Sa o,2020), and
Gene On ology (GO) (Ashbu ne e al.,2000) (Supplemen a y Da a 13–13) e ealed
a b oad epe oi e o me abolic enzymes, anspo e s, DNA- epai p o eins, and cell-
su ace ac o s, along wi h subse s o s ess- esponse p o eins, oxins, and an imic o-
bial esis ance de e minan s (Figu e 27e). This comp ehensi e genome anno a ion p o-
ided he ounda ion o p o eome-wide e e se accinology and subsequen Quali y-
by-Design (QbD) manu ac u abili y sc eening o candida e accines (Supplemen a y
Da a 15–15).
97
Figu e 27 Comple e genome assembly and unc ional anno a ion o S ep ococcus iniae s ain SIKU01. (a) Genome assembly and e e ence-guided
wo k low om aw eads o alida ed ci cula ch omosome. (b) Assembly and anno a ion s a is ics, including genome size, GC con en , gene coun s, and
RNA ea u es. (c) Compa a i e mac osyn eny o SIKU01 agains public S. iniae e e ence s ains, showing conse ed genomic a chi ec u e. (d) Me ada a
o s ain SIKU01 and ela ed isola es, including collec ion da e, coun y, and hos species, used in hyb id e e ence-guided assembly. (e) Func ional
anno a ion o he SIKU01 p o eome, including KEGG Mappe ca ego ies, Gene On ology subcellula localiza ion, and In e P oScan domain assignmen s.
104
ion we e su icien ly basic o su i e ca ion exchange (CEX). This was e lec ed in he
su i o pools (Fig. 3.5), wi h 98 p o eins e ained by AEX and only 20 by CEX. Bu e -
speci ic anges (Figs. 3.5–3.5) con i med ha AEX su i o s we e s able ac oss nea ly
all chemis ies (n = 82–98 p o eins), while CEX bu e s consis en ly yielded he same
es ic ed se (n = 20 p o eins). The ull an igen lis s o AEX and CEX a e p o ided in
(Supplemen a y Da a 15-15).
A ini y pu i ica ion ou es imposed dis inc o hogonal cons ain s: cel-
lulose excluded p o eins longe han 400 aa (Fig. 3.5), because he CBM usion ag is
i sel la ge and places a me abolic bu den on ecombinan exp ession; minimizing he
size o he used an igen educes ene gy demand and s e ic hind ance, he eby imp o -
ing olding and yield. In con as , silica a ini y (Fig. 3.5) was go e ned by elec os a ic
in e ac ions wi h su ace silanol g oups: p o eins wi h pI be ween 7–9 acqui e a ne posi-
i e cha ge a physiological pH, enabling s able adso p ion o nega i ely cha ged silica.
These il e s educed he pools o 100 cellulose-compa ible and 49 silica-compa ible
p o eins, de ailed in (Supplemen a y Da a 15-15).
Fo plasmid DNA (pDNA) pla o ms, C i ical Quali y A ibu e (CQA) il-
e s we e applied sepa a ely on a pe -hos basis. In O. nilo icus (Figs. 3.5–3.5), genes
and hei p oduc we e i s assessed o CAI ≥ Q3 ( 0.60), ensu ing codon usage was
well adap ed o he hos ansla ion machine y; his e ained 188 sequences. Applying
GC3 ≥ Q3 ( 0.32) hal ed he pool o 95, as high G/C con en a he hi d codon posi ion
imp o es mRNA s abili y and educes me abolic s ess. A u he an igen leng h cu o
(≤ 2,200 n ) yielded 57 candida es, a o ing sho e cons uc s wi h educed ansc ip-
ional bu den. Finally, genes con aining in e nal Type IIS es ic ion si es (e.g., BsaI,
SapI) we e excluded because hese enzymes cu ou side hei ecogni ion si es, dis up -
ing modula DNA assembly wo k lows by p e en ing a one-s ep cloning in a Golden
Ga e Assembly. The inal Nile ilapia pDNA pool emained a 57 su i o s.
In D. e io (Figs. 3.5–3.5), he h esholds we e sligh ly highe (CAI ≥ Q3
0.70, GC3 ≥ Q3 0.31). He e, 186 p o eins passed he CAI il e , 82 emained a e he
105
GC3 il e , 66 a e he leng h il e , and 65 a e he Type IIS emo al il e . Comp e-
hensi e candida e se s o bo h pDNA pla o ms (Nile ilapia and zeb a ish) a e a ailable
in (Supplemen a y Da a 15-15).
O e all, he e e se accinology–QbD unnel ou pu s pla o m- eady sho -
lis s: 98 AEX, 20 CEX, 100 cellulose, 49 silica, and 57–65 pDNA an igens o imme-
dia e downs eam de elopmen .
106
107
Figu e 28 Manu ac u abili y design spaces ac oss pu i ica ion pla o ms. (a) Ion-exchange
cha ge space o p o ein candida es: pI s. ne cha ge (z) a pH 7. Shaded bands ma k he
AEX and CEX inclusion egions; accine- e e enced an igens a e labeled. (b) Su i o s a e
M2 il e ing by pla o m; ba s show unique genes pe ou e wi h coun s and pe cen ages. Only
candida es i ing he biophysiochemical c i e ia o hei espec i e pu i ica ion pa hway a e
displayed in colo ba s. This s ep e lec s manu ac u abili y cons ain s and downs eam p ocess
op imiza ion in he QbD amewo k. (c–d) Bu e ope a ing pH anges o AEX (c) and CEX
(d). Sho ick ma ks indica e he wo king pH ( ange midpoin ). (e– ) P o ein- ou e su i o s
pe bu e e alua ed a he midpoin : AEX ule = base ga e passed, ne cha ge (z) a pH 7 ≤0,
and pI ≤pHmid; CEX ule = base ga e passed, ne cha ge (z) a pH 7 ≥0, and pI ≥pHmid.(g)
Cellulose: an igen leng h dis ibu ion o M2 su i o s; dashed line a 400 aa. (j) Silica binding
pep ides a o p o eins wi h mode a e pI (7–9) and posi i e ne cha ge a pH > 7. (h, k) pDNA
(M2) sca e plo s by hos : ilapia (h) and zeb a ish (k), showing CAI s. GC3; dashed lines ma k
da ase h esholds. (i, l) pDNA- ou e su i o s passing each manu ac u abili y ga e o ilapia
(i) and zeb a ish (l): CAI ≥ h eshold, GC3 ≥ h eshold, leng h ≤median, and no Type IIS si es
(coun ).
108
4. C oss-Valida ion wi h Li e a u e
We alida ed ou sho lis ed an igens agains p e iously es ed S. iniae accine
a ge s epo ed in mul iple hos s, including Channel ca ish (Ic alu us punc a us) (Wang
e al.,2016b), Nile ilapia (O eoch omis nilo icus) (Kayansam uaj e al.,2017), Oli e
lounde (Pa alich hys oli aceus) (Sheng e al.,2018a,2023), Zeb a ish (Danio e io)
(Memb ebe e al.,2016), Mouse (Mus musculus) (Wang e al.,2015a), and Tu bo ish
(Scoph halmus maximus) (Zhang e al.,2014a) (Table 11). Among he op- anked can-
dida es consis en ly ound ac oss pu i ica ion ou es, an igens wi h demons a ed in i o
p o ec ion (enolase, GAPDH, and G oEL) we e eco e ed, suppo ing he p edic i e ac-
cu acy o he QbD amewo k.
To u he assess hei sui abili y, we modeled GAPDH and enolase, wo ep e-
sen a i e an igens, bo h well-es ablished in he li e a u e. S uc u al p edic ions gene -
a ed using AlphaFold2 (Yang e al.,2023) o S. iniae SIKU01 con i med highly con-
se ed olds and su ace-exposed loops (Wang e al.,2017). Se e al epi ope-con aining
egions iden i ied h ough e e se accinology we e su ace-accessible ((Figu es 29a–
c), highligh ed in yellow), consis en wi h p io epo s (Gen e al.,2024). These epi-
opes (Table 9) a e sui able o di ec inco po a ion in o subuni accines o o he de-
sign o chime ic mul i-epi ope cons uc s (Pumchan e al.,2020), u he alida ing hei
sui abili y as accine a ge s. Bo h enolase (eno) and GAPDH (gap) displayed epi ope-
ich egions spa ially sepa a ed om ca aly ic o hype a iable si es, ein o cing hei
accessibili y and s abili y as b oad-spec um candida es.
109
Figu e 29 S uc u al mapping o a iabili y, epi opes, and ac i e si es in wo model S ep ococ-
cus iniae accine candida es. (a) Shannon en opy p o iles show ha wo model and well-known
an igens o he scien i ic li e a u e, GAPDH (gap) and enolase (eno), a e la gely conse ed wi h
limi ed a iable egions. (b) GAPDH (336 aa, UniP o Q7BB80) ca ies a p edic ed N- e minal
epi ope (MVVKVGINGFGRIGRLAFRRIQ) posi ioned nea , bu no o e lapping wi h, he ac-
i e si e and adjacen o a egion o high nucleo ide-le el (codon-de i ed) a iabili y (1–89 %
a ia ion). (c) Enolase (435 aa, UniP o T1TFA0) con ains a p edic ed epi ope (RAAADYLEV-
PLYNYLG) loca ed opposi e he ac i e si e and spa ially sepa a ed om i e nucleo ide-d i en
hype a iable su ace egions (HR1–HR5), indica ing s abili y and accessibili y as a accine a -
ge . Va iabili y alues a e based on codon-le el polymo phisms wi hin he co e-genome align-
men and p ojec ed on o co esponding amino acid posi ions in he 3-D models o isualiza ion.
110
DATA AVAILABILITY AND NCBI SUBMISSIONS
Bighead Ca ish
The inal diploid genome assembly was sc eened o con amina ion using he NCBI
Fo eign Con amina ion Sc een (FCS) (As ashyn e al.,2024) and submi ed ollowing
he Ve eb a e Genome P ojec (VGP) naming con en ions (h ps://gi hub.com/VGP/
gp-assembly) (Rhie e al.,2021). The p ojec is egis e ed unde NCBI BioP ojec
PRJNA1132508, wi h BioSample accession SAMN41769988 co esponding o a Thai
(male, adul ) bighead ca ish (isola e: CMAM; TaxID: 35657). Raw sequencing eads
a e deposi ed in he NCBI Sequence Read A chi e (SRA): Nanopo e (20% e o ) —
SRR29723575, HiFi — SRR29723576, Hi-C (150PE) — SRR29723577, and Illumina
(150PE) — SRR29723578. The inal diploid assembly is a ailable in GenBank un-
de accession numbe s JBLWMO000000000 (Haplo ype 1) and JBLWMP000000000
(Haplo ype 2). A comple e da ase , including genome assemblies and suppo ing iles,
is pe manen ly a chi ed a Zenodo (10.5281/zenodo.14826875).
F1 Hyb id Ca ish
The hyb id genome assembly (Cla ias mac ocephalus ×C. ga iepinus) was p o-
cessed and submi ed using he same quali y con ol and s anda diza ion wo k low as de-
sc ibed abo e. GenBank accessions a e JBLWFY000000000.1 and JBLWFZ000000000.1,
co esponding o he C. mac ocephalus and C. ga iepinus sub-genomes, espec i ely.
Associa ed eco ds a e hos ed unde NCBI BioP ojec PRJNA1153495 and BioSample
SAMN43395848 (TaxID: 1334085). Raw sequencing eads a e Nanopo e (SRR30599638),
HiFi (SRR30599641), Illumina (SRR30599640), and Hi-C (SRR30599639). Addi ional
Illumina da ase s co espond o emale C. mac ocephalus (SAMN42503781) and male
C. ga iepinus (SAMN43548335). The comple e F1 hyb id da ase , including assem-
blies and me ada a, is a chi ed a Zenodo (10.5281/zenodo.15269601).
111
S ep ococcus iniae
The S ep ococcus iniae p ojec is egis e ed unde NCBI BioP ojec PRJNA933632,
wi h BioSample accession SAMN33244440. Raw sequencing eads a e a ailable in he
NCBI Sequence Read A chi e (SRA): Illumina (150PE), 10 FASTQ iles ac oss i e
da ase s—SRR23406918 (SIKU01), and SRR23406921,SRR23406920,SRR23406919,
and SRR23406922 o SIKU02–SIKU05. The anno a ed genome o SIKU01 is de-
posi ed in NCBI GenBank unde accession CP121692.1. Supplemen a y anno a ion
iles, in e media e da ase s, and ela ed analysis ma e ials a e pe manen ly a chi ed
a Zenodo (h ps://zenodo.o g/ eco ds/17476104. All compu a ional ools used in his
s udy a e publicly a ailable, and all command-line pa ame e s a e speci ied in he Me h-
ods sec ion. Cus om sc ip s used o da a p ocessing, p o eome anno a ion, and Quali y-
by-Design (QbD) manu ac u abili y analysis a e p o ided as Supplemen a y Code,
a ailable a Zenodo (see abo e) and in he Appendix chap e o his documen .
Associa ed Publica ions
1. And es, Q. L. S., Singcha , W., & S ikulna h, K. (2025). Haplo ype-Resol ed
Ch omosome-scale Assembly o he Bighead Ca ish (Cla ias mac ocephalus)
Genome. h ps://doi.o g/10.5281/zenodo.14826876
2. And es, Q. L. S., Singcha , W., & S ikulna h, K. (2025). Dual Re e ence Genomes
om F1 Hyb ids: Phased Assembly o No h A ican Ca ish and Bighead
Ca ish wi h Hi-C Da a. h ps://doi.o g/10.5281/zenodo.15269601
3. ANDRES, Q. L. S., Uchuwi ayakul, A., & S isapoome, P. (2025). Comple e
genome and QbD-guided e e se accinology o S ep ococcus iniae s ain
SIKU01. h ps://doi.o g/10.5281/zenodo.15264953
112
DISCUSSION AND CONCLUSIONS
Genomic Insigh s and Technical Achie emen s
1. Ca ish Genome Assembly
This s udy p esen s he i s haplo ype- esol ed genome assemblies o bo h big-
head ca ish (Cla ias mac ocephalus) and No h A ican ca ish (C. ga iepinus), achie ed
h ough sequencing a single F1 hyb id o sp ing. Despi e using iden ical pipelines and
da a sou ces, he bighead ca ish subgenome exhibi ed lowe quali y me ics (highe
base e o a es, lowe QV) compa ed o he A ican subgenome—likely e lec ing di -
e ences in ansposable elemen con en and he inhe en assembly challenges o epea -
ich eleos genomes.
Technical alida ion ollowed VGP benchma ks using k-me sizes o 21 (QV
analysis), 28 (alignmen applica ions), and 31 (GenomeScope 2.0). While con amina-
ion was de ec ed in aw eads by Mash, i was success ully excluded om inal assem-
blies (con i med by FCS-NCBI and Mash sc eening). P ima y limi a ions included low
sequencing dep h (PacBio HiFi <13X, ONT <36X), subop imal ONT lib a y quali y
(longes ead <134kb), and so wa e- ela ed challenges including ool main enance and
quali y con ol gaps.
The F1 hyb id genome assembly eco e ed comple e genomic se s om bo h
pa en al species a excep ional quali y (QV50-QV60): 27 pseudoch omosomes om
bighead ca ish and 28 om No h A ican ca ish, o aling 1.8 Gb wi h median QV
o 55. The assembly achie ed 99.35% comple eness (21-me analysis) wi h o e 40
pseudoch omosomes assembled o nea elome e- o- elome e con inui y. All da a a e
a ailable unde NCBI BioP ojec PRJNA115349.
113
S ep ococcus iniae Vaccine De elopmen
This s udy p esen s an in eg a ed e e se accinology–Quali y by Design (QbD)
amewo k o S. iniae accines, add essing c i ical needs in aquacul u e whe e annual
s ep ococcal losses exceed USD 1 billion (Amillano-Cisne os e al.,2025;De oi d
e al.,2011). By e alua ing all 1,855 p o eins om s ain SIKU01 h ough manu ac-
u abili y il e s, we educed he candida e pool by 74–95% be o e we -lab alida ion
while e aining p e iously alida ed p o ec i e an igens. The pipeline yielded pla o m-
speci ic candida es: 98 o anion-exchange (Duong-Ly and Gabelli,2014), 100 o
cellulose-a ini y (Ca a d e al.,2000), 49 o silica-a ini y pu i ica ion (F ei as e al.,
2022), and 57–65 o plasmid DNA accines (Ma illonne and G u zne ,2020), includ-
ing alida ed an igens enolase and GAPDH (62–80% RPS in p e ious ials).
The bimodal pI dis ibu ion o he S. iniae p o eome (peaks a 5.5 and 9.0) explains
he i e- old highe cap u e a e o anion-exchange ch oma og aphy, p o iding p ac i-
cal guidance o pu i ica ion s a egy selec ion. DNA accines showed highes e icacy
(80–95% RPS), while ou E. coli exp ession il e s add essed inclusion-body o ma-
ion h ough hyd ophobici y and s abili y cons ain s (F ancis and Page,2010;Rosano
and Cecca elli,2014). To add ess S. iniae’s apid an igenic a ia ion, we es ic ed
candida es o he conse ed co e genome (n=1,374) wi h high conse a ion h esholds
(H.no m >0.98). The amewo k explici ly links an igen disco e y o p oduc ion ea-
sibili y, ensu ing candida es emain cos -compe i i e wi h cu en an ibio ic ea men s
while p o iding manu ac u ing lexibili y h ough al e na i e a ini y ags (Bachmann
and Jennings,2010b;Woes enenk e al.,2004) and Golden Ga e Assembly compa ibili y
(Bi d e al.,2022b;Ma illonne and G u zne ,2020). Key limi a ions include incomple e
cap u e o eleos immune complexi y h ough in silico p edic ion, une en geog aphic
ep esen a ion in ou 90-isola e pangenome, and he need o expe imen al alida ion o
manu ac u ing yields.
120
I. Machol, E. S. Lande , A. P. Aiden and E. L. Aiden. 2017. De no o assembly o he
Aedes aegyp i genome using Hi-C yields ch omosome-leng h sca olds. Science. 356
(6333): 92–95.
Duong, T.-Y. and K. T. Sc ibne . 2018. Regional a ia ion in gene ic di e si y be ween wild and
cul u ed popula ions o bighead ca ish (Cla ias mac ocephalus) in he Mekong Del a.
Fishe ies Resea ch. 207: 118–125.
Duong-Ly, K. C. and S. B. Gabelli. 2014. Using ion exchange ch oma og aphy o pu i y a
ecombinan ly exp essed p o ein. Me hods in Enzymology. 541: 95–103.
Du and, N. C., M. S. Shamim, I. Machol, S. S. Rao, M. H. Hun ley, E. S. Lande and E. L.
Aiden. 2016. Juice P o ides a One-Click Sys em o Analyzing Loop-Resolu ion Hi-C
Expe imen s. Cell Sys ems. 3 (1): 95–98.
Eisens ein, M. 2017. An ace in he hole o DNA sequencing. Na u e. 550 (7675): 285–288.
El Hilali, S. and R. R. Copley. 2023. mac osyn R: D awing au oma ically o de ed Ox o d G ids
om s anda d genomic iles in R. A chi e Ou e e HAL.
Elda , A., A. Ho o i cz and H. Be co ie . 1997. De elopmen and e icacy o a accine agains
S ep ococcus iniae in ec ion in a med ainbow ou . Ve e ina y Immunology and
Immunopa hology. 56 (1-2): 175–183.
El ai ou i, A., B. He mann, A. Bolin-Wiene , Y. Wang, C. Go ies, O. Zach isson, R. Pipko n,
L. Ronnblom and J. Blombe g. 2013. Epi opes o mic obial and human hea shock
p o ein 60 and hei ecogni ion in myalgic encephalomyeli is. PLoS ONE. 8 (11):
e81155.
Ellinghaus, D., S. Ku z and U. Willhoe . 2008. LTRha es , an e icien and lexible so wa e
o de no o de ec ion o LTR e o ansposons. BMC Bioin o ma ics. 9 (1).
Emms, D. M. and S. Kelly. 2019. O hoFinde : phylogene ic o hology in e ence o compa a-
i e genomics. Genome Biology. 20 (1).
Ewels, P., M. Magnusson, S. Lundin and M. Källe . 2016. Mul iQC: summa ize analysis esul s
o mul iple ools and samples in a single epo . Bioin o ma ics. 32 (19): 3047.
Eyngo , M. and o he s. 2008. Eme gence o no el S ep ococcus iniae exopolysaccha ide-
p oducing s ains ollowing accina ion wi h nonp oducing s ains. Applied and En i-
onmen al Mic obiology. 74 (22): 6892–6897.
Facklam, R., J. Ellio , L. Shewmake and A. Reingold. 2005. Iden i ica ion and cha ac e iza ion
o spo adic isola es o S ep ococcus iniae isola ed om humans. Jou nal o Clinical
Mic obiology. 43 (2): 933–937.
121
FAO, . 2020. The S a e o Wo ld Fishe ies and Aquacul u e 2020. FAO.
Faus , G. G. and I. M. Hall. 2014. SAMBLASTER: as duplica e ma king and s uc u al a ian
ead ex ac ion. Bioin o ma ics. 30 (17): 2503–2505.
Fe a is, C. J. 2007. Checklis o ca ishes, ecen and ossil (Os eich hyes: Silu i o mes), and
ca alogue o silu i o m p ima y ypes. Zoo axa. 1418 (1): 1–628.
Finn, R. D., J. Clemen s and S. R. Eddy. 2011. HMMER web se e : in e ac i e sequence
simila i y sea ching. Nucleic Acids Resea ch. 39 (suppl): W29–W37.
Flynn, J. M., R. Hubley, C. Goube , J. Rosen, A. G. Cla k, C. Fescho e and A. F. Smi . 2020.
Repea Modele 2 o au oma ed genomic disco e y o ansposable elemen amilies.
P oceedings o he Na ional Academy o Sciences. 117 (17): 9451–9457.
Fo men i, G., A. Rhie, B. P. Walenz, F. Thibaud-Nissen, K. Sha in, S. Ko en, E. W. Mye s,
E. D. Ja is and A. M. Phillippy. 2022. Me in: imp o ed a ian il e ing, assembly
e alua ion and polishing ia k-me alida ion. Na u e Me hods. 19 (6): 696–704.
Fou ie, K. and H. Wilson. 2020. Unde s anding G oEL and DnaK S ess Response P o eins as
An igens o Bac e ial Diseases. Vaccines. 8 (4): 773.
F ancis, D. M. and R. Page. 2010. S a egies o op imize p o ein exp ession in E. coli. Cu en
P o ocols in P o ein Science. (1): 5.24.1–5.24.29.
F ei as, A. I., L. Domingues and T. Q. Aguia . 2022. Ba e silica as an al e na i e ma ix o
a ini y pu i ica ion/immobiliza ion o his- agged p o eins. Sepa a ion and Pu i ica-
ion Technology. 286: 120448.
Gen , V., Y.-J. Lu, S. Lukhele and o he s. 2024. Su ace p o ein dis ibu ion in G oup B S ep-
ococcus isola es om Sou h A ica and iden i ying accine a ge s h ough in silico
analysis. Scien i ic Repo s. 14: 22665.
Gio anni, A., Y.-Z. Shi, P.-C. Wang, M.-A. Tsai and S.-C. Chen. 2025. Recombinan C5a Pep-
idase and Fo malin-Killed Cell: A Syne gis ic Vaccine Agains S ep ococcus iniae in
Fou -Finge Th ead in Fish (Eleu he onema e adac ylum). Jou nal o Fish Diseases.
48: e14154.
Glazuno a, O. O., D. Raoul and V. Roux. 2009. Pa ial sequence compa ison o he poB,
sodA, g oEL and gy B genes wi hin he genus S ep ococcus. INTERNATIONAL
JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY. 59
(9): 2317–2322.
Goel, M. and K. Schneebe ge . 2022. plo s : isualizing s uc u al simila i ies and ea ange-
men s be ween mul iple genomes. Bioin o ma ics. 38 (10): 2922–2926.
122
Gong, H. and o he s. 2017. Comple e Genome Sequence o S ep ococcus iniae 89353, a Vi -
ulen S ain Isola ed om Diseased Tilapia in Taiwan. Genome Announcemen s. 5
(8): e01524–16.
Goube , C., R. J. C aig, A. F. Bila , V. Peona, A. A. Vogan and A. V. P o asio. 2022. A beginne ’s
guide o manual cu a ion o ansposable elemen s. Mobile DNA. 13 (1).
G an , B. J., L. Skjae en and X. Q. Yao. 2021. The Bio3D packages o s uc u al bioin o ma -
ics. P o ein Science. 30 (1): 20–30.
G an , B. J. and o he s. 2006. Bio3D: an R package o he compa a i e analysis o p o ein
s uc u es. Bioin o ma ics. 22 (21): 2695–2696.
Gu, X. H., D. L. Jiang, Y. Huang, B. J. Li, C. H. Chen, H. R. Lin and J. H. Xia. 2018. Iden i ying a
Majo QTL Associa ed wi h Salini y Tole ance in Nile Tilapia Using QTL-Seq. Ma ine
Bio echnology. 20 (1): 98–107.
Guan, D., S. A. McCa hy, J. Wood, K. Howe, Y. Wang and R. Du bin. 2020. Iden i ying and
emo ing haplo ypic duplica ion in p ima y genome assemblies. Bioin o ma ics. 36
(9): 2896–2898.
Gua acino, A., S. Heumos, S. Nahnsen, P. P ins and E. Ga ison. 2022. ODGI: unde s anding
pangenome g aphs. Bioin o ma ics. 38 (13): 3319–3326.
Gu up asad, K., B. V. Reddy and M. W. Pandi . 1990. Co ela ion be ween s abili y o a p o ein
and i s dipep ide composi ion: a no el app oach o p edic ing in i o s abili y o a
p o ein om i s p ima y sequence. P o ein Enginee ing. 4 (2): 155–161.
Ha ison, P. W., M. R. Amode, O. Aus ine-O imoloye, A. G. Azo , M. Ba ba, I. Ba nes,
A. Becke , R. Benne , A. Be y, J. Bhai, S. K. Bhu ji, S. Boddu, P. R. B anco Lins,
L. B ooks, S. B. Rama aju, L. I. Campbell, M. C. Ma inez, M. Cha khchi, K. Chougule,
A. Cockbu n, C. Da idson, N. H. De Sil a, K. Dodiya, S. Donaldson, B. El Houdaigui,
T. E. Naboulsi, R. Fa ima, C. G. Gi on, T. Genez, D. G igo iadis, G. S. Gha ao aya,
J. G. Ma inez, T. A. Gu bich, M. Ha dy, Z. Hollis, T. Hou lie , T. Hun , M. Kay,
V. Kaykala, T. Le, D. Lemos, D. Lodha, D. Ma ques-Coelho, G. Maslen, G. A. Me ino,
L. P. Mi abueno, A. Mush aq, S. N. Hossain, D. N. Ogeh, M. P. Sak hi el, A. Pa ke ,
M. Pe y, I. Piližo a, D. Popple on, I. P oso e skaia, S. Raj, J. G. Pé ez-Sil a, A. I. A.
Salam, S. Sa a , N. Sa ai a-Agos inho, D. Sheppa d, S. Sinha, B. Sipos, V. Si nik,
W. S a k, E. S eed, M.-M. Sune , L. Su apaneni, K. Su inen, F. F. T icomi, D. U bina-
Gómez, A. Veidenbe g, T. A. Walsh, D. Wa e, E. Wass, N. L. Willho , J. Allen,
J. Al a ez-Ja e a, M. Chakiach ili, B. Flin , S. Gio ge i, L. Hagge y, G. R. Ilsley,
123
J. Kea ley, J. E. Lo eland, B. Moo e, J. M. Mudge, G. Naama i, J. Ta e, S. J. T e anion,
A. Win e bo om, A. F ankish, S. E. Hun , F. Cunningham, S. Dye , R. D. Finn, F. J.
Ma in and A. D. Ya es. 2023. Ensembl 2024. Nucleic Acids Resea ch. 52 (D1):
D891–D899.
Heckman, T. I., K. Shahin, E. E. Hende son, M. J. G i in and E. So o. 2022. De elopmen and
e icacy o S ep ococcus iniae li e-a enua ed accines in Nile ilapia (O eoch omis
nilo icus). Fish & Shell ish Immunology. 121: 152–162.
Hoelze , K., L. Bielke, D. P. Blake, E. Cox, S. M. Cu ing, B. De iend , E. E lache -Vindel,
E. Goossens, K. Ka aca, S. Lemie e, M. Me zne , M. Raicek, M. C. Su iñach, N. M.
Wong, C. Gay and F. V. Imme seel. 2018. Vaccines as al e na i es o an ibio ics o
ood p oducing animals. Pa 1: challenges and needs. Ve e ina y Resea ch. 49 (1).
Holcomb, D. D., A. Alexaki, U. Ka neni and C. Kimchi-Sa a y. 2019. The Kazusa codon usage
da abase, CoCoPUTs, and he alue o up- o-da e codon usage s a is ics. In ec ion,
Gene ics and E olu ion. 73: 266–268.
Hon, T., K. Ma s, G. Young, Y.-C. Tsai, J. W. Ka alius, J. M. Landolin, N. Mau e , D. Kud na,
M. A. Ha digan, C. C. S eine , S. J. Knapp, D. Wa e, B. Shapi o, P. Peluso and D. R.
Rank. 2020. Highly accu a e long- ead HiFi sequencing da a o i e complex genomes.
Scien i ic Da a. 7 (1).
Hu, J., Z. Wang, F. Liang, S.-L. Liu, K. Ye and D.-P. Wang. 2024. Nex Polish2: A Repea -
awa e Polishing Tool o Genomes Assembled Using HiFi Long Reads. Genomics,
P o eomics and Bioin o ma ics. 22 (1).
Jain, C., S. Ko en, A. Dil hey, A. M. Phillippy and S. Alu u. 2018a. A as adap i e algo i hm
o compu ing whole-genome homology maps. Bioin o ma ics. 34 (17): i748–i756.
Jain, C., A. Rhie, H. Zhang, C. Chu, B. P. Walenz, S. Ko en and A. M. Phillippy. 2020. Weigh ed
minimize sampling imp o es long ead mapping. Bioin o ma ics. 36 (Supplemen 1):
111–118.
Jain, C., A. Rhie, N. F. Hansen, S. Ko en and A. M. Phillippy. 2022. Long- ead mapping o
epe i i e e e ence sequences using Winnowmap2. Na u e Me hods. 19 (6): 705–710.
Jain, M., S. Ko en, K. H. Miga, J. Quick, A. C. Rand, T. A. Sasani, J. R. Tyson, A. D. Beggs,
A. T. Dil hey, I. T. Fiddes, S. Malla, H. Ma io , T. Nie o, J. O’G ady, H. E. Olsen, B. S.
Pede sen, A. Rhie, H. Richa dson, A. R. Quinlan, T. P. Snu ch, L. Tee, B. Pa en, A. M.
Phillippy, J. T. Simpson, N. J. Loman and M. Loose. 2018b. Nanopo e sequencing and
assembly o a human genome wi h ul a-long eads. Na u e Bio echnology. 36 (4):
124
338–345.
Jiang, D. L., X. H. Gu, B. J. Li, Z. X. Zhu, H. Qin, Z. n. Meng, H. R. Lin and J. H. Xia. 2019.
Iden i ying a Long QTL Clus e Ac oss ch LG18 Associa ed wi h Sal Tole ance in
Tilapia Using GWAS and QTL-seq. Ma ine Bio echnology. 21 (2): 250–261.
Jumpe , J. and o he s. 2021. Highly accu a e p o ein s uc u e p edic ion wi h AlphaFold. Na-
u e. 596 (7873): 583–589.
Kanehisa, M. 2000. KEGG: Kyo o Encyclopedia o Genes and Genomes. Nucleic Acids
Resea ch. 28 (1): 27–30.
Kanehisa, M. and Y. Sa o. 2020. KEGG Mappe o in e ing cellula unc ions om p o ein
sequences. P o ein Science. 29 (1): 28–35.
Ka oh, K. and D. M. S andley. 2013. MAFFT Mul iple Sequence Alignmen So wa e Ve sion
7: Imp o emen s in Pe o mance and Usabili y. Molecula Biology and E olu ion.
30 (4): 772–780.
Ka oh, K., K. Misawa, K. Kuma and T. Miya a. 2002. MAFFT: a no el me hod o apid mul iple
sequence alignmen based on as Fou ie ans o m. Nucleic Acids Resea ch. 30 (14):
3059–3066.
Kayansam uaj, P., H. T. Dong, N. Pi a a , D. Nilubol and C. Rodkhum. 2017. E icacy o
α
-enolase-based DNA accine agains pa hogenic S ep ococcus iniae in Nile ilapia
(O eoch omis nilo icus). Aquacul u e. 468: 102–106.
Kayansam uaj, P., N. A eechon and S. Unajak. 2020. De elopmen o ish accine in Sou heas
Asia: A challenge o he sus ainabili y o SE Asia aquacul u e. Fish and Shell ish
Immunology. 103: 73–87.
Ki ano, J., S. Mo i and C. L. Peichel. 2007. Sexual Dimo phism in he Ex e nal Mo phology o
he Th eespine S ickleback (Gas e os eus Aculea us). Copeia. 2007 (2): 336–349.
Kolbe g, J., A. Aase, S. Be gmann, T. He s ad, G. Rodal, R. F ank, M. Rohde and S. Hamme -
schmid . 2006. S ep ococcus pneumoniae enolase is impo an o plasminogen binding
despi e low abundance o enolase p o ein on he bac e ial cell su ace. Mic obiology
(Reading, England). 152 (P 5): 1307–1317.
Kolmogo o , M., D. M. Bickha , B. Behsaz, A. Gu e ich, M. Rayko, S. B. Shin, K. Kuhn,
J. Yuan, E. Pole iko , T. P. L. Smi h and P. A. Pe zne . 2020. me aFlye: scalable long-
ead me agenome assembly using epea g aphs. Na u e Me hods. 17 (11): 1103–1110.
K ogh, A., B. La sson, G. onHeijne and E. L. Sonnhamme . 2001. P edic ing ans-
memb ane p o ein opology wi h a hidden ma ko model: applica ion o comple e
125
genomes11Edi ed by F. Cohen. Jou nal o Molecula Biology. 305 (3): 567–580.
Kusakabe, M., A. Ishikawa, M. Ra ine , K. Yoshida, T. Makino, A. Toyoda, A. Fujiyama and
J. Ki ano. 2016. Gene ic basis o a ia ion in salini y ole ance be ween s ickleback
eco ypes. Molecula Ecology. 26 (1): 304–319.
Ku zle , M. and D. Weine . 2008. DNA accines: eady o p ime ime? Na u e Re iews
Gene ics. 9: 776–788.
Ky e, J. and R. F. Dooli le. 1982. A simple me hod o displaying he hyd opa hic cha ac e o
a p o ein. Jou nal o Molecula Biology. 157 (1): 105–132.
Langmead, B. and S. L. Salzbe g. 2012. Fas gapped- ead alignmen wi h Bow ie 2. Na u e
Me hods. 9 (4): 357–359.
Le B as, Y., N. Dechamp, F. K ieg, O. Filangi, R. Guyoma d, M. Boussaha, H. Bo enhuis, T. G.
Po inge , P. P une , P. Le Roy and E. Quille . 2011. De ec ion o QTL wi h e ec s on
osmo egula ion capaci ies in he ainbow ou (Onco hynchus mykiss). BMC Gene ics.
12 (1).
Le unic, I., S. Khedka and P. Bo k. 2021. SMART: ecen upda es, new de elopmen s and
s a us in 2020. Nucleic Acids Resea ch. 49 (D1): D458–D460.
Lewin, H. A., J. A. M. G a es, O. A. Ryde , A. S. G aphoda sky and S. J. O’B ien. 2019.
P ecision nomencla u e o he new genomics. GigaScience. 8 (8).
Li, H. 2018. Minimap2: pai wise alignmen o nucleo ide sequences. Bioin o ma ics. 34
(18): 3094–3100.
Li, H. and R. Du bin. 2009. Fas and accu a e sho ead alignmen wi h Bu ows–Wheele
ans o m. Bioin o ma ics. 25 (14): 1754–1760.
Li, H. and R. Du bin. 2010. Fas and accu a e long- ead alignmen wi h Bu ows–Wheele
ans o m. Bioin o ma ics. 26 (5): 589–595.
Li, H., B. Handsake , A. Wysoke , T. Fennell, J. Ruan, N. Home , G. Ma h, G. Abecasis and
R. Du bin. 2009. The Sequence Alignmen /Map o ma and SAM ools. Bioin o ma ics.
25 (16): 2078–2079.
Li, K., P. Xu, J. Wang, X. Yi and Y. Jiao. 2023. Iden i ica ion o e o s in d a genome assem-
blies a single-nucleo ide esolu ion o quali y assessmen and imp o emen . Na u e
Communica ions. 14 (1).
Li, W., K. R. O’Neill, D. H. Ha and o he s. 2021. Re Seq: expanding he P oka yo ic Genome
Anno a ion Pipeline each wi h p o ein amily model cu a ion. Nucleic Acids Resea ch.
49 (D1): D1020–D1028.
126
Li, W. and A. Godzik. 2006. Cd-hi : a as p og am o clus e ing and compa ing la ge se s o
p o ein o nucleo ide sequences. Bioin o ma ics. 22 (13): 1658–1659.
Lin, X., J. Tan, Y. Shen, B. Yang, Y. Zhang, Y. Liao, P. Wang, D. Zhou, G. Li and C. Tian.
2022. A high-densi y gene ic linkage map and QTL mapping o sex in Cla ias uscus.
Aquacul u e. 561: 738723.
Lin, Y., C. Ye, X. Li, Q. Chen, Y. Wu, F. Zhang, R. Pan, S. Zhang, S. Chen, X. Wang, S. Cao,
Y. Wang, Y. Yue, Y. Liu and J. Yue. 2023. qua TeT: a elome e- o- elome e oolki
o gap- ee genome assembly and cen ome ic epea iden i ica ion. Ho icul u e Re-
sea ch. 10 (8).
Lisacho , A., D. H. M. Nguyen, T. Pan hum, S. F. Ahmad, W. Singcha , J. Ponja a , K. Jaisamu ,
P. S isapoome, P. Duengkae, S. Ha acho e, K. S iphai oj, N. Muangmai, S. Unajak,
K. Han, U. Na-Nako n and K. S ikulna h. 2023. Eme ging impo ance o bighead ca ish
(Cla ias mac ocephalus) and no h A ican ca ish (C. ga iepinus) as a bio esou ce and
hei genomic pe spec i e. Aquacul u e. 573: 739585.
Lische , H. E. L. and K. K. Shimizu. 2017. Re e ence-guided de no o assembly app oach
imp o es genome econs uc ion o ela ed species. BMC Bioin o ma ics. 18 (1).
Liu, C., X. Hu, Z. Cao, Y. Sun, X. Chen and Z. Zhang. 2019. Cons uc ion and cha ac e iza ion
o a DNA accine encoding he SagH agains S ep ococcus iniae.Fish & Shell ish
Immunology. 89: 71–75.
Liu, J., Y. Cao, H. Ma, H. Du, T. Liu, G. Wang, M. Liu, Q. Wang, P. Li and E. Wang. 2023a.
Enolase-based nano accine imme sion immuniza ion induces obus immuni y and p o-
ec ion agains S ep ococcus in ec ion in ilapia. Aquacul u e. 576.
Liu, L. and o he s. 2009. Iden i ica ion and expe imen al e i ica ion o p o ec i e an igens
agains S ep ococcus suis se o ype 2 based on genome sequence analysis. Cu en
Mic obiology. 58: 11–17.
Liu, M., Y. Song, S. Zhang, L. Yu, Z. Yuan, H. Yang, M. Zhang, Z. Zhou, I. Seim, S. Liu, G. Fan
and H. Yang. 2023b. A ch omosome-le el genome o elec ic ca ish (Malap e u us
elec icus) p o ided new insigh s in o o de Silu i o mes e olu ion. Ma ine Li e Sci-
encec and Technology. 6 (1): 1–14.
Liu, Y., L. Li, F. Yu and o he s. 2020. Genome-wide analysis e ealed he i ulence a enu-
a ion mechanism o he ish-de i ed o al a enua ed S ep ococcus iniae accine s ain
YM011. Fish & Shell ish Immunology. 106: 546–554.
Mahmoud, M., Y. Huang, K. Ga imella, P. A. Audano, W. Wan, N. P asad, R. E. Handsake ,
127
S. Hall, A. Pionzio, M. C. Scha z, M. E. Talkowski, E. E. Eichle , S. E. Le y and F. J.
Sedlazeck. 2023. U ili y o long- ead sequencing o All o Us. Cold Sp ing.
Maneecho , N., C. F. Yano, L. A. C. Be ollo, N. Ge lekha, W. F. Molina, S. Di cha oen, B. Teng-
ja oenkul, W. Supiwong, A. Tanom ong and M. deBello Cio i. 2016. Genomic o -
ganiza ion o epe i i e DNAs highligh s ch omosomal e olu ion in he genus Cla ias
(Cla iidae, Silu i o mes). Molecula Cy ogene ics. 9 (1).
Ma çais, G. and C. Kings o d. 2011. A as , lock- ee app oach o e icien pa allel coun ing
o occu ences o k-me s. Bioin o ma ics. 27 (6): 764–770.
Ma illonne , S. and R. G u zne . 2020. Syn he ic DNA Assembly Using Golden Ga e Cloning
and he Hie a chical Modula Cloning Pipeline. Cu en P o ocols in Molecula Bi-
ology. 130 (1): e115.
Ma in, M., M. Pa e son, S. Ga g, S. O Fische , N. Pisan i, G. W. Klau, A. Schöenhu h and
T. Ma schall. 2016. Wha sHap: as and accu a e ead-based phasing. OX o d Aca-
demics - Bioin o ma ics.
Mc Ca ney, A. M., K. Sha in, M. Alonge, A. V. Bzikadze, G. Fo men i, A. Fung ammasan,
K. Howe, C. Jain, S. Ko en, G. A. Logsdon, K. H. Miga, A. Mikheenko, B. Pa en,
A. Shuma e, D. C. So o, I. So ić, J. M. D. Wood, J. M. Zook, A. M. Phillippy and
A. Rhie. 2022. Chasing pe ec ion: alida ion and polishing s a egies o elome e- o-
elome e genome assemblies. Na u e Me hods. 19 (6): 687–695.
Md, V., S. Mis a, H. Li and S. Alu u. 2019. E icien a chi ec u e-awa e accele a ion o bwa-
mem o mul ico e sys ems. In IEEE In e na ional Pa allel and Dis ibu ed P ocess-
ing Symposium (IPDPS).
Memb ebe, J. D., N. K. Yoon, M. Hong, J. Lee, H. Lee, K. Pa k, S. H. Seo, I. Yoon, S. Yoo,
Y. C. Kim and J. Ahn. 2016. P o ec i e e icacy o S ep ococcus iniae de i ed enolase
agains s ep ococcal in ec ion in a zeb a ish model. Ve e ina y Immunology and Im-
munopa hology. 170: 25–29.
Meng, E. C., T. D. Godda d, E. F. Pe e sen and o he s. 2023. UCSF Chime aX: Tools o
s uc u e building and analysis. P o ein Science. 32 (11): e4792.
Mikheenko, A., A. V. Bzikadze, A. Gu e ich, K. H. Miga and P. A. Pe zne . 2020. Tandem-
Tools: mapping long eads and assessing/imp o ing assembly quali y in ex a-long an-
dem epea s. Bioin o ma ics. 36 (Supplemen _1): i75–i83.
Milla d, C. and o he s. 2012. E olu ion o he capsula ope on o S ep ococcus iniae in esponse
o accina ion. Applied and En i onmen al Mic obiology. 78 (23): 8219–8226.
128
Mish a, A. and o he s. 2018. Cu en Challenges o S ep ococcus In ec ion and E ec i e
Molecula , Cellula , and En i onmen al Con ol Me hods in Aquacul u e. Molecules
and Cells. 41 (6): 495–505.
Mis y, J. and o he s. 2021. P am: The p o ein amilies da abase in 2021. Nucleic Acids
Resea ch. 49 (D1): D412–D419.
Mmanda, F. and o he s. 2014. Massi e mo ali y associa ed wi h S ep ococcus iniae in ec ion
in cage-cul u ed ed d um (Sciaenops ocella us) in Eas e n China. A ican Jou nal o
Mic obiology Resea ch. 8: 1722–1729.
Mo iel, D. and o he s. 2010. Iden i ica ion o p o ec i e and b oadly conse ed accine an igens
om he genome o ex ain es inal pa hogenic Esche ichia coli. P oceedings o he
Na ional Academy o Sciences. 107 (20): 9072–9077.
Na-Nako n, U., W. Kamon a and T. Ngamsi i. 2004. Gene ic di e si y o walking ca ish, Cla -
ias mac ocephalus, in Thailand and e idence o gene ic in og ession om in oduced
a med C. ga iepinus. Aquacul u e. 240 (1-4): 145–163.
Nawawi, R., J. Baiano and A. Ba nes. 2008. Gene ic a iabili y amongs S ep ococcus iniae
isola es om Aus alia. Jou nal o Fish Diseases. 31 (4): 305–309.
Naylo , R. L., R. W. Ha dy, A. H. Buschmann, S. R. Bush, L. Cao, D. H. Klinge , D. C. Li le,
J. Lubchenco, S. E. Shumway and M. T oell. 2021. Publishe Co ec ion: A 20-yea
e ospec i e e iew o global aquacul u e. Na u e. 595 (7868): E36–E36.
No man, J. D., M. Robinson, B. Glebe, M. M. Fe guson and R. G. Danzmann. 2012. Ge-
nomic a angemen o salini y ole ance QTLs in salmonids: A compa a i e analysis o
A lan ic salmon (Salmo sala ) wi h A c ic cha (Sal elinus alpinus) and ainbow ou
(Onco hynchus mykiss). BMC Genomics. 13 (1): 420.
Ondo , B. D., G. J. S a e , A. Sapping on, A. Kos ic, S. Ko en, C. B. Buck and A. M. Phillippy.
2019. Mash Sc een: high- h oughpu sequence con ainmen es ima ion o genome dis-
co e y. Genome Biology. 20 (1).
Oso io, D., P. Rondon-Villa eal and R. To es. 2015. Pep ides: A package o da a mining o
an imic obial pep ides. The R Jou nal. 7 (1): 4–14.
Ou, S. and N. Jiang. 2017. LTR_ e ie e : A Highly Accu a e and Sensi i e P og am o
Iden i ica ion o Long Te minal Repea Re o ansposons. Plan Physiology. 176 (2):
1410–1422.
Ou, S. and N. Jiang. 2019. LTR_FINDER_pa allel: pa alleliza ion o LTR_FINDER enabling
apid iden i ica ion o long e minal epea e o ansposons. Mobile DNA. 10 (1).
129
Ou, S., J. Chen and N. Jiang. 2018. Assessing genome assembly quali y using he LTR Assembly
Index (LAI). Nucleic Acids Resea ch.
Ou, S., W. Su, Y. Liao, K. Chougule, J. R. A. Agda, A. J. Hellinga, C. S. B. Lugo, T. A. Ellio ,
D. Wa e, T. Pe e son, N. Jiang, C. N. Hi sch and M. B. Hu o d. 2019. Benchma king
ansposable elemen anno a ion me hods o c ea ion o a s eamlined, comp ehensi e
pipeline. Genome Biology. 20 (1).
Ouchi, S., R. Kaji ani and T. I oh. 2023. G eenHill: a de no o ch omosome-le el sca olding
and phasing ool using Hi-C. Genome Biology. 24 (1).
Page, A. J., C. A. Cummins, M. Hun , V. K. Wong, S. Reu e , M. T. G. Holden, M. Fookes,
D. Falush, J. A. Keane and J. Pa khill. 2015a. Roa y: apid la ge-scale p oka yo e pan
genome analysis. Bioin o ma ics. 31 (22): 3691–3693.
Page, A. J., C. A. Cummins, M. Hun , V. K. Wong, S. Reu e , M. T. Holden, M. Fookes,
D. Falush, J. A. Keane and J. Pa khill. 2015b. Roa y: apid la ge-scale p oka yo e
pan genome analysis. Bioin o ma ics. 31 (22): 3691–3693.
Pandu angan, A. P., J. S ahlhacke, M. E. Oa es, B. Smi he s and J. Gough. 2019. The SUPER-
FAMILY 2.0 da abase: a signi ican p o eome upda e and a new webse e . Nucleic
Acids Resea ch. 47 (D1): D490–D494.
Pe ea, G. and M. Pe ea. 2020. GFF U ili ies: G Read and G Compa e. F1000Resea ch. 9:
304.
Pham, D. K., J. Chu, N. T. Do, F. B ose, G. Degand, P. Delahau , E. D. Pauw, C. Douny, K. V.
Nguyen, T. D. Vu, M.-L. Scippo and H. F. L. We heim. 2015. Moni o ing An ibio ic
Use and Residue in F eshwa e Aquacul u e o Domes ic Use in Vie nam. EcoHeal h.
12 (3): 480–489.
Pie , G. and S. Madin. 1976. S ep ococcus iniae sp. no ., a be a-hemoly ic s ep ococcus
isola ed om an Amazon eshwa e dolphin, Inia geo ensis. In e na ional Jou nal
o Sys ema ic Bac e iology. 26 (4): 545–553.
P idgeon, J. W. and P. H. Klesius. 2011. De elopmen and e icacy o a no obiocin- esis an
S ep ococcus iniae as a no el accine in Nile ilapia (O eoch omis nilo icus). Vaccine.
29 (35): 5986–5993.
Pumchan, A., S. K ob hong, S. Roy akul and o he s. 2020. No el chime ic mul iepi ope ac-
cine o s ep ococcosis disease in Nile ilapia (O eoch omis nilo icus Linn.). Scien i ic
Repo s. 10: 603.
Pu nam, N. H., B. L. O’Connell, J. C. S i es, B. J. Rice, M. Blanche e, R. Cale , C. J. T oll,
136
analysis. PLoS ONE. 5 (1): e10546.
Woes enenk, E. A., M. Hamma s öm, S. an denBe g, T. Hä d and H. Be glund. 2004. His
ag e ec on solubili y o human p o eins p oduced in Esche ichia coli: a compa ison
be ween ou exp ession ec o s. Jou nal o S uc u al and Func ional Genomics. 5
(3): 217–229.
Wyneken, J., S. P. Eppe ly, L. B. C owde , J. Vaughan and K. Blai Espe . 2007. DETERMIN-
ING SEX IN POSTHATCHLING LOGGERHEAD SEA TURTLES USING MULTI-
PLE GONADAL AND ACCESSORY DUCT CHARACTERISTICS. He pe ologica.
63 (1): 19–30.
Xiong, W., L. He, J. Lai, H. K. Doone and C. Du. 2014. Heli onScanne unco e s a la ge
o e looked cache o Heli on ansposons in many plan genomes. P oceedings o he
Na ional Academy o Sciences. 111 (28): 10263–10268.
Xiong, X., Y. Peng, R. Chen, X. Liu and F. Jiang. 2023. E icacy and ansc ip ome analysis o
golden pompano (T achino us o a us) immunized wi h a o malin-inac i a ed accine
agains S ep ococcus iniae.Fish & Shell ish Immunology. 134: 108489.
Xu, M., L. Guo, S. Gu, O. Wang, R. Zhang, B. A. Pe e s, G. Fan, X. Liu, X. Xu, L. Deng and
Y. Zhang. 2020. TGS-GapClose : A as and accu a e gap close o la ge genomes
wi h low co e age o e o -p one long eads. GigaScience. 9 (9).
Xu, Z. and H. Wang. 2007. LTR_FINDER: an e icien ool o he p edic ion o ull-leng h
LTR e o ansposons. Nucleic Acids Resea ch. 35 (Web Se e ): W265–W268.
Yan, J.-J., Y.-C. Lee, Y.-L. Tsou, Y.-C. Tseng and P.-P. Hwang. 2020. Insulin-like g ow h
ac o 1 igge s sal sec e ion machine y in ish unde acu e salini y s ess. Jou nal o
Endoc inology. 246 (3): 277–288.
Yang, Z., X. Zeng, Y. Zhao and o he s. 2023. AlphaFold2 and i s applica ions in he ields o
biology and medicine. Signal T ansduc ion and Ta ge ed The apy. 8: 115.
Yu, L. X. and o he s. 2014. Unde s anding pha maceu ical quali y by design. AAPS Jou nal.
16 (4): 771–783.
Yu, X., P. Se yawan, J. W. Bas iaansen, L. Liu, I. Im on, M. A. G oenen, H. Komen and H.-J.
Megens. 2022. Genomic analysis o a Nile ilapia s ain selec ed o salini y ole ance
shows signa u es o selec ion and hyb idiza ion wi h blue ilapia (O eoch omis au eus).
Aquacul u e. 560: 738527.
Yue, G. H. 2013. Recen ad ances o genome mapping and ma ke �assis ed selec ion in aqua-
cul u e. Fish and Fishe ies. 15 (3): 376–396.
137
Zayas, J. F. 1997. Solubili y o P o eins. Func ionali y o P o eins in Food. pp. 6–75.
Zeng, X., Z. Yi, X. Zhang, Y. Du, Y. Li, Z. Zhou, S. Chen, H. Zhao, S. Yang, Y. Wang and
G. Chen. 2024. Ch omosome-le el sca olding o haplo ype- esol ed assemblies using
Hi-C da a wi hou e e ence genomes. Na u e Plan s. 10 (8): 1184–1200.
Zhang, B.-C., J. Zhang and L. Sun. 2014a. S ep ococcus iniae SF1: Comple e genome sequence,
p o eomic p o ile, and immunop o ec i e an igens. PLoS ONE. 9 (3): e91324.
Zhang, B.-c., J. Zhang and L. Sun. 2014b. S ep ococcus iniae SF1: Comple e Genome Se-
quence, P o eomic P o ile, and Immunop o ec i e An igens. PLoS ONE. 9 (3): e91324.
Zhang, R.-G., G.-Y. Li, X.-L. Wang, J. Daina , Z.-X. Wang, S. Ou and Y. Ma. 2022. TEso e :
An accu a e and as me hod o classi y LTR- e o ansposons in plan genomes. Ho -
icul u e Resea ch. 9.
Zheng, Z., S. Li, J. Su, A. W.-S. Leung, T.-W. Lam and R. Luo. 2022. Symphonizing pileup
and ull-alignmen o deep lea ning-based long- ead a ian calling. Na u e Compu-
a ional Science. 2 (12): 797–803.
Zhou, Z., Y. Dang, M. Zhou, L. Li, C. Yu, J. Fu, S. Chen and Y. Liu. 2016. Codon usage is an
impo an de e minan o gene exp ession le els la gely h ough i s e ec s on ansc ip-
ion. P oceedings o he Na ional Academy o Sciences. 113 (41): E6117–E6125.
138
Appendix
139
Figu e 30 GenomeScope2.0 p o iles o (a) male C. ga iepinus, (b) emale C. mac ocephalus,
and he F1 hyb id ca ish a (c) k=21 and (d) k=31. The hyb id genome shows an in e medi-
a e he e ozygosi y le el ( 1%) and genome size o 1.8 Gb, consis en wi h con ibu ions om
bo h pa en al subgenomes. Pa en al species exhibi 0.056% (C. mac ocephalus) and 1.56%
(C. ga iepinus) he e ozygosi y, espec i ely. BUSCO analyses con i med assembly comple e-
ness and he p esence o wo subgenomes. Low Illumina co e age (<20×) in pa en al da ase s
esul ed in b oad peaks in panels (a–b).
140
Figu e 31 Compa a i e syn eny and s uc u al a ia ion be ween he No h A ican ca ish (C. ga iepinus) e e ence genome (GCA_024256425.2) and
he hyb id ca ish genome ( ClaHyb_Ga , his s udy). (A) Genome-wide mac osyn eny ac oss 28 pseudoch omosomes showing conse ed collinea i y
and localized in e sions, ansloca ions, and duplica ions. (B) Sequence a ia ion leng hs and pe cen ages o genome size, highligh ing highly di e gen
egions. (C) S uc u al a ia ion composi ion including duplica ions, ansloca ions, and in e sions. (D) Va ian ea u e coun s (SNPs, indels, CNVs,
andem epea s) showing genome-wide he e ogenei y.
141
Figu e 32 Compa a i e syn eny and s uc u al a ia ion be ween he Bighead ca ish (C. mac ocephalus) Haplo ype 1 (GCA_048544425.1) and he
hyb id genome ( ClaHyb_Mac, his s udy). (A) Genome-wide mac osyn eny ac oss 27 pseudoch omosomes showing ex ensi e collinea i y wi h limi ed
ea angemen s. (B) Sequence a ia ion p o iles showing di e gence and inse ion–dele ion pa e ns. (C) S uc u al a ia ion composi ion be ween pa en al
and hyb id genomes. (D) Va ian ea u e coun s ac oss all ca ego ies, emphasizing SNP dominance and mino s uc u al a ian s.
142
Figu e 33 Ci cula genome syn eny and quali y con ol o e iew o he i e sequenced S ep o-
coccus iniae s ains (SIKU01–SIKU05). The Ci cos plo shows in e -s ain genomic alignmen s,
GC con en a ia ion, GC skew, and con ig connec i i y me ics. Ou e ings ep esen genome
coo dina es, while inne links highligh conse ed syn enic egions among isola es, con i ming
o e all s uc u al s abili y ac oss s ains.
143
Figu e 34 Compa a i e genome syn eny o S ep ococcus iniae s ain SIKU01 ela i e o he o iginal Amazon Ri e dolphin isola e QMA0141 (collec ed
in 1976) (Pie and Madin,1976). Homologous egions show s ong mac osyn enic conse a ion wi h limi ed s uc u al ea angemen s, con i ming long-
e m genomic s abili y ac oss lineages.
144
Figu e 35 An igenic a ia ion and gene ca iage ac oss he 17 p o eins ca ying p edic ed epi opes in S ep ococcus iniae SIKU01. Ba plo s show he
dis ibu ion o an igenic de e minan s, sequence a iabili y, and c oss-s ain p esence o co esponding loci, indica ing conse ed immunogenic a ge s o
accine de elopmen .
145
Figu e 36 Biophysical landscape o he un il e ed SIKU01 p o eome (M0), showing dis ibu ions o molecula weigh , hyd ophobici y (GRAVY), iso-
elec ic poin (pI), and ins abili y index (II). These pa ame e s es ablish he baseline design space o subsequen QbD il e ing.