scieee Science in your language
[en] (orig)

Preventing illegal seafood trade using machine-learning assisted microbiome analysis

Author: Peruzza, Luca; Cicala, Francesco; Milan, Massimo; Rovere, Giulia Dalla; Patarnello, Tomaso; Boffo, Luciano; Smits, Morgan; Iori, Silvia; De Bortoli, Angelo; Schiavon, Federica; Zentilin, Aurelio; Fariselli, Piero; Cardazzo, Barbara; Bargelloni, Luca
Publisher: Zenodo
DOI: 10.1186/s12915-024-02005-w
Source: https://zenodo.org/records/17536220/files/BMC-Biology-2024.pdf
Pe uzzae al. BMC Biology (2024) 22:202
h ps://doi.o g/10.1186/s12915-024-02005-w
RESEARCH ARTICLE Open Access
© The Au ho (s) 2024. Open Access This a icle is licensed unde a C ea i e Commons A ibu ion 4.0 In e na ional License, which
pe mi s use, sha ing, adap a ion, dis ibu ion and ep oduc ion in any medium o o ma , as long as you gi e app op ia e c edi o he
o iginal au ho (s) and he sou ce, p o ide a link o he C ea i e Commons licence, and indica e i changes we e made. The images o
o he hi d pa y ma e ial in his a icle a e included in he a icle’s C ea i e Commons licence, unless indica ed o he wise in a c edi line
o he ma e ial. I ma e ial is no included in he a icle’s C ea i e Commons licence and you in ended use is no pe mi ed by s a u o y
egula ion o exceeds he pe mi ed use, you will need o ob ain pe mission di ec ly om he copy igh holde . To iew a copy o his
licence, isi h p://c ea i ecommons.o g/licenses/by/4.0/.
BMC Biology
P e en ing illegal sea ood ade using
machine-lea ning assis ed mic obiome analysis
Luca Pe uzza1†, F ancesco Cicala1†, Massimo Milan1* , Giulia Dalla Ro e e1, Tomaso Pa a nello1, Luciano Bo o2,
Mo gan Smi s3, Sil ia Io i1, Angelo De Bo oli4, Fede ica Schia on4, Au elio Zen ilin5, Pie o Fa iselli6,
Ba ba a Ca dazzo1 and Luca Ba gelloni1
Abs ac
Backg ound Sea ood is inc easingly aded wo ldwide, bu i s supply chain is pa icula ly p one o auds. To inc ease
consume con idence, p e en illegal ade, and p o ide independen alida ion o eco-labelling, accu a e ools
o sea ood aceabili y a e needed. He e we show ha he use o mic obiome p o iling (MP) coupled wi h machine
lea ning (ML) allows p ecise acing he o igin o Manila clams ha es ed in a eas sepa a ed by small geog aphic
dis ances. The s udy was designed o ep esen a eal-wo ld scena io. Clams we e collec ed in di e en seasons
ac oss he mos impo an p oduc ion a ea in Eu ope (lagoons along he no he n Ad ia ic coas ) o co e he known
seasonal a ia ion in mic obiome composi ion o he species. DNA ex ac ed om samples unde wen he same
depu a ion p ocess as comme cial p oduc s (i.e. a leas 12 h in open low sys ems).
Resul s Machine lea ning-based analysis o mic obiome p o iles was ca ied ou using wo comple ely independ-
en se s o da a (collec ed a he same loca ions bu in di e en yea s), one o aining he algo i hm, and he o he
o es ing i s accu acy and assessing he empo al s abili y signal. B ie ly, gills (GI) and diges i e gland (DG) o clams
we e collec ed in summe and win e o e wo di e en yea s (i.e. om 2018 o 2020) in one banned a ea and ou
a ming si es. 16S DNA me aba coding was pe o med on clam issues and he ob ained amplicon sequence a i-
an s (ASVs) able was used as inpu o ML MP. The bes -p edic ing pe o mances we e ob ained using he combined
in o ma ion o GI and DG (consensus analysis), showing a Cohen K-sco e > 0.95 when he a ge was he classi ica ion
o samples collec ed om he banned a ea and hose ha es ed a a ming si es. Classi ica ion o he ou di e en
a ming a eas showed sligh ly lowe accu acy wi h a 0.76 sco e.
Conclusions We show he e ha MP coupled wi h ML is an e ec i e ool o ace he o igin o shell ish p oduc s.
The ool is ex emely obus agains seasonal and in e -annual a iabili y, as well as p oduc depu a ion, and is eady
o implemen a ion in ou ine assessmen o p e en he ade o illegally ha es ed o mislabeled shell ish.
Keywo ds Machine lea ning, Food aceabili y, Mic obio a 16S, Manila clam, No h Ad ia ic sea, Illegal un epo ed
un egula ed (IUU) ishing
†Luca Pe uzza and F ancesco Cicala equally con ibu ed o his wo k.
*Co espondence:
Massimo Milan
[email protected]
Full lis o au ho in o ma ion is a ailable a he end o he a icle
Page 2 o 10
Pe uzzae al. BMC Biology (2024) 22:202
Backg ound
Sea ood is c ucial o he human die as i p o ides
an excellen sou ce o high-quali y p o ein, essen ial
omega-3 a y acids, and a ious i amins and mine als.
Regula consump ion o sea ood is linked o nume ous
heal h bene i s, including hea heal h, b ain unc ion,
and o e all well-being [1]. Ma ine bi al es a e among
he mos aded sea ood p oduc s wo ldwide [2]. Manila
clam (Rudi apes philippina um) is one o he mos
impo an mollusc species, wi h a global p oduc ion o
o e ou million me ic onnes. In Eu ope, I aly is he
la ges p oduce wi h ~ 96% o he Eu opean p oduc ion
(24,337 ) [3]. Na i e om Sou h-eas Asia (Indo-Paci ic),
R. philippina um was impo ed in Eu ope in he second
hal o he’70s and i was in oduced in I aly in 1983
[4]. Gi en i s high adap abili y, R. philippina um is now
commonly ound in b ackish wa e s along he no he n
Ad ia ic coas especially in he Venice lagoon and nea by
a eas such as he Po i e del a and he lagoon o Ma ano
[3–5].
Despi e hei high nu i ional alue, bi al e consump-
ion migh pose isks o human heal h. This is due o
hei il e - eeding s a egy h ough which hese mol-
luscs may accumula e human pa hogenic mic oo gan-
isms as well as me als and/o chemicals p esen in he
wa e [2, 6, 7]. The Eu opean Union (EU) has egula ed
bi al e ha es ing (Regula ion (EC) No 853/2004, No
854/2004, No 2073/2005 and No 1021/2008), classi y-
ing ha es ing a eas acco ding o he le els o Esche i-
chia coli in he in a al ula liquid pe g ams o bi al es
(EC, 2004a; 2004b, 2005, 2008). Based on hese egula-
ions, all species o bi al e molluscs a med in lagoons
mus be subjec ed o depu a ion p ocesses in o de o
emo e chemical compounds haza dous o humans. Sea-
ood p oduc labels mus include he a ea o p o enience
(Regula ion (EU) No 1379/2013; EU, 2013). Ecolabels
ha e also ecen ly been p oposed o imp o e consum-
e s’ pe cep ion and ma ke alue o shell ish p oduc s [8].
Finally, as bi al es migh accumula e signi ican amoun s
o chemical pollu an s, highly pollu ed a eas migh be
o icially banned o mollusc ha es ing. A well-known
example is he a ea close o Po o Ma ghe a in he Venice
Lagoon, in I aly, whe e decades o discha ging indus ial
was es ha e led o high concen a ions o se e al pollu -
an s in he sedimen (e.g. dioxins; PCBs; hea y me als).
Despi e he in e dic ion o ha es bi al es o human
consump ion in his a ea (Vene o egional egula ions
No 133/2018), howe e , illegal clam ha es ing in Po o
Ma ghe a o en occu s wi h majo isks o consume s’
heal h.
Fo all he abo e easons i is clea ha de eloping
ools ha allow e i ica ion o bi al e o igin as s a ed in
he p oduc label is inc easingly impo an . A b oad a ay
o me hodologies like a y acids (FA) p o iling, s able
and uns able iso ope and ace elemen inge p in ing,
as well as se e al me hods based on DNA analysis (DNA
inge p in ing, mic obial ba codes o p o iles, among
o he s), ha e been p oposed as possible ools o ace he
geog aphic o igin and de ec mislabelling [9–11]. These
me hods ha e a iable pe o mance depending on he
geog aphic scale and mobili y o he a ge species, and
di e se easibili y in e ms o ime and cos o analysis.
Fo ins ance, DNA analyses ha e been he mos used ool
o species iden i ica ion (DNA ba coding) and he sec-
ond mos used o disc imina e he geog aphic o igin o
bi al es (gene ic aceabili y) [10]. This ype o analysis
is usually chosen o i s ela i ely apid and cos -e ec i e
esul s. Howe e , he absence o ba ie s o he gene low
in species wi h highly agile de elopmen al s ages may
p eclude gene ic di e gence be ween closely loca ed pop-
ula ions (high connec i i y), hus limi ing he diagnos ic
powe o gene ic aceabili y [12]. FA p o iles a e among
he mos used and highly eliable app oaches o disc imi-
na e he geog aphic o igin o comme cial bi al es as
hese species may modula e hei FA p o iles by se e al
in insic (e.g. age, sex, ep oduc i e cycle, and phylogeny)
and ex insic ac o s (e.g. die , empe a u e, dep h, and
salini y) [13]. Howe e , as p e iously men ioned o DNA
analysis, ecen s udies ha e epo ed ha na u al sea-
sonal and in e -annual FA a iabili y may limi i s accu-
acy when conside ing samples collec ed o e mul iple
seasons/yea s [14].
A second majo issue is ha nea ly all hese me hods
ha e been alida ed using samples collec ed in di e en
a eas a a single ime poin (season, yea ). E en when a
igo ous s a is ical analysis is implemen ed, si e-o -o igin
disc imina ion accu acy is es ima ed based on a lea e-
one-ou c oss- alida ion. Such an app oach is p one o
o e i ing, in la ing he es ima ed accu acy. I also does
no allow o he assessmen o whe he he disc iminan
unc ion o he p edic i e model can gene alise, i.e. co -
ec ly classi ying ne e -seen-be o e samples. This is e en
mo e p oblema ic when he e a e ime dependencies
among he da a.
Mic obiome p o iling (MP) o 16S me aba cod-
ing is an eme ging al e na i e app oach o un eiling
geog aphical o igin mis ep esen a ions [15–17]. This
me hod consis s o he ampli ica ion and analysis o
a agmen o he 16S RNA gene om he mic obial
communi ies p esen in a gi en sample wi h he inal
aim o de ec a ea-speci ic axonomic composi ion p o-
iles [18]. Recen s udies ha e p o en ha al hough he
mic obial composi ion o molluscs is a ec ed by en i-
onmen al ac o s (e.g. empe a u e, salini y, chemical
con amina ion), i is also cha ac e ized by a s iking
esilience o change; hus, making mic obio a a sui able
Page 3 o 10
Pe uzzae al. BMC Biology (2024) 22:202
candida e o aceabili y [19, 20]. In ac , p e ious
wo ks employing his echnique ha e epo ed i s suc-
cess ul use o ace he geog aphic o igin o comme -
cially impo an wild [16, 21] and cul u ed species
[22]. Howe e , as men ioned abo e, molluscs g own in
ansi ion habi a s such as lagoons and del as need o
unde go a p ocess o depu a ion be o e comme ciali-
za ion. Such a p ocess migh signi ican ly change hei
mic obio a [15, 23] and ul ima ely impai ou abili y o
disc imina e be ween molluscs collec ed a di e ing
loca ions.
The p esen s udy aimed a implemen ing MP in sea-
ood aceabili y and add essing all he abo e-men ioned
issues. Fi s , samples we e collec ed and p ocessed
(depu a ed) as is he case o comme cial bi al e p od-
uc s. Second, he p edic i e model was based on wo ML
algo i hms ha a e less p one o o e i ing (linea bag-
ging and andom o es s). The numbe o ea u es used
was limi ed o hose axonomic uni s (amplicon sequence
a ian s, ASVs) showing signi ican abundance (> 5%),
o educe noise and limi he “la ge p small n” p oblem.
Thi d, samples we e collec ed ac oss seasons o assess
he e ec s o seasonali y on p edic ion accu acy. Fou h,
wo compa able da a se s we e ob ained om wo di -
e en yea s, p o iding independen aining/ alida-
ion and es se s o co ec ly assess he gene aliza ion
abili y o he ained model. B ie ly, 16S me aba coding
da a o wo issues (gills (GI) and diges i e glands (DG)
o Manila clam) we e ob ained o samples collec ed a
ou di e en si es, ep esen ing he majo clam a ming
a eas in he No h Ad ia ic Sea, ac oss di e en seasons
and yea s. A i h si e, he in e dic ed a ea o Po o Ma -
ghe a was included o es he abili y o classi y illegally
comme cialized p oduc s ha es ed in po en ially con-
amina ed a eas. O e all, high accu acy o p edic ion
was obse ed, especially when iden i ying clams om he
in e dic ed a ea. The p edic i e model was also obus o
seasonal and in e -annual a ia ion, showing signi ican
gene aliza ion abili y.
Resul s
Da a summa y
A o al o 1000 clams we e collec ed du ing ield expe-
di ions and pooled in 560 DNA-pool samples (280
pe issue). Raw DNA lib a ies we e deposi ed a he
NCBI eposi o y unde he BioP ojec access numbe :
PRJNA1013079 [24]. A e he quali y- il e s eps, a o al
o 15,491,493 eads we e e ained and a e ied a 5463
eads pe sample. Finally, a e he exclusion o a e ASVs
(less han 5% o abundance), a numbe be ween 150 and
300 ASVs (depending on he compa ison) we e used o
ain ML algo i hms.
P edic ing heo igin o samples ia ML
In he compa ison o Po o Ma ghe a (PM) s Chioggia
(CL) (i.e. he pollu ed and a ming si es wi hin he Ven-
ice lagoon), he a e age p ecision in he p edic ion a -
ied be ween 0.85 and 0.95 (Addi ional File 1: TableS1),
wi h he GI issue p o iding be e esul s han he DG
(Fig.1). The consensus model allowed o each an o e -
all AUC o 0.95 wi h all samples om PM being co ec ly
iden i ied as PM while only 1 sample o CL was mis-
labelled (Fig.1A). Bina y classi ica ion was based on a
la ge se o ea u es (p edic o s), as he analysis o ea u e
impo ance (Addi ional File 1: TableS1) showed ha in
he case o PM s CL he p esence-absence o 40 di e en
ASVs o he GI and 50 ASV o he DG da a explained
75% o classi ica ion impo ance. Simila e idence was
obse ed o o he classi ica ion es s by compa ing PM
s o he a ming si es (Addi ional ile2: Fig. S1).
In he compa ison o Po o Ma ghe a (PM) s Sca do-
a i (SC) (i.e. he pollu ed si e wi hin he Venice lagoon
and he a ming si e ou side he Venice lagoon) ML was
able o co ec ly iden i y all animals wi h 100% accu acy
wi h no di e ences in pe o mances among issues con-
side ed (Fig.1B).
The analysis be ween a med a eas was mo e challeng-
ing since accu acy anged om 0.66 o GI o 0.83 o DG
(Fig.1C). Once again, be e esul s we e ob ained wi h
he consensus model ha achie ed an a e age accu acy
o 0.84. Using he combined GI + DG da a se s o e e y
a ming a ea, a leas 8 ou o 10 samples we e co ec ly
classi ied (Fig.1C). No su p isingly, he mislabelled sam-
ples om Go o (GO) we e assigned o SC which is he
closes si e o GO in e ms o geog aphic dis ance (Fig.2).
Discussion
T aceabili y, he possibili y o ack each s ep along he
supply chain o any ood p oduc om i s o igin o he
inal consume , is essen ial o ensu e p oduc quali y and
consume sa e y. T aceabili y is pa icula ly ele an o
shell ish species. These o ganisms a e il e - eede s ha
can bio-accumula e abio ic and bio ic haza dous com-
pounds i g own in a eas subjec ed o chemical and/o
biological con amina ions, wi h po en ial isks o public
heal h [2, 12, 25]. In ac , episodes o molluscs ha es ed
om es ic ed a eas and hen sold in ish ma ke s a e
s ill equen and i can be qui e di icul and cos - and
ime-consuming o de ec such dange ous auds. Thus,
he c ea ion o diagnos ic ools able o accu a ely and
easily disc imina e he p o enance o sea ood, p e en ing
he in oduc ion o unsa e p oduc s in he ma ke [12, 25]
has long been sough o by heal h au ho i ies. In addi-
ion, such ools could help p oduce s and coope a i es
o gua an ee quali y and en i onmen al ce i ica ions,
Page 4 o 10
Pe uzzae al. BMC Biology (2024) 22:202
inc easing consume us in speci ic b ands and eco-
labels [12, 26–28].
In a p e ious s udy, we showed he po en ial o ML-
powe ed classi ica ion o MPs o disc imina e clams om
Po o Ma ghe a compa ed o animals collec ed om
a ming si es loca ed in he Venice lagoon [12]. Howe e ,
in ha p elimina y s udy empo al eplica ion was no
ex ensi e and sampling was limi ed o he Venice lagoon,
excluding o he impo an clam a ming a eas along he
No h Ad ia ic coas . In addi ion, a single issue (diges-
i e gland) was analysed and, mos impo an ly, clams did
no unde go he depu a ion p ocess equi ed by legisla-
ion, hus igno ing he po en ially al e ing e ec s o his
ea men on he animal mic obio a. I has in ac been
demons a ed ha depu a ion may signi ican ly in luence
he bi al e mic obial communi y [15]. In he p esen
wo k, we ca ied ou an en i ely new sampling campaign,
wi h ex ensi e empo al eplica ion and b oad geo-
g aphic ep esen a ion. All samples unde wen he s and-
a d depu a ion p ocess, which cons i u es a key s ep
o wa d. In ac , we ha e success ully demons a ed ha ,
despi e he depu a ion p ocess, a dis inc i e “signa u e”
is s ill de ec able in he clam MP. As al eady men ioned,
exis ing e idence sugges ed ha he composi ion o hos -
associa ed mic obio a was ela i ely s able and dis inc
om he mic obial communi ies p esen in he wa e and
he sedimen [20]. This is likely explained by he in e -
ac ions be ween hos and mic obio a, which selec and
main ain dis inc i e bac e ial axa associa ed o speci ic
o gans and issues [29].
The dynamics o hos coloniza ion and e olu ion o
hos -mic obio a associa ions, howe e , emain o be be -
e elucida ed in bi al es, and such knowledge migh
be highly ele an o mo e accu a ely unde s and he
Fig. 1 Con usion ma ices showing he esul s o he ML p edic ed p o enance (“P edic ed label”) e sus he eal p o enance (“T ue label”) o each
o he es ed samples by using gills (GI) only (le column), diges i e gland (DG) only (middle column) o by combining GI and DG in o a consensus
p edic ion ( igh column). A, B Classi ica ion disc imina ing be ween he pollu ed si e PM and he clean a ming si es o CL (A) and SC (B). C
Classi ica ion assessing he o igin o samples among he ou a ming a eas conside ed in he s udy. Colou scale is p opo ional o he numbe
o samples ha a e assigned o a speci ic loca ion. MA, Ma ano; PM, Po o Ma ghe a; CL, Chioggia; SC, Sca do a i; GO, Go o
Page 5 o 10
Pe uzzae al. BMC Biology (2024) 22:202
po en ial and he limi a ions o MP analysis in ace-
abili y. A i s a emp in shading ligh on he dynamics
o clam biology and i s associa ed mic obio a has been
aken ecen ly by Milan e al. [19] and by Cecche o e al.
[30] by pe o ming long- e m moni o ing campaigns
on clams a med in ou si es o he Venice lagoon. Ou
wo ks e ealed ha clams unde go bio ic and abio ic
s esso s in a si e-speci ic way, wi h loca ions in close
p oximi y o he inle o he lagoon expe iencing mo e
equen salini ies and/o empe a u es beyond he op i-
mum ange o he species. Fu he , we showed how en i-
onmen al g adien s, such as salini y and wa e esidence
ime, in luenced he o e all gene exp ession pa e n o
clams, he be a di e si y o mic obial communi ies in he
hos , and pu a i ely ansla ed in o di e ences in clam’s
g ow h, condi ion index (an index o gene al well-being
o he animal) and mo ali y. In e es ingly, we ound an
o e - ep esen a ion o he gene a A cobac e and Vib io
(bo h desc ibed as oppo unis ic pa hogens) a he end
o he summe in si es close o he inle and hus po en-
ially subjec o a highe abio ic s ess. Un o una ely, we
did no ha e mul i-pa ame ic p obes in place when we
pe o med all sampling ac i i ies o he cu en wo k,
hus hampe ing he possibili y o in eg a e en i onmen-
al da a in ou ML-based aceabili y sys em.
Ano he key ou come o he p esen s udy is he highly
signi ican accu acy (> 95%) in disc imina ing clams col-
lec ed in he p ohibi ed a ea (PM) om hose ha es ed
in he nea by a ming si es wi hin he Venice lagoon
and e en mo e accu a ely (100% eco e y) when sam-
ples o igina ed om a geog aphically close, bu dis inc
lagoon (SC). Simila esul s we e ob ained o pai wise
compa isons be ween PM and he wo o he a ming
si es. We belie e his is a signi ican p oo -o -concep
ha ML-empowe ed analysis o MPs could be used o
independen ly e i y he o igin o bi al es suspec ed o
being illegally collec ed in a eas ha a e in e dic ed o
a ious easons, as in he case o si es showing high le els
o chemical and/o mic obiological con amina ion. The
speci ic en i onmen p esen a con amina ed si es migh
shi he composi ion o mic obial communi ies, making
MPs mo e easily iden i iable. A second po en ial ac o
ha migh boos classi ica ion accu acy is he ac ha
clams in PM a e gene ally undis u bed a e se lemen ,
Fig. 2 Map showing he sampling a eas: wo om he Venice lagoon, wo om he Po i e del a and one om he Ma ano and G ado lagoon. MA,
Ma ano; PM, Po o Ma ghe a; CL, Chioggia; SC, Sca do a i; GO, Go o

Page 6 o 10
Pe uzzae al. BMC Biology (2024) 22:202
he e o e ea ly on du ing hei li e his o y, making hem
long- e m esiden s in he same a ea and allowing hem
o be associa ed wi h a highly dis inc i e mic obiome.
This is no he case o clams collec ed in au ho ized
a ming si es as i is a common p ac ice o use al eady
me amo phosed ju enile indi iduals o seed on-g owing
si es. As men ioned, li le is known abou he dynamics
o he es ablishmen and main enance o clam-associa ed
mic obio a, al hough he ea ly li e his o y phases seem
o be c ucial in de e mining mic obiome composi ion
in adul animals as obse ed in aqua ic model species
[31]. The equen p ac ice o seeding a ming si es wi h
ju eniles om di e en a eas o wi h spa p oduced by
cap i e ep oduc ion in bi al e ha che ies migh explain
he lowe , hough signi ican accu acy (k-sco e 0.76 com-
pa ed o he expec ed 0 by andom chance) in classi y-
ing he ou mos impo an I alian p oduc ion a eas. In
ac , ju eniles om na u al nu se y a eas om he Po
i e del a (Sca do a i and Go o; SC and GO) ha e been
epo ed o be used o suppo a ming si es in he Ven-
ice and Ma ano lagoons (CL and MA), whe e na u al
ec ui men has been ex emely limi ed in ecen yea s.
I should also be no ed ha mul i-class disc imina ion
cases a e gene ally mo e p oblema ic han bina y classi-
ica ion, which migh add o he p oblem o ansloca ion
o ju enile clams.
In he p esen s udy, ML-based classi ica ion p o ided
excellen o e y good accu acy. We belie e ha se e al
ac o s con ibu ed o such a posi i e ou come. Fi s ,
p esence-absence da a we e used, en o cing a conse a-
i e h eshold (> 5%) o conside ing a speci ic ASV as
p esen . This app oach educes he po en ial noise linked
o luc ua ing abundance caused by echnical and/o bio-
logical ac o s. Second, ASVs we e used as ea u es o
classi ica ion, which means ha disc imina ion is based
on he p esence o “unique” bac e ial s ains/species in a
speci ic si e. Thi d, educing he numbe o ea u es likely
limi ed he isk o o e i ing, which is gene ally high o
ML-based classi ica ion me hods. Fou h, a boo s ap
agg ega ing (bagging) app oach was implemen ed, which
is expec ed o u he educe he p oblem o o e i ing.
Fi h, wo di e en classi ica ion models we e es ed (lin-
ea bagging and andom o es s), and he one p o iding
he bes classi ica ion pe o mance was used in he ali-
da ion s ep. Las , and mos impo an , as al eady men-
ioned, he se o samples used o ain he classi ica ion
algo i hm was comple ely independen om he alida-
ion se . To he bes o ou knowledge, e en o s udies
in which he classi ica ion ool (ei he using adi ional
s a is ical app oaches o ML me hods) was alida ed,
he alida ion was pe o med as a lea e-one-ou c oss-
alida ion app oach. This is known o g ea ly in la e he
es ima ed accu acy and he gene aliza ion abili y o he
model. Ou wo k shows ha MP analysis using p ope ly
selec ed and ained ML models is obus o seasonal a -
ia ion and uly able o p edic he o igin in ne e -seen-
be o e samples (collec ed in a di e en yea ). Indeed,
ep oducibili y ac oss empo al eplica es is ce ainly one
o he mos ele an ea u es o any aceabili y ool, bu
un o una ely is g ea ly o e looked.
O e all, he p esen s udy p o ides compelling e i-
dence ha aceabili y ools based on ML-enabled analy-
sis o hos -associa ed mic obio a migh be a powe ul
weapon in he wa agains illegal sea ood ading. A he
momen we ha e es ed ou ML me hod ocusing on a
ela i ely “sho ” geog aphic dis ances (e en i we ha e
included all majo Manila clam a ming si es o Eu ope
in e ms o p oduc ion [32]) because i is mo e challeng-
ing o disc imina e he p o enance o samples on small
geog aphic dis ances a he han on b oade dis ances,
as demons a ed by Mamede e  al. [33]. Howe e , he
dec easing cos s o DNA sequencing will also make i
possible o c ea e much la ge aining da a se s, wi h
b oade geog aphic co e age and empo al eplica es.
This, in u n, should g ea ly imp o e classi ica ion accu-
acy and limi he “la ge p small n” p oblem, whe e p is
he numbe o ea u es/p edic o s used and n is he num-
be o cases. The apid and con inuous decline o cos
o DNA sequencing will make hese ools inc easingly
a o dable and he cos is al eady on pa o e en lowe
han o o he me hods such as FA p o iling and ace
elemen s analysis. Fu he mo e, DNA 16S RNA me a-
ba coding da a easily comply wi h he FAIR p inciples
(Findabili y, Accessibili y, In e ope abili y, and Reuse),
as p o ocols and public da a eposi o ies a e well es ab-
lished o his ype o da a, al hough u he e o s a e
needed especially owa d he s anda diza ion o DNA
ex ac ion me hods and he ha moniza ion o he 16S
RNA gene egion o be used a e needed. Based on ou
esul s and o he a ailable e idence, we belie e ha he
ela i e s abili y o hos -associa ed mic obio a likely
allows high aceabili y accu acy o sessile species (e.g.
bi al es, mac oalgae). Fo species ha a e highly mobile,
such s abili y migh ace he a ea whe e he animal/plan
was loca ed du ing he ea ly li e s ages, bu be less p ecise
in acking mo e ecen mo emen s.
Ano he po en ial limi a ion is he ime equi ed o
da a p oduc ion and analysis. Cu en ly, he limi ing s ep
is DNA sequencing, which is based on Illumina ech-
nology and equi es a couple o weeks o ob ain DNA
sequence da a. Howe e , he ad en o hi d-gene a ion
sequencing echnologies is making i possible o g ea ly
speed up he p ocess and in he case o Nanopo e 16S
ba coding ki , also o make i easily po able i imple-
men ed on a MinION ins umen . In addi ion, long- ead
sequencing p o ides ull leng h sequence in o ma ion
Page 7 o 10
Pe uzzae al. BMC Biology (2024) 22:202
o he 16S RNA gene, o e coming he p e iously men-
ioned p oblem o inding a consensus o he gene ag-
men o be sequenced and, a he same ime, allowing o
inc eased axonomic esolu ion, wi h a highe likelihood
o iden i y si e-speci ic ASVs. We expec ha in he nea
u u e, i will be possible o ca y ou apid, in- he- ield,
16S RNA me aba coding analyses. In ac , apid DNA
ex ac ion p o ocols sui able o long- ead sequencing
a e al eady becoming a ailable [34], he a ay o echno-
logical solu ions o po able, minia u ized PCR he mal
cycle s is apidly expanding [35], and he use o p e-
ained ML-based algo i hms o 16S da a analysis using
cloud compu ing has al eady been epo ed [36]. Fu he -
mo e, ML pe o mance usually inc eases when he da a
size g ows. Thus, we may expec signi ican imp o emen
when mo e samples a e collec ed and used o ain he
models. Howe e , he success ul implemen a ion and
imp o emen o he model will s ongly depend on a
close collabo a ion wi h egula o y bodies and indus y
s akeholde s which will also a ou i s adop ion and u -
he ex end i s usage.
Conclusions
In conclusion, he p esen s udy demons a es ha ML-
enabled analysis o hos -associa ed mic obio a is al eady
a key ins umen complemen ing he oolbox o sea-
ood aceabili y, showing he impo ance o add essing
he mos ele an s eps o ensu e classi ica ion accu acy.
Looking ahead, we expec ha highly po able and in e -
ope able diagnos ic ools based on he app oach p o-
posed he e will become a ailable, making p e en ion o
illegal sea ood ade apid and a o dable.
Me hods
Samples collec ion, DNA ex ac ion andsequencing
Comme cially ha es able adul s o Manila clam Rudi-
apes philippina um we e collec ed du ing ou sam-
pling campaigns conduc ed in July 2018 (S18), Janua y
2019 (W19), July 2019 (S19) and Janua y 2020 (W20)
along he Venice lagoon and neighbou ing a eas (Fig.2).
Du ing each sampling expedi ion, a ound 100 clams
we e collec ed om ou a ming a eas (i.e. Ma ano
Laguna e (MA); Chioggia (CL); Sca do a i (SC); Go o
(GO)) and one pollu ed si e (i.e. Po o Ma ghe a (PM))
by a mechanical ake and ollowing o icial egula ions
o bi al e comme cialisa ion. Landed animals we e
kep in a depu a ion cen e o 16h whe e hey we e
kep in open low- h ough sys ems ha con inuously
ecei e na u al sea wa e ha is mechanically, biologi-
cally and chemically (i.e. UV) il e ed. A e his s ep,
animals we e b ough o he labo a o y whe e he en i e
gills (GI) and diges i e gland (DG) we e dissec ed om
each clam using s e ilized scalpels. Tissue samples we e
immedia ely ans e ed o 1.5-ml mic ocen i uge ubes
con aining molecula g ade e hanol (90%). Samples we e
hen e ige a ed a 4 ℃ un il u he analysis. DNA
was ex ac ed om pooled issues ob ained by pooling
10 GI o DG issue pieces o simila weigh om clams
collec ed om he same expedi ion and a ea. DNA was
ex ac ed and pu i ied using a DNA Powe Soil ki (QIA-
GEN, Hilden, Ge many) ollowing he manu ac u e ’s
ins uc ions wi h an addi ional s ep o P o einase K o
imp o e cell lysis. DNA in eg i y was e i ied using aga-
ose gel elec opho esis (1%), while i s quan i y was es i-
ma ed by NanoD op ND1000. DNA aliquo s we e sen
o BMR Genomics (Padua, I aly) whe e a agmen o he
16S DNA spanning he V3-V4 egions was PCR-ampli-
ied and sequenced using MiSeq 2 × 300 pai -end (PE)
sequencing echnology.
16S RNA lib a ies p epa a ion
Raw eads we e p ocessed and analysed using Quan i a-
i e Insigh s in o Mic obial Ecology 2 . 2019.1 (QIIME
2) [37]. P ime sequences we e emo ed using cu a-
dap [38]. DADA2 [39] was used o il e low-quali y
sequences and o me ge o wa d and e e se lib a ies o
ob ain high-quali y ep esen a i e sequences. The same
p og am was also used o emo e chime ic eads om
he inal sequence da ase . Rep esen a i e sequences
we e aligned using MAFFT so wa e [40] and classi ied
using he Py hon lib a y Sciki -Lea n [41]. Taxa assign-
men was ca ied ou using he .142 SILVA da abase
ained o V3-V4 egions. Finally, in o de o s anda d-
ize eads lib a ies o a common sampling dep h, eads
we e a e ied by andomly subsampling by he minimum
numbe o eads. These QIIME 2 ou pu s, including he
abundance ea u e able and axonomy, we e used in he
aining and es ing o machine lea ning (ML) based
classi ica ion.
Machine lea ning p ocedu es
Using ML, wo classi ica ion p oblems we e es ed: (i) he
possibili y o disc imina ing be ween clams ha es ed in
au ho ized a ming a eas and hose illegally collec ed in
he in e dic ed pollu ed; (ii) he possibili y o acing he
o igin o clams g own in he ou mos impo an a m-
ing si es o he no h Ad ia ic Sea, as each is hese a eas
ca ies i s own comme cial b and (e.g. “Vongola e ace di
Sca do a i”). In he i s case, we ained and es ed he
ML-based ool in a wo-class classi ica ion p oblem, i.e.
o disc imina e be ween samples om he in e dic ed
a ea (PM) and he geog aphically closes a ming si e,
Chioggia (CL) (Fig.2). Likewise, we es ed he compa i-
son be ween PM and he mos impo an a ming a ea,
Sca do a i (SC). The second scena io consis ed in a mul-
iple class classi ica ion p oblem, whe e we es ed he
Page 8 o 10
Pe uzzae al. BMC Biology (2024) 22:202
abili y o he ML algo i hm o disc imina e he p o e-
nance o clams by including all a ming si es oge he (i.e.
GO; CL; SC; MA).
P epa a ion o inpu da a
Fo all case s udies, inpu da a we e il e ed o emo e
ASVs wi h low coun s, o educe noise and o selec a
smalle numbe o ea u es (p edic o s). In o al o each
case s udy, 8 inpu iles we e gene a ed, one o each
combina ion o issue (i.e. DG and GI) and sampling ime
(i.e. S18, S19, W19, W20). Mo e in de ail, he ASVs able
was ini ially impo ed in R/ 4.2.1 (R Co e Team, 2014)
and il e ed in o de o keep only loca ions ha we e
in ol ed in he ele an compa ison. Then, o each issue
and season (i.e. Summe o Win e ) he ASVs able was
spli be ween he i s and second yea o sampling and
was il e ed o keep only ASVs wi h an abundance highe
han 5% in a leas one o he samples, in bo h yea s.
Then he il e ed ASV ables we e con e ed o bina y
ma ices (i.e. p esence/absence), and he axonomy ile
was upda ed o keep only ASVs ha we e p esen in he
bina y ma ices. The R code used o gene a e he inpu
iles can be ound on Gi Hub a he ollowing link [42]:
h ps :// g i hub. com/ GEMMA- BCA/ Machi ne- Le a n ing-
assis ed- Mic o biome- analy sis.
Iden i ying clams om hein e dic ed a ea
Bina y ma ices om ASV ables we e impo ed in Jupi-
e no ebook 6.4.12 (h ps:// jupy e . o g). The lib a y
sciki -lea n [41] was loaded in py hon/ 3.9.13 and used
o ain a andom o es (RF) classi ie wi h he unc ion
“RandomFo es Classi ie ”. Ini ially, he RF model was
c oss- alida ed (i.e. whe e he aining se was spli in o
k- olds, k-1 olds we e used o aining and he esul ing
model was alida ed on he emaining pa o he da a)
o de ine he bes pa ame e s (i.e. he numbe o es ima-
o s “n_es ima o s”, anging om 12 o 18, and he max
dep h o ees “max_dep h”, anging om 1500 o 2000)
and he model wi h he highes Cohen kappa sco e was
chosen. C oss- alida ion was pe o med on second-
yea da a only (i.e. S19 + W20 oge he ) and a sepa a e
RF model o each issue was buil and c oss- alida ed
independen ly. A e c oss- alida ion, da a om he sec-
ond yea was i ed on he op imized RF model, and hen
his was used o p edic da a on no el, ne e -seen-be o e
da a om he i s yea (i.e. S18 + W19) on each issue
sepa a ely.
The pe o mance o he model in p edic ing he p o e-
nance o clams sampled in he i s yea was assessed sep-
a a ely o each issue by using he A ea Unde he Cu e
o a Recei e Ope a ing Cha ac e is ic AUC-ROC wi h
he unc ion “ oc_auc_sco e” and by means o con usion
ma ices. Fo each model he mos impo an ea u es
we e ob ained using he unc ion “ ea u e_impo ances”.
In addi ion, we assessed i a consensus model, de i ed by
combining he sepa a e p edic ions om he wo issues,
would imp o e he pe o mances o e he single- issue
models. Fo his pu pose, we ob ained he p edic ed
class p obabili ies o each inpu sample om bo h issues
wi h he unc ion “p edic _p oba”; hen o each sample
hese p obabili ies we e summed, and he class ha ing
he highes sum was deemed as he p edic ed class o
his consensus model. The pe o mance o he consensus
model was e alua ed by using AUC-ROC sco e and con-
usion ma ices, as p e iously desc ibed.
T acing a ming si es
Fo he mul iple class p oblem, ins ead o a RF classi ie , a
mul inomial logis ic eg ession was se up wi h he unc ion
“Logis icReg ession” o he py hon sklea n lib a y and he
ollowing op ions: “mul i_class = ’mul inomial’, sol e = ’lb gs’,
penal y = ’l2’, class_weigh = ’balanced’, i _in e cep = False,
C = 1”. To achie e a be e inal p edic ion, his “base” model
was coupled wi h boo s ap agg ega ion by using he ensem-
ble me a-es ima o bagging classi ie (wi h he unc ion
“BaggingClassi ie ”). C oss- alida ion was used o de ine he
bes pa ame e s o he bagging classi ie (i.e. he numbe o
es ima o s “n_es ima o s”, anging om 500 o 1500, and he
maximum numbe o ea u es o ain each base es ima o
“max_ ea u es”, anging om 100 o 300) and he model wi h
he highes Cohen kappa sco e was chosen. C oss- alida ion
was pe o med on second-yea da a only (i.e. S19 + W20
oge he ) and a sepa a e model o each issue was buil
and c oss- alida ed independen ly. A e c oss- alida ion,
da a om he second yea was i ed on he op imised bag-
ging classi ie model, and hen his was used o p edic da a
on no el, ne e -seen-be o e da a om he i s yea (i.e.
S18 + W19) on each issue sepa a ely.
The pe o mance o he model in p edic ing he p o e-
nance o clams was assessed as desc ibed in he p e ious
sec ion. Simila ly, he assessmen o he pe o mances
o a consensus model de i ed by combining he sepa-
a e p edic ions om he wo issues was pe o med as
desc ibed abo e. Finally, o each model, he mos el-
e an ea u es we e ex ac ed using a cus om py hon
unc ion and, o each a ming si e, he op 15 ea u es
we e plo ed using he R package ComplexHea map/
2.14.0 [43].
Abb e ia ions
MP Mic obiome p o iling
ML Machine lea ning
GI Gills
DG Diges i e gland
ASVs Amplicon sequence a ian s
PCBs Polychlo obiphenyls
FA Fa y acids
MP Mic obiome p o iling
MA Ma ano
Page 9 o 10
Pe uzzae al. BMC Biology (2024) 22:202
CL Chioggia
SC Sca do a i
GO Go o
PM Po o Ma ghe a
RF Random o es
Supplemen a y In o ma ion
The online e sion con ains supplemen a y ma e ial a ailable a h ps:// doi.
o g/ 10. 1186/ s12915- 024- 02005-w.
Addi ional ile 1: Table S1. P ecision, ecall, 1-sco e and suppo ob ained
in gills (GI) and diges i e gland (DG) ob ained in compa ing Po o
Ma ghe a e sus Chioggia (PM s CL), Po o MA ghe a e sus Sca do a i
(PM s SC), a min si es. Fea u e impo ance in gills and diges i e a e also
epo ed.
Addi ional ile 2: Figu e S1. Con usion ma ices showing he esul s o he
ML p edic ed p o enance (“P edic ed label”) e sus he eal p o enance
(“T ue label”) o each o he es ed samples by using gills (GI) only (le
column), diges i e gland (DG) only (middle column) o by combining
GI and DG in o a consensus p edic ion ( igh column). A) Classi ica ion
disc imina ing be ween he pollu ed si e PM and he clean a ming si es
o MA. B) Classi ica ion disc imina ing be ween he pollu ed si e PM and
he clean a ming si es o GO.
Acknowledgemen s
Au ho s acknowledge he unding o he I alian Minis y o Educa ion,
Uni e si y and Resea ch (MIUR) o he p ojec “Cen o di Eccellenza pe la
Salu e degli Animali Acqua ici ECCE AQUA”. Au ho s a e g a e ul o he ollow-
ing coope a i es/companies o hei assis ance in ob aining samples and/
o assis ance du ing he depu a ion p ocess: Cam—Conse i icio Alle a o i
Molluschi S l, Blupesca S l, coope a i a accol a alle amen o molluschi eduli
C.R.A.M.E., Na u edulis, Finpesca SPA, Coope a i a Ag icola ALMAR.
Au ho s’ con ibu ions
LB2, MM, TP and PF concei ed he p ojec ; LB1, AZ, ADB and FS coo dina ed
sampling ac i i ies and clams depu a ion. GDR, BC, MS and SI pe o med
animal sampling, DNA ex ac ion and Qiime2 analyses; LP and PF pe o med
machine lea ning analyses; FC and LB2 w o e he i s d a o he manusc ip
wi h inpu om LP and MM; all au ho s e ised and c i ically commen ed on
he manusc ip . All au ho s ead and app o ed he inal manusc ip .
Funding
Open access unding p o ided by Uni e si à degli S udi di Pado a. POR FESR
2014–2020 – Bando DGR n. 1139 del 19/07/2017 – Domanda di sos egno ID
10062983.
Eu opean p ojec FishEUT us (g an ag eemen 101060712) awa ded o TP.
A ailabili y o da a and ma e ials
Raw DNA lib a ies we e deposi ed a he NCBI eposi o y unde he BioP ojec
access numbe : PRJNA1013079 [24]. The R code used o gene a e he inpu
iles can be ound on Gi Hub a he ollowing link [42]: h ps:// gi hub. com/
GEMMA- BCA/ Machi ne- Lea n ing- assis ed- Mic o biome- analy sis.
Decla a ions
E hics app o al and consen o pa icipa e
No applicable.
Consen o publica ion
No applicable.
Compe ing in e es s
The au ho s decla e ha hey ha e no compe ing in e es s.
Au ho de ails
1 Depa men o Compa a i e Biomedicine and Food Science, Uni e si y
o Pado a, Viale Dell’Uni e si à 16, Legna o 35020, I aly. 2 La Vongola Ve ace Di
Chioggia, Chioggia, I aly. 3 LEMAR, UMR 6539 CNRS/UBO/IRD/IFREMER, Ins i u
Uni e si ai e Eu opéen de La Me , Place Nicolas Cope nic, Plouzané 29280,
F ance. 4 Expe eam S.R.L, Via Della Libe à, 12, Ma ghe a 30175, I aly. 5 Alma Soc.
Coop. Ag icola A l, Via G. Raddi, 2, Ma ano Laguna e 33050, I aly. 6 Depa men
o Medical Sciences, Uni e si y o To ino, Via San ena 19, Tu in 10126, I aly.
Recei ed: 17 Janua y 2024 Accep ed: 3 Sep embe 2024
Re e ences
1. Thomsen ST, Assunção R, A onso C, Boué G, Ca doso C, Cubadda F, e al.
Human heal h isk–bene i assessmen o ish and o he sea ood: a scop-
ing e iew. C i Re Food Sci Nu . 2022;62:7479–502.
2. Mamede R, Rica do F, San os A, Díaz S, San os SAO, Bispo R, e al. Re eal-
ing he illegal ha es ing o Manila clams (Rudi apes philippina um) using
a y acid p o iles o he adduc o muscle. Food Con ol. 2020;118:107368.
3. Ma ini A, Aguia i L, Capoccioni F, Ma inoli M, Napoli ano R, Pi lo G, e al.
Is Manila clam a ming en i onmen ally sus ainable? A Li e Cycle Assess-
men (LCA) app oach applied o an I alian Rudi apes philippina um
ha che y. Sus ainabili y. 2023;15(4):3237.
4. Humph eys J, Ha is MRC, He be RJH, Fa ell P, Jensen A, C agg SM.
In oduc ion, dispe sal and na u aliza ion o he Manila clam Rudi apes
philippina um in B i ish es ua ies, 1980–2010. J Ma Biol Assoc UK.
2015;95:1163–72.
5. Tu olla E, Cas aldelli G, Fano EA, Tambu ini E. Li e cycle assessmen
(LCA) p o es ha Manila clam a ming (Rudi apes philippina um) is a
ully sus ainable aquacul u e p ac ice and a ca bon sink. Sus ainabili y.
2020;12(13):5252.
6. Be na dini I, Ma ozzo V, Valsecchi S, Pe uzza L, Ro e e GD, Polesello S,
e al. The new PFAS C6O4 and i s e ec s on ma ine in e eb a es: Fi s
e idence o ansc ip ional and mic obio a changes in he Manila clam
Rudi apes philippina um. En i on In . 2020;2021(152):106484.
7. Masanja F, Yang K, Xu Y, He G, Liu X, Xu X, e al. Bi al es and mic obes: a
mini- e iew o hei ela ionship and po en ial implica ions o human
heal h in a apidly wa ming ocean. F on Ma Sci. 2023;10:1–9.
8. G ay M, Ba bou N, Campbell B, Robilla d AJ, Todd-Rod iguez A, Xiao
H, e al. Ecolabels can imp o e public pe cep ion and a m p o i s o
shell ish aquacul u e. Aquac En i on In e ac . 2021;13:13–20.
9. Va à MO, Zana di E, Se a M, Con e M, Ianie i A, Ghidini S. Iso ope in-
ge p in ing as a backup o mode n sa e y and aceabili y sys ems in he
animal-de i ed ood chain. Molecules. 2023;28(11):4300.
10 San os A, Rica do F, Domingues MRM, Pa inha C, Calado R. Cu en ends
in he aceabili y o geog aphic o igin and de ec ion o species-mislabe-
ling in ma ine bi al es. Food Con ol. 2023;152:109840.
11 A aújo DF, Ponze e a E, Knoe y J, B ian N, B uzac S, Si eau T, e al. Can
coppe iso ope composi ion in oys e s imp o e ma ine biomoni o ing
and sea ood aceabili y? J Sea Res. 2023;191:102334.
12. Milan M, Ma oso F, DallaRo e e G, Ca a o L, Fe a esso S, Pa a nello T,
e al. T acing sea ood a high spa ial esolu ion using NGS-gene a ed da a
and machine lea ning: compa ing mic obiome e sus SNPs. Food Chem.
2018;2019(286):413–20.
13. Zhuko a NV. Fa y acids o ma ine mollusks: Impac o die , bac e ial
symbiosis and biosyn he ic po en ial. Biomolecules. 2019;9:1–25.
14. G ahl-Nielsen O, Jacobsen A, Ch is ophe sen G, Magnesen T. Fa y acid
composi ion in adduc o muscle o ju enile scallops (Pec en maximus)
om i e No wegian popula ions ea ed in he same en i onmen .
Biochem Sys Ecol. 2010;38:478–88.
15. Liu X, Teixei a JS, Ne S, Ma KV, Pe onella N, Bane jee S, e al. Explo ing he
po en ial o he mic obiome as a ma ke o he geog aphic o igin o esh
sea ood. F on Mic obiol. 2020;11:1–9.
16. Singh P, Williams D, Velez FJ, Nagpal R. Compa ison o he gill mic obi-
ome o e ail oys e s om wo geog aphical loca ions exhibi ed dis inc
mic obial signa u es: a pilo s udy o po en ial u u e applica ions o
moni o ing au hen ici y o hei o igins. Appl Mic obiol. 2022;3:1–10.
17. Cohen FPA, Pimen el T, Valen i WC, Calado R. Fi s insigh s on he bac e ial
inge p in s o li e seaho se skin mucus and i s ele ance o aceabili y.
Aquacul u e. 2018;492:259–64.
18. Ta sadjieu NL, Maïwo é J, Hadjia MB, Loiseau G, Mon e D, Mbo ung CMF.
S udy o he mic obial di e si y o O eoch omis nilo icus o h ee lakes