In e na ional Jou nal o Da a Science and Analy ics (2025) 20:3965–3979
h ps://doi.o g/10.1007/s41060-024-00704-9
REGULAR PAPER
Au oma ed ce acean de ec ion in UAV image y using AI models: a case
s udy on Delphinid species
João Canelas1,3 ·Luana Clemen ino2·And é Cid3·Joana Cas o3,4 ·Inês Machado2,5 ·Susana Viei a1
Recei ed: 30 June 2024 / Accep ed: 13 Decembe 2024 / Published online: 10 Janua y 2025
© The Au ho (s) 2024
Abs ac
The iden i ica ion and quan i ica ion o ma ine mammals is c ucial o unde s anding hei abundance, ecology and suppo ing
hei conse a ion e o s. T adi ional me hods o de ec ing ce aceans, howe e , a e o en labo -in ensi e and limi ed in
hei accu acy. To o e come hese challenges, his wo k explo es he use o con olu ional neu al ne wo ks (CNNs) as a
ool o au oma ing he de ec ion o ce aceans h ough ae ial images om unmanned ae ial ehicles (UAVs). Addi ionally,
he s udy p oposes he use o Long-Sho -Te m-Memo y (LSTM)-based models o ideo de ec ion using a CNN-LSTM
a chi ec u e. Models we e ained on a selec ed da ase o dolphin examples acqui ed om 138 online ideos wi h he aim
o es ing me hods ha hold po en ial o p ac ical ield moni o ing. The app oach was e ec i ely alida ed on ield da a,
sugges ing ha he me hod shows po en ial o u he applica ions o ope a ional se ings. The esul s show ha image-
based de ec ion me hods a e e ec i e in he de ec ion o dolphins om ae ial UAV images, wi h he bes -pe o ming model,
based on a Con Nex a chi ec u e, achie ing high accu acy and 1-sco e alues o 83.9% and 82.0%, espec i ely, wi hin
ield obse a ions conduc ed. Howe e , ideo-based me hods showed mo e di icul ies in he de ec ion ask, as LSTM-based
models s uggled wi h gene aliza ion beyond hei aining en i onmen s, achie ing a op accu acy o 68%. By educing he
labo equi ed o ce acean de ec ion, hus imp o ing moni o ing e iciency, his esea ch p o ides a scalable app oach ha
can suppo ongoing conse a ion e o s by enabling mo e obus da a collec ion on ce acean popula ions.
Keywo ds Unmanned ae ial ehicles ·Con olu ional neu al ne wo ks ·Long-sho - e m-memo y ·Machine lea ning ·
Ma ine mammals de ec ion ·Pho o iden i ica ion
Luana Clemen ino, And é Cid, Joana Cas o, Inês Machado and Susana
Viei a ha e au ho s con ibu ed equally o his wo k.
BJoão Canelas
[email p o ec ed]
Luana Clemen ino
luana.clemen ino@wa ec.o g
And é Cid
[email p o ec ed]g
Joana Cas o
[email p o ec ed]g
Inês Machado
ines.machado@wa ec.o g
Susana Viei a
[email p o ec ed]
1IDMEC, Ins i u o Supe io Técnico, A . Ro isco Pais,
1049-001 Lisbon, Po ugal
2Wa EC O sho e Renewables, Edi ício Diogo Cão, Doca de
Alcân a a No e, 1350-352 Lisbon, Po ugal
1 In oduc ion
Ce aceans play a key ole in main aining ecosys em s abili y,
ac ing as sen inel o indica o species ha e lec he o e all
s a e o he ocean’s heal h [1,2]. Moni o ing and sa egua d-
ing he di e si y and abundance o ce aceans is impe a i e o
suppo conse a ion e o s (e.g., h ough con en ions and
ag eemen s) and achie e Good En i onmen al S a us (GES)
in Eu opean wa e s [3]. Achie ing Good En i onmen al S a-
3AIMM - Associação pa a a In es igação do Meio Ma inho,
Rua Maes o F ed. F ei as N15-1, 1500-399 Lisbon, Po ugal
4MARE - Ma ine and En i onmen al Sciences Cen e/ARNET
- Aqua ic Resea ch Ne wo k, Labo a ó io Ma í imo da Guia,
Faculdade de Ciências da Uni e sidade de Lisboa, A . Nossa
Senho a do Cabo, 939, 2750-374 Cascais, Po ugal
5MARE - Ma ine and En i onmen al Sciences Cen e/ARNET
- Aqua ic Resea ch Ne wo k, Faculdade de Ciências da
Uni e sidade de Lisboa, Campo G ande, 1749-016 Lisbon,
Po ugal
123
3966 In e na ional Jou nal o Da a Science and Analy ics (2025) 20:3965–3979
us (GES) in Eu opean wa e s is a key objec i e unde he
Ma ine S a egy F amewo k Di ec i e (MSFD), which was
adop ed by he Eu opean Union o e alua e and main ain he
heal h o he ma ine en i onmen . GES is de ined by ele en
desc ip o s ha assess a ious aspec s o ma ine ecosys ems,
enabling a comp ehensi e e alua ion o ma ine condi ions
and he p essu es om human ac i i ies. This app oach
aligns wi h simila in e na ional con en ions, such as he
Uni ed Na ions Sus ainable De elopmen Goal 14, which
a ge s he conse a ion and sus ainable use o oceans, seas,
and ma ine esou ces, as well as he OSPAR Con en ion,
which ocuses on he p o ec ion o he No h-Eas A lan ic
ma ine en i onmen . These amewo ks collec i ely con-
ibu e o a mo e esilien and sus ainably-managed global
ma ine ecosys em. Moni o ing and assessing he achie e-
men o GES is pa icula ly challenging gi en ha ce aceans
a e highly mobile species, dis ibu ed o e la ge a eas, and
mo ing ac oss a ious ma ine habi a s subjec o di e se
an h opogenic p essu es. These p essu es include inciden al
by-ca ch in ishing gea , bioaccumula ion o pa hogens and
oxins, ha m ul algal blooms, collisions wi h ships, unde -
wa e noise and clima e change [4–7]. Mo e ecen ly, he
ad ancemen o o sho e enewable ene gy u he in ensi-
ied hese challenges. Such p ojec s o en a ge la ge ma ine
a eas,commonlyo e lapping wi h ce acean habi a s, he eby
escala ing he p essu e be ween conse a ion needs and
ene gy exploi a ion [8]. The ulne abili y o hese species
and exploi a ion o hei habi a s unde sco es he impo -
ance o hei conse a ion, hus, i is impe a i e o imp o e
ou cu en unde s anding o ce acean dis ibu ion pa e ns.
Howe e , such s udies a e excessi ely cos ly, posing a sig-
ni ican ba ie o ad ancing conse a ion e o s. T adi ional
me hods o s udy and moni o ma ine mammal popula ions
in ol e isual su eys om a de ined pla o m (e.g., ae ial,
ship-based, o land-based), acous ic su eys [10,11], obse -
a ion o Ve y High Resolu ion (VHR) sa elli e images [12,
13], and obse a ion me hods ha allow a mo e ho ough
unde s anding (e.g., cap u e- ecap u e) [14]. Fu he mo e,
eme ging me hodologies, such as emo e sensing h ough
pho o de ec ion and iden i ica ion, p esen a p omising ool
o complemen ing such me hods while educing associa ed
cos s and isks [15–18].
Unmanned Ae ial Vehicles (UAVs) a e equipped wi h
imaging senso s ha can collec ex emely high- esolu ion
da a, hus becoming an inc easingly used ool o esea che s
o obse e ma ine wildli e and s udy ce aceans. UAVs a e a
non-in asi e me hod [19] ha allows he de ec abili y o ani-
mals in subsu ace wa e s, hus inc easing he ime a ailable
o de ec ion [20]. UAVs ha e an inc easing numbe o appli-
ca ions, such as moni o ing abundance and dis ibu ion [21],
pho o iden i ica ion [22], beha io al s udies [23], among o h-
e s [20].
None heless, his new echnology s ill p esen s limi a ions
and challenges, pa icula ly associa ed wi h da a manage-
men . The high olume o da a gene a ed equi es e icien
p ocessing solu ions, as manual inspec ion is o en imp ac i-
cal and p one o human e o [24,25]. As machine lea ning
and compu e ision ad ance apidly, au oma ed compu e
ision models p esen a p omising solu ion o au oma ing
he inspec ion p ocess. Since he scien i ic de elopmen s by
K izhe sky e al. [26] p o ing he e icacy o deep lea n-
ing algo i hms in image ecogni ion, Con olu ional Neu al
Ne wo ks (CNNs) ha e become he model o choice o
image de ec ion and iden i ica ion, achie ing esul s on pa
wi h human pe o mance in de ec ion and iden i ica ion asks
[27].Thesemodelsha ebeensuccess ullyapplied oindi id-
ual iden i ica ion o whales [28–30] and dolphins, [31] wi h
me hods ha can be adap ed o o he ce acean species [29].
The eha ealsobeenapplica ions o whalecoun ing h ough
sa elli eVHRimages[32],whe ecombining heusualcoun -
ing p ocedu e wi h an ini ial de ec ion o whale p esence has
imp o ed model accu acy and compu a ional e iciency [32].
Howe e , hese a e limi ed o species o la ge size due o he
spa ial esolu ion o images and a e di icul o de elop due
o he lack o open VHR image da ase s.
Models de eloped o image de ec ion, howe e , a e lim-
i ed o he e en s occu ing in one single ame, po en ially
missing impo an in o ma ion and con ex om he ideo
sequences ob ained wi h UAVs. Tha said, algo i hms capa-
ble o handling ideo ames, such as Recu en Neu al
Ne wo ks (RNNs), le e age he empo al con inui y and
con ex ual in o ma ion p o ided by ideo sequences while
educing in o ma ion missed, hus imp o ing de ec ion capa-
bili y. S ill, he de elopmen o such models ep esen s a
highe deg ee o complexi y, and s udies explo ing hei
e icacy on ma ine mammal de ec ion a e ela i ely limi ed
[33–35].
The main objec i e o his s udy is o de elop machine
lea ning models capable o au oma ing he de ec ion o
ce aceans, speci ically delphinids, using UAV da a. The
p esen wo kexplo es heimplemen a iono well-documen ed
CNN a chi ec u es di ec ed o image de ec ion, while also
p oposing he use o a Deep Fake de ec ion algo i hm based
on Gue a and Delp’s “Deep ake ideo de ec ion using ecu -
en neu al ne wo ks” [36], applied o he de ec ion o ma ine
mammalsin ideosequences.Thisapp oachbuildsuponcu -
en me hodologies, bu also explo es new a enues h ough
he use o a Long-Sho -Te m-Memo y (LSTM) ne wo k, a
speci ic RNN model, seeking o ha ness he addi ional in o -
ma ion p o ided h ough ideo analysis.
This s udy explo es he syne gies be ween deep lea ning
and ma ine science, ocusing on he po en ial o enhance
en i onmen al moni o ing and impac assessmen s a egies
in o sho e en i onmen s, pa icula ly o he conse a ion
o dolphin popula ions. These indings p o ide a me hod-
123
In e na ional Jou nal o Da a Science and Analy ics (2025) 20:3965–3979 3967
ological basis o imp o ing da a quali y, which can suppo
u u e e o s aimed a ad ancing sus ainable managemen
and conse a ion e o s in ma ine ecosys ems.
2 Da a acquisi ion
In-si u collec ion o a su icien olume o emo e sensing
da asui able o hede elopmen o ane icien iden i ica ion
model is challenging due o he high cos o equipmen and
logis ical cons ain s associa ed wi h ocean su eys. Fu he -
mo e,da ase sonspecieswi hb oaddis ibu ion anges,such
as ce aceans, a e sca ce, and publicly a ailable da ase s ai-
lo ed o ae ial de ec ion a e la gely non-exis en . As a esul ,
he da a used o build he models we e ob ained by collec -
ing sc aped ideo iles sou ced om a ious online sou ces
(e.g., YouTube, Pexels, Dailymo ion, e c). Gi en he chal-
lenging na u e o de eloping such da ase s, da a ga he ed o
his s udy we e limi ed o species o he amily Delphinidae,
as hey a e among he mos accessible ce aceans in publicly
a ailable oo age, and wi h he in en o ga he ing as much
da a as possible, no p e-selec ion c i e ia such as loca ion o
ime pe iod we e applied du ing collec ion.
2.1 T aining da ase
The ideos ob ained we e eco ded in di e se loca ions and
unde a ying condi ions, leading o signi ican a iabili y in
wa e cha ac e is ics such as hue, b igh ness, and oam, as
well as di e ences among delphinid species. This di e si y
enhances he model’s abili y o gene alize ac oss di e en
en i onmen al se ings. Fu he mo e, o achie e ep esen a-
i e samples whe e no dolphins a e p esen , he ideos also
include objec s o subjec s such as boa s, boa ds, and swim-
me s which se ed as po en ial con ounding elemen s o he
model.
The esul ingda aconsis edo 138ae ial ideoso a ying
du a ions and se ings. Some ideos exclusi ely con ain-
ing ce acean oo age, o he s solely ea u ing wa e scenes,
and some combining bo h elemen s. These 138 ideos we e
hen p ocessed o c ea e wo dis inc collec ions o da a: one
ailo ed o image classi ica ion and he o he o ideo clas-
si ica ion. Fo image classi ica ion, indi idual ames we e
ex ac ed om he ideos,p o idings a icsamples.Fo ideo
classi ica ion, he o iginal ideo segmen s we e e ained o
cap u e dynamic ea u es. While da a o bo h me hodolo-
gies we e de i ed om he same se o ideos, hese we e
p ocessed o sui he speci ic equi emen s o each classi i-
ca ion ype.
2.1.1 Image da a
Image da a we e gene a ed by decons uc ing he o iginal
138 aw ideosin oimagesbyex ac ing amesa aspeci ied
a e o one ame e e y h ee seconds using he open so wa e
FFmpeg. This a e can be adjus ed based on use needs and
he sou ce o ex ac ion, as some ideos may include mo e
o less i ele an da a. In his s udy, images we e ca ego ized
in o wo dis inc classes, based on he p esence o absence
o ce aceans: “Ce acean” and “No Ce acean”. Images we e
ini ially il e ed o exclude he ames ha we e poo ep e-
sen a i es o hei class, such as cases whe e subjec s we e
obs uc ed o no in he ame. Addi ional manual selec ion
was also conduc ed on ames ha we e good ep esen a-
i es. The esul ing se o da a consis ed o 2451 images,
di ided in o i s espec i e classes. The “No Ce acean” class
included images whe e no ce aceans we e p esen , as well as
images wi h o he su ace o subsu ace a i ac s ha could
lead he model o inco ec ly label hem as con aining a
ce acean. Including hese a i ac s wi hin he “No Ce acean”
class helps o co ec o po en ial alse posi i es by expos-
ing he model o non-ce acean images ha may esemble
ce aceans. The “Ce acean” class included images whe e a
leas one ce acean was p esen . The classi ica ion p ocess
esul ed in 776 images (app oxima ely 31.1%) ep esen ing
he “No Ce acean” class, and 1720 images (abou 68.9%)
ep esen ing he “Ce acean” class.
The no able imbalance in he numbe o images pe class
is due o he limi ed a ia ion in wa e su ace pa e ns o e
ime. F ames ex ac ed wi hin a ew seconds o each o he a e
o en nea ly iden ical, p o iding minimal addi ional alue.
On he o he hand, a da ase hea ily composed o ce acean
images could bias model p edic ions, inc easing he a e o
alse posi i es. To mi iga e his, he numbe o images in he
“No Ce acean” class was inc eased by a i icially gene a -
ing new sea images om exis ing ones. This was achie ed
by in oducing andom a ia ions in b igh ness, hue, and
sa u a ion o all newly gene a ed samples. Addi ionally, u -
he ans o ma ions we e applied wi h a ying p obabili ies:
sha pness enhancemen (25%), andom mi o ing (25%),
blu ing (25%), andom o a ions (15%), and andom c op-
ping (30%).
The desc ibed se o ans o ma ions was applied a o al
o 944 imes on andomly selec ed samples om he “No
Ce acean” class, gene a ing an addi ional 944 images. This
augmen a ion was pe o med o equalize he numbe o sam-
ples wi h ha o he “Ce acean” class. The esul ing balanced
da a we e composed o a o al o 3420 images, wi h an equal
dis ibu ion o 1720 (50%) images pe class.
123
3968 In e na ional Jou nal o Da a Science and Analy ics (2025) 20:3965–3979
2.1.2 Video da a
Video da a we e gene a ed by decons uc ing he same 138
aw ideos in o se e al smalle ideos (clips) o eigh sec-
onds each and subsequen ly ex ac ing a o al o 64 ames
om each o hese smalle ideos. The ini ial agmen a ion
p ocess o he o iginal ideos was pe o med using he so -
wa e Adobe P emie e P o 2020, e sion 14.0 (Adobe Inc.,
San Jose, Cali o nia). Fi s ly, in e als o eigh seconds we e
manually selec ed o accu a ely ep esen each class. Sim-
ple ans o ma ions, such as mi o ing, c opping, a ying
b igh ness, and hue, we e applied o some o he samples
o in oduce a ia ion. Each segmen was hen expo ed o
c ea e new ideo samples.
This ideo leng h was selec ed based on ca e ul analysis
o ini ial da a acqui ed, balancing he goal o cap u ing com-
p ehensi e in o ma ion on dolphin beha io and mo emen
wi hin a concise ime ame. This in e al p o ed e ec i e
o segmen ing o iginal ideos wi h equen ansi ions and
a ious added con en such as o e lays, logos, o a i ac s
ha could o he wise cause unwan ed model esponses. A
longe in e al would ha e signi ican ly educed he num-
be o usable samples, while a sho e window isked losing
con ex ual de ails, as many segmen s showed minimal mo e-
men o e b ie du a ions.Theeigh -secondleng h, he e o e,
p o ided an op imal comp omise, enabling ample sample
quan i y while e aining su icien in o ma ion o model
aining.
A e his segmen a ion, each clip is p ocessed using
Py hon’s OpenCV lib a y o ex ac ames a a a e o eigh
ames pe second, esul ing in a ba ch o images con aining
a o al o 64 ames pe clip. The choice o ame ex ac ion
a e allowed o cap u ing as much in o ma ion on dolphin
beha io a ia ions o e ime, while minimizing he numbe
o images.
The esul ing da a pos -p ocessing ope a ions consis ed
o 1216 ideos, o which 622 belong o he “Ce acean”
class(app oxima ely 51.2%),while he emaining594 ideos
belong o he “No Ce acean” class (app oxima ely 48.8%).
This equa es o 1216 ba ches o 64 images each, o aling
77824 images spanning bo h classes.
2.2 Tes da ase
To moni o and unde s and model pe o mance o e he
cou se o aining, models a e es ed on da a no in ol ed
in hei lea ning p ocess. This p ac ice p o ides insigh s in o
expec ed pe o mance and gene aliza ion by e alua ing sam-
ples he model’s pa ame e s we e no di ec ly adjus ed o,
p o iding a gene al unde s anding o model p og ess and
an icipa ed beha io wi hin simila da a samples.
In his s udy, he es da a was de i ed om he o iginal
da ase ou lined in Sec .2.1, om which a small po ion was
e ac ed. This di ision c ea es wo dis inc subse s: aining
da a,comp ising80%o heo iginal da ase ,used o each he
model o ecognize class pa e ns, and es da a, making up
he emaining 20%, o e i y he s a e o models. Empi ical
s udies sugges op imal esul s when ese ing 20–30% o
da a o es ing while using 70–80% o aining [37]. This
sepa a ion was done andomly om all a ailable samples
while keeping a p opo ional numbe o samples om each
class, esul ing in 688 (20%) es and 2752 (80%) aining
samples o image classi ica ion, and 244 (20%) es and 972
(80%) aining samples o ideo classi ica ion.
2.3 Valida ion da ase
The alida ion da a o his s udy we e p o ided by Associ-
ação pa a a In es igação do Meio Ma inho (AIMM), which
suppo ed he esea ch by supplying UAV da a om p e i-
ous expedi ions. Da a we e acqui ed on he coas al egion in
sou h Po ugal wi hin he Fa o dis ic . Speci ically, he s udy
a ea is loca ed app oxima ely 12 km o sho e om he coas -
line o Albu ei a, ex ending in o he A lan ic Ocean. This
egion is a signi ican habi a o a ious ce acean species,
especially delphinids such as common dolphins (Delphinus
delphis) and bo lenose dolphins (Tu siops unca us)[38–
40].
A o al o se en campaigns conduc ed be ween 2022 and
2023we eanalyzed.One campaign wasincludedin he ain-
ing da a o be e adap o local en i onmen al condi ions and
UAV se ings, while he emaining six campaigns we e used
o e alua ion. These su eys we e conduc ed in he mo n-
ing, be ween 10:30 and 12:00, unde a o able sea condi ions
de ined by a sea s a e o ⩽3 acco ding o he Beau o scale,
swells <1.5 m, good isibili y (>5km), and no p ecipi-
a ion. Figu e1o e s a comp ehensi e iew o he egion
unde s udy, p o iding in o ma ion on a ious expedi ions,
including da es, imes, and he p ecise loca ions o dolphin
sigh ings.
The UAV-based emo e sensing da a used in his s udy
we e collec ed using a Ma ic 2 P o mul i- o o UAV (DJI,
Shenzhen, China). The UAV cap u ed ideos a a esolu ion
o 3840 ×2160 pixels using a 1-inch CMOS RGB imaging
senso wi h a maximum esolu ion o 20-megapixel, coupled
wi h a 3-axis gimbal and a 28mm equi alen , /2.8- /11 lens,
p o iding a ield o iew o app oxima ely 77◦.
The d one missions we e conduc ed a di e en ligh al i-
udes, depending on se e al ac o s. These ac o s included
whe he he e we e any dolphin sigh ings a he ime and he
size o he g oup o dolphins, wi h highe ligh s p e e ably
used o g ea e seaco e agewhen nosigh ingswe e p esen ,
and lowe ligh s o a mo e de ailed iew when a g oup was
loca ed. Figu e2p esen s a box plo o he ligh al i udes
eco ded by he UAV du ing he di e en expedi ions. O he
six ligh s, h ee we e conduc ed a a maximum al i ude o
123
In e na ional Jou nal o Da a Science and Analy ics (2025) 20:3965–3979 3969
Fig. 1 O e iew o he s udy a ea o acqui ing da a
Fig. 2 Box plo : ligh al i ude dis ibu ion o each expedi ion
nea ly 80 m, while he emaining h ee we e lown below 50
m. In gene al, he UAV was obse ed o ope a e a an al i-
ude o a ound 20 m, excep o he second campaign, whe e
ligh s p edominan ly occu ed a highe al i udes.
The UAV-based image y collec ed esul ed in 35min and
40s o oo age. Simila o p e iously acqui ed da a, his
oo age was p ocessed o c ea e wo da ase s om he same
sou ce, his ime o alida ion. The i s , ailo ed o he al-
ida ion o image-based models, was ob ained by ex ac ing
and labeling ames om he o iginal aw ideo da a a a
a e o one ame e e y i e seconds, e ec i ely p ocessing
he en i e oo age. Each sample was labeled as belonging
o ei he he “Ce acean” o “No Ce acean” class based on
in e ence om he o iginal ideo image y cap u ed, allow-
ing o disce n he p esence o dolphins on samples ha
would o he wise be challenging o iden i y co ec ly. The
esul ing da a comp ises a o al o 428 image samples, wi h
247 (app oxima ely 57.7%) classi ied as “Ce acean” and 181
(app oxima ely 42.3%) as “No Ce acean”. These samples
cap u ea a ie yo dolphinposi ions,came aangles, anddis-
ances, p o iding su icien a ia ion o suppo a obus and
comp ehensi e assessmen . The second da ase , designed o
he alida ion o ideo-based models, was ob ained by manu-
allydi iding heo iginal ideo da ain osmalle eigh -second
clips, ex ac ing hem and subsequen ly con e ing hem in o
64 images. The esul ing ideo da a consis s o 232 ideos,
120 o which we e classi ied as belonging o he “Ce acean”
class (app oxima ely 51.7%), and 112 classi ied as belonging
o he “No Ce acean” class (app oxima ely 48.3%).
Table 1p o ides a summa y o he sample dis ibu ion
ac oss he aining, es ing, and e alua ion da ase s o bo h
image and ideo da a. Each da ase is di ided in o “Dol-
phin”and“Ocean”samples,co esponding o he“Ce acean”
and “No Ce acean” classes, wi h he aining and es ing se s
holding 80% and 20% o he o iginal da a, espec i ely. The
e alua ion da ase includes an addi ional se o samples ha
co e s 100% o i s alloca ed da a, ensu ing comp ehensi e
assessmen o he models. This di ision main ains a balanced
class ep esen a ionwi hin eachsubse ,wi hanea -equaldis-
ibu ionbe ween“Dolphin”and“Ocean”samplesac oss he
da ase s, acili a ing obus aining and pe o mance e alu-
a ion.
While he e alua ion da ase was en iched wi h ce acean
images o ensu e su icien da a o es ing he model, i is
acknowledged ha in eal-wo ld applica ions, ocean-only
imagesa elikely obe a mo ep e alen hance aceansigh -
123
3970 In e na ional Jou nal o Da a Science and Analy ics (2025) 20:3965–3979
ings. Consequen ly, his en ichmen may sligh ly unde es i-
ma e he a e o alse posi i es unde ope a ional condi ions.
Howe e , by main aining a balanced da ase o e alua ion,
he accu acy me ic becomes mo e ep esen a i e o he
model’s ue pe o mance.
3 Implemen a ion
The models and pipelines employed in his s udy we e de el-
oped wi hin he esea ch pla o m Google Colab P o, aking
ad an age o i s cloud compu ing capabili ies. The p ima y
p og amming language used was Py hon e sion 3.9 wi h
PyTo ch’s lib a y as he ounda ion o his p ojec ’s machine
lea ning amewo k, which allowed o an easy implemen a-
ion o s a e-o - he-a deep lea ning echniques.
3.1 Image-based models
In o de o analyze dis inc image iden i ica ion models, he
ollowing CNN a chi ec u es we e indi idually employed, as
ou linedinTable2.Theselec edmodelsha ea ack eco din
he ield and known pe o mance in image classi ica ion [41–
45]. I is wo h no ing ha while ce ain models may display
supe io pe o mance on a e age, eal-wo ld ou comes can
a ysigni ican lybasedon hespeci icna u eo hep oblems
being add essed. Alongside he model names, speci ica ions
such as he numbe o pa ame e s and he op accu acies
achie ed when hese a chi ec u es we e ained on ImageNe
a e also p o ided.
Ini ially, hese models we e se up acco ding o hei p e-
de ineda chi ec u esandini ializedwi h andompa ame e s,
making hem essen ially emp y amewo ks incapable o
making meaning ul p edic ions. Howe e , h ough ans e
lea ning, pa ame e s om models wi h iden ical a chi ec-
u es ha ha e been ained on ex ensi e da ase s, such as
ImageNe , can be ans e ed o hese models. ImageNe ,
o example, comp ises o e a million samples and co e s
a wide ange o classes, including animals like g ay whales,
dugongs, o cas, and sea lions. While i does no encompass
hespeci ic“dolphin”class, he ea u es ha dis inguish hese
ela ed classes can be in aluable o he iden i ica ion o dol-
phins.
The models p esen ed a e s uc u ed in wo sec ions, he
i s being he ea u e ex ac o , mainly consis ing o he
inpu laye and a se ies o hidden laye s. I cons i u es he
majo i y o he ne wo k and is designed o iden i y spe-
ci ic ea u es wi hin he inpu da a h ough i s con olu ional
laye s. These laye s ha e been p e iously ained on he Ima-
geNe da ase , he e o e, hey al eady possess he abili y o
e ec i ely ecognize a wide ange o cha ac e is ics om
a housand di e en classes. Consequen ly, hei pa ame-
Fig. 3 Combined CNN model pipeline
e s a e “ ozen” du ing u he aining o ensu e ha he
ex ac ed ea u es emain consis en .
The second sec ion o he model comp ises he classi-
ie , ypically a ully connec ed linea laye wi h a so max
ac i a ion unc ion a he end o he ne wo k. This classi-
ie in e p e s he ea u es ob ained om he hidden laye s
and assigns a class label o each da a sample. The classi ie ,
o iginally designed o classi y 1000 classes, was adap ed o
iden i y only wo classes. This was achie ed by eplacing
he inal linea laye , which ini ially had 1000 ou pu nodes,
wi h a new linea laye con aining only wo ou pu nodes,
co esponding o he wo a ge classes.
In addi ion o he indi idual a chi ec u es, a combined
model was de eloped o le e age mul iple ea u e ex ac o s
concu en ly. In his app oach, he ea u e ex ac ion laye s
om VGG16, Con Nex , and a s aigh o wa d se o con o-
lu ional laye s we e me ged in o a uni ied ea u e ex ac o .
The simple con olu ional model implemen ed along VGG16
andCon Nex consis so h eecon olu ionallaye sand h ee
max-pooling laye s, complemen ed by ce ain Rec i ied Lin-
ea Uni (ReLU) laye s in be ween as ac i a ion unc ions.
The in en behind his design was o ailo he model o
iden i y dolphin-speci ic ea u es om he aining da ase ,
ins ead o hose ha ep esen o he subjec s as in he case o
ans e lea ning.
Ul ima ely, a sha ed classi ie wi h wo laye s and 1024
nodes in i s middle laye was employed o ecei e and e ec-
i ely p ocess he ea u es ex ac ed om all a chi ec u es,
esul ing in a collec i e p edic ion. No ably, his implemen-
a ion was ca ied ou in wo ins ances: one ha omi ed he
Con olu ional Laye s, CombinedModel (1), while he o he
usedall h eea chi ec u es asexplainedCombinedModel (2).
Figu e3p o ides a gene al o e iew o his combined model
implemen a ion and how da a we e shaped h ough i .
3.2 Video-based models
A ideo-based iden i ica ion app oach inco po a ing mode n
deep ake de ec ion echniques was also adop ed o le e age
123
In e na ional Jou nal o Da a Science and Analy ics (2025) 20:3965–3979 3971
Table 1 Da ase ain- es spli
Image da ase Video da ase
T ain samples Tes samples E al samples T ain samples Tes samples E al samples
2752 (80%) 688 (20%) 428 (100%) 972 (80%) 244 (20%) 232 (100%)
Dolphin Ocean Dolphin Ocean Dolphin Ocean Dolphin Ocean Dolphin Ocean Dolphin Ocean
1376 1367 194 194 247 181 497 475 125 119 120 112
(50%) (50%) (50%) (50%) (57.7%) (42.3%) (51.1%) (48.9%) (51.2%) (48.8%) (51.7%) (48.3%)
Table 2 Common CNN
a chi ec u es and espec i e
pe o mances on ImageNe
da ase
Model Top-1 Acc (%) Top-5 Acc (%) Pa ame e s (M)
ResNe 50 80.858 95.434 25.6
Incep ionV3 77.294 93.450 27.2
VGG16_bn 73.360 91.516 138.4
Con Nex _Base 84.062 96.870 88.6
E icien Ne _V0 77.692 93.532 5.3
he unique empo al con inui y ea u e o ideos. Unlike
images, ideos a e composed o a sequence con aining
nume ous ames, whe e adjacen ames display a sub-
s an ial co ela ion and empo al con inui y. The me hod
implemen ed, based on he wo k by Gue a and Delp [36],
in ol es using a CNN wi hou a classi ie o ex ac ea u es
om indi idual ideo ames and eed he esul ing sequence
o ea u esin o anLSTM o analyzepa e nsin hei empo al
e olu ion.
Two CNN a chi ec u es we e used o his pu pose:
Incep ionV4 and Con Nex . Incep ionV3 a chi ec u e was
eplica ed om he o iginal wo k, while he Con Nex model
a chi ec u e was selec ed based on esul s om he image-
basedclassi ica ionme hodologycoun e pa .Inlinewi h he
p e ious app oach, models we e es ablished based on hei
espec i e a chi ec u es, ini ialized wi h andom pa ame e s,
and subsequen ly e ined by ans e ing pa ame e s om
p e- ained models on ImageNe . Since he CNNs wi hin
his CNN-LSTM pipeline a e used exclusi ely o ea u e
ex ac ion, hei pa ame e s we e “ ozen” o main ain he
consis ency o he ex ac ed ea u es. Simul aneously, he
classi ie s we e emo ed, enabling di ec passage o he ea-
u esiden i ied om hehiddenlaye s o heLSTM.Di e en
CNN a chi ec u es ha e speci ic inpu size equi emen s:
Incep ionV3 and Con Nex ha e inpu sizes o 299 ×299
and 224 ×224 pixels, espec i ely, and ex ac 2048 and
1024 ea u es, espec i ely.
To accommoda e he dis inc ea u e s uc u es ob ained
om each CNN a chi ec u e, wo dis inc LSTM a chi ec-
u es we e de eloped. Each was designed o handle a speci ic
inpu size o he ans e ed ea u es, aligned wi h i s co e-
sponding CNN. Bo h models we e c ea ed wi h wo ecu en
laye s o 256 nodes each. This means ha o each model,
wo LSTMs we e s acked oge he o o m a s acked LSTM,
wi h he second aking in he ou pu s om he i s o com-
pu e a new ou pu a each ime s ep. This se up enabled he
LSTMs o i e a i ely p oduce 256 alues, ep esen ing hei
hidden s a es, o e e y ame in he ideo sequence.
To conclude he CNN-LSTM pipeline, a classi ie was
in oduced o p ocess he ou pu p oduced by he LSTM cell
and make p edic ions. The classi ie implemen ed ea u es a
linea laye wi h 16384 nodes on i s inpu side. A each ime
s ep, he LSTM p ocesses a ame om he inpu ideo, hus
gene a ing 256 alues, which co espond o he 256 nodes in
i s hidden laye s, ep esen ing he hidden s a e a ha speci ic
ime s ep.
To maximize he amoun o in o ma ion used wi hin he
classi ie , all he hidden s a es p oduced by he LSTM cell
we e agg ega ed. This agg ega ion esul s in a o al o 16,384
nodes on he classi ie ’s inpu side since all inpu ideos a e
p e-p ocessed oconsis o 64 ames.Fu he mo e,i sou pu
laye consis s o wo nodes ep esen ing he wo a ailable
classes and u ilizes a so max ac i a ion o con e he aw
ou pu alues in o p obabili ies.
Figu e 4p o ides a gene al o e iew o he pipeline c e-
a ed, i s inne wo kings, and how he da a we e shaped
h ough his sys em.
3.3 P e-p ocessing da a
E ec i e p e-p ocessing is essen ial o p epa ing he ain-
ing and es ing da ase s. Key s eps include o ganizing da a
in o manageable ba ches and applying ans o ma ions com-
pa ible wi h p e- ained models, which op imize lea ning and
enhance model pe o mance.
To maximize compu a ional e iciency and imp o e lea n-
ing p ecision, all da a samples wi hin he aining and es ing
da ase s we e g ouped in o ba ches o 64 samples each. This
123
3972 In e na ional Jou nal o Da a Science and Analy ics (2025) 20:3965–3979
Fig. 4 CNN-LSTM pipeline s uc u e
agg ega ion allows he simul aneous p ocessing o mul iple
samples, p o iding a mo e s able g adien du ing backp op-
aga ion by combining losses om di e se samples wi hin
each ba ch. A ba ch size o 64, in pa icula , is a commonly
employedchoice ha o enwo kswell o a iousdeeplea n-
ing asks.
Fu he ans o ma ions we e applied o le e age he pa -
e ns lea ned om p e- ained models. Gi en ha weigh s
om hese we e ained on da a wi h a speci ic dis ibu ion,
i is essen ial o no malize he inpu da a acco dingly. This
no maliza ion in ol es sub ac ing he mean and di iding
by he s anda d de ia ion alues o he da ase used o p e-
ain he models. Fo ImageNe - ained models, hese alues
a e (0.485, 0.456, 0.406) o he mean and (0.229, 0.224,
0.225) o he s anda d de ia ion ac oss he h ee colo chan-
nels. Following no maliza ion, he inpu da a a e esized and
c opped o ma ch he dimensions equi ed by each CNN
a chi ec u e, wi h mos models accep ing (224 ×224) inpu ,
excep Incep ionV3, which equi es (299 ×299).
3.4 Model aining
The p ocess o adap ing a neu al ne wo k o i a speci ic
p oblem in ol es i e a i ely assessing i s pe o mance on he
aining da ase , and adjus ing i s pa ame e s each ime o
achie e p edic ions as close as possible o he desi ed alues.
To accomplish his, a C oss-En opy Loss unc ion, com-
monly u ilized in mul iclass p oblems such as his one was
de ined. This unc ion se es as a guide o de e mine how
he model’s pa ame e s should be upda ed. Subsequen ly, he
loss unc ion is u ilized o compu e a loss alue, p oduced
o each ba ch, which in u n is used o op imize he pa am-
e e s o he model. This op imiza ion is conduc ed h ough
an Adam op imize wi h an ini ial lea ning a e o 0.001,
due o i s adap i e lea ning a e and ease o use wi h ewe
hype pa ame e s.
Addi ionally, a d opou laye wi h a d opou p obabili y
o 20% was added o he classi ie a he end o each model
du ing aining, immedia ely be o e he inal linea laye .
This allows o andomly dele e ac i a ions om he nodes
ca ying ea u es be o e en e ing he classi ie wi h a p oba-
bili y o 20% o each ea u e. This s ep p o ed o help he
lea ning p ocess o all nodes in he classi ie and educe da a
o e i ing signi ican ly by allowing nodes o unde alued
impo ance o su e la ge adjus men s.
Finally, models unde wen aining by i e a i ely p o-
cessing he aining da ase , one ba ch a a ime, o e
se e al i e a ions, con inuously assessing p edic ions made
and adjus ing hei pa ame e s acco dingly. Du ing his p o-
cess, he da ase is comple ely p ocessed mul iple imes, and
models a e s o ed o u u e use wi h hei mos up- o-da e
pa ame e s and key pe o mance me ics a e each i e a ion.
4 Resul s and discussion
The p edic i e pe o mance o he ained models was ini-
ially e alua ed using he es da ase , o e ing an ea ly
indica ion o hei e ec i eness be o e alida ion on he
e alua ion da ase . This assessmen includes me ics such
as accu acy (Acc), p ecision (P ec), ecall (Rec), 1-sco e
(F1), and loss, p o iding a p elimina y baseline o each
model’s gene aliza ion capaci y. Table 3summa izes he
bes -pe o ming models, selec ed based on hei 1-sco e,
e lec ing he balance be ween p ecision and ecall.
Figu e 5shows he aining cu es o he op wo models
om each classi ica ion app oach, highligh ing he ends in
lossandaccu acyo e epochs.These isualiza ionshelpcla -
i y model s abili y and lea ning dynamics, se ing he s age
o a mo e de ailed alida ion using he e alua ion da ase in
he subsequen sec ion.
4.1 Model alida ion
To alida e he e ec i eness o he models s udied and con-
i m he quali y o hei applicabili y, ield obse a ions we e
simula edusingda acollec eddu ing ieldwo kconduc edby
AIMM.Subsequen ly, hepe o manceo hep e-es ablished
models in aining was assessed wi hin he e alua ion da ase
de ailedinSec .2.3.Thequan i a i e esul s om his assess-
men a e p esen ed in Table 4.
The esul s show a clea all in he o e all pe o mance
o all models. This is o be expec ed since bo h aining and
es ing da a sha e he same o igin sou ce, and he e o e, bea
a mo e simila i ies. F om he p esen ed alues i is possible
o in e ha he image-based iden i ica ion achie ed be e
pe o mance han i s ideo-based coun e pa , es ablishing
i as he supe io me hodology o his ask wi hin applied
models.
Models based on he Con Nex a chi ec u e expe ienced
a smalle d op in pe o mance. In pa icula , he Con Nex
123
In e na ional Jou nal o Da a Science and Analy ics (2025) 20:3965–3979 3973
Table 3 Pe o mance on es
da ase Model Acc Rec P ec F1 Loss
E icien Ne 0.834 0.875 0.809 0.841 0.006
Con Nex 0.969 0.965 0.974 0.969 0.001
ResNe 50 0.898 0.916 0.885 0.900 0.004
Incep ionV3 0.923 0.930 0.917 0.924 0.003
VGG16_bn 0.924 0.945 0.908 0.926 0.005
Con olu ionslLaye s 0.850 0.863 0.841 0.852 0.018
CombinedModel (1) 0.956 0.965 0.949 0.957 0.004
CombinedModel (2) 0.955 0.945 0.964 0.954 0.006
CNN-LSTM (Incep.V3) 0.898 0.950 0.856 0.900 0.453
CNN-LSTM (Con Nex ) 0.939 0.924 0.948 0.936 0.628
Fig. 5 T aining cu es om he
mos ele an models
implemen ed
model s ands ou as he model o choice, achie ing he high-
es accu acy and 1-sco e wi h alues o 83.9% and 82.0%,
espec i ely. This model is expec ed o p oduce o e all good
p edic ions, ha ing achie ed a be e balance be ween ecall
andp ecision.E en hougha dec ease in ecallwasobse ed,
he alue o 86.7% s ill p o ides he model wi h a educed
numbe o alse nega i es, while he p ecision alue o 77.7%
leads o a educed numbe o alse posi i es compa ed o
he emaining image-based models, meaning ha p edic ions
whe e he model inds he p esence o dolphins a e mo e
us wo hy.
The no able pe o mance o his model can be aced back
o i s aining and es ing cu es, ep esen ed in Fig.5, which,
when compa ed o he cu es o o he models, display a
highe deg ee o simila i y, o he ex en ha hey o e lap.
This sugges s a g ea gene aliza ion abili y by showing he
model’scapaci y oob ainhighaccu acy alueswi hou o e -
i ing aining da a.
RemainingCon Nex -baseda chi ec u esha ealsodemon-
s a edgoodpe o mance.No ably, hepe o manceo Com-
binedModel (2) imp o ed compa ed o CombinedModel
(1), achie ing he highes ecall alue o 90.6%, which
123