Co esponding au ho : Ch Vaishna i
Copy igh © 2025 Au ho (s) e ain he copy igh o his a icle. This a icle is published unde he e ms o he C ea i e Commons A ibu ion Liscense 4.0.
Mul imodal AI amewo k o image cap ioning, s o y gene a ion and na u al speech
na a ion
Ashwani A i, P iyanka Gudeboyena, Vaishna i Chigu la *, Soumika Molugu i and Ni hin Kasoju
Depa men o Compu e Science and Enginee ing (Da a Science), Ashwani A i, ACE Enginee ing College, Telangana,
India.
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 1037-1044
Publica ion his o y: Recei ed on 27 Ma ch 2025; e ised on 03 May 2025; accep ed on 06 May 2025
A icle DOI: h ps://doi.o g/10.30574/wja .2025.26.2.1685
Abs ac
Wi h he inc easing ubiqui y o digi al image y, he e is a g owing need o in elligen sys ems capable o unde s anding
isual con en and exp essing ha unde s anding in human-like language. This pape p esen s a comp ehensi e AI-
based pipeline ha no only gene a es cap ions om images bu also cons uc s i id s o ies based on hose cap ions
and inally deli e s hem in a human oice. The p oposed sys em in eg a es mul iple componen s: a Con olu ional
Neu al Ne wo k (VGG16) o ex ac ing isual ea u es, an LSTM-based sequence model o cap ion gene a ion, GPT-2
o c ea i e s o y gene a ion, and Google Tex - o-Speech (gTTS) o oice syn hesis. The esul is a mul i-modal AI
amewo k capable o ans o ming s a ic images in o ich, spoken na a i es. This app oach has applica ions in
assis i e echnologies, in e ac i e s o y elling, con en au oma ion, and educa ion. The p oposed model is ained and
e alua ed on he Flick 8k da ase , demons a ing a iable pa h o au oma ed isual s o y elling.
Keywo ds: Image Cap ioning; CNN-LSTM; VGG16; GPT-2; Tex - o-Speech (gTTS); Image- o-S o y Gene a ion; Na u al
Language P ocessing (NLP)
1. In oduc ion
The syne gy be ween compu e ision and na u al language p ocessing has led o g oundb eaking inno a ions in
a i icial in elligence. Deep lea ning echniques ha e allowed machines o in e p e complex pa e ns in bo h images
and ex , gi ing ise o applica ions like image cap ioning, au oma ed s o y elling, and human-compu e in e ac ion
sys ems. Howe e , while each o hese a eas has ma u ed in isola ion, combining hem in o a seamless, human-cen ic
expe ience emains a on ie in AI esea ch.
Humans possess a ema kable abili y o pe cei e a isual scene, a icula e i s con en in desc ip i e language, and e en
ex apola e imagina i e s o ies om i . Fo ins ance, when looking a a pho og aph o a child playing wi h a puppy in a
pa k, a human obse e migh no only say, “A child is playing wi h a dog,” bu also c ea e an engaging na a i e such
as, “On a sunny a e noon, Emma ound a new bes iend in he pa k.” Emula ing his le el o in e p e a ion equi es
mo e han jus objec de ec ion o sen ence gene a ion—i calls o con ex ual unde s anding, na a i e imagina ion,
and oice deli e y.
This pape p oposes a sys em ha emula es his human s o y elling p ocess. I s a s by analyzing an image using a p e-
ained VGG16 CNN model o ex ac high-le el ea u es. These ea u es a e hen ed in o an LSTM-based decode ,
ained on he Flick 8k da ase , o gene a e a concise cap ion. Nex , a ans o me -based language model, GPT-2, akes
he cap ion and expands i in o a c ea i e sho s o y. Finally, he s o y is con e ed in o na u al-sounding speech using
gTTS, c ea ing a ully imme si e and in e ac i e s o y elling expe ience.
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 1037-1044
1038
The in eg a ion o image unde s anding, na u al language gene a ion, and oice syn hesis p o ides a powe ul ool o
accessibili y—especially o he isually impai ed—whe e he abili y o “see h ough wo ds” becomes ans o ma i e.
The sys em also opens up new a enues in educa ion, in e ac i e media, and digi al con en c ea ion.
2. Li e a u e Re iew
The domain o image cap ioning has e ol ed h ough se e al phases. Ea ly me hods elied on empla e-based
app oaches and ule-based sys ems ha had limi ed gene aliza ion capabili y. Wi h he ad en o deep lea ning,
encode -decode a chi ec u es became he no m. Vinyals e al. (2015) in oduced he "Show and Tell" model, a
b eak h ough in combining CNNs and RNNs o end- o-end cap ion gene a ion. La e models such as "Show, A end and
Tell" inco po a ed a en ion mechanisms, imp o ing ocus on ele an pa s o he image du ing wo d gene a ion.
S o y gene a ion, on he o he hand, has seen p og ess h ough la ge-scale ans o me -based models. Rad o d e al.'s
GPT-2 (2019) demons a ed he abili y o gene a e luen , con ex ually cohe en ex , opening he doo o c ea i e
applica ions such as dialogue agen s, au oma ic w i e s, and s o y elling bo s. Howe e , hese models ope a e pu ely in
he ex domain, and wi hou isual g ounding, hei s o ies can be con ex ually gene ic o misaligned wi h isual
p omp s.
Tex - o-speech syn hesis has also signi ican ly imp o ed. Google's gTTS and o he neu al TTS engines like Taco on and
Wa eNe ha e made i possible o gene a e na u al-sounding speech wi h minimal la ency. These ools p o ide he
audi o y in e ace o AI sys ems, especially in accessibili y and human-compu e in e ac ion con ex s.
Despi e hese indi idual ad ancemen s, e y ew sys ems ha e a emp ed o in eg a e isual unde s anding, language
gene a ion, and speech syn hesis in o a single pipeline. This esea ch builds upon hese ounda ional wo ks o p opose
an end- o-end model o au oma ed image-based s o y elling.
3. Exis ing Sys em
Se e al s andalone sys ems exis ha pe o m well in isola ion:
3.1. Image Cap ioning Sys ems
Deep lea ning models like "Show and Tell" and "Neu al alk2" gene a e concise cap ions o images, ocusing on objec
ecogni ion and sen ence luency. These sys ems, howe e , a e limi ed o sho ph ases and lack na a i e capabili y.
3.2. S o y Gene a ion Tools
GPT-2 and i s successo s (e.g., GPT-3, GPT-4) ha e e olu ionized ex gene a ion, p oducing c ea i e and engaging
con en . Ye , hese ools equi e ca e ully c a ed p omp s and do no accep isual inpu s di ec ly.
3.3. Tex - o-Speech Engines
Tools such as Google TTS and Amazon Polly p o ide high-quali y speech ou pu . They a e widely used in accessibili y
applica ions, i ual assis an s, and audio con en gene a ion.
Each o hese sys ems is use ul independen ly, bu he lack o in eg a ion c ea es ic ion when building an applica ion
ha seeks o mimic human s o y elling om isual s imuli. The need o a uni ied, au oma ed, and con ex ually cohe en
sys em emains la gely unme .
4. P oposed Model
To add ess he agmen a ion in exis ing solu ions, we p opose a uni ied mul i-modal a chi ec u e ha emula es a
human-like s o y elling p ocess om isual inpu . The p oposed model consis s o ou igh ly in eg a ed modules:
4.1. Visual Fea u e Ex ac o
• Uses he p e- ained VGG16 model o ex ac high-le el ea u es om inpu images.
• Ou pu s a 4096-dimensional ea u e ec o ep esen ing isual seman ics.
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 1037-1044
1039
4.2. Cap ion Gene a o
• Employs a Tokenize and LSTM-based decode .
• T ained on he Flick 8k da ase o con e isual ea u es in o g amma ically co ec and con ex ually ele an
cap ions.
4.3. S o y Gene a o
• U ilizes GPT-2 o ex gene a ion.
• Takes he gene a ed cap ion as a p omp and p oduces a sho , c ea i e s o y based on i .
4.4. Tex - o-Speech Syn hesize
• Con e s he s o y in o an English oice using gTTS.
• P o ides audi o y eedback o accessibili y and engagemen .
Toge he , hese componen s o m an end- o-end pipeline capable o ans o ming s a ic isual inpu s in o ich, spoken
na a i es.
5. Me hodology
The p oposed sys em ollows a modula deep lea ning-based me hodology o ans o m s a ic images in o ich, spoken
na a i es. Fi s , inpu images a e p ep ocessed and passed h ough a p e- ained VGG16 model o ex ac deep isual
ea u es. These ea u es a e hen ed in o a CNN-LSTM ne wo k ha gene a es a meaning ul cap ion wo d by wo d. The
cap ion is used as a p omp o GPT-2, which gene a es a i id, cohe en s o y cap u ing he scene's con ex . Finally, he
s o y is con e ed in o speech using Google Tex - o-Speech (gTTS), comple ing he isual- o-audio ans o ma ion
pipeline.
Figu e 1 Me hodology(Sou ce: Au ho s)
5.1. Sys em A chi ec u e
The a chi ec u e o he p oposed sys em in eg a es Compu e Vision, Na u al Language P ocessing, and Tex - o-Speech
(TTS) in a mul i-s age pipeline ha au oma es s o y elling om isual da a. The en i e pipeline is modula and ollows
a clea sequence: image p ep ocessing, ea u e ex ac ion, cap ion gene a ion, s o y gene a ion, and speech syn hesis.
Below is an o e iew and b eakdown o each majo componen :
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 1037-1044
1040
Figu e 2 Sys em A chi ec u e(Sou ce: Au ho s)
5.2. Image Inpu Module
The i s s ep in he sys em is o handle he image inpu . Use s ei he p o ide an image manually o selec one om a
da ase , like he Flick 8k da ase . The inpu image is ypically in a ious o ma s and sizes, so his module s anda dizes
he image by esizing i o a ixed size (224x224 pixels). Addi ionally, he image is con e ed in o an RGB o ma ,
ensu ing i is compa ible wi h he VGG16 model o ea u e ex ac ion. The inpu module is c ucial o main aining
consis ency ac oss all image da a used in he subsequen s ages.
Once he image is p ocessed, i is ans o med in o a NumPy a ay o easy manipula ion in Tenso Flow o Ke as. This
module is designed o suppo a ious image o ma s and can be ex ended o eal- ime image upload, making i mo e
use - iendly o di e en applica ions.
5.3. Fea u e Ex ac ion Module (CNN - VGG16)
A e he image is p ep ocessed, he Fea u e Ex ac ion Module u ilizes a p e- ained VGG16 model o ex ac high-le el
isual ea u es. VGG16 is a con olu ional neu al ne wo k (CNN) model ha was o iginally ained o image
classi ica ion asks on la ge da ase s such as ImageNe . In his sys em, ins ead o using he model o classi ica ion, we
u ilize he penul ima e laye o he ne wo k o cap u e he ich ea u es o he image.
The model p ocesses he inpu image and gene a es a 4096-dimensional ea u e ec o . This ec o ep esen s he mos
salien ea u es o he image, such as objec s, ex u es, and spa ial ela ionships, which a e c ucial o gene a ing an
accu a e desc ip ion. These ea u es se e as a comp essed ye in o ma i e ep esen a ion o he image, cap u ing he
essen ial isual in o ma ion ha is passed o he cap ion gene a ion module.
5.4. Cap ion Gene a ion Module (CNN-LSTM)
The Cap ion Gene a ion Module is esponsible o gene a ing a ex ual desc ip ion o he image. This module combines
he isual ea u es ex ac ed by he CNN (VGG16) wi h a sequence o wo ds o gene a e cohe en and desc ip i e
cap ions. The a chi ec u e employed in his sys em is a combina ion o CNN and LSTM (Long Sho -Te m Memo y),
which has p o en e ec i e o sequence gene a ion asks.
The CNN ex ac s he isual ea u es, and hese ea u es a e ed in o he LSTM, which is ained o p edic a sequence o
wo ds. The LSTM is an ad anced ype o ecu en neu al ne wo k (RNN) ha excels a cap u ing empo al
dependencies in da a, making i ideal o sequence p edic ion asks like cap ioning. The Embedding Laye ans o ms
wo ds in o dense ec o ep esen a ions, allowing he LSTM o p ocess hese wo d sequences e ec i ely.
The sys em gene a es he cap ion wo d by wo d, s a ing wi h a oken like s a seq and i e a i ely p edic ing he nex
wo d based on he p e ious ones. The p ocess con inues un il an endseq oken is p edic ed, signaling he end o he
cap ion.
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 1037-1044
1041
5.5. Tex Enhancemen Module (S o y Gene a o using GPT-2)
Once he cap ion is gene a ed, he Tex Enhancemen Module uses a p e- ained GPT-2 model o ans o m he sho
cap ion in o a ull s o y. This is whe e he sys em's c ea i i y is injec ed. GPT-2, a powe ul ans o me -based language
model, is designed o ex gene a ion and can c ea e highly cohe en and con ex ually ele an ex based on a gi en
p omp . In his case, he cap ion se es as he p omp .
The module sends he cap ion o GPT-2, which hen gene a es a de ailed, i id s o y. The s o y goes beyond simple
cap ioning, wea ing a na a i e ha adds dep h and con ex o he image. Fo example, a cap ion like “a dog unning in
he pa k” migh e ol e in o a s o y desc ibing he dog’s ad en u e in he pa k, i s in e ac ion wi h o he animals, and
i s su oundings.
The s o y gene a ion is pa ame e ized by se ings such as empe a u e (which con ols he andomness o he ou pu ),
op-p (which egula es he di e si y o he p edic ions), and max_leng h (which limi s he numbe o gene a ed wo ds).
These se ings a e op imized o p oduce s o ies ha a e bo h c ea i e and cohe en .
5.6. Tex - o-Speech Module (gTTS)
The gene a ed s o y is passed o he Tex - o-Speech (TTS) Module, whe e i is con e ed in o spoken wo ds. This
module uses Google's Tex - o-Speech API (gTTS), which is a ligh weigh and easy- o-use ool o con e ing ex in o
speech. gTTS suppo s mul iple languages, bu in his case, he sys em gene a es English audio.
Once he s o y is con e ed in o an audio ile, he gTTS module sa es i as an MP3 ile. The MP3 ile can be played back
o he use , p o iding an imme si e audio expe ience. This module is pa icula ly use ul o c ea ing accessible con en ,
enabling he isually impai ed o enjoy he gene a ed s o ies h ough audio.
The speech syn hesis is no only a c i ical componen o accessibili y bu also adds a dynamic, li elike laye o he
sys em, making he na a i e expe ience mo e engaging.
5.7. Ou pu Module
The Ou pu Module is esponsible o playing he gene a ed audio ile and displaying ele an isual in o ma ion. In his
module, he audio ile gene a ed by he TTS sys em is played back using he playsound lib a y, allowing he use o
lis en o he s o y.
6. Resul s and Discussion
Figu e 3 Use In e ace(Sou ce: Au ho s)
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 1037-1044
1042
Figu e 4 Inpu Image(Sou ce: Au ho s)
Figu e 5 Cap ion and S o y gene a ion(Sou ce: Au ho s)
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 1037-1044
1043
Figu e 6 Audio ile s o ed as .mp3(Sou ce: Au ho s)
7. Conclusion
This pape p esen s an in eg a ed AI sys em ha pe o ms image cap ioning, s o y gene a ion, and ex - o-speech
con e sion in a uni ied pipeline. By le e aging deep lea ning echniques in compu e ision and na u al language
p ocessing, he model o e s a no el app oach o c ea ing engaging, human-like na a i es om s a ic images.
Beyond i s echnical implemen a ion, he sys em humanizes machine pe cep ion— ans o ming passi e image
ecogni ion in o an ac i e s o y elling expe ience. Applica ions o his wo k include educa ional ools, assis i e
echnologies o he isually impai ed, in e ac i e con en pla o ms, and c ea i e w i ing aids.
Fu u e wo k will explo e enhancemen s such as a en ion mechanisms, mul imodal ans o me s, mul ilingual suppo ,
and eal- ime web o mobile deploymen .
Compliance wi h e hical s anda ds
Disclosu e o con lic o in e es
No con lic o in e es o be disclosed.
Re e ences
[1] Xu, K., Ba, J., Ki os, R., Cho, K., Cou ille, A., Salakhu dino , R., Zemel, R., & Bengio, Y. (2015). Show, A end and Tell:
Neu al Image Cap ion Gene a ion wi h Visual A en ion. In P oceedings o he In e na ional Con e ence on Machine
Lea ning (ICML) (pp. 2048–2057).h p://p oceedings.ml .p ess/ 37/xuc15.h ml
[2] Rad o d, A., Wu, J., Child, R., Luan, D., Amodei, D., & Su ske e , I. (2019). Language Models a e Unsupe ised
Mul i ask Lea ne s. OpenAI.models/language_models_a e_unsupe ised_mul i ask_lea ne s.pd
[3] Vinyals, O., Toshe , A., Bengio, S., & E han, D. (2015). Show and Tell: A Neu al Image Cap ion Gene a o . In
P oceedings o he IEEE Con e ence on Compu e Vision and Pa e n Recogni ion (CVPR) (pp. 3156–
3164).h ps://doi.o g/10.1109/CVPR.2015.7298935
[4] Wang, Y., Ske y-Ryan, R., S an on, D., Wu, Y., Weiss, R. J., Jai ly, N., Yang, Z., e al. (2017). Taco on: Towa ds End-
o-End Speech Syn hesis. In P oceedings o In e speech 2017.h ps://doi.o g/10.21437/In e speech.2017-1452
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 1037-1044
1044
[5] an den Oo d, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., G a es, A., Kalchb enne , N., Senio , A., &
Ka ukcuoglu, K. (2016). Wa eNe : A Gene a i e Model o Raw Audio. a Xi p ep in
a Xi :1609.03499.h ps://a xi .o g/abs/1609.03499
Au ho ’s sho biog aphy
M . Ashwani A i:
M . Ashwani A i, Comple ed his B.Tech and M.Tech in CSE om IIT Kha agpu , He wo ked
in IT Sec o as So wa e Enginee and is Cu en ly wo king as Assis an P o esso ,
Depa men o CSE(Da a Science), ACE Enginee ing College. He aim o inspi e s uden s and
con ibu e o ad ancemen s in echnology h ough his wo k.
P iyanka Gundeboyena:
P iyanka, a inal-yea B.Tech s uden a ACE Enginee ing College, specializing in Compu e
Science and Enginee ing (Da a Science). He academic jou ney has been shaped by a s ong
commi men o explo ing da a-d i en solu ions o eal-wo ld challenges. O e he cou se
o my s udies, She ha e de eloped a p o ound in e es in a eas such as cybe secu i y,
machine lea ning and ne wo king.
Vaishna i Chigu la:
I am Vaishna i, a inal-yea B.Tech s uden a ACE Enginee ing College, specializing in
Compu e Science and Enginee ing (Da a Science). My academic jou ney has led me o
explo e he dynamic ields o cybe secu i y, machine lea ning, a i icial in elligence, and
ne wo king. Wi h a keen in e es in inno a ion, I s i e o b idge he gap be ween da a
science and eal-wo ld applica ions, con inuously expanding my knowledge and expe ise.
Soumika Molugu i:
Soumika, a inal-yea B.Tech s uden a ACE Enginee ing College, specializing in Compu e
Science and Enginee ing (Da a Science). He academic jou ney has led me o explo e di e se
domains, wi h a pa icula ocus on web de elopmen , machine lea ning, and a i icial
in elligence. Skilled in building dynamic and e icien web solu ions, She is commi ed o
con inuous lea ning and inno a ion o b idge he gap be ween da a science and mode n web
echnologies.
Ni hin Kasoju:
Ni hin, a inal-yea B.Tech s uden a ACE Enginee ing College, specializing in Compu e
Science and Enginee ing (Da a Science). His academic jou ney has been shaped by a s ong
commi men o explo ing da a-d i en solu ions o eal-wo ld challenges. O e he cou se
o his s udies, He ha e de eloped a p o ound in e es in a eas such as machine lea ning,
a i icial in elligence, and ne wo king.