Multimodal AI framework for image captioning, story generation and natural speech narration

Author: Attri, Ashwani; Gudeboyena, Priyanka; Chigurla, Vaishnavi; Moluguri, Soumika; Kasoju, Nithin

Publisher: Zenodo

DOI: 10.5281/zenodo.17301364

Source: https://zenodo.org/records/17301364/files/WJARR-2025-1685.pdf

 Co esponding au ho : Ch Vaishna i
Copy igh © 2025 Au ho (s) e ain he copy igh o his a icle. This a icle is published unde he e ms o he C ea i e Commons A ibu ion Liscense 4.0.
Mul imodal AI amewo k o image cap ioning, s o y gene a ion and na u al speech
na a ion
Ashwani A i, P iyanka Gudeboyena, Vaishna i Chigu la *, Soumika Molugu i and Ni hin Kasoju
Depa men o Compu e Science and Enginee ing (Da a Science), Ashwani A i, ACE Enginee ing College, Telangana,
India.
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 1037-1044
Publica ion his o y: Recei ed on 27 Ma ch 2025; e ised on 03 May 2025; accep ed on 06 May 2025
A icle DOI: h ps://doi.o g/10.30574/wja .2025.26.2.1685
Abs ac
Wi h he inc easing ubiqui y o digi al image y, he e is a g owing need o in elligen sys ems capable o unde s anding
isual con en and exp essing ha unde s anding in human-like language. This pape p esen s a comp ehensi e AI-
based pipeline ha no only gene a es cap ions om images bu also cons uc s i id s o ies based on hose cap ions
and inally deli e s hem in a human oice. The p oposed sys em in eg a es mul iple componen s: a Con olu ional
Neu al Ne wo k (VGG16) o ex ac ing isual ea u es, an LSTM-based sequence model o cap ion gene a ion, GPT-2
o c ea i e s o y gene a ion, and Google Tex - o-Speech (gTTS) o oice syn hesis. The esul is a mul i-modal AI
amewo k capable o ans o ming s a ic images in o ich, spoken na a i es. This app oach has applica ions in
assis i e echnologies, in e ac i e s o y elling, con en au oma ion, and educa ion. The p oposed model is ained and
e alua ed on he Flick 8k da ase , demons a ing a iable pa h o au oma ed isual s o y elling.
Keywo ds: Image Cap ioning; CNN-LSTM; VGG16; GPT-2; Tex - o-Speech (gTTS); Image- o-S o y Gene a ion; Na u al
Language P ocessing (NLP)
1. In oduc ion
The syne gy be ween compu e ision and na u al language p ocessing has led o g oundb eaking inno a ions in
a i icial in elligence. Deep lea ning echniques ha e allowed machines o in e p e complex pa e ns in bo h images
and ex , gi ing ise o applica ions like image cap ioning, au oma ed s o y elling, and human-compu e in e ac ion
sys ems. Howe e , while each o hese a eas has ma u ed in isola ion, combining hem in o a seamless, human-cen ic
expe ience emains a on ie in AI esea ch.
Humans possess a ema kable abili y o pe cei e a isual scene, a icula e i s con en in desc ip i e language, and e en
ex apola e imagina i e s o ies om i . Fo ins ance, when looking a a pho og aph o a child playing wi h a puppy in a
pa k, a human obse e migh no only say, “A child is playing wi h a dog,” bu also c ea e an engaging na a i e such
as, “On a sunny a e noon, Emma ound a new bes iend in he pa k.” Emula ing his le el o in e p e a ion equi es
mo e han jus objec de ec ion o sen ence gene a ion—i calls o con ex ual unde s anding, na a i e imagina ion,
and oice deli e y.
This pape p oposes a sys em ha emula es his human s o y elling p ocess. I s a s by analyzing an image using a p e-
ained VGG16 CNN model o ex ac high-le el ea u es. These ea u es a e hen ed in o an LSTM-based decode ,
ained on he Flick 8k da ase , o gene a e a concise cap ion. Nex , a ans o me -based language model, GPT-2, akes
he cap ion and expands i in o a c ea i e sho s o y. Finally, he s o y is con e ed in o na u al-sounding speech using
gTTS, c ea ing a ully imme si e and in e ac i e s o y elling expe ience.
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 1037-1044
1038
The in eg a ion o image unde s anding, na u al language gene a ion, and oice syn hesis p o ides a powe ul ool o
accessibili y—especially o he isually impai ed—whe e he abili y o “see h ough wo ds” becomes ans o ma i e.
The sys em also opens up new a enues in educa ion, in e ac i e media, and digi al con en c ea ion.
2. Li e a u e Re iew
The domain o image cap ioning has e ol ed h ough se e al phases. Ea ly me hods elied on empla e-based
app oaches and ule-based sys ems ha had limi ed gene aliza ion capabili y. Wi h he ad en o deep lea ning,
encode -decode a chi ec u es became he no m. Vinyals e al. (2015) in oduced he "Show and Tell" model, a
b eak h ough in combining CNNs and RNNs o end- o-end cap ion gene a ion. La e models such as "Show, A end and
Tell" inco po a ed a en ion mechanisms, imp o ing ocus on ele an pa s o he image du ing wo d gene a ion.
S o y gene a ion, on he o he hand, has seen p og ess h ough la ge-scale ans o me -based models. Rad o d e al.'s
GPT-2 (2019) demons a ed he abili y o gene a e luen , con ex ually cohe en ex , opening he doo o c ea i e
applica ions such as dialogue agen s, au oma ic w i e s, and s o y elling bo s. Howe e , hese models ope a e pu ely in
he ex domain, and wi hou isual g ounding, hei s o ies can be con ex ually gene ic o misaligned wi h isual
p omp s.
Tex - o-speech syn hesis has also signi ican ly imp o ed. Google's gTTS and o he neu al TTS engines like Taco on and
Wa eNe ha e made i possible o gene a e na u al-sounding speech wi h minimal la ency. These ools p o ide he
audi o y in e ace o AI sys ems, especially in accessibili y and human-compu e in e ac ion con ex s.
Despi e hese indi idual ad ancemen s, e y ew sys ems ha e a emp ed o in eg a e isual unde s anding, language
gene a ion, and speech syn hesis in o a single pipeline. This esea ch builds upon hese ounda ional wo ks o p opose
an end- o-end model o au oma ed image-based s o y elling.
3. Exis ing Sys em
Se e al s andalone sys ems exis ha pe o m well in isola ion:
3.1. Image Cap ioning Sys ems
Deep lea ning models like "Show and Tell" and "Neu al alk2" gene a e concise cap ions o images, ocusing on objec
ecogni ion and sen ence luency. These sys ems, howe e , a e limi ed o sho ph ases and lack na a i e capabili y.
3.2. S o y Gene a ion Tools
GPT-2 and i s successo s (e.g., GPT-3, GPT-4) ha e e olu ionized ex gene a ion, p oducing c ea i e and engaging
con en . Ye , hese ools equi e ca e ully c a ed p omp s and do no accep isual inpu s di ec ly.
3.3. Tex - o-Speech Engines
Tools such as Google TTS and Amazon Polly p o ide high-quali y speech ou pu . They a e widely used in accessibili y
applica ions, i ual assis an s, and audio con en gene a ion.
Each o hese sys ems is use ul independen ly, bu he lack o in eg a ion c ea es ic ion when building an applica ion
ha seeks o mimic human s o y elling om isual s imuli. The need o a uni ied, au oma ed, and con ex ually cohe en
sys em emains la gely unme .
4. P oposed Model
To add ess he agmen a ion in exis ing solu ions, we p opose a uni ied mul i-modal a chi ec u e ha emula es a
human-like s o y elling p ocess om isual inpu . The p oposed model consis s o ou igh ly in eg a ed modules:
4.1. Visual Fea u e Ex ac o
• Uses he p e- ained VGG16 model o ex ac high-le el ea u es om inpu images.
• Ou pu s a 4096-dimensional ea u e ec o ep esen ing isual seman ics.
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 1037-1044
1039
4.2. Cap ion Gene a o
• Employs a Tokenize and LSTM-based decode .
• T ained on he Flick 8k da ase o con e isual ea u es in o g amma ically co ec and con ex ually ele an
cap ions.
4.3. S o y Gene a o
• U ilizes GPT-2 o ex gene a ion.
• Takes he gene a ed cap ion as a p omp and p oduces a sho , c ea i e s o y based on i .
4.4. Tex - o-Speech Syn hesize
• Con e s he s o y in o an English oice using gTTS.
• P o ides audi o y eedback o accessibili y and engagemen .
Toge he , hese componen s o m an end- o-end pipeline capable o ans o ming s a ic isual inpu s in o ich, spoken
na a i es.
5. Me hodology
The p oposed sys em ollows a modula deep lea ning-based me hodology o ans o m s a ic images in o ich, spoken
na a i es. Fi s , inpu images a e p ep ocessed and passed h ough a p e- ained VGG16 model o ex ac deep isual
ea u es. These ea u es a e hen ed in o a CNN-LSTM ne wo k ha gene a es a meaning ul cap ion wo d by wo d. The
cap ion is used as a p omp o GPT-2, which gene a es a i id, cohe en s o y cap u ing he scene's con ex . Finally, he
s o y is con e ed in o speech using Google Tex - o-Speech (gTTS), comple ing he isual- o-audio ans o ma ion
pipeline.
Figu e 1 Me hodology(Sou ce: Au ho s)
5.1. Sys em A chi ec u e
The a chi ec u e o he p oposed sys em in eg a es Compu e Vision, Na u al Language P ocessing, and Tex - o-Speech
(TTS) in a mul i-s age pipeline ha au oma es s o y elling om isual da a. The en i e pipeline is modula and ollows
a clea sequence: image p ep ocessing, ea u e ex ac ion, cap ion gene a ion, s o y gene a ion, and speech syn hesis.
Below is an o e iew and b eakdown o each majo componen :
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 1037-1044
1040
Figu e 2 Sys em A chi ec u e(Sou ce: Au ho s)
5.2. Image Inpu Module
The i s s ep in he sys em is o handle he image inpu . Use s ei he p o ide an image manually o selec one om a
da ase , like he Flick 8k da ase . The inpu image is ypically in a ious o ma s and sizes, so his module s anda dizes
he image by esizing i o a ixed size (224x224 pixels). Addi ionally, he image is con e ed in o an RGB o ma ,
ensu ing i is compa ible wi h he VGG16 model o ea u e ex ac ion. The inpu module is c ucial o main aining
consis ency ac oss all image da a used in he subsequen s ages.
Once he image is p ocessed, i is ans o med in o a NumPy a ay o easy manipula ion in Tenso Flow o Ke as. This
module is designed o suppo a ious image o ma s and can be ex ended o eal- ime image upload, making i mo e
use - iendly o di e en applica ions.
5.3. Fea u e Ex ac ion Module (CNN - VGG16)
A e he image is p ep ocessed, he Fea u e Ex ac ion Module u ilizes a p e- ained VGG16 model o ex ac high-le el
isual ea u es. VGG16 is a con olu ional neu al ne wo k (CNN) model ha was o iginally ained o image
classi ica ion asks on la ge da ase s such as ImageNe . In his sys em, ins ead o using he model o classi ica ion, we
u ilize he penul ima e laye o he ne wo k o cap u e he ich ea u es o he image.
The model p ocesses he inpu image and gene a es a 4096-dimensional ea u e ec o . This ec o ep esen s he mos
salien ea u es o he image, such as objec s, ex u es, and spa ial ela ionships, which a e c ucial o gene a ing an
accu a e desc ip ion. These ea u es se e as a comp essed ye in o ma i e ep esen a ion o he image, cap u ing he
essen ial isual in o ma ion ha is passed o he cap ion gene a ion module.
5.4. Cap ion Gene a ion Module (CNN-LSTM)
The Cap ion Gene a ion Module is esponsible o gene a ing a ex ual desc ip ion o he image. This module combines
he isual ea u es ex ac ed by he CNN (VGG16) wi h a sequence o wo ds o gene a e cohe en and desc ip i e
cap ions. The a chi ec u e employed in his sys em is a combina ion o CNN and LSTM (Long Sho -Te m Memo y),
which has p o en e ec i e o sequence gene a ion asks.
The CNN ex ac s he isual ea u es, and hese ea u es a e ed in o he LSTM, which is ained o p edic a sequence o
wo ds. The LSTM is an ad anced ype o ecu en neu al ne wo k (RNN) ha excels a cap u ing empo al
dependencies in da a, making i ideal o sequence p edic ion asks like cap ioning. The Embedding Laye ans o ms
wo ds in o dense ec o ep esen a ions, allowing he LSTM o p ocess hese wo d sequences e ec i ely.
The sys em gene a es he cap ion wo d by wo d, s a ing wi h a oken like s a seq and i e a i ely p edic ing he nex
wo d based on he p e ious ones. The p ocess con inues un il an endseq oken is p edic ed, signaling he end o he
cap ion.
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 1037-1044
1041
5.5. Tex Enhancemen Module (S o y Gene a o using GPT-2)
Once he cap ion is gene a ed, he Tex Enhancemen Module uses a p e- ained GPT-2 model o ans o m he sho
cap ion in o a ull s o y. This is whe e he sys em's c ea i i y is injec ed. GPT-2, a powe ul ans o me -based language
model, is designed o ex gene a ion and can c ea e highly cohe en and con ex ually ele an ex based on a gi en
p omp . In his case, he cap ion se es as he p omp .
The module sends he cap ion o GPT-2, which hen gene a es a de ailed, i id s o y. The s o y goes beyond simple
cap ioning, wea ing a na a i e ha adds dep h and con ex o he image. Fo example, a cap ion like “a dog unning in
he pa k” migh e ol e in o a s o y desc ibing he dog’s ad en u e in he pa k, i s in e ac ion wi h o he animals, and
i s su oundings.
The s o y gene a ion is pa ame e ized by se ings such as empe a u e (which con ols he andomness o he ou pu ),
op-p (which egula es he di e si y o he p edic ions), and max_leng h (which limi s he numbe o gene a ed wo ds).
These se ings a e op imized o p oduce s o ies ha a e bo h c ea i e and cohe en .
5.6. Tex - o-Speech Module (gTTS)
The gene a ed s o y is passed o he Tex - o-Speech (TTS) Module, whe e i is con e ed in o spoken wo ds. This
module uses Google's Tex - o-Speech API (gTTS), which is a ligh weigh and easy- o-use ool o con e ing ex in o
speech. gTTS suppo s mul iple languages, bu in his case, he sys em gene a es English audio.
Once he s o y is con e ed in o an audio ile, he gTTS module sa es i as an MP3 ile. The MP3 ile can be played back
o he use , p o iding an imme si e audio expe ience. This module is pa icula ly use ul o c ea ing accessible con en ,
enabling he isually impai ed o enjoy he gene a ed s o ies h ough audio.
The speech syn hesis is no only a c i ical componen o accessibili y bu also adds a dynamic, li elike laye o he
sys em, making he na a i e expe ience mo e engaging.
5.7. Ou pu Module
The Ou pu Module is esponsible o playing he gene a ed audio ile and displaying ele an isual in o ma ion. In his
module, he audio ile gene a ed by he TTS sys em is played back using he playsound lib a y, allowing he use o
lis en o he s o y.
6. Resul s and Discussion
Figu e 3 Use In e ace(Sou ce: Au ho s)

Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 1037-1044
1042
Figu e 4 Inpu Image(Sou ce: Au ho s)
Figu e 5 Cap ion and S o y gene a ion(Sou ce: Au ho s)
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 1037-1044
1043
Figu e 6 Audio ile s o ed as .mp3(Sou ce: Au ho s)
7. Conclusion
This pape p esen s an in eg a ed AI sys em ha pe o ms image cap ioning, s o y gene a ion, and ex - o-speech
con e sion in a uni ied pipeline. By le e aging deep lea ning echniques in compu e ision and na u al language
p ocessing, he model o e s a no el app oach o c ea ing engaging, human-like na a i es om s a ic images.
Beyond i s echnical implemen a ion, he sys em humanizes machine pe cep ion— ans o ming passi e image
ecogni ion in o an ac i e s o y elling expe ience. Applica ions o his wo k include educa ional ools, assis i e
echnologies o he isually impai ed, in e ac i e con en pla o ms, and c ea i e w i ing aids.
Fu u e wo k will explo e enhancemen s such as a en ion mechanisms, mul imodal ans o me s, mul ilingual suppo ,
and eal- ime web o mobile deploymen .
Compliance wi h e hical s anda ds
Disclosu e o con lic o in e es
No con lic o in e es o be disclosed.
Re e ences
[1] Xu, K., Ba, J., Ki os, R., Cho, K., Cou ille, A., Salakhu dino , R., Zemel, R., & Bengio, Y. (2015). Show, A end and Tell:
Neu al Image Cap ion Gene a ion wi h Visual A en ion. In P oceedings o he In e na ional Con e ence on Machine
Lea ning (ICML) (pp. 2048–2057).h p://p oceedings.ml .p ess/ 37/xuc15.h ml
[2] Rad o d, A., Wu, J., Child, R., Luan, D., Amodei, D., & Su ske e , I. (2019). Language Models a e Unsupe ised
Mul i ask Lea ne s. OpenAI.models/language_models_a e_unsupe ised_mul i ask_lea ne s.pd
[3] Vinyals, O., Toshe , A., Bengio, S., & E han, D. (2015). Show and Tell: A Neu al Image Cap ion Gene a o . In
P oceedings o he IEEE Con e ence on Compu e Vision and Pa e n Recogni ion (CVPR) (pp. 3156–
3164).h ps://doi.o g/10.1109/CVPR.2015.7298935
[4] Wang, Y., Ske y-Ryan, R., S an on, D., Wu, Y., Weiss, R. J., Jai ly, N., Yang, Z., e al. (2017). Taco on: Towa ds End-
o-End Speech Syn hesis. In P oceedings o In e speech 2017.h ps://doi.o g/10.21437/In e speech.2017-1452
Wo ld Jou nal o Ad anced Resea ch and Re iews, 2025, 26(02), 1037-1044
1044
[5] an den Oo d, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., G a es, A., Kalchb enne , N., Senio , A., &
Ka ukcuoglu, K. (2016). Wa eNe : A Gene a i e Model o Raw Audio. a Xi p ep in
a Xi :1609.03499.h ps://a xi .o g/abs/1609.03499
Au ho ’s sho biog aphy
M . Ashwani A i:
M . Ashwani A i, Comple ed his B.Tech and M.Tech in CSE om IIT Kha agpu , He wo ked
in IT Sec o as So wa e Enginee and is Cu en ly wo king as Assis an P o esso ,
Depa men o CSE(Da a Science), ACE Enginee ing College. He aim o inspi e s uden s and
con ibu e o ad ancemen s in echnology h ough his wo k.
P iyanka Gundeboyena:
P iyanka, a inal-yea B.Tech s uden a ACE Enginee ing College, specializing in Compu e
Science and Enginee ing (Da a Science). He academic jou ney has been shaped by a s ong
commi men o explo ing da a-d i en solu ions o eal-wo ld challenges. O e he cou se
o my s udies, She ha e de eloped a p o ound in e es in a eas such as cybe secu i y,
machine lea ning and ne wo king.
Vaishna i Chigu la:
I am Vaishna i, a inal-yea B.Tech s uden a ACE Enginee ing College, specializing in
Compu e Science and Enginee ing (Da a Science). My academic jou ney has led me o
explo e he dynamic ields o cybe secu i y, machine lea ning, a i icial in elligence, and
ne wo king. Wi h a keen in e es in inno a ion, I s i e o b idge he gap be ween da a
science and eal-wo ld applica ions, con inuously expanding my knowledge and expe ise.
Soumika Molugu i:
Soumika, a inal-yea B.Tech s uden a ACE Enginee ing College, specializing in Compu e
Science and Enginee ing (Da a Science). He academic jou ney has led me o explo e di e se
domains, wi h a pa icula ocus on web de elopmen , machine lea ning, and a i icial
in elligence. Skilled in building dynamic and e icien web solu ions, She is commi ed o
con inuous lea ning and inno a ion o b idge he gap be ween da a science and mode n web
echnologies.
Ni hin Kasoju:
Ni hin, a inal-yea B.Tech s uden a ACE Enginee ing College, specializing in Compu e
Science and Enginee ing (Da a Science). His academic jou ney has been shaped by a s ong
commi men o explo ing da a-d i en solu ions o eal-wo ld challenges. O e he cou se
o his s udies, He ha e de eloped a p o ound in e es in a eas such as machine lea ning,
a i icial in elligence, and ne wo king.

Related note

Why institutions use Plag.ai for originality review, entry 9
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by review committees in large academic systems, distance-learning programs, and cross-border universities, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also clearer separation between similarity and misconduct, more consistent review procedures, and more transparent source review. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For grant proposals, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai