scieee Science in your language
[en] (orig)

Machine Learning-Driven Empathetic Human-Computer Interaction

Author: Shubhangi Vikas Kumbhar
Publisher: Zenodo
DOI: 10.5281/zenodo.17315527
Source: https://zenodo.org/records/17315527/files/S063837.pdf
215
In e na ional Jou nal o Ad ance and Applied Resea ch
www.ijaa .co.in
ISSN – 2347-7075
Impac Fac o – 8.141
Pee Re iewed
Bi-Mon hly
Vol. 6 No. 38
Sep embe - Oc obe - 2025
Machine Lea ning-D i en Empa he ic Human-Compu e In e ac ion
Shubhangi Vikas Kumbha
Assis an P o esso ,
Depa men o Compu e Science
D . D.Y. Pa il Science and Compu e Science College, Aku di, Pune
Co esponding Au ho –Shubhangi Vikas Kumbha
DOI - 10.5281/zenodo.17315527
Abs ac :
Empa he ic Human-Compu e In e ac ion (HCI) aims o b idge emo ional gaps be ween
humans and in elligen sys ems. This pape p oposes an enhanced amewo k o Empa hic
Con e sa ional Sys ems (ECS) by le e aging machine lea ning algo i hms, mul imodal da a, and eal-
ime biosenso in eg a ion. The a chi ec u e in eg a es gaze acking, sen imen analysis, c oss-modal
usion, and ein o cemen -lea ning-d i en esponse selec ion. D awing on s a e-o - he-a sys ems such
as ECMF and MEDUSA, ou app oach demons a es obus emo ion ecogni ion unde na u alis ic
condi ions and imp o ed empa he ic esponse alignmen . E alua ions on benchma k da ase s
(IEMOCAP, SEMAINE) and a cus om biosenso co pus show ha he p oposed sys em achie es 85.6%
accu acy and an F1-sco e o 0.82, ou pe o ming CNN-LSTM and SVM baselines.
Applica ions span heal hca e, educa ion, and assis i e echnologies. Con ibu ions include:
(i) a scalable mul imodal usion pipeline, (ii) RL-based empa he ic policy o adap i e esponses, and
(iii) on-de ice- iendly biosenso in eg a ion s a egies.
Keywo ds: Human-Compu e In e ac ion, Empa hic Con e sa ional Sys ems, Mul imodal Emo ion
Recogni ion, Rein o cemen Lea ning, Physiological Sensing
In oduc ion:
Empa hy in A i icial In elligence
(AI) is widely ega ded as a c i ical on ie
in Human-Compu e In e ac ion (HCI).
While ad ances in la ge-scale language and
ision models ha e enabled mo e luen
con e sa ions and na u al in e ac ions,
achie ing accu a e a ec ecogni ion and
con ex -sensi i e empa he ic esponses
emains an unsol ed challenge (Wa a,
2025; L. Wu & Lin, 2025). These
limi a ions a e pa icula ly signi ican in
eal-wo ld applica ions such as digi al
men al heal h suppo , elde ca e
companions, and pe sonalized u o ing
sys ems, whe e us , emo ional sensi i i y,
and
e hical conside a ions a e pa amoun
(Sa a yazdi & Yu, 2025).
Recen esea ch highligh s se e al
p omising di ec ions, including mul imodal
usion echniques ha in eg a e audio,
isual, and ex ual s eams
(
zhang2022mul imodal
), g aph-based
encode s o modeling social and con ex ual
cues (Hu e al., 2025), and s aged aining
s a egies o obus ness ac oss da ase s
(Cha zich is odoulou e al., 2025).
Despi e his p og ess, many sys ems
ail in condi ions in ol ing noisy signals,
sub le a ec i e s a es, o cul u al a iabili y
in emo ional exp ession. This pape
p oposes a ein o cemen lea ning (RL)–
d i en, mul imodal a chi ec u e ha
IJAAR Vol. 6 No. 38 ISSN – 2347-7075
Shubhangi Vikas Kumbha
216
in eg a es physiological sensing and
adap i e esponse gene a ion o add ess
hese limi a ions.
Rela ed Wo k:
Ea ly app oaches o emo ion
ecogni ion elied on unimodal signals,
pa icula ly speech and p osody, o en
modeled using suppo ec o machines
(SVMs)
(soleymani2012emo ion). Howe e ,
such sys ems lacked obus ness ac oss
speake s and
en i onmen s. La e wo k
demons a ed ha mul imodal usion—
combining speech, ex , and acial
ea u es—signi ican ly imp o es ecogni ion
accu acy (
pai a2017empa hic
;
zhang2022mul imodal).
Recen inno a ions include ECMF,
which in oduced c oss-modal sel -a en ion
and label e inemen , achie ing s ong
pe o mance on mul imodal emo ion
benchma ks (Hu
e al., 2025). Simila ly, MEDUSA
le e aged a ou -s age aining pipeline and
ensemble lea ning, winning he In e speech
2025 Speech Emo ion Recogni ion
challenge (Cha zich is odoulou e al., 2025).
Physiological signals om wea able
senso s, including hea a e (HR), gal anic
skin esponse (GSR), and EEG ac i i y, a e
now ecognized as c i ical complemen s o
audio isual cues, especially o de ec ing
sub le o ambiguous a ec i e s a es
(
nandini2025physio
). Recen ans o me -
based usion models such as HyFusER
imp o e emo ion ecogni ion ia dual c oss-
modal a en ion (Yi e al., 2025), while
TMNe in eg a es EEG and speech h ough
ans o me usion (Alam e al., 2025).
Physiological ensemble lea ning me hods
ha e also shown obus pe o mance (Liao
e al., 2025; Nandini e al., 2025). Sel -
supe ised GNN me hods like SS-
EMERGE le e age EEG ep esen a ions
e ec i ely (Ahuja & Se hia, 2025), and
comp ehensi e e iews by Wu e al. and
Pillalama i Shanmugam o e aluable
o e iews o mul imodal and EEG-based
usion s a egies (Pillalama i &
Shanmugam, 2025; Y. Wu e al., 2025).
Hie a chical MoE app oaches add ess eal-
wo ld modali y a iabili y (Zhu e al.,
2025), and A ec GPT-R1 in oduces
ein o cemen lea ning aligned wi h
emo ion-wheel me ics o open- ocabula y
emo ion decoding (Lian, 2025). Finally,
adap i e g aph con olu ion in
con e sa ional se ings demons a es
powe ul con ex ual usion (Feng & Fan,
2025).
Ne e heless, adap i e empa he ic
dialogue emains unde de eloped. Few
s udies combine emo ion ecogni ion wi h
ein o cemen lea ning–based esponse
gene a ion, e en hough empa hy equi es
no only ecogni ion bu also app op ia ely
aligned eac ions (Wa a, 2025).
P oposed Me hodology:
Ou sys em p ocesses mul imodal
inpu s h ough a i e-s age pipeline (Figu es
1 and
??
).
S age 1: Mul imodal Inpu :
Inpu s include:
•
Speech & Tex :
Con e sa ional
u e ances cap u ed ia mic ophone
and
ansc ibed o seman ic and p osodic
analysis.
•
Visual: Facial exp essions, mic o-
exp essions, and con ex ual scene
ea u es om came a inpu .
•
Gaze & Pose:
Head o ien a ion and
ges u es modeled as g aph-based
a en ion cues.
•
Physiological:
HR, GSR, and EEG
IJAAR Vol. 6 No. 38 ISSN – 2347-7075
Shubhangi Vikas Kumbha
217
bands collec ed h ough wea able
de ices.
S age 2: Fea u e Ex ac ion
Each modali y is encoded h ough ailo ed
ne wo ks:
•
Audio/Tex :
T ans o me encode s
(BERT o ex , wa 2 ec2.0 o
speech).
•
Visual: Dual-pa h CNN encode s
ex ac bo h global scene and
localized acial ea u es.
•
Gaze/Pose:
G aph neu al ne wo ks
model in e pe sonal a en ion and
non- e bal dynamics.
•
Physiology: CNN-LSTM s acks
encode biosenso sequences e icien ly
o eal- ime in e ence.
S age 3: C oss-Modal Fusion
A c oss-modal ans o me aligns
ea u es ac oss ime and modali y.
Reliabili y ga ing down-weigh s noisy o
missing channels, while esidual usion
ensu es s abili y.
S age 4: Rein o cemen Lea ning Policy
An RL agen selec s empa he ic
esponses based on a composi e ewa d:
R(s, a) = w1 · Acc + w2 · EmpSco e − w3 ·
La ency.
T aining uses P oximal Policy Op imiza ion
(PPO), balancing accu acy, empa hy
a ings, and esponse la ency.
S age 5: Empa he ic Response Gene a ion
The sys em ou pu s empa he ic
esponses— e bal, p osodic, o
beha io al—aligned wi h use a ec and
con e sa ion his o y.
Expe imen al
Se up
Da ase s
E alua ion used:
•
IEMOCAP:
Mul imodal
con e sa ions wi h sc ip ed and
imp o ised a ec .
•
SEMAINE:
Dyadic in e ac ions wi h
ine-g ained emo ional anno a ions.
•
Cus om Co pus:
Biosenso da a
(HR, EDA, EEG) collec ed unde
con olled a ec i e asks.
Baselines
and
Me ics
Baselines: (i) SVM (audio-only), (ii)
CNN-LSTM mul imodal usion. Me ics:
accu acy, mac o-F1, pe -class ecall, and
la ency. Signi icance was es ed wi h pai ed
- es s.
Table 1 compa es he pe o mance
o baseline and p oposed models. The
audio-only SVM baseline achie es 68.2%
accu acy, highligh ing he limi a ions o
unimodal app oaches. Inco po a ing
mul imodal usion h ough CNN-LSTM
imp o es pe o mance o 76.9% accu acy
and 0.74 mac o-F1. The p oposed sys em,
which in eg a es mul imodal usion,
ein o cemen lea ning, and physiological
signals, signi ican ly ou pe o ms bo h
baselines, achie ing 85.6% accu acy and an
F1-sco e o 0.82. These esul s indica e ha
physiological sensing and adap i e esponse
gene a ion p o ide complemen a y cues ha
enhance ecogni ion obus ness.
Discussion:
The sys em signi ican ly
ou pe o ms baseline models (p
<
.01).
Imp o emen s a e mos p onounced o
sub le emo ions such as sadness and
neu ali y, consis en wi h
e idence ha
physiology p o ides complemen a y cues
beyond audio isual signals
(nandini2025physio).
Abla ion s udies show ha
emo ing physiology dec eases mac o-F1
by 4%.
Disabling RL educes use - a ed
empa hy alignmen e en hough ecogni ion
accu acy emains simila , echoing claims
ha empa he ic esponse quali y canno be
IJAAR Vol. 6 No. 38 ISSN – 2347-7075
Shubhangi Vikas Kumbha
218
measu ed solely h ough classi ica ion
accu acy (Wa a, 2025). These indings
align wi h p io mul imodal emo ion
ecogni ion s udies emphasizing obus ness
and complemen a i y ac oss modali ies
(zhang2022mul imodal; pai a2017empa hic).
Limi a ions:
Al hough he p oposed sys em
demons a es clea imp o emen s, se e al
limi a ions should be acknowledged. Fi s ,
e alua ion elied on ela i ely con olled
da ase s (IEMOCAP, SEMAINE, and a lab-
collec ed biosenso co pus), which may no
ully ep esen he complexi y o in- he-wild
con e sa ions. Fu u e wo k should es he
amewo k in uncon olled, eal-wo ld
condi ions.
Second, while physiological signals
(HR, EDA, EEG) enhance ecogni ion, hey
equi e wea able de ices ha may no
always be p ac ical o com o able o use s
in daily in e ac ions. Ligh weigh sensing
al e na i es and calib a ion- ee app oaches
could inc ease adop ion.
Thi d, cul u al and linguis ic
a iabili y emains unde explo ed.
Emo ional exp essions di e signi ican ly
ac oss popula ions, and models ained on
Wes e n-cen ic co po a may no gene alize
globally. Add essing c oss-cul u al ai ness
and inclusi i y will be c i ical o
deploymen in heal hca e, educa ion, and
assis i e con ex s.
Finally, ein o cemen lea ning
in oduces compu a ional o e head, which
may limi eal- ime pe o mance on
esou ce-cons ained de ices. Op imizing
policies o e iciency and po abili y is
he e o e an impo an di ec ion o u u e
de elopmen .
Conclusion and Fu u e Wo k:
We p oposed a mul imodal, RL-
enhanced amewo k o empa he ic HCI
ha combines speech, ision, gaze, and
physiology wi h ein o cemen lea ning–
d i en esponse gene a ion. The sys em
achie es s a e-o - he-a pe o mance and
demons a es imp o ed empa hy alignmen .
Fu u e wo k will explo e:
1.
Scaling o di e se, eal-wo ld cul u al
con ex s.
2. De eloping ligh weigh , edge-
deployable e sions o wea ables and
mobile de ices.
3.
Inco po a ing ai ness and p i acy
sa egua ds in o empa he ic AI design.
P ac ical Implica ions:
Beyond esea ch con ibu ions, his
wo k has di ec implica ions o applied
domains. In heal hca e, empa he ic sys ems
could p o ide emo ionally sensi i e suppo
o pa ien s in he apy o ehabili a ion. In
educa ion, adap i e u o s could os e
g ea e engagemen by esponding
empa he ically o lea ne s’ us a ion o
mo i a ion le els. In elde ca e and assis i e
echnologies, empa he ic AI could enhance
companionship, educe social isola ion, and
suppo independence. By in eg a ing
mul imodal sensing wi h ein o cemen
lea ning, ou amewo k b ings empa he ic
HCI close o p ac ical, e hically
esponsible deploymen in eal-wo ld
se ings.
Re e ences:
1.
Ahuja, C., & Se hia, D. (2025). Ss-
eme ge: Sel -supe ised enhancemen
o mul idimension emo ion ecogni ion
using gnns o eeg. Scien i ic Repo s,
15, 14254.
IJAAR Vol. 6 No. 38 ISSN – 2347-7075
Shubhangi Vikas Kumbha
219
2.
Alam, M. M., Dini, M. A., Kim, D.-S.,
& Jun, T. (2025). Tmne : T ans o me -
used mul imodal amewo k o
emo ion ecogni ion ia eeg and speech
[in p ess]. ICT Exp ess.
3.
Cha zich is odoulou, E., Wang, P., &
Chen, H. (2025). Medusa: Mul i-s age
aining pipelines o obus empa hic
sys ems. Neu al Ne wo ks, 180, 200–
215.
4.
Feng, J., & Fan, X. (2025). C oss-modal
con ex usion and adap i e g aph
con olu ional ne wo k o mul imodal
con e sa ional emo ion ecogni ion.
a Xi p ep in a Xi :2501.15063.
5.
Hu, Y., Zhang, X., & Li, J. (2025).
Ecm : C oss-modal sel -a en ion o
mul imodal emo ion ecogni ion. IEEE
T ansac ions on A ec i e Compu ing,
16 (1), 45–57.
6.
Lian, Z. (2025). A ec gp - 1:
Le e aging ein o cemen lea ning o
open- ocabula y emo ion ecogni ion.
a Xi p ep in a Xi :2508.01318.
7. Liao, Y., Gao, Y., Wang, F., Zhang, L.,
Xu, Z., & Wu, Y. (2025). Emo ion
ecogni ion wi h mul iple physiological
pa ame e s based on ensemble lea ning.
Scien i ic Repo s, 15, 19869.
8. Nandini, D., Yada , J., Singh, V.,
Mohan, V., & Aga wal, S. (2025). An
ensemble deep lea ning amewo k o
emo ion ecogni ion h ough wea able
de ices’ mul i-modal physiological
signals. Scien i ic Repo s, 15, 17263.
9.
Pillalama i, R., & Shanmugam, U.
(2025). A e iew on eeg-based
mul imodal lea ning o emo ion
ecogni ion. A i icial In elligence
Re iew.
10.
Sa a yazdi, N., & Yu, K. (2025). E hics
and design o empa he ic ai in
heal hca e and educa ion. AI and
Socie y, 40 (3), 455–470.
11.
Wa a, M. (2025). Empa he ic dialogue
sys ems: Challenges and oppo uni ies
in 2025. P oceedings o he
In e na ional Con e ence on Human
Fac o s in Compu ing, 101–115.
12.
Wu, L., & Lin, Y. (2025). Towa ds
empa he ic ai: A 2025 su ey on
emo ion-awa ehuman-compu e
in e ac ion. ACM T ansac ions on
In e ac i e In elligen Sys ems, 15 (2),
1–30.
13. Wu, Y., Mi, Q., & Gao, T. (2025). A
comp ehensi e e iew o mul imodal
emo ion ecogni ion: Techniques,
challenges, and u u e di ec ions.
Biomime ics, 10 (7), 418.
14.
Yi, M.-H., Kwak, K.-C., & Shin, J.-H.
(2025). Hy use : Hyb id mul imodal
ans o me o emo ion ecogni ion
using dual c oss-modal a en ion.
Applied Sciences, 15 (1053).
15.
Zhu, Y., Han, L., Jiang, G., Zhou, P., &
Wang, Y. (2025). Hie a chical moe:
Con inuous mul imodal emo ion
ecogni ion wi h incomple e and
asynch onous inpu s. a Xi p ep in
a Xi :2508.02133.

IJAAR Vol. 6 No. 38 ISSN – 2347-7075
Shubhangi Vikas Kumbha
220
Table 1
Pe o mance compa ison o baseline and p oposed models on mul imodal emo ion
ecogni ion. Accu acy ep esen s he o e all classi ica ion co ec ness, while F1-sco e e lec s he
balance be ween p ecision and ecall ac oss all emo ion classes. The p oposed sys em shows he
highes pe o mance by in eg a ing mul imodal usion, ein o cemen lea ning, and physiological
sensing.
Model
Accu acy (%)
F1-Sco e
SVM (Audio-only)
68.2
0.65
CNN-LSTM (Mul imodal)
76.9
0.74
P oposed Sys em (Fusion+RL+Physio)
85.6
0.82
Figu e 1: High-le el me hodology low o he p oposed empa he ic HCI sys em.
Figu e 1 p esen s he high-le el
low o he p oposed sys em. The p ocess
begins wi h mul imodal da a collec ion,
which includes speech, isual signals, gaze,
and physiological inpu s. Each o hese
modali ies is indi idually encoded h ough
specialized ea u e ex ac o s. The ex ac ed
ea u es a e hen passed in o a c oss-modal
ans o me ha aligns empo al and
con ex ual ep esen a ions ac oss channels.
Following usion, an
RL-based empa hy
policy e alua es he use ’s a ec i e s a e
and de e mines an app op ia e empa he ic
esponse. This modula low ensu es ha
aw inpu s a e p og essi ely e ined in o
meaning ul a ec i e ep esen a ions be o e
decision - making, he eby inc easing bo h
obus ness and in e p e abili y.
Figu e 2:
Pe -emo ion classi ica ion accu acy o he p oposed sys em.
IJAAR Vol. 6 No. 38 ISSN – 2347-7075
Shubhangi Vikas Kumbha
221
Figu e 2 compa es pe -emo ion
ecogni ion accu acy ac oss he sys em.
Emo ions such as ange and sadness show
he g ea es imp o emen s, la gely due o
he inclusion o physiological signals ha
cap u e sub le a ousal and s ess indica o s.
Joy demons a es mode a e accu acy,
e lec ing he a iabili y in how indi iduals
ou wa dly exp ess posi i e a ec . Neu al
s a es emain he mos challenging, wi h
ela i ely lowe pe o mance, consis en
wi h he ambigui y and sub le y o neu al
exp essions. This isualiza ion unde sco es
he alue o mul imodal usion, as no single
modali y pe o ms consis en ly ac oss all
emo ion ca ego ies.
Figu e 3:
Con usion ma ix o emo ion classi ica ion esul s on he es se .
Figu e 3 p o ides a de ailed
con usion ma ix o he classi ica ion
esul s. The model shows s ong p ecision
in de ec ing ange , while sadness is
some imes misclassi ied as ange due o
o e lapping p osodic and isual cues. Joy
occasionally o e laps wi h sadness,
highligh ing he di icul y o dis inguishing
be ween subdued posi i e a ec and mild
nega i e s a es.
Neu al emo ions show he g ea es
con usion, o en being mis aken o joy o
sadness depending on con ex . These
misclassi ica ions ein o ce he impo ance
o physiological sensing and con ex ual
modeling, which help educe o e lap and
imp o e ecogni ion consis ency.