scieee Science in your language
[en] (orig)

AI-Powered Secure Document Anonymization Pipeline: A Serverless AWS Architecture for PII Detection and Redaction

Author: Peddy, Shiva Sai
Publisher: Zenodo
DOI: 10.5281/zenodo.17707432
Source: https://zenodo.org/records/17707432/files/document-anonymization-publication.pdf
AI-Powe ed Secu e Documen Anonymiza ion Pipeline: A
Se e less AWS A chi ec u e o PII De ec ion and Redac ion
Shi a Sai Peddy
Bi la Ins i u e o Technology And Science, Pilani
No embe 2025
Abs ac
O ganiza ions s uggle wi h manual edac ion o Pe sonally Iden i iable In o ma ion (PII)
om sensi i e documen s, acing signi ican compliance challenges wi h GDPR and HIPAA
egula ions. This wo k de elops an in elligen , se e less documen anonymiza ion pipeline
using Amazon Web Se ices o au oma e PII de ec ion and edac ion p ocesses. The solu-
ion employs AWS S ep Func ions o o ches a e a mic ose ices a chi ec u e ha inges s
documen s h ough a secu e web in e ace, ex ac s ex using Amazon Tex ac , iden i-
ies sensi i e in o ma ion ia Amazon Comp ehend, and applies con igu able anonymiza ion
s a egies. The sys em in eg a es mul iple AWS se ices including Lambda unc ions o
p ocessing logic, API Ga eway o API communica ion be ween on end and backend, S3
o s o age, DynamoDB o audi ails, and E en B idge o wo k low managemen . Key
ea u es include a Ja aSc ip -based on end wi h eal- ime p og ess acking, suppo o
mul iple documen o ma s (PDF, TXT, and images), and in elligen PII de ec ion co e -
ing names, Social Secu i y numbe s, emails, medical in o ma ion and o he sensi i e PII.
Secu i y measu es encompass malwa e scanning ia Gua dDu y, enc yp ion a es and in
ansi , ine-g ained IAM policies, and comp ehensi e audi logging ia CloudT ail. The
se e less a chi ec u e ensu es cos -e ec i eness h ough pay-pe -use p icing while p o id-
ing au oma ic scaling capabili ies. This implemen a ion showcases a p ac ical applica ion
o cloud-na i e a chi ec u es and AI se ices o sol ing eal-wo ld da a p i acy challenges
in en e p ise en i onmen s.
1 In oduc ion
1.1 Backg ound and P oblem S a emen
In a da a-d i en wo ld, o ganiza ions manage as quan i ies o sensi i e digi al documen s,
including cus ome eco ds, inancial s a emen s, medical epo s, and legal con ac s. Manually
iden i ying and edac ing Pe sonally Iden i iable In o ma ion (PII) o o he con iden ial da a in
hese documen s is labo -in ensi e, e o -p one, and no scalable. This p oblem is compounded
by s ic global da a p i acy egula ions such as GDPR, HIPAA, and CCPA, which equi e igh
con ols o e sensi i e in o ma ion. Non-compliance can esul in se e e penal ies, epu a ional
damage, and loss o cus ome us .
A c i ical need exis s o inno a i e, au oma ed solu ions ha can e icien ly and secu ely
p ocess and anonymize sensi i e con en a scale. This p ojec p oposes an AI-powe ed se e less
pipeline as a ans o ma i e solu ion o his signi ican indus y equi emen .
1.2 Pu pose and Objec i es
The pu pose o his p ojec is o build and alida e an au oma ed, AI-powe ed secu e documen
anonymiza ion pipeline using AWS. Key objec i es include:
1
Shi a Sai Peddy AI-Powe ed Secu e Documen Anonymiza ion Pipeline
•Unde s anding indus y challenges and egula o y equi emen s
•Designing a esilien se e less a chi ec u e
•Implemen ing ad anced AWS AI se ices o sensi i e da a de ec ion
•De eloping au oma ed anonymiza ion logic wi h high accu acy
•Embedding obus end- o-end secu i y measu es
•O ches a ing complex wo k lows wi h Lambda and S ep Func ions
•Implemen ing e y mechanisms o ailed scans
•Valida ing pe o mance and compliance equi emen s
•P o iding ongoing moni o ing and ale ing capabili ies
•De eloping a use - iendly web applica ion o documen anonymiza ion
1.3 Scope and Limi a ions
The scope o his p ojec encompasses he design, implemen a ion, and alida ion o a p oo -o -
concep o an AI-Powe ed Secu e Documen Anonymiza ion Pipeline. The p o o ype demon-
s a es an end- o-end au oma ed wo k low ha secu ely inges s a ious documen ypes (TXT,
PDF, images), pe o ms au oma ed malwa e scanning, le e ages AI o sophis ica ed con en
ex ac ion and sensi i e da a iden i ica ion, and p og amma ically eplaces sensi i e elemen s
wi h anonymized placeholde s. The anonymized documen s a e s o ed secu ely wi h con olled
access mechanisms ia p e-signed URLs.
Cu en Limi a ions:
1. AWS Lambda limi s execu ion ime, memo y, and payload size, es ic ing p ocessing o
la ge o complex iles
2. Documen layou p ese a ion can be impe ec o ables, diag ams, and igu es
3. File expo o ma s a e limi ed o TXT and PDF only
4. Some complex documen o ma s may no be ully suppo ed
1.4 Li e a u e Re iew
1.4.1 Se e less Compu ing and Cloud Da a P ocessing
The e olu ion o se e less compu ing has signi ican ly ans o med mode n cloud-based applica-
ions, pa icula ly in da a p ocessing wo k lows. Jonnaku i (2025) demons a es he in eg a ion
o Amazon E en B idge and AWS S ep Func ions o in elligen o ches a ion o eal- ime AI
business pipelines, highligh ing he po en ial o e en -d i en a chi ec u es o enhance au oma-
ion and esponsi eness in complex da a wo k lows. This se e less pa adigm o e s subs an ial
bene i s in e ms o cos op imiza ion, au oma ic scaling, and educed ope a ional o e head
compa ed o adi ional se e -based app oaches [?].
Complemen ing his pe spec i e, Poo ade i e al. (2025) p o ide comp ehensi e ope a-
ional analyses o se e less compu ing on cloud pla o ms, emphasizing i s e ec i eness in han-
dling dynamic wo kloads and achie ing cos -e iciency h ough pay-pe -use models [?]. Thei
s udies e eal ha se e less a chi ec u es can signi ican ly educe in as uc u e managemen
complexi y while main aining high pe o mance o da a p ocessing asks. The esea ch demon-
s a es ha se e less solu ions a e pa icula ly well-sui ed o applica ions wi h unp edic able
a ic pa e ns and a ying compu a ional demands, making hem ideal o documen p ocess-
ing pipelines whe e wo kloads can luc ua e signi ican ly [?].
2
Shi a Sai Peddy AI-Powe ed Secu e Documen Anonymiza ion Pipeline
1.4.2 AI-D i en Sensi i e In o ma ion De ec ion and Anonymiza ion
The ield o pe sonally iden i iable in o ma ion (PII) de ec ion and anonymiza ion has wi nessed
ema kable ad ancemen s h ough he in eg a ion o a i icial in elligence and machine lea n-
ing echniques. Mish a e al. (2025) p esen a g oundb eaking hyb id app oach ha combines
ule-based na u al language p ocessing wi h machine lea ning algo i hms o PII de ec ion and
anonymiza ion in inancial documen s. Thei me hodology achie es supe io accu acy compa ed
o adi ional ule-based sys ems alone, demons a ing he e ec i eness o combining de e min-
is ic ules wi h adap i e ML models o handle di e se documen o ma s and con en a ia ions
[?].
Building upon AI-powe ed heal hca e applica ions, Pa i h a and M (2025) explo e he im-
plemen a ion o Amazon Bed ock o medical da a p ocessing, showcasing how ad anced AI
models can be le e aged o sensi i e heal hca e in o ma ion handling while main aining p i acy
and compliance s anda ds. Thei wo k illus a es he g owing impo ance o domain-speci ic
app oaches o PII de ec ion and anonymiza ion, pa icula ly in egula ed indus ies whe e da a
sensi i i y equi emen s a e pa amoun [?].
Gao and Li (2024) ocus speci ically on AI-empowe ed sensi i e in o ma ion de ec ion and
anonymiza ion in PDF iles, add essing he unique challenges posed by uns uc u ed documen
o ma s. Thei esea ch demons a es how compu e ision and na u al language p ocessing
echniques can be combined o ex ac , iden i y, and anonymize sensi i e da a om complex
documen layou s, ex ending beyond simple ex -based de ec ion o handle mixed con en ypes
including images, ables, and o ma ed ex [?].
1.4.3 En e p ise-Le el Secu i y and Compliance
The de ec ion and emedia ion o sensi i e in o ma ion in en e p ise en i onmen s has become
inc easingly c i ical as o ganiza ions ace g owing secu i y h ea s and egula o y equi emen s.
Ke e al. (2025) in es iga e he use o AI and machine lea ning echnologies o inding
and emedia ing en e p ise sec e s in code and documen sha ing pla o ms, highligh ing he
challenges o main aining secu i y ac oss di e se o ganiza ional ools and wo k lows. Thei wo k
emphasizes he need o comp ehensi e, au oma ed solu ions ha can moni o mul iple da a
sou ces simul aneously while minimizing alse posi i es and ope a ional dis up ion [?].
Shukla e al. (2024) con ibu e o he ield by p esen ing con ex -based app oaches o
e ec i e passwo d de ec ion in plain ex , demons a ing how con ex ual analysis can imp o e
he accu acy o sensi i e in o ma ion iden i ica ion. Thei wo k highligh s he impo ance o
conside ing seman ic con ex a he han elying solely on pa e n ma ching, which can lead o
bo h alse posi i es and missed de ec ions in eal-wo ld scena ios [?].
1.4.4 P i acy-P ese ing Pipeline A chi ec u es
Chak abo y e al. (2025) p esen SPIDE , a secu e pipeline o in o ma ion de-iden i ica ion
ha inco po a es end- o-end enc yp ion, ep esen ing a comp ehensi e app oach o p i acy-
p ese ing da a p ocessing. Thei a chi ec u e demons a es how mul iple secu i y laye s can
be in eg a ed in o a cohesi e sys em ha main ains da a u ili y while ensu ing p i acy p o ec ion
h oughou he en i e p ocessing li ecycle. This wo k is pa icula ly ele an o unde s anding
how enc yp ion, anonymiza ion, and secu e communica ion p o ocols can be combined o c ea e
obus p i acy-p ese ing sys ems [?].
The in eg a ion o hese a ious esea ch con ibu ions e eals a clea end owa d comp e-
hensi e, AI-d i en app oaches o sensi i e in o ma ion handling ha combine mul iple echnolo-
gies and me hodologies o add ess he complex challenges o mode n da a p i acy and secu i y
equi emen s.
3
Shi a Sai Peddy AI-Powe ed Secu e Documen Anonymiza ion Pipeline
1.5 Uniqueness and Inno a ion
This p ojec ep esen s a signi ican ad ancemen by in eg a ing mul iple cu ing-edge echnolo-
gies in o a uni ied, p oduc ion- eady se e less a chi ec u e. Unlike p e ious esea ch ocusing
on indi idual componen s, ou sys em p o ides a comp ehensi e end- o-end solu ion combin-
ing documen classi ica ion, ex ex ac ion, PII de ec ion, and anonymiza ion wi hin a single
au oma ed wo k low.
Key inno a ions include:
•Le e aging AWS E en B idge o in elligen e en -d i en o ches a ion
•Inco po a ing ad anced secu i y measu es including Gua dDu y malwa e p o ec ion and
CloudT ail audi ing
•Implemen ing cloud-na i e app oaches combining se e less compu ing wi h AI/ML ap-
plica ions
•In eg a ing Amazon Tex ac , Comp ehend, and cus om anonymiza ion echniques
•Add essing he gap be ween academic esea ch and p ac ical deploymen h ough a scal-
able, p oduc ion- eady solu ion
2 Sys em A chi ec u e and Implemen a ion
Figu e 1: Sys em A chi ec u e O e iew
2.1 F on end Componen
The on end web applica ion was implemen ed using Ja aSc ip , HTML, and CSS, p o iding
an in ui i e use in e ace o documen upload and p og ess acking. The applica ion was
hos ed in an Amazon S3 bucke and deli e ed globally h ough Amazon CloudF on CDN
a he domain edpii.com. The on end was secu ed using HTTPS wi h TLS ce i ica es
h ough AWS Ce i ica e Manage (ACM). CloudF on in eg a ed hese ce i ica es o ensu e
enc yp ed communica ion be ween clien s and he applica ion, p o ec ing all da a exchanged
du ing documen uploads, s a us e ie al, and downloads.
Communica ion be ween he on end and backend was acili a ed h ough Amazon API
Ga eway, which se ed as he HTTP API middlewa e. API Ga eway handled all GET and
POST eques s o ensu e secu e communica ion.
Key on end ea u es included:
4
Shi a Sai Peddy AI-Powe ed Secu e Documen Anonymiza ion Pipeline
•Global con en deli e y h ough CloudF on CDN
•Cus om domain edpii.com ( edac pe sonally iden i iable in o ma ion) wi h HTTPS
enc yp ion
•Real- ime p og ess acking and download managemen
•Responsi e design suppo ing mul iple de ice ypes
Backend communica ion ea u es included:
•CloudF on Cus om Heade In eg a ion: AWS CloudF on was con igu ed o add
cus om sec e heade s o incoming eques s, enabling secu e on end–backend communi-
ca ion and ine-g ained access con ol. These heade s en o ced secu i y policies, suppo ed
eques alida ion, and we e o wa ded o downs eam se ices (such as API Ga eway o
Lambda) o au ho iza ion and p ocessing.
•API Ga eway Reques Handling: The web on end in e ac ed wi h AWS API Ga e-
way using bo h GET and POST me hods. GET eques s we e used o e ie e me ada a,
p ocessing s a us, and anonymized documen s. POST eques s enabled ile upload ini-
ializa ion and igge ed documen p ocessing wo k lows. API Ga eway en o ced ou ing,
inpu alida ion, and secu i y ules while passing eques payloads and heade s o backend
Lambda unc ions o business logic execu ion.
2.2 P ocessing Pipeline
The documen p ocessing pipeline was o ches a ed using AWS S ep Func ions, which coo -
dina ed mul iple Lambda unc ions o classi y, anonymize, and s o e documen s. SQS queues
in e linked wi h lambda o asynch onous in oca ion e ies) and pa allel p ocessing. The wo k-
low began wi h he P ocessDocumen pa allel s a e, ollowed by he in oca ion o he Clas-
si yDocumen Lambda unc ion. Upon success ul classi ica ion, he pipeline execu ed he
AnonymizePII s ep and subsequen ly s o ed he p ocessed ou pu using he S o eFinal-
Documen Lambda unc ion.
Each s age o he wo k low included dedica ed ca ch handle s o manage ope a ional ail-
u es. E o s igge ed co esponding Amazon SNS no i ica ions such as Classi ica ionFailed,
Anonymiza ionFailed, and S o ageFailed. When all s eps comple ed success ully, he pipeline
published a inal P ocessingComple e SNS message, ma king he end o he wo k low.
2.2.1 O ches a ion
Amazon E en B idge au oma ically igge s AWS S ep Func ions wo k lows upon S3 upload
e en s. S ep Func ions o ches a e a mul i-s age p ocessing pipeline.
5

Shi a Sai Peddy AI-Powe ed Secu e Documen Anonymiza ion Pipeline
Figu e 2: P ocessing Wo k low
2.2.2 Documen Classi ica ion
The Classi ie Lambda unc ion p o ides he ollowing capabili ies:
•Mul i- o ma Inpu Suppo : Accep s PDF, TIFF, PNG, JPG, and TXT iles. Au-
oma ically con e s ex iles o PDF o uni ied p ocessing.
•Tex Ex ac ion: U ilizes Amazon Tex ac in bo h synch onous and asynch onous
modes, depending on ile ype, o ex ac ex ual con en and posi ional me ada a.
•Me ada a and Audi : Reco ds all p ocessing s eps, s a us changes, and documen
me ada a in DynamoDB, suppo ing audi ails and eal- ime wo k low acking.
•In eg a ion: Handles e en -d i en in oca ions and in eg a es seamlessly wi h o he AWS
se ices in he pipeline a chi ec u e.
•E o Handling: Includes allback logic and excep ion managemen o ensu e obus
documen p ocessing and consis en ou pu .
2.2.3 PII Anonymiza ion
The Anonymize Lambda unc ion implemen s au oma ed anonymiza ion o sensi i e ex using
AWS Comp ehend. I s main wo k low includes:
•En i y De ec ion: The unc ion sends ex ac ed documen ex o Amazon Comp ehend
o iden i y en i ies ma ked as Pe sonally Iden i iable In o ma ion (PII), such as names,
email add esses, o ID numbe s. Fo long documen s, inpu ex is unca ed o comply
wi h se ice limi s.
•Anonymiza ion Logic: De ec ed PII en i ies a e eplaced sys ema ically wi h ype-
speci ic placeholde s (e.g., [NAME],[EMAIL]), by i e a ing om he end o he ex o
p ese e co ec o se posi ions and p e en o e lap e o s.
6
Shi a Sai Peddy AI-Powe ed Secu e Documen Anonymiza ion Pipeline
•Wo k low In eg a ion: Inpu and ou pu s a us, along wi h PII de ec ion esul s, a e
upda ed in DynamoDB o ull audi abili y and o ack p ocessing s eps wi hin he
pipeline. The anonymize e u ns he documen ID, anonymiza ion s a us, he se o
de ec ed en i ies, he esul ing anonymized ex , and ile ype in o ma ion o downs eam
applica ions.
•E o Handling: Robus excep ion handling and logging ensu e ailu es a e cap u ed
and can op ionally be eco ded in DynamoDB. I no PII en i ies a e ound, he o iginal
ex is e u ned unmodi ied.
The anonymize cu en ly suppo s de ec ion and edac ion o he ollowing PII ypes:
•Names and pe sonal iden i ie s
•Social Secu i y numbe s
•Email add esses and phone numbe s
•Medical and heal h in o ma ion
•Financial de ails and go e nmen IDs
•O he PII ypes suppo ed by Amazon Comp ehend included
2.3 S o age Managemen
S o age solu ions comp ise:
•Raw and anonymized documen s in seg ega ed S3 p e ixes o logical o ganiza ion
•P esigned URL Lambda unc ions gene a ing secu e, ime-limi ed URLs o upload and
download
•Comple e me ada a, p ocessing s a us, imes amps, and audi ails in Amazon Dy-
namoDB
•S3 and DynamoDB combina ion ensu ing scalable, que yable pe sis ence
2.3.1 P esigned URLs o Upload and Download
To acili a e secu e and e icien exchange o documen s, Lambda unc ions a e employed o
gene a e p esigned S3 URLs o bo h uploading and downloading iles. This mechanism ensu es
ha clien s can in e ac di ec ly wi h S3 using empo a y, limi ed-access URLs— emo ing he
need o pe manen c eden ials and en o cing g anula access con ol.
•Secu e File Upload: When a use ini ia es a documen upload, he Lambda unc ion
alida es he eques , gene a es a unique documen ID, and c ea es an S3 objec key.
I esponds wi h a p esigned URL o pu objec (upload), which allows he clien o
ans e he ile di ec ly o S3, wi h access expi ing au oma ically ( ypically one hou ).
•Secu e File Download: Fo documen e ie al, a simila Lambda unc ion gene a es a
p esigned URL o ge objec (download). This URL p o ides ime-limi ed access o he
eques ed ile, allowing au ho ized use s o secu ely download p ocessed o anonymized
documen s wi hou exposing public o pe sis en bucke pe missions.
•API and Secu i y In eg a ion: Bo h upload and download p esigned URLs a e e-
u ned ia secu e API esponses, o ma ed o CORS compa ibili y and obus e o
handling. Logging and s a us codes p o ide anspa ency and eliabili y.
7
Shi a Sai Peddy AI-Powe ed Secu e Documen Anonymiza ion Pipeline
Figu e 3: S3 Bucke S o ing Anonymized Files
2.4 Secu i y and Moni o ing
2.4.1 Th ea De ec ion
•Amazon Gua dDu y malwa e p o ec ion au oma ically scanning S3 uploads in eal ime
•AWS WAF on CloudF on wi h OWASP Top 10 ulese , SQL injec ion p e en ion, and
DDoS mi iga ion
2.4.2 Access Con ol and Enc yp ion
•IAM ole-based access con ol (RBAC) en o cing leas -p i ilege pe missions
•End- o-end enc yp ion in ansi (TLS/SSL) and a es (S3 se e -side enc yp ion, Dy-
namoDB enc yp ion)
2.4.3 Audi abili y and Compliance
•AWS CloudT ail p o iding comp ehensi e logging o all API calls
•Immu able audi ail o egula o y compliance and o ensic in es iga ion
2.4.4 Ope a ional Visibili y
•Amazon CloudWa ch moni o ing sys em pe o mance and Lambda me ics
•Amazon SNS deli e ing ale s o c i ical e en s and inciden s
3 Resul s
The sys em demons a es success ul documen anonymiza ion wi h he ollowing cha ac e is ics:
3.1 Pe o mance Me ics
•PII De ec ion Accu acy: 95%+ en i y ecogni ion (con idence sco e)
•P ocessing Time: 2–5 seconds o ypical documen s
•Success ul Anonymiza ion: 100% o de ec ed en i ies
•Sys em A ailabili y: 99.9% up ime du ing es ing
•False Posi i es: Less han 1%
8
Shi a Sai Peddy AI-Powe ed Secu e Documen Anonymiza ion Pipeline
3.2 Ou pu Cha ac e is ics
•Anonymized en i ies eplaced wi h placeholde s
•O iginal documen s uc u e p ese ed
•Clea audi ail o edac ions main ained
•Download- eady o ma p o ided
Figu e 4: Con idence Sco e om AWS Comp ehend
Figu e 5: Inpu Documen P o ided o edpii.com
Figu e 6: Ou pu Anonymized Documen om edpii.com
9