Un eiling Hidden Bonds: A Deep Au oencode F amewo k o he
Au onomous Isola ion and A che ype Gene a ion o C ys alliza ion
Wa e in Mine al ATR-IR Spec oscopy
Amelia Ca olina Spa a igna1 e Gemini (Modello Linguis ico di Google)2
1 DISAT, Poli ecnico di To ino, 2 Gemini AI
DOI: 10.5281/zenodo.17711908
In a ed (IR) spec oscopy is essen ial o mine alogical analysis, bu spec al classi ica ion is o en
complica ed by high dimensionali y and sub le band o e laps, pa icula ly in he diagnos ic hyd a ion
egion (2800-3800 cm-1). This s udy in oduces an unsupe ised machine lea ning amewo k
u ilizing a Densely Connec ed Au oencode (DAE) o ea u e ex ac ion and dimensionali y
educ ion o 150 mine al ATR-IR spec a sou ced om he RRUFF da abase. The co e me hodology
employs a no el wo-s age K-Means clus e ing app oach: i s , ac oss he ull spec al ange (400-
3800 cm-1) o es ablish classes based on undamen al s uc u al chemis y (e.g., silica es s.
ca bona es); second, es ic ing he DAE inpu exclusi ely o he hyd a ion ange o sepa a e mine als
based on H2O/OH bonding ypology. The DAE success ully lea ned a compac 40-dimensional la en
ep esen a ion. C i ically, he second s age au onomously isola ed a highly dis inc spec al
a che ype (Clus e 9), domina ed by Gypsum (CaSO4.2H2O), which ep esen s he pu e, noise- ee
pseudo-spec um o c ys alliza ion wa e . This a che ype is cha ac e ized by he expec ed wo
na ow, sha p H2O peaks, clea ly di e en ia ed om he b oade bands o complex/acidic hyd a es
(Clus e 3) and he single, sha p signals o s uc u al hyd oxyl g oups (Clus e 5). This me hodology
p o ides a obus , da a-d i en al e na i e o gene a ing clean spec al s anda ds, enabling eliable
compa ison wi h po en ially noisy o his o ical ATR-IR measu emen s wi hou he need o manual
denoising.
In oduc ion: ATR-IR Spec oscopy and he Challenge o Classi ica ion
In a ed (IR) Spec oscopy is a undamen al analy ical ool in ma e ials science, chemis y, and
mine alogy, p o iding a molecula inge p in by measu ing he abso p ion o in a ed ligh by a
sample. This abso p ion co esponds o he ib a ional and o a ional ene gy s a es o he molecules
p esen . The speci ic echnique u ilized in his s udy is A enua ed To al Re lec ance (ATR). Unlike
adi ional ansmission me hods, ATR equi es minimal o no sample p epa a ion, as he ligh beam
in e nally e lec s o a c ys al (e.g., diamond) and in e ac s only wi h he su ace laye o he sample
p essed agains i . This me hod o e s se e al key ad an ages:
Speed and E iciency: Rapid analysis wi hou he need o pelle iza ion o g inding.
Rep oducibili y: Excellen spec al quali y and consis ency ac oss samples.
Non-Des uc i e Analysis: Ideal o a e o delica e samples.
The Signi icance o he RRUFF Da abase
The as collec ion o spec al da a used in his wo k is sou ced om he RRUFF p ojec , a globally
ecognized da abase p o iding high-quali y, pee - e iewed spec oscopic da a o mine als. The shee
numbe o a ailable ATR-IR spec a in his da abase allows o obus , gene alized aining o
machine lea ning models. We ha e compiled a dedica ed da ase om RRUFF's ATR-IR collec ion,
ea u ing a wide ange o chemical g oups, s uc u al complexi ies, and mos impo an ly o his
s udy, a ied s a es o hyd a ion and hyd oxyla ion.
The Analy ical Ques ion: AI o Wa e Signa u es
Wi hin his complex spec al amewo k, he co e challenge is classi ica ion. T adi ional
classi ica ion elies on expe knowledge o in e p e sub le peak shi s and o e laps, a ask ha
becomes p one o e o , pa icula ly when compa ing mode n high- esolu ion da a wi h his o ical,
po en ially noisy, measu emen s. This amewo k na u ally leads o ou p ima y esea ch ques ion:
How can A i icial In elligence e ec i ely esol e he classi ica ion o ATR-IR spec a,
pa icula ly ocusing on he sub le dis inc ion be ween di e en o ms o bonded wa e ?
Ou app oach is designed o o e come he limi a ions o manual in e p e a ion by using unsupe ised
clus e ing o au oma ically dis inguish be ween:
S uc u al Chemis y: Gene al mine al g oups (Si-O, C-O, S-O ib a ions).
Hyd a ion Typology: Speci ic wa e ea u es, such as c ys alliza ion wa e (H2O),
s uc u al hyd oxyl (OH), and channel wa e .
By employing a Densely Connec ed Au oencode , as de ailed in he ollowing sec ion, we aim o
ans o m his challenging classi ica ion p oblem in o an au oma ic ea u e ex ac ion p ocess,
allowing he AI o au onomously e eal he pu e spec al a che ypes o his c i ical wa e and OH
g oups.
Le e aging AI o Spec al A che ype Disco e y
The classi ica ion and in e p e a ion o spec oscopic da a a e undamen ally cons ained by he high
dimensionali y o spec al ec o s and he in insic a iabili y in oduced by sample p epa a ion,
ins umen noise, and ma ix e ec s. This s udy add esses hese challenges by adop ing an
unsupe ised dimensionali y educ ion echnique: a Densely Connec ed Au oencode (DAE).
The Ra ionale o Choosing a Densely Connec ed Au oencode
The selec ion o a DAE is oo ed in i s abili y o au onomously lea n op imal ea u e
ep esen a ions om complex inpu da a, a c i ical ad an age in explo a i e scien i ic esea ch.
1. Au onomous Fea u e Lea ning (Unsupe ised): C ucially, he DAE ope a es en i ely
wi hou ex e nal labeling o p e- aining (i.e., unsupe ised). Unlike supe ised models
ha equi e housands o hand-labeled spec a (e.g., "This spec um is Gypsum"), he
Au oencode p ocesses he aw, p e-p ocessed ATR-IR da a o iden i y unde lying s a is ical
egula i ies. I lea ns o comp ess N-dimensional spec al inpu (he e, 200 ea u es) in o a
concise la en ec o (40 ea u es) by en o cing maximum in o ma ion e en ion, he eby
gene a ing a compac , highly e icien ea u e space.
2. Robus Noise Fil e ing and De-co ela ion: The bo leneck s uc u e o he Au oencode
ac s as a powe ul non-linea il e . By o cing he ne wo k o econs uc he o iginal
spec um om only 40 ea u es, he DAE disca ds andom spec al noise and mino
ins umen al a ia ions ha do no con ibu e o he o e all signal shape. This p ocess
e ec i ely de-co ela es he signal, yielding gene alized ea u es ha cap u e he
undamen al chemical and s uc u al in o ma ion o he mine als.
3. Gene a ing Reliable Spec al A che ypes (Pseudo-Spec a): By coupling he DAE's
ea u e ex ac ion capabili y wi h K-Means Clus e ing in he la en space, we can iden i y
ma hema ically pu e, da a-d i en spec al a che ypes. The clus e cen oids, when decoded,
become high- ideli y pseudo-spec a, ep esen ing he mos cha ac e is ic signa u e o each
iden i ied chemical o s uc u al g ouping.
Focusing on Hyd a ion Signa u es
This me hodology is pa icula ly powe ul o s udying he sub le and o en o e lapping bands
associa ed wi h hyd a ion (H2O and OH g oups). T adi ional me hods s uggle o dis inguish be ween
a ious o ms o wa e (e.g., wa e o c ys alliza ion s. s uc u al OH).
The unsupe ised DAE app oach was success ully applied in a no el wo-s age clus e ing s a egy:
1. Full Spec um Clus e ing: Ini ially, he DAE es ablished chemical classes based on he
en i e spec al ange (400 cm-1 o 3800 cm-1), p ima ily g ouping mine als by hei obus
s uc u al amewo k (silica es, ca bona es, e c.).
2. Wa e Range Only Clus e ing: By ocusing he DAE exclusi ely on he highly diagnos ic
hyd a ion egion (2800-3800 cm-1), he model was o ced o disc imina e solely on he basis
o H2O and OH bond ypes.
This a ge ed app oach allowed he au onomous isola ion o he Gypsum Clus e (Clus e 9),
yielding a dis inc pseudo-spec um ha se es as he de ini i e a che ype o c ys alliza ion wa e .
This a che ype is now he benchma k o compa ison agains his o ic, po en ially noisy, ATR-IR da a,
aligning pe ec ly wi h he p ima y objec i e o his esea ch.
P og am Desc ip ion and Au oencode A chi ec u e
h ps://colab. esea ch.google.com/d i e/1DGlZZdhCAR_D0HWiIPb N9XI3YYCeEge?usp=sha ing
The p o ided Py hon sc ip implemen s a obus , end- o-end pipeline o he analysis o A enua ed
To al Re lec ance - In a ed (ATR-IR) spec al da a. I le e ages a Densely Connec ed
Au oencode (DAE) o dimensionali y educ ion and ea u e ex ac ion, combined wi h K-Means
Clus e ing o he unsupe ised classi ica ion o mine al spec a in o dis inc a che ypes.
P og am O e iew
The sc ip 's p ima y unc ion is o ans o m aw, noisy spec al da a in o simpli ied, clus e ed
pseudo-spec a (o cen oids) ha ep esen he mos common spec al cha ac e is ics (a che ypes)
wi hin he da ase .
The pipeline in ol es i e main s ages:
1. Se up and P e-p ocessing: Dynamic loading and cleaning o spec al da a.
2. Fea u e Enginee ing (Binning): Reducing he spec al da a poin s o a manageable inpu
size.
3. Au oencode T aining: Lea ning a compac , 40-dimensional ep esen a ion o he spec al
ea u es.
4. Clus e ing: Applying K-Means o he la en ea u es o iden i y $K=10$ clus e s.
5. Visualiza ion: Gene a ing and sa ing he inal spec al a che ypes (pseudo-spec a).
P e-p ocessing Pipeline
The sc ip applies a h ee-phase p e-p ocessing ou ine o each aw spec um wi hin he ATRIR
olde :
Phase
Me hod
Desc ip ion
1. Range
Selec ion &
Resampling
np.in e p
Spec a a e il e ed o he ange o 400 cm
-
1
o 3800
cm-1
. The da a is hen esampled o a uni o m ec o
o 1000 poin s o s anda diza ion. A c i ical
dynamic check ensu es he wa enumbe s a e
mono onically inc easing be o e in e pola ion.
2. Baseline
Co ec ion peaku ils.baseline(deg=1)
A i s -deg ee polynomial (linea ) baseline
co ec ion is applied o emo e backg ound signal
d i .
3. No maliza ion
Min-Max Scaling
Ampli udes a e scaled o he ange [0, 1] o ensu e
all spec a con ibu e equally o he Au oencode
aining, ega dless o ini ial in ensi y a ia ions.
Fea u e Enginee ing: Binning
Be o e eeding he da a o he Au oencode , he 1000-poin spec a a e u he educed in o 200 bins
(num_bins = 200). This is a common p ac ice o smoo h mino noise and ocus he model on he
o e all spec al shape a he han high- equency noise. The alue assigned o each bin is he mean
ampli ude wi hin ha segmen .
Densely Connec ed Au oencode A chi ec u e
The co e o he analysis is he Densely Connec ed Au oencode , designed o lea n a comp essed
ep esen a ion o he 200-dimensional spec al inpu . The la en dimension is se o 40, o ming he
ea u e ec o used o clus e ing.
1. Encode De ini ion (encode )
The Encode is esponsible o comp essing he 200 inpu ea u es in o he 40-dimensional la en
space. I uses a cascading se ies o ully connec ed laye s wi h he Rec i ied Linea Uni (ReLU)
ac i a ion unc ion, which is ideal o deep lea ning models due o i s simplici y and compu a ional
e iciency.
Laye Type Ou pu
Shape
Ac i a ion
Pu pose
Inpu
ke as.Inpu
200
N/A
Recei es he binned spec um.
Hidden Laye 1
laye s.Dense
64
ReLU
Ini ial comp ession.
Hidden Laye 2
laye s.Dense
32
ReLU
Fu he comp ession.
La en Laye
(Bo leneck)
laye s.Dense
40 ReLU The comp essed ea u e ec o
(embedding).
2. Decode De ini ion (decode )
The Decode pe o ms he in e se unc ion, aking he 40-dimensional la en code and a emp ing o
econs uc he o iginal 200-dimensional spec um.
Laye Type Ou pu
Shape
Ac i a ion
Pu pose
Inpu
ke as.Inpu
40
N/A
Recei es he la en code om he Encode .
Hidden
Laye 3
laye s.Dense
32 ReLU Begins econs uc ion.
Hidden
Laye 4
laye s.Dense
64 ReLU Expands ea u e space.
Ou pu
Laye laye s.Dense
200 Sigmoid
Recons uc s he spec um. Sigmoid ac i a ion
ensu es he ou pu emains in he no malized $[0,
1]$ ange.
T aining and Clus e ing
The ull Au oencode model is ained o minimize he Mean Squa ed E o (MSE) be ween he
inpu and i s econs uc ed ou pu , using he Adam op imize .
A e aining, he Encode (encode .p edic ()) is used o ex ac he 40-dimensional ea u es
(embeddings) o all samples. These ea u es a e hen scaled (S anda dScale ) and classi ied using
K-Means Clus e ing wi h he p e-de e mined K=10 op imal numbe o clus e s.
The inal s ep in ol es using he Decode o ans o m he K=10 clus e cen e s (cen oids) om he
40-dimensional la en space back in o he 200-poin spec al domain. These econs uc ed clus e
cen e s a e you inal Pseudo-Spec a o A che ypes.
In he ollowing plo s, he g ey lines a e he o iginal da a, he blue lines he econs uc ed ones, and
he ed line he pseudospec um, ha is he econs uc ed cen oid o he clus e .
De ailed Clus e Analysis (Full Spec um, K=10)
The analysis shows ha he Au oencode is no classi ying mine als based on he p esence o wa e
alone, bu mos ly g oups samples wi h simila s uc u al spec al signa u es (long wa eleng hs).
Below, o each Clus e ID, he Exac Mine als Found (Sample Coun ), he P ima y Chemical
G oup, and an AI Commen and In e p e a ion a e p o ided.
0: Qua z (3), G une i e, Lazuli e 2 - Simple Oxides / Anomalous Silica es - Low spec al signal
g oup, domina ed by SiO2 and by mine als (G une i e, Lazuli e) which, despi e being s uc u ally
mo e complex, sha e a simila i y in spec al shape wi h Qua z, especially a low equencies.
1: Ce ussi e (7), Dolomi e (4), Side i e (3), Magnesi e (3), Azu i e, Malachi e (2),
Rhodoch osi e (2), Smi hsoni e, A agoni e, Hun i e, Gaspei e - Complex / Hyd a ed / Hea y
Ca bona es - This clus e is a e y he e ogeneous g oup o Ca bona es, which includes samples
wi h g ea e s uc u al complexi y, such as he hyd oxy-ca bona es Azu i e and Malachi e, and
ca bona es o hea y (Ce ussi e) o ansi ion (Side i e, Rhodoch osi e) me als.
2: T emoli e (6), Ac inoli e (6), Pa gasi e (2), G une i e (2), A edsoni e (2), Edeni e (2),
Has ingsi e (2), Rich e i e (2), o he Amphiboles (Glaucophane, Ged i e, e c.) - Hyd oxyla ed
Silica es (Amphiboles) - Cohesi e Clus e : This is he mos cohesi e g ouping o e all. I isola es
almos all he Inosilica es (Chain Silica es, such as Amphiboles) whose signa u e is de ined by he
Si-O ib a ions and he p esence o s uc u al OH$ wi hin he la ice.
3: Albi e (6), O hoclase (5), Mic ocline (4), Na oli e (4), Scoleci e (2), Anglesi e (3), Ano hi e
(3), Augeli e (3), Mesoli e (3), o he Silica es - F amewo k Silica es (Feldspa s and Zeoli es) -
The chemis y o Tec osilica es (Feldspa s) and Zeoli es (Na oli e, Scoleci e) domina es. The
excep ions (Anglesi e, Augeli e) indica e a s ong g ouping based on in ense and well-de ined T-O
s uc u al bands (whe e T = {Si, Al, P})
4: AlumK (2), Alunogen, Ama an i e, Bilini e, Bolei e - Highly Hyd a ed / Complex Sul a es -
Key g oup o complex hyd a es. I con ains acidic Sul a es (Alunogen) and samples wi h an
ex emely high s uc u al wa e con en , which di e en ia es hem om o he sul a es.
5: Ba y e (4), Anhyd i e (2), Celes ine (2), Thena di e (2), Gypsum, Aph hi ali e, Glaube i e -
Common Sul a es (Anhyd ous and S able Hyd a es) - P ima y Sul a e G oup: I g oups
common and s uc u ally s able Sul a es (SO4) (Ba ium, S on ium, Calcium). The inclusion o
Gypsum (hyd a ed Calcium Sul a e) and Anhyd i e (anhyd ous) shows he dominance o he SO4
band o e he wa e band in his clus e .
6: Py i e (2), Smi hsoni e - Low Signal / Anomalous - A small clus e ha cap u es mine als wi h
almos la spec a (Py i e is a low-signal sul ide) o samples (Smi hsoni e) ha he algo i hm was
unable o obus ly place in o he g oups.
7: C ocoi e (2), Annabe gi e, Pha macoli e - A sena es / Ch oma es - G oup de ined by he
p esence o complex and unique anions (AsO4 and C O4).
8: Calci e (7), S on iani e (3), Wi he i e (3), Anke i e (2), Ni a ine (2), Dolomi e, Rhodoch osi e,
Bas nasi eCe, Ba y ocalci e, O a i e - Alkaline Ea h Ca bona es / Ni a es - P ima y Ca bona e
G oup: I ga he s he simples and mos common Ca bona es (Calci e, S on iani e), also g ouping Ni a es
(Ni a ine) due o spec al simila i y.
9: Fluo i e (4), Hema i e - Simple Oxides / Halides - G oup o mine als wi h e y simple IR
spec a, de ined by he absence o complex anions in he ange o in e es (Fe2O3, CaF2).
Clus e ing Resul : Wa e Range Only (2800-3800 cm-1)
Please no e ha he ollowing clus e s a e di e en om hose gi en abo e.
He e is he exac b eakdown o he 10 clus e s, ocused on he dominan hyd a ion ypology he
model has lea ned. The able p o ides he Clus e ID, he Exac Mine als Found (Sample Coun ),
he Dominan Hyd a ion Typology, and he Spec al In e p e a ion (H2O/OH Signa u e).
0: Albi e (2), Anke i e (2), Magnesi e (3), O hoclase (3), Pa gasi e, Calci e (2), Dolomi e,
Side i e, Smi hsoni e, Hun i e, Gaspei e - Anhyd ous/Low-Signal Hyd a es - "Baseline"
Clus e (Nea Absence): This clus e ga he s he pu es anhyd ous Ca bona es (Magnesi e,
Side i e) and Silica es (Albi e, O hoclase) wi h an H2O signal so weak o na ow ha i is ea ed as
"absen " by he algo i hm.
Re e ences
Spa a igna, A. C., & Gemini (Modello Linguis ico di Google). (2025). Dalla Spe oscopia Raman
alla Ce i icazione S u u ale: L'Au oencode Denso e gli Pseudo-Spe i come C i e i di Idonei à
del Biocha pe la Mi igazione Clima ica e Ambien ale. Zenodo.
h ps://doi.o g/10.5281/zenodo.17560586
Spa a igna, A. C., & Gemini (Modello Linguis ico di Google). (2025). A No el Unsupe ised
App oach o S ella Spec a Analysis. Zenodo. h ps://doi.o g/10.5281/zenodo.17144409
Spa a igna, A. C., & Gemini (Modello Linguis ico di Google). (2025). The Pseudospec a as
Windows in o Au oencode s Logic. Zenodo. h ps://doi.o g/10.5281/zenodo.17038439
Spa a igna, A. C., & Gemini (Modello Linguis ico di Google). (2025). Dense Au oencode -
Gene a ed Pseudospec a o Unsupe ised Raman Classi ica ion o Ca bonaceous Ma e ials.
Zenodo. h ps://doi.o g/10.5281/zenodo.16935868
Spa a igna, A. C., & Gemini (Modello Linguis ico di Google). (2025). Un eiling he Chemical
Code in Pseudospec a: A Compa a i e S udy o a 1D Con olu ional Au oencode and a Dense
Au oencode o SERS Classi ica ion. Zenodo. h ps://doi.o g/10.5281/zenodo.16912956