scieee Science in your language
[en] (orig)

KAI-95M: Benchmarking An Efficient Transformer Alternative

Author: Koerber, Kai Stone; Mao, Xuanlin
Publisher: Zenodo
DOI: 10.5281/zenodo.17677118
Source: https://zenodo.org/records/17677118/files/main.pdf
KAI-95M: Benchma king An E icien
T ans o me Al e na i e
Kai S one Koe be Xuanlin Mao
[email p o ec ed] [email p o ec ed]
Abs ac
We in oduce KAI-95M, a highly capable 95M pa ame e language model ha le e ages a hyb id closed-
o m con inuous ime (C C) and mLSTM model a chi ec u e ha su passes he pe o mance o LLMs
∼
16x i s size (GPT-2 1.5B) a g ea ly educed compu a ional cos . KAI-95M has been designed o esol e
he memo y limi a ions o p e ious C C a chi ec u es ia he in oduc ion o mLSTM memo y uni s
and a ga ing mechanism o memo y uni egula ion. This allows KAI-95M o mee he challenge o
on-p em and on-de ice language model deploymen wi hou he need o GPUs o un in e ence based
asks. Sa ing money, sa ing compu e, and p o iding use s wi h a model ha has demons a ed i s abili y
o accomplish speci ic language modeling asks wi h a pa ame e e iciency ha is, on a e age,
∼
22x
g ea e han he models i is benchma ked agains he ein.
Designed o ensu e seamless ine- uning ac oss a ple ho a o asks, KAI-95M ep esen s an exci ing new
chap e in he e a o specialized small language models (SLMs). Please con ac us o esea ch alida ion
and o he echnical inqui ies.
1 In oduc ion
Today, he wo ld o A.I. language models mean o co po a e use p ima ily comp ises o ans o me -based
la ge language models (LLMs) ha ake a gene alis app oach o p o iding ho ough answe s o a b oad
a ay o que ies. While his has led o many ad ances in na u al language unde s anding and sea ch, i also
acili a ed he popula iza ion o a chi ec u es ha incu immense compu a ional expense du ing aining
and in e ence. Globally, hese ans o me -based la ge language models ha e been shown o demand as
amoun s o powe and wa e o sa is y hei ene gy and cooling equi emen s and a e a p esen p ojec ed o
each an annual wa e wi hd awal amoun equi alen o ha o Denma k’s o hal o he Uni ed Kingdom’s
by 2027 [
11
,
14
]. GPT-4o in pa icula equi es ene gy equi alen o 35,000 U.S. homes, and is esponsible
o esh wa e e apo a ion equi alen o he annual d inking needs o 1.2 million people pe yea jus o mee
i s needs o daily ope a ions [
11
]. The majo i y o AI-enabled companies oday a e elian on LLMs ha
a e oo la ge o deploy on-p em o on-de ice wi hou he end use o company ha ing o acqui e massi e
GPU in as uc u e. This massi e GPU in as uc u e in es men in addi ion o he ongoing ene gy, cooling,
and main enance cos s o an on-p em LLM, has been shown o be so la ge ha o en imes i akes se e al
yea s o b eak-e en a e on-p em deploymen is achie ed [
16
]. This highligh s he economic ad an age o
smalle pa ame e -e icien models like KAI-95M which achie es supe io benchma k pe o mance o he
la ge models discussed he ein on se e al asks while d ama ically educing ha dwa e, ene gy, cooling, and
1
main enance equi emen s.
In ecen yea s, he need o compac domain expe language models ha can un on co po a ions’ exis ing
legacy ha dwa e has become inc easingly appa en . A d i ing o ce behind his is ha many o ganiza ions
lack he in as uc u e o deploy and main ain la ge language models on-p em, and ha e aining hese
la ge models om sc a ch as hey become ou da ed is p ohibi i ely expensi e [
13
]. Ne e heless, he ac
emains ha many en e p ises wan o deploy hei own cus om language modeling ools ha s eamline
ope a ions and enhance p oduc i i y and sales wi hou ha ing o s o e sensi i e in e nal da a on hi d pa y
cloud se ices. They, qui e simply, wan o secu e hei s a egic edge: hei da a. A esou ce so p ecious, i
has singlehandedly become he hing ha s ands o de ine he comme ce o he 21s cen u y.
To add ess his g owing demand, we in oduce KAI-95M, a compac 95M pa ame e language model designed
o sus ainable, on-p em deploymen . Comp ised o a hyb id closed- o m con inuous ime [
8
] and mLSTM
[
2
] a chi ec u e, KAI-95M has demons a ed supe io compu a ional e iciency, signi ican ly accele a ed
aining and in e ence, s ong sequence modeling pe o mance, and a h oughpu ha is immensely g ea e
han a ans o me ’s in i s pa ame e ange. In addi ion o his, i s high pa ame e e iciency con ibu es o a
g ea ly educed memo y oo p in , which is highly ele an o eal- ime o on-de ice applica ions.
Despi e i s size, KAI-95M achie es supe io pe o mance o GPT-2 (1.5B) [
17
] on benchma ks such as
OpenbookQA [
15
], ARC-Challenge [
6
], CommonSenseQA [
19
], BoolQ [
5
] and achie es nea pa i y wi h
GPT-2 (1.5B) on MMLU [
9
] while ope a ing a a d ama ically educed compu a ional and inancial cos . Ou
model ep esen s a shi owa ds highly e icien , da a-so e eign A.I. ha p io i izes adap abili y, accessibili y
and sus ainabili y o e b u e o ce scale.
2 Model A chi ec u e
Ou model is p ima ily buil upon Closed- o m Con inuous- ime Neu al Ne wo ks (C C) [
8
] and inco po a es
a chi ec u al e inemen s h ough he in eg a ion o mLSTM uni s om he xLSTM [
2
] model, ul ima ely
yielding a no el ligh weigh language model a chi ec u e ailo ed o language gene a ion.
The C C in oduces an app oxima e closed- o m solu ion o con inuous neu al ne wo ks wi h explici
ime modeling, o malized in o a new class o neu al a chi ec u es ha signi ican ly accele a e aining
and in e ence. By eplacing adi ional ecu en upda es wi h closed- o m solu ions o con inuous- ime
dynamics, he C C e ains he ich modeling capabili ies o ODE-based app oaches while achie ing supe io
compu a ional e iciency, s able aining, and s ong sequence modeling pe o mance, making i well-sui ed
o long-con ex easoning asks.
The xLSTM, on he o he hand, ex ends he classical LSTM a chi ec u e by in oducing uni s such as he
mLSTM and sLSTM, and o ganizes hese uni s in o deepe ne wo k s uc u es. These ex ensions subs an ially
enhance he exp essi e powe o he model, enabling mo e lexible ep esen a ion lea ning and imp o ed
pe o mance in language modeling asks.
The o iginal C C model in eg a es LSTM uni s. Howe e , we obse ed ha i s ep esen a ional capaci y
emains limi ed and canno be di ec ly applied o la ge language models. To add ess his limi a ion, we
enhanced i s in e nal memo y mechanism by in oducing la ge memo y uni s and mo e sophis ica ed
a chi ec u al designs, he eby imp o ing C C’s exp essi eness and long- ange memo y capabili ies. The
mLSTM uni s om he xLSTM amewo k ul ill his equi emen : inspi ed by he a en ion laye s in he
classical T ans o me a chi ec u e, mLSTM ex ends he ec o -based memo y o LSTM in o a ma ix-based
memo y and inco po a es op imiza ions speci ically designed o la ge-scale aining. Consequen ly, we
2
eplace he LSTM uni s in C C wi h mLSTM uni s.
In addi ion, o accommoda e he compu a ion o mLSTM uni s and o mi iga e he g adien anishing and
exploding issues commonly associa ed wi h ecu en neu al ne wo k a chi ec u es, we modi ied he o e all
model a chi ec u e and compu a ional low, and in oduced a ga ing mechanism o egula e he memo y uni s.
Inspi ed by he T ans o me encode [
20
] a chi ec u e, we encapsula e ou model’s design in o modula
blocks by inco po a ing esidual connec ions and laye no maliza ion, which enables he cons uc ion o
deepe ne wo ks.
3 Resul s
We compa e KAI-95M’s pe o mance o GPT-2 (117M/800M/1.5B); all benchma ks we e calcula ed ia ou
own e alua ion pipeline. We measu ed pe o mance ac oss he ollowing asks:
•
Commonsense Reasoning (0-sho ): Hellaswag [
21
], Winog ande [
18
], PIQA [
3
], OpenbookQA [
15
],
ARC-Easy, ARC-Challenge [6], CommonsenseQA [19]
•Reading Comp ehension (0-sho ): BoolQ [5]
•C oss-Domain Language Unde s anding (5-sho ): MMLU [9]
Ou model was p e ained exclusi ely on a single encyclopedic ex co pus (
∼
4.66B okens) and has no
ye unde gone ine- uning o complex QA asks. Acco dingly, we chose no o benchma k on specialized
da ase s such as GSM8K [
7
], MATH [
10
], Humane al [
4
], o MBPP [
1
], as ou esul s on hese asks would
no ai ly ep esen he model’s po en ial a his s age. Ne e heless, he ac ha KAI
-
95M, wi h only 95M
pa ame e s, ou pe o ms GPT
-
2 (117M/800M/1.5B) ac oss ou e alua ion ca ego ies and achie es nea
pa i y wi h GPT-2-XL (1.5B) on MMLU highligh s i s compu a ional and pa ame e e iciency. This sugges s
ha , wi h b oade aining and ine- uning, KAI
-
95M could deli e pe o mance compe i i e wi h la ge
models a signi ican ly lowe cos .
Model Modali y MMLU ARC-C ARC-E BoolQ CSQA HellaSwag OBQA PIQA WinoG ande
GPT-2 (117M) P e ained 22.9% 19.0% 43.8% 48.7% 19.6% 28.9% 16.4% 62.9% 51.6%
GPT-2-La ge (800M) P e ained 23.2% 21.7% 53.2% 60.5% 19.9% 36.4% 19.4% 70.3% 55.3%
GPT-2-XL (1.5B) P e ained 25.2% 25.0% 58.3% 61.8% 19.6% 40.0% 22.4% 70.8% 58.3%
KAI-95M P e ained 25.1% 25.8% 26.6% 62.0% 21.0% 24.0% 27.2% 47.9% 48.3%
Table 1: Compa ison o KAI-95M wi h GPT-2 (117M/800M/1.5B). MMLU esul s a e added as a gene al
knowledge benchma k. KAI-95M nea ly ma ches GPT-2-XL (1.5B) pe o mance on MMLU despi e being
(
∼
16×) smalle , and also su passes GPT-2 (117M/800M/1.5B) ac oss mul iple commonsense and ac ual
asks.
Size and E iciency: The ”equi alen model sizes” o he GPT-2 and / o Mis al [
12
] models we e compu ed
in o de o gain g ea e unde s anding o jus how signi ican he e iciency gains o KAI-95M’s hyb id
C C-mLSTM a chec u e a e (see Figu e 2). When e alua ed ac oss he asks OpenBookQA, ARC-Challenge,
CommonSenseQA, BoolQ, and MMLU we ound ha he e was an o e all a e age pa ame e e iciency gain
o 21.8x ac oss hese benchma k asks. Which is o say ha KAI-95M has 21.8x g ea e pa ame e e iciency
han he models i was benchma ked agains because i has 21.8x ewe pa ame e s han he o he models
benchma ked on hese asks (on a e age) while e aining a g ea e le el o accu acy.
3
CSQA ARC-C MMLU OpenBookQA BoolQ
ask
0
10
20
30
40
50
60
Accu acy (%)
model
KAI-95M
GPT-2 (117M)
GPT-2-La ge (~800M)
GPT-2-XL (~1.5B)
Figu e 1: Benchma k ask accu acy compa ison on CommonSenseQA (CSQA), ARC-Challenge (ARC-
C), MMLU, OpenBookQA, and BoolQ o KAI-95M and GPT-2 (117M/800M/1.5B). Ou 95M-pa ame e
model ou pe o ms all GPT-2 a ian s (up o 1.5B pa ame e s) on mul iple easoning and comp ehension
asks, and achie es nea pa i y on MMLU. Achie ing s a e-o - he-a e iciency and pe o mance wi h a
ac ion o he scale.
01234567
Model Size (Billion Pa ame e s)
0
10
20
30
40
50
60
MMLU (%)
Equi alen GPT-2
o Mis al Model Size: 1.45B
(KAI-95M is 15.2×
mo e pa ame e -e icien )
Accu acy s Model Size
GPT-2 models
KAI-95M
Mis al-7B
(a) MMLU
01234567
Model Size (Billion Pa ame e s)
0
5
10
15
20
25
30
35
OpenBook QA (%)
Equi alen GPT-2
o Mis al Model Size: 3.99B
(KAI-95M is 42.0×
mo e pa ame e -e icien )
Accu acy s Model Size
GPT-2 models
KAI-95M
Mis al-7B
(b) OpenBookQA
01234567
Model Size (Billion Pa ame e s)
0
10
20
30
40
50
ARC-C (%)
Equi alen GPT-2
o Mis al Model Size: 1.68B
(KAI-95M is 17.6×
mo e pa ame e -e icien )
Accu acy s Model Size
GPT-2 models
KAI-95M
Mis al-7B
(c) ARC-C
01234567
Model Size (Billion Pa ame e s)
0
20
40
60
80
BoolQ (%)
Equi alen GPT-2
o Mis al Model Size: 1.56B
(KAI-95M is 16.4×
mo e pa ame e -e icien )
Accu acy s Model Size
GPT-2 models
KAI-95M
Mis al-7B
(d) BoolQ
01234567
Model Size (Billion Pa ame e s)
0
10
20
30
40
50
60
CommonSense QA (%)
Equi alen GPT-2
o Mis al Model Size: 1.71B
(KAI-95M is 18.0×
mo e pa ame e -e icien )
Accu acy s Model Size
GPT-2 models
KAI-95M
Mis al-7B
(e) CSQA
Figu e 2: Pe o mance compa ison ac oss MMLU, OpenBookQA, ARC-Challenge (ARC-C), BoolQ,
and CommonSenseQA (CSQA). KAI-95M demons a es nea pa i y wi h GPT-2-XL on MMLU despi e
being
∼
16× smalle , while main aining consis en gains ac oss easoning and comp ehension asks. These
esul s highligh KAI-95M’s pa ame e e iciency and s ong gene aliza ion.
4
4 Discussion
KAI-95M was no ins uc ion uned, so all o he me ics demons a ed he ein a e s ic ly om p e aining
ou model on a single gene al-pu pose ex co pus o
∼
4.66B okens. This co pus was also no an amalgam
o a as mul i ude o di e en sou ces; he absence o he e ogeneous sou ces limi s s uc u al di e si y in
he aining da a. Wi h ine uning and a la ge mo e di e se co pus o aining da a, he model’s abili y o
a ibu e seman ic alue o wo ds and o answe complex ques ions, is expec ed o imp o e u he . A p esen
KAI-95M is a ounda ion model ha is mean o be ine uned o mee en e p ise o consume needs.
No ably, he subs an ial pe o mance ou model has a ained in he a eas o common sense easoning (0-sho ),
eading comp ehension (0-sho ), and c oss
-
domain language unde s anding (5-sho ) s ands o be signi ican ly
imp o ed a e ine- uning. This indica es ou model’s subs an ial po en ial o accomplish a wide ange o
compu a ional asks wi h supe io e iciency.
In a b oade con ex , he pe o mance o ou model ac oss hese a eas has di ec ele ance o use s in he
en e p ise and consume domains. S ong ze o-sho easoning and comp ehension capabili ies sugges ha
KAI-95M can deli e alue in en e p ise applica ions such as documen analysis, cus ome se ice, and
knowledge e ie al wi hou ex ensi e e aining. The model’s c oss
-
domain adap abili y u he indica es
i s iabili y o deploymen ac oss di e se indus ies, educing ime
-
o
-
alue, enhancing e iciency, and
accele a ing e u ns on A.I. in as uc u e in es men s.
5 Conclusion
KAI-95M demons a es ha a new model a chi ec u e can achie e pe o mance ha is compe i i e wi h, o
supe io o, ans o me a chi ec u es while incu ing a ac ion o he compu a ional expense. This sugges s
ha pa ame e coun s need no scale p opo ionally wi h ask complexi y, as he pa ame e e icien design o
KAI-95M shows ha supe io pe o mance can be achie ed on a ange o asks wi hou la ge aining co po a
o ga gan uan pa ame e coun s. While he e is s ill much o disco e abou his a chi ec u e’s capabili ies, he
benchma ks p esen ed he ein highligh i s p omise as a mo e e icien al e na i e o ans o me a chi ec u es.
Acknowledgemen s
We hank he ollowing membe s o ou esea ch eam o hei excep ional e o s o e he yea s: Jackie Chen,
Shuyao He, Sy us Aslam, Sophie Cui, Quincy Thai, Melania Ohanian, Danji Liu, Mohammed Za ee -Mus a a,
Hen y Cen, Xiaole Guo, Na haniel Haynam, Naz Col, Mizuho Li, Se gio William Pe e son, Danielle Wong,
Melad Sabagh, Mengzhu Sun, Hanqi Xiong, Elizabe h Lau, Zhihao Du, Danial Nasi Awan, Alexande Luke
De Sa am, Gyuyeon Jung, Sama h Goel.
5

Re e ences
[1]
Jacob Aus in, Augus us Odena, Maxwell Nye, Maa en Bosma, Hen yk Michalewski, Da id Dohan,
Ellen Jiang, Ca ie Cai, Michael Te y, Quoc Le, e al. P og am syn hesis wi h la ge language models.
a Xi p ep in a Xi :2108.07732, 2021.
[2]
Maximilian Beck, Ko binian P
¨
oppel, Ma kus Span ing, And eas Aue , Oleksand a P udniko a, Michael
Kopp, G
¨
un e Klambaue , Johannes B ands e e , and Sepp Hoch ei e . xls m: Ex ended long sho - e m
memo y. a Xi p ep in a Xi :2405.04517, 2024.
[3]
Yona an Bisk, Rowan Zelle s, Jian eng Gao, Yejin Choi, e al. Piqa: Reasoning abou physical
commonsense in na u al language. In P oceedings o he AAAI Con e ence on A i icial In elligence,
2020.
[4]
Ma k Chen, Je y Two ek, Heewoo Jun, Qiming Yuan, Hen ique Ponde de Oli ei a Pin o, Ja ed Kaplan,
Ha i Edwa ds, Yu i Bu da, Nicholas Joseph, G eg B ockman, e al. E alua ing la ge language models
ained on code. a Xi p ep in a Xi :2107.03374, 2021.
[5]
Ch is ophe Cla k, Ken on Lee, Ming-Wei Chang, Tom Kwia kowski, Michael Collins, and K is ina
Tou ano a. Boolq: Explo ing he su p ising di icul y o na u al yes/no ques ions. a Xi p ep in
a Xi :1905.10044, 2019.
[6]
Pe e Cla k, Isaac Cowhey, O en E zioni, Tusha Kho , Ashish Sabha wal, Ca issa Schoenick, and
Oy ind Ta jo d. Think you ha e sol ed ques ion answe ing? y a c, he ai2 easoning challenge. a Xi
p ep in a Xi :1803.05457, 2018.
[7]
Ka l Cobbe, Vinee Kosa aju, Mohammad Ba a ian, Ma k Chen, Heewoo Jun, Lukasz Kaise , Ma hias
Plappe , Je y Two ek, Jacob Hil on, Reiichi o Nakano, e al. T aining e i ie s o sol e ma h wo d
p oblems. a Xi p ep in a Xi :2110.14168, 2021.
[8]
Ramin Hasani, Ma hias Lechne , Alexande Amini, Lucas Liebenwein, Aa on Ray, Max Tschaikowski,
Ge ald Teschl, and Daniela Rus. Closed- o m con inuous- ime neu al models. a Xi p ep in
a Xi :2106.13898, 2021.
[9]
Dan Hend ycks, Collin Bu ns, S e en Basa , Andy Zou, Man as Mazeika, Dawn Song, and Jacob
S einha d . Measu ing massi e mul i ask language unde s anding. a Xi p ep in a Xi :2009.03300,
2020.
[10]
Dan Hend ycks, Collin Bu ns, Sau a Kada a h, Akul A o a, S e en Basa , E ic Tang, Dawn Song,
and Jacob S einha d . Measu ing ma hema ical p oblem sol ing wi h he ma h da ase . a Xi p ep in
a Xi :2103.03874, 2021.
[11]
A. Jegham, D. Bensaada, and N. Be ached. How hung y is gp -4? measu ing he elec ici y, wa e , and
ca bon equi emen s o a i icial in elligence models. a Xi p ep in a Xi :2505.09598, 2025.
[12]
Albe Q. Jiang, Alexand e Sablay olles, A hu Mensch, Ch is Bam o d, De end a Singh Chaplo ,
Diego de las Casas, Flo ian B essand, Gianna Lengyel, Guillaume Lample, Lucile Saulnie , Lelio Rena d
La aud, Ma ie-Anne Lachaux, Pie e S ock, Te en Le Scao, Thibau La il, Thomas Wang, Timo hee
Lac oix, and William El Sayed. Mis al 7b. a Xi p ep in a Xi :2310.06825, 2023.
[13]
J. Kim, S. Lee, and H. Pa k. ixi-gen: E icien indus ial sllms h ough domain adap i e con inual
p e aining. a Xi p ep in a Xi :2507.06795, 2025.
6
[14]
P. Li, J. Yang, M. A. Islam, and S. Ren. Making ai less “ hi s y”: Unco e ing and add essing he sec e
wa e oo p in o ai models. a Xi p ep in a Xi :2304.03271, 2023.
[15]
Todo Mihaylo , Pe e Cla k, Tusha Kho , and Ashish Sabha wal. Can a sui o a mo conduc
elec ici y? a new da ase o open book ques ion answe ing. a Xi p ep in a Xi :1809.02789, 2018.
[16]
G. Pan and H. Wang. A cos -bene i analysis o on-p emise la ge language model deploymen : B eaking
e en wi h comme cial llm se ices. a Xi p ep in a Xi :2509.18101, 2025.
[17]
Alec Rad o d, Je ey Wu, Rewon Child, Da id Luan, Da io Amodei, and Ilya Su ske e . Language
models a e unsupe ised mul i ask lea ne s. Technical epo , OpenAI, 2019.
[18]
Keisuke Sakaguchi, Ronan Le B as, Chand a Bhaga a ula, and Yejin Choi. Winog ande: An ad e sa ial
winog ad schema challenge a scale. Communica ions o he ACM, 64(9):99–106, 2021.
[19]
Alon Talmo , Jona han He zig, Nicholas Lou ie, and Jona han Be an . Commonsenseqa: A ques ion
answe ing challenge a ge ing commonsense knowledge. a Xi p ep in a Xi :1811.00937, 2018.
[20]
Ashish Vaswani, Noam Shazee , Niki Pa ma , Jakob Uszko ei , Llion Jones, Aidan N. Gomez,
Ł
ukasz
Kaise , and Illia Polosukhin. A en ion is all you need. a Xi p ep in a Xi :1706.03762, 2017.
[21]
Rowan Zelle s, A i Hol zman, Yona an Bisk, Ali Fa hadi, and Yejin Choi. Hellaswag: Can a machine
eally inish you sen ence? a Xi p ep in a Xi :1905.07830, 2019.
7