KAI-95M: Benchmarking An Efficient Transformer Alternative

Author: Koerber, Kai Stone; Mao, Xuanlin

Publisher: Zenodo

DOI: 10.5281/zenodo.17677118

Source: https://zenodo.org/records/17677118/files/main.pdf

KAI-95M: Benchma king An E icien
T ans o me Al e na i e
Kai S one Koe be Xuanlin Mao
[email p o ec ed] [email p o ec ed]
Abs ac
We in oduce KAI-95M, a highly capable 95M pa ame e language model ha le e ages a hyb id closed-
o m con inuous ime (C C) and mLSTM model a chi ec u e ha su passes he pe o mance o LLMs
∼
16x i s size (GPT-2 1.5B) a g ea ly educed compu a ional cos . KAI-95M has been designed o esol e
he memo y limi a ions o p e ious C C a chi ec u es ia he in oduc ion o mLSTM memo y uni s
and a ga ing mechanism o memo y uni egula ion. This allows KAI-95M o mee he challenge o
on-p em and on-de ice language model deploymen wi hou he need o GPUs o un in e ence based
asks. Sa ing money, sa ing compu e, and p o iding use s wi h a model ha has demons a ed i s abili y
o accomplish speci ic language modeling asks wi h a pa ame e e iciency ha is, on a e age,
∼
22x
g ea e han he models i is benchma ked agains he ein.
Designed o ensu e seamless ine- uning ac oss a ple ho a o asks, KAI-95M ep esen s an exci ing new
chap e in he e a o specialized small language models (SLMs). Please con ac us o esea ch alida ion
and o he echnical inqui ies.
1 In oduc ion
Today, he wo ld o A.I. language models mean o co po a e use p ima ily comp ises o ans o me -based
la ge language models (LLMs) ha ake a gene alis app oach o p o iding ho ough answe s o a b oad
a ay o que ies. While his has led o many ad ances in na u al language unde s anding and sea ch, i also
acili a ed he popula iza ion o a chi ec u es ha incu immense compu a ional expense du ing aining
and in e ence. Globally, hese ans o me -based la ge language models ha e been shown o demand as
amoun s o powe and wa e o sa is y hei ene gy and cooling equi emen s and a e a p esen p ojec ed o
each an annual wa e wi hd awal amoun equi alen o ha o Denma k’s o hal o he Uni ed Kingdom’s
by 2027 [
11
,
14
]. GPT-4o in pa icula equi es ene gy equi alen o 35,000 U.S. homes, and is esponsible
o esh wa e e apo a ion equi alen o he annual d inking needs o 1.2 million people pe yea jus o mee
i s needs o daily ope a ions [
11
]. The majo i y o AI-enabled companies oday a e elian on LLMs ha
a e oo la ge o deploy on-p em o on-de ice wi hou he end use o company ha ing o acqui e massi e
GPU in as uc u e. This massi e GPU in as uc u e in es men in addi ion o he ongoing ene gy, cooling,
and main enance cos s o an on-p em LLM, has been shown o be so la ge ha o en imes i akes se e al
yea s o b eak-e en a e on-p em deploymen is achie ed [
16
]. This highligh s he economic ad an age o
smalle pa ame e -e icien models like KAI-95M which achie es supe io benchma k pe o mance o he
la ge models discussed he ein on se e al asks while d ama ically educing ha dwa e, ene gy, cooling, and
1
main enance equi emen s.
In ecen yea s, he need o compac domain expe language models ha can un on co po a ions’ exis ing
legacy ha dwa e has become inc easingly appa en . A d i ing o ce behind his is ha many o ganiza ions
lack he in as uc u e o deploy and main ain la ge language models on-p em, and ha e aining hese
la ge models om sc a ch as hey become ou da ed is p ohibi i ely expensi e [
13
]. Ne e heless, he ac
emains ha many en e p ises wan o deploy hei own cus om language modeling ools ha s eamline
ope a ions and enhance p oduc i i y and sales wi hou ha ing o s o e sensi i e in e nal da a on hi d pa y
cloud se ices. They, qui e simply, wan o secu e hei s a egic edge: hei da a. A esou ce so p ecious, i
has singlehandedly become he hing ha s ands o de ine he comme ce o he 21s cen u y.
To add ess his g owing demand, we in oduce KAI-95M, a compac 95M pa ame e language model designed
o sus ainable, on-p em deploymen . Comp ised o a hyb id closed- o m con inuous ime [
8
] and mLSTM
[
2
] a chi ec u e, KAI-95M has demons a ed supe io compu a ional e iciency, signi ican ly accele a ed
aining and in e ence, s ong sequence modeling pe o mance, and a h oughpu ha is immensely g ea e
han a ans o me ’s in i s pa ame e ange. In addi ion o his, i s high pa ame e e iciency con ibu es o a
g ea ly educed memo y oo p in , which is highly ele an o eal- ime o on-de ice applica ions.
Despi e i s size, KAI-95M achie es supe io pe o mance o GPT-2 (1.5B) [
17
] on benchma ks such as
OpenbookQA [
15
], ARC-Challenge [
6
], CommonSenseQA [
19
], BoolQ [
5
] and achie es nea pa i y wi h
GPT-2 (1.5B) on MMLU [
9
] while ope a ing a a d ama ically educed compu a ional and inancial cos . Ou
model ep esen s a shi owa ds highly e icien , da a-so e eign A.I. ha p io i izes adap abili y, accessibili y
and sus ainabili y o e b u e o ce scale.
2 Model A chi ec u e
Ou model is p ima ily buil upon Closed- o m Con inuous- ime Neu al Ne wo ks (C C) [
8
] and inco po a es
a chi ec u al e inemen s h ough he in eg a ion o mLSTM uni s om he xLSTM [
2
] model, ul ima ely
yielding a no el ligh weigh language model a chi ec u e ailo ed o language gene a ion.
The C C in oduces an app oxima e closed- o m solu ion o con inuous neu al ne wo ks wi h explici
ime modeling, o malized in o a new class o neu al a chi ec u es ha signi ican ly accele a e aining
and in e ence. By eplacing adi ional ecu en upda es wi h closed- o m solu ions o con inuous- ime
dynamics, he C C e ains he ich modeling capabili ies o ODE-based app oaches while achie ing supe io
compu a ional e iciency, s able aining, and s ong sequence modeling pe o mance, making i well-sui ed
o long-con ex easoning asks.
The xLSTM, on he o he hand, ex ends he classical LSTM a chi ec u e by in oducing uni s such as he
mLSTM and sLSTM, and o ganizes hese uni s in o deepe ne wo k s uc u es. These ex ensions subs an ially
enhance he exp essi e powe o he model, enabling mo e lexible ep esen a ion lea ning and imp o ed
pe o mance in language modeling asks.
The o iginal C C model in eg a es LSTM uni s. Howe e , we obse ed ha i s ep esen a ional capaci y
emains limi ed and canno be di ec ly applied o la ge language models. To add ess his limi a ion, we
enhanced i s in e nal memo y mechanism by in oducing la ge memo y uni s and mo e sophis ica ed
a chi ec u al designs, he eby imp o ing C C’s exp essi eness and long- ange memo y capabili ies. The
mLSTM uni s om he xLSTM amewo k ul ill his equi emen : inspi ed by he a en ion laye s in he
classical T ans o me a chi ec u e, mLSTM ex ends he ec o -based memo y o LSTM in o a ma ix-based
memo y and inco po a es op imiza ions speci ically designed o la ge-scale aining. Consequen ly, we
2
eplace he LSTM uni s in C C wi h mLSTM uni s.
In addi ion, o accommoda e he compu a ion o mLSTM uni s and o mi iga e he g adien anishing and
exploding issues commonly associa ed wi h ecu en neu al ne wo k a chi ec u es, we modi ied he o e all
model a chi ec u e and compu a ional low, and in oduced a ga ing mechanism o egula e he memo y uni s.
Inspi ed by he T ans o me encode [
20
] a chi ec u e, we encapsula e ou model’s design in o modula
blocks by inco po a ing esidual connec ions and laye no maliza ion, which enables he cons uc ion o
deepe ne wo ks.
3 Resul s
We compa e KAI-95M’s pe o mance o GPT-2 (117M/800M/1.5B); all benchma ks we e calcula ed ia ou
own e alua ion pipeline. We measu ed pe o mance ac oss he ollowing asks:
•
Commonsense Reasoning (0-sho ): Hellaswag [
21
], Winog ande [
18
], PIQA [
3
], OpenbookQA [
15
],
ARC-Easy, ARC-Challenge [6], CommonsenseQA [19]
•Reading Comp ehension (0-sho ): BoolQ [5]
•C oss-Domain Language Unde s anding (5-sho ): MMLU [9]
Ou model was p e ained exclusi ely on a single encyclopedic ex co pus (
∼
4.66B okens) and has no
ye unde gone ine- uning o complex QA asks. Acco dingly, we chose no o benchma k on specialized
da ase s such as GSM8K [
7
], MATH [
10
], Humane al [
4
], o MBPP [
1
], as ou esul s on hese asks would
no ai ly ep esen he model’s po en ial a his s age. Ne e heless, he ac ha KAI
-
95M, wi h only 95M
pa ame e s, ou pe o ms GPT
-
2 (117M/800M/1.5B) ac oss ou e alua ion ca ego ies and achie es nea
pa i y wi h GPT-2-XL (1.5B) on MMLU highligh s i s compu a ional and pa ame e e iciency. This sugges s
ha , wi h b oade aining and ine- uning, KAI
-
95M could deli e pe o mance compe i i e wi h la ge
models a signi ican ly lowe cos .
Model Modali y MMLU ARC-C ARC-E BoolQ CSQA HellaSwag OBQA PIQA WinoG ande
GPT-2 (117M) P e ained 22.9% 19.0% 43.8% 48.7% 19.6% 28.9% 16.4% 62.9% 51.6%
GPT-2-La ge (800M) P e ained 23.2% 21.7% 53.2% 60.5% 19.9% 36.4% 19.4% 70.3% 55.3%
GPT-2-XL (1.5B) P e ained 25.2% 25.0% 58.3% 61.8% 19.6% 40.0% 22.4% 70.8% 58.3%
KAI-95M P e ained 25.1% 25.8% 26.6% 62.0% 21.0% 24.0% 27.2% 47.9% 48.3%
Table 1: Compa ison o KAI-95M wi h GPT-2 (117M/800M/1.5B). MMLU esul s a e added as a gene al
knowledge benchma k. KAI-95M nea ly ma ches GPT-2-XL (1.5B) pe o mance on MMLU despi e being
(
∼
16×) smalle , and also su passes GPT-2 (117M/800M/1.5B) ac oss mul iple commonsense and ac ual
asks.
Size and E iciency: The ”equi alen model sizes” o he GPT-2 and / o Mis al [
12
] models we e compu ed
in o de o gain g ea e unde s anding o jus how signi ican he e iciency gains o KAI-95M’s hyb id
C C-mLSTM a chec u e a e (see Figu e 2). When e alua ed ac oss he asks OpenBookQA, ARC-Challenge,
CommonSenseQA, BoolQ, and MMLU we ound ha he e was an o e all a e age pa ame e e iciency gain
o 21.8x ac oss hese benchma k asks. Which is o say ha KAI-95M has 21.8x g ea e pa ame e e iciency
han he models i was benchma ked agains because i has 21.8x ewe pa ame e s han he o he models
benchma ked on hese asks (on a e age) while e aining a g ea e le el o accu acy.
3
CSQA ARC-C MMLU OpenBookQA BoolQ
ask
0
10
20
30
40
50
60
Accu acy (%)
model
KAI-95M
GPT-2 (117M)
GPT-2-La ge (~800M)
GPT-2-XL (~1.5B)
Figu e 1: Benchma k ask accu acy compa ison on CommonSenseQA (CSQA), ARC-Challenge (ARC-
C), MMLU, OpenBookQA, and BoolQ o KAI-95M and GPT-2 (117M/800M/1.5B). Ou 95M-pa ame e
model ou pe o ms all GPT-2 a ian s (up o 1.5B pa ame e s) on mul iple easoning and comp ehension
asks, and achie es nea pa i y on MMLU. Achie ing s a e-o - he-a e iciency and pe o mance wi h a
ac ion o he scale.
01234567
Model Size (Billion Pa ame e s)
0
10
20
30
40
50
60
MMLU (%)
Equi alen GPT-2
o Mis al Model Size: 1.45B
(KAI-95M is 15.2×
mo e pa ame e -e icien )
Accu acy s Model Size
GPT-2 models
KAI-95M
Mis al-7B
(a) MMLU
01234567
Model Size (Billion Pa ame e s)
0
5
10
15
20
25
30
35
OpenBook QA (%)
Equi alen GPT-2
o Mis al Model Size: 3.99B
(KAI-95M is 42.0×
mo e pa ame e -e icien )
Accu acy s Model Size
GPT-2 models
KAI-95M
Mis al-7B
(b) OpenBookQA
01234567
Model Size (Billion Pa ame e s)
0
10
20
30
40
50
ARC-C (%)
Equi alen GPT-2
o Mis al Model Size: 1.68B
(KAI-95M is 17.6×
mo e pa ame e -e icien )
Accu acy s Model Size
GPT-2 models
KAI-95M
Mis al-7B
(c) ARC-C
01234567
Model Size (Billion Pa ame e s)
0
20
40
60
80
BoolQ (%)
Equi alen GPT-2
o Mis al Model Size: 1.56B
(KAI-95M is 16.4×
mo e pa ame e -e icien )
Accu acy s Model Size
GPT-2 models
KAI-95M
Mis al-7B
(d) BoolQ
01234567
Model Size (Billion Pa ame e s)
0
10
20
30
40
50
60
CommonSense QA (%)
Equi alen GPT-2
o Mis al Model Size: 1.71B
(KAI-95M is 18.0×
mo e pa ame e -e icien )
Accu acy s Model Size
GPT-2 models
KAI-95M
Mis al-7B
(e) CSQA
Figu e 2: Pe o mance compa ison ac oss MMLU, OpenBookQA, ARC-Challenge (ARC-C), BoolQ,
and CommonSenseQA (CSQA). KAI-95M demons a es nea pa i y wi h GPT-2-XL on MMLU despi e
being
∼
16× smalle , while main aining consis en gains ac oss easoning and comp ehension asks. These
esul s highligh KAI-95M’s pa ame e e iciency and s ong gene aliza ion.
4
4 Discussion
KAI-95M was no ins uc ion uned, so all o he me ics demons a ed he ein a e s ic ly om p e aining
ou model on a single gene al-pu pose ex co pus o
∼
4.66B okens. This co pus was also no an amalgam
o a as mul i ude o di e en sou ces; he absence o he e ogeneous sou ces limi s s uc u al di e si y in
he aining da a. Wi h ine uning and a la ge mo e di e se co pus o aining da a, he model’s abili y o
a ibu e seman ic alue o wo ds and o answe complex ques ions, is expec ed o imp o e u he . A p esen
KAI-95M is a ounda ion model ha is mean o be ine uned o mee en e p ise o consume needs.
No ably, he subs an ial pe o mance ou model has a ained in he a eas o common sense easoning (0-sho ),
eading comp ehension (0-sho ), and c oss
-
domain language unde s anding (5-sho ) s ands o be signi ican ly
imp o ed a e ine- uning. This indica es ou model’s subs an ial po en ial o accomplish a wide ange o
compu a ional asks wi h supe io e iciency.
In a b oade con ex , he pe o mance o ou model ac oss hese a eas has di ec ele ance o use s in he
en e p ise and consume domains. S ong ze o-sho easoning and comp ehension capabili ies sugges ha
KAI-95M can deli e alue in en e p ise applica ions such as documen analysis, cus ome se ice, and
knowledge e ie al wi hou ex ensi e e aining. The model’s c oss
-
domain adap abili y u he indica es
i s iabili y o deploymen ac oss di e se indus ies, educing ime
-
o
-
alue, enhancing e iciency, and
accele a ing e u ns on A.I. in as uc u e in es men s.
5 Conclusion
KAI-95M demons a es ha a new model a chi ec u e can achie e pe o mance ha is compe i i e wi h, o
supe io o, ans o me a chi ec u es while incu ing a ac ion o he compu a ional expense. This sugges s
ha pa ame e coun s need no scale p opo ionally wi h ask complexi y, as he pa ame e e icien design o
KAI-95M shows ha supe io pe o mance can be achie ed on a ange o asks wi hou la ge aining co po a
o ga gan uan pa ame e coun s. While he e is s ill much o disco e abou his a chi ec u e’s capabili ies, he
benchma ks p esen ed he ein highligh i s p omise as a mo e e icien al e na i e o ans o me a chi ec u es.
Acknowledgemen s
We hank he ollowing membe s o ou esea ch eam o hei excep ional e o s o e he yea s: Jackie Chen,
Shuyao He, Sy us Aslam, Sophie Cui, Quincy Thai, Melania Ohanian, Danji Liu, Mohammed Za ee -Mus a a,
Hen y Cen, Xiaole Guo, Na haniel Haynam, Naz Col, Mizuho Li, Se gio William Pe e son, Danielle Wong,
Melad Sabagh, Mengzhu Sun, Hanqi Xiong, Elizabe h Lau, Zhihao Du, Danial Nasi Awan, Alexande Luke
De Sa am, Gyuyeon Jung, Sama h Goel.
5

Re e ences
[1]
Jacob Aus in, Augus us Odena, Maxwell Nye, Maa en Bosma, Hen yk Michalewski, Da id Dohan,
Ellen Jiang, Ca ie Cai, Michael Te y, Quoc Le, e al. P og am syn hesis wi h la ge language models.
a Xi p ep in a Xi :2108.07732, 2021.
[2]
Maximilian Beck, Ko binian P
¨
oppel, Ma kus Span ing, And eas Aue , Oleksand a P udniko a, Michael
Kopp, G
¨
un e Klambaue , Johannes B ands e e , and Sepp Hoch ei e . xls m: Ex ended long sho - e m
memo y. a Xi p ep in a Xi :2405.04517, 2024.
[3]
Yona an Bisk, Rowan Zelle s, Jian eng Gao, Yejin Choi, e al. Piqa: Reasoning abou physical
commonsense in na u al language. In P oceedings o he AAAI Con e ence on A i icial In elligence,
2020.
[4]
Ma k Chen, Je y Two ek, Heewoo Jun, Qiming Yuan, Hen ique Ponde de Oli ei a Pin o, Ja ed Kaplan,
Ha i Edwa ds, Yu i Bu da, Nicholas Joseph, G eg B ockman, e al. E alua ing la ge language models
ained on code. a Xi p ep in a Xi :2107.03374, 2021.
[5]
Ch is ophe Cla k, Ken on Lee, Ming-Wei Chang, Tom Kwia kowski, Michael Collins, and K is ina
Tou ano a. Boolq: Explo ing he su p ising di icul y o na u al yes/no ques ions. a Xi p ep in
a Xi :1905.10044, 2019.
[6]
Pe e Cla k, Isaac Cowhey, O en E zioni, Tusha Kho , Ashish Sabha wal, Ca issa Schoenick, and
Oy ind Ta jo d. Think you ha e sol ed ques ion answe ing? y a c, he ai2 easoning challenge. a Xi
p ep in a Xi :1803.05457, 2018.
[7]
Ka l Cobbe, Vinee Kosa aju, Mohammad Ba a ian, Ma k Chen, Heewoo Jun, Lukasz Kaise , Ma hias
Plappe , Je y Two ek, Jacob Hil on, Reiichi o Nakano, e al. T aining e i ie s o sol e ma h wo d
p oblems. a Xi p ep in a Xi :2110.14168, 2021.
[8]
Ramin Hasani, Ma hias Lechne , Alexande Amini, Lucas Liebenwein, Aa on Ray, Max Tschaikowski,
Ge ald Teschl, and Daniela Rus. Closed- o m con inuous- ime neu al models. a Xi p ep in
a Xi :2106.13898, 2021.
[9]
Dan Hend ycks, Collin Bu ns, S e en Basa , Andy Zou, Man as Mazeika, Dawn Song, and Jacob
S einha d . Measu ing massi e mul i ask language unde s anding. a Xi p ep in a Xi :2009.03300,
2020.
[10]
Dan Hend ycks, Collin Bu ns, Sau a Kada a h, Akul A o a, S e en Basa , E ic Tang, Dawn Song,
and Jacob S einha d . Measu ing ma hema ical p oblem sol ing wi h he ma h da ase . a Xi p ep in
a Xi :2103.03874, 2021.
[11]
A. Jegham, D. Bensaada, and N. Be ached. How hung y is gp -4? measu ing he elec ici y, wa e , and
ca bon equi emen s o a i icial in elligence models. a Xi p ep in a Xi :2505.09598, 2025.
[12]
Albe Q. Jiang, Alexand e Sablay olles, A hu Mensch, Ch is Bam o d, De end a Singh Chaplo ,
Diego de las Casas, Flo ian B essand, Gianna Lengyel, Guillaume Lample, Lucile Saulnie , Lelio Rena d
La aud, Ma ie-Anne Lachaux, Pie e S ock, Te en Le Scao, Thibau La il, Thomas Wang, Timo hee
Lac oix, and William El Sayed. Mis al 7b. a Xi p ep in a Xi :2310.06825, 2023.
[13]
J. Kim, S. Lee, and H. Pa k. ixi-gen: E icien indus ial sllms h ough domain adap i e con inual
p e aining. a Xi p ep in a Xi :2507.06795, 2025.
6
[14]
P. Li, J. Yang, M. A. Islam, and S. Ren. Making ai less “ hi s y”: Unco e ing and add essing he sec e
wa e oo p in o ai models. a Xi p ep in a Xi :2304.03271, 2023.
[15]
Todo Mihaylo , Pe e Cla k, Tusha Kho , and Ashish Sabha wal. Can a sui o a mo conduc
elec ici y? a new da ase o open book ques ion answe ing. a Xi p ep in a Xi :1809.02789, 2018.
[16]
G. Pan and H. Wang. A cos -bene i analysis o on-p emise la ge language model deploymen : B eaking
e en wi h comme cial llm se ices. a Xi p ep in a Xi :2509.18101, 2025.
[17]
Alec Rad o d, Je ey Wu, Rewon Child, Da id Luan, Da io Amodei, and Ilya Su ske e . Language
models a e unsupe ised mul i ask lea ne s. Technical epo , OpenAI, 2019.
[18]
Keisuke Sakaguchi, Ronan Le B as, Chand a Bhaga a ula, and Yejin Choi. Winog ande: An ad e sa ial
winog ad schema challenge a scale. Communica ions o he ACM, 64(9):99–106, 2021.
[19]
Alon Talmo , Jona han He zig, Nicholas Lou ie, and Jona han Be an . Commonsenseqa: A ques ion
answe ing challenge a ge ing commonsense knowledge. a Xi p ep in a Xi :1811.00937, 2018.
[20]
Ashish Vaswani, Noam Shazee , Niki Pa ma , Jakob Uszko ei , Llion Jones, Aidan N. Gomez,
Ł
ukasz
Kaise , and Illia Polosukhin. A en ion is all you need. a Xi p ep in a Xi :1706.03762, 2017.
[21]
Rowan Zelle s, A i Hol zman, Yona an Bisk, Ali Fa hadi, and Yejin Choi. Hellaswag: Can a machine
eally inish you sen ence? a Xi p ep in a Xi :1905.07830, 2019.
7

Related note

Why institutions use Plag.ai for originality review, entry 11
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by teachers in the United States, the European Union, South America, and other research regions, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also faster first-level screening, better protection of institutional reputation, and stronger evidence for review committees. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For student essays, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai