Appendix o Ea ly P oduc -Line Valida ion: Assessing LLMs
o Analysis o Semi-Fo mal Bluep in s
ACM Re e ence Fo ma :
. 2026. Appendix o Ea ly P oduc -Line Valida ion: Assessing LLMs o Analysis o Semi-Fo mal Bluep in s. In
P oceedings o The 41s ACM/SIGAPP Symposium on Applied Compu ing (SACโ26). ACM, New Yo k, NY, USA,
8 pages. h ps://doi.o g/XXXXXXX.XXXXXXX
A Task Ins uc ions
This sec ion de ails he easoning p ocedu es encoded in he use p omp s o each analysis
ope a ion (AO).
Common p ep ocessing. Fo e e y AO, he model i s no malizes he bluep in in o a canonical
ep esen a ion: (i) enume a e all unique ea u es and he oo ; (ii) pa se ela ionship seman ics
(manda o y, op ional, o , al e na i e); (iii) ex ac c oss- ee cons ain s and ew i e hem as logic
( equi es:๐ดโ๐ต,excludes:ยฌ(๐ดโง๐ต)). This sha ed decomposi ion is e e enced by all asks below.
Sol e - ee s uc u al me ics (AO1โAO9). Using only he canonical hie a chy and cons ain
lis , he model: (1) coun s unique ea u es and lea nodes; (2) compu es maximum dep h om he
oo ; (3) allies manda o y/op ional ea u es and numbe s o o /al e na i e ela ionships; (4) coun s
equi es/excludes cons ain s. All quan i ies a e de i ed di ec ly om he pa sed s uc u e, i.e.,
no ex e nal sol e is in oked. When needed (e.g., o local checks), he model applies ligh weigh
sani y ules ha disambigua e ela ionship seman ics be o e any coun ing.
Model sa is iabili y (AO10). Combine he hie a chical ules wi h all c oss- ee cons ain s. Sea ch
o con adic ions (e.g., mu ually exclusi e manda o y selec ions, cycles o implica ions ha o ce
an exclusion, illegal g oup assignmen s). I a leas one con igu a ion consis en wi h all ules
can be cons uc ed, ou pu
ue
; o he wise
alse
. P o ide a minimal con lic ing subse when
unsa is iable.
Con igu a ion sa is iabili y (AO11). This ope a ion checks whe he a gi en con igu a ion o
selec ed ea u es is alid wi h espec o he ules o he ea u e model. The check p oceeds in ou
s eps: (1) e i y ha all manda o y ea u es o selec ed pa en s a e included, and ha ea u e g oups
(o , al e na i e) a e chosen wi h alid mul iplici ies (e.g., exac ly one o al e na i e, a leas one o
o ); (2) ensu e ha all equi es cons ain s a e sa is ied, including indi ec ones (i A equi es B
and B equi es C, selec ing A mus also include C); (3) e i y ha no pai o excludes ea u es is
selec ed oge he ; and (4) con i m ha no addi ional ea u es implied by hese cons ain s cause
g oup iola ions ( o example, i wo equi ed ea u es belong o he same al e na i e g oup). I all
Au ho โs Con ac In o ma ion:
Pe mission o make digi al o ha d copies o all o pa o his wo k o pe sonal o class oom use is g an ed wi hou ee
p o ided ha copies a e no made o dis ibu ed o p o i o comme cial ad an age and ha copies bea his no ice and he
ull ci a ion on he i s page. Copy igh s o componen s o his wo k owned by o he s han he au ho (s) mus be hono ed.
Abs ac ing wi h c edi is pe mi ed. To copy o he wise, o epublish, o pos on se e s o o edis ibu e o lis s, equi es
p io speci ic pe mission and/o a ee. Reques pe missions om [emailย p o ec ed].
SACโ26, Thessaloniki, G eece
ยฉ2026 Copy igh held by he owne /au ho (s). Publica ion igh s licensed o ACM.
ACM ISBN 979-X-XXXX-XXXX-X/26/03
h ps://doi.o g/XXXXXXX.XXXXXXX
, Vol. 1, No. 1, A icle . Publica ion da e: Oc obe 2026.
2
ou checks hold, he con igu a ion is conside ed sa is iable; o he wise, he smalles con lic ing
subse o ea u es o cons ain s is epo ed.
Numbe o alid con igu a ions (AO12). Pe o m a cons ain -awa e b anch-and-coun o e he
ea u e ee: (1) ac o ize by ela ionships (manda o y, op ional, o , al e na i e) o o m local choice
se s; (2) p une b anches ea ly using equi es/excludes cons ain s; (3) coun only b anches ha
sa is y all cons ain s. Whe e applicable, apply small-scale inclusionโexclusion on independen
sub ees as a c oss-check.
Co e ea u es (AO13). Fo each ea u e
๐
, a emp a wi ness wi hou
๐
while sa is ying all con-
s ain s. I no alid con igu a ion omi ing
๐
exis s, classi y
๐
as co e. Jus i y wi h he blocking
dependency (manda o y lineage, chained equi es, o g oup logic).
Dead ea u es (AO14). Fo each ea u e
๐
, a emp a wi ness wi h
๐
selec ed ha espec s all
ela ionships and cons ain s. I e e y a emp yields a con adic ion (e.g., igge s an excludes o
iola es g oup mul iplici y), ma k ๐as dead and s a e he igh es con lic ing se .
False op ional ea u es (AO15). S a om ea u es ha appea op ional (explici ly op ional o
g oup membe s ha could be omi ed). Fo each candida e
๐
, y o build a alid con igu a ion
whe e i s pa en is selec ed bu
๐
is no . I no such con igu a ion exis s,
๐
is alse op ional. Explain
which dependency o cons ain o ces ๐.
Gene aliza ion (AO16). Gi en an o iginal bluep in
๐
and i s edi ed a ian
๐โฒ
, judge whe he
๐โฒ
p ese es all con igu a ions o
๐
. A emp o ind a coun e example: a con igu a ion alid in
๐
bu in alid in
๐โฒ
. I none is ound unde he pa sed seman ics, conclude ha
๐โฒ
gene alizes
๐
;
o he wise epo he coun e example and he iola ed ule in ๐โฒ.
B Fu he Resul s
B.1 Accu acy o LLM-based AOs
Table 1. Accu acy (%) o gene al-pu pose LLMs ac oss 9 sol e - ee AOs.
Model ID AO1 AO2 AO3 AO4 AO5 AO6 AO7 AO8 AO9 A e age
g ok-4-non- easoning 40 40 30 20 50 90 80 60 80 54.4
gp -4.1 50 40 50 30 50 100 70 70 80 60.0
llama-4-scou 30 20 30 20 30 40 50 40 70 36.7
claude-sonne -4 100 50 80 40 70 100 70 80 90 75.6
deepseek-cha 30 30 60 10 40 100 70 60 80 53.3
A e age 50.0 36.0 50.0 24.0 48.0 86.0 68.0 62.0 80.0 56.0
, Vol. 1, No. 1, A icle . Publica ion da e: Oc obe 2026.
Appendix o Ea ly P oduc -Line Valida ion: Assessing LLMs o Analysis o Semi-Fo mal Bluep in s 3
Table 2. Accu acy (%) o easoning-op imized LLMs ac oss 9 sol e - ee AOs.
Model ID AO1 AO2 AO3 AO4 AO5 AO6 AO7 AO8 AO9 A e age
g ok-4- easoning 100 80 100 80 90 100 80 90 100 91.1
gemini-2.5- lash 60 60 80 50 80 80 80 70 80 71.1
gemini-2.5-p o 100 70 90 80 80 100 80 90 100 87.8
llama-4-ma e ick 60 30 30 30 60 80 70 60 80 55.6
gp -5-mini 100 80 70 80 80 100 80 100 100 87.8
claude-sonne -4- hink 100 70 80 60 80 100 70 80 90 81.1
deepseek- easone 90 80 100 30 50 100 80 90 100 80.0
A e age 87.1 67.1 78.6 58.6 74.3 94.3 77.1 82.9 92.9 79.2
Table 3. Accu acy (%) o gene al-pu pose LLMs on 9 sol e - ee AOs ac oss bluep in s.
Model ID SW SMW IDE SMG COM SEA CVE BDB CNNl CNN
g ok-4-non- easoning 100 89 78 78 44 56 56 33 11 0
gp -4.1 100 100 67 67 33 56 56 78 22 22
llama-4-scou 89 78 56 44 22 11 22 33 11 0
claude-sonne -4 100 100 89 89 100 56 78 56 33 56
deepseek-cha 89 78 56 60 22 44 78 44 11 44
A e age 95.6 89.0 69.2 67.6 44.2 44.6 58.0 48.8 17.6 24.4
Table 4. Accu acy (%) o easoning-op imized LLMs on 9 sol e - ee AOs ac oss bluep in s.
Model ID SW SMW IDE SMG COM SEA CVE BDB CNNl CNN
g ok-4- easoning 100 100 100 100 89 100 100 100 67 56
gemini-2.5- lash 100 100 100 89 100 67 67 89 0 0
gemini-2.5-p o 100 100 100 100 100 89 100 100 44 44
llama-4-ma e ick 100 100 78 78 44 67 44 44 0 0
gp -5-mini 100 100 100 100 100 100 100 89 44 44
claude-sonne -4- hink 100 100 100 100 100 67 89 89 22 44
deepseek- easone 100 89 78 78 100 89 78 78 56 56
A e age 100.0 98.4 93.7 92.1 90.4 82.7 82.6 84.1 33.3 34.9
, Vol. 1, No. 1, A icle . Publica ion da e: Oc obe 2026.
4
Table 5. Accu acy (%) o gene al-pu pose LLMs ac oss sol e -based AOs.
Model ID AO10 AO11 AO12 AO13 AO14 AO15 AO16 A e age
g ok-4-non- easoning 90 80 50 50 70 60 100 71.4
gp -4.1 100 100 50 80 70 60 100 80.0
llama-4-scou 90 90 13 40 40 20 40 47.6
claude-sonne -4 100 90 50 90 80 70 90 81.4
deepseek-cha 80 80 25 60 60 50 90 63.6
A e age 92.0 88.0 37.6 64.0 64.0 52.0 84.0 68.8
Table 6. Accu acy (%) o easoning-op imized LLMs ac oss sol e -based AOs.
Model ID AO10 AO11 AO12 AO13 AO14 AO15 AO16 A e age
g ok-4- easoning 100 100 100 60 90 70 100 88.6
gemini-2.5- lash 100 90 63 70 80 60 80 77.6
gemini-2.5-p o 100 90 100 90 90 80 100 92.9
llama-4-ma e ick 70 100 13 60 60 40 90 61.9
gp -5-mini 100 100 75 90 90 80 100 90.7
claude-sonne -4- hink 100 100 75 90 90 80 80 87.9
deepseek- easone 100 90 75 90 80 70 90 85.0
A e age 95.7 95.7 71.6 78.6 82.9 68.6 91.4 83.5
Table 7. Accu acy (%) o gene al-pu pose LLMs on 7 sol e -based AOs ac oss bluep in s.
Model ID SW SMW IDE SMG COM SEA CVE BDB CNNl CNN
g ok-4-non- easoning 100 100 86 86 71 43 43 57 67 67
gp -4.1 100 100 100 100 86 71 43 57 50 100
llama-4-scou 57 100 43 57 43 43 71 43 17 33
claude-sonne -4 100 100 100 86 86 71 100 57 50 83
deepseek-cha 100 86 86 71 71 57 57 43 33 83
A e age 91.4 97.2 83.0 80.0 71.4 57.0 62.8 51.4 43.4 73.2
, Vol. 1, No. 1, A icle . Publica ion da e: Oc obe 2026.
Appendix o Ea ly P oduc -Line Valida ion: Assessing LLMs o Analysis o Semi-Fo mal Bluep in s 5
Table 8. Accu acy (%) o easoning-op imized LLMs on 7 sol e -based AOs ac oss bluep in s.
Model ID SW SMW IDE SMG COM SEA CVE BDB CNNl CNN
g ok-4- easoning 100 100 100 100 100 86 71 71 50 100
gemini-2.5- lash 100 100 100 86 86 57 57 57 67 67
gemini-2.5-p o 100 100 100 100 100 71 100 57 67 100
llama-4-ma e ick 100 86 86 86 57 43 43 29 33 67
gp -5-mini 100 100 100 100 100 100 86 57 67 100
claude-sonne -4- hink 100 100 100 86 100 71 100 71 67 100
deepseek- easone 100 100 86 100 86 86 86 71 67 83
A e age 100.0 98.0 96.0 94.0 90.0 73.4 77.6 59.0 59.7 88.1
Table 9. Accu acy (%) o gene al-pu pose LLMs on 16 AOs ac oss bluep in s.
Model ID SW SMW IDE SMG COM SEA CVE BDB CNNl CNN A e age
g ok-4-non- easoning 100 94 81 81 56 50 50 44 31 25 61.2
gp -4.1 100 100 81 81 56 62 50 69 31 50 68.0
llama-4-scou 75 88 50 50 31 25 44 38 13 13 42.7
claude-sonne -4 100 100 94 88 94 62 88 56 38 63 78.3
deepseek-cha 94 81 69 62 44 50 69 44 19 56 59.9
A e age 93.8 92.6 75.0 72.4 56.2 49.8 60.2 50.2 26.4 41.4 61.0
No e. Bold numbe s indica e models achie ing he bes o e all pe o mance; unde lined alues deno e challenging ea u e
models whe e accu acy is low.
Table 10. Accu acy (%) o easoning-op imized LLMs on 16 AOs ac oss bluep in s.
Model ID SW SMW IDE SMG COM SEA CVE BDB CNNl CNN A e age
g ok-4- easoning 100 100 100 100 94 94 88 88 60 73 89.7
gemini-2.5- lash 100 100 100 88 94 62 62 75 27 27 73.5
gemini-2.5-p o 100 100 100 100 100 81 100 81 53 67 88.2
llama-4-ma e ick 100 94 81 81 50 56 44 38 13 27 62.4
gp -5-mini 100 100 100 100 100 100 94 75 53 67 88.9
claude-sonne -4- hink 100 100 100 94 100 69 94 81 40 67 84.5
deepseek- easone 100 94 81 88 94 88 81 75 60 67 82.8
A e age 100.0 98.3 94.6 93.0 90.3 78.6 80.4 74.7 43.7 56.3 81.1
No e. Bold numbe s indica e models achie ing he bes o e all pe o mance; unde lined alues deno e challenging ea u e
models whe e accu acy is low.
B.2 Cos o LLM-based AOs
, Vol. 1, No. 1, A icle . Publica ion da e: Oc obe 2026.
6
Table 11. A e age un imes (in seconds) o each LLM o 16 AOs on 10 bluep in s, alongside FLAMAโs
pe o mance.
Model ID SW SMW IDE SMG COM SEA CVE BDB CNNl CNN A e age
g ok-4-non- easoning 101 249 403 610 881 1217 1334 1636 1987 2275 1169.3
gp -4.1 133 323 542 881 1374 2000 2285 2783 4128 4580 1892.9
llama-4-scou 88 196 294 382 492 683 751 881 1287 1748 680.2
claude-sonne -4 124 269 443 686 895 1199 1375 1660 1871 2063 1168.5
deepseek-cha 197 453 752 1082 1572 2063 2330 2811 3344 3891 1849.5
g ok-4- easoning 221 535 912 1214 1701 2753 3050 4152 5294 5835 3046.7
gemini-2.5- lash 124 286 471 753 1342 2382 2654 3750 5764 7699 3222.5
gemini-2.5-p o 168 366 586 850 1256 1861 2155 2963 3952 4951 2390.8
llama-4-ma e ick 67 156 279 449 653 839 902 976 1472 2068 886.1
gp -5-mini 479 1122 1982 3245 4827 6730 7769 10567 12017 13045 6118.3
claude-sonne -4- hink 270 592 971 1403 2043 3172 3573 4715 5489 5983 3011.1
deepseek- easone 815 1948 3428 5263 7617 10603 12029 15665 18181 20235 9508.4
FLAMA (Sol e )โ 0.0005 0.0011 0.0015 0.1037 0.3522 60.0982 0.1215 0.0372 60.3814 61.4404 18.45
โ Resul s on o mal inpu s; sol e canno p ocess semi- o mal bluep in s.
Table 12. Token consump ions o each LLM o 16 AOs on 10 bluep in s.
Model ID SW SMW IDE SMG COM SEA CVE BDB CNNl CNN A e age
g ok-4-non- easoning 24169 51050 80823 115685 160722 217493 267548 344634 701598 1412267 397599.0
gp -4.1 23733 49499 76478 106991 147857 198643 247339 316405 662554 1365923 379042.2
llama-4-scou 24168 49765 75602 104333 137007 180042 225816 285524 647489 1368337 374608.3
claude-sonne -4 28149 58272 91645 126395 168487 221851 274387 344894 707425 1287725 439523.0
deepseek-cha 23629 49720 77330 107658 145780 190342 237267 301916 649363 1206488 377549.3
g ok-4- easoning 29041 63488 101618 148517 213478 301704 364204 473289 902899 1623506 505874.4
gemini-2.5- lash 38375 84453 136414 204045 305885 504374 595520 828668 1639393 2882763 765789.0
gemini-2.5-p o 40145 84391 132640 188385 268049 383351 470076 635049 1210436 2260989 602251.1
llama-4-ma e ick 24363 52119 81078 113823 150870 193929 242472 302917 678755 1412195 389052.1
gp -5-mini 49531 111724 182992 282924 414198 578517 693933 916616 1345818 2107258 843251.1
claude-sonne -4- hink 37170 81463 129360 180836 249601 354763 422937 538218 940241 1539850 551644.0
deepseek- easone 39384 88434 147229 216841 304018 414756 492170 635532 1036354 1637181 561290.0
B.3 E o Analysis
Taxonomy and O e all F equencies. We ca ego ize ailu es in o i e mu ually exclusi e ypes: (i)
Fo ma -complian bu w ong (syn ac ically alid, seman ically inco ec ), (ii) Incomple e / missing
answe s ( ypically due o con ex /ou pu limi s), (iii) Nonsense ex (unpa sable na u al language
despi e a alid con ac elsewhe e), (i ) Con ac iola ions (mal o med o missing equi ed ags), and
( ) Re usals (model explici ly declines o compu e, e.g., โa sol e is equi edโ). Table 14 summa izes
coun s agg ega ed o e all uns.
Whe e and why models ail. (1) Seman ic misin e p e a ion (s uc u al AOs). The dominan
sou ce o e o in sol e - ee AOs (
AO4โAO9
) is misunde s anding o g oup seman ics, especially ead-
ing โA mus ha e B o Cโ as wo manda o y child en ins ead o an al e na i e/o g oup. This yields
in la ed #manda o y and de la ed #al e na i e coun s and cascades in o w ong # equi es/#excludes.
The e ec is mos isible on medium/la ge bluep in s (SMG,SEA,BDB,CNNl,CNN ).
(2) P opaga ion/enume a ion limi s ( easoning AOs).
AO12
(# alid con igu a ions) and
AO15
(# alse op ional) equi e ei he pa ial enume a ion o p ecise cons ain p opaga ion. Many models
unde -app oxima e cons ain s (missed implica ions) o o e -app oxima e ( ea ing exclusi e g oups
as independen ), p oducing w ong-bu - o ma ed ou pu s.
AO13
/
AO14
(co e/dead ea u es) show
simila pa e ns when c oss- ee cons ain s a e dense (BDB,CVE).
, Vol. 1, No. 1, A icle . Publica ion da e: Oc obe 2026.
Appendix o Ea ly P oduc -Line Valida ion: Assessing LLMs o Analysis o Semi-Fo mal Bluep in s 7
Table 13. A e age un imes (in seconds) and oken consump ions o each LLM ac oss 16 AOs on 10 bluep in s.
Model ID A g. Run ime A g. Tokens
g ok-4-non- easoning 1,169.3 397,599
gp -4.1 1,892.9 379,042
llama-4-scou 680.2 374,608
claude-sonne -4 1,168.5 439,523
deepseek-cha 1,849.5 377,549
g ok-4- easoning 3,046.7 505,874
gemini-2.5- lash 3,222.5 765,789
gemini-2.5-p o 2,390.8 602,251
llama-4-ma e ick 886.1 389,052
gp -5-mini 6,118.3 843,251
claude-sonne -4- hink 3,011.1 551,644
deepseek- easone 9,508.4 561,290
FLAMA (Sol e )โ 18.45 โ
โ Resul s on o mal inpu s; sol e canno p ocess semi- o mal bluep in s.
Table 14. Failu e coun s pe model (all bluep in s & AOs). The ou igh mos e o ypes appea almos
exclusi ely on CNNl and CNN .
Model W ong Incomple e Nonsense Con ac Re usal
g ok-4-non- easoning 48 1 0 0 0
gp -4.1 49 0 0 0 0
llama-4-scou 75 6 5 20
claude-sonne -4 33 1 0 0 0
deepseek-cha 62 3 0 0 0
g ok-4- easoning 17 0 0 0 0
gemini-2.5- lash 21 27 0 0 0
gemini-2.5-p o 18 0 0 0 0
llama-4-ma e ick 51 6 5 10
gp -5-mini 70 0 0 9
claude-sonne -4- hink 23 1 0 0 0
deepseek- easone 27 1 0 0 0
(3) Con ex /ou pu cons ain s ( e y la ge bluep in s). On
CNNl
and
CNN
, ou ailu e modes
spike: Incomple e,Nonsense,Con ac , and Re usal. Llama 4 Ma e ick (16K max ou pu ) equen ly
unca es and eso s o summa ies (e.g., epo ing only โ329 ea u esโ o
CNNl
), leading o Incom-
ple e/Nonsense and occasional Con ac iola ions. Gemini 2.5 Flash exhibi s many Incomple e uns
(ea ly s opping nea con ex limi s). GPT-5 mini issues Re usals (9ร) s a ing a sol e is equi ed
o exac coun s on he la ges ins ances. By con as , G ok 4 Fas Reasoning a oids hese ailu e
modes bu s ill accumula es w ong-bu - o ma ed e o s on he mos complex AOs.
Model-speci ic pa e ns. Llama 4 Scou and DeepSeek Cha accumula e many w ong-bu - o ma ed
ou pu s on s uc u al AOs; Scou also shows Nonsense and Con ac e o s on
CNNl/
. Claude Sonne
, Vol. 1, No. 1, A icle . Publica ion da e: Oc obe 2026.
8
4 achie es s ong esul s on small bluep in s bu consis en ly o e -coun s #manda o y when pa sing
al e na i e/o g oups (IDE/SMG/SEA/CVE), and unde -pe o ms on # alid con igu a ions o
COM
,
SEA
,
BDB
. G ok 4 Fas Reasoning, GPT-5 mini, and Gemini 2.5 P o a e ma kedly mo e s able;
esidual e o s concen a e on
AO12
/
AO15
and s uc u al coun s on
CNNl
/
CNN
. DeepSeek Reasone
spends subs an ial un ime bu s ill accumula es w ong-bu - o ma ed e o s on s uc u al AOs
(manda o y/op ional) and
AO15
. Gemini 2.5 Flash ails open unde long con ex s (Incomple e). Llama
4 Ma e ick is domina ed by Incomple e/Nonsense due o ou pu unca ion.
Bluep in - and AO-le el ho spo s. Failu e a es ise wi h bluep in complexi y:
SEA
(dep h 10) and
BDB
(dense c oss- ee cons ain s) igge p opaga ion mis akes;
CNNl
/
CNN
igge con ex /ou pu
issues and seman ic slips in g oup in e p e a ion. Ac oss AOs, he ha des a e
AO12
and
AO15
(p opaga ion/enume a ion); nex a e
AO4โAO9
(seman ic pa sing); he easies a e e i ica ion-s yle
AO10/AO11/AO16.
Illus a i e ailu e modes. Seman ic slip: โA mus ha e B o Cโ
โ
coun ed as wo manda o y (in-
la ed #manda o y, de la ed #al e na i e). Pa ial p opaga ion: ea ing an exclude as local (igno ing
ansi i e implica ions)
โ
unde -coun o dead/co e ea u es. Ou pu unca ion: easoning s ops
mid-lis (missing ea u es/ ela ionships)
โ
Incomple e wi h plausible bu w ong o als. Re usal:
explici claim ha exac coun ing equi es a sol e (obse ed in GPT-5-Mini on CNNl/CNN ).
Summa y. Mos e o s a ise om (i) seman ic misin e p e a ion in s uc u al AOs and (ii) incom-
ple e p opaga ion/enume a ion in easoning AOs; e y la ge bluep in s addi ionally expose (iii)
con ex /ou pu limi s. Reasoning-op imized models educe bu do no elimina e hese ailu es.
, Vol. 1, No. 1, A icle . Publica ion da e: Oc obe 2026.