scieee Science in your language
[en] (orig)

Instructional Code Editing Using Transformer Models

Author: Mercado, Yadiel; Torres, Gabriel; Alvarez, Michael
Publisher: Zenodo
DOI: 10.5281/zenodo.17307177
Source: https://zenodo.org/records/17307177/files/abstract.pdf
Ins uc ional Code Edi ing Using T ans o me
Models
Yadiel Me cado1†, Gab iel To es1, Michael Al a ez1*†
1*Compu e Science Depa men , Uni e si y o Pue o Rico a Rio
Pied as, 17 A e. Uni e sidad STE 1701, San Juan, 00925, Pue o Rico,
USA.
*Co esponding au ho (s). E-mail(s): michael.al [email p o ec ed];
Con ibu ing au ho s: y[email p o ec ed];
[email p o ec ed];
†These au ho s con ibu ed equally o his wo k.
Abs ac
This p ojec explo es ins uc ion-guided code edi ing h ough he ine- uning o
ans o me -based language models wi hin compu e-cons ained en i onmen s.
We ocus on CodeT5-base, a p e- ained encode -decode model designed o so -
wa e enginee ing asks such as code unde s anding and gene a ion. The model
was ine- uned on a cu a ed subse (25%) o he Ins uc Code da ase , which
consis s o na u al language ins uc ion–inpu –ou pu iple s ailo ed o code
ans o ma ion. Da a p epa a ion included okeniza ion wi h Hugging Face’s
Au oTokenize , capped a 1024 okens o accommoda e long code sequences.
The aining was execu ed using he Seq2SeqT aine module on Google Colab,
le e aging mixed-p ecision ( p16) aining, g adien accumula ion, and equen
checkpoin ing o maximize e iciency unde limi ed GPU esou ces. To e alu-
a e model pe o mance, we de eloped a cus om Py hon sc ip ha compu es
bo h cha ac e -le el and wo d-le el simila i y me ics be ween model p edic-
ions and a ge ou pu s. These sco es we e u he analyzed using a binning
s a egy and isualized wi h con usion ma ix-s yle summa ies. Ou esul s
show ha o e 12% o model ou pu s achie e mo e han 95% wo d-le el sim-
ila i y, indica ing p omising p ecision despi e minimal aining. Fu he mo e,
BLEU sco e compa isons ac oss CodeT5, FlanT5, CodeLlama-13B, and GPT-4o
models e ealed ha smalle ine- uned models can ou pe o m la ge , gene al-
pu pose ones in ask-speci ic se ings. These indings sugges ha ligh weigh
ins uc ion- uned models, when ained wi h ocused da a and e icien pipelines,
can o e a cos -e ec i e and scalable al e na i e o au oma ed code edi ing
asks. The wo k ein o ces he u ili y o domain-speci ic ine- uning s a egies and
1
lays he g oundwo k o u u e explo a ion in low- esou ce so wa e enginee ing
en i onmen s.
Keywo ds: T ans o me models, CodeT5, Fine- uning, Low- esou ce aining,
So wa e enginee ing, La ge Language Models (LLMs)
2