LOOPGEN: TRAINING-FREE LOOPABLE MUSIC GENERATION
Da ide Ma incione⋆Gio gio S ano⋆Dona o C isos omi
Robe o Ribuoli Emanuele Rodolà
Sapienza Uni e si y o Rome
{ma incione, s ano}@di.uni oma1.i
ABSTRACT
Loops–sho audio segmen s designed o seamless
epe i ion–a e cen al o many music gen es, pa icula ly
hose oo ed in dance and elec onic s yles. Howe e ,
cu en gene a i e music models s uggle o p oduce uly
loopable audio, as gene a ing a sho wa e o m alone does
no gua an ee a smoo h ansi ion om i s endpoin back
o i s s a , o en esul ing in audible discon inui ies. We
add ess his gap by modi ying a non-au o eg essi e model
(MAGNeT) o gene a e okens in a ci cula pa e n, le ing
he model a end o he beginning o he audio when c e-
a ing i s ending. This in e ence-only app oach esul s in
gene a ions ha a e awa e o u u e con ex and loop na u-
ally, wi hou he need o any addi ional aining o da a.
We e alua e he consis ency o loop ansi ions by compu -
ing oken pe plexi y a ound he seam o he loop, obse -
ing a 55% imp o emen . Blind lis ening es s u he con-
i m signi ican pe cep ual gains o e baseline me hods,
imp o ing mean a ings by 70%. Taken oge he , hese
esul s highligh he e ec i eness o in e ence-only ap-
p oaches in imp o ing gene a i e models and unde sco e
he ad an ages o non-au o eg essi e me hods o con ex -
awa e music gene a ion.
gi hub.com/gladia- esea ch-g oup/loopgen
gladia- esea ch-g oup.gi hub.io/loopgen-demo
1. INTRODUCTION
Loops play a c i ical ole in music p oduc ion ac oss a
b oad ange o gen es, om hip-hop o elec onic dance
music. By de ini ion, a loop is a segmen o audio ha can
be epea ed inde ini ely wi hou no iceably ja ing ansi-
ions be ween consecu i e epe i ions. These sho seg-
men s unc ion as building blocks in many composi ions,
p o iding hy hmic and ha monic ounda ions ha can be
laye ed, emixed, and manipula ed. Indeed, en i e online
pla o ms (e.g., Splice 1) e ol e a ound sha ing and cu-
a ing loops, unde sco ing hei comme cial and c ea i e
signi icance in con empo a y music-making.
Howe e , despi e hei ubiqui y in p ac ice, loops e-
main an unde explo ed challenge o gene a i e music
models. The p ima y issue lies in he disconnec be ween
gene a ing a sho audio sample and ensu ing ha i loops
co ec ly. Many exis ing gene a i e app oaches ocus on
⋆deno es equal con ibu ion.
1h ps://splice.com/
MAGNeT window
main
ile
le
padding
igh
padding
Figu e 1. Ou p oposed ci cula padding amewo k o
loopable sample gene a ion.
p oducing samples ha sound cohe en when played om
s a o inish [1,2,3,4,5,6], bu hey do no explici ly
conside he ansi ion poin om he end o he sample
back o i s beginning. As a esul , nai e epe i ion o hese
segmen s o en yields ab up discon inui ies, limi ing hei
p ac ical u ili y o musicians and p oduce s who ely on
seamless epe i ion.
In his pape , we in oduce a loop-awa e gene a ion
amewo k ha modi ies he i e a i e in e ence o a non-
au o eg essi e (NAR) model o p oduce seamless loops.
Conc e ely, we adop a ci cula padding s a egy, eplica -
ing pa ial po ions o he loop a bo h ends o he gene a-
ion window, so ha he model a ends o he loop’s begin-
ning while gene a ing i s ending (Figu e 1). This ensu es
a smoo h endpoin - o-onse ansi ion, e ec i ely c ea ing
“b idging okens” ha align he ail o he sample wi h i s
onse . Ou me hod can be used in wo ways: (1) o gene a e
en i e loopable segmen s om sc a ch, o (2) o e ine he
end o an exis ing audio sample so ha i loops seamlessly.
Addi ionally, we implemen a bea -awa e echnique ha
cons ains he o al leng h o he loop o align wi h musical
ba s, u he p omo ing cohe en epe i ion.
To e alua e his app oach, we p opose a new
pe plexi y-based me ic ha quan i ies he ha shness o he
cu a he seam o he loop. In ui i ely, i he loop bound-
a y is uly cohe en , hen i should no be pe cei ed as
i egula o dissonan , nei he o a human lis ene , no o
an audio model, as a well- ained ne wo k should oughly
ma ch human pe cep ion.
Ou con ibu ions a e:
•Loop-Awa e Gene a ion ia Audio Tiling: We
p opose a new in e ence p ocedu e ha can be ap-
plied o a NAR music ans o me , such as MAGNeT,
536
o c ea e seamlessly loopable audio samples. We call
his me hod, and he esul ing model, LoopGen.
•Pe plexi y-Based Seamlessness Me ic: We in o-
duce a me ic o quan i y he quali y o loop bound-
a ies, e ie ing he en opy in he “seam” egion o
a ack.
•Empi ical Valida ion and Code Release: We show
ha ou sys em yields supe io esul s acco ding o
bo h quan i a i e me ics and human lis ening es s,
and we elease ou code o os e u u e esea ch on
he gene a ion o musical loops.
2. RELATED WORK
Recen ad ances in music gene a ion le e age la ge-scale
ans o me -based a chi ec u es, which ha e displaced a-
di ional ecu en neu al ne wo ks o long- ange sequence
modeling. Pionee ing sys ems like MuseNe [7] and Mu-
sic T ans o me [8] showed ha a en ion-based models
[9] could cap u e ich composi ional s uc u e in symbolic
o ma s. Mo e ecen ly, s a e-o - he-a audio ans o me
models such as MusicGen [1] and [3,10,11,12], ha e
demons a ed high-quali y gene a ion o wa e o ms, capa-
ble o handling minu es-long clips condi ioned on ex o
use -p o ided melodies.
Ano he wa e o exp essi e and accu a e models has
come wi h he ad en o di usion models [13] such as Au-
dioLDM [4] and [5,6,14,15,16]. Thei applica ion has
also eached audio and music and, in his, hey a e gi ing
high quali y esul s on-pa wi h he ans o me models.
Pa allel decoding has eme ged as a p omising al e na-
i e o speed up gene a ion. MAGNeT [2] employs a single
non-au o eg essi e ans o me , such as hose used in NLP
asks [17,18], o p edic masked audio okens i e a i ely,
showing ha a well-designed masking and esco ing s a -
egy can close he quali y gap wi h au o eg essi e baselines
a a ac ion o he in e ence cos . VampNe [19], ano he
non-au o eg essi e app oach, in oduces inpain ing capa-
bili ies and pa ial ew i ing o e ine music segmen s, in-
cluding sho epea ed “ amps,” demons a ing p omise o
loop-cen ic wo k lows. Likewise, SoundS o m [20] ap-
plies a bidi ec ional ans o me on seman ic okens o e -
icien speech and music syn hesis, u he illus a ing he
iabili y o non-au o eg essi e me hods o audio.
Loopable music emains compa a i ely unde explo ed.
LoopNe [21] speci ically a ge s he gene a ion o seam-
less music loops, bu i is ied o a limi ed da ase o loops
which alls sho o he gene al-pu pose o ee- o m ap-
p oaches. O he wo k has ocused on symbolic loops in
MIDI [22,23], p oposing a chi ec u es ha ensu es seg-
men s a e musically consis en when epea ed. Howe e ,
hese me hods a e in insically di e en om aw audio
okens; MIDI loops equi e explici pi ch and ins umen
ep esen a ions, which do no ans e o audio-gene a ion
asks. Recen ly, DITTO [24] in oduced an in e ence-
ime op imiza ion ha allows ine-g ained con ol, includ-
ing looping, o e ex - o-music di usion models. While
DITTO is no able o i s high ou pu quali y, i equi es
memo y compa able o a ull ine- uning, and i slows
down in e ence by a ac o o ∼100×.
Finally, loopable media gene a ion is being ackled in
compu e ision wi h iling echniques. Models like Tile-
GAN [25] and [26,27] syn hesize ex u es o images ha
epea edge- o-edge wi hou seams. While hese isual
app oaches sha e he o e a ching idea o bounda y align-
men , hey do no di ec ly add ess audio con inui y o mu-
sical s uc u e.
In his pape , we build on MAGNeT’s non-
au o eg essi e design o p opose an in e ence- ime
app oach o loopable music gene a ion, a oiding addi-
ional aining o da a equi emen s. By ea ing ime in
a “ci cula ” manne , ou me hod en o ces con inui y a
he loop bounda y, subs an ially imp o ing pe cep ual
seamlessness in aw audio.
3. BACKGROUND
3.1 MAGNeT’s in e ence
Unlike ypical NAR models, MAGNeT’s in e ence does no
emi all ou pu okens in a single in e ence pass. Ins ead,
i de elops he audio clip i e a i ely. In pa icula , a each
i e a ion, MAGNeT:
1. Gene a es logi s o each emp y oken in he se-
quence.
2. Samples a alue o each oken.
3. Selec s he highes sco ing okens, and ma ks hem
as ixed.
4. Re-emp ies he emaining non- ixed okens and, i
no emp y okens a e le , e mina es; o he wise, i
s a s he nex i e a ion.
Following [2], we use MAGNeT’s own logi s o selec he
okens o he i s i e a ion. MAGNeT’s in e ence can
be iewed as a gene aliza ion o au o eg essi e in e ence:
a he han ecei ing a con inuous sequence o okens and
ou pu ing he nex oken, MAGNeT ope a es on a se o
emp y and non-emp y okens, illing mul iple emp y po-
si ions in each i e a ion. Thanks o i s non-causal sel -
a en ion, MAGNeT can condi ion i s ou pu s on bo h pas
and u u e okens, ensu ing cohe en gene a ion ac oss
bounda ies. This p ope y makes MAGNeT (and simila
non-au o eg essi e models) well-sui ed o loop c ea ion,
because he model can na u ally a end o he loop’s s a
while gene a ing i s end, he eby acili a ing a smoo he ,
mo e seamless ansi ion.
3.2 Resco ing
MAGNeT [2] also p oposes a a ian ha linea ly in e po-
la es he p obabili ies gi en by i s own logi s, wi h hose
o ano he audio model, such as MusicGen [1], o calcu-
la e he sco es o he selec ion p ocedu e. This esul s in
a ade-o be ween highe quali y and inc eased compu-
a ional cos , as calcula ing ano he model’s p obabili ies
equi es unning i alongside MAGNeT.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
537
3.3 Hyb id MAGNeT
As no ed in [2], when p esen ed wi h sho audio acks,
MAGNeT p oduces con inua ions which, on a e age, sound
be e han samples c ea ed om sc a ch. Knowing his,
we es bo h gene a ing samples om sc a ch, and con in-
ua ions o clips p oduced wi h MusicGen.
4. METHOD
While ou amewo k is, in p inciple, applicable o any
NAR model ha gene a es audio i e a i ely, we choose
MAGNeT [2] as ou base sys em because i is cu en ly he
s a e-o - he-a in NAR music gene a ion.
We adap MAGNeT’s i e a i e in e ence o c ea e a “ci -
cula ” con ex a ound he cen al segmen o okens ha
will o m ou inal loop. By eplica ing pa ial po ions o
his loop segmen a he beginning and end o he gene a-
ion window, MAGNeT can a end o he loop’s s a when
p edic ing i s end, and ice e sa. We e e o he cen al
segmen as he main loop ile.
4.1 I e a i e o e iew
MAGNeT gene a es audio okens in se e al i e a ions. Each
i e a ion pa ially ills an o e all gene a ion window o
leng h L. We isola e a speci ic sub ange o leng h cnea
he cen e o his window o become ou main loop ile.
The emaining space on he le and igh is illed wi h
copies o he ile’s end o beginning, espec i ely, hus
o ming a ci cula con ex .
4.2 In e ence algo i hm
A ini ializa ion, we s a wi h an emp y (o pa ially
illed) window o leng h L. In he middle o his window,
we ma k ou cconsecu i e posi ions as he main loop ile.
1. Filling he Con ex . Be o e calling MAGNeT, and
be o e each in e ence s ep, we copy:
• The ending o he main ile in o he le side o he
window, so ha he i s okens o he ile can “see”
wha happens a he end o i .
• The beginning o he main ile in o he igh side o
he window, so ha he las okens o he ile can
“see” i s s a .
This ensu es a ully ci cula a angemen : he model e ec-
i ely obse es how he loop’s end mee s i s beginning.
2. MAGNeT In e ence. We un MAGNeT on he en i e
window o leng h L. Because MAGNeT uses non-causal
(bidi ec ional) a en ion, okens in he main ile can be con-
di ioned on bo h he le -side copy (i s own end) and he
igh -side copy (i s own s a ).
3. Token Selec ion. A he end o each i e a ion, only
okens wi hin he main ile a e conside ed o inalizing.
We keep hose ha MAGNeT assigns he highes p obabil-
i y (e.g., op-ko h eshold-based), ma king hem as ixed
(i.e., no longe emp y in subsequen i e a ions). The es
a e ese o emp y.
main ile
le padding igh padding
MAGNeT
MAGNeT
...
op K
op K
Figu e 2. Diag am o ou app oach. The cen al main ile
ep esen s he inal audio segmen o be looped. A each
in e ence s ep, only he op-ksamples a e main ained and
e lec ed in he iles. This ci cula padding le s MAGNeT
a end o bo h he s a and end o he ile simul aneously,
ensu ing a smoo h ansi ion a he loop bounda y.
4. Repea un il comple ion. We mo e on o he nex
in e ence i e a ion, going back o s ep 2, un il he en i e y
o he main ile is illed.
5. Ex ac he inal loop. Once he i e a ion limi is
eached o all main- ile okens a e ixed, he algo i hm
s ops. The cen al c okens (ou main loop ile) a e ex-
ac ed as he inal esul . Repea ing his ile end- o-s a
yields a seamless loop.
4.3 Hyb id a ian
MAGNeT o en p oduces highe -quali y audio when con-
inuing om a gi en p omp a he han gene a ing en i ely
om sc a ch [2]. To ake ad an age o his, we i s gene -
a e an audio segmen Cwi h MusicGen, empi ically se
o hal he desi ed inal clip leng h. Fo ins ance, i he inal
clip is in ended o las 10s, we le MusicGen p oduce he
i s 5s and hen p o ide hese okens as a pa ially illed
main ile. This app oach o ces he model o gene a e a
cohe en con inua ion o he high-quali y p omp , ensu ing
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
538
he ending ansi ions seamlessly o he beginning. Empi -
ically, we obse e ha his hyb id e sion su passes sam-
ples gene a ed wi hou an audio p omp in e ms o seman-
ic a ie y, objec i e audio quali y, and musical cohe ence.
4.4 Signa u e-awa e leng h con ol
A well- o med loop o en sounds mos musical when i
aligns wi h ull ba s (e.g., 2 o 4 ba s o consis en empo).
Gene a ing loops o a bi a y leng h may c ea e awkwa d
b eaks i , o ins ance, he empo does no i in ege ba
di isions.
To mi iga e his, we use he cu en s a e-o - he-a bea -
ex ac ion sys em, bea _ his [28] on he ini ial audio
p omp C, o iden i y:
• The a e age du a ion be ween bea s, δ≈60/BPM.
• The median numbe o bea s pe ba , µ.
We use hese o compu e he du a ion o a ba ha he
p omp Cimplies. We wan he du a ion o he en i e loop
l o be a) an exac mul iple (o submul iple) o he du a-
ion o a ba ; b) cons ained in a ime in e al [α, β]. To
achie e his, we epea edly double o hal e he ini ial can-
dida e leng h lun il i i s hese cons ain s.
Algo i hm 1 Bea Alignmen algo i hm
Requi e: Audio clip C, min/max du a ion αand β, p e-
e ed numbe o ba s n
B, D ←de ec ed bea s/downbea s in bea _ his(C)
δ←median ime elapsed be ween Bs▷akin o 60
BPM
µ←median #bea s be ween Ds▷ba o he clip
l←nµδ ▷ du a ion o nba s
while l < α ∨l > β do
i l < α hen
l←2l
else
l←l
2
end i
end while
i l∈[nµδ
4,4nµδ] hen ▷Too a om nba s?
e u n l ▷ Re u n i su icien ly close
else
e u n ∅▷O he wise abo ( y ano he C)
end i
When used in combina ion, iled gene a ion and he bea
alignmen Algo i hm 1p oduce loops ha no only ha e
smoo h seam ansi ions bu also espec musical s uc u e.
This esul s in clips ha a e mo e na u ally loopable o
applica ions like music p oduc ion, li e pe o mance, o
any se ing in which igh ly aligned epea ing segmen s a e
equi ed.
5. EVALUATION METRICS
When assessing looped music, a clip ha sounds ine in
a single pass may s ill ha e an ab up ansi ion when i
epea s. S anda d me ics such as FAD [29], which mea-
su e he o e all dis ibu ional simila i y be ween gene a ed
−1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00
Time (s)
0
2
4
6
8
10
C oss-En opy
MusicGen Nai e
MAGNeT Nai e
MusicGen Bea Aligned Nai e
MAGNeT Bea Aligned Nai e
MAGNeT Tiled
MAGNeT Hyb id Tiled
LoopGen
Figu e 3. A e age c oss en opy o MusicGen a ound he
seam (highligh ed wi h a dashed line) o di e en model-
s/ a ian s.
and eal audio using neu al embeddings, may miss such
a i ac s. Because FAD ope a es o e ull-leng h clips, i
can o e look localized issues such as discon inui ies a he
loop bounda y. To add ess his, we p opose a me ic ha
ocuses di ec ly on he ansi ion seam.
5.1 Seam pe plexi y
To au oma ically assess he con inui y o he loop a ound
he seam, we adap he idea o pe plexi y om language
modeling. Le us assume ha we ha e a well- ained music
gene a ion model (such as MusicGen) ha can es ima e
p obabili ies o each oken (o audio ame) in a music
clip. While adi ional pe plexi y sums o e all okens in a
clip, we ocus exclusi ely on he seam, ha is, he ansi-
ion poin whe e he end mee s he beginning, whe e loop
a i ac s a e mos likely o occu .
5.1.1 C oss-en opy and pe plexi y
Fi s ecall ha , o a sequence X= (x1, x2, . . . , xT), he
a e age c oss-en opy H(X)is:
H(X) = −1
T
T
X
i=1
ln M(xi).(1)
In ui i ely, i Massigns highe p obabili y o each oken,
he c oss-en opy will be smalle , indica ing be e align-
men be ween model and da a. F om c oss-en opy, we de-
i e he pe plexi y P(X), a s anda d measu e o how well
a model p edic s a sequence:
P(X) = expH(X)= exp
−1
T
T
X
i=1
ln M(xi).(2)
A lowe pe plexi y alue indica es ha M inds he se-
quence mo e p edic able (o mo e likely).
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
539
5.1.2 Seam pe plexi y
While global pe plexi y ocuses on he en i e clip, loop a -
i ac s, as in Figu e 3, occu speci ically a he bounda y
whe e he clip w aps a ound. To isola e how he model
pe cei es ha ansi ion, we compu e seam pe plexi y on a
sho window a ound he bounda y.
Le us be gi en Ngene a ed clips {X(k)}. Each X(k)
has leng h T, and we iden i y a seam bounda y a index
b(k). We hen de ine a window o size Wimmedia ely
ollowing b(k), i.e., he okens
X(k)
seam =x(k)
i:i∈[b(k), b(k)+W−1].(3)
The a e age c oss-en opy o he seam okens in X(k)is:
Hseam(X(k)) = −1
W
b(k)+W−1
X
i=b(k)
ln Mx(k)
i.(4)
Finally, he seam pe plexi y is he exponen ial o he mean
seam c oss-en opy ac oss all Nclips:
Seam Pe plexi y = exp
1
N
N
X
k=1
HseamX(k).(5)
A low seam pe plexi y indica es ha he seam is “easy”
o a s ong e e ence model o p edic , sugges ing a
smoo h ansi ion. Con e sely, a high alue sugges s
ab up discon inui ies o o he a i ac s a he loop bound-
a y.
6. EXPERIMENTS
In he ollowing, all gene a ed acks a e condi ioned wi h
he same se o 100 ex ual p omp s, and MAGNeT’s i e -
a ions a e se o ⟨100,50,10,10⟩ o each o he 4 code-
books om EnCodec [30] espec i ely. Tex ual p omp s
a e gene a ed au oma ically ia a LLM, some o hem a e
(e.g.):
(1) “A high-ene gy EDM ack wi h a powe ul d op
and sidechain comp ession”
(2) “An I ish olk dance une wi h ene ge ic iddle and
bodh án d um”
6.1 E alua ing models
6.1.1 Baselines
Fo bo h MAGNeT and MusicGen, wo baseline solu ions
a e o mula ed: (i) Nai e, a sample is gene a ed and e-
pea ed, wi hou u he p ocessing, and (ii) Bea -Aligned
(BA) Nai e, a sample is gene a ed, an h ough Algo-
i hm 1 o cu hem a a musically- alid leng h, and e-
pea ed.
6.1.2 Ou echniques
F om he con ibu ions o his pape , h ee models a e e al-
ua ed: (i) Tiled, samples a e gene a ed ia he iled gen-
e a ion echnique desc ibed in Sec ion 4.2, (ii) Hyb id
Tiled, samples a e gene a ed wi h he same iled ech-
nique, bu s a ing om an audio p omp gene a ed by
MusicGen ( e . Sec ion 4.3), and (iii) Bea Aligned
Tiled, which uses he same echnique as Hyb id Tiled, bu
wi h he addi ional applica ion o he bea -alignmen algo-
i hm desc ibed in Sec ion 4.4. This la e a ian is ou
bes pe o ming model, and wha , going o wa d, we call
LoopGen. Fo each a ian , bo h Seam Pe plexi y and
ad k’s [31] FAD (wi h embeddings om bo h VGG-
ish[32] and CLAP [33] o e he FMA-Pop [34] da ase )
a e compu ed.
6.2 Hype pa ame e sea ch
The mos impo an hype pa ame e s o he samples’
quali y we iden i y a e classi ie - ee guidance (λ) and he
esco ing coe icien (ω). The o me con ols how much a
model should adhe e o he condi ioning in o ma ion gi en
(in ou case, he ex ual p omp ), ins ead o ollowing he
eme ging sample. In MAGNeT’s o iginal pape , he au ho s
ind ha he bes FAD is eached wi h λ= 10.0(linea ly
dec easing o λ= 1.0as he i e a ions pass) bu , as he
iling cons ain migh inc ease he con ex ual in o ma ion
ha he model can gain om he inpu , we e i ied ha a
lowe coe icien ansla es in o mo e o ganic gene a ions.
The esco ing coe icien ω, ins ead, con ols he in-
e pola ion coe icien in oduced in Sec ion 3.2. When
ω= 0, esco ing is no applied, when 1, only MusicGen’s
p obabili ies a e used. We es he e o e ou algo i hm wi h
mul iple coe icien s anging om 0 o 1. A easonable
alue o he c g was chosen o be λ= 5.0; on he o he
hand, he esco ing was chosen h ough a ho ough sea ch
conduc ed on bo h MAGNeT Tiled and MAGNeT Hyb id
Tiled, gene a ing 100 10-seconds samples o each model.
Ou esul s, p esen ed in Table 1, empi ically show he bes
esco ing o be ω= 0.5.
I is wo h no ing ha he Hyb id e sion o he model
consis en ly achie es be e FAD sco es, bu wo se pe -
plexi y. The be e FAD sco e can be clea ly a ibu ed o
he ini ial p omp gene a ed by MusicGen, which con-
sis en ly su passes MAGNeT’s audio quali y. This hyb id
combina ion o di e en models is also he eason o he
inc ease in pe plexi y, since he inal gene a ion consis s o
a conca ena ion o okens sampled om di e en dis ibu-
ions.
Model Va ian ωFAD ggish (↓)Seam Pe plexi y (↓)
MAGNeT Tiled 0.0 3.05 23.88 ±5.40
MAGNeT Tiled 0.25 3.22 25.17 ±5.35
MAGNeT Tiled 0.50 3.51 18.15 ±3.53
MAGNeT Tiled 0.75 3.97 24.55 ±5.43
MAGNeT Tiled 1.0 4.35 25.42 ±4.86
MAGNeT Hyb id Tiled 0.0 2.97 39.30 ±7.21
MAGNeT Hyb id Tiled 0.25 2.99 47.72 ±10.11
MAGNeT Hyb id Tiled 0.50 2.98 44.42 ±9.33
MAGNeT Hyb id Tiled 0.75 3.00 43.93 ±8.39
MAGNeT Hyb id Tiled 1.02.93 41.74 ±9.05
Table 1. Resco ing expe imen s (λ= 5.0)
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
540
6.3 Final esul s
Below, we p esen ou inal esul s ac oss six baselines and
ou h ee no el models, gene a ed wi h he same p e ious
100 ex ual p omp s, bu 30 seconds long. Tiling models,
using he echnique desc ibed in Sec ion 4.2, exhibi sig-
ni ican ly lowe Seam Pe plexi y compa ed o hei non-
iled coun e pa s, hough a he cos o a weake FAD
sco e. Howe e , LoopGen, le e aging bo h he Hyb id
app oach (Sec ion 4.3) and Algo i hm 1, achie es he bes
FAD sco e among all models. This imp o emen comes
wi h a sligh inc ease in Seam Pe plexi y, as p e iously dis-
cussed.
Despi e his mino ade-o in pe plexi y, LoopGen
subs an ially ou pe o ms baseline solu ions, o e ing a
mo e musically pleasing ou pu due o i s alignmen wi h
hy hmically meaning ul cu poin s (Algo i hm 1). This e-
sul s in acks ha main ain be e musical cohe ence com-
pa ed o he s anda d Tiled model.
Table 2p esen s he e alua ion me ics, and he dis i-
bu ion o Seam Pe plexi y alues is isualized in Figu e 4.
Model Va ian FAD ggish (↓)FADCLAP (↓)Seam Pe plexi y (↓)
MAGNeT Vanilla 3.36 0.33 —
MAGNeT Nai e 3.36 0.35 1549.06 ±556.03
MAGNeT Bea Aligned Nai e 3.34 0.34 153.22 ±47.69
MusicGen Vanilla 2.81 0.32 —
MusicGen Nai e 2.81 0.33 2512.39 ±903.16
MusicGen Bea Aligned Nai e 2.86 0.33 507.07 ±163.67
MAGNeT Tiled 4.30 0.51 56.17 ±11.78
MAGNeT Hyb id Tiled 2.98 0.33 94.41 ±25.77
MAGNeT LoopGen 2.80 0.31 84.29 ±22.66
Table 2. Main expe imen s’ e alua ion me ics. Fo each
model, we compu e he FAD wi h VGG-ish and CLAP em-
beddings using FMA-pop as a e e ence da ase . Fo e e -
ence, we also compu e FAD sco es o bo h MAGNeT and
MusicGen’s s anda d, non-looping, gene a ions).
MusicGen Nai e 2512.39
MeanModel
MAGNeT Nai e 1549.06
MusicGen Bea Aligned Nai e 507.07
MAGNeT Bea Aligned Nai e 153.22
MAGNeT Tiled 56.17
MAGNeT Hyb id Tiled 94.41
1e0 1e1 1e2 1e3 1e4 1e5
Seam Pe plexi y
LoopGen 84.29
Figu e 4.Seam Pe plexi y dis ibu ion o he conside ed
models (lowe is be e ).
6.4 Human e alua ion
Using he p e ious se ups, we p epa e a se o : 100
10-seconds clips om LoopGen (ou s), and 100 10-
seconds clips om MAGNeT Hyb id Nai e (wi hou Tiling-
gene a ion, baseline). We selec he la e model because i
is he mos simila o ou s, wi hou any o he modi ica ions
in oduced in his pape . This ensu es a ai compa ison,
wi h he p ima y expec ed di e ence being seamlessness.
The clips a e chosen o be 10 seconds long o ease o lis-
ening.
Wi h his se o samples, we conduc a blind lis ening
expe imen wi h a g oup o use s. Each olun ee lis ens
o up o 30 andomly selec ed clips (15 om ou model,
15 om he baseline) and a es he pe cep ibili y o he
seam on a Like scale (1 = E iden cu , 5 = Impe cep i-
ble cu ). In o al, we collec 506 da a poin s s om 18
12345
Human-assigned sco e
0.0
0.1
0.2
0.3
0.4
Densi y
Model
Baseline
LoopGen
Figu e 5. Dis ibu ion o pe cep ibili y a ings, compa ing
LoopGen wi h he baseline. Lines a e model’s mean.
lis ening sessions. Compu ing each use ’s a e age a ing
o each model; we un a pai ed - es such ha
H0≡µLoopGen =µbaseline (6)
This yields (17) = 12.21, p < 10−9, p o iding o e -
whelming e idence agains H0. Fu he mo e, he e ec
size is la ge (d= 2.88), con i ming a e y s ong e idence
ha ou echnique subs an ially educes he pe cep ibili y
o he seam, as can also be seen in Figu e 5.
7. CONCLUSIONS
Wi h his esea ch, we ha e in oduced a no el in e ence-
only app oach o gene a ing loopable music, le e aging a
simple “ci cula ” padding scheme wi hin MAGNeT’s non-
au o eg essi e amewo k o ensu e seamless bounda ies.
Ou expe imen s demons a ed clea gains in loop con inu-
i y, alida ed bo h by a new pe plexi y-based seam me -
ic and by human lis ening es s. The whole p ocedu e
does no equi e addi ional aining o specialized loop
da ase s. By aligning loop leng h o musical bea s, he
gene a ed audio segmen s mo e na u ally i common com-
posi ional s uc u es, u he imp o ing hei usabili y in
p ac ice. O e all, his wo k unde sco es he po en ial o
ligh weigh , in e ence- ime solu ions o enhancing gene -
a i e music models.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
541
8. ACKNOWLEDGMENTS
This wo k is suppo ed by Sapienza Uni e si y o
Rome ia he Seed o ERC g an “MINT.AI”, cup
B83C25001040001. Fu he mo e, we hank all o he pa -
icipan s in he human e alua ion es .
9. REFERENCES
[1] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and con ol-
lable music gene a ion,” Ad ances in Neu al In o ma-
ion P ocessing Sys ems, ol. 36, 2024.
[2] A. Zi , I. Ga , G. Le Lan, T. Remez, F. K euk, J. Cope ,
A. Dé ossez, G. Synnae e, and Y. Adi, “Masked au-
dio gene a ion using a single non-au o eg essi e ans-
o me ,” in The Twel h In e na ional Con e ence on
Lea ning Rep esen a ions, 2024.
[3] P. Dha iwal, H. Jun, C. Payne, J. W. Kim, A. Rad o d,
and I. Su ske e , “Jukebox: A gene a i e model o
music,” a Xi p ep in a Xi :2005.00341, 2020.
[4] H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. Mandic,
W. Wang, and M. D. Plumbley, “AudioLDM: Tex - o-
audio gene a ion wi h la en di usion models,” P o-
ceedings o he In e na ional Con e ence on Machine
Lea ning, pp. 21 450–21 474, 2023.
[5] F. Schneide , O. Kamal, Z. Jin, and B. Schölkop ,
“Moûsai: Tex - o-music gene a ion wi h long-con ex
la en di usion,” 2023. [Online]. A ailable: h ps:
//a xi .o g/abs/2301.11757
[6] S. Fo sg en and H. Ma i os, “Ri usion - s able
di usion o eal- ime music gene a ion,” 2022.
[Online]. A ailable: h ps:// i usion.com/abou
[7] O. Go en, E. Nachmani, and L. Wol , “A-muze-ne :
Music gene a ion by composing he ha mony based
on he gene a ed melody,” 2021. [Online]. A ailable:
h ps://a xi .o g/abs/2111.12986
[8] C.-Z. A. Huang, A. Vaswani, J. Uszko ei , N. Shazee ,
I. Simon, C. Haw ho ne, A. Dai, M. Ho man,
M. Dinculescu, and D. Eck, “Music ans o me :
Gene a ing music wi h long- e m s uc u e,” 2019.
[Online]. A ailable: h ps://a xi .o g/abs/1809.04281
[9] A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei ,
L. Jones, A. N. Gomez, L. u. Kaise , and I. Polosukhin,
“A en ion is all you need,” in Ad ances in Neu al In-
o ma ion P ocessing Sys ems, ol. 30, 2017.
[10] A. Agos inelli, T. I. Denk, Z. Bo sos, J. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi, M. Sha i i, N. Zeghidou ,
and C. F ank, “Musiclm: Gene a ing music om ex ,”
2023. [Online]. A ailable: h ps://a xi .o g/abs/2301.
11325
[11] S. Vasquez, S. Vasquez, M. Lewis, M. Lewis,
M. Lewis, M. Lewis, M. Lewis, and M. Lewis, “Mel-
ne : A gene a i e model o audio in he equency do-
main,” a Xi : Audio and Speech P ocessing, 2019.
[12] J. Ga dne , S. Du and, D. S olle , and R. Bi ne ,
“Lla k: A mul imodal ins uc ion- ollowing language
model o music,” P oc. o he In e na ional Con e -
ence on Machine Lea ning (ICML), 2024.
[13] J. Ho, A. Jain, and P. Abbeel, “Denoising di usion
p obabilis ic models,” in Ad ances in Neu al In o ma-
ion P ocessing Sys ems, ol. 33, 2020, pp. 6840–6851.
[14] J. Nis al, M. Pasini, C. Aouameu , M. G ach en, and
S. La ne , “Di -a- i : Musical accompanimen co-
c ea ion ia la en di usion models,” 2024. [Online].
A ailable: h ps://a xi .o g/abs/2406.08384
[15] Z. E ans, C. Ca , J. Taylo , S. H. Hawley, and J. Pons,
“Fas iming-condi ioned la en audio di usion,” 2024.
[Online]. A ailable: h ps://a xi .o g/abs/2402.04825
[16] G. Ma iani, I. Tallini, E. Pos olache, M. Mancusi,
L. Cosmo, and E. Rodolà, “Mul i-sou ce di usion
models o simul aneous music gene a ion and sepa-
a ion,” in The Twel h In e na ional Con e ence on
Lea ning Rep esen a ions, 2024. [Online]. A ailable:
h ps://open e iew.ne / o um?id=h922Qhkmx1
[17] J. De lin, M.-W. Chang, K. Lee, and K. Tou ano a,
“BERT: P e- aining o deep bidi ec ional ans o m-
e s o language unde s anding,” in P oceedings o he
2019 Con e ence o he No h Ame ican Chap e o
he Associa ion o Compu a ional Linguis ics: Human
Language Technologies, Volume 1 (Long and Sho Pa-
pe s). Associa ion o Compu a ional Linguis ics,
2019, pp. 4171–4186.
[18] J. Li, T. Tang, W. X. Zhao, J.-Y. Nie, and J.-R. Wen,
“Elme : A non-au o eg essi e p e- ained language
model o e icien and e ec i e ex gene a ion,” 2022.
[19] H. F. F. Ga cia, P. See ha aman, R. Kuma , and
B. Pa do, “Vampne : Music gene a ion ia masked
acous ic oken modeling,” in Ismi 2023 Hyb id Con-
e ence, 2023.
[20] Z. Bo sos, M. Sha i i, D. Vincen , E. Kha i ono ,
N. Zeghidou , and M. Tagliasacchi, “Sounds o m: E -
icien pa allel audio gene a ion,” 2023.
[21] P. Chandna, A. Rami es, X. Se a, and E. Gómez,
“Loopne : Musical loop syn hesis condi ioned on in-
ui i e musical pa ame e s,” in ICASSP 2021 - 2021
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP), 2021, pp. 3395–
3399.
[22] G.-Y. Chen and V.-W. Soo, “Con ollable music loops
gene a ion wi h midi and ex ia mul i-s age c oss a -
en ion and ins umen -awa e ein o cemen lea ning,”
in P oceedings o he 32nd ACM In e na ional Con e -
ence on Mul imedia, 2024, p. 6851–6859.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
542
[23] S. Han, H. R. Ihm, M. Lee, and W. Lim,
“Symbolic music loop gene a ion wi h neu al disc e e
ep esen a ions,” in In e na ional Socie y o Music
In o ma ion Re ie al Con e ence, 2022. [Online].
A ailable: h ps://api.seman icschola .o g/Co pusID:
251493133
[24] Z. No ack, J. McAuley, T. Be g-Ki kpa ick, and N. J.
B yan, “Di o: Di usion in e ence- ime -op imiza ion
o music gene a ion,” 2024.
[25] A. F ühs ück, I. Alhashim, and P. Wonka, “Tilegan:
syn hesis o la ge-scale non-homogeneous ex u es,”
ACM T ans. G aph., ol. 38, pp. 58:1–58:11, 2019.
[26] C. Rod íguez-Pa do and E. Ga ces, “Seamlessgan:
Sel -supe ised syn hesis o ileable ex u e maps,”
IEEE T ans. Vis. Compu . G aph., ol. 29, pp. 2914–
2925, 2023.
[27] O. Mada and O. F ied, “Tiled di usion,” 2024.
[Online]. A ailable: h ps://a xi .o g/abs/2412.15185
[28] F. Fosca in, J. Schlü e , and G. Widme , “Bea his!
accu a e bea acking wi hou dbn pos p ocessing,”
in P oceedings o he 25 h In e na ional Socie y o
Music In o ma ion Re ie al Con e ence (ISMIR), San
F ancisco, CA, Uni ed S a es, No . 2024.
[29] K. Kilgou , M. Zuluaga, D. Roblek, and M. Sha i i,
“F éche audio dis ance: A me ic o e alua ing music
enhancemen algo i hms,” 2019. [Online]. A ailable:
h ps://a xi .o g/abs/1812.08466
[30] A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi, “High
ideli y neu al audio comp ession,” a Xi p ep in
a Xi :2210.13438, 2022.
[31] A. Gui, H. Gampe , S. B aun, and D. Emmanouilidou,
“Adap ing eche audio dis ance o gene a i e music
e alua ion,” in P oc. IEEE ICASSP 2024, 2024.
[Online]. A ailable: h ps://a xi .o g/abs/2311.01616
[32] S. He shey, S. Chaudhu i, D. P. W. Ellis, J. F. Gem-
meke, A. Jansen, R. C. Moo e, M. Plakal, D. Pla ,
R. A. Sau ous, B. Seybold, M. Slaney, R. J. Weiss,
and K. Wilson, “Cnn a chi ec u es o la ge-scale audio
classi ica ion,” in 2017 IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP),
2017, pp. 131–135.
[33] Y. Wu*, K. Chen*, T. Zhang*, Y. Hui*, T. Be g-
Ki kpa ick, and S. Dubno , “La ge-scale con as i e
language-audio p e aining wi h ea u e usion and
keywo d- o-cap ion augmen a ion,” in IEEE In e na-
ional Con e ence on Acous ics, Speech and Signal
P ocessing, ICASSP, 2023.
[34] M. De e a d, K. Benzi, P. Vande gheyns , and
X. B esson, “FMA: A da ase o music analysis,” in
18 h In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence (ISMIR), 2017. [Online]. A ailable:
h ps://a xi .o g/abs/1612.01840
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
543
A. SEAM PERPLEXITY’S ERROR MARGINS
In a ious ables, we p esen he alues o ou Seam Pe plexi y as cen e ±s anda d e o . Because pe plexi y is he
exponen ia ion o a e age c oss-en opy, i is impossible o ac ually compu e e o ma gins di ec ly. To ob ain hese alues,
we s a om Equa ion (4) and compu e o each da ase o samples X={X1, . . . , XN} he mean c oss-en opy:
µX=1
N
N
X
k=1
HseamX(k)(7)
and s anda d de ia ion
σX=
u
u
1
N−1
N
X
k=1 µX−HseamX(k)2.(8)
We hen compu e he 95% con idence in e als o he c oss-en opy
hµX−1.96 σX
√N, µX+ 1.96 σX
√Ni,(9)
and ans o m hem in o exponen ial space
hl= exp µX−1.96 σX
√N, = exp µX+ 1.96 σX
√Ni.(10)
Finally, we calcula e he p o ided alues as
cen e =1
2(l+ ),s anda d e o =1
2( −l).(11)
This app oach di e s om he common me hod o showing a alue wi h e o ma gins, whe e he e o is modeled as
Gaussian, and he cen e alue is assumed o be he empi ical mean o he measu ed quan i y. In his case, howe e , since
he pe plexi y ope a ion i sel is compu ed as he exponen ia ion o i s mean, i would be impossible o calcula e a symme ic
Gaussian e o ma gin di ec ly (no wi hou unning calcula ions on mul iple olds o he da a).
B. 10 SECONDS EXPERIMENTS
Du ing de elopmen , we also explo ed he same inal expe imen s seen in he main a icle (Table 2) wi h he 10 seconds
a ian o MAGNeT. The esul s o hese expe imen s a e de ailed in Table 3and isualized in Figu e 6. No ably, he Seam
Pe plexi y exhibi s a signi ican change wi h his modi ica ion. While i is unclea whe he his change is solely a ibu able
o he di e en models, he sho e ack leng h, o a combina ion he eo , we empi ically obse ed no disce nible pe cep ual
di e ence in he seamlessness o he 10-second and 30-second samples.
Model Va ian FAD ggish(↓)FADCLAP(↓)Seam Pe plexi y (↓)
MAGNeT Vanilla 3.05 0.39 —
MAGNeT Nai e 3.02 0.31 310.21 ±98.33
MAGNeT Bea Aligned Nai e 3.03 0.35 202.43 ±67.23
MusicGen Vanilla 3.28 0.41 —
MusicGen Nai e 3.21 0.34 529.79 ±167.87
MusicGen Bea Aligned Nai e 3.24 0.31 302.88 ±79.91
MAGNeT Tiled 3.51 0.40 18.15 ±3.53
MAGNeT Hyb id Tiled 2.98 0.33 44.42 ±9.33
MAGNeT LoopGen 2.95 0.33 60.85 ±15.24
Table 3. 10 seconds e sions o main expe imen s’ e alua ion
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
544