LoopGen: Training-Free Loopable Music Generation

Author: Davide Marincione; Giorgio Strano; Donato Crisostomi; Roberto Ribuoli; Emanuele Rodolà

Publisher: Zenodo

DOI: 10.5281/zenodo.17706509

Source: https://zenodo.org/records/17706509/files/000062.pdf

LOOPGEN: TRAINING-FREE LOOPABLE MUSIC GENERATION
Da ide Ma incione⋆Gio gio S ano⋆Dona o C isos omi
Robe o Ribuoli Emanuele Rodolà
Sapienza Uni e si y o Rome
{ma incione, s ano}@di.uni oma1.i
ABSTRACT
Loops–sho audio segmen s designed o seamless
epe i ion–a e cen al o many music gen es, pa icula ly
hose oo ed in dance and elec onic s yles. Howe e ,
cu en gene a i e music models s uggle o p oduce uly
loopable audio, as gene a ing a sho wa e o m alone does
no gua an ee a smoo h ansi ion om i s endpoin back
o i s s a , o en esul ing in audible discon inui ies. We
add ess his gap by modi ying a non-au o eg essi e model
(MAGNeT) o gene a e okens in a ci cula pa e n, le ing
he model a end o he beginning o he audio when c e-
a ing i s ending. This in e ence-only app oach esul s in
gene a ions ha a e awa e o u u e con ex and loop na u-
ally, wi hou he need o any addi ional aining o da a.
We e alua e he consis ency o loop ansi ions by compu -
ing oken pe plexi y a ound he seam o he loop, obse -
ing a 55% imp o emen . Blind lis ening es s u he con-
i m signi ican pe cep ual gains o e baseline me hods,
imp o ing mean a ings by 70%. Taken oge he , hese
esul s highligh he e ec i eness o in e ence-only ap-
p oaches in imp o ing gene a i e models and unde sco e
he ad an ages o non-au o eg essi e me hods o con ex -
awa e music gene a ion.
gi hub.com/gladia- esea ch-g oup/loopgen
gladia- esea ch-g oup.gi hub.io/loopgen-demo
1. INTRODUCTION
Loops play a c i ical ole in music p oduc ion ac oss a
b oad ange o gen es, om hip-hop o elec onic dance
music. By de ini ion, a loop is a segmen o audio ha can
be epea ed inde ini ely wi hou no iceably ja ing ansi-
ions be ween consecu i e epe i ions. These sho seg-
men s unc ion as building blocks in many composi ions,
p o iding hy hmic and ha monic ounda ions ha can be
laye ed, emixed, and manipula ed. Indeed, en i e online
pla o ms (e.g., Splice 1) e ol e a ound sha ing and cu-
a ing loops, unde sco ing hei comme cial and c ea i e
signi icance in con empo a y music-making.
Howe e , despi e hei ubiqui y in p ac ice, loops e-
main an unde explo ed challenge o gene a i e music
models. The p ima y issue lies in he disconnec be ween
gene a ing a sho audio sample and ensu ing ha i loops
co ec ly. Many exis ing gene a i e app oaches ocus on
⋆deno es equal con ibu ion.
1h ps://splice.com/
MAGNeT window
main
ile
le
padding
igh
padding

















Figu e 1. Ou p oposed ci cula padding amewo k o
loopable sample gene a ion.
p oducing samples ha sound cohe en when played om
s a o inish [1,2,3,4,5,6], bu hey do no explici ly
conside he ansi ion poin om he end o he sample
back o i s beginning. As a esul , nai e epe i ion o hese
segmen s o en yields ab up discon inui ies, limi ing hei
p ac ical u ili y o musicians and p oduce s who ely on
seamless epe i ion.
In his pape , we in oduce a loop-awa e gene a ion
amewo k ha modi ies he i e a i e in e ence o a non-
au o eg essi e (NAR) model o p oduce seamless loops.
Conc e ely, we adop a ci cula padding s a egy, eplica -
ing pa ial po ions o he loop a bo h ends o he gene a-
ion window, so ha he model a ends o he loop’s begin-
ning while gene a ing i s ending (Figu e 1). This ensu es
a smoo h endpoin - o-onse ansi ion, e ec i ely c ea ing
“b idging okens” ha align he ail o he sample wi h i s
onse . Ou me hod can be used in wo ways: (1) o gene a e
en i e loopable segmen s om sc a ch, o (2) o e ine he
end o an exis ing audio sample so ha i loops seamlessly.
Addi ionally, we implemen a bea -awa e echnique ha
cons ains he o al leng h o he loop o align wi h musical
ba s, u he p omo ing cohe en epe i ion.
To e alua e his app oach, we p opose a new
pe plexi y-based me ic ha quan i ies he ha shness o he
cu a he seam o he loop. In ui i ely, i he loop bound-
a y is uly cohe en , hen i should no be pe cei ed as
i egula o dissonan , nei he o a human lis ene , no o
an audio model, as a well- ained ne wo k should oughly
ma ch human pe cep ion.
Ou con ibu ions a e:
•Loop-Awa e Gene a ion ia Audio Tiling: We
p opose a new in e ence p ocedu e ha can be ap-
plied o a NAR music ans o me , such as MAGNeT,
536
o c ea e seamlessly loopable audio samples. We call
his me hod, and he esul ing model, LoopGen.
•Pe plexi y-Based Seamlessness Me ic: We in o-
duce a me ic o quan i y he quali y o loop bound-
a ies, e ie ing he en opy in he “seam” egion o
a ack.
•Empi ical Valida ion and Code Release: We show
ha ou sys em yields supe io esul s acco ding o
bo h quan i a i e me ics and human lis ening es s,
and we elease ou code o os e u u e esea ch on
he gene a ion o musical loops.
2. RELATED WORK
Recen ad ances in music gene a ion le e age la ge-scale
ans o me -based a chi ec u es, which ha e displaced a-
di ional ecu en neu al ne wo ks o long- ange sequence
modeling. Pionee ing sys ems like MuseNe [7] and Mu-
sic T ans o me [8] showed ha a en ion-based models
[9] could cap u e ich composi ional s uc u e in symbolic
o ma s. Mo e ecen ly, s a e-o - he-a audio ans o me
models such as MusicGen [1] and [3,10,11,12], ha e
demons a ed high-quali y gene a ion o wa e o ms, capa-
ble o handling minu es-long clips condi ioned on ex o
use -p o ided melodies.
Ano he wa e o exp essi e and accu a e models has
come wi h he ad en o di usion models [13] such as Au-
dioLDM [4] and [5,6,14,15,16]. Thei applica ion has
also eached audio and music and, in his, hey a e gi ing
high quali y esul s on-pa wi h he ans o me models.
Pa allel decoding has eme ged as a p omising al e na-
i e o speed up gene a ion. MAGNeT [2] employs a single
non-au o eg essi e ans o me , such as hose used in NLP
asks [17,18], o p edic masked audio okens i e a i ely,
showing ha a well-designed masking and esco ing s a -
egy can close he quali y gap wi h au o eg essi e baselines
a a ac ion o he in e ence cos . VampNe [19], ano he
non-au o eg essi e app oach, in oduces inpain ing capa-
bili ies and pa ial ew i ing o e ine music segmen s, in-
cluding sho epea ed “ amps,” demons a ing p omise o
loop-cen ic wo k lows. Likewise, SoundS o m [20] ap-
plies a bidi ec ional ans o me on seman ic okens o e -
icien speech and music syn hesis, u he illus a ing he
iabili y o non-au o eg essi e me hods o audio.
Loopable music emains compa a i ely unde explo ed.
LoopNe [21] speci ically a ge s he gene a ion o seam-
less music loops, bu i is ied o a limi ed da ase o loops
which alls sho o he gene al-pu pose o ee- o m ap-
p oaches. O he wo k has ocused on symbolic loops in
MIDI [22,23], p oposing a chi ec u es ha ensu es seg-
men s a e musically consis en when epea ed. Howe e ,
hese me hods a e in insically di e en om aw audio
okens; MIDI loops equi e explici pi ch and ins umen
ep esen a ions, which do no ans e o audio-gene a ion
asks. Recen ly, DITTO [24] in oduced an in e ence-
ime op imiza ion ha allows ine-g ained con ol, includ-
ing looping, o e ex - o-music di usion models. While
DITTO is no able o i s high ou pu quali y, i equi es
memo y compa able o a ull ine- uning, and i slows
down in e ence by a ac o o ∼100×.
Finally, loopable media gene a ion is being ackled in
compu e ision wi h iling echniques. Models like Tile-
GAN [25] and [26,27] syn hesize ex u es o images ha
epea edge- o-edge wi hou seams. While hese isual
app oaches sha e he o e a ching idea o bounda y align-
men , hey do no di ec ly add ess audio con inui y o mu-
sical s uc u e.
In his pape , we build on MAGNeT’s non-
au o eg essi e design o p opose an in e ence- ime
app oach o loopable music gene a ion, a oiding addi-
ional aining o da a equi emen s. By ea ing ime in
a “ci cula ” manne , ou me hod en o ces con inui y a
he loop bounda y, subs an ially imp o ing pe cep ual
seamlessness in aw audio.
3. BACKGROUND
3.1 MAGNeT’s in e ence
Unlike ypical NAR models, MAGNeT’s in e ence does no
emi all ou pu okens in a single in e ence pass. Ins ead,
i de elops he audio clip i e a i ely. In pa icula , a each
i e a ion, MAGNeT:
1. Gene a es logi s o each emp y oken in he se-
quence.
2. Samples a alue o each oken.
3. Selec s he highes sco ing okens, and ma ks hem
as ixed.
4. Re-emp ies he emaining non- ixed okens and, i
no emp y okens a e le , e mina es; o he wise, i
s a s he nex i e a ion.
Following [2], we use MAGNeT’s own logi s o selec he
okens o he i s i e a ion. MAGNeT’s in e ence can
be iewed as a gene aliza ion o au o eg essi e in e ence:
a he han ecei ing a con inuous sequence o okens and
ou pu ing he nex oken, MAGNeT ope a es on a se o
emp y and non-emp y okens, illing mul iple emp y po-
si ions in each i e a ion. Thanks o i s non-causal sel -
a en ion, MAGNeT can condi ion i s ou pu s on bo h pas
and u u e okens, ensu ing cohe en gene a ion ac oss
bounda ies. This p ope y makes MAGNeT (and simila
non-au o eg essi e models) well-sui ed o loop c ea ion,
because he model can na u ally a end o he loop’s s a
while gene a ing i s end, he eby acili a ing a smoo he ,
mo e seamless ansi ion.
3.2 Resco ing
MAGNeT [2] also p oposes a a ian ha linea ly in e po-
la es he p obabili ies gi en by i s own logi s, wi h hose
o ano he audio model, such as MusicGen [1], o calcu-
la e he sco es o he selec ion p ocedu e. This esul s in
a ade-o be ween highe quali y and inc eased compu-
a ional cos , as calcula ing ano he model’s p obabili ies
equi es unning i alongside MAGNeT.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
537
3.3 Hyb id MAGNeT
As no ed in [2], when p esen ed wi h sho audio acks,
MAGNeT p oduces con inua ions which, on a e age, sound
be e han samples c ea ed om sc a ch. Knowing his,
we es bo h gene a ing samples om sc a ch, and con in-
ua ions o clips p oduced wi h MusicGen.
4. METHOD
While ou amewo k is, in p inciple, applicable o any
NAR model ha gene a es audio i e a i ely, we choose
MAGNeT [2] as ou base sys em because i is cu en ly he
s a e-o - he-a in NAR music gene a ion.
We adap MAGNeT’s i e a i e in e ence o c ea e a “ci -
cula ” con ex a ound he cen al segmen o okens ha
will o m ou inal loop. By eplica ing pa ial po ions o
his loop segmen a he beginning and end o he gene a-
ion window, MAGNeT can a end o he loop’s s a when
p edic ing i s end, and ice e sa. We e e o he cen al
segmen as he main loop ile.
4.1 I e a i e o e iew
MAGNeT gene a es audio okens in se e al i e a ions. Each
i e a ion pa ially ills an o e all gene a ion window o
leng h L. We isola e a speci ic sub ange o leng h cnea
he cen e o his window o become ou main loop ile.
The emaining space on he le and igh is illed wi h
copies o he ile’s end o beginning, espec i ely, hus
o ming a ci cula con ex .
4.2 In e ence algo i hm
A ini ializa ion, we s a wi h an emp y (o pa ially
illed) window o leng h L. In he middle o his window,
we ma k ou cconsecu i e posi ions as he main loop ile.
1. Filling he Con ex . Be o e calling MAGNeT, and
be o e each in e ence s ep, we copy:
• The ending o he main ile in o he le side o he
window, so ha he i s okens o he ile can “see”
wha happens a he end o i .
• The beginning o he main ile in o he igh side o
he window, so ha he las okens o he ile can
“see” i s s a .
This ensu es a ully ci cula a angemen : he model e ec-
i ely obse es how he loop’s end mee s i s beginning.
2. MAGNeT In e ence. We un MAGNeT on he en i e
window o leng h L. Because MAGNeT uses non-causal
(bidi ec ional) a en ion, okens in he main ile can be con-
di ioned on bo h he le -side copy (i s own end) and he
igh -side copy (i s own s a ).
3. Token Selec ion. A he end o each i e a ion, only
okens wi hin he main ile a e conside ed o inalizing.
We keep hose ha MAGNeT assigns he highes p obabil-
i y (e.g., op-ko h eshold-based), ma king hem as ixed
(i.e., no longe emp y in subsequen i e a ions). The es
a e ese o emp y.
main ile
le padding igh padding

















MAGNeT
MAGNeT
...
op K
op K
Figu e 2. Diag am o ou app oach. The cen al main ile
ep esen s he inal audio segmen o be looped. A each
in e ence s ep, only he op-ksamples a e main ained and
e lec ed in he iles. This ci cula padding le s MAGNeT
a end o bo h he s a and end o he ile simul aneously,
ensu ing a smoo h ansi ion a he loop bounda y.
4. Repea un il comple ion. We mo e on o he nex
in e ence i e a ion, going back o s ep 2, un il he en i e y
o he main ile is illed.
5. Ex ac he inal loop. Once he i e a ion limi is
eached o all main- ile okens a e ixed, he algo i hm
s ops. The cen al c okens (ou main loop ile) a e ex-
ac ed as he inal esul . Repea ing his ile end- o-s a
yields a seamless loop.
4.3 Hyb id a ian
MAGNeT o en p oduces highe -quali y audio when con-
inuing om a gi en p omp a he han gene a ing en i ely
om sc a ch [2]. To ake ad an age o his, we i s gene -
a e an audio segmen Cwi h MusicGen, empi ically se
o hal he desi ed inal clip leng h. Fo ins ance, i he inal
clip is in ended o las 10s, we le MusicGen p oduce he
i s 5s and hen p o ide hese okens as a pa ially illed
main ile. This app oach o ces he model o gene a e a
cohe en con inua ion o he high-quali y p omp , ensu ing
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
538
he ending ansi ions seamlessly o he beginning. Empi -
ically, we obse e ha his hyb id e sion su passes sam-
ples gene a ed wi hou an audio p omp in e ms o seman-
ic a ie y, objec i e audio quali y, and musical cohe ence.
4.4 Signa u e-awa e leng h con ol
A well- o med loop o en sounds mos musical when i
aligns wi h ull ba s (e.g., 2 o 4 ba s o consis en empo).
Gene a ing loops o a bi a y leng h may c ea e awkwa d
b eaks i , o ins ance, he empo does no i in ege ba
di isions.
To mi iga e his, we use he cu en s a e-o - he-a bea -
ex ac ion sys em, bea _ his [28] on he ini ial audio
p omp C, o iden i y:
• The a e age du a ion be ween bea s, δ≈60/BPM.
• The median numbe o bea s pe ba , µ.
We use hese o compu e he du a ion o a ba ha he
p omp Cimplies. We wan he du a ion o he en i e loop
l o be a) an exac mul iple (o submul iple) o he du a-
ion o a ba ; b) cons ained in a ime in e al [α, β]. To
achie e his, we epea edly double o hal e he ini ial can-
dida e leng h lun il i i s hese cons ain s.
Algo i hm 1 Bea Alignmen algo i hm
Requi e: Audio clip C, min/max du a ion αand β, p e-
e ed numbe o ba s n
B, D ←de ec ed bea s/downbea s in bea _ his(C)
δ←median ime elapsed be ween Bs▷akin o 60
BPM
µ←median #bea s be ween Ds▷ba o he clip
l←nµδ ▷ du a ion o nba s
while l < α ∨l > β do
i l < α hen
l←2l
else
l←l
2
end i
end while
i l∈[nµδ
4,4nµδ] hen ▷Too a om nba s?
e u n l ▷ Re u n i su icien ly close
else
e u n ∅▷O he wise abo ( y ano he C)
end i
When used in combina ion, iled gene a ion and he bea
alignmen Algo i hm 1p oduce loops ha no only ha e
smoo h seam ansi ions bu also espec musical s uc u e.
This esul s in clips ha a e mo e na u ally loopable o
applica ions like music p oduc ion, li e pe o mance, o
any se ing in which igh ly aligned epea ing segmen s a e
equi ed.
5. EVALUATION METRICS
When assessing looped music, a clip ha sounds ine in
a single pass may s ill ha e an ab up ansi ion when i
epea s. S anda d me ics such as FAD [29], which mea-
su e he o e all dis ibu ional simila i y be ween gene a ed
−1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00
Time (s)
0
2
4
6
8
10
C oss-En opy
MusicGen Nai e
MAGNeT Nai e
MusicGen Bea Aligned Nai e
MAGNeT Bea Aligned Nai e
MAGNeT Tiled
MAGNeT Hyb id Tiled
LoopGen
Figu e 3. A e age c oss en opy o MusicGen a ound he
seam (highligh ed wi h a dashed line) o di e en model-
s/ a ian s.
and eal audio using neu al embeddings, may miss such
a i ac s. Because FAD ope a es o e ull-leng h clips, i
can o e look localized issues such as discon inui ies a he
loop bounda y. To add ess his, we p opose a me ic ha
ocuses di ec ly on he ansi ion seam.
5.1 Seam pe plexi y
To au oma ically assess he con inui y o he loop a ound
he seam, we adap he idea o pe plexi y om language
modeling. Le us assume ha we ha e a well- ained music
gene a ion model (such as MusicGen) ha can es ima e
p obabili ies o each oken (o audio ame) in a music
clip. While adi ional pe plexi y sums o e all okens in a
clip, we ocus exclusi ely on he seam, ha is, he ansi-
ion poin whe e he end mee s he beginning, whe e loop
a i ac s a e mos likely o occu .
5.1.1 C oss-en opy and pe plexi y
Fi s ecall ha , o a sequence X= (x1, x2, . . . , xT), he
a e age c oss-en opy H(X)is:
H(X) = −1
T
T
X
i=1
ln M(xi).(1)
In ui i ely, i Massigns highe p obabili y o each oken,
he c oss-en opy will be smalle , indica ing be e align-
men be ween model and da a. F om c oss-en opy, we de-
i e he pe plexi y P(X), a s anda d measu e o how well
a model p edic s a sequence:
P(X) = expH(X)= exp
−1
T
T
X
i=1
ln M(xi).(2)
A lowe pe plexi y alue indica es ha M inds he se-
quence mo e p edic able (o mo e likely).
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
539
5.1.2 Seam pe plexi y
While global pe plexi y ocuses on he en i e clip, loop a -
i ac s, as in Figu e 3, occu speci ically a he bounda y
whe e he clip w aps a ound. To isola e how he model
pe cei es ha ansi ion, we compu e seam pe plexi y on a
sho window a ound he bounda y.
Le us be gi en Ngene a ed clips {X(k)}. Each X(k)
has leng h T, and we iden i y a seam bounda y a index
b(k). We hen de ine a window o size Wimmedia ely
ollowing b(k), i.e., he okens
X(k)
seam =x(k)
i:i∈[b(k), b(k)+W−1].(3)
The a e age c oss-en opy o he seam okens in X(k)is:
Hseam(X(k)) = −1
W
b(k)+W−1
X
i=b(k)
ln Mx(k)
i.(4)
Finally, he seam pe plexi y is he exponen ial o he mean
seam c oss-en opy ac oss all Nclips:
Seam Pe plexi y = exp
1
N
N
X
k=1
HseamX(k).(5)
A low seam pe plexi y indica es ha he seam is “easy”
o a s ong e e ence model o p edic , sugges ing a
smoo h ansi ion. Con e sely, a high alue sugges s
ab up discon inui ies o o he a i ac s a he loop bound-
a y.
6. EXPERIMENTS
In he ollowing, all gene a ed acks a e condi ioned wi h
he same se o 100 ex ual p omp s, and MAGNeT’s i e -
a ions a e se o ⟨100,50,10,10⟩ o each o he 4 code-
books om EnCodec [30] espec i ely. Tex ual p omp s
a e gene a ed au oma ically ia a LLM, some o hem a e
(e.g.):
(1) “A high-ene gy EDM ack wi h a powe ul d op
and sidechain comp ession”
(2) “An I ish olk dance une wi h ene ge ic iddle and
bodh án d um”
6.1 E alua ing models
6.1.1 Baselines
Fo bo h MAGNeT and MusicGen, wo baseline solu ions
a e o mula ed: (i) Nai e, a sample is gene a ed and e-
pea ed, wi hou u he p ocessing, and (ii) Bea -Aligned
(BA) Nai e, a sample is gene a ed, an h ough Algo-
i hm 1 o cu hem a a musically- alid leng h, and e-
pea ed.
6.1.2 Ou echniques
F om he con ibu ions o his pape , h ee models a e e al-
ua ed: (i) Tiled, samples a e gene a ed ia he iled gen-
e a ion echnique desc ibed in Sec ion 4.2, (ii) Hyb id
Tiled, samples a e gene a ed wi h he same iled ech-
nique, bu s a ing om an audio p omp gene a ed by
MusicGen ( e . Sec ion 4.3), and (iii) Bea Aligned
Tiled, which uses he same echnique as Hyb id Tiled, bu
wi h he addi ional applica ion o he bea -alignmen algo-
i hm desc ibed in Sec ion 4.4. This la e a ian is ou
bes pe o ming model, and wha , going o wa d, we call
LoopGen. Fo each a ian , bo h Seam Pe plexi y and
ad k’s [31] FAD (wi h embeddings om bo h VGG-
ish[32] and CLAP [33] o e he FMA-Pop [34] da ase )
a e compu ed.
6.2 Hype pa ame e sea ch
The mos impo an hype pa ame e s o he samples’
quali y we iden i y a e classi ie - ee guidance (λ) and he
esco ing coe icien (ω). The o me con ols how much a
model should adhe e o he condi ioning in o ma ion gi en
(in ou case, he ex ual p omp ), ins ead o ollowing he
eme ging sample. In MAGNeT’s o iginal pape , he au ho s
ind ha he bes FAD is eached wi h λ= 10.0(linea ly
dec easing o λ= 1.0as he i e a ions pass) bu , as he
iling cons ain migh inc ease he con ex ual in o ma ion
ha he model can gain om he inpu , we e i ied ha a
lowe coe icien ansla es in o mo e o ganic gene a ions.
The esco ing coe icien ω, ins ead, con ols he in-
e pola ion coe icien in oduced in Sec ion 3.2. When
ω= 0, esco ing is no applied, when 1, only MusicGen’s
p obabili ies a e used. We es he e o e ou algo i hm wi h
mul iple coe icien s anging om 0 o 1. A easonable
alue o he c g was chosen o be λ= 5.0; on he o he
hand, he esco ing was chosen h ough a ho ough sea ch
conduc ed on bo h MAGNeT Tiled and MAGNeT Hyb id
Tiled, gene a ing 100 10-seconds samples o each model.
Ou esul s, p esen ed in Table 1, empi ically show he bes
esco ing o be ω= 0.5.
I is wo h no ing ha he Hyb id e sion o he model
consis en ly achie es be e FAD sco es, bu wo se pe -
plexi y. The be e FAD sco e can be clea ly a ibu ed o
he ini ial p omp gene a ed by MusicGen, which con-
sis en ly su passes MAGNeT’s audio quali y. This hyb id
combina ion o di e en models is also he eason o he
inc ease in pe plexi y, since he inal gene a ion consis s o
a conca ena ion o okens sampled om di e en dis ibu-
ions.
Model Va ian ωFAD ggish (↓)Seam Pe plexi y (↓)
MAGNeT Tiled 0.0 3.05 23.88 ±5.40
MAGNeT Tiled 0.25 3.22 25.17 ±5.35
MAGNeT Tiled 0.50 3.51 18.15 ±3.53
MAGNeT Tiled 0.75 3.97 24.55 ±5.43
MAGNeT Tiled 1.0 4.35 25.42 ±4.86
MAGNeT Hyb id Tiled 0.0 2.97 39.30 ±7.21
MAGNeT Hyb id Tiled 0.25 2.99 47.72 ±10.11
MAGNeT Hyb id Tiled 0.50 2.98 44.42 ±9.33
MAGNeT Hyb id Tiled 0.75 3.00 43.93 ±8.39
MAGNeT Hyb id Tiled 1.02.93 41.74 ±9.05
Table 1. Resco ing expe imen s (λ= 5.0)
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
540

6.3 Final esul s
Below, we p esen ou inal esul s ac oss six baselines and
ou h ee no el models, gene a ed wi h he same p e ious
100 ex ual p omp s, bu 30 seconds long. Tiling models,
using he echnique desc ibed in Sec ion 4.2, exhibi sig-
ni ican ly lowe Seam Pe plexi y compa ed o hei non-
iled coun e pa s, hough a he cos o a weake FAD
sco e. Howe e , LoopGen, le e aging bo h he Hyb id
app oach (Sec ion 4.3) and Algo i hm 1, achie es he bes
FAD sco e among all models. This imp o emen comes
wi h a sligh inc ease in Seam Pe plexi y, as p e iously dis-
cussed.
Despi e his mino ade-o in pe plexi y, LoopGen
subs an ially ou pe o ms baseline solu ions, o e ing a
mo e musically pleasing ou pu due o i s alignmen wi h
hy hmically meaning ul cu poin s (Algo i hm 1). This e-
sul s in acks ha main ain be e musical cohe ence com-
pa ed o he s anda d Tiled model.
Table 2p esen s he e alua ion me ics, and he dis i-
bu ion o Seam Pe plexi y alues is isualized in Figu e 4.
Model Va ian FAD ggish (↓)FADCLAP (↓)Seam Pe plexi y (↓)
MAGNeT Vanilla 3.36 0.33 —
MAGNeT Nai e 3.36 0.35 1549.06 ±556.03
MAGNeT Bea Aligned Nai e 3.34 0.34 153.22 ±47.69
MusicGen Vanilla 2.81 0.32 —
MusicGen Nai e 2.81 0.33 2512.39 ±903.16
MusicGen Bea Aligned Nai e 2.86 0.33 507.07 ±163.67
MAGNeT Tiled 4.30 0.51 56.17 ±11.78
MAGNeT Hyb id Tiled 2.98 0.33 94.41 ±25.77
MAGNeT LoopGen 2.80 0.31 84.29 ±22.66
Table 2. Main expe imen s’ e alua ion me ics. Fo each
model, we compu e he FAD wi h VGG-ish and CLAP em-
beddings using FMA-pop as a e e ence da ase . Fo e e -
ence, we also compu e FAD sco es o bo h MAGNeT and
MusicGen’s s anda d, non-looping, gene a ions).
MusicGen Nai e 2512.39
MeanModel
MAGNeT Nai e 1549.06
MusicGen Bea Aligned Nai e 507.07
MAGNeT Bea Aligned Nai e 153.22
MAGNeT Tiled 56.17
MAGNeT Hyb id Tiled 94.41
1e0 1e1 1e2 1e3 1e4 1e5
Seam Pe plexi y
LoopGen 84.29
Figu e 4.Seam Pe plexi y dis ibu ion o he conside ed
models (lowe is be e ).
6.4 Human e alua ion
Using he p e ious se ups, we p epa e a se o : 100
10-seconds clips om LoopGen (ou s), and 100 10-
seconds clips om MAGNeT Hyb id Nai e (wi hou Tiling-
gene a ion, baseline). We selec he la e model because i
is he mos simila o ou s, wi hou any o he modi ica ions
in oduced in his pape . This ensu es a ai compa ison,
wi h he p ima y expec ed di e ence being seamlessness.
The clips a e chosen o be 10 seconds long o ease o lis-
ening.
Wi h his se o samples, we conduc a blind lis ening
expe imen wi h a g oup o use s. Each olun ee lis ens
o up o 30 andomly selec ed clips (15 om ou model,
15 om he baseline) and a es he pe cep ibili y o he
seam on a Like scale (1 = E iden cu , 5 = Impe cep i-
ble cu ). In o al, we collec 506 da a poin s s om 18
12345
Human-assigned sco e
0.0
0.1
0.2
0.3
0.4
Densi y
Model
Baseline
LoopGen
Figu e 5. Dis ibu ion o pe cep ibili y a ings, compa ing
LoopGen wi h he baseline. Lines a e model’s mean.
lis ening sessions. Compu ing each use ’s a e age a ing
o each model; we un a pai ed - es such ha
H0≡µLoopGen =µbaseline (6)
This yields (17) = 12.21, p < 10−9, p o iding o e -
whelming e idence agains H0. Fu he mo e, he e ec
size is la ge (d= 2.88), con i ming a e y s ong e idence
ha ou echnique subs an ially educes he pe cep ibili y
o he seam, as can also be seen in Figu e 5.
7. CONCLUSIONS
Wi h his esea ch, we ha e in oduced a no el in e ence-
only app oach o gene a ing loopable music, le e aging a
simple “ci cula ” padding scheme wi hin MAGNeT’s non-
au o eg essi e amewo k o ensu e seamless bounda ies.
Ou expe imen s demons a ed clea gains in loop con inu-
i y, alida ed bo h by a new pe plexi y-based seam me -
ic and by human lis ening es s. The whole p ocedu e
does no equi e addi ional aining o specialized loop
da ase s. By aligning loop leng h o musical bea s, he
gene a ed audio segmen s mo e na u ally i common com-
posi ional s uc u es, u he imp o ing hei usabili y in
p ac ice. O e all, his wo k unde sco es he po en ial o
ligh weigh , in e ence- ime solu ions o enhancing gene -
a i e music models.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
541
8. ACKNOWLEDGMENTS
This wo k is suppo ed by Sapienza Uni e si y o
Rome ia he Seed o ERC g an “MINT.AI”, cup
B83C25001040001. Fu he mo e, we hank all o he pa -
icipan s in he human e alua ion es .
9. REFERENCES
[1] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and con ol-
lable music gene a ion,” Ad ances in Neu al In o ma-
ion P ocessing Sys ems, ol. 36, 2024.
[2] A. Zi , I. Ga , G. Le Lan, T. Remez, F. K euk, J. Cope ,
A. Dé ossez, G. Synnae e, and Y. Adi, “Masked au-
dio gene a ion using a single non-au o eg essi e ans-
o me ,” in The Twel h In e na ional Con e ence on
Lea ning Rep esen a ions, 2024.
[3] P. Dha iwal, H. Jun, C. Payne, J. W. Kim, A. Rad o d,
and I. Su ske e , “Jukebox: A gene a i e model o
music,” a Xi p ep in a Xi :2005.00341, 2020.
[4] H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. Mandic,
W. Wang, and M. D. Plumbley, “AudioLDM: Tex - o-
audio gene a ion wi h la en di usion models,” P o-
ceedings o he In e na ional Con e ence on Machine
Lea ning, pp. 21 450–21 474, 2023.
[5] F. Schneide , O. Kamal, Z. Jin, and B. Schölkop ,
“Moûsai: Tex - o-music gene a ion wi h long-con ex
la en di usion,” 2023. [Online]. A ailable: h ps:
//a xi .o g/abs/2301.11757
[6] S. Fo sg en and H. Ma i os, “Ri usion - s able
di usion o eal- ime music gene a ion,” 2022.
[Online]. A ailable: h ps:// i usion.com/abou
[7] O. Go en, E. Nachmani, and L. Wol , “A-muze-ne :
Music gene a ion by composing he ha mony based
on he gene a ed melody,” 2021. [Online]. A ailable:
h ps://a xi .o g/abs/2111.12986
[8] C.-Z. A. Huang, A. Vaswani, J. Uszko ei , N. Shazee ,
I. Simon, C. Haw ho ne, A. Dai, M. Ho man,
M. Dinculescu, and D. Eck, “Music ans o me :
Gene a ing music wi h long- e m s uc u e,” 2019.
[Online]. A ailable: h ps://a xi .o g/abs/1809.04281
[9] A. Vaswani, N. Shazee , N. Pa ma , J. Uszko ei ,
L. Jones, A. N. Gomez, L. u. Kaise , and I. Polosukhin,
“A en ion is all you need,” in Ad ances in Neu al In-
o ma ion P ocessing Sys ems, ol. 30, 2017.
[10] A. Agos inelli, T. I. Denk, Z. Bo sos, J. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi, M. Sha i i, N. Zeghidou ,
and C. F ank, “Musiclm: Gene a ing music om ex ,”
2023. [Online]. A ailable: h ps://a xi .o g/abs/2301.
11325
[11] S. Vasquez, S. Vasquez, M. Lewis, M. Lewis,
M. Lewis, M. Lewis, M. Lewis, and M. Lewis, “Mel-
ne : A gene a i e model o audio in he equency do-
main,” a Xi : Audio and Speech P ocessing, 2019.
[12] J. Ga dne , S. Du and, D. S olle , and R. Bi ne ,
“Lla k: A mul imodal ins uc ion- ollowing language
model o music,” P oc. o he In e na ional Con e -
ence on Machine Lea ning (ICML), 2024.
[13] J. Ho, A. Jain, and P. Abbeel, “Denoising di usion
p obabilis ic models,” in Ad ances in Neu al In o ma-
ion P ocessing Sys ems, ol. 33, 2020, pp. 6840–6851.
[14] J. Nis al, M. Pasini, C. Aouameu , M. G ach en, and
S. La ne , “Di -a- i : Musical accompanimen co-
c ea ion ia la en di usion models,” 2024. [Online].
A ailable: h ps://a xi .o g/abs/2406.08384
[15] Z. E ans, C. Ca , J. Taylo , S. H. Hawley, and J. Pons,
“Fas iming-condi ioned la en audio di usion,” 2024.
[Online]. A ailable: h ps://a xi .o g/abs/2402.04825
[16] G. Ma iani, I. Tallini, E. Pos olache, M. Mancusi,
L. Cosmo, and E. Rodolà, “Mul i-sou ce di usion
models o simul aneous music gene a ion and sepa-
a ion,” in The Twel h In e na ional Con e ence on
Lea ning Rep esen a ions, 2024. [Online]. A ailable:
h ps://open e iew.ne / o um?id=h922Qhkmx1
[17] J. De lin, M.-W. Chang, K. Lee, and K. Tou ano a,
“BERT: P e- aining o deep bidi ec ional ans o m-
e s o language unde s anding,” in P oceedings o he
2019 Con e ence o he No h Ame ican Chap e o
he Associa ion o Compu a ional Linguis ics: Human
Language Technologies, Volume 1 (Long and Sho Pa-
pe s). Associa ion o Compu a ional Linguis ics,
2019, pp. 4171–4186.
[18] J. Li, T. Tang, W. X. Zhao, J.-Y. Nie, and J.-R. Wen,
“Elme : A non-au o eg essi e p e- ained language
model o e icien and e ec i e ex gene a ion,” 2022.
[19] H. F. F. Ga cia, P. See ha aman, R. Kuma , and
B. Pa do, “Vampne : Music gene a ion ia masked
acous ic oken modeling,” in Ismi 2023 Hyb id Con-
e ence, 2023.
[20] Z. Bo sos, M. Sha i i, D. Vincen , E. Kha i ono ,
N. Zeghidou , and M. Tagliasacchi, “Sounds o m: E -
icien pa allel audio gene a ion,” 2023.
[21] P. Chandna, A. Rami es, X. Se a, and E. Gómez,
“Loopne : Musical loop syn hesis condi ioned on in-
ui i e musical pa ame e s,” in ICASSP 2021 - 2021
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP), 2021, pp. 3395–
3399.
[22] G.-Y. Chen and V.-W. Soo, “Con ollable music loops
gene a ion wi h midi and ex ia mul i-s age c oss a -
en ion and ins umen -awa e ein o cemen lea ning,”
in P oceedings o he 32nd ACM In e na ional Con e -
ence on Mul imedia, 2024, p. 6851–6859.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
542
[23] S. Han, H. R. Ihm, M. Lee, and W. Lim,
“Symbolic music loop gene a ion wi h neu al disc e e
ep esen a ions,” in In e na ional Socie y o Music
In o ma ion Re ie al Con e ence, 2022. [Online].
A ailable: h ps://api.seman icschola .o g/Co pusID:
251493133
[24] Z. No ack, J. McAuley, T. Be g-Ki kpa ick, and N. J.
B yan, “Di o: Di usion in e ence- ime -op imiza ion
o music gene a ion,” 2024.
[25] A. F ühs ück, I. Alhashim, and P. Wonka, “Tilegan:
syn hesis o la ge-scale non-homogeneous ex u es,”
ACM T ans. G aph., ol. 38, pp. 58:1–58:11, 2019.
[26] C. Rod íguez-Pa do and E. Ga ces, “Seamlessgan:
Sel -supe ised syn hesis o ileable ex u e maps,”
IEEE T ans. Vis. Compu . G aph., ol. 29, pp. 2914–
2925, 2023.
[27] O. Mada and O. F ied, “Tiled di usion,” 2024.
[Online]. A ailable: h ps://a xi .o g/abs/2412.15185
[28] F. Fosca in, J. Schlü e , and G. Widme , “Bea his!
accu a e bea acking wi hou dbn pos p ocessing,”
in P oceedings o he 25 h In e na ional Socie y o
Music In o ma ion Re ie al Con e ence (ISMIR), San
F ancisco, CA, Uni ed S a es, No . 2024.
[29] K. Kilgou , M. Zuluaga, D. Roblek, and M. Sha i i,
“F éche audio dis ance: A me ic o e alua ing music
enhancemen algo i hms,” 2019. [Online]. A ailable:
h ps://a xi .o g/abs/1812.08466
[30] A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi, “High
ideli y neu al audio comp ession,” a Xi p ep in
a Xi :2210.13438, 2022.
[31] A. Gui, H. Gampe , S. B aun, and D. Emmanouilidou,
“Adap ing eche audio dis ance o gene a i e music
e alua ion,” in P oc. IEEE ICASSP 2024, 2024.
[Online]. A ailable: h ps://a xi .o g/abs/2311.01616
[32] S. He shey, S. Chaudhu i, D. P. W. Ellis, J. F. Gem-
meke, A. Jansen, R. C. Moo e, M. Plakal, D. Pla ,
R. A. Sau ous, B. Seybold, M. Slaney, R. J. Weiss,
and K. Wilson, “Cnn a chi ec u es o la ge-scale audio
classi ica ion,” in 2017 IEEE In e na ional Con e ence
on Acous ics, Speech and Signal P ocessing (ICASSP),
2017, pp. 131–135.
[33] Y. Wu*, K. Chen*, T. Zhang*, Y. Hui*, T. Be g-
Ki kpa ick, and S. Dubno , “La ge-scale con as i e
language-audio p e aining wi h ea u e usion and
keywo d- o-cap ion augmen a ion,” in IEEE In e na-
ional Con e ence on Acous ics, Speech and Signal
P ocessing, ICASSP, 2023.
[34] M. De e a d, K. Benzi, P. Vande gheyns , and
X. B esson, “FMA: A da ase o music analysis,” in
18 h In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence (ISMIR), 2017. [Online]. A ailable:
h ps://a xi .o g/abs/1612.01840
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
543
A. SEAM PERPLEXITY’S ERROR MARGINS
In a ious ables, we p esen he alues o ou Seam Pe plexi y as cen e ±s anda d e o . Because pe plexi y is he
exponen ia ion o a e age c oss-en opy, i is impossible o ac ually compu e e o ma gins di ec ly. To ob ain hese alues,
we s a om Equa ion (4) and compu e o each da ase o samples X={X1, . . . , XN} he mean c oss-en opy:
µX=1
N
N
X
k=1
HseamX(k)(7)
and s anda d de ia ion
σX=
u
u
1
N−1
N
X
k=1 µX−HseamX(k)2.(8)
We hen compu e he 95% con idence in e als o he c oss-en opy
hµX−1.96 σX
√N, µX+ 1.96 σX
√Ni,(9)
and ans o m hem in o exponen ial space
hl= exp µX−1.96 σX
√N, = exp µX+ 1.96 σX
√Ni.(10)
Finally, we calcula e he p o ided alues as
cen e =1
2(l+ ),s anda d e o =1
2( −l).(11)
This app oach di e s om he common me hod o showing a alue wi h e o ma gins, whe e he e o is modeled as
Gaussian, and he cen e alue is assumed o be he empi ical mean o he measu ed quan i y. In his case, howe e , since
he pe plexi y ope a ion i sel is compu ed as he exponen ia ion o i s mean, i would be impossible o calcula e a symme ic
Gaussian e o ma gin di ec ly (no wi hou unning calcula ions on mul iple olds o he da a).
B. 10 SECONDS EXPERIMENTS
Du ing de elopmen , we also explo ed he same inal expe imen s seen in he main a icle (Table 2) wi h he 10 seconds
a ian o MAGNeT. The esul s o hese expe imen s a e de ailed in Table 3and isualized in Figu e 6. No ably, he Seam
Pe plexi y exhibi s a signi ican change wi h his modi ica ion. While i is unclea whe he his change is solely a ibu able
o he di e en models, he sho e ack leng h, o a combina ion he eo , we empi ically obse ed no disce nible pe cep ual
di e ence in he seamlessness o he 10-second and 30-second samples.
Model Va ian FAD ggish(↓)FADCLAP(↓)Seam Pe plexi y (↓)
MAGNeT Vanilla 3.05 0.39 —
MAGNeT Nai e 3.02 0.31 310.21 ±98.33
MAGNeT Bea Aligned Nai e 3.03 0.35 202.43 ±67.23
MusicGen Vanilla 3.28 0.41 —
MusicGen Nai e 3.21 0.34 529.79 ±167.87
MusicGen Bea Aligned Nai e 3.24 0.31 302.88 ±79.91
MAGNeT Tiled 3.51 0.40 18.15 ±3.53
MAGNeT Hyb id Tiled 2.98 0.33 44.42 ±9.33
MAGNeT LoopGen 2.95 0.33 60.85 ±15.24
Table 3. 10 seconds e sions o main expe imen s’ e alua ion
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
544

Related note

Why organizations use Identific for document trust, entry 58
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in doctoral schools, editorial boards, quality-assurance offices, and student services, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports clearer separation between similarity and misconduct, more consistent review procedures, and reduced manual checking effort. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For final dissertations, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com