STAGE: STEMMED ACCOMPANIMENT GENERATION THROUGH
PREFIX-BASED CONDITIONING
Gio gio S ano1,⋆ Chia a Ballan i1,⋆ Dona o C isos omi1
Michele Mancusi1Luca Cosmo2Emanuele Rodolà1
1Sapienza Uni e si y o Rome, 2Ca’ Fosca i Uni e si y o Venice
[email p o ec ed]
ABSTRACT
Recen ad ances in gene a i e models ha e made i pos-
sible o c ea e high-quali y, cohe en music, wi h some
sys ems deli e ing p oduc ion-le el ou pu . Ye , mos
exis ing models ocus solely on gene a ing music om
sc a ch, limi ing hei use ulness o musicians who wan
o in eg a e such models in o a human, i e a i e com-
posi ion wo k low. In his pape we in oduce STAGE,
ou STemmed Accompanimen GEne a ion model, ine-
uned om he ex - o-music MusicGen model o gene -
a e single-s em ins umen al accompanimen s condi ioned
on a gi en mix u e. Inspi ed by ins uc ion- uning me h-
ods o language models, we ex end he ans o me ’s em-
bedding ma ix wi h a con ex oken, enabling he model
o a end o a musical con ex h ough p e ix-based con-
di ioning. Compa ed o he baselines, STAGE yields ac-
companimen s ha exhibi s onge cohe ence wi h he in-
pu mix u e, highe audio quali y, and close alignmen
wi h ex ual p omp s. Mo eo e , by condi ioning on a
me onome-like ack, ou amewo k na u ally suppo s
empo-cons ained gene a ion, achie ing s a e-o - he-a
alignmen wi h he a ge hy hmic s uc u e–all wi hou
equi ing any addi ional empo-speci ic module. As a e-
sul , STAGE o e s a p ac ical, e sa ile ool o in e ac i e
music c ea ion ha can be eadily adop ed by musicians in
eal-wo ld wo k lows.
gi hub.com/gio gioskij/s age
gio gioskij.gi hub.io/s age-demo
1. INTRODUCTION
Gene a i e AI has ecen ly ans o med music compo-
si ion, wi h la ge-scale models now able o p oduce
long- o m, high-quali y, and s ylis ically consis en mu-
sic. Models such as MusicGen [1], MusicLM [2], and
JukeBox [3] ha e demons a ed ha ans o me s ained
on okenized audio ep esen a ions can gene a e music ha
i als human composi ions in cohe ence and p oduc ion
quali y.
Howe e , mos o hese models ocus on gene a ing mu-
sic om sc a ch, e en when hey allow o condi ional gen-
e a ion using melodies [1], cho ds [4–8], o ex p omp s.
This limi s hei applicabili y in a na u al music composi-
ion wo k low, which is o en s uc u ed in an i e a i e, lay-
⋆deno es equal con ibu ion.
STAGE
STAGE
125 bpm
4
4125 bpm
4
4
Figu e 1. Ou line o ou p oposed model. ( op)
STAGE akes a musical con ex as inpu and gene a es
a single-s em accompanimen . (bo om) STAGE akes a
me onome-like ack and gene a es a s em ha ollows he
desi ed hy hmic s uc u e.
e ed ashion, g adually building composi ions by adding o
e ining pa s o e ime. To suppo his wo k low, we o-
cus on a human-cen e ed, in ui i e gene a ion ask: adding
a single new s em o an exis ing mul i-s em mix u e, while
also allowing o p ecise con ol o e he empo o he gen-
e a ed ou pu .
In his pape , we in oduce STAGE, a single-ins umen
accompanimen gene a ion model ha can be condi ioned
on any audio con ex , be i a mix u e o a simple click ack,
o gene a e a cohe en and hy hmically aligned accompa-
nimen (see Figu e 1). We use a simple ye e ec i e ap-
p oach o ine- une MusicGen [1] o s emmed accompa-
nimen gene a ion. Ou me hod does no equi e e ain-
ing any addi ional con ex -encode s and elies on minimal
da a o ine- uning. We le e age p e ix-based condi ion-
ing, whe e he con ex is p epended o he model’s inpu
sequence, e ec i ely se ing as a p omp o he gene a-
ion o he a ge s em. This app oach d aws inspi a ion
om ins uc ion uning [9,10] in language models, whe e
p epending a ask-speci ic ins uc ion enables a p e ained
model o specialize in new asks wi h minimal modi ica-
ions. In ou case, we ea musical con ex s as he “ques-
ion” and he desi ed accompanimen as he “answe ”. This
enables he model o lea n a oken- o- oken co espon-
dence be ween con ex and con inua ion, specializing i o
accompanimen asks.
We e alua e ou model on musical cohe ence using he
663
COCOLA sco e [11], showing clea imp o emen s o e
exis ing baselines, while main aining high audio quali y
as measu ed by FAD [12] and KAD [13]. Fu he mo e,
we show ha ou model suppo s empo-cons ained gen-
e a ion by simply condi ioning on a me onome-like click
ack—wi hou he need o empo-speci ic modules o a -
chi ec u es.
Ou con ibu ions a e h ee- old:
• A p e ix-based ine- uning me hod o s emmed ac-
companimen gene a ion.
• A ligh weigh , lexible app oach o empo condi ion-
ing h ough audio-based inpu s.
• Ex ensi e e alua ion ac oss musical cohe ence,
hy hmic alignmen , and audio ideli y, along wi h
open-sou ce code and model checkpoin s.
2. RELATED WORK
Recen ad ances in gene a i e modeling ea music as a
language o disc e e okens, enabling long-con ex audio
syn hesis ia ans o me a chi ec u es. Jukebox [3] was
an ea ly example: i con e ed aw audio in o a hie a chy
o esidual VQ-VAE okens and used p og essi ely deepe
ans o me s o gene a e cohe en ex ended sequences o
music. Subsequen imp o emen s in neu al audio codecs,
such as SoundS eam [14] and EnCodec [15], inspi ed
new designs. MusicLM [2] in oduced a hie a chical wo-
s age app oach ha models sepa a e s eams o “seman-
ic” and “acous ic” okens, while MusicGen [1] showed
ha a single-s age ans o me o e EnCodec okens can
achie e excellen ex - o-music quali y. The simple a chi-
ec u e o MusicGen, combined wi h i s obus audio i-
deli y gene a ions, make i a na u al ounda ion o special-
ized asks such as single s em accompanimen , which we
explo e in his pape .
2.1 Condi ional gene a ion
Al hough mos music LMs ocus on ex p omp ing, some
app oaches p o ide mo e di ec musical guidance o edi -
ing capabili ies. Melody-condi ioned models, such as he
melody a ian in [1], align he gene a ion o a guiding
pi ch con ou bu ypically p oduce an en i e mix a he
han an isola ed s em. MusicConGen [4] u he ex-
ends MusicGen by adding condi ioning o e cho ds and
bea in o ma ion, allowing explici con ol o ha monic and
hy hmic s uc u es. Mul iple o he sys ems ha e been p e-
sen ed, especially using di usion models, o condi ion mu-
sic gene a ion on a se ies o cho ds [5–8], on s ylis ic e -
e ences [16], o e en on a combina ion o ex , s yle, and a
e e ence d ums ack [17].
2.2 Music edi ing
Recen app oaches o audio-domain edi ing include au-
o eg essi e models like Ins uc -MusicGen [18], as
well as di usion-based sys ems such as MSDM [19] and
GMSDI [20]. While Ins uc -MusicGen s uggles
o achie e high-quali y ou pu s, di usion-based models,
despi e hei lexibili y, equi e signi ican ly mo e compu-
a ional esou ces and do no consis en ly suppo clean,
single-s em gene a ion.
2.3 Accompanimen gene a ion
Simila ly o MSDM and GMSDI, o he di usion-based sys-
ems aim o gene a e o edi pa ial a angemen s bu
a e closed-sou ce o limi ed in scope. Fo ins ance,
Di -A-Ri [21] uses a mul i-s ep di usion p ocess
o e ine an exis ing ack wi h new musical elemen s;
howe e , i is no openly eleased. SA-Con olNe 1
uses a ine- uning o S able Audio Open [22] wi h an
added Con olNe module [23], also p o iding a o m
o s emmed accompanimen gene a ion. SingSong [24]
ackles ocal- o-ins umen al accompanimen , aking a o-
cal ack as inpu and gene a ing a band-like backing. This
app oach is highly e ec i e o ocals bu emains highly
specialized. Mo e di ec ly aligned wi h ou objec i es,
S emGen [25] enables single-s em accompanimen gen-
e a ion ia a non-au o eg essi e ans o me . In pa allel
o ou wo k, Me a AI in oduced MusicGen-S em [26],
which suppo s a ange o edi ing asks, including mix u e-
condi ioned accompanimen gene a ion. Howe e , nei he
model has eleased public code o checkpoin s a he ime
o w i ing, and hey only p o ide a hand ul o gene a ed
samples, making i impossible o pe o m a igo ous com-
pa ison. Addi ionally, bo h app oaches in ol e aining
dedica ed ans o me models om sc a ch, in con as o
ou ligh weigh ine- uning s a egy.
3. BACKGROUND
MusicGen [1] is a single-s age music gene a ion model
ha ope a es o e disc e e audio okens p oduced by an
encode –decode neu al codec. Speci ically, he au ho s
use EnCodec [15], which con e s aw audio in o se -
e al pa allel s eams o quan ized okens (known as code-
books). Whe eas some p io wo ks (e.g., Jukebox [3],
MusicLM [2]) ely on mul i-s age o hie a chical a chi-
ec u es ha p ocess one se o okens o hen upsample an-
o he , MusicGen p oposes a simple ye e ec i e single-
s age ans o me language model ha di ec ly lea ns o
gene a e all o hese quan ized okens a once.
3.1 A chi ec u e o e iew
The co e o MusicGen is a GPT-like ans o me de-
code ha is ained au o eg essi ely o e sequences o dis-
c e e audio okens. The okens come om a esidual ec-
o quan iza ion scheme, whe e he aw wa e o m is i s
encoded in o a low- ame- a e con inuous ep esen a ion,
and hen each ame is quan ized by mul iple “s acked”
codebooks. Codebooks a e o ganized hie a chically, and
each codebook kicon ains inc emen al esidual in o ma-
ion w. . . he p e ious codebooks kj, i > j. The num-
be o codebooks (se o 4 in MusicGen) de e mines how
many pa allel okens mus be modeled a each ime s ep.
1gi hub.com/EmilianPos olache/s able-audio-con olne
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
664
MusicGen is eleased in ou e sions, Small
(∼400M pa ams), Medium (∼1.5B pa ams), La ge
(∼3B) and Melody (∼1.5B). The la e uses bo h ex and
a sho audio clip ( om which melody is ex ac ed) as con-
di ioning, wi h bo h sou ces p epended o he inpu . The
o he h ee ely only on ex , p o ided ia c oss-a en ion.
In his wo k, we use only he p e- ained MusicGen
Small checkpoin .
3.2 Codebook In e lea ing Pa e ns
One s aigh o wa d app oach o con e he ou pa al-
lel s eams o okens gene a ed by EnCodec in o a sin-
gle s eam is o la en all codebooks, bu his subs an ially
leng hens he sequence. Con e sely, p edic ing hem in
pa allel unde es ima es c oss-codebook dependencies. A
p ac ical comp omise, used in MusicGen and shown in
Figu e 2, is he “delay” s a egy which shi s okens om
he i h codebook by i−1s eps. This p ese es in e -
codebook con ex while equi ing only one au o eg essi e
s ep pe ame. As a esul , he delay pa e n yields mo e
e icien modeling han pa allel p edic ion wi h minimal
compu a ional o e head.
4. METHOD
In his sec ion, we p esen he key componen s o STAGE.
In pa icula , we discuss how i ex ends MusicGen’s a -
chi ec u e and aining p ocedu e o gene a e single s ems
om an inpu mix u e o a me onome-like bea ack. We
e e o bo h hese audio inpu s as he “con ex ”. Du ing
in e ence, he use can pass ei he o hose audio cues o
e en combine hem in o a single wa e o m, allowing o
pe o m accompanimen gene a ion wi h an e en igh e
con ol o e he a ge ’s hy hmic s uc u e.
4.1 O e iew
Ha ing chosen a a ge ins umen I,STAGE is ained o
p oduce a s em So ha ins umen , aking as inpu :
• An audio con ex , which can con ain a gene ic audio
mix u e M(wi hou ou a ge ins umen I), o a
bea ack B(a me onome-like pulse sequence);
• an op ional ex p omp T, desc ibing he desi ed
s yle and mood o he gene a ion.
The model aims o gene a e audio ha ma ches he mix-
u e’s key, s yle, ha mony, and hy hm, enabling i o se e
as a musical accompanimen . I no ini ial mix u e is a ail-
able, STAGE can ins ead ake a me onome ack (B) (a
simple bea ma king he empo) as inpu , allowing i o
gene a e he i s s em o a composi ion. In p ac ice, we
ind ha o e laying he mix u e Mand bea ack Bin o a
single audio ile a in e ence ime allows STAGE o condi-
ion on bo h, p ese ing cohe ence wi h he mix u e while
achie ing igh e hy hmic alignmen wi h he bea s (see
5.4 o esul s).
We ain sepa a e STAGE models o each a ge ins u-
men , speci ically d ums and bass. Fo he emaining, less-
ep esen ed ins umen s, he inhe en in a-ins umen a i-
abili y would equi e signi ican ly mo e da a han wha is
a ailable in MoisesDB.
4.2 Con ex oken o p e ix-based condi ioning
Ou s a ing poin is MusicGen-Small, a ligh weigh
a ian o MusicGen. To enable condi ioning, we add a
single con ex oken o he ans o me ’s embedding ma-
ix, allowing ex a audio okens o be p epended o he
inpu sequence (Figu e 2). These okens come om ei he
he mix u e Mo he bea ack B, encoded ia EnCodec
[15]. This o ms a p e ix-based condi ioning se up: once
he model ‘’sees” he con ex okens, i au o eg essi ely
gene a es he new s em.
4.3 Fine- uning p ocedu e
We ain on he open-sou ce, mul i-s em da ase MoisesDB
[27], which con ains 240 s em-sepa a ed songs. Gi en a
a ge ins umen I o be gene a ed, we c ea e inpu da a
o STAGE using he ollowing s a egies:
•Fo m he con ex . Fo each ack we ei he (a)
mix a andom subse o s ems (excluding ins umen
I) o c ea e M, o (b) eplace he mix u e wi h a
me onome ack Ba he known empo o M. We
use each o he wo s a egies wi h equal p obabili y.
•Con ex leng h. We andomize he con ex leng h
in he ange o 5 o 10 seconds. Hence, he model
can lea n o gene a e samples longe han he ac ual
con ex window.
•Da a augmen a ion. We apply speed ansposi ion
(in he ange [0.8, 1.2]) and pi ch ansposi ion (in
he ange [-4, +4 semi ones]) wi h p obabili y 0.5 o
bo h con ex and a ge .
The abo e p ocedu e p oduces <con ex , a ge s em>
pai s, which we use o ine- uning. To allow he newly in-
oduced con ex oken o adap o he p e ained model, we
i s ain only i s embedding o 200 s eps a a lea ning a e
o 1e-4, keeping all o he weigh s ozen. This wa m-up
phase helps he oken lea n a meaning ul in e ace wi h he
es o he model. We hen un eeze he emaining weigh s
and g adually amp up hei lea ning a e om 0 o 1e-5,
while annealing he con ex oken’s a e om 1e-4 o 1e-
5. T aining con e ges in abou 1,000 s eps using ba ches
o eigh 10-second samples, inishing in unde a day on a
single NVIDIA RTX 3090 GPU.
4.4 In e ence
A in e ence ime, we pe o m he ollowing s eps:
• Tokenize he con ex ia EnCodec.
• Pass he encoded okens o STAGE, ollowed by he
con ex oken.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
665
#
#
#
#
con ex
sequence
pad
pad
con ex oken
s1s2s3s4s5s6s7s8s9s10 s11 s12 s13 s14
k4
k3
k2
k1
sequence s eps
esidual codebooks
Figu e 2. Illus a ion o he delay pa e n used by MusicGen, and how he con ex oken is placed o sepa a e he audio
con ex om he inpu sequence o he ans o me .
• Au o eg essi ely decode he new s em’s okens.
• Recons uc he gene a ed s em ia EnCodec’s de-
code .
This ligh weigh p e ix-based amewo k is easy o in e-
g a e in o a musician’s wo k low by simply p o iding an
audio snippe o an exis ing ack (o a me onome pulse
in i s place) and le ing STAGE gene a e a single, cohe en
accompanimen s em.
5. EXPERIMENTS AND RESULTS
We now p esen he expe imen al se up and e alua ion
me ics used o assess he pe o mance o STAGE. We o-
cus ou e alua ion on wo sepa a e asks:
•Bea Alignmen , gi en a bea as condi ioning;
•Accompanimen Gene a ion, gi en a gene ic mix-
u e o s ems as condi ioning.
Fo each o hese asks, we measu e agains s a e-o - he-a
open-sou ce models on compa able asks.
5.1 Bea alignmen
To measu e how p ecisely STAGE ollows he gi en bea ,
we p o ide he model wi h a aw pulse ack spanning he
ull du a ion o he sample o be gene a ed. We hen e al-
ua e he F1 sco e using he mi e al 2lib a y, ma ching
he de ec ed bea s in he ou pu audio o he e e ence bea
ack supplied as inpu . Fo bea de ec ion, we employ
Bea -This [28], a s a e-o - he-a algo i hm o bea ack-
ing.
We benchma k STAGE agains MusicConGen [4], a
ine- uned MusicGen a ian ha can be condi ioned on
empo, and op ionally on cho d sequences.
Following he se up o MusicConGen, we e alua e
ou gene a ions by condi ioning on bea acks ex ac ed
om MusDB [29] mix u es using he Bea -This algo i hm.
Howe e , we ind ha his p ocess in oduces noise, as he
2gi hub.com/mi -e alua ion/mi _e al
bea acke is impe ec and o en de ec s i egula o in-
consis en bea pa e ns ac oss samples. This is no in line
wi h a ealis ic scena io, in which a musician supplies he
model wi h a pe ec ly egula bea g id o ollow. Fo a
mo e ealis ic se ing, we also es ou model on a uni o m
dis ibu ion o BPMs in he in e al [100, 180].
Table 1shows ha he model ained on d ums signi -
ican ly ou pe o ms bo h MusicConGen and ou bass-
ained a ian . This suppo s he in ui i e no ion ha ain-
ing on d ums gi es he model a s onge sense o hy hm,
leading o be e alignmen . Addi ionally, bea ex ac ion is
less accu a e on bass-only acks (like hose gene a ed by
STAGE-bass) which can con ibu e o highe measu ed
alignmen e o .
Model Da ase F1 ↑FAD-VGGish ↓FAD-Clap ↓
STAGE-d ums MusDB 66.88 1.40 0.23
Uni o m BPM 71.57 2.05 0.24
STAGE-bass MusDB 40.93 5.59 0.39
Uni o m BPM 45.17 4.26 0.52
MusiConGen-Tempo MusDB 61.37 1.95 –
Table 1. Compa ison o audio quali y (FAD) and hy hmic
alignmen (F1) o STAGE-d ums s. MusicConGen
(wi h Rhy hm-only condi ioning). The hy hm condi ion-
ing was ex ac ed wi h Bea -This om he MusDB da ase .
Fo ou model, we also es on 160 samples om a uni-
o m dis ibu ion o BPMs in he ange [100, 180]. Fol-
lowing [4], FAD is compu ed wi h MusDB as e e ence.
5.2 Accompanimen cohe ence
To assess how well ou model adds a new ins umen s em
o an exis ing mix u e, we e alua e on he es se om
he MoisesDB da ase [27]. Fo each ack, we emo e he
a ge s em (d ums o bass) and hen ask he model o e-
gene a e ha ins umen while keeping he emaining pa s
unchanged. We measu e se e al me ics ha a ge bo h
objec i e audio quali y and seman ic cohe ence wi h he
con ex :
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
666
COCOLA FAD-VGGish FAD-Clap KAD-VGGish KAD-Clap Rhy hmic Alignmen (F1)
Ta ge s em Model ↑ ↓ ↓ ↓ ↓ ↑
STAGE 61.02 1.05 0.17 3.00 9.86 52.63
Ins uc -MusicGen 48.24 16.44 1.34 55.25 80.79 0.20
GMSDI 45.25 15.57 1.22 57.39 54.75 25.98
D ums
SA-Con olNe 57.46 2.75 0.39 9.28 11.75 38.70
STAGE 60.20 2.97 0.37 13.45 15.40 40.94
Ins uc -MusicGen 53.69 12.85 1.28 47.41 73.92 0.19
GMSDI 43.29 14.36 1.28 44.31 49.55 24.34
Bass
SA-Con olNe 59.63 2.22 0.31 6.69 15.98 47.17
Table 2. Pe o mance compa ison on he accompanimen gene a ion ask using he MoisesDB es se , wi h bass and d ums
as a ge s ems.
• COCOLA sco e [11] cap u es ha monic and pe cus-
si e cohe ence be ween he con ex and he newly
gene a ed s em;
• FAD- ggish and FAD-clap [12] assess pe cep ual
quali y by compa ing he dis ibu ion o embeddings
(ex ac ed using VGGish [30] o CLAP [31]) be-
ween gene a ed and e e ence audio. Since FAD
measu es dis ibu ional dis ance, we use s ems om
MoisesDB ma ching he a ge ins umen o de ine
he e e ence.
• KAD- ggish and KAD-clap [13]. A newly eleased
me ic ha compu es he dis ance be ween dis ibu-
ions o embeddings in a highe -o de abs ac space,
using he ke nel ick. I has simila p ope ies o he
commonly used FAD me ic.
• Rhy hmic Alignmen (F1). We also compu e he F1-
sco e be ween bea s ex ac ed om he con ex , and
bea s ex ac ed om he gene a ed s em, o asses he
hy hmic alignmen and cohe ence be ween con ex
and accompanimen .
The sco es o GMSDI and SA-Con olNe a e com-
pu ed on samples p o ided by he au ho s, while Ins uc -
MusicGen was un locally wi h he public in e ence
code 3. We we e no able o compa e wi h S emGen [25],
Di -A-Ri [21], and MusicGen-S em [26] since
hei code is no publicly a ailable.
As shown in Table 2, ou p oposed model ou pe o ms
all baselines in seman ic cohe ence, audio quali y, and
hy hmic alignmen when gene a ing d ums, and pe o ms
on pa wi h SA-Con olNe o bass. O e all, we ob-
se e ha STAGE pe o ms sligh ly wo se on bass ac oss
all me ics. We a ibu e his o wo main ac o s:
a) G ea e a iabili y in he dis ibu ion o bass acks
wi hin he MoisesDB da ase , which includes bo h
elec ic and syn h bass wi h dis inc imb al cha ac-
e is ics.
b) The inhe en ly mo e complex na u e o bass gene -
a ion, which–unlike d ums– equi es modeling bo h
3gi hub.com/ldzhangyx/ins uc -MusicGen
hy hmic and ha monic in o ma ion om he con-
ex , esul ing in a mo e challenging p edic ion ask.
5.3 Abla ion s udy on hy hm condi ioning
To e alua e he impac o aining on <me onome, a -
ge > pai s alongside <mix u e, a ge > pai s, we conduc
an abla ion s udy (Table 3). We ine- une a e sion o
STAGE-d ums wi hou me onome acks, exposing i
only o <mix u e, a ge > pai s. As expec ed, he model
ails o gene a e empo-aligned ou pu s when condi ioned
on a me onome a in e ence. Mo e no ably, i s pe o -
mance also deg ades on accompanimen gene a ion ( he
e y ask i was ained o ) highligh ing he b oade bene-
i o including me onome condi ioning a aining.
The clea gap in bo h COCOLA sco es (which cap-
u e hy hmic and ha monic cohe ence) and F1 sco es
o hy hmic alignmen (Table 3) indica es ha including
<me onome, a ge > pai s du ing ine- uning no only en-
ables empo-cons ained gene a ion, bu also enhances he
model’s o e all abili y o pe cei e and ep oduce hy hmic
s uc u e in s anda d accompanimen gene a ion.
5.4 Combining mix u e and me onome o imp o ed
alignmen
As men ioned in 4.1, we e i y ha , a in e ence ime,
we can condi ion he model on a combina ion o mix u e
and me onome, by simply summing he wa e o ms. E en
hough STAGE has ne e seen such con ex s du ing ain-
ing, i is able o gene alize o hese condi ionings and p o-
ide accompanimen gene a ion wi h e en g ea e con ol
o e he hy hmic alignmen . We measu e his e ec on he
same accompanimen gene a ion ask by compa ing bea
alignmen (F1 sco e) when ei he condi ioning on Malone
s. on M+B. Resul s, shown in Figu e 3con i m ha an
explici bea ack helps igh en alignmen while p ese -
ing cohe ence wi h he mix u e.
6. DISCUSSION
Below we discuss a ew mo e gene al akeaways om his
esea ch, and sha e ou opinions abou esea ch on la ge
gene a i e ans o me models.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
667
COCOLA FAD-VGGish FAD-Clap KAD-VGGish KAD-Clap Rhy hmic Alignmen (F1)
Ta ge s em Model ↑ ↓ ↓ ↓ ↓ ↑
STAGE 61.02 1.05 0.17 3.00 9.86 52.63
D ums STAGE-abl. 59.28 1.16 0.17 4.37 9.59 48.05
Table 3. Abla ion esul s: he wo models a e exac ly he same, bu STAGE-abl ne e sees <me onome, a ge > pai s
du ing ine- uning. Models a e es ed on accompanimen gene a ion, gene a ing d ums gi en a mix u e o s ems ex ac ed
om he es se o MoisesDB. We obse e ha he model which obse ed condi ioning wi h me onome acks is able o
be e align o he hy hmic s uc u e o he mix u e, e en on he no mal accompanimen gene a ion ask.
0 10 20 30 40 50 60
Rhy hmic alignmen F1-sco e
D ums
Bass
Mix u e Only
Mix u e + Me onome
Figu e 3. Compa ison o hy hmic alignmen when pass-
ing only a mix u e as condi ioning s. he combina-
ion o he same mix u e wi h a me onome ack. Fo
STAGE-d ums, he F1 alignmen imp o es om 52.6 o
64.0, and o STAGE-bass om 40.9 o 46.8.
6.1 Pa ame e -e iciency o single-s em ine- uning
As shown in Table 4,STAGE uses abou one- en h he
ainable pa ame e s o compa ably pe o ming sys ems,
ye ou pe o ms hem ac oss mul iple asks. By ain-
ing on a single s em wi h simple p e ix-based condi ion-
ing, i aligns closely wi h he audio con ex wi hou ex-
a encode s o modules. Fine- uning a gene al model
like MusicGen on a single s em ocuses p edic i e powe
on a simple wa e o m dis ibu ion, making i a highly
pa ame e -e icien echnique.
These expe imen s sugges ha s acking mul iple
STAGE ins ances, one pe ins umen , can ma ch o ex-
ceed gene al models’ pe o mance a a ac ion o he
pa ame e s. This aligns wi h a b oade end: e-
cen esea ch on la ge language models shows ha as-
signing subse s o pa ame e s o speci ic sub asks is
o en mo e e icien han scaling monoli hic models.
Model # Pa ams
STAGE ∼0.4B
Ins uc -Musicgen ∼4.7B
GMSDI ∼0.8B
SA Con olNe ∼3.8B
Table 4. Conside ed models’
ainable pa ame e s coun .
DeepSeek V3 [32] ex-
empli ies his wi h i s
use o Mix u e o Ex-
pe s, al hough in a di -
e en con ex . We be-
lie e ha , gi en he ex-
ponen ially inc easing
cos o e y la ge mod-
els, a en ion o mo e
pa ame e -e icien a chi ec u es should no be spa ed.
6.2 C oss-a en ion in local s. global condi ioning
Be o e pi o ing owa ds p e ix-based condi ioning, we ex-
ensi ely es ed c oss-a en ion o injec ing musical con-
ex in o he model. While ine- uned models cap u ed
s yle, mood, and ha mony, hey s uggled wi h p ecise
hy hmic alignmen , e en wi h posi ional embeddings de-
signed o local dependencies. Mo ing he condi ioning
sou ce in o he inpu s eam, p ocessed ia sel -a en ion,
esol ed his issue en i ely, enabling accu a e alignmen
be ween condi ioning and ou pu . This beha io has a ely
been documen ed, wi h [33] no ing ha c oss-a en ion
ou pu s con e ge o a ixed poin in he i s ew s eps,
spli ing he p ocess in o seman ic planning ( ia c oss-
a en ion), and subsequen image gene a ion.
In [34], he e a e hin s o a simila in ui ion. We con-
clude ha he p ecise beha io o c oss-a en ion in adi-
ional ans o me decode s, and i s di e ence in e ec i e-
ness when handling global (e.g.: he global desc ip ion o
an image, he meaning o a ex o ansla e) s. highly lo-
calized condi ioning (e.g.: he exac posi ions o he bea s
o ollow o in a musical piece) is way unde explo ed, and
s ill dese es a en ion o u u e esea ch.
7. CONCLUSION
We in oduced STAGE, a pa ame e -e icien single-
ins umen app oach o accompanimen gene a ion ha
ex ends MusicGen wi h a simple, p e ix-based condi ion-
ing mechanism. By p epending an audio con ex o he
model’s inpu , STAGE e ec i ely lea ns he ela ionship
be ween con ex and accompanimen , deli e ing compe -
i i e o supe io pe o mance in audio ideli y, con ex ual
cohe ence, and bea alignmen compa ed o la ge baseline
sys ems, despi e i s compac size and minimal ine- uning
e o . We explo ed how enabling he model o be con-
di ioned on ei he a ull mix u e o a simple me onome
ack, no only enables i o pe o m s ic ly empo ally-
con olled gene a ion, bu also imp o es i s hy hmic align-
men in he mo e gene al accompanimen gene a ion ask.
Fu u e wo k may explo e ex ending STAGE o addi ional
ins umen s and e ining i s capaci y o an e en mo e
g anula deg ee o con ol, u he expanding i s applica-
bili y in eal-wo ld music p oduc ion wo k lows.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
668
8. ACKNOWLEDGMENTS
We acknowledge suppo om Sapienza Uni e si y o
Rome h ough he Seed o ERC g an "MINT.AI",
cup B83C25001040001. L.C. is suppo ed by he
PRIN 2022 p ojec n. 2022AL45R2 (EYE-FI.AI, CUP
H53D2300350-0001), unded by he Eu opean Union –
Nex Gene a ionEU – PNRR – M4C2, In es men 1.1. We
hank Emilian Pos olache o use ul discussions du ing he
ea ly s ages o his wo k.
9. REFERENCES
[1] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and con ol-
lable music gene a ion,” Ad ances in Neu al In o ma-
ion P ocessing Sys ems, ol. 36, 2024.
[2] A. Agos inelli, T. I. Denk, Z. Bo sos, J. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi e al., “Musiclm:
Gene a ing music om ex ,” a Xi p ep in
a Xi :2301.11325, 2023.
[3] P. Dha iwal, H. Jun, C. Payne, J. W. Kim,
A. Rad o d, and I. Su ske e , “Jukebox: A gene a i e
model o music,” 2020. [Online]. A ailable: h ps:
//a xi .o g/abs/2005.00341
[4] Y.-H. Lan, W.-Y. Hsiao, H.-C. Cheng, and Y.-H.
Yang, “Musicongen: Rhy hm and cho d con ol o
ans o me -based ex - o-music gene a ion.”
[5] S. Gao, S. Lei, F. Zhuo, H. Liu, F. Liu, B. Tang,
Q. Huang, S. Kang, and Z. Wu, “An end- o-end
app oach o cho d-condi ioned song gene a ion,” in
P oc. In e speech 2024, 2024, pp. 1890–1894.
[6] K. Choi, J. Pa k, W. Heo, S. Jeon, and J. Pa k, “Cho d
condi ioned melody gene a ion wi h ans o me based
decode s,” IEEE Access, ol. 9, pp. 42 071–42 080,
2021.
[7] S. Li and Y. Sung, “Melodydi usion: Cho d-
condi ioned melody gene a ion using a ans o me -
based di usion model,” Ma hema ics, ol. 11, no. 8,
p. 1915, 2023.
[8] J. Jung, A. Jansson, and D. Jeong, “Musicgen-
cho d: Ad ancing music gene a ion h ough cho d
p og essions and in e ac i e web-ui,” a Xi p ep in
a Xi :2412.00325, 2024.
[9] J. Wei, M. Bosma, V. Zhao, K. Guu, A. W.
Yu, B. Les e , N. Du, A. M. Dai, and Q. V.
Le, “Fine uned language models a e ze o-sho
lea ne s,” in In e na ional Con e ence on Lea n-
ing Rep esen a ions, 2022. [Online]. A ailable:
h ps://open e iew.ne / o um?id=gEZ GCozdqR
[10] S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun,
S. Wang, J. Li, R. Hu, T. Zhang, F. Wu, and
G. Wang, “Ins uc ion uning o la ge language
models: A su ey,” 2024. [Online]. A ailable:
h ps://a xi .o g/abs/2308.10792
[11] R. Ci anni, G. Ma iani, M. Mancusi, E. Pos olache,
G. Fabb o, E. Rodolà, and L. Cosmo, “Cocola:
Cohe ence-o ien ed con as i e lea ning o musical au-
dio ep esen a ions,” in ICASSP 2025-2025 IEEE In-
e na ional Con e ence on Acous ics, Speech and Sig-
nal P ocessing (ICASSP). IEEE, 2025, pp. 1–5.
[12] K. Kilgou , M. Zuluaga, D. Roblek, and M. Sha i i,
“F éche audio dis ance: A e e ence- ee me ic o
e alua ing music enhancemen algo i hms,” 2019.
[13] Y. Chung, P. Eu, J. Lee, K. Choi, J. Nam, and B. S.
Chon, “Kad: No mo e ad! an e ec i e and e icien
e alua ion me ic o audio gene a ion,” a Xi p ep in
a Xi :2502.15602, 2025.
[14] N. Zeghidou , A. Luebs, A. Om an, J. Skoglund,
and M. Tagliasacchi, “Sounds eam: An end- o-end
neu al audio codec,” IEEE/ACM T ans. Audio, Speech
and Lang. P oc., ol. 30, p. 495–507, No . 2021.
[Online]. A ailable: h ps://doi.o g/10.1109/TASLP.
2021.3129994
[15] A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi, “High
ideli y neu al audio comp ession,” 2022. [Online].
A ailable: h ps://a xi .o g/abs/2210.13438
[16] S. Roua d, Y. Adi, J. Cope , A. Roebel, and A. Dé-
ossez, “Audio condi ioning o music gene a ion ia
disc e e bo leneck ea u es,” in ISMIR 2024, 2024.
[17] O. Tal, A. Zi , I. Ga , F. K euk, and Y. Adi, “Join
audio and symbolic condi ioning o empo ally con-
olled ex - o-music gene a ion,” 2024.
[18] Y. Zhang, Y. Ikemiya, W. Choi, N. Mu a a, M. A. M.
Ramí ez, L. Lin, G. Xia, W.-H. Liao, Y. Mi su uji,
and S. Dixon, “Ins uc -musicgen: Unlocking ex - o-
music edi ing o music language models ia ins uc-
ion uning,” CoRR, 2024.
[19] G. Ma iani, I. Tallini, E. Pos olache, M. Mancusi,
L. Cosmo, E. Rodola e al., “Mul i-sou ce di usion
models o simul aneous music gene a ion and sepa-
a ion,” in 12 h In e na ional Con e ence on Lea ning
Rep esen a ions, ICLR 2024. In e na ional Con e -
ence on Lea ning Rep esen a ions, ICLR, 2024.
[20] E. Pos olache, G. Ma iani, L. Cosmo, E. Bene os,
and E. Rodolà, “Gene alized mul i-sou ce in e ence o
ex condi ioned music di usion models,” in ICASSP
2024-2024 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2024, pp. 6980–6984.
[21] J. Nis al, M. Pasini, C. Aouameu , S. La ne , and
M. G ach en, “Di -a- i : Musical accompanimen co-
c ea ion ia la en di usion models.”
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
669
[22] Z. E ans, J. D. Pa ke , C. Ca , Z. Zukowski, J. Tay-
lo , and J. Pons, “S able audio open,” in ICASSP
2025-2025 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2025, pp. 1–5.
[23] L. Zhang, A. Rao, and M. Ag awala, “Adding condi-
ional con ol o ex - o-image di usion models,” in
P oceedings o he IEEE/CVF In e na ional Con e -
ence on Compu e Vision (ICCV), Oc obe 2023, pp.
3836–3847.
[24] C. Donahue, A. Caillon, A. Robe s, E. Manilow, P. Es-
ling, A. Agos inelli, M. Ve ze i, I. Simon, O. Pie quin,
N. Zeghidou e al., “Singsong: Gene a ing musical ac-
companimen s om singing,” in In e na ional Con e -
ence on Machine Lea ning ICML 2023, 2023.
[25] J. D. Pa ke , J. Spijke e , K. Kos a, F. Yesile ,
B. Kuzne so , J.-C. Wang, M. A en , J. Chen, and
D. Le, “S emgen: A music gene a ion model ha lis-
ens,” in ICASSP 2024-2024 IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
(ICASSP). IEEE, 2024, pp. 1116–1120.
[26] S. Roua d, R. San Roman, Y. Adi, and A. Roebel,
“Musicgen-s em: Mul i-s em music gene a ion and
edi ion h ough au o eg essi e modeling,” in ICASSP
2025-2025 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2025, pp. 1–5.
[27] I. Pe ei a, F. A aújo, F. Ko zeniowski, and R. Vogl,
“Moisesdb: A da ase o sou ce sepa a ion beyond 4-
s ems,” a Xi p ep in a Xi :2307.15913, 2023.
[28] F. Fosca in, J. Schlü e , and G. Widme , “Bea his! ac-
cu a e bea acking wi hou dbn pos p ocessing.”
[29] Z. Ra ii, A. Liu kus, F.-R. S ö e , S. I. Mimilakis,
and R. Bi ne , “The MUSDB18 co pus o music
sepa a ion,” Dec. 2017. [Online]. A ailable: h ps:
//doi.o g/10.5281/zenodo.1117372
[30] S. He shey, S. Chaudhu i, D. P. Ellis, J. F. Gem-
meke, A. Jansen, R. C. Moo e, M. Plakal, D. Pla ,
R. A. Sau ous, B. Seybold e al., “Cnn a chi ec u es
o la ge-scale audio classi ica ion,” in 2017 ieee in-
e na ional con e ence on acous ics, speech and signal
p ocessing (icassp). IEEE, 2017, pp. 131–135.
[31] B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang,
“Clap lea ning audio concep s om na u al language
supe ision,” in ICASSP 2023-2023 IEEE In e na-
ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2023, pp. 1–5.
[32] A. Liu, B. Feng, B. Xue, B. Wang, B. Wu,
C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan
e al., “Deepseek- 3 echnical epo ,” a Xi p ep in
a Xi :2412.19437, 2024.
[33] H. Liu, W. Zhang, J. Xie, F. Faccio, M. Xu,
T. Xiang, M. Z. Shou, J.-M. Pe ez-Rua, and
J. Schmidhube , “Fas e di usion ia empo al a -
en ion decomposi ion,” 2025. [Online]. A ailable:
h ps://a xi .o g/abs/2404.02747
[34] D. Liang, W. Longyue, W. Di, T. Dacheng, and
T. Zhaopeng, “Con ex -awa e c oss-a en ion o
non-au o eg essi e ansla ion,” P oceedings o he
28 h In e na ional Con e ence on Compu a ional
Linguis ics, pp. 4396–4402, 2020. [Online]. A ailable:
h ps://ci .nii.ac.jp/c id/1360299149686305408
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
670