STAGE: Stemmed Accompaniment Generation Through Prefix-Based Conditioning

Author: Giorgio Strano; Chiara Ballanti; Donato Crisostomi; Michele Mancusi; Luca Cosmo; Emanuele Rodolà

Publisher: Zenodo

DOI: 10.5281/zenodo.17706555

Source: https://zenodo.org/records/17706555/files/000077.pdf

STAGE: STEMMED ACCOMPANIMENT GENERATION THROUGH
PREFIX-BASED CONDITIONING
Gio gio S ano1,⋆ Chia a Ballan i1,⋆ Dona o C isos omi1
Michele Mancusi1Luca Cosmo2Emanuele Rodolà1
1Sapienza Uni e si y o Rome, 2Ca’ Fosca i Uni e si y o Venice
[email p o ec ed]
ABSTRACT
Recen ad ances in gene a i e models ha e made i pos-
sible o c ea e high-quali y, cohe en music, wi h some
sys ems deli e ing p oduc ion-le el ou pu . Ye , mos
exis ing models ocus solely on gene a ing music om
sc a ch, limi ing hei use ulness o musicians who wan
o in eg a e such models in o a human, i e a i e com-
posi ion wo k low. In his pape we in oduce STAGE,
ou STemmed Accompanimen GEne a ion model, ine-
uned om he ex - o-music MusicGen model o gene -
a e single-s em ins umen al accompanimen s condi ioned
on a gi en mix u e. Inspi ed by ins uc ion- uning me h-
ods o language models, we ex end he ans o me ’s em-
bedding ma ix wi h a con ex oken, enabling he model
o a end o a musical con ex h ough p e ix-based con-
di ioning. Compa ed o he baselines, STAGE yields ac-
companimen s ha exhibi s onge cohe ence wi h he in-
pu mix u e, highe audio quali y, and close alignmen
wi h ex ual p omp s. Mo eo e , by condi ioning on a
me onome-like ack, ou amewo k na u ally suppo s
empo-cons ained gene a ion, achie ing s a e-o - he-a
alignmen wi h he a ge hy hmic s uc u e–all wi hou
equi ing any addi ional empo-speci ic module. As a e-
sul , STAGE o e s a p ac ical, e sa ile ool o in e ac i e
music c ea ion ha can be eadily adop ed by musicians in
eal-wo ld wo k lows.
gi hub.com/gio gioskij/s age
gio gioskij.gi hub.io/s age-demo
1. INTRODUCTION
Gene a i e AI has ecen ly ans o med music compo-
si ion, wi h la ge-scale models now able o p oduce
long- o m, high-quali y, and s ylis ically consis en mu-
sic. Models such as MusicGen [1], MusicLM [2], and
JukeBox [3] ha e demons a ed ha ans o me s ained
on okenized audio ep esen a ions can gene a e music ha
i als human composi ions in cohe ence and p oduc ion
quali y.
Howe e , mos o hese models ocus on gene a ing mu-
sic om sc a ch, e en when hey allow o condi ional gen-
e a ion using melodies [1], cho ds [4–8], o ex p omp s.
This limi s hei applicabili y in a na u al music composi-
ion wo k low, which is o en s uc u ed in an i e a i e, lay-
⋆deno es equal con ibu ion.
STAGE
STAGE
125 bpm
4
4125 bpm
4
4
Figu e 1. Ou line o ou p oposed model. ( op)
STAGE akes a musical con ex as inpu and gene a es
a single-s em accompanimen . (bo om) STAGE akes a
me onome-like ack and gene a es a s em ha ollows he
desi ed hy hmic s uc u e.
e ed ashion, g adually building composi ions by adding o
e ining pa s o e ime. To suppo his wo k low, we o-
cus on a human-cen e ed, in ui i e gene a ion ask: adding
a single new s em o an exis ing mul i-s em mix u e, while
also allowing o p ecise con ol o e he empo o he gen-
e a ed ou pu .
In his pape , we in oduce STAGE, a single-ins umen
accompanimen gene a ion model ha can be condi ioned
on any audio con ex , be i a mix u e o a simple click ack,
o gene a e a cohe en and hy hmically aligned accompa-
nimen (see Figu e 1). We use a simple ye e ec i e ap-
p oach o ine- une MusicGen [1] o s emmed accompa-
nimen gene a ion. Ou me hod does no equi e e ain-
ing any addi ional con ex -encode s and elies on minimal
da a o ine- uning. We le e age p e ix-based condi ion-
ing, whe e he con ex is p epended o he model’s inpu
sequence, e ec i ely se ing as a p omp o he gene a-
ion o he a ge s em. This app oach d aws inspi a ion
om ins uc ion uning [9,10] in language models, whe e
p epending a ask-speci ic ins uc ion enables a p e ained
model o specialize in new asks wi h minimal modi ica-
ions. In ou case, we ea musical con ex s as he “ques-
ion” and he desi ed accompanimen as he “answe ”. This
enables he model o lea n a oken- o- oken co espon-
dence be ween con ex and con inua ion, specializing i o
accompanimen asks.
We e alua e ou model on musical cohe ence using he
663
COCOLA sco e [11], showing clea imp o emen s o e
exis ing baselines, while main aining high audio quali y
as measu ed by FAD [12] and KAD [13]. Fu he mo e,
we show ha ou model suppo s empo-cons ained gen-
e a ion by simply condi ioning on a me onome-like click
ack—wi hou he need o empo-speci ic modules o a -
chi ec u es.
Ou con ibu ions a e h ee- old:
• A p e ix-based ine- uning me hod o s emmed ac-
companimen gene a ion.
• A ligh weigh , lexible app oach o empo condi ion-
ing h ough audio-based inpu s.
• Ex ensi e e alua ion ac oss musical cohe ence,
hy hmic alignmen , and audio ideli y, along wi h
open-sou ce code and model checkpoin s.
2. RELATED WORK
Recen ad ances in gene a i e modeling ea music as a
language o disc e e okens, enabling long-con ex audio
syn hesis ia ans o me a chi ec u es. Jukebox [3] was
an ea ly example: i con e ed aw audio in o a hie a chy
o esidual VQ-VAE okens and used p og essi ely deepe
ans o me s o gene a e cohe en ex ended sequences o
music. Subsequen imp o emen s in neu al audio codecs,
such as SoundS eam [14] and EnCodec [15], inspi ed
new designs. MusicLM [2] in oduced a hie a chical wo-
s age app oach ha models sepa a e s eams o “seman-
ic” and “acous ic” okens, while MusicGen [1] showed
ha a single-s age ans o me o e EnCodec okens can
achie e excellen ex - o-music quali y. The simple a chi-
ec u e o MusicGen, combined wi h i s obus audio i-
deli y gene a ions, make i a na u al ounda ion o special-
ized asks such as single s em accompanimen , which we
explo e in his pape .
2.1 Condi ional gene a ion
Al hough mos music LMs ocus on ex p omp ing, some
app oaches p o ide mo e di ec musical guidance o edi -
ing capabili ies. Melody-condi ioned models, such as he
melody a ian in [1], align he gene a ion o a guiding
pi ch con ou bu ypically p oduce an en i e mix a he
han an isola ed s em. MusicConGen [4] u he ex-
ends MusicGen by adding condi ioning o e cho ds and
bea in o ma ion, allowing explici con ol o ha monic and
hy hmic s uc u es. Mul iple o he sys ems ha e been p e-
sen ed, especially using di usion models, o condi ion mu-
sic gene a ion on a se ies o cho ds [5–8], on s ylis ic e -
e ences [16], o e en on a combina ion o ex , s yle, and a
e e ence d ums ack [17].
2.2 Music edi ing
Recen app oaches o audio-domain edi ing include au-
o eg essi e models like Ins uc -MusicGen [18], as
well as di usion-based sys ems such as MSDM [19] and
GMSDI [20]. While Ins uc -MusicGen s uggles
o achie e high-quali y ou pu s, di usion-based models,
despi e hei lexibili y, equi e signi ican ly mo e compu-
a ional esou ces and do no consis en ly suppo clean,
single-s em gene a ion.
2.3 Accompanimen gene a ion
Simila ly o MSDM and GMSDI, o he di usion-based sys-
ems aim o gene a e o edi pa ial a angemen s bu
a e closed-sou ce o limi ed in scope. Fo ins ance,
Di -A-Ri [21] uses a mul i-s ep di usion p ocess
o e ine an exis ing ack wi h new musical elemen s;
howe e , i is no openly eleased. SA-Con olNe 1
uses a ine- uning o S able Audio Open [22] wi h an
added Con olNe module [23], also p o iding a o m
o s emmed accompanimen gene a ion. SingSong [24]
ackles ocal- o-ins umen al accompanimen , aking a o-
cal ack as inpu and gene a ing a band-like backing. This
app oach is highly e ec i e o ocals bu emains highly
specialized. Mo e di ec ly aligned wi h ou objec i es,
S emGen [25] enables single-s em accompanimen gen-
e a ion ia a non-au o eg essi e ans o me . In pa allel
o ou wo k, Me a AI in oduced MusicGen-S em [26],
which suppo s a ange o edi ing asks, including mix u e-
condi ioned accompanimen gene a ion. Howe e , nei he
model has eleased public code o checkpoin s a he ime
o w i ing, and hey only p o ide a hand ul o gene a ed
samples, making i impossible o pe o m a igo ous com-
pa ison. Addi ionally, bo h app oaches in ol e aining
dedica ed ans o me models om sc a ch, in con as o
ou ligh weigh ine- uning s a egy.
3. BACKGROUND
MusicGen [1] is a single-s age music gene a ion model
ha ope a es o e disc e e audio okens p oduced by an
encode –decode neu al codec. Speci ically, he au ho s
use EnCodec [15], which con e s aw audio in o se -
e al pa allel s eams o quan ized okens (known as code-
books). Whe eas some p io wo ks (e.g., Jukebox [3],
MusicLM [2]) ely on mul i-s age o hie a chical a chi-
ec u es ha p ocess one se o okens o hen upsample an-
o he , MusicGen p oposes a simple ye e ec i e single-
s age ans o me language model ha di ec ly lea ns o
gene a e all o hese quan ized okens a once.
3.1 A chi ec u e o e iew
The co e o MusicGen is a GPT-like ans o me de-
code ha is ained au o eg essi ely o e sequences o dis-
c e e audio okens. The okens come om a esidual ec-
o quan iza ion scheme, whe e he aw wa e o m is i s
encoded in o a low- ame- a e con inuous ep esen a ion,
and hen each ame is quan ized by mul iple “s acked”
codebooks. Codebooks a e o ganized hie a chically, and
each codebook kicon ains inc emen al esidual in o ma-
ion w. . . he p e ious codebooks kj, i > j. The num-
be o codebooks (se o 4 in MusicGen) de e mines how
many pa allel okens mus be modeled a each ime s ep.
1gi hub.com/EmilianPos olache/s able-audio-con olne
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
664
MusicGen is eleased in ou e sions, Small
(∼400M pa ams), Medium (∼1.5B pa ams), La ge
(∼3B) and Melody (∼1.5B). The la e uses bo h ex and
a sho audio clip ( om which melody is ex ac ed) as con-
di ioning, wi h bo h sou ces p epended o he inpu . The
o he h ee ely only on ex , p o ided ia c oss-a en ion.
In his wo k, we use only he p e- ained MusicGen
Small checkpoin .
3.2 Codebook In e lea ing Pa e ns
One s aigh o wa d app oach o con e he ou pa al-
lel s eams o okens gene a ed by EnCodec in o a sin-
gle s eam is o la en all codebooks, bu his subs an ially
leng hens he sequence. Con e sely, p edic ing hem in
pa allel unde es ima es c oss-codebook dependencies. A
p ac ical comp omise, used in MusicGen and shown in
Figu e 2, is he “delay” s a egy which shi s okens om
he i h codebook by i−1s eps. This p ese es in e -
codebook con ex while equi ing only one au o eg essi e
s ep pe ame. As a esul , he delay pa e n yields mo e
e icien modeling han pa allel p edic ion wi h minimal
compu a ional o e head.
4. METHOD
In his sec ion, we p esen he key componen s o STAGE.
In pa icula , we discuss how i ex ends MusicGen’s a -
chi ec u e and aining p ocedu e o gene a e single s ems
om an inpu mix u e o a me onome-like bea ack. We
e e o bo h hese audio inpu s as he “con ex ”. Du ing
in e ence, he use can pass ei he o hose audio cues o
e en combine hem in o a single wa e o m, allowing o
pe o m accompanimen gene a ion wi h an e en igh e
con ol o e he a ge ’s hy hmic s uc u e.
4.1 O e iew
Ha ing chosen a a ge ins umen I,STAGE is ained o
p oduce a s em So ha ins umen , aking as inpu :
• An audio con ex , which can con ain a gene ic audio
mix u e M(wi hou ou a ge ins umen I), o a
bea ack B(a me onome-like pulse sequence);
• an op ional ex p omp T, desc ibing he desi ed
s yle and mood o he gene a ion.
The model aims o gene a e audio ha ma ches he mix-
u e’s key, s yle, ha mony, and hy hm, enabling i o se e
as a musical accompanimen . I no ini ial mix u e is a ail-
able, STAGE can ins ead ake a me onome ack (B) (a
simple bea ma king he empo) as inpu , allowing i o
gene a e he i s s em o a composi ion. In p ac ice, we
ind ha o e laying he mix u e Mand bea ack Bin o a
single audio ile a in e ence ime allows STAGE o condi-
ion on bo h, p ese ing cohe ence wi h he mix u e while
achie ing igh e hy hmic alignmen wi h he bea s (see
5.4 o esul s).
We ain sepa a e STAGE models o each a ge ins u-
men , speci ically d ums and bass. Fo he emaining, less-
ep esen ed ins umen s, he inhe en in a-ins umen a i-
abili y would equi e signi ican ly mo e da a han wha is
a ailable in MoisesDB.
4.2 Con ex oken o p e ix-based condi ioning
Ou s a ing poin is MusicGen-Small, a ligh weigh
a ian o MusicGen. To enable condi ioning, we add a
single con ex oken o he ans o me ’s embedding ma-
ix, allowing ex a audio okens o be p epended o he
inpu sequence (Figu e 2). These okens come om ei he
he mix u e Mo he bea ack B, encoded ia EnCodec
[15]. This o ms a p e ix-based condi ioning se up: once
he model ‘’sees” he con ex okens, i au o eg essi ely
gene a es he new s em.
4.3 Fine- uning p ocedu e
We ain on he open-sou ce, mul i-s em da ase MoisesDB
[27], which con ains 240 s em-sepa a ed songs. Gi en a
a ge ins umen I o be gene a ed, we c ea e inpu da a
o STAGE using he ollowing s a egies:
•Fo m he con ex . Fo each ack we ei he (a)
mix a andom subse o s ems (excluding ins umen
I) o c ea e M, o (b) eplace he mix u e wi h a
me onome ack Ba he known empo o M. We
use each o he wo s a egies wi h equal p obabili y.
•Con ex leng h. We andomize he con ex leng h
in he ange o 5 o 10 seconds. Hence, he model
can lea n o gene a e samples longe han he ac ual
con ex window.
•Da a augmen a ion. We apply speed ansposi ion
(in he ange [0.8, 1.2]) and pi ch ansposi ion (in
he ange [-4, +4 semi ones]) wi h p obabili y 0.5 o
bo h con ex and a ge .
The abo e p ocedu e p oduces <con ex , a ge s em>
pai s, which we use o ine- uning. To allow he newly in-
oduced con ex oken o adap o he p e ained model, we
i s ain only i s embedding o 200 s eps a a lea ning a e
o 1e-4, keeping all o he weigh s ozen. This wa m-up
phase helps he oken lea n a meaning ul in e ace wi h he
es o he model. We hen un eeze he emaining weigh s
and g adually amp up hei lea ning a e om 0 o 1e-5,
while annealing he con ex oken’s a e om 1e-4 o 1e-
5. T aining con e ges in abou 1,000 s eps using ba ches
o eigh 10-second samples, inishing in unde a day on a
single NVIDIA RTX 3090 GPU.
4.4 In e ence
A in e ence ime, we pe o m he ollowing s eps:
• Tokenize he con ex ia EnCodec.
• Pass he encoded okens o STAGE, ollowed by he
con ex oken.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
665
#
#
#
#
con ex
sequence
pad
pad
con ex oken
s1s2s3s4s5s6s7s8s9s10 s11 s12 s13 s14
k4
k3
k2
k1
sequence s eps
esidual codebooks
Figu e 2. Illus a ion o he delay pa e n used by MusicGen, and how he con ex oken is placed o sepa a e he audio
con ex om he inpu sequence o he ans o me .
• Au o eg essi ely decode he new s em’s okens.
• Recons uc he gene a ed s em ia EnCodec’s de-
code .
This ligh weigh p e ix-based amewo k is easy o in e-
g a e in o a musician’s wo k low by simply p o iding an
audio snippe o an exis ing ack (o a me onome pulse
in i s place) and le ing STAGE gene a e a single, cohe en
accompanimen s em.
5. EXPERIMENTS AND RESULTS
We now p esen he expe imen al se up and e alua ion
me ics used o assess he pe o mance o STAGE. We o-
cus ou e alua ion on wo sepa a e asks:
•Bea Alignmen , gi en a bea as condi ioning;
•Accompanimen Gene a ion, gi en a gene ic mix-
u e o s ems as condi ioning.
Fo each o hese asks, we measu e agains s a e-o - he-a
open-sou ce models on compa able asks.
5.1 Bea alignmen
To measu e how p ecisely STAGE ollows he gi en bea ,
we p o ide he model wi h a aw pulse ack spanning he
ull du a ion o he sample o be gene a ed. We hen e al-
ua e he F1 sco e using he mi e al 2lib a y, ma ching
he de ec ed bea s in he ou pu audio o he e e ence bea
ack supplied as inpu . Fo bea de ec ion, we employ
Bea -This [28], a s a e-o - he-a algo i hm o bea ack-
ing.
We benchma k STAGE agains MusicConGen [4], a
ine- uned MusicGen a ian ha can be condi ioned on
empo, and op ionally on cho d sequences.
Following he se up o MusicConGen, we e alua e
ou gene a ions by condi ioning on bea acks ex ac ed
om MusDB [29] mix u es using he Bea -This algo i hm.
Howe e , we ind ha his p ocess in oduces noise, as he
2gi hub.com/mi -e alua ion/mi _e al
bea acke is impe ec and o en de ec s i egula o in-
consis en bea pa e ns ac oss samples. This is no in line
wi h a ealis ic scena io, in which a musician supplies he
model wi h a pe ec ly egula bea g id o ollow. Fo a
mo e ealis ic se ing, we also es ou model on a uni o m
dis ibu ion o BPMs in he in e al [100, 180].
Table 1shows ha he model ained on d ums signi -
ican ly ou pe o ms bo h MusicConGen and ou bass-
ained a ian . This suppo s he in ui i e no ion ha ain-
ing on d ums gi es he model a s onge sense o hy hm,
leading o be e alignmen . Addi ionally, bea ex ac ion is
less accu a e on bass-only acks (like hose gene a ed by
STAGE-bass) which can con ibu e o highe measu ed
alignmen e o .
Model Da ase F1 ↑FAD-VGGish ↓FAD-Clap ↓
STAGE-d ums MusDB 66.88 1.40 0.23
Uni o m BPM 71.57 2.05 0.24
STAGE-bass MusDB 40.93 5.59 0.39
Uni o m BPM 45.17 4.26 0.52
MusiConGen-Tempo MusDB 61.37 1.95 –
Table 1. Compa ison o audio quali y (FAD) and hy hmic
alignmen (F1) o STAGE-d ums s. MusicConGen
(wi h Rhy hm-only condi ioning). The hy hm condi ion-
ing was ex ac ed wi h Bea -This om he MusDB da ase .
Fo ou model, we also es on 160 samples om a uni-
o m dis ibu ion o BPMs in he ange [100, 180]. Fol-
lowing [4], FAD is compu ed wi h MusDB as e e ence.
5.2 Accompanimen cohe ence
To assess how well ou model adds a new ins umen s em
o an exis ing mix u e, we e alua e on he es se om
he MoisesDB da ase [27]. Fo each ack, we emo e he
a ge s em (d ums o bass) and hen ask he model o e-
gene a e ha ins umen while keeping he emaining pa s
unchanged. We measu e se e al me ics ha a ge bo h
objec i e audio quali y and seman ic cohe ence wi h he
con ex :
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
666
COCOLA FAD-VGGish FAD-Clap KAD-VGGish KAD-Clap Rhy hmic Alignmen (F1)
Ta ge s em Model ↑ ↓ ↓ ↓ ↓ ↑
STAGE 61.02 1.05 0.17 3.00 9.86 52.63
Ins uc -MusicGen 48.24 16.44 1.34 55.25 80.79 0.20
GMSDI 45.25 15.57 1.22 57.39 54.75 25.98
D ums
SA-Con olNe 57.46 2.75 0.39 9.28 11.75 38.70
STAGE 60.20 2.97 0.37 13.45 15.40 40.94
Ins uc -MusicGen 53.69 12.85 1.28 47.41 73.92 0.19
GMSDI 43.29 14.36 1.28 44.31 49.55 24.34
Bass
SA-Con olNe 59.63 2.22 0.31 6.69 15.98 47.17
Table 2. Pe o mance compa ison on he accompanimen gene a ion ask using he MoisesDB es se , wi h bass and d ums
as a ge s ems.
• COCOLA sco e [11] cap u es ha monic and pe cus-
si e cohe ence be ween he con ex and he newly
gene a ed s em;
• FAD- ggish and FAD-clap [12] assess pe cep ual
quali y by compa ing he dis ibu ion o embeddings
(ex ac ed using VGGish [30] o CLAP [31]) be-
ween gene a ed and e e ence audio. Since FAD
measu es dis ibu ional dis ance, we use s ems om
MoisesDB ma ching he a ge ins umen o de ine
he e e ence.
• KAD- ggish and KAD-clap [13]. A newly eleased
me ic ha compu es he dis ance be ween dis ibu-
ions o embeddings in a highe -o de abs ac space,
using he ke nel ick. I has simila p ope ies o he
commonly used FAD me ic.
• Rhy hmic Alignmen (F1). We also compu e he F1-
sco e be ween bea s ex ac ed om he con ex , and
bea s ex ac ed om he gene a ed s em, o asses he
hy hmic alignmen and cohe ence be ween con ex
and accompanimen .
The sco es o GMSDI and SA-Con olNe a e com-
pu ed on samples p o ided by he au ho s, while Ins uc -
MusicGen was un locally wi h he public in e ence
code 3. We we e no able o compa e wi h S emGen [25],
Di -A-Ri [21], and MusicGen-S em [26] since
hei code is no publicly a ailable.
As shown in Table 2, ou p oposed model ou pe o ms
all baselines in seman ic cohe ence, audio quali y, and
hy hmic alignmen when gene a ing d ums, and pe o ms
on pa wi h SA-Con olNe o bass. O e all, we ob-
se e ha STAGE pe o ms sligh ly wo se on bass ac oss
all me ics. We a ibu e his o wo main ac o s:
a) G ea e a iabili y in he dis ibu ion o bass acks
wi hin he MoisesDB da ase , which includes bo h
elec ic and syn h bass wi h dis inc imb al cha ac-
e is ics.
b) The inhe en ly mo e complex na u e o bass gene -
a ion, which–unlike d ums– equi es modeling bo h
3gi hub.com/ldzhangyx/ins uc -MusicGen
hy hmic and ha monic in o ma ion om he con-
ex , esul ing in a mo e challenging p edic ion ask.
5.3 Abla ion s udy on hy hm condi ioning
To e alua e he impac o aining on <me onome, a -
ge > pai s alongside <mix u e, a ge > pai s, we conduc
an abla ion s udy (Table 3). We ine- une a e sion o
STAGE-d ums wi hou me onome acks, exposing i
only o <mix u e, a ge > pai s. As expec ed, he model
ails o gene a e empo-aligned ou pu s when condi ioned
on a me onome a in e ence. Mo e no ably, i s pe o -
mance also deg ades on accompanimen gene a ion ( he
e y ask i was ained o ) highligh ing he b oade bene-
i o including me onome condi ioning a aining.
The clea gap in bo h COCOLA sco es (which cap-
u e hy hmic and ha monic cohe ence) and F1 sco es
o hy hmic alignmen (Table 3) indica es ha including
<me onome, a ge > pai s du ing ine- uning no only en-
ables empo-cons ained gene a ion, bu also enhances he
model’s o e all abili y o pe cei e and ep oduce hy hmic
s uc u e in s anda d accompanimen gene a ion.
5.4 Combining mix u e and me onome o imp o ed
alignmen
As men ioned in 4.1, we e i y ha , a in e ence ime,
we can condi ion he model on a combina ion o mix u e
and me onome, by simply summing he wa e o ms. E en
hough STAGE has ne e seen such con ex s du ing ain-
ing, i is able o gene alize o hese condi ionings and p o-
ide accompanimen gene a ion wi h e en g ea e con ol
o e he hy hmic alignmen . We measu e his e ec on he
same accompanimen gene a ion ask by compa ing bea
alignmen (F1 sco e) when ei he condi ioning on Malone
s. on M+B. Resul s, shown in Figu e 3con i m ha an
explici bea ack helps igh en alignmen while p ese -
ing cohe ence wi h he mix u e.
6. DISCUSSION
Below we discuss a ew mo e gene al akeaways om his
esea ch, and sha e ou opinions abou esea ch on la ge
gene a i e ans o me models.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
667

COCOLA FAD-VGGish FAD-Clap KAD-VGGish KAD-Clap Rhy hmic Alignmen (F1)
Ta ge s em Model ↑ ↓ ↓ ↓ ↓ ↑
STAGE 61.02 1.05 0.17 3.00 9.86 52.63
D ums STAGE-abl. 59.28 1.16 0.17 4.37 9.59 48.05
Table 3. Abla ion esul s: he wo models a e exac ly he same, bu STAGE-abl ne e sees <me onome, a ge > pai s
du ing ine- uning. Models a e es ed on accompanimen gene a ion, gene a ing d ums gi en a mix u e o s ems ex ac ed
om he es se o MoisesDB. We obse e ha he model which obse ed condi ioning wi h me onome acks is able o
be e align o he hy hmic s uc u e o he mix u e, e en on he no mal accompanimen gene a ion ask.
0 10 20 30 40 50 60
Rhy hmic alignmen F1-sco e
D ums
Bass
Mix u e Only
Mix u e + Me onome
Figu e 3. Compa ison o hy hmic alignmen when pass-
ing only a mix u e as condi ioning s. he combina-
ion o he same mix u e wi h a me onome ack. Fo
STAGE-d ums, he F1 alignmen imp o es om 52.6 o
64.0, and o STAGE-bass om 40.9 o 46.8.
6.1 Pa ame e -e iciency o single-s em ine- uning
As shown in Table 4,STAGE uses abou one- en h he
ainable pa ame e s o compa ably pe o ming sys ems,
ye ou pe o ms hem ac oss mul iple asks. By ain-
ing on a single s em wi h simple p e ix-based condi ion-
ing, i aligns closely wi h he audio con ex wi hou ex-
a encode s o modules. Fine- uning a gene al model
like MusicGen on a single s em ocuses p edic i e powe
on a simple wa e o m dis ibu ion, making i a highly
pa ame e -e icien echnique.
These expe imen s sugges ha s acking mul iple
STAGE ins ances, one pe ins umen , can ma ch o ex-
ceed gene al models’ pe o mance a a ac ion o he
pa ame e s. This aligns wi h a b oade end: e-
cen esea ch on la ge language models shows ha as-
signing subse s o pa ame e s o speci ic sub asks is
o en mo e e icien han scaling monoli hic models.
Model # Pa ams
STAGE ∼0.4B
Ins uc -Musicgen ∼4.7B
GMSDI ∼0.8B
SA Con olNe ∼3.8B
Table 4. Conside ed models’
ainable pa ame e s coun .
DeepSeek V3 [32] ex-
empli ies his wi h i s
use o Mix u e o Ex-
pe s, al hough in a di -
e en con ex . We be-
lie e ha , gi en he ex-
ponen ially inc easing
cos o e y la ge mod-
els, a en ion o mo e
pa ame e -e icien a chi ec u es should no be spa ed.
6.2 C oss-a en ion in local s. global condi ioning
Be o e pi o ing owa ds p e ix-based condi ioning, we ex-
ensi ely es ed c oss-a en ion o injec ing musical con-
ex in o he model. While ine- uned models cap u ed
s yle, mood, and ha mony, hey s uggled wi h p ecise
hy hmic alignmen , e en wi h posi ional embeddings de-
signed o local dependencies. Mo ing he condi ioning
sou ce in o he inpu s eam, p ocessed ia sel -a en ion,
esol ed his issue en i ely, enabling accu a e alignmen
be ween condi ioning and ou pu . This beha io has a ely
been documen ed, wi h [33] no ing ha c oss-a en ion
ou pu s con e ge o a ixed poin in he i s ew s eps,
spli ing he p ocess in o seman ic planning ( ia c oss-
a en ion), and subsequen image gene a ion.
In [34], he e a e hin s o a simila in ui ion. We con-
clude ha he p ecise beha io o c oss-a en ion in adi-
ional ans o me decode s, and i s di e ence in e ec i e-
ness when handling global (e.g.: he global desc ip ion o
an image, he meaning o a ex o ansla e) s. highly lo-
calized condi ioning (e.g.: he exac posi ions o he bea s
o ollow o in a musical piece) is way unde explo ed, and
s ill dese es a en ion o u u e esea ch.
7. CONCLUSION
We in oduced STAGE, a pa ame e -e icien single-
ins umen app oach o accompanimen gene a ion ha
ex ends MusicGen wi h a simple, p e ix-based condi ion-
ing mechanism. By p epending an audio con ex o he
model’s inpu , STAGE e ec i ely lea ns he ela ionship
be ween con ex and accompanimen , deli e ing compe -
i i e o supe io pe o mance in audio ideli y, con ex ual
cohe ence, and bea alignmen compa ed o la ge baseline
sys ems, despi e i s compac size and minimal ine- uning
e o . We explo ed how enabling he model o be con-
di ioned on ei he a ull mix u e o a simple me onome
ack, no only enables i o pe o m s ic ly empo ally-
con olled gene a ion, bu also imp o es i s hy hmic align-
men in he mo e gene al accompanimen gene a ion ask.
Fu u e wo k may explo e ex ending STAGE o addi ional
ins umen s and e ining i s capaci y o an e en mo e
g anula deg ee o con ol, u he expanding i s applica-
bili y in eal-wo ld music p oduc ion wo k lows.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
668
8. ACKNOWLEDGMENTS
We acknowledge suppo om Sapienza Uni e si y o
Rome h ough he Seed o ERC g an "MINT.AI",
cup B83C25001040001. L.C. is suppo ed by he
PRIN 2022 p ojec n. 2022AL45R2 (EYE-FI.AI, CUP
H53D2300350-0001), unded by he Eu opean Union –
Nex Gene a ionEU – PNRR – M4C2, In es men 1.1. We
hank Emilian Pos olache o use ul discussions du ing he
ea ly s ages o his wo k.
9. REFERENCES
[1] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and con ol-
lable music gene a ion,” Ad ances in Neu al In o ma-
ion P ocessing Sys ems, ol. 36, 2024.
[2] A. Agos inelli, T. I. Denk, Z. Bo sos, J. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi e al., “Musiclm:
Gene a ing music om ex ,” a Xi p ep in
a Xi :2301.11325, 2023.
[3] P. Dha iwal, H. Jun, C. Payne, J. W. Kim,
A. Rad o d, and I. Su ske e , “Jukebox: A gene a i e
model o music,” 2020. [Online]. A ailable: h ps:
//a xi .o g/abs/2005.00341
[4] Y.-H. Lan, W.-Y. Hsiao, H.-C. Cheng, and Y.-H.
Yang, “Musicongen: Rhy hm and cho d con ol o
ans o me -based ex - o-music gene a ion.”
[5] S. Gao, S. Lei, F. Zhuo, H. Liu, F. Liu, B. Tang,
Q. Huang, S. Kang, and Z. Wu, “An end- o-end
app oach o cho d-condi ioned song gene a ion,” in
P oc. In e speech 2024, 2024, pp. 1890–1894.
[6] K. Choi, J. Pa k, W. Heo, S. Jeon, and J. Pa k, “Cho d
condi ioned melody gene a ion wi h ans o me based
decode s,” IEEE Access, ol. 9, pp. 42 071–42 080,
2021.
[7] S. Li and Y. Sung, “Melodydi usion: Cho d-
condi ioned melody gene a ion using a ans o me -
based di usion model,” Ma hema ics, ol. 11, no. 8,
p. 1915, 2023.
[8] J. Jung, A. Jansson, and D. Jeong, “Musicgen-
cho d: Ad ancing music gene a ion h ough cho d
p og essions and in e ac i e web-ui,” a Xi p ep in
a Xi :2412.00325, 2024.
[9] J. Wei, M. Bosma, V. Zhao, K. Guu, A. W.
Yu, B. Les e , N. Du, A. M. Dai, and Q. V.
Le, “Fine uned language models a e ze o-sho
lea ne s,” in In e na ional Con e ence on Lea n-
ing Rep esen a ions, 2022. [Online]. A ailable:
h ps://open e iew.ne / o um?id=gEZ GCozdqR
[10] S. Zhang, L. Dong, X. Li, S. Zhang, X. Sun,
S. Wang, J. Li, R. Hu, T. Zhang, F. Wu, and
G. Wang, “Ins uc ion uning o la ge language
models: A su ey,” 2024. [Online]. A ailable:
h ps://a xi .o g/abs/2308.10792
[11] R. Ci anni, G. Ma iani, M. Mancusi, E. Pos olache,
G. Fabb o, E. Rodolà, and L. Cosmo, “Cocola:
Cohe ence-o ien ed con as i e lea ning o musical au-
dio ep esen a ions,” in ICASSP 2025-2025 IEEE In-
e na ional Con e ence on Acous ics, Speech and Sig-
nal P ocessing (ICASSP). IEEE, 2025, pp. 1–5.
[12] K. Kilgou , M. Zuluaga, D. Roblek, and M. Sha i i,
“F éche audio dis ance: A e e ence- ee me ic o
e alua ing music enhancemen algo i hms,” 2019.
[13] Y. Chung, P. Eu, J. Lee, K. Choi, J. Nam, and B. S.
Chon, “Kad: No mo e ad! an e ec i e and e icien
e alua ion me ic o audio gene a ion,” a Xi p ep in
a Xi :2502.15602, 2025.
[14] N. Zeghidou , A. Luebs, A. Om an, J. Skoglund,
and M. Tagliasacchi, “Sounds eam: An end- o-end
neu al audio codec,” IEEE/ACM T ans. Audio, Speech
and Lang. P oc., ol. 30, p. 495–507, No . 2021.
[Online]. A ailable: h ps://doi.o g/10.1109/TASLP.
2021.3129994
[15] A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi, “High
ideli y neu al audio comp ession,” 2022. [Online].
A ailable: h ps://a xi .o g/abs/2210.13438
[16] S. Roua d, Y. Adi, J. Cope , A. Roebel, and A. Dé-
ossez, “Audio condi ioning o music gene a ion ia
disc e e bo leneck ea u es,” in ISMIR 2024, 2024.
[17] O. Tal, A. Zi , I. Ga , F. K euk, and Y. Adi, “Join
audio and symbolic condi ioning o empo ally con-
olled ex - o-music gene a ion,” 2024.
[18] Y. Zhang, Y. Ikemiya, W. Choi, N. Mu a a, M. A. M.
Ramí ez, L. Lin, G. Xia, W.-H. Liao, Y. Mi su uji,
and S. Dixon, “Ins uc -musicgen: Unlocking ex - o-
music edi ing o music language models ia ins uc-
ion uning,” CoRR, 2024.
[19] G. Ma iani, I. Tallini, E. Pos olache, M. Mancusi,
L. Cosmo, E. Rodola e al., “Mul i-sou ce di usion
models o simul aneous music gene a ion and sepa-
a ion,” in 12 h In e na ional Con e ence on Lea ning
Rep esen a ions, ICLR 2024. In e na ional Con e -
ence on Lea ning Rep esen a ions, ICLR, 2024.
[20] E. Pos olache, G. Ma iani, L. Cosmo, E. Bene os,
and E. Rodolà, “Gene alized mul i-sou ce in e ence o
ex condi ioned music di usion models,” in ICASSP
2024-2024 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2024, pp. 6980–6984.
[21] J. Nis al, M. Pasini, C. Aouameu , S. La ne , and
M. G ach en, “Di -a- i : Musical accompanimen co-
c ea ion ia la en di usion models.”
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
669
[22] Z. E ans, J. D. Pa ke , C. Ca , Z. Zukowski, J. Tay-
lo , and J. Pons, “S able audio open,” in ICASSP
2025-2025 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2025, pp. 1–5.
[23] L. Zhang, A. Rao, and M. Ag awala, “Adding condi-
ional con ol o ex - o-image di usion models,” in
P oceedings o he IEEE/CVF In e na ional Con e -
ence on Compu e Vision (ICCV), Oc obe 2023, pp.
3836–3847.
[24] C. Donahue, A. Caillon, A. Robe s, E. Manilow, P. Es-
ling, A. Agos inelli, M. Ve ze i, I. Simon, O. Pie quin,
N. Zeghidou e al., “Singsong: Gene a ing musical ac-
companimen s om singing,” in In e na ional Con e -
ence on Machine Lea ning ICML 2023, 2023.
[25] J. D. Pa ke , J. Spijke e , K. Kos a, F. Yesile ,
B. Kuzne so , J.-C. Wang, M. A en , J. Chen, and
D. Le, “S emgen: A music gene a ion model ha lis-
ens,” in ICASSP 2024-2024 IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
(ICASSP). IEEE, 2024, pp. 1116–1120.
[26] S. Roua d, R. San Roman, Y. Adi, and A. Roebel,
“Musicgen-s em: Mul i-s em music gene a ion and
edi ion h ough au o eg essi e modeling,” in ICASSP
2025-2025 IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP). IEEE,
2025, pp. 1–5.
[27] I. Pe ei a, F. A aújo, F. Ko zeniowski, and R. Vogl,
“Moisesdb: A da ase o sou ce sepa a ion beyond 4-
s ems,” a Xi p ep in a Xi :2307.15913, 2023.
[28] F. Fosca in, J. Schlü e , and G. Widme , “Bea his! ac-
cu a e bea acking wi hou dbn pos p ocessing.”
[29] Z. Ra ii, A. Liu kus, F.-R. S ö e , S. I. Mimilakis,
and R. Bi ne , “The MUSDB18 co pus o music
sepa a ion,” Dec. 2017. [Online]. A ailable: h ps:
//doi.o g/10.5281/zenodo.1117372
[30] S. He shey, S. Chaudhu i, D. P. Ellis, J. F. Gem-
meke, A. Jansen, R. C. Moo e, M. Plakal, D. Pla ,
R. A. Sau ous, B. Seybold e al., “Cnn a chi ec u es
o la ge-scale audio classi ica ion,” in 2017 ieee in-
e na ional con e ence on acous ics, speech and signal
p ocessing (icassp). IEEE, 2017, pp. 131–135.
[31] B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang,
“Clap lea ning audio concep s om na u al language
supe ision,” in ICASSP 2023-2023 IEEE In e na-
ional Con e ence on Acous ics, Speech and Signal
P ocessing (ICASSP). IEEE, 2023, pp. 1–5.
[32] A. Liu, B. Feng, B. Xue, B. Wang, B. Wu,
C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan
e al., “Deepseek- 3 echnical epo ,” a Xi p ep in
a Xi :2412.19437, 2024.
[33] H. Liu, W. Zhang, J. Xie, F. Faccio, M. Xu,
T. Xiang, M. Z. Shou, J.-M. Pe ez-Rua, and
J. Schmidhube , “Fas e di usion ia empo al a -
en ion decomposi ion,” 2025. [Online]. A ailable:
h ps://a xi .o g/abs/2404.02747
[34] D. Liang, W. Longyue, W. Di, T. Dacheng, and
T. Zhaopeng, “Con ex -awa e c oss-a en ion o
non-au o eg essi e ansla ion,” P oceedings o he
28 h In e na ional Con e ence on Compu a ional
Linguis ics, pp. 4396–4402, 2020. [Online]. A ailable:
h ps://ci .nii.ac.jp/c id/1360299149686305408
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
670

Related note

Why institutions use Plag.ai for originality review, entry 17
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by research administrators in North America, Europe, Latin America, and international online education, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also stronger evidence for review committees, more reliable review records, and clearer documentation of academic decisions. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For research files, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai