TOMI: TRANSFORMING AND ORGANIZING MUSIC IDEAS FOR
MULTI-TRACK COMPOSITIONS WITH FULL-SONG STRUCTURE
Qi He1Gus Xia1Ziyu Wang1,2
1Music X Lab, MBZUAI 2New Yo k Uni e si y
[email p o ec ed], [email p o ec ed], [email p o ec ed]
ABSTRACT
Hie a chical planning is a powe ul app oach o model
long sequences s uc u ally. Aside om conside ing hi-
e a chies in he empo al s uc u e o music, his pape
explo es an e en mo e impo an aspec : concep hie a -
chy, which in ol es gene a ing music ideas, ans o m-
ing hem, and ul ima ely o ganizing hem—ac oss musi-
cal ime and space—in o a comple e composi ion. To his
end, we in oduce TOMI (T ans o ming and O ganizing
Music Ideas) as a no el app oach in deep music gene -
a ion and de elop a TOMI-based model ia ins uc ion-
uned ounda ion LLM. Fo mally, we ep esen a mul i-
ack composi ion p ocess ia a spa se, ou -dimensional
space cha ac e ized by clips (sho audio o MIDI seg-
men s), sec ions ( empo al posi ions), acks (ins umen
laye s), and ans o ma ions (elabo a ion me hods). Ou
model is capable o gene a ing mul i- ack elec onic mu-
sic wi h ull-song s uc u e, and we u he in eg a e he
TOMI-based model wi h he REAPER digi al audio wo k-
s a ion, enabling in e ac i e human-AI co-c ea ion. Ex-
pe imen al esul s demons a e ha ou app oach p oduces
highe -quali y elec onic music wi h s onge s uc u al co-
he ence compa ed o baselines. 1
1. INTRODUCTION
Au oma ic music gene a ion has ad anced om p oducing
sho clips o composing en i e pieces, ye long- e m s uc-
u e emains a majo challenge. Unlike sho - e m gene a-
ion, which ocuses on cap u ing local pa e ns [1–6], long-
e m gene a ion equi es handling s uc u e ac oss mul iple
le els, om sec ional epe i ion and cadence o he o e all
heme and whole na a i e low. The mos common ap-
p oach is o scale up models and da a [7, 8], ye e en wi h
as aining co po a, achie ing uly s uc u ed composi-
ions emains di icul . An al e na i e is o model he em-
po al hie a chy o music, whe e i e ol es a mul iple ime
1Sou ce code: h ps://gi hub.com/heqi201255/TOMI.
Demo page: h ps:// omi-2025.gi hub.io/.
© Q. He, G. Xia, and Z. Wang. Licensed unde a C ea i e
Commons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu-
ion: Q. He, G. Xia, and Z. Wang, “TOMI: T ans o ming and O ganizing
Music Ideas o Mul i-T ack Composi ions wi h Full-Song S uc u e”, in
P oc. o he 26 h In . Socie y o Music In o ma ion Re ie al Con .,
Daejeon, Sou h Ko ea, 2025.
Figu e 1: Concep hie a chy in TOMI: music ideas de el-
oped om ea u es o clips a e ans o med and in eg a ed
in o he composi ion, o ganized by sec ions and acks.
scales, wi h di e en model componen s cap u ing con ex
dependencies a each le el [7,9–11].
Bu ha e we ully cap u ed he hie a chical na u e o
music? While empo al hie a chy is essen ial, so is con-
cep hie a chy, which o en mani es s in he de elopmen
o mo i s and music ma e ials. As no ed in [12], mos pop
songs a e composed wi h a spa se and small se o co e
ideas, which a e hen e ol ed and epea ed in an o ganized
way. We illus a e his p ocess in Figu e 1, whe e mu-
sic ideas, ini ially appea ing as abs ac ea u es, a e con-
c e ized as music clips, hen ans o med, and inally o ga-
nized in o a ull composi ion. We e e o his concep hi-
e a chy in music as TOMI (T ans o ming and O ganizing
Music Ideas). TOMI sha es a spi i wi h Da id Cope’s
ecombinan music composi ion in EMI [13], an idea ha
deep lea ning has ye o uly emb ace.
In his pape , we p opose a TOMI-based music gene a-
ion sys em o ull-song and mul i- ack elec onic music
composi ion, inspi ed by he p e alen use o MIDI and au-
dio sample packs among elec onic music p oduce s. The
sys em is buil a ound ou key elemen s: clips, ans o -
ma ions,sec ions, and acks. Sec ions and acks a e i s
c ea ed o se e as he can as o composi ion, simila o
a digi al audio wo ks a ion (DAW) in e ace. The sys em
hen selec s music clips om a lib a y o audio and MIDI
samples. Fo each selec ed clip, a ans o ma ion unc ion
is de ined and applied, hen he ans o med clip is placed
in i s designa ed sec ion and ack. On he backend, a s uc-
u ed da a ep esen a ion pa ame e izes clips, ans o ma-
ions, sec ions, and acks, and links hem dynamically o
cons uc a ull composi ion. We le e age a p e- ained
ex -based la ge language model (LLM) o ope a e on his
da a s uc u e, using in-con ex lea ning (ICL) o ill in he
pa ame e s and c ea e dynamic links.
337
In sum, he con ibu ions o his pape a e as ollows:
1. We in oduce TOMI o model music concep hie a -
chy and de elop a deep lea ning-based sys em o s uc-
u ed elec onic music gene a ion. The p oposed da a
s uc u e in eg a es symbolic and audio ep esen a ions
and can be manipula ed by ex -based LLMs ia ICL.
2. We apply ou sys em o gene a e high-quali y elec-
onic music wi h ull-song s uc u e. Objec i e and
subjec i e e alua ions show ha songs gene a ed by ou
model ha e clea e ph ase bounda ies, be e ph ase de-
elopmen , and highe music quali y han he baselines.
3. We in eg a e TOMI wi h he REAPER digi al audio
wo ks a ion, p o iding seamless connec ion wi h p o-
essional music so wa e in e ace and enabling human-
AI co-c ea ion wi h high- esolu ion audio ende ing.
2. RELATED WORK
In au oma ic music gene a ion, many s udies ocus on gen-
e a ing cohe en music segmen s [1–6], while ewe ocus
on modeling long- e m s uc u e unde he empo al hie -
a chy o music. Jukebox [7] uses hie a chical VQ-VAE
wi h ime condi ioning o enhance long- e m cohe ence;
Wang e al. [9] applies cascaded di usion models o s uc-
u ed symbolic music gene a ion. Some me hods in o-
duce explici s uc u e encoding in neu al ne wo ks [14,15]
o use e icien sampling echniques o gene a e s uc-
u ed a ia ions [16]. Howe e , he applica ion o TOMI
concep s in music gene a ion emains la gely unexplo ed.
Rela ed wo ks include a ule-based ecombinan music
me hod [13], which eo ganizes exis ing music elemen s
based on s ylis ic cons ain s, and MELONS [17], which
uses a s uc u e g aph o en o ce long- e m dependencies
in melody gene a ion bu o e s limi ed lexibili y in musi-
cal ans o ma ion.
In he con ex o co-c ea i e music sys ems, exis ing
gene a i e models p oduce da a in ei he symbolic (e.g.,
[1,2,4,9,17]) o wa e o m (e.g., [3,5,6]) o ma . Fu he -
mo e, hese models lack in ui i e in e aces o use co-
c ea ion and ine-g ained con ol o e musical elemen s,
as ound in mode n DAWs. Compose ’s Assis an [18] and
i s successo [19] in eg a e wi h REAPER o suppo co-
c ea ion bu ocus on gene a ing symbolic ph ases a he
han ull pieces. Ou app oach gene a es comple e compo-
si ions con aining bo h MIDI and audio ph ases while also
enabling use co-c ea ion di ec ly wi hin REAPER.
Wi h ecen ad ances in AI, ex -based LLMs ha e
eme ged as a p omising al e na i e o music c ea ion.
P e ious s udies like Cha Musician [20] and MuPT [21]
employ ABC No a ion [22] o ep esen music in ex
o ma and ine- une LLaMA models [23]. While hese
app oaches ou pe o m gene ic LLMs in music asks,
hei pe o mance emains limi ed compa ed o LLMs
ained exclusi ely on music da a [24–26]. O he wo ks
ha e le e aged ex -based LLMs o music analysis [27,
28], showing hei capabili y o in e p e music concep s
h ough na u al language. This mo i a es us o de elop
a ex ual hie a chical ep esen a ion o music concep s o
be e in eg a e wi h ex -based LLMs.
3. METHODOLOGY
In his sec ion, we discuss TOMI in mul i- ack elec onic
music gene a ion wi h ull-song s uc u e. The implemen-
a ion consis s o wo main componen s: (1) a g aph da a
s uc u e named composi ion link ha connec s aw mu-
sic ideas wi h he composi ion (Sec ion 3.1), and (2) in-
con ex lea ning o compose music by ollowing his da a
s uc u e (Sec ion 3.2). We also demons a e he in eg a-
ion wi h he REAPER DAW (Sec ion 3.3).
3.1 TOMI Da a S uc u e
The p oposed da a s uc u e consis s o ou node ypes:
clips,sec ions, acks, and ans o ma ions. A composi-
ion link is a quad uple o hese nodes, speci ying a music
clip (wha ) o be placed in a pa icula sec ion (when) and
on a speci ic ack (whe e), unde going ce ain ans o ma-
ions (how). Nodes a e eusable ac oss links, o ming a
s uc u ed ep esen a ion o he ull composi ion.
3.1.1 Clip Node
Clips a e audio o MIDI samples sou ced om da abases.
Each clip is a sho music segmen , such as a cho d p o-
g ession o a d um loop, desc ibed by a se o ea u es. A
MIDI clip can ep esen elemen s like cho ds, basslines,
melodies, o a peggios, wi h a ibu es such as onali y, du-
a ion, and oo p og ession. An audio clip can be ei he a
sample loop o a one-sho sound, wi h keywo ds desc ib-
ing i s con en . The blue-edged box in Figu e 2a shows an
example ea u e se o a cho d clip. Once he ea u es a e
speci ied by he LLM (discussed in Sec ion 3.2), we can
que y he da abases o he co esponding music ma e ial.
3.1.2 Sec ion Node
The empo al di isions o a music composi ion a e mod-
eled by sec ion nodes. A sec ion node in ol es a du a-
ion and a ph ase label, such as e se and cho us. A sec-
ion node can appea mul iple imes wi hin a composi-
ion, meaning i s music con en emains iden ical ac oss
ins ances. Fo example, as shown in Figu e 2, he 16-ba
e se sec ion s2is eused a e he 16-ba cho us sec ion
s3, esul ing in wo iden ical e ses wi h he same con en .
3.1.3 T ack Node
T ack nodes ep esen he e ical laye ing o a composi-
ion and o ganize clips in o MIDI o audio acks (see he
ack axis in Figu e 2b). MIDI acks accep only MIDI
clips and equi e ins umen s o gene a e sound, while au-
dio acks accep only audio clips and play hem di ec ly.
3.1.4 T ans o ma ion Node
A ans o ma ion node ans o ms clips be o e placing
hem on speci ic acks and sec ions. Unlike pi ch ans-
posi ion and empo adjus men , which a e handled in he
inal s age (discussed in Sec ion 3.3), ans o ma ion nodes
pe o m mo e seman ically meaning ul manipula ions and
mainly se e h ee oles. Fi s (and mos impo an ly), i
can con ol hy hmic pa e ns o he audio o MIDI clips
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
338
(a) Example o an LLM ou pu ollowing ou ICL p omp
s uc u e. In clips, he LLM only gene a es ea u es o clips,
which a e hen used o que y he sample da abases o ac ual
MIDI o audio clips. We also show h ee composi ion links,
each as a uple o (sec ion, ack, clip, and ans o ma ion).
(b) Music gene a ion p ocess wi h composi ion links, illus a ing how
clips ( op) a e ans o med and hen o ganized in o he a angemen
(bo om). Links pAqand pCqeach ha e wo b anches because sec ion
s2appea s wice in he sec ion sequence, which means hey a e iden i-
cal and sha e he same composi ion links.
Figu e 2: We show he s uc u ed LLM ou pu in (a) wi h he co esponding music gene a ion p ocess in (b). We dis inguish
node ypes by colo and shape, wi h a de ailed a ibu e example o each ype shown on he le o (a). We depic h ee
composi ion link examples as colo ed a ows in (b), highligh ing hei esul s wi h ec angles in co esponding colo s.
wi h an ac ion sequence o onse s, sus ains, and es s (see
1in Figu e 2a). Fo example, i can con e MIDI cho d
clips wi h whole no es in o ich hy hmic pa e ns o place
a one-sho d um on each bea o c ea e a ou -on- he- loo
hy hm. Second, i can handle he ise o alle sound
e ec s commonly used in elec onic music composi ion,
which a e placed ei he le - o igh -aligned wi hin a sec-
ion, c ea ing smoo h ansi ions (see c3in Figu e 2b).
Las ly, i dynamically handles he looping o a clip, de-
e mining whe he i plays once a a speci ic ime, loops
h oughou a sec ion, o is immed o i a sho e sec ion.
Ou implemen a ion de ines h ee subclasses o ans o -
ma ions: (1) D um ans o m ( o one-sho d ums), (2) Fx
ans o m ( o ise s and alle s), and (3) Gene al ans o m
( o o he s), each ollowing di e en ans o ma ion ules.
We e e he eade o ou demo page o de ails.
3.1.5 Composi ion Link
A composi ion link comp ises one node each o sec ion,
clip, ans o ma ion, and ack, showing he en i e p ocess
o ans o ming and o ganizing a music idea in he com-
posi ion. Since nodes a e independen o he composi ion
link, hey can be eused ac oss mul iple links. Figu e 2b
shows he p ocess o h ee composi ion links. No e ha
links pAqand pCqb anches in he inal composi ion be-
cause sec ion s2is used wice. This is an e icien way
o ep esen complex a angemen s, as a single clip can be
e e enced by mul iple links, spanning di e en sec ions
and acks while adap ing o a ious ans o ma ions.
3.2 Music Gene a ion wi h In-Con ex Lea ning
The TOMI da a s uc u e, as de ined abo e, can be ully
ep esen ed in ex o ma . A comple e composi ion can
be decomposed in o a se o composi ion links, each ep-
esen ed as a quad uple o nodes, whe e each node co -
esponds o a se o ex ual a ibu es. By le e aging a
ex -based LLM wi h in-con ex lea ning, we can gene a e
composi ions di ec ly as TOMI ins ances. Ou ICL p omp
sys ema ically b eaks down he da a s uc u e in s eps, ol-
lowing he o de : sec ions, acks,clips, ans o ma ions,
and composi ion links. I guides he LLM om planning
he o e all song s uc u e and music ideas o o ganizing
hem in o a ull song. The LLM ou pu ollows hese s eps
acco dingly, as shown in Figu e 2a.
In ou implemen a ion, we elabo a e ou da a s uc u e
wi h de ailed examples and de ine a s uc u ed esponse
schema in he p omp . We equi e he LLM o gene a e a
unique name a ibu e o each node o e e ence in com-
posi ion links. Mo eo e , we can speci y addi ional con-
ex s, such as empo, mood, and cus om song s uc u es.
To ge obus esul s, we implemen a ule-based alida-
ion o check he LLM ou pu o syn ax e o s and in alid
alues. I issues a e de ec ed, an e o epo is gene a ed,
p omp ing he LLM o i e a i ely e ine i s ou pu un il no
e o s emain. A his poin , we ob ain an abs ac ep e-
sen a ion o he composi ion in TOMI s uc u e. To ealize
his, we ini ia e he sample e ie al p ocess o ge he ac-
ual clip ma e ials. Then, we se a global empo and key o
uni y he keys and empos o clips wi hin he DAW.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
339
3.3 Digi al Audio Wo ks a ion In eg a ion
To suppo audio ende ing and in e ac i i y, we in eg a e
he TOMI amewo k wi h he REAPER DAW. This allows
he gene a ed composi ion o be isualized in a p o es-
sional DAW while bene i ing use s om REAPER’s edi -
ing and ende ing capabili ies. The composi ion links and
nodes o a gene a ed piece a e con e ed o REAPER el-
emen s, including acks, sec ion ma ke s, and clips wi h
applied ans o ma ions. We use REAPER o au oma i-
cally ime-s e ch loopable clips o i he empo se ing and
anspose melodic clips o align wi h he key se ing.
4. EXPERIMENT
To implemen he gene a ion sys em, we p epa e a MIDI
da abase and an audio da abase o clip sample e ie al
and use GPT-4o [29] o gene a e composi ions in TOMI
schema. We e alua e ou app oach wi h baseline me h-
ods and use bo h objec i e and subjec i e measu emen s o
compa e he music quali y and s uc u al consis ency.
4.1 Sys em P epa a ion
We collec mul iple licensed MIDI and audio sample packs
in he elec onic music gen e h ough online pu chases. 2
We p ocess he aw da ase s in o sepa a e MIDI and au-
dio da abases using di e en ea u e ex ac ion me hods.
Fo MIDI clips, we de eloped a sc ip o analyze and ex-
ac musical ea u es om hem. Mo eo e , i can also ex-
ac music s ems, such as bass, cho d, and melody, om
he sou ce MIDI o augmen he da a. Then, we s o e
he labeled da a in a SQLi e3 [30] da abase as ou MIDI
da abase. Fo audio samples, we use ADSR Sample Man-
age [31] o analyze and gene a e labels o hem. The e-
sul s a e also expo ed as a SQLi e3 da abase. The s a is-
ics o ou MIDI da abase and audio da abase a e shown
in Table 1. Sample e ie al in ol es cons uc ing a sea ch
que y based on he clip’s a ibu es o e ch ma ching sam-
ples om he da abase. The clip node andomly selec s one
sample om he esul s. I no ma ches a e ound, he clip
and i s associa ed composi ion links a e disca ded.
We limi all gene a ed sec ions o he 4/4 ime signa-
u e o simpli y implemen a ion. When expo ing audio ia
REAPER, we andomly assign each MIDI ack o one o
eigh i ual ins umen p ese s (5 o cho ds, 2 o melody,
and 1 o bass). We keep all REAPER se ings a hei de-
aul s and apply no mixing plug-ins excep o a limi e on
he mas e ack o p e en audio clipping.
4.2 Baseline Me hod and Abla ions
We compa e ou app oach wi h MusicGen [6] and wo ab-
la ions in elec onic music gene a ion. Since ou sys em
in eg a es bo h audio and MIDI da a, he e is no di ec ly
compa able baseline. Howe e , as ou inal ou pu is en-
de ed as audio, MusicGen is a sui able compa ison, which
2We ob ain sample packs om wo music asse pla o ms: h ps:
//splice.com/ and h ps://www.loopmas e s.com/
Con en Type Coun
Cho d 2604
Bass 209
Melody 1392
A peggio 227
To al 4432
(a) MIDI con en ypes.
Du a ion Coun
4-ba 2947
8-ba 1417
16-ba 68
To al 4432
(b) MIDI du a ions.
Sample Type Coun
Loop 104493
One-Sho 170187
To al 274680
(c) Audio sample ypes.
Loop Du a ion Coun
2-ba 24922
4-ba 27214
8-ba 24638
16-ba 27719
To al 104493
(d) Audio loop du a ions.
Table 1: S a is ics o sample da abases.
is capable o gene a ing long- e m and high-quali y elec-
onic music. This allows us o e alua e ou me hod agains
s a e-o - he-a music gene a ion sys ems. The abla ions
help us assess he indi idual con ibu ions o he composi-
ion link ep esen a ion and he LLM in eg a ion.
MusicGen We use he MusicGen-La ge-3.3B model as
he baseline wi h p omp s speci ying key, empo, and
sec ion sequence. To gene a e longe audio beyond i s
du a ion limi , we apply a sliding window app oach,
whe e a 30-second window slides in 10-second chunks,
using he p e ious 20 seconds as con ex . We modi y
i s in e ence p ocess o include he cu en gene a ion’s
posi ion wi hin he ull composi ion and i s co espond-
ing ph ase no a ions in he p omp a each s ep, guiding
he model o align i s ou pu wi h he gi en s uc u e.
S andalone LLM (TOMI w/o Composi ion Links)
We emo e he composi ion links ep esen a ion om
ou sys em. We edesign he p omp o le he LLM
gene a e a sequence o acks and clip desc ip ions wi h
posi ion in o ma ion ( ime poin and ack loca ion)
condi ioned on a sec ion sequence. The sample e ie al
mechanism is also applied o clips.
Random (TOMI w/o LLM) We eplace he LLM ope a-
ions in ou sys em wi h a ule-based me hod ha uses
andomized ope a ions o gene a e music wi hin he
composi ion-links s uc u e. The sys em c ea es 15–25
ack nodes, popula es sec ions wi h clips s ochas ically,
and de e mines o each ack whe he o place, euse,
o gene a e a new clip. MIDI clips a e assigned a an-
dom ype (cho d, bass, o melody) wi h bass de i ed
om cho ds, while audio clips a e selec ed om onal,
pe cussion, and sound e ec ea u e labels. Each clip is
hen linked o one o ou p ede ined ans o ma ions:
gene al, d um, ise Fx, o alle Fx.
We de ine 4 keys and 4 dis inc sec ion sequences, each
consis ing o 8 o 10 sec ions. Each sec ion has a name,
a ph ase label, and a du a ion anging om 4 o 16 ba s.
Then, using each me hod, we gene a e 4 se s o elec onic
music pieces a 120 BPM, wi h each se con aining 8 pieces
(2 pe key), all condi ioned on he same sec ion sequence.
In o al, we gene a e 32 composi ions o each me hod.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
340
Me hod FADVGGish ÓFADCLAP ÓILSMERT ÒILSMS ÒILSWF Ò
TOMI 3.51 0.38 0.28 ˘0.12 0.36 ˘0.33 1.14 ˘0.73
MusicGen 5.31 0.62 0.06 ˘0.04 0.12 ˘0.07 0.28 ˘0.09
S andalone LLM 5.84 0.46 0.16 ˘0.11 0.10 ˘0.12 0.09 ˘0.16
Random 6.92 0.47 0.16 ˘0.09 0.22 ˘0.16 0.48 ˘0.28
Table 2: Objec i e e alua ion esul s o FAD wi h wo models and ILS wi h h ee la en ep esen a ions.
Figu e 3: ILS simila i y ma ices example, whe e all ou composi ions a e gene a ed unde he sec ion sequence: i(in o),
1 ( e se 1), p1 (p e-cho us 1), c1 (cho us 1), 2 ( e se 2), p2 (p e-cho us 2), c2 (cho us 2), b(b idge), c3 (cho us 3),
and o(ou o). Da ke colo s indica e highe segmen simila i y. The blocks ma ked as yellow-edge boxes a e same-label
simila i ies, wi h diagonal alues ma ked as ed lines being excluded, and he emaining pa s a e di e en -label simila i ies.
4.3 Objec i e E alua ion
We e alua e he music quali y and s uc u al consis ency
o gene a ed composi ions. Fo music quali y, we use
he F éche Audio Dis ance me ic (FAD) [32, 33] wi h
a VGGish model [34] and a CLAP model [35] o compa e
human-composed elec onic music and he music gene -
a ed by each me hod. We andomly collec ed 329 songs as
e e ences om he Spo i y Min playlis [36], one o he
mos popula cu a ed playlis s dedica ed o he elec onic
music gen e. A lowe FAD sco e indica es ha he gene -
a ed music is close o human-composed music in quali y.
Fo s uc u al consis ency, we use he In e -Ph ase La-
en Simila i y me ic (ILS) e ined om [9]. ILS aims o
compu e a sel -simila i y ma ix o musical ea u es and
e alua es i he a e age simila i y be ween segmen s sha -
ing he same ph ase label is highe han hose wi h di e -
en labels. Ins ead o measu ing he a io o same-label
o o e all simila i ies in he o iginal me ic, we use Co-
hen’s d[37] o compu e he e ec size o he di e ence be-
ween same-label and di e en -label simila i ies, exclud-
ing diagonal elemen s o a oid biases. This o e s a scale-
independen measu e ha obus ly cap u es he sepa a ion
be ween g oups. We ex ac la en ep esen a ions om au-
dio while p ese ing empo al s uc u e, hen compu e a
sel -simila i y ma ix o hem using cosine simila i y. We
e alua e ILS wi h MERT embeddings [38], Mel spec o-
g ams, and aw wa e o ms, deno ed as ILSMERT, ILSMS,
and ILSWF, espec i ely. To compu e he ILS sco e, we
deno e he cosine simila i y be ween ime- ela ed la en el-
emen s iand jas Sij , he ph ase label o elemen ias
li, and he quan i ies o same-label and di e en -label ele-
men s as Nsame and Ndi , espec i ely, excluding diagonal
elemen s. A highe ILS sco e indica es be e and mo e e -
ec i e s uc u al consis ency. The o mula is de ined as:
ILS “
¯
Xsame ´¯
Xdi
s, (1)
whe e ¯
Xsame and ¯
Xdi a e he mean simila i ies o same-
label and di e en -label g oups, espec i ely, de ined as:
¯
Xsame “ři‰jSij ¨δpli“ljq
Nsame
, (2)
¯
Xdi “ři,j Sij ¨δpli‰ljq
Ndi
. (3)
and sis he pooled s anda d de ia ion de i ed om he
a iances s2
same and s2
di o he wo g oups, de ined as:
s“dpNsame ´1qs2
same ` pNdi ´1qs2
di
Nsame `Ndi ´2(4)
We show he FAD sco es and ILS wi h means and s an-
da d de ia ions in Table 2. The esul s show ha ou
me hod achie es he lowes FAD sco es and he highes
ILS sco es ac oss all measu emen s, ou pe o ming he
baseline me hods in bo h music quali y and s uc u al con-
sis ency. No ably, he abla ion esul s show ha e en wi h
high-quali y music samples, lacking su icien a ange-
men logic can s ill lead o poo FAD sco es. This p o es
ou app oach’s abili y o gene a e music wi h imp o ed
low na u alness and a angemen cohe ence while p e-
se ing he expec ed long- e m s uc u es. Figu e 3 p o-
ides a compa a i e isualiza ion o ILS ma ices o ou
composi ions, wi h ph ase labels on bo h axes, yellow-
edged boxes highligh ing egions o same-label simila i-
ies, he ed line ma king he main diagonal, and he e-
maining a eas ep esen ing di e en -label simila i ies.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
341
Figu e 4: Subjec i e e alua ion esul s o mean sco e wi h wi hin-subjec con idence in e als, whe e p1-p4 co esponds
o ou pa s o ou su ey: p1. Local Music Quali y,p2. Consis ency Among Same-Label Ph ases,p3. Con as Be ween
Di e en -Label Ph ases, and p4. O e all Full-Song E alua ion.
4.4 Subjec i e E alua ion
We conduc a double-blind online su ey o compa e he
music quali y and s uc u al consis ency o composi ions
gene a ed by he ou me hods. Since each composi ion is
a ull 3- o 4-minu e song, which is oo long o be e alua ed
in a single-ques ion, we assess ou composi ions (one pe
me hod) h oughou he su ey. To ensu e a comp ehen-
si e e alua ion, we c ea e h ee dis inc ques ion se s wi h
he same su ey s uc u e bu di e en composi ions, e-
sul ing in a o al o 12 composi ions being e alua ed. The
su ey is di ided in o ou pa s, each con aining 4 o 12
ques ions, aiming o e alua e he piece om sec ion-le el
music quali y o o e all s uc u al alignmen . All me ics
a e measu ed on a 5-poin a ing scale. The de ails a e as
ollows:
Pa 1. Local Music Quali y This pa consis s o 3 sub-
pa s, each selec ing he same sec ion (e.g., in o) om
each composi ion. Pa icipan s a e Na u alness o e al-
ua e he simila i y o human-composed music and he
con o mi y o he ypical elec onic music s yle.
Pa 2. Consis ency Among Same-Label Ph ases This
pa consis s o 2 subpa s, each selec ing wo sec ions
wi h he same ph ase label (e.g., e se 1 and e se 2)
om each composi ion. Pa icipan s a e Simila i y
be ween he wo sec ions.
Pa 3. Con as Be ween Di e en -Label Ph ases
This pa consis s o 2 subpa s, each selec ing wo
consecu i e sec ions om he same posi ion in each
composi ion (e.g., in o and e se 1). Pa icipan s a e
T ansi ion Na u alness based on bounda y clea ness,
ansi ion smoo hness, and ph ase alignmen .
Pa 4. O e all Full-Song E alua ion This pa shows
he comple e composi ion audios. Pa icipan s a e
S uc u e Cla i y o how well each sec ion aligns wi h
he gi en s uc u e, C ea i i y o how c ea i e he mu-
sic is, Na u alness o how humanlike he music sounds,
and Musicali y o he o e all music quali y.
We inse an in e media e page be ween e alua ion pa s
o in o m pa icipan s o hei p og ess and help educe lis-
ening a igue. Addi ionally, we show he inpu condi ions
o music gene a ion on each ques ion page o emind pa -
icipan s o he expec ed song s uc u e and onali y. We
dis ibu ed he su ey on mul iple social media pla o ms
and ecei ed a o al o 73 esponses. The demog aphic
s a is ics o pa icipan s a e as ollows:
Age (ă18: 0%, 18-29: 69.86%, 30-44: 19.18%, 45-59:
5.48%, ě60: 5.48%);
Gende (Female: 30.14%, Male: 65.75%, Non-bina y:
2.74%, P e e no o say: 1.37%);
Music backg ound (Ama eu : 34.25%, In e media e:
41.1%, P o essional: 24.66%);
Yea s spen on s udying music (None: 26.03%, 1 yea :
6.85%, 2 yea s: 10.96%, 3-5 yea s: 13.7%, 6-10 yea s:
16.44%, ą10 yea s: 26.03%).
The subjec i e e alua ion esul s a e shown in Figu e 4,
whe e he ba heigh ep esen s he mean sco e, and he
e o ba ep esen s he con idence in e als compu ed by
wi hin-subjec ANOVA. The esul s show ha ou me hod
signi ican ly ou pe o ms he baseline in mos subjec i e
me ics, p o ing he e ec i eness o ou sys em in gen-
e a ing high-quali y elec onic music wi h solid long- e m
s uc u al consis ency.
5. CONCLUSION AND FUTURE WORK
We con ibu e TOMI, a concep hie a chy pa adigm o
music ep esen a ion, and combine i wi h an ICL app oach
o achie e he i s sys em o gene a ing long- e m, mul i-
ack elec onic music wi h bo h MIDI and audio clips. Ex-
pe imen al esul s show ha ou app oach achie es high-
quali y gene a ion wi h obus s uc u al consis ency. In
addi ion, we in eg a e i wi h REAPER o suppo audio
ende ing and co-c ea ion. Howe e , ha monic cohe ence
in ou esul s can occasionally be dis up ed by andom-
ness and limi ed ea u es du ing sample e ie al. Ou sys-
em cu en ly selec s samples om local collec ions us-
ing a small ea u e se , which can lead o emp y esul s
o highly di e gen samples. To imp o e his, we plan o
in eg a e gene a i e models o clips o use ML-based em-
bedding models o MIDI and audio. Then, we aim o ex-
end ou model wi h a mo e sophis ica ed s uc u al hie a -
chy o suppo ad anced sound design and mixing. Las ly,
we plan o ain a TOMI-based neu al ne wo k on music
p ojec iles o enhance gene a ion quali y and scalabili y.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
342
6. REFERENCES
[1] Y. Huang, A. Gha a e, Y. Liu, Z. Hu, Q. Zhang,
C. S. Sas y, S. Gu u ani, S. Oo e, and Y. Yue,
“Symbolic music gene a ion wi h non-di e en iable
ule guided di usion,” in Fo y- i s In e na ional
Con e ence on Machine Lea ning, ICML 2024,
Vienna, Aus ia, July 21-27, 2024. OpenRe iew.ne ,
2024. [Online]. A ailable: h ps://open e iew.ne /
o um?id=g8AigOTNXL
[2] L. Min, J. Jiang, G. Xia, and J. Zhao, “Poly usion: A
di usion model o polyphonic sco e gene a ion wi h
in e nal and ex e nal con ols,” in P oceedings o he
24 h In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence, ISMIR, 2023.
[3] S. Fo sg en and H. Ma i os, “Ri usion-s able
di usion o eal- ime music gene a ion,” URL
h ps:// i usion.com/abou , 2022.
[4] G. Hadje es, F. Pache , and F. Nielsen, “DeepBach:
a s ee able model o Bach cho ales gene a ion,” in
P oceedings o he 34 h In e na ional Con e ence
on Machine Lea ning, se . P oceedings o Machine
Lea ning Resea ch, D. P ecup and Y. W. Teh, Eds.,
ol. 70. In e na ional Con en ion Cen e, Sydney,
Aus alia: PMLR, 06–11 Aug 2017, pp. 1362–1371.
[Online]. A ailable: h p://p oceedings.ml .p ess/ 70/
hadje es17a.h ml
[5] A. Agos inelli, T. I. Denk, Z. Bo sos, J. H. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi, M. Sha i i, N. Zeghidou ,
and C. H. F ank, “Musiclm: Gene a ing music om
ex ,” CoRR, ol. abs/2301.11325, 2023. [Online].
A ailable: h ps://doi.o g/10.48550/a Xi .2301.11325
[6] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and con ol-
lable music gene a ion,” in Ad ances in Neu al In o -
ma ion P ocessing Sys ems 36: Annual Con e ence on
Neu al In o ma ion P ocessing Sys ems 2023, Neu IPS
2023, New O leans, LA, USA, Decembe 10 - 16,
2023, A. Oh, T. Naumann, A. Globe son, K. Saenko,
M. Ha d , and S. Le ine, Eds., 2023.
[7] P. Dha iwal, H. Jun, C. Payne, J. W. Kim, A. Rad o d,
and I. Su ske e , “Jukebox: A gene a i e model o
music,” CoRR, ol. abs/2005.00341, 2020. [Online].
A ailable: h ps://a xi .o g/abs/2005.00341
[8] R. Yuan, H. Lin, S. Guo, G. Zhang, J. Pan, Y. Zang,
H. Liu, Y. Liang, W. Ma, X. Du, X. Du, Z. Ye,
T. Zheng, Y. Ma, M. Liu, Z. Tian, Z. Zhou, L. Xue,
X. Qu, Y. Li, S. Wu, T. Shen, Z. Ma, J. Zhan,
C. Wang, Y. Wang, X. Chi, X. Zhang, Z. Yang,
X. Wang, S. Liu, L. Mei, P. Li, J. Wang, J. Yu,
G. Pang, X. Li, Z. Wang, X. Zhou, L. Yu, E. Bene os,
Y. Chen, C. Lin, X. Chen, G. Xia, Z. Zhang, C. Zhang,
W. Chen, X. Zhou, X. Qiu, R. Dannenbe g, J. Liu,
J. Yang, W. Huang, W. Xue, X. Tan, and Y. Guo,
“Yue: Scaling open ounda ion models o long-
o m music gene a ion,” 2025. [Online]. A ailable:
h ps://a xi .o g/abs/2503.08638
[9] Z. Wang, L. Min, and G. Xia, “Whole-song
hie a chical gene a ion o symbolic music using
cascaded di usion models,” in The Twel h In-
e na ional Con e ence on Lea ning Rep esen a-
ions, ICLR 2024, Vienna, Aus ia, May 7-11,
2024. OpenRe iew.ne , 2024. [Online]. A ailable:
h ps://open e iew.ne / o um?id=sn7CYWya h
[10] S. Dai, Z. Jin, C. Gomes, and R. B. Dannenbe g,
“Con ollable deep melody gene a ion ia hie a chical
music s uc u e ep esen a ion,” in P oceedings o
he 22nd In e na ional Socie y o Music In o ma ion
Re ie al Con e ence, ISMIR 2021, Online, No embe
7-12, 2021, J. H. Lee, A. Le ch, Z. Duan, J. Nam,
P. Rao, P. an K anenbu g, and A. S ini asamu hy,
Eds., 2021, pp. 143–150. [Online]. A ailable: h ps:
//a chi es.ismi .ne /ismi 2021/pape /000017.pd
[11] A. Robe s, J. H. Engel, C. Ra el, C. Haw ho ne,
and D. Eck, “A hie a chical la en ec o model o
lea ning long- e m s uc u e in music,” in P oceedings
o he 35 h In e na ional Con e ence on Machine
Lea ning, ICML 2018, S ockholmsmässan, S ockholm,
Sweden, July 10-15, 2018, se . P oceedings o
Machine Lea ning Resea ch, J. G. Dy and A. K ause,
Eds., ol. 80. PMLR, 2018, pp. 4361–4370.
[Online]. A ailable: h p://p oceedings.ml .p ess/ 80/
obe s18a.h ml
[12] S. Dai, H. Yu, and R. B. Dannenbe g, “Wha
is missing in deep music gene a ion? A s udy
o epe i ion and s uc u e in popula music,” in
P oceedings o he 23 d In e na ional Socie y o
Music In o ma ion Re ie al Con e ence, ISMIR 2022,
Bengalu u, India, Decembe 4-8, 2022, P. Rao,
H. A. Mu hy, A. S ini asamu hy, R. M. Bi ne ,
R. C. Repe o, M. Go o, X. Se a, and M. Mi on,
Eds., 2022, pp. 659–666. [Online]. A ailable: h ps:
//a chi es.ismi .ne /ismi 2022/pape /000079.pd
[13] D. Cope, “Recombinan music: Using he compu e o
explo e musical s yle,” Compu e , ol. 24, no. 7, pp.
22–28, 1991. [Online]. A ailable: h ps://doi.o g/10.
1109/2.84830
[14] K. Chen, W. Zhang, S. Dubno , G. Xia, and W. Li,
“The e ec o explici s uc u e encoding o deep
neu al ne wo ks o symbolic music gene a ion,” in
2019 In e na ional Wo kshop on Mul ilaye Music
Rep esen a ion and P ocessing (MMRP). IEEE,
Jan. 2019, p. 77–84. [Online]. A ailable: h p:
//dx.doi.o g/10.1109/MMRP.2019.00022
[15] H. Chen, J. B. L. Smi h, J. Spijke e , J. Wang,
P. Zou, B. Li, Q. Kong, and X. Du, “Sympac:
Scalable symbolic music gene a ion wi h p omp s and
cons ain s,” in P oceedings o he 25 h In e na ional
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
343
Socie y o Music In o ma ion Re ie al Con e ence,
ISMIR 2024, San F ancisco, Cali o nia, USA and
Online, No embe 10-14, 2024, B. Kaneshi o, G. J.
Myso e, O. Nie o, C. Donahue, C. A. Huang,
J. H. Lee, B. McFee, and M. C. McCallum,
Eds., 2024, pp. 1029–1036. [Online]. A ailable:
h ps://doi.o g/10.5281/zenodo.14877507
[16] F. Pache , A. Papadopoulos, and P. Roy, “Sampling
a ia ions o sequences o s uc u ed music gen-
e a ion,” in P oceedings o he 18 h In e na ional
Socie y o Music In o ma ion Re ie al Con e ence,
ISMIR 2017, Suzhou, China, Oc obe 23-27, 2017,
S. J. Cunningham, Z. Duan, X. Hu, and D. Tu n-
bull, Eds., 2017, pp. 167–173. [Online]. A ailable:
h ps://a chi es.ismi .ne /ismi 2017/pape /000050.pd
[17] Y. Zou, P. Zou, Y. Zhao, K. Zhang, R. Zhang, and
X. Wang, “Melons: Gene a ing melody wi h long- e m
s uc u e using ans o me s and s uc u e g aph,” in
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing, ICASSP 2022, Vi ual and
Singapo e, 23-27 May 2022. IEEE, 2022, pp.
191–195. [Online]. A ailable: h ps://doi.o g/10.1109/
ICASSP43922.2022.9747802
[18] M. Maland o, “Compose ’s Assis an : An In e ac i e
T ans o me o Mul i-T ack MIDI In illing,” in P oc.
24 h In . Socie y o Music In o ma ion Re ie al Con .,
Milan, I aly, 2023, pp. 327–334.
[19] ——, “Compose ’s Assis an 2: In e ac i e Mul i-
T ack MIDI In illing wi h Fine-G ained Use Con ol,”
in P oc. 25 h In . Socie y o Music In o ma ion Re-
ie al Con ., San F ancisco, CA, USA, 2024, pp. 438–
445.
[20] R. Yuan, H. Lin, Y. Wang, Z. Tian, S. Wu,
T. Shen, G. Zhang, Y. Wu, C. Liu, Z. Zhou,
L. Xue, Z. Ma, Q. Liu, T. Zheng, Y. Li, Y. Ma,
Y. Liang, X. Chi, R. Liu, Z. Wang, C. Lin, Q. Liu,
T. Jiang, W. Huang, W. Chen, J. Fu, E. Bene os,
G. Xia, R. B. Dannenbe g, W. Xue, S. Kang,
and Y. Guo, “Cha musician: Unde s anding and
gene a ing music in insically wi h LLM,” in Findings
o he Associa ion o Compu a ional Linguis ics,
ACL 2024, Bangkok, Thailand and i ual mee ing,
Augus 11-16, 2024, L. Ku, A. Ma ins, and
V. S ikuma , Eds. Associa ion o Compu a ional
Linguis ics, 2024, pp. 6252–6271. [Online]. A ailable:
h ps://doi.o g/10.18653/ 1/2024. indings-acl.373
[21] X. Qu, Y. Bai, Y. Ma, Z. Zhou, K. M. Lo,
J. Liu, R. Yuan, L. Min, X. Liu, T. Zhang, X. Du,
S. Guo, Y. Liang, Y. Li, S. Wu, J. Zhou, T. Zheng,
Z. Ma, F. Han, W. Xue, and e al., “Mup : A
gene a i e symbolic music p e ained ans o me ,” in
The Thi een h In e na ional Con e ence on Lea ning
Rep esen a ions, ICLR 2025, Singapo e, Ap il 24-28,
2025. OpenRe iew.ne , 2025. [Online]. A ailable:
h ps://open e iew.ne / o um?id=iAK9oHp4Zz
[22] ABC Wiki, “The abc no a ion sys em,” 2021. [Online].
A ailable: h ps://abcwiki.o g/abc:syn ax
[23] H. Tou on, T. La il, G. Izaca d, X. Ma ine ,
M.-A. Lachaux, T. Lac oix, B. Roziè e, N. Goyal,
E. Hamb o, F. Azha , A. Rod iguez, A. Joulin,
E. G a e, and G. Lample, “Llama: Open and
e icien ounda ion language models,” 2023. [Online].
A ailable: h ps://a xi .o g/abs/2302.13971
[24] P. Lu, X. Xu, C. Kang, B. Yu, C. Xing, X. Tan,
and J. Bian, “Musecoco: Gene a ing symbolic
music om ex ,” 2023. [Online]. A ailable: h ps:
//a xi .o g/abs/2306.00110
[25] H.-W. Dong, K. Chen, S. Dubno , J. McAuley,
and T. Be g-Ki kpa ick, “Mul i ack music ans-
o me ,” in IEEE In e na ional Con e ence on Acous-
ics, Speech and Signal P ocessing (ICASSP), 2023.
[26] J. Thicks un, D. L. W. Hall, C. Donahue, and P. Liang,
“An icipa o y music ans o me ,” T ans. Mach. Lea n.
Res., ol. 2024, 2024. [Online]. A ailable: h ps:
//open e iew.ne / o um?id=EBNJ33Fc l
[27] S. Liu, A. S. Hussain, C. Sun, and Y. Shan,
“Music unde s anding llama: Ad ancing ex - o-music
gene a ion wi h ques ion answe ing and cap ioning,”
in IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing, ICASSP 2024, Seoul,
Republic o Ko ea, Ap il 14-19, 2024. IEEE,
2024, pp. 286–290. [Online]. A ailable: h ps:
//doi.o g/10.1109/ICASSP48485.2024.10447027
[28] Z. Kong, A. Goel, R. Badlani, W. Ping, R. Valle,
and B. Ca anza o, “Audio lamingo: A no el audio
language model wi h ew-sho lea ning and dialogue
abili ies,” in Fo y- i s In e na ional Con e ence on
Machine Lea ning, ICML 2024, Vienna, Aus ia, July
21-27, 2024. OpenRe iew.ne , 2024. [Online]. A ail-
able: h ps://open e iew.ne / o um?id=WYi3WKZjYe
[29] OpenAI (2024), “Gp -4o sys em ca d,” CoRR, ol.
abs/2410.21276, 2024. [Online]. A ailable: h ps:
//doi.o g/10.48550/a Xi .2410.21276
[30] D. R. Hipp, “Sqli e,” 2004. [Online]. A ailable:
h ps://www.sqli e.o g
[31] ADSR, “Ads sample manage ,” n.d. [Online]. A ail-
able: h ps://www.ads sounds.com/p oduc /so wa e/
ads -sample-manage /
[32] K. Kilgou , M. Zuluaga, D. Roblek, and M. Sha i i,
“F éche audio dis ance: A e e ence- ee me ic o
e alua ing music enhancemen algo i hms,” in 20 h
Annual Con e ence o he In e na ional Speech Com-
munica ion Associa ion, In e speech 2019, G az, Aus-
ia, Sep embe 15-19, 2019, G. Kubin and Z. Kacic,
Eds. ISCA, 2019, pp. 2350–2354. [Online]. A ail-
able: h ps://doi.o g/10.21437/In e speech.2019-2219
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
344
[33] H. H. Tan, “F eche audio dis ance in py o ch,” h ps:
//gi hub.com/gudgud96/ eche -audio-dis ance, 2022.
[34] S. He shey, S. Chaudhu i, D. P. W. Ellis, J. F.
Gemmeke, A. Jansen, R. C. Moo e, M. Plakal,
D. Pla , R. A. Sau ous, B. Seybold, M. Slaney,
R. J. Weiss, and K. W. Wilson, “CNN a chi ec u es
o la ge-scale audio classi ica ion,” in 2017 IEEE
In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing, ICASSP 2017, New O leans, LA,
USA, Ma ch 5-9, 2017. IEEE, 2017, pp. 131–135.
[Online]. A ailable: h ps://doi.o g/10.1109/ICASSP.
2017.7952132
[35] Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Be g-
Ki kpa ick, and S. Dubno , “La ge-scale con as i e
language-audio p e aining wi h ea u e usion and
keywo d- o-cap ion augmen a ion,” in IEEE In e na-
ional Con e ence on Acous ics, Speech and Signal
P ocessing ICASSP 2023, Rhodes Island, G eece, June
4-10, 2023. IEEE, 2023, pp. 1–5. [Online]. A ailable:
h ps://doi.o g/10.1109/ICASSP49357.2023.10095969
[36] J. Jo en, “Min y esh: Spo i y makes
a new home o edm,” 2018. [On-
line]. A ailable: h ps://hmc.cha me ic.com/
min y- esh-spo i y-makes-a-new-home- o -edm/
[37] J. Cohen, S a is ical Powe Analysis o he Beha io al
Sciences, 2nd ed. Hillsdale, NJ: Law ence E lbaum
Associa es, 1988.
[38] Y. Li, R. Yuan, G. Zhang, Y. Ma, X. Chen,
H. Yin, C. Xiao, C. Lin, A. Ragni, E. Bene os,
N. Gyenge, R. B. Dannenbe g, R. Liu, W. Chen,
G. Xia, Y. Shi, W. Huang, Z. Wang, Y. Guo,
and J. Fu, “MERT: acous ic music unde s anding
model wi h la ge-scale sel -supe ised aining,” in
The Twel h In e na ional Con e ence on Lea ning
Rep esen a ions, ICLR 2024, Vienna, Aus ia, May 7-
11, 2024. OpenRe iew.ne , 2024. [Online]. A ailable:
h ps://open e iew.ne / o um?id=w3YZ9MSlBu
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
345