scieee Science in your language
[en] (orig)

MIDI-VALLE: Improving Expressive Piano Performance Synthesis Through Neural Codec Language Modelling

Author: Jingjing Tang; Xin Wang; Zhe Zhang; Junichi Yamagish; Geraint Wiggins; George Fazekas
Publisher: Zenodo
DOI: 10.5281/zenodo.17706539
Source: https://zenodo.org/records/17706539/files/000072.pdf
MIDI-VALLE: IMPROVING EXPRESSIVE PIANO PERFORMANCE
SYNTHESIS THROUGH NEURAL CODEC LANGUAGE MODELLING
Jingjing Tang1Xin Wang2Zhe Zhang2
Junichi Yamagishi2Ge ain Wiggins1,3Gyö gy Fazekas1
1Cen e o Digi al Music, Queen Ma y Uni e si y o London, UK
2Na ional Ins i u e o In o ma ics, Japan
3V ije Uni e si ei B ussel, Belgium
[email p o ec ed]
ABSTRACT
Gene a ing exp essi e audio pe o mances om music
sco es equi es models o cap u e bo h ins umen acous-
ics and human in e p e a ion. T adi ional music pe o -
mance syn hesis pipelines ollow a wo-s age app oach,
i s gene a ing exp essi e pe o mance MIDI om a
sco e, hen syn hesising he MIDI in o audio. Howe e ,
he syn hesis models o en s uggle o gene alise ac oss di-
e se MIDI sou ces, musical s yles, and eco ding en i-
onmen s. To add ess hese challenges, we p opose MIDI-
VALLE, a neu al codec language model adap ed om
he VALLE amewo k, which was o iginally designed
o ze o-sho pe sonalised ex - o-speech (TTS) syn he-
sis. Fo pe o mance MIDI- o-audio syn hesis, we imp o e
he a chi ec u e o condi ion on a e e ence audio pe o -
mance and i s co esponding MIDI. Unlike p e ious TTS-
based sys ems ha ely on piano olls, MIDI-VALLE en-
codes bo h MIDI and audio as disc e e okens, acili a -
ing a mo e consis en and obus modelling o piano pe -
o mances. Fu he mo e, he model’s gene alisa ion abil-
i y is enhanced by aining on an ex ensi e and di e se
piano pe o mance da ase . E alua ion esul s show ha
MIDI-VALLE signi ican ly ou pe o ms a s a e-o - he-a
baseline, achie ing o e 75% lowe F éche Audio Dis-
ance on he ATEPP and Maes o da ase s. In he lis ening
es , MIDI-VALLE ecei ed 202 o es compa ed o 58 o
he baseline, demons a ing imp o ed syn hesis quali y and
gene alisa ion ac oss di e se pe o mance MIDI inpu s.
1. INTRODUCTION
Music pe o mance syn hesis (MPS) e e s o he p ocess
o gene a ing exp essi e audio pe o mances om mu-
sic sco es. This ask equi es models o cap u e acous ic
cha ac e is ics o musical ins umen s and in use human-
© J. Tang, X. Wang, Z. Zhang, J. Yamagishi, G. Wiggins
and G. Fazekas. Licensed unde a C ea i e Commons A ibu ion 4.0
In e na ional License (CC BY 4.0). A ibu ion: J. Tang, X. Wang, Z.
Zhang, J. Yamagishi, G. Wiggins and G. Fazekas, “MIDI-VALLE: Im-
p o ing Exp essi e Piano Pe o mance Syn hesis Th ough Neu al Codec
Language Modelling”, in P oc. o he 26 h In . Socie y o Music In o -
ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
like exp essi eness in o music sco es. While conside able
p og ess has been made in modelling hese aspec s sep-
a a ely, an e ec i e MPS sys em is expec ed o in eg a e
bo h dimensions o achie e high-quali y syn hesis.
A common app oach o MPS in ol es a wo-s age
pipeline consis ing o an exp essi e pe o mance ende -
ing (EPR) model, which gene a es exp essi e pe o mance
MIDI om a sco e, and an exp essi e pe o mance syn he-
sis (EPS) model, which con e s pe o mance MIDI in o
audio [1–3]. In ecen wo ks o de eloping EPS models
[2–5], he ask has been ecognised as analogous o speech
syn hesis, as bo h gene a e audio om symbolic ep esen-
a ions. This pa allel mo i a ed esea che s o apply ad-
anced echniques om he ex - o-speech (TTS) domain
o add ess he challenges in EPS. P e ious s udies ha e
demons a ed he e ec i eness o TTS echniques, such as
Wa eNe [6] and acous ical models wi h ocode s [2,4,5],
in syn hesising pe o mance MIDI o audio. Howe e , due
o limi ed aining da a di e si y and cons ained a chi-
ec u e design, hese models s uggle o gene alise ac oss
acous ic en i onmen s and imb e a ia ions, limi ing he
exp essi eness and ealism o hei ou pu s. Mo eo e ,
when in eg a ing hese EPS sys ems wi h EPR models, dis-
c epancies in he way EPR and EPS models p ocess and
ep esen MIDI da a in oduce inconsis encies. These dis-
c epancies o en esul in he loss o ine-g ained empo al
de ails, leading o he educed syn hesis quali y.
To add ess he limi a ions, we in oduce a no el EPS
model, MIDI-VALLE 1, adap ed om VALLE, a s a e-o -
he-a TTS amewo k o ze o-sho pe sonalised speech
syn hesis [7]. The VALLE model condi ions syn hesis on
speake -speci ic audio p omp s, enabling ze o-sho adap-
a ion o unseen speake s. We op imise his a chi ec u e
o pe o mance MIDI- o-audio syn hesis by condi ioning
on a e e ence audio pe o mance and i s co esponding
MIDI ep esen a ion. Ins ead o using he Maes o da ase
[6], which con ains eco ded pe o mance MIDI and audio
pai s, we ain he MIDI-VALLE on ATEPP [8], a la ge
and mo e di e se da ase comp ising ansc ibed pe o -
mance MIDI and audio pai s. This allows he model o
lea n om a b oade ange o musical exp essions, im-
1Demo and codes a e a ailable a h ps:// angjjbe sy.
gi hub.io/MIDI-VALLE/
623
p o ing gene alisa ion ac oss unseen MIDI sou ces, com-
posi ion s yles, and eco ding en i onmen s. As demon-
s a ed by bo h objec i e and subjec i e e alua ion esul s,
MIDI-VALLE shows enhanced adap abili y and obus ness
in handling di e se pe o mance inpu s compa ed o p e i-
ous s a e-o - he-a EPS models,
Mo eo e , o ully le e age neu al codec language mod-
elling, we okenise pe o mance MIDI and audio using Oc-
uple MIDI okenisa ion me hod [9] and a high- ideli y au-
dio codec model [10], which ensu es accu a e econs uc-
ion om audio okens. Compa ed o adi ional piano-
oll and spec og am ep esen a ions, his disc e e oken-
based app oach ensu es a mo e consis en alignmen be-
ween MIDI and audio. The esul s om he lis ening es
demons a e ha MIDI-VALLE, when in eg a ed wi h di -
e en EPR models in a wo-s age MPS pipeline, p o ides
a mo e obus and adap able syn hesis amewo k.
2. RELATED WORKS
2.1 Exp essi e Pe o mance Syn hesis
In he EPS domain, se e al s udies ha e explo ed a ious
app oaches o MPS, including DDSP-based modelling [1,
11] and TTS-inspi ed models [2–5]. These TTS-inspi ed
models ypically p ocess piano pe o mance MIDIs as pi-
ano olls o audio syn hesis. Haw ho ne e al. [6] em-
ployed Wa eNe o map piano olls di ec ly o wa e o ms.
Mo e ecen wo ks [2–5] adap ed ans o me -based TTS
models [12, 13] o i s con e piano olls in o in e medi-
a e acous ic ep esen a ions, such as spec og ams. These
ep esen a ions we e subsequen ly ans o med in o wa e-
o ms using ocode s like HiFi-GAN [14]. These EPS
models we e mainly ained on he Maes o da ase [6],
which consis s o eco ded MIDI and audio pai s om pi-
ano compe i ions. Al hough he da ase includes pe o -
mances o di e se composi ions, hey we e eco ded in a
ela i ely homogeneous acous ic en i onmen . This lack
o acous ic a ie y limi s he abili y o models ained on
he da ase o gene alise o mo e a ied acous ic condi-
ions. Tang e al. [3] a emp ed o ine- une a s a e-o - he-
a model [5] using he ATEPP [8] da ase , which ea u es
eco dings cap u ed in a b oade ange o acous ic se ings.
Howe e , he ine- uned model s ill s uggled o p oduce
consis en ambien sounds, applying misma ched o incon-
sis en oom e e be a ion and backg ound noise.
A key challenge in c ea ing a wo-s age pipeline o
MPS is he di e ence in MIDI ep esen a ions used by
EPR and EPS models, pa icula ly in empo al in o ma-
ion. EPS models ypically use piano- oll ep esen a ions,
while EPR models ei he okenise MIDI [3, 15, 16] in o
disc e e e en s o encode con inuous ea u es [17–19] like
iming and eloci y. These di e ences complica e MIDI
con e sion be ween s ages, especially wi h no e iming and
pedal ea men . Consequen ly, pe o mance MIDI gene -
a ed om EPR models di e s signi ican ly om he pe -
o mance MIDI used in EPS models, making di ec in e-
g a ion imp ac ical wi hou addi ional ine- uning. Mo e
de ails a e discussed in Sec ion 5.2 and Sec ion 6.
2.2 Neu al Codec Language Modelling o Audio
Gene a ion
Recen ad ances in audio and music gene a ion ha e le e -
aged neu al codec language models o add ess he chal-
lenges o gene alising ac oss di e se acous ics and music
s yles. The s a e-o -a ex - o-audio [20] and ex - o-music
[21, 22] models use codec models like Encodec [10] and
SoundS eam [23] o comp ess audio in o disc e e okens,
enabling mo e e icien aining on la ge-scale da ase s by
educing compu a ional cos s. In he TTS domain, he
VALLE model [7], inspi ed by AudioLM [20], uses En-
codec o syn hesise high-quali y speech while p ese ing
speake -speci ic ea u es. By eplacing mel-spec og ams
wi h comp essed audio codec okens, VALLE o mula es
TTS as condi ional codec language modelling. This en-
ables e ec i e ze o-sho imb e adap a ion and p ese a-
ion o speake emo ion and he acous ic en i onmen en-
coded in he e e ence p omp . Building on his app oach,
we ex end codec language modelling o piano pe o mance
syn hesis by okenising bo h pe o mance MIDI and audio,
demons a ing he e ec i eness o codec language mod-
elling in syn hesising exp essi e piano pe o mances.
3. MIDI-VALLE FOR PIANO SYNTHESIS
Ou MIDI-VALLE model ocuses on pe o mance MIDI-
o-audio syn hesis, d awing pa allels o ex - o-speech syn-
hesis by VALLE. The ollowing sec ions discuss he o-
kenisa ion s a egies and key a chi ec u al di e ences be-
ween MIDI-VALLE and VALLE, highligh ing he simila -
i ies and dis inc ions be ween speech and music syn hesis.
3.1 Tokenisa ion
3.1.1 Audio Tokenisa ion
Ins ead o he o iginal Encodec model [10], we ollow
he audio okenisa ion app oach applied in MusicGen [21].
Speci ically, we ine- une a ou -le el esidual ec o quan-
isa ion (RVQ) [24] o gene a e ou codebooks ha ep e-
sen he audio samples. In RVQ, each quan ise encodes
he esidual e o om he p e ious one, c ea ing in e de-
pendencies among he codebooks. As obse ed in [10,25],
he i s codebook encodes he p ima y acous ic in o ma-
ion, while he subsequen codebooks e ine he ou pu
by modelling ine de ails. The ine- uned codec, Piano-
Encodec, con e s audio pe o mances in o disc e e okens
while p ese ing high- ideli y acous ics and imb al cha -
ac e is ics. The decode hen econs uc s he audio om
hese okens.
3.1.2 MIDI Tokenisa ion
Classical piano music and speech di e signi ican ly in
complexi y and s uc u e. Classical music ea u es in i-
ca e equency pa e ns and p ecise iming, making seg-
men a ion challenging due o issues wi h no e sepa a-
ion, iming accu acy, and managing p olonged no e du-
a ions caused by pedalling. In con as , speech is simple ,
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
624
Figu e 1: O e iew o he MIDI-VALLE a chi ec u e, adap ed om VALLE [7]. Audio p omp is a 3-second segmen
selec ed om a e e ence pe o mance. The ex p omp in VALLE is eplaced wi h he co esponding MIDI p omp
conca ena ed wi h he a ge MIDI o syn hesis.
wi h clea segmen a ion based on phoneme bounda ies and
g ea e ole ance o iming a ia ions.
Music Speech
Pi ch Vel Du IOI Pos Ba
92 68 1156 772 388 20 512
Table 1: Vocabula y sizes o musical ea u es and speech.
We employ he Oc uple MIDI okenisa ion me hod [9],
as u ilised in [3], o achie e a consis en and disc e e ep e-
sen a ion o piano pe o mances wi hin he EPR and EPS
sys ems. Unlike me hods such as Compound Wo d [26]
o REMI [27], he Oc uple app oach uses dis inc ocabu-
la ies o each musical ea u e, enabling no e-wise encod-
ing and esul ing in a K×Na ay (numbe o ea u es ×
numbe o no es). This me hod educes ocabula y size
and s uc u al complexi y, esul ing in sho e oken se-
quences wi hou needing o g oup no e ea u es [9]. We
ex end he Oc uple me hod by okenising he in e -onse
in e al (IOI) o cap u e onse iming di e ences be ween
consecu i e no es. The MIDI okenisa ion o e s ad an-
ages o e he piano- oll ep esen a ion used in p io s ud-
ies [3–5, 11]. Piano- oll encodes only no e onse s and du-
a ions on a ixed empo al g id, lacking he esolu ion and
lexibili y o cap u e sub le iming a ia ions ha signi i-
can ly in luence a icula ion.
Table 1 illus a es he s uc u al and ep esen a ional
di e ences be ween MIDI and ex okens. MIDI okens
comp ise mul iple sequences ha encode musical ea u es
such as pi ch, eloci y (Vel), du a ion (Du ), in e -onse
in e al (IOI), posi ion (Pos), and ba , explici ly ep e-
sen ing iming in o ma ion. These sequences a e p o-
cessed h ough di e en embedding laye s and conca e-
na ed o embedding pooling [26]. In con as , speech ex
is okenised by a single sequence o in ege s ep esen ing
phonemes, wi h iming implici ly con eyed h ough oken
o de . These di e ences in oken ep esen a ion a e c i ical
o he success ul aining o MIDI-VALLE.
3.2 Model Design
Unlike VALLE, MIDI-VALLE is designed o p ese e he
imb al and acous ic cha ac e is ics o he e e ence piano
pe o mance. Gi en a piano pe o mance da ase Dp=
{xi, yi}, whe e x={x0, x1, ..., xL}is a MIDI oken
sequence and yis he co esponding audio segmen , he
audio is encoded in o disc e e acous ic codes using he
p e- ained Piano-Encodec model: encodec(y) = CT×4,
whe e Cis a wo-dimensional codec ma ix ep esen a-
ion and Tis he codec sequence leng h. Du ing ain-
ing, he model lea ns o p edic a codec ma ix ˆ
C om
he inpu MIDI x, and an acous ic p omp ma ix ˜
Cwhich
is de i ed om he i s h ee seconds o he co espond-
ing audio. The syn hesised audio is econs uc ed by ˆy=
decodec(ˆ
C), aiming o app oxima e he o iginal audio y.
The model is ained o maximise he condi ional likeli-
hood max p(C|x, ˜
C).
As shown in Figu e 1, MIDI-VALLE ollows a simi-
la design in VALLE, comp ising an au o eg essi e (AR)
ans o me decode ha p edic s disc e e okens om he
i s quan ise , c ,1, and a non-au o eg essi e (NAR) ans-
o me decode ha gene a es codes o he emaining
h ee quan ise s, c ,2:4. To p ocess MIDI oken sequences,
bo h models employ he embedding pooling [26] echnique
o map conca ena ed embeddings o a ious musical ea-
u es o ma ch he equi ed inpu size. The AR decode
akes he MIDI oken sequence as inpu and au o eg es-
si ely p edic s audio codec okens in a causal manne ,
wi hou using any acous ic p omp s. In con as , he NAR
decode is condi ioned on an acous ic p omp ˜
C, ex ac ed
om he i s h ee seconds o he pe o mance. This e-
places he neighbou ing-con ex s a egy used in VALLE
and be e p ese es musical cohe ence, as acous ic cha -
ac e is ics can change apidly and a y signi ican ly be-
ween segmen s. Du ing NAR aining, each oken in he
sel -a en ion laye can a end o all inpu okens. Despi e
hei di e en decoding app oaches, bo h he AR and NAR
models sha e he same a chi ec u e: 12 a en ion laye s, 16
a en ion heads, and hidden dimensions o size 1024.
Du ing in e ence, he model inpu s a a ge MIDI o
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
625
syn hesis and op ionally accep s an audio p omp wi h i s
co esponding MIDI segmen as he MIDI p omp . The
audio p omp could be any 3-second exce p om any
eco ded pe o mance, and he associa ed MIDI p omp is
conca ena ed o he beginning o he a ge MIDI be o e
okenisa ion. The impac o selec ing di e en p omp s
is discussed in Sec ion 6.2. The encoded audio p omp ,
i p o ided, is appended a e he MIDI okens in he in-
pu o he AR decode o con olling he acous ic en i-
onmen and imb al cha ac e is ics. The model hen es i-
ma es he audio codec okens o he a ge MIDI and e-
cons uc s he co esponding audio pe o mance using he
Piano-Encodec.
4. EXPERIMENTS
4.1 Da ase s
We used he ATEPP [8] da ase , excluding low-quali y pe -
o mances and hei ansc ip ions. A o al o 8,825 pe o -
mance eco dings we e selec ed and spli in o aining, al-
ida ion, and es se s in an 8:1:1 a io. The epe oi e has
a ound 700 hou s o audio eco dings om 1,099 albums,
ea u ing 1,523 composi ions by 25 compose s, pe o med
by 46 pianis s. All he pe o mances we e segmen ed an-
domly in o clips o 15-20 seconds. To ensu e p ecise align-
men be ween he audio segmen s and he co esponding
MIDIs, no es we e unca ed a segmen ing poin s, wi h he
emainde con inuing in he nex segmen i a no e was in-
e up ed. Due o limi ed pedalling ansc ip ion accu acy
in he ATEPP da ase , pedal in o ma ion was excluded du -
ing MIDI okenisa ion, and no e du a ions ep esen aw
du a ions only, wi hou sus ain ex ension.
4.2 Implemen a ion De ails
The Encodec model [21] was ine- uned using audio om
he ATEPP da ase . All pe o mances we e con e ed in o
32kHz monophonic audio and encoded wi h a ame a e
o 50 Hz. The ex ac ed audio embeddings we e quan ised
using RVQ wi h ou quan ise s, each ha ing a codebook
size o 2048. One-second audio segmen s we e andomly
sampled om he en i e ATEPP da ase a each epoch, ol-
lowing he s a egy p oposed in [10]. Fine- uning was ca -
ied ou o e 40 epochs on a Tesla A100 GPU o one day,
wi h pe o mance imp o emen s discussed in Sec ion 6.1.
Ou MIDI-VALLE was implemen ed based on an un-
o icial e sion o VALLE [25], wi h aining op imised
using he ScaledAdam [28] op imise and a base lea ning
a e o 0.05. The lea ning a e was adjus ed using he Eden
schedule , as desc ibed in [28]. The AR and NAR decode s
we e ained join ly, wi h g adien s upda ed in he same
s ep, con e ging a e app oxima ely 300k s eps (2.5 days)
on wo Tesla A100 GPUs.
5. EVALUATION
5.1 Objec i e Me ics
To e alua e he pe o mance o he p oposed MIDI-
VALLE sys em, we employ h ee objec i e me ics:
F éche Audio Dis ance (FAD) [29, 30], spec og am dis-
o ion, and ch oma dis o ion. FAD measu es he pe cep-
ual quali y and ealism o gene a ed audio by compa ing
i o e e ence pe o mances using embeddings ex ac ed
om Piano-Encodec. Adap ed om [3], spec og am dis-
o ion e alua es he ideli y o econs uc ed acous ics and
imb e, while ch oma dis o ion e alua es ha monic con-
sis ency a he pi ch class le el. They a e compu ed using
he no malised oo mean squa e e o and mean absolu e
e o , espec i ely.
Da ase Gen e MIDI Type RE⋆
ATEPP [8] classical T ansc ibed Li e & S udio
Maes o [6] classical Reco ded Compe i ion
Pijama [31] jazz T ansc ibed Li e & S udio
Table 2: Compa ison o he h ee piano solo da ase s used
o e alua ion. ⋆RE s ands o eco ding en i onmen .
We e alua e MIDI-VALLE agains he s a e-o - he-a
TTS-based EPS sys em, M2A [3], using h ee da ase s:
ATEPP [8], Maes o [6], and Pijama [31]. The M2A sys-
em [3] was o iginally ained on he Maes o da ase and
subsequen ly ine- uned using a cu a ed subse o 371 pe -
o mances om ATEPP. All h ee da ase s p o ide pe o -
mance MIDIs pai ed wi h co esponding audio eco dings.
As p esen ed in Table 2, ATEPP and Maes o a e classi-
cal piano pe o mance co po a, comp ising ansc ibed and
eco ded pe o mance MIDIs, espec i ely. The Pijama
da ase con ains ansc ibed jazz piano solos eco ded in
li e and s udio se ings. Fo e alua ion, we use only he
es se o ATEPP and andomly selec 100 pe o mances
om each da ase o ensu e di e se composi ional s yles
and eco ding condi ions.
Besides human pe o mance eco dings, econs uc ed
audio om Piano-Encodec is used as an addi ional e e -
ence o calcula ing me ics. This helps e alua e how well
MIDI-VALLE aligns wi h i s aining a ge : he codec
ep esen a ions ex ac ed by Piano-Encodec. All pe o -
mances a e di ided in o 15-20 second segmen s, and he
me ics a e calcula ed by compa ing he model-gene a ed
ou pu s o bo h he g ound u h audio and he Piano-
Encodec econs uc ions.
5.2 Lis ening Tes s
The lis ening es e alua es he syn hesis quali y o gene -
a ions by MIDI-VALLE and M2A, and hei compa ibili y
wi h di e en EPR sys ems in a wo-s age MPS pipeline,
using wo ypes o p e e ence-based e alua ions.
Fo he syn hesis quali y e alua ion, pa icipan s a e
p esen ed wi h a e e ence audio eco ding o human pe -
o mances alongside wo syn hesised e sions o he same
musical exce p —one gene a ed by MIDI-VALLE and he
o he by M2A, bo h condi ioned on he same pe o mance
MIDI. Pa icipan s a e asked o selec he e sion ha mo e
closely esembles he e e ence in e ms o imb e, ph as-
ing, and exp essi eness. The s imuli a e d awn om he
h ee da ase s used in he objec i e e alua ion, consis ing
o 6 exce p s om ATEPP, 4 om Maes o, and 4 om Pi-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
626
jama. Each exce p ep esen s a dis inc composi ion and
pe o mance, las ing app oxima ely 15–20 seconds. In o-
al, 14 pai wise compa isons a e c ea ed o his e alua ion.
Fo he sys em compa ibili y e alua ion, pa icipan s
a e p esen ed wi h wo syn hesised ou pu s gene a ed by
MIDI-VALLE and M2A, based on pe o mance MIDIs
p oduced by di e en EPR sys ems. Th ee EPR sys ems
a e conside ed: M2M [3], a T ans o me -based model in-
oduced alongside M2A as pa o an MPS sys em; Vi u-
osoNe [18], which employs a hie a chical ecu en neu al
ne wo k (RNN) a chi ec u e; and DEx e [19], a di usion-
based gene a i e model. These sys ems di e in hei
MIDI p ocessing and ep esen a ions, pa icula ly in how
sus ain pedalling is handled and how MIDI iles a e en-
coded. Fo example, bo h DEx e and Vi uosoNe s uggle
o accu a ely model sus ain pedal e ec s, which can esul
in unna u al no e o se p edic ions. In con as , M2M o-
kenises MIDI iles bu dis ega ds pedalling e ec s, elying
on M2A o syn hesise hese e ec s. In ou lis ening es ,
pa icipan s a e expec ed o indica e hei p e e ence based
on he na u alness, cla i y, and exp essi eness o he syn-
hesised audio o he same pe o mance MIDI. Fo each
EPR sys em, ou 15–20 second exce p s o dis inc com-
posi ions a e selec ed, leading o 12 pai wise compa isons.
A o al o 20 pa icipan s, almos all wi h o e 2 yea s o
music aining, we e ec ui ed, wi h each e alua ing hal o
he s imuli. This ensu ed ha each s imulus was assessed
by 9 o 11 pa icipan s, esul ing in a o al o 260 o es.
6. RESULTS & DISCUSSION
6.1 Objec i e E alua ion
As shown in Table 3, ine- uning wi h he ATEPP da ase
signi ican ly enhanced Piano-Encodec compa ed o he
o iginal Encodec [21], educing spec og am dis o ion
om 0.304 o 0.123 and ch oma dis o ion om 0.478 o
0.140. In addi ion, Piano-Encodec achie es high- ideli y
econs uc ion o human pe o mances, wi h much lowe
FAD, spec og am, and ch oma dis o ions han gene a i e
models. Al hough ine- uned using he ATEPP da ase , he
Piano-Encodec model achie es imp essi e econs uc ion
quali y on bo h he Maes o and Pijama da ase s. These
esul s alida e he eliabili y o he Piano-Encodec model
as an embedding ex ac ion ool o assessing acous ic and
musical simila i y be ween syn hesised ou pu s and e e -
ence audio.
Compa ed o M2A, as p esen ed in Table 4, ou MIDI-
VALLE model achie es o e 75% lowe FAD on he
ATEPP and Maes o da ase s, showing ha MIDI-VALLE
e ec i ely maps MIDI okens o ealis ic audio. Howe e ,
he high FAD sco es o Pijama indica e ha MIDI-VALLE
s uggles wi h jazz pe o mances, likely due o i s aining
on classical music, which limi s i s abili y o cap u e he
complex ha monic s uc u es and hy hms o jazz. Fu he -
mo e, he FAD be ween MIDI-VALLE ou pu s and econ-
s uc ions is lowe han wi h g ound u h, sugges ing ha
MIDI-VALLE aligns mo e wi h he quan ised ep esen a-
ions used in aining han wi h he o iginal audio.
Model Da ase FAD ↓Spec. ↓Ch oma ↓
Encodec [21] ATEPP – 0.304 ± .005 0.478 ± .011
Piano-Enc.
ATEPP 0.685 0.123 ± .002 0.140 ± .002
Maes o 0.984 0.135 ± .002 0.139 ± .001
Pijama 1.133 0.143 ± .003 0.137 ± .001
Table 3: Recons uc ion quali y o Encodec [10] and
Piano-Encodec (Piano-Enc.) e alua ed on h ee da ase s.
Me ics a e calcula ed by compa ing he econs uc ed pe -
o mances wi h he g ound u h eco dings. FAD, spec o-
g am dis ance (Spec.), and ch oma dis ance wi h 95% con-
idence in e als a e p esen ed.
Model Re . FAD ↓Spec. ↓Ch oma ↓
ATEPP
M2A [3] GT111.014 0.218 ± .005 0.421 ± .017
RC211.463 0.214 ± .004 0.464 ± .017
MV GT 3.329 0.219 ± .005 0.436 ± .012
RC 2.659 0.199 ± .005 0.442 ± .012
Maes o
M2A [3] GT 34.479 0.230 ± .003 0.387 ± .007
RC 33.753 0.224 ± .003 0.427 ± .007
MV GT 11.281 0.231 ± .004 0.428 ± .009
RC 9.168 0.206 ± .003 0.420 ± .009
Pijama
M2A [3] GT 274.153 0.312 ± .010 0.471 ± .009
RC 267.969 0.293 ± .008 0.509 ± .010
MV GT 102.022 0.322 ± .010 0.558 ± .014
RC 97.634 0.298 ± .009 0.584 ± .015
Table 4: Spec og am (Spec.) and ch oma dis o ions a e
p esen ed along wi h 95% con idence in e als o com-
pa ing M2A [3] and MIDI-VALLE (MV) on he ATEPP,
Maes o, and Pijama da ase s. Me ics a e calcula ed by
compa ing he model gene a ions wi h he e e ence (Re .).
1GT e e s o he g oud u h pe o mance eco ding and
2RC indica es audio econs uc ed ia Piano-Encodec.
In e ms o ch oma dis o ion, MIDI-VALLE shows
simila ha monic consis ency o M2A on ATEPP, whe eas
M2A sligh ly ou pe o ms MIDI-VALLE on Maes o. This
pe o mance di e ence aligns wi h he ac ha M2A was
o iginally ained on he Maes o da ase . Rega ding spec-
og am dis o ion, which e lec s he model’s abili y o e-
cons uc acous ics and imb e, MIDI-VALLE exhibi s a
smalle dis ance o he econs uc ion, compa ed o M2A
on bo h ATEPP and Maes o. Addi ionally, as shown in
Figu e 2, MIDI-VALLE p o ides a mo e accu a e econ-
s uc ion ac oss he ull equency spec um compa ed o
M2A, imp o ing bo h he imb e ealism and he pe cep-
ual weigh o he sound. These esul s sugges ha MIDI-
VALLE e ec i ely adap s o a ious eco ding en i on-
men s and econs uc s acous ics and ambien sound ha
closely ma ch he p o ided audio p omp s.
Fu he mo e, MIDI-VALLE, ained solely wi h an-
sc ibed pe o mance MIDIs, gene alises well o eco ded
MIDIs wi hou ine- uning, as shown by i s lowe FADenc
sco e and simila spec og am and ch oma dis o ions o
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
627

Figu e 2: Rainbow-g ams [5] o he pe o mances syn he-
sised by MIDI-VALLE and M2A, along wi h he g ound
u h and he Piano-Encodec econs uc ion, a e shown.
The ainbow-g am, based on he Cons an -Q T ans o m,
uses colou o ep esen ins an aneous equency and ligh -
ness o indica e spec al ampli ude. As he spec al ampli-
ude o a equency bin inc eases, he co esponding image
pixel becomes ligh e .
M2A. This makes i bene icial o eal-wo ld applica ions
wi h limi ed eco ded da a bu ich ansc ibed da a. In
con as , M2A, which is ained on eco ded pe o mance
MIDIs, s uggles o adap o ansc ibed da ase s wi hou
ine- uning [3], p ima ily due o he inhe en limi a ions o
he piano- oll ep esen a ion, as discussed in Sec ion 3.1.2.
On he Pijama da ase , MIDI-VALLE exhibi s inc eased
spec og am and ch oma dis o ions, highligh ing i s di i-
cul y in cap u ing he complex ha monic s uc u es, syn-
copa ed hy hms, and nuanced a icula ions cha ac e is ic
o jazz music. These s ylis ic di e ences om classical
music migh lead o unseen oken pa e ns in he MIDI
ep esen a ion, making adap a ion challenging. Ne e he-
less, MIDI-VALLE s ill ou pe o ms M2A in e ms o FAD
and achie es compa able spec og am dis o ion, sugges -
ing ha i be e p ese es imb al and ambien ea u es ha
con ibu e o pe cep ual simila i y. Howe e , he conside -
able FAD gap be ween MIDI-VALLE and he g ound u h
indica es ha he o e all audio quali y emains limi ed.
6.2 Subjec i e E alua ion
The lis ening es esul s u he alida e he indings om
he objec i e me ics. As shown in Figu e 3, MIDI-
VALLE ecei es signi ican ly mo e o es han M2A in
he syn hesis quali y e alua ion on he ATEPP and Mae-
s o da ase s. Howe e , M2A is a ou ed o segmen s
om Pijama da ase , indica ing ha while MIDI-VALLE
gene alises well o classical piano, i equi es u he e-
inemen o adap e ec i ely o s ylis ically dis inc gen es
such as jazz. In he sys em compa ibili y e alua ion, MIDI-
VALLE is consis en ly p e e ed o e M2A ac oss all EPR
sys ems, demons a ing be e adap abili y o sub le iming
and a icula ion di e ences in pe o mance MIDI. While
M2A’s piano- oll ep esen a ion is p one o a e ac s unde
such a ia ions, MIDI-VALLE emains obus , p oducing
mo e na u al and exp essi e ou pu s.
Addi ionally, he ou pu quali y o MIDI-VALLE is
Figu e 3: Win coun s a e p esen ed o MIDI-VALLE and
M2A models ac oss mul iple da ase s and combined EPR
sys ems in he lis ening es s.
s ongly in luenced by he audio p omp due o i s inhe -
i ed ze o-sho design. We obse ed ha , beyond cap u ing
ambien cha ac e is ics, he p omp could also de e mine
he loudness and imb e o he gene a ed audio, highligh -
ing he model’s abili y o adap o di e se acous ic en i on-
men s. Mo eo e , while using he i s h ee seconds o a
a ge segmen o en p oduces high-quali y esul s, MIDI-
VALLE can gene a e cohe en and na u al ou pu s om
any p omp ha is s ylis ically consis en and acous ically
clea , enabling i o handle imp o ised inpu s wi hin he
classical s yle.
Howe e , he cu en design equi es p ecise alignmen
be ween MIDI and audio p omp s. Sub le iming a ia ions
o ex a no es can lead o unexpec ed no es o omissions a
he s a o he gene a ion. Despi e he accu a e unca ion
o he MIDI p omp o h ee seconds, misalignmen s s ill
occu . As shown in Figu e 2, he plo o MIDI-VALLE ou -
pu appea s sligh ly shi ed due o he unca ion occu ing
in he middle o he i s no e, impac ing bo h i s iming
and a icula ion. While manually selec ing cu ing poin s
ha align wi h he end o MIDI no es, and when he sound
comple ely ades in he audio could esol e hese misalign-
men s, his me hod is no p ac ical o syn hesising mul i-
ple segmen s in o a comple e pe o mance. When gene a -
ing long pe o mances by conca ena ing mul iple syn he-
sised segmen s, discon inui ies in acous ic cha ac e is ics
can s ill be obse ed.
7. CONCLUSION
We p esen MIDI-VALLE, a no el EPS model adap ed
om he VALLE amewo k, o pe o mance MIDI- o-
audio syn hesis. Ou esul s demons a e ha MIDI-
VALLE ou pe o ms he exis ing EPS baseline in bo h
adap abili y and syn hesis quali y, p oducing mo e na u al
and exp essi e audio ac oss a wide ange o pe o mance
inpu s and eco ding condi ions. This imp o emen is p i-
ma ily a ibu ed o he disc e e okenisa ion app oach and
i s inhe i ed ze o-sho design, which enhances he model’s
abili y o cap u e pe o mance nuances and adap o di e se
inpu s. Fu u e wo k will ocus on imp o ing gene alisa ion
ac oss musical gen es, in es iga ing he impac o model
size and he ole o codebooks in exp essi e audio gene a-
ion, and b oadening compa isons wi h physical syn hesis
me hods and al e na i e audio codec models.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
628
8. ACKNOWLEDGEMENT
This wo k was suppo ed by he UKRI Cen e o Doc-
o al T aining in A i icial In elligence and Music [g an
numbe EP/S022694/1] and he Na ional Ins i u e o In-
o ma ics, Japan. J. Tang is a esea ch s uden join ly
unded by he China Schola ship Council [g an numbe
202008440382] and Queen Ma y Uni e si y o London.
G. Wiggins ecei ed unding om he Flemish Go e n-
men unde he "Onde zoeksp og amma A i iciële In elli-
gen ie (AI) Vlaande en". We hank he e iewe s o hei
aluable eedback, which helped imp o e he quali y o
his wo k.
9. ETHICS STATEMENT
No pe sonal o sensi i e use da a is in ol ed in his e-
sea ch. The da ase s used in his s udy — ATEPP [8], Mae-
s o [6], and Pijama [31] — con ain audio eco dings and
co esponding MIDI anno a ions o piano pe o mances.
The MIDI iles om all h ee da ase s a e publicly a ail-
able. Howe e , he audio eco dings in ATEPP and Pijama
a e accessible exclusi ely o esea ch pu poses unde aca-
demic use ag eemen s, and ha e been used acco dingly.
All model aining and e alua ion we e conduc ed in
compliance wi h hese e ms, wi h no comme cial us-
age o edis ibu ion o he es ic ed audio da a. The
lis ening es s in ol ed olun a y pa icipa ion by mu-
sically ained indi iduals, who we e in o med o he
pu pose and anonymised pa icipa ion. No pe sonal
da a was collec ed. The s udy was e iewed and ap-
p o ed by he Elec onic Enginee ing and Compu e Sci-
ence De ol ed School Resea ch E hics Commi ee a
Queen Ma y Uni e si y o London unde e e ence num-
be QMERC20.565.DSEECS25.019.
Code and gene a ed audio examples a e made a ailable
o p omo e anspa ency and ep oducibili y. We acknowl-
edge he po en ial o misuse o gene a i e audio models,
including he syn hesis o decep i e o misleading con en .
We s ongly discou age such applica ions and ad oca e o
he esponsible use o his echnology, including clea a i-
bu ion and disclosu e when syn he ic audio is employed.
10. REFERENCES
[1] Y. Wu, E. Manilow, Y. Deng, R. Swa ely, K. Kas ne ,
T. Cooijmans, A. Cou ille, C.-Z. A. Huang, and J. En-
gel, “MIDI-DDSP: De ailed con ol o musical pe -
o mance ia hie a chical modeling,” in In e na ional
Con e ence on Lea ning Rep esen a ions, 2022.
[2] H.-W. Dong, C. Zhou, T. Be g-Ki kpa ick, and
J. McAuley, “Deep pe o me : Sco e- o-audio music
pe o mance syn hesis,” in ICASSP 2022-2022 IEEE
In e na ional Con e ence on Acous ics, Speech and
Signal P ocessing (ICASSP). IEEE, 2022, pp. 951–
955.
[3] J. Tang, E. Coope , X. Wang, J. Yamagishi, and
G. Fazekas, “Towa ds an in eg a ed app oach o
exp essi e piano pe o mance syn hesis om music
sco es,” in ICASSP 2025 - 2025 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocess-
ing (ICASSP), 2025, pp. 1–5.
[4] E. Coope , X. Wang, and J. Yamagishi, “Tex - o-
speech syn hesis echniques o midi- o-audio syn he-
sis,” P oc. 11 h ISCA Speech Syn hesis Wo kshop (SSW
11), 2021.
[5] X. Shi, E. Coope , X. Wang, J. Yamagishi, and
S. Na ayanan, “Can knowledge o end- o-end ex - o-
speech models imp o e neu al midi- o-audio syn hesis
sys ems?” in ICASSP 2023-2023 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P ocess-
ing (ICASSP). IEEE, 2023, pp. 1–5.
[6] C. Haw ho ne, A. S asyuk, A. Robe s, I. Si-
mon, C.-Z. A. Huang, S. Dieleman, E. Elsen,
J. Engel, and D. Eck, “Enabling ac o ized piano
music modeling and gene a ion wi h he MAE-
STRO da ase ,” in In e na ional Con e ence on
Lea ning Rep esen a ions, 2019. [Online]. A ailable:
h ps://open e iew.ne / o um?id= 1lYRjC9F7
[7] S. Chen, C. Wang, Y. Wu, Z. Zhang, L. Zhou, S. Liu,
Z. Chen, Y. Liu, H. Wang, J. Li, L. He, S. Zhao, and
F. Wei, “Neu al codec language models a e ze o-sho
ex o speech syn hesize s,” IEEE T ansac ions on Au-
dio, Speech and Language P ocessing, ol. 33, pp.
705–718, 2025.
[8] H. Zhang, J. Tang, S. R. Ra ee, S. Dixon, G. A.
Wiggins, and G. Fazekas, “ATEPP: A Da ase o Au o-
ma ically T ansc ibed Exp essi e Piano Pe o mance,”
in In e na ional Socie y o Music In o ma ion Re-
ie al Con e ence, Dec. 2022, pp. 446–453. [Online].
A ailable: h ps://doi.o g/10.5281/zenodo.7342764
[9] M. Zeng, X. Tan, R. Wang, Z. Ju, T. Qin, and T.-Y. Liu,
“MusicBERT: Symbolic music unde s anding wi h
la ge-scale p e- aining,” in Findings o he Associa ion
o Compu a ional Linguis ics: ACL-IJCNLP 2021,
Online, Aug. 2021, pp. 791–800. [Online]. A ailable:
h ps://aclan hology.o g/2021. indings-acl.70
[10] A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi, “High
ideli y neu al audio comp ession,” T ansac ions on
Machine Lea ning Resea ch, 2023, ea u ed Ce i ica-
ion, Rep oducibili y Ce i ica ion. [Online]. A ailable:
h ps://open e iew.ne / o um?id=i Cd8z8zR2
[11] L. Renaul , R. Migno , and A. Roebel, “DDSP-Piano:
a Neu al Sound Syn hesize In o med by Ins umen
Knowledge,” AES - Jou nal o he Audio Enginee ing
Socie y Audio-Accous ics-Applica ion, ol. 71, no. 9,
pp. 552–565, Sep. 2023. [Online]. A ailable: h ps:
//hal.science/hal-04073770
[12] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu, “Neu al
speech syn hesis wi h ans o me ne wo k,” in P o-
ceedings o he AAAI con e ence on a i icial in elli-
gence, ol. 33, no. 01, 2019, pp. 6706–6713.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
629
[13] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao,
and T.-Y. Liu, “Fas speech: as , obus and con ol-
lable ex o speech,” P oceedings o he 33 d In e na-
ional Con e ence on Neu al In o ma ion P ocessing
Sys ems, 2019.
[14] J. Kong, J. Kim, and J. Bae, “Hi i-gan: Gene a i e ad-
e sa ial ne wo ks o e icien and high ideli y speech
syn hesis,” Ad ances in neu al in o ma ion p ocessing
sys ems, ol. 33, pp. 17 022–17 033, 2020.
[15] J. Tang, G. Wiggins, and G. Fazekas, “Recons uc ing
human exp essi eness in piano pe o mances wi h a
ans o me ne wo k,” The 16 h In e na ional Sympo-
sium on Compu e Music Mul idisciplina y Resea ch,
2023.
[16] I. Bo o ik and V. Vi o, “Sco epe o me : Exp essi e
piano pe o mance ende ing wi h ine-g ained con-
ol.” in P oceedings o he 23 d In e na ional Socie y
o Music In o ma ion Re ie al Con e ence, 2023, pp.
588–596.
[17] L. Renaul , R. Migno , and A. Roebel, “Exp essi e
Piano Pe o mance Rende ing om Unpai ed Da a,”
in In e na ional Con e ence on Digi al Audio E ec s
(DAFx23), Copenhague, Denma k, Sep. 2023, pp.
355–358. [Online]. A ailable: h ps://hal.science/
hal-04221612
[18] D. Jeong, T. Kwon, Y. Kim, K. Lee, and J. Nam, “Vi -
uosone : A hie a chical nn-based sys em o model-
ing exp essi e piano pe o mance,” in P oceedings o
he 20 h In e na ional Socie y o Music In o ma ion
Re ie al Con e ence, 2019.
[19] H. Zhang, S. Chowdhu y, C. E. Cancino-Chacón,
J. Liang, S. Dixon, and G. Widme , “Dex e : Lea ning
and con olling pe o mance exp ession wi h di usion
models,” Applied Sciences, no. 15, 2024. [Online].
A ailable: h ps://www.mdpi.com/2076-3417/14/15/
6543
[20] Z. Bo sos, R. Ma inie , D. Vincen , E. Kha i ono ,
O. Pie quin, M. Sha i i, D. Roblek, O. Teboul,
D. G angie , M. Tagliasacchi e al., “Audiolm: a
language modeling app oach o audio gene a ion,”
IEEE/ACM ansac ions on audio, speech, and lan-
guage p ocessing, ol. 31, pp. 2523–2533, 2023.
[21] J. Cope , F. K euk, I. Ga , T. Remez,
D. Kan , G. Synnae e, Y. Adi, and A. Dé os-
sez, “Simple and con ollable music gene a ion,”
in Thi y-se en h Con e ence on Neu al In o ma-
ion P ocessing Sys ems, 2023. [Online]. A ailable:
h ps://open e iew.ne / o um?id=j iQ26sCJi
[22] A. Agos inelli, T. I. Denk, Z. Bo sos, J. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi e al., “Musiclm:
Gene a ing music om ex ,” a Xi p ep in
a Xi :2301.11325, 2023.
[23] N. Zeghidou , A. Luebs, A. Om an, J. Skoglund,
and M. Tagliasacchi, “Sounds eam: An end- o-end
neu al audio codec,” IEEE/ACM T ans. Audio, Speech
and Lang. P oc., ol. 30, p. 495–507, No . 2021.
[Online]. A ailable: h ps://doi.o g/10.1109/TASLP.
2021.3129994
[24] R. G ay, “Vec o quan iza ion,” IEEE ASSP Magazine,
ol. 1, no. 2, pp. 4–29, 1984.
[25] F. Li, “Vall-e: A neu al codec language model,” 2023.
[Online]. A ailable: h p://gi hub.com/li ei eng/ all-e
[26] W.-Y. Hsiao, J.-Y. Liu, Y.-C. Yeh, and Y.-H. Yang,
“Compound wo d ans o me : Lea ning o compose
ull-song music o e dynamic di ec ed hype g aphs,”
in P oceedings o he AAAI Con e ence on A i icial In-
elligence, ol. 35, no. 1, 2021, pp. 178–186.
[27] Y.-S. Huang and Y.-H. Yang, “Pop music ans o me :
Bea -based modeling and gene a ion o exp essi e pop
piano composi ions,” in P oceedings o he 28 h ACM
In e na ional Con e ence on Mul imedia, se . MM ’20.
New Yo k, NY, USA: Associa ion o Compu ing
Machine y, 2020, p. 1180–1188. [Online]. A ailable:
h ps://doi.o g/10.1145/3394171.3413671
[28] Z. Yao, L. Guo, X. Yang, W. Kang, F. Kuang, Y. Yang,
Z. Jin, L. Lin, and D. Po ey, “Zip o me : A as e
and be e encode o au oma ic speech ecogni ion,”
in The Twel h In e na ional Con e ence on Lea ning
Rep esen a ions, 2024. [Online]. A ailable: h ps:
//open e iew.ne / o um?id=9WD9KwssyT
[29] K. Kilgou , M. Zuluaga, D. Roblek, and M. Sha i i,
“F éche audio dis ance: A e e ence- ee me ic
o e alua ing music enhancemen algo i hms,” in
In e speech, 2019. [Online]. A ailable: h ps://api.
seman icschola .o g/Co pusID:202725406
[30] A. Gui, H. Gampe , S. B aun, and D. Emmanouilidou,
“Adap ing eche audio dis ance o gene a i e music
e alua ion,” in P oc. IEEE ICASSP 2024, 2024.
[Online]. A ailable: h ps://a xi .o g/abs/2311.01616
[31] D. Edwa ds, S. Dixon, and E. Bene os, “Pijama: Pi-
ano jazz wi h au oma ic midi anno a ions,” T ansac-
ions o he In e na ional Socie y o Music In o ma-
ion Re ie al, 2023.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
630