AUDIO SYNTHESIZER INVERSION IN SYMMETRIC PARAMETER
SPACES WITH APPROXIMATELY EQUIVARIANT FLOW MATCHING
Ben Hayes Cha alampos Sai is Gyö gy Fazekas
Cen e o Digi al Music, Queen Ma y Uni e si y o London, Uni ed Kingdom
[email p o ec ed], {c.sai is, geo ge. azekas}@qmul.ac.uk
ABSTRACT
Many audio syn hesize s can p oduce he same signal gi en
di e en pa ame e con igu a ions, meaning he in e sion
om sound o pa ame e s is an inhe en ly ill-posed p oblem.
We show ha his is la gely due o in insic symme ies
o he syn hesize , and ocus in pa icula on pe mu a ion
in a iance. Fi s , we demons a e on a syn he ic ask ha
eg essing poin es ima es unde pe mu a ion symme y
deg ades pe o mance, e en when using a pe mu a ion-
in a ian loss unc ion o symme y-b eaking heu is ics.
Then, iewing equi alen solu ions as modes o a p oba-
bili y dis ibu ion, we show ha a condi ional gene a i e
model subs an ially imp o es pe o mance. Fu he , ac-
knowledging he in a iance o he implici pa ame e dis-
ibu ion, we ind ha pe o mance is u he imp o ed by
using a pe mu a ion equi a ian con inuous no malizing
low. To accommoda e in ica e symme ies in eal syn he-
size s, we also p opose a elaxed equi a iance s a egy ha
adap i ely disco e s ele an symme ies om da a. Apply-
ing ou me hod o Su ge XT, a ull- ea u ed open sou ce
syn hesize used in eal wo ld audio p oduc ion, we ind
ou me hod ou pe o ms eg ession and gene a i e baselines
ac oss audio econs uc ion me ics.
1. INTRODUCTION
Mode n audio syn hesize s a e in ica e sys ems, combining
nume ous me hods o sound p oduc ion and manipula ion
wi h ich use - acing con ol schemes. Whe e, once, many
digi al syn hesis algo i hms we e accompanied by a co e-
sponding analysis p ocedu e [1–3], selec ing pa ame e s o
a mode n syn hesize o app oxima e a gi en audio signal
is a challenging open p oblem [4, 5] which is inc easingly
app oached using powe ul op imiza ion and machine lea n-
ing algo i hms. In pa icula , ecen wo ks app oach he
ask wi h deep neu al ne wo ks ained on da ase s sampled
di ec ly om he syn hesize [6–8].
Many syn hesize s can p oduce he same signal gi en
mul iple di e en pa ame e con igu a ions. This means
ha in e ing he syn hesize is necessa ily ill-posed — i
© B. Hayes, C. Sai is, G. Fazekas. Licensed unde a C ea i e
Commons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu ion:
B. Hayes, C. Sai is, G. Fazekas, “Audio syn hesize in e sion in symme ic
pa ame e spaces wi h app oxima ely equi a ian low ma ching”, in P oc.
o he 26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon,
Sou h Ko ea, 2025.
Fo wa d map: audio syn hesis
Reg ession-based
...
Gene a i e Equi a ian gene a i e
P edic Sample Sample
In e se map:
Figu e 1:Top: Audio syn hesis is he o wa d map we wish
o in e . Bo om-le : Syn hesize in e sion by pa ame e
eg ession. The neu al ne wo k p oduces a poin es ima e,
and does no accoun o symme ies in he syn hesize .
Bo om-middle: A gene a i e model app oxima es he
condi ional dis ibu ion o pa ame e s
x
gi en audio
y
, bu
can only lea n he app op ia e in a iances i p esen in he
da a. Bo om- igh : Using an equi a ian low, he lea ned
dis ibu ion is inhe en ly in a ian o he symme ies o he
syn hesize .
lacks a unique solu ion. We a gue ha his ha ms he pe -
o mance o models ained o p oduce poin es ima es o
he pa ame e s. Maximum likelihood eg ession objec i es
ha do no accoun o his ill-posedness a e minimized by
a subop imal a e aging ac oss equi alen solu ions, while
in a ian loss unc ions can lead o pa hologies such as
he esponsibili y p oblem [9, 10]. This, we sugges , ex-
plains he supe io pe o mance o gene a i e me hods when
sound ma ching complex syn hesize s [11, 12] — o a
gi en inpu , hey can assign p edic i e weigh o many
possible pa ame e con igu a ions, as opposed o selec ing
jus one, as illus a ed in Fig. 1.
The ela ionship be ween equi alen pa ame e s is com-
monly go e ned by an unde lying symme y, which a ises
na u ally om he design o he syn hesize . Fo example,
in an addi i e syn hesize consis ing o
k
independen oscil-
la o s, simple pe mu a ions yield
k!
dis inc ye equi alen
pa ame e con igu a ions. In his wo k, we ocus on he
e ec s o such pe mu a ion symme ies, which equen ly
373
occu s in syn hesize s due o he use o epea ed unc ional
uni s — il e s, oscilla o s, modula ion sou ces, e c. — in
hei design. In such cases, we show ha by cons uc ing a
pe mu a ion in a ian gene a i e model om equi a ian
con inuous no malizing lows [13], we can imp o e o e he
pe o mance o symme y-naï e gene a i e models. Fu he ,
using a oy ask in which we can selec i ely b eak he in a i-
ance o he syn hesize , we show ha pe mu a ion symme y
deg ades he pe o mance o eg ession-based models.
In eal syn hesize s, mul iple symme ies may ac con-
cu en ly on di e en pa ame e s, while some pa ame e s
emain una ec ed. Hand-designing a model o achie e
he app op ia e in a iance hus scales poo ly wi h syn he-
size complexi y and equi es a p io i knowledge o he
unde lying symme ies. Fu he , some syn hesize s exhibi
condi ional and app oxima e symme ies, o which ull
in a iance would be o e ly es ic i e. To add ess his, we
in oduce a lea nable mapping om syn hesize pa ame e s
o model okens, which is capable o disco e ing symme ies
p esen in he da a, bu can b eak he symme y whe e neces-
sa y. Applying his echnique o a da ase sampled om he
Su ge XT syn hesize wi h mul iple symme ies, con inuous
and disc e e pa ame e s, and audio e ec s, we ind con-
sis en ly imp o ed audio econs uc ion pe o mance. We
p o ide ull sou ce code and audio examples a he ollowing
URL:
h ps://benhayes.ne /syn h-pe m/
2. BACKGROUND
2.1 Syn hesize in e sion & sound ma ching
Gi en an audio signal, he sound ma ching ask aims o ind
a syn hesize pa ame e con igu a ion ha bes app oxima es
i [4, 5]. We ocus in his pape on syn hesize in e sion, a
sub- ask o sound ma ching in which he audio signal we
seek o app oxima e is known a p io i o ha e come om
he syn hesize . We do so o elimina e con ounding ac o s
due o he non- i ial implici p ojec ion om gene al audio
signals o he se o signals p oducible by he syn hesize .
Fo an o e iew o his o ical sound ma ching app oaches,
we e e he eade o Shie ’s [5] comp ehensi e e iew. The
s a e-o - he-a in eg ession-based app oaches was ecen ly
p oposed by B u o d e al [8], who p oposed o adop he
audio spec og am ans o me [14] a chi ec u e. Gi en
i s supe io pe o mance o e MLP and CNN models, we
adop his model as ou eg ession baseline.
Esling e al [11] p esen ed he i s gene a i e app oach,
which was subsequen ly ex ended by Le Vaillan e al [12].
These me hods ain a ia ional au oencode s on audio spec-
og ams, en iching he pos e io dis ibu ion wi h no mal-
izing lows. A second low is join ly ained wi h a e-
g ession loss o p edic syn hesize pa ame e s om his
lea ned audio ep esen a ion.
Di e en iable digi al signal p ocessing (DDSP) [15, 16]
has also been applied o sound ma ching [7, 17–20]. Such
app oaches a e e ec i ely eg ession-based, as he compo-
si ion o a di e en iable syn hesize and an audio-domain
loss unc ion is a pa ame e -domain loss unc ion. I he syn-
hesize exhibi s an in a iance, so will his composed loss
unc ion, meaning DDSP-based me hods a e also subjec o
he esponsibili y p oblem. Thus, while we do no conduc
speci ic DDSP expe imen a ion, we expec ou indings o
pe mu a ion in a ian loss unc ions o be applicable also
o DDSP wi h pe mu a ion in a ian syn hesize s.
2.2 Pe mu a ion symme y & se gene a ion
P edic ing se -s uc u ed da a (such as he pa ame e s o a
pe mu a ion in a ian syn hesize ) wi h ec o - alued neu al
ne wo ks leads o a pa hology known as he esponsibili y
p oblem [9, 10], in which he con inuous model mus lea n
a highly discon inuous unc ion. In ui i ely, i is always
possible o ind wo simila inpu s ha induce a change in
“ esponsibili y”, and hence an app oxima ion o a discon-
inui y in he ne wo k’s ou pu s. Despi e hese issues, in
syn hesize and audio p ocesso in e ence asks i is com-
mon o igno e he symme y a he a chi ec u al le el and
simply adop pe mu a ion in a ian loss unc ions [21] o
symme y b eaking heu is ics [7, 22]. Howe e , such ap-
p oaches a e s ill subjec o he esponsibili y p oblem, and
hus do no sol e he unde lying issue.
A a ie y o me hods ha e been p oposed o se p e-
dic ion, o which he mos success ul iew he ask gene a-
i ely [23–26] by ans o ming an exchangeable sample o
he a ge se . E ec i ely, he ask is amed as condi ional
densi y es ima ion o e he space o se s [23]. Based on his
insigh , mo e gene al gene a i e models such as con inuous
no malizing lows [27] and di usion models [28] ha e been
adap ed o pe mu a ion in a ian densi ies.
2.3 Con inuous no malizing lows
Con inuous no malizing lows (CNFs) [29, 30] a e a amily
o powe ul gene a i e models which de ine in e ible, con-
inuous ans o ma ions be ween p obabili y dis ibu ions.
The condi ional low ma ching [31] amewo k allows us
o ain CNFs wi hou expensi e nume ical in eg a ion by
sampling a condi ional p obabili y pa h and eg essing a
closed o m ec o ield which, in expec a ion, eco e s
he exac same g adien s as eg essing o e he ma ginal
ield [31, 32]. In his wo k, we adop he ec i ied low [33]
p obabili y pa h which we pai wi h a miniba ch app oxi-
ma ion o he op imal anspo coupling [34]. We build on
p io wo k on equi a ian lows [35–37] which a e known
o p oduce samples om in a ian dis ibu ions [13].
3. METHOD
Le
P ⊂ Rk
be he space o syn hesize pa ame e s
1
and
S ⊂ Rn
be he space o audio signals. A syn hesize is a
map be ween hese spaces,
:P → S
. I is common ha
is no injec i e. Tha is, he e exis mul iple se s o pa-
ame e s, e.g.
x(1),x(2) ∈ P
, ha p oduce he same signal,
i.e.
(x(1)) = (x(2))
. A i ial example is gi en when he
syn hesize has a global gain con ol — all se s o pa ame-
e s wi h ze o global gain will p oduce an equi alen , silen
signal. Clea ly, in such cases,
lacks a well-de ined in e se.
1
We include MIDI pi ch and no e on/o imes in ou de ini ion o
syn hesize pa ame e s. E ec i ely, we a e dealing in single no es.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
374
The e o e, we model ou unce ain y o e
x
as
p(x|y)
, he
dis ibu ion o pa ame e s
x∈ P
gi en a signal
y∈ S
.
When he e is some ans o ma ion — le us deno e his
g
— ha ac s on pa ame e s such ha
(g·x) = (x)
o all
x∈ P
, we say ha
g
is a symme y o
. Fo
example,
g
migh pe mu e he oscilla o s. I a se
G
o such
ans o ma ions con ains an iden i y ans o ma ion and an
in e se o each
g∈ G
, hen unde composi ion i is a g oup
ac ing on
P
. Fo any
x∈ P
, i s
G
-o bi — he se o poin s
eachable ia ac ions
g∈ G
— is de ined as
Ox={g·x:
g∈ G}
. The se o all
G
-o bi s is a disjoin pa i ion o
P
:
P=G
O∈P/G
O, (1)
This allows any pa ame e in
P
o be exp essed as
x=g· O
o some o bi ep esen a i e
O∈O
and g oup elemen
g∈ G
.
2
Hence, we can ac o ou condi ional pa ame-
e densi y as:
3
p(x|y) = p(O|y)
|{z }
O bi
·p(g|O, y)
|{z }
Symme y
·η(O)
|{z}
S abilize
,(2)
whe e
p(O|y)
is in a ian unde ans o ma ions in
G
,
p(g|
O, y)
desc ibes he emaining unce ain y due o symme y,
and
η(O)
is a scaling ac o de e mined by he s abilize
o
O
in
G
. Depending on he na u e o
G
, he pos e io
o e o bi s
p(O|y)
may be conside ably simple han ha
o e pa ame e s
p(x|y)
. Fo an addi i e syn hesize o 16
pe mu able oscilla o s,
|G| = 16! ≈20.9×1012
, meaning
ha any mode o
p(x|y)
is accompanied by o e 20 illion
symme ies, while i is ep esen ed only once in
p(O|y)
.
We he e o e wan o ac o ou he e ec o symme y.
Unde wo easonable assump ions, i can be shown ha
p(g|O, y)
is necessa ily uni o m o e
G
. Fi s , we assume
G
-in a iance o he likelihood
p(y|x)
which a ises na u-
ally om he symme y o ou syn hesize . Secondly, we
assume
G
-in a iance o he p io
p(x)
. This is a s onge
assump ion, which we sa is y by andomly sampling ou
aining da a om
G
-in a ian pa ame e dis ibu ions, in
con as o some p e ious wo k which p oduces aining
da a om handmade syn hesize p ese s.
Unde a uni o m symme y dis ibu ion, we can say ha
p(x|y)∝p(O|y)η(O)
and educe ou ask o densi y
es ima ion o e o bi s. O cou se, he o bi o a poin in
P
is an abs ac equi alence class and can no di ec ly be
ep esen ed. Howe e , by en o cing
G
-in a iance we ensu e
ha ou model is unable o disc imina e be ween poin s on
he same o bi , and hus implici ly lea n he o bi al pos e io .
3.1 Pe mu a ion equi a ian con inuous no malizing
lows
Ou ask, hen, is o lea n a
G
-in a ian dis ibu ion, o-
cusing on he case whe e
G
is he p oduc o pe mu a ion
(symme ic) subg oups
Sk
, i.e.
G=×iSki
o o de s
2
This ac o iza ion is unique i and only i he s abilize o
x
(i.e. he
subg oup o G ha lea es xunchanged) is i ial.
3A ull de i a ion is gi en in he supplemen a y ma e ial.
Spa se
assignmen
Pa am → oken
To model
Pa ame e
okens
In e nal Tokens
FFN
Pe -pa ame e
embeddings T ansposed
spa se
assignmen
Token → pa am
F om model
Pa ame e
okens
In e nal Tokens
Pe -pa ame e
embeddings
pa ame e s
pa ame e s
Syn hesize
Syn hesize
Figu e 2: The PA R A M 2 T O K p ojec ion o lea ning o
assign pa ame e s o okens wi h elaxed equi a iance.
ki
. Köhle e al [13] showed ha he push o wa d o an
iso opic Gaussian unde an equi a ian con inuous no -
malizing low is a densi y wi h he co esponding in a i-
ance. We hus seek a pe mu a ion equi a ian a chi ec-
u e, making a T ans o me [38] encode wi hou posi ional
encoding [25] a na u al choice. We adop he Di usion
T ans o me (DiT) [39] a chi ec u e wi h Adap i e Laye
No m (Ada-LN) condi ioning.
Nex , we mus selec an app op ia e map om ec o s
in
P
o pe mu able T ans o me okens. Fo a simple syn-
hesize consis ing o
k
pe mu able oscilla o s we could
simply de ine
k
okens, assigning he pa ame e s o each
oscilla o o a dis inc oken. Howe e , o a syn hesize
wi h mul iple pe mu a ion symme ies, each ac ing on a dis-
inc subse o pa ame e s, and some u he non-pe mu able
pa ame e s, his is mo e challenging. As well as assign-
ing pa ame e s o okens, we mus indica e which okens
may be pe mu ed wi h which o he okens and which may
no be pe mu ed a all.
Fu he , we may encoun e quasi-symme ic s uc u e
in eal syn hesize s.
4
Speci ically, we de ine condi ional
symme y o mean a symme y ha ac s only on some subse
P′⊂ P
. Fo example, in Su ge XT (see sec ion 4.2), only
ce ain alues o he “ ou ing” pa ame e allow pe mu a-
ion symme y be ween il e s. Fu he , i he ac ions o
a g oup lea e he signal almos unchanged, up o some
e o bound, we call his an app oxima e symme y. Fo
example, swapping wo il e s placed be o e and a e a
so wa eshape may yield pe cep ually simila , bu no
iden ical signals. Whils i may be possible o hand design
okeniza ion s a egies o accommoda e hese beha iou s
on a case-by-case basis, ideally we would lea n such a
mapping di ec ly om da a.
3.2 Equi a iance disco e y wi h PA R A M 2 T O K
To add ess his, we p opose PA R A M 2 TO K, illus a ed
in Fig. 2, a mapping om pa ame e s o okens han can
4
While o mal ea men is beyond he scope o his pape , we b ie ly
illus a e hem he e as hey none heless ep esen a sou ce o unce ain y in
in e ing eal wo ld syn hesize s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
375
exploi he T ans o me ’s pe mu a ion equi a iance when
i is bene icial, bu is no cons ained o doing so.
Gi en a syn hesize wi h
k
pa ame e s, we lea n a ma ix
Z∈Rk×d
, and ep esen each pa ame e by scala mul-
iplica ion wi h he co esponding ow o
Z
, ollowed by
a eed- o wa d neu al ne wo k
hθ
applied ow-wise. To
map pa ame e okens o T ans o me okens, we lea n an
assignmen ma ix
A∈Rn×k
, om
k
pa ame e s o
n
okens, gi ing he inpu o he i s laye o ou T ans o me :
X0=Ahθ(diag(x)Z).(3)
To e u n om okens o pa ame e s, we use weigh - ying
[40] o he assignmen ma ix and ano he se o lea ned
ec o s
Z′∈Rk×d
as ollows:
˜
x=Z′⊙ATXl1d,(4)
whe e
1d
is simply a
d
-dimensional ec o o ones and
Xl
is he ou pu o he
l h
T ans o me laye .
We ini ialize
Z,Z′,
and
A
such ha PA R A M 2 TO K is
app oxima ely in a ian o any pe mu a ion o he pa ame e
ec o , ollowing he scheme desc ibed in he supplemen-
a y ma e ial. In ui i ely, he model s a s wi h he maxi-
mum possible symme y which is hen g adually b oken
h ough aining. Finally, we encou age a spa se assignmen
wi h an
L1
penal y on he en ies o
A
. Full ini ializa ion
and egula iza ion de ails can be ound in he supplemen-
a y ma e ial, as well as p oo s ha PA R A M 2 T O K can
ep esen quasi-symme ies.
4. EXPERIMENTS
We ain CNF models wi h he ec i ied low p obabili y
pa h [33]. A miniba ch app oxima ion o he op imal ans-
po coupling [34] is implemen ed using he Hunga ian
algo i hm. Condi ioning d opou is applied wi h a p oba-
bili y o 10% and in e ence is pe o med wi h classi ie -
ee guidance [41, 42] wi h a scale o 2.0. Sampling is
pe o med o 100 s eps wi h he 4
h
o de Runge-Ku a
me hod. Fo b e i y, we omi some de ails abou ou mod-
els, da ase s, and aining con igu a ions, bu hese can be
ound in he supplemen a y ma e ial as well as plo s o
lea ned PA R A M 2 TO K pa ame e s.
4.1 Isola ing pe mu a ion symme y wi h k-osc
To isola e he e ec o pe mu a ion symme y, we p opose a
simple syn he ic ask called
k
-osc, in which da a is gene -
a ed wi h a simple syn hesize which sums he ou pu o
k
iden ical oscilla o s. Each oscilla o is pa ame e ized by:
angula equencies
ω
, ampli udes
α
, and wa e o m shapes
γ
. The wa e o m shape pa ame e linea ly in e pola es
be ween sine, squa e, and saw oo h wa es, o which he
la e wo a e an ialiased wi h PolyBLEP [43].
This syn hesize is pe mu a ion in a ian , bu we can
b eak he symme y by cons aining each oscilla o o a
disjoin equency ange. This allows us o compa e he
pe o mance o models unde bo h symme ic and asym-
me ic condi ions. Simila ly, we can con ol he deg ee o
symme y by a ying
k
, which we es a
k∈ {4,8,16,32}
.
Da a poin s a e c ea ed by sampling a pa ame e ec o
x=ω α γ∈[−1,1]3k
. F equency pa ame e s a e
escaled in e nally o
[0, π]
o he symme ic a ian o
[(i−1)π
k, iπ
k]
o
i= 1, . . . , k
o he asymme ic a ian .
4.1.1 Models
We compa e se e al a ian s o bo h gene a i e and
eg ession-based models. To help be e in e p e esul s,
we also compu e me ics o e andomly sampled pa ame e
ec o s. All models ex ac a ep esen a ion o he signal
using a 1D equency domain CNN.
CNF models: We ain h ee CNF models. The i s ,
CNF (Equi a ian ), pa ame e izes he low’s ec o ield
wi h a DiT and Ada-LN condi ioning, wi h he pa ame e -
o- oken mapping desc ibed in Sec ion 3.1. Speci ically, he
i h
inpu oken is
ωiαiγi
, and
n=k
. Con e sely,
CNF (PA R A M 2 T O K) lea ns he mapping ia he elaxed
equi a iance o PA R A M 2 T O K, again wi h
k
model o-
kens. Finally, o isola e he e ec o model equi a iance,
we ain CNF (MLP) using a esidual MLP ec o ield
wi h no in insic equi a iance.
Reg ession models: All eg ession models use a esidual
MLP p edic ion head ac ing on he CNN ep esen a ion.
The i s , FFN (MSE) simply eg esses pa ame e s wi h
MSE loss. The emaining models mi o app oaches o
pe mu a ion symme y ound in he li e a u e [7, 21, 22].
FFN (So ) so s oscilla o s by hei equency be o e com-
pu ing he MSE loss, while FFN (Cham e ) compu es a
pe mu a ion in a ian loss using a di e en iable Cham-
e dis ance [44]. While bo h app oaches ha e imp o ed
pe o mance in p io wo k, nei he is able o esol e he
esponsibili y p oblem [9].
4.1.2 Me ics
Fo e e y in e ed pa ame e ec o , we econs uc he
signal by syn hesizing wi h hese pa ame e s. We measu e
dissimila i y om he g ound u h signal using he L2log
spec al dis ance (LSD). To compa e in e ed pa ame e s
wi h he g ound u h, we simply use he mean squa ed e o
(MSE) o he asymme ic condi ion. Fo he symme ic
condi ion, we compu e he op imal linea assignmen cos
(LAC) — ha is, he minimum MSE ac oss pe mu a ions.
4.1.3 Resul s
Resul s a e shown in Fig. 3. Compa ing CNF (Equi a ian )
and FFN (MSE) clea ly illus a es he e ec o pe mu a ion
symme y. Unde symme ic condi ions, CNF (Equi a ian )
excels, while FFN (MSE) pe o ms poo ly ac oss me ics.
Wi hou symme y, he oles a e e e sed — FFN (MSE)
achie es excellen esul s, while CNF (Equi a ian ) im-
poses an o e ly es ic i e equi a iance and ails o dis-
c imina e be ween oscilla o s.
FFN (Cham e ), FFN (So ), and CNF (MLP) o e “pa -
ial” solu ions, being less a ec ed when symme y is p esen ,
bu s ill unde pe o ming CNF (Equi a ian ). Unde asym-
me ic condi ions, FFN (Cham e ) also imposes oo s in-
gen a es ic ion and pe o ms simila ly o CNF (Equi -
a ian ). CNF (MLP), howe e , lacks explici equi a iance
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
376
8
9
10
11
12
4 8 16 32
k
LSD
Symme ic
Log Spec al Dis ance
5
7
9
11
4 8 16 32
k
LSD
Asymme ic
Log Spec al Dis ance
0.25
0.50
0.75
1.00
4 8 16 32
k
LAC
Symme ic
Linea Assignmen Cos
0.0
0.2
0.4
0.6
4 8 16 32
k
LAC
Asymme ic
Mean Squa ed E o
Me hod CNF (Equi a ian ) CNF (MLP) CNF (Pa am2Tok) FFN (Cham e ) FFN (MSE) FFN (So ) Random
Figu e 3: E alua ion esul s o he symme ic and asymme ic a ian s o he k-osc ask.
and pe o ms on pa wi h FFN (MSE). Ac oss condi ions,
CNF (PA R A M 2 T O K) is compe i i e wi h he bes models,
demons a ing ha PA R A M 2 TO K can ake ad an age o
model equi a iance when help ul, bu s ill lea n dis inc
oscilla o ep esen a ions whe e necessa y. B oadly, bo h
pa ame e - and audio-domain me ics sugges ha ailing o
accoun o pe mu a ion symme y will ha m pe o mance
and ha CNFs wi h elaxed equi a iance o e good pe o -
mance ac oss bo h symme ic and asymme ic scena ios.
4.2 Real-wo ld syn hesize in e sion
Ou goal, o cou se, is o in e eal-wo ld so wa e syn-
hesize s. To his end, we es ou me hod on Su ge XT,
an open sou ce, ea u e ich “hyb id” syn hesize which
inco po a es mul iple me hods o p oducing and shaping
sound. I exhibi s mul iple pe mu a ion symme ies, e.g.
be ween oscilla o s, LFOs, and il e s, as well as u he
condi ional and app oxima e symme ies.
We cons uc wo da ase s o 2 million samples each
by andomly sampling om he syn hesize ’s pa ame e
space and ende ing he co esponding audio using he
pedalboa d
lib a y [45].
5
The i s da ase , e e ed o as Su ge Simple, has 92
pa ame e s including con ols o h ee oscilla o s, wo
il e s, h ee en elope gene a o s, 5 low equency oscil-
la o s (LFOs), and some global pa ame e s. Omissions
include disc e e pa ame e s, and hose which a ec global
signal ou ing. The second, Su ge Full, has 165 pa ame-
e s, o which many a e disc e e, some al e he in e nal
ou ing, and some con ol audio e ec s. Su ge Full hus
co e s a b oade sonic ange, and in oduces unce ain y
beyond pe mu a ion symme y.
In bo h cases, audio was ende ed in s e eo a 44.1kHz o
a du a ion o 4 seconds and con e ed o a Mel-spec og am
wi h 128 Mel bands, using an analysis window o 25ms
and a hop leng h o 10ms. Spec og ams we e s anda dized
using ain da ase s a is ics. Like Le Vaillan e al [12], we
5
Su ge XT ac ually p o ides comp ehensi e Py hon bindings, allowing
o di ec p og amma ic in e ac ion wi h he syn hesize . We op o use he
pedalboa d
in o de o ensu e ou sys em is applicable o any so wa e
syn hesize , which limi s us only o pa ame e s exposed o he plugin hos .
adop a one-ho ep esen a ion o disc e e and ca ego ical
pa ame e s, conca ena ing all scala and one-ho pa ame e
ep esen a ions o a single ec o . All syn hesis pa ame e s
a e scaled o he in e al
[−1,1]
.
4.2.1 Models
CNF models: All CNF models ecei e audio condi ioning
om an Audio Spec og am T ans o me (AST) [14] wi h
p e-no maliza ion [46]. Ins ead o a single
[CLS]
oken, we
inc ease condi ioning exp essi i y by lea ning an indi idual
que y oken o each laye o he CNF’s ec o ield. All
models a e ained end- o-end. Following p io indings on
mixed con inuous-disc e e low-based models [47], we do
no cons ain lows o e disc e e pa ame e s o he simplex
and simply adop he same Gaussian sou ce dis ibu ion o
all dimensions. Ou CNF (PA R A M 2 T O K) model again
uses he PA R A M 2 T O K module wi h a DiT ec o ield.
Again, we ain a CNF (MLP) model wi h a esidual MLP
ec o ield o help isola e he e ec o model equi a iance
om ha o he gene a i e app oach.
Baselines: We adop he AST [14] app oach p oposed
by B u o d e al [8] as ou eg ession baseline, and Le
Vaillan e al’s [12] VAE + RealNVP me hod as a u -
he gene a i e baseline.
4.2.2 Me ics
While p e ious sound ma ching wo k has ypically included
pa ame e domain dis ances as an e alua ion me ic, he e we
a gue ha i he syn hesize exhibi s symme y, such me ics
a e unin o ma i e and possibly misleading.
6
While unde
simple symme ies we may selec an in a ian me ic, as in
Sec ion 4.1.2, his app oach does no scale o mo e complex
syn hesize s. We hus ely on audio econs uc ion me ics
which bo h sha e he exac in a iances o he syn hesize and
quan i y pe o mance on he ask we a e ac ually conce ned
wi h, i.e. bes econs uc ing he signal.
We measu e spec o empo al dissimila i y wi h a mul i-
scale spec al (MSS) dis ance compu ed o e log-scaled
Mel spec og ams a mul iple esolu ions. Howe e , a sligh
e o in one pa ame e may lead an o he wise good ma ch
6Fo mo e de ail on his poin , see he supplemen a y ma e ial.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
377
SU R G E X T (Simple) S U R G E X T (Full)
Me hod Pa ams MSS ↓wMFCC ↓SOT ↓RMS ↑MSS ↓wMFCC ↓SOT ↓RMS ↑
CNF (PA R A M 2 T O K) 40.2M 3.18 ±0.06 5.57 ±0.06 0.042 ±0.001 0.948 ±0.002 6.13 ±0.09 9.63 ±0.08 0.086 ±0.001 0.939 ±0.002
CNF (MLP) 41.6M 3.53 ±0.06 6.02 ±0.07 0.045 ±0.001 0.932 ±0.002 7.35 ±0.11 11.01 ±0.09 0.095 ±0.001 0.924 ±0.002
AST 45.3M 6.51 ±0.10 9.11 ±0.09 0.076 ±0.001 0.738 ±0.005 14.73 ±0.14 14.42 ±0.11 0.136 ±0.002 0.804 ±0.004
VAE + RealNVP 39.5M / 44.3M∗26.23 ±0.14 15.80 ±0.10 0.184 ±0.002 0.595 ±0.006 28.86 ±0.19 18.61 ±0.12 0.178 ±0.002 0.703 ±0.004
Su ge Full →NSyn h Su ge Full →FSD50K
MSS ↓wMFCC ↓SOT ↓RMS ↑MSS ↓wMFCC ↓SOT ↓RMS ↑
CNF (PA R A M 2 T O K) 40.2M 11.04 ±0.28 19.71 ±0.24 0.158 ±0.004 0.834 ±0.006 15.40 ±0.17 17.25 ±0.12 0.168 ±0.003 0.680 ±0.005
CNF (MLP) 41.6M 13.51 ±0.32 21.09 ±0.23 0.175 ±0.004 0.840 ±0.006 16.96 ±0.18 17.83 ±0.12 0.183 ±0.003 0.710 ±0.004
AST 45.3M 24.04 ±0.30 28.62 ±0.17 0.224 ±0.003 0.755 ±0.007 19.09 ±0.16 21.35 ±0.14 0.187 ±0.003 0.682 ±0.004
VAE + RealNVP 44.3M 35.29 ±0.26 23.50 ±0.14 0.266 ±0.004 0.689 ±0.007 25.06 ±0.20 21.24 ±0.12 0.247 ±0.003 0.681 ±0.004
∗VAE + RealNVP pa ame e coun s o Simple and Full da ase s, espec i ely. The di e ence a ises because he la en space dimension is se o he leng h o he pa ame e ec o .
Table 1:Top: Audio econs uc ion esul s on Simple and Full a ian s o he Su ge XT syn hesize in e sion ask. Bo om:
Ou -o -domain audio econs uc ion esul s o models ained on he Su ge Full da ase . All: Resul s epo ed wi h 95%
con idence in e al compu ed ac oss es da ase .
o be un ai ly penalized due o shi s in ime o equency.
We hus include a wa ped Mel- equency ceps al coe i-
cien (wMFCC) me ic as a mo e “malleable” al e na i e,
gi en by he op imal
L1
dynamic ime wa ping (DTW)
cos be ween wo MFCC se ies. This allows obus ness
o iming and pi ch de ia ions.
Poin wise spec al dis ances ail o cap u e dis ances
“along” he equency axis, such as pi ch dissimila i y [9,
48]. We hus adop he spec al op imal anspo (SOT) dis-
ance [49] as a pi ch-sensi i e measu e o spec al simila i y.
Finally, we include a simple cosine simila i y be ween ame-
wise RMS ene gy o cap u e ampli ude en elope simila i y.
4.2.3 Resul s
Me ics a e compu ed on a held-ou es da ase o
10,000
sounds syn hesized in he same manne as he aining
da ase . Resul s a e p esen ed in Table 1, and audio examples
can be hea d on he companion websi e.
7
The CNF models pe o m consis en ly wi h ou heo e i-
cal expec a ions and he esul s o ou oy ask. In pa icula ,
CNF (PA R A M 2 T O K) ou pe o ms all models ac oss all
me ics on bo h da ase s. The lea ned PA R A M 2 T O K pa-
ame e s sugges ha he model has disco e ed a oken
mapping ha co esponds o he in insic symme ies o
he syn hesize , and hus bene i s om he T ans o me ’s
pe mu a ion equi a iance.
The non-equi a ian CNF (MLP) also achie es eason-
able pe o mance, sugges ing ha simply adop ing a p oba-
bilis ic amewo k is al eady e y help ul in dealing wi h he
sou ces o ill-posedness in eal syn hesize s. The di e ence
be ween he MLP and PA R A M 2 T O K a ian s is mo e
p onounced o Su ge Full, sugges ing ha he e ec o in-
oducing g ea e complexi y is magni ied in he p esence o
un esol ed symme y. This aligns wi h ou decomposi ion o
he condi ional pa ame e densi y in Sec ion 3 — any change
in
p(O|y)
is “ epea ed” in
p(x|y)
o each elemen o
G
.
Despi e ex ensi e uning, VAE + RealNVP consis en ly
collapsed o p edic ing a e age alues o many pa ame e s.
O cou se, his con lic s wi h he esul s o he o iginal
pape s [11, 12]. We sugges his may be due o he use o
smalle da ase s o hand-designed syn hesize p ese s, which
could be su icien ly biased o b eak he in a iance o
p(x)
.
7Link o websi e: h ps://benhayes.ne /syn h-pe m/
4.2.4 Ou -o -dis ibu ion esul s
While his is no ou ocus, we epo ou -o -dis ibu ion
esul s on audio om he NSyn h [50] and FSD50K [51]
es da ase s in Table 1. Ac oss all models, pe o mance
su e s compa ed o in-domain da a, bu he ela ionships
be ween models a e p ese ed. Thus, e en unde such
challenging condi ions, unhandled symme y likely de i-
men ally in luences pe o mance. To adap ou me hod
o gene al sound ma ching, we in end o explo e sha ed
ep esen a ions o syn hesized and non-syn hesized audio,
domain adap a ion echniques, and in o ma ion bo lenecks,
such as quan iza ion, o condi ioning.
5. CONCLUSION
The implica ions o ou indings a e clea : i he syn hesize
has a symme y, i is be e o (i) app oach he p oblem
gene a i ely and (ii) lea n he co esponding in a ian den-
si y. This ex ends beyond syn hesize s, as audio e ec s
also commonly exhibi pe mu a ion symme ies, as no ed
by Ne cessian [22]. Beyond audio, hese esul s a e o ele-
ance anywhe e neu al ne wo ks pa ame e ize an ex e nal
sys em wi h s uc u al symme ies.
A key limi a ion o ou wo k is he lack o speci ic ex-
pe imen a ion on he ole o quasi-symme ies. A
k
-osc
s yle ask which isola es hei e ec would be e y illumi-
na ing abou he ex en o which PA R A M 2 T O K imp o es
esul s in hei p esence, and we in end o include his in
a subsequen publica ion. We also no e ha we cu en ly
lack heo e ical gua an ees ha PA R A M 2 T O K will dis-
co e symme ies. In u u e wo k, we hus in end o bo h
s udy his module heo e ically and ga he u he empi i-
cal da a on he ole o ini ializa ion and egula iza ion in
i s beha iou . We also plan mo e comp ehensi e e alua-
ion o he ex en o which PA R A M 2 T O K-based models
lea n he app op ia e in a iance.
Ou p oposed me hod should gene alize e ec i ely o
a bi a y so wa e syn hesize s. In u u e wo k, we will
he e o e explo e mul i- ask aining by simul aneously
modelling o e many syn hesize s. We will also seek o
ex end ou wo k o he mo e gene al sound ma ching ask by
explo ing he p e iously discussed s a egies o obus i ying
ou sys em o ou -o -dis ibu ion inpu s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
378
6. ETHICS STATEMENT
Like any AI model, ou wo k inhe en ly encodes he biases
and alues o he au ho s. In pa icula , ou choice o VST
syn hesize e lec s a bias owa ds wes e n popula music.
Howe e , he mo e abs ac na u e o ou syn he ic expe -
imen a ion does sugges ha ou esul s may easonably
be expec ed o gene alize o ools ha be e ep esen he
needs o o he musical cul u es.
The sou ce o aining da a is o pa icula impo ance in
assessing he e hical impac o an AI model, and in his in-
s ance we ha e wo ked en i ely wi h syn he ically gene a ed
da a. E en in ou expe imen a ion on a eal syn hesize , we
op ed o gene a e ou da ase by andomly sampling pa ame-
e s, a he han by sc aping he in e ne o p ese s o exploi -
ing ac o y p ese lib a ies. We u he no e ha Su ge XT
is an open sou ce syn hesize eleased unde he GNU GPL.
Amid g owing conce n o e AI ools displacing wo ke s,
pa icula ly in he c ea i e indus ies, we wish o high-
ligh ha ou goal in his wo k is o de elop echnology
o in eg a e wi h and enhance exis ing wo k lows. We be-
lie e he choice o ask e lec s his by seeking o enhance
in e ac ion wi h an exis ing amily o c ea i e ools, as
opposed o ou igh eplacing hem.
7. ACKNOWLEDGEMENTS
B.H. would like o hank Ch is ophe Mi chel ee, Ma co
Pasini, Chin-Yun Yu, Jack Lo h, and Julien Guino o hei
in aluable eedback on his manusc ip in a ying s ages
o comple ion, and Jo die Shie o he many inspi ing and
illumina ing con e sa ions on his opic.
8. REFERENCES
[1]
J. Jus ice. “Analy ic Signal P ocessing in Music
Compu a ion”. In: IEEE T ansac ions on Acous ics,
Speech, and Signal P ocessing 27.6 (Dec. 1979),
pp. 670–684. D O I:
10 . 1109 / TASSP . 1979 .
1163321.
[2]
R. McAulay and T. Qua ie i. “Speech Analy-
sis/Syn hesis Based on a Sinusoidal Rep esen a ion”.
In: IEEE T ansac ions on Acous ics, Speech, and
Signal P ocessing 34.4 (Aug. 1986), pp. 744–754.
DOI:10.1109/TASSP.1986.1164910.
[3]
Xa ie Se a and Julius Smi h. “Spec al Model-
ing Syn hesis: A Sound Analysis/Syn hesis Sys em
Based on a De e minis ic Plus S ochas ic Decompo-
si ion”. In: Compu e Music Jou nal 14.4 (1990),
pp. 12–24. D O I:10.2307/3680788.
[4]
Ma in Ro h and Ma hew Yee-King. “A Compa ison
o Pa ame ic Op imiza ion Techniques o Musical
Ins umen Tone Ma ching”. In: Audio Enginee ing
Socie y Con en ion 130. Audio Enginee ing Socie y,
May 13, 2011.
[5]
Jo die Shie . “The Syn hesize P og amming P ob-
lem: Imp o ing he Usabili y o Sound Syn hesize s”.
MA hesis. Uni e si y o Vic o ia, 2021.
[6]
O en Ba kan e al. “In e Syn h: Deep Es ima ion o
Syn hesize Pa ame e Con igu a ions om Audio
Signals”. In: IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing 27.12 (Dec. 2019),
pp. 2385–2396. D O I:
10.1109/TASLP.2019.
2944568. a Xi : 1812.06349.
[7]
Nao ake Masuda and Daisuke Sai o. “Imp o ing
Semi-Supe ised Di e en iable Syn hesize Sound
Ma ching o P ac ical Applica ions”. In: IEEE/ACM
T ansac ions on Audio, Speech, and Language P o-
cessing 31 (2023), pp. 863–875. D O I:
10.1109/
TASLP.2023.3237161.
[8]
F ed B u o d, F ede ik Blang, and Shahan Ne ces-
sian. “Syn hesize Sound Ma ching Using Audio
Spec og am T ans o me s”. In: P oceedings o he
27 h In e na ional Con e ence on Digi al Audio E -
ec s. DAFx24. Guild o d, Su ey, Sep . 3–7, 2024.
[9]
Ben Hayes, Cha alampos Sai is, and Gyö gy Fazekas.
“The Responsibili y P oblem in Neu al Ne wo ks wi h
Uno de ed Ta ge s”. In: The Fi s Tiny Pape s T ack
a ICLR 2023. ICLR. Kigali, Rwanda, May 5, 2023.
[10]
Yan Zhang, Jona hon Ha e, and Adam P ügel-
Benne . “FSPool: Lea ning Se Rep esen a ions
wi h Fea u ewise So Pooling”. In: In e na ional
Con e ence on Lea ning Rep esen a ions. Sep . 23,
2019.
[11]
Philippe Esling e al. “Flow Syn hesize : Uni e sal
Audio Syn hesize Con ol wi h No malizing Flows”.
In: Applied Sciences 10.1 (Dec. 2020), p. 302. D O I:
10.3390/app10010302.
[12]
Gwendal Le Vaillan , Thie y Du oi , and Sebas ien
Dekeyse . “Imp o ing Syn hesize P og amming
F om Va ia ional Au oencode s La en Space”. In:
2021 24 h In e na ional Con e ence on Digi al Au-
dio E ec s (DAFx). 2021 24 h In e na ional Con-
e ence on Digi al Audio E ec s (DAFx). Vienna,
Aus ia: IEEE, Sep . 8, 2021, pp. 276–283. DOI:
10.23919/DAFx51585.2021.9768218.
[13]
Jonas Köhle , Leon Klein, and F ank Noe. “Equi a i-
an Flows: Exac Likelihood Gene a i e Lea ning o
Symme ic Densi ies”. In: P oceedings o he 37 h
In e na ional Con e ence on Machine Lea ning. In-
e na ional Con e ence on Machine Lea ning. PMLR,
No . 21, 2020, pp. 5361–5370.
[14]
Yuan Gong, Yu-An Chung, and James Glass. “AST:
Audio Spec og am T ans o me ”. In: P oc. In e -
speech 2021. 2021, pp. 571–575. D O I:
10.21437/
In e speech.2021-698.
[15]
Jesse Engel e al. “DDSP: Di e en iable Digi al
Signal P ocessing”. In: 8 h In e na ional Con e ence
on Lea ning Rep esen a ions. ICLR 2020. Addis
Ababa, E hiopia, Ap . 2020.
[16]
Ben Hayes e al. “A Re iew o Di e en iable Digi al
Signal P ocessing o Music & Speech Syn hesis”.
In: F on ie s in Signal P ocessing (2023).
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
379
[17]
Nao ake Masuda and Daisuke Sai o. “Syn hesize
Sound Ma ching wi h Di e en iable DSP” (On-
line). No . 7, 2021. DOI:
10.5281/zenodo.
5624609.
[18]
Noy Uz ad e al. Di Moog: A Di e en iable Modu-
la Syn hesize o Sound Ma ching. Jan. 23, 2024.
DOI:
10.48550/a Xi .2401.12570
. a Xi :
2401.12570 [eess]. P e-published.
[19]
Yu ing Yang e al. “Whi e Box Sea ch o e Audio
Syn hesize Pa ame e s”. In: P oc. o he 24 d In .
Socie y o Music In o ma ion Re ie al Con . ISMIR.
Milan, I aly, 2023.
[20]
Han Han, Vincen Los anlen, and Ma hieu Lag ange.
“Lea ning o Sol e In e se P oblems o Pe cep ual
Sound Ma ching”. In: IEEE/ACM T ansac ions on
Audio, Speech, and Language P ocessing 32 (2024),
pp. 2605–2615. D O I:
10.1109/TASLP.2024.
3393738.
[21]
Jesse Engel, Rigel Swa ely, and Adam Robe s.
“Sel -Supe ised Pi ch De ec ion by In e se Audio
Syn hesis”. In: P oceedings o he In e na ional Con-
e ence on Machine Lea ning. ICML 2020. 2020,
p. 9.
[22]
Shahan Ne cessian. “Neu al Pa ame ic Equalize
Ma ching Using Di e en iable Biquads”. In: P o-
ceedings o he 23 d In e na ional Con e ence on
Digi al Audio E ec s. DAFx2020. Vienna, Aus ia,
2020, p. 8.
[23]
Da id W. Zhang, Ge jan J. Bu ghou s, and Cees
G. M. Snoek. “Se P edic ion wi hou Imposing
S uc u e as Condi ional Densi y Es ima ion”. In: In-
e na ional Con e ence on Lea ning Rep esen a ions.
Jan. 12, 2021.
[24]
Yan Zhang, Jona hon Ha e, and Adam P ugel-
Benne . “Deep Se P edic ion Ne wo ks”. In: Ad-
ances in Neu al In o ma ion P ocessing Sys ems.
Vol. 32. Cu an Associa es, Inc., 2019.
[25]
Adam R. Kosio ek, Hyunjik Kim, and Danilo J.
Rezende. “Condi ional Se Gene a ion wi h T ans-
o me s”. In: Wo kshop on Objec -O ien ed Lea n-
ing. In e na ional Con e ence on Machine Lea ning.
a Xi , July 1, 2020. a Xi : 2006.16841 [cs].
[26]
Jinwoo Kim e al. Se VAE: Lea ning Hie a chi-
cal Composi ion o Gene a i e Modeling o Se -
S uc u ed Da a. Ma . 29, 2021. DOI:
10.48550/
a Xi . 2103 . 15619
. a Xi :
2103 . 15619
[cs]. P e-published.
[27]
Ma in Biloš and S ephan Günnemann. “Scalable No -
malizing Flows o Pe mu a ion In a ian Densi ies”.
In: P oceedings o he 38 h In e na ional Con e ence
on Machine Lea ning. In e na ional Con e ence on
Machine Lea ning. PMLR, July 1, 2021, pp. 957–
967.
[28]
Lemeng Wu e al. “Fas Poin Cloud Gene a ion wi h
S aigh Flows”. In: 2023 IEEE/CVF Con e ence on
Compu e Vision and Pa e n Recogni ion (CVPR).
2023 IEEE/CVF Con e ence on Compu e Vision
and Pa e n Recogni ion (CVPR). Vancou e , BC,
Canada: IEEE, June 2023, pp. 9445–9454. DOI:
10.1109/CVPR52729.2023.00911.
[29]
Ricky T. Q. Chen e al. “Neu al O dina y Di e en ial
Equa ions”. In: Ad ances in Neu al In o ma ion
P ocessing Sys ems. Vol. 31. Cu an Associa es, Inc.,
2018.
[30]
Will G a hwohl e al. “FFJORD: F ee-Fo m Con in-
uous Dynamics o Scalable Re e sible Gene a i e
Models”. In: In e na ional Con e ence on Lea ning
Rep esen a ions. Sep . 27, 2018.
[31]
Alexande Tong e al. “Imp o ing and Gene aliz-
ing Flow-Based Gene a i e Models wi h Miniba ch
Op imal T anspo ”. In: T ansac ions on Machine
Lea ning Resea ch (2024).
[32]
Ya on Lipman e al. “Flow Ma ching o Gene a i e
Modeling”. In: The Ele en h In e na ional Con e -
ence on Lea ning Rep esen a ions. Feb. 1, 2023.
[33]
Xingchao Liu, Chengyue Gong, and Qiang Liu.
“Flow S aigh and Fas : Lea ning o Gene a e and
T ans e Da a wi h Rec i ied Flow”. In: The Ele en h
In e na ional Con e ence on Lea ning Rep esen a-
ions. Feb. 1, 2023.
[34]
A am-Alexand e Pooladian e al. “Mul isample Flow
Ma ching: S aigh ening Flows wi h Miniba ch Cou-
plings”. In: P oceedings o he 40 h In e na ional
Con e ence on Machine Lea ning. In e na ional Con-
e ence on Machine Lea ning. PMLR, July 3, 2023,
pp. 28100–28127.
[35]
Yuxuan Song e al. Equi a ian Flow Ma ching
wi h Hyb id P obabili y T anspo . Dec. 12, 2023.
D O I:
10.48550/a Xi .2312.07168
. a Xi :
2312.07168 [cs]. P e-published.
[36]
Majdi Hassan e al. “Equi a ian Flow Ma ching o
Molecula Con o me Gene a ion”. In: ICML’24
Wo kshop ML o Li e and Ma e ial Science: F om
Theo y o Indus y Applica ions. July 17, 2024.
[37]
Leon Klein, And eas K äme , and F ank Noe. “Equi -
a ian Flow Ma ching”. In: Ad ances in Neu al In-
o ma ion P ocessing Sys ems 36 (Dec. 15, 2023),
pp. 59886–59910.
[38]
Ashish Vaswani e al. “A en ion Is All You Need”.
In: Ad ances in Neu al In o ma ion P ocessing Sys-
ems. Vol. 30. Cu an Associa es, Inc., 2017.
[39]
William Peebles and Saining Xie. “Scalable Di -
usion Models wi h T ans o me s”. In: 2023
IEEE/CVF In e na ional Con e ence on Compu e
Vision (ICCV). 2023 IEEE/CVF In e na ional Con e -
ence on Compu e Vision (ICCV). Pa is, F ance:
IEEE, Oc . 1, 2023, pp. 4172–4182. D O I:
10 .
1109/ICCV51070.2023.00387.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
380
[40]
O i P ess and Lio Wol . “Using he Ou pu Em-
bedding o Imp o e Language Models”. In: P o-
ceedings o he 15 h Con e ence o he Eu opean
Chap e o he Associa ion o Compu a ional Lin-
guis ics: Volume 2, Sho Pape s. P oceedings o
he 15 h Con e ence o he Eu opean Chap e o he
Associa ion o Compu a ional Linguis ics: Volume
2, Sho Pape s. Valencia, Spain: Associa ion o
Compu a ional Linguis ics, 2017, pp. 157–163. D O I:
10.18653/ 1/E17-2025.
[41]
Jona han Ho and Tim Salimans. “Classi ie -F ee
Di usion Guidance”. In: Neu IPS 2021 Wo kshop
on Deep Gene a i e Models and Downs eam Appli-
ca ions. Dec. 8, 2021.
[42]
Qinqing Zheng e al. Guided Flows o Gene a-
i e Modeling and Decision Making. Dec. 7, 2023.
DOI:
10.48550/a Xi .2311.13443
. a Xi :
2311.13443 [cs]. P e-published.
[43]
Vesa Välimäki, Jussi Pekonen, and Juhan Nam. “Pe -
cep ually In o med Syn hesis o Bandlimi ed Classi-
cal Wa e o ms Using In eg a ed Polynomial In e -
pola ion”. In: The Jou nal o he Acous ical Socie y
o Ame ica 131.1 (Jan. 1, 2012), pp. 974–986. D O I:
10.1121/1.3651227.
[44]
Ha y G Ba ow e al. “Pa ame ic Co espondence
and Cham e Ma ching: Two New Techniques o
Image Ma ching”. In: P oceedings: Image Unde -
s anding Wo kshop. Science Applica ions, Inc, 1977,
pp. 21–27.
[45]
Pe e Sobo . Pedalboa d. Ve sion 0.7.3. Zen-
odo, Ap . 10, 2023. D O I:
10. 5281 / ZENODO.
7817838.
[46]
Ruibin Xiong e al. “On Laye No maliza ion in he
T ans o me A chi ec u e”. In: P oceedings o he
37 h In e na ional Con e ence on Machine Lea n-
ing. In e na ional Con e ence on Machine Lea ning.
PMLR, No . 21, 2020, pp. 10524–10533.
[47] Ian Dunn and Da id Ryan Koes. Mixed Con inuous
and Ca ego ical Flow Ma ching o 3D De No o
Molecule Gene a ion. Ap . 30, 2024. D O I:
10 .
48550 / a Xi . 2404 . 19739
. a Xi :
2404.
19739 [q-bio]. P e-published.
[48]
Joseph Tu ian and Max Hen y. “I’m So y o You
Loss: Spec ally-Based Audio Dis ances A e Bad a
Pi ch”. Dec. 9, 2020. a Xi :
2012.04572 [cs,
eess].
[49]
Be na do To es, Geo oy Pee e s, and Gaël Richa d.
“Unsupe ised Ha monic Pa ame e Es ima ion Us-
ing Di e en iable DSP and Spec al Op imal T ans-
po ”. In: ICASSP 2024 - 2024 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P o-
cessing (ICASSP). ICASSP 2024 - 2024 IEEE In-
e na ional Con e ence on Acous ics, Speech and
Signal P ocessing (ICASSP). Seoul, Ko ea, Repub-
lic o : IEEE, Ap . 14, 2024, pp. 1176–1180. D O I:
10.1109/ICASSP48485.2024.10447011.
[50]
Jesse Engel e al. “Neu al Audio Syn hesis o Musical
No es wi h Wa eNe Au oencode s”. In: P oceedings
o he 34 h In e na ional Con e ence on Machine
Lea ning - Volume 70. ICML’17. Sydney, Aus alia,
Aug. 6, 2017, pp. 1068–1077.
[51]
Edua do Fonseca e al. “FSD50K: An Open Da ase
o Human-Labeled Sound E en s”. In: IEEE/ACM
T ansac ions on Audio, Speech, and Language P o-
cessing 30 (2022), pp. 829–852. D O I:
10.1109/
TASLP.2021.3133208.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
381