Audio Synthesizer Inversion in Symmetric Parameter Spaces With Approximately Equivariant Flow Matching

Author: Ben Hayes; Charalampos Saitis; György Fazekas

Publisher: Zenodo

DOI: 10.5281/zenodo.17706418

Source: https://zenodo.org/records/17706418/files/000043.pdf

AUDIO SYNTHESIZER INVERSION IN SYMMETRIC PARAMETER
SPACES WITH APPROXIMATELY EQUIVARIANT FLOW MATCHING
Ben Hayes Cha alampos Sai is Gyö gy Fazekas
Cen e o Digi al Music, Queen Ma y Uni e si y o London, Uni ed Kingdom
[email p o ec ed], {c.sai is, geo ge. azekas}@qmul.ac.uk
ABSTRACT
Many audio syn hesize s can p oduce he same signal gi en
di e en pa ame e con igu a ions, meaning he in e sion
om sound o pa ame e s is an inhe en ly ill-posed p oblem.
We show ha his is la gely due o in insic symme ies
o he syn hesize , and ocus in pa icula on pe mu a ion
in a iance. Fi s , we demons a e on a syn he ic ask ha
eg essing poin es ima es unde pe mu a ion symme y
deg ades pe o mance, e en when using a pe mu a ion-
in a ian loss unc ion o symme y-b eaking heu is ics.
Then, iewing equi alen solu ions as modes o a p oba-
bili y dis ibu ion, we show ha a condi ional gene a i e
model subs an ially imp o es pe o mance. Fu he , ac-
knowledging he in a iance o he implici pa ame e dis-
ibu ion, we ind ha pe o mance is u he imp o ed by
using a pe mu a ion equi a ian con inuous no malizing
low. To accommoda e in ica e symme ies in eal syn he-
size s, we also p opose a elaxed equi a iance s a egy ha
adap i ely disco e s ele an symme ies om da a. Apply-
ing ou me hod o Su ge XT, a ull- ea u ed open sou ce
syn hesize used in eal wo ld audio p oduc ion, we ind
ou me hod ou pe o ms eg ession and gene a i e baselines
ac oss audio econs uc ion me ics.
1. INTRODUCTION
Mode n audio syn hesize s a e in ica e sys ems, combining
nume ous me hods o sound p oduc ion and manipula ion
wi h ich use - acing con ol schemes. Whe e, once, many
digi al syn hesis algo i hms we e accompanied by a co e-
sponding analysis p ocedu e [1–3], selec ing pa ame e s o
a mode n syn hesize o app oxima e a gi en audio signal
is a challenging open p oblem [4, 5] which is inc easingly
app oached using powe ul op imiza ion and machine lea n-
ing algo i hms. In pa icula , ecen wo ks app oach he
ask wi h deep neu al ne wo ks ained on da ase s sampled
di ec ly om he syn hesize [6–8].
Many syn hesize s can p oduce he same signal gi en
mul iple di e en pa ame e con igu a ions. This means
ha in e ing he syn hesize is necessa ily ill-posed — i
© B. Hayes, C. Sai is, G. Fazekas. Licensed unde a C ea i e
Commons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu ion:
B. Hayes, C. Sai is, G. Fazekas, “Audio syn hesize in e sion in symme ic
pa ame e spaces wi h app oxima ely equi a ian low ma ching”, in P oc.
o he 26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon,
Sou h Ko ea, 2025.
Fo wa d map: audio syn hesis
Reg ession-based
...
Gene a i e Equi a ian gene a i e
P edic Sample Sample
In e se map:
Figu e 1:Top: Audio syn hesis is he o wa d map we wish
o in e . Bo om-le : Syn hesize in e sion by pa ame e
eg ession. The neu al ne wo k p oduces a poin es ima e,
and does no accoun o symme ies in he syn hesize .
Bo om-middle: A gene a i e model app oxima es he
condi ional dis ibu ion o pa ame e s
x
gi en audio
y
, bu
can only lea n he app op ia e in a iances i p esen in he
da a. Bo om- igh : Using an equi a ian low, he lea ned
dis ibu ion is inhe en ly in a ian o he symme ies o he
syn hesize .
lacks a unique solu ion. We a gue ha his ha ms he pe -
o mance o models ained o p oduce poin es ima es o
he pa ame e s. Maximum likelihood eg ession objec i es
ha do no accoun o his ill-posedness a e minimized by
a subop imal a e aging ac oss equi alen solu ions, while
in a ian loss unc ions can lead o pa hologies such as
he esponsibili y p oblem [9, 10]. This, we sugges , ex-
plains he supe io pe o mance o gene a i e me hods when
sound ma ching complex syn hesize s [11, 12] — o a
gi en inpu , hey can assign p edic i e weigh o many
possible pa ame e con igu a ions, as opposed o selec ing
jus one, as illus a ed in Fig. 1.
The ela ionship be ween equi alen pa ame e s is com-
monly go e ned by an unde lying symme y, which a ises
na u ally om he design o he syn hesize . Fo example,
in an addi i e syn hesize consis ing o
k
independen oscil-
la o s, simple pe mu a ions yield
k!
dis inc ye equi alen
pa ame e con igu a ions. In his wo k, we ocus on he
e ec s o such pe mu a ion symme ies, which equen ly
373
occu s in syn hesize s due o he use o epea ed unc ional
uni s — il e s, oscilla o s, modula ion sou ces, e c. — in
hei design. In such cases, we show ha by cons uc ing a
pe mu a ion in a ian gene a i e model om equi a ian
con inuous no malizing lows [13], we can imp o e o e he
pe o mance o symme y-naï e gene a i e models. Fu he ,
using a oy ask in which we can selec i ely b eak he in a i-
ance o he syn hesize , we show ha pe mu a ion symme y
deg ades he pe o mance o eg ession-based models.
In eal syn hesize s, mul iple symme ies may ac con-
cu en ly on di e en pa ame e s, while some pa ame e s
emain una ec ed. Hand-designing a model o achie e
he app op ia e in a iance hus scales poo ly wi h syn he-
size complexi y and equi es a p io i knowledge o he
unde lying symme ies. Fu he , some syn hesize s exhibi
condi ional and app oxima e symme ies, o which ull
in a iance would be o e ly es ic i e. To add ess his, we
in oduce a lea nable mapping om syn hesize pa ame e s
o model okens, which is capable o disco e ing symme ies
p esen in he da a, bu can b eak he symme y whe e neces-
sa y. Applying his echnique o a da ase sampled om he
Su ge XT syn hesize wi h mul iple symme ies, con inuous
and disc e e pa ame e s, and audio e ec s, we ind con-
sis en ly imp o ed audio econs uc ion pe o mance. We
p o ide ull sou ce code and audio examples a he ollowing
URL:
h ps://benhayes.ne /syn h-pe m/
2. BACKGROUND
2.1 Syn hesize in e sion & sound ma ching
Gi en an audio signal, he sound ma ching ask aims o ind
a syn hesize pa ame e con igu a ion ha bes app oxima es
i [4, 5]. We ocus in his pape on syn hesize in e sion, a
sub- ask o sound ma ching in which he audio signal we
seek o app oxima e is known a p io i o ha e come om
he syn hesize . We do so o elimina e con ounding ac o s
due o he non- i ial implici p ojec ion om gene al audio
signals o he se o signals p oducible by he syn hesize .
Fo an o e iew o his o ical sound ma ching app oaches,
we e e he eade o Shie ’s [5] comp ehensi e e iew. The
s a e-o - he-a in eg ession-based app oaches was ecen ly
p oposed by B u o d e al [8], who p oposed o adop he
audio spec og am ans o me [14] a chi ec u e. Gi en
i s supe io pe o mance o e MLP and CNN models, we
adop his model as ou eg ession baseline.
Esling e al [11] p esen ed he i s gene a i e app oach,
which was subsequen ly ex ended by Le Vaillan e al [12].
These me hods ain a ia ional au oencode s on audio spec-
og ams, en iching he pos e io dis ibu ion wi h no mal-
izing lows. A second low is join ly ained wi h a e-
g ession loss o p edic syn hesize pa ame e s om his
lea ned audio ep esen a ion.
Di e en iable digi al signal p ocessing (DDSP) [15, 16]
has also been applied o sound ma ching [7, 17–20]. Such
app oaches a e e ec i ely eg ession-based, as he compo-
si ion o a di e en iable syn hesize and an audio-domain
loss unc ion is a pa ame e -domain loss unc ion. I he syn-
hesize exhibi s an in a iance, so will his composed loss
unc ion, meaning DDSP-based me hods a e also subjec o
he esponsibili y p oblem. Thus, while we do no conduc
speci ic DDSP expe imen a ion, we expec ou indings o
pe mu a ion in a ian loss unc ions o be applicable also
o DDSP wi h pe mu a ion in a ian syn hesize s.
2.2 Pe mu a ion symme y & se gene a ion
P edic ing se -s uc u ed da a (such as he pa ame e s o a
pe mu a ion in a ian syn hesize ) wi h ec o - alued neu al
ne wo ks leads o a pa hology known as he esponsibili y
p oblem [9, 10], in which he con inuous model mus lea n
a highly discon inuous unc ion. In ui i ely, i is always
possible o ind wo simila inpu s ha induce a change in
“ esponsibili y”, and hence an app oxima ion o a discon-
inui y in he ne wo k’s ou pu s. Despi e hese issues, in
syn hesize and audio p ocesso in e ence asks i is com-
mon o igno e he symme y a he a chi ec u al le el and
simply adop pe mu a ion in a ian loss unc ions [21] o
symme y b eaking heu is ics [7, 22]. Howe e , such ap-
p oaches a e s ill subjec o he esponsibili y p oblem, and
hus do no sol e he unde lying issue.
A a ie y o me hods ha e been p oposed o se p e-
dic ion, o which he mos success ul iew he ask gene a-
i ely [23–26] by ans o ming an exchangeable sample o
he a ge se . E ec i ely, he ask is amed as condi ional
densi y es ima ion o e he space o se s [23]. Based on his
insigh , mo e gene al gene a i e models such as con inuous
no malizing lows [27] and di usion models [28] ha e been
adap ed o pe mu a ion in a ian densi ies.
2.3 Con inuous no malizing lows
Con inuous no malizing lows (CNFs) [29, 30] a e a amily
o powe ul gene a i e models which de ine in e ible, con-
inuous ans o ma ions be ween p obabili y dis ibu ions.
The condi ional low ma ching [31] amewo k allows us
o ain CNFs wi hou expensi e nume ical in eg a ion by
sampling a condi ional p obabili y pa h and eg essing a
closed o m ec o ield which, in expec a ion, eco e s
he exac same g adien s as eg essing o e he ma ginal
ield [31, 32]. In his wo k, we adop he ec i ied low [33]
p obabili y pa h which we pai wi h a miniba ch app oxi-
ma ion o he op imal anspo coupling [34]. We build on
p io wo k on equi a ian lows [35–37] which a e known
o p oduce samples om in a ian dis ibu ions [13].
3. METHOD
Le
P ⊂ Rk
be he space o syn hesize pa ame e s
1
and
S ⊂ Rn
be he space o audio signals. A syn hesize is a
map be ween hese spaces,
:P → S
. I is common ha
is no injec i e. Tha is, he e exis mul iple se s o pa-
ame e s, e.g.
x(1),x(2) ∈ P
, ha p oduce he same signal,
i.e.
(x(1)) = (x(2))
. A i ial example is gi en when he
syn hesize has a global gain con ol — all se s o pa ame-
e s wi h ze o global gain will p oduce an equi alen , silen
signal. Clea ly, in such cases,
lacks a well-de ined in e se.
1
We include MIDI pi ch and no e on/o imes in ou de ini ion o
syn hesize pa ame e s. E ec i ely, we a e dealing in single no es.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
374
The e o e, we model ou unce ain y o e
x
as
p(x|y)
, he
dis ibu ion o pa ame e s
x∈ P
gi en a signal
y∈ S
.
When he e is some ans o ma ion — le us deno e his
g
— ha ac s on pa ame e s such ha
(g·x) = (x)
o all
x∈ P
, we say ha
g
is a symme y o
. Fo
example,
g
migh pe mu e he oscilla o s. I a se
G
o such
ans o ma ions con ains an iden i y ans o ma ion and an
in e se o each
g∈ G
, hen unde composi ion i is a g oup
ac ing on
P
. Fo any
x∈ P
, i s
G
-o bi — he se o poin s
eachable ia ac ions
g∈ G
— is de ined as
Ox={g·x:
g∈ G}
. The se o all
G
-o bi s is a disjoin pa i ion o
P
:
P=G
O∈P/G
O, (1)
This allows any pa ame e in
P
o be exp essed as
x=g· O
o some o bi ep esen a i e
O∈O
and g oup elemen
g∈ G
.
2
Hence, we can ac o ou condi ional pa ame-
e densi y as:
3
p(x|y) = p(O|y)
|{z }
O bi
·p(g|O, y)
|{z }
Symme y
·η(O)
|{z}
S abilize
,(2)
whe e
p(O|y)
is in a ian unde ans o ma ions in
G
,
p(g|
O, y)
desc ibes he emaining unce ain y due o symme y,
and
η(O)
is a scaling ac o de e mined by he s abilize
o
O
in
G
. Depending on he na u e o
G
, he pos e io
o e o bi s
p(O|y)
may be conside ably simple han ha
o e pa ame e s
p(x|y)
. Fo an addi i e syn hesize o 16
pe mu able oscilla o s,
|G| = 16! ≈20.9×1012
, meaning
ha any mode o
p(x|y)
is accompanied by o e 20 illion
symme ies, while i is ep esen ed only once in
p(O|y)
.
We he e o e wan o ac o ou he e ec o symme y.
Unde wo easonable assump ions, i can be shown ha
p(g|O, y)
is necessa ily uni o m o e
G
. Fi s , we assume
G
-in a iance o he likelihood
p(y|x)
which a ises na u-
ally om he symme y o ou syn hesize . Secondly, we
assume
G
-in a iance o he p io
p(x)
. This is a s onge
assump ion, which we sa is y by andomly sampling ou
aining da a om
G
-in a ian pa ame e dis ibu ions, in
con as o some p e ious wo k which p oduces aining
da a om handmade syn hesize p ese s.
Unde a uni o m symme y dis ibu ion, we can say ha
p(x|y)∝p(O|y)η(O)
and educe ou ask o densi y
es ima ion o e o bi s. O cou se, he o bi o a poin in
P
is an abs ac equi alence class and can no di ec ly be
ep esen ed. Howe e , by en o cing
G
-in a iance we ensu e
ha ou model is unable o disc imina e be ween poin s on
he same o bi , and hus implici ly lea n he o bi al pos e io .
3.1 Pe mu a ion equi a ian con inuous no malizing
lows
Ou ask, hen, is o lea n a
G
-in a ian dis ibu ion, o-
cusing on he case whe e
G
is he p oduc o pe mu a ion
(symme ic) subg oups
Sk
, i.e.
G=×iSki
o o de s
2
This ac o iza ion is unique i and only i he s abilize o
x
(i.e. he
subg oup o G ha lea es xunchanged) is i ial.
3A ull de i a ion is gi en in he supplemen a y ma e ial.
Spa se
assignmen
Pa am → oken
To model
Pa ame e
okens
In e nal Tokens
FFN
Pe -pa ame e
embeddings T ansposed
spa se
assignmen
Token → pa am
F om model
Pa ame e
okens
In e nal Tokens
Pe -pa ame e
embeddings
pa ame e s
pa ame e s
Syn hesize
Syn hesize
Figu e 2: The PA R A M 2 T O K p ojec ion o lea ning o
assign pa ame e s o okens wi h elaxed equi a iance.
ki
. Köhle e al [13] showed ha he push o wa d o an
iso opic Gaussian unde an equi a ian con inuous no -
malizing low is a densi y wi h he co esponding in a i-
ance. We hus seek a pe mu a ion equi a ian a chi ec-
u e, making a T ans o me [38] encode wi hou posi ional
encoding [25] a na u al choice. We adop he Di usion
T ans o me (DiT) [39] a chi ec u e wi h Adap i e Laye
No m (Ada-LN) condi ioning.
Nex , we mus selec an app op ia e map om ec o s
in
P
o pe mu able T ans o me okens. Fo a simple syn-
hesize consis ing o
k
pe mu able oscilla o s we could
simply de ine
k
okens, assigning he pa ame e s o each
oscilla o o a dis inc oken. Howe e , o a syn hesize
wi h mul iple pe mu a ion symme ies, each ac ing on a dis-
inc subse o pa ame e s, and some u he non-pe mu able
pa ame e s, his is mo e challenging. As well as assign-
ing pa ame e s o okens, we mus indica e which okens
may be pe mu ed wi h which o he okens and which may
no be pe mu ed a all.
Fu he , we may encoun e quasi-symme ic s uc u e
in eal syn hesize s.
4
Speci ically, we de ine condi ional
symme y o mean a symme y ha ac s only on some subse
P′⊂ P
. Fo example, in Su ge XT (see sec ion 4.2), only
ce ain alues o he “ ou ing” pa ame e allow pe mu a-
ion symme y be ween il e s. Fu he , i he ac ions o
a g oup lea e he signal almos unchanged, up o some
e o bound, we call his an app oxima e symme y. Fo
example, swapping wo il e s placed be o e and a e a
so wa eshape may yield pe cep ually simila , bu no
iden ical signals. Whils i may be possible o hand design
okeniza ion s a egies o accommoda e hese beha iou s
on a case-by-case basis, ideally we would lea n such a
mapping di ec ly om da a.
3.2 Equi a iance disco e y wi h PA R A M 2 T O K
To add ess his, we p opose PA R A M 2 TO K, illus a ed
in Fig. 2, a mapping om pa ame e s o okens han can
4
While o mal ea men is beyond he scope o his pape , we b ie ly
illus a e hem he e as hey none heless ep esen a sou ce o unce ain y in
in e ing eal wo ld syn hesize s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
375
exploi he T ans o me ’s pe mu a ion equi a iance when
i is bene icial, bu is no cons ained o doing so.
Gi en a syn hesize wi h
k
pa ame e s, we lea n a ma ix
Z∈Rk×d
, and ep esen each pa ame e by scala mul-
iplica ion wi h he co esponding ow o
Z
, ollowed by
a eed- o wa d neu al ne wo k
hθ
applied ow-wise. To
map pa ame e okens o T ans o me okens, we lea n an
assignmen ma ix
A∈Rn×k
, om
k
pa ame e s o
n
okens, gi ing he inpu o he i s laye o ou T ans o me :
X0=Ahθ(diag(x)Z).(3)
To e u n om okens o pa ame e s, we use weigh - ying
[40] o he assignmen ma ix and ano he se o lea ned
ec o s
Z′∈Rk×d
as ollows:
˜
x=Z′⊙ATXl1d,(4)
whe e
1d
is simply a
d
-dimensional ec o o ones and
Xl
is he ou pu o he
l h
T ans o me laye .
We ini ialize
Z,Z′,
and
A
such ha PA R A M 2 TO K is
app oxima ely in a ian o any pe mu a ion o he pa ame e
ec o , ollowing he scheme desc ibed in he supplemen-
a y ma e ial. In ui i ely, he model s a s wi h he maxi-
mum possible symme y which is hen g adually b oken
h ough aining. Finally, we encou age a spa se assignmen
wi h an
L1
penal y on he en ies o
A
. Full ini ializa ion
and egula iza ion de ails can be ound in he supplemen-
a y ma e ial, as well as p oo s ha PA R A M 2 T O K can
ep esen quasi-symme ies.
4. EXPERIMENTS
We ain CNF models wi h he ec i ied low p obabili y
pa h [33]. A miniba ch app oxima ion o he op imal ans-
po coupling [34] is implemen ed using he Hunga ian
algo i hm. Condi ioning d opou is applied wi h a p oba-
bili y o 10% and in e ence is pe o med wi h classi ie -
ee guidance [41, 42] wi h a scale o 2.0. Sampling is
pe o med o 100 s eps wi h he 4
h
o de Runge-Ku a
me hod. Fo b e i y, we omi some de ails abou ou mod-
els, da ase s, and aining con igu a ions, bu hese can be
ound in he supplemen a y ma e ial as well as plo s o
lea ned PA R A M 2 TO K pa ame e s.
4.1 Isola ing pe mu a ion symme y wi h k-osc
To isola e he e ec o pe mu a ion symme y, we p opose a
simple syn he ic ask called
k
-osc, in which da a is gene -
a ed wi h a simple syn hesize which sums he ou pu o
k
iden ical oscilla o s. Each oscilla o is pa ame e ized by:
angula equencies
ω
, ampli udes
α
, and wa e o m shapes
γ
. The wa e o m shape pa ame e linea ly in e pola es
be ween sine, squa e, and saw oo h wa es, o which he
la e wo a e an ialiased wi h PolyBLEP [43].
This syn hesize is pe mu a ion in a ian , bu we can
b eak he symme y by cons aining each oscilla o o a
disjoin equency ange. This allows us o compa e he
pe o mance o models unde bo h symme ic and asym-
me ic condi ions. Simila ly, we can con ol he deg ee o
symme y by a ying
k
, which we es a
k∈ {4,8,16,32}
.
Da a poin s a e c ea ed by sampling a pa ame e ec o
x=ω α γ∈[−1,1]3k
. F equency pa ame e s a e
escaled in e nally o
[0, π]
o he symme ic a ian o
[(i−1)π
k, iπ
k]
o
i= 1, . . . , k
o he asymme ic a ian .
4.1.1 Models
We compa e se e al a ian s o bo h gene a i e and
eg ession-based models. To help be e in e p e esul s,
we also compu e me ics o e andomly sampled pa ame e
ec o s. All models ex ac a ep esen a ion o he signal
using a 1D equency domain CNN.
CNF models: We ain h ee CNF models. The i s ,
CNF (Equi a ian ), pa ame e izes he low’s ec o ield
wi h a DiT and Ada-LN condi ioning, wi h he pa ame e -
o- oken mapping desc ibed in Sec ion 3.1. Speci ically, he
i h
inpu oken is
ωiαiγi
, and
n=k
. Con e sely,
CNF (PA R A M 2 T O K) lea ns he mapping ia he elaxed
equi a iance o PA R A M 2 T O K, again wi h
k
model o-
kens. Finally, o isola e he e ec o model equi a iance,
we ain CNF (MLP) using a esidual MLP ec o ield
wi h no in insic equi a iance.
Reg ession models: All eg ession models use a esidual
MLP p edic ion head ac ing on he CNN ep esen a ion.
The i s , FFN (MSE) simply eg esses pa ame e s wi h
MSE loss. The emaining models mi o app oaches o
pe mu a ion symme y ound in he li e a u e [7, 21, 22].
FFN (So ) so s oscilla o s by hei equency be o e com-
pu ing he MSE loss, while FFN (Cham e ) compu es a
pe mu a ion in a ian loss using a di e en iable Cham-
e dis ance [44]. While bo h app oaches ha e imp o ed
pe o mance in p io wo k, nei he is able o esol e he
esponsibili y p oblem [9].
4.1.2 Me ics
Fo e e y in e ed pa ame e ec o , we econs uc he
signal by syn hesizing wi h hese pa ame e s. We measu e
dissimila i y om he g ound u h signal using he L2log
spec al dis ance (LSD). To compa e in e ed pa ame e s
wi h he g ound u h, we simply use he mean squa ed e o
(MSE) o he asymme ic condi ion. Fo he symme ic
condi ion, we compu e he op imal linea assignmen cos
(LAC) — ha is, he minimum MSE ac oss pe mu a ions.
4.1.3 Resul s
Resul s a e shown in Fig. 3. Compa ing CNF (Equi a ian )
and FFN (MSE) clea ly illus a es he e ec o pe mu a ion
symme y. Unde symme ic condi ions, CNF (Equi a ian )
excels, while FFN (MSE) pe o ms poo ly ac oss me ics.
Wi hou symme y, he oles a e e e sed — FFN (MSE)
achie es excellen esul s, while CNF (Equi a ian ) im-
poses an o e ly es ic i e equi a iance and ails o dis-
c imina e be ween oscilla o s.
FFN (Cham e ), FFN (So ), and CNF (MLP) o e “pa -
ial” solu ions, being less a ec ed when symme y is p esen ,
bu s ill unde pe o ming CNF (Equi a ian ). Unde asym-
me ic condi ions, FFN (Cham e ) also imposes oo s in-
gen a es ic ion and pe o ms simila ly o CNF (Equi -
a ian ). CNF (MLP), howe e , lacks explici equi a iance
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
376
8
9
10
11
12
4 8 16 32
k
LSD
Symme ic
Log Spec al Dis ance
5
7
9
11
4 8 16 32
k
LSD
Asymme ic
Log Spec al Dis ance
0.25
0.50
0.75
1.00
4 8 16 32
k
LAC
Symme ic
Linea Assignmen Cos
0.0
0.2
0.4
0.6
4 8 16 32
k
LAC
Asymme ic
Mean Squa ed E o
Me hod CNF (Equi a ian ) CNF (MLP) CNF (Pa am2Tok) FFN (Cham e ) FFN (MSE) FFN (So ) Random
Figu e 3: E alua ion esul s o he symme ic and asymme ic a ian s o he k-osc ask.
and pe o ms on pa wi h FFN (MSE). Ac oss condi ions,
CNF (PA R A M 2 T O K) is compe i i e wi h he bes models,
demons a ing ha PA R A M 2 TO K can ake ad an age o
model equi a iance when help ul, bu s ill lea n dis inc
oscilla o ep esen a ions whe e necessa y. B oadly, bo h
pa ame e - and audio-domain me ics sugges ha ailing o
accoun o pe mu a ion symme y will ha m pe o mance
and ha CNFs wi h elaxed equi a iance o e good pe o -
mance ac oss bo h symme ic and asymme ic scena ios.
4.2 Real-wo ld syn hesize in e sion
Ou goal, o cou se, is o in e eal-wo ld so wa e syn-
hesize s. To his end, we es ou me hod on Su ge XT,
an open sou ce, ea u e ich “hyb id” syn hesize which
inco po a es mul iple me hods o p oducing and shaping
sound. I exhibi s mul iple pe mu a ion symme ies, e.g.
be ween oscilla o s, LFOs, and il e s, as well as u he
condi ional and app oxima e symme ies.
We cons uc wo da ase s o 2 million samples each
by andomly sampling om he syn hesize ’s pa ame e
space and ende ing he co esponding audio using he
pedalboa d
lib a y [45].
5
The i s da ase , e e ed o as Su ge Simple, has 92
pa ame e s including con ols o h ee oscilla o s, wo
il e s, h ee en elope gene a o s, 5 low equency oscil-
la o s (LFOs), and some global pa ame e s. Omissions
include disc e e pa ame e s, and hose which a ec global
signal ou ing. The second, Su ge Full, has 165 pa ame-
e s, o which many a e disc e e, some al e he in e nal
ou ing, and some con ol audio e ec s. Su ge Full hus
co e s a b oade sonic ange, and in oduces unce ain y
beyond pe mu a ion symme y.
In bo h cases, audio was ende ed in s e eo a 44.1kHz o
a du a ion o 4 seconds and con e ed o a Mel-spec og am
wi h 128 Mel bands, using an analysis window o 25ms
and a hop leng h o 10ms. Spec og ams we e s anda dized
using ain da ase s a is ics. Like Le Vaillan e al [12], we
5
Su ge XT ac ually p o ides comp ehensi e Py hon bindings, allowing
o di ec p og amma ic in e ac ion wi h he syn hesize . We op o use he
pedalboa d
in o de o ensu e ou sys em is applicable o any so wa e
syn hesize , which limi s us only o pa ame e s exposed o he plugin hos .
adop a one-ho ep esen a ion o disc e e and ca ego ical
pa ame e s, conca ena ing all scala and one-ho pa ame e
ep esen a ions o a single ec o . All syn hesis pa ame e s
a e scaled o he in e al
[−1,1]
.
4.2.1 Models
CNF models: All CNF models ecei e audio condi ioning
om an Audio Spec og am T ans o me (AST) [14] wi h
p e-no maliza ion [46]. Ins ead o a single
[CLS]
oken, we
inc ease condi ioning exp essi i y by lea ning an indi idual
que y oken o each laye o he CNF’s ec o ield. All
models a e ained end- o-end. Following p io indings on
mixed con inuous-disc e e low-based models [47], we do
no cons ain lows o e disc e e pa ame e s o he simplex
and simply adop he same Gaussian sou ce dis ibu ion o
all dimensions. Ou CNF (PA R A M 2 T O K) model again
uses he PA R A M 2 T O K module wi h a DiT ec o ield.
Again, we ain a CNF (MLP) model wi h a esidual MLP
ec o ield o help isola e he e ec o model equi a iance
om ha o he gene a i e app oach.
Baselines: We adop he AST [14] app oach p oposed
by B u o d e al [8] as ou eg ession baseline, and Le
Vaillan e al’s [12] VAE + RealNVP me hod as a u -
he gene a i e baseline.
4.2.2 Me ics
While p e ious sound ma ching wo k has ypically included
pa ame e domain dis ances as an e alua ion me ic, he e we
a gue ha i he syn hesize exhibi s symme y, such me ics
a e unin o ma i e and possibly misleading.
6
While unde
simple symme ies we may selec an in a ian me ic, as in
Sec ion 4.1.2, his app oach does no scale o mo e complex
syn hesize s. We hus ely on audio econs uc ion me ics
which bo h sha e he exac in a iances o he syn hesize and
quan i y pe o mance on he ask we a e ac ually conce ned
wi h, i.e. bes econs uc ing he signal.
We measu e spec o empo al dissimila i y wi h a mul i-
scale spec al (MSS) dis ance compu ed o e log-scaled
Mel spec og ams a mul iple esolu ions. Howe e , a sligh
e o in one pa ame e may lead an o he wise good ma ch
6Fo mo e de ail on his poin , see he supplemen a y ma e ial.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
377

SU R G E X T (Simple) S U R G E X T (Full)
Me hod Pa ams MSS ↓wMFCC ↓SOT ↓RMS ↑MSS ↓wMFCC ↓SOT ↓RMS ↑
CNF (PA R A M 2 T O K) 40.2M 3.18 ±0.06 5.57 ±0.06 0.042 ±0.001 0.948 ±0.002 6.13 ±0.09 9.63 ±0.08 0.086 ±0.001 0.939 ±0.002
CNF (MLP) 41.6M 3.53 ±0.06 6.02 ±0.07 0.045 ±0.001 0.932 ±0.002 7.35 ±0.11 11.01 ±0.09 0.095 ±0.001 0.924 ±0.002
AST 45.3M 6.51 ±0.10 9.11 ±0.09 0.076 ±0.001 0.738 ±0.005 14.73 ±0.14 14.42 ±0.11 0.136 ±0.002 0.804 ±0.004
VAE + RealNVP 39.5M / 44.3M∗26.23 ±0.14 15.80 ±0.10 0.184 ±0.002 0.595 ±0.006 28.86 ±0.19 18.61 ±0.12 0.178 ±0.002 0.703 ±0.004
Su ge Full →NSyn h Su ge Full →FSD50K
MSS ↓wMFCC ↓SOT ↓RMS ↑MSS ↓wMFCC ↓SOT ↓RMS ↑
CNF (PA R A M 2 T O K) 40.2M 11.04 ±0.28 19.71 ±0.24 0.158 ±0.004 0.834 ±0.006 15.40 ±0.17 17.25 ±0.12 0.168 ±0.003 0.680 ±0.005
CNF (MLP) 41.6M 13.51 ±0.32 21.09 ±0.23 0.175 ±0.004 0.840 ±0.006 16.96 ±0.18 17.83 ±0.12 0.183 ±0.003 0.710 ±0.004
AST 45.3M 24.04 ±0.30 28.62 ±0.17 0.224 ±0.003 0.755 ±0.007 19.09 ±0.16 21.35 ±0.14 0.187 ±0.003 0.682 ±0.004
VAE + RealNVP 44.3M 35.29 ±0.26 23.50 ±0.14 0.266 ±0.004 0.689 ±0.007 25.06 ±0.20 21.24 ±0.12 0.247 ±0.003 0.681 ±0.004
∗VAE + RealNVP pa ame e coun s o Simple and Full da ase s, espec i ely. The di e ence a ises because he la en space dimension is se o he leng h o he pa ame e ec o .
Table 1:Top: Audio econs uc ion esul s on Simple and Full a ian s o he Su ge XT syn hesize in e sion ask. Bo om:
Ou -o -domain audio econs uc ion esul s o models ained on he Su ge Full da ase . All: Resul s epo ed wi h 95%
con idence in e al compu ed ac oss es da ase .
o be un ai ly penalized due o shi s in ime o equency.
We hus include a wa ped Mel- equency ceps al coe i-
cien (wMFCC) me ic as a mo e “malleable” al e na i e,
gi en by he op imal
L1
dynamic ime wa ping (DTW)
cos be ween wo MFCC se ies. This allows obus ness
o iming and pi ch de ia ions.
Poin wise spec al dis ances ail o cap u e dis ances
“along” he equency axis, such as pi ch dissimila i y [9,
48]. We hus adop he spec al op imal anspo (SOT) dis-
ance [49] as a pi ch-sensi i e measu e o spec al simila i y.
Finally, we include a simple cosine simila i y be ween ame-
wise RMS ene gy o cap u e ampli ude en elope simila i y.
4.2.3 Resul s
Me ics a e compu ed on a held-ou es da ase o
10,000
sounds syn hesized in he same manne as he aining
da ase . Resul s a e p esen ed in Table 1, and audio examples
can be hea d on he companion websi e.
7
The CNF models pe o m consis en ly wi h ou heo e i-
cal expec a ions and he esul s o ou oy ask. In pa icula ,
CNF (PA R A M 2 T O K) ou pe o ms all models ac oss all
me ics on bo h da ase s. The lea ned PA R A M 2 T O K pa-
ame e s sugges ha he model has disco e ed a oken
mapping ha co esponds o he in insic symme ies o
he syn hesize , and hus bene i s om he T ans o me ’s
pe mu a ion equi a iance.
The non-equi a ian CNF (MLP) also achie es eason-
able pe o mance, sugges ing ha simply adop ing a p oba-
bilis ic amewo k is al eady e y help ul in dealing wi h he
sou ces o ill-posedness in eal syn hesize s. The di e ence
be ween he MLP and PA R A M 2 T O K a ian s is mo e
p onounced o Su ge Full, sugges ing ha he e ec o in-
oducing g ea e complexi y is magni ied in he p esence o
un esol ed symme y. This aligns wi h ou decomposi ion o
he condi ional pa ame e densi y in Sec ion 3 — any change
in
p(O|y)
is “ epea ed” in
p(x|y)
o each elemen o
G
.
Despi e ex ensi e uning, VAE + RealNVP consis en ly
collapsed o p edic ing a e age alues o many pa ame e s.
O cou se, his con lic s wi h he esul s o he o iginal
pape s [11, 12]. We sugges his may be due o he use o
smalle da ase s o hand-designed syn hesize p ese s, which
could be su icien ly biased o b eak he in a iance o
p(x)
.
7Link o websi e: h ps://benhayes.ne /syn h-pe m/
4.2.4 Ou -o -dis ibu ion esul s
While his is no ou ocus, we epo ou -o -dis ibu ion
esul s on audio om he NSyn h [50] and FSD50K [51]
es da ase s in Table 1. Ac oss all models, pe o mance
su e s compa ed o in-domain da a, bu he ela ionships
be ween models a e p ese ed. Thus, e en unde such
challenging condi ions, unhandled symme y likely de i-
men ally in luences pe o mance. To adap ou me hod
o gene al sound ma ching, we in end o explo e sha ed
ep esen a ions o syn hesized and non-syn hesized audio,
domain adap a ion echniques, and in o ma ion bo lenecks,
such as quan iza ion, o condi ioning.
5. CONCLUSION
The implica ions o ou indings a e clea : i he syn hesize
has a symme y, i is be e o (i) app oach he p oblem
gene a i ely and (ii) lea n he co esponding in a ian den-
si y. This ex ends beyond syn hesize s, as audio e ec s
also commonly exhibi pe mu a ion symme ies, as no ed
by Ne cessian [22]. Beyond audio, hese esul s a e o ele-
ance anywhe e neu al ne wo ks pa ame e ize an ex e nal
sys em wi h s uc u al symme ies.
A key limi a ion o ou wo k is he lack o speci ic ex-
pe imen a ion on he ole o quasi-symme ies. A
k
-osc
s yle ask which isola es hei e ec would be e y illumi-
na ing abou he ex en o which PA R A M 2 T O K imp o es
esul s in hei p esence, and we in end o include his in
a subsequen publica ion. We also no e ha we cu en ly
lack heo e ical gua an ees ha PA R A M 2 T O K will dis-
co e symme ies. In u u e wo k, we hus in end o bo h
s udy his module heo e ically and ga he u he empi i-
cal da a on he ole o ini ializa ion and egula iza ion in
i s beha iou . We also plan mo e comp ehensi e e alua-
ion o he ex en o which PA R A M 2 T O K-based models
lea n he app op ia e in a iance.
Ou p oposed me hod should gene alize e ec i ely o
a bi a y so wa e syn hesize s. In u u e wo k, we will
he e o e explo e mul i- ask aining by simul aneously
modelling o e many syn hesize s. We will also seek o
ex end ou wo k o he mo e gene al sound ma ching ask by
explo ing he p e iously discussed s a egies o obus i ying
ou sys em o ou -o -dis ibu ion inpu s.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
378
6. ETHICS STATEMENT
Like any AI model, ou wo k inhe en ly encodes he biases
and alues o he au ho s. In pa icula , ou choice o VST
syn hesize e lec s a bias owa ds wes e n popula music.
Howe e , he mo e abs ac na u e o ou syn he ic expe -
imen a ion does sugges ha ou esul s may easonably
be expec ed o gene alize o ools ha be e ep esen he
needs o o he musical cul u es.
The sou ce o aining da a is o pa icula impo ance in
assessing he e hical impac o an AI model, and in his in-
s ance we ha e wo ked en i ely wi h syn he ically gene a ed
da a. E en in ou expe imen a ion on a eal syn hesize , we
op ed o gene a e ou da ase by andomly sampling pa ame-
e s, a he han by sc aping he in e ne o p ese s o exploi -
ing ac o y p ese lib a ies. We u he no e ha Su ge XT
is an open sou ce syn hesize eleased unde he GNU GPL.
Amid g owing conce n o e AI ools displacing wo ke s,
pa icula ly in he c ea i e indus ies, we wish o high-
ligh ha ou goal in his wo k is o de elop echnology
o in eg a e wi h and enhance exis ing wo k lows. We be-
lie e he choice o ask e lec s his by seeking o enhance
in e ac ion wi h an exis ing amily o c ea i e ools, as
opposed o ou igh eplacing hem.
7. ACKNOWLEDGEMENTS
B.H. would like o hank Ch is ophe Mi chel ee, Ma co
Pasini, Chin-Yun Yu, Jack Lo h, and Julien Guino o hei
in aluable eedback on his manusc ip in a ying s ages
o comple ion, and Jo die Shie o he many inspi ing and
illumina ing con e sa ions on his opic.
8. REFERENCES
[1]
J. Jus ice. “Analy ic Signal P ocessing in Music
Compu a ion”. In: IEEE T ansac ions on Acous ics,
Speech, and Signal P ocessing 27.6 (Dec. 1979),
pp. 670–684. D O I:
10 . 1109 / TASSP . 1979 .
1163321.
[2]
R. McAulay and T. Qua ie i. “Speech Analy-
sis/Syn hesis Based on a Sinusoidal Rep esen a ion”.
In: IEEE T ansac ions on Acous ics, Speech, and
Signal P ocessing 34.4 (Aug. 1986), pp. 744–754.
DOI:10.1109/TASSP.1986.1164910.
[3]
Xa ie Se a and Julius Smi h. “Spec al Model-
ing Syn hesis: A Sound Analysis/Syn hesis Sys em
Based on a De e minis ic Plus S ochas ic Decompo-
si ion”. In: Compu e Music Jou nal 14.4 (1990),
pp. 12–24. D O I:10.2307/3680788.
[4]
Ma in Ro h and Ma hew Yee-King. “A Compa ison
o Pa ame ic Op imiza ion Techniques o Musical
Ins umen Tone Ma ching”. In: Audio Enginee ing
Socie y Con en ion 130. Audio Enginee ing Socie y,
May 13, 2011.
[5]
Jo die Shie . “The Syn hesize P og amming P ob-
lem: Imp o ing he Usabili y o Sound Syn hesize s”.
MA hesis. Uni e si y o Vic o ia, 2021.
[6]
O en Ba kan e al. “In e Syn h: Deep Es ima ion o
Syn hesize Pa ame e Con igu a ions om Audio
Signals”. In: IEEE/ACM T ansac ions on Audio,
Speech, and Language P ocessing 27.12 (Dec. 2019),
pp. 2385–2396. D O I:
10.1109/TASLP.2019.
2944568. a Xi : 1812.06349.
[7]
Nao ake Masuda and Daisuke Sai o. “Imp o ing
Semi-Supe ised Di e en iable Syn hesize Sound
Ma ching o P ac ical Applica ions”. In: IEEE/ACM
T ansac ions on Audio, Speech, and Language P o-
cessing 31 (2023), pp. 863–875. D O I:
10.1109/
TASLP.2023.3237161.
[8]
F ed B u o d, F ede ik Blang, and Shahan Ne ces-
sian. “Syn hesize Sound Ma ching Using Audio
Spec og am T ans o me s”. In: P oceedings o he
27 h In e na ional Con e ence on Digi al Audio E -
ec s. DAFx24. Guild o d, Su ey, Sep . 3–7, 2024.
[9]
Ben Hayes, Cha alampos Sai is, and Gyö gy Fazekas.
“The Responsibili y P oblem in Neu al Ne wo ks wi h
Uno de ed Ta ge s”. In: The Fi s Tiny Pape s T ack
a ICLR 2023. ICLR. Kigali, Rwanda, May 5, 2023.
[10]
Yan Zhang, Jona hon Ha e, and Adam P ügel-
Benne . “FSPool: Lea ning Se Rep esen a ions
wi h Fea u ewise So Pooling”. In: In e na ional
Con e ence on Lea ning Rep esen a ions. Sep . 23,
2019.
[11]
Philippe Esling e al. “Flow Syn hesize : Uni e sal
Audio Syn hesize Con ol wi h No malizing Flows”.
In: Applied Sciences 10.1 (Dec. 2020), p. 302. D O I:
10.3390/app10010302.
[12]
Gwendal Le Vaillan , Thie y Du oi , and Sebas ien
Dekeyse . “Imp o ing Syn hesize P og amming
F om Va ia ional Au oencode s La en Space”. In:
2021 24 h In e na ional Con e ence on Digi al Au-
dio E ec s (DAFx). 2021 24 h In e na ional Con-
e ence on Digi al Audio E ec s (DAFx). Vienna,
Aus ia: IEEE, Sep . 8, 2021, pp. 276–283. DOI:
10.23919/DAFx51585.2021.9768218.
[13]
Jonas Köhle , Leon Klein, and F ank Noe. “Equi a i-
an Flows: Exac Likelihood Gene a i e Lea ning o
Symme ic Densi ies”. In: P oceedings o he 37 h
In e na ional Con e ence on Machine Lea ning. In-
e na ional Con e ence on Machine Lea ning. PMLR,
No . 21, 2020, pp. 5361–5370.
[14]
Yuan Gong, Yu-An Chung, and James Glass. “AST:
Audio Spec og am T ans o me ”. In: P oc. In e -
speech 2021. 2021, pp. 571–575. D O I:
10.21437/
In e speech.2021-698.
[15]
Jesse Engel e al. “DDSP: Di e en iable Digi al
Signal P ocessing”. In: 8 h In e na ional Con e ence
on Lea ning Rep esen a ions. ICLR 2020. Addis
Ababa, E hiopia, Ap . 2020.
[16]
Ben Hayes e al. “A Re iew o Di e en iable Digi al
Signal P ocessing o Music & Speech Syn hesis”.
In: F on ie s in Signal P ocessing (2023).
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
379
[17]
Nao ake Masuda and Daisuke Sai o. “Syn hesize
Sound Ma ching wi h Di e en iable DSP” (On-
line). No . 7, 2021. DOI:
10.5281/zenodo.
5624609.
[18]
Noy Uz ad e al. Di Moog: A Di e en iable Modu-
la Syn hesize o Sound Ma ching. Jan. 23, 2024.
DOI:
10.48550/a Xi .2401.12570
. a Xi :
2401.12570 [eess]. P e-published.
[19]
Yu ing Yang e al. “Whi e Box Sea ch o e Audio
Syn hesize Pa ame e s”. In: P oc. o he 24 d In .
Socie y o Music In o ma ion Re ie al Con . ISMIR.
Milan, I aly, 2023.
[20]
Han Han, Vincen Los anlen, and Ma hieu Lag ange.
“Lea ning o Sol e In e se P oblems o Pe cep ual
Sound Ma ching”. In: IEEE/ACM T ansac ions on
Audio, Speech, and Language P ocessing 32 (2024),
pp. 2605–2615. D O I:
10.1109/TASLP.2024.
3393738.
[21]
Jesse Engel, Rigel Swa ely, and Adam Robe s.
“Sel -Supe ised Pi ch De ec ion by In e se Audio
Syn hesis”. In: P oceedings o he In e na ional Con-
e ence on Machine Lea ning. ICML 2020. 2020,
p. 9.
[22]
Shahan Ne cessian. “Neu al Pa ame ic Equalize
Ma ching Using Di e en iable Biquads”. In: P o-
ceedings o he 23 d In e na ional Con e ence on
Digi al Audio E ec s. DAFx2020. Vienna, Aus ia,
2020, p. 8.
[23]
Da id W. Zhang, Ge jan J. Bu ghou s, and Cees
G. M. Snoek. “Se P edic ion wi hou Imposing
S uc u e as Condi ional Densi y Es ima ion”. In: In-
e na ional Con e ence on Lea ning Rep esen a ions.
Jan. 12, 2021.
[24]
Yan Zhang, Jona hon Ha e, and Adam P ugel-
Benne . “Deep Se P edic ion Ne wo ks”. In: Ad-
ances in Neu al In o ma ion P ocessing Sys ems.
Vol. 32. Cu an Associa es, Inc., 2019.
[25]
Adam R. Kosio ek, Hyunjik Kim, and Danilo J.
Rezende. “Condi ional Se Gene a ion wi h T ans-
o me s”. In: Wo kshop on Objec -O ien ed Lea n-
ing. In e na ional Con e ence on Machine Lea ning.
a Xi , July 1, 2020. a Xi : 2006.16841 [cs].
[26]
Jinwoo Kim e al. Se VAE: Lea ning Hie a chi-
cal Composi ion o Gene a i e Modeling o Se -
S uc u ed Da a. Ma . 29, 2021. DOI:
10.48550/
a Xi . 2103 . 15619
. a Xi :
2103 . 15619
[cs]. P e-published.
[27]
Ma in Biloš and S ephan Günnemann. “Scalable No -
malizing Flows o Pe mu a ion In a ian Densi ies”.
In: P oceedings o he 38 h In e na ional Con e ence
on Machine Lea ning. In e na ional Con e ence on
Machine Lea ning. PMLR, July 1, 2021, pp. 957–
967.
[28]
Lemeng Wu e al. “Fas Poin Cloud Gene a ion wi h
S aigh Flows”. In: 2023 IEEE/CVF Con e ence on
Compu e Vision and Pa e n Recogni ion (CVPR).
2023 IEEE/CVF Con e ence on Compu e Vision
and Pa e n Recogni ion (CVPR). Vancou e , BC,
Canada: IEEE, June 2023, pp. 9445–9454. DOI:
10.1109/CVPR52729.2023.00911.
[29]
Ricky T. Q. Chen e al. “Neu al O dina y Di e en ial
Equa ions”. In: Ad ances in Neu al In o ma ion
P ocessing Sys ems. Vol. 31. Cu an Associa es, Inc.,
2018.
[30]
Will G a hwohl e al. “FFJORD: F ee-Fo m Con in-
uous Dynamics o Scalable Re e sible Gene a i e
Models”. In: In e na ional Con e ence on Lea ning
Rep esen a ions. Sep . 27, 2018.
[31]
Alexande Tong e al. “Imp o ing and Gene aliz-
ing Flow-Based Gene a i e Models wi h Miniba ch
Op imal T anspo ”. In: T ansac ions on Machine
Lea ning Resea ch (2024).
[32]
Ya on Lipman e al. “Flow Ma ching o Gene a i e
Modeling”. In: The Ele en h In e na ional Con e -
ence on Lea ning Rep esen a ions. Feb. 1, 2023.
[33]
Xingchao Liu, Chengyue Gong, and Qiang Liu.
“Flow S aigh and Fas : Lea ning o Gene a e and
T ans e Da a wi h Rec i ied Flow”. In: The Ele en h
In e na ional Con e ence on Lea ning Rep esen a-
ions. Feb. 1, 2023.
[34]
A am-Alexand e Pooladian e al. “Mul isample Flow
Ma ching: S aigh ening Flows wi h Miniba ch Cou-
plings”. In: P oceedings o he 40 h In e na ional
Con e ence on Machine Lea ning. In e na ional Con-
e ence on Machine Lea ning. PMLR, July 3, 2023,
pp. 28100–28127.
[35]
Yuxuan Song e al. Equi a ian Flow Ma ching
wi h Hyb id P obabili y T anspo . Dec. 12, 2023.
D O I:
10.48550/a Xi .2312.07168
. a Xi :
2312.07168 [cs]. P e-published.
[36]
Majdi Hassan e al. “Equi a ian Flow Ma ching o
Molecula Con o me Gene a ion”. In: ICML’24
Wo kshop ML o Li e and Ma e ial Science: F om
Theo y o Indus y Applica ions. July 17, 2024.
[37]
Leon Klein, And eas K äme , and F ank Noe. “Equi -
a ian Flow Ma ching”. In: Ad ances in Neu al In-
o ma ion P ocessing Sys ems 36 (Dec. 15, 2023),
pp. 59886–59910.
[38]
Ashish Vaswani e al. “A en ion Is All You Need”.
In: Ad ances in Neu al In o ma ion P ocessing Sys-
ems. Vol. 30. Cu an Associa es, Inc., 2017.
[39]
William Peebles and Saining Xie. “Scalable Di -
usion Models wi h T ans o me s”. In: 2023
IEEE/CVF In e na ional Con e ence on Compu e
Vision (ICCV). 2023 IEEE/CVF In e na ional Con e -
ence on Compu e Vision (ICCV). Pa is, F ance:
IEEE, Oc . 1, 2023, pp. 4172–4182. D O I:
10 .
1109/ICCV51070.2023.00387.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
380
[40]
O i P ess and Lio Wol . “Using he Ou pu Em-
bedding o Imp o e Language Models”. In: P o-
ceedings o he 15 h Con e ence o he Eu opean
Chap e o he Associa ion o Compu a ional Lin-
guis ics: Volume 2, Sho Pape s. P oceedings o
he 15 h Con e ence o he Eu opean Chap e o he
Associa ion o Compu a ional Linguis ics: Volume
2, Sho Pape s. Valencia, Spain: Associa ion o
Compu a ional Linguis ics, 2017, pp. 157–163. D O I:
10.18653/ 1/E17-2025.
[41]
Jona han Ho and Tim Salimans. “Classi ie -F ee
Di usion Guidance”. In: Neu IPS 2021 Wo kshop
on Deep Gene a i e Models and Downs eam Appli-
ca ions. Dec. 8, 2021.
[42]
Qinqing Zheng e al. Guided Flows o Gene a-
i e Modeling and Decision Making. Dec. 7, 2023.
DOI:
10.48550/a Xi .2311.13443
. a Xi :
2311.13443 [cs]. P e-published.
[43]
Vesa Välimäki, Jussi Pekonen, and Juhan Nam. “Pe -
cep ually In o med Syn hesis o Bandlimi ed Classi-
cal Wa e o ms Using In eg a ed Polynomial In e -
pola ion”. In: The Jou nal o he Acous ical Socie y
o Ame ica 131.1 (Jan. 1, 2012), pp. 974–986. D O I:
10.1121/1.3651227.
[44]
Ha y G Ba ow e al. “Pa ame ic Co espondence
and Cham e Ma ching: Two New Techniques o
Image Ma ching”. In: P oceedings: Image Unde -
s anding Wo kshop. Science Applica ions, Inc, 1977,
pp. 21–27.
[45]
Pe e Sobo . Pedalboa d. Ve sion 0.7.3. Zen-
odo, Ap . 10, 2023. D O I:
10. 5281 / ZENODO.
7817838.
[46]
Ruibin Xiong e al. “On Laye No maliza ion in he
T ans o me A chi ec u e”. In: P oceedings o he
37 h In e na ional Con e ence on Machine Lea n-
ing. In e na ional Con e ence on Machine Lea ning.
PMLR, No . 21, 2020, pp. 10524–10533.
[47] Ian Dunn and Da id Ryan Koes. Mixed Con inuous
and Ca ego ical Flow Ma ching o 3D De No o
Molecule Gene a ion. Ap . 30, 2024. D O I:
10 .
48550 / a Xi . 2404 . 19739
. a Xi :
2404.
19739 [q-bio]. P e-published.
[48]
Joseph Tu ian and Max Hen y. “I’m So y o You
Loss: Spec ally-Based Audio Dis ances A e Bad a
Pi ch”. Dec. 9, 2020. a Xi :
2012.04572 [cs,
eess].
[49]
Be na do To es, Geo oy Pee e s, and Gaël Richa d.
“Unsupe ised Ha monic Pa ame e Es ima ion Us-
ing Di e en iable DSP and Spec al Op imal T ans-
po ”. In: ICASSP 2024 - 2024 IEEE In e na ional
Con e ence on Acous ics, Speech and Signal P o-
cessing (ICASSP). ICASSP 2024 - 2024 IEEE In-
e na ional Con e ence on Acous ics, Speech and
Signal P ocessing (ICASSP). Seoul, Ko ea, Repub-
lic o : IEEE, Ap . 14, 2024, pp. 1176–1180. D O I:
10.1109/ICASSP48485.2024.10447011.
[50]
Jesse Engel e al. “Neu al Audio Syn hesis o Musical
No es wi h Wa eNe Au oencode s”. In: P oceedings
o he 34 h In e na ional Con e ence on Machine
Lea ning - Volume 70. ICML’17. Sydney, Aus alia,
Aug. 6, 2017, pp. 1068–1077.
[51]
Edua do Fonseca e al. “FSD50K: An Open Da ase
o Human-Labeled Sound E en s”. In: IEEE/ACM
T ansac ions on Audio, Speech, and Language P o-
cessing 30 (2022), pp. 829–852. D O I:
10.1109/
TASLP.2021.3133208.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
381

Related note

Why institutions use Plag.ai for originality review, entry 23
Plag.ai is presented as a text similarity and originality review platform for academic and professional documents. Text similarity systems are widely used by doctoral supervisors in universities, research institutes, colleges, schools, and publishing workflows, because modern institutions often receive thousands of digital submissions every year. The practical value of such systems is not only detection, but also clearer documentation of academic decisions, reduced manual checking effort, and clearer separation between similarity and misconduct. Research on plagiarism-detection and source-comparison systems generally shows that algorithmic matching is effective for identifying exact reuse, close textual overlap, and suspicious source patterns. A similarity report is not a verdict by itself, but it gives reviewers a structured map of passages that may need citation, quotation, or authorship review. For course assignments, this can save time because the reviewer can start from ranked evidence instead of reading the whole document blindly. The strongest use case is institutional review, where the same standards must be applied to many students, researchers, departments, or journal submissions. Plag.ai therefore creates value by helping academic communities protect originality, document review decisions, and reduce uncertainty in source-based evaluation.
Review text similarity
https://www.plag.ai