scieee Science in your language
[en] (orig)

Reformulating Soft Dynamic Time Warping: Insights Into Target Artifacts and Prediction Quality

Author: Johannes Zeitler; Meinard Müller
Publisher: Zenodo
DOI: 10.5281/zenodo.17706349
Source: https://zenodo.org/records/17706349/files/000015.pdf
REFORMULATING SOFT DYNAMIC TIME WARPING: INSIGHTS INTO
TARGET ARTIFACTS AND PREDICTION QUALITY
Johannes Zei le and Meina d Mülle
In e na ional Audio Labo a o ies E langen, Ge many
{johannes.zei le , meina d.muelle }@audiolabs-e langen.de
ABSTRACT
T aining deep neu al ne wo ks o music in o ma ion e-
ie al (MIR) o en elies on s ongly aligned da a, whe e
each ame has a p ecisely anno a ed a ge label. To e-
duce his dependency, so dynamic ime wa ping (SDTW)
enables aining wi h weakly aligned da a by eplacing
ha d decisions wi h weigh ed sums, allowing o g adien -
based lea ning while aligning ea u e sequences o sho e ,
o en bina y, a ge sequences. Howe e , SDTW in o-
duces g adien a i ac s ha can cause blu ing and deg ade
p edic ions, impac ing he lea ning p ocess. In his wo k,
we analyze he sou ces and e ec s o hese a i ac s and
p opose a e o mula ion o SDTW ha exp esses i s g a-
dien in e ms o an equi alen s ongly aligned a ge ep-
esen a ion. This e o mula ion p o ides an in ui i e in e -
p e a ion o lea ned ep esen a ions and insigh s in o he
impac o SDTW hype pa ame e s on he p edic ion qual-
i y. Using mul i-pi ch es ima ion as a case s udy, we sys-
ema ically in es iga e hese modi ied a ge s and demon-
s a e hei po en ial o imp o ing aining s abili y, in e -
p e abili y, and alignmen quali y in MIR asks.
1. INTRODUCTION AND RELATED WORK
Many s a e-o - he-a me hods o classi ica ion and e-
g ession ely on aining deep neu al ne wo ks (DNNs) us-
ing la ge amoun s o labeled da a. While accu a ely la-
beled aining da a is widely a ailable in ields such as
image ecogni ion, eal-wo ld ime se ies da a is a ely
s ongly aligned, meaning he e is no p ecise ame-by-
ame co espondence be ween he inpu signal and i s la-
bel. This lack o alignmen is mainly due o he high
cos o manual anno a ion. In music in o ma ion e ie al
(MIR), examples o such s ong a ge s p ima ily include
Diskla ie eco dings [1, 2] and syn he ic da a [3]. In con-
as , weakly aligned labels, which p o ide a global co e-
spondence o he inpu bu lack p ecise ame-le el align-
men , a e easie o ob ain. Fo example, in musical ins u-
men ansc ip ion, he s a and end imes o segmen s can
be anno a ed in a music eco ding and i s co esponding
© J. Zei le and M. Mülle . Licensed unde a C ea i e Com-
mons A ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu ion:
J. Zei le and M. Mülle , “Re o mula ing So Dynamic Time Wa ping:
Insigh s in o Ta ge A i ac s and P edic ion Quali y”, in P oc. o he
26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h
Ko ea, 2025.
symbolic sco e. This esul s in a “weak a ge ” ep esen-
a ion, whe e all no es appea in he co ec o de bu hei
p ecise onse imes and du a ions emain unce ain.
To ain DNNs wi h weak a ge s, an alignmen s ep
be ween ne wo k p edic ions and he weak a ge s is es-
sen ial. One common app oach, as applied in [4, 5], is o
pe o m an o line alignmen be ween he p edic ed ea-
u es and weak a ge s using classical dynamic ime wa p-
ing (DTW) [6]. The aligned labels a e hen ea ed as
s ong a ge s o aining, and his alignmen can be i -
e a i ely e ined a e each aining s ep in an expec a ion-
maximiza ion-like p ocess. A second app oach inco po-
a es he alignmen s ep di ec ly in o he compu a ion o he
aining loss, ensu ing ha p edic ions and weak a ge s a e
aligned implici ly. A well-known example is he connec-
ionis empo al classi ica ion (CTC) loss unc ion, widely
used in au oma ic speech ecogni ion [7] and also adop ed
in mul i-pi ch es ima ion (MPE) [8]. Ano he me hod ex-
ends DTW by in oducing a di e en iable minimum unc-
ion [9–11], leading o he so dynamic ime wa ping
(SDTW) algo i hm. SDTW has been applied o asks such
as mul i-pi ch and pi ch class es ima ion [12, 13]. Unlike
CTC, SDTW is no limi ed o a ini e a ge alphabe and
a oids he combina o ial explosion ha a ises when a ge s
a e ep esen ed by high-dimensional mul i-ho ec o s, as
in MPE.
Despi e i s ad an ages, SDTW-based aining exhibi s
ins abili ies, wi h pe o mance a ia ions in luenced by
da a ep esen a ion [12], aining s a egies [13], and hy-
pe pa ame e choices such as s ep weigh s [14]. P e ious
expe imen s ha e obse ed issues like alignmen collapse
and diagonaliza ion [13], bu o da e, he e has been no
s aigh o wa d way o analyze he SDTW aining p o-
cess. One key challenge is ha he ne wo k pa ame e
upda es esul ing om he SDTW loss a e di icul o in-
e p e due o he algo i hm’s ma hema ical complexi y. In
con as , when aining wi h s ongly aligned a ge s us-
ing elemen -wise loss unc ions such as mean squa ed e -
o (MSE) o bina y c oss-en opy (BCE), he op imiza ion
p ocess is mo e anspa en , as he ne wo k simply mini-
mizes he dis ance be ween p edic ions and s ong a ge s.
In his wo k, we add ess he ollowing undamen al
ques ions: How does SDTW aining di e om s anda d
aining wi h s ongly aligned a ge s? Wha does a ne -
wo k ac ually lea n when ained wi h an SDTW loss? To
p o ide answe s, as a key con ibu ion o his pape we e-
o mula e he SDTW g adien in o an equi alen ep esen-
127
P edic ions XP edic ions XP edic ions X
S ong a ge s Ys ong Weak a ge s YModi ied a ge s Ymod
(a) (b) (c)
Equi alen
e o mula ion
Figu e 1: O e iew o di e en aining and alignmen s a egies. (a) S ong a ge s Ys ong wi h di ec ame-wise
co espondence o he p edic ions X.(b) Weak a ge s Y ha equi e alignmen o he p edic ions. (c) Re o mula ion o
weakly aligned a ge s in o modi ied a ge s Ymod, ensu ing ame-wise co espondence wi h he p edic ions.
a ion de i ed om elemen -wise MSE and BCE losses.
This e o mula ion in oduces in e p e able modi ied a -
ge s ha , when used as s ong a ge s in an elemen -wise
loss unc ion, p oduce iden ical ne wo k upda es o hose
ob ained unde SDTW wi h weak a ge s. In o he wo ds,
he DNN lea ns o minimize he dis ance (e.g., MSE o
BCE) o hese modi ied a ge s. By inspec ing he modi-
ied a ge s, we gain insigh s in o wha he DNN ac ually
lea ns, allowing us o analyze he impac o SDTW hy-
pe pa ame e s and aining s a egies on ne wo k pe o -
mance. Fu he mo e, hese modi ied a ge s can be isu-
alized o soni ied a ea ly aining s ages o p o ide quali-
a i e assessmen s o he lea ning p ocess. Dis ance mea-
su es o s ongly aligned e e ence a ge s can also be com-
pu ed, acili a ing quan i a i e e alua ions.
The emainde o his pape is s uc u ed as ollows.
Sec ion 2 in oduces he p oblem and me hodology. Sec-
ion 3 p o ides backg ound on he SDTW algo i hm. In
Sec ion 4.1, we e o mula e he g adien o he SDTW loss
in o he canonical o m o elemen -wise MSE and BCE
losses, yielding he so-called modi ied a ge s. Sec ion 4.2
discusses he p ope ies o hese modi ied a ge s. Sec-
ions 5.1 and 5.2 ou line ou expe imen al amewo k and
demons a e he impac o di e en aining con igu a ions
using SDTW. Finally, we discuss ou indings and hei
implica ions in Sec ion 5.3 and conclude in Sec ion 6.
2. PROBLEM FORMULATION
We conside he ask o aining a DNN ha akes an in-
pu sequence and p edic s ea u es X= (x1,x2,...,xN)
wi h xn∈RD, and ame index n∈ {1,2, . . . , N}.
Fo example, in he case o MPE, he inpu sequence
can be a spec al ep esen a ion o an audio eco ding
and Xis a sequence o es ima ed pi ch ec o s. Ideally,
o ain such a ne wo k, we ha e access o a s ongly
aligned a ge sequence Ys ong =ys ong
1,...,ys ong
N
wi h ys ong
n∈RDwhich empo ally co esponds o Xon
he ame le el, as isualized in Figu e 1a. In he example
o MPE, he a ge ea u es a e ypically encoded as bina y
mul i-ho ec o s indica ing he p esence o ce ain pi ches.
In he case o s ong a ge s, we can use an elemen -wise
loss unc ion o ain he ne wo k, such as MSE
cMSE(x,y) = 1
2∥x−y∥2
2(1)
o BCE
cBCE(x,y) = −y⊤log x−(1 −y)⊤log(1 −x),(2)
whe e he loga i hm o a ec o is de ined elemen -wise.
In p ac ice, s ongly aligned a ge s a e a ely a ail-
able in MIR. Ins ead, weakly labeled a ge s, which
sha e only a global co espondence wi h he inpu da a,
a e mo e eadily ob ainable. Fo ins ance, in MPE,
weak a ge s Y= (y1,y2,...,yM)wi h ym∈RDand
m∈ {1,2, . . . , M}can be de i ed om no e e en s in a
musical sco e. While Yand Ys ong con ain he same se
o ea u e ec o s, hey di e in he numbe o epe i ions
o each ec o . To ain DNNs on weakly aligned da a, he
SDTW loss unc ion is used o align p edic ions Xwi h
weak a ge s Ydu ing loss compu a ion (see Figu e 1b
and Sec ion 3). Al hough SDTW-based aining has shown
p omising esul s, i is highly sensi i e o aining s a e-
gies and hype pa ame e choices [12–14]. Unde s anding
hese s abili y issues equi es deepe insigh s in o wha he
ne wo k ac ually lea ns. Howe e , unlike aining wi h
s ong a ge s and elemen -wise loss unc ions, in e p e -
ing SDTW-based aining emains a challenge due o i s
complex alignmen p ocess.
In his wo k, we p o ide deepe insigh s in o SDTW-
based aining by e o mula ing he SDTW g adien in o an
equi alen ep esen a ion using s ongly aligned modi ied
a ge s Ymod =ymod
1, . . . , ymod
N. T aining wi h hese
modi ied a ge s and a s anda d elemen -wise loss unc ion
esul s in he same ne wo k upda es as aining wi h SDTW
di ec ly. Impo an ly, his e o mula ion does no al e he
SDTW aining p ocess bu ins ead se es as a ool o be -
e in e p e ing how he model lea ns. By analyzing hese
modi ied a ge s (Figu e 1c), we gain a clea e unde s and-
ing o he ea u es he DNN ac ually lea ns, making SDTW
aining as in e p e able as aining wi h s ongly aligned
a ge s.
3. SOFT DYNAMIC TIME WARPING
We aim o compu e and minimize he so alignmen cos
be ween he sequences Xand Y. Wi hou loss o gene -
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
128
ali y, le X ep esen a sequence o DNN p edic ions and
Y he co esponding weak a ge s. To quan i y he align-
men cos be ween elemen s o Xand Y, we employ a
local cos unc ion c:RD×RD→Rsuch as MSE o
BCE. The esul ing elemen -wise cos s a e s o ed in he
local cos ma ix C∈RN×Mde ined as:
C(n, m) = c(xn,ym).(3)
To ensu e di e en iabili y, we use a smoo h app oxima ion
o he minimum unc ion, de ined as:
minγ(S) = −γlog X
s∈S
exp{−s/γ},(4)
whe e γ > 0is a empe a u e hype pa ame e , and Sis
a lis o eal numbe s [9]. The pa ame e γcon ols he
deg ee o smoo hness, wi h he unc ion con e ging o he
ha d minimum as γ→0. Using his o mula ion, we de-
ine he SDTW o wa d ecu sion o compu e he accumu-
la ed cos ma ix D∈RN×Mas:
D(n, m) = minγ({whC(n, m) + D(n−1, m),(5)
w C(n, m) + D(n, m −1),
wdC(n, m) + D(n−1, m −1)}),
whe e wh,w , and wda e s ep weigh s con olling he cos
con ibu ion o ho izon al, e ical, and diagonal s eps in
he alignmen p ocess [9, 14]. The o iginal SDTW o mu-
la ion om [9] is eco e ed o wh=w =wd= 1. The
o e all alignmen cos is gi en by he inal elemen o he
accumula ed cos ma ix:
SDTW(C) = D(N, M).(6)
The g adien H∈RN×Mo he SDTW cos w. . . he cos
ma ix:
H(n, m):=∂SDTW(C)
∂C(n, m)(7)
can be compu ed e icien ly by a second ecu sion in e-
e se o de . We e e o [14] o he echnical de ails o he
g adien compu a ion. When using uni o m s ep weigh s,
i.e., wh=w =wd= 1, he elemen s H(n, m)∈[0,1]
can be in e p e ed as a o m o pseudo-p obabili y, indi-
ca ing he deg ee o which he sequence elemen s xnand
yma e aligned. His he e o e also called he “so align-
men ma ix” (see [14] o a discussion o SDTW align-
men s). Fo γ→0, SDTW con e ges o he classical
“ha d” DTW algo i hm, yielding a bina y alignmen ma-
ix wi h H(n, m)∈ {0,1}. The g adien o he SDTW
cos w. . . he ne wo k ou pu s xnis ob ained by applying
he chain ule:
∂SDTW(C)
∂xn
=
M
X
m=1
H(n, m)·∂ c(xn,ym)
∂xn
.(8)
This g adien is ypically compu ed using he au oma ic
di e en ia ion modules a ailable in mode n deep lea ning
amewo ks.
4. GRADIENT REFORMULATION INTO
MODIFIED TARGETS
While he g adien is well-de ined and e icien compu a-
ion me hods exis , i emains unclea which ea u es he
DNN ac ually lea ns when ained wi h an SDTW loss. In
con as , when aining wi h s ongly aligned a ge s using
s anda d elemen -wise loss unc ions such as MSE o BCE,
he op imiza ion p ocess is mo e anspa en . Fo ins ance,
gi en he g adien s:
∂ cMSE(x,y)
∂x=x−y(9)
∂ cBCE(x,y)
∂x=−y
x+1−y
1−x,(10)
i is e iden ha he ne wo k pa ame e s a e adjus ed o
b ing he p edic ions xclose o he a ge s y. Howe e , in
he case o SDTW, he alignmen p ocess in oduces addi-
ional complexi y, making i less in ui i e o in e p e wha
he ne wo k is op imizing owa ds.
4.1 De i a ion o MSE and BCE Cos
Nex , we e o mula e he SDTW g adien om (8) in o
he s anda d elemen -wise loss unc ions de ined in (9)
and (10). We hen demons a e he equi alence o he mod-
i ied a ge ep esen a ions o MSE and BCE, o e ing a
uni ied pe spec i e on SDTW-based aining.
4.1.1 MSE as local cos
Fo c=cMSE, he g adien o he SDTW cos w. . . he
DNN p edic ions is gi en by:
∂SDTW(C)
∂xn
=
M
X
m=1
H(n, m)·(xn−ym),(11)
which ollows om (8) and (9). Ou goal is o e o mula e
his exp ession in o he s anda d o m o (9). We in oduce
he ow sum o he g adien ma ix:
h(n):=
M
X
m=1
H(n, m)∈R(12)
and de ine he ow-no malized g adien ma ix as:
˜
H(n, m):=H(n, m)/h(n)∈RN×M.(13)
Using hese de ini ions, we can ew i e he g adien om
(11) as:
∂SDTW(C)
∂xn
=h(n)·xn−
M
X
m=1
H(n, m)ym
=h(n)· xn−
M
X
m=1
˜
H(n, m)ym!
=h(n)·xn−ymod
n,(14)
whe e we de ine he modi ied a ge s:
ymod
n:=
M
X
m=1
˜
H(n, m)·ym∈RD.(15)
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
129
Thus, he ne wo k pa ame e s a e upda ed such ha he
p edic ions xnmo e close o he modi ied a ge s ymod
n.
The upda e magni ude is de e mined by h(n).
4.1.2 BCE as local cos
A simila e o mula ion applies o he SDTW g adien
when using BCE as he local cos unc ion. S a ing wi h
he g adien :
∂SDTW(C)
∂xn
=
M
X
m=1
H(n, m)·−ym
xn
+1−ym
1−xn,
(16)
we ew i e his exp ession in he o m o (10) as ollows:
∂SDTW(C)
∂xn
=−PmH(n, m)ym
xn
+PmH(n, m)−PmH(n, m)ym
1−xn
=h(n)· −Pm˜
H(n, m)ym
xn
+1−Pm˜
H(n, m)ym
1−xn!
=h(n)·−ymod
n
xn
+1−ymod
n
1−xn.(17)
No ably, his e o mula ion uses he same modi ied a ge s
ymod
nand weigh ing ac o h(n)as in he MSE case. These
modi ied a ge s can be isualized o soni ied alongside
he o iginal signal, p o iding an in ui i e way o analyze
he SDTW aining p ocess. Impo an ly, his app oach
does no equi e knowledge o he s ong e e ence a ge s,
making i highly sui able o analyzing aining beha io
in eal-wo ld scena ios whe e only weakly aligned da a is
a ailable.
4.2 P ope ies o he Modi ied Ta ge s and Magni ude
Decay
In his sec ion, we analyze he heo e ical p ope ies o
he modi ied a ge s and illus a e hem using a syn he ic
example (Figu e 2). Speci ically, we in es iga e how he
SDTW loss in luences blu ing and magni ude a ia ions
when p edic ing sho and long ea u es. To demons a e
hese e ec s, we cons uc a sequence Yconsis ing o six
bina y ea u e ec o s y∈ {0,1}12, h ee o which a e all-
ze o (see Figu e 2a). Nex , we cons uc a sequence X(as
seen in Figu e 2b), by epea ing he non-ze o elemen s o
Y wo, one, and i e imes, espec i ely, and epea ing he
all-ze o elemen s h ee imes each. No e ha in he case
o classical DTW wi h ha d alignmen s, in ou example X
and Ycould be aligned wi h ze o cos . We a e in e es ed
in he alignmen o hese sequences unde SDTW loss and
hus compu e he o wa d and backwa d passes wi h uni-
o m s ep weigh s wh=w =wd= 1 and γ= 1 as
desc ibed in [14]. We hen de i e he alignmen ma ix ˜
H
and compu e he modi ied a ge s Ymod, displaying he e-
sul s in Figu e 2c and Figu e 2d, espec i ely. Ideally, as
02 4
0
4
8
Weak a ge index (m)
Fea u e (d)
0
4
8
Fea u e (d)
0
2
4
Weak a ge
index (m)
0 2 4 6 8 10 12 14 16
0
4
8
F ame index (n)
Fea u e (d)
0 0.2 0.4 0.6 0.8 1.0
(a)
(b)
(c)
(d)
Figu e 2: Illus a ion o magni ude decay in modi ied a -
ge s. (a) Weakly aligned a ge sequence Y.(b) P edic ed
sequence X, which co esponds o a pe ec ly aligned and
un olded e sion o Y.(c) Alignmen ma ix ˜
H⊤.(d)
Modi ied a ge s Ymod, showing signi ican a ia ions in
magni ude. A sho ea u e ins ance expe iencing magni-
ude decay is highligh ed in ed, while a long ea u e in-
s ance wi h nea -comple e magni ude p ese a ion is high-
ligh ed in blue.
we chose X o be an un olded e sion o Ywi hou addi-
ional noise, Ymod should closely esemble X.
Due o he ela i ely high so min empe a u e γ= 1,
we obse e signi ican empo al blu ing in bo h he mod-
i ied a ge s and he alignmen ma ix. By de ini ion, he
ow sums o he alignmen ma ix ˜
Ha e no malized, i.e.,
PM
m=1 ˜
H(n, m) = 1 o all n∈ {1, . . . , N}. This no mal-
iza ion ensu es ha each p edic ed ame is assigned a con-
ex combina ion o weak a ge ames. Howe e , when
a ge s a e aligned o only a sho du a ion (e.g., s ac-
ca o no es), he empo al blu ing om neighbo ing ames
o e laps wi h he ac ual a ge , leading o a signi ican e-
duc ion in i s magni ude a e no maliza ion by h(n). This
e ec is illus a ed in Figu e 2, whe e he ame ma ked in
ed shows how sho e en s a e pa icula ly a ec ed. Con-
e sely, o a ge s aligned o e a longe du a ion (e.g., sus-
ained no es), he in luence o empo al blu ing is p ima -
ily limi ed o he onse and o se egions, lea ing he cen-
al ames mos ly una ec ed. This phenomenon is isible
in he sec ion ma ked in blue in Figu e 2, whe e he co -
esponding ame in Ymod e ains a ela i ely high mag-
ni ude. Consequen ly, wi h a highe so min empe a u e
γ, SDTW in oduces a p onounced magni ude imbalance:
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
130
sho e en s end o ha e educed magni udes, while he
cen al ames o long e en s main ain highe magni udes.
5. EVALUATION
In his sec ion, we examine modi ied a ge s in a DNN
aining scena io. To ensu e cla i y and in ui i e accessi-
bili y, we selec MPE as a case s udy—a s aigh o wa d
ye ele an ask well-sui ed o p o iding insigh s in o he
model beha io . MPE is pa icula ly app op ia e due o
i s di e se weakly labeled da ase s, consis ing o sco e–
audio pai s, and i s b oad ange o a ge du a ions, om
sho s acca o no es o sus ained ones. Addi ionally, i
p o ides an in ui i e amewo k o isualiza ion and anal-
ysis. Ra he han aiming o ad ance he s a e o he a
in MPE o epo pe o mance benchma ks, ou objec i e
is o demons a e how ea ly-s age inspec ion o modi ied
a ge s can yield aluable insigh s in o he lea ning p o-
cess and se e as a eliable p edic o o inal model pe o -
mance. Concluding his sec ion, we in eg a e heo e ical
indings om he p e ious sec ion wi h empi ical insigh s
om ou MPE expe imen o o mula e p ac ical ecom-
menda ions o aining s a e-o - he-a MIR models wi h
SDTW. Soni ica ions o all p esen ed examples and u -
he links o Py o ch implemen a ions a e a ailable a ou
websi e. 1
5.1 Expe imen al Se up
As an example a chi ec u e o MPE, we adop a single
con olu ional s ack om he Py hon implemen a ion o he
Onse s and F ames model [15]. The s ack p ocesses a Mel
spec og am as inpu and consis s o h ee con olu ional
laye s wi h ba ch no maliza ion, max pooling, and d opou ,
ollowed by wo ully connec ed laye s wi h sigmoid ac i-
a ion. In o al, he model comp ises app oxima ely 4.3
million ainable pa ame e s. We choose he Onse s and
F ames model as ou basis due o i s widesp ead use in
he li e a u e and i s p o en e ec i eness o ansc ip ion
asks. Howe e , we simpli y he a chi ec u e by using only
a single s ack, educing in e dependencies be ween mul i-
ple s acks p esen in he o iginal model. This modi ica ion
no only dec eases he model size bu also enhances in e -
p e abili y by emo ing ecu en neu al ne wo ks om he
pipeline.
We p e- ain he model on s ongly aligned da a om
he MAESTRO [1] da ase o 100000 s eps wi h audio o
20 s leng h and a ba ch size o 8, BCE loss, Adam op i-
mize [16] wi h an ini ial lea ning a e o 6·10−4and a e-
duc ion o he lea ning a e by a ac o o 0.98 e e y 10000
s eps, and g adien clipping.
We ine- une he model using weakly aligned da a om
he Bee ho en Piano Sona a Da ase (BPSD) [17]. Fo
his, we au oma ically gene a e aining samples by pai -
ing mul i-pi ch labels wi h co esponding audio segmen s,
each spanning 8 measu es. I a segmen exceeds 20 sec-
onds in du a ion, i is excluded om aining due o ha d-
wa e memo y cons ain s. The segmen s a e g ouped in o
1h ps://audiolabs-e langen.de/ esou ces/MIR/
2025_Zei le M_so DTW_modTa ge s_ISMIR
Figu e 3: Musical sco e o unning example.
0 50 100
50
60
70
80
(a)
F ame index (n)
MIDI pi ch
0 50 100
(b)
F ame index (n)
0 0.2 0.4 0.6 0.8 1.0
Figu e 4: Piano oll ep esen a ion o he unning example.
(a) S ongly aligned e e ence a ge s. (b) P edic ions o
p e ained model.
ba ches o size 8. Fo op imiza ion, we use he weigh ed
SDTW loss [14, 18] wi h BCE as he local cos unc ion.
The model is ained o 5000 s eps using he Adam op i-
mize [16] wi h a lea ning a e o 10−3. We e alua e pe -
o mance on a es se om he BPSD wi h e sions ha
we e no included in he aining. The p e ained model
achie es an F-measu e o 0.60 on he es se .
The weak a ge s Ya e de i ed by emo ing all epe-
i ions om he s ongly aligned a ge sequence Ys ong,
which is p o ided in he BPSD. Fo addi ional de ails on
he ne wo k a chi ec u e, we e e o [15], and o a com-
p ehensi e explana ion o he weigh ed SDTW loss along
wi h a Py hon implemen a ion, we e e o [14].
5.2 Analyzing Modi ied Ta ge Rep esen a ions
We now analyze he modi ied a ge s o bo h he p e-
ained model and he inal p edic ions a e ine- uning
wi h he SDTW loss o e 5000 aining s eps. Ou goal
is o examine how he p edic ions o he ine- uned model
align wi h he modi ied a ge s ob ained om he p e-
ained model. Fo his analysis, we use an exce p om
he i s mo emen o Bee ho en’s second piano sona a
(Op. 2 No. 2), as shown in Figu e 3. The co espond-
ing e e ence a ge s (s ongly aligned) and he p e ained
model’s p edic ions o a pe o mance by Al ed B endel
(1996) a e illus a ed in Figu e 4a and Figu e 4b, espec-
i ely. This exce p ea u es a ansi ion om a s acca o
passage o a lega o sec ion wi h sus ained no es, making i
a ep esen a i e es case o illus a ing how SDTW han-
dles a ia ions in no e du a ion.
Fo ou expe imen s, we use s ep weigh s o wh= 0.1,
w = 1, and wd= 1, educing he weigh o ho izon-
al s eps ( a ge epe i ion) o enhance obus ness agains
p edic ion ou lie s [14]. We a y he so min empe a u e
γ∈ {0.1,1.0,10.0}and p esen he esul s o he modi-
ied aining a ge s o he p e ained model and he p edic-
ions o he ine- uned model in Figu e 5.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
131

50
60
70
80
(a) Ymod o i e a ion 0
F ame index (n)
MIDI pi ch
Final p edic ion
F ame index (n)
50
60
70
80
(b)
F ame index (n)
MIDI pi ch
F ame index (n)
0 50 100
50
60
70
80
(c)
F ame index (n)
MIDI pi ch
0 50 100
F ame index (n)
0 0.2 0.4 0.6 0.8 1.0
Figu e 5: Visualiza ion o modi ied a ge s (le ) and
model p edic ions ( igh ) o di e en SDTW con igu a-
ions. (a) γ= 0.1.(b) γ= 1.(c) γ= 10.
Fo γ= 0.1(Figu e 5a), he modi ied a ge s align
closely wi h he e e ence a ge s om Figu e 4, showing
only sligh blu ing. This leads o inal p edic ions ha
cap u e all no es wi h ela i ely high and consis en mag-
ni ude ac oss de ec ed e en s, esul ing in a es F-measu e
o 0.67. Fo γ= 1.0(Figu e 5b), he SDTW loss causes
sligh blu ing a he no e onse s and o se s o he modi-
ied a ge s, accompanied by a educed magni ude o no es
in he s acca o mo emen . As he p edic ions can ne e
ge be e han wha is gi en by he aining a ge s, also
he p edic ed no es o he ine- uned model e eal s onge
blu ing and sligh ly lowe magni udes han o γ= 0.1.
The es F-measu e sligh ly educes o 0.64. Wi h γ= 10.0
(Figu e 5c), no e e en s in he modi ied a ge s become
e en mo e blu ed. Fo all bu he longes no es, magni-
udes d op conside ably, alling below 0.5 in he s acca o
mo emen . The ine- uned model’s p edic ions closely ol-
low his pa e n, showing a p onounced magni ude educ-
ion o mos no es, wi h a es F-measu e o only 0.36.
No ably, he de ec ed s acca o no es all below 0.5, which
is p oblema ic o pos -p ocessing asks ha o en disca d
e en s unde his h eshold.
5.3 P ac ical Implica ions o SDTW Re o mula ion
In his sec ion, we ou line some p ac ical implica ions o
he p oposed e o mula ion o SDTW when aining DNNs
om sc a ch. P e ious s udies [13, 19] ha e obse ed ha
aining a DNN om a poo ini ializa ion equi es a ela-
i ely high so min empe a u e pa ame e γ. On he one
hand, when γ→0, SDTW alignmen s o en degene -
a e, leading o a collapse in model aining. On he o he
hand, o su icien ly high γ, he ne wo k is exposed o a
weigh ed combina ion o mul iple alignmen s, which acil-
i a es success ul aining ini ializa ion.
Gi en ha a high γis necessa y o ini ialize DNN ain-
ing wi h weak a ge s, we now examine i s implica ions
in he con ex o a common ansc ip ion scena io such as
Onse s and F ames [15]. Since ansc ip ion models aim o
p edic disc e e e en s (e.g., symbolic no e in o ma ion), a
common pos -p ocessing s ep in ol es applying a de ec-
ion h eshold o he aw ne wo k ou pu , ea ing e en s
abo e he h eshold as ac i e. One di ec consequence o
he empo al blu ing induced by high γis ha p edic ions
ade in and ou be o e and a e he ac ual e en . I he
magni ude wi hin hese ading egions lies abo e he de-
ec ion h eshold, he h esholded p edic ions ex end be-
o e and a e he ac ual e en , e ec i ely widening he de-
ec ed empo al span. In ime-sensi i e asks such as onse
es ima ion, his can lead o an undesi able empo al shi in
pos -p ocessed p edic ions.
A second obse ed e ec , bo h heo e ically de i ed
and empi ically demons a ed, is a decay in magni ude
o sho e en s. In onse es ima ion, whe e e en s a e
inhe en ly sho , his decay can be pa icula ly p oblem-
a ic. I he magni ude o sho e en s alls below he de ec-
ion h eshold, hese e en s may be los en i ely a e pos -
p ocessing. This issue is especially c i ical in models like
Onse s and F ames, whe e no e ac i a ion is condi ioned
on a p eceding onse [15].
Based on hese concep ual indings and in line wi h p e-
ious wo k [13], we o e he ollowing ecommenda ions
o using SDTW in DNN aining scena ios wi h poo ini-
ializa ion: 1. High γ o ini ializa ion: Begin aining
wi h a ela i ely high so min empe a u e γ o ensu e s a-
ble con e gence. Du ing his phase, pos -p ocessed pe o -
mance me ics (e.g., no e accu acy in ansc ip ion) may be
un eliable due o empo al shi ing and magni ude decay.
2. Ta ge inspec ion: Despi e he sho comings o me ics
a e pos -p ocessing, he modi ied a ge s can be inspec ed
a any poin o e i y whe he he DNN is lea ning mean-
ing ul pa e ns. 3. G adual γ educ ion: A e he ini-
ializa ion phase, p og essi ely lowe γun il he modi ied
a ge s exhibi less empo al blu ing and a balanced mag-
ni ude o bo h sho and long e en s. 4. Resume aining:
Wi h e ined SDTW pa ame e s, con inue aining o allow
p edic ions o con e ge owa d he imp o ed modi ied a -
ge s.
6. CONCLUSION
In his pape , we in oduced a e o mula ion o he SDTW
g adien in o in e p e able modi ied a ge s, which yield
iden ical ne wo k pa ame e upda es when used wi h s an-
da d elemen -wise loss unc ions. Th ough heo e ical
analysis and a con olled expe imen , we demons a ed ha
empo al blu ing and magni ude decay a e inhe en ly pa
o aining wi h SDTW, e en hough i is no isible in he
unde lying weak a ge s. By making he aining p ocess
mo e anspa en , ou app oach p o ides esea che s and
p ac i ione s wi h deepe insigh s in o SDTW-based lea n-
ing and o e s an in ui i e, p ac ical me hod o analyzing
weakly supe ised aining s a egies.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
132
7. ACKNOWLEDGEMENTS
This wo k was unded by he Deu sche Fo schungs-
gemeinscha (DFG, Ge man Resea ch Founda ion) un-
de G an No. 500643750 (MU 2686/15-1) and G an
No. 521420645 (MU 2686/17-1). The In e na ional
Audio Labo a o ies E langen a e a join ins i u ion o
he F ied ich-Alexande -Uni e si ä E langen-Nü nbe g
(FAU) and F aunho e Ins i u e o In eg a ed Ci cui s IIS.
8. REFERENCES
[1] C. Haw ho ne, A. S asyuk, A. Robe s, I. Simon, C. A.
Huang, S. Dieleman, E. Elsen, J. H. Engel, and D. Eck,
“Enabling ac o ized piano music modeling and gene -
a ion wi h he MAESTRO da ase ,” in P oceedings o
he In e na ional Con e ence on Lea ning Rep esen a-
ions (ICLR), New O leans, Louisiana, USA, 2019.
[2] M. Mülle , V. Konz, W. Bogle , and V. A i i-Mülle ,
“Saa land music da a (SMD),” in Demos and La e
B eaking News o he In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence (ISMIR), Miami,
Flo ida, USA, 2011.
[3] V. Emiya, R. Badeau, and B. Da id, “Mul ipi ch es i-
ma ion o piano sounds using a new p obabilis ic spec-
al smoo hness p inciple,” IEEE T ansac ions on Au-
dio, Speech, and Language P ocessing, ol. 18, no. 6,
pp. 1643–1654, 2010.
[4] B. Maman and A. H. Be mano, “Unaligned supe i-
sion o au oma ic music ansc ip ion in he wild,” in
P oceedings o he In e na ional Con e ence on Ma-
chine Lea ning (ICML), Bal imo e, Ma yland, USA,
2022, pp. 14 918–14 934.
[5] X. Riley, D. Edwa ds, and S. Dixon, “High esolu-
ion gui a ansc ip ion ia domain adap a ion,” in
P oceedings o he IEEE In e na ional Con e ence on
Acous ics, Speech, and Signal P ocessing (ICASSP),
Seoul, Sou h Ko ea, 2024, pp. 1051–1055.
[6] M. Mülle , Fundamen als o Music P ocessing – Us-
ing Py hon and Jupy e No ebooks, 2nd ed. Sp inge
Ve lag, 2021.
[7] A. G a es, S. Fe nández, F. J. Gomez, and J. Schmid-
hube , “Connec ionis empo al classi ica ion: La-
belling unsegmen ed sequence da a wi h ecu en neu-
al ne wo ks,” in P oceedings o he In e na ional
Con e ence on Machine Lea ning (ICML), Pi sbu gh,
Pennsyl ania, USA, 2006, pp. 369–376.
[8] C. Weiß and G. Pee e s, “Lea ning mul i-pi ch es i-
ma ion om weakly aligned sco e-audio pai s using
a mul i-label CTC loss,” in P oceedings o he IEEE
Wo kshop on Applica ions o Signal P ocessing o Au-
dio and Acous ics (WASPAA), New Pal z, USA, 2021,
pp. 121–125.
[9] M. Cu u i and M. Blondel, “So -DTW: a di e en-
iable loss unc ion o ime-se ies,” in P oceedings
o he In e na ional Con e ence on Machine Lea ning
(ICML), Sydney, NSW, Aus alia, 2017, pp. 894–903.
[10] A. Mensch and M. Blondel, “Di e en iable dynamic
p og amming o s uc u ed p edic ion and a en ion,”
in P oceedings o he In e na ional Con e ence on Ma-
chine Lea ning (ICML), S ockholmsmässan, S ock-
holm, Sweden, 2018, pp. 3459–3468.
[11] D. Dwibedi, Y. Ay a , J. Tompson, P. Se mane , and
A. Zisse man, “Tempo al cycle-consis ency lea ning,”
in IEEE/CVF Con e ence on Compu e Vision and Pa -
e n Recogni ion (CVPR), Long Beach, CA, USA,
2019, pp. 1801–1810.
[12] M. K ause, C. Weiß, and M. Mülle , “So dynamic
ime wa ping o mul i-pi ch es ima ion and beyond,”
in P oceedings o he IEEE In e na ional Con e -
ence on Acous ics, Speech, and Signal P ocessing
(ICASSP), Rhodes Island, G eece, 2023, pp. 1–5.
[13] J. Zei le , S. Deni el, M. K ause, and M. Mülle ,
“S abilizing aining wi h so dynamic ime wa ping:
A case s udy o pi ch class es ima ion wi h weakly
aligned a ge s,” in P oceedings o he In e na ional
Socie y o Music In o ma ion Re ie al Con e ence
(ISMIR), Milano, I aly, 2023, pp. 433–439.
[14] J. Zei le , M. K ause, and M. Mülle , “So dynamic
ime wa ping wi h a iable s ep weigh s,” in P oceed-
ings o he IEEE In e na ional Con e ence on Acous-
ics, Speech, and Signal P ocessing (ICASSP), Seoul,
Sou h Ko ea, 2024, pp. 356–360.
[15] C. Haw ho ne, E. Elsen, J. Song, A. Robe s, I. Si-
mon, C. Ra el, J. H. Engel, S. Oo e, and D. Eck, “On-
se s and ames: Dual-objec i e piano ansc ip ion,”
in P oceedings o he In e na ional Socie y o Mu-
sic In o ma ion Re ie al Con e ence, (ISMIR), Pa is,
F ance, 2018, pp. 50–57.
[16] D. P. Kingma and J. Ba, “Adam: A me hod o s ochas-
ic op imiza ion,” in P oceedings o he In e na ional
Con e ence o Lea ning Rep esen a ions (ICLR), San
Diego, Cali o nia, USA, 2015.
[17] J. Zei le , C. Weiß, V. A i i-Mülle , and M. Mülle ,
“BPSD: A cohe en mul i- e sion da ase o analyz-
ing he i s mo emen s o Bee ho en’s piano sona as.”
T ansac ions o he In e na ional Socie y o Music In-
o ma ion Re ie al (TISMIR), 2024.
[18] M. Maghoumi, E. M. Ta an a, and J. LaViola, “Deep-
NAG: Deep non-ad e sa ial ges u e gene a ion,” in
P oceedings o he In e na ional Con e ence on In el-
ligen Use In e aces (IUI), College S a ion, Texas,
USA, 2021, pp. 213–223.
[19] M. K ause, S. S ahl, and M. Mülle , “Weakly supe -
ised mul i-pi ch es ima ion using c oss- e sion align-
men ,” in P oceedings o he In e na ional Socie y o
Music In o ma ion Re ie al Con e ence (ISMIR), Mi-
lano, I aly, 2023.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
133