MAIA: AN INPAINTING-BASED APPROACH FOR MUSIC ADVERSARIAL
ATTACKS
Yuxuan Liu, Peihong Zhang, Rui Sang, Zhixin Li, Shengchen Li
Xi’an Jiao ong-Li e pool Uni e si y
{yuxuan.liu2204, peihong.zhang20, ui.sang22, zhixin.li22}@s uden .xj lu.edu.cn
[email p o ec ed]
ABSTRACT
Music ad e sa ial a acks ha e ga ne ed signi ican in e -
es in he ield o Music In o ma ion Re ie al (MIR). In
his pape , we p esen Music Ad e sa ial Inpain ing A -
ack (MAIA), a no el ad e sa ial a ack amewo k ha
suppo s bo h whi e-box and black-box a ack scena ios.
MAIA begins wi h an impo ance analysis o iden i y c i -
ical audio segmen s, which a e hen a ge ed o modi i-
ca ion. U ilizing gene a i e inpain ing models, hese seg-
men s a e econs uc ed wi h guidance om he ou pu
o he a acked model, ensu ing sub le and e ec i e ad-
e sa ial pe u ba ions. We e alua e MAIA on mul iple
MIR asks, demons a ing high a ack success a es in bo h
whi e-box and black-box se ings while main aining min-
imal pe cep ual dis o ion. Addi ionally, subjec i e lis en-
ing es s con i m he high audio ideli y o he ad e sa ial
samples. Ou indings highligh ulne abili ies in cu en
MIR sys ems and emphasize he need o mo e obus and
secu e models.
1. INTRODUCTION
Music In o ma ion Re ie al (MIR) has e ol ed in o a mul-
i ace ed esea ch domain, unde pinning a ious applica-
ions such as applica ions ha ange om gen e classi i-
ca ion [1] and ins umen ecogni ion [2] o co e song
iden i ica ion [3–5] and ecommenda ion sys ems [6, 7].
As MIR algo i hms become inc easingly p e alen in bo h
comme cial p oduc s and academic esea ch, hei eliabil-
i y and obus ness ha e come unde sc u iny [8, 9]. Al-
hough ad e sa ial ulne abili ies ha e been ex ensi ely
s udied in speech ecogni ion [10, 11] and image classi i-
ca ion [12, 13], he music domain emains compa a i ely
unde explo ed.
Ad e sa ial a acks in he Music In o ma ion Re ie al
(MIR) con ex can be b oadly ca ego ized in o noise-based
and seman ic-based app oaches. Noise-based a acks, such
Yuxuan Liu and Peihong Zhang con ibu ed equally o his wo k.
© Y. Liu, P. Zhang, R. Sang, Z. Li, and S. Li. Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: Y. Liu, P. Zhang, R. Sang, Z. Li, and S. Li, “MAIA:
An Inpain ing-Based App oach o Music Ad e sa ial A acks”, in P oc.
o he 26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon,
Sou h Ko ea, 2025.
as he Ca lini–Wagne (C&W) a acks [12] in oduces sub-
le audio dis o ions o mislead he model in o inco ec
ou pu s. P inz e al. [8] ex ended his line o wo k by in o-
ducing end- o-end whi e-box ad e sa ial a acks ha op-
e a e di ec ly on aw wa e o ms, demons a ing hei e -
ec i eness in deg ading ins umen classi ica ion accu acy
and manipula ing music ecommenda ion sys ems while
main aining impe cep ible pe u ba ions. Saada panah e
al. [9] highligh ed he ulne abili y o copy igh de ec-
ion sys ems o ad e sa ial a acks, showing ha small
pe u ba ions can e ade obus inge p in ing sys ems like
YouTube’s Con en ID and AudioTag, aising conce ns
abou he secu i y o hese widely used indus ial ools.
Addi ionally, Chen e al. [11] p oposed he De il’s Whis-
pe me hod, ocusing on le e aging psychoacous ic p inci-
ples o c ea e highly s eal hy ad e sa ial audio examples.
Noise-based ad e sa ial a acks ely on adding impe -
cep ible pe u ba ions o audio bu o en lack in e p e abil-
i y and ail o le e age he s uc u e and seman ics o mu-
sic [14], limi ing hei use in scena ios equi ing seman ic
alignmen o con ex -sensi i e manipula ion. Duan e al.
[15] in oduced a pe cep ion-awa e a ack amewo k ha
e e se-enginee s human pe cep ion using eg ession anal-
ysis, op imizing pe u ba ions o minimize pe cei ed de i-
a ions while main aining a ack e ec i eness. This inno-
a i e in eg a ion o human pe cep ion p o ides a unique
pe spec i e, al hough i s dependence on subjec i e e alua-
ions could limi gene alizabili y. Simila ly, Yu e al. [16]
de eloped SMACK, a me hod ha pe u bs p osodic ea-
u es like pi ch and hy hm o c ea e seman ically meaning-
ul ad e sa ial audio while p ese ing na u alness. Despi e
i s e ec i eness, he compu a ional complexi y o p osody
op imiza ion emains a challenge. Luo e al. [17] p oposed
a equency-d i en app oach ha con ines pe u ba ions o
high- equency componen s, ensu ing impe cep ibili y and
seman ic cohe ence. Howe e , i s ocus on high- equency
egions may limi applicabili y in scena ios whe e low-
equency componen s a e c i ical.
Despi e hese ad ancemen s, exis ing app oaches s ill
ace challenges in balancing a ack e ec i eness, musical
cohe ence, and p ac ical easibili y ac oss di e en MIR
asks. In his pape , we p opose a no el music ad e -
sa ial inpain ing a ack (MAIA) amewo k ha add esses
hese gaps. Ou app oach iden i ies c ucial music segmen s
h ough impo ance analysis and selec i ely econs uc s
hem ia a gene a i e inpain ing model, ensu ing sub le ye
805
highly a ge ed ad e sa ial pe u ba ions. Unlike pu ely
noise-based me hods, MAIA’s local edi s e ain musical
cohe ence while in luencing classi ica ion in a whi e-box
o black-box se ing. Th ough comp ehensi e e alua ions
o MIR asks such as music gen e classi ica ion and co e
song iden i ica ion, we demons a e ha MAIA achie es
s a e-o - he-a a ack success wi h minimal pe cep ual a -
i ac s.
The con ibu ions o his wo k a e h ee old:
1. We p opose a no el ad e sa ial a ack amewo k,
MAIA, based on impo ance-d i en inpain ing. This
amewo k econs uc s c i ical audio segmen s wi h
ad e sa ial pe u ba ions, ensu ing musical cohe -
ence while e ec i ely misleading a ge models.
2. We design a black-box impo ance analysis me hod
ha iden i ies in luen ial music segmen s h ough a
coa se- o- ine que y-based app oach, enabling e ec-
i e ad e sa ial a acks wi hou equi ing g adien
access.
3. We pe o m ex ensi e objec i e and subjec i e e al-
ua ions o comp ehensi ely benchma k MAIA a -
ack success a e and pe cep ual quali y ac oss MIR
asks.
2. MUSIC ADVERSARIAL INPAINTING ATTACK
FRAMEWORK
2.1 Impo ance Analysis
A key objec i e o ad e sa ial a acks in Music In o ma-
ion Re ie al (MIR) is o in oduce minimal ye e ec i e
pe u ba ions ha a e ha d o bo h de ec ion algo i hms
and human lis ene s o no ice. In p ac ical e ms, modi-
ying only he mos in luen ial ime- equency egions can
educe he ex en o injec ed noise, he eby dec easing pe -
cep ual a i ac s. Acco dingly, we ocus on segmen s ha
con ibu e mos signi ican ly o he p edic ion o he model,
ensu ing a high a ack success a e while minimizing any
audible changes [18].
2.1.1 Whi e-Box Impo ance: G ad-CAM
When ull access o he a ge model pa ame e s and a -
chi ec u e is a ailable, we adop a class ac i a ion map
(CAM) [19]-based s a egy o loca e ime- equency e-
gions ha mos hea ily in luence he classi ie ’s deci-
sion. T adi ional CAM me hods [19] o en equi e eplac-
ing ully-connec ed laye s wi h global pooling laye s [20],
he eby cons aining he model a chi ec u e. Howe e ,
G ad-CAM [21] gene alizes CAM and does no equi e
modi ica ions o he classi ie , making i mo e lexible o
exis ing con olu ional neu al ne wo ks.
In ui ion and Se up. Unlike pu ely saliency-based ap-
p oaches [22], which a e ypically op imized o e lec
human isual a en ion, G ad-CAM speci ically cap u es
classi ie - ele an egions by p opaga ing class-speci ic
g adien signals back h ough he ne wo k [21]. O iginally
p oposed in he image domain, G ad-CAM can be adap ed
o ou music ad e sa ial a ack asks by:
• Con e ing he aw wa e o m o a sui able ime-
equency ep esen a ion (e.g., Mel-spec og am).
• Selec ing an app op ia e con olu ional laye —o en
he inal o penul ima e con olu ional laye —whe e
ea u e maps e ain meaning ul spa ial (o ime-
equency) s uc u e. In ou expe imen s, o
an a acked MIR model M, we selec he laye
model.laye s[-1].blocks[-1].no m1 as he a ge o
analysis. The ou pu o his laye ep esen s he
comple e, s abilized ea u e ep esen a ion om he
model’s inal block jus be o e he classi ica ion
head [23].
G ad-CAM Compu a ion. Le ˆycdeno e he model’s p e-
dic ed sco e (logi ) o class c. We deno e by Fl he ea u e
map ac i a ions a laye l, wi h Fl
kindica ing he k h chan-
nel. We compu e G ad-CAM as ollows:
1. G adien Ex ac ion: Ob ain he g adien o ˆyc
wi h espec o Fl
k:
αc
k=1
ZX
x,y
∂ˆyc
∂F l
k(x, y),(1)
whe e (x, y)indexes he spa ial/ ime- equency po-
si ions and Zis a no maliza ion ac o (e.g., numbe
o spa ial loca ions).
2. Weigh ed Agg ega ion: Mul iply each ea u e map
Fl
kby i s co esponding weigh αc
k, hen sum o e k
o ob ain he aw map:
Mc(x, y) = ReLU
X
k
αc
kFl
k(x, y).(2)
3. Spa ial Masking: Apply a ReLU o keep only pos-
i i e con ibu ions, gene a ing Mcas he inal G ad-
CAM hea map. Highe in ensi ies in Mc(x, y)indi-
ca e g ea e ele ance o p edic ing class c.
4. Mapping o Time-F equency Regions: Once Mc
is compu ed, we map i back o he o iginal spec o-
g am coo dina es. We hen no malize he hea map
o lie in [0,1] o selec he op p%o ime- equency
bins o isola e he mos c i ical egions. We ma ked
hese high-in ensi y a eas as he candida e ad e sa -
ial zone, which we will subsequen ly modi y in ou
inpain ing-based ad e sa ial a ack amewo k.
2.1.2 Black-Box Impo ance: Coa se- o-Fine Analysis
In scena ios whe e in e nal pa ame e s o he a ge model
M emain unknown, we canno ely on g adien in o -
ma ion o loca e c i ical segmen s. Ins ead, we p opose
acoa se- o- ine black-box p ocedu e ha sys ema ically
que ies he model o iden i y he mos in luen ial po ions
o he audio. Le xbe he ull music ack, and le M(x)
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
806
deno e he model’s p edic ion (e.g., classi ica ion p oba-
bili y o logi sco e). We assume access o a loss unc ion
LM(x), y, whe e yis he ue o iginal label.
Ini ial Pa i ion. We i s segmen xin o Ncoa se
chunks,
S(0) ={C(0)
1, C(0)
2, . . . , C(0)
N},(3)
whe e each C(0)
iis a non-o e lapping ime in e al (e.g.,
0.5second). Fo each chunk C(0)
i, we c ea e a modi ied
inpu ex−C(0)
i
using a Ze o-Masking p ocedu e. To p e en
spec al a i ac s a ising om ab up signal changes a he
chunk bounda ies, we apply a Tukey window o he a ge
segmen . The window’s shape pa ame e was se o 0.1 o
c ea e a sho , smoo h ape a he segmen ’s edges, ensu -
ing a con inuous wa e o m a e masking. This ensu es a
con inuous wa e o m a e masking. We hen compu e he
impo ance sco e:
IC(0)
i=
L
M(ex−C(0)
i
), y−L
M(x), y
du a ionC(0)
i.(4)
A highe alue o IC(0)
iindica es ha emo ing C(0)
i
leads o a la ge d op in model con idence o y, sugges ing
ha C(0)
iis mo e c i ical o he classi ica ion.
Ranking and Re inemen . Nex , we ank he chunks in
S(0) by hei impo ance measu e IC(0)
iin descending
o de . Le C(0)
max be he chunk wi h he highes sco e. We
hen e ine his chunk by subdi iding i in o M ine sub-
chunks:
S(1)
max =C(1)
max,1, C(1)
max,2, . . . , C(1)
max,M .(5)
Fo each sub-chunk C(1)
max,j, we compu e an upda ed im-
po ance measu e:
IC(1)
max,j=
L
M(ex−C(1)
max,j
), y−L
M(x), y
du a ionC(1)
max,j,
(6)
whe e ex−C(1)
max,j
is he audio ack wi h only ha sub-chunk
silenced.
We hen eplace C(0)
max in ou segmen a ion wi h i s sub-
chunks C(1)
max,j, hus c ea ing a e ined se o segmen s:
S(1) =S(0) {C(0)
max}∪C(1)
max,1, . . . , C(1)
max,M .(7)
We can i e a e his p ocedu e by again choosing he seg-
men wi h he la ges upda ed impo ance and subdi iding
u he , deno ed S(2),S(3), and so on, un il a desi ed le el
o g anula i y is eached o a que y budge is exhaus ed.
Final Selec ion. Upon comple ing T e inemen ounds,
we ob ain a inal se o segmen s
S(T)={C(T)
1, C(T)
2, . . . , C(T)
K},(8)
whe e each C(T)
ihas a co esponding impo ance measu e
IC(T)
i. We hen selec he op segmen s,
C(T)
1, . . . , C(T)
= TopIC(T)
i, ,(9)
as ou candida e ad e sa ial zones, concen a ing u u e
pe u ba ions on hese c i ical egions.
O e all, ou black-box impo ance analysis balances e -
ec i eness and p ac icali y, allowing us o iden i y p e-
cisely which audio segmen s ha e he g ea es impac on
he ou pu o a acked model wi hou equi ing knowledge
o i s in e nal pa ame e s o g adien s.
3. ADVERSARIAL INPAINTING
A e iden i ying he mos in luen ial segmen s o he a -
ge a acked model M, we p oceed o ad e sa ially in-
pain he op- anked segmen s. Ou goal is o econ-
s uc hese c i ical egions in such a way ha he esul ing
ack bo h deg ades he p edic ion con idence o Mand
emains pe cep ually cohe en o he human ea . In his
sec ion, we i s in oduce he concep o music inpain -
ing, ollowed by de ails o wo s a e-o - he-a inpain ing
models—GACELA [24]—which we le e age o ad e sa -
ial inpain ing.
3.1 GACELA
GACELA (Gene a i e Ad e sa ial Con ex Encode o
Long Audio Inpain ing) [24] is a condi ional gene a i e ad-
e sa ial ne wo k (cGAN) designed speci ically o econ-
s uc ing long gaps in audio signals, such as music. The
a chi ec u e comp ises a gene a o and i e disc imina o s
ope a ing a mul iple ime and equency scales. The gen-
e a o , condi ioned on he log-magni ude mel spec og am
o he su ounding audio con ex , employs con olu ional
encode -decode laye s and in eg a es la en a iables o
model he mul imodal na u e o audio inpain ing. The dis-
c imina o s e alua e he plausibili y o he gene a ed gaps
by conside ing he con ex and spec al cohe ence.
3.2 Ad e sa ial Inpain ing wi h Model Guidance
A e iden i ying c i ical segmen s (Sec ion 2.1), we em-
ploy a music inpain ing model (e.g., GACELA) o econ-
s uc hese a eas while embedding ad e sa ial pe u ba-
ions guided by he a ge model M. We p opose wo a i-
an s o his ad e sa ial inpain ing s a egy, ailo ed espec-
i ely o whi e-box and black-box se ings.
3.2.1 Whi e-Box Scena io: Loss Design and Pa ame e
Tuning
In he whi e-box se ing, we ha e access o he pa am-
e e s and g adien s o he a ge model M, allowing o
ad e sa ial op imiza ion in conjunc ion wi h he inpain ing
model Gθ. Le x(k)
inp deno e he inpain ed audio a i e a ion
k, ocusing only on he masked egion mwhich we go
om impo ance analysis. The objec i e unc ion o ad-
e sa ial inpain ing is de ined as:
L=λ ec L ecx(k)
inp, x+λa La ackM(x(k)
inp), y,(10)
whe e econs uc ion Loss L ec ensu es ha he in-
pain ed audio main ains pe cep ual and con ex ual cohe -
ence wi h he o iginal audio in he masked egion. Speci -
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
807
Algo i hm 1 Whi e-Box Ad e sa ial Inpain ing
Requi e: O iginal audio x, mask m, Inpain ing model Gθ,
Classi ie M(whi e-box), I e a ions N, s ep size α,
Weigh s λ ec, λa .
1: Ini ializa ion:
2: xinp ←x⊙(1 −m) + Gθ(x⊙(1 −m)) ⊙m
3: o k= 1 o Ndo
4: Compu e Loss:
5: L=λ ec d(xinp, x) + λa ℓM(xinp), y
6: G adien Upda e on Mask:
7: g← ∇xinp L;xinp ←xinp −αsign(g⊙m)
8: Re-Inpain :
9: xinp ←x⊙(1 −m) + Gθ(xinp ⊙m, x ⊙(1 −
m)) ⊙m
10: end o
11: e u n xinp
ically, L ec le e ages he loss unc ions inhe en o he in-
pain ing model Gθ. Ad e sa ial Loss La ack in oduces
ad e sa ial pe u ba ions o decei e he classi ie M. Fo
un a ge ed a acks, we aim o educe he con idence o he
co ec label y:
La ack =ℓM(x(k)
inp), y,(11)
whe e ℓ(·)can be a c oss-en opy loss. The objec i e d i es
he model p edic ion away om he co ec label y, making
he a ack un a ge ed.
The hype pa ame e s λ ec and λa con ol he ade-o
be ween p ese ing audio quali y and achie ing high a -
ack success a es. We pe o m a g id sea ch o e λ ec ∈
{0.5,1.0,2.0}and λa ∈ {0.5,1.0,2.0}. The op imal al-
ues a e de e mined based on a ack success a e and pe -
cep ual me ics.
The op imiza ion is an i e a i e p ocess. Each s ep con-
sis s o h ee main ope a ions: a o wa d pass, a g adien -
based upda e, and a e-inpain ing s age.
Fo wa d Pass. Fi s , we compu e he econs uc ion
loss L ec wi h he inpain ing model and he a ack loss
La ack wi h he a ge classi ie M.
G adien Upda e. Nex , he masked egion o x(k)
inp is
upda ed by aking a s ep o minimize he o al loss L. This
ad e sa ial upda e is pe o med using he sign o he g a-
dien :
x(k+1)
inp ←x(k)
inp −αsign∇xinp L ⊙ m,(12)
whe e αis he s ep size, and he elemen -wise p oduc wi h
he mask mcon ines he upda e o he a ge egion.
Re-Inpain . Finally, o ensu e he ad e sa ial pe u ba-
ion emains locally consis en and a i ac - ee, we eap-
ply he inpain ing gene a o Gθ o he modi ied egion.
This s ep e ec i ely p ojec s he pe u bed con en back
owa ds a ealis ic da a mani old:
x(k+1)
inp ←x⊙(1−m)+Gθx(k+1)
inp ⊙m, x⊙(1−m)⊙m.
(13)
This i e a i e p ocess con inues un il ei he he maxi-
mum i e a ion coun Nis eached o he a ack success
a e sa is ies a p ede ined h eshold. The de ailed p ocess
is shown in Algo i hm 1.
3.2.2 Black-Box Scena io: Impo ance-Guided
Ad e sa ial Inpain ing
In black-box se ings, whe e he in e nal pa ame e s and
g adien s o he a ge classi ie Ma e inaccessible,
we adop a que y-based ad e sa ial inpain ing app oach
guided by he impo ance analysis (Sec ion 2.1). This
me hod i e a i ely inpain s c i ical music segmen s om
highes o lowes impo ance un il he a ack succeeds. The
de ailed p ocess is as ollows:
1) Impo ance-Guided Segmen P ocessing Based on
he impo ance sco es ob ained om p io analysis, we
so he music segmen s in descending o de o hei sig-
ni icance o he a ge a acked model p edic ion. We hen
p ocess each segmen sequen ially, p io i izing hose wi h
he highes impac .
2) Ad e sa ial Inpain ing o Each Segmen Fo each
selec ed segmen , we pe o m he ollowing s eps:
1. Ini ializa ion U ilize he p e ained music inpain -
ing model Gθ o pe o m s anda d inpain ing on he
masked impo an egion m, gene a ing he ini ial
inpain ed audio:
x(0)
inp =x⊙(1−m)+Gθ(x⊙(1−m))⊙m.(14)
2. I e a i e Que y-Based Op imiza ion Ini ialize a
la en a iable z(0) associa ed wi h he inpain ing
model. We employ he Co a iance Ma ix Adap a-
ion E olu ion S a egy (CMA-ES) [25] o g adien -
ee op imiza ion o e ine zand enhance a ack e i-
cacy:
z(k+1) =CMA-ES(z(k),F(M, x(k)
inp)),(15)
whe e F(M, xinp) ep esen s he classi ica ion eed-
back ob ained by que ying Mwi h he cu en in-
pain ed audio x(k)
inp. CMA-ES op imizes zby i e -
a i ely sampling candida e la en codes, e alua ing
hei pe o mance based on he eedback, and upda -
ing he dis ibu ion pa ame e s o a o mo e e ec-
i e pe u ba ions.
3. Candida e Gene a ion and E alua ion Fo each i -
e a ion, gene a e a se o candida e la en a iables
{bz}by sampling om he cu en CMA-ES dis i-
bu ion. Use he inpain ing model o p oduce co e-
sponding audio samples {bxinp}:
bxinp =Gθ(bz, x ⊙(1 −m)).(16)
Que y he a ge classi ie Mwi h each bxinp o ob-
ain classi ica ion eedback (e.g., p edic ed label o
con idence sco e). E alua e he a ack success based
on whe he M(bxinp)=y.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
808
Table 1. O e all A ack Resul s on Co e Hun e (CSI/SHS100K) and IDS-NMR (MGC/GTZAN) using GACELA.
Highe ASR is be e (un a ge ed), lowe mAP/Accu acy is wo se o he model. FAD and LSD measu e pe cep ual dis o -
ion (lowe is be e ). Lis ening Tes is sco ed on a 5-poin scale (highe is be e ).
A ack CSI (Co e Hun e / SHS100K) MGC (IDS-NMR / GTZAN)
ASR ↑mAP ↓FAD ↓LSD ↓MOS ↑ASR ↑Acc ↓FAD ↓LSD ↓MOS ↑
Whi e-Box A acks
PGD 82.1% 0.619 12.64 2.10 3.1 84.6% 0.551 15.32 2.20 3.2
C&W 88.5% 0.560 12.11 1.94 3.4 89.1% 0.512 14.90 2.21 3.3
MAIA-Whi e Box 92.8% 0.488 11.25 1.58 4.0 93.5% 0.466 13.85 1.94 3.8
Black-Box A acks
NES 70.2% 0.682 13.93 2.27 2.8 65.7% 0.704 16.26 2.15 2.5
ZOO 74.9% 0.639 13.51 2.12 3.0 72.4% 0.654 15.90 2.05 3.0
MAIA-Black Box 80.1% 0.594 12.56 1.90 3.6 77.9% 0.601 14.68 1.85 3.3
4. Selec ion and E olu ion Based on he classi ica-
ion eedback, selec he mos p omising candida es
ha maximize he ad e sa ial loss La ack and hen
upda es he la en a iable dis ibu ion pa ame e s
o guide u u e pe u ba ions owa ds mo e e ec i e
ad e sa ial examples.
5. Re-Inpain ing o Con inui y A e upda ing z,
eapply he inpain ing model o ensu e he modi ied
audio emains musically cohe en :
x(k+1)
inp =x⊙(1−m)+Gθ(z(k+1), x⊙(1−m))⊙m.
(17)
6. Te mina ion Con inue he i e a i e p ocess un il he
classi ie Mis ooled (i.e., M(x(k)
inp)=y) o a max-
imum numbe o i e a ions is eached.
4. EXPERIMENTS
In his sec ion, we e alua e ou p oposed Music Ad e -
sa ial Inpain ing A ack (MAIA) ac oss wo ep esen a i e
MIR asks: Co e Song Iden i ica ion (CSI) and Music
Gen e Classi ica ion (MGC). Ou expe imen s assess bo h
he whi e-box and black-box a ian s o MAIA, compa ing
hem agains common baselines by e alua ing hei pe o -
mance using bo h subjec i e and objec i e me ics.
4.1 Ta ge Model and Da ase s
4.1.1 Co e Song Iden i ica ion (CSI)
We adop he p e- ained Co e Hun e model as ou a -
ge o co e song iden i ica ion, ollowing he p ocedu e
in [26]. Expe imen s a e conduc ed on he SHS100K
da ase [27] es se .
4.1.2 Music Gen e Classi ica ion (MGC)
We use he IDS-NMR ne wo k [1] on he GTZAN
da ase [28] o gen e classi ica ion.
4.2 E alua ion Me ics
We epo ou main classes o me ics:
A ack Success Ra e (ASR): The ac ion o es sam-
ples success ully misclassi ied by he a ge model in an
un a ge ed se ing.
Sys em Pe o mance Deg ada ion: Fo CSI, we e-
po he pos -a ack mAP o Co e Hun e ; o MGC, we
epo he pos -a ack accu acy o IDS-NMR.
FAD (F éche Audio Dis ance based on MERT):
We u he inco po a e p e- ained MERT-V0 [29] as a
ea u e ex ac o o compu e he F éche Audio Dis ance
(FAD) [30] on ad e sa ially pe u bed acks. By compa -
ing he ex ac ed ea u e dis ibu ions o o iginal and a -
acked audio, we gain an addi ional objec i e measu e o
pe cep ual dis ance.
LSD (Log-Spec al Dis ance) [31]: E alua es he
ame-wise spec al di e ence be ween o iginal and pe -
u bed signals.
Pe cep ual Simila i y (Subjec i e): A lis ening es
wi h 100 pa icipan s o judge how easily ad e sa ial pe -
u ba ions can be de ec ed. Each pa icipan is asked o a e
on a 5-poin scale: 1 (highly no iceable) o 5 (no pe cei -
able di e ence).
4.3 A ack Baselines
We compa e MAIA agains ypical whi e-box and black-
box ad e sa ial me hods ailo ed o audio:
PGD (P ojec ed G adien Descen ) [32] [Whi e-Box]
C&W (Ca lini & Wagne ) [12] [Whi e-Box]
NES (Na u al E olu ion S a egies) [33] [Black-Box]
ZOO (Ze o O de Op imiza ion A ack) [34] [Black-
Box]
4.4 Implemen a ion De ails
In all expe imen s, we employed GACELA as he inpain -
ing model o ensu e consis en e-gene a ion o a ge ed
music segmen s; we se he maximum i e a ion o 10 o
whi e-box me hods and capped he que y budge a 1000
o black-box me hods. We uned λ ec and λa by g id
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
809
sea ch, choosing alues ha balanced a ack success a e
(ASR) and pe cep ual ideli y. Table 1 p esen s he com-
bined esul s o bo h CSI (Co e Hun e on SHS100K) and
MGC (IDS-NMR on GTZAN) unde whi e-box and black-
box a acks.
4.5 Resul s
Table 1 demons a es ha ou p oposed MAIA-Whi e Box
consis en ly ou pe o ms s anda d whi e-box a ack base-
lines (PGD and C&W) ac oss bo h MIR asks. Speci -
ically, MAIA-WB achie es he highes A ack Success
Ra e (93.5% o CSI and 94.5% o MGC), signi ican ly
educing he mean A e age P ecision (mAP) om 0.845
o 0.488 in Co e Hun e and classi ica ion accu acy om
0.828 o 0.466 in IDS-NMR. Addi ionally, MAIA-WB
main ains supe io pe cep ual quali y wi h lowe F éche
Audio Dis ance (FAD) and Log-Spec al Dis ance (LSD)
sco es, and highe Lis ening Tes a ings (4.0), indica ing
ha he ad e sa ial pe u ba ions emain la gely impe cep-
ible o human lis ene s. In he black-box scena io, MAIA-
Black Box simila ly ou pe o ms NES and ZOO, achie -
ing ASRs o 80.1% o CSI and 77.9% o MGC, wi h
co esponding educ ions in mAP and accu acy o 0.594
and 0.601, espec i ely. MAIA-BB also exhibi s lowe
FAD and LSD sco es compa ed o black-box baselines,
and highe Lis ening Tes a ings (3.6), sugges ing ha ou
impo ance-guided ad e sa ial inpain ing app oach e ec-
i ely balances a ack po ency wi h audio ideli y. O e all,
MAIA a ian s consis en ly deli e highe a ack success
a es and g ea e pe o mance deg ada ion while p ese -
ing pe cep ual quali y be e han exis ing a ack me hods.
5. CONCLUSIONS
We ha e p esen ed MAIA, a Music Ad e sa ial Inpain -
ing A ack amewo k ha employs impo ance-d i en seg-
men selec ion and inpain ing-based pe u ba ions in bo h
whi e-box and black-box se ings. By ocusing on he mos
in luen ial egions, MAIA achie es highe a ack success
a es agains Co e Hun e ( o co e song iden i ica ion)
and IDS-NMR ( o gen e classi ica ion), while p ese -
ing audio ideli y as measu ed by objec i e (FAD, LSD)
and subjec i e (lis ening sco es) me ics. We belie e ha
ou indings highligh bo h he po en ial se e i y and he
sub le y o ad e sa ial h ea s in MIR. By demons a ing a
no el inpain ing-based app oach, we emphasize he need
o comp ehensi e, pe cep ion-awa e de enses o ensu e
obus and us wo hy music- ela ed se ices.
6. ACKNOWLEDGEMENTS
This wo k was suppo ed by he Jiangsu Science and
Technology P og amme (Majo Special P og amme, G an
No. BG2024027), he Suzhou Science and Technol-
ogy De elopmen Planning P og amme (Gusu Inno a ion
and En ep eneu ship Leading Talen s P og am, G an No.
ZXL2022472), and he XJTLU Resea ch De elopmen
Fund (G an No. RDF-22-02-046).
7. REFERENCES
[1] Y.-N. Hung, C.-H. H. Yang, P.-Y. Chen, and A. Le ch,
“Low- esou ce music gen e classi ica ion wi h c oss-
modal neu al model ep og amming,” in ICASSP 2023-
2023 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2023,
pp. 1–5.
[2] A. Solanki and S. Pandey, “Music ins umen ecogni-
ion using deep con olu ional neu al ne wo ks,” In e -
na ional Jou nal o In o ma ion Technology, ol. 14,
no. 3, pp. 1659–1668, 2022.
[3] X. Du, Z. Yu, B. Zhu, X. Chen, and Z. Ma, “By e-
co e : Co e song iden i ica ion ia mul i-loss ain-
ing,” in ICASSP 2021-2021 IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
(ICASSP). IEEE, 2021, pp. 551–555.
[4] X. Du, K. Chen, Z. Wang, B. Zhu, and Z. Ma, “By e-
co e 2: Towa ds dimensionali y educ ion o la en
embedding o e icien co e song iden i ica ion,” in
ICASSP 2022-2022 IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2022, pp. 616–620.
[5] X. Du, Z. Wang, X. Liang, H. Liang, B. Zhu, and
Z. Ma, “By eco e 3: Accu a e co e song iden i ica-
ion on sho que ies,” in ICASSP 2023-2023 IEEE In-
e na ional Con e ence on Acous ics, Speech and Sig-
nal P ocessing (ICASSP). IEEE, 2023, pp. 1–5.
[6] V. Mosca o, A. Pica iello, and G. Spe li, “An emo ional
ecommende sys em o music,” IEEE In elligen Sys-
ems, ol. 36, no. 5, pp. 57–68, 2020.
[7] D. A cha , A. Melchio e, M. Schedl, R. Hennequin,
E. Epu e, and M. Moussallam, “Explainabili y in mu-
sic ecommende sys ems,” AI Magazine, ol. 43,
no. 2, pp. 190–208, 2022.
[8] K. P inz, A. Flexe , and G. Widme , “On end- o-end
whi e-box ad e sa ial a acks in music in o ma ion e-
ie al.” T ansac ions o he In e na ional Socie y o
Music In o ma ion Re ie al, ol. 4, no. 1, pp. 93–105,
2021.
[9] P. Saada panah, A. Sha ahi, and T. Golds ein, “Ad e -
sa ial a acks on copy igh de ec ion sys ems,” in In e -
na ional Con e ence on Machine Lea ning. PMLR,
2020, pp. 8307–8315.
[10] S. Wang, Z. Zhang, G. Zhu, X. Zhang, Y. Zhou,
and J. Huang, “Que y-e icien ad e sa ial a ack wi h
low pe u ba ion agains end- o-end speech ecogni ion
sys ems,” IEEE T ansac ions on In o ma ion Fo ensics
and Secu i y, ol. 18, pp. 351–364, 2022.
[11] Y. Chen, X. Yuan, J. Zhang, Y. Zhao, S. Zhang,
K. Chen, and X. Wang, “De il’s whispe : A gene al
app oach o physical ad e sa ial a acks agains com-
me cial black-box speech ecogni ion de ices,” in 29 h
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
810
USENIX Secu i y Symposium (USENIX Secu i y 20),
2020, pp. 2667–2684.
[12] N. Ca lini and D. Wagne , “Towa ds e alua ing he o-
bus ness o neu al ne wo ks,” in 2017 IEEE Symposium
on Secu i y and P i acy (SP). IEEE, 2017, pp. 39–57.
[13] F. C oce and M. Hein, “Mind he box: l_1-apgd o
spa se ad e sa ial a acks on image classi ie s,” in In-
e na ional Con e ence on Machine Lea ning. PMLR,
2021, pp. 2201–2211.
[14] C. Ke eliuk, B. L. S u m, and J. La sen, “Deep lea ning
and music ad e sa ies,” IEEE T ansac ions on Mul i-
media, ol. 17, no. 11, pp. 2059–2071, 2015.
[15] R. Duan, Z. Qu, S. Zhao, L. Ding, Y. Liu, and Z. Lu,
“Pe cep ion-awa e a ack: C ea ing ad e sa ial music
ia e e se-enginee ing human pe cep ion,” in P o-
ceedings o he 2022 ACM SIGSAC con e ence on com-
pu e and communica ions secu i y, 2022, pp. 905–
919.
[16] Z. Rakama i´
c and M. Emmi, “SMACK: Decoupling
sou ce language de ails om e i ie implemen a-
ions,” in Compu e Aided Ve i ica ion: 26 h In e na-
ional Con e ence, CAV 2014, Held as Pa o he Vi-
enna Summe o Logic, VSL 2014, Vienna, Aus ia,
July 18-22, 2014. P oceedings 26. Sp inge , 2014,
pp. 106–113.
[17] C. Luo, Q. Lin, W. Xie, B. Wu, J. Xie, and L. Shen,
“F equency-d i en impe cep ible ad e sa ial a ack on
seman ic simila i y,” in 2022 IEEE/CVF Con e ence
on Compu e Vision and Pa e n Recogni ion (CVPR).
IEEE Compu e Socie y, 2022, pp. 15 294–15 303.
[18] S. Ali, T. Abuhmed, S. El-Sappagh, K. Muhammad,
J. M. Alonso-Mo al, R. Con alonie i, R. Guido i,
J. Del Se , N. Díaz-Rod íguez, and F. He e a, “Ex-
plainable a i icial in elligence (xai): Wha we know
and wha is le o a ain us wo hy a i icial in elli-
gence,” In o ma ion Fusion, ol. 99, p. 101805, 2023.
[19] B. Zhou, A. Khosla, A. Laped iza, A. Oli a, and
A. To alba, “Lea ning deep ea u es o disc imina i e
localiza ion,” in P oceedings o he IEEE Con e ence
on Compu e Vision and Pa e n Recogni ion, 2016, pp.
2921–2929.
[20] M. Lin, “Ne wo k in ne wo k,” a Xi p ep in
a Xi :1312.4400, 2013.
[21] R. R. Sel a aju, M. Cogswell, A. Das, R. Vedan am,
D. Pa ikh, and D. Ba a, “G ad-cam: Visual explana-
ions om deep ne wo ks ia g adien -based localiza-
ion,” in P oceedings o he IEEE In e na ional Con e -
ence on Compu e Vision (ICCV), 2017, pp. 618–626.
[22] M. Oquab, L. Bo ou, I. Lap e , and J. Si ic, “Is ob-
jec localiza ion o ee?-weakly-supe ised lea ning
wi h con olu ional neu al ne wo ks,” in P oceedings o
he IEEE Con e ence on Compu e Vision and Pa e n
Recogni ion (CVPR), 2015, pp. 685–694.
[23] J. Gildenbla and con ibu o s, “Py o ch lib a y
o cam me hods,” h ps://gi hub.com/jacobgil/
py o ch-g ad-cam, 2021.
[24] A. Ma a io i, P. Majdak, N. Holighaus, and N. Pe -
audin, “GACELA: A gene a i e ad e sa ial con-
ex encode o long audio inpain ing o music,”
IEEE Jou nal o Selec ed Topics in Signal P ocessing,
ol. 15, no. 1, pp. 120–131, 2020.
[25] K. Va elas, A. Auge , D. B ockho , N. Hansen, O. A.
ElHa a, Y. Seme , R. Kassab, and F. Ba ba esco, “A
compa a i e s udy o la ge-scale a ian s o cma-es,” in
Pa allel P oblem Sol ing om Na u e–PPSN XV: 15 h
In e na ional Con e ence, Coimb a, Po ugal, Sep em-
be 8–12, 2018, P oceedings, Pa I 15. Sp inge ,
2018, pp. 3–15.
[26] F. Liu, D. Tuo, Y. Xu, and X. Han, “Co e hun e :
Co e song iden i ica ion wi h e ined a en ion and
alignmen s,” in 2023 IEEE In e na ional Con e ence
on Mul imedia and Expo (ICME). IEEE, 2023, pp.
1080–1085.
[27] X. Xu, X. Chen, and D. Yang, “Key-in a ian con olu-
ional neu al ne wo k owa d e icien co e song iden-
i ica ion,” in 2018 IEEE In e na ional Con e ence on
Mul imedia and Expo (ICME). IEEE, 2018, pp. 1–6.
[28] B. L. S u m, “The GTZAN Da ase : I s con en s, i s
aul s, hei e ec s on e alua ion, and i s u u e use,”
a Xi p ep in a Xi :1306.1461, 2013.
[29] Y. Li, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin,
C. Xiao, C. Lin, A. Ragni, E. Bene os, N. Gyenge,
R. Dannenbe g, R. Liu, W. Chen, G. Xia, Y. Shi,
W. Huang, Z. Wang, Y. Guo, and J. Fu, “MERT:
Acous ic music unde s anding model wi h la ge-scale
sel -supe ised aining,” in In e na ional Con e ence
on Lea ning Rep esen a ions (ICLR), 2024.
[30] K. Kilgou , M. Zuluaga, D. Roblek, and M. Sha -
i i, “F éche Audio Dis ance: A e e ence- ee me ic
o e alua ing music enhancemen algo i hms,” in P o-
ceedings o In e speech, 2019.
[31] A. G ay and J. Ma kel, “Dis ance measu es o speech
p ocessing,” IEEE T ansac ions on Acous ics, Speech,
and Signal P ocessing, ol. 24, no. 5, pp. 380–391,
1976.
[32] Y. Deng and L. J. Ka am, “Uni e sal ad e sa ial a -
ack ia enhanced p ojec ed g adien descen ,” in 2020
IEEE In e na ional Con e ence on Image P ocessing
(ICIP). IEEE, 2020, pp. 1241–1245.
[33] D. Wie s a, T. Schaul, T. Glasmache s, Y. Sun, J. Pe-
e s, and J. Schmidhube , “Na u al e olu ion s a e-
gies,” The Jou nal o Machine Lea ning Resea ch,
ol. 15, no. 1, pp. 949–980, 2014.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
811
[34] P.-Y. Chen, H. Zhang, Y. Sha ma, J. Yi, and C.-J.
Hsieh, “Zoo: Ze o h o de op imiza ion based black-
box a acks o deep neu al ne wo ks wi hou aining
subs i u e models,” in P oceedings o he 10 h ACM
wo kshop on a i icial in elligence and secu i y, 2017,
pp. 15–26.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
812