MAIA: An Inpainting-Based Approach for Music Adversarial Attacks

Author: Yuxuan Liu; Peihong Zhang; Rui Sang; Zhixin Li; Shengchen Li

Publisher: Zenodo

DOI: 10.5281/zenodo.17706598

Source: https://zenodo.org/records/17706598/files/000094.pdf

MAIA: AN INPAINTING-BASED APPROACH FOR MUSIC ADVERSARIAL
ATTACKS
Yuxuan Liu, Peihong Zhang, Rui Sang, Zhixin Li, Shengchen Li
Xi’an Jiao ong-Li e pool Uni e si y
{yuxuan.liu2204, peihong.zhang20, ui.sang22, zhixin.li22}@s uden .xj lu.edu.cn
[email p o ec ed]
ABSTRACT
Music ad e sa ial a acks ha e ga ne ed signi ican in e -
es in he ield o Music In o ma ion Re ie al (MIR). In
his pape , we p esen Music Ad e sa ial Inpain ing A -
ack (MAIA), a no el ad e sa ial a ack amewo k ha
suppo s bo h whi e-box and black-box a ack scena ios.
MAIA begins wi h an impo ance analysis o iden i y c i -
ical audio segmen s, which a e hen a ge ed o modi i-
ca ion. U ilizing gene a i e inpain ing models, hese seg-
men s a e econs uc ed wi h guidance om he ou pu
o he a acked model, ensu ing sub le and e ec i e ad-
e sa ial pe u ba ions. We e alua e MAIA on mul iple
MIR asks, demons a ing high a ack success a es in bo h
whi e-box and black-box se ings while main aining min-
imal pe cep ual dis o ion. Addi ionally, subjec i e lis en-
ing es s con i m he high audio ideli y o he ad e sa ial
samples. Ou indings highligh ulne abili ies in cu en
MIR sys ems and emphasize he need o mo e obus and
secu e models.
1. INTRODUCTION
Music In o ma ion Re ie al (MIR) has e ol ed in o a mul-
i ace ed esea ch domain, unde pinning a ious applica-
ions such as applica ions ha ange om gen e classi i-
ca ion [1] and ins umen ecogni ion [2] o co e song
iden i ica ion [3–5] and ecommenda ion sys ems [6, 7].
As MIR algo i hms become inc easingly p e alen in bo h
comme cial p oduc s and academic esea ch, hei eliabil-
i y and obus ness ha e come unde sc u iny [8, 9]. Al-
hough ad e sa ial ulne abili ies ha e been ex ensi ely
s udied in speech ecogni ion [10, 11] and image classi i-
ca ion [12, 13], he music domain emains compa a i ely
unde explo ed.
Ad e sa ial a acks in he Music In o ma ion Re ie al
(MIR) con ex can be b oadly ca ego ized in o noise-based
and seman ic-based app oaches. Noise-based a acks, such
Yuxuan Liu and Peihong Zhang con ibu ed equally o his wo k.
© Y. Liu, P. Zhang, R. Sang, Z. Li, and S. Li. Licensed
unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY
4.0). A ibu ion: Y. Liu, P. Zhang, R. Sang, Z. Li, and S. Li, “MAIA:
An Inpain ing-Based App oach o Music Ad e sa ial A acks”, in P oc.
o he 26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon,
Sou h Ko ea, 2025.
as he Ca lini–Wagne (C&W) a acks [12] in oduces sub-
le audio dis o ions o mislead he model in o inco ec
ou pu s. P inz e al. [8] ex ended his line o wo k by in o-
ducing end- o-end whi e-box ad e sa ial a acks ha op-
e a e di ec ly on aw wa e o ms, demons a ing hei e -
ec i eness in deg ading ins umen classi ica ion accu acy
and manipula ing music ecommenda ion sys ems while
main aining impe cep ible pe u ba ions. Saada panah e
al. [9] highligh ed he ulne abili y o copy igh de ec-
ion sys ems o ad e sa ial a acks, showing ha small
pe u ba ions can e ade obus inge p in ing sys ems like
YouTube’s Con en ID and AudioTag, aising conce ns
abou he secu i y o hese widely used indus ial ools.
Addi ionally, Chen e al. [11] p oposed he De il’s Whis-
pe me hod, ocusing on le e aging psychoacous ic p inci-
ples o c ea e highly s eal hy ad e sa ial audio examples.
Noise-based ad e sa ial a acks ely on adding impe -
cep ible pe u ba ions o audio bu o en lack in e p e abil-
i y and ail o le e age he s uc u e and seman ics o mu-
sic [14], limi ing hei use in scena ios equi ing seman ic
alignmen o con ex -sensi i e manipula ion. Duan e al.
[15] in oduced a pe cep ion-awa e a ack amewo k ha
e e se-enginee s human pe cep ion using eg ession anal-
ysis, op imizing pe u ba ions o minimize pe cei ed de i-
a ions while main aining a ack e ec i eness. This inno-
a i e in eg a ion o human pe cep ion p o ides a unique
pe spec i e, al hough i s dependence on subjec i e e alua-
ions could limi gene alizabili y. Simila ly, Yu e al. [16]
de eloped SMACK, a me hod ha pe u bs p osodic ea-
u es like pi ch and hy hm o c ea e seman ically meaning-
ul ad e sa ial audio while p ese ing na u alness. Despi e
i s e ec i eness, he compu a ional complexi y o p osody
op imiza ion emains a challenge. Luo e al. [17] p oposed
a equency-d i en app oach ha con ines pe u ba ions o
high- equency componen s, ensu ing impe cep ibili y and
seman ic cohe ence. Howe e , i s ocus on high- equency
egions may limi applicabili y in scena ios whe e low-
equency componen s a e c i ical.
Despi e hese ad ancemen s, exis ing app oaches s ill
ace challenges in balancing a ack e ec i eness, musical
cohe ence, and p ac ical easibili y ac oss di e en MIR
asks. In his pape , we p opose a no el music ad e -
sa ial inpain ing a ack (MAIA) amewo k ha add esses
hese gaps. Ou app oach iden i ies c ucial music segmen s
h ough impo ance analysis and selec i ely econs uc s
hem ia a gene a i e inpain ing model, ensu ing sub le ye
805
highly a ge ed ad e sa ial pe u ba ions. Unlike pu ely
noise-based me hods, MAIA’s local edi s e ain musical
cohe ence while in luencing classi ica ion in a whi e-box
o black-box se ing. Th ough comp ehensi e e alua ions
o MIR asks such as music gen e classi ica ion and co e
song iden i ica ion, we demons a e ha MAIA achie es
s a e-o - he-a a ack success wi h minimal pe cep ual a -
i ac s.
The con ibu ions o his wo k a e h ee old:
1. We p opose a no el ad e sa ial a ack amewo k,
MAIA, based on impo ance-d i en inpain ing. This
amewo k econs uc s c i ical audio segmen s wi h
ad e sa ial pe u ba ions, ensu ing musical cohe -
ence while e ec i ely misleading a ge models.
2. We design a black-box impo ance analysis me hod
ha iden i ies in luen ial music segmen s h ough a
coa se- o- ine que y-based app oach, enabling e ec-
i e ad e sa ial a acks wi hou equi ing g adien
access.
3. We pe o m ex ensi e objec i e and subjec i e e al-
ua ions o comp ehensi ely benchma k MAIA a -
ack success a e and pe cep ual quali y ac oss MIR
asks.
2. MUSIC ADVERSARIAL INPAINTING ATTACK
FRAMEWORK
2.1 Impo ance Analysis
A key objec i e o ad e sa ial a acks in Music In o ma-
ion Re ie al (MIR) is o in oduce minimal ye e ec i e
pe u ba ions ha a e ha d o bo h de ec ion algo i hms
and human lis ene s o no ice. In p ac ical e ms, modi-
ying only he mos in luen ial ime- equency egions can
educe he ex en o injec ed noise, he eby dec easing pe -
cep ual a i ac s. Acco dingly, we ocus on segmen s ha
con ibu e mos signi ican ly o he p edic ion o he model,
ensu ing a high a ack success a e while minimizing any
audible changes [18].
2.1.1 Whi e-Box Impo ance: G ad-CAM
When ull access o he a ge model pa ame e s and a -
chi ec u e is a ailable, we adop a class ac i a ion map
(CAM) [19]-based s a egy o loca e ime- equency e-
gions ha mos hea ily in luence he classi ie ’s deci-
sion. T adi ional CAM me hods [19] o en equi e eplac-
ing ully-connec ed laye s wi h global pooling laye s [20],
he eby cons aining he model a chi ec u e. Howe e ,
G ad-CAM [21] gene alizes CAM and does no equi e
modi ica ions o he classi ie , making i mo e lexible o
exis ing con olu ional neu al ne wo ks.
In ui ion and Se up. Unlike pu ely saliency-based ap-
p oaches [22], which a e ypically op imized o e lec
human isual a en ion, G ad-CAM speci ically cap u es
classi ie - ele an egions by p opaga ing class-speci ic
g adien signals back h ough he ne wo k [21]. O iginally
p oposed in he image domain, G ad-CAM can be adap ed
o ou music ad e sa ial a ack asks by:
• Con e ing he aw wa e o m o a sui able ime-
equency ep esen a ion (e.g., Mel-spec og am).
• Selec ing an app op ia e con olu ional laye —o en
he inal o penul ima e con olu ional laye —whe e
ea u e maps e ain meaning ul spa ial (o ime-
equency) s uc u e. In ou expe imen s, o
an a acked MIR model M, we selec he laye
model.laye s[-1].blocks[-1].no m1 as he a ge o
analysis. The ou pu o his laye ep esen s he
comple e, s abilized ea u e ep esen a ion om he
model’s inal block jus be o e he classi ica ion
head [23].
G ad-CAM Compu a ion. Le ˆycdeno e he model’s p e-
dic ed sco e (logi ) o class c. We deno e by Fl he ea u e
map ac i a ions a laye l, wi h Fl
kindica ing he k h chan-
nel. We compu e G ad-CAM as ollows:
1. G adien Ex ac ion: Ob ain he g adien o ˆyc
wi h espec o Fl
k:
αc
k=1
ZX
x,y
∂ˆyc
∂F l
k(x, y),(1)
whe e (x, y)indexes he spa ial/ ime- equency po-
si ions and Zis a no maliza ion ac o (e.g., numbe
o spa ial loca ions).
2. Weigh ed Agg ega ion: Mul iply each ea u e map
Fl
kby i s co esponding weigh αc
k, hen sum o e k
o ob ain he aw map:
Mc(x, y) = ReLU
X
k
αc
kFl
k(x, y).(2)
3. Spa ial Masking: Apply a ReLU o keep only pos-
i i e con ibu ions, gene a ing Mcas he inal G ad-
CAM hea map. Highe in ensi ies in Mc(x, y)indi-
ca e g ea e ele ance o p edic ing class c.
4. Mapping o Time-F equency Regions: Once Mc
is compu ed, we map i back o he o iginal spec o-
g am coo dina es. We hen no malize he hea map
o lie in [0,1] o selec he op p%o ime- equency
bins o isola e he mos c i ical egions. We ma ked
hese high-in ensi y a eas as he candida e ad e sa -
ial zone, which we will subsequen ly modi y in ou
inpain ing-based ad e sa ial a ack amewo k.
2.1.2 Black-Box Impo ance: Coa se- o-Fine Analysis
In scena ios whe e in e nal pa ame e s o he a ge model
M emain unknown, we canno ely on g adien in o -
ma ion o loca e c i ical segmen s. Ins ead, we p opose
acoa se- o- ine black-box p ocedu e ha sys ema ically
que ies he model o iden i y he mos in luen ial po ions
o he audio. Le xbe he ull music ack, and le M(x)
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
806
deno e he model’s p edic ion (e.g., classi ica ion p oba-
bili y o logi sco e). We assume access o a loss unc ion
LM(x), y, whe e yis he ue o iginal label.
Ini ial Pa i ion. We i s segmen xin o Ncoa se
chunks,
S(0) ={C(0)
1, C(0)
2, . . . , C(0)
N},(3)
whe e each C(0)
iis a non-o e lapping ime in e al (e.g.,
0.5second). Fo each chunk C(0)
i, we c ea e a modi ied
inpu ex−C(0)
i
using a Ze o-Masking p ocedu e. To p e en
spec al a i ac s a ising om ab up signal changes a he
chunk bounda ies, we apply a Tukey window o he a ge
segmen . The window’s shape pa ame e was se o 0.1 o
c ea e a sho , smoo h ape a he segmen ’s edges, ensu -
ing a con inuous wa e o m a e masking. This ensu es a
con inuous wa e o m a e masking. We hen compu e he
impo ance sco e:
IC(0)
i=
L
M(ex−C(0)
i
), y−L
M(x), y
du a ionC(0)
i.(4)
A highe alue o IC(0)
iindica es ha emo ing C(0)
i
leads o a la ge d op in model con idence o y, sugges ing
ha C(0)
iis mo e c i ical o he classi ica ion.
Ranking and Re inemen . Nex , we ank he chunks in
S(0) by hei impo ance measu e IC(0)
iin descending
o de . Le C(0)
max be he chunk wi h he highes sco e. We
hen e ine his chunk by subdi iding i in o M ine sub-
chunks:
S(1)
max =C(1)
max,1, C(1)
max,2, . . . , C(1)
max,M .(5)
Fo each sub-chunk C(1)
max,j, we compu e an upda ed im-
po ance measu e:
IC(1)
max,j=
L
M(ex−C(1)
max,j
), y−L
M(x), y
du a ionC(1)
max,j,
(6)
whe e ex−C(1)
max,j
is he audio ack wi h only ha sub-chunk
silenced.
We hen eplace C(0)
max in ou segmen a ion wi h i s sub-
chunks C(1)
max,j, hus c ea ing a e ined se o segmen s:
S(1) =S(0) {C(0)
max}∪C(1)
max,1, . . . , C(1)
max,M .(7)
We can i e a e his p ocedu e by again choosing he seg-
men wi h he la ges upda ed impo ance and subdi iding
u he , deno ed S(2),S(3), and so on, un il a desi ed le el
o g anula i y is eached o a que y budge is exhaus ed.
Final Selec ion. Upon comple ing T e inemen ounds,
we ob ain a inal se o segmen s
S(T)={C(T)
1, C(T)
2, . . . , C(T)
K},(8)
whe e each C(T)
ihas a co esponding impo ance measu e
IC(T)
i. We hen selec he op segmen s,
C(T)
1, . . . , C(T)
= TopIC(T)
i, ,(9)
as ou candida e ad e sa ial zones, concen a ing u u e
pe u ba ions on hese c i ical egions.
O e all, ou black-box impo ance analysis balances e -
ec i eness and p ac icali y, allowing us o iden i y p e-
cisely which audio segmen s ha e he g ea es impac on
he ou pu o a acked model wi hou equi ing knowledge
o i s in e nal pa ame e s o g adien s.
3. ADVERSARIAL INPAINTING
A e iden i ying he mos in luen ial segmen s o he a -
ge a acked model M, we p oceed o ad e sa ially in-
pain he op- anked segmen s. Ou goal is o econ-
s uc hese c i ical egions in such a way ha he esul ing
ack bo h deg ades he p edic ion con idence o Mand
emains pe cep ually cohe en o he human ea . In his
sec ion, we i s in oduce he concep o music inpain -
ing, ollowed by de ails o wo s a e-o - he-a inpain ing
models—GACELA [24]—which we le e age o ad e sa -
ial inpain ing.
3.1 GACELA
GACELA (Gene a i e Ad e sa ial Con ex Encode o
Long Audio Inpain ing) [24] is a condi ional gene a i e ad-
e sa ial ne wo k (cGAN) designed speci ically o econ-
s uc ing long gaps in audio signals, such as music. The
a chi ec u e comp ises a gene a o and i e disc imina o s
ope a ing a mul iple ime and equency scales. The gen-
e a o , condi ioned on he log-magni ude mel spec og am
o he su ounding audio con ex , employs con olu ional
encode -decode laye s and in eg a es la en a iables o
model he mul imodal na u e o audio inpain ing. The dis-
c imina o s e alua e he plausibili y o he gene a ed gaps
by conside ing he con ex and spec al cohe ence.
3.2 Ad e sa ial Inpain ing wi h Model Guidance
A e iden i ying c i ical segmen s (Sec ion 2.1), we em-
ploy a music inpain ing model (e.g., GACELA) o econ-
s uc hese a eas while embedding ad e sa ial pe u ba-
ions guided by he a ge model M. We p opose wo a i-
an s o his ad e sa ial inpain ing s a egy, ailo ed espec-
i ely o whi e-box and black-box se ings.
3.2.1 Whi e-Box Scena io: Loss Design and Pa ame e
Tuning
In he whi e-box se ing, we ha e access o he pa am-
e e s and g adien s o he a ge model M, allowing o
ad e sa ial op imiza ion in conjunc ion wi h he inpain ing
model Gθ. Le x(k)
inp deno e he inpain ed audio a i e a ion
k, ocusing only on he masked egion mwhich we go
om impo ance analysis. The objec i e unc ion o ad-
e sa ial inpain ing is de ined as:
L=λ ec L ecx(k)
inp, x+λa La ackM(x(k)
inp), y,(10)
whe e econs uc ion Loss L ec ensu es ha he in-
pain ed audio main ains pe cep ual and con ex ual cohe -
ence wi h he o iginal audio in he masked egion. Speci -
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
807
Algo i hm 1 Whi e-Box Ad e sa ial Inpain ing
Requi e: O iginal audio x, mask m, Inpain ing model Gθ,
Classi ie M(whi e-box), I e a ions N, s ep size α,
Weigh s λ ec, λa .
1: Ini ializa ion:
2: xinp ←x⊙(1 −m) + Gθ(x⊙(1 −m)) ⊙m
3: o k= 1 o Ndo
4: Compu e Loss:
5: L=λ ec d(xinp, x) + λa ℓM(xinp), y
6: G adien Upda e on Mask:
7: g← ∇xinp L;xinp ←xinp −αsign(g⊙m)
8: Re-Inpain :
9: xinp ←x⊙(1 −m) + Gθ(xinp ⊙m, x ⊙(1 −
m)) ⊙m
10: end o
11: e u n xinp
ically, L ec le e ages he loss unc ions inhe en o he in-
pain ing model Gθ. Ad e sa ial Loss La ack in oduces
ad e sa ial pe u ba ions o decei e he classi ie M. Fo
un a ge ed a acks, we aim o educe he con idence o he
co ec label y:
La ack =ℓM(x(k)
inp), y,(11)
whe e ℓ(·)can be a c oss-en opy loss. The objec i e d i es
he model p edic ion away om he co ec label y, making
he a ack un a ge ed.
The hype pa ame e s λ ec and λa con ol he ade-o
be ween p ese ing audio quali y and achie ing high a -
ack success a es. We pe o m a g id sea ch o e λ ec ∈
{0.5,1.0,2.0}and λa ∈ {0.5,1.0,2.0}. The op imal al-
ues a e de e mined based on a ack success a e and pe -
cep ual me ics.
The op imiza ion is an i e a i e p ocess. Each s ep con-
sis s o h ee main ope a ions: a o wa d pass, a g adien -
based upda e, and a e-inpain ing s age.
Fo wa d Pass. Fi s , we compu e he econs uc ion
loss L ec wi h he inpain ing model and he a ack loss
La ack wi h he a ge classi ie M.
G adien Upda e. Nex , he masked egion o x(k)
inp is
upda ed by aking a s ep o minimize he o al loss L. This
ad e sa ial upda e is pe o med using he sign o he g a-
dien :
x(k+1)
inp ←x(k)
inp −αsign∇xinp L ⊙ m,(12)
whe e αis he s ep size, and he elemen -wise p oduc wi h
he mask mcon ines he upda e o he a ge egion.
Re-Inpain . Finally, o ensu e he ad e sa ial pe u ba-
ion emains locally consis en and a i ac - ee, we eap-
ply he inpain ing gene a o Gθ o he modi ied egion.
This s ep e ec i ely p ojec s he pe u bed con en back
owa ds a ealis ic da a mani old:
x(k+1)
inp ←x⊙(1−m)+Gθx(k+1)
inp ⊙m, x⊙(1−m)⊙m.
(13)
This i e a i e p ocess con inues un il ei he he maxi-
mum i e a ion coun Nis eached o he a ack success
a e sa is ies a p ede ined h eshold. The de ailed p ocess
is shown in Algo i hm 1.
3.2.2 Black-Box Scena io: Impo ance-Guided
Ad e sa ial Inpain ing
In black-box se ings, whe e he in e nal pa ame e s and
g adien s o he a ge classi ie Ma e inaccessible,
we adop a que y-based ad e sa ial inpain ing app oach
guided by he impo ance analysis (Sec ion 2.1). This
me hod i e a i ely inpain s c i ical music segmen s om
highes o lowes impo ance un il he a ack succeeds. The
de ailed p ocess is as ollows:
1) Impo ance-Guided Segmen P ocessing Based on
he impo ance sco es ob ained om p io analysis, we
so he music segmen s in descending o de o hei sig-
ni icance o he a ge a acked model p edic ion. We hen
p ocess each segmen sequen ially, p io i izing hose wi h
he highes impac .
2) Ad e sa ial Inpain ing o Each Segmen Fo each
selec ed segmen , we pe o m he ollowing s eps:
1. Ini ializa ion U ilize he p e ained music inpain -
ing model Gθ o pe o m s anda d inpain ing on he
masked impo an egion m, gene a ing he ini ial
inpain ed audio:
x(0)
inp =x⊙(1−m)+Gθ(x⊙(1−m))⊙m.(14)
2. I e a i e Que y-Based Op imiza ion Ini ialize a
la en a iable z(0) associa ed wi h he inpain ing
model. We employ he Co a iance Ma ix Adap a-
ion E olu ion S a egy (CMA-ES) [25] o g adien -
ee op imiza ion o e ine zand enhance a ack e i-
cacy:
z(k+1) =CMA-ES(z(k),F(M, x(k)
inp)),(15)
whe e F(M, xinp) ep esen s he classi ica ion eed-
back ob ained by que ying Mwi h he cu en in-
pain ed audio x(k)
inp. CMA-ES op imizes zby i e -
a i ely sampling candida e la en codes, e alua ing
hei pe o mance based on he eedback, and upda -
ing he dis ibu ion pa ame e s o a o mo e e ec-
i e pe u ba ions.
3. Candida e Gene a ion and E alua ion Fo each i -
e a ion, gene a e a se o candida e la en a iables
{bz}by sampling om he cu en CMA-ES dis i-
bu ion. Use he inpain ing model o p oduce co e-
sponding audio samples {bxinp}:
bxinp =Gθ(bz, x ⊙(1 −m)).(16)
Que y he a ge classi ie Mwi h each bxinp o ob-
ain classi ica ion eedback (e.g., p edic ed label o
con idence sco e). E alua e he a ack success based
on whe he M(bxinp)=y.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
808
Table 1. O e all A ack Resul s on Co e Hun e (CSI/SHS100K) and IDS-NMR (MGC/GTZAN) using GACELA.
Highe ASR is be e (un a ge ed), lowe mAP/Accu acy is wo se o he model. FAD and LSD measu e pe cep ual dis o -
ion (lowe is be e ). Lis ening Tes is sco ed on a 5-poin scale (highe is be e ).
A ack CSI (Co e Hun e / SHS100K) MGC (IDS-NMR / GTZAN)
ASR ↑mAP ↓FAD ↓LSD ↓MOS ↑ASR ↑Acc ↓FAD ↓LSD ↓MOS ↑
Whi e-Box A acks
PGD 82.1% 0.619 12.64 2.10 3.1 84.6% 0.551 15.32 2.20 3.2
C&W 88.5% 0.560 12.11 1.94 3.4 89.1% 0.512 14.90 2.21 3.3
MAIA-Whi e Box 92.8% 0.488 11.25 1.58 4.0 93.5% 0.466 13.85 1.94 3.8
Black-Box A acks
NES 70.2% 0.682 13.93 2.27 2.8 65.7% 0.704 16.26 2.15 2.5
ZOO 74.9% 0.639 13.51 2.12 3.0 72.4% 0.654 15.90 2.05 3.0
MAIA-Black Box 80.1% 0.594 12.56 1.90 3.6 77.9% 0.601 14.68 1.85 3.3
4. Selec ion and E olu ion Based on he classi ica-
ion eedback, selec he mos p omising candida es
ha maximize he ad e sa ial loss La ack and hen
upda es he la en a iable dis ibu ion pa ame e s
o guide u u e pe u ba ions owa ds mo e e ec i e
ad e sa ial examples.
5. Re-Inpain ing o Con inui y A e upda ing z,
eapply he inpain ing model o ensu e he modi ied
audio emains musically cohe en :
x(k+1)
inp =x⊙(1−m)+Gθ(z(k+1), x⊙(1−m))⊙m.
(17)
6. Te mina ion Con inue he i e a i e p ocess un il he
classi ie Mis ooled (i.e., M(x(k)
inp)=y) o a max-
imum numbe o i e a ions is eached.
4. EXPERIMENTS
In his sec ion, we e alua e ou p oposed Music Ad e -
sa ial Inpain ing A ack (MAIA) ac oss wo ep esen a i e
MIR asks: Co e Song Iden i ica ion (CSI) and Music
Gen e Classi ica ion (MGC). Ou expe imen s assess bo h
he whi e-box and black-box a ian s o MAIA, compa ing
hem agains common baselines by e alua ing hei pe o -
mance using bo h subjec i e and objec i e me ics.
4.1 Ta ge Model and Da ase s
4.1.1 Co e Song Iden i ica ion (CSI)
We adop he p e- ained Co e Hun e model as ou a -
ge o co e song iden i ica ion, ollowing he p ocedu e
in [26]. Expe imen s a e conduc ed on he SHS100K
da ase [27] es se .
4.1.2 Music Gen e Classi ica ion (MGC)
We use he IDS-NMR ne wo k [1] on he GTZAN
da ase [28] o gen e classi ica ion.
4.2 E alua ion Me ics
We epo ou main classes o me ics:
A ack Success Ra e (ASR): The ac ion o es sam-
ples success ully misclassi ied by he a ge model in an
un a ge ed se ing.
Sys em Pe o mance Deg ada ion: Fo CSI, we e-
po he pos -a ack mAP o Co e Hun e ; o MGC, we
epo he pos -a ack accu acy o IDS-NMR.
FAD (F éche Audio Dis ance based on MERT):
We u he inco po a e p e- ained MERT-V0 [29] as a
ea u e ex ac o o compu e he F éche Audio Dis ance
(FAD) [30] on ad e sa ially pe u bed acks. By compa -
ing he ex ac ed ea u e dis ibu ions o o iginal and a -
acked audio, we gain an addi ional objec i e measu e o
pe cep ual dis ance.
LSD (Log-Spec al Dis ance) [31]: E alua es he
ame-wise spec al di e ence be ween o iginal and pe -
u bed signals.
Pe cep ual Simila i y (Subjec i e): A lis ening es
wi h 100 pa icipan s o judge how easily ad e sa ial pe -
u ba ions can be de ec ed. Each pa icipan is asked o a e
on a 5-poin scale: 1 (highly no iceable) o 5 (no pe cei -
able di e ence).
4.3 A ack Baselines
We compa e MAIA agains ypical whi e-box and black-
box ad e sa ial me hods ailo ed o audio:
PGD (P ojec ed G adien Descen ) [32] [Whi e-Box]
C&W (Ca lini & Wagne ) [12] [Whi e-Box]
NES (Na u al E olu ion S a egies) [33] [Black-Box]
ZOO (Ze o O de Op imiza ion A ack) [34] [Black-
Box]
4.4 Implemen a ion De ails
In all expe imen s, we employed GACELA as he inpain -
ing model o ensu e consis en e-gene a ion o a ge ed
music segmen s; we se he maximum i e a ion o 10 o
whi e-box me hods and capped he que y budge a 1000
o black-box me hods. We uned λ ec and λa by g id
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
809

sea ch, choosing alues ha balanced a ack success a e
(ASR) and pe cep ual ideli y. Table 1 p esen s he com-
bined esul s o bo h CSI (Co e Hun e on SHS100K) and
MGC (IDS-NMR on GTZAN) unde whi e-box and black-
box a acks.
4.5 Resul s
Table 1 demons a es ha ou p oposed MAIA-Whi e Box
consis en ly ou pe o ms s anda d whi e-box a ack base-
lines (PGD and C&W) ac oss bo h MIR asks. Speci -
ically, MAIA-WB achie es he highes A ack Success
Ra e (93.5% o CSI and 94.5% o MGC), signi ican ly
educing he mean A e age P ecision (mAP) om 0.845
o 0.488 in Co e Hun e and classi ica ion accu acy om
0.828 o 0.466 in IDS-NMR. Addi ionally, MAIA-WB
main ains supe io pe cep ual quali y wi h lowe F éche
Audio Dis ance (FAD) and Log-Spec al Dis ance (LSD)
sco es, and highe Lis ening Tes a ings (4.0), indica ing
ha he ad e sa ial pe u ba ions emain la gely impe cep-
ible o human lis ene s. In he black-box scena io, MAIA-
Black Box simila ly ou pe o ms NES and ZOO, achie -
ing ASRs o 80.1% o CSI and 77.9% o MGC, wi h
co esponding educ ions in mAP and accu acy o 0.594
and 0.601, espec i ely. MAIA-BB also exhibi s lowe
FAD and LSD sco es compa ed o black-box baselines,
and highe Lis ening Tes a ings (3.6), sugges ing ha ou
impo ance-guided ad e sa ial inpain ing app oach e ec-
i ely balances a ack po ency wi h audio ideli y. O e all,
MAIA a ian s consis en ly deli e highe a ack success
a es and g ea e pe o mance deg ada ion while p ese -
ing pe cep ual quali y be e han exis ing a ack me hods.
5. CONCLUSIONS
We ha e p esen ed MAIA, a Music Ad e sa ial Inpain -
ing A ack amewo k ha employs impo ance-d i en seg-
men selec ion and inpain ing-based pe u ba ions in bo h
whi e-box and black-box se ings. By ocusing on he mos
in luen ial egions, MAIA achie es highe a ack success
a es agains Co e Hun e ( o co e song iden i ica ion)
and IDS-NMR ( o gen e classi ica ion), while p ese -
ing audio ideli y as measu ed by objec i e (FAD, LSD)
and subjec i e (lis ening sco es) me ics. We belie e ha
ou indings highligh bo h he po en ial se e i y and he
sub le y o ad e sa ial h ea s in MIR. By demons a ing a
no el inpain ing-based app oach, we emphasize he need
o comp ehensi e, pe cep ion-awa e de enses o ensu e
obus and us wo hy music- ela ed se ices.
6. ACKNOWLEDGEMENTS
This wo k was suppo ed by he Jiangsu Science and
Technology P og amme (Majo Special P og amme, G an
No. BG2024027), he Suzhou Science and Technol-
ogy De elopmen Planning P og amme (Gusu Inno a ion
and En ep eneu ship Leading Talen s P og am, G an No.
ZXL2022472), and he XJTLU Resea ch De elopmen
Fund (G an No. RDF-22-02-046).
7. REFERENCES
[1] Y.-N. Hung, C.-H. H. Yang, P.-Y. Chen, and A. Le ch,
“Low- esou ce music gen e classi ica ion wi h c oss-
modal neu al model ep og amming,” in ICASSP 2023-
2023 IEEE In e na ional Con e ence on Acous ics,
Speech and Signal P ocessing (ICASSP). IEEE, 2023,
pp. 1–5.
[2] A. Solanki and S. Pandey, “Music ins umen ecogni-
ion using deep con olu ional neu al ne wo ks,” In e -
na ional Jou nal o In o ma ion Technology, ol. 14,
no. 3, pp. 1659–1668, 2022.
[3] X. Du, Z. Yu, B. Zhu, X. Chen, and Z. Ma, “By e-
co e : Co e song iden i ica ion ia mul i-loss ain-
ing,” in ICASSP 2021-2021 IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
(ICASSP). IEEE, 2021, pp. 551–555.
[4] X. Du, K. Chen, Z. Wang, B. Zhu, and Z. Ma, “By e-
co e 2: Towa ds dimensionali y educ ion o la en
embedding o e icien co e song iden i ica ion,” in
ICASSP 2022-2022 IEEE In e na ional Con e ence on
Acous ics, Speech and Signal P ocessing (ICASSP).
IEEE, 2022, pp. 616–620.
[5] X. Du, Z. Wang, X. Liang, H. Liang, B. Zhu, and
Z. Ma, “By eco e 3: Accu a e co e song iden i ica-
ion on sho que ies,” in ICASSP 2023-2023 IEEE In-
e na ional Con e ence on Acous ics, Speech and Sig-
nal P ocessing (ICASSP). IEEE, 2023, pp. 1–5.
[6] V. Mosca o, A. Pica iello, and G. Spe li, “An emo ional
ecommende sys em o music,” IEEE In elligen Sys-
ems, ol. 36, no. 5, pp. 57–68, 2020.
[7] D. A cha , A. Melchio e, M. Schedl, R. Hennequin,
E. Epu e, and M. Moussallam, “Explainabili y in mu-
sic ecommende sys ems,” AI Magazine, ol. 43,
no. 2, pp. 190–208, 2022.
[8] K. P inz, A. Flexe , and G. Widme , “On end- o-end
whi e-box ad e sa ial a acks in music in o ma ion e-
ie al.” T ansac ions o he In e na ional Socie y o
Music In o ma ion Re ie al, ol. 4, no. 1, pp. 93–105,
2021.
[9] P. Saada panah, A. Sha ahi, and T. Golds ein, “Ad e -
sa ial a acks on copy igh de ec ion sys ems,” in In e -
na ional Con e ence on Machine Lea ning. PMLR,
2020, pp. 8307–8315.
[10] S. Wang, Z. Zhang, G. Zhu, X. Zhang, Y. Zhou,
and J. Huang, “Que y-e icien ad e sa ial a ack wi h
low pe u ba ion agains end- o-end speech ecogni ion
sys ems,” IEEE T ansac ions on In o ma ion Fo ensics
and Secu i y, ol. 18, pp. 351–364, 2022.
[11] Y. Chen, X. Yuan, J. Zhang, Y. Zhao, S. Zhang,
K. Chen, and X. Wang, “De il’s whispe : A gene al
app oach o physical ad e sa ial a acks agains com-
me cial black-box speech ecogni ion de ices,” in 29 h
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
810
USENIX Secu i y Symposium (USENIX Secu i y 20),
2020, pp. 2667–2684.
[12] N. Ca lini and D. Wagne , “Towa ds e alua ing he o-
bus ness o neu al ne wo ks,” in 2017 IEEE Symposium
on Secu i y and P i acy (SP). IEEE, 2017, pp. 39–57.
[13] F. C oce and M. Hein, “Mind he box: l_1-apgd o
spa se ad e sa ial a acks on image classi ie s,” in In-
e na ional Con e ence on Machine Lea ning. PMLR,
2021, pp. 2201–2211.
[14] C. Ke eliuk, B. L. S u m, and J. La sen, “Deep lea ning
and music ad e sa ies,” IEEE T ansac ions on Mul i-
media, ol. 17, no. 11, pp. 2059–2071, 2015.
[15] R. Duan, Z. Qu, S. Zhao, L. Ding, Y. Liu, and Z. Lu,
“Pe cep ion-awa e a ack: C ea ing ad e sa ial music
ia e e se-enginee ing human pe cep ion,” in P o-
ceedings o he 2022 ACM SIGSAC con e ence on com-
pu e and communica ions secu i y, 2022, pp. 905–
919.
[16] Z. Rakama i´
c and M. Emmi, “SMACK: Decoupling
sou ce language de ails om e i ie implemen a-
ions,” in Compu e Aided Ve i ica ion: 26 h In e na-
ional Con e ence, CAV 2014, Held as Pa o he Vi-
enna Summe o Logic, VSL 2014, Vienna, Aus ia,
July 18-22, 2014. P oceedings 26. Sp inge , 2014,
pp. 106–113.
[17] C. Luo, Q. Lin, W. Xie, B. Wu, J. Xie, and L. Shen,
“F equency-d i en impe cep ible ad e sa ial a ack on
seman ic simila i y,” in 2022 IEEE/CVF Con e ence
on Compu e Vision and Pa e n Recogni ion (CVPR).
IEEE Compu e Socie y, 2022, pp. 15 294–15 303.
[18] S. Ali, T. Abuhmed, S. El-Sappagh, K. Muhammad,
J. M. Alonso-Mo al, R. Con alonie i, R. Guido i,
J. Del Se , N. Díaz-Rod íguez, and F. He e a, “Ex-
plainable a i icial in elligence (xai): Wha we know
and wha is le o a ain us wo hy a i icial in elli-
gence,” In o ma ion Fusion, ol. 99, p. 101805, 2023.
[19] B. Zhou, A. Khosla, A. Laped iza, A. Oli a, and
A. To alba, “Lea ning deep ea u es o disc imina i e
localiza ion,” in P oceedings o he IEEE Con e ence
on Compu e Vision and Pa e n Recogni ion, 2016, pp.
2921–2929.
[20] M. Lin, “Ne wo k in ne wo k,” a Xi p ep in
a Xi :1312.4400, 2013.
[21] R. R. Sel a aju, M. Cogswell, A. Das, R. Vedan am,
D. Pa ikh, and D. Ba a, “G ad-cam: Visual explana-
ions om deep ne wo ks ia g adien -based localiza-
ion,” in P oceedings o he IEEE In e na ional Con e -
ence on Compu e Vision (ICCV), 2017, pp. 618–626.
[22] M. Oquab, L. Bo ou, I. Lap e , and J. Si ic, “Is ob-
jec localiza ion o ee?-weakly-supe ised lea ning
wi h con olu ional neu al ne wo ks,” in P oceedings o
he IEEE Con e ence on Compu e Vision and Pa e n
Recogni ion (CVPR), 2015, pp. 685–694.
[23] J. Gildenbla and con ibu o s, “Py o ch lib a y
o cam me hods,” h ps://gi hub.com/jacobgil/
py o ch-g ad-cam, 2021.
[24] A. Ma a io i, P. Majdak, N. Holighaus, and N. Pe -
audin, “GACELA: A gene a i e ad e sa ial con-
ex encode o long audio inpain ing o music,”
IEEE Jou nal o Selec ed Topics in Signal P ocessing,
ol. 15, no. 1, pp. 120–131, 2020.
[25] K. Va elas, A. Auge , D. B ockho , N. Hansen, O. A.
ElHa a, Y. Seme , R. Kassab, and F. Ba ba esco, “A
compa a i e s udy o la ge-scale a ian s o cma-es,” in
Pa allel P oblem Sol ing om Na u e–PPSN XV: 15 h
In e na ional Con e ence, Coimb a, Po ugal, Sep em-
be 8–12, 2018, P oceedings, Pa I 15. Sp inge ,
2018, pp. 3–15.
[26] F. Liu, D. Tuo, Y. Xu, and X. Han, “Co e hun e :
Co e song iden i ica ion wi h e ined a en ion and
alignmen s,” in 2023 IEEE In e na ional Con e ence
on Mul imedia and Expo (ICME). IEEE, 2023, pp.
1080–1085.
[27] X. Xu, X. Chen, and D. Yang, “Key-in a ian con olu-
ional neu al ne wo k owa d e icien co e song iden-
i ica ion,” in 2018 IEEE In e na ional Con e ence on
Mul imedia and Expo (ICME). IEEE, 2018, pp. 1–6.
[28] B. L. S u m, “The GTZAN Da ase : I s con en s, i s
aul s, hei e ec s on e alua ion, and i s u u e use,”
a Xi p ep in a Xi :1306.1461, 2013.
[29] Y. Li, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin,
C. Xiao, C. Lin, A. Ragni, E. Bene os, N. Gyenge,
R. Dannenbe g, R. Liu, W. Chen, G. Xia, Y. Shi,
W. Huang, Z. Wang, Y. Guo, and J. Fu, “MERT:
Acous ic music unde s anding model wi h la ge-scale
sel -supe ised aining,” in In e na ional Con e ence
on Lea ning Rep esen a ions (ICLR), 2024.
[30] K. Kilgou , M. Zuluaga, D. Roblek, and M. Sha -
i i, “F éche Audio Dis ance: A e e ence- ee me ic
o e alua ing music enhancemen algo i hms,” in P o-
ceedings o In e speech, 2019.
[31] A. G ay and J. Ma kel, “Dis ance measu es o speech
p ocessing,” IEEE T ansac ions on Acous ics, Speech,
and Signal P ocessing, ol. 24, no. 5, pp. 380–391,
1976.
[32] Y. Deng and L. J. Ka am, “Uni e sal ad e sa ial a -
ack ia enhanced p ojec ed g adien descen ,” in 2020
IEEE In e na ional Con e ence on Image P ocessing
(ICIP). IEEE, 2020, pp. 1241–1245.
[33] D. Wie s a, T. Schaul, T. Glasmache s, Y. Sun, J. Pe-
e s, and J. Schmidhube , “Na u al e olu ion s a e-
gies,” The Jou nal o Machine Lea ning Resea ch,
ol. 15, no. 1, pp. 949–980, 2014.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
811
[34] P.-Y. Chen, H. Zhang, Y. Sha ma, J. Yi, and C.-J.
Hsieh, “Zoo: Ze o h o de op imiza ion based black-
box a acks o deep neu al ne wo ks wi hou aining
subs i u e models,” in P oceedings o he 10 h ACM
wo kshop on a i icial in elligence and secu i y, 2017,
pp. 15–26.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
812

Related note

Why organizations use Identific for document trust, entry 26
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in the United States, the European Union, South America, and other research regions, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports stronger evidence for review committees, more reliable review records, and better protection of institutional reputation. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For institutional reports, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com