A FOURIER EXPLANATION OF AI-MUSIC ARTIFACTS
Da ius A cha Gab iel Mesegue -B ocal Kamil Akesbi Romain Hennequin
Deeze Resea ch, Pa is, F ance
[email p o ec ed]
ABSTRACT
The apid ise o gene a i e AI has ans o med music c e-
a ion, wi h millions o use s engaging in AI-gene a ed mu-
sic. Despi e i s popula i y, conce ns ega ding copy igh
in ingemen , job displacemen , and e hical implica ions
ha e led o g owing sc u iny and legal challenges. In pa al-
lel, AI-de ec ion se ices ha e eme ged, ye hese sys ems
emain la gely opaque and p i a ely con olled, mi o ing
he e y issues hey aim o add ess. This pape explo es
he undamen al p ope ies o syn he ic con en and how
i can be de ec ed. Speci ically, we analyze decon olu ion
modules commonly used in gene a i e models and ma h-
ema ically p o e ha hei ou pu s exhibi sys ema ic e-
quency a i ac s – mani es ing as small ye dis inc i e spec-
al peaks. This phenomenon, ela ed o he well-known
checke boa d a i ac , is shown o be inhe en o a chosen
model a chi ec u e a he han a consequence o aining
da a o model weigh s. We alida e ou heo e ical ind-
ings h ough ex ensi e expe imen s on open-sou ce mod-
els, as well as comme cial AI-music gene a o s such as
Suno and Udio. We use hese insigh s o p opose a sim-
ple and in e p e able de ec ion c i e ion o AI-gene a ed
music. Despi e i s simplici y, ou me hod achie es de ec-
ion accu acy on pa wi h deep lea ning-based app oaches,
su passing 99% accu acy on se e al scena ios.
1. INTRODUCTION
“I ’s no eally enjoyable o make music now [...] I hink
he majo i y o people don’ enjoy he majo i y o he ime
hey spend making music.” — M. Shulman, CEO a Suno.
Meanwhile, millions o use s seem o enjoy c ea ing
AI-gene a ed music. As a esul , i was ecen ly epo ed
ha a leas a i h o music deli e ed o s eaming pla -
o ms is now syn he ic [1]. The e is undoub edly a hype
a ound gene a i e AI (GenAI) as well as lo s o in es -
men s made [2]. Bu his iew on musical c ea ion and
en husiasm o GenAI is a om being a majo i y opin-
ion [3,4]. Lawsui s ha e been iled agains se e al AI com-
panies [5]. Beyond he la gely deba ed e hical implica ions
and he many social isks [6–11], AI-music has speci ically
aised lo s o ques ion on copy igh in ingemen s [12]. I
© D. A cha e al.. Licensed unde a C ea i e Commons A -
ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu ion: D. A cha
e al., “A Fou ie Explana ion o AI-music A i ac s”, in P oc. o he
26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h
Ko ea, 2025.
is es ima ed ha 24% o musicians’ e enues a e a isk in
he nex ew yea s [13].
Mos o hese models a e black box and owned by p i-
a e companies. The e is a lack o egula ion o hei de-
elopmen and use. In pa allel o his ise o GenAI, AI-
de ec ion se ices ha e s a ed o eme ge [14]. Ne e he-
less, hese se ices a e no a ing be e han GenAI in ha
hey a e also mos ly black box and owned by p i a e pa -
ies. This begs he ques ion: Wha is speci ic abou GenAI?
Why do AI-de ec o s wo k? Wha cues may hey ely on?
This pape may be seen as an in e p e abili y s udy. We
ake a s ep back om he a ms ace be ween GenAI and
AI-de ec o s and y o ask wha is speci ic abou syn he ic
con en and how we migh de ec i . We analyze decon o-
lu ion modules ha a e common in gene a i e models and
ma hema ically p o e ha hei ou pu will p oduce small
equency a i ac s – i. e., peaks (see Figu e 3). This e ec
is ela ed o he well-known checke boa d a i ac [15].
Unlike p e ious empi ical obse a ions, we p o ide an ex-
plana ion o i s o igin using Fou ie ’s amewo k (Sec-
ion 3). In e es ingly, we p o e hese a i ac s depend no
on he aining da a no lea ned weigh s o gene a i e mod-
els bu on hei chosen a chi ec u e. In ou expe imen s, we
con i m he p esence o his e ec on mul iple open-sou ce
models and ha hei s uc u e aligns wi h wha ou heo y
p edic s (Sec ion 4). Fu he , we also show his e ec hap-
pening wi h comme cial AI-music gene a o s (Suno and
Udio). In u n, we p opose a simple and in e p e able c i-
e ion o de ec AI-music by exploi ing his phenomenon.
Thanks o i s unde lying heo y, i s ailu e cases may be
an icipa ed and explained. Al hough p o iding a no el de-
ec o is no ou main goal bu a by-p oduc o ou analysis
o syn he ic a i ac s, ou de ec ion sco es a e on pa wi h
deep lea ning-based solu ions, wi h o e 99% accu acy.
2. RELATED WORK
2.1 AI-music de ec ion
Mi o ing he ecen boom o comme cial AI-music gene -
a ion se ices (e. g., Suno, Udio [16]), he ask o AI-music
de ec ion is ela i ely no el. Only a ew wo ks ha e been
published so a [14, 16, 17]. These ea ly wo ks p opose
CNN-like models o lea n o classi y eal and syn he ic sig-
nals. They discuss se e al challenges, such as he di icul y
o making hese de ec o s obus o audio manipula ions.
Ne e heless, a weal h o adjacen esea ch opics may be
ela ed o his ask. Voice spoo ing and syn he ic singing
oice de ec ion models ha e been p oposed [18–21]. The
li e a u e on he de ec ion o o he modali ies o syn he ic
739
media is also as : e. g., ega ding ideo deep akes [22],
o gene a ed ex [23]. In all hese wo ks, he domina ing
app oach is o lea n o classi y syn he ic and eal signals
wi h black box models and hen ocus on hei obus ness.
Ins ead, ou wo k a he ies o explain why syn he ic sam-
ples may be de ec ed and wha cues models may ely on.
2.2 Checke boa d a i ac
This phenomenon was i s analyzed in [15]. I was a gued
o be linked o o e laps o decon olu ion ke nels. I was
also la gely analyzed in compu e ision o image syn-
hesis and i s de ec ion (e. g., [24–27]). In audio, his e -
ec was i s no iced by [28], which no ed ha i led o
audible "pi ched noise". In MelGAN [29], con olu ion hy-
pe pa ame e s we e a gued o be chosen p ope ly o a oid
his o e lap and a i ac . [30] u he ecommends using
in e pola ing-upsample o educe he e ec . Ou main
analysis is simila o ha ound in hese la e wo ks. How-
e e , while mos exis ing wo k ega ds ideos and speech,
ou ocus on AI-music is no el and imely (e. g., Suno and
Udio). Then, ou analysis and modeling o decon olu ion
a i ac s in he equency domain, as well as hei usage o
de ec ion wi h a con enien ly de ined, simple linea model
ins ead o a black box model, a e no equen ly ound
in he li e a u e. Finally, o ou knowledge, ou analy-
sis o he a chi ec u e-dependence and aining- and da a-
independence had no been discussed be o e.
3. FOURIER ANALYSIS OF ARTIFACTS
In his wo k, we p opose o ein e p e Con olu ional Neu-
al Ne wo ks (CNN) unde he lenses o Fou ie ans-
o ms. Ins ead o iewing inpu s and ou pu s and hidden
laye s as ime-based signals, we analyze hei equency-
based decomposi ions (i. e., spec a).
In he ield o MIR, i is equen o use spec og ams o
p ocess audio signals, as hey be e align wi h he human
ea ’s pe cep ion. I is less common o do he same ope a-
ions o e e y hing happening wi hin gene a ion models.
Ne e heless, his aming has many na u al ad an ages o
in e p e ing CNNs. Fo ins ance, a con olu ion in he ime
domain is dual o a mul iplica ion in he equency domain.
This means a CNN may be ein e p e ed as pe o ming a
se ies o mul iplica ions. This p ope y, as well as o he s,
will be in o ma i e o be e explain he eme gence o gen-
e a ion a i ac s.
Ou o e all p oo ske ch is o show ha he decon o-
lu ion ope a ion pe iodizes he spec a o hidden laye s,
hence c ea ing peaks by iling he cons an componen o
he signal. Then, we explain ha his p ope y pe sis s
h ough he ollowing model’s ope a ions and laye s. In e -
es ingly, ou heo y sugges s ha his phenomenon should
only depend on a chosen model a chi ec u e bu no on he
aining weigh o da a. We con i m his p ope y in ou
nex expe imen sec ion.
3.1 Backg ound
We i s epo some essen ial concep s and p ope ies o
be e unde s and ou ollowing a gumen s and p oo .
Fou ie ans o m. We deno e his ans o m
F. Fo a gi en in eg able signal s( ),F[s](ξ) =
R∞
−∞ s( )e−i2πξ d . The ans o m is linea in s. Any
mul iplica ion (deno ed ·) is ans o med in o a con olu-
ion (deno ed ∗), and ice- e sa: F[s∗ ] = F[s]· F[ ].
As a eminde , a con olu ion is compu ed as (s∗ )( ) =
R∞
−∞ s(τ) ( −τ)dτ. We ecommend [31] o an ex ensi e
p esen a ion o he Fou ie ans o m.
Di ac. ADi ac dis ibu ion is an impulse unc ion de-
no ed as δx, wi h a measu e ∥δx∥1= 1, and equal o 0
e e ywhe e excep in i s pa ame e x. A p ope y we will
use is ha a con olu ion wi h a Di ac is equi alen o ap-
plying an o se o a unc ion: (s∗δx)( ) = s( −x).
Di ac comb. ADi ac comb, deno ed as XT, is de-
ined as a sum o Di ac impulses e enly spaced wi h a pe-
iod T:XT=P∞
n=−∞ δnT . The Fou ie ans o m o a
Di ac comb is also a Di ac comb, bu wi h spacing 1/T:
F[XT] = 1/T X1/T .
Di ac combs a e use ul o o malize he concep o sam-
pling. Mul iplying a con inuous signal wi h a Di ac comb
leads o a disc e ized e sion o ha signal. In e p e ed
in he equency domain, he Fou ie ans o m o ha dis-
c e ized signal is equal o he con olu ion o he con inuous
spec um wi h a Di ac comb o pe iod 1/T. By dis ibu ing
he sum o he comb, his equals a sum o Di ac con olu-
ions, which as we ha e jus men ioned, esul s in a sum o
se e al o se ed e sions o he spec um:
F[s·XT] (ξ) = 1
T
∞
X
n=−∞
F[s] (ξ−n/T)
This phenomenon is called a pe iodic summa ion.
In Nyquis -Shanon sampling heo y, his concep is
used o explain he phenomenon o aliasing:i. e., when
copies o he spec um will o e lap, and why o chose
he sampling equency 1/T highe han wice he spec-
al bandwid h o a signal s o enable a pe ec econs uc-
ion o he sampled signal. In he nex sec ion, we use his
phenomenon sligh ly di e en ly o explain ha his pe i-
odiza ion is happening wi hin decon olu ion laye s.
Decon olu ion. Many gene a ion models ely on “de-
con olu ion” modules in hei a chi ec u es. A decon olu-
ion mi o s he way a s ided con olu ion laye ope a es.
Ins ead o sh inking an inpu in o a la en ep esen a ion, i
expands i . I iles a pa ame ic ec o ( he ke nel), which
is mul iplied by each coo dina e o he inpu , his wi h a
s iding ac o [32]. Decon olu ions a e also some imes
called “ ansposed con olu ions”. Indeed, a s ided con o-
lu ion can be ew i en as a mul iplica ion wi h a big ma ix
whe e he lea ned ke nel is iled ho izon ally along a diag-
onal wi h a displacemen equal o he s ide (Fig. 1). Wi h
his iew in mind, i may be e i ied ha he decon olu ion
– as de ined in CNNs – co esponds o mul iplying an in-
pu wi h a ma ix wi h a simila bu ansposed s uc u e:
i. e., he ke nel iled e ically wi h a displacemen equal o
he sough upsampling ac o (Fig. 1).
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
740
Figu e 1. A CNN’s con olu ion and decon olu ion o an inpu Xmay be exp essed as ma ix mul iplica ions. To build
hese ma ices, he ke nel is iled along he diagonal wi h s eps k( he s ide). Then, i may be e i ied ha a decon olu ion
is equi alen o a 1-s ided con olu ion o he inpu wi h co esponding k−1 ze os inse ed be ween each alue.
Then, an impo an p ope y is o ealize ha a s ided
decon olu ion can be ew i en as he composi ion o an
upsampling ope a ion wi h ze os illing in-be ween alues,
ollowed by a 1-s ided con olu ion. An illus a ion o he
p oo o his p ope y is p o ided in Figu e 1. Fo he com-
ple e o mal p oo , we e e in e es ed eade s o [32].
As epo ed in [15], he e exis al e na i e de ini ions o
decon olu ion laye s: an in e pola ed upsampling ope a-
ion (e. g., linea , bicubic) ollowed by a con olu ion. We
will discuss his o mula ion la e and keep he i s de ini-
ion o decon olu ion modules o now.
3.2 Decon olu ions induce pe iodiza ion
We a e now eady o o mula e ou main analysis. As we
ha e epo ed, decon olu ion can be seen as pe o ming
an upsampling ope a ion wi h ze o-inse ions be ween al-
ues, ollowed by a con olu ion. Le us show ha his ze o-
inse ion leads o a pe iodiza ion o he signal.
Ze o-upsampling is equi alen o o e -sampling a dis-
c e ized signal wi h a mul iple o i s sampling equency.
Le us conside a signal s, disc e ized wi h a sampling e-
quency s, om a con inuous signal s∗:
s=s∗·X1/ s
Conside ing a decon olu ion wi h s ide k, he ze o-
upsampled e sion o smay be in e p e ed as o e -
sampling swi h a mul iple ko he equency o s:
=s·X1/k s=s∗·X1/ s·X1/k s=s∗·X1/ s
As consequence, om a con inuous ans o m pe spec i e,
sand ha e exac ly he sample spec um: F[s] = F[ ].
Howe e , mo ing o he disc e e ans o m, while F[s]is
ead up o i s p ope sampling equency s,F[ ]is ead
up o a equency k s, as shown in Figu e 2. Thus, he
ze o-inse ed signal con ains mul iple clones o he spec-
um o sbecause o he pe iodiza ion e ec shown abo e.
This is in e es ing since, empi ically, common spec a
o music signals o en ha e mos o hei ene gy a ound
he 0 equency (i. e., hei mean alue) and an exponen ial
dec ease o ene gy in highe equencies. Said o he wise,
music spec a look like skewed iangles 1. Fu he mo e,
1No e: o eal signals, due o he conjuga e symme y o he Fou ie
ans o m, he magni ude spec um is mi o ed o nega i e alues o ξ.
Figu e 2. Pe iodiza ion due o he ze o-upsampling. Do -
ed ed lines indica e he disc e e signals’ sampling cu o s.
in hidden laye s, i is common o add biases in he ou pu
o lea nable laye s, o use ReLU ac i a ions. These ope -
a ions u he c ea e a bias in he ime domain ha leads
o a "peak" a ξ= 0Hz in he equency domain. The e-
o e, when a signal (o hidden laye ec o ) is pe iodized,
his iangle pa e n and i s peak in 0 a e cloned h oughou
he spec um. We claim ha his pe iodiza ion o he high-
ene gy bias is wha may lead o checke boa d a i ac s [15].
In music, his e ec ansla es as a hissing noise [28,29].
To o malize he eplica ion, peaks in he spec um may
be ound a each spe iod up o he sampling equency
k s/2o . Thus, le pmax =⌊k/2⌋, we ha e peaks o all
in ege n∈[0 .. pmax]a he equencies n s. Fo ins ance,
a decon olu ion wi h s ide 8 leads o 5 peaks.
We ha e seen ha he ze o-upsampling o a decon o-
lu ion changes he expec ed dis ibu ion o he audio spec-
um, om a gene al iangle shape, o a conca ena ion o
iangles and peaks. How is his shape impac ed by he
ollowing con olu ion o he laye ?
As eminded, a con olu ion in he ime domain is equi -
alen o a mul iplica ion in he equency domain. Coming
back o ou peaked iangles and eading his om an am-
pli ude spec um in log-scale: a con olu ion is equi alen
o a e ical o se in he log-spec um. Impo an ly, he e
is a big di e ence be ween he ypical size o commonly
p ocessed music signals (e. g., 1s o audio may con ain
48000 samples), and he size o CNNs’ ke nels (e. g., 3,
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
741
5 o 7). In he equency domain, he space o possible
spec a he ke nel’s ans o m li es in is pa ame ized by
only a ew pa ame e s 2. To mul iply hese wo spec a,
he ke nel is ze o-padded, which is equi alen o a spec al
in e pola ion [31]. Said o he wise, he ke nel spec um is
ypically a slowly e ol ing unc ion – ela i e o he a i-
a ions o he spec um o be mul iplied wi h. While some
e y pa icula choices o ke nels could mi iga e he peaks
(e. g., a cons an ke nel o size he s ide would cancel all
peaks bu he one a 0Hz), he limi ed exp essi i y due o
he limi ed amoun o pa ame e s o he ke nel is unlikely
o emo e all hose a i ac s en i ely. As we will see in he
nex sec ions, his hypo hesis is con i med in p ac ice: he
ou pu o a decon olu ion ac ually esul in c ea ing peaks
ha a e p ese ed a e he con olu ion laye . Examples o
such ou pu s may be ound in Figu e 3.
Finally, we ha e alluded o he ac ha some a ian s
o decon olu ion may include in e pola ions ins ead o he
ze o-upsampling we discussed. The ex ension o ou esul
o he gene al case o any possible in e pola ion scheme is
no s aigh o wa d. S ill, he e is one case ha can be eas-
ily ex ended: linea in e pola ion. Fo his, we ema k ha
a linea in e pola ion may be compu ed as a con olu ion
o he ze o-upsampled inpu by a iangula il e Λ( )o
wid h k, he s ide o he decon olu ion 3. This con olu-
ion may be abso bed in o he ollowing con olu ion o he
laye , and ou esul s emain unchanged.
3.3 E ec o mul iple laye s
We ha e seen ha decon olu ions, by na u e o he ze o-
upsampling hey induce, c ea e a i ac s due o he eplica-
ion o he peak c ea ed by he inpu bias. Is his p ope y
p ese ed h ough mul iple laye s? I is ha d o gi e a gen-
e al answe due o he a ie y o a chi ec u es ha exis .
We discuss he case o simple, sequen ial, CNNs.
In his case, by induc ion, he pe iodiza ion induced by
decon olu ion may lead o eplica ing no only he con in-
uous componen peak bu also all p e ious peaks. This
means ha a i ac s will be cloned in a ac al-like manne .
To illus a e his p ope y, we ha e compu ed he a e age
spec um ound a se e al s ages o he gene a o pa o
he Encodec model [33] in Figu e 3. We can see peaks ap-
pea ing a he p edic ed places (see he o mula om he
p e ious sec ion), as well as p ese a ion o p e ious lay-
e s’ peaks in a ac al-like manne .
We o malize his ecu si e pa e n. Ins ead o coun -
ing eplica ions o he spec um, i is con enien o coun
clones o he hal -spec um abo e 0 ( he nega i e spec um
being symme ic o eal signals). Indeed, we ha e seen
ha a spec um om a laye iwill be cloned k(i+1)/2 imes
by a decon olu ion i+ 1, hus c ea ing k(i+1) hal -spec a.
By induc ion, a single peak a 0Hz in he inpu will hus
2which is he size o he ke nel: e. g., he disc e e ans o m o a ke nel
Ko size h ee is F[K](ξ) = k0+k1e−2iπξ/N +k2e−4iπξ/N ,∀ξ∈
[0, N −1]
3Wi h he linea in e pola ion, F[Λ](ξ) = ksinc2(kξ). The in e po-
la ion ac s as a low-pass il e . This is why his a ian is less equen ly
used as i may lead o wo se econs uc ions o high equencies.
lead o Pmax =QL
i=0 k(i)cloned hal -spec a a e Lde-
con olu ions. In o al, his amoun s o ⌊Pmax/2⌋+1 peaks
when assembling each hal -spec um. Fo ins ance, En-
codec has s ides {8,5,4,2}and hus c ea es 161 peaks.
In e es ingly, his ecu si e p ope y sugges s ha i a -
i ac s om p e ious laye s a e well p ese ed, hen hey
may be used as a inge p in o he whole a chi ec u e used
by a gene a ion model (speci ically, he s ide hype pa am-
e e o each decon olu ion).
3.4 Discussion
By combining se e al esul s om signal p ocessing and
p ope ies o decon olu ion, we ha e shown ha gene a-
i e CNNs may be subjec o he gene a ion o peaking a -
i ac s in he spec a o hei ou pu s. Ou discussion has
only ega ded simple cases and does no co e all possible
model a chi ec u es. Fo ins ance, we ha e no discussed
whe he non-linea ac i a ions allow hese a i ac s o pe -
sis , no whe he skip connec ions o ba ch no maliza ion
may help educe he bias peak and, hence, hei eplica ion
in he spec um. In he nex sec ion, we will add ess hese
u he ques ions empi ically by analyzing he a i ac s o
se e al music gene a ion models.
In e es ingly, we did no need o conside aining da a
o model weigh s o show he eme gence o a i ac s. This
sugges s ha his issue seems no ela ed o no sol able
wi h be e aining and da a, bu is inhe en o he use o
decon olu ion laye s in a model.
4. EXPERIMENT
In his sec ion, we alida e ou heo e ical indings and
close he gaps le ou by heo y wi h empi ical esul s. Ou
esea ch ques ions a e he ollowing:
RQ1 A e he a i ac s solely a chi ec u e dependen ?
RQ2 Can hey be used o de ec syn he ic music?
We show ha he models we s udy (bo h open-sou ce and
closed-sou ce) exhibi peaks in hei gene a ed ou pu s. In
u n, as an applica ion example, his can be used o build
a simple bu e icien syn he ic music de ec o ha is bo h
as and in e p e able.
4.1 Se up
We b ie ly de ail he models and da ase s we le e age in
ou expe imen s. Mo e implemen a ion de ails may be
ound on ou code eposi o y 4.
Open-sou ce models. We conside se e al music gen-
e a ion models. The e a e no many s udies on AI-music
de ec ion. To compa e ou esul s o [17], we conside
he same models: DAC,Encodec and Musika!, as well as
he medium spli o he FMA da ase [33–36]. The open-
sou ce na u e o hese models allows inspec ion. As a -
gued in [17], many AI-music models ely on neu al codecs
(e. g., [37, 38]). This means ha ins ead o gene a ing a
4gi hub.com/deeze /ismi 25-ai-music-de ec o
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
742
Figu e 3.Inside Encodec’s decode . We plo he a e -
age spec a o a eal music inpu , i s la en ep esen a ion
(i. e., a e he dequan iza ion o he audio okens), as well
as all i s successi e hidden ou pu s o all he decon olu-
ion laye s o Encodec (48kHz e sion). We indica e he
laye numbe and chosen s ide hype -pa ame e o each
decon olu ion. We plo c osses below he signal whe e he
pe iodized con inuous componen migh c ea e an a i ac
due o he pe iodiza ion o each abo e spec um (as de-
duced om he s ide). The spec a a e a e ages om he
many channels o each hidden laye , as well as empo ally
a e aged om successi e 213 samples.
song end- o-end, i is mo e e icien o lea n o gene a e a
sequence o audio okens, ha a e hen dequan ized and
con e ed o 48kHz audio h ough he neu al codec de-
code . I was shown ha lea ning o de ec neu al codec
hen ans e s o he de ec ion o music gene a o s using
ha codec [17]. We will compa e ou esul s o he black
box de ec ion model p oposed in ha la e s udy. To ad-
d ess RQ1, we also conside he MTAT and MTG-Jamendo
da ase s [39,40].
Closed-sou ce models. We also conduc expe imen s
on Suno and Udio. We le e age he ecen SONICS da ase
[16], con aining 50k o such syn he ic music acks. Wi h
hese models, we canno e i y whe he he p esence o
peaks co esponds o he unde lying a chi ec u e. We only
showcase ha hese models exhibi he same issues.
A i ac inge p in . To s udy he p esence o a i ac s,
we need o ex ac he po en ial eplica ed peaks in gene -
a ed spec og ams. Wi h he analysis p o ided in Sec ion
3, i seems ele an o s a om a e age spec a, as p o-
posed in Figu e 3, and ela ed o wha was p oposed in
compu e ision in [27]. Indeed, while a local music pa ch
will exhibi many equencies due o i s melodic con en , i
is easonable o hink ha when a e aging many pa ches’
spec um o e se e al minu es o audio, he melodic con-
en will be smoo hed and esul in he iangle-like shape
we we e discussing. F om he e, we simply p opose o
sub ac he local minima o he spec um o e sliding win-
dows o highligh he local a ia ions we y o de ec (i. e.,
peaks). Finally, we conside a educed bandwid h (e. g.,
[5kHz, 16kHz]) o u he disca d melodic in o ma ion and
unin o ma i e noise abo e he cu o o he mp3 codec. We
e e o his small p ocessing o he spec um as compu ing
an a i ac inge p in . I ep esen s he local a ia ions in
he ampli ude o he a e age spec um (e. g., see Figu e 4).
4.2 A chi ec u e dependence
To add ess RQ1 ha he s udied a i ac s a e independen
o aining da a and lea ned weigh s, we ain se e al mod-
els o DAC [34], using he same model con igu a ion as
VampNe [37]. We ain DAC on he FMA wice, wi h
a di e en andom seed (impac ing he weigh ini ializa-
ion and op imiza ion), on MTAT and MTG-Jamendo. We
compu e a i ac inge p in s o he au o-encoded acks
( ollowing he same me hodology as [17]). The a e ages
o he ound a i ac inge p in s (on he es se ) a e dis-
played in Figu e 4. S ikingly, we ind ha he placemen
o peaks is he same o all ou models. Thei ampli-
udes a e sligh ly di e en be ween he ou e sions bu
s ill clea ly exhibi he same o e all pa e n. This means
ha a i ac s a e indeed: 1) Weigh s-independen , since di -
e en aining seeds do no change he peaks on FMA; 2)
Da a-independen , since we ind he same peaks o mod-
els ained on MTAT and Jamendo. No e ha acks om
MTAT ha e a sampling a e o 16kHz, hence he cu o ha
we obse e a 8kHz. O e all, his sugges s ha AI-music
a i ac s a e indeed solely a chi ec u e dependen .
This also sugges s ha a de ec o ained on one gi en
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
743
Figu e 4. A i ac s o a ious ained e sions o DAC.
model and elying on hese cues may ans e i s pe o -
mance o ha model ained on a di e en da ase , o e en
o he models ha sha e a simila pa ame iza ion o hei
decon olu ion laye s. This may explain he in a- amily
gene aliza ion p ope y ound in [17].
4.3 AI-music de ec ion
To add ess RQ2, we build a simple de ec o . F om ou p e-
ious sec ion, i seems ha syn he ic con en may be sim-
ply de ec ed by checking i he a i ac inge p in con ains
peaks o no . By he look o wha was ound in Figu e
4, i seems ha a s aigh o wa d linea eg esso could be
enough o his ask: e. g., lea ning posi i e coe icien s o
peaking equencies and nega i e ones o baseline alues.
We i s ollow he se up o [17]. We ain o de ec
each eal music ack in FMA agains i s syn he ic econ-
s uc ions. The esul s a e p o ided in Table 1. We use he
neu al codecs a hei maximum bandwid h, which is he
ha des o de ec . O e all, he pe o mances a e compa a-
ble be ween ou linea model and he epo ed black box
CNN model, and he classi ica ion is almos pe ec .
Class Ou Repo ed om [17]
Real 99.87 99.7
Syn he ic
,→DAC (14kbps) 99.68 99.3
,→Encodec (24kbps) 99.81 99.7
,→Musika! 99.97 100.0
Table 1. Tes de ec ion sco es (%) on open-sou ce models.
We epo he syn he ic class b eakdown.
Nex , we do he same expe imen wi h closed-sou ce
gene a o s. We ollow he se up o SONICS [16]. How-
e e , he eal audio acks we e no p o ided due o copy-
igh s. The e o e, we all back o using acks om FMA
ins ead. We esampled hem o 16kHz as [16]. This means
ou ans o ms ha e a cu o a 8kHz. We acco dingly ad-
jus he bandwid h o ou inge p in s o [1kHz, 8kHz].
This change, un o una ely, means ha we canno compa e
ou esul s in a ai manne . We ne e heless epo SON-
ICS’ bes model (SpecTTT a-α) in Table 2 o in o ma ion.
As i may be seen, he sco es o ou 10K-pa ame e logis ic
eg ession a e compa able o he ones o a 20M-pa ame e
ans o me model. On he syn he ic e sion seen du ing
aining (Suno 3.5 and Udio 130), he classi ica ion is pe -
ec . The pe o mance d ops on Udio 32, which was un-
seen du ing aining ( ollowing he spli s om [16]). We
suspec ha Udio migh ha e changed hei model a chi ec-
u e be ween he 32 and 130 e sions, which would explain
he ailu e a he ze o-sho pe o mance ans e .
Class Ou Repo ed om [16]
Real 99.97 99
Syn he ic
,→Suno 3.5 100.00 100
,→Suno 3†100.00 96
,→Suno 2†99.90 78
,→Udio 130 100.00 100
,→Udio 32†39.83 96
Table 2. Tes de ec ion sco es (%) on closed-sou ce mod-
els. We include he b eakdown o Suno and Udio. †indi-
ca e e sions unseen du ing aining.
We may also ain ou model o de ec each gene a o
indi idually and display he lea ned weigh s as a way o
highligh he ound peaks (see Figu e 5). We can see clea
pa e ns and a ious peak placemen s om DAC, Encodec,
and Suno. This is mo e uzzy o o he models ha a e ye
s ill well de ec ed wi h ou me hod. We lea e hese mo e
challenging in e p e a ions o u u e wo k.
Figu e 5. Lea ned logis ic eg ession coe icien s.
5. CONCLUSION
We p opose a heo e ical analysis o he eme gence o
peak a i ac s in AI-music, o malize hei ecu si e shape,
and p edic hei a chi ec u e independence. Ou expe -
imen s con i med his la e obse a ion, and we ha e
used ou new- ound knowledge o c a a simple de ec o
wi h pe o mances on pa wi h p e ious million-pa ame e
models. While his highligh s wha cues de ec o s may ely
on, o he ypes o a i ac s migh s ill exis : e. g., his may
explain how SpecTTT a gene alizes o he unseen Udio 32.
We also ha e no ouched on he opic o audio manip-
ula ion obus ness. We can al eady an icipa e ha hose
impac ing equency posi ions will a ec ou model (e. g.,
esampling, pi ch shi ). We lea e his as u u e wo k.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
744
6. REFERENCES
[1] Deeze , “Deeze deploys cu ing-edge AI de ec-
ion ool o music s eaming,” h ps://news oom-
deeze .com/2025/04/deeze - e eals-18-o -all-new-
music-uploaded- o-s eaming-is- ully-ai-gene a ed/,
2025, [Online; accessed 22-Ma ch-2025].
[2] D. G. Widde and M. Hicks, “Wa ching he gen-
e a i e AI hype bubble de la e,” a Xi p ep in
a Xi :2408.08778, 2024.
[3] T. Sippy, F. Enock, J. B igh , and H. Z. Ma ge s, “Be-
hind he Deep ake: 8% C ea e; 90% Conce ned. Su -
eying public exposu e o and pe cep ions o deep akes
in he UK,” a Xi p ep in a Xi :2407.05529, 2024.
[4] 404 Media, “CEO o AI Music Company
Says People Don’ Like Making Music,”
h ps://www.404media.co/ceo-o -ai-music-company-
says-people-don -like-making-music/, 2025, [Online;
accessed 13-June-2025].
[5] The Gua dian, “Music labels sue AI song gene -
a o s Suno and Udio o copy igh in ingemen ,”
h ps://www. hegua dian.com/music/a icle/2024/
jun/25/ eco d-labels-sue-ai-song-gene a o -apps-
copy igh -in ingemen -lawsui , 2025, [Online;
accessed 13-June-2025].
[6] L. Pelly, Mood Machine: The Rise o Spo i y and he
Cos s o he Pe ec Playlis . Hodde & S ough on,
2025.
[7] Y. Wei, Y. Zhu, P. Hui, and G. Tyson, “Explo ing he
Use o Abusi e Gene a i e AI Models on Ci i ai,” in
P oceedings o he 32nd ACM In e na ional Con e -
ence on Mul imedia, 2024, pp. 6949–6958.
[8] H. H. Jiang, L. B own, J. Cheng, M. Khan, A. Gup a,
D. Wo kman, A. Hanna, J. Flowe s, and T. Geb u, “AI
A and i s Impac on A is s,” in AIES. ACM, 2023.
[9] S. Gau am, P. N. Venki , and S. Ghosh, “F om Mel ing
Po s o Mis ep esen a ions: Explo ing Ha ms in Gen-
e a i e AI,” in GenAICHI, 2024.
[10] M. Klincewicz, M. Al ano, and A. E. Fa d, “Slopa-
ganda: The in e ac ion be ween p opaganda and gene -
a i e AI,” Filoso iska No ise , ol. 12, no. 1, pp. 135–
162, 2025.
[11] L. Klein, M. Ma in, A. B ock, M. An oniak,
M. Walsh, J. M. Johnson, L. Til on, and D. Mimno,
“P o oca ions om he Humani ies o Gene a i e AI
Resea ch,” a Xi p ep in a Xi :2502.19190, 2025.
[12] T. S. Goe ze, “AI A is The : Labou , Ex ac ion, and
Exploi a ion: O , On he Dange s o S ochas ic Pol-
locks,” in ACM FAccT, 2024.
[13] CISAC, “S udy on he economic impac o Gen-
e a i e AI in he Music and Audio isual indus-
ies,” h ps://www.cisac.o g/News oom/news-
eleases/global-economic-s udy-shows-human-
c ea o s- u u e- isk-gene a i e-ai, 2024, [Online;
accessed 22-Ma ch-2025].
[14] Y. Li, M. Milling, L. Specia, and B. W. Schulle , “F om
Audio Deep ake De ec ion o AI-Gene a ed Music
De ec ion–A Pa hway and O e iew,” a Xi p ep in
a Xi :2412.00571, 2024.
[15] A. Odena, V. Dumoulin, and C. Olah, “Decon o-
lu ion and Checke boa d A i ac s,” Dis ill, 2016.
[Online]. A ailable: h p://dis ill.pub/2016/decon -
checke boa d
[16] M. A. Rahman, Z. I. A. Hakim, N. H. Sa ke , B. Paul,
and S. A. Fa ah, “SONICS: Syn he ic O No - Iden i-
ying Coun e ei Songs,” in In e na ional Con e ence
on Lea ning Rep esen a ions (ICLR), 2025.
[17] D. A cha , G. Mesegue -B ocal, and R. Hennequin,
“AI-Gene a ed Music De ec ion and i s Challenges,” in
ICASSP. IEEE, 2025.
[18] Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilçi,
M. Sahidullah, A. Sizo , N. E ans, M. Todisco,
and H. Delgado, “ASVspoo : he au oma ic speake
e i ica ion spoo ing and coun e measu es challenge,”
IEEE Jou nal o Selec ed Topics in Signal P ocessing,
ol. 11, no. 4, 2017.
[19] Z. Almu ai i and H. Elgib een, “A e iew o mode n
audio deep ake de ec ion me hods: challenges and u-
u e di ec ions,” Algo i hms, ol. 15, no. 5, p. 155,
2022.
[20] C. Sun, S. Jia, S. Hou, and S. Lyu, “AI-syn hesized
oice de ec ion using neu al ocode a i ac s,” in P o-
ceedings o he IEEE/CVF Con e ence on Compu e
Vision and Pa e n Recogni ion, 2023, pp. 904–912.
[21] Y. Zang, Y. Zhang, M. Heyda i, and Z. Duan,
“Sing ake: Singing oice deep ake de ec ion,” in
ICASSP. IEEE, 2024.
[22] Y. Mi sky and W. Lee, “The c ea ion and de ec ion
o deep akes: A su ey,” ACM compu ing su eys
(CSUR), ol. 54, no. 1, 2021.
[23] L. Lin, N. Gup a, Y. Zhang, H. Ren, C.-H. Liu, F. Ding,
X. Wang, X. Li, L. Ve doli a, and S. Hu, “De ec ing
Mul imedia Gene a ed by La ge AI Models: A Su -
ey,” a Xi :2402.00045, 2024.
[24] X. Zhang, S. Ka aman, and S.-F. Chang, “De ec ing
and simula ing a i ac s in gan ake images,” in 2019
IEEE in e na ional wo kshop on in o ma ion o ensics
and secu i y (WIFS). IEEE, 2019, pp. 1–6.
[25] T. Osakabe, M. Tanaka, Y. Kinoshi a, and H. Kiya,
“CycleGAN wi hou checke boa d a i ac s o
coun e - o ensics o ake-image de ec ion,” in In e -
na ional Wo kshop on Ad anced Imaging Technology
(IWAIT) 2021, ol. 11766. SPIE, 2021, pp. 51–55.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
745
[26] B. Liu, F. Yang, X. Bi, B. Xiao, W. Li, and X. Gao,
“De ec ing gene a ed images by eal images,” in Eu o-
pean Con e ence on Compu e Vision. Sp inge , 2022,
pp. 95–110.
[27] R. Co i, D. Cozzolino, G. Poggi, K. Nagano, and
L. Ve doli a, “In iguing p ope ies o syn he ic im-
ages: om gene a i e ad e sa ial ne wo ks o di usion
models,” in P oceedings o he IEEE/CVF con e ence
on compu e ision and pa e n ecogni ion, 2023, pp.
973–982.
[28] C. Donahue, J. McAuley, and M. Pucke e, “Ad e -
sa ial audio syn hesis,” in In e na ional Con e ence on
Lea ning Rep esen a ions, 2018.
[29] K. Kuma , R. Kuma , T. De Boissie e, L. Ges in, W. Z.
Teoh, J. So elo, A. De B ebisson, Y. Bengio, and A. C.
Cou ille, “Melgan: Gene a i e ad e sa ial ne wo ks
o condi ional wa e o m syn hesis,” Ad ances in neu-
al in o ma ion p ocessing sys ems, ol. 32, 2019.
[30] J. Pons, S. Pascual, G. Cenga le, and J. Se à, “Upsam-
pling a i ac s in neu al audio syn hesis,” in ICASSP.
IEEE, 2021.
[31] S. Malla , A Wa ele Tou o Signal P ocessing, Thi d
Edi ion: The Spa se Way, 3 d ed. USA: Academic
P ess, Inc., 2008.
[32] V. Dumoulin and F. Visin, “A guide o con olu-
ion a i hme ic o deep lea ning,” a Xi p ep in
a Xi :1603.07285, 2016.
[33] A. D’e ossez, J. Cope , G. Synnae e, and Y. Adi,
“High Fideli y Neu al Audio Comp ession,” A Xi ,
ol. abs/2210.13438, 2022. [Online]. A ailable: h ps:
//api.seman icschola .o g/Co pusID:253097788
[34] R. Kuma , P. See ha aman, A. Luebs, I. Kuma , and
K. Kuma , “High-Fideli y Audio Comp ession wi h
Imp o ed RVQGAN,” Neu IPS, ol. 36, 2024.
[35] M. Pasini and J. Schlü e , “Musika! Fas In ini e Wa e-
o m Music Gene a ion,” in ISMIR, 2022.
[36] M. De e a d, K. Benzi, P. Vande gheyns , and
X. B esson, “FMA: A da ase o music analysis,”
in ISMIR, 2017. [Online]. A ailable: h ps://a xi .o g/
abs/1612.01840
[37] H. F. Ga cia, P. See ha aman, R. Kuma , and B. Pa do,
“VampNe : Music Gene a ion ia Masked Acous ic
Token Modeling,” ISMIR, 2023.
[38] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and Con ol-
lable Music Gene a ion,” Neu IPS, ol. 36, 2024.
[39] E. Law, K. Wes , M. I. Mandel, M. Bay, and J. S.
Downie, “E alua ion o Algo i hms Using Games: The
Case o Music Tagging,” in ISMIR. Ci esee , 2009, pp.
387–392.
[40] D. Bogdano , M. Won, P. To s ogan, A. Po e ,
and X. Se a, “The MTG-Jamendo Da ase o
Au oma ic Music Tagging,” in Machine Lea ning o
Music Disco e y Wo kshop, In e na ional Con e ence
on Machine Lea ning (ICML 2019), Long Beach,
CA, Uni ed S a es, 2019. [Online]. A ailable: h p:
//hdl.handle.ne /10230/42015
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
746