A Fourier Explanation of AI-Music Artifacts

Author: Darius Afchar; Gabriel Meseguer Brocal; Kamil Akesbi; Romain Hennequin

Publisher: Zenodo

DOI: 10.5281/zenodo.17706577

Source: https://zenodo.org/records/17706577/files/000086.pdf

A FOURIER EXPLANATION OF AI-MUSIC ARTIFACTS
Da ius A cha Gab iel Mesegue -B ocal Kamil Akesbi Romain Hennequin
Deeze Resea ch, Pa is, F ance
[email p o ec ed]
ABSTRACT
The apid ise o gene a i e AI has ans o med music c e-
a ion, wi h millions o use s engaging in AI-gene a ed mu-
sic. Despi e i s popula i y, conce ns ega ding copy igh
in ingemen , job displacemen , and e hical implica ions
ha e led o g owing sc u iny and legal challenges. In pa al-
lel, AI-de ec ion se ices ha e eme ged, ye hese sys ems
emain la gely opaque and p i a ely con olled, mi o ing
he e y issues hey aim o add ess. This pape explo es
he undamen al p ope ies o syn he ic con en and how
i can be de ec ed. Speci ically, we analyze decon olu ion
modules commonly used in gene a i e models and ma h-
ema ically p o e ha hei ou pu s exhibi sys ema ic e-
quency a i ac s – mani es ing as small ye dis inc i e spec-
al peaks. This phenomenon, ela ed o he well-known
checke boa d a i ac , is shown o be inhe en o a chosen
model a chi ec u e a he han a consequence o aining
da a o model weigh s. We alida e ou heo e ical ind-
ings h ough ex ensi e expe imen s on open-sou ce mod-
els, as well as comme cial AI-music gene a o s such as
Suno and Udio. We use hese insigh s o p opose a sim-
ple and in e p e able de ec ion c i e ion o AI-gene a ed
music. Despi e i s simplici y, ou me hod achie es de ec-
ion accu acy on pa wi h deep lea ning-based app oaches,
su passing 99% accu acy on se e al scena ios.
1. INTRODUCTION
“I ’s no eally enjoyable o make music now [...] I hink
he majo i y o people don’ enjoy he majo i y o he ime
hey spend making music.” — M. Shulman, CEO a Suno.
Meanwhile, millions o use s seem o enjoy c ea ing
AI-gene a ed music. As a esul , i was ecen ly epo ed
ha a leas a i h o music deli e ed o s eaming pla -
o ms is now syn he ic [1]. The e is undoub edly a hype
a ound gene a i e AI (GenAI) as well as lo s o in es -
men s made [2]. Bu his iew on musical c ea ion and
en husiasm o GenAI is a om being a majo i y opin-
ion [3,4]. Lawsui s ha e been iled agains se e al AI com-
panies [5]. Beyond he la gely deba ed e hical implica ions
and he many social isks [6–11], AI-music has speci ically
aised lo s o ques ion on copy igh in ingemen s [12]. I
© D. A cha e al.. Licensed unde a C ea i e Commons A -
ibu ion 4.0 In e na ional License (CC BY 4.0). A ibu ion: D. A cha
e al., “A Fou ie Explana ion o AI-music A i ac s”, in P oc. o he
26 h In . Socie y o Music In o ma ion Re ie al Con ., Daejeon, Sou h
Ko ea, 2025.
is es ima ed ha 24% o musicians’ e enues a e a isk in
he nex ew yea s [13].
Mos o hese models a e black box and owned by p i-
a e companies. The e is a lack o egula ion o hei de-
elopmen and use. In pa allel o his ise o GenAI, AI-
de ec ion se ices ha e s a ed o eme ge [14]. Ne e he-
less, hese se ices a e no a ing be e han GenAI in ha
hey a e also mos ly black box and owned by p i a e pa -
ies. This begs he ques ion: Wha is speci ic abou GenAI?
Why do AI-de ec o s wo k? Wha cues may hey ely on?
This pape may be seen as an in e p e abili y s udy. We
ake a s ep back om he a ms ace be ween GenAI and
AI-de ec o s and y o ask wha is speci ic abou syn he ic
con en and how we migh de ec i . We analyze decon o-
lu ion modules ha a e common in gene a i e models and
ma hema ically p o e ha hei ou pu will p oduce small
equency a i ac s – i. e., peaks (see Figu e 3). This e ec
is ela ed o he well-known checke boa d a i ac [15].
Unlike p e ious empi ical obse a ions, we p o ide an ex-
plana ion o i s o igin using Fou ie ’s amewo k (Sec-
ion 3). In e es ingly, we p o e hese a i ac s depend no
on he aining da a no lea ned weigh s o gene a i e mod-
els bu on hei chosen a chi ec u e. In ou expe imen s, we
con i m he p esence o his e ec on mul iple open-sou ce
models and ha hei s uc u e aligns wi h wha ou heo y
p edic s (Sec ion 4). Fu he , we also show his e ec hap-
pening wi h comme cial AI-music gene a o s (Suno and
Udio). In u n, we p opose a simple and in e p e able c i-
e ion o de ec AI-music by exploi ing his phenomenon.
Thanks o i s unde lying heo y, i s ailu e cases may be
an icipa ed and explained. Al hough p o iding a no el de-
ec o is no ou main goal bu a by-p oduc o ou analysis
o syn he ic a i ac s, ou de ec ion sco es a e on pa wi h
deep lea ning-based solu ions, wi h o e 99% accu acy.
2. RELATED WORK
2.1 AI-music de ec ion
Mi o ing he ecen boom o comme cial AI-music gene -
a ion se ices (e. g., Suno, Udio [16]), he ask o AI-music
de ec ion is ela i ely no el. Only a ew wo ks ha e been
published so a [14, 16, 17]. These ea ly wo ks p opose
CNN-like models o lea n o classi y eal and syn he ic sig-
nals. They discuss se e al challenges, such as he di icul y
o making hese de ec o s obus o audio manipula ions.
Ne e heless, a weal h o adjacen esea ch opics may be
ela ed o his ask. Voice spoo ing and syn he ic singing
oice de ec ion models ha e been p oposed [18–21]. The
li e a u e on he de ec ion o o he modali ies o syn he ic
739
media is also as : e. g., ega ding ideo deep akes [22],
o gene a ed ex [23]. In all hese wo ks, he domina ing
app oach is o lea n o classi y syn he ic and eal signals
wi h black box models and hen ocus on hei obus ness.
Ins ead, ou wo k a he ies o explain why syn he ic sam-
ples may be de ec ed and wha cues models may ely on.
2.2 Checke boa d a i ac
This phenomenon was i s analyzed in [15]. I was a gued
o be linked o o e laps o decon olu ion ke nels. I was
also la gely analyzed in compu e ision o image syn-
hesis and i s de ec ion (e. g., [24–27]). In audio, his e -
ec was i s no iced by [28], which no ed ha i led o
audible "pi ched noise". In MelGAN [29], con olu ion hy-
pe pa ame e s we e a gued o be chosen p ope ly o a oid
his o e lap and a i ac . [30] u he ecommends using
in e pola ing-upsample o educe he e ec . Ou main
analysis is simila o ha ound in hese la e wo ks. How-
e e , while mos exis ing wo k ega ds ideos and speech,
ou ocus on AI-music is no el and imely (e. g., Suno and
Udio). Then, ou analysis and modeling o decon olu ion
a i ac s in he equency domain, as well as hei usage o
de ec ion wi h a con enien ly de ined, simple linea model
ins ead o a black box model, a e no equen ly ound
in he li e a u e. Finally, o ou knowledge, ou analy-
sis o he a chi ec u e-dependence and aining- and da a-
independence had no been discussed be o e.
3. FOURIER ANALYSIS OF ARTIFACTS
In his wo k, we p opose o ein e p e Con olu ional Neu-
al Ne wo ks (CNN) unde he lenses o Fou ie ans-
o ms. Ins ead o iewing inpu s and ou pu s and hidden
laye s as ime-based signals, we analyze hei equency-
based decomposi ions (i. e., spec a).
In he ield o MIR, i is equen o use spec og ams o
p ocess audio signals, as hey be e align wi h he human
ea ’s pe cep ion. I is less common o do he same ope a-
ions o e e y hing happening wi hin gene a ion models.
Ne e heless, his aming has many na u al ad an ages o
in e p e ing CNNs. Fo ins ance, a con olu ion in he ime
domain is dual o a mul iplica ion in he equency domain.
This means a CNN may be ein e p e ed as pe o ming a
se ies o mul iplica ions. This p ope y, as well as o he s,
will be in o ma i e o be e explain he eme gence o gen-
e a ion a i ac s.
Ou o e all p oo ske ch is o show ha he decon o-
lu ion ope a ion pe iodizes he spec a o hidden laye s,
hence c ea ing peaks by iling he cons an componen o
he signal. Then, we explain ha his p ope y pe sis s
h ough he ollowing model’s ope a ions and laye s. In e -
es ingly, ou heo y sugges s ha his phenomenon should
only depend on a chosen model a chi ec u e bu no on he
aining weigh o da a. We con i m his p ope y in ou
nex expe imen sec ion.
3.1 Backg ound
We i s epo some essen ial concep s and p ope ies o
be e unde s and ou ollowing a gumen s and p oo .
Fou ie ans o m. We deno e his ans o m
F. Fo a gi en in eg able signal s( ),F[s](ξ) =
R∞
−∞ s( )e−i2πξ d . The ans o m is linea in s. Any
mul iplica ion (deno ed ·) is ans o med in o a con olu-
ion (deno ed ∗), and ice- e sa: F[s∗ ] = F[s]· F[ ].
As a eminde , a con olu ion is compu ed as (s∗ )( ) =
R∞
−∞ s(τ) ( −τ)dτ. We ecommend [31] o an ex ensi e
p esen a ion o he Fou ie ans o m.
Di ac. ADi ac dis ibu ion is an impulse unc ion de-
no ed as δx, wi h a measu e ∥δx∥1= 1, and equal o 0
e e ywhe e excep in i s pa ame e x. A p ope y we will
use is ha a con olu ion wi h a Di ac is equi alen o ap-
plying an o se o a unc ion: (s∗δx)( ) = s( −x).
Di ac comb. ADi ac comb, deno ed as XT, is de-
ined as a sum o Di ac impulses e enly spaced wi h a pe-
iod T:XT=P∞
n=−∞ δnT . The Fou ie ans o m o a
Di ac comb is also a Di ac comb, bu wi h spacing 1/T:
F[XT] = 1/T X1/T .
Di ac combs a e use ul o o malize he concep o sam-
pling. Mul iplying a con inuous signal wi h a Di ac comb
leads o a disc e ized e sion o ha signal. In e p e ed
in he equency domain, he Fou ie ans o m o ha dis-
c e ized signal is equal o he con olu ion o he con inuous
spec um wi h a Di ac comb o pe iod 1/T. By dis ibu ing
he sum o he comb, his equals a sum o Di ac con olu-
ions, which as we ha e jus men ioned, esul s in a sum o
se e al o se ed e sions o he spec um:
F[s·XT] (ξ) = 1
T
∞
X
n=−∞
F[s] (ξ−n/T)
This phenomenon is called a pe iodic summa ion.
In Nyquis -Shanon sampling heo y, his concep is
used o explain he phenomenon o aliasing:i. e., when
copies o he spec um will o e lap, and why o chose
he sampling equency 1/T highe han wice he spec-
al bandwid h o a signal s o enable a pe ec econs uc-
ion o he sampled signal. In he nex sec ion, we use his
phenomenon sligh ly di e en ly o explain ha his pe i-
odiza ion is happening wi hin decon olu ion laye s.
Decon olu ion. Many gene a ion models ely on “de-
con olu ion” modules in hei a chi ec u es. A decon olu-
ion mi o s he way a s ided con olu ion laye ope a es.
Ins ead o sh inking an inpu in o a la en ep esen a ion, i
expands i . I iles a pa ame ic ec o ( he ke nel), which
is mul iplied by each coo dina e o he inpu , his wi h a
s iding ac o [32]. Decon olu ions a e also some imes
called “ ansposed con olu ions”. Indeed, a s ided con o-
lu ion can be ew i en as a mul iplica ion wi h a big ma ix
whe e he lea ned ke nel is iled ho izon ally along a diag-
onal wi h a displacemen equal o he s ide (Fig. 1). Wi h
his iew in mind, i may be e i ied ha he decon olu ion
– as de ined in CNNs – co esponds o mul iplying an in-
pu wi h a ma ix wi h a simila bu ansposed s uc u e:
i. e., he ke nel iled e ically wi h a displacemen equal o
he sough upsampling ac o (Fig. 1).
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
740
Figu e 1. A CNN’s con olu ion and decon olu ion o an inpu Xmay be exp essed as ma ix mul iplica ions. To build
hese ma ices, he ke nel is iled along he diagonal wi h s eps k( he s ide). Then, i may be e i ied ha a decon olu ion
is equi alen o a 1-s ided con olu ion o he inpu wi h co esponding k−1 ze os inse ed be ween each alue.
Then, an impo an p ope y is o ealize ha a s ided
decon olu ion can be ew i en as he composi ion o an
upsampling ope a ion wi h ze os illing in-be ween alues,
ollowed by a 1-s ided con olu ion. An illus a ion o he
p oo o his p ope y is p o ided in Figu e 1. Fo he com-
ple e o mal p oo , we e e in e es ed eade s o [32].
As epo ed in [15], he e exis al e na i e de ini ions o
decon olu ion laye s: an in e pola ed upsampling ope a-
ion (e. g., linea , bicubic) ollowed by a con olu ion. We
will discuss his o mula ion la e and keep he i s de ini-
ion o decon olu ion modules o now.
3.2 Decon olu ions induce pe iodiza ion
We a e now eady o o mula e ou main analysis. As we
ha e epo ed, decon olu ion can be seen as pe o ming
an upsampling ope a ion wi h ze o-inse ions be ween al-
ues, ollowed by a con olu ion. Le us show ha his ze o-
inse ion leads o a pe iodiza ion o he signal.
Ze o-upsampling is equi alen o o e -sampling a dis-
c e ized signal wi h a mul iple o i s sampling equency.
Le us conside a signal s, disc e ized wi h a sampling e-
quency s, om a con inuous signal s∗:
s=s∗·X1/ s
Conside ing a decon olu ion wi h s ide k, he ze o-
upsampled e sion o smay be in e p e ed as o e -
sampling swi h a mul iple ko he equency o s:
=s·X1/k s=s∗·X1/ s·X1/k s=s∗·X1/ s
As consequence, om a con inuous ans o m pe spec i e,
sand ha e exac ly he sample spec um: F[s] = F[ ].
Howe e , mo ing o he disc e e ans o m, while F[s]is
ead up o i s p ope sampling equency s,F[ ]is ead
up o a equency k s, as shown in Figu e 2. Thus, he
ze o-inse ed signal con ains mul iple clones o he spec-
um o sbecause o he pe iodiza ion e ec shown abo e.
This is in e es ing since, empi ically, common spec a
o music signals o en ha e mos o hei ene gy a ound
he 0 equency (i. e., hei mean alue) and an exponen ial
dec ease o ene gy in highe equencies. Said o he wise,
music spec a look like skewed iangles 1. Fu he mo e,
1No e: o eal signals, due o he conjuga e symme y o he Fou ie
ans o m, he magni ude spec um is mi o ed o nega i e alues o ξ.
Figu e 2. Pe iodiza ion due o he ze o-upsampling. Do -
ed ed lines indica e he disc e e signals’ sampling cu o s.
in hidden laye s, i is common o add biases in he ou pu
o lea nable laye s, o use ReLU ac i a ions. These ope -
a ions u he c ea e a bias in he ime domain ha leads
o a "peak" a ξ= 0Hz in he equency domain. The e-
o e, when a signal (o hidden laye ec o ) is pe iodized,
his iangle pa e n and i s peak in 0 a e cloned h oughou
he spec um. We claim ha his pe iodiza ion o he high-
ene gy bias is wha may lead o checke boa d a i ac s [15].
In music, his e ec ansla es as a hissing noise [28,29].
To o malize he eplica ion, peaks in he spec um may
be ound a each spe iod up o he sampling equency
k s/2o . Thus, le pmax =⌊k/2⌋, we ha e peaks o all
in ege n∈[0 .. pmax]a he equencies n s. Fo ins ance,
a decon olu ion wi h s ide 8 leads o 5 peaks.
We ha e seen ha he ze o-upsampling o a decon o-
lu ion changes he expec ed dis ibu ion o he audio spec-
um, om a gene al iangle shape, o a conca ena ion o
iangles and peaks. How is his shape impac ed by he
ollowing con olu ion o he laye ?
As eminded, a con olu ion in he ime domain is equi -
alen o a mul iplica ion in he equency domain. Coming
back o ou peaked iangles and eading his om an am-
pli ude spec um in log-scale: a con olu ion is equi alen
o a e ical o se in he log-spec um. Impo an ly, he e
is a big di e ence be ween he ypical size o commonly
p ocessed music signals (e. g., 1s o audio may con ain
48000 samples), and he size o CNNs’ ke nels (e. g., 3,
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
741
5 o 7). In he equency domain, he space o possible
spec a he ke nel’s ans o m li es in is pa ame ized by
only a ew pa ame e s 2. To mul iply hese wo spec a,
he ke nel is ze o-padded, which is equi alen o a spec al
in e pola ion [31]. Said o he wise, he ke nel spec um is
ypically a slowly e ol ing unc ion – ela i e o he a i-
a ions o he spec um o be mul iplied wi h. While some
e y pa icula choices o ke nels could mi iga e he peaks
(e. g., a cons an ke nel o size he s ide would cancel all
peaks bu he one a 0Hz), he limi ed exp essi i y due o
he limi ed amoun o pa ame e s o he ke nel is unlikely
o emo e all hose a i ac s en i ely. As we will see in he
nex sec ions, his hypo hesis is con i med in p ac ice: he
ou pu o a decon olu ion ac ually esul in c ea ing peaks
ha a e p ese ed a e he con olu ion laye . Examples o
such ou pu s may be ound in Figu e 3.
Finally, we ha e alluded o he ac ha some a ian s
o decon olu ion may include in e pola ions ins ead o he
ze o-upsampling we discussed. The ex ension o ou esul
o he gene al case o any possible in e pola ion scheme is
no s aigh o wa d. S ill, he e is one case ha can be eas-
ily ex ended: linea in e pola ion. Fo his, we ema k ha
a linea in e pola ion may be compu ed as a con olu ion
o he ze o-upsampled inpu by a iangula il e Λ( )o
wid h k, he s ide o he decon olu ion 3. This con olu-
ion may be abso bed in o he ollowing con olu ion o he
laye , and ou esul s emain unchanged.
3.3 E ec o mul iple laye s
We ha e seen ha decon olu ions, by na u e o he ze o-
upsampling hey induce, c ea e a i ac s due o he eplica-
ion o he peak c ea ed by he inpu bias. Is his p ope y
p ese ed h ough mul iple laye s? I is ha d o gi e a gen-
e al answe due o he a ie y o a chi ec u es ha exis .
We discuss he case o simple, sequen ial, CNNs.
In his case, by induc ion, he pe iodiza ion induced by
decon olu ion may lead o eplica ing no only he con in-
uous componen peak bu also all p e ious peaks. This
means ha a i ac s will be cloned in a ac al-like manne .
To illus a e his p ope y, we ha e compu ed he a e age
spec um ound a se e al s ages o he gene a o pa o
he Encodec model [33] in Figu e 3. We can see peaks ap-
pea ing a he p edic ed places (see he o mula om he
p e ious sec ion), as well as p ese a ion o p e ious lay-
e s’ peaks in a ac al-like manne .
We o malize his ecu si e pa e n. Ins ead o coun -
ing eplica ions o he spec um, i is con enien o coun
clones o he hal -spec um abo e 0 ( he nega i e spec um
being symme ic o eal signals). Indeed, we ha e seen
ha a spec um om a laye iwill be cloned k(i+1)/2 imes
by a decon olu ion i+ 1, hus c ea ing k(i+1) hal -spec a.
By induc ion, a single peak a 0Hz in he inpu will hus
2which is he size o he ke nel: e. g., he disc e e ans o m o a ke nel
Ko size h ee is F[K](ξ) = k0+k1e−2iπξ/N +k2e−4iπξ/N ,∀ξ∈
[0, N −1]
3Wi h he linea in e pola ion, F[Λ](ξ) = ksinc2(kξ). The in e po-
la ion ac s as a low-pass il e . This is why his a ian is less equen ly
used as i may lead o wo se econs uc ions o high equencies.
lead o Pmax =QL
i=0 k(i)cloned hal -spec a a e Lde-
con olu ions. In o al, his amoun s o ⌊Pmax/2⌋+1 peaks
when assembling each hal -spec um. Fo ins ance, En-
codec has s ides {8,5,4,2}and hus c ea es 161 peaks.
In e es ingly, his ecu si e p ope y sugges s ha i a -
i ac s om p e ious laye s a e well p ese ed, hen hey
may be used as a inge p in o he whole a chi ec u e used
by a gene a ion model (speci ically, he s ide hype pa am-
e e o each decon olu ion).
3.4 Discussion
By combining se e al esul s om signal p ocessing and
p ope ies o decon olu ion, we ha e shown ha gene a-
i e CNNs may be subjec o he gene a ion o peaking a -
i ac s in he spec a o hei ou pu s. Ou discussion has
only ega ded simple cases and does no co e all possible
model a chi ec u es. Fo ins ance, we ha e no discussed
whe he non-linea ac i a ions allow hese a i ac s o pe -
sis , no whe he skip connec ions o ba ch no maliza ion
may help educe he bias peak and, hence, hei eplica ion
in he spec um. In he nex sec ion, we will add ess hese
u he ques ions empi ically by analyzing he a i ac s o
se e al music gene a ion models.
In e es ingly, we did no need o conside aining da a
o model weigh s o show he eme gence o a i ac s. This
sugges s ha his issue seems no ela ed o no sol able
wi h be e aining and da a, bu is inhe en o he use o
decon olu ion laye s in a model.
4. EXPERIMENT
In his sec ion, we alida e ou heo e ical indings and
close he gaps le ou by heo y wi h empi ical esul s. Ou
esea ch ques ions a e he ollowing:
RQ1 A e he a i ac s solely a chi ec u e dependen ?
RQ2 Can hey be used o de ec syn he ic music?
We show ha he models we s udy (bo h open-sou ce and
closed-sou ce) exhibi peaks in hei gene a ed ou pu s. In
u n, as an applica ion example, his can be used o build
a simple bu e icien syn he ic music de ec o ha is bo h
as and in e p e able.
4.1 Se up
We b ie ly de ail he models and da ase s we le e age in
ou expe imen s. Mo e implemen a ion de ails may be
ound on ou code eposi o y 4.
Open-sou ce models. We conside se e al music gen-
e a ion models. The e a e no many s udies on AI-music
de ec ion. To compa e ou esul s o [17], we conside
he same models: DAC,Encodec and Musika!, as well as
he medium spli o he FMA da ase [33–36]. The open-
sou ce na u e o hese models allows inspec ion. As a -
gued in [17], many AI-music models ely on neu al codecs
(e. g., [37, 38]). This means ha ins ead o gene a ing a
4gi hub.com/deeze /ismi 25-ai-music-de ec o
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
742
Figu e 3.Inside Encodec’s decode . We plo he a e -
age spec a o a eal music inpu , i s la en ep esen a ion
(i. e., a e he dequan iza ion o he audio okens), as well
as all i s successi e hidden ou pu s o all he decon olu-
ion laye s o Encodec (48kHz e sion). We indica e he
laye numbe and chosen s ide hype -pa ame e o each
decon olu ion. We plo c osses below he signal whe e he
pe iodized con inuous componen migh c ea e an a i ac
due o he pe iodiza ion o each abo e spec um (as de-
duced om he s ide). The spec a a e a e ages om he
many channels o each hidden laye , as well as empo ally
a e aged om successi e 213 samples.
song end- o-end, i is mo e e icien o lea n o gene a e a
sequence o audio okens, ha a e hen dequan ized and
con e ed o 48kHz audio h ough he neu al codec de-
code . I was shown ha lea ning o de ec neu al codec
hen ans e s o he de ec ion o music gene a o s using
ha codec [17]. We will compa e ou esul s o he black
box de ec ion model p oposed in ha la e s udy. To ad-
d ess RQ1, we also conside he MTAT and MTG-Jamendo
da ase s [39,40].
Closed-sou ce models. We also conduc expe imen s
on Suno and Udio. We le e age he ecen SONICS da ase
[16], con aining 50k o such syn he ic music acks. Wi h
hese models, we canno e i y whe he he p esence o
peaks co esponds o he unde lying a chi ec u e. We only
showcase ha hese models exhibi he same issues.
A i ac inge p in . To s udy he p esence o a i ac s,
we need o ex ac he po en ial eplica ed peaks in gene -
a ed spec og ams. Wi h he analysis p o ided in Sec ion
3, i seems ele an o s a om a e age spec a, as p o-
posed in Figu e 3, and ela ed o wha was p oposed in
compu e ision in [27]. Indeed, while a local music pa ch
will exhibi many equencies due o i s melodic con en , i
is easonable o hink ha when a e aging many pa ches’
spec um o e se e al minu es o audio, he melodic con-
en will be smoo hed and esul in he iangle-like shape
we we e discussing. F om he e, we simply p opose o
sub ac he local minima o he spec um o e sliding win-
dows o highligh he local a ia ions we y o de ec (i. e.,
peaks). Finally, we conside a educed bandwid h (e. g.,
[5kHz, 16kHz]) o u he disca d melodic in o ma ion and
unin o ma i e noise abo e he cu o o he mp3 codec. We
e e o his small p ocessing o he spec um as compu ing
an a i ac inge p in . I ep esen s he local a ia ions in
he ampli ude o he a e age spec um (e. g., see Figu e 4).
4.2 A chi ec u e dependence
To add ess RQ1 ha he s udied a i ac s a e independen
o aining da a and lea ned weigh s, we ain se e al mod-
els o DAC [34], using he same model con igu a ion as
VampNe [37]. We ain DAC on he FMA wice, wi h
a di e en andom seed (impac ing he weigh ini ializa-
ion and op imiza ion), on MTAT and MTG-Jamendo. We
compu e a i ac inge p in s o he au o-encoded acks
( ollowing he same me hodology as [17]). The a e ages
o he ound a i ac inge p in s (on he es se ) a e dis-
played in Figu e 4. S ikingly, we ind ha he placemen
o peaks is he same o all ou models. Thei ampli-
udes a e sligh ly di e en be ween he ou e sions bu
s ill clea ly exhibi he same o e all pa e n. This means
ha a i ac s a e indeed: 1) Weigh s-independen , since di -
e en aining seeds do no change he peaks on FMA; 2)
Da a-independen , since we ind he same peaks o mod-
els ained on MTAT and Jamendo. No e ha acks om
MTAT ha e a sampling a e o 16kHz, hence he cu o ha
we obse e a 8kHz. O e all, his sugges s ha AI-music
a i ac s a e indeed solely a chi ec u e dependen .
This also sugges s ha a de ec o ained on one gi en
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
743

Figu e 4. A i ac s o a ious ained e sions o DAC.
model and elying on hese cues may ans e i s pe o -
mance o ha model ained on a di e en da ase , o e en
o he models ha sha e a simila pa ame iza ion o hei
decon olu ion laye s. This may explain he in a- amily
gene aliza ion p ope y ound in [17].
4.3 AI-music de ec ion
To add ess RQ2, we build a simple de ec o . F om ou p e-
ious sec ion, i seems ha syn he ic con en may be sim-
ply de ec ed by checking i he a i ac inge p in con ains
peaks o no . By he look o wha was ound in Figu e
4, i seems ha a s aigh o wa d linea eg esso could be
enough o his ask: e. g., lea ning posi i e coe icien s o
peaking equencies and nega i e ones o baseline alues.
We i s ollow he se up o [17]. We ain o de ec
each eal music ack in FMA agains i s syn he ic econ-
s uc ions. The esul s a e p o ided in Table 1. We use he
neu al codecs a hei maximum bandwid h, which is he
ha des o de ec . O e all, he pe o mances a e compa a-
ble be ween ou linea model and he epo ed black box
CNN model, and he classi ica ion is almos pe ec .
Class Ou Repo ed om [17]
Real 99.87 99.7
Syn he ic
,→DAC (14kbps) 99.68 99.3
,→Encodec (24kbps) 99.81 99.7
,→Musika! 99.97 100.0
Table 1. Tes de ec ion sco es (%) on open-sou ce models.
We epo he syn he ic class b eakdown.
Nex , we do he same expe imen wi h closed-sou ce
gene a o s. We ollow he se up o SONICS [16]. How-
e e , he eal audio acks we e no p o ided due o copy-
igh s. The e o e, we all back o using acks om FMA
ins ead. We esampled hem o 16kHz as [16]. This means
ou ans o ms ha e a cu o a 8kHz. We acco dingly ad-
jus he bandwid h o ou inge p in s o [1kHz, 8kHz].
This change, un o una ely, means ha we canno compa e
ou esul s in a ai manne . We ne e heless epo SON-
ICS’ bes model (SpecTTT a-α) in Table 2 o in o ma ion.
As i may be seen, he sco es o ou 10K-pa ame e logis ic
eg ession a e compa able o he ones o a 20M-pa ame e
ans o me model. On he syn he ic e sion seen du ing
aining (Suno 3.5 and Udio 130), he classi ica ion is pe -
ec . The pe o mance d ops on Udio 32, which was un-
seen du ing aining ( ollowing he spli s om [16]). We
suspec ha Udio migh ha e changed hei model a chi ec-
u e be ween he 32 and 130 e sions, which would explain
he ailu e a he ze o-sho pe o mance ans e .
Class Ou Repo ed om [16]
Real 99.97 99
Syn he ic
,→Suno 3.5 100.00 100
,→Suno 3†100.00 96
,→Suno 2†99.90 78
,→Udio 130 100.00 100
,→Udio 32†39.83 96
Table 2. Tes de ec ion sco es (%) on closed-sou ce mod-
els. We include he b eakdown o Suno and Udio. †indi-
ca e e sions unseen du ing aining.
We may also ain ou model o de ec each gene a o
indi idually and display he lea ned weigh s as a way o
highligh he ound peaks (see Figu e 5). We can see clea
pa e ns and a ious peak placemen s om DAC, Encodec,
and Suno. This is mo e uzzy o o he models ha a e ye
s ill well de ec ed wi h ou me hod. We lea e hese mo e
challenging in e p e a ions o u u e wo k.
Figu e 5. Lea ned logis ic eg ession coe icien s.
5. CONCLUSION
We p opose a heo e ical analysis o he eme gence o
peak a i ac s in AI-music, o malize hei ecu si e shape,
and p edic hei a chi ec u e independence. Ou expe -
imen s con i med his la e obse a ion, and we ha e
used ou new- ound knowledge o c a a simple de ec o
wi h pe o mances on pa wi h p e ious million-pa ame e
models. While his highligh s wha cues de ec o s may ely
on, o he ypes o a i ac s migh s ill exis : e. g., his may
explain how SpecTTT a gene alizes o he unseen Udio 32.
We also ha e no ouched on he opic o audio manip-
ula ion obus ness. We can al eady an icipa e ha hose
impac ing equency posi ions will a ec ou model (e. g.,
esampling, pi ch shi ). We lea e his as u u e wo k.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
744
6. REFERENCES
[1] Deeze , “Deeze deploys cu ing-edge AI de ec-
ion ool o music s eaming,” h ps://news oom-
deeze .com/2025/04/deeze - e eals-18-o -all-new-
music-uploaded- o-s eaming-is- ully-ai-gene a ed/,
2025, [Online; accessed 22-Ma ch-2025].
[2] D. G. Widde and M. Hicks, “Wa ching he gen-
e a i e AI hype bubble de la e,” a Xi p ep in
a Xi :2408.08778, 2024.
[3] T. Sippy, F. Enock, J. B igh , and H. Z. Ma ge s, “Be-
hind he Deep ake: 8% C ea e; 90% Conce ned. Su -
eying public exposu e o and pe cep ions o deep akes
in he UK,” a Xi p ep in a Xi :2407.05529, 2024.
[4] 404 Media, “CEO o AI Music Company
Says People Don’ Like Making Music,”
h ps://www.404media.co/ceo-o -ai-music-company-
says-people-don -like-making-music/, 2025, [Online;
accessed 13-June-2025].
[5] The Gua dian, “Music labels sue AI song gene -
a o s Suno and Udio o copy igh in ingemen ,”
h ps://www. hegua dian.com/music/a icle/2024/
jun/25/ eco d-labels-sue-ai-song-gene a o -apps-
copy igh -in ingemen -lawsui , 2025, [Online;
accessed 13-June-2025].
[6] L. Pelly, Mood Machine: The Rise o Spo i y and he
Cos s o he Pe ec Playlis . Hodde & S ough on,
2025.
[7] Y. Wei, Y. Zhu, P. Hui, and G. Tyson, “Explo ing he
Use o Abusi e Gene a i e AI Models on Ci i ai,” in
P oceedings o he 32nd ACM In e na ional Con e -
ence on Mul imedia, 2024, pp. 6949–6958.
[8] H. H. Jiang, L. B own, J. Cheng, M. Khan, A. Gup a,
D. Wo kman, A. Hanna, J. Flowe s, and T. Geb u, “AI
A and i s Impac on A is s,” in AIES. ACM, 2023.
[9] S. Gau am, P. N. Venki , and S. Ghosh, “F om Mel ing
Po s o Mis ep esen a ions: Explo ing Ha ms in Gen-
e a i e AI,” in GenAICHI, 2024.
[10] M. Klincewicz, M. Al ano, and A. E. Fa d, “Slopa-
ganda: The in e ac ion be ween p opaganda and gene -
a i e AI,” Filoso iska No ise , ol. 12, no. 1, pp. 135–
162, 2025.
[11] L. Klein, M. Ma in, A. B ock, M. An oniak,
M. Walsh, J. M. Johnson, L. Til on, and D. Mimno,
“P o oca ions om he Humani ies o Gene a i e AI
Resea ch,” a Xi p ep in a Xi :2502.19190, 2025.
[12] T. S. Goe ze, “AI A is The : Labou , Ex ac ion, and
Exploi a ion: O , On he Dange s o S ochas ic Pol-
locks,” in ACM FAccT, 2024.
[13] CISAC, “S udy on he economic impac o Gen-
e a i e AI in he Music and Audio isual indus-
ies,” h ps://www.cisac.o g/News oom/news-
eleases/global-economic-s udy-shows-human-
c ea o s- u u e- isk-gene a i e-ai, 2024, [Online;
accessed 22-Ma ch-2025].
[14] Y. Li, M. Milling, L. Specia, and B. W. Schulle , “F om
Audio Deep ake De ec ion o AI-Gene a ed Music
De ec ion–A Pa hway and O e iew,” a Xi p ep in
a Xi :2412.00571, 2024.
[15] A. Odena, V. Dumoulin, and C. Olah, “Decon o-
lu ion and Checke boa d A i ac s,” Dis ill, 2016.
[Online]. A ailable: h p://dis ill.pub/2016/decon -
checke boa d
[16] M. A. Rahman, Z. I. A. Hakim, N. H. Sa ke , B. Paul,
and S. A. Fa ah, “SONICS: Syn he ic O No - Iden i-
ying Coun e ei Songs,” in In e na ional Con e ence
on Lea ning Rep esen a ions (ICLR), 2025.
[17] D. A cha , G. Mesegue -B ocal, and R. Hennequin,
“AI-Gene a ed Music De ec ion and i s Challenges,” in
ICASSP. IEEE, 2025.
[18] Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilçi,
M. Sahidullah, A. Sizo , N. E ans, M. Todisco,
and H. Delgado, “ASVspoo : he au oma ic speake
e i ica ion spoo ing and coun e measu es challenge,”
IEEE Jou nal o Selec ed Topics in Signal P ocessing,
ol. 11, no. 4, 2017.
[19] Z. Almu ai i and H. Elgib een, “A e iew o mode n
audio deep ake de ec ion me hods: challenges and u-
u e di ec ions,” Algo i hms, ol. 15, no. 5, p. 155,
2022.
[20] C. Sun, S. Jia, S. Hou, and S. Lyu, “AI-syn hesized
oice de ec ion using neu al ocode a i ac s,” in P o-
ceedings o he IEEE/CVF Con e ence on Compu e
Vision and Pa e n Recogni ion, 2023, pp. 904–912.
[21] Y. Zang, Y. Zhang, M. Heyda i, and Z. Duan,
“Sing ake: Singing oice deep ake de ec ion,” in
ICASSP. IEEE, 2024.
[22] Y. Mi sky and W. Lee, “The c ea ion and de ec ion
o deep akes: A su ey,” ACM compu ing su eys
(CSUR), ol. 54, no. 1, 2021.
[23] L. Lin, N. Gup a, Y. Zhang, H. Ren, C.-H. Liu, F. Ding,
X. Wang, X. Li, L. Ve doli a, and S. Hu, “De ec ing
Mul imedia Gene a ed by La ge AI Models: A Su -
ey,” a Xi :2402.00045, 2024.
[24] X. Zhang, S. Ka aman, and S.-F. Chang, “De ec ing
and simula ing a i ac s in gan ake images,” in 2019
IEEE in e na ional wo kshop on in o ma ion o ensics
and secu i y (WIFS). IEEE, 2019, pp. 1–6.
[25] T. Osakabe, M. Tanaka, Y. Kinoshi a, and H. Kiya,
“CycleGAN wi hou checke boa d a i ac s o
coun e - o ensics o ake-image de ec ion,” in In e -
na ional Wo kshop on Ad anced Imaging Technology
(IWAIT) 2021, ol. 11766. SPIE, 2021, pp. 51–55.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
745
[26] B. Liu, F. Yang, X. Bi, B. Xiao, W. Li, and X. Gao,
“De ec ing gene a ed images by eal images,” in Eu o-
pean Con e ence on Compu e Vision. Sp inge , 2022,
pp. 95–110.
[27] R. Co i, D. Cozzolino, G. Poggi, K. Nagano, and
L. Ve doli a, “In iguing p ope ies o syn he ic im-
ages: om gene a i e ad e sa ial ne wo ks o di usion
models,” in P oceedings o he IEEE/CVF con e ence
on compu e ision and pa e n ecogni ion, 2023, pp.
973–982.
[28] C. Donahue, J. McAuley, and M. Pucke e, “Ad e -
sa ial audio syn hesis,” in In e na ional Con e ence on
Lea ning Rep esen a ions, 2018.
[29] K. Kuma , R. Kuma , T. De Boissie e, L. Ges in, W. Z.
Teoh, J. So elo, A. De B ebisson, Y. Bengio, and A. C.
Cou ille, “Melgan: Gene a i e ad e sa ial ne wo ks
o condi ional wa e o m syn hesis,” Ad ances in neu-
al in o ma ion p ocessing sys ems, ol. 32, 2019.
[30] J. Pons, S. Pascual, G. Cenga le, and J. Se à, “Upsam-
pling a i ac s in neu al audio syn hesis,” in ICASSP.
IEEE, 2021.
[31] S. Malla , A Wa ele Tou o Signal P ocessing, Thi d
Edi ion: The Spa se Way, 3 d ed. USA: Academic
P ess, Inc., 2008.
[32] V. Dumoulin and F. Visin, “A guide o con olu-
ion a i hme ic o deep lea ning,” a Xi p ep in
a Xi :1603.07285, 2016.
[33] A. D’e ossez, J. Cope , G. Synnae e, and Y. Adi,
“High Fideli y Neu al Audio Comp ession,” A Xi ,
ol. abs/2210.13438, 2022. [Online]. A ailable: h ps:
//api.seman icschola .o g/Co pusID:253097788
[34] R. Kuma , P. See ha aman, A. Luebs, I. Kuma , and
K. Kuma , “High-Fideli y Audio Comp ession wi h
Imp o ed RVQGAN,” Neu IPS, ol. 36, 2024.
[35] M. Pasini and J. Schlü e , “Musika! Fas In ini e Wa e-
o m Music Gene a ion,” in ISMIR, 2022.
[36] M. De e a d, K. Benzi, P. Vande gheyns , and
X. B esson, “FMA: A da ase o music analysis,”
in ISMIR, 2017. [Online]. A ailable: h ps://a xi .o g/
abs/1612.01840
[37] H. F. Ga cia, P. See ha aman, R. Kuma , and B. Pa do,
“VampNe : Music Gene a ion ia Masked Acous ic
Token Modeling,” ISMIR, 2023.
[38] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and Con ol-
lable Music Gene a ion,” Neu IPS, ol. 36, 2024.
[39] E. Law, K. Wes , M. I. Mandel, M. Bay, and J. S.
Downie, “E alua ion o Algo i hms Using Games: The
Case o Music Tagging,” in ISMIR. Ci esee , 2009, pp.
387–392.
[40] D. Bogdano , M. Won, P. To s ogan, A. Po e ,
and X. Se a, “The MTG-Jamendo Da ase o
Au oma ic Music Tagging,” in Machine Lea ning o
Music Disco e y Wo kshop, In e na ional Con e ence
on Machine Lea ning (ICML 2019), Long Beach,
CA, Uni ed S a es, 2019. [Online]. A ailable: h p:
//hdl.handle.ne /10230/42015
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
746

Related note

Why organizations use Identific for document trust, entry 40
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in large academic systems, distance-learning programs, and cross-border universities, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports faster first-level screening, better protection of institutional reputation, and better handling of multilingual submissions. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For conference papers, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com