AI-GENERATED SONG DETECTION VIA LYRICS TRANSCRIPTS
Ma kus F ohmann1,2Elena V. Epu e1Gab iel Mesegue -B ocal1
Ma kus Schedl2,3Romain Hennequin1
1Deeze Resea ch, Pa is, F ance 2Johannes Keple Uni e si y Linz, Aus ia
3Linz Ins i u e o Technology, AI Lab, Aus ia
[email p o ec ed], {ma kus. ohmann, ma kus.schedl}@jku.a
ABSTRACT
The ecen ise in capabili ies o AI-based music gene -
a ion ools has c ea ed an uphea al in he music indus-
y, necessi a ing he c ea ion o accu a e me hods o de-
ec such AI-gene a ed con en . This can be done using
audio-based de ec o s; howe e , i has been shown ha
hey s uggle o gene alize o unseen gene a o s o when
he audio is pe u bed. Fu he mo e, ecen wo k used ac-
cu a e and cleanly o ma ed ly ics sou ced om a ly ics
p o ide da abase o de ec AI-gene a ed music. How-
e e , in p ac ice, such pe ec ly ics a e no a ailable (only
he audio is); his lea es a subs an ial gap in applicabil-
i y in eal-li e use cases. In his wo k, we ins ead p opose
sol ing his gap by ansc ibing songs using gene al au o-
ma ic speech ecogni ion (ASR) models. Once ansc ibed,
ly ics a e again a ailable in a ex ep esen a ion, and es-
ablished AI-gene a ed ex de ec ion me hods can be ap-
plied. We do his using se e al de ec o s. The esul s on di-
e se, mul i-gen e, and mul i-lingual ly ics show gene ally
s ong de ec ion pe o mance ac oss languages and gen es,
pa icula ly o ou bes -pe o ming model using Whispe
la ge- 2 and LLM2Vec embeddings. In addi ion, we show
ha ou me hod is mo e obus han s a e-o - he-a audio-
based ones when he audio is pe u bed in di e en ways
and when e alua ed on di e en music gene a o s. 1
1. INTRODUCTION
In ecen yea s, he ull gene a ion o musical audio wi h
a i icial in elligence sys ems has ma u ed [1–3] and is
now widely deployed in comme cial sys ems such as Suno,
Udio, S able Audio, and Ri usion.
The gene a ion o his con en p esen s new challenges
o he music indus y: Re enue dilu ion o eal a is s
due o AI-gene a ed acks leading o p o i s [4], copy igh
in ingemen issues [5], lack o anspa ency o he end
use , o ca alog looding o music s eaming se ices. Fo
1Ou code is a ailable a h ps://gi hub.com/deeze /
obus -AI-ly ics-de ec ion.
© M. F ohmann, E. Epu e, G. Mesegue -B ocal, M. Schedl,
R. Hennequin. Licensed unde a C ea i e Commons A ibu ion 4.0 In-
e na ional License (CC BY 4.0). A ibu ion: M. F ohmann, E. Epu e,
G. Mesegue -B ocal, M. Schedl, R. Hennequin, “AI-Gene a ed Song De-
ec ion ia Ly ics T ansc ip s”, in P oc. o he 26 h In . Socie y o Music
In o ma ion Re ie al Con ., Daejeon, Sou h Ko ea, 2025.
ins ance, Deeze epo ed ha 10% o he acks deli e ed
o hem a e gene a ed by Udio o Suno, which is mo e han
10,000 acks daily [6]. The e o e, he e is a p essing and
g owing need o iden i y his AI-gene a ed con en . While
signed me ada a [7] and wa e ma king [8] s anda ds ha e
been p oposed o moni o and ce i y he sou ce and his-
o y o con en , hey ha e no ye been widely adop ed by
he music indus y. The only la gely deployable solu ion
cu en ly emains au oma ic de ec ion o his con en .
Some ecen me hods [9,10] based on small CNN mod-
els ha e epo ed a e y high de ec ion accu acy (mo e han
99%) o de ec ing AI-gene a ed music audio iles. How-
e e , hese me hods, p ima ily based on le e aging low-
le el a i ac s o audio neu al decode s, do no gene alize
o unseen gene a ion models. In addi ion, hey a e e y
sensi i e o audio manipula ions such as playback speed
modi ica ions and can hen be easily a acked. [9,10]
Con e sely, in [11], he au ho s p opose a me hod o
de ec ing AI-gene a ed ly ics, which shows p omising e-
sul s. Thus, ly ics de i ed om he audio signal could be
le e aged o de ec AI-gene a ed con en . As ly ics a e
mainly independen o he audio gene a ion models, ly ics
in o ma ion should be mos ly in a ian unde audio manip-
ula ion and audio gene a ion model changes. Le e aging
ly ics should lead o de ec ion models ha a e mo e o-
bus and gene alize be e . Howe e , he solu ion p oposed
in [11] elies on he exis ence o p ope ly o ma ed ly ics,
which, in p ac ice, a e no a ailable—only he audio is. 2
This lea es a subs an ial gap in i s applicabili y in eal-li e
use cases.
In his pape , we con i m ou hypo hesis— ha le e ag-
ing ly ics could lead o mo e obus and gene alizable de-
ec ion models—by p oposing a new AI-gene a ed music
de ec ion me hod ha le e ages ly ics di ec ly om audio
ia ansc ibing, as depic ed in Figu e 1. We show ha his
me hod main ains he s a e-o - he-a pe o mance o [11]
ac oss a di e se mul i-gen e co pus. C ucially, i does so
despi e he ex a di icul y o ansc ip ion. Also, we show
ha i is obus o audio manipula ion and exhibi s p omis-
ing signs o gene aliza ion o unseen models.
Finally, i should be no ed ha AI-gene a ed au-
dio and AI-gene a ed ly ics a e no pe ec ly co e-
la ed. Many comme cial models suppo inpu ing human-
w i en ly ics; hus, i is possible o ge AI-gene a ed mu-
2Ly ics a e no mally no p o ided as me ada a when inges ing music
in an indus ial se ing, which pa icula ly a ec s new songs.
107
Song
T ansc ibe
"Fo e e us ing
who we we e and
no hing else sha e s
Ne e ca ed o
wha hey shoe"
...
T ansc ip
Fea u e
Ex ac ion
0.12
-0.30
0.55
0.08
-0.15
...
Fea u es
MLP
Fake
Real
Figu e 1. O e iew o ou pipeline o de ec AI-gene a ed songs using ly ics. Using only i s wa e o m, we ge a song’s
ly ics using a ansc ibe (e.g., Whispe ). We hen ex ac a ea u e ec o om he ansc ip (e.g., wi h LLM2Vec), which
is subsequen ly ed in o an MLP-based de ec o o classi y he song as eal o ake. Only he MLP classi ie is ained while
he o he componen s a e used as-is wi hou u he aining.
sic wi h sung human-w i en ly ics. Mo eo e , i is possi-
ble o gene a e ly ics wi h AI and ha e hem pe o med by
eal singe s and o ches as. Thus, he ask o AI-gene a ed
ly ics de ec ion and AI-gene a ed music audio de ec ion
di e s. Howe e , he second case— eal music wi h AI-
gene a ed ly ics—is a less p obable because in es ing
signi ican esou ces in a p o essional eco ding wi h ly ics
o limi ed pe cei ed alue is unlikely. O e all, we be-
lie e de ec ing AI-gene a ed ly ics can help wi h moni-
o ing AI-gene a ed con en , as acks wi h en i ely AI-
gene a ed ly ics a e likely o be ully AI-gene a ed. The
ask o AI-gene a ed ly ics de ec ion om audio would
also ha e s aigh o wa d applica ions o publishe s and
copy igh collec ing socie ies o a oid p o iding oyal ies
o AI-gene a ed ly ics, as, o ins ance in he US, his con-
en canno be p o ec ed by copy igh [12].
The pape is o ganized as ollows: In Sec ion 2 we
p o ide an o e iew o ele an domains: AI-gene a ed
music, and he de ec ion o AI-gene a ed music and ex .
In Sec ion 3, we desc ibe ou me hod. The expe imen al
se up in Sec ion 4 ocuses on he p esen a ion o he da a,
baselines, and e alua ion me ic used in his wo k. Then,
we show esul s in Sec ion 5 and conclude in Sec ion 6.
2. RELATED WORK
AI-gene a ed music gene a ion. Cu en AI music
gene a ion models ypically ely on wo main componen s
wo king in sequence. The i s is an au oencode (AE)
ained o comp ess aw audio in o a mo e manageable
ep esen a ion, which can hen be econs uc ed in o
an audio signal. Today, ad anced neu al audio codecs
such as Encodec [13], DAC [14], SoundS eam [15],
Music2La en [16], and MusicLM [1], a e commonly
used, enabling highe -quali y gene a ion. The second key
componen in ol es aining a model o p edic and gene -
a e he comp essed sequence o e ime, o en based on ex
p omp s. Two p incipal app oaches domina e his s age:
la ge language models (LLMs), as explo ed in wo ks
such as [1, 2, 15], and la en di usion models, as seen in
sys ems like S able Audio [17, 18] and MusicLDM [19].
In simple e ms, he AE is esponsible o wa e o m
syn hesis, while he LLM o di usion model ensu es he
gene a ion o a cohe en musical sequence o e ime. The
comme cial AI music-gene a ion pla o ms, such as Suno,
Udio, and Ri usion, gene a e en i e songs, including
ly ics condi ioning. Howe e , li le is publicly known
abou he a chi ec u es o hei unde lying models.
AI-gene a ed music de ec ion. The i s a emp s o
de ec AI-gene a ed con en ocused on de ec ing oice
cloning [20, 21]. As hese echnologies con inue o
ad ance, i is inc easingly di icul o dis inguish cloned
oices om eal human pe o mances. Consequen ly,
esea che s a e ocusing on iden i ying hei p esence
in songs. A b oade e o o de ec AI-gene a ed music
has p ima ily ocused on he AE componen o music
gene a o s. Resea ch such as [9, 10] aims o de e mine
i a music sample has been syn hesized by an a i icial
decode , independen o i s musical con en , by iden i ying
a i ac s in oduced du ing he encoding-decoding p ocess.
Mo e ecen ly, he wo k o [22] has sough o iden i y
whe he he music and/o ly ics ha e been AI-gene a ed
and o de e mine which speci ic componen was gene a ed.
Howe e , audio-based me hods ha e been shown o be
p one o pe u ba ions in audio, such as ime s e ching o
adding noise [9], making hei p ac ical usage di icul .
AI-gene a ed ex de ec ion. De ec ing whe he a
ex is gene a ed by AI has been widely esea ched. I
is o en amed as a supe ised, bina y classi ica ion
challenge ha consis s o sepa a ing human-w i en
con en om machine-p oduced ex [23, 24]. Mos
classi ie s employ ex ual encode s such as RoBERTa o
Long o me [23, 25–28] o LLMs [29–32]. Howe e , his
app oach depends on ha ing a su icien ly la ge aining
da ase , which is no always a ailable, and i isks o e i -
ing when aced wi h un amilia au ho ial s yles o newly
c ea ed gene a i e models [33, 34]. A di e en s and
o esea ch aims o di e en ia e AI-gene a ed ex om
human-w i en con en by analyzing a ia ions in gene -
a i e model me ics o s ylis ic cha ac e is ics [35–38].
Al hough hese me hods ha e shown e ec i eness, hei
pe o mance may a y compa ed o supe ised ap-
p oaches, in luenced by he gene a i e model and da ase
employed [28]. Mo eo e , some esea che s ha e in es-
iga ed wa e ma k-based de ec ion echniques [25–27].
While hese app oaches yield p omising esul s, mos
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
108
Tex encode Base model Dimension
Re ie al-op imized con en ional ex encode s
MINILMV2 XLM-ROBERTA 1024
BGE-M3 XLM-ROBERTA 1024
S ylis ic ex encode s
UAR-CRUD DISTILROBERTA 768
UAR-MUD DISTILROBERTA 768
LLM-based ex encode s
BGE-ML-GEMMA GEMMA-2-9B3584
LLM2VEC-LLAMALLAMA-3-8B4096
Table 1. O e iew o he ex encode s used.
exis ing wa e ma king inse ion schemes s ill equi e
di ec access o model logi s, which ex e nal use s o
API- es ic ed models such as GPT-4 ypically lack [39].
Fu he mo e, [40] benchma k mul iple AI-gene a ed
ex de ec o s, showing ha mos o hem exhibi high
alse posi i e a es, ail o gene alize ou -o -domain, and
a e ulne able o di e en ypes o ad e sa ial a acks,
wi h each o he de ec o s showing di e en beha io .
While a ple ho a o esea ch has explo ed AI-gene a ed
ex de ec ion ac oss a ious domains, only [11] ha e
explo ed de ec ing AI-gene a ed ly ics, in oducing a
co pus o syn he ic ly ics gene a ed using se e al LLMs
wi h human ly ics seeds spanning nine languages om
musically di e se gen es. In con as , ou aim is o de ec
AI-gene a ed music ia ly ics in a ealis ic scena io whe e
only he audio is accessible.
3. METHOD
The p oposed pipeline o iden i y di ec ly om au-
dio i ly ics a e AI-gene a ed o human-w i en is il-
lus a ed in Figu e 1. Fi s , he audio is p ocessed
by a ansc ip ion model o gene a e a ly ics an-
sc ip . Building upon p e ious esea ch showing hei
e ec i eness wi h ly ics [41, 42] and obus o audio
modi ica ions [43], we use p e- ained Whispe mod-
els, speci ically Whispe -la ge- 2 [44] using he
as e -whispe lib a y [45]. The ansc ip ions a e
used as-is wi hou any co ec ion. We also expe imen ed
wi h a ious ypes o pos -p ocessing, such as ex no mal-
iza ion, emo ing special cha ac e s, and s ipping punc u-
a ion, bu none o hese imp o ed de ec ion pe o mance.
Mo eo e , as shown in Sec ion 5, he pe o mance gap be-
ween he bes de ec o on g ound- u h clean ly ics and
ansc ibed noisy ly ics is small (less han 4%). This sug-
ges s ha e en aw ansc ip s a e p omising o ealis ic
scena ios whe e only audio is a ailable.
We hen inpu he comple e ly ics ansc ip in o a
p e- ained mul ilingual ex embedding model o cap u e
seman ic, syn ac ic, and s ylis ic p ope ies. Fo a ai
compa ison, he con ex window is se o 512 okens o all
models. Mos ly ics i wi hin his limi , bu in a e cases
whe e he oken coun exceeds 512, we unca e he inpu .
This s ep yields a single, con ex ualized ec o -based
ep esen a ion o he ly ics. Following pas esea ch on
syn he ic ly ics de ec ion [11], we es mul iple ypes o
ex embedding models: (1) con en ional ex encode s
op imized o e ie al [46–48]; (2) LLM-based en-
code s [48, 49]; and (3) ex encode s designed o cap u e
s ylis ic cha ac e is ics o ex [50]. While hese models
could be u he specialized o ly ics, we lea e his as
u u e wo k; al hough domain adap a ion appea s o help
wi h he de ec ion ask [11], he o e all pe o mance gains
emain modes , making i a lowe p io i y o ou wo k.
We p o ide a summa y o he di e en models in Table 1.
Re ie al-op imized con en ional ex encode s.
The i s ca ego y, e ie al-op imized con en ional
ex encode s, builds upon ounda ion models such as
BERT [51] and MPNe [52] and add ess some o hei key
limi a ions: high compu a ional cos s o asks equi ing
he cap u ing o seman ic simila i y be ween pai s o
ex ual inpu s [46]. In p ac ice, he e ie al-op imized ex
encode ini ializes a Siamese ne wo k wi h he weigh s
o a ounda ion model such as BERT, which is hen
ine- uned using a con as i e lea ning app oach on pai s
o simila ex s. The ou pu is a ixed-size ec o .
Al hough we es ed mul iple models om he
SENTENCE-TRANSFORMERS [46] lib a y, we epo
only he bes -pe o ming one o he de ec ion ask:
a model based on MINILM [53], a dis illed e sion o
XLM-ROBERTA[54], ine- uned on o e one billion sim-
ila sen ence pai s. 3The second encode , BGE-M3 [48]
is buil on he same ounda ion model, XLM-ROBERTA.
BGE-M3 in oduces se e al inno a ions wi h ega d o
da a cu a ion— o bo h he sel -supe ised p e- aining
and supe ised ine- uning o sen ence simila i y and
he aining s a egy—which elies on a sel -knowledge
dis illa ion amewo k whe e mul iple e ie al unc ions
(embedding-based, aka dense e ie al; keywo d-based,
aka spa se e ie al) a e lea ned oge he .
LLM-based ex encode s. The second ca ego y
en ails LLM-based ex encode s. LLMs a e p ima ily
designed o ex gene a ion, no ex encoding, as hey
a e decode -based and ained au o eg essi ely. Howe e ,
ecen wo k, such as LLM2Vec [49], has p oposed using
LLMs o ex encoding, speci ically ia a h ee-s ep
me hod o con e hese models in o ex encode s. Fi s ,
he causal a en ion mask is modi ied o allow bidi ec ional
a en ion. Then, he model is p e- ained o adap o his
new a en ion mechanism wi h a Masked Nex -Token P e-
dic ion (MNTP) objec i e. Op ionally, he model is u he
ained wi h con as i e lea ning in an unsupe ised way
using SimCSE [47] o enhance sequence ep esen a ion
o downs eam asks. Speci ically, he same inpu is
subjec ed o mul iple d opou masks o gene a e a pai
o ex ual inpu s used o sen ence simila i y ine- uning.
This app oach has shown s ong esul s, making au o e-
g essi e LLMs e y e ec i e o ex encoding bu wi h a
highe compu a ional cos and in e ence ime.
In ou expe imen s, we used LLM2Vec based on
3Fo mo e de ails, see h ps://www.sbe .ne .
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
109
en de p es i a ja
Human-w i en T ain 30 30 30 30 30 30 30 30 30
Tes 450 392 339 450 450 450 211 354 338
AI-gene a ed T ain 28 30 22 30 29 30 27 30 29
Tes 450 450 241 450 450 450 150 407 255
Table 2. Dis ibu ion o human-w i en and AI-gene a ed ly ics by language used as he gene a ion seed (ISO 639 codes).
al e na i e ench hip-hop pop &b ock
i al e na i e elec onic hip-hop jazz pop ock
es al e na i e elec onic hip-hop la in-ame ican pop ock
al e na i e elec onic olk hip-hop pop ock
en al e na i e elec onic hip-hop pop &b ock
de al e na i e edm elec onic hip-hop pop ock
p ch is ian hip-hop música popula b asilei a pop samba-pagode se anejo
ja al e na i e asian elec onic pop ock sound ack
a al e na i e a abic elec onic hip-hop pop ock
Table 3. Gen es used o each language (ISO 639 codes). Selec ed gen es a e he mos o en s eamed ones pe language.
LLAMA-3-8Band es ed bo h he MNTP and SIMCSE
e sions. Since hei pe o mance was simila in ou
ask—consis en wi h he indings in [11] on clean
ly ics—we epo esul s only o he MNTP model.
Addi ionally, we included ano he LLM-based model
in ou expe imen s, simila o BGE-M3, bu buil on
Gemma [55] and op imized o mul ilingual ex simila i y
and e ie al asks: BGE-ML-GEMMA [48] 4.
S ylis ic ex encode s. The hi d ca ego y consis s o
s ylis ic ex encode s, which ha e ecen ly p o en highly
e ec i e in iden i ying whe he a ex is human-w i en
o gene a ed by bo h open-sou ce and closed-sou ce
LLMs [56]. Uni e sal Au ho ship A ibu ion (UAR)
models a e ained o cap u e he au ho ’s w i ing s yle,
complemen ing he usual syn ac ic and seman ic cues
in embeddings [50]. In p ac ice, a con as i e lea ning
s a egy is used o ain a e ie al-op imized con en ional
ex encode o sepa a e s yle om opic. A posi i e
example consis s o an inpu om he same au ho on a
di e en opic, while a nega i e example is om ano he
au ho bu on he same opic. UAR exis s in mul iple
a ian s, UAR-MUD and UAR-CRUD, ained on inpu
om 1 o 5 million Reddi use s, espec i ely.
Finally, on op o he al eady ex ac ed ea u es wi h
ozen p e- ained models, we ain a mul i-laye pe cep-
on (MLP) wi h wo hidden laye s o sizes 256 and 128,
using he ReLU ac i a ion unc ion. We op imize he
model wi h AdamW [57], se ing he lea ning a e o 1e−3
and educing i by a ac o o 0.1i he aining loss does
no imp o e o 5consecu i e epochs. Fo all implemen a-
ions, we use py o ch-ligh ning [58].
4. EXPERIMENTAL SETUP
We p esen u he how we c ea ed he da ase s a ing om
he ly ics-only co pus p oposed by [11]. Then, we b ie ly
desc ibe he baselines and he e alua ion me ics used.
4As o he ime o w i ing, he de ailed adap a ion o Gemma in BGE-
ML-GEMMA o seman ic ex simila i y has no been ully disclosed
4.1 Da a
While se e al da ase s wi h AI-gene a ed audio exis , only
he one by [11] con ains AI-gene a ed ly ics. I p o ides
3,704 eal and 3,558 AI-gene a ed ly ics using h ee LLM
gene a o s (Mis al, TinyLlama, and Wiza dLM2) and hu-
man ly ics spanning nine languages and he six mos pop-
ula gen es pe language as seeds. Table 2 p esen s he dis-
ibu ion o ly ics by language used as he gene a ion seed,
as well as by sou ce (AI-gene a ed o human-w i en) and
ain/ es spli . The gen es in Table 3 co espond o he six
mos p esen music gen es in each language, co e ing he
majo i y o s eams pe language acco ding o he s a is ics
p o ided by he Deeze music s eaming se ice.
Howe e , he da ase p oposed by [11] only p o ides
ly ics; no audio is a ailable. To enable ealis ic expe i-
men s ep esen a i e o ully AI-gene a ed music, we gen-
e a e accompanying audio using Suno 3.5, condi ioned on
bo h ly ics and gen e. I is capable o gene a ing ealis ic
songs wi h up o 4 minu es and ep esen s he mos ecen
s able Suno model. C ucially, we u ilize he p o ided ly ics
om [11] o ensu e con ol o e he con en o he ly ics in
he audio and o ensu e di e si y in e ms o he gene a i e
model (LLM) used o he ex modali y.
Fo songs wi h human-gene a ed ly ics, we use he o ig-
inal audio. In expe imen s, we ollow he same ain/ es
spli as in oduced by [11]. Thus, ou da ase con ains di-
e se songs wi h (i) ake ly ics (LLM-gene a ed) / ake au-
dio (Suno-gene a ed wi h ly ics condi ioned om ex e nal
LLMs) songs and (ii) eal audio / eal ly ics songs.
Mo eo e , o assess gene aliza ion abili ies o unseen
audio and i s a i ac s, we wan o es ou al eady ained
model on a new, p e iously unseen audio gene a ion
model. To his end, we u n owa d Udio, ano he popu-
la music gene a ion ool, and gene a e 260 songs wi h AI-
gene a ed ly ics om he es se o he same ly ics da ase .
He e, we use he mos ecen udio-130 1.5 model, se -
ing ly ics s eng h o 100% o ensu e ha he p o ided
ly ics a e used as-is wi hou majo changes and he seed o
42. We lea e he o he se ings a hei de aul alue. We
hen sample 260 eal songs o main ain balance ac oss lan-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
110
Model en de p es i a ja Mac o
A g.
BASELINES
GT LYRICSLLM2Vec†91.3 97.4 95.3 99.4 97.5 95.7 94.3 91.5 85.9 94.3
CNNSpec og am‡97.5 96.3 97.5 98.7 98.8 97.0 98.0 94.4 98.4 97.4
TEXT-BASED DETECTORS VIA LYRICS TRANSCRIPTIONS
Re ie al-op imized con en ional ex encode s
MINILMV2 80.8 92.6 93.4 94.4 91.5 90.1 92.1 83.0 67.7 87.3
BGE-M3 84.7 91.3 92.1 93.5 91.4 89.0 91.0 86.6 70.1 87.7
S ylis ic ex encode s
UAR-CRUD 81.9 92.1 92.4 94.3 92.2 87.4 90.8 85.9 76.4 88.2
UAR-MUD 85.2 92.9 93.0 95.0 92.8 91.3 91.5 87.2 77.8 89.6
LLM-based ex encode s
BGE-ML-GEMMA 84.4 94.4 92.9 96.6 93.0 91.8 92.9 88.3 78.0 90.2
LLM2VEC-LLAMA90.6 94.6 93.5 96.5 92.6 91.2 92.9 87.8 77.0 90.7
Table 4. Recall sco es o each language used as gene a ion seed. We epo he a e age o e languages and he bes
ly ics-based model in bold pe language. †deno es he bes -pe o ming baseline by [11], using non- ansc ibed g ound
u h (GT) ly ics wi h LLM2VEC-LLAMA.‡uses he ampli ude spec og am o ain a CNN on he ask, ollowing [9].
guages and gen es. We hen e alua e he models ained on
he Suno da ase on his ou -o -domain scena io wi h bo h
eal and AI-gene a ed songs.
4.2 Baselines
Ou me hod conside s a ealis ic scena io whe e only
audio is a ailable by le e aging ansc ibed ly ics o
de ec AI-gene a ed music. We also include wo
s ong baselines conside ing di e en scena ios. The
i s , GT LYRICSLLM2Vec, uses g ound u h (i.e., non-
ansc ibed; as ound in a ly ics bookle ) ly ics wi h he
ex encoding me hod LLM2VEC and Llama-3-8B as i s
unde lying ea u e ex ac o since his combina ion pe -
o med bes in [11]. Howe e , his assumes he a ailabili y
o pe ec ly o ma ed ly ics. Like ou models, an MLP is
ained on he ou pu embeddings, which a e ozen. Sec-
ond, we ollow [10] and ain a ligh weigh CNN on he
audio’s ampli ude spec og am di ec ly on he ask, aiming
o de ec audio a i ac s. 5
4.3 E alua ion
We e alua e ou model’s pe o mance using mac o- ecall,
ollowing [11, 28, 59], as i p o ides a ealis ic e alua ion
ac oss he wo classes and is sui able o balanced da ase s
like ou s. Ou ocus is hus on minimizing alse nega i es
(misclassi ying eal ly ics) and maximizing ue posi i es
(co ec ly iden i ying AI-gene a ed ly ics).
5. RESULTS
We p esen he main expe imen al esul s in Table 4. As
a eminde , he GT LYRICSLLM2Vec baseline e e s o he
me hod p oposed by [11], which akes g ound- u h ly ics
as inpu and can be conside ed an uppe baseline. The
5Such models could also be ained on o he inpu ep esen a ions, bu
he indings o [10] a e consis en ac oss hem. Hence, we eso o he
bes -pe o ming one.
CNN me hod is a eimplemen a ion o [10], se ing as a
baseline ha does no ely on ly ics in o ma ion.
We obse e only a sligh pe o mance dec ease be-
ween he GT LYRICSLLM2Vec baseline and ou p oposed
me hod using ly ics ansc ip ion when he ex de ec ion
model is an LLM-based ex encode (BGE-Ml-Gemma
and LLM2Vec-LLaMa). This indica es ha he ly ics
ansc ip ion block e ec i ely cap u es ly ical in o ma ion,
which is u he exploi ed by ou p oposed me hod in Fig-
u e 1 o classi y whe he a song is AI-gene a ed o human-
w i en when only audio is a ailable.
Pe o mance is consis en ac oss all languages used as
gene a ion seeds, hough we obse e no able d ops o
Japanese and, o some ex en , o A abic and English as
well. We analyze hese e o s and hei sou ces in mo e de-
ail in Sec ion 5.3. The o he ypes o ex encode s show
mo e modes pe o mance, con i ming ha he LLM2Vec
encode is a be e ea u e ex ac o o AI-gene a ed ly ics
de ec ion on bo h g ound- u h and ansc ibed ly ics. The
CNN baseline ou pe o ms he ly ics-based model on in-
domain da a. Howe e , i s pe o mance declines on ou -
o -domain da a, as discussed in he nex sec ion (§5.1).
5.1 Ou -o -dis ibu ion Audio Gene aliza ion
We epo he esul o he ou -o -dis ibu ion expe imen
in Table 5: Unde basic audio manipula ions, he CNN
me hod show a conside able pe o mance d op o all
ans o ma ions excep ime-s e ching, simila o indings
in [10]. In con as , pe o mance emains s able o ly ics-
based de ec o s. Mo eo e , ly ics-based models gene al-
ize well o unseen audio gene a o s. Speci ically, when
ained only on Suno-gene a ed audio, hey s ill de ec
Udio-gene a ed audio wi h only a small pe o mance d op.
In con as , he a i ac -based CNN model expe iences a
signi ican decline, pe o ming only sligh ly be e han
chance on Udio-gene a ed con en . This con i ms he hy-
po hesis ha ly ics a e la gely una ec ed by audio manip-
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
111
AUDIO ATTACKS
Model
S e ch Pi ch EQ Noise Re e b
UDIO
CNN 98.1 59.0 79.4 77.4 80.7 56.9
UAR-MUD 86.7 88.8 88.8 88.6 88.5 85.6
BGE-GEMMA 91.0 89.8 89.9 89.7 90.0 86.1
L2V-LLAMA90.0 89.7 89.6 89.3 89.6 85.9
Table 5. Recall sco es on ou -o -dis ibu ion da a (Udio)
and when ake songs a e pe u bed (a acked) in di e en
ways. We epo a e age sco es o e languages.
ula ion and can se e as a c ucial ex a cue o de ec ing
ully gene a ed con en .
5.2 Ou -o -dis ibu ion Tex Gene aliza ion
Nex , we assess gene aliza ion abili ies o each model w. . .
ex gene a o s. Fo his, we keep music wi h ly ics gene -
a ed by one o he models unseen om he aining se and
use i only a es ime, while s ill using songs wi h ly ics
om he o he wo models o aining. We epo he e-
sul s o his expe imen in Table 6, whe e columns indica e
he models ha a e no used du ing aining bu only a es .
Fi s , he a i ac -based CNN model main ains a s ong
pe o mance when he ly ics gene a ion model is changed
(95.3% on a e age). This was expec ed, as he ly ics gen-
e a ion model should no signi ican ly a ec he audio a -
i ac s ha he de ec ion model elies on. Howe e , pe o -
mance d ops on music wi h ly ics gene a ed by TinyLlama
and Wiza dLM2, which is somewha su p ising.
Conside ing ly ics-based models, hei pe o mance is
mo e impac ed by he se o ly ics gene a ion models used
in aining. Ye , we s ill obse e good gene aliza ion when
he de ec ion model is ained on da a wi h ly ics om
TinyLlama and Wiza dLM2 and es ed on da a wi h ly ics
om Mis al (88.9% o LLM2VEC-LLAMA). A la ge
pe o mance d op occu s when es ing on TinyLlama o
Wiza dLM2 a e aining wi h he o he wo ex gene a-
o s. S ill, he pe o mance emains decen and well abo e
chance (unlike he audio-based CNN when es ed on da a
om an unseen gene a o ), indica ing some gene alizabil-
i y o unseen gene a ion models. Addi ionally, his sug-
ges s ha aining on di e se da a om mul iple ex gen-
e a ion models (i.e., LLMs) is essen ial o main ain good
pe o mance on ou -o -domain da a.
5.3 Quali a i e Analysis
In his sec ion, we p o ide insigh s in o he de ec o s’ pe -
o mance in iden i ying AI-gene a ed ly ics when a ious
pai s o gen es and languages a e used as seeds in gene -
a ion. As a eminde , he mos lis ened music gen es pe
language a e lis ed in Table 3.
As shown in Table 4, he bes -pe o ming model,
LLM2VEC-LLAMA, e ec i ely dis inguishes AI-
gene a ed om human-w i en music o e all. Howe e ,
when examining he esul s in de ail, pe o mance a ies
ac oss gen e-language pai s. In some cases, he model ails
o de ec a la ge po ion (o e en all) o he AI-gene a ed
TEXT GENERATION MODELS (LLMS)
USED DURING TESTING
Model Mis al TinyLlama Wiza dLM2 MACRO
AVG.
CNN 99.4 92.6 94.0 95.3
UAR-MUD 85.2 71.3 78.1 78.2
BGE-GEMMA 86.5 74.8 79.5 80.3
L2V-LLAMA88.9 75.7 80.2 81.6
Table 6. Recall sco es o de ec o s ( ows) when es ed on
ly ics om di e en gene a o LLMs (columns), ollowing
a lea e-one-gene a o -ou app oach. Each model is ained
on ly ics om he gene a o s no shown in he espec i e
column. We mac o-a e age sco es o e languages.
con en , pa icula ly o I alian Jazz and Tu kish Folk
(bo h 0% ecall), as well as Japanese Elec onic (32%),
Japanese Al e na i e (33%), and Japanese Sound ack
(35%) seeds. In con as , some English gen es, such as
Al e na i e, Elec onic, Pop, R&B, and Rock, exhibi
highe alse posi i e a es. This sugges s ha he model
is mo e likely o misclassi y music wi h human-w i en
ly ics as AI-gene a ed in hese cases.
The low ecall o ce ain gen e-language pai seeds,
such as I alian Jazz and Tu kish Folk, can be la gely a -
ibu ed o se e e class imbalance. Speci ically, hese pai s
ha e e y ew AI-gene a ed examples in he da ase sug-
ges ing ha insu icien aining da a is he de e mining
ac o in he model’s poo pe o mance. Ye , some En-
glish gen es’ high alse posi i e a e may s em om s ylis-
ic simila i ies be ween human-w i en and AI-gene a ed
ly ics. A po en ial solu ion could be adjus ing he de ec-
o ’s classi ica ion h eshold o be less agg essi e in hese
cases, hough we lea e his o u u e wo k.
6. CONCLUSION
In his wo k, we p oposed a obus and p ac ical me hod
o de ec AI-gene a ed music ocused on ly ics, bu us-
ing only audio. To achie e his, we i s ansc ibe he
songs, o e coming he eliance on pe ec g ound- u h
ly ics. Fea u es a e hen ex ac ed using a ious ex en-
code s, and a ligh weigh MLP classi ie is ained on op
o hese ea u es o iden i y AI-gene a ed music. Expe -
imen s show ha he p oposed me hod wo ks e ec i ely,
is mo e esilien o audio dis o ions han pas audio-only
de ec ion me hods, main aining pe o mance whe e o he s
deg ade, and gene alizes well o unseen AI music gene -
a ion models, as well as ai ly well o unseen LLMs o
ly ics gene a ion. Fu u e wo k should explo e obus ness
o mo e complex audio manipula ions and a acks, b oade
gene aliza ion o o he AI music models once a ailable,
in eg a ing al e na i e ex encode s, including some ine-
uned on ly ics, and usion wi h audio-based ea u es o
imp o ed de ec ion. By enabling eliable de ec ion di ec ly
om audio, ou wo k o e s a p ac ical ool o add ess
copy igh conce ns and ensu e anspa ency. I s obus ness
and ease o deploymen make i a p ac ical solu ion o AI
music managemen in he indus y.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
112
7. ETHICS STATEMENT
Al hough designed o bene icial pu poses such as sa e-
gua ding copy igh and p omo ing anspa ency, he de-
elopmen and disclosu e o AI de ec ion sys ems pose
complex e hical challenges. We acknowledge he dynamic
landscape o his ield and he need o ca ch up; as gen-
e a i e models e ol e, hei ou pu s may become s a is i-
cally indis inguishable om human-c ea ed con en . De-
ec o s may be lea ning ansien a i ac s o cu en mod-
els, and hei long- e m u ili y is an open ques ion. Fu -
he mo e, ou me hod add esses ully AI-gene a ed con-
en , bu he g owing p e alence o hyb id human-AI col-
labo a i e wo k lows p esen s a mo e nuanced scena io
ha cu en de ec o s do no ye add ess. In gene al, mis-
applica ion o such ools could lead o unwa an ed e-
mo als o con en , disp opo iona ely ha ming c ea o s.
Mo eo e , biases embedded in aining da a may also
skew de ec ion ou comes ac oss di e en musical gen es
o languages. To add ess hese immedia e and long- e m
challenges, we u ge e hical de elopmen and implemen a-
ion o AI-gene a ed con en de ec o s, p io i izing ans-
pa ency abou hei capabili ies and limi a ions, equi able
ou comes, and human-in- he-loop app oaches o balance
inno a ion wi h p o ec ions o a is s, c ea o s, and he in-
eg i y o he music indus y.
8. ACKNOWLEDGEMENTS
This esea ch was unded in whole o in pa by he Aus-
ian Science Fund (FWF): h ps://doi.o g/10.
55776/COE12,h ps://doi.o g/10.55776/
DFH23,h ps://doi.o g/10.55776/P36413.
The au ho s would like o hank Au elien He aul , Manuel
Moussallam, Yanis Lab ak, and Gaspa d Michel o hei
in aluable eedback on his wo k.
9. REFERENCES
[1] A. Agos inelli, T. I. Denk, Z. Bo sos, J. Engel,
M. Ve ze i, A. Caillon, Q. Huang, A. Jansen,
A. Robe s, M. Tagliasacchi, M. Sha i i, N. Zeghidou ,
and C. H. F ank, “Musiclm: Gene a ing music om
ex ,” A Xi , ol. abs/2301.11325, 2023. [Online].
A ailable: h ps://api.seman icschola .o g/Co pusID:
256274504
[2] J. Cope , F. K euk, I. Ga , T. Remez, D. Kan , G. Syn-
nae e, Y. Adi, and A. Dé ossez, “Simple and con ol-
lable music gene a ion,” in Thi y-se en h Con e ence
on Neu al In o ma ion P ocessing Sys ems, 2023.
[3] N. Ziqian, C. Huakang, J. Yuepeng, H. Chunbo,
M. Guobin, W. Shuai, Y. Jixun, and X. Lei,
“Di Rhy hm: Blazingly as and emba assingly sim-
ple end- o-end ull-leng h song gene a ion wi h la en
di usion,” a Xi p ep in a Xi :2503.01183, 2025.
[4] Joe Spa ow, “AI-gene a ed song cha s in Ge -
many, amid con o e sy,” h ps://musically.com/2024/
08/13/ai-gene a ed-song-cha s-in-ge many-amid-
con o e sy/, 2024, [Online; accessed 26-Ma ch-
2025].
[5] RIAA P ess s a emen s, “Reco d Companies
B ing Landma k Cases o Responsible AI
Agains Suno and Udio in Bos on and New
Yo k Fede al Cou s, Respec i ely,” h ps:
//www. iaa.com/ eco d-companies-b ing-landma k-
cases- o - esponsible-ai-agains suno-and-udio-in-
bos on-and-new-yo k- ede al-cou s- espec i ely/,
2024, [Online; accessed 26-Ma ch-2025].
[6] Daniel Tence , “10,000 AI acks uploaded daily
o Deeze , pla o m e eals, as i iles wo
pa en s o new AI de ec ion ool,” h ps://www.
musicbusinesswo ldwide.com/10000-ai- acks-a e-
uploaded-daily- o-deeze -pla o m- e eals-as-i -
iles- wo-pa en s- o -new-ai-de ec ion- ool/, 2025,
[Online; accessed 26-Ma ch-2025].
[7] T. C. o Con en P o enance and A. (C2PA), “C2pa
spec icia ions,” 2024. [Online]. A ailable: h ps:
//c2pa.o g/speci ica ions/speci ica ions/1.3/index.h ml
[8] A. T. S. Commi ee, “A sc s anda d: Audio wa e ma k
emission,” 2024.
[9] D. A cha , G. Mesegue -B ocal, and R. Hennequin,
“De ec ing music deep akes is easy bu ac ually ha d,”
A Xi , ol. abs/2405.04181, 2024. [Online]. A ailable:
h ps://api.seman icschola .o g/Co pusID:269614314
[10] ——, “Ai-gene a ed music de ec ion and i s chal-
lenges,” ICASSP 2025 - 2025 IEEE In e na ional Con-
e ence on Acous ics, Speech and Signal P ocessing
(ICASSP), 2025.
[11] Y. Lab ak, M. F ohmann, G. Mesegue -B ocal,
and E. V. Epu e, “Syn he ic ly ics de ec ion ac oss
languages and gen es,” in P oceedings o he 5 h Wo k-
shop on T us wo hy NLP (T us NLP 2025), T. Cao,
A. Das, T. Kuma age, Y. Wan, S. K ishna, N. Meh abi,
J. Dhamala, A. Ramak ishna, A. Galys an, A. Kuma ,
R. Gup a, and K.-W. Chang, Eds. Albuque que,
New Mexico: Associa ion o Compu a ional Linguis-
ics, May 2025, pp. 524–541. [Online]. A ailable:
h ps://aclan hology.o g/2025. us nlp-main.34/
[12] “Copy igh ac o 1976,” 1976.
[13] A. Dé ossez, J. Cope , G. Synnae e, and Y. Adi,
“High ideli y neu al audio comp ession,” A Xi , ol.
abs/2210.13438, 2022. [Online]. A ailable: h ps:
//api.seman icschola .o g/Co pusID:253097788
[14] R. Kuma , P. See ha aman, A. Luebs, I. Kuma ,
and K. Kuma , “High- ideli y audio comp ession wi h
imp o ed qgan,” in Ad ances in Neu al In o ma ion
P ocessing Sys ems, ol. 36. Cu an Associa es,
Inc., 2023, pp. 27 980–27 993. [Online]. A ailable:
h ps://a xi .o g/abs/2306.06546
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
113
[15] N. Zeghidou , A. Luebs, A. Om an, J. Skoglund,
and M. Tagliasacchi, “Sounds eam: An end- o-end
neu al audio codec,” IEEE/ACM T ansac ions on
Audio, Speech, and Language P ocessing, ol. 30,
pp. 495–507, 2021. [Online]. A ailable: h ps:
//api.seman icschola .o g/Co pusID:236149944
[16] M. Pasini, S. La ne , and G. Fazekas, “Music2la en :
Consis ency au oencode s o la en audio comp es-
sion,” in P oceedings o he 25 h In e na ional Socie y
o Music In o ma ion Re ie al Con e ence. IS-
MIR, No . 2024, pp. 111–119. [Online]. A ailable:
h ps://doi.o g/10.5281/zenodo.14877289
[17] Z. E ans, C. Ca , J. Taylo , S. H. Hawley,
and J. Pons, “Fas iming-condi ioned la en audio
di usion,” A Xi , ol. abs/2402.04825, 2024. [Online].
A ailable: h ps://api.seman icschola .o g/Co pusID:
267523339
[18] Z. E ans, J. Pa ke , C. Ca , Z. Zukowski, J. Taylo ,
and J. Pons, “S able audio open,” A Xi , ol.
abs/2407.14358, 2024. [Online]. A ailable: h ps:
//api.seman icschola .o g/Co pusID:271310050
[19] K. Chen, Y. Wu, H. Liu, M. Nezhu ina,
T. Be g-Ki kpa ick, and S. Dubno , “Musi-
cldm: Enhancing no el y in ex - o-music gen-
e a ion using bea -synch onous mixup s a egies,”
ICASSP 2024 - 2024 IEEE In e na ional Con e -
ence on Acous ics, Speech and Signal P ocessing
(ICASSP), pp. 1206–1210, 2023. [Online]. A ailable:
h ps://api.seman icschola .o g/Co pusID:260438807
[20] Y. Zang, Y. Zhang, M. Heyda i, and Z. Duan,
“Sing ake: Singing oice deep ake de ec ion,” in P oc.
IEEE In e na ional Con e ence on Acous ics, Speech
and Signal P ocessing (ICASSP). IEEE, 2024, pp.
1–5.
[21] D. Desblancs, G. Mesegue -B ocal, R. Hennequin, and
M. Moussallam, “F om eal o cloned singe iden i ica-
ion,” in P oceedings o he 25 h In e na ional Socie y
o Music In o ma ion Re ie al Con e ence. ISMIR,
No . 2024.
[22] M. A. Rahman, Z. I. A. Hakim, N. H. Sa ke , B. Paul,
and S. A. Fa ah, “Sonics: Syn he ic o no - iden i y-
ing coun e ei songs,” in In e na ional Con e ence on
Lea ning Rep esen a ions (ICLR), 2025.
[23] X. Liu, Z. Zhang, Y. Wang, H. Pu, Y. Lan, and C. Shen,
“Coco: Cohe ence-enhanced machine-gene a ed ex
de ec ion unde low esou ce wi h con as i e lea n-
ing,” in P oceedings o he 2023 Con e ence on Empi -
ical Me hods in Na u al Language P ocessing, 2023,
pp. 16 167–16 188.
[24] F. Huang, H. Kwak, and J. An, “Toblend: Token-
le el blending wi h an ensemble o llms o a -
ack ai-gene a ed ex de ec ion,” a Xi p ep in
a Xi :2402.11167, 2024.
[25] S. Abdelnabi and M. F i z, “Ad e sa ial wa e ma k-
ing ans o me : Towa ds acing ex p o enance wi h
da a hiding,” in 2021 IEEE Symposium on Secu i y and
P i acy (SP). IEEE, 2021, pp. 121–140.
[26] M. Chak abo y, S. T. I. Tonmoy, S. M. Zaman, S. Gau-
am, T. Kuma , K. Sha ma, N. Ba man, C. Gup a,
V. Jain, A. Chadha e al., “Coun e u ing es (c 2):
Ai-gene a ed ex de ec ion is no as easy as you may
hink-in oducing ai de ec abili y index (adi),” in P o-
ceedings o he 2023 Con e ence on Empi ical Me h-
ods in Na u al Language P ocessing, 2023, pp. 2206–
2239.
[27] J. Ki chenbaue , J. Geiping, Y. Wen, J. Ka z, I. Mie s,
and T. Golds ein, “A wa e ma k o la ge language
models,” in In e na ional Con e ence on Machine
Lea ning. PMLR, 2023, pp. 17 061–17 084.
[28] Y. Li, Q. Li, L. Cui, W. Bi, Z. Wang, L. Wang,
L. Yang, S. Shi, and Y. Zhang, “MAGE: Machine-
gene a ed ex de ec ion in he wild,” in P oceedings
o he 62nd Annual Mee ing o he Associa ion o
Compu a ional Linguis ics (Volume 1: Long Pape s),
L.-W. Ku, A. Ma ins, and V. S ikuma , Eds.
Bangkok, Thailand: Associa ion o Compu a ional
Linguis ics, Aug. 2024, pp. 36–53. [Online]. A ailable:
h ps://aclan hology.o g/2024.acl-long.3
[29] D. Macko, R. Mo o, A. Uchendu, J. S. Lucas, M. Ya-
mashi a, M. Pikuliak, I. S ba, T. Le, D. Lee, J. Simko
e al., “Mul i ude: La ge-scale mul ilingual machine-
gene a ed ex de ec ion benchma k,” in 2023 Con e -
ence on Empi ical Me hods in Na u al Language P o-
cessing, EMNLP 2023. Associa ion o Compu a-
ional Linguis ics (ACL), 2023, pp. 9960–9987.
[30] W. An oun, D. Seddah, and B. Sago , “F om ex o
sou ce: Resul s in de ec ing la ge language model-
gene a ed con en ,” in The 2024 Join In e na ional
Con e ence on Compu a ional Linguis ics, Language
Resou ces and E alua ion (LREC-COLING 2024),
2024.
[31] Y. Chen, H. Kang, V. Zhai, L. Li, R. Singh, and B. Raj,
“Token p edic ion as implici classi ica ion o iden i y
llm-gene a ed ex ,” in P oceedings o he 2023 Con e -
ence on Empi ical Me hods in Na u al Language P o-
cessing, 2023, pp. 13 112–13 120.
[32] T. S. Kuma age, P. She h, R. Mo a ah, J. Ga land
e al., “How eliable a e ai-gene a ed- ex de ec o s? an
assessmen amewo k using e asi e so p omp s,” in
The 2023 Con e ence on Empi ical Me hods in Na u al
Language P ocessing, 2023.
[33] A. Uchendu, T. Le, K. Shu, and D. Lee, “Au ho ship
a ibu ion o neu al ex gene a ion,” in P oceedings
o he 2020 con e ence on empi ical me hods in na u al
language p ocessing (EMNLP), 2020, pp. 8384–8395.
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
114
[34] A. Bakh in, S. G oss, M. O , Y. Deng, M. Ranza o,
and A. Szlam, “Real o ake? lea ning o disc imina e
machine om human gene a ed ex ,” a Xi p ep in
a Xi :1906.03351, 2019.
[35] E. Mi chell, Y. Lee, A. Khaza sky, C. D. Manning, and
C. Finn, “De ec gp : Ze o-sho machine-gene a ed ex
de ec ion using p obabili y cu a u e,” in In e na ional
Con e ence on Machine Lea ning, 2023. [Online].
A ailable: h ps://api.seman icschola .o g/Co pusID:
256274849
[36] J. Su, T. Zhuo, D. Wang, and P. Nako , “De ec llm:
Le e aging log ank in o ma ion o ze o-sho de ec-
ion o machine-gene a ed ex ,” in Findings o he
Associa ion o Compu a ional Linguis ics: EMNLP
2023, 2023, pp. 12 395–12 412.
[37] B. Zhu, L. Yuan, G. Cui, Y. Chen, C. Fu, B. He,
Y. Deng, Z. Liu, M. Sun, and M. Gu, “Bea llms a hei
own game: Ze o-sho llm-gene a ed ex de ec ion ia
que ying cha gp ,” in P oceedings o he 2023 Con e -
ence on Empi ical Me hods in Na u al Language P o-
cessing, 2023, pp. 7470–7483.
[38] V. S. Sadasi an, A. Kuma , S. Balasub amanian,
W. Wang, and S. Feizi, “Can ai-gene a ed ex be e-
liably de ec ed?” a Xi p ep in a Xi :2303.11156,
2023.
[39] J. Achiam, S. Adle , S. Aga wal, L. Ahmad, I. Akkaya,
F. L. Aleman, D. Almeida, J. Al enschmid , S. Al -
man, S. Anadka e al., “Gp -4 echnical epo ,” a Xi
p ep in a Xi :2303.08774, 2023.
[40] L. Dugan, A. Hwang, F. T hlík, A. Zhu, J. M.
Ludan, H. Xu, D. Ippoli o, and C. Callison-Bu ch,
“RAID: A sha ed benchma k o obus e alua ion
o machine-gene a ed ex de ec o s,” in P oceedings
o he 62nd Annual Mee ing o he Associa ion o
Compu a ional Linguis ics (Volume 1: Long Pape s),
L.-W. Ku, A. Ma ins, and V. S ikuma , Eds.
Bangkok, Thailand: Associa ion o Compu a ional
Linguis ics, Aug. 2024, pp. 12 463–12 492. [Online].
A ailable: h ps://aclan hology.o g/2024.acl-long.674/
[41] L. Zhuo, R. Yuan, J. Pan, Y. Ma, Y. Li, G. Zhang,
S. Liu, R. B. Dannenbe g, J. Fu, C. Lin, E. Bene os,
W. Chen, W. Xue, and Y.-T. Guo, “Ly icwhiz:
Robus mul ilingual ze o-sho ly ics ansc ip ion by
whispe ing o cha gp ,” A Xi , ol. abs/2306.17103,
2023. [Online]. A ailable: h ps://api.seman icschola .
o g/Co pusID:259287024
[42] O. Cí ka, H. Sch eibe , L. Mine , and F.-R. S ö e ,
“Ly ics ansc ip ion o humans: A eadabili y-awa e
benchma k,” a Xi p ep in a Xi :2408.06370, 2024.
[43] S. Ka ko , A. Lio a, and A. Vie i, “Benchma k-
ing whispe unde di e se audio ans o ma ions and
eal- ime cons ain s,” in In e na ional Con e ence on
Speech and Compu e . Sp inge , 2024, pp. 82–91.
[44] A. Rad o d, J. W. Kim, T. Xu, G. B ockman,
C. McLea ey, and I. Su ske e , “Robus speech
ecogni ion ia la ge-scale weak supe ision,” A Xi ,
ol. abs/2212.04356, 2022. [Online]. A ailable: h ps:
//api.seman icschola .o g/Co pusID:252923993
[45] G. Klein, J. W. Kim, Y. Kim, and C. Delangue, “ as e -
whispe : A eimplemen a ion o openai’s whispe
model using c ansla e2,” no 2023. [Online]. A ail-
able: h ps://gi hub.com/SYSTRAN/ as e -whispe
[46] N. Reime s and I. Gu e ych, “Sen ence-BERT:
Sen ence embeddings using Siamese BERT-ne wo ks,”
in P oceedings o he 2019 Con e ence on Empi ical
Me hods in Na u al Language P ocessing and he 9 h
In e na ional Join Con e ence on Na u al Language
P ocessing (EMNLP-IJCNLP), K. Inui, J. Jiang,
V. Ng, and X. Wan, Eds. Hong Kong, China:
Associa ion o Compu a ional Linguis ics, No .
2019, pp. 3982–3992. [Online]. A ailable: h ps:
//aclan hology.o g/D19-1410
[47] T. Gao, X. Yao, and D. Chen, “SimCSE: Simple
con as i e lea ning o sen ence embeddings,” in
P oceedings o he 2021 Con e ence on Empi ical
Me hods in Na u al Language P ocessing, M.-F.
Moens, X. Huang, L. Specia, and S. W.- . Yih,
Eds. Online and Pun a Cana, Dominican Republic:
Associa ion o Compu a ional Linguis ics, No .
2021, pp. 6894–6910. [Online]. A ailable: h ps:
//aclan hology.o g/2021.emnlp-main.552
[48] J. Chen, S. Xiao, P. Zhang, K. Luo, D. Lian, and Z. Liu,
“M3-embedding: Mul i-linguali y, mul i- unc ionali y,
mul i-g anula i y ex embeddings h ough sel -
knowledge dis illa ion,” in Findings o he Associa ion
o Compu a ional Linguis ics: ACL 2024, L.-W.
Ku, A. Ma ins, and V. S ikuma , Eds. Bangkok,
Thailand: Associa ion o Compu a ional Linguis-
ics, Aug. 2024, pp. 2318–2335. [Online]. A ailable:
h ps://aclan hology.o g/2024. indings-acl.137/
[49] P. BehnamGhade , V. Adlakha, M. Mosbach,
D. Bahdanau, N. Chapados, and S. Reddy,
“Llm2 ec: La ge language models a e sec e ly
powe ul ex encode s,” 2024. [Online]. A ailable:
h ps://a xi .o g/abs/2404.05961
[50] R. A. Ri e a-So o, O. E. Miano, J. O donez, B. Y.
Chen, A. Khan, M. Bishop, and N. And ews,
“Lea ning uni e sal au ho ship ep esen a ions,” in
P oceedings o he 2021 Con e ence on Empi ical
Me hods in Na u al Language P ocessing, M.-F.
Moens, X. Huang, L. Specia, and S. W.- . Yih,
Eds. Online and Pun a Cana, Dominican Republic:
Associa ion o Compu a ional Linguis ics, No .
2021, pp. 913–919. [Online]. A ailable: h ps:
//aclan hology.o g/2021.emnlp-main.70
[51] J. De lin, M.-W. Chang, K. Lee, and K. Tou ano a,
“BERT: P e- aining o deep bidi ec ional ans o me s
P oceedings o he 26 h ISMIR Con e ence, Daejeon, Ko ea, Sep embe 21-25, 2025
115