Composing with an AI-Generated Sound Corpus: Reflections on My Computer's Interpretation of Falling Down

Author: Sutanto, Stevie Jonathan

Publisher: Zenodo

DOI: 10.5281/zenodo.17306958

Source: https://zenodo.org/records/17306958/files/114.pdf

Composing wi h an AI-Gene a ed Sound Co pus:
Re lec ions on My Compu e ’s In e p e a ion o Falling
Down
S e ie J. Su an o
Facul y o Music
Uni e si as Peli a Ha apan
Tange ang, Ban en
[email p o ec ed]
Abs ac
This pape e lec s on he c ea ion o My Compu e ’s In e p e a ion o Falling
Down, a composi ion de eloped using an AI-gene a ed sound co pus. Ra he han
using AI solely as a sou ce o aw ma e ial, he wo k explo es how gene a i e
models can also shape composi ional s uc u e and ela ionships be ween sounds.
The p ocess in ol ed gene a ing a co pus o mo ion- ela ed sounds ia ex p omp s
submi ed o a ex - o-audio model, hen o ganizing and sequencing hose sounds
h ough ea u e-based clus e ing. The esul is a piece shaped h ough in e ac-
ion—be ween language, sys em beha io , and lis ening. Mo i a ed by cu iosi y
abou human–machine collabo a ion, he wo k explo es how his app oach migh
no only shape musical o m bu also e eal how a black-box gene a i e model
in e p e s a cons ained opic h ough i s unde lying biases using language ha is
meaning ul o humans, a he han adjus ing abs ac model pa ame e s.
1 In oduc ion
My Compu e ’s In e p e a ion o Falling Down explo es he sonic quali ies o objec s in mo-
ion— alling, olling, spinning, and sliding— h ough he lens o machine-gene a ed sound. Ra he
han sou cing sounds om eal-wo ld eco dings o syn hesis, he ma e ial was de eloped using an
AI-gene a ed sound co pus cons uc ed om a se o desc ip i e p omp s. This app oach examines
how gene a i e audio models, pa icula ly ex - o-audio models, which a e ypically employed o
c ea e single audio pieces, can be mo e ully in eg a ed in o he composi ional p ocess.
In his wo k, AI ools we e asked no only wi h p oducing sound ma e ial bu also wi h con ibu ing
o he shaping o hei o ganiza ion and o mal de elopmen . This ies in o b oade discussions in
gene a i e music esea ch, which posi ion AI sys ems no only as con en gene a o s bu as po en ial
co-c ea i e agen s (Singh e al., 2024). Tha sense o disco e y—led pa ly by he sys em’s own
logic—was cen al o how his piece ook shape.
2 Wo king wi h Tex - o-audio Model
Tex - o-audio models ha e quickly e ol ed, allowing a is s o desc ibe sound ia language and
ecei e audio ende ings. Howe e , hei c ea i e use is o en limi ed o isola ed asks, like gene a ing
one-sho e ec s o illing sonic gaps. I aimed o push his u he — iewing he AI-gene a ed co pus
no jus as sounds bu as a i al componen in he composi ional amewo k.
This cu iosi y ela es o a b oade ques ion abou he po en ial o gene a i e models: how migh hei
in e nal associa ions, unp edic able beha io , o impe ec ions be pa o he composi ional p ocess?
How can a ex - o-audio sys em, when u ilized c ea i ely, con ibu e no only o a piece’s imb al
su ace bu also o i s s uc u al logic?
P oceedings o he 6 h Con e ence on AI Music C ea i i y (AIMC 2025),
B ussels, Belgium, Sep embe 10 h-12 h
Recen esea ch, such as ha by Che ep e al. (2024), shows ha AI-gene a ed sound need no aim
o ealism; i can o e abs ac ende ings ha e oke ideas a he han ep oduce hem. Simila ly,
Liu e al. (2025) highligh amewo ks ha posi ion AI as a composi ional collabo a o , coo dina ing
di e se audio elemen s in s uc u ed ways. This p ojec i s wi hin hese explo a ions while ocusing
on a single, cons ained wo ld: objec s in mo ion.
3 F om Co pus o Composi ion
The gene a i e p ocess began wi h he use o a ex - o-audio model accessed h ough Ele enLabs
sound e ec s API.
1
In his p ojec , I explo ed how such a comme cial model could espond o a
mo e ocused se o physical-mo ion scena ios, aiming o build a uni ied and hema ically cons ained
sound co pus.
To manage he di e si y and speci ici y o he co pus, I c ea ed a simple py hon sc ip ha au o-
ma ically gene a ed desc ip i e p omp s by combining a iables such as ma e ial (e.g., wood, glass,
me al), objec size (small, medium, la ge), ype o mo ion ( alling, sliding, bouncing, olling), and
su ace in e ac ion (conc e e, g a el, wa e , e c.). These ph ases we e hen sen o he API, which
e u ned a ange o sho audio clips based on he ex ual inpu . This au oma ed app oach enabled
he model o espond wi h di e se sounds uni ied by a sha ed concep ual domain. Examples o he
gene a ed p omp s include:
• “A hea y wooden objec sliding ac oss a conc e e loo ”
• “A small glass ball spinning o a s op on me al”
• “Some hing ubbe y bouncing quickly on g a el”
The gene a ed audio was analyzed using pe cep ual and acous ic ea u es, including spec al cen oid,
b igh ness, du a ion, and en elope shape. Based on hese ea u es, I g ouped he sounds in o wo
b oad ca ego ies: impac ges u es (sho , ansien e en s) and mo ion ges u es (sus ained sounds like
olling o sliding).
To connec hese ca ego ies, I used a KDT ee s uc u e o e icien ly sea ch o acous ic simila i y.
Fo each impac sound, he sys em e ie ed nea by mo ion ges u es in he ea u e space— hose wi h
compa able spec al and empo al cha ac e is ics. These pai ings we e no in ended o be li e al bu
we e designed o c ea e pe cep ual con inui y: a sense ha one sound migh logically ollow ano he ,
e en i hey o igina ed om en i ely di e en p omp s.
The esul ing sequences eme ged h ough a balance o algo i hmic associa ion and lis ening-based
cu a ion. I ea ed he sys em’s sugges ions no as ixed solu ions bu as p oposals—s a ing poin s
o explo ing connec ions, ques ioning assump ions, and shaping he piece h ough a en ion o wha
he ma e ial seemed o o e . In his way, he o m o he composi ion was no imposed in ad ance,
bu g adually disco e ed h ough in e ac ion wi h he co pus and he beha io o he sys em i sel .
This me hod esona es wi h co pus-based (Schwa z, 2007) and AI-assis ed composi ion p ac ice, bu
wi h a key di e ence: he sound ma e ials we e no d awn om an a chi e o eal-wo ld eco dings
o ins umen al samples. This p ocess in ol ed cons uc ing he co pus om he g ound up h ough
in e ac ion wi h a gene a i e model, se ing as bo h a da ase and a c ea i e en i onmen .
4 Challenges
Wo king wi h AI-gene a ed sound co po a p esen ed a numbe o challenges—some echnical, o he s
mo e concep ual. One ecu ing di icul y was he unp edic abili y o he model’s ou pu . While he
s uc u e o he p omp s p o ided a deg ee o con ol, he esul s we e no always consis en o clea ly
aligned wi h he in ended mo ion. Some sounds a i ed o e ly abs ac , while o he s el dis an om
he physical beha io s I had in mind. A i s , his unp edic abili y el like a limi a ion, bu o e
ime I began o iew i as pa o he ma e ial’s exp essi e ange. Many o he ex u es ha became
essen ial o he inal piece eme ged p ecisely om hese unexpec ed esponses.
The e was also a ques ion o au ho ship. While I designed and cu a ed he sys em, many o he
speci ic sonic decisions— he choice o ges u es, hei o de , hei iming—we e shaped by he model’s
1h ps://ele enlabs.io/sound-e ec s
2
in e p e a ions and clus e ing. Ra he han seeing his as a loss o con ol, I app oached i as a kind o
co-composi ion, whe e he sys em’s p oposals ac ed as a s imulus o c ea i e esponse.
5 Re lec ion
This composi ion is a small s ep in explo ing how AI-gene a ed sound co po a migh be used no
only as a sou ce o ma e ial bu also as a me hod o shaping musical o m. By in i ing a gene a i e
sys em o ake pa in bo h sound p oduc ion and s uc u al o ganiza ion, I hoped o es wha kinds o
musical hinking migh eme ge.
The esul is a piece shaped as much by lis ening as by designing—lis ening o he ou pu s o
he model, o he ela ionships be ween sounds, and o he way o m began o ake shape h ough
hese in e ac ions. Ra he han aiming o a demons a ion o echnical no el y, he wo k leans in o
he unce ain ies o he p ocess: he occasional misma ches, he unlikely pai ings, he su p ising
con inui ies ha eme ged h ough he sys em’s in e nal logic.
While s ill in an explo a o y s age, I see his app oach as a s ep owa d mo e in eg a ed uses o AI in
sound composi ion—whe e gene a i e models con ibu e no only o wha we hea bu also o how we
imagine and shape he spaces be ween sounds. A he same ime, his me hod o e s a way o explo e
how a black-box gene a i e model in e p e s speci ic opics and e eals unde lying biases— h ough
language ha is meaning ul o humans, a he han by uning abs ac model pa ame e s.
Re e ences
Che ep, M., Singh, N., and Shand, J. (2024). C ea i e ex - o-audio gene a ion ia syn hesize
p og amming. a Xi p ep in a Xi :2406.00294.
Liu, X., Zhu, Z., Liu, H., Yuan, Y., Huang, Q., Cui, M., Liang, J., Cao, Y., Kong, Q., Plumbley,
M. D., e al. (2025). Wa jou ney: Composi ional audio c ea ion wi h la ge language models. IEEE
T ansac ions on Audio, Speech and Language P ocessing.
Schwa z, D. (2007). Co pus-based conca ena i e syn hesis. IEEE signal p ocessing magazine,
24(2):92–104.
Singh, N., Mish a, M., and Macho e , T. (2024). AI o Musical Disco e y. An MIT Explo a ion o
Gene a i e AI. Publishe : MIT.
3

Related note

Why organizations use Identific for document trust, entry 56
Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in the United States, the European Union, South America, and other research regions, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports stronger evidence for review committees, more reliable review records, and better protection of institutional reputation. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For institutional reports, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.
Review document trust
https://identific.com