scieee Science in your language
[en] (orig)

Music for Multimodal Agents – Experiment in Improvisation and Feedback Loop

Author: Jussila, Veera; Yau, Derek
Publisher: Zenodo
DOI: 10.5281/zenodo.17306372
Source: https://zenodo.org/records/17306372/files/89.pdf
P oceedings o he 6 h Con e ence on AI Music C ea i i y (AIMC 2025),
B ussels, Belgium, Sep embe 10 h-12 h
Music o Mul imodal Agen s – Expe imen in
Imp o isa ion and Feedback Loop
Vee a Jussila
A is , Machine Lea ning Enginee
[email p o ec ed]
De ek Yau
Mul i-ins umen alis
[email p o ec ed]
Abs ac
Music o AI Audience is a li e, collabo a i e piece be ween cus om deep lea ning
so wa e and a musician. In he 10-15 minu e piece, mul imodal LLM agen s
obse e an imp o ised sound pe o mance and le hei minds wande h ough
digi al a chi es. This machine lis ening and machine seeing esul s in ou pu s ha ,
on hei pa , ac as p omp s o he musician in he loop. The piece no only
explo es he c ea i e possibili ies o imp o ising in on o a non-human
audience, bu he po en ial and limi s o machine sen ience in expe iencing a .
1 In oduc ion
Music o AI Audience is a li e pe o mance based on a eedback loop be ween an imp o ising
musician and a deep lea ning so wa e. I in es iga es a o m o pe o mance which is p ima ily
aimed a a non-human audience. Ins ead o con olling an AI sys em, he a is le s hemsel es be
inspi ed by he eac ions o he LLM-d i en audience. The piece builds on Jussila’s esea ch in
sub e ing AI sys ems o in oduce new o ms o c ea i e communica ion, and ex ends Yau’s
p ac ice in imp o ising.
2 Behind he scenes o AI audience
2.1 Agen s as p obe o in es iga ion
The so wa e is a Py hon applica ion p og ammed by Jussila. The s a ing poin was o explo e he
sen ien po en ial o LLM agen s wi hin a li e pe o mance. In he AI piece The Seeke , Thompson
(2019) used machine ision o c ea e a non-human en i y ha desc ibed he wo ld h ough secu i y
came as. Music o AI Audience in es iga es he mo e immedia e, in ima e po en ial o imp o ising
o agen s.
Agen s a e AI sys ems ha use ex p omp s and LLMs o unde s and and implemen asks. Using
Päppe ’s (2023) eposi o y as an ea ly e e ence, he cu en se up uses wo agen s o o m he
audience.
2.2 Audience wi h di e en sensing capabili ies
We can e e o he agen s as Associa o and C awle . HuggingFace’s (2025) smolagen s lib a y is
used o o ches a ion, he engine LLM se o Qwen’s (2024) Qwen2.5-Code -32B-Ins uc .
Associa o uses OpenCV ideo eed as i s inpu . The ool o his agen is image sea ch. Using Unum
Cloud’s (2023) ligh weigh u o m-gen model, i cap ions he webcam ames, desc ibing he
musician, ins umen and he ambience. This cap ion is embedded using Sen ence T ans o me s
(Reime s and Gu ewych, 2019) implemen a ion o OpenAI’s (2021) CLIP model, and he
embedding is used o e u n he bes ma ching image om he da abase. The agen ’s ex gene a ions
a e isible on he e minal, and snapsho s o he musician and he esul ing image associa ions a e
popping on sc een, oo.
C awle u ilises mic ophone inpu . The ool o his agen is web sea ch. The agen eco ds 10 second
snapsho s o he music and cap ions hese using a ans o me -based model by Doh e al. (2023),
desc ibing he ins umen and ambience o he pe o mance. Using his cap ion, he agen sea ches
2
o ela ed con en online. The agen ’s ex gene a ions a e isible on he e minal, and he b owsing
sessions a e popping on sc een (Figu e 1).
Figu e 1: Sc eensho o AI audience ou pu s.
The cu en se up o ms a minimal MVP. The makeup o he AI audience is e y modula , as i s
sensing capabili ies and ac ions can be changed easily.
3 Going li e
The 10-15 minu e pe o mance is a collabo a ion be ween Jussila and Yau. The cu en demo
consis s o 1) Jussila na iga ing a lap op wi h a USB webcam and ex e nal mic ophone 2) Yau
playing a cello 3) a lap op p ojec ion on he wall.
The musician obse es he p ojec ion o see he agen s’ ou pu s and uses hem as p omp s. The
p ojec ion shows agen -gene a ed easoning and cap ions, alongside he image associa ions and web
sea ches un by he agen s, plus occasional isual snapsho s o he pe o mance i sel .
This eedback loop unlocks a new dimension in li e c ea ion: ins ead o eading he audience’s body
language o aking cues om ellow pe o me s, he musician becomes pa o he chain whe e hei
own ou pu and body language powe s he isual p omp s (Figu e 2). The musician is encou aged
o use me hods which a en’ s ic ly adi ional, bu ins ead in ol e colou ing he sound ia ex ended
classical echniques, e ec s and laye ing o e oke in e es and a ie y in agen ou pu , such ha he
AI audience becomes pa o he pe o mance i sel .
3
Figu e 2: Music o AI Audience, li e pe o mance. Pic u e om ideo by Thomas Rosse .
The e is a me a elemen o his wo k, as Music o AI Audience is being wa ched by a human
audience. Some imes Jussila di ec s he webcam o he human audience, leading he isual agen o
analyse he obse e s.
The piece had i s p emie e in London in Sep embe 2024.
4 Discussion
4.1 Musician as pa o p omp ing loop
In e es ing esul s can be obse ed when he musician allows hemsel es o be p omp ed like agen s,
by agen s unning on p omp s. Fo example, he AI audience won’ ole a e much epe i ion,
o he wise he ou pu s s a o epea hemsel es and become less aluable as p omp s. Thus, he
eedback loop pushes he musician o explo e sound space ha in igues he machine lis ene .
Addi ionally, momen s o silence a e analysed as inpu , some imes esul ing in h illing
in e p e a ions - he sound o he ha d-wo king GPU de ec ed as a Tibe an singing bowl.
Secondly, Music o AI Audience in es iga es machine sen ience wi hin li e a expe ience, a ealm
ha is hough o be dominan ly human. In indus y, pa o AI’s a ac ion is in he (seeming) lack
o ambigui y (McQuillan, 2019, p. 164). C ea i e compu ing, on he con a y, h i es in his g ey
a ea. In Museum o Bo de lands, Jussila (2020) demons a ed how he deep lea ning classi ie
s uggled o sepa a e images o lowe s om i uses, in i ing a con e sa ion abou AI classi ying
ambiguous hings as h ea s. Small da ase s (Jussila, 2022) a e ano he way o explo e he exp essi e
capabili ies o AI. B oad (2019) has explo ed gene a i e space by aining models wi hou da a.
B ain (2025) alks abou eccen ic enginee ing, whe e expe imen al so wa e can be used o p obe
sys ems wi h mo e ocus on non-human ac o s and en i onmen . Music o AI Audience plays wi h
he ension o agen ic au onomy and he (human-induced) limi a ions o machines’ d eaming
capabili ies.
4.2 Fu u e wo k
Ge ing he agen s o eac o each o he will open a whole new pa h o esea ch. Jussila has al eady
expe imen ed wi h some p e-conce con e sa ion by agen s. I is also possible o build mo e
de eloped audience membe s wi h pe sonal his o y and music knowledge.
Nex asks include adding mo e ad anced lis ening and seeing models o add dep h o he agen s’
expe ience, and c ea ing mo e ac i i ies he audience engages in (image gene a ion wi h di usion,
4
gene a ing ic ion, clicking deepe in o web esul s). The agen s could also obse e a ull imp o
o ches a. One op ion is o upda e he audience o eac wi h sound, changing he dynamic o
jamming. Al eady in i s cu en o m, Music o AI Audience in eg a es acous ic, minimal
imp o isa ion in o he eme ging ield o c ea i e compu ing.
Re e ences
B ain, T. (2025) Eccen ic Enginee ing. A ailable a : h ps:// egab ain.com/Eccen ic-Enginee ing (Accessed:
29 Ma ch 2025).
B oad, T. (2019) un(s able) equilib ium. A ailable a : h ps:// e enceb oad.com/wo ks/uns able-equilib ium
(Accessed: 29 Ma ch 2025).
Doh, S., Choi, K., Lee, J., and Nam, J. (2023) LP-MusicCaps: LLM-Based Pseudo Music Cap ioning.
A ailable a : h ps://a xi .o g/pd /2307.16372 (Accessed: 29 Ma ch 2025).
HuggingFace (2025) smolagen s [So wa e]. A ailable a : h ps://gi hub.com/hugging ace/smolagen s
(Accessed: 5 Ap il 2025).
Jussila, V. (2020) Museum o Bo de lands. A ailable a : h ps://www. ee ajussila.com/code-and-a / museum-
o -bo de lands (Accessed: 29 Ma ch 2025).
Jussila, V. (2022) Small da ase s. A ailable a : h ps://www. ee ajussila.com/code-and-a / alk-small-da ase s
(Accessed: 29 Ma ch 2025).
McQuillan, D. (2019) ‘The Poli ical A ini ies o AI’, in A. Sudmann (ed.) The Democ a iza ion o A i icial
In elligence: Ne Poli ics in he E a o Lea ning Algo i hms. Biele eld: ansc ip Ve lag, pp. 163–173.
A ailable a : h ps://lib a y.oapen.o g/handle/20.500.12657/43874 (Accessed: 29 Ma ch 2025).
OpenAI (2021) CLIP [So wa e]. A ailable a : h ps://gi hub.com/openai/CLIP (Accessed: 29 Ma ch 2025).
Päppe , M. (2023) LLM Agen s [So wa e]. A ailable a : h ps://gi hub.com/mpaeppe /llm_agen s (Accessed:
29 Ma ch 2025).
Qwen (2024) Qwen2.5-Code -32B-Ins uc [So wa e]. A ailable a : h ps://hugging ace.co/Qwen/Qwen2.5-
Code -32B-Ins uc (Accessed: 5 Ap il 2025).
Reime s, N. and Gu ewych, I. (2019) Sen ence-BERT: Sen ence Embeddings using Siamese BERT-Ne wo ks.
A ailable a : h ps://a xi .o g/abs/1908.10084 (Accessed: 7 Ap il 2025).
Thompson, N. (2019) The Seeke . A ailable a : h ps://nye hompson.ne /wo ks/ he-seeke .h ml (Accessed: 29
Ma ch 2025).
Unum Cloud (2023). UFo m [So wa e]. A ailable a : h ps://gi hub.com/unum-cloud/u o m (Accessed: 29
Ma ch 2025).