Token Telephone
HUGO FLORES GARCÍA, No hwes e n Uni e si y, USA
STEPHAN MOORE, No hwes e n Uni e si y, USA
Addi ional Key Wo ds and Ph ases: in e ac i e sound ins alla ion, neu al ne wo ks, gene a i e ai, spa ial sound
ACM Re e ence Fo ma :
Fig. 1. Token Telephone is a co-c ea i e AI sound ins alla ion whe e pa icipan s in e ac wi h a chain o gene a i e AI models,
ini ia ing a gene a i e game o elephone. The ins alla ion space is ci cled by ou neu al ne wo ks, each ep esen ed by a loudspeake .
Pa icipan s make sounds in o a mic ophone a he en ance o he ins alla ion space. Thei sounds a e i e a i ely ans o med by
each neu al ne wo k in a eedback loop, de ia ing u he om he o iginal wi h e e y pass. This i e a i e p ocess e eals pa e ns
be ween he inpu and he aining da a o he ne wo ks, mo phing human u e ances in o new and unexpec ed sound ex u es.
Licensed unde a C ea i e Commons A ibu ion 4.0 In e na ional License (CC BY 4.0). Copy igh emains wi h he au ho (s).
Hugo Flo es Ga cía and S ephan Moo e. 2024. Token Telephone. 1, 1 (Sep embe 2024), 4 pages.
Music P oceedings o he In e na ional Con e ence on New In e aces o Musical Exp ession
NIME’24, 4–6 Sep embe , 2024, U ech , The Ne he lands
2 Flo es Ga cía and Moo e.
1 PROGRAM NOTES
Token Telephone is a co-c ea i e AI sound ins alla ion. Pa icipan s en e a space equipped wi h a mic ophone and a
qua e o gene a i e sound neu al ne wo ks, each ep esen ed by a loudspeake . Upon ocalizing in o he mic ophone,
he pa icipan s’ u e ance is ans o med in o neu al acous ic okens and played back, ini ia ing a game o elephone
be ween he neu al ne wo ks. Each ne wo k encodes, p ocesses and econs uc s he sound, dis o ing he o iginal
u e ance in o new ex u es guided by he ne wo k’s aining da a. The newly econ igu ed sound is hen passed o he
nex ne wo k/loudspeake in a clockwise di ec ion, and he p ocess epea s. The sound p oduced by he ou h ne wo k
is passed back o he i s ne wo k in he cycle, c ea ing a eedback loop whe ein he o iginal u e ance inc emen ally
loses all o i s o iginal cha ac e is ics and disin eg a es in o ex u es ha e lec he inhe en biases o he gene a i e
models in play. In ime, he esonan p ope ies o he p ocesses a e e ealed in on o he pa icipan . Inspi ed by he
popula child en’s game o elephone, Token Telephone illumina es he g adual o ma ion o hallucina ions h ough
he i e a i e p ocessing and e-p ocessing o audio, e lec ing he biases in oduced by he model’s unde s anding o
sound objec s, as well as he da a ha was p o ided o i .
2 MEDIA LINKS
To lis en o a s e eo demons a ion o oken elephone, isi he ollowing YouTube link:
• ideo: h ps://you u.be/ EaYoEg SUo
3 PROJECT DESCRIPTION
3.1 Mo i a ion
Telephone is a popula child en’s game in which child en y o communica e a message h ough a noisy in o ma ion
channel. Humans a e lossy in o ma ion machines, and hey do no s o e he u e ance hey hea as a aw audio signal
in hei b ains bu a he a comp essed ep esen a ion ha con ains a mix o seman ic (wha was said) and acous ic
(wha ha sounded like) in o ma ion.
When we epea a spoken u e ance om memo y, we a e o ced o ebuild an acous ic signal om he lossy
ep esen a ion s o ed in ou memo y. This means ha we may hallucina e wo ds ha we e no he e, ans o ming he
meaning o he o iginal message we mean o pass along.
Unlike he adi ional game, his ins alla ion employs neu al ne wo ks ins ead o humans o encode, dis o , and
egene a e sound. I illus a es he ascina ing and o en unp edic able ways AI in e p e s and manipula es pa e ns in
he inpu da a o gene a e new sounds.
Audiences engaging wi h Token Telephone will be able o hea , i e a i ely and in eal ime, he o ma ion o
hallucina o y audio in o ma ion. The sounds a e hemsel es compelling as hey imi a e and ampli y he hy hms and
nuances o ou ocaliza ions. Bu beyond he aes he ic in e es o i s ou pu , his ins alla ion p o ides a a e oppo uni y
o hea gene a i e neu al ne wo k a wo k. As AI mo es in o ou daily li es, he e may be some alue in unde s anding
how he biases inhe en in he da a se s we use o aining hese sys ems can in luence hei ou pu .
3.2 In e ac ion and Unde lying P ocess
Re e o Figu e 1 o a diag am o he ins alla ion layou . The ins alla ion is quad aphonic, wi h ou speake s placed in
a ing a ound he oom, wi h a mic ophone a he en ance o he ins alla ion space. Each speake in he oom embodies
Token Telephone 3
Fig. 2. Illus a ion o he okeniza ion -> co up ion (masking) -> gene a ion p ocess occu ing a each speake in he ins alla ion.
a co-c ea i e gene a i e sound agen capable o ecei ing a sound, encoding i , and econs uc ing i om i s lossy
neu al acous ic oken ep esen a ion.
Upon en e ing he ins alla ion space, he pa icipan is g ee ed wi h a mic ophone o u e any sound hey like.
Sho ly a e , he pa icipan ’s u e ance s a s playing in a loop on he speake nea es o he hem, which se s o he
game o elephone.
When he playe ’s u e ance en e s he elephone chain, i is encoded in o a sequence o neu al acous ic okens [
1
,
2
].
These okens a e a comp essed encoding o he audio signal, and hey a e used by gene a i e sound model (VampNe ) o
econs uc he encoded sounds. These gene a i e acous ic okens a e o ganized hie a chically, whe e “coa se” okens
loosely encode highe -le el in o ma ion abou a signal like i s hy hmic s uc u e, while " ine" okens may ep esen
high equency de ails and o he cha ac e is ics ha u he de ine a sound e en .
While he okens a e being passed a ound he chain, a pe cen age o hese okens a e “co up ed” (i.e. masked),
meaning ha he gene a i e model will ha e o in e and ill in he missing spo s, eso ing o he model o i s aining
da a o econs uc he missing okens. As he u e ed sound unde goes many passes h ough he oken elephone loop,
he sound e ains some o i s o iginal cha ac e is ics and hy hm, bu he sound iden i ies a e inc emen ally ans o med
om human speech o he sounds and pa e ns p esen in model’s aining da a.
This esul ing audio ans o ma ion makes he ins alla ion eel like a oice-con olled in e ace o musical exp ession,
whe e he sonic ges u es made by he pa icipan a e p ese ed by he gene a i e algo i hm, bu he ac ual con en s
and pe cep ual iden i y o he sound a e gi en a special imb e, ex u e and colo , o igina ing om he neu al ne wo k’s
own dis ibu ion o sounds.
The gene a i e model unde lying in his p ocess is VampNe [
1
], a gene a i e model capable o gene a ing a ia ions
on an inpu signal h ough his okenize →co up →gene a e p ocess.
Pa icipan s can lis en o hei u e ance unde go his p ocess h ough models ained on di e en sound lib a ies,
esul ing in di e en " la o s" o gene a ed sound (e.g. ope a ic ex u es, cho al ex u es, na u al sounds).
4 PERFORMANCE NOTES
4.1 Space Requi emen s and Sui able Venues
This piece can be ins alled in any small oom o medium hallway (e.g., minimum 14 x 14 ), as long as i com o ably
i s a 4-speake a ay, a mic + mic s and posi ioned ou side o he ing o speake s, a small able o house he compu e
and audio in e ace (ou side he speake s o be ween wo speake s), and a minimum o 4-5 people walking a ound he
a ea enclosed by he speake s.
We imagine ha his ins alla ion would be sui able o exhibi ion in a numbe o si ua ions, and he e o e in se e al
o he men ioned enues. I may be mos app op ia e in he Academy Galle y, bu i migh ecei e mo e in e ac ion i i
we e placed in a co ne o one o he open spaces in he main pe o mance enue. The HKU building’s co ido s could
4 Flo es Ga cía and Moo e.
also be used i hey a e su icien ly wide. We can imagine he ou doo enue and he ain s a ion being possibili ies, as
long as conce ns abou secu i y and shel e om wea he could be add essed.
4.2 Ne wo k Requi emen s
Because he ins alla ion equi es a s able in e ne connec ion o a emo e se e o neu al ne wo k p ocessing, we
equi e ha he loca ion o he ins alla ion ha e good ne wo k co e age and an a ailable in e ne connec ion capable o
uploading/downloading 10-second audio iles eliably (e.g. a leas 50Mbps o download and 30Mbps o upload speed).
4.3 Equipmen
The ins alla ion equi es he ollowing equipmen . Equipmen o be p o ided by he con e ence is highligh ed in yellow.
•1x dynamic ocal mic ophone (e.g. Shu e SM58)
•1x mic ophone s and.
•
1x XLR cable (25 , o long enough o each om he mic o he equipmen able com o ably in he ins alla ion
space).
•
4x Speake s, la ge enough o he ins alla ion space, wi h app op ia e cabling o each speake back o he
equipmen able. I speake s a e no ac i e, ampli ie s o he speake s a e equi ed as well.
•1x small able o housing he compu e , elec onics.
•a eliable in e ne connec ion o p ocessing incoming sound.
•1x 4+ channel audio in e ace (e.g. Sca le 4i4, P eSonus S udio1824c)
•1x Mac OSX compu e capable o unning Max/MSP 8.6 o la e .
5 ETHICAL STATEMENT
AI models a e ubiqui ous in di e en aspec s o ou mode n socie y. O en, AI models a e obscu ely p esen in he
digi al se ices we use daily (e.g. web sea ch ankings, music ecommenda ions), and hei biases can in luence ou
decision p ocesses wi hou us e en knowing i . One o he aims o his ins alla ion is o illus a e he s eng hs and
laws o he pa e n ecogni ion and signal syn hesis capabili ies o gene a i e AI sys ems, as being amilia wi h he
p ope ies and p ocesses behind hese AI models is becoming mo e and mo e impo an . Addi ionally, gene a i e models
ha e ga he ed much con o e sy as la ge o -p o i AI companies will ain hei gene a i e music models on la ge
collec ions o copy igh ed music wi hou an a is ’s consen , and p o ide echno-solu ionis a gumen s along he lines
o “gene a i e AI will democ a ize music”. We belie e ha his is an ill- o med goal, as music is no a homogeneous
blob, bu an umb ella e m encompassing coun less e ol ing communi ies o a is ic p ac ice, each wi h a unique se o
s yles, echniques, and aes he ic alues [3].
REFERENCES
[1]
Hugo Flo es Ga cia, P em See ha aman, Ri hesh Kuma , and B yan Pa do. 2023. VampNe : Music Gene a ion ia Masked Acous ic Token Modeling.
ISMIR (2023).
[2]
Ri hesh Kuma , P em See ha aman, Alejand o Luebs, Ishaan Kuma , and Kundan Kuma . 2023. High-Fideli y Audio Comp ession wi h Imp o ed
RVQGAN. a Xi :cs.SD/2306.06546
[3]
And ew McPhe son, Fabio Mo eale, and Jacob Ha ison. 2019. Musical ins umen s o no ices: Compa ing NIME, HCI and c owd unding app oaches.
New di ec ions in music and human-compu e in e ac ion (2019), 179–212.