scieee Science in your language
[en] (orig)
Technische Universität Berlin
Fakultät I - Geistes- und Bildungswissenschaften
Institut für Sprache und Kommunikation
Fachgebiet Audiokommunikation
Master’s Thesis in Audiokommunikation und -technologie
Autogenous Spatialization for
Arbitrary Loudspeaker Setups
Author: Zeyu Yang
1. Supervisor: Dr. Henrik von Coler
2. Supervisor: Prof. Dr. Stefan Weinziel
Start Date: 03-11-2023
Submission Date: 30-04-2024
Eidesstattliche Erklärung
Hiermit erkläre ich an Eides statt gegenüber der Fakultät I der Technischen
Universität Berlin, dass die vorliegende, dieser Erklärung angefügte Arbeit selb-
stständig und nur unter Zuhilfenahme der im Literaturverzeichnis genannten
Quellen und Hilfsmittel angefertigt wurde. Alle Stellen der Arbeit, die anderen
Werken dem Wortlaut oder dem Sinn nach entnommen wurden, sind kenntlich
gemacht. Ich reiche die Arbeit erstmals als Prüfungsleistung ein. Ich versichere,
dass diese Arbeit oder wesentliche Teile dieser Arbeit nicht bereits dem Leis-
tungserwerb in einer anderen Lehrveranstaltung zugrunde lagen
Mit meiner Unterschrift bestätige ich, dass ich über fachübliche Zitierregeln unter-
richtet worden bin und verstanden habe. Die im betroffenen Fachgebiet üblichen
Zitiervorschriften sind eingehalten worden.
Eine Überprüfung der Arbeit auf Plagiate mithilfe elektronischer Hilfsmittel darf
vorgenommen werden.
____________ ____________
Ort, Datum Unterschrift
II
Acknowledgements
I would like to express my profound gratitude to my supervisor, Henrik von Coler,
for his invaluable support and guidance. His influence was pivotal in my decision to
join the Audiokommunikation und -technologie program, and his insights helped
me to discover and pursue my true interests. Special thanks to Miller Puckette for
his advices to the implementation in Pure Data.
Above all, my deepest appreciation goes to Jiayu Ding for her unwavering mental
and emotional support throughout my master’s thesis journey. Her presence is
indispensable, and I could not have achieved this without her.
III
Abstract
This thesis introduces Zerr*, a novel approach to spatial music production that
navigates between conventional sound spatialization and spatial sound synthesis.
Following an extensive review of sound spatialization techniques, as well as his-
torical and contemporary developments in spatial music, a tailored coordinate
system has been developed to categorize Zerr* alongside existing spatial music
authoring tools.
Zerr* employs an innovative algorithmic framework that leverages the intrinsic
properties of the audio signal, coupled with a special mapping system, to au-
tonomously distribute audio to arbitrary loudspeaker setups. This approach facil-
itates dynamic, context-sensitive spatialization and enables unique spatial sound
synthesis effects. This approach effectively circumvents the limitations imposed by
traditional spatialization techniques, thus introducing new sonic experiences and
creative paradigms.
The core modules of Zerr* are implemented in C++ and are further extended
as a Pure Data package and JACK clients. The system is designed to integrate
seamlessly into existing creative ecosystems, supporting real-time audio manipu-
lation and sample-level spatialization. This functionality has proven particularly
effective for live performances and improvisational contexts.
Comprehensive listening tests conducted with participants from diverse back-
grounds have validated the system’s effectiveness. Their feedback has confirmed
the system’s innovative approach and its practical applicability in real-world set-
tings.
IV
Zusammenfassung
In dieser Arbeit wird Zerr* vorgestellt, ein neuartiger Ansatz zur räumlichen
Musikproduktion, der sich zwischen konventioneller Klangverräumlichung und
räumlicher Klangsynthese bewegt. Nach einer ausführlichen Untersuchung der
Techniken zur Klangverräumlichung sowie der historischen und aktuellen Entwick-
lungen in der räumlichen Musik wurde ein maßgeschneidertes Koordinatensystem
entwickelt, um Zerr* neben den bestehenden Tools zur Erstellung räumlicher
Musik einzuordnen.
Zerr* verwendet ein innovatives algorithmisches Framework, das die intrinsischen
Eigenschaften des Audiosignals in Verbindung mit einem speziellen Mapping-Sys-
tem nutzt, um Audiosignale autonom auf beliebige Lautsprecherkonfigurationen
zu verteilen. Dieser Ansatz erleichtert eine dynamische, kontextabhängige Räum-
lichkeit und ermöglicht einzigartige räumliche Klangsyntheseeffekte. Dieser Ansatz
umgeht effektiv die Beschränkungen, die durch herkömmliche Raumklangtech-
niken auferlegt werden, und führt so zu neuen Klangerlebnissen und kreativen
Paradigmen.
Die Kernmodule von Zerr* sind in C++ implementiert und werden durch ein Pure
Data-Paket und JACK-Clients erweitert. Das System ist so konzipiert, dass es sich
nahtlos in bestehende kreative Ökosysteme einfügt und Audiomanipulationen in
Echtzeit sowie Räumlichkeit auf Sample-Ebene unterstützt. Diese Funktionalität
hat sich als besonders effektiv für Live-Performances und Improvisationskontexte
erwiesen.
Umfassende Hörtests mit Teilnehmern aus verschiedenen Bereichen haben die Ef-
fektivität des Systems bestätigt. Ihr Feedback hat den innovativen Ansatz des
Systems und seine praktische Anwendbarkeit in realen Umgebungen bestätigt.
V
Contents
1 Introduction ................................................................................................ - 9 -
1.1 Motivation ........................................................................................ - 9 -
1.2 Research Paths and Objectives ....................................................... - 10 -
1.3 Structure of the Work ..................................................................... - 10 -
2 Technical and Theoretical Foundations ..................................................... - 11 -
2.1 Conventional Sound Spatialization ................................................. - 11 -
2.1.1 Channel-based Method ......................................................... - 11 -
2.1.2 Object-based Method ............................................................ - 12 -
2.1.2.1 Panning Algorithms .................................................... - 12 -
2.1.2.2 Sound Field Reconstruction ........................................ - 13 -
2.1.3 Discussion ............................................................................. - 15 -
2.2 Early Spatial Music ......................................................................... - 16 -
2.2.1 Spatialization as Performance ............................................... - 16 -
2.2.2 Spatialization as Composition ............................................... - 17 -
2.2.3 Discussion ............................................................................. - 19 -
2.3 Loudspeakers as Musical Instruments ............................................. - 19 -
2.3.1 Reconsideration of Channel-based Method ........................... - 19 -
2.3.2 Loudspeaker Characteristics ................................................. - 20 -
2.3.2.1 Loudspeaker Orchestras .............................................. - 20 -
2.3.2.2 Unconventional Loudspeakers ..................................... - 21 -
2.3.3 Sonic Trajectories .................................................................. - 22 -
2.3.3.1 Historical Milestone .................................................... - 22 -
2.3.3.2 Current Direction and Limitation ............................... - 23 -
2.4 Spatial Sound Control ..................................................................... - 24 -
2.4.1 Composition Oriented ........................................................... - 25 -
2.4.1.1 Trajectory Editing ....................................................... - 26 -
2.4.1.2 Composition Toolchain ............................................... - 27 -
2.4.2 Performance Oriented ........................................................... - 28 -
2.4.2.1 Controllers in History .................................................. - 29 -
2.4.2.2 Spatial Instruments ..................................................... - 30 -
2.4.2.3 Matrix-based Diffusion ................................................ - 30 -
2.4.3 Discussion ............................................................................. - 31 -
2.5 Spatial Texture ............................................................................... - 32 -
2.5.1 Relevant Theories .................................................................. - 32 -
2.5.1.1 Spectromorphology & Spatiomorphology .................... - 32 -
2.5.1.2 Textural Composition ................................................. - 33 -
2.5.2 Creation Techniques .............................................................. - 34 -
2.5.2.1 Spectral Spatialization ................................................ - 34 -
2.5.2.2 Spatial Granulation ..................................................... - 35 -
2.5.2.3 Panning & Decorrelation ............................................. - 37 -
2.5.3 Spatialization as Synthesis .................................................... - 38 -
VI
2.5.4 Related Tools ........................................................................ - 39 -
2.6 Discussion ....................................................................................... - 40 -
3 Zerr* Approach ......................................................................................... - 41 -
3.1 Approach Classification ................................................................... - 41 -
3.2 Signal Flows in Live Performance ................................................... - 42 -
3.3 Zerr* Concept ................................................................................. - 45 -
3.4 System Design ................................................................................. - 47 -
3.4.1 Feature Tracker ..................................................................... - 47 -
3.4.2 Feature Processor .................................................................. - 48 -
3.4.3 Speaker Manager ................................................................... - 49 -
3.4.4 Envelope Generator ............................................................... - 51 -
3.4.4.1 Speaker Selection ........................................................ - 51 -
3.4.4.2 Distribution Processing ............................................... - 52 -
3.4.5 Envelope Combinator ............................................................ - 52 -
3.4.6 Audio Disperser ..................................................................... - 53 -
3.5 Discussions ...................................................................................... - 53 -
3.5.1 Creative Use of Audio Features ............................................. - 53 -
3.5.2 Sample-level Processing ........................................................ - 55 -
3.5.3 Irregular Loudspeaker Setups ................................................ - 56 -
3.5.4 “Incorrect” Useage ................................................................ - 56 -
4 Implementation ......................................................................................... - 58 -
4.1 Aims and Priorities ......................................................................... - 58 -
4.2 Modular Design & Profiles .............................................................. - 59 -
4.3 Core Modules .................................................................................. - 59 -
4.3.1 Feature Tracker ..................................................................... - 60 -
4.3.2 Feature Processor .................................................................. - 60 -
4.3.3 Speaker Manager ................................................................... - 61 -
4.3.4 Envelope Generator ............................................................... - 61 -
4.3.5 Envelope Combinator ............................................................ - 62 -
4.3.6 Audio Disperser ..................................................................... - 62 -
4.4 Encapsulations ................................................................................ - 62 -
4.4.1 Pure Data Package ................................................................ - 63 -
4.4.2 JACK Client ......................................................................... - 64 -
4.4.3 Encapsulations in Development ............................................ - 65 -
4.5 Discussion ....................................................................................... - 66 -
5 Evaluation ................................................................................................. - 67 -
5.1 Goals & Expectations ..................................................................... - 67 -
5.2 Study Design ................................................................................... - 67 -
5.2.1 Test Environment .................................................................. - 67 -
5.2.2 Test System ........................................................................... - 68 -
5.2.3 Experience Assesment ........................................................... - 71 -
5.2.4 Test Process .......................................................................... - 73 -
5.3 Procedure ........................................................................................ - 74 -
5.3.1 Questionnaire ........................................................................ - 74 -
VII
5.3.2 Recruitment .......................................................................... - 75 -
5.3.3 Test Scenario ......................................................................... - 75 -
5.4 Analysis ........................................................................................... - 76 -
5.4.1 General Feedbacks ................................................................. - 76 -
5.4.1.1 Feedback for the Presets ............................................. - 76 -
5.4.1.2 Comprehensive Feedback ............................................ - 79 -
5.4.2 Experience-related Feedbacks ................................................ - 81 -
5.5 Discussion ....................................................................................... - 85 -
6 Conclusions & Future Work ...................................................................... - 86 -
VIII
Chapter 1
Introduction
1.1 Motivation
The concept of spatial audio is no longer a foreign one to the general public. The
technologies have progressed beyond the realm of research institutes and have
been incorporated into a multitude of commercial applications. For instance, the
spatial audio experience in Apple Music,¹ the Dolby Cinema,² the Apple Vision
¹https://www.dolby.com/experience/apple-music/
²https://www.dolby.com/movies-tv/cinema/
Pro³ and the recently constructed sound system in Vegas Sphere.4 It is becoming
³https://www.apple.com/apple-vision-pro/
4https://holoplot.com/insights/case-studies/msg-sphere-case-study
increasingly evident that the general public is beginning to recognize the existence
of a spatial dimension in sound perception.
The impact of spatial audio on the music industry is also significant. In Jan-
uary 2024, Apple Music decided to pay 10% higher royalties for Spatial Audio
tracks (supported by Dolby Atmos) than for tracks not available in this format.
The Dolby Atmos format5 continues to gain market share and has the potential of
5https://www.dolby.com/technologies/dolby-atmos/
becoming the dominant format for music production and distribution. The concept
of sound spatialization, which refers to the distribution of sound in an acoustic
space, is now being considered by traditional music producers and musicians.
Prior to this, however, attempts at spatializing sound has been an integral
part of electroacoustic music since its early days. The spatialization of previously
produced tape music on multichannel loudspeaker setups, as exemplified by the
potentiomètre d’espace spatial control system (Teruggi, 2007), has already been
practiced in the 1950s by the Groupe de Recherches Musicale (GRM). We can also
say that the current commercially successful application cannot be separated from
the contribution of the experiments and theories put forward by once pioneering
musicians in their attempts for sound spatialization.
While substantial progress has been made in accurately reproducing the spa-
tial characteristics of sound within the field of sound spatialization, this represents
only a fraction of the broader spectrum of challenges. In many cases, particularly
in musical practices, standard spatial audio techniques may not fully exploit the
spatial attributes of sound, potentially limiting the scope for innovation due to
their well-established methodologies. This thesis acknowledges the achievements
in traditional spatial audio techniques while proposing a considered exploration
into alternative approaches, particularly within the electroacoustic music sector.
There are already a number of pioneering approaches in this field, and although
- 9 -
some may initially appear rudimentary or overly ambitious, they hold the poten-
tial to drive significant innovations in how we perceive and interact with sound.
The motivation for this master’s thesis is rooted in a desire to critically explore
and develop these experimental methodologies. This research aims to incremen-
tally contribute to the existing body of knowledge by proposing new solutions that
address less explored challenges in sound spatialization. Through this process, the
work seeks to enhance the toolkit available for artists and technologists, poten-
tially enriching the auditory experiences and expanding the creative possibilities
within the field.
1.2 Research Paths and Objectives
This thesis outlines a structured path to the creation of a new experimental mu-
sical tool, divided into four phases: background research, theoretical innovation,
technical deployment, and empirical testing. It begins with a comprehensive re-
view of historical and contemporary applications of spatial audio technologies and
spatial attributes in music, including a reevaluation of loudspeakers actively as
part of musical instruments. Tools for creating spatial music6 are also categorized,
6Spatial music is a term used extensively in this thesis to represent all musical genres that take into account
spatial properties.
and innovative aesthetics and techniques that go beyond traditional soundfield
reconstruction and panning are introduced. This foundational study establishes
the theoretical framework for the Zerr* method, a novel approach to sound spa-
tialization and spatial sound synthesis. The Zerr* method is detailed through its
core concepts and specific algorithmic framework, which distinguish it from ex-
isting tools. The design of the system emphasizes ease of use and adaptability
to different environments, making it suitable for diverse applications. Empirical
validation is performed through extensive listening tests designed to assess the ef-
fectiveness of the system. The feedback from these tests helps to refine the system,
confirming its practicality and identifying areas for improvement. The overall goal
of this thesis is to develop a spatialization system that balances theoretical depth
with practical usability. This system will contribute to the advancement of spatial
music production technologies while maintaining an experimental focus.
1.3 Structure of the Work
Section2 provides an overview of the technical and theoretical foundations un-
derlying the Zerr* system. Section3 elucidates the fundamental concept of Zerr*
and the design principles that inform its development. Section4 delves into the
implementation details of the Zerr* system. Section5 presents the results of a
listening test conducted on the Zerr* system. Finally, Section6 offers a summary
of the conclusion and outlines future research directions.
- 10 -
Chapter 2
Technical and Theoretical Foundations
2.1 Conventional Sound Spatialization
Sound spatialization is still traditionally associated with the industry in which it
was first commercially successful, namely the film industry. Stereophony was first
used in 1940 in Disney’s animated film Fantasia, which was a significant commer-
cial success (Ross, 2012). To date, the film industry has been the primary domain
of sound spatialization, and as a result, a plethora of novel technologies have been
devised for the cinematic context. A multitude of assumptions and solutions are
proposed in order to facilitate the production of the film. This, in turn, engenders
a subtle tension for musicians to utilise sound spatialization technologies in their
music (Baalman, 2010). The commercial success of the technology has led to an
overwhelming advantage in its development. The utilization of technologies that
are not inherently music-specific for the creation of music inevitably results in
the influence of the methodologies that are adapted in the technology for their
respective scenarios. Consequently, to a certain extent, it will impede the full po-
tential of utilising space as a musical element in a musical context.
Due to its conceptual richness, music offers more expansive developmental
possibilities compared to other, more straightforward scenes. Furthermore, music
theorists and pioneering musicians have successfully developed various theories,
techniques, and technologies for utilizing space musically, particularly within the
realm of electroacoustic music. Prior to delving into the concepts of spatialization
in music, it is essential to elucidate the conventional concepts of sound spatializa-
tion in order to achieve a lucid comprehension of the distinctions.
2.1.1 Channel-based Method
In its narrowest definition, channel-based spatialization represents the most tra-
ditional and straightforward method of sound spatialization. It is also the first
attempt at spatialization through the assignment of sound to particular speakers
in a multi-channel sound system (Wenzel et al., 2017). The multi-channel sound
system is constrained to a limited number of predefined speaker layouts, which
can be as simple as stereo (2 channels) or as complex as 9.1.2 (12 channels). The
speaker layouts are frequently described using this X.Y.Z format. X represents the
number of main speakers positioned around the listener at ear level. Y represents
to the number of subwoofers. Z represents to the number of overhead speakers. In
the channel-based method, sound is often spatially rendered through the use of
volume level differences or time delays between channels, which serve as spatial
cues. The movement of the sound is rarely considered compared to the position
of the sound.
- 11 -
The simplicity of this approach has resulted in its continued use in a vari-
ety of settings, including music studios and home theaters. It has become the
prevailing concept of immersive sound for the majority of users. Novel forms of
sound spatialization systems are designed to retain compatibility with conven-
tional channel-based configurations. Nevertheless, the channel-based method has
a much more expansive meaning in the context of music, which will be elucidated
in Section2.3.1.
2.1.2 Object-based Method
Object-based spatialization attempts to emulate the natural behavior of sound
sources by generating the impression of sound coming from specific directions or
moving in certain ways. Typically, such spatialization techniques work with the
point source paradigm, in which a sound source is assigned a virtual position in
the listening space. In theory, the position of the point source is independent of the
actual position of the loudspeaker, and thus often regarded as a true source. Ob-
ject-based spatialization can be achieved through the use of panning algorithms,
such as Vector Base Amplitude Panning (VBAP), Distance-based Amplitude Pan-
ning (DBAP), and sound field reconstruction methods, including Ambisonics and
Wave Field Synthesis (WFS).
2.1.2.1 Panning Algorithms
Panning algorithms achieve sound source positioning by calculating the relative
gains of the speakers. These calculations are based on the geometric relationship
between the loudspeakers used for generation, the target source position. The ori-
gins of the concept can be traced back to the stereophony research that began
in the 1930s (Blumlein, 1933). This research delved into the perceived effects of
level difference and time difference between two loudspeakers on the localization
of a virtual sound source. Nevertheless, a number of enhancements have been im-
plemented, rendering contemporary spatial panning algorithms far more versatile
tools for sound spatialization than conventional stereo panning. Two widely uti-
lized algorithms are presented here.
Vector Base Amplitude Panning (VBAP) overcomes the limitation of
stereophony, which is exclusive to two speakers positioned in front of the listener,
by employing vectors to describe the positions of speakers and virtual sound
sources within a listening space (Pulkki, 1997; 1998; 2001). This approach extends
the stereo panning to two-dimensional and three-dimensional multiple speaker se-
tups. In order to determine the active speakers for the generation of the virtual
sound source, the VBAP algorithm selects the pair or triplet of speakers that
form the smallest angle with the direction of the virtual sound source. The precise
and flexible spatialization of sound can be achieved through low-cost calculations,
particularly in high-density loudspeaker setups such as conventional ring or dome
shapes.
- 12 -
Distance-based amplitude panning (DBAP) is a lightweight panning method
proposed to address the practical challenges of implementing sound spatialization
systems in real-world settings (Lossius et al., 2009). In such environments, it is
not feasible to assume that listeners will be situated in the optimal listening po-
sition, and conventional speaker configurations may not be viable. The algorithm
determines the distance from a virtual sound source to each speaker in order to
maintain a constant sound intensity, regardless of the source’s position. A further
distinction between DBAP and VBAP is the assumption that all speakers are
active and contribute to the perception of the virtual source. Additionally, spatial
blur is incorporated to circumvent localization issues that may arise when a sound
source aligns with a single speaker. The author identifies a limitation of DBAP
in its ability to deal with virtual sources outside loudspeaker arrays. To address
this, a solution is proposed whereby the virtual sound source is first aligned to the
edge of the convex hull and the intensity is adjusted to mimic the effect of that
source outside the loudspeaker array.
The DBAP is particularly well-suited to scenarios where the loudspeaker
configuration is determined by considerations related to artistic, architectural, or
acoustical design, providing creators with the flexibility to arrange speakers in a
variety of configurations across different physical spaces. The prevalence of the
DBAP algorithm illustrates the discrepancy between the research scenarios and
the practical applications of sound spatialization algorithms in the real world. A
number of fundamental assumptions, including the regular distribution of speakers
and fixed listening positions, are not applicable in practice. This makes it chal-
lenging to apply algorithms that perform well in laboratory settings in real-world
scenarios.
2.1.2.2 Sound Field Reconstruction
Another type of object-based method is designed to reconstruct the sound field,
rather than manipulate human perception based on psychoacoustic principles as
is the case with panning methods. The recreation of the sound field ensures the
physical correctness, thereby enabling the achievement of robust and high-accu-
rate sound localization. The most common algorithms for reconstruction are Am-
bisonics and wave field synthesis (WFS). These algorithms consider reconstruction
from different perspectives, although under certain conditions, they are essentially
the same.
Ambisonics is a method of encoding the sound field around a point in space,
capturing the directionality and intensity of sound from all directions (Malham &
Myatt, 1995). It can be used with both ring (2D) and sphere (3D) speaker setups.
This technique employs spherical harmonics to represent the sound field, which
can then be decoded for playback over a multi-speaker setup or headphones. The
sound field at this point is approximated using spherical harmonics of different or-
ders. As the order of the spherical harmonics increases, the approximation becomes
more precise, allowing for enhanced spatial resolution in the reproduced sound
- 13 -
(Gerzon, 1973). One significant advancement in the application of Ambisonics is
the development of Near-field-corrected Higher-Order Ambisonics (NFC-HOA),
now more commonly referred to as Distance-coded Ambisonics (DCA) (Daniel,
2003) . This refinement addresses one of the traditional limitations of Ambisonics
related to the decoding accuracy for listeners close to the speakers. NFC-HOA,
or DCA, incorporates distance information into the decoding process, optimiz-
ing sound reproduction for near-field listening environments. This enhancement
is particularly beneficial in settings where listeners may be positioned at varying
distances from the speakers, improving the accuracy and quality of the auditory
experience across a wider listening area.
Ambisonics provides a highly adaptable framework for the manipulation
of sound fields in the context of complete music production, encompassing the
processes of recording, manipulation, composition, and reproduction. The algo-
rithm’s open-source nature renders it a popular choice among artists for spatial-
izing sound. Nevertheless, despite improvements like DCA, the most apparent
drawback of Ambisonics remains: listeners are confined to a narrow sweet spot for
high-quality reproduction, particularly when traditional Ambisonic methods are
used without near-field correction.
In contrast, Wave Field Synthesis (WFS) is capable of producing an acousti-
cally accurate synthesized sound field through the generation of waves that repli-
cate the natural sound waves emitted by authentic sound sources (Berkhout et
al., 1993; Theile & Wittek, 2004). This implies that there is no concept of a sweet
spot, despite the fact that listeners perceive differently when moving, as if they
were listening to a real sound source. The Wavefront Synthesis (WFS) technique
is based on the principle of Huygen’s, which states that the subsequent wavefront
can be created by an infinite number of small audio sources located at the current
wavefront. It employs a multitude of closely spaced loudspeakers to synthesize
the sound field. Theoretically, the high density of loudspeakers allows for an ac-
curate reproduction of sound over a larger listening area. However, the physical
limitations of the speaker size make the spatial aliasing effect in high frequencies
inevitable.
A number of practical considerations have impeded WFS from attaining sig-
nificant popularity. Firstly, the algorithm necessitates the use of costly, special-
ized equipment with a substantial number of speakers integrated, in contrast to
ambisonics playback systems that can be constructed using conventional moni-
tor speakers. Secondly, WFS necessitates a considerable amount of computation
to perform large-scale multichannel audio processing, which remains a significant
challenge on personal computers. Nevertheless, it is possible to posit that these
practical considerations can be addressed in a gradual manner through the devel-
opment of engineering solutions, with the theoretical advantages of WFS subse-
quently becoming apparent. As previously stated in Section1.1, there are already
successful commercial applications of this technology (Start, 2024).
- 14 -
2.1.3 Discussion
The object-based method represents an advanced abstraction of spatial sound
placement, moving beyond the straightforward logic of one channel corresponding
to one speaker. This approach necessitates a shift in mentality as it abstracts sound
objects, making it generally agnostic of the loudspeaker system. This system is
assumed to be as dense, evenly distributed, and homogeneous as possible to create
realistic and immersive listening experiences. Consequently, works utilizing the
object-based method are highly portable. Audio material and source movement
data can be stored separately and rendered to any loudspeaker setup that meets
the necessary specifications. While the individual experience and quality depend
heavily on the loudspeaker systems, they are considered interchangeable compo-
nents of the technical infrastructure.
Enumerating the advantages of the object-based method does not suggest it
is a one-size-fits-all solution for sound spatialization. As discussed in the Dolby
Atmos White Paper (Dolby, 2014), sound objects are beneficial for controlling
instantaneous effects in movies, while ambient effects, reverberations, and back-
ground music are more suitably transmitted directly to an array of loudspeakers.
Therefore, as a commercial standard, it still supports “beds” in the channel-based
tradition as a complement to objects in the rendering process.
Different scenarios demand different technologies. The sound spatialization
technologies discussed are primarily aimed at environments similar to cinemas,
theaters, or for research purposes in laboratories. In these settings, the core re-
quirement is to reconstruct the sound field as closely as possible to real scenes,
providing an immersive sound experience. Historically, terms like immersive sound
or surround sound have been more prevalent in early promotions.
Conventional sound spatialization methods excel at reconstructing sound
fields and accurately localizing sounds. However, when the focus shifts back to mu-
sic, the perspective changes. A perfect imitation of physical sound sources does not
necessarily equate to a memorable musical experience. Indeed, the sound field sim-
ulation paradigm can significantly benefit music. The movement of sound objects
introduces a new dimension of expression to musical elements. Accurate sound
localization not only recreates the ambiance of live music scenes, particularly in
classical music but also offers a novel mixing approach that frees music producers
from the constraints of limited panning positions and spectral bandwidth.
Furthermore, there are broader possibilities for utilizing spatial properties in
music. Spatiality, an often overlooked but inherent attribute in music, can be more
intimately linked with different musical perspectives. The next section delves into
the historical explorations of spatial properties in music.
- 15 -
2.2 Early Spatial Music
De méme que la musique est une dialectique de la durée et de l’intensité, le
nouveau procédé est une dialectique du son dans l’espace et je pense que le
terme de musique spatiale lui conviendrait mieux que celui de stéréophonie.
— Abraham Moles
The divergent conceptualizations of spatialization in electroacoustic music, ex-
tending beyond its original roots in the film industry, have been thoroughly doc-
umented. In examining the origins of electroacoustic music, discussions typically
center on Musique Concrète, which originated in France, and Elektronische Musik
from Germany. Early developments in both countries already incorporated sound
spatialization as an integral aspect of musical practice. Chronologically, these
movements unfolded in a sequential manner, with significant cross-communication
between them. However, this thesis maintains a distinction between the two be-
cause the pioneers from each movement embodied distinctly different mindsets
regarding spatialization in music. One approach primarily viewed spatialization
as a performance tool, while the other integrated it as a fundamental aspect of
composition. Both perspectives have been pivotal in shaping the direction of sub-
sequent research. Throughout this section, the term spatial music is employed
to collectively refer to any music that involves exploration of spatial properties,
simplifying the complex taxonomy of music genres.
2.2.1 Spatialization as Performance
A particularly notable example from the early explorations is Jacques Poullin’s
work with the potentiomètre d’espace system (Valiquet, 2012). This occurred dur-
ing his time as a member of the Groupe de Recherche de Musique Concrète
(GRMC), which was organized by Pierre Schaeffer. The most renowned iteration
of this system, known as the pupitre d’espace (space desk) or pupitre de relief (re-
lief desk), was initially employed in live performances by Pierre Henry and Pierre
Schaeffer in 1951. This system featured a unique setup with four speakers: two
in front, one at the rear, and another overhead. A single-track recording is being
played back as the input signal, while a performer controls the position of sound in
real time via a handheld transmitter. The transmitter coil is designed to interact
with four receiver coils positioned around the performer, with the objective of
controlling the amplitude of the four loudspeakers (Teruggi, 2007).
Poullin believed that the innovative aspect of this system was its transfor-
mation of the listening experience, allowing sound to emanate not just from the
traditional frontal plane but from the surrounding space. However, from today’s
perspective, the contributions of this system extend far beyond that initial inno-
vation. Schaeffer attempted to differentiate this novel process from stereophony by
emphasizing that the objective is to generate a corresponding spatial development
- 16 -
for the sound rather than an exact replication. Furthermore, there were a few for-
ward-thinking remarks, as Abraham Moles notes in the quotation, “Just as music
is a dialectic of duration and intensity, the new process is a dialectic of sound in
space, and I think the term Spatial Music would suit it better than stereophony”.
The significant contribution of this system is that it deepens the integration of
sound spatialization into live performance, allowing performers to actively shape
spatial properties as a core element of musical expression. This practice gradually
evolved and then became central to electroacoustic music, known as Sound Diffu-
sion (Austin & Smalley, 2000; Dack, 2001), where performers, often referred to as
diffusers, dynamically sculpt the sound in real-time across the performance space.
Sound diffusion not only enhances the expressive potential of tape music but also
reshapes traditional concepts of audience engagement. By integrating spatializa-
tion directly into the performance, it creates a dynamic interaction among the
audio, the space, and the listener. This approach effectively dissolves the conven-
tional roles of composer, performer, and audience, fostering a more immersive and
collaborative experience (Harrison, 1998).
Figure1: Pierre Henry performing with the pupitre d’espace
2.2.2 Spatialization as Composition
From the origins of Elektronische Musik, Karlheinz Stockhausen investigated the
potential of sound spatialization, offering a perspective divergent from the per-
formance-centric approaches due to his exposure to serialism. In 1956, Karlheinz
Stockhausen composed Gesang der Jünglinge as his inaugural effort to employ the
spatial properties of sound in composition (Smalley, 2000). The timbre of the vocal
of a boy and the generated white noise and sine tones in this piece were blended
in such a way as to obscure the distinction between them. The precise positioning
and movement of sounds were meticulously crafted with the use of five groups of
loudspeakers, thereby further emphasizing the serialist connection between two
groups of timbre and becoming an integral component for the comprehension of
the work (Decroupet et al., 1998).
- 17 -
In his subsequent theoretical development, Stockhausen considered spatial
properties to be of equal importance to other musical elements, and thus asserted
that they should be articulated similarly (Morgan, 1975). However, he encountered
difficulties in serializing musical aspects beyond pitch, particularly timbre and
space, due to the lack of a clear serialist relationship among these high-dimensional
elements. To address this issue, he developed a novel approach that emphasized
the overall character of large, proportionally related groups of material, which
could also be applied to spatial properties. This approach was successfully applied
to his composition Gruppen ( “groups” in German) with three groups of orches-
tras located at the front left, center and right of the auditorium Further examples
include Carre (“square” in French) with four orchestras positioned at the four
angles of a square centered on the audience The impression of spatial movements
was created via overlapping crescendos and decrescendos, which are embedded in
the scores of the composition (Bates, 2009).
Figure2: Karlheinz Stockhause manipulate the rotating loudspeaker
Furthermore, Stockhausen posits that the direction of sound is of greater conse-
quence than the distance, as the latter can be derived from musical parameters
such as timbre and loudness. His contemplation of the potential of sound as a
primary compositional element was evidenced by the invention of the rotating
loudspeaker mechanism he created for the composition of Kontakte. The appa-
ratus comprises a turntable with a loudspeaker situated in the centre, and four
microphones arranged in a circle around it. The purpose of these microphones is
to record the sound produced by the loudspeaker during playback. The rotation
of the loudspeaker produces a multitude of acoustic effects that extend beyond
the mere amplitude change between four microphones. These include phenomena
such as the Doppler shift and phase shifts, which are challenging to simulate using
purely electronic devices. The spatial variation of the sound has been meticulously
recorded on the four-track recording, allowing for the playback of Kontakte in any
venue with an equal setup of speakers in all four directions without the need for
on-site diffuser control. The pursuit of precise control of sound direction by Stock-
hausen is still evident in the current development of spatial music authoring tools.
- 18 -
2.2.3 Discussion
The preceding sections have examined the dimensions of sound spatialization in
both performance and composition contexts. Section2.2.1 delves into how per-
formers, in roles akin to diffusers, actively manipulate the spatial attributes of
sound in real-time within live electroacoustic music, thereby fostering a dynamic
interaction between audio, spatial environment, and audience. This approach effec-
tively dissolves traditional performer roles, creating a deeply immersive auditory
experience. In contrast, Section2.2.2 examines how composers integrate spatial
properties directly into their compositions, emphasizing creative intent through
precise control of spatial attributes. Both methods place loudspeakers in a pivotal
role, transcending their traditional function as mere conduits of sound to become
active, color-imparting components of the musical expression.
If the initial phase of electroacoustic music, as is often asserted, transforms
the process of traditional music from the initial creation of music on a score by
the composer, which is then realized by instrument performers, to a direct manip-
ulation of sound materials, then the spatial control of sound represents a further
extension. In the past, the role of the instrument performer was regarded as that
of an intermediary between the composer and the listener. This meant that the
composer had no control over the performer’s interpretation of the music. Once the
composer begins to directly manipulate sound materials, there is no longer a need
to translate abstract notation into sound by performers. The only remaining iso-
lated interpreter is the playback system. The addition of loudspeakers inevitably
introduces a degree of coloration when reproducing sound. When musicians begin
to consider the distinctive attributes of loudspeakers and loudspeaker setups in
musical performance, and are able to regulate the audio playback during the per-
formance, it becomes evident that the loudspeaker can be considered an integral
component of the instrument. The subsequent section delineates various method-
ologies for employing the loudspeaker as an instrument for musical expression
within the domain of spatial music.
2.3 Loudspeakers as Musical Instruments
2.3.1 Reconsideration of Channel-based Method
Prior to a detailed examination of the manner in which speakers function as musi-
cal instruments, it is necessary to address a concept in Section2.1: channel-based
spatialization.
In contrast to the traditional understanding of a channel-based method, which
involves restricted speaker layouts, the term “channel-based” in spatial music is
employed to describe the one-to-one correspondence between channels and loud-
speakers. The musician must engage in a profound analysis of the utility of each
speaker and must possess direct control over each one. In this instance, the chan-
- 19 -
nel-based method represents a shift in definition from a highly structured format
to a more flexible approach to thinking.
The channel-based and object-based methods of spatial music creation are,
in essence, neutral with regard to advantages or disadvantages. The decision to
employ the object-based or channel-based approach is contingent upon the specific
intent. The primary function of abstracting the concept of the sound object is to
facilitate precise control of position and movement. This is in accordance with the
discussion about Stockhausen’s use of serialism in composing spatial music. How-
ever, the concept of channel-based is more rudimentary. A useful analogy is the
distinction between a specialized cell and a stem cell. The absence of preconceived
notions regarding its capabilities renders it more likely to develop applications
that diverge from the norm and realize a greater array of potentialities.
2.3.2 Loudspeaker Characteristics
2.3.2.1 Loudspeaker Orchestras
As is the case with different instruments in a symphony orchestra, loudspeakers
possess their own characteristics and techniques of use. From the perspective of
audio technology, these qualities can be reflected in the frequency response curve
of the speaker, distortion rate, and other parameters. Furthermore, the acoustic
discussion allows for the reflection of the interaction between the different posi-
tions of the speakers and the listening environment. When a multitude of speakers
with disparate qualities are assembled in a specific configuration, they can achieve
a similar effect to that of a symphony orchestra. This is because they complement
each other’s qualities, resulting in a musical experience that is not possible with
a single type of instrument.
Loudspeaker orchestras, which serve as a paradigm for exploring the quali-
ties of loudspeakers and the corresponding spatial experience, have been realized
in a number of versions throughout history and up to the present. The most no-
table example is the Acousmonium (Desantos et al., 1997), which was introduced
by François Bayle in the 1970s. The system combines loudspeakers with distinct
characteristics and employs these differences in the diffusion of tape music. An
additional significant development in the field of loudspeaker orchestration was the
Gmebaphone, which was created by the Groupe de Musique Expérimentale during
the late 1970s. (Clozier & Olsson, 2001). In addition to investigating the impact
of spatial effects on loudspeakers in a manner analogous to the Acousmonium, the
project’s most significant contribution was the influence of the controller it de-
signed on the advancement of tools for real-time sound spatialization control. This
will be further elucidated in Section2.4 on control tools. Moreover, this tradition
has been refined and expanded upon by contemporary implementations, such as
the Birmingham ElectroAcoustic Sound Theatre (BEAST) (Wilson & Harrison,
2010).
- 20 -
Table1: Acousmonium (left) and Gmebaphone-1 (right)
2.3.2.2 Unconventional Loudspeakers
In contrast to the formation of loudspeaker orchestras, an alternative approach
involves the use of unconventionally constructed loudspeakers. Spherical speaker
arrays, such as the IKO (Zotter et al., 2017), represent a departure from the tra-
ditional center-oriented array arrangement, with a diffuse speaker arrangement
providing a novel solution for composing directivity of sound. The distribution
of sound in space is achieved through the use of reflections, with the room itself
becoming an integral part of the instrument.
Parametric loudspeakers produce audible sound by emitting modulated ultra-
sonic waves that extend beyond the upper limit of human hearing. The modulated
ultrasonic waves interact with the nonlinear properties of the air to demodulate
into frequencies that can be heard (Shi & Gan, 2010). The aforementioned phys-
ical mechanism gives rise to the highly directional characteristics of parametric
loudspeakers. Although the majority of parametric loudspeaker manufacturers are
focused on the function of avoiding disruptions in scenarios such as home enter-
tainment or online conferences, the unique spatial characteristics of parametric
loudspeakers are ideal for sound installations that aim to create a strong connec-
tion between the sound and an exact listening location (Alunno & Yarce Botero,
2017).
An alternative approach to strong directivity is to minimize the perceptibility
of speakers. The OmniWave virtual speaker generates a stable, vertical phantom
sound source through its OmniDrive 360-degree radiators7. The system’s design
7https://bloomline.com/
ensures acoustical transparency, preserving the original sound quality and main-
taining stability across varying room conditions. Currently, the 4DSOUND system
(Oomen et al., 2016) based on Omniwave is being employed in the development
of spatial music and produces a series of works and performances.
The X1 Matrix array, manufactured by HOLOPLOT, has demonstrated re-
markable capabilities for precisely controlling the sound field based on the Wave
- 21 -
Field Synthesis (WFS) algorithm. (Start, 2024). This technology offers musicians
the potential to manipulate sound and space in a seamless manner, a capability
that has not previously been available. It is regrettable that the majority of appli-
cations for X1 Matrix are still constrained to the creation of precise coverage areas
and the provision of a consistent high-fidelity sound experience for all listeners.
There are fewer examples of further utilisation as a music creation tool.
As evidenced by the preceding analysis, the utilization of unconventional
speakers is a promising avenue for further investigation. However, it is important
to note that there is still a paucity of research in this area. The primary rationale
for this phenomenon is that these speakers are typically prohibitively expensive
and not widely utilized. It is challenging for the majority of researchers to gain
access to these devices and to conduct prolonged experiments with them. The
majority of research into spatialization of sound has been conducted using conven-
tional studio monitors and arrays constructed from such speakers. As previously
stated in the preceding section on early spatial music, the investigation of the
spatial position and movement of sound has been a fundamental aspect of musical
exploration for a considerable period of time.
2.3.3 Sonic Trajectories
2.3.3.1 Historical Milestone
The book written by Pierre Schaeffer, In Search of a Concrete Music, published
in 1952, discussed how the sound travels on a sonic trajectory and creates spatial
depth through the contrast between stationary and manually controlled move-
ments (Schaeffer, 2012). The initial investigations into the sonic trajectories have
already been discussed from both the performance and compositional perspectives
in Section2.2.
It is also worth noting the application of sonic trajectories in the history
of music. One such example is Iannis Xenakis’ masterpiece Poème Électronique,
which was performed at the Philips Pavilion at the 1958 Brussels World Fair
(Lukes, 1996). A highly intricate spatialization scheme was devised by Edgard
Varèse for this composition, utilizing 350 loudspeakers. All of the loudspeakers
were integrated into the Philips Pavilion as an integral component of its archi-
tectural design. They were positioned to create a series of trajectories that align
with the distinctive hyperbolic paraboloid structure of the pavilion. The sounds
could traverse specific pathways in order to create the illusion of sonic trajecto-
ries. The Philips Pavilion was constructed at a time when, in 1958, there was no
established technology for object-based spatialization in three-dimensional sound
space. Nevertheless, the pioneering architects and musicians were able to achieve
their desired result in a way that spared no expense, thus creating an immersive
sound experience that transforms the entire pavilion into a musical instrument.
Following the conclusion of the Expo, the pavilion was dismantled. Since that
- 22 -
time, a great deal of multifaceted research has been conducted on it, as well as
attempts to virtually reconstruct the integral experience of it (Lombardo et al.,
2009). While the Poème Électronique reached an unprecedented level in combining
music and space, this kind of large-scale engineering feat, which could only have
been realized in a specific period of history, is still very difficult or meaningless to
replicate nowadays.
Figure3: Sonic Trajectory for Poème Électronique
2.3.3.2 Current Direction and Limitation
A significant proportion of subsequent research into the sonic trajectory has been
conducted with a more pragmatic approach. A tendency has emerged to explore
more portable solutions than site-specific sound spatialization. The portable side
is distinguished by the utilization of more precisely regulated speaker array struc-
tures, exemplified by the 8-channel ring. Furthermore, there is a desire to have
underlying technology for more precise realization of sonic trajectories in space.
This topic is related to the discussion in Section2.1.2 about research interests
in the field of sound field reconstruction. The techniques outlined in Section2.1
represent the fundamental tools utilized by researchers in this field.
Similarly, a significant proportion of software, hardware, and user interfaces
designed for spatial music are oriented towards the control and characterisation of
sonic trajectories. It is impossible to ascertain whether this focus is justified from
any perspective. This is despite the fact that, from my personal perspective, the
notion that accurate control of sound spatialization does not necessarily equate
to a superior spatial musical experience has already been expressed.
It is crucial to recognize that a sonic trajectory remains an abstract concept,
and therefore, it is imperative to maintain a clear understanding of this funda-
mental aspect. The trajectory of a sound object is not as intuitive as one might
expect. It is challenging for the human ear to accurately recognize the trajec-
tory of a sound object without the aid of visual information (Schumacher et al.,
2021). In scenarios where there is only a single sound source, people already ex-
perience difficulties in distinguishing sound direction between the front and back.
- 23 -
The capacity to perceive sound trajectories is significantly diminished when mul-
tiple objects are in motion in the same space. Even the aforementioned Poème
Électronique, which employs real speaker trajectories to convey movements in a
vast space, is not wholly accurate in its perception and necessitates the input of
other sensory modalities. Despite the emphasis on sound, the work is, in fact, a
Gesamtkunstwerk, a comprehensive artwork encompassing architecture, lighting,
film, and music.
2.4 Spatial Sound Control
In considering the contemporary concept of musical instruments, the use of speak-
ers and speaker arrays as described in Section2.3 does not realize the complete
process of interaction between a musical instrument and musician. In reference
to the proposed models, (Wanderley, 2001) and (Magnusson, 2019), a musical
instrument can be defined as comprising three fundamental elements: a gestural
controller, a mapping engine, and a sound engine. A simplified diagram is pre-
sented in Figure4. Speakers and speaker arrays primarily serve the function of
the audio engine. The remaining two elements will be addressed in this section.
Figure4: Interaction schematic between instrument and musician
In the early history of spatial sound controllers, they were conceptualized primarily
as physical hardware, such as the pupitre d’espace mentioned in Section2.2.1. In
the contemporary context, the notion of spatial sound controllers is predominantly
conceptualised within the framework of software, or alternatively, a combination of
hardware and software. The advent of multi-channel audio transmission protocols,
such as MADI (Lidbetter, 1988), Dante8, has significantly lowered the threshold
for developing spatial audio applications.
8https://global.audinate.com/meet-dante/what-is-dante
- 24 -
Software can be categorized based on the environment in which it operates.
One category includes modules or libraries used in audio programming environ-
ments, such as Max/MSP9, Pure Data (Puckette & others, 1996), and Super-
9https://cycling74.com/products/max
Collider (McCartney, 2002). Another category consists of plugins for commercial
Digital Audio Workstations (DAWs) like Reaper10, Pro Tools, and Ableton Live.
10https://www.reaper.fm/
Among these, Reaper is particularly favored for spatial music creation due to its
robust support for multi-channel audio and the availability of a free evaluation
version. In addition, there are standalone applications or web browser-based con-
trol software.
The characteristics of a tool can be influenced by the environment in which
it is used. In essence, the tools utilized in an audio programming environment are
designed with a greater emphasis on real-time control and performance, as well as
the capacity to undertake a greater number of experimental trials. Conversely, the
tool chain employed in a digital audio workstation is more systematic in nature,
with the objective of providing a stable and controlled spatialization during the
production process. The standalone version is more effective in fully realizing the
design concept of the developer. In addition, relatively recent developments in web-
based technologies have enabled the implementation of collaborative multi-user
operations, which is challenging to achieve in other environments. The discussion
of this aspect is beyond the scope of this thesis. For further information, please
refer to other related studies (Barbosa, 2003; Coler et al., 2020; Leslie et al., 2010).
This section adheres to the categorization established in the discussion of
early spatial music. Accordingly, tools designed for the purpose of composition
and performance are discussed separately. A multitude of tools are available that
can be utilized for both composition and performance tasks. The following section
will delineate the principal contexts in which these tools are employed.
2.4.1 Composition Oriented
In a more expansive definition of the term, one might posit that all of the scores and
notations utilized in the initial spatial music explorations to facilitate sound diffu-
sion could be considered composition-oriented tools. It can be argued that scores
and notations are not entirely obsolete, given that there is a minimal distinction
between scripting spatialization schemes and manually drawing automation lines
in a digital audio workstation (DAW) when utilizing plugins of an object-based
spatialization paradigm such as IEM-Ambisonic¹¹, SPARTA¹² and Dolby Atmos
¹¹https://plugins.iem.at/
¹²https://leomccormack.github.io/sparta-site/
Renderer¹³. Although the majority of researchers are reluctant to acknowledge it,
¹³https://professional.dolby.com/product/dolby-atmos-content-creation/dolby-atmos-renderer/
these remain the most prevalent tools by which most musicians are introduced to
composing spatial music.
- 25 -
One notable distinction is that the early practice of sound diffusion necessi-
tated the control of a single track of tape music. In this instance, the use of hand-
drawn scores and notations, in conjunction with manual hand-control, remained
a viable approach. In the contemporary era, spatial music creation necessitates
the simultaneous control of a multitude of sound objects (object-based view) or
speakers (channel-based view). It is now impossible to exercise direct manual con-
trol, and the manual definition of all spatial parameters during the composition
process has become an extremely redundant task. The fundamental objective of
authoring tools is to streamline the workload. Consequently, more creative meth-
ods of composition will be devised.
2.4.1.1 Trajectory Editing
The prevailing trend of object-based spatialization has given rise to a plethora of
assistive authoring tools that endeavor to streamline the process of editing the
spatial trajectory of sound objects. To illustrate, the earlier graphical spatial tra-
jectory editing software NeXTStep (Todoroff et al., 1997), implemented on the
NeXT Computer, offered a multitude of commonly utilized presets for 2D and 3D
trajectories, in addition to a graphical user interface for parameter editing. This
software provides an alternative way to create spatial trajectories by connecting to
other devices via a MIDI interface, especially direct recording of spatial gestures
via the wearable controller, Data Glove (Harada, 1992; Sturman & Zeltzer, 1994).
In a similar vein, Holophon, released the following year, also offers 2D graphical
trajectory editing functions, in addition to a rich set of algorithm-based trajectory
generation functions (Pottier, 1998). The Holophon has undergone further devel-
opment14. In addition to the original HoloEdit graphical editors, HoloPad, an iPad
14https://en.gmem.org/holophon
software, has been introduced for the purpose of controlling DBAP-based sound
spatialization (Bascou, 2013). It duplicates a single-channel input according to the
number of fingers and pressure of each finger, then localizes the sound objects at
positions defined by the speaker array setup and finger positions. Additionally,
numerous analogous studies have been conducted on the evolution of graphical
user interfaces, which cannot be exhaustively enumerated or analyzed (Carpentier,
2015; Dilger, 2013; Thiébaut, 2005).
It is important to note that the aforementioned trajectory editing tools typ-
ically do not incorporate specific sound spatialization algorithms and, as a re-
sult, are not directly related to audio processing. The underlying sound spatial-
ization technique employed in conjunction with these tools may be any algorith-
mic process, including those mentioned in Section2.1.2, or any algorithm with
a similar functionality. The trajectories generated by these systems are typically
interpreted as control signals that indicate the position of the sound object. Some
software applications utilize more general transmission protocols, typically MIDI
(Rothstein, 1995) in the early stages of development and OSC (Wright, 2005) in
the contemporary era. Other software programs employ specialized sound descrip-
- 26 -
tion formats, including SDIF (Wright et al., 1999), ASDF (Geier et al., 2010), and
SpatDIF (Peters et al., 2013).
2.4.1.2 Composition Toolchain
The editing of trajectories represents a minor aspect of spatial music compositions.
Many long-term projects are dedicated to the development of a comprehensive
toolchain, encompassing the editing of spatial properties (trajectory, shape, time-
stamp), the generation of multichannel audio, and numerous other functions. A
project that provides only the underlying audio processing hardware or software
is not applicable to the categorization criteria used in this section. In essence,
these systems are technically neutral and do not explicitly incorporate musical
tendencies. For instance, IEM-Ambisonics is typically viewed as a set of DAW
plugins designed for composition. However, its standalone builds and OSC control
features make it an adaptable and engaging software tool. In contrast, the Spat
system developed by IRCAM (Carpentier, 2018; Jot & Warusfel, 1995) serves as
an external library in Max/MSP. It offers real-time control and improvisation
capabilities, without limiting its integration as a foundational support in other
composition tools. It is more accurate to refer to such projects as toolkits rather
than toolchains. A number of similar toolkits are available, including SoundScape
Renderer (Geier et al., 2012), the Ambisonic Toolkit (AKT)15, and numerous oth-
ers.
15https://www.ambisonictoolkit.net/
IanniX, in honor of Iannis Xenakis, is not a software program designed specif-
ically for spatial music (Coduys & Ferry, 2004). The team that created it defines
IanniX as “a graphical open-source sequencer for digital art.16 However, IanniX
16https://www.iannix.org
is well suited for composing complex spatial sound patterns, and is often cited in
papers on spatial authoring tools (Garcia et al., 2017; Jaroszewicz, 2015). IanniX
abstracts four core elements that comprise the sequential patterns: curves, trajec-
tories, triggers, and cursors. This is consistent with the design concepts of the tool
that will be presented in this thesis. The four fundamental elements can interact
with one another in a multitude of ways, thereby conferring upon the system a
high degree of flexibility in the generation of complex control signals. The control
signals, which utilize the OSC protocol, can be mapped to the sound spatialization
algorithms, such as the toolkit previously described. The IanniX software is highly
sophisticated, yet its flexibility and programmability also present significant chal-
lenges for users. The necessity of user-defined settings, for instance, makes it more
difficult to use.
Zirkonium was developed by the Center for Art and Media Karlsruhe (ZKM)
with the primary objective of composing for the Sound Dome at ZKM Kubus
(Miyama & Dipper, 2016). Nevertheless, as an extensively developed software, it
can be utilised in a variety of settings with disparate loudspeaker systems. It offers
- 27 -
a comprehensive toolchain for the creation of spatial music, encompassing but not
limited to the following capabilities: the generation of 2D or 3D loudspeaker setup
profiles, the creation and editing of parameter-based trajectories, the visualiza-
tion of real-time sound environments, the implementation of sound spatialization
algorithms (e.g., VBAP, HOA), and the integration of plugins for collaboration
with other software, synchronizing videos (ZirkVideoPlayer), and remote control
(ZirkPad). The initial release of the Zirkonium MK1 in 2006 marked the inception
of a novel concept: the integration of mathematical event-based sound movements,
rotations, and timing. This innovation served as a foundational framework for
subsequent development.
As with Zirkonium, the composition toolchain, which was designed for specific
venues, includes the SeamLess system for the TU Studio and Humboldt Forum
Listening Room (Coler et al., 2021) and the complete spatial music solution pro-
vided by 4DSOUND17, which includes both control software and speaker systems.
17https://4dsound.net/
It encompasses all the requisite functionalities, encompassing both software and
hardware (Cross, n.d.; Oomen et al., 2016).
Another significant undertaking is the development of a series of spatial mu-
sic composition tools based on the visual programming language for computer-
assisted music composition, OpenMusic (Bresson et al., 2011). Garcia, Bresson,
and other research colleagues have been engaged in the active exploration of spa-
tial music composition workflows and tools based on OpenMusic for several years
(Agger et al., 2017; Bresson, 2012; Bresson et al., 2017; Jérémie Garcia, Jean
Bresson, & Carpentier, 2015; Jérémie Garcia, Jean Bresson, Schumacher, et al.,
2015; Garcia et al., 2016; 2017). They have developed SPAT-SCENE, an auxil-
iary module in OpenMusic for interacting with the Spat toolkit for timelines and
spatio-temporal specification, the 3DC module for displaying spatial trajectories,
and Trajectoires, a mobile application for real-time drawing and control of spatial
trajectories by finger touch, among other applications. In addition to the tradi-
tional spatial music composition toolchain that has been the subject of current
discussion, more experimental workflows are proposed. These will be highlighted
in Section2.5.
2.4.2 Performance Oriented
Performance has been central to the development of spatial sound control tools,
from the earliest sound diffusion practices to present day. However, traditional
sound diffusion, with its roots in tape music, is now considered somewhat anti-
quated due to its reliance on fixed sound materials and inherent limitations. It is
challenging for a diffuser to be fully dynamic, akin to a conventional instrumental
performer, given that it is a time-varying system that they is playing with. The
tools utilized for sound diffusion-type practice have gradually transitioned from
- 28 -
performance-oriented types to those oriented towards composition, as discussed
in Section2.4.1.
Contemporary performance-oriented spatial sound control tools emphasize
real-time interaction, generation, and collaboration, alongside essential control
functions. These tools are heavily influenced by the improvisational music mind-
set, where creative processes unfold in real-time. Unlike traditional sound diffusion
—where spatial properties are manipulated live but the audio is pre-composed—
modern performers must simultaneously manage both sound and space.
This section is structured into three parts: the first provides a historical
overview of recognized spatial music performance devices; the second introduces
recent spatial sound control instruments for real-time performance; the third offers
a concise overview of a particular branch that utilizes automation in performance,
which then leads into the subsequent discussion.
2.4.2.1 Controllers in History
As previously discussed in Section2.3.2.1, it is pertinent to highlight the GME-
Baphone, which was developed with the objective of regulating the loudspeaker
orchestra (Clozier & Olsson, 2001). To be precise, the GMEBaphone is an instru-
mentarium. It is a complete system containing all the necessary equipment, from
speaker arrays to signal processing units and control consoles. The concept of
GMEBaphone here refers specifically to the consoles that have been used in more
than two decades of iterative development, starting with the GMEBaphone 2 and
continuing with the GMEBaphone 6/Cybernéphone.
The device appears to be more akin to a mixing console in the contemporary
sense than an instrument. The method of operation is to regulate the volume of
a specific speaker or group of speakers by interacting with a set of fader boards.
This can be regarded as the core playing style of common sound diffusion. How-
ever, what distinguishes GMEBaphone from a mixing console is its programmable
mapping engine, which is designed specifically for GMEBaphone loudspeaker sys-
tems. This enables performers to control the loudspeaker orchestra in real time
with greater accuracy and efficiency. This style is still evident in the contemporary
BEAST system. (Wilson & Harrison, 2010).
In order to facilitate the performance of Expo 1970 in Osaka, a spherical
sound controller was constructed to control the spherical speaker array within the
German Pavilion (Brech, 2015). The controller comprised 50 sensor buttons, each
of which was mapped to a specific loudspeaker group. The sound direction was
altered when a button was pressed. This instrument enabled the specific volume
control to be carried out by the control circuit, thus moving away from the sliding-
type gesture to the push-type gesture, which facilitated faster and denser control
signals.
- 29 -
In 1984, Luigi Nono’s Prometeo premiered, featuring an instrument called the
Halaphon (Brech et al., 2015). This hybrid analog-digital spatialization system
could route input signals to the speakers during the performance. The Halaphon
originated as a digital musical instrument that enabled more complex control logic.
Following this, spatial sound controllers with purely analog circuits were officially
consigned to history.
2.4.2.2 Spatial Instruments
In recent years, there has been a notable surge in the development of instruments
for real-time spatial music performance. A comprehensive review by Pysiewicz
and Weinzierl has already been conducted (Pysiewicz & Weinzierl, 2017), thus
obviating the need for this thesis to reiterate the list of relevant works for analy-
sis. In their review, the instruments are classified according to three dimensions:
the controller type/interface, the controlled spatial parameters, and the scope of
control. The controller type/interface refers to the manner in which the controller
interacts with the player. The controlled spatial parameters were classified into
three categories based on the proximity to spatial properties, ranging from the
basic spatial location to the acoustic properties of the listening environment. The
scope of control refers to whether the controller solely provides functionality for
control signal generation or whether it involves a sound synthesis system. It is
strongly advised that readers consult the original article for a more comprehensive
analysis.
It is important to note that this review explicitly addresses the spatial sound
control tools discussed here in the context of Digital Music Instrument (DMI) and
Human Computer Interaction (HCI). Therefore, software for automatic spatial
sound control without an explicit user interface or physical controller, as well as
algorithms, are not included in this review.
2.4.2.3 Matrix-based Diffusion
Most composition tools, as noted in Section2.4.1, incorporate automation features
such as automated trajectory control and sequencer-style automation. While au-
tomation in composition feels intuitive, its role in performance raises concerns
due to the performer’s need for greater control. However, human capability to
manage control is inherently limited. Introducing more automation can alleviate
the burden of repetitive tasks and reduce the complexity of multitasking during
performances. Simultaneously, it can enhance the playability and dynamic inter-
action with the music.
The use of faders for group control, as discussed in Section2.4.2.1, marks
an early form of performance automation. Many contemporary tools have tran-
sitioned from traditional sound diffusion methods to a point source paradigm,
where automation is predominantly built on object-based sound spatialization
technology. Despite this trend, some software still seeks to offer more adaptable
- 30 -
automation within the traditional sound diffusion framework. These programs of-
ten bypass object-based spatialization techniques in favor of matrix-based control
logic, allowing for greater flexibility in automation without departing from estab-
lished practices.
The DM8 system was first proposed by Barry Truax in his article on the
concepts of “space in sound” and “sound in space” (Truax, 1998). This is a ma-
trix-based system for mapping eight input signals to eight output signals. The
user has the option of manually assigning mapping relationships between inputs
and outputs, either statically or dynamically. During a performance, the user can
also cross-fade between eight different mapping patterns, thus creating complex
variations in sound diffusion. Similarly, the M2 system employs a matrix-based
mapping architecture for the software component and a straightforward 32-fader
architecture for the control hardware (Mooney et al., 2004). The software does not
preset any speaker array structure or fader mapping. Instead, it provides a flexible
mapping editing function that allows users to customize the input/output mapping
mode and fader control mapping according to their own needs. The M2 system
was originally designed to provide the greatest possible freedom of expression in
the context of improvisation, with the objective of facilitating the discovery of new
compositional avenues through improvisation. Resound represents a novel gener-
ation of matrix-based sound diffusion software that has been developed through
the accumulation of experience with the M2 system over an extended period of
time (Mooney & Moore, 2007; 2008; Stefani & Mooney, 2009). It presents a series
of creative mapping strategies, integrated as presets, which collectively constitute
a highly playable semi-automatic control device.
Admittedly, this particular offshoot appears somewhat incongruous within
the prevailing mainstream trends. Moreover, the continued reliance on the fader
as the controller is somewhat uninspiring. Nevertheless, it is evident that this type
of system continues to possess intrinsic value, as it facilitates the accomplishment
of tasks that would be challenging to achieve within object-based spatialization,
as discussed Section2.3.1.
2.4.3 Discussion
The role of faders extends beyond their traditional hardware implementations. As
demonstrated in the Resound system, faders do more than merely adjust volume
and sound trajectories; they fundamentally influence the “behaviors” of the spa-
tialization system. By reconceptualizing faders as tools for parameter mapping
—from simple one-to-one mappings to intricate configurations—they enable deep
and nuanced manipulation, offering richer and more complex control over the sys-
tem’s behaviors. Embracing this expanded mindset allows for the exploration of
a broader range of control algorithms for sound spatialization. In this approach,
performers or composers influence the behavior of these algorithms rather than
directly manipulating the spatial properties of individual sound materials. This
method facilitates the realization of unique spatial effects that alter not only
- 31 -
the spatial properties but also the timbre of sound materials. Such algorithmic
control fosters a more abstract form of spatial music aesthetics that provides a
deeply integrated sound-spatial experience, going beyond traditional localization
and movement of sound sources. This evolution blurs the lines between composing
sound timbre and spatiality. The term “spatial texture” is used throughout this
thesis to describe all related topics. The next section will delve deeper into spatial
texture from both theoretical and technical perspectives.
2.5 Spatial Texture
Most tools and algorithms discussed so far adhere closely to the traditional objec-
tives of sound diffusion. In these contexts, spatial properties of sound are lever-
aged to enhance other intrinsic attributes such as dynamics and timbre, which
are largely determined by recording or synthesizing methods. Although spatial
attributes are recognized as critical elements in these works, they often play a
secondary role or are considered in later stages of composition or performance.
This relatively loose integration limits the utilization of spatiality to a macro-scale
musical structure.
In the realm of electroacoustic music, the focus has traditionally been on tim-
bre, with scholars shifting their attention from broad sonic features to the more
intricate details of sound texture. This trend highlights a growing interest in the
nuanced aspects of sound perception and manipulation. Such a concentrated ex-
amination of finer details has fostered a rigorous exploration of spatial texture,
defined as the integration of spatial attributes with sound’s textural qualities. The
distinction between sound texture and spatial texture is becoming increasingly
subtle, reflecting a broader trend towards a more integrated understanding of
sound’s spatial and timbral dimensions.
This section will commence with an introduction to the pertinent theories,
after which it will proceed to a categorization and analysis of the various specific
creation techniques.
2.5.1 Relevant Theories
2.5.1.1 Spectromorphology & Spatiomorphology
Spectromorphology, a term coined by Denis Smalley (Smalley, 1997), offers a de-
tailed lens through which to describe and analyse the listening experience by fo-
cusing on the interaction between sound spectra (spectro-) and the ways in which
these sounds change and are shaped over time (-morphology). This approach al-
lows for a nuanced examination of the temporal and spectral structure of sounds
as they evolve, providing a vocabulary for discussing the otherwise abstract expe-
rience of listening to electroacoustic compositions. Spatiomorphology, introduced
- 32 -
in the same paper, further extends these concepts by incorporating the spatial
dimensions of sound.
The concept of “source bonding” plays a pivotal role in Smalley’s theory,
which focusing on the listener’s innate inclination to connect sounds with their
perceived origins or causes. This notion intricately weaves the extrinsic charac-
teristics of sound—its source and the method of its creation—with the listener’s
experience, whereby the origin of a sound, when obscured or abstracted, shifts the
listening focus towards the sound’s inherent qualities. This shift in focus becomes
particularly compelling in the context of spatial attributes of sound. When the
external characteristics, such as the source’s movement or position, become indis-
tinct, the listener’s attention is naturally drawn to the intrinsic spatial qualities
of the sound. This emphasis on the intrinsic spatial properties aligns closely with
Smalley’s exploration of spatial texture, understood as the revelation of spatial
perspective over time. It’s not merely about the movement or position of sound in
space but about how the listener perceives and interprets the spatial dimensions
and qualities of sound as it unfolds. The absence or abstraction of clear external
sources encourages a deeper engagement with these spatial textures, allowing lis-
teners to appreciate the subtleties of spatial expression and the nuanced interplay
between sound and space.
The theoretical frameworks of spectromorphology are of paramount impor-
tance, providing structured approaches to the analysis and understanding of the
complex interplays between sound’s spatial and timbral properties. Subsequent
explorations have been to a greater or lesser extent influenced by this theory, de-
spite their different foci.
2.5.1.2 Textural Composition
Textural composition is a method of creating real-time computer music based
on acousmatic and stochastic concepts, manifested as sound metaobjects (Hagan,
2017). It relies on agile sounds that do not require conventional trajectory-based
spatial techniques. It serves as a bridge linking tape music and real-time computer
music, blurring the lines between sound objects and soundscapes, as well as point-
source and trajectory-based spatialization. In textural composition, the sonic tex-
ture is given precedence over other musical elements, with the aim of creating
distinct spatial and temporal experiences. The intention is to provide listeners
with a broad, immersive auditory experience characterised by slow, environmen-
tal shifts in time. The underlying philosophy of textural composition draws upon
the aesthetic qualities of sound objects found in acousmatic music, as well as the
expansive sound masses influenced by Iannis Xenakis. The aesthetic concepts pre-
sented in the textural composition are in alignment with the potential outcomes
achievable through the system outlined in this thesis.
- 33 -
2.5.2 Creation Techniques
For the purpose of introducing techniques for spatial texture, the classification
method by Lynch (Lynch & Sazdov, 2011) serves as a reference. This method di-
vides spatial texture techniques into three categories according to the underlying
implementation logic: Spectral Spatialization, Spatial Granulation, and Panning &
Decorrelation. This thesis will reorganize the secondary classifications under the
broad categories and add some new examples from recent years.
2.5.2.1 Spectral Spatialization
The field of frequency domain analysis and processing has been a pivotal tool in
the understanding of sound characteristics and the processing of sound details. The
analysis of the spectrum can be employed as a method of intuitively understanding
and manipulating sound. The majority of frequency domain processing in music is
conducted through the use of the Fast Fourier Transform (FFT) algorithm. Some
sophisticated audio analysis or Music Information Retrieval (MIR) tools employ
alternative frequency domain transform algorithms that are not addressed here.
Spectral spatialization may be conceptualized as a process wherein the spectrum
components are regarded as the fundamental basis for spatialization. Regardless
of the specific details of the various spectral spatialization algorithms, the funda-
mental concept is the same: the application of distinct spatialization treatments
to each group of spectrum components
Normandeau put forth a method he referred to as timbre spatialization (Nor-
mandeau, 2009). The underlying concept is that by directly assigning different
frequency components of the sound to different loudspeakers, the entire spectrum
of sound is virtually reassembled in the listening space, resulting in a sound that
has an integral spatiality within the original timbre. Such effects can be achieved
by assigning different bandwidth filters to each speaker. The endeavors of the
loudspeaker orchestra, which employ a variety of speakers, exhibit a certain degree
of consistency in their approach to timbre spatialization. This is because we can
simply understand the different types of speakers as consisting of perfect playback
speakers and pre-filter setups. Timbre spatialization can also be manipulated in
a multitude of ways, as evidenced by Garcia et al.’s workflow, which combines
bandpass filter banks and sound movement (Jérémie Garcia, Jean Bresson, Schu-
macher, et al., 2015). A patch was constructed in OpenMusic for the purpose
of distributing an audio signal to a fixed eight-channel loudspeaker ring. Each
loudspeaker is assigned a specific band-pass filter. The sound shifts between the
speakers, thereby inducing a coherent spectral and positional change.
The method of analysis/re-synthesis spatialization is not fundamentally dis-
tinct from timbre spatialization. The only difference is that, in this instance, the
frequency component manipulation is no longer conducted via a filter bank, but
rather through direct control of the frequency bins subsequent to the FFT trans-
form. There are additional, more elaborate techniques that build upon this foun-
- 34 -
dation. One technique is the extended spectral delay effect, whereby the resyn-
thesis delay sound from individual FFT bins is sent to individual channels or
sound objects (Kim-Boyle, 2008). Another approach is to generate spatialization
patterns through analysis and mapping of spectral properties to spatialize another
input signal. Further creative techniques can be observed in the original papers.
(Jaroszewicz, 2015; Torchia & Lippe, 2004). Other, more complex techniques in-
volve the use of particle systems, as proposed by Kim-Boyle, which are categorized
in Section2.5.2.2 (Kim-Boyle, 2008).
Wave terrain synthesis represents a relatively self-contained multimodal
sound synthesis method, or alternatively, it can be interpreted as a kind of ex-
tended wavetable synthesis. In contrast to the majority of algorithms, which are
exclusively audio-centric, wave terrain synthesis employs graphical multidimen-
sional surfaces analogous to topographical maps as the foundation for sound gen-
eration (James, 2005). James has been engaged in an extensive investigation of
wave terrain synthesis as a means of spectral spatialization for an extended period
(James, 2012; 2015; 2016). The term wave terrain spatialization has been selected
to encapsulate the sound spatialization approach that employs this concept. The
fundamental concept is to first construct a topographical map as the target for
spectral distribution, then map the height information in this terrain with the de-
sired spectral components, and finally employ the enhanced spatial panning algo-
rithm with audio-rate control signal to achieve spatialization. The precise specifics
of the implementation vary from article to article. The explanation provided here
is largely based on the version from 2015 (James, 2015). This version explicitly
states in the title that the algorithm is inspired by the theories related to spec-
tromorphology. The distinguishing feature of this method is its inversion of the
spatialization method with respect to the generation of the spectral distribution.
This results in a representation that is both straightforward and visually appeal-
ing. The image of the terrain surface and the corresponding image of the sound
trajectory permit the user to comprehend the current spatialization effect in an
intuitive manner and to regulate it in a dynamic manner. Furthermore, the study
places significant emphasis on the importance of the audio-rate control signal in
the spatialization process and fully exploits the ability of gen~ in Max/MSP.
2.5.2.2 Spatial Granulation
In conjunction with the spatialization techniques that operate within the frequency
domain, temporal processing represents another primary method for the genera-
tion of spatial texture. The principal method for modifying sound texture in the
time domain has its origins in a long history of research in the field of microsound
(Thomson, 2004), which concerns the manipulation of sound fragments that last
for very short periods of time. The most widely recognized practical application of
this research is granular synthesis (Roads, 1978; Truax, 1988). There has always
been a close relationship between granular synthesis and sound spatialization.
Barry Truax, the author of the inaugural real-time granular synthesis algorithm,
- 35 -
posits that granular synthesis can be employed as a means of influencing the per-
ception of spatiality in sound (Truax, 1998). Algorithms that employ real-time
granular synthesis for sound spatialization were implemented at an early stage in
Max/FTS (Todoroff, 1995), the predecessor of the current Max/MSP. Examples
of recent tools may include the ambisonic-based GranularEncoder plug-in within
the IEM Plug-in Suite18.
18https://plugins.iem.at/docs/granularencoder/
The concept of spatial granulation can be elucidated as the temporal disas-
sembly (granulation) of the input signals and their corresponding distribution to
the multi-channel system. The technical challenge of achieving granulation and
distribution has been overcome with the advent of today’s computer performance.
The difficulty of the algorithm lies in the effective control of the behavior of hun-
dreds or thousands of grains and their distribution to the speaker array.
With regard to the behaviour of grains, the most well-known approach is to
control the overall behaviour of a group of homogeneous individuals based on the
Boids algorithms (Reynolds, 1987). The algorithm identifies three fundamental
behavioral patterns exhibited by individuals in a group: separation, cohesion, and
alignment. It then determines the optimal next step for each individual in the
group, with the goal of achieving unified control over the collective behavior of the
group. The concept has been previously explored in granular synthesis algorithms
without sound spatialization (Blackwell & Young, 2004).
Among the most successful practices in spatial texture are the swarm lab
system (Davis & Rebelo, 2005) and spatial swarm granulation (Wilson, 2008).
Another prevalent approach is the utilization of corpus-based or dictionary-based
methodologies (Einbond & Schwarz, 2010; McLeran et al., 2008). Although the
algorithms of these two methods are entirely distinct, the underlying concept is
to reduce the dimensionality of the grains and cluster them to construct a dimen-
sionally controllable representation space, upon which the mapping between the
representation space and the three-dimensional physical space can be realized.
In regard to the strategy of distributing grains towards the speaker array,
the most straightforward approach is to leverage the underlying support of the
object-based spatialization technique. The most prevalent approach is to associate
each grain stream with a sound source in ambisonics, or to spatially localize it
via VBAP. This approach obviates the difficulty of making specific channel map-
pings. Nevertheless, in the context of other specialized speaker arrays, there will
also be evident methodologies for the design of channel-based distribution strate-
gies. To illustrate, the spatial swarm granulation previously mentioned is based
on the BEAST loudspeaker system. The grain flow distribution strategy employs
the kd-tree algorithm to assign each boid to the closest speaker. Moreover, for
the High-Density Loudspeaker Array (HDLA), which is frequently employed in
contemporary research facilities, the attempts to map a grain stream to a specific
speaker are just as efficacious as the previous approaches (Garavaglia, 2016).
- 36 -
It is possible to utilise both spectral spatialisation and spatial granulation in
conjunction with one another. For instance, Kim proposed a method for generating
spatial texture by mapping the particle positions in a particle system controlled
by the Boids algorithm to specific Fast Fourier Transform (FFT) bins (Kim-Boyle,
2008).
2.5.2.3 Panning & Decorrelation
Both spectral spatialization and spatial granulation seek to deconstruct sound
into its constituent elements, thereby transforming the control of a single sound
source by the overall spatial properties into the control of a multitude of sound
elements. The notion that sound must be miniaturised in order to exert control
over its microscopic properties is a concept that is intuitively appealing. However,
there are alternative approaches that can be employed to achieve a subtle and
infectious spatial texture without the necessity of dismantling the sound. Two dis-
tinct algorithms are presented here: decorrelation and panning. The decorrelation
algorithm is designed to produce differences in detail between audio streams, while
the panning algorithm employs rapid movement of the sound source.
Decorrelation modifies an original audio signal into multiple outputs with
distinct waveforms that are perceived similarly to the source (Kendall, 1995).
Strictly, these sounds differ physically but share perceptual qualities, making them
indistinguishable as separate sources. This physical variance enhances spatial per-
ception through psychoacoustic effects. This principle underpins stereophony and
head-related transfer function (HRTF)-based spatial audio. When the phenome-
non of decorrelation is discussed outside the context of binaural hearing, it is ca-
pable of producing not only a psychoacoustic sense of spatiality, but also acoustic
phenomena that actually exist in space. In the most basic instance, a comb filter is
the result of superimposing a source in the sound field with a second source that is
slightly delayed in time. This phenomenon is typically avoided in traditional room
acoustic design, yet it can be utilized in spatial music creation. Other methods of
decorrelation can facilitate a more engaging spatial listening experience.
The generation of multiple, uncorrelatable sound sources from a single sound
source can be achieved through the utilisation of either an all-pass filter for phase
adjustment or an FFT transform, with the subsequent phase resetting before the
sound sources are synthesised. Another approach is to convolve the source with
different signals, which is not fundamentally different from using an all-pass filter.
However, it should be noted that this approach does not adhere to the strict defi-
nition of hearing consistency. The generation of spatial texture can tolerate certain
audible differences between the sources. When spatialized sound is synthesized by
a synthesizer, it is possible to create a decorrelation-like effect by using sounds in
different channels with a small difference in the synthesizer parameters. This is the
concept of topographic synthesis (Nystrom, 2018). The authors of this algorithm
do not propose categorizing it under decorrelation.
- 37 -
The concept of panning is referenced on several occasions in Section2.1, where
spatial audio algorithms developed on top of panning are also described. This
operation is fundamentally linked to traditional spatial audio applications and
spatial music aesthetics. As with the basic operations of audio processing, such as
amplitude modulation and frequency modulation, when applied at high rates and
modulation depths, they will have a dominant effect on the auditory experience.
Furthermore, panning can be applied with extreme parameter settings, resulting
in the original signal becoming unrecognisable.
The spatial rapid panning technique is usually underpinned by object-based
spatialization algorithms (Schmele & Lopez, 2022). It is assumed that a control
signal at audio rate is applied to a virtual sound source, causing it to move rapidly
between two points in space to produce significant amplitude modulations and
Doppler shifts (Schmele, 2011). Furthermore, when the rate of panning exceeds a
certain threshold, the sound enters the domain of microsound, resulting in a tex-
ture that is similar to that of granular techniques, as discussed in Section2.5.2.2
(McGee, 2015).
The concepts of the panning and decorrelation approach are more readily
comprehensible than the aforementioned time/frequency domain disassembly spa-
tialization approach. However, this does not imply that either approach is any
less efficacious. Both methods entail the actual acoustic phenomena, whereas the
decomposition methods are still predominantly focused on the audio signal. In
practice, the panning and decorrelation approach may result in a multitude of
unexpected outcomes.
2.5.3 Spatialization as Synthesis
Once the three primary categories of techniques have been established, the concept
of spatial sound synthesis becomes more readily comprehensible. It is erroneous to
view spatial sound synthesis as a distinct branch of synthesis algorithms existing
in isolation. Rather, it should be conceptualised as a model for thinking in a more
forward-thinking manner (Clarke, 1999). Although space and timbre have been
unified, the distinction between controlling timbre and controlling spatiality has
become nearly indistinguishable. Nevertheless, there persists the presupposition
that the sounds are exisit before being spatialized, even if the spatialized sound
is entirely distinct from the original. At this juncture, the appropriate term for
these methods is the spatial effect. However, if one introduces the concept of spa-
tial sound synthesis, the spatial aspects are considered at an early stage of the
synthesis process and thus become an integral part of the concept. Consequently,
sound is ultimately “synthesized” in real physical space.
Some of the algorithms are defined by the researchers as spatial sound syn-
thesis. A significant proportion of the concepts associated with spatial sound syn-
thesis can be considered spatialised extensions of conventional sound synthesis
algorithms. For instance, spatial granulation techniques in Section2.5.2.2 are still
- 38 -
regarded as either granular synthesis or concatenative synthesis methods. The
Spatio Operational Synthesis (SOS) (Topper et al., 2003) involves the rotation of
single partial components of basic waveforms in an additive synthesis process on
a circular loudspeaker setup utilising the VBAP. This is consistent with the pri-
mary concept of timbre spatialization, particularly his proposed extension method
utilizing subband decomposition. Some panning-based methods define themselves
as spatial modulation synthesis (McGee, 2015) or rapid panning modulation syn-
thesis (Schmele, 2011). Spectro-Spatial Sound Synthesis (Coler, 2019), which dis-
tributes the sounds of musical instruments as point clouds, represents a hybrid
use of spectral and temporal decomposition, albeit with a subtle reverse-thinking
approach. In topographic synthesis (Nystrom, 2018), each loudspeaker of a mul-
tichannel system is assigned an individual instance of a synthesis process. This
may be any general synthesis technique. When these parallel processes are de-
terministic and driven with the same input parameters, all speakers’ signals are
identical. Parameter distributions can be used to create instantaneous or evolutive
spatial textures. Topographic synthesis is indifferent to the spatial configuration
of loudspeaker systems, treating the loudspeakers as a sorted array. The author
provides a comprehensive examination of the similarities and distinctions between
topographic synthesis and decorrelation.
2.5.4 Related Tools
The techniques presented in this section are more experimental and less widely
used than those presented in Section2.4. Consequently, it is challenging to provide
a comprehensive overview of the diverse array of tools presented in that section.
The majority of the techniques are implemented in audio programming software
and have not been further developed to the extent that they can be easily used by
others. The following list provides an brief overview of some of the projects that
focus on this area.
BEASTmulchLib is a SuperCollider class library developed for the BEAST
project that offers advanced spatial techniques including the spatial swarm gran-
ulation and other unconventional signal routing techniques (Wilson, 2009). The
OMPrisma and OM-Sox libraries, which is part of the OpenMusic suite of tools,
provides a general framework for controlling spatial sound synthesis and incorpo-
rates sound spatialization (Schumacher & Bresson, 2010). The library conceptu-
alizes spatial sound rendering as an essential element of sound synthesis, thereby
elevating spatial parameters to the status of abstract musical materials within a
comprehensive compositional framework. Live 4 Life is a spatial performance tool
designed to facilitate the creation of sound across multiple loudspeakers in Super-
Collider (Lengelé, 2018). By focusing on spatial rhythmic patterns and synthesis
parameter loops, Live 4 Life aims to enhance the interaction between sound ob-
jects and their spatial attributes, catering to a diverse range of performance setups
and experimental sound exploration. The ImmLib software has been developed
for the composition of spatial music on grid-based loudspeaker systems (Negrao,
- 39 -
2014). It enables the creation of multiple decorrelated sound streams at different
locations, aiming to form a broad sound source with unique spatial qualities. Built
in SuperCollider, ImmLib simplifies the generation of these streams from a single
synthesis definition and offers tools for crafting spatial patterns on a virtual sur-
face by modulating synthesis parameters.
2.6 Discussion
This chapter provides a detailed overview of the theoretical and technical aspects
of spatial music applications, aiming to establish a comprehensive coordinate sys-
tem to clearly define the Zerr* approach. This system will clarify what Zerr* is and
what it is not, delineating its suitability for specific outcomes and its limitations
in certain scenarios.
Four key aspects that have been identified as categorizing related algorithms,
software, instrumentation, and other items can be summarized as follows:
Speaker-centric & Source-centric: Speaker-centric & Source-centric: This
distinction determines whether the application has a defined notion of a virtual
source or manipulates the speakers directly.
Composition-oriented & Performance-oriented: Determines if the appli-
cation is designed to produce fixed works or address performance-related chal-
lenges.
Scope of Realization: Considers whether the application is tailored for spe-
cific speakers or arrays, integrates a defined sound engine, or correlates with
specific musician gestures, akin to instrument design analysis.
Aesthetic Inclination: Evaluates whether the application adheres to tradi-
tional aesthetic principles or aligns with preferences for spatial texture.
While the applications discussed in this chapter can be classified according to the
four key indicators, a detailed taxonomy will not be attempted here for brevity.
The next chapter will define the Zerr* System using these metrics and discuss its
design concept in depth, comparing it to other applications.
- 40 -
Chapter 3
Zerr* Approach
3.1 Approach Classification
Before outlining the methodology that underpins the Zerr* approach and the de-
tails of the concept design, it is first necessary to categorize the Zerr* approach
according to the four sets of indicators previously mentioned. It is hoped that this
will enable a clear delineation of the boundaries of the discussion
Speaker-centric or Source-centric: The Zerr* approach is explicitly
speaker-centric. It controls the signals delivered to each speaker directly, with-
out utilizing a concept of a virtual source. As a result, Zerr* cannot achieve
accurate spatial positioning and movement. Although it is possible to simulate
the effect of sound source movement under specific parameter settings, compar-
ing the realism and stability with those achieved by object-based spatialization
algorithms would be inappropriate.
Composition-oriented or Performance-oriented: The Zerr* approach is
primarily used in live performances, particularly for real-time spatial music
improvisation. Unlike composition-oriented tools, it lacks non-linear editing
and scoring capabilities. All audio and control signals within the system are
processed and transmitted in real time.
Scope of Realization: The Zerr* system is compatible with any loudspeaker
setup that allows direct control of the input signals for each loudspeaker. In
addition, Zerr* can process any type of audio input, allowing musicians to
interact with Zerr* using any type of gesture. In essence, It functions as an
audio distribution engine that dynamically assigns input audio signal to specific
loudspeaker setups.
Aethetic Inclination: Zerr* excels at creating complex spatial textures, and
the characteristics mentioned in the first three points predestine it to be a very
experimental approach. Among the techniques related to spatial texture, Zerr*
employs pure amplitude panning. By adjusting the rate of the panning signal,
Zerr* can move seamlessly from standard spatialization effects to the creation
of distinctive spatial textures.
In essence, Zerr* is a speaker-centric sound spatialization approach designed for
arbitrary audio sources and loudspeaker setups. It excels in generating spatial
textures during live improvisations. This will be rigorously reviewed in subsequent
analyses and presentations to ensure the validity of the discussion.
- 41 -
3.2 Signal Flows in Live Performance
The initial intention to develop an experimental system must first be analyzed
in terms of its use scenarios. Most of Zerr*’s conceptual design is related to its
intended scenario, which is live improvisational performance. This section presents
a possible new system design methodology for the problem in a live performance
scenario, and the concept of Zerr* as a concrete proposal for realizing this method-
ology will be introduced in detail in the next section.
The real-time demands of live music performance scenarios have turned many
simple operations in the composition process into complex human-computer in-
teraction problems. And performances with real-time sound spatialization have
an extra dimension of complexity compared to a normal performance. Marshall
et al. analyzed the problems of live spatial music performance in detail (Marshall
et al., 2009). It divides the roles of those involved in the performance into spatial
performers, instrumental performers, and spatial conductors, and suggests the im-
portance of considering the cognitive load19 of the performers. In this section, this
19Cognitive load is defined as “The total amount of mental activity imposed on working memory at an instance
in time” in the original paper.
is used as a reference to analyze the difference between signal flow and performer
cognitive load in different performance scenarios.
With the help of Figure4, all setups in a live performance can be understood
as a holistic instrument that is controlled by several performers in different roles
at the same time. This sound engine can be simplified and divided into three core
modules corresponding to the three different types of roles proposed by Marshall
et al., namely Sound Source, Sound Spatializer and Sound System. Assuming that
the performers’ gesture inputs are ignored, there are only pure audio signal flows
between modules in the performance, as (A) in Figure5. The Sound Source mod-
ule in this context refers to anything that can produce audio signals, including but
not limited to acoustic instruments, playback devices, and synthesis algorithms.
The Sound Spatializer module refers generically to all algorithms and devices that
process and distribute the input audio signal. The Sound System module refers
to the equipment that ultimately presents the sound waves, including but not
limited to the loudspeaker arrays and the listening environment/room. In actual
scenarios, the three modules are merged. However, the current division is more
useful for later analysis.
In addition to the basic audio signal flow, performers generate gesture inputs
that vary depending on their role and the performance context. In a traditional
live performance without sound spatialization, the performers’ gesture inputs only
affect the sound sources. Here, the sound spatializer is a fixed system that collects
all input signals and routes them to the sound system, as shown in (B) of the
Figure5. In contrast, in sound diffusion-type performances, the performer directly
manipulates the sound spatializer with a fixed sound source system, as shown in
(C) of the Figure5. In both scenarios, the performers have clearly defined roles
- 42 -
corresponding to the spatial performer and the instrumental performers. The spa-
tial conductor in this model is the performer who inputs gestures into the sound
system. As discussed by Marshall et al., a modern spatial music performance typ-
ically involves a number of instrumental performers, a spatial performer, and a
spatial conductor, as shown in (D) of the Figure5. This division of roles proves
highly effective and efficient for live performances that are either well rehearsed
over time or structured around fixed scores. From personal observation, it’s worth
noting that there can be more than one spatial performer, and the role of spatial
conductor is relatively rare in actual performances. This role is simulated in the
original article by allowing the performer to control the size of a virtual space.
thus the spatial conductor is not central to the following discussion and will be
omitted.
Figure5: Signal flow in live performance
However, the model is less applicable in an improvised live performance. First,
from the point of view of the spatial performers, it is difficult for them to respond
effectively in time to the audio signals generated according to the improvised ges-
tures of the instrumental performers. Because the audio input at this point is
a time-varying signal relevant to instrumental performers with causality, spatial
performers cannot predict in advance what will happen next. More commonly,
the roles of instrumental performer and spatial performer in improvisation are
undertaken by the same individual, as shown in (A) of Figure6. When a performer
is required to process two disparate gesture control signal inputs simultaneously,
the cognitive load can easily exceed the performer’s capacity, potentially leading
to mistakes in performance. At this juncture, it would appear that one can only
consciously control either the sound properties or the spatial properties.
This is not an insurmountable paradox. In addition to the performer exer-
cising their own multitasking abilities, the majority of the performance aid tools
mentioned in Section2.4.2 attempt to address this problem. Some tasks that re-
quire conscious control are converted to non-conscious control through the use
- 43 -
of pre-composed automations. This enables the performer to exert control over
one of the modules with minimal and discrete control gestures, thereby allowing
them to concentrate on the control of the other module. The performer’s focus
during improvisation will be determined by the specific part being performed.
The other part will have a reduced role and serve as a supporting function, as
illustrated in (B) and (C) of the Figure6. One illustrative example could be where
the performer engages in improvisation on the musical instrument, subsequently
initiating a spatial effect, such as a trajectory movement through the action in the
intervals between playing. The process can also be reversed, whereby the performer
controls the spatial behaviour of a grain cloud in real time by finely controlling
various parameters in a spatial granulator. The sound input to the grain cloud is
triggered using simple gestures for playback or cessation. It is also possible to take
this non-conscious control to the extreme, namely to use complete automation
without human intervention on a particular module. In the case of autonomous
control of both parts, the intervention of real-time human subjective awareness is
lost and enters the realm of computer-generated music, which is not the subject
of the present discussion of performance.
In their discussion of conscious and non-conscious control, Marshall et al.
posit that non-conscious control is more akin to a compositional process than a
performance. In order to enhance the usability of an assistive tool, the control
gestures it provides are often more generic in nature, and on occasion, they do not
align with the specific gestures that the user desires to employ. Once the automa-
tion has been determined, there is a very limited range of interpretations that the
performers can make. This may give the performer the impression that they have
no real control over the performance.
It is necessary to identify a signal flow that can provide conscious control ges-
tures to both the sound source and the spatializer, which are dense and consistent
with the performers’ intentions without overloading them. One possible approach
to this problem is to associate two modules and control the behavior of the other
module with a signal from one of them. Thereafter, the performer only needs to
focus on controlling one of the modules after completing a small number of basic
setups, while the other module will change its behaviour in response to the con-
trol signals generated based on the performer’s current conscious control gestures.
This enables high gesture density control of both modules simultaneously. Given
that this control signal lies between conscious and non-conscious control, as it is
indeed related to the performer’s gesture input, it is more appropriately called
semi-conscious control.
As illustrated in (D) of the Figure6, the optimal method for implementing
this type of control system is to collect the audio signal emanating from the sound
source as an input and utilize it to generate the control signals for controlling the
sound spatializer. In this instance, the performer’s sole responsibility is to regulate
the sound source, with the spatialization tasks being executed automatically. The
signal flow diagram represents the primary framework underlying the Zerr* ap-
- 44 -
proach. Under this framework, any specific technical solution can be incorporated.
A controller can be designed in combination with a sound source or spatialiser or
sound system. This framework permits the incorporation of any specific technical
solution. The Zerr* approach is but one of the possible concepts.
Figure6: Signal flow in improvisation performance
3.3 Zerr* Concept
Figure7 shows the general signal flow of Zerr*20 . This signal flow diagram refines
the internal structure from (D) of the Figure6, while simultaneously simplify-
20Derived from German “Zerräumlichung” spatial disintegration.
ing the other components that are of lesser importance. All solid paths represent
audio-rate signals, with the bold paths representing the raw input and audible
output.
The Zerr* system takes a single-channel audio signal 𝑥 as input then gen-
erates 𝑁 distinct signals 𝑥
1…𝑁 for N loudspeakers. The input signal is initially
processed by the Feature Tracker module. The Feature Tracker module employs
specific algorithms to extract audio features, which are then transmitted to the
Feature Processor. The Feature Processor is responsible for the execution of basic
post-processing operations on the input audio features, with the objective of gen-
erating standard control signals. The standard control signals are inputted into the
Envelope Generator module. The Envelope Generator employs a polling mecha-
nism based on the information contained in the control signals to communicate
with the Speaker Manager. The Speaker Manager is the module that stores all
the information about the loudspeaker setup in advance, and provides Envelope
Generator with the necessary information for the generation of envelopes. The
Envelope Generator is capable of generating multi-channel envelopes in real time,
utilising both the control signals from the feature processor and the speaker in-
formation from the speaker manager. It is possible to merge the envelopes with
- 45 -
another set of envelopes in order to produce more complex envelopes. The final
combined envelopes will be multiplied with the original input signal in the Audio
Disperser in order to obtain audible multi-channel audio outputs. Each output
signal is transmitted directly to the corresponding loudspeaker, thereby complet-
ing the spatialization of the input audio. The functionality of each module will be
elucidated in detail in Section3.4.
Figure7: Signal flow of Zerr* approach
The input audio signal is used to define the spatial distribution in accordance with
the aforementioned schema. This addresses the practical need to reduce cognitive
load and maintain a high density of gesture inputs in live improvisation scenarios.
Nevertheless, Zerr* is distinguished by its creative rather than by its functional
purpose.
The performance-oriented character of Zerr* has already been elucidated by
the methodology presented in the preceding section. With the brief description of
the Zerr* signal flow diagram just given, it is also possible to demonstrate two
other characteristics. Zerr*’s input is an audio signal from the sound source, and
its output is a multi-channel audio signal assigned to the loudspeaker array, indi-
cating that Zerr*’s scope of realisation is limited to the spatialization of sound.
Furthermore, it is evident that Zerr* is a loudspeaker-centric system, as the audio
signal is fed directly to each corresponding loudspeaker. Additionally, the speaker
properties must be provided to the speaker manager for assisting decision-making
purposes. The only aspect of the system that requires further explanation and
cannot be directly recognized from the signal flow is the system’s aesthetic incli-
nation.
As previously discussed in Section2.5.2.3, the ability of panning to transition
from a fundamental operation to an experimental effect is contingent upon the
utilization of unconventional parameter settings, specifically those involving high
rates of control signals. This aligns with Zerr*’s design philosophy. As illustrated
in Figure7, all internal modules of Zerr* except speaker manager and envelope
generator utilize uni-directional audio-rate signals for information transfer. This
ensures that the system is inherently capable of generating high-speed control sig-
nals. The methodology of utilising audio-rate as a control signal has been outlined
in Section2.5.2.1 and Section2.5.2.3, which pertains to wave terrain spatialisation
- 46 -
and spatial modulation synthesis, respectively. In contrast to the aforementioned
control signals and the high-rate control signals typically employed in standard
modulation effects, the control signals derived from audio features are not only of
a high rate but also irregular. The characteristics of the control signals extracted
from the features will be analysed in detail in the Section3.4.
3.4 System Design
This section will explicate the particulars of each module’s functionality, concomi-
tantly describing the design concepts. For the specific implementation details of
each module, please refer to Section4. This section encompasses more of the con-
ceptual descriptions.
3.4.1 Feature Tracker
The feature tracker functions as the initial processing stage, responsible for ex-
tracting desired audio features from the input signal. According to Lerch, audio
features are defined as specific types of audio representations, constructed based on
expert knowledge, and tailored to meet the specific requirements of a task (Lerch,
2012). This process allows the audio’s meaningful properties to be emphasized,
informing subsequent control signal flows. The effectiveness of this system hinges
on the assumption that a performer’s gestures can significantly influence the au-
dio, particularly features that can be amplified and rendered distinctly. Only when
this condition is met can the control system, which initiates with audio features,
enhance the real-time controllability for the performer. From a broader perspec-
tive, this method of coupling timbral variations of audio with spatial changes
aligns closely with the foundational goal of integrating spatial elements into mu-
sic. Changes in timbre inherently drive spatial modifications. Leveraging audio
features for spatialization accelerates the transition from manual analysis and re-
production to real-time analysis and decision-making as the audio evolves.
Audio features are divided into two categories: instantaneous features and
learned features. Instantaneous features, also known as audio descriptors, are more
low-level audio features. They typically take a small block of audio samples as
input and return a single value based on a fixed calculation method. These fea-
tures lack an explicit musical or perceptual level meaning, but are simply descrip-
tions of the data characteristics of the sample block. Such features are generally
very simple to compute and are easily implemented in real time due to the small
amount of data required for a single computation. Furthermore, there is a clear
correspondence between inputs and outputs due to the fixed algorithm. However,
the disadvantage of this type of feature is also evident, as using only a very small
amount of data for the computation can lead to unstable feature values in the
output.
The extraction of learned features necessitates the utilisation of supplemen-
tary data in addition to the input audio signal. The calculation of such features is
- 47 -
contingent upon the acquisition of information from other data sources, which is
then integrated with the current input audio signal to yield the final feature. Both
traditional machine learning algorithms and modern AI models can be classified
as such. Such features can be utilized to derive highly abstract information from
them, either as explicit categorical labels or as feature vectors with implicit infor-
mation. While such algorithms are more complex, the majority of them require
longer audio data as input.
In order to guarantee the real-time performance of the Zerr* system, it is
preferable to deploy instantaneous features in the feature tracker. The issue of
unstable output from instantaneous features is not a significant concern under the
Zerr* system. At some point, it can be advantageous. The Zerr* system’s use of
audio features, which are not designed to preserve every audio detail but rather
to maximize real-time performance, allows for extreme flexibility in the definition
of audio features. Even the most minuscule units of audio, such as a few samples
or a single sample point, can be employed as a feature.
3.4.2 Feature Processor
The feature signals generated by the Feature Tracker will be fed directly into
the Feature Processor. The task of the Feature Processor is to firstly normalize
different kinds of feature signals, and then merge or map the normalized signals
according to the requirements, and finally process them into a standard high-
speed signal that can be comprehended by the subsequent modules. This does not
specify any exact processing method, but the structure of the resulting flows is
clearly defined.
In their book Generating Sound & Organizing Time, Wakefield and Taylor
engage in a discussion of the concept of signals, which serves as an invaluable
source of inspiration for the design of this module (Wakefield & Taylor, 2022). A
signal can be considered for its characteristics in 4 ways, rates of change, shapes
of change, ranges of values, kinds of value. The first point, rates of change, refers
to how fast the signal changes and whether the signal changes have significant
characteristics (periodic, sporadic, complex, or stochastic). Shapes of change refers
to the manner in which the signal is undergoing a transformation, whether it is
a sudden shift or a gradual evolution. Ranges of values is used to describe the
boundaries of a signal, including the presence of a maximum or minimum value
and the type of value (floating point, integer). Kinds of value is used to describe
whether a signal has other meanings or explicit functions. These include whether
the signal implies phase or periodic motion, or whether it has explicit units such
as decibels, Hz, etc.
The aforementioned four attributes permit the categorization of the control
signals generated by the feature processor. The feature processor generates two
types of signals, designated as trajectory and trigger.
- 48 -
Given that all inter-module communication employs the audio rate, the rate
of change has been explicitly stated. In the majority of cases, the trajectory and
trigger will be within the audible range. It is important to note that although it is
audible, it is still a control signal and is not intended to be used as an audio signal.
The signal is extracted from the audio feature, and thus, its pattern of change
is related to the original audio features. In terms of the value represented, both
signals are merely abstract control signals devoid of any specific units. The two
correspond to two distinct control modes. The trajectory represents a continuous
control mode, wherein each data point in the trajectory signal exerts an influence
on the behavior of the controlled module. The trigger is a discrete control mech-
anism that responds to a specific event occurring at a specific time, resulting in
a subsequent system response. Both signals are normalized to values between 0.0
and 1.0, which is consistent with the definition of unipolar signals. The trajectory
of the signal may fluctuate freely between 0.0 and 1.0, whereas the trigger signal
is constrained to either 0.0 or 1.0. The shapes of change on the trajectory are, for
the most part, smooth changes. However, there is a possibility that the trajectory
may become stepped-like if the audio characteristics change too drastically. The
trigger varies in a way that it jumps from 0.0 to 1.0 and will be 0.0 most of the
time. the pattern is basically the same as the single-sample impulse signals defined
in the book, the only difference is that there is no stable period. The trigger signal
jumps between 0.0 and 1.0, with the majority of instances exhibiting a value of
0.0. The pattern is analogous to that of the single-sample impulse signals defined
in the book (Wakefield & Taylor, 2022), with the exception that a stable period
is absent.
The primary rationale for the adoption of these two signal types as standard
control signal streams is their versatility and capacity to convey the requisite
amount of information as control signals. They can be processed from any combi-
nation of audio features and can be sample-level real-time accepted by subsequent
modules. Another distinguishing feature of the control signals is that they lack a
discernible pattern of change. Their patterns align with fluctuations in audio fea-
tures, in contrast to the limited number of common control signal models, such as
sinusoidal and random. Systems under conventional control signals exhibit greater
predictability and produce convergent effects. In this context, the instability of
instantaneous feature observed in Section3.4.1 becomes advantageous. Unstable
control flow can lead to unexpected outcomes.
3.4.3 Speaker Manager
The Speaker Manager is responsible for managing the properties of speakers within
a given loudspeaker setup configuration. It provides various functions for querying
loudspeaker properties, selecting specific speakers, all of which are essential for the
envelope generator stage. The Speaker Manager holds standard properties as well
as additional specific properties of loudspeaker setups. Standard properties are in-
herent to the speakers and loudspeaker setups Specific properties are task-related
- 49 -
and serve to fulfill specific creative purposes. Depending on the current creative
intent, a defined loudspeaker setup can have multiple sets of specific properties.
The standard properties include a unique identifier for each loudspeaker and
the geometric features of the loudspeaker array. These features include the position
of each loudspeaker in Cartesian and spherical coordinates and the orientation.
In addition, standard properties may include other intrinsic qualities of a loud-
speaker, such as frequency response, distortion, or fixed pre-processing settings
for each loudspeaker. This aspect has not yet been specifically considered in the
current concept, and it represents one of the possible directions for subsequent
development.
Specific properties can be understood as parameters that define the relation-
ship of each loudspeaker to the other loudspeakers. These variables can be defined
manually or calculated algorithmically. The initial investigation has identified
three properties that appear to be more suitable for current use, namely speaker
masks, speaker trajectory and speaker topology.
The speaker masks is employed to ascertain the visibility of each loudspeaker,
with only those that have been unmasked included in the system. In large-scale
loudspeaker array setups, it can be advantageous to utilize only a subset of loud-
speakers. From a functional standpoint, the selection of only a subset of loud-
speakers can alleviate the computational load of real-time multi-channel audio
processing, and enables parallel computation over multiple devices. In terms of
creative flexibility, the use of only a few speakers with specific distributions from
a massively homogenized speaker array allows for more atypical spatialization ef-
fects.
The speaker trajectory can be understood as a modernized expansion of the
definition of speaker trajectory, as discussed in Section2.3.3 of Poème Électron-
ique. The interconnection of loudspeakers in a sequential manner establishes a
trajectory for sound to traverse. The movement of the sound on this speaker path
is controlled in real time by the control signal in trajectory format. The point of
expansion of the speaker trajectory in this context is that it focuses solely on the
logical ordering of speakers, rather than following a trajectory in real spatial co-
ordinates This trajectory can be generated algorithmically based on the standard
properties. It is also recommended that the trajectory be defined manually so that
a customized design can be created that provides an unnatural sound movement.
The speaker topology is analogous to a matrix of logical connections between
speakers. It is defined by assigning each loudspeaker a list of the speaker identi-
fiers representing other loudspeakers linked to it. Thus, the speakers connected to
the current speaker can be used as potential destinations, i.e. the sound on the
current speaker can jump to one of them. The jump of the sound is determined
by the control signal in trigger format, and the jump occurs at the moment of
the trigger sample. In accordance with this definition, the connections between
the loudspeakers can be either bi-directional or uni-directional, thereby resulting
- 50 -
in a topology that is exceedingly complex. The definition of speaker topology al-
lows for the implementation of more flexible mapping strategies that transcend
the limitations of geometric constraints, thereby accommodating unconventional
speaker setups.
The spatial effects constructed through these three specific properties resem-
ble sound movement and distribution in the conventional sense when low speed
control signals are employed. However, once the velocity of the control signal
surpasses the threshold of human perception, the aforementioned properties effec-
tively delineate the underlying topology of the spatial texture. In contrast to an
impenetrable wall of sound, they collaborate to form a sound mass in accordance
with Hagen’s definition (Hagan, 2017).
3.4.4 Envelope Generator
The Envelope Generator generates 𝑁 individual modulation signals 𝑚1…𝑛 which
are contingent upon the signals from the Feature Processor and the properties
from the Speaker Manager. The outputs of the envelope generator are in the style
of unipolar envelopes(and windows) as described in (Wakefield & Taylor, 2022).
Two key considerations are required for the generation of multi-channel envelopes:
firstly, the selection of loudspeakers and secondly, the additional distribution pro-
cessing stages.
3.4.4.1 Speaker Selection
Zerr* employs amplitude panning to spatialize the input sound source. When am-
plitude panning is extended from merely relocating sound between two speakers or
two fixed spatial positions to a more complex process involving multiple destina-
tions, a fundamental challenge emerges: how to determine the optimal destination
for the sound to subsequently move to. The channel-based logic enables the deci-
sion space for this problem to be constrained from continuous spatial coordinates
to discrete specific speaker locations. In accordance with the formats of the control
signals, the envelope generator should also operate in trajectory mode or trigger
mode. Two distinct selection approaches are delineated herein as trajectory map-
ping and trigger shifting.
3.4.4.1.1 Trajectory Mapping
In the trajectory mapping approach, the Envelope Generator takes the trajectory
control signal, and maps it with the speaker trajectory property. Dispersion of
timbre in space is achieved by this mapping strategy This is essentially the same
effect as timbre spatialization. Due to the variety of available audio features, tra-
jectory mapping is able to achieve more flexible spatialization effects than timbre
spatialization, which is mainly spectrum-based.
The selection of an appropriate interpolation method can have a profound
impact on the perceived audio quality. It will affect the perceived smoothness of
- 51 -
the sound as it transitions from one speaker to another. At high speeds, longer
interpolation produces a more seamless sound field. Conversely, a small amount
of interpolation or a sharp edge will result in a more granular sound field.
3.4.4.1.2 Trigger Shifting
In the trigger shifting approach, whenever a trigger sample from the control sig-
nal is encountered, the envelope target instantaneously shifts to a newly selected
loudspeaker. The Speaker Manager determines the destinations according to the
speaker topology property.
The decision-making method may be either Random or Nearest. In the Ran-
dom method, the next jump destinations are randomly selected from the available
candidate speakers, introducing an element of unpredictability to the speaker se-
lection. In the Nearest method, the next destinations are chosen based on prox-
imity, selecting the candidate speakers that are closest to the currently active
speaker. Similarly, the attack and release of the envelope have a profound effect
on the auditory perception.
The fundamental distinction between the two modes pertains to the manner
in which the underlying structure of the spatial texture is to be understood. In
the case of Trajectory Mapping, a static binding of timbre to spatial coordinates
is employed, whereas in Trigger Mode, a dynamic binding of timbre changes to
spatial changes is utilized.
3.4.4.2 Distribution Processing
Distribution processing encompasses a range of additional manipulation methods,
in addition to panning, which collectively influence the overall character of the
sound. The current design comprises two fundamental processing stages.
Spread: Spread is similar to the definition of Signal skirt by Hagen (Hagan,
2017). The signal can be distributed to all other loudspeakers, in addition to
the central one, with a lower gain, thus creating a more immersive sound field.
Overall Gain: The overall gain can also be modulated. When connected to
a slowly varying signal, it introduces additional details. Conversely, when the
control signal varies rapidly, it can result in a significant alteration of the orig-
inal sound. Since the parameter controls all speaker behavior, it changes the
overall spatial listening experience.
3.4.5 Envelope Combinator
The envelopes created by the envelope generator serve as modulation signals for
the original audio input. The Envelope Combinator provides functions that facil-
itate the straightforward combination of sets of envelopes from disparate envelope
generators. This module is useful when more complex envelope generation strate-
gies are desired. It should be noted that this module is optional, as a single En-
- 52 -
velope Generator already provides sufficient information for dispersing the input
audio. The Envelope Combinator employs the scalability of unipolar envelope sig-
nals. Such signals retain their fundamental characteristics following mathematical
transformations. It is theoretically possible to cascade envelopes multiple times in
order to produce more complex modulation signals.
3.4.6 Audio Disperser
The Audio Disperser represents the final stage in the overall process. It generates
the individual loudspeaker signals 𝑥
𝑛 by applying the corresponding modulation
signal 𝑚𝑛 on 𝑥:
𝑥
𝑛= 𝑥𝑚
𝑛
Subsequently, the modulated signals are distributed to the loudspeakers. In this
final stage, the timbre characteristics of the input signal are combined with the
spatial properties of the modulation signals. Zerr* thus represents an approach
of achieving complete unification of sound and space. The interplay of the input
signal, algorithm and loudspeaker configuration allows performers to shape the
sound in texture, timbre and spatial behaviour simultaneously.
3.5 Discussions
The introduction of the functionalities of the modules that comprise Zerr* reveals
that it has the following main innovations: real-time spatialization through audio
features, sample-level processing, and support for non-conventional loudspeaker
arrays. These characteristics distinguish this approach from the majority of exist-
ing tools and render it an experimental endeavor. It is easier to create new sonic
experiences with it, but this inevitably leads to limitations in other aspects. The
following discussions will address the inherent advantages and disadvantages.
3.5.1 Creative Use of Audio Features
In the fields of audio content analysis and music information retrieval, the audio
feature serves as the fundamental basis for the construction of a system. Experts in
this field possess a profound understanding of both hand-crafted audio descriptors
and audio embeddings derived from machine learning. A multitude of projects offer
high-quality audio feature extraction algorithms, including Librosa (McFee et al.,
2015), Essentia (Bogdanov et al., 2013) and TorchAudio (Yang et al., 2021). The
high-quality libraries and the state-of-the-art algorithms and models are based on
the Python environments. The deployment and use of these models necessitates
the possession of adequate software engineering skills.
Additionally, there are libraries that provide audio analysis algorithms in au-
dio programming environments, both real-time and non-real-time(Collins, 2011;
Schnell et al., 2009). However, only a select few creative coders and music technol-
ogy researchers with a strong background in engineering are able to utilize these
- 53 -
algorithm libraries. These tools remain inaccessible to the average musician, in
part due to the complexity of audio feature knowledge and the difficulty of learn-
ing. However, this may also be attributed to the lack of a user-friendly workflow
and a comprehensive and clear tutorial for users with non-technical backgrounds.
To address this issue, the FluCoMa project has made a significant contribution.
The Fluid Corpus Manipulation project (FluCoMa) employs novel method-
ologies for the creative exploitation of sound collections in musical contexts
(Tremblay et al., 2019; 2021). It integrates advancements in digital signal process-
ing algorithms and machine learning models into the toolkit for “techno-fluent”
musicians, creative coders, and digital artists. FluCoMa offers a comprehensive
methodology that elucidates the utilization of these modules within the toolkit
for audio analysis and sound decomposition, accompanied by corresponding work-
flows for transforming a corpus of sound into music. Moreover, FluCoMa offers
interactive introductions to each module, thereby facilitating the commencement
of the learning process²¹. In terms of implementation, FluCoMa was identified
²¹https://learn.flucoma.org/
at an early stage of development as a tool that can be used across a range of
creative coding environments. This allows users to utilise FluCoMa within their
familiar environments and integrate it with their bespoke workflows. It is evident
that the conceptual design and implementation of Zerr* are profoundly influenced
by FluCoMa. Like the FluCoMa workflow, Zerr offers a comprehensive processing
flow with a variety of selectable algorithms tailored to user needs, allowing for
expansion with new algorithms as necessary.
Another issue that requires discussion is the necessity to control the spatial-
ization through audio features. The mapping of synthesizer parameters directly to
a spatialization algorithm is a relatively straightforward process that offers clear
benefits in terms of usability, similar to the topographic synthesis approach. In
this approach, the synthesizer parameters are directly distributed spatially. One
rationale for the utilization of audio features is that it permits a comprehensive
segregation of the sound source from the distribution system. A well-designed
mapping can be employed in conjunction with any source, and the sound source
can be altered in real time. This is a highly beneficial strategy to enhance the
playability of improvisations, along with the flexibility of the creative process.
Furthermore, synthesizer parameters do not directly correlate with audio features;
multiple parameters can affect the same feature, and a single parameter can influ-
ence multiple features. The utilization of audio features is merely an understanding
of timbre from a distinct parameter space, and is not interchangeable with the
mapping of synthesizer parameter methods. Moreover, for acoustic instruments,
audio features are the sole means of comprehending their sound and of controlling
them parametrically.
It is important to consider that when building Zerr* using multiple audio
features, there are correlations between different audio features. A change in the
- 54 -
timbre of a source will result in a corresponding change in most of the audio
features. It is challenging to achieve precise one-to-one control unless one method-
ically selects or designs unrelated audio features. Once more, this can be regarded
as a shortcoming as well as a characteristic. Correlations between audio features
may complicate the interpretation of the current mapping relationships, yet they
may also enhance the overall experience of the spatialization effect. In essence,
aside from the creator’s pursuit of interpretability of the system, the listener does
not actively perceive the spatialization effect of the Zerr* system from this per-
spective. This will be covered in detail in Section5 on listening tests. If one’s
objective is to achieve precise one-to-one control, then the use of audio features is
not an appropriate solution.
3.5.2 Sample-level Processing
The concept of sample-level has been repeatedly highlighted in the introductory
sections of the modules. The section on the Feature Processor module states that
both trajectory and trigger control signals are in audio-rate, indicating that the
control information can be carried on any sample. It can be observed that while
the control signal is routed to sample level, the spatial textures that are gener-
ated also exhibit a corresponding sample level response. This is evidenced in both
modes of operation, with the envelope edge having a significant impact on the
spatial effect.
To illustrate, if the rate of occurrence of trigger samples is not constrained and
the length of attack and release is reduced to a single sample, the high density of
jumps between loudspeakers creates a unique aural sensation that is only possible
with sample-level control. The crack sound, which is caused by the sudden jump
of the audio signal, is something that is generally avoided by signal processing
algorithms. Even when discussed in the context of electroacoustic music, this falls
into the category of sounds that most musicians do not prefer. However, the sam-
ple-level processing capability offers the possibility of using this extreme sound in
a musical way.
It must be acknowledged that the resulting sample-level spatial textures are
not always aesthetically pleasing. In some instances, the parameters of the map-
ping system require meticulous adjustment in order to achieve a harmonious bal-
ance between experimentation and musicality. This makes designing a concrete,
usable Zerr* system a challenging task. While it is possible to improvise on a
system without scruples, the process of designing the mapping itself is a rather
tedious and time-consuming compositional process. This is, to some extent, a lim-
itation of Zerr*. In order to achieve the desired flexibility in live performance, it is
necessary to invest a significant amount of time in preparation in order to create
the mappings.
- 55 -
3.5.3 Irregular Loudspeaker Setups
The Zerr* approach abstracts loudspeaker setups as logically connected trajecto-
ries or directed graphs. This allows it to be used with any irregular loudspeaker
setups. The use of specially arranged loudspeaker systems is not an uncommon
occurrence. In many real-world scenarios, it is challenging to achieve a speaker
distribution that fully aligns with the design specifications, while also considering
the impact of room acoustics. Consequently, even with the implementation of an
object-based spatialization system, there is no guarantee of a completely consis-
tent sound experience across creation and playback process. Fine-tuning a scene
while relocating it may take as much time as creating it directly in the playback
sound field. In the case of sound installations with special artistic requirements, the
distribution of the loudspeakers will be more heterogeneous and the composition
will be more challenging to relocate. However there is no such issue as relocation
under Zerr*’s logic. All that is required is a highly customizable creation for the
particular loudspeaker setups. The more specialized the loudspeaker system is, the
more Zerr’s strengths can be utilized to fully realize the creative intent. is a highly
customizable creation for the particular loudspeaker setups at hand. The more
specialized the loudspeaker system is, the more Zerr*’s strengths can be utilized
to fully realize the creative intent.
Another aspect that makes it possible to fully utilize the characteristics of
the speaker system is the complete abandonment of portability. This implies that
the music created with Zerr* can only exist in a specific speaker setup. The use
of virtual speakers to playback under other conventional speaker setups or to ren-
der to stereo will result in a significant deterioration of the listening experience,
which is highly detrimental to the promotion of the work. The decision of whether
to employ Zerr* for the purpose of authoring is contingent upon a cost-benefit
analysis.
3.5.4 “Incorrect” Useage
The signal flow diagram of the Zerr* approach is merely a suggestion. The ex-
pansion of this basis or the adoption of only some of the modules represents a
departure from the original design of Zerr*. However, this “incorrect” use can also
facilitate creativity and produce effects that cannot be realized by the standard
process. Two possible uses that come to mind will be presented here. The funda-
mental prerequisites for the viability of such initiatives are the highly modular
design of the * approach and the adaptability of the formats employed for signal
transfer between modules.
The universality of trajectory and trigger allows the control signal to skip
circumvent the audio feature modules and to employ alternative inputs or algo-
rithms. A significant proportion of traditional automation processes can be imple-
mented with self-generated control signals. Alternatively, in light of the discussion
in Section3.5.1, it is possible to utilise signals derived from synthesiser parameters
- 56 -
for the purpose of precise control. The utilization of this approach may obscure
numerous advantages inherent to the Zerr* approach; however, when employed in
an appropriate manner, it can enhance musicality.
An alternative approach would be to extend the scope of the Zerr* method-
ology. It is not always necessary to regulate the spatial properties of a sound using
the audio features derived from the sound itself. A signal generated by one source
can be utilized to regulate another source. This is analogous to a generalized
sidechaining effect. This type of cross-modulation can occur at either the stage
of inputting control signals or at the stage of fusing the source and modulation
signals. In my personal authoring practice, this method has proven to be highly
effective, particularly when the controlling source exhibits stable spatial properties
while the controlled source undergoes significant global spatial property changes.
- 57 -
Chapter 4
Implementation
Zerr is not just a conceptual framework but also an ongoing software project.
While still under development during this thesis preparation, it is expected to
undergo long-term updates and improvements. The functions, parameter names,
and other elements mentioned in this chapter are based on the current version.
All of the code for Zerr* is open source under the MIT license and available on
the GitHub repository of the project²².
²²https://github.com/ringbuffer-org/Zerr
4.1 Aims and Priorities
Zerr*’s implementation was heavily inspired by the FluCoMa project. In their
paper, seven aims and priorities²³ that need to be considered for building toolkit
²³Native integration, Consistency, Learnability, Configurability, Scalability, Breadth and Completeness
are proposed which also aligns with the implementation goals of the Zerr* system
(Tremblay et al., 2021). Here we will introduce the most important ones in the
context of the actual situation of Zerr*.
Native Integration denotes the necessity for the tool to adhere to the estab-
lished conventions of the system in which it is embedded, while simultaneously
exhibiting the requisite flexibility to transfer data across disparate frameworks.
The implementation should utilize the host system’s native interface to the
greatest extent possible, as this will facilitate the effective transfer of the user’s
experience within the host system to the process of using Zerr*. Concurrently,
the implementation cannot be wholly contingent on the data structure of a
host. The system parameters must be capable of migration to another environ-
ment without consequence and must maintain the performance of the system
before and after.
Completeness denotes the capacity of the system to provide a complete tool-
chain. In each host system, all the core functions of the approach should be
realized with the functions provided by Zerr* system and a few built-in tools,
without the necessity of relying on any other third-party libraries.
Configurability denotes the capacity of users to modify numerous facets of
the system to align them with their individual requirements, thereby combin-
ing their creativity into the system. Conversely, an implementation without
configurability implies a restricted range of fixed processing preset algorithms
that can be utilized. In order to accommodate the static workflow, users must
alter the settings of other tools they utilize.
For further details on the introduction of additional aims and priorities, please
refer to the FluCoMa paper (Tremblay et al., 2021).
- 58 -
4.2 Modular Design & Profiles
The Zerr* system is comprised of two primary components: a series of core mod-
ules and encapsulations for various host environments. The core modules are
implemented purely in C++ and do not make use of any plug-in development
framework. In addition to the C++ standard library, they depend only on com-
mon underlying libraries to handle such standardized processes. The rationale
behind the decision to utilise a low-level and time-consuming development plan is
to guarantee that the fundamental functionality of Zerr* is self-contained, thereby
enabling it to be freely extended to diverse environments. The core modules are
divided in a similar manner to the modules of the signal flow diagram, as illus-
trated in Figure7. Each module, with the exception of the Speaker Manager, is
implemented as an individual signal processing unit whose input/output signals
operate exclusively at an audio rate. The Speaker Manager is a submodule within
the Envelope Generator. One advantage of this approach is the freedom to choose
how different modules are combined and wrapped. In accordance with the spe-
cific requirements of the application scenario, modules may be incorporated into
an integrated audio client in accordance with the standard signal flow diagram.
Alternatively, each module may be encapsulated as a separate audio client, with
signals being transmitted between modules via the audio server of the embedded
system. Moreover, the modular design permits the utilization in unconventional
ways, as mentioned in Section3.5.4.
The configuration file serves as the foundation for Zerr* to facilitate cross-en-
vironment data transfer. Two distinct configuration files are available for use: the
speaker array configuration and the module configuration. The former retains the
standard properties that were previously outlined in the Speaker Manager section.
The latter encompasses configurations for the core modules, including the selected
features and the speaker selection mode, among other options. The configuration
files are in the YAML format. In some host environments, configuration files are
indispensable. In other environments, the system can be parameterized without
relying on configuration files, which are simply a means for cross-environment de-
ployment.
4.3 Core Modules
Each core module and audio feature algorithm is commented in Doxygen style24,
allowing developers to use the generated Doxygen documentation for a deeper
24https://www.doxygen.nl/
understanding of the core modules. This section introduces the key details of the
implementation of each core module.
- 59 -
4.3.1 Feature Tracker
The design logic of the feature tracker module draws inspiration from the Essentia
(Bogdanov et al., 2013) library. Each audio feature extraction algorithm is imple-
mented using a consistent class template and is accessed through a uniform calling
interface within the feature tracker module. This uniformity simplifies the process
of adding new audio feature algorithms to the system. Homogeneous processes,
such as audio buffering and Fast Fourier transforms, are conducted in the main
call module to prevent redundant calculations. The audio features that the Feature
Tracker is expected to output are loaded dynamically via a parameter list when
the Feature Tracker module is initialized. A Feature Tracker is capable of calcu-
lating and outputting an arbitrary number of audio features concurrently. The
number of output channels from the module is identical to the number of audio
features specified in the parameter list, and the order of output is also identical.
It is important to note that Feature Tracker only makes use of design ideas
derived from Essentia, without resorting to Essentia’s source code. The instanta-
neous audio features were developed independently based on the descriptions pro-
vided by Lerch (Lerch, 2012). Table2 enumerates the time-domain and spectral-
domain features that have been implemented thus far. The zero cross in this case
represents an unconventional sample-level feature. The output is a trigger sample
when the audio signal crosses the 0 point and remains at 0 for the remainder of
the time. In terms of functionality, this should be implemented by the Feature
Processor module. However, due to the extensive range of potential applications
for this sample-level feature, it is more practical to incorporate it into the Feature
Tracker.
Time Domain Spectral Domain
Root mean square
Zero crossing rate
Crest factor
Zero cross
Spectral flux
Spectral centroid
Spectral rolloff
Spectral flatness
Table2: Instantaneous features
4.3.2 Feature Processor
The Feature Processor has been unable to provide a standard processing template
analogous to that of Feature Tracker due to the more varied functionality that
needs to be implemented. Consequently, the fundamental module for the Feature
Processor is designed to provide input/output interfaces without any specific im-
plementation. The manner in which the Feature Processor is integrated into a sys-
tem determines whether it is implemented as a hard-coded component or defined
through a YAML file for numerical operations. Alternatively, it can be constructed
directly from standard blocks within the audio programming environment. All
calculations that ensure that the output control signals conform to the format
requirements are permitted.
- 60 -
4.3.3 Speaker Manager
The implementation of the Speaker Manager comprises two distinct steps. Initially,
the Speaker class, which describes the specific parameters associated with a given
speaker, must be defined. Subsequently, the Speaker Manager class is constructed
using the objects of the Speaker class as its core members.
The Speaker class is employed to store and query all loudspeaker standard
properties in a static and structural manner. All standard properties of the speaker
are stored in a YAML configuration file and loaded upon initialization. In the
current version, the standard properties include the unique identification index,
coordinate information (Cartesian and spherical) and orientation information rel-
ative to a spatial origin point. A single format for coordinates is sufficient; the
other is calculated automatically following the loading of the configuration. The
orientation system encompasses two degrees of freedom, namely yaw and pitch.
All speaker array standard properties can only be queried and not modified after
initialization.
In contrast, all specific properties are located within the Speaker Manager
object and can be reassigned at runtime. Furthermore, the Speaker Manager em-
ploys all the requisite control logic for selecting loudspeakers and furnishes the
requisite information for distribution processing. It processes the control signals
passed to it by the Envelope Generator on a sample-by-sample basis and instan-
taneously provides the speaker selection decision. Upon processing the trajectory
control signal, the Speaker Manager furnishes the current pair of speakers engaged
in the panning operation and the corresponding panning ratio. Upon the process-
ing of trigger control signals, the Speaker Manager will return the identification
index of the selected speaker.
4.3.4 Envelope Generator
The number of multi-channel modulation signals the Envelope Generator outputs
corresponds to the number of speakers, and the order of output is consistent with
the order defined in the speaker array parameter file. The selection of either tra-
jectory or trigger mode is dependent on the parameterization during initialization.
The Envelope Generator calls the corresponding method of the Speaker Manager
in accordance with the different operational modes. The trajectory mode enables
real-time bending of the linear panning ratio, thereby allowing for precise control
of the generated modulation signals. In a similar manner, the length and curvature
of both the attack and release phases of the generated envelopes can be controlled
in real time within the context of trigger mode. Furthermore, the trigger mode
incorporates functions designed to standardize the input trigger signal. These in-
clude functions that enable the capture of only the rising edge of the step signal as
a trigger and functions that allow the user to define the minimum trigger sample
interval. Referring to the discussion in Section3.5.2, the minimum interval can
and is set to 0 by default.
- 61 -
The two proposed distribution processing methods are also subject to regu-
lation by the audio-rate control signals, which accept trajectory control signals by
default. The calculation of the spread is conducted subsequent to the selection
of the speaker. The algorithm needs to access a precalculated speaker distance
table, generated by Speaker Manager after loading the speaker coordinates, to
perform calculations based on the distances between speakers. The range of the
trajectory control signal is limited to values between 0 and 1, which precludes the
possibility of representing actual spatial distance information. In order to stabilize
the spread effect on speaker arrays with different distance scales, it is necessary to
introduce a standard distance parameter. The calculation logic of Spread is thus
configured such that at 0 no energy is distributed to the other speakers, and at 1
the speakers at standard distance positions are allocated a predefined proportion
of energy. Subsequently, the signal resulting from the completion of the spread
step is adjusted in order to achieve an optimal overall volume.
4.3.5 Envelope Combinator
The number of input and output channels of the Envelope Combiner is also deter-
mined based on initialization parameters. It is capable of combining any group of
modulation signals having the same number of channels, one by one, in the order
in which the channels are arranged. The Envelope Combinator offers a suite of
fundamental numerical operations for the combination of signals, including sum-
mation, averaging, and maximization. Moreover, the most practical approach is
the product root calculation. Referring to the symbols defined in the Figure7, the
calculation formula is as follows:
𝑚
𝑛=𝑘
𝑘
𝑖=1𝑚(𝑛,𝑖)|
4.3.6 Audio Disperser
The input format of Audio Disperser is sound source for the first channel, followed
by the generated modulation signal. sound source will be multiplied with each
modulation signal in turn to get the final output signals.
4.4 Encapsulations
Two encapsulations of the Zerr* system have been developed so far: as JACK
clients25 and as Pure Data package. Other use environments such as SuperCol-
25https://jackaudio.org/
lider and Max/MSP, are still in the research and preliminary development phase.
For operating systems, binary releases have been successfully compiled on Ma-
cOS (Intel), MacOS (M1) and Linux. Since no Windows computers are currently
used and there is no immediate need for Zerr* on Windows, support for this plat-
form has low priority and is not yet available. For details on the progress of the
- 62 -
development, please refer to the development schedule in the README of the
Zerr* repository. This section focuses on the details of the Pure Data-based and
JACK Audio Connection Kit (JACK)-based encapsulation implementations, and
discusses other host environments that might want to experiment with as well as
their advantages and disadvantages.
4.4.1 Pure Data Package
The Pure Data version is the most fully developed encapsulation available and is
the version used in subsequent evaluations and creations. In the Pure Data ver-
sion, each core module is compiled into a separate external. The signal I/O and
configuration interfaces are initially wrapped as C data structures, in keeping the
Pure Data design concepts. Then encapsulate all interfaces according to the inlet/
outlet, arguments, messages specification of the Pure Data object. These are then
compiled as independent externals using pd-lib-builder26. The externals can then
26https://github.com/pure-data/pd-lib-builder
be utilized directly in Pure Data patching environment. The signaling between the
modules is managed by Pure Data. The following externals have been developed:
zerr_features~
zerr_envelopes~
zerr_disperser~
zerr_combinator~
All external names start with zerr to avoid naming conflicts with other libraries.
Function names are simplified from the original module names to reduce the length
of externals. All externals run under the PD sound engine, so their names end
with the tidal symbol, as is Pure Data’s custom. All audio-rate inputs and outputs
are encapsulated as inlets and outlets of external. All necessary parameters for
module initialization are encapsulated as inline arguments. The parameters that
can be modified at runtime are encapsulated as messages that can be received by
the first inlet. An individual Feature Processor external is not included because its
functionalities can be easily achieved using built-in Pure Data objects, eliminating
the need for an additional module.
Figure8 shows an example patch that contains all the necessary Zerr* ex-
ternals. The audio sources input by adc~ are connected to zerr_features~ and
zerr_disperser~ respectively. The zerr_features~ analyzes the spectral centroid,
the spectral rolloff, and the spectral flux of the incoming signal in real time. The
analysis results are entered to zerr_envelopes~ after passing through the numeri-
cal calculation module of the PD. The zerr_envelopes~ is currently set to trajec-
tory mode and reads the speaker array parameter file called “circulation_8.yaml”
from a relative path. The 8-channel modulation signals are sent directly to the
zerr_disperser~ to be merged with the audio source and sent to the 8 output
channels for direct control of the 8 loudspeakers.
- 63 -
Each external comes with a corresponding help patch that can be looked
up in Pure Data. It describes in detail how to use each external with basic ex-
amples and all supported formats of arguments & messages. Screenshots of the
help patches can be found in the appendix A. The pd-lib-builder also provides
system-specific installation features, the Pure Data package of Zerr* can be in-
stalled with a single click using this command line tool. The package includes the
aforementioned externals, help patches, speaker array configuration examples, and
continuously updated presets. The specific compilation and installation methods
are as described in the GitHub repository. Pure Data is best suited for exploring
different combinations and connections of the basic building blocks and allowing
additional manipulation with built-in processing units. This makes it very easy to
explore specific techniques for using Zerr*.
Figure8: Example PD patch for eight loudspeakers
4.4.2 JACK Client
The earliest stage of Zerr* was based on JACK Audio Connection Kit (JACK). Its
development is currently lags behind the PD version. There are two main reasons
why the development was not started in Pure Data in the first place. One is that
developing with the underlying audio server avoids limiting the design thinking
to PD mode too early, which is more helpful for migrating to other environments
later. Another benefit is the ability to use newer compilation tools that provide
more efficient compilation and clearer debugging information. This also eliminates
the hassle of re-entering the PD environment for testing after each update.
In the JACK implementation, all modules are internally buffered and con-
nected according to the standard signal flow, forming a cohesive system signal pro-
cessing unit. The JACK audio client essentially encapsulates the system’s overall
inputs and outputs. The signal flow between internal modules cannot be dynami-
cally modified. The processing methods in the Feature Processor must be written
according to the algorithm that is currently in use. The module parameters are
- 64 -
read exclusively from the configuration file during system initialization. In sum-
mary, the JACK Client version is not a programmable system. Instead, it consists
of a series of command-line programs compiled with fixed defaults. Each program,
once started, can execute only one constant processing algorithm.
The JACK version is much less flexible than using Zerr* in Pure Data. How-
ever, there are specific situations where the JACK version can be advantageous.
When it comes to controlling very large arrays of speakers, graphical programming
can be extremely tedious, and the layers of encapsulation are not as computa-
tionally efficient as communicating directly with the underlying audio server. In
addition, using JACK under Linux makes it easy to route audio signals between
different software. The convenience it offers is no less than that of Pure Data.
The JACK version is still highly recommended when a stable deployment of the
Zerr* system is desired. The recommended workflow is to validate the algorithm
in a programming environment such as Pure Data. The proven and reliable Zerr*
system can then be transferred to JACK for deployment. Although the internal
signal routing is not changeable, it is still possible to switch the Zerr* system to
use various loudspeaker systems via the configuration file.
4.4.3 Encapsulations in Development
Max/MSP and SuperCollider are the environments that will be explored in the
future, and whether or not to make Zerr* a regular audio plug-in like VST3 is still
up for discussion.
Max/MSP comes from the same root as Pure Data and is very similar in its
use. Encapsulation on Max/MSP can directly apply all the design ideas of the
PD version. The advantages and disadvantages of Max/MSP over Pure Data are
obvious. First, it has better multichannel support, which eliminates the need for
tedious patching. Second, Max/MSP has a larger user base and a more robust
community operation that helps to promote the Zerr* system. However, being a
closed-source software, it is not friendly to some users who want to customize their
personal systems in depth. SuperCollider seems to be more ideal, as it guarantees
both open source features and stable multi-channel support. However, the logic of
SuperCollider’s UGen is slightly inconsistent with objects in graphical program-
ming environments. How to maintain the consistency of the Zerr* system while
adapting to the habits of SuperCollider users will be a challenge for future devel-
opment.
Audio Plug-in is not a suitable carrier for Zerr* systems, either in terms of
functionality or usage. While Digital audio workstation’s function is centered on
arrangement, Zerr* is essentially a performance system. One of the more attractive
aspects of Plug-in is that development frameworks such as JUCE27 have a very
27https://juce.com/
strong support for developing graphical interfaces. This is the area in which other
audio programming software is deficient or unconcerned. In certain instances, vi-
- 65 -
sual feedback is as crucial as audio feedback for digital instruments. This is be-
cause, in many cases, consistent, tangible feedback is not available. Section2.4.1.1
mentions the effect of the visualized trajectory on a person’s perception of the
spatial location of a sound source. For more abstract spatial effects such as Zerr*,
it would be more user-friendly for non-technical users if there were corresponding
graphical feedback for users to visualize the results that can be produced under
the current system settings.
4.5 Discussion
The introduction of the Zerr implementation allows us to conclude that the cur-
rent implementation is capable of fulfilling the aims and priorities outlined in Sec-
tion4.1. It constitutes a complete workflow in its own right and has demonstrated
its capacity to integrate natively in a variety of environments. Furthermore, it
offers a plethora of configurable parameters that permit users to fully exploit
their creative potential. As part of a master’s thesis, Zerr* was developed inde-
pendently, which made it challenging to maintain the project’s overall quality and
pace of development in line with other community-driven open source software.
Functionality, encapsulation, and documentation will be as consistent as possible.
The objective is to achieve consistency in functionality, encapsulation, and docu-
mentation. However, the current development focus is still on applications under
Pure Data.
In addition to the development of the Zerr system, a listening test was con-
ducted to gather feedback from individuals with diverse backgrounds. The subse-
quent chapter will provide a comprehensive account of the design, process and
analysis of the listening test.
- 66 -
Chapter 5
Evaluation
In order to evaluate the concept and implementation of Zerr* in a variety of
contexts, an on-site listening test was conducted, involving participants from a
diverse range of backgrounds was conducted. This chapter describes the listening
test in four parts: the first outlines the test objectives, the second discusses the
experimental design, the third details the arrangement and implementation, and
the last analyzes participant feedback.
5.1 Goals & Expectations
The objective of this listening test is to obtain feedback on Zerr* from individuals
representing diverse backgrounds, which will inform the subsequent development
of the project. The feedback was comprised of two interrelated aspects: an assess-
ment of the conceptual framework underlying the Zerr* approach and an evalua-
tion of the efficacy of the Zerr* implementation.
The conceptual evaluation assessed participants’ understanding of the Zerr*
method, examining whether it meets the design expectations. This included eval-
uating the feasibility and usefulness of spatial and timbral coupling for enhancing
creativity, the effectiveness of controlling spatial properties through audio varia-
tions in reducing cognitive load during live performances, and importantly, Zerr*’s
ability to produce experimental and unique sounds with recognizable musicality.
The assessment of the Zerr system’s efficacy includes two key elements: firstly, the
participant’s ability to quickly understand the system’s operational logic, and sec-
ondly, their capacity to effectively use the system. The Zerr* is clearly an exper-
imental instrument, and its performance evaluation is likely to vary significantly
among participants from diverse backgrounds. Another key objective of this lis-
tening test was to differentiate and categorize the feedback based on participants’
experiences.
5.2 Study Design
5.2.1 Test Environment
The listening tests for Zerr* were conducted in the TU-Studio E-N 325, which
was specifically designed for the purpose of sound field synthesis and multichannel
music research and production.28. As illustrated in Figure9, the studio features
28https://tu-studio.github.io/studio-docs/EN325/
three distinct speaker array systems: a standard eight-channel ring loudspeaker
setup, a 21-speaker Ambisonics dome, and an extensive Sound Field Synthesis sys-
tem. Additionally, it includes two subwoofers positioned at the front and rear. All
- 67 -
speaker systems connect to a PC via a MADIface USB audio interface. The Zerr*
listening test utilizes these three speaker systems but excludes the subwoofers.
The E-N 325 system operates in two modes: the SeamLess Mode, developed by the
studio team (Coler et al., 2021), and the Direct Mode, which controls the channels
for each loudspeaker directly.
As a channel-based system, the studio must operate in Direct Mode. Cur-
rently, Direct Mode only supports direct control of octa and Ambisonics setups,
and does not allow independent control of each channel in the WFS-System. The
issue was resolved with the studio technician’s help by creating a new Dante
configuration file for the listening test in Seamless Mode. This allowed the 64
WFS-system loudspeakers to be mapped directly to the 64 channels of the audio
interface, bypassing the SeamLess system for direct control. As a result, switching
between the two modes is necessary during the listening test to fully utilize all
three loudspeaker systems.
The 64 loudspeakers in the chosen WFS system are the mid-high frequency
units in the 8 WFS panels on the right of the Figure9. Each panel has 8 loud-
speakers arranged horizontally. The 64 loudspeakers thus form a horizontal linear
loudspeaker array, an unconventional setup compared to the studio’s other two
loudspeaker systems.
Figure9: TU-Studio E-N 325 © TU Studio Team
5.2.2 Test System
The Zerr* system was subjected to testing using the Pure Data version. In order
to fully express the characteristics of Zerr* in the shortest possible time, four pre-
sets were created. The four presets have been designed with simplicity in mind.
The objective is that anyone who is unfamiliar with the presets will be able to
understand them simply by trying them out. Each preset employs a fundamental
audio synthesis algorithm as the input sound source, with each algorithm offering
- 68 -
only three parameters that can be manipulated. The Feature Tracker will extract
a single feature of the current input and utilize it to regulate one aspect of the
Envelope Generator.
Figure10: Synthesis algorithm patch for Zerr* listening test
The first preset input is a square wave signal that passes through a low-pass filter.
The first parameter controls the fundamental frequency of the square wave, the
second parameter controls the cutoff frequency of the low-pass filter, and the third
parameter controls the duty cycle of the square wave. This signal is analyzed to
obtain its crest factor, which is normalized and for controlling the speaker selec-
tion process. The envelope generator is configured to operate in trajectory mode
with the configuration file of the linear array loaded. The source with the lower
crest factor is assigned to the left side of the line, and vice versa.
The second preset input is a standard frequency modulation (FM) synthesis
algorithm. The first parameter is the fundamental frequency of the carrier signal,
the second parameter is the modulation frequency, and the third is the modula-
tion depth. The Feature Tracker module is responsible for extracting the spectral
centroid. The resulting spectral centroid frequency is wrapped and rescaled be-
tween 0 and 1 with a period of 400Hz. Based on the wrapped centroid frequency,
the envelope generator assigns the input source to the 8-channel ring system in a
clockwise direction from the lowest to the highest frequency.
The third preset input is a combination of ring modulation and basic additive
synthesis. The fundamental waveform is a 440Hz sine wave. When the first para-
meter is adjusted, the volume ratio of the integer multiples of the harmonics of the
sine wave is increased. The signal is then ring modulated with a sinusoidal signal.
The second parameter controls the modulation frequency, the third parameter is
the modulation amplitude. The spectral flatness of the input signal was analyzed
and rescaled. The Envelope Generator is set to trigger mode and configured with
the linear array. The current speaker position is manually set to the center of
the loudspeaker line. The control signal was employed solely for manipulating the
dispersion of the source.
The fourth preset is a combination of low-frequency oscillator overlay noise.
The superimposed noise is white noise passed through a resonance low-pass filter
with a cutoff frequency of 20 kHz. The first parameter controls the gain of the
- 69 -
noise. The second parameter controls the frequency of the low-frequency oscillator,
which has a minimum value of 0.1 Hz and a maximum value of 60 Hz, which is
already within the audible frequency range to the human ear. The third parameter
controls the resonant level of the low-pass filter, which acts as a band-pass filter
to some extent.
All Zerr* presets are separate Pure Data patches, and not visible to partici-
pants. The participants interact solely with a highly wrapped patch of the synthe-
sis algorithms, as illustrated in Figure10. The four columns of this patch, from left
to right, correspond to the four basic synthesizer algorithms described above. Each
of the four views displays the waveform corresponding to the synthesis algorithm,
and the following three sliders correspond to the three parameters available for
tuning by the synthesis algorithm, from top to bottom. This patch deliberately
omits the name of the synthesis algorithm and the names of the parameters, so
that only the waveform can serve as a hint. The control panel on the far right
provides basic system controls and a specialized matrix audio routing system. The
six vertical signs represent the mute (m), all sources (a), and synthesis algorithms
1 to 4. The five horizontal signs correspond to stereo playback (s), zerr* presets
a to d. The routing system enables the instantaneous transmission of the audio
source utilized in the listening test to any desired playback system. The complete
set of Pure Data patches and configuration files utilized in the listening test can
be found in the attached files. For the reader’s convenience, screenshots of the four
Zerr* preset patches are provided in Appendix B.
Figure11: MIDI controller for synthesis algorithm patch
The synthesis algorithm patch is controlled by participants via a MIDI controller
(AKAI MIDIMIX), as illustrated in Figure11. The four columns on the left of this
controller are directly aligned with the four columns of the patch. Three knobs
control the three synthesizer parameters from top to bottom, and the bottom
slider controls the volume of that synthesis algorithm. The three knobs in a col-
umn control the three parameters, while the bottom slider controls the volume of
the synthesis algorithm.
- 70 -
5.2.3 Experience Assesment
In order to obtain a more accurate profile of the participants, the listening test
included a number of assessments of the participants’ relevant experiential back-
grounds. The specific types of experience involved include musical experience,
spatial audio experience, synthesis algorithm experience, and audio analysis ex-
perience. The musical experience of the participants was evaluated through self-
report scale questions. The spatial audio experience was gauged by requesting
that the participants listen to six audio clips and identify spatial sound patterns.
The participants’ knowledge about synthesis algorithms and audio analysis was
assessed through quizzes comprising 12 questions each.
The Goldsmiths Musical Sophistication Index (Gold-MSI) was utilized to as-
sess the musical experience of the participants (Müllensiefen et al., 2014). The
comprehensive Gold-MSI self-report inventory comprises 31 scale questions, 8 sin-
gle-choice questions(mixed with text input), and supplementary personal data
fields. A total of 15 scale questions and four single-choice questions were selected
on the basis of a variety of considerations, including their appropriateness and
duration of the listening test. The fundamental premise of Gold-MSI is to evaluate
the discrepancies in individuals’ personal literacy with regard to music in its con-
ventional sense. Consequently, some of the inquiries in the questionnaire are not
pertinent to the current context. The majority of the questions that were removed
pertained to self-assessment of the level of knowledge and training in tonal music.
These included the ability to sing melodies accurately and to recognize rhythmic
errors. The remaining questions pertain to the more general aspects, such as active
engagement and emotions. Furthermore, a multiple-choice question was posed to
know the most frequently utilized music production tools of the participants. The
questions included in the musical experience evaluation form used in the listening
test are as shown in Table3.
Six audio clips were prepared to assess participants’ experience of recognising
different spatial properties of sound. The six audio clips were patched in Pure
Data, and placed on the right side of the main interface of the synthesis algorithm
patch. Participants can trigger the clips and listen to them on their own. The six
clips are organized into three groups of two, each testing a different spatial prop-
erty. The first set tests the ability to recognize the orientation of sounds. The first
clip was a sine wave from directly in front, the second was a sine wave from directly
to the right. The second set tests the ability to detect the size of the sound source.
The third clip is a sawtooth wave played through only one speaker directly in
front, while the fourth clip is a sawtooth wave played simultaneously through nine
speakers in front, with amplitude normalized. The third group tests the ability to
recognize the movement of the sound source. The fifth clip was white noise with
a 0.01 Hz sawtooth envelope moving clockwise in the speaker ring from a head-
down perspective. The sixth source is identical, but moves in a counterclockwise
direction. Each audio clip had a corresponding question to answer.
- 71 -
Scales
I spend a lot of my free time doing music-related activities
I sometimes choose music that can trigger shivers down my spine.
I can sing or play music from memory.
I’m intrigued by musical styles I’m not familiar with and want to find
out more.
Pieces of music rarely evoke emotions for me.
I can compare and discuss differences between two performances or ver-
sions of the same piece of music.
I often read or search the internet for things related to music.
I often pick certain music to motivate or excite me.
I am able to identify what is special about a given musical piece.
I don’t spend much of my disposable income on music.
I can tell when people sing or play out of tune.
When I hear a music I can usually identify its genre.
I would consider myself a musician.
I keep track of new of music that I come across (e.g. new artists or
recordings).
Music can evoke my memories of past people and places.
Single-choice
I listen attentively to music for 0-15 min / 15-30 min / 30-60 min /
60-90 min / 2 hrs / 2-3 hrs / 4 hrs or more per day.
I have attended 0 / 1 / 2 / 3 / 4-6 / 7-10 / 11 or more live music
events as an audience member in the past twelve months.
I can play 0 / 1 / 2 / 3 / 4 / 5 / 6 or more musical instruments.
The instrument I play best (including voice) is ____
Multiple-choice
What’s your mostly used music production tools?
I don’t make music
Song writing with main instruments (Guitar, Piano etc.)
Music Notation Software (Musescore, Sibelius etc.)
Digital Audio Workstation (Logic Pro, Ableton, Reaper etc.)
DAWless (Modular Synthesizer, Sampler, Groove Boxes etc.)
Audio Programming (Pure Data, Max/MSP, SuperCollider etc.)
Other
Table3: Musical experience evaluation form
The questions to assess knowledge of synthesis algorithms and audio analysis
were developed in collaboration with the ChatGPT artificial intelligence system29.
29https://chat.openai.com/
ChatGPT was tasked with generating a substantial number of questions based
on the knowledge points covered in the listening test, then a thorough review
of the questions generated by ChatGPT was conducted and the questions were
refined to meet the specified requirements. To illustrate, ChatGPT was requested
to generate in excess of 100 questions on a range of topics related to subtractive
synthesis, FM synthesis, ring modulation, and LFO, among others. A total of
12 questions were selected from the pool to form the final quiz, with each knowl-
edge point represented by two to three questions. The selection of questions was
- 72 -
based on the following criteria: no factual errors, clear and moderately difficulty,
and exclusively related to the synthesis algorithms. All questions that pertained
to the utilization of synthesizers and the specific techniques employed in music
production were excluded. Similarly, the questions in the audio analysis quiz are
dominated by audio features from the Zerr* presets used for the listening test,
with a small number of audio features present in the zerr_features~ external. The
questions pertain solely to the algorithmic details and are not specific to the uti-
lization of the algorithm. From a personal standpoint, this approach to utilising
large language models appears to be a reasonable one. The complete set of 2×12
quiz questions is provided in the Appendix C, with the correct answer in bold.
5.2.4 Test Process
Prior to the commencement of the listening test procedure, participants are pro-
vided with an introduction to the test system, test environment and experience
assessment. This is conducted in a cross-sectional manner in order to prevent
the participants from becoming overly preoccupied with the project, which could
result in a lack of concentration. Prior to the assessments, participants were re-
quested to indicate their self-perceived level of experience. In order to ensure the
data were unbiased, participants were instructed to answer the questions even if
they believed they lacked relevant experience. The specific crossover introduction
and assessment process can be outlined as follows. The musical experience test
begins with an initial run-through, followed by an introduction to TU Studio and
the loudspeaker systems. Participants are then introduced to the Pure Data patch
and the MIDI controller, leading into a quiz on the sound synthesis algorithm.
The session concludes with a brief overview of Zerr* and an audio analysis quiz.
The listening tests for the four Zerr* presets followed a standardized proce-
dure. Initially, the sound source was routed to a two-channel system, using the
front two speakers from the ring system. Participants had one and a half minutes
to manipulate the sound source and were then asked to provide a textual descrip-
tion of the sound produced by the synthesis algorithm and the role of the three
parameters, focusing on either technical or perceptual aspects. Next, the sound
source was routed to the corresponding Zerr* system for spatialization. Partici-
pants were given two minutes to explore changes in timbre and spatial attributes.
They were then asked to describe the differences between the original and spatial-
ized sound and analyze the connection between timbre and spatial properties. This
was followed by scale questions assessing the perceived strength of the coupling
between timbre and spatial attributes and their effectiveness in controlling spatial
attributes through the synthesis parameters. After responding, participants were
informed about the specific synthesis algorithms used and the functions of the
three knobs, as well as the role of the Zerr* system. Finally, they rated how well
these technical backgrounds matched their auditory experience.
This was followed by a comprehensive feedback session, during which partic-
ipants were asked to provide quantitative ratings of various aspects of the Zerr
- 73 -
system and to offer their subjective opinions. The ten questions in the final session
and their corresponding types are presented in Table4.
1 Rank From the tests conducted, which one demonstrated the most significant coupling
between sound and spatial properties?
(Test1, Test2, Test3, Test4)
2 Scales How challenging was it for you to grasp the concept of the Zerr*?
(very hard → very easy)
3 Text Were there specific aspects of Zerr* that you found particularly difficult to un-
derstand?
4 Text Have you encountered any discomfort of workflow or technical difficulties while
playing with Zerr*?
5 Scales In your view, does this sound-spatial coupling with Zerr* enhance or detract from
the overall musical experience?
(detract → enhance)
6 Scales Do you believe that the ability to control spatial properties via sound properties
simplifies the process of composing or improvising music?
(totally agree → totally disagree)
7 Scales Do you think Zerr* offers expanded possibilities for composing or improvising
music?
(not at all → yes for sure)
8 Text In your opinion, what unique features does Zerr* possess that distinguish it from
existing spatial audio systems?
9 Text Do you have any envision the application of Zerr* in contexts like sound instal-
lations or live improvisation music?
10 Text Are there any additional comments, suggestions, or ideas you would like to share?
Table4: Comprehensive feedback session
5.3 Procedure
5.3.1 Questionnaire
The complete process of the listening test was presented to the participants via
an online questionnaire on a laptop computer in the studio. Theoretically, partici-
pants could complete the entire listening test independently based on the instruc-
tional information provided on the questionnaire. In order to streamline the test,
I have endeavoured to minimise the verbal narrative component and have only
included explanations where necessary for sections that were not clearly expressed
in the text.
The questionnaire utilized for the listening test is hosted on Typeform30. Type-
form, as a business-oriented questionnaire service system was selected for several
30https://de5qfywm15f.typeform.com/to/Nj0OeAzs
reasons. Firstly, the time required for the faculty to respond to the application
for an account of student questionnaire system was a significant factor. Secondly,
the Typeform system offers a superior user experience in terms of questionnaire
editing and filling. It can be argued that the use of commercial questionnaires in
- 74 -
academic research does not detract from the overall credibility of such research. A
sample of part of this questionnaire is shown in Table5. It lists what 4 common
questionnaire elements look like, including description, scale qu, choice, and text
input.
Table5: Questionnaire Screenshots
5.3.2 Recruitment
The participants were recruited through a number of channels, including the fac-
ulty’s mailing list, social media platforms, live coding and spatial audio commu-
nities, etc. The recruitment and testing process commenced on February 27, 2024,
and concluded three weeks later.
A total of 18 individuals participated in the listening test. The participants
were drawn from a diverse range of backgrounds, including those engaged in audio
technology studies, those working as sound masters, musicians in bands, electronic
music producers, and music enthusiasts with no relevant experience. The mean
duration of participation was approximately 60 minutes. The unprocessed results
of the questionnaires can be found in the attached documents. In consideration of
the protection of individual privacy, no personal data is included in the results.
The participants were requested to provide their personal information on a sheet
of paper in the studio. Each participant was assigned a unique identifying index
which was used to correspond to the participant and the questionnaire they com-
pleted. The personal information on the sheet of paper will be destroyed after the
completion of the thesis.
5.3.3 Test Scenario
As illustrated by Figure9, each participant was positioned in the center of the
room. A MIDI controller is positioned in front of the participant. The laptop
- 75 -
screen is employed to present the questionnaire. The principal external monitor is
employed for displaying the synthesis algorithm patch. I observe the progress from
a position behind the participant without disturbing them. In the meantime, I
implement adjustments to the various configurations in accordance with the pro-
gression of the test and respond to any inquiries that the participant may have.
Following the completion of the tests, further communication will be initiated with
those participants who have demonstrated a high level of interest. Further details
regarding the project will be presented, with a particular focus on the concept of
design
5.4 Analysis
The Typeform system offers straightforward data analysis and visualization tools.
Moreover, the unprocessed results can be exported to either the Excel or CSV
file formats. The data is then subjected to further processing and analysis using
the Python libraries Pandas and Matplotlib. The results were analyzed in two
dimensions. The first was the overall feedback by the participants. The second was
the differences between participants with different experience backgrounds. It’s
important to note that this listening test will not undergo quantitative statistical
analysis, such as hypothesis testing. Instead, it will concentrate on qualitatively
discussing feedback from each participant. This approach is due to the small sam-
ple size of 18 participants with diverse backgrounds, which poses challenges for
reliable statistical analysis. Additionally, the test included many subjective ques-
tions requiring textual responses that are best analyzed in the context of each
participant’s unique background.
5.4.1 General Feedbacks
5.4.1.1 Feedback for the Presets
The Figure12 represents the bar charts for the quantization questions in the four
Zerr* Prests tests. The graphs correspond to the three questions in order, from
top to bottom. The quantitative scale of the question is represented on the x-axis,
while the number of results for each value is represented on the y-axis. The four
colored bars represent the results of the four different tests. The results of the text
input questions can be found in Appendix D.
As evidenced by the distribution of the first question’s graph, participants
demonstrated a general ability to perceive a strong correlation between the timbral
and spatial properties. The fourth of these tests is the most apparent, followed by
the first and second, while the third is the least apparent. This outcome aligns with
the anticipated outcomes of the four tests at the time of their design. The first and
second spatialization patterns are the spatial movement of the sound source and
the spatial dynamic distribution of the timbre, respectively. The third pattern is
the change in sound shape with the spatial position of the sound source fixed. The
- 76 -
human ability to recognize sound movements is higher than sound shapes. The
fourth source, however, exhibited a pronounced alteration in auditory perception
prior to and subsequent to spatialization, and the sound effects were highly re-
sponsive to the parameters of the synthesizer.
The responses to the second question exhibited a lower concentration of re-
sults than those to the first question. The prevailing view is that controlling spatial
properties through parameters is relatively straightforward. It can be reasonably
assumed that the underlying logic of control spatial properties via sound will not
pose any significant challenges for the performer. The first and second tests ex-
hibited a greater proportion of participants who perceived themselves to be un-
able to effectively control the system. This discussion parallels the first question
where dynamic features are clearly perceived, particularly when instabilities in the
mapping become apparent. Both the first and second cases employ the trajectory
model, which directly links timbre to space. Linear parameter adjustments result
in non-linear spatial changes, complicating control in the these setups.
Figure12: Feedback for the Zerr* presets
Attitudes towards the relationship between technological context and perceived
experience varied widely among participants. The majority felt there was con-
sistency, but a significant number perceived a low match. This could be due to
participants’ limited prior knowledge or the inherent difficulty of discerning tech-
nology through sound or conceptualizing sound based on techniques. Analysis of
the first and third graphs revealed that participant experience remained robust,
even without any interpretability of the system. This insight is valuable for fu-
ture exploration of the application, suggesting that achieving a perceived unity
of sound and space does not necessarily require explicit, understandable mapping
relationships.
- 77 -
A qualitative understanding of Zerr*’s impact across four system types can
be derived by comparing participants’ attitudes towards the original and spatial-
ized sounds, as captured in the textual descriptions from the four tests. However,
not all feedback was informative; some participants misunderstood the textual
description task, and others showed limited engagement in the tests.
In Test 1, participants initially described the original sound using metaphors
like “hair shaver” and “alarm,” or adjectives such as “sharp” and “noisy.Those
with synthesizer experience could identify the sound as a square wave signal and
accurately understand the purpose of the three parameters. After spatialization,
all participants noted the movement of sound in space, with most describing tim-
bre changes as the sound moved from left to right. However, few participants could
predict which specific timbre changes were associated with spatial movements be-
fore the explanation was given. The closest guess is that spatialization involves
the sharpness of the sound, not a change in frequency.
In Test 2, participants familiar with FM synthesis could identify its unique
qualities and accurately detail the use of the three parameters. While adjusting
these parameters, they noted an increase in the complexity and richness of the
sounds. It is also clear from the wording of the descriptions that the sounds It is
evident from the language employed in the descriptions that the sounds spatialized
in this manner afforded the participants novel and gratifying spatial and sonic ex-
periences. Given the inability to ascertain the precise spatial location of the sound
source, participants naturally commenced to describe patterns of distribution and
variation of timbre in space. Speculations about the realization method ranged
from pitch and frequency adjustments to filtering and phase shifting, demonstrat-
ing a variety of understandings. The diversity of the descriptions serves to confirm
the conceptual design’s anticipation of the potential of the panning algorithm.
The combination of panning with timbre changes can result in the production of
a wide range of complex effects.
The participants exhibited a more consistent perception of the original sig-
nal of Test 3, and AM/Ring Modulation was discernible by the majority of par-
ticipants. Many mentioned the first parameter’s role in adding more harmonics.
While sometimes interpreted as distortion, their understanding of its function was
generally correct. With regard to the sound after spatialization, the majority of
participants were able to indicate that the sound would be perceived as wider.
This perception was found to be related to the number of sound harmonics and
the degree of harmony. Nevertheless, the overall effects experienced by the partic-
ipants were not significant, and none of them found the effect to be appealing. It
can be demonstrated that variations of the sound shape alone are not an effective
means of spatialization. The primary function of Zerr* remains the irregular high-
speed spatial panning, with the width of the sound serving as an ancillary role.
Test 4 was evidently of considerable interest to the participants, due to its
pronounced effects and striking contrasts between the pre- and post-spatialization
- 78 -
conditions. The original signal due to its simplicity, it seems that all participants
can comprehend its synthesis principle with the aid of waveforms. With regard to
the signal post spatialization, the majority of participants indicated that it exerts a
pronounced spatial effect. To provide a brief summary, the original comments from
participant #7 are cited as follows, “It becomes very rich spatial material, with
lots of options, from shivering noise creeches to very distinct static impulses, to
very smooth drones.The majority of participants demonstrated an understanding
that the observed effect was a consequence of rapid spatial shifts in sound. While
only one noted that the shifts occurred when the signal cross zero, the correlation
between timbre and spatiality was evidently perceived by the majority of partici-
pants. Additionally, the observed reactions of the participants indicated that they
were enjoying the experience. The outcome of this feedback was, in fact, somewhat
unexpected but nevertheless highly encouraging. This evidence demonstrates that
sample-level spatialization is not solely limited to the domain of technical exper-
imentation. Rather, it can be appreciated by broader audiences. The majority
of participants expressed a positive aesthetic experience. This evidence supports
the further exploration of this type of effect and its potential use in real-world
applications.
5.4.1.2 Comprehensive Feedback
The results of first ranking question in the comprehensive feedback as shown in
Table6, which test participants thought demonstrated the most significant cou-
pling between sound and spatial properties, was consistent with the results ana-
lyzed directly from the Figure12. On average, the fourth test was placed first,
followed by the second test and then the third test. This order is also consistent
with participants’ preference in the analysis of their subjective feedback.
Rank Test Average
1 Test4 - Noise + LFO 1.72
2 Test2 - FM 2.61
3 Test1 - Square Wave 2.67
4 Test3 - AM/Ring 3.00
Table6: Rank based on degree of coupling
Figure13 presents the findings derived from the scale questions included in the
comprehensive feedback In accordance with the preceding graph, each subgraph
corresponds to a single scale question. The text descriptions can also be found in
Appendix D.
The second question, regarding the degree of difficulty in comprehending the
concept of Zerr*, was perceived by participants as relatively straightforward. The
preponderance of participants choosing values in the center proves that a certain
threshold of conceptual understanding is still required. This implies that there is
still a cost of education if the concept is to gain widespread acceptance. Responses
to question 3 suggest that the main comprehension difficulty among participants
- 79 -
stemmed from a lack of background knowledge. This includes understanding both
synthesis algorithms and audio analysis algorithms. Stripped of these detailed
techniques, the fundamental concept of controlling spatial properties through tim-
bral variations can be grasped by the majority of participants. In conjunction with
the responses to question 4, it can be observed that even participants who explic-
itly stated that their background knowledge was insufficient did not perceive any
discomforting aspects to their use. A lack of background knowledge can impede
the user’s ability to create their own preset system.
Figure13: Feedback scale bar chart for Zerr* approach
The fifth question posed toparticipants was whether they recognized the value of
the concept of Zerr* and whether this type of coupling was an enhancement or a
hindrance to musicality. The results were clear, with the vast majority seeing it
as an enhancement to the overall musical experience.
The responses to question six exhibited a notable degree of polarization. The
participants perceived the overall difficulty of creating with Zerr* to be higher than
that of the conventional method. This result can be analyzed in two directions.
Firstly as stated above, Composing with Zerr* requires much technical background
knowledge, and most people would be dissuaded by these technical terms. This is
likely the primary reason why a significant proportion of the participants selected
disagree. It is of paramount importance that developers of all types consider the
accessibility of tools for users with limited technical expertise One straightforward
- 80 -
approach is to provide a series of presets that do not require users to be aware
of the underlying technology. Another potential strategy is to educate users as
much as possible. The author is therefore particularly insistent on the importance
of the FluCoMa project in providing accessible tutorials for those with no prior
experience. In addition to the technical background, another aspect of the Zerr*
method is that it does not reduce the difficulty of creation. It is not an instrument
for enhancing efficiency; rather, it is a device for fostering creativity. As previously
discussed, while this approach facilitates improvisation in a live performance, the
process of writing the mapping system is a challenging compositional work. The
requisite effort will in some cases be comparable to that of composing a complete
fixed track. Generally, in music composition and improvisation, energy expendi-
ture and creative complexity increase when the reuse rate of the mapping system
is low.
One indication of the effectiveness of Zerr* as a creativity tool is the remark-
able consistency observed in the responses to the next question. It is widely ac-
knowledged that this tool facilitates the expansion of new possibilities. This is the
most significant benefit of the tool, and it is the primary reason for its existence.
The subsequent three questions were designed to elicit subjective and open-ended
responses that would be highly relevant to the participants’ backgrounds. They
were therefore chosen for analysis in Section5.4.2.
5.4.2 Experience-related Feedbacks
The Figure14 contains self-reported relevant background information to the 18
participants. The Instruments lists the number of instruments the participant
plays and the instrument they is best at. The Production Tools refers to the tools
or methods of music production that they most frequently used. The meanings of
the abbreviations are shown in the notes at the bottom of the table, where NM
indicates that the participant does not produce music. The Keywords describes
basically the participant’s profile. Given the considerable diversity of the partici-
pants’ backgrounds, this section was not included in the questionnaire. Instead,
the authors have identified the most relevant backgrounds of the participants from
communication.
Participants’ familiarity with background knowledge is detailed in Figure15.
Musical sophistication was self-rated using the condensed Gold-MSI scale. The
next three items feature self-assessed familiarity (detailed in the notes below) and
scores of mastery levels from quizzes. The scores were standardized to out of ten.
Data in gray highlight the highest scores for each topic.
It should be noted that the self-rated question in Spatial Audio category
asked participants about their familiarity with spatial audio technologies, while
the subsequent six audio clips test measured participants’ ability to recognize spa-
tial sound patterns. The results of this category were not highlighted, which is an
experimental design error that the authors must acknowledge. In these six test
- 81 -
audio clips, the actual listening experience differed from what the authors had
hoped for due to stimulus design and room acoustic issues, making the final test
results unreliable. In the first two direction listening tests, the direct front sound
was obstructed by the computer screen. Moreover, the capacity to identify pat-
terns from sinusoidal signal was already limited, which resulted in a near-absence
of audible differentiation between the front and right directions. The third and
fourth tests for the width of the sound, there were no discernible difference in the
near-uniform spread of the sound through the space. And the rotations of the last
two stimulus were simple enough to be correctly distinguished by all participants.
Consequently, although the results have been presented here, they are not to be
utilized further.
Index Instruments Production Tools Keywords
13 Drum SW, DAW Band Musician, Drummer
24 Guitar SW, NS Band Musician, Guitarist
36+ Violin SW Band Musician, Bassist
40 \ DL, AP New Media Artist
53 Pipa NM Music Enthusiast
65 SuperCollider AP Computational Artist, Live Coder
71 Guitar AP Spatial Music Composer & Developer
84 Guitar SW, NS, DAW, OT³¹ Independent Musician
91 Voice DAW, DL Electronic Music Producer, DJ
10 1 Drum machine DAW, DL Electronic Music Producer, DJ
11 4 Guitar DAW Band Musician, Guitarist
12 1 Guitar NM Audio Engineer
13 1 Voice SW, DAW Independent Musician, Singer
14 4 Piano NS, DAW Recording Engineer, Mixing Engineer
15 3 Bass DAW Audio Engineer
16 0 \ NM Audio Engineer
17 3 Guitar SW, DAW Audio Engineer
18 4 Guitar DAW Audio Engineer, Music Producer
NM: Non-Musician (“I don’t make music.”)
SW: Song writing with main instruments (Guitar, Piano etc.)
NS: Music Notation Software (Musescore, Sibelius etc.)
DAW: Digital Audio Workstation (Logic Pro, Ableton Live, Reaper etc.)
DL: DAWless (Mudular synthesizer, Sampler, Groove Boxes etc.)
AP: Audio Programming (PureData, Max/MSP, Supercollider etc.)
OT: Other
Figure14: Self-reported backgrounds
³¹Field recording
A salient correlation between the backgrounds of the participants and their evalua-
tions on the Zerr* system was evident. The impact of knowledge of sound synthesis
algorithms was more pronounced in the three categories of background knowledge,
as participants who had experience in this area were better able to understand the
specific changes in sound before and after. The influence of knowledge of audio
analysis was less pronounced.
- 82 -
Index Musical
Sophistication
Spatial Audio Sound Synthesis Audio Analysis
18.57 C 3.33 C 4.17 C 5.00
27.05 C 5.00 D 4.17 C 5.00
37.24 C 3.33 D 5.83 C 4.17
46.00 D 6.67 D 8.33 C 4.17
57.90 D 8.33 D 6.67 C 6.67
69.24 C 6.67 A 9.17 B 5.83
77.90 A 5.00 B 9.17 B 7.50
86.95 C 5.00 C 5.00 B 5.00
98.00 C 6.67 B 7.50 B 5.83
10 9.33 B 10.0 C 7.50 B 6.67
11 8.57 D 6.67 C 7.50 C 6.67
12 5.43 C 5.00 D 5.00 B 8.33
13 7.62 D 10.0 C 5.83 C 4.17
14 7.52 C 5.00 B 7.50 B 5.00
15 8.67 B 8.33 C 8.33 B 5.00
16 5.90 C 10.0 D 4.17 C 3.33
17 7.71 B 6.67 B 7.50 B 5.00
18 7.43 C 8.33 B 9.17 B 7.50
A: I’m an expert in spatial audio
algorithms.
B: I have experience in making
spatial audio pieces.
C: I‘ve listened to spatial audio
pieces.
D: I don’t have any experience
about spatial audio.
A: I’m an expert in sound syn-
thesis algorithms.
B: I skimmed over the algorithm
basics.
C: I’m familiar with the synthe-
sizer sounds, but not the algo-
rithms.
D: I don’t have any experience
about sound synthesis.
A: I’m an expert in audio analy-
sis algorithms.
B: I skimmed over the algorithm
basics.
C: I don’t have any experience
about audio analysis.
Figure15: Results of the experience assesments
The relationship between musical background and attitudes in feedback was not
linearly correlated. Participants in the field of audio and acoustic engineering, who
had basically no music-related activities, could clearly perceive little interest in
the test and some answer beside the point. However, enthusiasts who enjoy music
but have no experience in music production show an innate curiosity despite the
difficulties of understanding. The participants with some experience in music pro-
duction were also divided into two different attitudes. The band musicians have
provided an overall negative review of Zerr*, or at least have reservations about
the value of the project. The feedback from the participants #1,#2,#3 can be
referred. This may be indicative of a significant divergence of the aesthetic par-
adigm, particularly with rock music. It is challenging for musicians to identify
their preferred musical elements in the test, as they are unable to recognize any
familiar musical element, such as rhythm, melody, or harmony. In contrast, those
engaged in electronic music, music engineering, and independent musicians who
favour unconventional approaches tend to offer more positive feedback. They could
think more about musicality in terms of timbre and specific sound details, as this
could lead to a plethora of intriguing suggestions and associations. First of all,
they could all feel the convenience of controlling the spatial attributes through the
sound, and many of them also reached the conclusion that Zerr* is suitable for live
- 83 -
performance. For instance, participant #10 stated that they thought this system
would be good for live music similar to the electronic duo, Autechre. Participant
#13 thought that this system would be good for large venues. Participant #18
pointed out that the “combination of spatial and timbre characteristics opens up
possibilities especially for improvising or intuitive composing.In addition, they
identified a number of issues with Zerr*, including the significant hardware sup-
port required, as well as questions regarding the system’s integration with studio
music production workflows.
Those with prior experience in audio programming and spatial music compo-
sition were able to provide more detailed and instructive suggestions. The com-
ments of participant #4, #6, #7 can be referred to. The responses of the four
participants will be subjected to a rigorous analysis.
Participant #4 has a broader experience in the use of creative tools as a new
media artist and has some familiarity with spatial audio-based artwork. In the
question about the degree of difficulty in understanding, they expresses clearly
that, although they has difficulty in making accurate descriptions using a range
of technical terminologies, they can still perceive a significant link between spatial
and sonic attributes. Furthermore, their description of Zerr*’s position is notably
precise. “Zerr* as an additional layers between players and audio systems create
an automatic, changeable and intelligent effect on sound locality, which doesn’t
fully controlled by players themselves, instead indirect gains influences.In eval-
uating the potential of Zerr*, they proposed the possibility of transforming the
system into hardware. They further explained to the author after the listening test
whether Zerr* could be made into an add-on component for analog synthesiser
or for modular synthesis system. This was a notion that had not previously been
considered, but it does seem to be engineeringly feasible.
Participant #7 is a software developer for spatial music with similar experi-
ence to the author. However they has more experience in music production than
the author. They had relatively high scores in all assessments, which is a good
indication of how solid their relevant background is. In the process of communi-
cating with them, the auther can feel that there is a mutual agreement on technical
concerns. Upon analysis of the fourth preset, Section5.4.1.1 provided a quote
in which they expressed particular admiration for the spatial effects that Zerr*
was able to produce. Comprehensive Feedback in which they expresses a more
obvious endorsement as “It really shines with more spatial capabilities.Also they
thought Zerr* would be a great tool for live performance. The mapping system is
part of the artist’s personal style, just as important as the unique instrumental
techniques, the unique sound.
Participant #6 was the most experienced in the use of audio programming
software among all participants. In the question regarding which instrument they
considers to be their most proficient, they was the only one to mention the audio
programming software, SuperCollider. During the listening tests, they was also
- 84 -
able to describe basically all the synthesis algorithms, the corresponding parame-
ters and the spatial effects brought by the Zerr* system in an accurate way. In the
comprehensive feedback, they gave high praise for the program and expressed their
desire to use Zerr* with different speaker setups in their live performances. Their
original statement about the Zerr* system compared to other spatialized systems
is as follows: “To couple synth (or any parameters) to the spatialization, and to
use audio analysis to decide how the spatialization occurs sounds to me more in-
teresting than trying to recreate real-world spatial perception (like in systems such
as ambisonics, wfs, etc)” This notion is entirely consistent with the one expressed
in this thesis in Section2; reconstructing a real-world spatial sound experience
is not the same as pursuing an instriging musical experience. The application of
spatial computing to music should not be that limited. What they suggests in the
additional comment is also a direction the author would like to try subsequently. “I
would love to experiment with this concept in different speaker arrangements that
don’t necessarily follow rings/arrays but more asymmetrical figures, for instance.
Under the concept of Zerr*, this a very intuitive direction of exploration. It is
regrettable that due to the limitations of the test site, this listening test can only
be as asymmetrical as possible under the standard loudspeaker setups. Another
point mentioned in the follow-up communication has also been emphasized so far.
They expressed a desire to explore more complex mapping systems based on audio
analysis. They noted that the presets in the test were too simple and would be
a bit boring. Mapping systems don’t necessarily need to ensure explicit mapping,
and it is more possible to get unexpected ideas on a more chaotic system.
5.5 Discussion
In summary, the participants expressed generally positive opinions. This approach
permits the spatial and timbral coupling, thereby fostering creativity and facili-
tating the exploration of new possibilities for spatial music. Although the adop-
tion of the Zerr* approach, in terms of functionality, does not lower the threshold
of spatial music production as a whole. But its advantages regarding live perfor-
mance are universally recognized. As an experimental tool, it was not as difficult
to accept as initially expected. It can be observed that there is a certain technical
threshold for the creation of music with this system. However, there is no apparent
aesthetic or technical threshold for the appreciation of the sounds produced using
Zerr*. In summary, audiences don’t need to understand the technology behind
Zerr* to appreciate the aesthetic experience it offers. Musicians can realize their
creative visions without fully grasping every detail of the system. Additionally,
this gives the tool’s developer, the author, the freedom to explore and integrate
more experimental features.
- 85 -
Chapter 6
Conclusions & Future Work
This thesis conducts an in-depth review of the development of spatial music cre-
ation tools and introduces a tailored coordinate system for categorizing them,
thereby establishing a robust foundation for the introduction of a novel approach,
Zerr*.
Zerr*, uniquely positioned within the field of spatial music creation tools, ad-
dresses a gap in development and catalyzes new creative possibilities for artists. By
constructing an algorithmic framework that utilizes the intrinsic properties of au-
dio signals in conjunction with an innovative mapping system, Zerr* autonomously
distributes audio across arbitrary loudspeaker setups. This facilitates dynamic and
context-sensitive spatialization and spatial sound synthesis. This approach effec-
tively couples timbre with spatial properties that extend beyond the limitations
of traditional spatialization techniques in the context of music.
The research not only expands theoretical knowledge but also demonstrates
a practical implementation designed with extendability and accessibility in mind.
The practical applications of this system were showcased through its implemen-
tation of the core modules, and its encapsulations as Pure Data package and
as JACK clients, enabling flexible experimentation and integration into diverse
workflows. The emphasis on real-time manipulation and sample-level processing
indicated that this implementation is particularly advantageous for live perfor-
mance and improvisation.
A comprehensive listening test was conducted, gathering feedback from par-
ticipants of diverse backgrounds and experience levels on the concept and imple-
mentation of the approach. The feedback affirmed a high level of understanding
and acceptance of the approach, offering valuable insights for further refinement.
Future work will concentrate on three principal areas: engineering, applica-
tion, and conceptual development. Engineering efforts will be directed toward
broadening the system’s integration into other host environments. This will be
accompanied by the provision of improved documentation and tutorials. In the
field of application, the system’s potential will be explored through active music
production, the organization of improvisation performances, and other practical
uses. Finally, the potential for further theoretical research and conceptual expan-
sion is considerable. A promising direction involves integrating learned features
that are well-aligned with contemporary advancements in artificial intelligence.
The integration of AI in spatial music technologies represents an exciting research
frontier (Einbond et al., 2024). Initial studies have already demonstrated the po-
tential of this approach.
- 86 -
List of Figures
Figure1: Pierre Henry performing with the pupitre d’espace .................... - 17 -
Figure2: Karlheinz Stockhause manipulate the rotating loudspeaker ........ - 18 -
Figure3: Sonic Trajectory for Poème Électronique .................................... - 23 -
Figure4: Interaction schematic between instrument and musician ............ - 24 -
Figure5: Signal flow in live performance .................................................... - 43 -
Figure6: Signal flow in improvisation performance .................................... - 45 -
Figure7: Signal flow of Zerr* approach ...................................................... - 46 -
Figure8: Example PD patch for eight loudspeakers ................................... - 64 -
Figure9: TU-Studio E-N 325 © TU Studio Team ...................................... - 68 -
Figure10: Synthesis algorithm patch for Zerr* listening test ..................... - 69 -
Figure11: MIDI controller for synthesis algorithm patch ........................... - 70 -
Figure12: Feedback for the Zerr* presets ................................................... - 77 -
Figure13: Feedback scale bar chart for Zerr* approach ............................. - 80 -
Figure14: Self-reported backgrounds ......................................................... - 82 -
Figure15: Results of the experience assesments ......................................... - 83 -
- 87 -
List of Tables
Table1: Acousmonium (left) and Gmebaphone-1 (right) ........................... - 21 -
Table2: Instantaneous features .................................................................. - 60 -
Table3: Musical experience evaluation form .............................................. - 72 -
Table4: Comprehensive feedback session ................................................... - 74 -
Table5: Questionnaire Screenshots ............................................................ - 75 -
Table6: Rank based on degree of coupling ................................................. - 79 -
- 88 -
Bibliography
Agger, S., Bresson, J., & Carpentier, T. (2017). Landschaften–Visualization, Con-
trol and Processing of Sounds in 3D Spaces. International Computer Music
Conference (Icmc'17).
Alunno, M., & Yarce Botero, A. (2017). Directional landscapes: using parametric
loudspeakers for sound reproduction in art. Journal of New Music Research,
46(2), 201–211.
Austin, L., & Smalley, D. (2000). Sound diffusion in composition and performance:
an interview with Denis Smalley. Computer Music Journal, 24(2), 10–21.
Baalman, M. A. (2010). Spatial composition techniques and sound spatialisation
technologies. Organised Sound, 15(3), 209–218.
Barbosa, Á. (2003). Displaced soundscapes: A survey of network systems for music
and sonic art creation. Leonardo Music Journal, 13, 53–59.
Bascou, C. (2013). HoloPad: an original instrument for multi-touch control of
sound spatialisation based on a two-stage DBAP.
Bates, E. (2009). The composition and performance of spatial music.
Berkhout, A. J., Vries, D. de, & Vogel, P. (1993). Acoustic control by wave field
synthesis. The Journal of the Acoustical Society of America, 93(5), 2764–2778.
Blackwell, T., & Young, M. (2004). Swarm granulator. Workshops on Applications
of Evolutionary Computation, 399–408.
Blumlein, A. (1933). Improvements in and relating to sound-transmission,
soundrecording and sound-reproducing systems. UK Patent, 394325.
Bogdanov, D., Wack, N., Gómez Gutiérrez, E., Gulati, S., Boyer, H., Mayor, O.,
Roma Trepat, G., Salamon, J., Zapata González, J. R., Serra, X., & others.
(2013). Essentia: An audio analysis library for music information retrieval.
Britto A, Gouyon F, Dixon S, Editors. 14th Conference of the International
Society for Music Information Retrieval (ISMIR); 2013 Nov 4-8; Curitiba,
Brazil.[Place Unknown]: ISMIR; 2013. P. 493-8.
Brech, M. (2015). Der hörbare Raum: Entdeckung, Erforschung und musikalische
Gestaltung mit analoger Technologie (Vol. 13). transcript Verlag.
Brech, M., Coler, H. von, & Paland, R. (2015). Aspects of space in luigi nono’s
prometeo and the use of the Halaphon. Compositions for Audible Space. The
Early Electroacoustic Music and Its Contexts. Music and Sound Culture, 193–
204.
Bresson, J. (2012). Spatial structures programming for music. Spatial Computing
Workshop (SCW).
- 89 -
Bresson, J., Agon, C., & Assayag, G. (2011). OpenMusic: visual programming
environment for music composition, analysis and research. Proceedings of the
19th ACM International Conference on Multimedia, 743–746.
Bresson, J., Bouche, D., Carpentier, T., Schwarz, D., & Garcia, J. (2017). Next-
generation Computer-aided Composition Environment: A new implementation
of OpenMusic. International Computer Music Conference (Icmc'17).
Carpentier, T. (2015). ToscA: an OSC communication plugin for object-ori-
ented spatialization authoring. 41st International Computer Music Conference
(ICMC), 368–371.
Carpentier, T. (2018). A new implementation of Spat in Max. 15th Sound and
Music Computing Conference (Smc2018), 184–191.
Clarke, M. (1999). Composing with multi-channel spatialisation as an aspect of
synthesis. 25th International Computer Music Conference, 17–19.
Clozier, C., & Olsson, J. (2001). The gmebaphone concept and the cybernéphone
instrument. Computer Music Journal, 81–90.
Coduys, T., & Ferry, G. (2004). Iannix aesthetical/symbolic visualisations for hy-
permedia composition. Journées D'informatique Musicale.
Coler, H. von. (2019). A JACK-based application for spectro-spatial additive syn-
thesis. Proceedings of the 17th Linux Audio Conference (LAC-19), Stanford
University, USA.
Coler, H. von, Schuladen, P., & Tonnätt, N. (2021). SeamLess Integration of Spa-
tial Sound Reproduction Methods.
Coler, H. von, Tonnätt, N., Kather, V., & Chafe, C. (2020). Sprawl: A network
system for enhanced interaction in musical ensembles. Proceedings of the 18th
Linux Audio Conference, 33–37.
Collins, N. (2011). SCMIR: A SuperCollider music information retrieval library.
ICMC.
Cross, T. Reframing Sound Shapes in Spectromorphological Composition: Notat-
ing perspectival space through spherical, Euclidean and Cartesian-coordinate
systems. Organised Sound, 1–11.
Dack, J. (2001). Diffusion as Performance. In IIASSRC Conference Proceedings
(pp. 81–88). IIASSRC Conference Proceedings.
Daniel, J. (2003, May). Spatial Sound Encoding Including Near Field Effect:
Introducing Distance Coding Filters and a Viable, New Ambisonic Format.
Audio Engineering Society Conference: 23rd International Conference: Signal
Processing in Audio Recording and Reproduction. https://www.aes.org/e-lib/
browse.cfm?elib=12321
- 90 -
Davis, T., & Rebelo, P. (2005). Hearing emergence: towards sound-based self-or-
ganisation.
Decroupet, P., Ungeheuer, E., & Kohl, J. (1998). Through the sensory looking-
glass: the aesthetic and serial foundations of Gesang der Jünglinge. Perspec-
tives of New Music, 97–142.
Desantos, S., Roads, C., & Bayle, F. (1997). Acousmatic morphology: an interview
with François Bayle. Computer Music Journal, 11–19.
Dilger, T. (2013). Graphical Spatialization Program with Real Time Interactions
(GASPR). Intelligent Technologies for Interactive Entertainment: 5th Inter-
national ICST Conference, INTETAIN 2013, Mons, Belgium, July 3-5, 2013,
Revised Selected Papers 5, 136–145.
Dolby. (2014). Dolby Atmos Next-Generation Audio for Cinema.
Einbond, A., & Schwarz, D. (2010). Spatializing timbre with corpus-based con-
catenative synthesis. ICMC, 72–75.
Einbond, A., Carpentier, T., Schwarz, D., & Bresson, J. (2024). Embodying Spa-
tial Sound Synthesis with AI in Two Compositions for Instruments and 3-D
Electronics. Computer Music Journal, 1–19.
Garavaglia, J. A. (2016). Creating Multiple Spatial Settings with “Granular Spa-
tialisation” in the High-Density Loudspeaker Array of the Cube Concert Hall.
Computer Music Journal, 40(4), 79–90.
Garcia, Jérémie, Bresson, Jean, & Carpentier. (2015). Towards interactive author-
ing tools for composing spatialization. 2015 IEEE Symposium on 3d User In-
terfaces (3dui), 151–152.
Garcia, Jérémie, Bresson, Jean, Schumacher, et al. (2015). Tools and applications
for interactive-algorithmic control of sound spatialization in OpenMusic. In-
sonic2015, Aesthetics of Spatial Audio in Sound, Music and Sound Art.
Garcia, J., Carpentier, T., & Bresson, J. (2017). Interactive-compositional author-
ing of sound spatialization. Journal of New Music Research, 46(1), 74–86.
Garcia, J., Favory, X., & Bresson, J. (2016). Trajectoires: A mobile application
for controlling sound spatialization. Proceedings of the 2016 CHI Conference
Extended Abstracts on Human Factors in Computing Systems, 3671–3674.
Geier, M., Ahrens, J., & Spors, S. (2010). Object-based audio reproduction and
the audio scene description format. Organised Sound, 15(3), 219–227.
Geier, M., Hohn, T., & Spors, S. (2012). An open-source C++ framework for mul-
tithreaded realtime multichannel audio applications. Proc. Linux Audio Conf,
183–188.
Gerzon, M. A. (1973). Periphony: With-height sound reproduction. Journal of the
Audio Engineering Society, 21(1), 2–10.
- 91 -
Hagan, K. L. (2017). Textural composition: Aesthetics, techniques, and spatial-
ization for high-density loudspeaker arrays. Computer Music Journal, 41(1),
34–45.
Harada, T. (1992). Real Time Control of 3 D Sound Space by Gesture. Proc.
ICMC, 85–88.
Harrison, J. (1998). Sound, space, sculpture: some thoughts on the ‘what’,‘how’and
‘why’of sound diffusion. Organised Sound, 3(2), 117–127.
James, S. (2012). From autonomous to performative control of timbral spatiali-
sation.
James, S. (2015). Spectromorphology and Spatiomorphology of Sound Shapes: au-
dio-rate AEP and DBAP panning of spectra.
James, S. (2016). A multi-point 2D interface: Audio-rate signals for controlling
complex multi-parametric sound synthesis.
James, S. G. (2005). Developing a flexible and expressive realtime polyphonic wave
terrain synthesis instrument based on a visual and multidimensional method-
ology.
Jaroszewicz, M. (2015). Compositional strategies in spectral spatialization. Univer-
sity of California, Riverside.
Jot, J.-M., & Warusfel, O. (1995). A real-time spatial sound processor for music
and virtual reality applications. ICMC: International Computer Music Con-
ference, 294–295.
Kendall, G. S. (1995). The decorrelation of audio signals and its impact on spatial
imagery. Computer Music Journal, 19(4), 71–87.
Kim-Boyle, D. (2008). Spectral spatialization-an overview. ICMC.
Lengelé, C. (2018). Live 4 Life-A Spatial Performance Tool Focused On Rhythm
And Parameter Loops. ICMC.
Lerch, A. (2012). An introduction to audio content analysis: Applications in signal
processing and music informatics. Wiley-IEEE Press.
Leslie, G., Zamborlin, B., Jodlowski, P., & Schnell, N. (2010). Grainstick: A
collaborative, interactive sound installation. Proceedings of the International
Computer Music Conference (ICMC), 4.
Lidbetter, P. S. (1988, November). The Concepts and Implementation of the Mul-
tichannel Audio Digital Interface (MADI) Format. Audio Engineering Society
Convention 85. https://www.aes.org/e-lib/browse.cfm?elib=4707
Lombardo, V., Valle, A., Fitch, J., Tazelaar, K., Weinzierl, S., & Borczyk, W.
(2009). A virtual-reality reconstruction of poeme electronique based on philo-
logical research. Computer Music Journal, 33(2), 24–47.
- 92 -
Lossius, T., Baltazar, P., & Hogue, T. de la. (2009). DBAP–distance-based am-
plitude panning. ICMC.
Lukes, R. D. (1996). The" Poeme electronique" of Edgard Varese. Harvard Uni-
versity.
Lynch, H., & Sazdov, R. (2011). An ecologically valid experiment for the com-
parison of established spatial techniques. International Computer Music Con-
ference.
Magnusson, T. (2019). Sonic Writing. 36–37.
Malham, D. G., & Myatt, A. (1995). 3-D sound spatialization using ambisonic
techniques. Computer Music Journal, 19(4), 58–70.
Marshall, M. T., Malloch, J., & Wanderley, M. M. (2009). Gesture control of sound
spatialization for live musical performance. Gesture-Based Human-Computer
Interaction and Simulation: 7th International Gesture Workshop, GW 2007,
Lisbon, Portugal, May 23-25, 2007, Revised Selected Papers 7, 227–238.
McCartney, J. (2002). Rethinking the computer music language: Super collider.
Computer Music Journal, 26(4), 61–68.
McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M., Battenberg, E., &
Nieto, O. (2015). librosa: Audio and music signal analysis in python. Scipy,
18–24.
McGee, R. (2015). Spatial modulation synthesis. ICMC.
McLeran, A., Roads, C., Sturm, B. L., & Shynk, J. J. (2008). Granular sound
spatialization using dictionary-based methods. Proceedings of the 5th Sound
and Music Computing Conference, Berlin, Germany, 1.
Miyama, C., & Dipper, G. (2016). Zirkonium 3.1-a toolkit for spatial composition
and performance. Proceedings of the International Computer Music Confer-
ence, 313, 312.
Mooney, J., & Moore, D. (2007). A concept-based model for the live diffusion of
sound via multiple loudspeakers. Proc. DMRN, 7.
Mooney, J., & Moore, D. (2008). Resound: open-source live sound spatialisation.
Proceedings of the International Computer Music Conference 2008.
Mooney, J., Moore, A., & Moore, D. (2004). M2 diffusion: The live diffusion of
sound in space. Proceedings of the International Computer Music Conference
2004.
Morgan, R. P. (1975). Stockhausen's Writings on Music. The Musical Quarterly,
61(1), 1–16.
Müllensiefen, D., Gingras, B., Musil, J., & Stewart, L. (2014). The musicality of
non-musicians: An index for assessing musical sophistication in the general
population. Plos One, 9(2), e89642.
- 93 -
Negrao, M. C. (2014). ImmLib-A new library for immersive spatial composition.
ICMC.
Normandeau, R. (2009). Timbre Spatialisation: The medium is the space. Organ-
ised Sound, 14(3), 277–285.
Nystrom, E. (2018). Topographic Synthesis: Parameter distribution in spatial tex-
ture. Proceedings of the 2018 International Computer Music Conference, 117–
122.
Oomen, P., HOLLEMAN, P., & DE KLERK, L. (2016). 4DSOUND: A New Ap-
proach to Spatial Sound Reproduction and Synthesis. WHITE PAPERS, 238.
Peters, N., Lossius, T., & Schacher, J. C. (2013). The Spatial Sound Description
Interchange Format: Principles, Specification, and Examples. Computer Music
Journal, 37(1), 11–22. http://www.jstor.org/stable/24265581
Pottier, L. (1998). Dynamical spatialization of sound. HOLOPHON: a graphic
and algorithmic editor for Sigma1. Dafx98 Proceedings.
Puckette, M., & others. (1996). Pure Data: another integrated computer music
environment. Proceedings of the Second Intercollege Computer Music Concerts,
37–41.
Pulkki, V. (1997). Virtual sound source positioning using vector base amplitude
panning. Journal of the Audio Engineering Society, 45(6), 456–466.
Pulkki, V. (1998). Creating generic soundscapes in multichannel panning in
Csound synthesis software. Organised Sound, 3(2), 129–134.
Pulkki, V. (2001). Spatial sound generation and perception by amplitude panning
techniques. Helsinki University of Technology.
Pysiewicz, A., & Weinzierl, S. (2017). Instruments for spatial sound control in real
time music performances. a review. Springer.
Reynolds, C. W. (1987). Flocks, herds and schools: A distributed behavioral
model. Proceedings of the 14th Annual Conference on Computer Graphics and
Interactive Techniques, 25–34.
Roads, C. (1978). Automated Granular Synthesis of Sound. Computer Music Jour-
nal, 2(2), 61–62. http://www.jstor.org/stable/3680222
Ross, V. E. (2012). Too Much Change: How Fantasia's Cinematic Innovations
Overwhelmed the Audience of 1940. Kino: The Western Undergraduate Film
Studies Journal, 3(1).
Rothstein, J. (1995). MIDI: A comprehensive introduction (Vol. 7). AR Editions,
Inc.
Schaeffer, P. (2012). In search of a concrete music (Vol. 15). Univ of California
Press.
- 94 -
Schmele, T. (2011). Exploring 3d audio as a new musical language.
Schmele, T., & Lopez, J. J. (2022). Comparisons between VBAP and WFS using
Spatial Sound Synthesis. Audio Engineering Society Convention 153.
Schnell, N., Röbel, A., Schwarz, D., Peeters, G., Borghesi, R., & others. (2009).
MuBu and friends–assembling tools for content based real-time interactive au-
dio processing in Max/MSP. ICMC.
Schumacher, F., Espinoza, V., Mardones, F., Vergara, R., Aránguiz, A., & Aguil-
era, V. (2021). Perceptual recognition of sound trajectories in space. Computer
Music Journal, 45(1), 39–54.
Schumacher, M., & Bresson, J. (2010). Spatial sound synthesis in computer-aided
composition. Organised Sound, 15(3), 271–289.
Shi, C., & Gan, W.-S. (2010). Development of parametric loudspeaker. IEEE Po-
tentials, 29(6), 20–24.
Smalley, D. (1997). Spectromorphology: explaining sound-shapes. Organised
Sound, 2(2), 107–126.
Smalley, J. (2000). Gesang der Jünglinge: History and Analysis. Режим
Доступа: Http://sites. Music. Columbia. Edu/masterpieces/notes/stock-
hausen/gesanghistoryandanalysis. Pdf.
Start, E. (2024, January). Loudspeaker Matrix Arrays: Challenging the way we
create and control sound. Audio Engineering Society Conference: AES 2024
International Conference on Acoustics & Sound Reinforcement. https://www.
aes.org/e-lib/browse.cfm?elib=22351
Stefani, E., & Mooney, J. (2009). Spatial composition in the multi-channel domain:
aesthetics and techniques. Proceedings of the International Computer Music
Conference 2009.
Sturman, D. J., & Zeltzer, D. (1994). A survey of glove-based input. IEEE Com-
puter Graphics and Applications, 14(1), 30–39.
Teruggi, D. (2007). Technology and musique concrète: the technical developments
of the Groupe de Recherches Musicales and their implication in musical com-
position. Organised Sound, 12(3), 213–231.
Theile, G., & Wittek, H. (2004). Wave field synthesis: A promising spatial audio
rendering concept. Acoustical Science and Technology, 25(6), 393–399.
Thiébaut, J.-B. (2005). A graphical interface for trajectory design and musical
purposes. Journées D'informatique Musicale.
Thomson, P. (2004). Atoms and errors: towards a history and aesthetics of mi-
crosound. Organised Sound, 9(2), 207–218.
Todoroff, T. (1995). Real-Time Granular Morphing and Spatialisation of Sounds
with Gestual Control within MAX/FTS. ICMC.
- 95 -
Todoroff, T., Traube, C., & Ledent, J.-M. (1997). NeXTStep graphical interfaces
to control sound processing and spatialization instruments. ICMC.
Topper, D., Burtner, M., & Serafin, S. (2003). Spatio-operational spectral (sos)
synthesis. ICMC.
Torchia, R. H., & Lippe, C. (2004). Techniques for multi-channel real-time spatial
distribution using frequency-domain processing. Proceedings of the 2004 Con-
ference on New Interfaces for Musical Expression, 116–119.
Tremblay, P. A., Green, O., Roma, G., & Harker, A. (2019). From collections to
corpora: Exploring sounds through fluid decomposition. International Com-
puter Music Conference and New York City Electroacoustic Music Festival,
223–228.
Tremblay, P. A., Roma, G., & Green, O. (2021). Enabling programmatic data
mining as musicking: the fluid corpus manipulation toolkit. Computer Music
Journal, 45(2), 9–23.
Truax, B. (1988). Real-time granular synthesis with a digital signal processor.
Computer Music Journal, 12(2), 14–26.
Truax, B. (1998). Composition and diffusion: space in sound in space. Organised
Sound, 3(2), 141–146.
Valiquet, P. (2012). The spatialisation of stereophony: Taking positions in post-
war electroacoustic music. International Review of the Aesthetics and Sociology
of Music, 403–421.
Wakefield, G., & Taylor, G. (2022). Generating Sound & Organizing Time: Think-
ing with Gen~ Book 1 (Issue bk.1, pp. 16–25). Cycling '74. https://books.
google.de/books?id=yvV4zwEACAAJ
Wanderley, M. M. (2001). Gestural control of music. International Workshop Hu-
man Supervision and Control in Engineering and Music, 632–644.
Wenzel, E. M., Begault, D. R., Godfroy-Cooper, M., Roginska, A., & Geluso, P.
(2017). Immersive Sound: The Art and Science of Binaural and Multi-Channel
Audio. Routledge.
Wilson, S. (2008). Spatial swarm granulation. ICMC.
Wilson, S. (2009). BEASTMulchLib: BEASTmulchLib is a SuperCollider class li-
brary designed for use in the creation, processing and presentation of complex
multichannel signal chains. Objects include sources, matrix routers and mixers,
and sound processors and spatialisers. The latter are based on a simple user-
extensible plugin architecture. Many classes have elegant GUI representations.
Wilson, S., & Harrison, J. (2010). Rethinking the BEAST: Recent developments
in multichannel composition at Birmingham ElectroAcoustic Sound Theatre.
Organised Sound, 15(3), 239–250.
- 96 -
Wright, M. (2005). Open Sound Control: an enabling technology for musical net-
working. Organised Sound, 10(3), 193–200.
Wright, M., Chaudhary, A., Freed, A., Khoury, S., & Wessel, D. (1999). Audio
applications of the sound description interchange format standard. Audio En-
gineering Society Convention 107.
Yang, Y.-Y., Hira, M., Ni, Z., Chourdia, A., Astafurov, A., Chen, C., Yeh, C.-
F., Puhrsch, C., Pollack, D., Genzel, D., Greenberg, D., Yang, E. Z., Lian, J.,
Mahadeokar, J., Hwang, J., Chen, J., Goldsborough, P., Roy, P., Narenthiran,
S., Shi, Y. (2021). TorchAudio: Building Blocks for Audio and Speech Pro-
cessing. Arxiv Preprint Arxiv:2110.15018.
Zotter, F., Zaunschirm, M., Frank, M., & Kronlachner, M. (2017). A beamformer
to play with wall reflections: The icosahedral loudspeaker. Computer Music
Journal, 41(3), 50–68.
- 97 -
Appendix A: Zerr* External Help
Patches
Please note that this is only a demonstration and that the latest, cor-
rect version can be found in the repository.
zerr_features~
reference
- 98 -
zerr_envelopes~
reference
- 99 -
zerr_combinator~
reference
- 100 -
zerr_disperser~
reference
- 101 -
Appendix B: Zerr* Preset Patches
Zerr* Preset A
Zerr* Preset B
- 102 -
Zerr* Preset C
Zerr* Preset D
- 103 -
Appendix C: Synthesis Algorithm &
Audio Analysis Quizzes
Synthesis Algorithm Quiz
1. What is the primary function of an oscillator?
To control the volume of the sound
To generate the basic waveform or sound
To modulate other components like filters
To create rhythmic patterns
2. FM Synthesis is best described as:
Frequency Modulation of one oscillator by another
Amplitude Modulation of one oscillator by another
A method of filtering frequencies in a sound
A technique for creating rhythmic patterns
3. Which waveform is typically used for creating bass sounds in subtractive syn-
thesis due to its harmonic richness?
Sine wave
Sawtooth wave
Square wave
Triangle wave
4. What is the primary function of ring modulation in sound synthesis?
To mix two audio signals in a way that creates new harmonic con-
tent
To modulate the frequency of an oscillator
To synchronize the phase of two waveforms
To split an audio signal into multiple frequency bands
5. In FM synthesis, what term is used to describe the modulating oscillator?
Carrier
Modulator
Operator
Transmitter
6. Which type of filter cuts off frequencies above a certain threshold and allows
lower frequencies to pass?
High-pass filter
Low-pass filter
Band-pass filter
Notch filter
7. Which synthesis technique involves combining multiple simple waveforms to
create complex sounds?
Subtractive Synthesis
Additive Synthesis
Wavetable Synthesis
- 104 -
Phase Distortion Synthesis
8. In subtractive synthesis, what is the result of increasing resonance on a filter?
It decreases the volume of the sound.
It emphasizes frequencies around the filter’s cutoff point.
It broadens the range of frequencies that the filter affects.
It changes the waveform shape passing through the filter.
9. An LFO (Low Frequency Oscillator) is typically used to create audible pitches
in sound synthesis.
True
False
10. In subtractive synthesis, the primary method of shaping the sound is by re-
moving certain frequencies from a rich harmonic sound source using a filter.
True
False
11. Ring modulation is a form of amplitude modulation that can produce inhar-
monic overtones, often resulting in bell-like or metallic sounds.
True
False
12. A noise generator can only produce white noise.
True
False
- 105 -
Synthesis Algorithm Quiz
1. The Crest Factor of an audio signal is the ratio of:
Peak amplitude to RMS amplitude.
RMS amplitude to mean amplitude.
Peak amplitude to mean amplitude.
Mean amplitude to peak amplitude.
2. A high Crest Factor in an audio signal typically indicates:
A high level of distortion.
A smooth, sustained sound.
A dynamic range with significant peaks.
A consistent amplitude level.
3. Which of the following best describes Spectral Flatness?
It measures the ‘peakiness’ of the spectrum.
It indicates how noise-like a sound is, compared to being tone-like.
It is the average frequency of the spectrum.
It represents the highest frequency in the spectrum.
4. If a signal’s Spectral Flatness measure is close to 1, the signal can be described
as:
Very tonal.
Very noisy.
Very dynamic.
Very rhythmic.
5. The Spectral Centroid can be used as an indicator of the brightness of a sound.
True
False
6. What does the Spectral Centroid of an audio signal represent?
The average frequency of the spectrum.
The highest frequency in the spectrum.
The loudness of the audio signal.
The duration of the audio signal.
7. Spectral Flux measures:
The rate of change in the spectral power.
The highest power in the spectral domain.
The flatness of the spectrum.
The balance of even and odd harmonics.
8. A higher Spectral Flux value typically indicates a more rapidly changing
spectrum.
True
False
9. The Zero Crossing Rate of an audio signal is:
The rate at which the signal changes from positive to negative
or back.
The frequency at which the amplitude is highest.
- 106 -
The number of times a signal reaches zero amplitude.
The rate at which the spectral energy rolls off.
10. A high Zero Crossing Rate is often indicative of a high-pitched sound.
True
False
11. Spectral Rolloff is a measure of the bandwidth of the signal.
True
False
12. The Spectral Rolloff point is the frequency below which a certain percentage
of the total spectral energy is contained. What is the typical percentage used
in most applications?
50%
85%
95%
60%
- 107 -
Appendix D: Textual Feedback
Test1
Index For Original Sound For Spatialized Sound
1The sound is like the hair shaver. The parameters ef-
fect changed the pitch
I can feel more detail after the spatialization. I can
feel the sound sometimes very concentrated and some-
times separated.
2first parameter frequency; second parameter changes
the of long the signal 1 lasts and at the same time also
changes the waveform; the third parameter i forget.
The sound changes mainly with second parameter.
Obviously the sound property changing is associated
with spatial properties but due to my limited knowl-
edge i cannot describe precisely. Frequency seems to
have no relation.
3button noise in different tonal Low frequency going left, high frequency going right
4metallic, the saw wave is filtered in to square wave muddy and clear
5like a alarm’s sound the first button will change the sound randomly, but
the second and the third one can hear the direction
of sound clearly.
6Continuous sound, parameter 1 changes pitch, para-
meter 3 changes pulse width, parameter 2 makes the
sound less sharp
Sounds like the perceived amplitude is related some-
how to where the sound is positioned in space. Also
related to the sharpness of the sound, or when the
pulse width is very small it sounded like the sound
moved in space more drastically.
7sounds substrative synthesis. like a square wave which
is folded. 1st knob controls pitch, 2nd some filter, 3rd
brightness overtones. 2nd is interesting!
theres a perceptual adding up to it. i feel some tresh-
olds that evoke fast moving. the sound feels more
lively. fast movement in space
81. frequency low/high
2. waveform triangle/sine/square
3. hpf/lpf
1. feels more spatial compared to the original
2. it moves around between my left side to right side
9Slider for Volume, Third knob for pulse width mod,
second filter off higher freq, first knob changes pitch/
freq.
spatial auto pan from, sound changes from front
right top to front right bottom with different differ-
ent pitches, pan positions switch between left to right
based on different filter positions and pulse width
10 1: Frequency Speed of frequency change 2: Filter 3
Pulse waveform change.,
Spatial pan
11 the sound is relatively sharp, with some rhythmic feel-
ing the parameter on the top changes the frequency
feeling of the sound the second one changes the range
the third one controls the richness of the sound
when some parameter changes, the spatial properties
changes at the same time, getting harder to control
the tone and spatial properties separately the sound
with higher frequency or sharper tone gives feeling
they are focusing on one point, left or right, very
clearly
12 Sound itself is tonal Knob1 changes pitch Knob 2
changes tonal qualitz Knob 3 makes it noisz
After spatialization the sounds feels like it has higher
oscilation Certain spatial directions respond to cer-
tain pitches
13 Top knob: frequency, changed pitch; middle: filter,
tone changed; bottom: waveform.
The 4 speakers have different set-up that allow differ-
ent type of sound to pass through.
14 top: frequency; mid:sharpness and warmness; bot:dif-
ferent Anteil of Amplitude
it’s more direct with spatialization, and left side has
more diffusion
15 raw, technical, square wave it can be narrow, but also really wide. with frequency
changes, you can also feel changes of the percepted
room in the sound, can sound very big and strong and
with no direction, but also very direct.
16 much bass, noisy, not comfortable sounds like a bee, annoying
17 sharp, granulated feels like there are more sound sources, can mostly
locate the sound source
18 square wave to triangle, frequency adjustable, shape
width of the waveform adjustable, constant sustained
clear tone
“harsher timbre wise, but at the same time not as
harsh, because it was spatially distributed and one
could lay down in the sound field sound not so fo-
cussed, so less annoying”
- 108 -
Test2
Index For Original Sound For Spatialized Sound
1The sounds are more wobbly after I change the sec-
ond knob.
After I changed the third knob, the frequency and the
location of the sound both change.
2first one still frequency, second seems to change
the combination of the two waves(?), third still the
wavewidth. Second is confusing.
when the second is set up to a certain value, if you
change the third one the sound is shifting between
front left and front right.
3“Knob above controls the frequency (high low) Knob
middle controls the cut off Knob below is like a oscil-
lator”
Turn the knob clockwise and the sound also goes
clockwise. At the same time, the parameters also in-
crease the frequency.
4flowting, rounded, using square wave to mudulate the
frequency of sine wave
the location of sound changes with resonance
5the both sides are keeping more balances, can’t feel
the dynamic process anymore
like the alien’s sound
6FM synthesis, param one is carrier freq, param two is
modulation amount, param 3 is modulator freq
sounded like the perceived pitch is what is guiding the
spatialization. when modulating the frequency and
finding slow frequency oscillations I could hear the
sound moving with the pitch
7def modulated sound. goes from sine to noisey harsh
rich ovetone but also more tone sound charac. be-
comes rhzthmic in some values of knob 23
filtering invokes the movement. clockwise. but the 1st
param also creates fast rhythmic back and forth pan-
ning realted to the sound peaks. rather fast move-
ments
8“1. fundamental frequency
2. cut-off frequency/ waveform
3. intensity of the sound"
“feels more dynamic phase shifting, frequency modu-
lation,”
92nd knob fm depth, 1nd knob modulator freq, 3rd
knob, carrier freq.
first version stereo field, second version adding sur-
round movements to sound elements based on differ-
ent parameter changes
10 Frequency modulator Shaos
11 “the sound is sinus smooth sound the parameter on
the top controls the frequency the second one controls
the variation or how much the other wave interrupt
the sound the third one also changes the frequency
but in a different way, it also controls the interception
of the other wave”
“feels like speeding up, increasing the frequency
while running around clockwise with the sound get-
ting noisy, or complicated, the spatial properties also
change, more frequency, move faster”
12 “Sine wave, with different frequency modulators
Tonal Metallic sounds very bright can be achieved”
“Roomy sound With pitch change, sound moves left
to right high oscilattions”
13 Top knob: frequency; middle: how often does the bot-
tom knob effect happen; bottom: adding a second
wave in.
Space journey. Different frequency pass and other set-
up for receival sound for different loudspeaker.
14 top:basic frequency; mid:the frequency of 2nd signal;
bot:sampling rate
softer, different moving speed
15 low Sine wave, change in frequency but also waveform perceived space changes with frequency and shift in
parameter 2. it gets more sounding like metal, and
feels a bit wobby
16 continous, wavy , smooth higher, spacy
17 sound is width, change is sensitive sound barely changes, but is hard to locate
18 “from low and simple very fast to very complex sound
with lots of harmonics, can be very harsh sounding
my guess: fm synthesis 1. fundamental frequency 2.
modulator frequency 3. modulator gain”
“it’s possible to generate interesting movements of
soundshape and spatial positioning interesting points
where sound switches rapidly, like stable sustained
sound (sine like) breaks into modulated harsh wob-
bling sound”
- 109 -
Test3
Index For Original Sound For Spatialized Sound
1The change of sound is very intuitive. When the modulator frequency and depth are bigger,
the sound is wider
2third parameter frequency also relates to spatial prop-
erties. first one changes the quality of sound dramati-
cally, seems to become very metalic and inharmonice.
second one seems to be modulator differences.
the first parameter changes, seems to have more other
waves integrated, becoming more inharmonic.
3Knob above decides the tonal. Knob middle makes
the sound narrow. Knob below makes volume wave
the modular amplitude makes the sound stretching in
the space
4ring modulation? pure sound is more dispersed
5the third button can create the break of the sound the first button can change the sound from far to near,
and another two button can make the sounds jumping
6amplitude modulation; param 1 is frequency of first
osc, param 2 is frequency of second osc, param 3 is
modulation amount
sounds like the more complex the sound (more har-
monics) the wider is the spatialization
7am modulation i guess. param 3 controls how fast
amplitude is changing. 2 give some distorstion,noise
maybe realted to how much effect. sound feels very
organic moving from big to small
tough to spatialise. at some in between setting sound
was moving to the right but couldnt redo it again.
the 2nd param didnt feel so strong effect as it was
in stereo.
8“resonance noise-like filter frequency modulation” “feels like a match between frequency shift and spa-
tial movement. diffusion from point to array dot to
continuity”
9Ring Mod, 1st knob making it more noisy, 2 changes
center freq, 3rd amplitude mod.
1st knob center to sides, 2nd no pulsing to pulsing,
3rd more upper harmonics to more dull
10 Ambient. Slow changes. Flat “Harmonic volumes
2.Modulator Amplitude
Modulator Frequency"
11 “sound with high frequency and waving the first pa-
rameter controls proportion of first kind of sound the
second controls the second kind the third one controls
how much the sound waves”
with higher proportion of first wave, the sound get-
ting more centralized, with more parameter two the
tone does not change much but sound getting wider,
and the third parameter makes sound more waving,
also more spacey
12 “Tonal base sound, sine wave First knob changes, how
noisy the tone will be getting less tonal other knobs
make the sound oscilate”
Possible to widen the sound quite a bit, but one fre-
quency stays in the middle
13 Top knob: add wave to existing wave? Middle: modu-
lator frequency; Bottom: modulator changing ampli-
tude.
I have no idea. But I think when changing knob 3 the
change is more obvious.
14 top: overlap another high frequency signal; mid: over-
lap another low frequency signal; bot: amplitude osc
central sound very concentrated, width changes with
different overtone frequency
15 Sinus mit veraenderlichem Klirrfaktor und amplitu-
denmodulation
more spacious, a bit darker then the other sounds
16 only high frequencies, smooth, consistent conistent, wavy
17 sound is smooth, change is clear to hear sound is at right side, properties is clear
18 “sine wave with fixed frequency you can add harmon-
ics to the sine with fixed frequency you can add sec-
ond sine and adjust it’s frequency smooth sound to a
little bit harsh very stable sustained sound”
“wider, more open very unstable in comparison not
so smooth anymore, but also not harsh fragile”
- 110 -
Test4
Index For Original Sound For Spatialized Sound
1It sounds like sea and the environment noise. When the parameters of the second and third knobs
are bigger, the jump feelings are less.
2related to noise. first parameter changes the wave
integrated to the main sin-wave, second parameter
changes the main sin-wave, third one more noisy or
less.
first and second parameters make the jump of spatial-
ization, third don’t change much of the jump.
3knob above change the volume. knob middle add a low
frequency slowly that gets higher. knob below makes
the sound more thin and penetrated
Turn the knob above move the sound to the top. turn
the knob middle and down raise the pace of the sound
jumping from different speaker.
4Low frequency oscillator and noise the cut off of the carrier wave changes the location
5feeling of white noise and can control the coarseness
of voice
the white voice shows strong space effects
6sounds like a noise source with amplitude being mod-
ulated by a sine wave, parameter 3 seems like a high
pass filter, param 1 is noise amplitude and param 2
is the sine wave frequency
it sounds like when the sine wave oscillator crosses
the zero the sound jumps to another speaker. when
the sine wave frequency is very slow you can hear it
choosing randomly the speakers very quick, so it feels
like the is a small threshold near the zero crossing.
7feels like 2 sounds. a deep base sine tone and white
noise on top of the waveform. 3rd feels like highpass
filter. yet i did not perceive it as one sound rather two
entities
it becomes very rich spatial material, with lots of op-
tions, from shivering noise creeches to very distinct
static impulses, to very smooth drones. it really evokes
learning and playing!!!
8“1. gain
2. fundamental frequency of a masked tone
3. noise strength"
“sound feels completely different
clockwise-rotation noise strength relative location"
93rd knob bandpass filtering thru, 2nd knob adjusting
LFO freq, 1st knob noise level
1 knob beating of noise oscillator 1, 2nd knob funda-
mental freq of oscillator two, 3rd knob forgotten
10 “Noise 1. Noise Amplitude
2. LFO Frequency
3. Resonant Level"
Subtle
11 “the sum of some noise and bass sound with very low
frequency the first parameter controls how strong the
noises are the second controls the frequency of bass
sound the third one controls the threshold of filter”
it just jumps out of somewhere randomly, but with
rhythm, but with interception of bass sound the noise
get controlled
12 White noise, frequency modulation possible “The random placement on the speakers is quite in-
tense Speed of change can be adjusted, from perceiv-
able change to chaos”
13 Tob knob: noise; middle: low freq osc; bottom: filter. If waves in view4 has x y: then I would say each singal
speaker is responsible for different ranges of x y.
14 top:white noise; mid:low frequency; bot:high pass spatial properties come with the Amplitude change-
ment
15 white noise, changes in frequency of the sine and the
noise
Das Rauschen wird auf die Lautsprecher gegeben und
die Verteilung erhoeht sich, mit steigender Frequenz.
Sound fuehlt sich gross an und der Einfluss auf den
Klang sehr gut
16 rushy, low frequently, more stereo, more 3d effekt,
17 sound is wide and ambient, change is slight chaotic, cant tell direction of the sound
18 “noise oscillator modulated by lfo
1. noise gain, 2. lfo frequency 3. filter frequency of
soft bell filter?
good sub coloured noise"
“moving fast nosiy harsh all over the place not so
sustained anymore but almost percussive lfo rate and
noise gain were assiociated with rate of movement,
maybe some sort of dynamic property”
- 111 -
Comprehensive Feedback
Index Question 3 Question 4 Question 8 Question 9 Question 10
1It would be diffi-
cult to understand
how the parameters
affect the spatial
sound without basic
acknowledge of the
sythsizer.
No. Very intuitive Sorry I can’t tell It must be amazing. No
2no knowledge of the
field therefore hard
to grasp in general.
no, in general
smooth.
enough amount of
hardware i guess.
yes. not really.
3The synthesizer jar-
gon
no The ambisonic se-
quence
Yes no
4Although it’s hard
for me to use tech-
nical terminologies
to describe the ac-
curate parameters,
which change the
spacial properties, I
can still strongly
feed the connection
between the space
and sound charac-
ter. The operation
on sound influences
indirectly the sound
field.
nop Zerr* as an ad-
ditional layers be-
tween players and
audio systems cre-
ate an auto-
matic, changeable
and intelligent effect
on sound locality,
which doesn’t fully
controlled by play-
ers themselves, in-
stead indirect gains
influences.
I see so many poten-
tials.:) looking for-
ward to seeing it be
realised in hardware
(like daisy?)
GREAT
5to find the exact
sources where the
single sound come
from
no can self control and
have more possibil-
ity
yah;) can create and
try more different
sound’s resoure and
maybe can let us
to try the combina-
tions of any single
sound
6no no To couple synth
(or any parameters)
to the spatialization,
and to use audio
analysis to decide
how the spatializa-
tion occurs sounds
to me more in-
teresting than try-
ing to recreate real-
world spatial per-
ception (like in sys-
tems such as am-
bisonics, wfs, etc)
I would definitely
like to play live with
different speaker se-
tups and Zerr*
as my spatialization
tool
I would love to ex-
periment with this
concept in differ-
ent speaker arrange-
ments that don’t
necessarily follow
rings/arrays but
more asymmetrical
figures, for instance.
7how mapping strat-
egy was chosen
no i think it really
shines with more
spatial capabilities.
eg more speaker or
more dimensions. it
feels very intuitive
cause you listen to
sound and its spatial
characters together.
differetn from gui
based spat
i think it a great
tool for playing live -
find sound and map-
pings that match
your personal style!!
no
- 112 -
8no the connection of
the ethernet cable
the possibilty to
freely modify your
presets
yes. very impressive ap-
plication made with
graphical represen-
tation and knobs
for adjusting para-
meters
9no not much thank you more concentrated
possiblity of sound
Yes Good work!
10 Good No It’s simpler, more
practical
Live for Autechre Keep going
11 the knowledge of
synthesizer
nope controls the spa-
tial properties while
changes the sound
properties
yes, but need equip-
ment support
the control of spatial
properties could be
more clear
12 The translation of
parameters from
stereo to spatial rep-
resenatation
No The ability to freely
play with parame-
ters, that automati-
cally will be trans-
lated into spatial
sound
For live music
with needed speaker
setup, special at-
mosspheres could be
created
No
13 Not for the Zerr
but just general con-
cepts.
Nope No previous experi-
ence with other spa-
tial audio systems.
Yes. Makes it possi-
ble to pass over en-
ergy to engage au-
dience from different
area, if it is a big
venue.
Idk how this would
affect studio music
production though.
14 n/a basic concept knowl-
edgement
Freiheit und
Moeglichkeit
Neue Musik Kompo-
sition
Very potiential
15 not really no great and very easy
movement of space
and sound
yes, totally no comments
16 the graphics no wellenkopplung und
3d verteilung
yes no idea
17 definiton of spectral
centroid etc.
not able to good
control the knob
can freely assign pa-
rameters to the syn-
thesize and listen to
the effect
yes tell the people be-
fore the test that
which loudspeakers
are used in which
view
18 spatial mapping for
am, ring modulation
could not push
square wave synth
to the right
combination of spa-
tial and timbre char-
acteristics opens up
possibilities espe-
cially for improvis-
ing or intuitive com-
posing
see question 8 no
- 113 -