Document [original]

Christoph Pörschmann, Johannes M. Arend

Analysis and visualization of dynamic human voice

directivity

Open Access via institutional repository of Technische Universität Berlin

Document type

Conference paper | Published version

(i. e. publisher-created published version, that has been (peer-) reviewed and copyedited; also known as:

Version of Record (VOR), Final Published Version)

This version is available at

https://doi.org/10.14279/depositonce-15551

Citation details

Pörschmann, C. & Arend, J. M. (2022): Analysis and visualization of dynamic human voice directivity. In:

AIA-DAGA 2022 : proceedings of the International Conference on Acoustics. Berlin: Deutsche Gesellschaft für

Akustik e.V. pp. 1444-1447.

This work is protected by copyright and/or related rights. You are free to use this work in any way permitted by

the copyright and related rights legislation that applies to your usage. For other uses, you must obtain

permission from the rights-holder(s).

Analysis and Visualization of Dynamic Human Voice Directivity

Christoph P¨orschmann1, Johannes M. Arend1,2

1TH K¨oln, Institute of Communications Engineering, Cologne, Germany

2TU Berlin, Audio Communication Group, Berlin, Germany

Email: christoph.poerschmann@th-koeln.de

Introduction

In many everyday situations, we experience the inﬂuence

of the human voice directivity. We perceive loudness and

timbre diﬀerently when a speaker faces us or turns away

from us. Often, we use voice directivity intuitively, for ex-

ample when facing a person in a meeting or a casual con-

versation. Such eﬀects of human voice directivity have

long been a topic of research. Early studies were carried

out more than 200 years ago analyzing the directional

radiation of speech in general [1, 2, 3]. In 1929 a ﬁrst ap-

proach determining directivity patterns for several vowels

and fricatives in the horizontal plane has been presented

by Trendelenburg [4]. In 1939 Dunn and Farnsworth

[5] determined spherical directivity patterns for a spoken

sentence at diﬀerent distances in third-octave bands from

63 Hz up to 12 kHz. Since then, voice directivity has been

subject to many studies, either applying human speakers

or dummy heads with integrated mouth simulators.

A speciﬁc characteristic of the human voice that cannot

be analyzed with a dummy head is its dynamic directiv-

ity, i.e., time-variant changes that occur when speaking

or singing. To adequately determine these features, the

sound radiation needs to be captured for an appropri-

ately large number of directions. Katz and D’Alessandro

[6] analyzed the voice directivity in the horizontal plane

for sustained vowels articulated by a professional opera

singer. Even though the study showed no systematic

diﬀerences between the diﬀerent vowels, it gave ﬁrst in-

sights into the vowel-dependencies. Kocon and Mon-

son [7] examined articulation-dependent eﬀects of voice

directivity for diﬀerent vowels in ﬂuent speech. They

observed an eﬀect of the vowel on the directivity pat-

tern and determined the strongest directivity for an [a].

Monson et al. [8] analyzed phoneme-dependencies in the

horizontal plane for measurements of the directivity in

angular steps of 15◦. They showed that the directivity

varies strongly for diﬀerent articulations, e.g., between

voiceless fricatives with a more directive sound radiation

of an [s] than of an [f]. However, this study did not

show signiﬁcant diﬀerences between speech and singing,

and only minor dependencies on the articulation level.

In contrast, Chu and Warnock [9] observed signiﬁcant

diﬀerences depending on the articulation level. Postma

and Katz [10] as well as Postma et al. [11] analyzed the

inﬂuence of dynamic voice directivity for auralizations.

Their results indicate that auralizations involving dy-

namic voice directivity are perceived as more plausible

and exhibit a wider apparent source width than aural-

izations with static voice directivity or omnidirectional

sources. In contrast, a recent study by Ehret et al. [12]

concluded that dynamic directivity is perceptually indis-

tinguishable from a static voice directivity.

In general, voice directivity measurements can either be

performed sequentially for an arbitrary number of direc-

tions or simultaneously using a surrounding microphone

array. In the case of sequential measurements, the spec-

trum of the radiated sound is typically analyzed averaged

over time. Thus, time-variances inﬂuencing the speaker’s

directivity can hardly be resolved and are in most evalua-

tions not considered. For determining the spherical direc-

tivity patterns, it is advantageous to apply surrounding

microphone arrays (SMAs) [13, 14, 15, 16]. Generally,

the setup is restricted to a limited number of sampling

points. Thus, the spatial resolution of the directivity sets

is sparse. Consequently, methods are required for spatial

upsampling of the (sparsely) measured datasets by ap-

propriate interpolation between the measured directions.

Up to now, only a limited number of studies on spherical

voice directivity have been published (e.g., [17, 9, 15]),

but none of them analyzed the time-variant eﬀects of

ﬂuent speech. This pilot study aims to investigate to

what extent high-density directivity sets can be deter-

mined from ﬂuent speech measured on a sparse sam-

pling grid with an SMA. In this context, we presented

the SUpDEq (Spatial Upsampling by Directional Equal-

ization) method [18], which originally was designed for

spatial upsampling of head-related transfer functions. In

P¨orschmann and Arend [19, 20], we evaluated the method

for a dummy head with mouth simulator and showed that

reasonable dense directivity sets can be obtained from

sparse measurements. Our studies revealed that mea-

surements in a surrounding microphone array with 32 mi-

crophones are suﬃcient to generate a decent full-spherical

dense directivity set, with an error averaged over the en-

tire sphere below 4 dB for frequencies up to 8 kHz. In

[21, 22], we applied the SUpDEq method to human voice

directivity and determined full-spherical directivity pat-

terns of ﬁve vowels and three fricatives for 13 persons

using a surrounding spherical microphone array with 32

microphones. The results showed signiﬁcant diﬀerences

between the two groups of phonemes, and between some

of the phonemes of each group. Furthermore, in this con-

text we studied the inﬂuence of hand postures on voice

directivity and showed that cupping the hands around

the mouth or holding a hand in front of the mouth have

stronger eﬀects than diﬀerences between the phonemes

[23]. In this paper, we present a pilot study that investi-

gates dynamic voice directivity of ﬂuent speech and thus

resolves in which way voice directivity changes over time.

DAGA 2022 Stuttgart

1444

Figure 1: Human speaker inside the SMA during the

measurements.

Method and Materials

Measurements

All measurements of this pilot study were performed in

the anechoic chamber of TH K¨oln, sized 4.5 m ×11.7 m

×2.3 m (W×D×H) and with a lower cut-oﬀfrequency

of about 200 Hz. The directivity patterns were measured

with an SMA having a basic shape of a pentakis dodeca-

hedron with 32 Rode NT5 cardioid microphones located

at the vertices of this shape on a constant radius of 1 m.

This sampling scheme allows resolving the directivity up

to a spatial order of N= 4 [13]. In the present study,

an additional Rode NT5 microphone was positioned at

the front serving as a reference and for spectral equal-

ization in postprocessing. Four RME Octamic II devices

served as preampliﬁers and AD / DA converters for the

32 microphones of the SMA. All signals of the SMA were

processed with two RME Fireface UFX audio interfaces.

For a more detailed description of the SMA setup, please

refer to Arend et al. [14, 16]. In the present study, one

of these audio interfaces was also used as a preampliﬁer

and AD / DA converter for the reference microphone.

As test material, we recorded excerpts from Antoine de

Saint-Exupery’s book ”The little prince” [24], spoken by

two German speakers in English and German. To ob-

tain a visual representation of the time-variant changes

of the mouth, we ﬁlmed the speakers using a Sony PXW-

FS7 camera at a frame rate of 180 fps. The video can be

superimposed with the determined directivity patterns

or used to create slow-motion sequences of mouth move-

ments. Fig. 1 shows the SMA with a person inside.

Postprocessing

The postprocessing is only brieﬂy summarized here, as

it was carried out similarly as presented in P¨orschmann

and Arend [20, 22]. To eliminate the inﬂuence of re-

ﬂections and of room modes of the anechoic chamber,

which in our case become prominent below 200 Hz, a low-

frequency extension was applied substituting the origi-

nal low-frequency component in the frequency domain by

an adequately matched one of an analytic low-frequency

model. Furthermore, in the postprocessing, the inaccura-

cies in positioning the subjects in the center of the micro-

01234567

Time [s]

Directivity Index [dB]

Once... old, magni�cent...

I...

[ʌ][əʊ][aɪ] [æ]

1 kHz 2 kHz 4 kHz 8 kHz

Figure 2: Directivity index (DI) over time for one sen-

tence for octave bands of 1 kHz, 2 kHz, 4 kHz, and 8 kHz.

Four selected phonemes are marked on the time axis.

phone array were compensated. As small deviations of

some centimeters already lead to strong impairments in

the spatial upsampling process [25], we applied a method

for distance error compensation that reduces the impair-

ments of distance errors of the measured directivity pat-

terns [20]. Then the compensated multichannel record-

ings were partitioned into frames of 67 ms (3200 samples

at 48 kHz sampling rate) and Hann-windowed (50 % over-

lap). Further processing was done separately for each au-

dio frame. The sparse datasets were spatially upsampled

to a dense grid with 2702 sampling points on a Lebedev

sampling scheme applying the SUpDEq method1, which

we described in detail in P¨orschmann et al. [18] and evalu-

ated for upsampling voice directivity in P¨orschmann and

Arend [20]. However, in contrast to the processing used

in the studies mentioned above, we applied the postpro-

cessing and the upsampling not to transfer functions but

directly to the multichannel audio frames. As output of

the processing chain, a high-density dataset was stored

for each audio frame, which can be used for the further

analysis of dynamic voice directivity.

Results

The results presented here are based on one selected sen-

tence of the recordings, which was the ﬁrst sentence of

[24]: ”Once when I was six years old I saw a magniﬁ-

cent picture in a book, called True Stories from Nature,

about the primeval forest.” In a ﬁrst step, we calculated

the directivity index (DI) for each of the frames:

DI(f, n) = 10 lg 4π|p(φ0,θ0, f, n)|2

2π

�

π/2

�

−π/2

|p(φ,θ, f, n)|2cosθdθdφ

,(1)

with φthe azimuth, θthe elevation, φ0,θ0the frontal

direction, fthe frequency, and nthe index of the frame.

1A Matlab-based implementation of the SUpDEq method can

be accessed on https://github.com/AudioGroupCologne/SUpDEq

DAGA 2022 Stuttgart

1445

0°

30°

60°

90°

120°

150°

180°

210°

240°

270°

300°

330°

-18

-12

-6

0 dB

[

�

]

0°

30°

60°

90°

120°

150°

180°

210°

240°

270°

300°

330°

-18

-12

-6

0 dB

[

��

]

0°

30°

60°

90°

120°

150°

180°

210°

240°

270°

300°

330°

-18

-12

-6

0 dB

�

]

0°

30°

60°

90°

120°

150°

180°

210°

240°

270°

300°

330°

-18

-12

-6

0 dB

[æ]

1 kHz 2 kHz 4 kHz 8 kHz

Figure 3: Directivity in the horizontal plane for an [

�

], an [

��

], an [a

�

], and an [æ] in octave bands of 1 kHz, 2 kHz,

4 kHz, and 8 kHz normalized to a maximal value of 0 dB.

90°

60°

30°

Front

-30°

-60°

-90°

-60°

-30°

Back

30°

60°

-18

-12

-6

0 dB

[

�

]

90°

60°

30°

Front

-30°

-60°

-90°

-60°

-30°

Back

30°

60°

-18

-12

-6

0 dB

[

��

]

90°

60°

30°

Front

-30°

-60°

-90°

-60°

-30°

Back

30°

60°

-18

-12

-6

0 dB

�

]

90°

60°

30°

Front

-30°

-60°

-90°

-60°

-30°

Back

30°

60°

-18

-12

-6

0 dB

[æ]

1 kHz 2 kHz 4 kHz 8 kHz

Figure 4: Directivity in the vertical plane for an [

�

], an [

��

], an [a

�

], and an [æ] in octave bands of 1 kHz, 2 kHz, 4 kHz,

and 8 kHz normalized to a maximal value of 0 dB.

Fig. 2 shows the DI over time in octave bands. Since no

stable values were obtained for speech pauses and sec-

tions with low energy, we refrained from plotting frames

with an energy of 14 dB or more below the average energy

of the entire sentence. Generally, the DI increases with

frequency, which is in line with other studies [15, 22] and

can easily be explained by frequency-dependent diﬀrac-

tion eﬀects of the head. Furthermore, we found varia-

tions in DI over time caused by phoneme-dependencies.

These variations occurred in a similar way as in our pre-

vious studies [21, 22], in which we analyzed directivity

patterns of a variety of separately articulated phonemes.

For our test sentence, we observed the highest DI for an

[

�

] at the beginning of the sentence.

Then we determined directivity patterns for selected

phonemes within the sentence. Fig. 3 and 4 show the

directivity patterns in the horizontal and vertical plane,

for four diﬀerent time frames in which an [

�

], an [

��

], an

�

], and an [æ] were spoken. The respective time frames

are also marked in Fig. 2. It can be observed that sig-

niﬁcant diﬀerences between the phonemes occur in the

horizontal plane. While for an [

�

] and an [a

�

], the direc-

tivity patterns in the 4 kHz and 8 kHz octave band are

very similar, they are stronger directed to the frontal di-

rection for an [

��

] at 8 kHz. Furthermore, the plots show

slight asymmetries, e.g., for the [

��

] or the [æ]. In the

vertical plane, the frequency-dependent diﬀerences seem

to be smaller than in the horizontal plane, at least in

the frontal hemisphere. In the vertical plane, both, the

variations between the phonemes as well as between the

diﬀerent frequency bands, tend to be reduced. However,

in future work, these ﬁndings need to be examined in

more detail based on larger parts of the test material.

Conclusion

In this pilot study, we analyzed dynamic voice directiv-

ity of ﬂuent speech. We presented and demonstrated a

method to determine high-density directivity sets from

sparse measurements carried out with an SMA with 32

microphones. The results reveal how the dynamic direc-

tivity changes in ﬂuent speech over time and how this

aﬀects the DI. Furthermore, superimposing the directiv-

ity plots with a ﬁlmed sequence of the subject allows

studying the dynamics of the lip movements as well as of

the form and the size of the mouth opening. The results

presented in this pilot study are only based on one single

sentence. Next, we plan to systematically segment and

evaluate larger recordings. This will allow for a more de-

tailed analysis of variations and ﬂuctuations of the direc-

tivity. For this, processing tools for speech segmentation

need to be applied and appropriately adapted.

The results of this study are of high relevance for diﬀerent

purposes. First, it is of general interest when studying

voice production to analyze the dynamic voice directivity

DAGA 2022 Stuttgart

1446

in detail for ﬂuent speech. Second, methods and datasets

are required for applications in the ﬁeld of virtual reality,

augmented reality, or room acoustic simulation, to inte-

grate adequate voice radiation patterns into the process

of sound-ﬁeld calculation. The question of whether and

to what extent time-variant aspects have to be consid-

ered has a strong inﬂuence on the system design and the

required methods for obtaining the directivity datasets.

Finally, when reproducing one’s own voice in a virtual

acoustic environment [26, 27, 28, 29], its dynamic direc-

tivity could be perceptible for the speaker himself.

Acknowledgements

We thank Raphael Gillioz and Kai Altwicker for their

work on the measurements and the segmentation of the

raw recordings. The research has been carried out in the

research project NarDasS, funded by the Federal Min-

istry of Education and Research in Germany, support

code: BMBF 03FH014IX5-NarDasS.

References

[1] Saunders, G., Treatise on Theaters, I. and J. Taylor, London,

1790.

[2] Wyatt, B., Observation on the Design for the Theatre Royal,

Drury Lane, J. Taylor, London, 1813.

[3] Henry, J., “Annual Report of the Board of Regents of the

Smithsonian Institution,” Technical report, A. G. F. Nichol-

son, Washington, DC, 1857.

[4] Trendelenburg, F., “Beitrag zur Frage der Stimm-

richtwirkung,” Zeitschrift f¨ur techn. Physik, 11, pp. 558–563,

1929.

[5] Dunn, H. K. and Farnsworth, D. W., “Exploration of pressure

ﬁeld around the human head during speech,” The Journal of

the Acoustical Society of America, 10, pp. 184–199, 1939, doi:

https://doi.org/10.1121/1.1915975.

[6] Katz, B. and D’Alessandro, C., “Directivity measurements of

the singing voice,” in Proceedings of the 19th International

Congress on Acoustics, 2007.

[7] Kocon, P. and Monson, B. B., “Horizontal directivity pat-

terns diﬀer between vowels extracted from running speech,”

The Journal of the Acoustical Society of America, 144(1), pp.

EL7–EL12, 2018, doi:10.1121/1.5044508.

[8] Monson, B. B., Hunter, E. J., and Story, B. H., “Horizontal

directivity of low- and high-frequency energy in speech and

singing,” The Journal of the Acoustical Society of America,

132(1), pp. 433–441, 2012, doi:10.1121/1.4725963.

[9] Chu, W. T. and Warnock, A. C. C., “Detailed Directivity of

Sound Fields Around Human Talkers,” Technical report, 2002,

doi:10.4224/20378930.

[10] Postma, B. N. J. and Katz, B. F., “Dynamic voice directivity

in room acoustic auralizations,” in Proceedings of the 42th

DAGA, pp. 352–355, 2016.

[11] Postma, B. N. J., Demontis, H., and Katz, B. F. G., “Sub-

jective Evaluation of Dynamic Voice Directivity for Auraliza-

tions,” Acta Acustica united with Acustica, 103(2), pp. 181–

184, 2017.

[12] Ehret, J., Stienen, J., Brozdowski, C., B¨onsch, A., Mittelberg,

I., Vorl¨ander, M., and Kuhlen, T. W., “Evaluating the Inﬂu-

ence of Phoneme-Dependent Dynamic Speaker Directivity of

Embodied Conversational Agents’ Speech,” in Proceedings of

the 20th ACM International Conference on Intelligent Vir-

tual Agents, pp. 1–8, ACM, New York, NY, USA, 2020, doi:

10.1145/3383652.3423863.

[13] Pollow, M., Directivity Patterns for Room Acoustical Mea-

surements and Simulations, Logos Verlag Berlin, 2015.

[14] Arend, J. M., Stade, P., and P¨orschmann, C., “Binaural re-

production of self-generated sound in virtual acoustic envi-

ronments,” in Proceedings of the 173rd Meeting of the Acous-

tical Society of America, volume 30, pp. 1–13, 2017, doi:

10.1121/2.0000574.

[15] Brandner, M., Frank, M., and Rudrich, D., “DirPat -

Database and Viewer of 2D/3D Directivity Patterns of Sound

Sources and Receivers,” in Proceedings of the 144th AES Con-

vention, e-Brief 425, 1, pp. 1–5, 2018.

[16] Arend, J. M., L¨ubeck, T., and P¨orschmann, C., “A Reactive

Virtual Acoustic Environment for Interactive Immersive Au-

dio,” in Proceedings of the AES Conference on Immersive and

Interactive Audio, 2019.

[17] Kob, M. and Jers, H., “Directivity measurement of a singer,”

The Journal of the Acoustical Society of America, 105, p.

1003, 1999, doi:10.1121/1.425813.

[18] P¨orschmann, C., Arend, J. M., and Brinkmann, F., “Direc-

tional Equalization of Sparse Head-Related Transfer Function

Sets for Spatial Upsampling,” IEEE/ACM Transactions on

Audio, Speech, and Language Processing, 27(6), pp. 1060 –

1071, 2019, doi:10.1109/TASLP.2019.2908057.

[19] P¨orschmann, C. and Arend, J. M., “A Method for Spatial

Upsampling of Directivity Patterns of Human Speakers by Di-

rectional Equalization,” in Proceedings of the 45th DAGA, pp.

1458 – 1461, 2019.

[20] P¨orschmann, C. and Arend, J. M., “A Method for Spatial

Upsampling of Voice Directivity by Directional Equalization,”

Journal of the Audio Engineering Society, 68(9), pp. 649–663,

2020, doi:10.17743/jaes.2020.0033.

[21] P¨orschmann, C. and Arend, J. M., “Analyzing the Directiv-

ity Patterns of Human Speakers,” in Proceedings of the 46th

DAGA, pp. 1141 – 1144, 2020.

[22] P¨orschmann, C. and Arend, J. M., “Investigating phoneme-

dependencies of spherical voice directivity patterns,” The

Journal of the Acoustical Society of America, 149(6), pp. 4553

– 4564, 2021, doi:10.1121/10.0005401.

[23] P¨orschmann, C. and Arend, J. M., “Eﬀects of hand postures

on voice directivity,” JASA Express Letters, 2(3), p. 035203,

2022, doi:10.1121/10.0009748.

[24] de Saint-Exupery, A., The Little Prince, Reynal & Hitchcock,

1943.

[25] P¨orschmann, C. and Arend, J. M., “How positioning inaccu-

racies inﬂuence the spatial upsampling of sparse head-related

transfer function sets,” in Proceedings of the International

Conference on Spatial Audio - ICSA 2019, pp. 1–8, 2019.

[26] P¨orschmann, C., “One’s own voice in auditory virtual environ-

ments,” Acta Acustica united with Acustica, 87(3), pp. 378–

388, 2001.

[27] P¨orschmann, C. and Pellegrini, R. S., “3-D Audio in Mobile

Communication Devices: Eﬀects of Self-Created and Exter-

nal Sounds on Presence in Auditory Virtual Environments,”

JVRB - Journal of Virtual Reality and Broadcasting, 7(11),

pp. 3–11, 2010.

[28] Neidhardt, A., “Detection of a nearby wall in a virtual echolo-

cation scenario based on measured and simulated OBRIRs,”

in Proceedings of the AES Conference on Spatial Reproduc-

tion, 2018.

[29] Frank, M. and Brandner, M., “Perceptual Evaluation of

Spatial Resolution in Directivity Patterns 2: coincident

source/listener positions,” in Proceedings of the International

Conference on Spatial Audio - ICSA 2019, September, pp.

1–5, 2019.

DAGA 2022 Stuttgart

1447