Document [original]

This version is available at https://doi.org/10.14279/depositonce-8804

right to use is granted. This document is intended solely for

personal, non-commercial use.

Lykartsis, Athanasios; Weinzierl, Stefan (2016): Rhythm Description for Music and Speech Using the Beat

Histogram with Multiple Novelty Functions: First Results. In: Proceedings of the Inter-Noise 2016 : 45th

International Congress and Exposition on Noise Control Engineering : towards a quiter future : August

21-24, 2016, Hamburg. Berlin: Deutsche Gesellschaft für Akustik e.V. pp. 964–967.

Athanasios Lykartsis, Stefan Weinzierl

Rhythm Description for Music and Speech

Using the Beat Histogram with Multiple

Novelty Functions: First Results

Published versionConference paper |

Rhythm Description for Music and Speech Using the Beat Histogram

with Multiple Novelty Functions: First Results

Athanasios Lykartsis1, Stefan Weinzierl1

1Audio Communication Group, TU Berlin, 10587 Berlin, Germany, Email: {athanasios.lykartsis, stefan.weinzierl}@tu-berlin.de

Introduction

In the last few years, methods for rhythmic analysis of

music signals have become widespread in their use due to

their value for diverse tasks of music processing. In the

field of Music Information Retrieval (MIR), features de-

scribing rhythmic content through the properties of the

very low modulation frequencies or periodicities (i.e., be-

tween 0.5 and 10 Hz) in the signal have been developed

for music transcription, beat tracking, rhythmic similar-

ity calculation and music genre classification. For the

latter task in particular, methods have been devised for

capturing either specific temporal patterns in order to

measure similarities between different tracks and com-

pare against predetermined rhythmic patterns [1]; or for

extracting information on the statistical properties of pe-

riodicities in the signal, so as to be able to perform super-

vised classification [2, 3, 4]. In both cases, the basic idea

is the same: A novelty function (extracted through on-

set detection algorithms) of a basic temporal or spectral

property of an audio track (e.g., the signal amplitude) is

extracted, providing information about salient changes

in the signal. This novelty function is then analyzed

through an FFT, an Autocorrelation Function (ACF),

or resonant filters to provide a representation of the peri-

odicities present in the signal and their relative strength.

This form has been dubbed with several names - period-

icity/beat histogram, self-similarity-matrix, inter-onset-

interval histogram - but the basic goal is the same: the

representation provides information concerning the dis-

tribution and temporal evolution of signal qualities and

therefore describes the rhythmic content of the signal.

Up to now, such methods have shown satisfactory results

in the rhythm-based genre classification and rhythmic

similarity tasks either used alone or in combination with

other, non-rhythmic features, inspiring several adapta-

tions and efficient implementations [5, 6]. In related work

for speech signals, similar representations based on peri-

odicities detected in the signal amplitude envelope have

been used only recently and to a limited extent, in order

to analyze their properties and detect differences between

languages and speakers [7].

The above mentioned method has, however, some lim-

itations: first, if a strong beat is lacking or the signal

periodicities are complex and not distinctive (as is the

case, for example, for certain types of jazz music), the

extraction leads to noisy and less informative features.

Furthermore, if the signals are polyphonic, the features

extracted either only express the most prevalent period-

icities (which are the ones caused by the instruments or

voices having the greatest energy or impact on the sig-

nal’s waveform and spectrum) amongst others present,

or the representation loses its ability to provide mean-

ingful features, essentially blurring information since the

rhythms present are interwoven. Finally, the features ex-

tracted are not always easy to interpret, since their calcu-

lation involves multiple steps which do not allow a clear

view of the feature’s significance. To tackle those prob-

lems many strategies have been followed, such as feature

selection (e.g. with mutual information with target data,

to identify the most informative features), dimensionality

reduction (such as PCA, to increase the feature relevance

and independence) and use of more elaborated methods

for the representation [8]. However, we wanted to address

a basic conceptual problem of this class of methods: Al-

though music and other audio signals mostly comprise of

many sources or have properties which change differently

in time (e.g., a musical track’s harmony does not evolve

at the same pace as the drum beat), this information has

not been exploited in the past for rhythmic feature ex-

traction. In that sense, two kinds of approaches would be

suitable: source separation (for example based on Non-

Negative Matrix Factorization - NMF), in order to be

able to apply the rhythm extraction on different instru-

ments or voices; or application of the periodicity repre-

sentation on other signal properties than only amplitude,

providing the possibility to analyze several musical prop-

erties and extract information pertaining to each of them.

This latter approach has the added advantage that it can

be adapted for speech signals. In the following, our ap-

proach and the first results concerning the application of

this method for music and speech are presented.

Method

In order to take account of several signal properties and

their periodicities which do not all necessarily evolve in

the same way, we extract several features [9] and apply

the beat histogram transformation to them [10]. Results

have shown that this method provides good performance

and can be helpful in determining which exact signal

components are responsible for special rhythmic changes

- which in this case were the spectral flux, the RMS am-

plitude and the spectral flatness (concerning the novelty

functions), whereas with regards to the statistics on the

beat histogram, simple statistics such as the mean and

standard deviation but also advanced descriptors such

as tempo have provided the best results. A similar ap-

proach was also used in [11], where we extracted multiple

drum components using NMF for rhythm-based genre

classification. Being motivated by our results, we de-

cided to adapt and apply this method for speech [12, 13],

in order to analyze speech rhythm. So far, only speech

DAGA 2016 Aachen

964

rhythm metrics (analyzing the statistical properties of

duration intervals between salient speech elements) have

been used up to date (see [14] for a review). In our case,

following similar works from [15] and [7] we extracted

spectral (spectral flux, centroid and flatness), temporal

(RMS) and tonal (F0) measures to check for periodicities

and use for automatic language identification (LID). The

novelty functions and features on the beat histograms

for both music and speech can be seen in Tables 1 and 2

respectively.

Table 1: Novelty Functions for Beat Histogram Extraction.

Music Speech

Spectral Flux (SF) Spectral Flux (SF)

Spectral Flatness (SFL) Spectral Flatness (SFL)

Spectral Centroid (SCD) Spectral Centroid (SCD)

RMS Amplitude (RMS) RMS Amplitude (RMS)

Pitch Chroma Coefficients (1-12) Fundamental Frequency F0 (HPS)

MFCCs (1-13)

Tonal Power Ratio (TPR)

Table 2: Subfeatures extracted from Beat Histograms (both

for speech and music).

Distribution Peak

Mean (ME) Salience of Strongest Peak (A1)

Standard Deviation (SD) Salience of 2nd Stronger Peak (A0)

Mean of Derivative (MD) Period of Strongest Peak (P1)

SD of Derivative (SDD) Period of 2nd Stronger Peak (P2)

Skewness (SK) Period of Peak Centroid (P3)

Kurtosis (KU) Ratio of A0 to A1 (RA)

Entropy (EN) Sum (SU)

Geometrical Mean (GM) Sum of Power (SP)

Centroid (CD)

Flatness (FL)

High Frequency Content (HFC)

Table 3: Datasets Used.

Music Speech

GTZAN MULTEXT PD

Ballroom OGI-MLTS

ISMIR2004

Homburg

Unique

Experimental Setup

For the evaluation of the beat histogram features, a base-

line feature set was extracted in every case through calcu-

lation of a series of non-rhythmic features and the respec-

tive novelty functions. The novelty functions for speech

and music are shown in Table 1, whereas the features

on each novelty function can be seen in the Distribution

column of Table 2. The reason for this is the need to be

able to estimate if the use of rhythmic features provides

significantly different results to non-rhythmic ones and

consequently, if they present a genuine improvement or

degradation in the performance of the associated task.

For supervised classification, we use Support Vector Ma-

chines (SVM) [16] in all cases. For the SVM algorithm

the Radial Basis Function (RBF) Kernel is used with the

parameters Cand γdetermined through grid search. All

experiments take place as multiclass one-vs-one classifi-

cation problems with 10-fold cross validation and prior

standardization of the features (z-score, separately for

train and test set). In order to evaluate the classification

we use the average accuracy (Acc.) as a performance

measure.

Concerning the datasets, Table 3 gives an overview of

the resources used in both cases. Almost all datasets

are unbalanced, which has a negative effect on classifi-

cation accuracy, but often this problem is circumvented

by creating a balanced subset of the dataset. Finally,

the quality of the datasets is in both cases, not at an

equal level. For the musical ones, signal quality is good,

but the ground truth can be challenged. For speech, the

MULTEXT dataset has better signal quality, which is

important for the outcome of the experiments and the

conclusions drawn from them.

Results Comparison - Discussion

Results of genre classification and language identification

accuracy can be seen in Fig. 1 and Fig. 2. In Tables 5

and 6, the results of the feature selection for both appli-

cations are shown.

Concerning overall classification accuracy, two tendencies

can be observed: For music, only for one dataset (Ball-

room) the accuracy of the rhythmic features exceeds the

one achieved with the baseline feature set. For speech,

the accuracy of the rhythmic features is higher in one

dataset (MULTEXT PD, almost balanced, read speech,

good quality recordings), but lower on the other (OGI-

MLTS, unbalanced, spontaneous speech, telephone qual-

ity), where results are low at any rate. Comparing music

and speech, we can see that for datasets which are bal-

anced, have good sound quality, are rhythmically distinct

(for music) or containing less variation (for speech), the

performance based on accuracy is good and close to what

other studies achieve. This is probably due to noisy end

features, resulting from a low quality signal at the begin-

ning of the processing chain and an extraction procedure

involving multiple steps.

There are both similarities and differences between the

most efficient features in speech and music: For both

cases, salient novelty functions denoting spectral change

in the signal such as the RMS amplitude, spectral flux

and spectral flatness were amongst the most informative

features. However, in music, tonal components seem to

be as important; their performance, at least for genre

classification, is limited across multiple datasets. In

speech, however, fundamental frequency appears to be

an important feature, particularly in the case where the

dataset quality is low. In Fig. 33, feature groups for

genre classification shows that those tendencies are also

confirmed by the group selection, whereas for speech (Ta-

bles 4 and 6), fundamental frequency is a salient feature

even in adverse conditions (OGI-MLTS). Those results

show that extracting novelty functions which are indica-

tive of salient signal changes provides a good basis for the

extraction of informative features. For the subfeatures,

no candidate came out as a ”winner”, stressing the need

to extract as much information as possible but also to

focus on more meaningful features. On that note, the

tempo information provides a good candidate for such a

follow-up investigation of its properties.

DAGA 2016 Aachen

965

GTZAN Ballroom ISMIR04 Unique Homburg

Datasets

Accuracy (%)

Classification Results

Baseline

Rhythmic

Combined

Prior (Max P)

Prior (Avg.)

Figure 1: Classification results, comparison between

datasets (music). Figure from [10].

MULTEXT PD OGI−MLTS

100

Speech Corpora

Accuracy (%)

Classification Results

Rhythm Feature Set

Baseline Feature Set

Figure 2: Classification results, comparison between

datasets (speech). Figure from [13].

Table 4: Feature group comparison (speech).

Rhythmic feature subset MULTEXT PD OGI-MLTS

All features 70.4 % 31.2 %

RMS Amplitude 67.5 % 25.4 %

Fundamental Pitch 70.4 % 27.4 %

Spectral Flux 67.5 % 25.5 %

Spectral Flatness 66.8 % 24.7 %

Spectral Centroid 64.9 % 24.9 %

Table 5: Best features after feature selection (music). Left:

subfeature, right: novelty function. Table from [10].

Rank GTZAN Ballroom ISMIR04 Unique Homburg

1 MD.RMS P1.SF MD.MFC2 SD.MFC1 SD.RMS

2 FL.RMS A0.SFL CD.MFC1 GM.SFL SD.SPC3

3 GM.SFL SD.SPC3 A0.SF MD.MFC2 FL.SFL

Table 6: Best features after feature selection (speech). Left:

subfeature, right: novelty function. Table from [13].

Rank MULTEXT PD OGI-MLTS

1 FL.SFL SP.HPS

2 GM.SF P3.HPS

3 A2.SF A2.HPS

Conclusions

In this paper we present first results on the use of novel

features for rhythm analysis and rhythm-based LID. The

expansion of the use of periodicity representation meth-

ods from the field of MIR such as the beat histogram

for speech rhythm analysis has provided promising re-

0.1

0.2

0.3

0.4

0.5

0.6

Novelty Functions

Accuracy (%)

Classification Results for Novelty Function and Subfeature Groups

SCD

MFCC1

MFCC2

MFCC3

MFCC4

MFCC5

MFCC6

MFCC7

MFCC8

MFCC9

MFCC10

MFCC11

MFCC12

MFCC13

SFL

SPC1

SPC2

SPC3

SPC4

SPC5

SPC6

SPC7

SPC8

SPC9

SPC10

SPC11

SPC12

STPR

RMS

0.2

0.3

0.4

0.5

0.6

0.7

Subfeatures

Accuracy (%)

SDD

HFC

Figure 3: Feature group comparison (music). Table

from [13].

sults. For the rhythm descriptors, not only the signal

amplitude but also other rhythm-relevant signal quanti-

ties were used as basis for creating the beat histogram

and were found to be relevant. Furthermore, a compre-

hensive array of subfeatures was extracted from the peri-

odicity representation, which provides ample information

about the periodicities in the signal and their patterns.

We could show that classification performance for one

multilingual speech corpus using the SVM algorithm is

comparable to the results of similar studies and close to

those using other basic, non-rhythmic features. Simi-

lar results can be observed for music, where for two out

of five datasets, performance is acceptable and in one

case even better than when using more general features.

In general, concerning the datasets, rhythmic features

provide good or at least acceptable performance for bal-

anced, high-quality sound datasets, both for music and

for speech. Furthermore, the proposed method has the

advantage that it takes into account the rhythmic prop-

erties on the signal (signal properties and features) and

not on the speech element level (syllables), providing a

new perspective for the analysis of speech rhythm and

the related signal properties (such as fundamental fre-

quency for speech). Another important advantage of the

proposed method for speech rhythm analysis is that it is

fully automatic and can be extended to larger datasets.

These conclusions provide several objectives for further

research, such as the application of the method to more

diverse and comprehensive speech corpora (such as the

GLOBALPHONE [17]). At this point, the relation of

the rhythm features to other speech rhythm metrics and

language elements such as syllables and consonant-vowel

clusters is unclear, suggesting another direction for fu-

ture work. Another promising direction is focusing on

specific salient features (such as the tempo, which has

been shown to be easier to extract and understand where

music is concerned, but which has these properties in

speech as well) over different languages and/or genres, in

order to study their behavior and draw conclusions about

whether they can serve as a discriminatory feature. The

DAGA 2016 Aachen

966

use of rhythmic similarity measures as complementary

methods to the beat histogram is also a possible goal, so

as to capture language specific rhythm patters instead

of features describing periodicities. Future goals include

the investigation of optimal parameter settings for fea-

ture extraction, as well as the utilization of unsupervised

classification methods and novel classifiers, such as Deep

Neural Networks (DNNs).

References

[1] Pohle, T.; Schnitzer, D.; Schedl, M.; Knees, P.; Wid-

mer, G. (2009): “On rhythm and general music sim-

ilarity.” In: ISMIR.

[2] Tzanetakis, G.; Cook, P. (2002): “Musical genre

classification of audio signals.” In: Speech and Audio

Processing, IEEE transactions on,10(5):293–302.

[3] Burred, J.J.; Lerch, A. (2003): “A hierarchical ap-

proach to automatic musical genre classification.”

In: DAFx.

[4] Gouyon, F.; Dixon, S.; Pampalk, E.; Widmer, G.

(2004): “Evaluating rhythmic descriptors for musi-

cal genre classification.” In: Proceedings of the AES

25th International Conference, 196–204, Citeseer.

[5] Peeters, G. (2011): “Spectral and temporal periodic-

ity representations of rhythm for the automatic clas-

sification of music audio signal.” In: IEEE Trans-

actions on Audio, Speech and Language Processing,

19(5):1242–1252.

[6] Holzapfel, A.; Flexer, A.; Widmer, G. (2011): “Im-

proving tempo-sensitive and tempo-robust descrip-

tors for rhythmic similarity.” In: Proceedings of the

8th Sound and Music Computing Conference.

[7] Tilsen, S.; Arvaniti, A. (2013): “Speech rhythm

analysis with decomposition of the amplitude enve-

lope: characterizing rhythmic patterns within and

across languages.” In: The Journal of the Acoustical

Society of America,134(1):628–639.

[8] Marchand, U.; Peeters, G. (2014): “The modula-

tion scale spectrum and its application to rhythm-

content description.” In: DAFx.

[9] Lerch, A. (2012): An introduction to audio content

analysis: Applications in signal processing and mu-

sic informatics. John Wiley & Sons.

[10] Lykartsis, A.; Lerch, A. (2015): “Beat histogram

features for rhythm-based musical genre classifica-

tion using multiple novelty functions.” In: DAFx.

[11] Lykartsis, A.; Wu, C.W.; Lerch, A. (2015): “Beat

histogram features from nmf-based novelty functions

for music classification.” In: ISMIR.

[12] Lykartsis, A.; Weinzierl, S. (2015): “Analysis of

speech rhythm for language identification based

on beat histograms.” In: Fortschritte der Akustik:

Tagungsband d. 41. DAGA.

[13] Lykartsis, A.; Weinzierl, S. (2015): “Using the beat

histogram for speech rhythm description and lan-

guage identification.” In: Sixteenth Annual Confer-

ence of the International Speech Communication As-

sociation.

[14] Wagner, P. (2008): “The rhythm of language and

speech: Constraining factors, models, metrics and

applications.” In: Germany: Habilitationsschrift,

University of Bonn.

[15] Rouas, J.L.; Farinas, J.; Pellegrino, F.; Andr´e-

Obrecht, R. (2005): “Rhythmic unit extraction and

modelling for automatic language identification.” In:

Speech Communication,47(4):436–456.

[16] Vapnik, V. (2000): The nature of statistical learning

theory. springer.

[17] Schultz, T. (2002): “Globalphone: a multilingual

speech and text database developed at karlsruhe uni-

versity.” In: INTERSPEECH.

DAGA 2016 Aachen

967