Document [original]

This version is available at https://doi.org/10.14279/depositonce-9714

right to use is granted. This document is intended solely for

personal, non-commercial use.

Lykartsis, Athanasios; Weinzierl, Stefan (2015): Using the Beat Histogram for Speech Rhythm Description

and Language Identification. In: INTERSPEECH 2015 - 16th Annual Conference of the International

Speech Communication Association, Dresden, Germany, September 6-10, 2015. pp. 1007–1011. https://

www.isca-speech.org/archive/interspeech_2015/i15_1007.html

Athanasios Lykartsis, Stefan Weinzierl

Using the Beat Histogram for Speech

Rhythm Description and Language

Identification

Accepted manuscript (Postprint)Conference paper |

Using the Beat Histogram for Speech Rhythm Description

and Language Identification

Athanasios Lykartsis1, Stefan Weinzierl1

1Audio Communication Group, Technische Universit¨

at Berlin, Germany

[email protected], [email protected]

Abstract

In this paper we present a novel approach for the description

of speech rhythm and the extraction of rhythm-related features

for automatic language identification (LID). Previous methods

have extracted speech rhythm through the calculation of fea-

tures based on salient elements of speech such as consonants,

vowels and syllables. We present how an automatic rhythm

extraction method borrowed from music information retrieval,

the beat histogram, can be adapted for the analysis of speech

rhythm by defining the most relevant novelty functions in the

speech signal and extracting features describing their periodic-

ities. We have evaluated those features in a rhythm-based LID

task for two multilingual speech corpora using support vector

machines, including feature selection methods to identify the

most informative descriptors. Results suggest that the method is

successful in describing speech rhythm and provides LID clas-

sification accuracy comparable to or better than that of other

approaches, without the need for a preceding segmentation or

annotation of the speech signal. Concerning rhythm typology,

the rhythm class hypothesis in its original form seems to be only

partly confirmed by our results.

Index Terms: speech rhythm, beat histogram, language identi-

fication, novelty functions, rhythm typology

1. Introduction

Speech rhythm and its quantification has been an interesting and

controversial matter of research, with implications for language

rhythm typology and the possible existence of two or more

rhythmic classes of languages (stress- and syllable-timed) [1, 2],

dubbed the Rhythm Class Hypothesis [3]. In the last years, anal-

ysis of speech rhythm has focused on the attempt to obtain met-

rics of acoustic correlates of speech rhythm which could pro-

vide information about the rhythmic patterns of speech, gen-

erally by manually annotating vowels, consonants and stresses

in the speech signal and consequently calculating statistics of

the durations between intervals of those language prominence

units, resulting in measures such as the ∆C,%V,nPVI and

VarcoC [3, 4, 5, 6]. Those metrics have been proven useful

as first attempts to design descriptors of speech rhythm and

were very often used to investigate language rhythm typology,

by testing for significant differences between languages and at-

tempting to position the languages in a rhythm continuum be-

tween stress- and syllable-timed [4, 5]. For various small speech

corpora, they have provided evidence that supports the rhythm

class hypothesis and have therefore been seen as adequate mea-

sures of speech rhythm [4, 5, 6]. However, the scientific dis-

cussion about speech rhythm and its measurement continues up

to the present day [7, 8, 9, 10]. In this context, the aforemen-

tioned metrics have also been criticized [9, 11, 12, 13] as not

being robust with respect to the information they hold about

speech rhythm, since differences between languages have not

been consistent or significant across all studies, presumably due

to the existence of many non-language specific factors affect-

ing speech rhythm [9]. Other shortcomings are the manual

annotation necessary for the procedure (which is tedious and

can be subjective or erroneous), their derivation on basis of ab-

stract language elements (i.e., syllables) as opposed to quan-

tities physically manifest in the speech signal (e.g., its ampli-

tude envelope or other measures) and, finally, their variabil-

ity with respect to other, non-rhythm-related speech parame-

ters such as speaker or elicitation method [6, 9]. However, sev-

eral promising studies on the description of speech rhythm have

taken a different direction, attempting to extract rhythmic quan-

tities directly from the acoustic signal, specifically by extracting

salient periodicities and their characteristics from its amplitude

envelope [14, 15]. Furthermore, studies from the field of lan-

guage discrimination [16, 17] have used measures derived from

fundamental frequency and amplitude to discriminate between

pairs of languages with relative success. Other studies from the

field of rhythm- or prosody-based automatic language identifi-

cation (LID) [18, 19, 20, 21, 22] have conducted rhythm mod-

eling by using schemes such as automatic segmentation of the

speech signal in pseudosyllables and extracted statistical fea-

tures describing energy and fundamental frequency which pro-

duced good results (in the area of 60 −80%) in LID tasks,

showing that those features can indeed be useful for describing

speech rhythm. The crux of those approaches is that the focus is

shifted on quantities in the speech signal rather than on the reg-

ularities of more linguistically defined speech elements. This

paper follows in that rationale, introducing a novel method for

speech rhythm description, inspired from similar rhythm anal-

ysis methods from the field of Audio Content Analysis [23],

which have been used for tasks such as musical genre classi-

fication [24] with success. We assume that the rhythmic con-

tent of a sound can be captured through the signal-inherent pe-

riodicities and their properties. This definition does not differ-

entiate between musical and speech signals, providing a uni-

fied concept for rhythm which has been called for [25, 26]. In

the following chapters, the rhythm features are described, after

determining the signal properties whose periodicities are rele-

vant for rhythm. The features are evaluated in an automatic

LID task for two established multilingual speech corpora in or-

der to draw conclusions about their suitability for rhythm-based

LID and on rhythm language typology. Results are encouraging

regarding the feature capacity, but with certain caveats which

are discussed. Moreover, findings concerning language rhythm

classes are ambivalent. Finally, advantages and disadvantages

of the proposed method and the most informative features are

discussed and perspectives for further research are given.

2. Method

Various approaches for rhythm description and quantification

have been developed in the field of Music Information Retrieval

(MIR) [27]. In the context of musical genre classification, the

focus lay on the extraction of signal periodicities from a musi-

cal excerpt. Beginning with the work of Scheirer [28], a rep-

resentation for periodicities of the signal amplitude envelope

in the lower frequency area was introduced for beat tracking.

Tzanetakis and Cook [29] modified and used this representa-

tion, called the beat histogram, for extracting rhythmic con-

tent features. Similar approaches followed also by Burred and

Lerch [30] and Gouyon et al. [31]. The fundamental assump-

tion is that those features are representative of the regularities in

the temporal structure of an acoustic signal, describing multiple

aspects of the signal’s inherent periodicities. For both music

and speech, the beat histogram captures periodicities related to

strong, recurring ’beats’, in effect salient onsets of the signal’s

constituent elements.

The beat histogram calculation can take place on basis

of the trajectory of various relevant signal quantities over

time [32]. As such, the representation will then express pe-

riodicities related to this quantity, which might have different

statistical and other properties than those which are amplitude

or energy related. Those temporal trajectories are called novelty

functions [33]. A careful consultation of the most important

works in phonetics, MIR and rhythm-based LID (mentioned

in Section 1), as well as a study of the important rhythm def-

inition approaches in music theory and cognition [34, 25, 35]

reveals that there are three essential quantities whose tempo-

ral evolution must be taken into account for the the extraction

of speech rhythm: The amplitude of the signal envelope is an

acoustic correlate of perceived loudness. This makes it the ba-

sis for the detection of rhythm which results from the changing

energy of the signal due to the application of stresses on spe-

cific parts of speech in comparison to others. As such, it de-

notes intonation. The pitch or value of a salient (for speech,

the fundamental F0) frequency in the signal and the temporal

trajectory thereof is the most important tonal rhythm carrier in

the signal and expresses speech prosody. Features derived from

its beat histogram can describe changes in voice melody tra-

jectories, regularities in rising or falling voice pitch or related

changes. Spectral changes are an acoustic correlate of change

in sound texture and timbre, which essentially characterize dif-

ferent categories of sounds (such as tonal or noisy) or changes in

spectral content (e.g. high or low-frequency content). Features

from a beat histogram based on spectrum novelty can serve as

descriptors for change of speech elements, such as consonants

and vowels, or even different formants. In our study, ampli-

tude novelty is extracted through the calculation of the RMS

amplitude of the signal. The fundamental frequency (F0) is

extracted through the use of a spectral harmonic product algo-

rithm on a filtered version of the speech signal (using a 4th-

order Butterworth lowpass with a 800 Hz cutoff-frequency), so

as to ensure tracking of the fundamental frequency alone. Three

standard features are extracted to track spectral changes [23]:

the spectral flux (SF) (indicating general spectral change), the

spectral flatness (SFL) (as a measure of signal tonalness or

noisiness) and the spectral centroid (SCD) (a measure of the

spectral centre-of-weight), the latter also on a filtered version

of the signal (using a 4th-order Butterworth bandpass filter be-

tween 300 Hz and 3200 Hz) to ensure that only formant area

frequencies are considered. More information on those features

can be found in [23]. Experiments in musical genre classifica-

tion using features based on similar amplitude, tonal and spec-

tral shape novelty functions have shown promising results for a

wide range of datasets [32], suggesting their suitability for LID,

a task analogue to genre classification [26].

0 2 4 6

100

200

300

Seconds

Temporal Trajectory, Fundamental Frequency

2 4 6 8 10

0.05

0.1

0.15

0.2

Beat Strength

Beat Histogram, Fundamental Frequency

0 2 4 6

0.1

0.2

0.3

0.4

Seconds

Amplitude

Temporal Trajectory, Spectral Flux

2 4 6 8 10

0.05

0.1

0.15

0.2

Beat Strength

Beat Histogram, Spectral Flux

Figure 1: Novelty functions and corresponding beat histograms.

Distribution Peak

Mean (ME) Salience of Strongest Peak (A1)

Standard Deviation (SD) Salience of 2nd Stronger Peak (A2)

Mean of Derivative (MD) Period of Strongest Peak (P1)

SD of Derivative (SDD) Period of 2nd Stronger Peak (P2)

Skewness (SK) Period of Peak Centroid (P3)

Kurtosis (KU) Ratio of A0 to A1 (RA)

Entropy (EN) Sum (SU)

Geometrical Mean (GM) Sum of Power (SP)

Centroid (CD)

Flatness (FL)

High Frequency Content (HFC)

Table 1: Subfeatures extracted from beat histograms.

Fig. 1 gives an overview of two novelty functions (F0 and

Spectral Flux) and their corresponding beat histograms for the

utterance Gestern war ich in einem Selbsterfahrungskurs. Ich

bin mir nicht wirklich sicher, ob es mir gefallen hat (from the

german subset of the MULTEXT corpus, signal 11). It is clear

that the two novelty functions show different periodicities and

therefore carry valuable information on multiple speech rhythm

levels. More specifically, the fundamental frequency follows

the prosody of the given utterance, whereas the spectral flux

measure is expected to track general changes of spectrum in the

signal, i.e. phoneme or stress changes.

The beat histogram computation follows the computation

in [29, 32]. All spectral-based features are calculated through a

Short-Time-Fourier-Transform (STFT), whereas the fundamen-

tal frequency and the RMS measure on basis of the time-domain

signal, both of which with the same temporal resolution param-

eters from the time domain signal. The complete procedure

for the generation of a feature vector representing each utter-

ance includes the following steps: the audio signal is down-

mixed to mono, resampled to 22.5 kHz, DC-freed and normal-

ized. Afterwards, the signal is separated in texture windows

with a length of 3 s and 50% overlap, on which the beat his-

tograms are extracted. The STFT is performed with a frame-

length of 46.4 ms, a Hann window and an overlap of 75%,

whereas for the time-domain features the same parameters are

used. The novelty function is computed through the calculation

of the temporal trajectory of the features and half-wave recti-

fication. The beat histogram is extracted through an Autocor-

relation Function (ACF) for each texture window, retaining the

area between 0.5 Hz and 10 Hz, as representative for the rele-

vant periodicities in speech [25]. Finally, the beat histograms

extracted from all 3 s frames for an utterance are averaged. For

each beat histogram, two categories of features can be extracted

(Table 2). Similar features on beat histograms have been used

in [29, 30, 31], providing valuable statistical information on the

temporal features of each language, similar to the work on LID

in [15]. In total, 5novelty functions are used for the production

of as many beat histograms, from each of which 19 subfeatures

are extracted, producing in total 95 features.

3. Experimental Setup and Evaluation

For evaluation, extraction of a series of non-rhythmic features

was undertaken, by calculating their values over all texture

windows (keeping the average value inside an analysis win-

dow) on a speech file. Those non-rhythmic features serve

as a baseline for the comparison, since acoustic feature-based

LID-approaches are among those providing very high perfor-

mance [36, 37]. Acoustic features such as Mel Frequency Cep-

stral Coefficients (MFCCs) and Shifted Delta Cepstral (SDCs)

features have been used widely for non-rhythmic LID with good

results [38, 39]. However, for the sake of comparability with

the novel rhythm features, we used a baseline set which com-

prises all five novelty functions which were also utilized for the

calculation of the beat histograms. From their temporal trajec-

tories, the distribution features listed in Table 2 were extracted.

In total, the baseline feature set comprises 5novelties times 11

subfeatures = 55 features. For supervised classification, the

Support Vector Machines (SVM) [40] algorithm under MAT-

LAB with a Radial Basis Function (RBF) kernel in a multiclass

setting was used. A grid search procedure (i.e. a search for the

optimal parameter values) was applied to determine the hyper-

parameters for this kernel (C,γ). All experiments took place

with a 10-fold cross-validation, with results averaged over the

folds. Z-score standardization was conducted prior to classifi-

cation, separately for the train and test set. Classification per-

formance was evaluated through accuracy, defined as the pro-

portion of correctly classified samples to all samples.

As speech material, two established multilingual speech

corpora for automatic LID were used: the MULTEXT PD [41]

and the OGI-MLTS corpus [42]. The first is a corpus of read,

high quality speech which contains five indoeuropean languages

which are assumed belong to the two basic rhythm groups (en-

glish and german to the stress-timed, french, italian and spanish

to the syllable-timed), making it useful to test the rhythm class

hypothesis when using the proposed novel features. The corpus

10 speakers per language (5male and 5female with an average

of 15 passages per speaker) and an average length of 20 s for

each utterance. The OGI-MLTS corpus contains spontaneous,

telephone quality speech from eleven languages (featuring apart

from indoeuropean also tonal languages such as mandarin chi-

nese, or even others, such as hindi or vietnamese), multiple

speakers per language (male and female) and an average length

of 45 s for each utterance. For the experiments in this paper, we

retained only the four languages which are common with the

MULTEXT PD corpus and which can be used for rhythm ty-

pology research. The two selected datasets represent two cases

of speech material with very different properties.

In order to identify the best performing descriptors and nov-

elty functions we conducted feature selection following two ap-

proaches: First, we apply a filter method (Mutual Information

with Target Data [43], using the maximum relevance CMIM

metric [44] from the MI-Toolbox [45]). From the feature rank-

ing, we retain the Nbest features which gave comparable accu-

racy to the full rhythmic feature set. Second, we evaluate each

of the five novelty functions separately, by retaining only the 19

subfeatures resulting from the corresponding beat histogram.

MULTEXT PD OGI−MLTS

100

Speech Corpora

Accuracy (%)

Classification Results

Rhythm Feature Set

Baseline Feature Set

Figure 2: Classification results, both datasets and feature sets

True/Predicted EN FR GE IT SP

EN 110 22 5 5 8

FR 4 76 5 9 6

GE 15 22 121 23 19

IT 3 21 5 114 7

SP 7 25 5 6 107

Acc. (%) 73.3 76 60.5 76 71.3

Prior (%) 20 13.3 26.7 20 20

Table 2: Confusion matrix for MULTEXT PD corpus lan-

guages, rhythmic features, average accuracy: 70.4%.

True/Predicted EN FR GE SP

EN 58 56 33 39

FR 38 76 27 45

GE 44 55 39 48

SP 44 56 27 49

Acc. (%) 31.2 40.9 21.0 26.3

Prior (%) 25 25 25 25

Table 3: Confusion matrix for OGI-MLTS corpus languages,

rhythmic features, average accuracy: 31.2%.

Rank MULTEXT PD OGI-MLTS

1 FL.SFL SP.HPS

2 GM.SF P3.HPS

3 A2.SF A2.HPS

Table 4: Best features after filter feature selection. Abbreviation

left of point denotes subfeature, otherwise novelty function.

Rhythmic feature subset MULTEXT PD OGI-MLTS

All features 70.4 31.2

RMS Amplitude 67.5 25.4

Fundamental Pitch 70.4 27.4

Spectral Flux 67.5 25.5

Spectral Flatness 66.8 24.7

Spectral Centroid 64.9 24.9

Table 5: Group feature selection, results given in percentages.

4. Results

Results of the classification procedure are presented in Fig. 2.

For the MULTEXT PD corpus, the rhythmic feature set per-

forms better than the baseline set. The performance of the base-

line set (54.3%) lies close to that of the rhythmic feature set

(70.4%). For the OGI-MLTS dataset, results show the exact op-

posite tendency: The baseline set with an accuracy of (37.5%)

outperforms the rhythmic set (31.5%). With regards to the per-

formance of the corpora, a great difference in accuracies can

be observed: Whereas the MULTEXT PD corpus shows a sat-

isfactory performance which lies well above the average prior

(20%), the OGI-MLTS corpus accuracy stays at relatively low

levels (which are, however, comparable to those in other rhythm

modeling LID studies [22]). In the case of the MULTEXT PD

corpus, high accuracy can be achieved for all languages. In

the case of the OGI-MLTS corpus, only french shows better

performance, whereas the accuracy for other languages is only

moderately above the prior. The confusion matrices for both

cases are given in Tables 2 and 3. It can be seen that in the

case of the MULTEXT PD corpus, the rhythm class hypothesis

is confirmed only partly: The hypothesized stress-timed lan-

guages english and german are not confused with each other

more than with others outside this group. In the syllable-timed

group, italian and spanish are confused with french, but not with

each other. However, the tendency towards misclassifications

toward french can be observed for all languages. For the OGI-

MLTS corpus, specific misclassifications between languages in

the hypothesized same rhythm class, such as english-german

or french-spanish, cannot be observed in this case as well. Fi-

nally, concerning feature selection for the rhythmic feature set,

results show that the same accuracy can achieved with the first

19 (MULTEXT PD) or 21 (OGI-MLTS) features of the CMIM

ranking. In Table 4, a list of the best features for both datasets

is given. It is noted that between the novelty functions, such

based on spectral flux, spectral flatness and pitch are most com-

monly among the best ones. Finally, selection based on novelty

feature groups (Table 5) shows that all novelty functions are al-

most equally important for accuracy, a result which is true for

both corpora. In both cases, the F0 feature subgroup seems to

perform marginally better than the others.

5. Discussion

The results presented in Section 4 suggest that the application

of the beat histogram features for automatic LID is indeed valu-

able, since it provides comparable performance to that of other

rhythm-based LID approaches [18, 19, 21, 22], although latest i-

vector-based methods provide even higher results [46, 47]. The

differences observed between the rhythmic and baseline fea-

ture sets are telling with respect to the robustness and quality

of the proposed features. In the case of the MULTEXT PD cor-

pus, which is a prosodic database, rhythmic features seem bet-

ter suited to capture differences between languages than more

general acoustic features. On the other hand, for the more

generic OGI-MLTS corpus, non-rhythmic features perform bet-

ter, showing that in that case rhythm features are informative

enough. Other reasons which could explain the difference in

performance between the two datasets are the signal quality,

which in case of the OGI-MLTS corpus might impair the extrac-

tion of rhythm features or features in general significantly; and

the difference in speech elicitation method, showing that spon-

taneous speech not only makes the extraction of robust features

much more difficult, but also does not allow rhythmic features

to achieve acceptable performance. Those observations are use-

ful in determining the scope of use of the suggested rhythmic

features, suggesting that they could be more suitable for read

speech with good signal quality, but their robustness could be

further improved. With regards to the best features, the fact that

novelty functions of pitch and spectral change features produce

the most salient beat histograms is a hint for their eligibility for

speech rhythm analysis. It is interesting that features such as

P1 and P2 (showing periodicities of prominent beats in speech)

are not among the best ones. This hints towards the fact that ei-

ther speech periodicities cannot help differentiate between lan-

guages (as they could be noisy because of variability due to

other factors) or that they cannot be reliably extracted from the

beat histograms through the subfeatures presented here. Con-

cerning language typology on basis of the beat histogram fea-

tures, the rhythm class hypothesis does not seem to be corrob-

orated in its pure form from our results on the MULTEXT PD

corpus: on the one hand it is clear that languages supposed to

be rhythmically close to each other, such as english and german

are not confused with each other more than with languages from

different supposed rhythm classes. On the other hand, spanish

and italian are more confused with french than with english or

german (which would hint towards a rhythmic similarity in this

group), however this can be an artifact of the specific dataset,

since french seems to act as an attractor for all other languages,

hinting that its rhythmic features are somehow representative

of other languages as well. In the case of the OGI-MLTS cor-

pus, results also do not confirm the rhythm class hypothesis di-

rectly. Those results can indicate that the novel features in their

present form are better suited for specific languages. However,

they might also be the consequence of our features not capturing

speech rhythm in the same form as the rhythm class hypothesis

first posited. More experiments are needed in order to deter-

mine of those results are dependent on dataset or the feature

extraction and classification methods.

6. Conclusions

The presented beat histogram features are shown to be good

descriptors of speech rhythm since they have been shown to

provide good accuracy in an automatic LID task. Further-

more, the features achieved accuracies comparable to those of

other speech rhythm feature approaches [21, 22] for the same

datasets, further attesting to the merit of the method. Amongst

the advantages of the presented rhythm description scheme is

that it does not require any preprocessing such as syllable anno-

tation or even automatic segmentation which is time-consuming

or could potentially insert erroneous assumptions. Further-

more, the method allows the automatic processing of greater

datasets and provides a novel perspective on the description of

speech rhythm through solely signal-based measures. How-

ever, more experiments with greater corpora (such as GLOB-

ALPHONE [48]), extraction parameters (to test, e.g., for effects

concerning the texture window size) and other classification

methods (such as artificial neural nets, as well as unsupervised

methods) will be conducted, so as to be able to check for result

consistency and improve robustness. Furthermore, the relation-

ship between the features and more abstract speech elements is

not entirely clear, prompting future research to establish con-

crete connections. Further future work on feature selection will

attempt to find out which novelty functions and features are the

most informative across many datasets and experimental setups,

in order to compare the results with those from phonetics or hu-

man speech rhythm perception research. Concerning language

typology, the presented rhythm-based LID does not seem to cor-

roborate the rhythm class hypothesis in its pure form, but gives

incentives to attempt and reformulate the hypothesis in a new

version so as to account for the empirical evidence.

7. References

[1] K. L. Pike, The Intonation of American English. Ann Arbor:

University of Michigan Press, 1945.

[2] D. Abercrombie, Elements of general phonetics. Edinburgh Uni-

versity Press Edinburgh, 1967, vol. 203.

[3] R. M. Dauer, “Stress-timing and syllable-timing reanalyzed,”

Journal of phonetics, 1983.

[4] F. Ramus, M. Nespor, and J. Mehler, “Correlates of linguistic

rhythm in the speech signal,” Cognition, vol. 73, no. 3, pp. 265–

292, 1999.

[5] E. Grabe and E. L. Low, “Durational variability in speech and the

rhythm class hypothesis,” Papers in laboratory phonology, vol. 7,

no. 515-546, 2002.

[6] V. Dellwo, A. Fourcin, and E. Abberton, “Rhythmical classifica-

tion of languages based on voice parameters,” in ICPhS ’07, 2007,

pp. 1129–1132.

[7] P. Wagner, “The rhythm of language and speech: Constraining

factors, models, metrics and applications,” Germany: Habilita-

tionsschrift, University of Bonn, 2008.

[8] L. Wiget, L. White, B. Schuppler, I. Grenon, O. Rauch, and S. L.

Mattys, “How stable are acoustic metrics of contrastive speech

rhythm?” The Journal of the Acoustical Society of America, vol.

127, no. 3, pp. 1559–1569, 2010.

[9] A. Arvaniti, “The usefulness of metrics in the quantification of

speech rhythm,” Journal of Phonetics, vol. 40, no. 3, pp. 351–373,

2012.

[10] A. Turk and S. Shattuck-Hufnagel, “What is speech rhythm? a

commentary on arvaniti and rodriquez, krivokapi´

c, and goswami

and leong,” Laboratory Phonology, vol. 4, no. 1, pp. 93–118,

2013.

[11] P. Roach, “On the distinction between stress-timed and syllable-

timed languages,” Linguistic controversies, pp. 73–79, 1982.

[12] W. J. Barry, B. Andreeva, M. Russo, S. Dimitrova, T. Kostadinova

et al., “Do rhythm measures tell us anything about language type,”

in ICPhS ’03, 2003, pp. 2693–2696.

[13] A. Arvaniti, “Rhythm, timing and the timing of rhythm,” Phonet-

ica, vol. 66, no. 1-2, pp. 46–63, 2009.

[14] S. Tilsen and K. Johnson, “Low-frequency fourier analysis of

speech rhythm,” The Journal of the Acoustical Society of Amer-

ica, vol. 124, no. 2, pp. EL34–EL39, 2008.

[15] S. Tilsen and A. Arvaniti, “Speech rhythm analysis with decom-

position of the amplitude envelope: characterizing rhythmic pat-

terns within and across languages,” The Journal of the Acoustical

Society of America, vol. 134, no. 1, pp. 628–639, 2013.

[16] F. Cummins, F. Gers, J. Schmidhuber, and C. Elvezia, “Automatic

discrimination among languages based on prosody alone,” Speech

Communication, 1999.

[17] A. Thym´

e-Gobbel and S. E. Hutchins, “Prosodic features in auto-

matic language identification reflect language typology,” in ICPhS

’99, 1999.

[18] J. Farinas, F. Pellegrino, J.-L. Rouas, and R. Andr´

e-Obrecht,

“Merging segmental and rhythmic features for automatic lan-

guage identification,” in ICASSP ’02, vol. 1, 2002, pp. I–753.

[19] J.-L. Rouas, J. Farinas, F. Pellegrino, and R. Andr´

e-Obrecht,

“Modeling prosody for language identification on read and spon-

taneous speech,” in ICASSP ’03, vol. 6, 2003, pp. I–40.

[20] J.-L. Rouas, J. Farinas, and F. Pellegrino, “Automatic modelling

of rhythm and intonation for language identification,” in ICPhS

’03, 2003, pp. 567–570.

[21] J.-L. Rouas, J. Farinas, F. Pellegrino, and R. Andr´

e-Obrecht,

“Rhythmic unit extraction and modelling for automatic language

identification,” Speech Communication, vol. 47, no. 4, pp. 436–

456, 2005.

[22] J.-L. Rouas, “Automatic prosodic variations modeling for lan-

guage and dialect discrimination,” IEEE Transactions on Audio,

Speech, and Language Processing, vol. 15, no. 6, pp. 1904–1911,

2007.

[23] A. Lerch, An introduction to audio content analysis: Applications

in signal processing and music informatics. Wiley & Sons, 2012.

[24] N. Scaringella, G. Zoia, and D. Mlynek, “Automatic genre clas-

sification of music content: a survey,” IEEE Signal Processing

Magazine, vol. 23, no. 2, pp. 133–141, March 2006.

[25] A. D. Patel, Music, language, and the brain. Oxford university

press, 2008.

[26] S. H¨

ubler and R. Hoffmann, “Comparing the rhythmical charac-

teristics of speech and music–theoretical and practical issues,” in

Toward Autonomous, Adaptive, and Context-Aware Multimodal

Interfaces. Theoretical and Practical Issues. Springer, 2011, pp.

376–386.

[27] F. Gouyon and S. Dixon, “A review of automatic rhythm descrip-

tion systems,” Computer music journal, vol. 29, no. 1, pp. 34–35,

2005.

[28] E. D. Scheirer, “Tempo and beat analysis of acoustic musical sig-

nals,” The Journal of the Acoustical Society of America, vol. 103,

no. 1, pp. 588–601, 1998.

[29] G. Tzanetakis and P. Cook, “Musical genre classification of au-

dio signals,” IEEE transactions on Speech and Audio Processing,

vol. 10, no. 5, pp. 293–302, 2002.

[30] J. J. Burred and A. Lerch, “A hierarchical approach to automatic

musical genre classification,” in DAFX ’03, 2003.

[31] F. Gouyon, S. Dixon, E. Pampalk, and G. Widmer, “Evaluating

rhythmic descriptors for musical genre classification,” in AES ’04,

2004, pp. 196–204.

[32] A. Lykartsis, “Evaluation of accent-based rhythmic descriptors

for genre classification of musical signals,” Master’s thesis, Au-

dio Communication Group, Technische Universit¨

at Berlin, 2014.

[33] J. P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, and

M. B. Sandler, “A tutorial on onset detection in music signals,”

IEEE Transactions on Speech and Audio Processing, vol. 13,

no. 5, pp. 1035–1047, 2005.

[34] F. Lerdahl and R. S. Jackendoff, A generative theory of tonal mu-

sic. MIT press, 1983.

[35] J. London, Hearing in time. Oxford University Press, 2012.

[36] Y. K. Muthusamy, E. Barnard, and R. A. Cole, “Reviewing au-

tomatic language identification,” Signal Processing Magazine,

IEEE, vol. 11, no. 4, pp. 33–41, 1994.

[37] M. A. Zissman and K. M. Berkling, “Automatic language iden-

tification,” Speech Communication, vol. 35, no. 1, pp. 115–124,

2001.

[38] E. Singer, P. A. Torres-Carrasquillo, T. P. Gleason, W. M. Camp-

bell, and D. A. Reynolds, “Acoustic, phonetic, and discrimina-

tive approaches to automatic language identification.” in INTER-

SPEECH, 2003.

[39] W. M. Campbell, E. Singer, P. A. Torres-Carrasquillo, and D. A.

Reynolds, “Language recognition with support vector machines,”

in ODYSSEY04, 2004.

[40] V. Vapnik, The nature of statistical learning theory. Springer,

2000.

[41] E. Campione and J. V´

eronis, “A multilingual prosodic database.”

in ICSLP, vol. 98, 1998, pp. 3163–3166.

[42] Y. K. Muthusamy, R. A. Cole, B. T. Oshika, L. D. Consortium

et al., “The ogi multi-language telephone speech corpus,” in IC-

SLP, vol. 92, 1992, pp. 895–898.

[43] I. Guyon and A. Elisseeff, “An introduction to variable and feature

selection,” The Journal of Machine Learning Research, vol. 3, pp.

1157–1182, 2003.

[44] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual

information criteria of max-dependency, max-relevance, and min-

redundancy,” IEEE Transactions on Pattern Analysis and Ma-

chine Intelligence, vol. 27, no. 8, pp. 1226–1238, 2005.

[45] G. Brown, A. Pocock, M.-J. Zhao, and M. Luj´

an, “Conditional

likelihood maximisation: a unifying framework for information

theoretic feature selection,” The Journal of Machine Learning Re-

search, vol. 13, pp. 27–66, 2012.

[46] H. Li, B. Ma, and C.-H. Lee, “A vector space modeling approach

to spoken language identification,” Audio, Speech, and Language

Processing, IEEE Transactions on, vol. 15, no. 1, pp. 271–284,

2007.

[47] M. Li and W. Liu, “Speaker verification and spoken language

identification using a generalized i-vector framework with pho-

netic tokenizations and tandem features,” submitted to INTER-

SPEECH, 2014.

[48] T. Schultz, “Globalphone: a multilingual speech and text database

developed at karlsruhe university.” in INTERSPEECH, 2002.