Document [original]

Stationary Subspace Analysis

Towardsunderstandingnon-stationarydata

Paul von Bünau

Stationary Subspace Analysis

Towardsunderstandingnon-stationarydata

vorgelegt von Paul von Bünau

Von der Fakultät IV — Elektrotechnik und Informatik

der Technischen Universität Berlin

zur Erlangung des akademischen Grades

Doktor der Naturwissenschaen (Dr. rer. nat.)

genehmigte Dissertation

Promotionsausschuss

Vorsitzender: Prof. Dr. Olaf Hellwich (Technische Universität Berlin)

1. Gutachter: Prof. Dr. Klaus-Robert Müller (Technische Universität Berlin)

2. Gutachter: Prof. Dr. Gilles Blanchard (Universität Potsdam)

3. Gutachter: Prof. Dr. Benjamin Blankertz (Technische Universität Berlin)

Tag der wissenschalichen Aussprache: 12. September 2012

Berlin, 2012

D 83

Acknowledgements

Translated from the German, “Ph.D. advisor” becomes “doctor father”. is is quite

literally what Klaus-Robert Müller was to me. For the most part, it was his inspira-

tional enthusiasm that drew me into machine learning in the ﬁrst place, and made

me return aer a two-year excursion. His kind encouragement and indefatigable

optimism helped me not to despair during the dry periods of research, when ev-

erything seems pointless. But above all, the inimitable Müller determination not to

take anything too seriously, and see the human side of everything, is what makes his

research group such an enjoyable place. ank you.

is thesis is the result of a collaboration with Frank C. Meinecke and Franz J.

Kiraly. It has been a privilege, a pleasure, and an incredible learning experience for

which I am deeply grateful.

My colleagues have been an invaluable source of scientiﬁc advice, camaraderie

and amusement, both in and out of the oﬃce, and on our various joint conference

trips. In particular, I would like to mention Mikio Braun, Stefan Haufe and Felix

Bießmann.

I am indebted to Jan Saputra Müller and Duncan A. J. Blythe for the excellent

joint work during the last years. Your support has been important in many ways.

My research would not have been possible without the tireless eﬀorts of Andrea

Gerdes, Dominik Kühne and Imke Weitkamp. Even though I can only fathom what

most of it entails, I am well aware that it matters hugely.

eEEG datasets Iused inthis thesiswererecordedby Claudia Sannelli, orsten

Dickhaus, Sven Dähne and Johannes Höhne. is is hard work; your attention to

detail is very much appreciated.

I am sincerely grateful for the frequent detours from academia with Sebastian

Mika and Aldo Benini; to industry, and Bangladesh, respectively. Both have sig-

niﬁcantly delayed this thesis, but also provided the change in scenery on which I

thrive.

Over the years, I have had the good fortune of meeting inspiring teachers who

generously shared their knowledge with me. While this list is inevitably incomplete,

it certainly includes Lothar Budach, Martyn Quick, Albrecht Klemm, Georg Her-

rmann, and from the ﬁrst, my parents.

But what matters most is that I have found you, Louise.

Financial support

e bulk of my research was supported by a teaching position at TU Berlin, funded

by the federal state of Berlin. In addition, I have received substantial travel grants

from the EU Network of Excellence PASCAL and the German Academic Exchange

Foundation (DAAD). In  and , I was kindly invited to the Research-in-

Pairs(RiP)programmeofthe MathematischesForschungsinstitutOberwolfach(MFO),

core-funded by the Leibniz Gemeinscha and the state of Baden-Württemberg.

I am sincerely grateful to the European tax payer who provided this money, and

I appreciate the trust that society puts in the academic community.

Abstract

is thesis is about statistical methods for understanding change in the joint dis-

tribution of observed multivariate data over time. e setting we consider is com-

pletely explorative or unsupervised: no auxiliary information regarding the distribu-

tion changes is available.

We propose the ﬁrst unsupervised method, stationary subspace analysis (SSA),

for ﬁnding a linear coordinate transformation which factorises the observed data into

stationary and non-stationary components. is is essential because the relevant

changescanoccurin the dependencies between variables, which meansthatthe input

variables can be totally uninformative: in fact, the non-stationary and the stationary

part can remain completely invisible in the observations. In practice, this is oen the

case when one can only measure superpositions of the actual variables of interest.

For example, in EEG analysis, electrodes on the surface of the scalp record activity

from several neural sources located inside the brain. As we show in this thesis, in-

vestigating changing behaviour in brain sources crucially depends on the separation

of the stationary and non-stationary components in the recorded signals.

e second main contribution of this thesis is a novel approach to ﬁnding partic-

ular types of approximative solutions to systems of polynomial equations of arbitrary

degree based on techniques from computational algebraic geometry. Using the con-

cept of generic polynomials, we show how SSA can be formulated in this framework.

is leads to a new algorithm which has a unique solution and is more accurate in

certain cases. From a theoretical perspective, the most interesting feature of this ap-

proach is that it allows us to solve the problem algebraically instead of searching for

the solution guided by an objective function. As the assumptions underpinning the

algorithm are rather general, it may be directly applicable to other problems in ma-

chine learning whose solution can be formulated in terms of polynomial equations.

viii

Zusammenfassung

Das ema dieser Dissertation sind statistische Methoden für das Verständnis zeit-

licher Veränderung der gemeinsamen Verteilung multivariater Daten. Wir betrachten

densogenanntenexplorativen Fall, in demkeinerlei zusätzlichen Informationen, z.B.

über relevante Zeitpunkte oder einen kontrollierten Stimulus, verfügbar sind.

Der zentrale Beitrag dieser Arbeit ist die Entwicklung des erstenunüberwachten

Verfahrens, Stationary Subspace Analysis (SSA), welches eine linear Koordinaten-

transformation ﬁndet das die beobachteten Daten in eine Gruppe von stationären

und nicht-stationären Komponenten faktorisiert. Dies ist unerlässlich zur Analyse

multivariaterDaten, weildie wesentlichenÄnderungen dergemeinsamenVerteilung

die Abhängigkeiten zwischen den Variablen betreﬀen können. Daher erlaubt die

univariate Betrachtung der Eingangsgrössen nicht notwendigerweise Rückschlüsse

auf Änderungen der gemeinsamen Verteilung. Sowohl die nicht-stationären als

auch die stationären Komponenten können nämlich in den beobachten Variablen

vollständig unsichtbar bleiben. Dies ist vor allem dann der Fall, wenn die mess-

baren Grössen Überlagerungen der tatsächlich relevanten Variablen sind, welche

nicht direkt gemessen werden können. Ein gutes Beispiel liefert die EEG Datenanal-

yse: die Elektroden auf der Kopaut messen die Beiträge einer Vielzahl neuronaler

Quellen im Gehirn. Um die zeitliche Veränderung der Verteilung dieser Quellen

zu verstehen ist es daher notwendig, die stationären von den nicht-stationären Sig-

nalanteilen zu trennen. In einer Anwendung auf EEG Daten zeigen wir zum Einen,

dass SSA dies leistet und zum Anderen, dass die populären Koordinatentransforma-

tionen Principal Component Analysis (PCA) und Independent Component Anal-

ysis (ICA) dazu nicht in der Lage sind.

Der zweite wesentliche Beitrag dieser Arbeit ist ein neuartiger Ansatz zur ap-

proximativenLösungpolynomiellerGleichungssystemeinesbestimmtenTyps. Auf-

bauend auf dem Konzept der generischen Polynome zeigen wir, das SSA als ein

solches Problem formuliert werden kann. Dies führt zu einem neuen Algorithmus

dessen Lösung nicht nur eindeutig, sondern in Spezialfällen auch exakter ist. Von

einem theoretischen Standpunkt aus gesehen ist es besonders interessant, dass es

dieser Ansatz erlaubt das SSA Problem direkt algebraisch zu lösen, anstatt wie in

einem Optimierungsverfahren nach der Lösung zu suchen. Da die zugrunde liegen-

den Annahmen eher allgemein sind, lässt sich der Algorithmus möglicherweise di-

rekt auf andere Probleme im Maschinellen Lernen anwenden, die als Lösung poly-

nomieller Gleichungen formuliert werden können.

Contents

1 Introduction 1

. Whythisthesis? ............................ 

. Scope, main contributions and publications . . . . . . . . . . . . . 

. Related areas in machine learning . . . . . . . . . . . . . . . . . . . 

. A roadmap through this document . . . . . . . . . . . . . . . . . . 

2 Interesting directions in data 13

. Motivation ............................... 

. Preliminaries and notation . . . . . . . . . . . . . . . . . . . . . . . 

. Maximum variance and independence . . . . . . . . . . . . . . . . 

. Summary and discussion . . . . . . . . . . . . . . . . . . . . . . . 

3 Stationary subspace analysis 27

. Model and identiﬁability . . . . . . . . . . . . . . . . . . . . . . . . 

. eSSAalgorithm........................... 

. Spurious stationarity . . . . . . . . . . . . . . . . . . . . . . . . . . 

. Summary and discussion . . . . . . . . . . . . . . . . . . . . . . . 

4 An algebraic approach 51

. From moments to polynomials . . . . . . . . . . . . . . . . . . . . 

. e vector space of polynomials . . . . . . . . . . . . . . . . . . . . 

. An approximate algebraic algorithm . . . . . . . . . . . . . . . . . 

. Relationship to algebraic geometry . . . . . . . . . . . . . . . . . . 

. Summary and discussion . . . . . . . . . . . . . . . . . . . . . . . 

5 Simulations and applications 73

. Introduction .............................. 

. Simulations on synthetic data . . . . . . . . . . . . . . . . . . . . . 

. Application to EEG analysis . . . . . . . . . . . . . . . . . . . . . . 

. Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . 

6 Conclusion 99

xii CONTENTS

Chapter 1

Introduction

1.1 Why this thesis?

is thesis is about methods for the analysis of data. e starting point is a simple

question: given a set of multivariate samples observed at diﬀerent points in time, how

does the joint distribution change over time? e setting is completely unsupervised

orexplorative: apart from thedata, no further informationorpriorknowledge which

could be helpful in identifying the changes is available.

Figure .: A change in variance over time.

In such a situation, a common approach is to resort to a visual inspection of

the data. Although unwieldy for large data sets, plotting the time course of a sin-

gle variable should, in principle, provide us with a good indication of its change in

behaviour over time (see e.g. Figure .).

Signals

Signal 1Signal 2

Epoch histograms

Figure .: Does this joint distribution change over time?

However, this approach fails as soon as we are dealing with the joint distribution

of more than one variable. Let us consider the example shown in Figure .. In ad-

2CHAPTER 1. INTRODUCTION

dition to the time courses, this plot shows the distribution of the individual variables

as histograms in four segments. Apart from the ﬂuctuations due to ﬁnite samples,

this does not reveal any signiﬁcant changes in distribution.

Signals

Signal 2

Signal 1

Figure .: e change is conﬁned to the dependencies!

Figure . provides a diﬀerent view of the same data set. From the scatter plots,

we immediately see what has been missed: the dependencies between the two vari-

ables. e correlation between the variables changes completely over time, from

highly correlated at the beginning, to uncorrelated, to highly anticorrelated towards

the end. Looking at variables individually, i.e. the marginals of the joint distribution,

does not give us any information about the dependencies between them.

Figure .: More useful directions

In light of this ﬁnding, a more useful way of looking at this data set is given by

the red and green directions shown in Figure .. By projecting the data points onto

these axes, we obtain two new variables with time courses shown in the panel below.

e ﬁrst one corresponds to the green direction. As we can see in the scatter plots,

along this direction the change in the joint distribution is clearly visible: from large

variance in the ﬁrst epoch to small variance in the last epoch; this is what we call a

non-stationary direction. e second variable (red) is a stationary variable, because

1.1. WHY THIS THESIS? 3

its distribution stays constant¹. e projection onto these two directions achieves a

separation of the stationary and the non-stationary content of the joint distribution.

is example was benign in that it consisted of only two variables. For higher

dimensional data sets, analysing dependencies by looking at scatter plots between

all pairs of variables (over a potentially large number of epochs) is not merely cum-

bersome, but simply infeasible. Moreover, we have no labels or target variables to

guide our search. us the question posed at the outset turns out to be surpris-

ingly diﬃcult to answer. More importantly, this problem is not only non-trivial,

but also relevant in practice. Understanding temporal changes lies at the heart of

many scientiﬁc inquiries. Real data contains signiﬁcant amount of information in

its dependency structure, and is oen high-dimensional.

In fact, in many circumstances it is well known that what we measure is a super-

position of underlying latent variables which cannot be measured directly. is cre-

ates rich dependencies across dimensions. For example, in EEG analysis, electrodes

on the surface of the scalp record contributions from a multitude of sources located

inside the brain. Similarly, in the analysis of seismic activity, one is oen conﬁned

to surface measurements for investigating the activity inside the earth. us each

observed variable contains contributions from both the stationary, and the non-

stationary parts of the underlying data generating system.

Signals

Time

Observed time series

Time

Sources

Underlying sources

Sources

Time

Comparison of source power

Figure .: Non-stationary component masked by stronger stationary compo-

nent.

As the example in Figure . showed, this could mean that the relevant distribu-

tion changes are entirely invisible in the measurements, because they are conﬁned

to the dependencies. Another way in which relevant distribution changes can go

unnoticed is if they are masked by relatively stronger components, see Figure ..

e opposite eﬀect is also possible: a single non-stationary factor aﬀects all mea-

surements, giving the false impression that the data as a whole is non-stationary.

Discerning the stationary from the non-stationary components is, therefore, a key

step towards understanding multiviariate non-stationary data. How to do this is the

topic of this thesis.

¹A precise notion of stationarity will be introduced in Chapter .

4CHAPTER 1. INTRODUCTION

Our main contribution is a method for ﬁnding a linear coordinate transforma-

tion, stationary subspace analysis (SSA), which factorises the observed data into a

set of stationary, and a set of non-stationary, sources. In the example shown in Fig-

ure ., such a transformation would be given by projecting onto the green and red

axes respectively. e type of distribution changes which our method detects are

changes in the mean and the covariance matrix. Importantly, SSA does not require

labels to specify the timing of the distribution changes (or any other form of super-

vision); and it is not limited to time series but can also be applied to sets of data

collected at diﬀerent time points.

A factorisation into stationary and non-stationary components is useful in many

circumstances. As we have seen in the example, SSA is an essential tool for ex-

plorative analysis, because the observed variables can be arbitrarily ill suited for

the analysis of changes in the joint distribution. Moreover, SSA is useful for fea-

ture extraction [Guyon and Elisseeﬀ ()] when only the stationary or the non-

stationary part is informative for the subsequent application. For example, high-

dimensional change-point detection [Blythe et al. ()] can beneﬁt from the prior

removal of the uninformative stationary part. In some cases, SSA has been found to

be useful for domain adaptation, by restricting parameter estimation of a prediction

method to the stationary part [von Bünau et al. (), Hara et al. ()]. Finally,

SSA can extract meaningful components when the observations are (modeled as) lin-

ear superpositions of the relevant latent variables, and non-stationarity is a sensible

criterion for identifying the components of interest [Hara et al. ()]. In particu-

lar, SSA allows us to extract such components even when no prior knowledge about

the timing of the relevant activity is available, e.g. when they are not induced by an

experimental paradigm. is is the case e.g. in the analysis of spontaneous neu-

ral activity [Bießmann et al. (), Biswal et al. ()] or default mode networks

[Raichle et al. ()].

Stationary subspace analysis is not the ﬁrst method to exploit dependencies be-

tween variables to ﬁnd a coordinate system which is more useful in some sense.

However, none of the existing approaches are tailored to the understanding of dis-

tribution changes. As an illustration, let us compare the results of SSA with the

solutions found by principal component analysis (PCA) and independent compo-

nent analysis (ICA) on the data shown in Figure .. In Figure . we see that both

PCA and ICA produce uninformative results, albeit in diﬀerent ways. In the PCA

basis (top panel), the distribution changes remain completely invisible. ICA (middle

panel), on the other hand, gives us two components that are both non-stationary:

suddenly, the stationary components have become invisible. Only SSA achieves a

clear separation into stationary and non-stationary components. In principle, this

is to be expected. ICA maximises the statistical independence of the coordinates,

which is unrelated to discerning the two groups of components. Similarly, the PCA

solution is determined by the shape of the overall covariance matrix which is not

related to non-stationarity.

1.1. WHY THIS THESIS? 5

PCA solution

ICA solution

SSA solution

Figure .: PCA and ICA do not separate the stationary and the non-stationary

components.

e second main contribution of this thesis is a new framework for the ﬁnd-

ing approximate solutions (or a certain type) to systems of polynomial equation.

We demonstrate that SSA can be formulated in terms of computing the basis of an

algebraic variety given in terms of noisy polynomials estimated from data. Or, in

other words, we show how to approximate the set of solutions to noisy polynomial

equations by a linear space. What makes the SSA problem amenable to this type of

formulation is that the projections to the stationary coordinates can be described as

the set of linear projections which make the moments of the distribution equal in

the new coordinates. ese projections are therefore given by the solution to a set

of polynomials with coeﬃcients derived from moments estimated on the observed

data. For example, the equality of the mean translates into linear polynomials and

the equality of the covariance matrix is expressed by a quadratic polynomial. Not

only does the algebraic approach have a better performance (higher accuracy) in

certain scenarios, but this viewpoint also allows us to invoke powerful devices from

algebraic geometry for reasoning about learning problems and algorithms.

From a theoretical perspective, the interesting distinction of the algebraic ap-

proach is that it allows us to directly solve a problem algebraically, which hitherto

required an iterative search procedure using gradient-based optimisation. In some

sense, this is due to the fact that in the algebraic formulation, we have explicitly in-

corporated all available information about the structure of the problem.

6CHAPTER 1. INTRODUCTION

1.2 Scope, main contributions and publications

e main contributions of this thesis can be summarised as follows.

. Motivation and formulation of a new type of unsupervised learning task, sta-

tionary subspace analysis.

. Analysis of the generative model w.r.t. identiﬁability invariances and spurious

solutions.

. Derivation of a computationally eﬃcient SSA algorithm, based on a custom

optimisation procedure over the special orthogonal group.

. Development of an approximate algebraic algorithm for solving the SSA prob-

lem based on an algebraic-geometric viewpoint.

. Empirical evaluation of the SSA algorithm on synthetic data and in applica-

tions to EEG analysis.

1.2.1 Included publications

Parts of this thesis have been presented in the following publications. e results

presented in Chapter  are unpublished.

• von Bünau, P., Meinecke F. C., Király, F. J., and Müller K.R. Finding Stationary

Subspaces in Multivariate Time Series. Phys. Rev. Lett.  ():, .

• Müller, J. S., von Bünau, P., Meinecke F. C., Király, F. J., and Müller K.R. e

Stationary Subspace Analysis Toolbox. Journal of Machine Learning Research

:–, .

• Király, F. J., von Bünau, P., Meinecke F. C., Blythe, D. A. J., and Müller K.R.

Algebraic Geometric Comparison of Probability Distributions. Journal of Ma-

chine Learning Research :–, .

• von Bünau, P., Meinecke F. C., Scholler, S., and Müller K.R. Finding Stationary

Brain Sources in EEG Data. Proceedings of the nd Annual Conference of

the IEEE EMBS, Buenos Aires, .

• Király, F. J., von Bünau, P., Müller, J.S., Blythe, D. A. J., Meinecke F. C., and

Müller K.R. Regression for sets of polynomial equations. JMLR Workshop and

Conference Proc. (AISTATS ), Vol. , .

1.2.2 Related work not included

Since the publication of the SSA algorithm, the following applications and exten-

sions have been presented which are not part of this thesis. e author of this thesis

has been involved in some but not all of them.

1.3. RELATED AREAS IN MACHINE LEARNING 7

Application to domain adaptation e authors show that SSA can be used as a

pre-processing step to allow for the domain adaptation of classiﬁcation algorithms

in the context of WiFi localisation [Hara et al. ()].

Application to geophysical data analysis and an analytic algorithm In the si-

multaenous analysis of measurements relating to the dynamics of the earth’s mag-

netic ﬁeld, the relevant components are known to exhibit strong changes but can-

not be assumed to be independent. e authors show that SSA recovers mean-

ingful components where ICA fails. Moreover, a new SSA algorithm is presented

that has an analytic solution based on certain assumptions on the latent variables

[Hara et al. ()].

Application to computer vision e stationary components found by SSA can be

interpreted as temporally stable relationships between variables. e authors adopt

this viewpoint and demonstrate that it allows the extraction of invariances from im-

age data [Meinecke et al. ()].

Application to change-point detection In high-dimensional change-point de-

tection, only the non-stationary directions are informative while the rest contribute

to the number of dimensions by adding irrelevant information which is potentially

harmful to the performance. erefore, the authors show that SSA can be used as

a feature extraction method in extensive simulations and as an application to fault

detection [Blythe et al. ()].

An information-theoretical view of SSA e authors show that SSA can be un-

derstood from an information-theoretical point of view in terms of a projection in

the space of distributions; this leads to a new algorithm [Kawanabe et al. ()].

Group-Wise SSA and brain-computer-interfacing In the context of a classiﬁca-

tion setting, it is oen desirable to identify only those non-stationarities that are not

class-related. To this end, the authors propose a new algorithm called groupwise-

SSAanddemonstrateits eﬀectivenessinanapplicationtobrain-computer-interfacing

[Samek et al. ()].

1.3 Related areas in machine learning

With this thesis, we introduce a new family of unsupervised learning tasks. Even

though there are no algorithms with exactly the same objective, SSA is related to a

number of approaches by formal similarity, shared technical tools or a similar ap-

plication context, which we brieﬂy summarise below.

8CHAPTER 1. INTRODUCTION

Two-sample testing Do two samples come from the same distribution? A two-

sample test [Wald and Wolfowitz (), Smirnoﬀ ()] is a statistical hypothesis

test for deciding this problem �he result of such tests is a binary decision; no more

information about the type or extent of a diﬀerence in distribution can be gleaned

from the result. Two-sample testing may be applied in similar circumstances than

SSA, but addresses a fundamentally diﬀerent question.

At the heart of a two-sample test lies a parametric or nonparametric distance

measure between two distributions, and such a function also appears in the SSA ob-

jective. Infact, weshow thatthe SSAobjectivefunctionis equivalenttoacertaintype

of two-sample test. Future work may explore the possibility of deriving high-order

SSA algorithms by borrowing from non-parametric two-sample tests, e.g. following

[Gretton et al. ()].

Domain and covariate shift adaptation e goal of domain and covariate shi

adaptation is to improve the performance of prediction methods (classiﬁcation or

regression) in a scenario in which the test data follows a diﬀerent distribution than

the training data. One way of covariate shi adaptation is to weight the training

data [Shimodaira (), Sugiyama et al. (), Sugiyama et al. ()] or regu-

larise w.r.t. distribution changes [Schweikert et al. ()].

In contrast to SSA, the focus is on adapting a supervised learning to distribu-

tion changes between two sets of samples, and not on investigating the distribution

change itself. However, it has been shown [Hara et al. ()] that for some appli-

cations, SSA can be used for domain adaptation by conﬁning the learning to the

directions between training and testing data which are stationary. Note that this

need not be true in general: the stationary directions are not necessarily correlated

with the label; and a certain degree of non-stationarity along strongly informative

directions may not be harmful for prediction performance. How the possibly con-

ﬂicting goals of stationarity and informativeness can be combined in a principled

manner is an open question.

Moreover, SSA can be useful as a feature extraction method for density ratio es-

timation [Sugiyama ()], which is an essential step in the reweighting approach

to covariate shi adaptation: only the non-stationary directions are relevant for den-

sity ratio estimation, and it has been shown [Sugiyama et al. ()] that the prior

removal of the stationary directions can increase the accuracy of density ratio esti-

mation and hence covariate shi adaptation.

PCA, ICA and factor analysis e generative model of PCA, ICA and factor anal-

ysis formally resembles the SSA model: the observed data is a linear mixture of un-

derlying latent variables. is, however, is where the similarities end. ICA assumes

that sources are independent; SSA merely presupposes that there are two groups

of sources (stationary and non-stationary), which may have arbitrary dependence

1.4. A ROADMAP THROUGH THIS DOCUMENT 9

structures. Consequently, ICA and SSA algorithms have entirely diﬀerent goals:

recovering maximally independent sources vs. maximizing resp. minimizing a mul-

tivariate stationarity measure. Similarly, PCA and factor analysis ﬁnd orthogonal

directions minimizing the reconstruction error which is unrelated to stationarity

resp. non-stationarity. Also, whereas PCA, ICA and factor analysis compute a set of

one-dimensional components, SSA ﬁnds two subspaces.

However, SSA is related to the family of blind source separation algorithms by

shared technical tools (a pre-whitening step plus optimisation over the special or-

thogonal group) and similar terminology.

1.4 A roadmap through this document

Chapter 2 In this chapter, we introduce the the task of ﬁnding a coordinate trans-

form of the data space with certain useful properties. Mathematical concepts are

brieﬂy reviewed and the two most prominent methods, PCA and ICA, are discussed

in depth.

Chapter 3 Stationary subspace analysis is presented: aer outlining and analysing

the SSA model, we develop a computationally eﬃcient algorithm, and discuss the

problem of spurious stationarity and the relationship to stationarity testing.

Chapter 4 is chapter contains the second main contribution: we present an

approximate algebraic algorithm for solving the SSA task and analyse its computa-

tional complexity.

Chapter 5 In extensive simulations on synthetic data, we investigate the perfor-

mance of SSA relative to an artiﬁcial ground truth; in applications to EEG analysis

we show that SSA provides useful new insights.

Chapter 6 e thesis concludes with a summary of the main ﬁndings, and a dis-

cussion of open problems and directions for future work.

10 CHAPTER 1. INTRODUCTION

List of all publications

e following is a list of all publications co-authored by the author of this thesis from

 until .

Articles in journals

Hara, S., Kawahara, Y., Washio, T., von Bünau, P., Tokunaga, T. and Yumoto, K. Sep-

aration of stationary and non-stationary sources with a generalized eigenvalue prob-

lem Neural Networks :–, .

Kiraly, F. J., von Bünau, P., Meinecke, F. C., Blythe, D. A. J. and Müller, K. R. Alge-

braic Geometric Comparison of Probability Distributions Journal of Machine Learn-

ing Research :–, .

Blythe, D. A. J., von Bünau, P., Meinecke, F. C. and Müller, K. R. Feature Extraction

for Change-Point Detection using Stationary Subspace Analysis IEEE Transactions

on Neural Networks and Learning Systems  ():–, .

Sugiyama, M., Yamada, M., von Bünau, P., Suzuki, T., Kanamori, T. and Kawanabe,

M. Direct density-ratio estimation with dimensionality reduction via least-squares

hetero-distributional subspace search Neural Networks  ():-, .

Vidaurre, C., Kawanabe, M., von Bünau, P., Blankertz, B. and Müller, K. R. Toward

Unsupervised Adaptation of LDA for Brain-Computer Interfaces IEEE Transactions

on Biomedical Engineering  ():–, .

Müller, J., von Bünau, P., Meinecke, F. C., Király, F. J. and Müller, K. R. e Stationary

Subspace Analysis Toolbox Journal of Machine Learning Research :–,

.

Araujo, J., von Bünau, P., Mitchell, J. and Neunhoeﬀer, M. Computing automor-

phisms of semigroups Journal of Symbolic Computation  ():-, .

von Bünau, P., Meinecke, F. C., Király, F. J. and Müller, K. R. Finding Stationary

Subspaces in Multivariate Time Series Phys. Rev. Lett.  ():, .

Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von Bünau, P. and Kawan-

abe, M. Direct Importance Estimation for Covariate Shi Adaptation Annals of the

Institute of Statistical Mathematics  ():–, .

1.4. A ROADMAP THROUGH THIS DOCUMENT 11

Peer-reviewed contributions to conferences

Kiraly, F. J., von Bünau, P., Müller, J. S., Blythe, D. A. J., Meinecke, F. C. and Müller,

K. R. Regression for sets of polynomial equations JMLR Workshop and Conference

Proc. (AISTATS ) , .

Kawanabe, M., Samek, W., von Bünau, P. and Meinecke, F. C. An Information Ge-

ometrical View of Stationary Subspace Analysis Artiﬁcial Neural Networks and Ma-

chine Learning (ICANN ), LNCS :–, Springer, .

von Bünau, P., Meinecke, F. C., Scholler, S. and Müller, K. R. Finding Stationary

Brain Sources in EEG Data Proceedings of the nd Annual Conference of the IEEE

EMBS, .

Hara, S., Kawahara, Y., Washio, T. and von Bünau, P. Stationary subspace analysis as

a generalized eigenvalue problem Proceedings of the th international conference

on Neural information processing (ICONIP), .

Sugiyama, M., Hara, S., von Bünau, P., Suzuki, T., Kanamori, T. and Kawanabe, M.

Direct Density Ratio Estimation with Dimensionality Reduction Proceedings of 

International Conference on Data Mining (SDM), .

Meinecke, F. C., von Bünau, P., Kawanabe, M. and Müller, K. R. Learning invariances

with Stationary Subspace Analysis ICCV Computer Vision Workshops, .

von Bünau, P., Meinecke, F. C. and Müller, K. R. Stationary Subspace Analysis Pro-

ceedings of ICA Conference  (Paraty), LNCS :–, Springer, .

Sugiyama, M., Nakajima, S., Kashima, H., von Bünau, P. and Kawanabe, M. Direct

Importance Estimation with Model Selection and Its Application to Covariate Shi

Adaptation Advances in Neural Information Processing Systems  (NIPS), .

12 CHAPTER 1. INTRODUCTION

Chapter 2

Interesting directions in data

2.1 Motivation

In this chapter we explain why a basis transformation of the data space can be use-

ful, review mathematical fundamentals, and discuss two prominent unsupervised

coordinate transforms: principal component analysis and independent component

analysis.

Data comes from measurements. However, in many circumstances, the actual

variables of interest cannot be measured directly. For example, in the neurosciences,

one is oen limited to measuring non-invasively on the surface of the scalp in order

to study the activity of sources located inside the brain. In EEG recordings, for in-

stance, each electrode measures a superposition of activity from several underlying

sources. e input signals are therefore not directly useful for analysing brain activ-

ity, because we cannot discern the contributions from the diﬀerent cortical regions.

−0.5

0.5

Projection pursuit (F.-T. index) Principal component analysis Independent comp. analysis

Figure .: Directions found by exploratory projection pursuit (Friedman-Tukey

index), principal component analysis and independent component analysis.

However, on multivariate data one can oen combine the information of several

variables to construct more useful new variables by applying a coordinate transfor-

mation. In the previous example, a desirable transformation would separate the

contribution of the underlying brain sources into new variables. Interestingly, for

a diverse range of applications from the neurosciences to the analysis of web data,

14 CHAPTER 2. INTERESTING DIRECTIONS IN DATA

useful coordinate transformations can be found based on statistical criteria alone,

independent of the particular meaning of the observed variables or prior informa-

tion. is is the topic of this chapter.

Let us ﬁrst of all consider intuitive examples for the informativeness of axes in

data space. Figure . shows three datasets. In the le panel, the scatter plot clearly

shows two clusters; however, these are not visible by looking at the individual di-

mensions. Here, a projection onto the orange direction clearly reveals this structure

in the data. e variation in the data set shown in the middle is largely conﬁned

to the orange axis because the two variables are strongly correlated; coordinates on

this axis provide a more compact description of the data. In the right scatter plot,

the data points are distributed along two axes in a seemingly independent way, i.e.

the value of one coordinate has no inﬂuence on the distribution of the other. is

suggests that the data is generated as a linear mixture of two underlying (latent) in-

dependent factors.

Note thatthesepropertieswereapparentonlybecause we could look atthe whole

data in a scatter plot. For higher-dimensional data sets this is not possible. In

fact, each of the three examples correspond to a criterion of a family of algorithms:

projection pursuit using the Friedman-Tukey index [Friedman and Tukey ()]

ﬁnds directions with cluster structure (le panel); principal component analysis

[Pearson ()] ﬁnds directions that maximize the variance (middle panel), and in-

dependent component analysis [Hyvärinen et al. ()] ﬁnds independent sources

from a linear mixture. Incidentally, the latter approach is oen used to address the

problem sketched at the outset: under the assumption that EEG signals are gener-

ated as a linear mixture of independent brain sources, ICA allows us to recover the

neural components. Not only the new coordinates are of interest: both the elements

of the projection and the basis can oen be interpreted in terms of patterns if the sen-

sors are spatially distributed. In EEG analysis, the so-called scalp maps have become

an important tool for the interpretation of sources found by linear methods.

e remainder of this chapter is organised as follows. In the next section we

introduce some fundamental concepts from statistics and information theory. In

Section ., we review two prominent methods of ﬁnding interesting directions. Af-

ter tracing their historical development, we start with principal component analysis

followed by independent component analysis. ese are arguably the most widely

used coordinate transforms, and the ones we will compare against in the applica-

tion to EEG analysis (Chapter ). In the concluding section, we discuss the relative

merits of these algorithms.

2.2 Preliminaries and notation

In this section, we brieﬂy review fundamentals from probability theory, statistics

and information theory. is overview is largely based on [Mood et al. ()],

2.2. PRELIMINARIES AND NOTATION 15

[Cover et al. ()] and [Jaynes and Bretthorst ()].

Random variables, joint densities, independence

e building block of statistical modeling is the the concept of the random variable.

A random variable is a formal variable which represents a sampling or measurement

process, governed by a probability distribution. Random variables are a convenient

tool to formulate probabilistic models in the familiar language of mathematical ex-

pressions.

A random variable Xis associated with a probability space (Ω,A, P). We will

not give a formal deﬁnition; it suﬃces to say that a probability space speciﬁes the

Boolean algebra of random events A ⊂ P(Ω), which consists of subsets of elemen-

tary events in Ω, to which we assign probabilities by the function P:A → [0,1].

e empty set, the “sure event” Ωand all single elements of Ωare events.

In a probability space, there is no such thing as a value or an order of possible

events; it is simply a set of possible outcomes. However, in many cases one wants to

model random processes in which the outcome is a real number: we would like to

be able to speak of the probability that the outcome lies in a certain interval and to

be able to represent this graphically. A real random variable does just that: it links

the probability space and the real numbers.

Let (Ω,A, P)be a probability space. A real random variable is a function X:

Ω→

such that all subsets Ar⊂Ωdeﬁned as,

Ar={ω∈Ω|X(ω)≤r},

are events in the probability space: Ar∈ Afor all r∈

is condition ensures that every interval in

can be assigned a probability.

For a continuous random variable, this allows for the deﬁnition of the cumulative

distribution function (CDF) and its derivative, the probability density function (PDF),

which show the distribution of the probability mass over the real numbers.

e cumulative distribution function of a real random variable Xis the function

FX:

→[0,1] such that,

FX(x) = P(X≤x) = P({ω|X(ω)≤x}),

for every real number x∈

. e probability density function of Xis the function

fxsuch that FX(x) = ∫x

−∞ fX(u)du.

So far, we have considered only single random variables. In many cases, how-

ever, the joint behavior of several random variables is of interest. To that end, we

introduce multivariate real random variables that are a vector of random variables

and have a joint probability density function.

Let X1, . . . , Xkbe real random variables over the same probability space. en

X= (X1, . . . , Xk)is a multivariate real random variable if it has a joint probability

16 CHAPTER 2. INTERESTING DIRECTIONS IN DATA

density function fX:

k→[0,1] such that its cumulative distribution function

FX:

k→[0,1] is given by

FX(x1, . . . , xk) = ∫x1

−∞ ···∫xk

−∞

fX(u1, . . . , uk)du1···duk.

A multivariate random variable consisting of kindividual random variables is

also called k-variate. In the context of a multivariate random variable, we call the

density of an individual constituent marginal density. Given the probability density

function of a multivariate random variable, we can obtain its marginals by integrat-

ing out the other random variables, respectively.

One motivation for considering the joint distribution of random variables is to

study the dependencies between them: how does the outcome of one group of vari-

ables inﬂuence the density over another group of variables. is is represented by

the conditional density. Let Xand Ybe multivariate real random variables that have

a joint distribution. e conditional probability density function of Ygiven X=x

is deﬁned as, fY|X(y|x) = fX,Y (x,y)

fX(x)it is undeﬁned for all x∈Rwith fX(x) = 0.

A particularly important special case is the independence of two random vari-

ables. In this case, the outcome of one variable has no inﬂuence on the distribution

of the other variable. In other words, the conditional distribution is equal to the

marginal: fY|X≡fYif Yand Xare independent. Let Xbe a k-variate real ran-

dom variable. en X1, . . . , Xkare independent, if their joint probability density

function is equal to the product of its marginals, fX(x1, . . . , xk) = ∏k

i=1 fXi(xi).

Expectation and cumulants

e expectation of a function g(X)of a random variable Xis, loosely speaking, the

average over its random argument X. Let Xbe a k-variate random variable and

g(X)be a scalar function of X. e expected value of g(X)is deﬁned as,

E[g(X)] = ∫∞

−∞ ···∫∞

−∞

| {z }

ktimes

g(x1, . . . , xk)fX(x1, . . . , xk)dx1···dxk,

assuming that the integral exists. Since the integral is linear, the expectation is also

linear: E[αf(X) + βg(X)] = αE[f(X)] + βE[g(X)].

Properties of distributions are oen summarised in terms of its moments or cen-

tral moments, which are an inﬁnite series of increasing order. In general, the r-th

order moment of a random variable Xis deﬁned as µ′

r(X) = E[Xr]and the r-th

order central moment is µr= E[(X−E[X])r].

e expectation of a random variable, abbreviated as µ(X)is its ﬁrst moment

and is commonly interpreted as a location parameter. e variance of a random

2.2. PRELIMINARIES AND NOTATION 17

variable, its second central moment, describe its dispersion around the mean. e

variance of X, denoted by var(X)is deﬁned as,

var(X) = E[(X−µ(X))2] = E[X2]−µ(X)2.

e covariance between two random variables describes how they vary together.

e value of the covariance is a measure for a linear dependence between them.

For a multivariate random variable, all pairwise covariances are summarised in a

covariance matrix. e covariance of two real random variables Xand Ywith a

joint probability density is given by,

cov(X, Y ) = E[(X−µ(X))(Y−µ(Y))].

For a k-variate random variable Z, its covariance matrix Σ∈

k×kcontains the

covariance for each pair of variables, Σij = cov(Zi, Zj).

Moments above the mean and covariance are usually called higher order mo-

ments. e third central moment, the skewness, is oen used as a measure of asym-

metry around the mean. e kurtosis is the fourth central moment which measures

the degree of ﬂatness (or peakedness) of a density around its centres. e standard-

ise fourth moment, or Pearson’s kurtosis, is deﬁned as κ(X) = µ4

var(X).

Information theory

e entropy is a measure H[X]for the uncertainty of a random variable Xdeﬁned as

H[X] = −E[log fX(X)]. e entropy is bounded from below by zero, H[X]≥0;

it is zero if and only if the random variable is a constant, i.e. non-random. e

conditional entropy of Xgiven another random variable X, denoted by H[X|Y]is

deﬁned as H[X|Y] = −EX,Y [log fX,Y (X|Y)].e entropy of a joint distribution

can be written as H[X, Y ] = H[X] + H[Y|X].

e Kullback-Leibler (KL) divergence or relative entropy DKL is a way of compar-

ing two distributions p(x)and q(x),

DKL[p||q] = ∫+∞

−∞

p(x) log p(x)

q(x)dx.

e KL-divergence is zero if and only if the two densities are identical, and other-

wise positive (unbounded). Note, however, that the KL-divergence is not a distance

because it is not symmetric and does not satisfy the triangle equation.

A related concept to the KL-divergence is the mutual information between ran-

dom variables. It measures how much information one variable contains about the

other. e mutual information I[X, Y ]can be deﬁned using the KL-divergence as,

I[X, Y ] = DKL[p(X, Y )||p(X)p(Y)],

i.e. it is the distance between the joint distribution and the product of its marginals.

18 CHAPTER 2. INTERESTING DIRECTIONS IN DATA

The Gaussian distribution

Many statistical techniques are based on distributional assumptions involving the

(multivariate) Gaussian distribution. In particular, the noise component in statis-

tical models is usually assumed to be Gaussian distributed. e conventional jus-

tiﬁcation is that by the central limit theorem, the sum of independent identically

distributed random variables converges (in distribution) to the normal distribution.

Let Xbe a k-variate random variable that follows a multivariate Gaussian distri-

bution withmeanµandcovariancematrix Σ, whichweabbreviatebyX∼ N(µ, Σ).

e probability density function is of the multivariate Gaussian is given by,

fX(x) = det(2πΣ)−1

2exp (−1

2(x−µ)⊤Σ−1(x−µ)).

e entropy of a D-variate Gaussian H(N(µ, Σ)) is given by H(N(µ, Σ)) =

log √det(2πΣ) + D/2.Note that the entropy is thus invariant under orthogonal

transformations of variables.

For a given mean and covariance matrix, the Gaussian is the real-valued distri-

bution with the highest entropy [Jaynes ()]. From this point of view, the Gaus-

sian distribution is the least restrictive distributional assumption.

e negentropy J(X)of a random variable Xwith is the diﬀerence between its

entropy and the entropy of a Gaussian random variable with the same mean and

covariance,

J(X) = H [N(µ(X),cov(X))] −H[X].

e negentropy is nonnegative.

Maximum likelihood

One the key tasks in statistics is the estimation of model parameters from samples.

In the standard maximum likelihood (ML) approach, we select those parameters that

maximise the likelihood of observing the data at hand under our generative model

assumption. Let X ⊂

dbe a dataset and let ℓθ:P(

d)→

the likelihood

function of a statistical model, which is the likelihood of observing a data set given

the parameters θ. e maximum likelihood estimator for the model parameters is

given by, ˆ

θML = argmaxθlog ℓθ(X).

e standard estimators for the parameters µand Σof a d-variate Gaussian dis-

tribution are the sample mean ˆµand the sample covariance matrix ˆ

Σ,

ˆµ=1

|X| ∑

x∈X

xand ˆ

Σ = 1

|X|−1∑

x∈X

(x−ˆµ)(x−ˆµ)⊤,

respectively. Note that the latter is not the maximum likelihood estimator, but it has

the favorable property of being unbiased, i.e. its expectation over realisations of the

dataset is equal to the true parameter.

2.3. MAXIMUM VARIANCE AND INDEPENDENCE 19

2.3 Maximum variance and independence

In this section, we brieﬂy introduce the family of statistical models corresponding

to coordinate transformation in data space and discuss the two most prominent ap-

proaches: principal component analysis (PCA) and independent component anal-

ysis (ICA).

e generative model for the data is a linear transformation of latent variables.

e observed D-variate data Xis assumed to be generated as a linear transformation

of Dlatent variables Yby an unknown full rank mixing matrix A∈

D×D,

X=AY.

In a time series context, the data Xas called signals denoted by x(t)and the latent

variables are called sources written as y(t). Moreover, the mixing is instantaneous,

since the observation x(t)at time tonly depends on the sources at time tand not on

contributions from previous time points. In factor analysis, the matrix Ais called

loading matrix.

e elements of the latent variable Yare referred to as components or factors.

Given only samples from X, the goal is to estimate an inverse for Awhich allows

us to recover the latent variables Y, i.e. ﬁnd the transformation A−1to coordinates

corresponding to the desired components.

is is possible only by making further assumptions on the unknown parameter

Aand the statistical properties of the latent variables Y. It is these types of assump-

tions which allow to distinguish the diﬀerent types of coordinate transformations.

e assumption on Aand Yalso determine up to which symmetries the true mixing

matrix Acan be identiﬁed from the observed data.

2.3.1 Principal component analysis

As an academic discipline, the search for interesting directions in data started with

Karl Pearson’s work [Pearson ()] on principal component analysis (PCA) at the

beginning of the th century. To this day, PCA is still the most widely used method

of extracting useful linear components from a set of high-dimensional data points.

e vantage point of Pearson’s work was what is now known as ordinary least squares

regression (OLS) [Gauß (), Legendre ()], a method for ﬁtting the mean of

a univariate dependent variable (or target) yby a linear combination of indepen-

dent variables (or covariates) x1, . . . , xd. is analysis yields a coeﬃcient αifor

each independent variable xi, which is commonly interpreted as the eﬀect of the

independent variable on the target, conditioned on (or controlled for) ﬁxed values

of all other covariates.

e generative model corresponding to linear regression is given by the equa-

20 CHAPTER 2. INTERESTING DIRECTIONS IN DATA

tion,

∑

i=1

αixi+ϵwhere ϵ∼ N(0, σ),(.)

where the only random component is the Gaussian noise term ϵ. e unknown pa-

rameters of interest are the coeﬃcients α1, . . . αdand the variance of the noise σ. In

this model, the values of the covariates are assumed to be known exactly; only the

target is subject to measurement noise. As Pearson points out, it is quite unrealistic

that all variables apart from one can be measured exactly. Moreover, the distinc-

tion between dependent and independent variable is somewhat arbitrary: one could

equally well aim to explain the variation in the covariate x1by a linear function of

x2, . . . , xdand y. ese points are not deﬁciencies of the linear regression model

but reﬂect the aim of the analysis: to explain the variation of a particular variable in

terms of another set of variables in order to investigate the underlying mechanism.

In other words, the ordinary least squares model (.) belongs to the family of con-

ditional models, which model the distribution of a designated variable as a function

of another set of variables.

Independent variable (covariate)

Dependent variable (target)

Principal components

OLS regression

Figure .: Comparison of ordinary least squares regression (OLS) and PCA.

However, as Pearson noted, conditional modeling of a particular variable is not

necessarily optimal for explaining the total variation in the data, i.e. the joint vari-

ation of all variables x1, . . . , xdand y. See Figure . for an example. To that end,

Pearson proposed a method to ﬁnd the “Closest ﬁt to Systems of Points in Space”

[Pearson ()], which now known as principal component analysis. PCA ﬁnds an

orthogonal coordinate system that minimises the reconstruction error of the data.

e reconstruction error of a data vector is the L2-distance to its projection onto a

particular basis. e elements of this PCA basis are called principal components. e

principal components are ordered descendingly by the variance of the data along

2.3. MAXIMUM VARIANCE AND INDEPENDENCE 21

each coordinate. at is, the ﬁrst principal component is the direction in the data

that has the highest variance.

We can ﬁnd the principal components iteratively by minimizing the squared re-

constructionerror. Assumewearegivenad-dimensionalcentereddatasetx1, . . . , xn∈

dand we have found the ﬁrst korthonormal principal components v1, . . . , vk. e

next principal component vk+1 is the solution to the optimisation problem,

vk+1 = argmin

∥v∥=1

∑

i=1 



(v⊤xi)v−xi



2s.t. v⊥span{v1, . . . , vk}

= argmax

∥v∥=1

v⊤(n

∑

i=1

xix⊤

i)vs.t. v⊥span{v1, . . . , vk}

= argmax

∥v∥=1

v⊤ˆ

Σvs.t. v⊥span{v1, . . . , vk},(.)

where ˆ

Σis the sample covariance matrix of the centered dataset. is optimisation

problem has a well known global solution [Golub and Van Loan ()] given by

the eigenvectors of the sample covariance matrix,

Σ = V DV ⊤,

where Dis a diagonal matrix with ordered positive eigenvalues λ1≥ ··· ≥ λd>0

on the diagonal and the columns of Vare the corresponding orthonormal eigenvec-

tors v1, . . . , vd∈

d. In other words, V⊤is an orthogonal transformation which

diagonalises the covariance matrix: V⊤ˆ

ΣV=D, i.e. the data is uncorrelated in

the PCA basis. is means that the axis of the ellipsoid representing the covariance

matrix are aligned with the principal components, as illustrated in Figure .. More-

over, the eigenvalues correspond to the length of each axis of the ellipsoid. In terms

of the optimisation problem (Equation .), each eigenvalue λiis the objective func-

tion value at the maximum, i.e. the largest variance in the orthogonal complement

of span{v1, . . . , vi−1}. us, if we choose to represent our dataset by the ﬁrst k

principal components, then the fraction of the variance that we explain is given by

σPCA

k=∑k

i=1 λi/∑d

i=1 λi. Conversely, the reconstruction error as a fraction of

the total variance is 1−σPCA

k. e normalised eigenvalue spectrum therefore gives

a good indication of the contribution of each principal component to the variance

of the data. In practice, one oen encounters spectra of a characteristic shape, where

relatively few principal components explain almost all of the variance.

Applications and interpretation

Probably the most common application of PCA is dimensionality reduction (or data

compression), guided by the principle of minimizing the reconstruction error: re-

taining a certain fraction of the variance using the minimum number of dimensions.

22 CHAPTER 2. INTERESTING DIRECTIONS IN DATA

Under a Gaussian noise model, this is equivalent to de-noising, if the data lies on a

lower-dimensional subspace. In this scenario, one selects the number of dimen-

sions based on the shape of the eigenvalue spectrum, or a pre-speciﬁed fraction of

the variance that has to be captured. Depending on the application context, one

proceeds either in the PCA coordinates, or projects the data onto the PCA subspace

in the original coordinates. In factor analysis, the focus is oen on interpreting the

coeﬃcients of the loading matrix V.

e PCA solution is meaningful only if the relative scale of the input coordinates

is not arbitrary. Rescaling a single coordinate changes its variance relative to the

other coordinates and therefore the PCA solution. If the relative scaling is arbitrary,

then the PCA basis is somewhat arbitrary.

2.3.2 Independent component analysis

Independent component analysis [Hyvärinen et al. ()] ﬁnds a new coordinate

system for the data which maximises the statistical independence of the new coordi-

nates. One motivation of this approach is to recover the underlying (supposedly in-

dependent) mechanisms which have generated the data. From an explorative point

of view, the advantage of having independent components is that they can be studied

univariately, since there are no interactions (dependencies) between them. Whereas

PCA is oen used as a technical pre-processing step in a semi-automatic fashion, the

ICA basis and coordinates are usually subject to further qualitative analysis.

Figure .: ICA example, from le to right: input data; whitening; rotation to

independent components. e red lines are the true independent components,

the black curve is the contour of the sample covariance matrix.

e generative model of ICA is that the observed data Xis generated as a linear

mixture of independent latent variables (or sources) Y,

X=AY,

where we assume that Ais invertible and that there are as many observed variables

as there are latent variables, both Xand Yare assumed to be d-variate.

2.3. MAXIMUM VARIANCE AND INDEPENDENCE 23

e aim of ICA is to recover the latent variables from the mixing, i.e. estimate an

inverse ˆ

A−1for the mixing matrix such that ˆ

Y=ˆ

A−1X. e true mixing matrix

is not uniquely identiﬁable. First of all, the sign and the scaling of the latent sources

Ycannot be determined from the observations, because we cannot distinguish be-

tween the true sign and scaling and the multiplication of the columns of Aby a

scalar. Moreover, the true order of the latent sources is not identiﬁable, because we

can absorb a permutation matrix either into the mixing matrix, or into the sources.

us, we can only identify Aup to arbitrary permutation and scaling of its columns¹

In other words, ICA ﬁnds an unordered set of one-dimensional subspaces.

Many ICA algorithms proceed in two steps: a whitening followed by a rotation.

In the ﬁrst step, the data is decorrelated by a multiplication with a whitening matrix

W, which diagonalises the sample covariance matrix: Wˆ

ΣW⊤=I. In this basis,

the data is uncorrelated but not necessarily independent², as we can see in the middle

panel of Figure ..

Aer the decorrelation step, we ﬁnd a linear transformation that leaves the unity

covariance matrix intact, RIR⊤=Iand maximises the independence of the la-

tent variables, i.e. we ﬁnd a orthogonal matrix such that the demixing is given by

A−1=RW. A large family of ICA algorithms are based on minimizing the mu-

tual information between the estimated sources, as a measure of independence. We

will see that this is equivalent to maximizing the distance between the distribution

of the estimated latent variables and the Gaussian distribution. More formally, let

us assume that the data has already been decorrelated by the whitening matrix W

such that cov(X) = I. Now we seek the optimal rotation matrix R∗that gives us

maximally independent latent variables ˆ

Y=R∗X. In terms of an optimisation

problem, this rotation is given by,

R∗= argmin

RR⊤=I

I[ˆ

Y1,..., ˆ

Yd]

(1)

= argmin

RR⊤=I

∑

i=1

H[ˆ

Yi]−H[ˆ

(2)

= argmin

RR⊤=I

∑

i=1

H[ˆ

Yi]−H [X]−log det(R).

Equality () results from the deﬁnition of mutual information and equality () fol-

lows from the fact that ˆ

Yis obtained from the observations Xby the linear transfor-

mation R. Since the entropy of Xdoes not depend on R, and since Ris a rotation

¹Under the assumption that there is not more than one Gaussian-distributed source; if that is not

the case, then only the subspace of Gaussian sources can be identiﬁed.

²However, if the data is Gaussian, then uncorrelatedness is equivalent to independence.

24 CHAPTER 2. INTERESTING DIRECTIONS IN DATA

(hence det R= 1) we get,

R∗= argmin

RR⊤=I

∑

i=1

H[ˆ

Yi]

(3)

= argmin

RR⊤=I

∑

i=1 (H[N(µ(ˆ

Yi),cov( ˆ

Yi))]−J( ˆ

Yi))

(4)

= argmin

RR⊤=I

∑

i=1 −J(ˆ

Yi),

where the equality () follows from the deﬁnition of the negentropy and () results

from the fact that the entropy of the Gaussian distribution is invariant under the

rotation R.

us we have seen that minimizing the mutual information is equivalent to max-

imizing the distance to a Gaussian distribution of the individual sources, as mea-

sured by the negentropy. A practical approach to maximizing the non-Gaussianity

which lends itself well to numerical optimisation is based on the higher order mo-

ments, i.e. moments beyond the mean and the covariance matrix. e JADE algo-

rithm [Cardoso ()], for instance, maximises the kurtosiskurt [ˆ

Yi]usingGradient-

based numerical optimisation.

A great number of diﬀerent algorithms have been proposed to ﬁnd these non-

Gaussian projections; some optimise contrast functions like the negentropy by gra-

dient descent [Bell and Sejnowski ()], others solve the problem by approximate

diagonalisation of the fourth-order cumulant [Cardoso and Souloumiac ()] or

by deﬂation-type ﬁxed-point algorithms [Hyvärinen ()].

Another family of ICA algorithms exploits the time structure of the data. Ex-

amples for these algorithms are e.g. TDSEP [Ziehe and Müller (a)] (a generali-

sation of [Molgedey and Schuster ()]) or SOBI [Belouchrani et al. ()]. Be-

sides the fact that these methods have access to information that other ICA algo-

rithms do not have, they are oen numerically more stable since they rely only on

second-order moments.

Applications and interpretation

e archetypal example for an application of independent component analysis is the

so-called cocktail party problem, where several people are talking simultaneously in

a crowded room and one tries to single out one of the speakers from recordings

at several microphones distributed in the room. Assuming that the speakers are

independent from each other, and that the signals at the microphones are an instan-

taneous mixture of the speakers, this task can be solved by independent component

analysis.

2.4. SUMMARY AND DISCUSSION 25

is has lead to the slightly tongue-in-cheek metaphor of the “cerebral cocktail

party problem” that one faces in the analysis of non-invasive neural recordings. For

example, EEG potentials measured on the surface of the scalp pick up contributions

from a variety of sources located inside the brain. In EEG analysis, ICA has been

applied successfuly to extract meaningful components in a variety of contexts; see

[Jung et al. ()] for a review. In particular, the fact that the columns of the es-

timated mixing matrix can be interpreted as scalp patterns made it possible to use

ICA as a visualisation, or imaging, technique.

2.4 Summary and discussion

We have started this chapter by motivating why a linear transformation of variables

can be useful: from the perspective of a generative model, we recover the latent

variables from the linear mixing. Conversely, the explorative data analyst likes to

obtain new variables with desirable properties, such as pair-wise independence. Af-

ter a brief review of mathematical fundamentals, we introduced the basic setup of

linear coordinate transforms, followed by an overview of the two most widely used

linear methods: PCA and ICA.

Nonlinear axes?

e linear model is a simpliﬁcation: the relationship between the latent variables

and the observations may well be non-linear. is means that the observed data lies

on a non-linear manifold, and what we want to ﬁnd is the intrinsic coordinate sys-

tem of that manifold. Over the last decade, a large number of methods have been

proposed for nonlinear dimensionality reduction or manifold learning. e most

popular are kernel PCA [Schölkopf et al. ()], locally linear embedding (LLE)

[Roweis and Saul ()], isomap [Tenenbaum et al. ()], self-organizing-maps

[Kohonen ()], stochastic neighbour embedding [Hinton and Roweis ()],

and principal curves [Hastie and Stuetzle ()].

Despite the fact that the linear model is indeed restrictive, nonlinear methods

do not yet enjoy the same level popularity among practitioners outside the statis-

tics and machine learning communities. ere are several reasons for this. First of

all, most of the nonlinear approaches do not compute an explicit representation for

the relationship between latent variables and observations, but only the new coor-

dinates. In that sense, they do not ﬁnd interesting directions. e solution is there-

fore harder to interpret. Also, it is usually not straightforward to project new data

points in the found coordinates—the algorithms need to be run again on the ex-

tended data set [Bengio et al. ()]. Moreover, as the nonlinear model is more

complex, it needs to be regularised in order to avoid overﬁtting. For kernel PCA

[Schölkopf et al. ()], this means choosing kernel parameters; for LLE we need

26 CHAPTER 2. INTERESTING DIRECTIONS IN DATA

to choose local neighbourhoods and a regularisation parameter, and Isomap de-

pends on the construction of a neighbourhood graph that follows the true manifold

exactly. Choosing these parameters is diﬃcult and the solution greatly depend on

it. Lastly, nonlinear methods are, overall, less noise robust and require signiﬁcantly

more data. Both of which are characteristics that do not bode well for practical ap-

plications.

Useful directions beyond maximum variance and independence

Apart frommaximumvarianceand pair-wise independence, thereare other desider-

ata for linear coordinate transforms. For example, projection pursuit [Huber (),

Friedman and Tukey ()]ﬁndsdirectionthatreveals clusterstructurein thedata;

random projections [Rahimi and Recht ()] have been found useful for large-

scale learning; and methods from EEG source reconstruction [Haufe et al. ()]

aim to recover brain sources from recordings on the scalp using a combination of

prior knowledge, physiological and statistical constraints.

However, none of these criteria aims at understanding changes in the distribution

of data. is is the problem addressed by stationary subspace analysis (SSA), which

we introduce in the next chapter.

Chapter 3

Stationary subspace analysis

How does the distribution of a data set change over time? As we had seen in Chap-

ter , this unsuspicious-seeming question can be diﬃcult to answer in the unsuper-

vised multivariate case. at is because from the perspective of the individual input

variables, we only observe the marginals of the joint distribution, but the relevant

changes might occur in the dependencies. Moreover, we do not have labels, tim-

ing information, or a target variable that could help us identifying the distribution

changes. In a simple example we showed that the input basis can be completely

uninformative for understanding changes in the joint distribution.

is is the problem we address in this chapter. We introduce a novel type of

linear coordinate transform, stationary subspace analysis (SSA), which ﬁnds direc-

tions based on the strength of distribution changes. More speciﬁcally, SSA identiﬁes

two subspaces: the subspace in which the distribution stays constant over time (sta-

tionary subspace) and the directions in which the changes are most pronounced

(non-stationary subspace).

Such a coordinate transform is useful in many circumstances. “Is there a tem-

poral change in my data?” and “what are these changes?” are common questions

in the explorative analysis of data, whenever there is some sort of time structure

present. is need not be an explicit time series, but also, for example, ever-growing

data sets (such as web data, like that extracted from social media sites) or survey

data data that is collected at several points in time. Finding an informative basis for

understanding distribution changes is particularly relevant in the high-dimensional

case.

Even in a less explorative setting, SSA can be the right tool to extract meaningful

components by revealing the most non-stationary components. For example, in the

analysis of spontaneous neural activity [Bießmann et al. (), Allen et al. ()],

where no timing information or target variables is available, the most non-stationary

subspace can contain the information of interest corresponding to the neural activ-

ity.

From a more methodological point of view, SSA can be used as an unsupervised

feature extraction method for certain tasks. For example, in change-point detec-

tion [Blythe et al. ()] (or temporal segmentation), one wants to detect the time

28 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS

points at which the data distribution changes. However, estimating and comparing

probability distributions is diﬃcult in high-dimensions. Here, stationary subspace

analysis helps by ﬁrst ﬁnding the relevant non-stationary subspace and then conﬁn-

ing change point dimension to that lower-dimensional basis. Similarly, in density

ratio estimation [Sugiyama ()], one is given two data sets D1and D2and the

goal is to compute the ratio of the estimated densities ˆpD1(·)/ˆpD2(·)at all points in

D1. is is notoriously diﬃcult in high dimensions. Here, ﬁnding a basis in which

the distribution of the two data sets diﬀer maximally is a useful pre-processing step.

Moreover, many prediction methods rely on the assumption that the distribution

of the calibration data is equal to the distribution of the data at application time

[Quionero-Candela et al. ()]. When this assumption is violated, SSA can con-

tribute to domain adaptation [Hara et al. ()] by conﬁning the parameter esti-

mation to the stationary subspace.

e remainder of this chapter is organised as follows. In the next Section .,

we introduce the generative model of SSA along with our deﬁnition of stationarity

and discuss the identiﬁability of this model. e issue of identiﬁability is revisited

in Section ., where we encounter the phenomenon of spurious stationarity and

how it can be overcome. Section . is concerned with the algorithmic approach to

solving the SSA problem and its relationship to statistical testing. Finally, in the last

Section ., we summarise and point to open questions and related work.

3.1 Model and identiability

In the stationary subspace analysis model, the observed data is generated as a linear

mixture of two groups of latent variables (or sources): a set of variables whose joint

distribution remains constant over time (stationary) and another set of variables

whose joint distribution varies (non-stationary). e mixing matrix is assumed to

be time-constant and invertible. ere are no further assumptions on the latent vari-

ables. In particular, both the stationary and the non-stationary variables can have

arbitrary dependencies, also across the two groups. More formally, the observed

D-variate data Xis generated as,

X=ASt=[AsAn][Ss

t],(.)

where the random variables Ss

trepresents the dstationary variables and Sn

tare the

D−dnon-stationary directions and tis the time index t∈

. e ﬁrst dcolumns

of the mixing matrix Asspan the stationary subspace and its last D−dcolumns An

span the non-stationary subspace.

e goal of an SSA algorithm is to invert this mixture, i.e. discern stationary and

non-stationary contributions in the observed data. Before we turn to the issue of

which parts of the model are identiﬁable, we need to establish more precisely what

3.1. MODEL AND IDENTIFIABILITY 29

stationary

non−stationary

non−

stationary

Figure .: Illustration of the SSA model. e non-stationary and the station-

ary source have a time-variable joint distribution, as shown in the epoch scatter

plots. e marginal distribution of the stationary direction (vertical axis) stays

constant, whereas the histograms on the horizontal axis change. Note also that

the correlation between the two sources changes between epochs, and that they

are not distinguishable by their variance (signal power).

is meant by stationarity resp. non-stationarity. In its strongest sense, stationarity

means that all properties of the time series Ss

tremain constant over time. at is,

all joint distributions Ss

Twith T⊂

are exactly equal. is deﬁnition does not

lend itself well to the derivation of a practical SSA algorithm, since it corresponds to

an inﬁnite number of constraints. At the same time, only a few of the quantities to

which these constraints pertain can be estimated reliably from ﬁnite data.

A pragmatic denition of stationarity

is is reﬂected in our deﬁnition of stationarity: we only consider the ﬁrst two mo-

ments (mean and covariance matrix) of the probability distribution and ignore the

time structure. at is, a d-variate time series Ytis stationary if

E[Yt1] = E[Yt2]and E[Yt1Y⊤

t1] = E[Yt2Y⊤

t2],

for all pairs of time points t1, t2∈

. is type of stationarity is also called weak

stationarity [Priestley ()] without time structure.

isclearlyisa limitednotion of stationarity. Forexample, notethat atimeseries

whose frequency content changes while its power stays constant would be deemed

stationary. Similarly, a non-stationary time series where all changes are conﬁned to

higher order moments is weakly stationary; though this is probably rare in practice.

30 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS

However, in practicalapplications we have foundthat thistype of stationarityalready

achieves interesting novel results, while not requiring large amounts of samples.

What can we recover from the mixture?

e aim of stationary subspace analysis is to invert the mixture of stationary and

non-stationary sources given only samples from the mixture X. at is, we want to

ﬁnd a demixing matrix,

A−1=[Bs

Bn]with Bs∈

d×Dand Bn∈

D−d×D,

that consists of the matrices Bsand Bnto the stationary and non-stationary direc-

tions respectively. In the following, we will refer to “changing into the source coor-

dinates” as projecting and denote Bsand Bnas the s-projection and the n-projection

respectively. Note that we assume that the true number of stationary directions dis

known. Applying these projections to the mixture yields the estimated latent vari-

ables,

t=BsXand ˆ

t=BnX.

Before we turn to the problem of ﬁnding this demixing, let us consider its identi-

ﬁability. Assuming that the generative model is correct and that we know the the

true number of stationary and non-stationary directions, an important question is:

up to which symmetries can we recover the true model parameters from the mixed

signals? Our model has one parameter, the mixing matrix A=[AsAn], which

consists of a basis for the stationary and the non-stationary subspace respectively.

A criterion for identifying Amust be based on recovering the latent variables with

the prescribed properties from the mixed data.

Figure .: ere is only one stationary direction (blue) but several non-

stationary ones: projecting onto the green, red and turqoise directions yields a

non-stationary time series (see below).

3.2. THE SSA ALGORITHM 31

e only assumption we have made about the latent variables is that there are d

stationary and D−dnon-stationary ones. An SSA algorithm which achieves recov-

ers sets of sources with these characteristics has exhausted all available information.

us, if we apply such an ideal demixing ˆ

A−1to the observed data Xt,

[ˆ

t]=ˆ

A−1Xt=ˆ

A−1A[Ss

t]=[BsAsBsAn

BnAsBnAn][Ss

t],

we obtain a criterion for a valid demixing in terms of its relationship to the true

mixing matrix.

First of all, in order for the stationary project Bsto yield stationary projec-

tions, all non-stationary contributions need to be removed from the mixture, i.e.

BsAn= 0. is means that we can identify a basis for the true non-stationary sub-

space, because it is uniquely given as the dual space of the rows of Bs. However, the

product BsAsis unconstrained since it corresponds to a linear transformation of the

stationary sources, which does not change their stationary nature for an example.

us we can only identify the true stationary sources up to linear transformations.

e same holds for the non-stationary sources: BnAnis arbitrary. Moreover, the

BnAsis also unconstrained, since in general we can add arbitrary stationary contri-

butions to a non-stationary source without making it stationary. Figure . shows an

example, where there is only one stationary direction, but several directions yielding

non-stationary time series. e true stationary subspace is therefore not identiﬁable.

3.2 The SSA algorithm

In the previous section, we have observed that we can only recover the true station-

ary sources (up to linear transformations), and that the non-stationary sources are

not identiﬁable in principle. We will therefore start by developing an optimisation

criterion for ﬁnding the stationary projection Bs, before we consider the second

part of the demixing matrix Bn.

Measuring stationarity

e starting point for ﬁnding the s-projection is to deﬁne a measure of stationarity.

at is, given samples x1, . . . , xmfrom a d-dimensional time series, we need a way

of quantifying its degree of stationarity. According to our deﬁnition, a time series

is stationary if its ﬁrst two moments are constant over time. erefore, a natural

measure of stationarity is based on the distance of the mean and covariance ma-

trix. To jointly compare the ﬁrst two moments, we need to combine the distance in

the mean and the covariance matrix in a principled way. To that end, we use the

32 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS

Kullback-Leibler divergence between D-variate Gaussians, which is given by,

DKL [N(µ0,Σ0)||N(µ1,Σ1)] =

2(log (det Σ1

det Σ0)+ trace (Σ−1

1Σ0)+ (µ1−µ0)⊤Σ−1

1(µ1−µ0)−D).

is means that we approximate the probability density by a Gaussian distribution.

Note, however, that in the following derivation we will not assume that the sources

are Gaussian distributed. e Gaussian merely serves as the least restrictive distri-

butional assumption for data with ﬁxed mean and covariance matrix, in the sense

that it has the highest entropy [Jaynes ()].

In order to compute this stationarity measure from data, we need to estimate the

moments at several points in time. is we do either by segmenting the time series

into consecutive epochs, or by using a sliding window¹. Let ˆµ1, . . . , ˆµn∈

dand

Σ1, . . . , ˆ

Σn∈

d×dbe the epoch mean and covariance matrices respectively, where

nis the total number of epochs. Moreover, let mbe the total number of samples and

m1, . . . , mn, the number of samples in each epoch.

Based on the KL-divergence between two epochs, there are now several ways

of deﬁning a measure of stationarity for the whole time series. For example, we

could sum up the KL-divergence between all pairs of epochs. We choose a diﬀerent

approach, which is computationally more eﬃcient: comparing each epoch against

a reference epoch. Clearly, the sum of these KL-divergences is zero, if and only if,

all moments are equal and otherwise positive. As a reference epoch, we choose the

average epoch, with moments ¯µand ¯

Σ,

¯µ=

∑

i=1

µiand ¯

Σ =

∑

i=1

Σi.

erefore, given an epochised time series we measure its stationarity as,

L(µ1, . . . , µn,Σ1, . . . , Σn) =

∑

i=1

DKL [N(µi,Σi)||N(¯µ, ¯

Σ)],

which is zero if and only if the time series is perfectly stationary. Note that this

measure of stationarity can grow arbitrarily large.

From a measure of stationarity to an objective function

In order to ﬁnd the stationary projection, we minimise the non-stationarity of the

estimated stationary directions. us the optimal stationary projection is the solu-

¹e design of the epoch structure is an important issue that we will discuss later.

3.2. THE SSA ALGORITHM 33

tion to the optimisation problem,

Bs∗= argmin

B∈

d×D

L(Bµ1, . . . , Bµn, BΣ1B⊤, . . . , BΣnB⊤)

= argmin

B∈

d×D

∑

i=1

DKL [N(Bµi, BΣiB⊤)||N(B¯µ, B ¯

ΣB⊤)].

In this form, the objective function does not lend itself well to numerical optimisa-

tion: without constraints on B, degenerate solutions cannot be avoided (e.g. a rank

deﬁcient projection) and the fact that the objective contains inverses of a projected

covariance matrix can lead to numerical instability.

In order to make the objective more benign, we exploit the invariances that we

observed earlier while studying the identiﬁability of the model parameters. As we

will see inthe following, wecanﬁx themeanandthecovariancematrix oftheaverage

epoch to 0and Irespectively without loss of generality by choosing a diﬀerent basis

and introducing constraints.

First of all, we can translate all epoch mean vectors without changing the so-

lution, because the distance between the epochs does not depend on their overall

location. us we centre the average epoch, i.e. translate the data such that ¯µ= 0.

Moreover, we can perform the optimisation in an arbitrary basis. us we ﬁrst ap-

ply a whitening such that the average covariance matrix is the identity, ¯

Σ = I. In

order to ensure that the moments of the average epoch stay constant under any pro-

jection, we add the constraint that BB⊤=I. Recall that we can only identify the

true stationary projection up to arbitrary linear transformations, so this is not a re-

striction. is means that we have written the demixing matrix as a whitening of

the original data by a matrix Wfollowed by an orthogonal matrix, i.e. ˆ

A−1=RW

with RR⊤=I. Note, however, that only the ﬁrst drows of the demixing matrix

enter into the objective when optimizing for the stationary projection, because we

only evaluate the estimated stationary sources.

In the following we assume that that the centring and whitening has already been

applied to the data such that ¯µ= 0 and ¯

Σ = I. uswe can rewritethe optimisation

34 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS

problem as,

Bs∗= argmin

BB⊤=I

∑

i=1

DKL [N(Bµi, BΣiB⊤)||N(0, I)]

= argmin

BB⊤=I

∑

i=1 (−log det(BΣiB⊤) + trace(BΣiB⊤) + ∥Bµi∥2−d)

= argmin

BB⊤=I

∑

i=1 (−log det(BΣiB⊤) + ∥Bµi∥2)−nd

2trace (B(n

∑

i=1

Σi)B⊤)

(1)

= argmin

BB⊤=I

∑

i=1 (−log det(BΣiB⊤) + ∥Bµi∥2)−nd

2+n

2trace Id×d

= argmin

BB⊤=I

∑

i=1 (−log det(BΣiB⊤) + ∥Bµi∥2)

| {z }

=J(B)

,(.)

where the equation  above follows from the fact that we have set the average covari-

ance matrix to the identity. e gradient of the objective function Jwith respect to

the projection Bis given by

∂J

∂B =

∑

i=1 (BΣiB⊤)−1BΣi+Bµiµ⊤

i,(.)

fromwhichweseethattheobjectivefunctionisnotconvexduetothelog-determinant.

Gradient descent over the special orthogonal group SO(D)

Even thought we have achieved a simpler, and computationally advantageous ob-

jective function (.), the orthogonality constraint BB⊤=Imakes it diﬃcult

to optimise. In the next step, we therefore eliminate this constraint by choosing

a parametrisation that ensures that the constraint is automatically fulﬁlled. To that

end, instead of optimizing a projection matrix, we are looking for an orthogonal

transformation RR⊤=Iof the data such that the ﬁrst dcoordinates are stationary.

In the objective function, this means that we write Bin terms of Ras its truncation

B=IdRto the ﬁrst drows (Idis an identity matrix truncated to the ﬁrst drows).

In order to ﬁnd the orthogonal matrix R, we use an iterative optimisation pro-

cedure using multiplicative updates by an orthogonal matrix. Starting with a ran-

dom orthogonal matrix, in the k-th step we ﬁnd an orthogonal update Usuch that

3.2. THE SSA ALGORITHM 35

Rk+1 =URk. Aer termination in the ℓ-th step, the found solution is Bs∗=

IdRℓW, where Wis the whitening matrix.

But how do we ﬁnd each orthogonal update? Following the standard gradient-

based optimisation strategy, what we would like to do is to identify a direction of

steepest descent in the set of orthogonal transformations and then perform a line

search along this direction to ﬁnd the optimal step.

Let us ﬁrst of all take a closer look at this set of orthogonal matrices. e orthog-

onal matrices form a multiplicative group: the product of two orthogonal matrices

is orthogonal and the inverse of an orthogonal matrix is orthogonal; the identity ma-

trix is orthogonal. is group is called the orthogonal group O(D)and it consists

of elements of two types: rotations with determinant +1 and reﬂections with deter-

minant −1. e rotations are a normal subgroup of the orthogonal group, called

special orthogonal group SO(D)▹O(D). To ﬁnd the true stationary projection, we

need to project orthogonal to the non-stationary subspace, i.e. we want to remove

stationary contributions from the ﬁrst dcoordinates. is can always be achieved

by a rotation of the data; and at each step we can improve on this objective by a

applying a rotation — changing the sign of one coordinate (i.e. apply a reﬂection)

does not change the objective. Hence we can constrain our search for the direction

of steepest descent to the subgroup SO(D), the rotation matrices.

Figure .: Illustration of the tangent space (blue plane) at one point of the group

(sphere). A direction in the tangent space (blue arrow) corresponds to a one-

parameter subgroup (or geodesic) in the group, shown as the curved red arrow.

In order to ﬁnd the update rotation U, we compute the gradient of our objective

function J(U)and use it as a search direction over the special orthogonal group.

is is possible because SO(D)is a Lie group, i.e. it is a continuous diﬀerentiable

manifold. However, since the SO(D)is not a Euclidean space but a manifold, deﬁn-

ing the gradient is a bit more involved. e gradient at a point on a manifold is

36 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS

an element of the tangent space (linearisation) at that point. In a Euclidean space,

the tangent space at each element is equal to the Euclidean space itself, because it

is a linear space. On a curved manifold, this is not the case; see Figure . for an

illustration

In a Lie group, the tangent space at the identity has a special meaning: it is the

corresponding Lie algebra. Informally speaking, the Lie algebra is a vector space

spanned by the inﬁnitesimal transformations corresponding to the group action,

which in our case, are the inﬁnitesimal rotations.

e Lie algebra so(D)of the special orthogonal group SO(D)is the set of all

matrices M∈RD×Dthat are anti-symmetric, M=−M⊤. e element Mij can

be interpreted as the angle of an inﬁnitesimal rotation of axis jtowards axis i. Since

we absorb the update rotations into the data, at each step we want to ﬁnd the gradient

of J(U)at the identity I. e gradient is thus an element of the Lie algebra and we

can ﬁnd it by using the connection between the group SO(D)and the algebra so(D)

provided by the exponential map. e exponential map,

exp : so(D)→SO(D),

is a one-to-one correspondence between the skew-symmetric matrices and the ro-

tation matrices, given by the matrix exponential. Geometrically speaking, it maps

straight lines in the tangent space onto geodesics on the manifold. See the blue and

right arrows in Figure . for an example.

Following [Plumbley ()], we parametrise the update Uby an antisymmetric

matrix U= exp(M)and calculate the gradient ∇so(D)J(exp(M)) at M= 0 to

ﬁnd the search direction. Using this gradient, we can perform a line search along

the corresponding geodesic {exp (t∇so(D)J)|t∈

}to ﬁnd the update U.

To calculate the gradient in the Lie algebra, we need a notation of distance be-

tween anti-symmetric matrices. A natural way to deﬁne a norm is to adapt the

Frobenius-norm to reﬂect the fact that every element of M∈so(D)appears twice,

∥M∥2=1

2∑

i,j

ij,

by including the factor 1/2. is corresponds to an inner product for M1, M2∈

so(D)given by,

⟨M1, M2⟩=1

2∑

i,j

(M1)ij(M2)ij =1

2trace(M⊤

1M2).

Using this scalar product, we can implicitly deﬁne the gradient in the Lie algebra in

terms of its projection onto unit vectors. Let H∈so(D)be a unit vector ∥H∥= 1

in the algebra; then the component of the gradient in the direction of His the scalar

3.2. THE SSA ALGORITHM 37

product ⟨∇so(D)J, H⟩. Moreover, we the derivative of the update rotation U=

exp(tH)along the direction H(a so called single-parameter subgroup) is given by,

∂exp(tH)

∂t =Hexp(tH).(.)

In the following, we will obtain an explicit expression for the gradient ∇so(D)Jby

equating the derivative of Jalong Hwith the projection of the gradient ∇so(D)J

onto Hin the Lie algebra. First of all, using the chain rule, we get an expression for

the derivative of Jalong H,

∂J(exp(tH))

∂t = trace [(∂J

∂U )⊤∂U

∂t ]

(1)

= trace [(∂J

∂U )⊤

HU]

(2)

= trace [U(∂J

∂U )⊤

(3)

= trace [1

2(U(∂J

∂U )⊤

−(∂J

∂U )U⊤)H]

(4)

=⟨(∂J

∂U )U⊤−U(∂J

∂U )⊤

, H⟩,

where equality 1follows from Equation .; equality 2is a property of the trace;

equality 3holds because H∈so(D)and is therefore antisymmetric; and equality 4

corresponds to our deﬁnition of the scalar product in the Lie algebra.

By the deﬁnition of the gradient, its projection onto Hmust be equal to the

partial derivative ∂J(exp(tH)/∂t,

⟨∇so(D)J, H⟩=∂J(exp(tH)/∂t

=⟨(∂J

∂U )U⊤−U(∂J

∂U )⊤

, H⟩,

and since this equality holds for all unit vectors H∈so(D)it follows that,

∇so(D)J(exp(M)) = (∂J

∂U )U⊤−U(∂J

∂U )⊤

.(.)

Using this equation, we can compute the gradient in the Lie algebra so(D)at M= 0

for any objective function J(exp(M)) = J(U)for which we know the partial

derivative∂J/∂U. is isaclassical approach, forexampleinICA [Plumbley ()].

38 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS

Let us now have a look at the structure of the gradient of J(U). Since only

the ﬁrst drows of the update rotation Uaﬀect the objective, only the ﬁrst drows

of the derivative ∂J/∂U are non-zero. e non-zero part of ∂J/∂U is given by

Equation .. As we evaluate the gradient at the identity M= 0, and since the

upper d×dsubmatrix of ∂J/∂U is symmetric, it follows that the gradient in the

Lie algebra has the form,

∇so(D)J(exp(M))M=0 =∂J

∂U −(∂J

∂U )⊤

=[0Z

−Z⊤0],(.)

where Z∈

d×Dis the non-zero part. is block-structure has an intuitive in-

terpretation: the two square zero parts of dimension dand (D−d)correspond to

rotations within the stationary and non-stationary source space respectively. ese

do not change the objective. e non-zero part Zcontains the angles between the

two subspaces. us we can reduce the number of variables in the optimisation to

d(D−d).

e gradient (Equation .) deﬁnes a search direction in the Lie algebra, which

allows for geodesic line search over the rotations using the exponential map. e

whole procedure is summarised in Algorithm .

Algorithm  Stationary subspace analysis: ﬁnding the projection Bs∗to the most

stationary sources

: function SSA(d, µ1, . . . , µn,Σ1, . . . , Σn)

: Compute the average mean ¯µ←1

n∑i=1 µi.

: Compute the average covariance matrix ¯

Σ←1

n∑i=1 Σi.

: Center the epochs µi←(µi−¯µ)∀i

: Compute the whitening matrix W←¯

Σ−1

: Let B←random rotation matrix in SO(D)

: Initialise the epochs µi←(BW)µiand Σi←(BW)Σi(BW)⊤∀i

: repeat

: Let G← ∇so(D)J(U)be the gradient of the objective at U=I

: Perform a line search for t∗on the geodesic {exp (t G)|t∈

}

: Let U←exp(t∗G)be the update rotation

: Store the update B←UB

: Rotate the epochs µi←Uµiand Σi←UΣiU⊤∀i

: until convergence

: Let Bs∗←IdBW be the stationary projection

: return Bs∗

: end function

3.2. THE SSA ALGORITHM 39

Finding the most non-stationary sources

As we have seen in the previous sections, the true non-stationary sources cannot be

identiﬁed from the mixture because one can add arbitrary stationary components

without changing their non-stationary nature. However, in practice one is oen

interesting in ﬁnding the most non-stationary directions instead of recovering the

true non-stationary directions. is, however, can not necessarily be achieved by

projecting orthogonal (in the whitenend basis) to the stationary components, i.e.

choosing the bottom D−drows of the data rotation as the non-stationary pro-

jection (see Algorithm ). e right panel of Figure . shows a situation where

the projection that is orthogonal to the true s-projection yields an almost stationary

latent variable, as the projected variance in the two epochs is almost equal. As it

turns out, whenever there are changes in the correlation between the stationary and

non-stationary variables, we can ﬁnd more non-stationary directions by explicitly

maximizing the non-stationarity in a second step.

Σ1

Σ2

Σ1

Σ2

Figure .: e le panel shows two epoch covariance matrices where the covari-

ance between the stationary and the non-stationary direction does not change.

Here, projection orthogonal to the stationary subspace yields the most non-

stationary sources. In the right panel, the non-stationarity is mostly contained

in the time-variable covariance; projecting orthogonal to the stationary subspace

yields only mildly non-stationary sources.

Let us analyse the situation more rigorously. We consider the simple case where

we have one stationary and one non-stationary source with corresponding nor-

malised basis vectors ∥As∥= 1 and ∥An∥= 1 respectively, where ϕbe the angle

between the two spaces, i.e. cos ϕ=As⊤An. We will consider an arbitrary pair of

epochs, T1and T2, and show which projection maximises the diﬀerence in mean

∆µand variance ∆σbetween them, in relation to the true stationary projection Bs.

Let X1and X2be bivariate random variables modeling the distribution of the

data in the two epochs. According to the linear mixing model, we can write X1and

40 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS

X2in terms of the underlying sources,

X1=AsXs+AnXn1

X2=AsXs+AnXn2

where the univariate random variable Xsrepresents the stationary source and the

two univariate random variables Xn1and Xn2model the non-stationary source,

whose probability distribution changes between the two epochs, i.e. Xn1is not equal

to Xn2in distribution. In the following, we will assume without loss of generality,

that the true s-projection Bs= (An)⊥is normalised, ∥Bs∥= 1. We can write the

estimated n-projection as a linear combination of Bsand An,

Bn=αBs+βAn⊤,(.)

with α, β ∈

such that ∥ˆ

Bn∥= 1.

In the next step, we will observe which n-projection maximises the diﬀerence in

mean and covariance between the two epochs. Let us ﬁrst consider the diﬀerence in

the mean of the estimated n-sources,

∆µ= E[ ˆ

BnX1]−E[ ˆ

BnX2] = ˆ

BnAn(E[Xn1]−E[Xn2])

which is maximal for ˆ

BnAn= 1, i.e. when ˆ

Bnis orthogonal to Bs. us, with

respect to the diﬀerence in the mean, choosing the n-projection ˆ

Bnto be orthogonal

to the s-projection is always optimal. e diﬀerence in the variance of the estimated

n-sources is,

∆σ= var[ ˆ

BnX1]−var[ ˆ

BnX2] = β2(var[Xn1]−var[Xn2])

+ 2 [αcos (ϕ+π

2)+βcos ϕ](cov[Xs, Xn1]−cov[Xs, Xn2])

| {z }

=∆σsn

Clearly, when there is no change in the covariance of the s- and the n-sources be-

tween the two epochs (∆σsn = 0), this diﬀerence is maximised when βis maximal,

which by Equation . implies that ˆ

Bn= (Bs)⊥. However, when the covariance

between s- and n-sources does vary (|∆σsn |>0), the derivative of ∆σwith respect

to αat α= 0 is ,

∂∆σ/∂α|α=0 = 2 cos (ϕ+π

2)∆σsn .

Hence α= 0 is not an extremum when |∆sn|>0, which means that the most

non-stationary projection is not orthogonal to the s-projection in this case.

us, in order to ﬁnd the projection to the most non-stationary sources, we also

need to maximise the non-stationarity of the estimated n-sources. To that end, we

3.2. THE SSA ALGORITHM 41

simply maximise the SSA objective function (Equation .) for the n-projection, i.e.

Bn∗= argmax

BB⊤=I

∑

i=1 (−log det(BΣiB⊤) + ∥Bµi∥2).

e procedure is completely analogous to Algorithm , apart from the fact that we

reverse the sign of the objective.

Stationary subspace analysis in deation mode

SSA factorises the observed data into two groups of sources, both of which can only

be determined up to arbitrary linear transformations. In contrast to a PCA or ICA

solution, the results of SSA are two multidimensional subspaces and not a set of one-

dimensional components. is can make the solution more diﬃcult to interpret,

because there is no unique set of projections resp. basis elements. erefore, in some

scenarios it is desirable to perform SSA univariately, in so-called deﬂation mode.

is means that we ﬁnd a set of uncorrelated one-dimensional components which

are ordered by their degree of non-stationarity.

In each step, we optimise the non-stationarity of a single component; aer con-

vergence, we ﬁndthenextcomponentintheorthogonal complement(in thewhitened

basis). is procedure is called deﬂation mode because in each step, we reduce the

dimensionality of the data space by one. In this way, we ignore changes in the cor-

relation between variables. However, the advantage of this approach is that the so-

lution is unique (up to local minima). What is more, the fact that we have factorised

the input space into a set of one-dimensional subspace allows for a comparison of

their non-stationarity in terms of the value of the objective function.

3.2.1 Relationship to statistical testing

e absolute value of our measure of stationarity is hard to interpret: it is bounded

from below by zero for perfectly stationary sources, but it can grow arbitrarily large.

On ﬁnite samples it will never be exactly zero even for perfectly stationary data, since

ﬂuctuations in the moment estimates introduce diﬀerences across epochs. However,

the value of the objective function is an important ingredient for interpreting an SSA

solution. For instance, in order to determine an adequate number of stationary di-

rections in a data-driven manner, one can increase the number of stationary sources

until the non-stationarity reaches a certain critical threshold. In order for this work,

one needs to normalise the objective function value such that it becomes comparable

across diﬀerent numbers of dimensions, epochs and sample sizes.

In this section, we show that our measure of stationarity is equivalent to a test

statistic: under the null hypothesis of perfectly stationary Gaussian data, its distribu-

tion converges to a χ2-distribution [Blythe et al. ()]. e test statistic compares

two competing Gaussian models for the data: the simple model H0that presuppose

42 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS

that each epoch has the same mean and covariance; and the alternative model HA,

under which each epoch has its own set of parameters. is is a test for model com-

parison: it tells us whether we should reject the simple model H0in favour of the

more complex HA. Since the two models are nested (i.e. H0is a special case of HA),

and since we obtain the parameters by maximum likelihood, this hypothesis can

be tested in the likelihood ratio testing framework [Neyman and Pearson ()],

where the test statistic is the log of the ratio of the likelihood of the data under H0

and HA.

More formally, let X1, . . . , Xnbe d-variate random variables modeling the dis-

tribution of the data in each of the nepochs. Aer centering and whitening, the

hypothesis can be written as follows.

H0:X1, . . . , Xn∼ N(0, I)

HA:X1∼ N(µ1,Σ1), . . . , Xn∼ N(µn,Σn)

Let Xbe our data set that is segmented into epochs X=T1∪ ··· ∪ Tnand let

LH0(X)and LA(X)be the likelihood of the data under H0and HArespectively

where the parameters are the maximum likelihood estimates. en the likelihood

ratio test statistic Λ(X)is,

Λ(X) = −2 log LH0(X)

LA(X),(.)

whichunderthe nullhypothesisH0isapproximatelyχ2distributedwithk=1

2nd(d+

3) degrees of freedom. It can be shown that by applying the variance-stabilizing

transformation,

Λ′(X) = √2Λ(X),

we arrive at a test statistic that is approximately Gaussian distribution with mean

µ=√2k−1and unit standard deviation σ= 1.

We will now show the equivalence between the test statistic Equation . and the

SSA objective function. For the sake of simplicity, let us assume that all epochs are of

equal size m=|T1|=··· =|Tn|. e estimates of the epoch means and covariance

matrices are given by ˆµi=1

m∑x∈Tixand ˆ

Σi=1

m−1∑x∈Ti(x−ˆµi)(x−ˆµi)⊤

for all epochs 1≤i≤n. As in the derivation of the SSA algorithm, we assume that

the data is centred and whitened such that 1

n∑n

i=1 ˆµi= 0 and 1

n∑n

i=1 ˆ

Σi=I.

3.2. THE SSA ALGORITHM 43

en the log-likelihood of the data under H0is given by,

log LH0(X) =

∑

i=1 ∑

x∈Ti

log pN(x; 0, I) = −1

∑

i=1

md log(2π) + ∑

x∈Ti

x⊤x

(1)

=−1

∑

i=1

md log(2π) + trace 



∑

x∈Ti

xx⊤−ˆµiˆµ⊤

i

+mˆµiˆµ⊤

i



=−1

∑

i=1

md log(2π) + mˆµ⊤

iˆµi+ trace [(m−1)ˆ

Σi]

(2)

=−1

∑

i=1 (md log(2π) + mˆµ⊤

iˆµi)+(n

∑

i=1

(m−1))trace [I]

=−1

∑

i=1

md log(2π) + mˆµ⊤

iˆµi+ (m−1)d,

where equality  follows from the identity that x⊤x= trace(xx⊤)and the equal-

ity  is a consequence of the fact we have set the average epoch mean and epoch

covariance matrix to 0and Irespectively.

e log-likelihood of the data under the alternative hypothesis HAis given by,

log LHA(X) =

∑

i=1 ∑

x∈Ti

log pN(x; ˆµi,ˆ

Σi)

=−1

∑

i=1

md log(2π) + mlog det ˆ

Σi+ (m−1)d.

us the test statistic is,

Λ(X) = −2(log LH0−log LHA)

∑

i=1 (−log det ˆ

Σi+ ˆµ⊤

iˆµi),

which is equivalent to the SSA objective function in the sense that its value is the

same on source estimates up to the multiplication by the epoch sample size m. In

this derivation, we assumed that each epoch consists of the same number of samples.

If we allow arbitrary sample sizes, then in the test statistic each term in the sum is

weighted by the respective epoch sample size.

As we noted earlier, the distribution of the statistic Λis only approximately χ2-

distributed and, most importantly, this holds only under the assumption that the

44 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS

data is in fact Gaussian distributed. How useful this test statistic is in practice there-

fore depends crucially on (a) how well its distribution is approximated by the χ2

distribution and (b) what happens when the data is non-Gaussian.

A non-parametric approach to testing for stationarity is based on resampling.

e basic idea is to generate samples from the objective function on stationary data

sets that have the same properties as the input in terms of the shape of the distribu-

tion (kurtosis etc.). To this end, we put all data together and divide it randomly into

epochs of the same size as speciﬁed in the application of the SSA algorithm. Since

the data points are assigned randomly to epochs, every epoch is sampled from the

same distribution. Evaluating the objective function on one such assignment pro-

vides us with one sample from the objective function on stationary data of this kind.

By doing this repeatedly, we can approximate this distribution using a histogram, of

estimate quantiles for statistical testing. In Chapter , we compare the parametric

and the non-parametric approach to stationarity testing in controlled simulations.

3.3 Spurious stationarity

In Section ., we have seen that we can identify a basis for the true non-stationary

subspace: recovering stationary sources is equivalent to projecting orthogonal to the

true non-stationary subspace. However, this argument relies on the assumption that

every contribution from the non-stationary subspace yields a non-stationary latent

variable. If we observe a limited amount of variation, then this may not be the case

as the following example shows.

Figure .: Example of a spurious stationary direction. Both plots show the joint

distributions of non-stationary latent variables. In the le panel, we have observed

two diﬀerent joint distributions, with the same epoch but diﬀerent covariance ma-

trices (blue and red contours). ere exists a direction on which the projection

agrees (black line). is is a spurious stationary direction, because it is located

in the non-stationary space. In the right panel, we add a third joint distribution:

now there exists no spurious stationary direction.

Imagine, for instance, that we have two non-stationary latent variables and that

3.3. SPURIOUS STATIONARITY 45

we observe two diﬀerent joint distributions Xn

1and Xn

2, which diﬀer only with

respect to their covariance matrix, Σ1= cov(Xn

1)and Σ2= cov(Xn

2)whereas

µ(Xn

1) = µ(Xn

2). e le panel of Figure . shows the contours of the covariance

matrices Σ1and Σ2. Note that there exists a direction w∈

2(black line) on which

the variance of the two random variables is equal: w⊤Σ1w=w⊤Σ2w. Accord-

ing to our deﬁnition of stationarity, this is a stationary direction, even though it lies

in the non-stationary subspace: this is a spurious stationary direction. Only aer

we add a third covariance matrix (right panel) does this spurious stationary direc-

tion disappear. e existence of spurious stationary directions makes it impossible

to identify the true non-stationary subspace. We therefore need a criterion which

guarantees that there are no spurious stationary directions.

In order to derive a theoretical result, we consider the generic case in order to

exclude pathological cases. In other words, we assume that the true covariance and

mean of the epochs do not have any special properties, other than being equal on

the true stationary coordinates. Under this assumption, we can derive the following

bound on the number of necessary epochs.

Proposition  Let nbe the number of epochs in Ddimensions and let d<Dbe

the number of stationary directions. Moreover, let us assume that the mean and

the covariance matrix of the non-stationary sources in each epoch is generic, in the

sense that with probability one no equation between their elements is fulﬁlled. en

for any number of epochs nfor which holds,

n > D−d

2+ 1,

there exists no spurious stationary direction. In the case where the mean vectors are

known to be constant, this bound becomes,

n > D −d+ 1.

e rigorous proof relies on results form generic algebraic geometry; it can be

found in [Kiraly et al. ()]. e geometric intuition is that each available epoch

moment reduces the degrees of freedom for a spurious stationary direction by one.

e numerator in the ﬁrst bound therefore stems from the fact that each epoch pro-

vides us with two moments. us if integrated higher order moments into the SSA

algorithm, the number of required epochs would be smaller—under the assumption

that they carry information, i.e. variation in the non-stationary directions.

is bound is good news for practical applications: the number of necessary

epochs grows only linearly with the size of the non-stationary subspace. is is con-

sistent with the results of the simulations presented in Chapter .

46 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS

3.4 Summary and discussion

In this chapter, we have introduced a new type of latent variable model: the ob-

served data is a linear mixture of stationary and non-stationary variables, which are

not assumed to be independent. Based on a pragmatic deﬁnition of stationarity, we

have analysed which parameters are identiﬁable in principle, and up to what symme-

tries. We derived a computationally eﬃcient SSA algorithm, which optimises over

the special orthogonal group SO(D)using Lie group properties. e equivalence

of the objective function with a statistical hypothesis test for stationarity suggests

a normalisation. Finally, we investigated the phenomenon of spurious stationarity:

given a set of generic mean vectors and generic covariance matrices, under which

condition is there no subspace on which they are identical?

Beyond weak stationarity

e most relevant limitation of the current algorithm is its deﬁnition of stationarity,

which ignores temporal structure. As long as the power of a source remains constant

it is deemed stationary, no matter what happens to the frequency content. However,

in many situations it is the variation in the time structure thats is most interesting.

Similarly, changes in higher order moments go unnoticed by the current SSA objec-

tive. Even though the ﬁrst two moments are easier to interpret and estimate, there

may be situations where changes e.g. in the kurtosis are relevant.

ere is a number of strategies for addressing both issues. A straightforward

approach for integrating temporal information would be to consider time-lagged

covariance matrices in addition to the epoch covariance matrices as in temporal

ICA. e advantage of this technique is that most of the optimisation problem’s ap-

pealing features can be retained. Using higher order information is less straightfor-

ward for various reasons. Whereas the Gaussian approximation is a sensible choice

for comparing distributions up to their ﬁrst two moments, it is less clear how to

do this for higher moments in a way which corresponds to a natural weighting of

the distances in the individual moments. Expansions of the Kullback-Leibler diver-

gence for higher moments already become computationally unwieldy in the case of

the third and fourth moment. Non-parametric approaches oen do not lend them-

selves well to optimization, because one can not compute a derivative with respect

to a linear projection of the data.

Finding the number of stationary sources

roughout this chapter, we have tacitly assumed that the true number of stationary

sources is known. While this may be true in certain special cases, for most real data

sources it is not. Manual approaches of selecting this parameter have their limits.

In some scenarios, increasing the number of stationary sources until the estimated

3.4. SUMMARY AND DISCUSSION 47

sources become “too non-stationary” by visual inspection may be adequate. Draw-

ing this line, however, is not straightforward using just a subjective assessment of

the result. Firstly, the change in correlations is diﬃcult to assess by visual inspec-

tion. Secondly, one needs to account for the fact that by increasing the number of

sources, more changes in the distribution are inevitable. We outlined a solution to

this problem using two kinds of stationarity tests, which are evaluated empirically in

Chapter . In the next chapter, we present an alternative approach based on an alge-

braic view. is has the advantage that one can evaluate all possible choices for the

number of stationary sources simultaneously by looking at an eigenvalue spectrum.

Design of the epoch structure

e SSA objective function evaluates distribution changes between epochs of a time

series. Unless the way in which the data is segmented into epochs is somehow de-

termined by the application context (e.g. aligned to experimental markers or cor-

responding to meaningful temporal intervals), the setup of the epoch structure is a

crucial model parameter. It deﬁnes the time scale on which distribution changes are

visible to the objective function, and the time points at which they are detectable. In

the purely unsupervised case, where one wants to make as little assumptions about

the data as possible, a good choice is to use a sliding window instead of a segmenta-

tion. In order to increase the temporal resolution, a short window is clearly desir-

able. Even though this increases the eﬀect of small sample errors, our simulations in

Chapter  show that until the windows get very short (approximately less than 2D

samples), the impact on the accuracy is negligible.

Post-processing and pre-processing, interpretation

e result of inverting the SSA model are two multivariate sets of sources: desti-

mated stationary sources and D−destimated non-stationary sources, where the

latter have been optimised for maximum non-stationarity. As we have seen before,

both groups can only be determined up to arbitrary linear transformations. e in-

dividual sources, therefore, have no particular meaning; the same is true of the basis

of the estimated non-stationary subspace. is makes it diﬃcult to interpret an SSA

solution.

ere are several ways of making an SSA solution unique, and possibly mean-

ingful, in a post-processing step. In order to allow for univariate interpretation and

since independent sources perhaps correspond to underlying latent variables, one

can apply ICA to both sets of sources. A way of post-processing that is in line

with analysing non-stationarities is applying SSA in deﬂation mode to both sets of

sources. is provides us with two unique sets of sources, each of which is ordered

by its degree of non-stationarity.

As in factor analysis [Harman ()], in some applications one may be primar-

48 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS

ily interested in the coeﬃcients of the basis of the non-stationary subspace. For ease

of interpretation, it is desirable that each basis vector has only a few large coeﬃcients,

so that one can assign (meaningful) latent variables to directions in data space. To

that end, there exist a variety of methods for post-rotating a basis to achieve this

[Abdi ()]; the most prominent is Varimax [Kaiser ()], which maximises

the variances of the coeﬃcients in each basis vector by application of a rotation.

PCA and ICA are not SSA

PCA, ICA and SSA are similar in that their generative model is a linear mixing of la-

tent variables (see Chapter ). e crucial distinction lies in the speciﬁc assumptions

about the mixing matrix and the latent variables. First of all, in the SSA model there

are two groups of sources with prescribed properties whereas PCA and ICA assume

the existence of Dunivariate components. For this reason, it is not straightforward

to compare PCA and ICA solutions to an SSA solution. But most importantly, it is in

general not possible ﬁnd the most stationary or the most non-stationary projection

by combining PCA or ICA directins in a post-processing step. at is because the

criteria of PCA and ICA algorithms are completely unrelated to stationarity resp.

non-stationarity.

Epoch 1 Epoch 2 Epoch 3 Epoch 4

Whole dataset

(input basis)

Whole dataset

(ICA basis)

PCA directions

ICA directions

Figure .: Exemplary data set from Chapter  with the PCA and ICA solutions.

Figure . shows the PCA and ICA solutions for the illustrative data set ﬁrst

introduced in Figure .. As we had already seen in Chapter , both PCA and ICA

yield uninformative bases in the sense that they do not separate the stationary from

the non-stationary components. In the two right panels, we see why this is the case.

PCA diagonalises the average covariance matrix. Since the vertical axis has higher

variance and is, on average, uncorrelated with the horizontal axis, PCA does not

change the basis (signiﬁcantly) — the distribution changes remain hidden. ICA, on

the other hand, is looking for two independent coordinates. As we can see from the

scatter plot of the whole data set, the input coordinates are uncorrelated on average

but not independent: the data appears to be generated by a Gaussian mixture model

3.4. SUMMARY AND DISCUSSION 49

with (at least) two independent components, and it is these that the ICA algorithm ²

identiﬁes.

²Here we have used the JADE algorithm [Cardoso ()].

50 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS

Chapter 4

An algebraic approach

In the previous chapter, we presented an optimisation algorithm for ﬁnding the sta-

tionary projection. In essence, this algorithm searches the space of all possible pro-

jections guided by an objective function that is zero at the optimal solution (see

Figure .). In this chapter, we explore a diﬀerent route: instead of searching the

space of possible solutions, we directly solve the SSA problem algebraically.

is is possible because the set of stationary projections can be described in

terms of equations: in the optimal stationary coordinates, all epoch mean vectors

are equal to the null vector and all epoch covariance matrices are equal to the iden-

tity matrix. More formally, an optimal stationary projection Bsis deﬁned by the

system of linear and quadratic matrix equations,

Bsµ1= 0

Bsµn= 0

BsΣ1(Bs)⊤=I

BsΣn(Bs)⊤=I,

where nis the number of epochs. From this point of view, the SSA objective function

can be seen as measuring the extent to which this system of polynomial equations

is not fulﬁlled for a particular projection.

e gist of the algebraic algorithm is that we manipulate this system of equations

in a way that reveals the stationary projection. is can be understood in analogy

to solving a system of polynomial equations. However, as the coeﬃcients of these

equations are derived from epoch moments estimated on ﬁnite sets of samples, the

exact set of solutions will always be empty in the generic case. To this end, we de-

velop a notion of ﬁnding an approximate linear space of solutions. Even though this

is an approximate solution, it can be found in closed form.

e presented algorithm was originally derived from an algebraic-geometric

52 CHAPTER 4. AN ALGEBRAIC APPROACH

0 20 40 60 80 100 120 140 160 180

Objective function

Angle of projection (α)

Objective function over all possible projections

Covariance matrices

True solution (α=50)

Local minimum (α=154)

Figure .: Illustration of the optimisation approach to SSA. e le panel shows

the contour plots of three sample covariance matrices. e black line is the true

one-dimensional subspace on which the projected variances are exactly equal

while the magenta line corresponds to a local minimum of the objective func-

tion. e right panel shows the value of the objective function over all possible

one-dimensional subspaces, parameterised by the angle αto the horizontal axis.

perspective, and previous publications [Kiraly et al. (), Kiraly et al. ()] used

this terminology extensively. In this chapter, we attempt a diﬀerent presentation

that is accessible by a wider audience.

e remainder of this chapter is organised as follows. In Section ., we for-

mulate the objective of SSA in terms of a system of polynomial equations. en,

in Section ., we introduce a representation for these polynomials: the coeﬃcient

vector space. In Section ., we show how we can ﬁnd an approximate solution to

our set of equations by linear algebra in the coeﬃcient vector space. e connec-

tion to algebraic geometry is brieﬂy described in Section .. In the last Section,

., we summarise our ﬁndings and discuss the relative merits of the algebraic algo-

rithm and the prerequisites for using the algebraic approach in other circumstances

in machine learning.

4.1 From moments to polynomials

In this section, we start by reformulating the objective of the SSA algorithm in terms

of polynomials. Until Section ., we will assume that these polynomials are exact,

in the sense that they have been derived from the exact (true) epoch moments, and

not estimates thereof. is is of course never true in practice, where moments are

estimated from ﬁnite and noisy samples. However, this assumption allows for an

easy introduction of the main concepts before we turn to an approximate treatment.

According to the SSA mixture model, an optimal s-projection Bscan be deﬁned

as being dual to the non-stationary subspace. Alternatively, we can deﬁne the op-

timal stationary projection in terms of the moments in the stationary coordinates

(under the assumption that there are spurious stationary directions). at is, an

4.1. FROM MOMENTS TO POLYNOMIALS 53

optimal s-projection fulﬁlls the equations,

Bsµi=Bsµj

BsΣi(Bs)⊤=BsΣj(Bs)⊤∀1≤i=j≤n.

is set consists of n(n−1) equations between all pairs of epochs, which are re-

dundant because of transitivity; equivalently, we can compare each projected pair

of moments against one reference epoch. As in the optimisation-based algorithm,

we chose the average over all epoch as the reference, which has mean ¯µand co-

variance matrix ¯

Σ. us we characterise the set of solutions by a number of matrix

equations that is linear in the number of epochs,

Bsµi=Bs¯µ

BsΣi(Bs)⊤=Bs¯

Σ(Bs)⊤∀1≤i≤n.

Note that this set of equations is still not independent, because the equality of n−1

projected moments with the average epoch implies the equality of all. As we have

seen before, if we centre the mean vectors such that ¯µ= 0 (translating all the data by

the same vector does not change solution) and choose a basis in which the average

covariance matrix is the identity, ¯

Σ = I, (whitening), we arrive at a set of 2(n−1)

homogeneous matrix equations which equivalently describe the set of solutions,

Bsµi= 0

Bs[Σi−I] (Bs)⊤= 0 ∀1≤i≤(n−1),(.)

where Bsis a matrix with linearly independent rows.

e optimal stationary projection Bsis of course only unique up to arbitrary

linear transformations of its image, which leaves the stationary nature of the esti-

mated sources unchanged. In the following, we therefore state the goal as ﬁnding a

basis for the space of s-projections.

Deﬁnition  e d-dimensional linear subspace of all s-projections S⊂

Dis the

dual space of the true non-stationary subspace, or in other words, the row space of

an optimal stationary projection Bs.

In the next step, we translate the system of Equations . into a set of polyno-

mials. is establishes the le hand sides of these equations as objects in their own

right, which are amenable to algorithmic algebraic manipulations.

Deﬁnition  Amonomial min Dvariables T1, . . . , TDis a formal product,

m=Tα1

1Tα2

2···TαD

which we abbreviate as m=Tα, where α∈

0is the vector of exponents. A

monomial is therefore uniquely identiﬁed by its exponent vector. e degree of a

54 CHAPTER 4. AN ALGEBRAIC APPROACH

monomial deg(m)is the sum of its exponents,

deg(Tα) =

∑

i=1

αi.

Apolynomial pin Dvariables over

is a sum of ﬁnitely many monomials,

p=∑

α∈

aαTα,

where aα∈

is the coeﬃcient of the monomial Tα, and only ﬁnitely many of the

coeﬃcients are nonzero. A polynomial is homogeneous if all monomials with non-

zero coeﬃcients have the same degree. e evaluation of a polynomial pon a vector

of values w∈

Dis a number in

written as p(w)and is the result of replacing all

formal variables by values,

p(w) = ∑

α∈

aαwα1

1···wαD

D∈

Using this deﬁnition, we write the le hand sides of the Equations . as poly-

nomials, where the variables T1, . . . , TDcorrespond to the coordinates in the input

vector space and the coeﬃcients are given by the elements of the mean vectors and

covariance matrices. us we obtain a set of linear polynomials ℓ1, . . . , ℓn−1and

quadratic polynomials q1, . . . , qn−1, using formal vector notation as,

ℓi=[T1···TD]µi

qi=[T1···TD]Σi









∀1≤i≤(n−1).

Note that these polynomials are homogeneous of degree one and two, respectively.

Let us now turn to its set of solutions. In the language of algebraic geometry, the

set of solutions to a set of polynomials is their vanishing set.

Deﬁnition  e vanishing set Vof a set of polynomials p1, . . . , pmin Dvariables

is their common set of solutions in

V(p1, . . . , pk) = {x∈

D|p1(x) = ···=pm(x) = 0}.

Whereas by deﬁnition Sis a subset of V(ℓ1, . . . , ℓn−1, q1, . . . , qn−1), the van-

ishing set need not be to the space of s-projections S. Not only is the null vec-

tor an element of the vanishing set (but not an s-projection), but by translating

4.2. THE VECTOR SPACE OF POLYNOMIALS 55

the matrix Equations . into polynomials, we have implicitly dropped the con-

straint that the s-projections must be elements of a d-dimensional linear space. e

goal of our algorithm is to ﬁnd the d-dimensional linear subspace contained in

V(ℓ1, . . . , ℓn−1, q1, . . . , qn−1). Whether this subspace is unique depends on the ob-

served input polynomials ℓ1, . . . , ℓn−1, q1, . . . , qn−1. is question is addressed in

the next section.

4.2 The vector space of polynomials

e representation in which we compute with polynomials is the vector space of their

coeﬃcients. For example, a homogeneous quadratic polynomial pin two variables,

p=α11T2

1+α12T1T2+α22T2

is represented by its coeﬃcient vector, p =[α11 α12 α22]⊤∈

3, where the

coordinates correspond to the monomials of degree two in two variables, T2

1,T1T2

and T2

2. Figure . illustrates how the equality of two projected covariance matri-

ces is translated into a polynomial equations that is embedded in coeﬃcient vector

space.

Figure .: e le panel shows the contours of two covariance matrices along

with the direction von which their projections are equal. e middle panel shows

that vis the solution to a quadratic equation, and the right panel illustrates how

this equation becomes a vector in the coeﬃcient space C2of quadratic homoge-

neous polynomials.

By representing polynomials in terms of coeﬃcient vectors in a ﬁnite dimen-

sional vector space we lose structure. e coeﬃcient vector space is not isomorphic

to the polynomial ring: whereas the addition of polynomials translates into the ad-

dition of coeﬃcient vectors, this is not true for the multiplication of polynomials. In

particular, the product of two polynomials of degree at least one is not embeddable

in the same coeﬃcient vector space, because it increases the degree of the monomi-

als. However, the coeﬃcient vector space is a suitable representation for our algo-

rithm, as we will see later. In particular, the coeﬃcient vector space comes with a

56 CHAPTER 4. AN ALGEBRAIC APPROACH

natural distance measure between polynomials, which allows us to treat uncertainty

in the polynomial coeﬃcients using familiar tools from machine learning.

e representation of a polynomial in the coeﬃcient vector space of degree k

depends on an arbitrary ordering of all monomials of degree k. For computational

reasons, we adopt the lexicographical ordering, because it lends itself to fast enu-

meration and selection of certain monomials.

Deﬁnition  e coeﬃcient vector space of homogeneous polynomials in Dvari-

ables of degree kis a

-vector space denoted by Ck. A polynomial pof degree

kis represented in terms of its coeﬃcient vector p ∈Ckwhere the monomials

are ordered lexicographically. In this ordering, monomial M1=Tα1

1···TαD

Dap-

pears aer M2=Tβ1

1···TβD

D, denoted by M1≻M2, if there exists a variable Tl

such that αl> βland for all variables with smaller index T1, . . . , Tl−1it holds that

αi=βi.

For example, the lexicographical ordering of the monomials in three variables

of degree two is given by,

1≻T1T2≻T1T3≻T2

2≻T2T3≻T2

Let us now have a closer look at the coeﬃcient vector space. e dimension of

Ckis equal to the number of monomials in Dvariables of degree k. is is equal to

the number of ways that we can distribute the exponents among the Dvariables.

Proposition  e dimension of the coeﬃcient vector space Ckis given by,

dim Ck= ∆(k, D) = (k+D−1

k).

Proof. A monomial of degree kin Dvariables is product of kvariables chosen with

repetition from the set {T1, . . . , TD}. e number of diﬀerent monomials is there-

fore equivalent to the number of multisets of cardinality kchosen from a set of car-

dinality D.

An important sub-vectorspace

e algebraic algorithm hinges on the fact that our input polynomials are elements

of a certain sub-vectorspace of the coeﬃcient vector space (in the exact, noise-free,

case). For the linear polynomials derived from the mean vectors this is clear: by def-

inition, 

ℓ1,...,

ℓn−1are dual to Sand therefore elements of a (D−d)-dimensional

linear subspace. In the following, we see that there is such a linear subspace in the

corresponding coeﬃcient vector space for polynomials of arbitrary degree.

Deﬁnition  e set of coeﬃcient vectors corresponding to the polynomials of de-

gree kvanishing on Sis denoted by CS

4.2. THE VECTOR SPACE OF POLYNOMIALS 57

For degrees k > 1, the number of dimensions of the coeﬃcient vector space is

not immediately obvious but straightforward to calculate, as the next result shows.

Proposition  e set of coeﬃcient vectors CS

k⊂Ckof homogeneous polynomials

of degree kvanishing on Sis a linear subspace of Ckwith dimension,

dim CS

k= ∆(k, D)−∆(k, d).

Proof. CS

kis a linear subspace of Ckbecause it is closed under linear combination:

any linear combination of polynomials in CS

kis of degree kand also vanishes on S,

hence it is also an element of CS

k. In order to get its dimension we do a linear change

of the polynomials variables as follows. Let b1, . . . , bd∈

Dbe an orthonormal ba-

sis for Sand let bd+1, . . . , bDbe an orthonormal basis for its complement in

D. We

now consider polynomials in this basis, i.e. we write them in terms of new variables

T′

1, . . . , T′

Dwhere each variable T′

icorresponds to the coordinate along bi. en

for any polynomial p ∈CS

k, the coeﬃcients of all monomials that contain only the

variables T′

1, . . . , T′

dare zero, because pvanishes on S, and this is a suﬃcient con-

dition for membership in CS

k. Hence the dimension of CS

kis the total number of

monomials ∆(k, D)minus the ∆(k, d)monomials that are ﬁxed to zero.

4.2.1 Generic polynomials and identiability

In the previous sections, we formulated the SSA problem in terms of polynomials;

introduced the coeﬃcient vector space, and showed that the (exact, noise-free) input

polynomials are elements of certain subvectorspaces. In order to derive theoretical

results, we need to assume a generative model for the input polynomials. In essence,

this will allow us to exclude pathological cases in our arguments.

AccordingtotheSSAmodel, weassumethattheinput polynomialsℓ1, . . . , ℓn−1,

q1, . . . , qn−1do in fact vanish on the space of s-projections S. Apart from that, the

polynomials should be generic, in the sense that they have no special property. For

a rigorous deﬁnition, we refer to the supplemental material of [Kiraly et al. ()].

Intuitively, we can think of a generic polynomial as a polynomial-valued random

variable that has no “algebraic property” with probability one, apart from those

properties that are implied by the assumption that it vanishes on S. Algebraic prop-

erties of a polynomial are those that can be expressed in terms of polynomial equa-

tions in the coeﬃcients of the polynomial. As a concrete example, any random vari-

able that has a positive continuous probability density over the coeﬃcient vector

space CS

kyields a generic polynomial in our context.

Using the concept of generic polynomials, we can investigate under which con-

dition the true space of s-projections Scan be uniquely identiﬁed, given the input

polynomials. is means: under which condition is Sthe only d-dimensional linear

subspace in the vanishing set V(ℓ1, . . . , ℓn−1, q1, . . . , qn−1)?

58 CHAPTER 4. AN ALGEBRAIC APPROACH

Proposition  Let ℓ1, . . . , ℓn−1and q1, . . . , qn−1be generic linear and quadratic

polynomials derived from nepochs which vanish on the d-dimensional linear sub-

space of s-projections S.

en Sis the the unique d-dimensional linear subspace in the vanishing set

V(ℓ1, . . . , ℓn−1, q1, . . . , qn−1)if the number of epochs is at least,

n−1≥D−d+ 1

e proof can be found in [Kiraly et al. ()]. Intuitively, one can think of

each of the 2(n−1) generic polynomials as reducing the degrees of freedom by

one, until Sremains as the only d-dimensional linear subspace.

An important consequence of the genericity assumption is that the coeﬃcient

vectors in CS

1and CS

2corresponding to the observed polynomials are each linearly

independent with probability one.

Proposition  e coeﬃcient vectors corresponding to the generic homogeneous

polynomials p1, . . . , pmof degree kwith m≤dim CS

kare linearly independent

with probability one.

Proof. If p1, . . . , pmare linearly dependent then there exists a vector pisuch that

pi∈span{pj}j=i. However, in order for this to be the case, at least one equation

between the coeﬃcients of piand the other vectors need to be fulﬁlled, which holds

with probability zero.

4.2.2 Generating polynomials of higher degree

In the last section, we have shown that under genericity assumptions, our (exact,

noise-free) linear and quadratic input polynomials are linearly independent ele-

ments of the subspaces CS

1and CS

2in coeﬃcient vector space respectively. However,

they may not provide us with a basis (which is essential for the algorithm to work),

as the number of dimensions might be larger than the number of observed poly-

nomials. In fact, this is likely to be the case in practice. As we had seen before, the

number of dimensions of the coeﬃcient vector space grows rapidly with the number

of dimensions of the data space; recall that

dim CS

2= ∆(2, D)−∆(2, d) = 0.5[D(D+ 1) −d(d+ 1)].

Even for a moderate number of input dimensions, one would thus need a large num-

ber of epochs n≥dim CS

2+1 in order for the derived polynomials to span CS

2(see

Figure .). is is the problem addressed in this section.

For reasons that will become clear later, we focus on the quadratic input poly-

nomials and show how we can generate a larger set of polynomials of higher degree

4.2. THE VECTOR SPACE OF POLYNOMIALS 59

0 5 10 15 20 25 30

100

150

200

250

300

350

400

450

500

Number of stationary directions

Dimension of coefficient vector space

Num. of dimensions D=10

Num. of dimensions D=20

Num. of dimensions D=30

Figure .: Growth of the number of dimensions of the vector space CS

2of

quadratic homogeneous polynomials vanishing on S(vertical axis) for diﬀerent

numbers of input dimensions D(blue, red, green curve) and number of station-

ary directions d(horizontal axis). e number of dimensions of CS

2is quadratic

in D.

k≥2which (a) is guaranteed to span CS

kand (b) has the same vanishing set as the

input.

Recall that a vector x∈

Dis an element of the vanishing set V(q1, . . . , qn−1)

if and only if xis a solution to the system of polynomial equations,

q1(x) = 0

qn−1(x) = 0.

Since these equations are homogeneous quadratic, multiplying each with all non-

zero monomials Mof a ﬁxed degree k′does not change the set of solution, i.e. the

extended system,

Mq1(x) = 0

Mqn−1(x) = 0 ∀M=Tα1

1···TαD

Dwith

∑

i=1

αi=k′,

has the same vanishing set as V(q1, . . . , qn−1). Moreover, every choice of monomial

Myields a distinct set {Mq1, . . . , Mqn−1}of polynomials of degree deg M+ 2.

us if we multiply by all monomials of a certain degree, we obtain a larger set of

polynomials which is equivalent to the input in terms of its vanishing set.

60 CHAPTER 4. AN ALGEBRAIC APPROACH

Proposition  Let q1, . . . , qn−1begenerichomogeneousquadraticpolynomialsvan-

ishing on the set of s-projections S. e set of polynomials obtained by multiplica-

tion with all monomials of ﬁxed degree,

Pk={Mq1, . . . , Mqn−1|Mis a monomial of degree k−2},

contains (n−1)∆(k−2, D) = |Pk|distinct polynomials of degree k.

2 3 4 5 6

100

1000

10000

Degree of polynomials

Rank of polynomials

Dim. of coeff. vector space

Num. of polynomials

dim CS

4>|P4|>rank P4

|P5|>dim CS

5>rank P5

rank P6=dimCS

Figure .: Growth of the rank of the set of polynomials Pk(red curve) in coeﬃ-

cient vector space, the number of polynomials |Pk|(green curve), and the dimen-

sions of the coeﬃcient vector space CS

k(blue curve) with the degree k(horizontal

axis), for 11 generic polynomials in d= 10 dimensions, of which one is station-

ary. For degrees smaller than three, dim CS

kis much larger than the number of

polynomials; P5contains more polynomials than dim CS

5, but the rank is smaller.

Here we need to multiply up to k= 6 to ensure that P6spans CS

As the number of polynomials in Pkgrows with the degree k, so does the num-

ber of dimensions of the vector space CS

k. e number of polynomials |Pk|grows

faster than the number of dimensions of the corresponding coeﬃcient vector space,

lim

k→∞ |Pk|

dim CS

= lim

k→∞

(n−1)∆(k−2, D)

∆(k, D)−∆(k, d)>1.

However, the obtained set of polynomials is in general not linearly independent, i.e.

rank Pk<|Pk|. Figure . shows experimental results on simulated data. While it

can be shown that there always exists a ksuch that the coeﬃcient vectors of Pkspan

k[Kiraly et al. ()], in practice it is very important to know what that number

is, because the computational complexity of the overall algorithm is dominated by

the number of dimensions of the coeﬃcient vector space. us the central question

is: how does the rank of a set of generic polynomials grow with the degree k′of

monomials by which we multiply?

4.2. THE VECTOR SPACE OF POLYNOMIALS 61

is question is related to Fröberg’s conjecture [Froeberg and Hollman (),

Froeberg ()] from algebraic geometry. If it holds, then we have an exact condi-

tion on the minimum necessary degree, given by the following result.

Conjecture  (based on Fröberg’s conjecture) Let P2={q1, . . . , qn−1}be generic

homogeneous quadratic input polynomials vanishing on the linear subspace of s-

projection Swith n≥D+ 1 and let Pkbe derived from P2by multiplication with

all monomials of degree k−2,

Pk={Mq1, . . . , Mqn−1|Mis a monomial of degree k−2}.

en the smallest degree k∗≥2such that the coeﬃcient vectors corresponding to

the polynomials Pk∗span the subspace CS

k∗in coeﬃcient vector space is the index

of the ﬁrst non-positive coeﬃcient ak∗in the power series,

∞

∑

ℓ=1

aℓtℓ=∏n−1

i=1 (1 −t2)

(1 −t)D−1

(1 −t)d

For a proof see the supplemental material of [Kiraly et al. ()]. Since the

coeﬃcient of this power series are straightforward to calculate, using this result we

can compute the minimum necessary degree at a negligible computational cost.

Algorithm  Enlarge set of polynomials by increasing the degree until it spans the

corresponding vector space of polynomials.

: function MU(q1, . . . , qn−1)

: Initialise k←2to the degree of the input polynomials.

: while ak>0do ◃Coeﬃcient of the power series in Conjecture 

: Increment degree k←k+ 1.

: end while

: Multiply input polynomials by all monomials of degree k−2,

Pk← {mq1, . . . , mqn−1|mmonomial with deg m=k−2}.

: return Pk

: end function

is result leads to Algorithm  which computes for any generic input the poly-

nomials Pkspanning the corresponding coeﬃcient vector space CS

k. e computa-

tional cost is negligible, since enumerating all monomials of a certain degree (Line )

is linear in the number of all monomials.

Note that naïve approaches for generating the set Pkare infeasible in almost all

practically relevant cases. First of all, verifying whether a set of vectors are linearly

62 CHAPTER 4. AN ALGEBRAIC APPROACH

independent entails applying a rank-revealing algorithm (e.g. Gaussian elimination)

which is at least quadratic in the number of dimensions; and coeﬃcientvector spaces

are notoriously high-dimensional. Secondly, simply generating a “very large num-

ber” is of course not guaranteed to work in general but also leads to exceedingly

high dimensional coeﬃcient vector spaces, in which subsequent steps of the alge-

braic algorithm need to work. As some of these steps are quadratic in the number

of dimensions, it is important to keep the degree as small as possible.

4.3 An approximate algebraic algorithm

In the previous sections we have laid the groundwork for our algorithm: we refor-

mulated the SSA problem in terms of a set of exact polynomial equations which are

represented as vectors in the coeﬃcient vector space. Under genericity assumptions,

we have seen that the (exact, noise-free) coeﬃcient vectors are linearly independent

elements of subspace of known dimensions (but unknown basis). en we showed

how we can generate a larger set of polynomials Pkof higher degree ksuch that its

coeﬃcient vectors are guaranteed to span the desired subspace in coeﬃcient vector

space.

So far, we have assumed that the input polynomials are known exactly, in the

sense that they were derived from the true epoch mean and covariance. In practice,

where cumulants are estimated from ﬁnite and noisy data, this is not the case. is

means that there exists no exact solution; a true s-projection matrix Bswill not

make the projected moments exactly equal. Hence the vanishing set of the input

polynomials does notcontain the true set of s-projections S, but only the null vector.

We therefore need to ﬁnd an approximate solution.

It is here the vector space view on polynomials comes in handy, because it comes

with a natural geometric interpretation which includes a distance between polyno-

mials. As we have seen before, all quadratic homogeneous polynomials that vanish

exactly on Slie on a linear subspace CS

2of the coeﬃcient vector space. In the ap-

proximate case, we will think of our input as lying approximately on CS

e overall strategy of our algorithm is to algebraically manipulate our set of

polynomials CS

kuntil we obtain linear polynomials that have approximately the

same vanishing set. From linear polynomials, we can readily obtain the space of

s-projections Sas their approximate d-dimensional dual. is procedure can be

seen as solving the SSA problem algebraically. We proceed step-wise: starting at the

minimal necessary degree (determined by Proposition ), with each step we reduce

the degree of our set of polynomials by one.

Approximate division by a single variable

e fundamental idea behind the step-wise reduction of degree can be loosely de-

scribed as “approximate division of a set of polynomials by a single variable”. at is,

4.3. AN APPROXIMATE ALGEBRAIC ALGORITHM 63

we go from a set of polynomial Pkof degree kto set of polynomials Pk−1of degree

k−1that has approximately the same vanishing set as Pk.

Quadratic input polynomials

Quadratic homogeneous

polynomials vanishing on all

stationary projections

(linear subspace)

Quadratic homogeneous

polynomials divisible by

(linear subspace)

Approximate

intersection

Figure .: Illustration of the approximate division of a set of polynomials by a

single variable T1. e planes illustrate the true sub-vectorspaces, not our estimate

from the input polynomials.

e choice of variable by which we divide is arbitrary; by our genericity assump-

tion any variable Tiis not a basis element of S. In the following we ﬁx it to T1. As we

have seen before, the polynomials in Pklie approximately on CS

k⊂Ck. Since we

have applied Algorithm , we can also assume that they are an approximate basis.

Let us now consider another subspace of Ck, namely the polynomials of degree k

that are divisible by T1. ey also form a linear subspace spanned by all monomials

that are divisible by T1. Let us denote this subspace by T1·Ck−1. What this abuse

of notation indicates is that T1·Ck−1is in fact isomorphic to the coeﬃcient vector

space Ck−1of lower degree. Figure . shows a cartoon illustrating the relationship

between these two linear subspaces in the quadratic case.

Approximate division of Pkby T1consists of two steps. In the ﬁrst step, we com-

pute an approximate basis for the intersection of CS

kand T1·Ck−1. More speciﬁcally,

we compute a basis for the approximate intersection of CS

k(for which we only have

an approximate basis in terms of the input polynomials) and T1·Ck−1whose basis

is known exactly by deﬁnition. We ﬁnd the elements of this basis as linear combi-

nations of the polynomials in Pkby minimizing the L2-distance to T1·Ck−1using

the singular value decomposition. e elements of this basis are polynomials that

are both divisible by T1and linear combinations of our input, i.e. vanish approx-

imately on S. Let us denote the set of these polynomials by P′

k−1, and note that

they are still of degree k. In Figure ., P′

k−1is a basis for the one-dimensional ap-

proximate intersection shown as the orange line. In the next step, we remove from

each polynomial in P′

k−1all terms that are not divisible by T1. Finally, by dividing

each remaining monomial by T1, we get a set of polynomials of degree k−1called

64 CHAPTER 4. AN ALGEBRAIC APPROACH

Pk−1, whose elements lie approximately on CS

k−1. is concludes the approximate

division of Pkby T1.

Note that the choice of variable by which we divide is arbitrary. In fact, we can

divide by T1in an arbitrary basis to obtain a set of polynomials of lower degree

that lie approximately on CS

k−1. We exploit this property in the ﬁnal algorithm to

generatealarger set of suchpolynomials, which we thenreduce to abasis byapplying

PCA, in order to increase the accuracy on noisy data.

The complete algorithm, explained

isomorphic

S⊥

divisible

by T1

divisible

by T1

input polynomial

Multiplication of all

input polynomials by

all variables T1,...,T

(Algorithm 1).

Approximate division

by a variable.

Approximate division

by a variable.

isomorphic

Figure .: Illustration of the algorithm. e quadratic input polynomials are

embedded in C3(by multiplication with all variables) and lie approximately on

3(red dots in the top panel). e intersection of CS

3with the subspace of poly-

nomials divisible by T1(orange line) is isomorphic to CS

2in the vector space of

homogeneous quadratic polynomials C2. By dividing out T1again (blue line) we

arrive at a basis for the linear polynomials vanishing on S, which is a basis for the

orthogonal complement S⊥.

Let us now step through the complete Algorithm . roughout the text we refer

to line numbers in that algorithm. Figure . illustrates the steps of the algorithm,

4.3. AN APPROXIMATE ALGEBRAIC ALGORITHM 65

Algorithm  Compute an approximate basis for the d-dimensional space of s-

projections Sgiven generic homogeneous polynomials corresponding to epoch

mean and covariance matrices respectively.

: function ASSA(ℓ1, . . . , ℓn−1, q1, . . . , qn−1, d)

: Compute set of polynomials Pk←MU(q1, . . . , qn−1)

: Let π←(1 ···D)be the cyclic permutation of variables.

: while k > 1do

: Initialise Pk−1← {}

: for i= 1, . . . , D do

: Permute polynomial variables Pπ

k← {πi(p)|p∈ Pk}

: Compute a linearly independent set of polynomials P′

k−1in the span

of the coeﬃcient vectors in Pπ

P′

k−1⊂span Pπ

of size |P′

k−1|= dim CS

k−1that is closest to the subspace of polyno-

mials divisible by T1in terms of the L2-distance of coeﬃcient vectors

(SVD).

: Remove from each polynomial in P′

k−1all monomials not divisible

by T1(i.e. set the corresponding coeﬃcients to zero).

: Divide by T1, invert the permutation of variables, and add to the set

of polynomials of lower degree,

Pk−1← Pk−1∪{π−i(p/T1)p∈ P′

k−1}.

: end for

: Reduce Pk−1to an approximate basis of CS

k−1using PCA on the co-

eﬃcient vectors.

: Let k←k−1.

: end while

: Compute an approximate basis b1, . . . , bD−dfor the orthogonal com-

plement S⊥from the coeﬃcient vectors of P1∪{ℓ1, . . . , ℓn−1}using

PCA.

: return b1, . . . , bD−d

: end function

and the isomorphisms across coeﬃcient vector spaces associated with diﬀerent de-

grees.

e main part of the algorithm deals with the quadratic input polynomials; the

linear polynomials are added at the end. In the ﬁrst step, we compute a new set of

polynomials CS

kof possibly larger degree k≥2from the input, which is guaranteed

to approximately span CS

k(Line ). is corresponds to the arrow going from C2to

66 CHAPTER 4. AN ALGEBRAIC APPROACH

C3in Figure ..

e loop starting at Line  is to descend the degree from kdown to one (linear

polynomials). In the nested loop beginning at Line , we repeated the approximate

division for diﬀerent permutations of variables to increase the accuracy of our esti-

mate for the basis of CS

k−1. e approximative division by T1(under the permuta-

tion of variables) is implemented in Lines  to .

In Line , we compute the approximate intersection with the set of polynomi-

als divisible by T1. is is a quadratic optimisation problem which can be solved

by a singular value decomposition (SVD): let Qbe a matrix deﬁned as the row-

wise concatenation of all coeﬃcient vectors in Pk, where we remove the columns

corresponding to monomials not divisible by T1. Using the SVD, we compute a

basis v1, . . . , vm∈

|Pk|for the approximate le kernel of Qof dimension m=

dim CS

k−1. e elements of this basis provide us with coeﬃcients for mlinear inde-

pendent linear combinations of polynomials in Pksuch that the sum of the squared

coeﬃcients of monomials not divisible by T1are minimised; this is the set P′

k−1. In

Lines  and  this is then simply converted into polynomials of degree k−1.

In each step in Line  we approximate the basis for the polynomials of lower de-

gree CS

k−1using PCA (Line ). e quality of this estimate increases with the num-

ber of available vectors that lie approximately on CS

k−1. In order to obtain more such

coeﬃcient vectors, we repeatedly divide out the variable T1, but each time under a

diﬀerent permutation of the polynomial variables T1, . . . , TD. is corresponds to

a basis change in the input space, of which the permutation of variables is compu-

tationally attractive when rearranging and ordering polynomial coeﬃcients. Each

permutation of variables yields diﬀerent sets of polynomials, which we combine to

anapproximatebasis of CS

k−1by ﬁrstinverting thepermutationofvariablesandthen

performing PCA to reduce the number of basis elements to the correct dimension

dim CS

k−1(Line ). Finally, in the last step from quadratic to linear polynomials

we include the linear polynomials ℓ1, . . . , ℓn−1obtained from the mean vectors.

Computational complexity

e computational complexity of Algorithm  is determined by the singular value

decomposition (Line ) of the coeﬃcient matrix of the highest (initial) degree k,

which consists of the row-wise concatenation of the coeﬃcient vectors of the initial

set of polynomials Pπ

e worst-case complexity of computing the full singular value decomposition

for a general r×cmatrix is O(min{rc2, r2c}). e number of columns cof the co-

eﬃcient matrix is equal to the number of dimensions of the coeﬃcient vector space

Ckof homogeneous polynomials in D−1variables,

c= ∆(k, D −1),

4.3. AN APPROXIMATE ALGEBRAIC ALGORITHM 67

because we include only those monomials that are not divisible by T1. is can be

written as a polynomial function in the degree kandin the number of input variables

c(k) = 1

(D−2)!(k+ 1) ···(k+D−2),

c(D) = 1

k!(D−1) ···(D−2 + k),

respectively. e number of columns cgrows like kD−3in the degree kand like

Dk−1in the number of input variables D. e number of rows rof the coeﬃcient

matrix is equal to the number of polynomials in the set Pπ

r=|Pπ

k|= (n−1)∆(k−2, D).

As for the number of columns, the number of rows rcan be written as a polynomial

in kand Dwith leading terms kD−2and Dk−3respectively.

9 11 13 15 17 19 21 23 25 27 29 31 33 35

0.5

Number of epochs

Runtime (seconds)

Degree 5 to

degree 4

Degree 4 to

degree 3

Degree 3 to

degree 2

1 2 3 4 5 6 7

0.1

Number of stationary directions

Runtime (seconds)

n=9 epochs

n=10 epochs

n=12 epochs

Figure .: e le panel shows the runtime (vertical axis, log scale) of Algo-

rithm  for D= 8 dimensions and d= 4 stationary directions for various num-

bers of epochs (horizontal axis). Overall, the runtime decreases when the num-

ber of epochs grows since the necessary degree becomes smaller, and hence the

number of dimensions of the coeﬃcient vector space. e right panel shows the

runtime (log scale) of the algorithm for three diﬀerent numbers of epochs over

varying numbers of stationary directions.

For ﬁxed number of input dimensions D, the computational complexity de-

pends on the necessary initial degree k, which depends on the number of polyno-

mials n−1and the number of stationary directions d(see Conjecture ). e fewer

polynomials available, the higher the necessary degree. is is illustrated in the le

panel of Figure .. As the number of polynomials grows (horizontal axis), we ob-

serve characteristic drops (phase transitions) in the runtime which correspond to

decreases in the necessary degree kthat translates into fewer number of columns c

of the coeﬃcient matrix. For a constant necessary degree k(i.e. between the jumps),

68 CHAPTER 4. AN ALGEBRAIC APPROACH

we see that the runtime grows with the number of polynomials, because it increases

the number of rows r. Moreover, we observe that for D= 8 input dimensions and

d= 4 stationary directions, n= 26 generic epochs are suﬃcient to span the whole

subspace CS

2of the coeﬃcient vector space.

e right panel of Figure . shows that the runtime also decreases with the

number of stationary directions. at is because the number of dimensions of the

linear subspace of polynomials CS

kvanishing on Sdecreases, which in turn leads to

a lower necessary degree.

4.4 Relationship to algebraic geometry

So far, we have largely eschewed concepts and terminology from algebraic geom-

etry in order to make the presentation accessible by the wider machine learning

community. In this section, we brieﬂy establish this connection. We do not attempt

an exhaustive or rigorous exposition (see e.g. [Cox et al. ()] for a good intro-

duction with a view to computation); the purpose is to point to the main concepts

and rephrase the task in the language of algebraic geometry.

Algorithm  exploits the algebra-geometry connection: to ﬁnd a basis for the

linear subspace of s-projections (a geometrical object), we operate on its represen-

tations in terms of polynomials (elements of an algebra, the ring of polynomials).

Assuming that the quadratic input polynomials q1, . . . , qn−1are known exactly, this

corresponds to a classical problem in computational algebraic geometry: computing

generators for the radical of an ideal.

An ideal is a set of polynomials in the ring

[T1, . . . , TD]that correspond to

a vanishing set in

D. In Section . we have deﬁned the vanishing set of a set of

polynomials. Moreover, we have observed that the space of s-projections Sis a lin-

ear subspace of the vanishing set V(q1, . . . , qn−1). Conversely, we can associate an

algebraic set V⊂

Dwith a set of polynomials, its ideal. e ideal of a vanishing

set is the set of all polynomials that vanish on it, denoted by I(V). A purely alge-

braic way to deﬁne an ideal is in terms of a set of so-called polynomial generators

f1, . . . , fm∈

[T1, . . . , TD]. If we think of the generators as polynomial equations

deﬁning the vanishing set V(f1, . . . , fm), then the generated ideal corresponds to

the set of all polynomial consequences. ese we can derive from the generators

by multiplication with all elements of the polynomial ring and all summations of

such products. From this point of view, we can interpret Algorithm  as generating

further elements of the ideal ⟨q1, . . . , qn−1⟩, up to a certain degree.

Since the space of s-projections Sis a linear space, its ideal is generated by lin-

ear polynomials, I(S) = ⟨b1, . . . , bD−d⟩. is means that every polynomial that

vanishes on Scan be written as a sum of polynomials, each having a linear factor

corresponding to an element of the orthogonal complement S⊥. If we knew the

linear generators b1, . . . , bD−d, then we could obtain a basis for S. However, since

4.5. SUMMARY AND DISCUSSION 69

our input consists of quadratic polynomials q1, . . . , qn−1∈I(S), it is not straight-

forward to extract this basis.

By assumption, the input polynomials are generic and n≥D+1, which implies

that the ideal ⟨q1, . . . , qn−1⟩contains all polynomials of degree at least two which

vanish on S. is is a unique subset of the ideal I(S). us, on the level of ideals,

the correspondence between the input and Sis one-to-one. In fact, I(S)is the so-

called radical ideal of the input ⟨q1, . . . , qn−1⟩, which is denoted by,

√⟨q1, . . . , qn−1⟩= I(S) = ⟨b1, . . . , bD−d⟩.

We can think of the radical of an ideal Ias the largest maximal ideal containing I.

Solving the SSA problem is therefore equivalent to computing a linear generating set

for the radical ideal of the input. is is a classical task in computational algebraic

geometry, for which the existence of algorithms has been known for a long time

[Hermann ()].

However, algorithms for radical computation that are suitable for implementa-

tion have only been developed in recent decades. e best known algorithms are

those of [Gianni et al. ()] (implemented in AXIOM and REDUCE); Macauly 

provides [Eisenbud et al. ()]; [Caboara et al. ()] is implemented in CoCoa

along with [Krick and Logar ()], the modiﬁcation by [Laplagne ()], avail-

able in SINGULAR.

All of these algorithms have two points in common. First of all, their compu-

tational worst case complexities are doubly exponential in the square of the num-

ber of variables [Laplagne ()]. Secondly, they can only be applied when the

coeﬃcients of the polynomials are known exactly, which is never the case in prac-

tical applications of SSA where moments are estimated from data. It is for these

two reasons that we cannot solve the SSA problem using oﬀ-the-shelf general al-

gorithms for radical computation. Our algorithm addresses a new type of prob-

lem in the nascent ﬁeld of approximate computational algebra [Corless et al. (),

Stetter (), Kreuzer et al. (), Heldt et al. ()].

4.5 Summary and discussion

In this chapter we have presented an algebraic algorithm for solving the SSA prob-

lem. Exploiting the algebra-geometry connection, we do not search the space of all

possible projections guided by an objective function, but compute an approximate

solution to a system of polynomial equations.

e algebraic approach has several advantages over optimisation-based tech-

niques. First of all, the algorithm has a unique solution (no local minima) which

can also be shown to be consistent, i.e. it converges to the true solution as the sam-

ple size grows. Moreover, it is straightforward to integrate cumulants of arbitrary

70 CHAPTER 4. AN ALGEBRAIC APPROACH

order, as they corresponds to polynomials of higher degree. Parallel to the com-

bination of mean (degree one) and covariance matrices (degree two), the highest

moment determines the initial coeﬃcient vector space and the moments of lower

order are added during the step-wise reduction of the degree. In Algorithm , the

mean vectors are added in the last step (Line ), aer the quadratic polynomials

have been reduced to an approximate linear basis.

Choosing the number of stationary directions

10 35 120 220

0.01%

0.1%

Sorted eigenvalues (PCA)

Percentage of variance

Num. of s−src. d=3

Num. of s−src. d=5

Num. of s−src. d=8

Figure .: PCA eigenvalue spectrum in coeﬃcient vector space C3for three

synthetic data sets in ten dimensions with varying numbers of stationary sources.

e true number of stationary directions ddetermines the number of dimen-

sions of the subspace CS

kin coeﬃcient vector space, on which the input polyno-

mials (multiplied up to the necessary degree k) lie approximately. By estimating

the eﬀective number of dimensions of that subspace from the input, we can select

an appropriate number of stationary directions. To that end, we inspect the PCA

eigenvalue spectrum of the polynomials Pkin coeﬃcient vector space Ck.

Figure . shows an example computed on synthetic data. For d= 5 stationary

directions in ten dimensions, we see that there exists a -dimensional subspace that

has signiﬁcantly less variance than the rest of the data. For polynomials of degree

three in D= 10 dimensions this corresponds to ﬁve stationary directions, because a

ﬁve-dimensional stationary subspace induces a (-)-dimensional eﬀective sub-

space in coeﬃcient vector space,

dim CS

3= ∆(3,10) −∆(3,5) = 220 −35 = 185.

Similarly, the low-variance subspaces of 10 and 120 dimensions (blue and green

curve) correspond to stationary subspaces of three and eight dimensions respec-

4.5. SUMMARY AND DISCUSSION 71

tively. e advantage of this approach, relative to stationarity testing, is that it al-

lows us to inspect all possible number of dimensions simultaneously by computing a

single eigenvalue spectrum instead of running the optimisation procedure for each

candidate and then performing time-consuming resampling.

Alleviating the computational cost

e biggest drawback of the algebraic algorithm is the scaling of the computational

cost. e singular value decomposition (Line  of Algorithm ) is cubic in the num-

ber of dimensions of the coeﬃcient vector space — which grows rapidly with the

degree and the number of input dimensions. However, there are several directions

which could bring about a dramatic reduction of the computational complexity.

First of all, the matrix of coeﬃcient vectors (corresponding to the set Pπ

kin Al-

gorithm ) is very sparse when it is large: while the number of dimensions of the

coeﬃcient vector space grows quickly with the degree k, the number of nonzero co-

eﬃcients (i.e. nonzero entries in each row of the matrix) stays the same. Moreover,

the matrix has a special structure which can be exploited. e multiplication of an

input polynomial with all monomials of a certain degree (Algorithm ) essentially

sends its coeﬃcients to diﬀerent places (columns), each corresponding to a row in

the matrix. Finally, for the purpose of approximate division by a variable, we do not

need to compute all singular values, but only the smallest ones. ere exist several

algorithms for iterative large-scale sparse structured singular value decomposition

[Kaltofen et al. (), Park et al. ()] which are yet to be explored in this con-

text.

Algebra beyond SSA?

e algebraic approach is not limited to stationary subspace analysis; our algorithm

solves a rather general problem: approximating the set of solutions to a system of

homogeneous polynomial equations of arbitrary degree by a linear space. More-

over, it would be straightforward to generalise this procedure to ﬁnd approximate

“descriptions” of sets of solutions in terms of polynomials of higher degree or with

certain prescribed properties.

e main requirement for a machine learning problem to ﬁt our algebraic ap-

proach is that its set of solutions needs to be describable in terms of a system of

approximate polynomial equations. is is an uncommon formulation, perhaps be-

cause so far there have been no eﬃcient ways of solving such problems known to the

machine learning community. What is attractive about this framework is that poly-

nomials of arbitrary degree can be combined naturally (e.g. corresponding to certain

constraints on the solutions), without suddenly rendering the problem infeasible.

72 CHAPTER 4. AN ALGEBRAIC APPROACH

Chapter 5

Simulations and applications

5.1 Introduction

In this chapter, we evaluate the SSA algorithm in simulations and present applica-

tions to EEG data analysis.

Since SSA is a new unsupervised learning task, there is no established evaluation

strategy. To date, there exist no competing algorithms or benchmark data sets that

allow for a relative assessment. Moreover, since SSA is an unsupervised problem

there exists no natural quantitative performance measure that is inherently mean-

ingful as e.g. for classiﬁcation problems. Whether a particular SSA solution is useful

is ultimately a qualitative decision, which depends on the application context. is

is reﬂected in our evaluation methodology, which consists of four diﬀerent angles.

. Synthetic data. An artiﬁcially constructed ground truth allows us to quanti-

tatively measure the performance in terms of the deviation from the true so-

lution. We investigate the inﬂuence of key parameters, the relative eﬃcacy of

the algebraic approach, and two methods for stationarity testing. e results

serve as a basic validation and provide insights for the domain of applicability

and parameter choice in practice.

. Comparison against PCA and ICA on real data. Even though PCA and ICA

solve a diﬀerent problem than SSA, in certain situations they may yield a sim-

ilar results. For a large EEG data set, we establish that neither PCA nor ICA

yield a basis that is useful for understanding distribution changes. is un-

derlines the original contribution of the SSA approach.

. Real data with ground truth. In two situations from EEG analysis where an

expected results is induced by an experimental paradigm, we show that SSA

successfully identiﬁes the corresponding neural components. e results are

evaluated qualitatively, in terms of the frequency content and the scalp maps,

and quantitively using the correlation with an experimental stimulus.

. Indirect evaluation in a classiﬁcation setting. In the context of brain-computer-

interfacing, we demonstrate that in certain cases, SSA can remove harmful

74 CHAPTER 5. SIMULATIONS AND APPLICATIONS

non-stationary brain sources. e eﬃcacy of this approach is reﬂected in

lower misclassiﬁcation rates on the test set.

e remainder of this chapter is organised along these lines. In the next Sec-

tion ., we present the results on simulated data. e applications to EEG analysis

are contained in Secton .. In the last Section ., we discuss our ﬁndings and point

to future work motivated by the numerical results.

5.2 Simulations on synthetic data

Before we present results, we consider two main ingredients: (a) a method for gen-

erating synthetic data that provides us with a ground truth and allows us to vary

the parameters of interest, and (b) a measure of deviation between the true and the

found solution.

Data generation

e data is generated exactly according to the SSA model. at is, we directly sam-

ple the data for each epoch. In doing so, we assume that we have chosen the seg-

mentation into epochs optimally in the sense that we extract the maximum possible

amount of information from the data without any redundancy (e.g. overlaps intro-

duced by a sliding window). is is unrealistic; however, any other way of dividing

our synthetic data into epochs is equally arbitrary and introduces further parameters

which make the results harder to interpret.

Foreachepoch, werandomlysamplethemeanandthecovarianceof thesources,

µ1, . . . , µnand Σ1, . . . , Σn, such that the moments are identical along the ﬁrst d

coordinates. e elements of the mixing matrix Aare drawn uniformly at random

from the interval [−0.5,0.5]. In all our simulations, we focus on the more diﬃcult

case where there is no information in the mean, i.e. the mean of the non-stationary

sources is constant in each epoch. If we include information in the mean, the contri-

bution of the new SSA algorithm becomes somewhat unclear: the separation of sta-

tionary and non-stationary sources using diﬀerences in their mean can be achieved

analytically by a simple matrix inversion.

e marginal covariance of the stationary sources is ﬁxed to the identity I. For

each epoch, the variance of the non-stationary sources is sampled uniformly over

the intervals (1, α)and (1/α, 1) with α > 1. at is, each non-stationary source is

at most αtimes larger or αtimes smaller than the stationary sources. e covariance

between the two sets of sources is chosen by drawing random mixing coeﬃcients.

In order to investigate the more realistic scenario of non-gaussian data (heavier

tails, outliers), we sample Gaussian data X∼ N(0,1), exponentiate it and preserve

the sign: X′= sign(X)|X|k, where the parameter kcontrols the degree of the

5.2. SIMULATIONS ON SYNTHETIC DATA 75

non-gaussianity; we report Pearson’s kurtosis instead of k. e data is then linearly

transformed and translated to have the chosen epoch mean and covariance.

A performance measure for SSA

As we have seen in Chapter , only the true non-stationary subspace can be identi-

ﬁed from the mixed signals. A natural performance measure for SSA is therefore the

deviation between the true and the found non-stationary subspace. e alternative,

a measure based on the level of stationarity of the estimated stationary sources, is

more diﬃcult to deﬁne because even for an ideal separation, the estimated epoch

mean and covariance matrices will not be exactly equal. e expected deviation

from equality in the case of ideal separation depends on properties of the data, such

as the number of samplesor the heaviness ofthe tails (see Section .. for a more in-

depth discussion). Since these are precisely the parameters of interest in our simula-

tions, it is more convenient to adopt a performance measure based on the deviation

from the true subspace.

Following [Meinecke ()] we deﬁne a subspace error based on the principal

angles between two subspaces. Let Uand Vbe two linear subspaces of the vector

space Vwith dim U≤dim V. e ﬁrst principal angle θ1is deﬁned as the smallest

angle between any two vectors u∈Uand v∈V,

cos θ1= min {u⊤v

∥u∥∥v∥u∈U, v ∈V}=∠(u1, v1),

where the vectors u1and v1that minimise the angle are called the principal vectors.

e remaining principal angles θ2, . . . , θdim Uare found recursively in the orthog-

onal complement of the previous principal vectors,

cos θk+1 = min {u⊤v

∥u∥∥v∥u∈U, u ⊥u1, . . . , uk, v ∈V, v ⊥v1, . . . , vk}.

us the principal vectors in each subspace are mutually orthogonal. If a principal

vector θkis zero then Uand Vshare a common subspace and uk, vk∈U∩V.

Hence the dimension of the intersection of Uand Vis equivalent to the number

of zero principal angles. e principal angles and vectors can be found using the

singular value decomposition (SVD). Let the columns of the matrices U′and V′

be orthonormal bases for the spaces Uand Vrespectively. From the singular value

decomposition ˆ

UΣˆ

V⊤= (U′)⊤V′, we obtain the principal angles from the singular

values, cos θk= Σkk and the principal vectors are the columns of ˆ

Uand ˆ

To quantify the diﬀerence between two k-dimensional subspaces spanned by the

columns of A, B ∈

d×k, we deﬁne the subspace error as the average sin2of the

76 CHAPTER 5. SIMULATIONS AND APPLICATIONS

principal angles θ1, . . . , θkbetween them,

E(A, B) = 1

∑

i=1

sin2(θi).

is error is zero when the true subspaces are identical and one if they are orthog-

onal to each other. e subspace error can be interpreted in terms of the sources.

It is the percentage of the signal power that a non-stationary sources loses if it is

projected from the true onto the estimated non-stationary sources. Conversely, for

the estimated stationary sources ˆss(t) = ˆ

Psx(t)this means that E(ˆ

An, An)is the

percentage of non-stationary signal power that they contain.

However, the maximum possible subspace error depends on the number of di-

mensions of the two subspaces in relation to the total number of dimensions, i.e.

on the number of dimensions in which they can possibly diﬀer. For example, two

nine dimensional subspaces Uand Vin ten dimensions have a maximal subspace

error if there exists a one-dimensional subspace which is contained in Uand in V⊥

or vice versa. In this case, exactly one of the principal angles is 1

2πand all others

are zero, so that the subspace error is /. In general, two k-dimensional subspaces

can only diﬀer by a d−kdimensional subspace, so that the maximum subspace

error is min {1,d−k

k}. is worst case is a very special case, in the sense that unless

we speciﬁcally construct it, any set of basis vectors is expected to have a lower sub-

space error. As a less pathological baseline, in each simulation we compare against

random projections [Rahimi and Recht ()].

19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1

50%

100%

Subspace error (% of rand. projection)

Number of non−stationary directions

Low non−stationarity (α=3)

Medium non−stationarity (α=5)

Strong non−stationarity (α=10)

Figure .: Performance for varying numbers of non-stationary directions (hor-

izontal axis) measured in terms of the subspace error relative to random projec-

tions. e total number of dimensions is ﬁxed to .

5.2. SIMULATIONS ON SYNTHETIC DATA 77

5.2.1 Dimensions, sample size, epochs, kurtosis

In this section we present the results of our simulations. Unless otherwise stated,

eachplotshowsmediansubspace errorsoverrandomrealisationsofthedatasets,

with error bars extending from the  to the  quantile.

In the ﬁrst set of experiments, we investigate the inﬂuence of the number of sta-

tionary dimensions. e total number of dimensions is set to , the number of

epochs is  and in each epoch we observe  samples with non-stationarity in

the covariance matrices of degree α= 3. Figure . shows the result. In order to

obtain comparable results across diﬀerent dimensionality, we report the subspace

error relative to random projections. We see that in terms of the subspace error,

larger non-stationary subspaces are more diﬃcult to ﬁnd. is is not only a nu-

merical eﬀect: as we have kept the number of distinct epochs constant, the relative

amount of available information (i.e. observed variation in the distribution of the

non-stationary sources) is smaller for the larger subspaces, which makes them more

diﬃcult to identify.

12 25 50 100 500 1000

0.1

0.2

0.3

0.4

Subspace error

Number of samples per epoch (log)

Low non−stationarity (α=3)

Medium non−stationarity (α=10)

Strong non−stationarity (α=50)

Random projection

Figure .: Inﬂuence of the epoch sample size (horizontal axis) on the perfor-

mance of SSA, measured in terms of the subspace error (vertical axis).

In the next step, we analyse how the error of SSA depends on the number of

available samples in each epoch. e setup is somewhat artiﬁcial: in practice, we

divide a ﬁxed number of samples into epochs; we can usually only vary the num-

ber of samples in each epoch by changing the overall segmentation, which changes

the true distribution of the data in each epoch. However, in order to isolate the ef-

fect of small sample errors, we vary the number of samples available in each epoch

from a ﬁxed epoch source distribution. e total number of dimensions is  and

the number of stationary directions is . Figure . shows the results. We see that

even for epochs as small as eleven samples, SSA can still ﬁnd a projection that is

signiﬁcantly better than a random projection. is suggests that small sample er-

78 CHAPTER 5. SIMULATIONS AND APPLICATIONS

rors average out across epochs to some extent, an eﬀect that we also observe in the

next set of simulations. Moreover, we see that the eﬀect of estimation errors in the

epoch moments is relative to the strength of the non-stationarity. is makes sense:

a strong change in variance is less easily masked by small sample ﬂuctuations than

a weaker non-stationarity.

2 5 10 20 50 100

0.1

0.2

0.3

0.4

0.5

Subspace error

Number of epochs (log)

Low non−stationarity (!=3)

Medium non−stationarity (!=5)

Strong non−stationarity (!=10)

Random projection

Spurious stationary

directions

Small sample errors

Figure .: Inﬂuence of the number of epochs (horizontal axis) with distinct

moments on the performance of SSA, measured in terms of the subspace error

(vertical axis). e total number of samples is ﬁxed to .

Now we consider a diﬀerent scenario: the total number of samples is ﬁxed to

 and we vary the number of epochs with random source covariance matrix (ﬁve

stationary and ﬁve non-stationary directions). is means that for two epochs, their

sample size will be  each, whereas for the largest number of epochs (), each

contains only  samples. In other words, we trade small sample errors for the num-

ber of distinct distributions, or accuracy of information for amount of information.

While this is an artiﬁcial scenario (we can never vary the amount of information in a

dataset), it is directly related to an important practical problem, namely the design of

the epoch structure. One usually faces the dilemma of wanting a ﬁne-grained epoch

setup, to capture as much variation as possible and increase the temporal resolution,

while controlling the detrimental eﬀect of small sample estimates.

e results shown in Figure . convey a positive message: only as the epochs

become extremely small ( samples in  dimensions) does the performance de-

teriorate markedly. As we have seen before, small sample errors average out across

epochs, andat thesametime, SSAbeneﬁtsfromfurthervariation inthenon-stationary

space. In the opposite case, few large epochs, we observe the eﬀect of spurious sta-

tionary directions; the subspace error reaches the plateau exactly at the number of

epochs given by our theoretical result (see Section .).

All simulations presented so far have one unrealistic aspect in common: Gaus-

5.2. SIMULATIONS ON SYNTHETIC DATA 79

3 10 100

0.1

0.2

0.3

0.4

0.5

Subspace error

Pearson’s Kurtosis (heavy−tailedness)

Epoch size m=40

Epoch size m=300

Epoch size m=2500

Random projection

Figure .: Inﬂuence of the fatness of the tails of the sources (horizontal axis) on

the performance of SSA, measured in terms of the subspace error (vertical axis).

e source data is sampled from a super-gaussian with Pearson’s kurtosis shown

on the horizontal axis.

sian sources. Real data, however, is notoriously non-gaussian, apart perhaps from

pure noise components; real data has heavier tails and is almost always contami-

nated by outliers. Moreover, interesting components are usually distinctively non-

gaussian, which is reﬂected by the fact that many algorithms explicitly search for

components with non-Gaussian properties, e.g. ICA [Hyvarinen ()], projec-

tion pursuit [Huber (), Friedman and Tukey ()] or non-gaussian compo-

nent analysis [Blanchard et al. ()]. Outliers and heavy tails lead to stronger ﬂuc-

tuations in ﬁnite samples between epochs, even for stationary sources, leading to

higher estimation errors of the true non-stationary subspace. e absolute scale of

the errors we have reported is therefore probably too optimistic for most real data

sets¹.

erefore, in the ﬁnal set of simulations, we investigate how the error behaves

as we fatten the tails of the source distribution. As before, the total number of di-

mensions is ten with ﬁve stationary directions, we have  epochs and the non-

stationarity level is set to α= 10. Figure . shows the results over various epoch

sample sizes. Interestingly, the performance does not deteriorate rapidly as the kur-

tosis grows. As one would expect, small sample sizes strongly exacerbate the inﬂu-

ence of outliers.

¹is is in line with the aim of these simulations, which is to show the relative performance across

parameters. Any assumption about the source distribution is somewhat arbitrary and introduces a

systematic bias.

80 CHAPTER 5. SIMULATIONS AND APPLICATIONS

5.2.2 Testing for stationarity

In Section .., we have seen that the SSA objective function can also be derived

as a likelihood ratio test [Neyman and Pearson ()]. In this section, we evaluate

the eﬀectiveness of this test and compare it against a data-driven nonparametric test

based on resampling.

e value of the SSA objective function is a measure of stationarity. However,

for a data set at hand, its value alone does not provide a clear indication of whether

the data is truly stationary, in the sense that its epochs have been sampled from dis-

tributions with the same mean and covariance matrix. at is because small sample

errors introduce diﬀerences in moment estimates between epochs, even for a data

that has been sampled from a stationary model.

erefore, one way of making sense of the objective function is to compare its

value against its distribution on truly stationary data, and decide for non-stationarity

if it exceeds a chosen quantile (critical value). In terms of hypothesis testing, station-

arydatais our null hypothesisH0whichwereject atacertain(one-sided) conﬁdence

level.

−5 0 5 10 15 20 25 30 35 40

Normalized SSA objective function

Density

Source kurtosis 3 (Gaussian)

Source kurtosis 5.9

Source kurtosis 11.4

Source kurtosis 21.8

Figure .: Distribution of the normalised SSA objective function for stationary

sources with varying kurtosis estimated from random samples.

In Section .., we mentioned that the distribution of the normalized objective

function,

√2Λ(X)−√nd(d+ 3) −1,

approximately follows a standard normal distribution on stationary data with Gaus-

sian sources, where Λis the unnormalized objective function, nis the number of

epochs and dis the number of dimensions. Figure . shows estimates of the dis-

tribution of the normalised objective function on simulated stationary data in ﬁve

dimensions of  epochs, each with  samples, over  random realisations.

5.2. SIMULATIONS ON SYNTHETIC DATA 81

For Gaussian sources, the empirical distribution of the normalised objective re-

semblesanormal density, butit isnot normal according to theKolmogorov�Smirnov

test. However, even for moderate levels of kurtosis, the distribution moves away

from N(0,1). us a stationarity test based on quantiles of the χ2distribution is

bound to perform poorly on almost all real data sets, in that it has a high false posi-

tive rate (type I error).

However, for non-Gaussian data the distribution of the objective function un-

der H0is in general unknown. Let us therefore take a closer look at the resampling

approach (see Chapter ). In short, we put the data from all epochs together, ran-

domly divide it into epochs, on which we then evaluate the SSA objective function.

Each random shuﬄing provides us with one sample from the distribution of the

SSA objective function on stationary data. By doing this repeatedly, we obtain a set

of samples from which we can estimate a critical value (e.g. the  quantile) for

testing.

As an illustration, we compare the performance of the χ2test for stationarity

with the test based on resampling on a synthetic data set. To this end, for each degree

of non-gaussianity, we randomly generate  stationary and  non-stationary

data sets and apply both test strategies at the  conﬁdence level; the parameters

of the data set are the same as the one used for Figure ., the level of non-stationarity

is low (α= 2). In light of our previous observations, we are mainly interested in the

false positive rate (FPR).

3 3.5 4 4.5 5

20%

40%

60%

80%

100%

Pearson’s Kurtosis

False positive rate (FPR)

Resampling

Chi−squared

Figure .: Comparison of two strategies for stationarity testing: comparing

against quantiles from the standard normal distribution (the theoretical limit) vs.

quantiles estimated from resampling the data set.

Figure . shows the result. For Gaussian sources (with Pearson’s kurtosis equal

to ), the false positive rate of both tests corresponds exactly to the chosen conﬁ-

dence level of . However, already for moderate deviations form normality, the

increasing likelihood of outliers leads to rapidly rising false positive rate of the para-

metric test; at kurtosis , it has become completely useless, which is in line with our

observation in Figure .. On the other hand, the quantiles estimated from  re-

82 CHAPTER 5. SIMULATIONS AND APPLICATIONS

samples of the shuﬄed data set are remarkable reliable: the false positive rate stays at

 while the true positive rate (not shown here) is constantly . Of course, the

performance of the resampling test also depends on several factors (number of sam-

ples, type of stationary distribution, etc.), and we do not have theoretical guarantees

yet. Even so, resampling can be a viable approach in practice.

5.2.3 Algebraic SSA vs. optimisation

Comparing the performance of the algebraic and the optimisation-based SSA algo-

rithm is not straightforward: for a given dataset, the former has a unique solution

computed in a ﬁxed amount of time whereas the latter is initialised randomly and

stops when a somewhat arbitrary convergence criterion has been met. Moreover, for

“standard” setting of these convergence criterion, the runtime of both algorithms

diﬀer markedly for a wide range of setting. As we had seen before, the runtime of

the algebraic algorithm grows rapidly, in particular when the number of available

epochs is not very large.

11 50 100 500 1000

0.1

0.2

0.3

0.4

0.5

0.6

Subspace error

Number of samples per epoch (log)

SSA

Algebraic SSA

Random

Figure .: Comparison of the algebraic and the optimisation-based SSA algo-

rithm w.r.t. the epoch samples size.

It is for these reasons that we do not attempt an exhaustive comparison, the re-

sults of which would be diﬃcult to interpret. Instead, we focus on one case where

the runtime of both algorithms is comparable and illustrates a general trend. In

this simulation, we use ten dimensional data (ﬁve stationary directions), a moder-

ate level of non-stationarity (α= 5), and vary the number of samples in each of the

 epochs.

Figure . shows the results: until the epochs have reached a critical size of 

samples, the optimisation approach outperforms the algebraic algorithm. e rea-

son for this is that the algebraic algorithm is more prone to small sample errors than

optimisation-SSA. at is because the coeﬃcient vector space that it operates in has

5.3. APPLICATION TO EEG ANALYSIS 83

much higher dimension than the data space, in which the optimisation takes place.

However, as the small sample errors get weaker in magnitude (on large epochs), it

clearly outperforms the standard SSA algorithm. at is because the gradient-based

optimisation converges slowly due to the ﬂatness of the objective function around

the true solution.

5.3 Application to EEG analysis

In this second part of this chapter, we apply SSA to data from electroencephalogra-

phy (EEG) [Berger (), Caton ()] recordings. EEG data analysis is a prime

application domain for SSA, because the EEG potential is well approximated by a

linear transformation of ﬂuctuations of macroscopic potentials in the cortex. is

has lead to the tremendous success of linear methods for the analysis of brain activity

(see e.g.[Parra et al. ()]).

First of all, we show that PCA and ICA do not ﬁnd non-stationary brain sources

onadata setfrombrain-computer-interfacing(BCI), andthatthemostnon-stationary

task-locked components correspond to the neural response to the task. Next, we

investigate the role of non-stationarities as one of the possible causes behind de-

teriorating performance of BCIs over time. We do this by removing certain non-

stationary directions in a pre-processing step, which for some subjects leads to a

signiﬁcant improvement of the error rate. In the last part of this section, we demon-

stratethatSSArecoversacertaintype ofnon-stationarybrainsource(auditorysteady

state evoked responses) that has been induced by an experimental paradigm. Here

we can measure the eﬀectiveness using the correlation with the stimuli. e analysis

shows that both ICA and PCA fail to recover the desired source.

5.3.1 Electroencephalography

e electroencephalography (EEG) [Berger (), Caton ()] is an electric po-

tential² measured on the surface of the scalp. e EEG is caused both by brain ac-

tivity and other physiological sources, e.g. nearby muscles.

e following overview is largely based on [Kandel et al. ()]. e cortical

source of the EEG is the synchronous activity of neurons, which are oen considered

as the atomic units of the cortex in models of the brain. e human brain consists of

around 1010 neurons, each of which is connected to its neighbouring neurons by a

dendritic tree and one axon. In a neuron, information is processed and transmitted

in the form of electric potentials. Every neuron maintains a membrane potential, i.e.

a voltage diﬀerence across its plasma membrane, which is controlled by the open-

ing and closing of ion channels through a complex electrochemical mechanism. A

²More precisely, it is a diﬀerence in potential between an electrode in the scalp and a reference

electrode, oen positioned on the nose or ear lobes.

84 CHAPTER 5. SIMULATIONS AND APPLICATIONS

neuron receives electrochemical stimulation from other neurons via its dendrites; if

the membrane potential exceeds a certain threshold, it is released as an action po-

tential which travels along the axon to other neurons. is eﬀect is called spiking.

When large, spatially aligned neuron populations spike synchronously, their cur-

rent ﬂow combines to produce macroscopic ﬁeld potentials, which are measurable

on the surface of the scalp.

e EEG signal not only reﬂects the ﬂuctuations of the ﬁeld potentials gener-

ated inside the brain, but also non-neural artefacts, which are of physiological or

technical origin. Prominent physiological artefacts result from muscle activity (e.g.

face, tongue, eye blinks and movement), heart beatand pulse. Technical artefacts are

oen generated by the power supply or other equipment. e signal power of arte-

factual sources is usually much stronger than the brain sources. Removing artefacts

is a key step in EEG signal processing. For an extensive review of artefact rejection

methods see [Fatourechi et al. ()].

ere are two general strategies for removing artefacts: discarding subsets (tri-

als) of the data with strong artefact activity, or removing artefactual sources by re-

gression and substraction [Cro and Barry ()], projection or spatio-temporal

ﬁltering. e ﬁrst strategy leads to a loss of data points while the second reduces

the number of dimensions and, if successful, preserves the relevant neural signal. A

common approach to artefact removal, which we will adapt in this chapter, is based

on independent component analysis [Makeig et al. (), Jung et al. ()]. In a

ﬁrst step, the data is decomposed into independent components which are then in-

dividually assessed to identify the artefactual components, oen in a semi-automatic

way. Rejecting independent components is appealing, because it ensures that one

does not remove information contained in dependencies and artefacts are usually

considered to be independent of cortical activity.

e manual classiﬁcation into artefact/non-artefact is based on a number of

heuristics, that take into account the frequency content, the scalp pattern, the time

course and sometimes also the correlation with auxiliary measurements. For an ex-

tensive evaluation of the standard manual artefact removal protocol, we refer the

reader to [McMenamin et al. ()].

For our purpose, we used an ICA basis computed by the TDSEP algorithm

[Ziehe and Müller (b)], which was preceded by a PCA pre-processing to 

of the variance (reducing the number of dimensions from  to approximately ).

Each component was inspected manually. No information other than the frequency

spectrum, the scalp plot and the time course was available during artefact classiﬁca-

tion.

EEG-based Brain-Computer-Interfacing

e goal of brain-computer-interfacing (BCI) [Dornhege (), Vidal ()] is to

transmit information directly from a human brain to a computer, circumventing

5.3. APPLICATION TO EEG ANALYSIS 85

peripheral nerves or muscles. is is achieved by voluntary changes in the brain

state, that can be detected by a computer and turned into a control signal.

e most popular BCI paradigm is based on modulations of the sensorimo-

tor rhythm (SMR), also called µ-rhythm. In most subjects, the SMR can be ob-

served in cortical EEG aer appropriate signal processing. e SMR can be used for

brain-computer-interfacing because it is modulated by the neural processing of mo-

tor commands, which can be controlled voluntarily by most subjects. In particular,

the µ-rhythm responds to motor commands that are only imagined and not actually

executed. is makes it possible to generate a control signal by thought alone.

Imaginedmovementscausetheso-called event-relateddesynchronisation(ERD)

[Pfurtscheller and Lopes da Silva ()] of neuron populations in the motor cortex

which lead to an attenuation of the SMR. e location of the SMR inside the brain

corresponds to the part of the body that is to be moved. Imagined movements of

the le and right hand cause ERD in the right and le motor cortex respectively. In

order to translate imagined movements into a control signal, the central task is to

distinguish between the spatial localisation of changes in the SMR. However, since

the SMR has low power relative to other brain sources (e.g. from the visual cortex)

and artefactual activity, it is not visible in the raw EEG signal, but requires spatio-

temporal ﬁltering to enhance the signal-to-noise ratio e.g. using the common spatial

patterns (CSP) algorithm [Koles (), Blankertz et al. ()].

Dataset and experimental setup

For our comparison to PCA and ICA in Section .. and the application to BCI

in Section .., we use a dataset from a study [Blankertz et al. ()] which in-

cluded only BCI-novices. From this study, we select the  subjects () who were

recorded in Berlin.

Between the calibration runs of imagined movement, partici-

pants were confronted with a computerized version of the d2-test

(Brickenkamp and Zillmer, 1998), which was part of an extended

psychological test battery (results will be presented elsewhere).

After the calibration, participants performed three runs each of

100 trials with BCI ‘feedback’(for some participants only one or two

runs were recorded due to fatigue (N=3) or lack of time (N=7)).

The ﬁrst 20 trials were used for adapting the bias of the classiﬁer (see

section ‘BBCI feedback’). Each trial of feedback started with a period

of 2 s with a black ﬁxation cross in the center of a gray screen. Then

an arrow appeared behind the cross to indicate the target direction

of that trial (left or right for motor imagery classes left hand and

right hand and downward for class foot) and 1 s later the cross turned

purple and started moving according to the classiﬁer output, as

described in section ‘BBCI feedback’. If the foot class was selected, the

cursor moved on a corner shaped trajectory: For left hand vs. foot,

e.g., the cursor moved from the center to the left of the screen for

detected left hand motor imagery, and the cursor moved from the

center downwards for detected foot imagery. After 4 s of cursor

movement the cross froze at the ﬁnal position and turned black

again. Two seconds later the cross was reset to the center position

and the next trial began. Hits or misses were counted according to

this ﬁnal position, but the score was only indicated during a break of

15 s after every block of 20 trials (see Fig. 3 for timing during

feedback runs).

BBCI feedback

The EEG signals of the calibration measurement were band-pass

ﬁltered in a subject-speciﬁc frequency band and temporally ﬁltered in

a subject-speciﬁc time interval. The start of that interval varied

between 480 ms and 2720 ms (median 930 ms) after cue onset and its

end varied between 1700 ms and 4990 ms (median 3810 ms). The

median length of the interval was 2730 ms (range 1000 ms to

3890 ms). Details on the heuristics underlying the selection of

frequency band and time interval can be found in Blankertz et al.

(2008b).Notethatalsosubject-optimizedspatialﬁlters were

determined by common spatial pattern (CSP) analysis (Blankertz

et al. 2008b). From these signals the log-variance was calculated in

each trial of the calibration data. This procedure resulted in a feature

vector with dimensionality equal to the number of (heuristically)

selected CSP ﬁlters (Blankertz et al., 2008b). To our experience, those

features can be classiﬁed well by linear methods like linear

discriminant analysis (LDA). Other paradigms and feature representa-

tions may however require non-linear modeling (see, e.g., Müller

et al., 2001; Müller et al., 2003).

For online operation, features were calculated every 40 ms with a

sliding window of 750 ms (applying CSP ﬁlters, band-pass ﬁltering,

calculating log-variance and applying the LDA classiﬁer, see Blankertz

et al. (2008b). The output of the classiﬁer was translated into cursor

movement in a rate control manner: At the beginning of each trial, the

cursor started in the center of the screen and every 40 ms a fraction of

the classiﬁer output was added to the current cursor position (see also

section ‘Experimental setup’). CSP ﬁlters calculated from the initial

calibration measurement were not adapted during online operation.

The bias of the linear classiﬁer was adapted on the basis of the data of

the ﬁrst 20 trials of each feedback run (see Krauledat et al., 2008). (By

bias we denote the term bin the separating hyperplane formulation

y=wx+bwhich can be used to adjust the output of a binary classiﬁer

toward one class or the other). These 20 trials were not included in the

calculation of the feedback performance.

Performance predictor

To construct our novel SMR predictor, only a short two minute

recording of EEG under the condition ‘relax with eyes open’using two

Laplacian channels (C3 and C4, calculated from nine original

monopolar channels) was required. For this purpose we used the

concatenated segments of the same condition during the baseline

recording (see section ‘Experimental setup’).

From these data we calculated the power spectral density (PSD) in

the Laplace-ﬁltered channels C3, C4 and determined for each of those

channels the maximum difference between the PSD curve and a ﬁtof

the 1/fnoise spectrum as explained below (cf. Fig. 4). These two

values were estimates of the strength of the SMR over the hand areas.

The SMR predictor was calculated as the average of those two values.

It quantiﬁed the potential for desynchronization of the SMRs at rest as

an indicator of SMR strength during feedback.

Fig. 2. Course of a calibration trial. 2 s ﬁxation cross, 4 s motor imagery cued by arrows (left, right, or down), and 2 s blank screen.

Fig. 3. Course of a feedback trial. 2 s ﬁxation cross (black), 1 s indication of target direction cued by arrows, 4 s feedback (cross turns magenta and moves according to BBCI classiﬁer),

2 s indication of result (cross turns black and freezes at its ﬁnal position).

1305B. Blankertz et al. / NeuroImage 51 (2010) 1303–1309

Figure .: Trial structure during calibration: ﬁrst, a ﬁxation cross appears

for s, then a visual cue instructs the user to imagine a certain movement for

s (here: le hand), which is followed by a s rest period. Figure taken from

[Blankertz et al. ()].

e EEG was recorded from the scalp with a multi-channel ampliﬁer (BrainAmp

DC by Brain Products, Munich, Germany) using  Ag/AgCl electrodes (reference

86 CHAPTER 5. SIMULATIONS AND APPLICATIONS

at nasion; manufacturer EasyCap, Munich, Germany) and sampled at  Hz with

a band-pass ﬁlter of . Hz to  Hz.

For each subject, a single session was conducted, divided into a calibration and

a feedback phase. During calibration, a visual cue prompted the subject to imagine

one of the following motor tasks: movement of le hand, right hand, and right foot

or both feet movement³. In total,  trials for each class were recorded; see Figure .

for an illustration of the trial structure. ese trials were used to train a classiﬁer on

bandpower features extracted with CSP. Aer the calibration phase, the participants

performed the same task with feedback, i.e. a visual cursor indicating the current

classiﬁer output.

5.3.2 Comparison to PCA and ICA

Before we look at the results of SSA, we ask the baseline question: could existing

methods do the job? Even though PCA and ICA optimise diﬀerent criteria, and

will in general ﬁnd diﬀerent bases than SSA, the important question in practice is

whether the solutions diﬀer signiﬁcantly enough to justify a new method.

On EEG data, it is not inconceivable, a priori, that the strongest distribution

changes are exhibited by independent brain sources. Moreover, the degree of non-

stationarity may be related to the average power. If that were the case, PCA or ICA

could provide us with components that are useful for analysing non-stationarities.

Before we proceed to results, let us clarify what it means for a basis to be “useful

for analysing non-stationarities”. Informally speaking, it implies that it separates the

stationary from the non-stationary signal contributions. In the ﬁrst step, we assess

this by considering the degree of non-stationarity of the individual basis elements

univariately. e sorted valuesprovideuswith a non-stationarity spectrum forabasis

on a particular data set. In terms of this spectrum, a basis is informative if it decays

rapidly and is uninformative if it is ﬂat. In this way, the interpretation is similar to

the eigenvalue spectrum of PCA. However, unlike PCA’s eigenvalue spectrum, the

non-stationarity spectrum does not sum to the same value for all orthogonal bases.

Whereas every basis captures the same total variance, the same is not true for the

non-stationarities. As we had seen in Chapter , a non-stationary data set can seem

completely stationary in a basis when the distribution changes are entirely conﬁned

to changes in the dependencies.

In a second step, we analyse whether a multivariate approach is necessary to ﬁnd

the most non-stationary subspace. To that end, we compare the subspace obtained

by combining the components found univariately to a subspace obtained by the joint

optimisation of its basis.

³e type of foot movement could be chosen by the subject.

5.3. APPLICATION TO EEG ANALYSIS 87

Setup and results

e general setup of our analysis is as follows. For each subject, we select all cali-

bration trails from the data set described in the previous section. Firstly, we apply a

PCA pre-processing to preserve  of the variance, which reduces the number of

dimensions from  input channels (a subset of electrodes) to around . en we

apply a bandpass ﬁlter between . and  Hz. No trial is rejected. e SSA epochs

are sliding windows of .s length with a  overlap. SSA is applied in deﬂationary

mode, unless otherwise stated.

We begin our analysis by looking at the results of a single subject. e non-

stationarity spectrum of PCA, ICA and SSA is shown in Figure .. e directions

are ordered by degree of non-stationarity (SSA objective function value). For ICA

and PCA, we applied the SSA objective function using the same epoch structure as

for SSA.

1 5 10 15 20 25 30 35 40

Non−stationarity

Directions sorted by non−stationarity

SSA

PCA

ICA

Artefact

5 10 20 30 40

Component 1

Component 17

Loose electrode (#1)

Alpha rhythm (#17)

Figure .: Non-stationary spectrum for SSA (red), PCA (blue) and ICA (green)

for one exemplary BCI subject. For each method, the found directions are ordered

(horizontal axis) by the degree of non-stationarity (vertical axis). e red circle

indicates that an SSA direction corresponds to an artefact in the EEG data. e

scalp maps visualise the basis of component  and , next to it are the frequency

spectra of the sources.

As we can see, all PCA projections (blue curve) pick up a similar amount of non-

stationary contributions. Moreover, we observe that the total non-stationarity cap-

tured by PCA is much less than both ICA and SSA. is suggests that signal power is

not strongly correlated with non-stationarity. e basis found by ICA (green curve)

picks up a similar amount of non-stationarity among its ﬁrst ﬁve directions to the

SSA solution (red curve). However, a closer inspection of the most non-stationary

sources found by SSA reveals that they are non-neural artefacts (indicated by the red

circles).

Since we are interested in ﬁnding and analysing brain sources, we remove the

artefacts in a pre-processing step. e results in the le panel of Figure . show that

88 CHAPTER 5. SIMULATIONS AND APPLICATIONS

1 5 10 15 20 25 30

Non−stationarity

Directions sorted by non−stationarity

SSA

PCA

ICA

1 5 10 15 20 25 30

Non−stationarity / variance

PCA directions sorted by variance

Variance

Non−stationarity

Figure .: Non-stationary spectrum aer artefact rejection: degrees of non-

stationarity for SSA (red), PCA (blue) and ICA (green) basis. Variance of PCA

directions vs. non-stationarity: the blue bars show the eigenvales, the red curve is

the degree of non-stationarity.

aer artefact rejection, PCA and ICA do not provide a basis that discerns stationary

from non-stationary contributions.

Even though PCA did not ﬁnd non-stationary directions relative to the SSA re-

sult, the variance of the PCA components may still be correlated with their strength

of non-stationarity. In the panel of Figure ., we plot the eigenvalues against the

degree of non-stationarity. However, we see that on this dataset, there is no associ-

ation between signal power and non-stationarity.

40%

50%

60%

70%

80%

PCA ICA

Total found non−stat. relative to SSA

0.1% 1% 10% 100%

10%

25%

50%

100%

Share of total variance

Share of total non−stat.

PCA component

Figure .: e le panel shows the distribution of the total non-stationarity

of the PCA and the ICA basis relative to the SSA basis over all subjects. In the

right panel, we plot for each PCA component its variance against its degree of

non-stationarity.

e analysis of all remaining  subjects reveals that these ﬁndings are consis-

tent over all subjects. Figure . shows the results aer artefact rejection. In the

5.3. APPLICATION TO EEG ANALYSIS 89

le panel, we see that SSA ﬁnds signiﬁcantly more non-stationary components than

PCA and ICA over all subjects. e right panel conﬁrms that there is no systematic

relationship between the power of a PCA component and its non-stationarity.

−4 −3 −2 −1 0 1 2

Degree of non−stationarity (log)

Frequency

25%

50%

75%

100%

Artefact likelihood

0.6

0.7

0.8

0.9

AUC

Figure .: Degree of non-stationary (horizontal axis) vs. artefact likelihood

(right vertical axis) for all  subjects. e plot combines a histogram over non-

stationarity strengths (blue bars, le vertical axis) with a curve (red) indicating

the artefact likelihood. e artefact likelihood is the percentage of artefactual di-

rections in a centered window of length .

e results for the single subject suggest that there is a relationship between the

strength of non-stationarity and the likelihood that anSSA component isartefactual.

In order to assess this systematically, we manually classify the SSA components of

all subjects into artefact/non-artefact.

Figure . shows the results. e histogram shows the distribution of the non-

stationarity over all 40·40 = 1600 SSA directions, along with an empirical estimate

of the conditional artefact likelihood (red curve). e shape of this curve indicates

that the non-stationarity is a strong predictor for artefacts; in fact, above a certain

threshold every component found in this dataset is artefactual. ese ﬁnding are

conﬁrmed by the AUC scores shown in the right panel of Figure .. Not only is the

degree of non-stationarity a good predictor across all directions of all subjects, but

it also performs well for most subjects individually: the median AUC over subjects

is above . and the  percentile well above ..

So far, we have evaluated and compared the SSA solution from the perspective

of the non-stationary spectrum. is is a univariate measure, which ignores the

changes in the correlations between the non-stationary variables. Correspondingly,

the non-stationary directions have been optimised univariately in deﬂation mode.

Whether a univariate approach to ﬁnding the non-stationary subspace is suf-

ﬁcient depends on the underlying data generating mechanisms. In order to verify

whether this is the case, we compare the non-stationarity of subspaces found by uni-

variate (deﬂation mode) and multivariate SSA. More speciﬁcally, for univariate SSA

we choose as a basis for the d-dimensional non-stationary subspace the top dmost

90 CHAPTER 5. SIMULATIONS AND APPLICATIONS

1 5 10 15 20 25

100%

150%

200%

250%

Number of non−stat. variables

Non−stationarity

SSA multivariate

SSA univariate

Random subspace

Non-stationary correlations

Figure .: Comparison of univariate (deﬂation mode) SSA vs. multivariate op-

timisation of a non-stationary subspace by SSA over subspaces of dimension one

to  (horizontal axis). e vertical axis shows the non-stationarity relative to the

median of a random subspace.

non-stationary directions. If there is no advantage to a multivariate optimisation,

then the degree of non-stationarity of the two subspaces should be similar.

Figure . shows the results for one subject. Multivariate SSA clearly outper-

forms deﬂation mode SSA: for a ten-dimensional subspace, the subspace found by

multivariate SSA is nearly twice as non-stationary. is indicates that there are

strong non-stationarities in the correlation between latent variables, which calls for

a multivariate approach to the analysis of distribution changes in EEG data. Even

though univariate solutions have the appealing property of being unique (no arbi-

trary choice of basis), in this example we see that they do not reveal the full picture.

A canonical way to arrive at a unique basis for a non-stationary subspace would be

a post-processing in the obtained source-space using deﬂationary SSA.

The strongest task-locked non-stationary direction

In the last part of this section, we show in an example that the most non-stationary

task-locked source found by SSA has a meaningful interpretation in terms of the

expected neural response of the task. Here, we apply SSA not to the concatenation

of all trials of the calibration phase, but use a sliding window within the trials of

all classes to deﬁne our epochs, i.e. our analysis is locked to the motor imager task.

e sliding window has length .s and an overlap of . At each position within

the epoch, we average the covariance matrix over all trials of the same class (le or

right).

In this task-locked setup, one would expect the strongest non-stationary source

to correspond to the attenuation of the SMR. Figure . shows that this is indeed

5.3. APPLICATION TO EEG ANALYSIS 91

Left Right

0s 2s 6s 8s

Time within trial at center of window

Power

left

right

5 10 20 30 40

Figure .: Illustration of the most non-stationary source within the trials of

classleandrightrespectively. etoppanelshowsthemedian powertimecourse

for both sources on the corresponding trials; the bottom panels show the scalp

pattern and frequency spectra.

the case, for both classes. e power time series shows the clear response to the

stimuli (prompt for imagination) s aer the beginning of the trial.

While this is an example speciﬁcally constructed to allow for comparison against

a ground truth, there exist completely unsupervised analysis setups where no la-

bels or other auxiliary information to guide the extraction of useful components

are available (e.g. in the analysis of spontaneous activity [Bießmann et al. (),

Biswal et al. ()] or resting state networks). It is in these cases that the most non-

stationary components may be precisely the ones that are most interesting, which,

as we have seen, PCA or ICA can not, in general, ﬁnd.

5.3.3 Non-stationarities in brain-computer-interfacing

A major impediment to the more widespread use of EEG-based brain-computer-

interfaces is their deteriorating performance over time, for which several adaptation

techniques have been proposed [Vidaurre et al. (), Shenoy et al. ()]. Dur-

ing the calibration phase, the subject is prompted to produce motor imagery which is

then used to train a classiﬁer. However, as time passes, these parameters can become

suboptimal. is may be due to changes in the mental state (e.g. increased tiredness

of the subject) or in response to changing conditions during the application phase,

where the environment is less controlled than during the calibration phase.

In this section, we present some ﬁrst steps towards investigating the role of non-

stationary brain sources using SSA. Our hypothesis is that (a) the generalisation

92 CHAPTER 5. SIMULATIONS AND APPLICATIONS

error can suﬀer from learning along directions where the distribution of the data

changes and (b) that we can identify these directions using the training set. at is,

we assume that the relevant non-stationary directions on the test set already show

non-stationary behaviour during the training condition. We test this hypothesis by

adding a pre-processing step to the standard oﬄine analysis in which we remove

non-stationary directions. e performance is evaluated on the test set.

Setup and results

e aim of this SSA pre-processing is to remove non-stationary directions, while not

removing too much class-discriminative information. Clearly, this is a trade-oﬀ. In

order to keep the analysis simple, we merely choose an epoch structure which re-

ﬂects this: each SSAepochconsists of two trials, onefromeachclass, chosenchrono-

logically. Moreover, in order to detect those non-stationarities most likely to aﬀect

the CSP+LDA solution, we bandpass ﬁlter the data to the most discriminative fre-

quency band, selected by the standard heuristics [Blankertz et al. ()]. e most

non-stationary directions are found using SSA in deﬂation mode in order to allow

for a step-wise procedure and an interpretation of individual components.

0 2 4 6 8 10 12 14 16 18 20 22 24 26

20%

30%

40%

50%

Number of removed directions

Error rate

SSA

Random projection

5 10 20 30 40

Component #7 Component #26

Figure .: Results of the SSA pre-processing for one subject (S) over varying

numbers of removed non-stationary directions (horizontal axis); the vertical axis

shows the error rate on the test set. e error tube extends from the  to the 

over resamples of the test data; the two scalp plots correspond to the directions

that were removed in the step indicated by the black circle.

From the  subjects of the VitalBCI study, we select the bottom  subjects

in terms of classiﬁcation accuracy on the test set. It is for this group of subjects

5.3. APPLICATION TO EEG ANALYSIS 93

that we conjecture that distribution changes have a role to play: despite strongly

discriminative CSP components found on the training set, the performance on the

test set is sub-standard.

S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12

−50%

−30%

−10%

+10%

Subjects

Relative change in error rate

SSA

Random projections

Figure .: Results of the SSA pre-processing over all  selected subjects. e

vertical axis shows the change in test error rate relative to the baseline; for each

subject (horizontal axis) we report its distribution (blue boxplot) over removing

the top k= 1, . . . , 25 most non-stationary directions compared to the eﬀect of

removing the same number of random directions (black boxplot).

e crucial parameter of the SSA pre-processing (apart from the epoch struc-

ture)is howmany resp. which non-stationary directionsshould be removed. Clearly,

at some point we will inevitably remove important discriminative information.

e results for one subject (S) are shown in Figure .. We see that aer

removing the -th most non-stationary direction, the error rate drops from 

(chance level) to below . e scalp plot of the removed component suggests that

we have removed the α-rhythm, whose strength is associated with the level of fa-

tigue or tiredness. is source overlaps with the discriminative SMR, both spatially

(in electrode space) and in terms of frequency content. It is therefore likely that the

power of the α-rhythm contributes to the feature vectors. If the power of the α-

rhythm changes between training and test (e.g. due to increase tiredness), then this

may lead to suboptimal performance on the test set.

As we continue to remove non-stationary directions, the error rate eventually

drops below . However, with the removal of the -th most non-stationary

source, the error rate increases strongly. As we can see from the scalp plot, the pat-

tern resembles the characteristic pattern of the µ-rhythm associated with the right

motor cortex. By removing it, we remove discriminative information which is re-

ﬂected in the increased error rate.

e results over all subjects are shown in Figure . and Figure .. For each

subject, we remove the top kmost non-stationary components and report the dis-

94 CHAPTER 5. SIMULATIONS AND APPLICATIONS

35% 40% 45% 50%

35%

40%

45%

50%

Baseline test error

Test error after SSA pre−proc.

Figure .: Evaluation of the SSA pre-processing for BCI. Each dot corresponds

to a subject; the black dots show the median test error rate over the top 1, . . . , 25

most non-stationary directions. e red dots show the test error rate for an opti-

mal choice of this parameter.

tribution of results over all kfrom 1to 25. Apart from the three subjects S, S and

S, the median eﬀect of the SSA pre-processing is a reduction in error rate. For the

subjects S and S, the most non-stationary direction found in the ﬁrst step is highly

correlated with the label. An expert manually evaluating each direction on the train-

ing set would have detected this (characteristic scalp plot and frequency spectrum);

however, in our automatic procedure, this leads to a drop in performance.

Whether this pre-processing can be turned into a practical approaching for BCI

depends on whether the non-stationary directions that are to be removed can be se-

lected automaticallybysomecriterion. Potential approachesincludecross-validation

on the training set; a heuristic combining the degree of non-stationarity with a mea-

sure of discriminatory power; or an automatic data-driven classiﬁcation of harmful

vs. benign distribution changes based on features of the components.

However, what this analysis has shown is that there exist non-stationary compo-

nents which are highly detrimental to BCI performance for some subjects (subject

 improved from  (chance level) down to almost  error rate) and that SSA

oﬀers a means of ﬁnding them.

5.3.4 Recovering auditory steady-state evoked potentials

In this section, we use SSA to recover a brain source that has been induced by an

experimental stimulus, namely the auditory steady-state evoked potential (ASSEP).

is evoked potential is generated in the brain in response to an auditory stimu-

lus which occurs at a constant frequency. It has been shown that such a steady-

state stimulus evokes a response in the auditory system at the stimulation frequency

5.3. APPLICATION TO EEG ANALYSIS 95

Pattern

10 20 30 40 50 60 70 80

Spectrum

500 1000 1500 2000 2500 3000

40 Hz power time course vs. target function

Power

Time index

Target function

SSA component

Stimulation frequency

Figure .: Component found by SSA that has the highest correlation with

the target (stimulus). e top le panel shows the scalp plot (basis vector) cor-

responding to the source; the top right panel shows its frequency content. e

bottom panel shows the power time course of the SSA component (green) and

the target function (blue).

[Plourde et al. (), Galambos et al. (), Picton et al. ()]. Moreover, the

amplitude of the ASSEP has been found to correlate to the intensity of the steady-

state stimulus [Rodriguez et al. ()]. us, by modulating the intensity of the

stimulus, we can induce a non-stationary brain source.

We analyse a dataset presented by [Dähne et al. ()] and follow a similar

evaluation strategy. In this experiment, a  Hz pure tone is amplitude modu-

lated by a  Hz cosine. In order to continuously vary the intensity of the stimulus,

it is multiplied by a slowly varying function which modulates the loudness between

 and  dB relative to the subject-speciﬁc hearing level. e EEG signals were

recorded using  wet Ag/AgCl electrodes (Fast’n Easy Cap, EasyCap GmbH) placed

according to the - system. Signals were ampliﬁed using two -channel ampli-

ﬁers (Brain Products), sampled at  kHz and ﬁltered by an analog bandpass ﬁlter

between . and  Hz. For the oﬄine analysis, the signals were down-sampled to

 Hz and a notch ﬁlter around  Hz was applied to attenuate line noise.

We analyse a block of three minutes of continuous stimulation. SSA is applied

to ﬁnd the most non-stationary sources in deﬂation mode; epochs are deﬁned by

a sliding window of .s length and step .s. In a pre-processing step, the data is

bandpassﬁltered to awindowaround theexpected Hzpeak; the widthofthiswin-

dow is varied during the analysis. Each obtained component is evaluated in terms of

96 CHAPTER 5. SIMULATIONS AND APPLICATIONS

2 4 6 8 10 12 14 16 18 20 None

0.1

0.2

0.3

0.4

0.5

0.6

Width of filter window centered at 40Hz

Correlation with target

SSA

PCA

ICA

Figure .: Comparison of PCA, FastICA and SSA in recovering the auditory

steady-statecomponent. ehorizontalaxis showsthemaximumcorrelationwith

the target signal (the stimulus) over all components of each method for varying

breadths of bandpass ﬁlters around  Hz.

its correlation to the intensity modulation time series. We compare the components

found by SSA with the PCA and FastICA [Hyvärinen ()] solutions.

Figure . shows the SSA direction with the highest correlation to the tar-

get found without bandpass ﬁltering (corresponding to the right most bars in Fig-

ure .). e scalp plot shows the characteristic broad central pattern. e power

spectrum (top right panel) exhibits a clear peak at  Hz, the stimulation frequency;

and in the bottom panel, we see that the intensity modulation is correlated with the

power of the SSA component in a small window around  Hz.

Figure . shows the results of the comparison of SSA against PCA and Fas-

tICA. e bars show the maximum correlation coeﬃcient over all components for

each method. Overall, SSA signiﬁcantly outperforms PCA and FastICA, probably

because it is the only method that maximises a criterion which is related to the char-

acteristics of this source.

5.4 Summary and Discussion

e simulations on synthetic data revealed the fundamental relationship between

the performance of SSA and basic parameters of the data. One result stands out: es-

timation errors in individual epochs average out across epochs. is is good news for

the practitioner, who is oen keen to increase the temporal resolution and capture

as much variation as possible, which means making the epochs smaller.

Another important outcome is the invalidation of the likelihood ratio test for

real, i.e. non-gaussian, data. On the other hand, a test based on resampling per-

5.4. SUMMARY AND DISCUSSION 97

formed remarkably well. Even so, more robust analytic test procedures would be

highly desirable to alleviate the computational burden of resampling. A straightfor-

ward approach might be to insert a more heavy-tailed distribution into the likeli-

hood ratio formalism or to expand the objective function around higher-order mo-

ments.

A comparison of the standard and the algebraic SSA algorithm conﬁrmed that

the latter is advantageous in a niche setting: large number of epochs with a high

number of samples in low dimension. Even though this is probably exactly the op-

posite of what one expects in practice, the results do highlight the attractive features

of the algebraic approach. Several possible approaches towards reducing its compu-

tational cost were discussed in Chapter .

e results on EEG analysis have shown clearly that PCA and ICA are not use-

ful for understanding non-stationarities in neural activity. e non-stationary di-

rections that ICA ﬁnds are artefacts, not brain sources. Secondly, a multivariate

approach is necessary to ﬁnd the most non-stationary subspace which captures the

non-stationarities in the dependency structure. In this respect, SSA diﬀers from

most linear latent variable methods in that it ﬁnds a subspace and not a set of indi-

vidual components. While we do not oﬀer an interpretation of the non-stationary

subspace, this clearly is an exciting direction for future research.

In two settings we have demonstrated that SSA ﬁnds solutions corresponding

to a ground truth: the most non-stationary sources in task-locked setting from BCI

clearly corresponds to the intentionally modulated µ-rhythm; and SSA successfully

ﬁnds the auditory steady state evoked potential, which PCA and ICA do not. ese

ﬁndingshighlightthe abilityofSSA to extract meaningful componentsincompletely

unsupervised settings where the relevant components are those that exhibit distri-

bution changes.

e application to BCI analysis showed that in some subjects, non-stationary

brain sources are highly detrimental to performance. Removing them in a pre-

processing step prior to calibration yielded reductions in error rates of almost 

percentage points on some subjects. While it is unclear whether SSA pre-processing

can be turned into a practical automatic step that is beneﬁcial to a suﬃciently large

number of subjects, the results show that (a) distribution changes can have a strong

inﬂuence and (b) that SSA is the right tool for analysing them.

98 CHAPTER 5. SIMULATIONS AND APPLICATIONS

Chapter 6

Conclusion

Analysing change in observed data in response to the change in a controlled vari-

able is a quintessential task in empirical research. As such, there exists a wide range

of statistical methods which address this problem. e common approach in the

multivariate case is to ﬁnd a direction in the data space associated with the change

in the controlled variable. For instance, linear discriminant analysis [Fisher ()]

ﬁnds a direction along which the distribution of the data is maximally diﬀerent be-

tween two sets of samples. In EEG analysis, common spatial patterns [Koles ()]

ﬁnd linear projections that allow us to distinguish between two experimental con-

ditions. However, the applicability of these families of methods crucially depend on

the availability of an externally controlled variable, or label. is is not always avail-

able. In this respect, SSA closes a gap because it allows for the unsupervised analysis

of change in data.

Summary of the main results

A simple example in Chapter  shows that in order to analyse changes in the joint

distribution of a multivariate data set, a univariate perspective is, in principle, in-

suﬃcient. In essence, this is because changes in the dependencies between variables

are not necessarily visible in the marginals. e two-variable example also showed

the existence of another coordinate system, better suited to our task: it separates the

stationary from the non-stationary components. is observation sets the theme

for this thesis: ﬁnding such advantageous coordinate systems.

In the next Chapter  we equip ourselves with the necessary background for this

task. We review fundamentals from mathematics, the basic framework of coordi-

nate transformations in machine learning and statistics, and the two most promi-

nent methods: principal component analysis (PCA) and independent component

analysis (ICA).

Chapter  contains our main contribution, the formulation of the stationary

subspace analysis problem and the ﬁrst algorithm to solve it. We start by introduc-

ing a generative model, discuss its identiﬁability, and deﬁne our notion of station-

arity: time-constant mean and covariance matrix. Corresponding to this deﬁnition,

100 CHAPTER 6. CONCLUSION

we derive a measure of stationarity based on epochs of the given time series, which

leads to an eﬃcient algorithm that minimises or maximises this objective to ﬁnd the

stationary and non-stationary components respectively. We discuss the problem of

spurious solutions and present a theoretical result which tells us precisely how much

data is needed to avoid them in practice.

In Chapter , we develop an alternative formulation of the SSA problem in terms

of an approximate solution to a system of polynomial equations. In contrast to the

SSA algorithm developed in the previous chapter, we do not search for the solution

guided by the gradient of an objective function but directly solve the problem al-

gebraically. is leads to more accurate solutions in certain scenarios; and allows

us to choose the number of stationary directions from data by computing a certain

eigenvalue spectrum.

Chapter  is concerned with the validation of the SSA algorithm on synthetic

data and in applications to EEG data analysis. In controlled simulations, we analyse

the inﬂuence of key parameters on the performance of the SSA algorithm, compare

two approaches for stationarity testing, and demonstrate the relative merits of the

algebraic SSA algorithm.

In three applications to EEG analysis, we show that SSA provides us with gen-

uinely new and useful results. First of all, we demonstrate that neither PCA nor

ICA allow us to uncover the non-stationarity directions in EEG data. In particu-

lar, we also show that non-stationarities are, loosely speaking, multivariate in EEG

data: the most non-stationary subspace is not the combination of the individually

maximally non-stationary directions, but can only be found by joint optimisation

of several projections. Moreover, we show that the most non-stationary direction

in a brain-computer-interfacing context has a meaningful interpretation, in that it

corresponds to the intentional attenuation of the SMR.

Secondly, we investigate the role of non-stationarities in brain-computer-inter-

facing. It has oen been hypothesised that distribution changes are one of the fac-

tors behind deteriorating performance over time as parameters calibrated during

the training sessions may be suboptimal in the long term, should the distribution

of the EEG data change due to environmental inﬂuences or changes in the brain

state. To this end, we investigate whether removing non-stationary directions (esti-

mated during calibration) can be beneﬁcial for performance. Our results show that

for some subjects, this indeed leads to a dramatic increase in performance. While

it is unclear whether this pre-processing can be turned into an automated practi-

cal method (due to diﬃculties in parameter choice), this analysis demonstrates that

SSA is a useful new tool in this context.

In the third application, we show that SSA can recover a neural component from

EEG signals which PCA and ICA do not ﬁnd. e auditory steady state evoked po-

tential is the response to an auditory stimulus which occurs at a steady frequency

(beeps). is power of this response has been found to correlate with the volume

(intensity) of the stimulus. By varying the stimulus intensity, we can induce a brain

101

response (at the stimulus frequency) that has non-stationary variance. erefore,

SSA should be able to ﬁnd it which is indeed the case. We compare the performance

of SSA, PCA and ICA in terms of the correlation with the stimulus intensity: as it

turns out, neither PCA nor ICA can ﬁnd components that are signiﬁcantly corre-

lated.

Limitations and future directions

e most relevant limitations of the presented SSA algorithm are related to its under-

lying deﬁnition of stationarity: changes in the higher order moments go unnoticed

and the time structure within the epochs is discarded. While an extension in the

former direction probably comes at a high price (in terms of sample and compu-

tational complexity), integrating time structure should be possible along the lines

of TDSEP. As many real sources are endowed with temporal structure, this should

bring about signiﬁcant improvements.

Another important question is: how to choose the number of stationary resp.

non-stationary sources? We have proposed two answers: post-hoc stationarity test-

ing using resampling, or inspecting the PCA eigenvalue spectrum in coeﬃcient vec-

tor space It would be interesting to investigate an approach based on a model se-

lection framework [Akaike (), Schwarz (), Rissanen ()] or a Bayesian

formulation that allows us to compute the posterior over possible numbers of sta-

tionary directions.

e applications to EEG analysis presented in this chapter is a very ﬁrst step. Fu-

ture research will investigate the types and potential causes of non-stationary neu-

ral components, both in task-locked and paradigmless settings, possibly in com-

bination with source reconstruction techniques [Scherg and Von Cramon (),

Herrmann et al. (), Haufe et al. ()]. In the EEG context, it may be pos-

sible to interpret the non-stationary subspace (as opposed to individual compo-

nents) using physiological constraints that reduce the number of plausible bases

[Schmidt (), Mosher and Leahy ()]. Moreover, our results showed that the

degree of non-stationarity is related to artefact-likelihood. Whether this can be

turned into a practical artefact rejection method needs to be evaluated.

Where else can the algebraic approach be advantageous? In short, whenever the

desired solution is a set of polynomials of a certain type (in the SSA case: linear pro-

jections) whose approximate vanishing set is given in terms of another set of input

polynomials, usually of higher degree. at is, both input and output are polynomi-

als and the objective of the learning task needs to be describable in terms of equa-

tions. is is, for example, the case in the simultaneous diagonalisation of several

matrices [Cardoso and Souloumiac (), Bunse-Gerstner et al. ()], whereone

aims to make the oﬀ-diagonal elements equal to zero by applying a linear basis trans-

formation.

102 CHAPTER 6. CONCLUSION

Bibliography

[Abdi ()] H. Abdi. Factor rotations in factor analyses. Encyclopedia for

Research Methods for the Social Sciences. Sage: ousand Oaks, CA, pages

–, .

[Akaike ()] H. Akaike. A new look at the statistical model identiﬁcation. Au-

tomatic Control, IEEE Transactions on, ():–, .

[Allen et al. ()] E.A Allen, E.B Erhardt, E Damaraju, W Gruner, J.M Segall, R.F

Silva, M Havlicek, S Rachakonda, J Fries, and R Kalyanam. A baseline for

the multivariate comparison of resting-state networks. Front Syst Neurosci, ,

.

[Bell and Sejnowski ()] A. J. Bell and T. J. Sejnowski. An information-

maximization approach to blind separation and blind deconvolution. Neural

Computation, ():–, November .

[Belouchrani et al. ()] A. Belouchrani, K. Abed-Meraim, J.F. Cardoso, and

E. Moulines. A blind source separation technique using second order statis-

tics. IEEE Trans. on Signal Processing, ():–, .

[Bengio et al. ()] Y. Bengio, J.F. Paiement, P. Vincent, O. Delalleau, N. Le Roux,

and M. Ouimet. Out-of-sample extensions for lle, isomap, mds, eigenmaps,

and spectral clustering. Advances in neural information processing systems, :

–, .

[Berger ()] Hans Berger. Über das elektroencephalogramm des menschen.

Archiv fur Psychiatrie und Nervenkrankheiten, :–, .

[Bießmann et al. ()] Felix Bießmann, Sergey M Plis, Frank C Meinecke, Tom

Eichele, and Klaus-Robert Müller. Analysis of multimodal neuroimaging

data. Biomedical Engineering, IEEE Reviews in, : – , .

[Biswal et al. ()] B Biswal, F Z Yetkin, V M Haughton, andJ S Hyde. Functional

connectivity in the motor cortex of resting human brain using echo-planar

mri. Magn Reson Med, ():–, Oct .

104 BIBLIOGRAPHY

[Blanchard et al. ()] G. Blanchard, M. Kawanabe, M. Sugiyama, V. Spokoiny,

and K.R. Müller. In search of non-gaussian components of a high-

dimensional distribution. e Journal of Machine Learning Research, :

–, .

[Blankertz et al. ()] B. Blankertz, R. Tomioka, S. Lemm, M. Kawanabe, and

K.R. Muller. Optimizing spatial ﬁlters for robust eeg single-trial analysis.

Signal Processing Magazine, IEEE, ():–, .

[Blankertz et al. ()] Benjamin Blankertz, Claudia Sannelli, Sebastian Halder,

Eva M. Hammer, Andrea Kübler, Klaus-Robert Müller, Gabriel Curio, and

orsten Dickhaus. Neurophysiological predictor of SMR-based BCI perfor-

mance. NeuroImage, ():–, .

[Blythe et al. ()] Duncan A. J. Blythe, Paul von Bünau, Frank C. Meinecke,

and Klaus-Robert Müller. Feature extraction for change-point detection us-

ing stationary subspace analysis. IEEE Transactions on Neural Networks and

Learning Systems, ():–, .

[Bunse-Gerstner et al. ()] A. Bunse-Gerstner, R. Byers, and V. Mehrmann.

Numerical methods for simultaneous diagonalization. SIAM Journal on Ma-

trix Analysis and Applications, :–, .

[Caboara et al. ()] Massimo Caboara, Pasqualina Conti, and Carlo Traverse.

Yet another ideal decomposition algorithm. In Teo Mora and Harold Matt-

son, editors, Applied Algebra, Algebraic Algorithms and Error-Correcting

Codes, volume  of Lecture Notes in Computer Science, pages –.

Springer Berlin / Heidelberg, .

[Cardoso and Souloumiac ()] Jean-François Cardoso and Antoine

Souloumiac. Blind beamforming for non gaussian signals. IEE Proceedings-F,

:–, .

[Cardoso ()] J.F. Cardoso. High-order contrasts for independent component

analysis. Neural computation, ():–, .

[Cardoso and Souloumiac ()] J.F. Cardoso and A. Souloumiac. Jacobi angles

for simultaneous diagonalization. SIAM Journal on Matrix Analysis and Ap-

plications, ():–, .

[Caton ()] Richard Caton. e electric currents of the brain. British Medical

Journal, ():, .

[Corless et al. ()] R.M. Corless, Patrizia M. Gianni, Barry M. Trager, and S.M.

Watt. e singular value decomposition for polynomial systems. Proc. ISSAC

’, pages –, .

BIBLIOGRAPHY 105

[Cover et al. ()] T.M. Cover, J.A. omas, J. Wiley, et al. Elements of informa-

tion theory, volume . Wiley Online Library, .

[Cox et al. ()] David A. Cox, John Little, and Donal O’Shea. Ideals, Varieties,

and Algorithms: An Introduction to Computational Algebraic Geometry and

Commutative Algebra, /e (Undergraduate Texts in Mathematics). Springer-

Verlag New York, Inc., Secaucus, NJ, USA, . ISBN .

[Cro and Barry ()] R.J. Cro and RJ Barry. Removal of ocular artifact from

the eeg: a review. Neurophysiologie Clinique/Clinical Neurophysiology, ():

–, .

[Dähne et al. ()] Sven Dähne, Vadim Nikulin, Frank Meinecke, Stefan Haufe,

Johannes Höhne, Michael Tangermann, and Klaus-Robert Müller. Optimal

spatial ﬁlters for correlating band power with cognitive function, .

[Dornhege ()] G. Dornhege. Toward brain-computer interfacing. e MIT

Press, .

[Eisenbud et al. ()] David Eisenbud, Craig Huneke, and Wolmer Vasconcelos.

Direct methods for primary decomposition. Inventiones Mathematicae, :

–, . ISSN -.

[Fatourechi et al. ()] M. Fatourechi, A. Bashashati, R.K. Ward, and G.E. Birch.

Emg and eog artifacts in brain computer interface systems: A survey. Clinical

Neurophysiology, ():–, .

[Fisher ()] R.A. Fisher. e use of multiple measurements in taxonomic prob-

lems. Annals of Human Genetics, ():–, .

[Friedman and Tukey ()] J.H. Friedman and J.W. Tukey. A projection pursuit

algorithm for exploratory data analysis. Computers, IEEE Transactions on,

():–, .

[Froeberg ()] Ralf Froeberg. An inequality for hilbert series of graded algebras.

Math. Scand., : – , .

[Froeberg and Hollman ()] Ralf Froeberg and Joachim Hollman. Hilbert se-

ries for ideals generated by generic forms. Journal of Symbolic Computation,

(): – , . ISSN -.

[Galambos et al. ()] R. Galambos, S. Makeig, and P.J. Talmachoﬀ. A -hz au-

ditory potential recorded from the human scalp. Proceedings of the national

academy of sciences, ():, .

[Gauß ()] Carl Friedrich Gauß. eoria motus corporum coelestium in sec-

tionibus conicis solem ambientium. Göttingen, .

106 BIBLIOGRAPHY

[Gianni et al. ()] Patrizia Gianni, Barry Trager, and Gail Zacharias. Groebner

bases and primary decomposition of polynomial ideals. Journal of Symbolic

Computation, (-): – , . ISSN -.

[Golub and Van Loan ()] G.H. Golub and C.F. Van Loan. Matrix computa-

tions, volume . Johns Hopkins Univ Pr, .

[Gretton et al. ()] A. Gretton, K.M. Borgwardt, M.J. Rasch, B. Schölkopf, and

A. Smola. A kernel two-sample test. e Journal of Machine Learning Re-

search, :–, .

[Guyon and Elisseeﬀ ()] Isabelle Guyon and Andre Elisseeﬀ. An introduction

to variable and feature selection. Journal of Machine Learning Research, :

–, Mar .

[Hara et al. ()] Satoshi Hara, Yoshinobu Kawahara, Takashi Washio, and Paul

von Bünau. Stationary subspace analysis as a generalized eigenvalue problem.

In Proceedings of the th international conference on Neural information pro-

cessing: theory and algorithms - Volume Part I, ICONIP’, pages –,

Berlin, Heidelberg, . Springer-Verlag.

[Hara et al. ()] Satoshi Hara, Yoshinobu Kawahara, Takashi Washio, Paul von

Bünau, Terumasa Tokunaga, and Kiyohumi Yumoto. Separation of stationary

and non-stationary sources with a generalized eigenvalue problem. Neural

Networks, :–, . ISSN -.

[Harman ()] H.H. Harman. Modern factor analysis. University of Chicago

Press, .

[Hastie and Stuetzle ()] T. Hastie and W. Stuetzle. Principal curves. Journal of

the American Statistical Association, pages –, .

[Haufe et al. ()] S. Haufe, V.V. Nikulin, A. Ziehe, K.R. Müller, and G. Nolte.

Combining sparsity and rotational invariance in eeg/meg source reconstruc-

tion. NeuroImage, ():–, .

[Heldt et al. ()] Daniel Heldt, Martin Kreuzer, Sebastian Pokutta, and Hennie

Poulisse. Approximate computation of zero-dimensional polynomial ideals.

Journal of Symbolic Computation, (): – , . ISSN -.

In Memoriam Karin Gatermann.

[Hermann ()] Grete Hermann. Die Frage der endlich vielen Schritte in der

eorie der Polynomideale - unter Benutzung nachgelassener S�tze von K.

Hentzelt. Mathematische Annalen, (): – , . ISSN -.

BIBLIOGRAPHY 107

[Herrmann et al. ()] M.J. Herrmann, J. Römmler, A.C. Ehlis, A. Heidrich, and

A.J. Fallgatter. Source localization (loreta) of the error-related-negativity

(ern/ne) and positivity (pe). Cognitive Brain Research, ():–, .

[Hinton and Roweis ()] G. Hinton and S. Roweis. Stochastic neighbor embed-

ding. Advances in neural information processing systems, :–, .

[Huber ()] P.J. Huber. Projection pursuit. e annals of Statistics, pages

–, .

[Hyvarinen ()] A. Hyvarinen. Fast and robust ﬁxed-point algorithms for in-

dependent component analysis. Neural Networks, IEEE Transactions on, 

():–, .

[Hyvärinen et al. ()] A. Hyvärinen, J. Karhunen, and E. Oja. Independent com-

ponent analysis, volume . Wiley-interscience, .

[Hyvärinen ()] Aapo Hyvärinen. Fast and robust ﬁxed-point algorithms for

independent component analysis. IEEE Transactions on Neural Networks, 

():–, May . ISSN .

[Jaynes ()] E. T. Jaynes. Information theory and statistical mechanics. Physical

Review, (–), .

[Jaynes and Bretthorst ()] E.T. Jaynes and G.L. Bretthorst. Probability theory:

the logic of science. Cambridge Univ Pr, .

[Jung et al. ()] T.P. Jung, S. Makeig, C. Humphries, T.W. Lee, M.J. Mckeown,

V. Iragui, and T.J. Sejnowski. Removing electroencephalographic artifacts by

blind source separation. Psychophysiology, ():–, .

[Jung et al. ()] T.P. Jung, S. Makeig, M.J. McKeown, A.J. Bell, T.W. Lee, and T.J.

Sejnowski. Imaging brain dynamics using independent component analysis.

Proceedings of the IEEE, ():–, .

[Kaiser ()] H.F. Kaiser. e varimax criterion for analytic rotation in factor

analysis. Psychometrika, ():–, .

[Kaltofen et al. ()] E. Kaltofen, Z. Yang, and L. Zhi. Structured low rank ap-

proximation of a sylvester matrix. Symbolic-numeric computation, pages

–, .

[Kandel et al. ()] E.R. Kandel, J.H. Schwartz, T.M. Jessell, et al. Principles of

neural science, volume . McGraw-Hill New York, .

108 BIBLIOGRAPHY

[Kawanabe et al. ()] Motoaki Kawanabe, Wojciech Samek, Paul von B¸nau,

and Frank Meinecke. An information geometrical view of stationary sub-

space analysis. In Artiﬁcial Neural Networks and Machine Learning – ICANN

, volume  of Lecture Notes in Computer Science, pages –.

Springer Berlin / Heidelberg, .

[Kiraly et al. ()] Franz J. Kiraly, Paul von Bünau, Frank C. Meinecke, Duncan A. J.

Blythe, and Klaus-Robert Müller. Algebraic geometric comparison of prob-

ability distributions. Journal of Machine Learning Research, :–, .

[Kiraly et al. ()] Franz J. Kiraly, Paul von Bünau, Jan Saputra Müller, Duncan

A. J. Blythe, Frank C. Meinecke, and Klaus-Robert Müller. Regression for

sets of polynomial equations. In JMLR Workshop and Conference Proc. Vol.

, pages –, .

[Kohonen ()] T. Kohonen. e self-organizing map. Proceedings of the IEEE,

():–, .

[Koles ()] Z.J. Koles. e quantitative extraction and topographic mapping of

the abnormal components in the clinical eeg. Electroencephalography and

Clinical Neurophysiology, ():–, .

[Kreuzer et al. ()] Martin Kreuzer, Hennie Poulisse, and Lorenzo Robbiano.

Approximate Commutative Algebra, chapter From Oil Fields to Hilbert

Schemes. Springer-Verlag Berlin Heidelberg, .

[Krick and Logar ()] Teresa Krick and Alessandro Logar. An algorithm for the

computation of the radical of an ideal in the ring of polynomials. In Harold

Mattson, Teo Mora, and T. Rao, editors, Applied Algebra, Algebraic Algorithms

and Error-Correcting Codes, volume  ofLecture Notes in Computer Science,

pages –. Springer Berlin / Heidelberg, .

[Laplagne ()] Santiago Laplagne. An algorithm for the computation of

the radical of an ideal. In Proceedings of the  international sym-

posium on Symbolic and algebraic computation. ACM, . URL

http://dx.doi.org/10.1145/1145768.1145802.

[Legendre ()] Adrien-Marie Legendre. Nouvelles méthodes pour la détermina-

tion des orbites des comètes, chapter Sur la methode des moindres quarres.

Firmin Didot, http://imgbase-scd-ulp.u-strasbg.fr/displayimage.php?pos=-

, .

[Makeig et al. ()] S. Makeig, A.J. Bell, T.P. Jung, T.J. Sejnowski, et al. Indepen-

dent component analysis of electroencephalographic data. Advances in neural

information processing systems, pages –, .

BIBLIOGRAPHY 109

[McMenamin et al. ()] B.W. McMenamin, A.J. Shackman, J.S. Maxwell,

D.R.W. Bachhuber, A.M. Koppenhaver, L.L. Greischar, and R.J. Davidson.

Validation of ica-based myogenic artifact correction for scalp and source-

localized eeg. Neuroimage, ():–, .

[Meinecke et al. ()] Frank C. Meinecke, Paul von Bünau, Motoaki Kawanabe,

and Klaus-Robert Müller. Learning invariances with stationary subspace

analysis. In Computer Vision Workshops (ICCV Workshops),  IEEE th

International Conference on, pages  –, .

[Meinecke ()] Frank Carsten Meinecke. Synchronized? Iden-

tifying Interactions from Superimposed Signals. PhD the-

sis, Berlin Institute of Technology (TU Berlin), . URL

http://opus.kobv.de/tuberlin/volltexte/2012/3386/.

[Molgedey and Schuster ()] L. Molgedey and H. G. Schuster. Separation of a

mixture of independent signals using time delayed correlations. Phys. Rev.

Lett., ():–, Jun .

[Mood et al. ()] A.M. Mood, F.A. Graybill, and D.C. Boes. Introduction to the

theory of statistics. McGraw-Hill Book Company, .

[Mosher and Leahy ()] J.C. Mosher and R.M. Leahy. Source localization us-

ing recursively applied and projected (rap) music. Signal Processing, IEEE

Transactions on, ():–, .

[Neyman and Pearson ()] J. Neyman and E.S. Pearson. On the problem of the

most eﬃcient tests of statistical hypotheses. Philosophical Transactions of the

Royal Society of London. Series A, Containing Papers of a Mathematical or

Physical Character, :–, .

[Park et al. ()] H. Park, L. Zhang, and J.B. Rosen. Low rank approximation of

a hankel matrix by structured total least norm. BIT Numerical Mathematics,

():–, .

[Parra et al. ()] L.C. Parra, C.D. Spence, A.D. Gerson, and P. Sajda. Recipes for

the linear analysis of eeg. Neuroimage, ():–, .

[Pearson ()] K Pearson. On lines and planes of closest ﬁt to systems of points

in space. Philosophical Magazine, :–, .

[Pfurtscheller and Lopes da Silva ()] G. Pfurtscheller and FH Lopes da Silva.

Event-related eeg/meg synchronization and desynchronization: basic princi-

ples. Clinical neurophysiology, ():–, .

110 BIBLIOGRAPHY

[Picton et al. ()] T.W. Picton, M.S. John, A. Dimitrijevic, and D. Purcell. Hu-

man auditory steady-state responses: respuestas auditivas de estado estable

en humanos. International journal of audiology, ():–, .

[Plourde et al. ()] G. Plourde, D.R. Stapells, and T.W. Picton. e human au-

ditory steady-state evoked potentials. Acta Oto-Laryngologica, (S):

–, .

[Plumbley ()] Mark D. Plumbley. Geometrical methods for non-negative ICA:

Manifolds, Lie groups and toral subalgebras. Neurocomputing, (–),

.

[Priestley ()] M.B. Priestley. Spectral Analysis and Time Series. AcademicPress,

.

[Quionero-Candela et al. ()] J. Quionero-Candela, M. Sugiyama,

A. Schwaighofer, and N.D. Lawrence. Dataset shi in machine learn-

ing. e MIT Press, .

[Rahimi and Recht ()] A. Rahimi and B. Recht. Weighted sums of random

kitchen sinks: Replacing minimization with randomization in learning. Ad-

vances in Neural Information Processing Systems (NIPS), .

[Raichle et al. ()] M E Raichle, A M MacLeod, A Z Snyder, W J Powers, D A

Gusnard, and G L Shulman. A default mode of brain function. Proc Natl Acad

Sci USA, ():–, Jan .

[Rissanen ()] J. Rissanen. Modeling by shortest data description. Automatica,

():–, .

[Rodriguez et al. ()] R. Rodriguez, T. Picton, D. Linden, G. Hamel, and

G. Laframboise. Human auditory steady state responses: eﬀects of intensity

and frequency. Ear and hearing, ():, .

[Roweis and Saul ()] S.T. Roweis and L.K. Saul. Nonlinear dimensionality re-

duction by locally linear embedding. Science, ():–, .

[Samek et al. ()] Wojciech Samek, Klaus-Robert Müller, Motoaki Kawanabe,

and Carmen Vidaurre. Brain-computer interfacing in discriminative and sta-

tionary subspaces. In Conf Proc IEEE Eng Med Biol Soc. IEEE EMBS, .

[Scherg and Von Cramon ()] M. Scherg and D. Von Cramon. Evoked dipole

source potentials of the human auditory cortex. Electroencephalography and

Clinical Neurophysiology/Evoked Potentials Section, ():–, .

BIBLIOGRAPHY 111

[Schmidt ()] R. Schmidt. Multiple emitter location and signal parameter es-

timation. Antennas and Propagation, IEEE Transactions on, ():–,

.

[Schölkopf et al. ()] B. Schölkopf, A.J. Smola, and K.-R. Müller. Nonlinear

component analysis as a kernel eigenvalue problem. Neural Computation,

():–, .

[Schwarz ()] G. Schwarz. Estimating the dimension of a model. e annals of

statistics, ():–, .

[Schweikert et al. ()] Gabriele Schweikert, Christian Widmer, Bernhard

Schölkopf, and Gunnar Rätsch. An empirical analysis of domain adaptation

algorithms for genomic sequence analysis. In Daphne Koller, Dale Schu-

urmans, Yoshua Bengio, and Léon Bottou, editors, NIPS, pages –.

Curran Associates, Inc., .

[Shenoy et al. ()] P. Shenoy, M. Krauledat, B. Blankertz, R.P.N. Rao, and K.R.

Müller. Towards adaptive classiﬁcation for bci. Journal of Neural Engineering,

:R, .

[Shimodaira ()] H. Shimodaira. Improving predictive inference under covari-

ate shi by weighting the log-likelihood function. Journal of Statistical Plan-

ning and Inference, ():–, .

[Smirnoﬀ ()] N. Smirnoﬀ. On the estimation of the discrepancy between

empirical curves of distribution for two independent samples. Bulletin de

l�Universite de Moscow, Serie internationale (Mathematiques), :–, .

[Stetter ()] Hans J. Stetter. Numerical polynomial algebra. Society for Industrial

and Applied Mathematics, . ISBN .

[Sugiyama ()] M. Sugiyama. Density ratio estimation: A new versatile tool for

machine learning. Advances in Machine Learning, pages –, .

[Sugiyama et al. ()] M. Sugiyama, M. Krauledat, and K.R. Müller. Covariate

shi adaptation by importance weighted cross validation. e Journal of Ma-

chine Learning Research, :–, .

[Sugiyama et al. ()] M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von

Bünau, and M. Kawanabe. Direct importance estimation for covariate shi

adaptation. Annals of the Institute of Statistical Mathematics, ():–,

.

112 BIBLIOGRAPHY

[Sugiyama et al. ()] Masashi Sugiyama, Makoto Yamada, Paul von Bünau,

Taiji Suzuki, Takafumi Kanamori, and Motoaki Kawanabe. Direct density-

ratio estimation with dimensionality reduction via least-squares hetero-

distributional subspace search. Neural Networks, ():–, . ISSN

-.

[Tenenbaum et al. ()] J.B. Tenenbaum, V. De Silva, and J.C. Langford. A global

geometric framework for nonlinear dimensionality reduction. Science, 

():–, .

[Vidal ()] J.J. Vidal. Toward direct brain-computer communication. Annual

review of Biophysics and Bioengineering, ():–, .

[Vidaurre et al. ()] C. Vidaurre, M. Kawanabe, P. Bunau, B. Blankertz, and K.R.

Muller. Toward an unsupervised adaptation of lda for brain-computer inter-

faces. Biomedical Engineering, IEEE Transactions on, ():–, .

[von Bünau et al. ()] Paul von Bünau, Frank C. Meinecke, Simon Scholler, and

Klaus-Robert Müller. Finding stationary brain sources in EEG data. In Pro-

ceedings of the nd Annual Conference of the IEEE EMBS, pages –,

.

[Wald and Wolfowitz ()] A. Wald and J. Wolfowitz. On a test whether two

samples are from the same population. e Annals of Mathematical Statis-

tics, ():–, .

[Ziehe and Müller (a)] A. Ziehe and K.-R. Müller. TDSEP – an eﬃcient algo-

rithm for blind separation using time structure. In L. Niklasson, M. Bodén,

and T. Ziemke, editors, Proceedings of the th International Conference on Ar-

tiﬁcial Neural Networks, ICANN’, Perspectives in Neural Computing, pages

 – , Berlin, a. Springer Verlag.

[Ziehe and Müller (b)] A. Ziehe and K.-R. Müller. TDSEP – an eﬃcient algo-

rithm for blind separation using time structure. In L. Niklasson, M. Bodén,

and T. Ziemke, editors, Proc. ICANN ’, pages  – . Springer Verlag,

b.