Stationary Subspace Analysis
Towardsunderstandingnon-stationarydata
Paul von Bünau
Stationary Subspace Analysis
Towardsunderstandingnon-stationarydata
vorgelegt von Paul von Bünau
Von der Fakultät IV — Elektrotechnik und Informatik
der Technischen Universität Berlin
zur Erlangung des akademischen Grades
Doktor der Naturwissenschaen (Dr. rer. nat.)
genehmigte Dissertation
Promotionsausschuss
Vorsitzender: Prof. Dr. Olaf Hellwich (Technische Universität Berlin)
1. Gutachter: Prof. Dr. Klaus-Robert Müller (Technische Universität Berlin)
2. Gutachter: Prof. Dr. Gilles Blanchard (Universität Potsdam)
3. Gutachter: Prof. Dr. Benjamin Blankertz (Technische Universität Berlin)
Tag der wissenschalichen Aussprache: 12. September 2012
Berlin, 2012
D 83
Acknowledgements
Translated from the German, “Ph.D. advisor” becomes “doctor father”. is is quite
literally what Klaus-Robert Müller was to me. For the most part, it was his inspira-
tional enthusiasm that drew me into machine learning in the first place, and made
me return aer a two-year excursion. His kind encouragement and indefatigable
optimism helped me not to despair during the dry periods of research, when ev-
erything seems pointless. But above all, the inimitable Müller determination not to
take anything too seriously, and see the human side of everything, is what makes his
research group such an enjoyable place. ank you.
is thesis is the result of a collaboration with Frank C. Meinecke and Franz J.
Kiraly. It has been a privilege, a pleasure, and an incredible learning experience for
which I am deeply grateful.
My colleagues have been an invaluable source of scientific advice, camaraderie
and amusement, both in and out of the office, and on our various joint conference
trips. In particular, I would like to mention Mikio Braun, Stefan Haufe and Felix
Bießmann.
I am indebted to Jan Saputra Müller and Duncan A. J. Blythe for the excellent
joint work during the last years. Your support has been important in many ways.
My research would not have been possible without the tireless efforts of Andrea
Gerdes, Dominik Kühne and Imke Weitkamp. Even though I can only fathom what
most of it entails, I am well aware that it matters hugely.
eEEG datasets Iused inthis thesiswererecordedby Claudia Sannelli, orsten
Dickhaus, Sven Dähne and Johannes Höhne. is is hard work; your attention to
detail is very much appreciated.
I am sincerely grateful for the frequent detours from academia with Sebastian
Mika and Aldo Benini; to industry, and Bangladesh, respectively. Both have sig-
nificantly delayed this thesis, but also provided the change in scenery on which I
thrive.
Over the years, I have had the good fortune of meeting inspiring teachers who
generously shared their knowledge with me. While this list is inevitably incomplete,
it certainly includes Lothar Budach, Martyn Quick, Albrecht Klemm, Georg Her-
rmann, and from the first, my parents.
But what matters most is that I have found you, Louise.
vi
Financial support
e bulk of my research was supported by a teaching position at TU Berlin, funded
by the federal state of Berlin. In addition, I have received substantial travel grants
from the EU Network of Excellence PASCAL and the German Academic Exchange
Foundation (DAAD). In and , I was kindly invited to the Research-in-
Pairs(RiP)programmeofthe MathematischesForschungsinstitutOberwolfach(MFO),
core-funded by the Leibniz Gemeinscha and the state of Baden-Württemberg.
I am sincerely grateful to the European tax payer who provided this money, and
I appreciate the trust that society puts in the academic community.
Abstract
is thesis is about statistical methods for understanding change in the joint dis-
tribution of observed multivariate data over time. e setting we consider is com-
pletely explorative or unsupervised: no auxiliary information regarding the distribu-
tion changes is available.
We propose the first unsupervised method, stationary subspace analysis (SSA),
for finding a linear coordinate transformation which factorises the observed data into
stationary and non-stationary components. is is essential because the relevant
changescanoccurin the dependencies between variables, which meansthatthe input
variables can be totally uninformative: in fact, the non-stationary and the stationary
part can remain completely invisible in the observations. In practice, this is oen the
case when one can only measure superpositions of the actual variables of interest.
For example, in EEG analysis, electrodes on the surface of the scalp record activity
from several neural sources located inside the brain. As we show in this thesis, in-
vestigating changing behaviour in brain sources crucially depends on the separation
of the stationary and non-stationary components in the recorded signals.
e second main contribution of this thesis is a novel approach to finding partic-
ular types of approximative solutions to systems of polynomial equations of arbitrary
degree based on techniques from computational algebraic geometry. Using the con-
cept of generic polynomials, we show how SSA can be formulated in this framework.
is leads to a new algorithm which has a unique solution and is more accurate in
certain cases. From a theoretical perspective, the most interesting feature of this ap-
proach is that it allows us to solve the problem algebraically instead of searching for
the solution guided by an objective function. As the assumptions underpinning the
algorithm are rather general, it may be directly applicable to other problems in ma-
chine learning whose solution can be formulated in terms of polynomial equations.
viii
Zusammenfassung
Das ema dieser Dissertation sind statistische Methoden für das Verständnis zeit-
licher Veränderung der gemeinsamen Verteilung multivariater Daten. Wir betrachten
densogenanntenexplorativen Fall, in demkeinerlei zusätzlichen Informationen, z.B.
über relevante Zeitpunkte oder einen kontrollierten Stimulus, verfügbar sind.
Der zentrale Beitrag dieser Arbeit ist die Entwicklung des erstenunüberwachten
Verfahrens, Stationary Subspace Analysis (SSA), welches eine linear Koordinaten-
transformation findet das die beobachteten Daten in eine Gruppe von stationären
und nicht-stationären Komponenten faktorisiert. Dies ist unerlässlich zur Analyse
multivariaterDaten, weildie wesentlichenÄnderungen dergemeinsamenVerteilung
die Abhängigkeiten zwischen den Variablen betreffen können. Daher erlaubt die
univariate Betrachtung der Eingangsgrössen nicht notwendigerweise Rückschlüsse
auf Änderungen der gemeinsamen Verteilung. Sowohl die nicht-stationären als
auch die stationären Komponenten können nämlich in den beobachten Variablen
vollständig unsichtbar bleiben. Dies ist vor allem dann der Fall, wenn die mess-
baren Grössen Überlagerungen der tatsächlich relevanten Variablen sind, welche
nicht direkt gemessen werden können. Ein gutes Beispiel liefert die EEG Datenanal-
yse: die Elektroden auf der Kopaut messen die Beiträge einer Vielzahl neuronaler
Quellen im Gehirn. Um die zeitliche Veränderung der Verteilung dieser Quellen
zu verstehen ist es daher notwendig, die stationären von den nicht-stationären Sig-
nalanteilen zu trennen. In einer Anwendung auf EEG Daten zeigen wir zum Einen,
dass SSA dies leistet und zum Anderen, dass die populären Koordinatentransforma-
tionen Principal Component Analysis (PCA) und Independent Component Anal-
ysis (ICA) dazu nicht in der Lage sind.
Der zweite wesentliche Beitrag dieser Arbeit ist ein neuartiger Ansatz zur ap-
proximativenLösungpolynomiellerGleichungssystemeinesbestimmtenTyps. Auf-
bauend auf dem Konzept der generischen Polynome zeigen wir, das SSA als ein
solches Problem formuliert werden kann. Dies führt zu einem neuen Algorithmus
dessen Lösung nicht nur eindeutig, sondern in Spezialfällen auch exakter ist. Von
einem theoretischen Standpunkt aus gesehen ist es besonders interessant, dass es
dieser Ansatz erlaubt das SSA Problem direkt algebraisch zu lösen, anstatt wie in
einem Optimierungsverfahren nach der Lösung zu suchen. Da die zugrunde liegen-
den Annahmen eher allgemein sind, lässt sich der Algorithmus möglicherweise di-
x
rekt auf andere Probleme im Maschinellen Lernen anwenden, die als Lösung poly-
nomieller Gleichungen formuliert werden können.
Contents
1 Introduction 1
. Whythisthesis? ............................
. Scope, main contributions and publications . . . . . . . . . . . . .
. Related areas in machine learning . . . . . . . . . . . . . . . . . . .
. A roadmap through this document . . . . . . . . . . . . . . . . . .
2 Interesting directions in data 13
. Motivation ...............................
. Preliminaries and notation . . . . . . . . . . . . . . . . . . . . . . .
. Maximum variance and independence . . . . . . . . . . . . . . . .
. Summary and discussion . . . . . . . . . . . . . . . . . . . . . . .
3 Stationary subspace analysis 27
. Model and identifiability . . . . . . . . . . . . . . . . . . . . . . . .
. eSSAalgorithm...........................
. Spurious stationarity . . . . . . . . . . . . . . . . . . . . . . . . . .
. Summary and discussion . . . . . . . . . . . . . . . . . . . . . . .
4 An algebraic approach 51
. From moments to polynomials . . . . . . . . . . . . . . . . . . . .
. e vector space of polynomials . . . . . . . . . . . . . . . . . . . .
. An approximate algebraic algorithm . . . . . . . . . . . . . . . . .
. Relationship to algebraic geometry . . . . . . . . . . . . . . . . . .
. Summary and discussion . . . . . . . . . . . . . . . . . . . . . . .
5 Simulations and applications 73
. Introduction ..............................
. Simulations on synthetic data . . . . . . . . . . . . . . . . . . . . .
. Application to EEG analysis . . . . . . . . . . . . . . . . . . . . . .
. Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . .
6 Conclusion 99
xii CONTENTS
Chapter 1
Introduction
1.1 Why this thesis?
is thesis is about methods for the analysis of data. e starting point is a simple
question: given a set of multivariate samples observed at different points in time, how
does the joint distribution change over time? e setting is completely unsupervised
orexplorative: apart from thedata, no further informationorpriorknowledge which
could be helpful in identifying the changes is available.
Figure .: A change in variance over time.
In such a situation, a common approach is to resort to a visual inspection of
the data. Although unwieldy for large data sets, plotting the time course of a sin-
gle variable should, in principle, provide us with a good indication of its change in
behaviour over time (see e.g. Figure .).
1
2
Signals
Signal 1Signal 2
Epoch histograms
Figure .: Does this joint distribution change over time?
However, this approach fails as soon as we are dealing with the joint distribution
of more than one variable. Let us consider the example shown in Figure .. In ad-
2CHAPTER 1. INTRODUCTION
dition to the time courses, this plot shows the distribution of the individual variables
as histograms in four segments. Apart from the fluctuations due to finite samples,
this does not reveal any significant changes in distribution.
1
2
Signals
Signal 2
Signal 1
Figure .: e change is confined to the dependencies!
Figure . provides a different view of the same data set. From the scatter plots,
we immediately see what has been missed: the dependencies between the two vari-
ables. e correlation between the variables changes completely over time, from
highly correlated at the beginning, to uncorrelated, to highly anticorrelated towards
the end. Looking at variables individually, i.e. the marginals of the joint distribution,
does not give us any information about the dependencies between them.
Figure .: More useful directions
In light of this finding, a more useful way of looking at this data set is given by
the red and green directions shown in Figure .. By projecting the data points onto
these axes, we obtain two new variables with time courses shown in the panel below.
e first one corresponds to the green direction. As we can see in the scatter plots,
along this direction the change in the joint distribution is clearly visible: from large
variance in the first epoch to small variance in the last epoch; this is what we call a
non-stationary direction. e second variable (red) is a stationary variable, because
1.1. WHY THIS THESIS? 3
its distribution stays constant¹. e projection onto these two directions achieves a
separation of the stationary and the non-stationary content of the joint distribution.
is example was benign in that it consisted of only two variables. For higher
dimensional data sets, analysing dependencies by looking at scatter plots between
all pairs of variables (over a potentially large number of epochs) is not merely cum-
bersome, but simply infeasible. Moreover, we have no labels or target variables to
guide our search. us the question posed at the outset turns out to be surpris-
ingly difficult to answer. More importantly, this problem is not only non-trivial,
but also relevant in practice. Understanding temporal changes lies at the heart of
many scientific inquiries. Real data contains significant amount of information in
its dependency structure, and is oen high-dimensional.
In fact, in many circumstances it is well known that what we measure is a super-
position of underlying latent variables which cannot be measured directly. is cre-
ates rich dependencies across dimensions. For example, in EEG analysis, electrodes
on the surface of the scalp record contributions from a multitude of sources located
inside the brain. Similarly, in the analysis of seismic activity, one is oen confined
to surface measurements for investigating the activity inside the earth. us each
observed variable contains contributions from both the stationary, and the non-
stationary parts of the underlying data generating system.
Signals
Time
Observed time series
Time
Sources
Underlying sources
Sources
Time
Comparison of source power
Figure .: Non-stationary component masked by stronger stationary compo-
nent.
As the example in Figure . showed, this could mean that the relevant distribu-
tion changes are entirely invisible in the measurements, because they are confined
to the dependencies. Another way in which relevant distribution changes can go
unnoticed is if they are masked by relatively stronger components, see Figure ..
e opposite effect is also possible: a single non-stationary factor affects all mea-
surements, giving the false impression that the data as a whole is non-stationary.
Discerning the stationary from the non-stationary components is, therefore, a key
step towards understanding multiviariate non-stationary data. How to do this is the
topic of this thesis.
¹A precise notion of stationarity will be introduced in Chapter .
4CHAPTER 1. INTRODUCTION
Our main contribution is a method for finding a linear coordinate transforma-
tion, stationary subspace analysis (SSA), which factorises the observed data into a
set of stationary, and a set of non-stationary, sources. In the example shown in Fig-
ure ., such a transformation would be given by projecting onto the green and red
axes respectively. e type of distribution changes which our method detects are
changes in the mean and the covariance matrix. Importantly, SSA does not require
labels to specify the timing of the distribution changes (or any other form of super-
vision); and it is not limited to time series but can also be applied to sets of data
collected at different time points.
A factorisation into stationary and non-stationary components is useful in many
circumstances. As we have seen in the example, SSA is an essential tool for ex-
plorative analysis, because the observed variables can be arbitrarily ill suited for
the analysis of changes in the joint distribution. Moreover, SSA is useful for fea-
ture extraction [Guyon and Elisseeff ()] when only the stationary or the non-
stationary part is informative for the subsequent application. For example, high-
dimensional change-point detection [Blythe et al. ()] can benefit from the prior
removal of the uninformative stationary part. In some cases, SSA has been found to
be useful for domain adaptation, by restricting parameter estimation of a prediction
method to the stationary part [von Bünau et al. (), Hara et al. ()]. Finally,
SSA can extract meaningful components when the observations are (modeled as) lin-
ear superpositions of the relevant latent variables, and non-stationarity is a sensible
criterion for identifying the components of interest [Hara et al. ()]. In particu-
lar, SSA allows us to extract such components even when no prior knowledge about
the timing of the relevant activity is available, e.g. when they are not induced by an
experimental paradigm. is is the case e.g. in the analysis of spontaneous neu-
ral activity [Bießmann et al. (), Biswal et al. ()] or default mode networks
[Raichle et al. ()].
Stationary subspace analysis is not the first method to exploit dependencies be-
tween variables to find a coordinate system which is more useful in some sense.
However, none of the existing approaches are tailored to the understanding of dis-
tribution changes. As an illustration, let us compare the results of SSA with the
solutions found by principal component analysis (PCA) and independent compo-
nent analysis (ICA) on the data shown in Figure .. In Figure . we see that both
PCA and ICA produce uninformative results, albeit in different ways. In the PCA
basis (top panel), the distribution changes remain completely invisible. ICA (middle
panel), on the other hand, gives us two components that are both non-stationary:
suddenly, the stationary components have become invisible. Only SSA achieves a
clear separation into stationary and non-stationary components. In principle, this
is to be expected. ICA maximises the statistical independence of the coordinates,
which is unrelated to discerning the two groups of components. Similarly, the PCA
solution is determined by the shape of the overall covariance matrix which is not
related to non-stationarity.
1.1. WHY THIS THESIS? 5
PCA solution
ICA solution
SSA solution
Figure .: PCA and ICA do not separate the stationary and the non-stationary
components.
e second main contribution of this thesis is a new framework for the find-
ing approximate solutions (or a certain type) to systems of polynomial equation.
We demonstrate that SSA can be formulated in terms of computing the basis of an
algebraic variety given in terms of noisy polynomials estimated from data. Or, in
other words, we show how to approximate the set of solutions to noisy polynomial
equations by a linear space. What makes the SSA problem amenable to this type of
formulation is that the projections to the stationary coordinates can be described as
the set of linear projections which make the moments of the distribution equal in
the new coordinates. ese projections are therefore given by the solution to a set
of polynomials with coefficients derived from moments estimated on the observed
data. For example, the equality of the mean translates into linear polynomials and
the equality of the covariance matrix is expressed by a quadratic polynomial. Not
only does the algebraic approach have a better performance (higher accuracy) in
certain scenarios, but this viewpoint also allows us to invoke powerful devices from
algebraic geometry for reasoning about learning problems and algorithms.
From a theoretical perspective, the interesting distinction of the algebraic ap-
proach is that it allows us to directly solve a problem algebraically, which hitherto
required an iterative search procedure using gradient-based optimisation. In some
sense, this is due to the fact that in the algebraic formulation, we have explicitly in-
corporated all available information about the structure of the problem.
6CHAPTER 1. INTRODUCTION
1.2 Scope, main contributions and publications
e main contributions of this thesis can be summarised as follows.
. Motivation and formulation of a new type of unsupervised learning task, sta-
tionary subspace analysis.
. Analysis of the generative model w.r.t. identifiability invariances and spurious
solutions.
. Derivation of a computationally efficient SSA algorithm, based on a custom
optimisation procedure over the special orthogonal group.
. Development of an approximate algebraic algorithm for solving the SSA prob-
lem based on an algebraic-geometric viewpoint.
. Empirical evaluation of the SSA algorithm on synthetic data and in applica-
tions to EEG analysis.
1.2.1 Included publications
Parts of this thesis have been presented in the following publications. e results
presented in Chapter are unpublished.
• von Bünau, P., Meinecke F. C., Király, F. J., and Müller K.R. Finding Stationary
Subspaces in Multivariate Time Series. Phys. Rev. Lett. ():, .
• Müller, J. S., von Bünau, P., Meinecke F. C., Király, F. J., and Müller K.R. e
Stationary Subspace Analysis Toolbox. Journal of Machine Learning Research
:–, .
• Király, F. J., von Bünau, P., Meinecke F. C., Blythe, D. A. J., and Müller K.R.
Algebraic Geometric Comparison of Probability Distributions. Journal of Ma-
chine Learning Research :–, .
• von Bünau, P., Meinecke F. C., Scholler, S., and Müller K.R. Finding Stationary
Brain Sources in EEG Data. Proceedings of the nd Annual Conference of
the IEEE EMBS, Buenos Aires, .
• Király, F. J., von Bünau, P., Müller, J.S., Blythe, D. A. J., Meinecke F. C., and
Müller K.R. Regression for sets of polynomial equations. JMLR Workshop and
Conference Proc. (AISTATS ), Vol. , .
1.2.2 Related work not included
Since the publication of the SSA algorithm, the following applications and exten-
sions have been presented which are not part of this thesis. e author of this thesis
has been involved in some but not all of them.
1.3. RELATED AREAS IN MACHINE LEARNING 7
Application to domain adaptation e authors show that SSA can be used as a
pre-processing step to allow for the domain adaptation of classification algorithms
in the context of WiFi localisation [Hara et al. ()].
Application to geophysical data analysis and an analytic algorithm In the si-
multaenous analysis of measurements relating to the dynamics of the earth’s mag-
netic field, the relevant components are known to exhibit strong changes but can-
not be assumed to be independent. e authors show that SSA recovers mean-
ingful components where ICA fails. Moreover, a new SSA algorithm is presented
that has an analytic solution based on certain assumptions on the latent variables
[Hara et al. ()].
Application to computer vision e stationary components found by SSA can be
interpreted as temporally stable relationships between variables. e authors adopt
this viewpoint and demonstrate that it allows the extraction of invariances from im-
age data [Meinecke et al. ()].
Application to change-point detection In high-dimensional change-point de-
tection, only the non-stationary directions are informative while the rest contribute
to the number of dimensions by adding irrelevant information which is potentially
harmful to the performance. erefore, the authors show that SSA can be used as
a feature extraction method in extensive simulations and as an application to fault
detection [Blythe et al. ()].
An information-theoretical view of SSA e authors show that SSA can be un-
derstood from an information-theoretical point of view in terms of a projection in
the space of distributions; this leads to a new algorithm [Kawanabe et al. ()].
Group-Wise SSA and brain-computer-interfacing In the context of a classifica-
tion setting, it is oen desirable to identify only those non-stationarities that are not
class-related. To this end, the authors propose a new algorithm called groupwise-
SSAanddemonstrateits effectivenessinanapplicationtobrain-computer-interfacing
[Samek et al. ()].
1.3 Related areas in machine learning
With this thesis, we introduce a new family of unsupervised learning tasks. Even
though there are no algorithms with exactly the same objective, SSA is related to a
number of approaches by formal similarity, shared technical tools or a similar ap-
plication context, which we briefly summarise below.
8CHAPTER 1. INTRODUCTION
Two-sample testing Do two samples come from the same distribution? A two-
sample test [Wald and Wolfowitz (), Smirnoff ()] is a statistical hypothesis
test for deciding this problem �he result of such tests is a binary decision; no more
information about the type or extent of a difference in distribution can be gleaned
from the result. Two-sample testing may be applied in similar circumstances than
SSA, but addresses a fundamentally different question.
At the heart of a two-sample test lies a parametric or nonparametric distance
measure between two distributions, and such a function also appears in the SSA ob-
jective. Infact, weshow thatthe SSAobjectivefunctionis equivalenttoacertaintype
of two-sample test. Future work may explore the possibility of deriving high-order
SSA algorithms by borrowing from non-parametric two-sample tests, e.g. following
[Gretton et al. ()].
Domain and covariate shift adaptation e goal of domain and covariate shi
adaptation is to improve the performance of prediction methods (classification or
regression) in a scenario in which the test data follows a different distribution than
the training data. One way of covariate shi adaptation is to weight the training
data [Shimodaira (), Sugiyama et al. (), Sugiyama et al. ()] or regu-
larise w.r.t. distribution changes [Schweikert et al. ()].
In contrast to SSA, the focus is on adapting a supervised learning to distribu-
tion changes between two sets of samples, and not on investigating the distribution
change itself. However, it has been shown [Hara et al. ()] that for some appli-
cations, SSA can be used for domain adaptation by confining the learning to the
directions between training and testing data which are stationary. Note that this
need not be true in general: the stationary directions are not necessarily correlated
with the label; and a certain degree of non-stationarity along strongly informative
directions may not be harmful for prediction performance. How the possibly con-
flicting goals of stationarity and informativeness can be combined in a principled
manner is an open question.
Moreover, SSA can be useful as a feature extraction method for density ratio es-
timation [Sugiyama ()], which is an essential step in the reweighting approach
to covariate shi adaptation: only the non-stationary directions are relevant for den-
sity ratio estimation, and it has been shown [Sugiyama et al. ()] that the prior
removal of the stationary directions can increase the accuracy of density ratio esti-
mation and hence covariate shi adaptation.
PCA, ICA and factor analysis e generative model of PCA, ICA and factor anal-
ysis formally resembles the SSA model: the observed data is a linear mixture of un-
derlying latent variables. is, however, is where the similarities end. ICA assumes
that sources are independent; SSA merely presupposes that there are two groups
of sources (stationary and non-stationary), which may have arbitrary dependence
1.4. A ROADMAP THROUGH THIS DOCUMENT 9
structures. Consequently, ICA and SSA algorithms have entirely different goals:
recovering maximally independent sources vs. maximizing resp. minimizing a mul-
tivariate stationarity measure. Similarly, PCA and factor analysis find orthogonal
directions minimizing the reconstruction error which is unrelated to stationarity
resp. non-stationarity. Also, whereas PCA, ICA and factor analysis compute a set of
one-dimensional components, SSA finds two subspaces.
However, SSA is related to the family of blind source separation algorithms by
shared technical tools (a pre-whitening step plus optimisation over the special or-
thogonal group) and similar terminology.
1.4 A roadmap through this document
Chapter 2 In this chapter, we introduce the the task of finding a coordinate trans-
form of the data space with certain useful properties. Mathematical concepts are
briefly reviewed and the two most prominent methods, PCA and ICA, are discussed
in depth.
Chapter 3 Stationary subspace analysis is presented: aer outlining and analysing
the SSA model, we develop a computationally efficient algorithm, and discuss the
problem of spurious stationarity and the relationship to stationarity testing.
Chapter 4 is chapter contains the second main contribution: we present an
approximate algebraic algorithm for solving the SSA task and analyse its computa-
tional complexity.
Chapter 5 In extensive simulations on synthetic data, we investigate the perfor-
mance of SSA relative to an artificial ground truth; in applications to EEG analysis
we show that SSA provides useful new insights.
Chapter 6 e thesis concludes with a summary of the main findings, and a dis-
cussion of open problems and directions for future work.
10 CHAPTER 1. INTRODUCTION
List of all publications
e following is a list of all publications co-authored by the author of this thesis from
until .
Articles in journals
Hara, S., Kawahara, Y., Washio, T., von Bünau, P., Tokunaga, T. and Yumoto, K. Sep-
aration of stationary and non-stationary sources with a generalized eigenvalue prob-
lem Neural Networks :–, .
Kiraly, F. J., von Bünau, P., Meinecke, F. C., Blythe, D. A. J. and Müller, K. R. Alge-
braic Geometric Comparison of Probability Distributions Journal of Machine Learn-
ing Research :–, .
Blythe, D. A. J., von Bünau, P., Meinecke, F. C. and Müller, K. R. Feature Extraction
for Change-Point Detection using Stationary Subspace Analysis IEEE Transactions
on Neural Networks and Learning Systems ():–, .
Sugiyama, M., Yamada, M., von Bünau, P., Suzuki, T., Kanamori, T. and Kawanabe,
M. Direct density-ratio estimation with dimensionality reduction via least-squares
hetero-distributional subspace search Neural Networks ():-, .
Vidaurre, C., Kawanabe, M., von Bünau, P., Blankertz, B. and Müller, K. R. Toward
Unsupervised Adaptation of LDA for Brain-Computer Interfaces IEEE Transactions
on Biomedical Engineering ():–, .
Müller, J., von Bünau, P., Meinecke, F. C., Király, F. J. and Müller, K. R. e Stationary
Subspace Analysis Toolbox Journal of Machine Learning Research :–,
.
Araujo, J., von Bünau, P., Mitchell, J. and Neunhoeffer, M. Computing automor-
phisms of semigroups Journal of Symbolic Computation ():-, .
von Bünau, P., Meinecke, F. C., Király, F. J. and Müller, K. R. Finding Stationary
Subspaces in Multivariate Time Series Phys. Rev. Lett. ():, .
Sugiyama, M., Suzuki, T., Nakajima, S., Kashima, H., von Bünau, P. and Kawan-
abe, M. Direct Importance Estimation for Covariate Shi Adaptation Annals of the
Institute of Statistical Mathematics ():–, .
1.4. A ROADMAP THROUGH THIS DOCUMENT 11
Peer-reviewed contributions to conferences
Kiraly, F. J., von Bünau, P., Müller, J. S., Blythe, D. A. J., Meinecke, F. C. and Müller,
K. R. Regression for sets of polynomial equations JMLR Workshop and Conference
Proc. (AISTATS ) , .
Kawanabe, M., Samek, W., von Bünau, P. and Meinecke, F. C. An Information Ge-
ometrical View of Stationary Subspace Analysis Artificial Neural Networks and Ma-
chine Learning (ICANN ), LNCS :–, Springer, .
von Bünau, P., Meinecke, F. C., Scholler, S. and Müller, K. R. Finding Stationary
Brain Sources in EEG Data Proceedings of the nd Annual Conference of the IEEE
EMBS, .
Hara, S., Kawahara, Y., Washio, T. and von Bünau, P. Stationary subspace analysis as
a generalized eigenvalue problem Proceedings of the th international conference
on Neural information processing (ICONIP), .
Sugiyama, M., Hara, S., von Bünau, P., Suzuki, T., Kanamori, T. and Kawanabe, M.
Direct Density Ratio Estimation with Dimensionality Reduction Proceedings of
International Conference on Data Mining (SDM), .
Meinecke, F. C., von Bünau, P., Kawanabe, M. and Müller, K. R. Learning invariances
with Stationary Subspace Analysis ICCV Computer Vision Workshops, .
von Bünau, P., Meinecke, F. C. and Müller, K. R. Stationary Subspace Analysis Pro-
ceedings of ICA Conference (Paraty), LNCS :–, Springer, .
Sugiyama, M., Nakajima, S., Kashima, H., von Bünau, P. and Kawanabe, M. Direct
Importance Estimation with Model Selection and Its Application to Covariate Shi
Adaptation Advances in Neural Information Processing Systems (NIPS), .
12 CHAPTER 1. INTRODUCTION
Chapter 2
Interesting directions in data
2.1 Motivation
In this chapter we explain why a basis transformation of the data space can be use-
ful, review mathematical fundamentals, and discuss two prominent unsupervised
coordinate transforms: principal component analysis and independent component
analysis.
Data comes from measurements. However, in many circumstances, the actual
variables of interest cannot be measured directly. For example, in the neurosciences,
one is oen limited to measuring non-invasively on the surface of the scalp in order
to study the activity of sources located inside the brain. In EEG recordings, for in-
stance, each electrode measures a superposition of activity from several underlying
sources. e input signals are therefore not directly useful for analysing brain activ-
ity, because we cannot discern the contributions from the different cortical regions.
0
0
−0.5
0
0.5
Projection pursuit (F.-T. index) Principal component analysis Independent comp. analysis
Figure .: Directions found by exploratory projection pursuit (Friedman-Tukey
index), principal component analysis and independent component analysis.
However, on multivariate data one can oen combine the information of several
variables to construct more useful new variables by applying a coordinate transfor-
mation. In the previous example, a desirable transformation would separate the
contribution of the underlying brain sources into new variables. Interestingly, for
a diverse range of applications from the neurosciences to the analysis of web data,
14 CHAPTER 2. INTERESTING DIRECTIONS IN DATA
useful coordinate transformations can be found based on statistical criteria alone,
independent of the particular meaning of the observed variables or prior informa-
tion. is is the topic of this chapter.
Let us first of all consider intuitive examples for the informativeness of axes in
data space. Figure . shows three datasets. In the le panel, the scatter plot clearly
shows two clusters; however, these are not visible by looking at the individual di-
mensions. Here, a projection onto the orange direction clearly reveals this structure
in the data. e variation in the data set shown in the middle is largely confined
to the orange axis because the two variables are strongly correlated; coordinates on
this axis provide a more compact description of the data. In the right scatter plot,
the data points are distributed along two axes in a seemingly independent way, i.e.
the value of one coordinate has no influence on the distribution of the other. is
suggests that the data is generated as a linear mixture of two underlying (latent) in-
dependent factors.
Note thatthesepropertieswereapparentonlybecause we could look atthe whole
data in a scatter plot. For higher-dimensional data sets this is not possible. In
fact, each of the three examples correspond to a criterion of a family of algorithms:
projection pursuit using the Friedman-Tukey index [Friedman and Tukey ()]
finds directions with cluster structure (le panel); principal component analysis
[Pearson ()] finds directions that maximize the variance (middle panel), and in-
dependent component analysis [Hyvärinen et al. ()] finds independent sources
from a linear mixture. Incidentally, the latter approach is oen used to address the
problem sketched at the outset: under the assumption that EEG signals are gener-
ated as a linear mixture of independent brain sources, ICA allows us to recover the
neural components. Not only the new coordinates are of interest: both the elements
of the projection and the basis can oen be interpreted in terms of patterns if the sen-
sors are spatially distributed. In EEG analysis, the so-called scalp maps have become
an important tool for the interpretation of sources found by linear methods.
e remainder of this chapter is organised as follows. In the next section we
introduce some fundamental concepts from statistics and information theory. In
Section ., we review two prominent methods of finding interesting directions. Af-
ter tracing their historical development, we start with principal component analysis
followed by independent component analysis. ese are arguably the most widely
used coordinate transforms, and the ones we will compare against in the applica-
tion to EEG analysis (Chapter ). In the concluding section, we discuss the relative
merits of these algorithms.
2.2 Preliminaries and notation
In this section, we briefly review fundamentals from probability theory, statistics
and information theory. is overview is largely based on [Mood et al. ()],
2.2. PRELIMINARIES AND NOTATION 15
[Cover et al. ()] and [Jaynes and Bretthorst ()].
Random variables, joint densities, independence
e building block of statistical modeling is the the concept of the random variable.
A random variable is a formal variable which represents a sampling or measurement
process, governed by a probability distribution. Random variables are a convenient
tool to formulate probabilistic models in the familiar language of mathematical ex-
pressions.
A random variable Xis associated with a probability space (Ω,A, P). We will
not give a formal definition; it suffices to say that a probability space specifies the
Boolean algebra of random events A ⊂ P(Ω), which consists of subsets of elemen-
tary events in Ω, to which we assign probabilities by the function P:A → [0,1].
e empty set, the “sure event” Ωand all single elements of Ωare events.
In a probability space, there is no such thing as a value or an order of possible
events; it is simply a set of possible outcomes. However, in many cases one wants to
model random processes in which the outcome is a real number: we would like to
be able to speak of the probability that the outcome lies in a certain interval and to
be able to represent this graphically. A real random variable does just that: it links
the probability space and the real numbers.
Let (Ω,A, P)be a probability space. A real random variable is a function X:
Ω→
R
such that all subsets Ar⊂Ωdefined as,
Ar={ω∈Ω|X(ω)≤r},
are events in the probability space: Ar∈ Afor all r∈
R
.
is condition ensures that every interval in
R
can be assigned a probability.
For a continuous random variable, this allows for the definition of the cumulative
distribution function (CDF) and its derivative, the probability density function (PDF),
which show the distribution of the probability mass over the real numbers.
e cumulative distribution function of a real random variable Xis the function
FX:
R
→[0,1] such that,
FX(x) = P(X≤x) = P({ω|X(ω)≤x}),
for every real number x∈
R
. e probability density function of Xis the function
fxsuch that FX(x) = ∫x
−∞ fX(u)du.
So far, we have considered only single random variables. In many cases, how-
ever, the joint behavior of several random variables is of interest. To that end, we
introduce multivariate real random variables that are a vector of random variables
and have a joint probability density function.
Let X1, . . . , Xkbe real random variables over the same probability space. en
X= (X1, . . . , Xk)is a multivariate real random variable if it has a joint probability
16 CHAPTER 2. INTERESTING DIRECTIONS IN DATA
density function fX:
R
k→[0,1] such that its cumulative distribution function
FX:
R
k→[0,1] is given by
FX(x1, . . . , xk) = ∫x1
−∞ ···∫xk
−∞
fX(u1, . . . , uk)du1···duk.
A multivariate random variable consisting of kindividual random variables is
also called k-variate. In the context of a multivariate random variable, we call the
density of an individual constituent marginal density. Given the probability density
function of a multivariate random variable, we can obtain its marginals by integrat-
ing out the other random variables, respectively.
One motivation for considering the joint distribution of random variables is to
study the dependencies between them: how does the outcome of one group of vari-
ables influence the density over another group of variables. is is represented by
the conditional density. Let Xand Ybe multivariate real random variables that have
a joint distribution. e conditional probability density function of Ygiven X=x
is defined as, fY|X(y|x) = fX,Y (x,y)
fX(x)it is undefined for all x∈Rwith fX(x) = 0.
A particularly important special case is the independence of two random vari-
ables. In this case, the outcome of one variable has no influence on the distribution
of the other variable. In other words, the conditional distribution is equal to the
marginal: fY|X≡fYif Yand Xare independent. Let Xbe a k-variate real ran-
dom variable. en X1, . . . , Xkare independent, if their joint probability density
function is equal to the product of its marginals, fX(x1, . . . , xk) = ∏k
i=1 fXi(xi).
Expectation and cumulants
e expectation of a function g(X)of a random variable Xis, loosely speaking, the
average over its random argument X. Let Xbe a k-variate random variable and
g(X)be a scalar function of X. e expected value of g(X)is defined as,
E[g(X)] = ∫∞
−∞ ···∫∞
−∞
| {z }
ktimes
g(x1, . . . , xk)fX(x1, . . . , xk)dx1···dxk,
assuming that the integral exists. Since the integral is linear, the expectation is also
linear: E[αf(X) + βg(X)] = αE[f(X)] + βE[g(X)].
Properties of distributions are oen summarised in terms of its moments or cen-
tral moments, which are an infinite series of increasing order. In general, the r-th
order moment of a random variable Xis defined as µ′
r(X) = E[Xr]and the r-th
order central moment is µr= E[(X−E[X])r].
e expectation of a random variable, abbreviated as µ(X)is its first moment
and is commonly interpreted as a location parameter. e variance of a random
2.2. PRELIMINARIES AND NOTATION 17
variable, its second central moment, describe its dispersion around the mean. e
variance of X, denoted by var(X)is defined as,
var(X) = E[(X−µ(X))2] = E[X2]−µ(X)2.
e covariance between two random variables describes how they vary together.
e value of the covariance is a measure for a linear dependence between them.
For a multivariate random variable, all pairwise covariances are summarised in a
covariance matrix. e covariance of two real random variables Xand Ywith a
joint probability density is given by,
cov(X, Y ) = E[(X−µ(X))(Y−µ(Y))].
For a k-variate random variable Z, its covariance matrix Σ∈
R
k×kcontains the
covariance for each pair of variables, Σij = cov(Zi, Zj).
Moments above the mean and covariance are usually called higher order mo-
ments. e third central moment, the skewness, is oen used as a measure of asym-
metry around the mean. e kurtosis is the fourth central moment which measures
the degree of flatness (or peakedness) of a density around its centres. e standard-
ise fourth moment, or Pearson’s kurtosis, is defined as κ(X) = µ4
var(X).
Information theory
e entropy is a measure H[X]for the uncertainty of a random variable Xdefined as
H[X] = −E[log fX(X)]. e entropy is bounded from below by zero, H[X]≥0;
it is zero if and only if the random variable is a constant, i.e. non-random. e
conditional entropy of Xgiven another random variable X, denoted by H[X|Y]is
defined as H[X|Y] = −EX,Y [log fX,Y (X|Y)].e entropy of a joint distribution
can be written as H[X, Y ] = H[X] + H[Y|X].
e Kullback-Leibler (KL) divergence or relative entropy DKL is a way of compar-
ing two distributions p(x)and q(x),
DKL[p||q] = ∫+∞
−∞
p(x) log p(x)
q(x)dx.
e KL-divergence is zero if and only if the two densities are identical, and other-
wise positive (unbounded). Note, however, that the KL-divergence is not a distance
because it is not symmetric and does not satisfy the triangle equation.
A related concept to the KL-divergence is the mutual information between ran-
dom variables. It measures how much information one variable contains about the
other. e mutual information I[X, Y ]can be defined using the KL-divergence as,
I[X, Y ] = DKL[p(X, Y )||p(X)p(Y)],
i.e. it is the distance between the joint distribution and the product of its marginals.
18 CHAPTER 2. INTERESTING DIRECTIONS IN DATA
The Gaussian distribution
Many statistical techniques are based on distributional assumptions involving the
(multivariate) Gaussian distribution. In particular, the noise component in statis-
tical models is usually assumed to be Gaussian distributed. e conventional jus-
tification is that by the central limit theorem, the sum of independent identically
distributed random variables converges (in distribution) to the normal distribution.
Let Xbe a k-variate random variable that follows a multivariate Gaussian distri-
bution withmeanµandcovariancematrix Σ, whichweabbreviatebyX∼ N(µ, Σ).
e probability density function is of the multivariate Gaussian is given by,
fX(x) = det(2πΣ)−1
2exp (−1
2(x−µ)⊤Σ−1(x−µ)).
e entropy of a D-variate Gaussian H(N(µ, Σ)) is given by H(N(µ, Σ)) =
log √det(2πΣ) + D/2.Note that the entropy is thus invariant under orthogonal
transformations of variables.
For a given mean and covariance matrix, the Gaussian is the real-valued distri-
bution with the highest entropy [Jaynes ()]. From this point of view, the Gaus-
sian distribution is the least restrictive distributional assumption.
e negentropy J(X)of a random variable Xwith is the difference between its
entropy and the entropy of a Gaussian random variable with the same mean and
covariance,
J(X) = H [N(µ(X),cov(X))] −H[X].
e negentropy is nonnegative.
Maximum likelihood
One the key tasks in statistics is the estimation of model parameters from samples.
In the standard maximum likelihood (ML) approach, we select those parameters that
maximise the likelihood of observing the data at hand under our generative model
assumption. Let X ⊂
R
dbe a dataset and let ℓθ:P(
R
d)→
R
the likelihood
function of a statistical model, which is the likelihood of observing a data set given
the parameters θ. e maximum likelihood estimator for the model parameters is
given by, ˆ
θML = argmaxθlog ℓθ(X).
e standard estimators for the parameters µand Σof a d-variate Gaussian dis-
tribution are the sample mean ˆµand the sample covariance matrix ˆ
Σ,
ˆµ=1
|X| ∑
x∈X
xand ˆ
Σ = 1
|X|−1∑
x∈X
(x−ˆµ)(x−ˆµ)⊤,
respectively. Note that the latter is not the maximum likelihood estimator, but it has
the favorable property of being unbiased, i.e. its expectation over realisations of the
dataset is equal to the true parameter.
2.3. MAXIMUM VARIANCE AND INDEPENDENCE 19
2.3 Maximum variance and independence
In this section, we briefly introduce the family of statistical models corresponding
to coordinate transformation in data space and discuss the two most prominent ap-
proaches: principal component analysis (PCA) and independent component anal-
ysis (ICA).
e generative model for the data is a linear transformation of latent variables.
e observed D-variate data Xis assumed to be generated as a linear transformation
of Dlatent variables Yby an unknown full rank mixing matrix A∈
R
D×D,
X=AY.
In a time series context, the data Xas called signals denoted by x(t)and the latent
variables are called sources written as y(t). Moreover, the mixing is instantaneous,
since the observation x(t)at time tonly depends on the sources at time tand not on
contributions from previous time points. In factor analysis, the matrix Ais called
loading matrix.
e elements of the latent variable Yare referred to as components or factors.
Given only samples from X, the goal is to estimate an inverse for Awhich allows
us to recover the latent variables Y, i.e. find the transformation A−1to coordinates
corresponding to the desired components.
is is possible only by making further assumptions on the unknown parameter
Aand the statistical properties of the latent variables Y. It is these types of assump-
tions which allow to distinguish the different types of coordinate transformations.
e assumption on Aand Yalso determine up to which symmetries the true mixing
matrix Acan be identified from the observed data.
2.3.1 Principal component analysis
As an academic discipline, the search for interesting directions in data started with
Karl Pearson’s work [Pearson ()] on principal component analysis (PCA) at the
beginning of the th century. To this day, PCA is still the most widely used method
of extracting useful linear components from a set of high-dimensional data points.
e vantage point of Pearson’s work was what is now known as ordinary least squares
regression (OLS) [Gauß (), Legendre ()], a method for fitting the mean of
a univariate dependent variable (or target) yby a linear combination of indepen-
dent variables (or covariates) x1, . . . , xd. is analysis yields a coefficient αifor
each independent variable xi, which is commonly interpreted as the effect of the
independent variable on the target, conditioned on (or controlled for) fixed values
of all other covariates.
e generative model corresponding to linear regression is given by the equa-
20 CHAPTER 2. INTERESTING DIRECTIONS IN DATA
tion,
Y=
d
∑
i=1
αixi+ϵwhere ϵ∼ N(0, σ),(.)
where the only random component is the Gaussian noise term ϵ. e unknown pa-
rameters of interest are the coefficients α1, . . . αdand the variance of the noise σ. In
this model, the values of the covariates are assumed to be known exactly; only the
target is subject to measurement noise. As Pearson points out, it is quite unrealistic
that all variables apart from one can be measured exactly. Moreover, the distinc-
tion between dependent and independent variable is somewhat arbitrary: one could
equally well aim to explain the variation in the covariate x1by a linear function of
x2, . . . , xdand y. ese points are not deficiencies of the linear regression model
but reflect the aim of the analysis: to explain the variation of a particular variable in
terms of another set of variables in order to investigate the underlying mechanism.
In other words, the ordinary least squares model (.) belongs to the family of con-
ditional models, which model the distribution of a designated variable as a function
of another set of variables.
Independent variable (covariate)
Dependent variable (target)
Principal components
OLS regression
Figure .: Comparison of ordinary least squares regression (OLS) and PCA.
However, as Pearson noted, conditional modeling of a particular variable is not
necessarily optimal for explaining the total variation in the data, i.e. the joint vari-
ation of all variables x1, . . . , xdand y. See Figure . for an example. To that end,
Pearson proposed a method to find the “Closest fit to Systems of Points in Space”
[Pearson ()], which now known as principal component analysis. PCA finds an
orthogonal coordinate system that minimises the reconstruction error of the data.
e reconstruction error of a data vector is the L2-distance to its projection onto a
particular basis. e elements of this PCA basis are called principal components. e
principal components are ordered descendingly by the variance of the data along
2.3. MAXIMUM VARIANCE AND INDEPENDENCE 21
each coordinate. at is, the first principal component is the direction in the data
that has the highest variance.
We can find the principal components iteratively by minimizing the squared re-
constructionerror. Assumewearegivenad-dimensionalcentereddatasetx1, . . . , xn∈
R
dand we have found the first korthonormal principal components v1, . . . , vk. e
next principal component vk+1 is the solution to the optimisation problem,
vk+1 = argmin
∥v∥=1
n
∑
i=1
(v⊤xi)v−xi
2s.t. v⊥span{v1, . . . , vk}
= argmax
∥v∥=1
v⊤(n
∑
i=1
xix⊤
i)vs.t. v⊥span{v1, . . . , vk}
= argmax
∥v∥=1
v⊤ˆ
Σvs.t. v⊥span{v1, . . . , vk},(.)
where ˆ
Σis the sample covariance matrix of the centered dataset. is optimisation
problem has a well known global solution [Golub and Van Loan ()] given by
the eigenvectors of the sample covariance matrix,
ˆ
Σ = V DV ⊤,
where Dis a diagonal matrix with ordered positive eigenvalues λ1≥ ··· ≥ λd>0
on the diagonal and the columns of Vare the corresponding orthonormal eigenvec-
tors v1, . . . , vd∈
R
d. In other words, V⊤is an orthogonal transformation which
diagonalises the covariance matrix: V⊤ˆ
ΣV=D, i.e. the data is uncorrelated in
the PCA basis. is means that the axis of the ellipsoid representing the covariance
matrix are aligned with the principal components, as illustrated in Figure .. More-
over, the eigenvalues correspond to the length of each axis of the ellipsoid. In terms
of the optimisation problem (Equation .), each eigenvalue λiis the objective func-
tion value at the maximum, i.e. the largest variance in the orthogonal complement
of span{v1, . . . , vi−1}. us, if we choose to represent our dataset by the first k
principal components, then the fraction of the variance that we explain is given by
σPCA
k=∑k
i=1 λi/∑d
i=1 λi. Conversely, the reconstruction error as a fraction of
the total variance is 1−σPCA
k. e normalised eigenvalue spectrum therefore gives
a good indication of the contribution of each principal component to the variance
of the data. In practice, one oen encounters spectra of a characteristic shape, where
relatively few principal components explain almost all of the variance.
Applications and interpretation
Probably the most common application of PCA is dimensionality reduction (or data
compression), guided by the principle of minimizing the reconstruction error: re-
taining a certain fraction of the variance using the minimum number of dimensions.
22 CHAPTER 2. INTERESTING DIRECTIONS IN DATA
Under a Gaussian noise model, this is equivalent to de-noising, if the data lies on a
lower-dimensional subspace. In this scenario, one selects the number of dimen-
sions based on the shape of the eigenvalue spectrum, or a pre-specified fraction of
the variance that has to be captured. Depending on the application context, one
proceeds either in the PCA coordinates, or projects the data onto the PCA subspace
in the original coordinates. In factor analysis, the focus is oen on interpreting the
coefficients of the loading matrix V.
e PCA solution is meaningful only if the relative scale of the input coordinates
is not arbitrary. Rescaling a single coordinate changes its variance relative to the
other coordinates and therefore the PCA solution. If the relative scaling is arbitrary,
then the PCA basis is somewhat arbitrary.
2.3.2 Independent component analysis
Independent component analysis [Hyvärinen et al. ()] finds a new coordinate
system for the data which maximises the statistical independence of the new coordi-
nates. One motivation of this approach is to recover the underlying (supposedly in-
dependent) mechanisms which have generated the data. From an explorative point
of view, the advantage of having independent components is that they can be studied
univariately, since there are no interactions (dependencies) between them. Whereas
PCA is oen used as a technical pre-processing step in a semi-automatic fashion, the
ICA basis and coordinates are usually subject to further qualitative analysis.
0
0
0
Figure .: ICA example, from le to right: input data; whitening; rotation to
independent components. e red lines are the true independent components,
the black curve is the contour of the sample covariance matrix.
e generative model of ICA is that the observed data Xis generated as a linear
mixture of independent latent variables (or sources) Y,
X=AY,
where we assume that Ais invertible and that there are as many observed variables
as there are latent variables, both Xand Yare assumed to be d-variate.
2.3. MAXIMUM VARIANCE AND INDEPENDENCE 23
e aim of ICA is to recover the latent variables from the mixing, i.e. estimate an
inverse ˆ
A−1for the mixing matrix such that ˆ
Y=ˆ
A−1X. e true mixing matrix
is not uniquely identifiable. First of all, the sign and the scaling of the latent sources
Ycannot be determined from the observations, because we cannot distinguish be-
tween the true sign and scaling and the multiplication of the columns of Aby a
scalar. Moreover, the true order of the latent sources is not identifiable, because we
can absorb a permutation matrix either into the mixing matrix, or into the sources.
us, we can only identify Aup to arbitrary permutation and scaling of its columns¹
In other words, ICA finds an unordered set of one-dimensional subspaces.
Many ICA algorithms proceed in two steps: a whitening followed by a rotation.
In the first step, the data is decorrelated by a multiplication with a whitening matrix
W, which diagonalises the sample covariance matrix: Wˆ
ΣW⊤=I. In this basis,
the data is uncorrelated but not necessarily independent², as we can see in the middle
panel of Figure ..
Aer the decorrelation step, we find a linear transformation that leaves the unity
covariance matrix intact, RIR⊤=Iand maximises the independence of the la-
tent variables, i.e. we find a orthogonal matrix such that the demixing is given by
ˆ
A−1=RW. A large family of ICA algorithms are based on minimizing the mu-
tual information between the estimated sources, as a measure of independence. We
will see that this is equivalent to maximizing the distance between the distribution
of the estimated latent variables and the Gaussian distribution. More formally, let
us assume that the data has already been decorrelated by the whitening matrix W
such that cov(X) = I. Now we seek the optimal rotation matrix R∗that gives us
maximally independent latent variables ˆ
Y=R∗X. In terms of an optimisation
problem, this rotation is given by,
R∗= argmin
RR⊤=I
I[ˆ
Y1,..., ˆ
Yd]
(1)
= argmin
RR⊤=I
d
∑
i=1
H[ˆ
Yi]−H[ˆ
Y]
(2)
= argmin
RR⊤=I
d
∑
i=1
H[ˆ
Yi]−H [X]−log det(R).
Equality () results from the definition of mutual information and equality () fol-
lows from the fact that ˆ
Yis obtained from the observations Xby the linear transfor-
mation R. Since the entropy of Xdoes not depend on R, and since Ris a rotation
¹Under the assumption that there is not more than one Gaussian-distributed source; if that is not
the case, then only the subspace of Gaussian sources can be identified.
²However, if the data is Gaussian, then uncorrelatedness is equivalent to independence.
24 CHAPTER 2. INTERESTING DIRECTIONS IN DATA
(hence det R= 1) we get,
R∗= argmin
RR⊤=I
d
∑
i=1
H[ˆ
Yi]
(3)
= argmin
RR⊤=I
d
∑
i=1 (H[N(µ(ˆ
Yi),cov( ˆ
Yi))]−J( ˆ
Yi))
(4)
= argmin
RR⊤=I
d
∑
i=1 −J(ˆ
Yi),
where the equality () follows from the definition of the negentropy and () results
from the fact that the entropy of the Gaussian distribution is invariant under the
rotation R.
us we have seen that minimizing the mutual information is equivalent to max-
imizing the distance to a Gaussian distribution of the individual sources, as mea-
sured by the negentropy. A practical approach to maximizing the non-Gaussianity
which lends itself well to numerical optimisation is based on the higher order mo-
ments, i.e. moments beyond the mean and the covariance matrix. e JADE algo-
rithm [Cardoso ()], for instance, maximises the kurtosiskurt [ˆ
Yi]usingGradient-
based numerical optimisation.
A great number of different algorithms have been proposed to find these non-
Gaussian projections; some optimise contrast functions like the negentropy by gra-
dient descent [Bell and Sejnowski ()], others solve the problem by approximate
diagonalisation of the fourth-order cumulant [Cardoso and Souloumiac ()] or
by deflation-type fixed-point algorithms [Hyvärinen ()].
Another family of ICA algorithms exploits the time structure of the data. Ex-
amples for these algorithms are e.g. TDSEP [Ziehe and Müller (a)] (a generali-
sation of [Molgedey and Schuster ()]) or SOBI [Belouchrani et al. ()]. Be-
sides the fact that these methods have access to information that other ICA algo-
rithms do not have, they are oen numerically more stable since they rely only on
second-order moments.
Applications and interpretation
e archetypal example for an application of independent component analysis is the
so-called cocktail party problem, where several people are talking simultaneously in
a crowded room and one tries to single out one of the speakers from recordings
at several microphones distributed in the room. Assuming that the speakers are
independent from each other, and that the signals at the microphones are an instan-
taneous mixture of the speakers, this task can be solved by independent component
analysis.
2.4. SUMMARY AND DISCUSSION 25
is has lead to the slightly tongue-in-cheek metaphor of the “cerebral cocktail
party problem” that one faces in the analysis of non-invasive neural recordings. For
example, EEG potentials measured on the surface of the scalp pick up contributions
from a variety of sources located inside the brain. In EEG analysis, ICA has been
applied successfuly to extract meaningful components in a variety of contexts; see
[Jung et al. ()] for a review. In particular, the fact that the columns of the es-
timated mixing matrix can be interpreted as scalp patterns made it possible to use
ICA as a visualisation, or imaging, technique.
2.4 Summary and discussion
We have started this chapter by motivating why a linear transformation of variables
can be useful: from the perspective of a generative model, we recover the latent
variables from the linear mixing. Conversely, the explorative data analyst likes to
obtain new variables with desirable properties, such as pair-wise independence. Af-
ter a brief review of mathematical fundamentals, we introduced the basic setup of
linear coordinate transforms, followed by an overview of the two most widely used
linear methods: PCA and ICA.
Nonlinear axes?
e linear model is a simplification: the relationship between the latent variables
and the observations may well be non-linear. is means that the observed data lies
on a non-linear manifold, and what we want to find is the intrinsic coordinate sys-
tem of that manifold. Over the last decade, a large number of methods have been
proposed for nonlinear dimensionality reduction or manifold learning. e most
popular are kernel PCA [Schölkopf et al. ()], locally linear embedding (LLE)
[Roweis and Saul ()], isomap [Tenenbaum et al. ()], self-organizing-maps
[Kohonen ()], stochastic neighbour embedding [Hinton and Roweis ()],
and principal curves [Hastie and Stuetzle ()].
Despite the fact that the linear model is indeed restrictive, nonlinear methods
do not yet enjoy the same level popularity among practitioners outside the statis-
tics and machine learning communities. ere are several reasons for this. First of
all, most of the nonlinear approaches do not compute an explicit representation for
the relationship between latent variables and observations, but only the new coor-
dinates. In that sense, they do not find interesting directions. e solution is there-
fore harder to interpret. Also, it is usually not straightforward to project new data
points in the found coordinates—the algorithms need to be run again on the ex-
tended data set [Bengio et al. ()]. Moreover, as the nonlinear model is more
complex, it needs to be regularised in order to avoid overfitting. For kernel PCA
[Schölkopf et al. ()], this means choosing kernel parameters; for LLE we need
26 CHAPTER 2. INTERESTING DIRECTIONS IN DATA
to choose local neighbourhoods and a regularisation parameter, and Isomap de-
pends on the construction of a neighbourhood graph that follows the true manifold
exactly. Choosing these parameters is difficult and the solution greatly depend on
it. Lastly, nonlinear methods are, overall, less noise robust and require significantly
more data. Both of which are characteristics that do not bode well for practical ap-
plications.
Useful directions beyond maximum variance and independence
Apart frommaximumvarianceand pair-wise independence, thereare other desider-
ata for linear coordinate transforms. For example, projection pursuit [Huber (),
Friedman and Tukey ()]findsdirectionthatreveals clusterstructurein thedata;
random projections [Rahimi and Recht ()] have been found useful for large-
scale learning; and methods from EEG source reconstruction [Haufe et al. ()]
aim to recover brain sources from recordings on the scalp using a combination of
prior knowledge, physiological and statistical constraints.
However, none of these criteria aims at understanding changes in the distribution
of data. is is the problem addressed by stationary subspace analysis (SSA), which
we introduce in the next chapter.
Chapter 3
Stationary subspace analysis
How does the distribution of a data set change over time? As we had seen in Chap-
ter , this unsuspicious-seeming question can be difficult to answer in the unsuper-
vised multivariate case. at is because from the perspective of the individual input
variables, we only observe the marginals of the joint distribution, but the relevant
changes might occur in the dependencies. Moreover, we do not have labels, tim-
ing information, or a target variable that could help us identifying the distribution
changes. In a simple example we showed that the input basis can be completely
uninformative for understanding changes in the joint distribution.
is is the problem we address in this chapter. We introduce a novel type of
linear coordinate transform, stationary subspace analysis (SSA), which finds direc-
tions based on the strength of distribution changes. More specifically, SSA identifies
two subspaces: the subspace in which the distribution stays constant over time (sta-
tionary subspace) and the directions in which the changes are most pronounced
(non-stationary subspace).
Such a coordinate transform is useful in many circumstances. “Is there a tem-
poral change in my data?” and “what are these changes?” are common questions
in the explorative analysis of data, whenever there is some sort of time structure
present. is need not be an explicit time series, but also, for example, ever-growing
data sets (such as web data, like that extracted from social media sites) or survey
data data that is collected at several points in time. Finding an informative basis for
understanding distribution changes is particularly relevant in the high-dimensional
case.
Even in a less explorative setting, SSA can be the right tool to extract meaningful
components by revealing the most non-stationary components. For example, in the
analysis of spontaneous neural activity [Bießmann et al. (), Allen et al. ()],
where no timing information or target variables is available, the most non-stationary
subspace can contain the information of interest corresponding to the neural activ-
ity.
From a more methodological point of view, SSA can be used as an unsupervised
feature extraction method for certain tasks. For example, in change-point detec-
tion [Blythe et al. ()] (or temporal segmentation), one wants to detect the time
28 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS
points at which the data distribution changes. However, estimating and comparing
probability distributions is difficult in high-dimensions. Here, stationary subspace
analysis helps by first finding the relevant non-stationary subspace and then confin-
ing change point dimension to that lower-dimensional basis. Similarly, in density
ratio estimation [Sugiyama ()], one is given two data sets D1and D2and the
goal is to compute the ratio of the estimated densities ˆpD1(·)/ˆpD2(·)at all points in
D1. is is notoriously difficult in high dimensions. Here, finding a basis in which
the distribution of the two data sets differ maximally is a useful pre-processing step.
Moreover, many prediction methods rely on the assumption that the distribution
of the calibration data is equal to the distribution of the data at application time
[Quionero-Candela et al. ()]. When this assumption is violated, SSA can con-
tribute to domain adaptation [Hara et al. ()] by confining the parameter esti-
mation to the stationary subspace.
e remainder of this chapter is organised as follows. In the next Section .,
we introduce the generative model of SSA along with our definition of stationarity
and discuss the identifiability of this model. e issue of identifiability is revisited
in Section ., where we encounter the phenomenon of spurious stationarity and
how it can be overcome. Section . is concerned with the algorithmic approach to
solving the SSA problem and its relationship to statistical testing. Finally, in the last
Section ., we summarise and point to open questions and related work.
3.1 Model and identiability
In the stationary subspace analysis model, the observed data is generated as a linear
mixture of two groups of latent variables (or sources): a set of variables whose joint
distribution remains constant over time (stationary) and another set of variables
whose joint distribution varies (non-stationary). e mixing matrix is assumed to
be time-constant and invertible. ere are no further assumptions on the latent vari-
ables. In particular, both the stationary and the non-stationary variables can have
arbitrary dependencies, also across the two groups. More formally, the observed
D-variate data Xis generated as,
X=ASt=[AsAn][Ss
t
Sn
t],(.)
where the random variables Ss
trepresents the dstationary variables and Sn
tare the
D−dnon-stationary directions and tis the time index t∈
N
. e first dcolumns
of the mixing matrix Asspan the stationary subspace and its last D−dcolumns An
span the non-stationary subspace.
e goal of an SSA algorithm is to invert this mixture, i.e. discern stationary and
non-stationary contributions in the observed data. Before we turn to the issue of
which parts of the model are identifiable, we need to establish more precisely what
3.1. MODEL AND IDENTIFIABILITY 29
stationary
non−stationary
non−
stationary
stationary
Figure .: Illustration of the SSA model. e non-stationary and the station-
ary source have a time-variable joint distribution, as shown in the epoch scatter
plots. e marginal distribution of the stationary direction (vertical axis) stays
constant, whereas the histograms on the horizontal axis change. Note also that
the correlation between the two sources changes between epochs, and that they
are not distinguishable by their variance (signal power).
is meant by stationarity resp. non-stationarity. In its strongest sense, stationarity
means that all properties of the time series Ss
tremain constant over time. at is,
all joint distributions Ss
Twith T⊂
N
are exactly equal. is definition does not
lend itself well to the derivation of a practical SSA algorithm, since it corresponds to
an infinite number of constraints. At the same time, only a few of the quantities to
which these constraints pertain can be estimated reliably from finite data.
A pragmatic denition of stationarity
is is reflected in our definition of stationarity: we only consider the first two mo-
ments (mean and covariance matrix) of the probability distribution and ignore the
time structure. at is, a d-variate time series Ytis stationary if
E[Yt1] = E[Yt2]and E[Yt1Y⊤
t1] = E[Yt2Y⊤
t2],
for all pairs of time points t1, t2∈
N
. is type of stationarity is also called weak
stationarity [Priestley ()] without time structure.
isclearlyisa limitednotion of stationarity. Forexample, notethat atimeseries
whose frequency content changes while its power stays constant would be deemed
stationary. Similarly, a non-stationary time series where all changes are confined to
higher order moments is weakly stationary; though this is probably rare in practice.
30 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS
However, in practicalapplications we have foundthat thistype of stationarityalready
achieves interesting novel results, while not requiring large amounts of samples.
What can we recover from the mixture?
e aim of stationary subspace analysis is to invert the mixture of stationary and
non-stationary sources given only samples from the mixture X. at is, we want to
find a demixing matrix,
ˆ
A−1=[Bs
Bn]with Bs∈
R
d×Dand Bn∈
R
D−d×D,
that consists of the matrices Bsand Bnto the stationary and non-stationary direc-
tions respectively. In the following, we will refer to “changing into the source coor-
dinates” as projecting and denote Bsand Bnas the s-projection and the n-projection
respectively. Note that we assume that the true number of stationary directions dis
known. Applying these projections to the mixture yields the estimated latent vari-
ables,
ˆ
Ss
t=BsXand ˆ
Sn
t=BnX.
Before we turn to the problem of finding this demixing, let us consider its identi-
fiability. Assuming that the generative model is correct and that we know the the
true number of stationary and non-stationary directions, an important question is:
up to which symmetries can we recover the true model parameters from the mixed
signals? Our model has one parameter, the mixing matrix A=[AsAn], which
consists of a basis for the stationary and the non-stationary subspace respectively.
A criterion for identifying Amust be based on recovering the latent variables with
the prescribed properties from the mixed data.
Figure .: ere is only one stationary direction (blue) but several non-
stationary ones: projecting onto the green, red and turqoise directions yields a
non-stationary time series (see below).
3.2. THE SSA ALGORITHM 31
e only assumption we have made about the latent variables is that there are d
stationary and D−dnon-stationary ones. An SSA algorithm which achieves recov-
ers sets of sources with these characteristics has exhausted all available information.
us, if we apply such an ideal demixing ˆ
A−1to the observed data Xt,
[ˆ
Ss
t
ˆ
Sn
t]=ˆ
A−1Xt=ˆ
A−1A[Ss
t
Sn
t]=[BsAsBsAn
BnAsBnAn][Ss
t
Sn
t],
we obtain a criterion for a valid demixing in terms of its relationship to the true
mixing matrix.
First of all, in order for the stationary project Bsto yield stationary projec-
tions, all non-stationary contributions need to be removed from the mixture, i.e.
BsAn= 0. is means that we can identify a basis for the true non-stationary sub-
space, because it is uniquely given as the dual space of the rows of Bs. However, the
product BsAsis unconstrained since it corresponds to a linear transformation of the
stationary sources, which does not change their stationary nature for an example.
us we can only identify the true stationary sources up to linear transformations.
e same holds for the non-stationary sources: BnAnis arbitrary. Moreover, the
BnAsis also unconstrained, since in general we can add arbitrary stationary contri-
butions to a non-stationary source without making it stationary. Figure . shows an
example, where there is only one stationary direction, but several directions yielding
non-stationary time series. e true stationary subspace is therefore not identifiable.
3.2 The SSA algorithm
In the previous section, we have observed that we can only recover the true station-
ary sources (up to linear transformations), and that the non-stationary sources are
not identifiable in principle. We will therefore start by developing an optimisation
criterion for finding the stationary projection Bs, before we consider the second
part of the demixing matrix Bn.
Measuring stationarity
e starting point for finding the s-projection is to define a measure of stationarity.
at is, given samples x1, . . . , xmfrom a d-dimensional time series, we need a way
of quantifying its degree of stationarity. According to our definition, a time series
is stationary if its first two moments are constant over time. erefore, a natural
measure of stationarity is based on the distance of the mean and covariance ma-
trix. To jointly compare the first two moments, we need to combine the distance in
the mean and the covariance matrix in a principled way. To that end, we use the
32 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS
Kullback-Leibler divergence between D-variate Gaussians, which is given by,
DKL [N(µ0,Σ0)||N(µ1,Σ1)] =
=1
2(log (det Σ1
det Σ0)+ trace (Σ−1
1Σ0)+ (µ1−µ0)⊤Σ−1
1(µ1−µ0)−D).
is means that we approximate the probability density by a Gaussian distribution.
Note, however, that in the following derivation we will not assume that the sources
are Gaussian distributed. e Gaussian merely serves as the least restrictive distri-
butional assumption for data with fixed mean and covariance matrix, in the sense
that it has the highest entropy [Jaynes ()].
In order to compute this stationarity measure from data, we need to estimate the
moments at several points in time. is we do either by segmenting the time series
into consecutive epochs, or by using a sliding window¹. Let ˆµ1, . . . , ˆµn∈
R
dand
ˆ
Σ1, . . . , ˆ
Σn∈
R
d×dbe the epoch mean and covariance matrices respectively, where
nis the total number of epochs. Moreover, let mbe the total number of samples and
m1, . . . , mn, the number of samples in each epoch.
Based on the KL-divergence between two epochs, there are now several ways
of defining a measure of stationarity for the whole time series. For example, we
could sum up the KL-divergence between all pairs of epochs. We choose a different
approach, which is computationally more efficient: comparing each epoch against
a reference epoch. Clearly, the sum of these KL-divergences is zero, if and only if,
all moments are equal and otherwise positive. As a reference epoch, we choose the
average epoch, with moments ¯µand ¯
Σ,
¯µ=
n
∑
i=1
µiand ¯
Σ =
n
∑
i=1
Σi.
erefore, given an epochised time series we measure its stationarity as,
L(µ1, . . . , µn,Σ1, . . . , Σn) =
n
∑
i=1
DKL [N(µi,Σi)||N(¯µ, ¯
Σ)],
which is zero if and only if the time series is perfectly stationary. Note that this
measure of stationarity can grow arbitrarily large.
From a measure of stationarity to an objective function
In order to find the stationary projection, we minimise the non-stationarity of the
estimated stationary directions. us the optimal stationary projection is the solu-
¹e design of the epoch structure is an important issue that we will discuss later.
3.2. THE SSA ALGORITHM 33
tion to the optimisation problem,
Bs∗= argmin
B∈
R
d×D
L(Bµ1, . . . , Bµn, BΣ1B⊤, . . . , BΣnB⊤)
= argmin
B∈
R
d×D
n
∑
i=1
DKL [N(Bµi, BΣiB⊤)||N(B¯µ, B ¯
ΣB⊤)].
In this form, the objective function does not lend itself well to numerical optimisa-
tion: without constraints on B, degenerate solutions cannot be avoided (e.g. a rank
deficient projection) and the fact that the objective contains inverses of a projected
covariance matrix can lead to numerical instability.
In order to make the objective more benign, we exploit the invariances that we
observed earlier while studying the identifiability of the model parameters. As we
will see inthe following, wecanfix themeanandthecovariancematrix oftheaverage
epoch to 0and Irespectively without loss of generality by choosing a different basis
and introducing constraints.
First of all, we can translate all epoch mean vectors without changing the so-
lution, because the distance between the epochs does not depend on their overall
location. us we centre the average epoch, i.e. translate the data such that ¯µ= 0.
Moreover, we can perform the optimisation in an arbitrary basis. us we first ap-
ply a whitening such that the average covariance matrix is the identity, ¯
Σ = I. In
order to ensure that the moments of the average epoch stay constant under any pro-
jection, we add the constraint that BB⊤=I. Recall that we can only identify the
true stationary projection up to arbitrary linear transformations, so this is not a re-
striction. is means that we have written the demixing matrix as a whitening of
the original data by a matrix Wfollowed by an orthogonal matrix, i.e. ˆ
A−1=RW
with RR⊤=I. Note, however, that only the first drows of the demixing matrix
enter into the objective when optimizing for the stationary projection, because we
only evaluate the estimated stationary sources.
In the following we assume that that the centring and whitening has already been
applied to the data such that ¯µ= 0 and ¯
Σ = I. uswe can rewritethe optimisation
34 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS
problem as,
Bs∗= argmin
BB⊤=I
n
∑
i=1
DKL [N(Bµi, BΣiB⊤)||N(0, I)]
= argmin
BB⊤=I
1
2
n
∑
i=1 (−log det(BΣiB⊤) + trace(BΣiB⊤) + ∥Bµi∥2−d)
= argmin
BB⊤=I
1
2
n
∑
i=1 (−log det(BΣiB⊤) + ∥Bµi∥2)−nd
2
+1
2trace (B(n
∑
i=1
Σi)B⊤)
(1)
= argmin
BB⊤=I
1
2
n
∑
i=1 (−log det(BΣiB⊤) + ∥Bµi∥2)−nd
2+n
2trace Id×d
= argmin
BB⊤=I
1
2
n
∑
i=1 (−log det(BΣiB⊤) + ∥Bµi∥2)
| {z }
=J(B)
,(.)
where the equation above follows from the fact that we have set the average covari-
ance matrix to the identity. e gradient of the objective function Jwith respect to
the projection Bis given by
∂J
∂B =
n
∑
i=1 (BΣiB⊤)−1BΣi+Bµiµ⊤
i,(.)
fromwhichweseethattheobjectivefunctionisnotconvexduetothelog-determinant.
Gradient descent over the special orthogonal group SO(D)
Even thought we have achieved a simpler, and computationally advantageous ob-
jective function (.), the orthogonality constraint BB⊤=Imakes it difficult
to optimise. In the next step, we therefore eliminate this constraint by choosing
a parametrisation that ensures that the constraint is automatically fulfilled. To that
end, instead of optimizing a projection matrix, we are looking for an orthogonal
transformation RR⊤=Iof the data such that the first dcoordinates are stationary.
In the objective function, this means that we write Bin terms of Ras its truncation
B=IdRto the first drows (Idis an identity matrix truncated to the first drows).
In order to find the orthogonal matrix R, we use an iterative optimisation pro-
cedure using multiplicative updates by an orthogonal matrix. Starting with a ran-
dom orthogonal matrix, in the k-th step we find an orthogonal update Usuch that
3.2. THE SSA ALGORITHM 35
Rk+1 =URk. Aer termination in the ℓ-th step, the found solution is Bs∗=
IdRℓW, where Wis the whitening matrix.
But how do we find each orthogonal update? Following the standard gradient-
based optimisation strategy, what we would like to do is to identify a direction of
steepest descent in the set of orthogonal transformations and then perform a line
search along this direction to find the optimal step.
Let us first of all take a closer look at this set of orthogonal matrices. e orthog-
onal matrices form a multiplicative group: the product of two orthogonal matrices
is orthogonal and the inverse of an orthogonal matrix is orthogonal; the identity ma-
trix is orthogonal. is group is called the orthogonal group O(D)and it consists
of elements of two types: rotations with determinant +1 and reflections with deter-
minant −1. e rotations are a normal subgroup of the orthogonal group, called
special orthogonal group SO(D)▹O(D). To find the true stationary projection, we
need to project orthogonal to the non-stationary subspace, i.e. we want to remove
stationary contributions from the first dcoordinates. is can always be achieved
by a rotation of the data; and at each step we can improve on this objective by a
applying a rotation — changing the sign of one coordinate (i.e. apply a reflection)
does not change the objective. Hence we can constrain our search for the direction
of steepest descent to the subgroup SO(D), the rotation matrices.
Figure .: Illustration of the tangent space (blue plane) at one point of the group
(sphere). A direction in the tangent space (blue arrow) corresponds to a one-
parameter subgroup (or geodesic) in the group, shown as the curved red arrow.
In order to find the update rotation U, we compute the gradient of our objective
function J(U)and use it as a search direction over the special orthogonal group.
is is possible because SO(D)is a Lie group, i.e. it is a continuous differentiable
manifold. However, since the SO(D)is not a Euclidean space but a manifold, defin-
ing the gradient is a bit more involved. e gradient at a point on a manifold is
36 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS
an element of the tangent space (linearisation) at that point. In a Euclidean space,
the tangent space at each element is equal to the Euclidean space itself, because it
is a linear space. On a curved manifold, this is not the case; see Figure . for an
illustration
In a Lie group, the tangent space at the identity has a special meaning: it is the
corresponding Lie algebra. Informally speaking, the Lie algebra is a vector space
spanned by the infinitesimal transformations corresponding to the group action,
which in our case, are the infinitesimal rotations.
e Lie algebra so(D)of the special orthogonal group SO(D)is the set of all
matrices M∈RD×Dthat are anti-symmetric, M=−M⊤. e element Mij can
be interpreted as the angle of an infinitesimal rotation of axis jtowards axis i. Since
we absorb the update rotations into the data, at each step we want to find the gradient
of J(U)at the identity I. e gradient is thus an element of the Lie algebra and we
can find it by using the connection between the group SO(D)and the algebra so(D)
provided by the exponential map. e exponential map,
exp : so(D)→SO(D),
is a one-to-one correspondence between the skew-symmetric matrices and the ro-
tation matrices, given by the matrix exponential. Geometrically speaking, it maps
straight lines in the tangent space onto geodesics on the manifold. See the blue and
right arrows in Figure . for an example.
Following [Plumbley ()], we parametrise the update Uby an antisymmetric
matrix U= exp(M)and calculate the gradient ∇so(D)J(exp(M)) at M= 0 to
find the search direction. Using this gradient, we can perform a line search along
the corresponding geodesic {exp (t∇so(D)J)|t∈
R
}to find the update U.
To calculate the gradient in the Lie algebra, we need a notation of distance be-
tween anti-symmetric matrices. A natural way to define a norm is to adapt the
Frobenius-norm to reflect the fact that every element of M∈so(D)appears twice,
∥M∥2=1
2∑
i,j
M2
ij,
by including the factor 1/2. is corresponds to an inner product for M1, M2∈
so(D)given by,
⟨M1, M2⟩=1
2∑
i,j
(M1)ij(M2)ij =1
2trace(M⊤
1M2).
Using this scalar product, we can implicitly define the gradient in the Lie algebra in
terms of its projection onto unit vectors. Let H∈so(D)be a unit vector ∥H∥= 1
in the algebra; then the component of the gradient in the direction of His the scalar
3.2. THE SSA ALGORITHM 37
product ⟨∇so(D)J, H⟩. Moreover, we the derivative of the update rotation U=
exp(tH)along the direction H(a so called single-parameter subgroup) is given by,
∂exp(tH)
∂t =Hexp(tH).(.)
In the following, we will obtain an explicit expression for the gradient ∇so(D)Jby
equating the derivative of Jalong Hwith the projection of the gradient ∇so(D)J
onto Hin the Lie algebra. First of all, using the chain rule, we get an expression for
the derivative of Jalong H,
∂J(exp(tH))
∂t = trace [(∂J
∂U )⊤∂U
∂t ]
(1)
= trace [(∂J
∂U )⊤
HU]
(2)
= trace [U(∂J
∂U )⊤
H]
(3)
= trace [1
2(U(∂J
∂U )⊤
−(∂J
∂U )U⊤)H]
(4)
=⟨(∂J
∂U )U⊤−U(∂J
∂U )⊤
, H⟩,
where equality 1follows from Equation .; equality 2is a property of the trace;
equality 3holds because H∈so(D)and is therefore antisymmetric; and equality 4
corresponds to our definition of the scalar product in the Lie algebra.
By the definition of the gradient, its projection onto Hmust be equal to the
partial derivative ∂J(exp(tH)/∂t,
⟨∇so(D)J, H⟩=∂J(exp(tH)/∂t
=⟨(∂J
∂U )U⊤−U(∂J
∂U )⊤
, H⟩,
and since this equality holds for all unit vectors H∈so(D)it follows that,
∇so(D)J(exp(M)) = (∂J
∂U )U⊤−U(∂J
∂U )⊤
.(.)
Using this equation, we can compute the gradient in the Lie algebra so(D)at M= 0
for any objective function J(exp(M)) = J(U)for which we know the partial
derivative∂J/∂U. is isaclassical approach, forexampleinICA [Plumbley ()].
38 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS
Let us now have a look at the structure of the gradient of J(U). Since only
the first drows of the update rotation Uaffect the objective, only the first drows
of the derivative ∂J/∂U are non-zero. e non-zero part of ∂J/∂U is given by
Equation .. As we evaluate the gradient at the identity M= 0, and since the
upper d×dsubmatrix of ∂J/∂U is symmetric, it follows that the gradient in the
Lie algebra has the form,
∇so(D)J(exp(M))M=0 =∂J
∂U −(∂J
∂U )⊤
=[0Z
−Z⊤0],(.)
where Z∈
R
d×Dis the non-zero part. is block-structure has an intuitive in-
terpretation: the two square zero parts of dimension dand (D−d)correspond to
rotations within the stationary and non-stationary source space respectively. ese
do not change the objective. e non-zero part Zcontains the angles between the
two subspaces. us we can reduce the number of variables in the optimisation to
d(D−d).
e gradient (Equation .) defines a search direction in the Lie algebra, which
allows for geodesic line search over the rotations using the exponential map. e
whole procedure is summarised in Algorithm .
Algorithm Stationary subspace analysis: finding the projection Bs∗to the most
stationary sources
: function SSA(d, µ1, . . . , µn,Σ1, . . . , Σn)
: Compute the average mean ¯µ←1
n∑i=1 µi.
: Compute the average covariance matrix ¯
Σ←1
n∑i=1 Σi.
: Center the epochs µi←(µi−¯µ)∀i
: Compute the whitening matrix W←¯
Σ−1
2
: Let B←random rotation matrix in SO(D)
: Initialise the epochs µi←(BW)µiand Σi←(BW)Σi(BW)⊤∀i
: repeat
: Let G← ∇so(D)J(U)be the gradient of the objective at U=I
: Perform a line search for t∗on the geodesic {exp (t G)|t∈
R
}
: Let U←exp(t∗G)be the update rotation
: Store the update B←UB
: Rotate the epochs µi←Uµiand Σi←UΣiU⊤∀i
: until convergence
: Let Bs∗←IdBW be the stationary projection
: return Bs∗
: end function
3.2. THE SSA ALGORITHM 39
Finding the most non-stationary sources
As we have seen in the previous sections, the true non-stationary sources cannot be
identified from the mixture because one can add arbitrary stationary components
without changing their non-stationary nature. However, in practice one is oen
interesting in finding the most non-stationary directions instead of recovering the
true non-stationary directions. is, however, can not necessarily be achieved by
projecting orthogonal (in the whitenend basis) to the stationary components, i.e.
choosing the bottom D−drows of the data rotation as the non-stationary pro-
jection (see Algorithm ). e right panel of Figure . shows a situation where
the projection that is orthogonal to the true s-projection yields an almost stationary
latent variable, as the projected variance in the two epochs is almost equal. As it
turns out, whenever there are changes in the correlation between the stationary and
non-stationary variables, we can find more non-stationary directions by explicitly
maximizing the non-stationarity in a second step.
Σ1
Σ2
Σ1
Σ2
Bs
Bs
ˆ
Bn
ˆ
Bn
Figure .: e le panel shows two epoch covariance matrices where the covari-
ance between the stationary and the non-stationary direction does not change.
Here, projection orthogonal to the stationary subspace yields the most non-
stationary sources. In the right panel, the non-stationarity is mostly contained
in the time-variable covariance; projecting orthogonal to the stationary subspace
yields only mildly non-stationary sources.
Let us analyse the situation more rigorously. We consider the simple case where
we have one stationary and one non-stationary source with corresponding nor-
malised basis vectors ∥As∥= 1 and ∥An∥= 1 respectively, where ϕbe the angle
between the two spaces, i.e. cos ϕ=As⊤An. We will consider an arbitrary pair of
epochs, T1and T2, and show which projection maximises the difference in mean
∆µand variance ∆σbetween them, in relation to the true stationary projection Bs.
Let X1and X2be bivariate random variables modeling the distribution of the
data in the two epochs. According to the linear mixing model, we can write X1and
40 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS
X2in terms of the underlying sources,
X1=AsXs+AnXn1
X2=AsXs+AnXn2
where the univariate random variable Xsrepresents the stationary source and the
two univariate random variables Xn1and Xn2model the non-stationary source,
whose probability distribution changes between the two epochs, i.e. Xn1is not equal
to Xn2in distribution. In the following, we will assume without loss of generality,
that the true s-projection Bs= (An)⊥is normalised, ∥Bs∥= 1. We can write the
estimated n-projection as a linear combination of Bsand An,
ˆ
Bn=αBs+βAn⊤,(.)
with α, β ∈
R
such that ∥ˆ
Bn∥= 1.
In the next step, we will observe which n-projection maximises the difference in
mean and covariance between the two epochs. Let us first consider the difference in
the mean of the estimated n-sources,
∆µ= E[ ˆ
BnX1]−E[ ˆ
BnX2] = ˆ
BnAn(E[Xn1]−E[Xn2])
which is maximal for ˆ
BnAn= 1, i.e. when ˆ
Bnis orthogonal to Bs. us, with
respect to the difference in the mean, choosing the n-projection ˆ
Bnto be orthogonal
to the s-projection is always optimal. e difference in the variance of the estimated
n-sources is,
∆σ= var[ ˆ
BnX1]−var[ ˆ
BnX2] = β2(var[Xn1]−var[Xn2])
+ 2 [αcos (ϕ+π
2)+βcos ϕ](cov[Xs, Xn1]−cov[Xs, Xn2])
| {z }
=∆σsn
.
Clearly, when there is no change in the covariance of the s- and the n-sources be-
tween the two epochs (∆σsn = 0), this difference is maximised when βis maximal,
which by Equation . implies that ˆ
Bn= (Bs)⊥. However, when the covariance
between s- and n-sources does vary (|∆σsn |>0), the derivative of ∆σwith respect
to αat α= 0 is ,
∂∆σ/∂α|α=0 = 2 cos (ϕ+π
2)∆σsn .
Hence α= 0 is not an extremum when |∆sn|>0, which means that the most
non-stationary projection is not orthogonal to the s-projection in this case.
us, in order to find the projection to the most non-stationary sources, we also
need to maximise the non-stationarity of the estimated n-sources. To that end, we
3.2. THE SSA ALGORITHM 41
simply maximise the SSA objective function (Equation .) for the n-projection, i.e.
Bn∗= argmax
BB⊤=I
1
2
n
∑
i=1 (−log det(BΣiB⊤) + ∥Bµi∥2).
e procedure is completely analogous to Algorithm , apart from the fact that we
reverse the sign of the objective.
Stationary subspace analysis in deation mode
SSA factorises the observed data into two groups of sources, both of which can only
be determined up to arbitrary linear transformations. In contrast to a PCA or ICA
solution, the results of SSA are two multidimensional subspaces and not a set of one-
dimensional components. is can make the solution more difficult to interpret,
because there is no unique set of projections resp. basis elements. erefore, in some
scenarios it is desirable to perform SSA univariately, in so-called deflation mode.
is means that we find a set of uncorrelated one-dimensional components which
are ordered by their degree of non-stationarity.
In each step, we optimise the non-stationarity of a single component; aer con-
vergence, we findthenextcomponentintheorthogonal complement(in thewhitened
basis). is procedure is called deflation mode because in each step, we reduce the
dimensionality of the data space by one. In this way, we ignore changes in the cor-
relation between variables. However, the advantage of this approach is that the so-
lution is unique (up to local minima). What is more, the fact that we have factorised
the input space into a set of one-dimensional subspace allows for a comparison of
their non-stationarity in terms of the value of the objective function.
3.2.1 Relationship to statistical testing
e absolute value of our measure of stationarity is hard to interpret: it is bounded
from below by zero for perfectly stationary sources, but it can grow arbitrarily large.
On finite samples it will never be exactly zero even for perfectly stationary data, since
fluctuations in the moment estimates introduce differences across epochs. However,
the value of the objective function is an important ingredient for interpreting an SSA
solution. For instance, in order to determine an adequate number of stationary di-
rections in a data-driven manner, one can increase the number of stationary sources
until the non-stationarity reaches a certain critical threshold. In order for this work,
one needs to normalise the objective function value such that it becomes comparable
across different numbers of dimensions, epochs and sample sizes.
In this section, we show that our measure of stationarity is equivalent to a test
statistic: under the null hypothesis of perfectly stationary Gaussian data, its distribu-
tion converges to a χ2-distribution [Blythe et al. ()]. e test statistic compares
two competing Gaussian models for the data: the simple model H0that presuppose
42 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS
that each epoch has the same mean and covariance; and the alternative model HA,
under which each epoch has its own set of parameters. is is a test for model com-
parison: it tells us whether we should reject the simple model H0in favour of the
more complex HA. Since the two models are nested (i.e. H0is a special case of HA),
and since we obtain the parameters by maximum likelihood, this hypothesis can
be tested in the likelihood ratio testing framework [Neyman and Pearson ()],
where the test statistic is the log of the ratio of the likelihood of the data under H0
and HA.
More formally, let X1, . . . , Xnbe d-variate random variables modeling the dis-
tribution of the data in each of the nepochs. Aer centering and whitening, the
hypothesis can be written as follows.
H0:X1, . . . , Xn∼ N(0, I)
HA:X1∼ N(µ1,Σ1), . . . , Xn∼ N(µn,Σn)
Let Xbe our data set that is segmented into epochs X=T1∪ ··· ∪ Tnand let
LH0(X)and LA(X)be the likelihood of the data under H0and HArespectively
where the parameters are the maximum likelihood estimates. en the likelihood
ratio test statistic Λ(X)is,
Λ(X) = −2 log LH0(X)
LA(X),(.)
whichunderthe nullhypothesisH0isapproximatelyχ2distributedwithk=1
2nd(d+
3) degrees of freedom. It can be shown that by applying the variance-stabilizing
transformation,
Λ′(X) = √2Λ(X),
we arrive at a test statistic that is approximately Gaussian distribution with mean
µ=√2k−1and unit standard deviation σ= 1.
We will now show the equivalence between the test statistic Equation . and the
SSA objective function. For the sake of simplicity, let us assume that all epochs are of
equal size m=|T1|=··· =|Tn|. e estimates of the epoch means and covariance
matrices are given by ˆµi=1
m∑x∈Tixand ˆ
Σi=1
m−1∑x∈Ti(x−ˆµi)(x−ˆµi)⊤
for all epochs 1≤i≤n. As in the derivation of the SSA algorithm, we assume that
the data is centred and whitened such that 1
n∑n
i=1 ˆµi= 0 and 1
n∑n
i=1 ˆ
Σi=I.
3.2. THE SSA ALGORITHM 43
en the log-likelihood of the data under H0is given by,
log LH0(X) =
N
∑
i=1 ∑
x∈Ti
log pN(x; 0, I) = −1
2
N
∑
i=1
md log(2π) + ∑
x∈Ti
x⊤x
(1)
=−1
2
N
∑
i=1
md log(2π) + trace
∑
x∈Ti
xx⊤−ˆµiˆµ⊤
i
+mˆµiˆµ⊤
i
=−1
2
N
∑
i=1
md log(2π) + mˆµ⊤
iˆµi+ trace [(m−1)ˆ
Σi]
(2)
=−1
2
N
∑
i=1 (md log(2π) + mˆµ⊤
iˆµi)+(n
∑
i=1
(m−1))trace [I]
=−1
2
N
∑
i=1
md log(2π) + mˆµ⊤
iˆµi+ (m−1)d,
where equality follows from the identity that x⊤x= trace(xx⊤)and the equal-
ity is a consequence of the fact we have set the average epoch mean and epoch
covariance matrix to 0and Irespectively.
e log-likelihood of the data under the alternative hypothesis HAis given by,
log LHA(X) =
N
∑
i=1 ∑
x∈Ti
log pN(x; ˆµi,ˆ
Σi)
=−1
2
N
∑
i=1
md log(2π) + mlog det ˆ
Σi+ (m−1)d.
us the test statistic is,
Λ(X) = −2(log LH0−log LHA)
=m
n
∑
i=1 (−log det ˆ
Σi+ ˆµ⊤
iˆµi),
which is equivalent to the SSA objective function in the sense that its value is the
same on source estimates up to the multiplication by the epoch sample size m. In
this derivation, we assumed that each epoch consists of the same number of samples.
If we allow arbitrary sample sizes, then in the test statistic each term in the sum is
weighted by the respective epoch sample size.
As we noted earlier, the distribution of the statistic Λis only approximately χ2-
distributed and, most importantly, this holds only under the assumption that the
44 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS
data is in fact Gaussian distributed. How useful this test statistic is in practice there-
fore depends crucially on (a) how well its distribution is approximated by the χ2
distribution and (b) what happens when the data is non-Gaussian.
A non-parametric approach to testing for stationarity is based on resampling.
e basic idea is to generate samples from the objective function on stationary data
sets that have the same properties as the input in terms of the shape of the distribu-
tion (kurtosis etc.). To this end, we put all data together and divide it randomly into
epochs of the same size as specified in the application of the SSA algorithm. Since
the data points are assigned randomly to epochs, every epoch is sampled from the
same distribution. Evaluating the objective function on one such assignment pro-
vides us with one sample from the objective function on stationary data of this kind.
By doing this repeatedly, we can approximate this distribution using a histogram, of
estimate quantiles for statistical testing. In Chapter , we compare the parametric
and the non-parametric approach to stationarity testing in controlled simulations.
3.3 Spurious stationarity
In Section ., we have seen that we can identify a basis for the true non-stationary
subspace: recovering stationary sources is equivalent to projecting orthogonal to the
true non-stationary subspace. However, this argument relies on the assumption that
every contribution from the non-stationary subspace yields a non-stationary latent
variable. If we observe a limited amount of variation, then this may not be the case
as the following example shows.
Figure .: Example of a spurious stationary direction. Both plots show the joint
distributions of non-stationary latent variables. In the le panel, we have observed
two different joint distributions, with the same epoch but different covariance ma-
trices (blue and red contours). ere exists a direction on which the projection
agrees (black line). is is a spurious stationary direction, because it is located
in the non-stationary space. In the right panel, we add a third joint distribution:
now there exists no spurious stationary direction.
Imagine, for instance, that we have two non-stationary latent variables and that
3.3. SPURIOUS STATIONARITY 45
we observe two different joint distributions Xn
1and Xn
2, which differ only with
respect to their covariance matrix, Σ1= cov(Xn
1)and Σ2= cov(Xn
2)whereas
µ(Xn
1) = µ(Xn
2). e le panel of Figure . shows the contours of the covariance
matrices Σ1and Σ2. Note that there exists a direction w∈
R
2(black line) on which
the variance of the two random variables is equal: w⊤Σ1w=w⊤Σ2w. Accord-
ing to our definition of stationarity, this is a stationary direction, even though it lies
in the non-stationary subspace: this is a spurious stationary direction. Only aer
we add a third covariance matrix (right panel) does this spurious stationary direc-
tion disappear. e existence of spurious stationary directions makes it impossible
to identify the true non-stationary subspace. We therefore need a criterion which
guarantees that there are no spurious stationary directions.
In order to derive a theoretical result, we consider the generic case in order to
exclude pathological cases. In other words, we assume that the true covariance and
mean of the epochs do not have any special properties, other than being equal on
the true stationary coordinates. Under this assumption, we can derive the following
bound on the number of necessary epochs.
Proposition Let nbe the number of epochs in Ddimensions and let d<Dbe
the number of stationary directions. Moreover, let us assume that the mean and
the covariance matrix of the non-stationary sources in each epoch is generic, in the
sense that with probability one no equation between their elements is fulfilled. en
for any number of epochs nfor which holds,
n > D−d
2+ 1,
there exists no spurious stationary direction. In the case where the mean vectors are
known to be constant, this bound becomes,
n > D −d+ 1.
e rigorous proof relies on results form generic algebraic geometry; it can be
found in [Kiraly et al. ()]. e geometric intuition is that each available epoch
moment reduces the degrees of freedom for a spurious stationary direction by one.
e numerator in the first bound therefore stems from the fact that each epoch pro-
vides us with two moments. us if integrated higher order moments into the SSA
algorithm, the number of required epochs would be smaller—under the assumption
that they carry information, i.e. variation in the non-stationary directions.
is bound is good news for practical applications: the number of necessary
epochs grows only linearly with the size of the non-stationary subspace. is is con-
sistent with the results of the simulations presented in Chapter .
46 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS
3.4 Summary and discussion
In this chapter, we have introduced a new type of latent variable model: the ob-
served data is a linear mixture of stationary and non-stationary variables, which are
not assumed to be independent. Based on a pragmatic definition of stationarity, we
have analysed which parameters are identifiable in principle, and up to what symme-
tries. We derived a computationally efficient SSA algorithm, which optimises over
the special orthogonal group SO(D)using Lie group properties. e equivalence
of the objective function with a statistical hypothesis test for stationarity suggests
a normalisation. Finally, we investigated the phenomenon of spurious stationarity:
given a set of generic mean vectors and generic covariance matrices, under which
condition is there no subspace on which they are identical?
Beyond weak stationarity
e most relevant limitation of the current algorithm is its definition of stationarity,
which ignores temporal structure. As long as the power of a source remains constant
it is deemed stationary, no matter what happens to the frequency content. However,
in many situations it is the variation in the time structure thats is most interesting.
Similarly, changes in higher order moments go unnoticed by the current SSA objec-
tive. Even though the first two moments are easier to interpret and estimate, there
may be situations where changes e.g. in the kurtosis are relevant.
ere is a number of strategies for addressing both issues. A straightforward
approach for integrating temporal information would be to consider time-lagged
covariance matrices in addition to the epoch covariance matrices as in temporal
ICA. e advantage of this technique is that most of the optimisation problem’s ap-
pealing features can be retained. Using higher order information is less straightfor-
ward for various reasons. Whereas the Gaussian approximation is a sensible choice
for comparing distributions up to their first two moments, it is less clear how to
do this for higher moments in a way which corresponds to a natural weighting of
the distances in the individual moments. Expansions of the Kullback-Leibler diver-
gence for higher moments already become computationally unwieldy in the case of
the third and fourth moment. Non-parametric approaches oen do not lend them-
selves well to optimization, because one can not compute a derivative with respect
to a linear projection of the data.
Finding the number of stationary sources
roughout this chapter, we have tacitly assumed that the true number of stationary
sources is known. While this may be true in certain special cases, for most real data
sources it is not. Manual approaches of selecting this parameter have their limits.
In some scenarios, increasing the number of stationary sources until the estimated
3.4. SUMMARY AND DISCUSSION 47
sources become “too non-stationary” by visual inspection may be adequate. Draw-
ing this line, however, is not straightforward using just a subjective assessment of
the result. Firstly, the change in correlations is difficult to assess by visual inspec-
tion. Secondly, one needs to account for the fact that by increasing the number of
sources, more changes in the distribution are inevitable. We outlined a solution to
this problem using two kinds of stationarity tests, which are evaluated empirically in
Chapter . In the next chapter, we present an alternative approach based on an alge-
braic view. is has the advantage that one can evaluate all possible choices for the
number of stationary sources simultaneously by looking at an eigenvalue spectrum.
Design of the epoch structure
e SSA objective function evaluates distribution changes between epochs of a time
series. Unless the way in which the data is segmented into epochs is somehow de-
termined by the application context (e.g. aligned to experimental markers or cor-
responding to meaningful temporal intervals), the setup of the epoch structure is a
crucial model parameter. It defines the time scale on which distribution changes are
visible to the objective function, and the time points at which they are detectable. In
the purely unsupervised case, where one wants to make as little assumptions about
the data as possible, a good choice is to use a sliding window instead of a segmenta-
tion. In order to increase the temporal resolution, a short window is clearly desir-
able. Even though this increases the effect of small sample errors, our simulations in
Chapter show that until the windows get very short (approximately less than 2D
samples), the impact on the accuracy is negligible.
Post-processing and pre-processing, interpretation
e result of inverting the SSA model are two multivariate sets of sources: desti-
mated stationary sources and D−destimated non-stationary sources, where the
latter have been optimised for maximum non-stationarity. As we have seen before,
both groups can only be determined up to arbitrary linear transformations. e in-
dividual sources, therefore, have no particular meaning; the same is true of the basis
of the estimated non-stationary subspace. is makes it difficult to interpret an SSA
solution.
ere are several ways of making an SSA solution unique, and possibly mean-
ingful, in a post-processing step. In order to allow for univariate interpretation and
since independent sources perhaps correspond to underlying latent variables, one
can apply ICA to both sets of sources. A way of post-processing that is in line
with analysing non-stationarities is applying SSA in deflation mode to both sets of
sources. is provides us with two unique sets of sources, each of which is ordered
by its degree of non-stationarity.
As in factor analysis [Harman ()], in some applications one may be primar-
48 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS
ily interested in the coefficients of the basis of the non-stationary subspace. For ease
of interpretation, it is desirable that each basis vector has only a few large coefficients,
so that one can assign (meaningful) latent variables to directions in data space. To
that end, there exist a variety of methods for post-rotating a basis to achieve this
[Abdi ()]; the most prominent is Varimax [Kaiser ()], which maximises
the variances of the coefficients in each basis vector by application of a rotation.
PCA and ICA are not SSA
PCA, ICA and SSA are similar in that their generative model is a linear mixing of la-
tent variables (see Chapter ). e crucial distinction lies in the specific assumptions
about the mixing matrix and the latent variables. First of all, in the SSA model there
are two groups of sources with prescribed properties whereas PCA and ICA assume
the existence of Dunivariate components. For this reason, it is not straightforward
to compare PCA and ICA solutions to an SSA solution. But most importantly, it is in
general not possible find the most stationary or the most non-stationary projection
by combining PCA or ICA directins in a post-processing step. at is because the
criteria of PCA and ICA algorithms are completely unrelated to stationarity resp.
non-stationarity.
Epoch 1 Epoch 2 Epoch 3 Epoch 4
Whole dataset
(input basis)
Whole dataset
(ICA basis)
PCA directions
ICA directions
Figure .: Exemplary data set from Chapter with the PCA and ICA solutions.
Figure . shows the PCA and ICA solutions for the illustrative data set first
introduced in Figure .. As we had already seen in Chapter , both PCA and ICA
yield uninformative bases in the sense that they do not separate the stationary from
the non-stationary components. In the two right panels, we see why this is the case.
PCA diagonalises the average covariance matrix. Since the vertical axis has higher
variance and is, on average, uncorrelated with the horizontal axis, PCA does not
change the basis (significantly) — the distribution changes remain hidden. ICA, on
the other hand, is looking for two independent coordinates. As we can see from the
scatter plot of the whole data set, the input coordinates are uncorrelated on average
but not independent: the data appears to be generated by a Gaussian mixture model
3.4. SUMMARY AND DISCUSSION 49
with (at least) two independent components, and it is these that the ICA algorithm ²
identifies.
²Here we have used the JADE algorithm [Cardoso ()].
50 CHAPTER 3. STATIONARY SUBSPACE ANALYSIS
Chapter 4
An algebraic approach
In the previous chapter, we presented an optimisation algorithm for finding the sta-
tionary projection. In essence, this algorithm searches the space of all possible pro-
jections guided by an objective function that is zero at the optimal solution (see
Figure .). In this chapter, we explore a different route: instead of searching the
space of possible solutions, we directly solve the SSA problem algebraically.
is is possible because the set of stationary projections can be described in
terms of equations: in the optimal stationary coordinates, all epoch mean vectors
are equal to the null vector and all epoch covariance matrices are equal to the iden-
tity matrix. More formally, an optimal stationary projection Bsis defined by the
system of linear and quadratic matrix equations,
Bsµ1= 0
.
.
.
Bsµn= 0
BsΣ1(Bs)⊤=I
.
.
.
BsΣn(Bs)⊤=I,
where nis the number of epochs. From this point of view, the SSA objective function
can be seen as measuring the extent to which this system of polynomial equations
is not fulfilled for a particular projection.
e gist of the algebraic algorithm is that we manipulate this system of equations
in a way that reveals the stationary projection. is can be understood in analogy
to solving a system of polynomial equations. However, as the coefficients of these
equations are derived from epoch moments estimated on finite sets of samples, the
exact set of solutions will always be empty in the generic case. To this end, we de-
velop a notion of finding an approximate linear space of solutions. Even though this
is an approximate solution, it can be found in closed form.
e presented algorithm was originally derived from an algebraic-geometric
52 CHAPTER 4. AN ALGEBRAIC APPROACH
0 20 40 60 80 100 120 140 160 180
Objective function
Angle of projection (α)
Objective function over all possible projections
Covariance matrices
True solution (α=50)
Local minimum (α=154)
Figure .: Illustration of the optimisation approach to SSA. e le panel shows
the contour plots of three sample covariance matrices. e black line is the true
one-dimensional subspace on which the projected variances are exactly equal
while the magenta line corresponds to a local minimum of the objective func-
tion. e right panel shows the value of the objective function over all possible
one-dimensional subspaces, parameterised by the angle αto the horizontal axis.
perspective, and previous publications [Kiraly et al. (), Kiraly et al. ()] used
this terminology extensively. In this chapter, we attempt a different presentation
that is accessible by a wider audience.
e remainder of this chapter is organised as follows. In Section ., we for-
mulate the objective of SSA in terms of a system of polynomial equations. en,
in Section ., we introduce a representation for these polynomials: the coefficient
vector space. In Section ., we show how we can find an approximate solution to
our set of equations by linear algebra in the coefficient vector space. e connec-
tion to algebraic geometry is briefly described in Section .. In the last Section,
., we summarise our findings and discuss the relative merits of the algebraic algo-
rithm and the prerequisites for using the algebraic approach in other circumstances
in machine learning.
4.1 From moments to polynomials
In this section, we start by reformulating the objective of the SSA algorithm in terms
of polynomials. Until Section ., we will assume that these polynomials are exact,
in the sense that they have been derived from the exact (true) epoch moments, and
not estimates thereof. is is of course never true in practice, where moments are
estimated from finite and noisy samples. However, this assumption allows for an
easy introduction of the main concepts before we turn to an approximate treatment.
According to the SSA mixture model, an optimal s-projection Bscan be defined
as being dual to the non-stationary subspace. Alternatively, we can define the op-
timal stationary projection in terms of the moments in the stationary coordinates
(under the assumption that there are spurious stationary directions). at is, an
4.1. FROM MOMENTS TO POLYNOMIALS 53
optimal s-projection fulfills the equations,
Bsµi=Bsµj
BsΣi(Bs)⊤=BsΣj(Bs)⊤∀1≤i=j≤n.
is set consists of n(n−1) equations between all pairs of epochs, which are re-
dundant because of transitivity; equivalently, we can compare each projected pair
of moments against one reference epoch. As in the optimisation-based algorithm,
we chose the average over all epoch as the reference, which has mean ¯µand co-
variance matrix ¯
Σ. us we characterise the set of solutions by a number of matrix
equations that is linear in the number of epochs,
Bsµi=Bs¯µ
BsΣi(Bs)⊤=Bs¯
Σ(Bs)⊤∀1≤i≤n.
Note that this set of equations is still not independent, because the equality of n−1
projected moments with the average epoch implies the equality of all. As we have
seen before, if we centre the mean vectors such that ¯µ= 0 (translating all the data by
the same vector does not change solution) and choose a basis in which the average
covariance matrix is the identity, ¯
Σ = I, (whitening), we arrive at a set of 2(n−1)
homogeneous matrix equations which equivalently describe the set of solutions,
Bsµi= 0
Bs[Σi−I] (Bs)⊤= 0 ∀1≤i≤(n−1),(.)
where Bsis a matrix with linearly independent rows.
e optimal stationary projection Bsis of course only unique up to arbitrary
linear transformations of its image, which leaves the stationary nature of the esti-
mated sources unchanged. In the following, we therefore state the goal as finding a
basis for the space of s-projections.
Definition e d-dimensional linear subspace of all s-projections S⊂
R
Dis the
dual space of the true non-stationary subspace, or in other words, the row space of
an optimal stationary projection Bs.
In the next step, we translate the system of Equations . into a set of polyno-
mials. is establishes the le hand sides of these equations as objects in their own
right, which are amenable to algorithmic algebraic manipulations.
Definition Amonomial min Dvariables T1, . . . , TDis a formal product,
m=Tα1
1Tα2
2···TαD
D,
which we abbreviate as m=Tα, where α∈
N
D
0is the vector of exponents. A
monomial is therefore uniquely identified by its exponent vector. e degree of a
54 CHAPTER 4. AN ALGEBRAIC APPROACH
monomial deg(m)is the sum of its exponents,
deg(Tα) =
D
∑
i=1
αi.
Apolynomial pin Dvariables over
C
is a sum of finitely many monomials,
p=∑
α∈
N
D
0
aαTα,
where aα∈
C
is the coefficient of the monomial Tα, and only finitely many of the
coefficients are nonzero. A polynomial is homogeneous if all monomials with non-
zero coefficients have the same degree. e evaluation of a polynomial pon a vector
of values w∈
C
Dis a number in
C
written as p(w)and is the result of replacing all
formal variables by values,
p(w) = ∑
α∈
N
D
aαwα1
1···wαD
D∈
C
.
Using this definition, we write the le hand sides of the Equations . as poly-
nomials, where the variables T1, . . . , TDcorrespond to the coordinates in the input
vector space and the coefficients are given by the elements of the mean vectors and
covariance matrices. us we obtain a set of linear polynomials ℓ1, . . . , ℓn−1and
quadratic polynomials q1, . . . , qn−1, using formal vector notation as,
ℓi=[T1···TD]µi
qi=[T1···TD]Σi
T1
.
.
.
TD
∀1≤i≤(n−1).
Note that these polynomials are homogeneous of degree one and two, respectively.
Let us now turn to its set of solutions. In the language of algebraic geometry, the
set of solutions to a set of polynomials is their vanishing set.
Definition e vanishing set Vof a set of polynomials p1, . . . , pmin Dvariables
is their common set of solutions in
C
D,
V(p1, . . . , pk) = {x∈
C
D|p1(x) = ···=pm(x) = 0}.
Whereas by definition Sis a subset of V(ℓ1, . . . , ℓn−1, q1, . . . , qn−1), the van-
ishing set need not be to the space of s-projections S. Not only is the null vec-
tor an element of the vanishing set (but not an s-projection), but by translating
4.2. THE VECTOR SPACE OF POLYNOMIALS 55
the matrix Equations . into polynomials, we have implicitly dropped the con-
straint that the s-projections must be elements of a d-dimensional linear space. e
goal of our algorithm is to find the d-dimensional linear subspace contained in
V(ℓ1, . . . , ℓn−1, q1, . . . , qn−1). Whether this subspace is unique depends on the ob-
served input polynomials ℓ1, . . . , ℓn−1, q1, . . . , qn−1. is question is addressed in
the next section.
4.2 The vector space of polynomials
e representation in which we compute with polynomials is the vector space of their
coefficients. For example, a homogeneous quadratic polynomial pin two variables,
p=α11T2
1+α12T1T2+α22T2
2,
is represented by its coefficient vector, p =[α11 α12 α22]⊤∈
C
3, where the
coordinates correspond to the monomials of degree two in two variables, T2
1,T1T2
and T2
2. Figure . illustrates how the equality of two projected covariance matri-
ces is translated into a polynomial equations that is embedded in coefficient vector
space.
Figure .: e le panel shows the contours of two covariance matrices along
with the direction von which their projections are equal. e middle panel shows
that vis the solution to a quadratic equation, and the right panel illustrates how
this equation becomes a vector in the coefficient space C2of quadratic homoge-
neous polynomials.
By representing polynomials in terms of coefficient vectors in a finite dimen-
sional vector space we lose structure. e coefficient vector space is not isomorphic
to the polynomial ring: whereas the addition of polynomials translates into the ad-
dition of coefficient vectors, this is not true for the multiplication of polynomials. In
particular, the product of two polynomials of degree at least one is not embeddable
in the same coefficient vector space, because it increases the degree of the monomi-
als. However, the coefficient vector space is a suitable representation for our algo-
rithm, as we will see later. In particular, the coefficient vector space comes with a
56 CHAPTER 4. AN ALGEBRAIC APPROACH
natural distance measure between polynomials, which allows us to treat uncertainty
in the polynomial coefficients using familiar tools from machine learning.
e representation of a polynomial in the coefficient vector space of degree k
depends on an arbitrary ordering of all monomials of degree k. For computational
reasons, we adopt the lexicographical ordering, because it lends itself to fast enu-
meration and selection of certain monomials.
Definition e coefficient vector space of homogeneous polynomials in Dvari-
ables of degree kis a
C
-vector space denoted by Ck. A polynomial pof degree
kis represented in terms of its coefficient vector p ∈Ckwhere the monomials
are ordered lexicographically. In this ordering, monomial M1=Tα1
1···TαD
Dap-
pears aer M2=Tβ1
1···TβD
D, denoted by M1≻M2, if there exists a variable Tl
such that αl> βland for all variables with smaller index T1, . . . , Tl−1it holds that
αi=βi.
For example, the lexicographical ordering of the monomials in three variables
of degree two is given by,
T2
1≻T1T2≻T1T3≻T2
2≻T2T3≻T2
3.
Let us now have a closer look at the coefficient vector space. e dimension of
Ckis equal to the number of monomials in Dvariables of degree k. is is equal to
the number of ways that we can distribute the exponents among the Dvariables.
Proposition e dimension of the coefficient vector space Ckis given by,
dim Ck= ∆(k, D) = (k+D−1
k).
Proof. A monomial of degree kin Dvariables is product of kvariables chosen with
repetition from the set {T1, . . . , TD}. e number of different monomials is there-
fore equivalent to the number of multisets of cardinality kchosen from a set of car-
dinality D.
An important sub-vectorspace
e algebraic algorithm hinges on the fact that our input polynomials are elements
of a certain sub-vectorspace of the coefficient vector space (in the exact, noise-free,
case). For the linear polynomials derived from the mean vectors this is clear: by def-
inition,
ℓ1,...,
ℓn−1are dual to Sand therefore elements of a (D−d)-dimensional
linear subspace. In the following, we see that there is such a linear subspace in the
corresponding coefficient vector space for polynomials of arbitrary degree.
Definition e set of coefficient vectors corresponding to the polynomials of de-
gree kvanishing on Sis denoted by CS
k.
4.2. THE VECTOR SPACE OF POLYNOMIALS 57
For degrees k > 1, the number of dimensions of the coefficient vector space is
not immediately obvious but straightforward to calculate, as the next result shows.
Proposition e set of coefficient vectors CS
k⊂Ckof homogeneous polynomials
of degree kvanishing on Sis a linear subspace of Ckwith dimension,
dim CS
k= ∆(k, D)−∆(k, d).
Proof. CS
kis a linear subspace of Ckbecause it is closed under linear combination:
any linear combination of polynomials in CS
kis of degree kand also vanishes on S,
hence it is also an element of CS
k. In order to get its dimension we do a linear change
of the polynomials variables as follows. Let b1, . . . , bd∈
R
Dbe an orthonormal ba-
sis for Sand let bd+1, . . . , bDbe an orthonormal basis for its complement in
R
D. We
now consider polynomials in this basis, i.e. we write them in terms of new variables
T′
1, . . . , T′
Dwhere each variable T′
icorresponds to the coordinate along bi. en
for any polynomial p ∈CS
k, the coefficients of all monomials that contain only the
variables T′
1, . . . , T′
dare zero, because pvanishes on S, and this is a sufficient con-
dition for membership in CS
k. Hence the dimension of CS
kis the total number of
monomials ∆(k, D)minus the ∆(k, d)monomials that are fixed to zero.
4.2.1 Generic polynomials and identiability
In the previous sections, we formulated the SSA problem in terms of polynomials;
introduced the coefficient vector space, and showed that the (exact, noise-free) input
polynomials are elements of certain subvectorspaces. In order to derive theoretical
results, we need to assume a generative model for the input polynomials. In essence,
this will allow us to exclude pathological cases in our arguments.
AccordingtotheSSAmodel, weassumethattheinput polynomialsℓ1, . . . , ℓn−1,
q1, . . . , qn−1do in fact vanish on the space of s-projections S. Apart from that, the
polynomials should be generic, in the sense that they have no special property. For
a rigorous definition, we refer to the supplemental material of [Kiraly et al. ()].
Intuitively, we can think of a generic polynomial as a polynomial-valued random
variable that has no “algebraic property” with probability one, apart from those
properties that are implied by the assumption that it vanishes on S. Algebraic prop-
erties of a polynomial are those that can be expressed in terms of polynomial equa-
tions in the coefficients of the polynomial. As a concrete example, any random vari-
able that has a positive continuous probability density over the coefficient vector
space CS
kyields a generic polynomial in our context.
Using the concept of generic polynomials, we can investigate under which con-
dition the true space of s-projections Scan be uniquely identified, given the input
polynomials. is means: under which condition is Sthe only d-dimensional linear
subspace in the vanishing set V(ℓ1, . . . , ℓn−1, q1, . . . , qn−1)?
58 CHAPTER 4. AN ALGEBRAIC APPROACH
Proposition Let ℓ1, . . . , ℓn−1and q1, . . . , qn−1be generic linear and quadratic
polynomials derived from nepochs which vanish on the d-dimensional linear sub-
space of s-projections S.
en Sis the the unique d-dimensional linear subspace in the vanishing set
V(ℓ1, . . . , ℓn−1, q1, . . . , qn−1)if the number of epochs is at least,
n−1≥D−d+ 1
2.
e proof can be found in [Kiraly et al. ()]. Intuitively, one can think of
each of the 2(n−1) generic polynomials as reducing the degrees of freedom by
one, until Sremains as the only d-dimensional linear subspace.
An important consequence of the genericity assumption is that the coefficient
vectors in CS
1and CS
2corresponding to the observed polynomials are each linearly
independent with probability one.
Proposition e coefficient vectors corresponding to the generic homogeneous
polynomials p1, . . . , pmof degree kwith m≤dim CS
kare linearly independent
with probability one.
Proof. If p1, . . . , pmare linearly dependent then there exists a vector pisuch that
pi∈span{pj}j=i. However, in order for this to be the case, at least one equation
between the coefficients of piand the other vectors need to be fulfilled, which holds
with probability zero.
4.2.2 Generating polynomials of higher degree
In the last section, we have shown that under genericity assumptions, our (exact,
noise-free) linear and quadratic input polynomials are linearly independent ele-
ments of the subspaces CS
1and CS
2in coefficient vector space respectively. However,
they may not provide us with a basis (which is essential for the algorithm to work),
as the number of dimensions might be larger than the number of observed poly-
nomials. In fact, this is likely to be the case in practice. As we had seen before, the
number of dimensions of the coefficient vector space grows rapidly with the number
of dimensions of the data space; recall that
dim CS
2= ∆(2, D)−∆(2, d) = 0.5[D(D+ 1) −d(d+ 1)].
Even for a moderate number of input dimensions, one would thus need a large num-
ber of epochs n≥dim CS
2+1 in order for the derived polynomials to span CS
2(see
Figure .). is is the problem addressed in this section.
For reasons that will become clear later, we focus on the quadratic input poly-
nomials and show how we can generate a larger set of polynomials of higher degree
4.2. THE VECTOR SPACE OF POLYNOMIALS 59
0 5 10 15 20 25 30
0
50
100
150
200
250
300
350
400
450
500
Number of stationary directions
Dimension of coefficient vector space
Num. of dimensions D=10
Num. of dimensions D=20
Num. of dimensions D=30
Figure .: Growth of the number of dimensions of the vector space CS
2of
quadratic homogeneous polynomials vanishing on S(vertical axis) for different
numbers of input dimensions D(blue, red, green curve) and number of station-
ary directions d(horizontal axis). e number of dimensions of CS
2is quadratic
in D.
k≥2which (a) is guaranteed to span CS
kand (b) has the same vanishing set as the
input.
Recall that a vector x∈
C
Dis an element of the vanishing set V(q1, . . . , qn−1)
if and only if xis a solution to the system of polynomial equations,
q1(x) = 0
.
.
.
qn−1(x) = 0.
Since these equations are homogeneous quadratic, multiplying each with all non-
zero monomials Mof a fixed degree k′does not change the set of solution, i.e. the
extended system,
Mq1(x) = 0
.
.
.
Mqn−1(x) = 0 ∀M=Tα1
1···TαD
Dwith
D
∑
i=1
αi=k′,
has the same vanishing set as V(q1, . . . , qn−1). Moreover, every choice of monomial
Myields a distinct set {Mq1, . . . , Mqn−1}of polynomials of degree deg M+ 2.
us if we multiply by all monomials of a certain degree, we obtain a larger set of
polynomials which is equivalent to the input in terms of its vanishing set.
60 CHAPTER 4. AN ALGEBRAIC APPROACH
Proposition Let q1, . . . , qn−1begenerichomogeneousquadraticpolynomialsvan-
ishing on the set of s-projections S. e set of polynomials obtained by multiplica-
tion with all monomials of fixed degree,
Pk={Mq1, . . . , Mqn−1|Mis a monomial of degree k−2},
contains (n−1)∆(k−2, D) = |Pk|distinct polynomials of degree k.
2 3 4 5 6
10
100
1000
10000
Degree of polynomials
Rank of polynomials
Dim. of coeff. vector space
Num. of polynomials
dim CS
4>|P4|>rank P4
|P5|>dim CS
5>rank P5
rank P6=dimCS
6
Figure .: Growth of the rank of the set of polynomials Pk(red curve) in coeffi-
cient vector space, the number of polynomials |Pk|(green curve), and the dimen-
sions of the coefficient vector space CS
k(blue curve) with the degree k(horizontal
axis), for 11 generic polynomials in d= 10 dimensions, of which one is station-
ary. For degrees smaller than three, dim CS
kis much larger than the number of
polynomials; P5contains more polynomials than dim CS
5, but the rank is smaller.
Here we need to multiply up to k= 6 to ensure that P6spans CS
6.
As the number of polynomials in Pkgrows with the degree k, so does the num-
ber of dimensions of the vector space CS
k. e number of polynomials |Pk|grows
faster than the number of dimensions of the corresponding coefficient vector space,
lim
k→∞ |Pk|
dim CS
k
= lim
k→∞
(n−1)∆(k−2, D)
∆(k, D)−∆(k, d)>1.
However, the obtained set of polynomials is in general not linearly independent, i.e.
rank Pk<|Pk|. Figure . shows experimental results on simulated data. While it
can be shown that there always exists a ksuch that the coefficient vectors of Pkspan
CS
k[Kiraly et al. ()], in practice it is very important to know what that number
is, because the computational complexity of the overall algorithm is dominated by
the number of dimensions of the coefficient vector space. us the central question
is: how does the rank of a set of generic polynomials grow with the degree k′of
monomials by which we multiply?
4.2. THE VECTOR SPACE OF POLYNOMIALS 61
is question is related to Fröberg’s conjecture [Froeberg and Hollman (),
Froeberg ()] from algebraic geometry. If it holds, then we have an exact condi-
tion on the minimum necessary degree, given by the following result.
Conjecture (based on Fröberg’s conjecture) Let P2={q1, . . . , qn−1}be generic
homogeneous quadratic input polynomials vanishing on the linear subspace of s-
projection Swith n≥D+ 1 and let Pkbe derived from P2by multiplication with
all monomials of degree k−2,
Pk={Mq1, . . . , Mqn−1|Mis a monomial of degree k−2}.
en the smallest degree k∗≥2such that the coefficient vectors corresponding to
the polynomials Pk∗span the subspace CS
k∗in coefficient vector space is the index
of the first non-positive coefficient ak∗in the power series,
∞
∑
ℓ=1
aℓtℓ=∏n−1
i=1 (1 −t2)
(1 −t)D−1
(1 −t)d
For a proof see the supplemental material of [Kiraly et al. ()]. Since the
coefficient of this power series are straightforward to calculate, using this result we
can compute the minimum necessary degree at a negligible computational cost.
Algorithm Enlarge set of polynomials by increasing the degree until it spans the
corresponding vector space of polynomials.
: function MU(q1, . . . , qn−1)
: Initialise k←2to the degree of the input polynomials.
: while ak>0do ◃Coefficient of the power series in Conjecture
: Increment degree k←k+ 1.
: end while
: Multiply input polynomials by all monomials of degree k−2,
Pk← {mq1, . . . , mqn−1|mmonomial with deg m=k−2}.
: return Pk
: end function
is result leads to Algorithm which computes for any generic input the poly-
nomials Pkspanning the corresponding coefficient vector space CS
k. e computa-
tional cost is negligible, since enumerating all monomials of a certain degree (Line )
is linear in the number of all monomials.
Note that naïve approaches for generating the set Pkare infeasible in almost all
practically relevant cases. First of all, verifying whether a set of vectors are linearly
62 CHAPTER 4. AN ALGEBRAIC APPROACH
independent entails applying a rank-revealing algorithm (e.g. Gaussian elimination)
which is at least quadratic in the number of dimensions; and coefficientvector spaces
are notoriously high-dimensional. Secondly, simply generating a “very large num-
ber” is of course not guaranteed to work in general but also leads to exceedingly
high dimensional coefficient vector spaces, in which subsequent steps of the alge-
braic algorithm need to work. As some of these steps are quadratic in the number
of dimensions, it is important to keep the degree as small as possible.
4.3 An approximate algebraic algorithm
In the previous sections we have laid the groundwork for our algorithm: we refor-
mulated the SSA problem in terms of a set of exact polynomial equations which are
represented as vectors in the coefficient vector space. Under genericity assumptions,
we have seen that the (exact, noise-free) coefficient vectors are linearly independent
elements of subspace of known dimensions (but unknown basis). en we showed
how we can generate a larger set of polynomials Pkof higher degree ksuch that its
coefficient vectors are guaranteed to span the desired subspace in coefficient vector
space.
So far, we have assumed that the input polynomials are known exactly, in the
sense that they were derived from the true epoch mean and covariance. In practice,
where cumulants are estimated from finite and noisy data, this is not the case. is
means that there exists no exact solution; a true s-projection matrix Bswill not
make the projected moments exactly equal. Hence the vanishing set of the input
polynomials does notcontain the true set of s-projections S, but only the null vector.
We therefore need to find an approximate solution.
It is here the vector space view on polynomials comes in handy, because it comes
with a natural geometric interpretation which includes a distance between polyno-
mials. As we have seen before, all quadratic homogeneous polynomials that vanish
exactly on Slie on a linear subspace CS
2of the coefficient vector space. In the ap-
proximate case, we will think of our input as lying approximately on CS
2.
e overall strategy of our algorithm is to algebraically manipulate our set of
polynomials CS
kuntil we obtain linear polynomials that have approximately the
same vanishing set. From linear polynomials, we can readily obtain the space of
s-projections Sas their approximate d-dimensional dual. is procedure can be
seen as solving the SSA problem algebraically. We proceed step-wise: starting at the
minimal necessary degree (determined by Proposition ), with each step we reduce
the degree of our set of polynomials by one.
Approximate division by a single variable
e fundamental idea behind the step-wise reduction of degree can be loosely de-
scribed as “approximate division of a set of polynomials by a single variable”. at is,
4.3. AN APPROXIMATE ALGEBRAIC ALGORITHM 63
we go from a set of polynomial Pkof degree kto set of polynomials Pk−1of degree
k−1that has approximately the same vanishing set as Pk.
Quadratic input polynomials
Quadratic homogeneous
polynomials vanishing on all
stationary projections
(linear subspace)
Quadratic homogeneous
polynomials divisible by
(linear subspace)
Approximate
intersection
Figure .: Illustration of the approximate division of a set of polynomials by a
single variable T1. e planes illustrate the true sub-vectorspaces, not our estimate
from the input polynomials.
e choice of variable by which we divide is arbitrary; by our genericity assump-
tion any variable Tiis not a basis element of S. In the following we fix it to T1. As we
have seen before, the polynomials in Pklie approximately on CS
k⊂Ck. Since we
have applied Algorithm , we can also assume that they are an approximate basis.
Let us now consider another subspace of Ck, namely the polynomials of degree k
that are divisible by T1. ey also form a linear subspace spanned by all monomials
that are divisible by T1. Let us denote this subspace by T1·Ck−1. What this abuse
of notation indicates is that T1·Ck−1is in fact isomorphic to the coefficient vector
space Ck−1of lower degree. Figure . shows a cartoon illustrating the relationship
between these two linear subspaces in the quadratic case.
Approximate division of Pkby T1consists of two steps. In the first step, we com-
pute an approximate basis for the intersection of CS
kand T1·Ck−1. More specifically,
we compute a basis for the approximate intersection of CS
k(for which we only have
an approximate basis in terms of the input polynomials) and T1·Ck−1whose basis
is known exactly by definition. We find the elements of this basis as linear combi-
nations of the polynomials in Pkby minimizing the L2-distance to T1·Ck−1using
the singular value decomposition. e elements of this basis are polynomials that
are both divisible by T1and linear combinations of our input, i.e. vanish approx-
imately on S. Let us denote the set of these polynomials by P′
k−1, and note that
they are still of degree k. In Figure ., P′
k−1is a basis for the one-dimensional ap-
proximate intersection shown as the orange line. In the next step, we remove from
each polynomial in P′
k−1all terms that are not divisible by T1. Finally, by dividing
each remaining monomial by T1, we get a set of polynomials of degree k−1called
64 CHAPTER 4. AN ALGEBRAIC APPROACH
Pk−1, whose elements lie approximately on CS
k−1. is concludes the approximate
division of Pkby T1.
Note that the choice of variable by which we divide is arbitrary. In fact, we can
divide by T1in an arbitrary basis to obtain a set of polynomials of lower degree
that lie approximately on CS
k−1. We exploit this property in the final algorithm to
generatealarger set of suchpolynomials, which we thenreduce to abasis byapplying
PCA, in order to increase the accuracy on noisy data.
The complete algorithm, explained
C3
CS
3
C2
CS
2
isomorphic
RD
S⊥
divisible
by T1
divisible
by T1
input polynomial
Multiplication of all
input polynomials by
all variables T1,...,T
D
(Algorithm 1).
Approximate division
by a variable.
Approximate division
by a variable.
isomorphic
Figure .: Illustration of the algorithm. e quadratic input polynomials are
embedded in C3(by multiplication with all variables) and lie approximately on
CS
3(red dots in the top panel). e intersection of CS
3with the subspace of poly-
nomials divisible by T1(orange line) is isomorphic to CS
2in the vector space of
homogeneous quadratic polynomials C2. By dividing out T1again (blue line) we
arrive at a basis for the linear polynomials vanishing on S, which is a basis for the
orthogonal complement S⊥.
Let us now step through the complete Algorithm . roughout the text we refer
to line numbers in that algorithm. Figure . illustrates the steps of the algorithm,
4.3. AN APPROXIMATE ALGEBRAIC ALGORITHM 65
Algorithm Compute an approximate basis for the d-dimensional space of s-
projections Sgiven generic homogeneous polynomials corresponding to epoch
mean and covariance matrices respectively.
: function ASSA(ℓ1, . . . , ℓn−1, q1, . . . , qn−1, d)
: Compute set of polynomials Pk←MU(q1, . . . , qn−1)
: Let π←(1 ···D)be the cyclic permutation of variables.
: while k > 1do
: Initialise Pk−1← {}
: for i= 1, . . . , D do
: Permute polynomial variables Pπ
k← {πi(p)|p∈ Pk}
: Compute a linearly independent set of polynomials P′
k−1in the span
of the coefficient vectors in Pπ
k,
P′
k−1⊂span Pπ
k,
of size |P′
k−1|= dim CS
k−1that is closest to the subspace of polyno-
mials divisible by T1in terms of the L2-distance of coefficient vectors
(SVD).
: Remove from each polynomial in P′
k−1all monomials not divisible
by T1(i.e. set the corresponding coefficients to zero).
: Divide by T1, invert the permutation of variables, and add to the set
of polynomials of lower degree,
Pk−1← Pk−1∪{π−i(p/T1)p∈ P′
k−1}.
: end for
: Reduce Pk−1to an approximate basis of CS
k−1using PCA on the co-
efficient vectors.
: Let k←k−1.
: end while
: Compute an approximate basis b1, . . . , bD−dfor the orthogonal com-
plement S⊥from the coefficient vectors of P1∪{ℓ1, . . . , ℓn−1}using
PCA.
: return b1, . . . , bD−d
: end function
and the isomorphisms across coefficient vector spaces associated with different de-
grees.
e main part of the algorithm deals with the quadratic input polynomials; the
linear polynomials are added at the end. In the first step, we compute a new set of
polynomials CS
kof possibly larger degree k≥2from the input, which is guaranteed
to approximately span CS
k(Line ). is corresponds to the arrow going from C2to
66 CHAPTER 4. AN ALGEBRAIC APPROACH
C3in Figure ..
e loop starting at Line is to descend the degree from kdown to one (linear
polynomials). In the nested loop beginning at Line , we repeated the approximate
division for different permutations of variables to increase the accuracy of our esti-
mate for the basis of CS
k−1. e approximative division by T1(under the permuta-
tion of variables) is implemented in Lines to .
In Line , we compute the approximate intersection with the set of polynomi-
als divisible by T1. is is a quadratic optimisation problem which can be solved
by a singular value decomposition (SVD): let Qbe a matrix defined as the row-
wise concatenation of all coefficient vectors in Pk, where we remove the columns
corresponding to monomials not divisible by T1. Using the SVD, we compute a
basis v1, . . . , vm∈
R
|Pk|for the approximate le kernel of Qof dimension m=
dim CS
k−1. e elements of this basis provide us with coefficients for mlinear inde-
pendent linear combinations of polynomials in Pksuch that the sum of the squared
coefficients of monomials not divisible by T1are minimised; this is the set P′
k−1. In
Lines and this is then simply converted into polynomials of degree k−1.
In each step in Line we approximate the basis for the polynomials of lower de-
gree CS
k−1using PCA (Line ). e quality of this estimate increases with the num-
ber of available vectors that lie approximately on CS
k−1. In order to obtain more such
coefficient vectors, we repeatedly divide out the variable T1, but each time under a
different permutation of the polynomial variables T1, . . . , TD. is corresponds to
a basis change in the input space, of which the permutation of variables is compu-
tationally attractive when rearranging and ordering polynomial coefficients. Each
permutation of variables yields different sets of polynomials, which we combine to
anapproximatebasis of CS
k−1by firstinverting thepermutationofvariablesandthen
performing PCA to reduce the number of basis elements to the correct dimension
dim CS
k−1(Line ). Finally, in the last step from quadratic to linear polynomials
we include the linear polynomials ℓ1, . . . , ℓn−1obtained from the mean vectors.
Computational complexity
e computational complexity of Algorithm is determined by the singular value
decomposition (Line ) of the coefficient matrix of the highest (initial) degree k,
which consists of the row-wise concatenation of the coefficient vectors of the initial
set of polynomials Pπ
k.
e worst-case complexity of computing the full singular value decomposition
for a general r×cmatrix is O(min{rc2, r2c}). e number of columns cof the co-
efficient matrix is equal to the number of dimensions of the coefficient vector space
Ckof homogeneous polynomials in D−1variables,
c= ∆(k, D −1),
4.3. AN APPROXIMATE ALGEBRAIC ALGORITHM 67
because we include only those monomials that are not divisible by T1. is can be
written as a polynomial function in the degree kandin the number of input variables
D,
c(k) = 1
(D−2)!(k+ 1) ···(k+D−2),
c(D) = 1
k!(D−1) ···(D−2 + k),
respectively. e number of columns cgrows like kD−3in the degree kand like
Dk−1in the number of input variables D. e number of rows rof the coefficient
matrix is equal to the number of polynomials in the set Pπ
k,
r=|Pπ
k|= (n−1)∆(k−2, D).
As for the number of columns, the number of rows rcan be written as a polynomial
in kand Dwith leading terms kD−2and Dk−3respectively.
9 11 13 15 17 19 21 23 25 27 29 31 33 35
0.5
1
5
10
25
Number of epochs
Runtime (seconds)
Degree 5 to
degree 4
Degree 4 to
degree 3
Degree 3 to
degree 2
1 2 3 4 5 6 7
0.1
1
5
10
25
Number of stationary directions
Runtime (seconds)
n=9 epochs
n=10 epochs
n=12 epochs
Figure .: e le panel shows the runtime (vertical axis, log scale) of Algo-
rithm for D= 8 dimensions and d= 4 stationary directions for various num-
bers of epochs (horizontal axis). Overall, the runtime decreases when the num-
ber of epochs grows since the necessary degree becomes smaller, and hence the
number of dimensions of the coefficient vector space. e right panel shows the
runtime (log scale) of the algorithm for three different numbers of epochs over
varying numbers of stationary directions.
For fixed number of input dimensions D, the computational complexity de-
pends on the necessary initial degree k, which depends on the number of polyno-
mials n−1and the number of stationary directions d(see Conjecture ). e fewer
polynomials available, the higher the necessary degree. is is illustrated in the le
panel of Figure .. As the number of polynomials grows (horizontal axis), we ob-
serve characteristic drops (phase transitions) in the runtime which correspond to
decreases in the necessary degree kthat translates into fewer number of columns c
of the coefficient matrix. For a constant necessary degree k(i.e. between the jumps),
68 CHAPTER 4. AN ALGEBRAIC APPROACH
we see that the runtime grows with the number of polynomials, because it increases
the number of rows r. Moreover, we observe that for D= 8 input dimensions and
d= 4 stationary directions, n= 26 generic epochs are sufficient to span the whole
subspace CS
2of the coefficient vector space.
e right panel of Figure . shows that the runtime also decreases with the
number of stationary directions. at is because the number of dimensions of the
linear subspace of polynomials CS
kvanishing on Sdecreases, which in turn leads to
a lower necessary degree.
4.4 Relationship to algebraic geometry
So far, we have largely eschewed concepts and terminology from algebraic geom-
etry in order to make the presentation accessible by the wider machine learning
community. In this section, we briefly establish this connection. We do not attempt
an exhaustive or rigorous exposition (see e.g. [Cox et al. ()] for a good intro-
duction with a view to computation); the purpose is to point to the main concepts
and rephrase the task in the language of algebraic geometry.
Algorithm exploits the algebra-geometry connection: to find a basis for the
linear subspace of s-projections (a geometrical object), we operate on its represen-
tations in terms of polynomials (elements of an algebra, the ring of polynomials).
Assuming that the quadratic input polynomials q1, . . . , qn−1are known exactly, this
corresponds to a classical problem in computational algebraic geometry: computing
generators for the radical of an ideal.
An ideal is a set of polynomials in the ring
C
[T1, . . . , TD]that correspond to
a vanishing set in
C
D. In Section . we have defined the vanishing set of a set of
polynomials. Moreover, we have observed that the space of s-projections Sis a lin-
ear subspace of the vanishing set V(q1, . . . , qn−1). Conversely, we can associate an
algebraic set V⊂
R
Dwith a set of polynomials, its ideal. e ideal of a vanishing
set is the set of all polynomials that vanish on it, denoted by I(V). A purely alge-
braic way to define an ideal is in terms of a set of so-called polynomial generators
f1, . . . , fm∈
C
[T1, . . . , TD]. If we think of the generators as polynomial equations
defining the vanishing set V(f1, . . . , fm), then the generated ideal corresponds to
the set of all polynomial consequences. ese we can derive from the generators
by multiplication with all elements of the polynomial ring and all summations of
such products. From this point of view, we can interpret Algorithm as generating
further elements of the ideal ⟨q1, . . . , qn−1⟩, up to a certain degree.
Since the space of s-projections Sis a linear space, its ideal is generated by lin-
ear polynomials, I(S) = ⟨b1, . . . , bD−d⟩. is means that every polynomial that
vanishes on Scan be written as a sum of polynomials, each having a linear factor
corresponding to an element of the orthogonal complement S⊥. If we knew the
linear generators b1, . . . , bD−d, then we could obtain a basis for S. However, since
4.5. SUMMARY AND DISCUSSION 69
our input consists of quadratic polynomials q1, . . . , qn−1∈I(S), it is not straight-
forward to extract this basis.
By assumption, the input polynomials are generic and n≥D+1, which implies
that the ideal ⟨q1, . . . , qn−1⟩contains all polynomials of degree at least two which
vanish on S. is is a unique subset of the ideal I(S). us, on the level of ideals,
the correspondence between the input and Sis one-to-one. In fact, I(S)is the so-
called radical ideal of the input ⟨q1, . . . , qn−1⟩, which is denoted by,
√⟨q1, . . . , qn−1⟩= I(S) = ⟨b1, . . . , bD−d⟩.
We can think of the radical of an ideal Ias the largest maximal ideal containing I.
Solving the SSA problem is therefore equivalent to computing a linear generating set
for the radical ideal of the input. is is a classical task in computational algebraic
geometry, for which the existence of algorithms has been known for a long time
[Hermann ()].
However, algorithms for radical computation that are suitable for implementa-
tion have only been developed in recent decades. e best known algorithms are
those of [Gianni et al. ()] (implemented in AXIOM and REDUCE); Macauly
provides [Eisenbud et al. ()]; [Caboara et al. ()] is implemented in CoCoa
along with [Krick and Logar ()], the modification by [Laplagne ()], avail-
able in SINGULAR.
All of these algorithms have two points in common. First of all, their compu-
tational worst case complexities are doubly exponential in the square of the num-
ber of variables [Laplagne ()]. Secondly, they can only be applied when the
coefficients of the polynomials are known exactly, which is never the case in prac-
tical applications of SSA where moments are estimated from data. It is for these
two reasons that we cannot solve the SSA problem using off-the-shelf general al-
gorithms for radical computation. Our algorithm addresses a new type of prob-
lem in the nascent field of approximate computational algebra [Corless et al. (),
Stetter (), Kreuzer et al. (), Heldt et al. ()].
4.5 Summary and discussion
In this chapter we have presented an algebraic algorithm for solving the SSA prob-
lem. Exploiting the algebra-geometry connection, we do not search the space of all
possible projections guided by an objective function, but compute an approximate
solution to a system of polynomial equations.
e algebraic approach has several advantages over optimisation-based tech-
niques. First of all, the algorithm has a unique solution (no local minima) which
can also be shown to be consistent, i.e. it converges to the true solution as the sam-
ple size grows. Moreover, it is straightforward to integrate cumulants of arbitrary
70 CHAPTER 4. AN ALGEBRAIC APPROACH
order, as they corresponds to polynomials of higher degree. Parallel to the com-
bination of mean (degree one) and covariance matrices (degree two), the highest
moment determines the initial coefficient vector space and the moments of lower
order are added during the step-wise reduction of the degree. In Algorithm , the
mean vectors are added in the last step (Line ), aer the quadratic polynomials
have been reduced to an approximate linear basis.
Choosing the number of stationary directions
10 35 120 220
0.01%
0.1%
1%
Sorted eigenvalues (PCA)
Percentage of variance
Num. of s−src. d=3
Num. of s−src. d=5
Num. of s−src. d=8
Figure .: PCA eigenvalue spectrum in coefficient vector space C3for three
synthetic data sets in ten dimensions with varying numbers of stationary sources.
e true number of stationary directions ddetermines the number of dimen-
sions of the subspace CS
kin coefficient vector space, on which the input polyno-
mials (multiplied up to the necessary degree k) lie approximately. By estimating
the effective number of dimensions of that subspace from the input, we can select
an appropriate number of stationary directions. To that end, we inspect the PCA
eigenvalue spectrum of the polynomials Pkin coefficient vector space Ck.
Figure . shows an example computed on synthetic data. For d= 5 stationary
directions in ten dimensions, we see that there exists a -dimensional subspace that
has significantly less variance than the rest of the data. For polynomials of degree
three in D= 10 dimensions this corresponds to five stationary directions, because a
five-dimensional stationary subspace induces a (-)-dimensional effective sub-
space in coefficient vector space,
dim CS
3= ∆(3,10) −∆(3,5) = 220 −35 = 185.
Similarly, the low-variance subspaces of 10 and 120 dimensions (blue and green
curve) correspond to stationary subspaces of three and eight dimensions respec-
4.5. SUMMARY AND DISCUSSION 71
tively. e advantage of this approach, relative to stationarity testing, is that it al-
lows us to inspect all possible number of dimensions simultaneously by computing a
single eigenvalue spectrum instead of running the optimisation procedure for each
candidate and then performing time-consuming resampling.
Alleviating the computational cost
e biggest drawback of the algebraic algorithm is the scaling of the computational
cost. e singular value decomposition (Line of Algorithm ) is cubic in the num-
ber of dimensions of the coefficient vector space — which grows rapidly with the
degree and the number of input dimensions. However, there are several directions
which could bring about a dramatic reduction of the computational complexity.
First of all, the matrix of coefficient vectors (corresponding to the set Pπ
kin Al-
gorithm ) is very sparse when it is large: while the number of dimensions of the
coefficient vector space grows quickly with the degree k, the number of nonzero co-
efficients (i.e. nonzero entries in each row of the matrix) stays the same. Moreover,
the matrix has a special structure which can be exploited. e multiplication of an
input polynomial with all monomials of a certain degree (Algorithm ) essentially
sends its coefficients to different places (columns), each corresponding to a row in
the matrix. Finally, for the purpose of approximate division by a variable, we do not
need to compute all singular values, but only the smallest ones. ere exist several
algorithms for iterative large-scale sparse structured singular value decomposition
[Kaltofen et al. (), Park et al. ()] which are yet to be explored in this con-
text.
Algebra beyond SSA?
e algebraic approach is not limited to stationary subspace analysis; our algorithm
solves a rather general problem: approximating the set of solutions to a system of
homogeneous polynomial equations of arbitrary degree by a linear space. More-
over, it would be straightforward to generalise this procedure to find approximate
“descriptions” of sets of solutions in terms of polynomials of higher degree or with
certain prescribed properties.
e main requirement for a machine learning problem to fit our algebraic ap-
proach is that its set of solutions needs to be describable in terms of a system of
approximate polynomial equations. is is an uncommon formulation, perhaps be-
cause so far there have been no efficient ways of solving such problems known to the
machine learning community. What is attractive about this framework is that poly-
nomials of arbitrary degree can be combined naturally (e.g. corresponding to certain
constraints on the solutions), without suddenly rendering the problem infeasible.
72 CHAPTER 4. AN ALGEBRAIC APPROACH
Chapter 5
Simulations and applications
5.1 Introduction
In this chapter, we evaluate the SSA algorithm in simulations and present applica-
tions to EEG data analysis.
Since SSA is a new unsupervised learning task, there is no established evaluation
strategy. To date, there exist no competing algorithms or benchmark data sets that
allow for a relative assessment. Moreover, since SSA is an unsupervised problem
there exists no natural quantitative performance measure that is inherently mean-
ingful as e.g. for classification problems. Whether a particular SSA solution is useful
is ultimately a qualitative decision, which depends on the application context. is
is reflected in our evaluation methodology, which consists of four different angles.
. Synthetic data. An artificially constructed ground truth allows us to quanti-
tatively measure the performance in terms of the deviation from the true so-
lution. We investigate the influence of key parameters, the relative efficacy of
the algebraic approach, and two methods for stationarity testing. e results
serve as a basic validation and provide insights for the domain of applicability
and parameter choice in practice.
. Comparison against PCA and ICA on real data. Even though PCA and ICA
solve a different problem than SSA, in certain situations they may yield a sim-
ilar results. For a large EEG data set, we establish that neither PCA nor ICA
yield a basis that is useful for understanding distribution changes. is un-
derlines the original contribution of the SSA approach.
. Real data with ground truth. In two situations from EEG analysis where an
expected results is induced by an experimental paradigm, we show that SSA
successfully identifies the corresponding neural components. e results are
evaluated qualitatively, in terms of the frequency content and the scalp maps,
and quantitively using the correlation with an experimental stimulus.
. Indirect evaluation in a classification setting. In the context of brain-computer-
interfacing, we demonstrate that in certain cases, SSA can remove harmful
74 CHAPTER 5. SIMULATIONS AND APPLICATIONS
non-stationary brain sources. e efficacy of this approach is reflected in
lower misclassification rates on the test set.
e remainder of this chapter is organised along these lines. In the next Sec-
tion ., we present the results on simulated data. e applications to EEG analysis
are contained in Secton .. In the last Section ., we discuss our findings and point
to future work motivated by the numerical results.
5.2 Simulations on synthetic data
Before we present results, we consider two main ingredients: (a) a method for gen-
erating synthetic data that provides us with a ground truth and allows us to vary
the parameters of interest, and (b) a measure of deviation between the true and the
found solution.
Data generation
e data is generated exactly according to the SSA model. at is, we directly sam-
ple the data for each epoch. In doing so, we assume that we have chosen the seg-
mentation into epochs optimally in the sense that we extract the maximum possible
amount of information from the data without any redundancy (e.g. overlaps intro-
duced by a sliding window). is is unrealistic; however, any other way of dividing
our synthetic data into epochs is equally arbitrary and introduces further parameters
which make the results harder to interpret.
Foreachepoch, werandomlysamplethemeanandthecovarianceof thesources,
µ1, . . . , µnand Σ1, . . . , Σn, such that the moments are identical along the first d
coordinates. e elements of the mixing matrix Aare drawn uniformly at random
from the interval [−0.5,0.5]. In all our simulations, we focus on the more difficult
case where there is no information in the mean, i.e. the mean of the non-stationary
sources is constant in each epoch. If we include information in the mean, the contri-
bution of the new SSA algorithm becomes somewhat unclear: the separation of sta-
tionary and non-stationary sources using differences in their mean can be achieved
analytically by a simple matrix inversion.
e marginal covariance of the stationary sources is fixed to the identity I. For
each epoch, the variance of the non-stationary sources is sampled uniformly over
the intervals (1, α)and (1/α, 1) with α > 1. at is, each non-stationary source is
at most αtimes larger or αtimes smaller than the stationary sources. e covariance
between the two sets of sources is chosen by drawing random mixing coefficients.
In order to investigate the more realistic scenario of non-gaussian data (heavier
tails, outliers), we sample Gaussian data X∼ N(0,1), exponentiate it and preserve
the sign: X′= sign(X)|X|k, where the parameter kcontrols the degree of the
5.2. SIMULATIONS ON SYNTHETIC DATA 75
non-gaussianity; we report Pearson’s kurtosis instead of k. e data is then linearly
transformed and translated to have the chosen epoch mean and covariance.
A performance measure for SSA
As we have seen in Chapter , only the true non-stationary subspace can be identi-
fied from the mixed signals. A natural performance measure for SSA is therefore the
deviation between the true and the found non-stationary subspace. e alternative,
a measure based on the level of stationarity of the estimated stationary sources, is
more difficult to define because even for an ideal separation, the estimated epoch
mean and covariance matrices will not be exactly equal. e expected deviation
from equality in the case of ideal separation depends on properties of the data, such
as the number of samplesor the heaviness ofthe tails (see Section .. for a more in-
depth discussion). Since these are precisely the parameters of interest in our simula-
tions, it is more convenient to adopt a performance measure based on the deviation
from the true subspace.
Following [Meinecke ()] we define a subspace error based on the principal
angles between two subspaces. Let Uand Vbe two linear subspaces of the vector
space Vwith dim U≤dim V. e first principal angle θ1is defined as the smallest
angle between any two vectors u∈Uand v∈V,
cos θ1= min {u⊤v
∥u∥∥v∥u∈U, v ∈V}=∠(u1, v1),
where the vectors u1and v1that minimise the angle are called the principal vectors.
e remaining principal angles θ2, . . . , θdim Uare found recursively in the orthog-
onal complement of the previous principal vectors,
cos θk+1 = min {u⊤v
∥u∥∥v∥u∈U, u ⊥u1, . . . , uk, v ∈V, v ⊥v1, . . . , vk}.
us the principal vectors in each subspace are mutually orthogonal. If a principal
vector θkis zero then Uand Vshare a common subspace and uk, vk∈U∩V.
Hence the dimension of the intersection of Uand Vis equivalent to the number
of zero principal angles. e principal angles and vectors can be found using the
singular value decomposition (SVD). Let the columns of the matrices U′and V′
be orthonormal bases for the spaces Uand Vrespectively. From the singular value
decomposition ˆ
UΣˆ
V⊤= (U′)⊤V′, we obtain the principal angles from the singular
values, cos θk= Σkk and the principal vectors are the columns of ˆ
Uand ˆ
V.
To quantify the difference between two k-dimensional subspaces spanned by the
columns of A, B ∈
R
d×k, we define the subspace error as the average sin2of the
76 CHAPTER 5. SIMULATIONS AND APPLICATIONS
principal angles θ1, . . . , θkbetween them,
E(A, B) = 1
k
k
∑
i=1
sin2(θi).
is error is zero when the true subspaces are identical and one if they are orthog-
onal to each other. e subspace error can be interpreted in terms of the sources.
It is the percentage of the signal power that a non-stationary sources loses if it is
projected from the true onto the estimated non-stationary sources. Conversely, for
the estimated stationary sources ˆss(t) = ˆ
Psx(t)this means that E(ˆ
An, An)is the
percentage of non-stationary signal power that they contain.
However, the maximum possible subspace error depends on the number of di-
mensions of the two subspaces in relation to the total number of dimensions, i.e.
on the number of dimensions in which they can possibly differ. For example, two
nine dimensional subspaces Uand Vin ten dimensions have a maximal subspace
error if there exists a one-dimensional subspace which is contained in Uand in V⊥
or vice versa. In this case, exactly one of the principal angles is 1
2πand all others
are zero, so that the subspace error is /. In general, two k-dimensional subspaces
can only differ by a d−kdimensional subspace, so that the maximum subspace
error is min {1,d−k
k}. is worst case is a very special case, in the sense that unless
we specifically construct it, any set of basis vectors is expected to have a lower sub-
space error. As a less pathological baseline, in each simulation we compare against
random projections [Rahimi and Recht ()].
19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1
50%
100%
Subspace error (% of rand. projection)
Number of non−stationary directions
Low non−stationarity (α=3)
Medium non−stationarity (α=5)
Strong non−stationarity (α=10)
Figure .: Performance for varying numbers of non-stationary directions (hor-
izontal axis) measured in terms of the subspace error relative to random projec-
tions. e total number of dimensions is fixed to .
5.2. SIMULATIONS ON SYNTHETIC DATA 77
5.2.1 Dimensions, sample size, epochs, kurtosis
In this section we present the results of our simulations. Unless otherwise stated,
eachplotshowsmediansubspace errorsoverrandomrealisationsofthedatasets,
with error bars extending from the to the quantile.
In the first set of experiments, we investigate the influence of the number of sta-
tionary dimensions. e total number of dimensions is set to , the number of
epochs is and in each epoch we observe samples with non-stationarity in
the covariance matrices of degree α= 3. Figure . shows the result. In order to
obtain comparable results across different dimensionality, we report the subspace
error relative to random projections. We see that in terms of the subspace error,
larger non-stationary subspaces are more difficult to find. is is not only a nu-
merical effect: as we have kept the number of distinct epochs constant, the relative
amount of available information (i.e. observed variation in the distribution of the
non-stationary sources) is smaller for the larger subspaces, which makes them more
difficult to identify.
12 25 50 100 500 1000
0.1
0.2
0.3
0.4
Subspace error
Number of samples per epoch (log)
Low non−stationarity (α=3)
Medium non−stationarity (α=10)
Strong non−stationarity (α=50)
Random projection
Figure .: Influence of the epoch sample size (horizontal axis) on the perfor-
mance of SSA, measured in terms of the subspace error (vertical axis).
In the next step, we analyse how the error of SSA depends on the number of
available samples in each epoch. e setup is somewhat artificial: in practice, we
divide a fixed number of samples into epochs; we can usually only vary the num-
ber of samples in each epoch by changing the overall segmentation, which changes
the true distribution of the data in each epoch. However, in order to isolate the ef-
fect of small sample errors, we vary the number of samples available in each epoch
from a fixed epoch source distribution. e total number of dimensions is and
the number of stationary directions is . Figure . shows the results. We see that
even for epochs as small as eleven samples, SSA can still find a projection that is
significantly better than a random projection. is suggests that small sample er-
78 CHAPTER 5. SIMULATIONS AND APPLICATIONS
rors average out across epochs to some extent, an effect that we also observe in the
next set of simulations. Moreover, we see that the effect of estimation errors in the
epoch moments is relative to the strength of the non-stationarity. is makes sense:
a strong change in variance is less easily masked by small sample fluctuations than
a weaker non-stationarity.
2 5 10 20 50 100
0.1
0.2
0.3
0.4
0.5
Subspace error
Number of epochs (log)
Low non−stationarity (!=3)
Medium non−stationarity (!=5)
Strong non−stationarity (!=10)
Random projection
Spurious stationary
directions
Small sample errors
Figure .: Influence of the number of epochs (horizontal axis) with distinct
moments on the performance of SSA, measured in terms of the subspace error
(vertical axis). e total number of samples is fixed to .
Now we consider a different scenario: the total number of samples is fixed to
and we vary the number of epochs with random source covariance matrix (five
stationary and five non-stationary directions). is means that for two epochs, their
sample size will be each, whereas for the largest number of epochs (), each
contains only samples. In other words, we trade small sample errors for the num-
ber of distinct distributions, or accuracy of information for amount of information.
While this is an artificial scenario (we can never vary the amount of information in a
dataset), it is directly related to an important practical problem, namely the design of
the epoch structure. One usually faces the dilemma of wanting a fine-grained epoch
setup, to capture as much variation as possible and increase the temporal resolution,
while controlling the detrimental effect of small sample estimates.
e results shown in Figure . convey a positive message: only as the epochs
become extremely small ( samples in dimensions) does the performance de-
teriorate markedly. As we have seen before, small sample errors average out across
epochs, andat thesametime, SSAbenefitsfromfurthervariation inthenon-stationary
space. In the opposite case, few large epochs, we observe the effect of spurious sta-
tionary directions; the subspace error reaches the plateau exactly at the number of
epochs given by our theoretical result (see Section .).
All simulations presented so far have one unrealistic aspect in common: Gaus-
5.2. SIMULATIONS ON SYNTHETIC DATA 79
3 10 100
0.1
0.2
0.3
0.4
0.5
Subspace error
Pearson’s Kurtosis (heavy−tailedness)
Epoch size m=40
Epoch size m=300
Epoch size m=2500
Random projection
Figure .: Influence of the fatness of the tails of the sources (horizontal axis) on
the performance of SSA, measured in terms of the subspace error (vertical axis).
e source data is sampled from a super-gaussian with Pearson’s kurtosis shown
on the horizontal axis.
sian sources. Real data, however, is notoriously non-gaussian, apart perhaps from
pure noise components; real data has heavier tails and is almost always contami-
nated by outliers. Moreover, interesting components are usually distinctively non-
gaussian, which is reflected by the fact that many algorithms explicitly search for
components with non-Gaussian properties, e.g. ICA [Hyvarinen ()], projec-
tion pursuit [Huber (), Friedman and Tukey ()] or non-gaussian compo-
nent analysis [Blanchard et al. ()]. Outliers and heavy tails lead to stronger fluc-
tuations in finite samples between epochs, even for stationary sources, leading to
higher estimation errors of the true non-stationary subspace. e absolute scale of
the errors we have reported is therefore probably too optimistic for most real data
sets¹.
erefore, in the final set of simulations, we investigate how the error behaves
as we fatten the tails of the source distribution. As before, the total number of di-
mensions is ten with five stationary directions, we have epochs and the non-
stationarity level is set to α= 10. Figure . shows the results over various epoch
sample sizes. Interestingly, the performance does not deteriorate rapidly as the kur-
tosis grows. As one would expect, small sample sizes strongly exacerbate the influ-
ence of outliers.
¹is is in line with the aim of these simulations, which is to show the relative performance across
parameters. Any assumption about the source distribution is somewhat arbitrary and introduces a
systematic bias.
80 CHAPTER 5. SIMULATIONS AND APPLICATIONS
5.2.2 Testing for stationarity
In Section .., we have seen that the SSA objective function can also be derived
as a likelihood ratio test [Neyman and Pearson ()]. In this section, we evaluate
the effectiveness of this test and compare it against a data-driven nonparametric test
based on resampling.
e value of the SSA objective function is a measure of stationarity. However,
for a data set at hand, its value alone does not provide a clear indication of whether
the data is truly stationary, in the sense that its epochs have been sampled from dis-
tributions with the same mean and covariance matrix. at is because small sample
errors introduce differences in moment estimates between epochs, even for a data
that has been sampled from a stationary model.
erefore, one way of making sense of the objective function is to compare its
value against its distribution on truly stationary data, and decide for non-stationarity
if it exceeds a chosen quantile (critical value). In terms of hypothesis testing, station-
arydatais our null hypothesisH0whichwereject atacertain(one-sided) confidence
level.
−5 0 5 10 15 20 25 30 35 40
Normalized SSA objective function
Density
Source kurtosis 3 (Gaussian)
Source kurtosis 5.9
Source kurtosis 11.4
Source kurtosis 21.8
Figure .: Distribution of the normalised SSA objective function for stationary
sources with varying kurtosis estimated from random samples.
In Section .., we mentioned that the distribution of the normalized objective
function,
√2Λ(X)−√nd(d+ 3) −1,
approximately follows a standard normal distribution on stationary data with Gaus-
sian sources, where Λis the unnormalized objective function, nis the number of
epochs and dis the number of dimensions. Figure . shows estimates of the dis-
tribution of the normalised objective function on simulated stationary data in five
dimensions of epochs, each with samples, over random realisations.
5.2. SIMULATIONS ON SYNTHETIC DATA 81
For Gaussian sources, the empirical distribution of the normalised objective re-
semblesanormal density, butit isnot normal according to theKolmogorov�Smirnov
test. However, even for moderate levels of kurtosis, the distribution moves away
from N(0,1). us a stationarity test based on quantiles of the χ2distribution is
bound to perform poorly on almost all real data sets, in that it has a high false posi-
tive rate (type I error).
However, for non-Gaussian data the distribution of the objective function un-
der H0is in general unknown. Let us therefore take a closer look at the resampling
approach (see Chapter ). In short, we put the data from all epochs together, ran-
domly divide it into epochs, on which we then evaluate the SSA objective function.
Each random shuffling provides us with one sample from the distribution of the
SSA objective function on stationary data. By doing this repeatedly, we obtain a set
of samples from which we can estimate a critical value (e.g. the quantile) for
testing.
As an illustration, we compare the performance of the χ2test for stationarity
with the test based on resampling on a synthetic data set. To this end, for each degree
of non-gaussianity, we randomly generate stationary and non-stationary
data sets and apply both test strategies at the confidence level; the parameters
of the data set are the same as the one used for Figure ., the level of non-stationarity
is low (α= 2). In light of our previous observations, we are mainly interested in the
false positive rate (FPR).
3 3.5 4 4.5 5
5%
20%
40%
60%
80%
100%
Pearson’s Kurtosis
False positive rate (FPR)
Resampling
Chi−squared
Figure .: Comparison of two strategies for stationarity testing: comparing
against quantiles from the standard normal distribution (the theoretical limit) vs.
quantiles estimated from resampling the data set.
Figure . shows the result. For Gaussian sources (with Pearson’s kurtosis equal
to ), the false positive rate of both tests corresponds exactly to the chosen confi-
dence level of . However, already for moderate deviations form normality, the
increasing likelihood of outliers leads to rapidly rising false positive rate of the para-
metric test; at kurtosis , it has become completely useless, which is in line with our
observation in Figure .. On the other hand, the quantiles estimated from re-
82 CHAPTER 5. SIMULATIONS AND APPLICATIONS
samples of the shuffled data set are remarkable reliable: the false positive rate stays at
while the true positive rate (not shown here) is constantly . Of course, the
performance of the resampling test also depends on several factors (number of sam-
ples, type of stationary distribution, etc.), and we do not have theoretical guarantees
yet. Even so, resampling can be a viable approach in practice.
5.2.3 Algebraic SSA vs. optimisation
Comparing the performance of the algebraic and the optimisation-based SSA algo-
rithm is not straightforward: for a given dataset, the former has a unique solution
computed in a fixed amount of time whereas the latter is initialised randomly and
stops when a somewhat arbitrary convergence criterion has been met. Moreover, for
“standard” setting of these convergence criterion, the runtime of both algorithms
differ markedly for a wide range of setting. As we had seen before, the runtime of
the algebraic algorithm grows rapidly, in particular when the number of available
epochs is not very large.
11 50 100 500 1000
0
0.1
0.2
0.3
0.4
0.5
0.6
Subspace error
Number of samples per epoch (log)
SSA
Algebraic SSA
Random
Figure .: Comparison of the algebraic and the optimisation-based SSA algo-
rithm w.r.t. the epoch samples size.
It is for these reasons that we do not attempt an exhaustive comparison, the re-
sults of which would be difficult to interpret. Instead, we focus on one case where
the runtime of both algorithms is comparable and illustrates a general trend. In
this simulation, we use ten dimensional data (five stationary directions), a moder-
ate level of non-stationarity (α= 5), and vary the number of samples in each of the
epochs.
Figure . shows the results: until the epochs have reached a critical size of
samples, the optimisation approach outperforms the algebraic algorithm. e rea-
son for this is that the algebraic algorithm is more prone to small sample errors than
optimisation-SSA. at is because the coefficient vector space that it operates in has
5.3. APPLICATION TO EEG ANALYSIS 83
much higher dimension than the data space, in which the optimisation takes place.
However, as the small sample errors get weaker in magnitude (on large epochs), it
clearly outperforms the standard SSA algorithm. at is because the gradient-based
optimisation converges slowly due to the flatness of the objective function around
the true solution.
5.3 Application to EEG analysis
In this second part of this chapter, we apply SSA to data from electroencephalogra-
phy (EEG) [Berger (), Caton ()] recordings. EEG data analysis is a prime
application domain for SSA, because the EEG potential is well approximated by a
linear transformation of fluctuations of macroscopic potentials in the cortex. is
has lead to the tremendous success of linear methods for the analysis of brain activity
(see e.g.[Parra et al. ()]).
First of all, we show that PCA and ICA do not find non-stationary brain sources
onadata setfrombrain-computer-interfacing(BCI), andthatthemostnon-stationary
task-locked components correspond to the neural response to the task. Next, we
investigate the role of non-stationarities as one of the possible causes behind de-
teriorating performance of BCIs over time. We do this by removing certain non-
stationary directions in a pre-processing step, which for some subjects leads to a
significant improvement of the error rate. In the last part of this section, we demon-
stratethatSSArecoversacertaintype ofnon-stationarybrainsource(auditorysteady
state evoked responses) that has been induced by an experimental paradigm. Here
we can measure the effectiveness using the correlation with the stimuli. e analysis
shows that both ICA and PCA fail to recover the desired source.
5.3.1 Electroencephalography
e electroencephalography (EEG) [Berger (), Caton ()] is an electric po-
tential² measured on the surface of the scalp. e EEG is caused both by brain ac-
tivity and other physiological sources, e.g. nearby muscles.
e following overview is largely based on [Kandel et al. ()]. e cortical
source of the EEG is the synchronous activity of neurons, which are oen considered
as the atomic units of the cortex in models of the brain. e human brain consists of
around 1010 neurons, each of which is connected to its neighbouring neurons by a
dendritic tree and one axon. In a neuron, information is processed and transmitted
in the form of electric potentials. Every neuron maintains a membrane potential, i.e.
a voltage difference across its plasma membrane, which is controlled by the open-
ing and closing of ion channels through a complex electrochemical mechanism. A
²More precisely, it is a difference in potential between an electrode in the scalp and a reference
electrode, oen positioned on the nose or ear lobes.
84 CHAPTER 5. SIMULATIONS AND APPLICATIONS
neuron receives electrochemical stimulation from other neurons via its dendrites; if
the membrane potential exceeds a certain threshold, it is released as an action po-
tential which travels along the axon to other neurons. is effect is called spiking.
When large, spatially aligned neuron populations spike synchronously, their cur-
rent flow combines to produce macroscopic field potentials, which are measurable
on the surface of the scalp.
e EEG signal not only reflects the fluctuations of the field potentials gener-
ated inside the brain, but also non-neural artefacts, which are of physiological or
technical origin. Prominent physiological artefacts result from muscle activity (e.g.
face, tongue, eye blinks and movement), heart beatand pulse. Technical artefacts are
oen generated by the power supply or other equipment. e signal power of arte-
factual sources is usually much stronger than the brain sources. Removing artefacts
is a key step in EEG signal processing. For an extensive review of artefact rejection
methods see [Fatourechi et al. ()].
ere are two general strategies for removing artefacts: discarding subsets (tri-
als) of the data with strong artefact activity, or removing artefactual sources by re-
gression and substraction [Cro and Barry ()], projection or spatio-temporal
filtering. e first strategy leads to a loss of data points while the second reduces
the number of dimensions and, if successful, preserves the relevant neural signal. A
common approach to artefact removal, which we will adapt in this chapter, is based
on independent component analysis [Makeig et al. (), Jung et al. ()]. In a
first step, the data is decomposed into independent components which are then in-
dividually assessed to identify the artefactual components, oen in a semi-automatic
way. Rejecting independent components is appealing, because it ensures that one
does not remove information contained in dependencies and artefacts are usually
considered to be independent of cortical activity.
e manual classification into artefact/non-artefact is based on a number of
heuristics, that take into account the frequency content, the scalp pattern, the time
course and sometimes also the correlation with auxiliary measurements. For an ex-
tensive evaluation of the standard manual artefact removal protocol, we refer the
reader to [McMenamin et al. ()].
For our purpose, we used an ICA basis computed by the TDSEP algorithm
[Ziehe and Müller (b)], which was preceded by a PCA pre-processing to
of the variance (reducing the number of dimensions from to approximately ).
Each component was inspected manually. No information other than the frequency
spectrum, the scalp plot and the time course was available during artefact classifica-
tion.
EEG-based Brain-Computer-Interfacing
e goal of brain-computer-interfacing (BCI) [Dornhege (), Vidal ()] is to
transmit information directly from a human brain to a computer, circumventing
5.3. APPLICATION TO EEG ANALYSIS 85
peripheral nerves or muscles. is is achieved by voluntary changes in the brain
state, that can be detected by a computer and turned into a control signal.
e most popular BCI paradigm is based on modulations of the sensorimo-
tor rhythm (SMR), also called µ-rhythm. In most subjects, the SMR can be ob-
served in cortical EEG aer appropriate signal processing. e SMR can be used for
brain-computer-interfacing because it is modulated by the neural processing of mo-
tor commands, which can be controlled voluntarily by most subjects. In particular,
the µ-rhythm responds to motor commands that are only imagined and not actually
executed. is makes it possible to generate a control signal by thought alone.
Imaginedmovementscausetheso-called event-relateddesynchronisation(ERD)
[Pfurtscheller and Lopes da Silva ()] of neuron populations in the motor cortex
which lead to an attenuation of the SMR. e location of the SMR inside the brain
corresponds to the part of the body that is to be moved. Imagined movements of
the le and right hand cause ERD in the right and le motor cortex respectively. In
order to translate imagined movements into a control signal, the central task is to
distinguish between the spatial localisation of changes in the SMR. However, since
the SMR has low power relative to other brain sources (e.g. from the visual cortex)
and artefactual activity, it is not visible in the raw EEG signal, but requires spatio-
temporal filtering to enhance the signal-to-noise ratio e.g. using the common spatial
patterns (CSP) algorithm [Koles (), Blankertz et al. ()].
Dataset and experimental setup
For our comparison to PCA and ICA in Section .. and the application to BCI
in Section .., we use a dataset from a study [Blankertz et al. ()] which in-
cluded only BCI-novices. From this study, we select the subjects () who were
recorded in Berlin.
Between the calibration runs of imagined movement, partici-
pants were confronted with a computerized version of the d2-test
(Brickenkamp and Zillmer, 1998), which was part of an extended
psychological test battery (results will be presented elsewhere).
After the calibration, participants performed three runs each of
100 trials with BCI ‘feedback’(for some participants only one or two
runs were recorded due to fatigue (N=3) or lack of time (N=7)).
The first 20 trials were used for adapting the bias of the classifier (see
section ‘BBCI feedback’). Each trial of feedback started with a period
of 2 s with a black fixation cross in the center of a gray screen. Then
an arrow appeared behind the cross to indicate the target direction
of that trial (left or right for motor imagery classes left hand and
right hand and downward for class foot) and 1 s later the cross turned
purple and started moving according to the classifier output, as
described in section ‘BBCI feedback’. If the foot class was selected, the
cursor moved on a corner shaped trajectory: For left hand vs. foot,
e.g., the cursor moved from the center to the left of the screen for
detected left hand motor imagery, and the cursor moved from the
center downwards for detected foot imagery. After 4 s of cursor
movement the cross froze at the final position and turned black
again. Two seconds later the cross was reset to the center position
and the next trial began. Hits or misses were counted according to
this final position, but the score was only indicated during a break of
15 s after every block of 20 trials (see Fig. 3 for timing during
feedback runs).
BBCI feedback
The EEG signals of the calibration measurement were band-pass
filtered in a subject-specific frequency band and temporally filtered in
a subject-specific time interval. The start of that interval varied
between 480 ms and 2720 ms (median 930 ms) after cue onset and its
end varied between 1700 ms and 4990 ms (median 3810 ms). The
median length of the interval was 2730 ms (range 1000 ms to
3890 ms). Details on the heuristics underlying the selection of
frequency band and time interval can be found in Blankertz et al.
(2008b).Notethatalsosubject-optimizedspatialfilters were
determined by common spatial pattern (CSP) analysis (Blankertz
et al. 2008b). From these signals the log-variance was calculated in
each trial of the calibration data. This procedure resulted in a feature
vector with dimensionality equal to the number of (heuristically)
selected CSP filters (Blankertz et al., 2008b). To our experience, those
features can be classified well by linear methods like linear
discriminant analysis (LDA). Other paradigms and feature representa-
tions may however require non-linear modeling (see, e.g., Müller
et al., 2001; Müller et al., 2003).
For online operation, features were calculated every 40 ms with a
sliding window of 750 ms (applying CSP filters, band-pass filtering,
calculating log-variance and applying the LDA classifier, see Blankertz
et al. (2008b). The output of the classifier was translated into cursor
movement in a rate control manner: At the beginning of each trial, the
cursor started in the center of the screen and every 40 ms a fraction of
the classifier output was added to the current cursor position (see also
section ‘Experimental setup’). CSP filters calculated from the initial
calibration measurement were not adapted during online operation.
The bias of the linear classifier was adapted on the basis of the data of
the first 20 trials of each feedback run (see Krauledat et al., 2008). (By
bias we denote the term bin the separating hyperplane formulation
y=wx+bwhich can be used to adjust the output of a binary classifier
toward one class or the other). These 20 trials were not included in the
calculation of the feedback performance.
Performance predictor
To construct our novel SMR predictor, only a short two minute
recording of EEG under the condition ‘relax with eyes open’using two
Laplacian channels (C3 and C4, calculated from nine original
monopolar channels) was required. For this purpose we used the
concatenated segments of the same condition during the baseline
recording (see section ‘Experimental setup’).
From these data we calculated the power spectral density (PSD) in
the Laplace-filtered channels C3, C4 and determined for each of those
channels the maximum difference between the PSD curve and a fitof
the 1/fnoise spectrum as explained below (cf. Fig. 4). These two
values were estimates of the strength of the SMR over the hand areas.
The SMR predictor was calculated as the average of those two values.
It quantified the potential for desynchronization of the SMRs at rest as
an indicator of SMR strength during feedback.
Fig. 2. Course of a calibration trial. 2 s fixation cross, 4 s motor imagery cued by arrows (left, right, or down), and 2 s blank screen.
Fig. 3. Course of a feedback trial. 2 s fixation cross (black), 1 s indication of target direction cued by arrows, 4 s feedback (cross turns magenta and moves according to BBCI classifier),
2 s indication of result (cross turns black and freezes at its final position).
1305B. Blankertz et al. / NeuroImage 51 (2010) 1303–1309
Figure .: Trial structure during calibration: first, a fixation cross appears
for s, then a visual cue instructs the user to imagine a certain movement for
s (here: le hand), which is followed by a s rest period. Figure taken from
[Blankertz et al. ()].
e EEG was recorded from the scalp with a multi-channel amplifier (BrainAmp
DC by Brain Products, Munich, Germany) using Ag/AgCl electrodes (reference
86 CHAPTER 5. SIMULATIONS AND APPLICATIONS
at nasion; manufacturer EasyCap, Munich, Germany) and sampled at Hz with
a band-pass filter of . Hz to Hz.
For each subject, a single session was conducted, divided into a calibration and
a feedback phase. During calibration, a visual cue prompted the subject to imagine
one of the following motor tasks: movement of le hand, right hand, and right foot
or both feet movement³. In total, trials for each class were recorded; see Figure .
for an illustration of the trial structure. ese trials were used to train a classifier on
bandpower features extracted with CSP. Aer the calibration phase, the participants
performed the same task with feedback, i.e. a visual cursor indicating the current
classifier output.
5.3.2 Comparison to PCA and ICA
Before we look at the results of SSA, we ask the baseline question: could existing
methods do the job? Even though PCA and ICA optimise different criteria, and
will in general find different bases than SSA, the important question in practice is
whether the solutions differ significantly enough to justify a new method.
On EEG data, it is not inconceivable, a priori, that the strongest distribution
changes are exhibited by independent brain sources. Moreover, the degree of non-
stationarity may be related to the average power. If that were the case, PCA or ICA
could provide us with components that are useful for analysing non-stationarities.
Before we proceed to results, let us clarify what it means for a basis to be “useful
for analysing non-stationarities”. Informally speaking, it implies that it separates the
stationary from the non-stationary signal contributions. In the first step, we assess
this by considering the degree of non-stationarity of the individual basis elements
univariately. e sorted valuesprovideuswith a non-stationarity spectrum forabasis
on a particular data set. In terms of this spectrum, a basis is informative if it decays
rapidly and is uninformative if it is flat. In this way, the interpretation is similar to
the eigenvalue spectrum of PCA. However, unlike PCA’s eigenvalue spectrum, the
non-stationarity spectrum does not sum to the same value for all orthogonal bases.
Whereas every basis captures the same total variance, the same is not true for the
non-stationarities. As we had seen in Chapter , a non-stationary data set can seem
completely stationary in a basis when the distribution changes are entirely confined
to changes in the dependencies.
In a second step, we analyse whether a multivariate approach is necessary to find
the most non-stationary subspace. To that end, we compare the subspace obtained
by combining the components found univariately to a subspace obtained by the joint
optimisation of its basis.
³e type of foot movement could be chosen by the subject.
5.3. APPLICATION TO EEG ANALYSIS 87
Setup and results
e general setup of our analysis is as follows. For each subject, we select all cali-
bration trails from the data set described in the previous section. Firstly, we apply a
PCA pre-processing to preserve of the variance, which reduces the number of
dimensions from input channels (a subset of electrodes) to around . en we
apply a bandpass filter between . and Hz. No trial is rejected. e SSA epochs
are sliding windows of .s length with a overlap. SSA is applied in deflationary
mode, unless otherwise stated.
We begin our analysis by looking at the results of a single subject. e non-
stationarity spectrum of PCA, ICA and SSA is shown in Figure .. e directions
are ordered by degree of non-stationarity (SSA objective function value). For ICA
and PCA, we applied the SSA objective function using the same epoch structure as
for SSA.
1 5 10 15 20 25 30 35 40
Non−stationarity
Directions sorted by non−stationarity
SSA
PCA
ICA
Artefact
5 10 20 30 40
dB
Hz
5 10 20 30 40
dB
Hz
Component 1
Component 17
Loose electrode (#1)
Alpha rhythm (#17)
Figure .: Non-stationary spectrum for SSA (red), PCA (blue) and ICA (green)
for one exemplary BCI subject. For each method, the found directions are ordered
(horizontal axis) by the degree of non-stationarity (vertical axis). e red circle
indicates that an SSA direction corresponds to an artefact in the EEG data. e
scalp maps visualise the basis of component and , next to it are the frequency
spectra of the sources.
As we can see, all PCA projections (blue curve) pick up a similar amount of non-
stationary contributions. Moreover, we observe that the total non-stationarity cap-
tured by PCA is much less than both ICA and SSA. is suggests that signal power is
not strongly correlated with non-stationarity. e basis found by ICA (green curve)
picks up a similar amount of non-stationarity among its first five directions to the
SSA solution (red curve). However, a closer inspection of the most non-stationary
sources found by SSA reveals that they are non-neural artefacts (indicated by the red
circles).
Since we are interested in finding and analysing brain sources, we remove the
artefacts in a pre-processing step. e results in the le panel of Figure . show that
88 CHAPTER 5. SIMULATIONS AND APPLICATIONS
1 5 10 15 20 25 30
Non−stationarity
Directions sorted by non−stationarity
SSA
PCA
ICA
1 5 10 15 20 25 30
Non−stationarity / variance
PCA directions sorted by variance
Variance
Non−stationarity
Figure .: Non-stationary spectrum aer artefact rejection: degrees of non-
stationarity for SSA (red), PCA (blue) and ICA (green) basis. Variance of PCA
directions vs. non-stationarity: the blue bars show the eigenvales, the red curve is
the degree of non-stationarity.
aer artefact rejection, PCA and ICA do not provide a basis that discerns stationary
from non-stationary contributions.
Even though PCA did not find non-stationary directions relative to the SSA re-
sult, the variance of the PCA components may still be correlated with their strength
of non-stationarity. In the panel of Figure ., we plot the eigenvalues against the
degree of non-stationarity. However, we see that on this dataset, there is no associ-
ation between signal power and non-stationarity.
40%
50%
60%
70%
80%
PCA ICA
Total found non−stat. relative to SSA
0.1% 1% 10% 100%
5%
10%
25%
50%
100%
Share of total variance
Share of total non−stat.
PCA component
Figure .: e le panel shows the distribution of the total non-stationarity
of the PCA and the ICA basis relative to the SSA basis over all subjects. In the
right panel, we plot for each PCA component its variance against its degree of
non-stationarity.
e analysis of all remaining subjects reveals that these findings are consis-
tent over all subjects. Figure . shows the results aer artefact rejection. In the
5.3. APPLICATION TO EEG ANALYSIS 89
le panel, we see that SSA finds significantly more non-stationary components than
PCA and ICA over all subjects. e right panel confirms that there is no systematic
relationship between the power of a PCA component and its non-stationarity.
−4 −3 −2 −1 0 1 2
Degree of non−stationarity (log)
Frequency
0%
25%
50%
75%
100%
Artefact likelihood
Artefact likelihood
0.6
0.7
0.8
0.9
1
AUC
Figure .: Degree of non-stationary (horizontal axis) vs. artefact likelihood
(right vertical axis) for all subjects. e plot combines a histogram over non-
stationarity strengths (blue bars, le vertical axis) with a curve (red) indicating
the artefact likelihood. e artefact likelihood is the percentage of artefactual di-
rections in a centered window of length .
e results for the single subject suggest that there is a relationship between the
strength of non-stationarity and the likelihood that anSSA component isartefactual.
In order to assess this systematically, we manually classify the SSA components of
all subjects into artefact/non-artefact.
Figure . shows the results. e histogram shows the distribution of the non-
stationarity over all 40·40 = 1600 SSA directions, along with an empirical estimate
of the conditional artefact likelihood (red curve). e shape of this curve indicates
that the non-stationarity is a strong predictor for artefacts; in fact, above a certain
threshold every component found in this dataset is artefactual. ese finding are
confirmed by the AUC scores shown in the right panel of Figure .. Not only is the
degree of non-stationarity a good predictor across all directions of all subjects, but
it also performs well for most subjects individually: the median AUC over subjects
is above . and the percentile well above ..
So far, we have evaluated and compared the SSA solution from the perspective
of the non-stationary spectrum. is is a univariate measure, which ignores the
changes in the correlations between the non-stationary variables. Correspondingly,
the non-stationary directions have been optimised univariately in deflation mode.
Whether a univariate approach to finding the non-stationary subspace is suf-
ficient depends on the underlying data generating mechanisms. In order to verify
whether this is the case, we compare the non-stationarity of subspaces found by uni-
variate (deflation mode) and multivariate SSA. More specifically, for univariate SSA
we choose as a basis for the d-dimensional non-stationary subspace the top dmost
90 CHAPTER 5. SIMULATIONS AND APPLICATIONS
1 5 10 15 20 25
100%
150%
200%
250%
Number of non−stat. variables
Non−stationarity
SSA multivariate
SSA univariate
Random subspace
Non-stationary correlations
Figure .: Comparison of univariate (deflation mode) SSA vs. multivariate op-
timisation of a non-stationary subspace by SSA over subspaces of dimension one
to (horizontal axis). e vertical axis shows the non-stationarity relative to the
median of a random subspace.
non-stationary directions. If there is no advantage to a multivariate optimisation,
then the degree of non-stationarity of the two subspaces should be similar.
Figure . shows the results for one subject. Multivariate SSA clearly outper-
forms deflation mode SSA: for a ten-dimensional subspace, the subspace found by
multivariate SSA is nearly twice as non-stationary. is indicates that there are
strong non-stationarities in the correlation between latent variables, which calls for
a multivariate approach to the analysis of distribution changes in EEG data. Even
though univariate solutions have the appealing property of being unique (no arbi-
trary choice of basis), in this example we see that they do not reveal the full picture.
A canonical way to arrive at a unique basis for a non-stationary subspace would be
a post-processing in the obtained source-space using deflationary SSA.
The strongest task-locked non-stationary direction
In the last part of this section, we show in an example that the most non-stationary
task-locked source found by SSA has a meaningful interpretation in terms of the
expected neural response of the task. Here, we apply SSA not to the concatenation
of all trials of the calibration phase, but use a sliding window within the trials of
all classes to define our epochs, i.e. our analysis is locked to the motor imager task.
e sliding window has length .s and an overlap of . At each position within
the epoch, we average the covariance matrix over all trials of the same class (le or
right).
In this task-locked setup, one would expect the strongest non-stationary source
to correspond to the attenuation of the SMR. Figure . shows that this is indeed
5.3. APPLICATION TO EEG ANALYSIS 91
Left Right
0s 2s 6s 8s
Time within trial at center of window
Power
left
right
5 10 20 30 40
dB
Hz
5 10 20 30 40
dB
Hz
Figure .: Illustration of the most non-stationary source within the trials of
classleandrightrespectively. etoppanelshowsthemedian powertimecourse
for both sources on the corresponding trials; the bottom panels show the scalp
pattern and frequency spectra.
the case, for both classes. e power time series shows the clear response to the
stimuli (prompt for imagination) s aer the beginning of the trial.
While this is an example specifically constructed to allow for comparison against
a ground truth, there exist completely unsupervised analysis setups where no la-
bels or other auxiliary information to guide the extraction of useful components
are available (e.g. in the analysis of spontaneous activity [Bießmann et al. (),
Biswal et al. ()] or resting state networks). It is in these cases that the most non-
stationary components may be precisely the ones that are most interesting, which,
as we have seen, PCA or ICA can not, in general, find.
5.3.3 Non-stationarities in brain-computer-interfacing
A major impediment to the more widespread use of EEG-based brain-computer-
interfaces is their deteriorating performance over time, for which several adaptation
techniques have been proposed [Vidaurre et al. (), Shenoy et al. ()]. Dur-
ing the calibration phase, the subject is prompted to produce motor imagery which is
then used to train a classifier. However, as time passes, these parameters can become
suboptimal. is may be due to changes in the mental state (e.g. increased tiredness
of the subject) or in response to changing conditions during the application phase,
where the environment is less controlled than during the calibration phase.
In this section, we present some first steps towards investigating the role of non-
stationary brain sources using SSA. Our hypothesis is that (a) the generalisation
92 CHAPTER 5. SIMULATIONS AND APPLICATIONS
error can suffer from learning along directions where the distribution of the data
changes and (b) that we can identify these directions using the training set. at is,
we assume that the relevant non-stationary directions on the test set already show
non-stationary behaviour during the training condition. We test this hypothesis by
adding a pre-processing step to the standard offline analysis in which we remove
non-stationary directions. e performance is evaluated on the test set.
Setup and results
e aim of this SSA pre-processing is to remove non-stationary directions, while not
removing too much class-discriminative information. Clearly, this is a trade-off. In
order to keep the analysis simple, we merely choose an epoch structure which re-
flects this: each SSAepochconsists of two trials, onefromeachclass, chosenchrono-
logically. Moreover, in order to detect those non-stationarities most likely to affect
the CSP+LDA solution, we bandpass filter the data to the most discriminative fre-
quency band, selected by the standard heuristics [Blankertz et al. ()]. e most
non-stationary directions are found using SSA in deflation mode in order to allow
for a step-wise procedure and an interpretation of individual components.
0 2 4 6 8 10 12 14 16 18 20 22 24 26
20%
30%
40%
50%
Number of removed directions
Error rate
SSA
Random projection
5 10 20 30 40
dB
Hz
5 10 20 30 40
dB
Hz
Component #7 Component #26
Figure .: Results of the SSA pre-processing for one subject (S) over varying
numbers of removed non-stationary directions (horizontal axis); the vertical axis
shows the error rate on the test set. e error tube extends from the to the
over resamples of the test data; the two scalp plots correspond to the directions
that were removed in the step indicated by the black circle.
From the subjects of the VitalBCI study, we select the bottom subjects
in terms of classification accuracy on the test set. It is for this group of subjects
5.3. APPLICATION TO EEG ANALYSIS 93
that we conjecture that distribution changes have a role to play: despite strongly
discriminative CSP components found on the training set, the performance on the
test set is sub-standard.
S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 S11 S12
−50%
−30%
−10%
0%
+10%
Subjects
Relative change in error rate
SSA
Random projections
Figure .: Results of the SSA pre-processing over all selected subjects. e
vertical axis shows the change in test error rate relative to the baseline; for each
subject (horizontal axis) we report its distribution (blue boxplot) over removing
the top k= 1, . . . , 25 most non-stationary directions compared to the effect of
removing the same number of random directions (black boxplot).
e crucial parameter of the SSA pre-processing (apart from the epoch struc-
ture)is howmany resp. which non-stationary directionsshould be removed. Clearly,
at some point we will inevitably remove important discriminative information.
e results for one subject (S) are shown in Figure .. We see that aer
removing the -th most non-stationary direction, the error rate drops from
(chance level) to below . e scalp plot of the removed component suggests that
we have removed the α-rhythm, whose strength is associated with the level of fa-
tigue or tiredness. is source overlaps with the discriminative SMR, both spatially
(in electrode space) and in terms of frequency content. It is therefore likely that the
power of the α-rhythm contributes to the feature vectors. If the power of the α-
rhythm changes between training and test (e.g. due to increase tiredness), then this
may lead to suboptimal performance on the test set.
As we continue to remove non-stationary directions, the error rate eventually
drops below . However, with the removal of the -th most non-stationary
source, the error rate increases strongly. As we can see from the scalp plot, the pat-
tern resembles the characteristic pattern of the µ-rhythm associated with the right
motor cortex. By removing it, we remove discriminative information which is re-
flected in the increased error rate.
e results over all subjects are shown in Figure . and Figure .. For each
subject, we remove the top kmost non-stationary components and report the dis-
94 CHAPTER 5. SIMULATIONS AND APPLICATIONS
35% 40% 45% 50%
35%
40%
45%
50%
Baseline test error
Test error after SSA pre−proc.
Figure .: Evaluation of the SSA pre-processing for BCI. Each dot corresponds
to a subject; the black dots show the median test error rate over the top 1, . . . , 25
most non-stationary directions. e red dots show the test error rate for an opti-
mal choice of this parameter.
tribution of results over all kfrom 1to 25. Apart from the three subjects S, S and
S, the median effect of the SSA pre-processing is a reduction in error rate. For the
subjects S and S, the most non-stationary direction found in the first step is highly
correlated with the label. An expert manually evaluating each direction on the train-
ing set would have detected this (characteristic scalp plot and frequency spectrum);
however, in our automatic procedure, this leads to a drop in performance.
Whether this pre-processing can be turned into a practical approaching for BCI
depends on whether the non-stationary directions that are to be removed can be se-
lected automaticallybysomecriterion. Potential approachesincludecross-validation
on the training set; a heuristic combining the degree of non-stationarity with a mea-
sure of discriminatory power; or an automatic data-driven classification of harmful
vs. benign distribution changes based on features of the components.
However, what this analysis has shown is that there exist non-stationary compo-
nents which are highly detrimental to BCI performance for some subjects (subject
improved from (chance level) down to almost error rate) and that SSA
offers a means of finding them.
5.3.4 Recovering auditory steady-state evoked potentials
In this section, we use SSA to recover a brain source that has been induced by an
experimental stimulus, namely the auditory steady-state evoked potential (ASSEP).
is evoked potential is generated in the brain in response to an auditory stimu-
lus which occurs at a constant frequency. It has been shown that such a steady-
state stimulus evokes a response in the auditory system at the stimulation frequency
5.3. APPLICATION TO EEG ANALYSIS 95
Pattern
10 20 30 40 50 60 70 80
10
20
30
Spectrum
Hz
dB
500 1000 1500 2000 2500 3000
40 Hz power time course vs. target function
Power
Time index
Target function
SSA component
Stimulation frequency
Figure .: Component found by SSA that has the highest correlation with
the target (stimulus). e top le panel shows the scalp plot (basis vector) cor-
responding to the source; the top right panel shows its frequency content. e
bottom panel shows the power time course of the SSA component (green) and
the target function (blue).
[Plourde et al. (), Galambos et al. (), Picton et al. ()]. Moreover, the
amplitude of the ASSEP has been found to correlate to the intensity of the steady-
state stimulus [Rodriguez et al. ()]. us, by modulating the intensity of the
stimulus, we can induce a non-stationary brain source.
We analyse a dataset presented by [Dähne et al. ()] and follow a similar
evaluation strategy. In this experiment, a Hz pure tone is amplitude modu-
lated by a Hz cosine. In order to continuously vary the intensity of the stimulus,
it is multiplied by a slowly varying function which modulates the loudness between
and dB relative to the subject-specific hearing level. e EEG signals were
recorded using wet Ag/AgCl electrodes (Fast’n Easy Cap, EasyCap GmbH) placed
according to the - system. Signals were amplified using two -channel ampli-
fiers (Brain Products), sampled at kHz and filtered by an analog bandpass filter
between . and Hz. For the offline analysis, the signals were down-sampled to
Hz and a notch filter around Hz was applied to attenuate line noise.
We analyse a block of three minutes of continuous stimulation. SSA is applied
to find the most non-stationary sources in deflation mode; epochs are defined by
a sliding window of .s length and step .s. In a pre-processing step, the data is
bandpassfiltered to awindowaround theexpected Hzpeak; the widthofthiswin-
dow is varied during the analysis. Each obtained component is evaluated in terms of
96 CHAPTER 5. SIMULATIONS AND APPLICATIONS
2 4 6 8 10 12 14 16 18 20 None
0
0.1
0.2
0.3
0.4
0.5
0.6
Width of filter window centered at 40Hz
Correlation with target
SSA
PCA
ICA
Figure .: Comparison of PCA, FastICA and SSA in recovering the auditory
steady-statecomponent. ehorizontalaxis showsthemaximumcorrelationwith
the target signal (the stimulus) over all components of each method for varying
breadths of bandpass filters around Hz.
its correlation to the intensity modulation time series. We compare the components
found by SSA with the PCA and FastICA [Hyvärinen ()] solutions.
Figure . shows the SSA direction with the highest correlation to the tar-
get found without bandpass filtering (corresponding to the right most bars in Fig-
ure .). e scalp plot shows the characteristic broad central pattern. e power
spectrum (top right panel) exhibits a clear peak at Hz, the stimulation frequency;
and in the bottom panel, we see that the intensity modulation is correlated with the
power of the SSA component in a small window around Hz.
Figure . shows the results of the comparison of SSA against PCA and Fas-
tICA. e bars show the maximum correlation coefficient over all components for
each method. Overall, SSA significantly outperforms PCA and FastICA, probably
because it is the only method that maximises a criterion which is related to the char-
acteristics of this source.
5.4 Summary and Discussion
e simulations on synthetic data revealed the fundamental relationship between
the performance of SSA and basic parameters of the data. One result stands out: es-
timation errors in individual epochs average out across epochs. is is good news for
the practitioner, who is oen keen to increase the temporal resolution and capture
as much variation as possible, which means making the epochs smaller.
Another important outcome is the invalidation of the likelihood ratio test for
real, i.e. non-gaussian, data. On the other hand, a test based on resampling per-
5.4. SUMMARY AND DISCUSSION 97
formed remarkably well. Even so, more robust analytic test procedures would be
highly desirable to alleviate the computational burden of resampling. A straightfor-
ward approach might be to insert a more heavy-tailed distribution into the likeli-
hood ratio formalism or to expand the objective function around higher-order mo-
ments.
A comparison of the standard and the algebraic SSA algorithm confirmed that
the latter is advantageous in a niche setting: large number of epochs with a high
number of samples in low dimension. Even though this is probably exactly the op-
posite of what one expects in practice, the results do highlight the attractive features
of the algebraic approach. Several possible approaches towards reducing its compu-
tational cost were discussed in Chapter .
e results on EEG analysis have shown clearly that PCA and ICA are not use-
ful for understanding non-stationarities in neural activity. e non-stationary di-
rections that ICA finds are artefacts, not brain sources. Secondly, a multivariate
approach is necessary to find the most non-stationary subspace which captures the
non-stationarities in the dependency structure. In this respect, SSA differs from
most linear latent variable methods in that it finds a subspace and not a set of indi-
vidual components. While we do not offer an interpretation of the non-stationary
subspace, this clearly is an exciting direction for future research.
In two settings we have demonstrated that SSA finds solutions corresponding
to a ground truth: the most non-stationary sources in task-locked setting from BCI
clearly corresponds to the intentionally modulated µ-rhythm; and SSA successfully
finds the auditory steady state evoked potential, which PCA and ICA do not. ese
findingshighlightthe abilityofSSA to extract meaningful componentsincompletely
unsupervised settings where the relevant components are those that exhibit distri-
bution changes.
e application to BCI analysis showed that in some subjects, non-stationary
brain sources are highly detrimental to performance. Removing them in a pre-
processing step prior to calibration yielded reductions in error rates of almost
percentage points on some subjects. While it is unclear whether SSA pre-processing
can be turned into a practical automatic step that is beneficial to a sufficiently large
number of subjects, the results show that (a) distribution changes can have a strong
influence and (b) that SSA is the right tool for analysing them.
98 CHAPTER 5. SIMULATIONS AND APPLICATIONS
Chapter 6
Conclusion
Analysing change in observed data in response to the change in a controlled vari-
able is a quintessential task in empirical research. As such, there exists a wide range
of statistical methods which address this problem. e common approach in the
multivariate case is to find a direction in the data space associated with the change
in the controlled variable. For instance, linear discriminant analysis [Fisher ()]
finds a direction along which the distribution of the data is maximally different be-
tween two sets of samples. In EEG analysis, common spatial patterns [Koles ()]
find linear projections that allow us to distinguish between two experimental con-
ditions. However, the applicability of these families of methods crucially depend on
the availability of an externally controlled variable, or label. is is not always avail-
able. In this respect, SSA closes a gap because it allows for the unsupervised analysis
of change in data.
Summary of the main results
A simple example in Chapter shows that in order to analyse changes in the joint
distribution of a multivariate data set, a univariate perspective is, in principle, in-
sufficient. In essence, this is because changes in the dependencies between variables
are not necessarily visible in the marginals. e two-variable example also showed
the existence of another coordinate system, better suited to our task: it separates the
stationary from the non-stationary components. is observation sets the theme
for this thesis: finding such advantageous coordinate systems.
In the next Chapter we equip ourselves with the necessary background for this
task. We review fundamentals from mathematics, the basic framework of coordi-
nate transformations in machine learning and statistics, and the two most promi-
nent methods: principal component analysis (PCA) and independent component
analysis (ICA).
Chapter contains our main contribution, the formulation of the stationary
subspace analysis problem and the first algorithm to solve it. We start by introduc-
ing a generative model, discuss its identifiability, and define our notion of station-
arity: time-constant mean and covariance matrix. Corresponding to this definition,
100 CHAPTER 6. CONCLUSION
we derive a measure of stationarity based on epochs of the given time series, which
leads to an efficient algorithm that minimises or maximises this objective to find the
stationary and non-stationary components respectively. We discuss the problem of
spurious solutions and present a theoretical result which tells us precisely how much
data is needed to avoid them in practice.
In Chapter , we develop an alternative formulation of the SSA problem in terms
of an approximate solution to a system of polynomial equations. In contrast to the
SSA algorithm developed in the previous chapter, we do not search for the solution
guided by the gradient of an objective function but directly solve the problem al-
gebraically. is leads to more accurate solutions in certain scenarios; and allows
us to choose the number of stationary directions from data by computing a certain
eigenvalue spectrum.
Chapter is concerned with the validation of the SSA algorithm on synthetic
data and in applications to EEG data analysis. In controlled simulations, we analyse
the influence of key parameters on the performance of the SSA algorithm, compare
two approaches for stationarity testing, and demonstrate the relative merits of the
algebraic SSA algorithm.
In three applications to EEG analysis, we show that SSA provides us with gen-
uinely new and useful results. First of all, we demonstrate that neither PCA nor
ICA allow us to uncover the non-stationarity directions in EEG data. In particu-
lar, we also show that non-stationarities are, loosely speaking, multivariate in EEG
data: the most non-stationary subspace is not the combination of the individually
maximally non-stationary directions, but can only be found by joint optimisation
of several projections. Moreover, we show that the most non-stationary direction
in a brain-computer-interfacing context has a meaningful interpretation, in that it
corresponds to the intentional attenuation of the SMR.
Secondly, we investigate the role of non-stationarities in brain-computer-inter-
facing. It has oen been hypothesised that distribution changes are one of the fac-
tors behind deteriorating performance over time as parameters calibrated during
the training sessions may be suboptimal in the long term, should the distribution
of the EEG data change due to environmental influences or changes in the brain
state. To this end, we investigate whether removing non-stationary directions (esti-
mated during calibration) can be beneficial for performance. Our results show that
for some subjects, this indeed leads to a dramatic increase in performance. While
it is unclear whether this pre-processing can be turned into an automated practi-
cal method (due to difficulties in parameter choice), this analysis demonstrates that
SSA is a useful new tool in this context.
In the third application, we show that SSA can recover a neural component from
EEG signals which PCA and ICA do not find. e auditory steady state evoked po-
tential is the response to an auditory stimulus which occurs at a steady frequency
(beeps). is power of this response has been found to correlate with the volume
(intensity) of the stimulus. By varying the stimulus intensity, we can induce a brain
101
response (at the stimulus frequency) that has non-stationary variance. erefore,
SSA should be able to find it which is indeed the case. We compare the performance
of SSA, PCA and ICA in terms of the correlation with the stimulus intensity: as it
turns out, neither PCA nor ICA can find components that are significantly corre-
lated.
Limitations and future directions
e most relevant limitations of the presented SSA algorithm are related to its under-
lying definition of stationarity: changes in the higher order moments go unnoticed
and the time structure within the epochs is discarded. While an extension in the
former direction probably comes at a high price (in terms of sample and compu-
tational complexity), integrating time structure should be possible along the lines
of TDSEP. As many real sources are endowed with temporal structure, this should
bring about significant improvements.
Another important question is: how to choose the number of stationary resp.
non-stationary sources? We have proposed two answers: post-hoc stationarity test-
ing using resampling, or inspecting the PCA eigenvalue spectrum in coefficient vec-
tor space It would be interesting to investigate an approach based on a model se-
lection framework [Akaike (), Schwarz (), Rissanen ()] or a Bayesian
formulation that allows us to compute the posterior over possible numbers of sta-
tionary directions.
e applications to EEG analysis presented in this chapter is a very first step. Fu-
ture research will investigate the types and potential causes of non-stationary neu-
ral components, both in task-locked and paradigmless settings, possibly in com-
bination with source reconstruction techniques [Scherg and Von Cramon (),
Herrmann et al. (), Haufe et al. ()]. In the EEG context, it may be pos-
sible to interpret the non-stationary subspace (as opposed to individual compo-
nents) using physiological constraints that reduce the number of plausible bases
[Schmidt (), Mosher and Leahy ()]. Moreover, our results showed that the
degree of non-stationarity is related to artefact-likelihood. Whether this can be
turned into a practical artefact rejection method needs to be evaluated.
Where else can the algebraic approach be advantageous? In short, whenever the
desired solution is a set of polynomials of a certain type (in the SSA case: linear pro-
jections) whose approximate vanishing set is given in terms of another set of input
polynomials, usually of higher degree. at is, both input and output are polynomi-
als and the objective of the learning task needs to be describable in terms of equa-
tions. is is, for example, the case in the simultaneous diagonalisation of several
matrices [Cardoso and Souloumiac (), Bunse-Gerstner et al. ()], whereone
aims to make the off-diagonal elements equal to zero by applying a linear basis trans-
formation.
102 CHAPTER 6. CONCLUSION
Bibliography
[Abdi ()] H. Abdi. Factor rotations in factor analyses. Encyclopedia for
Research Methods for the Social Sciences. Sage: ousand Oaks, CA, pages
–, .
[Akaike ()] H. Akaike. A new look at the statistical model identification. Au-
tomatic Control, IEEE Transactions on, ():–, .
[Allen et al. ()] E.A Allen, E.B Erhardt, E Damaraju, W Gruner, J.M Segall, R.F
Silva, M Havlicek, S Rachakonda, J Fries, and R Kalyanam. A baseline for
the multivariate comparison of resting-state networks. Front Syst Neurosci, ,
.
[Bell and Sejnowski ()] A. J. Bell and T. J. Sejnowski. An information-
maximization approach to blind separation and blind deconvolution. Neural
Computation, ():–, November .
[Belouchrani et al. ()] A. Belouchrani, K. Abed-Meraim, J.F. Cardoso, and
E. Moulines. A blind source separation technique using second order statis-
tics. IEEE Trans. on Signal Processing, ():–, .
[Bengio et al. ()] Y. Bengio, J.F. Paiement, P. Vincent, O. Delalleau, N. Le Roux,
and M. Ouimet. Out-of-sample extensions for lle, isomap, mds, eigenmaps,
and spectral clustering. Advances in neural information processing systems, :
–, .
[Berger ()] Hans Berger. Über das elektroencephalogramm des menschen.
Archiv fur Psychiatrie und Nervenkrankheiten, :–, .
[Bießmann et al. ()] Felix Bießmann, Sergey M Plis, Frank C Meinecke, Tom
Eichele, and Klaus-Robert Müller. Analysis of multimodal neuroimaging
data. Biomedical Engineering, IEEE Reviews in, : – , .
[Biswal et al. ()] B Biswal, F Z Yetkin, V M Haughton, andJ S Hyde. Functional
connectivity in the motor cortex of resting human brain using echo-planar
mri. Magn Reson Med, ():–, Oct .
104 BIBLIOGRAPHY
[Blanchard et al. ()] G. Blanchard, M. Kawanabe, M. Sugiyama, V. Spokoiny,
and K.R. Müller. In search of non-gaussian components of a high-
dimensional distribution. e Journal of Machine Learning Research, :
–, .
[Blankertz et al. ()] B. Blankertz, R. Tomioka, S. Lemm, M. Kawanabe, and
K.R. Muller. Optimizing spatial filters for robust eeg single-trial analysis.
Signal Processing Magazine, IEEE, ():–, .
[Blankertz et al. ()] Benjamin Blankertz, Claudia Sannelli, Sebastian Halder,
Eva M. Hammer, Andrea Kübler, Klaus-Robert Müller, Gabriel Curio, and
orsten Dickhaus. Neurophysiological predictor of SMR-based BCI perfor-
mance. NeuroImage, ():–, .
[Blythe et al. ()] Duncan A. J. Blythe, Paul von Bünau, Frank C. Meinecke,
and Klaus-Robert Müller. Feature extraction for change-point detection us-
ing stationary subspace analysis. IEEE Transactions on Neural Networks and
Learning Systems, ():–, .
[Bunse-Gerstner et al. ()] A. Bunse-Gerstner, R. Byers, and V. Mehrmann.
Numerical methods for simultaneous diagonalization. SIAM Journal on Ma-
trix Analysis and Applications, :–, .
[Caboara et al. ()] Massimo Caboara, Pasqualina Conti, and Carlo Traverse.
Yet another ideal decomposition algorithm. In Teo Mora and Harold Matt-
son, editors, Applied Algebra, Algebraic Algorithms and Error-Correcting
Codes, volume of Lecture Notes in Computer Science, pages –.
Springer Berlin / Heidelberg, .
[Cardoso and Souloumiac ()] Jean-François Cardoso and Antoine
Souloumiac. Blind beamforming for non gaussian signals. IEE Proceedings-F,
:–, .
[Cardoso ()] J.F. Cardoso. High-order contrasts for independent component
analysis. Neural computation, ():–, .
[Cardoso and Souloumiac ()] J.F. Cardoso and A. Souloumiac. Jacobi angles
for simultaneous diagonalization. SIAM Journal on Matrix Analysis and Ap-
plications, ():–, .
[Caton ()] Richard Caton. e electric currents of the brain. British Medical
Journal, ():, .
[Corless et al. ()] R.M. Corless, Patrizia M. Gianni, Barry M. Trager, and S.M.
Watt. e singular value decomposition for polynomial systems. Proc. ISSAC
’, pages –, .
BIBLIOGRAPHY 105
[Cover et al. ()] T.M. Cover, J.A. omas, J. Wiley, et al. Elements of informa-
tion theory, volume . Wiley Online Library, .
[Cox et al. ()] David A. Cox, John Little, and Donal O’Shea. Ideals, Varieties,
and Algorithms: An Introduction to Computational Algebraic Geometry and
Commutative Algebra, /e (Undergraduate Texts in Mathematics). Springer-
Verlag New York, Inc., Secaucus, NJ, USA, . ISBN .
[Cro and Barry ()] R.J. Cro and RJ Barry. Removal of ocular artifact from
the eeg: a review. Neurophysiologie Clinique/Clinical Neurophysiology, ():
–, .
[Dähne et al. ()] Sven Dähne, Vadim Nikulin, Frank Meinecke, Stefan Haufe,
Johannes Höhne, Michael Tangermann, and Klaus-Robert Müller. Optimal
spatial filters for correlating band power with cognitive function, .
[Dornhege ()] G. Dornhege. Toward brain-computer interfacing. e MIT
Press, .
[Eisenbud et al. ()] David Eisenbud, Craig Huneke, and Wolmer Vasconcelos.
Direct methods for primary decomposition. Inventiones Mathematicae, :
–, . ISSN -.
[Fatourechi et al. ()] M. Fatourechi, A. Bashashati, R.K. Ward, and G.E. Birch.
Emg and eog artifacts in brain computer interface systems: A survey. Clinical
Neurophysiology, ():–, .
[Fisher ()] R.A. Fisher. e use of multiple measurements in taxonomic prob-
lems. Annals of Human Genetics, ():–, .
[Friedman and Tukey ()] J.H. Friedman and J.W. Tukey. A projection pursuit
algorithm for exploratory data analysis. Computers, IEEE Transactions on,
():–, .
[Froeberg ()] Ralf Froeberg. An inequality for hilbert series of graded algebras.
Math. Scand., : – , .
[Froeberg and Hollman ()] Ralf Froeberg and Joachim Hollman. Hilbert se-
ries for ideals generated by generic forms. Journal of Symbolic Computation,
(): – , . ISSN -.
[Galambos et al. ()] R. Galambos, S. Makeig, and P.J. Talmachoff. A -hz au-
ditory potential recorded from the human scalp. Proceedings of the national
academy of sciences, ():, .
[Gauß ()] Carl Friedrich Gauß. eoria motus corporum coelestium in sec-
tionibus conicis solem ambientium. Göttingen, .
106 BIBLIOGRAPHY
[Gianni et al. ()] Patrizia Gianni, Barry Trager, and Gail Zacharias. Groebner
bases and primary decomposition of polynomial ideals. Journal of Symbolic
Computation, (-): – , . ISSN -.
[Golub and Van Loan ()] G.H. Golub and C.F. Van Loan. Matrix computa-
tions, volume . Johns Hopkins Univ Pr, .
[Gretton et al. ()] A. Gretton, K.M. Borgwardt, M.J. Rasch, B. Schölkopf, and
A. Smola. A kernel two-sample test. e Journal of Machine Learning Re-
search, :–, .
[Guyon and Elisseeff ()] Isabelle Guyon and Andre Elisseeff. An introduction
to variable and feature selection. Journal of Machine Learning Research, :
–, Mar .
[Hara et al. ()] Satoshi Hara, Yoshinobu Kawahara, Takashi Washio, and Paul
von Bünau. Stationary subspace analysis as a generalized eigenvalue problem.
In Proceedings of the th international conference on Neural information pro-
cessing: theory and algorithms - Volume Part I, ICONIP’, pages –,
Berlin, Heidelberg, . Springer-Verlag.
[Hara et al. ()] Satoshi Hara, Yoshinobu Kawahara, Takashi Washio, Paul von
Bünau, Terumasa Tokunaga, and Kiyohumi Yumoto. Separation of stationary
and non-stationary sources with a generalized eigenvalue problem. Neural
Networks, :–, . ISSN -.
[Harman ()] H.H. Harman. Modern factor analysis. University of Chicago
Press, .
[Hastie and Stuetzle ()] T. Hastie and W. Stuetzle. Principal curves. Journal of
the American Statistical Association, pages –, .
[Haufe et al. ()] S. Haufe, V.V. Nikulin, A. Ziehe, K.R. Müller, and G. Nolte.
Combining sparsity and rotational invariance in eeg/meg source reconstruc-
tion. NeuroImage, ():–, .
[Heldt et al. ()] Daniel Heldt, Martin Kreuzer, Sebastian Pokutta, and Hennie
Poulisse. Approximate computation of zero-dimensional polynomial ideals.
Journal of Symbolic Computation, (): – , . ISSN -.
In Memoriam Karin Gatermann.
[Hermann ()] Grete Hermann. Die Frage der endlich vielen Schritte in der
eorie der Polynomideale - unter Benutzung nachgelassener S�tze von K.
Hentzelt. Mathematische Annalen, (): – , . ISSN -.
BIBLIOGRAPHY 107
[Herrmann et al. ()] M.J. Herrmann, J. Römmler, A.C. Ehlis, A. Heidrich, and
A.J. Fallgatter. Source localization (loreta) of the error-related-negativity
(ern/ne) and positivity (pe). Cognitive Brain Research, ():–, .
[Hinton and Roweis ()] G. Hinton and S. Roweis. Stochastic neighbor embed-
ding. Advances in neural information processing systems, :–, .
[Huber ()] P.J. Huber. Projection pursuit. e annals of Statistics, pages
–, .
[Hyvarinen ()] A. Hyvarinen. Fast and robust fixed-point algorithms for in-
dependent component analysis. Neural Networks, IEEE Transactions on,
():–, .
[Hyvärinen et al. ()] A. Hyvärinen, J. Karhunen, and E. Oja. Independent com-
ponent analysis, volume . Wiley-interscience, .
[Hyvärinen ()] Aapo Hyvärinen. Fast and robust fixed-point algorithms for
independent component analysis. IEEE Transactions on Neural Networks,
():–, May . ISSN .
[Jaynes ()] E. T. Jaynes. Information theory and statistical mechanics. Physical
Review, (–), .
[Jaynes and Bretthorst ()] E.T. Jaynes and G.L. Bretthorst. Probability theory:
the logic of science. Cambridge Univ Pr, .
[Jung et al. ()] T.P. Jung, S. Makeig, C. Humphries, T.W. Lee, M.J. Mckeown,
V. Iragui, and T.J. Sejnowski. Removing electroencephalographic artifacts by
blind source separation. Psychophysiology, ():–, .
[Jung et al. ()] T.P. Jung, S. Makeig, M.J. McKeown, A.J. Bell, T.W. Lee, and T.J.
Sejnowski. Imaging brain dynamics using independent component analysis.
Proceedings of the IEEE, ():–, .
[Kaiser ()] H.F. Kaiser. e varimax criterion for analytic rotation in factor
analysis. Psychometrika, ():–, .
[Kaltofen et al. ()] E. Kaltofen, Z. Yang, and L. Zhi. Structured low rank ap-
proximation of a sylvester matrix. Symbolic-numeric computation, pages
–, .
[Kandel et al. ()] E.R. Kandel, J.H. Schwartz, T.M. Jessell, et al. Principles of
neural science, volume . McGraw-Hill New York, .
108 BIBLIOGRAPHY
[Kawanabe et al. ()] Motoaki Kawanabe, Wojciech Samek, Paul von B¸nau,
and Frank Meinecke. An information geometrical view of stationary sub-
space analysis. In Artificial Neural Networks and Machine Learning – ICANN
, volume of Lecture Notes in Computer Science, pages –.
Springer Berlin / Heidelberg, .
[Kiraly et al. ()] Franz J. Kiraly, Paul von Bünau, Frank C. Meinecke, Duncan A. J.
Blythe, and Klaus-Robert Müller. Algebraic geometric comparison of prob-
ability distributions. Journal of Machine Learning Research, :–, .
[Kiraly et al. ()] Franz J. Kiraly, Paul von Bünau, Jan Saputra Müller, Duncan
A. J. Blythe, Frank C. Meinecke, and Klaus-Robert Müller. Regression for
sets of polynomial equations. In JMLR Workshop and Conference Proc. Vol.
, pages –, .
[Kohonen ()] T. Kohonen. e self-organizing map. Proceedings of the IEEE,
():–, .
[Koles ()] Z.J. Koles. e quantitative extraction and topographic mapping of
the abnormal components in the clinical eeg. Electroencephalography and
Clinical Neurophysiology, ():–, .
[Kreuzer et al. ()] Martin Kreuzer, Hennie Poulisse, and Lorenzo Robbiano.
Approximate Commutative Algebra, chapter From Oil Fields to Hilbert
Schemes. Springer-Verlag Berlin Heidelberg, .
[Krick and Logar ()] Teresa Krick and Alessandro Logar. An algorithm for the
computation of the radical of an ideal in the ring of polynomials. In Harold
Mattson, Teo Mora, and T. Rao, editors, Applied Algebra, Algebraic Algorithms
and Error-Correcting Codes, volume ofLecture Notes in Computer Science,
pages –. Springer Berlin / Heidelberg, .
[Laplagne ()] Santiago Laplagne. An algorithm for the computation of
the radical of an ideal. In Proceedings of the international sym-
posium on Symbolic and algebraic computation. ACM, . URL
http://dx.doi.org/10.1145/1145768.1145802.
[Legendre ()] Adrien-Marie Legendre. Nouvelles méthodes pour la détermina-
tion des orbites des comètes, chapter Sur la methode des moindres quarres.
Firmin Didot, http://imgbase-scd-ulp.u-strasbg.fr/displayimage.php?pos=-
, .
[Makeig et al. ()] S. Makeig, A.J. Bell, T.P. Jung, T.J. Sejnowski, et al. Indepen-
dent component analysis of electroencephalographic data. Advances in neural
information processing systems, pages –, .
BIBLIOGRAPHY 109
[McMenamin et al. ()] B.W. McMenamin, A.J. Shackman, J.S. Maxwell,
D.R.W. Bachhuber, A.M. Koppenhaver, L.L. Greischar, and R.J. Davidson.
Validation of ica-based myogenic artifact correction for scalp and source-
localized eeg. Neuroimage, ():–, .
[Meinecke et al. ()] Frank C. Meinecke, Paul von Bünau, Motoaki Kawanabe,
and Klaus-Robert Müller. Learning invariances with stationary subspace
analysis. In Computer Vision Workshops (ICCV Workshops), IEEE th
International Conference on, pages –, .
[Meinecke ()] Frank Carsten Meinecke. Synchronized? Iden-
tifying Interactions from Superimposed Signals. PhD the-
sis, Berlin Institute of Technology (TU Berlin), . URL
http://opus.kobv.de/tuberlin/volltexte/2012/3386/.
[Molgedey and Schuster ()] L. Molgedey and H. G. Schuster. Separation of a
mixture of independent signals using time delayed correlations. Phys. Rev.
Lett., ():–, Jun .
[Mood et al. ()] A.M. Mood, F.A. Graybill, and D.C. Boes. Introduction to the
theory of statistics. McGraw-Hill Book Company, .
[Mosher and Leahy ()] J.C. Mosher and R.M. Leahy. Source localization us-
ing recursively applied and projected (rap) music. Signal Processing, IEEE
Transactions on, ():–, .
[Neyman and Pearson ()] J. Neyman and E.S. Pearson. On the problem of the
most efficient tests of statistical hypotheses. Philosophical Transactions of the
Royal Society of London. Series A, Containing Papers of a Mathematical or
Physical Character, :–, .
[Park et al. ()] H. Park, L. Zhang, and J.B. Rosen. Low rank approximation of
a hankel matrix by structured total least norm. BIT Numerical Mathematics,
():–, .
[Parra et al. ()] L.C. Parra, C.D. Spence, A.D. Gerson, and P. Sajda. Recipes for
the linear analysis of eeg. Neuroimage, ():–, .
[Pearson ()] K Pearson. On lines and planes of closest fit to systems of points
in space. Philosophical Magazine, :–, .
[Pfurtscheller and Lopes da Silva ()] G. Pfurtscheller and FH Lopes da Silva.
Event-related eeg/meg synchronization and desynchronization: basic princi-
ples. Clinical neurophysiology, ():–, .
110 BIBLIOGRAPHY
[Picton et al. ()] T.W. Picton, M.S. John, A. Dimitrijevic, and D. Purcell. Hu-
man auditory steady-state responses: respuestas auditivas de estado estable
en humanos. International journal of audiology, ():–, .
[Plourde et al. ()] G. Plourde, D.R. Stapells, and T.W. Picton. e human au-
ditory steady-state evoked potentials. Acta Oto-Laryngologica, (S):
–, .
[Plumbley ()] Mark D. Plumbley. Geometrical methods for non-negative ICA:
Manifolds, Lie groups and toral subalgebras. Neurocomputing, (–),
.
[Priestley ()] M.B. Priestley. Spectral Analysis and Time Series. AcademicPress,
.
[Quionero-Candela et al. ()] J. Quionero-Candela, M. Sugiyama,
A. Schwaighofer, and N.D. Lawrence. Dataset shi in machine learn-
ing. e MIT Press, .
[Rahimi and Recht ()] A. Rahimi and B. Recht. Weighted sums of random
kitchen sinks: Replacing minimization with randomization in learning. Ad-
vances in Neural Information Processing Systems (NIPS), .
[Raichle et al. ()] M E Raichle, A M MacLeod, A Z Snyder, W J Powers, D A
Gusnard, and G L Shulman. A default mode of brain function. Proc Natl Acad
Sci USA, ():–, Jan .
[Rissanen ()] J. Rissanen. Modeling by shortest data description. Automatica,
():–, .
[Rodriguez et al. ()] R. Rodriguez, T. Picton, D. Linden, G. Hamel, and
G. Laframboise. Human auditory steady state responses: effects of intensity
and frequency. Ear and hearing, ():, .
[Roweis and Saul ()] S.T. Roweis and L.K. Saul. Nonlinear dimensionality re-
duction by locally linear embedding. Science, ():–, .
[Samek et al. ()] Wojciech Samek, Klaus-Robert Müller, Motoaki Kawanabe,
and Carmen Vidaurre. Brain-computer interfacing in discriminative and sta-
tionary subspaces. In Conf Proc IEEE Eng Med Biol Soc. IEEE EMBS, .
[Scherg and Von Cramon ()] M. Scherg and D. Von Cramon. Evoked dipole
source potentials of the human auditory cortex. Electroencephalography and
Clinical Neurophysiology/Evoked Potentials Section, ():–, .
BIBLIOGRAPHY 111
[Schmidt ()] R. Schmidt. Multiple emitter location and signal parameter es-
timation. Antennas and Propagation, IEEE Transactions on, ():–,
.
[Schölkopf et al. ()] B. Schölkopf, A.J. Smola, and K.-R. Müller. Nonlinear
component analysis as a kernel eigenvalue problem. Neural Computation,
():–, .
[Schwarz ()] G. Schwarz. Estimating the dimension of a model. e annals of
statistics, ():–, .
[Schweikert et al. ()] Gabriele Schweikert, Christian Widmer, Bernhard
Schölkopf, and Gunnar Rätsch. An empirical analysis of domain adaptation
algorithms for genomic sequence analysis. In Daphne Koller, Dale Schu-
urmans, Yoshua Bengio, and Léon Bottou, editors, NIPS, pages –.
Curran Associates, Inc., .
[Shenoy et al. ()] P. Shenoy, M. Krauledat, B. Blankertz, R.P.N. Rao, and K.R.
Müller. Towards adaptive classification for bci. Journal of Neural Engineering,
:R, .
[Shimodaira ()] H. Shimodaira. Improving predictive inference under covari-
ate shi by weighting the log-likelihood function. Journal of Statistical Plan-
ning and Inference, ():–, .
[Smirnoff ()] N. Smirnoff. On the estimation of the discrepancy between
empirical curves of distribution for two independent samples. Bulletin de
l�Universite de Moscow, Serie internationale (Mathematiques), :–, .
[Stetter ()] Hans J. Stetter. Numerical polynomial algebra. Society for Industrial
and Applied Mathematics, . ISBN .
[Sugiyama ()] M. Sugiyama. Density ratio estimation: A new versatile tool for
machine learning. Advances in Machine Learning, pages –, .
[Sugiyama et al. ()] M. Sugiyama, M. Krauledat, and K.R. Müller. Covariate
shi adaptation by importance weighted cross validation. e Journal of Ma-
chine Learning Research, :–, .
[Sugiyama et al. ()] M. Sugiyama, T. Suzuki, S. Nakajima, H. Kashima, P. von
Bünau, and M. Kawanabe. Direct importance estimation for covariate shi
adaptation. Annals of the Institute of Statistical Mathematics, ():–,
.
112 BIBLIOGRAPHY
[Sugiyama et al. ()] Masashi Sugiyama, Makoto Yamada, Paul von Bünau,
Taiji Suzuki, Takafumi Kanamori, and Motoaki Kawanabe. Direct density-
ratio estimation with dimensionality reduction via least-squares hetero-
distributional subspace search. Neural Networks, ():–, . ISSN
-.
[Tenenbaum et al. ()] J.B. Tenenbaum, V. De Silva, and J.C. Langford. A global
geometric framework for nonlinear dimensionality reduction. Science,
():–, .
[Vidal ()] J.J. Vidal. Toward direct brain-computer communication. Annual
review of Biophysics and Bioengineering, ():–, .
[Vidaurre et al. ()] C. Vidaurre, M. Kawanabe, P. Bunau, B. Blankertz, and K.R.
Muller. Toward an unsupervised adaptation of lda for brain-computer inter-
faces. Biomedical Engineering, IEEE Transactions on, ():–, .
[von Bünau et al. ()] Paul von Bünau, Frank C. Meinecke, Simon Scholler, and
Klaus-Robert Müller. Finding stationary brain sources in EEG data. In Pro-
ceedings of the nd Annual Conference of the IEEE EMBS, pages –,
.
[Wald and Wolfowitz ()] A. Wald and J. Wolfowitz. On a test whether two
samples are from the same population. e Annals of Mathematical Statis-
tics, ():–, .
[Ziehe and Müller (a)] A. Ziehe and K.-R. Müller. TDSEP – an efficient algo-
rithm for blind separation using time structure. In L. Niklasson, M. Bodén,
and T. Ziemke, editors, Proceedings of the th International Conference on Ar-
tificial Neural Networks, ICANN’, Perspectives in Neural Computing, pages
– , Berlin, a. Springer Verlag.
[Ziehe and Müller (b)] A. Ziehe and K.-R. Müller. TDSEP – an efficient algo-
rithm for blind separation using time structure. In L. Niklasson, M. Bodén,
and T. Ziemke, editors, Proc. ICANN ’, pages – . Springer Verlag,
b.