Cognitive Computation
https://doi.org/10.1007/s12559-020-09754-0
Echo State Networks and Long Short-Term Memory for Continuous
Gesture Recognition: a Comparative Study
Doreen Jirak1·Stephan Tietz2·Hassan Ali1·Stefan Wermter1
Received: 20 February 2020 / Accepted: 15 July 2020
©The Author(s) 2020
Abstract
Recent developments of sensors that allow tracking of human movements and gestures enable rapid progress of applications
in domains like medical rehabilitation or robotic control. Especially the inertial measurement unit (IMU) is an excellent
device for real-time scenarios as it rapidly delivers data input. Therefore, a computational model must be able to learn
gesture sequences in a fast yet robust way. We recently introduced an echo state network (ESN) framework for continuous
gesture recognition (Tietz et al., 2019) including novel approaches for gesture spotting, i.e., the automatic detection of the
start and end phase of a gesture. Although our results showed good classification performance, we identified significant
factors which also negatively impact the performance like subgestures and gesture variability. To address these issues, we
include experiments with Long Short-Term Memory (LSTM) networks, which is a state-of-the-art model for sequence
processing, to compare the obtained results with our framework and to evaluate their robustness regarding pitfalls in the
recognition process. In this study, we analyze the two conceptually different approaches processing continuous, variable-
length gesture sequences, which shows interesting results comparing the distinct gesture accomplishments. In addition,
our results demonstrate that our ESN framework achieves comparably good performance as the LSTM network but has
significantly lower training times. We conclude from the present work that ESNs are viable models for continuous gesture
recognition delivering reasonable performance for applications requiring real-time performance as in robotic or rehabilitation
tasks. From our discussion of this comparative study, we suggest prospective improvements on both the experimental and
network architecture level.
Keywords Continuous gesture recognition ·Echo state networks ·Long Short-Term Memory
Introduction
Continuous gesture recognition is a challenging task due to
three critical aspects: (1) the correct identification of the
start and end of the actual gesture, called subgesture,(2)
the recognition of a gesture of possibly variable length,
also called inter-subject variability, and (3) the accurate
distinction between an active gesture and subtle movements
or silent phases like pauses. The correct yet fast recognition
of gestures is an important research area predominantly
Doreen Jirak
[email protected]hamburg.de
1Department of Informatics, Knowledge Technology,
University of Hamburg, Vogt-K¨
olln-Str. 30, 22527
Hamburg, Germany
2Technical University of Berlin, Strasse des 17. Juni 135,
10623 Berlin, Germany
for vision-based application in human-robot interaction
(HRI) or human-computer interaction (HCI). Although
visual gesture recognition allows the most intuitive interface
between a human and an agent, it is also the most
challenging task starting from the recording procedures
to preprocessing of a huge number of video streams to
finally computational models with low latency and high
recognition rates. In recent years, deep learning techniques
emerged as a new way to learn huge datasets using
GPU computing. Especially for gesture recognition, they
achieved high accuracy on benchmarks like ChaLearn [7].
Learning sequences demands some memory mechanism
as is implemented in recurrent neural networks (RNNs).
Training of deep models, i.e., network architectures with
many layers, through gradient propagation often suffers
from effects of exploding or vanishing gradients [3,
4]. To address this issue, Long Short-Term Memory
(LSTM) networks [13] have been proposed, whose gating
mechanisms integrated into an RNN architecture overcome
Cogn Comput
the error-prone gradient computations. An alternative
paradigm to the traditional RNN training subsumed under
the term “Reservoir Computing” (RC) [26] has become
popular and showed high performance in time-series
prediction. A special implementation called echo state
networks (ESNs), proposed by J¨
ager [14], has been
successfully applied to language processing [12,25],
navigation tasks [6] and central pattern generation [28]. The
RC community is also growing in the recent years due to
the successful implementation of reservoirs in hardware [2,
23], supporting real-world applications like human action
recognition [1]. Although gestures are sequences similar to
sentences, human actions, or path trajectories, surprisingly
little is known about the potential application of ESNs to the
task of gesture recognition [8,15].
In this article, we present an extension of our previous
study on sensor-based continuous gesture recognition [24].
Although this research is dominated by vision data, other
important areas like rehabilitation, limb prosthesis or
controlling virtual environments use sensors like the so-
called “inertial measurement unit” (IMU). The reason is that
this sensor delivers movement data for direct input very fast
which allows real-time control for hand or arm movements
or instantaneous reactions in robotic interfaces.
Our paper is structured as follows: we will first review
recent research on smart devices and the different learning
techniques for continuous gesture recognition. In the
subsequent section, we will summarize our ESN framework
including the data recordings introduced earlier [24],
followed by the explanations of our new experiments using
an LSTM network. The performance of both approaches
will be compared in the evaluation section and contrasted
with other approaches in our discussion. We conclude our
paper with suggestions for prospective applications.
Related Work
Wearable or smart devices have influenced different
research domains such as controlling games and media
applications or prostheses and rehabilitation. Recent work
on continuous gesture recognition using smart sensors
primarily uses a set of standard learning techniques such
as dynamic time warping and only a few studies apply
recurrent neural networks to the task. Gupta et al. [10]
introduced an algorithm, which maps the sensory stream
from the gyroscope and accelerometer of a Samsung mobile
phone into a gesture codebook. To distinguish between an
actual gesture and no gestural activation, the dynamic time
warping (DTW) algorithm was used. Basically, the DTW
procedure aligns two sequences which may vary in speed
and, based on predefined similarity measures, can classify
a sequence to its most similar sequence or “template” in a
data corpus. On a set of 6, respectively, 12 gestures (actually
mirrored), their approach achieved an average accuracy
of 90% and 94% for users using the so-called portrait
mode. Interestingly, the performance dropped significantly
compared with uWave [19], a gesture recognition system for
accelerometer data introduced earlier. However, due to the
lack of benchmark data, a fair comparison between different
systems is difficult. Although the authors demonstrated
good performance with a rather simple approach, they fall
short of the number of gestures and it remains open whether
the creation of a codebook would scale up when extending
the gesture vocabulary.
Yang et al. [29] presented a system using data from
a surface electromyography sensor (sEMG) on an arm.
A sliding window procedure was used to segment the
sensor stream and a threshold applied to separate active
gestures from unintentional gestures or noise. To model
the gestures, Gaussian Mixture Models and Hidden Markov
Models (GMM; HMM) were trained and the divergence
between two models used for evaluation. The Kullback-
Leibler divergence displayed the difference between any
two models and was used to distinguish the 6 gesture
classes. The system achieved 97–100% accuracy, however,
the whole procedure was tested on samples from one person
only. Also, the performance time for the gestures was
always set to 4 seconds, which means that the models
neither captured any variances between users nor gesture
performance time. This aspect lowers the generalization to
other users and limits the system application. Furthermore,
the chosen model suite is known to be hard to train and thus,
again, the question of whether the system would scale well
to the number of gestures remains open.
In the context of HCI, wearables are also used in
the gaming domain because the sensor input directly
measured from the subject allows real-time processing. In
this regard, Li et al. [18] presented a system to control the
Jump&Go Fly Bird game. Gyroscope and accelerometer
data were collected from a wristband and gestures manually
segmented using video information to mark the start and
end phase of a gesture. A sliding window approach in
combination with DTW classified the game control gesture,
which was limited to raising a hand. All other gestures were
subsumed as “other” gesture. Although the system achieved
an F1 score of up to 99%, the application of this gesture
recognition system is restricted to binary classification.
Moreover, manual labeling becomes a time factor when
increasing the gesture vocabulary, thus more sophisticated
methods for gesture spotting need to be developed.
One such system presenting the application of recurrent
neural networks (RNNs) was introduced in the SLOTH
architecture [5]. A triaxial wearable accelerometer provided
data for 6 defined gestures classes and, additionally,
a “no gesture” class. The gesture spotting was trained
Cogn Comput
with a Long Short-Term Memory (LSTM) network. A
subsequent continuous gesture recognition module (CGR)
then classified the different gestures using two sets: one
dataset comprised sequences from 9 participants with a total
of 540 sequences, evaluated offline yielding an accuracy of
96.9%. A second dataset was restricted to one participant
only with combinations of different gestures. The CGR
module then classified the incoming data stream in an online
fashion, which substantially decreased the recall of gestures
while still yielding an average accuracy of 79.7% up to
90.6% when changing some critical system parameters.
Especially the “no gesture” class was misclassified 100%,
which the authors [5] explained by the accelerometer
settings in the device. The differences in the recall for
the other gestures may also be subject to the so-called
subgesture problem, i.e., an online classifier may output the
incorrect label because it confuses similar start patterns of
different gestures.
Sosin et al. [22] included domain adaptation in their
system to enhance the generalization of gesture recognition
to other persons. An sEMG sensor tracked hand movements
from 5 subjects in two conditions, mobile and immobile
wrist. An additional Leap motion device indicated the
correct position of the hand. The authors compared simple
and gated recurrent neural units and, additionally, trained
the RNNs with adversarial domain adaptation (ADA).
Employing ADA can help to prevent overfitting, and thus,
a trained network can be transferred to test different
subjects. The study revealed the superior performance of the
simple units in combination with ADA for both conditions,
evaluated using the (normalized) root mean-squared error,
which measures the displacement between the angle of
the correct gesture to the predicted one. The system is,
however, limited by the installation of the Leap motion
controller, which needs additional calibration for every
new user.
Similarly, Han and Yoon [11] proposed a system for the
recognition of 6 gestures obtained from a wireless triaxial
gyroscope worn by 5 subjects. The gyro data was prepro-
cessed using a sliding window and the normalized covari-
ance between the sequences calculated. The maximum
covariance between different gestures was then assigned to
as the correct gesture and compared with a reference vector
from a trained support vector machine (SVM). The evalua-
tion of the single gestures performed by 4 of the participants
showed an average accuracy of 97% where the confusion
could be tracked back to symmetric gestures like “left-right”
or “up-down.” However, the system was optimized to every
user with a preceding customization session. The evaluation
is, therefore, biased, as the session familiarizes each sub-
ject with the task resulting in a stable gesture performance
yielding low variances among the gesture classes and clear
gesture waveforms. Regarding the symmetric gestures, the
system also showed no improvement when trained specifi-
cally for a multimedia application.
A recent study by Wang and Ma [27] similarly to the
one presented in this paper introduced a recognition system
for 10 gestures performed by 40 users. The IMU from a
wearable sensor was used, yielding six features that were
corrected for unit differences and further projected to a
lower-dimensional space by principal component analysis.
The continuous gestures were segmented with a sliding
window and manually labeled. An SVM was trained to
output class labels, while a DTW enhanced the correct
recognition for variable-length sequences. The experiments
were divided into three scenarios: the recognition of a single
gesture, a sequence of different but predefined gestures,
and the recognition of arbitrary gesture combinations. The
single gesture condition achieved an average accuracy of
93.14% and when including a timing threshold, 97.28%,
mostly confusing a “circle” gesture or “up-down.” However,
single gestures are hardly relevant to the task of recognizing
gestures in a continuous stream. For the two other
experiments, the performance dropped to 86% for a
predefined sequence of gestures and decreased even more
when considering a random gesture sequence to 60%.
This result confirms that continuous gesture recognition
is a nontrivial task, heavily influenced by inter-subject
variability in gesture performance.
Dataset and Methodology
We created our own dataset because we are using sensor-
based data for gesture recognition and no public dataset is
available for this task. Therefore, we make our data and
the code of our models publicly available1. We also explain
the experimental settings for both architectures used in this
study: an echo state network (ESN) and a Long Short-Term
Memory (LSTM) network.
Dataset Creation
First, we defined 10 gesture classes shown in Fig. 1inspired
by Lee et al. [17]. We chose the gesture types we think
correspond best to an action alternatively to swiping or
speech commands. For instance, the first two snap gestures
in Fig. 1can be used to slide photos to the left or right, or to
control the volume of a music application. Second, we set
up an experimental environment as demonstrated in Fig. 2
and invited 5 participants who performed each gesture 10
times, resulting in a total of 500 variable-length sequences.
The input data was collected from the inertial measurement
unit (IMU) of a smartphone with Android OS. The IMU
1https://github.com/swtietz/UHH-IMU-gestures-comparison
Cogn Comput
Snap
Le
Snap
Right
Snap
Forward
Snap
Backward
Bounce
Up
Bounce
Down
Rotate
Horizontal
Rotate
Vercal
Shake
Le Right
Shake
Forward
Backward
Fig. 1 We defined ten gestures used for our experiments. The challenge for a system is to recognize variable-length gestures (cf. snap vs. shake),
where one shorter gesture may be a subgesture of another, and to distinguish gesture symmetry (e.g., up vs. down). (Figure from [24])
delivers 9 features: the three orientation axes {x, y, z},the
rotation velocity along these axes, and, correspondingly,
the acceleration. Before feeding the sensor values into the
network they are normalized channel wise, such that the
maximum norm of each three dimensional signal is 1. The
normalization has been carried out over the whole dataset.
The signals have a frequency of 30 Hz, i.e., one time
step corresponds to ≈0.03 s. Figure 2outlines the data
recordings: a participant was seated on a table opposite to
a supervisor, who tracked each gesture performance and
marked both the gesture onset and the gesture finish. Each
participant was instructed on how to hold the phone and
which gesture in which direction is expected for each trial
(e.g., shake left-right). However, no time restrictions nor
Fig. 2 The settings for the data
recordings. A subject was seated
opposite to the supervisor who
gave instructions on when to lift
the phone and which gesture
type is expected. No time
constraints or any further help
on the gesture performance was
given
Data Collecon
Supervisor
operang
Teacher Server
Parcipant with
smartphone and
Gesture
Recognion App
…000222221111…
Sensor values:
Gyroscope,
accelerometer
and fused values
Manually set
markers,
produced by
teacher
One .npz-file per
gesture, each
containing 10
samples
Preprocessing module:
Spling by gesture, adding ground truth
Cogn Comput
help was given. After the performance of all trials per
gesture type, the participants had a little break to avoid
fatigue. After the recordings, the ground truth was added
(cf. [24]).
All experiments follow the leave one out cross validation
(LOOCV) protocol presented in our previous study [24]:
we use data from n−1 subjects as the training set, where
streams are segmented into individual gestures, concate-
nated to continuous sequences and shuffled randomly. Data
from the remaining person was then used as the test set. We
used grid search to obtain optimal parameters for both archi-
tectures. We will now explain the specific configurations of
the two networks we used in this study.
Experimental Settings for the ESN
An echo state network (ESN) [14] is a specific implementa-
tion of the Reservoir Computing paradigm [26]. The model
is separated into a randomly initialized hidden layer or
“reservoir,” which stays fixed, and a trainable readout. The
key idea is to project any input to the reservoir into a high-
dimensional feature space similar to kernels used by support
vector machines. The projection allows the application of
simpler training techniques, usually linear models. Given an
input u, the reservoir states xare computed as:
˜x(t+1)=f(u
(t+1)Win +x(t)Wres +y(t)Wfb +ν(t))(1)
x(t+1)=(1−α)x(t) +α˜x(t+1)(2)
where fis the activation function (here tanh)andαis the
leak rate. The layer-wise connectivity matrices W∗for the
input, reservoir, and feedback remain fixed while training
the network. We initialize Win sparsely with only 10% of
all weights set. Inputs are multiplied with the input scaling
parameter before being fed to Win.Wres is initialized fully
connected with Gaussian weights and then re-scaled to
the desired spectral radius. As we are using the ESN for
supervised learning, we set the feedback matrix Wfb and the
noise term νto 0. We also used the full training sequences,
i.e., no states were discarded. Given the teacher signal Yand
the state matrix Xwith all states collected in matrix X:
Y=g(Wout X) (3)
where gis the linear output activation function, the output
weights Wout can be computed using ridge regression:
Wout =YXT(XXT+λI)−1(4)
where λis the regularization coefficient and Iis the identity
matrix. The general ESN architecture is shown in Fig. 3.
The ESN used in this study has a 9 dimensional input layer,
which corresponds to the sensor values obtained from the
data collection. Additionally, we employed input scaling to
exploit the range of the used tanh activation function [20],
i.e., all input values are multiplied with a certain scaling
coefficient. We fixed the reservoir to 400 neurons achieving
the best performance, after having tested sizes starting from
25 neurons, doubling each results, up to 3600. We stopped
our trial experiments for the reservoir size at this value as the
performance started to decrease significantly. We observed
an plateau-like behavior in the performance for 400, 800,
and 1600 neurons, which is why we agreed on a smaller
reservoir size for faster training. The output layer consists of
10 neurons, representing the 10 gesture classes. We trained
the ESN using ridge regression. All hyperparameters are
summarized in Table 1with the best value resulting from
our grid search in bold.
Experimental Settings for the LSTM
An LSTM [13] is an RNN capable of learning both short-
term and long-term dependencies. The core concept of
LSTMs lies in the cell state, whose information is regulated
through three gates. First, an input gate controls which
values to be stored in the cell state. Which information to be
removed is then decided by a forget gate. Finally, an output
gate is responsible for choosing which values to be used for
the activation of an LSTM unit. Figure 4shows the LSTM
architecture used in this study. The calculations for LSTM
training are shown in Eqs. 5–7for the gate mechanisms.
Equations 8and 9show the calculations within the layers,
i.e., the cell states and the hidden states as used in our
implementation.
i(t) =σ(Wixx(t) +Wihh(t−1)+bi)(5)
f(t) =σ(Wfxx(t) +Wfhh(t−1)+bf)(6)
o(t) =σ(Woxx(t) +Wohh(t−1)+bo)(7)
c(t) =f(t) ◦c(t−1)+i(t) ◦tanh(Wcxx(t) +Wchh(t−1)+bc)
(8)
h(t) =tanh(c(t))◦o(t) (9)
Fig. 3 Layout of an echo state network, in our experiment used with
leaky neurons. The hyperparameters correspond to the ones in Table 1
Cogn Comput
Table 1 Reservoir parameters: ρis the spectral radius, αis the leak
rate, κdenotes connectivity, and λis the regularization coefficient
Hyperparameter Range
Reservoir size 400
κ0.1
Input scaling [1, 5, 9, 13]
ρ[0.1, 0.4, 0.7, 1.0,1.3]
α[0.1, 0.3, 0.5, 0.7, 0.9]
λ[0.01, 0.1, 1, 10]
Input scaling is a factor with which the input is multiplied before being
feed into the network
where i(t),f(t),ando(t) denote the input, forget, and output
gates respectively. x(t),c(t),andh(t) are the LSTM unit
input, cell state, and hidden state respectively. Wand b
represent the weights and biases, σis the sigmoid activation
function, and ◦is the element-wise multiplication.
We implemented a simple recurrent network in PyTorch2
consisting of one LSTM layer and a 10-cell fully connected
linear readout layer with no bias stacked on top. Opposed
to the ESN, where the different sensors have been manually
tuned [24], we now scale each channel of each sensor
individually to have a standard deviation of 1. We tried 10,
20, 40, 80, and 160 recurrent cells in the hidden layer but
found that only small gains in F1 score and accuracy could
be achieved when increasing the number of neurons higher
than 40 and, therefore, chose 80 LSTM cells for further
experiments.
Although we are in a classification setting we treat the
task as a regression problem and use the mean-squared
error for training. This is due to the fact that we integrate
gesture pauses, i.e., sequences with no gestures, in between
the gestures, during which we want no neuron to be
active. When using categorical cross-entropy, and therefore
“softMax” as the final activation function, the network can
not represent “no gesture” sequences. We use the “adam”
optimizer [16] with a learning rate ηset to 0.001, and train
for a maximum of 100 epochs. We use 10% of our train
set as a validation set and apply early stopping once the
validation error starts to increase.
Evaluation Scheme and Results
The major challenge in continuous gesture recognition is
the gesture spotting followed by the correct classification of
2https://pytorch.org/
...
LSTM Layer MSE
x
xx
x
xx
...
Input
Fig. 4 The implemented LSTM architecture. We used only one
recurrent layer but performed a grid search over the number of cells
the actual gesture. While the gesture spotting is problematic
due to variable-length gestures, the classification is often
hampered by so-called subgestures. Dynamic gestures, such
as commands, often share the same start movements, e.g.
lifting an arm or the phone. Therefore, the whole gesture
sequence has to be parsed first and then mapped to the
correct gesture label. Figure 5demonstrates the problem
exemplary taken from our LSTM model: while the whole
sequence is a shake, the start is misclassified as snap.
In our previous study [24], we introduced an evaluation
mapping scheme from actual gesture sequences to their
corresponding ground truth sequences (Fig. 6). Due to the
variable-length sizes of the gesture samples and the “no
gesture” condition, we suggested the following: We apply
a ReLu non-linearity over the network output and sum up
all remaining activity at every time step. If the total activity
is above a predefined threshold of 0.4 we start summing
up all individual outputs over time until the total activity
falls below the threshold again. The whole segment is then
labeled as belonging to the class of the neuron that had the
highest total activity. On the resulting segments, we run our
mapping algorithm:
– Only one true positive (TP) mapping is allowed
– A wrong prediction is counted as “wrong gesture”
(WG)
– A prediction and class segment that does not overlap is
counted as false positive (FP)
– An actual class without a mapping is a false negative
(FN)
Table 2reports the average accuracy and F1 score of
the test sets averaged over all subjects. The values in
parentheses denote the standard deviation. We also provide
Cogn Comput
Fig. 5 Example of the output activation from the LSTM architecture.
Activation for all 10 output neurons is plotted over the first 500 time
steps, the ground truth is shown with transparent boxes. The subgesture
problem shows very nicely for the shake gestures: The network com-
monly predicts a snap gesture during the early, ambiguous phase and
only after a whole shake is performed the shake neuron turns active
the corresponding training times for comparison. While the
LSTM shows superior performance for the F1 score, the
difference in accuracy for both models is less apparent. The
table also shows less training times for the ESN framework.
As explained before, gesture sequences in a continuous
stream are easily misclassified due to subgestures and
variances in the performance of a gesture. Therefore, we
analyzed the results from each participant for both the ESN
and the LSTM to explain the possible error sources. Table 3
shows the individual evaluations for the ESN and the LSTM.
The different F1 scores among the subjects visible in the
table indicate high inter-subject variability. The values in
the parentheses denote the standard deviation, which is low
for all subjects.
We show in Figs. 7,8,and9the worst, average, and
best performance from the set of our participants evaluated
from our ESN (the participant names are anonymous). The
misclassifications result from the average of all trials per
subject, which explains why the values are not integers.
Most confusion between the individual gestures is shown
in Fig. 7. Noteworthy, the gestures {snap left, snap right}
are confused with the longer gesture shake left-right,
which supports that the subgesture problem affects the
performance. Similarly, the snap backward gestures is
misclassified for shake up-down. Further, the directions in
the shake gestures are confused between up-down and left-
right. Notably, also the “no gesture” negatively influences
the performance by producing misclassification with almost
all gesture classes (cf. bottom row and final column in
Fig. 7). The effect of a possible prefix gesture is also shown
in Fig. 8for the snap left gesture, however, most of the
misclassifications stem from wrong predictions of the “no
gesture” class. Finally, Fig. 9shows the highest score in the
gesture performance, emphasizing again the confusion of
“no gesture” with other gesture sequences.
The right column of Table 3demonstrates the results
from each participant for the LSTM. Figures 10,9,and
12 provide insights into the misclassification from the
worst to the best performance as produced by our LSTM
experiments. Again the overall performance is affected by
confusion of subgestures like snap left with the accord-
ing shake gesture. The snap backward gesture is mostly
confused with shake up-down. Moreover, the symmetry of
the bounce gesture has negatively influenced the predic-
tion results for the best performance (Fig. 12). Similarly,
the average performance as shown in Fig. 11 results primar-
Fig. 6 Example of our proposed mapping scheme [24]. The target ges-
ture stream consists of gesture 1, followed by a silent phase or “no
gesture,” another performance of gesture 1 and a subsequent pause and
finally gesture 2. In case number 2, the erroneous prediction results in
a “wrong gesture” followed by a false positive as we constrained ges-
ture segments to be mapped as correct only once. The false negative is
a consequence of the long “no gesture” prediction (case number 6)
Cogn Comput
Table 2 Average F1 score, accuracy on the test sets, and training times
ESN LSTM
F1 score 0.78 (0.09) 0.87 (0.07)
Accuracy 0.87 (0.03) 0.93 (0.04)
Train time (in sec) 2.6 (0.03) 88.9 (15.1)
Standard deviations are given in parentheses
ily from misclassifications of the snap left gesture. Finally,
the best performance shown in Fig. 10 has the most promi-
nent confusion between the symmetric bounce up-down
gestures.
Discussion
We presented an experimental study on continuous gesture
recognition to compare the performance between an
echo state network (ESN) and Long Short-Term Memory
(LSTM) network. Both networks are special architectures
of recurrent neural networks and successfully applied to
sequence processing. Given the inherent variances on how
to perform gestures, so-called inter-subject variability,
we were interested in the computational performance of
both approaches as they are conceptually different. We
used a dataset introduced in an earlier work [24]for
training and testing both models. Our evaluation showed
that our ESN framework achieved an accuracy comparable
to the LSTM, while being faster to train (cf. Table 2).
Therefore, we conclude ESNs to be a viable model for
continuous gesture recognition using sensor data and, due
to their fast recognition times, to be an ideal candidate
for tasks that require real-time processing. The applications
are manifold and range from simple body tracking to
concrete rehabilitation of the limb apparatus as shown for,
e.g., sEMG devices [29]. Recent approaches primarily use
simple techniques like sliding windows and DTW or an
SVM. However, all the techniques need special tuning of
the window size or the specific kernel [27], which led
to an additional subject customization step in the system
[11]. In our study, we show how ESNs learn different
activity patterns while being able to distinguish between
gestures, subtle movements, and pauses. This highly
facilitates gesture spotting where no extra hardware is
needed [22]. Moreover, the working principles behind both
approaches presented here address time-varying patterns in
contrast to studies using a fixed time window [27,29],
which shadows the inter-subject variability problem. The
nontrivial problem of learning these variances is highlighted
by the study presented by Wang and Li [27]: while their
system achieved 86.99% accuracy for a fixed number of
gestures concatenated into one stream, the performance
drops to 60% for arbitrary sequences. In contrast, our
experiments achieved an average accuracy of 93% for
LSTM and 87% for the ESN using randomly shuffled
gestures. Learning distinct gesture patterns performed by
human subjects is key to a flexible recognition system,
which provides an intuitive interface without preliminary
customer calibrations or fixation of gesture lengths to
arbitrary values. We believe that our experiments show a
new research direction in a domain that is still dominated by
classic computational techniques like DTW.
Another important factor affecting the performance
are subgestures and symmetric gestures [11]. In our
experiments, the individual gestures snap left and snap right
as well as the snap forward and snap backward gestures
were often falsely classified with their corresponding
shake gestures in both networks. Also, classification errors
occurred between snap up and snap down with snap forward
and snap backward. We hypothesize that those sequences
are too short and, as we did not provide any help on
the actual gesture performance, the participant might have
held the phone such that the sensors detect a backward
motion. The influence of misclassification caused by the
issues described is more prominent for our ESN framework.
The individual evaluation revealed many errors in the
upper triangular part of the confusion matrix for the worst
performance (cf. Fig. 7), which relates to a bad recall metric.
As a result, the recall negatively impacts the F1 score for
the ESN, which is 9% worse than the LSTM. However, the
difference in the overall accuracy is less pronounced.
Interestingly, the performance of participant Jchanged
from worst in the ESN model to best for the LSTM
and vice versa for the best performance of subject L(cf.
Table 3). When looking into our data, we observed highly
distinct gesture trajectories for participant Jfor each gesture
while subject Lperformed each gesture similar. The better
performance of the LSTM for varying gesture sequences
confirms the superior performance for the F1 score and
the robustness of this network to recognize variable-length
gestures. We explain the switched role for subject Lwith our
mapping algorithm, which is tailored to the ESN output. We
think that a modular approach as presented by Carfi et al.
[5] would yield a better evaluation as a subsequent classifier
is explicitly used. Finally, both models were error-prone to
subgestures and variability of gesture sequences, especially
pronounced for symmetric gestures, which supports that
continuous gesture recognition is a nontrivial task. More
research on these aspects in the future could be useful to
expand the application area.
The main question regarding the further application of
our system addresses the size of the gesture vocabulary
Cogn Comput
Table 3 Individual evaluations from our ESN and LSTM experiments with one subject as test set. Standard deviations are given in parentheses
ESN LSTM
Test set Train sets F1 train F1 test F1 train F1 test
J L, S, Ni, Na 0.97 (0.01) 0.64 (0.04) 0.98 (0.01) 0.95 (0.02)
Ni Na, J, L, S 0.97 (0.01) 0.81 (0.05) 0.97 (0.02) 0.83 (0.04)
S Ni, Na, J, L 0.97 (0.01) 0.80 (0.05) 0.98 (0.02) 0.90 (0.08)
Na J, L, S, Ni 0.98 (0.01) 0.80 (0.06) 0.98 (0.01) 0.88 (0.05)
L S, Ni, Na, J 0.96 (0.01) 0.88 (0.05) 0.96 (0.03) 0.80 (0.04)
and the number of samples in the dataset. Although our 10
gestures used in this study is a high number compared with
recent research [5,10,18,29], and involving 5 participants
to include gesture diversity, our experiments are evaluated
only on a total of 500 sequences. We obtained high
performance for the LSTM both in F1 score and accuracy,
the latter comparable with our ESN framework. It would be
interesting to support the expressiveness of our results for
a larger dataset. Unfortunately, until today no benchmark
dataset for sensor-based continuous gesture recognition
exists, which precludes a reasonable and fair comparison
of different computational architectures. We hypothesize
that extending the gesture vocabulary for more and distinct
gesture types together with a significant increase in the
sample size will be challenging for the standard ESN
architecture. Having our dataset publicly available, we hope
that the data issue will gain attention from more researchers,
yielding a more diverse dataset in the future. Variability in
the gesture performance among the subjects but also for
every gesture itself and the subgesture problem will then
be more pronounced. It remains an open question whether
more sophisticated ESN architectures like Deep ESN [9]
or different ESN topologies [21] are key to upscale the
gesture recognition tasks and comparing other deep learning
approaches such as LSTM will shed further light on the
applicability of models from different paradigms to real-
world scenarios as for sensor interfaces or human-robot
interaction.
Fig. 7 Confusion matrix of the
worst performance of a
participant of our study resultant
from our ESN. The
misclassifications are mainly
among short gestures predicted
to be a longer but similar gesture
(subgesture) and for symmetric
gestures
Cogn Comput
Fig. 8 Confusion matrix derived
from a subject with average
performance resultant from our
ESN. Most confusion are
between bounce gestures. The
subgesture problem is less
pronounced
Fig. 9 Confusion matrix of the
best performance among all
subjects resultant from our ESN.
Only a few misclassifications
occur
Cogn Comput
Fig. 10 Confusion matrix of the
worst performance among all
subjects resultant from our
LSTM architecture. Most of the
misclassifications stem from the
shake gestures and their
corresponding subgestures
Fig. 11 Confusion matrix
derived from a subject with
average performance resultant
from our LSTM architecture.
The main source of confusion is
the left gesture
Cogn Comput
Fig. 12 Confusion matrix of the
best performance among all
subjects resultant from our
LSTM architecture. The only
significant confusion is between
the bounce gestures
Conclusion
Continuous gesture recognition is a crucial task due to
high variances of gestures in a stream and the easy
confusion of inherently similar gestures. The goal of
our present study was a performance comparison of a
previously introduced echo state framework with a Long
Short-Term Memory network, which is a state-of-the-art
model for sequence processing. Our results confirm the
robust processing of continuous gesture streams for both the
LSTM and the ESN model, the latter showing comparable
performance. As training is much faster, echo state networks
are suitable computational models for experiments that
require real-time processing. Our study reveal the impact of
variability in the gesture performance and subgestures on
the recognition performance for both models. We assume
that these factors will be more affecting when considering
a larger gesture vocabulary with more data from subjects
than actually available. Until now, only little is known
about the capabilities of those networks in large gesture
recognition scenarios. We hypothesize that further research
on echo state networks will progress to novel developments
of network architectures, resulting in potential applications
for many domains such as rehabilitation or human-robot
interaction.
Acknowledgments The authors would like to thank the anonymous
reviewers for their valuable comments on an earlier version of the
manuscript.
Compliance with Ethical Standards
The authors declare that they have no conflict of interest.
Funding Information Open Access funding provided by Projekt DEAL.
Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as
long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence, and indicate
if changes were made. The images or other third party material in
this article are included in the article’s Creative Commons licence,
unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your
intended use is not permitted by statutory regulation or exceeds
the permitted use, you will need to obtain permission directly from
the copyright holder. To view a copy of this licence, visit http://
creativecommonshorg/licenses/by/4.0/.
Cogn Comput
References
1. Antonik P, Marsal N, Brunner D, Rontani D. Human
action recognition with a large-scale brain-inspired photonic com-
puter. Nat Mach Intell. 2019;1:530–537. https://doi.org/10.1038/
s42256-019-0110-8.
2. Argyris A, Bueno J, Fischer I. Photonic machine learning imple-
mentation for signal recovery in optical communications. Sci-
entific Reports. 2018;8:8487. https://doi.org/10.1038/s41598-018-
26927-y.
3. Bengio Y, Boulanger-Lewandowski N, Pascanu R. Advances in
optimizing recurrent networks. In: IEEE International conference
on acoustics, speech and signal processing; 2013. p. 8624–8628.
4. Bengio Y, Simard P, Frasconi P. Learning long-term depen-
dencies with gradient descent is difficult. IEEE Transactions on
Neural Networks. 1994;5(2):157–166. https://doi.org/10.1109/72.
279181.
5. Carfi A, Motolese C, Bruno B, Mastrogiovanni F. Online
human gesture recognition using recurrent neural networks and
wearable sensors. In: 2018 27Th IEEE international symposium
on robot and human interactive communication RO-MAN; 2018.
p. 188–195.
6. Dasgupta S, W¨
org¨
otter F., Manoonpong P. Information dynamics
based self-adaptive reservoir for delay temporal memory tasks.
Evolving Systems. 2013;4(4):235–249.
7. Escalera S, Bar´
o X., Gonz`
alez J., Bautista MA, Madadi M,
Reyes M, Ponce-L´
opez V., Escalante HJ, Shotton J, Guyon I.
Chalearn looking at people challenge 2014: Dataset and results.
Computer vision - ECCV 2014 workshops. In: Agapito L.,
Bronstein M. M., and Rother C., editors. Cham: Springer
International Publishing; 2015. p. 459–473.
8. Gallicchio C, Micheli A. A reservoir computing approach for
human gesture recognition from kinect data. In: S. Bandini,
G. Cortellessa, F. Palumbo (eds.) Proceedings of the Artificial
Intelligence for Ambient Assisted Living 2016 co-located with
15th International Conference of the Italian Association for
Artificial Intelligence (AIxIA 2016), Genova, Italy, November
28th, 2016, CEUR Workshop Proceedings, vol. 1803, pp. 33–42.
CEUR-WS.org; 2016.
9. Gallicchio C, Micheli A, Pedrelli L. Design of deep echo state
networks. Neural Netw. 2018;108:33–47. https://doi.org/10.1016/
j.neunet.2018.08.002.
10. Gupta HP, Chudgar HS, Mukherjee S, Dutta T, Sharma K.
A continuous hand gestures recognition technique for human-
machine interaction using accelerometer and gyroscope sen-
sors. IEEE Sensors J. 2016;16(16):6425–6432. https://doi.org/10.
1109/JSEN.2016.2581023.
11. Han H, Yoon SW. Gyroscope-based continuous human hand
gesture recognition for multi-modal wearable input device for
human machine interaction. Sensors. 2019;19:11.
12. Hinaut X, Dominey PF. Real-time parallel processing of
grammatical structure in the fronto-striatal system: a recurrent
network simulation study using reservoir computing. PloS one.
2013;8(2):e52946.
13. Hochreiter S, Schmidhuber J. Long short-term memory. Neural
computation. 1997;9:1735–80.
14. Jaeger H. Tutorial on training recurrent neural networks, covering
BPPT, RTRL, EKF and the” echo state network” approach GMD-
Forschungszentrum Informationstechnik. 2002.
15. Jirak D, Barros P, Wermter S. Dynamic gesture recognition
using echo state networks. In: Proceedings of the European
Symposium of Artificial Neural Networks and Machine Learning,
pp. 475–480; 2015.
16. Kingma DP, Ba J. Adam: A method for stochastic optimization.
2014. arXiv:1412.6980.
17. Lee MC, Cho SB. Mobile gesture recognition using hierarchical
recurrent neural network with bidirectional long short-term
memory. In: Proceedings of UBICOMM; 2012. p. 138–141.
18. Li Y, Wang T, Khan A, Li L, Li C, Yang Y, Liu L. Hand gesture
recognition and real-time game control based on a wearable band
with 6-axis sensors. 2018.
19. Liu J, Zhong L, Wickramasuriya J, Vasudevan V. uwave:
Accelerometer-based personalized gesture recognition and its
applications. Pervasive and Mobile Computing. 2009;5(6):657–
675. PerCom 2009.
20. Lukoˇ
seviˇ
cius M., Jaeger H. Reservoir computing approaches
to recurrent neural network training. Computer Science Review.
2009;3(3):127–149. https://doi.org/10.1016/j.cosrev.2009.03.005.
21. Qiao J, Li F, Han H, Li W. Growing echo-state network
with multiple subreservoirs. IEEE Trans Neural Netw Learn
Syst. 2017;28(2):391–404. https://doi.org/10.1109/TNNLS.2016.
2514275.
22. Sosin I, Kudenko D, Shpilman A. Continuous gesture
recognition from semg sensor data with recurrent neural
networks and adversarial domain adaptation. In: 2018 15Th
international conference on control, automation, robotics and
vision (ICARCV); 2018. p. 1436–1441.
23. Tanaka G, Yamane T, H´
eroux JB, Nakane R, Kanazawa N,
Takeda S, Numata H, Nakano D, Hirose A. Recent advances
in physical reservoir computing: A review. Neural Networks.
2019;115:100–123.
24. Tietz S, Jirak D, Wermter S. A reservoir computing framework
for continuous gesture recognition. In: Tetko, I. V., V. k˙urkov´
a, P.
Karpov, F. Theis. Artificial neural networks and machine learning
– ICANN 2019: workshop and special sessions, pp. 7–18 Springer
International Publishing Cham; 2019.
25. Triefenbach F, Jalalvand A, Schrauwen B, Pierre Martens J.
Phoneme recognition with large hierarchical reservoirs. In: J.D. Laf-
ferty, C.K.I. Williams, J. Shawe-Taylor, R.S. Zemel, A. Culotta
(eds.) Advances in Neural Information Processing Systems 23,
pp. 2307–2315. Curran Associates, Inc. http://papers.nips.cc/paper/
4056-phoneme-recognition-with-large-hierarchical-reservoirs.
pdf; 2010.
26. Verstraeten D, Schrauwen B, D’Haene M, Stroobandt D. An
experimental unification of reservoir computing methods. Neural
Networks. 2007;20(3):391–403. Echo State Networks and Liquid
State Machines.
27. Wang Y, Ma H. Real-time continuous gesture recognition with
wireless wearable imu sensors. In: 2018 IEEE 20Th international
conference on e-health networking, applications and services
(healthcom); 2018. p. 1–6. https://doi.org/10.1109/HealthCom.
2018.8531095.
28. Wyffels F, Schrauwen B. Design of a central pattern generator
using reservoir computing for learning human motion. In: 2009
Advanced technologies for enhanced quality of life, pp. 118–122;
2009.
29. Yang J, Pan J. Li, j.: semg-based continuous hand gesture
recognition using gmm-hmm and threshold model. In: 2017 IEEE
International conference on robotics and biomimetics (ROBIO),
pp. 1509–1514; 2017.
Publisher’s Note Springer Nature remains neutral with regard to
jurisdictional claims in published maps and institutional affiliations.