scieee Science in your language
[en] (orig)
Features and Machine Learning Systems
for Structured and Sequential Data
vorgelegt von
Dipl.Inf.
Guido Schwenk
von der Fakultät IV - Elektrotechnik und Informatik
der Technischen Universität Berlin
zur Erlangung des akademischen Grades
Doktor der Ingenieurwissenschaften
- Dr.-Ing. -
genehmigte Dissertation
Promotionsausschuss:
Vorsitzender: Prof. Dr. Benjamin Blankertz
Gutachter: Prof. Dr. Klaus-Robert Müller
Gutachter: Prof. Dr. Konrad Rieck
Gutachterin: Prof. Dr. Anja Feldmann
Tag der wissenschaftlichen Aussprache: 18. März 2019
Berlin 2019
To my beloved wife and our children, for their never ending support.
Acknowledgements
I would like to thank Prof. Dr. Klaus-Robert Müller and Prof. Dr. Konrad Rieck
for their enduring optimism and encouragement.
Summary
Modern web and communication technology relies heavily on sequential and
structured data for its process execution and communication protocols. Due to its
complex properties, a manual analysis and detection of problems on this data is too
time-consuming and expensive, and hence not feasible. As a consequence, features
and automatic learning systems on this type of data are highly sought after.
To address these issues, the thesis proposes features and systems for learning
on structured, sequential and temporal data, both in abstract and in concrete form,
with a focus on analyses in the fields of IT security and Quality of Service, on the
data domains of analysis data of malware binaries and JavaScript code, as well as
on mobile network communication data. The proposed features and feature combi-
nations cover various statistical, non-behavioral and behavioral, stateless, stateful,
structural and temporal concepts, and are used individually and in a complemen-
tary manner, e.g. via hierarchical or ensemble approaches. The proposed learn-
ing systems are evaluated against competitive approaches, where they outperform
commonly used and state-of-the-art methods, including approaches using neural
networks.
Specific practically relevant aspects are also addressed in depth, like high levels
of automation to extend the scope of the system application, different re-training
procedures, or the calibration of metrics relevant for the specific domain. To im-
prove the interpretability of the system processes and their results to increase the
system reliability and its level of trust, different visualization approaches are pro-
posed, focussing on interpretable and transparent feature projections and relevance
analyses. These additional discussions on the proposed ideas further support a
potential adaptation of the proposed ideas to concrete application scenarios.
Zusammenfassung
Moderne Internet und Kommunikationstechnologien nutzen sequentielle und
strukturierte Daten zur Ausführung ihrer Prozesse und Kommunikationsprotokolle.
Aufgrund deren komplexer Eigenschaften ist eine manuelle Erkennung und Ana-
lyse von Problemen auf diesen Daten zu zeitaufwendig und teuer und daher oft
nicht realisierbar. Infolgedessen sind automatisierte Lernsysteme, die auf solchen
Daten arbeiten und dies ermöglichen, sehr gefragt.
Die vorliegende Dissertation adressiert dies in mehrfacher Hinsicht. So wer-
den Eigenschaften, Systeme und Merkmale von strukturierten, sequentiellen und
temporalen Daten diskutiert, sowohl in abstrakter wie auch konkreter Form am
Beispiel von Analysendaten von JavaScript Code und Schadsoftware im Bere-
ich der IT Sicherheit, sowie auf Mobilfunkkommunikationsdaten zu Zwecken der
Qualitätssicherung. Es werden verschiedene individuelle wie auch kombinierte
statistische, verhaltensbasierte, zustandslose, zustandsbasierte, strukturelle sowie
temporale Merkmalsarten eingeführt und analysiert. Dabei werden deren Eigen-
schaften sowohl im individuellen Gebrauch wie auch im Verbundgebrauch ana-
lysiert, beispielsweise in Form von hierarchischen Merkmalsverbänden oder En-
semble-Ansätzen. Die Klassifikationsleistungen und Merkmalseigenschaften der
vorgestellten Lernsysteme werden im Rahmen umfangreicher Evaluationen mit
konkurrierenden Ansätzen verglichen. Dabei zeigen sich sehr gute Ergebnisse der
vorgestellten Methoden, selbst im Vergleich zu state-of-the-art Methoden wie neu-
ronalen Netzwerken.
Zusätzlich werden praktisch relevante Aspekte der besprochenen Probleme
adressiert, um deren Potential einer realen Anwendung zu erhöhen. Dazu gehören
beispielsweise ein hoher Grad an Automatisierung der vorgeschlagenen Systeme,
verschiedene Trainingsprozeduren, sowie Möglichkeiten der Kalibrierung von Me-
triken, die für das besprochene Anwendungsgebiet relevant sind. Auch die Mög-
lichkeiten der Interpretierbarkeit und Transparenz der vorgeschlagenen Systeme
werden besprochen und mit verschiedenen Methoden adressiert, um dadurch das
Vertrauen in die vorgestellten automatisierten Lernsysteme zu erhöhen. Unter an-
derem werden dazu verschiedene Visualisierungsmethoden für Daten und Merk-
male, sowie Möglichkeiten der Ergebnisrelevanzanalyse vorgestellt.
Contents
1 Introduction 1
1.1 Thesis Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Own Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Adaptive Detection of Covert Communication in HTTP Requests 8
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Dynamics of HTTP Communication . . . . . . . . . . . . . . . . 10
2.3 The DUMONT System . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Features of HTTP Requests . . . . . . . . . . . . . . . . 11
2.3.2 Anomaly Detection . . . . . . . . . . . . . . . . . . . . . 13
2.3.3 Training DUMONT ..................... 14
2.3.4 Limiting Evasion . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Empirical Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.1 Evaluation Data . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 Evaluation Setup . . . . . . . . . . . . . . . . . . . . . . 18
2.4.3 System Performance . . . . . . . . . . . . . . . . . . . . 18
2.4.4 Comparative Evaluation . . . . . . . . . . . . . . . . . . 19
2.5 Related Work and Limitations . . . . . . . . . . . . . . . . . . . 21
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3 Autonomous Learning for the Detection of JavaScript Attacks 23
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.1 Client-based Honeypots . . . . . . . . . . . . . . . . . . 25
3.2.2 Analysis Systems . . . . . . . . . . . . . . . . . . . . . . 26
3.2.3 Attack-specific Detection . . . . . . . . . . . . . . . . . . 26
3.2.4 Learning-based Detection . . . . . . . . . . . . . . . . . 27
3.3 Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.1 Collection Framework . . . . . . . . . . . . . . . . . . . 28
3.3.2 Collected Data Sets . . . . . . . . . . . . . . . . . . . . . 30
3.4 Learning-Based Detection . . . . . . . . . . . . . . . . . . . . . 31
3.4.1 Feature Extraction . . . . . . . . . . . . . . . . . . . . . 32
3.4.2 Learning-Based Detectors . . . . . . . . . . . . . . . . . 33
3.4.3 Learning Framework . . . . . . . . . . . . . . . . . . . . 35
3.5 General Performance Evaluation . . . . . . . . . . . . . . . . . . 36
3.5.1 Performance of AV-Scanners . . . . . . . . . . . . . . . . 36
3.5.2 Performance of Detectors . . . . . . . . . . . . . . . . . 36
3.5.3 Misclassification Analysis . . . . . . . . . . . . . . . . . 38
3.5.4 Performance of Different Learning Methods . . . . . . . . 39
3.6 Analysis of Re-Training Procedures . . . . . . . . . . . . . . . . 40
3.6.1 Regular Re-Training . . . . . . . . . . . . . . . . . . . . 40
3.6.2 One-Time Training . . . . . . . . . . . . . . . . . . . . . 42
3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 Classification of Structured Validation Data
using Stateless and Stateful Features 45
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Feature Spaces for Structured Temporal Data . . . . . . . 48
4.2.2 Token n-gram Features . . . . . . . . . . . . . . . . . . . 49
4.3 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.4 Feature Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.4.1 Stateless Features Γs.less . . . . . . . . . . . . . . . . . . 55
4.4.2 Stateful Features Γs.ful ................... 59
4.5 Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.5.1 Learning Methods on Γs.less Features . . . . . . . . . . . 61
4.5.2 Learning Methods on Γs.ful Features . . . . . . . . . . . 61
4.5.3 Learning Methods on Γcomb Features . . . . . . . . . . . 62
4.6 Details for the Practical Application . . . . . . . . . . . . . . . . 63
4.6.1 Batch vs. Online Classification . . . . . . . . . . . . . . . 63
4.6.2 Interpretation of the Results . . . . . . . . . . . . . . . . 65
4.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.7.1 Evaluation Results . . . . . . . . . . . . . . . . . . . . . 68
4.7.2 Evaluation Interpretation . . . . . . . . . . . . . . . . . . 71
4.8 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . 72
4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5 Feature Spaces and a Learning System for Structural-Temporal Data 75
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
5.3 Use Case Description . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.1 Data Set Properties . . . . . . . . . . . . . . . . . . . . . 79
5.3.2 Failure Classes . . . . . . . . . . . . . . . . . . . . . . . 82
5.4 Structural and Temporal Feature Spaces . . . . . . . . . . . . . . 84
5.4.1 Base Processing . . . . . . . . . . . . . . . . . . . . . . 85
5.4.2 ΓqT Features . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4.3 ΓTFeatures . . . . . . . . . . . . . . . . . . . . . . . . . 87
5.4.4 ΓSFeatures . . . . . . . . . . . . . . . . . . . . . . . . . 88
5.4.5 ΓS+TFeatures . . . . . . . . . . . . . . . . . . . . . . . 88
5.4.6 ΓST Features . . . . . . . . . . . . . . . . . . . . . . . . 88
5.5 System Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.5.1 Model Class Detection and Prediction . . . . . . . . . . . 93
5.5.2 Learning Methods . . . . . . . . . . . . . . . . . . . . . 94
5.5.3 Combined Classification System . . . . . . . . . . . . . . 96
5.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.6.1 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . 99
5.6.2 Experiments on MFC data . . . . . . . . . . . . . . . . . 100
5.6.3 Experiments on AFC data . . . . . . . . . . . . . . . . . 108
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6 Thesis Conclusion 110
7 Appendix 114
7.1 Learning Methods . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.1.1 Support Vector Machines . . . . . . . . . . . . . . . . . . 114
7.1.2 One Class SVM . . . . . . . . . . . . . . . . . . . . . . 116
7.1.3 MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
7.1.4 RNN and LSTM . . . . . . . . . . . . . . . . . . . . . . 118
7.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . 120
7.3 Using Metrics for Data with unbalanced Class Sizes . . . . . . . . 122
7.4 Visualizations for Interpretable Systems . . . . . . . . . . . . . . 125
7.4.1 Weight Vector Visualization . . . . . . . . . . . . . . . . 125
7.4.2 Visualizing the δnmatching function ˆ
Φof ΓST . . . . 131
List of Figures
1.1 Overview: From Source Data to Analysis Data . . . . . . . . . . 3
2.1 Examples of the dynamics in the recorded HTTP communication. 10
2.2 Frequency distribution and anomaly scores per request length. . . 13
2.3 Hierarchical detection layers. . . . . . . . . . . . . . . . . . . . . 14
2.4 Calibrating a detector using the ROC curve. . . . . . . . . . . . . 16
2.5 Contribution of each detector to the detection performance. . . . . 20
3.1 A framework for acquisition and analysis of JavaScript code. . . . 28
3.2 Number of malicious URLs visited per day. The dashed line rep-
resents a weekly average. . . . . . . . . . . . . . . . . . . . . . . 31
3.3 Average size of JavaScript code collected per day from benign
(top) and malicious (bottom) URLs. The dashed line represents
a weekly average. . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4 Visualization of different models achieved by different values of ω. 34
3.5 Histograms of the predicted scores of the optimized models of
the static (top) and dynamic (bottom) detectors on the complete
dataset. The dashed line represents Θ0.001. . . . . . . . . . . . . . 38
3.6 False (top) and true (bottom) positive rates of the static detector per
day with regular re-training, using Θ0.001. . . . . . . . . . . . . . 41
3.7 False (top) and true (bottom) positive rates of the dynamic detector
per day with regular re-training, using Θ0.001. . . . . . . . . . . . 41
3.8 False (top) and true (bottom) positive rates of the combined detec-
tors per day with regular re-training, using Θ0.001.......... 42
3.9 False (top) and true (bottom) positive rates of the dynamic detector
per day, in the one-time training setup, using Θ0.001. . . . . . . . . 43
4.1 Main components of the processing chain. . . . . . . . . . . . . . 47
4.2 Conceptual example of the stateful Γs.ful (left) and stateless Γs.less
(right) feature approach for 4events of a structured sequence sample 54
4.3 Stateless Parameter-Value ratios (top) and Feature Space Sizes (bot-
tom) for the discussed data sets . . . . . . . . . . . . . . . . . . . 58
4.4 Positively contributing dimensions ˆw+(left) and negatively con-
tributing dimensions ˆw(right) of weight vector ˆwand correspond-
ing features of samples s1and s2. . . . . . . . . . . . . . . . . . 66
4.5 Box plots of the AUC-results (in %) of the comparative evaluation
of the feature spaces and classifiers on the 2014 campaign data
(top: debug-analysis data, bottom: tshark-analysis data). . . . . . 69
4.6 Box plots of the AUC-results (in %) of the comparative evaluation
of the feature spaces and classifiers on the 2015 campaign data
(top: debug-analysis data, bottom: tshark-analysis data). . . . . . 70
5.1 Training of the detection and prediction system . . . . . . . . . . 94
5.2 Application of the detection and prediction system . . . . . . . . 98
5.3 MCP evaluation on MFC data for ΓST : F1 scores (macro) in % for
ranges of δ[0,60] (left) and δ[0,10] (right) . . . . . . . . . . 101
5.4 Percentage of MC samples in the High-confidence set, for θdb
[0,1.0], for ΓT,ΓS,ΓS+Tand ΓST . . . . . . . . . . . . . . . . . 103
5.5 Combined classification results on MFC data: precision, recall, ef-
fective recall (from top to bottom) in % for θdb [0,1.0], for ΓT,
ΓS,ΓS+Tand ΓST . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.6 Precision (left) and Effective Recall (right) of ΓST against ΓS+T
(in %) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
7.1 Graphical illustration of the One Class SVM . . . . . . . . . . . . 116
7.2 Exemplary fully-connected Multi Layer Perceptron Network . . . 117
7.3 Unrolled RNN node loop . . . . . . . . . . . . . . . . . . . . . . 118
7.4 Notation for the graphical LSTM explanations . . . . . . . . . . . 118
7.5 Simple activation function layout in RNN . . . . . . . . . . . . . 119
7.6 Structure of an LSTM node . . . . . . . . . . . . . . . . . . . . . 119
7.7 Base metrics and their interpretation . . . . . . . . . . . . . . . . 120
7.8 w-visualizations based on the L1 norm, correlated with Strain . . 127
7.9 w-visualizations based on the L1 norm, decorrelated from Strain . 128
7.10 w-visualizations based on the L2 norm, correlated with Strain . . 129
7.11 w-visualizations based on the L2 norm, decorrelated from Strain . 130
7.12 Prefix-tree visualization of sequence sprojected on model sequences
SM. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
7.13 Compressed graph visualization of sequence sprojected on model
sequences SM. . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
List of Tables
2.1 Length features of HTTP requests. . . . . . . . . . . . . . . . . . 12
2.2 Structural features of HTTP requests. . . . . . . . . . . . . . . . 12
2.3 Entropy features of HTTP requests. . . . . . . . . . . . . . . . . 12
2.4 Temporal features of HTTP requests. . . . . . . . . . . . . . . . . 13
2.5 Web traffic of six users recorded over 90 days. . . . . . . . . . . . 17
2.6 Statistics of malicious web traffic . . . . . . . . . . . . . . . . . . 18
2.7 Detection performance of DUMONT . . . . . . . . . . . . . . . . 19
3.1 Details of the acquired datasets. . . . . . . . . . . . . . . . . . . 31
3.2 False positive and true positive rates of two anti-virus tools. . . . . 36
3.3 Average detection performance of the different detectors. . . . . . 37
3.4 Average true positive rates of the different detectors, tested on the
AV-alerts. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.5 Average true positive rates of the different models. . . . . . . . . 39
3.6 Average true positive rates of the different detectors applied during
one-time training. . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.1 Sizes of the data sets of valid and invalid failures . . . . . . . . . 54
4.2 Sizes of the Stateless Token Feature Spaces . . . . . . . . . . . . 56
4.3 Sizes of the Stateful Token Feature Spaces . . . . . . . . . . . . . 59
4.4 Mean AUC Results and Standard Deviation (in %) of the 2014 and
2015 campaigns. . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.1 Data set statistics: Number of events e#, of failure sub classes c#
and of samples s#. . . . . . . . . . . . . . . . . . . . . . . . . . 81
5.2 Exemplary sequence of successful bi-directional mobile commu-
nication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3 Two artificial communication sequences x1and x2........ 86
5.4 Exemplary snippet of the binarized ˆ
Φ-projections of a sequence s
before and after dimensionality reduction . . . . . . . . . . . . . 92
5.5 Results of the individual MCP and MCD evaluations on MFC data
(%) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.6 Results of combined classification evaluation on MFC data (in %) 106
5.7 Values of minimal θer required to reject hypotheses HA
0and HB
0
at α= 0.05 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.8 Results of the individual MCP evaluations on AFC data (in %) . . 108
7.1 Representative data set properties of Chapters 2, 3 and 4 . . . . . 124
7.2 Representative classification model properties of Chapters 2, 3 and 4124
7.3 Exemplary token 3-gram dimensions of wof debug-data . . . . . 125
Chapter 1
Introduction
Motivation
Modern internet and communication technologies rely heavily on the processing
of functionally structured textual and numerical data, like executed code or data
transmitted in accordance with network communication protocols. Necessarily,
the structures contained in this data are semantically relevant for the process ex-
ecution. As these processes also require a sequential execution of its functional
components, this data can also be viewed as sequential data. As a result of the
complexity of these processes, practical applications need to address various re-
lated problems, like the detection of malicious code or communication to enable a
sufficient level of IT security, or the classification and validation of network com-
munication to enable a higher quality of the respective communication service.
While manual problem analysis on such data is possible, it is time-consuming and
expensive. Consequently automatically learning systems addressing such prob-
lems are highly sought after. To enable such systems, data representations as well
as learning methods are required which allow an efficient processing of this type
of structured, sequential data. Additionally, these data representations and learn-
ing methods also need to be combined in a highly automated manner, reducing the
requirement for manual processing steps. And finally these systems should also
be transparent, enabling the interpretation of the processing steps and the results.
These objectives build the practical motivation of this thesis, i.e. is to propose and
analyze features, learning methods and complete systems to automatically learn,
detect, predict and interpret problems based on sequential and structured code and
communication data. Consequently, it aims at answering the following questions:
Which properties are relevant for structural and sequential data of network
communication and code?
Which types of features are best suited to represent those properties - and
how are they extracted effectively?
Which learning methods and approaches are best suited for solving the dif-
ferent relevant objectives?
1
What is needed to achieve highly automated systems utilizing these features
and learning methods?
As each thesis chapter focusses on slightly different types of data and use-
cases, detailed discussions of the related research are provided in each respective
chapter. Generally speaking though, learning on structured as well a structured
sequential data is relevant in various research areas. As such relevant methods of
the research field of Natural language processing [BPX+07, WM12] are applied to
Sequence learning [LMP01, Gra12] or to solve IT security objectives like intrusion
detection [Rie09] or malware classification [RTWH11]. In the more recent field
of process mining [VDAADM+11, ERF16] this is extended to state-based event
sequences with additional temporal features. While the research provided in those
areas addresses some of the problems we will face within this thesis, the com-
plex nature of the utilized data often requires deviating adaptations of the utilized
methods and approaches.
Data Domains
The thesis discusses general properties of sequential structural and temporal data,
to allow an abstraction of the proposed ideas to other research areas. But to high-
light the practical relevance and consequences of the proposed ideas, it also focuses
on a set of concrete data domains. In the first half of the thesis these are the prac-
tical aspects of different types of concrete features in the domain of IT security,
namely the detection of covert malicious HTTP communication in Chapter 2, and
the detection of malicious Javascript code in Chapter 3. In the second half the
focus is extended to more abstract analyses of different types of features and fea-
ture concepts, with a focus on the detection and classification of erroneous process
behavior in internet communication protocols in Chapter 4 and in specifically pre-
processed mobile network communication protocol data in Chapter 5. These focus
points of the different chapters are illustrated in Figure 1.1.
Properties and Features
To provide a broad view on the data properties and their relevant features, various
statistical, behavioral, non-behavioral, stateless, stateful, structural and temporal
features are discussed, both for their individual representations, as well as for dif-
ferent ways of combining them via hierarchical features, feature concatenations
and ensemble approaches. Thus Chapter 2 starts with introducing different con-
crete features like length and entropy values, and it also introduces structural and
temporal features representing highly specific properties. Chapter 3 builds on this
idea of concrete numerical features, and further extends it by introducing token
n-grams in the problem domain. These are then further developed into stateless
features and compared against stateful features in Chapter 4, before they are finally
extended to the temporal structural feature domain in Chapter 5, where they are
compared against different sequential and non-sequential types of features.
2
Figure 1.1: Overview: From Source Data to Analysis Data
Learning and Detection Systems
To solve the concrete use case objectives and to enable a better analysis the ca-
pabilities of the different feature spaces, complete machine learning systems are
proposed, allowing for extensive evaluations of the feature performances in their
context. The proposed systems and features are evaluated against competitive ap-
proaches, where they outperform commonly used and state-of-the-art methods, and
even methods based on neural networks. Additionally specific practically relevant
aspects are addressed in depth. As such achieving a high level of automation is rele-
vant to reduce the amount of manual system processes, largely extending the scope
of the system application. As different application domains require systems which
allow for the calibration of system relevant metrics, such calibration procedures are
proposed as well. Finally also the interpretation of the system processes and their
results need to be enabled, to increase the system reliability and its achieved level
of trust. This is achieved through different proposed analysis and visualization
approaches, focussing on transparent feature projections and a transparent feature
relevance analysis.
Thesis Contributions
As a result of these analyses, this thesis provides the following practical and theo-
retical contributions:
Analysis of the properties of structured, sequential data, specifically of net-
work communication and code.
Proposal of novel features or feature combinations to represent those prop-
erties.
Proposal of novel ways of combining learning methods to achieve the re-
spective classification and detection objectives.
3
Proposal and evaluation of highly automated systems for solving specific
practically relevant learning problems.
In depth evaluation of the proposed features, learning methods and systems,
competitively compared against approaches used in related research areas
like IT security, sequence learning and process learning.
Proposal of methods and graphical representations to increase the trans-
parence and interpretability of the system and the obtained results.
4
1.1 Thesis Roadmap
Chapter 2 proposes and evaluates different individual and combined statistical fea-
tures, which are also used complementarily in a hierarchical manner. A complete
machine learning system for the detection of covert and tunneled outbound HTTP
communication is proposed, preventing malicious activity and hardening networks
against malware proliferation. The proposed unsupervised detection system allows
for a calibration of the false positive rate, as required by the respective potential
application scenario. It is evaluated on the analysis data of real-world malware
binaries, collected over 90 days of user-data, achieving very good detection perfor-
mances.
Chapter 3 proposes and evaluates different behavioral and non-behavioral fea-
tures, calculated for dynamically and statically analyzed data for an empirical
study of a fully automated system for collecting, analyzing and detecting malicious
JavaScript code. The proposed supervised detection system allows for a calibra-
tion of the false positive rate, as required by the respective potential application
scenario, as well as for a complementary combination of the utilized detectors. It
is evaluated on a large dataset of benign and malicious webpages, achieving very
good detection performances.
Chapter 4 proposes and evaluates different stateless, stateful and their comple-
mentarily combined features. The proposed supervised classification system aims
for the classification of mobile validation data for service quality and system de-
pendability of mobile communication data, and is evaluated on two extensive data
sets of real-world data, for which two competitive data representations are analyzed
for their individual and combined performance and general applicability, achieving
very good classification results. The proposed system also allows for a relevance
ranking via the analysis of the trained weight vector, for which a visualization is
introduced, allowing the interpretation of the classification results.
Chapter 5 proposes and evaluates structural, temporal and complementarily com-
bined features for the detection and prediction of mobile failure data for service
quality and system dependability, for a specific type of mobile network commu-
nication data. The proposed supervised classification system is evaluated on real-
world data sets, where it performed better than methods used in related work. The
system also allows for the calibration of a high precision and effective recall. Ad-
ditionally different hypotheses for the different feature spaces are analyzed, high-
lighting the practical relevance of the proposed feature space. Additionally a visu-
alization method for the interpretation of the proposed feature space projection is
provided.
5
1.2 Own Contributions
Chapter 2 is based on the publication "Adaptive Detection of Covert Commu-
nication in HTTP Requests" by Guido Schwenk and Konrad Rieck, published in
the Proceedings of the European Conference on Computer Network Defense 2011,
EC2ND11. My contributions to the described work are extensive, as I set up the
virtual network and implemented and conducted the execution of the collected bi-
naries and PDF files, collecting the outbound HTTP communication in the process.
Afterwards I also conducted all analyses and evaluations on this data.
Chapter 3 is based on the publication "Autonomous Learning for Detection of
JavaScript Attacks: Vision or Reality?" by Guido Schwenk, Alexander Bikadorov,
Tammo Krueger and Konrad Rieck, published in the Proceedings of the 2012 ACM
CCS Workshop on Artificial Intelligence and Security, AISec 2012. My contri-
butions to the described work are extensive and cover the feature extraction, the
application of the detectors and all subsequent evaluations.
Chapter 4 is based on the publication "Classification of Structured Validation Data
using Stateless and Stateful Features" by Guido Schwenk, Ralf Pabst and Klaus-
Robert Müller, Journal of Computer Communications (Elsevier), 2019. With the
exception of the actual data collection, which was done externally, I conducted all
of the described work, ranging from the data pre-processing over the definition and
extraction of features to the implementation and evaluation of the learning methods
and feature types, including all discussed analyses and data visualizations.
Chapter 5 is based on the publication "Feature Spaces and a Learning System for
Structural-Temporal Data" by Guido Schwenk, Ben Jochinke and Klaus-Robert
Müller, submitted to PloS one, December 2018. With the exception of the ac-
tual data collection, which was done externally, I conducted all of the described
work, ranging from the data pre-processing over the definition and extraction of
features to the definition, implementation and evaluation of the proposed detection
and prediction system, including the significance analyses and the proposed data
visualizations.
6
1.3 List of Publications
"Classification of Structured Validation Data using Stateless and Stateful
Features" by Guido Schwenk, Ralf Pabst, Klaus-Robert Müller, Journal of
Computer Communications (Elsevier), 2019
"Feature Spaces and a Learning System for Structural-Temporal Data" by
Guido Schwenk, Ben Jochinke, Klaus-Robert Müller, submitted to PloS one,
December 2018
"A Close Look on n-Grams in Intrusion Detection" by Christian Wressneg-
ger, Guido Schwenk, Daniel Arp, Konrad Rieck, Proceedings of the 2013
ACM CCS Workshop on Artificial Intelligence and Security, AISec 2013
"Autonomous Learning for Detection of JavaScript Attacks: Vision or Re-
ality?" by Guido Schwenk, Alexander Bikadorov, Tammo Krueger, Konrad
Rieck, Proceedings of the 2012 ACM CCS Workshop on Artificial Intelli-
gence and Security, AISec 2012
"Detecting Behavioral and Structural Anomalies in MediaCloud Applica-
tions" by Guido Schwenk, Sebastian Lapuschkin, arXiv September 2012
"Adaptive Detection of Covert Communication in HTTP Requests" by Guido
Schwenk, Konrad Rieck, Proceedings of the European Conference on Com-
puter Network Defense 2011, EC2ND11
"Botzilla: Detecting the "Phoning Home" of Malicious Software" by Konrad
Rieck, Guido Schwenk, Tobias Limmer, Thorsten Holz and Pavel Laskov,
ACM Symposium on Applied Computing 2010, SAC10
7
Chapter 2
Adaptive Detection of
Covert Communication in
HTTP Requests
Summary
The infection of computer systems with malicious software is an endur-
ing problem of computer security. Avoiding an infection in the first place is a
hard task, as computer systems are often vulnerable to a multitude of attacks.
However, to explore and control an infected system, an attacker needs to es-
tablish a communication channel with the victim. While such a channel can
be easily established to an unprotected end host in the Internet, infiltrating
a closed network usually requires passing an application-level gateway - in
most cases a web proxy - which constitutes an ideal spot for detecting and
blocking unusual outbound communication.
This chapter introduces DUMONT, a system for detecting covert out-
bound HTTP communication passing through a web proxy. DUMONT learns
profiles of normal HTTP requests for each user of the proxy and adapts to
individual web surfing characteristics. The profiles are inferred from a di-
verse set of features, covering the structure and content of outbound data,
and allowing for automatically identifying tunnels and covert channels as
deviations from normality. While this approach does not generally rule out
sophisticated covert communication, it significantly improves on state-of-
the-art methods and hardens networks against malware proliferation. This
capability is demonstrated in an evaluation with 90 days of web traffic, where
DUMONT uncovers the communication of malware, tunnels and backdoors
with few false alarms.
8
2.1 Introduction
Computer networks face a wide variety of threats from malicious software (mal-
ware). Just a few years ago, malicious software could be categorized into a few
basic classes, but nowadays we are confronted with a plethora of malicious tools
developed by an underground economy for monetary gains [e.g., FPPS07, HEF09,
SGHSV11]. This malware is characterized by versatile functionality and the capa-
bility to take numerous routes to a victim, ranging from malicious documents and
shortened links to drive-by downloads and targeted attacks. In practice, detecting
and eliminating all these infection vectors has proven to be an intractable task and
thus millions of hosts in the Internet are plagued by malicious software.
Once compromised, infected machines are regularly misused for illegal activ-
ities, such as gathering personal data, distributing spam messages or conducting
attacks against other hosts. All these activities inherently require establishing a
communication channel that enables the attacker to retrieve data and control the
infected system. Such a channel can be trivially established to an unprotected
host, for example, by directly sending network packets as performed by the trojans
Storm and Nugache [SDHD07, HSD+08]. As a result, a large body of research
has studied methods for detecting direct communication with infected hosts [e.g.
GZL08, GPZL08, WBH+09, RSL+10]. However, enterprise and government net-
works are often shielded from the Internet by an application-level gateway - typi-
cally in form of a packet filter and a web proxy - and thus no direct communication
with infected machines can be established. In this setting, the malicious software
is required to tunnel its communication through the web proxy and there is a need
for methods capable of detecting tunneled and covert communication in HTTP.
For this purpose this chapter introduces DUMONT, an anomaly detection sys-
tem for identifying tunneled and covert communication passing through a web
proxy. DUMONT learns profiles of normal HTTP requests for each user of the
proxy and thereby adapts to the individual web surfing characteristics of each user.
The individual profiles are inferred from a diverse set of features, covering the
structure and content of outbound data. Using these profiles, tunnels and covert
communication of malicious software can be identified as deviations from normal-
ity, where respective requests can be put on hold and further investigated before
leaving the network. Similarly, DUMONT can be applied for analysis of suspicious
files in a sandbox, where it can detect unusual web traffic, for instance, when a
spyware program transfers gathered data to a remote host.
Detecting covert channels in the general case is a very ambitious task and
clearly DUMONT can not spot arbitrarily sophisticated covert communication, for
example, using the timing of requests for encoding information. However, the
involved implementation and low transmission rates of such advanced channels
render them less attractive for adversaries. In practice DUMONT significantly im-
proves on the detection capabilities of related methods such as WEBTAP [BP04]
and raises the bar for malware authors to comprise networks. In an empirical eval-
uation with 90 days of web traffic from six users, DUMONT allows to identify the
9
majority of malicious software, tunnels and backdoors with a false-positive rate
of 0.35%, whereas the rule-based method WEBTAP suffers from over 3% false-
positives due to the dynamics of web traffic.
The rest of this chapter is structured as follows: In Section 2.2 the dynamics of
HTTP are discussed. The DUMONT system and underlying learning techniques are
presented in Section 2.3 and evaluated in Section 2.4. Section 2.5 presents related
work and Section 2.6 the conclusion.
2.2 Dynamics of HTTP Communication
The HTTP protocol features a diversity of properties, exploitable to learn some-
thing about a user’s communication behavior. While most of them look static at
a first glance, they show a rather dynamic behavior in practice. When analyzing
for example, which web sites a set of users visits during a defined time period,
one might consider creating a whitelist of benign web sites sufficient for stopping
outbound communication to malicious sites. This assumption is unrealistic, as Fig-
ure 2.1(a) illustrates.
(a) Unknown sites per day (b) Unknown user-agents per day
(c) Distribution of request lengths
Figure 2.1: Examples of the dynamics in the recorded HTTP communication.
Formerly unknown (i.e. first-time seen) benign web sites do always occur, be it
through the evolution of the Internet or just through the normal web behavior of the
10
user. Furthermore previously benign web sites might have been infected recently,
making them no longer viable for a benign whitelist. For those reasons we do not
learn the concrete web site addresses, but model them indirectly using machine
learning.
Another dynamic behavior can be observed in the occurrences of HTTP head-
ers. For example, creating a simple whitelist of the appearances of the header User-
Agent for individual users can provide a means to detect deviations and suspicious
combinations (e.g. of the operating system and the web browser). However, this
static approach is also not sufficient, as Figure 2.1(b) illustrates. Previously unseen
User-Agents occur all the time, be it due to changing tools on the client side or
simple version changes in the different web clients. A more indirect method of
modeling HTTP requests is necessary here as well.
Another question is, whether adaptive learning on the data of individual users
provides advantages over learning on an agglomerated dataset of several users to-
gether. As Figure 2.1(c) illustrates for the distribution of a single feature of HTTP
traffic, namely the lengths of the requests, the same feature may show a different
statistical behavior for each user. As learning a representative model of normality
requires those features to have consistent statistics, learning on data of individual
users is preferable to learning on data of all users combined.
These three examples demonstrate that the dynamics of HTTP communication
can hardly be tackled by rule-based methods, such as WEBTAP [BP04]. Hence
a learning-based approach to the detection of covert communication in HTTP is
applied here, capable of adapting to the individual characteristics of each user.
2.3 The DUMONT System
In the following, the design of our system DUMONT and its inner workings are
presented. The selection of features of outbound HTTP traffic is discussed in Sec-
tion 2.3.1, while the necessary learning method as well as the design of the detector
are introduced in Section 2.3.2. The concrete necessities and technical details of
the operation are then laid out in Section 2.3.3.
2.3.1 Features of HTTP Requests
According to RFC2616 [FGM+99] an HTTP request starts with a method, e.g.
GET or POST. A method requires an URI, which may include pairs of parameters
and values. After the URI, HTTP headers are defined, again consisting of pairs of
parameters and values. If a POST request is triggered, typically a body of data con-
cludes the request. Additional to features of this data, each request is triggered at a
certain point in time, whose features can be stored as well. Based on those elements
of HTTP requests 17 descriptive features grouped in 4 semantic sets are extracted.
In networks secured with DUMONT, the use of the HTTP method CONNECT is to
be restricted, as this method implements a standard tunnel protocol. Allowing an
11
unmonitored use of such a communication channel does undermine the objective
of DUMONT. Therefore requests using that method are not taken into account.
Length features: The set of length features is depicted in Table 2.1. It describes
length values of different parts of the request, such as URI and body, for later
detecting deviations from those values.
Feature Description
l1Length of request
l2Length of URI
l3Total length of URI parameters
l4Total length of headers
l5Length of request body
Table 2.1: Length features of HTTP requests.
Structural features: The set of structural features is shown in Table 2.2. It con-
tains values describing the structure of an HTTP request by statistical measures,
such as the average length of URI parameter names or header values. This per-
spective allows identifying outbound data otherwise hidden through distribution
over different headers or parameters.
Feature Description
s1Average length of URI parameter names
s2Average length of URI parameter values
s3Average length of header names
s4Average length of header values
Table 2.2: Structural features of HTTP requests.
Entropy features: The set of entropy features, depicted in Table 2.3, contains
entropy values for different bit widths. These values allow an estimation of the
information content in the analyzed request, where the different bit widths cover
the request content at different granularity.
Feature Description
e18-bit entropy of request
e216-bit entropy of request
e324-bit entropy of request
e432-bit entropy of request
Table 2.3: Entropy features of HTTP requests.
Temporal features: The set of temporal features is illustrated in Table 2.4.
These features enable the analysis of temporal traffic characteristics and help to
spot unusual communication activity.
12
Feature Description
t1Number of requests in last minute
t2Number of outbound bytes in last minute
t3Hour of HTTP request
t4Week day of HTTP request
Table 2.4: Temporal features of HTTP requests.
2.3.2 Anomaly Detection
Our system DUMONT makes use of a standard learning technique - the One-Class
SVM [TD99, SPST+01] - for learning a model of normality for the different HTTP
features. Formally, a One-Class SVM describes a hypersphere. This sphere en-
closes given data with a minimal volume. Anomalies are detected through their
distance from the center of the learned sphere, resulting in a high anomaly score.
To compensate outliers and noise, as well as to optimize false-positive and detec-
tion rates, a soft margin is used. This way not all normal data points are required
to reside within the sphere. By using specialized functions, so-called kernels, the
sphere can be embedded into a high-dimensional feature space, facilitating the
modeling of more complex structures with non-linear representations. In the setup
discussed here, Gaussian kernels [MMR+01] are applied to achieve this. More de-
tails on support vector machines for one class learning, their optimization problem
and decision function, as well as more details on the use of kernels can be found in
Chapter 7.1.
(a) Length of requests
Figure 2.2: Frequency distribution and anomaly scores per request length.
A non-linear model for normality is illustrated in Figure 2.2, on the example
of the length of requests. The left y-axis shows the frequency of different request
lengths for one user and the right y-axis shows a function of the anomaly score
computed using a One-Class SVM with Gaussian kernels. In principle longer re-
quests cause a higher anomaly score. However, the local minimum at a request
13
length of 1,600 byte demonstrates the advantage of a non-linear representation to
model ranges of normality more subtle than simple upper and lower bounds can
do.
Hierarchical detection layers
Another important concept realized in DUMONT is the combination of individual
detectors in hierarchical layers. The trained models for individual features respec-
tively (detection layer 0) provide the capability to detect covert channels reflected
in single features. However, by training models on the combined features of the
four feature sets (detection layer 1), the detection capability can be further in-
creased to also identify anomalies in the combination of features. This way for
example, anomalous requests can be detected, which are normal in both length and
entropy, but anomalous in their combination.
Figure 2.3: Hierarchical detection layers.
The concept is illustrated in Figure 2.3, depicting detection layer 0 and detec-
tion layer 1, as well as detection layer 2, which consists of a trained model for all
individual features combined, enabling even more synergistic effects. When an-
alyzing an unclassified request, all of those detectors are applied, such that each
detector decides, whether the request is normal or anomalous. DUMONT classifies
a request as anomalous, if at least one detector triggers an alarm. This approach
provides a maximal detection rate, though increasing the false-positive rate as well.
The alternative - classifying a request as anomalous when at least ndetectors trig-
ger alarms - is no option here, as it allows malware to cover its communication by
hiding information within the features of n1detectors.
2.3.3 Training DUMONT
For the training of DUMONT several steps are necessary. Initially the normal data
is split into training, validation and testing datasets, maintaining the temporal order
of the requests. After training and selecting a suitable model for each detector, a
sample of malicious communication is used to calibrate the hierarchical detectors
of DUMONT.
14
Model selection
Training a model of normality with a One-Class SVM requires optimizing two
parameters, namely the width of the Gaussian kernel and the “softness” of the One-
Class SVM. For each combination of these parameters a different SVM model is
obtained. From those models a suitable one is picked by the following heuristic:
1. A threshold is set to define an upper bound on the desired false-positive rate
on the training dataset. Of the calculated models the one with the highest
false-positive rate below this threshold is selected.
2. If models with identical false-positive rates occur, the one with the highest
number of support vectors is selected, as this corresponds to the best adapta-
tion to the training data.
Automatic calibration
After selecting a suitable model for each detector on the normal training dataset,
the representation quality of the model is further optimized by calibrating the radius
of the soft margin of the SVM on the validation dataset using a Receiver Operating
Characteristic (ROC) curve. To generate a ROC, the validation data and a sample of
malicious requests is processed with DUMONT using different thresholds (radius).
In principle, a good threshold corresponds to the point closest to (0.0,1.0) in the
ROC. As a low false-positive rate is the primary objective, however, the use of
this point is not recommended, as it corresponds to a one-to-one ratio of false-
positive and detection rate. The focus of DUMONT has to be a low false-positive
rate, because each of the detectors in Figure 2.3 is able to trigger an alarm if it
detects an anomaly. As a result the individual false alarms of the detectors are
accumulated, rendering it highly important to keep the false-positive rate of each
detector low. Though this results in generally smaller detection rates as well, the
negative impact on the overall detection rate is small, since the correctly classified
anomalies of the individual detectors are accumulated as well.
To assure a low false-positive rate for each of the individual detectors, the
method illustrated in Figure 2.4 is implemented. Thresholds corresponding to a
suitable ratio of false-positive and detection rate are found at the ascending gradi-
ents before any local plateau of the ROC curve. They are retrieved by positioning a
linear function in (0.0,1.0) and selecting the point which is closest to that function.
Two examples of such points are depicted in Figure 2.4 as black dots. While both
of them have a good ratio of false-positive and detection rate, only the left one,
selected by a linear function with a higher gradient, yields a low false-positive rate.
Consequently the linear function with the steepest gradient is selected to determine
the threshold for each detector in our system. We chose this approach to allow
for a direct influence on the calibration of the false-positive rate. Otherwise also
ensemble approaches [Die00] would be a viable option to train these classifiers.
15
lower false-positive rate
higher detection rate
Figure 2.4: Calibrating a detector using the ROC curve.
2.3.4 Limiting Evasion
One problem of anomaly detection in general are evasion attacks that aim at poison-
ing the learning data [KL10]. If an adversary knows the distribution of the features
of the normal requests, he can tune his malware to generate seemingly normal traf-
fic with high anomaly scores. Such data points near the margin of the One-Class
SVM can shift the model towards any direction desired by the attacker, resulting
in a setup where once anomalous data is now classified as normal. Fortunately,
different methods have been developed to increase the robustness of anomaly de-
tection and to minimize the influence of an adversary. In particular, the techniques
of bootstrapping [BS85] and sanitization [CSL+08] can be applied to adjust and
filter out anomalous data from the training corpus.
2.4 Empirical Evaluation
For the empirical evaluation, datasets of normal and malicious HTTP requests have
been collected and used for different experiments. The datasets and the results of
those experiments, as well as a comparison with a state-of-the-art approach, are
illustrated in this section.
2.4.1 Evaluation Data
For collecting normal outbound HTTP requests, a dedicated proxy server has been
set up at our institute. After discussing considerations of data privacy, six users
were willing to use the proxy server for web access. In the resulting traffic dumps
of outbound HTTP traffic, the IP addresses of all users have been pseudonymized.
16
Statistics of the resulting data set of 90 days is presented in Table 2.5. In total the
six users generated 143MB of HTTP requests, consisting of 182,996 requests with
altogether 173 days of usage and 5,272 requests per day.
Number of Data Active Requests per
Requests volume Days active day
User 1 116,565 85 MB 50 2,331
User 2 29,723 36 MB 43 691
User 3 18,834 11 MB 52 362
User 4 9,882 6 MB 10 988
User 5 4,001 2 MB 8 500
User 6 3,991 3 MB 10 399
Total 182,996 143 MB 173 5,272
Table 2.5: Web traffic of six users recorded over 90 days.
For collecting malicious HTTP communication data, different sources have
been used. In particular, samples of malicious software have been obtained from
the Internet Early Warning system [EFG+10] hosted at the University of Mannheim
and different honeypots running at our institute. In total a dataset of 2.765 mali-
cious executable files and PDF documents has been collected for our experiments.
To retrieve the kind of communication data relevant for our problem, a small
virtual network has been set up, where the binaries and PDF documents are auto-
matically executed in a virtual machine running Windows XP, providing each of
them a time frame of 15 minutes to get active. The connections triggered by the
malware are redirected to a virtual machine simulating the Internet using TRU-
MANBOX [Gor08]. To model the desired network layout, a HTTP proxy has been
included in our setup. Any HTTP connections from the Windows machine had to
find and use that proxy, using information found in the preferences and registry
entries prepared on the Windows machine. Of the 2,765 malicious files only 695
have been capable of doing this, whereas several common malware families failed
to correctly communicate with the web proxy and would not have been able to
establish a communication outside a closed network.
Besides malicious software, there also exist public tools for establishing out-
bound communication channels to a system. In particular, the web backdoors
MATAHARI1and RWW-SHELL2are included in the conducted experiments for
creating covert communication. Both backdoors have been run with a polling in-
terval of 10 seconds, executing 10–20 shell commands in each session. Moreover,
the common tunnel software HTTPTUNNEL3is considered for tunneling various
traffic through the web proxy.
It is noteworthy that we also executed additional experiments with other soft-
1A Simple Reverse HTTP Shell, http://matahari.sourceforge.net
2Placing Backdoors Through Firewalls, http://www.thc.org/releases.php
3GNU HTTP Tunnel, http://www.nocrew.org/software/httptunnel.html
17
# Sessions # Requests
Malicious software 695 12,899
HTTP tunnels 11 164
Web backdoors 12 345
Table 2.6: Statistics of malicious web traffic
ware for establishing tunneled communication, such as CORKSCREW, SKYPE and
TEAMVIEWER. However, these tools make use of the CONNECT method for com-
munication and have been excluded from our experiments, as they can be trivially
detected and blocked. The statistics of the resulting dataset of malicious HTTP
requests are presented in Table 2.6.
2.4.2 Evaluation Setup
To conduct the training of DUMONT in our experiments, the temporally first third
of the normal data is selected. The trained models are then validated and calibrated
on the temporally second third of the normal data and the validation partition of
the malicious data. For realistic experimental results, the validation is repeated ten
times, each time with a newly randomized set of validation data of the malicious
dataset. For testing, the detectors are applied on the remaining temporally last third
of the normal dataset, as well as the malicious test data, remaining in each of the
randomizations. Due to that approach the final false-positive and detection rates
are presented as the average values of those ten repetitions.
The features t1t4and tare not included in the evaluation. In practice they
help detecting malicious outbound traffic at unusual times or with an unusual re-
quest frequency. Due to our methods of collecting malicious requests, however,
this could not be fully tested, because for collecting malicious communication data
the binaries and infected documents have been executed over night and weekend
as well. Due to the resulting time stamps the corresponding time features contain
artifacts and thus are easily distinguishable from benign communication.
2.4.3 System Performance
The detection and false-positive rates of DUMONT for each user are presented in
Table 2.7. Applied on the traffic of tunnels, web backdoors and malicious software,
DUMONT performs decently with detection rates of 100.0%, 94.3% and 89.3%
respectively. The average false-positive rate reaches a value of 0.35%. While the
detection rates of tunnels remain static among all users, the detection of backdoors
and malware is strongly user-dependent due the variance of HTTP traffic. This
variance is also the reason why rule-based methods are limited in detecting these
covert channels, as we will see in Section 2.4.4.
In principle, each hierarchical detector contributes to the final detection perfor-
mance of DUMONT. Covert communication can be spotted in all of the features
18
Detection Rates FP rates
HTTP HTTP Malware Malware Benign
tunnels backdoors (faked UA) web traffic
User 1 100.0 % 88.0 % 79.9 % 67.5 % 0.14 %
User 2 100.0 % 100.0 % 96.5 % 94.3 % 0.39 %
User 3 100.0 % 100.0 % 98.7 % 89.4 % 0.23 %
User 4 100.0 % 84.0 % 80.8 % 73.4 % 1.18 %
User 5 100.0 % 94.0 % 82.1 % 69.7 % 1.12 %
User 6 100.0 % 100.0 % 97.6 % 97.9 % 4.05 %
Average 100.0 % 94.3 % 89.3 % 82.0 % 0.35 %
Table 2.7: Detection performance of DUMONT
extracted from HTTP requests, as can be seen in Figure 2.5, which depicts the
average contribution of the individual detectors on the false-positive and the true-
positive rates. The false positives triggered by DUMONT are mainly caused by
large data uploads and most notably cookies. While both types of requests are gen-
erally useful for interacting with the Internet, it is obvious that especially in our
scenario they represent an inherent risk. Both methods are used to send a bigger
and often encrypted amount of data to a server located outside of the protected net-
work, which is what our system is designed to prevent. Such false positives could
even be interpreted as true positives when found in a network with stricter security
protocols. Since this is not our initial assumption, they are kept as false positives
here.
DUMONT is implemented in Java, with no special performance optimization.
On a single core of an Intel Core2 Duo with 3.00GHz, the whole normal dataset,
containing the requests of six users of 90 days, can be processed (i.e. extracting
the features and applying the trained detectors) within five minutes. This equates
to a run-time performance of approximately 1.300 requests per second.
2.4.4 Comparative Evaluation
The second experiment conducts a comparison of the detection performance of
DUMONT to WEBTAP [BP04], which detects covert communication using a mix
of filters, trained rules and threshold values. The comparison is conducted on the
dataset introduced in Section 2.4.1. In terms of the detection rate, WEBTAP iden-
tifies 100% of the traffic of tunnels, web backdoors and malicious software. As
discussed in the previous section, DUMONT performs slightly worse. But in terms
of false-positive rate, DUMONT significantly outperforms WEBTAP, which flags
3.6% of the requests as covert communication, thus generating more than ten times
more false alarms in our experiments. These false alarms are due to the dynamics
of HTTP traffic and can be attributed to changing header names and values.
To further illustrate this shortcoming, the influence of header changes on the
19
(a) Contribution to false positives.
(b) Contribution to true positives
Figure 2.5: Contribution of each detector to the detection performance.
20
overall detection performance is investigated. To this end, the User-Agent of each
malicious request is changed to the one most frequently used by the tested user.
As a result, all malicious requests contain ”faked” user agents. This method of
masquerading malicious web traffic can be implemented into malware with only
little effort. In this setting, the detection rate of WEBTAP drastically decreases
from 100% to 3.7%. The performance of DUMONT, however, as presented in
Table 2.7 decreases only slightly from 89% to 82%. Obviously, the detection of
WEBTAP strongly depends on static header fields and thus can be easily thwarted.
The comparison shows strikingly that already a little effort on the attackers side
results in a security loss if the employed detection method can not be adapted to
the individual characteristics and dynamics of HTTP communication.
2.5 Related Work and Limitations
Closest to our system DUMONT is the work of Borders and Prakash [BP04], which
derive rule-based techniques for detecting covert communication in web traffic.
While effective in different settings, these approaches assume that HTTP commu-
nication remains static over time and that features extracted from requests are sta-
tionary. With the adoption of HTTP as a generic communication protocol for many
applications, these assumptions fail in practice and a more adaptive approach for
modeling normality is needed.
The most common approach for detecting malware is the use of signatures.
While in the past mainly focusing onto inbound traffic [Pax99, BBCP04, NKS05],
recent work has studied detecting outbound malicious HTTP traffic via automati-
cally generated signatures [RSL+10, RPF10, WBH+09]. In the case of a secured
network, where malware tries to establish a covert outbound channel to leak spe-
cific information, an adversary surely will avoid using known patterns of malware
communication and thus signature-based detection is not effective. Finally, various
research on detecting the communication of bot networks [e.g. GZL08, GPZL08]
is related to our approach and makes use of similar concepts. Yet this work fo-
cuses on identifying direct communication with end hosts and is not suitable for
determining anomalous requests in a web proxy.
A different strain of research has studied techniques for circumventing the
leakage of confidential data by monitoring sensitive data in host systems [SS11,
KPPK10]. Although effective in practice, these approaches work only on a system
where memory access can be monitored. By contrast, DUMONT can be directly de-
ployed in a network without modifying the operating system of connecting hosts.
One of the limitations of detecting covert channels is based on general cod-
ing theory [McH95]. For HTTP, a good example is described by Feamster et.al.
[FBH+02], where one party monitors accesses to certain benign web sites, while
another party accesses those web site in a specifically arranged pattern. The trans-
ferred information is hidden within that pattern and therefore completely unde-
tectable. Though fortunately such approaches limit the bandwidth of information
21
to a minimum, they remain problematic, especially considering long term informa-
tion leakage through insiders.
2.6 Conclusion
This chapter presented a novel approach for detecting covert and tunneled commu-
nication passing through a web proxy. Our system DUMONT builds on hierarchical
detectors that can identify anomalous communication in various features of HTTP
requests. By using machine learning techniques, DUMONT can be applied to the
individual traffic of each user and thus can adapt to particular web surfing charac-
teristics automatically. It is demonstrated empirically that this setting provides a
better detection performance than current static approaches, where DUMONT can
identify the communication of malicious software, tunnels and backdoors with only
few false alarms.
An interesting topic for future work is to further extend the set of features.
For example in a hybrid approach, features from keystrokes or mouse movement
[ZP00] might be added to our system to achieve an improved detection perfor-
mance. Similarly, daily bandwidth limitations as used in WEBTAP could easily be
implemented to complement our approach. Finally, the integration of DUMONT
into different network environments, e.g. for mobile devices or sensors, may pro-
vide perspectives for network-based detection of unknown malicious activity.
22
Chapter 3
Autonomous Learning for the
Detection of JavaScript Attacks
Summary
Malicious JavaScript code in webpages is a pressing problem in the In-
ternet. Classic security tools, such as anti-virus scanners, are hardly able to
keep ahead of these attacks, as their obfuscation and complexity obstructs
the manual generation of signatures. Recently, several methods have been
proposed that combine JavaScript analysis with machine learning for auto-
matically generating detection models. However, it is open how these meth-
ods can really operate autonomously and update detection models without
manual intervention. This chapter presents an empirical study of a fully au-
tomated system for collecting, analyzing and detecting malicious JavaScript
code. The system is evaluated on a dataset of 3.4 million benign and 8,282
malicious webpages, which has been collected in a completely automated
manner over a period of 5 months. The results of our study are mixed: For
manually verified data excellent detection rates up to 93% are achievable,
yet for fully automated learning only 67% of the malicious code is identified.
This chapter concludes with a discussion of the limiting factors, which would
indeed enable a fully automated system, once they are solved.
23
3.1 Introduction
According to a study of Symantec [Sym11], the number of JavaScript attacks in
the Internet has almost doubled in the year 2010, reaching peaks of over 35 million
attacks per day. As part of these attacks, malicious JavaScript code is planted on
webpages, such that a user visiting the webpage is automatically attacked and un-
noticably infected with malicious software. The success of these attacks is rooted
in the close interaction of the JavaScript interpreter with the web browser and its
extensions. Often it is possible with a few lines of code to probe and exploit vul-
nerabilities in the browser environment [DHM08, EKK09].
Unfortunately, the detection of malicious JavaScript code is a challenging task:
JavaScript attacks are small programs that are executed in the web browser. The
attacker can build on the full flexibility of interpreted code, which allows him to
easily obfuscate his code as well as dynamically exploit different types of vulner-
abilities. Common security tools, such as anti-virus scanners, are hardly able to
keep abreast of these attacks, as the obfuscation and complexity obstruct the man-
ual generation of effective signatures. As a result, malicious JavaScript code is
often insufficiently detected due to a lack of up-to-date signatures [RKD10].
As a remedy, several detection methods have been proposed that combine
JavaScript analysis with techniques from the area of machine learning. These
methods build on the ability of machine learning to automatically generate de-
tection models from known samples of benign and malicious JavaScript code and
thereby avoid the manual crafting of signatures. Common examples are the de-
tection systems CUJO [RKD10], ZOZZLE [CLZS11] and ICESHIELD [HFH11],
which are capable of accurately identifying malicious code in webpages at run-
time with few false alarms.
Learning-based detection provides a promising ground for mitigating the threat
of malicious webpages. However, to take effect and provide advantages over
signature-based tools, learning-based methods need to operate with very little man-
ual intervention. From the acquisition of training data to the generation of detection
models, the learning process needs to be largely automatic to quickly adapt to the
development of malicious software. Previous work has ignored this issue of auto-
matic learning and it is open whether learning-based detection methods can really
operate autonomously over a longer period of time.
This chapter tests the feasibility of automatic learning and presents an em-
pirical study of a fully automated system based on the detector CUJO [RKD10].
The system (a) retrieves benign and malicious JavaScript code from the Internet,
(b) identifies malicious functionality using client-based honeypots and (c) learns a
detection model from features of static and dynamic analysis in regular intervals.
The system is evaluated on a dataset of 3.4 million benign and 8,282 malicious
webpages, which has been acquired over a period of 5 months. In particular the
detection performance as well as the learning process over time are studied, for dif-
ferent features and learning methods, such as anomaly detection and classification
approaches.
24
The results of our study are mixed: In line with previous work, the system at-
tains a high detection rate of 93% if applied to manually verified data. However, in
a fully automated setting it identifies only 67% of the malicious code in webpages
- irrespective of the used features and learning methods. Two main factors that
contribute to this decrease are identified:
Semantic gaps: It is considerably hard to verify the presence of malicious
activity during the visit of a webpage and use the exact same information at a
later stage for learning. If both stages differ only slightly, malicious activity
may be present but is not exposed to the learning method.
Time delays: JavaScript attacks are very volatile and often active for only a
few hours. Due to the large amount of processed data, a significant amount
of time may pass between the verification of a malicious webpage and the
resulting learning stage. If the malicious code is not present anymore, the
detection model is trained on incomplete data.
The overall conclusion from this study is that fully automated systems for de-
tection of JavaScript attacks are still an open problem and there exist several prac-
tical challenges that need to be addressed first.
The rest of Chapter 3 is structured as follows: The related work is discussed
in Section 3.2. Section 3.3 then introduces our framework for data acquisition
and presents details of the collected JavaScript code. Section 3.4 describes the
features and learning methods used in our system. Section 3.5 presents the em-
pirical results of our study and discusses their implications, which is continued in
Section 3.6 with the analysis of different practically relevant training approaches.
Finally Section 3.7 concludes this chapter.
3.2 Related Work
Before presenting the study on learning-based detection of malicious JavaScript
code, some related work is reviewed first. In particular we discuss related ap-
proaches for analyzing and detecting malicious code in webpages. These ap-
proaches can be roughly categorized into client-based honeypots,analysis systems
and detection systems, where these categories are not rigid and some systems im-
plement a mixture of functionalities.
3.2.1 Client-based Honeypots
To systematically monitor and understand the phenomena of JavaScript attacks,
several honeypot systems have been devised that visit webpages and mimic the
behavior of users. One class of these systems are high-interaction honeypots,
e.g. [WBJ+06, PMRM08, SS06, Roa07], which operate a real browser in a sand-
box environment and detect attacks by monitoring unusual state changes in the en-
vironment, such as modified system files. Another class of these systems are low-
interaction honeypots, which only emulate the functionality of web browsers and
25
corresponding vulnerabilities for tracking malicious activity, e.g. [IHF08, Naz09,
BMB10].
Both types of honeypots are valuable sources for collecting JavaScript attacks,
especially in combination with systems for efficient retrieval of potentially ma-
licious webpages [IBC+12]. In contrast to server-based approaches, client-based
honeypots are capable of actively searching for malicious code and allow to capture
instances of novel attack campaigns early on. As a consequence, client-based hon-
eypots are widely used and can be considered a standard for monitoring JavaScript
attacks in the wild.
3.2.2 Analysis Systems
Collecting malicious JavaScript code, however, is only a first step in crafting ef-
fective defenses. A second strain of research has thus focused on methods for
automatically analyzing the collected code and extracting security-relevant infor-
mation, such as patterns indicative for attacks. Most notable here is the community
service WEPAWET that is backed by a chain of analysis tools for collecting, filter-
ing and analyzing JavaScript code [CKV10, CCVK11, IBC+12]. The service au-
tomatically analyzes webpages using an emulated browser environment and is able
to identify anomalous behavior in the code using machine learning techniques.
In contrast to WEPAWET, which performs a more general analysis of web-
page content, other systems address particular aspects of JavaScript attacks, e.g.
[KLN+11, KLZS12]. For example, the analysis system ROZZLE implements an
involved multi-path execution for JavaScript code. Instead of following a single
execution flow, the method inspects multiple branches of execution and thereby
exposes hidden and conditional functionality of JavaScript attacks.
Although very effective in analyzing code and identifying JavaScript attacks,
the presented analysis systems are mainly designed for offline application and in-
duce an overhead which is prohibitive for real-time detection. For example, Cova
et al. [CKV10] report an average processing time of 25 seconds per webpage for
WEPAWET. For this reason, methods for offline analysis are not considered in our
study - even if they employ learning-based components. Nevertheless, many of the
techniques implemented for offline analysis are also applicable in online detection
systems [KLZS12].
3.2.3 Attack-specific Detection
The first methods capable of detecting malicious code at run-time have been pro-
posed for specific types of JavaScript attacks, e.g. [RLZ08, EWKK09]. These
methods proceed by monitoring the browser environment for known indicators of
certain attack types. For example, the system NOZZLE scans string objects for frag-
ments of executable code, a typical indication of heap-spraying and other memory
corruption attacks. While these approaches provide a low run-time, they are inher-
ently limited to particular attacks and do not provide a generic protection from ma-
26
licious JavaScript code. A more generic detection of JavaScript attacks is achieved
by the systems BLADE [LYPL10] and ARROW [ZSSL11], which identify attacks
using indicators outside the browser environment. In particular, BLADE spots and
blocks the covert installation of malware as part of drive-by downloads, whereas
ARROW generates detection patterns for the URLs involved in JavaScript attacks.
Both methods intentionally do not analyze JavaScript code and are thus indepen-
dent of specific attack types. However, by ignoring the actual attack code, these
methods critically depend on the presence of the considered indicators in practice.
3.2.4 Learning-based Detection
The demand for a generic detection of malicious code has finally motivated the de-
velopment of efficient learning-based detection systems, such as CUJO [RKD10],
ZOZZLE [CLZS11], and ICESHIELD [HFH11], which are the main focus of our
study. These systems analyze webpages at run-time and discriminate benign from
malicious JavaScript code using machine learning techniques. In contrast to offline
analysis, they induce only a minor run-time overhead and can be directly applied
for protecting end user systems.
At the core of these learning-based approaches are two central concepts: the
considered features and the learning model for detecting attacks. For example,
ZOZZLE mainly extracts features from a static analysis of JavaScript code, whereas
ICESHIELD monitors the execution of code dynamically and constructs behavioral
features. Moreover, many efficient detection systems employ a supervised classifi-
cation approach for learning, while the offline system WEPAWET successfully uses
unsupervised anomaly detection for identifying attacks. We study these concepts
and related differences in the conducted evaluation in more detail.
3.3 Data Acquisition
A key for evaluating learning-based detection systems is a realistic dataset of ma-
licious and benign JavaScript code. Previous work has suggested to automatically
acquire such data using client-based honeypots and offline analysis systems. This
is clearly a promising approach, as it allows for automatically updating and re-
training learning-based systems on a regular basis. However, almost no research
has explored this approach in depth. Most of the results reported for learning-based
detection have been obtained on a single dataset with manually cleansed training
data.
This chapter investigates how learning-based systems perform if they are reg-
ularly and automatically updated with malicious and benign data without human
sanitization. To this end we have devised a framework that visits malicious and be-
nign webpages on a daily basis and returns reports for static and dynamic analysis
of the contained JavaScript code.
27
3.3.1 Collection Framework
An overview of the collection framework is presented in Figure 3.1. The frame-
work is constructed using existing security instruments, such as public services and
client-based honeypots, and only serves the purpose of automatically retrieving
large amounts of benign and malicious JavaScript code. Note that the framework
is not designed to gain insights into the malware ecosystem. Moreover, learning-
based components such as PROPHILER [CCVK11] and JSAND [CKV10] are delib-
erately excluded from the framework, as they may bias the evaluation of learning-
based detection methods towards their specific feature sets.
Figure 3.1: A framework for acquisition and analysis of JavaScript code.
Sources for URLs
At the start of each day sources of potentially benign and malicious URLs are
harvested. For the benign URLs we considered rankings and listing of popular
webpages. In particular, we randomly sampled 25,000 URLs per day from the
Alexa ranking, which lists the top 1 million web pages according to visitors and
page views. But popular web pages are not necessarily attack-free. In fact, at-
tackers invest considerable effort into comprising popular webpages and exposing
malicious code to a large group of users. Consequently, it cannot be ruled out that
some of the 25,000 URLs are compromised, yet we assume that the vast majority
of the URLs is benign. Moreover, we take precautions in the later verification to
filter out known instances of JavaScript attacks.
For collecting potentially malicious URLs, we visited common blacklists and
services tracking malicious URLs. As an example, the database service HAR-
MUR [LC11] was queried for all malicious URLs that have been submitted in the
last 24 hours. Furthermore, we regularly retrieve URLs from search engines using
“dangerous” search terms. In total our framework collects about 8,000 potentially
malicious URLs per day. Similarly to the benign sources, these URLs are not
guaranteed to be malicious and thus the subsequent verification of the data is an
indispensable step.
28
Verification
The collected benign and malicious URLs are far from being an ideal source for
training and evaluating learning-based systems. First, the data is not guaranteed to
be of a certain type and, second, even if a URL points to a malicious webpage, this
does not necessarily indicate that JavaScript code is involved. Note that phishing
and other scam webpages are also often flagged as malicious but do not contain any
malicious JavaScript code. To improve the quality of our data, a verification stage
is employed that filters the data, such that malicious and benign JavaScript code is
obtained with high probability. For this task we prioritized tools that could also be
applied in a practical deployment scenario, that is, the verification is conducted on
a single workstation, using only techniques that return results within a few hours.
The following two verification procedures are applied to benign and malicious
URLs, respectively:
(a) For improving the quality of benign URLs, blacklisted URLs are filtered out
from the 25,000 URLs collected per day. To carry out this task efficiently, the
Google Safe Browsing service is used, which provides a frequently updated
list of malicious URLs. The main goal of this stage is to remove known
malicious code from the benign dataset.
(b) All collected malicious URLs are analyzed using a high-interaction honey-
pot. In particular, the honeypot SHELIA [Roa07] is used, which monitors a
target application and employs taint tracking for identifying memory corrup-
tion and code injection attacks. As target application the Internet Explorer 6
is used.
As a result of this analysis, the collected data can be refined to webpages that
are either (a) likely benign and not contained in blacklists or (b) likely malicious
and cause memory corruption or redirection of control flow during execution. We
focus on a particular setting, namely SHELIA and the Internet Explorer 6, as this
browser version is known for several public vulnerabilities and a frequent target
of attacks. Nevertheless, our framework could be easily extended to also support
other browsers and client-based honeypots for the verification stage.
Javascript Analysis
While learning-based detection systems generally follow the same design template,
their inner working differs fundamentally. For example, ZOZZLE closely interacts
with the JavaScript interpreter of the Internet Explorer, ICESHIELD uses frozen
DOM objects for tracking JavaScript execution and CUJO makes use of a dedi-
cated sandbox for analyzing code. Clearly, integrating all these different technical
systems into a single prototype and conducting a fair comparison would be an in-
tractable task.
29
Therefore we focus on a single system, namely CUJO, as it provides static and
dynamic analysis of JavaScript code. While there are several subtle differences in
how the code is analyzed, the reports of the static and dynamic analysis are ba-
sically similar to the representations of code used in the other systems. Overall,
we are not interested in benchmarking concrete implementations but rather com-
paring underlying concepts, such as the use of static or dynamic code analysis for
detection.
Static analysis. The static analysis in our framework first assembles the code
base for a given URL by following redirects and downloading referenced JavaScript
code. The assembled code is then parsed using a YACC grammar and the parsed
representation is passed as a report for further analysis. A detailed description of
the parsing process is presented by Rieck et al. [RKD10].
Dynamic analysis. Additionally, a dynamic analysis of the JavaScript code is
conducted. This analysis uses an enhanced version of ADSANDBOX (ADS), an
efficient sandbox for JavaScript code. This sandbox is embedded in the Internet
Explorer and allows to observe the behavior of JavaScript code in a secure envi-
ronment. All interactions of the code with the virtual browser are recorded and a
detailed report of the code’s behavior is generated. A description of the sandbox is
provided by Dewald et al. [DHF10].
The presented framework enables us to automatically collect benign and mali-
cious webpages and to generate analysis reports for each webpage on a daily basis.
It is necessary to note that analyzing over 25,000 URLs and processing 8,000 URLs
with a honeypot each day is a challenging task. As a consequence, several hours
may pass during the processing of a webpage and a verified malicious URL may not
necessarily expose malicious activity when it is later visited using the JavaScript
analysis. This problem of time delays is inherent to our setting and would also ex-
ist in a practical application. Hence, the data used here is not artificially corrected,
and inactive attacks are left in the malicious dataset.
3.3.2 Collected Data Sets
The framework for data acquisition has been deployed on April 19th 2011 at our
site and collected malicious and benign JavaScript code for a period of 5 months
(137 days). Though the data is not artificially corrected and inactive attacks are
left in the malicious dataset, the reports of ADSANDBOX are filtered to remove
those that are broken, empty, timed out during the analysis or are too small to show
any specific behavior, since such reports corrupt the training process with their
inconsistent features. Table 3.1 shows the total number of webpages before and
after filtering, and the resulting dataset sizes.
As described in Section 3.3.1, initially 25,000 benign and 8,000 potentially
malicious URLs have been collected per day. The applied verification significantly
reduced the number of malicious URLs, as only a fraction of the candidate URLs
has been capable to trigger an attack in our honeypot. Figure 3.2 depicts the number
30
Benign Malicious
Total number of URLs 3,400,000 8,282
Number of filtered URLs 2,900,000 3,220
Total size of dataset 359,000 MB 130 MB
Size of filtered dataset 291,000 MB 77 MB
Unique filtered URLs 1,200,000 2,146
Table 3.1: Details of the acquired datasets.
of malicious URLs that have been visited and verified per day.
The corresponding size of the downloaded JavaScript code per day for the be-
nign and the malicious URLs, averaged for each day, is depicted in Figure 3.3.
An interesting observation is the continuous growth of the amount of JavaScript
code found at the benign URLs, reflecting the generally increasing importance of
JavaScript. The peak at day 78 is due to a down-time of our analysis system and
the resulting queue of unvisited webpages. For the malicious URLs a certain kind
of recurring structure seems to exist, showing a weekly periodicity in the peaks and
average lines. This finding can be credited to compromised workstation systems
involved in the attacks, e.g. by redirecting traffic or hosting landing pages.
Figure 3.2: Number of malicious URLs visited per day. The dashed line represents
a weekly average.
3.4 Learning-Based Detection
All learning methods heavily rely on the features chosen to model a certain prob-
lem. The features for the detection of malicious JavaScript code require the ability
to reflect patterns in the code as well as monitored malicious behavior. Subse-
quently a proper learning method has to be selected, which is capable of handling
those features. To overcome the limitation of choosing either a supervised or an
unsupervised learning method, we discuss a way of evaluating both types of learn-
ing methods in a single setup. The following sections present this setup, the set of
features and learning methods considered in our study, and the resulting learning
framework.
31
Figure 3.3: Average size of JavaScript code collected per day from benign (top)
and malicious (bottom) URLs. The dashed line represents a weekly average.
3.4.1 Feature Extraction
We focus on features that can be extracted from the available static and dynamic
reports. Those features are independent of specific attack characteristics, while
reflecting relevant properties of the contained JavaScript behavior. In particular
token n-gram-features are applied to extract short string features from the reports,
similar to the q-gram features implemented in CUJO. Moreover, we use specific
numeric features derived from WEPAWET and ICESHIELD.
Token n-gram features
Token n-grams denote sequences of nwords. To extract token n-grams from a
static or dynamic report, the report is transformed into a sequence of words, sepa-
rated by white-space characters. The desired token n-grams are extracted by slid-
ing a window with the size of nwords over the report and storing the consecutive
nwords found at each position. The following example depicts this procedure for
a snippet of a static report, using a value of n= 3:
ID=ID+x{ (ID = ID), (= ID +), (ID + x) }.
The procedure works similar for the dynamic reports, where the SET command
is executed on a variable x.y, leading to a similar list of sequences:
SET x.y to a{(SET x.y to), (x.y to a)}.
32
In the next step the token n-grams of each report are mapped into a vector space
spanned by all possible token n-grams, i.e. each dimension is associated with the
occurrences of one particular token n-gram. This results in a high-dimensional, yet
sparse vector for each report, which can be efficiently processed [RKD10].
Numeric features
Additionally, 15 different numeric features are derived by analyzing the static and
dynamic reports, similar to the way such features are extracted in WEPAWET and
ICESHIELD. These features are designed bearing specific malicious JavaScript
behavior in mind. Exemplary features inspired by WEPAWET are counts of the
occurrences of document.location or document.referrer, the number
of instantiated components, the number of times code is executed dynamically or
the ratio of string definitions and string uses. Examples of features inspired by
ICESHIELD are the number of occurrences of dynamically injected code, occur-
rences of potentially dangerous MIME-types or abnormally long strings used in
decoding functions.
Unfortunately the features of WEPAWET and ICESHIELD could not be accu-
rately re-produced, as some of them require information which is not contained
in our analysis reports. Thus the combination of this incomplete feature set and
our heterogeneous data leads to an insufficient detection performance for these nu-
meric features. To avoid a misleading presentation of the results, numeric features
are omitted in the following experiments.
3.4.2 Learning-Based Detectors
Taking insights from CUJO, the behavior of different learning models is stud-
ied using Support Vector Machines (SVM) [MMR+01, SS02] with linear kernels.
Though other approaches like decision trees work as well, SVMs offer a high per-
formance and are robust against noise in the data. The feature space depends on the
choice of the kernel. Focusing specifically on SVMs with a linear kernel has two
advantages. The first one is the ability to process big datasets of very high dimen-
sionality very fast. While Gaussian kernels often lead to better results than using
a linear kernel, this comes at a massively increased cost of run-time and memory
requirements. The second advantage is the opportunity to parametrically balance
the influence of differently sized datasets during training. Before discussing this in
more detail, some conceptual basics are covered.
Basics of two-class SVMs
A two-class SVM model is trained on datasets of two classes. Objective of the
training is to find a hyperplane between the two classes which separates them with
a maximal margin. The data points of each class with feature vectors closest to this
margin are denoted as support vectors. In our setting, those two classes correspond
33
to the reports retrieved from the benign and the malicious URLs. For each report
the feature vectors are extracted. Once the model is trained on the feature vectors
of those two datasets, an unknown report can be classified very fast. For this pur-
pose the feature vector of this unknown report is extracted. Afterwards its position
and distance from the hyperplane allows to predict the corresponding class mem-
bership. Figure 3.4 depicts an example of benign (white) and malicious (black)
data points, where different models have been trained on. The dashed line of the
middle image represents the separating hyperplane.
Figure 3.4: Visualization of different models achieved by different values of ω.
Two-class learning methods work best if the two classes have a comparable
size. If one of the classes is much bigger, however, this approach often leads to
suboptimal models. In that case one-class learning is often the better approach,
allowing to learn a model on a huge dataset of one class alone.
Transition to One-Class SVMs
Two-class and one-class learning methods both have their advantages and short-
comings. Focusing only on unsupervised one-class methods means ignoring our
available malicious data, which makes calibrating the detector much harder. Fo-
cussing only on the supervised two-class methods, however, leaves the problem of
the imbalanced size of both datasets. To achieve a smooth transition between both
methods, the weight parameter ωin included into the model selection phase of the
learning method. A high value of ωincreases the weight on one specific class,
which means that the two-class SVM operates like a one-class SVM for this class.
This is done by applying a higher penalty for misclassifications of this class. It is
also possible to choose ωsuch that it balances two differently sized classes. An
example of the models resulting in using three different values of ωis illustrated in
Figure 3.4 on a toy data distribution.
The utilization of ωduring model selection facilitates a direct comparison of
the detection performance of a two-class learning method with both a benign one-
class learner and a malicious one-class learner, simply by applying different values
of ωduring model selection. A hope is that a properly defined ωcould lead to a
model better adapted to the imbalance of the two classes. As a result, neither the
34
one-class nor the two-class models would be expected to be the optimal solution,
but instead an optimized model in between.
More details on the formulation of the optimization problem and the decision
functions for the One Class and the Two Class SVMs, as well as their similarities
and differences are provided in Chapter 7.1.
3.4.3 Learning Framework
The long-term dataset acquired during this study, as well as the different selected
features and learning methods allow for an extensive experimental evaluation. This
whole evaluation is conducted in an offline manner under utilization of the SVM
library LibLinear [FCH+08] for training the model. The extraction of the token
n-gram features from the static and the dynamic reports leads to a static and a
dynamic detector, respectively. Taking insights from CUJO, the parameter nis
fixed to n= 4 for the static reports and to n= 3 for the dynamic reports.
To evaluate the behavior of the different detectors during the whole time pe-
riod of the study, the system is trained and tested weekly, i.e. every seven days
all models are re-trained. To cope with the circumstance that the number of mali-
cious reports is much smaller than the number of available benign reports, all past
malicious reports are used during the training phase, but only the last two weeks
of benign reports are used. This two week dataset is halved, such that the newest
reports constitute the training data, while the older reports constitute the validation
data. On these datasets a model selection for both the static and dynamic models
is performed over the cost parameter cand the parameter ω. Then for each model
the hyperplane is calibrated as follows: First the false positive rate of the current
model is calculated on the validation data. Then the hyperplane is shifted such
that a preselected false positive rate on the validation data, FPval(Θ), defined by a
threshold Θ, is not exceeded. Finally the model with the highest true positive rate
on the validation data is selected.
Being able to define Θis a vital component of training a learning based de-
tector, suitable for the network environment at hand, because some environments
simply require lower false positive rates than others. When testing the best model
this targeted false positive rate is achieved with only minor deviations. Conse-
quently no achieved false positive rates are shown in the evaluation, as they are
close to the target values defined using the parameter Θ.
In our practical evaluation we pick a value of FPval(Θ) = 0.001 (i.e. 0.1%)
as a good compromise of low false positive and high true positive rate. For some
experiments, however, also FPval(Θ) = 0.0001 is considered to highlight specific
properties. For convenience the Θwith FPval(Θ) = 0.001 is further on denoted as
Θ0.001, and Θwith FPval(Θ) = 0.0001 is denoted as Θ0.0001. Note that the com-
bination of the collection framework of Section 3.3.1 and this learning framework
realizes a fully automated system for collecting, analyzing and detecting malicious
JavaScript code.
35
3.5 General Performance Evaluation
The evaluation starts by investigating, whether learning-based detectors can eas-
ily be integrated in a completely automated tool-chain without manually sanitized
data. This is empirically supported by this section, which provides a long-term
evaluation of the performance of the different detectors and features utilized in our
learning framework. First of all the detection performance of two commonly used
anti-virus tools is determined on our datasets, which is compared to the perfor-
mance of our learning framework. In the next step the performance of our learning
framework on a sanitized dataset is shown. After that examples and reasons for
misclassified reports and the influence of ωare discussed.
3.5.1 Performance of AV-Scanners
The two anti-virus tools AVIRA ANTIRVIR and AVG ANTIVIRUS have been tested
on the original JavaScript code of our complete benign and malicious datasets.
Both employ different analysis engines capable of detecting maliciously behaving
JavaScript code. Table 3.2 shows the results of this analysis.
Avira AVG
JavaScript Code Antivir Anti-Virus
Benign URLs FP 0.0007 0.0003
Malicious URLs TP 0.2760 0.3140
Table 3.2: False positive and true positive rates of two anti-virus tools.
The achieved true positive rates are quite low. This result comes as a surprise,
considering the extensive initial verification steps of our system, and considering
the circumstance that the anti-virus tools have been applied several months after
the last day of data collection, which gave the anti-virus vendors enough time to
update their signatures. The achieved false positive rates range between the tar-
geted false positive rates of FPval0.001)and FPval0.0001). This permits an
easy comparison of the corresponding true positive rates to those achieved by our
detectors.
3.5.2 Performance of Detectors
Additional to the performance of the individual static and dynamic detectors, the
performance of a disjunctive combination of both detectors is evaluated as well.
This combined detector triggers an alarm if either the static or the dynamic detec-
tor does so. The resulting average performance values of the different detectors,
obtained from weekly re-trained models tested on the following week, are listed in
Table 3.3.
While the static and dynamic detectors are nearly on par, the combination of
both is significantly higher. In comparison to the performance of the anti-virus
36
Detector Θ0.0001 Θ0.001
Static 0.3525 0.5409
Dynamic 0.4931 0.6208
Combined 0.5451 0.6733
Table 3.3: Average detection performance of the different detectors.
tools, all detectors show a good detection performance. Especially the combined
detector is able to classify approximately twice as much malicious URLs correctly.
Surprisingly however, none of the considered features or learning methods attains
a detection rate of more than 90%, as reported from previous work, which used
manually sanitized datasets. To investigate this further we decide to create a subset
of the malicious JavaScript code using the anti-virus tools as an additional saniti-
zation instance. The assumption is that previous work always used manually sani-
tized datasets, so this way of automatic sanitization is expected to result in a better
detection performance. In this new AV-Alerts dataset only those 890 URLs are in-
cluded which both anti-virus tools raised an alert for. The results of the evaluation
of our learning-based detection methods on this dataset are listed in Table 3.4.
Detector Θ0.0001 Θ0.001
Static 0.7264 0.8446
Dynamic 0.7394 0.8480
Combined 0.8450 0.9319
Table 3.4: Average true positive rates of the different detectors, tested on the AV-
alerts.
Especially the combined detector shows an impressively increased detection
rate for both values of Θ, reaching a performance much closer to that of methods
which are solely tested on manually sanitized data. This illustrates that under
the assumption of a sanitized dataset the results of previous papers can be repro-
duced.
The application in a completely automated system, however, drastically de-
creases the detection performance. We see two main reasons for this unexpected
behavior. While the verification phase of the collection framework acts properly
in defining, whether the scripts of a potentially malicious URL really behave ma-
liciously or not, we can not be certain that the specific malicious behavior SHE-
LIA detected is also detected and reported by ADSANDBOX. Because the learned
model builds on those reports of ADSANDBOX, it learns features that do not con-
tain the initial malicious behavior any longer. We denote this as the semantic gap,
because it is caused by the discrepancy of detection and learning mechanisms. The
second reason is of a temporal kind. Because JavaScript attacks are very volatile
and often active for only a few hours, any delay between the verification step and
the creation of the reports may lead to a worse learned model, because the attack
37
may already be inactive again and the malicious code not present anymore. We
denote this as time delays.
3.5.3 Misclassification Analysis
There exist different reasons for misclassifications. One general reason is that the
datasets are noisy, i.e. not all members of a class behave in accordance to their
label. For the benign reports this means that, as explained in Section 3.3.1, the
verification phase for the JavaScript code of benign URLs does not guarantee to
exclude all malicious JavaScript code. Due to the semantic gap this is even more
complex for the reports of malicious URLs. SHELIA might respond to a type of
malicious behavior, which the static or dynamic reports of ADSandbox do not re-
flect. As a result the dynamic and static reports look benign. The dynamic reports
might even be empty or have completely failed to execute. To test this last as-
sumption, the malicious dataset has been further filtered, leaving only those 2,179
malicious URLs which did not result in errors in the dynamic execution. When
testing the dynamic detectors on this dataset, the average detection performance
increased only very slightly (approximately 1.5%), meaning that these errors in the
dynamic execution do not influence the quality of the reports significantly.
Figure 3.5: Histograms of the predicted scores of the optimized models of the
static (top) and dynamic (bottom) detectors on the complete dataset. The dashed
line represents Θ0.001.
To get a better idea of the concrete reasons for misclassifications in our system,
we analyze the false positives and the false negatives that occurred using a rep-
resentative model on the complete dataset. The histograms of the corresponding
predicted scores of the static and the dynamic detectors, as well as the concrete
values of Θ0.001, are illustrated in Figure 3.5. As a result it is found that many of
the false positives actually are malicious. Concrete examples of code injection, the
38
dynamic execution of long obfuscated sequences, redirects to suspicious websites,
hidden iframes and heapspraying occurred. Those false positives that have really
shown a benign behavior often contained features similar to malicious features, e.g.
dynamic execution of obfuscated code or hidden iframes, which renders a proper
classification difficult.
The verification process of the malicious URLs is more elaborated, especially
due to the incorporation of SHELIA. Therefore none of the false negatives is ex-
pected to contain benign JavaScript code. For this reason it is an indicator of the
semantic gap that many samples of the false negatives did not expose any behavior
at all. Other misclassifications have been caused by malicious code looking very
similar to benign code, i.e. omitting features usually found in malicious reports,
like heapspraying, code injection or the dynamic execution of obfuscated code.
3.5.4 Performance of Different Learning Methods
Another interesting question is the impact of the class-weight parameter ωon the
performance of the different detectors. Specifically, whether the detection perfor-
mance of the normal two-class model can be optimized by this step, and how the
different one-class approaches perform. The results are listed in Table 3.5. The
low performance of the different one-class models does not come as a surprise.
The learned model simply can not rely on one class only, and the calibration does
not work that well either. The detection performance of the two-class model is
boosted by the use of the optimized ω, resulting in much better results than those
of the generic two-class model. The optimal models of our evaluation always use
values of ωcorresponding to a model in between a benign one-class model and a
two-class model. An analysis of the development of the optimal values of ωduring
the re-training setup also shows a steady shift of ωtowards the benign one-class
model, caused by the continuously growing number of malicious data points.
Θ0.001
Learning Models Static Dynamic
One-Class Benign 0.0375 0.0303
One-Class Malicious 0.1314 0.1757
Two-Class 0.4560 0.4794
Two-Class Optimized 0.5409 0.6208
Table 3.5: Average true positive rates of the different models.
Integrating a class weighting into the training phase of a detector to balance
differently sized datasets improves the overall performance. The insight that the
chosen optimal ωresides closer to a benign one-class learning model is helpful as
well, because the benign one-class detector is due to the stability of its database,
i.e. the better availability of benign data, more desirable than a model trained on
continuously changing malicious data.
39
3.6 Analysis of Re-Training Procedures
As this aspect greatly impacts the possibility to achieve a highly automated system
in practice, we will now discuss the influence of different re-training procedures on
the performance of the different detectors, aided by the previous long-term evalua-
tion. The functionality and quantity of JavaScript code in real-life websites grows
steadily, as we can see in Figure 3.3. A model should reflect this, because a model
trained only once might be out-dated very soon. Thus a learning-based detector
which is regularly re-trained on the latest datasets is assumed to achieve better
models. The biggest disadvantage of this approach is the continuous requirement
to regularly spend time and resources to compute an updated detector. While such
learning efforts can be minimized, e.g. by using incremental learning [LGKM06], a
confirmation of the assumed advantage of regular re-training in the domain at hand
has yet to be done. For this purpose the following sections focus on a comparison
of the long-term performance of the different detectors, either utilizing frequent
re-training or conducting a one-time training only.
3.6.1 Regular Re-Training
In the re-training setup, the detectors are re-trained each week and tested on the
following week. The corresponding overall average performance results have been
discussed in Section 3.5.2. In the Figures 3.6 - 3.9 the weekly average is repre-
sented by the dashed line. The light vertical bars contained in the figures visualize
days where for none of the 8,000 verified malicious URLs an actual malicious
behavior could be exposed.
Figure 3.6 illustrates the performance values of the static detector. A first ob-
servation is the immense variance of both false and true positive rates. The picture
is slightly different for the dynamic detector, illustrated in Figure 3.7. Its false pos-
itive rates show much less variance, while its true positive rates fluctuate massively,
even more than the ones of the static detector. An especially interesting observa-
tion is, that the true positive rates of both the static and dynamic detector are weak
in the beginning, while generally improving the average performance afterwards.
This is caused by the low number of malicious data available for training at the
very beginning of the evaluation. With the steadily increasing number of malicious
reports available during training, however, better models are achieved. Another in-
teresting observation is the performance drop of the models closely following day
78. At that day a huge amount of URLs has been updated after a down-time of the
collection framework, which had an impact on the dynamic detector, but not the
static detector.
The long-term performance of the combined detector, depicted in Figure 3.8, is
more stable than the one of the individual detectors. Especially the periods of low
performance of the dynamic detector are nicely backed up by the static detector.
The general false positive rate has doubled as well, but the combination of the
detectors still performs best.
40
Figure 3.6: False (top) and true (bottom) positive rates of the static detector per
day with regular re-training, using Θ0.001.
Figure 3.7: False (top) and true (bottom) positive rates of the dynamic detector per
day with regular re-training, using Θ0.001.
41
Figure 3.8: False (top) and true (bottom) positive rates of the combined detectors
per day with regular re-training, using Θ0.001.
3.6.2 One-Time Training
In the one-time training setup a single model is trained on both the static and the
dynamic reports, respectively. Because a sufficient amount of malicious data is
vital for achieving a good detection performance, the model is not trained on the
first weeks, but on the seventh, where a sufficient amount of malicious data is
finally available. These models are then tested on all consecutive days. Note that
for reasons of comparability the average detection performances of the previous
tables are calculated on this range as well.
The dynamic detector exposes some interesting properties when comparing its
re-training performance, depicted in Figure 3.7, with the corresponding one-time
performance, depicted in Figure 3.9. The re-trained models show a much more
varying performance than the model which has been trained only once. Especially
the true positive rate is much more stable. The implication for the dynamic detector
is, that a re-trained model is not necessarily better than an existing one. The static
detector does not expose such varying behavior.
In terms of the long-term detection performance during the one-time training
setup, an initial expectation was that applying an old model on newer data could
lead to a steady performance decrease. This could not be observed for the static de-
tector, where the false positive rate remained very stable and even the true positive
rates follow closely the development curves observed in the re-training setup. The
performance of the dynamic detector in Figure 3.9 does comply with that initial
expectation a little more. The false positive rate, while stable for quite some time,
starts to increase from day 100, and the true positive rate is below the average for
42
Figure 3.9: False (top) and true (bottom) positive rates of the dynamic detector per
day, in the one-time training setup, using Θ0.001.
nearly two weeks at that time. These results suggest that using the detectors alone
for such a long time without re-training is no viable option.
Detector Θ0.0001 Θ0.001
Static 0.2444 0.5027
Dynamic 0.3371 0.5531
Table 3.6: Average true positive rates of the different detectors applied during one-
time training.
Table 3.6 finally shows the average results during the evaluation of the one-
time training period. The performance values are generally lower than those of
the re-training experiments, especially for Θ0.0001 (see Table 3.3 for comparison).
The implications for the static detector are that while the detection performance
does not decrease or vary that much during the one-time training setup, its average
performance is considerably lower than the one achieved with regularly re-trained
models. The average performance of the dynamic detector also suffers, but as
discussed above, the lower variance of the detection performance of the dynamic
detector makes a less frequent re-training of this detector a viable option.
3.7 Conclusion
This chapter investigated the feasibility to combine a learning-based system for
the detection of malicious JavaScript code with a completely automated system
43
for the collection and analysis of JavaScript code. The behavior and detection
performance of different learning-based detectors, based upon different features
and learning methods, have been evaluated on a huge set of automatically collected
data, where previous work solely used manually sanitized datasets.
The results of this evaluation have shown that the vision of a complete, auto-
mated, learning-based system has not been completely achieved. Two main factors
have been identified as limitations to this approach and its results. The first reason
is the semantic gap occurring when malicious activity is present during the visit
and verification of a webpage but no longer during the subsequent learning stage.
Already a slight discrepancy in this information transfer means that malicious ac-
tivity may be present but is not exposed to the learning method. The second reason
is the time delay, caused by the volatility of JavaScript attacks which are often
active for only a few hours. Due to the large amount of data accumulated and pro-
cessed during the visit and verification of the URLs, a significant amount of time
may pass between the verification of a malicious webpage and the resulting learn-
ing stage. Consequently the malicious code may not be present anymore, which
results in a sub-optimal detection model, because it is trained on incomplete data.
Fortunately these limitations can be overcome by creating an integrated system
which combines the components of collection, analysis and detection very closely.
As a result the complete behavior, exposed during the collection and verification
of the malicious URLs, should be available to the learning component with a min-
imum time delay.
Furthermore the evaluation has shown that better models can be learned by in-
tegrating the class-specific weight ωin the model selection and taking care of a
sufficient amount of malicious data. Combining different detectors is also impor-
tant, because they often supplement each other. And finally it has been shown that
a regular re-training mostly results in a better detection performance than using a
single model for a longer time period.
To bring autonomous learning to reality, a critical step is the design and devel-
opment of an integrated analysis and learning system. Besides that another impor-
tant idea is the investigation of a more intelligent way of re-training the different
models based on their comparative performance. For example one could re-train
one detector regularly, but instead of relying completely on the newly trained de-
tector, just use it in parallel to the current one. Also methods of online learning are
an interesting option for that purpose.
44
Chapter 4
Classification of
Structured Validation Data using
Stateless and Stateful Features
Summary
To reliably identify problems impacting the service quality and system
dependability of mobile communication networks, the monitored data needs
to be validated. This chapter proposes and evaluates analysis methods, fea-
tures and learning methods for the automatic validation of such data, with
a special focus on failure data of mobile communication data. This data
can be analyzed for discriminating failures caused by problems in the infras-
tructure (valid failures) from those caused by other circumstances like de-
vice imperfections (invalid failures), with the purpose of filtering the invalid
failures, which effectively increases both dependability and value of the un-
derlying data. To represent the complex structural and temporal properties
of the mobile communication data, two complementary feature representa-
tions are proposed and compared, followed by a discussion of classification
methods which are suitable for these feature spaces and for an interpreta-
tion of their results to support manual auditing. Their classification perfor-
mances on these feature spaces are evaluated and compared to competitive
approaches. In the evaluation a classification performances of up to 97%
AUC-ROC is achieved. This renders our approach a good alternative to us-
ing manual matching rules, which require costly expert-knowledge and are
much more time-consuming to define and maintain - while also highlighting
the relevance of combining feature spaces of different problem perspectives.
Additionally it is shown that using non-proprietary data analysis can enable
feature representations nearly as expressive as those created by using propri-
etary analysis methods, which allows a broader application of the proposed
methods, due to the lower processing requirements.
45
4.1 Introduction
The mobile telecommunication sector has grown rapidly in the last decades, with
billions of mobile devices all around the world [CLC+16] utilized today. Operat-
ing all those connections, in rural or urban areas, for static or fast-moving clients,
requires a well developed telecommunication infrastructure. Optimizing the cov-
erage of this infrastructure to reach as many people as possible, while maintaining
the required quality of service, is an ongoing process, as users, locations and uti-
lized protocols fluctuate. Therefore regular controls of the service quality data are
needed to get an updated overview of the current state of the infrastructure, specifi-
cally on the most relevant locations, like densely populated urban areas. While the
network service providers could in principle try gathering this kind of data directly
on the customers devices, this is hindered by data privacy protection rules. Addi-
tionally the service providers are interested in measuring specific parameters like
signal quality or bandwidth, which can only be derived via low-level access to the
corresponding mobile devices. There might also be areas where the service quality
is so low, that as a result data gathered from customers devices would simply not
contain enough information to draw any useful conclusions.
For those reasons infrastructure data acquisition campaigns are conducted on
a regular basis, either by the mobile service providers themselves or by third party
contractors, which often provide analyses on this data, resulting in actionable re-
sults for the provider. Given the amount of time and money required for gathering
this kind of data, a high reliability of its labels is indispensable to assure correct
subsequent analyses. Failures, i.e. communication sequences which could not be
successfully completed, are of specific relevance for the service quality and system
dependability, as their analysis might point towards more severe problems in the
infrastructure, e.g. coverage holes or insufficient interworking or malfunction of
individual network elements on the involved protocol levels. Hence the detection
and explanation of failures is highly relevant for the operator of the infrastructure,
since they can potentially be solved once identified, before negatively affecting
a large group of customers. For those reasons they are labeled as valid failures.
On the contrary, sequences might also fail due to erroneous measurement environ-
ments or third-party server problems. Such invalid failures are not desired and need
to be detected and removed from the data, as they are not relevant for the network
provider, due to their relation to the measurement environment and other causes
outside of its access. Discriminating those two types of failures is a crucial first
step of data sanitization, resulting in a cleaner and better interpretable data foun-
dation, as removing the invalid failures also strongly increases the dependability of
the network system, since the valid failures provide a less noisy and more reliable
data source for further analyses.
Current practical solutions for discriminating valid from invalid failures rely
on manual analysis and labeling by domain experts, which is time consuming and
costly. To accelerate and replace this manual analysis, we are proposing and eval-
uating different feature types and learning methods to achieve this objective with
46
a semi-automatic approach. While the analyzed learning methods have been ap-
plied in the fields of NLP, process mining or sequence learning, they have not been
applied to the classification problem of valid and invalid failures. Thus evaluat-
ing their classification performance on this specific problem, and discussing the
practically relevant aspects is one part of the motivation of this chapter.
While the specificity of the classification problem does allow using prior knowl-
edge to define handcrafted features for separating the valid from the invalid failures,
their use is limited and expensive. For defining such features, the communication
data needs to be analyzed to extract statistics and key performance indicators (KPI),
which can then be used to find relevant properties. Besides the computational costs
and the costs of manually defining such features through expert knowledge, this
approach also produces features which are highly focused on specific data aspects,
while potentially ignoring other factors in the traced data, which might addition-
ally support the classification result. Therefore this chapter is also motivated by
the need to evaluate the potential of applying a freely available communication
data analysis method, instead of relying on proprietary analysis methods, which
further reduces the cost (both in human and computational resources) of creating
meaningful features. As the analyzed validation data offers both textual and se-
quential properties, specific stateless and stateful features are defined. which focus
on these different data properties respectively. This allows an analysis of the nature
of this type of data, by comparatively evaluating the effectiveness of each of those
feature spaces individually, as well as combined. Consequently the results of this
analysis are relevant for other fields, as they can be transferred to similar types of
data.
Our approach is semi-automatic in that it initially requires manually labeled
samples in the first phase of a new campaign. On this data classification mod-
els are trained, which can then be applied automatically on later sequences of the
same campaign, supporting or replacing the otherwise required expensive manual
analysis. This general processing is illustrated in Figure 4.1.
Figure 4.1: Main components of the processing chain.
Summarizing these motivational aspects, the analyses in Chapter 4 offer the
following contributions:
proposal of a system to semi-automatically solve learning problems on struc-
tured temporal data, applied to the use-case of the classification of mobile
communication validation data
comparative discussions of the required pre-processing, and evaluation of the
potential of replacing expensive proprietary data pre-processing with com-
47
putationally less expensive pre-processing, provided by freely available ana-
lysis and pre-processing tools
proposal and analysis of suitable complementary stateless and stateful fea-
ture spaces for structured temporal data, which are evaluated comparatively,
both individually and combined, using different classification methods com-
monly used by competitive approaches in related research fields
additional discussions on the practical applicability of these feature spaces
and classifiers, with a focus on system interpretability and reliability
Chapter 4 is structured as follows: This introduction is followed in Section 4.2
by a discussion of the related work. The datasets and feature spaces utilized are
discussed in Sections 4.3 and 4.4, while Sections 4.5 and 4.6 discuss the com-
pared learning methods and the consequences of their practical application. Sec-
tion 4.7 competitively evaluates the classification performance of the proposed fea-
ture spaces, and Section 4.9 concludes this chapter.
4.2 Related Work
This section will discuss the related work and research, specifically the types and
applications of stateless and stateful feature spaces for structured temporal data and
token n-gram features. The learning methods utilized in related research fields will
be introduced separately in Section 4.5, as they have to be discussed in the concrete
context of the selected analysis methods and feature spaces.
4.2.1 Feature Spaces for Structured Temporal Data
Our main focus is to provide two complementary feature spaces obtained by pre-
processing structured temporal data. We are talking about structured temporal data
in the sense, that individual samples have to consist of a sequence of events, where
each event is structured by and consisting of lexical and non-lexical tokens, where
a token is a sequence of lexical and non-lexical characters. There exist various
methods utilizing stateful and stateless data representations to solve problems on
similar data. Stateful feature spaces are characterized through the way their data
can be represented as a set of states and labeled or unlabeled transitions between
those states. Exemplary types of this data representation are finite-state machines
[Min67, Hop71], graphs [CSRL01, SB14] or event-based process representations
[VDAADM+11]. The flow of executed source code can also be represented this
way, e.g. with call graphs as in [GDDC97, GYAR13]. Network communication
data can also be represented in this manner, as it follows strict protocol defini-
tions, which can be used to define such states and transitions. Once learned on
a set of recorded network communication data, a stateful data representation can
be used for different purposes, e.g. to learn and analyze the behavior of this spe-
cific network. Examples of this can be found in [CYLS07, CWKK09, CPWK06,
48
CPC+08, KGKR12, LJXZ08, NBFS06, WCKK08], where the authors used state-
ful data representations to learn, replay or simulate network behavior, and also
in [CSS+10, GWY+15, LMD05, WBC10], where more specific communication
patterns, like those of botnets or malware have been learned successfully.
Often such complex types of data representation are not required, especially
when the discriminative properties required for the objective, e.g. multi-class clas-
sification, do not rely on state transitions. In that case stateless data representa-
tions offer a compelling solution. Instead of creating complex feature models, e.g.
by explicitly encoding state transitions, single samples can easily be represented
as individual feature vectors. Due to its lower complexity, stateless data repre-
sentations allow a wider application of learning methods, while still achieving an
often remarkable degree of representativeness of the information contained in the
data, and correspondingly a high accuracy of the models trained upon it. Exam-
ples of stateless data representations and their use for training models of specific
learning methods can be found in [CPWK06, BGH14, KKR11]. Both types of
data representation can be retrieved from varying types of sequential or temporal
data, e.g. from network traces like pcap dumps [II07], from programming lan-
guage source code [RKD10, SBKR12], or from pre-processed data, e.g. analysis
files which are the result of static or dynamic analysis, as for example explained
in [DHF10, RTWH11]. Whereas the static analysis of raw data can be straight
forward, e.g. by parsing lexical data into a direct textual representation, dynamic
analysis is much more complex, involving the execution or replay of code blocks or
network communication, allowing for a much more detailed analysis of the respec-
tive process flow. To reproduce state transitions, complex dynamic analysis is often
required for a stateful representation of the data. However, it can also function as
a good basis for a stateless feature representation. As a result, the feature space
might be stateless - but its dynamically analyzed source data contains relevant in-
formation about the process states and its transitions. In [SBKR12] we showed for
different types of network communication data and programming language code,
that for solving classification problems, most often an initial static analysis is suffi-
cient, while more complex data types require additional details of the process flow,
only available through dynamic analysis. Finally the authors of [ABCM09] used
a proximity-based feature extraction for log files, which works best on data with
homogenous features with very discriminative positions, but is less suitable for our
more heterogenous, less position-dependent data.
4.2.2 Token n-gram Features
When trying to achieve a stateless feature representation, the most obvious way
is to project the lexical data into a vector representation. If there are numer-
ical values contained in the data, and their range is known, one could start to
parse those values and create numerical feature vectors based on them, e.g. as
in [SR11]. If this is not the case, or if the data is more mixed, the more gen-
eral n-gram feature representation can be used successfully. n-gram features are
49
sequences of nconsecutive characters or character sequences, like byte n-grams,
consisting of nconsecutive bytes, word n-grams, or token n-grams, where a to-
ken denotes an arbitrary byte-sequence. The basic idea is to see n-grams as a
way to integrate local context into the description of the data to increase its in-
formation content. This type of feature representation has first been used in nat-
ural language processing [BPX+07, GSZ13, PSC15, WM12], even including the
use of dynamic values of nlately, e.g. in [LXLZ15]. In recent years they have
been applied to other fields as well, especially to those relying on lexical data
representations similar to those of natural language processing, like the analysis
of network communication data for IT security purposes. Some examples of re-
search in this field are the works [Rie09, RTWH11], where n-gram feature spaces
of natural language processing are applied to intrusion detection problems. In
[ORLS14, PAF+09, WPS06, WS04] the authors use different types of n-grams
to successfully analyze network communication for detecting server-side attacks
or intrusion attempts, while in [RKD10, LŠ11] the authors use token n-grams for
identifying malicious JavaScript code in web pages and even PDF documents. A
similar approach of discussing the impact of n-gram properties of the dataset on
the classification performance can be found in [WSAR13]. Although token n-
gram feature spaces are very high-dimensional, the computing power of todays
processors, coupled with modern learning methods like support vector machines
[SPST+01, TD04] or neural networks [Bis95, MOM12] enabled using those fea-
ture spaces for solving complex classification tasks. Additional tricks further ease
the use of those high dimensional feature spaces, e.g. using efficient data structures
like Bloom filters [Blo70], using hashed feature vector representations [WDL+09]
or hash kernels [SPD+09], or further reducing the dimensionality by applying ad-
ditional sanitization steps during pre-processing.
4.3 Datasets
For our analysis we are using real-world mobile communication log datasets which
are recorded an analysis of mobile network infrastructure coverage in Germany. As
such the data is highly representative for and comparable with similarly collected
network data. The data sets cover 22 consecutive days of data collected in the year
2014 and 25 days of 2015. The data was gathered by a fleet of cars, equipped
with roof-mounted antennas and sets of multiple android smartphones. To bet-
ter reproduce the geographical diversity, mobility and varying radio conditions of
smartphone usage in practice, the cars are moving during the execution of those
tasks. Each of the smartphones runs a dedicated measurement application which
automatically executes different sequences of mobile communication. The result-
ing communication data is traced and stored in two formats. Pcap log files store
the actual data transmissions, while Debug-logs store events of the participating
smartphones, network events and also early dynamic analyses of additional prop-
erties like the success of network events like hand-shakes or transmissions. Each
50
of the test sequences consists of a set of consecutively executed tasks. The nature
of these service tests is to analyze the data service quality, as opposed to e.g. voice
service quality. Therefore each of those tasks conducts a specific test, like connect-
ing to a video-hosting website (e.g. www.youtube.com) and download a specific
video, or connecting to a news website and visiting a number of articles. All of
this requires the use of different protocols of the TCP/IP-stack, i.e. TCP, DNS and
HTTP.
Based on this data one can conduct analyses of different properties of the tested
mobile network, e.g. on aspects like cell coverage, signal strength or ping statis-
tics. An important type of analysis is the sequence validation, i.e. the verification
whether the sequence realized a successful communication. In practice this can
be solved heuristically, e.g. by setting and requesting flags for crucial protocol
flow positions, s.t. their completeness defines a successful sequence. However,
sequences that have been verified to be unsuccessful are more relevant - and more
problematic. The samples we are focusing on are of this problem class, here de-
noted as failures. Those failures can be analyzed to discriminate valid and invalid
ones. Invalid failures can be caused by measurement errors like an erroneous test
methodology or a malfunctioning packet sniffer, but also by factors outside of the
access of both the measurement environment and the network provider. Examples
for this are unavailable third-party servers which are required in test sequences, or
unsupported protocol types like HTTPS. Because the measurement tools rely on
specific events and triggers within the test sequences to continue their processing,
HTTPS redirects are particularly problematic as they hide those events and triggers
within their encrypted content. Valid failures on the other hand are those that are
neither directly related to measurement problems nor to problems with third-party
servers, and are therefore relevant for the infrastructure provider. Such failures
range from application problems (e.g. failing replay of online videos), connection
errors and server problems to problems with the protocol stack, like TCP or DNS
problems, like unmatched pairs of SYN-ACK. Hence these failures are caused by
the network infrastructure itself and are therefore of the highest interest for the
network provider, because they point directly to weak points in their infrastructure.
Equipped with that knowledge, the provider can fix them and thereby improve their
network and communication service quality.
The concrete task to solve on this data is the classification of those valid and
invalid failures. For our data set class labels are available. Those labels have been
created manually by experts, which is a hard and time consuming task due to the
variety of causes and the required expert knowledge. Fortunately the problem can
be approached by using supervised classification methods, thus essentially labeling
the data automatically, s.t. in the best case no further manual analysis is required,
and in the worst case at least subsequently conducted manual analyses are eased
and lead to more reliable results faster.
We are discussing two different types of data representation - debug log data
and pcap log data - which need to be pre-processed independently. The debug-logs
have the advantage of logging different types of events and are also containing
51
results of early analyses processed during logging. As such they include many
details which can not be easily obtained after the measurements are finished. A
small disadvantage of this approach is that models learned on data collected this
way are not easily applicable to data from other sources. We call the data created
by using this analysis approach debug data, as its proprietary properties are closely
related to reporting logs of debugging tools.
The format of the debug logs of the 2014 dataset is illustrated in the two exem-
plary events below, showing the format of the utilized sequence timestamp, mes-
sage index, message timestamp and the data, consisting of a keyword in square
brackets, and the actual message. The debug data of both years contains additional
tokens, which add noise to the data. Those tokens (marked as underlined) are re-
moved, as they are either redundant or do not contain information relevant for the
classification task, but are instead based on information about the measurement
itself, which could be learned by the model, and may lead to false assumptions,
e.g. that a specific time range has a high probability for the occurrence of a certain
class, which would hardly be transferrable to data of a different time period.
2014-07-23 13:02:02.484 0000312
MsgTime: 2014-07-23 13:02:02.484
[MasterQueue] Removed cast: 8990956
2014-07-23 13:02:02.723 0000313
MsgTime: 2014-07-23 13:02:02.723
[QOS EVAL] QOS-BASE: SUCCESS
([SMARTPHONE HTTP LIVE URL]
TCP, DNS): Duration 710,00 ms
The structure of the data of the 2015 debug logs is overall similar to those of the
2014 campaign. However the occurring keywords have largely changed, as the
utilized measurement tools and their output are under constant development. These
fluctuations in the debug data format are a disadvantage in comparison to the tshark
data, as it perturbs comparative analyses between different debug data sets. The
following lines show examples of two single events of debug data of the 2015
campaign.
52
2015-02-05 16:06:51.764 0001169
[TASK] (DEBUG) service:=FServices[i]
Assigned(FSequence.FUsecase): True;
2015-02-05 16:06:52.002 0001170
[TRACING] (APP_EVENT) TloEvent:
Received; TloSource: SMARTPHONE HTTP;
TloDestination: EVALUATOR;
Event: Plugin ready; Plugin: HTTP;
To achieve a standardized, more general data representation, the binary pcap-data
is converted by applying TSHARK 1, a parametrically configurable tool for the
analysis of network traffic traces, which is able to parse the contained TCP-stack
communication, concatenate the individual frames and finally output a readable
ASCII-file. We did not extract the hex-representation, the frame-wise details or
the HTTP-payload, as these represent redundant information and are not required
for a proper data description. Using the non-proprietary TSHARK for processing
the network traces has the advantage of increasing the model compatibility to data
collected by other methods, which eases a potential model transfer, at the cost of
eventually missing some details, which are included in the proprietary and more
dynamic analysis approach. We call the data created by using this analysis ap-
proach tshark data. The exemplary snippet below illustrates two events, i.e. two
lines of a tshark log-file.
8319 137.194419
10.51.71.246 -> 173.194.44.95
HTTP 542 GET / HTTP/1.1
8320 137.836234
173.194.44.95 -> 10.51.71.246
TCP 68 80 -> 58780 [ACK] Ack=5
Win=159 Len=0 TSval=1049 TSecr=1
We can see that this data format is very compressed, containing primarily the IP-
addresses of both clients and the respective content, and does therefore not require
removing additional fields. While both debug data and tshark data share an event-
based, sequential data representation, their features differ. Due to its proprietary
nature, the debug data contains many details about the execution of specific appli-
cations, the connection to certain websites and services, and even statistical details
about the incoming and outgoing traffic. While the tshark-data contains signifi-
cantly less detailed data, the availability of the required tools is its biggest advan-
tage. As such, one of the objectives of this chapter is also to analyze, whether a
1tshark - the wireshark network analyzer
53
freely available analysis tool like TSHARK is able to produce analysis logs equally
expressive as the proprietary debug-logs, which would enable its application to
other fields that also allows access to pcap-data of its communication.
Data set Data type |Svalid| |Sinvalid|
2014 debug 4,550 393
tshark 4,045 373
2015 debug 11,887 318
tshark 11,743 316
Table 4.1: Sizes of the data sets of valid and invalid failures
Table 4.1 lists the number of samples of both classes of each dataset under uti-
lization of these analysis approaches. The 2014 data offers less samples than the
2015 data. The differences in the numbers of samples between both datatypes of
one dataset exist, because some samples were analyzable using the proprietary de-
bug approach, while giving empty results when using the tshark approach, caused
by the missing support for the analysis of specific network features in tshark, as
well as the already mentioned deeper insights into the data, possible only with the
debug approach. Due to these two different pre-processing approaches, the result-
ing log files differ in both their statistical properties and in their contained token
sets. The implications of those differences for the classification problems are dis-
cussed in the following section about the resulting feature spaces.
4.4 Feature Spaces
This section introduces the feature extraction methods which define complemen-
tary stateless and stateful data representations to capture different data properties.
Figure 4.2: Conceptual example of the stateful Γs.ful (left) and stateless Γs.less
(right) feature approach for 4events of a structured sequence sample
54
Assume a sequence sconsists of kevents e1, e2, ..., ek, where each event e
consists of up to ltokens t1, t2, ..., tl. Then the stateful feature space focuses on
representing the state transitions of the process flow within the sequence (intra-
sequence state transitions, i.e. the transitions between the events e) to finally dis-
criminate both classes. The stateless approach is not aware of these transitions, but
instead represents the sequences using a detailed static, high-dimensional token n-
gram feature space (based on the intra-event tokens t) to model the data. Figure
4.2 illustrates these concepts. Using this approach creates two feature spaces which
complement each other, as they rely on different data properties. Additionally the
reliability of the failure classification process is increased by selecting interpretable
feature spaces. Finally we will discuss differences between batch and online pro-
cessing of the feature space, which is highly relevant for practical applications.
4.4.1 Stateless Features Γs.less
To achieve a high generalization, token n-grams are utilized for representing in-
dividual features of our samples. To describe the data in terms of token n-grams,
each sample sis first represented as a list of lexical and non-lexical tokens. Moving
a window of ntokens over this representation allows extracting all token n-grams
of length n. We assume to have a dataset S, defined by its mindividual samples
s1, . . . , sm. The function ϕ(s, n)extracts the unsorted token n-gram features of the
single sample sS. The function Φ(S, n)agglomerates and sorts them to yield
the feature space of the unified sets of token n-grams: Φ(S, n) = m
j=1 ϕ(sj, n).
Now we define fiΦ(S, n)to be the ith dimension in the sorted token n-gram
feature space. Based on that we can create a binary feature space. Using the final
projection function Γs.less(s, Φ(S, n)) we can project each sample sinto the token
n-gram feature space Φ(S, n),fiΦ(S, n), s.t.
Γs.less(s, Φ(S, n)) = 1,if scontains fi
0,otherwise.
Exemplarily we can project a subsequence of a sample
ˆs=’HTTP 542 GET HTTP/1.1’
into the subset of a feature space ˆ
Φ(S, n)for n= 2 as follows:
Γs.less(ˆs, ˆ
Φ(S, n)) =
1
1
1
0
for ˆ
Φ(S, 2) =
’HTTP 542’
’542 GET’
’GET HTTP/1.1’
’TCP 68’
Finally we normalize the feature vector by the vector sum of Γ(s, Φ(S, n)), s.t. the
vector norm equals one. This helps balancing longer and shorter samples. As a
result we gain feature vectors for all samples sS, which have the same length
55
of |Φ(S, n)|, i.e. the number of token n-grams in Φ(S, n), because all of them are
projected onto the same feature space Φ(S, n).
Before these token n-gram projections of both debug and tshark representa-
tions of the mobile communication data can be created, their individual token sets
have to be extracted. To separate the individual tokens in the lines of the debug- and
tshark logs, the following delimiters are selected: ()[]:;, *=."%<>. Specifi-
cally the = is highly important, as it allows creating semantically relevant pairs of
variable and value, which are then included in the corresponding token n-gram. To
keep the notation simple, we are using from now on Φ(S)if the choice of ncan be
arbitrary, or simply Φif the choice of ncan be arbitrary, and the required dataset
Scan be derived from the respective context. After tokenizing the data we need
to quantize strings tokens representing large numbers, also known as binning, i.e.
grouping data into bins. It is used to enable discrete feature spaces and reduce their
overall dimensionality. Without quantization, the overall number of token n-gram
dimensions would fast become very large and computationally costly. Furthermore
a string-comparison between numbers only matches if both are identical and not
just metrically close. Therefore it is required to quantize different numerical fields
in the log-files, especially the variable values in the tshark data and the relative
timestamps of each task. In the tshark-data, the numerical values of the following
parameters have been quantized (i.e. divided and converted into an Integer-value):
the values of Seq (i.e. the sequence number) by 10, the values of Ack and Win
by 100 and the values of TSval and TSecr by 1 million. This approach projects
similar values into the same quantization bins, which allows using them as quali-
tative features. The quantization sizes have been selected s.t. a sufficient number
of values is projected into the same bin. The tshark-data contains timestamps of
the individually executed tasks as relative values, i.e. starting at zero for each new
task. Using such relative timestamps (as opposed to absolute ones) enables the
comparison of inter-task and inter-sample behavior. To increase the comparability,
each of those relative timestamps is quantized as well, using a bin size of 1 second.
Again the idea is to project temporarily close moments into the same bins, similar
to a temporal sliding window, and thereby enabling their use as qualitative features.
It may seem counterintuitive to restrict the existing numerical values by applying
quantization. Actually it would be possible to extract those specific numerical val-
ues and learn separate numerical models on them, but this is neither the focus here,
nor is it easily compatible with the use of token n-gram vectors.
Data set |Φ(Sdebug,1)| |Φ(Stshark,1)|
2014 138,940 199,577
2015 547,535 315,606
Table 4.2: Sizes of the Stateless Token Feature Spaces
Properly handling the size of the feature space to achieve a compromise be-
tween size and expressiveness is a major concern when constructing a feature
56
space. Reducing the size of the feature space helps in achieving lower compu-
tational costs, but bears the risk of removing dimensions relevant for specific data
set properties. The sizes of the resulting token feature spaces (i.e. n= 1) are
listed in Table 4.2. The 2015 campaign contains much more tokens for both types
of data representation. For the debug-data this is mostly due to the fact, that for
the 2015 campaign a lot more keywords were introduced in the debug protocol.
Additionally the individual test sequences of the 2015 campaign are also slightly
longer, which increases the feature space sizes of both the debug and the tshark data
representations. For our experiments we are using a token n-gram size of n= 2,
because even n-gram sizes tend to represent pairs of parameters and values nicely,
present and highly relevant in both types of data representation. To decide on a
good value of n, the ratio fpm(Φ(S, n)) for the stateless feature space Φ(S, n)of a
given data set Sand a selected nis calculated, which is the ratio of token n-grams
with parameter-value pairs to all token n-grams within this Φ(S, n):
fpm(Φ(S, n)) =
tΦ(S,n)
fpm(t)
|Φ(S, n)|
with fpm(t)allowing to find adjacent parameter-value tokens as follows:
fpm(t) =
1if i {2, ..., n}with
ˆ
ti1∈ Rˆ
tiRˆ
ti1Rˆ
ti∈ R
0else
where {ˆ
t1, ..., ˆ
tn}are the tokens in the token n-gram t.n-grams of size 2also offer
a good compromise between feature space size and expressiveness, which is espe-
cially important when combining the feature vectors, as large differences in their
respective dimensionality would decrease their feature compatibility, reducing the
performance of the classifiers trained on these combined feature vectors. The top
plot in Figure 4.3 shows these respective values of the parameter-value ratios for
the discussed data sets, for a range of relevant values of n. The bottom plot in
Figure 4.3 additionally illustrates the sizes of the feature spaces for token n-grams
with n[1, ..., 5]. We can see the values of the debug-data having a lower gradient
than those of the tshark-data. This is caused by the higher n-gram variance in the
tshark data, and specifically in the very heterogenous tshark 2015 dataset.
Using this approach to define Γs.less also has some limitations.Γs.less is de-
fined by a single properly sized nfor the definition of its token n-grams, which
specifically aims for a high parameter-value ratio. As a result, a larger context
defined by a token sequence larger than nmight not be sufficiently represented.
While using multiple or larger values of nwould help alleviating this problem, it
would also lower the influence of the parameter-value pairs, and is therefore less
optimal. Further limitations are introduced by choosing fixed quantization values,
which were chosen here as a compromise to allow both sufficient feature discrimi-
57
Figure 4.3: Stateless Parameter-Value ratios (top) and Feature Space Sizes (bottom)
for the discussed data sets
58
nation and a sufficiently low feature space size, but may vary in other applications
with different value distributions.
4.4.2 Stateful Features Γs.ful
In the stateless feature space all occurring token n-grams are used to define the
feature space and project the samples. Now we want to focus on finding combi-
nations of states and transitions reflecting the stateful behavior of each sample, to
project the data sets into a stateful feature representation. The states should reflect
measurements at consecutive moments and should also allow constituting similar-
ity between similarly behaving samples. A shared property of the log-files is that
each of their lines can be separated into their respective string tokens, and that
each line represents a single, consecutive event (or state), temporally located by its
timestamp. We use combinations of selected tokens shared between the samples
to define those states, and their sample-wise order to represent the transitions of
the underlying state-machine, resulting in a stateful representation of each sample.
Note that this representation is similar to event-based process representations, as
commonly used in process mining [VDAADM+11], which motivates the upcom-
ing selection of learning methods used in this related research field to enable a
comparative classification performance evaluation.
Dataset |Φ(Sdebug)| |Φ(Stshark)|
2014 45 3,181
2015 65 3,462
Table 4.3: Sizes of the Stateful Token Feature Spaces
Similar to the procedure for the stateless features, we denote a function Φ(S)to
extract the token combinations relevant for representing the state-based semantics
of each sample. For the tshark-data we start with the steps described in Section 4.3,
i.e. tokens which are irrelevant for the stateful feature space, like timestamps or
frame numbers, are removed, as long as the consecutive order of the log file entries
is kept intact. Tokens that look most promising in terms of the protocol state-
machine are the utilized protocols and ports. The IP-addresses are not so useful, as
they are hardly generalizable due to their variation over different smartphones and
campaigns, restricting the potential to compare data of different devices. However,
the tokens representing the utilized protocols and ports (like HTTP 540) allow a
detailed analysis of the state behavior within the individual sample, while being
highly comparable between different samples and even across campaigns.
The approach for the debug data is similar. For the debug 2015 data there are
up to four interesting tokens, which are used by the debug-system itself to describe
its different states. These are hierarchical tokens like: [ADB], [APP_EVENT],
[TloEvent], [Received]. The debug 2014 data does not use such hier-
archical debugging keywords, but instead uses single, non-hierarchical keywords
59
like [COMMAND], [DASHBOARD] or [SYSTEM_LOAD]. The resulting feature
space sizes of both debug and tshark and debug data are shown in Table 4.3.
In contrast to the Γs.less features, the token combinations collected here by
Φ(S)can not be directly used to create feature representations, as they would have
different lengths for all samples, which is based on the number of events per sam-
ple. However, feature vectors of identical length are a necessary requirement to
allow processing by subsequently applied learning methods. We will use Φ(S)to
create the Γs.ful(s, Φ(S)) feature vector representations to achieve this goal. As
this strongly depends on the selected learning method, we will use Section 4.5.2 to
further discuss this topic.
By restricting the set of tokens utilized to define the states, this approach of
Γs.ful obviously has its limitations. One limitation is that such an approach prefers
data sets with more tokens and more possible token combinations for defining
states, which allow a more discriminative description of the occurring transitions.
Furthermore this approach also assumes that the respective state-based behavior of
the process state-machine is sufficiently reflected by these tokens, which may pose
to be a problem in practice, if the utilized analysis tools do not provide the required
analysis depth to achieve such a detailed description.
4.5 Learning Methods
The choice of learning methods is largely constrained by two factors. The learn-
ing methods need to be able to (a) perform well on our feature spaces to be suit-
able for the task, with primary objectives being reasonable computational costs
and scalability with |S|to perform well in a practical application, and (b) also
need to provide interpretable results to further increase the dependability of the
model, s.t. they can be used in a processing chain with a manual analysis step in
post-processing [BBM+15, BSH+10]. If there are only unlabeled training sam-
ples available, past research [II07] has shown that unsupervised learning methods
like one class support vector machines [SWS+00, MMR+01] can work well on
token n-gram feature spaces [WSAR13] of abstract source code feature spaces
[YWGR13]. Once labels are available, however, better results are achieved using
supervised classification methods like Support Vector Machines [CV95, STC04]
or Neural Networks [LXLZ15], especially for network communication data, e.g.
in the domain of IT security [WSAR13]. For this reason they should be preferred
if the sample labels are available. If the samples are only partially labeled, ap-
proaches like active learning [TK02] and the close relation of specific supervised
and unsupervised learning methods, as elaborated in [SBKR12] open interesting
approaches for using both methods complementary. In terms of interpretability,
the Support Vector Machines also have the advantage of allowing nice interpreta-
tions of the model and classification results [HMG+14]. Due to those properties,
their overall good performance and the availability of labeled samples of the in-
dividual valid and invalid failures, Support Vector Machines will be our primary
60
learning method. More details on the optimization problem, the decision function
and the use of kernels for Support Vector Machines are provided in Chapter 7.1.1.
For a comparison with the classification performances with other relevant and
commonly used learning methods, we are additionally evaluating a Multi Layer
Perceptron Classifier [PVG+11] on both the stateless and the stateful feature spaces.
MLP are widely used in process mining [MLI+15, LWR+10], which makes its
classification performance relevant, as the stateful feature representation is similar
to the event-based process representation widely used in process mining. But this
method is equally interesting for the performance comparison using the stateless
features, as it originates from the field of NLP sequence classification [Col02]. Due
to its non-linear and non-convex loss function the interpretability of its results is
harder than that of the SVM though, which is why we do not focus stronger on
this classification method. More details on the definition and application of MLP
classifiers are provided in Chapter 7.1.3.
4.5.1 Learning Methods on Γs.less Features
Kernel methods are known to perform very well in supervised scenarios (e.g. [CV95,
STC04]). We use a linear kernel, since research in [YHL12] and also our own re-
search has shown that using a linear kernel with qualitative features yields good
results at low computational cost, if the feature space is sufficiently large and the
feature representation of each sample is sufficiently non-sparse. Both of those con-
ditions are met here. Thus a model of a two class SVM is learned on the feature
vectors of a subset of the training data, Γs.less(Strain). Since it is a supervised
learning method, labeled samples of both relevant classes are required. The ob-
jective of the training is to find a hyperplane, separating the classes pairwise with
a maximal margin. The hyperplane can be described by its normal wand its dis-
tance from the coordinate origin, denoted by a bias term b. The samples of each
class defining the borders of this margin are denoted as support vectors. Once the
model is trained on Γs.less(Strain), a new sample sis easily classifiable by calcu-
lating its direction and distance from the hyperplane via a prediction function, s.t.
sbelongs to the positively labeled class if w, s+b > 0and to the negatively
labeled class if w, s+b < 0. Further details of support vector machines can be
found in Chapter 7, as well as in [MMR+01]. On the stateless feature space we
are additionally evaluating a Bayes classifier, commonly used in process mining
[PSBDL18], which predicts the most probable class affiliation based on the scalar
product of the respective feature vector with the class-wise term-frequencies.
4.5.2 Learning Methods on Γs.ful Features
The evaluation of the failure classification on sample projections into a stateful
feature space Γs.ful requires several processing steps to achieve feature vectors of
equivalent length, enabling the application of support vector machines. After in-
dexing the defined states an index for each line of the log-file is obtained, resulting
61
in a feature vector. To yield feature vectors of identical length, we can interpret the
sequences as directed graphs, with the state tokens defining the nodes ν, connected
by transition edges defined by their order. This allows us to create an adjacency
matrix Auin an unsupervised manner, i.e. without concern about the label of the
individual log-file, resulting in a feature space capable of representing both classes.
Aucontains the probabilities of the transitions between the state nodes. E.g. the
nodes νiof a state-space feature vector of v= [ν2, ν4, ν3, ν1]correspond to the
following consecutive transitions: [ν2ν4, ν4ν3, ν3ν1]. At this point it
is also possible to create adjacency matrices Asin a supervised manner, i.e. one
matrix for each class, which can be used to define a classifier based on the highest
class path probability in the first order Markov chain. We use this classifier as a
comparison evaluation method for the stateful feature space, which is especially
relevant, as Markov Classifiers are commonly used in the research field of process
mining [PNC11, LGN12, LSD+15, ULD16, LCDF+15].
To be able to apply a support vector machine on this feature representation,
sparse sample-wise feature vectors operating on the same feature space are re-
quired, i.e. the feature vectors need to be of identical length. The current stateful
representation of the samples based on their state transitions does not fulfill this
requirement. However, each sample can be projected onto the unsupervised adja-
cency matrix Au, resulting in a sparse representation containing only the transitions
of this sample. The resulting matrix can be vectorized to finally yield equivalently
sized feature vectors. When comparing the classification performance of an SVM
with a non-linear kernel, operating on feature vectors of concrete transition prob-
abilities of Au, with that of an SVM with a linear kernel, operating on vectors
of binary values for each occurring transition within Au, the linear kernel SVM
consistently performs better, which is why we use the latter for our subsequent
experiments.
4.5.3 Learning Methods on Γcomb Features
Since both proposed feature spaces focus on different data properties, their com-
bined expressiveness is highly interesting. For that purpose both qualitative Γs.less
and Γs.ful feature vectors of each sample are concatenated, yielding a combined
feature vector Γcomb(s),sS. This representation is then evaluated with the
SVM and MLP methods, as described above. The idea is that both feature vectors
should complement each other, giving a classifier trained on them the opportu-
nity to optimize for feature combinations between both of them, finally achieving
a higher or more robust classification performance than classifiers on each of the
individual feature vectors alone. We tested this second approach by using an en-
semble approach [Die00] of combining the results of classifiers of both Γs.less and
Γs.ful feature vectors disjunctively. As this did not perform competitively at all,
the respective results are not included in the evaluation.
62
4.6 Details for the Practical Application
This section addresses topics which are relevant to achieve a working practical
application and the required level of acceptance for the proposed system, namely
the differences in two re-training approaches for the system, and the possibilities
of interpreting the final classification results with respect to the most important
features.
4.6.1 Batch vs. Online Classification
While the classification approach proposed here could also be extended to inter-
campaign classification (i.e. training on the data of one campaign to predict sam-
ples of another campaign), this introduces new problem like partially incompatible
feature sets due to model shift and concept drift [SK12, VBMKM09]. As a result
we are only evaluating single corpus, single campaign classification problems, in
which a single corpus of logged and pre-processed data is used for both training
and testing of classification models. To obtain robustness of the classifier, it is cru-
cial that the campaign-specific parameters utilized during the data logging process
of recording the network traffic, as well as the parameters for the processing of the
resulting log files have not been changed during this process. Those parameters
could include logging different IP layers, changing the degree of details, or adding
or removing specific parameters of the logging environment. Under this premise
we can assume a high inter-similarity for the samples of the corpus. The task is
now to train a classification model over feature projections of a data set of labeled
samples, Strain, to predict the labels of a set of samples Stest. In practice this
problem can be approached with a batch or an online methodology, the choice of
which depends on the available computational resources and the time constraints,
i.e. how fast a predicted label for a new sample is required.
Data: Strain, Stest
Result: Classification Scores of Stest
for Strain,Stest do
Create Φ = Φ(Strain)
Create Γ(s, Φ),s {Strain, Stest}
Train Mon all Γ(s, Φ),sStrain
Classify Γ(s, Φ) with M,sStest
end
Algorithm 1: Batch Classification Algorithm
The single-pass batch-approach, described in Algorithm 1, creates the feature
space Φand its projections Γ {Γs.less,Γs.ful}over the complete set of Strain,
trains the classification model Mon the projection Strain, and finally allows for
classification of the projections of Stest. This differs in some major aspects from
the online approach, shown in Algorithm 2. Here we assume not to have Strain
63
and Stest available instantly, but instead receive subsets of Strain and Stest consec-
utively, which we will denote as ˆ
Strain and ˆ
Stest. While this sounds cumbersome,
it is practical reality: while the campaign data is collected, new samples arrive all
the time. While one could wait until all data is available and a manually labeled sub
set Strain is created, this is not done in practice. Instead continuously new sets of
samples are received, of which some are manually labeled ( ˆ
Strain) and others are
unlabeled ( ˆ
Stest). The online approach is designed to allow to directly incorporate
these new strain ˆ
Strain into the model, while also allowing to directly classify
the new stest ˆ
Stest based on the feature space defined so far. As a result the fea-
ture space over Strain can not be created once in a single pass, but a more dynamic
multi-pass approach has to be used, where for each new set ˆ
Strain the feature space
and the projections are recalculated, and the model retrained. This enables the re-
quired optimization process to take the dimensions of all labeled samples Strain
into account. Once this is done, the new stest ˆ
Stest can be classified to get a pre-
diction as fast as possible. Since later feature spaces Φare based on a larger set of
Strain, their resulting models become better and better. Therefore it is reasonable
to repeat the classification, even for already classified samples of Stest, to increase
the classification accuracy.
Data: ˆ
Strain Strain,ˆ
Stest Stest
Result: Classification Scores of stest Stest
Strain =,Stest =
foreach ˆ
Strain,ˆ
Stest do
Strain =Strain ˆ
Strain
Stest =Stest ˆ
Stest
Create Φ = Φ(Strain)
Create Γ(s, Φ),s {Strain,ˆ
Stest}
Train Mon all Γ(s),sStrain
Classify Γ(s)with M,sˆ
Stest
end
Algorithm 2: Online Classification Algorithm
Each of those approaches has advantages, but also introduces some limitations.
Whereas the online approach allows the fastest classification of new samples, it re-
quires re-defining the feature space with each new ˆ
Strain, which results in a higher
computational cost and a slighty lower classification accuracy. If the practical ap-
plication allows for a certain time buffer, then the batch approach is more suitable.
It requires collecting all available labeled samples before being able to classify
Stest, has a lower computational cost as a result, and has also a better classification
accuracy. Additionally the model optimization in the batch approach tries to opti-
mize a larger number of dimensions than the respective optimization in the online
approach - at least, until the last labeled ˆ
Strain has been processed. This leads to
the batch model being more susceptible to over-generalization, while the online
model is more susceptible to over-fitting. In fact the batch approach is a special
64
case of the online approach, with ˆ
Strain =Strain and ˆ
Stest =Stest.
All of this has to be taken into account when applying both of those approaches
in practice. We recommend that one could start with the online approach to yield
results fast, while being aware of a potentially high false positive rate. The abso-
lutely highest-scoring sequences can be treated with the highest confidence in the
correctness of the classification results, while absolutely lower-scoring sequences
should be held back for further analysis. In this step the visualization described
in Section 4.6.2 can be of great help. After a certain number of test sequences
has been collected, the models converge towards those of the batch approach and
should be applied to the samples of Stest - at least to those held-back sequences.
In practice the decision between the batch and the online approach further depends
on the inter-sample similarity of the respective data, eventually requiring learning
methods capable of online learning (e.g. [LGKM06]). Since our data has a high
inter-sample similarity and thus a low expected model shift, we are using the batch
approach, as it provides a sufficiently exact classification performance baseline, as
well as easier to maintain and better interpretable classification models.
Creating both stateless and stateful feature space and projecting individual
samples into it has a computational complexity linear to the number of tokens
in S,O(|Φ(S, 1)|). This computational complexity remains for the batch classi-
fication. The complexity becomes O(|Φ(S, 1)| · |Strain| · |Stest|)in the case of
strain =ˆ
Strain and stest =ˆ
Stest, which means it still remains linear to the num-
ber of tokens - even when using the online classification algorithm with subsets
consisting of single samples.
4.6.2 Interpretation of the Results
As the practical applicability of our method is highly relevant, we also focus on
finding ways to increase the explainability and interpretability of the learning pro-
cess and the results. Interpretability of the model and the samples in the defined
feature space is very important to achieve a trustworthy model, as only such a
model will be dependable enough to be applied in practice. We therefore include
the capability to inspect the selected dimensions and weights of the model on
whether they are reasonably chosen. We also describe a confidence ranking to
estimate, which samples are more reliably classified correctly - and which are less
reliable, and may thus need to be inspected closer. The research of the interpreta-
tion of models and results of learning methods is of increasing importance, as the
adoption of complex models and methods of machine learning is becoming more
widespread. Exemplary research is found e.g. in [RSG16, VKMG17], where the
authors discuss methods for explaining the models and samples of various learning
methods, and in [BBM+15, BSH+10, KSA+17, MBKM13, MLB+17, MSM18],
where specifically interpretative methods and their evaluation for non-linear kernel
methods and neural networks are discussed.
The use of linear kernels in our Support Vector Machines allows the inter-
pretation of both the model and the classification results through analyzing the
65
Figure 4.4: Positively contributing dimensions ˆw+(left) and negatively contribut-
ing dimensions ˆw(right) of weight vector ˆwand corresponding features of sam-
ples s1and s2
weight vector wof the model, which is defined over the utilized feature space
Γ(Strain), with Γ {Γs.less,Γs.ful,Γcomb}. As the classification is done via the
scalar product of wand s, the classification outcome is determined by either a few
large weights, a large set of dimensions with smaller weights - or a combination of
both. To find those dimensions, wcan be analyzed directly, e.g. by sorting dimen-
sions according to their weights in w, and then start with analyzing the dimensions
with the largest values, assuming they are largely important. However, we are in-
terested in those dimensions contributing most to the classification results - which
are not necessarily those dimensions of wwith the largest weights. Vector wof
the trained model relies on the co-occurrence frequency of the dimensions found
in Strain. Relevant dimensions which occur highly correlated within Strain will
have lower weights in w, while relevant dimensions which occur uncorrelated in
Strain will have higher weights in w. This influence can be removed by utilizing
the covariance matrix over Strain, calculating ˆwT=wT·cov(Strain), thereby in-
creasing the interpretable weights of wfor the highly correlated dimensions. This
gives us a final new ˆw, which allows detecting absolutely relevant dimensions -
irrespective of the underlying Strain. More details of this approach are discussed
in [HMG+14]. By increasing the sparsity of vector wthe interpretability can be
further improved, as this removes the noise of dimensions with smaller weights.
This can be achieved by using the L1 norm during model training, which reduces
the overall number of support vectors and thereby the density of w, leaving only
the most relevant dimensions for interpretation.
Figure 4.4 depicts an exemplary visualization of the interpretation of the weight
vector ˆw. The black bars in the colored rows ˆw+and ˆwdepict the non-empty
dimensions contained in the weight vector ˆw, which are contributing to a positive
(ˆw+) and a negative ( ˆw) classification score, respectively. They are displayed on
a log-scale, ordered descendingly from left to right. To increase the interpretability,
the scales are centered around the bias b. The ranges of the scales are also limited,
such that only the numerical range is covered which is also covered by concrete
weights in w. For the examples s1of class -1 (invalid failures) and s2of class +1
(valid failures) this means, that dimensions matching with ˆw+(green area) increase
the probability of the sample sto achieve a classification outcome w, s> b, i.e.
belong to valid failures (the positively labeled class). Dimensions matching with
ˆw(red area) increase the probability of the sample to achieve a classification
outcome w, s< b, i.e. belong to invalid failures (the negatively labeled class).
With s=ˆw, sand s+=ˆw+, sbeing the absolute sum of the negatively
66
and the positively contributing matching dimensions, this corresponds to what we
see for the exemplary samples s1and s2, as s
1> s+
1indeed leads to a negative
classification for s1, whereas s
2< s+
2leads to a positive classification for s2.
This visualization is specifically useful in practice, as already mentioned in
Section 4.6.1. Once the test sequences are classified, the respective samples can be
sorted according to their classification score. By plotting the corresponding ˆwvisu-
alization for each sample at its side, we achieve an overview over all samples, with
those dimensions highlighted that contributed most to the classification result. This
allows to get a fast overview over the whole test set, which dimensions are shared
between samples and which are most relevant for each of both classes. Due to
its ordering it also helps in selecting potential candidates for a succeeding manual
analysis, as those samples with the absolutely highest scores are already classified
correctly with the highest probability, whereas those with a score close to the bias,
i.e. samples close to the decision hyperplane of the SVM should either be inspected
manually to assure a higher classification performance, or re-classified later with
potentially better model trained on a larger set of Strain. Section 7.4.1 contains an
additional discussion with more detailed examples illustrating this concept.
4.7 Evaluation
For the evaluation of the classification performances we are using the ROC-AUC
values (i.e. the area under the receiver operation characteristics curve). Using
the ROC-AUC allows us to calculate a performance metric over our unbalanced
classes, because the ROC-curve relies on the percentage ratios of the true and false
positive rates. As a result its AUC provides an upper bound on the classification
performance, achievable by a model defined by choosing a specific point on the
ROC-curve, corresponding to a specific compromise between false and true posi-
tive rates. Training a classifier for a certain value of true or false positive rate can
be achieved by using a calibration process on a separate validation data set. Cali-
brating a classifier for a practical application always requires a preference decision:
a low false positive rate or a high true positive rate. Since this decision strongly
depends on the specific practical use case, which can prefer either one or the other,
we are not conducting this calibration, but instead rely on the ROC-AUC value.
The method we introduced in Section 4.6.2 additionally allows interpreting the
results of the classification process in practice, which is specifically relevant for
the samples near the decision boundary and helps in further improving a selected
model calibration process. While learning a calibrated classifier is easily done on
our proposed classification methods by shifting the bias term bduring the calibra-
tion, it is not easily possible on the competing learning methods we selected for the
evaluation. Therefore, and because of its other advantages, we think directly using
the ROC-AUC performance metric for performance comparisons is more suitable
than using concrete values of false and true positive rates of calibrated models.
As described in Section 4.5, comparisons of different classification methods
67
have already been conducted in other domains. Supervised learning using support
vector machines often achieves good and robust results, which is why we designed
our feature spaces to work with them. Additionally we also conducted experiments
using unsupervised learning, specifically one-class support vector machines, to
learn models representing valid failures, which can then be utilized to detect deviat-
ing samples, namely the invalid failures. However, the classification results proved
to be worse than those of our supervised approaches, which is mainly caused by the
circumstance that discrimination of both classes is much better when using features
of both classes - and not only features of the dominating class. This high relevance
of features specific to the class of invalid failures is also exemplarily visible in the
visualization described in Section 4.6.2, where the average size of the weights of
ˆωis larger than those of ˆω+.
The number of collected invalid failures is similar for both data sets (between
316 and 393 samples). The number of valid failures varies however, ranging from
4,045 to 4,550 for the 2014 data set, and from 11,743 to 11,887 for the 2015 data
set. To adapt our evaluation setup to these circumstances and to assure compara-
bility of the results, we randomly extract for each evaluation run the same number
of samples for both data sets. We are using 300 samples of valid and 100 samples
of invalid failures, split in equally sized training and test sets.
4.7.1 Evaluation Results
Section 4.3 explained how the test cases are designed to execute different process-
ing chains on the underlying protocol layers and their inherent states and transi-
tions, while Section 4.4 illustrates how the stateful feature space focuses on de-
scribing the transitions, while the stateless feature space focusses on describing
details at the single states. Therefore we have a high variance in the properties
of the 2014 and 2015 campaigns, huge differences in the their respective debug-
and tshark-analysis data, and strongly varying size and expressiveness of the re-
sulting stateless and stateful feature representations. As such we expect largely
varying evaluation results over all those properties, which will be discussed now.
The results of our evaluations on the 2014 campaign data are shown in Figure
4.5, whereas the results of our evaluations on the 2015 campaign data are shown
in Figure 4.6. Additionally Table 4.4 shows the precise results achieved on both
campaigns.
When analyzing the results, the Markov and Bayes classifiers show the lowest
overall performance. In general the SVM and MLP classifiers on the stateful fea-
ture space perform slightly worse than those on the stateless feature space, which
means the stateless features are describing the data better. Since the results using
the stateless feature space outperformed the results on the stateful feature space,
obviously its more detailed description of the individual states is more beneficial.
While this underlines the importance of selecting proper features for any given data
set, combining both types of features is even more beneficial and partly alleviates
the need for such a decision. This is illustrated by the results on the combined fea-
68
Figure 4.5: Box plots of the AUC-results (in %) of the comparative evaluation of
the feature spaces and classifiers on the 2014 campaign data (top: debug-analysis
data, bottom: tshark-analysis data).
ture vectors, which are consistently better than those on the individual feature vec-
tors, having either the highest average results (2014 tshark with SVM, 2015 debug
with SVM) or a good and practically relevant compromise of near-optimal average
results and small variance (2014 debug with SVM, 2015 tshark with SVM).
When comparing the performances of Two-Class SVM and the MLP Clas-
sifier, a slight advantage for the SVM approach can be observed. Where on both
campaigns both classifiers show a similar behavior in the results variance, the SVM
largely outperforms the MLP in terms of average classification performance. While
neural networks steadily increased their relevance and performance in recent years
on varying types of data, obviously SVM classifiers are still a valid choice for this
kind of data - with the additional advantages of easier interpretability and more
reliable convergence. We can also see that the results on the debug-analysis data
outperform those of the tshark-analysis data. While this was expected, given the
more dynamic nature of the proprietary debug-analysis data format and its larger
number of resulting features, the good performance that can be achieved with the
more generic tshark-analysis data is surprising - especially as this is a freely avail-
able analysis tool.
When comparing the performances achieved on the difference campaigns, the
results on the 2015 campaign data outperform those of the 2014 campaign data.
Since we took care of utilizing identically sized sets of Strain and Stest for both
69
Figure 4.6: Box plots of the AUC-results (in %) of the comparative evaluation of
the feature spaces and classifiers on the 2015 campaign data (top: debug-analysis
data, bottom: tshark-analysis data).
campaigns, these differences originate in the different expressiveness of the two
campaigns, represented by the differences in the number of tokens, as illustrated in
Table 4.2 and Table 4.3. In the 2015 campaign better designed and more varying
test cases lead to an improved classification performance on both types of anal-
ysis data, while the inclusion of additional keyword in the debug-analysis data
further improved the expressiveness of the resulting feature vectors, and thus the
final classification performances. Additionally the 2014 campaign contains more
heterogenous types of invalid failures, making it harder to learn homogenous fea-
ture sets. In the 2015 campaign, invalid failures - which are caused by HTTPS
traffic influencing the measurements - dominate. As they provide easier to learn
feature characteristics, this also leads to improved results on the 2015 data. Fur-
thermore we see differences in both feature spaces in that the debug features allow
a higher classification performance on the 2015 data than on the 2014 data (with
a difference of 1.9% with SVM on Γcomb), while the corresponding performance
difference on the tshark data is smaller (0.18% with SVM on Γcomb). We see the
main reason for this behavior again in the extended expressiveness of the debug
2015 features, which allow are more detailed description of the occurring behav-
ior and hence more capable classification models. Similar to what we found in
[SBKR12], we also see our expectation confirmed, that dynamic analysis - as used
for the debug data - yields better classification results.
70
2014 Debug Tshark
Γs.ful
Markov 90.96 ±1.30 76.83 ±4.46
MLP 92.24 ±1.09 93.19 ±1.95
SVM 94.07 ±1.26 93.34 ±1.51
Γs.less
Bayes 90.68 ±1.14 90.73 ±1.18
MLP 94.75 ±0.86 94.89 ±1.42
SVM 95.15 ±1.29 94.76 ±1.34
Γcomb
MLP 94.99 ±1.29 94.75 ±1.01
SVM 95.09 ±1.08 94.92 ±1.16
2015 Debug Tshark
Γs.ful
Markov 81.88 ±1.56 87.70 ±1.59
MLP 94.65 ±1.18 94.16 ±0.63
SVM 95.28 ±1.14 94.32 ±1.18
Γs.less
Bayes 89.31 ±1.62 92.66 ±1.34
MLP 96.82 ±0.88 95.18 ±0.81
SVM 96.88 ±0.72 95.17 ±1.16
Γcomb
MLP 96.91 ±0.88 94.70 ±1.10
SVM 96.99 ±0.84 95.10 ±0.98
Table 4.4: Mean AUC Results and Standard Deviation (in %) of the 2014 and 2015
campaigns.
4.7.2 Evaluation Interpretation
As discussed in Section 4.6.2, we are also interested in interpreting the results of the
SVM-classification by analyzing the semantics of the most relevant dimensions of
ˆw. Whereas the results discussed in the previous section are achieved with a token
n-gram size of n= 2, using n= 4 lead to similar results, but allows a much better
interpretation due to its larger context, which is why we are using the 4-grams for
the analysis in this section.
In the models trained on the 2015 data set, the dominance of HTTPS connec-
tions leading to invalid failures is well represented in both feature spaces of de-
bug and tshark data, with the respective features weighted highly negative within
ˆw. In the models trained on the tshark data the most relevant dimensions in the
stateful feature space are those state transitions that directly represent the use of
HTTPS via the keyword TLS, like in TLSv1 115 TCP 76 or TLSv1 115
TLSv1.2 260. In the models trained on the stateless feature space, the cor-
responding token 4-grams are similar, e.g. TLSv1 2 260 Client or TLSv1
2 385 Application. Besides those similarities between both feature spaces,
there are also features that are unique to the respective feature space. In the stateful
feature space different inter-protocol transitions are identified as causes for invalid
failures, like DNS 158 TCP 76 or TCP 753 TCP 68, whereas in the
stateless feature token 4-grams of specific transmission sequences function sim-
71
ilarly, e.g. Seq 427 Ack 5 or Seq 427 Ack 9. In this way both feature
spaces complement each other, as they focus on different aspects of the network
behavior. In practice, an additional manual analysis could start with those specific
ports or transmission sequences to find the cause and semantics of these invalid
failures, or to verify the predictions of the classifiers, as such dimensions could
also be an indicator for an overfitting during model training. Features contributing
to a positive classification (i.e. valid failures) are less specifically defined, with
state transitions and tokens like HTTP/XML 996 TCP 1416 in the stateful
feature space and 95 TCP 76 TCP in the stateless feature space.
On the debug representation of the 2015 data we observe a similar behav-
ior. There are dimensions that focus specifically on HTTPS and are contributing
strongly to invalid classifications, like Redirect to https //m in the state-
less feature space and
DBG Log HTTP_SP_EVAL GET DBG Log HTTP_SP_EVAL Redirect
in the stateful feature space. Other very specific dimensions target concrete web-
sites which reliably contribute to invalid failures, like the token 4-gram youtube
com service SMARTPHONE, but most dimensions are less specific and would
require a manual analysis to really identify their semantics. On the 2014 data the
situation is similar, with the exception that invalid failures caused by HTTPS are
not that dominant. As a result we see more general features, with the exception of
quite specific token 4-grams in the debug representation which contain concrete
error messages like Session error SA failure or failure on TCP
layer. Obviously each feature spaces focusses on different highly relevant to-
ken n-grams, which - together with the overall better classification performance of
the classifiers on the combined feature vectors - shows the potential of using those
complementary feature spaces to achieve better overall classification performances
and a better insight into the respective data semantics.
4.8 Practical Considerations
Besides the classification performance also the overall processing runtime can pose
to be a limitation for practical applications. As discussed in Section 4.4, both fea-
ture spaces have a computational complexity linear to the number of tokens within
the respective feature space. In terms of overall size and the amount of analysis
required, the stateful feature space is much less complex than the stateless feature
space, but still achieves a classification performance competitive to the stateless
approach. In that regard it could be a viable option for systems which focus less on
an optimal classification performance, but more on low computational complexity.
For an optimal classification performance though, using the stateless or combined
feature vectors is recommended. Selecting either batch or online processing in the
feature extraction allows further adaptation of the computation to the requirements
of any given practical application. Both methods are also easily parallelized due
72
to their underlying vector operations. The same holds for learning and applying
the proposed classification methods, rendering the complete processing chain a
flexible and well performing solution for solving classification tasks on similarly
structured temporal data.
4.9 Conclusion
In this chapter we have shown how to create stateless and stateful feature spaces
from structural-temporal data, resulting in feature representations specifically de-
signed to complement each other, to cover a wider spectrum of the properties defin-
ing the data, and to be suitable for a subsequent application of supervised learning
methods. We discussed their properties and consequences for practical application
and how to use them for discriminating data classes based on the different repre-
sentation of their behavior. We competitively evaluated the classification perfor-
mances on data representations of real-world data sets of mobile communication
sequences, pre-processed using a non-proprietary and proprietary analysis meth-
ods. We showed that using the less expensive non-proprietary data analysis can en-
able feature representations nearly as expressive as proprietary ones, which widens
the applicability of the proposed automatic learning approach. We also evaluated
competitively the classification performance of various classifiers commonly used
in the related fields, operating on the different feature space projections of the
created data representations. A good classification performance was achieved both
absolutely and in comparison to competitive approaches, specifically from the field
of process mining, which highlights the suitability of the proposed analysis meth-
ods, features and learning methods for the otherwise manually solved problem of
communication validation data classification. We also showed that the best final
classification performance can be achieved by combining both stateless and state-
ful features. To further pronounce the practical applicability we also showed how
to interpret the model to allow discriminating reliable from unreliable classification
results, allowing to potentially decide for a succeeding manual analysis or later re-
classification.
Besides their advantages, the proposed approaches also have different limiting
requirements, like selecting appropriate values for the quantization and nto de-
fine the Γs.less token n-grams, using data sufficiently expressive to provide state-
defining token combinations for Γs.ful, adapting to the overfitting of the batch
learning approach or the additional computational requirements of an online learn-
ing approach, or selecting the appropriate feature space based on the given compu-
tational restrictions. However, their advantages outweigh these limitations.
In terms of future work it could be promising to further enhance the comple-
mentary effects of both stateful and stateless features and create a better combined
feature space, e.g. by gathering the most relevant stateless and stateful data prop-
erties directly during feature extraction. Further improvements could be thought of
by integrating neural networks as learning methods. While interesting approaches
73
like [MCCD13] allow learning on textual data using feedforward or recurrent neu-
ral networks, and methods like [ERF16, DL15] use LSTM to solve learning prob-
lems on process or sequence data, there are no approaches available which are
capable of simultaneously handling the rich structural and sequential properties
contained in the discussed data. These ideas will be analyzed in Chapter 5 for a
slightly different data type.
74
Chapter 5
Feature Spaces and
a Learning System for
Structural-Temporal Data
Summary
The service quality and system dependability of real-time communica-
tion networks strongly depends on the analysis of monitored data, to identify
concrete problems and their causes. Many of these can be described by ei-
ther their structural or temporal properties, or a combination of both. As
current research is short of approaches sufficiently addressing both proper-
ties simultaneously, we propose a new feature space specifically suited for
this task, which we analyze for its theoretical properties and its practical rel-
evance. We evaluate its classification performance when used on real-world
data sets of structural-temporal mobile communication data, and compare it
to the performance achieved of feature representations used in related work.
For this purpose we propose a system which allows the automatic detection
and prediction of classes of pre-defined sequence behavior, greatly reduc-
ing costs caused by the otherwise required manual analysis. With our pro-
posed feature spaces this system achieves a precision of more than 93% at
recall values of 100%, with an up to 6.7% higher effective recall than oth-
erwise similarly performing alternatives, notably outperforming alternative
deep learning, kernel learning and ensemble learning approaches of related
work. Furthermore the supported system calibration allows separating reli-
able from unreliable predictions more effectively, which is highly relevant
for any practical application.
75
5.1 Introduction
Sequences of structural and temporal data combine properties of symbolic se-
quences and multi-variate time series, in that a single sequence sof length rhas
the format s= [(t(e1), e1), ..., (t(er), er)], i.e. each event eioccurs at timestamp
t(ei)within s, s.t. t(ei) = t(s)i. Additionally each sequence scan have a label
y(s). When trying to process such data with temporal properties (i.e. semantically
relevant time intervals of varying length between individual events) and structural
properties (i.e. a semantically relevant order and context of events and their rep-
resented behavior), research is often faced with different problems, like varying
sequence lengths and the lack of feature spaces that allow representing both tem-
poral and structural properties sufficiently well. With suitable feature spaces, pro-
cesses that rely on such data can be better analyzed and represented, consequently
allowing the application of a wide range of learning methods on those features.
This is true for all processes that create time-dependent structured data, like
multi-layer protocol-based network communication, whose data can be recorded
in real-time by logging systems. This allows the analysis of its structural-temporal
data, utilizing its structural properties (e.g. protocol state-machine behavior) and
its temporal properties (e.g. response timings) to continuously improve the qual-
ity of the respective communication service. The problem becomes even more
complicated when analyzing the data of multi-directional real-time communica-
tion setups, as in video conferencing systems, in cloud infrastructures [SB14] or
even in industrial infrastructure [KL14] networks. In those cases the temporal and
structural properties are contained in multiple interacting event sequences, and the
temporal properties are of vital relevance for the service quality of the system.
To be able to apply machine learning methods to solve the different types of
problems occurring on such data (e.g. classification or prediction), specific fea-
ture spaces are required that properly represent those structural and temporal data
properties and allow projecting sequences of arbitrary length onto feature vectors
of homogenous length. To achieve this we analyze the structural and temporal
properties of such highly time-dependent multi-client sequence data, and propose
a new combined feature space of homogenous length. We also show with an exten-
sive competitive evaluation and statistical analysis how this feature space succeeds
in this task.
As practical use case for structural temporal data we focus on bi-directional
multi-client real-time mobile communication data, used to solve the specific prob-
lem of detecting and predicting known sequence classes on data of both known and
unknown sequence classes. On this data all of the previously mentioned properties
are relevant, i.e. the order and context of the events are class specific and relevant,
as are both the individual and the interacting sequences of both clients, as well
as the temporal properties represented in the contained timestamps and their sub-
sequences. For this use case we introduce and evaluate a system for the automatic
classification of failures of communication sequences between mobile clients. This
system allows supporting or replacing the expensive manual classification usually
76
handled by domain experts. We conduct a detailed analysis of the practical im-
plications and requirements, especially on how to calibrate the system for a high
precision while achieving a reasonable effective recall. On this system we compar-
atively evaluate the classification performances when using the proposed feature
spaces with baseline, as well as competitive deep learning, kernel learning and
ensemble methods from the fields of process mining and sequence classification,
allowing to draw implications on their general suitability and their individual ad-
vantages and disadvantages.
While we are focusing our analyses on the specific area of mobile communi-
cation, the proposed system for the detection and prediction of pre-defined classes
of sequential data, as well as the proposed feature spaces can be applied in all ar-
eas working with structural temporal data, specifically for problems that require
the incorporation of real-time or multi-client system properties. This enables more
precise classifiers and reduces the amount of required manual analysis, whose ex-
pensive cost would otherwise prevent the scaling of such a classification system to
large scale data sets.
Summarizing the primary objectives, we aim to overcome the restrictions of
existing approaches of process mining and sequence learning by combining struc-
tural and temporal features, finally integrating all into one system. As such the
contributions of this chapter are the following:
1. Comparison of the classification performance of features and learning me-
thods commonly used in process mining and sequence classification, with
the proposed combined feature space
2. Theoretical analysis of a baseline combined feature space against the pro-
posed combined feature space
3. Proposal of a combined detection and prediction system for failure classifica-
tion of structural temporal data of mobile communication, with an additional
focus on the calibration of practically relevant metrics
Hence Chapter 5 is structured as follows: This introduction is followed by a
discussion of the related work in Section 5.2. The use case and the utilized datasets
are discussed in depth in Section 5.3, followed by the introduction of the relevant
feature spaces in Section 5.4. Afterwards the components of the proposed system
for sequence class detection and prediction are described in Section 5.5, including
the utilized learning methods. This is followed by the evaluation and a statistical
analysis of the classification system and the introduced feature spaces in Section
5.6, before finally reaching the conclusion in Section 5.7.
5.2 Related Work
We are interested in analyzing and evaluating features spaces representing the
unique properties of sequences of structural, temporal data, whose inter-event struc-
tural properties have a semantically relevant relation to each other. We do this
77
with a focus on mobile communication data, as an example of real-time protocol-
based network communication data, where both the raw data logs and logs of ad-
ditional dynamic analysis allow representing the contained structural dependen-
cies. Further details on mobile communication protocol behavior can be found in
[Sau10]. While similar analyses have been done for network communication, as
e.g. in [CYLS07, CWKK09, CPWK06, CPC+08, KGKR12, LJXZ08, NBFS06,
WCKK08], most research focusses on the structural data properties, and less on
the temporal properties relevant for the real-time execution of such processes.
This makes our research unique but also highly relevant for commercial pro-
cesses, where such approaches are still highly sought after, e.g. in the form of
process mining [VDAADM+11]. In this field different objectives are solved on
business analytics data, which often comes in a format similar to our, i.e. consist-
ing of events and timestamps. As such [PNC11] are using Markov Classifiers to
predict the time remaining to completion of a business case, which is also done
by [PSBDL18] using Naive Bayes Classification. The verification of LTL (lin-
ear temporal logic) compliance is approached in [MDFDG14] by using Decision
Trees. One large research topic is also the prediction of the next events during run-
time, for which LSTMs [ERF16], Decision Trees [CLF+14, LSD+15, ULD16],
Markov Classifiers [LGN12, LSD+15, ULD16] or Multi Layer Perceptrons (MLP)
[MLI+15] are used. In the latest research projects, [TVLRD17] are using LSTM to
simultaneously allow the runtime prediction of the next events and the prediction
of the remaining time to case completion, [LWR+10] are using MLP for detecting
service level agreement violations, and [LCDF+15] are using Markov Models and
Decision Trees for predicting the achievement of a performance objective, or ful-
fillment of a compliance rule. While these approaches sound promising, they rely
primarily on the structural information provided by the event sequence, and less
on temporal data. The high dynamic interactivity between mobile network com-
munication processes is also not easily represented by these approaches, as they
require their training data to already contain all possible process behavior. Also
these methods rely on large labeled training data sets, which can not necessarily be
provided in our envisioned setup. Also none of the methods allows dynamically
learning features relevant for predicting manually assigned class labels, which are
per definition less clear defined than labels based on the violation of concrete event-
driven constraints.
Since we are operating on sequential data, one could also use methods orig-
inally from the field of sequence labeling, where the task is to predict a label
for each event within the sequence, which was solved using methods like Hidden
Markov Models [Rab89], Conditional Random Fields [LMP01], MLP [Col02], but
recently also Recurrent Neural Networks and specifically LSTM Neural Networks
[DL15, Gra12, HGX+17]. Since in our objective data each event is already la-
beled though, we are not interested in predicting such individual labels, but instead
require predictions based on the behavior represented by the complete sequences.
Using predictions of such individual labels to describe a complete sequence label is
also no option, as a defined sequence label can depend on structural-temporal prop-
78
erties not represented in event-wise predictions. Since the event order and the inter-
event durations are semantically relevant, using methods of sequence alignment
[TLLP11] is also not directly possible. Also methods like dynamic time warp-
ing [KR05] are not directly applicable, as they are not designed to process tem-
poral data with additional structural properties, or multi-client dependencies. As
such we need methods producing a label for the whole sequence, as is done in se-
quence classification [XPK10, Die02]. Of those methods support vector machines
[LSTCW01, LS05, SRS05a, SRS05b, SCW+03, CV95, MMR+01] are a well es-
tablished method for feature-based sequence classification, which we are specifi-
cally interested in as we are trying to reflect the data properties by specific feature
spaces. SVM approaches have shown a reliable performance, specifically when us-
ing non-sequential n-gram feature spaces for the respective sequential data, as done
in the fields of natural language processing [BPX+07, GSZ13, PSC15, WM12], in-
trusion detection [RTWH11, Rie09], malware detection [RKD10, LŠ11, SBKR12]
and the analysis of network communication [ORLS14, PAF+09, WPS06, WS04].
For our feature spaces we are using different types of features to represent
the temporal and structural data properties within the feature spaces. Specifically
we are relying on token n-grams, i.e. sequences of narbitrary tokens. This fea-
ture representation originates in the field of natural language processing [BPX+07,
GSZ13, PSC15, WM12], but has also been extended to network communication
[RTWH11, Rie09, ORLS14, PAF+09, WS04]. However, we are extending this
structural feature type by including additional temporal information, and by also
integrating a wider context for each token n-gram, an idea similar to the integra-
tion of additional context information as introduced in [MCCD13] for the CBOW
(continuous bag of words) and Skip-gram models.
All of this shows that our approach can be clearly distinguished from other ap-
proaches of related work. Additionally our proposed system combines supervised
methods for detecting and predicting problem classes in a novel way, allowing a
calibration to the metrics relevant in this specific problem domain, similar to what
has been done in [MLI+15].
5.3 Use Case Description
We will now discuss the properties of the data and the contained problem classes
of our concrete use case of failure classification for sequence data of mobile com-
munication, and why these are representative for the analysis of structural and tem-
poral aspects.
5.3.1 Data Set Properties
We are using mobile communication data of a specific format as a concrete exam-
ple to discuss and show the properties of sequential structural and temporal data. It
was recorded to provide a wide-ranging quality analysis of the underlying network
infrastructure. The data is collected by a fleet of specialized cars equipped with
79
roof-mounted antennas and multiple android smartphones. The mobile phones run
test sequences, which consist of automatically calling a phone within one of the
other cars, establishing a connection, playing a voice chat of 1 minute and finally
closing the connection. This whole process is monitored, recorded and consecu-
tively analyzed by a system which focuses on various key performance indicators
(KPIs) and statistics, radio frequency (RF) values and the successful completion of
the key processing steps of the respective protocol state machines of UMTS, LTE
and GSM. Since the cars are moving while all of this takes place, the recorded
data also contains switches between different transmission technologies (e.g. the
sequence may start in GSM, switches then to LTE, and finishes in UMTS). The
resulting data already allows for drawing simple conclusions, e.g. on whether
communication sequences have been successful at all (i.e. the relevant KPI and
RF values did not show negative deviations, and all relevant protocol states have
been completed successfully), whether they dropped in the middle (i.e. important
protocol states have not been completed), or whether they have failed for other
reasons.
Those failed sequences are then manually analyzed to determine their reason of
failure and potentially even its cause. This process is called failure classification,
allowing to assign a specific failure class to a failed sequence. Doing this manu-
ally by looking at many hundred log file entries is a tedious, time-consuming and
expensive task. By using statistics over the KPIs, RF values and the protocol states
for rule-based approaches, this can partially be automated, specifically for fail-
ures with simpler behavioral patterns. A more versatile machine learning approach
could however help solving this problem better, potentially allowing to cover less
clearly defined failure classes, while also adding some flexibility when trying to
find new failure classes, caused by to changes in the communication technology
backbone architecture, e.g. with the upcoming 5G [CFH14] technology.
For our analyses we collected two different data sets: the MFC data set, con-
taining manually labeled and unlabeled failure class samples, and the AFC data
set, containing failure class samples which are automatically labeled by a rule-
based approach, as well as unlabeled samples. Table 5.1 shows some of their most
important characteristics. Each contained sample represents a call sequence, which
utilizes at least the GSM, UMTS or LTE protocol, following its respective speci-
fication and call phases, which is reflected in the logged events and the respective
timestamps. Instead of using all the recorded events though, we are mainly inter-
ested in those that are relevant for the potential failure classes. Hence we are using
filters based on rules that have been defined by experts with deep domain knowl-
edge, removing all events that contain redundant or irrelevant information. As a
result we obtain a final set of base events, which together with their different states
(e.g. different reasons for a location update reject), and in combination with the
respectively utilized protocol, leads to overall 335 different event types.
Both data sets will serve different purposes in Chapter 5, because the details of
the different labeling approaches used for both data sets are expected to impact the
AFC evaluation performance negatively. The filtered events available in both data
80
MFC AFC
Average e#44.11 ±13.51 29.75 ±27.24
Main failure classes c#s#c#s#
CSFB Failure 2 48 2 370
Congestion Failure - - 1 60
Core Network Failure 1 30 8 682
E2E Failure 1 19 2 169
UE Failure 1 21 - -
Other Failures 39 86 39 360
Sum 44 204 52 1,641
Unlabeled Samples - 5,873 - 1,623
Table 5.1: Data set statistics: Number of events e#, of failure sub classes c#and
of samples s#
sets are selected to cover all important call phases and event sequences potentially
relevant for a proper class prediction. As such they are theoretically sufficient to
properly reproduce the MFC labels. However, they are insufficient to achieve a
similar performance for the AFC data, where also data of additional events, as well
as relevant KPI and RF data has been used for defining the classes, none of which
we have access to in our structural-temporal data. As a result of these restrictions
Table 5.1 shows that the average number of events per sequence is much smaller
for AFC data, and its variance is much larger, reflecting a larger class variance
and making a proper discrimination harder. Additionally, the AFC data contains
8 very similar variations of the Core Network Failure class, which are harder to
discriminate as well. Due to these shortcomings of the AFC data set, the smaller
MFC data set is more relevant for our purpose, as its labels provide a better ground
truth. Since this is a problem in the AFC data set, we will restrict our analyses
on the AFC data to those which are expected to provide valuable insights on this
data set only, namely how well the proposed feature spaces and learning methods
can still reproduce the AFC labels under these conditions, and - given its larger
number of samples - how much of a performance improvement we can expect
when increasing the training data size.
We will now discuss the format of the contained sequences. Each sample in our
data sets consists of the bi-directional communication sequence between a caller
and a callee. We denote the caller as the MOC client (mobile originated call)
and the callee as the MTC client (mobile terminated call). The samples contain
highly structured sequential data (order of events) with highly relevant temporal
components (inter-event durations), all of which are semantically relevant for dis-
criminating the failure classes, e.g. reflecting call phases being incomplete or too
long, or reflecting an anomalous order of events. Table 5.2 shows an exemplary
event sequence of a successful call, starting in LTE and proceeding and ending in
81
UMTS. This example also introduces the new timestamp format ts.rel, denoting
sequence-relative timestamps, defined for a sequence sas ts.rel(ei) = t(ei)t(e0)
(or ts.rel(s)i=t(s)it(s)0), i.e. the timestamp of the first sequence event is
set to 0.0, and all other timestamps are reset relative to this value. In the example
the MOC client is set up until ts.rel = 12.232, then the MTC client is set up until
ts.rel = 14.231. Once the clients are connected at ts.rel = 30.103 the call takes
place, before they are finally disconnected again. The abbreviations represent the
following event types: extended service (ES) request, security mode (SM) com-
mand and complete, connection management service (CMS) request, radio bearer
(RB) setup and extended service (ES) request.
5.3.2 Failure Classes
Before analyzing how the concrete sequence properties are utilized in the feature
spaces, we first need to discuss the existing failure classes and their defining struc-
tural and temporal properties in more detail, to provide a better practical back-
ground of the problem domain. Table 5.1 contains an overview and provides rel-
evant class statistics for the selected, sufficiently sized failure sub-classes of the
listed main failure classes. It also contains details about the samples of insuffi-
ciently sized failure classes, grouped together in the set of Other Failures, as well
as the additional number of unlabeled samples per data set.
CSFB Problems
A circuit switch fallback is conducted when the current network (e.g. LTE) does
not sufficiently fulfill the current connection requirements (e.g. signal strength, cell
coverage, sufficient response times) or suffers from other problems, while at the
same time an older network (e.g. GSM) is available. It can also occur when one of
the communicating clients is not LTE capable. Handing over the correct connection
state to another protocol can be problematic, though. Our data set contains cases
of failures that occurred when the call setup was not properly continued after the
location area update (LAU) and the routing area update (RAU), when the current
network did not allow a proper release for redirection, or when the redirection to
the older network simply took too long.
Congestion Problems
These problems can occur when the network is overloaded, s.t. problems with the
connection or response timings occur. In our data sets we have sample sequences
where the connection downlink disconnected too early, leading to an interrupted
connection, and other cases, where no circuit channel was available, completely
preventing to establish a connection.
82
MOC client MTC client
ts.rel protocol event event protocol
0.0 LTE ES Request
1.271 LTE SM Command
1.271 LTE SM Complete
2.385 UMTS SM Command
2.386 UMTS SM Complete
3.298 UMTS CMS Request
3.864 UMTS SM Command
3.864 UMTS SM Complete
4.273 UMTS Setup
4.826 UMTS Call Proceeding
5.192 UMTS RB Setup
5.392 ES Request LTE
6.209 SM Command LTE
6.209 SM Complete LTE
12.232 UMTS RB Setup Complete
MOC client is set up
12.577 SM Command UMTS
12.577 SM Complete UMTS
13.130 Setup UMTS
13.253 Call Confirmed UMTS
13.783 RB Setup UMTS
14.231 RB Setup Complete UMTS
MTC client is set up
14.799 Alerting UMTS
14.865 UMTS Alerting
15.721 Connect UMTS
16.254 Connect Ack UMTS
18.237 UMTS Connect
19.020 UMTS Connect Ack
29.923 UMTS SM Command
29.924 UMTS SM Complete
30.032 UMTS RB Setup
30.103 UMTS RB Setup Complete
Clients are connected and the call takes place
93.028 UMTS Disconnect
93.427 Disconnect UMTS
Table 5.2: Exemplary sequence of successful bi-directional mobile communication
83
Core Network Problems
This failure class represents more general problems, where the causes might be
similar to those of the previous failure classes, e.g. an unexpected downlink dis-
connect or problems with LAU and RAU. The previous classes, however, contain
additional semantic properties that lead to the classification as CSFB (fallback to
older technology) or congestion problem (high latency, low bandwidth), which are
missing here. Failure sub classes dominant in our data sets contain cases of unex-
pected downlink disconnect, unreachable MTC clients, or no or too slow reply to
LAU or RAU. The high similarity to other classes, as well as the fact that only mi-
nor differences exist between its individual sub classes makes discriminating them
harder, which is specifically relevant for the AFC data, with its higher number of
samples of these sub-classes.
E2E Problems
End to end problems occur beyond the scope of the core network. In our data set
two of its sub classes are prominently represented. Unexpected downlink (DL)
radio resource control (RRC) connection releases are symptoms of problems dur-
ing the downlink authentication phase of the connection, causing it to fail. This
also holds for missing downlink setup failures, which already fail at an even earlier
state.
UE Problems
Besides those network and protocol related failures, problems can also occur on
the devices themselves. The MFC data set contains samples of potential firmware
issues, leading to such problems.
Other Failures
Sequences of failure classes that contain only very few samples are not useable for
a proper evaluation. Those samples are all re-labeled as Other Failures, allowing
to use them in the model class detection step of our system.
5.4 Structural and Temporal Feature Spaces
One of the objectives of Chapter 5 is to analyze feature spaces capable of represent-
ing sequential data with structural and temporal properties, like the one detailed
in the previous section, and to propose a feature space suited to better represent
those properties. To achieve this, we discuss five different feature spaces ΓqT ,ΓT,
ΓS,ΓS+Tand ΓST .ΓqT is based on a sequential representation of the data, as
commonly used in process mining [VDAADM+11]. Since we are specifically in-
terested to additionally integrated temporal information in our feature spaces, ΓqT
optionally allows the inclusion of quantized temporal information. ΓTfocusses on
84
the temporal information in a re-ordered sequential representation, while ΓSfo-
cusses on the structural information in a non-sequential representation, essentially
using a token n-gram approach, as described in Section 5.2. Finally we propose
ΓS+Tand ΓST to show the advantages of integrating structural and temporal in-
formation complementarily into a single feature space, which is expected to allow
a better data representation, compared to using only structural or temporal features
alone. We discuss those feature spaces abstract, but also discuss unique properties
that are specifically relevant for our use case. As such some of those discussions are
exemplary adaptations of the abstract feature space properties to our concrete use
case. However, this should not be seen as a restriction on the general applicability
of these feature spaces, as they can be adapted to other use cases as well.
Nomenclature
MCiModel class i, i.e. sequence class sufficiently sized to train a classifier
s#MC Configuration parameter for the number of samples in this model class
|s|Number of events in a sequence s
¬MC non Model Class, containing samples of insufficiently sized classes
MCP Model Class Predictor
MCD Model Class Detector
ΓqT Quantized temporal feature space
ΓTTemporal feature space
ΓSStructural feature space
ΓS+TConcatenated structural temporal feature space
ΓST Structural temporal feature space
SSet of sequences s
SMSet of model sequences sM, defining the ΓST feature space model
ΘMCD Confidence rating based on the MCD
Θdb Confidence rating based on the decision boundaries of the MCP
θdb Calibration parameter for the MCP decision boundaries
5.4.1 Base Processing
All of the following feature spaces require an identical base processing, for which
we need a second timestamp format, enabling the consideration of relative time-
dependencies (i.e. local delays): the event-relative timestamps te.rel, defined as
te.rel(ei) = ts.rel(ei)ts.rel(ei1)(or te.rel(s)i=ts.rel(s)its.rel(s)i1) for a
given sequence s.
One objective for our proposed feature spaces is, that they represent multi-
client behavior within the sequential data. This is relevant for our use case, because
some failures can be caused by erroneous behavior on the MOC side of the call,
while others are caused by problems on the MTC side - or even by problems on
85
both sides, individually or combined. To reflect those structural properties, we need
to create the sub-sequences sMOC and sMT C to contain exclusively the events of
MOC and MTC respectively, with |s2C|=|sMOC|+|sMT C |. These represen-
tations allow the feature spaces to omit events of the respective opposite client.
Table 5.2 shows this behavior exemplarily for the MOC-events at ts.rel = 5.192
and ts.rel = 12.232. They are interrupted by three MTC events in the s2Cse-
quence, but are represented consecutively in the sMOC sequence. As a result, a
sequence scan be described by a triple of sequences s2C,sMOC and sMT C. If
not stated otherwise, we use ssynonymously with s2Cto denote the complete 2-
client communication. As such the example in Table 5.2 effectively illustrates an
s2Csequence. This definition is extended to the whole set of sequences. Where
previously we denoted all sequences sin a set Sas sS, we can now also
denote the sets of sequences of each different representation, i.e. s2CS2C,
sMOC SMOC and sMT C SMT C . This also allows extending the defini-
tion of ts.rel to those representations, in that ts.rel(s2C)denotes the vector of the
sequence-relative timestamps of s2C, and ts.rel(sMOC )and ts.rel(sMT C )denote
those of the client-wise sequence representations. The same holds for te.rel. Note
that ts.rel is set to 0.0for the first event of each sequence representation respec-
tively (i.e. ts.rel(s)i=t(s)it(s)0for s {sMOC, sMT C }), to allow for a better
comparability of sequences of the same client type.
Using an exemplary set of event identifiers E={A,B,C,D}allows
creating two artificial example sequences x1and x2and their timestamps in the
two formats, as shown in Table 5.3. These will be used in the next sections to
illustrate various aspects of the different feature spaces.
x1x2
ts.rel te.rel eits.rel te.rel ei
0.0 0.0 D 0.0 0.0 D
2.211 2.211 A 1.823 1.823 B
4.341 2.130 B 3.230 1.407 B
6.207 1.866 C 5.103 1.873 C
18.075 11.868 A 25.330 20.227 B
30.548 5.218 A
Table 5.3: Two artificial communication sequences x1and x2
5.4.2 ΓqT Features
ΓqT features are based on the sS2Cfeature vectors, essentially representing
the most common type of data representation used in the related work of process
mining. We extend this representation additionally by quantized temporal fea-
tures. The idea is to get a sequential representation of the event identifiers, which
is specifically suited for classifiers used in process mining, but to additionally in-
clude temporal information. To achieve this, we start with the event indices in
86
s. Consecutive events with a temporal distance closer than a predefined minimum
interval θmi maintain their current position in the event index sequence. If con-
secutive events exceed θmi, an additional empty event eis inserted, reflecting the
larger interval between those consecutive events. This is repeated until the next
event is reached. As a result, smaller values of θmi introduce more events eand
lead to larger, sparser feature vectors, while larger values of θmi introduce less
empty events, thereby decreasing the feature vector length while increasing the
density - until a completely dense feature vector is achieved, containing no empty
events eat all. Via model selection a value of θmi = 5.0sis selected as a compro-
mise capable of filling large temporal gaps occurring in data (which often represent
overly long durations between two protocol states) while not increasing the over-
all feature vector length too much in less relevant regions of the sequence. For
θmi = 5.0s, the resulting ΓqT feature vector for sequence x1in Table 5.3 is thus
ΓqT (x1)=[D,A,B,C, e, e,A], s.t. the large te.rel(A)is represented by
two instances of e. Choosing θmi >60.0sallows eliminating all occurrences of
e, which is identical to the original sequence of events, without the additional tem-
poral information provided by the einserted in the sequence. When using ΓqT in
this way, we denote it as ΓqT , which allows highlighting the performance differ-
ences when using both types of feature representations with competing classifiers
of the process mining domain.
5.4.3 ΓTFeatures
To create the set of temporal features ΓT, we use the set S2C. The idea of ΓTis
to create a feature space which projects properties of the ith occurrence of each
event type in a sequence sonto the same dimension. The projected property is
the timestamp ts.rel of the respective event, allowing to compare it with the times-
tamps of the ith occurrence of the same event type of other sequences. We start by
calculating the occurrence frequency f(e, s)for each event type eEin each se-
quence sS2C. Then we define its maximum value as me=max(f(e, s)),e
E, sS2C, calculating what is the most frequent occurrence of event type e
in any sequence. Furthermore a function κ(e, s, i)is required, returning the ith
occurrence of ein s. For example the simple sequence x1in Table 5.3 has two
occurrences of the event type ’A’. As such, κ(A, x1,2) returns e5=A, i.e. the
5th event in sis the 2nd occurrence of ’A in s, with ts.rel(κ(A, x1,2)) = 18.075.
Now a function ts(s, e, me)can be defined, providing a vector of these timestamps
ts.rel of all occurrences of a selected event in a sequence, and 0.0otherwise:
ts(s, e, me)=[τ(s, e, 1), ..., τ(s, e, me)]
with
τ(s, e, i) = ts.rel(κ(e, s, i)) if if(e, s)
0.0else
Concatenating the resulting vectors of ts(s, e, me)for a sorted list of all eEvia
the concatenate()-function yields the final feature vector for a sequence s, which
87
has for all samples the same length of me,eE. A small example should
make this better understandable. Table 5.3 shows two simple sequences {x1, x2}=
X, for which the maximum frequencies me=max(f(e, x)),xXof each
e {A,B,C,D}are mA= 2,mB= 3,mC= 1,mD= 1. As a result the
feature vectors have the following format:
ΓT(s) = concatenate([ts(s,A,2), ts(s,B,2), ts(s,C,1), ts(s,D,1)]),
for s {x1, x2}, resulting in the following final feature vectors:
ΓT(x1) = [2.211,18.075,4.341,0.0,0.0,6.207,0.0]
and
ΓT(x2) = [30.548,0.0,1.823,3.230,25.330,5.103,0.0].
5.4.4 ΓSFeatures
The structural ΓSfeatures are event n-gram features, similar to the previously used
token n-gram features, and as such representing a feature representation commonly
used in the related works of sequence classification. Therefore we extract for all
s2C,sMOC and sMT C all event n-grams and index their sorted list, spanning the
final feature space ΓS- including the client-specific n-grams of sMOC and sMT C.
We denote an event n-gram of a sequence feature vector svia its vector indices in
interval notation, i.e. s[i,i+n)denotes the event n-gram from position i(inclusive)
to position i+n(exclusive). The ΓSfeature vector of sequence sample sis then
defined via the binary occurrence of the respective event n-gram within s. When
using the examples x1and x2from Table 5.3, the sorted list of n-grams for n= 3
is [’DAB’, ’ABC’, ’BCA’, ’DBB’, ’BBC’, ’BCB’, ’CBA’], resulting in the final ΓS
feature vector ΓS(x1) = [1,1,1,0,0,0,0].
5.4.5 ΓS+TFeatures
One base hypothesis of Chapter 5 is that the classification performance can be
increased by using a complementary structural and temporal feature space. For the
structural-temporal ΓS+Tfeature space we treat the feature vectors of ΓSand ΓT
as equivalent. Because of it binary format, ΓSalready produces qualitative feature
vectors, but ΓTproduces quantitative feature vectors. If we binarize its values,
we create a qualitative representation, which we can simply concatenate with the
ΓSfeature vector. This is used here to provide a baseline complementary feature
space, before defining the more complex complementary feature space ΓST .
5.4.6 ΓST Features
For the structural-temporal ΓST feature space we will first need an analysis of the
representative capabilities we specifically want to achieve with this feature space.
As such, we will start this section with an analysis of some feature requirements,
88
before explaining how these requirements are met by creating the data representa-
tion via structural-temporal δnmatching and the use of model sequences.
Context and Position
Metrics and features for structured, sequential data should reflect its specific prop-
erties. A sample of such data could be described by the occurrence of single n-
grams (as done in ΓS). But this description can be improved when these n-grams
are also analyzed in terms of their broader context and position. As such two sim-
ilarly positioned n-grams might be identical, but their respective neighbor events
(their context) might be different, which should prevent or penalize a match be-
tween them. This is highly relevant in data which is created by protocol-driven
processes, like mobile communication data, which follows specific protocol states
(e.g. for the radio bearer setup or the security parameter negotiations), all requir-
ing specific events in their context. Thus it is important to focus on comparing
contextualized n-grams with each other, i.e. events at the call setup should not be
compared with those in the final call phases.
Model Sequences
For the definition of ΓST the concept of model sequences needs to be introduced.
Projecting each of the sSonto a feature space spanned by these model se-
quences yields projected samples of the same length, while at the same time incor-
porating both temporal and structural properties. The use of model sequences is
based on the idea of defining the features of a sequence sbased on its similarity
to each model sequence sMin the set of model sequences SM, which thus defines
a feature space model. To this purpose we define the set of model sequence rep-
resentations as a triple SM= (SM
2C, SM
MOC, SM
MT C )just as we did for our actual
sequences. By consecutively indexing the sequences and the events within SMwe
effectively span a feature space of size sMSM|sM|. Note that the model se-
quences do not have to be labeled, and also do not have be mutually exclusive to the
set of training or test sequences S={S2C, SMOC, SMT C }, as we are not using the
labels of the model sequences in any way. We rely instead on the relevance of their
contained structural and temporal properties, offering insight into relevant types of
behavior, required for the class discrimination. However, as the feature space is
spanned by using the model sequences, their labels could potentially be used to in-
crease the contained number of different features, or to balance the representation
of features of more complex failure classes against those of simpler ones.
Defining the Structural Temporal Features
The ΓST feature space is based on the idea of representing structural and temporal
properties of the respective sequences. In this paragraph we will discuss, how to
achieve this by using n-grams and model sequences in structural-temporal match-
ing procedure, with a focus on the feature properties of context and position. We
89
define the context of each event by the size of the n-grams and the parameter of po-
sitional variance δ, and the positional properties by the actual matching procedure.
The idea of this procedure is to match each n-gram of each sSwith the n-grams
of each sMsorted(SM), requiring identical positions of both matching n-grams
within the two sequences - and then loosen this requirement via the parameter δ.
The result of this matching for each sSis a vector of its structural similarity
to each of the model sequences sM, for each of its sequence representations s2C,
sMOC and sMT C. Specifically the additional matches on the SM
MOC and SM
MT C
are relevant for failures, as they allow detecting event chains of a single client, au-
tomatically crossing the gap caused by interfering events of the other client. We
achieve this by a structural δ-nmatching function, denoted as ˆ
Φ(s, sM,ˆsM, δ, n),
where ˆsMdenotes a zero-vector of length |sM|. By iterating over all indices, this
vector is populated as follows:
ˆ
Φ(s, sM,ˆsM, δ, n) =
inc(ˆsM
[i+j,i+n+j),
1) if sM
[i+j,i+n+j)=s[i,i+n)
ˆsM
[i+j,i+n+j)else
for δ0with i= [1,|sM| n+ 1],j[δ, δ],(i+j)1and
(i+n+j) |sM|+ 1, with inc(x, y) = x +y denoting here the element-
wise incrementation of vector ˆsMby a one-vector
1 = (1, ..., 1) of length n. Here
we are using the indexing method introduced for ΓSto denote individual n-grams,
s.t. s[i,i+n)denotes the n-gram of the events [ei, ..., ei+n1]in the sequence s.
As such we essentially compare s[i,i+n)with sM
[i+j,i+n+j), and allow a positional
variance in the model sequence by defining jover the range of [δ, δ].
Using the exemplary sequence x1and x2of Table 5.3, with x1as
s= [D,A,B,C,A]
and x2as model sequence
sM= [D,B,B,C,B,A],
the structural δ-nmatching with δ= 1 and n= 1 yields the vector
ˆ
Φ(s, sM,ˆsM,1,1) = [1,1,1,1,0,1],
with event ’B’ and the final ’A making use of the positional variance introduced
by δ.
Since we are not only interested in the structural properties of our data, we
will now extend ˆ
Φby integrating the event-relative timestamps te.rel as temporal
properties, obtaining the final structural-temporal projection function Φ. The idea
is to calculate the absolute differences of the te.rel of the structurally matching
events of sand sM. This is done by modifying the previously used incrementation
function int(), giving rise to the final definition of Φ:
90
Φ(s, sM,ˆsM, δ, n) =
inc(ˆsM
[i+j,i+n+j),abs(x, y)) if sM
[i+j,i+n+j)=s[i,i+n)
with x =te.rel(sM)[i+j,i+n+j)
and y =te.rel(s)[i,i+n)
ˆsM
[i+j,i+n+j)else
for δ0with i= [1,|sM| n+ 1],j[δ, δ],(i+j)1and (i+j+n)
|sM|+1, with the function abs(x, y)defining a vector by calculating the absolute
element-wise difference between two vectors x and y, which is in this case the
temporal difference of the respective matching events of sand sM. As this can
result in multiple matches per event in ˆsM, we finally average each field in ˆsMby
its number of matches. Applying this final formulation to the previous example
sequences and their event-relative timestamps
te.rel(s) = [0.0,2.211,2.130,1.866,11.868]
and
te.rel(sM) = [0.0,1.823,1.407,1.873,20.227,5.218]
yields the final feature vector
Φ(s, sM,ˆsM,1,1) = [0.0,0.307,0.723,0.007,0,6.65].
Since Φis defined over a single sM, it has to be executed for all sM, with
sMsorted(SM), and the resulting vectors ˆsMhave to be concatenated via the
concatenate()-function, giving rise to the final definition of ΓST :
ΓST (s, SM, δ, n) = concatenate(Φ(s, sM,ˆsM, δ, n)),sMsorted(SM)
The final feature vector ΓST (s, SM, δ, n)has the length sMSM|sM|, and con-
tains the structural-temporal δnmatches of the sequence sand all model se-
quences SM. The O-complexity of this whole process is linear, as creating the
feature space by indexing the model sequences SMis done in linear time, and pro-
jecting a single sequence sonto SMis linear to the number of model sequences
|SM|, as it requires matching the n-grams of each swith those of every sMSM.
Explaining the Semantics
The objective of our projection Φis to achieve features highlighting differences in
structurally similar, but temporally different sequences, i.e. we aim for a way to
define similar features for sequences with similar structural and temporal behavior,
while achieving different feature vectors for those which are structurally different,
or which are structurally similar, but temporally different. As one can see, the
value of a single dimension is 0if there is no structural match, it is very small if
91
te.rel(s[i,i+n))and te.rel(sM
[i+j,i+n+j))are similar, and it is large if te.rel(s[i,i+n))
and te.rel(sM
[i+j,i+n+j))strongly deviate.
Once the feature vector of sis calculated, it is utilized in the classifier, where
this feature projection does indeed allow focussing on the desired differences. If
the projections of both samples have small values for a dimension, those small val-
ues contribute to a small distance between two samples, yielding a high similarity
between the samples. If the projections of both samples have similarly large values
for a dimension, these can contribute to a small distance between two samples - but
only if they are similarly large. This is only the case if both samples have a simi-
larly large deviation from the timestamps of the model sequence, which is only the
case, if they show a similar temporal behavior. If the projections of both samples
have differing values for a dimension, these values increase the distance between
both samples, emphasizing their inter-sample difference for this dimension.
By using the interpretation methods described in Chapter 4.6.2, ΓST also al-
lows inspecting, which dimensions are of highest importance for the classification.
This allows a better interpretation of the classification results, and also enables a
knowledge transfer into potentially faster rule-based algorithms. Since the ΓST
components also encode the temporal position of the relevant n-grams, this can
also be used to synchronize the data with existing radio frequency (RF) data, which
itself can lead to its own set of failures (e.g. coverage or interference failures),
caused for example by low signals of the reception level (RXLEV), the reception
quality (RXQUAL) or the received signal code power (RSCP).
Limitations of ΓST Features
One limitation of ΓST is the high dimensionality of the resulting feature space. To
address this issue and to simplify the subsequent projection, classifier training and
application, we reduce the final feature space by utilizing the redundancy between
the S2Cand SMOC, and the S2Cand SMT C feature vectors, once they are projected
with the structural δnmatching of ˆ
Φ(s, sM,ˆsM, δ, n). Table 5.4 shows the
binarized ˆ
Φfeatures of a snippet of the complete and the reduced feature vector.
Before dimensionality reduction
s2C011110110101101101110101111011
sMOC 011110100000000001110101100000
sMT C 000000010101101100000000011011
After dimensionality reduction
s2C0111101101-11011-1110101111-11
sMOC 01---10-----01---110---
sMT C ----010----10-----011--
Table 5.4: Exemplary snippet of the binarized ˆ
Φ-projections of a sequence sbefore
and after dimensionality reduction
92
The dimensionality reduction is achieved by first removing single client dimen-
sions of the currently inactive client which are always zero, i.e. sMOC-dimensions
whenever the MTC client is active, and vice versa. Furthermore those representa-
tions of dimensions are removed which are redundant between the single client and
the multi-client vector, i.e. between sMOC and s2C, and sMT C and s2C, keeping
only the s2Cdimension. Since the ˆ
Φserve as the basis for the Φfeatures, these
can simply be projected onto the new, reduced representation afterwards, which,
when concatenated, constitute the final ΓST feature vector. To further reduce the
dimensionality of ΓST , one could also select features that are most relevant for the
classification task, e.g. by applying efficient feature selection methods like RDE
[BBM08].
Another related limitation of ΓST exists in form of a potentially problematic
class-wise feature balancing. If SMcontains an unbalanced number of samples per
sequence class, this will lead to many, potentially redundant features for over rep-
resented data aspects (e.g. class specifics), while other class- or data aspects could
remain nearly uncovered, due to an insufficient number of sMrepresenting these
aspects. This makes multi-class learning harder, because these over represented
features might outweigh less represented features and may therefore produce re-
sults prone to classify the corresponding class. To address this issue and to further
improve the expressiveness of the resulting ΓST features, the sequences for SM
should be carefully selected, to properly and equally represent all classes. While
we made sure not to use duplicate sequences in any of our data sets, we did not
include such an additional sequence selection optimization.
5.5 System Layout
This section will introduce the actual system for the detection and prediction of
classes of sequence behavior and the utilized learning methods. The system de-
scription will be kept as abstract as possible, to allow an application to other rele-
vant use cases. Figure 5.1 shows the training phase of the system, in which the sets
of sequences Sare used to create the feature vectors, which are then used to train
the required classifiers, responsible for the detection and prediction of properly
represented model classes.
5.5.1 Model Class Detection and Prediction
In our use case, the classes of sequence behavior are defined by the different ways
communication sequences between both clients can fail. When samples of failed
sequences of a new data campaign need to be classified, we could assume a hypo-
thetical scenario in which no new failure classes are found in the new campaign,
i.e. all potential failure classes have already been seen before. However, this is not
true in practice, where new types of previously unseen failures occur indeed. This
is also true in our data sets, where only a limited number of failures classes have a
sufficient size to properly evaluate supervised classification models with them. We
93
Figure 5.1: Training of the detection and prediction system
denote such classes as model classes, or MC. Sequences of Other Failures, i.e. of
insufficiently large failure classes are denoted as non model classes, or ¬MC.
As a consequence we design our system to contain two major components,
which allow to detect whether a new sample is potentially an MC sample, and if
that is the case, to predict the model class. Accordingly, those steps are called the
model class detection (MCD) and the model class prediction (MCP). For the MCP a
classifier is trained in a multi class approach, learning to discriminate only samples
of the MC, but not the ¬MC. For the MCD multiple classifiers are trained, one for
each MC. Each of those classifiers is trained in a two class approach, learning to
discriminate the respective MC against samples of ¬MC. After training the MCP
and MCD classifiers, we can predict the failure class of a new sample by predicting
aMC with the MCP classifier, and then using the MCD classifier trained for this
MC to confirm or reject this prediction.
5.5.2 Learning Methods
While unsupervised or semi-supervised methods have shown to produce good re-
sults on textual and structural data and could be relevant to our problem of model
class detection, supervised methods are regularly outperforming them and are the
preferred solution, if labeled training data is available. As discussed in Section
5.2, Decision Trees, Markov Classifiers and LSTM are learning methods widely
used in the domain of process mining, while MLP and SVM are more widely used
on non-sequential data representations of sequential data. For these reasons we
are conducting our evaluations on those methods, to provide a broad picture of
the classification performances achievable on the discusses sequential (ΓqT ), non-
sequential (ΓT,ΓS,ΓS+T) and semi-sequential (ΓST ) feature representations. We
also include the classification performance using k-nearest neighbors to provide
an additional baseline. All of the models below have been chosen using standard
cross-validation based model selection.
94
Decision Tree
Decision trees model training data based on their sequence, where their shared pre-
fix paths build the root of the tree, which branches along the sequence down to the
leafs, annotating the transitions with their respective probabilities. They are widely
used, specifically in process mining in [MDFDG14, CLF+14, LSD+15, ULD16]
and as random forest of decision trees[LCDF+15]. We use them as classifier, based
on the sequential ΓqT features, using additional suffix padding to achieve equiva-
lent length sequences.
Markov Classifier
Markov Models are commonly used in process mining [PNC11, LGN12, LSD+15,
ULD16], where they are primarily used for predicting objectives like the remaining
time or the next event, and not for sequence classification. It can also be used for
classification though, as Markov Models represent the data of each class in the
training data as a Markov process. This allows calculating the class-wise path
probabilities for a new sequence, and predicting the most probable class by the
highest overall path probability. We apply such a classifier on the sequential data
representations of the ΓqT feature spaces.
LSTM RNN
Recurrent Neural Networks with Long Short-Term Memory nodes are a type of
classifier recently used e.g. in process mining [TVLRD17, ERF16]. Deep Learn-
ing approaches work best with large training data sets. In our use case, getting a
large amount of labeled samples is not easy, thus a deep learning approach might
not be the best way to address this problem. However, recurrent neural networks
(RNN) with long short-term memory (LSTM) units have shown great performance
on sequence prediction problems, s.t. evaluating their performance on this prob-
lem is still highly interesting. For our experiments we are using the tensorflow
[AAB+15] implementation of RNN with LSTM. The ΓqT features are specifically
designed with an LSTM RNN in mind. To achieve samples of homogenous length
per batch, we added empty events to the end of each sequence. Since the history of
each event is of specific relevance in the event sequence handling in LSTMs, this
suffix-padding is a good solution, as it allows to assure that the starting events are
not empty. In our experiments we achieve the best results when using one-hot la-
bel encoding, a single hidden layer of 20 nodes, 300 epochs and a batch size of 10.
Further details on the concepts and ideas behind LSTMs, as well as their formal
definitions, are provided in Chapter 7.1.4.
KNN
K-nearest neighbor classifiers are classical distance-based baseline classifiers from
the field of natural language processing. We achieve the best results with a value
95
of k= 5, using the euclidean distance while additionally weighing points by the
inverse of their distance, s.t. closer points have a larger impact.
MLP
Multi Layer Perceptrons are a widely used type of neural network sequence clas-
sifier, e.g. in [LWR+10, MLI+15]. We achieve the best results by using a single
hidden layer with 80 nodes and the identity function as activation function. Fur-
ther explanations on the reasoning behind MLP classifiers are provided in Chapter
7.1.3.
SVM
Support Vector Machines are a supervised learning methods, training a maximal
margin separating hyperplane between linearly separable class data. While this can
also be extended to non-linearly separable class data, we are using a linear kernel,
which has shown very good results given sufficiently high-dimensional data, and
specifically for protocol-based communication data [ORLS14, PAF+09, WPS06,
WS04]. For the MCP evaluation we are using a one-vs-rest (OVR) approach, as
this includes calculating a separating hyperplane for each model class MC, which
allows a confidence calibration to optimize the system precision, as explained in the
next section. Further details on the optimization problem of 2C-SVM, its decision
functions and the use of kernels are provided in Chapter 7.1.1.
5.5.3 Combined Classification System
We combine different classifiers to separate each MC from ¬MC samples (via the
MCD), and to predict the correct MC (via the MCP). The corresponding predic-
tions are obtained by applying their respective prediction functions to the feature
vector Γ(s)of a test sample s, using one of the previously defined feature spaces,
i.e. Γ {ΓqT ,ΓT,ΓS,ΓS+T,ΓST }. The function for the model class prediction
classifier is FMCP , with
yMCP =FMCP (Γ(s))
and with yMCP {MC1, ..., MCk}, the set of all kmodel class labels. The
prediction function for the model class detector is FMCD(MCi), obtaining the
prediction of the classifier trained for MCivia
yMCD =FMCD(MCi,Γ(s))
with yMCD {MCi,¬MC}. As one can see, FMCP returns one of the MC-
labels, and FMCD(MCi)returns the label of the class MCi, or ¬MC.
We are combining these prediction functions by using two confidence ratings as
a way to ensure a higher confidence in the predictions of the combined (MCP and
MCD) classifier, as this helps largely improving the overall classification precision
of our approach, which is crucial in the practical application. These confidence
96
ratings produce the binary results High and Low. They can be logically combined
and can be interpreted either as providing support for the prediction of the MCP
(High confidence) or objecting against its prediction (Low confidence).
Since the output of FMCD is limited to a single MCiand ¬MC, it is used
as the first confidence rating for yMCP , answering the question of whether the
provided yMCP really belongs to MCi, or whether it belongs to ¬MC. For this
purpose, the confidence rating function ΘMCD(Γ(s)) {High,Low}is defined
as:
ΘMCD(Γ(s)) = High if FMCD(yMCP ,Γ(s)) = yMCP
Low else
To obtain further confidence on the classification result of FMCP , we define
an additional confidence rating Θdb. It uses the decision boundary of the MCP
classifier of the predicted class. The idea behind Θdb is to calibrate the decision
boundary of each MCP classifier towards more conservative values, requiring a
sample with yMCP =MCito cross a stricter decision boundary for this MCito
obtain a High confidence for Θdb, reducing the false positive rate and increasing the
precision. As it is defined over the decision scores, it can be applied to any classifier
which provides access to its decision scores or probabilities. For this purpose we
need an additional function for accessing the prediction score of the MCP via the
function DMCP (Γ(s)). We also require the existing bias bof the MCP classifier of
yMCP , and a parameter θDB 0to define the new bias bdb =b+ (Db)·θdb,
with Drepresenting the mean of those decision scores of the training samples
that have been correctly classified by the MCP. As such, the parameter θdb gives
rise to the following definition of the confidence rating Θdb:
Θdb(Γ(s), θdb) = High if DMCP (Γ(s)) + bdb >0
Low else
For example in the two-class definition of the SVM the decision function (as
defined in Formula 7.1 in Chapter 7.1.1) is FMCP (Γ(s)) = sign(w, Γ(s)+b),
with wbeing the weight vector of the trained model. Here we achieve a positive
prediction if w, Γ(s)+b > 0. The respective function to access the score is then
DMCP (Γ(s)) = w, Γ(s)+b.
θdb can be changed dynamically during model selection, which allows for a
calibration of the model precision, similar to other methods of false positive cali-
bration as used e.g. [SBKR12]. The confidence ratings are then processed together
with the prediction yMCP to produce the final predicted label Fcombined(Γ(s))
{MC1, ..., MCk,¬MC}, defined as follows:
Fcombined(Γ(s)) =
FMCP (Γ(s)) if ΘMCD(Γ(s)) = High
Θdb(Γ(s), θdb) = High
¬MC else
97
The effect of the confidence ratings and the consequently created High - con-
fidence prediction subset on the applied evaluation metrics will be elaborated in
the next section. Now that the system is trained, i.e. the ΓST feature space model
is defined and the MCD and MCPs are trained and calibrated, we can apply the
system to classify unlabeled sequences, as illustrated in Figure 5.2. As just de-
scribed, the final prediction Fcombined(Γ(s)) of this combined classifier for a given
sequence sonly provides the label of the predicted MCiif ΘMCD(Γ(s)) = High
and Θdb(Γ(s), θdb) = High, and ¬MC otherwise. This, together with the still
accessible Θ-confidence ratings allows for an effective filtering of reliable and un-
reliable predictions, which is highly relevant in the practical application.
Figure 5.2: Application of the detection and prediction system
System Limitations
One limitation of this combined system is the currently unused opportunity of a
closer integration of the currently separated MCP and MCD steps during the train-
ing and the application phase. This provides an opportunity to further increase the
system classification performance, as both steps could complement each other and
provide class information currently not available to the respective other. Another
limitation is the requirement to define Θdb via model selection, which could be
problematic given the potentially small data sets available.
5.6 Evaluation
In this section we start with describing the parameters and settings used for the
feature spaces and classifiers during the evaluation, as well as the utilized met-
rics. Afterwards we evaluate the model class prediction (MCP) component of the
proposed system. For this purpose we evaluate the ΓqT feature space with LSTM
classifiers, and the ΓT,ΓS,ΓS+Tand ΓST feature spaces with the MLP, KNN and
SVM classifiers. We also conduct additional comparative analyses on various sta-
tistical properties of the feature spaces, allowing for more detailed conclusions on
their individual advantages and disadvantages. Finally we select the best perform-
ing combinations and evaluate their model class detection (MCD) and combined
98
MCP and MCD classification performance, simulating the complete system work-
flow. For all evaluations we used 20 times random sampling, each with a 5 times
cross validation. For a proper evaluation we made sure all conducted comparisons
between classifiers and feature spaces were done on the same respective samplings.
For the event n-gram sizes of the ΓS,ΓS+Tand ΓST feature spaces we used a fixed
n-gram size of n= 2, which yielded the overall best results. We also tested using
multiple values of nsimultaneously, as e.g. described in [LXLZ15]. However, the
performance increase was only minimal. As a convention, all results are listed as
the mean and standard deviation in percent.
5.6.1 Evaluation Metrics
The general system application of reliably detecting and predicting known amidst
unknown sequence classes, and its concrete practical application in failure classifi-
cation enforces a focus on two primary objectives: (1) to obtain reliable predictions
(2) for as many MC samples as possible. Precision, recall and the F1 score capture
those aspects. For the evaluation of the MCP and the MCD classifiers they are cal-
culated for multiple cross validation repetitions, s.t. we chose to calculate their un-
weighted class mean (denoted with the keyword macro), because we already con-
figured the sampling procedure to produce similarly sized model classes. When-
ever the ¬MC class participated in the evaluation (in MCD and combined classi-
fiers) we excluded it from the calculation of the classwise mean values, because
our focus is on the correct detection of MC samples. By combining MCP and
MCD classifiers and applying confidence ratings, we effectively create a filter, al-
lowing us to focus solely on the created High-confidence subset of the predictions,
expected to contain only those samples and labels which truly are of a model class
MC and are correctly classified as such. Consequently we also have to adapt our
metrics. The set Pof the positive samples is defined as the set of samples contained
in the High-confidence subset, which is split into the subsets T P and FP.TP con-
sists of the correctly predicted samples of MC in this subset, and FP consists of
falsely predicted samples of MC and also samples of ¬MC in this subset. The set
Nof the negative samples is defined as the set of samples in the Low-confidence
subset and is split into the sets TN and FN.TN consists of samples of ¬MC
and falsely predicted MC, and FN consists of samples of correctly predicted sam-
ples of MC. Based on these values precision and recall are defined as usual, with
precision =TP/P and recall =TP/(TP +FN). However, to correctly ad-
dress objective (2) we have to calculate an additional effective recall by considering
all existing MC samples, not only those in the High-confidence subset. Therefore
we define the effective recall as the recall of correctly predicted samples of MC
over the sum of samples in all MC, i.e. effective recall =TP/ s#MCi,MCi.
Together with the precision over the High-confidence subset, this metric allows for
a conclusive analysis of the overall system classification performance, as provided
by the combined classification processing.
99
5.6.2 Experiments on MFC data
To be able to apply the selected learning methods, we have to assure sufficiently
sized failure classes. To obtain any model classes at all, we sample only from
classes with a minimum size of 15 sequences. To improve the interpretability of
our classification results we opted for similarly sized failure classes, which we
achieved by limiting the size of each failure class to a maximum of 25 sequences.
As such, the number of samples per MC used for the evaluation, s#MC, is 15 <
s#MC 25. To simulate the complete system, we also need members of the Other
Failures class, of which we used all 86 available sequences, i.e. s#¬MC = 86. In
the MCD evaluation this allows highlighting the detection purpose of the method,
as the ratio of samples of MCito ¬MC is approximately 1:4. In the combined
evaluation the ratio of all MC samples to ¬MC samples is approximately 1:1,
which helps in the interpretation of the classification results. For defining the ΓST
features we used all 6,077 labeled and unlabeled MFC samples as model sequences
SM. After applying an additional redundancy-based dimensionality reduction, this
resulted in a feature space of 294,435 dimensions. As previously described this
use of the unlabeled sequences helps to extract additional information about the
behavior of the projected sequence.
Evaluation of the individual MCP and MCD classifiers
The purpose of the first set of experiments is to find the best performing learning
method for all feature spaces, s.t. we can restrict the further experiments to this
learning method, allowing to focus on the feature space analyses. Since ΓST is
further parametrized by δ, we start with analyzing the impact of different values
of δon the MCP classification performance of ΓST using the SVM classifier. The
mean sample length in the MFC data set is 44.11 events, with a standard deviation
of 13.51 events. Since δencodes the positional variance of the matched n-grams
within the projected sequence, it does not make sense to increase its size beyond a
value of δ= 60, at which the complete average sequence length is covered. Since
we expect a high importance of a similar positioning of the matched n-grams, we
expect better results for smaller values of δ. The left plot in Figure 5.3 shows the
results for δ[0,60]. As expected we achieve the best results with a value of
δ10. For that reason we focused stronger on the range of δ[0,10], which is
illustrated in the right plot in Figure 5.3, allowing to further reduce the selection of
an optimal value down to δ= 5. As this value allows a positional variance of ±5
events on SM, it can additionally be explained by the circumstance that event sub-
sequences, which are crucial to the protocol, like the call setup (also illustrated in
Table 5.2) take around 10 events in the s2Crepresentation, requiring any sequence
to match the contained events.
Now that we know a proper setting of δwe can conduct a comparative eval-
uation of all MCP classifiers on the MFC data set, using all feature spaces. The
results are shown in Table 5.5. Of the process mining methods applied on ΓqT and
100
Figure 5.3: MCP evaluation on MFC data for ΓST : F1 scores (macro) in % for
ranges of δ[0,60] (left) and δ[0,10] (right)
ΓqT , the decision tree showed the overall best performance, specifically on ΓqT ,
i.e. the original sequence representation. However, both the Markov and the LSTM
classifier achieve an improved performance on the temporally enriched ΓqT feature
space. When compared to the performance on the other feature spaces though, all
of those approaches commonly used in the field of process mining are clearly out-
performed by the other feature spaces and learning methods. While this was to be
expected, given that the process mining approach, and specifically deep learning
approaches like LSTM usually require much larger training data sets, also the lack
of the additionally included concrete temporal information is highly relevant, since
they are even outperformed by ΓT, which only contains strongly reduced structural
information about the data. Hence the SVM classifier performs best on all feature
spaces, outperforming the otherwise widely used MLP, as well as (obviously) the
KNN approach. For those reasons we are using it for the remaining experiments.
The results of the SVM classifier also show, that in the optimal scenario in which all
MC are known, good results can already be achieved without using the proposed
θdb system calibration.
The results of the MCD evaluation using the SVM classifier are shown at the
bottom of Table 5.5. Obviously discriminating the MC and ¬MC is harder than
separating the MC in the MCP setting, which is to be expected, as the ¬MC
samples are very heterogenous. However, using semi-supervised learning via a
One-Class SVM to model each MCiagainst the ¬MC performed even worse. Of
all feature spaces ΓS+Tperforms best, while the specifically crafted ΓST feature
space is slightly outperformed by all other feature spaces. While one might think
that ΓST does not look that promising yet, the combined classifier evaluation will
show, that it performs better than its competitors in the final system layout, when
the effective recall becomes relevant.
Evaluation of the combined classifiers
Now that we established some understanding of the performance of the individual
MCD and MCP classifiers, we will now evaluate for each qualified feature space
its combined classification performance, also integrating the previously described
confidence ratings. We do this to find the feature space which has the highest
precision, at the highest possible effective recall, which is highly relevant for an
101
MCP F1 Score (macro) Precision (macro)
ΓqT
Markov 30.90 ±8.06 34.62 ±10.97
DCT 52.58 ±10.12 56.03 ±11.26
LSTM 45.36 ±11.78 49.25 ±13.42
ΓqT
Markov 33.57 ±6.91 34.22 ±9.18
DCT 48.89 ±9.74 52.45 ±11.44
LSTM 47.38 ±7.85 52.55 ±8.14
ΓT
KNN 77.57 ±9.05 80.27 ±9.13
MLP 80.94 ±6.83 83.55 ±6.98
SVM 84.72 ±7.34 87.33 ±6.65
ΓS
KNN 77.51 ±8.89 81.79 ±7.96
MLP 83.07 ±7.00 85.58 ±6.41
SVM 83.60 ±9.01 86.17 ±8.51
ΓS+T
KNN 79.10 ±8.14 82.44 ±7.89
MLP 84.27 ±7.81 87.08 ±7.24
SVM 85.25 ±7.17 87.67 ±6.83
ΓST
KNN 77.67 ±8.40 79.93 ±8.31
MLP 78.53 ±8.31 82.03 ±7.55
SVM 83.67 ±7.13 85.70 ±6.77
MCD F1 Score (macro) Precision (macro)
ΓT
SVM
62.91 ±19.43 64.54 ±20.33
ΓS63.33 ±21.72 70.68 ±23.85
ΓS+T65.53 ±20.64 72.57 ±22.21
ΓST 58.63 ±25.17 76.72 ±28.09
Table 5.5: Results of the individual MCP and MCD evaluations on MFC data (%)
102
effective system in the practical application. Achieving high precision predictions
means that we can trust the results to be correctly classified and to not contain any
samples of ¬MC falsely being classified as an MC sample. And getting the high
effective recall means, we get this predictive behavior for a larger portion of the
MC samples that are actually contained in the test set. This aspect is illustrated
in Figure 5.4, which shows the percentage of High-confidence samples of MC
to all samples of MC in the test set, under a shifting parameter θdb [0,1.0].
The more θdb is increased, the less samples of MC are actually contained in the
High-confidence subset, reducing the potential effective recall.
Figure 5.4: Percentage of MC samples in the High-confidence set, for θdb
[0,1.0], for ΓT,ΓS,ΓS+Tand ΓST
For comparing the combined classification performance of the different feature
spaces with the SVM classifier, we need to select values of θdb representing prac-
tically relevant values of precision and recall, which are similar for the respective
feature spaces. Figure 5.5 contains the precision, recall and the effective recall of
the classification results for θdb [0,1.0]. The precision starts to reach 100% for
most classifiers at θdb = 0.8. At this value also the recall reaches the maximum
of 100%. When looking at the concrete values of mean and standard deviation,
shown in Table 5.6, we see that the recall can not be further increased, and that the
corresponding precision can be selected s.t. it is around 93% for ΓST ,ΓS+Tand
ΓS. Since further increasing the precision would not further increase the recall,
and 93% is already a reasonable system precision, we will use this value and the
respective settings of θdb for the further analyses. Thus we are using θdb = 0.8for
ΓSand ΓS+T, and θdb = 0.9for ΓST . In this respect, the performance of ΓTwas
not sufficiently high to achieve similar values of precision and recall, which is why
we used θdb = 1.1there, achieving relatively close values for further analyses.
Now we need to evaluate which feature space offers the best effective recall,
103
Figure 5.5: Combined classification results on MFC data: precision, recall, effec-
tive recall (from top to bottom) in % for θdb [0,1.0], for ΓT,ΓS,ΓS+Tand
ΓST
104
i.e. the fraction of MC samples that can be recovered from the set of test sam-
ples, with those high values of precision and recall previously described. At their
respective values of θdb,ΓShas the lowest effective recall of 25.05%, followed
by ΓS+Twith 26.27%, and ΓST with an effective recall of 31.79%. Thus ΓST
produces a precision performance similar to both ΓS+Tand ΓS, while achieving
a 5.5% higher effective recall as ΓS+T, and a 6.7% higher effective recall as ΓS.
This means, we can get for 31.79% of the MC samples in the test set a correct
prediction with 93.11% precision and 100% recall when using ΓST , compared to
26.27% effective recall, 93.51% precision and 100% recall when using ΓS+T, and
even worse when using ΓS. These results are highly relevant in practice, as they
effectively allow filtering near-certain from uncertain predictions with a very high
precision. These results are also highlighting the effectiveness of both combined
feature spaces, with a significant advantage for the ΓST feature space. In that re-
gard both structural-temporal feature spaces ΓS+Tand ΓST outperform ΓSand ΓT:
Whereas ΓThas a relatively good effective recall, but a relatively low precision,
ΓShas an acceptable precision, but a low effective recall. This renders both feature
spaces less practically relevant than their combined counter parts, highlighting the
relevance of combined structural-temporal feature spaces.
The dimensions most relevant for the respective classification results in this use
case were security handshake events, followed by the existence of events represent-
ing a successful response to the most relevant key protocol states, like a successful
radio bearer setup. As we saw in the MCP evaluation, the temporal features are
also relevant and utilized in both combined feature spaces. The ΓST feature space
also has an advantage here, as its SM-based feature space allows locating the con-
crete positions and structural-temporal properties of the relevant events within the
sequence, which are in the evaluation often identified as responses occurring too
late in the sequence, or security mode negotiations at anomalous sequence posi-
tions. All of this can then be used to obtain deeper insights into the data, which can
help manual analysts to limit the number of causes for this specific failure class.
Due to their potential in combining classifiers of different features spaces,
we also evaluate the classification performance of an ensemble method [Die00],
namely the ensemble classifier E, which could potentially further optimize preci-
sion and effective recall. It predicts the combined classification results by using a
majority voting over the predicted labels of ΓS,ΓS+Tand ΓST . Table 5.6 shows
its results when using their default trained models of θdb = 0.0, and for their opti-
mized decision boundary models, using θdb = 0.8for ΓSand ΓS+T, and θdb = 0.9
for ΓST respectively. ΓThas not been used due to its lower performance. When
using the default models at θdb = 0.0, the ensemble classifier in fact achieves the
best precision at the cost of the effective recall, an effect similar to the trade-off of
θdb. For the optimized models of θdb {0.8,0.9}the ensemble classifier achieved
worse results though.
105
θdb F1 Score Precision Recall Eff. Recall
ΓT
0.076.37 ±7.90 69.09 ±8.22 91.12 ±7.73 61.57 ±9.39
1.190.71 ±13.28 87.40 ±16.75 98.44 ±7.67 17.25 ±7.49
ΓS
0.080.82 ±7.18 73.74 ±8.42 95.32 ±5.55 57.94 ±10.17
0.895.01 ±6.26 92.65 ±8.92 99.90 ±1.00 25.05 ±7.98
ΓS+T
0.081.23 ±7.65 75.28 ±8.09 95.02 ±5.54 60.50 ±12.77
0.895.64 ±5.54 93.51 ±8.13 100.0 ±0.0 26.27 ±8.64
0.997.36 ±4.86 95.97 ±7.39 100.0 ±0.0 20.38 ±8.09
ΓST
0.081.52 ±8.83 77.74 ±10.97 91.73 ±7.35 50.84 ±8.28
0.891.72 ±8.29 88.29 ±10.34 99.17 ±4.49 39.88 ±8.81
0.995.33 ±6.32 93.11 ±9.16 100.0 ±0.0 31.79 ±10.33
E
0.081.98 ±8.33 78.53 ±9.07 90.10 ±9.40 55.78 ±10.73
0.888.74 ±13.49 89.76 ±13.65 90.72 ±13.58 23.59 ±7.76
0.9
Table 5.6: Results of combined classification evaluation on MFC data (in %)
Significance analysis
To further substantiate the results of the previous section, we conduct significance
tests on both of our theoretical hypotheses, namely that the combined feature
spaces ΓS+Tand ΓST outperform the base feature spaces ΓTand ΓSin terms
of effective recall at similar precision (Hypothesis HA), and that the more com-
plex combined feature space ΓST outperforms the simpler combined feature space
ΓS+Tunder the same premises (Hypothesis HB). For the formulation of the hy-
potheses we denote θer as the minimal effective recall.
For hypothesis HAthe null hypothesis HA
0is defined as follows: When us-
ing ΓSor ΓTfor achieving a test set precision mean of 93%, a fraction of p0
samplings have an effective recall θer. The alternative hypothesis HA
1is then
defined as follows: When using ΓST or ΓS+Tfor achieving a test set precision
mean of 93%, a fraction of ˆpsamplings have an effective recall θer. Now we
can formulate the question for hypothesis HA: Is there sufficient evidence at the
α= 0.05 level to conclude that the effective recall for the high precision classifica-
tion performance is increased, when using one of the combined feature spaces ΓST
or ΓS+Tinstead of one of the individual feature spaces ΓTor ΓS? And at which
minimal effective recall θer does this hold? The results for the minimal θer, at
which we can reject HA
0in favor of HA
1(i.e. above which pαalways holds for
the resulting p-values) are shown in Table 5.7, for each pair of base and combined
feature space, as calculated on the same sampling that have also been used for the
previous combined MFC evaluation. We can see that HA
0can be rejected for ΓT
for values of θer 5%, i.e. for nearly all values of θer, excluding those which
do not occur in the combined feature spaces due to their generally higher effective
recall. For ΓS,HA
0can be rejected for θer 9% for ΓST , and for θer 14% for
106
ΓS+T. This means ΓST is better for a larger number of samplings, while ΓS+T
starts outperforming ΓSlater - both of which is also relevant for hypothesis HB.
ΓS+TΓST
HAΓT5% 5%
ΓS14% 9%
HBΓS+T- 30%
Table 5.7: Values of minimal θer required to reject hypotheses HA
0and HB
0at
α= 0.05
For hypothesis HBthe null hypothesis HB
0is defined as follows: When using
ΓS+Tfor achieving a test set precision mean of 93%, a fraction of p0samplings
have an effective recall θer. The alternative hypothesis HB
1is then defined anal-
ogous: When using ΓST for achieving a test set precision mean of 93%, a fraction
of ˆpsamplings have an effective recall θer. The question for hypothesis HB
is then: Is there sufficient evidence at the α= 0.05 level to conclude that the ef-
fective recall for the high precision classification performance is increased, when
using the complex combined feature space ΓST instead of the simpler combined
feature space ΓS+T? And at which minimal effective recall θer does this hold?
As shown in Table 5.7, HB
0can be rejected for all values of θer 30%, show-
ing that ΓST indeed outperforms ΓS+T, a fact that is further strengthened by the
performance advantage of ΓST over ΓS+T, as shown for hypothesis HA.
Figure 5.6: Precision (left) and Effective Recall (right) of ΓST against ΓS+T(in
%)
We will now further elaborate hypothesis HBby analyzing the distribution of
precision and effective recall, when using either feature space on the same test
sets. The results of this performance variance analysis are shown in Figure 5.6. As
previously stated, and shown in Table 5.6, we conducted the significance tests on
107
SVM classification models calibrated for an average precision 93%. As shown
in the left plot of Figure 5.6, the models of both ΓS+Tand ΓST show similar
performance distributions, with a slight advantage for ΓS+T, due to its slightly
higher average precision of 93.51%, compared to the 93.11% of ΓST . However,
as the right plot shows, the effective recall on the same test sets is much balanced
towards ΓST , clearly supporting hypothesis HB. As a result these analyses support
our conclusion that ΓST is the most advantageous feature space in the discussed
use case.
5.6.3 Experiments on AFC data
Due to the already discussed shortcomings of the AFC data set properties, we are
only interested to see, whether the MCP are capable of discriminating the model
classes of the AFC data set at all, and how much of a performance improvement
we can expect with a larger training data set, which is possible only on the AFC
data set. Similar to the class size restrictions described for the MFC evaluation,
we have to ensure sufficiently large as well as similarly sized failure classes. To
reflect smaller and larger training data sets, we evaluate two different setups. The
first setup is defined with comparability to the MFC evaluations in mind. Hence we
use s#MC = 25, resulting in the 13 sufficiently sized failure classes listed in Table
5.1. In the second setup, selecting s#MC = 100 allows for a larger training data
set, resulting in 4 sufficiently sized failure classes. For defining the ΓST features
we used all 3,264 sequences as model sequences SM, resulting in a feature space
of 144,938 dimensions after redundancy-based dimensionality reduction.
MCP c#MC F1 Score (macro) Precision (macro)
ΓS+T25 46.84 ±4.71 49.17 ±6.17
ΓST 42.00 ±5.63 43.97 ±6.12
ΓS+T100 71.15 ±4.09 72.08 ±4.17
ΓST 62.90 ±6.05 63.67 ±6.08
Table 5.8: Results of the individual MCP evaluations on AFC data (in %)
Table 5.8 shows the results of the MCP evaluations on the AFC data set. Due
to the differences in the MFC and the AFC data, we expected a worse classification
performance than on the MFC data, which indeed occurs. However, for the larger
sets of training data with s#MC = 100 the results are largely improved, which
shows, that the AFC data set still contains a sufficient number of discriminative
features to enable classification. This also documents the potential for an equally
increased classification performance on the MFC data, once more class-wise train-
ing data is there available as well - which also applied to the general use case of
similar classification problems.
108
5.7 Conclusion
This chapter addresses theoretical and practical issues, relevant when analyzing
real-time log data of structural, temporal processes using specific structural-tem-
poral feature spaces, specifically when solving mobile communication failure clas-
sification problems. On the theoretical side we present an analysis of structural
and temporal data properties, specifically on the discussed format of mobile com-
munication data. We introduce and discuss novel individual and combined feature
spaces utilizing those properties to obtain a good data representation. We con-
duct a comparative performance evaluation of these feature spaces against feature
spaces commonly used in related work, on a range of classifiers. We also show in
various evaluations and via hypothesis testing that both of our combined temporal
structural feature spaces ΓS+Tand ΓST outperform their competition counterparts
from the research fields of sequence learning and process mining, and that the novel
ΓST feature space excels in classification performance when being compared to all
other approaches, including an ensemble method. We also discussed and applied a
solution to address the main limitation of the ΓST features, the high feature space
dimensionality.
On the practical side we propose a system for the detection and prediction of
classes of pre-defined sequence behavior, applied on the use case of the automatic
classification of mobile communication failures using the proposed feature spaces
and supervised learning, for which we also show how to maximize its classification
precision and effective recall via a calibration procedure. We highlight the impor-
tance of properly labeled training data, for which we show that our proposed ΓST
feature space is able achieve a precision of more than 93% while having the advan-
tage of an up to 6.7% higher effective recall than the other feature spaces. These
results are highly relevant in practice, as they effectively allow separating reliable
from unreliable predictions. And with the higher effective recall more reliable pre-
dictions can be obtained, further reducing the costs of otherwise unfeasible manual
analysis processes.
As an outlook it would be interesting to evaluate the potential of word vector
representations like those of [MCCD13] for corpora of structural-temporal data.
This would not necessarily reflect the temporal data aspects, and would also re-
quire data sets much larger than currently available. However, the sequential and
contextual aspects of the event relations could potentially be covered, which could
help improving the interpretability of the internal process relations, as well as the
overall classification performances.
109
Chapter 6
Thesis Conclusion
This thesis proposed and analyzed properties and features of sequential and struc-
tural data, as well as learning methods and automated systems for this data, specif-
ically in the domains of network communication and code. These are the three
primary aspects of this thesis, whose individual results and contributions we will
now summarize, followed by the final outlook.
Automated Systems
For this purpose various highly automated systems were introduced, all of which
reduce the necessity for additional manual processing, which is relevant for a wide
range of web and communication analysis problems, and provides answers to the
thesis question
What is needed to achieve highly automated systems utilizing these features
and learning methods?
Chapter 2 introduced a novel approach for detecting covert and tunneled commu-
nication of malware using hierarchical detectors for anomalous communication
in features of HTTP requests. This system can be adapted to the web surfing
characteristics of individual users and can identify the communication of mali-
cious software, tunnels and backdoors. Chapter 3 investigated creating a combined
learning-based system for the detection of malicious JavaScript code and a com-
pletely automated system for the collection and analysis of JavaScript code. Chap-
ter 4 proposed a system for the validation of mobile communication sequences,
which was competitively evaluated using different data representations based on
different analysis methods. Chapter 5 proposes a system for the detection and pre-
diction of classes of pre-defined sequence behavior, applied on the use case of the
automatic classification of log-data of mobile communication failures. As the the-
sis also provides analyses of specific practically relevant topics of the proposed
systems, like the impact of re-training procedures, the interpretability of the re-
sults, or methods to determine the reliability of the results, the proposed systems
110
allow a good adaptation to practical applications. These practical thesis aspects are
summarized by the following contributions:
Proposal and evaluation of highly automated systems for solving specific
practically relevant learning problems.
Proposal of methods and graphical representations to increase the trans-
parence and interpretability of the system and the obtained results.
Properties and Features
The thesis proposed and analyzed various forms of features and feature combi-
nations with regard to their applicability to sequential and structured data and its
specific problems. Specifically token n-grams were found to be highly applica-
ble, especially when extended with additional temporal properties. To address the
research questions
Which properties are relevant for structural and sequential data of network
communication and code?
Which types of features are best suited to represent those properties - and
how are they extracted effectively?
the thesis starts in Chapter 2 with utilizing concrete structural, temporal and sta-
tistical features like the length or number of individual HTTP requests, as well
as their hierarchical combination. While these implicitly stateless features work
well for this use case, they can not easily be generalized, as they directly represent
concrete data properties. As a consequence Chapter 3 starts utilizing the more gen-
eral token n-gram features, which is continued in Chapter 4, where token n-gram
features are explicitly defined as stateless features. By additionally introducing a
stateful feature representation, this chapter allows discussions and comparisons of
these stateless and stateful feature spaces. These analyses found that using state-
less feature representations requires a higher computational effort (due to its high
dimensionality), but usually slightly outperforms stateful features. However, as
their performance gap is not very large, stateful features are still a viable option
- especially considering their lower computational requirements. Afterwards their
potential to complement each other was evaluated, resulting in the highest classi-
fication performance. Furthermore it was shown that using non-proprietary data
analysis techniques can enable feature representations nearly as expressive as pro-
prietary ones, widening the applicability of the proposed automatic learning ap-
proach. Finally Chapter 5 further extends the analysis of combining stateless and
stateful features, by introducing novel feature spaces which allow the inclusion of
both structural and temporal data properties. Via hypothesis testing and in a broad
evaluation on a range of different learning methods, these feature spaces are shown
to perform better than various other relevant feature spaces from the research fields
of sequence learning and process mining. These thesis aspects are summarized by
the following contributions:
111
Analysis of the properties of structured, sequential data, specifically of net-
work communication and code.
Proposal of novel features or feature combinations to represent those prop-
erties.
Learning Methods and Evaluation
The detection and classification performance of the proposed features was eval-
uated using a wide range of learning methods, from classical machine learning
methods like Markov classifiers [LSD+15] and Bayes classifiers [PSBDL18] over
kernel based learning methods like One-Class SVM [SWS+00] and Two-Class
SVM [MMR+01, SS02] to neural network classifiers like MLP [Sch15] and LSTM
[HS97], all of which are relevant in related research fields, like sequence learning
or process mining. During the research Support Vector Machines proved to show
the most reliable performance, especially when handling small training data sets,
and very high dimensional feature spaces. Thus Chapter 2 uses the One-Class
SVM very effectively, achieving very good detection performance and false posi-
tive rates. This is further emphasized in Chapter 3, which also achieves very good
results. Here it is shown that One-Class SVMs can be further improved by addi-
tionally including labeled data of the second class, shifting its interpretation more
towards the supervised Two-Class SVM. It was also shown that combining dif-
ferent detectors further improved the system performance. Chapter 4 continues
these analyses by relying on supervised learning, showing the advantages of the
Two-Class SVMs in the evaluation against competitive methods, as well as the ad-
ditional advantages of using the trained model for the interpretation of the results.
Finally Chapter 5 shows the advantages of Two-Class SVMs on a different objec-
tive, against competitive learning methods. In the proposed system these learning
methods are combined with Multi-Class SVM detectors, which results in a good
classification performance and the added benefit of estimating the reliability of the
predicted results. These thesis aspects are summarized by the following contribu-
tions:
Proposal of novel ways of combining learning methods to achieve the re-
spective classification and detection objectives.
In depth evaluation of the proposed features, learning methods and systems,
competitively compared against approaches used in related research areas
like IT security, sequence learning and process learning.
Limitations and Outlook
Both practical and theoretical limitations were found during the course of this the-
sis’ research, but none of them represent insurmountable obstacles. For the practi-
cal implementation and evaluation of the complete systems it was often crucial to
112
achieve a close and high-performance integration of the system components. As
a result the system proposed in Chapter 3 could not be automated as expected,
because the lifetime of the malicious activity and the volatility of the JavaScript
attacks required a much closer integrated system - a problem, a practical applica-
tion could solve with a sufficiently performing implementation. Similarly Chapter
5 required a complex implementation of the projection methods to achieve a highly
parallelized processing chain. Also theoretical limitations like the additional model
selection parameters and sufficiently discriminative training data (Chapter 4) or an
improved feature space dimensionality reduction (Chapter 5) did not represent un-
solvable problems.
Apart from these main limitations there is still room for improvement. Combin-
ing different feature spaces which represent different data aspects still has a very
high potential to create more expressive features. Methods like RDE [BBM08]
could help at this point by improving the selection of relevant dimensions of those
feature spaces. Furthermore word vector feature representations like those de-
scribed in [MCCD13] could help improving the results obtained with LSTM neural
networks, like those used in Chapter 5. While this combination of feature repre-
sentation and learning method showed great results on other sequential data, and
could potentially cover the sequential and contextual aspects of the occurring event
relations, integrating the temporal aspects of the datasets used in this thesis would
still require additional research.
113
Chapter 7
Appendix
7.1 Learning Methods
This section provides further explanations and formal notations on the learning
methods most prominently used throughout the thesis.
7.1.1 Support Vector Machines
Support vector machines [MMR+01, SS02] are a supervised learning method based
on the idea of finding the maximum margin hyperplane between data distributions
of two linear separable, labeled classes, which can also be extended to non-linearly
separable class distributions by use of the kernel trick. Given labeled training
samples (xi, yi)with i[1, m]and yi {−1,1}of two classes, each hyper-
plane separating the classes can be described through an orthonormal vector w
and its distance bfrom the origin. It can thus be described by the linear equation
w, x+b= 0. A hyperplane splits the feature space in two parts, with one class
on either side. As such one can describe the class label of each training sample as
yi=sgn(w, xi+b). During the training phase wand bare optimized s.t. the
margin between both classes is maximized. For linear separable data this is done
by minimizing 1
2||w||2, s.t. the side condition yi(w, xi+b)1,i[1, m]
holds.
This formulation can be extended by a slack variable ξi>0penalizing non-
zero dimensions, and a constant C, which limits the number of samples allowed to
lie within the margin, e.g. when the data is not totally linear separable. As a result
the optimization problem becomes minimizing wand ξ, i.e.
min
w
1
2||w||2+C
m
i=1
ξi
s.t.
yi(w, xi+b)1ξi,i[1, m].
114
If we want to separate non-linear separable classes, a dual problem formulation
is required, additional to the primal problem just formulated. As we can represent
was linear combination of training samples, with w=m
i=1 αiyixi, we can derive
the dual form via the Lagrange multipliers and the Karush-Kuhn-Tucker condition,
resulting in the optimization problem to maximize α, i.e.
max
α
m
i=1
αi1
2
m
i=1
m
j=1
αiαjyiyjxi, xj
s.t.
0αiC,
m
i=1
αiyi= 0.
This results in the following classification rule:
f(x) = sgn(w, x+b) = sgn(
m
i=1
αiyixi, x+b)(7.1)
This also explains the origin of the name of the method, as support vectors are
a subset of xi, whose Lagrange variables αi= 0 and which lie either on or in
the margin (if ξi>0). We can now extend this classifier to non-linear separable
classes by using kernel functions, which are based on the idea of using a high-
dimensional hyperplane for separating the classes - which can be integrated in the
SVM without explicitly projecting the samples into this high-dimensional feature
space. This is called the ’Kernel Trick’. As such we start by defining a projection
function ϕ:Rd1Rd2, x → ϕ(x), with d1< d2. Since we are using only the
scalar product of xiand xjin the optimization problem, we can replace xi, xjin
the input space Rd1with ϕ(xi), ϕ(xj)in the high-dimensional feature space Rd2.
Now we can use a positive definite kernel function
k(xi, xj) = ϕ(xi), ϕ(xj)
to define the new classification method
f(x) = sgn(w, ϕ(x)+b) = sgn(
m
i=1
αiyik(xi, x) + b)(7.2)
with w=m
i=1 αiyiϕ(xi). An example of such a kernel function is the widely
used radial basis function (RBF) kernel, defined as
k(xi, xj) = exp(||xixj||2
2σ2),
with σas the kernel width, which has to be set through model selection.
115
7.1.2 One Class SVM
Using an SVM approach for novelty or outlier detection was first discussed in
[SWS+00], which allowed defining a separating hyperplane for describing a single
(unlabeled) data distribution in the form of a One Class SVM. Given a data distri-
bution xX, the objective is to detect samples deviating from this distribution
and label them as novel or anomalous. Similar to the previously explained Support
Vector Machines, the kernel trick can be applied, i.e. one can choose a kernel to
define a function ϕ(x), projecting the data into a high-dimensional feature space.
Now a hyperplane can be defined, separating the origin and the given data distri-
bution. By maximizing its distance from the origin, we obtain a description of the
data distribution. A novel or anomalous data point is then labeled as such. Ad-
ditionally an anomaly score can be assigned to this data point, e.g. based on the
distance to the selected hyperplane. Figure 7.1 illustrates these concepts.
Figure 7.1: Graphical illustration of the One Class SVM
Due to its similarity to the Two Class SVM formulation, also its optimization
problems are similar. Thus the primal optimization problem is again formulated as
the following minimzation
min
w,b,ξ
1
2||w||2b+1
m
i=1
ξi
s.t.
w, ϕ(xi) bξiand ξi0,i[1, m],0ν1,
where νallows a better model selection, providing an upper bound on the fraction
of outliers. The respective dual optimization problem is formulated as follows:
max
α1
2
m
i=1
m
j=1
αiαjk(xi, xj)
s.t. m
i=1
αi= 1 and 0αi1
,i[1, m].
116
Similar to the SVM definition, the decision function utilizes the position of the
predicted data point respective to the separating hyperplane and can be defined as
f(x) = sgn(w, ϕ(x)+b).(7.3)
Another approach on the use of support vectors and kernels to describe a data
distribution and use it for novelty detection, which is equivalent to the previously
described one but slightly more intuitive, was provided in [TD04]. Here the opti-
mization objective is formulated as minimizing an enclosing hypersphere around
the data distribution, with the anomaly score of a newly classified data point being
defined by its distance from the center of this hypersphere.
7.1.3 MLP
Artificial neural networks are based on the idea of reproducing the way neurons
work in the brain. In this sense perceptrons [Ros58] were introduced to repre-
sent a single neuron. It receives input values of previous neurons and processes
their weighted sum with an activation function, finally producing an output in the
range of [0,1]. Whereas the original perceptron was restricted to using thresholded
activation functions though, a single neuron in a multilayer perceptron network
[Sch15] allows using arbitrary activation functions. In such a feedforward neural
network, multiple of such perceptrons are combined in multiple layers, namely the
input layer, several hidden layers, and a final output layer. Figure 7.2 illustrates
this concept, showing the different layers and the respective neuronal nodes, here
using a sigmoid activation function. Notice that each previous node is fully con-
nected to each node of the subsequent layer.
Figure 7.2: Exemplary fully-connected Multi Layer Perceptron Network
Each node has a weight vector assigned to its input data. By comparing the
labels of the input data with the created output labels, an error function can be
defined. During the training procedure of the MLP network the weight vectors
117
are dynamically adapted to reduce this error function. This is achieved using a
gradient descent over the calculated error surface, followed by a backpropagation
of the error to the respective weight vectors. This basis of most neural network
learning methods is described in detail in [RHW86].
7.1.4 RNN and LSTM
As an extension of neural network architectures to sequential data, Recurrent Neu-
ral Networks [RHW86] have been designed to enable access to information of pre-
vious events. This is achieved by extending classical feedforward neural network
architectures by integrating loops into the processing of each node, illustrated in
Figure 7.3. Here a node architecture Aof a neural network layer processes input
data xt, i.e. data xat moment t- but additionally utilizes a loop containing the
output data of the previous step t1. This concept becomes clear in the unrolled
node chain.
Figure 7.3: Unrolled RNN node loop
While RNNs are able to properly represent sequences with a short temporal
context (i.e. contextually relevant data points are always close together), they fail
in representing information represented in a larger temporal context (i.e. contex-
tually relevant data points are further away). For this purpose Long Short Term
Memory networks (LSTMs) [HS97] were designed, conceptually capable of prop-
erly representing this long-term dependency problem. Before discussing their spe-
cific properties, Figure 7.5 shows the notation used to describe those properties
graphically.
Figure 7.4: Notation for the graphical LSTM explanations
Commonly used neural network node operate using specific activation func-
tions (also known as transfer function) to determine how strong the respective neu-
ron is activated, by defining its output based on the summed and weighted input
vectors. These activation functions are usually non-linear, to represent complex
118
ways of combining the input values, e.g. the logistic function (also known as sig-
moid function, due to its shape), producing output values in the range of [0,1], or
the hyperbolic tangent function (tanh), returning values in the range of [1,+1].
Figure 7.5 illustrates this with a tanh activation function.
Figure 7.5: Simple activation function layout in RNN
An LSTM node consists of a complex arrangement of different activation func-
tions. Additionally, the so called cell state provides a means to memorize previous
events. These concepts provide an efficient way of including new relevant infor-
mation and to remove older no longer relevant information. Figure 7.6 illustrates
these concepts, processing the input data xt, the previous output data ht1and the
previous cell state Ct1.
Figure 7.6: Structure of an LSTM node
The cell state is modified by so-called "gates", which consist of a sigmoid
neural network layer and a point-wise operation. As its respective sigmoid layer
provides a weighting in the range of [0,1], these gate allow removing or adding
information as needed. The first one is the "forget gate layer", which utilizes ft
defined as follows:
ft=σ(Wf,[ht1, xt]+bF)
thereby assessing, which of the previous information to discard. The second one is
the "input gate layer", creating a new vector ˜
Cto populate the cell state, which is
then also filtered via the sigmoid function. This is defined as follows:
˜
Ct=tanh(WC,[tt1, xt]+bC)
119
it=σ(Wi,[ht1, xt]+bi)
Now the cell state can be updated as follows:
Ct=ftCt1+it˜
Ct
Finally the new output htis assembled, based on a filtered version of the previous
output ht1and the updated cell state Ct.
ot=σ(Wo,[ht1, xt]+bo)
ht=ottanh(Ct)
7.2 Evaluation Metrics
We are using the following metrics for the evaluations conducted throughout the
thesis, all of which are based on the base metrics of true positives (TP), false posi-
tives (FP), true negatives (TN) and false negatives (FN), as illustrated in Figure 7.7.
Note that not all of them can be used for each type and combination of classifier,
so their individual requirements are discussed separately in the respective chapters.
Section 7.3 provides additional discussions on the impact of unbalanced classes
on the evaluation of the system performance and the selection of the respective
metrics.
Figure 7.7: Base metrics and their interpretation
Precision
The precision focusses on the predicted positives. It provides a view on the ratio of
correctly positive predicted values. This metric is very important, if the objective
is to focus on one of the two classes, finding its - and only its - samples. Precision
is also known as true positive rate or confidence, and is defined as follows:
precision =TP
TP +FP
120
Recall
The recall focusses on the real positives. It provides a view on the ratio of positively
predicted samples found over all positive samples. It is also known as sensitivity.
It is defined as follows:
recall =TP
TP +FN
False Positive Rate
The false positive rate provides the ratio of erroneously positively predicted sam-
ples. It is defined as follows:
false positive rate =F P
TP +FP
Accuracy
The accuracy focusses on correct predictions, i.e. it provides a view on the ratio of
overall correctly predicted samples of all samples of both classes. It is defined as
follows:
accuracy =TP +T N
TP +FP +TN +FN
ROC-AUC
For this metric the Receiver Operation Characteristics curve is calculated over the
false positive and true positive values, achieved by shifting the decision boundary
of the trained classifier along the bias values defined by the decision scores (i.e.
the distances) of all test samples. Afterwards its enclosed area is calculated, which
has to be maximized to achieve the best classification performance. Additionally,
this curve also allows for the calibration of classification models on validation data,
e.g. by setting thresholds for the maximum allowed false positive rate of the final
model.
F1score
The F1 score is the balanced Fscore, which is the harmonic mean of precision and
recall. It is defined as follows:
F1(ytrue, ypred, c) = 2 ·precision ·recall
precision +recall
The parameter cis the class label of the class that is to be predicted positively
(or detected) and is thus the primary class for this F1score. It focusses on both
the predicted positives and the real positives. The F1score is only defined for the
binary case, i.e. for two classes, but can be extended to cover more classes by
averaging procedures, as described below.
121
Multi-class extension
Above metrics are always defined on two classes: a positive and a negative class.
Using multi-class classifiers (as e.g. in Chapter 5) we need to be able to evalu-
ate their performance together. To achieve a better representation of potentially
unweighted class sizes, we are using two ways of averaging:
macro averaging, which calculates a value per positive class, and averages
over all
weighted averaging, which calculates a value per positive class, and weighs
it with the number of true samples of this class
This is described below for the F1score, but can be calculated equivalently for
accuracy, precision, recall and ROC-AUC. Thus F1macro is defined as follows:
F1scoremacro(ytrue, ypred, C) =
cC
F1(ytrue, ypred, c)
|C|
with |C|=|set(cytrue)|, i.e. the number of unique classes c. This means,
the F1score is calculated for each c, and then the unweighted average is calcu-
lated. Therefore this approach does not take class size variance into account - but
F1scoreweighted does, and is defined as follows:
F1scoreweighted(ytrue, ypred, C) =
cC
F1(ytrue, ypred, c)· |c|
cC
|C|
with |c|being the number of samples of this class c. This means, F1is again
calculated for each c, but the average is weighted by the number of samples of
each respective c. Therefore this approach alters F1scoremacro by taking class
size imbalance into account.
7.3 Using Metrics for Data with unbalanced Class Sizes
Learning on data with unbalanced class sizes requires additional care in the selec-
tion of learning methods and performance evaluation metrics. We already covered
some aspects of this in Chapter 3.4.2 with the class weighting and in Chapter 5.5
with the combined classification system and the confidence ratings. By discussing
the implications of the Bases-Rate Neglect and the use of the accuracy as a metric,
this appendix adds further analyses on this topic. The Base-Rate Neglect describes
an effect occurring in detection systems applied to sets of populations with largely
differing size. In the context of IT security it was first raised in [Axe00], where
it was discussed for intrusion detection systems, in which large numbers of be-
nign reports are discriminated against very small numbers of malicious intrusion
122
attempts - a property that is also true for the use cases discussed in Chapters 2, 3
and (to some extent) Chapter 4. The consequence of this effect is that a trained
intrusion detection system requires a false positive rate (FP R) much lower than
the probability that an actual intrusion ˆ
Imay occur (P(ˆ
I)), i.e. FPR << P(ˆ
I).
If this requirement is not fulfilled, it is more likely that a detected intrusion is in
fact a benign sample (a false positive) than being a correctly predicted intrusion (a
true positive). This is true despite otherwise high true positive rates of the applied
detector.
We will now analyze this more formally. We denote Ito be an incident, rep-
resenting an intrusion (as in [Axe00]), a covert HTTP request (in Chapter 2), a
malicious JavaScript containing URL (in Chapter 3) or an invalid failure sample
(in Chapter 4). We denote Ato be an alarm caused by the detection of an inci-
dent Ithrough the respective detection system. Then we can define the following
probabilities:
P(A|I): True Positive Rate
P(AI): False Positive Rate
P(I): Probability that incident Ihappens
P(¬I): Probability that incident Idoes not happen
with P(I) = # of samples p.d./# of incidents p.d.1. These values can be
placed in Bayes’ Theorem
P(A|B) = P(A)·P(B|A)
P(B)
to obtain the following formulation for calculating the Bayes Detection Rate
P(I|A):
P(I|A) = P(I)·P(A|I)
P(I)·P(A|I) + P(¬I)·P(AI)
By providing the probability that an alarm Ais indeed correctly predicting an
incident I,P(I|A)allows an intuitive interpretation of the practical usability of
detection systems for unbalanced class data. We will now shortly discuss this for
some examples. For these examples also the respective Accuracy values are shown.
As the accuracy balances true positive and true negative values equivalently, it is
highly prone to reflect the detection performance on the benign class, achieving
indifferentiable results in the process. Due to this limited capability of the accuracy
to properly reflect the classification performance of system applied to data with
very unbalanced class sizes, it was not used in this thesis.
Table 7.1 shows the relevant properties of exemplarily selected data sets of
Chapters 2, 3 and 4. Their Bayes Detection Rates P(A|I)and their accuracy val-
ues are shown in Table 7.2. For the DUMONT detector the resulting probability
that an alarm is indeed caused by a covert communication attempt is very high,
123
Chapter 2 Chapter 3 Chapter 4
Requests Filtered URLs 2014 Campaign
Benign Malicious Benign Malicious Valid Invalid
Timespan 90 days 137 days 22 days
Samples 182,996 12,899 2,900,000 3,220 4,500 400
Samples p.d. 2,033 143 21,267 23 205 18
P(I)0.066 0.001 0.081
P(¬I)0.934 0.999 0.919
Table 7.1: Representative data set properties of Chapters 2, 3 and 4
Chapter 2 Chapter 3 Chapter 4
DUMONT Competing Combined Exemplary
Detector AV Scanners Detectors SVM(Γs.less)
Accuracy 0.9873 0.9987 0.9994 0.9147
P(A|I)0.857 0.295 0.545 0.901
P(AI)0.0035 0.0005 0.0001 0.084
P(I|A)0.95 0.371 0.845 0.485
Table 7.2: Representative classification model properties of Chapters 2, 3 and 4
i.e. P(I|A) = 95%, making these predictions very reliable. In Chapter 3 the
probability that an alarm is indeed caused by a JavaScript attack is 37.1 % for the
competing AV Scanners and 84.5 % for the proposed combined detectors (on all
samples). Thus the Bayes Detection Rate allows an easy identification of the bet-
ter model - while the accuracy barely changes for the two models at all and does
therefore not allow a proper performance comparison. The selected SVM classifier
on Γs.less features, representing the classification performance of the system pro-
posed in Chapter 4, achieves a P(I|A)of 48.5%. Hence each correctly predicted
invalid sample also adds (approximately) one incorrectly predicted valid sample.
Nonetheless the predictor remains useable, since the ratio of valid to invalid sam-
ples is merely 11.25 :1, barely meeting the assumption of the Base-Rate Neglect.
Hence this approach still provides a filter, reliably preselecting candidate samples.
For the analysis of these candidates the proposed methods for reliability estima-
tion, like the the weight vector visualization of Chapter 4.6.2 and the confidence
ratings of Chapter 5.5 provide additional help. This percentage can also be further
alleviated by choosing other models with a lower P(AI)or applying the class
weighting, described in Chapter 3.4.2. Since the class data used for the evaluations
of Chapter 5 are relatively balanced, the assumption of the Base Rate Neglect does
not hold. Therefore we will not discuss its Bayes Detection Rate here.
124
7.4 Visualizations for Interpretable Systems
To be able to effectively utilize machine learning approaches in productive sys-
tems, they need to address more aspects than just the classification performance.
To achieve a trustworthy system, transparence of the learning procedure and the
utilized features, as well as interpretability of the classification results are equally
important. In this section we will discuss two visualization methods specifically
developed to address those topics. As such Section 7.4.1 provides an extended
discussion of the weight vector visualization introduced in Section 4.6.2, which
allows additional insights into the classification results on whole data sets. Section
7.4.2 on the other hand discusses a way of visualizing the structural δnmatching
function ˆ
Φ, as defined for the ΓST feature space in Section 5.4.6, providing insights
into this feature space projection procedure. Further analyses on the relevance of
the interpretability of more general classification systems for reliable results can be
found in [LWB+19].
7.4.1 Weight Vector Visualization
As already discussed in Section 4.6.2, we can visualize the values of the individ-
ual dimensions of the wof a trained SVM model to estimate the relevance of the
respective dimensions for the underlying classification model. Table 7.3 illustrates
such exemplary dimensions and its values wiin the weight vector w, on the ex-
ample of the token 3-gram debug-data of Chapter 4. As discussed there, we can
also further improve the interpretability of these values by decorrelating wusing
the covariance matrix of Strain. To highlight its impact, figures 7.8, 7.9, 7.10 and
7.11 show the visualizations of the original correlated and the decorrelated wof a
whole data set, for models trained with the L1 and L2 norm.
witoken 3-gram dimension di
-0.674 Redirect to https
-0.431 028 TASK_RESULT HTTP
... ...
-0.124 175 SP_ICMP EV_SA_SUCCESS
... ...
-0.001 028 STATEMACHINE VMCCRAT
... ...
0.075 175 DTS MODULE
... ...
0.286 23 kbit/s smart
0.323 Overall Duration 10000
Table 7.3: Exemplary token 3-gram dimensions of wof debug-data
125
As described in Chapter 4, the bottom row shows the respective values of w,
sorted descending. Each row contains the data for a single sample x. The first
parameter denotes the sample ID, followed by the ytrue and ypred, the true and the
predicted class label. This is followed by an identity verification of ytrue =ypred,
which is denoted by the letter "T", if found true. Finally the widixis
shown, i.e. the sum of the wiof those dimensions dithat occur in the sample x-
all of which are then shown as the black bars in the graphical representations.
By creating these visualizations for a whole data set, sorted by the sum of the
matching weights, we can determine dimensions that are relevant for whole subsets
of samples, indicating a potential class relationship between those samples. The vi-
sualization also allows conclusions about the confidence in the predictions for the
individual samples: whereas the samples at the top and the bottom end are predom-
inantly classified correctly (lying farthest from the separating hyperplane of the uti-
lized SVM), those in the middle lie closer to the separating hyperplane, increasing
the risk for potential misclassifications. This can easily be seen when closer in-
specting the identity verification, which shows most misclassifications around this
region.
When using the L1 norm during the training phase, the dimensions are al-
ready distinguishably separated, s.t. the interpretation of the individual dimen-
sional relevance is in fact easier on the original sparse solution, than on the dense,
decorrelated version, where many additional dimensions suddenly become equally
relevant. However, in our experiments the L2 norm generally allowed a better clas-
sification performance, and is therefore the preferred approach. In this case, as
shown in figures 7.10 and 7.11, a comprehensive interpretation of the relevant di-
mensions absolutely requires the decorrelation of w, as its weights wilie otherwise
indistinguishably close to each other.
126
Figure 7.8: w-visualizations based on the L1 norm, correlated with Strain
127
Figure 7.9: w-visualizations based on the L1 norm, decorrelated from Strain
128
Figure 7.10: w-visualizations based on the L2 norm, correlated with Strain
129
Figure 7.11: w-visualizations based on the L2 norm, decorrelated from Strain
130
7.4.2 Visualizing the δnmatching function ˆ
Φof ΓST
To obtain additional insights into the δnmatching procedure, as defined for
the ΓST feature space in Section 4.6.2, we developed a tree-based visualization,
as shown exemplarily in Figure 7.12. The idea is to obtain the ability to visually
inspect similarities between a projected sequence sand a selected set of model
sequences SM, utilized in the structural δnmatching function ˆ
Φ. To achieve this,
the model sequences sMSMwith SM {SM
2C, SM
MOC, SM
MT C }are modeled as
a tree, with the events as individual nodes. To reduce the number of depicted nodes,
the tree is converted into a prefix tree, i.e. the prefix-paths of sequences are merged,
if they are identical. To further reduce the complexity, also the suffix-paths can
be merged, resulting in a compressed graph, depicted in Figure 7.13, which also
contain the final colored highlights for the structural δnmatching, for values of
n= 2 and δ= 1. To allow the visual inspection, nodes of the projected sequence
sare highlighted with a bold black frame, and their structural matches with bold
blue frames. Using these colored highlights, the types of sequences contained in
the visualized SM, as well as relevant or dominant sub-sequences can be easily
distinguished.
131
Figure 7.12: Prefix-tree visualization of sequence sprojected on model sequences
SM
132
Figure 7.13: Compressed graph visualization of sequence sprojected on model
sequences SM
133
Bibliography
[AAB+15] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo,
Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jef-
frey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,
Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia,
Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Lev-
enberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek
Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit
Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Van-
houcke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete
Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xi-
aoqiang Zheng. TensorFlow: Large-scale machine learning on
heterogeneous systems, 2015. Software available from tensor-
flow.org.
[ABCM09] Michal Aharon, Gilad Barash, Ira Cohen, and Eli Mordechai.
One Graph Is Worth a Thousand Logs: Uncovering Hid-
den Structures in Massive System Event Logs, pages 227–243.
Springer Berlin Heidelberg, Berlin, Heidelberg, 2009.
[Axe00] Stefan Axelsson. The base-rate fallacy and the difficulty of in-
trusion detection. ACM Transactions on Information and System
Security (TISSEC), 3(3):186–205, 2000.
[BBCP04] J. Beale, A.R. Baker, B. Caswell, and M. Poor. Snort 2.1 Intru-
sion Detection. Syngress Publishing, 2nd edition, 2004.
[BBM08] Mikio L Braun, Joachim M Buhmann, and Klaus Robert Müller.
On relevant dimensions in kernel feature spaces. Journal of Ma-
chine Learning Research, 9:1875–1908, 2008.
[BBM+15] Sebastian Bach, Alexander Binder, Grégoire Montavon, Fred-
erick Klauschen, Klaus-Robert Müller, and Wojciech Samek.
On pixel-wise explanations for non-linear classifier decisions by
layer-wise relevance propagation. PLOS ONE, 10(7):1–46, 07
2015.
134
[BGH14] Georges Bossert, Frédéric Guihéry, and Guillaume Hiet. To-
wards automated protocol reverse engineering using semantic
information. In Proceedings of the 9th ACM symposium on In-
formation, computer and communications security, pages 51–
62. ACM, 2014.
[Bis95] Christopher M. Bishop. Neural Networks for Pattern Recogni-
tion. Oxford University Press, Inc., New York, NY, USA, 1995.
[Blo70] Burton H Bloom. Space/time trade-offs in hash coding with al-
lowable errors. Communications of the ACM, 13(7):422–426,
1970.
[BMB10] Armin Büscher, Michael Meier, and Ralf Benzmüller. Throwing
a monkeywrench into web attackers plans. In Proc. of Commu-
nications and Multimedia Security (CMS), 2010.
[BP04] K. Borders and A. Prakash. Web tap: detecting covert web traf-
fic. In Proc. of Conference on Computer and Communications
Security (CCS), pages 110–120, 2004.
[BPX+07] Thorsten Brants, Ashok C Popat, Peng Xu, Franz J Och, and
Jeffrey Dean. Large language models in machine translation. In
In Proceedings of the Joint Conference on Empirical Methods in
Natural Language Processing and Computational Natural Lan-
guage Learning, pages 858–867. Citeseer, 2007.
[BS85] R. Beran and M.S. Srivastava. Bootstrap tests and confidence
regions for functions of a covariance matrix. Annals of Statistics,
13(1):95–115, 1985.
[BSH+10] David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki
Kawanabe, Katja Hansen, and Klaus-Robert Müller. How to
explain individual classification decisions. J. Mach. Learn. Res.,
11:1803–1831, August 2010.
[CCVK11] D. Canali, M. Cova, G. Vigna, and C. Kruegel. Prophiler: a fast
filter for the large-scale detection of malicious web pages. In
Proc. of the International World Wide Web Conference (WWW),
April 2011.
[CFH14] Woon Hau Chin, Zhong Fan, and Russell Haines. Emerging
technologies and research challenges for 5g wireless networks.
IEEE Wireless Communications, 21(2):106–112, 2014.
[CKV10] Marco Cova, Christopher Kruegel, and Giovanni Vigna. De-
tection and analysis of drive-by-download attacks and malicious
135
JavaScript code. In Proc. of the International World Wide Web
Conference (WWW), 2010.
[CLC+16] Patrik Cerwall, Anette Lundvall, Stephen Carson, Richard
Möller, Susanna Bävertoft, Anna Jacobsson, Git Sellin, Michael
Björn, Vishnu Singh, Stephen Carson, Reiner Ludwig, Lasse
Wieweg, Jonas Edstam, Per Lindberg, and Kati Öhman. Er-
icsson mobility report: On the pulse of the networked society.
Technical report, Ericsson, Stockholm, Sweden, June 2016.
[CLF+14] Michelangelo Ceci, Pasqua Fabiana Lanotte, Fabio Fumarola,
Dario Pietro Cavallo, and Donato Malerba. Completion time
and next activity prediction of processes using sequential pat-
tern mining. In International Conference on Discovery Science,
pages 49–61. Springer, 2014.
[CLZS11] Charlie Curtsinger, Benjamin Livshits, Benjamin Zorn, and
Christian Seifert. Zozzle: Fast and precise in-browser javascript
malware detection. In Proc. of USENIX Security Symposium,
2011.
[Col02] Michael Collins. Discriminative training methods for hidden
markov models: Theory and experiments with perceptron algo-
rithms. In Proceedings of the ACL-02 conference on Empirical
methods in natural language processing-Volume 10, pages 1–8.
Association for Computational Linguistics, 2002.
[CPC+08] Weidong Cui, Marcus Peinado, Karl Chen, Helen J Wang, and
Luis Irun-Briz. Tupni: Automatic reverse engineering of in-
put formats. In Proceedings of the 15th ACM conference on
Computer and communications security, pages 391–402. ACM,
2008.
[CPWK06] Weidong Cui, Vern Paxson, Nicholas Weaver, and Randy H
Katz. Protocol-independent adaptive replay of application di-
alog. In NDSS, volume 4, pages 279–293, 2006.
[CSL+08] G. Cretu, A. Stavrou, M. Locasto, S.J. Stolfo, and A.D.
Keromytis. Casting out demons: Sanitizing training data for
anomaly sensors. In Proc. of IEEE Symposium on Security and
Privacy, 2008.
[CSRL01] Thomas H. Cormen, Clifford Stein, Ronald L. Rivest, and
Charles E. Leiserson. Introduction to Algorithms. McGraw-Hill
Higher Education, 2nd edition, 2001.
136
[CSS+10] Chia Yuan Cho, Eui Chul Richard Shin, Dawn Song, et al. In-
ference and analysis of formal models of botnet command and
control protocols. In Proceedings of the 17th ACM conference on
Computer and communications security, pages 426–439. ACM,
2010.
[CV95] Corinna Cortes and Vladimir Vapnik. Support-vector networks.
Machine learning, 20(3):273–297, 1995.
[CWKK09] Paolo Milani Comparetti, Gilbert Wondracek, Christopher
Kruegel, and Engin Kirda. Prospex: Protocol specification ex-
traction. In Security and Privacy, 2009 30th IEEE Symposium
on, pages 110–125. IEEE, 2009.
[CYLS07] Juan Caballero, Heng Yin, Zhenkai Liang, and Dawn Song.
Polyglot: Automatic extraction of protocol message format us-
ing dynamic binary analysis. In Proceedings of the 14th ACM
conference on Computer and communications security, pages
317–329. ACM, 2007.
[DHF10] Andreas Dewald, Thorsten Holz, and Felix C Freiling. Adsand-
box: Sandboxing javascript to fight malicious websites. In Pro-
ceedings of the 2010 ACM Symposium on Applied Computing,
pages 1859–1864. ACM, 2010.
[DHM08] Mark Daniel, Jake Honoroff, and Charlie Miller. Engineering
heap overflow exploits with JavaScript. In Proc. of USENIX
Workshop on Offensive Technologies (WOOT), 2008.
[Die00] Thomas G Dietterich. Ensemble methods in machine learning.
In International workshop on multiple classifier systems, pages
1–15. Springer, 2000.
[Die02] Thomas G Dietterich. Machine learning for sequential data:
A review. In Joint IAPR International Workshops on Statisti-
cal Techniques in Pattern Recognition (SPR) and Structural and
Syntactic Pattern Recognition (SSPR), pages 15–30. Springer,
2002.
[DL15] Andrew M Dai and Quoc V Le. Semi-supervised sequence
learning. In Advances in neural information processing systems,
pages 3079–3087, 2015.
[EFG+10] Markus Engelberth, Felix Freiling, Jan Goebel, Christian
Gorecki, Thorsten Holz, Ralf Hund, Philipp Trinius, and Carsten
Willems. The InMAS approach. In Proc. of European Workshop
on Internet Early Warning and Network Intelligence (EWNI),
January 2010.
137
[EKK09] Manuel Egele, Engin Kirda, and Christopher Kruegel. Mitigat-
ing drive-by download attacks: Challenges and open problems.
In Proc. of Open Research Problems in Network Security Work-
shop (iNetSec), 2009.
[ERF16] Joerg Evermann, Jana-Rebecca Rehse, and Peter Fettke. A deep
learning approach for predicting process behaviour at runtime.
In International Conference on Business Process Management,
pages 327–338. Springer, 2016.
[EWKK09] Manuel Egele, Peter Wurzinger, Christopher Kruegel, and Engin
Kirda. Defending browsers against drive-by downloads: Mit-
igating heap-spraying code injection attacks. In Detection of
Intrusions and Malware & Vulnerability Assessment (DIMVA),
2009.
[FBH+02] N. Feamster, M. Balazinska, G. Harfst, H. Balakrishnan, and
D. R. Karger. Infranet: Circumventing web censorship and
surveillance. In Proc. of USENIX Security Symposium, pages
247–262, 2002.
[FCH+08] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin.
LIBLINEAR: A library for large linear classification. Journal of
Machine Learning Research (JMLR), 9:1871–1874, 2008.
[FGM+99] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter,
P. Leach, and T. Berners-Lee. Hypertext Transfer Protocol
HTTP/1.1. RFC 2616 (Draft Standard), June 1999. Updated by
RFC 2817.
[FPPS07] Jason Franklin, Vern Paxson, Adrian Perrig, and Stefan Savage.
An Inquiry Into the Nature and Causes of the Wealth of Internet
Miscreants. In Proc. of Conference on Computer and Communi-
cations Security (CCS), pages 375–388, 2007.
[GDDC97] David Grove, Greg DeFouw, Jeffrey Dean, and Craig Cham-
bers. Call graph construction in object-oriented languages. In
Proceedings of the 12th ACM SIGPLAN Conference on Object-
oriented Programming, Systems, Languages, and Applications,
OOPSLA ’97, pages 108–124, New York, NY, USA, 1997.
ACM.
[Gor08] Christian Gorecki. Truman box: Safe malware analysis by sim-
ulating the internet. Master’s thesis, University of Mannheim,
2008.
138
[GPZL08] G. Gu, R. Perdisci, J. Zhang, and W. Lee. BotMiner: Clus-
tering Analysis of Network Traffic for Protocol- and Structure-
Independent Botnet Detection. In Proc. of USENIX Security
Symposium, 2008.
[Gra12] Alex Graves. Supervised sequence labelling with recurrent neu-
ral networks. ISBN 9783642212703. URL http://books. google.
com/books, 2012.
[GSZ13] M Ghiassi, J Skinner, and D Zimbra. Twitter brand sentiment
analysis: A hybrid system using n-gram analysis and dynamic
artificial neural network. Expert Systems with applications,
40(16):6266–6282, 2013.
[GWY+15] Hugo Gascon, Christian Wressnegger, Fabian Yamaguchi,
Daniel Arp, and Konrad Rieck. Pulsar: Stateful black-box
fuzzing of proprietary network protocols. In Security and Pri-
vacy in Communication Networks, pages 330–347. Springer,
2015.
[GYAR13] Hugo Gascon, Fabian Yamaguchi, Daniel Arp, and Konrad
Rieck. Structural detection of android malware using embed-
ded call graphs. In Proceedings of the 2013 ACM workshop on
Artificial intelligence and security, pages 45–54. ACM, 2013.
[GZL08] G. Gu, J. Zhang, and W. Lee. BotSniffer: Detecting Botnet Com-
mand and Control Channels in Network Traffic. In Proc. of Net-
work and Distributed System Security Symposium (NDSS), 2008.
[HEF09] Thorsten Holz, Markus Engelberth, and Felix Freiling. Learn-
ing More About the Underground Economy: A Case-Study of
Keyloggers and Dropzones. In Proc. of European Symposium
on Research in Computer Security (ESORICS), 2009.
[HFH11] Mario Heiderich, Tilman Frosch, and Thorsten Holz. IceShield:
Detection and mitigiation of malicious websites with a frozen
dom. In Recent Adances in Intrusion Detection (RAID), Septem-
ber 2011.
[HGX+17] Zhen He, Shaobing Gao, Liang Xiao, Daxue Liu, Hangen He,
and David Barber. Wider and deeper, cheaper and faster: Ten-
sorized lstms for sequence learning. In Advances in Neural In-
formation Processing Systems, pages 1–11, 2017.
[HMG+14] Stefan Haufe, Frank Meinecke, Kai Görgen, Sven Dähne, John-
Dylan Haynes, Benjamin Blankertz, and Felix Bießmann. On the
interpretation of weight vectors of linear models in multivariate
neuroimaging. Neuroimage, 87:96–110, 2014.
139
[Hop71] John E. Hopcroft. An n log n algorithm for minimizing states in
a finite automaton. Technical report, Stanford, CA, USA, 1971.
[HS97] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term
memory. Neural computation, 9(8):1735–1780, 1997.
[HSD+08] T. Holz, M. Steiner, F. Dahl, E. Biersack, and F. Freiling. Mea-
surements and mitigation of peer-to-peer-based botnets: A case
study on storm worm. In Proc. of USENIX Workshop on Large-
Scale Exploits and Emergent Threats (LEET), 2008.
[IBC+12] Luca Invernizzi, Stefano Benvenuti, Paolo Milani Comparetti,
Marco Cova, Christopher Kruegel, and Giovanni Vigna.
EVILSEED: A guided approach to finding malicious web pages.
In Proc. of IEEE Symposium on Security and Privacy, May
2012.
[IHF08] Ali Ikinci, Thorsten Holz, and Felix Freiling. Monkey-spider:
Detecting malicious websites with low-interaction honeyclients.
In Proc. of Conference “Sicherheit, Schutz und Zuverlässigkeit”
(SICHERHEIT), pages 891–898, 2008.
[II07] K. L. Ingham and H. Inoue. Comparing anomaly detection tech-
niques for HTTP. In Recent Adances in Intrusion Detection
(RAID), pages 42 62, 2007.
[KGKR12] Tammo Krueger, Hugo Gascon, Nicole Krämer, and Konrad
Rieck. Learning stateful models for network honeypots. In Pro-
ceedings of the 5th ACM workshop on Security and artificial in-
telligence, pages 37–48. ACM, 2012.
[KKR11] Tammo Krueger, Nicole Krämer, and Konrad Rieck. ASAP: Au-
tomatic Semantics-Aware Analysis of Network Payloads, pages
50–63. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011.
[KL10] Marius Kloft and Pavel Laskov. Online anomaly detection un-
der adversarial impact. In Proceedings of the Thirteenth Interna-
tional Conference on Artificial Intelligence and Statistics, pages
405–412, 2010.
[KL14] Eric D Knapp and Joel Thomas Langill. Industrial Network Se-
curity: Securing critical infrastructure networks for smart grid,
SCADA, and other Industrial Control Systems. Syngress, 2014.
[KLN+11] Sandeep Karanth, Srivatsan Laxman, Prasad Naldurg, Rama-
rathnam Venkatesan, J. Lambert, and Jinwook Shin. ZDVUE:
Prioritization of javascript attacks to surface new vulnerabilities.
140
In Proc. of CCS Workshop on Security and Artificial Intelligence
(AISEC), October 2011.
[KLZS12] Clemens Kolbitsch, Benjamin Livshits, Benjamin Zorn, and
Christian Seifert. Rozzle: de-cloaking internet malware. In Proc.
of IEEE Symposium on Security and Privacy, May 2012.
[KPPK10] V. Kemerlis, V. Pappas, G. Portokalidis, and A. Keromytis.
ileak: A lightweight system for detecting inadvertent informa-
tion leaks. In Proc. of European Conference on Computer Net-
work Defense (EC2ND), 2010.
[KR05] Eamonn Keogh and Chotirat Ann Ratanamahatana. Exact in-
dexing of dynamic time warping. Knowledge and information
systems, 7(3):358–386, 2005.
[KSA+17] Pieter-Jan Kindermans, Kristof T Schütt, Maximilian Alber,
Klaus-Robert Müller, and Sven Dähne. Patternnet and pat-
ternlrp - improving the interpretability of neural networks.
arXiv:1705.05598, 2017.
[LC11] C. Leita and M. Cova. HARMUR: Storing and analyzing historic
data on malicious domains. In Proc. of Workshop on Building
Analysis Datasets and Gathering Experience Returns for Secu-
rity (BADGERS), April 2011.
[LCDF+15] Anna Leontjeva, Raffaele Conforti, Chiara Di Francescomarino,
Marlon Dumas, and Fabrizio Maria Maggi. Complex symbolic
sequence encodings for predictive monitoring of business pro-
cesses. In International Conference on Business Process Man-
agement, pages 297–313. Springer, 2015.
[LGKM06] P. Laskov, C. Gehl, S. Krüger, and K. R. Müller. Incremen-
tal support vector learning: Analysis, implementation and appli-
cations. Journal of Machine Learning Research, 7:1909–1936,
September 2006.
[LGN12] Mai Le, Bogdan Gabrys, and Detlef Nauck. A hybrid model for
business process event prediction. In Research and Development
in Intelligent Systems XXIX, pages 179–192. Springer, 2012.
[LJXZ08] Zhiqiang Lin, Xuxian Jiang, Dongyan Xu, and Xiangyu Zhang.
Automatic protocol format reverse engineering through context-
aware monitored execution. In NDSS, volume 8, pages 1–15,
2008.
141
[LMD05] Corrado Leita, Ken Mermoud, and Marc Dacier. Scriptgen: an
automated script generation tool for honeyd. In Computer Secu-
rity Applications Conference, 21st Annual. IEEE, 2005.
[LMP01] John Lafferty, Andrew McCallum, and Fernando CN Pereira.
Conditional random fields: Probabilistic models for segmenting
and labeling sequence data. ICML ’01 Proceedings of the Eigh-
teenth International Conference on Machine Learning 8, pages
282–289, 2001.
[LS05] Ming Li and Ronan Sleep. A robust approach to sequence clas-
sification. In Tools with Artificial Intelligence, 2005. ICTAI
05. 17th IEEE International Conference on, pages 5–pp. IEEE,
2005.
[LŠ11] Pavel Laskov and Nedim Šrndi´
c. Static detection of malicious
javascript-bearing pdf documents. In Proceedings of the 27th
Annual Computer Security Applications Conference, pages 373–
382. ACM, 2011.
[LSD+15] Geetika T Lakshmanan, Davood Shamsi, Yurdaer N Doganata,
Merve Unuvar, and Rania Khalaf. A markov prediction model
for data-driven semi-structured business processes. Knowledge
and Information Systems, 42(1):97–126, 2015.
[LSTCW01] Huma Lodhi, John Shawe-Taylor, Nello Cristianini, and Christo-
pher JCH Watkins. Text classification using string kernels. In
Advances in neural information processing systems, pages 563–
569, 2001.
[LWB+19] Sebastian Lapuschkin, Stephan Wäldchen, Alexander Binder,
Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller.
Unmasking clever hans predictors and assessing what machines
really learn. Nature communications, 10(1):1096, 2019.
[LWR+10] Philipp Leitner, Branimir Wetzstein, Florian Rosenberg, Anton
Michlmayr, Schahram Dustdar, and Frank Leymann. Runtime
prediction of service level agreement violations for composite
services. In Service-oriented computing. ICSOC/ServiceWave
2009 workshops, pages 176–186. Springer, 2010.
[LXLZ15] Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. Recurrent con-
volutional neural networks for text classification. In AAAI, pages
2267–2273, 2015.
[LYPL10] L. Lu, V. Yegneswaran, P. A. Porras, and W. Lee. BLADE: An
attack-agnostic approach for preventing drive-by malware infec-
142
tions. In Proc. of Conference on Computer and Communications
Security (CCS), October 2010.
[MBKM13] G. Montavon, M. L. Braun, T. Krueger, and K. R. Muller. An-
alyzing local structure in kernel-based learning: Explanation,
complexity, and reliability assessment. IEEE Signal Processing
Magazine, 30(4):62–74, July 2013.
[MCCD13] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Ef-
ficient estimation of word representations in vector space. arXiv
preprint arXiv:1301.3781, 2013.
[McH95] John McHugh. Handbook for the Computer Security Certifica-
tion of Trusted Systems, chapter Covert Channel Analysis. Naval
Research Laboratory, 1995.
[MDFDG14] Fabrizio Maria Maggi, Chiara Di Francescomarino, Marlon Du-
mas, and Chiara Ghidini. Predictive monitoring of business pro-
cesses. In International Conference on Advanced Information
Systems Engineering, pages 457–472. Springer, 2014.
[Min67] Marvin L. Minsky. Computation: Finite and Infinite Machines.
Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1967.
[MLB+17] Grégoire Montavon, Sebastian Lapuschkin, Alexander Binder,
Wojciech Samek, and Klaus-Robert Müller. Explaining nonlin-
ear classification decisions with deep taylor decomposition. Pat-
tern Recognition, 65:211 222, 2017.
[MLI+15] Andreas Metzger, Philipp Leitner, Dragan Ivanovi´
c, Eric
Schmieders, Rod Franklin, Manuel Carro, Schahram Dustdar,
and Klaus Pohl. Comparing and combining predictive business
process monitoring techniques. IEEE Transactions on Systems,
Man, and Cybernetics: Systems, 45(2):276–290, 2015.
[MMR+01] K-R Muller, Sebastian Mika, Gunnar Ratsch, Koji Tsuda, and
Bernhard Scholkopf. An introduction to kernel-based learning
algorithms. IEEE transactions on neural networks, 12(2):181–
201, 2001.
[MOM12] Grégoire Montavon, Genevieve B Orr, and Klaus-Robert Müller.
Neural networks: tricks of the trade. Springer, 2012.
[MSM18] Grégoire Montavon, Wojciech Samek, and Klaus-Robert Müller.
Methods for interpreting and understanding deep neural net-
works. Digital Signal Processing, 73:1–15, 2018.
143
[Naz09] J. Nazario. A virtual client honeypot. In Proc. of USENIX Work-
shop on Large-Scale Exploits and Emergent Threats (LEET),
2009.
[NBFS06] James Newsome, David Brumley, Jason Franklin, and Dawn
Song. Replayer: Automatic protocol replay by binary analy-
sis. In Proceedings of the 13th ACM conference on Computer
and communications security, pages 311–321. ACM, 2006.
[NKS05] J. Newsome, B. Karp, and D. Song. Polygraph: Automatically
generating signatures for polymorphic worms. In Proc. of IEEE
Symposium on Security and Privacy, pages 120–132, 2005.
[ORLS14] Aditya Oza, Kevin Ross, Richard M Low, and Mark Stamp. Http
attack detection using n-gram analysis. Computers & Security,
45:242–254, 2014.
[PAF+09] Roberto Perdisci, Davide Ariu, Prahlad Fogla, Giorgio Giacinto,
and Wenke Lee. Mcpad: A multiple classifier system for ac-
curate payload-based anomaly detection. Computer networks,
53(6):864–881, 2009.
[Pax99] V. Paxson. Bro: A system for detecting network intruders in
real-time. Computer Networks, 31(23–24):2435–2466, Decem-
ber 1999.
[PMRM08] Niels Provos, Panayiotis Mavrommatis, Moheeb A. Rajab, and
Fabian Monrose. All Your iFRAMEs Point to Us. In Proc. of
USENIX Security Symposium, 2008.
[PNC11] Suraj Pandey, Surya Nepal, and Shiping Chen. A test-bed for the
evaluation of business process prediction techniques. In Collab-
orative Computing: Networking, Applications and Worksharing
(CollaborateCom), 2011 7th International Conference on, pages
382–391. IEEE, 2011.
[PSBDL18] Mirko Polato, Alessandro Sperduti, Andrea Burattin, and Mas-
similiano De Leoni. Time and activity sequence prediction of
business process instances. Computing, pages 1–27, 2018.
[PSC15] Joris Pelemans, Noam Shazeer, and Ciprian Chelba. Pruning
sparse non-negative matrix n-gram language models. In Pro-
ceedings of Interspeech, pages 1433–1437, 2015.
[PVG+11] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg,
J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot,
144
and E. Duchesnay. Scikit-learn: Machine learning in Python.
Journal of Machine Learning Research, 12:2825–2830, 2011.
[Rab89] Lawrence R Rabiner. A tutorial on hidden markov models and
selected applications in speech recognition. Proceedings of the
IEEE, 77(2):257–286, 1989.
[RHW86] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams.
Learning representations by back-propagating errors. nature,
323(6088):533, 1986.
[Rie09] Konrad Rieck. Machine learning for application-layer intrusion
detection. Technology, page 151, 2009.
[RKD10] Konrad Rieck, Tammo Krueger, and Andreas Dewald. Cujo: ef-
ficient detection and prevention of drive-by-download attacks. In
Proceedings of the 26th Annual Computer Security Applications
Conference, pages 31–39. ACM, 2010.
[RLZ08] Paruj Ratanaworabhan, Benjamin Livshits, and Benjamin Zorn.
Nozzle: A defense against heap-spraying code injection at-
tacks. Technical Report MSR-TR-2008-176, Microsoft Re-
search, 2008.
[Roa07] Joan Robert Roaspana. SHELIA: a client honeypot for client-
side attack detection. Master’s thesis, Vrije Universiteit Amster-
dam, 2007.
[Ros58] Frank Rosenblatt. The perceptron: a probabilistic model for in-
formation storage and organization in the brain. Psychological
review, 65(6):386, 1958.
[RPF10] W. Lee R. Perdisci and N. Feamster. Behavioral clustering of
HTTP-based malware and signature generation using malicious
network traces. In Proc. of USENIX Symposium on Networked
Systems Design and Implementation (NSDI), pages 391–404,
2010.
[RSG16] Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. Why
should i trust you?: Explaining the predictions of any classi-
fier. In Proceedings of the 22nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages
1135–1144. ACM, 2016.
[RSL+10] Konrad Rieck, Guido Schwenk, Tobias Limmer, Thorsten Holz,
and Pavel Laskov. Botzilla: Detecting the "phoning home" of
malicious software. In Proc. of 25th ACM Symposium on Ap-
plied Computing (SAC), pages 1978–1984, March 2010.
145
[RTWH11] Konrad Rieck, Philipp Trinius, Carsten Willems, and Thorsten
Holz. Automatic analysis of malware behavior using machine
learning. Journal of Computer Security, 19(4):639–668, 2011.
[Sau10] Martin Sauter. From GSM to LTE: an introduction to mobile
networks and mobile broadband. John Wiley & Sons, 2010.
[SB14] Guido Schwenk and Sebastian Bach. Detecting behavioral and
structural anomalies in mediacloud applications. arXiv preprint
arXiv:1409.8035, 2014.
[SBKR12] Guido Schwenk, Alexander Bikadorov, Tammo Krueger, and
Konrad Rieck. Autonomous learning for detection of javascript
attacks: vision or reality? In Proceedings of the 5th ACM work-
shop on Security and artificial intelligence, pages 93–104. ACM,
2012.
[Sch15] Jürgen Schmidhuber. Deep learning in neural networks: An
overview. Neural networks, 61:85–117, 2015.
[SCW+03] Rong She, Fei Chen, Ke Wang, Martin Ester, Jennifer L Gardy,
and Fiona SL Brinkman. Frequent-subsequence-based predic-
tion of outer membrane proteins. In Proceedings of the ninth
ACM SIGKDD international conference on Knowledge discov-
ery and data mining, pages 436–445. ACM, 2003.
[SDHD07] S. Stover, D. Dittrich, J. Hernandez, and S. Dietrich. Analysis of
the Storm and Nugache Trojans: P2P is here. USENIX ;login:,
32(6):18–27, 2007.
[SGHSV11] Brett Stone-Gross, Thorsten Holz, Gianluca Stringhini, and Gio-
vanni Vigna. The underground economy of spam: A botmas-
ter’s perspective of coordinating large-scale spam campaigns. In
Proc. of USENIX Workshop on Large-Scale Exploits and Emer-
gent Threats (LEET), 2011.
[SK12] Masashi Sugiyama and Motoaki Kawanabe. Machine learning
in non-stationary environments: Introduction to covariate shift
adaptation. MIT press, 2012.
[SPD+09] Qinfeng Shi, James Petterson, Gideon Dror, John Langford,
Alex Smola, and S.V.N. Vishwanathan. Hash kernels for struc-
tured data. J. Mach. Learn. Res., 10:2615–2637, December
2009.
[SPST+01] B. Schölkopf, J. Platt, J. Shawe-Taylor, A.J. Smola, and R.C.
Williamson. Estimating the support of a high-dimensional dis-
tribution. Neural Computation, 13(7):1443–1471, 2001.
146
[SR11] Guido Schwenk and Konrad Rieck. Adaptive detection of covert
communication in http requests. In Computer Network Defense
(EC2ND), 2011 Seventh European Conference on, pages 25–32.
IEEE, 2011.
[SRS05a] Sören Sonnenburg, Gunnar Rätsch, and Christin Schäfer. Learn-
ing interpretable svms for biological sequence classification. In
Annual International Conference on Research in Computational
Molecular Biology, pages 389–407. Springer, 2005.
[SRS05b] Sören Sonnenburg, Gunnar Rätsch, and Bernhard Schölkopf.
Large scale genomic sequence svm classifiers. In Proceedings
of the 22nd international conference on Machine learning, pages
848–855. ACM, 2005.
[SS02] B. Schölkopf and A.J. Smola. Learning with Kernels. MIT Press,
Cambridge, MA, 2002.
[SS06] Christian Seifert and Ramon Steenson. Capture honeypot
client (Capture-HPC). Victoria University of Wellington, NZ,
https://projects.honeynet.org/capture-hpc,
2006.
[SS11] M. Ben Salem and S. J. Stolfo. Decoy document deployment for
effective masquerade attack detection. In Detection of Intrusions
and Malware & Vulnerability Assessment (DIMVA), pages 35–
54, 2011.
[STC04] John Shawe-Taylor and Nello Cristianini. Kernel methods for
pattern analysis. Cambridge university press, 2004.
[SWS+00] Bernhard Schölkopf, Robert C Williamson, Alex J Smola, John
Shawe-Taylor, and John C Platt. Support vector method for nov-
elty detection. In Advances in neural information processing
systems, pages 582–588, 2000.
[Sym11] Symantec Internet Security Threat Report: Trends for 2010. Vol.
16, Symantec, Inc., 2011.
[TD99] D.M.J. Tax and R.P.W. Duin. Support vector domain description.
Pattern Recognition Letters, 20(11–13):1191–1199, 1999.
[TD04] David MJ Tax and Robert PW Duin. Support vector data de-
scription. Machine learning, 54(1):45–66, 2004.
[TK02] Simon Tong and Daphne Koller. Support vector machine active
learning with applications to text classification. J. Mach. Learn.
Res., 2:45–66, March 2002.
147
[TLLP11] Julie D Thompson, Benjamin Linard, Odile Lecompte, and
Olivier Poch. A comprehensive benchmark study of multiple
sequence alignment methods: current challenges and future per-
spectives. PloS one, 6(3):e18093, 2011.
[TVLRD17] Niek Tax, Ilya Verenich, Marcello La Rosa, and Marlon Du-
mas. Predictive business process monitoring with lstm neural
networks. In International Conference on Advanced Informa-
tion Systems Engineering, pages 477–492. Springer, 2017.
[ULD16] Merve Unuvar, Geetika T Lakshmanan, and Yurdaer N Do-
ganata. Leveraging path information to generate predictions for
parallel business processes. Knowledge and Information Sys-
tems, 47(2):433–461, 2016.
[VBMKM09] Paul Von Bünau, Frank C Meinecke, Franz C Király, and Klaus-
Robert Müller. Finding stationary subspaces in multivariate time
series. Physical review letters, 103(21):214101, 2009.
[VDAADM+11] Wil Van Der Aalst, Arya Adriansyah, Ana Karla Alves
De Medeiros, Franco Arcieri, Thomas Baier, Tobias Blickle, Ja-
gadeesh Chandra Bose, Peter Van Den Brand, Ronald Brandtjen,
Joos Buijs, et al. Process mining manifesto. In International
Conference on Business Process Management, pages 169–194.
Springer, 2011.
[VKMG17] Marina M-C Vidovic, Marius Kloft, Klaus-Robert Mueller,
and Nico Goernitz. Ml2motif—reliable extraction of discrim-
inative sequence motifs from learning machines. PloS one,
12(3):e0174392, 2017.
[WBC10] Sean Whalen, Matt Bishop, and James P Crutchfield. Hidden
markov models for automated protocol learning. In Security and
Privacy in Communication Networks, pages 415–428. Springer,
2010.
[WBH+09] Peter Wurzinger, Leyla Bilge, Thorsten Holz, Jan Goebel,
Christopher Kruegel, and Engin Kirda. Automatically gener-
ating models for botnet detection. In Proc. of European Sympo-
sium on Research in Computer Security (ESORICS), pages 232–
249, 2009.
[WBJ+06] Yi-Min Wang, Doug Beck, Xuxian Jiang, Roussi Roussev, Chad
Verbowski, Shuo Chen, and Samuel T. King. Automated web
patrol with strider honeymonkeys: Finding web sites that exploit
browser vulnerabilities. In Proc. of Network and Distributed Sys-
tem Security Symposium (NDSS), 2006.
148
[WCKK08] Gilbert Wondracek, Paolo Milani Comparetti, Christopher
Kruegel, and Engin Kirda. Automatic network protocol analy-
sis. In Proceedings of the 15th Annual Network and Distributed
System Security Symposium (NDSS’08), pages 1–18, 2008.
[WDL+09] Kilian Weinberger, Anirban Dasgupta, John Langford, Alex
Smola, and Josh Attenberg. Feature hashing for large scale mul-
titask learning. In Proceedings of the 26th Annual International
Conference on Machine Learning, ICML ’09, pages 1113–1120,
New York, NY, USA, 2009. ACM.
[WM12] Sida Wang and Christopher D Manning. Baselines and bigrams:
Simple, good sentiment and topic classification. In Proceedings
of the 50th Annual Meeting of the Association for Computational
Linguistics: Short Papers-Volume 2, pages 90–94. Association
for Computational Linguistics, 2012.
[WPS06] Ke Wang, Janak J Parekh, and Salvatore J Stolfo. Anagram:
A content anomaly detector resistant to mimicry attack. In Re-
cent Advances in Intrusion Detection, pages 226–248. Springer,
2006.
[WS04] Ke Wang and Salvatore J Stolfo. Anomalous payload-based net-
work intrusion detection. In Recent Advances in Intrusion De-
tection, pages 203–222. Springer, 2004.
[WSAR13] Christian Wressnegger, Guido Schwenk, Daniel Arp, and Kon-
rad Rieck. A close look on n-grams in intrusion detection:
anomaly detection vs. classification. In Proceedings of the 2013
ACM workshop on Artificial intelligence and security, pages 67–
76. ACM, 2013.
[XPK10] Zhengzheng Xing, Jian Pei, and Eamonn Keogh. A brief survey
on sequence classification. ACM Sigkdd Explorations Newslet-
ter, 12(1):40–48, 2010.
[YHL12] Guo-Xun Yuan, Chia-Hua Ho, and Chih-Jen Lin. Recent ad-
vances of large-scale linear classification. Proceedings of the
IEEE, 100(9):2584–2603, 2012.
[YWGR13] Fabian Yamaguchi, Christian Wressnegger, Hugo Gascon, and
Konrad Rieck. Chucky: Exposing missing checks in source code
for vulnerability discovery. In Proceedings of the 20th ACM
Conference on Computer and Communications Security, pages
499–510. ACM, 2013.
[ZP00] Y. Zhang and V. Paxson. Detecting backdoors. In Proc. of
USENIX Security Symposium, pages 157–170, 2000.
149
[ZSSL11] Junjie Zhang, Christian Seifert, Jack W. Stokes, and Wenke Lee.
ARROW: GenerAting SignatuRes to Detect DRive-By DOWn-
loads. In Proc. of the International World Wide Web Conference
(WWW), 2011.
150