
https://doi.org/10.3758/s13428-022-01844-1
A machine learning-based procedure for leveraging clickstream
data to investigate early predictability of failure on interactive tasks
Esther Ulitzsch1·Vincent Ulitzsch2·Qiwei He3·Oliver L ¨
udtke1,4
Accepted: 14 March 2022
©The Author(s) 2022
Abstract
Early detection of risk of failure on interactive tasks comes with great potential for better understanding how examinees
differ in their initial behavior as well as for adaptively tailoring interactive tasks to examinees’ competence levels. Drawing
on procedures originating in shopper intent prediction on e-commerce platforms, we introduce and showcase a machine
learning-based procedure that leverages early-window clickstream data for systematically investigating early predictability
of behavioral outcomes on interactive tasks. We derive features related to the occurrence, frequency, sequentiality, and
timing of performed actions from early-window clickstreams and use extreme gradient boosting for classification. Multiple
measures are suggested to evaluate the quality and utility of early predictions. The procedure is outlined by investigating
early predictability of failure on two PIAAC 2012 Problem Solving in Technology Rich Environments (PSTRE) tasks. We
investigated early windows of varying size in terms of time and in terms of actions. We achieved good prediction performance
at stages where examinees had, on average, at least two thirds of their solution process ahead of them, and the vast majority
of examinees who failed could potentially be detected to be at risk before completing the task. In-depth analyses revealed
different features to be indicative of success and failure at different stages of the solution process, thereby highlighting the
potential of the applied procedure for gaining a finer-grained understanding of the trajectories of behavioral patterns on
interactive tasks.
Keywords Interactive tasks ·Early prediction ·Extreme gradient boosting ·Time-stamped action sequences ·
Clickstreams ·PIAAC
Introduction
Interactive tasks mirror dynamic, real-life environments,
aiming at a more realistic assessment of what examinees
know and can do. Prominent examples for these environ-
ments are the simulated email, web pages, and spreadsheet
Supplemental online materials for this article can be found in the
OSF and are available via the following link: https://osf.io/7gcfd
Esther Ulitzsch
ulitzsch@leibniz-ipn.de
1IPN – Leibniz Institute for Science and Mathematics
Education, Educational Measurement, Olshausenstraße 62,
24118 Kiel, Germany
2Technical University Berlin, Berlin, Germany
3Educational Testing Service, Princeton, NJ, USA
4Center for International Student Assessment,
Munich, Germany
environments employed in the Programme for the Interna-
tional Assessment of Adult Competencies (PIAAC; OECD,
2013) to measure problem solving in technology-rich envi-
ronments (PSTRE), or the interactive problem-solving tasks
administered in the Programme for International Student
Assessment 2012 (PISA; OECD, 2014). Being computer-
administered, assessments using interactive tasks support
logging clickstream data in the form of time-stamped action
sequences, documenting the type, order, and timing of
the actions examinees executed when trying to solve the
given tasks. This rich source of additional data comes with
great potential for a nuanced understanding of response
processes, and allows to move from investigating whether
to how examinees solved a task (Greiff, W¨ustenberg, &
Avvisati, 2015), for instance, by identifying typical strate-
gies (e.g. He, Borgonovi, & Paccagnella, 2021; Ulitzsch
et al., 2021b; Vista, Care, & Awwal, 2017; Wang, Tang, Liu,
& Ying, 2020; Zhu, Shu, & von Davier, 2016) or investi-
gating which behavioral patterns distinguish success from
failure on a task (e.g. Han, He, & von Davier, 2019;He&
/ Published online: 1 June 2022
Behavior Research Methods (2023) 55:1392–1412
1 3

von Davier, 2015;Qiao&Jiao,2018; Salles, Dos Santos, &
Keskpaik, 2020).
In this study, we introduce a procedure for systematically
investigating whether and how early performed actions as
well as the time required for their execution already contain
sufficient information for predicting the outcome of exami-
nees’ behavioral trajectories, that is, success or failure, and
for identifying examinees at risk of failure before they com-
plete the task. To this end, we make use of early-window
clickstream data, i.e., time-stamped action sequences com-
prising only initially performed actions and the associated
time stamps. We consider predictions to be useful if accurate
predictions can be achieved at stages where the major-
ity of examinees have the greater part of their solution
process still ahead of them and the majority of exami-
nees who failed could potentially be detected to be at risk
before completing the task. Investigating early predictability
comes with great potential for a finer-grained understand-
ing of how examinees approach interactive tasks and may
potentially aid in improving the testing procedure. More
specifically, first, investigating early-window clickstream
data may improve our understanding of behavioral pat-
terns of early interactions with interactive tasks (e.g., initial
exploration or planning behavior) that distinguish behav-
ioral trajectories of examinees succeeding or failing on a
task. This knowledge can then be used to refine theories
on test-taking behavior or be employed in interventions that
aid students in improving their skills for initial exploration
of complex problem-solving tasks. Second, such analyses
support investigating whether it is possible to dynamically
track examinees’ risk of failure as they interact with the
task. Once risk of failure can reliably be inferred from early
interactions, this knowledge may—when combined with a
good understanding of the sources of failure—be put into
action by providing early support in real time such as hints
or reformulations of the task that may aid examinees at risk
of failing to successfully complete the task.
Although rarely encountered in the context of interactive
tasks, the objective of predicting behavioral outcomes from
early-window clickstream data is not unknown in the behav-
ioral sciences and has been successfully addressed in various
applications, ranging from predicting grades or dropout from
early uses of online learning management systems (e.g.
Baker, Lindrum, Lindrum, & Perkowski, 2015; Lykourent-
zou, Giannoukos, Nikolopoulos, Mpardis, & Loumos,
2009; Mongkhonvanit, Kanopka, & Lang, 2019; White-
hill, Williams, Lopez, Coleman, & Reich, 2015) to pre-
dicting purchase events from early browsing behavior on
e-commerce platforms (e.g. Awalkar, Ahmed, & Nevrekar,
2016; Hatt & Feuerriegel, 2020; Requena, Cassani, Tagli-
abue, Greco, & Lacasa, 2020; Toth, Tan, Di Fabbrizio, &
Datta, 2017). In the present study, we build on these pre-
viously applied procedures for early-window clickstream
data and explore whether and how they can be adapted to
the context of early prediction of behavioral outcomes on
interactive tasks in general and failure in particular.
In what follows, we first review previous research
on using process data to better understand behavioral
patterns differentiating correct from incorrect responses.
Subsequently, we provide a short overview on approaches
to early prediction of shopper intent on e-commerce
websites. We then use these approaches as a blueprint and
starting point for introducing a procedure for systematically
investigating early predictability of behavioral outcomes on
interactive tasks. The procedure is outlined by assessing
early predictability of failure on two tasks from the
PIAAC PSTRE domain. Finally, we discuss implications
and identify potentials for future work.
Using clickstream data to differentiate correct
from incorrect responses
Posing a rich description of how examinees attempted
the administered tasks, clickstream data from computer-
based interactive tasks have recently gained much attention
in psychometrics, psychology, and educational sciences.
Within this stream of research, both theory-driven and
exploratory approaches to investigating behavioral patterns
related to success and failure on interactive tasks emerged.
Herein, however, the predominant aim has been to
investigate behavioral patterns rather than to predict
behavioral outcomes.
Theory-driven approaches Theory-driven approaches com-
monly aim at corroborating theories on solution and test-
taking behavior. Based on subject-matter theory, click-
stream data are used for the construction of behavioral
indicators. Examples for such indicators are the applica-
tion of specific strategies (such as vary-one-thing-at-a-time,
VOTAT; Greiff et al., 2015; or other expert-defined strate-
gies as in Hao, Shu, & von Davier, 2015; He, Borgonovi, &
Paccagnella, 2019), the degree of automation of procedural
knowledge as indicated by the time spent on automatable
subtasks (e.g., drag-and-drop events; Stelter, Goldhammer,
Naumann, & R¨
olke, 2015), planning behavior as indicated,
e. g., by the time required for performing the first action
(Albert & Steinberg, 2011; Eichmann, Goldhammer, Greiff,
Pucite, & Naumann, 2019), or disengaged behavior as indi-
cated by short times spent on task and few actions (Sahin
& Colvin, 2020). Subsequently, these behavioral indica-
tors can be related to performance in order to investigate
whether the considered behaviors are related to successful
task completion as hypothesized.
Applications of theory-driven approaches have markedly
deepened the understanding and refined theories of test-
taking behavior on interactive tasks. For predicting the
1393Behavior Research Methods (2023) 55:1392–1412
1 3

outcomes of behavioral trajectories, however, purely theory-
driven approaches are limited. First, for the construction
of theory-derived indicators, clickstreams are scanned for
occurrences of specific strategies. Hence, when prediction
rather than corroborating theories is the primary research
objective, potentially useful information is discarded.
Second, some of these indicators may be constructed only
on the basis of longer sequences and/or when the solution
process is already at more advanced stages, such that the
behavioral patterns used for indicator construction may not
often be encountered in early-window clickstream data.
VOTAT, for instance, is a complex strategy that manifests
itself in sequences of actions that may occur only in
later stages of the solution process when examinees have
acquainted themselves with the task environment.
Exploratory approaches In recent years, a plethora of
exploratory approaches to identifying features distinguish-
ing correct from incorrect clickstreams has been developed
and applied (Chen, Li, Liu, & Ying, 2019; Han et al., 2019;
He & von Davier, 2016;Qiao&Jiao,2018; Salles et al.,
2020; Ulitzsch, He, & Pohl, 2021a). Features derived from
clickstreams comprise generic features commonly used in
sequence mining or natural language processing (e.g., n-
gramsasinHe&vonDavier,2015,2016; Liao, He, & Jiao,
2019; Ulitzsch et al., 2021a), task-specific features, created
based on subject-matter knowledge on behavioral patterns
to be expected on the task (Chen et al., 2019; Salles et al.,
2020), or a combination of the two (Qiao & Jiao, 2018;Han
et al., 2019). These features are then fed to classifiers or
prediction models, or analyzed using sequence mining tech-
niques to identify features that best distinguish correct from
incorrect clickstreams.
Note that commonly the objective of such approaches is
not prediction but rather to better understand examinees’
attempts to solve the administered tasks by uncovering
key behavioral patterns that distinguish success from
failure. Aimed at gaining insights on the whole solution
process, these approaches leverage the whole of information
contained in collected clickstreams—from opening the task
to proceeding to the next one. As the actions performed on
interactive tasks are an inherent part of the solution process,
correct and incorrect clickstreams have been found to be
well distinguishable. For a PISA 2012 problem-solving
task, for instance, Qiao and Jiao (2018) reported specificity
and sensitivity of more than .90 for various classifiers
being fed n-grams extracted from action sequences.
Analyzing an interactive math item from the French Cycle
des ´
Evaluations Disciplinaires R´
ealis´
es sur ´
Echantillons
(subject-related sample-based assessment cycle; CEDRE),
Salles et al. (2020) obtained an area under the receiver
operating characteristic curve (AUC ROC) value of .78
from random forest analyses using theory-derived, task-
specific features. Such good performance, however, may not
necessarily be achievable for predictions based on early-
window clickstream data, which are the focus of the present
study. First, behavioral patterns distinguishing success from
failure may be encountered only at later stages of the
solution process, while differences in the very first actions,
stemming, for instance, from initial exploration behavior,
may be less pronounced. Second, information contained in
early-window clickstream data from interactive tasks may
be rather sparse. For instance, across the 14 tasks of the
PIAAC 2012 domain, average sequence length ranged from
10.8 to 96.9 (Tang, Wang, Liu, & Ying, 2020b). If we were
to predict outcomes of behavioral trajectories after what
would, on average, be the middle of the solution process,
on some tasks, predictions would need to be made on the
basis of as few as five actions and the associated timing
information.
Predictive approaches So far, the predominant goal of anal-
yses of clickstream data has been to gain a better under-
standing of behavioral patterns rather than making pre-
dictions. Nevertheless, just recently, predictive approaches
started to emerge.
Tang, Wang, He, Liu, and Ying (2020a)investigated
whether action sequence data from one PIAAC PSTRE
task can predict performance on another one. To that end,
the authors determined the discrepancy between action
sequences from each PIAAC PSTRE task by drawing on a
dissimilarity measure originating from clickstream analysis
and subsequently extracted item-specific latent features via
multidimensional scaling. Using logistic regression, the
authors then investigated whether features derived from one
task can predict success or failure on another one over
and above performance on the predicting task. For most
of the item pairs, Tang et al. (2020a) reported a marked
improvement in prediction accuracy when features were
included, highlighting the vast potential of information
contained in sequence data for predicting the performance
of examinees.
Chen et al. (2019) proposed a model-based approach for
dynamic prediction of behavioral outcomes. The authors
proposed to include features as time-varying covariates
in an event history model, which at any given time of
the solution process can be used to predict outcomes of
the solution process, i.e., success or failure as well as
time spent on the task. Their study is an important con-
tribution as it showcased and initiated the discussion on
the utility of clickstream data for dynamic predictions
of behavioral outcomes on interactive tasks. Nevertheless,
Chen et al. (2019) critically remarked that although employ-
ing a prediction model rather than using black-box machine
learning methods allows retrieving interpretable parame-
ters, it comes at the price of strong assumptions on data-
1394 Behavior Research Methods (2023) 55:1392–1412
1 3

generating processes which, given the complexity of click-
stream data, renders the model likely to “not most closely
approximate the data-generating process” (Chen et al.,
2019, p. 4), potentially yielding biased predictions. Among
others, these assumptions concern the functional form of
the relationship between considered features and behav-
ioral outcomes. Further, regression weights are assumed to
be time-invariant, implying that the considered features are
equally predictive at different stages of the solution process.
This must not necessarily be the case. Actions related to
task exploration, for instance, may be positively related to
success at early stages, capturing examinees’ willingness to
thoroughly explore the task environment, but may be indica-
tive for risk of failure at later stages of the solution process,
when such actions are no longer beneficial for successful
task completion. Analyzing a PISA 2012 problem-solving
task, Chen et al. (2019) retrieved a satisfactory AUC ROC
value of .72 only at later stages of the solution process when
the median time spent on the task had already passed, which
may be considered as a benchmark for subsequent studies.
Using early-window clickstream data for shopper
intent prediction
In fields where clickstream data is a more established
source of behavioral data, predicting behavioral outcomes
from early-window clickstream data is a common problem
statement. In the present study, we turn our attention to
procedures employed in the context of predicting behavioral
outcomes based on clickstream data from e-commerce
websites. In this vein of research, clickstream data is
commonly used for predicting whether users are at risk
for leaving the page without purchases (see Awalkar
et al., 2016; Bertsimas, Mersereau, & Patel, 2003;Hatt
and Feuerriegel, 2020; Requena et al., 2020;Tothetal.,
2017, for examples). Early detection of such risks may
trigger automated interventions, such as offering discounts
that may nudge customers into purchasing. To that end,
a plethora of supervised classifiers has been employed,
ranging from predictive models for sequential data such as
hidden Markov models (as in Hatt & Feuerriegel, 2020)
or recurrent neural networks (as in Toth et al., 2017)to
classifiers trained on features derived from clickstream
data such as extreme gradient boosting or support vector
machines (as in Requena et al., 2020). Features considered
comprise information on the action level such as uni- and
bigrams (Requena et al., 2020), aggregates such as the
number of performed clicks or the maximum time elapsed
between subsequent clicks as well as metadata such as the
day of the week when the session was initiated (Awalkar
et al., 2016). Research on predictions of behavioral
outcomes has repeatedly demonstrated that clickstream data
is well suited for making accurate predictions at relatively
early points in time based on rather sparse data.
Data structures from e-commerce websites can be
expected to resemble those encountered in interactive tasks,
rendering it worthwhile to investigate whether procedures
applied in the context of e-commerce also perform well in
the context of interactive tasks. First, interactive tasks such
as those employed in the PIAAC PSTRE domain oftentimes
mirror interfaces of web applications to evoke real-life
problem-solving behavior. Second, clickstreams from e-
commerce websites tend to be rather short. Requena et al.
(2020), for instance, based their analyses of shopper intent
prediction on browsing sessions with action sequences
of length 5 to 155, closely resembling typical ranges
encountered in clickstream data from interactive tasks.
Across all 14 tasks of the PIAAC PSTRE domain, for
instance, the minimum action sequence length was 3 and
maximum action sequence length ranged from 51 to 398
(Zhang, Tang, He, Liu, & Ying, 2021).
Due to these resemblances in typical data structures, pro-
cedures employed for investigating the early predictability
of shopper intent pose a promising tool for investigating
the early predictability of failure or success on interactive
tasks. In the present study, we draw on and adapt procedures
that have recently been employed by Requena et al. (2020)
in their systematic and exhaustive study of early shopper
intent prediction. Requena et al. (2020) created multiple
subsets of action sequences that were trimmed to all but
those actions that fell into a given early window. Next,
the authors compared the performance of multiple machine
learning algorithms on these subsets to investigate at which
point early-window action sequences contained sufficient
information to achieve accurate predictions. Among others,
Requena et al. (2020) achieved good results with extreme
gradient boosting, where AUC ROC values exceeded .70 as
soon as early action sequences were of at least length seven.
Objective and research questions
Adapting machine learning-based procedures originally
employed by Requena et al. (2020) for investigating early
predictability of shopper intent on e-commerce websites, the
present study introduces and showcases a procedure for the
systematic investigation of early predictability of behavioral
outcomes on interactive tasks in educational assessment.
When introducing the procedure, we suggest features that
may be derived from clickstream data from interactive tasks
as well as measures to be tracked that aid in evaluating
the quality and utility of early predictions. We outline the
procedure by investigating the potential of early-window
clickstream data for early prediction of risk of failure on two
1395Behavior Research Methods (2023) 55:1392–1412
1 3

PSTRE tasks from PIAAC 2012, addressing the following
research questions:
RQ1 Establishing a baseline: How well can customary
supervised classifiers on the basis of features con-
structed from complete clickstream data, capturing
the whole solution process, identify failure on the
task?
RQ2 Investigating the accuracy of early predictions: How
early in terms of a) the number of performed actions
as well as b) elapsed time can customary supervised
classifiers on the basis of features constructed from
early-window clickstream data accurately predict
failure on the task?
RQ3 Investigating feature importance: Which features
constructed from early-window clickstream data
display the highest predictive importance at different
phases of the solution process?
Materials and methods
Data
We made use of clickstream data from the items U23
(“Lamp Return”) and U02 (“Meeting Rooms”) from the
PIAAC 2012 PSTRE domain. In PIAAC 2012, problem-
solving items were administered with fixed positions and
without time limits. “Meeting Rooms” is located in the
middle of the second problem-solving cluster (PS2), while
“Lamp Return” is administered at the very end of PS2.
Hence, when approaching “Meeting Rooms” and “Lamp
Return”, examinees were already exposed to different
PIAAC PSTRE task environments and had the opportunity
to accumulate pre-familarity with these environments. We
chose these items as they strongly differ in their difficulty
as well as in the amount of initial task exploration required
prior to performing key actions for solving the task, both
potentially impacting early predictability. Very difficult or
very easy items yield highly imbalanced data sets which
may challenge classifiers (see, e.g., Ruisen et al., 2018).
The amount of initial task exploration required prior to
performing key actions may impact how distinguishable
early-window clickstream data associated with success
or failure are because differences in initial exploration
behavior may be less pronounced and differences in
performing key actions for solving the task may emerge
only at later stages of the solution process.
“Lamp Return” involves both web page and email
environments and requires examinees to navigate through
an online lamp shop to complete an explicitly specified
consumer transaction. To that end, examinees have to
submit a request, retrieve an email message, and fill out
an online form. Examinees receive partial credit if at
least one of the fields of the online form is filled out
correctly. Figure 1displays an example item with email
and web environments (from the Education and Skills
Online Assessment) that shares a comparable item interface
with the PIAAC item “Lamp Return”. “Meeting Rooms”
involves email, web, and word processor environments1
and requires examinees to navigate through emails, identify
relevant requests for meeting room reservations, and
subsequently submit these meeting room requests via a
simulated online reservation site. A conflict between one
request and the existing schedule presents an impasse to be
resolved.
“Lamp Return” and “Meeting Rooms” are located at
Proficiency Levels 2 and 3, respectively,2and, with item
difficulties of 321 and 346, respectively, pose items of
medium and high difficulty (OECD, 2013). For getting to
and filling out the lamp return form, it is not necessary
to exhaustively explore the task’s environment. As such,
the item can be solved in a rather linear manner and
only requires a minimum of 17 actions (including actions
performed for filling out the return form) for receiving
full credit (He et al., 2021). Key actions required for
successful task completion can therefore be expected to be
commonly encountered in early-window clickstream data
associated with successful task completion. This is different
for “Meeting Rooms”, which requires examinees to seek
and integrate information from multiple environments
before filling out the meeting room reservation forms.
Due to the higher necessity of initial task environment
exploration, “Meeting Rooms” requires a minimum of 25
actions for receiving full credit (He et al., 2021). Initial
task exploration is likely to be non-linear, with examinees
switching between different environments to compare and
integrate the displayed information.3Key actions required
for successfully submitting the reservation forms may
therefore be commonly encountered only at later stages
of the solution process. Based on these consideration, we
expected early predictability for “Lamp Return” to be less
challenging than for “Meeting Rooms”.
We analyzed clickstream data from examinees from
Ireland, Japan, the Netherlands, the United Kingdom,
and the United States who were administered “Lamp
1Note that the word processor is an optional environment instead
of a compulsory one, designed to assist examinees to summarize
information extracted from the email requests.
2The PSTRE performance is defined by four levels: below Level 1 (0–
240), Level 1 (241–290), Level 2 (291–340) and Level 3 (341–500).
For more details, refer to OECD (2013).
3The manifold ways of how this initial task exploration manifests itself
in action sequences is, among others, reflected in the comparably low
similarity of sequences to expert-defined optimal strategies (He et al.,
2019).
1396 Behavior Research Methods (2023) 55:1392–1412
1 3
Loading more pages...