A machine learning-based procedure for leveraging clickstream data to investigate early predictability of failure on interactive tasks [original]

https://doi.org/10.3758/s13428-022-01844-1

A machine learning-based procedure for leveraging clickstream

data to investigate early predictability of failure on interactive tasks

Esther Ulitzsch1·Vincent Ulitzsch2·Qiwei He3·Oliver L ¨

udtke1,4

Accepted: 14 March 2022

©The Author(s) 2022

Abstract

Early detection of risk of failure on interactive tasks comes with great potential for better understanding how examinees

differ in their initial behavior as well as for adaptively tailoring interactive tasks to examinees’ competence levels. Drawing

on procedures originating in shopper intent prediction on e-commerce platforms, we introduce and showcase a machine

learning-based procedure that leverages early-window clickstream data for systematically investigating early predictability

of behavioral outcomes on interactive tasks. We derive features related to the occurrence, frequency, sequentiality, and

timing of performed actions from early-window clickstreams and use extreme gradient boosting for classification. Multiple

measures are suggested to evaluate the quality and utility of early predictions. The procedure is outlined by investigating

early predictability of failure on two PIAAC 2012 Problem Solving in Technology Rich Environments (PSTRE) tasks. We

investigated early windows of varying size in terms of time and in terms of actions. We achieved good prediction performance

at stages where examinees had, on average, at least two thirds of their solution process ahead of them, and the vast majority

of examinees who failed could potentially be detected to be at risk before completing the task. In-depth analyses revealed

different features to be indicative of success and failure at different stages of the solution process, thereby highlighting the

potential of the applied procedure for gaining a finer-grained understanding of the trajectories of behavioral patterns on

interactive tasks.

Keywords Interactive tasks ·Early prediction ·Extreme gradient boosting ·Time-stamped action sequences ·

Clickstreams ·PIAAC

Introduction

Interactive tasks mirror dynamic, real-life environments,

aiming at a more realistic assessment of what examinees

know and can do. Prominent examples for these environ-

ments are the simulated email, web pages, and spreadsheet

Supplemental online materials for this article can be found in the

OSF and are available via the following link: https://osf.io/7gcfd

Esther Ulitzsch

ulitzsch@leibniz-ipn.de

1IPN – Leibniz Institute for Science and Mathematics

Education, Educational Measurement, Olshausenstraße 62,

24118 Kiel, Germany

2Technical University Berlin, Berlin, Germany

3Educational Testing Service, Princeton, NJ, USA

4Center for International Student Assessment,

Munich, Germany

environments employed in the Programme for the Interna-

tional Assessment of Adult Competencies (PIAAC; OECD,

2013) to measure problem solving in technology-rich envi-

ronments (PSTRE), or the interactive problem-solving tasks

administered in the Programme for International Student

Assessment 2012 (PISA; OECD, 2014). Being computer-

administered, assessments using interactive tasks support

logging clickstream data in the form of time-stamped action

sequences, documenting the type, order, and timing of

the actions examinees executed when trying to solve the

given tasks. This rich source of additional data comes with

great potential for a nuanced understanding of response

processes, and allows to move from investigating whether

to how examinees solved a task (Greiff, W¨ustenberg, &

Avvisati, 2015), for instance, by identifying typical strate-

gies (e.g. He, Borgonovi, & Paccagnella, 2021; Ulitzsch

et al., 2021b; Vista, Care, & Awwal, 2017; Wang, Tang, Liu,

& Ying, 2020; Zhu, Shu, & von Davier, 2016) or investi-

gating which behavioral patterns distinguish success from

failure on a task (e.g. Han, He, & von Davier, 2019;He&

/ Published online: 1 June 2022

Behavior Research Methods (2023) 55:1392–1412

1 3

von Davier, 2015;Qiao&Jiao,2018; Salles, Dos Santos, &

Keskpaik, 2020).

In this study, we introduce a procedure for systematically

investigating whether and how early performed actions as

well as the time required for their execution already contain

sufficient information for predicting the outcome of exami-

nees’ behavioral trajectories, that is, success or failure, and

for identifying examinees at risk of failure before they com-

plete the task. To this end, we make use of early-window

clickstream data, i.e., time-stamped action sequences com-

prising only initially performed actions and the associated

time stamps. We consider predictions to be useful if accurate

predictions can be achieved at stages where the major-

ity of examinees have the greater part of their solution

process still ahead of them and the majority of exami-

nees who failed could potentially be detected to be at risk

before completing the task. Investigating early predictability

comes with great potential for a finer-grained understand-

ing of how examinees approach interactive tasks and may

potentially aid in improving the testing procedure. More

specifically, first, investigating early-window clickstream

data may improve our understanding of behavioral pat-

terns of early interactions with interactive tasks (e.g., initial

exploration or planning behavior) that distinguish behav-

ioral trajectories of examinees succeeding or failing on a

task. This knowledge can then be used to refine theories

on test-taking behavior or be employed in interventions that

aid students in improving their skills for initial exploration

of complex problem-solving tasks. Second, such analyses

support investigating whether it is possible to dynamically

track examinees’ risk of failure as they interact with the

task. Once risk of failure can reliably be inferred from early

interactions, this knowledge may—when combined with a

good understanding of the sources of failure—be put into

action by providing early support in real time such as hints

or reformulations of the task that may aid examinees at risk

of failing to successfully complete the task.

Although rarely encountered in the context of interactive

tasks, the objective of predicting behavioral outcomes from

early-window clickstream data is not unknown in the behav-

ioral sciences and has been successfully addressed in various

applications, ranging from predicting grades or dropout from

early uses of online learning management systems (e.g.

Baker, Lindrum, Lindrum, & Perkowski, 2015; Lykourent-

zou, Giannoukos, Nikolopoulos, Mpardis, & Loumos,

2009; Mongkhonvanit, Kanopka, & Lang, 2019; White-

hill, Williams, Lopez, Coleman, & Reich, 2015) to pre-

dicting purchase events from early browsing behavior on

e-commerce platforms (e.g. Awalkar, Ahmed, & Nevrekar,

2016; Hatt & Feuerriegel, 2020; Requena, Cassani, Tagli-

abue, Greco, & Lacasa, 2020; Toth, Tan, Di Fabbrizio, &

Datta, 2017). In the present study, we build on these pre-

viously applied procedures for early-window clickstream

data and explore whether and how they can be adapted to

the context of early prediction of behavioral outcomes on

interactive tasks in general and failure in particular.

In what follows, we first review previous research

on using process data to better understand behavioral

patterns differentiating correct from incorrect responses.

Subsequently, we provide a short overview on approaches

to early prediction of shopper intent on e-commerce

websites. We then use these approaches as a blueprint and

starting point for introducing a procedure for systematically

investigating early predictability of behavioral outcomes on

interactive tasks. The procedure is outlined by assessing

early predictability of failure on two tasks from the

PIAAC PSTRE domain. Finally, we discuss implications

and identify potentials for future work.

Using clickstream data to differentiate correct

from incorrect responses

Posing a rich description of how examinees attempted

the administered tasks, clickstream data from computer-

based interactive tasks have recently gained much attention

in psychometrics, psychology, and educational sciences.

Within this stream of research, both theory-driven and

exploratory approaches to investigating behavioral patterns

related to success and failure on interactive tasks emerged.

Herein, however, the predominant aim has been to

investigate behavioral patterns rather than to predict

behavioral outcomes.

Theory-driven approaches Theory-driven approaches com-

monly aim at corroborating theories on solution and test-

taking behavior. Based on subject-matter theory, click-

stream data are used for the construction of behavioral

indicators. Examples for such indicators are the applica-

tion of specific strategies (such as vary-one-thing-at-a-time,

VOTAT; Greiff et al., 2015; or other expert-defined strate-

gies as in Hao, Shu, & von Davier, 2015; He, Borgonovi, &

Paccagnella, 2019), the degree of automation of procedural

knowledge as indicated by the time spent on automatable

subtasks (e.g., drag-and-drop events; Stelter, Goldhammer,

Naumann, & R¨

olke, 2015), planning behavior as indicated,

e. g., by the time required for performing the first action

(Albert & Steinberg, 2011; Eichmann, Goldhammer, Greiff,

Pucite, & Naumann, 2019), or disengaged behavior as indi-

cated by short times spent on task and few actions (Sahin

& Colvin, 2020). Subsequently, these behavioral indica-

tors can be related to performance in order to investigate

whether the considered behaviors are related to successful

task completion as hypothesized.

Applications of theory-driven approaches have markedly

deepened the understanding and refined theories of test-

taking behavior on interactive tasks. For predicting the

1393Behavior Research Methods (2023) 55:1392–1412

1 3

outcomes of behavioral trajectories, however, purely theory-

driven approaches are limited. First, for the construction

of theory-derived indicators, clickstreams are scanned for

occurrences of specific strategies. Hence, when prediction

rather than corroborating theories is the primary research

objective, potentially useful information is discarded.

Second, some of these indicators may be constructed only

on the basis of longer sequences and/or when the solution

process is already at more advanced stages, such that the

behavioral patterns used for indicator construction may not

often be encountered in early-window clickstream data.

VOTAT, for instance, is a complex strategy that manifests

itself in sequences of actions that may occur only in

later stages of the solution process when examinees have

acquainted themselves with the task environment.

Exploratory approaches In recent years, a plethora of

exploratory approaches to identifying features distinguish-

ing correct from incorrect clickstreams has been developed

and applied (Chen, Li, Liu, & Ying, 2019; Han et al., 2019;

He & von Davier, 2016;Qiao&Jiao,2018; Salles et al.,

2020; Ulitzsch, He, & Pohl, 2021a). Features derived from

clickstreams comprise generic features commonly used in

sequence mining or natural language processing (e.g., n-

gramsasinHe&vonDavier,2015,2016; Liao, He, & Jiao,

2019; Ulitzsch et al., 2021a), task-specific features, created

based on subject-matter knowledge on behavioral patterns

to be expected on the task (Chen et al., 2019; Salles et al.,

2020), or a combination of the two (Qiao & Jiao, 2018;Han

et al., 2019). These features are then fed to classifiers or

prediction models, or analyzed using sequence mining tech-

niques to identify features that best distinguish correct from

incorrect clickstreams.

Note that commonly the objective of such approaches is

not prediction but rather to better understand examinees’

attempts to solve the administered tasks by uncovering

key behavioral patterns that distinguish success from

failure. Aimed at gaining insights on the whole solution

process, these approaches leverage the whole of information

contained in collected clickstreams—from opening the task

to proceeding to the next one. As the actions performed on

interactive tasks are an inherent part of the solution process,

correct and incorrect clickstreams have been found to be

well distinguishable. For a PISA 2012 problem-solving

task, for instance, Qiao and Jiao (2018) reported specificity

and sensitivity of more than .90 for various classifiers

being fed n-grams extracted from action sequences.

Analyzing an interactive math item from the French Cycle

des ´

Evaluations Disciplinaires R´

ealis´

es sur ´

Echantillons

(subject-related sample-based assessment cycle; CEDRE),

Salles et al. (2020) obtained an area under the receiver

operating characteristic curve (AUC ROC) value of .78

from random forest analyses using theory-derived, task-

specific features. Such good performance, however, may not

necessarily be achievable for predictions based on early-

window clickstream data, which are the focus of the present

study. First, behavioral patterns distinguishing success from

failure may be encountered only at later stages of the

solution process, while differences in the very first actions,

stemming, for instance, from initial exploration behavior,

may be less pronounced. Second, information contained in

early-window clickstream data from interactive tasks may

be rather sparse. For instance, across the 14 tasks of the

PIAAC 2012 domain, average sequence length ranged from

10.8 to 96.9 (Tang, Wang, Liu, & Ying, 2020b). If we were

to predict outcomes of behavioral trajectories after what

would, on average, be the middle of the solution process,

on some tasks, predictions would need to be made on the

basis of as few as five actions and the associated timing

information.

Predictive approaches So far, the predominant goal of anal-

yses of clickstream data has been to gain a better under-

standing of behavioral patterns rather than making pre-

dictions. Nevertheless, just recently, predictive approaches

started to emerge.

Tang, Wang, He, Liu, and Ying (2020a)investigated

whether action sequence data from one PIAAC PSTRE

task can predict performance on another one. To that end,

the authors determined the discrepancy between action

sequences from each PIAAC PSTRE task by drawing on a

dissimilarity measure originating from clickstream analysis

and subsequently extracted item-specific latent features via

multidimensional scaling. Using logistic regression, the

authors then investigated whether features derived from one

task can predict success or failure on another one over

and above performance on the predicting task. For most

of the item pairs, Tang et al. (2020a) reported a marked

improvement in prediction accuracy when features were

included, highlighting the vast potential of information

contained in sequence data for predicting the performance

of examinees.

Chen et al. (2019) proposed a model-based approach for

dynamic prediction of behavioral outcomes. The authors

proposed to include features as time-varying covariates

in an event history model, which at any given time of

the solution process can be used to predict outcomes of

the solution process, i.e., success or failure as well as

time spent on the task. Their study is an important con-

tribution as it showcased and initiated the discussion on

the utility of clickstream data for dynamic predictions

of behavioral outcomes on interactive tasks. Nevertheless,

Chen et al. (2019) critically remarked that although employ-

ing a prediction model rather than using black-box machine

learning methods allows retrieving interpretable parame-

ters, it comes at the price of strong assumptions on data-

1394 Behavior Research Methods (2023) 55:1392–1412

1 3

generating processes which, given the complexity of click-

stream data, renders the model likely to “not most closely

approximate the data-generating process” (Chen et al.,

2019, p. 4), potentially yielding biased predictions. Among

others, these assumptions concern the functional form of

the relationship between considered features and behav-

ioral outcomes. Further, regression weights are assumed to

be time-invariant, implying that the considered features are

equally predictive at different stages of the solution process.

This must not necessarily be the case. Actions related to

task exploration, for instance, may be positively related to

success at early stages, capturing examinees’ willingness to

thoroughly explore the task environment, but may be indica-

tive for risk of failure at later stages of the solution process,

when such actions are no longer beneficial for successful

task completion. Analyzing a PISA 2012 problem-solving

task, Chen et al. (2019) retrieved a satisfactory AUC ROC

value of .72 only at later stages of the solution process when

the median time spent on the task had already passed, which

may be considered as a benchmark for subsequent studies.

Using early-window clickstream data for shopper

intent prediction

In fields where clickstream data is a more established

source of behavioral data, predicting behavioral outcomes

from early-window clickstream data is a common problem

statement. In the present study, we turn our attention to

procedures employed in the context of predicting behavioral

outcomes based on clickstream data from e-commerce

websites. In this vein of research, clickstream data is

commonly used for predicting whether users are at risk

for leaving the page without purchases (see Awalkar

et al., 2016; Bertsimas, Mersereau, & Patel, 2003;Hatt

and Feuerriegel, 2020; Requena et al., 2020;Tothetal.,

2017, for examples). Early detection of such risks may

trigger automated interventions, such as offering discounts

that may nudge customers into purchasing. To that end,

a plethora of supervised classifiers has been employed,

ranging from predictive models for sequential data such as

hidden Markov models (as in Hatt & Feuerriegel, 2020)

or recurrent neural networks (as in Toth et al., 2017)to

classifiers trained on features derived from clickstream

data such as extreme gradient boosting or support vector

machines (as in Requena et al., 2020). Features considered

comprise information on the action level such as uni- and

bigrams (Requena et al., 2020), aggregates such as the

number of performed clicks or the maximum time elapsed

between subsequent clicks as well as metadata such as the

day of the week when the session was initiated (Awalkar

et al., 2016). Research on predictions of behavioral

outcomes has repeatedly demonstrated that clickstream data

is well suited for making accurate predictions at relatively

early points in time based on rather sparse data.

Data structures from e-commerce websites can be

expected to resemble those encountered in interactive tasks,

rendering it worthwhile to investigate whether procedures

applied in the context of e-commerce also perform well in

the context of interactive tasks. First, interactive tasks such

as those employed in the PIAAC PSTRE domain oftentimes

mirror interfaces of web applications to evoke real-life

problem-solving behavior. Second, clickstreams from e-

commerce websites tend to be rather short. Requena et al.

(2020), for instance, based their analyses of shopper intent

prediction on browsing sessions with action sequences

of length 5 to 155, closely resembling typical ranges

encountered in clickstream data from interactive tasks.

Across all 14 tasks of the PIAAC PSTRE domain, for

instance, the minimum action sequence length was 3 and

maximum action sequence length ranged from 51 to 398

(Zhang, Tang, He, Liu, & Ying, 2021).

Due to these resemblances in typical data structures, pro-

cedures employed for investigating the early predictability

of shopper intent pose a promising tool for investigating

the early predictability of failure or success on interactive

tasks. In the present study, we draw on and adapt procedures

that have recently been employed by Requena et al. (2020)

in their systematic and exhaustive study of early shopper

intent prediction. Requena et al. (2020) created multiple

subsets of action sequences that were trimmed to all but

those actions that fell into a given early window. Next,

the authors compared the performance of multiple machine

learning algorithms on these subsets to investigate at which

point early-window action sequences contained sufficient

information to achieve accurate predictions. Among others,

Requena et al. (2020) achieved good results with extreme

gradient boosting, where AUC ROC values exceeded .70 as

soon as early action sequences were of at least length seven.

Objective and research questions

Adapting machine learning-based procedures originally

employed by Requena et al. (2020) for investigating early

predictability of shopper intent on e-commerce websites, the

present study introduces and showcases a procedure for the

systematic investigation of early predictability of behavioral

outcomes on interactive tasks in educational assessment.

When introducing the procedure, we suggest features that

may be derived from clickstream data from interactive tasks

as well as measures to be tracked that aid in evaluating

the quality and utility of early predictions. We outline the

procedure by investigating the potential of early-window

clickstream data for early prediction of risk of failure on two

1395Behavior Research Methods (2023) 55:1392–1412

1 3

PSTRE tasks from PIAAC 2012, addressing the following

research questions:

RQ1 Establishing a baseline: How well can customary

supervised classifiers on the basis of features con-

structed from complete clickstream data, capturing

the whole solution process, identify failure on the

task?

RQ2 Investigating the accuracy of early predictions: How

early in terms of a) the number of performed actions

as well as b) elapsed time can customary supervised

classifiers on the basis of features constructed from

early-window clickstream data accurately predict

failure on the task?

RQ3 Investigating feature importance: Which features

constructed from early-window clickstream data

display the highest predictive importance at different

phases of the solution process?

Materials and methods

Data

We made use of clickstream data from the items U23

(“Lamp Return”) and U02 (“Meeting Rooms”) from the

PIAAC 2012 PSTRE domain. In PIAAC 2012, problem-

solving items were administered with fixed positions and

without time limits. “Meeting Rooms” is located in the

middle of the second problem-solving cluster (PS2), while

“Lamp Return” is administered at the very end of PS2.

Hence, when approaching “Meeting Rooms” and “Lamp

Return”, examinees were already exposed to different

PIAAC PSTRE task environments and had the opportunity

to accumulate pre-familarity with these environments. We

chose these items as they strongly differ in their difficulty

as well as in the amount of initial task exploration required

prior to performing key actions for solving the task, both

potentially impacting early predictability. Very difficult or

very easy items yield highly imbalanced data sets which

may challenge classifiers (see, e.g., Ruisen et al., 2018).

The amount of initial task exploration required prior to

performing key actions may impact how distinguishable

early-window clickstream data associated with success

or failure are because differences in initial exploration

behavior may be less pronounced and differences in

performing key actions for solving the task may emerge

only at later stages of the solution process.

“Lamp Return” involves both web page and email

environments and requires examinees to navigate through

an online lamp shop to complete an explicitly specified

consumer transaction. To that end, examinees have to

submit a request, retrieve an email message, and fill out

an online form. Examinees receive partial credit if at

least one of the fields of the online form is filled out

correctly. Figure 1displays an example item with email

and web environments (from the Education and Skills

Online Assessment) that shares a comparable item interface

with the PIAAC item “Lamp Return”. “Meeting Rooms”

involves email, web, and word processor environments1

and requires examinees to navigate through emails, identify

relevant requests for meeting room reservations, and

subsequently submit these meeting room requests via a

simulated online reservation site. A conflict between one

request and the existing schedule presents an impasse to be

resolved.

“Lamp Return” and “Meeting Rooms” are located at

Proficiency Levels 2 and 3, respectively,2and, with item

difficulties of 321 and 346, respectively, pose items of

medium and high difficulty (OECD, 2013). For getting to

and filling out the lamp return form, it is not necessary

to exhaustively explore the task’s environment. As such,

the item can be solved in a rather linear manner and

only requires a minimum of 17 actions (including actions

performed for filling out the return form) for receiving

full credit (He et al., 2021). Key actions required for

successful task completion can therefore be expected to be

commonly encountered in early-window clickstream data

associated with successful task completion. This is different

for “Meeting Rooms”, which requires examinees to seek

and integrate information from multiple environments

before filling out the meeting room reservation forms.

Due to the higher necessity of initial task environment

exploration, “Meeting Rooms” requires a minimum of 25

actions for receiving full credit (He et al., 2021). Initial

task exploration is likely to be non-linear, with examinees

switching between different environments to compare and

integrate the displayed information.3Key actions required

for successfully submitting the reservation forms may

therefore be commonly encountered only at later stages

of the solution process. Based on these consideration, we

expected early predictability for “Lamp Return” to be less

challenging than for “Meeting Rooms”.

We analyzed clickstream data from examinees from

Ireland, Japan, the Netherlands, the United Kingdom,

and the United States who were administered “Lamp

1Note that the word processor is an optional environment instead

of a compulsory one, designed to assist examinees to summarize

information extracted from the email requests.

2The PSTRE performance is defined by four levels: below Level 1 (0–

240), Level 1 (241–290), Level 2 (291–340) and Level 3 (341–500).

For more details, refer to OECD (2013).

3The manifold ways of how this initial task exploration manifests itself

in action sequences is, among others, reflected in the comparably low

similarity of sequences to expert-defined optimal strategies (He et al.,

2019).

1396 Behavior Research Methods (2023) 55:1392–1412

1 3

Loading more pages...