
psychometrika—vol. 86, no. 1, 190–214
March 2021
https://doi.org/10.1007/s11336-020-09743-0
COMBINING CLICKSTREAM ANALYSES AND GRAPH-MODELED DATA
CLUSTERING FOR IDENTIFYING COMMON RESPONSE PROCESSES
Esther Ulitzsch
IPN – LEIBNIZ INSTITUTE FOR SCIENCE AND MATHEMATICS EDUCATION
Qiwei He
EDUCATIONAL TESTING SERVICE
Vincent Ulitzsch,Hendrik Molter,André Nichterlein and
Rolf Niedermeier
TECHNISCHE UNIVERSITÄT BERLIN
Steffi Pohl
FREIE UNIVERSITÄT BERLIN
Complex interactive test items are becoming more widely used in assessments. Being computer-
administered, assessments using interactive items allow logging time-stamped action sequences. These
sequences pose a rich source of information that may facilitate investigating how examinees approach an
item and arrive at their given response. There is a rich body of research leveraging action sequence data
for investigating examinees’ behavior. However, the associated timing data have been considered mainly
on the item-level, if at all. Considering timing data on the action-level in addition to action sequences,
however, has vast potential to support a more fine-grained assessment of examinees’ behavior. We provide
an approach that jointly considers action sequences and action-level times for identifying common response
processes. In doing so, we integrate tools from clickstream analyses and graph-modeled data clustering
with psychometrics. In our approach, we (a) provide similarity measures that are based on both actions and
the associated action-level timing data and (b) subsequently employ cluster edge deletion for identifying
homogeneous, interpretable, well-separated groups of action patterns, each describing a common response
process. Guidelines on how to apply the approach are provided. The approach and its utility are illustrated
on a complex problem-solving item from PIAAC 2012.
Key words: action sequences, response times, complex problem solving, cluster editing.
1. Introduction
Interactive items in low-stakes large-scale assessments are designed to provide authentic
tasks and, as such, to better reflect what examinees know and are able to do than traditional test
items can (Goldhammer, Naumann, & Keßel, 2013). Such kind of items is used, for example,
in the Problem Solving in Technology-Rich Environments (PSTRE) domain in the Programme
for the International Assessment of Adult Competencies (PIAAC, OECD, 2013) and the collab-
orative problem solving domain in the Programme for International Student Assessment (PISA,
OECD, 2017). Understanding response processes to interactive tasks is paramount for assess-
ing whether these indeed capture the construct to be measured. As noted in the Standards for
Educational and Psychological Testing “construct interpretations oftentimes involve more or less
Correspondence should be made to Esther Ulitzsch, Educational Measurement, IPN – Leibniz Institute for Science
190
© 2021 The Author(s)

ESTHER ULITZSCH ET AL. 191
Time
0 1 2 3 4 5 6 7 8
4
D E
3
A B C
2
A D B E C
1
A B C
Figure 1.
Schematic representation of time-stamped action sequences for four hypothetical examinees
explicit assumptions about the cognitive processes engaged” (American Educational Research
Association, American Psychological Association, & National Council on Measurement in Edu-
cation and Joint Committee on Standards for Educational and Psychological Testing, 2014, p. 15).
Therefore, “theoretical and empirical analyses of the response processes” (American Educational
Research Association et al., 2014, p. 15) are recommended for assessing whether the response
processes applied by examinees fit with the interpretation of the construct to be measured.
Being computer-administered, assessments using interactive items allow logging time-
stamped action sequences. These sequences, illustrated schematically in Fig. 1, document both the
particular actions executed and the time required for their execution. Various approaches exist that
leverage action sequence data for investigating how examinees interact with interactive items (e.g.,
He & von Davier, 2015; Qiao & Jiao, 2018; Tang, Wang, He, Liu, & Ying, 2020; Tang, Wang, Liu,
& Ying, 2020). The associated timing data, however, have mainly been considered on an aggre-
gate level (e.g., time spent on task as opposed to time required for the single actions executed for
completing the task), if at all. Since differences in timing can be indicative of different underlying
cognitive processes even if the same actions are performed, considering action sequences jointly
with the time elapsed between the actions constituting the sequences has vast potential to support
a more fine-grained assessment of examinees’ interactions with interactive items. For instance, it
may support detecting parts of response processes that are more time consuming for examinees,
e.g., due to being cognitively more challenging.
To motivate the use of time-stamped action sequences for a more in-depth assessment of
response processes, we consider action patterns for Examinees 1 and 3 in Fig. 1, performing the
same action sequence within a comparable amount of time. However, while Examinee 1 executed
his or her first action rather quickly, Examinee 3 initially required more time but then performed
all actions in quick succession. These differences in time elapsing between actions may mark
different response processes. Examinee 3 might have spent long time for carefully planning how
to approach the task, while Examinee 1 might have planned on-the-go, resulting in a shorter time
to first action but longer time between subsequent actions required for planning the next step.
Such differences cannot be uncovered by solely considering action sequences or time spent on
the whole task but needs considering action-level timing data.
In this article, we aim at making use of the whole of information contained in time-stamped
action sequences and provide an approach that jointly considers action sequences and the corre-
sponding sequence of time elapsed between the actions for identifying common response pro-
cesses. For doing so, we combine data mining techniques originally developed for the analysis of
clickstream data with graph-modeled data clustering.

192 PSYCHOMETRIKA
Theremainderofthisarticleisstructuredas follows: First, we provideanoverviewof previous
approaches for making use of action sequences, timing data, or both. We then present an approach
for identifying common response processes that is based on the information contained in time-
stamped action sequences. We illustrate the insights that can be gained on the basis of this approach
by applying it to a PIAAC PSTRE task.
1.1. Using Action Sequences for Investigating Response Processes
Making use of action sequences for investigating examinees’ interactions with interactive
tasks is a rapidly growing stream of research. One of the main challenges for making use of action
sequences is how to meaningful aggregate this usually enormous amount of data (von Davier,
Khorramdel, He, Shin, & Chen, 2019).
In the case that subject-matter theory exists on how examinees approach interactive tasks,
theory-derived indicators can be constructed (e.g., whether examinees employed a certain solution
strategy or not). These can then be related to other variables of interest (Greiff, Niepel, Scherer,
& Martin, 2016; Greiff, Wüstenberg, & Avvisati 2015) or even be employed as indicators in mea-
surement models (LaMar, 2018). However, given that action sequence data are usually complex,
reflecting the wide diversity of human behavior (Tang, Wang, Liu, & Ying, 2020), most of the
approaches for such data are exploratory in nature.
Visual approaches aim at providing graphical frameworks for depicting action sequence data
that assist discovering meaningful patterns in the data, e.g., important actions or pathways (Vista,
Care, & Awwal, 2017; Zhu, Shu, & von Davier, 2016). Similar objectives have been pursued by
employing data mining techniques for identifying single actions or subsequences (n-grams) that
are associated with success or failure on an interactive task or that differentiate between different
proficiency groups (He & von Davier, 2015;2016; Liao, He, & Jiao, 2019; Qiao & Jiao, 2018;
Stadler, Fischer, & Greiff, 2019).
Another common approach for detecting patterns in action sequence data for the purpose
of investigating examinees’ interactions with interactive tasks is to compress the information
contained in differences between any two action sequences into distance measures. In this context,
distance measures can either be defined to describe how action sequences differ from each other
(Tang, Wang, He, et al., 2020) or with regard to expert-defined optimal strategies (Hao, Shu, &
von Davier, 2015; He, Borgonovi, & Paccagnella, 2019a) and are usually derived by drawing on
techniques from natural language processing, such as the Levenshtein edit distance (Hao et al.,
2015) or longest common subsequences (LCSs; He, Borgonovi, & Paccagnella, 2019a; Sukkarieh,
von Davier, & Yamamoto, 2012), or from the field of clickstream analysis (Tang, Wang, He,
et al., 2020). The information contained in such distance measures can then be further used in
employing exploratory dimensionality reduction techniques such as principal component analysis
and hierarchical clustering (Eichmann, Greiff, Naumann, Brandhuber, & Goldhammer, 2020;Hao
et al., 2015), or multidimensional scaling (Tang, Wang, He, et al., 2020). When distance from
expert-defined optimal strategies is assessed, distance measures can be related to other variables of
interest, for example, proficiency. This allows assessing whether similarity with optimal strategies
indeed contains information on examinees’ proficiency levels (He, Borgonovi, & Paccagnella,
2019a).
Recently, new approaches have been developed that draw on machine learning techniques for
assessing response processes by complexity reduction. Recurrent neural networks, for instance,
have successfully been applied for extracting latent features for parsimoniously describing
response processes (Tang, Wang, Liu, & Ying, 2020) or for breaking down individual processes
into a sequence of subprocesses (Wang, Tang, Liu, & Ying, 2020).

ESTHER ULITZSCH ET AL. 193
1.2. Using Timing Data for Investigating Response Processes
Using timing data for inferring the nature of cognitive processes has a long tradition in
psychology and is a key element for drawing inferences about cognitive and behavioral processes
in a variety of paradigms and theoretical frameworks (see De Boeck & Jeon, 2019; Kyllonen & Zu,
2016, for overviews). These are built on the rationale that differences in timing data are indicative
of qualitative or quantitative differences in cognitive processes that differ in the time required
for their execution. A prominent example for such differences is the distinction between solution
and rapid guessing behavior in the context of multiple-choice items, where both processes can
result in choosing the same answer on a multiple-choice item but are likely to be associated with
rather different response times (Wise, 2017). In the context of traditional test items (i.e., items
with a multiple-choice or open-response format) there is a rich body of research using timing
data for better understanding response behavior, e.g., by assessing how examinees allocate their
time during the assessment (e.g., Fox & Marianti, 2016) or for detecting differences in response
processes (e.g., Molenaar, Oberski, Vermunt, & De Boeck, 2016; Partchev & De Boeck, 2012;
Ulitzsch, von Davier, & Pohl, 2019;2020 ; Wang & Xu, 2015; Wang, Xu, Shang, & Kuncel, 2018;
Weeks, von Davier, & Yamamoto, 2016).
In the context of interactive tasks, research focusing on timing data has mainly focused on
item-level time, for instance, to investigate how time spent on an item is related to proficiency
(Goldhammer et al., 2014; Naumann & Goldhammer, 2017; Scherer, Greiff, & Hautamäki, 2015).
There are, however, some exceptions. Stelter, Goldhammer, Naumann, and Rölke (2015) assessed
time spent on pre-selected, automatable subtasks such as drag-and-drop events or setting a book-
mark via the toolbar of a browser. The authors argued that shorter time spent on automatable
subtasks indicates a higher degree of automation of the procedural knowledge needed to execute
these subtasks. In support of this, the authors showed that examinees with shorter time spent on
automatable subtasks were more likely to succeed on PIAAC PSTRE tasks, indicating higher
levels of proficiency. In a similar vein, Albert and Steinberg (2011) assessed whether planning
time, defined as the time elapsed from beginning the task until performing the first action, is
related to successful task completion. Using data from the PISA 2012 problem solving domain,
Eichmann, Goldhammer, Greiff, Pucite, and Naumann (2019) built on that work and derived
indicators that allow depicting planning behavior in greater detail. The authors considered (a) the
longest time interval elapsed between actions, conceptualized as the longest planning interval, (b)
the time elapsed until the longest planning interval occurred as a measure for the time when (most
of) the planning takes place, and c) the variance of times elapsed between any two successive
actions, giving the variation in planning time. Both Albert and Steinberg (2011) and Eichmann et
al. (2019) could show that planning time is beneficial for successful task completion. It is noted
that the objective of these studies was to assess the predictive power of features derived from
action-level times for successful task completion. They do, however, not allow for disentangling
and describing different response processes in terms of the types and order of performed actions.
1.3. Combining Information from Action Sequences and Timing Data
Few approaches exist that consider both information from action sequences and timing data
for the purpose of investigating examinees’ interactions with interactive tasks. The majority of
these approaches considers timing data only on the item-level, that is, takes into account the total
time spent on an item rather than the time taken for each performed action (time to action).
In a confirmatory approach, De Boeck and Scalise (2019) considered both action sequences
and timing data as aggregated variables, employing the number of actions, total time spent on the
item, and performance as indicators of a three-dimensional latent variable model. This framework
allows assessing how the number of actions taken and the time required for solving a task relate
to proficiency.

194 PSYCHOMETRIKA
Exploratory approaches jointly considering information on actions and timing are predom-
inantly aimed at identifying groups differing in their interaction with the tasks. To that end, He,
Liao, and Jiao (2019b)usedk-means clustering based on actions, the length of action sequences,
and time spent on the item. Xu, Fang, Chen, Liu, and Ying (2018) employed latent class analyses
based on the frequency of recurrent actions and time spent on the item. Their approach allowed
the detection of classes differing in the degree of efficiency of solution behavior—defined in terms
of how often the same actions were performed—and to assess differences in time spent on the
item between the classes.
Another approach to analyze timing data in addition to action sequences is to treat time
spent on task as a covariate that can be considered for further investigation of identified response
processes. Wang et al. (2020), for instance, assessed whether the time spent on a task is related
to employing different solution strategies.
One exception considering action sequences jointly with non-aggregated timing data is the
work conducted by Chen, Li, Liu, and Ying (2019), who presented an exploratory event history
approach to simultaneously predict total time spent on the item as well as the final response.
Variousfeaturesderivedfrom actionsequences (suchasactionfrequenciesor indicatorsofwhether
a certain action has previously been executed) were used as predictors. By considering features
derived from action sequences in continuous time, the approach takes the timing of actions into
account. Note that the objective of this approach is closely related to approaches that aim at
identifying key actions and subsequences that are relevant for success on an interactive item
(seeHe&vonDavier,2015; Liao et al., 2019; Qiao & Jiao, 2018). It does, as such, not aim
for describing differences in response processes, considering the type and timely order of action
sequences.
In sum, there is a rich and rapidly growing body of research aiming to make use of the
information contained in time-stamped action sequences, either for assessing the predictive power
of actions and timing for successful task completion or for investigating differences in response
processes. The latter class of approaches, however, so far has considered timing data only on
the aggregate item-level. By considering aggregated features such as time spent on task rather
than time elapsed between actions, these previous approaches neglect that examinees may differ
in the time—and, as such, the underlying cognitive processes—required for executing specific
subprocesses. Therefore, the aim of this article is to develop a new approach that can make use
of the whole of information contained in time-stamped action sequences for a more in-depth
investigation of the behavioral processes underlying task completion.
2. Proposed Method
We propose a two-step approach that integrates tools from clickstream analyses and graph-
modeled data clustering with psychometrics and combines action sequences and action-level times
into one analysis framework. We leverage the information contained in action patterns as given by
action sequences and action-level times (a) to determine the degree of similarity between action
patterns and (b) to identify common response processes. For identifying subgroups of persons with
similar action patterns, we propose performing cluster editing—a graph-modeled data clustering
technique—on the similarity measures.
In the following, we first present two similarity measures considering action sequences and
times to action that vary in their degree of sensitivity to time-wise differences. We then introduce
cluster editing as a mean for identifying common response processes given by homogeneous
subgroups with similar action patterns. An existing integer linear programming (ILP) formulation
of the cluster editing problem is explained.
Loading more pages...