Combining Clickstream Analyses and Graph-Modeled Data Clustering for Identifying Common Response Processes [original]

psychometrika—vol. 86, no. 1, 190–214

March 2021

https://doi.org/10.1007/s11336-020-09743-0

COMBINING CLICKSTREAM ANALYSES AND GRAPH-MODELED DATA

CLUSTERING FOR IDENTIFYING COMMON RESPONSE PROCESSES

Esther Ulitzsch

IPN – LEIBNIZ INSTITUTE FOR SCIENCE AND MATHEMATICS EDUCATION

Qiwei He

EDUCATIONAL TESTING SERVICE

Vincent Ulitzsch,Hendrik Molter,André Nichterlein and

Rolf Niedermeier

TECHNISCHE UNIVERSITÄT BERLIN

Steffi Pohl

FREIE UNIVERSITÄT BERLIN

Complex interactive test items are becoming more widely used in assessments. Being computer-

administered, assessments using interactive items allow logging time-stamped action sequences. These

sequences pose a rich source of information that may facilitate investigating how examinees approach an

item and arrive at their given response. There is a rich body of research leveraging action sequence data

for investigating examinees’ behavior. However, the associated timing data have been considered mainly

on the item-level, if at all. Considering timing data on the action-level in addition to action sequences,

however, has vast potential to support a more fine-grained assessment of examinees’ behavior. We provide

an approach that jointly considers action sequences and action-level times for identifying common response

processes. In doing so, we integrate tools from clickstream analyses and graph-modeled data clustering

with psychometrics. In our approach, we (a) provide similarity measures that are based on both actions and

the associated action-level timing data and (b) subsequently employ cluster edge deletion for identifying

homogeneous, interpretable, well-separated groups of action patterns, each describing a common response

process. Guidelines on how to apply the approach are provided. The approach and its utility are illustrated

on a complex problem-solving item from PIAAC 2012.

Key words: action sequences, response times, complex problem solving, cluster editing.

1. Introduction

Interactive items in low-stakes large-scale assessments are designed to provide authentic

tasks and, as such, to better reflect what examinees know and are able to do than traditional test

items can (Goldhammer, Naumann, & Keßel, 2013). Such kind of items is used, for example,

in the Problem Solving in Technology-Rich Environments (PSTRE) domain in the Programme

for the International Assessment of Adult Competencies (PIAAC, OECD, 2013) and the collab-

orative problem solving domain in the Programme for International Student Assessment (PISA,

OECD, 2017). Understanding response processes to interactive tasks is paramount for assess-

ing whether these indeed capture the construct to be measured. As noted in the Standards for

Educational and Psychological Testing “construct interpretations oftentimes involve more or less

Correspondence should be made to Esther Ulitzsch, Educational Measurement, IPN – Leibniz Institute for Science

and Mathematics Education, Olshausenstraße 62, 24118 Kiel, Germany. Email: [email protected]

190

ESTHER ULITZSCH ET AL. 191

Time

0 1 2 3 4 5 6 7 8

D E

A B C

A D B E C

A B C

Figure 1.

Schematic representation of time-stamped action sequences for four hypothetical examinees

explicit assumptions about the cognitive processes engaged” (American Educational Research

Association, American Psychological Association, & National Council on Measurement in Edu-

cation and Joint Committee on Standards for Educational and Psychological Testing, 2014, p. 15).

Therefore, “theoretical and empirical analyses of the response processes” (American Educational

Research Association et al., 2014, p. 15) are recommended for assessing whether the response

processes applied by examinees fit with the interpretation of the construct to be measured.

Being computer-administered, assessments using interactive items allow logging time-

stamped action sequences. These sequences, illustrated schematically in Fig. 1, document both the

particular actions executed and the time required for their execution. Various approaches exist that

leverage action sequence data for investigating how examinees interact with interactive items (e.g.,

He & von Davier, 2015; Qiao & Jiao, 2018; Tang, Wang, He, Liu, & Ying, 2020; Tang, Wang, Liu,

& Ying, 2020). The associated timing data, however, have mainly been considered on an aggre-

gate level (e.g., time spent on task as opposed to time required for the single actions executed for

completing the task), if at all. Since differences in timing can be indicative of different underlying

cognitive processes even if the same actions are performed, considering action sequences jointly

with the time elapsed between the actions constituting the sequences has vast potential to support

a more fine-grained assessment of examinees’ interactions with interactive items. For instance, it

may support detecting parts of response processes that are more time consuming for examinees,

e.g., due to being cognitively more challenging.

To motivate the use of time-stamped action sequences for a more in-depth assessment of

response processes, we consider action patterns for Examinees 1 and 3 in Fig. 1, performing the

same action sequence within a comparable amount of time. However, while Examinee 1 executed

his or her first action rather quickly, Examinee 3 initially required more time but then performed

all actions in quick succession. These differences in time elapsing between actions may mark

different response processes. Examinee 3 might have spent long time for carefully planning how

to approach the task, while Examinee 1 might have planned on-the-go, resulting in a shorter time

to first action but longer time between subsequent actions required for planning the next step.

Such differences cannot be uncovered by solely considering action sequences or time spent on

the whole task but needs considering action-level timing data.

In this article, we aim at making use of the whole of information contained in time-stamped

action sequences and provide an approach that jointly considers action sequences and the corre-

sponding sequence of time elapsed between the actions for identifying common response pro-

cesses. For doing so, we combine data mining techniques originally developed for the analysis of

clickstream data with graph-modeled data clustering.

192 PSYCHOMETRIKA

Theremainderofthisarticleisstructuredas follows: First, we provideanoverviewof previous

approaches for making use of action sequences, timing data, or both. We then present an approach

for identifying common response processes that is based on the information contained in time-

stamped action sequences. We illustrate the insights that can be gained on the basis of this approach

by applying it to a PIAAC PSTRE task.

1.1. Using Action Sequences for Investigating Response Processes

Making use of action sequences for investigating examinees’ interactions with interactive

tasks is a rapidly growing stream of research. One of the main challenges for making use of action

sequences is how to meaningful aggregate this usually enormous amount of data (von Davier,

Khorramdel, He, Shin, & Chen, 2019).

In the case that subject-matter theory exists on how examinees approach interactive tasks,

theory-derived indicators can be constructed (e.g., whether examinees employed a certain solution

strategy or not). These can then be related to other variables of interest (Greiff, Niepel, Scherer,

& Martin, 2016; Greiff, Wüstenberg, & Avvisati 2015) or even be employed as indicators in mea-

surement models (LaMar, 2018). However, given that action sequence data are usually complex,

reflecting the wide diversity of human behavior (Tang, Wang, Liu, & Ying, 2020), most of the

approaches for such data are exploratory in nature.

Visual approaches aim at providing graphical frameworks for depicting action sequence data

that assist discovering meaningful patterns in the data, e.g., important actions or pathways (Vista,

Care, & Awwal, 2017; Zhu, Shu, & von Davier, 2016). Similar objectives have been pursued by

employing data mining techniques for identifying single actions or subsequences (n-grams) that

are associated with success or failure on an interactive task or that differentiate between different

proficiency groups (He & von Davier, 2015;2016; Liao, He, & Jiao, 2019; Qiao & Jiao, 2018;

Stadler, Fischer, & Greiff, 2019).

Another common approach for detecting patterns in action sequence data for the purpose

of investigating examinees’ interactions with interactive tasks is to compress the information

contained in differences between any two action sequences into distance measures. In this context,

distance measures can either be defined to describe how action sequences differ from each other

(Tang, Wang, He, et al., 2020) or with regard to expert-defined optimal strategies (Hao, Shu, &

von Davier, 2015; He, Borgonovi, & Paccagnella, 2019a) and are usually derived by drawing on

techniques from natural language processing, such as the Levenshtein edit distance (Hao et al.,

2015) or longest common subsequences (LCSs; He, Borgonovi, & Paccagnella, 2019a; Sukkarieh,

von Davier, & Yamamoto, 2012), or from the field of clickstream analysis (Tang, Wang, He,

et al., 2020). The information contained in such distance measures can then be further used in

employing exploratory dimensionality reduction techniques such as principal component analysis

and hierarchical clustering (Eichmann, Greiff, Naumann, Brandhuber, & Goldhammer, 2020;Hao

et al., 2015), or multidimensional scaling (Tang, Wang, He, et al., 2020). When distance from

expert-defined optimal strategies is assessed, distance measures can be related to other variables of

interest, for example, proficiency. This allows assessing whether similarity with optimal strategies

indeed contains information on examinees’ proficiency levels (He, Borgonovi, & Paccagnella,

2019a).

Recently, new approaches have been developed that draw on machine learning techniques for

assessing response processes by complexity reduction. Recurrent neural networks, for instance,

have successfully been applied for extracting latent features for parsimoniously describing

response processes (Tang, Wang, Liu, & Ying, 2020) or for breaking down individual processes

into a sequence of subprocesses (Wang, Tang, Liu, & Ying, 2020).

ESTHER ULITZSCH ET AL. 193

1.2. Using Timing Data for Investigating Response Processes

Using timing data for inferring the nature of cognitive processes has a long tradition in

psychology and is a key element for drawing inferences about cognitive and behavioral processes

in a variety of paradigms and theoretical frameworks (see De Boeck & Jeon, 2019; Kyllonen & Zu,

2016, for overviews). These are built on the rationale that differences in timing data are indicative

of qualitative or quantitative differences in cognitive processes that differ in the time required

for their execution. A prominent example for such differences is the distinction between solution

and rapid guessing behavior in the context of multiple-choice items, where both processes can

result in choosing the same answer on a multiple-choice item but are likely to be associated with

rather different response times (Wise, 2017). In the context of traditional test items (i.e., items

with a multiple-choice or open-response format) there is a rich body of research using timing

data for better understanding response behavior, e.g., by assessing how examinees allocate their

time during the assessment (e.g., Fox & Marianti, 2016) or for detecting differences in response

processes (e.g., Molenaar, Oberski, Vermunt, & De Boeck, 2016; Partchev & De Boeck, 2012;

Ulitzsch, von Davier, & Pohl, 2019;2020 ; Wang & Xu, 2015; Wang, Xu, Shang, & Kuncel, 2018;

Weeks, von Davier, & Yamamoto, 2016).

In the context of interactive tasks, research focusing on timing data has mainly focused on

item-level time, for instance, to investigate how time spent on an item is related to proficiency

(Goldhammer et al., 2014; Naumann & Goldhammer, 2017; Scherer, Greiff, & Hautamäki, 2015).

There are, however, some exceptions. Stelter, Goldhammer, Naumann, and Rölke (2015) assessed

time spent on pre-selected, automatable subtasks such as drag-and-drop events or setting a book-

mark via the toolbar of a browser. The authors argued that shorter time spent on automatable

subtasks indicates a higher degree of automation of the procedural knowledge needed to execute

these subtasks. In support of this, the authors showed that examinees with shorter time spent on

automatable subtasks were more likely to succeed on PIAAC PSTRE tasks, indicating higher

levels of proficiency. In a similar vein, Albert and Steinberg (2011) assessed whether planning

time, defined as the time elapsed from beginning the task until performing the first action, is

related to successful task completion. Using data from the PISA 2012 problem solving domain,

Eichmann, Goldhammer, Greiff, Pucite, and Naumann (2019) built on that work and derived

indicators that allow depicting planning behavior in greater detail. The authors considered (a) the

longest time interval elapsed between actions, conceptualized as the longest planning interval, (b)

the time elapsed until the longest planning interval occurred as a measure for the time when (most

of) the planning takes place, and c) the variance of times elapsed between any two successive

actions, giving the variation in planning time. Both Albert and Steinberg (2011) and Eichmann et

al. (2019) could show that planning time is beneficial for successful task completion. It is noted

that the objective of these studies was to assess the predictive power of features derived from

action-level times for successful task completion. They do, however, not allow for disentangling

and describing different response processes in terms of the types and order of performed actions.

1.3. Combining Information from Action Sequences and Timing Data

Few approaches exist that consider both information from action sequences and timing data

for the purpose of investigating examinees’ interactions with interactive tasks. The majority of

these approaches considers timing data only on the item-level, that is, takes into account the total

time spent on an item rather than the time taken for each performed action (time to action).

In a confirmatory approach, De Boeck and Scalise (2019) considered both action sequences

and timing data as aggregated variables, employing the number of actions, total time spent on the

item, and performance as indicators of a three-dimensional latent variable model. This framework

allows assessing how the number of actions taken and the time required for solving a task relate

to proficiency.

194 PSYCHOMETRIKA

Exploratory approaches jointly considering information on actions and timing are predom-

inantly aimed at identifying groups differing in their interaction with the tasks. To that end, He,

Liao, and Jiao (2019b)usedk-means clustering based on actions, the length of action sequences,

and time spent on the item. Xu, Fang, Chen, Liu, and Ying (2018) employed latent class analyses

based on the frequency of recurrent actions and time spent on the item. Their approach allowed

the detection of classes differing in the degree of efficiency of solution behavior—defined in terms

of how often the same actions were performed—and to assess differences in time spent on the

item between the classes.

Another approach to analyze timing data in addition to action sequences is to treat time

spent on task as a covariate that can be considered for further investigation of identified response

processes. Wang et al. (2020), for instance, assessed whether the time spent on a task is related

to employing different solution strategies.

One exception considering action sequences jointly with non-aggregated timing data is the

work conducted by Chen, Li, Liu, and Ying (2019), who presented an exploratory event history

approach to simultaneously predict total time spent on the item as well as the final response.

Variousfeaturesderivedfrom actionsequences (suchasactionfrequenciesor indicatorsofwhether

a certain action has previously been executed) were used as predictors. By considering features

derived from action sequences in continuous time, the approach takes the timing of actions into

account. Note that the objective of this approach is closely related to approaches that aim at

identifying key actions and subsequences that are relevant for success on an interactive item

(seeHe&vonDavier,2015; Liao et al., 2019; Qiao & Jiao, 2018). It does, as such, not aim

for describing differences in response processes, considering the type and timely order of action

sequences.

In sum, there is a rich and rapidly growing body of research aiming to make use of the

information contained in time-stamped action sequences, either for assessing the predictive power

of actions and timing for successful task completion or for investigating differences in response

processes. The latter class of approaches, however, so far has considered timing data only on

the aggregate item-level. By considering aggregated features such as time spent on task rather

than time elapsed between actions, these previous approaches neglect that examinees may differ

in the time—and, as such, the underlying cognitive processes—required for executing specific

subprocesses. Therefore, the aim of this article is to develop a new approach that can make use

of the whole of information contained in time-stamped action sequences for a more in-depth

investigation of the behavioral processes underlying task completion.

2. Proposed Method

We propose a two-step approach that integrates tools from clickstream analyses and graph-

modeled data clustering with psychometrics and combines action sequences and action-level times

into one analysis framework. We leverage the information contained in action patterns as given by

action sequences and action-level times (a) to determine the degree of similarity between action

patterns and (b) to identify common response processes. For identifying subgroups of persons with

similar action patterns, we propose performing cluster editing—a graph-modeled data clustering

technique—on the similarity measures.

In the following, we first present two similarity measures considering action sequences and

times to action that vary in their degree of sensitivity to time-wise differences. We then introduce

cluster editing as a mean for identifying common response processes given by homogeneous

subgroups with similar action patterns. An existing integer linear programming (ILP) formulation

of the cluster editing problem is explained.

Loading more pages...