What are the Problem Makers:
Discovering the Most Frequently Changed
Activities in Adaptive Processes
Chen Li1?, Manfred Reichert2, and Andreas Wombacher3
1Information System Group, University of Twente, The Netherlands
2Institute of Databases and Information System, Ulm University, Germany
3Database Group, University of Twente, The Netherlands
Abstract. Recently, a new generation of adaptive Process-Aware In-
formation System (PAIS) has emerged, which enables dynamic service
changes (i.e., changes of instances derived from a composite service and
process respectively). This, in turn, results in a large number of pro-
cess variants derived from the same process model, but differing in their
structure due to the applied changes. Since such process variants are ex-
pensive to maintain, the process model should evolve accordingly. It is
therefore our goal to discover those activities that have been more often
involved in process (instance) adaptations than others, such that we can
focus on them when re-designing the process model. This paper provides
two approaches to rank activities based on their involvement in process
adaptations and process configurations respectively. The first approach
allows to precisely rank the activities, but it is very expensive to perform
since the algorithm is at N P level. We therefore provide as alternative
approach an approximation ranking algorithm which computes in poly-
nomial time. The performance of the approximation algorithm is evalu-
ated and compared through a comprehensive simulation of 3600 process
models. By applying statistical significance tests, we can also identify
several factors which influence the performance of the approximation
ranking algorithm.
1 Introduction
In today’s dynamic business world, success of an enterprise increasingly depends
on its ability to react to changes in its environment in a quick, flexible and
cost-effective way [16]. Along this trend a variety of process and service support
paradigms as well as corresponding specification languages (e.g., WS-BPEL, WS-
CDL) have emerged. In addition, different approaches for flexible processes and
?This work was done in the MinAdept project, which has been supported by the
Netherlands Organization for Scientific Research (NWO) under contract number
612.066.512.
2
services respectively exist [17, 20]. Generally, adaptations of composite services
and processes are not only needed for configuration purposes at buildtime, but
also become necessary during runtime to deal with exceptional situations and
changing needs; i.e., for single instances of composite services and processes
respectively, it must be possible to dynamically adapt their structure (e.g. to
insert, delete or move activities during runtime).
In response to this need adaptive process management technology has emerged
[29]. It allows to adapt and configure process models at different levels. This, in
turn, results in large collections of process model variants (process variants for
short), which are created from the same process model, but slightly differ from
each other in their structure. Fig. 1 depicts an example. The left hand side shows
a high-level view on a patient treatment process as it is normally executed: a pa-
tient is admitted to a hospital, where he first registers, then receives treatment,
and finally pays. In emergency situations, however, it might become necessary
to deviate from this model, e.g., by first starting treatment of the patient and
allowing him to register later during treatment. To capture this behavior in the
model of the corresponding process instance, we need to move activity receive
treatment from its current position to a position parallel to activity register.
This leads to an instance-specific process model variant S0as shown on the right
hand side of Fig. 1. Generally, a large number of process variants may exist in a
Process-Aware Information System (PAIS) at both the process type and process
instance level [14].
S[∆>S’
receive
treatment
Admitted
a) S: original process model
register pay
b) S’: final execution & change
register
receive
treatment
pay
AND-Split AND-Join
admitted
∆=Move (S, register, admitted, pay)
e=<admitted, receive treatment, register, pay>
Fig. 1. Original Process Model S and Process Variant S’
In most approaches which allow for the adaptation and configuration of pro-
cess models, the resulting process variants have to be maintained separately.
Then even simple changes (e.g. due to new laws) often require manual re-editing
of a large number of process variants. Over time this leads to divergence of the
respective process models, which aggravates their maintenance significantly [31].
Considering a given reference process model and analyzing the collection of
process variants configured from it, this paper aims at finding the problem mak-
ers, i.e., the activities that are involved in process adaptations more often than
others. These activities, in turn, cause most deviations from the reference process
model and thus lead to highest configuration effort. In particular, we provide al-
gorithms that solely use the reference process model and a collection of variants
derived from it as input; i.e., we do not require the presence of a change log [18,
3
19] The discovered information is particularly useful for monitoring the devia-
tions from the predefined process model or for redesigning it through learning
from past executions.
Based on the two assumptions that: (1) process models are block-structured
[17] (like for example BPEL 4) and (2) all activities in a process model have
unique labels, this paper deals with the following fundamental research ques-
tion:Given a reference process model and a collection of process variants config-
ured from it, how to rank the activities according to their involvement in struc-
tural process adaptations (i.e., the adaptations that become necessary when con-
figuring the process variants)?
The remainder of this paper is organized as follows: Section 2 gives back-
ground information needed for understanding this paper. To illustrate our algo-
rithms, we provide an running example in Section 3. We provide a precise, but
expensive ranking algorithm in Section 4 and a more efficient approximation
ranking algorithm in Section 5. To test the performance of the two algorithms,
we conduct comprehensive simulation. Section 6 describes the setup of this sim-
ulation whereas Section 7 presents its result. Finally, Section 8 discusses related
work and Section 9 concludes with a summary and an outlook.
2 Backgrounds
We first introduce basic notions needed in the following:
Process Model: Let Pdenote the set of all sound process models. A par-
ticular process model S= (N, E, . . .)∈ P is defined as well-structured Activity
Net [17, 29]. Nconstitutes the set of process activities and Ethe set of control
edges (i.e., precedence relations) linking them. To limit the scope, we assume
Activity Nets to be block-structured like in BPEL. An example is provided in
Fig. 1.
Process change A process change is accomplished by applying a sequence of
change operations to the process model Sover time [17]. Such change operations
modify the initial process model by altering the set of activities and their order
relations. Thus, each application of a change operation results in a new process
model. We define process change and process variant as follows:
Definition 1 (Process Change and Process Variant). Let Pdenote the
set of possible process models and Cbe the set of possible process changes. Let
S, S0∈ P be two process models, let ∆∈ C be a process change expressed in
terms of a high-level change operation, and let σ=h∆1, ∆2, . . . ∆ni∈C∗be a
sequence of process changes performed on initial model S. Then:
–S[∆iS0iff ∆is applicable to Sand S0is the (sound) process model resulting
from the application of ∆to S.
–S[σiS0iff ∃S1, S2, . . . Sn+1 ∈ P with S=S1,S0=Sn+1, and Si[∆iiSi+1
for i∈ {1, . . . n}. We denote S0as variant of S.
4see [28] for a technique transforming an unstructured process a model to block (tree)
structured model
4
Examples of high-level change operations include insert activity,delete ac-
tivity, and move activity as implemented in the ADEPT change framework [17].
While insert and delete modify the set of activities in the process model, move
changes activity positions and thus the order relations of the process model. For
example, operation move(S, A,B,C) moves activity Afrom its current position
within process model Sto the position after activity Band before activity C.
Operation delete(S, A), in turn, deletes activity Afrom process model S. Issues
concerning the correct use of these operations, their generalizations, and formal
pre-/post-conditions are described in [17]. Though the depicted change oper-
ations are discussed in relation to our ADEPT approach, they are generic in
the sense that they can be easily applied in connection with other process meta
models as well [29]. For example, a process change as described in the ADEPT
framework can be mapped to the concept of life-cycle inheritance known from
Petri Nets [25]. We refer to ADEPT since it covers by far most high-level change
patterns and change support features when compared to other approaches [29],
and it offers a fully implemented adaptive process engine.
Definition 2 (Distance and Bias). Let S,S0∈ P be two process models.
Then: Distance d(S,S0)between Sand S0corresponds to the minimal number
of high-level change operations needed to transform process model Sinto process
model S0; i.e., d(S,S0):= min{|σ| | σ∈ C∗∧S[σiS0}. Furthermore, a sequence of
change operations σwith S[σiS0and |σ|=d(S,S0)is denoted as a bias between S
and S0. All the biases are summarized in a set B(S,S0)={σ∈ C∗| |σ|=dS,S0},
which we denote this set as the bias set.
The distance between two process models Sand S0is the minimal number
of high-level change operations needed for transforming Sinto S0. Usually, it
measures the complexity for model transformation. The corresponding sequence
of change operations is denoted as bias between Sand S0. Generally, it is possible
to have more than one minimal sequence of change operations to realize the
transformation from Sinto S0, i.e., given two process models Sand S0their
bias is not necessarily unique [26, 12]. As example consider Fig. 1. Here, the
distance between model Sand process variant S0is one, since we only need to
perform one change operation ∆1=move(S, register, addmitted, pay) to
transform Sinto S1. However, it is also possible to transform Sinto S0with ∆2=
move(S, receive treatment, admitted, pay). Therefore, we obtain B(S,S0)=
{∆1, ∆2}as bias set. In general, determining the bias and distance between two
process models has complexity at NP level (see [12] for a computation method).
Here, we use high-level change operations rather than change primitives (i.e.
elementary changes like adding or removing nodes and edges) to measure the
distance between process models. This allows to guarantee soundness of process
models and also provides a more meaningful measure for distance [12].
Trace: A trace ton process model S= (N, E, . . .) denotes a valid and
complete execution sequence t≡< a1, a2, . . . , ak>of activities ai∈Naccording
to the control flow set out by S. All traces Scan produce are summarized in
trace set TS.t(a≺b) is denoted as precedence relation between activities a
5
and bin trace t≡< a1, a2, . . . , ak>iff ∃i<j:ai=a∧aj=b. Here, we
only consider traces composing ’real’ activities, but no events related to silent
ones (i.e. activity nodes which contain no action and exist only for control flow
purpose [12]). Finally, we consider two process models being the same if they are
trace equivalent, i.e., S≡S0iff TS≡ TS0. The stronger notion of bi-similarity [8]
is not required here.
3 Running Example
Fig. 2 gives an example, which is used for illustration purpose through out this
paper. Regarding this example, out of a reference model S, five different process
variants Si∈ P (i= 1,2, . . . 5) have been configured, which are weighted based
on the number of process instances created from them. In our example, 30%
of all process instances were executed according to process variant S1, while
15% of the instances ran on S2. If we only know process variants, but have no
runtime information about related instance executions, we assume the variants
to be equally weighted; i.e., every process variant then has weight 1/n, where n
corresponds to the number of given variants.
Process configuration / adaptation
S
1
: 30%
S
2
: 15%
S
3
: 20%
S
4
: 20%
S
5
: 15%
B C
E
AD
DA C B E
A
D
E
B C
B E
A
C
D
A E
B
C
D
Weight of process variant,
based on number of executions
Distance (S, S
1
) = 1
Distance (S, S
2
) = 2
Distance (S, S
3
) = 2
Distance (S, S
4
) = 1
Distance (S, S
5
) = 2
Average weighted distance =
1.5
change / instance
A B C D E
S: reference process model
AND-Split AND-Join XOR-Split XOR-Join
Fig. 2. One illustrative example
We first compute the distances (cf. Def. 2) between process model Sand its
variants. For example, when comparing Swith S1we obtain distance one, i.e.,
we only need to perform one change operation (i.e. move(S, E,A,D)) to trans-
form Sinto S1. Or when comparing Swith S2, needed change operations are
move(S, D,B,C) and move(S, E,B,C), and distance between Sand S2is two.
Based on the weight of each variant, we can compute the average weighted dis-
tance between reference model Sand its variants; e.g., the distances between
6
Sand Si, and the weights are depicted in Fig. 2. As average weighted distance
we obtain 1 ×0.3 + 2 ×0.15 + 2 ×0.2 + 1 ×0.2+2×0.15 = 1.5. This means
we need to perform on average 1.5 change operations to transform the reference
model to a process variant and corresponding instance respectively. Generally,
the average weighted distance between a reference model and its process variants
expresses how ”close” they are; i.e., the higher the average weighted distance is,
the greater configuration efforts have become
Though we are able to compute the distances between the reference model
and each variant, it is not clear which activities are most involved in these
configurations. Clearly, we should focus on the activities which are more often
involved in these configurations, as they cause the major effort with respect
to process configuration. In this context, we need an approach for ranking the
activities based on their involvement in process variants configurations. Since
we do not presume any run-time data (like change logs [19], or execution logs
[27]), it is not possible to know which activities have been involved in process
configurations particularly. In the following section, we will provide a method to
measure the potential involvement of each activity aiin process configurations,
which we denote as change impact i.e., CI(ai). And based on the change
impact of each activity, we are able to rank the activities, lets denote such
ranking as change impact ranking list.
In this paper, we are interested in detecting changes of order relations in pro-
cess models. Therefore, we only consider move operations but factor out insert
or delete operations. Note that the latter can be easily detected by comparing
activity sets of two process models (for details, see [10, 12]).
4 Computing the Precise Change Impact Ranking List
Since we do not presume the presence of a change log or execution log respec-
tively, the major information we can use for our analysis are the bias set BS,Si
which describe the structural differences between the reference process model
and each of its variant Si. From the bias set, we are able to compute the mini-
mal number of change operations needed to transform the reference model Sinto
a particular variant Si. The bias set, therefore, can be considered as a purified
change log for our analysis. In this section, we present a method to compute the
change impact of each activity based on the bias sets. A general description of
our approach is as follows:
1. We compute the bias sets BS,Siwhich describe the structural difference be-
tween the reference model Sand all the variants Si(cf. Section 4.1).
2. For each bias set BS,Si, we measure the involvement of each activity aiin
the captured adaptation. (cf. Section 4.2).
3. The change impact CI(ai) of an activity aiis then measured by its involve-
ment in all bias sets BS,Si(cf. Section 4.2).
7
4.1 Computing Biases
Let us re-consider the example from Fig. 2. By scanning the reference process
model Sand a process variant Si(i= 1 ...5), we are able to compute bias set
BS,Si[12]. This bias set contains all possible sequences of change operations
transforming Sinto Siwith minimal number of change operations. However,
the definition of bias set is too strict in our context, since we are only interested
in the activities being involved in adaptation rather than the order in which the
different changes were applied. For example, the bias set B(S,S2)comprise the
two changes σ1, σ2where σ1=< ∆1, ∆2>with ∆1=move(S, D, B, C) and
∆2=move(S, E, B, C) and σ2=< ∆2, ∆1>. Although σ16=σ2, this difference
is not relevant in our context since we are only interested in the activities being
changed rather than the order in which different changes were applied.
Therefore, we keep the granularity of our bias analysis only on the activities
that have been involved in changes changed rather than the applied change
operations. Regarding our example, we only want to document these activities
i.e., {D,E}in the context of σ1(see above) rather than the change operation σ1
itself. When only looking at the changed activities, σ1does not differ from σ2
since the activities concerned by the changes in the two biases are exactly the
same.
Definition 3 (Changed Activity Set).
We define set Aσrepresenting the activities changed by any change operation
of bias σ, i.e., Aσ={ai|(aiis an activity changed by ∆i)∧(∆i∈σ)}. We define
C(S,S0)={Aσ|σ∈ B(S,S0)}as the Changed Activity Set of Sand S0.
According to Def. 3, an element Aσof the Changed Activity Set C(S,S0)
corresponds to a set representing the activities changed by bias σ. Regarding
our example from Fig. 2, the changed activity sets of the reference model Sand
its variants Si(i= 1 . . . 5) are listed in Table 1.
Models Changed Activity Set
C(S,S1){{E}}
C(S,S2){{D,E},{C,D},{C,E}}
C(S,S3){{B,D},{B,E},{C,D},{C,E}}
C(S,S4){{C},{D}}
C(S,S5){{B,D},{B,E},{C,D},{C,E}}
Table 1. the Changed activity sets between reference model and all variants
As example, consider C(S,S2). We can either move activities Dand E, or activ-
ities Cand Dor activities Cand Eto transform model Sinto S2. When further
analyzing Table 1, we can see that C(S,S3)is exactly the same as C(S,S5)though
the two process models S3and S5are quite different (cf. Fig. 2). The reason
behind is that, for example, regarding move operations, the Changed Activity
8
Set only documents which activity has potentially been changed but does not
specify to which position the activity have been moved to. Consequently, bias
sets B(S,S3)and B(S,S5), which document the complete information about the
changes, are rather different. However, in connection with the research ques-
tion described in Section 1, we are only interested in activities which have been
potentially involved in a change.
4.2 Computing the Change Impact CI(ai) of Each Activity ai
We measure the change impact CI(ai) of each activity aiby computing its
contribution to the average weighted distance between the reference model and
its variants. As example, consider distance d(S,S2)between reference model S
and variant S2, which is 2. Here we want to measure how much each activity
has contributed to this distance. We measure it by analyzing changed activity
set CS,S2.
In general, for a given changed activity set, we enumerate all possible so-
lutions to transform the reference process model Sinto the variants Si, i.e.,
∀Aσ∈ C(S,Si):∃σ∈ B(S,Si). Note that we do not have any information about
which sequence of change operations was applied to configure Siout of S. There-
fore, to each Aσ∈ CS,Siwe assign same weight d(S,Si)/|CS,Si|. Finally, for a
particular activity aj∈ Aσ∈ CS,Si, we choose d(S,Si)
|C(S,Si)|×1
|Aσ|.
Consider again the definition for the distance between process models Sand
Si(cf. Def. 2). Then d(S,Si)=|Aσ|holds, since bias σis a sequence with minimal
number of change operations to transform Sinto Si. Consequently, the change
impact for each activity ajbetween Sand a particular Sican be computed as
|{Aσ∈C(S,Si)|aj∈Aσ}|
|C(S,Si)|.
Let Sbe the reference model and let Si(i= 1, . . . , n) be weighted process
variants Si(with weight wiand Σn
1wi= 1) derived from S. Then the Change Im-
pact CI(aj) of a particular activity aj, which measures its potential involvement
in process adaptations, can be computed as follows:
CI(aj) =
n
X
i=1
wi×|{Aσ∈ C(S,Si)|aj∈ Aσ}|
|C(S,Si)|(1)
Figure 3 summarizes the Change Impacts of the activities from our example
(cf. Figure 2). When reading the depicted table horizontally, it shows the poten-
tial involvement of each activity when configuring the reference model Sinto a
particular variant Si. As example take S4. The distance between Sand S4is one,
since both activities Cand Dcould be potentially involved in the corresponding
configuration (cf. Table 1). Each of the two activities therefore obtain Change
Impact of 0.5 from this particular variant. When reading Figure 3 vertically, it
shows the Change Impact of each activity based on the observation of all vari-
ants. For example, activity Bshows Change Impact of 0.5 for configuring variant
S3and 0.5 for configuring variant S5. Considering the weight of each variant, the
change impact of activity BCI(B) can be computed by Formula (1). As result
9
we obtain 0.175, which means that on average we need to move activity B0.175
times when configuring a variant out of the given reference model.
Activity
Variant
Fig. 3. The change impact for each activity
Based on the change impact of each activity, we can also rank them accord-
ingly. Clearly, activity Ehas the highest change impact and therefore should be
ranked first. Consequently, if we need to find an activity of the reference process
model for re-positioning, activity Ewill be the first candidate. Reason is that it
has been reconfigured (i.e., re-positioned) more often than the other activities.
In fact, we have determined the contribution each activity has on the average
weighted distance between the reference model and its variants. If we sum up
the change impact of all activities, we obtain 0 + 0.175 + 0.375 + 0.375 + 0.575
= 1.5. Note that this corresponds to the actual value of the average weighted
distance (cf. Fig. 2).
4.3 Discussion
The approach we described in this section is very precise: all possible changes be-
tween the reference model and a process variant are enumerated and the change
impact of a particular activity is computed by analyzing the reference model
and all variants. However, enumerating all possible changes between two models
is a NP problem, this approach can be very expensive. We need to call the NP
algorithm every time we want to find possible changes between the reference
model and a particular variant. Therefore, this approach will not scale up. If
we have to deal with a large number of variants with complex structures (like
models with tens up to hundreds of activities). In the next section, we introduce
an approximation algorithm to solve the problem in an efficient way.
10
5 Compute the Approximation Change Impact Ranking
List
To reduce the complexity for computing the change impact of each activity, we
now introduce an approximation algorithm which only requires polynomial time
to compute the ranking result. We first introduce the notion of order matrix in
Section 5.1 and the notion of aggregated order matrix in Section 5.2. Section 5.3
then presents our approximation ranking algorithm.
5.1 Representing Process Models as Order Matrices
Theoretical backgrounds of high-level change operations have been extensively
discussed in ADEPT [17]. One key feature of our ADEPT change framework
is to maintain the structure of the unchanged parts of a process model [17].
For example, when deleting an activity this neither influences the successors
nor predecessors of this activity, and therefore also not their order relations.
To incorporate this feature in our approach, rather than only looking at di-
rect predecessor-successor relationships between activities (i.e. control edges),
we consider the transitive control dependencies for each activity pairs; i.e. for a
given process model S= (N, E, . . .)∈ P, we examine for every pair of activities
ai, aj∈N,ai6=ajtheir transitive order relations. Logically, we determine order
relations by considering all traces the process model may produce (cf. Section 2).
Results are aggregated in an order matrix A|N|×|N|, which considers four types
of control relations (cf. Def. 4):
Definition 4 (Order matrix). Let S= (N, E, . . .)∈ P be a process model
with N={a1, a2, . . . , an}. Let further TSdenote the set of all traces producible
on S. Then: Matrix A|N|×|N|is called order matrix of Swith Aij representing
the order relation between activities ai,aj∈N,i6=jiff:
–Aij = ’1’ iff (∀t∈ TSwith ai, aj∈t⇒t(ai≺aj))
If for all traces containing activities aiand aj,aialways appears BEFORE
aj, we denote Aij as ’1’, i.e., aialways precedes of ajin the flow of control.
–Aij = ’0’ iff (∀t∈ TSwith ai, aj∈t⇒t(aj≺ai))
If for all traces containing activities aiand aj,aialways appears AFTER
aj, we denote Aij as a ’0’, i.e. aialways succeeds of ajin the flow of control.
–Aij = ’*’ iff (∃t1∈ TS, with ai, aj∈t1∧t1(ai≺aj))∧(∃t2∈ TS, with
ai, aj∈t2∧t2(aj≺ai))
If there exists at least one trace in which aiappears before ajand another
trace in which aiappears after aj, we denote Aij as ’*’, i.e. aiand ajare
contained in different parallel branches.
–Aij = ’-’ iff ( ¬∃t∈ TS:ai∈t∧aj∈t)
If there is no trace containing both activity aiand aj, we denote Aij as ’-’,
i.e. aiand ajare contained in different branches of a conditional branching.
Fig. 4 shows an example. Besides control edges, which express direct predecessor-
successor relationships, process model Salso contains four kinds of control con-
nectors: AND-Split and AND-Join (corresponding to a flow activity in BPEL),
11
A
C
B
E
F
GD
Process model S Order matrix of S
AND-Split
AND-Join
XOR-Split
XOR-Join
‘0’ : successor
‘1’ : predecessor
‘*’ : AND-block
‘-’ : XOR-block
Fig. 4. Process model and its order matrix
and XOR-Split and XOR-join (corresponding to a switch or pick activity in
BPEL). The depicted order matrix represents all four described relations. For
example activities Aand Bwill never appear in the same trace since they are con-
tained in different branches of an XOR block. Therefore, we assign ’-’ to matrix
element AAB . Similarly, we can obtain the relation for each pair of activities.
The main diagonal of the matrix is empty since we do not compare an activity
with itself.
Under certain conditions, an order matrix uniquely represents the process
model it was created from. This is stated by Theorem 1. Before giving this
theorem, we need to define the notion of substring of trace:
Definition 5 (Substring of trace). Let S∈ P be a process model and let
t, t0∈ TSbe two traces on S. We denote tas sub-string of t0iff [∀ai,aj∈t,
t(ai≺aj)⇒ai,aj∈t0∧t0(ai≺aj)] and [∃ak∈N:ak/∈t∧ak∈t0].
Theorem 1. Let S, S0∈ P be two process models with same activity set N=
{a1, a2, . . . , an}. Let further TS,TS0be the related trace sets and An×n,A0
n×nbe
the order matrices of Sand S0. Then S6=S0⇔A6=A0, if [¬∃t1, t0
1∈ TS:t1is
a substring of t0
1] and [¬∃t2, t0
2∈ TS0:t2is a substring of t0
2].
We give a proof of Theorem 1 in [12]. According to this theorem, there will
be a one-to-one mapping between a process model Sand its order matrix A, if
the substring constraint is met. (Note that the substring constraint can be easily
checked and handled [12]); i.e., if the conditions of Theorem 1 are met, the order
matrix will uniquely represent the process model. Analyzing its order matrix (cf.
Def. 4) will then be sufficient in order to analyze the process model.
5.2 Aggregated Order Matrix
In order to analyze a given collection of process variants, we first compute the
order matrix for each of these process variants (cf. Def. 4). Regarding our example
from Fig. 2, for instance we obtain five order matrices (cf. Fig. 5). Then, we
analyze the order relation for each pair of activities considering all five order
matrices derived before. As the order relation between two activities might be not
12
the same in all order matrices, this analysis does not result in a fixed relationship,
but provides a distribution for the four types of order relations (cf. Def. 4).
Regarding our example, in 65% of all cases activity Cis a successor of activity
B(as for variants S1,S2,S4), in 20% of all cases Cis a predecessor of B(as in
S3), and in 15% of the cases, Band Care contained in different branches of an
XOR block (as in S5) (cf. Fig. 5). Therefore, we can define the order relation
between two activities aand bas 4-dimensional vector Vab = (v0
ab, v1
ab, v∗
ab, v−
ab):
each field corresponds to the frequency of the corresponding relation type (’0’,
’1’, ’*’ or ’-’) as specified in Def. 4. Take our example from Fig. 5: here v1
CB = 0.65
corresponds to the frequency of all cases with activities Band Chaving order
relation ’1’, i.e. all cases for which Cprecedes B. Regarding our example, we
obtain VCB = (0.65,0.2,0,0.15).
We define an aggregated order matrix as follows:
Definition 6 (Aggregated Order Matrix).
Let Si∈ P,i= 1,2, . . . , n be a collection of process variants with same activ-
ity set N. Let further Aibe the order matrix of Si, and let weight wirepresent
the relative frequency of process instances executed on basis of Si. The Aggre-
gated Order Matrix of all process variants is defined as 2-dimensional matrix
Vm×mwith m=|N|and each matrix element vjk = (v0
jk, v1
jk, v∗
jk, v−
jk)being a
4-dimensional vector. For τ∈ {0,1,∗,−}, element vτ
jk expresses to what percent-
age, activities ajand akhave order relation τwithin the collection of process
variants Si. Formally:
∀aj, ak∈N, aj6=ak,∀τ∈ {0,1,∗,−} :vτ
jk =(Pn
i=1,Aijk =0τ0wi)/(Pn
i=1 wi).
0 1
*-
‘0’ : successor
‘1’ : predecessor
‘*’ : AND-block
‘-’ : XOR-block
S
1
: 30%
S
2
: 15%
S
3
: 20%
S
4
: 20%
S
5
: 15%
V
V
CB
= (0.65, 0.20, 0, 0.15)
‘0’ : 65%
‘1’ : 20%
‘*’ : 0%
‘-’ : 15%
O
r
d
e
r
m
a
t
r
i
c
e
s
Aggregated
order matrix
Fig. 5. Aggregated order matrix V
13
Fig. 5 shows the aggregated order matrix of the process variants from Fig.
2. In an aggregated order matrix, main diagonal is always empty since we do
not specify the order relation of an activity with itself. For all other elements, a
non-filled value in a certain dimension means it corresponds to zero.
5.3 Approximation Algorithm for Ranking Activities According to
Their Change Impact
We have introduced the aggregated order matrix to reflect the fact that the
execution orders between two activities may not be the same in different variants.
For example, the execution orders between activities Cand Bcan be represented
by VCB = (0.65,0.2,0,0.15). When reconsidering the reference process model
from Fig. 2, we can see that the order relation between activities Cand Bis
”0”, i.e., Cis a successor of B. If we built an aggregated order matrix Vref
purely based on this reference model, as relationship between Cand Bwe would
obtain Vref
CB = (1,0,0,0), i.e., Cwould then always be a successor of B. When
comparing VCB = (0.65,0.2,0,0.15) (which represents the variants) and Vref
CB
(which represents the reference model), we can easily see that these two vectors
are not the same. This indicates that when configuring reference model into the
variants, the position of Bor Cmight have changed afterwards.
Configuring the reference model into a variant is realized by applying a se-
quence of change operations (cf. Def. 1). Such an operation either changes the
activity set (like insert and delete operations) or the original relationship be-
tween the activities (e.g. move operation) of the reference model. Interestingly,
the changes of the activity relations are represented by the changes of the order
matrix. When configuring the reference model into variant S1, for example, we
need to move activity Eto the position between Aand D, (i.e., move(S, E, A, D)).
This change influences the order relation of Ewith the other activities. In the
reference model Sfor example, Esucceeds A, C, and D. Regarding process vari-
ant S1configured out of it by applying a move operation, the order relations
between Eon the one side and B, C and Don the other side is changed, i.e., E
now precedes Dand is allocated in parallel to Band C.
Generally, we can assume that the more an activity is moved, the more
its order relation differs from the original one. To quantitatively measure this
difference, we compare the order relations set out by the reference model with
those of the aggregated order matrix.
Before providing the comparison method, we first introduce function f(α, β)
which expresses the closeness between two vectors α= (x1, x2, ..., xn) and β=
(y1, y2, ..., yn):
f(α, β) = α·β
|α|×|β|=Pn
i=1 xiyi
pPn
i=1 x2
i×pPn
i=1 y2
i
(2)
f(α, β)∈[0,1] computes the cosine value of the angle θbetween vectors α
and βin Euclidean space. If f(α, β) = 1 holds, αand βexactly match in their
directions; f(α, β) = 0 means, they do not match at all. Regarding our running
14
examples, for instance, we obtain f(VCB, V ref
CB )=0.933. This number indicates
high similarity between the order relations of the reference model and the ones
of the variants (which are represented by the aggregated order matrix).
Based on these considerations, the change impact of a particular activity can
be measured using the following formula. To differentiate it from Formula (1),
we denote change impact computed by this approximation as CIa(ai).
CIa(ai)5=Px∈N\{ai}f2(Vaix, V ref
aix)
|N|−1(3)
CIa(ai)∈[0,1] corresponds to the average square mean value of the similarity
(measured by Formula (2)) between activity aiand the rest of activities. It there-
fore approximately reflects how much aihas been re-configured. If CIa(ai) = 1
holds, activity aiwill exactly have same order relations with respect to the other
activities in both the reference model and all the variants. For this case, We can
therefore assume that it has not been moved. Weight(ai) = 0, in turn, means
that the order relation of aias reflected shown in the reference model is com-
pletely different from the order relations in the different variants. We therefore
assume that in such case activity aihas been moved a lot. Note that our ranking
is based on descending orders, i.e., the higher the change impact CIa(ai) is, the
lower the chance will be that it has been potentially moved and the lower such
activity should be ranked.
Regarding our example from Fig. 2, the ranking result of the five activities
is shown in Table 2.
Activity E D C B A
CIa(ai) 0.6641 0.7384 0.8678 0.9280 1.0000
Rank 1 2 3 4 5
Table 2. Approximate ranking result
From Table 2, we can clearly derive a ranking order of the activities based
on their potential involvement in changes. Activity Eis moved most frequently
while activity Ais the least moved one.
5Note that this is not a precise measure since not only the execution orders of the
moved activities are affected, but the other activities may be influenced by a change
operation as well. For example, when configuring Sinto S1, we actually only need to
move activity E. However, the execution orders of the remaining activities are also
changed, e.g., activities B, C and D. Reason is that move operations can globally
influence the execution order while our measure only examines the local information
between every pair of activities.
15
5.4 Comparing the Precision ranking Algorithm and the
Approximation Ranking Algorithm
The approximation algorithm presented in Section 5.3 is a polynomial algorithm,
i.e., the complexity for computing the change impact CIa(ai) of activity aivalue
of an activity is at O(n2×m) where nis the number of activities in each variant
and mis the number of variants. 6Compared to the NP level complexity of the
precise ranking algorithm, efficiency of the approximation ranking algorithm is
much better. However, we still have to validate the performance of the approx-
imation algorithm, i.e., we must show how close it is to the real optimum (i.e.,
the ranking provided by the precise ranking algorithm). Regarding our running
example, comparison results are summarized in Table 3.
Precise ranking algorithm
Activity E D C B A
CI(ai) 0.5750 0.3750 0.3750 0.1750 0.0000
Rank 1 2 2 4 5
Approximation ranking algorithm
Activity E D C B A
CIa(ai) 0.6641 0.7384 0.8678 0.9280 1.0000
Rank 1 2 3 4 5
Table 3. Comparing the ranking results of the two algorithms
Table 3 shows the ranking results for our example from Fig. 2 using both
precise ranking algorithm and the approximation ranking algorithm. More pre-
cisely, tt shows the change impact of each activity, as well as its ranking order.
This simple comparison already indicate that that the performance of the ap-
proximation ranking algorithm is quite good. Here, it generates the same ranking
order as the precise ranking algorithm does.
Of course, such a simple comparison is far from being sufficient. First, as
the comparison is only based on one example, it is not allowed to draw general
conclusion about the performance of the approximation algorithm. Clearly, it
cannot NOT perfectly match the precise ranking algorithm, since the latter one
has NP level complexity. While the approximation ranking algorithm only tries
to achieve an approximation using a polynomial algorithm. A more systematic
comparison is required to evaluate the real performance of the approximation
ranking algorithm. In this context, we are also interested in those factors that
influence the performance of the approximation ranking algorithm. In the fol-
lowing, we try to answer the following two questions:
6The complexity is computed based on the assumption that we have already had
the order matrices for the reference model and the variants. Since the complexity for
computing the order matrix from a process model is at O(n2), the overall complexity
for computing the ranking value of an activity is O(m×n2+n) where nis the number
of activities and mis the number of variants.
16
1. How good does the approximation algorithm perform, i.e., how close are its
ranking results in comparison to the precise ranking results?
2. What factors have influence on the performance of the approximation ranking
algorithm, i.e., based on what conditions does it perform better or worse?
6 Simulation
We use simulation to answer the above questions, i.e., by generating hundreds
or thousands of examples, we are able to conclude statistically how good the
performance of the algorithm is and to test which factors significantly influence
it.
In order to analyze the influential factors, we require the dataset for simula-
tion to be well structured and well understood, i.e., we need to know the features
of the dataset which we are analyzing in order to determine which parameters
are more important than others. So far, there are no such real-life data available
for simulation. And even if this had been the case, such data would certainly
not cover all the scenarios we want to examine. Therefore, we use automatically
created datasets to run our simulations.
This section will describe how the dataset are generated. Since both the
ranking algorithms require a reference process model and a collections of process
variants derived from it, this section has been divided into three subsections:
1. We first describe an algorithm to randomly generate a reference model in
Section 6.1.
2. Besides purely examining the performance of our algorithms, we are also
interested in whether the performance of our algorithms can be influenced
by some external parameters (like the size of the models or the similarity
between the models, etc). Therefore, Section 6.2 describes which parameters
will be considered when configuring the reference model into process variants.
3. Section 6.3 then describes how we adjust the different parameters in config-
uring the process variants.
In the following, we generate 36 groups of datasets using different values for
the parameters we consider. For each of these groups, we generate 1 reference
model and 100 process variants configured out of the reference model (i.e., we
consider 3636 process models). For creating the data sets, we assume different
scenarios. We compare the two ranking algorithms based on the ranking results
we obtain from the 36 groups. In the following, we describe the scenarios we
used to generate the datasets and in the next section (cf. Section 7), we will
evaluated the ranking result of the different algorithms.
6.1 Generating the Reference Process Model
Our general idea of randomly generating (block structured) reference model is to
cluster blocks, i.e., we randomly cluster activities (blocks) into a bigger block and
17
input : Set of activities aithe process model to be generated should contain,
i= (i,...,n)
output: Valid process model S
Define each activity aias a basic block Bi,i= (1,...,n);1
Define set B:= {B1,...,Bn}/* initial state */ ;2
while |B| >1do3
randomly selected two blocks Bi, Bj∈ B ;4
randomly select an order relation τ∈ {0,1,∗,−} ;5
build block Bkwhich contains sub-blocks Biand Bjhaving order relation τ6
;
B:= B \ {Bi, Bj};7
B:= BS{Bk};8
end9
S:= B0with B0∈ B10
Algorithm 1: Randomly generating a reference model
this clustering continues iteratively until all the activities (blocks) are clustered
together. The detail of our approach is depicted in algorithm 1.
To illustrate how Algorithm 1 works, an example is given in Fig. 6. As input a
set of activities {A,B,C,E and E}are given, and the goal is to construct a valid,
block-structured process model Sout of them. The algorithm starts by consid-
ering each activity aias basic block Bi, and adding these blocks to set B(lines
1 and 2). Regarding our example, B={{A},{B},{C},{D},{E}}. The algorithm
first randomly select two blocks Bi, Bj(lines 4) and link them with a randomly
chosen order relation τ(lines 5 and 6). Regarding our example, blocks {B}and
{C}are selected to construct a new block {B, C}with a randomly chosen order
relation 1 (which means Bprecedes C). The newly created block {B, C}will then
replace blocks {B}and {C}in the block set B, i.e., B={{A},{B,C},{D},{E}}
(lines 7 and 8). This procedure (lines 4-8) is repeated until block set Bcontains
one single block B0(B0={A,B,C,D,E}regarding our example). This block then
represents our randomly generated process model S. (line 10). Fig. 6 shows the
process model we randomly generated as well as the block constructed in each
iteration.
AB C E D
1
*
1
0
Order relation
randomly chosen
Result
BC
E
D
A
1234
Fig. 6. Example of generating random process model
18
In practise, certain order relations are used more often than others. For exam-
ple, the predecessor-successor relation is used more frequently than AND/XOR-
splits [33]. When randomly generating a process model, we therefore take this
into account as well. Rather than randomly setting the order relation for two
blocks, we set the probability for choosing AND-split (τ=0∗0) and XOR-
split (τ=0−0) to 10% respectively, while predecessor-successor relationships
(τ={0,1}) are chosen with probability of 80%.
We therefore randomly generate 3 reference models, containing 10, 20 and
50 activity respectively 7. According to [15], process models containing more
than 50 activities have high risk of errors. therefore, it is not recommended to
design such large model. Following this guideline, we also set the largest size of
a process model for 50 activities in our simulation.
6.2 Parameters Considered for Generating Process Variants
Taking a generated reference process model, we control how variants are config-
ured by adjusting specific parameters. For example, these parameters determine
how many change operations needed to perform to configure a particular vari-
ant and where activities should be moved to and so forth. Basically, we have
considered the following parameters when generating the process variants.
1. Parameter 1 (Size of Process Models) The size of a variant (i.e., the
number of its activities) can potentially influence results. Therefore, we need
to check the behavior of our algorithm when applying them to variants of
different sizes. This is also important to test the scalability of the approxi-
mation algorithm, whose performance should not depend on the size of the
sample data.
2. (Parameter 2 (Similarity of Process Variants) This parameter mea-
sures how ”close” these variants are, e.g., whether or not the variants are
similar to each other. In this context, similarity measures how difficult it is
to configure one variant into another [12].
3. Parameter 3 (The Activities Been Changed) Often we can observe
that the probability with which activities are changed is not uniformly dis-
tributed, i.e., some activities might be involved in changes more often than
others. We therefore want to analyze whether the probability distribution of
how frequently an activity is changed would thus influence the performance
of our algorithms.
4. Parameter 4 (Position Where Activities Are Moved To) Change can
be local or global. A local change will influence the order relations of only
very few activities (e.g., when swapping the order of two directly succeeding
activities) while global ones can influence quite a lot of order relations in a
process model (for example when moving one activity from the beginning
to the end). Therefore, we need to check whether or not this influences the
results.
7Each of these model will be used in 12 groups of datasets, since we want to avoid the
influence of the reference model. A detailed discussion can be found later in Section
6.3
19
6.3 Method for generating Data Sets
The dataset for our simulation analysis is generated by a randomly generating a
reference model (cf. Section 6.1) and a collection of variants configured based on
this reference model. When configuring these variants, by applying a sequences
of change operations to the reference model, we vary the parameters described
in Section 6.2 and consequently generate variants based on different scenarios.
The following choices are available for the different parameters:
Parameter 1 (Size of Process Models) This parameter controls how
many activities shall be contained in a process model. There can be three options:
1. Small-sized models : 10 activities per variant
2. Medium-sized models: 20 activities per variant
3. Large-sized models: 50 activities per variant
(Parameter 2 (Similarity of Process Variants) The closeness between
the variants is measured by determining the total number of change operations
we have to apply when generating variants (cf. Def. 2). Three possible choices
exist:
1. Small change: 10% of activities are moved
2. Medium change: 20% of activities are moved
3. Large change: 30% of activities are moved
For example, for the datasets comprising large-size process variants (i.e.,
variants with 50 activities), medium-change would mean to randomly change 10
activities when generating generate a variant of the given a reference model. This
way, we can control the distance between the reference model and its variant.
And indirectly, we can control the similarity between variants since they are all
controlled in a certain distances with the reference model.
Parameter 3 (The Activities Been Changed) This parameter controls
which activities are moved when generating the variant models. We consider two
scenarios:
1. In this scenario, we randomly pick the activities be moved, i.e., each activity
is assumed to have the same probability to be involved in a move operation.
As example assume that we need to move two activities in order to configure
one particular variant out of the reference model. In this scenario, we assume
each activity in the reference model would have the same probability to
be chosen, i.e., we can randomly pick two different activities. Since every
time activities are selected randomly, there will be no activity which has
been moved significantly more often than others in the collection of variants.
Please note that it does not mean every activity will be changed exactly for
the same number of times when configuring the collection of process variants.
Random differences will occur but these differences are not significant enough
in a statistical sense. Table 4 shows one example for random picking activities
with small model (10 activities) and small change (10% of them are changing,
i.e., one change).
20
2. Activities are selected based on Gaussian distribution. Very often
we can observe that some activities are involved in changes than others. We
therefore have to simulate the situation in which activities are not been in-
volved in change with same frequency. We set the probabilities for changing
(i.e., moving) the activities by using a Gaussian distribution, i.e., some ac-
tivities are assumed to be moved more often than others when generating
the process variants. If ncorresponds to the number of activities in the ref-
erence model and we randomly give a permutation of the activity set: the
probability for selection the number nactivity follows Gaussian distribution
X:(n/2,(n/10)2). This means the expected mean of the distribution is n/2
and its expected standard deviation is n/10. The Table 4 gives a general
ideal of such distribution for a dataset with small size model (comprising 10
activities) and only small change (10% of them are changing) are performed
to configure each process variant:
Activity A B C D E F G H I J
Number of times Selected activity randomly
activity being 15 10 9 10 9 12 7 6 11 11
involved in Selected activity based on a Gaussian distribution
changes operations 0 0 2 15 34 35 14 0 0 0
Table 4. When configuring 100 variants from the reference model, this table shows
the number of times one particular activity being involved in change operations based
on either random selection or Gaussian distribution
Parameter 4 (Position Where Activities Are Moved To). While pa-
rameter 3 determines how frequent an activity is moved, Parameter 4 controls the
position to which corresponding activities are moved to. Clearly, a local change
(e.g., to swap the order of two directly succeeding activities) has less effects on
order relations than a global change (e.g., move an activity from the beginning
to the end). This section analyzes whether this will influence the performance
of our ranking algorithms.
When performing a move operation, one important issue has to be considered.
Since we only consider block-structured process models, move operations must
not ”destroy” this block structure. Given the reference model San the candidate
activity aifor moving, we perform the following three steps to guarantee block-
structure of the resulting model:
1. We first remove activity aifrom the process model S.
2. We enumerate all possible blocks the modified process model contains. A
block can be one single activity or a self-contained part of the process model
or even the model itself. (See Appendix A for an algorithm enumerating all
possible blocks in a process model). Also note that the number of possible
candidate blocks is normally very large, e.g., several hundred potential blocks
can be identified for a large process model (with 50 activities).
21
3. We cluster the candidate activity airandomly with one of the potential
blocks we enumerated in step 2. The clustering technique will be similar to
how we generate a random process model in Section 6.1. i.e., we randomly
select an order relation τ∈ {0,1,∗,−} as the order relation between activity
aiand the selected block. Therefore, they are clustered together to form a
larger block.
Following these three steps, we can guarantee that the resulting process model
is sound and block-structured. Every time we cluster an activity with a block,
we actually move this activity to the position where it can form a bigger block
together with the selected one. On example is given in Fig. 7:
S
Potential blocks:
{B}, {C}, {D}, {E}, {B,C},
{B,C,E}, {B,C,D,E}
Delete
Activity
A
Cluster
block {B,C}
with
Activity A
with
order relation
1
(precede)
BC
E
AD
BC
E
D
S
r
S
d
A C
D
B
E
Fig. 7. Example of how to guarantee block structure when perform a move operation
Given a process model S, we would like to know where we can move say-
ing activity Ato, so that the resulting model maintains block-structured. The
first step is to delete this particular activity (Ain this case) from the process
model and we obtain an intermediate process model Sd. Then, we enumerate
all possible blocks such model contains. Regarding our example, Sdcontains the
following seven blocks {{B},{C},{D},{E},{B,C},{B,C,E},{B,C,D,E}}}. We can
then cluster activity Awith any one of the blocks with order relation τbetween
the activity and the selected block, τ∈ {0,1,∗,−}. Regarding our example,
block {B,C}are chosen and it is clustered with Awith order relation 1, i.e., A
precedes block {B,C}.
If we know where one particular activity aican be moved to, we then need
to determine whether it is a local change or a global change. Clearly, if aiis
clustered with a block close to its original position, it should be a local change.
Otherwise, if aiis clustered with a block far away from where it used to be, such
move should be considered as a globe one. Therefore, whether a change is a local
change or a global change is determined by the distance the moved activity and
the block which it will be clustered with.
In general, the distance between an activity and a block is measured by the
differences between them and the remaining activities. A detailed discussion can
be found in Appendix B. Regarding our example in Fig. 7, the distances between
activity Aand all possible blocks it can be clustered with are summarized in Table
5.
Parameter 4 therefore can control whether we perform more local changes or
more globe changes. We consider two options for determining such target block.
22
1. Random block selection In this scenario, we randomly select a block
from the list of all possible blocks, i.e., each potential block has the same
probability to be chosen. This way, block selection is not controlled and both
local and global changes are possible. Regarding our example in Fig. 7, the
probability for selecting all 7 potential blocks equals 1/7 = 0.143 (cf. Table
5).
2. Selection of block that is close to the activity to be moved Blocks
with shorter distance to the activity to moved have higher probability to
be chosen. In order to realize this, we first rank all blocks based on their
distance to the activity to be moved, and the probability to choose the ith
block is 2·1
√2πe−(5i/m
2)2, where mequals the number of blocks (this curve
has similar shape as Gaussian distribution X:(0,12) within the interval of
[0,5]). Regarding our example in Fig. 7, the probability for choosing each
block is also summarized in Table 5. Clearly, the probability for choosing a
closer block is a lot higher than the probability for choosing a block which
has higher distance. Therefore, following this scenario we are expected to
perform more local change compared to the other scenario.
Blocks {B} {B,C} {B,C,E} {C} {E} {B,C,D,E} {D}
Distance to Activity A2.45 3.10 3.31 3.74 3.74 3.90 5.65
Rank 2.45 3.10 3.31 3.74 3.74 3.90 5.65
The probability Select block randomly
of this block 0.143 0.143 0.143 0.143 0.143 0.143 0.143
being selected Select block based on a Gaussian distribution
0.618 0.288 0.08 0.013 0.001 8.19E-5 2.97E-6
Table 5. When selecting a block to be clustered with A, this table shows the proba-
bility of selecting each candidate block based on either random selection or Gaussian
distribution
6.4 Simulation Setup
Section 6.3 has discussed four parameters we considered. Regarding parameter
1 and 2, we can choose between 3 values, while parameter 3 and parameter 4
respectively allow for setting two values, i.e., in total we have 3 ×3×2×2 = 36
possible configurations of these four parameters and consequently should con-
sider 36 groups of datasets. These 4 parameters, and consequently should consist
For example, one particular group may prescribe 10 activities for a variant, where
each variant is generated by moving 10% of activities from the reference model.
When performing the move operations, the activities are selected based on the
Gaussian distribution whereas the block determining the position to which we
move these activities to is randomly selected.
We document the following information when generating the datasets:
23
1. Original reference model, i.e., the model based on which we perform the
changes. We have sketched a method to randomly generate a reference model
in Section 6.1. In addition, to make results comparable, we use same reference
model for all groups with same number of activities. For example, all groups
with parameter 1 = ”small size” use same reference model containing 10
activities.
2. 100 process variants. Based on a given reference model, we generate each
variant by configuring the reference model according to the different sce-
narios as described in Section 6.3. For each group, we generate 100 process
variants. Note that although the 100 variants are generated by following a
same scenario, they are not necessarily same. The reason is that the sce-
nario we described In Section 6.3 only depicted the feature of the collection
of variants but not a particular variant. For example take Table 4. When
following the scenario to select activity randomly, i.e., each activity has the
same probability to be chosen when configuring a process variant, it is ac-
tually not the case that each activity has been chosen for exactly a same
amount of times.
3. Ranking Results for Each Group Based on the reference process model
and the 100 process variants, we can rank the activities based on the precise
ranking algorithm (cf. Section 4) as well as the approximation ranking algo-
rithm (cf. Section 5). Table 6 shows an example. In this group, each variant
contains 10 activities, and it is generated by moving 10% of the activities in
the reference model. When performing the changes, activities and blocks are
randomly selected. 8
Precise ranking result
Activity A F I B D J E C G H
CI(ai) 0.1450 0.1250 0.1100 0.1000 0.0999 0.0999 0.0900 0.0800 0.0700 0.0699
Rank 1 2 3 4 5 5 7 8 9 10
Approximation ranking result
Activity A F J I D G E B C H
CIa(ai) 0.9787 .9792 0.9903 0.9904 0.9908 0.9911 0.991726 0.991728 0.9921 0.9923
Rank 1 2 3 4 5 6 7 8 9 10
Table 6. Precise and approximation ranking result
From Table 6, it becomes clear that the precise ranking algorithm and the
approximation ranking algorithm does not always provide same ranking results.
For example, Activity Iis ranked third regarding the precise ranking result, but
ranked fourth regarding the approximation ranking result. In the next section, we
provide a method to evaluate the differences between the two ranking algorithm.
8All datasets and ranking results are available at:
http ://wwwhome.cs.utwente.nl/ lic/Resources.html
24
7 Evaluation
In the following, we analyze the ranking results provided by our two algorithms
by means of simulation. This section consists of four parts:
1. Pre-processing the ranking results, i.e., tie-breaking (cf. Section 7.1).
2. Defining the approach for evaluating the ranking results (cf. Section 7.2)
3. Analyzing the performance of the approximation ranking algorithm (cf. Sec-
tion 7.3)
4. Analyzing the parameters that have significant influence on the ranking re-
sults (cf. Section 7.4)
7.1 Pre-process of the ranking result
When examining the precision ranking list in Table 6, we can see activities D
and Jhave the same change impact and consequently have the same rank. This
triggers a question about how to handle the case in which several activities have
the same rank, i.e., how to break the tie.
Most algorithms handle tie-breaking randomly, i.e., they enforce a random
order of the activities with same rank. Unfortunately, this method does not work
in our context. Consider a situation in which the precise ranking algorithm as-
signs to all activities the same change impact, i.e., all activities have same rank.
If we enforce a random order in connection with the precise ranking algorithm,
the matching between the precise ranking and the approximation ranking algo-
rithms would be fully dependent on such random order. If the random order is
”luckily” the same as in the approximation algorithm, the match will be per-
fect. However, if the random order is ”unfortunately” generated differently, the
opposite will be the case.
To handle this tie-breaking problem, we enforce an order for activities with
same rank by additionally considering the order of the other ranking algorithm.
As example, consider the case in Table 6. As activity Dand Jhave the same rank
in the precise ranking algorithm, we sort Dand Jby considering their rank in the
approximation ranking algorithm. As in the approximation ranking algorithm
Jprecedes Din the rank, we also put Jbefore Dwhen breaking the tie. In this
case, the precise ranking result will change, see Table 7.
Clearly, enforcing an order this way also bears the risk that we artificially
make the two ranks provided by the two algorithms more similar with each other.
To have a clear understanding of this problem, we also document the number
of all possible solutions to break the tie whenever we perform a tie-breaking. If
nactivities have same change impact value, the number of possible solutions to
break such tie equals the total number of sequences these nactivities can make
(their permutation). And we can compute it by Pn
n=n!. Regarding our example
from Table 7, two activities Jand Dhave the same impact factor, therefore we
have P2
2= 2! = 2 solutions, i.e., either JD or DJ. This total number of possible
solutions ptherefore can represent the risk we bear when break the tie using the
above mentioned way. The reason is that our tie-breaking approach only take 1
25
Precise ranking result
Activity A F I B J D E C G H
Change impact 0.1450 0.1250 0.1100 0.1000 0.0999 0.0999 0.0900 0.0800 0.0700 0.0699
Rank 1 2 3 4 5 5 7 8 9 10
Approximation ranking result
Activity A F JIDG E B C H
Change impact 0.9787 .9792 0.9903 0.9904 0.9908 0.9911 0.991726 0.991728 0.9921 0.9923
Rank 1 2 3 4 5 6 7 8 9 10
Table 7. Precise ranking algorithm after resort according to the approximation ranking
algorithm
particular sequence from the ppossible solutions. Therefore, the higher pis, the
higher the chance we will artificially make the two ranking results more similar.
The pvalues for all groups are shown in Appendix C.
In Appendix C, the pvalue for approximation ranking algorithms all equals
1 in all groups of our dataset. This means that no activities have equal change
impact values provided by the approximation raking algorithm and therefore no
tie-breaking is performed to pre-process the approximation rankings. However,
when checking the precise ranking result, we can see some high pvalues. It
means that we have performed tie-breaking for some activities in the ranking.
However, these pvalues are not large enough to threaten the validity of our
tie-breaking strategy. Reason is that they are still considerably low compared
with the total number of possible sequences these activities can make, i.e., the
ranking algorithm is not based on a random sequence of activities.
7.2 Evaluation Approach
In this section, we evaluate the performance of the approximation ranking algo-
rithm, i.e., we measure how close our approximation is to the ”real” optimum
(i.e., the precise ranking).
Precision is a widely used notion for measuring the performance of rank-
ing algorithms in different domains [23, 1]. In our context, we can consider the
precise ranking algorithm provides the ranking order we want to have, while
the approximation ranking result is the one we actually get. As the activities
are ranked differently, the sub-sets of the top nranked activities are necessarily
same. For example, let us compare the top three activities as ranked in the two
ranking algorithms (cf. Table6). While the precise ranking list contains activi-
ties A, F and I, the approximation ranking list comprises activities A,F and J.
Consequently, if we need to pick the top 3 activities, the two ranking algorithms
will provide different activity sets. The difference between the top nactivities in
the two ranking list can be measured using Precision(n):
Definition 7 (Precision(n)). Let P(n)be the set containing the top nranked
activities provided by the precise ranking algorithm. Let further A(n)be the set
26
containing the top nranked activities as provided by the approximation ranking
algorithm. We define Precision(n) as follows:
Precision(n) = |P(n)TA(n)|
|P(n)|
precision(n) reflects the ratio of the top nranked activities shared in the
result lists of both the precise and approximation ranking algorithm. Def. 7 uses
the precise ranking result as reference list, i.e., the optimum to which we want
to be close. The approximation ranking result in turn, corresponds to the infor-
mation we actually get. Consequently, precision(n) measures how much ”useful
information” about the actual top nactivities we can get from the application
of approximation algorithm. As example, consider Table 6, when comparing
the top 3 activities of the precision ranking result with those of the approx-
imation ranking result, we can see that activities Aand Fhave been correc-
tively selected, whereas this does not apply to activity I(i.e., Iis not contained
in the top 3 activities suggested by the precise ranking algorithm). Therefore,
precision(3) = 2/3=0.6667. Table 8 shows all precision values concerning the
top nranked activities.
Precise ranking result
Activity A F I B J D E C G H
Weight 0.1450 0.1250 0.1100 0.1000 0.0999 0.0999 0.0900 0.0800 0.0700 0.0699
Rank 1 2 3 4 5 6 7 8 9 10
Approximation ranking result
Activity A F J I D G E B C H
Weight 0.9787 .9792 0.9903 0.9904 0.9908 0.9911 0.991726 0.991728 0.9921 0.9923
Rank 1 2 3 4 5 6 7 8 9 10
precision(n) for top n activities
top n activity 1 2 3 4 5 6 7 8 9 10
precision(n) 1.0000 1.0000 0.6667 0.7500 0.8000 0.8333 0.8571 0.8750 1.000 1.000
Table 8. precision table
Here precision(1) and precision(2) equal 1. This means that the top ranked
nactivities (n= (1,2)) are the same for the precise ranking list and the approx-
imation ranking list. However, as the approximation algorithm ranks activity J
at the third place, we see a decrease on precision(3) since Ishould be ranked
third as done in the case of the precise ranking list. Then, Precision(n) keeps in-
creasing as nincreases until it reaches 1 again at the end of the chart (n= 9,10).
Trivially, precision(n) will always be 1 if nequals the number of activities. We
have additionally plotted the precision values in Fig. 8.
We can derive the curve depicted by plotting and interpolating all precision
values in Fig. 8. Besides this precision curve, we plot a line precision(n) = 1,
n= (1,2, . . . , 10). Further more we have marked the surface area between the two
27
precision chart
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1 2 3 4 5 6 7 8 9 10
Top n activities
precision value
surface area
n
Fig. 8. Surface area for the precision chart
curves. The size of this surface area then can be used to evaluate the performance
of the approximation ranking algorithm for the given dataset. Trivially, if the
approximation ranking algorithm delivers the same ranking result as the precise
ranking algorithm, the precision value for any top ranked nactivities will equal
1, i.e., the precision curve matches the optimum curve precision(n) = 1 with n=
1,2,...,10. If the precise and approximation ranking algorithms provide different
ranking results, there will be a number n0the precision value precision(n0) does
not equal to 1. In this case, the precision curve deviates from the optimum curve,
i.e., there is space between the two curves. The larger this area is, the bigger the
difference of the two ranking results will be, and the worse the approximation
algorithm actually performs. Regarding our example, the surface area between
the two curves equals 1.12. To make the surface are independent from the number
of activities and to normalize the value within interval [0,1], we measure the ratio
between the surface area and the area created by the optimal line (the rectangle
between (0,0) and (1,10) in our case). This value is 0.112 in our example. This
number can be used as indicator showing how close the approximation line is
to the real optimum line, i.e., showing how good the approximation algorithm
works. Obviously, the lower the value is, the better the approximation algorithm
works. 9Altogether, the proposed method (i.e., measuring the ratio between
the precision line and the optimum line), is used to evaluate the performance of
our approximation algorithm in the conducted simulation which consists of 36
groups of datasets.
9This evaluation method is inspired by the precision-recall curve used in information
retrieval [1] or statistics [22]. We omitted ”recall” since in our context, it always
equals n/m for the top nranked activities in a rank list of size m. We do not apply
correlation analysis in statistics [22] since we are interested in the ranking order
rather than the exact change impact of activities.
28
7.3 Evaluation Result
In Appendix C, we give a short summary of the evaluation results for all groups.
For each group, we indicate the influence of tie-breaking policy as described in
Section 7.1. We also show the surface area (precision) introduced in Section 7.2.
A complete list, including the weights the ranking orders, etc, is available at:
http://wwwhome.cs.utwente.nl/
~
lic/Resources.html.
In the next two sub-sections we visualize the results in two different perspec-
tives.
Surface Area Values Distributions We first analyze the distributions of the
surface area values at different value ranges. A standard method is to depict the
histogram [22]. A histogram shows the distribution of surface areas into different
intervals. The result is shown in Fig. 9. The value range of the surface area is [0,
0.4] in all 36 groups. The histogram shows eight sub-intervals with an increment
of 0.05 in each interval. From Fig. 9, it becomes clear that in most groups, i.e.,10
out of 36 group, the surface area falls into interval [0.15,0.2). If we enlarge this
interval to [0.1, 0.25), more than 60% of the groups are covered. When computing
the mean and standard deviation of the surface areas, we obtain as mean 0.1933
and as standard deviation 0.0871.
Mean:
19.33%
Standard div:
8.71%
Fig. 9. Histogram of surface area value
To have a better understanding of the distribution of the surface areas, we
also tested whether the surface area follows a Gaussian distribution. Here, we
apply One-Sample Kolmogorov −Smirnov Test [22] to test our assumption.
29
The Kolmogorov −Smirnov Z value is 0.555, which indicates that the data
have 91.8% confidence following a Gaussian distribution. Clearly, the expected
mean is 0.1933 while the standard deviation is 0.0871. The distribution curve is
depicted in Fig. 9.
In addition, We also tested the confidence interval [22] of this surface area
since it is an important factor to measure the performance of the algorithm.
The 95% confidence interval is [0.1637,0.2225]. This indicates that the mean of
surface area has 95% probability falling into the interval [0.1637,0.2225].
Average Precision Values at Top nRanked Activities In the former
subsection, we have analyzed the distributions of surface area values in differ-
ent groups. This analysis shows the general performance of the approximation
ranking algorithm, and indicates that the average imprecision, measured by the
surface area, is around 20%. However, in most use cases we are mainly interested
in the activities at the top of the ranking list. A typical question is for example,
how the approximation algorithm will perform if we only compare the precision
of the top 10% ranked activities? In the following, we evaluate the precision of
our approximation algorithm at different positions of the ranking order.
Based on the 36 dataset groups, we evaluate the precision value for top
n%(n= 10,20,...,100) ranked activities of the ranking list. The mean as well
as the standard deviation of the precision values are depicted in Fig. 10.
mean and standard deviation of precisions
0
0.2
0.4
0.6
0.8
1
1.2
10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
top n% of ranking list
mean & standard deviation
mean
standard deviation
Average precision:
80.00%
Average S-Div:
13.77%
Fig. 10. mean and standard deviation of precision values at top n% of ranking list
The average precision value starts at 55% with standard deviation of 37.53%
when only evaluating the top 10% of the ranking list. The average precision
values keep increasing as we enlarge the ranking list we compare. This value
becomes stable after evaluating the top 30% of the ranking list. Without any
surprise, the value ends with 100% at the end of the curve.
Our simulation results show that the approximation algorithm performs rel-
atively poor when only considering the first activities of the ranking list, but
30
quickly becomes stable when we enlarge the ranking list we analyze. Note that
even 55% of precision at the beginning of the ranking list is considered being ac-
ceptable in the field of domain-specific information retrieval, where the precision
values are quite low and vary largely [1].
In conclusion, we can now answer the first research question formulated in
Section 5.4 about the performance of the approximation ranking algorithm. The
average surface area, which measures the imprecision of the approximation rank-
ing algorithm is 19.33% and with a probability of 95% falls into the interval
[0.1637,0.2225]. The precision of the approximation ranking algorithm is 55%
when only measuring the top 10% activities. This value increases as the rank
list increases, and will becomes stable after considering more than 30% of the
top ranked activities.
7.4 Significance test
In this section, we analyze which of the four parameters presented in Section 6.2
have significant influence on the surface area, i.e., which parameters do signifi-
cantly influence the performance of the approximation ranking algorithm.
We assume that all parameters are independent from each other. In the
following, we analyze the situation in which one parameter is varying while the
other parameters remain constant. For example, assume we want to measure
the influence of Parameter 2 (similarity between variants). Then we need to
divide our 36 groups of datasets into 3 sub-groups: one group with parameter
2 = ”small”, another one with Parameter 2 = ”medium”, a third one with
Parameter 2 = ”large”. We then analyze whether or not the results from the
three groups are significantly different from each other. If so, Parameter 2 has
significant influence on the results of the approximation ranking algorithm.
Therefore, for each of the four parameters n(n= 1,2,3,4), we want to test
the following null hypothesis:
Hn
0: Parameter nhas no influence on the surface area.
If the hypothesis is tested to be statistically significant (i.e., probability is
larger than 5%), we will accept the hypothesis, i.e. Parameter nis then assumed
to have no influence on the performance of the approximation algorithm. If this
is not the case, we need to reject this hypothesis, i.e., the parameter then sig-
nificantly influences the performance of the approximation ranking algorithm.
Comparable to most of the hypothesis tests, we assume that errors are indepen-
dent and follow normal distribution [22, 9].
Regarding our example, Parameter 1 and 2 have three options to choose
while parameter 3 and 4 have two (cf. Section 6.2). We will first explain how to
measure parameter 3 and 4 and then parameter 1 and 2.
Parameter 3 and 4 We apply ttest to evaluate the statistical significance
of hypothesis H0[22] for Parameter 3 and 4. The significance test attempts to
disprove this hypothesis by determining a probability value (p-value), i.e., which
measures the probability that the observed difference could have occurred by
31
chance. If the p-value is less than a given threshold value α(which is often set to
0.05), we can reject H0and conclude that the tested parameter has significant
influence on the surface area.
We first examine influence of Parameter 3. For this purpose, we divide our
dataset into 2 sets Xand Y:Xcontains all the groups with Parameter 3 =
”Random”, while Ycomprises groups with Parameter 4 = ”Gaussian”. Since
we have totally 36 groups of datasets, we have |X|=|Y|= 36/2 = 18. Let
Xi∈Xand Yi∈Ybe the surface area values of two groups with parameter 3
varying while the remaining three parameters have the same values, i= 1, . . . , n
and n= 18. We define Di=Yi−Xifor i= 1, . . . , n, and Dis the average
value of Di(i= 1, . . . , n). Assume that the model is additive, i.e., the observed
difference Dican be measured by the group different θplus an independent error
²i(Di=θ+²i). Our null hypothesis H0is then θ= 0, which means the two
groups are the same. The mathematical details for the t-test are given below.
t-test
t=D
s(Di)/√nwith
D=1
n
n
X
i=1
Diand s(Di) = v
u
u
t
1
n−1
n
X
i=1
(Di−D)2
Distribution under H0should follow Student’s twith n−1 degrees of freedom
[22].
Using the same approach we can test Parameter 4; results are shown in Table
9.
Parameter Mean diff. Standard dev. t value Probability Significant? has influence?
Parameter 3 -0.1043 0.0802 -5.5206 3.70E-5 NO YES
Parameter 4 0.0776 0.0799 4.1225 7.15E-4 NO YES
Table 9. Significance test result for Parameter 3 and 4
For Parameter 3, the result shows that groups in Y(groups with Parameter
3 = ”Gaussian”) perform better than groups in X(groups with Parameter 3 =
”Random”), since the mean difference is a negative value. The significance test
indicates very low probability that such difference is generated by random errors
(probability = 3.70E-5). Because of this, we need to reject H0, i.e., Parameter 3
can significantly influence the performance of the approximation ranking algo-
rithm. And the more changes are performed based on few activities, the better
the approximation ranking algorithm performs.
Similar analysis can be made for Parameter 4. Regarding Table 9, we can
see that groups in Xare better than groups in Y, since the mean difference is a
positive value. The two groups are also significantly different to each other since
the probability value is very low (7.15E-4). We can conclude that Parameter 4
32
is also can significantly influence the performance of the approximation rank-
ing algorithm: the more global change we apply, the better the approximation
algorithm performs.
Since parameter 3 and 4 have significant influence on the performance of the
approximation algorithm, we further analyze the groups of datasets partitioned
by the values of parameter 3 and 4. We first divided the 36 groups of dataset
into two subsets, one with Parameter 3 being set to ”Random” and another
with Parameter 3 set to ”Gaussian”. Similar to the approach described in Fig.
10, we analyze the mean and standard deviations of the precision values at top
n% ranked activities in the two sub-groups respectively. Similar analysis can be
applied to parameter 4, i.e., we divide the datasets into two subsets based on
the value of Parameter 4, and analyze the mean and standard deviation of the
top n% precision values. Results are depicted in Fig. 11.
Mean and SDev of top n% prevision values of
the two sub-groups divided by parameter 3
0
0.2
0.4
0.6
0.8
1
1.2
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
random(mean) gaussian(mean)
random(SDev) gaissian(SDev)
0.7827
0.1338
0.8573
0.1106
Mean and SDev of top n% prevision values of
the two sub-groups divided by parameter 4
0
0.2
0.4
0.6
0.8
1
1.2
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
random(mean) gaussian(mean)
random(SDev) gaussian(SDev)
0.8696
0.1118
0.7704
0.1415
SDev
mean
mean
SDev
Fig. 11. Comparing the mean and standard deviation of the top n% precision values
in different groups
We plot the mean and standard deviation of the subgroups in the same graph.
It then becomes clear that for Parameter 3, the group with value ”Gaussian”
(dashed line) performs better than the one with value ”Random”. Reason is that
average precisions are higher while standard deviations are lower. For Parameter
4, the group with value ”Random” then performs better than the group with
value ”Gaussian” for the same reasons as described above. By plotting the mean
and standard deviation of the precision values for the top n% activities, we can
derive the same result as we obtained through the significance test.
Parameter 1 and 2 Since Parameter 1 and 2 both have three values to choose
(”small”, ”medium”, and ”large”), we cannot use the t-test approach described
above. The standard approach to this type of problem is to examine the data
using a two-way Analysis of Variance (ANOVA) [22, 9].
We first analyze Parameter 1. Let us divide the dataset into sub-sets Yj, j =
(1, . . . , m) based on the value of Parameter 1. Let yij, i = (1, . . . , n) be the
surface area of a group in set Yj. Such a group in Yjshould have same index i
33
as a group in Y0
jin case parameter 2, 3 and 4 all have the same value for these
two groups. Regarding our example, since Parameter 1 can have three values
(”Small”, ”Medium” and ”Large”), mis then 3 and since n=|Yj|then nis
36/3 = 12. Let ybe the average surface area of all groups while yjrepresent the
average surface area of the groups in Yj. The standard model is as follows:
yij =µ+αi+βj+²ij
This means that each observation yij can be broken down into the true mean
performance µ, the group effect αi, the influence of parameter βj, and the error
²ij. We assume αand βto be independent and additive. Two-way ANOVA can
be computed as follows:
FA=MSR
MSE =
nPj(yj−y)2
m−1
Pi,j (yij −yi−yj+y)2
(n−1)(m−1)
MSR corresponds to the mean-squared difference between groups Yjand
MSE is the mean-squared error. We can test our hypothesis for βj= 0, j=
1, . . . , m. The assumption is that errors are independent and normally dis-
tributed. The probability of accepting our hypothesis H0follows Fdistribution
with n−1 and (n−1)(m−1) degrees of freedom. The test result is given in
Table 10.
Parameter F value Probability Significant? has influence?
Para 1 1.3665 0.2758 YES NO
Para 2 0.6545 0.5295 YES NO
Table 10. Significance test result for parameter 2 and 3
For Parameter 1, the probability of H1
0being true is 0.2758 and therefore sig-
nificant. This means that we can accept the hypothesis that Parameter 1 does
not influence the size of the surface area. To be more precise, the number of
activities in each variant does NOT influence the performance of the approx-
imation algorithm. Similarly, Parameter 2 also has no influence on the size of
the surface area, i.e., whether the variants are similar or not also does NOT
influence the performance of the approximation algorithm.
The statistical results of Parameter 1 also indicate that the performance
of our approximation algorithm is able to scale up, since the goodness of our
approximation algorithm performs does not depend on the size of the process
models.
8 Related work
Ranking techniques have been wildly used in fields like information retrieval [1]
or data mining [23]. In information retrieval, for example, a query results in a list
34
of web sites or documentations, which are ranked according to the relevance of
the searched object. In [13], the authors also provide a technique to retrieve sim-
ilar process models using a query process model. However, this research focuses
on retrieving relevant information rather than analyzing the structure of pro-
cess models. In the workflow field, conformance checking techniques are widely
used to measure the match between the designed process model and its actual
execution behavior [21]. Such technique has also been applied in certain process
mining approaches like genetic mining [4]. [6] also represented a process mining
technique by discovering a collection of process variants. However, a prerequisite
of this approach is a valid change log which documenting all the changes when
configuring the reference process model to each variants. Clearly, such change
log is not always available in our context. Similar techniques for conformance
checking have been applied in process monitoring where people focus on handling
exceptional situations and measuring fulfillment of business rules. [5, 32]. In the
field of web services, service monitoring techniques are also used to monitor the
behavior of the agreed service compositions. Violations of these agreement can
be identified and also be punished [2]. Similar techniques also apply in business
IT alignment measures [24] or security checks [30], where un-predefined business
rules or security protocols are automatically identified and measured. However,
most of the above mentioned approaches analyze behavior in-consistencies to
measure the matching between the designed model and real executions. This be-
havior is different than the structural change on which we focused in this paper
(see [11] for a detailed comparison). Also, few of the above mentioned approach
are able to provide a detailed analysis of every individual process activity based
on the observed in-consistencies.
9 Summary and Outlook
The key contribution of this paper is to provide both a precise ranking algo-
rithm and an approximation ranking algorithm to rank the activities according
to the degrees with which they are potentially involved in reconfiguration a given
reference process model. Using these techniques, we are able to identify which
activities have been configured more often than others. Such information is valu-
able for identifying optimization for the currently used (reference) process model
when understanding the reference model or when re-engineering process models.
It can also be used in process monitoring to identify which parts of a process
(i.e., composite service) has been adapted more often than others during run
time execution.
The precise ranking algorithm is precise but also time-consuming. Therefore,
we introduced the approximation ranking algorithm, which can be computed in
polynomial time, and evaluated it by performing an simulation. After having
analyzed about 3600 process models, we demonstrated the performance of the
approximation ranking algorithm and identified several parameters which can
significantly influence the performance of the approximation ranking algorithm.
The overall preciseness of the approximation ranking algorithm is around 80%
35
and the performance is not dependent on the size or the similarities between
process variants. This result indicates that the approximation ranking algorithm
is able to scale up since the performance is not dependent on the size of process
model we analyze.
The proposed approach allows to identify the activities which are the problem
makers, i.e., which are potentially responsible for change changes when config-
uring the process variants. Our next step is to also make use of the suggested
technique for process variant mining [10]. In process variant mining, we can
discover a reference process model by mining a collection of process variants.
However, such technique does not consider the original reference model and
therefore could generate a spaghetti-like structure which is too different from
the original reference model. This is surely not preferred since re-engineering the
reference model from the old version to the newly discovered one would be too
complex and costly. The ranking algorithm discussed in this paper provides an
opportunity to take the original reference model into consideration. This way,
we can only focus on the highly ranked activities and the trivial configurations
will not be considered in the end result.
References
1. H.M. Blanken, A.P. de Vries, H.E. Blok, and L. Feng. Multimedia Retrieval.
Springer, 2007.
2. L. Bodenstaff, A. Wombacher, M. Reichert, and M. C. Jaeger. Monitoring de-
pendencies for SLAs: The mode4sla approach. In IEEE SCC (1), pages 21–29,
2008.
3. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algo-
rithms, Second Edition. The MIT Press and McGraw-Hill Book Company, 2001.
4. A.K. Alves de Medeiros. Genetic Process Mining. PhD thesis, Eindhoven Univer-
sity of Technology, NL, 2006.
5. D. Grigori, F. Casati, U. Dayal, and M. Shan. Improving business process quality
through exception understanding, prediction, and prevention. In VLDB ’01, pages
159–168, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc.
6. C.W. G¨unther, S. Rinderle-Ma, M. Reichert, W.M.P. van der Aalst, and J. Recker.
Using process mining to learn from process changes in evolutionary systems. Int’l
Journal of Business Process Integration and Management, 3(1):61–78, 2008.
7. T. H. Haveliwala. Topic-sensitive pagerank: A context-sensitive ranking algorithm
for web search. IEEE Trans. Knowl. Data Eng., 15(4):784–796, 2003.
8. J. Hidders, M. Dumas, W.M.P. van der Aalst, A.H.M. ter Hofstede, and J. Verelst.
When are two workflows the same? In CATS ’05:, pages 3–11, Darlinghurst, Aus-
tralia, 2005.
9. David Hull. Using statistical testing in the evaluation of retrieval experiments. In
SIGIR ’93: Proceedings of the 16th annual international ACM SIGIR conference
on Research and development in information retrieval, pages 329–338, New York,
NY, USA, 1993. ACM.
10. C. Li, M. Reichert, and A. Wombacher. Discovering reference process models by
mining process variants. In ICWS’08, pages 45–53. IEEE Computer Society, 2008.
11. C. Li, M. Reichert, and A. Wombacher. Mining process variants: Goals and issues.
In IEEE SCC (2), pages 573–576. IEEE Computer Society, 2008.
36
12. C. Li, M. Reichert, and A. Wombacher. On measuring process model similarity
based on high-level change operations. In ER ’08, pages 248–262. Springer LNCS
5231, 2008.
13. R. Lu and S. W. Sadiq. On the discovery of preferred work practice through
business process variants. In ER, pages 165–180. Springer, 2007.
14. R. Lu and S.W. Sadiq. Managing process variants as an information resource. In
BPM’06, pages 426–431, 2006.
15. Jan Mendling, Gustaf Neumann, and Wil M. P. van der Aalst. Understanding
the occurrence of errors in process models based on metrics. In CoopIS’07, LNCS
4803, pages 113–130, 2007.
16. B. Mutschler, M. Reichert, and J. Bumiller. Unleashing the effectiveness of
process-oriented information systems: Problem analysis, critical success factors
and implications. IEEE Transactions on Systems, Man, and Cybernetics (Part
C), 38(3):280–291, 2008.
17. M. Reichert and P. Dadam. ADEPTflex -supporting dynamic changes of workflows
without losing control. Journal of Intelligent Information Systems, 10(2):93–129,
1998.
18. S. Rinderle, M. Jurisch, and M. Reichert. On deriving net change information from
change logs - the deltalayer-algorithm. In BTW’07, pages 364–381, 2007.
19. S. Rinderle, M. Reichert, M. Jurisch, and U. Kreher. On representing, purging, and
utilizing change logs in process management systems. In BPM’06, pages 241–256.
LNCS 4102, Springer, 2006.
20. M. Rosemann and W.M.P. van der Aalst. A configurable reference modelling
language. Inf. Syst., 32(1):1–23, 2007.
21. A. Rozinat and W. M. P. van der Aalst. Conformance checking of processes based
on monitoring real behavior. Inf. Syst., 33(1):64–95, 2008.
22. D.J. Sheskin. Handbook of Parametric and Nonparametric Statistical Procedures.
CRC Press, 2004.
23. P.N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison-
Wesley, 2005.
24. W. M. P. van der Aalst. Business alignment: using process mining as a tool for
delta analysis and conformance testing. Requir. Eng., 10(3):198–211, 2005.
25. W.M.P. van der Aalst and T. Basten. Inheritance of workflows: an approach to
tackling problems related to change. Theor. Comput. Sci., 270(1-2):125–203, 2002.
26. W.M.P. van der Aalst and T.Basten. Identifying commonalities and differences
in object life cycles using behavioral inheritance. In ICATPN ’01, pages 32–52,
London, UK, 2001. Springer-Verlag.
27. W.M.P. van der Aalst, B.F. van Dongen, J. Herbst, L. Maruster, G. Schimm, and
A.J.M.M. Weijters. Workflow mining: a survey of issues and approaches. Data
Knowl. Eng., 47(2):237–267, 2003.
28. J. Vanhatalo, H. Volzer, and J. Koehler. The refined process structure tree. In
BPM’08, pages 100–115. Springer, 2008.
29. B. Weber, M. Reichert, and S. Rinderle-Ma. Change patterns and change support
features - enhancing flexibility in process-aware information systems. Data and
Knowledge Engineering, 66(3):438–466, 2008.
30. B. Weber, M. Reichert, W. Wild, and S. Rinderle. Balancing flexibility and security
in adaptive process management systems. In CooplS ’05, pages 59–76, 2005.
31. M. Weske. Business Process Management. Springer, 2007.
32. M. zur Muehlen and M. Rosemann. Workflow-based process monitoring and con-
trolling - technical and organizational issues. In HICSS ’00, page 6032, Washington,
DC, USA, 2000. IEEE Computer Society.
37
33. Michael zur Muehlen and Jan Recker. How much language is enough? theoretical
and practical use of the business process modeling notation. In CAiSE’08, pages
465–479. LNCS 5074, Springer, 2008.
A Block-enumerating algorithm
Let S= (N, E, . . .)∈ P be a process model with N=a1, . . . , an. Let further
A|N|×|N|be the order matrix of S. Two activities aiand ajcan form a block if
and only if ∀ak∈N\ai, aj:Aik =Ajk, i.e., two activities can form a block if
and only if then shows exactly same order relations to the rest of activities. We
can describe the block detecting algorithm as follows:
input : A process model Sand its order matrix A
output: A set BS with all possible blocks
Define BSxbe a set of blocks containing blocks with xactivities. x= (1, . . . , n);1
Define each activity aias a block B,i= (1,...,n);2
Put all these blocks Bin BS1/* initial state */ ;3
for i= 2 to n/* Compute BSn*/;4
do5
let j = 1; let k = i;6
while j < k do7
k = i - j;8
forall (∀Bj∈BSj)and (∀Bk∈BSk);9
/* judge whether Bjand Bkcan form a block */;
do10
if BjTBk=∅then11
if ((∀aα∈Bj) and (∀aβ∈Bk) and (∀aγ∈N\BjSBk)) ;12
satisfies Aαγ =Aβγ then13
Define Bi=BjSBk;14
Put Biin BSi;15
end16
end17
end18
j ++ ;19
end20
end21
BS =Sx∈(1,...,n)BSx
22
Algorithm 2: Block enumerating algorithm
In the block enumerating algorithm, the initial state is that each activity
forms a single block for itself, i.e., blocks only with one activity in it (line 3).
Then we try to compute each block set BSxwhich contains all blocks with x
activities in each block (line 4). Line 8 means that a block containing iactivities
can be created by merging two disjoint blocks containing jand kactivities
38
(i=j+k). Lines 9 to 18 shows the algorithm to check whether the two block Bj
and Bkare able to merge into a bigger block: two blocks can merge iff any two
activities aα,aβ from the two blocks Bjand Bkshows unique order relations
to the rest of activities outside the two blocks Bjand Bk(line 11 to 18). The
block set BS containing all possible blocks is the union of block sets BSxwith
x= (0, . . . , n).
The complexity of this algorithm in worst-case scenario is 2nwhile nequals
the number of activities. However, this complexity only happens when any com-
bination of activities is able to form a block (like a process model with all ac-
tivities in parallel with each other). During our simulation, for most of time we
can enumerate the block from all process models within just a few second. This
indicates the complexity in practise is not so high.
B Distance measure
A process model Scan be represented by an order matrix Ashowing the or-
der relations between activities in the model. We therefore measure the distance
between two activities ai,ajby their relations towards other activities. For ac-
tivity ai, we can compute the distributions of activities in the four types of order
relation (cf. 5.1), i.e., we compute how many predecessor aihas and how many
successor aihas, etc. Such distribution can be represented in a 4-dimensional
vector pi= (ni0, ni1, ni∗, ni−) where ni0counts the number of predecessor aihas,
ni1counts how many successors aihas, etc. Similarly, we can compute vector pj
for activity aj. Therefore, the distance between two activities ai,ajin process
model Scan be measured by the geometry distance between piand pj.
For example, take Fig. 4. We can represent activity Cwith pC= (3,2,1,0)
and Fwith pF= (4,1,1,0). The distance between Cand Fis (3 −4)2+ (2 −
1)2+ (1 −1)2+ (0 −0)2= 1.41. Similarly, we can compute the distance between
activity Cand Gis 3.74. Based on the distances, we can see that activity Fis
closer to Cthan activity Gis. It is also clear in the process model that activity
Fis a direct successor of Cwhile Gis not. The distance between an activity and
a block is measured by the average distances between such activity and all the
activities in the block.
We did not choose some common distance measures like shortest path in
graph theory [3] or common number of predecessor / successor in tree structure
[7]. The reason is that these measures can not show the global influence of a
change operation. For example take Fig. 4. If we switch the order between Cand
F, it will have lower influence than if we switch the order of Fand G. Although
these two pairs have both direct predecessor-successor relationship, changing F
and Gwill also influence activity E. Such difference can only be detected if we
use the above mentioned distance measure: distance between Cand Fis 1.41,
while distance between Fand Gis 2.45.
39
C Precision values in different groups
Group Para 2 Para 3 Para 4 Para 5 No. PreList * No. AppList Precision
1 Small Small Random Random 2 1 0.1218
2 Small Small Random Gaussian 4 1 0.2797
3 Small Small Gaussian Random 6 1 0.1379
4 Small Small Gaussian Gaussian 6 1 0.2323
5 Small Medium Random Random 1 1 0.1801
6 Small Medium Random Gaussian 2 1 0.3400
7 Small Medium Gaussian Random 2 1 0.0912
8 Small Medium Gaussian Gaussian 2 1 0.1746
9 Small Large Random Random 1 1 0.1329
10 Small Large Random Gaussian 1 1 0.2571
11 Small Large Gaussian Random 1 1 0.1911
12 Small Large Gaussian Gaussian 1 1 0.0569
13 Medium Small Random Random 32 1 0.1973
14 Medium Small Random Gaussian 1.008E+4 1 0.2704
15 Medium Small Gaussian Random 432 1 0.0517
16 Medium Small Gaussian Gaussian 54 1 0.1752
17 Medium Medium Random Random 12 1 0.1805
18 Medium Medium Random Gaussian 4 1 0.3800
19 Medium Medium Gaussian Random 720 1 0.0497
20 Medium Medium Gaussian Gaussian 96 1 0.1868
21 Medium Large Random Random 4 1 0.1141
22 Medium Large Random Gaussian 16 1 0.3685
23 Medium Large Gaussian Random 12 1 0.1078
24 Medium Large Gaussian Gaussian 24 1 0.1852
25 Large Small Random Random 1.271E+16 1 0.1099
26 Large Small Random Gaussian 3.583E+8 1 0.2201
27 Large Small Gaussian Random 1.229E+20 1 0.0973
28 Large Small Gaussian Gaussian 2.152E+14 1 0.1797
29 Large Medium Random Random 2.877E+10 1 0.1868
30 Large Medium Random Gaussian 8.847E+6 1 0.2412
31 Large Medium Gaussian Random 1.672E+11 1 0.1402
32 Large Medium Gaussian Gaussian 4.977E+7 1 0.2347
33 Large Large Random Random 1.084E+12 1 0.2212
34 Large Large Random Gaussian 1024 1 0.3752
35 Large Large Gaussian Random 1.161E+8 1 0.2281
36 Large Large Gaussian Gaussian 13824 1 0.2596
Table 11. Precision values for different groups
*Note: the maximal possible number of permutations:
1. Para 2 = ”Small” : 10! = 3.629E+6
2. Para 2 = ”Medium” : 20! = 2.433E+18
3. Para 2 = ”Large” : 50! = 3.041E+64