scieee Science in your language
[en] (orig)
Engineering a Highly Scalable
Object-aware Process Management Engine
Using Distributed Microservices
Kevin Andrews, Sebastian Steinau, and Manfred Reichert
Institute of Databases and Information Systems, Ulm University, Germany
{firstname.lastname}@uni-ulm.de
Abstract. Scalability of information systems has been a research topic
for many years and is as relevant as ever with the dramatic increases in
digitization of business processes and data. This also applies to process-
aware information systems, most of which are currently incapable of
scaling horizontally, i.e., over multiple servers. This paper presents the
design science artifact that resulted from engineering a highly scalable
process management system relying on the object-aware process man-
agement paradigm. The latter allows for distributed process execution
by conceptually encapsulating process logic and data into multiple in-
teracting objects that may be processed concurrently. These objects, in
turn, are represented by individual microservices at run-time, which can
be hosted transparently across entire server clusters. We present mea-
surement data that evaluates the scalability of the artifact on a compute
cluster, demonstrating that the current prototypical implementation of
the run-time engine can handle very large numbers of users and process
instances concurrently in single-case mechanism experiments with large
amounts of simulated user input. Finally, the development of scalable
process execution engines will further the continued maturation of the
data-centric business process management field.
1 Problem Definition
For decades, researchers have been examining parallelism, concurrency, and scal-
ability in computer hard- and software. The topic of scalability also became
instantly relevant to workflow management systems (WfMS) when they first
showed up on the market, as they were explicitly built with large-scale appli-
cations in mind [16]. First attempts to create scalable WfMS applied existing
scalable architecture principles. The resulting approaches, such as WIDE [7],
OSIRIS [17], ADEPTdistribution [5], and Gridflow [6], focused on the system ar-
chitecture point of view, largely ignoring other aspects, such as role assignments,
permissions, and data flow. However, the process models these approaches, espe-
cially Gridflow, are meant to support, are typically high-performance production
workflows, where these aspects merely play a secondary role [16].
Furthermore, with the increasing digitization of business processes and data in
recent years, the scalability and speed of process management systems that fo-
cus on the execution of human-centric processes has become a relevant topic.
This is most noticeable in the widespread availability of commercial cloud-based
BPM solutions and even academic implementations [18]. Such cloud-based BPM
solutions rely heavily on a highly scalable back end architecture to enable multi-
tenancy without creating individual virtual machines for each customer. Cur-
rently, however, no truly hyperscale process execution engine exists that com-
bines a highly scalable back end with a process execution concept that allows
more users to work on the same process instance than is possible in traditional,
activity-centric process management, where the number of concurrently working
users is typically limited by the branches in the process model [4].
This paper presents our in-depth experiences and details of the artifact that
shall fill this gap: the PHILharmonicFlows process engine, which is implemented
based on the object-aware process management paradigm [13]. Section 3 presents
the solution objectives, detailing what we set out to achieve. Before delving into
the artifact design and development in Section 4, we establish the necessary fun-
damentals for object-aware process management in Section 2. The evaluation of
the artifact can be found in Section 5. Section 6 gives a brief overview on related
work. Finally, Section 7 summarizes the contribution and gives an outlook.
2 Fundamentals of Object-Aware Process Management
PHILharmonicFlows, the object-aware process management framework we are
using as a test-bed for the concepts presented in this paper, has been under de-
velopment for many years at Ulm University [4,13,20,3,14]. PHILharmonicFlows
takes the idea of a data-driven and data-centric process management system,
enhancing it with the concept of objects. One such object exists for each busi-
ness object present in a real-world business process. As can be seen in Fig. 1,
an object consists of data, in the form of attributes, and a state-based process
model describing the object lifecycle.
Amount: IntegerAmount: IntegerAmount: Integer Date: DateDate: DateDate: Date Approved: BoolApproved: BoolApproved: Bool
Initialized Decision Pending
Approved
Rejected
AmountAmount DateDate ApprovedApproved
Comment: StringComment: StringComment: String
Approved == true
Approved == false
Transfer
Lifecycle
Attributes
Assignment: Customer Assignment: Checking Account Manager
CommentComment
Fig. 1. ”Transfer” Object, with Lifecycle Process and Attributes
2
The attributes of the Transfer object (cf. Fig. 1) include Amount,Date,Com-
ment, and Approved. The lifecycle process, in turn, describes the different states
(Initialized,Decision Pending,Approved, and Rejected), a Transfer object may
pass during process execution. In turn, a state contains one or more steps, each
referencing exactly one of the object attributes, thereby forcing that attribute
to be written at run-time. The steps are connected by transitions, allowing them
to be arranged in a sequence. The state of the object changes when all steps in
a state are completed. Finally, alternative paths are supported in the form of
decision steps, an example of which is the Approved decision step.
As PHILharmonicFlows is data-driven, the lifecycle process for the Transfer
object can be understood as follows: The initial state of a Transfer object is Ini-
tialized. Once a Customer has entered data for the Amount and Date attributes,
the state changes to Decision Pending, which allows an Account Manager to in-
put data for Comment and Approved. Based on the value of Approved, the state
of the Transfer object changes to either Approved or Rejected. Obviously, this
fine-grained approach to modeling business processes increases complexity com-
pared to the activity-centric paradigm, where the minimum granularity of a user
action is one atomic activity or task, instead of an individual data attribute.
Bank Transfer DecisionBank Transfer Decision
27.000
03.06.2017
true
Amount
Date
Approved*
Submit
Comment
Fig. 2. Example Form
Additionally, the object-aware approach allows for
automated form generation at run-time. This is
facilitated by the lifecycle process of an object,
which dictates the attributes to be filled out be-
fore the object may switch to the next state, re-
sulting in a personalized and dynamically created
form. An example of such a form, derived from
the lifecycle process in Fig. 1, is shown in Fig. 2.
Customer 1
Checking
Account 1
Employee 1
Customer 2
Checking
Account 2
Checking
Account 3
Transfer 2 Transfer 1 Transfer 3
Fig. 3. Data Model
Note that a single object and its resulting form
only constitutes one part of a PHILharmonicFlows
process. To allow for complex executable business
processes, many different objects may have to be
involved [20]. The entire set of objects and rela-
tions present in a PHILharmonicFlows process is
denoted as the data model, an example of which
can be seen in Fig. 3. In addition to the objects,
the data model contains information about the re-
lations existing between them. A relation consti-
tutes a logical association between two objects,
e.g., a relation between a Transfer and a Checking
Account. The resulting meta information, i.e., the
information that the Transfer in question belongs
to this specific Checking Account, can be used to
coordinate the execution of the two objects.
Finally, complex object coordination, which is indispensable for processes with
interacting business objects, is supported as well [20,19]. As objects publicly
advertise their state information, the current state of an object is utilized as an
3
abstraction to coordinate with other objects through a set of constraints, defined
in a coordination process. As an example, consider the following constraint: “Only
4Transfer objects that have a relation to the same Checking Account may be in
the Approved state at the same time”. A coordination process is always attached
to one object, and can coordinate all objects related to that object.
In summary, objects each encapsulate data (i.e., the attributes) and logic (i.e.,
object behavior) which is based on attribute values (i.e., the lifecycle process).
Therefore, objects are largely independent of each other, apart from the relations
that exist between them, which are necessary for their orchestration by the
coordination processes.
3 Solution Objectives
The artifact we present in this work is the distributed object-aware process ex-
ecution engine PHILharmonicFlows. As stated in the problem definition, there
are two main objectives we wish to achieve with this artifact: develop a highly
scalable process management system, which uses an underlying process manage-
ment concept that allows for a multitude of users to work concurrently on the
same process instance (i.e., Objective 1) and is engineered in a way so that it
scales out well when more hardware resources are added (i.e., Objective 2).
Objective 1 can be achieved by breaking with the traditional activity-centric
process management approach, i.e., employing an approach such as artifact-
centric [8], case handling [1], or object-aware process management [14]. Due
to the fact that in such approaches the tasks users have to perform are not
shoehorned into atomic activities, such as forms, which only one person at a
time can view and edit, they are intrinsically more scalable from a user point of
view as more users may execute parts of process instances concurrently [4].
Objective 2, engineering the process engine in a way such that it becomes highly
scalable in regards to the hardware it runs on, requires a more precise explana-
tion. First, it is necessary to clarify that the goal is not to create a system that
scales up well, i.e. vertically with a more powerful processor, but scales out well
horizontally with more computers added to a data center or computing cluster,
as this is generally more desirable in cloud scenarios. In particular, we aim at
achieving ideal linear speedup, i.e., increasing the amount of available processing
power by factor pshould speed up process execution by p[12].
Finally, while one can design an activity-centric process management engine in
a way that allows it to scale well on the hardware side, the benefits of combining
the necessary software architecture aspects with the conceptual aspects of object-
aware process management are what make the artifact original.
4 Design and Development
The PHILharmonicFlows process engine has undergone many development cy-
cles, starting with an extensive research phase concerning the conceptual foun-
4
dations (cf. Section 2). The most important architectural iteration concerned
the scalability aspects described in this paper.
4.1 Architectural Challenges
Initially, we had opted for a relational database to hold the process execution
state, as it is common in many other process engines. However, during the
testing phase of the development iteration, we noticed that the prototype was
plagued by severe scalability issues when confronting it with large numbers of
concurrent users and objects. Note that, as opposed to more traditional (i.e.,
activity-centric) process management technology, object-aware process manage-
ment needs to react to very fine-grained user actions. Due to the nature of the
lifecycle processes and the dynamically generated forms, the engine has to react
to each data element input by users at run-time. In consequence, there is a much
higher frequency at which small workloads have to be completed by the process
engine. Note that this is the price one has to pay for the increase in flexibility
compared to activity-centric systems, which use predefined forms and allow for
far fewer possible execution variants.
As mentioned, a relational database is not ideally suited as the backbone for such
a data-centric and data-driven system, as each of the actions a user completes
on one object may have effects on multiple other objects, each represented by
rows in database tables. As a consequence, a large number of rows (for various
attributes, objects, and relations) have to be loaded into memory to determine
the effects of a user action. In particular, if actions lead to large amounts of small
changes to other objects, time is wasted on locking/unlocking rows and tables,
slowing the system down considerably. Furthermore, the necessary communica-
tion between the individual objects is not predictable up front, as it depends
on (1) the coordination process, (2) the structure of the data model, i.e., the
relations that exist between the objects in question, and (3) the current state of
each object. As these factors may change at any time during process execution,
common techniques (e.g. relying on query optimization) are not ideal for our use
case. In consequence, at least conceptually, object-aware process management
has an unpredictable n-m-n communication pattern, a very simple depiction of
which can be seen in Fig. 4. After carefully examining this unpredictable pattern
we opted for a distributed approach to persisting the state of objects.
O2O2
O2 O2 O2
O2
O2
O3
O4
O4O1
O1O1
Client 6Client 6 Client 2Client 2 Client 9Client 9
Client 5Client 5 Client 3Client 3
Client 12Client 12
O1
Client 14Client 14 Client 13Client 13
NameName
Client Machine
Name
Object Instance
(lifecycle + data)
Fig. 4. Object Communication Pattern (Simplified)
5
4.2 Design Methodology
After experimenting with highly distributed and lightweight document databases,
such as RethinkDB and Couchbase, and not achieving satisfying performance
results, we decided to not only distribute the persistence layer, but also the
computation. To facilitate this, we applied actor model theory. The actor model
is a well established theoretical foundation for concurrency in parallel systems in
which the computation and persistence layer are split into largely independent
primitives called actors [2]. Furthermore, it is supported by a number of frame-
works for highly scalable applications and supports communication patterns such
as the one present in object-aware process management (cf. Fig. 4).
Background on Actor Model Theory The following gives an overview on the
theory behind the actor model. In essence, an actor consists of a message queue
and a store for arbitrary data. The actor can receive messages and handle them
using the data from its store or by sending messages to other actors. An actor,
however, may only complete exactly one task at a time, i.e., an actor currently
servicing a request from its message queue may only work on answering that one
message, whereas all others are ignored until the current message is removed
from the queue. An actor system usually consists of a very large number of
actors of different types, which each type having different functionality which
can be completed by any instances of that type.
An example of the communication between a set of actors of different types, A,
B, and C, can be seen in Fig. 5. Note that actor A receives a request from an
external source, depicted by the message in its message queue. As each actor
represents a small unit of functionality and data, an actor of type A might not
have all the information or functionality to service the request, which necessitates
communication with the other actors, e.g., B and C. However, the part of the
request redirected at actor B triggers communication between Actor B and actor
C. Due to the single conceptual “thread”, as well as the forced message queuing,
most concurrency problems concerning persistence and computation, such as
race conditions and dirty read/writes, cannot occur in an actor system.
Thread
Data
ActorB
Messages
Thread
Data
Actor C
Messages
Thread
Data
ActorA
Messages
Fig. 5. Actor Communication
6
Applying the Actor Model While the actor model may not be fitting for
systems with few, yet long-running, tasks, it is well suited to handle the chal-
lenges posed by a multitude of communicating and interacting objects present
in an object-aware data model. While the actor model itself guarantees a highly
scalable distributed system with few concurrency problems, the main challenge
is to actually apply it to an existing concept or software system. In the context
of our work, this means finding the correct mapping of the various conceptual
elements of object-aware process management to individual actors. To this end,
we elect to treat each object in a data model as an actor, which allows us to keep
the conceptual elements contained in the object, i.e., the lifecycle process on the
computation side, and the the attributes on the persistence side, together as one
unit. Note that this ensures that updates to individual objects, such as changes
to attribute values (and the resulting updates to the lifecycle processes), can be
handled independently from unrelated objects.
Note that the coordination processes (cf. Section 2) coordinate the various ob-
jects based on (1) their current state as well as (2) their relations to other objects.
To enable (1), objects that advance to a different state after one of their attribute
values is changed must inform their coordination process of the respective state
change. We facilitate this by leveraging the actor message pattern described in
Section 4.2. To enable (2), we encapsulate the relations themselves as actors.
This allows the actor representing a new relation to send a message informing
the coordination processes of its creation. Finally, to ensure that the coordina-
tion processes handle such messages in the same way as objects, i.e., ordered
and race condition free, we redesigned them to be actors as well.
Cluster
Server
Microservice
Actor Model
Conceptual
Element
Fig. 6. Encapsulation Dia-
gram of Conceptual Elements,
Actors, and Microservices
To be precise, all high-level conceptual elements
present in object-aware processes, i.e., objects,
relations, coordination processes, and the data
model, are represented as actors in the PHILhar-
monicFlows engine. As the actor model allows for
independent execution of logic, using only well-
regulated message exchanges at certain points, ac-
tors can be hosted as individual microservices in
computation environments. This results in a five
layer concept for the process management engine,
a sketch of which is shown in Fig. 6. As Fig. 6 im-
plies, all high-level conceptual elements of object-
aware process management can be interpreted as actors, allowing them to be
hosted in microservices on a server that can be part of a cluster. Clearly, Fig. 6
abstracts from the fact that multiple instances of any conceptual element may
exist, resulting in one microservice per instance of the conceptual element exist-
ing at run-time.
Note that we chose to keep the more fine-grained conceptual elements, such as
attributes, permissions, roles, and lifecycle states and steps as state information
inside the actors representing the objects. Taking the checking of an attribute
7
permission as an example, as shown in Fig. 7, it becomes evident why having
the permissions and attributes as independent actors would make little sense.
Employee1
(1) hasWritePermission(CheckingAccount1,Balance,Opened)
P1(write,rr1,CheckingAccount,Balance,Opened,[SecurityLevel=0])P1(write,rr1,CheckingAccount,Balance,Opened,[SecurityLevel=0])P1(write,rr1,CheckingAccount,Balance,Opened,[SecurityLevel=0])
P2(write,rr2,Transaction,Amount,Sent,[])P2(write,rr2,Transaction,Amount,Sent,[])P2(write,rr2,Transaction,Amount,Sent,[])
Pn(...)Pn(...)Pn(...)
RR1(CustomerToEmployee,[Department=AccountManagement])RR1(CustomerToEmployee,[Department=AccountManagement])
RR2(CustomerToTransaction,[Department=AccountManagement“]
)
RR2(CustomerToTransaction,[Department=AccountManagement“]
)
RRN(...)RRN(...)
Permissions Roles
Lifecycle Instance Attribute Instances
Department[String] : AccountManagementDepartment[String] : AccountManagementDepartment[String] : AccountManagement
Name[String] : Paul DentonName[String] : Paul DentonName[String] : Paul Denton
CheckingAccount1
Lifecycle Instance Attribute Instances
SecurityLevel[Byte] : 0SecurityLevel[Byte] : 0SecurityLevel[Byte] : 0
Balance[Integer] : 133700Balance[Integer] : 133700Balance[Integer] : 133700
Interest[Float] : 1.2Interest[Float] : 1.2Interest[Float] : 1.2
Initialized Opened
Closed
Frozen
(2) Search matching permission
(3) Get role information
(4) Check attribute restriction
(5) Request attribute value
(8) Return true
(6) Get attribute value
(7) Return 0
Fig. 7. Permission Request
In the example, two objects, Employee1 and CheckingAccount1, each represented
by an actor at run-time, are shown. Note that only markings (1),(5),(7), and
(8) constitute messages being passed between actors, all other markings are just
operations that are completed using data local to the respective actor, the details
of which are not relevant at this point, but can be found in previous work [3].
After the initial request (1) arrives in the message queue of Employee1, it has
to complete a number of operations. Furthermore, this specific request causes
the Employee1 actor to communicate (5) with the CheckingAccount1 actor. In
particular, granting the permission depends on the value of the SecurityLevel
attribute, which is only present in the actor for CheckingAccount1. Finally, after
actor CheckingAccount1 returns the value of the attribute (7), actor Employee1
can return the result of the request (8).
8
Obviously, as the two actors for Employee1 and CheckingAccount1 may be lo-
cated on different servers in the computing cluster, a request like this can in-
troduce some communication overhead. However, as our goal is not to create an
artifact that can resolve a single request to an object-aware process engine as
fast as possible, but instead scales out well, this overhead is negligible, as shown
by our measurement results in Section 5. A more fine-grained decomposition,
e.g., putting every permission, attribute, and step of the lifecycle process into
individual actors, would render no additional benefit as the information they
represent is often only needed by the object they belong to.
In summary, we chose the actor model as a basis for the PHILharmonicFlows en-
gine as it fits well to the already established conceptual elements of object-aware
process management. Furthermore, as each actor maintains its own which solely
this actor may access, we chose the granularity of decomposition of conceptual
elements to actors in a way such that requests to an actor, e.g., an object, can
be serviced without too much communication overhead. Moreover, having all
data of an object located in the corresponding actor prevents common concur-
rency problems, such as dirty reads on the attributes, as the data may only be
accessed by one incoming request to the actor at the same time. Finally, the ar-
chitecture, which is based on the actor model allows us to host each actor in its
own microservice. As microservices enable high degrees of horizontal scalability,
any computational logic that can be partitioned into individual actors can be
scaled across multiple servers well. This means that scaling out the PHILhar-
monicFlows engine horizontally is possible on the fly, which makes it ideal for
scalable business process execution in cloud scenarios.
Architecture Several actor model based microservice frameworks exist, notable
ones include Akka1(Java), CAF2(C++), and Orleans3(.NET). As the preexisting
code base of PHILharmonicFlows was written in .NET, we utilize a modified
version of Orleans, which is part of the open source Service Fabric Reliable Actors
SDK4, for our prototype. Service Fabric allows us to run actor-based applications
on development machines, on-premise research computing clusters, and in the
cloud. This is enabled by a transparent placement of instantiated actors across
all available servers. A diagram of the entire architecture of our engine is shown
in Fig. 8. The “Actor Services” and “Framework Services” are hosted across all
servers in the cluster, ensuring that they are all fully capable of instantiating
actor microservices, such as those shown above “Instantiated Microservices”. All
depicted elements are implemented and working to the specifications of object-
aware process management. Finally, note that the engine is still only considered a
prototype for our artifact, as we have not yet conducted technical action research
with real-world clients.
1https://github.com/akka
2https://github.com/actor-framework
3https://github.com/dotnet/orleans
4https://github.com/Azure/service-fabric-services-and-actors-dotnet
9
ClientsClients
User Simulation
+ Monitoring
Tool
ClusterCluster
Modeling ClientModeling Client
Runtime ClientRuntime Client
Shared REST
Interface
(Swagger) HTTP Cluster
Gateway
(Stateless)
Cluster ServersCluster Servers
Firewall
(Only HTTP
Communication)
Firewall
(Only HTTP
Communication)
Service
Fabric
Hosting &
Activation
Communication
Subsystem
Communication
Subsystem
Transport
Subsystem
Transport
Subsystem
Reliable State
Manager
Relation
Actor Service
Object
Actor Service
Coordination
Actor Service
Data Model
Actor Service
Object
Instance #1
Object
Instance #1
Object
Instance #2
Object
Instance #2
Object
Instance #3
Object
Instance #3
Object
Instance #4
Object
Instance #4
Data Model
Instance #1
Data Model
Instance #1
Object
Instance #6
Object
Instance #6
Coordination
Instance #2
Coordination
Instance #2
Relation
Instance #1
Relation
Instance #1
Relation
Instance #2
Relation
Instance #2
Coordination
Instance #1
Coordination
Instance #1
Object
Instance #9
Object
Instance #9
Object
Instance #5
Object
Instance #5
Relation
Instance #6
Relation
Instance #6
Relation
Instance #3
Relation
Instance #3
Instantiated Microservices
Actor Services Framework Services
Fig. 8. Architecture Overview
5 Evaluation and Demonstration
As other aspects of the PHILharmonicFlows process management system are still
under development, both conceptually and from an implementation perspective,
we have not completed technical action research to verify the solution objec-
tives. Instead, we are currently conducing single-case mechanism experiments
to evaluate the capabilities of the developed architecture, the results of which
we present in this section. Before, however, we discuss conceptual performance
limitations of object-aware processes as well as the measurement methodology.
5.1 Conceptual Performance Limitations
As each object in an object-aware data model can be interacted with individ-
ually, there are hardly any bottlenecks to be expected for the execution of the
object lifecycle processes. However, there is the issue of coordinating the exe-
cution of inter-related objects that needs particular consideration. In general,
coordination processes are informed by the objects when their states change due
to the completion of steps in their lifecycle processes. How often this happens
10
depends on the lifecycle process model of the object in question. In particular,
as these are merely fire-and-forget messages to the coordination process, they do
not actually impact the performance of the object.
Obviously, coordination processes with constraints, such as the example given in
Section 2, are necessary to model real-world processes. However, they do not con-
stitute a scalability issue in the computing sense, as it is often necessary to wait
for other parts of a process to complete in any process management paradigm. To
enable such coordination constraints, it has to be clear which objects are linked
to each other, i.e. have relations between them [19]. The structure of all objects
and the relations between them constitute the data model of an object-aware
process. As all relations are unidirectional and acyclic, the data model forms a
tree-like directed acyclic graph structure, as shown in Fig. 3.
The central performance limitation of the artifact stems from the fact that link-
ing a new object to the existing data model causes a recursive update all the
way to the root object. This is necessary since any objects with coordination
processes on higher levels of the data model have to be informed of new objects
being attached on lower levels in order to coordinate them properly. To ensure
that this operation is completed in an ordered fashion, it is completed by the
actor representing the data model at run-time. As actors may only complete one
action at a time, the linking of objects constitutes a performance bottleneck.
However, objects can only create relations to other objects as specified at design
time in the process model. Thus, the number of relations creatable at run-time
is directly limited by the business process itself. In summary, as in most cases
objects are only related to one or two other objects directly (others transitively),
only few relation creation operations are conducted for each object at run-time.
5.2 Experiment Scenario and Tooling
As our engine is an actor system running on Service Fabric (cf. Section 4.2), it
can be deployed to a cluster with an arbitrary number of servers. For our single-
case mechanism experiments, we employ an on-premise cluster with eight servers
and a total of 64 logical processors (32 cores @ 2.4Ghz + hyper-threading). The
data model used for the experiments is the Banking scenario presented in Section
2, an example of a typical object-aware process model.
Obviously, an object must be instantiated in a microservice before it can be
linked to other objects or be manipulated through attribute value changes. How-
ever, the order in which the attributes of an object are set, or whether they are
set before or after linking the object to others, has no influence on outcome after
all actions are completed. This is due to the fact that object-aware execution is
data-driven, which allows us to design the experiment as follows: (1) we instan-
tiate all objects required for the scenario a predefined number of times, (2) we
link the objects together using a predefined schema, and, finally, (3) we supply
predefined attribute values to each object we instantiated, thereby executing
their lifecycle processes. Our tooling simulates user interaction directly through
the run-time client via REST, using the architecture shown in Fig. 8.
11
We can measure times for all three action types, i.e., instantiation (1), linking
(2), and execution (3), as well as the total time for a run. This allows us to
discuss the three distinct actions and to observe the impact of changing various
parameters on the time they take in relation to the total time.
5.3 Statistical Measurement Methodology
The measurements follow the guidelines for measuring the performance of par-
allel computing systems, as defined in [12]. As a parallel computing system can
render vastly different numbers for the same experiment scenario when run mul-
tiple times, it is hard to measure exact execution times.
Instead of simply measuring the mean or median execution times over a prede-
fined arbitrary number of runs, we use the statistically exact method presented
in [12] to determine the number of runs necessary to ensure high confidence in
an adequately small confidence interval. As we are dealing with non-normally
distributed data, the number of runs has to be determined dynamically i.e.,
there is no analytical method for predetermining it. As it can take several hun-
dred minute long runs to get a narrow confidence interval, this exact calculation
becomes necessary to not waste hours trying to find an adequate number of runs.
To this end, we run Algorithm 1 after every individual experiment run to ensure
that after nruns the confidence interval is tight enough and the confidence in
the interval is high enough. The inputs are the already completed experiment
runs (runs), the maximum width of the confidence interval as a percentage of the
confidence interval average (%CI ), and the minimum confidence in the confidence
interval ((1 α)min). The algorithm returns interval [timelRank, timeuRank], i.e.,
the total time of the runs at the lower and upper bounds of the confidence
interval, as well as (1 α), i.e., the exact confidence level of the returned interval.
Note that Algorithm 1 uses the statistical method described in [15], Annex A,
to determine which of the already measured runs constitute the bounds of the
confidence interval with the required confidence (1 α)min.
Algorithm 1 checkCIWidth
Require: runs[],%CI ,(1 α)min
Ensure: nruns 6.no confidence interval possible for n5
q0.5.quantile 0.5 = median
confidenceLevels[] [(lRank, uRank, (1 α)]
for j= 1 to ndo
for k= 1 to ndo
confidenceLevels BinomialCDF (q, n, k 1) BinomialCDF (q, n, j 1)
end for
end for
for all (lRank, uRank, 1α)in confidenceLevels do
timelRank runs[lRank].time
timeuRank runs[uRank].time
timemean (timelRank +timeuRank )/2
CIwidthmax timemean ×%CI
if timeuRank timelRank CIwidthmax and 1α(1 α)min then
return (timelRank , timeuRank ,1α)
end if
end for
12
5.4 Results
We evaluated how well the artifact achieves the solution objectives (cf. Section 3)
based on the measurement results from the single-case mechanism experiments
presented in this section. The objective of the experiments was (1) to determine
the limits of how many concurrent objects can be processes on the given eight
server hardware setup, (2) find bottlenecks by simulating mass user interactions
and (3) measure the impact of object lifecycle complexity on execution times.
To this end, we conducted the following series of single-case mechanism experi-
ments with an example process model and simulated user input to the current
PHILharmonicFlows engine prototype. While this is not technical action re-
search in the real world, as explained in Section 5.2, the engine is interacted
with exactly as if thousands of real users were executing the individual objects
concurrently. We utilize these results to demonstrate the viability of a highly
scalable actor model based microservice architecture for data-centric process
management systems in general.
Each results table row represents an experiment series with a specific configura-
tion and the following in- and outputs:
Header Description
Input
nDnumber of data models created
Oobject function, type and number of objects created per data model
Rrelation function, A:B:100 indicates 100 relations between objects A and B
Output
nOnumber of created objects across all data models
nRnumber of created relations across all data models
CI tinst confidence interval for total instantiation time
CI tlink confidence interval for total linking time
CI texec confidence interval for total lifecycle execution time
CI ttotal confidence interval for total run time of experiment
1αexact confidence in the CI ttotal confidence interval
nnumber of runs completed to achieve 1αconfidence in CI ttotal
All confidence intervals in all result tables are required to have a maximum
width of 5% in respect to the median measured time, i.e., %CI = 0.05. Further,
the confidence in each interval must be at least 95%, i.e., 1αmin 0.95.
As this means that a measurement with a suspected median time of 1000ms is
required to have a confidence interval of [975ms, 1025ms]5, and the confidence
that the real median time lies in this interval must be at least 95%. Finally,
all time measurements are specified in the mm:ss.ms format. Table 1 shows the
measurement results for a scenario in which we instantiate, link, and execute
varying numbers of Checking Accounts (CA) and Transfers (TR).
550ms width, 5% of 1000ms
13
Table 1. Simple Data Model: Checking Accounts (CA) and Transfers (TR)
nDO R nOnRCI tinst CI tlink CI texec CI ttotal 1α n
1CA:100,
TR:1000 TR:CA:1000 1100 1000 [00:00.960,
00:01.003]
[00:03.002,
00:03.190]
[00:00.422,
00:00.603]
[00:04.499,
00:04.969] 96.49 32
1CA:500,
TR:5000 TR:CA:5000 5500 5000 [00:04.452,
00:04.540]
[00:14.037,
00:14.984]
[00:01.898,
00:03.622]
[00:21.075,
00:23.272] 95.67 24
1CA:1000,
TR:10000 TR:CA:10000 11000 10000 [00:08.657,
00:08.993]
[00:29.127,
00:33.835]
[00:04.499,
00:06.333]
[00:42.898,
00:46.550] 96.14 11
2CA:500,
TR:5000 TR:CA:5000 11000 10000 [00:04.723,
00:04.979]
[00:16.713,
00:18.122]
[00:02.810,
00:04.482]
[00:24.648,
00:27.076] 95.67 25
4CA:250,
TR:2500 TR:CA:2500 11000 10000 [00:02.943,
00:03.307]
[00:11.161,
00:12.908]
[00:02.707,
00:04.004]
[00:17.742,
00:19.554] 95.86 20
8CA:125,
TR:1250 TR:CA:1250 11000 10000 [00:04.130,
00:04.516]
[00:10.109,
00:10.670]
[00:03.239,
00:05.202]
[00:18.493,
00:20.439] 95.3 37
A few things are noteworthy here. First, regarding the first four series, in which
all objects and relations were part of a single data model, one can observe almost
ideal linear scaling of the various median times with the respective workloads,
as evidenced by Fig 9. This demonstrates the capability of the microservices on
the cluster to handle increasing concurrent workloads extremely well. Any below
linear curve would indicate the system becoming congested.
Second, note the super-linear scaling of CI tlink in the four series with 10.000
relations spread over an increasing number of data models (cf. Fig. 10). This in-
dicates a conceptual bottleneck due to the fact that the linking operations must
be ordered to ensure correct recursive updating of the entire data model, as ex-
plained in Section 5.1. To this end, each linking operation is queued and executed
by the actor representing the data model in question, reducing concurrency.
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
01000 2000 3000 4000 5000 6000 7000 8000 9000 10000
MILLISECONDS
AMOUNT RELATIONS
Instantiation ms Linking ms Execution ms Total ms
Fig. 9. Single Data Model
Fig. 10. Multiple Data Models
14
Table 2. Complex Data Model: Employees (EM), Customers (CU), Checking Accounts
(CA), and Transfers (TR)
nDO R nOnRCI tinst CI tlink CI texec CI ttotal 1α n
1
EM:1,
CU:10,
CA:100,
TR:1000
CU:EM:10,
CA:CU:100,
TR:CA:1000,
1111 1110 [00:00.926,
00:00.993]
[00:05.424,
00:06.160]
[00:00.418,
00:00.482]
[00:06.892,
00:07.573] 96.48 14
1
EM:5,
CU:50,
CA:500,
TR:5000
CU:EM:50,
CA:CU:500,
TR:CA:5000,
5555 5550 [00:04.518,
00:04.754]
[00:27.701,
00:30.809]
[00:03.800,
00:05.538]
[00:37.184,
00:40.179] 96.09 9
1
EM:10,
CU:100,
CA:1000,
TR:10000
CU:EM:100,
CA:CU:1000,
TR:CA:10000,
11110 11100 [00:09.110,
00:09.380]
[00:57.553,
01:02.829]
[00:06.504,
00:08.429]
[01:12.990,
01:19.123] 96.14 12
Table 2 shows the experiment results when creating a significantly more complex
data model, with all objects and relations shown in Fig. 3. The objects involved
are Checking Account (CA), Employee (EM), Transfer (TR), and Customer
(CU). However, we adjusted the total amount of objects created to be almost
identical to the experiment series shown in the first three rows of Table 1. Note
the large increase in the CI ttotal interval when comparing Table 2 with Table
1. However, the increase clearly stems from the higher effort when linking new
objects into the more complex data model, as evidenced by the higher CI tlink
values, as the average CI tinst and CI texec values are similar.
Second, consider the wide CI texec intervals in comparison to the very tight
CI tinst intervals (only the CI ttotal interval width is fixed at 5%). This can be
explained with the fact that, while both instantiation and execution are tied to
the complexity of the respective lifecycle process, instantiation is done by the
actor service, while execution is done by the thousands of object actors created
during instantiation and spread across the cluster. As we only have four actor
services active during the experiment series shown in Table 2, the actor frame-
work will distribute them to four distinct servers for each experiment run. As
every server is equal, this leads to almost identical measured times across all ex-
periments. However, as the underlying actor framework does not have knowledge
of the varying complexity of the lifecycles processes in the different objects, the
thousands of objects will also be distributed at random across the eight servers.
In consequence, as some objects take longer to execute their lifecycles than oth-
ers, servers might be unequally tasked in this scenario, leading to comparatively
wide confidence intervals.
Consider the following as a very simplified and extreme example of this effect:
objects of type A may take one second to execute their lifecycle process, whereas
objects of type B may take ten seconds to execute their lifecycle process. If there
15
are two servers and ten objects of each type, in one experiment run both servers
might be tasked with executing five objects of type A and five objects of type B,
which would take 55 seconds in total. Another run might see all objects of type
A executed by one server, taking ten seconds, and all objects of type B executed
by the other server, taking 100 seconds.
To further examine these differences in execution times , we instantiated and
executed each object type used in Table 2 10.000 times. The results of this ex-
periment can be seen in Table 3. As there are no relations in this experiment, we
instead included the number of attributes, steps, and states the objects possess,
as they give an indication as to how complex their lifecycle processes are. The
results show the clear correlations between lifecycle complexity and instantia-
tion/execution time.
Table 3. Comparison of Object Performance: Employees (EM), Customers (CU),
Checking Accounts (CA), and Transfers (TR)
nDO Attributes Steps States CI tinst CI texec CI ttotal 1α n
1 CA:10000 1 3 3 [00:06.537,
00:06.757]
[00:04.076,
00:04.753]
[00:10.731,
00:11.529] 95.1 16
1 EM:10000 2 4 3 [00:08.158,
00:08.726]
[00:04.415,
00:05.857]
[00:12.804,
00:14.118] 96.87 6
1 TR:10000 4 6 4 [00:08.350,
00:08.723]
[00:06.095,
00:07.521]
[00:14.499,
00:15.929] 95.72 30
1 CU:10000 8 11 4 [00:11.974,
00:12.358]
[00:11.222,
00:13.784]
[00:23.576,
00:25.979] 95.64 41
6 Related Work
There is a large amount of literature concerning the scalability of software sys-
tems, little of which is concerned with process management systems. While the
works mentioned in the introduction, WIDE [7], OSIRIS [17], ADEPTdistribu-
tion [5], and Gridflow [6] are fairly dated and do not take modern architectures
into consideration, there is some newer research that does.
[9] and [11] present architectures for distributing business process workloads
between a client-side engine and a cloud-based engine, taking into consideration
that users might not want to store their business data in the cloud. Thus, the
approaches suggest to primarily run compute-intensive workloads on the cloud-
based engine and transfer business data only when necessary. [9] further presents
a method for decomposing the process model into two complementary process
models, one for the client engine, and one for the cloud engine.
[10] introduces a resource controller for cloud environments that monitors pro-
cess execution and can predict how compute intensive future workloads will be
16
depending on previous executions. This data is used for auto-scaling the virtual
machines the process engine and external systems are running on.
[18] gives an in-depth overview of the state of the art concerning cloud business
process architecture, revealing that most current approaches use virtual machine
scaling, simply replicating process engines to achieve scalability.
7 Summary
In this work, we demonstrated the viability of creating a hyperscale process man-
agement system using a data-centric approach to process management on the
conceptual side as well as actor model theory and microservices on the archi-
tecture side. Thus, we maximize the possible concurrent actions conducted by
process participants at any given time during process execution, without simply
relying on virtual machine scaling as most of the related work does. Our measure-
ments show that the PHILharmonicFlows engine scales linearly with the number
of objects created and executed (i.e., user-generated workload) in a single pro-
cess instance. Furthermore, the measurements give insights into what the engine
is capable of, with even this first prototype with no “professional” optimization
executing the lifecycle processes of 11.000 objects in around five seconds. This
is even more impressive considering the fact that these are full executions of
the respective lifecycle processes. As the example objects have an average of 4-5
attributes and steps this equates to around 50.000 user interactions the engine
is handling in the aforementioned five second time frame.
While the data from our single-case mechanism experiments shows that data-
centric process engines have the potential to scale extremely well and to handle
many users working on a single process instance at the same time, we identified
a conceptual bottleneck in the linking of newly created objects to the existing
data model. As this has to be handled centrally by the data model, it cannot
be done concurrently. However, in a more real-world scenario with more running
data models and less objects per data model, this is not much of an issue. This is
especially true when considering that in general the ratio of object instantiations
and linking to actual execution actions, such as form inputs by users, will be low.
In future work, we will work on optimizing the engine further and subjecting
it to more real-world usage when conducing technical action research. Another
important task is comparing the performance of PHILharmonicFlows to activity-
centric process management systems, although finding a process scenario in
which the two very different paradigms are scientifically comparable from a per-
formance perspective will not be easy. Finally, as one of our stated goals is to
increase the maturity and acceptance of data-centric process management, we
are actively developing advanced capabilities, such as ad-hoc changes and variant
support, on top of the presented artifact.
Acknowledgments. This work is part of the ZAFH Intralogistik, funded by the
European Regional Development Fund and the Ministry of Science, Research and
the Arts of Baden-Wᅵrttemberg, Germany (F.No. 32-7545.24-17/3/1)
17
References
1. Van der Aalst, W.M.P., Weske, M., Grünbauer, D.: Case handling: a new paradigm
for business process support. Data & Knowledge Engineering 53(2), 129–162 (2005)
2. Agha, G., Hewitt, C.: Concurrent programming using actors. In: Readings in Dis-
tributed Artificial Intelligence, pp. 398–407. Elsevier (1988)
3. Andrews, K., Steinau, S., Reichert, M.: Enabling fine-grained access control in
object-aware process management systems. In: EDOC. pp. 143–152. IEEE (2017)
4. Andrews, K., Steinau, S., Reichert, M.: Towards hyperscale process management.
In: Proc EMISA. pp. 148–152 (2017)
5. Bauer, T., Reichert, M., Dadam, P.: Intra-subnet load balancing in distributed
workflow management systems. Intl J of Coop Inf Sys 12(03), 295–323 (2003)
6. Cao, J., Jarvis, S.A., Saini, S., Nudd, G.R.: Gridflow: Workflow management for
grid computing. In: Intl Symp on Cluster Comp and the Grid. pp. 198–205 (2003)
7. Ceri, S., Grefen, P., Sanchez, G.: WIDE - a distributed architecture for workflow
management. In: RIDE-WS. pp. 76–79. IEEE (1997)
8. Cohn, D., Hull, R.: Business artifacts: A data-centric approach to modeling busi-
ness operations and processes. Bulletin IEEE TCDE 32(3), 3–9 (2009)
9. Duipmans, E.F., Pires, L.F., da Silva Santos, L.O.B.: Towards a bpm cloud ar-
chitecture with data and activity distribution. In: Enterprise Distributed Object
Computing Conference Workshops (EDOCW), 2012 IEEE 16th International. pp.
165–171. IEEE (2012)
10. Euting, S., Janiesch, C., Fischer, R., Tai, S., Weber, I.: Scalable business process
execution in the cloud. In: Cloud Engineering (IC2E), 2014 IEEE International
Conference on. pp. 175–184. IEEE (2014)
11. Han, Y.B., Sun, J.Y., Wang, G.L., Li, H.F.: A cloud-based bpm architecture with
user-end distribution of non-compute-intensive activities and sensitive data. Jour-
nal of Computer Science and Technology 25(6), 1157–1167 (2010)
12. Hoefler, T., Belli, R.: Scientific benchmarking of parallel computing systems. In:
IEEE SC15. p. 73. ACM (2015)
13. Künzle, V., Reichert, M.: PHILharmonicFlows: towards a framework for object-
aware process management. JSME 23(4), 205–244 (2011)
14. Künzle, V., Reichert, M.: Striving for object-aware process support: How existing
approaches fit together. In: SIMPDA. pp. 169–188. Springer (2011)
15. Le Boudec, J.Y.: Performance evaluation of computer systems. Epfl Press (2010)
16. Leymann, F., Roller, D.: Production workflow: concepts and techniques. Prentice
Hall PTR Upper Saddle River (2000)
17. Schuler, C., Weber, R., Schuldt, H., Schek, H.J.: Scalable peer-to-peer process
management - the osiris approach. In: IEEE ICWS. pp. 26–34 (2004)
18. Schulte, S., Janiesch, C., Venugopal, S., Weber, I., Hoenisch, P.: Elastic business
process management. FGCS 46, 36–50 (2015)
19. Steinau, S., Andrews, K., Reichert, M.: The Relational Process Structure. In: 30th
Int’l Conf. on Advanced Information Systems Engineering (CAiSE). pp. 53–67.
Springer (2018)
20. Steinau, S., Künzle, V., Andrews, K., Reichert, M.: Coordinating business processes
using semantic relationships. In: Proc CBI. pp. 143–152 (2017)
18