scieee Science in your language
[en] (orig)
On Deriving Net Change Information From Change Logs
The DELTALAYER-Algorithm
Stefanie Rinderle1, Martin Jurisch1, Manfred Reichert2
1Institute DBIS, Ulm University, Germany, {stefanie.rinderle, martin.jurisch}@uni-ulm.de
2IS Group, University of Twente, The Netherlands, [email protected]
Abstract: The management of change logs is crucial in different areas of information
systems like data replication, data warehousing, and process management. One bar-
rier that hampers the (intelligent) use of respective change logs is the possibly large
amount of unnecessary and redundant data provided by them. In particular, change
logs often contain information about changes which actually have had no effect on
the original data source (e.g., due to subsequently applied, overriding change opera-
tions). Typically, such inflated logs lead to difficulties with respect to system perfor-
mance, data quality or change comparability. In order to deal with this we introduce
the DeltaLayer algorithm. It takes arbitrary change log information as input and pro-
duces a cleaned output which only contains the net change effects; i.e., the produced
log only contains information about those changes which actually have had an effect
on the original source. We formally prove the minimality of our algorithm, and we
show how it can be applied in different domains; e.g., the post-processing of differ-
ential snapshots in data warehouses or the analysis of conflicting changes in process
management systems. Altogether the ability to purge change logs from unnecessary
information provides the basis for a more intelligent handling of these logs.
1 Introduction
1.1 Problem Description
The management of log information is crucial in different areas of information systems. In
addition to, for example, transaction logs in database management systems the optimized
processing of change (log) information (e.g., differential snapshots [LGM96]) is gaining
more and more importance. In this context, one prominent example is the update of a data
warehouse based on change information. More precisely, for such an update heteroge-
nous data sources are monitored. According to the particular monitoring strategy, updates
within the data sources are then fed into the staging area of the data warehouse (cf. Figure
1a). How these data updates can be determined depends on the kind of data sources. For
example, database systems offer mechanisms such as replication or triggers in this con-
text which ease the extraction and processing of update information. However, very often
data updates have to be extracted from sources (e.g., legacy systems) which do not offer
any support for getting this update information. In such cases, the only way is to produce
snapshots of the data sources before and after the update and to calculate the difference
(i.e., the differential snapshot) between them [LGM96, JNS+97] (cf. Figure 1a). A differ-
Proc. BTW 2007, Aachen, Germany, March 2007
ential snapshot can be expressed by a set of change operations (i.e., INSERT, UPDATE,
DELETE operations) leading from the source data to the target data.
In addition to data warehouses, change information also plays an important role in the
domain of adaptive process management technology [RRD04b]. Such systems enable
process changes at different levels, which occur frequently in practice [Wes01]: Single
process instances may have to be adapted, for example, to deal with exceptional situa-
tions (e.g., by adding, deleting, or shifting process steps, cf. Figure 1b). Furthermore
process templates1may have to be changed, e.g., to react on new regulations or to im-
plement process optimizations. The information about these changes has to be stored for
several reasons. In [RRJK06] we discuss traceability (e.g., for the medical domain) and
correctness checks in the context of concurrent process changes as important use cases.
Delta Layer
Differential
Snapshots
Differential
Snapshots
Staging Area
Legacy
Systems
Differential
Snapshots
DeltaLayer
Post
Processing
Load Netto
Change
a) Post Processing of
Differential Snapshots
A B D
C
E
XDeltaLayer
b) Correctness Checks in Adaptive PMS
Process Template S:
Change Log for S:
INSERT((X, (A,B)),
DELETE((B, NULL))
INSERT((B, NULL))
Process Instances:
A B D
C
E
X
Instance I1:
Change Log for I1:
INSERT((X, (A,B)),
A B D
C
E
Instance I2:
Change Log for I2:
UPDATE((A, (B,D))*,
UPDATE((A, (D,E))
DeltaLayer
DeltaLayer
Correctness Checks:
Are there conflicting
changes at template and
instance level?
*i.e., shift A from current position to
position between B and D
Figure 1: Different Use Cases for the DELTALAYER Approach
Both application domains, data warehouses and process management systems, show that
change logging is very important. However, one problem hampers the management and
the use of change logs: Very often, change logs consists of a possibly large amount of
unnecessary information. In data warehouses, for example, this information comprises
wasteful DELETE/INSERT or INSERT/DELETE pairs within the differential snapshots;
these pairs are often produced when calculating the differential snapshots. A wasteful
DELETE/INSERT pair expresses the deletion of a database entry X followed by the in-
sertion of X (i.e., the same entry) in the sequel. The same problem arises in the context of
1Process templates describe the general structure of a process (e.g., control and data flow), cf. Figure 1b.
process management systems where change logs may contain change operations overrid-
ing the effects of previous changes. One example is illustrated for process instance I2 in
Figure 1b: The first user updates the position of process activity A to (B, D) followed by
an update of the second user to position (D, E) (i.e., the overall effect for both UPDATE
operations is UPDATE(A,(D,E)). The presence of such unnecessary information within
change logs may cause a number of problems and difficulties:
Performance problems due to inflated logs: In case of differential snapshots this
may lead to performance problems in case of tight update windows [DDJ+98].
Data quality problem: In data warehouses, redundant information from differential
snapshots has to be cleaned within the staging area. Redundant change information
might be desirable for traceability reasons on the one hand. On the other hand,
DELETE/INSERT and INSERT/DELETE pairs do not have any real background
and therefore should to be cleaned before analyzing the data.
Comparability problem: In process management systems many correctness checks
(e.g., dealing with concurrently applied changes) are based on comparing change
logs. In this context, unnecessary information leads to non-comparable change logs,
which significantly hampers correctness checks in the sequel [RRD04a, RRJK06].
This paper aims at tackling these problems which might be relevant for other use cases as
well (e.g., when updating materialized views in data warehouses).
1.2 Contribution
In [RRJK06] we have already discussed requirements for capturing (process) change logs
in adaptive process management systems. We have also shown how to purge unnecessary
information from process change logs at the logical level. In this paper, we present the
DELTALAYER algorithm which takes arbitrary change log information as input and pro-
duces a cleaned output which only contains the net change effects (i.e., information about
changes which actually have had an effect on the database, the data source, or the process
repository). The resulting net changes are considered as being correct if they are mini-
mal with respect to the sets of inserted, deleted, and updates tuples. This constitutes an
extension of the minimality notion as used in, for example, approaches for updating mate-
rialized views [GL95]. When compared to approaches on transaction equivalence [AV88]
which mainly operate at the logical level, our approach tackles the challenge to efficiently
realize minimized change logs at the physical level. Contrary to [AV88], for example, the
DeltaLayer algorithm produces output (i.e., minimized change logs) which can be directly
used for data extraction in data warehouses or correctness checks within a process man-
agement system. This also implies that our approach is independent of any transactional
concepts (i.e., transactions and associated recovery mechanisms can be specified on top of
the change logs information as considered in this paper).
Advertisement
Since the output of the DELTALAYER algorithm is structured it can be accessed and
queried more easily, for example, to post-process the net changes: When updating ma-
terialized views in data warehouses, it might be useful to determine which attributes are
affected by changes and which are not. Regarding adaptive process management systems
the access to specific entities (such as process activities) support the efficient processing
of the necessary correctness checks (e.g., whether two changes are conflicting) [RRD04a].
After presenting the DELTALAYER algorithm we will show how it can be implemented
within commercial systems such as Oracle database or IBM DataPropagator as well as
within differential snapshot algorithms (e.g., the Window algorithm as, for example, pre-
sented in [LGM96]) in order to purge the output from wasteful information. For appli-
cations such as data warehouses, it is sufficient that the DELTALAYER algorithm is able
to process the basic operations DELETE,UPDATE, and INSERT [LGM96]. However, it
offers the possibility to process more complex changes on top of these basic operations as
well (e.g., transactions, set-oriented updates, or ”high-level” process graph changes). We
will show this by means of an example for a ”high-level” operation in the area of adaptive
process management systems. In this context we can also see that the minimality property
of the DELTALAYER algorithm is crucial. Minimality guarantees that two arbitrary change
logs become comparable after running the DELTALAYER algorithm on them. This, in turn,
is essential for any check which requires comparison of change information.
The paper is organized as follows: In Section 2 fundamental notions are introduced. We
present the DELTALAYER algorithm and prove its minimality property in Section 3. Sec-
tion 4 discusses several applications of the DELTALAYER algorithm. In Section 5 we
discuss related work, and we finish with a summary and outlook in Section 6.
2 Fundamentals
In this section we introduce fundamental notions which we use further on in the paper and
which are necessary for a basic understanding.
First of all, we restrict our considerations to the following set of basic change operations:
CO := {INSERT(T), UPDATE(T), DELETE(T)}2. We assume that tuple T has a
key T.Key and a list of attributes T.A with A= [a1, ..., an]. Using this notion, for ex-
ample, differential snapshots can be easily expressed: INSERT(T) refers to the insertion
of a data field with key T.Key and attribute list T.A (cf. [LGM96]). Regarding change
logs in adaptive process management systems, INSERT(T) expresses the insertion of a
new activity node T.Key into the underlying process template (e.g., activity node X is
inserted into process template S, cf. Figure 1b). The additional parameters stored within
T.A might, for example, specify the position where T.Key is inserted. Furthermore,
UPDATE(T) expresses an update of a data field having key T.Key with attributes T.A
or an update of the position of a process activity node T.Key within a process template
(e.g., updating the current position of activity node A to new position (D, E) for instance
I2, cf. Figure 1b). Finally, for a delete operation it is sufficient to only specify the key of
2In Section 4.3 we show how our approach can be extended to other change operations as well.
the tuple to be deleted.
In the following we denote attribute values which remain non-specified as NULL. This
might be used, for example, when applying UPDATE operations for which only some of
the attributes are updated whereas for the others the old values are kept (cf. Table 1).
One or several change operations as described above can be used to express modifications
on data sources (e.g., update of source tables contained in differential snapshots). More
precisely, we denote an ordered sequence of change operations as change log. Formally:
cL :=< op1, ..., opn>where opiCO (i = 1, .., n) and opiaffects tuple T having
key T.Key and attribute list T.A. Note that an operation opiis performed before another
operation opj(opi, opjcL) for i < j. As running example we use the following
simple change log for the remainder of this paper:
Change Log 1 (Example Change Log) <
op1= INSERT((2, "HELLO")),
op2= INSERT((3, "BYE")),
op3= DELETE((2, NULL)),
op4= INSERT((4, "BYEBYE")),
op5= UPDATE((4, "HELLO")),
op6= UPDATE((4, "BYE"),
op7= DELETE((5, NULL)),
op8= INSERT((5, "CIAO")),
op9= INSERT((6, "HI")),
op10 = INSERT((7, "HIHI") >
When analyzing this sample change log we can see that its actual effects on the original
data source can be captured by a fraction of the original change operations, i.e.,
<INSERT((3, "BYE")), INSERT((4, "BYE")), UPDATE((5, "CIAO"))3,
INSERT((6, "HI")), INSERT(7, "HIHI"))>.
From literature studies and experiences with our process management system we learned
that unnecessary change pairs within change logs occur rather often. Examples are dif-
ferential snapshots [LGM96] and change logs in process management systems [RRJK06].
Regarding the first case, wasteful change pairs are produced by the algorithms to compute
the differential snapshots; in the latter case the unnecessary information might be caused
by users trying out the best solutions for change operations (see, for example, instance I2
in Figure 1b). In any case such wasteful change pairs might cause performance problems
[DDJ+98] and / or hamper correctness checks based on the change logs [RRJK06]. In Ta-
ble 1 we summarize all different kinds of wasteful change pairs which might occur within
change logs. Table 1 also analyzes their actual effects on the data source.
3In this case we have to check if attribute value CIAO means an update on the original table, otherwise the
INSERT((5, "CIAO")) operation would be unnecessary too.
Advertisement
Table 1: Wasteful Change Pairs within Change Logs
Change Pair Effect Example
1. INSERT(T)/DELETE(T) neutral INSERT((2, "HELLO")),
DELETE((2, NULL))
2. UPDATE(T)/DELETE(T) delete UPDATE((2, "HELLO")),
DELETE((2, NULL))
=DELETE((2, NULL))
3. DELETE(T)/INSERT(T) a) same attribute values as tuple (2,
..) had in source before =neutral
DELETE((2, NULL)),
INSERT((2, "HELLO"))
b) different attribute values as tuple
(2, ..) had in source before =up-
date
DELETE((2, NULL)),
INSERT((2, "BYE"))
=UPDATE((2, "BYE"))
4. UPDATE(T)/UDPATE(T) a) second update has same attribute original attribute values of T in source
values as T in source (2, "HELLO")
=neutral UPDATE((2, "BYE")),
UPDATE((2, "HELLO"))
b) updates attribute values original attribute values of T in source
compared to attribute values (2, "HELLO")
of T in source UPDATE((2, "BYE")),
UPDATE((2, "BYEBYE"))
=update =UPDATE((2, "BYEBYE"))
c) value update for more than one original attribute values of T
attribute =in source (2, "HELLO", 100)
merge into one update UPDATE((2, "BYE", NULL))
UPDATE((2, NULL, 300))
=UPDATE((2, "BYE", 300))
5. INSERT(T)/UPDATE(T) a) for one attribute =insert INSERT((2, "HELLO")),
UPDATE((2, "BYE"))
=INSERT((2, "BYE"))
b) for more than one attribute INSERT((2, "BYE", NULL))
=merge into insert operation UPDATE((2, NULL, 11))
=INSERT((2, "BYE", 11))
3 On Calculating the Minimal Effect of Change Logs the DELTA-
LAYER Algorithm
So far we have discussed different kinds of wasteful change combinations within change
logs (cf. Section 2). In this section we present the DELTALAYER algorithm which purges
a change log from change combinations as summarized in Table 1 and produces a ”net
information” output (i.e., an output only containing change operations which reflect the
delta between target and source table).
3.1 The DELTALAYER Algorithm
Generally, the DELTALAYER algorithm (cf. Algorithm 1) receives a change log as input
(such as Example 1) and produces an output within the DeltaLayer format4[RRJK06]
4We abstract from a concrete representation since different applications usually require different approaches
in this context. One possibility for a DeltaLayer representation is presented in Figure 2. We will discuss other
approaches as used in commercial systems (e.g., the condensed change data table format in IBM Data Propa-
gator) in Section 4.1. Here, for example, the entities affected by change operations opiare stored within one
which only contains the net information of the input change log. In (1) the DeltaLayer
structure is initialized; either a new DeltaLayer structure is created or an already existing
one is used. Then the algorithm steps through the change log (i.e., an ordered sequence of
change operations).
Algorithm 1 (DELTALAYER: ChangeLog cL :=< op1, ..., opn>7→ DeltaLayer DNet)
DeltaLayer DNet := newDeltaLayer
forall (i = 1, ..., n) do
if opi= INSERT(T) then
DNet = insertTuple(DNet, T)
else if opi= DELETE(T) then
DNet = deleteTuple(DNet, T)
else if opi= UPDATE(T) then
DNet = updateTuple(DNet, T);
fi
od
For each change operation opiits type is determined (i.e., INSERT, UPDATE, DELETE).
Based on the current change type, Algorithm 1 calls associated functions insertTuple(...),
deleteTuple(...), or updateTuple(...). Basically, these functions check if
and what effect current operation opihas on the target data table. For this decision the
effects of already applied change operations have to be taken into account (cf. Table 1).
Let us assume that Algorithm 1 calls function insertTuple(...) with the current
DeltaLayer and the current change operation opi=INSERT(T). First of all, function
insertTuple(...) checks whether tuple T has already been deleted before (i.e., by
a change operation opkwith k < i). This corresponds to case 3 in Table 1. As we
can see from this table, function insertTuple(...) now has to distinguish whether
the attribute values used when applying opiare the same or different from the attribute
values having been used for prior opk=DELETE(T) operation (i.e., the original values
in the source table in case of a DELETE operation). This is done in lines (?). If the
attribute values used by opiand opkare different, obviously, opihas to be stored within the
DeltaLayer as UPDATE(T) operation (i.e., if update = true ()). In any case, T is deleted
from the set of deleted tuples and therefore the previous DELETE(T) operation opkis
correctly purged from DeltaLayer. Except the DELETE(T)/INSERT(T) combination,
according to Table 1, there is no other possibly wasteful change combination where an
INSERT operation is applied in the second place. Therefore, if T 6∈ Dold.deletedTuples,
T can be added to the set of inserted tuples and INSERT(T) is currently stored within
DeltaLayer.
Function 1 (insertTuple (DeltaLayer Dold, newTupel Tnew)7→ DeltaLayer DNet)
DNet := Dold
boolean contains := false
table including a column where the type of the particular change operation is indicated (e.g., Ifor an INSERT
operation). Another possibility would be to keep an own table for INSERT,UPDATE, and DELETE operations.
The choice may depend on factors such as performance or query optimization.
Advertisement
forall (T Dold.deletedTupels) do //
if (T.Key = Tnew.Key) then //
boolean update := false
Tupdate = new T<Tnew.Key,{}>
n = |T.A|
forall (k = 1, ...,n) do
if (T.ak6=Tnew.ak)then //?
Tupdate.ak= Tnew.ak//?
update = true
fi
od
if (update = true) then //
DNet.updatedTupels := DNet.updatedTupels Tupdate//
fi
DNet.deletedTupels := DNet.deletedTupels \T
contains := true
break
fi
od
if (contains = false)
DNet.insertedTupels := DNet.insertedTupels Tnew
fi
return DNet
We omit the code for functions deleteTuple(...) and updateTuple(...) due
to space restrictions, but explain their essence in the following: For opi=DELETE(T),
if there has been a previous operation opkwith opk=INSERT(T) (k < i), opkand opi
are purged from DeltaLayer by function deleteTuple(...). Reason is that combi-
nation INSERT(T)/DELETE(T) has no actual effect according to case 1 in Table 1.
By contrast, case 2 in Table 1 is more interesting: Although there has been a previous
UPDATE(T) operation, we have to insert the original attribute values into the DeltaLayer
which have been contained within the source table before any change operation has been
applied. Otherwise the DeltaLayer does not reflect correctly applicable changes. To get
correct values it is not sufficient to check the source table due to intermediate (and already
purged) UPDATE operations. Therefore, one alternative is to look up the original values
from the target table which might decrease the performance of the algorithm (see Sec-
tion 3.2). However doing so does not cause a blocking of the target table and therefore
does not lead to an increase of the update window. Alternatively, the original attributes
can be stored in an auxiliary data structure as soon as an UPDATE operation is applied.
This raises storage needs but increases performance. The mechanisms used in function
updateTuple(...) can be seen as a combination of those ones used for functions
insertTuples(...) and deleteTuples(...).
Figure 2 shows how the DELTALAYER algorithm works on the input changes presented in
Example 1. The crossed-out entries reflect the purged entries within the DeltaLayer.
op1INSERT UPDATE DELETE
2, („Hello“)
INSERT UPDATE DELETE
2, („Hello“)
op2
3, („BYE“)
INSERT UPDATE DELETE
2, („Hello“)
op3 *
3, („BYE“)
INSERT UPDATE DELETE
op4 3, („BYE“)
4, („BYEBYE“)
INSERT UPDATE DELETE
op5 3, („BYE“)
4, („BYEBYE“)
4, („HELLO“)
INSERT UPDATE DELETE
op6 3, („BYE“)
4, („BYE“)
4, („HELLO“)
INSERT UPDATE DELETE
op7 3, („BYE“)
4, („BYE“)
5, („Hi“)
INSERT UPDATE DELETE
op8 3, („BYE“)
4, („BYE“)
5, („Hi“) 5, („CIAO“)
op9+ op10 simple insert, no purge
Result: INSERT UPDATE DELETE
3, („BYE“)
4, („BYE“)
6, („Hi“)
7, („HiHi“)
5, („CIAO“)
t
Input Change Log:
op1 = INSERT((2, „HELLO“))
op2 = INSERT((3, „BYE“))
op3 = DELETE((2, NULL))
op4 = INSERT((4, „BYEBYE“))
op5 = UPDATE((4, „HELLO“))
op6 = UPDATE((4, „BYE“))
op7 = DELETE((5, NULL))
op8 = INSERT((5, „CIAO“))
op9 = INSERT((6, „HI“))
op10 = INSERT((7, „HIHI“))
*(2, Hello) is deleted by
DELETE(2, NULL)
Figure 2: Applying DELTALAYER to Example Change Log
Note that the algorithm can also be used if the Delta Layer is updated each time a change
operation occurs, i.e., the input of algorithm 1 becomes a one-element change log. This
approach is, for example, applied in our adaptive process management system ADEPT2,
i.e., each time a change operation is applied either to a process instance or to a process
template, the according Delta table is updated using the DELTALAYER algorithm. In Sec-
tion 4.2 we will show how the same principle can be directly integrated within the Window
algorithm producing differential snapshots.
3.2 Performance Considerations for the DELTALAYER Algorithm
In this section, we provide some performance considerations for the DELTALAYER algo-
rithm. Obviously, the complexity of Algorithm 1 without considering functions
insertTuple(...), deleteTuple(...), and updateTuple(...) is O(n)
with n equals the number of change operations contained in the input change log. More
interesting are the complexity considerations for functions
insertTuple(...), deleteTuple(...), and updateTuple(...).
For function insertTuple(...) we obtain a complexity of O(nc) = O(n)since
this function consists of a scan of set deletedTupels with nelements =O(n)(cf line ),
a simple comparison in line (), and a check of all cattributes of T =O(c)with c is
constant. Furthemore, adding T to the set of updated or inserted tuples as well as deleting
T from the set of deletedTuples (if necessary) is all of complexity O(1).
Advertisement
Function deleteTuple(...) also has a complexity of O(n): The check of the set of
inserted tuples has a complexity of O(n)(nequals the number of inserted tuples). The scan
of the set of updated tuples is of complexity O(m)with m equals the number of updated tu-
ples. Therefore we obtain a complexity of O(n+m) = O(n)for deleteTuple(...)
(delete and insert into the sets of inserted, deleted, and updates tuples are again of a com-
plexity of O(n)).
For function updateTupel(...) we obtain a worst case complexity of O(n2): First
of all there is a scan of the set of inserted tuples having complexity of O(n)with n equals
the number of inserted tuples. The following check of the attribute set of a tuple T has
complexity of O(c)with c equals the number of attributes of T. The conditional scan of
the set of updated tuples is of complexity O(m)(m equals the number of updated tuples).
Then the number of attributes of tuple T is checked again with complexity of O(k)with
k equals the number of attributes of tuple T. Then the target table has to be accessed with
complexity of O(t)with t equals the number of tupels in the target table. Note that here no
locks on the target table become necessary. Furthermore we can decrease the complexity
by storing the original attribute values of the affected tuples. Finally, a scan of the attributes
of tuple T in the target table becomes necessary having complexity O(l)with l equals the
number of attributes of tuple T. Altogether, the complexity of updateTuple(...)
turns out as O(nc)+O(mktl)where c, k, and l are constant. Consequently, the
complexity is O(n)+O(mt)=O(n)+O(n2)=O(n2).
Generally speaking, the number of the tuple in the DeltaLayer is quite small when com-
pared to the number of tuples in source and target tables since only the minimal set of
change operations is stored. Therefore scanning the sets of inserted, updated, and deleted
tuples can be accomplished rather quickly. Generally, processing DELTALAYER algorithm
is not the part potentially causing performance problems. The critical part is feeding the
data into the data warehouse afterwards since the target tables have to be locked during this
time (update window). The DELTALAYER algorithm minimizes the volume of the data to
be fed into the data warehouse significantly and therefore can increase the update window
quite dramatically. Furthermore, as we will show in Section 4.2, the DELTALAYER algo-
rithm has not to be applied once for all (and possibly large) change logs. It is possible
to directly integrate the DELTALAYER algorithm into, for example, differential snapshot
algorithms. Therefore the different data mechanisms can work with the DELTALAYER
algorithm asychronously regaring the time updates on the data tables occur.
3.3 Correctness of the DELTALAYER Algorithm
In this section we show that the output produced by Algorithm 1 is correct, i.e., the output
captured within the DeltaLayer is minimal. Minimality has been considered as important
requirement in the context of updating materialied views as well. For example, in [GL95]
minimality of the bag algebra expressions for updating a derived view has been proven for
INSERT and DELETE operations. UPDATE operations applied to the base view have not
been directly taken into account since they have been always transformed into associated
DELETE/INSERT pairs. This is contrary to minimality itself.
In the following, we will adopt the mimimality requirements for INSERT and DELETE
operations for our algorithmic approach but will extend it by considering UPDATES.
Definition 1 (Minimality of the DeltaLayer)Let Sbe the tuple set before the change and
let S0be the tuple set after applying change operations opi(i = 1, ..., n) captured within
change log cL =< op1, ..., opn>. Let further the sets of actually inserted, deleted, and
updated tuples be defined as follows:
1. Actually deleted tuples: 5cL := {t|tS\S’: 6 t’ S’ with t’.Key = t.Key}
2. Actually inserted tuples: 4cL = := {t|tS’ \S: 6 t’ S with t’.Key = t.Key}
3. Actually updated tuples: .cL = := {t|tS\S’ tS’ \S: t’ S’ with t’.Key = t.Key}
Then we call a change log or related data structure (e.g., a delta layer) Dminimal if and
only if the following conditions hold:
1. D.deletedTuples = 5cL
2. D.insertedTuples = 4cL
3. D.updatedTuples = .cl
The definitions of 5cL, 4cL, and .cL are illustrated in Figure 3a from a set-based point
of view. As it can be seen from Figure 3b the distinction between updated and inserted or
updated and deleted tuples is done by comparing keys.
S S’
deleted or updated
neutral /
no change
inserted
or updated
a) Set-based View: b) Key-based View:
S.Key S’.Key
deleted
neutral or
updated inserted
Figure 3: Sets of Inserted, Deleted, and Updated Tuples
Theorem 1 (Minimality of DeltaLayer)Let S be a tuple set and let cL be a change log
transforming tuple set Sinto tuple set S0. Let further DeltaLayer Dbe the output resulting
from the application of Algorithm 1 to cL. Then Dis minimal according to Definition 1.
Due to lack of space we only sketch the proof of Theorem 1 for the set of inserted tuples
D.insertedT uples. The proofs for deleted and updated tuples, however, can be accom-
plished in a similar way by contradiction.
Proof Sketch:
Proof by contradiction =we have to show
Related document tools
Make document review easier
Plag is useful when academic text needs an additional similarity check. Identific helps make review workflows more structured. They support better decisions before a document is accepted.
1. D.deletedTuples 6=5cL
2. D.insertedTuples 6=4cL (*)
3. D.updatedTuples 6=.cl
= ¬ (D produced by Algorithm 1)
We proof (*) = ¬ (D produced by Algorithm 1).
Auxiliary assumptions: a) t used in the following tStS0holds, b) INSERT(t)
is the last entry in cL for t, c) it is not possible to apply an INSERT(t) operation after an
UPDATE(t) operation (would be rejected by the database system,
D.insertedTuples 6=4cL
D. insertedTuples 6={t|tS’ \S: 6 t’ S with t’.Key = t.Key}=
t1 D.insertedTuples with
1. t1 S\S’
2. t1 SS’
3. (t1 S’ \S t2 S with t1.Key = t2.Key)
=
1) t1 S\S’ =t1 has been deleted or updated =contradiction to (34) of Algorithm
1 and to b) and c) of auxiliary assumptions
2) t1 SS’ =(t1 has not been affected by any change) INSERT(t1)/DELETE(t1)
pair in cL (with same attributes) DELETE(t1)/INSERT(t1) pair in cL UPDATE(t1)
/UPDATE(t1) pair in cL (with original values of t1 in S for last update)5
=contradiction to 1) contradiction to (14,28,33) contradiction to (40,42,56,57)
contradiction to (76,77,97) of Algorithm 1
3) (t1 S’ \S t2 S with t1.Key = t2.Key) =UPDATE(t1) in cL =contradic-
tion to (63) of Algorithm 1 and to b) and c) of auxiliary assumptions 2
4 On Applying the DELTALAYER Algorithm
In this section we provide different application scenarios for the DELTALAYER algorithm.
They range from commercial systems such as Oracle or IBM DataPropagator to an exten-
sion of the used change operations within process management systems.
5For all change combinations see Table 1.
4.1 Application within Commercial Systems
In the following, we show how the DELTALAYER algorithm can be applied in the context
of two main commercial systems which offer support for data warehouses:
Oracle Database: Within the data warehousing guide of Oracle [Dat03] different possibil-
ities for data change capture are summarized. Closely related to the problems described
in this paper is the merge statement which is an extension of SQL. Basically, the merge
statement is used to transform data from source tables into the format of the target table.
(1) MERGE INTO products t USING products delta s
(2) ON (t.prod id=s.prod id)
(3) WHEN MATCHED THEN UPDATE SET
(4) t.prod list price=s.prod list price, t.prod min price=s.prod min price
(5) WHEN NOT MATCHED THEN INSERT (prod id, prod name, prod desc,
prod subcategory,
(6) prod subcategory desc, prod category, prod category desc, prod status,
(7) prod list price, prod min price)
(8) VALUES (s.prod id, s.prod name, s.prod desc, s.prod subcategory,
(9) s.prod subcategory desc, s.prod category, s.prod category desc,
(10) s.prod status, s.prod list price, s.prod min price);
Figure 4: MERGE Statement provided by Oracle Data Warehousing Guide [Dat03]
In the above statement products t refers to the target table and products delta
refers either to the source or to the delta table. In (2) the merge criteria are stated. If the
merge criteria are fulfilled then the associated tuples within the source / delta tables are
updated (3-4). Otherwise new tuples are inserted (5-10) (where column name of source
and target table do not necessarily have to match). The merge-statement offers the pos-
sibility of expressing an ”if-then-else”-like semantics (cf. Figure 4.1). Therefore if could
be used in a slightly modified way to implemented the ”if-then-else”-constructs of the
DELTALAYER algorithm. Using a table representing the changes of an unpurged log as
input, the modified merge-statements (one for each change operation) produce a correct
net delta in the output table.
Another, maybe even simpler possibility is to implement the DELTALAYER algorithm
within JAVA and make it accessible using the mechanism of user-defined functions of
Oracle database.
IBM DataPropagator: Another commercial system dealing with delta information in the
context of replicating distributed data sources is IBM DataPropagator [IBM95]. Using
IBM DataPropagator, basically, it is possible to produce so called condensed change data
tables which are supposed to contain net update information. However, condensed change
data tables are neither minimal nor correct in most cases. Let us apply the triggers pre-
sented in [IBM95] on the change combinations summarized in Table 1. If, for example, a
change of type DELETE(T) is applied, the before trigger, first of all, deletes any entry for
which key equals T.Key holds (regardless whether it is an update or insert operation) and
the after trigger stores the DELETE(T) operation afterwards. Doing so, in case of unnec-
Advertisement
essary INSERT(T)/DELETE(T) pairs the DELETE(T) operation remains within the
condensed change data table (therefore the table is not minimal). Furthermore, this re-
sult is even not correct if key T.Key is not key of the data warehouse schema afterwards
(otherwise the database system would simply reject the application of the DELETE(T)
operation). In case of unnecessary INSERT(T)/UPDATE(T) pairs the triggers always
produce an incorrect output since tuple T having key T.Key is no longer present when the
UPDATE(T) operation is applied. Consequently, the triggers provided for IBM DataProp-
agator so far could be improved by implementing them according to the DELTALAYER
algorithm. Due to lack of space we abstain from details here.
4.2 Post Processing of Differential Snapshots
Another application for the DELTALAYER algorithm is post processing the output of dif-
ferential snapshot algorithms. As already mentioned in [LGM96] this would constitute
a valuable extension in order to minimize update windows within the data warehouse
[DDJ+98]. One alternative in this context is to take the whole differential snapshot as in-
put for the DELTALAYER algorithm (real post processing). However, we can also think of
directly integrating the DELTALAYER algorithm into the differential snapshot algorithms.
For the Window algorithm [LGM96], for example, this can be accomplished rather easily:
instead of writing entries from AgingBuffer1 into the DELETE queue and entries from
AgingBuffer2 into the INSERT queue all entries from the aging buffers can be directly
written into the DeltaLayer. There an immediate purge of wasteful INSERT/DELETE
pairs takes place.
4.3 An Extension Towards Arbitrary Change Operations
INSERT/DELETE/UPDATE operations are rather simple changes. In applications such
as data warehouses [GM95] or process management systems more advanced change op-
erations become necessary in practical applications [RRJK06]. In our ADEPT process
management system, for example, we offer the change framework depicted in Figure 5a.
Here, first of all, we distinguish between primitive and high-level change operations in or-
der to offer better user support. Thereby the high-level change operations are constructed
by combining an arbitrary amount of change primitives. Furthermore, high-level change
operations are equipped with a set of (formal) pre- and post-conditions which guarantee
the correctness of the resulting process when applying a high-level change operation.
The main differences between high-level operations and change primitives are their prop-
erties and the intention they are used for. High-level operations act as an interface pro-
vided to the end-user. Furthermore contrary to the set of change primitives the set
of high-level change operations does not necessarily have to be closed (i.e. the process
management system can be extended by new high-level change operations any point in
time). Change primitives are mainly used to create the data structures (in particular the
DeltaLayer) based on which (high-level) changes are represented at system-level. More
precisely, there is a mapping for each high-level change operation to associated change
primitives (including those high-level operations which newly defined within the process
management system). Based on the concept of mapping high-level change operations at
system level to a change primitive representation, the associated change primitives can
be used for manipulating the internal representation, i.e. the DeltaLayer. Therefore, for
high-level change operations the DELTALAYER algorithm can again be used in order to
keep the DeltaLayer correct and minimal. Minimality of the change information, in turn,
is crucial for correctness checks in the context of concurrently applied changes which are
mainly based on comparing the associated change logs.
In Figure 5b, for example, a process template change is illustrated which is accomplished
by applying two high-level change operations sInsert(...) and sMove(...) at
the user level. At the system level, the high-level operation have been automatically trans-
formed into a change primitive representation. Note that the DELTALAYER algorithm
presented in this paper can be easily adapted to the change primitives used in adaptive
process management systems (e.g., addNode(...), addEdge(...), or
addDataElement(...)) as it has been shown in [RRJK06]. Based on this primi-
tive representation, the DELTALAYER algorithm produces the purged DeltaLayer which
automatically keeps a minimal change information for correctness checks at any time.
Enter
order Examine
patient Deliver
report
Inform
Patient Prepare
Patient
Template Version S := S(T,1)
Enter
order Lab test
Inform
patient
Prepare
Patient Examine
Patient
patData patData
Template Version S‘ := S(T,2)
cLT1(S) = (sInsert(S, Lab test, Examine Patient, Deliver report),
sMove(S, Inform Patient, Prepare Patient, Examine Patient))
Lab test
Deliver
report
cLprimT1=
(addNode(S, Lab test), removeEdge(S, Examine patient, Deliver Report, Ctrl),
addEdge(S, Examine Patient, Lab test, Ctrl), addEdge(S, Lab test, Deliver report, Ctrl),
removeEdge(S, Enter order, Inform patient, Ctrl), removeEdge(S, Inform patient, Prepare Patient, Ctrl),
removeEdge(S, Prepare patient, Examine patient, Ctrl), addEdge(S, Enter order, Prepare patient, Ctrl),
addEdge(S, Prepare Patient, Inform patient, Ctrl), addEdge(S, Inform patient, Examine Patient, Ctrl))
Change Logs High-Level Change Operations
sInsert (…), delAct(…), …
Change Primitives
addNode (…), deleteNode(…), …
a) ADEPT Change Framework
MAPPING
b) Process Template Change
Purged Delta Layer
Figure 5: a) ADEPT Change Framework and Example Process Template Change
Advertisement
5 Related Work
There are several suggestions how to extract and feed updates from data sources into a
data warehouse. One complete architecture has been proposed within the WHIPS project
[HGMW+95] ranging from data extraction to the update of materialized views.
Snapshot-Based Approaches: If data sources do not offer any support for extracting data
updates (typical for legacy systems) the only way to do so is to build snapshots of the
files before and after the change. Then the difference between those snapshots is com-
puted, e.g., by the algorithm proposed in [LGM96]. Commercial systems such as Oracle
database offer the possibility to create different snapshots [BG04]. The resulting change
log is loaded into the data warehouse afterwards. However, such algorithms produce
wasteful change combinations as well, i.e., unnecessary INSERT(T)/DELETE(T) or
DELETE(T)/INSERT(T) pairs as depicted in Table 1. This may cause overhead when
loading the files afterwards [LGM96] as well as unnecessary checks during the transfor-
mation phase within the staging area later on. Therefore a post processing as proposed in
this paper would be beneficiary for working with differential snapshots.
Replication-Based Approaches: Replication approaches range from copying database ta-
bles to storing the updates tuples within so called delta tables. In particular the latter is
related to our approach.
Materialized Views: The principles of net changes and delta tables may be also interesting
for the maintenance of views in data warehouses [LYG99, LYC+00, ZGHW95, GL95],
i.e., the problem of propagating changes of a base table Bto a view Vderived from B.
Here the incremental update of views is of particular interest, i.e., updated view V0is cal-
culated as original view Vminus the set of deleted tuples 5V plus the set of inserted tuples
4V. The challenge is to determine sets 5V and 4V. This comprises two aspects: first
of all, it has to be determined which tuples have been changed in Band secondly, how to
calculate the sets 5V and 4V from that (e.g., the necessary aggregations). Our approach
can be taken to optimize the result of step one, the determination of changes applied to B:
Since UPDATES on Bare expressed as INSERT and DELETE pairs [LYG99] the mini-
mality requirement for the view updates only refer to INSERT and DELETE operations
[GL95]. However, our algorithm additionally considers minimality regarding UPDATE
operations.
Object-Oriented Approaches: Delta objects have been proposed for object-oriented data-
bases [SBDU97]. Here the changes are associated with the objects, not stored within a
gloabl delta log. Furthermore, the purpose of managing the delta inforamtion is different.
In this context, delta information is, for example, used for debugging and run-time testing.
Adaptive Process Management: In general, adaptivity in process management systems has
been a hot topic in literature for many years. However, there are only few approaches
dealing with an efficient implementation of advanced process management functionality
[Wes98, KAS+03]. So far, they have neglected issues related to change log management.
Our ADEPT system is one of the very few available research prototypes for adaptive,
high-performance process management [RRD04b, RRKD05, RRJK06].
6 Summary and Outlook
In this paper we have introduced the DELTALAYER algorithm which receives a change log
as input and produces a structured net change within the DeltaLayer format as output. We
have shown that the algorithm produces a correct (i.e., minimal) net change with respect
to the set of inserted, deleted, and updated tuples. Furthermore in the paper different
application areas for the DELTALAYER algorithm have been provided such as replication
mechanisms in commercial systems (e.g., Oracle or IBM DataPropagator), post processing
of differential snapshots to feed data updates into data warehouses, or correctness checks
for adaptive process management systems.
At the moment we are integrating the DELTALAYER algorithm within the implementation
of our process management engine ADEPT2 (for more information see www.aristaflow.de).
This is also particulary interesting since maintaining a DeltaLayer provides transparency
to any change operations which is added to the change framework afterwards. Based on
the implementation we will conduct performance studies within different scenarios.
In future work we want to study more application areas for the DELTALAYER algorithm.
One example is the cleaning of data within the staging area of a data warehouse (cf. Figure
6). The idea is to purge data (if desired!) after necessary transformation using metadata
within a DeltaLayer. Doing so, for example, redundancies can be purged.
Figure 6: Cleansing Data from Different Sources withing Staging Area
References
[AV88] S. Abiteboul and V. Vianu. Equivalence and Optimization of Relational Transactions.
Journal of the Assocation of Computing Machinery, 35(1):70–120, 1988.
[BG04] A. Bauer and H. G¨
unzel. Data Warehouse Systems. dpunkt, 2004.
[Dat03] Oracle Database. Data Warehousing Guide, 10g Release 1 (10.1), December 2003.
[DDJ+98] L. Do, P. Drew, W. Jin, V. Jumani, and D. Van Rossum. Issues in Developing Very
Large Data Warehouses. In Proc. Int’l Conf. VLDB, pages 633–636, 1998.
[GL95] T. Griffin and L. Libkin. Incremental Maintenance of Views with Duplicates. In
Proc. Int’l SIGMOD Conf., pages 328–339, 1995.
Advertisement
[GM95] A. Gupta and I. Mumick. Maintenance of Materialized Views: Problems, Tech-
niques, and Applications. IEEE Data Engineering Bulleting, 18(2):3–18, 1995.
[HGMW+95] J. Hammer, H. Garcia-Molina, J. Widom, W. Labio, and Y. Zhuge. The Stanford
Data Warehousing Project. IEEE Data Engineering Bulletin, June, 1995.
[IBM95] IBM. Data Where You Need It, The DPropR Way! DataPropagator Relational Solu-
tions Guide. Technical Report GG24-4492-00, IBM International Technical Support
Organization, San Jose Center, 1995.
[JNS+97] H. Jagadish, P. Narayan, S. Seshadri, S. Sudarshan, and R. Kanneganti. Incremental
Organization for Data Recording and Warehousing. In Proc. Int’l Conf. VLDB, pages
16–25, 1997.
[KAS+03] K. Kochut, J. Arnold, A. Sheth, J. Miller, E. Kraemer, B. Arpinar, and J. Cardoso.
IntelliGEN: A Distributed Workflow System for Discovering Protein-Protein Inter-
actions. DPD, 13(1):43–72, 2003.
[LGM96] W. Labio and H. Garcia-Molina. Efficient Snapshot Differential Algorithms for Data
Warehousing. In Proc. Int’l Conf. VLDB, pages 63–74, 1996.
[LYC+00] W. Labio, J. Yang, Y. Cui, H. Garcia-Molina, and J. Widom. Performance Issues in
Incremental Warehouse Maintenance. In Proc. Int’l Conf. on Very Large Databases,
pages 461–472, 2000.
[LYG99] W. Labio, R. Yerneni, and H. Garcia-Molina. Shrinking the Warehouse Update Win-
dow. In Proc. Int’l SIGMOD Conference, pages 383–394, 1999.
[RRD04a] S. Rinderle, M. Reichert, and P. Dadam. Disjoint And Overlapping Process Changes:
Challenges, Solutions, Applications. In CoopIS’04, pages 101–120, 2004.
[RRD04b] S. Rinderle, M. Reichert, and P. Dadam. Flexible Support Of Team Processes By
Adaptive Workflow Systems. DPD, 16(1):91–116, 2004.
[RRJK06] S. Rinderle, M. Reichert, M. Jurisch, and U. Kreher. On Representing, Purging, and
Utilizing Change Logs in Process Management Systems. In Proc. Int’l Conf. BPM,
pages 241–256, 2006.
[RRKD05] M. Reichert, S. Rinderle, U. Kreher, and P. Dadam. Adaptive process management
with ADEPT2. In ICDE’05, pages 1113–1114, 2005.
[SBDU97] A. Sundermier, T. Ben Abdellatif, S.W. Dietrich, and S.D. Urban. Object Deltas in an
Active Database Development Environment. In Proc. Int’l Conf. DOOD-97, pages
211–229, 1997.
[Wes98] M. Weske. Object-Oriented Design of a Flexible Workflow Management System. In
ADBIS98, pages 119–131, 1998.
[Wes01] M. Weske. Formal Foundation and Conceptual Design of Dynamic Adaptations in a
Workflow Management System. In HICSS-34, 2001.
[ZGHW95] Y. Zhuge, H. Garcia-Molina, J. Hammer, and J. Widom. View Maintenance in a
Warehousing Environment. In Proc. Int’l SIGMOD Conf., pages 316–327, 1995.