Document [original]

Fakultät für Elektrotechnik, Informatik und Mathematik

Institut für Informatik

Arbeitsgruppe Softwaretechnik

Warburger Straße 100

33098 Paderborn

Data-oriented Reengineering

Dissertation submitted in partial fulfillment

of the requirements for the degree of

„Doctor of Natural Science“ (Dr. rer. nat.)

Schriftliche Arbeit

zur Erlangung des Grades

"Doktor der Naturwissenschaften" (Dr. rer. nat.)

vorgelegt von

Dipl.-Inform. Jörg P. Wadsack

Ferrariweg 34

33102 Paderborn

Paderborn, Dezember 2003

iii

ABSTRACT

Today, information system evolution primarily consists of extending the legacy systems

and migrate them to modern platforms related to the Web and mobile devices. This dis-

sertation tackles the problem of understanding and adapting legacy web information sys-

tems based on the systems’ data. In this context several methods and tools have been

proposed for reengineering. Since legacy systems have grown over years and lack suffi-

cient documentation, reengineering is a complex and hard task. Tool supported reengi-

neering approaches and processes reduce the complexity and risks during web

information system maintenance. Still, current reengineering approaches and tools often

tackle only specific system parts or dedicated maintenance aspects. This thesis aims to

overcome these limitations by providing a process that combines tools for reengineering

the data as well as the applications. Our focus is to handle uncertain knowledge by sus-

taining human exploration and iteration during a tool supported reengineering process. In

practice uncertain knowledge during reengineering plays a fundamental role but it is often

neglected by exiting approaches due to idealistic assumptions.

The presented data-oriented reengineering process provides concepts that combine exist-

ing data and application design reengineering approaches to maintain web information

systems. Since it is unrealistic to presume that reengineering can follow a strictly waterfall

like process without iterations, inconsistencies occur during the reengineering process.

The chosen combined approaches fulfil the requirement to deal with such inconsistencies.

We based our process on models because models provide the possibility to represent (in-

consistent) knowledge at different levels of abstraction. Moreover, models can be accu-

rate enough to enable code generation. In this dissertation, we base the system’s models

on graphs. We employ graph transformation theory to provide mechanisms that detect,

handle and resolve inconsistencies of the models automatically. The results are imple-

mented within the REDDMOM project. We use the FUJABA TOOL SUITE for tool integration

and evaluate our concepts with a case study in the Health Care domain.

AKNOWLEDGEMENTS

Many people have influenced my research during the past five years. I am especially

obliged to Wilhelm Schäfer who supported me in every occasion and provided a phenom-

enal working environment. Special thanks go to Jens Jahnke and Albert Zündorf. Both are

„responsible“ for this thesis since they convinced me that I am able accomplish it. Beside

office and beer sharing, the fruitful discussions with them were the basis for most of the

thesis results. Jens Jahnke welcomed my as a member of his research group at the Univer-

sity of Victoria and as a guest in his house. Many thanks to Anke at this point! During this

period the fundamental parts of this thesis were settled. A special thank you goes to Jörg

Niere for many fruitful discussions about our theses.

The achievement of this thesis would not have been possible without the contributions of

many colleagues and students. I thank all persons involved in the REDDMOM project. Ev-

eryone that somehow contributed to the FUJABA environment: thanks. Further, I thank all

people at the University of Victoria, especially those involved in the palliative care case

study, for the great support during my stay. Finally, I thank all my colleagues of the Soft-

ware Engineering Group sharing ideas, jokes, planes, beds, offices, (sparkling) wine, beer

and coffee with me.

Special thanks go the „proof readers“: Holger Giese, Jens Jahnke, Ekkart Kindler, Jörg

Niere, Matthias Meyer, Wilhelm Schäfer, Matthias Tichy, Robert Wagner, Lothar Wen-

dehals and Albert Zündorf.

I thank Jürgen Maniera for the great technical support. I am obliged Jutta Haupt who

helped me surviving the bureaucracy jungle and for many chats.

Without the support of my friends and my family, especially my wife Sonja and my son

Simon, this would never have been possible.

I love you.

To my family

CONTENTS

LIST OF FIGURES AND TABLES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XIII

CHAPTER 1: INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

1.1 Background: Web Information System Reengineering . . . . . . . 17

1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

1.3 Our Data-oriented Reengineering Approach . . . . . . . . . . . . . . . 18

1.4 Dissertation Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

CHAPTER 2: DATA-ORIENTED REENGINEERING: A CASE STUDY . . . . . . 23

2.1 An Health Care Web Information System . . . . . . . . . . . . . . . . . 24

2.1.1 The Legacy (Web) Information System . . . . . . . . . . . . . 24

2.1.2 The Considered Target Web Information System . . . . . . 25

2.2 The Data-oriented Reengineering process . . . . . . . . . . . . . . . . . 27

2.2.1 Understanding Phase. . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.2.2 Adapting Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2.2.3 Model Maintenance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

CHAPTER 3: DATA-ORIENTED REVERSE ENGINEERING . . . . . . . . . . . . . 37

3.1 Reverse Engineering Data Components. . . . . . . . . . . . . . . . . . . 37

3.1.1 Reverse Engineering Steps . . . . . . . . . . . . . . . . . . . . . . . 38

3.1.2 Relationships and Data Dependencies . . . . . . . . . . . . . . 40

3.1.3 Model Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.2 Data Model Recovery. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2.1 Schema Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.2.2 Retrieval of Hidden Schema Parts . . . . . . . . . . . . . . . . . 50

3.2.3 Schema Mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.2.4 Conceptual Schema Refactoring . . . . . . . . . . . . . . . . . . . 63

3.3 Relationship Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.3.1 Code Fragment Extraction and Parsing . . . . . . . . . . . . . 67

3.3.2 Pattern Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

3.3.3 Pattern Instance Retrieval. . . . . . . . . . . . . . . . . . . . . . . . 79

3.3.4 Handling Uncertainty using Fuzzy Beliefs. . . . . . . . . . . 85

3.4 Tool Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

3.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

3.6 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

CHAPTER 4: DATA COMPONENT EXTENSION . . . . . . . . . . . . . . . . . . . . 101

4.1 Extension Approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.2 Data Component Clustering and Classification. . . . . . . . . . . . 103

4.2.1 Data Component Clustering . . . . . . . . . . . . . . . . . . . . . 103

4.2.2 Clustering Strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.2.3 Data Component Classification . . . . . . . . . . . . . . . . . . 110

4.3 Architectural Patterns for Data Mediation. . . . . . . . . . . . . . . . .111

4.3.1 Architectural Pattern: Data Portal. . . . . . . . . . . . . . . . . 113

4.3.2 Architectural Pattern: Data Fusion . . . . . . . . . . . . . . . . 116

4.3.3 Architectural Pattern: Data Transducer . . . . . . . . . . . . 119

4.3.4 Architectural Pattern: Data Connection . . . . . . . . . . . . 121

4.3.5 Architectural Pattern Application Examples . . . . . . . . 124

4.4 Access Layer Generation and Model Execution . . . . . . . . . . . 126

4.4.1 Transactional Access Layer . . . . . . . . . . . . . . . . . . . . . 126

4.4.2 Mediation Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

4.4.3 Publishing Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131

4.4.4 Multimedia extension . . . . . . . . . . . . . . . . . . . . . . . . . . 134

4.5 Tool Support. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

4.7 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

CHAPTER 5: MODEL CONSISTENCY MANAGEMENT . . . . . . . . . . . . . . . 143

5.1 Data Component Model Consistency Maintenance . . . . . . . . . 143

5.2 Graph-based History Mechanism . . . . . . . . . . . . . . . . . . . . . . . 145

5.2.1 Background: History Graph Mechanism. . . . . . . . . . . . 145

5.2.2 Simple Undo History Graph Mechanism . . . . . . . . . . . 152

5.2.3 Selective Undo History Graph Mechanism. . . . . . . . . . 158

5.2.4 Composed History Graph Mechanism . . . . . . . . . . . . . 163

5.3 Tool Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

5.4 Related work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

5.5 Summary and Future Work. . . . . . . . . . . . . . . . . . . . . . . . . . . . 172

CHAPTER 6: CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

6.1 Summary. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

6.2 Transferability of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

6.3 Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

6.4 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

Chapter 1: Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

Chapter 2: Data-oriented Reengineering: A Case Study . . . . . 183

Chapter 3: Data-Oriented Reverse Engineering . . . . . . . . . . . . 186

Chapter 4: Data Component Extension . . . . . . . . . . . . . . . . . . 197

Chapter 5: Model Consistency Management . . . . . . . . . . . . . . 206

Chapter 6: Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

INDEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CCXI

xii

xiii

LIST OF FIGURES AND TABLES

CHAPTER 1: INTRODUCTION

Figure 1.1: Data-oriented Reengineering process.......................................................20

CHAPTER 2: DATA-ORIENTED REENGINEERING: A CASE STUDY

Figure 2.1 Health Care Web Information System: Considered Target System ..........26

Figure 2.2: Case study understanding phase...............................................................28

Figure 2.3: Parsed data models overview ...................................................................28

Figure 2.4: Table clone example.................................................................................29

Figure 2.5: Palliative care conceptual schema excerpt in UML.................................30

Figure 2.6: Case study adating phase..........................................................................31

Figure 2.7: Clustering example...................................................................................32

Figure 2.8: New data components integration examples............................................32

Figure 2.9: : Case study administrating phase ............................................................33

Figure 2.10: Consistency violation example...............................................................34

CHAPTER 3: DATA-ORIENTED REVERSE ENGINEERING

Figure 3.1: Web information system understanding...................................................38

Figure 3.2: Data Component Reverse Engineering - Overview .................................39

Figure 3.3: Relationships and Data Dependencies Overview.....................................41

Figure 3.4: Conceptual schema view (relationships)..................................................46

Figure 3.5: Pysical aspect view as package diagram..................................................47

Figure 3.6: Conceptual schema view example ...........................................................48

Figure 3.7: Variant and optimisation structure example.............................................51

Figure 3.8: Patient variant example ............................................................................52

Figure 3.9: Optimisation structure example ...............................................................53

Figure 3.10: Triple-graph-grammar fundamental idea example.................................53

Figure 3.11: Schema mapping graph model ...............................................................56

Figure 3.12: MapEntityToClass mapping rule............................................................58

Figure 3.13: MapEntityToClass relating rule (story diagram)....................................59

xiv

Figure 3.14: MapEntityToClass foward rule (story pattern)...................................... 60

Figure 3.15: MapEntityToClass reverse rule (story pattern)...................................... 60

Table 3.1: Mapping Rule Overview ................................................................. 61

Figure 3.16: Mapping rule features in the MapAttrToAttr rule ................................. 62

Figure 3.17: Graphical constraint with story diagram................................................ 62

Figure 3.18: moveAttribute refactoring operation ..................................................... 63

Figure 3.19: splitClass refactoring operation ............................................................. 64

Table 3.2: Refactoring operation overview ...................................................... 65

Figure 3.20: Extracting code fragments of interrest................................................... 68

Figure 3.21: Code fragment of interest ...................................................................... 70

Figure 3.22: sliced code fragment of interest............................................................. 71

Figure 3.23: Type Graph Model................................................................................. 73

Figure 3.24: Simplified annotated abstract syntax graph instance............................. 74

Figure 3.25: IND pattern definition............................................................................ 75

Figure 3.26: Sample IND code and annotated abstract syntax graph ........................ 76

Figure 3.27: IND annotations..................................................................................... 76

Figure 3.28: R-IND pattern definition........................................................................ 77

Figure 3.29: Association code example and pattern definition.................................. 78

Figure 3.30: Replication code example and pattern definition .................................. 79

Figure 3.31: Duplication pattern definition................................................................ 81

Figure 3.32: Sample analysis execution..................................................................... 82

Figure 3.33: Retrieval process statechart ................................................................... 85

Figure 3.34: Code samples for duplication ................................................................ 86

Figure 3.35: Alternative duplication pattern definitions ............................................ 87

Figure 3.36: Imprecise duplication and insert pattern definitions.............................. 88

Figure 3.37: Sample analysis execution with fuzzy values........................................ 89

Figure 3.38: Annotated conceptual view.................................................................... 91

Figure 3.39: Conceptual schema view (data dependencies) ...................................... 91

Figure 3.40: Architecture of the REDDMOM reverse engineering tools...................... 93

Figure 3.41: SplitClass composed transformation example....................................... 94

CHAPTER 4: DATA COMPONENT EXTENSION

Figure 4.1: Web information system adaption ......................................................... 101

Figure 4.2: Data Component Extension: Overview..................................................102

Figure 4.3: Sample of the palliative care conceptual schema...................................105

Figure 4.4: Clustered palliative care conceptual schema: sample 1 .........................106

Figure 4.5: Clustered palliative care conceptual schema: sample 2 .........................106

Figure 4.6: Schema Integration.................................................................................108

Figure 4.7: Application Integration ..........................................................................108

Table 4.1: Relationship Weighting for Clustering ..........................................109

Figure 4.8: Architectural patterns: overview ............................................................112

Figure 4.9: Data Portal Pattern use ...........................................................................113

Figure 4.10: Structure of the Data Portal Pattern......................................................114

Figure 4.11: Participant View of the Data Portal Pattern..........................................115

Figure 4.12: Data Portal Pattern sample ...................................................................116

Figure 4.13: Data Fusion Pattern use........................................................................117

Figure 4.14: Structure of the Data Fusion Pattern ....................................................117

Figure 4.15: Data Fusion Pattern sample..................................................................118

Figure 4.16: Data Transducer Pattern use.................................................................119

Figure 4.17: Structure of the Data Transducer Pattern .............................................120

Figure 4.18: Data Transducer Pattern sample...........................................................121

Figure 4.19: Data Connection Pattern use ................................................................122

Figure 4.20: Structure of the Data Connection Pattern.............................................123

Figure 4.21: Architectural pattern application examples..........................................125

Figure 4.22: Modelled and generated layers.............................................................126

Figure 4.23: Access layer generator overview..........................................................127

Figure 4.24: ACID transaction related patterns [Gra99] ..........................................127

Figure 4.25: Transactional access layer generator overview....................................128

Figure 4.26: Examples of links between different object kinds ...............................130

Figure 4.27: Examples of a <<search>> link............................................................130

Figure 4.28: Sample Data Fusion pattern modelled with a story diagram ...............131

Figure 4.29: Abstract syntax graph publishing.........................................................132

Figure 4.30: Sample publishing layer: web portal....................................................132

Figure 4.31: Sample publishing layers for a data transducer....................................133

Figure 4.32: Architecture of the Reddmom extension tools......................................135

Figure 4.33: Design transformations pushAttribute and generalise .........................136

xvi

Table 4.2: (Re)Design Transformations ......................................................... 136

Figure 4.34: Pattern Editor: pattern instantiation example ...................................... 138

CHAPTER 5: MODEL CONSISTENCY MANAGEMENT

Figure 5.1: Web information system model maitenance.......................................... 143

Figure 5.2: Model Consistency Management: Overview......................................... 145

Figure 5.3: History (GXL) Graph model ................................................................. 147

Figure 5.4: Graph production splitClass .................................................................. 148

Figure 5.5: Template of History Graph Transformation .......................................... 148

Figure 5.6: History Graph Transformation splitClass.............................................. 149

Figure 5.7: Application of production splitClass ..................................................... 149

Figure 5.8: Basic structure of a History Graph ........................................................ 151

Figure 5.9: Sample History Graph ........................................................................... 152

Figure 5.10: Interaction with the History Graph Mechanism .................................. 153

Figure 5.11: Undo History Graph Mechanism......................................................... 155

Figure 5.12: Determine affected transformations..................................................... 156

Figure 5.13: Affected History Graph: simple undo.................................................. 157

Figure 5.14: Updated History Graph: simple undo.................................................. 157

Figure 5.15: Reevaluate transformations ................................................................. 159

Figure 5.16: History Graph: directly affected transformations ................................ 161

Figure 5.17: History Graph: indirectly affected transformations............................. 161

Figure 5.18: History Graph: reevaluated transformations........................................ 162

Figure 5.19: Composed History Graph: overview ................................................... 164

Figure 5.20: Composed History Graph: reevaluation .............................................. 165

Figure 5.21: History Graph Sequence Example....................................................... 166

Figure 5.22: Reevaluation of HG II.a....................................................................... 167

Figure 5.23: Reevaluation of HG III.a ..................................................................... 167

Figure 5.24: Reevaluation of HG IV........................................................................ 168

Figure 5.25: Architecture of the History Graph Mechanism tool support ............... 169

CHAPTER 6: CONCLUSIONS

Figure 6.1 Tool Support for the Data-oriented Reengineering process.................... 174

CHAPTER 1: INTRODUCTION

Observe constantly that all things take place by change, and ac-

custom thyself to consider that the nature of the Universe loves

nothing so much as to change the things which are, and to make

new things like them.

MARCUS AURELIUS ANTONINUS

Roman emperor and philosopher (121 - 180)

1.1 Background: Web Information System Reengineering

Current trends in the field of information technology, like eHealth, eGovernment or ePro-

curement, lead to the emergence of web information systems. Today’s (web) information

systems require constantly functional extensions due to new requirements, i.e., the heter-

ogeneous distributed information systems become increasingly heterogeneous and dis-

tributed. Further, these (web) information systems interface more and more with the

clients through the Web and include a growing number of mobile devices.

These web information systems are legacy systems, i.e., they are inherited systems which

have evolved over years and still evolve. These systems are mission-critical to the com-

panies. Further, these systems are generally poorly documented and only partially under-

stand. New development or even replacement of the legacy systems is often not

practicable. Web information system’s reengineering is the only solution to keep the mis-

sion-critical systems running and to manage their inherent evolution. „Reengineering (...)

is the examination and alteration of a subject system to reconstitute it in a new form and

the subsequent implementation of the new form.“ -- [CC90]

The evolution of web information systems is intrinsic to their existence. The enduring,

highly complex and rapid evolution of web information systems requires continuous re-

engineering processes. Iterations and incremental changes occur during reengineering

processes. Reengineering is typically highly explorative, i.e., human-centered, because of

incomplete and uncertain knowledge. To manage the reengineering process’ complexity

tool support is needed [JW00b].

Maintenance covers over two thirds of the expenses spent in information technology

[Bas90, ZSG79]. Reengineering, including re-documentation, activities accounts for the

largest part of these budgets. Nevertheless, the legacy web information systems have to

be kept running and thus the business logic and the data have to be reengineered. The busi-

ness logic processes the data. Consequently, understanding a web information system

INTRODUCTION

consists of understanding the data as well as the business logic. Finally, the maintenance

of the regained knowledge is crucial for the reengineering process and necessary to avoid

cost explosion.

1.2 Problem Definition

The persistent data structure is the central part of a legacy web information system

[Aik96]. Thus, the basis for web information system reengineering is to understand the

data organisation of the system. The importance of the system’s data and its organisation

is further reflected in the increasing interest dedicated to data security, data mining, etc.

The volume of data, the data heterogeneity and the data distribution in web information

systems determine the system’s complexity. Since volume, heterogeneity and distribution

of the data increase, the complexity of the system increases. Moreover data consistency

(management) increases the system’s complexity. Data consistency is complicated by the

volume, the heterogeneity and the distribution of the data. This complexity of web infor-

mation systems makes data reengineering hard.

The fact that business logic and data (structures) are interwoven, makes data reengineer-

ing of web information systems even harder because it increases the complexity of the

systems further. Unfortunately, systematic processes and techniques for data reengineer-

ing of web information systems are hardly available.

Assume a simple two tier architecture for web information systems composed of the data

and the applications. Existing reengineering approaches cover either the data or the appli-

cations. Feasible data reengineering approaches tackle only single data repositories or da-

tabases, e.g., [EH99, Jah99]. Other approaches focus on application design recovery, e.g.,

[MTW93, GCBM96, KSRL01, Nie03]. At a higher level of abstraction, application archi-

tecture recovery, e.g., [BH99, GAK99, SK01, PG02], identifies the component interac-

tion but rarely establishes a link to the data or application design.

Process guidelines in form of strategic decisions for legacy system reengineering are pre-

sented, e.g., in [Mil98, War99]. System evolution support is often considered only for new

software, e.g., [Gro01, OLHL02], or only of small dedicated parts, e.g., [ORH02]. A pro-

cess that supports system evolution, especially system extension, considering the data and

the applications do not exist. Such a process must recover documentation to enable under-

standing and facilitate reliable extension to keep the legacy web information systems run-

ning.

1.3 Our Data-oriented Reengineering Approach

Our approach is focused on reengineering the data organisation of web information sys-

tems. We call our approach data-oriented reengineering. We base our data-oriented re-

engineering process on component-based software engineering and model-driven

development (for forward engineering). Since the starting point is a legacy system, our

OUR DATA-ORIENTED REENGINEERING APPROACH

process deals with pre-existent (data-related) system parts, i.e., data components (e.g., a

database or a data replication service). Our approach is model-based because abstraction

through models enables better understanding and facilitates system extension.

We provide a semi-automatic process for data reengineering of web information systems

that manages incomplete and uncertain knowledge and supports iterations. To resolve in-

consistency and uncertainty emerging during the process, human interaction is needed. It-

eration is needed to explore the existing system by making assumptions that can be

validated or refuted. To unburden the reengineer from error-prone and recurrent tasks,

tool support automating these tasks is indispensable.

We divided the process in two phases

•Understanding the system’s data components and how they interoperate and

•Adapting the data components to the (new) requirements.

Figure 1.1 shows the data-oriented reengineering process. The first phase is Data-oriented

Reverse Engineering. „Reverse engineering is the process of analyzing a subject system

to (1) identify the system’s components and their interrelationships and (2) create repre-

sentations of the system in another form or at higher level of abstraction.“ -- [CC90].

We start with the reengineering of the data. The data models of the different legacy system

databases are recovered based on Jahnke’s data reverse engineering approach [Jah99].

These data models are recovered and mapped to a representation at higher level of abstrac-

tion. Further, these data models are refactored by the reengineer.

The application code is analysed to restore hidden data dependencies that are fundamental

for the system’s understanding. These data model relationships are retrieved based on

Niere’s (application) design recovery approach [Nie03]. Further, application models are

retrieved and linked to the data models. The result of this understanding phase are Data

Component Models.

The Data Component Extension is the second phase. Since in both phases the Data Com-

ponent Models are transformed, for some activities we use the same tools in both phases.

The adaptation of the data components aims at extending the system by transforming the

retrieved conceptual data component models and design new parts by model-driven de-

velopment. Therefore the retrieved and new data components are clustered and classified

before they are redesigned. We propose architectural patterns to (re)structure the data

components. Based on the regained knowledge an interface to the legacy databases can

automatically be generated to access the data. Finally, the old retrieved and newly created

models can be executed.

To support exploration and iteration during long-lasting reengineering, the Data Compo-

nent Models have to be maintained. Model Maintenance means in this context that the dif-

ferent models are kept consistent to each other, i.e., changes in one model that affect other

models are propagated. Therefore, we provide model consistency management. The web

INTRODUCTION

information system Data Component Models, including their representation at different

level of abstraction, are internally represented as graphs. We elaborate a mechanism,

called History Graph Mechanism, that provides consistency for all model-based systems

parts through explicit model transformation logging.

We consider the two phases as distinguishable states in the recurring cycle of iterations

during data-oriented reengineering. Our process is flexible such that the phases as well as

the activities performed during these phases are extendable and exchangeable. This im-

plies that the participating tools can be exchanged by newer versions or by more adequate

tools, and that new tools can be integrated.

The main research contributions of this dissertation have partly been published and can

be summarised as follows:

Web

Information

System

1. Understanding:

Data-oriented Reverse

Engineering

1.1 Data Model Recovery

1.2 Mapping & Refactoring

2. Relationship Retrieval

2. Adapting:

Data Component Extension

1. Data Component Clustering & Classification

Data Component (Re)Design

2. Architectural Pattern Application

3. Generation and Model Execution

Model Maintenance:

Model

Consistency

Management

History Graph

Mechanism

Figure 1.1: Data-oriented Reengineering process

Data

Component

Models

New

Applications

legacy

system

new

applications

database schemas and partial

application models

data access

information flow

model transformations

model correspondances

DISSERTATION OUTLINE

• We have developed the data-oriented reengineering process to support the mainte-

nance of web information system. The process allows iterations and the manage-

ment of uncertain and inconsistent reengineering knowledge. The regained

information is documented with conceptual models. Enduring reengineering after an

initial reverse engineering phase is possible [SWJ04, Wad03, JSWZ02, WJ02,

GW01, JW99a].

• Our data-oriented reverse engineering is a combination of a data reverse engineer-

ing approach and an application design recovery approach. To this end, we defined

patterns to detect dependencies between the data component models in the applica-

tion code [NWW03, WNGJ02, NSW+02, NWZ01, NWW01, JNW00, JW00a,

JW99c].

• For web information system extension we provide

- clustering to partition the web information system into different data components

and a classification for these data components, and

- architectural patterns to (re)structure the web information systems [JGW02]

before the parts required for extension can be (re)design.

• Based on the Data Component Models we provide code generation for an access

layer and model exection. We maintain all the (regained) knowledge about the web

information system in these models at the different level of abstraction. This enables

code generation for accessing the legacy databases as well as for the newly designed

parts.

• Iteration and exploration during the process as well as code generation are based on

the Model Maintenance. We keep the Data Component Models consistent with our

History Graph Mechanism [WJ03, JWZ02, ZWR01, JW99b].

• We implement our approach in the REDDMOM project as part of the FUJABA TOOL

SUITE (FUJABATS), partially by integrating existent tools, and use a case study in the

Health Care domain as a proof of concept [BGN+03, NNWZ00].

1.4 Dissertation Outline

This dissertation is organised as follows:

Chapter 2 describes a case study in the growing field of Health Care. We describe the

overall case study and point out our involvement. We use this case study to exemplify our

concepts and techniques, i.e., it serves as running example for this dissertation.

Chapter 3 introduces our data-oriented reverse engineering process that enables legacy

web information system understanding. We use a flexible combined data and design re-

verse engineering approach based on graph transformations. The recovered database log-

ical schemas are mapped into a conceptual representation. This rough object-oriented

representation is improved by refactoring operations and enriched with inter-schema de-

INTRODUCTION

pendencies and relationships to application models. Inconsistency and uncertainty occur-

ring during the process is managed using fuzzy logic.

Chapter 4 covers the model-driven extension of the web information systems. To classify

the data components we cluster them following the adopted integration strategy. We pro-

pose architectural patterns for data mediation. Based on the retrieved web information

system and newly designed models, code is generated.

Chapter 5 outlines the consistency management of the compiled data component models.

Our History Graph Mechanism is based on composed graphs of the models where selec-

tive undo operations can be performed. The consistency between model and code is re-

solved by code generation.

All three technical chapters (3, 4 and 5) are closed with the corresponding tool support,

related work and a short discussion. Chapter 6 concludes the dissertation. The transfer-

ability of results is presented and open questions and future directions of this work are dis-

cussed.

CHAPTER 2: DATA-ORIENTED REENGINEERING:

A CASE STUDY

Pain (any pain--emotional, physical, mental) has a message. The

information it has about our life can be remarkably specific, but

it usually falls into one of two categories: "We would be more

alive if we did more of this," and, "Life would be more lovely if we

did less of that." Once we get the pain's message, and follow its

advice, the pain goes away.

PETER MCWILLIAMS, IN LIFE 101

United States Author (1950 - 2000)

The aim of this chapter is to corroborate the need of a data-oriented reengineering with a

case study. Current approaches provide only little support for data-oriented reengineering

of web information systems. Most of them do not consider the evolutionary and explor-

atory nature of the (data-oriented) reengineering process [HEH+95]. Tools are indispens-

able to unburden the reengineer from manually error-prone and/or repeating assured tasks.

A semi-automatic data-oriented reengineering process able to handle inconsistencies and

uncertainty is required. This process must allow human interaction and iteration to sustain

exploration and evolution.

Based on the definitions of reengineering [CC90, TS95] and data reverse engineering

[HA00, HHH+00] we define:

- Data-oriented reengineering is the systematic transformation of existing system

parts concerning the persistent data into a new form to realise improvements. This

includes some data-oriented reverse engineering followed by some data-oriented

forward engineering activities.

- Data-oriented reverse engineering is the process of recovering a conceptional

(high level of abstraction) representation of the persistent data to facilitate the

understanding of the data structure and interrelationships of an existent system.

- Data-oriented forward engineering is the process of modifying an existing

system’s conceptional, implementation-independent data structure representation

followed by the physical implementation.

The case study of the Health Care web information system shows that our tool-supported

data-oriented reengineering approach is able to handle uncertainty and inconsistencies

during the data-oriented reverse engineering process. Further, the case study demonstrates

that our approach supports iterations. In this chapter we give a case study overview, de-

tailled examples are given within the next chapters.

DATA-ORIENTED REENGINEERING: A CASE STUDY

2.1 An Health Care Web Information System

In Health Care data is fundamental. Indeed, without patient data, drug data, health re-

search data, etc., prevention or treatment is not possible. In Health Care, data has long

only been collected, but over the last few years it has been realised that valuable knowl-

edge may be contained within the data. The existing web information systems have to be

extended for achieving data knowledge extraction beside data collection and data storage

[RT02, Wil03].

By using data analysis tools such as data mining tools, knowledge can be extracted from

the data so that it may be used to improve the delivery of Health Care. Knowledge discov-

ery within Health Care has been described as an enumeration of symbols and the arrange-

ment of those symbols into meaningful structures [Cim00]. Such a process is made easier

if the data, representing the symbols, share common terminology and definitions. How-

ever, with different organisations having different data models and data definitions, data

that is gathered for analysis is often heterogeneous. By using mediation technology it is

possible to gather data from multiple health organizations and converge them, e.g., into a

common data warehouse with a common data definition [Wei02].

To build a Health Care web information system, which contains information from a mul-

titude of sites and can be widely accessed, data management facilities have to be put in

place. Only collecting data, storing data and extracting knowledge from data is not

enough. Beside the management of data on the local sites, a global data management is

required. Further various mediation strategies are needed to fulfil the requirements of to-

day’s various static and mobile clients.

A central point of (Health Care) web information systems is that they base on distributed

legacy systems containing the data. These legacy systems have to be analysed before they

can be integrated together.

2.1.1 The Legacy (Web) Information System

The case study involves a Palliative Care Network System in Canada. The goal is to es-

tablish a network between different canadian Palliative Care Centers. The current situa-

tion is that several databases exist in each center. The different centers are not connected

and have different data models. Further a Palliative Care Data Warhouse, where all data

are collected and processed, is planned.

Data from a single Health Care institution is limited in its knowledge discovery potential.

The ideal situation is to incorporate data from multiple sites to increase the knowledge

discovery potential. However, that presents the problem of different formats and defini-

tions of data from different sites. Palliative Care Center Victoria, e.g., has different data

definitions and data collection requirements than either Palliative Care Center X or Y1. A

1. For political reasons, we are not able to make public the names of the involved centers.

further complication is that the data at different institutions may actually be stored in dif-

ferent database applications such as Microsoft Access, Microsoft SQL Server or Oracle.

Another problem, which is experienced, is that there may be different information needs

within the same center. Palliative Care Centers X or Y may gather large volumes of patient

data, whereas individual sub-organisations such as pharmacy, lab and finance may only

need certain subsets of that data.

Another problem is the actual collection of data. As traditionally done in health institu-

tions, much of the data collection such as patient charting is done in a paper-based format

that is later transcribed into an information system. Because that method requires a dupli-

cation of efforts, there is a potential for errors to be made during the process. Collecting

patient data with mobile devices such as PDAs, electronic tablets and cellular phones is a

way to avoid this error prone effort. Many Health Care institutions are experimenting in

this direction [MRF+03]. However, because of differences in data collections from the

different mobile devices it can still give a heterogeneous collection of data.

Currently, the University of Victoria is building this Palliative Care Network System. The

tasks are (1) the reengineering and of the existing systems; and (2) the construction of an

Health Information Grid for sharing the information [BBD+03].

The first task, i.e., data-oriented reengineering, is subject of the thesis. Our involvement

in the project is restricted to a part of this Palliative Care Network System. This has two

main reasons. Firstly, the period of our involvement was limited. Secondly, for political

reasons not all the needed system insights were given to us or can be presented in this the-

sis. Therefore we concentrate on one Health Care web information system: the Palliative

Care Center Victoria.

The second task, the integration of different Health Care systems in a configurable net-

work of interconnected organisations, is achieved by wrapping the existing systems

[OBJ03]. These wrappers use web services and industry standards to provide a uniform

and adaptable interface with functionalities for interoperability as well as secured and

controlled access among the individual systems. This is done within a Grid Federation

Envelope described in [Ona03].

2.1.2 The Considered Target Web Information System

Figure 2.1 depicts the target systems architecture. The Palliative Care Center Victoria is

composed of three Microsoft Access databases (DB), an Application and two portals. After

reverse engineering and wrapping the three databases, the data mediation portal (DMP)

and the WebPortal are constructed. The WebPortal enables new access from inside the

center with browsers. This WebPortal runs in parallel with the existing applications. The

data mediation portal (DMP) provides access through Data Mediation Services to the oth-

er two subsystems.

DATA-ORIENTED REENGINEERING: A CASE STUDY

The system incorporates three kinds of static clients in form of browsers. Except the data

flow from the WebPortals to the browsers, all other data is exchanged via Data Mediation

Services.

The Data Mediation Services play a central role in this architecture by managing the data

flow between the other subsystems. The benefit of such services is their flexibility, loose

coupling and technology independence. Each service passes the data to a mediator that

puts the data into a common format before storing it in a database. The data, regardless of

its initial format or the application in which it was created, is now in a format which is

much more valuable for data analysis and knowledge discovery.

The Information Management System is composed of a database (DB), an Application and

portals. The database stores all system users’ information like roles, access rights, docu-

ment versions, etc. The Application incorporates the management tasks regarding access

Palliative Care

Center Victoria

Medical

Diagnosis ES

Information

Management

System

Data

Mediation

Services

Figure 2.1 Health Care Web Information System: Considered Target System

web portal or data

mediation portal (DMP)

connection

DMP

Internal

Center

Browser

DMP

WebPortal

Physician

Browser

Patient

Browser

WebPortal

Application

DMP

Appli-

cation

Appli-

cation

WebPortal

data flow

(legacy) subsystem

database

web browser

application

THE DATA-ORIENTED REENGINEERING PROCESS

rights and documentation. The WebPortal allows access from all browsers with respect to

the users’ role. The data mediation portal (DMP) is used, via the Data Mediation Services,

by the other subsystems for user identification or document access. Further it is extended

to handle multimedia data.

The Information Management System is based on two systems, DSD and BEETLE, that are

customised for handling documents in a medical context. At the software engineering

group from the University Paderborn the distributed software development (DSD) system

[GGT01] was build. The core of the system is based on the concurrent version system

[CVS]. The concurrent version system functionality is encapsulated in Jini services

[AOS+99]. A pure Java client permits the access of the Jini services and the Sybase data-

base. Moreover a document report system called BEETLE exists. This system is based on

a MySQL database, Java and Java Server Pages. The combination of these two systems

enables the versioning and processing of documents during workflows in the Health Care

domain.

The Medical Diagnosis Expert System is a prototype that includes one database (DB)

hosting the knowledge base. The Application includes the inference and the inference Ap-

plication Programming Interface (API). This API is accessible through two portals. The

WebPortal permits access from a browser used by physicians. The data mediation portal

(DMP) permits the access of electronic patient records for the diagnosis and treatment.

2.2 The Data-oriented Reengineering process

2.2.1 Understanding Phase

We follow a divide and conquer approach in the understanding phase and aim for a rapid

(semi-)automatic bottom-up reverse engineering step followed by a top-down explorative

human-driven reverse engineering. This activity is based on Jahnke’s database reverse

engineering approach VARLET [Jah99] and Niere’s incremental pattern instance recogni-

tion approach [Nie03].

Figure 2.2 gives an overview of the understanding phase. Conceptual schemas of the da-

tabases are retrieved. During this process logical schemas are also recovered. The rela-

tionships are retrieved from the application code. In addition to the schemas, parts of the

applications that are related to the schemas are also retrieved. The results of this phase is

one cluster that contains data models from the different (web) information system.

The process starts with the reverse engineering of the logical schemas. Figure 2.3 sketches

the schemas of the three databases, namely hospdata, bvmtdata and outcomes_be, of the

Palliative Care Center Victoria. The parsed initial logical schemas only contain the prima-

ry and foreign keys that are explicitly defined in the databases. This step is followed by

the schema analysis that consists of structural completion and semantical enrichment

[Jah99]. The next step is the mapping from the logical schemas to conceptual schemas. A

set of triple-graph-grammar rules [SL96] creates this conceptual representation in the

DATA-ORIENTED REENGINEERING: A CASE STUDY

Unified Modeling Language (UML) [BRJ99]. The resulting conceptual schemas are then

refactored by the reengineer to improve the understandability. Since these steps stem from

the VARLET approach, we skip the details here, and refer to Section 3.2 and [Jah99].

data-

base

logical

schema

conceptual

schema

appli-

cation relationship

partial appl.

model

cluster information flow

data access

bvmtoutcomes

hopsdata bvmtdata

outcomes_be DB_DSD

BeetleDB

VCM

Bug_Report

Figure 2.2: Case study understanding phase

Figure 2.3: Parsed data models overview

hospdata bvmtdata

outcomes_be

THE DATA-ORIENTED REENGINEERING PROCESS

The results of these first reverse engineering steps are independent refactored conceptual

schemas. Since the VARLET approach supports an iterative explorative process that man-

ages inconsistencies, in this phase the reengineer has two possibilities. The reengineer can

either rapidly produce uncertain abstract results or analyse sequentially the (independent)

schemas completely.

A crucial factor for understanding the web information system’s data structure is the re-

trieval of the implicit inter-schema relationships. Figure 2.4 shows an example for a typ-

ical situation occurring in web information system evolution. Two tables Bereavement

Reason exist in two different databases (hospdata and bvmtdata) storing the same infor-

mation at the beginning of system’s evolution. During maintenance activities only the ta-

ble in the bvmtdata database is updated. The table in the hospdata database remains as

inconsistent copy. Such relationships are valuable indicators for maintaining the system,

i.e., for the reengineering responsible for the system extension and improvement. Other

examples of inter-schema relationships are described in Section 3.3 and [WNGJ02].

The detection of such inter-schema relationships is based on source code analysis with a

pattern instance recognition mechanism. The first step is the identification of code frag-

ments of interest with island grammar parsing [vDK99, Moo01] and slicing [Wei84].

These code fragments are then analysed with Niere’s incremental pattern instance recog-

nition approach [Nie03]. Patterns in source code that indicates (inter-schema) relation-

ships are not certain. Niere’s approach enables an iterative recovery process definition

that can handle uncertainty and is based on (composed) pattern definition. Examples are

given in Chapter 3. Here again the chosen approach enables an iterative explorative pro-

cess that manages inconsistencies, which rapidly produces first results.

Figure 2.5 shows some inter-schema relationships of the three databases hospdata, bvmt-

data and outcomes_be. The central database is the hospdata database which contains all

Patient related data. The outcomes_be database stores results from examinations and fi-

nally the bvmtdata database includes information related to bereavement. Figure 2.5

shows (four relationships inside the hospdata schema and) four relationships between da-

Figure 2.4: Table clone example

‘Bereavement Reason’ tables in the hospdata

and bvmtdata databases

DATA-ORIENTED REENGINEERING: A CASE STUDY

tabase schemas. Firstly, we have the duplication between the two Bereavement Reason

classes. Secondly, a duplication of the deathdate between Patient and Deceased is re-

trieved. Thirdly, a replication between DrugProfile and tblDose is detected, which main-

tains information about the last drug dose, purpose and route that was given to a patient.

Fourthly, an inter-schema foreign key relates classes Patient and tblOutcomes.

Finally, the conceptual schemas can be refactored, compared to each other and views can

be build from parts of the different schemas. One task from the adaptation phase can be

used in the understanding phase: clustering. For further understanding clusters can be

build [TWBK89] and used as views for exploring different system aspects.

2.2.2 Adapting Phase

After understanding major parts of the system, the reengineer needs support for exten-

sions. Figure 2.6 gives an overview of the adapting phase. New system parts were con-

structed. The different old and new data component models are clustered, (re)structured,

(re)designed and connected. Finally, new application are generated that accesses existing

or new databases.

The first step in the adapting phase is clustering the web information system parts to data

components. Out of the recovered conceptual schemas and their interrelationships, classes

can be grouped according to their interrelationships. An example is the clustering pro-

Figure 2.5: Palliative care conceptual schema excerpt in UML

hospdata

outcomes_be

bvmtdata

THE DATA-ORIENTED REENGINEERING PROCESS

posed by [SPPB02] that groups classes (representing entities) according to the primary

keys of their corresponding entities, cf. Figure 2.7.

Further, in iteration with redesign, the clustering can support the reengineer in the concep-

tual partitioning of the data related parts based on relationships to application classes, i.e.,

construct external schemas. In this case study, an external schema to the Palliative Care

Center Victoria databases was constructed that interfaces the Diagnosis Expert System

prototype application. Moreover, clusters of classes to data components and clusters of

data components are built and classified into corporate data components, data mediation

components and ubiquitous data components to support the reengineer.

The next step is redesign. We propose four architectural patterns to facilitate data media-

tion between the different data components. The Data Portal pattern can be seen as a fa-

cade or interface to data components. The Data Fusion pattern and the Data Transducer

pattern merges and transform exchanged data. The exchange itself is realised with the

Data Connection pattern that cope with update, synchronisation and reachability prob-

lems of participating data components. To redesign the models themselves we provide a

(design) pattern instantiation mechanism and a redesign transformation catalogue. Rede-

signed or newly designed data components are included in an existing data component

cluster or form a new one.

data-

base

logical

schema

conceptual

schema

appli-

cation relationship

partial appl.

model

cluster information flow

data access

Document Management

Patient Data Application

Figure 2.6: Case study adating phase

Diagnosis Expert System

hopsdata bvmtdata

outcomes_be Knowledge Base DB_DSD

BeetleDB

existing

parts

existing

parts

DATA-ORIENTED REENGINEERING: A CASE STUDY

In this case study a prototype of a Medical Diagnosis Expert System was constructed.

Currently the medical diagnosis expert system is a sample prototype. The prototype is

composed of a Knowledge Base, the Expert System application and a user interface (Di-

agnosis GUI). These data components collaborate with the existing web information sys-

tem of the Palliative Care Center Victoria. Figure 2.8 shows an overview of this extension.

Figure 2.7: Clustering example

class

association

cluster

Bereavement

Reason

Patient Drug Profile

Deceased

tblDose

Drugs

<<copy>>

bereavement

reason

deathdate

reason

bereavement

profile

drugdose

<<replica>>

drugs

tblOutcomes

outcomes

<<inter-schema>>

BVMNT_CODE

VHS_ID# DrugID

OutcomesID DoseID

0..n 0..n Bereavement

Reason

0..n

0..1

0..n

0..1

0..n 0..n

0..n

0..1

0..10..1 0..n

0..n

Figure 2.8: New data components integration examples

service

Diagnosis

GUI

hospiceGUI

Knowledge Base

outcomes

Expert_System

bvmtdata

palliative care

schema

outcomes_be

hopsdata

data flow

persistent tier

application tier

presentation tier

black

grey

dashed

existing parts

integrating parts

extending parts

THE DATA-ORIENTED REENGINEERING PROCESS

The model correspondences retrieved during the understanding phase and added during

the adapting phase are used to generate an object-oriented access layer to the existing da-

tabases.

The adapting phase terminates with model execution facilities. Modelling and generating

an entire application, using the chosen (middleware) technologies, is a hard task. The aim

in this thesis for model execution is on the one hand to facilitate the modelling and gen-

eration of small data mediation components (web services) that exchange data in eXch-

nageable Markup Language (XML). On the other hand to facilitate prototyping based on

the retrieved system models.

2.2.3 Model Maintenance

The data-oriented reverse engineering and the data component extension are not fully au-

tomated nor performed sequentially. On the contrary, they are part of an iterative process,

i.e., reengineers explore the web information system in several iterations to understand

and adapt it. Therefore the models have to be maintained. The data models and their inter-

relationships have to be consistent to preserve the consistency of the data. We developed

the History Graph Mechanism that provides model consistency through explicit model

transformation logging and selective undo for iterations, cf. Chapter 5. Figure 2.9 depicts

that changes ( ) can then be traced through the models ( ).

data-

base

logical

schema

conceptual

schema

appli-

cation relationship

partial appl.

model

cluster information flow

data access

bvmt

Document Management

outcomes

Patient_Data

Figure 2.9: : Case study administrating phase

Expert_System (Diagnosis)

hopsdata bvmtdata

outcomes_be Knowledge Base DB_DSD

BeetleDB

VCM

Bug_Report

DATA-ORIENTED REENGINEERING: A CASE STUDY

A sample is depicted in Figure 2.10. A duplication between class Deceased and class Pa-

tient is retrieved in a first understanding phase. The duplication is based on the attributes

deathdate and death_date.

In a second understanding phase the reengineer detects variants in table ’Patient’. This has

to be reflected in the conceptual model, i.e., class Patient becomes the superclass of new

classes DischargedPatient and DeceasedPatient. Because attribute death_date has

moved from class Patient to class DeceasedPatient, the duplication between class De-

ceased and class Patient is no longer valid. Instead it has to exist between class Deceased

and class DeceasedPatient.

The History Graph Mechanism reestablish consistency by undoing previously applied

model transformations. This means that the creation of the duplication between class De-

ceased and class Patient is undone. Other schema transformations, e.g., during clustering

or redesign, are also affected from this change. Not all of them are necessarily undone;

only those transformations that are not longer valid are undone. Finally, the reengineer

can browse the list of undone schema transformations and may reapply some of them in

a slightly different context like it is the case here for the duplication deathdate between

class Deceased and class DeceasedPatient.

Figure 2.10: Consistency violation example

Patient class inheritance

DischargedPatient

discharge_date: timestamp

discharge_time: timestamp

Patient

VHS_ID#: string

...

Patient

VHS_ID#: string

...

discharge_date: timestamp

discharge_time: timestamp

...

death_date: timestamp

death_time: timestamp

place_of_death: string

Deceased

VHS_ID#: string

...

deathdate: timestamp

<<copy>>

deathdate

<<copy>>

deathdate

association

<<copy>>

deathdate

Deceased

VHS_ID#: string

...

deathdate: timestamp

DeceasedPatient

death_date: timestamp

death_time: timestamp

place_of_death: string

change

0..1

THE DATA-ORIENTED REENGINEERING PROCESS

For the sake of completeness, we shortly report about the Information Management Sys-

tem that coordinates the management task regarding access rights and documentation. In

the understanding phase the two systems DSD and BEETLE were reverse engineered.

Since DSD and BEETLE are stand alone applications, the outcome of the understanding

phase were two coexisting conceptual schemas that handle similar data. One problem was

that they were based on two different databases. Data was hold twice without replication

concept neither implementation. Further different client technologies was used and the

data access was implemented with embedded Structured Query Language (SQL) [Dat89]

statements.

In order to use this system as an information management system in Health Care follow-

ing requirements had to be fulfilled. The document handling had to be extended to multi-

media data, e.g., the parallel versionning of image files (radiology pictures) with

corresponding audio files (explanations), the management of different movie files that be-

long together (sequences from an operation), etc. Further, the integration of the two data-

base schemas was needed to achieve consistent document management. After schema

integration, two options arise. Firstly, use the new virtual schema as external schema and

stepwise migrate both applications to use it. Secondly unify the databases to one database

and stepwise migrate both applications. We choose the second option because of technical

reasons. The concrete resulting tasks (database integration, application migration and ex-

tension) are not further disscused in this thesis.

DATA-ORIENTED REENGINEERING: A CASE STUDY

CHAPTER 3: DATA-ORIENTED REVERSE

ENGINEERING

If data cannot be correctly understood, it cannot be combined

with other information. Instead, it is just data pollution.

UNKNOWN

3.1 Reverse Engineering Data Components

Web information systems consist of large amount of heterogeneous data spread over mul-

tiple locations and platforms. These systems have to be understood for maintenance and

especially for extension. As the persistent data structure is the central part of a legacy

(web) information system [Aik96], our data-oriented reverse engineering approach fo-

cusses on the web information system parts related to the data, i.e., data components:

A data component is a unit that encapsulates capabilities and functionalities to

store and/or manage data.

In this thesis a data component is mainly composed of one or more models, i.e.,

persistent data models (schemas) and/or application models, and relationships

between the models. A data component can be composed (1) exclusively of sche-

mas, (2) exclusively of application models or (3) of a combination of schemas and

application models.

Further, each data component has a defined interface to interact with other data

components. Data components can be composed and exchange data via connec-

tions. A data component can be deployed independently. The basic idea of a data

component is based on the component / connection philosophy presented by

Szyperski [Szy99]. Note that the legacy data components may not fulfill all these

characteristics entirely.

We distinguish between three kinds of data components:

• data components within corporate organisations (corporate data components),

• data components among corporate organisations (data mediation components) and

• data components between corporate information systems and mobile, embedded

smart devices (ubiquitous data components).

The focus of our approach is the recovery and retrieval of the schemas from the distributed

databases of legacy (web) information systems. This implies that along with the schemas

the dependencies resulting from distribution have to be reverse engineered. The black

DATA-ORIENTED REVERSE ENGINEERING

parts of Figure 3.1 illustrate this situation. By analysing the databases and the legacy ap-

plications, (1) the logical schemas are recovered and then (2) the conceptional schemas

with inter-schema relationships and parts of application models are retrieved. Several ap-

proaches and tools exist for these reverse engineering subtasks. In this approach we use,

combine and extend some of them. The light grey parts of Figure 3.1 can be ignored here.

3.1.1 Reverse Engineering Steps

We divide the reverse engineering of web information systems into two steps. Firstly, the

schema of each database is recovered and conceptualised. Secondly, the data dependen-

cies between these schemas and the application model parts are retrieved.

Figure 3.2 gives an overview of these reverse engineering activities. Four data compo-

nents are depicted but only two of them store data and consequently contains schemas. In

the activity 1.1 Data Model Recovery1 the physical schema is parsed and represented in

an Extended Entity Relationship (EER) [Che76, BCN92] notation. This EER model rep-

resents the logical schema. It is (structurally) enriched and (sematically) completed during

an iterative explorative analysis process involving reengineers and domain experts. Next,

the logical schema is mapped to a conceptual schema. This initial object oriented concep-

tual schema in UML notation can then be refactored by model transformations (1.2 Map-

ping and Refactoring).

1. The Activity 1.1 Data Model Recovery is represented twice for layout reason.

LS1 LSi LSj LSm

data-

base

logical

schema

conceptual

schema

appli-

cation relationship

partial appl.

model

cluster information flow

data access

DB1

legacy App.

CS1

new App.

DBi DBj DBm DBp

DBn

legacy App. legacy App.

CSi CSj CSn

LSn

CSp

LSp

new App.

Figure 3.1: Web information system understanding

new App.

CSm

REVERSE ENGINEERING DATA COMPONENTS

Once the schemas are identified, enriched and completed, the definition of how they in-

teroperate is needed. In the second process activity 2. Relationship Retrieval the relation-

ships between the different data component models are investigated. This task also

involves code reverse engineering and consequently uncertainty management. Again

(only) a semi-automatic, iterative approach is feasible. The reengineer typically refactores

the resulting conceptual schemas.

Logical ERR

Schema

Representation

Reengineer

Domain

Expert

1.1 Data

Model

Recovery

2. Rela-

tionship

Retrieval

Reengineer

Med.

Port.

Inter-package

Relationship

Details

Data Component

Extension

conceptual dependency

information flow

link to next process step

Data Component

Data

Compo-

nent

Data

Compo-

nent

Data

Component

Doc.

Code

Doc.

Code

Data

Component

(class diagram)

Figure 3.2: Data Component Reverse Engineering - Overview

1.2 Mapping

and Refactoring

Domain

Expert

1.1 Data

Model

Recovery

DB DB

Conceptual UML

Representation

Logical ERR

Schema

Representation

Doc. Documentation DB database

Code application code

data

component

activity data component conceptual representation

(DB: Database; Med.: Data Mediation; Por.: Data Portal)

data component logical

representation

users and

experts relationship

list

DATA-ORIENTED REVERSE ENGINEERING

From each activity, backward iterations to the precedent process activities are needed be-

cause of misinterpreted, incomplete or even erroneous intermediate results. Summarising,

the two activities of the data-oriented reverse engineering process are done in iterations.

3.1.2 Relationships and Data Dependencies

Web information systems underlie frequent changes and integration. In contrast to non-

distributed systems no overall structural representations are available and thus system un-

derstanding is rather difficult. To understand the system data components entities and re-

lationships, the data models have to be reverse engineered. While recovering entities is

relatively simple, relationship retrieval can become quite complex.

Often occurring relationships are foreign keys or attributes named like attributes of other

entities (“immaterial foreign keys“). Complex functional dependencies are used when the

relevant relationships are computed on-the-fly by combining the values of multiple stored

attributes. While revealing the first case is relatively simple [Jah99], instances of the latter

case are rather hard to detect. Moreover, assume a set of database entries which is spread

over the different databases. They might depend on each other in various different ways

and the intended coupling can have different semantic properties. Relevant relationships,

however, will manifest themselves in the applications.

Relationships are dependencies between entities based on attribute relations. These rela-

tionships exist between attributes within an entity or between attributes of different enti-

ties. Further relationships exist between attributes of entities in different schemas. We

distinguish between:

• intra-entity relationships:

relationships between attributes within one entity.

• inter-entity relationships:

relationships between attributes of different entities, that we divide in:

- intra-schema relationships:

relationships between attributes of different entities within one schema.

- inter-schema relationships:

relationships between attributes of different entities of different schema.

Figure 3.3 gives an overview of relationships that we consider, as a class diagram. Exam-

ples will be given in Section 3.3.2; others can be found in [WNGJ02].

A data component model is composed of Entities with Attributes and Relationships. A Re-

lationship is a Homonym, a Primary Key, a Data Dependency, a Usage Relationship or a

Join. A Data Dependency is an inclusion dependency (IND), a Redundancy dependency

or a Constraint. Relationships classes are mostly based on Attribute indicators, i.e., rela-

tions between Attributes and attribute properties. Attribute properties (AttrProp) are:

• name similarity (NS)

names of two attributes are syntactically close to each other

REVERSE ENGINEERING DATA COMPONENTS

• name equivalence (NE)

names of two attributes are equal

• type compatibility (TC)

values of two attributes can be exchanged without converting the values, e.g., ’char’

and ’varchar(18)’, ’varchar(30)’ and ’varchar(50)’ or ’numeric’ and ’double’.

• type equivalence (TE)

types of two attributes are equal

AttrProp

AttrSim

Attibute Entity

Join

Relationship

PrimaryKey

Homonym

Association

IND

Data Dependency

Redundancy

Replication

Usage Constraint

Figure 3.3: Relationships and Data Dependencies Overview

DELETE

AttrNS

AttrNE

AttrEqu

0..n 0..n

used

1..n

0..n

0..n 0..n

1..n

0..n

0..n 1

1..n

1..2 0..n

0..1

2..n

0..1

1..n

0..n

0..n 2

1..n

participate

joins

attrs entities

keys

homonyms

relships

similarities

propreties

composed_by

composed_of

Surplus

Inheritance

Duplication

Synonym

UPDATE

INSERT

SELECT

Entity class

1..2 0..n

pk association

generalisation

Aggregation

2..n

0..n

DATA-ORIENTED REVERSE ENGINEERING

We combine these attribute properties to four different indicators for attributes:

• attribute similarity (AttrSim=NS&TC),

• attribute name similarity (AttrNS=NS&TE),

• attribute name equivalence (AttrNE=NE&TC) and

• attribute equivalence (AttrEqu=NE&TE).

Note that the attribute indicator classes follow an inheritance hierarchy, e.g., if two at-

tributes are equivalent they are also similar. This also holds for other classes involved in

inheritance hierarchies, cf. Figure 3.3.

Primary Key

The sole pure intra-entity relationship we consider here is the PrimaryKey. A primary key

is one attribute or a set of attributes that enables the unambiguous identification of an en-

tity. This concept stems from the relational database model [Cod70].

Homonym

A relationship where two attributes of two different entities have the same name but store

different information is a Homonym. These are attributes that are not related through the

application code but solely by name equivalence. Homonyms help to avoid confusion

about equivalent named attributes during the understanding process. Note that omitted re-

lationship update during integration may lead to false homonyms. Indeed, attributes with

the same name that store the same information but without explicit relationship in the ap-

plication code can be recovered as homonyms. Checking the data can help the reengineer

to detect such false homonyms.

Joins

The basis for all data dependencies are relationships between attributes, i.e., Join opera-

tions (joins). Following SQL we consider four types of joins: INSERT, DELETE, UP-

DATE and SELECT. Attributes in access code (not only SQL statements) that implements

such a join operation will be assigned as taking part in the operation, i.e., joins are

composed_by attributes.

Data Dependencies

Data dependencies are relations between entities, more precisely between their attributes.

We classify them into three basic kinds:

• redundancy dependency:

the same information is held - and maintained - (at least) twice

• inclusion dependency:

an attribute (set of attributes) in one entity holds a part or the same information as an

attribute (set of attributes) of a second entity

• constraint dependency:

condition(s) over two or more data dependencies to assign information

REVERSE ENGINEERING DATA COMPONENTS

Redundancy Dependency

The first inter-schema Data Dependency type is called Redundancy dependency. A re-

dundancy dependency can be Synonym, Surplus, Duplication or Replication. In our ap-

proach redundancy dependencies must be related within the application code.

Synonym

Synonyms are attributes that hold the same information but have different names. Non

name similar, type compatible attributes where information is inserted and updated to-

gether are synonym candidates. Exclusively an insert operation or exclusively an update

operation is not sufficient to indicate a synonym. Therefore, synonyms imply joins (at

least one insert and one update), type compatibility but no name similarity (includes name

equivalence).

In some cases data is never updated because the information system uses keys that are

never altered. For this case further more complex investigation has to take place which is

based on attribute semantic resemblance. Semantic resemblance and in consequence se-

mantic equivalence (synonym) are hard to determine and subject of current work.

Surplus

We have "real" redundancy if the same information is maintained but not inserted togeth-

er. In addition to attribute similarity, a Surplus dependency is revealed when the attributes

are updated but not inserted together (at least one update but no insert operation).

Duplication

We talk about Duplication if an explicit copy is made (at specific points in the application)

but the copied information is not kept consistent. Thus, attribute similarity with at least

one insert operation but no update operation indicates a duplication. Note that the inser-

tion do not have to take place at the same time, i.e., the insert operations may be spread

over the code. Such duplications are hard to detect.

Replication

Replication is an explicit copy, which is held consistent, i.e., a controlled redundancy.

Thus, it is the occurrence of attribute similarity and at least one insert and one update op-

eration. In general these operations will be performed together, but, in analogy with the

duplication, the operations can be distributed in the code.

IND (Inclusion Dependency)

Inclusion dependencies (INDs) are known from (single) relational databases [EN94]. We

identify an IND as a data dependency where the attributes are similar and at least one se-

lect operation exists. They build the basis for interpreting the semantics of foreign keys

DATA-ORIENTED REVERSE ENGINEERING

(associations and inheritances). We use the classical definition of the inclusion dependen-

cy in data reengineering. We classify INDs into R-INDs, C-INDs or I-INDs according to

[FV95]. An R-IND describes a relationship which is not separated by a separate entity (re-

lation schema). An C-IND describes a cardinality constraint and an I-IND describes an is-

a relationship (inheritance).

Association (R-IND, C-IND)

An Association is an inclusion dependency, i.e., an R-IND with an C-IND as option. An

R-IND is an IND where the included attribute (set of attributes) is a PrimaryKey of the cor-

responding data (Entity). In contrast to foreign keys, an association can have a n:m cardi-

nality. This corresponds to a reference table and a 1:n foreign key and a 1:m foreign key.

In case of EER diagrams, which represent databases, we only have inclusion dependen-

cies. The cardinality of an association is determined by the R-IND and the C-IND if it is

classified as one.

Aggregation

An Aggregation is a special kind of association which represents a „has_a“ relationship.

In contrast to an association, an aggregation represents a „whole / part“ relationship mod-

el [BRJ99].

Inheritance (I-IND)

Inheritance is a generalisation.

„A generalization is a relationship between a general thing (called the superclass or

parent) and a more specific kind of that thing (called subclass or child).“

[BRJ99]

In terms of EER diagrams it is an IND classified as I-IND.

Constraint

Dependencies that relate data (attributes) in more complex manner are classified as Cons-

traint dependencies. A constraint incorporates parts of the application model, i.e., business

logic. Therefore an exact definition as well as a sub-classification of constraint dependen-

cies are hard to give. The notion of a constraint can be refined when used in further anal-

ysis and re-design steps to identify the essential business rules and the business logic in

systems that affect multiple databases.

Usage Relationship

In addition to the data dependencies it is useful for the reengineer to understand relation-

ships between the persistent and transient parts of the web information system. We clas-

sify this kind of relationship as Usage relationship. A usage relationship is the usage of

REVERSE ENGINEERING DATA COMPONENTS

data by transient parts of the web information system, i.e., an interface using the data de-

pendencies for the connection to the "transient world", e.g., the application or the internet.

3.1.3 Model Representation

Data storage is mainly done in tables, records or files. Data management uses multiple

data types which are assigned with values from the data storage structures. To represent

these aspects we consider two models, namely the EER model and UML diagrams as ob-

ject-oriented model.

Internally we use an object-oriented abstract syntax graph to represent the data component

models (EER and UML) as well as the data manipulating code (application models). We

refer to [Zün01] and [Nie03] for details.

We use EER diagrams to represent data model’s data structures at the logical view level.

A logical schema is composed of entities with attributes and relationships. The attributes

have assigned data types which allows the distinction of the type of data stored. Attributes

are used for identification (primary key) and search (indexes). The different relationships

are represented with foreign keys or integrity constraints.

UML Class diagram

The UML enables the conceptual representation of data management aspects indepen-

dently from the physical data storage kind. Instead of entities we have classes with at-

tributes. Relationships are associations, aggregations, generalisations and constraints.

Moreover, at the conceptual level, we consider data dependencies that occur with the dis-

tribution of the data component models. Of course also data types are used although they

are different from the physical ones.

The reasons to introduce a conceptual representation are on the one hand the higher level

of abstraction for understanding and on the other hand the representation of behaviour.

Data management, i.e., data access and manipulation, is done in methods, procedures or

functions which can be represented with UML.

As mentioned we represent an entity in the conceptual schema as a class. We differentiate

the entities, i.e., persistent classes, from transient classes that do not represent persistent

data structures. We mark a persistent class with the stereotype <<persistent>>. Attributes

are represented with their respective types according to the type mapping.

We represent all relationships of Figure 3.3 except the attribute similarity (including sub-

classes), INDs1 and Joins. They are schema annotations from the code that are used to re-

cover data dependencies. Beside the fact that representing all these three kind of

1. Note that INDs are represented in the logical schema (EER diagram) but only refined INDs, i.e.,

R-INDs, C-INDs and I-INDs, are mapped to the conceptual schema (UML class diagram).

DATA-ORIENTED REVERSE ENGINEERING

relationships will overload and confuse the schema representation, their information is en-

closed in the represented data dependencies.

Relationships are represented with UML associations except primary key, inheritance and

constraint. We conserve the primary key information, when existing, to enable mappings

from classes (conceptual schema) to entities (logical schema). Attributes belonging to a

primary key are underlined. Inheritance is represented by UML generalisation. A con-

straint is represented within an additional class with stereotype <<constraint>> containing

the constraint itself. The relationships to the (attributes of the) classes involved in the con-

straint are represented as aggregations, cf. Figure 3.4.

All relationships represented as associations and are always assigned a stereotype, except

the association itself:

• Homonym <<homonym>>

• Synonym <<synonym>>

• Surplus <<surplus>>

• Duplication <<copy>>

• Replication <<replica>>

• Association none or <<inter-schema>>

• Usage Relationship <<usage>>

The abstract relationships used for the relationship retrieval are of course not assigned a

stereotype because they cannot be instantiated. To further improve the understanding we

distinguish between intra-schema and inter-schema associations. We add a stereotype

<<inter-schema>> for inter-schema associations, cf. Figure 3.39.

Figure 3.4: Conceptual schema view (relationships)

REVERSE ENGINEERING DATA COMPONENTS

UML Package diagram

We represent the logical schema with EER diagrams, the conceptual schemas with UML

class diagrams, but how can we represent architectural aspects like distribution, redundan-

cy or views? Since the architectural aspects are considered based on the conceptual sche-

mas, using UML for representation purposes seems to be the most appropriate notation.

Schemas are represented by package diagrams, i.e., the entities, attributes and relation-

ships are represented in class diagrams which are enclosed in a package. The relationships

between the entities of different packages are grouped and represented as inter-package

relationship. Further, we employ the following UML notation elements:

• stereotype, to differentiate packages (and classes and relationships)

• note, to show comments or annotations.

We further use packages to visualise packages of data manipulating classes, e.g., con-

straints, web interfaces or portals. Representation problems occur when a class is con-

tained in more than one package that is to visualise. Such we restrict the representation of

inter-connected packages in package diagrams to packages that do not have overlapping

classes. If views are build such that a class can participate in more that one view, we rec-

ommend to use view diagrams instead of package diagrams.

Nevertheless, inter-connected packages are useful to represent certain kind of views on

the system. Of course the representation of the physical aspects (how the system is phys-

ically distributed) is one of these views. This corresponds to the conceptual schemas and

the manipulating data classes as they are stored and organised in the databases and files,

cf. Figure 3.5. Another example of a system view divided in disjunctive connected pack-

ages are representations resulting from clustering. A clustering criterion may be to group

the classes related by inheritances and redundancy relationships. For such purposes pack-

ages that contain packages, as long as the packages remain disjunctive, can be visualised.

Figure 3.5: Pysical aspect view as package diagram

<<database>>

hospdata

<<database>>

bvmtdata

<<database>>

outcomes_be

<<directory>>

outcomes

<<directory>>

bvmt

<<directory>>

admin

<<directory>>

service

DATA-ORIENTED REVERSE ENGINEERING

In case that view diagrams are judged not to be appropriate, a set of overlapping views on

the system can also be organised in a package diagram. Overlapping packages can be vi-

sualised together without any connection between them. Such a diagram may be the result

of manual analysis or can be the result of the automatic application of a view builder. Then

the packages will be represented next to each other without any (visible) connection. Only

the package name will guide the reengineer. Again, packages can be organised in sub-

packages if desired.

View diagram

A view diagram is a sub-diagram of a class diagram where only selected classes and rela-

tionships are visualised. The idea of views is not new and well known in the database field

[EN94, Dat00]. Database (logical) views are directly mapped to conceptual schema

views, cf. Figure 3.11. Note that a materialised view correspond more to a logical schema

than to a database view since they are redundant subschemas and should be handled ac-

cordingly.

Further (conceptual) schema views can be defined by the reengineer. Schema views are

representations of parts of schemas. That these parts are from a single schema is not man-

datory. Since schema views represent aspects from the whole system, they are not fixed

because changes of the underlying schemas should be reflected in the views. The reengi-

neer can use predefined filters or define filters which dynamically calculate the schema

views.

A filter is defined through the context of a class (schema entity) or a set of classes. The

context of a class is the set of reachable classes from this class via relationships. The ex-

ample of Figure 3.6 shows the 1-context relative to the inheritance relationships of class

Patient. Operations can be either applied to schema elements in the corresponding under-

Figure 3.6: Conceptual schema view example

DATA MODEL RECOVERY

lying schema or in one of the schema views. Note that also data manipulating (transient)

classes can be part of the views. Further details concerning view and filter definition can

be found in [Rec01].

3.2 Data Model Recovery

Recovering the data structures of the web information model is the basis of the data-ori-

ented reengineering process. We enhance the understanding of the data models by provid-

ing a conceptual knowledge-centric view on the models. We use annotations on the data

models to represent the automatically and manually recovered knowledge. A fundamental

aspect is that data model recovery is embedded in the iterative reverse engineering pro-

cess.

The data model recovery is divided into two main activities: recovering the physical sche-

ma into a logical schema and constructing a conceptual schema. In the first activity the

explicit data model elements can be parsed but hidden schema parts have to be recovered,

i.e., data model elements that are implicitly enclosed in the schemas. The construction of

the conceptual schema is done be an automated ad-hoc mapping followed by a manual re-

factoring phase.

3.2.1 Schema Recovery

The data of the data repositories participating in a web information system is organised in

schemas. A data model can be represented by different schemas at different levels of ab-

straction. The common schema levels are the physical schema, the logical schema and the

conceptual schema. We assume that the data model exists as physical schema. Such we

recover a logical schema (before transforming it to a conceptual schema).

The first step is to recover the explicit data structures contained in the physical schema.

We achieve this by parsing the according physical schema. A physical schema can be

parsed through JDBC1, or an XMLSchema2 or a Document Type Definition (DTD) pars-

er. JDBC drivers are provided for almost all data repositories. Note that we are not re-

stricted to Java applications, we only use java as programming language to recover the

schemas.

The easiest schema parts to recover are entities and attributes. They can be read (parsed)

directly from the database. Depending on the database management system logical prop-

erties like primary keys can also be parsed. However, in most cases not all schema ele-

ment properties are explicitly defined.

The most valuable schema elements for understanding are the dependencies between en-

tities, namely foreign keys, inclusion dependencies, links and references. Again some of

1. The JDBC API provides Java applications with access to all major database systems and file

formats, via SQL. According to Sun, JDBC is not an acronym for Java Database Connectivity.

2. http://www.w3.org/XML/Schema

DATA-ORIENTED REVERSE ENGINEERING

them may be explicitly encoded in the database schemas, but most of them are hidden in

the application code.

In case of a relational database additional information may enrich the schema knowledge,

e.g., indices, stored procedures or (materialised) views. Further in federated or multi-da-

tabases the different schemas (local, component, export) may also be a valuable source of

information.

3.2.2 Retrieval of Hidden Schema Parts

As mentioned above parts of the schema are hidden and cannot be parsed directly from

the data repository. The most hidden parts are also the most valuable for the reengineers

understanding. There are two main reasons why schema parts are hidden, i.e., not declared

explicitly. Firstly, the database system does not have the language constructs to express

all schema parts in an appropriate way. Secondly, like the lack of correct and complete

documentation, the schemas are not attached with all the possible and desirable descrip-

tions for many different reasons.

Why do we need this hidden schema parts? Well, basically because they enrich the data

schemas with valuable reengineering knowledge. We distinguish between two ways for

retrieving hidden schema parts: (1) a semi-automatic process where indicators of missing

schema elements are retrieved automatically, before the reengineer validates or refutes

them; (2) a manual process where the reengineer annotates the schema with knowledge

that he derives from information sources like interviews, documentation, code reviews,

tests or application executions.

To enrich the schemas with knowledge, operations to create schema annotations are need-

ed, e.g., createPrimaryKey. Indeed, assume a user has gained knowledge of a certain sche-

ma for example by using a data retrieval tool. He afterwards wants to transfer this

knowledge into that schema so that other reengineers can benefit from this regained

knowledge.

These schema annotation operations are also used by the semi-automatic process. The an-

notations are endowed with a fuzzy value between 0 and 100 to express the certainty of

the according annotation. Depending on his certitude the reengineer can also add uncer-

tain knowledge to the schema. In case of the automated recovery the certainty degree de-

pends on indicator significance that is predefined. The underlying logical model is

presented in [Jah99].

Operations are needed for each schema element. Three different operations per schema

element exist: create an element, remove an element and update an element. The create

operation is always attached a fuzzy value. The delete operations can be bundled to a gen-

eral deleteElements operation. Updates can be subdivided into several operations, e.g.,

updateAttribute can be subdivided into renameAttribute, changeType and changeFuzzy-

Value.

DATA MODEL RECOVERY

The basic element types for a schema are entity, attribute, primary key and foreign key.

Primary keys can be tagged as candidate keys. Foreign keys can be tagged as IND and

classified as R-IND, I-IND or C-IND, cf. page 43.

Special elements resulting from reverse engineering practice or database features are vari-

ant, index, view, stored procedure, optimization structure and materialised view. A variant

of an entity is a sub-categorisation of the entity into different logical entities, e.g., the en-

tity patient has attributes for death date & time and discharge date & time. We have two

variants for patient here; a patient can only have a death date & time or a discharge date

& time. The other elements are well known from the database field and such we refer to

the corresponding literature.

Automatic schema recovery covers all elements. Entities and attributes are easy to recover

through JDBC. Relationship (primary and foreign keys) retrieval will be discussed in Sec-

tion 3.3. Indices, views, stored procedures and materialised views need special parsing de-

pending on the database management system. Variants and optimization structures are

more likely to be retrieved.

Figure 3.7 shows three tables and an IND as an excerpt of a database schema from the

palliative care information system. This example is depicted as it was parsed with the JD-

BC, i.e., no analysis was performed nor annotations added. The table Patient contains

three variants and table Medical contains optimisation structures. Note that the foreign key

Reference3 is automatically retrieved through the JDBC based parser.

Figure 3.7: Variant and optimisation structure example

DATA-ORIENTED REVERSE ENGINEERING

Variants can be found in code or by analysing the data. Figure 3.8 shows a simplified code

fragment, sketches the three variants as three sample tuples and depicts a possible concep-

tual representation. Tuples belonging to variant 1 (patient is discharged) have NULL val-

ues for the three attributes related to the death of a patient. Variant 2 (patient is deceased)

has the NULL values for the discharged date and time. If the patient is still in treatment

(variant 3) all five attributes concerning the death or discharging of a patient have NULL

values. Conceptually this results in three classes Patient, Discharged and Deceased de-

pending the status of the patient. The code fragment outlines a status request for a patient

of a given patientID. Note that this code fragment is a sample, still even more complex

fragments can be detected and thus serve as indicators. For further details concerning vari-

ants we refer to [Jah99].

Several optimisation structures occur in the table Medical of Figure 3.7. Attributes RE-

FERRAL DX and DISEASE stores the patient’s referral reason and disease. The other at-

tributes store two biopsies where BIOPSY1 DATE corresponds BIOPSY2 DATE,

SPECIFIC_DISEASE corresponds PRIMARY_CANCER_2, etc. Each biopsy has six at-

tributes to store metastases sites. Figure 3.9 shows a code sample as indicator for the op-

timised 1:n foreign key between Metastases Sites to Medical. Patients are searched where

metastases where detected at SITE :site. Therefore all attributes of both biopsies has to be

compared to the corresponding METNUM.

The table Medical would correspond to three tables if no optimisation were introduced: a

table „Disease“ to store the patient’s referral reason and disease; a table „Biopsy“ storing

the different biopsies for a patient; and a table „Biopsy_Metastases_Sites“ for the detected

metastases sites during a biopsy. This last table would have the 1:n foreign key to Meta-

stases Sites. For further details concerning optimisation structures we refer to [Bew98].

Figure 3.8: Patient variant example

...

date = „select DEATH DATE, DEATH

TIME from Patient where

VHS_ID# = :patientID“;

status = „deceased“;

if (NULL == date) {

date = „select DISCHARGE DATE,

DISCHARGE TIME from Patient

where VHS_ID# = :patientID“;

status = „discharged“; }

if (NULL == date) {

status = „in treatment“; }

...

variant DEATH DATE DEATH TIME PLACE OF DEATH DISCHARGE DATE DISCHARGE TIME VHS_ID# ...

1 NULL NULL NULL

2 NULL NULL

3 NULL NULL NULL NULL NULL

Patient

VHS_ID#: string

...

Deceased

death_date: timestamp

death_time: timestamp

place_of_death: string

Patient class

inheritance

Discharged

discharge_date: timestamp

discharge_time: timestamp

DATA MODEL RECOVERY

3.2.3 Schema Mapping

Once the logical schema is recovered, at least judged as recovered by the reengineer at

this point in time, it can be mapped into a conceptual schema. The conceptual schema pro-

vides a higher level of abstraction, which makes it easier to understand the major points

of interest. The conceptual representation is done with UML class diagrams. Such a map-

ping between logical schemas and the conceptual schemas is needed. The schema map-

ping is described in detail in [Jah99], we will only present the relevant parts here.

This meta-model mapping is done with triple-graph-grammars [Sch94, Lef95, SL96]. The

fundamental idea of triple-graph-grammars is that two documents are integrated and such

kept consistent through a third document, the integration document. Figure 3.10 depicts

an example for document integration with triple-graph-grammars. A triple-graph-gram-

mar rule relates elements of document A to elements of the integration document and

those elements of the integration document to elements of document B. In Figure 3.10 this

can be seen for a<---i--->b or (a1,a2)<---i1--->b1.

Figure 3.9: Optimisation structure example

...

M = „select METNUM from METASTASES SITES where SITE = :site“;

...

patientID = „select VHS_ID from MEDICAL where

(MET SITE1 = :M or MET SITE2 = :M or

MET SITE3 = :M or MET SITE4 = :M or

MET SITE5 = :M or MET SITE6 = :M)

(MET2 SITE1 = :M or MET2 SITE2 = :M or

MET2 SITE3 = :M or MET2 SITE4 = :M or

MET2 SITE5 = :M or MET2 SITE6 = :M) “;

...

document A dokument Bintegration document

Figure 3.10: Triple-graph-grammar fundamental idea example

i1 b1

aib

intra-document relation inter-document relation

DATA-ORIENTED REVERSE ENGINEERING

From a triple-graph-grammar rule, three rules are generated. These rules can be applied

to the documents without any rule ordering or prioritisation. Firstly, the forward rule maps

elements of the left document to elements of the right document, e.g., a of document A

exist and i and b are created. Secondly, the reverse rule maps elements of the right docu-

ment to elements of the left document, e.g., b of document B exist and i and a are created.

Thirdly, the relating rule which relates elements of the left document and elements of the

right document, e.g., a of document A and b of document B exist and i is created. If none

of these three rules can be applied to an element of document A or B, this element is re-

moved.

The gererated rules are graph productions. Generally, graph productions are defined as a

pair of graphs, a set of application conditions, and a set of attribute-transfer-clauses. The

two graphs are called the left-hand side and the right-hand side of the production, respec-

tively. Graph elements of the left-hand side that also appear on the right-hand side are

called the interface graph, i.e., they are neither created nor deleted. Graph productions

and their application semantics have been formalised based on algebra theory, for a com-

plete formalisation of these concepts we refer to [Roz99].

The characteristic of the triple-graph-grammar formalism is that it permits consistency

management. For this purpose the integration document has a crucial importance. Assume

that two documents are related like in Figure 3.10 document A and B. During editing do-

cument A we remove element a. In case that we do not have an integration document, we

cannot decide whether we have to remove b or (re-)create a. In the case that we have the

integration document, after removing a, i and b exist and we know that they should also

be removed. This situation also could be handled with some overhead without an integra-

tion document, but consider the more complex case that we remove a1. The relation is di-

fined as (a1,a2)<---i1--->b1, such i1 and b1 have to be removed. Without an integration

document the removal of b1 would result in the removal of a2, as inverse consequence of

the document consistency.

In our case this means that the logical meta-model instances are related to the conceptual

meta-model instances and later kept consistent. Hence we need a mapping model, i.e., the

integration document that connects the two meta-models and a set of rules. Figure 3.11

shows the mapping model. This mapping model is adapted from the migration graph mod-

el from [Jah99]. We will give a short overview of the mapping model.

The meta-model for the analysed logical schema is represented as an abstract syntax graph

on the left side of Figure 3.11. The node of type LSchema represents the root of the model

for a logical schema. This root contains following classes: LView, which represents views

of the logical schema; Entity, which represents the entities; and LType which represents

the types of the entities attributes (LAttribute). Each Entity has an attribute ename that

stores the name of the represented Entity. An Entity is composed by a non-empty set of

named (vname) Variants, a primary key (LKey) that is referenced by the pkey association,

and a set of alternative keys which are referenced by the association akeys. Each Variant

DATA MODEL RECOVERY

contains a set of foreign-keys (FKey) and a set of attributes (LAttribute). An attribute has

an ltype association to refer to its type. An IND is represented by one of the three types I-

IND, C-IND and R-IND (cf. page 43 and [FV95]), and has two associations lkey and fkey

that point to a primary key and a foreign key.

The rational for selecting the conceptual schema is driven by following observations.

Many variations and extensions of the Entity-Relationship (ER) model [Che76] have been

proposed to facilitate the description of data structures. The most common extensions to

extended Entity-Relationship models are concepts for abstraction like aggregation and in-

heritance [BCN92]. In our application domain, i.e., web information systems, the distrib-

uted programming language Java is widely used for integration and extension of legacy

systems. Multiple inheritance is not allowed by the type system of Java [SCC+93]. Such

we have restricted our conceptual model to classes to have at most one generalisation.

Note that this restriction affects only the data models, that we want to access through a

Java layer, and not the application programming languages of the reengineered legacy

system.

The meta-model that specifies the chosen conceptual model is displayed on the right side

of Figure 3.11. The class CSchema is the root of the conceptual schema abstract syntax

graph. By analogy with the logical schema, this CSchema contains views (CView), class-

es (Class) and attribute types (CType). A boolean attribute (abstract) is used to store the

information whether a class is abstract or not, i.e., whether a class can be instantiated. The

name of a class is stored in attribute clname. Inheritance relationships are represented by

nodes of type Inheritance with two associations sub and sup to the subclass and the su-

perclass, respectively. Classes are composed by a set of CAttributes and an optional key

(CKey). A CKey is composed by a non-empty set of CAttributes. Each Association has a

name (aname) and attributes srcname and tarname store the role names of the classes that

participate as source and target of the relationship, respectively. Attributes srccard and

srctotal, and tarcard and tartotal represent the information about the cardinality of the

source class and target class, respectively. The value of attribute srccard (tarcard) defines

the maximum cardinality for the source (target) of the relationship. If the attribute srctotal

(tartotal) is true, the relationship is total with respect to its source (target), cf. [EN94].

The schema mapping model connects the abstract syntax graphs of the logical and the

conceptual schema and represents their associations. The elements of the schema map-

ping model are depicted in Figure 3.11. Their purpose is motivated and described in more

detail in [Jah99].

MapSchema is used to connect the roots of both abstract syntax graphs. To map attribute

types MapType is used. Each variant in the logical schema is mapped to a concrete class

in the conceptual schema. If an entity has more than one variant, it implies usually that

common attributes are comprised which leads to an inheritance hierarchy with abstract

classes in the conceptual schema. Consequently, an abstract class is mapped to more than

one variant, namely all variants which are represented by its concrete subclasses. These

DATA-ORIENTED REVERSE ENGINEERING

correspondences among classes and variants are represented by nodes of type MapVari-

ant.

Inheritance relationships in the conceptual model can correspond in two different ways to

constructs in the logical schema. Firstly, they can correspond to the inclusion of more spe-

cific variants in less specific variants that belong to the same entity. This is represented by

an inheritance relationship in the conceptual model which is mapped by MapVarToInh.

Figure 3.11: Schema mapping graph model

LSchema

LType

ltname: string

Entity

ename: string

LKey

LAttribute

laname: string

FKey

IND

I-IND

C-IND

R-IND

invkb: boolean

Variant

vname: string

LView

lvname: string

CSchema

CType

ctname: string

Class

clname: string

abstract: boolean

CView

cvname: string

Inheritance

CKey

CAttribute

caname: string

Association

aname: string

srcname: string

tarname: string

srctotal: boolean

tartotal: boolean

srccard: integer

tarcard: integer

MapSchema

MapType

MapView

MapVariant

MapVarToInh

MapIIND

MapKey

MapAttribute

MapRIND MapAssoc

0..n

0..1

0..n

1..n

0..n

1..n

0..n

1..n

10..n

0..n

1..n

ltypes

lviews

entities

ltype

variants

pkey

akeys

fkeys

lattributes

lkey

fkey

fkattrs

lattrs

ents

Entity class

1..n 0..n

attrs association

generalisation

0..n

ctypes

ctyp

cviews

0..n

cls

0..n

classes

0..n

sub

super 1111

0..1

0..n

1..n

0..1

0..n

cattributes

cattrs

ckey

src tar

0..n

left

right

0..1 0..1

10..1 0..1 1

1..n

0..1 0..1

0..n

0..1

1..n

0..1

0..n

left_vg

left_vs

1..n

0..n

0..1

10..1

right_id

0..n

0..1

0..1 1

0..n

0..n 0..1 1

1 0..1 0..n 0..n

attrs_via

assocs_via

DATA MODEL RECOVERY

Association left_vs is used to reference the variant which is more specific, while associa-

tion left_vg references the variant which is more general. Secondly, the other possibility

is to map INDs in the logical schema that have been classified as inheritance relationships

(I-INDs) in the analysis process to inheritance relationships. In this case, the mapping is

represented by a node of type MapIIND.

MapKey is used to map primary keys in the logical schema to keys in the conceptual sche-

ma. According to the ODMG 3.0 (Object Data Management Group) data model

[CBB+00], our conceptual model includes the notion of unique object identifiers for in-

stances of classes. Note that it is not required that every class contains a value-based key.

Still, if we aim for object-relational data integration, unique object identifiers have to be

resolved to value-based keys in the logical data model. For this purpose, every class has

an association of type right_id that references a MapKey class in the schema mapping.

Attributes are mapped to attributes by nodes of type MapAttribute. To provide the flexi-

bility to allow for different alternative schema mappings, we admit that attributes of a sin-

gle class can be mapped to attributes in different entities. For such attributes, the access

path from the entity that includes the value-based key associated to the class and the entity

which includes such attributes has to be maintained. This is done by the association

attrs_via. If a MapAttribute node does not have an attrs_via link, the mapped column be-

longs to the entity that contains the key referenced by the right_id association of the class

that contains the mapped attribute. Otherwise, the mapped column belongs to a different

entity and the attrs_via link of the corresponding MapAttribute node refers to a set of

MapRIND nodes. These nodes represent the access path from the entity that contains the

key referenced by the right_id association to the entity that contains the mapped column.

Each MapRIND is connected to R-IND which logically represents a foreign key that has to

be dereferenced to access the mapped column. In analogy with columns, MapAssoc and

assocs_via associations are used to map Associations to sets of foreign keys (represented

by R-INDs).

An example for a mapping rule is given in Figure 3.12. The MapEntityToClass mapping

rule relates an entity including variant and primary key to a class with corresponding key.

The prerequisite for this rule is that the logical root (LSchema) has a correspondent con-

ceptual root (CSchema). This is the case if they are related by a MapSchema node. The

prerequisites are represented in black whereas the elements which have to be related are

in grey (green) and marked by a <<create>> stereotype. The side of each element is indi-

cated by <<left>>, <<map>> and <<right>> corresponding to the left (logical schema), in-

tegration (mapping schema) and right (conceptual schema) document, respectively. Note

that for execution reasons <<input>> and <<output>> nodes can be defined and are

marked correspondingly. Further constraints and assignments can be expressed (e.g.,

LEFT : cl.clname:=v.vname;). The side stereotype expresses if the constraint or assign-

ment is considered during generation.

DATA-ORIENTED REVERSE ENGINEERING

From this mapping rule three rules are generated: a rule that creates the conceptual ele-

ments from the logical ones (forward rule), a rule for the opposite direction (reverse rule)

and a rule which relates logical and conceptual elements if they already exist (relating

rule). The resulting rules are generated as story diagrams. A story diagram is an activity

diagram where each activity contains either (Java) code or a story pattern. A story pattern

is a graph production with an adapted collaboration diagram notation. We refer to

[FNTZ98] for details.

Figure 3.13 shows the relating rule MapEntityToClassRelating as a story diagram. The

MapSchema node ms is the input node, i.e., the node where the subgraph matching starts.

The first activity initialises the MapVariant output node outParam. The next activity is a

story pattern. Starting form the bound object ms, the black nodes are searched. For the re-

lating rule, these are all logical and conceptual elements belonging to an entity and its cor-

responding class including the variant and keys. The variant v and class cl have the same

name, cl is not abstract and does not have a value-based key assigned (empty(cl.-right_id-

>)). These constraints are all marked as MAP or RIGHT/MAP in the mapping rule. Once

the subgraph has a corresponding match in the abstract syntax graph, the nodes in grey

(green) will be created with the corresponding links. In this case MapKey and MapVariant

nodes are created to relate the entity e, variant v and key lk to class cl and key ck. If the

subgraph matching and creation of nodes is successful then the last activity assigns the

Figure 3.12: MapEntityToClass mapping rule

DATA MODEL RECOVERY

MapVariant node mv to outParam. Otherwise the outParam remains null as assigned dur-

ing initialisation. Finally the rule returns the outParam.

In analogy with the relating rule the forward rule (Figure 3.14) and the reverse rule (Figure

3.15) are generated. Only the story patterns are presented because the remaining of the

story diagram is identical to Figure 3.13 except the rule names. The constraint and assign-

ments are generated accordingly to the side they were defined for.

Other mapping rules are needed to establish a full correspondence between the logical and

conceptual schema. The Table 3.1 gives an overview over the mapping rules needed for

a bidirectional mapping and consistency preservation between the schemas.

Catalogues of (forward) mapping rules are given in [Wad98] and [Jah99]. To specify all

rules special graph-grammar constructs are employed. The mapping rule MapAttrToAttr

depicted in Figure 3.16 contains examples of a path construct, optional nodes and graph-

ical constraints. A path enables the navigation through the abstract syntax graph from one

node (or set of nodes) to another node (or set of nodes) even when they are not directly

related by an association. A simple path from v:Variant to lk:LKey is given in Figure 3.16.

This path <-variants-.-pkey-> accesses the entity primary key from the variant it starts

from. Optional nodes are used in this rule to relate attributes that belong to a key. If an

attribute belongs to a key the corresponding key nodes will be matched and the needed

Figure 3.13: MapEntityToClass relating rule (story diagram)

DATA-ORIENTED REVERSE ENGINEERING

links to reflect this information will be created. If an attribute do not belong to a key the

nodes containing information about a key will not be matched. The generated rules will

Figure 3.14: MapEntityToClass foward rule (story pattern)

Figure 3.15: MapEntityToClass reverse rule (story pattern)

DATA MODEL RECOVERY

not fail because the nodes where declared as optional and thus are not required for the suc-

cessful application of the rules. Finally, we have two graphical constraints.

Figure 3.17 shows the graphical constraint IsNotForeignKeyAttribute. A graphical con-

straint is expressed with a story diagram and attached to a node. The constraint is only

checked if the attached node is a prerequisite of the rule. IsNotForeignKeyAttribute checks

if the logical attribute is not a foreign key attribute in at least one variant. Logical at-

tributes that contains information about foreign keys have no related attribute in the con-

ceptual schema. This information is implicitly contained in associations. For this reason

no such check is needed when a conceptual attribute is mapped to a logical attribute. The

story diagram checks for the given attribute la that is contained in a variant, if a path via

a foreign key from this variant to the given attribute does exist. If this is the case then the

success variable becomes true and the constraint (“if (success)”) will return “false”, oth-

erwise the failure variable becomes true and consquently the constraint will return “true”.

The presented mapping is partial, e.g., many-to-many associations or aggregations are not

considered. Mapping rules for such constructs can be defined in analogy with the present-

ed rules. These rules lead to ambiguities, e.g., relating an R-IND to an association or an

mapping rule description

MapSchema Relates the root nodes of the logical and conceptual schema.

MapType Relates the logical types to the conceptual types. In our case mainly data-

base types to Java types, e.g., varchar to string.

MapView Relates database views to UML class diagram views, cf. Section 3.1.3.

MapEntityToClass Relates entities to concrete classes, including the primary keys.

MapVariantToConcreteClass Relates additional entity variants to concrete classes.

MapVariantToAbstractClass Relates variants to abstract classes in case that variant structures repre-

sent inheritance hierarchies.

MapVariantToInheritance Relates variant hierarchal structures with the inheritance relationships of

the corresponding hierarchy of concrete and abstract classes.

MapAttrToAttr Relates the attributes of the entity variants and the related classes.

MapRINDToAssoc[1:1] Relates an inversely key-based R-IND to a total one-to-one association.

MapRINDToAssoc[N:0,1] Relates a non inversely key-based R-IND that has an inverse C-IND to a

left-total one-to-many association.

MapRINDToAssoc[0,N:0,1] Relates all R-INDs that are not inversely key-based and do not have an

inverse C-IND to partial one-to-many associations.

MapIINDToInheritance Relates I-INDs to inheritances.

Table 3.1: Mapping Rule Overview

DATA-ORIENTED REVERSE ENGINEERING

aggregation. These ambiguities can be resolved by ordering mapping rules or adding fur-

ther semantic annotations to the logical schema. Our experience shows that the number

and the complexity of rules then grows considerably. We adopt a solution that enables a

fully automatic mapping between the logical and conceptual schema with the limited set

of twelve rules of Table 3.1. The reengineer can use conceptual schema refactoring oper-

ations to add conceptual constructs to the conceptual schema. In Section 3.2.4 we present

an extensible catalogue of conceptual schema refactoring operations, which maintain au-

tomatically the correspondences to the logical schema.

Figure 3.16: Mapping rule features in the MapAttrToAttr rule

path

optional nodes

graphical

constraints

Figure 3.17: Graphical constraint with story diagram

DATA MODEL RECOVERY

3.2.4 Conceptual Schema Refactoring

According to [Fow99], refactoring is the modification of the internal structure of software

without changing its observable behaviour, to make the software easier to understand,

modify and maintain. The fundamental idea of refactoring is also reflected in conceptual

schema refactoring, i.e., modifying the conceptual schema structure without changing the

(observable) information capacity, to make the schema easier to understand, use and

maintain.

Refactoring is a complex task regarding the impact analysis that a refactoring operation

has to perform before it can be applied. This holds for software refactoring [Fow99], da-

tabase (physical schema) refactoring [Amb02] and even logical schema refactoring. For

conceptual schema refactoring this is different because the conceptual schema is not (yet)

used by third parties. Indeed, the conceptual schema is newly created for understanding

and will be used as capsule for the data access only after the reengineering. Two factors

must be taken into account for conceptual schema refactoring: (1) the schema transforma-

tions must preserve the information capacity and (2) consistency between the conceptual

schema and logical schema has to be preserved. While the first will be discussed subse-

Figure 3.18: moveAttribute refactoring operation

DATA-ORIENTED REVERSE ENGINEERING

quently, the latter is satisfied (guaranteed) by the mapping consistency mechanism with

triple-graph-grammars (cf. Section 3.2.3).

Refactoring operations must preserve the information capacity [BCN92] of the schema.

Crucial for an information capacity preserving operation is that the schema contains data

structures, which enclose the same data before and after the operation application. An ap-

plication, which runs and uses data described by the schema, must still run after schema

refactoring, i.e., the data is still fully available. Formal graph transformations that pre-

serve information capacity are presented for database integration [GO00] and database re-

engineering [JZ99].

An example for a basic refactoring operation is moveAttribute. Figure 3.18 shows the im-

plementation of moveAttribute as a story diagram. Attribute attr is moved from class cl1

to class cl2 via association assoc. The prerequisite for this operation is that the move op-

eration is done via a one-to-one association. This is done with assertions. An assertion is

a constraint that is part of the matched node, e.g., srccard==1. Further the association has

to be total with respect to class cl1. In other cases moving an attribute via an association

would change the information capacity. Further the attribute has to have a link to a

MapAttribute node. This ensures that attr has a corresponding attribute in the logical sche-

Figure 3.19: splitClass refactoring operation

DATA MODEL RECOVERY

ma. If the activity is applied the link between cl1 and attr is deleted and a link between cl2

and attr is created. After successful completion of this activity possible access pathes have

to be updated. Note, if no MapRIND node is found the operation still terminates correctly.

Refactoring operations can be divided in basic schema transformations and composed

schema transformations. Indeed much operations are based on several basic operations.

Figure 3.19 shows splitClass which is an example for a composed refactoring operation.

On the left hand the available transformations are listed. The new composed transforma-

tion is on the right side. The splitClass operation is composed of two basic design trans-

formations createClass and createAssoc. Before moving the attributes from the old class

to the new class, the correspondences to the logical schema are maintained with mapping

transformations. createLink creates a link of type ’right_id’ between the given new class

and MapKey node. The new class is mapped to all variants, the old class is mapped to

(createMapNode). Finally, the basic refactoring operation moveAttribute is applied in a

loop to a passed set of attributes. Optionally the new class is aggregated (Aggregate).

Table 3.2 gives an overview of all refactoring operations classified into basic refactoring

(BR) and composed refactoring (CR) operations. Note that operations, which do not pre-

serve information capacity if they are applied separately (design operations), may be

needed for composed refactoring operations, e.g., createClass in splitClass. Further, the

correspondences to the logical schema have to be maintained, cf. MappingOperations Fig-

ure 3.19.

name description kind

renameClass Updates the name of a class with a given new class name. BR

renameAttribute Updates the name of an attibute with a given new attribute name. BR

renameRelationship Updates the name of a relationship with a given new relationship name. BR

Aggregate Converts an association to an aggregation. Limited to associations inside

packages.

Disaggregate Converts an aggregation to an association. BR

MoveAttribute Moves attributes from a class to an associated class via a given one-to-

one relationship.

SwapAssocDirection Exchanges the source and target of a given association. BR

renamePackage Updates the name of a package with a given new package name. BR

moveClass Moves a class from a package to another package if this class is not

aggregated to any other class. A constraint class cannot be moved alone.

unifyAttrs Merges two attributes to one attribute if they are related by a redundancy

dependency.

Table 3.2: Refactoring operation overview

DATA-ORIENTED REVERSE ENGINEERING

3.3 Relationship Retrieval

The great data heterogeneity existent in web information systems and the various mecha-

nisms used to integrate the subsystems make relationship retrieval quite complex. There-

fore, we divide the retrieval problem into smaller problems, namely (1) the single schema

recovery (cf. Section 3.2), (2) extracting code fragments of interest from the web infor-

mation system and (3) the retrieval of relationships based on the extracted code fragments

and the single schemas.

Retrieving relationships between schemas makes sense only if these are considered as

completely reverse engineered. A complete schema in this context means complete after

data model recovery. Nevertheless, not all schemas have to be recovered; our incremental

process allows the addition of schemas later in the process such that relationships to these

schemas can then be retrieved. Due to results of the relationship retrieval those schemas

may have to be further analysed which can engender changes.

In contrast to the recovery of a single data model the inter-schema relationships are locat-

ed in access and management data components, i.e., application code. Hence a code re-

covery mechanism is needed. Since we are only interested in the data related aspects of

the code and parsing the entire existing code may not be scalable, we extract code frag-

ments of interest. In some cases simple incremental parsing will be sufficient. In more

complex cases we use island grammar parsing [vDK99, Moo01] and slicing [Wei84].

AssocToClass Replaces an association between two classes by an intermediate class

with two associations with calculated cardinalities to the initial two clas-

ses. Limited to relationships inside packages.

ClassToAssoc Replaces a class with two associations, both not many-to-many, to an

association with calculated cardinalities.

SplitClass Divides a class into two classes associated by a one-to-one association or

aggregation and moves the given attributes to the new class.

MergeClasses Merges two classes associated by a one-to-one relationship or an aggre-

gation into a single class.

moveClasses Moves a given set of classes from a package to another package if these

classes are not aggregated to any other classes outside the given set.

MergePackages Merges two packages into a single package. CR

SplitPackages Divides a package into two packages and moves the given classes to the

new package.

name description kind

Table 3.2: Refactoring operation overview

RELATIONSHIP RETRIEVAL

For the relationship retrieval, we use a pattern instances detection mechanism of patterns

which interrelate data component model elements [Nie03]. The pattern instance detection

is based on a (design) pattern instance recovery mechanism [NSW+02] which uses fuzzy

logic to express uncertainty [NWW01, NWW03]. Validation of detected relationships

which results in the adaptation of the associated fuzzy value can be done by several satel-

lite activities. Finally, the reengineer confirms or rejects the retrieved relationships.

3.3.1 Code Fragment Extraction and Parsing

Relationships between attributes of entities in different schemas may occur everywhere in

the application code and in various kinds. Such the problems we face for code fragment

extraction are to determine which fragments are fragments of interest and how these frag-

ments looks like. In case of web (thus, heterogeneous) information systems, however, the

extraction problem is even more difficult because it deals with multiple different program-

ming languages. Such we have to resolve an occurrence heterogeneity problem.

The code fragments of interest are those which manipulate the data. Such we have to ex-

tract all fragments that somehow accesses or manipulates the data through their corre-

sponding data structures. This implies that we have to know if the data manipulating

fragments are delimited by keywords and if they are multi-lingual.

In general, distributed transactions are used to ensure data integrity for the access and ma-

nipulation of multiple databases. The code used within such a distributed transaction may

be spread over a set of methods. In practice besides local method calls also remote method

calls are used. Within the application tier the current transaction context is propagated ei-

ther in an implicit [COS98, OTS98] or explicit [AOS+99] manner. The resulting transac-

tion boundaries can determine the relevant excerpt of code fragments for the analysis of

relationships. Such the fragments are delimited which reduces the lines of code dramati-

cally and allows us to handle also large applications.

Further, relationships are included in the application code where the database access is

done by using API’s provided by the database itself. For example, JDBC is a standard in-

terface to access various kinds of databases. The interface provides data structures which

can be accessed and modified in the Java programming language. Hence a complete anal-

ysis of an application is too expensive regarding the analysis time, we can use the JDBC

interface declarations as starting points for extracting fragments of interest only. This also

holds for specific in house interfaces.

An easy case of fragment extraction is when the data manipulation is done via query log

files. Query log files contain all the access and manipulation operations normally in the

database manipulation language, e.g., SQL. These files are called by the application code

to perform the data manipulations, such as done with scripts. Such a query log files can

be seen as a result of fragment extraction. In fact, in this case is more a fragment detection

than extraction. Nevertheless, the files also have to be located.

DATA-ORIENTED REVERSE ENGINEERING

To find the fragments of interest we apply two techniques: island grammar parsing and

slicing. The island grammar parsing covers the cases where the fragments are delimited

but not entirely the multi-language problem. Since we start from the data models, we

know the entities and attribute names. We can use a slicer to find all the occurrences of

these entities and attributes. To extract the fragments of interest, they do not need to be

delimited and multi-language fragments can be handled. The drawback of slicing is the

size and number of fragments extracted this way. Indeed, slicing the application code for

all entities and attributes would cover huge parts of the code. Only selective slicing is

manageable and scalable.

Therefore we define the following extraction process. We start with the identification of

fragments by using an island grammar parser, cf. Figure 3.20. So called islands are ex-

tracted according to the grammar definition. Next, in case that an island is encapsulated

in a method, we extract the method signature and slice the method calls backwards. Final-

ly, the such extracted code fragments are parsed and attached to the abstract syntax graph

representing the data component models.

Island grammar parsing

To automate the extraction of code fragments of interest, we make use of parser technol-

ogy to import the legacy code into the abstract syntax graph that can be further processed.

This is not new and has been done for decades in numerous reverse engineering tools. To-

day, parsers, or parser generating grammars, are available for a wide variety of languages,

including database languages like SQL and programming languages like COBOL, C and

Java. These parsers have been used to build extractors for reverse engineering homoge-

neous systems, i.e., systems implemented in a single language.

In case of web (thus, heterogeneous) information systems, however, the extraction prob-

lem is more difficult because it deals with multiple different programming languages. Us-

1. Island

Grammar

Parser

Slicer

Parser

application code

Figure 3.20: Extracting code fragments of interrest

abstract

syntax

graph

code

island slice

step

information flow

backward slicing

RELATIONSHIP RETRIEVAL

ing multiple different parsers solves the problem only partially, because this approach

fails to capture the relationships between software artefacts written in different languages.

This problem gets worse for multi-language systems where certain languages are embed-

ded in other languages. This situation is typical for many web information systems: most

database management systems provide proprietary data manipulation languages embed-

ded in various host languages like C, COBOL, Java, etc. In addition, the code fragments

may contain code pieces from multiple modules integrated in the application tier via in-

teroperable interfaces [Cha96, Ib96, COR99].

Parsers for extracting relationships among distributed schemas have to deal with code

fragments, that are amalgamations of different languages including proprietary dialects.

This feature renders the reuse of existing parsers highly unlikely. In addition, experiences

show that the construction of multi-lingual custom parsers might become a fairly complex

task. The reduction of reverse engineering effort achieved by the resulting extractor may

be lost when building and adapting a multi-lingual custom parser. However, the code frag-

ments of interest are not arbitrary amalgamations of different languages but rather the well

separated code fragments executed within distributed transactions. Therefore, we can

simplify the task by looking at its specific characteristics.

Still, it is important to note that the code fragments of interest for extracting relationships

among distributed schemas are typically in a fairly small subset of the multi-language

grammar. Therefore, it is viable to construct more simple parsers that filter out only those

"interesting" parts of the multi-language syntax and ignore everything else. A naïve way

of performing this filtering is using a pre-processing step with a lexical analyser such as

the Unix grep command.

Reverse engineering researchers have started to investigate more powerful approaches,

one of them being island grammars [vDK99]. An island grammar can informally be de-

fined as a set of production rules that describe the language fragments of interest (so-

called islands) plus another set of production rules that catch the rest (so-called water).

Obviously, the idea behind the concept of island grammars is to make the water signifi-

cantly less descriptive than the islands in order to decrease the complexity of the associ-

ated parser. For a formal definition of island grammars and how to build robust island

grammar parsers we refer to [Moo01].

The definition of an island grammar for our application, the extraction of relationships

among distributed, heterogeneous information systems, is a task that requires an intimate

understanding of the concept of island grammars and a highly explorative process. To

simplify this process, [Chu04] developes a tool for interactively creating island grammars

based on code examples of interest identified by the user.

This tool, called BUFFY, initially assumes that the entire input code represents water.

When the user identifies instances of interesting code fragments, BUFFY suggests a set of

island productions for this instance. Subsequently, the user can interactively correct and

refine these productions to characterise the associated island. Then, the user can generate

DATA-ORIENTED REVERSE ENGINEERING

a prototype extractor and run it against other parts of the input code in order to verify if

this island recognises other instances of this pattern. Depending on the result of this veri-

fication step the user might iteratively refine the description of the island. Resulting, the

public class Database implements DatabaseBackendInterface

...

public String getValue(int databaseConnectionID, String keyName, String table-

Name, String[] attributeName, String[] attributeValue, String[] type)

{

ResultSet result = null;

String sqlString = "SELECT " + keyName + " FROM " + tableName + " WHERE ";

// loop around the array

for(int n = 0; n<attributeName.length; n++)

{

if( n == (attributeName.length - 1) )

{

if( type[n].toUpperCase().equals("STRING") ||

type[n].toUpperCase().equals("DATE") )

{

sqlString = sqlString + attributeName[n] + " = '" +

attributeValue[n] + "'";

} else {

sqlString = sqlString + attributeName[n] + " = " + attributeValue[n];

}

} else{

if( type[n].toUpperCase().equals("STRING") ||

type[n].toUpperCase().equals("DATE") )

{

sqlString = sqlString + attributeName[n] + " = '" +

attributeValue[n] + "' AND ";

} else {

sqlString = sqlString + attributeName[n] + " = " +

attributeValue[n] + " AND ";

}

String retValue = "@ERROR@";

try {

this.lastUsedConnection.put( new Integer(databaseConnectionID),

new Long(System.currentTimeMillis()) );

result = select( databaseConnectionID, sqlString );

result.next();

retValue = result.getString( keyName );

} catch (Exception ex) {

retValue = "@ERROR@";

}

return retValue;

}

...

Figure 3.21: Code fragment of interest

RELATIONSHIP RETRIEVAL

fragment extractor yields islands (code fragments) delimited by transaction boundaries or

method definition including the indicators for the relationship retrieval.

Figure 3.21 shows an example of a code fragment of interest. In this case it is the whole

method getValue. The island grammar production for this code fragment is relatively sim-

ple: fragments of interest contains the keywords „SELECT“, „FROM“ and „WHERE“

and methods build the fragments boundaries. Still, this fragment on its own is not a com-

plete indicator, to identify the elements involved further recovery is needed.

Slicing

To trace the use of code fragments of interest inside methods we employ slicing [Wei84].

Various notions of slicing have been proposed. For an overview we refer to [Tip95,

HG98]. To slice the programs for nested method calls interprocedural static backward

slicing is appropriate. Starting from the method declaration the calls of this method are

sliced backwards.

in DatabaseBackendInterface:

public String getValue(int databaseConnectionID, String keyName,

String tableName, String[] attributeName, String[] attributeValue,

String[] type) throws RemoteException;

in DatabaseConnection:

public String getValue( String keyName, String tableName, String[] attribute-

Name, String[] attributeValue, String[] type)

{

String retValue = "";

try {

this.checkConnection();

retValue = this.databaseBackend.getValue( this.databaseConnectionID,

keyName, tableName, attributeName, attributeValue, type);

} catch (Exception ex) {

if(this.out)

{

ex.printStackTrace();

}

retValue = "";

}

return retValue;

}

in XMLWorker (lines 479 and 559):

roleid=this.databaseConnection.getValue("roleid", "roles",

new String[]{"userid", "projectid", "rolename"},

new String[]{userid, projectid, "programmer"},

new String[]{"int", "int", "String"});

Figure 3.22: sliced code fragment of interest

DATA-ORIENTED REVERSE ENGINEERING

Figure 3.22 sketches an example where the method containing the code fragment from

Figure 3.21 is sliced backwards. Starting point is the method getValue in class Database.

Class Database implements the (abstract) class DatabaseBackendInterface where getVa-

lue is declared. In the class DatabaseConnection method getValue is called inside another

method getValue. Note that the method signatures are different. This second method get-

Value is finally called in class XMLWorker with concrete values.

The difficulty of such slices is to know the depth of nested method calls. Two solutions

are considered. Firstly, from our experience a depth of 3 is largely sufficient to cover all

nested methods until the call with concrete values. The drawback is that often the slices

produced are larger than required because the nested method call depth is less than 3. Con-

trariwise all nested method calls greater than 3 are not sliced to their end. Secondly, the

slicing can be done user-driven until the method call with concrete values. The drawback

here is that the fragment extraction is not longer automatic and thus user intensive.

We opt for the first solution because we search for indicators only. Moreover we try to

reduce the user involvement in this early activity. The example of Figure 3.21 and Figure

3.22 is of depth 1. The starting method declaration is getValue from Figure 3.21. The first

nested call is in class DatabaseConnection. The call in XMLWorker is not nested; it is the

call with concrete values. Note that the method calls sliced backwards of depth 2 and 3

were omitted in this example.

Parsing

Our pattern instance recovery approach is graph based. Therefore, we represent the sche-

mas as well as the manipulating source code as an abstract syntax graph. This abstract syn-

tax graph is an object-oriented graph, cf. [Zün01, Nie03]. The code fragments are parsed

into abstract syntax graphs. Additionally, these (code fragments) abstract syntax graphs

are attached to internal schema abstract syntax graphs.

The output of the island grammar parser are either code fragments of interest embedded

in methods (followed by slicing) or standalone code fragments of interest. In both cases

the result will be an abstract syntax graph. In case that the corresponding method calls

have to be sliced the abstract syntax graph is not yet attached to schema abstract syntax

graph. The standalone code fragments are directly attached to the schema abstract syntax

graph.

The slicing results and query log files have to be parsed and attached to the schema ab-

stract syntax graph. In case of the slicing the linking elements are the method declaration

headers. Otherwise, the linking of the abstract syntax graphs is done via attributes and

classes of the object-oriented graph.

For mid-sized unilingual web information systems a sequential incremental parsing can

avoid code fragment extraction. The files containing code are parsed sequentially and the

resulting abstract syntax graphs are only stored temporally. An iterative process is then

RELATIONSHIP RETRIEVAL

needed that combines the parsing and pattern instance recovery. A file is parsed followed

by the pattern instance recovery. Only the recovered pattern annotations remain in the

schema abstract syntax graph and the code abstract syntax graph is deleted. Then the next

file is parsed and so on. Classes, attributes and method headers are directly parsed and rep-

resented in an abstract syntax graph. The method bodies are incrementally parsed on de-

mand. This parsing on demand has then to be triggered from the pattern instance retrieval

process.

3.3.2 Pattern Definition

Detecting instances of patterns for relationship retrieval requires that patterns be formally

defined, since informally described parts of the patterns are not amenable to (semi-)auto-

matic recovery. The most formal part of patterns is the structure part. In our approach pat-

terns (and subpatterns) are defined with respect to the abstract syntax graph. Subpatterns

define abstract syntax graph structures that are constituent parts of other patterns or sub-

patterns.

Using an abstract syntax graph representation has several advantages over using a textual

source code representation. It is easily combinable with representation of the schemas

which is generally done with graphs, in our case also with abstract syntax graphs. It avoids

white space and formatting problems. It automatically normalises the code for simple syn-

tactic variants such as ’SQL SELECT’ vs. „Select“. Abstract syntax graphs also provide

additional information, such as identifier application and declaration links, that is useful

in further analysis.

We use a simplified abstract syntax graph model for readability reasons for examples in

this thesis. Figure 3.23 shows an excerpt of the type graph model underlying the pattern

definition as an UML class diagram. We omit the complete graph model and just present

the classes Node, ASGNode and Annotation. Each element of the abstract syntax graph is

represented by a corresponding subclass of class ASGNode. Each annotation used in a

pattern definition is a subclass of class Annotation. The attribute kind qualifies the pattern,

e.g. SELECT and Comparison in Figure 3.23. The Annotation nodes are associated to

Node nodes, i.e., ASGNode nodes or Annotation nodes, in the graph by a qualified asso-

kind: String

fuzzyValue: Float

threshold: Float

Annotation

Figure 3.23: Type Graph Model

ASGNode Node class

generalisation

association

< annotation

<<abstract>>

Node {qualified}

0..n 0..n

DATA-ORIENTED REVERSE ENGINEERING

ciation. The qualifier must be unique. The other two attributes of Annotation are used for

handling uncertainty in the retrieval process.

Recognising an instance of a pattern or subpattern in the abstract syntax graph under anal-

ysis results in the addition of a corresponding annotation. The oval-shaped node :SELECT

shown in Figure 3.24 is an annotation which identifies attributes refid and fileid as partic-

ipating in a select join. In generally, oval-shaped nodes are annotations which identify cer-

tain subgraphs of the abstract syntax graph as matching the named subpatterns. The

rectangles represent nodes of the abstract syntax graph. The nodes are linked together in

accordance to the underlying type graph model. The indirect link between class “Update-

Service“ and :PTAssignment indicates that the assignment is contained in the class, i.e.,

abstract syntax graph nodes where omitted. Further we omitted abstract syntax graph

nodes in the WHERE clause and just show the :Comparison annotation.

Figure 3.24: Simplified annotated abstract syntax graph instance

// Select join embedded in Java code

result=this.databaseConnection.selectSQLVector(

"Select files.path

from product, files

where (product.refid = files.fileid)

AND

(files.name = updateFile )"

);

:Xyz

:XYZ

direct

node

annotation

objects

:Operation

name=”select-

SQLVector”

:PTDot

:Class

name=”UpdateService”

:PTAssignment

:PTVariable

name=”this”

:PTLink

:SELECT

:SQLStatement

:AND

:PTDot

:Class

name=”ResultType”

:Attribute

name=”result”

lhs

rhs

right

left

type

right

target

methods

type

source

:Class

name=”Database-

Connection”

params

:Param

left

:Select

:From

:Where

select

where

from

:Attribute

name=”path”

attr

:Class

name=”product”

:Class

name=”files”

entity

:Attribute

name=”refid”

:Attribute

name=”fileid”

type

:Attribute

name=”name”

:Variable

name=”updateFile”

and

left

right

left

right

comp

comp type type

composed_by

indirect

links

:Comparison

RELATIONSHIP RETRIEVAL

In our approach each pattern is formally defined by a graph production. The correspond-

ing graph production annotates the abstract syntax graph with additional nodes and links

(the oval-shaped nodes and corresponding links in Figure 3.24) to indicate which sub-

graph of an abstract syntax graph correspond to the pattern. Such subgraphs can then be

used by rules defining other patterns that contain the defined pattern as a constituent part.

The rule definition is graph based, means the approach needs only a graph representation

of the schemas, code or any other graph representing information of a system, e.g. data-

flow or control-flow graphs. Therefore, the presented approach is not bound to any par-

ticular program language or any particular programming paradigm.

As an example of such a pattern definition, Figure 3.25 shows the graph production de-

fining a pattern which is an IND between two entities1. In the notation used, the subgraph

to be matched in the host graph is defined by the black nodes and edges. The subgraph to

be added is defined by the grey (green) node annotated with the stereotype <<create>>

and grey (green) edges. This simple notation can be used because the rules only add in-

formation to the host graph and never delete any. The formal definition and theory under-

lying such graph productions is given in [SWZ95]. The stereotype <<trigger>>, as well as

the values 100 / 0, will be explained later in this chapter.

Figure 3.26 shows an example of an SQL query which contains two INDs. We have oc-

currences of the SELECT and AttrSim pattern for (users.userid, roles.userid) and

1. Note that internally, i.e., in the abstract syntax graph, all schema elements are

represented in an object oriented graph like defined in [Zün01] and [Nie03].

Figure 3.25: IND pattern definition

DATA-ORIENTED REVERSE ENGINEERING

(roles.projectid, projects.projectid). Applying the graph production from Figure 3.25 to

this code will result in the annotations of two INDs. One between entities users and roles

and one between entities roles and projects. These three entities are part of the same log-

ical schema, i.e., from the DSD database (cf. Chapter 2). The resulting IND annotations in

the abstract syntax graph are shown in the lower part of Figure 3.26.

Figure 3.27 shows the IND annotations presented to the reengineer, i.e., between entities

in the EER diagram.

Figure 3.26: Sample IND code and annotated abstract syntax graph

// get the email addresses from project +mailModuleName+ for all persons having

the role rname ...

result=this.databaseConnection.selectSQLVector("Select users.email from us-

ers,roles,projects where (users.userid = roles.userid ) AND (roles.projectid =

projects.projectid) AND (projects.name='"+mailModuleName+"') AND (roles.role-

name=rname)");

:SELECT

:Attribute

name=”userid”

:Class

kind=”Entity”

name=”users”

:Attribute

name=”userid”

:Attribute

name=”projectid”

:AttrSim

:SELECT

:AttrSim

:IND

:Attribute

name=”projectid”

:IND

:Xyz

:XYZ

direct

node

annotation

objects

new

links

schema

attrs

composed_by

similarities similarities

composed_by

similarities

composed_by

entities entities

entities

joins

:Class

kind=”Entity”

name=”roles”

:Class

kind=”LSchema”

name=”DSD”

:Class

kind=”Entity”

name=”roles”

Figure 3.27: IND annotations

RELATIONSHIP RETRIEVAL

An IND is a subpattern for an R-IND. In case that two entities were annotated with an IND

and that one side of the participating Attributes are a PrimaryKey we have an R-IND, cf.

Figure 3.28. To ensure that the Attributes composing the IND are chosen the SELECT an-

notation is used. Finally, we ensure that the IND was not already annotated as IIND be-

cause I-INDs and R-INDs exclude themselves mutually, cf. [FV95]. The IND examples

of Figure 3.26 are R-INDs because userid and projectid are primary keys of users and pro-

jects, respectively.

The graph productions for I-INDs and C-INDs are similar to the R-IND rule. Definitions

of these patterns and subpatterns, e.g., AttrSim, PrimaryKey or SELECT, can be found in

[Bew98] and [Jah99]. Note that annotations can also by added be the reengineer.

In case of inter-schema relationships the pattern definition, i.e., the graph productions, are

composed of elements of the conceptual schemas. Figure 3.29 shows the Association pat-

tern specification which is in analogy with the specification of an R-IND (Figure 3.28).

The code example shows an association between tblOutcomes in database outcomes_be

and Patient in database hospdata. The key is VHS_ID# from Patient. In analogy with R-

IND, Association and Inheritance exclude themselves mutually.

Finally, we present a definition of a redundancy dependency namely replication. The ex-

ample code of Figure 3.30 shows an INSERT query for tblDose followed by a nested

UPDATE /SELECT query that indicates replication. Attributes LastDose, LastPurpose

and LastRoute from hospdata.DrugProfile are updated with the corresponding attribute

values of Dose, Purpose and Route from tblOutcomes.tblDose. The WHERE clause can

be discarded for replication retrieval. The Replication pattern definition of Figure 3.30 ex-

actly covers such situations. However replication can be implemented in different ways.

Figure 3.28: R-IND pattern definition

DATA-ORIENTED REVERSE ENGINEERING

Defining a pattern for every possible implementation would result in a multitude of pat-

tern definitions and consequently graph productions. For this reason we group the defini-

tions. An example is the Comparison pattern:

•product.refid = files.fileid (Figure 3.26)

•DrugID = (SELECT DrugID FROM ... (Figure 3.30)

In both cases a comparison of the attributes is done within the WHERE clause from a SEL-

ECT statement. The abstract syntax graph enables that one pattern definition covers both

implementations. Another example is the R-IND and Association pattern definitions.

These two patterns differ only in the schema they can be applied. If we ignore that the par-

ticipating entities are either entities or classes and treat an IIND as an Inheritance, both

pattern definitions are identical.

Figure 3.3 represents an excerpt of the domain model that underlies the pattern definition,

i.e., it gives an overview of the relationships and data dependencies. In the domain model

we do not distinguish between elements of the logical or conceptual schema. For example

we have entity, attribute, PrimaryKey, IND, association, inheritance, etc. Depending

whether we are in the logical or conceptual schema an entity is an entity or a class, or an

Figure 3.29: Association code example and pattern definition

// get the the symtomes for a patient

SQL SELECT Patient.VHS_ID#, tblOutcomes.symtomeDescription, tblOutcomes.Date,

tblOutcomes.Time FROM outcomes_be.tblOutcomes, hospdata.Patient WHERE tblOut-

comes.VHSID = Patient.VHS_ID# AND Patient.SURNAME = :SN AND Patient.FIRST NAME

= :FN;

RELATIONSHIP RETRIEVAL

inheritance is an IIND or inheritance. Note that, e.g., INDs are annotated in both schemas

but only shown in the logical schema representation (EER diagram).

Nevertheless, grouping of annotations in the domain model and of pattern definitions still

results in a multitude of graph productions. We refer to [NWW03] were this is shown for

association detection in Java code. We overcome this problem by introducing uncertainty

in our graph productions and thus pattern definitions. This will be discussed in Section

3.3.4.

3.3.3 Pattern Instance Retrieval

We described an effective formalism for defining a catalogue of patterns as the basis for

pattern instance retrieval. The retrieval process for pattern instances (in web information

systems) is inevitably an iterative one. Typically, the reverse engineer first applies an ini-

tial set of patterns, then repeatedly examines the results, adjusts the patterns to address

perceived deficiencies and reapplies them until a satisfactory outcome is achieved. To

support this process the engineer needs tool support that applies the pattern instance re-

trieval process to the system involved and displays the results obtained.

Figure 3.30: Replication code example and pattern definition

SQL INSERT INTO tblOutcomes.tblDose VALUES (dose_id, PatientID, date, time,

drug, dose, route, purpose, provider, NULL, NULL, NULL);

...

SQL UPDATE hospdata.DrugProfile SET LastDose, LastPurpose, LastRoute SELECT

Dose, Purpose, Route FROM tblOutcomes.tblDose WHERE (VHSID = PatientID) AND

(DrugID = (SELECT DrugID FROM hospdata.Drugs WHERE Description = drug)) ;

DATA-ORIENTED REVERSE ENGINEERING

To devise a tool that meets this requirement we adopt a threefold strategy. Firstly, we min-

imise the rule scalability problems by adopting the best available analysis algorithm. Sec-

ondly, we adapt this algorithm to deliver useful results incrementally rather than on

completion. Thirdly, we involve the reverse engineer in the analysis process, to avoid un-

necessary computation of unwanted analysis results. We give an overview of this strategy,

deeper details can be found in [Nie03].

The basic analysis algorithm

Pattern-based retrieval is a deductive analysis problem where patterns, or rules, are re-

peatedly applied to the abstract syntax graph to arrive at the most complete characterisa-

tion of the system permitted by the rules. Pure deductive analysis algorithms typically

apply the rules involved level by level, bottom-up1, according to their natural hierarchy,

and produce useful results only when analysis is complete. Results from other researchers,

such as [Qui94] and [Wil96], suggest that a reverse engineering tool providing fully au-

tomatic analysis based on this approach cannot scale for larger software systems.

Where patterns are defined as graph productions, as in our case, graph transformation sys-

tems are the natural choice for implementing the tool. However, the scalability problem

also applies to graph transformation systems such as PROGRES [Zün95] or AGG [AGG],

which apply the rules in an arbitrary sequence usually determined by the internal data

structures used.

In practical application it has been experienced that rules are generally applied in a given

context. Therefore, FUJABATS, in contrast to other graph transformation systems, applies

rules given a context, normally one object in the graph. The advantage is that it reduces

the runtime complexity of the rule matching algorithm to polynomial size, whereas the

original sub-graph matching problem is NP-complete [Meh84]. This restriction to the

original theory of graph grammars has been shown not to be a problem. For more details

we refer to [Zün01] and [FNTZ98].

By adopting FUJABATS as the platform for our tool, we therefore reduce the problem of

scalability compared to systems using standard approaches to deductive analysis. For the

reverse engineer, however, this does not necessarily solve the performance problems in-

volved.

Adapting the analysis algorithm

Although FUJABATS reduces the computational complexity of analysis, a fully-automatic

tool based on FUJABATS is still undesirable, as the results are made available only when

1. In comparison, pure top-down approaches starting with top-level rules in the topology hierarchy

are only of theoretical interest, because of the search-space implied. Even when a specific rule is

identified for application, without an adequate starting context its top-down application is

impractical.

RELATIONSHIP RETRIEVAL

analysis is complete. Given that reverse engineering is an iterative process, such tool be-

haviour does not lead to an efficient overall process.

Suppose, for example, the Duplication pattern of Figure 3.31 does not include the attribute

similarity (AttrSim) check. The resulting false positives manifest themselves early in the

analysis, but the reverse engineer has to wait until analysis is complete to recognise them.

For reverse engineering, therefore, a semi-automatic process is likely to be more effective,

in which useful intermediate results are produced and the engineer is allowed to interact

with them, either to add information and request that analysis continues or to revise the

rule definitions and restart analysis.

To support such a process, the analysis algorithm itself must produce intermediate results

useful to the engineer as early as possible, and be amenable to interruption and resumption

without loss of results to date. Since the results most useful to the engineer are those pro-

duced by rules at the highest levels in the rule hierarchy, we adopt an analysis algorithm

which combines a bottom-up strategy and a top-down strategy. Note that the algorithm

affects only the execution sequence of patterns and does not violate their formalisation as

graph productions.

To define the algorithm, the dependency hierarchy of the rules is levelled, such that each

rule has a level number. A rule depending only on objects in the initial abstract syntax

graph gets number 1. A rule depending on other rules, i.e., whose definition includes an-

notations created by other rules, gets a higher number consistent with the natural topolog-

ical order of the rules. Rules included in cycles concerning their dependencies get the

same level number and are marked as recursive.

Figure 3.32 shows a snapshot of our analysis algorithm. The grey rectangle at the bottom

represents all objects in the abstract syntax graph. The black oval identifies an annotation

Figure 3.31: Duplication pattern definition

DATA-ORIENTED REVERSE ENGINEERING

already created by bottom-up analysis (with a link to the annotated object a1) while grey

ovals, together with the grey links, represent a top-down analysis in progress. Directed

arcs indicate the scheduling sequence of the rules. Variables at the arcs represent objects

passed to the scheduled rule as context. We omit the links for :AttrSim, :NS and :TC for

readability reasons.

Bottom-up strategy

After the abstract syntax graph is created, the analysis starts in bottom-up mode. Initially,

all abstract syntax graph objects schedule level 1 rules, i.e., those depending on abstract

syntax graph objects only. Scheduling initially only level 1 rules is sufficient to ensure that

all necessary rule applications are eventually considered. It avoids many top-down fail-

ures that would otherwise occur, because the information available is not enough to estab-

lish a high level rule. Consider, for example, a Duplication rule scheduled by a single class

or attribute. The inherent search space is too large to justify its top-down investigation.

An object o scheduling a rule R creates a rule/context pair R(o) which is added to a bot-

tom-up priority queue held in descending order of rule level number. The use of rule level

numbers to order the rule/context pairs in the bottom-up queue is not critical. Any order-

ing that promotes higher-level rules will do. This fact can be exploited to further tune the

algorithm [NSW+02].

The algorithm continues in bottom-up mode by dequeueing the first rule/context pair, in

our example INSERT(i1), cf. Figure 3.32. This rule is immediately applicable, so an IN-

SERT annotation i is created, annotating the attribute a1, which is accessible via the IN-

SERT object i1. In contrast to abstract syntax graph objects, which schedule level 1 rules

Figure 3.32: Sample analysis execution

c1:Class a1:Attribute a2:Attribute

abstract syntax graph ... ...

... ...

...

i:INSERT

attrs

attrs i1:INSERT

attr

:Duplication

s:SELECT

:AttrSim

composed_by

entities entities

a1,a2

:NS :TC

a1,a2 a1,a2

composed_by

joins

part_of

c2:Class

:Class ASGNode

execution direction

annotating links

:INSERT annotation

RELATIONSHIP RETRIEVAL

only, creation of i schedules all rules depending on the INSERT rule, e.g., the top-level Du-

plication rule.

Since Duplication is a top-level rule, the pair Duplication(i) is taken next from the bottom-

up queue. At this point, however, Duplication(i) cannot be applied successfully, since an-

notations have yet to be created by the other rules on which Duplication depends (SELECT

and AttrSim).

Top-down strategy

When a rule that depends on other rules cannot be applied in bottom-up mode, the algo-

rithm switches to top-down mode, which uses a separate top-down priority queue. The

top-down strategy tries to make the other rules create the missing annotations based on

currently available information. In this case, the search space is quite strictly delimited by

the information available, e.g., that inherent in the i:INSERT.

Consideration of Duplication(i) in top-down mode thus schedules the SELECT and the At-

trSim rules, cf. Figure 3.32. Where such rules depend on other rules, rule scheduling con-

tinues recursively. In our case, for example, the AttrSim rule now schedules NS rule and

TC rule, as Figure 3.3 implies.

To establish a Duplication consistent with i, the relevant context for the SELECT rule is

the INSERT annotation i. For the AttrSim rule the attributes a1 and a2 obtained from i and

s, respectively, are an alternative context, which have to be considered.

The rule/context pair at the front of the top-down mode queue is not dequeued if the rule

involved schedules other lower-level rules. Instead, pairs added to the top-down queue are

queued in ascending order of their level number. This means that the higher-level rule will

be reconsidered after the lower-level rules on which it depends (if these succeed). Using

a priority queue rather than a stack means that the top-down algorithm goes as far down

the abstract syntax graph as quickly as possible. This encourages earliest possible failure

in top-down mode, while maintaining an appropriate sequence of rule applications for

top-down success. If a rule marked as recursive is added to the top-down queue, however,

stack behaviour is adopted until all rules marked as recursive have been removed from the

stack/queue.

When the rule at the front of the top-down queue can be applied, a corresponding annota-

tion is created, all dependent rules are scheduled for bottom-up consideration, the front

entry of the top-down queue is dequeued and the next element of the top-down queue is

considered. The newly scheduled rules join the bottom-up queue since they represent

analysis results that would have been created later in bottom-up mode anyway and need

further investigation.

The algorithm runs in top-down mode until the top-down queue is empty or a rule in the

queue fails with no alternative contexts left to explore. The first case means that the rule

that started this top-down phase has been successfully applied, in our example the Dupli-

DATA-ORIENTED REVERSE ENGINEERING

cation rule. In the second case the starting rule cannot be applied in the given context. In

either case the algorithm switches back to bottom-up mode.

Intermediate results

With the algorithm as described, each annotation once created represents an intermediate

result that is not affected by subsequent analysis. In principle, therefore, the execution can

be interrupted for inspection of results by the engineer at any stage. In practice, however,

it is illogical to allow interruption during a top-down interlude, when some but not all of

a closely related set of annotations may have been created.

Since the algorithm tries to establish high level rules using the top-down strategy, the in-

termediate results are likely to be useful information for the reverse engineer, e.g., redun-

dancy dependency patterns. The engineer reviews such patterns to determine if the

analysis should continue on the current basis. The algorithm is also robust to certain

changes by the engineer prior to resumption. Addition of annotations by the engineer is

valid at this stage, provided these add all corresponding rule/context pair for dependent

rules to the bottom-up queue. Marking a rule as ‘to-be-deleted’1 is also acceptable, as the

consequences of deletion can be systematically propagated to both the results to date and

the resumed analysis. Such actions may be useful to the engineer as ‘proofing actions’ pri-

or to permanent change to the rules themselves. Any addition or modification to the rules,

however, invalidates the analysis to date and requires restart of the overall analysis.

The overall analysis finishes when the bottom-up queue is empty. In this case the algo-

rithm has analysed all abstract syntax graph objects and created annotations on the objects

for all rules that could be applied.

Integration of the reverse engineer

Integrating the described analysis algorithm into a semi-automatic reverse engineering

process is easy because it is interruptible. Figure 3.33 shows our pattern retrieval process

as a statechart. The process starts by creating abstract syntax graph representation, fol-

lowed by loading a particular pattern catalogue. The engineer can then make initial mod-

ifications before starting the analysis algorithm by sending a start() event.

The complex state on the left-hand side with its two internal states bottom-up strategy and

top-down strategy represents the analysis algorithm described above. The algorithm halts,

and the reverse engineer can look at the results, if the algorithm has finished or the reverse

engineer interrupts the execution by sending a stop() event. As mentioned above, it is log-

ical to confine such interruptions to bottom-up mode purely for pragmatic reasons.

The reverse engineer then has the opportunity to look at the results to see if the patterns

selected still seem appropriate. By sending an adapt() event, he/she can mark patterns for

deletion or create annotations that will steer the algorithm to a part of the source code that

1. Note that the rules are not deleted but only marked as ’to-be-deleted’ for exploration.

RELATIONSHIP RETRIEVAL

he/she wants to have analysed. Resuming rather than restarting the algorithm systemati-

cally propagates the consequences of such changes to both the prior and subsequent anal-

ysis. If the analysis to date fails to meet the engineer’s needs in other ways, the patterns

can be modified, but in this case the whole analysis must restart from the beginning.

3.3.4 Handling Uncertainty using Fuzzy Beliefs

The performance of the retrieval process is crucial regarding the involvement of the re-

engineer in the process. The presented retrieval process solves the performance problem

only partially. As mentioned before large abstract syntax graphs, i.e., a large search space,

creates this performance problem. The large number of graph productions describing pre-

cisely a pattern intensifies the performance problem. Reducing the search space means

that the system is only analysed partially, i.e., information is lost [NWW03, Nie03]. Re-

ducing the number of graph productions means to reduce the number of patterns covered

and thus get incomplete results. Identifying common parts of different graph productions

and replace them by one graph production leads to many false positives. The reengineer

is then not longer able to distinguish correct results from false positives.

Our approach is to allow reducing the number of graph productions and therefore get im-

precise results but also allow expressing the degree of impreciseness by assigning fuzzy

beliefs to the graph productions. In addition to the fuzzy results, our approach allows to

filter the results with fuzzy values higher than a certain threshold. This enables the reengi-

neer to profit from appropriate time-limit results and valuing the found matches of a pat-

tern. Our experience shows that even precise pattern definitions produce false positives.

Such this approach is a trade of between the number of graph productions and the number

Figure 3.33: Retrieval process statechart

create abstract

syntax graph

load pattern

catalogue

modify, add,

delete pattern

show

(intermediate)

results

mark patterns for

deletion; create

annotations

bottom-up

strategy

top-down

strategy

analysis algorithm

[change

Strat]

start()

[change

Strat]

stop() or

[finished]

continue()

ready()

adapt()

modify()

load states

control flow

start()

DATA-ORIENTED REVERSE ENGINEERING

of false positives to produce rapid intermediate results with the benefit that the uncertainty

is presented to the reengineer.

Indicator Impreciseness

Another reason for adding impreciseness to graph productions is the impreciseness of in-

dicators that can be found in the code. This uncertainty problem of indicator-based retriev-

al processes is discussed [Jah99] in the context of (single) database reengineering

processes. The code example 2 of Figure 3.34 shows such indicator impreciseness for a

Duplication pattern. The two INSERT statements get the same value :DD and are precon-

ditioned to be „close“ to each other, i.e., within the same transaction boundaries. Never-

theless we do not have the certainty that the value of :DD is not changed between the two

statements.

Figure 3.34 shows three examples of possible code fragments that indicate duplication. In

each example the DEATH DATE of Patient is duplicated into DEATHDATE of Deceased.

These three occurrences of a duplication are covered by the graph productions of Figure

3.31 and Figure 3.35.

We introduce a so called fuzzy belief for each graph production that expresses its precise-

ness. The fuzzy belief of a rule is a value between 0 and 100. By this value, the reengineer

expresses his estimation that, e.g., 20% of all matches are false positives. Thus, 80% of

all matches would be correct and the value would be 80. In Figure 3.31 and Figure 3.35

the fuzzy belief is defined in the green Duplication annotations that are created when the

graph productions are applied. The fuzzy belief is the first value in the pair of numbers;

the second is explained in the adapted analysis execution. Note that a fuzzy belief of 100

for all graph productions is equivalent to the retrieval process where uncertainty is not

considered like described in Section 3.3.3.

Figure 3.34: Code samples for duplication

// example 1

SQL INSERT INTO bvmtdata.Deceased (DEATHDATE) SELECT Patient.(DEATH DATE) FROM

hospdata.Patient WHERE Patient.VHS_ID# = Deceased.VHS_ID#;

...

// example 2

SQL INSERT INTO bvmtdata.Deceased (DEATHDATE) VALUES :DD;

...

SQL INSERT INTO hospdata.Patient (DEATH DATE) VALUES :DD;

...

// example 3

DD = „SQL SELECT Patient.(DEATH DATE) FROM hospdata.Patient WHERE Pa-

tient.VHS_ID# = PatienID;

SQL INSERT INTO bvmtdata.Deceased (DEATHDATE) VALUES :DD;

...

RELATIONSHIP RETRIEVAL

Managing Uncertainty

An exact graph production for each implementation variant has the benefit of few false

positives in the analysis results, but it leads to performance problems because of the high

number of rules. We introduce further abstraction into the rule definitions so that one or

only few graph productions cover a large variety of implementations. This reduces the

number of rules and improves the performance. To define a new graph production with

further abstraction the reengineer, first, chooses a set of implementations or pattern defi-

nitions that should be covered by the new rule. Common parts have to be identified. The

common parts, then, form the new graph production. All implementations or patterns

from the set are then at least covered by this new graph production. In addition, there will

be implementations or patterns that were not intended to be found by the rule. Some of

them are not instances of the searched pattern, i.e., false positives. Others are correct

matches that were not considered when defining the graph production. This is a kind of

uncertainty that has also to be managed.

Figure 3.36 depicts a new graph production for Duplication that is designed to replace all

three graph productions from Figure 3.31 and Figure 3.35. Therefore, the definition of IN-

SERT and InsertValue were revisited. The nested INSERT / SELECT of example 1 as

well as the examples 2 and 3 in Figure 3.34 are covered by imprecise InsertValue graph

Figure 3.35: Alternative duplication pattern definitions

DATA-ORIENTED REVERSE ENGINEERING

productions. Consequently, all three examples are covered by the new INSERT definition

and thus by the Duplication graph production.

The impreciseness of a graph production that stems from reducing the number of rules has

to be valued to be useful to the reengineer. The value should describe the ratio between

correct matches and all matches of a certain rule including false positives, i.e. the precise-

ness of a rule. After all applications of a rule, this ratio can be calculated. At the time of

the rule definition the number of false positives produced by the rule can only be estimat-

ed.

In our example we estimate that 30% of all matches for Duplication are false positives. We

reduced it compared to the three „precise“ graph productions. Only the indicator impre-

ciseness was considered so far. Thus, the ratio is 70. For the INSERT annotation we esti-

mate the ratio to 80.

Adapted analysis execution

In the following we revisit a part of the algorithm to show how the fuzzy values of the

results are calculated and influence the analysis execution. The graph productions create

annotations with a certain fuzzy value. The fuzzy value is calculated during rule applica-

tion for each match of the graph production. They are stored in the fuzzyValue attribute

of the Annotation nodes (cf. Figure 3.23). The fuzzy beliefs of the graph productions are

the basis for the fuzzy value calculation. The fuzzy value of a match is computed as the

minimum of the fuzzy values of all annotation nodes occurring in the match and the fuzzy

belief of the rule [Jah99, Nie03]. Thus, calculation of the fuzzy values is similar to fuzzy

grammars, cf. [Zad65].

Figure 3.36: Imprecise duplication and insert pattern definitions

RELATIONSHIP RETRIEVAL

To apply the Duplication graph production of Figure 3.36 two annotations (:INSERT and

:AttrSim) are needed. The fuzzy belief of the Duplication graph production is 70. In Figure

3.37 a cut-out of the graph during an analysis execution is shown. The INSERT graph pro-

duction has created an annotation i with value 80 on insert i1. Upon this annotation i the

Duplication graph production has created annotation d. Since a Duplication also needs an

AttrSim annotation top-down analysis is performed to detect a with value 95. A fuzzy val-

ue of 70 (the minimum of 80, 95 and 70) is assigned to the annotation node d. The fuzzy

belief of 70 of the Duplication graph production is the minimum, compared to 80 and 95

of the other two annotations. Thus, the fuzzy belief of the graph production is an upper

bound for the fuzzy values computed.

A node in the graph production may have multiple valid matching annotation nodes in the

host graph that have different fuzzy values. In that case the annotation node with the max-

imum fuzzy value is chosen as match such that the new annotation created by the graph

production application uses the most reliable source of information. Assume that later a

second annotation i’:INSERT with value 60 is detected. This annotation would not be con-

sidered furthermore because it has a fuzzy value of 60 which is lower than the fuzzy value

80 of annotation i used in the match of the Duplication. Therefore, the fuzzy value of an

annotation corresponds to the maximum fuzzy value of all derivations of the annotation,

cf. [Zad65].

As a way to limit the graph production applications to reasonable cases we introduced

thresholds, cf. attribute threshold in Figure 3.23. In Figure 3.36 the threshold is defined in

the grey (green) annotation nodes that are created when the graph productions are applied.

Figure 3.37: Sample analysis execution with fuzzy values

abstract syntax graph

... ...

...

i:INSERT

attrs i1:INSERT

attr

composed_by i1

entities

a1,a2

d:Duplication

a:AttrSim

composed_by

similarities

i2:INSERT

attr

i’:INSERT

attrs

:Class ASGNode

:INSERT annotation

execution direction

annotating links

c2:Class

a1:Attribute

a2:Attribute

c1:Class

DATA-ORIENTED REVERSE ENGINEERING

It is the second value in the pair, i.e., 40 for Duplication and 50 for INSERT. In our sample

analysis execution of Figure 3.37 the value of the annotation i is 80. The annotation d is

created because 80, as well as 95 later for a, is greater than 40 (the Duplication threshold).

If any annotation that is part of the subgraph to match has a fuzzy value lower than the

threshold, the graph production would not be applied. This helps to minimise computation

time and memory resources that would otherwise be used for investigation of unreliable

and imprecise information. Thus, the thresholds improve the scalability of our approach.

They are chosen by the rule developer based on personal experience and/or historical data.

Note that the threshold of the Duplication (40) is lesser than the threshold of INSERT (50).

Defining the threshold of a subpattern (INSERT) higher than the threshold of a pattern

(Duplication) based on this subpattern makes only sense if the pattern is based on another

subpattern (AttrSim). In case that the annotation a would have a fuzzy value of 30 the Du-

plication is not created because the value 30 is lesser than the Duplication threshold of 40,

but a does not affect the INSERT subpattern with threshold 50.

Tuning Fuzzy Beliefs and Values

One problem remaining in this incremental semi-automatic pattern instance retrieval pro-

cess is the exact choice of the fuzzy beliefs and thresholds of the graph productions. Both

values are estimated by the reengineer or at the best adapted manually over the time by

the reengineer during the incremental process. In [Nie03] an automatic adaptation of the

fuzzy beliefs and thresholds is presented. This adaptation is based on the comparison be-

tween the given values and calculated values from pattern instance occurrences based on

the corrections of the reengineer.

The impreciseness of the indicators can be further reduced. Therefore validation of detect-

ed relationships, which results in the diminution, confirmation, attenuation or augmenta-

tion of the associated fuzzy values, is done by several satellite activities.

One possibility to verify assumptions is looking at the data, e.g., checking a primary key

by validating the uniqueness of each data column. Assume that a column, which is anno-

tated to be a primary key, contains identical values several times. This indicates that the

primary key assumption may be false. The number of counterexamples relative to the

number of present data determines the fuzzy value in this case, cf. [Jah99]. Note that if no

counterexample is found for enough present data the fuzzy value will be augmented.

Finally, the reengineer has to accept or reject the annotations [Jah99, Wen01, Nie03]. An

annotation that is confirmed by the reengineer is attributed a fuzzy value of 100. In case

of rejection the fuzzy value is set to 0. For this activity of confirming or rejecting annota-

tions, the reengineer has the possibility to look into the code. Slicing the involved entities

for a recovered relationship shows uses of the entities different from pattern instances in-

dicators. Some code fragments may not be covered by the island grammar parser and/or

the pattern instance retrieval and can therefore provide further information to the reengi-

neer.

RELATIONSHIP RETRIEVAL

Figure 3.38 shows some annotations from the examples of Section 3.3. We see the Asso-

ciation between Patient and tblOutcomes (Figure 3.29) and the Replication between Drug-

Profile and tblDose (Figure 3.30). The Duplication between Patient and Deceased (Figure

3.34 and Figure 3.36) is Confirmed by the reengineer. The association profile was retrieved

as R-IND in the logical schema and mapped to the conceptual schema.

Figure 3.38: Annotated conceptual view

Figure 3.39: Conceptual schema view (data dependencies)

DATA-ORIENTED REVERSE ENGINEERING

Finally, when the reengineer confirms a data dependency pattern instance it is trans-

formed into an association with the corresponding stereotype, cf. Figure 3.39. Note that

the reengineer has the possibility to confirm all pattern instances greater or equal a cust-

omisable fuzzy value.

3.4 Tool Support

The tool support for the data-oriented reengineering process is located in the REDDMOM

project. REDDMOM is mainly based on the FUJABATS and part of the FUJABA Reengineer-

ing tool (FUJABARE). FUJABATS has a plug-in architecture allowing developers to add

functionality easily while retaining full control over their contributions. This plug-in ar-

chitecture enables extension and integration at the meta-model level. Details can be found

in [BGN+03].

REDDMOM (and FUJABARE) uses this plug-in architecture to reuse and combine existing

tools. Figure 3.40 shows an overview of the REDDMOM tools for data-oriented reverse en-

gineering. Several figures in the precedent sections show screenshots from the different

tool user interfaces. The grey rectangles represent tools from other universities that we

use. Dashed rectangles express that the tools in question are (largely) generated. The tools

inside the grey area were developed in FUJABARE. The remaining tools were constructed

in REDDMOM.

Data Model Parser

We use the JDBC API as interface to parse the physical data models of the different data

sources. Presumed that a JDBC driver exists for a data storage facility the Data Model Par-

ser can parse the data structure. Note that meta-data can only be parsed if they were in-

troduced in the data repository. An XML parser and especially an XMLSchema parser is

currently inserted in FUJABARE. From the output of these parsers the corresponding ab-

stract syntax graph is directly created.

EER Editor

The EER Editor represents the logical (schema) abstract syntax graph in an EER diagram,

i.e., entities, attributes and relationships. We added specific data reverse engineering con-

structs, e.g., views or variants, to the editor. It is similar to the analysis Front-end of the

VARLET ANALYST [Jah99]. The schema elements and detected pattern annotations are

represented. The reengineer can apply schema transformations (e.g. createRIND) and an-

notation related operations (e.g. confirmAnnotation or setFuzzyValue).

Triple-Graph-Grammar Editor

To provide inter-model consistency between the logical and conceptual schema we use

triple-graph-grammar rules, cf. Section 3.2.3. These rules can be defined in the Triple-

Graph-Grammar Editor. From these rules in a first step story diagrams are generated and

TOOL SUPPORT

in a second step Java code is generated with FUJABARE, see Figure 3.12 to Figure 3.17.

The mapping graph model (Figure 3.11) connects the logical and conceptual model via

the FUJABATS meta-model integration pattern [BGN+03].

Mapping Rules

The Mapping Rules are the generated Java code from mapping rules defined as triple-

graph-grammar rules in the Triple-Graph-Grammar Editor. For runtime optimisation we

ordered the rule application sequence and assign each rule a context with an input param-

eter, i.e., a node of the abstract syntax graph where to start the sub-graph matching.

Pattern

Specification

Editor

(schema ASG model)

Figure 3.40: Architecture of the REDDMOM reverse engineering tools

Data Model

Parser

(via JDBC)

EER Editor

(logical schema

ASG model)

UML Editor

(conceptual schema

ASG model)

Tranformation Editor

(schema ASG model)

Mapping

Rules

Triple-Graph-

Grammar Editor

(mapping graph model)

Reengineer

Domain

Expert

Pattern

Instance

Retrieval

(GFRN / FPN)

Island Grammar

Parser

(BUFFY, MANGROOVE)

Java Slicer

(BANDERA)

Code Parser

(Java/SQL,

JavaCC)

Reengineer

Domain

Expert

Web

Information

System

generated/generates

information flow

provided by

FUJABARE

DATA-ORIENTED REVERSE ENGINEERING

Transformation Editor

The reengineer can define basic graph transformations with REDDMOM, cf. Figure 3.18.

The Transformation Editor enables the reengineer to construct composed graph transfor-

mations out of basic or composed graph transformations. In REDDMOM the Transformati-

on Editor permit the construction of graph transformations for the logical as well as the

conceptual schema, cf. Figure 3.41. Wildcards are used to assign the parameters of the

SplitClass operation to the parameters of createClass, createAssoc, moveAttribute, etc.,

cf. Figure 3.19.

Code Parser

The abstract syntax graph is produced by the JavaCC source code parser [JCC]. Our Code

Parser can read SQL and Java. The Java parser stems from FUJABARE and REDDMOM

added the SQL parser. The Java parser is incremental, i.e., classes, attributes and method

headers are directly parsed and presented in an UML class diagram. The method bodies

are only parsed on demand when a user wants to see a method body. Both parsers are cur-

rently adapted to incrementally parse files, like described in Section 3.3.1. We refer to

[Sch01] for incremental parser details.

Figure 3.41: SplitClass composed transformation example

TOOL SUPPORT

UML Editor

The UML Editor covers UML class diagrams, views [Rec01] and story diagrams

[FNTZ98] with the respective functionalities. It is the core of FUJABATS together with the

Java code generation. REDDMOM added stereotypes for associations to the class diagram.

The story patterns, i.e., the activities in story diagrams, were extended with a path con-

struct.

Pattern Specification Editor

The Pattern Specification Editor was developed in [Pal01] and part of FUJABARE. In Sec-

tion 3.3.2 several pattern definitions are shown. Further, the Pattern Specification Editor

contains a pattern catalogue, that is automatically created during pattern specification,

which shows the dependencies between the pattern graph productions, cf. [Nie03].

Pattern Instance Retrieval

The Pattern Instance Retrieval tool is based on the Generic Fuzzy Reasoning Nets

(GFRN) [JSZ97, Jah99]. The analysis execution takes place on Fuzzy Petri Nets [KM96].

Details of the implementation can be found in [Wen01] and [Nie03]. This tool is part of

FUJABARE and runs the algorithm presented in Section 3.3.3 and Section 3.3.4.

Island Grammar Parser

Island Grammar Parsers can either be generated by MANGROOVE or BUFFY. Both tools

are prototypes from other researchers; we present them shortly.

Mangrove [Moo01] is a parser generator that takes an island grammar in SDF [HHKR89]

format and produces an extractor for this (partial) language. The definition of an island

grammar requires an intimate understanding of the concept of island grammars and is a

highly explorative process. Still, simple island grammars can also produce good results.

BUFFY [Chu04] generates extractors based on code fragment examples. The user intro-

duces an interesting code fragment into BUFFY. Then, BUFFY suggests a set of island pro-

ductions for this instance. Subsequently, the user can interactively correct and refine these

productions to characterise the associated island. Then, the user can run the generated ex-

tractors against the code and verify the island definition. Depending on the result of this

verification step the user might iteratively refine the description of the island.

Java Slicer

BANDERA [HDZ00] is a Java model checker which incorporates a Java Slicer. This slicer

enables the detection of method calls of code fragments of interest such as described in

Section 3.3.1 and the slices for entities involved in a retrieved relationship, cf. Section

3.3.4.

DATA-ORIENTED REVERSE ENGINEERING

3.5 Related Work

Considerable effort has been made to develop concepts and methods to reverse engineer

legacy information systems. Some of these methods have been implemented in computer-

aided reengineering tools to automate laborious activities and reduce the complexity of

the data reengineering problem. As the persistent data structure is the central part of a leg-

acy information system [Aik96], many approaches focus on

• data model (database schema) analysis and/or

• model (schema) translation and transformation.

Most existing approaches to data model analysis aim to recover a complete model (logical

schema), e.g., [PB94, SLGC94]. Premerlani and Blaha propose a set of simple, loosely

coupled tools for textual search and data analysis, e.g., grep, awk-scripts, and predefined

database queries [PB94]. Signore et al. present a knowledge-based approach based on

backward reasoning to infer schema constraints from collected indicators [SLGC94]. An

example for an approach that annotates schemas semantically is presented in [RH96,

RH97] based on the idealistic assumption of a structurally complete schema description

[HCTJ93].

Various algorithms for canonical translation of logical to conceptual schemas (e.g.,

[BDH+87, NA87, JK89, MM90, SK90, And94, PKBT94, MCAH95, Fon97, RH97]) are

fully automatic approaches and often make unrealistic assumptions about the quality of

the legacy system, like structurally complete schema description. All these approaches

provide very little support for iteration and in particular lack the ability to detect and prop-

agate inconsistencies between a (modified) legacy system and its representation.

Vossen and Fahner suggest combining an initial automatic translation with a subsequent

manual transformation phase but do not provide any tool support for this human-intensive

activity [FV95]. In [BGD97], Behm et al. propose an interactive schema migration envi-

ronment that provides a set of alternative schema mapping rules. In this approach, which

is similar to the migration environment presented in [JSZ96], the reengineer repeatedly

chooses an adequate mapping rule for each schema artifact that has to be mapped. It

turned out that the set of rules become very large and that a transformation phase after an

initial translation phase is much easier for the reengineer. Jeusfeld and Johnen propose an

approach to schema migration that employs a generic meta-model as mediator [JJ94]. The

advantage of this approach over direct translation is questionable as it was only evaluated

for translation of relational schemas to ER schemas.

The observations described by Blaha and Premerlani motivate user-centric, interactive re-

verse engineering approaches [BP98]. Indeed, one of the most important limitations of

most data reverse engineering tools is that they do not consider the evolutionary and ex-

ploratory nature of the reengineering process [HEH+95]. Therefore, Hainaut et al. skip the

initial translation step completely and also use a common generic data model that sub-

sumes conceptual constructs as well as logical (and physical) constructs [Hai89,

RELATED WORK

HHH+96]. In the DB-MAIN tool [EH99], the reverse engineering process is invoked by

predefined scripts which look at the application code [HHH+99] to extract data structures.

Data depenpency elicitation is supported through variable dependency graph and program

slicing [HH01]. Based on the common data model, they have defined a catalogue of sche-

ma transformations which are used to gradually replace low-level implementation con-

structs by more abstract concepts [HTJC94]. However, the execution of in-place

transformations impedes the iterative process because the original (logical) schema im-

plementation is lost during the process.

A notable exception is VARLET [Jah99] that covers all phases of database reverse engi-

neering from schema recovery up to building a conceptual schema. In VARLET an inter-

active process to handle uncertainty and inconsistency during recovery of information

models (comprising relationships) is based on Generic Fuzzy Reasoning Nets [JSZ97]

which revert to code and data analysis. Moreover it considers variants structures that are

largely used in forward engineering [BCN92, HHEH96] during analysis. This analysis

phase is followed by an initial automatic translation and schema transformation. It pro-

vides many adapted conceptual redesign transformations proposed by [BP98]. The VAR-

LET approach is limited to single schemas.

Valuable stand-alone approaches exist that focus on the retrieval of semantic constraints.

Relationship retrieval based on stereotypical (code) patterns are presented in [PKBT94,

PTBK96] and [And94]. Soutou presents an algorithm to recover n-ary associations

[Sou98b] followed by a method to retrieve aggregations [Sou98a]. In [BR97] methods

from the inductive logic programming domain are adopted to detect relationships.

All these data reverse engineering approaches are limited to one model (single schema)

analysis and do not consider distributed schemas. They lack the flexibility for recovering

inter-schema relationships. An overview of reverse engineering methods and tools that

can further be used and adapted for data reverse engineering is given in [MJS+00]. The

field of recovering inter-schema relationships is poorly explored.

A catalogue for inter-database/-schema dependencies has not been published yet. Rusink-

iewisz et al. [RSK91] present two examples of interdatabase dependencies: (1) replicated

data that is characterised as "identical copies of data in two or more databases" for which

"we can tolerate inconsistencies (...) for no more than one day" by the authors; (2) exis-

tential constraints which are, e.g., referential integrity constraints requiring immediate up-

dates. Both correspond to the redundancy dependency with respect to copies that have to

be kept consistent. Our approach also discovers possible redundant data by duplicated

schema elements.

A theory of attribute equivalence in databases on a semantic basis is presented in

[LNE89]. The approach uses semantic attribute equivalence for integration of database

schemas. Therefore, the characteristics of the attribute equivalence are very detailed and

restrictive. In contrast to schema integration, schema reverse engineering needs flexible

and general attribute property (characteristic) definitions.

DATA-ORIENTED REVERSE ENGINEERING

Identifying and solving conflicts in inter-schema knowledge in cooperative information

systems has been presented in various references, e.g. [BLN86, CL93, TGF00]. In these

approaches the discovery and representation of inter-schema assertions is studied to

"make explicit the knowledge which a human integrator uses implicitly to identify seman-

tic similar schema concepts" [TGF00]. This is different from the inter-schema knowledge,

i.e., explicit dependencies between the distributed databases, we retrieve.

Several approaches exist for pattern recognition. Harandi and Ning [HN90] present pro-

gram analysis based on an Event and Plan Base. The analysis process starts by firing ru-

dimentary events constructed from source code. Plans define the correlation between one

or more (incoming) events and they fire a new event corresponding to the intention of the

plan. Each plan definition corresponds to exactly one implementation variant, which leads

to a high number of definitions. This applies also to the approach of Paul and Prakash

[PP94], where a matching algorithm for syntactic patterns based on a non-deterministic

finite automaton is introduced.

An approach to recognise clichés, i.e., commonly used computational structures, is pre-

sented in [Wil96], within the GRASPR system. Legacy code to be examined is represent-

ed as flow graphs by GRASPR, clichés are encoded as an attributed graph grammar. The

recognition of clichés is formulated as the sub-graph parsing problem which is NP-com-

plete [Meh84]. This approach allows analysing only some thousand lines of code, which

was sufficient to detect data structures or search and sorting algorithms. Applying the ap-

proach to larger programs had failed.

Krämer and Prechelt [KP96] use Prolog in order to detect design patterns [GHJV95] in

C++ source code. The source code is parsed into facts and rules describing the relations.

Prolog’s execution mechanism applies the rules in arbitrary sequence and uses back-

tracking where necessary. The approach is able to analyse larger programs, but its precise-

ness is very low, because the approach uses information of header files only. An analysis

of method bodies is not supported.

An approach producing more precise results is presented by Antoniol et al. [AFC98,

TA99]. They use metrics, such as the number of method calls within a method body, to

include a method body analysis without time-intensive graph productions. Unfortunately,

the used metrics are inadequate to express detailed information, e.g. method calls within

loops. Therefore, a lot of false positives remain.

Keller et. al. present an approach [KSRP99] to recover design patterns. Patterns are de-

fined using UML and a pattern matching algorithm matches patterns on an abstract syntax

graph representation of the source code, also using the UML notation. The matching pro-

cess is executed using scripts and adoption of patterns is hard to follow especially when

patterns are highly interrelated. In addition Seemann and von Gudenberg [SvG98] present

an approach to recover design patterns starting with inheritance relations, call graphs,

naming conventions, and programming guide lines. The pattern definition of higher order

patterns allows a reverse engineer to compose patterns out of subpatterns and reduces

SUMMARY AND FUTURE WORK

thereby the number of definitions. Both approaches are also feasible for the pattern based

analysis task, but cannot deal with large programs.

Radermacher [Rad99] uses the graph rewrite system Progres [Zün95] to match patterns

on the program. Patterns are defined as graph transformation rules and thus similar to

ours. Radermacher uses the execution mechanism of the Progres environment, whereas

the execution is not incremental.

3.6 Summary and Future Work

In this chapter we presented our data-oriented reverse engineering process to legacy (web)

information systems based on existing approaches and tools. The chapter started with a

process overview. Then we presented a classification for relationships of data component

models in web information systems. After sketching our way of data component model

representation, the process was presented. Based on the VARLET approach [Jah99] we pre-

sented the data model recovery that consists of schema recovery, hidden schema part re-

trieval, schema mapping and schema refactoring. The relationship retrieval started with

code fragment extraction and parsing based on island grammars [vDK99] and slicing

[Wei84]. The pattern instance retrieval, based on Niere’s approach [Nie03], was present-

ed in three steps: the pattern definition, the inference algorithm and the handling of uncer-

tainty. Finally, we pointed out tools that support the iterative nature of this process.

The hard part of this process is the relationship retrieval. Several additional indicators can

be used for tuning the certainty of the retrieved relationships. Data profiling tools, e.g.,

SAS/STAT®1, dfPower® Studio2 or Evoke Axio™3, permit to find schema overlapping,

schema incompleteness or redundancy. Rank aggregation techniques, e.g., like presented

in [SA03], enables a relationship weighting. In analogy, the number of identical (similar)

indicators can be used for tuning. This also holds for software clones. In case that we find

multiple indicators for the same relationship but that the code fragments in question are

clones, the fuzzy value may be adapted. Dead code analysis that finds out if the code frag-

ments are still executed, may also be used to adapt the fuzzy value of a retrieved relation-

ship. Note that relationships detected in dead code are also of interest and should not be

rejected automatically. Finally, the pattern based recovery can be extended to dynamic

analysis [Wen03]. For code fragment extraction the Multi-Language Tool [LCBO03],

which detect dependencies between Java and C/C++ code, can be adapted to our purposes

1. http://www.sas.com

2. http://www.dataflux.com/

3. http://www.evokesoft.com

DATA-ORIENTED REVERSE ENGINEERING

100

101

CHAPTER 4: DATA COMPONENT

EXTENSION

He who moves not forward, goes backward.

JOHANN WOLFGANG VON GOETHE

German dramatist, novelist, poet & scientist (1749 - 1832)

4.1 Extension Approach

After understanding the web information system, the next step is to extend it in order to

meet new requirements. Our approach focusses on the modelling of new applications and

schemas, followed by the integration of these models. In Chapter 3 we describe how con-

ceptual data component models of a legacy web information system can be retrieved. We

base our extension approach on these retrieved models, i.e., models of the old data com-

ponents, and the models of the new data components. Figure 4.1 shows the system parts

considered for data component extension. The central parts are the conceptual models that

permit, among other things, to generate access to the databases of the legacy web infor-

mation system.

data-

base

logical

schema

conceptual

schema

appli-

cation relationship

partial appl.

model

cluster information flow

data access

LS1 LSi LSj LSm

legacy App.

CS1

new App.

legacy App. legacy App.

CSi CSj CSn

LSn

CSp

LSp

new App.

Figure 4.1: Web information system adaption

new App.

CSm

DB1 DBi DBj DBm DBp

DBn

DATA COMPONENT EXTENSION

102

The process of data component extension consists of integrating the different retrieved

parts and new parts to (clusters of) data components. The integration of data components

requires the following activities: Data Component Clustering, Classification and (Re)De-

sign, cf. Figure 4.2. These activities are preformed iteratively by the reengineer. Then, we

(re)structure the web information system with architectural patterns. We conclude the in-

tegration with the Generation of a data access layer and Model Execution.

The Data Component Clustering is the starting point of our approach. We use the data

components knowledge retrieved during the reverse engineering phase to cluster the sys-

tem into data components. This supports the reengineer during data component integra-

tion.

Then, the data components are classified in either Corporate Data Components, Data Me-

diation Components or Ubiquitous Data Components. This classification helps to de-

scribe the data components’ functionality and the actions they perform. Of course, during

Data-Oriented Reverse

Engineering

1.2. Data Component

Classification

Clusters

3. Generation

and Model

Execution

Data Components

Figure 4.2: Data Component Extension: Overview

Ubiquitous Data

Components

Data Mediation

Components

Corporate Data

Components

Web

Information

System

Reengineer

Domain

Expert

Data Component

(Re)Design

2. Architectural

Pattern

Application

Reengineer

Domain

Expert

1.1. Data Component

Clustering

users and experts

data component

cluster

information flow

prototype

system

Classified

Clusters

activity

link to previous

process step

DATA COMPONENT CLUSTERING AND CLASSIFICATION

103

DATA COMPONENT CLUSTERING AND CLASSIFICATION

the classification some data components may need to be adjusted concerning their cluster

affiliation.

During Architectural Pattern Application the web information system is (re)structured fol-

lowing four architectural patterns we propose.

During these activities, the reengineer may redesign data components that exist. Data

component parts, that are strongly related, have to be refined and / or adapted. Missing

data component parts or data components which are needed for new functionality have to

be designed. Data Component (Re)Design enables the reengineer to implement these

parts with design patterns, like suggested by [GHJV95], and/or to use (Re)Design Trans-

formations we propose. Since design patterns and model-driven design are well docu-

mented and understood, we pass on explaining it in this thesis and only consider it for tool

support, cf. Section 4.5.

The retrieved models enable the Generation of object-oriented access layers for the legacy

databases. Of course, this can also be done for new schemas/databases if they are mod-

elled accordingly. Further, this approach enables the modelling of functionality such that

data components can interact, i.e., exchange data. We propose to construct prototypes to

validate the (re)design. If the reengineer follows this model-driven design approach, the-

ses prototypes can be run by Model Execution.

4.2 Data Component Clustering and Classification

Before the reengineer designs new system parts, he may rearrange the existing web infor-

mation system by splitting it up into clusters (data components) and classify them. Thus,

the first activity of data component extension is data components clustering followed by

their classification. Generally the retrieved system parts will not fit the reengineers mind

model of the system. Thus, redesign takes place in parallel.

Before explaining the clustering, we clarify the membership of persistent classes to data

components. A persistent class represents the data structure from its corresponding data-

base entity, but a persistent class can exist and be deployed independently. This means that

a persistent class and its corresponding database entity can be located in different data

components. A persistent class can access the data via a defined interface. Consequently,

persistent classes representing entities from the same database do not necessarily belong

to the same data component, i.e., the same cluster.

4.2.1 Data Component Clustering

Resulting from the reverse engineering we have the logical schemas and the redesigned

conceptual schemas, including the mapping from the logical schemas to the conceptual

schemas. To facilitate (re)design we build data component clusters, which are classified

in the next step.

DATA COMPONENT EXTENSION

104

These clusters enable different views on the system. Entity and schema comparison along

the relationships, especially the redundancy dependencies, is possible. For example,

strongly related entities of different schemas can be grouped to sustain schema integra-

tion. Further the classification of clusters is simpler and faster than the classification of

each data component by itself. Finally, clustering helps to integrate new data components

and thus identifying interfaces and connectors.

The advantages and benefits of clustering are primarily the semi-automatic aspect of the

task and the endorsement of our iterative reengineering approach, i.e., repetitions of the

same clustering tasks are allowed and wanted. Further, the clustering can be used for un-

derstanding by constructing views.

We use the BUNCH tool [MMCG99] to compute clusters. Therefore we will shortly intro-

duce the clustering approach used in BUNCH. The basic assumption is that the modules

and dependencies are mapped to a Module Dependency Graph (MDG):

„ MDG = ( M , R ) is a graph where M is the set of named modules of a software

system, and R⊆M×M is the set of ordered pairs 〈u,v〉 that represent the source-

level dependencies (e.g., procedural invocation, variable access) between modules u

and v of the same system. “ [MMCG99]

In our case M is the set of named entities or data components (modules) and such

〈u,v〉∈R are pairs of entities or data components that are related. Further in BUNCH it

is possible to attribute a weight, in form of an integer, to each pair. The clustering problem

is to find a „good partition“ of the MDG. A „good partition“ is the decomposition of all

modules into disjoint clusters, where (highly) interdependent modules are grouped into

clusters and in opposite (highly) independent modules are assigned to different clusters.

Based on our experience and the observations in [Lun98] difficulties and limitations of

clustering activities can occur because:

• Reverse engineering does not generate precise and complete information, i.e., not

all relationships are considered or identified

• Occurrence of omnipresent modules [MOTU93] or data components, i.e., data com-

ponents that have much more connections than other data components and which

seem not to belong to any cluster

• Some data components perform a specific task and are comparatively small to other

data components

• The existing partitioning was not well done by the designer(s)

• Inappropriate choice of clustering techniques

These limitations coincide with the limitations that were detected in a first version of

BUNCH and are mainly resolved in the BUNCH version we use [MMCG99]. Two lists of

omnipresent modules, one for clients and one for suppliers, can be specified either man-

ually or automatically. These client and supplier modules are assigned to two separate

DATA COMPONENT CLUSTERING AND CLASSIFICATION

105

DATA COMPONENT CLUSTERING AND CLASSIFICATION

clusters. Further, user-specified clusters enable the handling of predefined (by domain-

knowledge) or existing clusters. Optionally the locking of these clusters can be enabled

and the addition of modules to the user-specified clusters is possible. To handle the inte-

gration of new modules or when modules undergo structural changes the orphan adoption

technique [TH97] was introduced to BUNCH. Scalability and performance problems are

minimised because of the predefined clusters, e.g., conceptual schemas, and algorithms

for sub-optimal results, i.e., hill-climbing and genetic algorithms. For more details we re-

fer to [MMR+98, MMCG99].

Figure 4.3 shows an excerpt of the palliative care conceptual schema, which contains 66

classes and about 100 relationships. The different relationships are marked by stereotypes,

except the intra-schema associations.

Figure 4.4 and Figure 4.5 depict clustering results. The clusters, which are or can be seen

as subsystems, are represented with UML package diagrams. This corresponds to the

UML notation of subsystems as suggested in [HC01]. Each cluster is visualised by one

package; such a package can contain a class diagram or other packages. We do not visu-

alise relationships between elements of different packages for clarity reasons. Such rela-

tionships are bundled to one relation between packages. Details of the relationships are

shown as a list.

Figure 4.4 shows the clustering of the conceptual schema corresponding to the physical

schemas. The conceptual schema is divided into the three database schemas bvmtdata,

hopsdata and outcomes_be. Relationships between packages, i.e., the inter-schema rela-

tionships, are bundled and the details available as a list. In this sample this is the case for

Figure 4.3: Sample of the palliative care conceptual schema

class

association

Bereavement

Reason

Patient Drug Profile

Deceased

tblDose

Drugs

<<copy>>

bereavement

reason

deathdate

reason

bereavement

profile

drugdose

<<replica>>

drugs

tblOutcomes

outcomes

<<inter-schema>>

0..n 0..n Bereavement

Reason

0..n

0..1

0..n

0..1

0..n 0..n

0..n

0..1

0..10..1 0..n

0..n

DATA COMPONENT EXTENSION

106

the duplications bereavement reason and deathdate, the replication drugdose and the in-

ter-schema association outcomes.

Bereavement

Reason

Deceased

tblDose

Figure 4.4: Clustered palliative care conceptual schema: sample 1

tblOutcomes

bvmtdata hospdata

outcomes_be

class association

package bundled

associations

outcomes ...

drugdose ...

bereavement

reason ...

deathdate ...

Bereavement

Patient Drug Profile

Drugs

reason

bereavement

profile

drugs

Bereavement

Reason

0..n

0..1

0..n

0..1 0..n

0..n

Figure 4.5: Clustered palliative care conceptual schema: sample 2

class association

package bundled

associations

bereavement reason

patient

drug profile

bereavement ...

profile ...

Bereavement

Reason

<<copy>>

bereavement

reason

0..n 0..n

0..n

0..1

Bereavement

Reason

Patient

Deceased <<copy>>

deathdate

tblOutcomes

outcomes

<<inter-schema>>

0..1

0..n

0..1

0..1 Drug Profile

tblDose

Drugs

drugdose

drugs

0..n

DATA COMPONENT CLUSTERING AND CLASSIFICATION

107

DATA COMPONENT CLUSTERING AND CLASSIFICATION

Figure 4.5 shows the clustering of classes related by redundancy dependencies. We attrib-

uted a weight of ’10’ for every redundancy dependency and a weight of ’1’ for all other

relationships. In Figure 4.5 we can see that classes participating on a redundancy depen-

dency, i.e., the redundancy dependency pairs (Bereavement Reason, Bereavement Rea-

son), (Patient, Deceased) and (Drug Profile, tblDose), are located in the same packages.

Changing the clustering parameters does not affect the result except for one class: Be-

reavement is either located in the bereavement reason package or the patient package.

Omitting the weights for the relationships has only a limited impact, although the redun-

dancy dependency pairs are not clustered in the same package each time. The reason for

this limited impact is the size of the chosen sample which does not allow much variations.

The result of the clustering is the partitioning of the web information system into data

components.

4.2.2 Clustering Strategies

For data component extension, we emphasize the integration of existing data components

with each other or the integration of existing and new data components. To determine the

way of clustering. i.e., the clustering strategy, we need to determine the integration strat-

egy. To clarify the integration scenario we assume a three tier architecture. The schema

tier describes the information data structure (schemas) of our web information system.

The application tier contains all the functionality to manipulate the information stored in

the schemas. The graphical user interface (GUI) tier represents the interface to the real

world, i.e., the users.

Schema versus Application Integration

Integration of web information systems can be done in various ways. To discuss the inte-

gration strategies we consider the two following extreme strategies: (1) schema integra-

tion which provides a consistent virtual schema located between the schema and

application tier containing the integrated schemas; (2) application integration which re-

sults in an additional virtual application tier in-between the application and GUI tier.

In Figure 4.6 the schema integration is presented. The different application specific sche-

mas are integrated into an overall virtual schema (grey). The schemas can be merged into

a single (virtual) schema. A virtual schema can be a physical virtual schema which is

served by a single homogenous distributed database management system. A virtual sche-

ma can also be realised by code with a schema access layer. System evolution in form of

an additional application would result in (optionally) a new schema, an updated virtual

schema, a new application and (optionally) a new GUI.

The application integration scenario in contrast would integrate an additional application

without modifying the different schemas, cf. Figure 4.7. Instead, the virtual application

tier (grey) is used to coordinate the applications as required. This additional application

tier does further permit to use different heterogeneous and physically distributed database

DATA COMPONENT EXTENSION

108

management systems when support for distributed transaction processing standards

[XA94] is present. For example this is the case when extending the system by a new ap-

plication with its own new schema and GUI, cf. dashed parts in Figure 4.7. The required

coordination however has to be realized by code in the virtual application tier. The identi-

fied application integration strategy is strongly related to enterprise application integra-

tion [Lin99] approaches, but we assume that the application composition does not involve

event processing as realised either by polling on a shared database or using specific appli-

cation APIs.

The application integration approach requires the realisation of coordination activities. A

suitable solution is to employ available middleware technology. For the schema integra-

tion approach the same technology can be applied when the virtual schema tier is realised

by code, still, also database technology like views can be used. Depending on whether a

code or strict database solution has been chosen, the flexibility of the virtual tier permits

variations to a great extent. A more detailed comparison of both integration strategies can

be found in [GW01].

Clustering Strategy Decision

In addition to the integration strategy, the reengineer has to decided which aspect is more

important for the integration and consequently for the clustering. We distinguish between

two aspects. Firstly, the aspect of data similarity preponderates, i.e., the integration is con-

application

Figure 4.6: Schema Integration

GUI

schema

virtual schema

application

schema

data flow

persistent tier

application tier

presentation tier

black

grey

dashed

existing parts

integrating parts

extending parts

Figure 4.7: Application Integration

virtual application

application

GUI

schema

application

schema

application

data flow

persistent tier

application tier

presentation tier

black

grey

dashed

existing parts

integrating parts

extending parts

DATA COMPONENT CLUSTERING AND CLASSIFICATION

109

DATA COMPONENT CLUSTERING AND CLASSIFICATION

centrated around data that contains similar information. Secondly, the aspect of functional

similarity preponderates, i.e., the integration is concentrated around application function-

ality that perform similar data manipulation tasks.

Based on these criteria, Table 4.1 shows which relationships have to be highly weighted

for the clustering.

Regarding schema integration and the data similarity aspect, two relationship kinds give

the best hints. Entities that have similar primary keys generally contains similar informa-

tion. Thus clustering classes together, where the corresponding entities have similar pri-

mary keys, enables schema integration regarding the data similarity aspect. This also

holds for classes related by redundancy dependencies because the redundancy dependen-

cies express that similar data is stored in the participating classes (entities).

In case of schema integration and the functionality similarity aspect, several relationship

kinds have to be considered. Since the preponderate aspect is functionality, relationships

that relate classes by means of data manipulation purposes are relevant. In our case this

includes replication, constraints, usage and inter-schema relationships. Replication is

done for purpose of data access efficiency or data safety. The other three relationships

form the basis for data manipulation, accessing data (usage), relating data (inter-schema)

and controlling data (constraints).

Clustering for application integration is harder. Regarding the data similarity aspect, the

first step is the clustering of persistent classes that contain similar information. Then from

these predefined clusters, the transient classes related via usage relationships are clus-

tered. Thus, the transient classes manipulating similar information are clustered together.

Regarding application integration and the functional aspect, two scenarios exist. Firstly,

like for schema integration, relationships that relate classes by means of data manipulation

(except data access) purposes are relevant, i.e., replication, constraints and inter-schema

relationships. After clustering the persistent classes in analogy with the schema integra-

tion case, the transient classes related via usage relationships are clustered considering the

predefined clusters. Secondly, we start with the clustering according to the physical sche-

Schema integration Application integration

data

similarity

similar primary keys

redundancy dependencies

usage relationships preceded by

similar primary keys and/or

redundancy dependencies

functional

similarity

usage relationships

inter-schema relationships

replications

constraints

usage relationships preceded by

replications, constraints and

inter-schema relationships

usage relationships preceded by

intra-schema relationships

Table 4.1: Relationship Weighting for Clustering

DATA COMPONENT EXTENSION

110

mas (intra-schema relationships). Based on these predefined clusters, the transient classes

are clustered together that manipulate the same physical schema (via usage relationships).

Based on the clustering startegy, two ways of clustering are possible in our case: from

coarse-grained clusters to fine-grained clusters and vice versa.

In the first case we start with a general clustering to find a coarse-grained partition of en-

tities and data components. Next these coarse-grained clusters are then partitioned in

smaller clusters by weighting the dependency types.

In the second case the fine-grained clustering by weighting the dependency types takes

place first. Then the resulting clusters themselves are clustered by weighting the number

of dependencies between the clusters. This weighting can be refined by a combined cal-

culated value of the number of dependencies and the weight of dependency types. In both

cases predefined clusters, e.g., conceptual schemas, have to be respected.

Which way of clustering is best for data component integration in web information sys-

tems depends on the reengineering settings. In case of multiple (small) systems that have

to be integrated into a web information system a first coarse-grained clustering is certainly

meaningful. This gives the reengineer an overview about how the existing and new data

components are related. This holds especially when each involved conceptual schema

stems from exactly one legacy system.

The opposite case is when few (bigger) systems, that should remain independent as far as

possible, have to be integrated. The most data components will remain as they are and

only for the new data components a fine-grained clustering is needed. Next, these new

fine-grained clusters can be clustered together with the existing ones from the legacy sys-

tems. The final decision of how the clustering takes place is therefore the reengineers’

task, depending on his experience and the settings. Note that in some cases both ways may

be tried, followed by a partition comparison.

4.2.3 Data Component Classification

In Chapter 3 we introduced three data component kinds: corporate data components, data

mediation components and ubiquitous data components. In this thesis every data compo-

nent is assigned to exactly one of these three kinds.

Once the clusters are determined, the data components can be classified according to the

three data component kinds. In case that a cluster contains more than one data component,

all these data components normally are of the same kind. Therefore, when a data compo-

nent kind is assigned to a cluster, this means that this kind is assigned to all data compo-

nents in this cluster. In case that a data component in a cluster has a different kind than the

cluster kind, we assigned this different kind to the data component in question. Such an

exception is a cluster of omnipresent data components.

ARCHITECTURAL PATTERNS FOR DATA MEDIATION

111

ARCHITECTURAL PATTERNS FOR DATA MEDIATION

A data component (cluster) should be classified as “corporate” if it is part of mediation

within the same organisation. From a user perspective such components (clusters) look

like a single system. Corporate middleware from one vendor is generally used for inte-

grating such data components. In general, a data component cluster that incorporates a

data source will be classified as “corporate” or „ubiquitous“. Indeed, the storage of infor-

mation mainly takes place within an organisation or a mobile device. Note that the data

interfaces and portals are classified accordingly.

Omnipresent data components, i.e., a data component that is connected to many other data

components, should be avoided. This situation corresponds more to a monolithic system

than to a web information system. Still, within a cluster, i.e., corporate or ubiquitous data

component, such central data components can exist. This can be seen as „local omnipres-

ence“ which is resolved locally depending on the technology employed in the organisation

or mobile device in question. The reengineer should distribute the data components to ex-

isting clusters or created explicitly a cluster with them. Nevertheless, such omnipresent

data components can be interesting for reuse purposes.

Data component clusters that do not contain a data source and are not omnipresent, i.e.,

remain, should be classified as „mediation“. Indeed, these remaining clusters generally

exchange or preprocess (format) data. They connect largely autonomous data compo-

nents, i.e., they typically handle data that is replicated, duplicated and synchronised ac-

cording to certain interoperability policies.

During the classification inconsistencies may emerge that require a cluster reevaluation

or even a data component redesign. For this reason iteration support is needed. In a sim-

plified way two iteration loops exist. Firstly, iteration that goes back to the clustering and

redesign. Entities, relationships or data components may be changed (redesigned) or re-

located in another cluster. Further, the clustering may be redone or reevaluated, e.g., by

applying the orphan adoption technique. Secondly, a wider iteration can even go back to

the reverse engineering. The integration steps or problems during integration may suggest

further and deeper investigations of the underlying legacy systems.

4.3 Architectural Patterns for Data Mediation

As stated before we propose architectural patterns to (re)structure the web information

system. These architectural patterns can be implemented by instantiating design patterns.

Note that we do not restrict the design patterns those proposed by [GHJV95].

Traditional architectures for web information systems are based on the procedure-call par-

adigm, i.e., clients call service operations on server objects. This traditional development

paradigm implies that the client programmer knows about the servers at the development

time of the client software. This is disadvantageous in case of rapidly evolving architec-

tures. Therefore, software engineers have recently started to migrate to a new paradigm

called component-oriented software development.

DATA COMPONENT EXTENSION

112

The component-oriented software development paradigm promotes connection-based

programming. This means that components are defined with well-defined interfaces in

partial ignorance of each other. The actual connections among components are instantiat-

ed and deployed later. In other words, connections are now treated as “active” first-order

citizens in distributed architectures [Szy99]. This supports networked evolution because

it facilitates adding and removing connections with little changes to the components in a

system. The patterns we propose are component-based and promote connection-based

programming for web information systems.

In general, architectural patterns define the responsibility of typical parts of a system. Fur-

thermore, they provide rules and guidelines for relationships between those components.

Architectural patterns express and separate the concerns of fundamental structures in soft-

ware systems [BMR+96].

We present four architectural patterns for data mediation:

• Data Portal

•Data Fusion

• Data Transducer

• Data Connection

Figure 4.8 depicts their relationships. Distributed information management (corporate or

ubiquitous data) components are interfaced by means of the Data Portal pattern. The Data

Portal is just one of three general Mediator Component patterns that can be connected

through instances of the Data Connection pattern. The Data Fusion pattern is used to

merge information from separate sources. Finally, the Data Transducer translates data

into a different structure. This pattern is used for mediating among different data repre-

sentations, as well as for rendering data for presentation to human clients (e.g., in web

browsers). We describe these four architectural patterns in the next sections.

connection

specialisation

Figure 4.8: Architectural patterns: overview

Data

Connection

Mediator

Component

Data

Transducer

Data

Fusion Data

Portal

connection A

connection B

0..n

0..n 1

input

2..n

architectural

pattern

generalisation

ARCHITECTURAL PATTERNS FOR DATA MEDIATION

113

ARCHITECTURAL PATTERNS FOR DATA MEDIATION

4.3.1 Architectural Pattern: Data Portal

Name

Data Portal

Intent

A Data Portal is an interface of web information subsystems to the world. Its purpose is

to make selected parts of the data (maintained within a subsystem) accessible for autho-

rised external clients (services or users). We distinguish between two different versions of

this pattern: Export Data Portal, which is responsible for making internal data available

outside the subsystem; and Import Data Portal, which is used to import information from

external sources into the subsystem data source(s). In practical applications, a clear dis-

tinction between these two patterns might not always be possible. The combination of ex-

porting internal data and importing external data is an Export & Import Data Portal.

Motivation

Three main motivations exist for the Data Portal, namely resolving heterogeneity, provid-

ing data safety, and increasing data availability. Heterogeneity reflects on the fact that dif-

ferent web information system participants utilise various heterogeneous platforms and

technologies for their data repositories. Interoperability between these different partici-

pants requires that this heterogeneity is resolved. The Data Portal serves this purpose by

exploiting interoperability standards.

The requirement for data safety stems from the fact that, in many cases, external clients

should not have access to all the data stored in internal databases (e.g., for protecting in-

tellectual property, personal or sensitive information, etc). The Data Portal provides a lev-

el of isolation between the external schema (as it is perceived by the external client), and

the source schemas of the internal databases in question.

Finally, the Data Portal facilitates fast access to a unified view of internal data structures.

This is needed because data is often distributed among various different transactional and

analytical repositories within organisations. The Data Portal serves as a façade for these

bvmtdata outcomes_be

hospdata

outcomes service

bvmt

admin

bvmtdata outcomes_be

hospdata

outcomes service

bvmt

admin

Data

Portal

data source

clients

data flow

data portal

Figure 4.9: Data Portal Pattern use

DATA COMPONENT EXTENSION

114

various data sources, in which the external schema is a buffered view of the collection of

schemas of all the involved data sources.

The Data Portal provides more than only a unified interface, it also handles safety and het-

erogeneity aspects. An example for the use of a data portal is given in Figure 4.9. The pal-

liative care data sources are encapsulated for its accessing clients.

Applicability

The Data Portal pattern can be used whenever a subsystem needs to make available part

of its data to participants outside its intranet. The Export Data Portal specialises in allow-

ing external participants to query the internal data sources, while the Import Data Portal

is used to update the internal data sources with data from the outside of the subsystem.

Structure

Figure 4.10 shows the structure of the Data Portal pattern.

Participants

Figure 4.11 shows the participants of the Data Portal pattern.

• source schemas

are the schemas of the different data sources of the web information subsystem.

• external schema

is the schema of the subsystem data as seen by the external clients.

• mapping function

is a function that converts the source schemas (each conforming to the different data

sources) to the external schema.

Collaborations

Each of the internal data sources is described with its own schema (a source schema).

An external client, however, is not expected to be able to see any of these source schemas.

Figure 4.10: Structure of the Data Portal Pattern

connection A

connection B

Export

Data

Portal

Import

Data

Portal

Import

Export

Data

Portal

connection

clients

data portal

generalisation

ARCHITECTURAL PATTERNS FOR DATA MEDIATION

115

ARCHITECTURAL PATTERNS FOR DATA MEDIATION

Instead, the client has access only to the “view” that the subsystem allows. This view is

described using the external schema. The mapping function is responsible for resolving a

request from the client (based on the external schema) into a request to the internal data-

bases (described by the source schemas).

Consequences

There is a single entry point to external access and updates of the internal subsystem’s da-

tabases. From the point of view of the (external) client, there is a single schema that cor-

responds to a single data source. The client does not have to worry about the different data

sources, their types (whether they are ODBC, file-based, or other) nor how to query or up-

date them individually. The subsystem, on the other hand, can filter and block access to

sensitive information in the source databases. A disadvantage is that the complexity of

building this access layer is significantly higher than using direct access, e.g., by using

JDBC.

Example (Case Study)

Figure 4.12 shows a sample instantiation of the pattern. Each of the three databases of the

palliative care hospice has one (logical) source schema. These schemas are mapped to a

(conceptional) external export schema and a (conceptional) external import schema.

These schemas are the basis for the Export Data Portal and Import Data Portal. An addi-

tional Import&Export Data Portal is built for clients that receive and pass data to the data-

bases. The used Data Connection pattern is explained in Section 4.3.4.

Figure 4.11: Participant View of the Data Portal Pattern

data

source I

data

source II

data

source III

data

source IV

source

schema III

source

schema II

source

schema I

source

schema IV

mapping

function

external

schema

data source

schema

function

correspon-

dance

information

flow

Data

Portal

data

portal

DATA COMPONENT EXTENSION

116

Related patterns

Facade: Provides a unified interface to a set of interfaces in a subsystem. Facade defines

a higher-level interface that makes the subsystem easier to use [GHJV95].

Patterns similar to the Facade pattern found in [Ris00] are: Facade [Coc96], Whole-Part

[BMR+96], Wrapper Facade [Sch99], Abstract Database Interface [ABM96], Shared re-

pository and (Legacy) Wrapper [Mul95].

Data Abstraction and Object-Oriented Organization: In the architectural style based on

data abstraction and object-oriented organisation, data representations and their associat-

ed primitive operations are encapsulated in an abstract data type or object [SG96].

4.3.2 Architectural Pattern: Data Fusion

Name

Data Fusion

Intent

To combine the data received from two or more data sources, into a single, unified data

source.

Figure 4.12: Data Portal Pattern sample

hospdata

outcomes_be

bvmtdata bvmtdata

schema

outcomes_be

schema

hospdata

schema

mapping

function

external

export

schema

data source

schema

function

correspondance

data flow

mapping

function

external

import

schema

clients

connection

Export

Data

Portal

Import

Data

Portal

Import

Export

Data

Portal

data portal

Data

Connection

Data

Connection

data connection

ARCHITECTURAL PATTERNS FOR DATA MEDIATION

117

ARCHITECTURAL PATTERNS FOR DATA MEDIATION

Motivation

One of the goals of the semantic Web (as defined by the W3C) is the availability of a va-

riety of data sources. These data sources will be mined by intelligent agents, which will

understand their schemas, manipulate and combine their data, and then present it to the

client as a unified data source. The client does not need to deal with these complexities,

and instead, can assume a unique data source. Figure 4.13 shows a sample for the use of

the Data Fusion Pattern. Data is collected from two data sources (:Source1 and :Source2).

Then it is merged and sent to the :Client.

Applicability

A web information system often combines and aggregates data from several sources into

a common data stream.

Structure

Figure 4.14 shows the structure of the Data Fusion pattern.

Figure 4.13: Data Fusion Pattern use

:Source2 :Data Fusion :Client

collect()

merge()

send()

data

:Source1

objects

return

message

activation

lifeline

Figure 4.14: Structure of the Data Fusion Pattern

connection A

input

Collector

Merger

2..n

packages

connection stream

Data Fusion Server Data

Connection

Client Data

Connection

data

fusion

data

connection

data fusion components

DATA COMPONENT EXTENSION

118

Participants

•Server Data Connections

A Data Connection is created to connect the Data Fusion to each of the servers.

• Client Data Connection

The Data Connection to the client.

• Collector

Queries the different servers and receives their results.

• Merger

The Merger is responsible for merging the results from the different data sources.

Collaborations

The instance of the Data Fusion pattern creates Server Data Connections to each of the

servers. When the instance of the pattern receives a query, it translates it into a sequence

of queries, each intended for a different server. The Collector is then responsible for exe-

cuting these queries (using a Data Connection to the corresponding server). The Collector

receives the results and passes them to the Merger, who proceeds to combine them. The

resulting data is then sent to the client using the Client Data Connection.

Consequences

The availability of data, e.g. in XML, from different Mediator Components requires the

existence of Data Fusion instances that merge these sources into a unified data source. A

client application that uses a Data Fusion pattern does not need to worry about the com-

plexities of accessing and merging multiple data sources.

Example (Case Study)

The medical diagnosis expert system needs data from the knowledge base as well as from

the palliative care hospice, cf. Chapter 2. Based on the patients data, e.g., symptoms and /

Figure 4.15: Data Fusion Pattern sample

input

Server Data

Connection

Server Data

Connection

Data

Fusion Client Data

Connection

Knowledge

Base

Data Portal

Palliative

Care

Data Portal

data

portal

connection

data

connection

data

fusion

ARCHITECTURAL PATTERNS FOR DATA MEDIATION

119

ARCHITECTURAL PATTERNS FOR DATA MEDIATION

or checkup results, the knowledge base is queried for diagnostic suggestions. Figure 4.15

shows that the Data Fusion receives data from both Export Data Portals via the Server

Data Connections. The merged data, i.e., the diagnostic suggestions are then passed to the

diagnosis GUI via the Client Data Connection.

Related patterns

Glue: Join a number of (multimedia) artefacts into a single composite artefact [CL00].

4.3.3 Architectural Pattern: Data Transducer

Name

Data Transducer

Intent

Convert data of a given source format into data of a different target format.

Motivation

There are cases where the client might not be able to interpret the data in its original for-

mat. In that case, the Data Transducer pattern converts the source data into a target format

that the client expects. Figure 4.16 shows a sample for the use of the Data Transducer Pat-

tern. Data sent from a data :Source is converted into the target format and sent to the :Cli-

ent.

Applicability

Whenever an application requires data in a different format than the one that the data

source provides, the Data Transducer pattern can be used. A client application might be

designed with a given format in mind, but the server might provide data in a different for-

mat. Transducing the data from one format to another will allow both applications to in-

teract without changes in either one of them. For example, the client expects data in an

XML document with a different XMLSchema definition or DTD than the source produc-

es.

Figure 4.16: Data Transducer Pattern use

:Source :Data Transducer :Client

convert()

send()

objects

return

message

activation

lifeline

DATA COMPONENT EXTENSION

120

Structure

Figure 4.17 depicts the structure of the Data Transducer pattern.

Participants

• Source Data Connection

A Data Connection to the source of the data.

• Target Data Connection

A Data Connection to the consumer of the data.

• Source Format

The format of the original source data.

• Target Format

The desired format for the resulting data.

• Transducing Function

A function that converts the data from the source format into the target format.

Collaborations

The producer of data is connected to the Data Transducer using a Source Data Connec-

tion; similarly, the consumer of the data is connected to the Data Transducer using a Tar-

get Data Connection. The Source Data Connection pulls the data to be transduced. This

data conforms to the source format and it is used as input to the transducing function. The

output data, which conforms to the target format, is then fed to the Target Data Connec-

tion.

Consequences

This pattern allows the interaction of two applications that are designed with different

schemas, i.e, that exchange data in different formats, to interoperate. The two applications

do not need to be aware that the exchange format of the other is different. The disadvan-

tage of this pattern is that the transduction could lose information because the target sche-

ma might not be able to convey all the data of the source schema.

Example (Case Study)

The example of Figure 4.18 shows a Data Transducer acting as renderer. The patient data

from the Palliative Care Export Data Portal is rendered into a PDA compatible format.

connection

data

connection

data

transducer

Figure 4.17: Structure of the Data Transducer Pattern

connection B connection A

Data

Transducer

Source Data

Connection

Target Data

Connection

ARCHITECTURAL PATTERNS FOR DATA MEDIATION

121

ARCHITECTURAL PATTERNS FOR DATA MEDIATION

Related patterns

Adapter: Converts the interface of a class into another interface clients expect. Adapter

lets classes work together that could not otherwise because of incompatible interfaces

[GHJV95].

4.3.4 Architectural Pattern: Data Connection

Name

Data Connection

Intent

Proactive service for pulling data of interest from a given data source and pushing this

data to a given data sink.

Motivation

Traditionally, data exchange is controlled by either the client or the server of a web infor-

mation system. This concept limits scalability and flexibility for rapidly evolving distrib-

uted architectures of web information systems. Therefore, engineers have begun to treat

(data) connections as first-order citizens in their architectural designs. This trend is merely

reflecting on a general trend in current software engineering practice from the traditional

call-procedure paradigm to the new paradigm of connection-based programming [Szy99].

For the application domain of engineering web information systems, a Data Connection

is an instance that controls the exchange of electronic data among several components. A

Data Connection is not a placeholder but a system component. Figure 4.19 shows two

samples for the use of the Data Connection Pattern. On the left side we have a static link

to the :Source and a dynamic link to the :Client. The Connection Policy is an Updater, i.e.,

the :Source updates the :Client. The :DataConnection stores the data until the :Client con-

nects to it and an update can be performed. On the right side both links are static, but we

Palliative Care

Import &

Export Data

Portal

Figure 4.18: Data Transducer Pattern sample

PDA Import

Data Portal

Source Data

Connection

Data

Transducer

connection

Target Data

Connection

data

portal

data

connection

data

transducer

DATA COMPONENT EXTENSION

122

have a Synchroniser Connection Policy. The :DataConnection collects the data from

:Source and :Client, syncrhronises it and updates the :Source and the :Client.

Applicability

Whenever we need to constantly retrieve data from or send data to a mediator component.

Structure

Figure 4.20 shows the structure of the pattern.

Participants

•server connection and client connection

These connections are created as instance of the Data Connection pattern to the ser-

ver and the client, and can be either static or dynamic links. A Static Link assumes a

reliable connection between the client and the server, while a Dynamic Link can

handle a change in the Internet Protocol (IP) address of the client and temporal dis-

connection between client and server (both situations are common in mobile

devices).

• Connection Policy

Indicates the properties of the data connection. An Updater (data connection) is only

concerned with handling continuous updates to the client (stock market prices, news

updates), while the Synchroniser assumes that the client has a copy of the data and

needs to synchronise it with the master copy in the server.

• Server

The Mediator Component that exports data, i.e., to which the client wants to con-

nect.

•Client

The Mediator Component that imports data, i.e., that needs to connect to a server.

Figure 4.19: Data Connection Pattern use

update()

connect()

check()

update()

:Source :Data Connection :Client

update()

:Source :Data Connection :Client

collect()

data

synchronise()

update()

objects

return

message

activation

lifeline

collect()

data

update()

ARCHITECTURAL PATTERNS FOR DATA MEDIATION

123

ARCHITECTURAL PATTERNS FOR DATA MEDIATION

Collaborations

When instantiated, the Data Connection pattern creates communication links to both, the

client and the server. The instance of the Data Connection pattern, according to its con-

nection policy, receives requests from the client and converts them into requests to the

server. The reply from the server is then translated and sent to the client. The Data Con-

nection is responsible for the issues related to the connection to the component, such as

authentication, encryption, roaming, temporal disconnection, etc.

Consequences

The main benefit of connection-based programming is late binding of Mediator Compo-

nents. This means that it becomes possible to develop Mediator Components with well-

defined interfaces more-or-less in isolation from each other, and flexibly connect them at

a later point in time. The Data Connection pattern is also responsible for handling the

complexities of the communication with the Mediator Component, allowing the client to

be unaware of them.

Example (Case Study)

Figure 4.12 depicts two static Data Connections with an Updater policy. A dynamic Data

Connection with Synchroniser policy is the Target Data Connection of Figure 4.18.

Figure 4.20: Structure of the Data Connection Pattern

connection specialisation

Connection

Policy

Updater

Synchoniser

Link

Static

Link

Dynamic

Link

B: Link

A: Link

server

connection

instance of

Data Connection

data connection data connection composants

client

connection

DATA COMPONENT EXTENSION

124

Related patterns

Broker: Produces distributed software systems with decoupled components that interact

by remote service invocations. A broker component is responsible for coordinating com-

munication, such as forward requests, as well as for transmitting results and exceptions

[BMR+96], see also Broker [Mul95].

Connector: Decouple service initialisation from the services provided [Sch97], see also

Client-Dispatcher-Server (p. 34) and Forwarder/Receiver [BMR+96].

Data Filter Architecture Pattern: Filters the contents of client requests in a distributed

system, according to predefined policies. Filtering can occur locally or remotely [FF99].

Mediator: Define an object that encapsulates how a set of objects interact. Mediator pro-

motes loose coupling by keeping objects from referring to each other explicitly, and it lets

you vary their interaction independently [GHJV95].

Pipes and Filters: Provides filter components which encapsulate processing steps for data

streams. The pipes pass through the data between adjacent filters [Meu95, BMR+96,

SG96].

Proxy: Provide a surrogate or placeholder for another object to control access to it

[GHJV95].

4.3.5 Architectural Pattern Application Examples

Figure 4.21 shows three prototypes in the Health Care web information system where ar-

chitectural patterns are applied. The overall architecture integrates the three organisation-

al sites, namely the Palliative Care Center Victoria, the Medical Diagnosis ES (expert

system) and Information Management System and the ubiquitous site Physician PDA.

Further a Physician Browser and a Patient Browser are integrated. Different Data Portals

per site hide the complexity and heterogeneity.

The Physician PDA has the purpose of viewing and collecting data from clinical practice.

This mediation is realised by the prototype 1 by means of two Data Connections and a

Data Transducer pattern for rendering clinical data on the PDA. The link from the Target

Data Connection to the PDA Import & Export Data Portal is a dynamic link and the con-

nection policy is a Synchronizer since the PDA can be used off-line (see Section 4.3.4 and

Figure 4.18).

Furthermore, Figure 4.21 shows that the Data Fusion pattern is used for combining the

data sets from the different organisational sites. In cases where data sets are not structur-

ally compliant to the chosen web information systems format, a Data Transducer pattern

is instantiated to provide the structural translation, cf. data stemming from the Information

Management System. All instances of Data Connection deployed between the organisa-

tional sites enact an Updater policy, because information is communicated only in one di-

rection. Data Transducers render the data for the browsers.

ARCHITECTURAL PATTERNS FOR DATA MEDIATION

125

ARCHITECTURAL PATTERNS FOR DATA MEDIATION

Prototype 2 enables the patients to access their data via a browser (Patient Browser). The

prototype 3 which consists of all architectural patterns that are not of the two other proto-

types, permits data access for the physicians. Note that the input from the Physician

Browser is transduced before it is passed to the organisational sites.

Translator

Renderer

Translator

Figure 4.21: Architectural pattern application examples

Palliative Care Center Victoria

Medical Diagnosis ES

Information

Management

System

prototype 2

bvmtdata

outcomes_be

hospdata

Physician

Browser

Knowledge Base

DSD_DB

Physician PDA

Patient

Browser

Import &

Export

data

source

browser dynamic connection

data flow static connection

corporate or ubiquitous

system parts

schemas & mappings

Target/

Server

Target

Source

Target

Source

Target

Client/

Source

Client/

Source

Server

Server Server

Server

Target/

Server

Source

Client/

Source

Target

Renderer Renderer

Translator

Target

Renderer

Export

Import

Import &

Export

Import &

Export

data

connection

data

portal

data

transducer

data

fusion

prototype 1

DATA COMPONENT EXTENSION

126

4.4 Access Layer Generation and Model Execution

During the extension, i.e., clustering, classifying and restructuring, the retrieved corre-

spondences to the databases are maintained through the schema mappings. For new sche-

mas these correspondences are established. Thus, every element of a conceptual schema

has a correspondent in the logical schema and consequently in the database. This enables

us to generate a transactional access layer to support open nested transactions on the ex-

tended web information system.

Further, we validate the extensions by implementing prototypes. For prototyping, we

model only system (data component) parts. The focus is to provide code for the manipu-

lation and exchange of data. The mediation layer provides transition between persistent

and transient classes, i.e., usage relationships, and the manipulation of data inside the tran-

sient classes. The publishing layer reads and writes XML documents that contain the data

to exchange.

Figure 4.22 depicts the layers along with their connections, the data and the users. The

data access is done via the transactional access layer where the data portal pattern is im-

plemented. Implemented data fusion and/or data transducer patterns occur in the media-

tion layer. The mediation layer has connections to publishing layers, which are interfaces

between two mediation layers or between a mediation layer and a GUI.

4.4.1 Transactional Access Layer

The complexity of building an access layer is significantly higher than using direct access,

e.g., by using JDBC. Nevertheless the retrieved conceptual schemas (cf. Chapter 3) rep-

resent an object-oriented access layer to the data of the web information system. The re-

engineer does not have to know the logical or even physical schemas nor which specific

databases are involved. An access layer can be generated from the conceptual schemas

and enables database independent access to the data.

Figure 4.23 shows an overview of the access layer generator. After a conceptual schema

is retrieved from the data sources, the generator uses the retrieved information to gener-

ate an access layer. The logical schemas, schema mappings and conceptual schema are

database

mediation

layer

data flow

persistent tier

application tier

presentation tier

transactional

access layer

publishing

layer

Figure 4.22: Modelled and generated layers

user

GUI

ACCESS LAYER GENERATION AND MODEL EXECUTION

127

ACCESS LAYER GENERATION AND MODEL EXECUTION

used to assign each generated conceptual schema element to the corresponding logical

schema element. The configuration enables the explicit allocation of the logical schema

elements to the databases internal physical schema elements via JDBC. The generated ac-

cess layer can then access the data.

ACID transaction

Further, the access layer has to support transactions in order to allow distributed and con-

current access. Grand [Gra99] presented four transaction related patterns. Figure 4.24

shows the four patterns which are a solid basis for transaction management. The ACID

transaction pattern enables the compliance of the ACID (atomicity, consistency, isolation,

durability) properties of a transaction. The nesting of transactions is done within the com-

posite transaction pattern. The two phase commit pattern enables the atomicity for the

composite transaction pattern. History management for ACID transactions is done by the

audit trail pattern.

We only explain the ACID transaction pattern that directly accesses the data and omit a

detailed description of the three other patterns. The ACID transaction pattern has to be re-

alised in accordance with the generated access layer, i.e. the schemas and mapping. Such,

Figure 4.23: Access layer generator overview

databases

data sources logical

schemas

logical

schemas

logical

schemas

logical

schemas

logical

schemas

logical

schemas

logical

schemas

conceptual

schema

configuration

generator

access

layer information flow

data access

schema

mappings

Figure 4.24: ACID transaction related patterns [Gra99]

logical

schemas

ACID

transaction

logical

schemas

composite

transaction

logical

schemas

audit

trail

logical

schemas

two phase

commit

uses

pattern

DATA COMPONENT EXTENSION

128

the generation of the transactional access layer consist of two intertwined tasks: generat-

ing the (data) access layer and implementing the ACID transaction pattern.

We introduce an intermediate schema that is an object-oriented counterpart of the logical

schema. A logical schema represents the physical schema and thus the data structure of

the database. The intermediate schema, i.e., the object-oriented part of this logical sche-

ma, is responsible for the object-oriented encapsulation of the data. We have a one-to-one

correspondence between the relational and object-oriented part of the logical schema. Fig-

ure 4.25 depicts these one-to-one mappings which precede the schema mappings. Note

that the schema mappings as presented in Section 3.2.3 are not affected because the in-

termediate schemas are “copies” of the logical schemas.

The intermediate schemas are used to guarantee ACID properties. The generator is di-

vided in two generators: the intermediate layer generator and the object-oriented layer

generator, cf. Figure 4.25. The transaction manager comprises the ACID transaction im-

plementation, is considered for generation and used by the transactional access layer dur-

ing data access. Additionally to the object-oriented layer, the intermediate layer is

generated.

Since the transactional access layer accesses distributed heterogeneous data sources, not

only databases, it has to guarantee the ACID properties by itself. To guarantee atomicity

and isolation the additional intermediate layer is needed to cache values during transac-

tion execution on the object-oriented layer. Atomicity is realised by caching the old values

in the intermediate layer during transaction execution. Only after a successful storage in

the database the new values are stored in the intermediate layer. In case of a transaction

failure the old values from the intermediate layer overwrite the new values of the object-

oriented layer. The same cache mechanism is used for isolation. When during a transac-

Figure 4.25: Transactional access layer generator overview

databases

databases logical

schemas

logical

schemas

logical

schemas

logical

schemas

logical

schemas

logical

schemas

logical

schemas

conceptual

schema

configuration

one-to-one

mappings

generator

transactional

access layer

information flow

data access

logical sche-

mas

logical sche-

mas

logical sche-

mas

logical sche-

mas

logical sche-

mas

intermediate

schemas

schema

mappings

intermediate

layer

object-oriented

layer

intermediate layer

generator

object-oriented

layer generator

transaction

manager

ACCESS LAYER GENERATION AND MODEL EXECUTION

129

ACCESS LAYER GENERATION AND MODEL EXECUTION

tion execution another transaction wants to access a same data value, the old consistent

value from the intermediate layer will be provided. This is realised with a read-/write-lock

mechanism.

Consistency and durability are harder to achieve. Consistency depends on a correct trans-

actional access layer. This includes a correct transaction manager. Correctness is ensured

through the generation of the transactional access layer. Repeated manual implementa-

tion is an error prone task compared to code generation. Durability depends on the under-

lying data sources. A database management system provides durability whereas it is hard

to guarantee for a file system. For further details of the transactional access layer genera-

tion we refer to [Wol01].

4.4.2 Mediation Layer

The transactional access layer provides access to the persistent data. We divide the medi-

ation layer in data access interfaces and data manipulation services. Both layer parts are

modelled with UML class diagrams and story diagrams. Code is than generated from

these models. A data access interface establishes the transition between the persistent and

transient parts of the web information system. Data access interfaces are normally built

on top of a transactional access layer. Data processing is done within data manipulation

services. Data manipulation services are generally situated between a data access inter-

face and a publishing layer or between two publishing layers.

Data access interfaces

Data access interfaces implement external schemas. Modelling such interfaces means

modelling transitions between persistent and transient objects. Links between transient

objects only exist if the participating objects exist, e.g., d: Diagnosis and pi1: PatientInfo

in Figure 4.26. For links between persistent objects we have a similar situation, a link ex-

ists if the participating persistent objects exist in the data source. In contrast to the tran-

sient case, a lookup into the data source may be needed, e.g., p1: Patient and p3: Patient

in Figure 4.26. This is handled by the transaction manager. If the objects already exist as

(persistent) memory objects the lookup is not needed, e.g., p1: Patient and n: Notes in

Figure 4.26.

In all cases we have the same clear modelling concept: a link is established if the source

and target exist.

For links on the transition, i.e., links between persistent and transient objects, we have two

situations. Firstly, the persistent object exists already as memory object and thus can be

accessed from a transient object like another transient object, e.g., pi1: PatientInfo and

p1: Patient in Figure 4.26. Secondly, if the persistent object does not exist as memory ob-

ject, a data source lookup has to take place to create the correspondent memory object be-

fore it can be accessed, pi3: PatientInfo and p3: Patient in Figure 4.26.

DATA COMPONENT EXTENSION

130

The present clear modelling concept is obfuscated if no distinction between these two sit-

uations can be made. Both situations take place in the transient world. In the first case the

source and target of the links exist. In the second case only the source exists and the target

has to be created.

The first situation is handled with the existing constructs, i.e., a link between a transient

object and a persistent object only considers existing memory objects. To cover the second

situation we introduced a stereotype <<search>>. When a transient object has to access a

persistent object whether the persistent object already exist as memory object or not, in

the story pattern the accessed persistent object and the corresponding link are marked with

the stereotype <<search>>, cf. Figure 4.27. The semantics of <<search>> is as follows:

p1: Patient

pi1: PatientInfo

d: Diagnosis

n: Notes

Patient

VHS_ID# CITY POSTAL CODE ...

00-0001 SIDNEY V8L-3H6

00-0002 VICTORIA V8X-1P2

00-0003 VICTORIA V8S-4H7

Figure 4.26: Examples of links between different object kinds

VHS_ID# ==00-0001

p3: Patient

VHS_ID# ==00-0003

(persitent)

memory objects

persitent

objects

transient

objects

lookup

object link

pi3: PatientInfo

Figure 4.27: Examples of a <<search>> link

ACCESS LAYER GENERATION AND MODEL EXECUTION

131

ACCESS LAYER GENERATION AND MODEL EXECUTION

try to find the persistent object as memory object; if it does not exist as memory object try

to create this object with a data source lookup; only if in both cases the persistent object

cannot be matched, the link cannot be established. Figure 4.27 shows a <<search>> link

for the transition between pi1: PatientInfo and p1: Patient in a story diagram.

Data manipulation services

The data can be accessed. Next, it is processed by manipulating objects inside the data ma-

nipulation services. We call data objects the objects that hold data. The data is exchanged

via XML documents. These XML documents are internally represented by the data ma-

nipulation services as data objects. Data is processed either with data objects that directly

access the data through the data access interfaces or with data objects that are received

through the publishing layer. The object manipulation itself is done with story diagrams.

Figure 4.28 depicts a story diagram which models a Data Fusion pattern. Recorded patient

Symptoms are collected from the hospdata database followed by a check for a possible

disease in the knowledge base. Possible diseases are linked to the current Diagnosis ob-

ject. The possible diseases can then be published to the physician for diagnostic support.

4.4.3 Publishing Layer

The publishing layer writes data object structures into XML documents and vice versa,

i.e., reads XML documents and creates the corresponding data objects. In order to perform

these read and write operations the publishing layer needs a referencing data object model

for the exchanged data. Data object models are represented as class diagrams. Conform-

Figure 4.28: Sample Data Fusion pattern modelled with a story diagram

DATA COMPONENT EXTENSION

132

ing to these data object models the publishing layer reads XML documents into data ob-

jects or writes data objects into XML documents.

The read and write operations from data object models to XML documents are depicted

in Figure 4.29. A data object model is internally represented as abstract syntax graph. An

abstract syntax graph is parsed into the FUJABATS proprietary FPR (Fujaba PRojekt file)

format, which is itself parsed into a proprietary FujabaXML format. Finally this XML doc-

ument can be transformed with XSLT (Extensible Stylesheet Language Transformation)

to other formats like GXL (Graph eXchange Lnaguage), XMLSchema, (XML Metadata

Interchange) XMI, etc. The way back is, in analogy, from XML documents to FujabaXML

(via XSLT), followed by an XML parser into a DOM (Document Object Model) structure

and finally into an abstract syntax graph. Details of abstract syntax graph publishing can

be found in [HL02].

We give two samples for publishing layers. Firstly, we present a HTML-based data portal.

which is a special case of publishing layer. The HTML-based data portal reads HTML

template pages and returns them as filled HTML pages. Figure 4.30 shows a scenario for

abstract

syntax

graph

FPR

DOM

FUJABA

XML

GXL

XMLSchema

XMI

...

XSLT

Figure 4.29: Abstract syntax graph publishing

FPR / XML

parser

XML

parser

FUJABATS

internal

parsers

format data flow

abstract syntax graph

abstract

syntax

graph

modified

abstract

syntax

graph data access

interface

data manipulation

service

Patient: 00-0001

Symptoms:

cough, nausea

Possible diseases:

disease1

disease2

...

Patient: 00-0001

Symptoms:

cough, nausea

Possible diseases:

1) Lung Neoplasms

2) Chronic Obstructive

Pulmonary Disease

3) ...

Figure 4.30: Sample publishing layer: web portal

HTML page servlet

abstract syntax graph

data flow

mediation layer parts

(1) (2) (3)

(6)

(4) (5)

(7)

(8)

ACCESS LAYER GENERATION AND MODEL EXECUTION

133

ACCESS LAYER GENERATION AND MODEL EXECUTION

filling a HTML template page. The template page, i.e., the page with placeholders

disease1, disease2, etc., is read by the DiagnosisSystem servlet (1). The abstract syntax

graph corresponding to the template page is built (2) in accordance to HTML 4.011. This

abstract syntax graph is then passed to a data manipulation service (3). The data manip-

ulation service accesses the data source via a data access interface (4) and (5) and modi-

fies the abstract syntax graph. The modified abstract syntax graph is then parsed (7) into

the result HTML page and posted to the client (8). Details of the modelling and code gen-

eration of servlets are described in [Kam03].

Secondly, a publishing layer that reads data and a publishing layer that writes data for a

data transducer. Figure 4.31 shows the correspondence between the input Patient descrip-

tion and the output Patient description. The input document only shows the Surname ob-

ject description in GXL (upper part of Figure 4.31). This input document is read into an

1. http://www.w3.org/TR/html401

Figure 4.31: Sample publishing layers for a data transducer

...// input document ’sd’

<attr name="de.uni_paderborn.fujaba.uml.UMLObject::objectType"> <string>Surname</string> </attr>

</node>

...

... // output document ’td’

<attr name="de.uni_paderborn.fujaba.uml.UMLObject::objectType"> <string>Patientname</string> </attr>

</node>

<attr name="de.uni_paderborn.fujaba.uml.UMLAttrExprPair::name"> <string>value</string> </attr>

<attr name="de.uni_paderborn.fujaba.uml.UMLAttrExprPair::expression"> <string>s.value, f.value</string> </attr>

</node>

...

read

write

transduced

DATA COMPONENT EXTENSION

134

object structure with nodes for the document sd, the surname s: Surname and the first-

name f: Firstname. These objects are transduced into the target document (td) object struc-

ture. Figure 4.31 shows the story pattern, which creates p: Patientname that unifies the

s: Surname and f: Firstname values. Finally the output document is written. In Figure

4.31 only the Patientname object, including its value and the edge between them, is de-

picted. The read and write operations are performed through the abstract syntax graph

publishing, cf. Figure 4.29.

4.4.4 Multimedia extension

Medical data nowadays consists among other data of multimedia data, e.g., videos or

commented radiographs. Therefore the existing legacy systems have to be extended to

handle multimedia data. Multimedia data owns peculiarities. We consider the following

meaningful peculiarities:

• huge size (gigabytes are the lower bound),

• enclosed meta data (e.g., title and duration of a video, ...),

• particular formats (e.g., video, associated images and audio, ...),

• specific operations (e.g., displaying, real time streaming, ...),

• time dependence (e.g., life streaming audio and video synchronisation, ...) and

• combination of discrete and continuous data (e.g., associated images and audio, ...).

To sustain these peculiarities, extensions in the data tier are needed. The extension of a

data source is hard to achieve. Therefore the transactional access layer has to handle some

of the extensions. By dividing the data for storage and re-combining it for retrieval the

size, format and meta data problems can be handled. Unsupported formats and meta data

are resolved by storing extra data, e.g., format characteristics are stored additionally to the

essential data.

Dividing the data is needed because of capacity restrictions for data storage and efficiency

of data retrieval. This leads to data distribution and thus to the need of ACID transaction

management, which is provided by the transactional access layer. The read-/write-lock

mechanism has to be refined. Further the strategy pattern [GHJV95] is used for providing

different partitioning/combining algorithms. For further details we refer to [Wag01].

The extensions for multimedia data in the mediation layer are numerous. Modelling for

indexing, sorting and searching multimedia data are required. Further multimedia data has

to be composed and synchronised before it can be published. Many approaches exist to

cope with these issues, e.g., [vC97, Man99, CL00, ES02].

The publishing layer would be mainly responsible for displaying the multimedia contents

to the clients. Various solutions exists how to achieve publishing, real-time displaying,

etc. These problems are out of the scope of this dissertation, such we refer to the literature

for details.

TOOL SUPPORT

135

TOOL SUPPORT

4.5 Tool Support

The tool support for the data component integration process is located in the REDDMOM

project. REDDMOM (and FUJABARE) uses the FUJABATS plug-in architecture [BGN+03]

to reuse and combine existing tools. Figure 4.32 shows an overview of the REDDMOM

tools for data component extension. The Clustering Tool (BUNCH) in grey is available from

the Drexel University in Philadelphia. The dashed part of the web information system ex-

presses the generated parts. The tools inside the grey area are provided by FUJABARE. The

remaining tools were constructed in REDDMOM.

UML Editor

The UML Editor is currently extended with UML package diagrams and consists of UML

class diagrams, views and story diagrams as described in Section 3.4. Packages can con-

tain other packages or a UML class diagram view. Further, REDDMOM introduces the pos-

sibility to import and export UML class diagrams (data object models) as (XML)

documents, e.g., in XMLSchema or GXL [HWSS00]. Further, REDDMOM extended the

story patterns with the <<search>> stereotype construct.

Transformation Editor

The Transformation Editor is described in Section 3.4. In analogy with refactoring opera-

tions, design operations can be defined. We give a short overview of the design operations

to give the reader insight into (re)design.

Figure 4.32: Architecture of the REDDMOM extension tools

UML Editor

(conceptual schema

ASG model)

Tranformation Editor

(schema ASG model)

Web

Information

System

generated/generates

information flow

provided by

FUJABARE

Reengineer

Domain

Expert

Pattern Definition

and Use Editor

(conceptual schema

ASG model)

Generator

(logical & intermediate

schema ASG model,

mappings, configuration,

transaction manager,)

Clustering

Tool

(BUNCH)

Dobs /

DoLittle

DATA COMPONENT EXTENSION

136

Figure 4.33 depicts pushAttribute (in a story diagram) as example for a basic design op-

eration and generalise as example for a composed design operation. pushAttribute is a

simple operation, the given attribute attr is pushed from the class clazz to the given sub-

or superclass cl. A concrete or an abstract class is created by generalise, followed by push-

ing the given attributes and relationships to the newly created superclass. The parameters

for a composed design transformation are assigned in another dialog, cf. Chapter 3.

Since the design operations change the conceptual schema information capacity, these

changes have to be reflected in the data sources. The logical (and physical) schema is up-

dated by the mapping rules (cf. Section 3.2.3), e.g., the new superclass created by gener-

alise is mapped to a new variant of the entity corresponding to the superclass’ subclass.

Note that also refactoring operations (cf. Section 3.2.4) can be used for implementing

composed design transformations.

In Table 4.2 we present design operations. The refactoring operations of Table 3.2. are a

subset of design operations and thus not listed again. The design operations are classified

into basic design (BD) and composed design (CD) operations.

name description

createClass Creates a new (concrete) class. BD

createAttribute Creates an attribute in a given class. BD

createAssociation Creates an association between two given classes. BD

createInheritance Creates an inheritance between two given classes. BD

Table 4.2: (Re)Design Transformations

Figure 4.33: Design transformations pushAttribute and generalise

TOOL SUPPORT

137

TOOL SUPPORT

createKey Creates a key for a given classes. BD

createPackage Creates a new package. BD

nestingPackage Interleaves two given packages. BD

addToPackage Adds given classes to a given package. BD

changeRelshipCard Modifies the cardinality of an association or aggregation. BD

changeAttrType Changes the type for a given attribute. BD

unifyAttributes Merges two or more given attributes with the same type

into one of the given attributes.

convertAbstract Sets a given concrete class to an abstract class. BD

convertConcrete Sets a given abstract class to a concrete class. BD

pushClass Moves a given class from its package to a given super- or

sub package.

pushAttribute Moves a given attribute from its class to a given speciali-

sation or generalisation.

pushRelationship Moves a given association or aggregation from its class

to a given specialisation or generalisation.

remove Deletes a given conceptual schema element. BD

createAbstractClass Creates a new abstract class. CD

generaliseNew Creates a new concrete or abstract superclass without

attributes for a given class.

specialiseNew Creates a new concrete or abstract subclass without attri-

butes for a given class.

generalise Creates a concrete or abstract superclass for a given class

and pushes up the given attributes and relationships.

specialise Creates a concrete or abstract subclass for a given class

and pushes down the given attributes and relationships.

superPackage Creates a super package for a given package. CD

name description

Table 4.2: (Re)Design Transformations

DATA COMPONENT EXTENSION

138

Pattern Definition and Use Editor

Today, patterns that represent good design solutions to recurring problems are frequently

used. The design patterns of [GHJV95] are well known, but patterns for all kinds of ap-

plication domains exist and can be found, e.g., in [Ris00].

The Pattern Definition and Use Editor enables the reengineer to define (design) patterns.

A pattern consists of a structural part which defines classes and structural relationships,

e.g. associations, between them and a behavioral part which defines how the classes in-

teract. Since our design takes place on the conceptual schemas we use UML class dia-

grams for the specification of the structural part of a pattern. The behavioral parts, i.e., the

methods, are specified with story diagrams. Finally, the defined pattern is exported to

GXL.

The use of a pattern is realised by instantiating it, i.e., integrating it into an actual model

of a system. The corresponding GXL description is imported into a UML class diagram.

The different constituents of a pattern, e.g., classes or methods, can be matched to existing

UML class diagram elements. Some of the classes, associations, and methods of the pat-

tern may already exist in the actual design. The users then has to assign existing design

subPackage Creates a sub package for a given package. CD

removeAll Deletes all given conceptual schema element. CD

name description

Table 4.2: (Re)Design Transformations

Figure 4.34: Pattern Editor: pattern instantiation example

TOOL SUPPORT

139

TOOL SUPPORT

elements to elements specified by the pattern, i.e., the existing design elements and the

pattern elements are merged. The remaining elements of the pattern are then created dur-

ing the instantiation. This allows a seamless integration of patterns into an existing design.

Of course, for newly designed data components or data component parts, patterns can also

be instantiated and build the starting point for a new design.

Figure 4.34 shows an example for a pattern instantiation. We refined the chain of respon-

sibility pattern [GHJV95] for multimedia handling. A Server passes the file to process to

a Handler which is either an AudioHandler, a PictureHandler or a MovieHandler, cf. the

class diagram of Figure 4.34. Given an MMFile and a Server, the extended chain of re-

sponsibility is instantiated to handle the MMFile instances. The right part of Figure 4.34

shows the corresponding dialog. It depicts the merging of the existing Server with the pat-

tern’s Server class. All other classes are created. In analogy, this is applied to methods,

attributes and associations.

Clustering Tool (BUNCH)

BUNCH is a clustering tool intended to aid the software developer and maintainer in un-

derstanding, verifying and maintaining a source code base. To do this, BUNCH lets the user

evaluate the quality of an application's modularisation, by analysing the source code

graph. BUNCH relies solely on the information contained in a module dependency file,

considering nodes as program units or modules, such as files or classes, and edges be-

tween the nodes as calls or relationships between those modules, such as function calls or

inheritance relationships. With this graph, BUNCH can find what a "good" clustering for

the system is (thus helping when documentation of the code is nonexistent or outdated),

and it can also use predefined clusters to measure or improve the quality of the system's

clustering. The BUNCH tool is described in [MMCG99], more details can be found on the

BUNCH web pages (http://serg.cs.drexel.edu/SERG/Projects/Bunch/).

Generator

The Generator for the different layers is based on the Java code generation included in FU-

JABARE. Extensions in REDDMOM were done for the database access layer (persistent ob-

jects and multimedia objects), the servlets and the data access interface code generation.

Details for these extensions are described in [Wol01], [Wag01] and [Kam03].

Dobs / DoLittle

To facilitate simulation and graphical debugging of the generated application, the FUJA-

BATS environment provides the DOBS (Dynamic Object Browsing System) plug-in. DOBS

uses Java runtime type information and dynamic method invocation features for analysing

Java runtime object structures and for the user driven execution of application specific

methods. DOLITTLE is an extension of DOBS for the transactional access layer, i.e., to

browse persistent objects enclosed in transactions with DOBS. For details we refer to

[Wol01] and [GZ02].

DATA COMPONENT EXTENSION

140

4.6 Related Work

Techniques to determine clusters (subsystems) using source code component similarity

[MOTU93, SH94, CS90], concept analysis [LS97, vK99, Anq00], or information avail-

able from the system implementation such as module, directory, and/or package names

[AL99] are available. An introduction and overview to clustering can be found in

[Wig97]. Mostly clustering is used for software reverse engineering and re-modularisa-

tion [DB00, SMB+03]. The BUNCH software clustering system [MMR+98, MMCG99],

unlike the other software clustering techniques, uses search techniques to determine the

subsystem hierarchy of a software system, and has been shown to produce good results

for many different types of systems.

In the context of data models clustering is used for ER schema abstraction [FM86]. Other

approaches facilitate user comprehension [TWBK89], consider relationship abstraction

and clustering [JOS93] or cluster relations for databases reverse engineering [SPPB02].

All these approaches have in common that they focus on large ER schemas and conse-

quently on intra-schema relationships.

Architectural patterns are related to the notion of architectural styles [SG96]. Architectur-

al styles do not result in a complete architecture but can rather be seen as an architectural

framework. Architectural styles are specific views for one subsystem at different levels of

a system. In contrast, architectural patterns [BMR+96] are problem oriented. They ex-

press and separate the concerns of fundamental structures in software systems

The Façade pattern defines and provides a unified interface to a set of interfaces in a sub-

system [GHJV95]. The presented Data Portal pattern is an architectural variant of the

Façade design pattern. Data Portals provide unified interfaces to data components of the

system. A possible technology-driven instantiation of Data Portal is the Abstract Data-

base Interface pattern [ABM96]. Abstract Database Interface makes an application inde-

pendent from the underlying database platform.

Buschmann at al. [BMR+96] define two architectural patterns that are related to Data

Connection. The first pattern is called Broker and coordinates communication of decou-

pled components that interact by remote service invocations. The second pattern is Pipes

and Filters that provides filter components which encapsulate processing steps for data

streams. The pipes pass through the data between adjacent filters. Further, Gomaa, Mena-

scé and Shin [GMS01] described component interconnection patterns for synchronous,

asynchronous and brokered communication. A designer of a new distributed application

can use their patterns for appropriate component interaction. In contrast to those commu-

nication centric patterns, Data Connection mediates data and can be compared to the Me-

diator pattern [GHJV95] at architectural level.

In terms of extension, only few approaches tackle the problem of system migration. Behm

et al. [BGD97] and [Fon97], e.g., propose a complete replacement of relational by object-

oriented databases. Based on the schema mappings, they present algorithms to automati-

RELATED WORK

141

RELATED WORK

cally migrate the data. According to [War99], system „replacement“ is an expensive and

risk-prone task that should only be considered when continued maintenance or reengi-

neering is not feasible.

Data component extension is based on reverse engineered data models, thus in this thesis

we focus on integration of the data related legacy system parts. A prerequisite that inte-

gration has to meet in order to fulfil the new technology requirements is to enable the leg-

acy system to act as components, i.e., black boxes that exchange data. Sneed presents an

approach to wrap legacy systems with interfaces exchanging the data via XML [Sne02].

A prerequisite of this approach is to reengineer the existing interfaces. An approach com-

bining model-integrating computing with component middleware is presented

[GSNW02]. The advantage of this approach is that components can be reused and the

modelling tools have only to generate the interfaces. The drawback is that this process

starts with the integrated data models.

The Clio project is a system that manages data transformation and integration [MHH+01].

It supports the generation and management of schemas, correspondences between sche-

mas and mappings (queries) between schemas. The weakness of Clio is the inter-schema

correspondences mining that could be augmented as stated by the authors themselves.

Schema mappings are the basis for providing uniform data access to heterogeneous dis-

tributed database systems. Many approaches use object-oriented data models to integrate

multiple databases, e.g., [RL82, ADD+91, CHK+91, FGS93]. The TSIMMIS project

[PGMW95] strives to integrate a broad range of repositories. The Garlic project

[CHS+95, HMN+99] provides an object-oriented view, not only to databases and record-

based files, but also in media-specific data repositories with specialised search facilities.

The integration itself in form of wrappers and mediators [Wie92, Wie95] are poorly sup-

ported.

The InterDB approach [THB+98] integrates the construction of abstract interfaces to ac-

cess independent heterogeneous distributed databases in DB-Main. It provides generation

of conceptual wrappers, i.e., software layers that interface a database based on the recov-

ered conceptual schema [TCHH99]. These wrappers are based on data conversion pro-

grams which consist of a concatenation of the applied schema transformations. The

drawback here is that this approach can hardly be used as input for commercial off-the-

shelf middleware because the schema correspondences are implicitly defined in the data

conversion programs. Approaches like [Sie98, CER99, Tho99, Obj99b, Obj99a] provide

such support based on previous determined schema mappings.

Remaining problems are the construction of such schema mappings and that the reengi-

neer is not aware of the overlapping schema information. The VARLET approach [Jah99]

maintains an explicit schema mapping during the reverse engineering of single schemas

that can serve as input. Still, given schemas of different data sources have to be related.

Schema integration and matching approaches go in this direction. Schema integration

[RR98] mainly tackles the reverse engineering problem of determining inter-schema re-

DATA COMPONENT EXTENSION

142

lationships. Schema matching provides a mapping between two schemas that semantical-

ly correspond to each other. A detailed overview on existing partially automated schema

matching techniques is given in [RB01]. Schema matching techniques are used for data

integration and to detect schema overlapping. The drawback of most tools is that they as-

sume semantically similar schemas and/or are focussed on one-to-one matching.

An approach that defines model mappings based on semantics approximation is presented

in [Köl98]. The mappings form the basis to establish a cooperative coexistence of new and

legacy information systems. The weakness of this approach lies in the manual analysis of

the legacy systems to derive a flat reverse object-oriented model.

For multimedia extension, Cybulski and Linden [CL00] describe a pattern language that

is used to define a multimedia authoring environment capable of producing and utilising

multimedia components. Each of the presented patterns describes one well-known ap-

proach to multimedia authoring, e.g. joining and breaking artefact groups, defining and

filling in templates, arranging and re-arranging artefact collections, creating and holding

presentations, synchronising multiple multimedia channels, etc. Van den Broecke and

Coplien propose a pattern-based approach to design software for the rapidly changing

field of multimedia networking [vC97]. Engels and Sauer [ES02] present an approach to

object-oriented hypermedia and multimedia modeling with OMMMA-L a visual, object-

oriented modelling language that is an extension of the UML. To search in large amount

of information, Manolescu works with an alternative, a simpler representation of the data,

i.e., the representation contains only the information that is relevant for the problem at

hand [Man99].

4.7 Summary and Future Work

In this chapter we presented our approach to extend the web information system. The

chapter started with a process overview. The process starts with the clustering of data

component models based on the BUNCH Tool [MMCG99]. After discussing different

clustering strategies, the data components resulting from clustering are classified in three

kinds. This classification is followed by (re)structuring the web information system based

on the presented architectural patterns. The maintained data component model dependen-

cies enable the generation of an object-oriented transactional access layer to the legacy

databases. Next, a short overview for modelling prototypes and multimedia extension was

given. Finally, we presented the tool support for this process.

A consequent step is the integration of patterns that enable further generation and thus

model execution, like presented in [HK02] for Enterprise JavaBeans or in [LB03] for

ubiquitous computing. Further, architectures of web information systems are constantly

in flux and organisations have to be able to respond quickly to changing requirements.

This engenders component-oriented development as presented in [CWMY02] or model-

driven development of web services [Amb02].

143

CHAPTER 5: MODEL CONSISTENCY

MANAGEMENT

Historical knowledge is indispensable for those who want to build

a better world.

LUDWIG VON MISES

Austrian economists (1881 - 1973)

5.1 Data Component Model Consistency Maintenance

In the two previous chapters, we present reengineering activities for web information sys-

tems that are based on their data component models. These models represent the old and

new data components at different levels of abstraction. The models depend on each other.

Inconsistencies between those models often cause update problems. Whenever a reengi-

neer discovers new information, e.g., about the real semantics of implementation con-

structs in the legacy physical schema, the conceptual representation that has been created

so far must be updated accordingly.

data-

base

logical

schema

conceptual

schema

appli-

cation relationship

partial appl.

model

cluster information flow

data access

LS1 LSi LSj LSm

DB1

new App.

DBi DBj DBm DBp

DBn

new App.

Figure 5.1: Web information system model maitenance

new App.

CS1 CSi CSj CSn CSp

CSm

legacy App.legacy App. legacy App.

LSp

LSn

MODEL CONSISTENCY MANAGEMENT

144

Figure 5.1 shows how the models depend on the databases and legacy applications; and it

shows the dependencies and relationships between models. A typical source of inconsis-

tencies are on-the-fly modifications to the implementation of the legacy data sources due

to urgent requirements while reengineering activities are in progress. These iterations dur-

ing the reengineering activities that imply model changes lead to inconsistencies between

the different models. Detecting and eliminating such inconsistencies manually is a time-

consuming and error prone activity. Hence, a commonly used alternative is to discard all

created (conceptual) models of the system and generate default representations anew. In

this case, the reengineering work that has been performed manually so far is lost and has

to be repeated. Obviously, both alternatives are unsatisfactory.

The overall goal (of our research) is the development of mechanisms that facilitate incre-

mental and iterative reengineering processes. Traceability is the most important prerequi-

site for such a mechanism, i.e., in case of process iterations, we have to be able to trace

and propagate modifications of the initial representation of the legacy system to its trans-

formed representation in order to re-establish model consistency. The requirement for

traceability implies some sort of “logbook” about the interdependencies of all operations

invoked by the environment, i.e., a reengineer (user) or a tool.

We ensure tracability inside and between the models representing the web information

system. This means that all reengineering activities have to be model-based. Since our

models are maintained as abstract syntax graphs, only modifications performed with

graph productions are considered for tracability. Therefore, this only holds for model-

based reverse engineering and model-driven development for the extension activities,

e.g., manual code changes cannot be considered. Further, the legacy applications cannot

be updated because we lack their models and thus tracability for them.

We have developed an incremental approach to traceability in graph-based activities. It

turns out that graph-based structure is most suitable to maintain the interdependencies. We

call the corresponding graph History Graph because it reflects the transformation history

of the different models. The developed History Graph Mechanism provides the traceabil-

ity necessary for iterative (reengineering) processes. The core idea is an explicit depen-

dency relation among all performed graph transformations. Let us assume that a software

artefact has been updated during the iteration. The dependency information stored in the

proposed History Graph Mechanism enables the selective removal of only those transfor-

mations that (transitively) depend on the changed information.

Figure 5.2 shows an overview of the History Graph Mechanism that provides model con-

sistency management. The Data-Oriented Reverse Engineering and Data Component Ex-

tension activities transform the models (abstract syntax graphs) representing the web

information system data components. The History Graph Mechanism receives as input a

start graph and History Graph Transformations, i.e., all executed operations that modify

the web information system data components. To re-establish consistency between mod-

GRAPH-BASED HISTORY MECHANISM

145

GRAPH-BASED HISTORY MECHANISM

els the reengineer runs the History Graph Mechanism. The output produced is on the one

hand the consistent Updated Models and on the other hand the Undone History Graph

Transformations. The Updated Models can then be modified, e.g., reapply adapted un-

done operations, and further data-oriented reengineering activities can take place.

Note that the History Graph Mechanism establish consistency for the models. The consis-

tency to the generated code has to be established by generating the code anew.

5.2 Graph-based History Mechanism

We have employed the concept of History Graphs for consistency management during it-

erations in reengineering processes since 1998. We present background knowledge and

basics before we explain the History Graph Mechanism in more details.

5.2.1 Background: History Graph Mechanism

Assume that midway during a database reverse engineering process the reengineer learns

about additional dependencies in the legacy database. Ideally, (s)he would like to add this

new information to the initially extracted representation of the legacy database and vali-

date those operations which might have to be undone due to the new knowledge. The His-

tory Graph Mechanism was developed exactly to serve this purpose. With its help the

reengineer can get back to the initial representation of the legacy database, make all de-

Updated

Models

Figure 5.2: Model Consistency Management: Overview

1.1 Data

Model

Recovery

2. Relationship

Retrieval

1.2 Mapping

and

Refactoring

1. Data Component

Clustering and

Classification

3. Generation

and Model

Execution

History Graph

Transformation

(graph production)

Undone

History Graph

Transformation

(graph production)

Data-Oriented Reverse Engineering Data Component Extension

History

Graph

Mechanism

Reengineer

Domain

Expert

Reengineer

Domain

Expert

users and experts

data component conceptual representation

graph transformation

information flow

activity

Data Component

(Re)Design

2. Architectural

Pattern

Application

MODEL CONSISTENCY MANAGEMENT

146

sired changes, and determine which operations might have lost their validity [JW99]. In

the VARLET project [Jah99], we implemented a first History Graph Mechanism prototype

[Wad98]. This prototype was tightly integrated with the VARLET tool and could not be re-

used by other existing tools.

Therefore, we have developed a lightweight History Graph Mechanism that can loosely

be integrated with existing reengineering tools [JWZ02]. Graphs are commonly used by

reengineering tools for internal representation of software artefacts. Of course, the specif-

ic graph models used in different tools might comprise variations with respect to their ex-

pressiveness. Nevertheless, most graph models have in common that they support

different node- and edge types, attributes and directed edges. We use the GXL graph mod-

el [HWSS00]. Note that the History Graph Mechanism does not depend on this particular

model.

The History Graph Mechanism makes use of the theoretical concept of graph transforma-

tion systems [Roz99]. The History Graph Mechanism is a specific implementation of the

general concept of a graph process as introduced by Corradini et al. [CMR96]. Corradini

defines a graph process as a partially ordered structure, plus suitable mappings which re-

late the elements of this structure to those of a given typed graph grammar. This theoret-

ical basis allows us to uniformly describe various kinds of processes in terms of graph

processes.

History Graph Model

The History Graph Model is based on the GXL graph model. In the following we give a

brief, semi-formal introduction to the GXL graph model. We refer to [Roz99] for a com-

plete formalisation of attributed graph models. A more in-depth discussion of GXL can

be found in [HWSS00]. Figure 5.3 shows a UML specification of the GXL graph model.

A Graph can be directed or undirected, defined in isdirected from LocalConnection. It

contains GraphElements in form of Nodes, Edges, and Relations. Relations contain a set

of links pointing to one graph element each. All graph elements have unique identifiers.

GXL also includes the notion of hypergraphs, i.e., graph elements can themselves contain

sub-graphs. Furthermore, GXL supports typed graph elements (where the type of a graph

element is itself a graph element of a type graph). We omit a more detailed discussion of

the GXL typing concept since it is not necessary for the History Graph Mechanism. Note

that the actual exchange of GXL graph instances is performed in a canonical textual for-

mat based on XML [HWSS00].

Further Figure 5.3 shows the graphical specification for the History Graph model. We ex-

tend the GXL graph model for maintaining the History Graph: we introduced Transfor-

mation nodes that carry a timestamp and a name. The timestamp is used to log the time

when a History Graph Transformation has occurred. The name of the corresponding op-

GRAPH-BASED HISTORY MECHANISM

147

GRAPH-BASED HISTORY MECHANISM

eration is logged within the name attribute. Input and output dependencies are represented

by the ordered associations in and out.

History Graph Transformations

We use graph productions (cf. Section 3.2.3) to formalise and implement History Graph

Transformations. The left-hand side of a graph production represents the input of the cor-

responding (History Graph) transformation, while its output is represented by the right-

hand side. For example, let us consider splitClass (cf. Figure 3.19) in Figure 5.4.

Unique identifiers of the nodes allow identifying identical nodes on both sides of the pro-

duction, e.g., cl is passed as input parameter and a bound object on both sides. Node cl

represents the class to be splitted. Nodes attrs represent the set of attributes that will be

moved to the new class (Node newClName: Class). Nodes ca: cattributes represents the

edges from the class cl to the attributes attrs. These edges do not occur an the right-hand

side of the production. Consequently, these edges will be deleted during the productions

application. On the right-hand side new nodes occur. Node newClName: Class represent

the new class. Node assocName: Association represents the new association between the

old class cl and the new class newClName: Class. The source and target edges are repre-

sented by the node :src and node :tar, respectively. Finally, the edges :cattributes between

Figure 5.3: History (GXL) Graph model

AttributedElement

GXLDocument

Type TypedElement

Graph

Attribute

GraphElement

Node

isdirected

Edge

Relation

name

kind

role

edgeids

hypergraph

edgemode

Relend

role

direction

startorder

endorder

0..n

0..1

0..1 0..n

0..n

0..1

0..n 11

0..n

hasAttribute

refersType

hasType

0..n

contains contains

contains

refersDocument

hasRelend

from

relatesTo LocalConnection

composition

aggregation

association

attr

class class

generalisation

Transformation

name

timestamp

0..n0..n

in out

0..10..1

{ordered}

MODEL CONSISTENCY MANAGEMENT

148

the new class cl and the attributes attrs are created. Note that an identifier is not mandatory

for nodes that are created.

In order to maintain the application contexts of transformations, i.e., if we want to main-

tain input/output dependencies of applied transformations, we have to store information

about the matches for the corresponding graph productions. We identify the graph ele-

ments of the left- and right-hand sides and represent them explicitly in the History Graph.

Figure 5.5 shows the template for transformations.

GraphElements that appear on the left-hand side but not on the right-hand side, i.e., de-

leted elements, are associated to the Transformation node by in links. We call such nodes

consumed History Graph elements. In opposite, GraphElements that appear on the right-

hand side but not on the left-hand side, i.e., that are created, are associated to the Trans-

formation node by out links. These nodes are called the current History Graph elements.

A node that appear on both sides is associated to the corresponding transformation by an

Figure 5.4: Graph production splitClass

production splitClass (cl:Class, newClName: string, assocName: string,

attrs:CAttributes [0..n];

ca: cat-

tributes

newCl-

Name: Class

: src : tar

assocName:

Association

attrs

from

Graph

Element

’to’ and ’from’

links between

GraphElement

and Edge in

GXL

left-hand side right-hand side

attrs : cattri-

butes

Figure 5.5: Template of History Graph Transformation

GraphElement

GraphElementTransformation

out

in/out

in out

TGG

History Graph

Transformation

(bold) current History

Graph Element

(grey) consumed

History Graph Element

Transformation

’in’ and ’out’ edges

GRAPH-BASED HISTORY MECHANISM

149

GRAPH-BASED HISTORY MECHANISM

in link and by an out link. We represent this situation with a single link labelled in/out.

These nodes are also current History Graph elements.

The corresponding transformation to the graph production splitClass example of

Figure 5.4 is depicted in Figure 5.6. The splitClass: Transformation has five input nodes

and seven output nodes. In addition to the three left-hand side nodes of the graph produc-

tion, the two (string) input parameters are represented as (consumed) nodes in the trans-

:cattri-

butes

Figure 5.6: History Graph Transformation splitClass

from

cl:

Class

newCl-

Name:

string

ca :cat-

tributes

attrs:

CAttri-

butes

splitClass:

Transformation

newCl-

Name: Class

assoc-

Name:

string

: src

: tar

assocName:

Association

History Graph

Transformation

(bold) current History

Graph Element

(grey) consumed

History Graph Element

Transformation

’in’ and ’out’ links

’to’ and ’from’ links

between GraphElement

and Edge in GXL

(grey means consumed)

from

in/out

out

Figure 5.7: Application of production splitClass

ca:cat-

tributes

patient:

src

address:

Association

Patient:

Class

STREET ADDRESS,

CITY,

POSTAL CODE,

HOME PHONE,

WORK PHONE

from

splitClass

G’to’ and ’from’ links

between GraphElement

and Edge in GXL

remaining

graph

elements

:cattri-

butes

STREET ADDRESS,

CITY,

POSTAL CODE,

HOME PHONE,

WORK PHONE

from

remaining

graph

elements

links to the

remaining

graph elements

Gremaining graph elements

graph elements participating

in the graph production

Patient:

Class

address:

tar

Address:

Class

MODEL CONSISTENCY MANAGEMENT

150

formation. This is necessary to maintain the entire application context of the

transformation. The current nodes are identical to the right-hand side nodes of the graph

production.

Application of History Graph Transformation

In analogy with the treatment of transformations as graph productions, we can view trans-

formation applications as applications of graph productions. As an example, Figure 5.7

shows an application of transformation splitClass to class Patient. The left-hand side of

Figure 5.7 shows the left-hand side subgraph that fulfills all left-hand side application

conditions of the graph production splitClass. A set of attributes (STREET ADDRESS,

CITY, POSTAL CODE, HOME PHONE and WORK PHONE) is linked to class Patient via

the edge :cattributes and link from. This match is depicted by the three nodes with plain

border. The remaining graph is represented by the remaining graph elements with a

dashed border. Links are represented accordingly. The right-hand side of Figure 5.7

shows the subgraph that contains all elements that occur on the right-hand side of the

graph production splitClass. In this example class Address and Association address, with

roles (edges) patient and address, are created. The attributes are linked to class Address.

Creating Reengineering Histories with Graph Processes

Graph transformation systems are typically defined as a start graph and a set of graph pro-

ductions. Graph elements on the left-hand side of a graph production that do not appear

on the right-hand side are deleted. The main difference to the History Graph Mechanism

is that deleted nodes are not removed (from the History Graph) but are isolated, i.e., all

their incomming and outgoing links in the corresponding abstract syntax graph are deleted

and only the links to the transformation nodes are maintained.

Figure 5.8 illustrates the basic structure of a History Graph: applied transformations are

explicitly represented by n-nodes with corresponding input and output edges to graph el-

ements (G-nodes). The number n gives the chronological order of the transformations ac-

cording to their timestamps. The current valid graph is composed of the nodes with bold

borders. Consumed graph elements, i.e., graph elements that are isolated during transfor-

mation application, have grey borders.

The History Graph, which represents a particular model transformation history, contains

the current abstract syntax graph of the web information system as a subgraph. This sub-

graph can easily be determined by filtering all graph elements that have not been con-

sumed by transformations, i.e., that are not exclusively sourced by an in edge that points

to a transformation node (bold G-nodes in Figure 5.8). Likewise, the History Graph also

subsumes all previous states of the abstract syntax graph during the interactive transfor-

mation process. Thus, it is possible to trace back to any past state in the editing history.

GRAPH-BASED HISTORY MECHANISM

151

GRAPH-BASED HISTORY MECHANISM

Formal definitions of the History Graph, History Graph Transformations (graph produc-

tions) and the application of transformations in the History Graph can be found in [Jah99,

JWZ02, JSWZ02].

Example

Figure 5.9 shows a snapshot of a History Graph of our case study. It is the same graph as

depicted in Figure 5.8, where the graph elements and transformations are represented as

G-nodes and n-nodes, respectively.

Starting points are the variants Deceased and Patient of the corresponding entities. The

mapping rule MapVariantToConcreteClass is applied for both variants. The correspond-

ing nodes, especially classes Deceased and Patient, are created (cf. Section 3.2). Both

mapping rules are stored in the History Graph as transformations. We replaced the times-

tamp of the six transformations occurring by numbers from 1 to 6 according to the chro-

nological order they were applied, i.e., transformation with timestamp {1} was applied

before transformation with timestamp {2}. Next, the transformation splitClass is applied

on class Patient (cf. Figure 5.7). During recovery activity the Annotation ’Duplication’ be-

tween the classes Deceased and Patient is created. We omit the precondition details and

represent them as one node set. The Annotation ’Duplication’ is confirmed and replaced by

the Association ’deathdate’ with stereotype <<copy>>, cf. Figure 3.38 and Figure 3.39.

Finally, a second transformation splitClass is applied on class Deceased.

in/out

out

in/out

from

out out

out

out to

from

out

in/out

in from

from

out out

out

from

from from

in/out

out

Figure 5.8: Basic structure of a History Graph

History Graph

Transformation

(bold) current History

Graph Element

(grey) consumed

History Graph Element

Transformation

’in’ and ’out’ links

’to’ and ’from’ links

between GraphElement

and Edge in GXL

(grey means consumed)

Gin

4out

in/out

out

out out

from

in/out

out

from

out

out G

from

Gin/out

MODEL CONSISTENCY MANAGEMENT

152

5.2.2 Simple Undo History Graph Mechanism

In this section, we describe how the History Graph can be used for incremental undo

(change propagation). The sequence diagram in Figure 5.10 illustrates an interaction sce-

nario between the environment (Tool/Reengineer) and the History Graph Mechanism.

Note that the diagram covers a single iteration only.

Figure 5.9: Sample History Graph

cattri-

butes

from

Class

’Deceased’

boolean

’false’

string

’Contact’

cattri-

butes

CAt-

tribute

’attrs1’

Transformation

’splitClass’

{6}

to Class

’Contact’

string

’contact’

src

tar

Association

’contact’

from

in/out

out

in/out

cattri-

butes

from

Class

’Patient’

boolean

’true’

string

’Address’

cattri-

butes to

Transformation

’splitClass’

{3}

to Class

’Address’

string

’address’

src

tar

Aggregation

’address’

from

in/out

out

src

tar

Association

’deathdate’

’<<copy>>’

Variant

’Deceased’

Transformation

’MapVariantTo

ConcreteClass’

{2}

Variant

’Patient’

Transformation

’createDuplication’

{4}

Transformation

’createAssociation’

{5}

in/out

Transformation

’MapVariantTo

ConcreteClass’

{1}

Map-

Variant

left

right

Map-

Variant

left

right

Annotation

’Duplication’

CAt-

tribute

’attrs2’

from

in/out

out

from

out

out out

in/out

out in

in out

out

from

in/out

out

from out

out

THistory Graph

Transformation G(bold) current History

Graph Element G(grey) consumed

History Graph Element

Transformation

’in’ and ’out’ links

’to’ and ’from’ links between

GraphElement and Edge in

GXL (grey means consumed)

attrs1:

FIRSTNAME, LASTNAME, TITEL,

STREET, CITY, POSTAL, HOME

PHONE, BIRTHDATE

attrs2:

STREET ADDRESS, CITY,

POSTAL CODE, HOME

PHONE, WORK PHONE

:entities

out in

out

from

Precondi-

tions

in/out

from

GRAPH-BASED HISTORY MECHANISM

153

GRAPH-BASED HISTORY MECHANISM

The process starts with the creation of the start graph by passing it to the History Graph

Mechanism (addGraph(startGraph)). During this operation a copy of the startGraph is

made and stored. Then, a sequence of transformations is performed. These transforma-

tions are logged in the History Graph. This is done by calling function addTransforma-

tion(leftGraph_i, rightGraph_i, name_i), where leftGraph_i and rightGraph_i denote in-

and output of the transformation with name_i, respectively. For each call, the History

Graph Mechanism automatically creates the transformation node and returns trafo_i, the

signature (name and input parameters) and the timestamp, to the environment.

At any point in time, the valid graph before the execution of trafo_j (currentGraph) or the

transformation history ({trafo_1, ..., trafo_n}) can be requested. When invoking the cre-

ateHistory() operation, the list of all valid applied transformation {trafo_1, ..., trafo_n} is

Figure 5.10: Interaction with the History Graph Mechanism

Tool/Reengineer

addGraph (startGraph)

History Graph Mechanism

...

addTransformation (leftGraph_n, rightGraph_n, name_n)

addTransformation (leftGraph_1, rightGraph_1, name_1)

createHistory()

( {trafo_1, ..., trafo_n} )

getGraph(trafo_1)

( trafo_1 )

( trafo_n )

( startGraph )

addChanges(startGraph, changedStartGraph)

propagateChanges()

( newCurrentGraph )

getGraph(trafo_j)

( currentGraph )

MODEL CONSISTENCY MANAGEMENT

154

returned. Based on the transformation list, any intermediate graph can be restored by call-

ing the getGraph(trafo_i) operation.

When an iteration of the reengineering process is required, the initial startGraph can be

restored by invoking getGraph(trafo_1). Since trafo_1 is the first transformation, the valid

graph before the execution of the first transformation is the initial startGraph. Subse-

quently, the user can perform all necessary additions and/or modifications to the initial

graph of the legacy systems and submit the changes (addChanges(startGraph, changed-

StartGraph)). Likewise, the user may perform changes to any intermediate graph by re-

storing it, by editing it and finally by submitting the changes. Now the consistency can be

re-established through an incremental undo of all transformations that depend on the

changes to the start graph (propagateChange()). The result of this operation is the current

valid graph newCurrentGraph.

The algorithm in Figure 5.11 describes the work performed by the History Graph Mech-

anism. At the beginning, it receives the start graph from the reengineering tool. This start

graph (startGraph) becomes the initial history graph (HG) and further help graphs are ini-

tialised. Then, the History Graph Mechanism performs a Loop where iterations can take

place. The History Graph Mechanism logs all invoked transformations in HG. If process

iterations occur, the History Graph Mechanism recovers the requested graph (resultG) and

returns it (line 9). Line 7 specifies that the requested graph resultG can be recovered by

taking all graph elements from HG that:

• do not represent transformations (type(g)≠’Transformation’),

• do not participate in any transformation (g.-in->=∅) or

• are the output of transformations,

- which have been executed before the given transformation

(t1.timestamp < trafo_j.timestamp) and

- which are not consumed by a previous transformation.

All isolated graph elements that are again valid in resultG are restored (line 8 - restoreEle-

ments(resultG)). How graph elements can be restored is described in Section 5.2.3.

Once the modifications to the graph are performed, they are sent back as modified graph

changedG to the History Graph Mechanism (line 11). Then the affected transformation

are calculated and returned as affectedT (line 12). The method calculateAffectedTransfor-

mation( HG, originalG, changedG) is presented in Figure 5.12.

The affected transformations (affectedT) and all graph elements that are exclusively out-

put of the affected transformations ( { g∈HG | ∃ t∈affectedT, g := (t.-out-> ∧ ¬(t.<-in-)) } )

are removed (line 14). Next, all isolated input graph elements that are valid again are re-

stored; and all superfluous input graph elements of deleted affected transformations are

removed (line 15). The resulting current graph updatedG (line 16) is processed in analogy

GRAPH-BASED HISTORY MECHANISM

155

GRAPH-BASED HISTORY MECHANISM

with resultG (line 7) and then retured (line 17). The only difference is that no given trans-

formation has to be taken into account. Finally, a the list of undone transformation dis-

played (line 18).

Figure 5.12 shows how the input/output dependencies in the History Graph are used to de-

tect all transformation applications which are affected by the modifications in the graph

changedG compared to the graph resultG. It requires the calculation of the changeSet

which represents the set of all graph elements that have been modified, cf. diff (originalG,

changedG) in line 3. First, all transformations that have a changed graph element as input

Algorithm Undo History Graph Mechanism

1. Graph HG, resultG, updatedG, affectedT;

2. addGraph (startGraph);

3. HG := startGraph; // startGraph becomes the initial History Graph

resultG := HG; // initialise the result graph

updatedG := HG; // initialise the unpdated graph

affectedT := ∅; // initialise the affected transformations

4. Loop

5. addTransformation (leftGraph_i, rightGraph_i, name_i);

// log all transformations in HG

6. if (getGraph(trafo_j) )// if the graph before trafo_j is reqeusted

7. { resultG := { g∈HG | type(g)≠’Transformation’ ∧

( (g.-in->=∅) ∨ ( ∃ t1∈HG, t1:=(g.<-out-) | ∀ t2∈HG, t2:=(g.-in->) :

type(t1)=’Transformation’ ∧ type(t2)=’Transformation’ ∧

t1.timestamp < trafo_j.timestamp ∧

t1.timestamp >= t2.timestamp ) ) };

// recover the current graph before trafo_j

8. resultG := restoreElements(resultG);

9. return resultG;

10. };

11. addChanges(originalG, changedG) // receive the changed graph

12. if ( propagateChanges() )// if request to propagate changes is received

13. { affectedT := calculateAffectedTransformation( HG, originalG, changedG );

// determine all affected transformations

14. HG := HG–(affectedT ∪ { g∈HG | ∃ t∈affectedT, g := (t.-out-> ∧ ¬(t.<-in-)) } );

// undo affected transformations

15. HG := restoreElements(HG); // restore or remove isolated elements

16. updatedG := { g∈HG | type(g)≠’Transformation’ ∧

( (g.-in->=∅) ∨ ( ∃ t1∈HG, t1:=(g.<-out-) | ∀ t2∈HG, t2:=(g.-in->) :

type(t1)=’Transformation’ ∧ type(t2)=’Transformation’ ∧

t1.timestamp >= t2.timestamp ) ) };

// compute updated current graph

17. return updatedG;

18. displayUndoReport (affectedT);

19. };

20. EndLoop

End

Figure 5.11: Undo History Graph Mechanism

MODEL CONSISTENCY MANAGEMENT

156

are added to the affected transformations set affectedT (line 4). Then, all subsequent trans-

formations, i.e., transformations that are reachable from affected transformations and that

have been executed after the affected transformations, are added to the affected transfor-

mation set affectedT (line 7). This is done as long as no new affected transformation is

found (line 5, 6 and 8). Finally, the set of affected transformation is returned (line 9).

Figure 5.13 illustrates the History Graph Mechanism applied on the case study snapshot

from Figure 5.9. We use the representation of Figure 5.8 for layout reasons.

We assume that the History Graph exists and start with the changePropagation() call (cf.

line 11 in Figure 5.11). The lifted index ( )represents a changed graph element (Variant

’Patient’ in Figure 5.9). Starting from this graph element all transformations depending on

it are marked affected as described in Figure 5.12. This is shown by the hand with a point-

ing index ( ). We marked all affected graph elements with a circle ( ) to enhance

the readers understanding. In Figure 5.13 all transformations except ’2’ are affected.

Transformation ’2’ does not follow an affected transformation, i.e., it has no input graph

element that is affected.

The removed graph elements are highlighted with a fist ( ) in Figure 5.14. Note that

not all output graph elements of affected transformations are deleted. Output graph ele-

ments that are also input graph elements of an affected transformation are retained. Fur-

ther, an output graph element of an affected transformation, which is also an output graph

element of another not affected transformation, is retained.

Obviously, in both cases the graph elements are part of the currently valid graph. Further-

more not all consumed input graph elements are retained. Consumed input graph elements

that are isolated, i.e., which have no link to any graph element, are deleted.

Graph calculateAffectedTransformation( Graph HG, Graph originalG, Graph changedG )

{

1. Graph changeSet, affectedT, affectedT’;

2. changeSet := ∅; // initialise the changed elements set

affectedT := ∅; // initialise the affected transformations

affectedT’ := affectedT; initialise help graph

3. changeSet := diff (originalG, changedG);

// determine the changes between the two graphs resultG and changedG

4. affectedT := { t∈HG | type(t) = ’Transformation’ ∧ ( ∀ g∈changeSet, t = g.-in-> ) };

5. Repeat // collect all following transformations

6. affectedT’ := affectedT;

7. affectedT := { t∈HG | type(t) = ’Transformation’ ∧

( ∀ t’∈affectedT, t = t’.-out->.-in-> ∧ t.timestamp >= t’.timestamp ) };

8. Until (affectedT == affectedT’)

9. return affectedT

}

Figure 5.12: Determine affected transformations

GRAPH-BASED HISTORY MECHANISM

157

GRAPH-BASED HISTORY MECHANISM

in/out

out

in/out

from

out out

out

out to

from

out

in/out

in from

from

out out

out

out to

from

from from

in/out

out

Figure 5.13: Affected History Graph: simple undo

transformation

(bold) current

element

(grey) consumed

element

Gin

4out

in/out

out

out out

from

in/out

out

from

out

out G

from

Gin/out

changed

element

affected

transformation

affected

element

’in’ and

’out’ links

’to’ and ’from’ links

(grey means consumed)

in/out

out

in/out

from

out out

out

out to

from

out

in/out

in from

from

out out

out

out to

from

from from

in/out

out

Figure 5.14: Updated History Graph: simple undo

transformation

(bold) current

element

(grey) consumed

element

’in’ and

’out’ links

’to’ and

’from’ links

(grey means

consumed)

Gin

4out

in/out

out

out out

from

in/out

out

from

out

out G

from

Gin/out

changed

element

elements and

transformations

to be deleted

MODEL CONSISTENCY MANAGEMENT

158

The remaining graph elements (cf. Figure 5.9) are the Variant ’Deceased’, the Class ’De-

ceased’ and the corresponding mapping nodes (left, MapVariant, right). Additionally, the

Transformation ’MapVariantToConcreteClass’ remains. Further the CAttributes ’attrs1’

inclusive the edges cattributes, the Preconditions, the Variant ’Patient’ and the CAttributes

’attrs2’ are not deleted.

5.2.3 Selective Undo History Graph Mechanism

The impact of one little change seems to be massive, but considering the whole palliative

care information system only a small part is affected. Nevertheless, transformations were

undone that not needed to. Obviously, the splitClass transformation applied to the Class

’Deceased’ is still valid since it is not influenced by Variant ’Patient’. To reduce the loss

of reengineers work further, we elaborate a selective undo History Graph Mechanism.

The main difference to the single undo History Graph Mechanism is that the affected

transformations are not simply undone, i.e., removed, but they are reevaluated. The His-

tory Graph Mechanism exports the transformation to be reevaluated to the environment

(tool/reengineer) and gets the successful or failed transformation back. The algorithm re-

mains the same as the one of Figure 5.11 except to line 14; restoreElements(HG) can be

discarded because transformations are reevaluated, i.e., the consumed input graph ele-

ments of transformations that will be valid again are restored before the transformations

reevaluation. Further, calculateAffectedTransformation( HG, resultG, changedG) of

Figure 5.12 is replaced by reevaluateTransformation( HG, resultG, changedG), cf. Figure

5.15.

Lines 1 to 10 are in analogy with the lines 1 to 9 in Figure 5.12. A changedSet is calcu-

lated and all depending transformations are collected in affectedT. These directly affected

transformations have to be validated, i.e., are then reevaluated.

However, some of these transformations depend on input graph elements which were con-

sumed or which values were changed by a transformation. These isolated graph elements

have to be reproduced before the dependent transformation can be reevaluated. Reproduc-

ing these graph elements means to reevaluate all transformations that have been applied

to produce them. Some of the transformations that have to be reevaluated might not have

been marked because they are not directly affected by the changed graph elements. Hence,

we need a further affected transformation collection, i.e., collecting the indirectly affected

transformation. This is done in lines 11 to 14. Note that the reapplication of the transfor-

mations also reproduces the links between graph elements.

The selection of the transformations, which have to be reevaluated, for the graph elements

that have to be restored in the simple undo case is done in analogy. After collecting all

isolated input graph elements that are valid again, the indirectly affected transformations

are collected and then reevaluated. In case that the graph element to be restored is an ini-

tial graph element, i.e., member of the start graph, we use the copy of the start graph. The

GRAPH-BASED HISTORY MECHANISM

159

GRAPH-BASED HISTORY MECHANISM

Graph reevaluateTransformation(Graph HG, Graph originalG, Graph changedG )

{

1. Graph changeSet, affectedT, reevaluateT, dependendT, dependendT, affectedT’;

2. changeSet := ∅; // initialise the changed elements set

affectedT := ∅; // initialise the affected transformations

reevaluateT := ∅; // initialise the transformations to be reeavluate

dependentT := ∅; // initialise the dependend transformations

failedT := ∅; // initialise the affected transformations

affectedT’ := affectedT; initialise help graph

3. Transformation oldT, newT;

4. oldT := ∅;

newT := ∅;

5. changeSet := diff (originalG, changedG);

// determin the changes between the two graphs resultG and changedG

6. affectedT := { t∈HG | type(t) = ’Transformation’ ∧ ( ∀ g∈changeSet, t = g.-in-> ) };

7. Repeat // collect all directly affected transformations

8. affectedT’ := affectedT;

9. affectedT := { t∈HG | type(t) = ’Transformation’ ∧

( ∀ t’∈affectedT, t = t’.-out->.-in-> ∧ t.timestamp >= t’.timestamp ) };

10. Until (affectedT == affectedT’)

11. Repeat // collect all indirectly affected transformations

12. affectedT’ := affectedT;

13. affectedT := { t∈HG | type(t) = ’Transformation’ ∧

( ∀ t’∈affectedT, (t’.<-out-)∉changeSet : t = t’.<-out-.<-in- };

14. Until (affectedT == affectedT’)

15. Repeat

16. reevaluateT := { t∈HG | type(t) = ’Transformation’ ∧

( ∀ t’∈HG, t’ = t.<-out-.<-in- : t’∉affectedT) };

// select all transformations that do not follow an affected one

17. oldT := t∈reevaluateT; // select one transformation to be reevaluate

18. newT := reevaluate (HG, oldT);

19. if (newT ≠ ∅) // succesful reapplied transfomation oldT

20. { replace (oldT, newT); // replace oldT by newT

21. affectedT := affectedT-{oldT}; // clean up affectedT

22. } else // failed reapplied transfomation oldT

23. { Repeat // collect all depending transformations

24. affectedT’ := dependentT;

25. dependentT := { t∈affectedT | ∃ g∈HG, g=oldT.-out-> ∧ g≠oldT.<-in- :

t = g.-in-> }

26. Until (dependentT == affectedT’)

27. affectedT := affectedT-dependentT; // clean up affectedT

28. failedT := failedT ∪ {oldT} ∪ dependentT;

29. }

30. Until (affectedT == ∅)

31. return failedT;

32.}

Figure 5.15: Reevaluate transformations

MODEL CONSISTENCY MANAGEMENT

160

unique identifiers of GXL enable an unambiguous matching and update of the graph ele-

ments in question.

Next, the transformations that do not depend on another affected transformation are cho-

sen (line 16). Reevaluating transformations that do not have actual input graph elements

is superflous. Out of the transformations with actual input, one transformation oldT is se-

lected and reevaluated (lines 17 and 18). Reevaluating a transformation means to apply

the corresponding transformation anew to the current (maybe changed) graph elements.

The reevaluation itself is performed by the environment (tool or reengineer) in order to

check the application constraints. The new transformation newT is returned. Note that for

some reevaluations initial graph elements may be restored first by the unambiguous

matching and update. Finally, when the reevaluated transformation is successfully reap-

plied (newT ≠ ∅), it is replaced by the new transformation (line 20) and removed from the

set of affected transformations (line 21).

In the case that the transformation has lost its applicability (line 22), all dependent trans-

formations dependentT are collected (lines 23 to 26). The dependent transformations are

also no longer applicable because their input is not revalidated. Subsequently, the failed

transformation oldT and these dependent transformations dependentT are removed from

the affected transformations (line 27) and added to the failed transformations failedT (line

28). The reevaluation finishes when all affected transformations were reevaluated (line

30). Instead of all transformations that are affected by the changes, only transformations

that are not longer applicable are returned to the algorithm of Figure 5.11 (line 31).

To illustrate the History Graph Mechanism process Figure 5.16 to Figure 5.18 show the

selective undo applied on the case study snapshot from Figure 5.9. Again we use the rep-

resentation of Figure 5.8 for layout reasons.

Figure 5.16 shows the directly affected transformations (marked with ) and graph el-

ements (marked with ) by the change of Variant ’Patient’ (marked with ). Entity

’Patient’ is not longer composed of the single Variant ’Patient’ but of three Variants: ’Pa-

tient’, ’Discharged’ and ’Deceased’, cf. Figure 3.8. Thus the Class ’Patient’, which is the

output of transformation with timestamp ’1’, no longer contains the same attributes prior

to the change. Note that the directly affected transformations are the same as the affected

transformations in the simple undo case. Therefore, Figure 5.13 and Figure 5.16 are iden-

tical.

Figure 5.17 shows all indirectly affected transformations( ). The indirectly affected

graph elements are marked in consequence ( ).

The marked transformations are then reevaluated in the predefined order of their input/

output dependencies. Figure 5.18 shows the reevaluation. In this case, the first reevaluat-

ed transformation is either transformation ’1’ or transformation ’2’. Transformation ’3’

depends only on transformation ’1’ and thus can precede transformation ’2’. Transforma-

GRAPH-BASED HISTORY MECHANISM

161

GRAPH-BASED HISTORY MECHANISM

in/out

out

in/out

from

out out

out

out to

from

out

in/out

in from

from

out out

out

out to

from

from from

in/out

out

Figure 5.16: History Graph: directly affected transformations

transformation

(bold) current

element

(grey) consumed

element

’in’ and

’out’ links

’to’ and

’from’ links

(grey means

consumed)

Gin

4out

in/out

out

out out

from

in/out

out

from

out

out G

from

Gin/out

changed

element

directly affected

transformation

affected

element

in/out

out

in/out

from

out out

out

out to

from

out

in/out

in from

from

out out

out

out to

from

from from

in/out

out

Figure 5.17: History Graph: indirectly affected transformations

transformation

(bold) current

element

(grey) consumed

element

’in’ and

’out’ links

’to’ and

’from’ links

(grey means

consumed)

Gin

4out

in/out

out

out out

from

in/out

out

from

out

out G

from

Gin/out

changed

element

directly affected

transformation

affected

element

indirectly

affected

transformation

MODEL CONSISTENCY MANAGEMENT

162

tion ’3’ is successfully reevaluated because the involved attributes remained in Class ’Pa-

tient’. No other transformation is reevaluate prior these three transformations because all

three remaining affected transformations depend on them. Next transformation ’4’ is re-

evaluated. Because of the changes (missing attributes) of the new Variant ’Patient’, the ap-

plication of transformation ’4’ (createDuplication) fails. Subsequently transformation ’5’

(createAssociation) is no longer applicable because it depends on transformation ’4’. This

is shown by the symbol. Transformations ’6’ is not removed from the affected trans-

formation set although it depends on transformation ’4’. The reason is that the linking

graph element is also input of transformation ’4’ and thus a reevaluation can be success-

ful. Indeed, in our case splitClass (transformation ’6’) is still applicable. Transformation

’6’ is successfully reevaluated because Class ’Deceased’ did not changed. Finally, all

transformations that are no longer applicable and exclusively their output graph elements

are deleted (marked with ).

In a first time, the impact seems to be even more massive then for the simple undo case.

Again, considering the whole palliative care information system only a small part is af-

fected. The advantages of reevaluating the transformations are obvious comparing

Figure 5.14 and Figure 5.18. In case of the simple undo all transformations are undone ex-

cept transformation ’2’ whereas in the selective undo case only transformations ’3’ and

’5’ were undone.

in/out

out

in/out

from

out out

out

out to

from

out

in/out

in from

from

out out

out

out to

from

from from

in/out

out

Figure 5.18: History Graph: reevaluated transformations

transformation

(bold) current

element

(grey) consumed

element

’in’ and

’out’ links

’to’ and

’from’ links

(grey means

consumed)

Gin

4out

in/out

out

out out

from

in/out

out

from

out

out G

from

Gin/out

changed

element

elements and

transformations

to be deleted

reapplied

transformations

failed

transformations

no longer

applicable

transformations

GRAPH-BASED HISTORY MECHANISM

163

GRAPH-BASED HISTORY MECHANISM

5.2.4 Composed History Graph Mechanism

So far the History Graph Mechanism is based on one single graph which traces all the

transformations. To further sustain iterative and explorative processes we introduce the

composed History Graph Mechanism. We get following benefits from the composition of

History Graphs:

• branches

Try-outs during iteration are possible, i.e., from a common start graph several

branches can be processed in parallel. To get the best possible result the same task

can be performed by two distinct reengineers and later be compared. The merging of

graphs resulting from different branches is not addressed for the moment.

• structured process

Each process activity can be attributed an own History Graph (as far as it is possi-

ble). Thus, a History Graph is self contained for each process activity. This enables

parallel work since the transformations are logged in different History Graphs, e.g.,

working in parallel on distinct graphs that are only united later. Working in parallel

on graphs that depend on each other is only meaningful if the impact of changes is

limited.

• impact reduction

A consequence of several smaller History Graphs is that the number of transforma-

tions to be reevaluated is reduced. The reason is that inside a History Graph reevalu-

ation no check is made if the output graph elements are „really“ affected. This could

be achieved by investigating if a graph element was changed by the reevaluated

transformation. This investigation automatically takes place for some transforma-

tions during the composition of History Graphs.

• prevention of scaling problems

We have shown that our approach is scalable [JSWZ02]. Nevertheless, scaling

problems may occur if a single History Graph stores all transformations for long-

lasting processes. The division into several smaller History Graphs prevents such

scaling problems.

Three composition kinds of History Graphs are provided by the composed History Graph

Mechanism, i.e., sequence, union and branch. Figure 5.19 shows these kinds in the figure

compartments B to D. Compartment A depicts a History Graph chronology: the start

graph is transformed into the current graph; both are included in the History Graph (com-

plete graph).

A sequence of History graphs is shown in compartment B. In the History Graph I, graph 1

is transformed into graph 2. In History Graph II, graph 2 is transformed into graph 3. The

sequence is realised by copying the current graph of History Graph I (graph 2) and using

it as start graph of History Graph II.

MODEL CONSISTENCY MANAGEMENT

164

In analogy, compartment C shows the union of History Graphs. The current graphs

(graph 2a and graph 2b) of two or more parallel History Graphs (History Graph I.a and

History Graph I.b) are united to one graph. This graph (graph 2a ∪ graph 2b) is the start

graph of the subsequent History Graph (History Graph II).

Finally, branching of History Graphs is achieved by duplicating a current graph (graph 2

of History Graph I) and introducing it as start graph in two or more subsequent History

Graphs (History Graph II.a and History Graph II.b), cf. compartment D.

The reevaluation in composed History Graphs works inside each History Graph like the

reevaluation in single History Graphs. Only the transition, i.e., the correspondence of cur-

rent and start graph, from one History Graph to another has to be determined. Figure 5.20

History Graph :

start graph current graph

sequence:

union:

graph 1 graph 2 graph 2 graph 3

graph 1a graph 2a

graph 1b graph 2b graph 2a ∪

graph 2b

graph 3

branch:

graph 1 graph 2

graph 2

graph 3a

graph 3b

Figure 5.19: Composed History Graph: overview

History Graph I History Graph II

History Graph I.a

History Graph I.b History Graph II

History Graph II.a

History Graph II.b

History Graph I

complete graph

transformations

graph

compartment A

compartment B

compartment C

compartment D

GRAPH-BASED HISTORY MECHANISM

165

GRAPH-BASED HISTORY MECHANISM

shows the reevaluation in case of the sequence of History Graphs. Two situations can oc-

cur. Firstly, the manual change takes place on graph 2. Then a (selective) undo is done

inside the HG II and HG I is not affected. Secondly, the manual change takes place in

graph 1. After reevaluation the graph 2 is updated to graph 2’. To transfer the changes,

graph 2’ is passed to the HG II and becomes the new start graph of HG II. This is done by

calling addChanges(graph 2 , graph 2’); graph 2 exist as copy in HG II and graph 2’ is the

updated current graph of HG I. In term of changes:

changes in start graph (HG II) =

diff ( old startGraph(HG II), updated currentGraph(HG I) )

For the two remaining composition kinds, we only present the changes:

• union:

- changes in start graph (HG II) =

diff ( old startGraph(HGII), updated currentGraph(HGI.a ∪ HGI.b) )

• branch:

- changes in start graph (HG II.a) =

diff ( old startGraph(HGII.a), updated currentGraph(HGI) )

- changes in start graph (HG II.b) =

diff ( old startGraph(HGII.b), updated currentGraph(HGI) )

Figure 5.21 shows an example of History Graph composition in the context of our case

study. The reengineering process starts with the three databases hospdata, bvmtdata and

outcomes_be. To trace and maintain all applied transformations, fourteen history graphs

are involved.

sequence: =>

graph 1 graph 2’ graph 2’ graph 3

Figure 5.20: Composed History Graph: reevaluation

HG I HG II

manual change (selective) undo transfered changes

transformations graph changed graph reevaluated

transformations updated graph

sequence: =>

graph 1 graph 2 graph 2 graph 3

HG I HG II

manual change (selective) undo

MODEL CONSISTENCY MANAGEMENT

166

Each database is parsed and represented as an initial physical schema. In Figure 5.21 we

only quote the qualifier of the schemas, e.g., physical schema is quoted with ’physical’.

Each physical schema is retrieved into a relational schema (HG I.a, HG I.b and HG I.c).

Then each relational schema is mapped to a conceptual schema (HG II.a, HG II.b and

HG II.c) that is refactored (HG III.a, HG III.b and HG III.c). Those transformations can take

place in parallel and thus three initial physical schemas are transformed in three concep-

tual refactored schemas.

The three refactored schemas are united into one composed schema. Since these refac-

tored schemas are disjoint, i.e., they do not have any inter-schema relationships this can

easily be done. The inter-schema relationships are investigated next. The composed sche-

ma is transformed into a completed schema (HG IV). At this point a single user or more

than one user may try out some redesign transformations on the same completed schema

and decide later which branch will persist. In the scenario of Figure 5.21 two branches are

created. The corresponding (redesign) transformations are logged in HG V.a and HG V.b,

hospdata

physical

hospdata

relational

HG I.a

hospdata

relational

hospdata

conceptual

HG II.a

bvmtdata

physical

bvmtdata

relational

HG I.b

outcomes_be

physical

outcomes_be

relational

HG I.c

bvmtdata

relational

bvmtdata

conceptual

HG II.b

outcomes_be

relational

outcomes_be

conceptual

HG II.c

hospdata

conceptual

HG III.a

hospdata

refactored

bvmtdata

conceptual

HG III.b

bvmtdata

refactored

outcomes_be

conceptual

HG III.c

outcomes_be

refactored

palliative

composed

HG IV

palliative

completed

Figure 5.21: History Graph Sequence Example

palliative

redesigned

HG V.a

palliative

completed

palliative

clustered

HG VI

palliative

redesigned

palliative

clustered

HG VII

palliative

classified

palliative

redesigned

HG V.b

palliative

completed

branch can

be continued

or aborted

information flow

start & current graph

History

Graph

GRAPH-BASED HISTORY MECHANISM

167

GRAPH-BASED HISTORY MECHANISM

respectively. We assume that the branch logged by HG V.a is chosen. HG VI maintains the

transformation related to clustering. Finally, the classification operations are retained in

HG VII.

One benefit of the composed History Graph Mechanism is the impact reduction. This is

shown in Figure 5.22 to Figure 5.24 where we replay the reevaluation of Section 5.2.3.

We start with the changed Variant ’Patient’ in HG II.a, cf. Figure 5.22. The only affected

transformation is transformation ’1’ because all other transformations, i.e., ’2’ to ’6’ are

located in other History Graphs. Transformation ’MapVariantToConcreteClass’ (transfor-

mation ’1’) is successfully reapplied.

During refactoring, Class ’Patient’ was split into Class ’Patient’ and Class ’Address’.

Since transformation ’1’ was reapplied, changes occurred, but Class ’Patient’ itself was

not changed. Thus the transfer of changes from HG II.a to HG III.a do not affect any input

graph element of transformation ’3’, i.e., splitClass, cf. Figure 5.23.

Even if in HG III.a no transformation was affected, the transfer of changes is pursued. In-

deed, the updated graph elements of HG II.a have an impact on HG IV. The mapping of

the changed Variant ’Patient’ engender changes in the Preconditions of transformation ’4’,

cf. Figure 5.24. Like in Figure 5.18, the application of transformation ’4’ (createDuplica-

out

from

Figure 5.22: Reevaluation of HG II.a

transformation

(bold) current

element

(grey) consumed

element

’in’ and

’out’ links

’to’ and

’from’ links

in/out

out

from

changed

element

directly affected

transformation

affected

element

reapplied

transformations

in/out

in from

from

out out

out

out to

from

in/out

Figure 5.23: Reevaluation of HG III.a

Gin

from

nG transformation

(bold) current

element

(grey) consumed

element

G’in’ and

’out’ links

’to’ and ’from’ links

(grey means consumed)

MODEL CONSISTENCY MANAGEMENT

168

tion) fails and transformation ’5’ (createAssociation) is no longer applicable. Thus the

concerned graph elements are deleted.

Class ’Deceased’ is affected, but not changed. In analogy with HG III.a, no impact on

HG V.a and HG V.b is transferred. Note that the changes still have to be transferred further.

In our case the deletion of the duplication Association ’deathdate’ affects the clustering

and consequently HG VI. We refer to Section 4.2.1 for details of the clustering.

Comparing the reevaluation between the single History Graph Mechanism and the com-

posed History Graph Mechanism in our example, only 2 instead of 5 transformations are

reevaluated. The reason therefore is the duplication of graph elements in the different His-

tory Graphs. On the one hand, a check if the graph element has really changed is done by

transferring the changes to the subsequent History Graph. On the other hand, graph ele-

ments do not have to be restored because they exist in their current state in the correspond-

ing History Graph.

5.3 Tool Support

The GXL-based History Graph Mechanism tool support is joint work with the netlab

Group located at the University of Victoria, Canada. In our context it is located in the

REDDMOM project. REDDMOM is mainly based on the FUJABATS and especially on parts

of the FUJABARE. The History Graph Mechanism is independent from the tool support for

the undestading phase and the adaption phase. It is accessed via an API. Figure 5.25 shows

an overview of the History Graph Mechanism tool support for model consistency man-

agement. The filled tool parts are currently under development.

from

in/out

out

Figure 5.24: Reevaluation of HG IV

4out

in/out

in/out in

out

out G

from

Gin/out

Gtransformation

(bold) current

element

(grey) consumed

element

’in’ and

’out’ links

’to’ and

’from’ links

(grey means

consumed)

changed

element

elements and

transformations

to be deleted

directly affected

transformation

failed

transformations

no longer

applicable

transformations

affected

element

from

TOOL SUPPORT

169

TOOL SUPPORT

External tool (prerequisites):

A tool has to meet certain requirements to be able to interface with our History Graph

Mechanism. We have minimised these requirements as much as possible to enable the in-

tegration with many different environments. Basically, a tool has to be able to

• import and export its internal abstract syntax graph structure

(preferably in GXL),

• report for each transformation on the abstract syntax graph

- the graph elements that represent the input of the transformation (in GXL)

- the graph elements that represent the output of the transformation (in GXL), and

• provide an API that enables the reevaluation of a transformation specified by its

name and with given input (in GXL).

GXLHISTORYGRAPH:

The API of the GXLHISTORYGRAPH comprises following operations:

• addGraph(graph)

History Graph initialisation, makes a graph copy.

• addTransformation(leftGraph_i, rightGraph_i, name_i)

build the History Graph, returns transformation ’trafo_i’.

• createHistory()

returns the list of all applied transfomations {trafo_1, ..., trafo_n}.

• getGraph(trafo_j)

calculates the valid graph before the application of transformation trafo_j.

• addChanges(graph1, graph2)

transfers the changes between graph1 and graph2 into the History Graph.

• propagateChange()

determines the impact of the changes and returns a „change-consistent“ graph.

Figure 5.25: Architecture of the History Graph Mechanism tool support

External tool

(e.g. FUJABARE) API

GXLHistoryGraph

(simple undo)

(selective undo)

Composed History

Graph

API information flow

MODEL CONSISTENCY MANAGEMENT

170

All graphs passed to GXLHISTORYGRAPH must be in GXL format.

The simple undo determine the affected transformations and undo them. This is done by

deleting the affected transformations and the corresponding graph elements

The selective undo only undo the transformations that cannot be successfully reevaluated.

Composed History Graph:

The API of the Composed History Graph comprises following operations:

• createHistoryGraph()

creates a new History Graph and returns it. All History Graph (in our case GXLHis-

toryGraph) API operations are available.

• createHistory()

returns the list of all History Graphs {HG I, ..., HG N}.

• getHistoryGraph(HG x)

returns the History Graph HG x.

• propagateChange()

transfers the changes to all History Graphs.

All History Graph API operations are passed to the corresponding GXLHISTORYGRAPH

operations. The GXLHISTORYGRAPH can be replaced by other History Graph implementa-

tions. If an external tool uses a format different from GXL, two solutions are possible.

Firstly, a new History Graph that understands this format has to be implemented. Second-

ly, a converter from this format into GXL and vice versa is needed.

5.4 Related work

The described research is related to various existing approaches maintaining inter-docu-

ment consistency in the software (forward) engineering process. Perhaps the most prom-

inent technology in this regard are mechanisms for version and configuration

management like, e.g., the concurrent versions system [CVS]. In principle, mechanisms

like the concurrent versions system can be used to create version histories of recovered

design documents. However, such version histories do not consider the actual structural

dependencies among the transformations performed in the abstraction process. Instead,

the version history merely reflects a time-oriented view on intermediate stages reached in

the (last iteration of the) recovery process. Consequently, traditional version management

systems can only provide limited support for iteration.

Lefering and Schürr suggest a dedicated formalism called Triple Graph Grammars for

specifying structure dependencies among abstract syntax graphs [SL96]. This formalism

is used in the IPSEN project for generating integration tools that maintain consistency and

propagate changes from software design documents to the implementation code and vice

RELATED WORK

171

RELATED WORK

versa [Nag96]. Nowadays, most commercially available development environments pro-

vide (limited) support for change propagating between software documents on different

levels of abstraction.

Unfortunately, very few reengineering tools provide similar support for propagating

changes of the knowledge about a legacy system. In most current tools, the legacy system

analysis and information extraction, the information transformation and abstraction, and

the system and information extension (integration) cannot be iterated without losing the

interactive work performed. They impose a strictly phase-oriented, waterfall-like reengi-

neering process, without the support for iteration. This is an important limitation in prac-

tice, as iterations between reverse engineering and integration steps occur frequently:

when a reengineer learns more about the abstract design of a legacy system, (s)he often

refutes some initial assumptions and does further investigations. Moreover, reengineering

projects might have durations from several months up to years. Urgent (on-the-fly) mod-

ifications of the original system during this period have to be reflected in the target sys-

tem, which, of course, leads to the demand for iteration in the reengineering process. This

problem of consistency management in process iteration is only addressed in DB-MAIN

[HHE+99] and VARLET [Jah99].

DB-MAIN logs the history of all transformations during the reengineer’s interaction.

Each performed transformation is recorded precisely and completely to make its inversion

possible. The history has to be monotone and linear. This makes it possible that changes

can be propagated from one schema level to another. In case that a transformation record-

ed in the history becomes invalid the reengineer has to resolve the inconsistencies manu-

ally. DB-MAIN does not allow the reengineer to get back (and modify) the initial state of

the legacy system. Exploration (branches) and concurrent work are not supported.

The VARLET environment supports iterative data analysis and abstraction processes

[JW99]. VARLET logs all schema transformations invoked by the user in an automated

logbook (including their pre- and post-conditions and interdependencies). Moreover, the

initial, low-level representation of the legacy schema remains available to the user. When-

ever the initial representation of the legacy system is changed, VARLET uses the logged

transformation dependencies to determine which interactive operations are affected by the

modifications. The pre-conditions of these transformations are then reevaluated and only

those transformations that fail this test are undone. The consistency of the extracted infor-

mation on the legacy system and its interactively created abstraction has been re-estab-

lished. This consistency mechanism is encoded in VARLET and thus cannot be reused by

other tools.

Two related approaches that address change propagation are the Evolvable View Environ-

ment [NR99] and ArchDiff [vv02]. Nica and Rudensteiner [NR99] present an approach

that updates view extents after view synchronisation driven by schema changes. Similar

to our approach, they exploit the knowledge: how the view definition was synchronised

MODEL CONSISTENCY MANAGEMENT

172

and which changes were performed on the view definition after synchronisation. Van der

Westhuizen and van der Hoek [vv02] present algorithms that can be used for understand-

ing architectural changes and propagating those changes among individual architectures

in the product line. The strength of the algorithms lies in their use of a simple, XML-based

representation for capturing architectural changes. Different algorithms build upon this

representation to determine not only those architectural elements that have been added or

removed, but also those sets of elements that represent replacements within the architec-

ture.

5.5 Summary and Future Work

This chapter started with an overview of our model consistency management. The pre-

sented History Graph Mechanism enables to restore consistency between models that rep-

resent web information systems. We provided the background information for our graph-

based approach. Next we explained how model changes are propagated and discussed

successively simple undo and selective undo to restore consistency. Finally, we presented

how History graphs can be combined and the consequent tool support.

We allow composition and duplication (results in branches) of history graphs. This pro-

vides flexibility and sustains iterations, e.g. a parallel try in multiple branches followed

by choosing one branch. The choosing can be done after change propagation to all branch-

es. We recommend that all branches are erased and are not longer used to avoid confu-

sions and runtime overhead. The composition can be refined by passing subgraphs, but

this requires an exact definition of filter (user view). Future improvements would be the

splitting and merging of graphs. This would provide more flexibility, e.g., by merging

branches. Therefore transformations covering several graphs may be needed, what makes

a consistency management hard. Moreover, possibilities to suppress, add or replace a

transformation would increase the usability and support of iteration.

173

CHAPTER 6: CONCLUSIONS

Fundamental progress has to do with the reinterpretation of basic

ideas.

-- ALFRED NORTH WHITEHEAD

British mathematician, logician and philosopher (1861-1947)

6.1 Summary

The costs spent for reengineering are no garantee for successful reengineering. In

[BSS+99] ten reasons that reengineering efforts fail are presented. Seven reasons tackle

project management and technical issues. The remaining three reasons are matched by our

data-oriented reengineering process:

• „The organization does not have its legacy system under control.“ (reason 4)

We provide understanding of a web information system by data-oriented reverse

engineering and use clustering to partition the web information system.

• „Software architecture is not a primary reengineering consideration.“ (reason 6)

We consider relationships between data components and proposed architectural pat-

terns to (re)restructure the web information system.

• „There is no notion of a separate and distinct „reengineering process“. “ (reason 7)

We propose our data-oriented reengineering process that includes the management

of the models representing the web information system.

Our data-oriented process is composed of two phases: the understanding phase and the

adapting phase. Figure 6.1 shows the flexible tool support, which sustains iterations and

explorations, that we provide for these two phases. Basic information about the data is re-

gained by parsing the web information system and represented in Data Component Mo-

dels. These Data Component Models are interactively completed, abstracted and

transformed by the reengineer. The web information system can then be extended with

further models. From the regained and new models an access layers and new applications

are generated.

One of the major contributions of this dissertation is the classification and retrieval of re-

lationships in web information systems. We combined existing approaches and tools to re-

cover the data structure of a web information system particularly the data dependencies

between the data components. During this reverse engineering process uncertainty is ex-

pressed with fuzzy logic and inconsistencies are tolerated.

CONCLUSIONS

174

Once models of the data components are reverse engineered, these retrieved models are

redesigned and new functionality, i.e., schemas and/or applications, is modeled. To this

end, we provide clustering of the Data Component Models followed by classification of

the data components. We presented clustering strategies depending on the chosen integra-

tion strategy. To (re)structure the data components we elaborate four architectural pat-

terns.

Web

Information

System

1. Understanding:

Data-oriented Reverse

Engineering

2. Adapting: Data Component Extension

Maintenance:

Model-driven

Consistency

Management

Figure 6.1 Tool Support for the Data-oriented Reengineering process

Data

Component

Models

New

Applications

Data Model

Parser

Code Parser

Java Slicer

Island Grammar Parser

Pattern

Specification

Editor

Pattern

Instance

Retrieval

Triple-

Graph-

Grammar

Editor

Mapping

Rules

EER

Editor

Tranformation

Editor

UML

Editor

Generator

Dobs /

DoLittle

Pattern Definition

and Use Editor

Clustering Tool

GXLHistoryGraph

Composed

History Graph

legacy

system

new

applications

database schemas and partial

application models

data access information flow model transformations model correspondances

Domain Experts

and Reengineers

TRANSFERABILITY OF RESULTS

175

TRANSFERABILITY OF RESULTS

After adapting and extending the web information, we provide code generation facilities.

Firstly, based on the maintained dependencies between the physical schemas and concep-

tual, object-oriented schemas, we generated transactional object-oriented access layers

for the legacy databases. Secondly, for newly modeled data components we generated

Java code that accesses the legacy system via the transactional access layer. The aim is to

enable the construction of data component prototypes by model execution.

We maintain the knowledge of the system by model consistency management. We make

use of graph transformations for the model-based reverse engineering and the model-driv-

en development. We developed an incremental graph-based model change propagation

mechanism, i.e., the History Graph Mechanism, which preserve consistency between

models. This mechanism enables the selective undo of (graph) transformations that be-

come invalid because of changes performed after the transformations’ application.

Finally, we evaluated our data-oriented reengineering process and tool support with a case

study for proof of concepts. This case study deals with the reengineering of a web infor-

mation system in the health care domain.

6.2 Transferability of Results

Most of our results are not limited to the reverse engineering of legacy web information

systems like presented in the case study, i.e., composed of databases and object-oriented

application code. Our data-oriented reverse engineering process can be applied to other

data sources, e.g., file based, and non object-oriented application code, e.g., COBOL.

Similar challenges exist in architectural reverse engineering. As our approach is based on

a combination of data reverse engineering and design recovery, especially design pattern

instances recovery, it can be extended to architectural aspects: (1) either the extension can

be done by adding or exchanging tools for recover architectural models and dependencies;

or (2) the existing parsers and pattern definitions can be adapted or extended.

The relationship retrieval is based on a data related pattern instance recovery approach.

Business rules also depend on the information systems’ data. Defining business rules with

patterns anchored on the systems’ data structure enable their detection with our approach.

The generation of the transactional object-oriented access layer and the model consistency

management enables incremental web information system migration [BS95]. The access

layer builds an interface between the data and the applications. First, the applications’ data

access is shifted to exclusive data access via the generated access layer. Then, the physical

schemas can be changed or migrated in a controlled way and with limited impact into the

generated access layer. Last, data transfer can be automated due to the explicit data com-

ponent model maintenance. The model consistency management enables a long-lasting

maintenance strategy by keeping the models consistent during parallel system evolution.

Different engineering tools have different application domains (e.g., software product line

development, architectural recovery or program understanding) but the vast majority of

CONCLUSIONS

176

tools have in common that they are graph-based, i.e., they store models in abstract syntax

graphs. Most of the engineering processes supported by these tools are iterative. Our His-

tory Graph Mechanism can propagate changes occurring during iteration through any

graph structure. Therefore, every graph-based tool providing the simple History Graph

Mechanism API can benefit from incremental change propagation.

6.3 Open Problems

The case study revealed open problems that need further investigation. The definitions of

the usage and constraint relationship should be subdivided to enable more detailed anal-

ysis and clustering. The kinds of data dependencies used, e.g., replication or duplication,

would provide additional helpful knowledge to the reengineer. Another useful informa-

tion regarding the extension would be how the data dependency is used, e.g., encapsulated

in a method call or directly by an embedded SQL statement.

A related open problem is the clustering. First results during evaluation have demonstrat-

ed the clustering usefulness. The existing clustering approaches and tools are generally

focused on source code re-modularisation or schema abstraction. A deeper investigation

of data component clustering focusing on data distribution is desirable to reach more ac-

curate and reliable results.

Our approach is limited to the generation of transactional object-oriented access layers to

existing databases and of prototypes. Many practical scenarios exist that require platform-

specific application design and thus specific generation. A major improvement would be

the code generation for middleware technologies. Common solution for information sys-

tem integration middleware within organisations is distributed transaction processing

[XA94] as provided by transaction monitors [Hud94, Hal96] and middleware transaction

services [OTS98, JTS99]. A more scalable solution is reliable messaging [Hou98, Lew99]

which results in a reliable asynchronous processing scenario. Liebig and Tai propose an

integration of message-oriented transactions and distributed object transaction to middle-

ware mediated transactions [LT01].

Existing tools can be integrated by XML file exchange. Nevertheless, tighter tool integra-

tion, e.g., at the meta-model level like realised in the FUJABATS [BGN+03], is desirable.

Further, layout information is often lost during iteration and information transfer between

tools.

6.4 Future Directions

More possible information sources for the data-oriented reverse engineering process can

be found, like sketched in [HHH+00]. One valuable source of information would be to

record the user interactions on the system to find out and increase certainty of data depen-

dencies. Depending on the existent application the databases are queried in various forms.

Approaches that traces user interactions for system understanding are presented in

FUTURE DIRECTIONS

177

FUTURE DIRECTIONS

[MC01, ESS02]. We implemented a proprietary approach that propose optimisation mea-

sures depending on the kind of query and the duration a query needs in the different sys-

tem’s tiers [Rot03, Wad03]. This result can be transferred and used for web information

system understanding.

The retrieval can be further improved by adding and combining reverse engineering tech-

niques to our approach [MBPRR01]. In [DRT98] three reverse engineering patterns are

presented: (1) code duplication detection, (2) architectural extraction using prototyping

and (3) inferring hot spots (a variation in the application domain) from overridden meth-

ods. The code duplication detection can be used to weight the code fragments. Since we

provide prototyping facilities, a further iteration can take place applying this pattern. In-

ferring hot pots can help to recover hidden dependencies. „A hidden dependency is a re-

lationship between two seemingly independent components and it is caused by a data flow

inside a third software component.“ -- [YR01] Further the automatic extraction of inter-

faces like presented in [WML02] would help the reengineer when (re)structuring the web

information system.

The current information technology trends require the extension of legacy web informa-

tion systems in two directions. Firstly, integrating mobile devices is needed to meet the

ubiquitous computing requirements. Mascolo et al. present a data-sharing middleware for

mobile computing named XMIDDLE [MCZE02]. The sharing of XML documents across

heterogeneous mobile hosts is provided by XMIDDLE, allowing on-line and offline access

to data. Replication transparency is abandoned by XMIDDLE to achieve an acceptable per-

formance and scalability. Secondly, a seamless integration to the internet is still a chal-

lenge. Many approaches exist in this direction, e.g., [MG00, KK01, MMK02, SL02].

CONCLUSIONS

178

179

REFERENCES

Chapter 1: Introduction

[Aik96] P. Aiken. Data Reverse Engineering: Slaying the Legacy Dragon.

McGraw-Hill, 1996.

[Bas90] V.R. Basili. Viewing Maintainance as Reuse-Oriented Software Develop-

ment. IEEE Software, 7(1):19–25, September 1990. IEEE Computer Soci-

ety Press.

[BGN+03] S. Burmester, H. Giese, J. Niere, M. Tichy, J.P. Wadsack, R. Wagner,

L. Wendehals, and A. Zündorf. Tool Integration at the Meta-Model Level

within the FUJABA Tool Suite. In Proceedings of the Workshop on Tool-

Integration in System Development (TIS), Helsinki, Finland, (ESEC / FSE

2003 Workshop 3), pages 51–56, September 2003. online at http://

www.es.tu-darmstadt.de/english/events/tis/documentation/Proceed-

ings.p%df.

[BH99] I.T. Bowman and R.C. Holt. Reconstructing Ownership Architectures To

Help Understand Software Systems. In Proceedings of the 7th Interna-

tional Workshop on Program Comprehension, Pittsburgh, USA, pages 28–

37. IEEE Computer Society Press, May 1999.

[CC90] Elliot J. Chikofsky and James H. Cross II. Reverse Engineering and

Design Recovery: A Taxonomy. IEEE Software, 7(1):13–17, January

1990. IEEE Computer Society Press.

[EH99] V. Englebert and J.-L. Hainaut. DB-MAIN: A Next Generation Meta-

CASE. Journal of Information Systems - Special Issue on Meta-CASEs,

24(2):99–112, June 1999. Elsevier Science Publishers B.V (North-Hol-

land).

[GAK99] G.Y. Guo, J.M. Atlee, and R. Kazman. A Software Architecture Recon-

struction Method. In Proceedings of the 1st Working IFIP Conference on

Software Architecture, San Antonio, USA, pages 22–24. Kluwer Aca-

demic Publishers, February 1999.

180

[GCBM96] W.G. Grisworld, M.I. Chen, R.W. Bowdidge, and J.D. Margenthaler. Tool

Support for Planning the Restructuring of Data Abstractions in Large Sys-

tems. In Proceedings of the 4th ACM SIGSOFT Symposium on the Foun-

dations of Software Engineering, San Francisco, USA, pages 33–45. ACM

Press, October 1996.

[Gro01] Object Management Group. Model Driven Architecture (MDA) Edited by

Joaquin Miller and Jishnu Mukerji. online at http://www.omg.org/cgi-bin/

doc?ormsc/2001-07-01, 492 Old Connecticut Path, Framingham, MA

01701, USA, 2001.

[GW01] H. Giese and J.P. Wadsack. Reengineering for Evolution of Distributed

Information Systems. In Proceedings of the 3rd International Workshop on

Net-Centric Computing: Migrating to the Web, Toronto, Canada, pages

36–39. ACM Press, May 2001.

[Jah99] J.H. Jahnke. Management of Uncertainty and Inconsistency in Database

Reengineering Processes. PhD thesis, University of Paderborn, Pader-

born, Germany, September 1999.

[JGW02] J.H. Jahnke, D.M. German, and J.P. Wadsack. Architecturals Patterns for

Data Mediation in Web-centric Information Systems. In Proceedings of of

the 3rd ICSE Workshop on Web Engineering, Orlando, USA, May 2002.

[JNW00] J.H. Jahnke, J. Niere, and J.P. Wadsack. Automated Quality Analysis of

Component Software for Embedded Systems. In Proceedings of the 8th

International Workshop on Program Comprehension, Limerick, Irland,

pages 18–26. IEEE Computer Society Press, June 2000.

[JSWZ02] J.H. Jahnke, W. Schäfer, J.P. Wadsack, and A. Zündorf. Supporting Itera-

tions in Exploratory Database Reengineering Processes. Journal of Sci-

ence of Computer Programming, 45(2-3):99–136, November 2002.

Elsevier Science Publishers B.V (North-Holland), (Special Issue on Soft-

ware Maintenance and Reengineering).

[JW99a] J.H. Jahnke and J.P. Wadsack. Human-centered Reverse Engineering

Environments should Support Human Reasoning. In Proceedings of the 1st

Workshop on Soft Computing Applied to Software Engineering, Limer-

ick, Ireland, pages 77–84. Limerick University Press, April 1999.

181

[JW99b] J.H. Jahnke and J.P. Wadsack. Integration of analysis and redesign activ-

ities in information system reengineering. In P. Nesi and C. Vernoef, edi-

tors, Proceedings of the 3rd European Conference on Software

Maintenance and Reengineering, Amsterdam, The Nederlands, pages

160–168. IEEE Computer Society Press, March 1999.

[JW99c] J.H. Jahnke and J.P. Wadsack. Varlet: Human-Centered Tool Support for

Database Reengineering. In J. Ebert, B. Kullbach, and F. Lehner, editors,

Proceedings 1st of the Workshop Software Reengineering. Bad Honnef,

Germany, pages 149–156. Fachberichte Informatik, Universität Koblenz-

Landau, July 1999.

[JW00a] J.H. Jahnke and J.P. Wadsack. The Varlet Analyst: Employing Imperfect

Knowledge in Database Reverse Engineering Tools. In In Proceedings of

the 3rd International Workshop on Intelligent Software Engineering, Lim-

erick, Ireland, pages 59–69. IEEE Computer Society Press, June 2000.

[JW00b] J.H. Jahnke and A. Walenstein. Reverse Engineering Tools as Media for

Imperfect Knowledge. In Proceedings of the 7th Working Conference on

Reverse Engineering, Brisbane, Australia, pages 22–31. IEEE Computer

Society Press, November 2000.

[JWZ02] J.H. Jahnke, J.P. Wadsack, and A. Zündorf. A History Concept for Design

Recovery Tools. In Proceedings of the 4th European Conference on Soft-

ware Maintenance and Reengineering, Budapest, Hungary, pages 59–69.

IEEE Computer Society Press, March 2002.

[KSRL01] R.K. Keller, R. Schauer, S. Robitaille, and B. Lagu. The SPOOL

Approach to Pattern-Based Recovery of Design Components. In

H. Erdogmus and O. Tanir, editors, Advances in Software Engineering

Topics in Evolution, Comprehension, and Evaluation. Springer Verlag,

2001.

[Mil98] H.W. Miller. Reengineering Legacy Software Systems. Digital Press,

1998.

[MTW93] H. Müller, S. Tilley, and K. Wong. Understanding Software Systems

using Reverse Engineering Technology: Perspectives from the Rigi

Project. In Proceedings of the 1993 IBM/NRC CAS Conference, Toronto,

Canada, pages 217–226. IBM, October 1993.

[Nie03] J. Niere. Inkremetelle Mustererkennung. PhD thesis, University of Pader-

born, Paderborn, Germany, December 2003.

182

[NNWZ00] U.A. Nickel, J. Niere, J.P. Wadsack, and A. Zündorf. Roundtrip Engineer-

ing with FUJABA. In J. Ebert, B. Kullbach, and F. Lehner, editors, Proc of

2nd Workshop on Software-Reengineering (WSR), Bad Honnef, Ger-

many, pages 31–34. Fachberichte Informatik, Universität Koblenz-Lan-

dau, August 2000.

[NSW+02] J. Niere, W. Schäfer, J.P. Wadsack, L. Wendehals, and J. Welsh. Towards

Pattern-Based Design Recovery. In Proceedings of the 24th International

Conference on Software Engineering, Orlando, Florida, USA, pages 338–

348. ACM Press, May 2002.

[NWW01] J. Niere, J.P. Wadsack, and L. Wendehals. Design Pattern Recovery

Based on Source Code Analysis with Fuzzy Logic. Technical Report tr-ri-

01-222, University of Paderborn, Paderborn, Germany, March 2001.

[NWW03] J. Niere, J.P. Wadsack, and L. Wendehals. Handling Large Search Space

in Pattern-Based Reverse Engineering. In Proceedings of the 11th Interna-

tional Workshop on Program Comprehension, Portland, USA, pages 274–

280. IEEE Computer Society Press, May 2003.

[NWZ01] J. Niere, J.P. Wadsack, and A. Zündorf. Recovering UML Diagrams from

Java Code using Patterns. In J.H. Jahnke and C. Ryan, editors, Proceed-

ings of the 2nd Workshop on Soft Computing Applied to Software Engi-

neering, Enschede, The Netherlands. Centre for Telematics and

Information Technology, University of Twende, The Netherlands, Febru-

ary 2001. online at http://trese.cs.utwente.nl/scase/scase-2/Proceed-

ings.pdf.

[OLHL02] A. Orso, D. Liang, M.J. Harrold, and R. Lipton. Gamma System: Contin-

uous Evolution of Software after Deployment. In Proceedings of the 2002

International Symposium on Software Testing and Analysis, pages 65–69.

ACM Press, July 2002.

[ORH02] A. Orso, A. Rao, and M. J. Harrold. A Technique for Dynamic Updating of

Java Software. In Proceedings of the 2002 IEEE International Conference

on Software Maintenance, Montréal, Canada, pages 649–658. IEEE Com-

puter Society Press, October 2002.

[PG02] M. Pinzger and H. Gall. Pattern-Supported Architecture Recovery. In Pro-

ceedings of the 10th International Workshop on Program Comprehension,

Paris, France, pages 53–63. IEEE Computer Society Press, June 2002.

183

[SK01] K. Sartipi and K.Kontogiannis. A Graph Pattern Matching Approach to

Software Architecture Recovery. In Proceedings of the 2001 IEEE Interna-

tional Conference on Software Maintenance, Florence, Italy, pages 408–

418. IEEE Computer Society Press, November 2001.

[SWJ04] W. Schäfer, J.P. Wadsack, and J.H. Jahnke. Software Reengineering - Die

Suche nach verlorener Information. ForschungsForum Paderborn, 7-2004,

January 2004. Paderborner Universitätsmagazin, (to appear).

[Wad03] J.P. Wadsack. Architectural Issues in Data Reengineering. In Report of

the Dagstuhl-Seminar 03061 on Software Architecture Recovery and

Modeling. Schloss Dagstuhl, Germany, February 2003.

[War99] I. Warren. The Renaissance of Legacy Systems - Method Support fpr Soft-

wrae-System Evolution. PRACTIONER SERIES. Springer Verlag, 1999.

[WJ02] J.P. Wadsack and J.H. Jahnke. Towards Model-Driven Middleware Main-

tenance. In Proceedings of the OOPSLA 2002 Workshop on Generative

Techniques in the context of Model-Driven Architecture, Seattle, USA,

November 2002. online at http://www.softmetaware.com/oopsla2002/

mda-workshop.html.

[WJ03] J.P. Wadsack and J.H. Jahnke. A History Concept for Recovery and Design

Tools. In Report of the Dagstuhl-Seminar 03061 on Software Architecture

Recovery and Modeling. Schloss Dagstuhl, Germany, February 2003.

[WNGJ02] J.P. Wadsack, J. Niere, H. Giese, and J.H. Jahnke. Towards Data Depen-

dency Detection in Web Information Systems. In Proceedings of the ICSM

2002 Database Maintenance and Reengineering Workshop, Montréal,

Canada, pages 47–64. IEEE Computer Society Press, October 2002.

[ZSG79] M.V. Zelkowitz, A.C. Shaw, and J.D. Gannon. Principles of Software

Engineering and Design. Prentice Hall, 1979.

[ZWR01] A. Zündorf, J.P. Wadsack, and I. Rockel. Merging graph-like object struc-

tures. In Proceedings of the 10th International Workshop on Software

Configuration Management, Toronto, Canada, May 2001.

Chapter 2: Data-oriented Reengineering: A Case Study

[AOS+99] K. Arnold, B. Osullivan, R. W. Scheifler, J. Waldo, A. Wollrath, and

B. O’Sullivan. The Jini(TM) Specification. Addison-Wesley, June 1999.

184

[BBD+03] I. Bilykh, Y. Bychkov, D. Dahlem, J.H. Jahnke, G. McCallum, C. Obry,

A. Onabajo, and C. Kuziemsky. Can GRID Services Provide Answers to

the Challenges of National Health Information Sharing? In D.A. Stewart,

editor, Cascon 2003: Meeting of Minds, Toronto, Canada. IBM, October

2003.

[BRJ99] G. Booch, J. Rumbaugh, and I. Jacobson. The Unified Modeling Lan-

guage User Guide. Addison-Wesley, Reading, Massachusetts, USA, 1st

edition, 1999.

[CC90] Elliot J. Chikofsky and James H. Cross II. Reverse Engineering and

Design Recovery: A Taxonomy. IEEE Software, 7(1):13–17, January

1990. IEEE Computer Society Press.

[Cim00] J. J. Cimino. From Data to Knowledge through Concept-oriented Termi-

nologies. American Medical Informatics Association, 7(3):288–297,

2000. Hanley & Belfus, Inc. Medical Publishers.

[CVS] CVS. Concurrent Versions System - The open standard for version con-

trol. http://www.cvshome.org/.

[Dat89] C.J. Date. A Guide to the SQL standard. Addison-Wesley, 1989.

[GGT01] M. Gehrke, H. Giese, and M. Tichy. A Jini-supported Distributed Version

and Configuration Management System. In Proceedings of the 2001 Inter-

national Symposium on Convergence of IT and communications, Denver,

USA, August 2001.

[HA00] K. Hogshead Davis and P. Aiken. Data Reverse Engineering: A Historical

Survey. In Proceedings of the 7th Working Conference on Reverse Engi-

neering, Brisbane, Australia, pages 70–78. IEEE Computer Society Press,

November 2000.

[HEH+95] J-L. Hainaut, V. Englebert, J. Henrard, J-M. Hick, and D. Roland.

Requirements for information system reverse engineering support. In Pro-

ceedings of the 2nd Working Conference on Reverse Engineering, Tor-

onto, Canada, pages 136–145. IEEE Computer Society Press, July 1995.

[HHH+00] J-L. Hainaut, J. Henrard, J-M. Hick, J. Englebert, and D. Roland. The

Nature of Data Reverse Engineering. In Proceedings of the Data Reverse

Engineering Workshop EuroRef, 7th Reengineering Forum, Reengineer-

ing Week 2002. Zurich, Switzerland, March 2000.

185

[Jah99] J.H. Jahnke. Management of Uncertainty and Inconsistency in Database

Reengineering Processes. PhD thesis, University of Paderborn, Pader-

born, Germany, September 1999.

[Moo01] L. Moonen. Generating Robust Parsers using Island Grammars. In Pro-

ceedings of the 8th Working Conference on Reverse Engineering, Stut-

tgart, Germany, pages 13–22. IEEE Computer Society Press, October

2001.

[MRF+03] M. A. Munoz, M. Rodrigez, J. Favela, A.I. Martinez-Garcia, and

V.M. Gonz lez. Context-Aware Mobile Communication in Hospitals.

IEEE Computer, 36(9):38–46, September 2003. IEEE Computer Society

Press.

[Nie03] J. Niere. Inkremetelle Mustererkennung. PhD thesis, University of Pader-

born, Paderborn, Germany, December 2003.

[OBJ03] A. Onabajo, I. Bilykh, and J.H. Jahnke. Wrapping Legacy Medical Sys-

tems for Integrated Health Network. In NET.ObjectDAYS 2003, Erfurt,

Germany. Springer Verlag, September 2003.

[Ona03] A. Onabajo. GRID FEDERATION ENVELOPE (GFE): Federating Medi-

cal Information Systems. Master’s thesis, Department of Computer Sci-

ence, University of Victoria, Victoria, Canada, September 2003.

[RT02] W. Raghupathi and J. Tan. Strategic IT Applications oin Health Care.

Communications of the ACM, 45(12):56–61, December 2002. ACM

Press.

[SL96] A. Schürr and M. Lefering. Specification of Integration Tools. In M. Nagl,

editor, Building Tightly Integrated Software Development Environments:

The IPSEN Approach, volume 1170 of Lecture Notes in Computer Sci-

ence, pages 324–334. Springer Verlag, 1996.

[SPPB02] P. Sousa, M.L. Pedro-de-Jesus, G. Pereira, and F. Brito e Abreu. Cluster-

ing Relations into abstract ER Schemas for database reverse engineering.

Journal of Science of Computer Programming, 45(2-3):137–153, Novem-

ber 2002. Elsevier Science Publishers B.V (North-Holland), (Special Issue

on Software Maintenance and Reengineering).

[TS95] S.R. Tilley and D.B. Smith. Perspectives on Legacy Systems Reengineer-

ing (draft). Reengineering Center, Softwrae Engineering Institute, Carn-

egie Mellon University, 1995. (available online at http://

www.sei.cmu.edu/reengineering/lsysree.pdf).

186

[TWBK89] T. Teorey, G. Wei, D. Bolton, and J. Koenig. ER Model Clustering as an

Aid for User Communication and Documentation in Database Design.

Communications of the ACM, 32(8):975–987, August 1989. ACM Press.

[vDK99] A. van Dreusen and T. Kuipers. Building Documentation Generators. In

Proceedings of the 9th International Conference on Software Mainte-

nance, Oxford, UK, pages 40–49. IEEE Computer Society Press, Septem-

ber 1999.

[Wei84] M. Weiser. Program slicing. IEEE Transactions on Software Engineering,

10(4):352–357, July 1984. IEEE Computer Society Press.

[Wei02] G. Weiss. Welcome To The (Almost) Digital Hospital. IEEE Spectrum,

39(3):44–49, March 2002. IEEE Computer Society Press.

[Wil03] E.V. Wilson. Asychronous Health Care Communication. Communica-

tions of the ACM, 46(6):79–84, June 2003. ACM Press.

[WNGJ02] J.P. Wadsack, J. Niere, H. Giese, and J.H. Jahnke. Towards Data Depen-

dency Detection in Web Information Systems. In Proceedings of the ICSM

2002 Database Maintenance and Reengineering Workshop, Montréal,

Canada, pages 47–64. IEEE Computer Society Press, October 2002.

Chapter 3: Data-Oriented Reverse Engineering

[AFC98] G. Antoniol, R. Fiutem, and L. Christoforetti. Design pattern recovery in

object-oriented software. In Proceedings of the 6th International Work-

shop on Program Comprehension, Ischia, Italy, pages 153–160. IEEE

Computer Society Press, June 1998.

[AGG] Technical University of Berlin. AGG, the Attributed Graph Grammar sys-

tem. Online at http://www.tfs.cs.tu-berlin.de/agg.

[Aik96] P. Aiken. Data Reverse Engineering: Slaying the Legacy Dragon.

McGraw-Hill, 1996.

[Amb02] S. Ambler. Agile Database Techniques : Effective Strategies for the Agile

Software Developer, chapter 12: Database Refactoring. John Wiley and

Sons, Inc., October 2002.

[And94] M. Andersson. Extracting an Entity Relationship Schema from a Rela-

tional Database through Reverse Engineering. In Proceedings of the 13th

International Conference of the Entity Relationship Approach, Manches-

ter, volume 881 of Lecture Notes in Computer Science, pages 403–419.

Springer Verlag, December 1994.

187

[AOS+99] K. Arnold, B. Osullivan, R. W. Scheifler, J. Waldo, A. Wollrath, and

B. O’Sullivan. The Jini(TM) Specification. Addison-Wesley, June 1999.

[BCN92] C. Batini, S. Ceri, and S.B. Navathe. Conceptual Database design. Ben-

jamin/Cummings, 1992.

[BDH+87] H. Briand, C. Ducateau, Y. Hebrail, D. Herin-Aime, and

J. Kouloumdjian. From Minimal Cover to Entity-Relationship Diagram.

In Proceedings of the 6th International Conference of the Entity-Relation-

ship Approach, New York, USA, pages 287–304. North-Holland, Novem-

ber 1987.

[Bew98] B. Bewermeyer. Cliche-Erkennung in relationalen Datenbankanwendun-

gen. Master’s thesis, University of Paderborn, Department of Mathematics

and Computer Science, Paderborn, Germany, September 1998.

[BGD97] A. Behm, A. Geppert, and K. R. Dittrich. On the Migration of Relational

Schemas and Data to Object-Oriented Database Systems. In Proceedings

5th International Conference on Re-Technologies for Information Sys-

tems, Klagenfurt, Austria, pages 13–33. Österreichische Computer Gesell-

schaft, December 1997.

[BGN+03] S. Burmester, H. Giese, J. Niere, M. Tichy, J.P. Wadsack, R. Wagner,

L. Wendehals, and A. Zündorf. Tool Integration at the Meta-Model Level

within the FUJABA Tool Suite. In Proceedings of the Workshop on Tool-

Integration in System Development (TIS), Helsinki, Finland, (ESEC / FSE

2003 Workshop 3), pages 51–56, September 2003. online at http://

www.es.tu-darmstadt.de/english/events/tis/documentation/Proceed-

ings.p%df.

[BLN86] C. Batini, M. Lenzerini, and S. B. Navathe. A Comparative Analysis of

Methodologies for Database Schema Integration. ACM Computing Sur-

veys, 18(2):323–364, 1986. ACM Press.

[BP98] M. Blaha and W. Premerlani. Object-Oriented Modeling and Design for

Database Applications. Prentice Hall, 1998.

[BR97] H. Blockeel and L. De Raedt. Relational knowledge discovery in data-

bases. In S. Muggleton, editor, Proceedings of the 6th International Work-

shop on Inductive Logic Programming, Stockholm, Sweden, volume 1314

of Lecture Notes in Artificial Intelligence, pages 199–211, Berlin, August

1997. Springer Verlag.

188

[BRJ99] G. Booch, J. Rumbaugh, and I. Jacobson. The Unified Modeling Lan-

guage User Guide. Addison-Wesley, Reading, Massachusetts, USA, 1st

edition, 1999.

[CBB+00] R. G. G. Cattell, D. Barry, M. Berler, J. Eastman, D. Jordan, C. Russell,

O.Schadow, T. Staniendam, and F. Velez. The Object Data Standard:

ODMG 3.0. Morgan Kaufmann Publishers, San Francisco (CA), USA,

2000.

[Cha96] D. Chappell. Understanding ActiveX and OLE - A Guide for Developers

and Managers. Microsoft Press, 1996.

[Che76] P.P. Chen. The Entity-Relationship Model – Toward a unified view of data.

ACM Transactions on Database Systems, 1(1):9–36, 1976. ACM Press.

[Chu04] D. Church. Using mildly context-sensitive island grammars for semi-

structured data extraction. Master’s thesis, Department of Computer Sci-

ence, University of Victoria, Victoria, Canada, forthcoming in 2004.

[CL93] T. Catarci and M. Lenzerini. Representing and Using Interschema Knowl-

edge in a CooperativeInformation Systems. International Journal of Intel-

ligent and Cooperative Information Systems, 2(4):375–398, 1993. IEEE

Computer Society Press.

[Cod70] E.F. Codd. A Relational Model of Data for Large Shared Data Banks.

Communications of the ACM, 13(6):377–387, June 1970. ACM Press.

[COR99] CORBA-2.3.1. The Common Object Request Broker: Architecture and

Specification, CORBA/IIOP 2.3.1 Specification. Object Managment

Group, October 1999. Revision 2.3.1: OMG Technical Document formal/

99-10-07.

[COS98] COSS-2.3. CORBAservices: Common Object Services Specification.

Object Managment Group, December 1998. Revision 2.3, OMG technical

document 98-12-09.

[Dat00] C.J. Date. An introduction to database systems. Addison-Wesley, 7th edi-

tion, 2000.

[EH99] V. Englebert and J.-L. Hainaut. DB-MAIN: A Next Generation Meta-

CASE. Journal of Information Systems - Special Issue on Meta-CASEs,

24(2):99–112, June 1999. Elsevier Science Publishers B.V (North-Hol-

land).

189

[EN94] R. Elmasri and S.B. Navathe. Fundamentals of Database Systems. Ben-

jamin/Cummings, Redwood City, 2nd edition, 1994.

[FNTZ98] T. Fischer, J. Niere, L. Torunski, and A. Zündorf. Story Diagrams: A new

Graph Rewrite Language based on the Unified Modeling Language. In

G. Engels and G. Rozenberg, editors, Proceedings of the 6th International

Workshop on Theory and Application of Graph Transformation, Pader-

born, Germany, volume 1764 of Lecture Notes in Computer Science,

pages 296–309. Springer Verlag, November 1998.

[Fon97] J. Fong. Converting Relational to Object-Oriented Databases. ACM SIG-

MOD Record, 26(1):53–58, March 1997. ACM Press.

[Fow99] M. Fowler. Refactoring: Improving the Design of Existing Code. Addison-

Wesley, 1999.

[FV95] C. Fahrner and G. Vossen. Transforming Relational Database Schemas

into Object-Oriented Schemas according to ODMG-93. In Proceedings of

the 4th International Conference on Deductive and Object-Oriented Data-

bases, Singapur, volume 1013 of Lecture Notes in Computer Science,

pages 429–446. Springer Verlag, 1995.

[GHJV95] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Ele-

ments of Reusable Object Oriented Software. Addison-Wesley, Reading,

MA, 1995.

[GO00] C.L. Gittens and S.L. Osborn. Database Integration Using Graph Trans-

formation. In Joint 2000 APPLIGRAPH and GETGRATS Workshop on

Graph Transformation Systems, Berlin, Germany, pages 96–105. online at

http://tfs.cs.tu-berlin.de/gratra2000, March 2000.

[Hai89] J.-L. Hainaut. A Generic Entity-Relationship Model. In Proceedings of the

IFIP WG 8.1 Conference on Information System Concepts: an in-depth

analysis, Namur, Belgium. North-Holland, 1989.

[HCTJ93] J-L. Hainaut, M. Chandelon, C. Tonneau, and M. Joris. Contribution to a

Theory of Database Reverse Engineering. In Proceedings of the Working

Conference on Reverse Engineering, Baltimore, USA, pages 161–170.

IEEE Computer Society Press, May 1993.

[HDZ00] J. Hatcli, M.B. Dwyer, and H. Zheng. Slicing software for model construc-

tion. Higher-Order and Symbolic Computation, 3(4):315–253, 2000. Klu-

wer Academic Publishers.

190

[HEH+95] J-L. Hainaut, V. Englebert, J. Henrard, J-M. Hick, and D. Roland.

Requirements for information system reverse engineering support. In Pro-

ceedings of the 2nd Working Conference on Reverse Engineering, Tor-

onto, Canada, pages 136–145. IEEE Computer Society Press, July 1995.

[HG98] M. Harman and K.B. Gallagher. Program Slicing - Introduction to the

Special Issue on Program Slicing. Information and Software Technology,

40(11):577–581, November/December 1998. Elsevier Science Publishers

B.V (North-Holland).

[HH01] J.-L. Hainaut and J. Henrard. Data dependency elicitation in database

reverse engineering. In Proceedings of the 5th European Conference on

Software Maintenance and Reengineering, Lisbon, Portugal, pages 11–19.

IEEE Computer Society Press, 2001.

[HHEH96] J.-L. Hainaut, J.-M. Hick, V. Englebert, and J. Henrard. Understanding

Implementation of Is-A Relations. In Proceedings of the 15th International

Conference on the Entity-Relationship Approach, Cottbus, Germany, vol-

ume 1157, pages 42–50. Springer Verlag, 1996.

[HHH+96] J.-L. Hainaut, J. Henrard, J.-M. Hick, D. Roland, and V. Englebert. Data-

base Design Recovery. In Proceedings of the 8th Conference on Advance

Information Systems Engineering, Crete, Greece, volume 1080 of Lecture

Notes in Computer Science, pages 272–300. Springer Verlag, 1996.

[HHH+99] J. Henrard, J-L. Hainaut, J-M. Hick, D. Roland, and J. Englebert. Data

structure extraction in database reverse engineering. In Proceedings of

the 1st International Workshop on Reverse Engineering in Information

Systems, Paris, France, pages 149–160, November 1999.

[HHKR89] J. Heering, P.R.H. Hendriks, P. Klint, and J. Rekers. The Syntax Definition

Formalism SDF - Reference Manual -. SIGPLAN Notices, 24(11):43–75,

1989. ACM Press.

[HN90] M. T. Harandi and J. Q. Ning. Knowledge Based Program Analysis. IEEE

Transactions on Software Engineering, 7(1):74–81, 1990. IEEE Computer

Society Press.

[HTJC94] J.-L. Hainaut, C. Tonneau, M. Joris, and M. Chandelon. Transformation-

Based Database Reverse Engineering. In Proceedings of the 12th Interna-

tional Conference on the Entity-Relationship Approach, Dallas, USA, vol-

ume 823, page 364. Springer Verlag, 1994.

191

[Ib96] CORBA IDL-binding. Information technology – Information Resource

Dictionary System (IRDS) Services Interface. ISO/IEC, 1996. Amendment

3:1996 to ISO/IEC 10728:1993 CORBA IDL binding.

[Jah99] J.H. Jahnke. Management of Uncertainty and Inconsistency in Database

Reengineering Processes. PhD thesis, University of Paderborn, Pader-

born, Germany, September 1999.

[JCC] WebGain, Inc. JavaCC, the Java Parser Generator. Online at http://

www.experimentalstuff.com/Technologies/JavaCC/ (last visited June

2003).

[JJ94] U.A. Johnen and M.A. Jeusfeld. An Executable Meta Model for Re-Engi-

neering of Database Schemas. In Proceedings of the 13th International

Conference on the Entity-Relationship Approach, Manchester, UK, num-

ber 885 in Lecture Notes in Computer Science, pages 533–547. Springer

Verlag, March 1994.

[JK89] P. Johannesson and K. Kalman. A Method for Translating Relational

Schemas into Conceptual Schemas. In F.H. Lochovsky, editor, Proceed-

ings of the 8th International Conference on Entity-Relationship Approach,

Toronto, Canada, pages 271–285. North-Holland, October 1989.

[JSZ96] J. Jahnke, W. Schäfer, and A. Zündorf. A Design Environment for Migrat-

ing Relational to Object-Oriented Database Systems. In Proceedings of

the 6th International Conference on Software Maintenance, Monterrey,

USA, pages 163–170, November 1996.

[JSZ97] J.H. Jahnke, W. Schäfer, and A. Zündorf. Generic Fuzzy Reasoning Nets

as a basis for reverse engineering relational database applications. In

Proceedings of the 6th European Software Engineering Conference, vol-

ume 1302 of Lecture Notes in Computer Science, pages 193–210. Springer

Verlag, September 1997.

[JZ99] J.H. Jahnke and A. Zündorf. Applying Graph Transformations To Data-

base Re-Engineering. In H. Ehrig, G. Engels, H.-J. Kreowski, and

G. Rozenberg, editors, Handbook of Graph Grammars and Computing by

Graph Transformation, volume 2 - Application, Languages and tools.,

pages 267–284. World Scientific, Singapore, 1999.

[KM96] A. Konar and A. K. Mandal. Uncertainty Management in Expert Systems

Using Fuzzy Petri Nets. IEEE Transactions on Knowledge and Data Engi-

neering, 8(1):96–105, February 1996. Academic Press, London.

192

[KP96] C. Krämer and L. Prechelt. Design recovery by automated search for

structural design patterns in object-oriented software. In Proceedings of

the 3rd Working Conference on Reverse Engineerinung, Monterey, CA,

pages 208–215. IEEE Computer Society Press, November 1996.

[KSRP99] R.K. Keller, R. Schauer, S. Robitaille, and P. Page. Pattern-Based

Reverse-Engineering of Design Components. In Proceedings of the 21st

International Conference on Software Engineering, Los Angeles, USA,

pages 226–235. IEEE Computer Society Press, May 1999.

[LCBO03] P.K. Linos, Z. Chen, S. Berrier, and B. O’Rourke. A Tool for Understand-

ing Multi-Language Program Dependencies. In Proceedings of the 11th

International Workshop on Program Comprehension (IWPC), Portland,

USA, pages 64–72. IEEE Computer Society Press, May 2003.

[Lef95] Martin Lefering. Integrationswerkzeuge in einer Softwareentwicklung-

sumgebung. Informatik. Verlag Shaker, 1995.

[LNE89] J.A. Larson, S.B. Navathe, and R. Elmasri. A Theory of Attribute Equiva-

lence in Databases with Application to Schema Integration. IEEE Trans-

actions on Software Engineering, 15(4):449–463, April 1989. IEEE

Computer Society Press.

[MCAH95] P. Martin, J.R. Cordy, and R. Abu-Hamdeh. Information Capacity Pre-

serving of Relational Schemas Using Structural Transformation. Techni-

cal report, Dept. of Computing and Information Science, Queen’s

University, Kingston, Ontario, Canada, November 1995.

[Meh84] K. Mehlhorn. Graph Algorithms and NP-Completeness. Springer Verlag,

1st edition, 1984.

[MJS+00] H.A. Müller, J.H. Jahnke, D.B. Smith, M.A. Storey, and K. Wong.

Reverse Engineering: a roadmap. In A. Finkelstein, editor, Future of Soft-

ware Engineering. International Conference on Software Engineering,

Limerick, Ireland, pages 47–60. ACM Press, June 2000.

[MM90] R.W. Mathews and W.C. McGee. Data modeling for software develpment.

IBM Systems Journal, 29(2):228–235, 1990. IBM.

[Moo01] L. Moonen. Generating Robust Parsers using Island Grammars. In Pro-

ceedings of the 8th Working Conference on Reverse Engineering, Stut-

tgart, Germany, pages 13–22. IEEE Computer Society Press, October

2001.

193

[NA87] S.B. Navathe and A.M. Awong. Abstracting Relational and Hierarchical

Data with a Semantic Data Model. In Proceedings of the 6th International

Conference of the Entity-Relationship Approach, New York, USA, pages

305–333. North-Holland, November 1987.

[Nie03] J. Niere. Inkremetelle Mustererkennung. PhD thesis, University of Pader-

born, Paderborn, Germany, December 2003.

[NSW+02] J. Niere, W. Schäfer, J.P. Wadsack, L. Wendehals, and J. Welsh. Towards

Pattern-Based Design Recovery. In Proceedings of the 24th International

Conference on Software Engineering, Orlando, Florida, USA, pages 338–

348. ACM Press, May 2002.

[NWW01] J. Niere, J.P. Wadsack, and L. Wendehals. Design Pattern Recovery

Based on Source Code Analysis with Fuzzy Logic. Technical Report tr-ri-

01-222, University of Paderborn, Paderborn, Germany, March 2001.

[NWW03] J. Niere, J.P. Wadsack, and L. Wendehals. Handling Large Search Space

in Pattern-Based Reverse Engineering. In Proceedings of the 11th Interna-

tional Workshop on Program Comprehension, Portland, USA, pages 274–

280. IEEE Computer Society Press, May 2003.

[OTS98] OTS-1.1. Transaction Service Specification. Object Management Group,

February 1998. The Common Object Request Broker: Architecture and

Specification, CORBA/IIOP 1.1 Specification, Revision 1.1: OMG Tech-

nical Document formal/97-12-17.

[Pal01] M. Palasdies. Design-Pattern Spezifikation und Erkennung auf Basis von

Story-Diagrammen. Master’s thesis, University of Paderborn, Department

of Mathematics and Computer Science, Paderborn, Germany, May 2001.

[PB94] W.J. Premerlani and M.R. Blaha. An Approach for Reverse Engineering of

Relational Databases. Communications of the ACM, 37(5):42–49, May

1994. ACM Press.

[PKBT94] J-M. Petit, J. Kouloumdjian, J-F. Boulicaut, and F. Toumani. Using Que-

ries to Improve Database Reverse Engineering. In Proceedings of 13th

International Conference of the Entity-Relationship Approach, Manches-

ter, UK, number 885 in Lecture Notes in Computer Science, pages 369–

386. Springer Verlag, March 1994.

[PP94] S. Paul and A. Prakash. A Framework for Source Code Search Using Pro-

gram Patterns. IEEE Transactions on Software Engineering, 20(6):463–

475, June 1994. IEEE Computer Society Press.

194

[PTBK96] J-M. Petit, F. Toumani, J. Boulicaut, and J. Kouloumdjian. Towards the

reverse engineering of denormalized relational databases. In Proceedings

12th International Conference on Data Engineering, pages 218–227, New

Orleans, February 1996. IEEE Computer Society Press.

[Qui94] A. Quilici. A Memory-Based Approach to Recognizing Programming

Plans. Communications of the ACM, 37(5):84–93, May 1994. ACM

Press.

[Rad99] A. Radermacher. Support for Design Patterns through Graph Transfor-

mation Tools. In Proceedings of 1999 International Workshop and Sympo-

sium on Applications Of Graph Transformations With Industrial

Relevance, Kerkrade, The Netherlands, volume 1779 of Lecture Notes in

Computer Science, pages 111–126. Springer Verlag, 1999.

[Rec01] C. Reckord. Entwurf eines generischen Sichtenkonzeptes für die Entwick-

lungsumgebung Fujaba. Bachelor’s thesis, University of Paderborn,

Department of Computer Science, Paderborn, Germany, February 2001.

[RH96] S. Ramanathan and J. Hodges. Reverse Engineering Relational Schemas

to Object-Oriented Schemas. Technical Report MSU-960701, Department

of Computer Science, Mississippi State University, USA, July 1996.

[RH97] S. Ramanathan and J. Hodges. Extraction of Object-Oriented Structures

from Existing Relational Databases. ACM SIGMOD Record, 26(1),

March 1997. ACM Press.

[Roz99] G. Rozenberg, editor. Handbook of Graph Grammars and Computing by

Graph Transformation, volume 1. World Scientific, Singapore, 1999.

[RSK91] M. Rusinkiewisz, A. Sheth, and G. Karabatis. Specifying Interdatabase

Dependencies in a Multidatabase Environment. IEEE Computer,

24(12):46–53, December 1991. IEEE Computer Society Press.

[SA03] M. M. Sufyan Beg and N. Ahmad. Soft Computing Techniques for Rank

Aggregation on the World Wide Web. World Wide Web: Internet and Web

Information Systems, 6(1):5–22, March 2003. Kluwer Academic Publish-

ers.

[SCC+93] Y.-P. Shan, T. Cargill, B. Cox, W. Cook, M. Loomis, and A. Snyder. Is

Multiple Inheritance Essential to OOP? SIGPLAN Notices - Proccedings

of the 1993 Conference on Object-Oriented Programming Systems, Lan-

guages, and Applications, 28(10):360–363, October 1993. ACM Press.

195

[Sch94] Andy Schürr. Specification of Graph Translators with Triple Graph

Grammars. In Proceedings of the 20th International Workshop on Graph-

Theoretic Concepts in Computer Science, pages 151–163, Herrschin, Ger-

many, June 1994. Spinger Verlag.

[Sch01] M.A. Schwarz. Integration eines inkrementellen Parsing - Algorithmus in

Fujaba. Bachelor’s thesis, University of Paderborn, Department of Com-

puter Science, Paderborn, Germany, September 2001.

[SK90] F.N. Springsteel and C. Kou. Reverse Data Engineering of E-R Designed

Relational Schemas. In Proceedings of the 1st Databases, Parallel Archi-

tectures and their Applications, Miami Beach, USA, pages 438–440.

Springer Verlag, March 1990.

[SL96] A. Schürr and M. Lefering. Specification of Integration Tools. In M. Nagl,

editor, Building Tightly Integrated Software Development Environments:

The IPSEN Approach, volume 1170 of Lecture Notes in Computer Sci-

ence, pages 324–334. Springer Verlag, 1996.

[SLGC94] O. Signore, M. Loffredo, M. Gregori, and M. Cima. Reconstruction of ER

Schema from Database Applications: a Cognitive Approach. In Proceed-

ings of 13th International Conference on the Entity-Relationship

Approach, Manchester, UK, number 885 in Lecture Notes in Computer

Science, pages 387–402. Springer Verlag, March 1994.

[Sou98a] C. Soutou. Inference of Aggregate Relationships through Database

Reverse Engineering. In Proceedings of 17th International Conference on

Conceptual Modeling, Singapore, volume 1507 of Lecture Notes in Com-

puter Science, pages 135–149. Springer Verlag, November 1998.

[Sou98b] C. Soutou. Relational Database Reverse Engineering: Extraction of Car-

dinality Constraints. Data and Knowledge Engineering, 28(2):161–207,

November 1998. Elsevier Science Publishers B.V (North-Holland).

[SvG98] J. Seemann and J.W. von Gudenberg. Pattern-Based Design Recovery of

Java Software. ACM SIGSOFT Software Engineering Notes, 23(6):10–

16, November 1998. ACM Press.

[SWZ95] A. Schürr, A.J. Winter, and A. Zündorf. Graph Grammar Engineering

with PROGRES. In W. Schäfer and P. Botella, editors, Proceedings of 5th

European Software Engineering Conference, Barcelona, Spain, volume

989 of Lecture Notes in Computer Science, pages 219–234. Springer Ver-

lag, September 1995.

196

[Szy99] C. Szyperski. Component Software Beyond Object-Oriented Program-

ming. Addison-Wesley, 1999.

[TA99] P. Tonella and G. Antoniol. Object Oriented Design Pattern Inference. In

Proceedings of the 9th International Conference on Software Mainte-

nance, Oxford, UK, pages 230–238. IEEE Computer Society Press, Sep-

tember 1999.

[TGF00] A-R. H. Tawil, W. A. Gray, and N. J. Fiddian. Discovering and Repre-

senting InterSchema Semantic Knowledge in a Cooperative Multi-Infor-

mation Server Environment. In M. T. Ibrahim, J. Küng, and N. Revell,

editors, Proceedings of the 11th International Conference on Database and

Expert Systems Applications, London, UK, volume 1873 of Lecture Notes

in Computer Science, pages 548–562. Springer Verlag, September 2000.

[Tip95] F. Tip. A Survey of Program Slicing Techniques. Journal of programming

languages, 3(3):121–189, September 1995. Chapman & Hall.

[vDK99] A. van Dreusen and T. Kuipers. Building Documentation Generators. In

Proceedings of the 9th International Conference on Software Mainte-

nance, Oxford, UK, pages 40–49. IEEE Computer Society Press, Septem-

ber 1999.

[Wad98] J.P. Wadsack. Inkrementell Konsistenzerhaltung in der transformations-

basierten Datenbankmigration. Master’s thesis, University of Paderborn,

Department of Mathematics and Computer Science, Paderborn, Germany,

1998.

[Wei84] M. Weiser. Program slicing. IEEE Transactions on Software Engineering,

10(4):352–357, July 1984. IEEE Computer Society Press.

[Wen01] L. Wendehals. Cliché- und Mustererkennung auf Basis von Generic Fuzzy

Reasoning Nets. Master’s thesis, University of Paderborn, Department of

Mathematics and Computer Science, Paderborn, Germany, October 2001.

[Wen03] L. Wendehals. Improving Design Pattern Instance Recognition by

Dynamic Analysis. In Proceedings of the ICSE 2003 Workshop on

Dynamic Analysis, Portland, USA, pages 29–32, May 2003. online at

http://www.cs.nmsu.edu/ jcook/woda2003/woda2003.pdf.

197

[Wil96] L.M. Wills. Using Attributed Flow Graph Parsing to Recognize Pro-

grams. In J.E. Cuny, H. Ehrig, G. Engels, and G. Rozenberg, editors, Pro-

ceedings of 5th International Workshop on Graph Grammars and Their

Application to Computer Science, Williamsburg, USA, volume 1073 of

Lecture Notes in Computer Science, pages 170–184, Williamsburg, Vir-

ginia, 1994, November 1996. Springer Verlag.

[WNGJ02] J.P. Wadsack, J. Niere, H. Giese, and J.H. Jahnke. Towards Data Depen-

dency Detection in Web Information Systems. In Proceedings of the ICSM

2002 Database Maintenance and Reengineering Workshop, Montréal,

Canada, pages 47–64. IEEE Computer Society Press, October 2002.

[Zad65] L.A. Zadeh. Fuzzy Sets. Information and Control, 8:338–353, 1965.

Elsevier Science Publishers B.V (North-Holland).

[Zün95] A. Zündorf. PROgrammierte GRaphErsetzungsSysteme. PhD thesis,

RWTH Aachen, 1995.

[Zün01] A. Zündorf. Rigorous Object Oriented Software Development. University

of Paderborn, 2001. (draft available online: http://www.uni-paderborn.de/

fachbereich/AG/schaefer/Personen/Ehemalige/Zuend%orf/

AZRigSoftDraft_0_2.pdf).

Chapter 4: Data Component Extension

[ABM96] A. Aarsten, D. Brugali, and G. Menga. Patterns for Three-Tier Client/

Server Applications. In Proceedings of the 3rd Conference on the Pattern

Languages of Programs, Urbana-Champaign, USA. Washington Univer-

sity, Technical Report# WUCS-97-07, September 1996.

[ADD+91] R. Ahmed, P. De Smedt, W. Du, W. Kent, M.A. Ketabchi, W. Litwin,

A. Rafii, and M.-C. Shan. The Pegasus Heterogeneous Multidatabase Sys-

tem. IEEE Computer, 24(12), December 1991. IEEE Computer Society

Press.

[AL99] N. Anquetil and T. Lethbridge. Recovering software architecture from the

names of source files. In Proceedings of the 6th Working Conference on

Reverse Engineering, Atlanta, USA, pages 235–255. IEEE Computer

Society Press, October 1999.

[Amb02] S.W. Ambler. Deriving Web services from UML models. developerWorks

newsletter: technology edition, March 2002. online at http://www-

106.ibm.com/developerworks/webservices/library/ws-uml1.

198

[Anq00] N. Anquetil. A comparison of graphs of concept for reverse engineering.

In Proceedings of the 8th International Workshop on Program Compre-

hension, Limerick, Irland, pages 231–240. IEEE Computer Society Press,

June 2000.

[BGD97] A. Behm, A. Geppert, and K. R. Dittrich. On the Migration of Relational

Schemas and Data to Object-Oriented Database Systems. In Proceedings

5th International Conference on Re-Technologies for Information Sys-

tems, Klagenfurt, Austria, pages 13–33. Österreichische Computer Gesell-

schaft, December 1997.

[BGN+03] S. Burmester, H. Giese, J. Niere, M. Tichy, J.P. Wadsack, R. Wagner,

L. Wendehals, and A. Zündorf. Tool Integration at the Meta-Model Level

within the FUJABA Tool Suite. In Proceedings of the Workshop on Tool-

Integration in System Development (TIS), Helsinki, Finland, (ESEC / FSE

2003 Workshop 3), pages 51–56, September 2003. online at http://

www.es.tu-darmstadt.de/english/events/tis/documentation/Proceed-

ings.p%df.

[BMR+96] F. Buschmann, R. Meunier, H. Rohnert, P. Somerlad, and M. Stal. Pat-

tern-Oriented Software Architecture - A System of Patterns. John Wiley

and Sons, Inc., 1st edition, 1996.

[CER99] CERMICS Database Team, 2004 route des lucioles, BP93, 06902 Sophia

Antipolis Cedex, France. ObjectDRIVER V1.1 User Manual, 1999.

[CHK+91] T. Connors, W. Hasan, C. Kolovson, M.A. Neimat, D. Schneider, and

K. Wilkinson. The Papyrus Integrated Data Server. In Proceddings of the

1st International Parallel and Distributed Information Systems Confer-

ence, Miami Beach, USA, page 139. IEEE Computer Society Press,

December 1991.

[CHS+95] M-J. Carey, L.M. Haas, P.M. Schwarz, M. Arya, W.F. Cody, R. Fagin,

M. Flickner, A.W. Luniewski, W. Niblack, D. Petkovic, J. Thomas, J.H.

Williams, and E.L. Wimmers. Towards Heterogeneous Multimedia Infor-

mation Systems: The Garlic Approach. In M.T. Ozsu and M.C. Shan, edi-

tors, 5th International Workshop on Research Issues in Data Engineering

- Distributed Object Management, Taipei, Taiwan, pages 124–131. IEEE

Computer Society Press, March 1995.

[CL00] J.L. Cybulski and T. Linden. Composing Multimedia Artifacts for Reuse.

In N. Harrison, B. Foote, and H. Rohnert, editors, Pattern Languages of

Program Design 4, pages 461–488. Addison-Wesley, 2000.

199

[Coc96] A. Cockburn. Prioritizing Forces in Software Design. In J.M. Vlissides,

J.O. Coplien, and N.L. Kerth, editors, Pattern Languages of Program

Design 2, pages 317–333. Addison-Wesley, 1996.

[CS90] S. Choi and W. Scacchi. Extracting and restructuring the design of large

systems. IEEE Software, 7(1):66–71, January 1990. IEEE Computer Soci-

ety Press.

[CWMY02] F. Chen, Q. Wang, H. Mei, and F. Yang. An Architecture-Based Approach

for Component-Oriented Development. In In Proceedings of the 26th

Annual International Computer Software and Applications Conference,

Oxford, England, pages 450–455. IEEE Computer Society Press, August

2002.

[DB00] J. Davey and E. Burd. Evaluating the Suitability of Data Clustering for

Software Remodularisation. In Proceedings of the 7th Working Confer-

ence on Reverse Engineering, Brisbane, Australia, pages 268–277. IEEE

Computer Society Press, November 2000.

[ES02] G. Engels and S. Sauer. Object-oriented Modeling of Multimedia Applica-

tions. In S.K. Chang, editor, Handbook of Software Engineering and

Knowledge Engineering, volume 2, pages 21–53. World Scientific, Sin-

gapore, 2002.

[FF99] E.B. Fernandez and R. Flanders. Data Filter Architecture Pattern. In Pro-

ceedings of the 6th Conference on the Pattern Languages of Programs,

Urbana-Champaign, USA, August 1999. online at http://jerry.cs.uiuc.edu/

plop/plop99/proceedings.

[FGS93] D.D. Fang, S. Ghandeharizadeh, and A. Si. The Design, Implementation,

and Evaluation of an Object-Based Sharing Mechanism for Federated

Database Systems. In Proceedings of the 9th International Conference on

Data Engineering, Vienna, Austria, pages 467–475. IEEE Computer Soci-

ety Press, April 1993.

[FM86] P. Feldman and D. Miller. Entity Model Clustering: Structuring A Data

Model By Abstraction. The Computer Journal, 29(4):348–360, 1986.

Oxford University Press.

[Fon97] J. Fong. Converting Relational to Object-Oriented Databases. ACM SIG-

MOD Record, 26(1):53–58, March 1997. ACM Press.

200

[GHJV95] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns: Ele-

ments of Reusable Object Oriented Software. Addison-Wesley, Reading,

MA, 1995.

[GMS01] H. Gomaa, D.A. Menasc, and M. E. Shin. Reusable component intercon-

nection patterns for distributed software architectures. In In Proceedings

of the 2001 Symposium on Software Reusability: putting software reuse in

context, Toronto, Canada, pages 69–77. ACM Press, May 2001.

[Gra99] M. Grand. Transaction Patterns: A Collection of Four Transaction

Related Patterns. In Proceedings of the 6th Conference on the Pattern Lan-

guages of Programs, Urbana-Champaign, USA, August 1999. online at

http://jerry.cs.uiuc.edu/ plop/plop99/proceedings.

[GSNW02] A. Gokhale, D.C. Schmidt, B. Natarajan, and N. Wang. Applying Model-

Integrating Computing to Component Middleware and Enterprise Appli-

cation. Communications of the ACM, 45(10):65–69, October 2002. ACM

Press.

[GW01] H. Giese and J.P. Wadsack. Reengineering for Evolution of Distributed

Information Systems. In Proceedings of the 3rd International Workshop on

Net-Centric Computing: Migrating to the Web, Toronto, Canada, pages

36–39. ACM Press, May 2001.

[GZ02] L. Geiger and A. Zündorf. Graph Based Debugging with Fujaba. Elec-

tronic Notes in Theoretical Computer Science, 72(2), November 2002.

Elsevier Science Publishers B.V (North-Holland).

[HC01] G.T. Heinemann and W.T. Councill. Component-Based Software Engi-

neering - Putting the Pieces Together. Addison-Wesley, 2001.

[HK02] I. Hammouda and K. Koskimies. Generating a Pattern-Based Application

Development Environment for Enterprise JavaBeans. In In Proceedings of

the 26th Annual International Computer Software and Applications Con-

ference, Oxford, England, pages 856–866. IEEE Computer Society Press,

August 2002.

[HL02] P. Hoven and M. Liebrecht. Entwurf und Implementierung einer Import/

Export Funktionalität für die Entwicklungsumgebung Fujaba. Bachelor’s

thesis, University of Paderborn, Department of Computer Science, Pader-

born, Germany, August 2002.

201

[HMN+99] L.M. Haas, R.J. Miller, B. Niswonger, M. Tork Roth, P.M. Schwarz, and

E.L. Wimmers. Transforming Heterogeneous Data with Database Mid-

dleware: Beyond Integration. Bulletin of the IEEE Computer Socisty

Technical Committee on Data Engineering, 2(1):31–36, March 1999.

IEEE Computer Society Press.

[HWSS00] R.C. Holt, A. Winter, A. Schürr, and S. Sim. GXL: Towards a Standard

Exchange Format. In Proceedings of the 7th Working Conference on

Reverse Engineering, Brisbane, Australia, pages 162–171, Brisbane, Aus-

tralia, November 2000. IEEE Computer Society Press.

[Jah99] J.H. Jahnke. Management of Uncertainty and Inconsistency in Database

Reengineering Processes. PhD thesis, University of Paderborn, Pader-

born, Germany, September 1999.

[JOS93] P. Jaeschke, A. Oberweis, and W. Stucky. Extending ER Model Clustering

by Relationship Clustering. In R. Elmasri, V. Kouramajian, and

B. Thalheim, editors, Proceedings of the 12th International Conference on

Entity-Relationship Approach, Arlington, Texas, USA, volume 823 of

Lecture Notes in Computer Science, pages 451–462. Springer Verlag,

December 1993.

[Kam03] M.L. Modjo Kamneng. Entwurfsunstresttzung Web-basierter Schnitts-

tellen aud Basis der UML. Master’s thesis, University of Paderborn,

Department of Computer Science, Paderborn, Germany, April 2003.

[Köl98] U. Kölsch. Object-Oriented Re-Engineering of Information Systems in a

Heterogeneous Distributed Environment. In Proceedings of the 5th Work-

ing Conference on Reverse Engineering, Hawaii, USA, pages 104–114.

IEEE Computer Society Press, October 1998.

[LB03] J.A. Landay and G. Borriello. Design Patterns for Ubiquitous Computing.

IEEE Computer, 36(8):93–95, August 2003. IEEE Computer Society

Press.

[Lin99] D. Linthicum. Enterprise Application Integration. Addison-Wesley, 1999.

[LS97] C. Lindig and G. Snelting. Assessing modular structure of legacy code

based on mathematical concept analysis. In Proceedings of the 19th Inter-

national Conference on Software Engineering, Boston, USA, pages 349–

359. ACM Press, May 1997.

202

[Lun98] C.-H. Lung. Software architecture recovery and restructuring through

clustering techniques. In Proceedings of the 3rd international workshop on

Software architecture, Orlando, USA, pages 101–104. ACM Press,

November 1998.

[Man99] D.-A. Manolescu. Feature Extraction: A Pattern for Information

Retrieval. In N. Harrison, B. Foote, and H. Rohnert, editors, Pattern Lan-

guages of Program Design 4, pages 391–412. Addison-Wesley, December

1999.

[Meu95] R. Meunier. The Pipes and Filters Architecture. In J.O. Coplien and D.C.

Schmidt, editors, Pattern Languages of Program Design 1, pages 427–440.

Addison-Wesley, October 1995.

[MHH+01] R.J. Miller, M.A. Hernandez, L.M. Haas, L. Yan, C.T.H. Ho, and R. Fagin

andL. Popa. The Clio Project: Managing Heterogeneity. ACM SIGMOD

Record, 30(1):78–83, March 2001. ACM Press.

[MMCG99] S. Mancoridis, B.S. Mitchell, Y. Chen, and E.R. Gansner. Bunch: A clus-

tering tool for the recovery and maintenance of software system struc-

tures. In Proceedings of the 9th International Conference on Software

Maintenance, Oxford, UK, pages 50–59. IEEE Computer Society Press,

August 1999.

[MMR+98] S. Mancoridis, B.S. Mitchell, C. Rorres, Y. Chen, and E.R. Gansner.

Using automatic clustering to produce high-level system organizations of

source code. In Proceedings 6th International Workshop on Program

Comprehension, Ischia, Italy, pages 45–52. IEEE Computer Society Press,

June 1998.

[MOTU93] H.A. Müller, M.A. Orgun, S.R. Tilley, and J.S. Uhl. A Reverse Engineer-

ing Approach To Subsystem Structure Identification. Journal of Software

Maintenance, 5(4):181–204, December 1993. John Wiley and Sons, Inc.

[Mul95] D.E. Mularz. Pattern-Based Integration Architectures. In J.O. Coplien and

D.C. Schmidt, editors, Pattern Languages of Program Design 1, pages

441–452. Addison-Wesley, October 1995.

[Obj99a] The Object People Inc., 885 Meadowlands Dr., Suite 509, Ottawa,

Ontario. TOPLink for Java 2.0 User’s Manual, 1999.

[Obj99b] ObjectMatter Inc., 2450 S.W. 137 Ave. Suite 206 Miami, Fl. 33175,

UNITED STATES. Objectmatter VBSF Object-Relational Framework

V2.02 User Manual, 1999.

203

[PGMW95] Y. Papakonstantinou, H. Garcia-Molina, and J. Widom. Object Exchange

Across Heterogeneous Information Sources. In P.S. Yu and A.L.P. Chen,

editors, Proceedings of the 11th International Conference on Data Engi-

neering, Taipei, Taiwan, pages 251–260. IEEE Computer Society Press,

March 1995.

[RB01] E. Rahm and P.A. Bernstein. A survey of approaches to automatic schema

matching. The International Journal on Very Large Data Bases,

10(4):334–450, December 2001. Springer Verlag.

[Ris00] L. Rising. The Pattern Almanac 2000. Addison-Wesley, 2000.

[RL82] R. Rosenberg and T. Landers. An Overview of MULTIBASE. In

H. Schneider, editor, Distributed Databases, pages 153–184. North Hol-

land, 1982.

[RR98] S. Ram and V. Ramesh. Schema integration: Past, Present, and Future. In

A. Elamagrid, M. Rusinkiewicz, and A. Sheth, editors, Management of

Heterogeneous and Autonomous Databse Systems, pages 119–155. Mor-

gan-Kaufmann, San Mateo, CA, 1998.

[Sch97] D.E. Schmidt. Acceptor and Connector. In R.C. Martin, D. Riehle, and

F. Buschmann, editors, Pattern Languages of Program Design 3, pages

191–229. Addison-Wesley, October 1997.

[Sch99] D.C. Schmidt. Wrapper Facade: A Structural Pattern for Encapsulating

Functions within Classes. In C++ Report, volume 11. SIGS Publications,

February 1999.

[SG96] M. Shaw and D. Garlan. Software Architecture: Perspectives on an

emerging Discipline. Prentice Hall, 1996.

[SH94] R.W. Schwanke and S.J. Hanson. Using Neural Networks to Modularize

Software. Machine Learning, 15(2):137–168, May 1994. Kluwer Aca-

demic Publishers.

[Sie98] Siemens AG - C-LAB, Fürstenallee 11, 33102 Paderborn, Germany.

OpenDM ODMG User’s Guide, 1998.

[SMB+03] M. Saeed, O. Maqbool, H.A. Babri, S.Z. Hassan, and S.M. Sarwar. Soft-

ware Clustering Techniques and the Use of Combined Algorithm. In Pro-

ceedings of the 7th European Conference On Software Maintenance And

Reengineering, Benevento, Italy, pages 301–306. IEEE Computer Society

Press, March 2003.

204

[Sne02] H.M. Sneed. Using XML to Integrate Existing Software Systems into the

Web. In In Proceedings of the 26th Annual International Computer Soft-

ware and Applications Conference, Oxford, England, pages 167–172.

IEEE Computer Society Press, August 2002.

[SPPB02] P. Sousa, M.L. Pedro-de-Jesus, G. Pereira, and F. Brito e Abreu. Cluster-

ing Relations into abstract ER Schemas for database reverse engineering.

Journal of Science of Computer Programming, 45(2-3):137–153, Novem-

ber 2002. Elsevier Science Publishers B.V (North-Holland), (Special Issue

on Software Maintenance and Reengineering).

[Szy99] C. Szyperski. Component Software Beyond Object-Oriented Program-

ming. Addison-Wesley, 1999.

[TCHH99] P. Thiran, A. Chougrani, J-M. Hick, and J-L. Hainaut. Generation of Con-

ceptual Wrappers for Legacy Databases. In Proceedings of the 10th Inter-

national Conference on Database and Expert Systems Applications,

Florence, Italy, pages 678–687. Springer Verlag, September 1999.

[TH97] V. Tzerpos and R. Holt. The orphan adoption problem in architecture

maintenance. In Proceedings of the 4th Working Conference on Reverse

Engineering, Amsterdam, The Netherlands, pages 76–83. IEEE Computer

Society Press, October 1997.

[THB+98] P. Thiran, J-L. Hainaut, S. Bodart, A. Deflorenne, and J-M. Hick. Interop-

eration of Independent, Heterogeneous and Distributed Databases. Meth-

odology and CASE Support: the InterDB Approach. In Proceedings of the

3rd IFCIS International Conference on Cooperative Information Systems,

New York, USA, pages 54–63. IEEE Computer Society Press, August

1998.

[Tho99] Thought Inc., 657 Mission Street, Suite 202, San Francisco, CA 94105,

USA. CocoBase WhitePaper, 1999.

[TWBK89] T.J. Teorey, G. Wei, D.L. Bolton, and J.A. Koenig. ER Model Clustering

as an Aid for User Communication and Documentation in Database

Design. Communications of the ACM, 32(8):975 – 987, August 1989.

ACM Press.

[vC97] J.A. van den Broecke and J.O. Coplien. Using Design Patterns to Build a

Framework for Multimedia Networking. In Bell Labs Technical Journal,

pages 166–187. Lucent Technologies Inc., winter 1997.

205

[vK99] A. van Deursen and T. Kuipers. Identifying objects using cluster and con-

cept analysis. In Proceedings of the 21st International Conference on Soft-

ware Engineering, Los Angeles, USA, pages 246–255. ACM Press, May

1999.

[Wag01] S. Wagner. Datenbank-Erweiterungen für multimediale Anwendungen.

Master’s thesis, University of Paderborn, Department of Mathematics and

Computer Science, Paderborn, Germany, 2001.

[War99] I. Warren. The Renaissance of Legacy Systems - Method Support fpr Soft-

wrae-System Evolution. PRACTIONER SERIES. Springer Verlag, 1999.

[Wie92] G. Wiederhold. Mediatiors in the architecture of duture information sys-

tems. IEEE Computer, 14(2), March 1992. IEEE Computer Society Press.

[Wie95] G. Wiederhold. Mediation in information systems. ACM Computing Sur-

veys, 27(2), m 1995. ACM Press.

[Wig97] T.A. Wiggerts. Using Clustering Algorithms in Legacy Systems Remodu-

larization. In Proceedings of the 4th Working Conference on Reverse

Engineering, Amsterdam, The Netherlands, pages 33–43. IEEE Computer

Society Press, October 1997.

[Wol01] F. Wolf. Entwicklung eines Generators fr eine objektorientierte Zugriffss-

chicht auf einer relationalen Datenbank. Master’s thesis, University of

Paderborn, Department of Computer Science, Paderborn, Germany, April

2001.

[XA94] XA. Distributed Transaction processing: The XA+ Specification, version

2. X/Open Group, 1994. X/Open Company, Reading, UK.

Chapter 5: Model Consistency Management

[CMR96] A. Corradini, U. Montanari, and F. Rossi. Graph Processes. Fundamenta

Informaticae, 26(3):241–265, June 1996. IOS Press, Amsterdam.

[CVS] CVS. Concurrent Versions System - The open standard for version con-

trol. http://www.cvshome.org/.

[HHE+99] J.-M. Hick, J.-L. Hainaut, V. Englebert, D. Roland, and J. Henrard. Strat-

egies pour l’evolution des applications de bases de donnees relation-

nelles: l’approche DB-MAIN. In In XVIIe congress INFORSID, La Garde,

France, June 1999.

206

[HWSS00] R.C. Holt, A. Winter, A. Schürr, and S. Sim. GXL: Towards a Standard

Exchange Format. In Proceedings of the 7th Working Conference on

Reverse Engineering, Brisbane, Australia, pages 162–171, Brisbane, Aus-

tralia, November 2000. IEEE Computer Society Press.

[Jah99] J.H. Jahnke. Management of Uncertainty and Inconsistency in Database

Reengineering Processes. PhD thesis, University of Paderborn, Pader-

born, Germany, September 1999.

[JSWZ02] J.H. Jahnke, W. Schäfer, J.P. Wadsack, and A. Zündorf. Supporting Itera-

tions in Exploratory Database Reengineering Processes. Journal of Sci-

ence of Computer Programming, 45(2-3):99–136, November 2002.

Elsevier Science Publishers B.V (North-Holland), (Special Issue on Soft-

ware Maintenance and Reengineering).

[JW99] J.H. Jahnke and J.P. Wadsack. Integration of analysis and redesign activ-

ities in information system reengineering. In P. Nesi and C. Vernoef, edi-

tors, Proceedings of the 3rd European Conference on Software

Maintenance and Reengineering, Amsterdam, The Nederlands, pages

160–168. IEEE Computer Society Press, March 1999.

[JWZ02] J.H. Jahnke, J.P. Wadsack, and A. Zündorf. A History Concept for Design

Recovery Tools. In Proceedings of the 4th European Conference on Soft-

ware Maintenance and Reengineering, Budapest, Hungary, pages 59–69.

IEEE Computer Society Press, March 2002.

[Nag96] M. Nagl, editor. The IPSEN Approach, volume 1170 of Lecture Notes in

Computer Science. Springer Verlag, 1996.

[NR99] A. Nica and E.A. Rundensteiner. View Maintenance after View Synchro-

nization. In International Database Engineering and Application Sympo-

sium, Montréal, Canada, pages 215–223. IEEE Computer Society Press,

June 1999.

[Roz99] G. Rozenberg, editor. Handbook of Graph Grammars and Computing by

Graph Transformation, volume 1. World Scientific, Singapore, 1999.

[SL96] A. Schürr and M. Lefering. Specification of Integration Tools. In M. Nagl,

editor, Building Tightly Integrated Software Development Environments:

The IPSEN Approach, volume 1170 of Lecture Notes in Computer Sci-

ence, pages 324–334. Springer Verlag, 1996.

207

[vv02] C. van der Westhuizen and A. van der Hoek. Understanding and Propa-

gating Architecutural Changes. In J. Bosch, W.M. Gentleman,

C. Hofmeister, and J. Kuusela, editors, IFIP 17th World Computer Con-

gress - TC2 Stream / 3rd IEEE/IFIP Conference on Software Architecture,

Montréal, Canada, volume 224 of IFIP Conference Proceedings, pages

95–109. Kluwer Academic Publishers, August 2002.

[Wad98] J.P. Wadsack. Inkrementell Konsistenzerhaltung in der transformations-

basierten Datenbankmigration. Master’s thesis, University of Paderborn,

Department of Mathematics and Computer Science, Paderborn, Germany,

1998.

Chapter 6: Conclusions

[BGN+03] S. Burmester, H. Giese, J. Niere, M. Tichy, J.P. Wadsack, R. Wagner,

L. Wendehals, and A. Zündorf. Tool Integration at the Meta-Model Level

within the FUJABA Tool Suite. In Proceedings of the Workshop on Tool-

Integration in System Development (TIS), Helsinki, Finland, (ESEC / FSE

2003 Workshop 3), pages 51–56, September 2003. online at http://

www.es.tu-darmstadt.de/english/events/tis/documentation/Proceed-

ings.p%df.

[BS95] M.L. Brodie and M. Stonebraker. Migrating Legacy Systems - Gateways,

Interfaces & The Incremental Approach. Morgan Kaufmann Publishers,

San Francisco, 1995.

[BSS+99] J. Bergey, D. Smith, Tilley S, N. Weiderman, and S. Woods. Why Reengi-

neering Projects Fail. Technical Report CMU/SEI-99-TR-010, Carnegie

Mellon Software Engineering Institute, Pittsburg, USA, April 1999.

[DRT98] S. Demeyer, M. Rieger, and S. Tichelaar. Three Reverse Engineering Pat-

terns, 1998. darft available at http://www.iam.unibe.ch/ famoos/

Deme98p/threerevpat.pdf.

[ESS02] M. El-Ramly, E. Stroulia, and P. Sorenson. Mining System-User Interac-

tion Traces for Use Case Models. In Proceedings of the 10th International

Workshop on Program Comprehension, Paris, France, pages 21–30. IEEE

Computer Society Press, June 2002.

[Hal96] C. L. Hall. Building Client/Server Applications Using TUXEDO. John

Wiley and Sons, Inc., 1996.

[Hou98] P. Houston. Building Distributed Applications with Message Queuing

Middleware. Microsoft Cooperation, 1998.

208

[Hud94] E. S. Hudders. CICS: A Guide to Internal Strucure. John Wiley and Sons,

Inc., 1994.

[JTS99] JTS. Java Transaction Service (JTS). Sun Microsystems Inc., December

1999. Version 1.0.

[KK01] C. Kerer and E. Kirda. Layout, Content and Logic Separation in Web

Engineering. In S. Murugeasn and Y. Deshpande, editors, Web Engineer-

ing - Managing Diversity and Complexity of Web Application Develop-

ment, volume 2016 of Lecture Notes in Computer Science, pages 135–

147. Springer Verlag, 2001.

[Lew99] R. Lewis. Advanced Messaging Applications with MSMQ and MQSeries.

Que, 1999.

[LT01] C. Liebig and S. Tai. Middleware Mediated Transactions. In Proceedings

of the 3rd International Symposium on Distributed Objects & Applica-

tions. Rome, Italy, pages 340–350. IEEE Computer Society Press, Sep-

tember 2001.

[MBPRR01] R.T. Mittermeir, A. Bollin, H. Pozewaunig, and D. Rauner-Reithmayer.

Goal-Driven Combination of Softwrae Comprehension Approcahes for

Component Based Development. In Proceedings of the 2001 symposium

on Software reusability: putting software reuse in context, Toronto, Can-

ada, pages 95–102. ACM Press, May 2001.

[MC01] J. Moe and D.A. Carr. Understanding Distributed Systems via Execution

Trace Data. In Proceedings of the 9th International Workshop on Program

Comprehension, Toronto, Canada, pages 60–69. IEEE Computer Society

Press, May 2001.

[MCZE02] C. Mascolo, L. Capra, S. Zachariadis, and W. Emmerich. XMIDDLE: A

Data-Sharing Middleware for Mobile Computing. Journal on Wireless

Personal Communications, 2002. Kluwer Academic Publishers.

[MG00] S.A. Mondal and K.D. Gupta. Choosing a Middleware for Web-Integra-

tion of a Legacy Application. Software Engineering Notes, 25(3):50–53,

May 2000. ACM Press.

[MMK02] M. Morrison, J. Morrison, and A. Keys. Integrating Web Sites and Data-

bases. Communications of the ACM, 45(9):81–86, September 2002. ACM

Press.

209

[OTS98] OTS-1.1. Transaction Service Specification. Object Management Group,

February 1998. The Common Object Request Broker: Architecture and

Specification, CORBA/IIOP 1.1 Specification, Revision 1.1: OMG Tech-

nical Document formal/97-12-17.

[Rot03] A. Rott. Werkzeug-untersttzte Optimierung komplexer Datenbank- und

Applikationsserverumgebungen auf Basis von Zugriffsanalysen. Master’s

thesis, University of Paderborn, Department of Computer Science, Pader-

born, Germany, January 2003.

[SL02] T. Schattkowsky and M. Lohmann. Rapid Development of Modular

Dynamic Web Sites using UML. In J.-M. Jézéquel, H. Hussmann, and

S. Cook, editors, In Proceedings of the 5th International Conference on the

Unified Modeling Language, Dresden, Germany, volume 2460 of Lecture

Notes in Computer Science, pages 336–350. Springer Verlag, October

2002.

[Wad03] J.P. Wadsack. Enterprise Application Performance Optimization based on

Request-centric Monitoring and Diagnosis. In Proceedings of the Work-

shop on Remote Analysis and Measurement of Software Systems, Port-

land, USA, pages 27–30. IEEE Computer Society Press, May 2003.

[WML02] J. Whaley, M.C. Martin, and M.S. Lam. Automatic Extraction of Object-

Oriented Component Interfaces. In Proceedings of the 2002 International

Symposium of Software Testing and Analysis, Roma, Italy, pages 218–

228. ACM Press, July 2002.

[XA94] XA. Distributed Transaction processing: The XA+ Specification, version

2. X/Open Group, 1994. X/Open Company, Reading, UK.

[YR01] Z. Yu and V. Raijlich. Hidden Dependencies in Program Comprehension

and Change Propagation. In Proceedings of the 9th International Work-

shop on Program Comprehension, Toronto, Canada, pages 293–299. ACM

Press, May 2001.

210

ccxi

INDEX

access layer

generator 127

generator (transactional) 128

transactional 126

adapting phase

see process

annotation 74

application architecture recovery 18

application design recovery 18

application model 37

architectural pattern 112

Data Fusion 116

Data Portal 113

Data Transducer 119

association 44

aggregation 44