Document [original]

MANAGEMENT OF UNCERTAINTY AND INCONSISTENCY

IN DATABASE REENGINEERING PROCESSES

Dissertation submitted in partial fulfillment

of the requirements for the degree of

“Doctor of Science”

in the Department of Mathematics and Computer Science

at the University of Paderborn

Schriftliche Arbeit

zur Erlangung des Grades

“Doktor der Naturwissenschaften”

im Fachbereich Mathematik-Informatik

der Universität Paderborn

Dipl. Inform. Jens H. Jahnke

Universität Paderborn

Fachbereich Mathematik-Informatik

D-33095 Paderborn

Germany

August 1999

ABSTRACT

This dissertation tackles one of the most urgent problems in today’s information technology, namely

the renovation and migration of legacy information systems to modern platforms and net-centric

architectures. In this context, several methods, tools, and processes have been proposed to support

reengineering and modernizations of legacy database applications. This can be a complex task

because many legacy databases have grown over several generations of programmers and lack a

sufficient documentation. Computer-aided reengineering methods and processes have a great

potential to reduce the complexity and risks involved in database design recovery and migration

projects. Still, current reengineering tools are hardly adopted for practical problems in industry

because they often make idealistic assumptions about the structure of legacy systems and the

characteristics of reengineering processes. The goal of this thesis is to provide concepts and

techniques to overcome these severe limitations. In particular, our focus is on developing

mechanisms to manage uncertainty and inconsistency in computer-aided databases reengineering

processes. In practice, uncertain knowledge plays an important role in activities aiming to recover

conceptual design documents for large idiosyncratic implementation structures. This fact is

neglected in current database reengineering methods and tools.

In this dissertation, we identify and extend a theory that provides a suitable basis to deal with

uncertain reengineering knowledge and allows to implement practical tools and environments to

support reengineering processes. The requirement for consistency management considers the fact

that it is unrealistic to presume that database reengineering processes can be executed in a number

of sequential phases or steps without iterations. In practice, larger reengineering projects comprise

many process iterations due to various reasons like incomplete knowledge about legacy

implementation structures or necessary "on-the-fly" modifications of the legacy system. Detecting

and removing inconsistencies caused by such iterations significantly increase costs and durations of

current reengineering projects. In this thesis, we employ graph transformation theory to develop

mechanisms which allow to detect and eliminate inconsistencies between legacy schema

implementations and their abstract representation, automatically. Our results have been

implemented in the database reengineering environment Varlet and evaluated with an industrial

project. They are suitable to complement many existing approaches in the domain of information

system reengineering and migration. As an example, we describe the integration of Varlet with an

existing middleware product for data integration.

ACKNOWLEDGMENTS

During the last five years in Paderborn, many people have influenced my research and significantly

contributed to the results presented in this dissertation. I am especially obliged to Wilhelm Schäfer

who supported me throughout this period and provided an interesting and constructive working

environment. I learned a lot from our fruitful discussions. Many thanks to Hausi Müller who

welcomed me as an external member of his research group at the University of Victoria, B.C., for

nearly four months. His advice in the early phase of structuring my written thesis as well as his

detailed comments on a preliminary version were very valuable and had a significant impact on the

final result. I am grateful to Gregor Engels for many interesting comments and discussions in our

weekly graduate seminars. My special thanks go to Albert Zündorf with whom I shared one office,

several thoughts, and many beers. Over the years, our collaboration has been productive as well as

pleasant. Albert did a great job in proofreading this dissertation.

The achievements made in this dissertation would not have been possible without the practical and

theoretical contributions of a number of graduate students at the University of Paderborn. I was

particularly inspired by supervising several master theses in the context of this project. In his thesis,

Jens Holle implemented the very first prototype of the Varlet schema migration environment. His

work gave us important insight into the benefits and limitations of using triple graph grammars for

conceptual abstraction. Christian Rummel formalized a conceptual data model and elaborated a

catalog of schema redesign transformations. Jörg Wadsack realized the mechanism for incremental

change propagation presented in this dissertation. Heike Schalldach developed a generator for

object-relational middleware components based on graph-oriented schema dependencies. In

collaboration with another local research institute (C-Lab), Ulrich Nickel applied our techniques to

a practical case study in the domain of engineering information systems. Markus Westerfeld

extended Varlet by generic mechanisms (e.g., XML views) to integrate commercial middleware

components like ObjectDRIVER. Melanie Heitbreder implemented the core of the Varlet schema

analysis tool, namely the possibilistic inference engine. Christoph Strebin extended this inference

engine by concepts and techniques that enable self-adaptation of reverse engineering heuristics.

Barbara Bewermeyer developed a flexible detection mechanism for stereotypical source code

patterns. Additionally, I would like to thank all students who implemented the current Varlet user

interface, namely Martin Bierschenk, Frank Eckhardt, Sven Meyer zu Eißen, Hajo Köhler, Ralf

Langer, Carsten Matysczok, Jens Rehpöhler (who also designed the Varlet Web page at

www.upb.de/varlet), and Swen Thümmler. Special thanks to Michael Kisker and Felix Wolf who

worked on this project as research assistants.

I would like to thank eps Bertelsmann, Gütersloh, Germany, for providing us with an industrial-

strength case study for our research prototype. Thanks to the Database Team at CERMICS, Sophia

Antipolis Cedex, France, and the Progres Team at RWTH Aachen, Germany, for their technical

support.

Thanks to Reiko Heckel, Jörg Niere, Heike Schalldach, Jörg Wadsack, and Anke Weber for

proofreading parts of this dissertation. Thanks also to Olaf Neumann and Sabine Sachweh who have

always been eager to discuss new ideas. Jutta Haupt has been a great help in navigating through the

"jungle of paper works" at the University of Paderborn. Many thanks to Jürgen Maniera who did a

great job in keeping the machines running and spreading good mood. I will definitely miss this team.

Still, most important during my studies in Dortmund and Paderborn has been the support and love

of my family and my wife Anke Weber.

To Meta Jahnke.

CONTENTS

List of Figures xiii

List of Definitions xvii

1 Introduction 1

1.1 Background: the dilemma of software legacies . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Database reengineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Problem definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.4 The approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.5 Dissertation outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2 Database Reengineering - A Case Study 11

2.1 A legacy product and document information system . . . . . . . . . . . . . . . . . . . . . 11

2.2 Migration target: a distributed marketing information system . . . . . . . . . . . . . . 11

2.2.1 Functional requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.2 Technical requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Migration strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 The reengineering process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4.1 Legacy schema analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4.2 Conceptual schema migration and redesign . . . . . . . . . . . . . . . . . . . . . 24

2.4.3 Implementation of changes and a middleware for data integration . . . 26

2.5 Summary and concluding remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3 A Theory to Manage Imperfect Knowledge 33

3.1 Requirements on formalisms to manage DBRE knowledge . . . . . . . . . . . . . . . . 33

3.1.1 Quantitative representation of uncertainty . . . . . . . . . . . . . . . . . . . . . . 35

3.1.2 Representation and indication of contradicting knowledge . . . . . . . . . 36

3.1.3 Reasoning about incomplete knowledge . . . . . . . . . . . . . . . . . . . . . . . . 36

3.1.4 Representation of ignorance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.1.5 Computational tractability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2 Evaluation of theories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

3.2.1 Production systems with confidence factors . . . . . . . . . . . . . . . . . . . . . 38

3.2.2 Probabilistic reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.2.3 Credibilistic reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.2.4 Fuzzy reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.2.5 Possibilistic reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

3.3 Summary and conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4 GFRN as a Basis for Legacy Schema Analysis 55

4.1 Supporting human-centered schema analysis processes . . . . . . . . . . . . . . . . . . . 55

4.2 Specification of database reengineering knowledge . . . . . . . . . . . . . . . . . . . . . . 57

4.2.1 Informal introduction to GFRNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2.2 Integration of automatic analysis operations . . . . . . . . . . . . . . . . . . . . 63

4.2.3 Formal definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3 Knowledge inference with GFRN specifications . . . . . . . . . . . . . . . . . . . . . . . . 76

4.3.1 A fuzzy Petri net model for non-monotonic reasoning . . . . . . . . . . . . . 77

4.3.2 The inference process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.4 Implementing the Varlet Analyst . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

4.4.2 User interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.6 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

5 Conceptual Schema Migration and Data Integration 113

5.1 The migration graph model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

5.1.1 Graph-based representation of logical and conceptual schema . . . . . 116

5.1.2 The schema mapping graph model . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5.2 A graphical formalism to implement schema translators . . . . . . . . . . . . . . . . . 121

5.2.1 Triple graph grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

5.2.2 Mapping variants to class hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . 127

5.2.3 Mapping columns to class attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 132

5.2.4 Mapping inclusion dependencies to relationships . . . . . . . . . . . . . . . 133

5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

5.3 Conceptual schema redesign . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

5.3.1 Schema redesign transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

5.3.2 An extensible catalog of schema redesign transformations . . . . . . . . 138

5.3.3 Complex schema redesign transformations . . . . . . . . . . . . . . . . . . . . 144

5.4 Incremental change propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

5.4.1 The history graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

5.4.2 The propagation mechanism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

5.5 Implementing the Varlet Migrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

5.5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

5.5.2 User interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159

5.6 Data integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

5.6.1 Generating descriptions for relational and object-oriented schemas . 166

5.6.2 Generating object-relational mapping descriptions . . . . . . . . . . . . . . 167

5.7 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

5.8 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

5.8.1 Conceptual schema migration and consistency management . . . . . . . 178

5.8.2 Data integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

5.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

6 Conclusions and Future Perspectives 181

6.1 Major contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

6.2 Transferability of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

6.3 Open problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

6.4 Future perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184

A Additional Definitions and Specifications 187

A.1 Interpretation of a logical schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187

A.2 Specification of the migration graph model . . . . . . . . . . . . . . . . . . . . . . . . . . . 188

B A Catalog of Redesign Transformations 195

References 217

Index 233

Abbreveations 237

LIST OF FIGURES

Figure 1.1. Reengineering process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .3

Figure 1.2. Conceptual schema as a starting point for subsequent DBRE activities . . . . . . . . . . .4

Figure 1.3. CARE tool classification according to the role of human knowledge . . . . . . . . . . . . .6

Figure 1.4. Proposed DBRE approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .8

Figure 2.1. Existing Product and Document Information System (PDIS) . . . . . . . . . . . . . . . . . .12

Figure 2.2. Planned Marketing Information System (MIS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13

Figure 2.3. Gradual migration strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14

Figure 2.4. The planned reengineering process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15

Figure 2.5. Constraints resulting from the schema catalog . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

Figure 2.6. Potential constraints indicated by naming heuristics . . . . . . . . . . . . . . . . . . . . . . . . .16

Figure 2.7. Detail of PDIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17

Figure 2.8. Contradicting indicators for key constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18

Figure 2.9. Potential foreign keys indicated by join patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . .18

Figure 2.10. Result of the structural completion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19

Figure 2.11. Assumed hidden common domain relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20

Figure 2.12. Labeled variants and additional foreign keys of table PRODREF . . . . . . . . . . . . . . .20

Figure 2.13. Variants of table PRODREF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .20

Figure 2.14. Detected optimization and aggregation structures . . . . . . . . . . . . . . . . . . . . . . . . . . .21

Figure 2.15. Implication of relational constraints on the cardinality of relationships . . . . . . . . . .22

Figure 2.16. Summary of analysis results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .23

Figure 2.17. Conceptual schema for PDIS (detail) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .25

Figure 2.18. Extended conceptual schema for MIS (detail) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .26

Figure 2.19. Extended conceptual schema for MIS after iteration (detail) . . . . . . . . . . . . . . . . . . .27

Figure 2.20. Implemented extensions of the logical schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28

Figure 2.21. MIS architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .28

Figure 2.22. Design of the middleware layer (detail) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .29

Figure 3.1. Reference architecture of KBS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .34

Figure 3.2. Sample fuzzy sets with continous and discrete membership functions . . . . . . . . . . .45

Figure 3.3. Sample fuzzy sets for fuzzy predicates AName,LargeExt, and MediumExt . . . . . . .47

Figure 3.4. Evaluation summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .53

Figure 4.1. The proposed schema analysis process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .56

Figure 4.2. Simple GFRN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59

Figure 4.3. Implication with constraint and negation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59

Figure 4.4. Implication with conjunction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .59

Figure 4.5. Similarity measures for the seven sample attribute names with the string userid. . .60

Figure 4.6. Implication with threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .60

Figure 4.7. Premise with universal quantifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .61

Figure 4.8. Variable aggregation and composition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62

Figure 4.9. Combination of heuristics in a single GFRN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62

Figure 4.10. Characteristics for classifying automatic analysis operations . . . . . . . . . . . . . . . . . .64

Figure 4.11. GFRN with data- and goal-driven predicates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .64

Figure 4.12. Goal-driven analysis operation validate_IND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .65

List of Figures

XIV LIST OF FIGURES

Figure 4.13. N(validIND2(i,v)) for the case of no counterexamples . . . . . . . . . . . . . . . . . . . . . . . .65

Figure 4.14. GFRN to illustrate the formalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .68

Figure 4.15. Translation algorithm GFRN2NPL1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .69

Figure 4.16. Translation algorithm Impl2NPL1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .70

Figure 4.17. Algorithm OperateGFRN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .73

Figure 4.18. Excerpt of case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .75

Figure 4.19. Necessity degrees for the facts produced by Ω(ANameIsRSName+ID1)(B). . . . . . . .75

Figure 4.20. Necessity degrees for the facts produced by ω(validKey1)(B,validKey1(x)). . . . . . . .76

Figure 4.21. Belief revision phase 1: computation of fuzzy truth tokens . . . . . . . . . . . . . . . . . . . .79

Figure 4.22. Belief revision phase 2: Computation of FBMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79

Figure 4.23. The proposed iterative and interactive inference process . . . . . . . . . . . . . . . . . . . . . .82

Figure 4.24. Representation of an expanded GFRN implication (sample) . . . . . . . . . . . . . . . . . . .83

Figure 4.25. Forward and backward expansion (sample) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .84

Figure 4.26. Information sources for inference example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85

Figure 4.27. GFRN to exemplify the inference process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .86

Figure 4.28. FPN after the first expansion/evaluation cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .86

Figure 4.29. FPN after second expansion/evaluation cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .87

Figure 4.30. FPN after third expansion/evaluation cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .88

Figure 4.31. Additional implications to specify necessary conditions for R-INDs . . . . . . . . . . . .88

Figure 4.32. FPN after considering human input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .89

Figure 4.33. Final analsysis result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .89

Figure 4.34. Representation of human assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .90

Figure 4.35. Algorithm GFRNInference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .91

Figure 4.36. Algorithm CreatePlace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93

Figure 4.37. Algorithm ExpandFPN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .94

Figure 4.38. Algorithm ComputeBindingsForImpl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .96

Figure 4.39. Algorithm ComplementBindings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .97

Figure 4.40. Example GFRN for termination problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .98

Figure 4.41. Architecture of the Varlet Analyst . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .100

Figure 4.42. Customization Front-End . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .102

Figure 4.43. Customization Front-End (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .103

Figure 4.44. Analysis Front-End (overview) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .104

Figure 4.45. Analysis Front-End (detail view) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .105

Figure 4.46. Graphical and textual documentation of an analyzed logical schema . . . . . . . . . . .107

Figure 5.1. Incremental schema migration and generative data integration . . . . . . . . . . . . . . . .115

Figure 5.2 Migration graph model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .117

Figure 5.3. Graph test DuplicateClassName . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .119

Figure 5.4. Sample situation: correspondence among variant and inheritance structures . . . . .120

Figure 5.5. Graph production AddRSToLSchema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .122

Figure 5.6. Mapping rule MapRSToClass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .123

Figure 5.7. Reverse production MapRSToClassrv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .124

Figure 5.8. Forward production MapRSToClassfw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .125

Figure 5.9. Startgraph for schema migration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .126

Figure 5.10. Algorithm MapSchema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .126

Figure 5.11. Mapping rule MapVariantToConcreteClass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .127

Figure 5.12. Example RS Tenant with two variants an their conceptual representation . . . . . . .128

Figure 5.13. Example application of rules MapRSToClass and MapVariantToConcreteClass . .128

Figure 5.14. Production MapVariantsToAbstractClassrv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .129

Figure 5.15. Example application of production MapVariantsToAbstractClassrv . . . . . . . . . . . .130

Figure 5.16. Production MapVariantsToInheritancerv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .131

LIST OF FIGURES XV

Figure 5.17. Example application of production MapVariantsToInheritancerv . . . . . . . . . . . . . .132

Figure 5.18. Mapping rule MapColToAttr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .133

Figure 5.19. Mapping rule MapRINDToAssoc[1:1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .134

Figure 5.20. Mapping rule MapRINDToAssoc[N:0,1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .135

Figure 5.21. Mapping rule MapRINDToAssoc[0,N:0,1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .135

Figure 5.22. Mapping rule MapIINDToInheritance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .136

Figure 5.23. Catalog of conceptual redesign transformations . . . . . . . . . . . . . . . . . . . . . . . . . . .139

Figure 5.24. Schema transformation SplitClass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .140

Figure 5.25. Schema transformation MoveAttribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .141

Figure 5.26. Schema transformation Generalize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .142

Figure 5.27. Schema transformation PushUpAttribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .143

Figure 5.28. Complex transformation MoveOverAggregation . . . . . . . . . . . . . . . . . . . . . . . . . . .145

Figure 5.29. Basic structure of a history graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .146

Figure 5.30. Template of transformation Generalize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .147

Figure 5.31. History graph model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .149

Figure 5.32. Phase I: forward propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .151

Figure 5.33. Phase II: backward propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .152

Figure 5.34. Phase III: reevaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .152

Figure 5.35. Phase IV: translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .153

Figure 5.36. Transaction PropagateChange . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .154

Figure 5.37. Graph test Generalize_getParams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .155

Figure 5.38. Production Generalize_withParams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .156

Figure 5.39. Architecture of the Varlet Migrator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .157

Figure 5.40. Using the Progres environment to extend module Redesign Transformation . . . . .158

Figure 5.41. Logical schema after first analysis step and its initial conceptual translation . . . . .160

Figure 5.42. Redesigned conceptual schema (Migration Front-End) . . . . . . . . . . . . . . . . . . . . .161

Figure 5.43. Completed logical schema (top) and updated logical schema (bottom) . . . . . . . . .162

Figure 5.44. Implementation of conceptual extensions (Analysis Front-End) . . . . . . . . . . . . . . .164

Figure 5.45. ObjectDRIVER overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .165

Figure 5.46. Integration of the ObjectDRIVER middleware generator as a back-end for Varlet .166

Figure 5.47. Relational schema description for ObjectDRIVER . . . . . . . . . . . . . . . . . . . . . . . . .167

Figure 5.48. Object schema description for ObjectDRIVER . . . . . . . . . . . . . . . . . . . . . . . . . . . .168

Figure 5.49. Mapping description for classes and subclasses . . . . . . . . . . . . . . . . . . . . . . . . . . . .169

Figure 5.50. Test getClassInstantiationConstraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .169

Figure 5.51. Mapping description for base table attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . .170

Figure 5.52. Test getAttrMappedToColInBaseTable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .170

Figure 5.53. Mapping description for remote attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .171

Figure 5.54. Test getAttrMappedToColInRemoteTable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .171

Figure 5.55. Mapping description for base table relationships . . . . . . . . . . . . . . . . . . . . . . . . . . .172

Figure 5.56. Test getRelMappedToBaseTable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .172

Figure 5.57. Mapping description for remote relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . .173

Figure 5.58. Test getRelMappedToRemoteTable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .173

Figure 5.59. Mapping description for IND-based inheritance relationships . . . . . . . . . . . . . . . .174

Figure 5.60. Test getInheritMappedToI_IND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .174

Figure 5.61. Mapping Description for ObjectDRIVER . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .175

Figure 5.62. MIS application code (example) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .176

Figure 6.1. Self-adapting analysis process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .185

Figure B.1. Transformation Aggregate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .196

Figure B.2. Transformation AssociationToClass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .197

Figure B.3. Transformation ChangeAssocCardinality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .198

XVI LIST OF FIGURES

Figure B.4. Transformation ChangeAttributeType . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .198

Figure B.5. Transformation ClassToAssociation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .199

Figure B.6. Transformation CreateAssociation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .200

Figure B.7. Transformation CreateAttribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .200

Figure B.8. Transformation CreateClass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .201

Figure B.9. Transformation CreateInheritance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .201

Figure B.10.Transformation CreateKey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .202

Figure B.11.Transformation ConvertAbstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .202

Figure B.12.Transformation ConvertConcrete . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .203

Figure B.13.Transformation DisAggregate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .204

Figure B.14.Transformation Generalize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .205

Figure B.15.Transformation MergeClass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .206

Figure B.16.Transformation MoveAttribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .207

Figure B.17.Transformation PushDownAttribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .208

Figure B.18.Transformation PushDownAssociation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .209

Figure B.19.Transformation PushUpAttribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .210

Figure B.20.Transformation PushUpAssociation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .211

Figure B.21.Transformation Remove . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .212

Figure B.22.Transformation RenameClass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .212

Figure B.23.Transformation RenameAttribute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .212

Figure B.24.Transformation RenameRelationship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .213

Figure B.25.Transformation SplitClass . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .213

Figure B.26.Transformation Specialize . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .214

Figure B.27.Transformation SwapAssocDirection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .215

LIST OF DEFINITIONS

Definition 1.1 Legacy software system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1

Definition 1.2 Software reengineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2

Definition 1.3 Reverse engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .2

Definition 1.4 Database reengineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4

Definition 3.1 Data model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .37

Definition 3.2 Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38

Definition 3.3 Relational database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38

Definition 3.4 (Notation) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38

Definition 3.5 Flattening . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .38

Definition 3.6 Basic probability assignment, focal proposition . . . . . . . . . . . . . . . . . . . . . . . . . .42

Definition 3.7 Combination of evidences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43

Definition 3.8 Belief function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .43

Definition 3.9 Plausibility function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .44

Definition 3.10 Fuzzy set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .45

Definition 3.11 t-norm and t-conorm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .46

Definition 3.12 Fuzzy relation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47

Definition 3.13 Fuzzy logical operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48

Definition 3.14 Fuzzy inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .48

Definition 3.15 Necessity-valued formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .49

Definition 3.16 Classical projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50

Definition 3.17 α-cut . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .50

Definition 3.18 Partial contradicting set of formulae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51

Definition 3.19 Best model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51

Definition 3.20 Formal system for NPL1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .51

Definition 4.1 Signature of an analyzed logical schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .58

Definition 4.2 Signature of a GFRN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .66

Definition 4.3 Context sensitive syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .67

Definition 4.4 Declarative semantics of GFRNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .68

Definition 4.5 Extent of a predicate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71

Definition 4.6 Data-driven analysis operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71

Definition 4.7 Goal-driven analysis operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .71

Definition 4.8 Application context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72

Definition 4.9 Expansion of formulae over a finite universe . . . . . . . . . . . . . . . . . . . . . . . . . . . .72

Definition 4.10 Occurrence of literals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72

Definition 4.11 Semantics of automatic analysis operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .72

Definition 4.12 Fuzzy Petri net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .78

Definition 4.13 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80

Definition 4.14 Predecessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80

Definition 4.15 Axiom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .80

Definition 4.16 Axiom-based marking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81

Definition 4.17 Grounded place . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92

Definition 4.18 Derivability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .95

List of Definitions

XVIII LIST OF DEFINITIONS

Definition 4.19 Derivation sink . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .95

Definition 5.1 Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .115

Definition 5.2 Graph production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .121

Definition 5.3 Application of a production . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .121

Definition 5.4 1-context of a set of nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .148

Definition 5.5 Context of a transformation application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .148

Definition 5.6 History graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .149

Definition 5.7 Application of transformations to a history graph . . . . . . . . . . . . . . . . . . . . . . .150

Definition A.1 Interpretation of a logical schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .187

CHAPTER 1 INTRODUCTION

To better meld into the software development practice, CASE tools should adopt a

programmer’s mental model of software projects. In particular, CASE tools should support

soft aspects of software development as well as rigorous modeling, provide a natural process-

oriented development framework rather than a method-oriented one, and play a more active

role in software development than current CASE tools.

Jarzabek and Huang, CACM 8/98. [JH98b]

1.1 Background: the dilemma of software legacies

role of

information

management

Effective and efficient information management is a crucial factor for the competitiveness of

today’s companies. It enables them to respond to changing conditions on a global market,

quickly. Emerging key technologies like the World Wide Web (Web), Object-Orientation (OO),

Client/Server (CS) applications, and open system standards (e.g., CORBA [Vin97], DCE

[Fou92]) greatly influence modern business processes. Besides new applications in the area of

Electronic Commerce [Ten98], there has been increasing interest in using enterprise-wide data

integration to build management information systems and decision support systems [JP92].

While new company start-ups are able to purchase software that takes advantage of the latest

technology, longer established enterprises have to deal with various pre-existing software

systems. In many cases, these pre-existing systems have to be extended or modified to fit new

requirements and exploit emerging technologies. Such modifications are often difficult to

achieve in older software. These systems are usually called legacy software systems (LSS).

legacy systems

characteristics

Many LSS have evolved over several generations of programmers. They do not satisfy the

flexibility and growth requirements of modern enterprises. They were built with focus on

efficiency rather than on interoperability and maintainability. LSS are often badly documented,

which adds to the complexity of achieving modifications. In many cases, technical design

documents are obsolete, have been lost, or have never existed. But even without the need for

extensive modifications, LSS are increasingly expensive to maintain, because they are usually

operated centrally and based on old hard- and software platforms. On the other hand, LSS are of

great value if they incorporate important business knowledge and manage a vast amount of

mission-critical business data. The described characteristics reflected in the following definition

which is a combination of the definitions given by Schick [Sch95a] and Umar [Uma97],

respectively.

Definition 1.1 Legacy software system

Any software system of value that significantly resists modification and evolution to meet new

❑

and constantly changing business requirements is called legacy software system (LSS).

2 INTRODUCTION

dealing with

legacy systems

cold turkey

Enterprises have to handle the dilemma of software system legacies in order to remain

competitive and respond to emerging business requirements. One solution is to replace LSS

completely by new systems that have been rebuilt from scratch and meet the current

requirements. This strategy is called cold turkey [Uma97, BS95]. For complex systems, such a

project might require up to several years and implies a significant risk. During this time, the LSS

is likely to evolve according to urgent business requirements and additional new features. It is

often a problem to ensure that the development of the new software system evolves in step with

the evolving LSS. In general, cold turkey is only cost-effective for software with long expected

lives and high demands for flexibility. However, in case of mission-critical applications, e.g.,

systems that have to be operational for 24 hours a day (e.g., billing systems), this solution is

hardly viable. This is due to the fact that new complex systems are typically far from being

faultless.

reengineering Because of the aforementioned problems to replace existing software systems there has been

increasing interest in concepts, methods, and tools to migrate LSS gradually to fit new

requirements. The corresponding process is usually called reengineering (RE). This strategy

tries to analyze and decompose LSS in order to migrate some of their subcomponents to new

technologies while other legacy components are replaced or remain unchanged [BS95]. Typical

candidates for such components are user interfaces, data management components, and data

processing units. Compared to cold turkey, RE aims to achieve step-wise improvements in

shorter time and with minimized risk. The following definition of the RE process has been

adopted from Chikofsky and Cross [CI90]:

Definition 1.2 Software reengineering

The analysis of an LSS to reconstitute it in a new form and the subsequent implementation of

❑

the new form is called software reengineering (RE).

reengineering

process According to Definition 1.2, software reengineering processes mainly consist of two

subsequent phases, namely reverse engineering (RvE) and forward engineering (FE). RvE

activities investigate an LSS to gain abstract information about its internal structure. The

purpose of these activities are to improve the human understanding of an LSS. Chikofsky and

Cross give the following definition of the RvE process [CI90]:

Definition 1.3 Reverse engineering

Reverse engineering is the process of analyzing a subject system with two goals in mind:

•to identify the system’s components and their interrelationships; and,

❑

•to create representations of the system in another form or at a higher level of abstraction.

The formulation of the above definition (...two goals in mind...) reflects on the fact that the RvE

task is generally considered to be human-intensive, i.e., it requires a well-trained staff and a high

amount of expert knowledge [ALV93, Big90]. Subsequently, the produced abstract information

is used as a basis to plan the necessary modifications of the LSS and estimate the required effort.

Such planning activities are crucial to manage the risk of RE projects [Sne95]. The forward

engineering phase aims to implement the planned changes. Often several iterations between the

different phases are needed to yield the desired target system. Figure 1.1 illustrates the

described evolutionary reengineering process.

DATABASE REENGINEERING 3

1.2 Database reengineering

mass changes

w.r.t. data

representation

Software evolution and maintenance problems might be caused by all kinds of new or changed

requirements. However, in his keynote for the 1998 Working Conference on Reverse

Engineering, McCabe has identified a number of requirements, which are currently of special

importance because they are responsible for significant mass changes in today’s business

software [McC98]. Among these central requirements are the Year-2000 problem [Mar97], the

Euro-conversion problem [Gro98], and the ability to compete on a global, electronic market.

The primary concern of all these requirements is the issue of how business data should

adequately be represented in software systems. The addressed problems range from simple

questions, e.g., for the number of digits that are necessary to represent a date (Year-2000

problem), up to complex architectural decisions, e.g., how to federate data maintained by

diverse (formerly autonomous) information systems and integrate these systems with the Web

to facilitate electronic commerce.

importance of

data structures

If an LSS has to be adapted to one of these requirements, a conceptual documentation of its data

structure is thus often a necessary prerequisite to achieve the maintenance goal. Moreover, a

conceptual data structure is an excellent starting point for the migration to modern

programming languages, as they are usually data-oriented [GK93]. This is because it reflects

major business rules but is fairly independent from procedural application code.

database

reengineering

The importance of a sound understanding of legacy data structures in RE projects has been

pointed out by several researchers and practitioners [Aik95, HEH+98, GK93]. Unfortunately, a

corresponding documentation is missing, obsolete, or inconsistent for many existing LSS. The

process of recovering such a documentation from an LSS is called data reverse engineering

(DRvE) [Aik95]. Today, many existing LSS in business applications maintain data in some kind

of database management system (DBMS). In these cases the described design recovery process

is also more specifically called database reverse engineering (DBRvE), respectively, database

reengineering (DBRE) if a subsequent modification of the LSS is considered.

Figure 1.1. Reengineering process

reverse

engineering

5 EXEC SQL DECLARE c8

CURSOR FOR

16SELECT d FROM

DOCUMENT d, KEYW k

5 EXEC SQL

DECLARE c8

CURSOR FOR

16SELECT d FROM

forward

engineering

legacy system abstract

documentation

target system

planning

plan of needed

changes/effort

iteration

information flow

5 EXEC SQL DECLARE c8

CURSOR FOR

16SELECT d FROM

DOCUMENT d, KEYW k

5 EXEC SQL

DECLARE c8

CURSOR FOR

16SELECT d FROM

4 INTRODUCTION

Definition 1.4 Database reengineering

Database reengineering (DBRE) aims on recovering a consistent conceptual model for the

persistent data structure of an legacy database (LDB). Subsequently, this conceptual model is

used to reconstitute the LDB in a new form. In addition to the LDB schema catalog and the

stored data, DBRE processes might consider the same information sources (and employ

❑

similar techniques) as general RE processes.

In general, DBRvE processes are in general more structured than arbitrary DRvE processes

[Aik95]. Consequently, the potential for tool support and automation is much higher in DBRE.

The main reason for this is that the used DBMS already provides the reengineer with some basic

information about the implemented physical data structure in form of a schema catalog. Still,

important structural and semantic information about the data structure might not be explicit but

indicators for this information might be found in different parts of the LDB, including its

procedural code, stored data, and obsolete documentation. Moreover, domain experts and

developers might be able to contribute further valuable information about the LDB. The DBRE

problem is to find, assess, and merge these indicators to create a consistent conceptual DB

schema (cf. Figure 1.2). In many cases, heuristics and uncertain expert knowledge have to be

employed.

1.3 Problem definition

tool support In the last two decades, many researchers have developed concepts and techniques for

automating certain DBRE activities, in order to reduce the complexity of the DBRE task

[vdBKV97]. Many of these approaches have been implemented in computer-aided RE (CARE)

tools and some of them have proven to be useful for practical applications. CARE tools seem to

have great potential to assist the reengineer, e.g., by performing laborious analysis steps,

browsing information about legacy software artifacts, and guiding the DBRE process. However,

data

code

schema

developer

name

Apartment

Tenant

MainTenant

is a

rent

hires

is a

Tenant

ApHouse

house_id flats

street city

SubTenant

has

name

Apartment

Tenant

MainTenant

rent

SubTenant

has

street city

hires

is a is a

Tenant

domain expert

documentation

obsolete

data

integration

migration

analyze

dynamic

parts

conceptual schema

developer

Figure 1.2. Conceptual schema as a starting point for subsequent DBRE activities

catalog

                   

                       

                   

        

         ! "   ! #  

 $    "   $    % & 

PROBLEM DEFINITION 5

such tools are rarely used in industry [Sto98]. Researchers and practitioners have identified the

most significant reasons for this as their lack of customizability [MNB+94] and human-

awareness [JH98b, Sto98]. Furthermore, they do not allow for incremental and iterative RE

processes [WSK97].

customizabilityCustomizability is a crucial requirement on CARE tools, because LDBs differ with respect to

many technical and non-technical parameters. They are based on diverse (old) hard- and

software platforms, use miscellaneous programming languages, contain various optimization

structures and arcane coding concepts (idiosyncrasies [BP95, HHEH96]), and comprise

different naming conventions. Furthermore, DBRE projects may be driven by a great variety of

different goals. Such goals range from fixing defects (e.g., Year-2000-Problem [Mar97]), over

extending or integrating data structures, up to completely changing the architecture of an LDB,

e.g., migrating from a procedural, monolithic, and autonomous legacy system to an open,

distributed, and object-oriented application.

While current compiler technology allows to generate parsers for different programming

languages based on abstract specifications [Slo95], most existing CARE tools still employ

general-purpose programming languages to implement RE heuristics, analysis operations, and

processes. As a consequence, these heuristics and processes can hardly be customized for

changing application contexts. Some more advanced approaches aim to tackle this problem by

providing application programming interfaces (APIs) [BM98] or interpreters for scripting

languages [Rat98, TWSM94]. Such interfaces offer the flexibility to adapt CARE tools with less

effort or even without the need for recompilation. Still, a limitation of these approaches is their

low level of abstraction: RE heuristics and processes have to be programmed in form of

procedural scripts, even though they would be more adequately described in a declarative

formalism, e.g., in form of textual or graphical rules [SLGC94, HK94, PS92]. Moreover, CARE

tools should provide an open architecture, i.e., they should allow the integration of other tools

(e.g., parsers, analyzers, extractors, and transformers).

human-awarenessOne of the most valuable information sources in RE are humans. Developers, operators,a and

domain experts might be able to contribute important knowledge about a subject LDB. Hence,

CARE tools should be human-aware, i.e., they should consider human knowledge and

interaction in the supported (DB)RE process. The human-awareness of existing CARE tools

can be characterized according to two main aspects. The first aspect regards the role of human

knowledge in the RE process, while the second aspect regards its representation.

a In order to clearly distinguish between the user of a CARE tool and a user of an LSS, we use the term operator

whenever we refer to an LSS user.

role of human

knowledge

A comparison of CARE tools according to the role of human knowledge concerns the question:

at which point in the supported RE process is human knowledge considered? We classify

existing approaches as either human-excluded,human-involved, or human-centered

(cf. Figure 1.3).

6 INTRODUCTION

Human-excluded CARE tools perform fully-automatic analysis or conceptual abstraction

operations on a subject LSS (e.g., [Fon97, RH97, Hüs98, FV95, MCAH95, MN95, SLGC94,

Wil94, RHSR94, LS97, vDM98]). As an output, they produce (a number of) analysis reports

that can be used as a starting point for manual semantic abstraction and redesign activities.

Human knowledge and intervention is not considered in these batch-oriented tool processes.

Many CARE tools involve humans in partly interactive RE processes. Such approaches usually

start with an automatic analysis of the LSS in order to extract important structural information.

Based on the analysis results, the user can subsequently explore the LSS, and interactively add

further semantic abstractions. Examples for such more sophisticated approaches are

[Hol98, KWDE98, LO98, Nov97, FHK+97, BGD97, YHC97, ONT96, SM95, MAJ94, AL94,

MWT94]. We call these tools human-involved as opposed to human-centered, because human

knowledge is only considered in the second stage of the supported RE process (abstraction/

redesign, cf. Figure 1.3). Finally, we denote CARE tools as human-centered, if they enable

interactive RE processes including both kinds of activities, software analysis and abstraction/

redesign, e.g., [HEH+98, AG96, MNS95].

abstraction

intermediate

documentation

analysis

conceptual

human-centered tool

human-involved tool

human-excluded tool

Figure 1.3. CARE tool classification according to the role of human knowledge

analysis

intermediate

documentation conceptual

design

intermediate

documentation conceptual

design

/ redesign

abstraction

/ redesign

abstraction

/ redesign

design

5 EXEC SQL DECLARE c8

CURSOR FOR

16SELECT d FROM

DOCUMENT d, KEYW k

5 EXEC SQL

DECLARE c8

CURSOR FOR

16SELECT d FROM

LSS

5 EXEC SQL DECLARE c8

CURSOR FOR

16SELECT d FROM

DOCUMENT d, KEYW k

5 EXEC SQL

DECLARE c8

CURSOR FOR

16SELECT d FROM

LSS

5 EXEC SQL DECLARE c8

CURSOR FOR

16SELECT d FROM

DOCUMENT d, KEYW k

5 EXEC SQL

DECLARE c8

CURSOR FOR

16SELECT d FROM

LSS

THE APPROACH 7

representation of

human knowledge

The second aspect of human-awareness considers the way how human knowledge is

represented in CARE environments. RvE activities deal with various heuristics that deliver

uncertain analysis results and reengineers have uncertain assumptions about the internal

realization of LDBs. Existing CARE tools do not consider this human mental model and

represent assumptions and analysis results without a measure for their confidence. Furthermore,

RvE activities generally have an evolutionary and explorative nature. It frequently occurs that

heuristics deliver contradicting analysis result, i.e., the reengineer discovers indicators in favor

of a given hypothesis as well as against it. Current CARE tools do not tolerate such

contradictions and most of them do not even indicate them. This is a severe limitation because

in a later stage of the RvE process it might become clear that the hypothesis that has been

chosen in such a situation has to be refuted. In this case, the knowledge about the indication of

its alternative has been lost due to the inability to represent “both sides of the coin”.

iterationsAnother problem of currently existing CARE tools is their limited support for process

iterations. They usually assume that the process of knowledge accumulation is monotonic and

prescribe a strictly phase-oriented methodology. In practice, this is an important limitation as

iterations between analysis and abstraction steps occur frequently: When a reengineer learns

more about the abstract design of an LSS, (s)he often refutes some initial assumptions or does

some further investigations. For example, as soon as an intermediate abstraction of an LSS has

been created it can be discussed with domain experts which might elicit additional information.

In many cases, this new information contradicts to some initial assumptions. Strictly phase-

oriented tools do not aid the reengineer in detecting and resolving such inconsistencies. In case

of iterations with early analysis activities the reengineer loses the work (s)he has performed

interactively in later abstraction and redesign activities.

1.4 The approach

In this dissertation, we developed concepts and techniques that allow to build CARE

environments which overcome the aforementioned limitations of current approaches in the

DBRE domain. We propose a process that consists of two main phases, namely schema analysis

and conceptual schema migration (cf. Figure 1.4). In the first phase, the different parts of the

LDB are analyzed to obtain a consistent and complete logical schema for the implemented

physical representation. In the second phase (migration), this logical schema is transformed into

a conceptual schema that is a suitable basis for subsequent maintenance activities like schema

extension and federation, data integration, distribution, and code migration.

GFRN to achieve

customizability

We developed a dedicated graphical language named Generic Fuzzy Reasoning Nets (GFRN) to

customize the analysis process of CARE tools according to their specific application context.

GFRN specifications separate declarative knowledge from operational aspects. They provide a

high level of abstraction and extensibility. Analysis operations that have been developed in other

(DB)RE approaches can easily be integrated with GFRN specifications. We implemented a

prototype CARE environment that is parameterized by GFRN specifications and includes a

customization front-end for this purpose.

8 INTRODUCTION

analysis guided

by possibilistic

inference engine

We reflect the mental model of the reengineer by representing RvE knowledge in the framework

of possibility theory [DLP94]. This approach allows to deal with uncertain and contradicting

analysis results. We developed a non-monotonic inference engine (IE) that supports the

reengineer in his/her DBRvE activities by propagating and indicating measures of credibility

and contradiction. For this purpose, the IE interprets the declarative knowledge that has been

specified in the GFRN specification. In addition, the IE is also capable of executing the analysis

operations that are specified in the GFRN. This is done automatically during the DBRE process

to search for indicators or validate intermediate hypotheses. With this approach, we obtain a

CARE tool that plays a more active role in the DBRE process than existing tools.

user interaction A graphical dialog component visualizes the current knowledge about the persistent structure of

an LDB to the user. This component provides powerful abstraction and query mechanisms to

focus the reengineers attention on the most controversial parts of the legacy schema. It enables

the reengineer to enter the results of manual investigations or add new hypotheses that might be

falsified or supported by the IE. Hence, our approach intertwines automatic and manual analysis

activities in an explorative and evolutionary process that is guided by the IE until a consistent

(and definite) logical schema is obtained.

iterations between

analysis and

migration

We applied graph grammars [Roz97] to map the analyzed logical schema into a conceptual

(OO) data model. The resulting conceptual schema can interactively be enhanced and

redesigned to exploit additional abstraction mechanisms and migrate to new requirements. The

available redesign operations are formally defined by graph transformation rules. Based on this

formalization, we developed a consistency management component that incrementally

propagates modifications of the logical schema to its (redesigned) conceptual representation in

migration

logical schema

conceptual schema

data code schema documentation

obsolete

catalog

                 

                     

   

                

           

         ! "   !

schema

analysis

extension migrationfederation integration distribution

Figure 1.4. Proposed DBRE approach

domain expert

reengineer (relational model)

schema

cycl_join

i1: 0.7 v2⊆v1

i2: 0.3

v1i10: 1.0

i7: 0.6

sel_dist

key IND

validIND

validKey

i3: 1.0

i5: 1.0

v2=π2(v1)

i9: 1.0

equiv

i8: 0.5

tcomp

nsimilar

i6: 0.8

v2v1v1

GFRN

specified by

(object-oriented model)

Graph

Grammar

specified by

THE APPROACH 9

case of process iterations. This unburdens the reengineer from the error-prone and time-

consuming task to determine such inconsistencies manually. The developed consistency

management component can be viewed as an adaption of general techniques described in

[Nag96] to the (DB)RE domain.

The particular research contributions of this dissertation have partly been published in [JSZ96,

JSZ97, JZ97, JNW98, JH98a, JZ98, JW99b, JZ99, JS99, JW99a, JW99c, JSWZ99] and can be

summarized as follows:

•We elaborated a catalog of requirements on a theory to manage imperfect knowledge in

human-centered DBRE environments.

•We used these requirements to evaluate the appropriateness of major theories for managing

imperfect knowledge with respect to the application domain of DBRE. We showed that

fuzzy set theory and possibilistic logic [DP88] provide a suitable basis for our application.

•We defined Generic Fuzzy Reasoning Nets (GFRNs), as an abstract, graphical formalism to

specify domain-specific DBRE knowledge and schema analysis processes.

•In order to interpret GFRN specifications, we developed a non-monotonic inference engine

(IE) that allows for user interaction and automatically executes specified analysis operations

to refute or support hypotheses. As a basis for this IE, we extended the fuzzy Petri net (FPN)

model described by Konar and Mandal [KM96] to represent and propagate contradicting

situation-specific knowledge.

•We employed graph grammars [Roz97] to map the relational data model to a formally

defined conceptual data model.

•We formalized conceptual redesign operations by graph transformation rules.

•We defined a data structure (called migration graph model) that represents the mapping

between the logical (relational) schema and its conceptual (object-oriented) representation.

The migration schema is updated incrementally with every redesign operation and allows to

develop

•a consistency management mechanism that incrementally propagates changes in the

logical schema to the conceptual schema to support iterations in the DBRE process;

•an update mechanism that automatically implements semantic modifications of the

conceptual schema to the logical schema;

•an automatic generator for textual schema mapping descriptions as the input for

commercial off-the-shelf (COTS) middleware components. In particular, we describe the

integration of the object-relational middleware product ObjectDRIVER [CER99] which

provides seamless integration of distributed object-oriented applications and legacy data

according to the ODMG standard [CBB+97].

•We implemented our approach in a prototype DBRE environment (called Varlet) and we use

an industrial case study for evaluation purposes.

10 INTRODUCTION

1.5 Dissertation outline

Chapter 2 This dissertation is organized as follows. In the next chapter, we characterize the application

domain of DBRE with a motivating sample scenario that summarizes our experiences with an

industrial case study. By means of this scenario, we point out a number of observations in order

to provide the motivation for CARE tools that are able to deal with uncertain reengineering

knowledge and process iterations. We revisit details of this scenario throughout the dissertation

to exemplify and evaluate different aspects of our approach.

Chapter 3 Chapter 3 elaborates central requirements on a formalism to represent and reason about DBRE

knowledge in human-centered CARE environments. We use these domain-specific

requirements to evaluate different theories on managing imperfect knowledge with respect to

their suitability for the application domain of DBRE.

Chapter 4 Subsequently, we introduce Generic Fuzzy Reasoning Nets (GFRN) as a dedicated formalism to

specify, execute, and customize DBRE heuristics and processes. The execution of GFRN

specifications is based on an inference mechanism that employs an FPN model which enables

non-monotonic reasoning under uncertainty and contradiction. The last part of Chapter 4

describes an implementation of these concepts and mechanisms in a customizable DBRE tool

that guides the user in an evolutionary schema analysis process.

Chapter 5 In Chapter 5, we describe a flexible approach to database schema migration based on graph

transformation systems. We employ graphical mapping rules to yield an initial translation that is

subsequently enhanced by applying an extensible set of conceptual redesign transformations.

Redesign transformations are defined by graph productions which fosters human

comprehension and provides a sound basis for their semantics. Furthermore, we describe a

mechanism for incremental consistency management to support iterations between intertwined

analysis and redesign steps. After the LDB schema has been analyzed and migrated to a suitable

conceptual target schema, we describe the integration of middleware components that

encapsulate LDBs with object-oriented interfaces. This flexible approach facilitates gradual

migration to new technologies like object-orientation, Java, and the Web, because autonomous

legacy applications can be preserved.

Chapters 6 Both technical chapters (4 and 5) are closed with a discussion and evaluation of our results with

practical case studies and a comparison with related work. Finally, Chapter 6 provides a

summary of our major contributions and identifies a number of open problems and future

directions of our approach.

CHAPTER 2 DATABASE REENGINEERING-

A CASE STUDY

In this chapter, we introduce a database reengineering (DBRE) sample scenario that

summarizes some experiences we made in an industrial project with two German companies.

The reason for this chapter is to elicit characteristic observations about DBRE activities to

motivate our approach. It presents one coherent application example that integrates the

different aspects covered in this thesis. We will revisit this example throughout the dissertation

and it will be used to evaluate the developed concepts and techniques. Even though the

background of the presented scenario is a concrete industrial case study, the presented

implementation details have been changed to protect copy rights, simplify the presentation, and

consider experiences with other projects. We presume that the reader is familiar with the basic

terminology of relational DBs. An excellent introduction to this subject is given by Elmasri and

Navathe [EN94].

2.1 A legacy product and document information system

PDIS overviewThe case study deals with a legacy product and document information system (PDIS) of an

international enterprise that produces a great variety of drugs and other chemical products. The

original version of PDIS was developed in Cobol [McC75] as an information system to

maintain the company’s product catalog using a relational DB (DB2) [Dat84]. Subsequently,

the functionality was extended to support the maintenance of information about documents

related to stored products. PDIS has evolved over more than ten years and has been subject to

many modifications (e.g., according to new national and international safety regulations and

changing organizational structures). PDIS has become the central source for information about

more than 100,000 products currently available and it contains more than 50,000 references to

documents including product specifications, safety classifications, research reports, and

marketing statistics. PDIS is accessed by members of the central hotline at the company

headquarters. They use it to answer questions of customers, product managers, and researchers

about products and available documents (cf. Figure 2.1). The functionality of the system is

considered to be mission-critical, i.e., the company depends on the service provided by PDIS.

PDIS problemsToday, PDIS has become increasingly expensive to operate. It requires a huge staff of hotline

members at the company headquarters to answer all ingoing inquiries. Different international

time zones demand for extended business hours of this hotline service. Furthermore, old hard-

and software platforms and the lack of a consistent technical documentation result in serious

problems in maintaining the system.

2.2 Migration target: a distributed marketing information

system

migration

objectives

Because of the problems described above, the Information Technology (IT) department plans

to employ Internet-technology to establish a distributed Web-based marketing information

12 DATABASE REENGINEERING- A CASE STUDY

system (MIS) that covers and extends the functionality of PDIS. The aim of this project is to

reduce expenses and increase the availability and currency of the stored data.

2.2.1 Functional requirements

Analogously to the old PDIS, the primary purpose of the planned MIS is to store and retrieve

up-to-date product information. However, the new system aims to provide customers and

employees with direct access to product data (24 hours a day). This will drastically reduce the

expenses to operate the central hotline service at the company headquarters. Another goal is to

improve the availability of information and unburden the company’s marketing department by

providing on-line versions of frequently accessed documents. Moreover, the IT department

aims to integrate their sales, distribution and financial controlling system (SAP R/3, [KKM98])

with the new MIS, to increase the currency of marketing information. User statistics created by

MIS will be transferred to the company’s data warehouse that is used for strategic planning. A

schematic overview of the planned MIS is given in Figure 2.2.

Other functional requirements on MIS are implied by the heterogeneity of its prospected users.

In contrast to the old PDIS, the new system will not be accessed by well-trained staff but by

geographically distributed users in various roles and with different knowledge about the

system. Thus, MIS has to provide a simple user-interface with on-line help for inexperienced

customers, as well as more sophisticated access mechanisms for experts (e.g., product

managers). Furthermore, there have to be authorization and security mechanisms, as certain

information may not be publicly available.

database

hotline

customers

product

researchers

managers

product information,

reports, and statistics

central

12 hours/day

Figure 2.1. Existing Product and Document Information System (PDIS)

MIGRATION STRATEGY 13

2.2.2 Technical requirements

The most important technical requirement on MIS is for high availability. This is due to the fact

that its functionality is considered to be mission-critical. Consequently, the MIS client has to

run on a wide range of different hard- and software platforms and the MIS server has to be

reliable and fault tolerant. Of course, the new MIS should be extendable and overcome the

maintenance problems of PDIS.

2.3 Migration strategy

The object-oriented programming language Java [GJS97] was chosen to implement the MIS. It

facilitates to meet most technical requirements, because of its support for distributed,

heterogeneous architectures and its built-in security concept. In addition, the IT department has

selected the Unified Modeling Language (UML) [RJB99, BRJ99] to specify and document

MIS.

In order to be able to test and improve the reliability of the new system, the company plans to

migrate gradually from the old PDIS to the new MIS. This means that both systems have to be

operated in parallel until the MIS runs stable. During this period, customers and employees

will still have the possibility to access information via the hotline service. This strategy entails

that both systems must access the same up-to-date information. One possible solution was to

create a completely new DB for the MIS and periodically replicate and propagate information

changes between the MIS and the PDIS. However, this would result in temporary

inconsistencies and a low currency of stored information. Thus, the IT department decides to

integrate both systems by using a common DB. The plan is to decompose the PDIS data

management component to extend and reuse it in the MIS. The current realization of this

component in DB2 facilitates this decomposition. However, a conceptual design of the

implemented data structure is not available. Thus, the data management component has to be

reverse engineered before it can be extended.

database

customers

product

researchers

managers

online

controlling

Web-gateway

World Wide

Web

information

system

Figure 2.2. Planned Marketing Information System (MIS)

24 hours/day

14 DATABASE REENGINEERING- A CASE STUDY

As the information provided by the MIS will be an extension of the data stored in the PDIS, the

necessary changes of the DB schema can be made in a way that preserves compatibility with

the procedural rest of the PDIS. This allows to run the legacy application with no or only little

modification on top of the extended DB schema. The integration of the common DB with the

object-oriented rest of the new MIS will be realized by an object-relational middleware layer.

The purpose of this layer is to perform the transformation between Java-objects and relations.

The implementation of the middleware will be based on the standardized Java database

connectivity (JDBC [PM96, YLQ98]). The resulting gradual migration strategy is illustrated in

Figure 2.3.

2.4 The reengineering process

The planned RE process consists of several subsequent activities, which are shown in

Figure 2.4. The first two activities aim to recover an abstract model for the persistent data

structure of PDIS. The first step is to extract the available schema catalog from DB2. Then, the

resulting physical schema is structurally completed and semantically enriched by adding

further information gained in analyzing PDIS and querying human experts. This analysis

activity produces the logical schema (cf. Figure 1.4). Subsequently, the next activity

(migration) maps the logical schema to a (semantically equivalent) conceptual schema in

UML-notation.

The following activities are forward engineering steps. The first step (redesign) extends the

reverse engineered conceptual schema according to the additional requirements on MIS. The

resulting extended conceptual schema is the basis to employ UML to specify dynamic

properties of MIS (e.g., object interaction and classes for transient objects). Moreover, the

extended conceptual schema also serves as input for an activity to modify the implemented DB

schema accordingly and develop the object-relational middleware for MIS. Finally, the last two

activities aim to implement the Java classes for MIS and migrate the legacy data to the

extended relational schema.

DB2

PDIS

reengineering DB2 (ext.)

MIS

OO-REL

middleware

(COBOL)

(Java)

current

architecture intermediate

architecture target

architecture

Figure 2.3. Gradual migration strategy

THE REENGINEERING PROCESS 15

problem of

inconsistency

Subsequently, we will use our sample scenario to exemplify each activity in the shaded area of

the displayed RE process (Figure 2.4). We will point out typical observations and experiences

made with each activity. In particular, we will also show that problems might be caused by

inconsistencies among documentsa used in different stages of this process. Such

inconsistencies might arise for several reasons, e.g., due to process iterations or human failures

during manual activities. In our sample scenario, we will exemplify one instance of such an

iteration. However, according to the evolutionary and explorative nature of DBRE processes,

often several such iterations occur in practice.

2.4.1 Legacy schema analysis

a We use the common term documents to denote the various representations of information used as in-

or output of RE activities.

activity overviewThis activity starts with the physical schema catalog extracted from the LDB. In most cases,

this schema catalog lacks important structural and semantical information that is needed in

order to recover a correct logical schema. Furthermore, legacy schemas often comprise

optimization structures and de-normalizations that obscure their original meaning. The goal of

this first activity is thus to detect these de-normalizations and hidden constraints in order to

produce a structural complete and semantically enriched legacy schema (cf. [FV95]).

PDIS exampleLet us revisit our sample scenario to exemplify this activity. Figure 2.7 shows a small detail of

different parts of PDIS, including its schema catalog, a small snapshot of its data, and four

selected segments of its procedural code. Moreover, Figure 2.7 illustrates that human experts

are another valuable source of information about an LDB. For the sake of simplicity, we will

refer only to this detail of PDIS throughout this sample scenario. That means we will only

consider the eight relational tables shown, instead of all 85 tables of the real case study.

PDIS

(COBOL)

DB2

MIS

OO-REL

middleware

(Java)

name

Apartment

Tenant

MainTenant

rent

SubTenant

has

street city

hires

is a is a

Tenant

DB2

name

Apartment

Tenant

MainTenant

rent

SubTenant

has

street city

hires

is a is a

Tenant

Tenant ApHouse

house_idflats

analysis of

LDB schema

analyzed

logical schema

migration

conceptual

redesign

conceptual modify

log.schema

and develop

middleware

extended

log.schema

migrate data

specify

dynamic

model implement

Figure 2.4. The planned reengineering process

' ( ) * + , - ( - , .

/ ) * + , - ( - , .

- 0 1 2 3 4 * , - 2 0 5 2 6

7 8 9 8

iterations

schema schema (ext.)

16 DATABASE REENGINEERING- A CASE STUDY

Structural completion

In case of a relational database, the activity of structural

completion mainly consists of detecting all key

candidates and foreign keys, which are not declared

explicitly in the schema catalog. According to

Figure 2.7, the schema catalog of PDIS includes

definitions of tables and some index structures. A

definition of a unique index trivially implies a key

constraint on the corresponding table. Hence, the schema catalog gives information about the

five key constraints displayed in Figure 2.5.

heuristics

as indicators In order to detect additional key candidates, the reengineer has to make further investigations.

Typically, (s)he has to make use of heuristics to find indicators for the desired information.

Such heuristics for relational information systems are described for example in [HHEH96,

FV95, BP95, PKBT94, PB94, SLGC94, And94, ALV93].

naming

conventions Simple, commonly used heuristics check

column names for typical naming

conventions. In our example, the reengineer

assumes that the id columns of tables

DOCREF and PRODREF represent key

values, as many programmers use similar

names to label key columns. Likewise, (s)he

assumes usrid to represent a key column of

table USER, because keys are often named

after their tables with a supplemented string

“id”. Furthermore, the reengineer might

check column names in different tables for

similarities in order to detect foreign key

constraints. Obviously, such potential

foreign keys have to be type compatible with their referenced columns. Moreover, there should

be a key constraint over the columns referenced. One possible application of this heuristic to

our example is to infer a potential foreign key from columns cg and pg of table PRODUCT to

the equally named columns in table PRODGRP. Analogously, the reengineer might also

compare column names with names of tables. For example, (s)he might notice that the name of

column usr of table DOCUMENT is very similar to the name of table USER. This suggests a

foreign key constraint between these two tables, even despite the fact that column usr

(DOCUMENT) is not exactly type compatible with the referenced key column usrid (USER).

Such slight type incompatibilities among columns with identical meaning occur frequently in

LDB applications. Naming heuristics similar to those described above can be used to indicate

the rest of the potential constraints listed in Figure 2.6.

Key: COMGRP(cgid)

Key: PRODGRP(cg,pg)

Key: DOCUMENT(docno)

Key: PRODUCT(no,pg,cg)

Key: KEYW(keyw,seqn)

Figure 2.5. Constraints resulting

from the schema catalog

Key?: DOCREF(id)

Key?: PRODREF(id)

Key?: USER(usrid)

Foreign key?: PRODUCT(cg,pg) ->

PRODGRP(cg,pg)

Foreign key?: DOCUMENT(usr) -> USER(usrid)

Foreign key?: PRODGRP(cg) ->

COMGRP(cgid)

Foreign key?: DOCREF(sdoc), DOCREF(tdoc) ->

DOCUMENT(docno)

Foreign key?: KEYW(doc1)...KEYW(doc5) ->

DOCUMENT(docno)

Figure 2.6. Potential constraints indicated

by naming heuristics

THE REENGINEERING PROCESS 17

+ 3

* ,

, * : ;

9 < = >

' ) / ?

- @ A B C D ) E ) ' F

@ 2 + A B C D ) E ) ' F

, @ 2 + A B C D ) E ) ' G

+ 3

* ,

- 0 @

9 H <

' ) / B

< I

2 0

< = >

' ) / ?

@ 2 + G F

+ 3

* ,

, * : ;

9 J

= < K >

D ?

0*4

> L M

' ? N O G F

0 2 A B C D ) E ) ' F

P Q

A B C D ) E ) ' F

A B C D ) E ) ' G G F

+ 3

* ,

9 7

0 - R

7 9

- 0 @

9 H J

= <

< I

2 0

= < K >

D ? 0 2 S

P Q

S +

G F

+ 3

* ,

, * : ;

9 J

= <

' ) / ?

- @ A B C D ) E ) ' F

P Q

A B C D ) E ) ' F

3 2 @ A B C D ) E ) ' F

A B C D ) E ) ' F

@ 2 + A B C D ) E ) ' G

+ 3

* ,

- 0 @

9 H J

' ) / B

< I

2 0

= <

' ) / ? @ 2 + G F

+ 3

* ,

, * : ;

9 K T

) ' ?

7 8

3 - @ A

> L M

' ? U O G F

0*4

> L M

' ? N O G F

, A

> L M

' ? U V G F

0*4

> L M

' ? U V G F

* @ @ 3 A

> L M

' ? W O G F

; 2 A

> L M

' ? U V G F

;

> L M

' ? U V G G

+ 3

* ,

, * : ;

9 X

) Y Z ?

[

. 6 A

> L M

' ? \ O G F

8 9

R 0 A B C D ) E ) ' F

@ 2 + U A B C D ) E ) ' F

@ 2 + \ A B C D ) E ) ' F

@ 2 + ] A B C D ) E ) ' F

@ 2 + W A B C D ) E ) ' F

@ 2 + N A B C D ) E ) ' G

+ 3

* ,

9 7

0 - R

7 9

- 0 @

9 H X

) Y Z B

< I

2 0

) Y Z ? [

. 6 S

8 9

R 0 G

+ 3

* ,

, * : ;

9 > = ^

E '

- @ A B C D ) E ) ' F

0*4

> L M

' ? U V G G F

+ 3

* ,

9 7

0 - R

7 9

- 0 @

9 H > = ^

E '

< I

2 0

> = ^

E '

? +

- @ G F

+ 3

* ,

, * : ;

9 J

= <

E '

A B C D ) E ) ' F

4 * 0 *

3 A

> L M

' ? W O G F

P Q

A B C D ) E ) ' F

0 * 4

> L M

' ? U V G G F

+ 3

* ,

9 7

0 - R

7 9

- 0 @

9 H J

= <

E '

< I

2 0

= <

E '

? +

P Q

G F

+ 3

* ,

, * : ;

9 < = > K ^

) C D ?

@ 0 * 4

> L M

' ? \ N N G F

@ 2 + 0 2 A B C D ) E ) ' F

( * ; - @ A

> L M

' ? V G F

, _ 2 3 A

> L M

' ? \ N N G F

7 8

3 A

> L M

' ? ] O G F

3 @ A B C D ) E ) ' G F

+ 3

* ,

9 7

0 - R

7 9

- 0 @

9 H < = >

< I

2 0

< = > K ^

) C D ? @ 2 + 0 2 G F

Figure 2.7. Detail of PDIS

` a b c ` d e f g f

L h =

? @

9 8

, i G U O j i U O O \ O U O

; * + / ;

9 H k

' U

@ V \ U V V O ]

; -

k L

N O U l ] l O V

I k

N O O 3

+ , - 2 0 \ V m N O ]

g f n o ` a b c

] ; * + R

7 9

U O ; * : 2 3 * , 2 3 . + _

4 - + * ;

V 4

@ - + * ;

8 7 P P

; .

PRODUCT

COMGRP

g f b a ` a f c p e f f p e ` a b c

]

+ _ O ] V O 4

, * ; * @ _

9 8

- (

; * + R

7 9

U O 4

; O V \ O ; - R

- @ ; * : 2 3 * , 2 3 . + _

4 - + * ;

V :

9 8

O U l O

_ * 3 4 *

PRODGRP

o ` a b c o d g ` d q a r n o a s t u d p s v p p o

* ;

9 8

i V l w W V w U i \ i m V

7Q9

3 U O U

8P9

+ - x + i

L h =

? @ G W w l ] m ] U i O l i m m C -

V O

5 .

3 4

, * ; ; * + W U \ O O W ] U i U \ i m w

9 9

; U m O

+ 2

, * , 4 , U y m m V N V l ] U i U \ i m V D _

0 U O U

DOCUMENT

n o e f e p d o g f o d g

] C

j j C

j j N m V N V l

] l O U l ] V l w W V w

V V O C

j j ] U \ O O W

\ C

j j C

j j m V N V l

PRODREF

n o v o d g t o d g

\ m V N V l l w W V w

W m V N V l m W w l ]

U U \ O O W w l ] V O

\ V \ l N l O ] l N W

DOCREF

s v p n o ` a b c o e t v ` a b c a o o p t c r d t c r e

] z 2 _ 0 {

9 8

O U

j * : ] W O l O \ O C

j j

U O

* 0 1 3

+ _ 4 - , |

J > < 8

+ _ O \

1 x +

Z W N O ] N ] O N V w V w

L 9

- 0 3 - + _

^ 7

; ;

; O V

4 {

] N O V ] ] U N \ w U V

USER

}

c ~  v c  ` o d g  o d g  o d g  o d g  o d g 

, 2 4 2 : - ;

W O U \ O O W U \ O O N U \ O O l U \ U O V m O \

, 2 4 2 : - ;

W U V m O ] V m O W C

j j C

j j

, * , -

, - + ] O W l w W V w m V N V l W V N l ] C

j j C

j j

KEYW

O U

= >

)

< K

' )

B  B

C i

i i i

g d o c v c f b c ` t  

O \ )

)

> T 

) j )

U S

\ B C D

8 k

+ U S

8 k

+ \

O ] / '

= ^ K T

) '

U S

K T

) '

\ S

<= >K^

) C D @

O W Z

) ' ) @ 2 + 0 2  A @ 0 2

@ i

7 8

3 

U i

7 8

3 - @

O N

< 7

\ i @

, 

U i @

< 7

\ i * @ @ 3 

U i * @ @ 3

O l

U i

0*4



\ i

0*4

) C

< k

)

i i i

g d o c v c f b c ` t  

O w )

)

> T 

) j )

D B C

D  B C D

8 k

O V / '

= ^ K T

) ' Z

) ' )

0 * 4

 A

C * 0 @ @

,  A

)

O m ) C

< k

)

i i i

g d o c v c f b c ` t  

U O )

)

> T 

)

' ) + U

> K

T =

' /

U U

) j )

/ '

= ^ J

= < K >

< = > K ^

) C D @ S

= <

' ) / 3

U \ Z

) ' ) , - , ;

 A D

3 i @ 2 +  @ i @ 2 + 0 2

U ]

3 i

3 2 @ 

i 0 2

3 i

P Q



P Q M

3 i +



i +

U W ) C

< k

)

i i i

g d o c v c f b c ` t  

U N )

)

> T 

)

' ) + V

> K

T =

' /

U l

) j )

D @ / '

= ^ < = > K ^

) C D @ S

) Y Z [

U w Z

) ' ) [ i [

. 6  A

8 9

* 3 + _ 6 2 3 @

U V ? @ 2 + U  @ 2 + 0 2

' @ 2 + \  @ 2 + 0 2

U m

' @ 2 + ]  @ 2 + 0 2

' @ 2 + W  @ 2 + 0 2

i i i i

PDIS

schema catalog

data procedural

code

I assume, id is

a key of table

DOCREF...

18 DATABASE REENGINEERING- A CASE STUDY

code patterns Indicators for potential constraints might also be found in the procedural code of the legacy

system, e.g., in form of stereotypical code patterns. Andersson (informally) describes the idea

of searching legacy code for typical SQLa queries that serve as indicators for constraints about

the database schema [And94]. The four code segments of COBOL-embedded SQL presented

in Figure 2.7 are instances of such code patterns.

aStructured Query Language [Dat89]

code segment 1:

cyclic join pattern Code segment 1 is an instance of the so-called cyclic join pattern [And94]. The purpose of the

query is to deliver contact information about the user (u1) who is responsible for a given

document and another person (u2) who works close to this user. The corresponding query is

called cyclic join, because it selects two rows in the same table. At this, an inequality condition

assures that these two rows are not identical. Hence, a cyclic join serves as an indicator for a

key candidate. In code segment 1, the inequality condition is applied to column sname of table

USER, which indicates an alternative key for this table.

code segment 2:

select distinct

pattern

The second code segment is an example for the fact

that indicators might even be contradicting. It is an

instance of the so-called select distinct pattern. It

selects a person record according to a given value for

columns sname and dpt. At this, the keyword

DISTINCT is used in SQL to eliminate multiple equal

rows in the result of a query. However, such equal

rows can only occur, if columns sname and dpt do not

represent keys in table USER. This code segment is

thus a negative indicator that sname represents a key

and contradicts to the previously found cyclic join indicator (cf. Figure 2.8). However, the

select distinct indicator generally has a lower credibility than the cyclic join pattern, because

many programmers tend to use the keyword DISTINCT even if it is not needed. The reengineer

has to make further investigations (e.g., examine the available data) in order to resolve such

contradicting analysis results.

code segment 3:

join pattern Code segment 3 is an instance of a so-called join

pattern. It joins table PRODREF with table

DOCUMENT and table PRODUCT, respectively. In

our scenario, the reengineer uses this join statement

as an indicator for two potential foreign keys in table

PRODREF (cf. Figure 2.9). Likewise, joins in other

queries might be found as indicators for the

additional potential foreign keys listed in Figure 2.9.

data analysis The reengineer can use a snap shot of the available data (DB extension) to validate assumed

(potential) constraints about the legacy schema. Of course, hypotheses can only be falsified but

not proved by means of data. Still, the fact that an assumed constraint holds for a huge amount

of data can provide further support for this hypothesis. In our sample scenario, we will only use

the small amount of sample data represented in Figure 2.7. It shows that most of the potential

Key?: USER(sname)

Figure 2.8. Contradicting indicators

for key constraint

cyclic join select distinct

(segment 2)(segment 1)

pro con

Foreign key?: PRODREF(doc) ->

DOCUMENT(docno)

Foreign key?: PRODREF(prod,pg,cg) ->

PRODUCT(no,pg,cg)

Foreign key?: PRODGRP(cg) ->

COMGRP(cgid)

Foreign key?: PRODGRP(manager) ->

USER(sname)

Figure 2.9. Potential foreign keys

indicated by join patterns

THE REENGINEERING PROCESS 19

constraints cannot be falsified via the available extension. However, the entries in tables

PRODREF and DOCREF contain counterexamples for the initial belief of the reengineer that

the name id might label key columns in these tables (cf. Figure 2.6 and Figure 2.7). By talking

to PDIS users in the company hotline, the reengineer learns that references from documents to

products or other documents are uniquely numbered per each referencing document. This

additional knowledge leads to the assumption that columns id and doc, respectively, id and

sdoc, represent keys for the corresponding tables PRODREF and DOCREF. A new

investigation of the available data supports this assumption, because it holds for the huge

extension of these two tables (which is not shown in Figure 2.7). Consequently, Figure 2.10

summarizes the result of the structural completion of our schema detail.

Semantical enrichment

Semantical enrichment aims to classify and annotate LDB schema components according to

detected optimization structures and higher level concepts like inheritance and aggregation

[EN94]. This activity usually starts with an LDB schema that has already been structurally

completed [FV95]. Nevertheless, both activities, structural completion and semantic

enrichment, are highly intertwined in practice. This means that often important structural

information is discovered during semantic enrichment, that has not been detected before. Our

sample scenario will reflect on this experience.

inheritance

structures

Relational data models often contain indicators for hidden inheritance structures. Similar or

synonymous names of tables or groups of columns might represent hints for such structures. In

our sample scenario, the reengineer knows that PDIS maintains references among different

documents and products. Due to this domain knowledge and the similarity of the names of

tables PRODREF and DOCREF, (s)he assumes a hidden inheritance structure. (S)he beliefs

that the purpose of tables PRODREF and DOCREF is to store references from documents to

other documents and from documents to products, respectively. Thus, her/his first assumption

is that there might be a (hidden) common domain relation [FV95] REFERENCE that covers all

references in general (cf. Figure 2.11).

Foreign key: PRODUCT(cg,pg) ->

PRODGRP(cg,pg)

Foreign key: DOCUMENT(usr) ->

USER(usrid)

Foreign key: PRODGRP(cg) ->

COMGRP(cgid)

Foreign key: DOCREF(sdoc), DOCREF(tdoc) ->

DOCUMENT(docno)

Foreign key: KEYW(doc1)...KEYW(doc5) ->

DOCUMENT(docno)

Foreign key: PRODREF(prod,pg,cg) ->

PRODUCT(no,pg,cg)

Foreign key: PRODGRP(manager) ->

USER(sname)

Figure 2.10. Result of the structural completion

Key: COMGRP(cgid)

Key: PRODGRP(cg,pg)

Key: PRODGRP(manager)

Key: DOCUMENT(docno)

Key: PRODUCT(no,pg,cg)

Key: USER(usrid)

Key: USER(sname)

Key: PRODREF(id,doc)

Key: DOCREF(id,sdoc)

Key: KEYW(keyw,seqn)

20 DATABASE REENGINEERING- A CASE STUDY

variant records However, when the reengineer

considers other available information

sources, (s)he has to refute the initial

assumption that tables PRODREF and

DOCREF serve for separate

specialized concerns: the DB

extension shows an overlapping

among key values (id,doc) of both

tables. In fact, each key value in table

DOCREF seems to imply an equal key value in table PRODREF. Furthermore, all of these

implied rows seem to comprise NULL-values in all other columns (cf. Figure 2.7). After a

more detailed investigation of the available data of table PRODREF the reengineer discovers

that there are actually four different variants of entries in this table, which are displayed in

Figure 2.13. By talking to PDIS users, the reengineer learns that the system not only allows for

references from documents to products but also to product groups and commodity groups.

Moreover, (s)he learns that all references, either among different documents or among

documents and products, have unique numbers with respect to the referencing document. This

requirement for unique reference numbers is a plausible explanation for the hypothetical

inclusion dependency between tables DOCREF and PRODREF: each entry in table DOCREF

implies a numbering place holder in table PRODREF. Together, the additional domain

knowledge allows to label the detected variants in table PRODREF and entails the three new

foreign key constraints shown in Figure 2.12. The fact that Variant 4 of table PRODREF

represents place holders for document references allows to classify the corresponding foreign

key as an inheritance (is-a) relationship [HHEH96].

PRODREF(id,pg,prod,cg,doc)

Figure 2.11. Assumed hidden common domain

relation

DOCREF(id,sdoc,tdoc)

REFERENCE(id,doc)

is-a

Foreign key: PRODREF(pg,cg) (var. 2) ->

PRODGRP(pg,cg)

Foreign key: PRODREF(cg) (var. 3) ->

COMGRP(cg)

Foreign key: DOCREF(id,sdoc) ->

PRODREF(id,doc) (var. 4)

(classified as is-a relationship)

Figure 2.12. Labeled variants and additional foreign keys of table PRODREF

Variant 1 (product reference):

PRODREF(id,pg,prod,cg,doc)

Variant 2 (product group reference):

PRODREF(id,pg,cg,doc)

Variant 3 (commodity group reference):

PRODREF(id,cg,doc)

Variant 4 (placeholder for document reference):

PRODREF(id,doc)

variant# id pg prod cg doc

1 ... ... ... ... ...

2 ... ... NULL ... ...

3 ... NULL NULL ... ...

4 ... NULL NULL NULL ...

Figure 2.13. Variants of table PRODREF

THE REENGINEERING PROCESS 21

optimization

structures

code segment 4

Another objective of semantic schema enrichment is to detect optimization structures. The five

foreign keys between table KEYW and table DOCUMENT are a typical example for such an

optimization structure. A similar structure is described by Premerlani and Blaha [PB94].

Conceptually, it represents a many-to-many relationship between keywords and documents.

Such a relationship is normally implemented as a simple join-table. However, in this case, the

developer implemented it by a number of (five) foreign keys borrowed by the keyword table.

The role of column seqn is to enable a carry over in case that one keyword is associated to more

than five documents. This means that more than five references are represented by additional

rows with the same keyword in column keyw but increasing values in column seqn. Again,

there are various possibilities to detect optimization structures, e.g., naming conventions in the

schema, characteristic procedural access code (code segment 4), and special value

combinations in the available data.

artificial keysIt is common practice to introduce additional key columns when a relational data model is

implemented. Often, sequence numbers are used for these columns, because they provide a

simpler notion of identity than composite keys which carry real application data. Joins among

tables are generally more efficient using such artificial keys. Hence, they can be considered as

another kind of optimization structure. When an LDB is reengineered such artificial

implementation structures have to be identified, as they should be suppressed in the recovered

conceptual schema. In case of PDIS, the reengineer recognizes column usrid of table USER as

an artificial key because users are conceptually identified by their short name (sname).

aggregationsAccording to the first normal form

[BCN92], the relational data model does

allow for atomic values in columns of

tables, only. This implies that if a complex

(aggregated) object structure has to be

stored in a relational DB, it has to be

decomposed into relations over atomic

values. When a legacy relational DB is

reengineered, knowledge about these

aggregate relationships is important to

recover its conceptual design. In our sample scenario, the reengineer annotates the information

that column telo and telp of table USER conceptually represent a complex attribute that

maintains different telephone numbers of users. More complex examples of detected aggregate

relationships are described by Soutou [Sou98] and Vossen [FV95]. The annotations of our

sample schema detail according to the detected optimization and aggregation structures is

shown in Figure 2.14.

cardinality

constraints

In general, a foreign key implements a many-to-one relationship between two tables. However,

this cardinality information might be defined more precisely by investigating further relational

constraints. Figure 2.15 gives an overview on the implication of such relational constraints on

the cardinality range of a relationship implemented as a foreign key A(a1...an)->B(b1...bn).

Some of these relational constraints are already included in the results of previous analysis

activities (e.g., key or not-NULL constraints), while others have not been investigated,

previously. For example, by analyzing the data a reengineer can try to find out if a given foreign

Foreign key: KEYW(doc1)...KEYW(doc5) ->

DOCUMENT(docno)

Key: KEYW(keyw,seqn)

(optimized many-many relationship)

Key: USER(usrid)

(artificial key)

Complex: telephone(USER(telo,telp))

Figure 2.14. Detected optimization and

aggregation structures

22 DATABASE REENGINEERING- A CASE STUDY

key also entails an inclusion dependency (IND) [EN94] in the reverse direction.a In this case,

the minimum lower bound of the left side of the corresponding relationship is one, i.e., the

relationship is left-total (cf. last row of Figure 2.15).

a This special kind of inclusion dependency is called C-IND by Vossen and Fahrner [FV95].

problems of scale:

completeness

and consistency

The primary goal of the presented sample scenario is to characterize the involved activities.

Hence, it is not intended to be complete but it describes the most important steps of schema

analysis for relational LDBs. There are many other analysis activities and methods dealing

with various data models (cf. Section 2.5). For obvious reasons, our sample scenario covers

only a small detail of the real case study. Figure 2.16 summarizes the results of the legacy

schema completion and enrichment activity for this detail. The real system consists of 85

relational tables, 347 attributes, 111 foreign keys, several hundred thousand lines of procedural

code and a huge database extension. As a consequence of this scale, performing the described

process manually becomes a time-consuming, tedious, and error-prone task. The reengineer is

likely to overlook some indicators for important semantic information and it is difficult to keep

the resulting semantic information consistent.

iterative process Because of the reasons described above, it is idealistic to presume a strictly phase-oriented,

waterfall-type DBRE process. In practice, reengineers often start abstraction and forward

engineering activities based on analysis information about the logical schema which still might

be incomplete or inconsistent. During these activities reengineers accumulate additional

knowledge about abstract concepts of the LDB. With this understanding, they often go back in

the RE process and make some further investigation to refute or add some analysis results. In

order to reflect on this experience in our sample scenario, we assume that in an initial analysis

of PDIS the reengineer has not noticed the different variants of table PRODREF and the

alternative key of table USER. The resulting incomplete analysis information is shown in black

color in Figure 2.16.

relational constraint Min(xl) Max(xu) Min(yl) Max(yu)

Foreign Key: A(a1...an)->B(b1...bn) 0 N 0 1

Not NULL: A(a1...an) 0 N 1 1

Key: A(a1...an) 0 1 1 1

IND: B[b1...bn]⊆A[a1...an] 1 N 0 1

Figure 2.15. Implication of relational constraints on the cardinality of relationships

Cardinality range of represented relationship: [xl,xu] : [yl,yu]

THE REENGINEERING PROCESS 23

Observations

With the presented sample scenario we can observe a number of typical characteristics about

the activity of analyzing legacy schemas. They are summarized in the following statements:

Legacy schema analysis

O1. involves heuristics and imprecise facts, i.e., it deals with uncertain knowledge;

O2. deals with idiosyncratic coding concepts and optimization structures;

O3. involves heuristics with credibilities that depend on technical and non-technical

parameters of the LDB (e.g., used hard/software platforms and personal

programming style or naming conventions, respectively);

Foreign key: PRODUCT(cg,pg) ->

PRODGRP(cg,pg)

(cardinality range [1,N]:[1,1])

Foreign key: DOCUMENT(usr) ->

USER(usrid)

(cardinality range [0,N]:[1,1])

Foreign key: PRODGRP(cg) ->

COMGRP(cgid)

(cardinality range [1,N]:[1,1])

Foreign key: DOCREF(sdoc), DOCREF(tdoc) ->

DOCUMENT(docno)

(cardinality range [0,N]:[1,1])

Foreign key: KEYW(doc1)...KEYW(doc5) ->

DOCUMENT(docno)

(optimized many-many relationship

with cardinality range [1,N]:[1,N])

Foreign key: PRODREF(prod,pg,cg) (var. 1) ->

PRODUCT(no,pg,cg)

(cardinality range [0,N]:[1,1])

Foreign key: PRODREF(pg,cg) (var. 2) ->

PRODGRP(pg,cg)

(cardinality range [0,N]:[1,1])

Foreign key: PRODREF(cg) (var. 3) ->

COMGRP(cg)

(cardinality range [0,N]:[1,1])

Foreign key: DOCREF(id,sdoc) ->

PRODREF(id,doc) (var. 4)

(classified as IS-A relationship

with cardinality range [1,1]:[1,1])

Foreign key: PRODGRP(manager) ->

USER(sname)

cardinality range [0,1]:[1,1])

Figure 2.16. Summary of analysis results

Variant 1 (product reference):

PRODREF(id,pg,prod,cg,doc)

Variant 2 (product group reference):

PRODREF(id,pg,cg,doc)

Variant 3 (commodity group reference):

PRODREF(id,cg,doc)

Variant 4 (placeholder for document reference):

PRODREF(id,doc)

Complex: telephone(USER(telo,telp))

Key: COMGRP(cgid)

Key: PRODGRP(cg,pg)

Key: PRODGRP(manager)

(alternative key)

Key: DOCUMENT(docno)

Key: PRODUCT(no,pg,cg)

Key: USER(usrid)

(artificial key)

Key: USER(sname)

(alternative key)

Key: PRODREF(id,doc)

Key: DOCREF(id,sdoc)

Key: KEYW(keyw,seqn)

24 DATABASE REENGINEERING- A CASE STUDY

O4. combines contradicting indicators and assumptions from various information

sources, which may result in contradicting analysis results;

O5. deals with incomplete information and non-monotonic reasoning processes, i.e.,

new analysis results might refute initial hypotheses;

O6. is a human-intensive process that can be supported by semi-automatic analysis

operations; and

O7. produces abstract information about the LDB by aggregating and classifying legacy

schema components.

2.4.2 Conceptual schema migration and redesign

conceptual

migration In this activity, the reengineer uses his/her domain knowledge and the analysis results about the

logical schema to produce a corresponding conceptual schema. This is a creative design task

because, in general, there are many possible conceptual models for one single logical schema.

For the selected detail of our case study, the reengineer designs the conceptual schema

presented in Figure 2.17 as a UML [RJB99] object model. It contains classes for the central

entities of our schema detail. Associations have been created mostly according to the detected

foreign keys with their annotated cardinality information. However, the conceptual schema

abstracts from optimization structures and artificial keys. The two tables PRODREF and

DOCREF are conceptually represented as ordered many-to-many associations, where the order

is determined by their columns named id. Furthermore, the reengineer has decided to represent

a user’s department as a separate class.a

a Class Department represents a so-called weak entity [BCN92].

conceptual

redesign After creating an up-to-date conceptual schema for PDIS, the reengineer makes the necessary

modifications and extensions to meet the new requirements for MIS. Figure 2.18 shows such

an extended conceptual schema for our sample scenario. According to the requirement to

maintain documents on-line, the reengineer has introduced two specializations of class

Document, namely OfflineDocument and OnlineDoccument. The qualified association (master)

among these new classes specifies that each on-line document must have exactly one archived

master document. On the other hand, an off-line document might also be available on-line in

different formats. Moreover, Figure 2.18 contains two new classes (Employee and Customer)

and some new attributes to represent the different roles of users in MIS. Finally, the reengineer

refined some cardinality constraints in the extended conceptual schema.

iteration By discussing the designed conceptual schema with developers of MIS and users of PDIS, the

reengineer learns that PDIS maintains not only cross references from documents to products

and other documents, but also from documents to product groups and commodity groups.

Therefore, (s)he returns to the analysis phase to investigate this indication. During this

investigation, (s)he detects the different variants of table PRODREF, the alternative key of table

USER, and the missing foreign key between PRODGRP and USER (cf. Figure 2.16).

THE REENGINEERING PROCESS 25

update of

conceptual schema

After modifying the analysis results about the logical PDIS schema, consistency with the

redesigned conceptual schema has been lost. In order to re-establish consistency, the

reengineer has to trace the impact of the new analysis results on the conceptual schema and to

perform necessary changes and extensions. Figure 2.19 shows a detail of the conceptual

schema for MIS that has been updated according to the current analysis results. Due to the

common unique numbering of cross references (cf. Section 2.4.1), it is no longer correct to

represent document and product references as distinct many-to-many associations. Hence, the

reengineer introduces a new abstract class XRef as a generalization of all types of references.

Moreover, the newly detected foreign key implies a partial one-to-one association between

classes ProductGroup and Employee. However, the former attribute manager of class

ProductGroup had to be deleted because its relational counterpart has been borrowed from

table USER to implement this association.

problems of scale:

correctness and

consistency

For larger examples, the design of a correct conceptual schema for a given legacy schema is a

complex task that is prone to error. Modifications of the LDB and iterations between schema

migration and reverse engineering activities often result in inconsistencies between the logical

LDB schema and the corresponding conceptual schema. Resolving these inconsistencies early

is crucial for the success of any DBRE project, as the conceptual schema is the basis for many

subsequent migration and forward engineering activities. However, the complexity of real-

world systems makes it difficult to keep track of the impact of changing information about the

LDB on the designed conceptual schema, manually.

User

shortName: String

addr: String

Telephone

office: String

private: String

Department

Document

title: String

number: Integer

validUntil: Date

Keyword

word: String

author: String

confidential: Bool

Commodity

Group

id: Integer

Product

Group

id: Integer

manager: String

Product

number: Integer

1..*

0..*

{ordered}

ref.Products

{ordered}

0..* 0..*

0..*

ref.Documents

describedBy

1..*

worksFor

isResponsibleFor

0..*

Figure 2.17. Conceptual schema for PDIS (detail)

26 DATABASE REENGINEERING- A CASE STUDY

2.4.3 Implementation of changes and a middleware for data integration

The next step in our RE process is to implement the extended conceptual schema for MIS in

the relational DB. According to the requirements defined in Section 2.2.1, the necessary

schema modifications should be performed in a way such that the legacy PDIS will still be able

to access the stored data (with no or only minor changes to its application code). Figure 2.20

shows that the conceptual changes can be implemented in a canonical way for our sample

scenario. The new classes OnlineDocument and OfflineDocument are implemented as

extensions of the existing table DOCUMENT. This solution has been chosen because the

current legacy data about documents actually represents off-line documents which are available

at the company head quarter ("HQ"). Hence, the data in table DOCUMENT does not have to be

reorganized and the legacy PDIS application code can still be used to access the document

data. Similar compatibility reasons have driven the decision to implement the new subclasses

Employee and Customer as two variants in the existing table USER. (All users which are stored

by PDIS so far are in fact employees, namely the members of the telephone hotline service.)

The additional columns in tables DOCUMENT and USER are added using the SQL

modification command alter table and providing a default value. The new association master is

implemented as a foreign key which implies a referential integrity constraint.

User

addr: String

Telephone

office: String

private: String

Employee

shortName: String Customer

company: String

trusted: Bool

Department

1..*

worksFor

Document

title: String

number: Integer

validUntil: Date

Keyword

word: String author: String

confidential: Bool

describedBy

1..8 1..*

responsibleFor

0..*

Commodity

Group

id: Integer

Product

Group

id: Integer

manager: String

Product

number: Integer

1..*

0..10

{ordered}

ref.Products

0..*

Online

format: Format

Document

contents: Blob

Offline

archive: string

Document

{ordered}

0..10

0..*

ref.Documents

0..1

master

Figure 2.18. Extended conceptual schema for MIS (detail)

format

{disjoint}

THE REENGINEERING PROCESS 27

implementation

alternatives

There are other possible implementations of the extended conceptual schema, e.g., new tables

for each subclass, column replication, etc. Selecting one alternative is mostly a trade-off

between minimized redundancy (well-defined schema structure) and efficiency. A

comprehensive overview on relational implementation alternatives of conceptual structures is

given by Fussell [Fus97]. In particular, in the DBRE domain, reengineers often have to

compromise between well-designed schema modifications and the need for compatibility with

legacy applications in order to enable gradual migration of legacy systems.

User

addr: String

Telephone

office: String

private: String

Customer

company: String

Document

title: String

number: Integer

validUntil: Date

Keyword

word: String author: String

confidential: Bool

describedBy

1..8 1..*

responsibleFor

0..*

Commodity

Group

id: Integer

Product

Group

id: Integer

Product

number: Integer

1..*

Online

format: Format

Document

contents: Blob

Offline

archive: string

Document

10..1

master

Figure 2.19. Extended conceptual schema for MIS after iteration (detail)

format

{disjoint}

XRef

no: Integer

DocRef ProdRef

ProdGrpRef

ComGrpRef

referencedBy

reference reference

reference

1 1

1 0..* 10..*

0..*

0..* 1

{disjoint}

Employee

shortName: String

trusted: Bool

Department

1..*

worksFor

manager

10..1

28 DATABASE REENGINEERING- A CASE STUDY

MIS architecture

and rationales In order to exploit the benefits of the conceptual schema migration the IT department decided

to employ the object-oriented paradigm [JEJ95] to develop MIS. However, this implied that the

developers may not use one of the various available DB Web-gateways [EKR97] or DB access

libraries (e.g., JDBC) in their application code. These solutions deal with direct textual queries

to the legacy schema. Using them in the application code would violate the most important

principle of object-oriented systems, namely encapsulation. As a consequence, every change in

the legacy schema would entail changes to the MIS application code. Therefore, the developers

decided to implement an object-relational middleware layer that hides the concrete

representation of objects in the DB from the application code. The objective is to increase the

maintainability of the new MIS by designing a middleware API that is compliant to the Java

language binding specified in the Object Database Management Group (ODMG) standard

[CBB+97]. Figure 2.21 shows that the planned object-oriented API is internally based in the

JDBC-gateway for DB2.

middleware design An outline for the design of the desired middle-

ware is given in Figure 2.22. Of course, there are

other possible designs that realize an ODMG

compliant API, e.g., the Java mechanisms of class

and interface inheritance could be used

differently. Still, all possible designs will have

certain classes which are specific to the

application (MIS) and other classes, which are

application independent (generic). The shaded

outer part of Figure 2.22 contains examples for

generic ODMG classes, including a common root

class for all persistent objects (ODMGObject), a

transaction manager (Transaction), and a class to

create and look up named database objects

(Database).

additional attribute login of class User:

alter table USER add column

implementation of class Employee:

alter table USER add column

trusted: BOOLEAN with default FALSE;

(legacy user data represents Employees)

implementation of class Customer:

alter table USER add column

company: VARCHAR(80) with default NULL;



mplementation of class OfflineDocument:

alter table DOCUMENT add column

archive: VARCHAR(80) with default "HQ";

(legacy documents are OfflineDocuments)

implementation of class OnlineDocument:

alter table DOCUMENT add column

format: INTEGER with default NULL;

alter table DOCUMENT add column

contents: BLOB with default NULL;

alter table DOCUMENT add column

master: INTEGER with default NULL;

implementation of association master:

foreign key DOCUMENT(master) references

DOCUMENT(docno) on delete cascade;

Figure 2.20. Implemented extensions of the logical schema

Legacy Database

(extended)

MIS

JDBC-Gateway

Object-Relational

Middleware

(ODMG)

Figure 2.21. MIS architecture

(application code)

CLIENTSERVER

WEB

THE REENGINEERING PROCESS 29

The inner (unshaded) part of Figure 2.22 exemplifies that application specific classes are not

only implemented for each class in the conceptual schema, but also for collections of their

instances and their entire extends, respectively. The class members of application specific API

classes are mainly designed in a canonical way by adding read and write accessor methods for

each class attribute and association.a Using these methods to traverse an association

automatically triggers the translation of relational data to Java-objects in the run-time cache of

the current transaction. Hence, the implementation details of the LDB are completely hidden

from the user of the middleware. Of course, application specific middleware classes may also

contain additional methods that encapsulate more sophisticated queries to the legacy database

(LDB).

a For the sake of simplicity, we did not list all methods and attributes of the classes depicted in

Figure 2.22.

Document

title: String

number: Integer

validUntil: Date

author: String

confidential: Bool

Database

void open(String dbname, int accessmode)

ODMGObject lookup( String name)

void bind(ODMGObject obj, String name)

Transaction

Transaction(Database &db, String name)

void start(), abort()

Boolean commit()

JDBCStatement getStatement()

RefList: getReferences()

KeyWordList getKeywords()

ODMGObject

Boolean isValid()

Boolean isDirty()

void setDirty()

ODMGList

LinkedList (JDK)

KeyWord

DocList getDocs()

void addToDocs(Document doc)

void delFromDocs(Document doc)

DocList DocExtend

KeyWordExtendKeyWordList

ODMGExtend

describedBy

1..8 1..*

pendingTXs

0..* 1

visited

0..*

Figure 2.22. Design of the middleware layer (detail) inherit class

implement interface

application independent framework

docskeywords

keyword: String

30 DATABASE REENGINEERING- A CASE STUDY

Observations

This case study led to the following observations:

O8. Conceptual abstraction (and redesign) of a logical schema is a creative (re-)design

task; there are many alternative conceptual schemas.

O9. Conceptual translation of complex schemas is error-prone; due to the semantic gap

between conceptual and logical data models it is often non-trivial to decide whether

a created conceptual schema is correct, i.e., semantically equivalent to the

implemented data structure.

O10. Increasing conceptual knowledge about LDBs often cause iterations with further

analysis activities. Other causes for iterations are on-the-fly modifications of the

LDB schema during an ongoing RE project.

O11. Iterations cause inconsistencies between the logical and conceptual schema which

are often difficult to detect and resolve.

O12. Modifications to the original schema can be performed canonically according to the

redesigned conceptual schema.

2.5 Summary and concluding remarks

relevance of

the scenario The case study presented in this chapter describes a typical example for current industrial

DBRE projects. The emerging requirement to compete on a global electronic market has

become one of the major driving forces to integrate existing LDBs with Web-based

technologies [LS98b]. An increasing number of companies seek a strategic business advantage

in establishing Web-based information systems. Lederer et al. give an overview on the benefits

expected from this technology [LMS98]. There are various reports about similar projects of

reengineering LDBs to the Web. Many of them deal with data stored in relational DBMS, e.g.,

Umar [Uma97, pp. 461-464], Fryer [Fry95], and Simpson [Sim94] report on the integration of

DB2-based mainframe applications with the Web. Other case studies deal with different data

models, e.g., the hierarchical data model (IMSa) [Uma97, pp. 464-468] and the network data

model [FH97, p. 227]. For some scenarios it is not necessary to make the transition to a fully

object-oriented API. In these cases, HTMLb language extensions or Web-gateways can be used

to integrate LDBs with Web services. However, these solutions are insufficient for companies

which aim on encapsulating and integrating various heterogeneous DBs to achieve enterprise-

wide IS infrastructures.

a Information Management System - hierarchical DBMS on mainframes (IBM)

b HyperText Markup Language [Bar94]

migration strategy Like the migration strategy described in our scenario, many approaches make use of the fact

that LDBs are data-decomposable [Uma97], i.e., they maintain their data in some kind of

DBMS which can be integrated with the new technology. Similar strategies can also be applied

to migrate other components of a legacy system, e.g., its user interface [BS95]. Recently, this

idea of decomposing legacy systems in order to reuse certain components and substitute or

enhance others has been described in general in terms of the so-called divide-and-modernize

reengineering pattern [SP98].

SUMMARY AND CONCLUDING REMARKS 31

In our scenario, we only describe the integration of one single LDB with new technologies.

Additional problems arise if we also consider the integration of several autonomous LDB, e.g.,

the problem of mediating among different component schemas. This issue is tackled by Lincke

and Schmid [LS98a], again, by using the example of electronic product catalogs.

DBRE processThe process described in our case study covers the major aspects of typical DBRE activities in

the area of relational systems. For each of these activities we pointed out a number of important

observations. A more general discussion about DBRE processes has been presented by Hainaut

et al. [HCTJ93]. However, even if we consider other data models and architectures our

observations are still appropriate, as they reflect on inherent characteristics of the DBRE

domain [ALV93, Big90].

role of the

scenario

Analogously to our observations, most techniques presented in this thesis are not restricted to

the described scenario only, i.e., the integration of relational LDBs with object-oriented, Web-

based technologies. The primary purpose of the described scenario is rather to motivate these

techniques and to provide a coherent application example for their evaluation. We will refer to

the elicited observations in the following technical chapters to define the major requirements

for DBRE tool support.

CHAPTER 3 A THEORY TO MANAGE

IMPERFECT KNOWLEDGE

knowledge-based

system

A major goal of this dissertation is to develop a formalism to specify and customize database

reengineering (DBRE) knowledge and to implement mechanisms that allow to apply this

knowledge in human-centered CARE environments. In principle, the desired system is similar

to a knowledge-based or expert system [Kas96]. However, the term expert system has originally

been introduced in the community of artificial intelligence (AI) to describe a computer

program which imitates human experts [BC90]. In this sense, the desired DBRE environment

can hardly be considered as an expert system. This is because its primary task is to support the

reengineer by unburdening her/him from stereotypical and error-prone activities and focussing

her/his attention on the parts of the LDB where human common-sense and intuition is required.

Consequently, the reengineer will not be replaced but the goal of the desired DBRE

environment is to employ her/his capabilities in a more efficient and effective way. Recently,

the term knowledge-based system (KBS) has been used in a broader sense: the main

characteristic of a KBS is that it consists of a formal description of domain knowledge, a fact

base, and a separate component including a number of problem solving strategies to execute

the knowledge [BL97].

In order to develop a knowledge-based tool to assist legacy schema analysis, it is crucial to

characterize the problem domain of DBRE, carefully. With our case study, we have already

made some important observations about the nature of DBRE processes and activities (cf.

Section 2.4.1, on page 15). Other researchers and practitioners report on similar experiences

with DBRE projects, e.g., [BP95, BRH95, PKBT94, And94, PB94, AMR94, Sne91]. It is the

purpose of Section 3.1 to use these observations to derive central requirements on a formalism

that is suitable to manage DBRE knowledge in human-centered CARE environments.

Subsequently, we will review different theories of imperfect knowledge representation and

reasoning, and evaluate their suitability for our specific application domain (Section 3.2). In

Section 3.3, we compare the reviewed approaches based on the evaluation results. This

comparison enables us to conclude which general theories and techniques are most suitable to

be applied in the DBRE context.

3.1 Requirements on formalisms to manage DBRE knowledge

The requirements that have to be fulfilled by a formalism that is suitable to manage imperfect

knowledge in human-centered DBRE environments cover different aspects. Some aspects like

clarity and maintainability should generally be considered whenever a language for knowledge

representation is developed. Still, other aspects depend on our specific class of problems. A

good way to introduce these aspects is to look at a typical reference architecture for KBS in

Figure 3.1 [Kas96].

A core component of each such systems is a knowledge base which contains situation-

independent domain knowledge, that has explicitly been specified by experts and/or implicitly

34 A THEORY TO MANAGE IMPERFECT KNOWLEDGE

been learned from sample situations. Some of the central questions are: What kind of

knowledge has to be represented? How can the knowledge be specified or adapted? How is the

knowledge represented internally?

Other core components of a KBS are the inference engine and the fact base. The inference

engine is an implementation of a set of problem solving strategies. It interprets the knowledge

represented in the knowledge base and applies it to the available data residing in the fact base

in order to infer additional facts about the current situation. We have to cover questions like:

What kind of data has to be stored in the fact base? Where does this data come from? What is

the right problem solving strategy?

Finally, user interaction plays an important role in KBS. On one hand, the inferred data has to

be communicated to the user (user dialog), and on the other hand, the system should be able to

explain the way how this result has been obtained (explanation). Questions of interest include

How can facts adequately be presented to the user? What explanation or query functionality is

required?

In the following subsections, we will investigate the mentioned questions more thoroughly for

the application domain of DBRE and elaborate central requirements that have to be fulfilled by

a theory to manage DBRE knowledge.a

a We do not claim that the following requirements are sufficient for an approach to manage DBRE knowledge but

they are necessary and allow to identify the most promising theory.

Knowledge Base

(domain knowledge) Fact Base

(situation knowledge)

Inference Engine

(problem solving)

User Dialog

(situation specific)

Implicit Knowledge Acquisition

(learning from situations)

Explicit Knowledge Acquisition

(knowledge specification) Explanation

(situation knowledge)

information flow

Figure 3.1. Reference architecture of KBS

REQUIREMENTS ON FORMALISMS TO MANAGE DBRE KNOWLEDGE 35

3.1.1 Quantitative representation of uncertainty

Our case study has demonstrated that DBRE expert knowledge includes various heuristics

(cf. O1 in Section 2.4.1). In general, heuristics represent a form of imperfect knowledge about

the real world under consideration. They are employed when it is not tractable to use definite

knowledge, e.g., in the case when necessary information is unknown, or when it takes too

much effort to use definite knowledge. The drawback of using heuristics is that they might not

be valid in some situations, i.e., they might lead to unsatisfactory results.

quantitative

vs. qualitative

approaches

The approaches which have been introduced to formalize uncertain knowledge can be

distinguished in two major categories, namely quantitative and qualitative approaches [Hül96].

In quantitative approaches, each piece of knowledge has an associated valuation which is

represented by a real number in a closed interval. The concrete semantics of this valuation

depends on the underlying theoretical framework of each approach, e.g., probability theory.

However, the different valuations in all quantitative approaches have in common that they

define a measure for the degree of validity of the corresponding pieces of knowledge. When a

new piece of knowledge p is derived from a combination of existing pieces of knowledge

{p1,..,pn}, the valuation for p is computed by combining the valuations of {p1,..,pn}.

Many critics have argued that real numbers are not adequate to represent the uncertainty of

human knowledge [Her94, Sch92]. They point out that common sense reasoning has a

qualitative, rather than numerical nature. Hence, qualitative approaches to uncertain knowledge

representation allow to make propositions like “p is likely” or “p is possible”. In principle,

qualitative approaches are specializations of quantitative approaches with a small, finite

domain of possible values [BC90]. Some qualitative approaches use an internal knowledge

representation based on real numbers to represent uncertainty. In these cases, it is often

possible to specify or obtain a quantitative measure for the certainty of a piece of knowledge,

which is in contrast to “purely” qualitative approaches.

In order to decide whether a qualitative or a quantitative approach is more suitable for the

DBRE domain, we have to consider the primary purpose of our application. In the introduction

to this chapter, we stated that a central functionality of the projected DBRE environment is to

direct the reengineer‘s attention to the most controversial parts of the legacy system. Hence, we

can classify our application as a selection problem. Purely qualitative approaches are less

suitable for these kinds of problems, as their small domain of possible truth-values is less

selective than real numbers. For example, it is less informative for a reengineer to know that a

given foreign-key might possibly represent either an association or an inheritance relationship,

than to know that the confidence of the association is measured to 0.2, whereas the inheritance

relationship has a confidence measure of 0.7 (in a valuation interval from 0 to 1). As a

consequence of this discussion, we impose the first requirement on the desired formalism.

R1. A formalism to specify DBRE expert knowledge has to allow for a quantitative

representation of uncertain domain and situation-specific knowledge.

36 A THEORY TO MANAGE IMPERFECT KNOWLEDGE

3.1.2 Representation and indication of contradicting knowledge

In Section 2.4.1, we have exemplified that in order to recover an up-to-date documentation

about an LDB the reengineer has to combine indicators and assumptions that stem from

various sources. Furthermore, we have observed that this information is likely to be (partly)

contradicting (cf. Observation O4). In general, dealing with contradicting domain and situation

specific knowledge seems to be inevitable in our application domain. On one hand,

contradicting domain specific knowledge is often introduced by acquiring heuristics from

different DBRE experts. On the other hand, contradictions in situation specific knowledge

might be injected, e.g., by combining the output of different automatic software analysis

procedures [MNL96, AT98] or by considering diverse human (situation-specific) assumptions

about the LDB. This leads to the following requirement:

R2. A formalism to manage DBRE knowledge has to allow for the representation of

contradicting knowledge with domain-specific and situation-specific character.

Still, tolerating contradicting knowledge is not enough, because a major goal of the DBRE

process is to produce a documentation of the LDB that has to be consistent, eventually. Hence,

a central task of the reengineer is to find and resolve all contradictions in the situation-specific

knowledge. A knowledge-based DBRE environment should support this process by indicating

contradicting knowledge to the user.

R3. A formalism to manage DBRE knowledge has to be able to indicate contradicting

situation-specific knowledge.

3.1.3 Reasoning about incomplete knowledge

Traditionally, many approaches to knowledge representation and reasoning make the so-called

closed world assumption, which entails that all relevant information is known before the

reasoning process starts. This kind of reasoning process is called monotonic because no

additional information that might become available later can lead to the falsification of a

conclusion. From the DBRE case study, we learn that we have to give up the closed world

assumption for our application domain (cf. Observation O5, on page 24). On the contrary, we

have shown that reengineers generally begin their reasoning process with incomplete

information in terms of initial assumptions and analysis results. This information might lead to

intermediate hypotheses which might be refuted or supported as soon as new information

becomes available from further investigations. Consequently, we have to make an open world

assumption, which involves a non-monotonic reasoning process.

R4. A formalism to manage DBRE knowledge has to be able to deal with incomplete

knowledge in a non-monotonic reasoning process.

3.1.4 Representation of ignorance

DBRE uses heuristic knowledge in combination with positive and negative indicators and

human assumptions to infer new (uncertain) situation specific knowledge. For example, in our

case study, we show that an instance of a cyclic-join pattern over a given attribute x serves as a

positive indicator for the fact that x is a key, whereas an instance of a select-distinct pattern

over x represents an indicator against this assumption (cf. page 18). However, in the absence of

EVALUATION OF THEORIES 37

any such indication nothing is known about whether or not attribute x might be a key. This state

of (partial) ignorance cannot be described with statements like "x is not a key" or "x is a key

with 50% chance". Thus, we require the following criteria.

R5. A formalism to manage DBRE knowledge has to be able to represent partial

ignorance about situation-specific knowledge.

3.1.5 Computational tractability

The criteria discussed above reflect on qualitative properties of the desired formalism for

knowledge management. However, we are interested in selecting this formalism in order to

solve a particular class of problems in DBRE. Even if we employ a knowledge representation

that satisfies all of the above requirements but cannot be executed on a computer with the

efficiency that is necessary for practical applications we have not solved the problem. In the

DBRE domain, we have to deal with a large amount of information in terms of several hundred

tables, millions of lines of code, and a vast amount of business data. It is crucial to find a

solution that scales up to practical applications. Therefore, we need to take into account

another criterion:

R6. A formalism to manage DBRE knowledge should scale up to practical applications.

3.2 Evaluation of theories

This section contains a survey of major approaches to manage imperfect knowledge, namely

production systems with confidence factors (Section 3.2.1), probabilistic reasoning

(Section 3.2.2), credibilistic reasoning (Section 3.2.3), fuzzy reasoning (Section 3.2.4), and

possibilistic reasoning (Section 3.2.5). We use the requirements that have been elaborated in

the previous Section 3.1 to evaluate each approach according to its suitability for the

application domain of DBRE. We concentrate only on quantitative (or hybrid) approaches

because of our first requirement. Even though purely qualitative formalisms (e.g., modal logic

[Lem77, Gär75, HR87], default logic [MT93, Poo88], and multi-valued logic [BB92]) are not

suitable for our particular purpose they have proven useful in many other application contexts.

An interesting comparison of modal and many-valued logic with most quantitative approaches

evaluated in this dissertation has been presented by Hajek [Haj94].

Notation and basic definitions

Before we start our discussion of the different approaches in Section 3.2.1, we need a more

formal notion of a (relational) DB. Furthermore, we define some notational conventions that

are used throughout this dissertation.

Definition 3.1 Data model

Adata model is a tuple M:= (C, O), where C is a set of concepts that are used to describe the

❑

structure of data and O is a set of operations to handle the data represented by elements of C.

38 A THEORY TO MANAGE IMPERFECT KNOWLEDGE

Definition 3.2 Database

Adatabase is a tuple DB:=(M,S,δ(S),C,D), where M is a data model, S is a data structure that

is represented by concepts of M (S is also called schema), δ(S) is the extension of S (also called

data), C is an application program that uses operations of M (C is often represented by its

❑

source code), and D is the documentation of DB.

Definition 3.3 Relational database

Arelational database is a database RDB:=(M,S,δ(S),C,D), where M is the relational algebra;

S=(R, ∆) is a relational database schema where R={r1,...,rn}, n∈, is a finite set of relation

schemas (RS); and ∆ is a set of inclusion dependencies (INDs). Each r∈R is a tuple

r=(X, ∑,Θ), where X is a finite set of attributes;∑ is a finite set of key dependencies; and Θ is

❑

a finite set of not-null constraints.

Definition 3.4 (Notation)

•



denotes the universe of discourse.

•SET denotes the infinite set of all sets.

•REL⊂SET denotes the infinite set of all relations.

•LIST denotes the infinite set of all lists.

•We extend the applicability of the set operators {∈,⊆,⊂,⊇,⊃} on lists,

e.g., for a list <e1,..,en>∈LIST we define e∈<e1,..,en>⇔e∈{e1,..,en}, etc.

•



(S)∈SET denotes the power set of a given S∈SET.

•For a given set S, let |S| denote the cardinality of S.

•FUN denotes the infinite set of all functions.

•For a given function f∈FUN and an argument a∈



we denote def(f(a)) iff f is defined for a.

•L0, L1 denote the language of propositional logic and first-order logic [Rog71],

respectively.

•



{L} denotes the infinite set of all valid expressions in a given language L.

•Let denote the tautology and let denote the contradiction, i.e., the logical formula that

is always and never fulfilled, respectively.

❑

•Let RDB denote the infinite set of all possible relational databases.

Definition 3.5 Flattening

We define a function that transitively flattens nested sets and lists, i.e., flatten:SET∪LIST→SET,

with

❑

3.2.1 Production systems with confidence factors

An early approach to represent and reason about uncertain expert knowledge has been

proposed by Shortliffe et al. [Sho74, BS84]. It has been implemented in the well-known expert

system called MYCIN and applied to problems of medical diagnosis. In Shortliffe’s approach,

flatten S( ) x s S∈( ) x flatten s( )∈ if def flatten s( )( )

x=s else

∧

î 

 

 

EVALUATION OF THEORIES 39

propositions and implication rules are associated with a measure of belief (MB) and a measure

of disbelief (MD), both being numbers between 0 and 1. These measures are then combined

into a single number called certainty factor (CF). The CF of a proposition u∈



is computed as

CF(u):=MB(u)-MD(u),CF(u)∈[-1,1]. The general form of an implication rule with CF is

IF u1THEN u2, with CF=c

where u1∈



can be an arbitrary complex proposition and u2∈



has to be atomic. Confidences

of complex propositions are calculated using the minimum operation for conjunctions and the

maximum operation for disjunctions. A negation results in a change of the sign of the

corresponding CF. This means for u3,u4∈



CF(u3∧u4)=min(CF(u3),CF(u4)) (EQ 1)

CF(u3∨u4)=max(CF(u3),CF(u4)) (EQ 2)

CF(¬u3)=-1*CF(u3) (EQ 3)

The above equations can be used to determine the confidence for the entire antecedent of a rule.

This confidence is then multiplied by the confidence factor of the corresponding rule itself to

obtain the confidence for the conclusion. Generally, several rules can have the same

conclusion. In this case, the confidences that result of each rule application have to be

combined to obtain the new CF for the common conclusion. For the rules

R1:IF u1THEN u , with CF=c1

R2:IF u2THEN u , with CF=c2

let CF(u|ui) denote confidence for proposition u resulting from the application of rule Ri, with

CF(u|ui)=ci*CF(u). The combined confidence for the common conclusion u is then computed

(EQ 4)

Evaluation

only monotonic

reasoning (R4)

Many researchers have criticized the unclear semantics of the measures defined in MYCIN

[Ada76, Joh86]. However, the most significant problem of Shortliffe’s approach with respect to

our application domain is the inability to represent incomplete knowledge and execute non-

monotonic reasoning processes (Requirement R4, on page 36). All relevant knowledge has to

be known before the inference starts. In the general case of cyclic rule bases, recursive rule

applications cause problems with constantly growing confidences [BL97]. In contrast to the

original application domain of MYCIN (medical diagnosis) where acyclic rule bases were

sufficient to solve many practical problems, we need the general case of cyclic rule bases to

support the desired incremental and evolutionary DBRE process. This means that in our

particular application domain there is no strict separation of “symptoms” and “diagnoses” but

the DBRE environment should enable the reengineer to add information on arbitrary levels of

abstraction.

CF u u1u2

∧( )

CF u|u1

( ) CF u|u2

( )+CF u|u1

( )CF u|u2

( )– for CF u|u1

( ) CF u|u2

( ) 0>,

CF u|u1

( ) CF u|u2

( ) CF u|u1

( )CF u|u2

( )+ + for CF u|u1

( ) CF u|u2

( ) 0<,

CF u|u1

( ) CF u|u2

( )+

1min CF u|u1

( ) CF u|u2

( ),( )–

-------------------------------------------------------------------------------- else.

î









40 A THEORY TO MANAGE IMPERFECT KNOWLEDGE

3.2.2 Probabilistic reasoning

Probabilistic logic extends classical logic [Rog71] with probability theory to reason about

uncertain information [NH95, Nil93, Paa88b]. Analogously to Shortliffe’s approach,

knowledge representation in probabilistic logic usually involves the specification of weighted

implication rules of the form

“IF A THEN B, with probability π”.

semantics A and B represent propositions in L1.a Given the semantics of an implication in classical logic

“→“, the probability might be interpreted as the probability that the condition (A∧B)∨¬A

holds. However, this semantics has not been adopted by most approaches to probabilistic logic.

The main argument against this semantics is that the probability of the above condition is of

little meaning in the mental model of an expert who tries to model his/her domain knowledge

[Paa88b, pp. 216]. Therefore, the probability associated with a rule is defined as a conditional

probability of the consequent given the fulfillment of the antecedent, i.e.,

(EQ 5)

aIn order to keep this survey simple, we restrict this introduction to the propositional case, which is mostly used in

probabilistic KBS. Approaches to define the semantics of probability-valued formulae in L1 are described, e.g.,

by Halpern [Hal90] and Fenstad [Fen67].

subjective

probability The traditional estimation of probabilities involves a large number of repetitions of a given

situation. The estimated probability for an event is then based on the frequency of occurrences

of this event divided by the total number of experiments performed. However, this frequency

concept is not applicable in most applications of probabilistic expert systems, because it is

rarely possible to observe a large number of identical experiments. For these cases, the theory

of subjective probabilities [Ber80, pp. 61ff] has been created. A subjective probability reflects

a human expert’s personal belief in the chance that the corresponding proposition is true.

Consequently, there might be different subjective probabilities for a single proposition which

have been defined by different human experts depending on their personal experience and

background.

The primary aim of probabilistic logic is to combine the probabilities provided by human

experts in order to define and evaluate a joint probability measure p over the universe of all

relevant propositions



={u1,...,un}. This universe is defined by all propositions that occur in a

probability (or conditional probability) specified for the resulting inference net.

The joint probability measure can then be defined as p:(



)→[0,1]

(EQ 6)

Ω={ω1,..,ωk} represents the set of all interpretations, i.e., all possible worlds with respect to the

problem of interest. These worlds have to be exclusive and exhaustive. Consequently, the

fundamental axioms of Kolmogorow have to be fulfilled for the probability measure [Loe78].

πp A B∧( )

p A( )

----------------------=

p ui

( ) pωj

( ), where ui

j Ji

∈

∑j Ji

∈ωj

∨= =

EVALUATION OF THEORIES 41

Inference methods for probabilistic KBS usually employ Bayesian inference networks

[HMW95, Pea98] to represent causal information. They are based on the Bayesian formula

[Loe78] which is used to calculate the so-called posterior probability of a each interpretation

ωi∈Ω for a given event uj∈



(EQ 7)

Evaluation

limited support for

contradiction (R3):

error models

An often cited problem of probabilistic reasoning is that it is unrealistic to expect that a human

expert is able to specify exact probabilities for axioms and implication rules. One approach to

tackle the problem of contradicting probabilities is the specification of error models [Paa88b].

This solution entails that each judgement of an expert has to be assessed w.r.t. its certainty. An

error model is represented by a conditional distribution p(πi|πi), where πi denotes the subjective

probability for rule i provided by an expert and πi represents the “correct” probability estimate

that would be given by a rational expert with complete information about all aspects of the

problem. In the general case, error models are specified for the subjective beliefs of many rules.

In this case, the maximum-likelihood approach can be employed to yield the most probable

solution [Paa88b]. This optimization procedure resolves contradictions in such a way that for

less reliable subjective probabilities the extent of the modification is largest.

Still, a major limitation of this approach is that the errors for different probabilities are assumed

to be statistically independent. This is only reasonable if the experts use distinct sources of

information and do not collaborate, which cannot be generally assumed in our application

domain. Clemen and Winkler [CW85] show that dependent sources of information considerably

reduce the precision of estimates. An inherent feature of using error models and the maximum-

likelihood approach is that inconsistencies are automatically resolved during the inference

process, i.e., probabilities of contradicting rules are adjusted to obtain consistency with the

available data (cf. [Paa88b]). If the available data (situation-specific knowledge) is uncertain

itself, error models can be used in the same way to specify this uncertainty. However, definite

probability values are calculated for deduced situation-specific knowledge. This means that this

approach does not allow to represent contradicting inference results explicitly, but it adapts the

uncertain input knowledge such that the inference results are consistent.

no representation

of ignorance (R5)

Furthermore, it is not possible to represent ignorance in probabilistic logic. This is because the

state of knowledge where there is an equally lack of certainty about all events (including non-

elementary ones) that are liable to occur cannot be expressed by a single probability measure

[DP88 p. 287].

computational

intractable for

DBRE (R6)

Bayesian inference is typically employed with a number of severe structural restrictions, e.g.,

events (axioms) are required to be conditionally independent, conclusions have to be exclusive,

the inference net has to be acyclic, prior probabilities are required for final and intermediate

results, or the desired probability distribution is expected to belong to a restricted class of

distributions (cf. [Paa88b]). The general problem of finding the posterior probability of a

proposition in a Bayesian network is in NP [Coo90]. Some authors have proposed inference

procedures that are less accurate and have a lower complexity for the average case, e.g.,

pωi|uj

( ) pωi

( )p uj|ωi

( )

ω Ω∈

∑ω( )p uj|ω( )

----------------------------------------------=

42 A THEORY TO MANAGE IMPERFECT KNOWLEDGE

[Poo93]. Haber and Brown present an iterative algorithm for the general case of cyclic

inference nets [HB86], while Pearl discusses simplified procedures with special interaction

patterns [Pea86]. Some approaches employ optimization heuristics that yield approximate

solutions with less computational effort [AdBHL86]. Given the fact that in our particular

application domain (DBRE) we have to deal with cyclic inference networks and a vast amount

of propositions, probabilistic inference seems to be computationally intractable for our

purpose. Moreover, the effort that is spent in probabilistic reasoning in order to comply to the

basic axioms of probability theory does not seem to be justifiable for our application. This is

because the credibility of DBRE heuristics vary according to diverse technical and

nontechnical parameters of the current application context. Hence, experiments are not

repeatable which, according to von Mises [vM19], makes the computation of numerical

probabilities meaningless. Even, if we employ the subjectivistic approach, e.g., a gambling

situation [Nea92], instead of the relative frequency of events to define the semantic of

(subjective) probabilities, the significance of the inference results might be questionable for our

application domain. This is because the multiplicative combination of probabilities could lead

to an unreasonable amplification of estimation errors for longer inference paths.

3.2.3 Credibilistic reasoning

The mathematical theory of credibilistic reasoning based on the quantification of pieces of

subjective evidences has been introduced 1976 by Shafer [Sha76, Sha90]. This theory, which is

often referred to as Dempster-Shafer model, has been further generalized by Smets [Sme88] to

deal also with incomplete information, i.e., non-monotonic reasoning. It has been applied to

several practical problems of uncertain reasoning [Sme88, Nea92, Bau94, Bau95].

The central assumption of Shafer‘s theory is that there is a finite amount of belief (or

credibility) that is spread among the universe of all relevant propositions



. This belief is

distributed according to the available pieces of evidence. Without any loss of generality, the

total amount of belief induced by a single piece of evidence is usually scaled to 1. For each

available piece of evidence, the expert distributes this total amount of belief to a number of so-

called focal propositions. The functions that define these basic distributions are referred to as

basic probability assignment:

Definition 3.6 Basic probability assignment, focal proposition

Let



denote the set of all relevant propositions and let {E1,..,En} be a set of pieces of evidence.

The amount (mass) mi(u1) of belief which has been allocated by evidence Ei to a proposition

u1∈



(and which cannot be allocated to any other proposition u2∈



that implies u1) is called

abasic probability number. Any proposition u∈



with mi(u)>0 is called focal proposition of

evidence Ei. Basic probability numbers are assigned to propositions by a function m:



→[0,1]

called basic probability assignment, with

❑

(EQ 8)

difference to

probabilistic logic The main difference of Shafer‘s model compared to probability theory is the way how

credibilistic reasoning handles evidence which supports complex focal propositions, e.g.,

u1∨u2. In probabilistic logic, the total assigned mass of belief m(u1∨u2) has to be split between

miu( )

u U∈

∑1 1 i n≤ ≤( ),=

EVALUATION OF THEORIES 43

the two component propositions u1 and u2. If it is unknown how to distribute m(u1∨u2),

probabilists usually invoke the principle of insufficient reason [Sme88] or an argument of

symmetry to decide that m(u1∨u2) has to be split in two equal parts m(u1) and m(u2).

Credibilistic reasoning does not rely on this principles, i.e., it allows to allocate basic

probability numbers for complex propositions.

The combination of different pieces of evidence is performed by applying Dempster’s rule of

combination on the basic probability assignments [Sme88]. At this, the mass assigned to a

conjunction of (focal) propositions is defined as the product of the basic probability

assignments derived for both propositions from all available pieces of evidence.

Definition 3.7 Combination of evidences

For a proposition u∈



and a set of pieces of evidence {E1,..,En}, let mi(u) denote the basic

probability numbers assigned to u which have been derived from evidence Ei. The combined

mass of two evidences E1 and E2 supporting proposition u1=u2∧u3, denoted as m12(u1), is then

defined as:

(EQ 9)

The combination of n+1 pieces of evidences is recursively defined by applying Dempster’s rule

to combine the combination of n pieces of evidence with the next piece of evidence, e.g., m123 is

❑

computed by combining m12 with m3 in the same way.

Using Dempster‘s rule of combination, we yield a combined measure for the belief m(u) that

has specifically been committed to each proposition u∈



. However, if we want to obtain the

total degree of belief that we have about the fact that u is true, we have to add all masses of

belief that have been allocated to propositions u‘∈



that imply u. This total belief is quantified

by the so-called belief function:

Definition 3.8 Belief function

Let m:



→[0,1] be the mass function that is obtained by applying Dempster‘s rule of

combination for all available pieces of evidence. The function bel:



→[0,1] is called belief

function, with

❑

(EQ 10)

semantics of

belief and

plausibility

In [Sme88], Smets describes the semantics of the degree of belief as the degree of minimal or

necessary entailment. Besides the degree of belief, Shafer introduces another measure, which

is called plausibility. The plausibility of a proposition u1∈



is defined as the sum of the belief

allocated to all other propositions u2∈



that do not contradict to u1. Its meaning can be

described as the degree of minimal or potential entailment.

m12 u1

( ) m1a( )m2b( )

a b∧u1

a b U∈,

∑

bel u1

( ) m u2

( )

u2u1

→

∑

44 A THEORY TO MANAGE IMPERFECT KNOWLEDGE

Definition 3.9 Plausibility function

Let m:



→[0,1] be the mass function that is obtained by applying Dempster‘s rule of

combination for all available pieces of evidence. The function pl:



→[0,1] is called plausibility

function, with

❑

(EQ 11)

The plausibility function is related to the belief function by the following equation:

pl(u)=bel( )-bel(¬u)=1-m( )-bel(¬u) (EQ 12)

Evaluation

limited support for

non-monotonic

reasoning (R4)

Shafer’s original theory did not consider incomplete knowledge. This closed-world assumption

has been too restrictive for many practical applications. In [Sme88], Smets describes a theory

of credibility that deals with incompleteness. Whenever new evidence becomes available all

basic probability assignments are changed to take this new evidence into account. This revision

is performed by Dempster’s rule of conditioning [Sme88].

computational

intractable (R6) A more severe problem that arises with the application of credibilistic reasoning in the DBRE

domain is, however, that Shafer’s approach has proven unfeasible even for moderately-sized

problems (cf. [Pro89, Voo89]). The general problem of inferring belief functions is NP hard,

because they are defined on the power set of possible answers to a question which is the

complete Boolean algebra of events with 2k elements [Paa88a].

3.2.4 Fuzzy reasoning

Fuzzy logic is a relatively young theory that can be viewed as a form of multi-valued logic

[Got88]. During the last two decades there has been a tremendous amount of research in this

area. Many practitioners have used this technology to implement and reason about vague

knowledge in a variety of application domains. We refer the interested reader to [Kas96] and

[Nov92] for a comprehensive introduction to this theory. Furthermore, [NAF99] and [FUZ98]

give a general overview on the latest results and research directions in this area. The following

introduction to the basic principles of fuzzy reasoning (and the next section on possibilistic

reasoning) is a little more detailed than the description of the previous theories as they will

provide the theoretical framework for the approach developed in Chapter 4.

fuzzy sets Fuzzy reasoning is based on the central notion of fuzzy sets introduced in 1965 by Zadeh

[Zad65]. A fuzzy set is a generalization of the concept of a set in classical mathematics.

Traditionally, each object in the universe may either be included in a given set S or excluded

from S. In this sense, a set S can be represented by its characteristic function

µS:



→{0,1}, with . (EQ 13)

pl u1

( ) m u2

( )

u2u1≠∧∑

=|=

µSu( ) 1u S∈

0u S∉

î



EVALUATION OF THEORIES 45

Zadeh‘s theory generalizes this concept by allowing objects to belong to a (fuzzy) set only

partially. Hence, the values of the characteristic function of a fuzzy set, which is called

membership function in fuzzy set theory, are real numbers in the interval [0,1].

Definition 3.10 Fuzzy set

A set of pairs F:={(u,µ(u)) | u∈



} is called fuzzy set in a universe



. The function µF:



→[0,1]

❑

is called membership function of F.

For a given object u∈



and a fuzzy setF the value µF(u) is called membership degree of u in F.

A membership degree of µF(u)=0 means that uis not a member of F and µF(u)=1 means that u

entirely belongs to F. Membership functions might be continuous or discrete. Figure 3.2 shows

two examples from our application domain. The continuous membership function on the left-

hand side defines the fuzzy set of large software systems according to their total number of

lines of code (LOC). The second example is a fuzzy set of pairs of type compatible string

attributes. It is described by a discrete membership function that is defined over the absolute

difference of length of both attributes. The diagram on the right-hand side of Figure 3.2

illustrates this fuzzy set for the case that the first attribute has a length of 80 characters.

These simple examples already demonstrate that a major benefit of using fuzzy sets is a more

adequate formalization of aspects of human reasoning. Using traditional set theory to describe

the set of type compatible string attributes, we would have to use a strict threshold value to

define the corresponding membership function. Each pair of string attributes with a difference

in length that is lower than the chosen threshold would then be considered to be (completely)

compatible, while all other pairs of attributes would be considered to be (completely)

incompatible. Obviously, this solution does not adequately represent the notion of type

compatibility in the mental model of human DBRE experts.

α-cutSeveral operations have been defined to convert fuzzy sets to traditional (crisp) sets. The most

important example is the so called α-cut, which is defined to be the (classical) subset of a given

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 5 10 15 20 25 30 35 40

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0103104105106107

LOC(x) |length(s1)-length(s2)|

µlargeSS(x) µstrCompatible((s1,s2))

(with length(s1)=80)

Figure 3.2. Sample fuzzy sets with continous and discrete membership functions

46 A THEORY TO MANAGE IMPERFECT KNOWLEDGE

fuzzy set F that consists of all elements in F with a membership degree greater or equal a given

value α∈[0,1].

operations on

fuzzy sets Analogously to crisp sets, it is possible to define operators like intersection (∩) and union (∪)

on fuzzy sets. The intersection operator is generally defined by an operation called t-norm,

while the union operator is defined by an operation called t-conorm.

Definition 3.11 t-norm and t-conorm

t-norm/t-conorm Two binary functions T,⊥:[0,1]×[0,1]→[0,1] are called t-norm and t-conorm, respectively, if

they fulfill the following properties:

❑

There is a functional dependency between t-norm and t-conorm operations. With the help of a

negation operation n, a t-conorm can uniquely be derived from a given t-norm and vice-versa,

i.e., ⊥(x,y)=n(T(n(x),n(y))) and T(x,y)=n(⊥(n(x),n(y))). Hence, t-norms and t-conorms are

called dual operations. In practice, the most commonly used t-norm is the minimum function,

T(x,y)=min(x,y). The corresponding t-conorm is the maximum function, ⊥(x,y)=max(x,y).

With these functions, we are able to define the following operations on two fuzzy sets A and B

which are defined over the same universe



. (Other possible choices for t-norms (t-conorms)

can be found in [Gra95]).

Union, A∪B := {(x, max(µA(x), µB(x))) | x∈



} (EQ 14)

Intersection, A∩B := {(x, min(µA(x), µB(x))) | x∈



} (EQ 15)

Equality, A=B := (∀x∈



) (µA(x)=µB(x)) (EQ 16)

Complement, ¬A := {(x, 1-µA(x)) | x∈



} (EQ 17)

fuzzy rules and

inference Fuzzy rules are vague implication rules that use fuzzy sets as predicates to express common-

sense reasoning. The most common form of such rules is “IF A(x) THEN B(y)” (Zadeh-

Mamdani-rules [Kas96]) or, more general, “IF A1(x1)and A2(x2) and ... An(xn)THEN B(y)”.

Example 3.1 Fuzzy rule

In this example, we use a fuzzy rule to describe the following sample DBRE heuristic:

"If the name of an attribute x is similar to its RS R, supplemented with the string ’id’ and if all

tuples in the extension of R have unique values in attribute x and if the extension of R is large

then x might be a key."

In the following, we will use fuzzy predicates to reason about a given attribute x that belongs to

a relation schema RS(x):

Symmetry T(x,y)=T(y,x) ⊥(x,y)=⊥(y,x)

Associativity T(x,T(y,z))=T(T(x,y),z) ⊥(x,⊥(y,z))=⊥(⊥(x,y),z)

Neutral Element T(x,1)=x ⊥(x,0)=x

Null/one element T(x,0)=0 ⊥(x,1)=1

Monotony x ≤z⇒T(x,y)≤T(z,y) x ≤z⇒ ⊥(x,y)≤⊥(z,y)

EVALUATION OF THEORIES 47

IF ANameIsRSName+ID(x) AND Unique(x) AND LargeExt(δ(RS(x)))

THEN Key(x)

Let us abbreviate the first predicate used in the above implication rule as AName. Each of those

four predicates are described by a (fuzzy) set that contains all objects in the universe which

(gradually) comply to this predicate. Figure 3.3 illustrates this for predicates AName and

LargeExt. The left-hand side of Figure 3.3 shows similarity degrees for seven sample attributes

of an RS named user. In this definition of µAName, we use the Levenshstein-distance [Lev66]

(Levensh()) to calculate a measure of similarity of two strings. The right-hand side of

Figure 3.3 shows a sample definition of fuzzy sets that define the predicates LargeExt and

MediumExt for possible extensions δ(RS(x)).

❑

If the fuzzy values in the antecedent of a fuzzy rule are known, it is possible to compute a fuzzy

value for its consequent by using methods of fuzzy inference. Fuzzy inference is based on the

notion of fuzzy implications and fuzzy compositions. In order to define these terms we have to

introduce the formal concept of a fuzzy relation.

fuzzy relationsDefinition 3.12 Fuzzy relation

Let F1,..,Fn be n fuzzy sets over objects of the universe



1,..,



n, respectively. A fuzzy relation

R(F1,..,Fn) is then defined as a fuzzy set over the cross product of the universes



1×..×



n, i.e.,

❑

R(F1,..,Fn)={((x1,..,xn), µR(x1,..,xn)) | x1∈



1,..,xn∈



Afuzzy implication, denoted as A→B, is a fuzzy relation over two fuzzy sets A and B over the

universes



A and



B, respectively. In fuzzy logic there are different ways to define an

implication. This is in contrast to propositional logic where the implication is defined by a

0.2

0.4

0.6

0.8

userid

user_id

usr_iduid id

✧

✧✧

us_ident dpt

Figure 3.3. Sample fuzzy sets for fuzzy predicates AName, LargeExt, and MediumExt

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 200 400 600 800 1000

d=|δ(RS(x))|

µlargeExt d( )

--- d

200

--------- 2–

 

 

atan+

-----------------------------------------------=

µLargeExt(d)

µMediumExt(d)

µAName x( ) 2

--- Levensh(name(x) userid,( ))atan=

µAName(x)

µmediumExt d( ) 10 4– d300–( )2

( )exp=

48 A THEORY TO MANAGE IMPERFECT KNOWLEDGE

single truth table. Mizumoto and Zimmermann compare 15 different fuzzy implications

[MZ82]. One commonly used implication is defined by the minimum function and has been

introduced by Mamdani [EM77] (for a discussion of other implications, we refer to [Ker92]):

A→B := {((a,b),min(µA(a),µB(b)))|a∈



A, b∈



B} (EQ 18)

Analogously to propositional logic, fuzzy logic uses and- (∧), or- (∨), and not- (¬) operators to

compose logical expressions. These operators are defined by the following compositions:

Definition 3.13 Fuzzy logical operators

Conjunction, A∧B := {(x, min(µA(x), µB(x))) | x∈



}

Disjunction, A∨B := {(x, max(µA(x), µB(x))) | x∈



}

❑

Negation, ¬A := {(x, 1-µA(x)) | x∈



}

MAX-MIN

composition Afuzzy composition of two fuzzy relations R1(A,B) and R2(B,C), denoted as R1•R2, is a

relation (R1•R2)(A,C) obtained by applying R1 and R2 after one another. A typical composition

is the MAX-MIN composition [Zad65]:

(R1•R2)(A,C)={((a,c),max{min(µR1(a,b),µR2(b,c)) | b∈



B}) | a∈



A, c∈



C} (EQ 19)

Analogously to the fuzzy implication, there are other composition operators that have been

successfully applied to fuzzy reasoning [Kas96]. A fuzzy implication and composition allow

for fuzzy inference according to the following compositional inference law. (Other possible

laws of inference are given in [Kas96].)

Definition 3.14 Fuzzy inference

Given an implication A→B and a composition •, a fuzzy set B‘ can be inferred when a fuzzy set

❑

A‘ is known, with B‘=A‘•(A→B)

3.2.4.1 Evaluation

limited support for

uncertainty (R1) Fuzzy logic has been introduced as an approach to linguistic approximation of human

knowledge. It allows to describe and reason about vague concepts but it is less suitable to deal

with uncertain knowledge. Some extensions of this theory have been proposed to overcome

this deficiency. Analogously to the approach described in Section 3.2.1, a popular approach is

to assign confidence factors (CF) to fuzzy rules and facts [Kas96, pp. 194ff]. However, this

solution has similar limitations like production systems with CF in the general case of cyclic

rule bases. Another approach to handle uncertainty is type-2 fuzzy logic [KM98]. It is based on

the concept of type-2 fuzzy sets, introduced in [Zad75]. While in normal (type-1) fuzzy sets

membership degrees are represented as real numbers, in type-2 fuzzy sets, membership degrees

are fuzzy themselves, i.e., they are defined by (type-1) fuzzy values in [0,1]. Hence, type-2

fuzzy sets can be used in situations where there is uncertainty about the membership degrees,

e.g., when the exact shape of the membership function is unknown. This approach can be

viewed as a second-order approximation, compared to type-1 fuzzy logic which represents a

first-order approximation. A qualitative disadvantage of using type-2 fuzzy logic to describe

EVALUATION OF THEORIES 49

uncertain DBRE knowledge is that second-order approximations are more difficult to handle

and compute than other approaches which include a direct notion of uncertainty.

limited support for

contradiction and

ignorance (R3,R5)

Fuzzy reasoning does not meet our requirements for representation of contradicting knowledge

(R3) and partial ignorance (R5). Recently, Zhang proposed to use bipolar fuzzy sets to overcome

this limitation [Zha98]. A bipolar fuzzy set consists of two traditional fuzzy sets which represent

degrees of compatibility or incompatibility with the associated predicate, respectively. Hence,

they allow to reason about the coexistence and interaction of contradicting relationships. In

addition, they are suitable to express partial ignorance.

3.2.5 Possibilistic reasoning

Possibility theory has been introduced by Zadeh [Zad78] in 1978 as a means for approximate

reasoning with uncertain and incomplete information. Since then, possibility theory has been

systematically developed as a calculus of uncertain logics, mainly by Dubois et al. [DP83,

DP88, DLP92, PD93, DLP94, DP97]. Like fuzzy logic, possibilistic logic has its roots in the

theory of fuzzy sets. However, both calculi serve distinct purposes. While fuzzy logic is used to

reason about vague knowledge, possibilistic logic has been developed primarily to reason

about uncertain and incomplete knowledge. In this section, we will introduce the main idea

behind the concept of possibility and we will introduce a calculus for necessity-valued

possibilistic logic. For a comprehensive introduction to possibility theory, we refer to [DLP94].

possibility

and necessity

Possibilistic logic deals with weighted formulae of the form (



,β), where



is a closed formula

in L1 and the valuation β∈[0,1] is a positive real value. The valuation represents a lower bound

on so-called degrees of necessity N(



) or degrees of possibility P(



) of the corresponding

formula



. The value of N(



) expresses to what extent the available evidence entails the truth of



, whereas P(



) expresses to what extent the truth of



does not contradict to the available

evidence.a The degree of necessity and the degree of possibility are dual measures, i.e., N(



)=1-

P(¬



). It is important to note that N(



)=0 or P(



)=1 represent the state of complete ignorance,

i.e., nothing is known about the truth of



. The following properties hold:

P( )=0; P( )=1; N( )=0; N( )=1; (EQ 20)



1∧



2)=min(N(



1),N(



2)); P(



1∧



2)=max(P(



1),P(



2)) (EQ 21)



1∨



2)≥max(N(



1),N(



2)); P(



1∧



2)≤min(P(



1),P(



2)) (EQ 22)

min(N(



),N(¬



))=0; max(P(



),P(¬



))=1 (EQ 23)

a Other possible (physical) interpretations of this mathematical model are summarized in [DP88].

necessity-valued

formulae

In the following, we will only consider necessity-valued formulae, because this fragment of

possibilistic logic is powerful enough for our application.

Definition 3.15 Necessity-valued formula

Anecessity-valued formula is a pair φ:=(



,β), where f is a well-formed formula in L 1 and

β∈[0,1] is a lower bound for the necessity degree of



, i.e., N(



)≥β. Let NPL1 denote the

❑

language of necessity-valued formula.

50 A THEORY TO MANAGE IMPERFECT KNOWLEDGE

Sometimes it is desired to convert a set of necessity-valued formulae Φ to a set of classical

formulae or to extract those formulae from Φthat are at least certain to a given degree. The

following two operations serve these purposes.

Definition 3.16 Classical projection

For a given set of necessity-valued formulae Φ⊆



{NPL1}, the classical projection Φ* is

❑

defined as Φ*:={



| (



,β)∈Φ}.

Definition 3.17 α-cut

For a given set of necessity-valued formulae Φ⊆



{NPL1}, the α-cut Φα is defined as

❑

Φα:={(



,β) | (



,β)∈Φ ∧β≥α}.

semantics The semantics of a set of closed formulae



in L1 is defined by the subset ω of all

interpretations Ω that satisfy all formulae in



. Each such interpretation ω∈Ω is called a model



. In case of a set of formulaeΦ in NPL1 the interpretation is given by a so-called possibility

distribution over Ω that is represented by fuzzy set π of all models for Φ.π can be viewed as a

preference relation over Ω. Based on the possibility distribution π, we can define the possibility

measure P as a function



{L1}→[0,1], with P(



)=sup{π(ω),ω



}, ω∈Ω. (EQ 24)

Consequently, the dual necessity measure N(



)=1-P(¬



), induced by π is defined by



{L1}→[0,1], with N(



)=inf{1-π(ω),ω ¬



}, ω∈Ω. (EQ 25)

A possibility distribution π is said to satisfy a formula (



,β)∈



{NPL1}, iff N(



)≥β.

Consequently, a set of formulae Φ={φ

    



}⊆



{NPL1} is satisfied by a possibility distribution

π, denoted as π Φ, iff ∀i∈[1,n],πsatisfies φ



. Then, a logical formula φ

  

∈



{NPL1} is a

logical consequence of a set of formulae Φ⊆



{NPL1}, iff all possibility distributions that

satisfy Φ also satisfy φ

  

, i.e., the following condition holds.

∀π (π Φ



⇒(π φ

  

)) (EQ 26)

partial

contradiction For a consistent set of possibilistic formulae Φ, we require the fuzzy set that represents the

possibility distribution π induced by Φ to be normalized, i.e., sup{π(ω)|ω∈Ω}=1. Possibilistic

logic is also able to deal with partial contradiction if we give up the above normalization

condition, i.e., if we allow for sup{π(ω)|ω∈Ω}=1-i, i∈(0,1]. Consequently, the axiom N(⊥)=0

(EQ 20) (given for the consistent case) is no longer valid, because

N(⊥)=N(



∧¬



)=min(N(



),N(¬



))=i>0. However, the following properties still hold:

N( )=1 (EQ 27)



1∧



2)=min(N(



1),N(



2)) (EQ 28)



1∨



2)≥max(N(



1),N(



2)) (EQ 29)

(



2)⇒ N(



2)≥N(



1) (EQ 30)

EVALUATION OF THEORIES 51

Definition 3.18 Partial contradicting set of formulae

A set of formulae Φ={φ

¡ 



}⊆



{NPL1} is said to be partial contradicting (inconsistent), if

there is no normalized possibility distribution that satisfies Φ, i.e.,

Cons(Φ)=supπ Φsupω∈Ωπ(ω)<1 (EQ 31)

Cons(Φ) and Incons(Φ)=1-Cons(Φ) are called the degree of consistency or inconsistency

❑

(contradiction) of Φ, respectively.

deduction problemAccording to [DLP94], the deduction problem in possibilistic logic can be stated as follows:

Given a set of formulae Φ⊆



{NPL1} and a classical formula



that we would like to deduce

from Φ, we have to compute the best valuation β (i.e., the best lower bound of a necessity

degree) such that (



,β) is a logical consequence of Φ. This means, we ha ve to compute

Val(



,Φ)=sup{β∈(0,1]|Φ(



,β)}.

least specific poss.

distribution

This valuation is defined by the necessity measure Val(



,Φ)=NΦ(



) which is induced by the

least specific possibility distribution πΦ satisfying Φ. For a given set of formulae

Φ={(



,β



)

   

(



,β



)}⊆



{NPL1} the least specific possibility distribution πΦ is defined as

πΦ(ω)=min{1-β



| ω ¬



, i∈[1,..,n]}.

best modelπΦ imposes a preference relation over all models of Φ. In order to solve the aforementioned

deduction problem, we have to select a best model which means to choose an interpretation

ω*∈Ω that is most compatible with Φ. The degree of compatibility of a given model ω is

defined by πΦ(ω). Note, that such a best model always exists; a proof can be found in [DLP94].

Definition 3.19 Best model

Let Φ⊆



{NPL1} be a set of possibilistic formulae. Any interpretation ω*∈Ω that maximizes πΦ

❑

is called best model of Φ, i.e., πΦ(ω*)=sup{πΦ(ω)|ω∈Ω}.

inferenceIn [Lan91] and [DLP94], Lang et al. propose a formal system in terms of a set of axioms and

inference rules, that implements the described semantics of NPL1, i.e., that fulfills the

condition that every possibilistic formula φ∈



{NPL1} is a consequence of a set of possibilistic

formulae Φ⊆



{NPL1}, iff φ can be derived from Φ using the proposed formal system. Similar

versions of the following inference rule GMP (graded modus ponens) have been used in many

theoretical frameworks for uncertain reasoning, e.g., [Res76, FG90].

Definition 3.20 Formal system for NPL1

Axioms:

(A1) (

 

⇒(

 ¢

⇒

 

) 1)

(A2) ((

 

⇒(

 ¢

⇒

 £

))⇒((

 

⇒

 ¢

)⇒ (

 

⇒

 £

)) 1)

(A3) ((¬

 

⇒¬

 ¢

)⇒((¬

 

⇒

 ¢

)⇒

 

)1)

(A4) ((∀x(

 

⇒

 ¢

)⇒(

 

⇒(∀x

 ¢

)) 1), if x does not appear in

 

and is not bound in

 ¢

(A5) ((∀x



)⇒



x|t 1), if x is free for t in



Inference rules:

(GMP) (

 

,β), (

 

⇒

 ¢

,γ) (

 ¢

,min(β,γ))

(G) (



,β) ((∀x



),β), if x is not bound in



❑

(S) (



,β) (



,γ), if γ≤β

⊥

52 A THEORY TO MANAGE IMPERFECT KNOWLEDGE

3.2.5.1 Evaluation

The theory of possibilistic reasoning meets all requirements identified in Section 3.1. It is well-

suited to describe and reason about uncertain knowledge (R1). The deduction mechanism

described above deals with contradicting domain-specific and situation specific knowledge.

Dubois et al. show that for a given set of formulae φ∈



{NPL1} the degree of contradiction

Incons(φ) acts as a threshold that inhibits all formulae of φ with a valuation equal to or under

this threshold [DLP94, pp. 458ff]. If φ contains all domain-specific and situation-specific

knowledge the contradicting part of φ can be isolated by selecting all formulae φi⊆φ that have a

valuation lower or equal to Incons(φ). The part of φi that represents the (contradicting)

situation-specific knowledge can be indicated to the user. Hence, both requirements, R2 and R3,

are satisfied. Moreover, requirement R5 is fulfilled because ignorance about the truth of a

proposition u∈



can be expressed by N(u)=N(¬u)=0 or P(u)=P(¬u)=1, respectively.

The deduction operator introduced in Definition 3.20 is monotonic. However, in

[DLP94, pp. 466ff], Dubois et al. define the nontrivial deduction operator that allows only

for the deduction for formulae with a valuation greater than the degree of contradiction, i.e.,

φ(



,β) iff φ(



,β) and β>Incons(φ). (EQ 32)

Hence, the nontrivial deduction operator enables non-monotonic reasoning (requirement R4),

i.e., it is possible that φ(



,β) and φ∪φ*(



,β).

Finally, the problem of inference in possibilistic rule bases has polynomial complexity. Of

course, if general first-order formulae in NPL1 are considered the complexity of inference is

exponential with respect to the number of elements in the universe of discourse.

3.3 Summary and conclusion

In this chapter, we elaborated a catalog of major requirements on a formalism that is suitable to

manage imperfect DBRE knowledge in human-centered CARE environments. Based on these

requirements, we systematically evaluated five important approaches to represent and reason

about uncertain knowledge. We would like to emphasize that this evaluation is not general but

dedicated to our particular application domain. Other applications might impose different

criterions. In the following, we summarize the result of our evaluation in order to decide which

approach is most suitable for our purpose. Figure 3.4 shows a decision matrix that relates each

approach with each requirement imposed. In this matrix a requirement is either fulfilled (✓),

partly fulfilled (■) or failed (✕) by a given approach. Obviously, this kind of condensed

classification represents a simplified view on the results of our evaluation, i.e., it does not show

preferences between two approaches which both fulfill or fail a given requirement. Still, it

serves our purpose to identify the formalism which is most appropriate for the application to

DBRE. Moreover, a quantitative classification would be rather hypothetical without further

experimental results.

The main reasons for the unsuitability of production systems with confidence factors for our

application is their computational difficulties in the case of cyclic inference networks.

Moreover, they lack mechanisms to deal with contradicting and incomplete knowledge. Due to

their computational complexity, probabilistic and credibilistic reasoning do not scale up for

applications to practical DBRE problems: the concept of objective probability which is based

⊥

|∼

⊥

|∼

SUMMARY AND CONCLUSION 53

on the relative frequency of events does not apply to the DBRE context. Even if a subjectivistic

view on probabilities is used it is problematic to estimate their reliability (in terms of error

models). The multiplicative combination of uncertainties amplifies estimation errors which

might lead to unreasonable results. In addition, the credibilistic approach lacks an explicit

notion of contradiction. The primary focus of fuzzy reasoning is to deal with vague rather than

uncertain knowledge. Existing approaches to incorporate a notion of uncertainty in fuzzy logic

(e.g., confidence factors and Type-2 fuzzy sets) comprise significant limitations w.r.t. to our

application domain (cf. Section 3.2.4). This is in contrast to possibility theory which allows to

reason about uncertain, contradicting, and incomplete knowledge. Consequently, possibility

theory turns out to be most suitable to implement and reason about DBRE knowledge. In the

next chapter, we will use this theory as a basis to develop a dedicated, high-level formalism to

specify, customize, and execute DBRE knowledge.

Figure 3.4. Evaluation summary

Production

sytems with CFs Probabilistic

reasoning Credibilistic

reasoning Fuzzy

reasoning Possibilistic

reasoning

(uncertainty) ✓ ■

(error models) ✓(■)

(CF and

Type-2-logic)

✓

(represent. of

contradiction)

✓ ✓

(deviation of

uncertain

probabilities)

✓ ■

(interpolation) ✓

(indication of

contradiction)

(no explicit notion

of contradiction)

(adaptation of

domain-spec.

knowledge)

(no explicit

notion of

contradiction)

■

(bipolar fuzzy sets) ✓

(incomplete-

ness)

✕ ✓

(belief revision

[AGM85])

✓

(Dempster’s rule

of conditioning)

✓

(nonmonotonic

fuzzy logic, e.g.,

FNM3 [DD92])

✓

(non-trivial

deduction

operator)

(ignorance) ✓ ✕ ✓ ✓

(bipolar fuzzy sets) ✓

(computational

tractability)

■

(problem with

cycles)

✕ ✕ ✓ ✓

Approach

Requirement

CHAPTER 4 GFRN AS A BASIS FOR

LEGACY SCHEMA ANALYSIS

In our experience, lack of customizability is the single most common limiting factor in

using tools for software analysis and transformation.

Markosian et al. [MNB+94]

This chapter introduces Generic Fuzzy Reasoning Nets (GFRNs) as a dedicated formalism to

specify, customize, and execute database reengineering (DBRE) knowledge applied to schema

analysis. The development of this formalism has been driven by the requirements elaborated in

Section 3.1. It is based on possibilistic logic (and fuzzy set theory) which, according to our

evaluation in Section 3.2, is most adequate to manage imperfect knowledge in our specific

application domain. The GFRN approach enables to realize a CARE environment that supports

partial automation of the schema analysis process but provides a high amount of

customizability and extensibility. GFRNs facilitate the integration of various existing analysis

methods and the adaption of domain-specific DBRE knowledge. Our approach is human-

centered because it allows for (and depends on) human interaction in an evolutionary rather

than a phase-oriented schema analysis process. It reflects on the mental model of the

reengineer and guides her/him from initially incomplete and contradicting knowledge about a

legacy database (LDB) to a complete and consistent model of the corresponding logical

schema. This logical schema is the basis for subsequent conceptual migration and redesign

activities discussed in Chapter 5.

The structure of this chapter is as follows. In the next section, we give an overview of the

proposed schema analysis process that is supported by our approach. Section 4.2 introduces

GFRNs as a dedicated formalism to specify domain-specific DBRE knowledge and processes.

Subsequently, we develop an inference mechanism for GFRN specifications that can be

implemented in a human-centered CARE tool (Section 4.3). Section 4.4 presents the Varlet

Analyst which is a prototype implementation of the concepts developed in this chapter.

Section 4.5 reports on our experiences with applying this implementation to evaluate our

approach with practical DBRE problems. A discussion of related work in the domain of legacy

schema analysis is presented in Section 4.6. Finally, a summary of this section and its results is

given in Section 4.7.

4.1 Supporting human-centered schema analysis processes

The main purpose of this chapter is to clarify the role of GFRN specifications in the proposed

schema analysis process before we introduce the actual formalism. Moreover, the structure of

the rest of this chapter is directly motivated by this schema analysis process which is shown as

a data flow diagram in Figure 4.1.

56 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

customization

process It is important to distinguish between activities that aim to customize the prospected CARE

tool to a specific application context from activities that are involved in the actual process of

applying the tool. The activities that belong to the customization process are displayed with a

grey background in Figure 4.1. In this process, a knowledge engineer investigates the LDB in

order to determine the specific application context of the CARE tool. The result of this domain

analysis step is a number of technical and non-technical characteristics of the current LDB,

e.g., properties of the employed hard- or software platform and applied coding conventions,

respectively. Subsequently, the knowledge engineer specifies or adapts the domain-specific

DBRE knowledge that is applied in the schema analysis process according to these

characteristics. The corresponding knowledge is formally represented by a GFRN

specification.

analysis process After the tool has been customized with respect to its current application context it can be

employed to analyze the schema of the LDB which is under investigation. This analysis

process is performed semi-automatically. At first, automatic analysis operations are applied to

different legacy software artifacts including the LDB’s schema catalog, procedural code, and

the available data. The result of this initial automatic analysis is a set of (situation-specific)

facts about the LDB. Subsequently, these facts are taken as indicators which are combined with

the domain-specific knowledge specified in the GFRN to infer new knowledge about possible

schema constraints. This newly inferred knowledge might comprise definite facts as well as

uncertain and contradicting hypotheses. Some of these hypotheses might be refutable using

automatic analysis operations. We call such analysis operations goal-driven because they are

performed “on-demand” to support or refute intermediate hypotheses. According to the

domain-specific characteristics of the LDB, the GFRN specification determines which

operations are available and when they are performed.

cycl_join

i1: 0.7 v2⊆v1

i2: 0.3

i10: 1.0

i7: 0.6

sel_dist

key IND

validIND

validKey

i3: 1.0

i5: 1.0

v2=π2(v1)

i9: 1.0

equiv

i8: 0.5

tcomp

nsimilar

i6: 0.8

v2v1v1

GFRN

LDB

Figure 4.1. The proposed schema analysis process

automatic

analysis

automatic

analysis

initial

goal-

driven

presentation

/dialog

non-monotonic

inference

analysis

manual

discussion

knowledge

engineer

specification/ domain-specific

DBRE knowledge

(heuristics)

indicators

hypotheses

support/

refutation

schema,

data,

code

hypotheses/

definite facts

(inconsistent)

logical schema

hypotheses/

definite facts

queries

results

analysis

domain

reengineer

application expert

customization

DBRE

knowledge GFRN

characteristics

adaption

5 EXEC SQL DECLARE c8

CURSOR FOR

16SELECT d FROM

DOCUMENT d, KEYW k

5 EXEC SQL

DECLARE c8

CURSOR FOR

16SELECT d FROM

engine

SPECIFICATION OF DATABASE REENGINEERING KNOWLEDGE 57

user interactionThe output of this non-monotonic inference process is a logical schema which might still

partially be inconsistent and uncertain. This schema is presented to the reengineer in a dialog

process that provides interactive queries to indicate the sources of such imperfect knowledge.

The reengineer might discuss this information with application experts (e.g., developers or

operators) and do further manual investigations of the LDB. As a result of these manual

activities the reengineer might enter additional hypotheses or definite facts about the LDB.

Now, the inference process is resumed, i.e., new knowledge is inferred and automatic (goal-

driven) analysis might be performed to validate hypotheses. The described semi-automatic

schema analysis process is iterated until a complete and consistent logical schema is obtained.

role of the GFRNFrom the above description it becomes clear that the domain-specific DBRE knowledge which

is defined in a GFRN serves mainly three purposes: (1) it unburdens the reengineer from

manually analyzing recurring situations and focuses her/his attention on non-standard

situations, (2) it controls the consistency of the analysis result, i.e., the logical schema, and (3)

it facilitates the customization of the CARE tool to changing application contexts.

4.2 Specification of database reengineering knowledge

In the previous section, we have described the proposed schema analysis process and clarified

the role of predefined domain-specific DBRE knowledge. The current section is dedicated to

the definition of GFRN as a formalism to specify this knowledge. According to the results of

our evaluation in Chapter 3, we have chosen possibility theory as the formal framework for the

definition of the GFRN semantics. As customizability is a crucial requirement in our

application domain, we have developed GFRN as a graphical formalism that provides a high

level of abstraction and, thus, facilitates human comprehension. In Section 4.2.1 and

Section 4.2.2, we begin with an informal introduction of GFRNs followed by their formal

definition in Section 4.2.3.

Basic definitions

Before we begin with the introduction of GFRNs as a formalism to support legacy schema

analysis, we need a more precise notion of the actual analysis result, i.e., an analyzed logical

schema of a relational database. In Section 2.4.1, we exemplified that such an analyzed schema

basically consists of a relational schema with semantical annotations. Similar to an approach

proposed by Fahrner and Vossen [FV95], we used annotations to classify INDs according to

their semantics. In addition, we generalized the notion of attributes that can contain NULL-

values to an explicit concept of different relational variants (cf. page 20). We formalize the

signature of an analyzed logical schema in Definition 4.1. As the semantics of the relational

data model is well-known we forego a formal definition of its interpretation in this chapter.

However, such a formalization is included as Definition A.1 in Appendix A. From now on, we

refer to an analyzed logical schema even if we use the expression logical schema for

abbreviation.

58 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

Definition 4.1 Signature of an analyzed logical schema

An (analyzed) logical schema for a relational DB is a tuple (T, R, ∆,

), where

•T={t1,...,tm}, m∈, is a finite set of attribute type names;

•R={n1,...,nm}, m∈, is a finite set of relation schemas (tables); each r∈R is a tuple

r:(n, X, ∑, V), where

• n is a unique name of a relation schema (RS);

• X(r)=X={x1,...,xm}, m∈, is a finite set of column signatures; each x∈X is a tuple

x:(n, c, t), where c is a unique attribute (column) name and t∈T is an attribute type;

•∑(r)=∑={σ1,...,σm}, m∈, is a finite set of keys, with σj⊆X, for j∈[1,m];

• V(r)=V={v1,...,vm}, m∈, is a non-empty, finite set of variants; each vj∈V, j∈[1,m],

is a subset of X that includes all keys, i.e., vj⊆X, ∀σ∈∑:σ⊆ vj;

•∆={d1,...,dm}, m∈, is a finite set of inclusion dependencies (INDs); each d∈∆ is a tuple

d:(l, r, I), where

• l ∈V is a variant of an RS (n, X, ∑, V)∈R and represents the left side of the IND;

• r:(n, X, ∑, V) ∈R is the RS that represents the right side of the IND;

• I={i1,...,im}, m∈, is a finite set of pairs of equivalent attributes,

for each (xl,xr)∈I, xl∈V and xr∈X.

•

:∆→{I-IND, R-IND, C-IND} is an annotation function that classifies each IND d∈∆ as

• I-IND, if d semantically represents an inheritance relationship,

• R-IND, if d semantically represents an association, and

• C-IND, if d semantically represents a cardinality constraint (cf. [FV95]).

For notational convenience, we define for any attribute x that RS(x) denotes the corresponding

❑

RS, i.e., ∀r∈R, x∈X(r): RS(x):=r.

4.2.1 Informal introduction to GFRNs

The purpose of the GFRN language is to define domain-specific knowledge and analysis

processes which are executed in a semi-automatic reverse engineering activity to recover a

logical schema that is structurally complete and semantically enriched. In the following, we

will informally discuss several example GFRN specifications that define parts of the

knowledge employed in our DBRE case study in Section 2.4.1. For each of these examples, we

denote the corresponding formal semantics in necessity-valued possibilistic logic (NPL1)

(cf. Section 3.2.5).

A GFRN specification is a graphical network of fuzzy predicates (represented as ovals) and

uncertain implications (represented as rectangles). Predicates and implications are connected

by directed arcs which are labeled by variable names. Figure 4.2 shows a simple example for a

GFRN that represents the heuristic that an instance of a cyclic-join pattern over a set of

relational attributes indicates a possible key constraint over these attributes (cf. page 18). The

corresponding GFRN contains two predicates (cyclicJoin1 and key1) and one implication. Each

predicate has a unique name which terminates with a number that denotes the arity of the

corresponding predicate.

SPECIFICATION OF DATABASE REENGINEERING KNOWLEDGE 59

The premise of an implication is defined by all predicates that

are the sources of ingoing arcs of this implication. Each

implication has an associated confidence value (CV) between 0

and 1. Based on the theory of possibilistic logic, the semantics

of a CV is a lower bound of the necessity that the

corresponding implication is valid (cf. Section 3.2.5). The

semantics of an implication in a GFRN is defined by a closed

formula in NPL1, i.e., all variables are implicitly quantified by

a universal quantifier. Hence, the semantics of the GFRN in

Figure 4.2 is defined by a formula (

,0.7)∈

{NPL1}, with

:= ∀x (cyclicJoin1(x)→key1(x)).

constraints

and negation

In order to express more complex heuristics, implications can

be associated with constraints over variables that are attached

to in- and outgoing arcs. As an example, the GFRN in

Figure 4.3 represents the heuristic that an instance of a select-

distinct pattern over a set of selected attributes s serves as an

indicator against the assumption that one of the subsets k⊆s

might represent a candidate for a key (cf. page 18). Note, that

in order to simplify the GFRN syntax we use prefix notation

to denote operations defined in constraints of implications.

The negation in the conclusion of the corresponding

implication is represented by an arc with a solid arrow head.

We would like to emphasize that the CVs presented in our

examples are not absolute but depend on the specific

characteristics of the LDB under investigation. They are

adjusted according to the results of the domain analysis activity during the tool customization

process (cf. Figure 4.1). The relatively low CV of the select distinct heuristic in Figure 4.3

might reflect on the fact that by investigating code samples the knowledge engineer has

discovered that the programmers of the LDB had not been precise in using the distinct keyword

only in queries where it is necessary to suppress duplicate tuples. The semantics of the GFRN

in Figure 4.3 is defined by a formula (

,0.3)∈

{NPL1}, with

:=∀s∀k ((k⊆s→selectDist1(s))→¬key1(k)).

conjunctionLogical conjunctions are represented in the GFRN

formalism by connecting two or more predicates to the

premise of an implication. An example for such a situation

is given in Figure 4.4. The shown implication specifies the

heuristic that an inclusion dependency (IND) can be

classified as an R-IND if it is key-based, i.e., if there is a key

constraint over the attribute set on its right-hand side.

According to Definition 4.1, the signature of an IND

is represented by a tuple (l, r, i)

where i is a set of pairs of corresponding attributes, i.e.,

i={(a1,b1),(a2,b2), ...,(an,bn)}. Operation Π2(i) applied in the

constraint of the implication in Figure 4.4 represents the projection on the second component

0.7

cyclicJoin1

key1

Figure 4.2. Simple GFRN

0.3

selectDist1

key1

Figure 4.3. Implication with

constraint and negation

⊆(k,s)

0.5

k=Π2(i)

key1

R-IND1

IND1

Figure 4.4. Implication with

conjunction

Πa1...,an

,δRl

( ) Πb1...,bn

,δRr

( )⊆

60 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

of each tuple in a given relation i, i.e., k=Π2(i)={b1,b2,...,bn}. Hence, k represents the set of

attributes on the right-hand side of the IND which have to represent a key according to the

second predicate in the implication’s premise. The semantics of the GFRN in Figure 4.4 is

defined by a formula (

,0.5)∈

{NPL1}, with

:=∀k∀i ((k=Π2(i)→(key1(k)∧IND1(i))→R-IND1(i)).

thresholds A problem that arises with the use of quantitative measures for uncertainty is that the inference

might lead to a vast amount of hypotheses with a low certainty. For example, let us consider the

heuristic that a key might be indicated by an attribute name that is similar to the name of its RS

with the suffix “id” (cf. Example 3.1 on page 46). Using the Levenshtein distance [Lev66] to

measure the similarity of strings we obtain the similarity measures displayed in Figure 4.5 for

seven sample attribute names of table USER (cf. Example 3.1 on page 46). During manual

analysis a reengineer would not consider an attribute like dpt with respect to the above

heuristic, because the similarity of its name with the string “userid“ is very low. Considering

such indicators in the proposed automatic knowledge inference process would entail the

generation of many false positives. Subsequently, the reengineer would have to validate each

such hypotheses manually to obtain a definite analysis result. This contradicts to our goal to

unburden the reengineer from stereotypical activities and focus her/his attention.

In the GFRN approach, we allow to suppress incredible

indicators by assigning a threshold value (TV) to each

implication. A TV defines the minimum amount of

certainty that is needed for a premise such that the

corresponding implication is considered. The semantics of a

TV is defined by an α-cut on the fuzzy set that represents

the propositions in the premise of the corresponding

implication. For example Figure 4.6 shows an implication

that specifies the naming heuristic discussed above. It has

an associated TV of 0.2 which is represented by another

real number that is separated from the CV by a slash. The

dashed line in Figure 4.5 illustrates how this threshold

0.2

0.4

0.6

0.8

userid

user_id

usr_iduid id

✧

✧✧

us_ident dpt

Figure 4.5. Similarity measures for the seven sample attribute names with the string userid.

--- Levensh(x userid,( ))atan

threshold (α-cut)

key1

Figure 4.6. Implication with

threshold

ANameIsRSName+ID1

0.8/0.2

k=set(a)

SPECIFICATION OF DATABASE REENGINEERING KNOWLEDGE 61

suppresses all propositions in the premise that have a certainty measure lower than 0.2.

Furthermore, we have to consider that the naming heuristic uses names of single attributes as

indicators for key constraints. However, key constraints are generally defined in sets of

attributes. Hence, we have to restrict the hypothetical key k to be a set of only one attribute a.

This restriction is represented by the constraint k=set(a). The semantics of the GFRN in

Figure 4.6 is defined by a formula (

,0.8)∈

{NPL1}, with

:=∀a∀k((k=set(a)→ANameIsRSName+ID1(a)∧N(ANameIsRSName+ID1(a))>0.2)→key1(k)).

premise with

inner universal

quantifier

Some heuristics consider a set of indicators to infer new

knowledge where the cardinality of this set depends on

situation-specific knowledge. For example, in our case study,

we applied a heuristic that uses an arbitrary set of pairs of

similarly named attributes in two different RS as an indicator

for a complex foreign key (IND) (cf. page 16). To be able to

specify such heuristics we have to provide means to consider

all situation-specific knowledge that fulfills certain specified

constraints, i.e., we need an explicit notion of a universal

quantifier within the premise of an implication. In the GFRN

formalism, such an inner universal quantifier (IQ) is

represented by an arc with a cancelled arrow head.

Figure 4.7 shows an example that specifies the heuristic

discussed in this paragraph. The predicate NamSim2 is

defined by the fuzzy set of pairs p:(a,a) of attrib utes with

similar names. The constraint ∈(p,i) restricts p to be a pair of corresponding attributes in the

hypothetical IND i:{(a1,a1),(a2,a2),...,(an,an)}. Furthermore, by using the constraint

disj(Π1(s),Π2(s)) we restrict the left-hand side {a1,a2,...,an} and the right-hand side

{a1,a2,...,an} of the hypothetical IND to be disjoint. Finally, we require that at-a-time the left-

hand side and the right-hand side of the concluded IND belong to one single RS (sameRS(...)).

The semantics of the GFRN in Figure 4.7 is defined by a formula (

,0.5)∈

{NPL1}, with

:=∀i(∀p(p∈i∧disj(Π1(i),Π2(i))∧sameRS(Π1(i))∧sameRS(Π2(i))→NamSim2(p)∧

N(NamSim2(p))>0.2) →IND1(i)).

The above examples show that compared with textual formulae the graphical GFRN formalism

improves the understandability of specified knowledge, significantly. In the following

examples, we will skip the translation of GFRN specifications to NPL1 for the sake of

readability. We will come back to this issue when we formally define the syntax and semantics

of GFRN specifications in Section 4.2.3.

variable

aggregation

and composition

In the previous example, we already implicitly used the concept of variable aggregation: we

used a single variable p to denote a tuple (a,a) of attributes. In general, each arc in a GFRN is

labeled either by a tuple of n variables which stands for the arguments of the connected n-ary

predicate, or by a single variable that denotes the entire tuple. This notation can be used to

aggregate variables as well as to compose new tuples. Figure 4.8 shows an application example

that combines both techniques.

0.5/0.2

∈(p,i)

IND1

NamSim2

Figure 4.7. Premise with

universal quantifier

disj(Π1(i),Π2(i))

sameRS(Π1(i))

sameRS(Π2(i))

62 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

The left implication is a more sophisticated version

of the implication in the previous example. While

the implication in Figure 4.7 restricts the attributes

on the left-hand side of the hypothetical IND to

belong to the same RS, the left implication in

Figure 4.8 strengthens this condition by restricting

them to be in the same variant. This reflects on the

experience with our DBRE case study which has

shown that the extension of an RS might comprise

different variants of tuples (cf. page 20). As a

consequence, we have to extend the definition of

predicate IND2 by an additional parameter (v) which

represents the variant for the IND’s left-hand side. In

the conclusion of the left implication in Figure 4.8

the argument of predicate IND2 is composed by

variables i and v. The implication on the right side of

Figure 4.8 specifies the definite knowledge that a hypothetical IND can only by true if it is

valid in the available data. Obviously, predicate validIND2 has to be defined over the same

formal parameters as predicate IND2. However, for the right implication in Figure 4.8 we

aggregate the pair of parameters in one variable (t).

Figure 4.9 combines the example heuristics discussed in this section in one single GFRN.

Figure 4.8. Variable aggregation

and composition

IND2

i,v

variant1

vvalidIND2

1.0/0.2

0.5/0.2

∈(p,i)

disj(Π1(i),Π2(i))

⊆(Π1(i),v)

NamSim2

sameRS(Π2(i))

Figure 4.9. Combination of heuristics in a single GFRN

key1

ANameIsRSName+ID1

0.8/0.2

k=set(a)

0.5

k=Π2(i)

R-IND2

0.3

selectDist1

⊆(k,x)

0.7

cyclicJoin1

IND2

i,v

variant1

validIND2

1.0/0.2

0.5/0.2

∈(p,i)

disj(Π1(i),Π2(i))

⊆(Π1(i),v)

pNamSim2

i,v

sameRS(Π2(i))

SPECIFICATION OF DATABASE REENGINEERING KNOWLEDGE 63

4.2.2 Integration of automatic analysis operations

The GFRN formalism described so far allows to define domain-specific heuristics to reason

about situation-specific knowledge. If we want to employ this reasoning process in a semi-

automatic schema analysis process (as described in Section 4.1) we have to provide means to

integrate automatic analysis operations which retrieve situation-specific knowledge from the

LDB.

existing

operations

In the DBRE community, there is a great variety of programs and procedures that perform

analysis of different parts of an LDB. For example, in [PB94], Premerlani and Blaha report on

their experience in schema analysis using a simple but flexible tool set which mainly contains

UNIX tools [RRF90] like grep and awk . Anderson [And94] defines a number of recurring

patterns in the procedural LDB code that can be used as semantic indicators. In [Bew98],

Bewermeyer extends this collection of patterns and employs graph grammars to recognize

them in an abstract syntax graph representation. Petit et.al. [PKBT94] describe specific

database queries that can be used to extract important information from the available legacy

data.

In this section, we describe how such existing operations can be integrated with the GFRN

formalism to achieve the desired knowledge-driven analysis process. In Section 4.1, we have

distinguished between two kinds of analysis operations: (1) operations that perform an initial

analysis of the LDB, and (2) operations which are only executed on-demand to refute or

support intermediate hypotheses. Let us revisit an example from our case study to motivate this

distinction.

In Section 2.4.1, we have exemplified how indicators for foreign key constraints can be found

by employing heuristics about naming conventions of LDB schema components. One of these

heuristics searches the schema for pairs of RS that have (groups of) attributes with similar

names (cf. page 16). If such a situation can be found in an LDB schema, our heuristic leads to

an uncertain hypothesis that there might be a foreign key constraint between the two RS.

However, such a foreign key might only exist if the corresponding IND is valid in the available

data. Using the GFRN formalism we can specify this knowledge as shown in Figure 4.9. Both

fuzzy sets that define the predicates validIND2 and NamSim2 can be determined by automatic

analysis operations: the validity of INDs can be checked by predefined queries to the data and

string similarity measures can be used to check the schema for naming conventions. Still, there

is a qualitative difference between both predicates. While predicate NamSim2 serves to indicate

a semantic constraint, predicate validIND2 is used to validate this indication. Hence, the

validity of a hypothetical IND should only be checked when it has actually been indicated.

Another rational for such a goal-driven analysis is that the computational effort which was

involved in checking the validity of all possible combinations of INDs beforehand would grow

exponentially with the size of the LDB’s schema. Hence, this solution would contradict to our

requirement for scalability (cf. R6 on page 37).

data- and

goal-driven

operations

According to the above motivation, we classify automatic analysis operations as either data-

driven, i.e., they are executed before the inference process starts to provide an initial set of

indicators, or goal-driven, i.e., they are invoked on demand during the inference process. This

classification can be performed according to the guidelines displayed in Figure 4.10. If an

analysis operation delivers facts about an LDB that represent valuable indicators for semantic

64 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

constraints and this operation is computational inexpensive, then it should be classified as data-

driven. On the other hand, if an operation delivers facts that are less suitable as indicators and

this operation is computational expensive, then it should be classified as goal-driven. The

classification of other analysis operations depends on the application context, i.e., on the

concrete LDB under investigation. For example, an analysis operation that delivers valuable

indicators but is computational expensive can be classified as data-driven for a small-scale

LDB, but it should be classified as goal-driven if the LDB has a large scale.

different types of

predicates In the GFRN approach, predicates can be bound to

data- and goal-driven operations. Consequently,

such predicates are called data-driven or goal-

driven, respectively. Predicates that are not bound

to analysis operations are called dependent.

Figure 4.11 shows that data- and goal-driven

predicates are represented as bold ovals with

different colors, where black means data-driven

and grey stands for goal-driven. Furthermore, the

left-most implication in Figure 4.11 exemplifies

that the application of goal-driven predicates is not

limited to the purpose of refuting hypotheses: the

validity of a hypothetical IND in a large amount of

data delivers a good support that this hypothesis is

true. Still, hypotheses cannot be proved by means of data. Hence, we attached a CV lower than

1 to this implication.

Figure 4.12 displays an example for a goal-driven analysis operation named validate_IND

which can be bound to predicate validIND2 in Figure 4.11. The first argument of operation

validate_IND (B) represents the LDB which is the current target of the analysis. Note, that in

contrast to the other two arguments, parameter B is not represented explicitly in the GFRN.

Operation validate_IND returns a degree of necessity for and against the proposition that the

corresponding IND holds in B. The algorithm uses a local variable ψ to store all tuples that

belong to variant v on the left side of the IND i. If these tuples contain no counterexample for

the hypothetical IND the necessity of validIND2(i,v) is computed depending on the cardinality

of ψ. A large amount of data entails a higher support for the hypothesis than just a few tuples.

The generated membership function is illustrated in Figure 4.13. Otherwise, if a

counterexample can be found, the hypothesis is refuted. Note, that we have presumed correct

Figure 4.10. Characteristics for classifying automatic analysis operations

high indication low indication

computational

inexpensive data-driven data-driven

goal-driven

computational

expensive data-driven

goal-driven goal-driven

IND2

Figure 4.11. GFRN with data- and

goal-driven predicates

validIND2

1.0/0.2

0.8/0.2

i,v

variant1

0.5/0.2

∈(p,i)

disj(Π1(i),Π2(i))

⊆(Π1(i),v)

NamSim2

sameRS(Π2(i))

SPECIFICATION OF DATABASE REENGINEERING KNOWLEDGE 65

legacy data for the definition of this analysis operation. If we expect the data to include some

incorrect tuples, i.e., tuples that do not comply to INDs even if they exist, we should choose an

analysis operation that gradually refutes IND hypotheses according to the relative number of

contradicting tuples.

The purpose of the pseudo code in Figure 4.12 is simply to introduce our concept of integrating

automatic analysis operations with knowledge represented in GFRN specifications. We employ

the programming language Java for concrete implementations of goal- and data-driven

analysis operations. This issue will be discussed in Section 4.4.

Operation validate_IND (B, i, v)

input: B (* B is an RDB according to Definition 3.3 *)

i:({(a1,a1),(a2,a2),...,(am,am)}), v

(* (v,RS(a1),i) is an IND signature w.r.t. Definition 4.1 *)

output: N(validIND2(i,v))∈[0,1], N(¬validIND2(i,v))∈[0,1]

begin

let r1=RS(a1,...,an);

let r2=RS(a1,...,an);

let ψ= {x∈δ(r1) | ∀a∈X(r1) : (a∈v→Πa(x)≠NULL) ∧ (a∉v→Πa(x)=NULL)}

(* ψ represents all members of variant v *)

if (* is the IND valid? *)

then let N(validIND2(i,v))=

let N(¬validIND2(i,v))=0

else let N(validIND2(i,v))=0

let N(¬validIND2(i,v))=1

end.

Πa1...,an

,ψ Πa1...,an

,δr1

( )⊆

--- ψ

100

---------

 

 

atan

Figure 4.12. Goal-driven analysis operation validate_IND

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 200 400 600 800 1000

Figure 4.13. N(validIND2(i,v)) for the case of no counterexamples

N validIND2i v,( )( )=2

--- ψ

100

---------

 

 

atan

66 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

4.2.3 Formal definition

In the previous sections, we have informally introduced and exemplified GFRNs as a

formalism to specify DBRE knowledge and semi-automatic analysis processes. In the

following, we will give a formal definition of the syntax and semantics of this language.

4.2.3.1 Syntax of GFRN

In this section, we formalize the syntax of GFRN specifications by defining their signatures

and a set of context sensitive constraints.

Definition 4.2 Signature of a GFRN

Ageneric fuzzy reasoning net is defined by a 9-tuple GFRN:=(P, Fr, Fb, I, E, cf, th, Ω,ω),

•P =(Pd,Pg,Pt), with Pd∪Pg∪Pt= { , , ..., }, x ∈, a finite set of unique predicate

symbols with arity uq∈ , q∈[1,x], the disjoint sets Pd,Pg,Pt are called data-driven, goal-

driven, and dependent predicates, respectively.

•Fr= { , , ..., }, x ∈ is a finite set of relational function symbols with arity

uq∈, q ∈[1,x], i.e., each ∈ Fr denotes a function .

•Fb= { , , ..., }, x ∈ is a finite set of boolean function symbols with arity uq∈,

q∈[1,x], i.e., each ∈ Fb denotes a function {True,False}.

The boolean function symbol ’∈2’∈ Fb is predefined.

•I= {i1, i2, ..., ix}, x ∈, is a finite set of implications, each implication i ∈I is a tupel

i = (ι,V, K), with

•ι, an unique implication identifier,

• V={v1, v2, ..., vx}, x∈ , a set of parameter names,

• K={k1,k2,...,kx},x∈ , is a finite set of constraints over V, where each k∈K has the

form k=(w, fu,<w1,w2,...,wu>), with w1,..,wu∈V, (w ∈V∧ fu∈Fr)∨ (w=ε ∧ fu∈Fb).

•E={e1,e2,...,en}, n ∈+ is a finite set of arcs, where each e∈E is a tupel

e= (χ, l, s, d, Α), with

•χan unique arc identifier,

• l:(p,(ι,V,K)) ∈ (P ×I), a location,

• s ∈ {‘ ‘, ¬}, a sign,

• d ∈ {premise, premise_quantified, conclusion}, a type; ‘premise’ and

‘premise_quantified’ means that the arc is in the premise of the connected implication,

‘premise_quantified’ denotes an arc with a variable that has been quantified with an

IQ, ‘conclusion’ denotes an arc in the conclusion of the corresponding implication.

•Α=α or Α=<α1,α2, ..., αkq> an actualization vector, with α,αu∈V, for 1≤u≤kq.

•cf: I → (0, 1] and th: I → [0, 1) are functions that associate integer values between 0 and 1

to implications. cf is called the confidence function while the th is called the threshold

function.

•Ω: Pd→FUN and ω: Pg→FUN, are two functions that associate analysis operations to

❑

data- and goal-driven predicates.

uq:

→

uq:

uq→

SPECIFICATION OF DATABASE REENGINEERING KNOWLEDGE 67

We define the following context-sensitive constraints on GFRN signatures in order to ensure

their executability and simplify the formulation of the inference and translation algorithms in

the following sections. We will denote a GFRN that complies to the following constraints as a

well-formed GFRN.

Definition 4.3 Context sensitive syntax

A GFRN ((Pd,Pg,Pt),Fr,Fb,I,E,cf,th,Ω,ω) is called well-formed if it admits to the following

syntactic constraints:

•predicates are not isolated, i.e.,

∀p∈Pd∪Pg∪Pt∃(χ,(p,i),s,d,A)∈E) ∈E

•implications have at least one predicate in their premise and exactly one predicate in their

conclusion, i.e.,

∀i∈I∃(χ,(p,i),s,d,A)∈E∃!(χ,(p,i),s,conclusion,A)∈E d∈{’premise’,’premise_quantified’}

•data- or goal-driven predicates do not occur in the conclusion of any implication, i.e.,

¬∃(χ,(p,i),s,conclusion,A)∈E p∈Pd∪Pg

•all variables of all implications are actualized, i.e.,

∀i:(ι,V,K) ∈I∀v∈V∃(χ,(p,i),s,d,A)∈E v∈A

•IQs can only be used for single variable names, i.e.,

∀(χ,l,s,premise_quantified,<α1, ..., αkq>)∈E kq=1

•for each implication, there is at most one variable which is bound by an IQ, i.e.,

❑

∀i∈I(∃(χ,(p,i),s,premise_quantified,a),(χ,(p,i),s,premise_quantified,a)∈E→χ=χ)

Example 4.1 Syntax of a GFRN

Figure 4.14 shows an example GFRN that consists of five implications and six predicates,

including two data-driven and one goal-driven predicates.

According to Definition 4.2, the signature of the depicted GFRN is defined by a tuple

G:(P,Fr,Fb,I,E,cf,th,Ω,ω), with

•predicate symbols P=(Pd,Pg,Pt), with

• data-driven predicates Pd={selectDist1, ANameIsRSName+ID1},

• goal-driven predicates Pg={validKey1},

• dependent predicates Pt={IND2, I-IND2, key1},

•relational function symbols Fr={Π11,Π21, set1},

•boolean function symbols Fb={∈2,⊆2},

•implications I={ (ι1,{s,k}, {(ε,⊆2,<k,s>)}), (ι2,{k,a}, {(k,set1,<a>)}), (ι3,{t},{}), (ι4,{t},{}),

(ι5,{i,v,k1,k2}, {(k1,Π11,<i>), (k2,Π21,<i>)}) },

•edges E={ (e1, (selectDist1,ι1), ’’, premise, <s>), (e2, (key1,ι1), ¬, conclusion, <k>),

(e3, (ANameIsRSName+ID1,ι2), ’’, premise,<a>), (e4, (key1,ι2), ’’, conclusion,<k>),

(e5, (key1,ι5), ’’, premise, <k1>), (e6, (key1,ι5), ’’, premise, <k2>),

68 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

(e7, (IND2,ι5), ’’, premise, <i,v>), (e8, (I-IND2,ι5), ’’, conclusion, <i,v>),

(e9, (key1,ι3), ¬, conclusion,<t>), (e10, (valid_key1,ι3), ¬, premise,<t>),

(e11, (key1,ι4), ’’, conclusion,<t>), (e12, (valid_key1,ι4), ’’, premise,<t>),

•confidence function cf(ι1)=0.3, cf(ι2)=0.8, cf(ι3)=1, cf(ι4)=0.8, cf(ι5)=0.6,

•threshold function th(ι1)=0, th(ι2)=0.2, th(ι3)=0.2, th(ι4)=0.2, th(ι5)=0,

•analysis operations Ω(ANameIsRSName+ID1), Ω(selectDist1), and ω(validKey1).

❑

4.2.3.2 Declarative semantics

In Section 4.3, we will use algorithmic notation to define an inference and execution

mechanism for GFRN specifications based on a fuzzy Petri net model. However, the level of

abstraction of this operational definition of the GFRN semantics is too low to facilitate sound

understanding of the meaning of GFRN specifications. Therefore, this section contains a

declarative definition of the GFRN semantics based on a canonical translation of GFRN

signatures to closed formulae in possibilistic logic. Subsequently, we formalize the semantics

of integrating automatic analysis operations with data- and goal-driven predicates in the

framework of this translation.

Definition 4.4 Declarative semantics of GFRNs

The declarative semantics of a well-formed GFRN G:=(P,Fr,Fb,I,A,cf,th,Ω,ω) is formally

defined by a canonical translation of G to NPL1. The translation algorithm is given in

❑

Figure 4.15 and Figure 4.16.

ANameIsRSName+ID1

ι2:0.8/0.2

⊆(k,s)

ι1:0.3/0 k

skey1I-IND2

Figure 4.14. GFRN to illustrate the formalization

validKey1

ι3:1/0.2

ι5: 0.6/0

k1=Π1(i)

k2=Π2(i) i,v

ι4:0.8/0.2

k=set(a)

selectDist1

IND2

i,v

SPECIFICATION OF DATABASE REENGINEERING KNOWLEDGE 69

algorithm

explanation

Algorithm GFRN2NPL1 in Figure 4.15 takes the signature of a well-formed GFRN G as input

parameter and produces a closed formula

in NPL1 as an output parameter.

is initialized by

the tautology. For each implication i in G,GFRN2NPL1 calls the algorithm Impl2NPL1 in

Figure 4.16, that creates a closed formula in NPL1 representing the semantics of i. The

semantics of the entire GFRN is defined as the logical conjunction of the translation of all its

implications.

Algorithm Impl2NPL1 uses five auxiliary variables (

©¬«® ©°¯

) of type String to create the desired

NPL1 formula (

). Strings (and formulae) are concatenated by using the assignment operation

let, e.g., line 20 in Figure 4.16. Characters enclosed by quotes („“) are taken literally, while

strings which are not enclosed by quotes have to be variables. Variables (like

©°«

and v in

line 20) are evaluated and their current value is taken for the assignment operation.

If there exists an IQ in the premise of the current implication, the statement in line 20 creates a

universal quantifier for the corresponding parameter tuple in variable

©¬±

. Likewise, the first

loop uses

©¬«

to store "outer" universal quantifications for all remaining variables of i. The

second loop creates a string (

©³²

) that represents a logical conjunction of all constraints of i. The

last loop (lines 39-41) creates a string (

©°´

) that represents a logical conjunction of all predicates

in the antecedent of i, while the assignment in line 45 creates a string (

©¬¯

) that represents the

predicate in the consequent of i. Finally, the assignment operation in line 47 creates the

resulting formula in NPL1 that represents the semantics of i. We assign the identifier of the

translated implication (ι) as an index to the implication operator (→ι) to facilitate identification

of the original GFRN implication. However, there is no additional semantics to this index.

algorithm GFRN2NPL1

1) input G:(P, Fr, Fb, I, E, cf, th, Ω,ω)∈

{GFRN}

2) output

∈

{NPL1}

3) local variables

∈

{NPL1},i∈I

4) begin

5) let

= „ “

6) for each i∈Ido

7) let

„∧“ Impl2NPL1(G, i)

8) od

9) return

10) end

Figure 4.15. Translation algorithm GFRN2NPL1

70 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

Example 4.2 Translation of GFRN to NPL1

In this example, we revisit our sample GFRN in Figure 4.14 on page 68 to illustrate our

translation algorithm in Figure 4.15 and Figure 4.16. The signature of this GFRN is presented

in the previous example (4.1). It contains five implications, hence Impl2NPL1 is invoked five

times. In the following, we select implication ι5to discuss one of these invocations in detail.

algorithm Impl2NPL1(G, i)

11) input G:=(P, Fr, Fb, I, E, cf, th, Ω,ω)∈L{GFRN}, i:(ι,V, K) ∈I

12) output

∈

{NPL1}

13) local variables

ª ¶ · ª ¸ · ª ¹ · ª º · ª »

∈String; e ∈E; v,vi∈V; Vc⊆V

14) begin

15) let

ª ¶

ª ¸

ª »

= „“

16) let

ª ¹

ª º

=„ “

17)

18) // create „inner“ univ. quantifier (IQ)

19) if ∃(χ,l, s, premise_quantified, vi)∈E

20) then let

ª ¸

„∀“ vi

21) let V=V\vi

22) fi

23)

24) // create „outer“ univ. quantifiers for all remaining variables

25) for each v ∈Vdo

26) let

ª ¶

„∀“ v

27) od

28)

29) // create constraints

30) for each (w,fu,<w1,..,wu>)∈Kdo

31) if w=εthen

32) let

ª ¹

„∧“fu„(“w1,..,wu„)“

33) else

34) let

ª ¹

„∧“w„=“ fu„(“w1,..,wu„)“

35) fi

36) od

37)

38) // create predicates in premise

39) for each ((pm,i),s,t,A)∈E with t=’premise’ or t=’premise_quantified’ do

40) let

ª º

„∧“s pm „(“ A „)“

41) od

42)

43) // create predicate in conclusion

44) let ((pm,i),s,conclusion, A)∈E

45) let

ª »

=s pm „(“ A „)“

46)

47) let

= „

“

ª ¶

„

“

ª ¸

„

“

ª ¹

„→“

ª º

„∧N

“

ª º

„

≥“

¾ ¿ ¼ À ½

„

→ι“

ª »

„

,“ cf(i) „

“

48) return

49) end

Figure 4.16. Translation algorithm Impl2NPL1

SPECIFICATION OF DATABASE REENGINEERING KNOWLEDGE 71

At first, the five auxiliary variables are initialized (

©« © ¯

). Variable

©±

remains empty because

there is no IQ in the premise of implication ι5. The first loop creates quantifiers for all

parameters of ι5, i.e.,

©«

= „∀i∀v∀k1∀k2“. The second loop considers the constraints of ι5, i.e.,

©²

= „ ∧ k1=Π1(i) ∧ k2=Π2(i)“. In lines 39-45, variables

©´

and

©¯

are defined to represent the

predicates in the premise and the conclusion of ι5, respectively. After this section,

©´

has the

value „ ∧ key1(k1)∧ key1(k2)∧ IND2(i,v)“ and the value of

©¯

is „I-IND2(i,v)“. Finally, the

translation of ι5 to NPL1 is created in line 47 as Impl2NPL1(ι5)=(f,0.6) with

f=∀i∀v∀k1∀k2(∧k1=Π1(i)∧k2=Π2(i)→ ∧key1(k1)∧key1(k2)∧

IND2(i,v)∧N

∧key1(k1)∧key1(k2)∧IND2(i,v)

≥0

→5I-IND2(i,v).

The resulting formula can be simplified by pruning unnecessary brackets, conjunctions with

tautologies, and preconditions due to thresholds which are equal zero (see below). The four

other implications are translated likewise and the semantics of this sample GFRN is defined by

the following formula in NPL1.

GFRN2NPL1(G)=

Impl2NPL1(ι1)=(∀s∀k((k⊆s→selectDist1(s)) →1¬key1(k)),0.3)

∧Impl2NPL1(ι2)=(∀t(ANameIsRSName+ID1(t)∧N

ANameIsRSName+ID1(t)

≥0.2

→2key1(t)),0.8)

∧Impl2NPL1(ι3)=(∀t(¬validKey1(t)∧N

¬validKey1(t)

≥0.2 →3¬key1(t)),1)

∧Impl2NPL1(ι4)=(∀t(validKey1(t)∧N

validKey1(t)

≥0.2→4key1(t)),0.8)

∧Impl2NPL1(ι5)=(∀i∀v∀k1∀k2((k1=Π1(i)∧k2=Π2(i))→

key1(k1)∧key1(k2)∧IND2(i,v)

❑

→5I-IND2(i,v)),0.6)

semantics of

analysis operations

In the rest of this section, we formalize the semantics of automatic data- and goal analysis

operations which have been attached to GFRN predicates. In Section 4.2.2, we have

exemplified that automatic analysis operations deliver situation-specific facts about the LDB

that are associated with degrees of necessity. The facts delivered by automatic analysis

operations which have been bound to GFRN predicates represent applications of these

predicates. Hence, we denote that these facts are in the extent of the corresponding predicates.

Definition 4.5 Extent of a predicate

For a given universe

the extent of a possibilistic predicate p, denoted as 〈p〉

, is defined by

❑

the set of propositions 〈p〉

={(p(u),x)|u∈

,x∈[0,1]}⊂

{NPL0}.

The concept of data- and goal-driven analysis functions is formalized as follows.

Definition 4.6 Data-driven analysis operation

For a given data-driven predicate p∈Pdthe associated data-driven analysis operation Ω(p) is

❑

defined by a function Ω(p):RDB →

(〈p〉

Definition 4.7 Goal-driven analysis operation

For a given goal-driven predicate p∈Pgthe associated goal-driven analysis operation ω(p) is

❑

defined by a function ω(p):RDB × 〈p〉

→〈p〉

×〈¬p〉

72 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

The LDB which is under investigation defines a finite universe of discourse which we will call

the application context of the LDB.

Definition 4.8 Application context

The application context

(B) of a given LDB B:(M,S,δ,C,D)∈RDB is defined by the finite

❑

power set of all software artifacts of B, i.e.,

(B)=

(flatten((M,S,δ,C,D)).a

a For the definition of function flatten see Definition 3.5 on page 38.

In the following, we make use of the fact that a set of formulae in L1 which is applied in a finite

domain can be represented by an equivalent set of formulae in L0 [BC90, pp. 35ff]. This is

done by expressing each universal quantifier by a conjunction and each existential quantifier by

a disjunction of propositions.

Definition 4.9 Expansion of formulae over a finite universe

Let Φ⊂

{NPL1} be a set of closed formulae where all variables are bound with the universal

quantifier. For a finite universe

, let Φ⇓

⊂

{NPL0} denote the expansion of Φ over

which

represents an equivalent set of formulae where all quantifiers have been eliminated by using

conjunctions, i.e.,

❑

Φ⇓

={(gi,β)| (∃g1,..,gn∈

{NPL0})(f ≡g1∧,..,∧gn in

Ã Â

∧i∈[1,n])}.

Definition 4.10 Occurrence of literals

Let f∈

{L0} be a propositional formula and let l∈

{L0} be a literal. We denote occ(f,l) iff l

❑

occurs in f as a positive literal and we denote occ¬(f,l) iff l occurs in f as a negative literal.

Now, we have the prerequisites to formalize the semantics of automatic analysis operations in

GFRN specifications.

Definition 4.11 Semantics of automatic analysis operations

The semantics of a GFRN specification is defined by the algorithm OperateGFRN which is

presented in Figure 4.17. OperateGFRN takes a GFRN and an RDB as its arguments and

❑

returns a consistent set of definite propositions about the RDB.

algorithm

explanation Algorithm OperateGFRN uses a local variable (exec) that is a two dimensional array of

boolean values which are initialized to FALSE. This array maintains information about which

goal-driven analysis operations have already been applied. In line 5, algorithm GFRN2NPL1 is

called to translate the passed GFRN to a set of formulae Φ in NPL1. Then all data-driven

analysis operations are executed on the RDB B and the resulting propositions are added to Φ

(lines 7-9). The condition in lines 13-15 checks for the existence of an implication rule

(f1→if2,β) in the expansion Φ of Φ over the universe

(B) that represents the translation of an

implication i in the GFRN. Furthermore, the condition requires that an instance of a goal-

driven predicate p(u) occurs in the premise (f1) of this rule and that its conclusion (f2) can be

deduced from Φ with a necessity higher than the threshold of i. If this condition is fulfilled and

fβ,( ) Φ∈

∪

SPECIFICATION OF DATABASE REENGINEERING KNOWLEDGE 73

the goal-driven analysis operation for p(u) has not yet been executed (exec(p,u)=FALSE) the

corresponding operation is invoked and the results of the operation is added to Φ (line 19).

Subsequently, the value of exec(p,u) is set to TRUE to avoid that the same goal-driven analysis

operation is executed twice. Lines 23-26 consider hypotheses and definite facts entered by the

reengineer.

The loop from line 11 to 29 is iterated until a definite analysis result is obtained. This condition

is reached when each instance of a dependent predicate Φp(u) with a positi ve necessity

degree is either necessarily true or false with a necessity degree of 1. Consequently, we take a

necessity degree of 1 as a modal operator that overrules partial inconsistency, i.e., if N(p(u))=1

we ignore N(¬p(u))<1 and vice-versa. In the following, we will denote this mechanism of

overruling as grounding. Still, we have to exclude the case of complete inconsistency, i.e.,

N(p(u))=N(¬p(u))=1. Grounding might occur due to the result of goal-driven analysis

operations (e.g., the falsification of hypotheses with the available data) and by definite

algorithm OperateGFRN(G, B)

1) input G:=((Pd,Pg,Pt),Fr,Fb,I,E,cf,th,Ω,ω)∈

{GFRN}, B∈RDB

2) output

⊂

{L0}

3) local variables Φ ⊂

{NPL1},exec[p∈Pg,u∈

(B)]:BOOLEAN=FALSE

4) begin

5) let Φ = GFRN2NPL1(G)

6) // execute data-driven analysis operations

7) for each p∈Pddo

8) let Φ =Φ ∪Ω(p)(B)

9) end

10)

11) loop

12) let Φ=Φ⇓

(B)

13) if (∃(f1→i f2,β)∈Φ) (∃p∈Pg) (∃u∈

(B)) (∃γ∈[th(i),1])

14) (occ(f1,p(u)))∧Φ (f2,γ)) // p(u) in the antecedent of an implication that

implies a credible hypotheses */

15) then

16) if exec[p,u]=FALSE

17) then

18) // execute goal-driven analysis operations

19) let Φ =Φ ∪ω(p)(B,p(u))

20) let exec[p,u]=TRUE

21) fi

22) fi

23) if exists user input ϕ⊂

{NPL0}

24) then

25) let Φ =Φ ∪ϕ

26) fi

27) until a definite analysis results is obtained, i.e.,

28) ¬(∃p∈Pt)(∃u∈

(B))

29) ((Φ(p(u),γ)∧γ∈(0,1)∧Φ (¬p(u),γ)∧γ≠1)∨(Φ(p(u),γ)∧γ=1∧Φ (¬p(u),γ)∧γ=1)))

30) return {

| (

,1)∈Φ ∧

∈

{L0}}

31) end

Figure 4.17. Algorithm OperateGFRN

⊥

74 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

knowledge entered by the reengineer. This non-monotonic inference process will be discussed

in more detail in Section 4.3.

Example 4.3 Semantics of automatic analysis operations

In this example, we illustrate the formal semantics of automatic analysis operations in GFRN

specifications by applying the GFRN in Figure 4.14 to a small excerpt of our case study which

is given in Figure 4.18. Let this excerpt be formalized as an RDB B:(M,(R, ∆),δ,C,D)∈RDB. We

define the following automatic analysis operations for the data- and goal-driven predicates in our

sample GFRN:

•Ω(ANameIsRSName+ID1)(B)=

where Levensh(s1,s2) denotes the Levenshstein-distance [Lev66] of two strings s1 and s2.

•

•=

In the first phase of algorithm OperateGFRN,G is translated to Φ⊂

{NPL1}. The results of this

translation is given in the previous Example 4.2. Subsequently, the data-driven analysis

operations Ω(ANameIsRSName+ID1)(B) and Ω(selectDist1)(B) are executed. For each attribute

in RS USER, Ω(ANameIsRSName+ID1)(B) produces a fact with the necessity degrees shown in

Figure 4.19. Furthermore, Ω(selectDist1)(B) detects an instance of a select-distinct pattern over

attributes sname and dpt, which results in the fact (selectDist1({sname,dpt}),1).

In the first iteration of the inference loop from line 11-29, Φ is expanded to Φ⊂

{NPL0} with

respect to the application context

(B). In the following, we list the subset Φsof formulae in Φ

which are relevant for this example. Note, that due to the threshold of implication ι2

(ANameIsRSName+ID1(userid),0.8) is the only relevant fact in the analysis result of operation

application Ω(ANameIsRSName+ID1)(B).

Φs={ (selectDist1({sname,dpt}),1), (EQ 33)

(ANameIsRSName+ID1(usrid),0.8), (EQ 34)

({sname}⊆{sname,dpt} →selectDist1({sname,dpt})) →1¬key1({sname})),0.3), (EQ 35)

({sname,dpt}⊆{sname,dpt} →selectDist1({sname,dpt})) →1¬key1({sname,dpt})),0.3), (EQ 36)

(ANameIsRSName+ID1(usrid)∧N

ANameIsRSName+ID1(usrid)

≥0.2

→2key1(usrid)),0.8), (EQ 37)

(¬validKey1(usrid)∧N

¬validKey1(usrid)

≥0.2

→3¬key1(usrid)),1), (EQ 38)

(¬validKey1(sname)∧N

¬validKey1(sname)

≥0.2

→3¬key1(sname)),1) (EQ 39)

}

ANameIsRSName+ID1x( ) 12

---– (Levensh name x( ) name r( )+id,( ))atan,

 

 

r R∈

x X r( )∈

∪

ΩselectDist1

( ) B( ) selectDist1aj

{ }

j1n,[ ]∈∪

( ) 1,

 

  Ccontains select-distinct pattern

over attributes a1...an

î 

 

 

∪

ωvalidKey1

( ) B validKey1x( ),( )

v¬alidKey1x( ) 1,( ) if t1t2

,( )∃ δ RS x( )( )∈( ) t1t2Πxt1

( )=Πxt2

( )∧≠( )

validKey1x( ) 2

--- δRS x( )( )

100

--------------------------

 

 

atan,

 

  else.

î









SPECIFICATION OF DATABASE REENGINEERING KNOWLEDGE 75

Ê Ë Ì Í Î Ì Î Í Ï Ð Ì Ñ Ò Ó Ô Õ

Ö × Ë Ø Ù Ú Û Ü Ý Ô Õ Þ ß à á

â Í ã Ì Ú Û Ü Ý Ô Õ ä ß à á

Ù å æ Ú Û Ü Ý Ô Õ Þ ç à á

×â Í ã Ì Ú Û Ü Ý Ô Õ Þ ç à á

Í Ù Ù Ë Ú Û Ü Ý Ô Õ è ß à á

æ Ì Ð é Ú Û Ü Ý Ô Õ Þ ç à á

æ Ì Ð å Ú Û Ü Ý Ô Õ Þ ç à à

Figure 4.18. Excerpt of case study

ê ë ì í î ï ð ñ ò î ó ô ë ï ð ñ ò ð î î ì ô ò õ ö ô ò õ ó

÷ ø

é ù â ú Ì ×æ û Ô ü Ï Ì ×ß Þ û ý Í Ï

è ß þ ß ÿ ß



Ñ ý ý

Þ ß û Í â



Ë Ì Ù Ò Ê ù ã Ø æ

 

Û ü ×Ê ù ß ÿ

  

Ê Ì



è ä ß

ß ä ç



ç Ü Ì Ø â Ë Ø Ê ù û ÖÐ Ð Ì Ë Û Ô ü ã ÖÐ ß ç Û ù Ì ã ú Ý

ä ß ç

÷ ÷

Þ ä ÿ



Þ ç

USER

schema catalog



ö î ò ë ò



ñ ò ï ô

 





Ó Û Ò



ý Ò Ó ý Ó Û





  

     

Ú å Ì Ë ×



Ë Ì Ê

ß ç





û Ñ Ò Ó Ô



Ü Ó Ô Ó ×â Í ã Ì



Ú Ò



Í â Ù Ù å æ



Ú ü Ó











Ó Û



PDIS

precedural code

available data

0.2

0.4

0.6

0.8

usrid

✧

dpt

Figure 4.19. Necessity degrees for the facts produced by Ω(ANameIsRSName+ID1)(B).

N ANameIsRSName+ID1x( )( )

name

✧

sname

✧

addr

✧

telo

✧

telp

✧

76 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

According to the formal system for NPL1 defined in Definition 3.20 on page 51, we can deduce

(GMP) EQ34,EQ37 (key1(usrid),0.8). Hence, the condition in line 13-14 is satisfied by

(f1⇒if2,β):=EQ38, because p(u):=validKey1(usrid) occurs in f1 and f2 is deducible from Φ with a

necessity of 0.8 which is greater than th(i)=0.2. Consequently, the query in the body of the

conditional statement the goal-driven analysis operation is

executed. The sample data in Figure 4.18 contains no counter-example for the hypothesis that

usrid might be a key. Still, the function plot in Figure 4.20 shows that according to the small

size of our sample data set we get also only little support for this hypothesis, i.e.,

N(validKey1(usrid))≅0.

Let us now assume that the reengineer manually validates this automatically inferred

hypothesis. As a result of this validation (s)he acknowledges the hypothesis by a definite

proposition (key1(usrid),1). Consequently, at the end of the next iteration the inference loop

terminates because we have obtained a definite result according to the criterion specified in lines

❑

28 and 29.

The above example closes the formalization of the syntax and declarative semantics of GFRN

specifications. In the next section, we will develop a non-monotonic inference engine that

implements the described concepts and allows for efficient execution of GFRN specifications

in CARE environments.

4.3 Knowledge inference with GFRN specifications

In the previous sections, we defined GFRNs as a dedicated language to specify and customize

DBRE knowledge and processes. As described in Section 4.1, we aim to execute such

specifications in semi-automatic schema analysis processes. A prerequisite for this execution is

an inference engine that combines domain-specific GFRN specifications with situation-specific

data about the LDB under investigation. Obviously, a suitable inference engine has to meet

requirements R2 and R4 defined in Section 3.1, i.e., it has to allow for non-monotonic

⊥

ωvalidKey1

( ) B validKey1usrid( ),( )

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 200 400 600 800 1000

|δ(RS(x)|

Figure 4.20. Necessity degrees for the facts produced by ω(validKey1)(B,validKey1(x))

in case of no counter-example.

N validKey1x( )( )

KNOWLEDGE INFERENCE WITH GFRN SPECIFICATIONS 77

reasoning over inconsistent knowledge. In addition, efficiency is a crucial requirement for the

practical usability of the GFRN approach (cf. requirement R6 on page 37).

forwards and

backwards

reasoning

In the AI literature, reasoning problems are often characterized by search problems, i.e., by the

problem to select a method that efficiently finds a solution in the search space of all possible

options [BB94, pp. 40ff]. Generally, search methods can be classified as either forward- or

backward-oriented. Forward-oriented search methods start with initial data and successively

apply reasoning operators until a certain goal is reached, while backward-oriented methods

start with a predefined goal and try to find suitable data that allows to reach this goal. In our

application domain, we aim to enable an incremental and explorative DBRE process that

considers automatically retrieved indicators (initial data) as well as human assumptions on

different levels of abstraction (goals). Hence, we have to aim for a hybrid approach that allows

for forwards as well as backwards reasoning.

incremental

reasoning

Another problem arises with the evolutionary character of the proposed schema analysis

process (cf. Section 4.1). This process consists of iterative steps involving human interaction

and automatic knowledge inference until a consistent and complete result is obtained. New

knowledge is added in each of these iterations. However, this additional knowledge generally

affects only a part of the results of the previous inference step. Consequently, we should avoid

to recompute every inference result at each iteration. In contrast, we should aim for an

incremental reasoning mechanism that uses inference results computed in previous iterations

as far as they are not affected by the newly added knowledge.

In the following, we propose an inference engine that meets the above requirements. This

inference engine is based on an operational knowledge representation in terms of a fuzzy Petri

net (FPN) [FS97]. During the inference process, domain-specific knowledge in form of a

GFRN and situation-specific knowledge about the LDB under investigation are compiled to an

FPN that subsequently can be evaluated efficiently. This compilation process, which we will

call expansion from now on, is performed incrementally, i.e., the FPN that has been expanded

in a given iteration step is preserved and incrementally updated in subsequent iterations.

This section is devided in three parts. First, Section 4.3.1 introduces the used FPN model and

reasons about the stability of the proposed non-monotonic belief revision process. Based on

these results, we introduce and formally define the entire inference process in Section 4.3.2.

Finally, Section 4.3.2.3 discusses the complexity and scalability of our approach.

4.3.1 A fuzzy Petri net model for non-monotonic reasoning

Traditionally, Petri nets (PNs) have been applied to formalize properties of dynamic systems

[Pet81]. A rich theory of PNs has been developed since their invention in 1962 by Petri. Many

different PN models have been proposed for a great variety of applications. Recently, PNs have

been discovered for knowledge representation in rule-based expert systems [FS98]. They

combine the advantage of a graphical representation of a rule base with a formal definition of

its execution. Analogously to fuzzy rule-based systems, fuzzy Petri nets (FPN) have been

proposed for applications that deal with imperfect knowledge. A good overview has been

presented by Cardoso et al. [CVD96]. In this section, we define an FPN that is an extension of

the model described by Konar and Mandal [KM96] which itself is based on Looney’s approach

[Loo88].

78 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

Like any PN an FPN is a directed, bipartite graph with active and passive elements. The active

elements are usually called transitions while the passive elements are called places. In our

FPN, places correspond to propositions and transitions represent implication rules. Each place

carries a so-called fuzzy belief marking (FBM) which is represented by a real number between

0 and 1 in the original model of Konar and Mandal [KM96]. The actual extension of our model

is that we use two real numbers to represent FBMs, one representing (a lower bound for) the

necessity that the associated proposition is fulfilled, while the second represents the necessity

against its fulfillment. Similar to GFRNs, we use signed arcs to determine whether the positive

or the negative belief is propagated. This facilitates the representation of inconsistent

knowledge. However, our model can easily be mapped to the original model of Konar and

Mandal by using unsigned arcs and allocating two places per proposition (a positive and a

negative one). Hence, we are able to transfer the theoretic results established for the original

model to our extension. We will make use of this property when we analyze the belief revision

model with respect to its stability in case of a cyclic FPN. The signature of the FPN model is

defined in Definition 4.12.

Definition 4.12 Fuzzy Petri net

Afuzzy Petri net (FPN) is a tuple FPN:=(S, T, F; D, b, v, c, t, m) where

•S is a finite set of elements called places,

•T is a finite set of elements called transitions disjoint from S, (S∩T=∅),

•F⊆ (S × T)∪(T × S) is a flow relation,

•D is a finite set of propositions,

•b:S→D is a bijective function that maps places to propositions,

•v:F→{‘’, ¬} is a signing function,

•cf: T→ (0, 1] and th: T→ [0, 1) are functions that associate integer values between 0 and 1

to transitions; cf is called the confidence function while the th is called the threshold

function,

•m:S→ [0,1]×[0,1] is called the marking function that assigns a pair of real values to each

❑

place.

For notational convenience, we use the auxiliary marking function m:S×{‘’, ¬}→ [0,1] defined

belief revision The process of propagating FBMs in a cyclic FPN is called belief revision [KM96]. It is

performed in a number of subsequent belief revision steps (BRS). In the following, we will

describe different markings of an FPN in different BRS’ by adding the number of the BRS as

an index to the marking function m, i.e., mx+1 describes the marking of an FPN (S, T, F; D, b, v,

c, t, mx)after performing one further BRS.

Each BRS consists of two subsequent phases illustrated in Figure 4.21 and Figure 4.22. In the

first step, the output value of each transition in the FPN is computed. This output value is called

fuzzy truth token (FTT) and is defined by the equations EQ40 and EQ41 below. At first, the

minimum function is applied to the set of all incoming belief values depending on the signs of

m s x,( ) a for x=‘‘

b else

î

with m s( ) a b,( )= =

KNOWLEDGE INFERENCE WITH GFRN SPECIFICATIONS 79

the corresponding edges. Then, the resulting value Ix(tq) is compared to the threshold of the

corresponding transition tq. If the threshold is lower or equal to this intermediate result, the

transition is said to be enabled. In this case, the transition fires with an FTT that is the

minimum of the intermediate result Ix(tq) and its confidence value cf(tq). Otherwise, the new

FTT is equal to zero.

(EQ 40)

(EQ 41)

In the second phase of each BRS, the incoming FTTs are combined to compute the new FBMs

at each place. This is done according to EQ42 and Figure 4.22 by applying the maximum

function over all incoming FTTs. Again, the signs of the arcs have to be respected.

(EQ 42)

It is important to note that a major difference of the introduced FPN model compared to

classical PN models is that tokens are not removed from the input places of an enabled

transitions that fires. On the contrary, input tokens are only copied and remain at their original

Ixtq

( ) Min mxs v s tq

,( ),( ) s S∈( ) s tq

,( ) F∈∃{ }( )=

FTTxtq

( ) Min cf tq

( ) Ixtq

( ),( ) if Ixtq

( ) th tq

( ))≥

0else

î



tqFTTx+1(tq)

v(sn,tq)

mx(s1)

mx(sn)

Figure 4.21. Belief revision phase 1: computation of fuzzy truth tokens

v(s1,tq)

mx1+ s w,( ) Max FTTx1+ t( ) (t T)∈(t s,( ) F v t s,( )=w)∧∈∃{ }( ) if (t T)(∈t s,( ) F v t s,( )=w)∧∈∃

mxs w,( ) else

î







v(t1,si)

v(tu,si)si

FTTx+1(t1)

FTTx+1(tu)

mt(si)

Figure 4.22. Belief revision phase 2: Computation of FBMs

80 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

places. This procedure is necessary for logical inference since the truth of a proposition may

imply the truth of several other conditions. Because of this speciality, well-known structural

conflicts like deadlocks and traps [Pet81] cannot occur in our model. The characteristic of

copying tokens entails another interesting property, namely the fact that belief revision can be

performed for all places simultaneously.

termination

and stability

of belief revision

The belief revision process terminates when there is no change in the marking of the FPN in

two subsequent BRS’. In this case, we say that the FPN has reached its equilibrium state. Still,

an FPN might contain places on cycles that comprise periodic oscillation (PO) of FBMs. If

such POs sustain for an infinite number of BRS’ they prevent the FPN from reaching its

equilibrium state. Such oscillating cycles are called limitcycles (LC) by Konar and Mandal

[KM96]. An FPN with LCs is said to be unstable. This notion of stability is formalized in

Definition 4.13. Furthermore, Theorem 4.1 is a result that has been established by Konar and

Mandal.

Definition 4.13 Stability

An FPN N:(S, T, F; D, b, v, c, t, m0) is said to be stable iff its marking remains unchanged after

a finite number of BRS (∃x∈) (∀s∈S)(mx(s)=mx+1(s)). In this case, it is said that N has

❑

equilibrium state in BRS x and Min{0,..,x} is called the equilibrium time.

Theorem 4.1 Equilibrium time

The number of transitions represents an upper bound for the equilibrium time of a stable FPN.

❑

(The proof of this theorem is given in [KM96].)

From Theorem 4.1 follows that after a maximum number of BRS that is equal to the number of

transitions it can be decided whether an FPN is stable. Konar and Mandal present an algorithm

that removes an LC from an unstable FPN by permanently inhibiting a selected transition on

the LC from firing. This transition is selected in such a way that the inference result of the FPN

is least affected by the modification. However, eliminating LCs might induce new LCs in

neighborhood cycles. Hence, this procedure has to be performed iteratively, in general.

The following Theorem 4.2 shows that LCs cannot occur if we start the belief revision process

with an initial marking that assigns non-zero FBMs only to places that do not have incoming

arcs. Such places are called axioms and the described marking is called an axiom-based

marking.

Definition 4.14 Predecessor

For a given place s∈S that is part of an FPN (S,T,F;D,b,v,c,t,m) the set of predecessors,

❑

denoted as pre(s), is given by pre(s)={z∈S|(∃t∈T)((t,s),(z,t)∈F)}.

Definition 4.15 Axiom

A place s∈S that is part of an FPN (S,T,F;D,b,v,c,t,m) is called axiom, denoted as axiom(s), iff

❑

it has no incoming arc, i.e., pre(s)=∅.

KNOWLEDGE INFERENCE WITH GFRN SPECIFICATIONS 81

Definition 4.16 Axiom-based marking

An FPN (S,T,F;D,b,v,c,t,m) has an axiom-based marking, iff the following condition holds:

❑

(∀s∈S)(m(s)≠(0,0)⇒ axiom(s)).

Theorem 4.2 Stability of FPN with axiom-based markings

An FPN N:(S,T,F;D,b,v,c,t,m0) with an axiom-based marking is stable.

Proof: If N is not stable there has to be at least one places∈S with a fuzzy marking that exhibits

infinite periodic oscillation starting from a given BRS xlc, i.e.,

(∀x∈[xlc,∞))(mx(s)=mx+p(s)∧mx(s)≠mx+r(s)) (EQ 43)

with a period p∈[2,∞) and r∈[1,p-1]. In the following, we denote (a,b)≥(c,d) for two tuples

(a,b),(c,d)∈[0,1]×[0,1] iff a≥c and b≥d. Obviously, EQ43 contradicts to the following condition

(∀x∈[0,∞])(∀s∈S)(mx+1(s)≥mx(s)) (EQ 44)

which can easily be proved: EQ44 is trivially fulfilled for axioms. The initial marking for all

other (non-axiom) places s is set to m0(s)=(0,0). Hence,

(∀s∈S)(m1(s)≥m0(s)). (EQ 45)

From EQ40-EQ42 follows that for any non-axiom places s∈S in any BRS x holds

(∀z∈pre(s))(mx+1(z)≥mx(z))⇒mx+2(s)≥mx+1(s) (EQ 46)

which together with EQ45 proves that EQ44 is also fulfilled for all non-axiom places s∈S in all

❑

subsequent BRS.

Corollary 4.1

Each FPN (S,T,F;D,b,v,c,t,mx) is stable that can be obtained by subsequently performing x≥0

❑

BRS on an FPN (S,T,F;D,b,v,c,t,m0) with an axiom-based marking.

The above corollary directly follows from the inductive proof of Theorem 4.2. It grants the

stability of the FPN inference mechanism which will be employed in the next section.

4.3.2 The inference process

In this section, we develop an inference engine (IE) for GFRN specifications that allows for an

iterative and human-centered DBRE process. The proposed IE is based on the FPN model

which has previously been introduced. Again, this section is devided in two parts. In the first

part (Section 4.3.2.1), we informally outline our strategy. Subsequently, we give a detailed

formalization of the IE in Section 4.3.2.2.

4.3.2.1 Informal introduction

The control flow chart in Figure 4.23 shows the inference process that has been proposed in

Figure 4.1 on page 56 in more detail. We will start with a general description of each step in

82 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

this process. Subsequently, we will discuss each step with an example that deals with an

excerpt of our case study.

data-driven

analysis The entire inference process starts with the creation of a new FPN. Then, all data-driven

analysis operations in the GFRN are executed and axioms are added to the FPN to represent the

resulting initial knowledge about the LDB.

Figure 4.23. The proposed iterative and interactive inference process

Start

FPN extended?

Data-driven analysis

of LDB

Initialize

empty FPN

Expansion/completion

FPN according to

GFRN and facts

Evaluation of

FPN until

Goal-driven analysis

of LDB

results User dialog

input of new

hypotheses

and definite facts

consistent

User dialog

presentation

of resulting

physical schema

results

complete?

End

yes

and definite?

Grounding

of definite results

and pruning of FPN

equilibrium state

KNOWLEDGE INFERENCE WITH GFRN SPECIFICATIONS 83

expansionIn the next step, the FPN is expanded according to the places in the FPN. This step is illustrated

in Figure 4.24 for a sample situation. It shows that an instance of a GFRN implication is

represented by a number of transitions with the same CV and TV in the FPN: the solid

transition represents the actual implication rule p2(u2)∧p3(u3)→p1(u1) and the other two

transitions represent its contraposition ¬p1(u1)→¬(p2(u2)∧p3(u3)) that has been normalized

with deMorgan’s law to ¬p1(u1)∧p2(u2)→¬p3(u3) and ¬p1(u1)∧p3(u3)→¬p2(u2). Analogously

to the GFRN formalism, we represent arcs with a negative sign by solid arrow heads. From

now on, we refer to the transition that represents the actual implication rule as the main

transition (MT) while we denote the other transitions as contraposition transitions (CTs). In

general, the number of created CTs is equal to the number of places in the antecedent of the

MT. In order to increase the readability of our FPN diagrams we use grey color for arcs that

belong to CTs.

An implication can only be expanded if all its variables can be bound such that its constraints K

are satisfied. If this precondition is fulfilled the implication can be expanded either in forward

or backward mode (cf. Figure 4.25).

•The implication is expanded forwards if all necessary propositions in the antecedent of the

MT to be created are present in the FPN, the MT would be enabled, and the MT would have

at least one positive outgoing arc. (We do not expand MTs with negative consequents in

forward mode because we are interested in inferring positive hypotheses. Such MTs are

expanded in backward mode only to refute positive hypotheses.)

•The implication is expanded backwards if there exists a proposition in the consequent of the

MT to be created that has a positive FBM that is greater or equal to the threshold of MT.

Note, that it is not required that all propositions in the antecedent and consequent of the

transitions are already present in the FPN. It is sufficient for the expansion if variable bindings

for missing propositions can be computed by applying the constraints K to the variables which

can be bound to actual parameters of present propositions. For example, consider implication

ι2 from Figure 4.14 on page 68: if the FPN contains a proposition that is suitable to bind

variable a, we can compute variable k by applying the constraint k=set(a).

goal-driven

analysis

If the FPN structure has been modified in the expansion activity, goal-driven analysis

operations are automatically executed for each newly created place that is an instance of a

Figure 4.24. Representation of an expanded GFRN implication (sample)

CV/TV

p2p3

v2v3

FPNGFRN

CV/TV

84 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

goal-driven predicate. The result of each operation is stored in the FBM of the corresponding

place. Furthermore, such places are converted to axioms, i.e., all incoming arcs are removed

from the FPN. This is necessary because indicators delivered by goal-driven analysis

operations are definite and may not be modified during the inference process.

evaluation In the next step, the FPN is evaluated using the belief revision process defined in Section 4.3.1.

The stability of the expanded FPN is guaranteed by Corollary 4.1, because we have created an

axiom-based marking.

grounding After performing the goal-driven analysis and evaluating the FPN there might be some definite

analysis results, i.e., facts that have a positive or negative necessity degree of 1. These facts are

converted to axioms in a subsequent activity that we call grounding.

automatic

expansion and

evaluation cycles

Now, the expansion of the existing FPN is resumed under consideration of the newly added

facts and the results of the evaluation. Again, goal-driven operations are executed on demand if

the FPN structure has been extended. Subsequently, the FBMs at all non-axiom places are reset

to zero, in order to create an axiom-based marking before the evaluation process is resumed.

These expansion/evaluation cycles are iterated automatically until the FPN structure remains

unmodified after an expansion step.

user dialog When the automatic expansion/evaluation cycles terminate, the inference engine checks

whether the produced analysis result is definite and consistent. If this is the case, the reengineer

has to decide if the resulting information is complete. Otherwise, the reengineer has to do some

further (manual) investigation of the LDB in order to support or refute intermediate analysis

results or add new knowledge. After this interaction step, the automatic inference process is

resumed. The entire semi-interactive process terminates when the analysis result is definite,

complete, and consistent.

Example 4.4 Inference process

We will now illustrate the described semi-automatic inference process with an example that

deals with an excerpt of our DBRE case study. Figure 4.26 shows that this excerpt consists of

the two RS USER and DOCUMENT, including some sample data. In this example, we aim to

detect foreign keys between these RS. We apply the GFRN presented in Figure 4.27 for this

purpose.

Figure 4.25. Forward and backward expansion (sample)

forward expansion backward expansion

KNOWLEDGE INFERENCE WITH GFRN SPECIFICATIONS 85

first expansion/

evaluation cycle

The initial analysis of the LDB is performed by executing the data-driven analysis operations

that have been attached to predicates ANameIsRelName+ID1,NamSim2, and variant1. As a

result of this automatic analysis, four axioms are created in the FPN, which are represented as

doubled circles in Figure 4.28. At this, we use the following abbreviations for the actual

parameters of the displayed propositions:

                     ! "  ! 

# $ % & ' ( ) * +

) *

ß Þ û

./-

# 0

3 4 . .

Þ ß û

' 5 6 ) 7 8 9 & : ; + < = >

* 9 &

2 ?

5 @ 9 ) A 0 B

# B #

B CDCD

C E ) F ' 6 F 9 & G H I I ) 6 >

, J K

H I

C > & )

( L # B

C##

9 6 )

/ M

)

M / -

I ) N 8 O

, P

H * 6 F 7 Q > E L

, P

Þ ß

R S

/ K

) Q > E L

, P

R S

7 T

Q > E L

, P

R S

* '

/ K

) Q > E L

, P

R S

7 7 6 Q > E L

, P

R S

) I U Q > E L

, P

R S

) I T Q > E L

, P

R R

Figure 4.26. Information sources for inference example

schema catalog

PDIS

available data

9 6 )

/ M

)

M / -

I )

J ?

> N G O

3 V P

7 '

/ K

) Q > E L

, P 2

B B

R S

7 U 9 ' U Q W

3 V

O X O

, S

I F 7 Q > E L

, P

R S

& U 6 Q > E L

, P 2

B B

R S

H * 6 Q > E L

, P

R S

6 7 Q W

3 V

O X O

, R S

      " Z  " [  !     \ "      

6 E

I ) * 6 ) T ] C

D 0 C D

]

] ^ C _ 6 H ` ) 6

Þ ß Þ

* T ) 9 F @ 9 ] E a

? P

0 D

# ^ #

]

] ^ ^

F ) 6 ) C

b c

) 6

)

M /

I I

9 0

ß ß

0 #

]

] ^ D 8

) ) I #

9 U *

M / M K M

^ ^ C B C

]

] ^ C

&H'

Þ ß Þ

DOCUMENT

USER

abbreviation parameter

uu USER.usrid

un USER.name

du DOCUMENT.usr

dn DOCUMENT.dname

d{DOCUMENT.title,DOCUMENT.docno,DOCMU-

NET.valid,DOCUMENT.author,DOCUMENT.usr}

86 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

The axioms created in the FPN in Figure 4.28 show that the initial analysis has detected that

predicate ANameIsRelName+ID1 is valid to a degree of 0.8 for attribute uu. Moreover, it has

detected that there are two pairs of similarly named attributes in these RS, namely (du,uu) and

(dn,nn). An analysis of the available data shows only one variant of tuples that includes all

attributes of RS DOCUMENT. (We skip the variant of RS USER as it is not relevant for our

example.) In the first expansion step, the forward expansion rule can be applied once for

implication ι1 and twice for implication ι6. During this expansion step variable a of implication

ι1 is bound to parameter uu. Using the function set, which is defined as a simple set constructor,

the value of the second variable k is functionally determined by this binding. Note, that no CTs

are created in this first expansion step, because incoming arcs are forbidden for axioms. The

first automatic expansion/evaluation cycle finishes with the evaluation of the FBMs at the

expanded places according to EQ40-EQ42 on page 79.

ι3:0.5/0 ι7:1/0

equC2

ι4: 0.5/0.3

∈(t,i)

i,v

ι6:0.5/0.3

k=Π2(i)

key1

R-IND2

IND2

validIND2TypComp2

NamSim2

Figure 4.27. GFRN to exemplify the inference process

validKey1

ι2:1/0.2

disj(Π1(i),Π2(i))

variant1

⊆(Π1(i),v)

ι5:1/0.2

ANameIsRSName+ID1

i,v

ι1:0.8/0.2

k=set(a)

i,v

sameRS(Π2(i))

Figure 4.28. FPN after the first expansion/evaluation cycle

KNOWLEDGE INFERENCE WITH GFRN SPECIFICATIONS 87

second expansion/

evaluation cycle

The FPN that results from the second expansion/evaluation cycle is presented in Figure 4.29. In

this cycle, the backward expansion rule can be applied once for implication ι2and twice for

implication ι7. Furthermore, the forward expansion rule is applied three times to implication ι4,

because of the IQ in its premise. The corresponding MTs are labeled t7,t9, and t11. After the

expansion, the goal-driven analysis operations that are attached to predicates validKey1 and

TypComp2 are executed for their newly added instantiations. However, the hypothetical key

constraint over attribute uu cannot be falsified automatically by the available data and the

compared pairs of attributes are (fairly) type compatible.

third expansion/

evaluation cycle

In the third expansion/evaluation cycle, the forward expansion rule can be applied to

implication ι3 that combines the knowledge about the hypothetical key constraint and the IND

over du and uu to infer an R-IND (cf. Figure 4.30). The two other INDs can be falsified by

applying the backward expansion rule to implication ι5 and executing the corresponding goal-

driven analysis operation validIND2. We say that the two corresponding places have been

grounded, because they represent definite facts (i.e., they have a negative necessity degree

of 1). They are converted to axioms in the grounding activity at the end of this expansion/

evaluation cycle. In order to increase the readability of Figure 4.30 we display only enabled

transitions and places that are connected to enabled transitions.

Figure 4.29. FPN after second expansion/evaluation cycle

88 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

human interaction The FPN shown in Figure 4.30 cannot be further expanded by applying the defined expansion

rules. Consequently, the automatic inference process terminates with an analysis result that is

still inconsistent respectively undecided. (The R-IND between DOCUMENT and USER is only

indicated with a necessity degree of 0.4). This result is presented to the reengineer in a suitable

dialog. The reengineer has to use her/his domain knowledge and perform manual investigations

to decide whether the hypothetical R-IND is valid. Let us assume that (s)he decides that the

inferred R-IND exists: (s)he adds this definite fact, which results in another grounded place in

Figure 4.32 (for proposition R-IND2({(du,uu)},d)). The other two uncertain propositions

(IND2({(du,uu)},d) and Key1({uu})) can be grounded likewise. However, this can also be done

automatically by the inference engine if we add implications to our GFRN which specify that

an IND and a key constraint is necessary for the existence of an R-IND (cf. Figure 4.31). In this

case, only one interaction is necessary to arrive at the definite analysis result presented in

Figure 4.33. We did not consider these additional implications in our GFRN in Figure 4.27

because it would have further increased the complexity of the FPNs displayed in this example.

Figure 4.30. FPN after third expansion/evaluation cycle

ι8:1/0

k=Π2(i)

key1R-IND2IND2

Figure 4.31. Additional implications to specify necessary conditions for R-INDs

kι9:1/0

i,v t

KNOWLEDGE INFERENCE WITH GFRN SPECIFICATIONS 89

❑

Figure 4.32. FPN after considering human input

Figure 4.33. Final analsysis result

90 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

representing

human

assumptions

In the example presented above, human interaction is needed to validate (and support) a

hypothesis that has been inferred automatically. However, in an evolutionary DBRE process,

other scenarios are also possible. For example, the reengineer might annotate uncertain

assumptions in order to use the IE (in combination with the knowledge provided by the GFRN)

to validate these assumption and infer new hypotheses. Obviously, such uncertain assumptions

cannot be represented as axiomatic instances of the corresponding predicates in the GFRN,

because the FBM of axioms are immutable per definition. Therefore, we create an additional

axiom for each such assumption with a transition that leads to the actual proposition. In

Figure 4.34, this is illustrated for the simple case that the reengineer enters his/her subjective

belief that attribute dname might be a key of RS DOCUMENT. This assumption is represented

by an axiom (SB_key({dn})) with a transition that propagates the belief to the place that

represents the actual proposition. Figure 4.34 shows that this assumption is refuted by the goal-

driven analysis operation attached to predicate validKey1. Hence, the actual proposition

represented by place key({dn}) can be grounded.

In the next section, we will formalize the inference mechanism that has been introduced and

illustrated so far.

4.3.2.2 Formal definition

We start the algorithmic formalization of the process introduced in Figure 4.23 by discussing

the main inference algorithm presented in Figure 4.35. Subsequently, we give a more detailed

definition of the expansion step. This algorithm (GFRNInference) produces a set of definite

propositions based on an input that consists of a GFRN specification and a relational DB. Two

FPN variable structures are used locally to obtain this result. The first structure (N) is used for

the actual expansion and evaluation activities, while the second structure (N) stores the FPN

that was the result of the most recent expansion/evaluation cycle. Moreover, we employ a

variable X to store the set of places that are going to be axioms. Using this variable simplifies

the expansion algorithm, because we do not have to distinguish between axioms and non-

axioms in each situation when places and transitions are created. After each expansion/

evaluation cycle, we satisfy the required structural constraints (no incoming arcs for axioms)

by employing the information stored in X in a post-processing step (cf. line 39).

Figure 4.34. Representation of human assumptions

KNOWLEDGE INFERENCE WITH GFRN SPECIFICATIONS 91

1) algorithm GFRNInference(G, B)

2) input G:=((Pd,Pg,Pt),Fr,Fb,I,E,cf,th,Ω,ω)∈

{GFRN}; B∈RDB

3) output R⊂

{L0}

4) local variables N:(S,T,F;D,b,v,c,t,mx)∈

{FPN}// current FPN

5) N:(S,T,F;D,b,v,c,t,mx)∈

{FPN}// result of the most recent exp./eval. cycle

6) X⊆S // places that are going to be axioms

7) begin

8) let N:(S,T,F;D,b,v,c,t,mx)=CreateEmptyFPN()

10) for each p∈Pddo // data-driven analysis

11) for each q∈{(w,β)∈Ω(p)(B)| (∃(χ,(p,i),s,d,A)∈E)(w=s p∧β≥th(i))} do

12) let (N,X)=CreatePlace(q, N, X, TRUE)

13) od

14) od

15)

16) loop

17) loop

18) let Dchanged={d|d∈D∧(d∉D∨mx(b-1(d))≠mx(b-1(d)))} // new/changed places

19) if Dchanged≠∅

20) then

21) let N=N // store old FPN state

22) let N:(S,T,F;D,b,v,c,t,mx)=ExpandFPN(G,N,Dchanged) // expansion

23)

24) for each z∈{s∈S|b(s)=p(u)∧p(u)∈D-D∧p∈Pg}do

25) let N=CreatePlace((ω(p)(B,p(u)) ,N, TRUE) // goal-driven analysis

26) od

27)

28) let N=ResetMarkings(N) // create axiom-based marking

29) let N=EvaluateFPN(N) // evaluation

30)

31) for each s∈{z∈S| grounded(z)} do // grounding

32) let (a,b)=mx(s)

33) if mx(s,‘‘)=1 then mx(s,¬)=0

34) else mx(s,‘‘)=0

35) fi

36) let X=X∪{s}

37) od

38)

39) let N=RemoveIncomingArcs(N, X) // satisfy structural constraints for axioms

40) fi

41) until Dchanged=∅// FPN unchanged

42)

43) for each (w,β)∈UserDialog(D,G) do // user dialog

44) CreateOrReviseAxiom((w,β) , N, G)

45) od

46)

47) until (∀p(u)∈D)(p∈Pt→p(u)∈X∨mx(b-1(p(u)),’’)=0)

48) // positive results is definite and consistent

49) return {p(u)∈X| p∈Pt∧mx(b-1(p(u)),’’)=1}

50) end

Figure 4.35. Algorithm GFRNInference

92 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

data-driven

analysis The inference algorithm starts by creating an empty FPN in line 8. Then, all data-driven

analysis operations are executed and places are created to represent the resulting indicators in

the FPN (lines 10-13). Note, that only those indicators are considered that have a credibility

weakly greater than the threshold of at least one GFRN implication that has the corresponding

data-driven predicate in its premise. The last parameter of the called algorithm CreatePlace is a

boolean value that determines whether the newly created place will be an axiom (cf.

Figure 4.36).

outer (interactive)

inference loop The outer (interactive) inference loop starts in line 16 and terminates in line 47 when all

instances of dependent predicates with a positive lower bound of necessity are represented by

axioms in the FPN. In Example 4.4, we demonstrated that such instances are converted to

axioms only if they have been grounded, i.e., if they have a positive or negative necessity of 1.

Consequently, the output of the algorithm is defined by the classical projection of all positive

propositions, i.e., all propositions that have a positive necessity of 1 (cf. line 49). User input is

considered in lines 43-45. The user might revise situation-specific knowledge by updating the

corresponding FBMs and (s)he can add new propositions by creating new axioms (cf.

Figure 4.34).

inner (automatic)

inference loop Lines 17-41 specifies the inner loop that automatically performs expansion/evaluation cycles

until the FPN remains unchanged.The statement in line 18 computes the set of all propositions

that have been added or modified in the last iteration. If this set is not empty it is used in line 22

to expand the FPN incrementally. Subsequently, goal-driven analysis operations are called for

all newly added instances of goal-driven predicates (cf. lines 24-26). Then, all FBMs at non-

axiom places are set to zero to obtain an FPN with an axiom-based marking that is evaluated

until equilibrium state in line 29. The aforementioned activity of grounding is formalized in

lines 31-37. In this activity all definite analysis results (i.e., propositions with a positive or

negative necessity of 1) are converted to axioms and partial inconsistency is removed. (A

formal definition of the notion of a grounded place is given in Definition 4.17.) Before the next

iteration of the inner loop, line 39 removes all incoming arcs for places that actually represent

axioms.

Definition 4.17 Grounded place

A place s∈S that is part of an FPN (S,T,F;D,b,v,c,t,mx) is called grounded in BRS x, denoted as

❑

groundedx(s), iff Min(a,b)<Max(a,b)=1 with mx(s)=(a,b).

Expansion process

algorithm

CreatePlace The expansion process is the process that incrementally creates and extends an FPN from a

combination of a GFRN and accumulated situation-specific knowledge. We have already

referred to algorithm CreatePlace in Figure 4.36 that is used to create instances of GFRN

predicates. In lines 6 and 7 it creates a new place that is added to the set of axioms if the

boolean argument ax is TRUE (cf. line 9). Then, the FBM of the new place is initialized

according to the sign of the represented literal (lines 11-14). Finally, the unsigned proposition

is added to the set of propositions in the FPN (lines 15 and 16).

KNOWLEDGE INFERENCE WITH GFRN SPECIFICATIONS 93

algorithm

ExpandFPN

The algorithm that formalizes the forward and backward expansion rule from Figure 4.3.2.1 is

presented in Figure 4.37 (ExpandFPN). Algorithm ExpandFPN starts by removing all

transitions from the FPN that represent instances of implications with IQs which might be

affected by the last change in the FPN (cf. lines 6-8). This is done because some of these

transitions might lose their validity in presence of additional situation-specific knowledge. An

alternative solution for this problem is to check the corresponding constraints for each affected

transition with the new knowledge and to remove only those transitions which are no longer

valid. We have chosen the first alternative because it does not increase the computational

complexity of our algorithm (cf. Section 4.3.2.3) but it reduces its complicacy. The main loop

in algorithm ExpandFPN (lines 10-35) tries to expand each implication in the GFRN that is

affected by the changed situation-specific knowledge. In line 11, algorithm ComputeBindings-

ForImpl is called to compute all valid variable bindings for the current implication. These

bindings are returned in form of a relation that is assigned to the local variable bindingset. For

a given implication (ι,<v1,..,vx>,K)∈I, each tuple g:(u1,..,ux)∈bindingset represents a valid

binding for the variable list <v1,..,vx>. Furthermore, we define that the single elements of each

such tuple can be associatively accessed by the corresponding formal variable name, i.e.,

g[vi]=uiwith 1≤i≤x.

expansion of

transitions

The loop from line 12 to line 34 extends the FPN structure for each binding in variable

bindingset. This is done in the following steps. Firstly, all positive and negative propositions in

the antecedent and consequent of the corresponding MT are stored in the local variables Da+,

Da-, Dc+,and Dc- (lines 13-16). Then, it is checked whether the forward or backward

expansion rule can be applied (lines 18-19). If this is the case, then lines 21-23 create places for

all propositions that are not yet represented in the FPN. Subsequently, lines 26-29 create the

MT and all CTs that are necessary to represent the propositional implication, if these

transitions have not been created before (cf. line 24).

algorithm CreatePlace(d, N, X, ax)

1) input d∈

{NPL0}, N:(S,T,F;D,X,b,v,c,t,mx)∈

{FPN}; X⊆S

2) input G:(P, Fr, Fb, I, E, cf, th, Ω,ω)∈

{GFRN}; ax∈BOOL

3) output (N,X)

4) local variables s∈// place identifier

5) begin

6) let s=createID()

7) let S=S∪{s}

9) if ax=TRUE then let X=X∪{s} fi

10)

11) if d=(¬p(u), β)

12) then let mx(s)=(0,β)

13) else let mx(s)=(β,0) /* d=(p(u), β) */

14) fi

15) let D=D∪{p(u)}

16) let b(s)=p(u)

17) return (N,X)

18) end

Figure 4.36. Algorithm CreatePlace

94 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

In the following algorithm (ComputeBindingsForImpl), we make use of the fact that certain

variables of an implication can be computed from other variables by considering the

constraints specified for the implication (cf. page 83 for an example). In Definition 4.18, we

formalize this concept of variable derivability. Furthermore, we define the notion of a

derivation sink as a variable that can be derived from other variables but is not used to derive

variables itself (cf. Definition 4.19).

algorithm ExpandFPN(G, N, X, Dchanged)

1) input G:(P, Fr, Fb, I, E, cf, th, Ω,ω)∈

{GFRN}

2) input N:(S,T,F;D,b,v,c,t,mx)∈

{FPN}; Dchanged⊆D; X⊆S

3) output N:(S,T,F;D,X,b,v,c,t,mx)∈

{FPN}; X⊆S

4) local variables bindingset∈REL; g∈

(B); Da+,Da-,Dc+,Dc- ⊆

{L0}

5) begin

6) for each i∈{i:(ι,V,K)∈I|u∈

(B)∧p(u)∈Dchanged∧(χ,(p,i),s,premise_quantified,A)∈E} do

7) remove all transitions from N that have been created for implication i

8) od

10) for each i∈{i:(ι,V,K)∈I|u∈

(B)∧p(u)∈Dchanged∧(χ,(p,i),s,d,A)∈E} do

11) let bindingset=ComputeBindingsForImpl(G,i,N)

12) for each g∈bindingset do

13) let Da+={p(u1,..,ux)|(χ,(p,i),’’,premise*,<a1,..,ax>)∈E∧ui=g[ai]∧1≤i≤x}

14) let Da-={p(u1,..,ux)|(χ,(p,i),¬,premise*,<a1,..,ax>)∈E∧ui=g[ai]∧1≤i≤x}

15) let Dc+={p(u1,..,ux)|(χ,(p,i),’’,conclusion,<a1,..,ax>)∈E∧ui=g[ai]∧1≤i≤x}

16) let Dc-={p(u1,..,ux)|(χ,(p,i),¬,conclusion,<a1,..,ax>)∈E∧ui=g[ai]∧1≤i≤x}

17)

18) if Da+∪Da-⊆D /* forward expansion: if premise is fulfilled */

19) ∨Dc+∪D≠∅ /* or backward expansion: if hypothesis in the conclusion */

20) then

21) for each d∈(Da+∪Da-∪Dc+∪Dc-) - D do

22) let (N,X)=CreatePlace((d,0),N,X,G,FALSE)

23) od

24) if ExistsMT(N, i, Da+,Da-,Dc+,Dc-)=FALSE

25) then

26) let N=CreateTransition(N, i, Da+,Da-,Dc+,Dc-) // create MT

27) for each d∈da+ do

28) let N=CreateTransition(N, i, (Da+\d)∪Dc-,Da-∪Dc+,∅,{d}) od

29) for each d∈da- do

30) let N=CreateTransition(N, i, Da+∪Dc-,(Da-\d)∪Dc+,{d}, ∅)od

31) // create CTs

32) fi

33) fi

34) od

35) od

36) return (N,X)

37) end

Figure 4.37. Algorithm ExpandFPN

KNOWLEDGE INFERENCE WITH GFRN SPECIFICATIONS 95

Definition 4.18 Derivability

Let (ι,V,K)∈I be an implication of a GFRN (P,Fr,Fb,I,E,cf,th,Ω,ω)∈

{GFRN}, v1,v2,..,vn∈V a

set of variables, and x∈{2,..,n}. We say that variable v1 is directly derivable from variables

v2,..,vn w.r.t. the constraints in K, denoted as v1∠Kv2,..,vn, iff the following condition holds:

∃(v1, f, W)∈K W⊆{v2,..,vn}

∨ ∃(vx, f, v1)∈K vx∈{v2,..,vn}∧∃f-1∈FUN ∧ ∀x defn(f(x))→f-1(f(x))=x

∨ ∃(ε,∈,v1,vx)∈K∨ ∃(ε,∈,vx,v1)∈K.

We define the notion of derivability based on the transitive closure of the above relation, i.e., a

variable v1∈V is derivable from a set of variables v2,..,vn∈V, denoted as v1∠K*v2,..,vn, iff

v1∠Kv2,..,vn

❑

∨ ∃vn+1,..,vn+m∈V (v1∠vn+1,..,vn+m ∧∀w∈{vn+1,..,vn+m} w∠K*v2,..,vn).

Definition 4.19 Derivation sink

We say that a variable v1∈V is a derivation sink of an implication i:(ι,V,K), denoted as

dsinkK(v1), iff the following condition holds:

❑

∃v2,..,vn∈V v1∠K*v2,..,vn∧ ¬∃w1,..,wq∈V (w1∠K*w2,..,wq,v1∧ ¬w1∠K*w2,..,wq)

algorithm

ComputeBindings-

ForImpl

The algorithm that computes all possible bindings for a given implication with respect to the

current propositions in the FPN is given in Figure 4.38.a In the first part of

ComputeBindingsForImpl (lines 6-22), the FPN is searched for all instances of predicates that

are connected to the current implication. Line 9 assures that only those propositions are

considered in the search that are represented by places with an FBM weakly greater than the

threshold of the current implication i. The relation of possible bindings (bindingset) is created

incrementally by binding the actual parameters of found propositions to the corresponding

variables and combining each variable binding with all (partial) binding tuples that have been

created so far and do not violate the constraints K (lines 15-18). Note, that we employ the

knowledge about the variable dependencies specified for i by excluding all variables that

represent derivation sinks. The excluded variables are derived later by applying the specified

functional dependencies. For example, in case of implication ι3 in Figure 4.27

ComputeBindingsForImpl would only search the FPN for bindings for the variable tuple (i, v)

because variable k represents a derivation sink (k is functionally determined by variable i). The

bindings for derivation sinks are computed in line 35 by a call to algorithm

ComplementBindingsForImpl according to their functional dependencies.

aTo improve the readability of this algorithm, we consider the case of GFRN arcs with a variable vector of length

one, only.

dealing with IQsIf there exists an IQ in the premise of the current implication, the two nested loops (at line 25

and line 27) compute bindings with all subsets of conjunctions over the corresponding

propositions in the FPN that satisfy the constraints K. After the completion of each binding

with respect to unbound derivable variables (line 35), the maximum conjunction for each IQ

variable is selected in lines 36-38.

96 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

algorithm ComputeBindingsForImpl(G, i, N)

1) input G:(P,Fr,Fb,I,E,cf,th,Ω,ω)∈

{GFRN}; i∈I; N:(S,T,F;D,X,b,v,c,t,mx)∈

{FPN}

2) output bindingset∈REL

3) local variables bindingset, bindingset’∈REL, g∈

(B)

4) begin

5) let bindingset=∅

6) for each {(χ,(p,i),s,d,<a>)∈E| ¬sink(a)∧¬s=premise_quantified} do

7) // for each variable that does not represent a derivation sink

8) let bindingset’=∅

9) for each {p(u)∈D| mx(p(u),s)≥th(i)} do

10) if bindingset=∅

11) then

12) let g[a]={u}

13) if ConstraintsHold(g,K) then let bindingset=bindingset∪{g} fi

14) else

15) for each g∈bindingset do

16) let g[a]={u}

17) if ConstraintsHold(g,K) then let bindingset’=bindingset’∪{g} fi

18) od

19) fi

20) od

21) let bindingset=bindingset∪bindingset’

22) od

23) if ∃(χ,(p,i),s,premise_quantified,a)∈E

24) then

25) for each {p(u)∈D| mx(p(u),s)≥th(i)} do

26) let bindingset’=∅

27) for each g∈bindingset do

28) let g[a]=g[a]∪{u}

29) if ConstraintsHold(g,K) then let bindingset’=bindingset’∪{g} fi

30) od

31) let bindingset=bindingset∪bindingset’

32) od

33) fi

34)

35) let bindingset=ComplementBindings(bindingset,i)

36) if ∃(χ,(p,i),s,premise_quantified,a)∈E // select maximal bindings for IQ

37) then let bindingset=bindingset-{g∈bindingset| ∃g’∈bindingset

38) (g’≠g∧g[a]⊂g’[a]∧g[V\a]=g[V\a])}

39) fi

40) return bindingset

41) end

Figure 4.38. Algorithm ComputeBindingsForImpl

KNOWLEDGE INFERENCE WITH GFRN SPECIFICATIONS 97

algorithm

Complement-

Bindings

The algorithm that performs the completion of each binding (ComplementBindings) is

presented in Figure 4.39. The first loop (lines 6-15) considers GFRN constraints with the

predefined boolean function ’∈’. For each constraint of the form ’∈(w1,w2)’ and each binding

tuple g it is checked which elements of g[w2] are valid bindings for w1. All new valid bindings

are added to relation bindingset. Moreover, if {g[w1]} is a valid binding for variable w2 it is

added to bindingset. The second loop (lines 18-26) uses the defined relational functions to

derive bindings for all variables that have not been bound yet. All binding tuples with unbound

variables that cannot be derived by bound variables are removed from relation bindingset.

4.3.2.3 Complexity and scalability

LDBs often consist of a large number of software artifacts and DBRE methods and tools have

to admit to this scale in order to be of practical use. In this section, we reason about the

complexity and scalability of the proposed approach to legacy schema analysis.

algorithm ComplementBindings(bindingset, G, i)

1) input G:=(P, Fr, Fb, I, E, cf, th, Ω,ω)∈

{GFRN}; i:(ι,V:<v1,..,vn>, K)∈I; bindingset∈SET

2) output bindingset∈REL

3) local variables bindingset‘∈REL; g∈

(B)

4) begin

5) let bindingset‘=bindingset

6) for each (ε,∈,<w1,w2>)∈Kdo

7) for each g∈bindingset do

8) for each u∈g[w2]do

9) let g[w1]=u

10) if ConstraintsHold(g,K) then let bindingset‘=bindingset‘∪{g} fi

11) od

12) let g[w2]={g[w1]}

13) if ConstraintsHold(g,K) then let bindingset‘=bindingset‘∪{g} fi

14) od

15) od

16)

17) let bindingset=bindingset‘

18) for each g∈bindingset do

19) let Vbound={v∈V|g(v)≠∅}

20) if ∀v∈V-Vbound ∃v2,..,vn∈Vbound v∠K* v2,..,vn

21) then

22) derive bindings of all v∈V-Vbound from g

23) else

24) let bindingset=bindingset-{g}

25) fi

26) od

27) return bindingset

28) end

Figure 4.39. Algorithm ComplementBindings

98 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

Worst-case complexity of the proposed algorithms

Complement-

Bindings We start with the analysis of the worst-case complexity of algorithm ComplementBindings (cf.

Figure 4.39). The complexity of the first loop (lines 6-15) is O(b*t), where b denotes the

number of tuples in bindingset and t is the maximal cardinality of each set-valued element of

such a tuple. The second loop (lines 18-26) has a complexity of O(b). Together, this leads to a

worst-case complexity of O(b*t) for algorithm ComplementBindings.

ComputeBindings

ForImpl Let us assume that in the first iteration of the outer loop in ComputeBindingsForImpl

(Figure 4.38, lines 6-22) predicate p1is selected. The body of the inner loop that starts in line 9

is iterated dp1 times, where dp1 denotes the number of places in the FPN that represent

instances of this predicate (with the necessary FBM). Consequently, after the first iteration of

the outer loop the relation bindingset contains dp1 tuples. For every further iteration of the outer

loop according to an additional predicate px∈p2,..,pnthat does not use an IQ, the relation

bindingset is extended by at most dpx*bx-1 elements, where dpx denotes the number of instances

of predicate px and bx-1 denotes the cardinality of relation bindingset after the most recent

iteration. Consequently, the loop from lines 6-22 has a worst-case complexity of O(da), where

d denotes the number of propositions in the FPN and a is the number of arcs connected to the

current implication. If the current implication has an IQ then the worst-case complexity of the

loops in lines 25-32 is O(dd*a), because the cardinality of bindingset might be doubled for each

iteration of the outer loop. Given the complexity of algorithm ComplementBindings, the total

worst-case complexity of algorithm ComputeBindingsForImpl is then estimated to O(dd*a).

ExpandFPN and

GFRNInference The complexity of algorithm ExpandFPN is dominated by the complexity of algorithm

ComputeBindingsForImpl. The same applies for algorithm GFRNInference, because

Corollary 4.1 on page 81 guarantees that an axiom-based FPN can be evaluated in linear time.

(Clearly, the complexity of GFRNInference also depends on the complexity of the employed

data- and goal-driven analysis operations.) Moreover, we can only reason about the complexity

of one single expansion/evaluation cycle, because the termination of the entire inference

process depends on the GFRN and is undecided without the knowledge about the semantics of

the functions that are used in the GFRN. For example, Figure 4.40 shows a GFRN which may

or may not terminate depending on its input and the semantics of the employed functions. Let

us assume that ∪() denotes the union operator, set() is a set constructor, and f(x) is true iff

’foo’∈x. Then the inference process does not terminate for an input fact p1({bar}). On the other

hand, if f(x) is true iff |x|<10 then the inference process terminates after 10 expansion/

evaluation cycles.

1/0

v2=∪(v1,set(v1))

Figure 4.40. Example GFRN for termination problem

f(v1)

IMPLEMENTING THE VARLET ANALYST 99

Discussion of analysis results

A worst case complexity of O(dd*a) for the proposed inference algorithm might seem

intractable. However, this exponential effort is only needed for implications that use universal

quantifiers (IQ). For all other implications the inference process can be performed in

polynomial time O(t*da) w.r.t. the number of situation-specific facts in the FPN. The number of

connected arcs per implication (a) is usually between 2 and 4.

In case of implications with IQs, our approach allows to control the computational complexity

by choosing higher TVs for these implications. A higher TV reduces the number d of searched

propositions and considers only the most credible ones. Hence, the reengineer can employ TVs

to weigh individual GFRN implications according to their computational complexity.

Consequently, this strategy allows to scale our approach to LDBs with different sizes. In the

next section, we report on our practical experiences.

4.4 Implementing the Varlet Analyst

The algorithms described above can easily be implemented in a procedural programming

language that provides a basic library of types and functions to deal with sets, tuples, and

relations. We have chosen the portable programming language Java [GJS97] to implement and

evaluate these algorithms in a CARE tool prototype named the Varlet Analyst. This tool

supports the first phase (schema analysis) in the DBRE process sketched in Figure 1.3 on

page 6. The following section will outline the architecture of the Varlet Analyst, whereas

Section 4.4.2 presents the user’s perspective. The Varlet Analyst is part of an integrated tool

environment (Varlet) which also supports subsequent DBRE phases (schema migration and

data integration). The remaining parts of the Varlet tool environment will be discussed in

Chapter 5.

4.4.1 Architecture

The architecture of the Varlet Analyst is shown in Figure 4.41. The entire tool comprises

approximately 30.000 lines of code. Its core component that deals with the internal GFRN

representation and inference is written in Java. The concrete design and implementation of the

inference engine (IE) including the FPN model is described by Heitbreder [Hei98]. Module

GFRN encapsulates the logical representation of GFRNs and provides functionality to store

and retrieve different specifications. Boolean and relational functions that are used in

constraints of GFRN implications are implemented in module Constraint Functions. This

module is extended during the tool customization process when additional functions are

needed (cf. Section 4.4.2.1). Likewise, additional analysis operations can be added to modules

Data-Driven Operations and Goal-Driven Operations.

Analysis operations use basic functionality provided by modules Code Pattern Extraction,

Extension Extraction, and Schema Extraction. Module Code Pattern Extraction implements

customizable detection mechanism for stereotypical code patterns (cf. page 18). Code patterns

are specified on a high level of abstraction using layered graph grammars (LGG) [RS97]. They

are stored in a pattern library that can easily be extended [Bew98]. The actual pattern

recognition algorithm is implemented in the graphical programming language Progres

100 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

[SWZ95] and described by Bewermeyer [Bew98]. Module Schema Extraction provides

functionality to extract information about the meta data of the LDB, while module Extension

Extraction allows to access the available legacy data. We use an abstract interface to facilitate

the adaption of the Varlet Analyst to different DBMS. More precisely, we employ the ODBCa

standard [Gei95] to interface existing databases because ODBC gateways are broadly available

from many DBMS vendors.

The Varlet Analyst provides two user interface components: the Customization Front-End

allows to adapt the tool (i.e., to customize the GFRN specification and analysis operations),

while the Analysis Front-End is used to apply the tool for legacy schema analysis. These user

interface components have been implemented in iTcl/Tk [Wel97]. Internally, the logical

schema is represented by an abstract syntax graph (ASG) that is initially constructed by a

DDLb parser implemented with lex&yacc [LMB92].

a Open DataBase Connectivity

b Data Definition Language

GFRN

Customization

Front-End Analysis

Front-End

FPN

Inference

Engine Logical

schema

Schema

Code Pattern

Extraction

Extension

Extraction

Data-Driven

Operations

Goal-Driven

Operations

Constraint

Functions

data

code

schema

catalog

h i j k j l m n o m j o j l p q r s q i

t u

p v w x y z { | z y } r s x y z { | z y } i

h ~   v   m j  q r s  m j  q i s



v l   j

p 

h    j  j   }    w    

u 

  q { z  q r  q { z   

LDB

Figure 4.41. Architecture of the Varlet Analyst

iTcl/Tk

Java

Progres

Java

JDBC

Java

JDBC

Extraction

DDL

Parser

lex/yacc

uses

module

Abstract DB

ODBC

Pattern

LGG

Library

IMPLEMENTING THE VARLET ANALYST 101

4.4.2 User interface

In the following, we will present the user interface of the Varlet Analyst from two different

perspectives: the next section describes the Customization Front-End which is used to adapt

our tool to a specific application context. Subsequently, we will move to the perspective of the

reengineer who uses the Analysis Front-End to recover a consistent logical schema of an LDB.

4.4.2.1 The Customization Front-End

A graphical editor for GFRN specifications represents the main component of the

Customization Front-End. This editor can be invoked from the Varlet Control Panel which is

also used to start all other tools in our CARE environment. Figure 4.42 shows a screenshot of

the GFRN editor and the Varlet Control Panel (upper right corner). Note, that for technical

reasons we use integer values between 0 and 100 to specify CVs and TVs in the Customization

Front-End. Some implications in the displayed GFRN are already familiar from previous

example specifications. They have been labeled by identifiers i1,..,i14 to make it easier to refer

to them in our explanation. In order to simplify the representation, the GFRN editor skips

variable names for implications which have the same variable associated with all of their in-

and outgoing arcs. Note that in our GFRN editor, different types of predicates (data-driven,

goal-driven, or dependent) are represented by different colors. Hence, in the grey-scale printout

of our screenshot, dependent predicates are marked by black ovals while data-driven and goal-

driven predicates are rendered with dark grey and light grey color, respectively.

description of

sample GFRN

Implications i1,..,i3have already been introduced in Section 4.2.1. Analogously to implications

i10 and i11, which have been discussed in Figure 4.11 on page 64, implications i4and i5 specify

the rule that a hypothetical key may only exist if the corresponding constraint is valid in the

available data. Implications i12,..,i14 represent a refinement of the heuristic given in Figure 4.8

on page 62 that also considers the type compatibility of attributes: i13 specifies that similar

attribute names might indicate equivalent meaning, while i14 represents the knowledge that

equivalent attributes have to be type compatible. Implication i12 formalizes the heuristic that a

set of pairs of equivalent attributes might indicate an IND. Implication i9 specifies that an

instance of a join pattern is another indicator for an IND (cf. page 18). Analogously to

implication i8, which has already been known from Figure 4.4 on page 59, implication i7

classifies an IND as inheritance relationship (I-IND) if there is a (hypothetical) key constraint

for its left- and right-hand side. Finally, implication i6 determines an IND to be a cardinality

constraint if there exists a key constraint for its left-hand side only (cf. page 21).

multiple views to

handle complexity

A typical problem of graphical languages like Petri nets, state charts, and Entity-Relationship

(ER) models is that specifications soon become too complex to be visualized in a single

diagram. For example, we would like to add further implications to the GFRN in Figure 4.42

representing our knowledge that an R-IND necessarily implies an IND and a key constraints in

the referenced table. Other implications could express that the classification of a given IND as

an R-IND or a I-IND is mutual exclusive, etc. A commonly used solution to this visualization

problem is to use multiple views on a single specification. We adopt this technique in our

implementation, i.e., there can be different views on the same GFRN specification. Each of

these views might focus on a separate aspect of the analysis process, e.g., detection of keys,

detection of INDs, and classification of INDs. Consequently, the reengineer can use another

102 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

view to add the necessary conditions for R-INDs, I-INDs, and C-INDs (i18,i16,i15 in

Figure 4.43) and to specify the mutual exclusiveness of R-INDs and I-INDs (i17).

consistency check

implementation of

analysis operations

After the reengineer has modified the GFRN specification, (s)he can invoke consistency checks

which validate that the GFRN is well-formed (cf. Definition 4.3 on page 67). In addition, these

consistency checks use the Java Reflection API [Fla97] to ensure that for each data-driven or

goal-driven predicate there exists an implementation of a corresponding analysis operation in

Java. The same applies for functions used within constraints of GFRN implications. The

screenshot in Figure 4.42 shows a consistency report which indicates three missing Java

implementations, namely the implementations for the function disj, the goal-driven operation

validIND, and the data-driven operation selDist. The reengineer uses generated code frames to

implement missing Java methods used in the GFRN specification. Obviously, implementing

additional analysis operations is the most time-consuming activity in the customization process

of the Varlet Analyst. However, our tool provides the reengineer with a predefined library of

standard functionality that facilitates this task, e.g., DB login and access, operations on result

relations, and parameterizable fuzzy membership functions. Often, it is not even necessary to

add further analysis operations but the reengineer just uses the GFRN editor to change

specified heuristics or modify their credibility.

Figure 4.42. Customization Front-End

IMPLEMENTING THE VARLET ANALYST 103

4.4.2.2 The Analysis Front-End

After the Varlet Analyst has been customized w.r.t. to its current application context it can be

applied for legacy schema analysis. The first step in this process is to extract the schema

catalog from the LDB under investigation. Subsequently, all data-driven analysis operations

specified in the GFRN are executed to deliver indicators for additional semantic constraints.

The result of this initial automatic analysis step is graphically presented to the reengineer in the

so-called Analysis Front-End (cf. Figure 4.44). In the Analysis Front-End, each box represents

a table and INDs are visualized by lines. In order to cope with large schemas, the reengineer

can choose from various levels of abstraction and create separate views on the same logical

data structure. At the beginning of the analysis of a large LDB schema it is more appropriate to

choose a view that hides details and allows to cluster groups of tables into subsections that can

be analyzed separately [SdJPeA99]. The Analysis Front-End provides different layout

algorithms to facilitate this activity.

Figure 4.43. Customization Front-End (2)

104 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

detailed

representation Figure 4.45 presents another screenshot of the Analysis Front-End with a more detailed view

on the eight sample tables from our sample scenario. It shows all attributes with their

corresponding type. Key attributes are set in bold font. Additional information might be

represented at the bottom of each table, e.g., the representation of table USER indicates the

existence of another key (2 keys) which can be displayed using the ShowNextKey command

from the Commands menu. Likewise, the representation of table PRODREF indicates the

existence of four different variants (cf. page 20). All attributes and INDs which do not belong

to the currently displayed variant are dimmed. Again the reengineer can chose from various

degrees of detail. For example, most INDs are represented as single lines in Figure 4.45, but

the reengineer selected a detailed representation of the IND between tables PRODUCT and

PRODREF. For this IND correspondences between pairs of referencing attributes are marked

by numbers. Note, that different triangular icons are used to represent INDs according to their

classification:

•symbol represents INDs without a further classification;

•an additional key symbol marks INDs which have been classified as key-based

references (R-INDs);

•symbol marks INDs which also imply an IND (C-IND) in the reverse direction;

•again, an additional key symbol represents R-INDs which imply an inverse C-IND;

•finally, key-based INDs which have been classified as inheritance relationships (I-INDs)

are marked by white triangles .

Figure 4.44. Analysis Front-End (overview)

IMPLEMENTING THE VARLET ANALYST 105

visualizing

imperfect

information

A key achievement of our approach is that we relax the requirement for consistency during the

legacy schema analysis process. Consequently, the Varlet Analyst has to provide means to

visualize such imperfect information about LDB schemas. A central problem that arises with

such a visualization stems from using a quantitative measure to represent uncertainty. We have

to avoid that the schema representation is overloaded by too many hypothetical constraints

with low credibility. We solve this problem by introducing the concept of a view threshold

which determines a lower limited of certainty for all schema artifacts displayed in the current

view. Consequently, the semantics of a view threshold is an α-cut on the fuzzy set of all certain

schema artifacts (cf. page 50). In the Varlet Analyst, the view threshold is displayed in the

status bar over the graphical window (cf. page 105). It can be changed on-the-fly by the

reengineer. Note, that the view threshold does only consider the certainty in favor of a

hypothesis but it disregards the certainty against it. Hence, the graphical view contains also

contradicting information as long as these hypotheses have not been refuted completely, i.e., as

long as they have a negative certainty lower than 100 (1).

Figure 4.45. Analysis Front-End (detail view)

inference

button

106 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

indicating

imperfect

information

The ultimate goal of the schema analysis process is to come up with a consistent logical

schema for the LDB under investigation. In an evolutionary process, the reengineer confirms or

refutes uncertain hypotheses and resolves contradictions to obtain this result. In order to do this

efficiently, a CARE tool that tolerates imperfect knowledge has to provide powerful

mechanisms to indicate imperfect information to the reengineer and guide him/her during the

analysis. For this purpose, we have developed a dedicated dialog called the Analyst’s Agenda

which is shown on the right side of page 105. The Analyst’s Agenda presents a list of uncertain

or contradicting constraints about the current view of the logical schema. For each constraint a

positive and a negative certainty is displayed. The Analyst’s Agenda provides the functionality

to sort the list items according to various criterions in ascending or descending order, e.g.,

positive certainty, negative certainty, degree of contradiction (absolute difference of both

certainties). When the reengineer selects an entry in the agenda the corresponding graphical

representation is highlighted in the main window. In Figure 4.45, the reengineer has selected

the foreign key from PRODGRP to USER which has been inferred with a certainty of 60 (0.6).

Let us assume that after investigating the form-based user interface of PDIS (s)he confirms that

the inferred foreign key in fact represents a reference between product groups and product

managers (stored in table USER). This can be done by selecting the Confirm command from

the context-sensitive menu of the Varlet Analyst (cf. Figure 4.45). Likewise, (s)he can proceed

and do further manual investigations and annotations according to the displayed agenda or

additional knowledge.

automatic

inference On the other hand, (s)he can invoke the inference engine (IE) at any point in time to resume the

automatic analysis process. This can be done by pressing the inference button on the Varlet

Analyst’s icon bar. When invoked, the IE propagates the schema modifications and executes

goal-driven operations if necessary (cf. Figure 4.23 on page 82). This automatic analysis and

inference step is performed asynchronously in a separate thread. The reason for this solution is

to allow the reengineer to continue his/her investigation during the inference process. At the

end of each inference step the schema representation is not updated automatically but the

availability of the inference result is indicated to the reengineer by changing the icon on the

inference button from an empty box to a full box. We have chosen this solution to avoid

confusion due to spontaneously updated representations. In the sample situation displayed in

Figure 4.45, the schema update produced by the IE will remove three entries from the agenda:

the selected R-IND will be removed because it has been confirmed. Moreover, the first two

entries will be removed as well because, according to the GFRN in Figure 4.43, they represent

necessary preconditions for the confirmed R-IND. When the agenda is empty the current view

of the schema is consistent w.r.t. the defined view threshold. In this case, the reengineer can

either decrease the view threshold and investigate hypotheses with lower certainty (if existent)

or (s)he can decide to neglect all remaining hypotheses with a lower certainty, produce

annotated textual and graphical documentation (cf. Figure 4.46), and continue with the

conceptual schema migration process (cf. Chapter 5).

EVALUATION 107

4.5 Evaluation

first prototypeWe have chosen an incremental approach to stepwise implement, evaluate, and refine our

approach. We created our first implementation prototype with the high-level specification

language Progres which has been developed at RWTH Aachen (Germany) [SWZ95]. In

particular, this language has been well-suited because it is based on the notion of graphsa as the

central implementation paradigm and GFRNs as well as FPNs are graph-oriented structures.

Moreover, the Progres development environment includes customizable graph visualization

tools which we employed as a rudimentary user interface for the Varlet Analyst. We used this

initial implementation with small-scale schema reverse engineering problems to validate and

refine our concepts. We learned that our approach is feasible in principle but the tool lacked

adequate abstraction mechanisms and user dialogs to make experiments with larger case

studies and attract potential users. Moreover, the performance of the tool became weak when

the FPN grew larger because every data structure in Progres is stored persistently in a graph-

oriented database with full support (and overhead) for transaction management and recovery.

a cf. Definition 5.1

second prototypeIn fall 1997, we decided to (re)implement the inference engine in Java and create a dedicated

user interface in iTcl/Tk. This enabled us to use transient FPN data structures to perform the

inference process. Still, we left the internal representation of the analysis results (i.e., the

analyzed logical schema) in the Progres repository to exploit the benefits of error recovery. The

Java inference engine was about 15 times faster than the former Progres version [Hei98].

Moreover, the Java Reflection API [Fla97] allowed us to bind existing data- and goal-driven

analysis operations to GFRN predicates on-the-fly without the need to recompile our tool.

Obviously, the implementation of new operations with a compiler-based language like Java

still needed recompilation. Hence, we considered using an interpretative scripting language to

define analysis operations, e.g., an extension of Tcl [Wel97] or Perl [WS90]. However, the

Figure 4.46. Graphical and textual documentation of an analyzed logical schema

108 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

overhead caused by the recompilation (less than one minute on a 300 MHz Sun Sparc II) was

too low to justify this effort.

case study In spring 1998, we started a collaboration with two German companies (eps Bertelsmann and

Merck KGA) who provided us with a practical case study for our approach. The database

schema consisted of 85 tables and 347 attributes; the database access component comprised

26.000 lines of code.a By implementing multiple views on the same logical schema with

various levels of abstraction, we improved the usability of the Varlet Analyst’s user interface to

visualize larger application examples. After executing all automatic data-driven analysis

operations, we made a first attempt to apply the GFRN inference engine to obtain initial

hypotheses about schema constraints. We cancelled the inference process because it did not

terminate within 30 minutes. A postmortem investigation revealed that the data-driven analysis

operation NamSim2 (cf. Section 4.2.1) was responsible for this undesired behavior: it produced

over one thousand indicators for INDs because the example schema contained many similar

column names. Most of these indicators could be falsified by the automatic analysis of the

available data with the goal-driven operation validIND2. However, this process was very time-

consuming.

a The entire system had a size of several hundred lines of code.

domain analysis The experience described above emphasized the importance of the domain analysis and GFRN

customization step (Figure 4.1 on page 56) before starting the actual schema analysis process.

The inadequacy of the aforementioned naming heuristic could be detected with little effort by

browsing the schema catalog before starting the analysis process. Within a few minutes, we

discovered that in many cases the developers named columns (which seemed to be foreign

keys) similarly to other tables. Consequently, we replaced the NamSim2-heuristic with another

heuristic which is based on similarities among column and table names to indicate INDs. The

customization of the GFRN and the implementation of the new analysis operation took less

than ten minutes. Subsequently, we restarted the analysis process. This time, the inference

engine terminated after five minutes and indicated 46 possible INDs (out of 111 actually

existing INDs). The new naming heuristic delivered 29 INDs. Another 26 INDs were indicated

by instances of join-patterns in the database access code (cf. Section 2.4.1), but 9 of these INDs

could be falsified by the automatic execution of goal-driven operation validIND2. In

combination with the specified and detected key constraints, 24 INDs were classified as R-

INDs, 11 INDs were (primarily) classified as I-INDs, and 3 INDs were classified as C-INDs.

All indicated INDs turned out to be valid. Still, we had to resolve contradictions caused by the

ambiguous classification of I-INDs (cf. Section 4.4.2.1).

user guidance We made the experience that the user needs additional guidance to detect and resolve such

contradictions: so far, our tool only supported the concept of view thresholds and the possibility

to query each schema constraint for its associated certainty (cf. Section 4.4.2.2). Using these

mechanisms to find and eliminate uncertain and contradicting information about the logical

schema turned out to be a tedious activity for larger examples. Hence, we introduced the

agenda concept described in Section 4.4.2.2 with querying, sorting, and high-lighting facilities

which drastically simplified this activity.

concurrent

inference The proposed iterative process of manual investigations, goal-driven analysis, and automatic

inference and propagation of results proved to be of great benefit for our application examples.

RELATED WORK 109

Still, the experimental users of the Varlet Analyst complained that they had to wait for the

inference engine to terminate each time they resumed the automatic analysis process. This was

disturbing because the entire analysis/inference cycle took up to several minutes for the

example schema. Therefore, we implemented the inference engine in a separate process that

ran in parallel to the Analysis Front-End. This solution allowed the users to invoke the

inference engine and proceed with their manual investigations. Still, it had the spurious side-

effect that sometimes spontaneous screen updates caused by the inference results interfered

with manual analysis activities. Hence, we decided to indicate the availability of new inference

results in the Analysis Front-End and let the user decide when the graphical representation

should be updated accordingly.

experiences with

the current tool

Our experiences with the current version of the Varlet Analyst are positive. By incorporating

imperfect DBRE knowledge, our tool provides significantly better support for schema analysis

and completion than existing approaches. With little effort, it can be customized to the

characteristics of different legacy schemas. We learned that it is especially important to adapt

heuristics that deal with naming conventions in this customization step. Even though the

current prototype still has a number of technical problems, which mainly stem from combining

multiple languages (Java, Progres, C, and iTcl/Tk), we are confident that many of our

implemented concepts have the maturity necessary to find their way into commercial DBRE

tools. Still, a frequently mentioned point of criticism with our approach is that confidences of

heuristics are hard to estimate in terms of real numbers. Introducing a limited set of symbolic

confidences to choose from (e.g., certain,more or less certain,weakly certain) could ease the

specification of heuristics.

4.6 Related work

Blaha and

Premerlani

Most existing approaches to legacy schema analysis aim to recover a complete logical schema

by following a predefined process of subsequent reverse engineering activities. Some

approaches suggest loosely coupled tools to support certain activities. In [PB94, BP95, Bla98],

Premerlani and Blaha report on their experience in schema analysis using simple tool sets

which mainly contain UNIX tools [RRF90] like grep and awk and predefined SQL queries.

They argue that a flexible, interactive approach to DBRE is more likely to succeed than batch-

oriented compilers. The proposed DBRE process is based on the Object Modeling Technique

(OMT) [RBP+91] and starts with an initial object model where each RS represents a candidate

class. Subsequently, the reengineer has to detect abstract design concepts based on a set of

informal heuristics, guidelines, and clues [BP98]. The main drawback of their approach is that

loosely coupled tools provide little support for exchanging and combining analysis results,

automatically. They lack the ability to control, propagate, and indicate inconsistencies.

Moreover, they play a mostly passive role in the DBRE process. This means that the reengineer

is responsible to invoke analysis operations for code, data, and schema inspection, manually.

Our approach overcomes this limitation and allows to integrate such existing analysis

operations in a GFRN as a common framework (cf. Section 4.2.2).

Petit et al.

Andersson

Petit et al. present an approach to analyze queries in existing application code to derive

semantic constraints about legacy schemas, e.g, INDs and inheritance relationships [PKBT94,

PTBK96]. They search the application code for stereotypical patterns like equi-joins,auto-

joins,set operations, and group-by clauses which serve as semantic indicators. Once such

110 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

semantic indicators have been detected, Petit et al. use additional queries to the available data

in order to determine further information about the hypothetical constraints, e.g., the

cardinality of associations or the direction of inheritance relationships. Similar to Petit’s

approaches, Andersson employs stereotypical code patterns as semantic indicators for key and

foreign key constraints [And94]. Both methods can be integrated in our approach in form of

analysis operations which are bound to GFRN predicates. For example, data-driven operations

can be used to search for initial indicators while goal-driven operations perform further

cardinality analysis and validation for all indicators found.

Signore et al. In [SLGC94], Signore et al. present a knowledge-based approach to DBRE that uses Prolog

clauses [Wil86] to infer schema constraints from detected semantic indicators. In a first step,

indicators for primary keys, candidate keys, and foreign keys are collected by comparing

names and types of attributes and investigating their usage in the application code. Each

indicator is stored as a Prolog clause in the fact base of the DBRE tool. The second step is

called conceptualization. In this step, predefined heuristics modeled as Prolog rules are used to

infer abstract modeling concepts like many-to-many relationships, complex attributes, and

generalizations. These pattern recognition rules can easily be adapted by the reengineer, which

is similar to our approach. Still, a significant drawback of Signore’s tool is that the employed

Prolog interpreter supports only backward reasoning, i.e., it validates hypotheses of the

reengineer but it does not create new hypotheses. Hence, it restricts the reengineer to a top-

down analysis process. Moreover, there is no tool support to detect indicators (e.g., instances of

code patterns, naming conventions, variant structures). All indicators have to be present before

the inference process and there is no mechanism to execute goal-driven analysis operations on-

demand. Another limitation of Signore’s approach is that heuristics and uncertain results are

represented as definite clauses without a valuation for their credibility or their contradiction.

Hainaut et al. A comprehensive CARE environment for schema analysis and migration (DB-Main) has been

developed since 1993 at the University of Namur, Belgium [HEH+96, HHHR96]. It provides

the reengineer with a powerful scripting language called Voyager 2 [Eng98]. DB-Main includes

several predefined scripts for extracting data structures declared in catalog tables and source

code [HEH+98]. Voyager 2 allows the reengineer to extend the set of available extractors by

new analysis operations. Even though this approach is very powerful its disadvantage is the low

level of abstraction: DBRE heuristics and processes are coded in procedural scripts whereas

declarative formalism would be more appropriate. It takes a significant amount of training to

learn how to use Voyager 2 to customize the analysis process. Moreover, extractor scripts have

a passive nature, i.e., they have to be invoked explicitly by the reengineer. In our opinion, a

combination of Voyager 2 scripts to develop data- and goal-driven analysis operations with the

declarative GFRN approach to specify heuristics and active processes seems most promising

and beneficial.

Hodges and

Ramanathan

Vossen and

Fahrner

In [RH97, RH96], Hodges and Ramanathan describe a method to identify abstract concepts

like associations, aggregations, and inheritance structures in relational schemas. Their

approach is based on the assumption that the relational schema description is structurally

complete, i.e., the reengineer has complete information about key and foreign key constraints.

This assumption is too idealistic for many existing LDB systems [HCTJ93]. Vossen and

Fahrner describe similar techniques to annotate relational schemas semantically [FV95].

However, their approach also covers the phase of structural schema completion (cf.

Section 2.4.1): they propose an algorithm to infer INDs based on equivalence classes of

SUMMARY 111

relational attributes. We specified central ideas of their method in the GFRN specification

described in this chapter.

Soutou

Blockeel and

De Raedt

Several other methods and algorithms have been proposed to detect indicators for structural or

semantical information about relational LDB schemas. Soutou presents an algorithm to recover

n-ary associations [Sou98a]. This analysis is performed in two steps: firstly, information about

key and foreign key dependencies are used to identify candidates for RS that represent n-ary

associations. Secondly, an algorithm generates tentative queries to determine cardinality

constraints for these associations. Based on the analysis results, Soutou proposed a method to

recover aggregate relationships in relational schemas [Sou98b]. Blockeel and De Raedt adopt

methods known from the domain of inductive logic programming (ILP) to detect constraints in

relational DBs [BR97]. They propose an algorithm to find relationships among different RS

that can be implemented in SQL. The GFRN approach described in this dissertation allows to

integrate such algorithms in terms of data- and goal-driven analysis operations.

DBInformer,

ERwin,

SeeData

Some tools have their primary focus on visualizing existing LDB structures. Most of these

approaches generate graphical networks of entities and relationships which can be browsed and

annotated interactively by the reengineer, e.g., DBInformer [Him97] and ERwin [Log97]. The

problem of such graph-oriented representations is that they tend to clutter for larger database

schemas (several hundreds of tables). A more scalable approach to schema visualization has

been developed at AT&T Bell Labs [AEP96]. Their tool (called SeeData) provides several

different views which display separate aspects of an LDB on various levels of abstraction.

These views also cover the relationship between the LDB schema and the corresponding

application code. This allows the reengineer to determine those parts of the code which are

affected by a given schema modification. More powerful visualization techniques and the

ability to browse the source code which is associated to certain schema artifacts would further

increase the usability if the Varlet Analyzer Front-End.

SousaAn approach to reverse engineer large LDB schemas that follows the divide-and-conquer

paradigm has been developed by Sousa et al. [SdJPeA99]. Their idea is to use information

about primary keys to cluster relations into so-called abstract entities and relationships. Each

abstract entity (and relationship) represents an excerpt of the entire LDB schema which is

reverse engineered separately. In a final step, the resulting reverse engineered subschemas are

integrated to a common schema and completed with missing elements. Sousa’s approach can

be viewed as a meta process as they do not make any assumption on the actual method which is

used for schema analysis. An integration of similar clustering techniques as pre- and

postprocessing steps in our schema analysis process could further increase the scalability of

our approach. However, one limitation of Sousa’s original method is the lack of control about

the granularity of clusters: some clusters are composed by a large numbers of relations while

others consists of a single relation.

4.7 Summary

In this chapter, we have elaborated an approach to incorporate and exploit imperfect

knowledge in human-centered DBRE processes. Our research was driven by the observation

that imperfect knowledge plays an important role in database schema analysis activities.

Currently existing approaches to DBRE do not consider imperfect knowledge. They presume a

112 GFRN AS A BASIS FOR LEGACY SCHEMA ANALYSIS

mostly monotonic schema analysis process that consists of accumulating definite (and

consistent) knowledge about an LDB until the structural and semantic information about the

schema is complete. We set up the hypothesis that by temporarily relaxing this requirement for

consistency and precision, we would be able to develop a DBRE tool that considers the human

reasoning process of reengineers more adequately.

To solve this problem, we proposed an evolutionary analysis process controlled by a non-

monotonic inference engine that propagates intermediate results and automatically invokes

analysis operations. We introduced Generic Fuzzy Reasoning Nets (GFRNs) as a dedicated,

abstract formalism to specify domain-specific heuristics and integrate automatic analysis

operations. A major concern with the development of the GFRN language was that GFRN

specifications can be customized with little effort to changing application contexts. The

motivation for this requirement was our observation that the heuristics and operations applied

in a schema analysis process depend on the specific characteristics of the LDB system under

investigation. Syntax and semantics of the GFRN language have been defined in the formal

framework of necessity-valued possibilistic logic.

Based on the notion of fuzzy Petri nets, we have developed an inference algorithm to

operationalize GFRN specifications in human-centered DBRE processes. The implementation

of this inference algorithm in a procedural programming language is straight-forward. We

experimented with implementations in Progres and Java. Early experiences with practical

application examples showed the feasibility of our approach. However, they also emphasized

the importance of dedicated user interface concepts to communicate imperfect information to

the reengineer and efficiently guide him/her to a complete and consistent analysis result. We

implemented and refined such concepts in the current version of our DBRE environment

(Varlet Analyst). An evaluation of the Varlet Analyst in an industrial project clearly showed the

benefits of our approach over existing DBRE tools and validated the hypothesis stated at the

beginning of this section.

CHAPTER 5 CONCEPTUAL SCHEMA

MIGRATION AND DATA

INTEGRATION

Someone must maintain the mapping between the entity-relationship diagram and the relations in the

database as the database evolves. This can be a difficult task.

Antis et al. [AEP96]

In the previous chapter, we developed concepts, techniques, and tools to support reengineers in

analyzing legacy database (LDB) schemas. The output of such an analysis activity is a logical

schema that has been annotated structurally and semantically as far as possible

(cf. Figure 4.46). Based on this intermediate result, the present chapter focusses on two

important subsequent database reengineering (DBRE) activities, namely conceptual schema

migration and data integration.

schema migrationAs exemplified in Chapter 2 (Section 2.4.2), conceptual schema migration aims to produce an

abstract design for an LDB schema. High-level modeling concepts like objects, aggregation,

and inheritance are employed in this human-intensive activity that cannot be performed fully

automatically [ALV93]. The resulting conceptual schema provides a level of abstraction that is

suitable to facilitate understanding and assessment of an LDB’s static structure. Furthermore, it

is a prerequisite to achieve a large variety of maintenance goals, e.g., the integration with

enabling technologies like object-orientation, the Internet, and Client-Server architectures

[Uma97].

problem of

iterations

Most currently existing computer-aided reengineering (CARE) tools that support schema

abstraction and migration generate an initial conceptual schema based on a given logical

schema (e.g., [BGD97, MCAH95, RH97, Fon97, MAJ94]). Subsequently, the reengineer uses

another tool to restructure, enhance, and annotate this initial conceptual schema (e.g., [Log97,

Rat98]). Even though, these approaches allow to validate the consistency of the created

conceptual schema itself, they hardly provide any support to check the consistency among

different documents in the entire DBRE project. This is a severe limitation because the DBRE

process has an explorative and iterative character (cf. Chapter 2). Whenever the information

about the logical schema is revised, the consistency with the conceptual schema that has been

created so far is lost. Using such loosely integrated approaches, the only possibility to re-

establish consistency automatically is to generate the conceptual schema anew. In this case,

interactive enhancement and redesign operations performed by the reengineer are lost and have

to be repeated manually.

problem of

data integration

Like in our case study, many DBRE projects focus on integrating LDBs with new technologies

rather than aiming on their complete replacement. Often, the conceptual schema is used as a

basis to define the class structure of object-oriented applications that access the LDB.

Frequently used programming languages for such applications are Java [WM97] and C++

[Str97]. In such scenarios, a programmer has to develop a middleware component for data

integration that implements the data dependencies between the logical schema and the

migrated conceptual schema. Middleware generators that have been developed to forward

114 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

engineer new information systems prescribe a canonical (mostly object-relational) mapping

that generally lacks the flexibility to integrate arbitrary pre-existing LDB schemas. Even with

approaches which focus on integrating pre-existing systems it is still the responsibility of the

reengineer to define a consistent schema mapping description.

approach:

tight integration To overcome this limitation, we adapt techniques described by Nagl et al. [Nag96] to the

DBRE domain. This means that we propose a fine-grained integration of tools used in the

different phases of the DBRE process (i.e., schema analysis and migration) by a common

migration graph structure. This approach enables incremental change propagation and

consistency preservation and, thus, supports process iterations. In addition, the migration graph

is used to map changes in the conceptual schema back to the implementation model, i.e., the

logical schema. Another benefit of this tight integration is that it allows to generate middleware

components for data integration based on the schema mapping information that is maintained

implicitly. This is a significant progress over existing approaches to middleware generation

where it is the responsibility of the reengineer to define a consistent schema mapping

description manually [CER99, Hüs97, ONT96, Rad95].

The described approach to incremental schema migration and generative data integration is

illustrated in Figure 5.1. The migration process starts with a canonical translation of the

analyzed logical schema into a conceptual data model. Then, the resulting conceptual schema

is redesigned and extended interactively by the reengineer. The grey parts in Figure 5.1 actually

belong to the schema analysis process described in Chapter 4. They are shown to emphasize

the fact that incremental schema migration in a tightly integrated DBRE environment enables

iterative and intertwined execution of analysis and migration activities. Internally, the logical

schema and the conceptual schema are represented by their abstract syntax graphs (ASG). The

dependencies between both schemas are represented by an intermediate graph called the

schema mapping graph (SMG). In case of process iterations, the information maintained in the

SMG is employed to control incremental change propagation operations that aim to re-

establish project consistency. Moreover, the SMG is taken as the basis to generate an object-

relational middleware layer without the need for the user to define schema dependencies

explicitly.

The approach outlined above is described in detail in the following subsections: Section 5.1

introduces and formalizes the migration graph model which covers both ASG representations

and the SMG model. Based on this formalization, in Section 5.2, we employ triple graph

grammars [LS96] to specify a mapping between the relational and the conceptual data model.

This mapping is used to perform an automatic translation of a relational schema to an initial

conceptual schema. In most cases, such an automatic translation is unsatisfactory and has to be

redesigned or extended to meet new requirements. Hence, in Section 5.3, we define a catalog of

conceptual schema redesign transformations that can be applied interactively by the reengineer.

Section 5.4 is dedicated to the problem of re-establishing the consistency and preserving as

many of these interactive redesign transformations as possible in case of process iterations. An

implementation of the described concepts and techniques is presented in Section 5.5. In

Section 5.6, we describe a generative approach to object-relational data integration based on

mapping information that has been created and maintained implicitly during the schema

migration process. Section 5.7 evaluates our approach and reports on our practical experiences

THE MIGRATION GRAPH MODEL 115

with application examples. A discussion of related work in this domain is presented in

Section 5.8. Finally, Section 5.9 gives a summary of the main contributions of this chapter.

5.1 The migration graph model

graphThe formal basis for the migration graph is the concept of a directed, attributed graph with

node and edge types [Eng86]. In the following, we use the term graph for abbreviation

whenever we refer to a directed, attributed graph with node and edge types. Such a graph can

be defined as shown in Definition 5.1.

Definition 5.1 Graph

G := (N, E, yN, A) is a graph over two given type label sets LN, LE with:

•N(G):=N is a finite set of nodes;

•E(G):=E ⊆ N×LE×N is a finite set of edges;

•yN(G):N→LNis a typing function for nodes;

•A is a finite set of node attributes, each a∈A is a partial function a:N→dom(a),

where ’dom(a)’ denotes the domain of attribute ’a’.

Moreover, we define the following auxiliary functions:

•s(G):E→N with and t(G):E→N with s((n1,l,n2)):=n1 and t((n1,l,n2)):=n2, return for each

edge (n1,l,n2)∈E its source and target;

❑

•yE(G):E→LE returns for each edge (n1,l,n2)∈E its label.

reengineer

schema

(Chap. 4)

analysis

translation

and change

propagation

conceptual

initial

migration

and

redesign

Figure 5.1. Incremental schema migration and generative data integration

conceptual

logical

schema schema

logical

ASG

conceptual

ASG

schema

migration graph model

middleware

generation OO

REL

information flow

represented by

schema schema

mapping

graph

116 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

graph model in

Progres In the following, we are not interested in defining particular instances of migration graphs but

we aim on defining a schema for a graph class that contains all valid migration graphs. We call

such a schema a graph model. We have used the formal specification language Progres

(PROgrammed Graph REplacement Systems) [Sch91, SWZ95] to define and implement the

graph models discussed in this dissertation.

migration graph

model The migration graph model mainly consists of two ASG models, one for the logical data model

and the other one for the conceptual model. Both ASG models are connected by an

intermediate graph model, the SMG model. Figure 5.2 shows the most important parts of this

graph model in a diagrammatic Progres notation that is similar to UML [UML97]. To avoid

confusion with classes and associations which are modeled within a conceptual DB schema,

we keep on using the graph-oriented terms node type and edge type instead of class and

association like in UML. Note, that cardinalities of edge types are denoted in form of intervals.

If no cardinality is specified in the diagram its default value is defined as [1:1]. A formalization

of the complete migration graph model in form of a textual Progres graph schema has been

included in Appendix A.

5.1.1 Graph-based representation of logical and conceptual schema

logical schema The left-hand side of Figure 5.2 represents the ASG model for the analyzed logical schema

which is derived directly from Definition 4.1 on page 58. Names of edge types that begin with

c_ represent syntactical containment relationships in the ASG model. A node of type LSchema

represents the root of the ASG model for a logical schema. This syntactical root contains a set

of nodes of type RS and LType, which represent relation schemas and column types,

respectively. Each RS node has an attribute rsname that stores the name of the represented RS.

An RS is composed by a non-empty set of Variant nodes, a primary key (LKey) that is

referenced by an c_pk edge, and a set of alternative keys which are referenced by edges of type

c_ak. Each Variant node contains a set of foreign-keys (ForKey) and a non-empty set of

columns (Column). A column has an lt edge to point to its type. An IND is represented by one

of three node types I-IND,C-IND, and R-IND with respect to its semantic classification (cf.

page 58 and [FV95]). These nodes types are derived from an abstracta node type IND which

has two out-going edge types c_f and c_k that point to a key and a foreign key node.

a In Progres schema diagrams, abstract node types are represented as boxes with sharp corners.

rational for

selecting the

conceptual schema

Since the introduction of the Entity-Relationship (ER) model in 1976 by Chen [Che76], many

variations and extensions of this conceptual data model have been proposed to facilitate the

description of data structures. The most common extensions are concepts for abstraction by

aggregation and inheritance [BCN92]. Such extended ER models have had a major influence

on the development of the key concepts for modern, object-oriented programming languages.

In the context of our application domain (DBRE), we approach the problem of choosing a

specific conceptual model from the opposite direction: the expressiveness of our conceptual

data model is mainly determined by the distributed programming language Java and its object-

oriented database binding ODMG-2.0 [CBB+97] because, currently, Java-based technology is

the migration platform that provides the greatest potential to leverage existing information

systems. The type system of Java does not allow for multiple inheritance [SCC+93]. Hence,

we have chosen a conceptual data model that restricts classes to have at most one

THE MIGRATION GRAPH MODEL 117

generalization. As a consequence, we do not have to deal with typical inheritance conflicts like

repeated inheritance and name collisions. The object-oriented data model proposed by the

OMG (Object Management Group) defines further concepts for ordered list structures and

complex attributes [CBB+97]. In our conceptual data model, we have not defined an explicit

notion of complex attributes for the sake of simplicity. This is not a severe limitation as

complex attributes can always be represented by aggregated objects. Moreover, we decided to

consider only set-valued relationships to reduce the complexity of our graph model.

Figure 5.2 Migration graph model

    

           

  ¡ ¢   £  ¤

£ ¢ ¤  ¥ ¦ § ¨ £ ¡ ¢   © ª

¡ ¥ ¢  ¥ ¦ § ¨ £ ¡ ¢   © ª

¡ ¥ ¢ ¡ « ¡ ¥ ¬ ¨  « « ¬ § ¥  ª

¡ ¥ ¢ ¤ ¥ ¢ ® ¨   ¡ § © § ¢ ª

¯  



 

¯   °  ±

¯  

       

 ²

  ¡ ¢   £  ¤

¢ £  ¥ ¦ § ¨ £ ¡ ¢   © ª

¯  



 ³



   ´     ³ 

¯   µ



² ³   ¶ 

¯   · ±  

¸ ² ³   ¶ 

¯   ² ³ 

¹   ´  º »  

  ¡ ¢   £  ¤

¥  ¥ ¦ § ¨ £ ¡ ¢   © ª

® § ¼ ¥ ½ ¬ ¡ ¨ £ ¡ ¢   © ª



· ±  

¸ · ±  

¤ ¾ ¿ À

Á Â ¨  Ã

¤ ¾ ¬ ¡

Á Ä ¨  Ã

¤ ¾ ¤ ¡

Á Â ¨  Ã

¤ ¾ ¤ ¬

Á Â ¨  Ã

¤ ¾ Å

Á Ä ¨  Ã

£ ½ Æ

Á Â ¨  Ã

£ ½ 

Á Â ¨ Ä Ã

¤ ¾ ¥ Ç

¤ ¾ ¤ Ç

Á Â ¨ Ä Ã

¤ ¾ Ç ¥

Á Â ¨  Ã

Á Ä ¨  Ã

¤ ¾ ¥ ¡ ¡

Á Â ¨  Ã

¹    ³      

  ¡ ¢   £  ¤

£ ¢ ¤ ¡ « ¡ ¥ ¬ ¨  « « ¬ § ¥  ª

£ ¢ ¤ ¤ ¥ ¢ ® ¨   ¡ § © § ¢ ª

¹ È È ´  È     

¤ ¾ ¤ « ¬

Á Â ¨  Ã

Á Ä ¨  Ã

¤ ¾ Ç ¤

Á Â ¨  Ã

Á Ä ¨  Ã

¤ ¾ ¼ Ç

Á Â ¨  Ã

Á Ä ¨  Ã

¦ ¾ ¬ £ Á Â ¨ Ä Ã¦ ¾ ¤ £

¦ ¾ ¬ ¡ Á Â ¨ Ä Ã ¦ ¾ ¤ ¡Á Â ¨ Ä Ã

¦ ¾ Å

Á Â ¨  Ã

Á Ä ¨  Ã

¦ ¾ ¤ ¬Á Â ¨ Ä Ã

¦ ¾ Å ©

Á Â ¨  Ã

Á Ä ¨  Ã ¦ ¾ Å ¾  

Á Â ¨ Ä Ã

¦ ¾    ®

Á Â ¨ Ä Ã

¦ ¾  ¾  

Á Â ¨  Ã

¦ ¾ ¬ Ç

Á Â ¨  Ã

¦ ¾ ¤ Ç

Á Â ¨ Ä Ã

¦ ¾ ¤ « ¬

Á Â ¨ Ä Ã

¦ ¾ ¥

Á Â ¨ Ä Ã

¦ ¾ ¢   ®

Á Â ¨ Ä Ã



°  ±

¡ ¥ ¢

Á Â ¨  Ã

£ ¢ ¤

Á Â ¨  Ã

¦ ¾ ¢

Á Â ¨  Ã

É  ´ °  ±



   

  ¡ ¢   £  ¤

¤ ¬  ¥ ¦ § ¨ £ ¡ ¢   © ª

¥  £ ¡ ¢ ¥ ¤ ¡ ¨  « « ¬ § ¥  ª

µ  ´    

¸ °  ±

¦ ¾ Å £

Á Â ¨  ÃÁ Ä ¨  Ã

Á Â ¨  Ã

¤ ¡

¬ ¡

Á Â ¨  Ã



  » ¶ 

  ¡ ¢   £  ¤

¤ « ¬  ¥ ¦ § ¨ £ ¡ ¢   © ª

¯   

  

¥ ¾ Å  ¥

Á Â ¨  Ã

¦ ¾  ®

Á Â ¨  Ã

Á Â ¨ Ä Ã



   

  ¡ ¢   £  ¤

  Å Ç  ¨  « « ¬ § ¥  ª

  

¤ ¾ ¼

¤ ¾ Ç

¤ ¾ ¤

Á Â ¨  Ã

¤ ¾ Æ Ç

¯     

¢ ¾ Å  ¥

  ¡ ¢   £  ¤

¤ ¡  ¥ ¦ § ¨ £ ¡ ¢   © ª

  ¡ ¢   £  ¤

¬ ¡  ¥ ¦ § ¨ £ ¡ ¢   © ª

Á Â ¨  Ã

Á Â ¨ Ä Ã

Á Â ¨  Ã

118 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

conceptual schema The right-hand side of Figure 5.2 depicts the ASG model that specifies the chosen conceptual

model. A node of class CSchema is the syntactic root of this ASG. Analogously to the logical

schema, this root contains a set of attribute types (CType) and a set of classes (Class). A

boolean attribute (abstract) is used to store the information whether a class is abstract or

concrete, i.e., whether a class can be instantiated. The name of a class is stored in attribute

clname. Inheritance relationships are represented by nodes of type Inheritance with two edge

types sub and sup which point to the participating subclass and its generalization, respectively.

Classes are composed by a set of Attribute nodes and an optional key (CKey). A CKey node

itself is composed by a non-empty set of Attribute nodes. An Attribute node stores its name

(aname) and a default value (default). Associations and aggregations are represented by node

types Association and Aggregation which are generalized to an abstract node type

Relationship. For each Relationship node, attributes srcname and tarname store the role names

of the classes that participate as source and target of the relationship, respectively. Attributes

tarcard and tartotal represent the information about the cardinality of the target class. The

value of attribute tarcard defines the maximum cardinality for the target of the relationship.a If

attribute tartotal is true the relationship is total w.r.t. to its target. The same information is

represented for the other side of an association by attributes srccard and srctotal. Note, that

these attributes are not needed for node type Aggregation because we restrict the source of an

aggregation to represent a total, single instance.

a A zero value means infinity.

graph constraints

graph tests

The migration graph model in Figure 5.2 contains the specification of a number of simple

constraints by means of cardinalities of edge types, e.g., the restriction to single inheritance.

Still, these mechanisms are not sufficient to express more complex constraints of correctness

that consider a larger graph context and attribute values. Examples for such constraints are

scoping rules like "attribute and reference names have to be unique per class" and "class

names have to be unique per schema", etc. In Progres, it is possible to denote complex

constraints by so-called graph constraints which are enforced by the graph repository on the

occurrence of predefined events (cf. [Tea99, p. 15]). In the case of constraint violations,

automatic repair operations re-establish the consistency of the graph. However, this strategy is

not suitable for our application. In evolutionary and iterative DBRE processes, the reengineer

needs a mechanism that validates correctness constraints on demand but violations should be

indicated rather than eliminated automatically. Hence, in contrast of using graph constraints,

we employ so-called graph tests to check the migration graph for violations of constraints

(cf. [Tea99, p. 20]). Graph tests allow to specify conditions for constraint violations on a high

level of abstraction. They can be performed in predefined situations to report about the

correctness of the conceptual schema. This provides the reengineer with the necessary

flexibility to react on indicated constraint violations.

Figure 5.3 shows an example for a graph test that checks for duplicate class names in the

conceptual schema. When a graph test is applied to a given migration graph it searches for a

subgraph that is an isomorphic match for the graph specified in the graphical body of the test.b

The test evaluates to true if and only if such a subgraph can be found in the migration graph. In

addition, this match has to fulfill the attribute conditions specified below the graphical body.

bEven though the general problem of finding such a match is NP-complete [Chr75], Zündorf provided the Progres

compiler with an efficient algorithm that solves this problem for most practical applications [Zün95]. The central

idea of this algorithm is to employ typing and cardinality information provided by the graph model.

THE MIGRATION GRAPH MODEL 119

Unique node numbers are used to refer to particular nodes in the condition part of the test. The

graph test in Figure 5.3 searches for two Class nodes that belong to the same conceptual

schema and have the same value in attribute clname. Likewise, the Progres specification of the

migration graph model in Appendix A includes the usual scoping and correctness constraints

for relational and object-oriented schemas as further (negative) graph tests.

5.1.2 The schema mapping graph model

The schema mapping graph (SMG) connects the ASGs of the logical and the conceptual

schema and represents their interdependencies. The graph elements of the SMG model are

displayed in grey color in Figure 5.2.a The information maintained in the SMG serves two

separate purposes: (1) it is the basis for the initial schema translation and (2) it enables the

generation of schema mapping descriptions for middleware components that facilitate data

integration. The SMG model is rather complex because it has to provide suitable flexibility to

allow for alternative schema mappings. In the following, we will give a brief overview on the

graph elements involved. Their purpose is motivated and described in more detail in the

following sections.

a Note, that all names of edge types that belong to the SMG start with m_.

mapping types

and classes

A node of type MapSch is used to connect the syntactic roots of both ASGs. MapType nodes

are used to map column types to attribute types. Each variant in the logical schema is

represented by a concrete class in the conceptual schema. However, if an RS has more than one

variant, they usually comprise common columns which implies an inheritance hierarchy with

abstract classes in the conceptual schema. Consequently, an abstract class is mapped to more

than one variant, namely all variants which are represented by its concrete subclasses. In the

SMG, correspondences among classes and variants are represented by nodes of type MapV (cf.

Figure 5.2).

mapping

inheritance

relationships

Inheritance relationships in the conceptual model can be mapped in two different ways to

constructs in the logical schema. Firstly, they can be mapped to the inclusion of more specific

variants in less specific variants that belong to the same RS. Consider Figure 2.16 on page 23

as an example for such a situation. In this example, Variant 4 of table PRODREF is less specific

than Variant 3, i.e., Variant 4 is included in Variant 3. This situation is represented by an

inheritance relationship in the conceptual model which is mapped by a node of type MapInc to

the two variants (cf. Figure 5.4). An edge of type m_vs is used to reference the variant which is

test DuplicateClassName =

condition ‘2.clname = ‘3.clname;

Figure 5.3. Graph test DuplicateClassName

‘3 : Class‘2 : Class

c_cl c_cl

‘1 : CSchema

120 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

more specific, while an edge of type m_vg references the variant which is more general. The

second possibility is to map inheritance relationships to INDs in the logical schema that have

been classified as inheritance relationships (I-INDs) in the analysis process (cf. page 20). In

this case, the mapping is represented by a node of type MapIIND.

mapping keys Nodes of type MapKey are used to map primary keys in the logical schema to keys in the

conceptual schema. According to the ODMG data model, our conceptual model includes the

notion of unique object identifiers (OIDs) for instances of classes [CBB+97]. Hence, it is not

required that every class contains a value-based key. Still, if we aim for object-relational data

integration, OIDs have to be resolved to value-based keys in the logical data model. For this

purpose, every class has an edge of type m_id that references a MapKey node in the schema

mapping graph.

mapping attributes

and relationships Attributes are mapped to columns by nodes of type MapCol. To provide the flexibility to allow

for different alternative schema mappings, we admit that attributes of a single class can be

mapped to columns in separate RS. For such remote columns, the SMG has to maintain the

access path from the RS that includes the value-based key associated to the class and the RS

which includes the remote column. This information is represented by edges of type a_via: if a

MapCol node does not have an a_via edge the mapped column belongs to the RS that contains

the key referenced by the m_id edge of the class that contains the mapped attribute. Otherwise,

the mapped column belongs to a different RS and the a_via edge of the corresponding MapCol

node points to a set of MapRIND nodes. These nodes represents the access path from the RS

that contains the key referenced by the m_id edge to the RS that contains the mapped column.

Each MapRIND node is connected to an R-IND node which logically represents a foreign key

that has to be dereferenced to access the mapped column. Analogously to columns, MapRel

nodes and r_via edges are used to map associations and aggregations to sets of foreign keys

(represented by nodes of type R-IND).

Figure 5.4. Sample situation: correspondence among variant and inheritance structures

AGRAPHICAL FORMALISM TO IMPLEMENT SCHEMA TRANSLATORS 121

5.2 A graphical formalism to implement schema translators

Most existing approaches to conceptual schema translation employ rule-based transformation

systems. Such transformation rules are often specified in a textual pattern language [MCAH95]

or in a calculus based on first-order logic and set theory [BGD97, HHHR96]. However, despite

their precise semantics such transformation rules are difficult to understand. Therefore,

researchers typically employ diagrams to explain the meaning of transformation rules.

Furthermore, some formal specifications cannot be executed directly but have to be

implemented in a programming language on a lower level of abstraction. In our approach, we

employ graph grammars to specify schema transformations because they are executable and

have the expressiveness of diagrams.

graph grammarsA number of graph grammar formalisms have been proposed based on different theories with

their specific advantages and drawbacks. A comprehensive overview on these approaches is

given in [Roz97]. The approach used in this chapter has been known as the logic-based

approach [Sch95]. It is the basis for the specification language Progres which has been

introduced and formally defined by Schürr [Sch91]. In the following, we will give an example-

driven, semi-formal introduction to the essential concepts of this graph grammar formalism

which are necessary to understand our application. Analogously to classical (textual)

grammars, a graph grammar consist of a start graph and a set of (graph) productions.

graph productionIn general, a graph production can be defined as a pair of graphs, a set of application

conditions, and a set of attribute transfer clauses (cf. Definition 5.2). The two graphs are called

the left-hand side and the right-hand side of the production, respectively. The application of a

production to a given graph is described in the following Definition 5.3. Note, that Progres

productions allow for extended concepts like optional nodes, node sets, path expressions, etc.

[Sch91]. However, the semantics of these extended concepts can be defined based on the

primitive concepts described below [Zün99].

Definition 5.2 Graph production

Agraph production is a tuple r:(P, Q, C, T), where

•P(r)=P and Q(r)=Q are two graphs over the same sets of node and edge type labels;

P(r) is called the left-hand side and Q(r) is called the right-hand side of r.

•C is a set of application conditions.

•T is a set of attribute transfer clauses.

Definition 5.3 Application of a production

A production r:(P,Q,C,T) is applied to graph G in the following five steps:

•CHOOSE an occurrence of the left-hand side P in G. P has an occurrence in graph G if

there is a morphism m:P→G which preserves source and target and labelling mappings.

Furthermore, the occurrence has to fulfill the so-called identification condition which

prescribes that elements on the left-hand side which do not occur on the right-hand side can

uniquely be identified in G, i.e., ∀x∈P\Q, x’ ∈ P: m(x)=m(x’) ⇒x = x’.

•CHECK the application conditions according to C. If they are fulfilled the occurrence of P

in G is called a match for P.

122 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

•REMOVE all elements in G which have been matched to elements in P that do not occur in

Q, i.e., remove m(P\Q) from G. If the removal of nodes causes dangling edges in G these

dangling edges are removed as well.

•ADD all elements to G which are new in Q, i.e., which do not occur in P. These new

elements are glued to G in the preserved graph elements identified by m(P∩Q). We denote

the morphism m:Q→G that identifies the (newly created) occurrence of Q in G as comatch.

•TRANSFER attribute values to nodes in G that match nodes in Q according to the attribute

transfer clauses specified in T.

In the following, we denote G↓(r,m) for the graph that is produced by the application of a

❑

production r to another graph G (in a match m).

Figure 5.5 shows a simple Progres production AddRSToLSchema which specifies the extension

of a logical schema by a new RS.a The left-hand side of production AddRSToLSchema contains

only a single node of type LSchema. If the production is applied this node is preserved because

it occurs on the right-hand side with an identical node number. Furthermore, G is extended by

three new nodes and three new edges which represent a new RS with one variant and a primary

key.

5.2.1 Triple graph grammars

Usually, a graph grammar is used to define a single graph model in terms of all possible graphs

that can be derived by applying the productions to a given start graph. They are less suitable to

specify the mapping between two different ASG models as needed in our application. In

[LS96, Lef95], Lefering and Schürr propose an extended formalism called triple graph

grammars that is dedicated to this problem. A triple graph grammar consists of a set of

mapping rules. Basically, each mapping rule consists of a production triple, i.e., it contains

three productions. Two of these productions specify equivalent extensions of the first and the

second ASG, while another production is used to extend a mapping graph that represents the

correspondences between both ASGs.

a For layout reasons, the right-hand side of a production might also be below its left-hand side.

‘1 : LSchema

production AddRSToLSchema =

::=

Figure 5.5. Graph production AddRSToLSchema

1’ = ‘1

2’ : RS

4’ : Variant

c_RS

c_pk

3’ : LKey

c_v

AGRAPHICAL FORMALISM TO IMPLEMENT SCHEMA TRANSLATORS 123

Figure 5.6 shows an example for such a mapping rule. In this notation which has been

proposed in [JSZ96] the three productions are separated by vertical, grey bars. Triple graph

grammars deal with extending productions only, i.e., no graph elements are removed. Hence, a

single graphical diagram can be used to represent both sides of an extending production in a

condensed way. For example the left production of the mapping rule in Figure 5.6 is a

condensed notation for production AddRSToLSchema in Figure 5.5. The entire mapping rule

MapRSToClass in Figure 5.6 specifies that an extension of a logical schema by a new RS

corresponds to the extension of the conceptual schema by a new class. The production in the

middle part of the mapping rule is used to update the SMG that represents the correspondence

between both ASGs.

generation of

reverse and

forward translators

A triple graph grammar allows to generate automatic translators that create conceptual

schemas from logical schemas (reverse mapping) and vice-versa (forward mapping). Such an

automatic translator consists of a set of conventional graph grammar productions. Each such

production is derived from one mapping rule specified in the triple graph grammar. A reverse

production prv is derived from a mapping rule by choosing its black parts and its left side as the

left-hand side of prv and the elements in the entire mapping rule as the right-hand side of prv

(cf. Figure 5.7). Analogously, the forward production pfw is derived by choosing the black parts

and the right-hand side of the mapping rule as the left-hand side of pfw and the elements in the

entire mapping rule as the right-hand side of pfw (cf. Figure 5.8).

attribute transfer

clauses

As defined in Definition 5.2, Progres productions might include attribute transfer clauses.

They are added in textual form under the graphical part of the production. The first attribute

transfer clause in Figure 5.7 assigns the boolean value false to attribute abstract of the new

Class node ’11. The second transfer clause transfers the name of the mapped RS to this new

node. In a triple graph grammar, we add transfer clauses (and application conditions) for both

derivable productions to each mapping rule. This is exemplified in Figure 5.6 where we use the

suffixes rv and fw to denote whether the clauses belong to the reverse or the forward

production.

mapping rule MapRSToClass =

Figure 5.6. Mapping rule MapRSToClass

‘11 : Class

‘1 : LSchema

‘2 : RS

‘4 : Variant

c_RS

c_pk

‘9 : CSchema

c_cl

conditionfw:

empty(‘11.-m_id->)

transferrv:

‘11.clname:=’2.rsname

‘11.abstract:=false

transferfw:

‘2.rsname=’11.clname

‘5 : MapSch m_cs

m_ls

‘8 : MapV

‘3 : LKey

c_v m_cl

m_v

‘6 : MapKey

m_lk ‘10 : CKey

m_ck

c_ck

m_id

‘11.abstract=false

124 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

application

conditions In Progres,application conditions often contain so-called path expressions [Tea99, p. 33]. Path

expressions consist of a sequence of edge traversals separated by dots or the ampersand

symbol, e.g., -e1-> & <-e2- defines a path over an outgoing edge of type e1 and an ingoing e2

edge. When a path expression is applied to a node n (or a set of nodes) its application returns

all nodes that can be reached from n by traversing the specified path. For example, the

expression ’11.-m_id-> in the condition part of MapRSToClassfw (Figure 5.8) returns all

variant nodes that can be reached from node ’11 over an outgoing edge of type m_id. The

boolean predicate empty returns true if and only if its argument is an empty set. This condition

is necessary to enable that several classes in an inheritance hierarchy can be mapped to variants

of the same RS: new RS nodes are created for classes only if they do not have the same value-

based key (referenced by edge m_id) as another class which has been mapped before.

Moreover, the attribute condition "‘11.abstract=false" ensures that only concrete classes are

mapped to RS’ in the logical schema.

Figure 5.7. Reverse production MapRSToClassrv

production MapRSToClass_rv =

::=

transfer 11’.abstract := false;

11’.clname := ‘2.rsname;

‘2 : RS

‘9 : CSchema

‘3 : LKey

‘4 : Variant

c_RS

c_v

c_pk

m_ls m_cs

‘5 : MapSch‘1 : LSchema

10’ : CKey

3’ = ‘3

4’ = ‘4

m_ls m_cs

5’ = ‘5

c_RS

1’ = ‘1

c_v

c_pk

2’ = ‘2 c_cl

c_ck

9’ = ‘9

11’ : Class

m_ck

m_lk 6’ : MapKey

m_v m_cl

8’ : MapV

m_id

AGRAPHICAL FORMALISM TO IMPLEMENT SCHEMA TRANSLATORS 125

start graphSimilar to start symbols of conventional textual grammars, graph grammars are applied to an

initial graph that is called start graph. In our application, the minimal start graph consists of the

syntactic root nodes for the ASGs of both schemas and graph elements that represent all

attribute and column types (cf. Figure 5.9). Pairs of equivalent atomic data types are mapped

by nodes of type MapType. The correspondences among atomic types in the logical and the

conceptual schema, respectively, depends on the concrete application context of the DBRE

tool. Different DBMS provide different data types. Hence, in our approach, the reengineer has

to enter atomic type correspondences in an initial customization dialog of our DBRE tool.a

a In some cases, it might also be necessary to implement type conversion functions. In principle, such functions

can be stored in further attribute of MapType nodes. However, we abstract from this detail in the following

discussion.

production MapRSToClass_fw =

::=

condition empty ( ‘11.-m_id-> );

‘11.abstract = false;

transfer 2’.rsname := ‘11.clname;

‘11 : Class

‘5 : MapSch

‘10 : CKey

‘1 : LSchema m_ls m_cs

c_cl

c_ck

‘9 : CSchema

5’ = ‘5

10’ = ‘10

3’ : LKey

m_ls m_cs

m_lk

4’ : Variant m_v

c_v

c_RS

1’ = ‘1

c_pk

2’ : RS m_ck

6’ : MapKey

m_cl

8’ : MapV

c_ck

11’ = ‘11

c_cl

9’ = ‘9

m_id

Figure 5.8. Forward production MapRSToClassfw

126 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

In typical DBRE scenarios, the start graph contains further parts of an analyzed logical schema

ASG which are going to be translated to conceptual schema constructs. Moreover, during the

migration process it often occurs that modifications in conceptual schemas have to be

remapped to the original logical schema. In this case, the mapping algorithm is applied to a

start graph that contains ASG elements from the logical schema as well as from the conceptual

schema (illustrated by the grey subgraph in Figure 5.9).

translation

algorithm In sections 5.2.2-5.2.4, we complement the triple graph grammar specification that defines a

mapping among logical and conceptual DB schemas. The translation process is based on the

execution of forward and reverse productions that are derived from each mapping rule. The

corresponding translation algorithm is described in Figure 5.10. It iteratively chooses a

production rfrom the set of all derived productions Rthat has a match in the current migration

graph G. Furthermore, it is validated that this match cannot be extended to a match that

includes all SMG elements on the right-hand side of r. This is to avoid multiple applications of

the same production in the same match. If such a match can be found, the corresponding

production is applied to the host graph. These steps are iteratively performed until no

production in Rfulfills the condition in lines 8 and 9.

Figure 5.9. Startgraph for schema migration

1 :LSchema

2 : LType

3 : LType

4 : LType

‘5 : LType

c_lt

...

11:CSchema

12 : CType

13 : CType

14 : CType

15 : CType

c_ct

...

7 : MapType

8 : MapType

9 : MapType

10: MapType

m_ctm_lt

m_ct

m_lt

m_ctm_lt

6 :MapSch m_cs

m_ls

logical

schema

conceptual

schema

algorithm MapSchema(R,S)

1) input R, a set of forward and reverse productions derived from a triple graph grammar

2) input S, a start graph (according to Figure 5.9)

3) output G, a migration graph (according to Figure 5.2)

4) begin

5) let G=S

6) repeat

7) let r:(P,Q,C,T)∈Rbe a production that fulfills the following conditions

8) - P has a match in Grepresented by a morphism m:P→G

9) - this match cannot be extended in G by a match for the SMG elements in Q

10) let G = G↓(r,m)

11) until no production p∈P fulfills the conditions in lines 8 and 9

12) return G

13) end

Figure 5.10. Algorithm MapSchema

AGRAPHICAL FORMALISM TO IMPLEMENT SCHEMA TRANSLATORS 127

The described algorithm defines how triple graph grammars can be employed for bi-directional

schema translation. Note, that the productions are not tested and applied in a predefined order.

(The specification of the schema mapping rules ensures confluence [Roz97, p. 105] for all

production applications.) For larger schemas this simple algorithm lacks efficiency. This

problem can be solved by implementing a procedural framework that defines an order for the

application of the derived graph productions. The procedural framework that has been

implemented in our DBRE environment is described by Holle [Hol97].

5.2.2 Mapping variants to class hierarchies

In our approach, RS in the logical schema are initially mapped to classes in the conceptual

schema. However, in contrast to other tool-based approaches to schema translation, we

consider the fact that relational DBs often comprise hidden inheritance structures in form of

different variants of tuples in RS (cf. page 20). Consequently, RS with more than one variant

are mapped to several classes which participate in an inheritance hierarchy.a In Figure 5.6, we

presented a mapping rule which maps an RS to a class. This rule is sufficient for the standard

case where an RS has only one variant of tuples.

aDue to the restriction to single inheritance, there might be variant structures that cannot be mapped to inheritance

hierarchies in our conceptual model. The reengineer has to resolve such conflicts by adding or removing variants.

MapVariantTo-

ConcreteClass

If an RS has more than one variant each additional variant has to be mapped to a concrete class

which participates in the same inheritance hierarchy like the class mapped in rule

MapRSToClass. Since the relational data model has no explicit concept for the corresponding

inheritance relationship, it is not considered in the (bi-directional) mapping rule

MapVariantToConcreteClass in Figure 5.11. (Note, that class node ’9 has been mapped by rule

MapRSToClass in Figure 5.6.)

mapping rule MapVariantToConcreteClass =

Figure 5.11. Mapping rule MapVariantToConcreteClass

‘10 : Class

‘1 :LSchema

‘2 : RS

‘4 : Variant

c_RS

‘8 : CSchema

c_cl

transferrv:

‘10.clname:=’2.rsname

‘10.abstract:=false

‘5 : MapSch m_cs

m_ls

‘7 : MapV

c_v

m_cl

m_v

‘3 : Variant

c_v ‘9 : Class

‘6 : MapV c_cl

m_cl

m_v

‘11: MapKey

m_id

conditionfw:

‘10.abstract=false

128 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

Example 5.1 Application of rules MapRSToClass and MapVariantToConcreteClass

Let us illustrate the correspondences among logical variants and concrete classes with a sample

RS (Tenant) that has two variants (cf. Figure 5.12). Note, that we use an example different from

our case study (Figure 2.13) to improve the readability of the graph representation and include

an abstract class in our consideration. Tuples belonging to the first variant of RS Tenant have

null values in column mtenant, while all remaining tuples have null values in column rent.

Conceptually, the first variant stores main tenants while the second variant represents sub

tenants. Both concrete variants share a common attribute name which gives rise to an abstract

generalization in the conceptual schema.

Figure 5.13 shows the graph representation for our example after applying rule MapRSToClass

followed by an application of rule MapVariantToConcreteClass. We skipped all nodes

representing schema, key, and type mappings in order to simplify the graph layout. Note, that

class nodes with a label "false" indicate concrete classes because this label indicates the current

value of the boolean attribute abstract.

❑

variant# name rent mtenant

1 ... ... NULL

2 ... NULL ...

Figure 5.12. Example RS Tenant with two variants an their conceptual representation

Tenant Tenant

{abstract}

name

MainTenant

rent

SubTenant

mtenant

logical schema conceptual schema

Figure 5.13. Example application of rules MapRSToClass and MapVariantToConcreteClass

application of

MapRSToClass

application of

MapVariantToConcreteClass

AGRAPHICAL FORMALISM TO IMPLEMENT SCHEMA TRANSLATORS 129

MapVariantsTo-

AbstractClassrv

Recovering inheritance hierarchies from variant structures might require the creation of

abstract classes. Abstract classes do not have corresponding constructs in the logical schema.

Consequently, we employ a unidirectional (reverse) production (MapVariantsToAbstract-

Classrv) to recover abstract classes from variant structures (cf. Figure 5.14).

Production MapVariantsToAbstractClassrv in Figure 5.14 uses a node of type MapV to map a

set of two or more variants to an abstract class if all variants in this set (5’) comprise a common

sets of columns (’7) and foreign keys (’6), respectively. In Progres, boxes with shadows (’5, ’6,

and ’7) represent node sets while a dashed shape is used to mark optional graph elements, i.e.,

the set of foreign keys (’6) is allowed to be empty.a The first application condition

"card(‘5>1)" of MapVariantsToAbstractClassrv specifies that a set of more than one variant is

needed to be mapped to an abstract class. The second condition specifies that ’5 may not

contain two distinct variants v, w where w includes v, i.e., the set of variants in ’5 has to be

minimal. Note, that the Progres set operator implies returns true if and only if its first argument

is a subset of its second argument. Furthermore, the sign # represents the inequality operator.

a For computational difficulties, the current Progres compiler (Version 9.2) does not allow the user to specify

edges among node sets. In this dissertation, we use this notation because it is easier to understand than equivalent

textual circumscriptions: whenever, we use an edge between two node sets we require the existence of an edge of

this type between each node in the first set and each node in the second set. (An implementation of the above

rules which is compliant with the current Progres compiler is described by Wadsack [Wad98]).

production MapVariantsToAbstractClassrv =

Figure 5.14. Production MapVariantsToAbstractClassrv

condition: transfer:

‘9.abstract:=true

‘9 : Class

‘4 : RS

‘5 : Variant ‘8 : MapV

‘1 : LSchema

c_RS

‘2 : MapSch m_cs

m_ls ‘3 : CSchema

‘9.clname:=’4.rsname

card(‘5)>1

c_cl

m_cl

m_v

c_v

‘10: MapKey m_id

‘7 : Column

c_col

‘6 : ForKey

c_fk

(* no variant in ‘5 inludes another variant in ’5 *)

for_all v:=’5 :: for_all w:=’5 ::

not (v.-c_col-> implies w.-c_col-> and

v.-c_fk-> implies w.-c_fk-> and v#w)

c_pk ‘11 : LKey m_lk

130 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

Example 5.2 Application of production MapVariantsToAbstractClassrv

Figure 5.15 illustrates the application of production MapVariantsToAbstractClassrv to the

example graph in Figure 5.13. Node set ’5 has been matched to both variant nodes because

they share a common column (name) and do not include each other.

❑

MapVariantsTo-

Inheritancerv

Now, that we have mapped variants to concrete and abstract classes in the conceptual schema,

we complement the inheritance hierarchy by adding Inheritance nodes to represent

generalization relationships. For this purpose, we employ a second reverse production

(MapVariantsToInheritancerv) in Figure 5.16. This production specifies that a class (’10) is a

direct generalization of class (‘7) if

•(C 1) the common properties (attributes and foreign keys) of all variants (‘6) that have been

mapped to superclass ‘7 are included in the set of properties common for all variants (‘5)

which have been mapped to the new subclass ‘10; and

•(C 2) class ‘7 is a direct superclass, i.e., there is no other class (‘11) which has been mapped

to a set of variants that includes the properties common for all variants in ‘6 but has a subset

of those properties common to all variants in ‘5.

Condition C1 is ensured by the textual application condition on the bottom-left corner of

Figure 5.16. Condition C2 is necessary to avoid the creation of transitive inheritance

relationships. It is specified in form of a negative application condition with an annotated

restriction (cf. [Tea99, p. 26]). The negative application condition is represented by a cancelled

node (‘11) which inhibits the application of the production if a match for this node can be

found that complies to the specified restriction. The textual restriction employs the use

statement to define three local variables, namely

•vars, the set of variants mapped to class ‘11,

•fks, the set of foreign keys common to all variants in vars, and

•cols, the set of columns common to all variants in vars.

Figure 5.15. Example application of production MapVariantsToAbstractClassrv

AGRAPHICAL FORMALISM TO IMPLEMENT SCHEMA TRANSLATORS 131

Note that the assignment "fks:= vars.-fk->" delivers the set of all foreign key nodes which

belong to any variant in vars. Hence, we used the Progres operator valid [Tea99, p. 27] to

further restrict this set to those foreign key nodes which belong to all variants in vars. In

addition, we have to admit common elements in node sets ‘5 and ‘6. This is specified by a so-

called folding clause in the bottom-right corner of Figure 5.16. If no folding clause was

specified the match would have to be isomorphic (cf. [Tea99, p. 25]).

If production MapVariantsToInheritancerv is applicable it creates a new Inheritance node (’9)

which is mapped over a MapInc node (‘8) to node sets ’5 and ’6. All variants which have been

mapped to the subclass of the new inheritance relationship are referenced by edges of type

m_vs, while all variants that correspond to the generalization are referenced by m_vg edges.

Example 5.3 Application of production MapVariantsToInheritancerv

Production MapVariantsToInheritancerv can be applied twice to the example graph in

Figure 5.15 The resulting graph, which completes the reverse mapping of RS Tenant to the

corresponding class hierarchy in the conceptual model is displayed in Figure 5.17. In this

representation, bold lines have been used to mark all additional edges. Note, that bold lines with

two labels (m_vg / m_vs) represent the existence of two separate edges with these labels between

the corresponding source and target nodes.

î







production MapVariantsToInheritancerv =

Figure 5.16. Production MapVariantsToInheritancerv

condition:

‘7 : Class

‘4 : RS ‘6 : Variant

‘5 : Variant ‘10: Class

‘1 : LSchema

c_RS

‘2 : MapSch m_cs

m_ls ‘3 : CSchema

‘9 : Inheritance

sup

sub

‘8 : MapInc

m_vs m_v_in

c_cl

m_vg

c_v

folding:

{‘5,‘6}

‘5=‘10.<-m_cl-.-m_v->

‘6=‘7.<-m_cl-.-m_v->

c_cl c_cl

use vars:= ’11.<-m_cl- & -m_v->;

fks:= vars.-c_fk->.valid(for_all(f:=elem(self) :: f.<-c_fk- = vars));

cls:= vars.-c_col->.valid(for_all(c:=elem(self) :: c.<-c_col- = vars)) ::

(’16 implies fks) and (‘17 implies cls) and

(fks implies ’14) and (cls implies ‘15);

‘13 : MapV m_cl

m_v

‘12 : MapV m_cl

m_v

‘17 : Column

c_col

‘16 : ForKey

c_fk

m_vs

‘16 implies ‘14

‘17 implies ‘15 C1

m_vs

‘15 : Column

c_col

‘14 : ForKey

c_fk







‘11: Class

132 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

❑

5.2.3 Mapping columns to class attributes

In the relational data model, the representation of logical entities and their relationships is

based on the simple mathematical concept of relations. Hence, columns are basically used for

two purposes: they might represent actual data values of entities or they might represent

references implemented as redundant copies of such data values in other relations (foreign

keys). Only columns that do not represent foreign keys should be mapped to attributes in the

conceptual model because it includes explicit concepts for relationships (associations and

aggregations). If we admit the existence of different variants of tuples in an RS, we have to

generalize this restriction such that only those columns are mapped to attributes which do not

belong to foreign keys in all of these variants. This restriction is considered within the first part

of the reverse application condition of mapping rule MapColToAttr (cf. the comment in

Figure 5.18).

Even though an RS with multiple variants is mapped to an inheritance hierarchy of classes,

each of its columns is mapped to only one class attribute in this hierarchy. This attribute is then

inherited by all subclasses in the hierarchy. The second part of the reverse application condition

ensures that the column is mapped to the most general class (‘8) in the inheritance hierarchy.

This requirement is represented by a conditional boolean expression [Tea99, p. 44] which

returns true if there exists no such generalization. Otherwise, it ensures that at least one variant

that has been mapped to the generalization of class ‘8 does not include column ‘2. Note, that

the operator in tests the membership of its first argument in the set represented by its second

argument.

Figure 5.17. Example application of production MapVariantsToInheritancerv

AGRAPHICAL FORMALISM TO IMPLEMENT SCHEMA TRANSLATORS 133

key columnsNodes ‘4, ‘5, and ‘9 have been declared as optional graph elements (cf. page 129) to consider

the two possible cases of mapping key columns or non-key columns. If the column

(respectively the attribute) belongs to a key this information is reflected by adding the

corresponding syntactical edges in both ASGs. The outlined arrow between nodes ‘1 and ‘4

marks a graphical path expression.

5.2.4 Mapping inclusion dependencies to relationships

In contrast to the variety of concepts for relationships in the conceptual model (inheritance,

association, and aggregation with different cardinalities), INDs are the only means to

implement references among different RS in the relational model. The schema analysis

activities described in Chapter 4 aim to narrow this semantical gap by classifying INDs either

as normal references (R-IND), cardinality constraints (C-IND), or as inheritance relationships

(I-IND) (cf. Definition 4.1 on page 58). Based on this classification, we present four mapping

rules that translate INDs to relationships in the conceptual model and vice-versa. The first three

rules map R-INDs (in combination with C-INDs) to associations with different cardinalities,

while the fourth rule maps I-INDs to inheritance relationships.

MapRINDToAssoc

[1:1]

Rule MapRINDToAssoc[1:1] in Figure 5.19 maps an R-IND which is inversely key-based to a

total one-to-one association in the conceptual model (cf. Figure 2.15 on page 22). The

restriction to inversely key-based INDs is specified by testing attribute invkb in the textual

condition part of rule MapRINDToAssoc[1:1]. Analogously to the previous mapping rules, the

rest of this condition block ensures that the new association is created among the most general

classes in the corresponding inheritance hierarchy.

mapping rule MapColToAttr =

Figure 5.18. Mapping rule MapColToAttr

‘8 : Class

conditionrv:

‘2 : Column

transferrv:

‘10.aname:=’2.colname

transferfw:

‘2.colname:=’10.aname

‘10 :Attribute

c_att

‘3 : LType ‘11 :CType

‘6 :MapCol m_a

m_col

‘7:MapType m_ct

m_lt

lt ct

c_col

not for_all v:=’1 :: ’2 in v.-c_fk->.-c_c->

(* column ’2 does not belong to foreign key in at least one variant *)

‘9 : CKey

c_ck

‘4 : LKey

c_ka

‘5:MapKey

c_kc

<-c_v-&-c_pk->

m_lk m_ck

‘1=‘8.<-m_cl-.-m_v->

‘12 : MapV m_cl

m_v

[exists v:=elem(‘1.<-m_vs-.-m_vg->) :: not ‘2 in v.-col-> | true]

(* column ’2 is not included in all variants that have been mapped to a generalization of class ‘8 *)

‘1 : Variant

134 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

MapRINDToAssoc

[N:0,1] Similar to MapRINDToAssoc[1:1] the next rule (MapRINDToAssoc[N:0,1]) in Figure 5.20

maps an R-IND that is not inversely key-based but has an inverse C-IND to a left-total one-to-

many association. This rule contains a folding clause to enable that nodes ’9 and ’10 might

represent the same class.

MapRINDToAssoc

[0,N:0,1] All remaining R-INDs (which are not inversely key-based and do not have inverse C-INDs) are

mapped to partial one-to-many associations (cf. Figure 5.21). Again, we employ a negative

graphical application condition (node ‘5) to require the absence of the inverse C-IND.

MapIINDTo

Inheritance Finally, rule MapIINDToInheritance in Figure 5.22 specifies the correspondence of I-INDs

with inheritance relationships. The condition specified for the reverse production ensures that

each class has only one generalization. Analogously to the reverse translation of variants to

class hierarchies, it might occur that an I-IND cannot be mapped because this would violate the

single inheritance condition. The reengineer has to resolve such a conflict, e.g., by changing the

classification of the IND from I-IND to R-IND.

mapping rule MapRINDToAssoc[1:1] =

Figure 5.19. Mapping rule MapRINDToAssoc[1:1]

‘8 :Association

‘2 :R_IND

‘7 : Class

‘9 : Class

m_r

‘3 : ForKey

c_k

‘1 : LKey

c_f

src

tar

transferrv:

‘8.srctotal:=true

‘8.tartotal:=true

‘8.srccard:=1

‘8.tarcard:=1

conditionfw:

‘8.srctotal=true

‘8.tartotal=true

‘8.srccard=1

‘8.tarcard=1

m_rind

conditionrv:

‘2.invkb=true

transferfw:

‘2.invkb:=true

<-m_cl- & -m_v-

<-m_cl- & -m_v->

‘4 : Column

c_c

‘5 : Variant

c_col

‘6 :MapRIND ‘10 :MapRel

r_via

‘11 : Variant

<-c_v-&-c_ak->

[exists v:=elem(‘5.<-m_vs-.-m_vg->) :: not ‘3 in v.-c_fk-> | true]

(* foreign key ’3 is not included in all variants

c_fk

that have been mapped to a generalization of class ‘9 *)

‘5=‘9.<-m_cl-.-m_v->

AGRAPHICAL FORMALISM TO IMPLEMENT SCHEMA TRANSLATORS 135

mapping rule MapRINDToAssoc[N:0,1] =

Figure 5.20. Mapping rule MapRINDToAssoc[N:0,1]

‘4 : R-IND

‘3 : Variant ‘9 : Class

‘10 : Class

‘6 : LKey

c_f

‘1 : ForKey

c_fk

c_k

src

tar

‘5 : C-IND

c_f

c_k

conditionrv:

transferrv:

‘8.srctotal:=true

‘8.tartotal:=false

‘8.srccard:=0 (* zero represents infinity *)

‘8.tarcard:=1

conditionfw:

‘8.srctotal=true

‘8.tartotal=true

‘8.srccard#1

‘8.tarcard=1

‘2 : Column

c_c

c_col

folding:

{’9,’10}

<-m_cl- & -m_v->

‘8 :Association

<-c_v-&-c_ak->

’4.invkb=false

transferfw:

‘4.invkb:=false

m_r

m_rind ‘7 :MapRIND ‘11 :MapRel

r_via

‘12 : Variant <-m_cl- & -m_v->

[exists v:=elem(‘3.<-m_vs-.-m_vg->) :: not ‘1 in v.-c_fk-> | true]

‘3=‘9.<-m_cl-.-m_v->

mapping rule MapRINDToAssoc[0,N:0,1] =

Figure 5.21. Mapping rule MapRINDToAssoc[0,N:0,1]

conditionrv:

‘4 :R-IND

‘3 : Variant ‘8 : Class

‘10 : Class

‘6 : LKey

c_f

c_k

src

tar

<-m_cl- & -m_v->

transferrv:

‘9.srctotal:=false

‘9.tartotal:=false

‘9.srccard:=0

‘9.tarcard:=1

conditionfw:

‘9.srctotal=false

‘9.tartotal=false

‘9.srccard#1

‘9.tarcard=1

‘1 : ForKey

c_fk

‘2 : Column

c_c

c_col

‘9 :Association

‘5 :C-IND

c_f

c_k

’4.invkb=false

folding:

{’8,’10}

transferfw:

‘4.invkb:=false

m_r

m_rind ‘7 :MapRIND ‘11 :MapRel

r_via

<-c_v-.-c_ak->

‘12 : Variant <-m_cl- & -m_v->

[exists v:=elem(‘3.<-m_vs-.-m_vg->) :: not ‘1 in v.-c_fk-> | true]

‘3=‘8.<-m_cl-.-m_v->

136 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

5.2.5 Discussion

The main advantage of using triple graph grammars to specify and implement schema

translators is their high level of abstraction. Graph-oriented specifications are much easier to

define, comprehend, and extend than textual formalisms. Another benefit of this approach is

that it enables the generation of bi-directional translators, because it defines correspondences

among increments in both data models. Hence, triple graph grammars are best suited to

integrate two document types with similar concepts and granularity. The previous section

demonstrates the elegance of using triple graph grammars to define correspondences among

similar concepts like INDs and relationships. Still, the triple graph grammar approach reaches

its limit when there is a significant divergence between the expressiveness of both data models

to be integrated. This was exemplified in Section 5.2.2 where we used two additional reverse

productions to recover inheritance relationships with abstract classes from variant structures in

the logical schema.

Even though the presented mapping rules define a bi-directional mapping among logical and

conceptual schemas, it is important to note that this mapping is partial: further mapping rules

are needed to define correspondences among additional conceptual constructs like aggregations

and many-to-many relationships. These mapping rules can be defined analogously to the rules

described before (cf. [Wad98]). Typically, their definition leads to ambiguities in the (reverse)

translation process from the logical to the conceptual schema. For example, a given R-IND can

be mapped to an association or an aggregation, and an RS with two foreign keys can be

mapped to a class or a many-to-many relationship (association or aggregation). Such

ambiguities can be solved by adding priorities to mapping rules [JSZ96] or extending the

logical schema by further semantic annotations, e.g., to mark an aggregation relationship. Still,

we made the experience that the number of mapping rules grows very large if we strive to

consider all possible (and reasonable) correspondences among logical and conceptual schema

constructs. We tackle this problem by combining a fully automatic schema translator generated

from a limited set of mapping rules with a set of conceptual redesign transformations. The

reengineer can use these redesign transformations to choose from alternative conceptual

constructs while the correspondences to the logical schema are kept automatically.

Mapping rule MapIINDToInheritance =

Figure 5.22. Mapping rule MapIINDToInheritance

‘8 : Inheritance

‘2 : I-IND ‘6 :MapIIND

‘7 : Class

‘9 : Class

m_i_in

c_k

‘1 : LKey

c_f

sup

sub

m_iind

<-m_cl- & -m_v->

conditionrv:

empty(’9.<-sub-)

‘3 : ForKey ‘4 : Column

c_c

‘5 : Variant

c_col

<-c_v-&-c_ak->

‘10: Variant <-m_cl- & -m_v->

CONCEPTUAL SCHEMA REDESIGN 137

5.3 Conceptual schema redesign

In the previous section, we described and specified a canonical translation from an analyzed

logical schema to a conceptual schema (and vice-versa). This canonical translation allows to

represent and assess the persistent data structure of LDBs on a higher level of abstraction by

employing object-oriented modeling concepts. Still, in most DBRE scenarios such a canonical

translation is just a first step in the schema migration process: typically, the initial conceptual

schema is restructured and extended in order to meet new requirements and fully exploit

abstract modeling concepts, e.g., aggregations and cardinality constraints. Most DBRE tools

applied in the activity of conceptual schema restructuring provide little support beyond the

functionality of conventional DB schema design tools: they just provide editor operations to

create or remove schema artifacts like entities, attributes, relationships. Most of these schema

editors are also capable of generating (new) DB schema catalogs from the conceptual model.

However, these approaches do not maintain information about the dependencies of the

restructured conceptual schema with the original LDB schema. This is a severe limitation in

case of iterations in the DBRE process because this information is needed to propagate changes

of the analyzed schema to the conceptual schema and re-establish consistency (cf. page 22).

Likewise, it is not possible to modify the original logical schema incrementally according to

extensions made in the conceptual model. Incremental schema changes are especially

important in the DBRE domain because they are local (e.g., insertion of new attributes or RS),

i.e., they allow to preserve a large amount of the legacy data. Finally, dependency information

between the logical and the conceptual schema is needed to generate middleware components

that facilitate data integration.

5.3.1 Schema redesign transformations

In our approach, we employ the notion of schema (redesign) transformations instead of simple

editing operations to overcome the described limitations. Redesign transformations have

traditionally been applied in logical DB design [BCN92, p. 424]. For example, they are used as

decomposition operations in algorithms to obtain a normalized relational DB schema [EN94].

In contrast to simple editor operations, schema transformations include a definition of the

semantics of the schema change. This semantics is declared by a definition how instances of

the source schema are translated to instances of the target schema of the transformation. Hence,

a schema transformation is often defined as a tuple (T,I), where T denotes the so-called

structure transformation and I is the instance mapping [Hai91]. The structure transformation

represents a function T:ST→S* that is defined on the subset ST⊂S* of all schemas S* that

satisfy the precondition of T. It replaces a given source schema S∈STby a target schema

S’=T(S). Consequently, the instance mapping I:µ(S)→µ(S’) converts valid database extensions

of the original schema into valid extensions of the target schema S’.µ(S) denotes the

information capacity of a given schema Swhich is defined as the set of all valid database states

(or instances) of S. According to [BCN92], a given schema transformation can be classified as

•information-preserving (IP) if its instance mapping Iis bijective or

•information-changing (IC), otherwise, namely,

•information-augmenting (IA) if Iis injective but not surjective or

•information-reducing (IR) if I is surjective but not injective.

138 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

Many approaches in the domain of DB evolution allow to reorganize the data after a redesign

transformation has been applied to the schema [Sch93, Tre95]. In our application, we focus on

integrating legacy DB schemas with distributed, object-oriented technology by generating a

middleware component that provides data integration. The necessary schema dependency

information is represented by the schema mapping graph (SMG) (cf. Section 5.6).

Consequently, we describe the semantics of schema transformations by defining the

modification to the SMG in correspondence to the structural transformation of the conceptual

schema. Using a data integration middleware, the conceptual schema represents an object-

oriented view on the implemented logical schema. Redesign transformations that are

performed to this view do not necessarily change the implemented data model. In fact, we are

interested in keeping the modifications of the legacy schema to a minimum to preserve

compatibility with existing legacy application code. Only IA transformations require actual

changes to the implementation of the schema.

insufficiency

of predefined

transformations

Several researchers have proposed catalogs of redesign transformations for different

conceptual schemas, e.g., [BKKK87, Hai91, Sch93, Tre95, BP96]. Typically, these catalogs

consist of so-called primitive transformations which serve as the basic building blocks of more

complex transformations. Banerjee et al. argue that their catalog of transformations is complete

[BKKK87]. Still, Schiefer shows examples for important schema transformations that cannot

be performed with this catalog [Sch93]. Especially, in the context of DBRE, we doubt the

feasibility of defining a complete catalog of schema redesign transformations. This is because

LDB schemas often comprise complex idiosyncratic optimization patterns and unforeseen

design structures [BP95]. An example for such complex optimization patterns is described in

Chapter 2 on page 21. In most cases, it is not sufficient to apply primitive transformations to

the building blocks of such a pattern. On the contrary, a transformation that is suitable to

normalize such a complex structure has to deal with the entire pattern. Hence, our special focus

is on providing a catalog of transformations that is easily extensible rather than trying to create

a catalog that is complete. The combination of the expressive power of graph grammar

productions with the Progres code generation mechanism [SWZ95] enables us to achieve this

goal: the catalog of redesign transformations that are provided by our schema migration tool

can easily be extended or customized on a high level of abstraction.

5.3.2 An extensible catalog of schema redesign transformations

Figure 5.23 shows an initial catalog of schema transformations which are specified and

implemented in this dissertation. A semi-formal proof of their classification as IP, IC, IR, or IA

transformations is given by Rummel [Rum98]. In the following, we will discuss four of these

transformations in more detail to illustrate our approach. The specifications for all other

transformations are presented in Appendix B.

SplitClass As a first example, we have chosen the IP redesign transformation SplitClass which is specified

as a graph production in Figure 5.24. Redesign transformations are performed interactively by

the reengineer who provides the parameters included in the signature of the graph production.

SplitClass creates a new class with name clName which is connected by a total one-to-one

association to a given class cl. Parameters oldRole and newRole contain the role names of the

pre-existing class and the new class in the created association, respectively.

CONCEPTUAL SCHEMA REDESIGN 139

Figure 5.23. Catalog of conceptual redesign transformations

In Figure 5.24 and the following graph productions, we use bold nodes and edges to make it

easier to identify the part of the production that specifies the actual change in the conceptual

schema. Thin nodes and edges represent the remaining part that specifies the corresponding

modification in the mapping graph. Production SplitClass specifies that the newly created class

(node 6’) is mapped to the same variants that have been mapped to the pre-existing class (node

1’). A new edge of type m_id represents the information that OIDs of the new class are

translated to the same value-based key like OIDs of the old class. The new association is not

mapped to any foreign key (R-IND) in the relational schema. However, it is connected to a new

node of type MapRel to indicate that the association has already a corresponding representation

in the logical schema (cf. the mapping algorithm on page 126).

Transformation Informal description Type

Aggregate Transforms an association into an aggregation IP

AssociationToClass Transforms an association between two classes to an inter-

mediate class with two associations IP

ChangeAssocCardinality Modifies the cardinality of a given association IC

ChangeAttributeType Changes the type of an attribute IC

ClassToAssociation Transforms a class that participates in two one-to-many

associations to a many-to-many association IP

CreateAssociation Creates an association between two given classes IA

CreateAttribute Creates an attribute in a given class IA

CreateClass Creates a new class IA

CreateInheritance Creates an inheritance relationship between two given

classes IA

CreateKey Creates a key for a given class IR

DisAggregate Transforms an aggregation into an association IP

Generalize Creates a generalization for a given class IA

ConvertAbstract Converts a concrete class into an abstract class IR

ConvertConcrete Converts an abstract class into a concrete class IA

MergeClasses Merges two classes which are associated by a one-to-one

relationship into a single class IP

MoveAttribute Moves an attribute from one class to an associated class

via a given one-to-one relationship IP

PushDownAttribute Moves an attribute of a given class to its specialization IR

PushDownAssociation Moves a relationship of a given class to its specialization IR

PushUpAttribute Moves an attribute of a given class to its generalization IA

PushUpAssociation Moves a relationship of a given class to its generalization IA

Remove Removes an increment from the conceptual schema IC

RenameAttribute Changes the name of an attribute IP

RenameClass Changes the name of a class IP

RenameRelationship Changes the role names of a relationship IP

Specialize Creates a specialization for a given class IA

SplitClass Splits a class in two classes connected by a one-to-one

relationship IP

SwapAssocDirection Swaps source and target of a given association IP

140 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

MoveAttribute Classes that are newly created by applying transformation SplitClass do not contain attributes

or participate in any relationship other than the newly created association. The reengineer can

use IA transformations like CreateAttribute or CreateAssociation to create new class

properties. In this case, the mapping rules defined in Section 5.2.2 are used to translate these

properties to columns and foreign keys which extend the original logical schema. Besides the

possibility to add new properties, the catalog in Figure 5.23 contains two transformations

(MoveAttribute and MoveAssociation) that allow to move class properties from one class over

an one-to-one association to another class. These transformations do not augment the

information capacity of the schema. Hence, they do not imply changes in its implementation.

The graph production for transformation MoveAttribute is presented in Figure 5.25. The two

parameters attr and assoc represent the attribute that has to be moved and the association that

connects source and target of this relocation operation. The right-hand side of production

MoveAttribute shows that the attribute which was initially aggregated in class ‘1 by a c_att

edge has been relocated to class 3’ after the transformation has been applied. The information

about the relocation is reflected in the mapping graph by adding the set of all MapRIND nodes

(‘5) to the access path of the relocated attribute which have been mapped to the association.

This is done by adding a_via edges from the attribute mapping node ‘6 to all nodes in set ‘5.a

a In Section 5.6, we use this information for generating middleware components for data integration.

production SplitClass( cl : Class ; clName : string ; newRole : string ;

oldRole : string)

::=

transfer 2’.srctotal := true;

2’.tartotal := true;

2’.srccard := 1;

2’.tarcard := 1;

2’.srcname := oldRole;

2’.tarname := newRole;

6’.clname := clName;

Figure 5.24. Schema transformation SplitClass

‘5 : Variant

‘3 : MapKey m_id

<-m_cl-

& -m_v->

‘1 = cl

5’ = ‘5

3’ = ‘3 m_id 1’ = ‘1

src

tar

2’ : Association

m_id

6’ : Class

m_cl

m_v

7’ : MapV

m_r

4’ : MapRel

CONCEPTUAL SCHEMA REDESIGN 141

Still, it is also possible that association assoc is not mapped to any MapRIND node, e.g., if it

has been created by applying the SplitClass transformation. Hence, node set ‘5 is defined to be

optional. In the case that no match can be found for node set ‘5, the mapping information of the

relocated attribute remains unchanged.

The application condition of production MoveAttribute restricts its applicability to one-to-one

associations only. The relocation of class properties over many-to-one associations is

ambiguous w.r.t. to the instance conversion and, thus, has to be prohibited. On the other hand,

relocating class properties over a one-to-many association would represent an IA

transformation. In the case that a relocation operation aims at an augmentation of the

information capacity, the corresponding properties have to be deleted from the variants mapped

to class ‘1 and added to the variants mapped to class ‘3. This can be done by a concatenation of

remove and create transformations (cf. Figure ). Strategies to reorganize the available data after

IA transformations have been developed in the domain of DB evolution [Sch93, Tre95]. One

typical solution is to insert default values for undefined attribute values.

Association ‘4 has to be total w.r.t. class ‘1 to avoid information augmentation. This

requirement is represented by a conditional boolean expression to cover the case that class ‘1 is

the source of the association or its target, respectively. The semantics of this conditional

expression is that if nodes ‘1 and ‘4 are connected by an edge of type src, attribute ‘4.srctotal is

evaluated as the result of the expression. Otherwise, the result is defined as the value of

attribute ‘4.tartotal (cf. [Tea99]).

production MoveAttribute( attr : Attribute ; assoc : Association)

::=

condition ‘4.tarcard = 1; ‘4.srccard = 1;

[ ‘1 = ‘4.-src-> :: ‘4.srctotal | ‘4.tartotal ] ; (* association is total w.r.t. class ‘1 *)

Figure 5.25. Schema transformation MoveAttribute

‘3 : Class

‘2 = attr

c_att

‘1 : Class -src->

or -tar-> -src->

or -tar->

m_a

‘6 : MapCol ‘5 : MapRIND

<-m_r- &

-r_via->

‘4 = assoc

4’ = ‘4

2’ = ‘2

1’ = ‘1

m_a

c_att

3’ = ‘3

5’ = ‘5

a_via

6’ = ‘6

142 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

Generalize The transformations described so far employ relationship concepts like association and

aggregation to redesign the structure of conceptual schemas. We propose additional

transformations to modify inheritance structures. Two important examples are transformations

Generalize and Specialize. The purpose of transformation Generalize is to create a new

generalization for the root class of an inheritance hierarchy, while transformation Specialize is

used to insert a new subclass of a given class. New classes which are created by these two

transformations are mapped to additional variants in the logical schema. We have selected this

implementation alternative because it does not entail modifications of the logical schema and a

reorganization of the legacy data. Other possible implementations of inheritance relationships

are described, e.g., by Hainaut et al. [HHEH96] and Fussell [Fus97]. Note, that transformation

Generalize creates a concrete class per default which can be converted to an abstract class

using transformation ConvertAbstract from our catalog.

The specification for transformation Generalize is presented in Figure 5.26. Its signature has

two parameters, namely the class that has to be generalized (cl) and the name of the new

superclass (clName). The bold graph elements of the corresponding production show that the

key attributes of class cl (‘2) are relocated to the new class (7’). This is because the new class is

represented by a new variant (8’) in the logical schema and each variant has to include the

primary key of its RS. The inheritance relationship itself is mapped to the inclusion of the new

variant (8’) in the existing variant (5’) by a node of type MapInc (11’).

production Generalize( cl : Class ; clName : string) =

::=

transfer 7’.clname := clName;

7’.abstract := false;

Figure 5.26. Schema transformation Generalize

‘6 : Column

‘2 : Attribute

sub

‘3 : Inheritance

c_ck

c_ka

c_att

<-m_cl-

& -m_v-> ‘1 = cl

‘4 : CKey

m_ck

-m_lk->

& -c_kc-> ‘9 : MapKey

‘5 : Variant

6’ = ‘6

2’ = ‘2

1’ = ‘1

c_ka

4’ = ‘4

sub

sup

3’ : Inheritance

c_att

c_ck

m_cl

m_v

c_col

8’ : Variant

m_v_in

m_vg

5’ = ‘5

m_v

10’ : MapV

m_vs 11’ : MapInc

m_ck

9’ = ‘9

m_id

7’ : Class

CONCEPTUAL SCHEMA REDESIGN 143

PushUpAttrSimilar to the relocation of class properties via associations (MoveAttribute,MoveAssociation),

we define redesign transformations to relocate class properties in inheritance hierarchies.

According to the common practice to denote inheritance hierarchies as inverse vertical trees we

have named these transformations PushUpAttribute,PushUpAssociation,PushDownAttribute,

and PushDownAssociation. The first two transformations are information-augmenting while

the latter two transformations are information-reducing. As an example, Figure 5.27 shows the

specification of transformation PushUpAttribute which relocates a given attribute from one

class to its generalization.

Note, that we restrict the application of relocation transformations to inheritance relationships

that have been mapped to variants of a single RS. The reason for this restriction is that

otherwise we would have to relocate the corresponding column in the logical schema to a

different RS and reorganize the data. Consequently, PushUp and PushDown transformations

cannot be applied to inheritance relationships that are mapped to I-INDs. If such a schema

modification is desired the corresponding attribute has to be removed from the subclass and

added to its generalization. Again, DB evolution strategies elaborated for example by Schiefer

[Sch93] and Tresch [Tre95] can be used to reorganize the data accordingly.

production PushUpAttribute( attr : Attribute) =

::=

folding {‘5,‘7}

Figure 5.27. Schema transformation PushUpAttribute

c_att

‘2 : Class

sub

sup

‘3 : Inheritance

‘5 : Variant

‘6 : Column <-m_a-

& -m_col-> ‘1 = attr

c_col

‘7 : Variant

m_v_in

m_vs

m_vg ‘8 : MapInc

<-m_cl-

& -m_v-> ‘4 : Class

1’ = ‘1

2’ = ‘2

sub

sup

3’ = ‘3

6’ = ‘6

c_col

7’ = ‘7

m_v_in

m_vs

m_vg

8’ = ‘8

c_col

5’ = ‘5

c_att

4’ = ‘4

m_vg

144 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

5.3.3 Complex schema redesign transformations

In the previous section, we employed graph productions to specify a catalog of primitive

schema redesign transformations. In order to facilitate maintainability of this catalog it should

be minimal, i.e., it should not contain transformations that can be simulated by executing a

sequence of other transformations in this catalog. Still, from the reengineer’s point of view it is

more convenient and efficient to use more powerful redesign transformations. For example, a

reengineer might want to relocate several attributes over an aggregation. In this case, (s)he

would prefer to select a single operation (e.g., MoveOverAggregation) instead of transforming

the aggregation into an association (primitive transformation DisAggregate), moving each

attribute separately (primitive transformation MoveAttribute), and transforming the association

back to an aggregation (primitive transformation Aggregate).

The obvious solution to meet this requirement is to provide some kind of macro mechanism

that allows to concatenate primitive transformations to more complex transformations.

However, we have to be aware of the fact that each primitive transformation has its own

application precondition. Hence, it is possible that the precondition of some intermediate

transformation is not fulfilled. Let us assume that in the above scenario the reengineer wants to

relocate attributes over a one-to-many aggregation. If the complex transformation

MoveOverAggregation is implemented as a script that calls the different primitive

transformations it will fail with the first call to MoveAttribute (because it requires a one-to-one

association). Still, the precondition of the first primitive transformation (DisAggregate) was

valid and it has been applied to the migration graph. Obviously, the result of such an aborted

complex transformation is not what the reengineer intended.

The described example motivates the need for some mechanism which guarantees that

complex transformations are executed either completely or not at all. This problem is well-

known from the domain of transaction processing in database management systems [EN94].

Hence, one solution is to use a transaction monitor that, in case of a violated precondition,

allows to recover the state of the migration graph before the execution of the complex

transformation. An alternative solution is to check all preconditions of involved primitive

transformations at the beginning of a complex transformation. However, this would involve

additional effort to rewrite those preconditions which actually depend on the output of other

primitive transformations in the complex sequence.

In our approach, we have selected the former alternative. We employ the transaction concept

which is provided by the graph-oriented database GRAS [KSW95]. The Progres language

provides control structures to specify such transactions. This is exemplified in Figure 5.28. In

this example, assoc is declared as a local variable of type Association. Primitive

transformations are invoked like simple method calls. Further complex redesign

transformations can be defined analogously, e.g., a concatenation of transformations

Generalize and PushUpAttribute.

INCREMENTAL CHANGE PROPAGATION 145

5.4 Incremental change propagation

Inconsistencies among different representations on various levels of abstraction often cause

update problems in DBRE projects. In Chapter 2, we exemplified that such inconsistencies

might be caused by process iterations (cf. page 22): whenever the reengineer discovers new

information about the real semantics of (low-level) implementation constructs all (high-level)

representations of the LDB that have been created so far must be updated accordingly. A

further typical source of inconsistencies are on-the-fly modifications to the implementation of

the LDB due to urgent requirements while the DBRE project is in progress. Detecting and

eliminating such inconsistencies manually is a time-consuming and error-prone activity.

Hence, a commonly used approach is to discard all created high-level views of the LDB and

generate default representations anew. In this case, the redesign work that has been performed

manually by the reengineer is lost and has to be repeated. Obviously, both alternatives are

unsatisfactory. Therefore, we have developed an incremental approach to consistency

management in DBRE environments. In this section, we describe an automatic mechanism to

propagate changes of an LDB’s implementation to its conceptual representation without

discarding manually performed redesign operations that remain valid.

The developed consistency management mechanism is based on the fact that our approach to

schema migration employs transformations as the fundamental concept. In Section 5.2.1, we

have shown how to derive an automatic transformation system from a triple graph grammar to

translate a logical LDB schema into an initial conceptual representation. Subsequently, we

have proposed a catalog of redesign transformations that can be applied to this conceptual

representation, interactively. The main idea of our consistency management concept is to keep

track of input/output dependencies among all transformations that have been applied to the

implemented logical schema. In the case of implementation changes or modified semantic

annotations, this dependency information is employed to detect all transformations which are

affected by the change. Each of these transformations is re-evaluated automatically to

determine if their preconditions are still applicable. Only those transformations which have lost

their applicability are discarded.

transaction MoveOverAggregation( aggr : Aggregation ; attrs : Attribute [1:n])

=use assoc : Association

DisAggregate ( aggr, out assoc )

& for all attr := attrs

MoveAttribute ( attr, assoc )

end

& Aggregate ( assoc, out aggr )

end

end;

Figure 5.28. Complex transformation MoveOverAggregation

146 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

5.4.1 The history graph

history graph In this dissertation, we have used graph productions to formalize and implement

transformations. In this sense, the left-hand side of a graph production represents the input of

the corresponding transformation, while its output is represented by the right-hand side. If we

want to maintain input/output dependencies of applied transformations, we have to store

information about the matches for the corresponding graph productions. A graph-based

structure is most suitable to maintain these dependencies. We call the corresponding graph

history graph because it reflects the migration history of an LDB schema. Figure 5.29

illustrates the basic structure of a history graph: applied transformations are explicitly

represented by T-nodes with corresponding input and output parameters. Input parameters

which have actually been removed by an applied transformation remain as place holders in the

history graph to represent the necessary dependency information (cf. C-nodes with grey shape

in Figure 5.29).

transformation

templates In order to maintain the application contexts of transformations we have to identify and

represent the graph elements on their left- and right-hand sides explicitly in the history graph.

In the Progres graph model, it is sufficient to consider node parameters only, because they

uniquely determine the application context of productions (cf. Definition 5.1 on page 115). For

example, let us consider transformation Generalize in Figure 5.26 on page 142. It has six input

node parameters and eleven output node parameters. Each parameter has a unique node

number and some of the output parameters also serve as input. Figure 5.30 shows this input/

output structure for transformation Generalize. The Parameter nodes serve as place holders for

the actual parameters of a transformation application. Hence, we call this structure a

transformation template. The parameter numbering is based on the node numbers of the

corresponding Progres production.

L P P C P

L P

PCPTP C

L P T P C P P C P

P C P

L P P C

L P T P C P P C P

L P

L P T P C P

P C

PTP C

P P C

Figure 5.29. Basic structure of a history graph

Ê Ë Ì Í Î Ï Ê Ð Î Ñ Ò Ó Ï Í Ô Î Õ Ò Ó Ò Ô Ö

Î × Õ Õ Ò Ô Ö Î Ë Ô Î Ò Ø Ö Ù Ð Î Ñ Ò Ó Ï Í Ô Î Õ Ò Ó Ò Ô Ö

Í Ð Ë Ê Ï Ö Ò Ú Î Ë Ô Î Ò Ø Ö Ù Ð Î Ñ Ò Ó Ï Í Ô Î Õ Ò Ó Ò Ô Ö

Ð Î Ñ Ò Ó Ï Ö Õ Ï Ô Ð Û Ë Õ Ó Ï Ö Í Ë Ô

Ø Ê Ï Î Ò Ñ Ë Ê Ú Ò Õ Û Ë Õ Ø Ï Õ Ï Ó Ò Ö Ò Õ

in/out

out

in/out

INCREMENTAL CHANGE PROPAGATION 147

dependencies

among edges

Even though node parameters are sufficient to determine the application context of a Progres

production, its application itself obviously depends also on edge parameters. These

dependencies cannot be represented directly in the history graph because the underlying graph

model does not allow for higher-order edges, i.e., edges that have edges as their source or

target (cf. Definition 5.1 on page 115). However, this dependency information can be

disregarded if all graph productions comply to the requirement that whenever an edge is

modified its source and target nodes have to occur on the left-hand sides. This requirement is

satisfied in all graph productions included in this dissertation. Still, Progres provides other

means to modify edges in terms of so-called redirection, embedding, and copy clauses which

can be added to productions (cf. [Tea99]). Such clauses may not be used in our approach.

restriction:

path expressions

Another problem arises with Progres productions that employ path expressions. For example,

production Generalize has two path expressions on its left-hand side (cf. Figure 5.26 on

page 142). Although path expressions represent a powerful means to specify graph traversals

they are problematic for our consistency management mechanism because they imply

additional input dependencies. In principle, it is necessary to add input dependencies to each

node that has been visited in an application of such a path expression. However, collecting all

visited nodes would imply modifications to the internal implementation of the Progres

compiler. On the other hand, prohibiting the usage of path expression completely would entail

a severe restriction for the expressiveness of our formalism. Therefore, we decided to restrict

our formalism to path expressions that have a maximum path length of two edge traversals.

This restriction allows to combine the main benefits of path expressions with a simple

(conservative) approach to consider the additional input dependencies. The idea is simply to

add input dependencies to all direct neighbors of nodes matched to the left-hand side of an

applied transformation. These additional nodes are called 1-context of the actual input

parameters of the transformation. Formal definitions for the 1-context and the entire

application context of transformations are given in Definition 5.4 and Definition 5.5.

Figure 5.30. Template of transformation Generalize

(Generalize)

Transformation

(Column)

Parameter 6

(Variant)

Parameter 5

(MapKey)

Parameter 9

(Class)

Parameter 1

(Attribute)

Parameter 2

(CKey)

Parameter 4

(MapInc)

Parameter 11

(Inheritance)

Parameter 3

(Class)

Parameter 7

(MapV)

Parameter 10

(Variant)

Parameter 8

out

in/out

out

148 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

transitive closure

in path expressions The restriction to short path expressions does also prohibit the use of Progres operators for

transitive closure (*,+) in such expressions. However, due to our experience this is no real

limitation because transitive path expressions are usually employed in transformations to check

for violations of invariant graph constraints. An example for such an invariant constraint is that

there might not be two classes with the same name. Such invariant constraints do not depend

on the actual transformation context and, thus, can be specified separately as described on

page 118. They can be validated before a transformation is finally committed. In addition, this

strategy reduces the redundancy in transformation specifications because otherwise the

corresponding condition had to be specified in all transformations that may violate it.

Definition 5.4 1-context of a set of nodes

The 1-context of a set of nodes S in a graph G is defined as the set of nodes S’ which contain all

direct neighbors of nodes in S which do not belong to S, i.e.,

❑

S’=1-context(G,S):= {n| n∈N(G)\S ∧ ∃e∈E(G) : (s(e)∈S∧ t(e)=n) ∨(t(e)∈S∧s(e)=n)}

Definition 5.5 Context of a transformation application

The context of an application of a transformation (represented by a production) r:(P,Q,C,T) to

a graph G in a match m:P→G is defined by a tuple r:(in, out, con1) of two mappings and a set:

•in:N(P)→N(G) with in(n)=m(n) for n∈N(P),

•out:N(Q)→N(G↓(r,m)) with out(n)=m(n) for n∈N(Q), where m(n) is the comatch of the

production application (cf. Definition 5.3 on page 121).

❑

•con1:=1-context(G,in(N(P)).

negative conditions Negative application conditions in graph productions cause another problem because they

specify the necessity for the absence of certain graph elements. Still, negative conditions are

frequently needed to select the right transformation. An example is given in Figure 5.21 on

page 135. Here, the absence of a C-IND node is required in order to map an R-IND to a partial

many-to-one association. If the reengineer finds out, at a later point in time, that such a C-IND

in fact exists, the transformation has to be undone. We solve the problem of negative

application conditions as follows: we require that negative nodes have to be in the 1-context of

at least one other node on the left-hand side of the production. Whenever a new node n has

been created that is used in a negative application condition, the nodes in the 1-context of n are

marked changed.

history graph

model Figure 5.31 shows a graphical Progres specification for the history graph model. According to

Definition 5.5, input and output dependencies are represented by edges of type In and Out,

whereas the nodes in the 1-context of a transformation application are referenced by con1

edges. Figure 5.31 also shows that the history graph model is an extension of the migration

graph model, i.e., the history graph contains the migration graph as a subgraph. Node type

Increment represents a generalization if all node types in the migration graph model

represented in Figure 5.2 on page 117. Edges of type actual connect parameter place holders of

transformation templates with their actual input and output parameters in the migration graph.

INCREMENTAL CHANGE PROPAGATION 149

Definition 5.6 History graph

The history graph is a graph that includes the migration graph as a subgraph. Moreover, it

contains nodes and edges that represent all application contexts of (mapping and redesign)

productions in the entire editing history. The corresponding extension of the migration graph

model (Figure 5.2 on page 117) is given in Figure 5.31. The projection of a history graph

H:(N,E,yN,A) on the current migration graph MG(H):(N’,E’,y’N, A’) includes all increments

which do not occur as in-parameters of a transformation without occuring as out-parameters of

the same transformation, i.e,

•N’:={n∈N | yN(n)∉{’Transformation’, ’Parameter’} ∧

(∀np,nt∈N, ∀ea,ei∈E: t(ea)=n ∧ s(ea)=np∧t(ei)=np ∧ s(ei)=nt∧yE(H)(ea)=’actual’ ∧

yE(H)(ei)=’In’ ⇒ ∃eo ∈ E: t(eo)=np∧ s(eo)=nt∧ yE(H)(eo)=’Out’)}

•E’:={e∈E | s(e), t(e) ∈N’}

•y’N:=yN\{’Transformation’, ’Parameter’}

❑

•A’=A

The history graph defined above is a specific implementation of the general concept of a graph

process as introduced by Corradini et al. [CMR96]. A graph process is a partially ordered

structure, plus suitable mappings which relate the elements of this structure to those of a given

typed graph grammar. According to this terminology, the Transformation and Parameter nodes

with their In,Out, and con1 edges represent the above mentioned partially ordered structure;

edges of type actual represent the mapping between this partially ordered structure and the

typed graph elements representing the logical schema, the conceptual schema, and the SMG,

respectively.

Parameter

Increment

Transformation

In con1

actual

[0:N]

[1:N]

[0:N]

[0:1]

Figure 5.31. History graph model

Out

[0:N]

[0:1]

LSchema MapSch CSchema

...

intrinsic

nr: integer

...

m_csm_ls

150 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

5.4.2 The propagation mechanism

application of

transformations to

the history graph

In order to log the application of transformations in the history graph, we have to redefine the

way how transformations (graph productions) are applied (cf. Definition 5.7). The main

difference of this definition w.r.t. Definition 5.3 on page 121 is that nodes which are deleted on

the right-hand side of production are not removed from the history graph but they are isolated,

i.e., all their in- and out-going edges in the corresponding migration graph are deleted.

Definition 5.7 Application of transformations to a history graph

A transformation that is represented by a production t:(P,Q,C,T) is applied to a history graph H

in the following five steps.

•CHOOSE an occurrence of the left-hand side P in MG(H) (analogously to Definition 5.3 on

page 121).

•CHECK the application conditions according to C.

•REMOVE all edges from H that have been matched to edges in E(P\Q) .

•ADD all elements to G which are new in Q, i.e., which do not occur in P. These new

elements are glued to G in the preserved graph elements identified by m(P∩Q).

•LOG the context of the applied transformation t:

•EXTEND H by the corresponding template for t (cf. Figure 5.30)

•EMBED the new template according to the context information, i.e.,

• create an actual edge from each parameter to the corresponding node in H, and

• create a con1 edge from the new Transformation node to each node in

1-context(MG(H), N(m(P)).

•ISOLATE all nodes in MG(H) that have been matched to nodes in N(P\Q), i.e., remove all

edges from H which belong to MG(H) and are connected to nodes in m(P\Q).

•TRANSFER attribute values to nodes in G that match nodes in Q.

change

propagation In the remainder of this section, we describe how the information stored in the history graph

can be used for incremental change propagation. Let us assume a scenario where an analyzed

logical LDB schema has been translated to a conceptual representation which subsequently has

been redesigned and extended. Our case study describes a sample situation for a change in the

logical schema during such an ongoing conceptual migration process (cf. page 22). Using the

history graph that has been created during the translation and editing history, the change

propagation process has four major phases, namely forward propagation,backward

propagation,reevaluation, and translation.

Phase I:

forward

propagation

In the first phase, the input/output dependencies in the history graph are used to detect all

transformation applications (and increments in the conceptual schema) which are affected by

the modifications in the logical schema. This step is illustrated in Figure 5.32 where L-nodes

with a pencil mark the modifications and extension of the logical schema, respectively. Note,

that in this phase, con1 edges are used in the same way like in edges to find (potentially)

affected transformation applications. However, we do not represent the 1-context of

transformation applications in Figure 5.32 (and the following diagrams) for reasons of

simplification.

INCREMENTAL CHANGE PROPAGATION 151

Phase II:

backward

propagation

Obviously, all transformation applications that have been marked in the forward propagation

step have to be validated. However, some of these transformation applications depend on input

parameters which have been consumed by a transformation. These parameters, which are only

represented by isolated place holders, have to be reproduced before the dependent

transformation can be re-evaluated. Reproducing these parameters means to re-evaluate all

transformations that have been applied to produce them. Some of the transformation

applications that have to be re-evaluated might not have been marked in the forward

propagation phase because they are not directly affected by the modification in the logical

schema. Hence, we need a further backward propagation phase to mark such indirectly

affected transformation applications in the history graph (cf. Figure 5.33).

Phase III:

reevaluation

In the third phase, the marked transformation applications are re-evaluated in the predefined

order of their input/output dependencies. Reevaluating a transformation application means to

apply the corresponding transformation anew to the current (maybe changed) parameters. Each

transformation that remains applicable remains in the history graph. Figure 5.34 shows that the

output parameters of such a transformation and the input parameters of a dependent

transformation application are actualized to the newly created conceptual schema increments.

All old parameter place holders are deleted from the history graph. Likewise, all

transformations which are no longer applicable are deleted as well. In Figure 5.34, this is

illustrated for the right-most transformation template.

L P

✗✗✗✗✗✗

Figure 5.32. Phase I: forward propagation

Ü Ý Þ ß à á Ü â à ã ä å á ß æ à ç ä å ä æ è

à é ç ç ä æ è à Ý æ à ä ê è ë â à ã ä å á ß æ à ç ä å ä æ è

à Ü Ý æ ä ì à Ý æ à ä ê è ë â à ã ä å á ß æ à ç ä å ä æ è

â à ã ä å á è ç á æ â í Ý ç å á è ß Ý æ

ê Ü á à ä ã Ý Ü ì ä ç í Ý ç ê á ç á å ä è ä ç

✗

á í í ä à è ä ì

in/out

out

in out

✎

152 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

Phase IV:

translation The purpose of the final phase in the change propagation process is to translate logical schema

increments which do not have a current representation in the conceptual schema

(cf. Figure 5.35). This is necessary for logical schema increments which have been added

during the last modification. Furthermore, translations of existing logical schema increments

might have been deleted during the reevaluation phase because the corresponding

transformation rules are no longer applicable. At the end of this translation phase, the

consistency of the logical schema with its conceptual representation has been reestablished.

✗✗✗✗✗✗

✗✗✗

✗

Figure 5.33. Phase II: backward propagation

Ü Ý Þ ß à á Ü â à ã ä å á ß æ à ç ä å ä æ è

à é ç ç ä æ è à Ý æ à ä ê è ë â à ã ä å á ß æ à ç ä å ä æ è

à Ü Ý æ ä ì à Ý æ à ä ê è ë â à ã ä å á ß æ à ç ä å ä æ è

â à ã ä å á è ç á æ â í Ý ç å á è ß Ý æ

ê Ü á à ä ã Ý Ü ì ä ç í Ý ç ê á ç á å ä è ä ç

✗

á í í ä à è ä ì

✗

ß æ ì ß ç ä à è Ü î á í í ä à è ä ì

out

in out

in/out

✎

✔

Figure 5.34. Phase III: reevaluation

Ü Ý Þ ß à á Ü â à ã ä å á ß æ à ç ä å ä æ è

à é ç ç ä æ è à Ý æ à ä ê è ë â à ã ä å á ß æ à ç ä å ä æ è

à Ü Ý æ ä ì à Ý æ à ä ê è ë â à ã ä å á ß æ à ç ä å ä æ è

â à ã ä å á è ç á æ â í Ý ç å á è ß Ý æ

ê Ü á à ä ã Ý Ü ì ä ç í Ý ç ê á ç á å ä è ä ç

✔

ç ä ï ä ð á Ü é á è ä ì á æ ì á ê ê Ü ß à á ñ Ü ä

ç ä ï ä ð á Ü é á è ä ì á æ ì æ Ý è á ê ê Ü ß à á ñ Ü ä

✔

✃

è Ý ñ ä ì ä Ü ä è ä ì

✃ ✃

✃✃ ✃

out

in out

in/out

✎

INCREMENTAL CHANGE PROPAGATION 153

realization in

Progres

The described incremental change propagation algorithm has been implemented in Progres.

This implementation is described in detail by Wadsack [Wad98]. Figure 5.36 shows the

transaction PropagateChange which formalizes the propagation process. It requires an

argument changeSet which represents the set of all logical schema increments that have been

added or modified.a In the first phase, (transitive) path expressions are used to collect all

directly affected transformation applications in the local variable affectedTrafoAppls. In the

backward propagation phase all transformation applications are added to this variable which

are needed to reproduce consumed parameters. Phase III is performed in a loop that repeatedly

chooses one transformation application (oldTrafoAppl) that does not depend on any other

transformation application in affectedTrafoAppls. Note, that the Progres operator and

computes the intersection of two sets. The following choose statement tries to reapply the

transformation in oldTrafoAppl. If this is possible and the specified invariant graph constraints

are fulfilled it actualizes the output parameters of the new transformation application.

Subsequently, the re-evaluated transformation application oldTrafoAppl is removed from the

set affectedTrafoAppls. This is done by using the Progres operator but_not which computes the

difference of two sets. In the case that the transformation in oldTrafoAppl has lost its

applicability, the else block of the choose statement in Figure 5.36 collects all dependent

transformation applications in variable depTrafoAppls. Subsequently, these transformation

applications are removed from the history graph.

a These increments are collected by the Varlet Analyst during interactive schema analysis activities.

Figure 5.35. Phase IV: translation

Ü Ý Þ ß à á Ü â à ã ä å á ß æ à ç ä å ä æ è

à é ç ç ä æ è à Ý æ à ä ê è ë â à ã ä å á ß æ à ç ä å ä æ è

â à ã ä å á è ç á æ â í Ý ç å á è ß Ý æ

ê Ü á à ä ã Ý Ü ì ä ç í Ý ç ê á ç á å ä è ä ç

out

in/out

out

in/out

✎

154 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

transaction PropagateChange( changeSet : Increment [1:n]) =

use

affectedTrafoAppls, depTrafoAppls : Transformation [0:n];

oldTrafoAppl, newTrafoAppl : Transformation

(* Phase I: forward propagation *)

affectedTrafoAppls := changeSet.( ( <-actual-

& <-In- )

or <-con1- )

& affectedTrafoAppls :=

affectedTrafoAppls.( ( -Out->

& -actual->

& ( ( <-actual-

& <-In- )

or <-con1- ) ) * )

(* Phase II: backward propagation *)

& affectedTrafoAppls :=

affectedTrafoAppls.( ( ( ( -In->

& -actual-> )

or -con1-> )

& <-actual-

& <-Out- ) * )

(* Phase III: reevaluation *)

& loop

oldTrafoAppl :=

affectedTrafoAppls.valid (empty ( (self.-In->.-actual->.<-actual-.<-Out-)

and affectedTrafoAppls ))

& choose

Reevaluate ( oldTrafoAppl, out newTrafoAppl )

& CheckGraphConstraints (* cf. page 118 *)

& ActualizeOutParams ( oldTrafoAppl, newTrafoAppl )

& affectedTrafoAppls :=

(affectedTrafoAppls but not oldTrafoAppl)

else

depTrafoAppls :=

oldTrafoAppl.( ( ( -Out->

but not -In-> )

& -actual->

& <-actual-

& <-In- ) * )

& RemoveTrafoAppls ( depTrafoAppls )

& affectedTrafoAppls :=

(affectedTrafoAppls but not depTrafoAppls)

end

(* Phase IV: remapping *)

& MapSchema (* cf. Figure 5.10 on page 126 *)

end

end;

Figure 5.36. Transaction PropagateChange

INCREMENTAL CHANGE PROPAGATION 155

adaption of

productions

In order to retrieve the necessary information about the context of transformation applications,

we have to modify the corresponding Progres productions in a way such that the matched

nodes are returned as parameters. Moreover, the described propagation algorithm requires the

possibility to re-evaluate transformations with a predetermined application context. Therefore,

we split each production that specifies a schema transformation into two separate parts, namely

agraph test and a parameterizable production that accepts a predetermined application

context. This is exemplified for transformation Generalize in Figure 5.37 and Figure 5.38.

Whenever a transformation is applied the graph test is used to deliver the input parameters for

the application context of this transformation. The 1-context can be easily computed from these

nodes (cf. Definition 5.4). If this test succeeds the corresponding parameterizable production is

invoked with the delivered input parameters. Subsequently, this production returns the output

parameters which are needed to complete the information about the application context. All

nodes which are actually deleted by a production have to be added to its right-hand side,

because they serve as isolated place holders in the history graph. During the change

propagation process the parameterizable production is re-evaluated with the actualized

application context. Note, that the described adaption of productions can be performed

automatically by a canonical pre-compilation step and does not have to be done manually.

test Generalize_getParams( cl : Class ; clName : string ;

out param1, param2, param4, param5, param6, param9 : Increment [0:n] )

return param1 := ‘1;

param2 := ‘2;

param4 := ‘4;

param5 := ‘5;

param6 := ‘6;

param9 := ‘9;

Figure 5.37. Graph test Generalize_getParams

‘5 : Variant

‘6 : Column

‘5 : Variant ‘2 : Attribute

sub

‘3 : Inheritance

c_ck

c_ka

‘4 : CKey

c_att

<-m_cl-

& -m_v-> ‘1 = cl

m_ck

-m_lk->

& -c_kc->

‘9 : MapKey

156 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

scalability The change propagation mechanism described above can efficiently be executed. Maintaining

the history graph does not add to the run-time complexity of applying schema transformations.

Each applied transformation extends the history graph by one instance of a transformation

template (cf. Definition 5.7). Hence, the space complexity of the history graph is O(n) where n

is the number of transformations applied during the conceptual schema migration process. If

we make the simplifying assumption that application conditions of graph productions can be

validated in constant time then the time complexity of algorithm PropagateChange in

Figure 5.36 is also O(n).

production Generalize_withParams( clName : string ;

param1, param2, param4, param5, param6, param9 : Increment [0:n];

out param3, param7, param8, param10, param11 : Increment [0:n])

::=

transfer 7’.clname := clName;

return param3 := 3’;

param7 := 7’;

param8 := 8’;

param10 := 10’;

param11 := 11’;

Figure 5.38. Production Generalize_withParams

‘5 = param5

‘6 = param6

‘5 = param5 ‘2 = param2

sub

‘3 : Inheritance

c_ck

c_ka

‘4 = param4

c_att

<-m_cl-

& -m_v-> ‘1 = param1

m_ck

-m_lk->

& -c_kc-> ‘9 = param9

5’ = ‘5

6’ = ‘6

5’ = ‘5

2’ = ‘2

1’ = ‘1

c_ka

4’ = ‘4

sub

sup

3’ : Inheritance

c_att

c_ck

m_cl

m_v

10’ : MapV

c_col

8’ : Variant

m_v_in

m_vg

m_vs 11’ : MapInc

m_ck

9’ = ‘9

m_id

7’ : Class

IMPLEMENTING THE VARLET MIGRATOR 157

5.5 Implementing the Varlet Migrator

Our approach to conceptual schema migration has been implemented in a tool called the Varlet

Migrator. In order to achieve the incremental and iterative DBRE process described above, the

Varlet Migrator is tightly integrated with the Varlet Analyst presented in Section 4.4. The

following section describes this integrated architecture in more detail, while Section 5.5.2 is

illustrates to the user’s perspective.

5.5.1 Architecture

The central component of the Varlet Migrator is a repository that maintains the migration

graph in the dedicated software engineering database GRAS [KSW95]. GRAS provides the

possibility to access large graphs efficiently with full support for transaction management,

recovery, and operation undo/redo. Figure 5.39 shows that the schema for this repository is

devided into logical subsections for the ASG models of logical and conceptual LDB schemas,

the mapping graph model, and the history graph model. The grey components in Figure 5.39

illustrate that GRAS is also used as repository for the Varlet Analyst.a This architecture enables

the desired tight integration among both tools.

aMore precisely, the Varlet Analyst is based on an extended version of the logical schema ASG model depicted in

Figure 5.2 which allows to represent certainty measures for constraints like keys, INDs, etc.

Logical schema

ASG model

Analysis

Front-End

Figure 5.39. Architecture of the Varlet Migrator

iTcl/Tk

Conceptual schema

ASG model

Mapping graph

model Migration graph model

GRAS

Migration

Front-End

iTcl/Tk

Redesign

transformations

Progres

Schema

translation

Progres

Relational

unparser

Progres

Object-oriented

unparser

Progres

uses

module

Command

extractor

lex/yacc

History graph

model

Consistency

management

Pogres

158 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

The internal functionality of the Varlet Migrator is entirely implemented in Progres. Module

Schema Translation implements the triple graph grammar based bidirectional schema mapping

mechanism described in Section 5.2. This module and the compiler that derives conventional

Progres productions from triple graph grammar rules is described by Holle [Hol97]. The

change propagation mechanism described in the previous section is implemented in module

Consistency Management. Module Redesign Transformations implements the extensible

catalog of primitive and complex redesign transformations discussed in Section 5.3. The

Progres development environment [SWZ95] provides a visual editor for graph productions and

transactions which has proven very useful to add redesign transformations to our catalog.

Figure 5.40 depicts a screenshot of the Progres editor that shows an implementation of a new

redesign transformation (MergeParallelAssociations) in order to deal with the optimization

structure detected in our case study (cf. page 21). Such specific transformation can be added

"on-the-fly" to the catalog of available transformations. However, one problem that remains is

that in many cases the resulting idiosyncratic schema dependencies cannot be represented by

the SMG model. This issue does not affect the conceptual schema migration process but it

disables the generation of data integration middleware components for these idiosyncratic parts

of the schema. In these cases the reengineer has the choice of extending the SMG model or

implementing the data integration components for these optimization structures manually.

Figure 5.40. Using the Progres environment to extend module Redesign Transformation

IMPLEMENTING THE VARLET MIGRATOR 159

command

extraction

Obviously, whenever new redesign transformations have been added, they should be made

available at the user interface (the so-called Migration Front-End). In order to avoid manual

changes to the Migration Front-End due to changes in the transformation catalog, we have

implemented a generic command generation mechanism that parses module Redesign

transformations and extracts signatures for all implemented transformations. These signatures

are stored in a text file which is read during start-up of the Migration Front-End to build the list

of available commands. However, a problem of this generic solution is that the generated list of

menu commands soon becomes rather huge and confusing to the user. We solved this problem

by offering context sensitive menus: whenever the user has selected a number of schema

artifacts on the screen we exploit the extracted information about the signatures of commands

to offer only those commands which accept the selected artifacts as parameters.

textual

unparsers

The Varlet Migrator includes several unparsers to generate textual representations of different

parts of the migration graph. We have implemented unparsers for language standards like SQL

[BED94] and ODL [CBB+97] as well as for proprietary formats like object-oriented schema

descriptions for O2 [LR89] and ObjectDRIVER [CER99]. The extraction of textual schema

descriptions from the migration graph is performed by traversing and unparsing the ASGs for

the logical and the conceptual schema, respectively. For this purpose, we employ a Progres

mechanism called derived attributes [Tea99] which is similar to the well-known semantic rules

in attribute grammars [Knu68, Kas80]. The concrete implementation of the derived attributes

for the textual schema descriptions is given by Holle [Hol97].

5.5.2 User interface

Let us revisit the schema migration sample scenario from Section 2.4.2, on page 24 to illustrate

the user interface of the Varlet Migrator. This scenario deals with two iterations among legacy

schema analysis and conceptual schema migration activities. The top section of Figure 5.41

shows the logical schema that is the result of the first analysis activity. This schema contains

our familiar excerpt from the PDIS case study shown in Figure 2.7.

initial translationWhen the user invokes the Varlet Migrator for the first time the current logical schema is

translated into an initial conceptual schema. This is performed according to the translation

algorithm MapSchema (cf. Figure 5.10 on page 126). The screenshot of the Migration Front-

End in the bottom section of Figure 5.41 shows that the product of this initial conceptual

translation still looks similar to the logical schema: basically, each table has been mapped to a

class and each foreign key has been mapped to an association with corresponding cardinality

constraints.

160 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

Figure 5.41. Logical schema after first analysis step (top), initial conceptual translation (bottom)

initial

translation

Analysis

Front-End

Migration

Front-End

IMPLEMENTING THE VARLET MIGRATOR 161

conceptual

redesign

Now, the reengineer can use the catalog of available transformations to redesign and extend the

conceptual schema according to the new requirements. In our sample scenario in Chapter 2, the

reengineer extended the schema by additional classes and associations to store information

about customers and on-line documents (cf. Figure 2.18 on page 26). Figure 5.42 illustrates

how the Varlet Migrator is used to perform these schema modifications. In this picture, we use

grey arrows to indicate some of the redesign transformations performed to the conceptual

schema. The dialog box entitled Execute Command shows that the reengineer is about to

transform class DOCREF into a many-to-many association. Note, that in contrast to our sample

scenario (Figure 2.18) our conceptual data model is restricted to unordered associations only

(cf. page 116).

iterationIn our sample scenario, we assumed that by talking to operators and investigating legacy data,

the reengineer detects four different variants in table PRODREF. Moreover, (s)he finds out that

column manager in table PRODGRP represents a foreign key referencing an alternative key

(sname) of table USER (cf. page 22). Using the Varlet Analyst (s)he can add this information to

the logical schema of PDIS. In the top part of Figure 5.43, we used ovals to mark the

differences between the completed logical schema and the first analysis result in Figure 5.41.

Note, that the reengineer used the filter mechanisms provided by the Analysis Front-End to

hide columns doc2,..,doc5 of the optimization structure in table KEYW.

Figure 5.42. Redesigned conceptual schema (Migration Front-End)

ò ó ô õ ö ó

÷ ø ù ú û ü ù

õ ý ý

ó ô ó ÿ õ

ù ú



÷ ø



ù ú



ÿ ó õ



û û

 

  

ÿ ó



÷ ø



ù ú



ü ù

õ ý ý

  

ý ý

 

û ú



÷ ø ù ú û ü ù

õ ý ý

Migration

Front-End



ó ÿ





õ ÿ õ

ù ù



ý ý

 

û ú



ô ý

ò ó ö

 



ý ó





162 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

change

propagation

Figure 5.43. Completed logical schema (top) and updated logical schema (bottom)

new

new new

Analysis

Front-End

Migration

Front-End

ü ù

õ ý ý

  

ý ý

 

û ú



ü ù

õýý

  

ý ý

 

û ú



IMPLEMENTING THE VARLET MIGRATOR 163

change

propagation

After modifying the logical schema, the reengineer uses the incremental consistency

management mechanism described in Section 5.4 to propagate the changes into the redesigned

conceptual schema. In Varlet, this is done by pressing the Update button in the Migration

Front-End. The bottom section of Figure 5.43 shows that the four variants in table PRODREF

have been mapped to an inheritance structure with superclass PRODREF and three subclasses

PRODREF#1-3. It is the task of the reengineer to determine reasonable names for these

classes. For example, (s)he might rename the superclass to XRef and the subclasses to ProdRef,

ProdGrpRef, and ComGrpRef like in Figure 2.19. In addition, the updated conceptual schema

contains several other changes:

•class DOCREF represents a further subclass of class PRODREF because of the I-IND

between the logical representations of these two classes,

•attribute manager has been removed from class PRODGRP because it only represents a

borrowed key in the logical schema,

•there is a new one-to-many association among classes PRODGRP and Employee because of

the newly detected foreign key manager in table PRODGRP.

Most applied redesign transformations are still valid in the updated conceptual schema. Still,

Figure 5.43 shows that the two applications of transformation ClassToAssociation to classes

DOCREF and PRODREF have been undone. This is because their application condition is

violated for classes that participate in inheritance hierarchies (cf. Figure B.5 on page 199).

Note that the grey arrows which indicate the cancelled transformations do not (yet) belong to

the user interface of the Varlet Migrator. Still, our tool provides the user with a textual update

report including information about all cancelled transformations.

implementation

of extensions

During the conceptual migration activity, the reengineer has made several modifications which

extend the information capacity of the original PDIS schema, e.g., (s)he added new subclasses,

class attributes, and associations. These changes do not have to be implemented manually but

the schema mapping mechanism described in Section 5.2 can be used to extend the logical

schema, automatically. For this purpose, the Analysis Front-End contains an Update button

similar to the Migration Front-End. Figure 5.44 shows the result of this logical schema update.

As specified in mapping rule MapVariantToConcreteClass (on page 127), the new classes have

been mapped to new variants in tables USER and DOCUMENT. All new attributes have been

mapped to columns in these tables and the association master among on-line and off-line

documents has been mapped to a cyclic foreign key master in table DOCUMENT. The SQL

unparser allows the reengineer to retrieve a textual representation of the schema modifications

which can be used to update the LDB schema catalog.

164 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

5.6 Data integration

In general, the output of a conceptual schema migration activity is an abstract design document

for an LDB schema. This documentation facilitates understanding, assessment, and

maintenance of the LDB. The techniques described in the previous sections allow to integrate

schema migration and maintenance activities in an evolutionary and intertwined process. This

helps to solve the well-known problem of keeping the conceptual design up-to-date and

consistent with the current implementation. The conceptual design gains even greater

importance in DBRE projects that aim on migrating LDB applications to new technologies,

programming languages, or architectures. Object-oriented technology is a common standard

for the development of modern cooperative information system infrastructures [Vin97,

CBB+97]. In such projects, the conceptual design is not only used as abstract documentation

but also as an object-oriented view to access the information maintained in the LDB. Such

object-oriented access layers allow to create unified views on heterogeneous component

databases and abstract from low-level implementation details like idiosyncratic data formats

and optimization constructs. By encapsulating the concrete structure of the LDB, they improve

Figure 5.44. Implementation of conceptual extensions (Analysis Front-End)

imple-

mentation

Migration

Front-End

Analysis

Front-End

DATA INTEGRATION 165

the robustness of the entire information system infrastructure w.r.t. the evolution of single

component schemas. Several so-called middleware components and libraries have been

developed to facilitate the development of object-oriented access layers, e.g., [CER99, Obj99b,

Hüs97, ONT96, Rad95]. Most approaches employ proprietary programming languages and

APIs to specify the dependencies among the component schemas and their object-oriented

representations [CER99, Obj99b, Hüs97, Rad95] while other products provide menu-driven

dialog interfaces [Obj99b, ONT96]. However, the problem that prevails with these approaches

is that the reengineer has to specify and maintain these dependencies manually.

ObjectDRIVERThe integrated approach to schema migration developed in this dissertation allows to overcome

this problem. The correspondences implicitly stored in the migration graph during the schema

migration process enable automatic generation of the dependency information necessary for

middleware components. We have chosen the middleware product ObjectDRIVER [CER99]

which has been developed by the CERMICS Database Team at Sophia Antipolis Cedex,

France, to evaluate this approach. ObjectDRIVER provides seamless integration of object-

oriented applications written in Java or C++ with legacy data sources (cf. Figure 5.45).a It

allows to create a ODMG-compliant [CBB+97] object-oriented interface that hides the

concrete database implementation. An OQL (Object Query Language) [CBB+97] interpreter

supports the formulation of adhoc-queries based on the abstract, object-oriented schema.

a Figure 5.45 has been adopted from [CER99] under permission of the CERMICS Database Team.

Figure 5.45. ObjectDRIVER overview

166 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

Integration of

ObjectDRIVER

and Varlet

The integration of the ObjectDRIVER middleware with the Varlet schema migration

environment is illustrated in Figure 5.46. Based on the information maintained in the migration

graph, Varlet generates textual descriptions for both schemas and their interdependencies

which are required as the input for ObjectDRIVER. In the following, we will use our DBRE

sample scenario to exemplify the structure of these textual descriptions and to describe how the

necessary information is extracted from the migration graph.

5.6.1 Generating descriptions for relational and object-oriented schemas

The textual schema descriptions for ObjectDRIVER have to be in a proprietary format that does

not comply to any common standard like SQL DDL [BED94] or ODL [CBB+97]. Still, the

format for the relational schema description is similar to data definitions in standard SQL.

Figure 5.47 illustrates this format for the eight RS considered in our sample scenario. The

specification of a primary key for each RS is mandatory. Those columns of an RS which belong

to such a key are marked by the suffix keyPart. Note, that we had to eliminate the optimization

structure in table KEYW because the mapping mechanism provided by ObjectDRIVER lacks

the necessary flexibility to access it (cf. page 21). An alternative solution that allows to keep

(more important) optimization structures is to disregard the corresponding RS during the

middleware generation and program the necessary data access functionality manually

afterwards.

Figure 5.46. Integration of the ObjectDRIVER middleware generator as a back-end for Varlet

ObjectDRIVER

middleware component

Varlet

schema migration environment

Relational

Schema

Description

Object

Schema

Description

Mapping

Description

DATA INTEGRATION 167

Figure 5.48 presents the ObjectDRIVER schema description for the conceptual view on the

MIS sample schema. The notation is very similar to schema definitions for the purely object-

oriented database O2 [O2 93]. Associations among classes can be implemented either as single

references or as pairs of references using the keyword inverse to specify their correspondences.

As described in Section 5.5.1, we employ derived text attributes to extract both textual schema

descriptions from the Varlet migration graph. In contrast to the generation of the schema

mapping description, this unparsing mechanism is simple and straight forward because we only

need to consider the syntactical structure of the logical and the conceptual representation.

5.6.2 Generating object-relational mapping descriptions

The schema mapping description for ObjectDRIVER is not represented in a separate file but it

is an extension of the object-oriented schema description by additional mapping directives. The

Varlet schema mapping graph stores the information needed to generate these mapping

directives (cf. Section 5.1.2, on page 119). Analogously to the generation of schema

descriptions, we use derived text attributes to unparse this textual information. However, the

derivation rules of such text attributes are less suited to facilitate understanding of our

approach, because they include many conditionals. Therefore, in the following, we specify the

extraction of the mapping description with Progres graph tests (cf. page 118) and we use our

DBRE case study to exemplify the generation of each different mapping construct for

ObjectDRIVER.

define table USER {

usrid string(10) keyPart,

name string(50),

trusted boolean,

dpt string(18),

company string(255),

sname string(18),

addr string(40),

telo string(18),

telp string(18)

};

define table KEYW {

keyw string keyPart,

doc integer

};

define schema MIS {

relationalDbms DB2;

define table COMGRP {

cgid integer keyPart,

name string(18)

};

define table PRODGRP {

cg integer keyPart,

manager string(40),

pg integer keyPart,

grpname string(18)

};

define table PRODREF {

id integer keyPart,

pg integer,

prod integer,

cg integer,

doc integer keyPart

};

Figure 5.47. Relational schema description for ObjectDRIVER

define table DOCUMENT {

docno integer keyPart,

dname string(255),

valid string(8),

rd integer,

archive string(80),

master integer,

author string(255),

usr string(30),

format integer,

contents octet

};

define table DOCREF {

id integer keyPart,

sdoc integer keyPart,

tdoc integer

};

define table PRODUCT {

name string(50),

no integer keyPart,

pg integer keyPart,

cg integer keyPart

};

168 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

classes and

subclasses Classes without a generalization are mapped to so-called base tables. In ObjectDRIVER, this

mapping is defined by the class name followed by the keyword on and the name of the base

table (cf. class XRef in Figure 5.49). Subclasses are automatically mapped to the same base

table like their generalizations. Textual constraints are used to specify which database entries

qualify as valid members of a certain subclass. Figure 5.49 illustrates this concept for the four

subclasses of class XRef. These subclasses logically correspond to the four different variants of

entries in table PRODREF (cf. page 20). For example the constraint for class PrdGrpRef in

Figure 5.49 specifies that only those tuples with a null-value in column prod but with valid

values in columns cg and pg represent product group references.

class ComGrpRef inherit XRef

type Tuple (

ref CommodityGroup

)

class ProdGrpRef inherit XRef

type Tuple (

ref ProductGroup

)

class ProdRef inherit XRef

type Tuple (

ref Product

)

class Document

type Tuple (

title String,

number integer,

validUntil String,

author String,

confidential integer,

respEmp Employee,

xrefs Set(XRef)

inverse XRef.refBy,

refBy Set(DocRef)

inverse DocRef.ref,

keyword Set(Keyword)

inverse Keyword.docs

)

class OnlineDocument inherit Document

type Tuple (

contents octet,

format integer

)

class User

type Tuple (

name String,

addr String,

telephone Telephone

)

class Employee inherit User

type Tuple (

shortName String,

trusted boolean,

worksFor Department

)

class Customer inherit User

type Tuple (

company String

)

class Telephone

type Tuple (

office String,

private String

)

class XRef

type Tuple (

no integer,

refBy Document

)

class DocRef inherit XRef

type Tuple (

ref Document

)

class Department

type Tuple (

deptName String

)

class OfflineDocument inherit Document

type Tuple (

archive String

)

class CommodityGroup

type Tuple (

name String,

id integer

prodGrps Set(ProductGroup)

inverse

ProductGroup.comGrp

)

class ProductGroup

type Tuple (

name String,

id integer,

comGrp CommodityGroup,

manager Employee,

)

class Product

type Tuple (

name String,

number integer,

prodGrp ProductGroup

)

class Keyword

type Tuple (

keyw String,

docs Set(Document)

inverse Document.keywords

)

Figure 5.48. Object schema description for ObjectDRIVER

DATA INTEGRATION 169

Figure 5.50 shows the graph test that can iteratively be called for each class cl to extract the

necessary information from the migration graph. If cl has a generalization it is matched to the

optional node ‘5 and returned in parameter supercl. The base table is identified as node ‘3 by

traversing the m_id edge and the m_lk to the primary key of the corresponding RS in the logical

schema (cf. Figure 5.2 on page 117). All variants that have been mapped to cl are collected in

node set ‘2. Node set ‘6 represents all columns that are common to the variants in node set ‘2.

These columns have to carry valid values (not null) in order to qualify for instances of class cl.

On the other hand, the set of columns that have to carry null values for instances of cl is

returned in parameter nullCols. This set is defined by all columns of the base table minus all

columns that are includes by any variant mapped to cl (node set ’4).

Figure 5.49. Mapping description for classes and subclasses

class ProdGrpRef inherit XRef

type Tuple (

ref ProductGroup,

constrainedBy((PRODREF.cg != NULL)

&& (PRODREF.pg != NULL)

&& (PRODREF.prod == NULL)

)

class ProdRef inherit XRef

type Tuple (

ref Product,

constrainedBy((PRODREF.cg != NULL)

&& (PRODREF.pg != NULL)

&& (PRODREF.prod != NULL)

)

class XRef on PRODREF

type Tuple

(

no integer,

refBy Document

)

class DocRef inherit XRef

type Tuple (

ref Document,

constrainedBy((PRODREF.pg == NULL)

&& (PRODREF.prod == NULL)

&& (PRODREF.cg == NULL))

)

class ComGrpRef inherit XRef

type Tuple (

ref CommodityGroup,

constrainedBy((PRODREF.cg != NULL)

&& (PRODREF.pg == NULL)

&& (PRODREF.prod == NULL)

)

test ClassInstantiationConstraint( cl : Class ; out rs : RS ; out nnCols : Column [0:n] ;

out nullCols : Column [0:n] ; out supercl : Class [0:1])

condition ‘2=‘1.<-m_cl-.-m_v->;

return rs := ‘3;

nnCols := ‘6;

nullCols := (‘3.-c_v->.-c_col->) but not ‘4;

supercl := ‘5;

‘4 : Column

<-m_cl-

& -m_v->

<-m_cl-

& -m_v->

& -c_col->

-m_id->

& -m_lk->

& <-c_pk-

c_v

‘3 : RS

‘5 : Class

<-sub-

& -sup->

‘1 = cl

‘6 : Column

c_col

‘2 : Variant

Figure 5.50. Test getClassInstantiationConstraint

170 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

base table

attributes Mappings of attributes which correspond to columns in the base table are described by simply

adding the key word on followed by the qualified name of corresponding columns

(cf. Figure 5.51). The graph test to check the validity of this mapping for each attribute attr is

presented in Figure 5.52. Node ‘5 represents a negative application condition which ensures

that attr is not mapped over a foreign key to a column in a different table. If this condition is

fulfilled the corresponding column in the base table is returned in parameter col.

remote attributes If a class contains attributes which belong to (remote) tables different from the corresponding

base table these tables have to be joined. So far, our sample scenario does not include such a

situation. However, let us assume a situation where MIS users who are managers have been

mapped to a specialization of class Employee named Manager. Furthermore, let us assume that

managers have an additional attribute secretariate which has been relocated via association

manager to class ProductGroup using the conceptual redesign transformations MoveAttribute

and RenameAttribute (cf. Section 5.3.2, on page 138). This scenario is illustrated on the left-

hand side of Figure 5.53. Its right-hand side contains the corresponding mapping description

for ObjectDRIVER. It shows that the relocated and renamed attribute contactInfo of class

ProductGroup is mapped to column secretariate of table USER. Both tables are joined over the

foreign key that is the logical representation of association manager.

Figure 5.51. Mapping description for

base table attributes

class Document on DOCUMENT

type Tuple (

title String

on DOCUMENT.dname,

number integer

on DOCUMENT.docno,

validUntil String

on DOCUMENT.valid,

author String

on DOCUMENT.author,

confidential integer

on DOCUMENT.rd,

...

)

test getAttrMappedToColInBaseTable( attr : Attribute ; out col : Column)

return col := ‘3;

‘5 : MapRIND

‘3 : Column

<-c_att-

& -m_id->

& -m_lk->

& <-c_pk-

‘1 = attr

m_am_col a_via

‘4 : MapCol

-c_v->

& -c_col->

‘2 : RS

Figure 5.52. Test getAttrMappedToColInBaseTable

DATA INTEGRATION 171

The graph test that specifies the extraction of the information needed to generate the mapping

description for each remote attribute attr is presented in Figure 5.54. The foreign key that is

used for the join is matched to node ‘5 by traversing edges a_via and m_rind from the column

mapping node ‘4. The remote table itself is represented by node ‘8 and returned in output

parameter remTab, while the join columns in both tables are returned in parameters k and fk,

respectively. Note, that generally it is also possible to relocate an attribute over more than one

association. In this case, more than one R-IND node can be matched to node ‘5 and several

joins have to be generated for the ObjectDRIVER mapping description. This situation cannot

be specified with one single graph test but additional control structures are necessary to extract

this information.

Figure 5.53. Mapping description for remote attributes

class ProductGroup on PRODGRP

type Tuple (

contactInfo String

on u1<USER>.secretariate

) join PRODGRP to u1 by

(u1.sname==PRODGRP.manager)

)

Employee

Manager

secretariate: String

ProductGroup

Employee

ProductGroup

contactInfo: String

Manager

            

              

manager

test getAttrMappedToColInRemoteTable( attr : Attribute ; out col : Column ; out remTab : RS ;

out fk : Column [1:n] ; out k : Column [1:n]) =

return col := ‘3;

remTab := ‘8;

fk := ‘7;

k := ‘6;

<-c_att-

& -m_id->

& -m_lk->

& <-c_pk-

‘1 = attr

m_am_col

-a_via->

& -m_rind->

‘4 : MapCol

‘6 : Column

‘7 : Column

<-c_col-

& <-c_v-

& -c_v->

& -c_col->

‘3 : Column

-c_v->

& -c_col->

‘2 : RS

-c_k->

& -c_kc->

-c_f->

& -c_c->

‘5 : R_IND

-c_v->

& -c_col->

‘8 : RS

Figure 5.54. Test getAttrMappedToColInRemoteTable

172 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

base table

relationships Relationships that have been created by splitting classes in the conceptual schema are

represented in the ObjectDRIVER mapping description as a cyclic join over the key column(s)

of the corresponding base table. In our case study, we have split class Employee in two classes

Employee and Department with an association worksFor (cf. Figure 5.42 on page 161). The

generated mapping information for reference worksFor in class Employee is given in

Figure 5.55.

The graph test in Figure 5.56 specifies that the given relationship rel may not be mapped to

foreign keys and that its source and target class have to be mapped to the same base table (node

‘4). If the test succeeds it returns the set of all primary key columns in output parameter k.

remote

relationships Similar to the mapping description for remote attributes, we have to add joins to the mapping

description of relationships if they have been mapped to foreign keys in the migration graph.

Depending on the cardinalities of such remote relationships the corresponding references in the

participating source and target classes are either declared to be set-valued or single valued.

Further cardinality constraints like specific limits and totality constraints have to be checked in

the application code which can also be generated. As an example, Figure 5.57 shows the

mapping description for the one-to-many association referencedBy among classes Document

and XRef (cf. Figure 2.19 on page 27). This association is represented as a set-valued reference

xrefs in class Document which is inverse to a single valued reference refBy in class XRef.

Figure 5.55. Mapping description for

base table relationships

class Employee inherit User

type Tuple (

shortName String

constrainedBy(USER.sname != NULL),

trusted boolean

on USER.trusted,

worksFor Department

on u1<USER>.usrid

) join USER to u1 by (u1.usrid == USER.usrid)

test getRelMappedToBaseTable( rel : Relationship ; out k : Column [1:n]) =

return k := ‘6;

src

tar

-m_id->

& -m_lk->

& <-c_pk- ‘1 : Class

-m_id->

& -m_lk->

& <-c_pk- ‘2 : Class

‘5 : MapRIND

<-m_r-

& -r_via-> ‘3 = rel

‘6 : Column

-c_pk->

& -c_kc->

‘4 : RS

Figure 5.56. Test getRelMappedToBaseTable

DATA INTEGRATION 173

Obviously, different graph tests are needed to check for the various possible cardinalities of

relationships. With respect to our example, Figure 5.58 shows the graph test that validates a

one-to-many relationship rel and retrieves the necessary information about the foreign key

which is mapped to rel. The defined folding clause is necessary to allow for cyclic joins, i.e.,

that rel has the same class as its source and target.

IND-based

inheritance

Finally, we have to specify how inheritance relationships that have been mapped to inclusion

dependencies (I-INDs) are represented in ObjectDRIVER mapping descriptions. Our sample

case study includes such a situation for the specialization DocRef of class XRef (cf. Figure 5.42

on page 161). Figure 5.59 shows that this constellation is represented by adding a join over the

foreign key columns between both participating tables in the relational schema. The

corresponding graph test in Figure 5.60 is very similar to the previous test in Figure 5.58. The

complete ObjectDRIVER mapping description for the seventeen classes in our case study is

summarized in Figure 5.61.

Figure 5.57. Mapping description for remote relationships

class Document on DOCUMENT

type Tuple (

xrefs Set on x1<XRef> (

aRefXRef

on PRODREF.id

join DOCUMENT to x1

by(DOCUMENT.docno == x1.doc)

) inverse XRef.refBy

...)

class XRef on PRODREF

type Tuple

(

...

refBy Document

on d1<DOCUMENT>.docno

) join PRODREF to d1

by(PRODREF.doc == d1.docno)

test getRelMappedToRemoteTable( rel : Relationship ; out k, fk : Column [1:n])

folding { ‘1, ‘2 }, { ‘4, ‘5 };

condition ‘3.tarcard # 1;

‘3.srccard = 1;

return k := ‘8;

fk := ‘6;

‘6 : Column

src

tar

-m_id->

& -m_lk->

& <-c_lk-

‘1 : Class

-m_id->

& -m_lk->

& <-c_pk-

‘2 : Class

<-m_r-

& -r_via->

& -m_rind-> ‘3 = rel

‘8 : Column

-c_f->

& -c_c->

-c_k->

& -c_kc->

‘7 : R_IND

-c_v->

& -c_col->

‘5 : RS

-c_pk->

& -c_kc->

‘4 : RS

Figure 5.58. Test getRelMappedToRemoteTable

174 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

object-oriented

application code The current version of ObjectDRIVER (1.1) does not yet provide further support for the

generation of object-oriented application code in Java or C++. This means that the

programmer is responsible to define all application classes with their methods. The

ObjectDRIVER data integration mechanism requires that all class properties (attributes and

references) are accessed exclusively by method calls (set and get accessor methods). Besides

the usual reasons for encapsulation, this is especially important because ObjectDRIVER creates

run-time objects from persistent (relational) data on demand only, i.e., when the corresponding

resource is accessed. This is illustrated by the sample application code for class Document in

Figure 5.62. The first statement in each accessor method is a call to the predefined method

getObject() which initiates that the object is filled with the actual data maintained in the

relational LDB. Before the first call to getObject() the object is represented by a proxy. This

lazy data migration strategy is needed to avoid efficiency problems that could otherwise be

caused by the eager generation of huge object structures due to a large amount of data. It is

possible and desirable to generate application classes with such canonical accessor methods

automatically. We have implemented such a generator for a different ODMG middleware

[Sch98].

Figure 5.59. Mapping description for IND-based

inheritance relationships

class DocRef inherit XRef

type Tuple (

ref Document

on d1<DOCUMENT>.docno

)join DOCREF to PRODREF

by((DOCREF.id == PRODREF.id)

&& (DOCREF.sdoc == PRODREF.doc)),

join d1 to DOCREF

by(d1.docno=DOCREF.tdoc)

test getInheritMappedToI_IND( inherit : Inheritance ; out k, fk : Column [1:n]) =

return k := ‘8;

fk := ‘6;

‘8 : Column

‘6 : Column

sup

sub

-m_id->

& -m_lk->

& <-c_pk-

‘1 : Class

-m_id->

& -m_lk->

& <-c_pk-

‘2 : Class

-c_f->

& -c_c->

-c_k->

& -c_kc->

-c_ak->

& -c_kc->

‘5 : RS

-c_ak->

& -c_kc->

‘4 : RS

‘7 : I_IND

<-m_i_in-

& -m_iind-> ‘3 = inherit

Figure 5.60. Test getInheritMappedToI_IND

DATA INTEGRATION 175

 

    ! " # $ % & '   (     )   *

 + ,  -  ,



 .

  * /   0 



 1    ,

  , 2 3 /  4 5 1  / 6 7 , 8



         0 9 + . . /  4 5  : ; 7



8 < = > ? @ @ A

B B . /  4 5  : ; 7 , 8 < = > ? @ @ A

B B . /  4 5  : ; 7 ,   0 = = > ? @ @ A

A C    , 2   /  4 5  : ;

 + . . , 2 7



8 = = /  4 5  : ; 7



8 A

B B . , 2 7 , 8 = = /  4 5  : ; 7 , 8 A A

 

    ! " % & '   (     )   *

 + ,  -  ,



 .

  * /   0 





  , 2 3 /  4 5 ? D - 6 7  



         0 9 + . . /  4 5  : ; 7



8 < = > ? @ @ A

B B . /  4 5  : ; 7 , 8 < = > ? @ @ A

B B . /  4 5  : ; 7 ,   0 < = > ? @ @ A

A C    , 2   /  4 5  : ;

 + . . , 2 7



8 = = /  4 5  : ; 7



8 A

B B . , 2 7 , 8 = = /  4 5  : ; 7 , 8 A

B B . , 2 7   = = /  4 5  : ; 7 ,   0 A A

 

   E ! F G H & I J   5 4 D ?  : > -

 + ,  -  ,



 .

  



 K     8

  5 4 D ?  : > - 7 0     L

          8  

  5 4 D ?  : > - 7 0 



  L

 



 0 ?   



K     8

  5 4 D ?  : > - 7  



 0 L

   (   K     8

  5 4 D ?  : > - 7    (   L



  M 0     



    8  

  5 4 D ?  : > - 7  0 L

   , :  , :  ,



 +  

   2 3 ? K :  6 7     0 L

  *  K    

2 3 )   * 6 .

   * )   *

  /  4 5  : ; 7  0

C    5 4 D ?  : > -  

 + . 5 4 D ?  : > - 7 0 



  = =

2 7 0 



A        )   * 7   * 9 + L

  * 9 + K     0 2 3 5 



  * 6 .

   * 5 



  *

  /  4 5  : ; 7  0

C    5 4 D ?  : > -   0 2

 + . 5 4 D ?  : > - 7 0 



  = = 0 2 7  0 



A        5 



  * 7   *

O  + P   0 K     O 2 3 Q  + P   0 6 .

 O  + P Q  + P   0

  Q : R S 7 O  + P

C    5 4 D ?  : > -   O 2

 + . 5 4 D ?  : > - 7 0 



  = = O 2 7 0 



A        Q  + P   0 7 0 





A C     2   5 4 D ?  : > -

 + .  2 7     0 = = 5 4 D ?  : > - 7    A

 

   E & $ T J H & I J   ? K : 

 + ,  -  ,



 .

0  ,  >    K     8

  ? K :  7 0 , 

 

   U V &   ? K : 

 + ,  -  ,



 .

    K     8

  ? K :  7     L



 8   K     8

  ? K :  7



 8   L

 0 0  K     8

  ? K :  7  0 0  L

 



 , (    - 



 ,   

  ? K :  7     0

 

   W H $ X ! Y & &   (     ?   

 + ,  -  ,



 .

 (    >    K     8



         0 9 + . ? K :  7      < = > ? @ @ A L

      0   



  

  ? K :  7       0 L

P   O  ;   5  ,       

  0 2 3 ? K :  6 7     0

A C    ? K :    0 2  + . 0 2 7     0 = = ? K :  7     0 A

 

   Z G V J ! H &   (     ?   

 + ,  -  ,



 .



  ,   + K     8



         0 9 + . ? K :  7



  ,   + < = > ? @ @ A

 

   [ & X & $ \ ! I &   ? K : 

 + ,  -  ,





 * M



 K     8

  ? K :  7  



 L

,       K     8

  ? K :  7  



 

   ] % & '   /  4 5  : ;

 + ,  -  ,





      8  

  /  4 5  : ; 7  0 L

  * 9 + 5 



    

  0 2 3 5 4 D ?  : > - 6 7 0 



 

A C    /  4 5  : ;   0 2

 + . /  4 5  : ; 7 0 



= = 0 2 7 0 



  A

 

   E ! F % & '   (     )   *

 + ,  -  ,



 .

  * 5 



    

  0 2 3 5 4 D ?  : > - 6 7 0 



 



         0 9 + . . /  4 5  : ; 7 , 8 = = > ? @ @ A

B B . /  4 5  : ; 7 ,   0 = = > ? @ @ A

B B . /  4 5  : ; 7



8 = = > ? @ @ A A

A C    5 4 D  : ;   /  4 5  : ;

 + . . 5 4 D  : ; 7  0 = = /  4 5  : ; 7  0 A

B B . 5 4 D  : ; 7  0 



= = /  4 5  : ; 7 0 



A A L

C    0 2   5 4 D  : ;

 + . 0 2 7 0 



  = 5 4 D  : ; 7  0 



 

   Z ! H # $ % & '   (     )   *

 + ,  -  ,



 .

  * D     0   + 1    ,

 



2 3 D 4  1  / 6 7



         0 9 + . . /  4 5  : ; 7



8 < = > ? @ @ A

B B . /  4 5  : ; 7 , 8 = = > ? @ @ A

B B . /  4 5  : ; 7 ,   0 = = > ? @ @ A

A C   



2   /  4 5  : ;

 + .



2 7



8 = /  4 5  : ; 7



8 A

 

   ^ I X _ I & E ! F G H & I J   (     5 



    

 + ,  -  ,



 .



       



  



         0 9 + . 5 4 D ?  : > - 7



       < = > ? @ @ A L

*          8  

  5 4 D ?  : > - 7 *     

 

   ^ ' ` _ I & E ! F G H & I J   (     5 



    

 + ,  -  ,



 .

 



(    K     8



         0 9 + . 5 4 D ?  : > - 7  



(    < = > ? @ @ A

 

   Z ! H H ! " _ J Y # ! G $   D 4  1  /

 + ,  -  ,



 .

    K     8

  D 4  1  / 7     L

 0     8  

  D 4  1  / 7



8  0 L

,   0 1  ,  K     , 2 3 /   0 



 1    , 6 .

 / 1 /   0 



 1    ,

  /  4 5 1  / 7 , 8

C    D 4  1  /   , 2

 + . D 4  1  / 7



8  0 = = , 2 7



8 A

A        /   0 



 1    , 7



  1  ,

 

    ! " G F J # ! G $   /  4 5 1  /

 + ,  -  ,



 .

    K     8

  /  4 5 1  / 7 8  ,     L

 0     8  

  /  4 5 1  / 7 , 8 L



  1  , D     0   + 1    ,

 



2 3 D 4  1  / 6 7



8  0

    8   :  ,



 +  

   2 3 ? K :  6 7     0

A C    /  4 5 1  /    2

 + .  2 7      = = /  4 5 1  / 7     8   A L

C    /  4 5 1  /  



 + .



2 7



8  0 = = /  4 5 1  / 7



8 A

 

    ! " G F J   /  4 5 ? D -

 + ,  -  ,



 .

    K     8

  /  4 5 ? D - 7     L

          8  

  /  4 5 ? D - 7   L

,   0 1  , /   0 



 1    ,

  , 2 3 /  4 5 1  / 6 7 , 8

A C    /  4 5 ? D -   , 2

 + . . , 2 7



8 = = /  4 5 ? D - 7



8 A

B B . , 2 7 , 8 = = /  4 5 ? D - 7 , 8 A A

 

   a & Y b ! "   Q : R S

 + ,  -  ,



 .

O  + P K     8

  Q : R S 7 O  + P L

0 



 K     0 2 3 5 



     6 .

 5 



5 



    

  5 4 D ?  : > - 7 0 



 

C    Q : R S   0 2

 + . Q : R S 7 0 



= = 0 2 7 0 



  A

A        5 



     7 O  + P   0 

Figure 5.61. Mapping Description for ObjectDRIVER

176 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

5.7 Evaluation

During the last three years we iteratively implemented, evaluated, and refined our approach to

conceptual schema migration and data integration. In 1996, we implemented the idea of using

triple graph grammars to describe the translation between logical and conceptual database

schemas in a first prototype of the Varlet Migrator [JSZ96]. The major motivation for this

approach was that experiments with existing tools for schema migration and data integration

showed that they provided too little flexibility for alternative schema mappings [ONT96,

Sie98]. We had the hypothesis that by using triple graph grammars to define and generate

schema translators, we would obtain a database migration environment that is easily extensible

w.r.t. alternative schema mapping rules. Moreover, the triple graph grammar approach to

incremental document integration introduced by Lefering and Schürr [LS96] seemed suitable

to overcome the inability of current tools to cope with iterations among schema analysis and

migration activities.

experiences with

triple graph

grammars

We evaluated our first prototype with small application examples and discussed the concepts

with other researchers and practitioners in this domain [JSZ97a, JSZ97b]. The ability of the

prototype to propagate incremental changes in the logical schema to the conceptual schema

and vice-versa received broad attention. In order to increase the flexibility of our schema

translation tool, we defined many alternative mapping rules. A drawback of this approach was

that it became increasingly difficult for the user of our tool to comprehend all possible

alternative translations [Wad98]. Inspired by the research of Hainaut et al. on transformation-

based database reverse engineering [HTJC94], we discovered that it is significantly easier for

the user to invoke redesign operations on a given conceptual schema than to select from many

public Employee getRespEmp() {

getObject();

return respEmp;

}

public Set getReferencedProducts() {

getObject();

SetOfObjects prods = new SetOfObjects();

Enumeration e = xref.elements();

ProdRef pr;

While (e.hasMoreElements()) {

try {

pr = (ProdRef) e.get();

prods.add(pr.getRef());

} catch (Exception) {};

e.next();

}

return prods;

}

...

}

public class Document extends ObjectDRIVERObject {

private String title;

private int number;

private String validUntil;

private String author;

private boolean confidential;

private Employee respEmp;

private Set xrefs;

private Set refBy;

private Set keywords;

private transient int status;

public Document() {

}

public getTitle() {

getObject();

return title;

}

public setTitle(String aName) {

getObject();

title=aname;

}

Figure 5.62. MIS application code (example)

EVALUATION 177

alternative translations from a logical to a conceptual schema. Hence, we decided to combine

an automatic initial schema translation step (defined by a limited set of triple graph grammar

rules) with an interactive conceptual redesign phase. This approach greatly improved the

usability of the Varlet Migrator but it also required the development of additional mechanisms

for change propagation to retain the tool’s ability to cope with process iterations. We developed

the concept of the history graph to meet this requirement (Section 5.4).

case studyThis extended version of the Varlet Migrator has been tested and refined in the context of an

industrial project in collaboration with two German companies. The analyzed logical schema

included 85 tables, 347 attributes, and 138 INDs. The automatic initial translation to the

conceptual data model took 2.5 minutes on a SUN Ultra-Sparc II with 300Mhz processor. In

experiments with several (internal and external) users, we have validated the advantages of the

proposed automatic change propagation mechanism to support process iterations. The most

frequent changes of the logical schema have been due to additional INDs or changed semantic

classifications of INDs. Depending on how many applied redesign transformations have been

affected by a given change, the propagation time ranged from about 30 seconds up to minutes.

The users considered this performance as satisfactory compared to the alternative of validating

and re-establishing the consistency, manually. Furthermore, all of them appreciated the

reliability of using a persistent graph repository and accepted to trade some of the run-time

performance for having the advantage of a recovery mechanism after a crash of the Varlet

Migrator. One common point of criticism was that the current version of our tool does not

preserve layout information (for different views) for those increments which have been

affected by the change. Still, this weakness is not an inherent characteristic of our approach but

we have chosen default layout information to simplify our implementation. Currently, we are

working on a new version of the Varlet Migrator that overcomes this problem.

middleware

generation

The possibility of using the dependency information maintained in the schema mapping graph

to generate middleware components for data integration is self-evident. Still, we had to find a

data structure that provides suitable flexibility for alternative schema mappings but is simple

enough to facilitate its maintenance and interpretation. For our first experiments, we developed

an own middleware generator as a test bed to conduct experiments with different data

structures [Sch98]. Subsequently, we investigated the possibility of integrating existing

commercial middleware generators as a back-end to our DBRE environment. We selected

ObjectDRIVER [CER99] because it has been freely available for research purposes and it

provides suitable flexibility to deal with legacy schemas. Extracting the schema mapping

description for ObjectDRIVER was possible with little effort and without any modifications to

our migration graph structure. Hence, we are confident that other middleware products can be

integrated, likewise.

178 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

5.8 Related work

According to the two main aspects covered in this chapter (schema migration and data

integration), we split the discussion of related work in two subsections: the following section

compares related approaches w.r.t. to their support for schema migration and consistency

management, whereas Section 5.6 covers the aspect of data integration.

5.8.1 Conceptual schema migration and consistency management

Vossen and

Fahrner

Behm et al.

For more than one decade, many approaches to conceptual schema migration have been

developed based on algorithms that perform canonical translations of logical to conceptual

schemas [NA87, BDH+87, JK90, MM90, SK90, And94, PKBT94, MCAH95, RH97, Fon97].

Recently, several critics have stated that these approaches provide little flexibility for different

possible schema mappings. Because of this problem, Vossen and Fahrner suggest a further

manual redesign phase after the canonical translation [FV95]. Behm et al. propose an

interactive schema migration environment that provides a set of alternative schema mapping

rules [BGD97]. In an iterative process, the reengineer chooses an adequate mapping rule for

each schema artifact that has to be mapped. This approach is similar to our migration

environment in its early stages [JSZ96]. However, we discarded this approach for several

reasons: we made the experience that in order to achieve a reasonable flexibility for alternative

schema mappings, the set of mapping rules became very large. User experiments showed that

with a growing number of alternatives it became increasingly difficult for the reengineer to

grasp the semantics of the different mapping rules and choose the best alternative. It turned out

that it is much easier for the reengineer to redesign an initial conceptual translation of the

logical schema than having to think of alternative mappings between the logical schema and

the conceptual schema, explicitly.

Jeusfeld and

Johnen Jeusfeld and Johnen propose an approach to schema migration that employs a generic meta

model as mediator [MAJ94]. This meta model includes general modeling concepts like objects,

types, and links with different cardinality. The schema migration process is performed as

follows. In a first step, the concepts of the concrete data model of the LDB are classified in

terms of concepts of the meta model. The same is done for the target data model. The

classification of the source data model is the basis to map all LDB schema artifacts to

equivalent artifacts in the meta model. Analogously, the classification of the target data model

is used to map this meta schema back to an equivalent schema in the target data model. These

mapping steps are performed in an interactive process and the tool prompts the reengineer in

case of ambiguities. Even though the idea of a common meta model as a mediator among

different concrete data models is appealing, the advantages of the described approach over a

direct translation are questionable. This is because Jeusfeld and Johnen evaluated their

approach only for the translation of relational schemas to ER schemas.

Hainaut et al. Hainaut et al. propose to skip the initial translation step completely and use a common generic

data model that subsumes conceptual constructs as well as logical (and physical) constructs

[HHHR96, Hai89]. Based on this common data model Hainaut et al. have defined a catalog of

schema transformations which are used to gradually replace low-level implementation

constructs by more abstract concepts [HTJC94]. An implicit assumption behind this approach

is that all relevant information about the legacy schema is available at the beginning of the

RELATED WORK 179

migration process. In this dissertation, we have argued that this assumption is unrealistic

because, in practice, iterations among analysis and migration activities might occur for

different reasons. However, the execution of in-place transformations (as suggested by

Hainaut) impede such iterative DBRE processes because the original LDB schema is lost

during the migration process. A possibility to overcome this limitation is to make an initial

copy of the LDB schema and perform all transformations on this copy. This initial copy

operation can be implemented as (very simple) initial schema mapping transformations and,

thus, the consistency management mechanism defined in this thesis can be used to enable

iterations.

problem of

consistency

The problem of consistency management in case of DBRE process iterations is not adequately

solved in any of the above approaches. As exemplified in the previous paragraph, the

mechanism for incremental change propagation, which has been developed in this dissertation,

can be used with little modifications to complement these approaches and overcome this

limitation. None of the above DBRE tools supports automatic propagation of extensions made

to the conceptual schema back into the original implementation.

problem of

idiosyncrasies

Most approaches referenced above presume a logical schema in third normal form [EN94].

Some authors argue that this requirement can always be satisfied by inserting a preprocessing

(normalization) step before migrating the schema [FV95]. However, this solution is not

feasible for unforeseen idiosyncratic optimization patterns [BP95]. Hence, it is important that a

DBRE tool can easily be adapted to deal with such patterns. Most existing tools do not provide

the necessary adaptability, because their schema migration process and mapping rules are hard-

coded in general programming languages. A notable exception that employs a dedicated

language to describe transformation systems (TXL) has been developed by Cordy et al.

[MCAH95]. Still, such textual transformation patterns are significantly harder to formulate and

comprehend than graphical transformation rules. Because of this reason many authors have

used diagrams to communicate their transformation rules to their readers, e.g., [BP96,

HTJC94, BCN92, Tre95]. By choosing graph grammars, the approach presented in this thesis

combines the expressiveness of diagrams with the executability of formal replacement systems.

The Progres graph grammar engineering environment allows the reengineer to specify

additional mapping rules and redesign transformations. This facilitates to add further mapping

rules to deal also with denormalized RS, e.g., the rules described by Ramanathan and Hodges

[RH97].

problem of

variant structures

The formal definition and automatic translation of variant structures in LDB schemas to

inheritance structures in the conceptual model is new in our approach. Other existing DBRE

tools do not consider variant structures even though they are broadly used in forward

engineering relational database schemas [HHEH96, BCN92].

5.8.2 Data integration

Behm et al.

Fong

Only a few of the approaches to schema migration also tackle the problem of data integration.

Behm et al. [BGD97] and Fong [Fon97] aim on a complete replacement of relational by object-

oriented databases. Based on the schema correspondences created in the schema migration

step, they present algorithms to migrate the data in a batch-oriented process. Due to our

experience, a complete replacement of relational by object-oriented database platforms is often

not desired, not viable, or implies a significant risk. Hence, there has been an increasing

180 CONCEPTUAL SCHEMA MIGRATION AND DATA INTEGRATION

industrial demand for approaches to wrap and integrate LDB systems with modern

technologies.

Hainaut et al. In the InterDB project [THB+98], Hainaut et al. use their transformation-based approach to

schema migration to generate the data integration wrappers for LDBs [TCHH99]. Their

approach is based on the definition of data conversion operations (instance mappings) for all

schema transformations. A logging mechanism records all schema transformations which have

been applied during the interactive schema migration and redesign phase. This history log is

the basis to generate a data conversion program which consists of a concatenation of the

instance mappings of all applied transformations. The main difference to the approach

described in this dissertation is that we maintain explicit schema dependencies in a schema

mapping graph (SMG). This explicit information allows us to generate declarative schema

mapping descriptions as an input for various commercial off-the-shelf (COTS) middleware

products. This is not possible or at least problematic for Hainaut’s approach because schema

correspondences are implicitly defined in operational data conversion programs.

COTS middleware

Web-gateways

Examples for commercial middleware products which require declarative textual schema

mapping descriptions are Ardent Software’s Java-Relational-Binding [Gre98], ObjectDRIVER

[ObjDrv99], OpenDM [Sie98], and CocoBase [Tho99]. Other products also provide graphical

user interfaces to build schema mappings, e.g., Object Integration Server [ONT96],

ObjectMatter [Obj99b], and TOPLink [Obj99a]. The common aim of these products is to wrap

LDB applications with a modern API that facilitates integration with object-oriented,

distributed, and platform independent technology, e.g., CORBA,COM, and Java [Uma97].

Still, in projects that focus on integrating legacy data with Web-based services it might also be

sufficient to use a more light-weight approach in terms of so-called Web-gateways. Currently,

almost every database vendor offers such a gateway solution. Typically, Web-gateways provide

the possibility to embed database queries into HTML pages. Kappel et al. present a taxonomy

for the different technical solutions in this domain [EKR97].

5.9 Summary

In this chapter, we elaborated an incremental approach to conceptual schema migration which

is based on a tight integration of tools for legacy schema analysis and conceptual translation

and redesign. A major benefit of this approach is that it provides support for iterations between

analysis and conceptual migration activities rather than imposing a strictly phase-oriented

DBRE process. We showed that a common graph repository is a suitable platform for this tight

integration. Furthermore, it allows to employ graph grammars as an abstract formalism to

facilitate specification of schema translation and redesign transformations. We argued that this

high level of abstraction is particularly important because it facilitates extension and adaption

of schema transformations due to unforeseen design patterns in LDB schemas. Based on the

concept of input/output dependencies of schema transformations, we described an incremental

change propagation mechanism that allows the reengineer to reestablish schema consistency

after iterations in the DBRE process, automatically. We used the Progres graph grammar

engineering environment to implement our approach in a customizable DBRE tool called the

Varlet Migrator. We argued that another benefit of this tight integration is the possibility to

generate schema mapping descriptions for existing middleware components. We selected the

object-relational middleware product ObjectDRIVER to validate this hypothesis.

CHAPTER 6 CONCLUSIONS AND

FUTURE PERSPECTIVES

6.1 Major contributions

Database reengineering (DBRE) activities inherently deal with uncertain information about the

internal structure of legacy systems. This uncertainty and the fact that legacy systems evolve

during ongoing migration activities often cause iterations in DBRE processes. The direct result

of such process iterations are inconsistencies between the implementation of the legacy system

and its conceptual (re)design. In this dissertation, we have explored concepts and techniques to

manage aspects of uncertainty and inconsistency in computer-aided DBRE processes. The

major contributions of our research are summarized in the following paragraphs.

selection of a

theory to manage

uncertainty

Based on our experiences with practical DBRE case studies, we elaborated a catalog of central

requirements on a theory as a basis to represent and reason about imperfect DBRE knowledge.

With this catalog we studied and evaluated major theories in the domain of approximate

reasoning. As a result of this evaluation, we have identified possibilistic logic as the theory

which is most suitable to provide the framework for our research.

GFRN as a basis

for LDB analysis

In this framework, we have developed Generic Fuzzy Reasoning Nets (GFRNs) as a dedicated

formalism to specify and adapt DBRE heuristics and processes. GFRN specifications provide

the basis to integrate and combine many existing schema analysis operations and methods. By

distinguishing between data-driven and goal-driven analysis operations, GFRNs allow for the

specification of active analysis tools. Such tools are capable of executing analysis operations

depending on the state of information about the legacy system, automatically. This is in

contrast to traditional (passive) tools where all analysis operations have to be invoked explicitly

by the user. The GFRN language has a sound declarative semantics based on a formal

translation to necessity-valued possibilistic logic. In order to execute GFRN specifications in

human-centered DBRE tools, we have developed a non-monotonic inference algorithm based

on fuzzy Petri nets.

implementation

and evaluation

Incorporating imperfect knowledge in DBRE tools has a significant impact on their user

interfaces. New concepts and interaction mechanisms are required to communicate uncertain

and contradicting information to the reengineer and guide him/her to a consistent analysis

result. In our prototype CARE tool The Varlet Analyst, we have developed filter mechanisms

and an advanced agenda concept to meet these requirements. The Varlet Analyst has been used

as a test bed to evaluate our approach to legacy schema analysis with practical case studies.

These experiments showed that the concepts and techniques developed in this thesis represent a

valuable improvement over currently existing tool support for legacy schema analysis.

flexible schema

translation

We have developed a hybrid approach to conceptual schema migration which consists of an

automatic initial translation step followed by an interactive redesign and extension phase. The

entire migration process has been specified on a high-level of abstraction using graph

transformation systems. A generation mechanism which is mainly based on the Progres graph

grammar engineering environment enables the produce executable transformation tools based

182 CONCLUSIONS AND FUTURE PERSPECTIVES

on this abstract specification. This generative approach provides a high amount of flexibility

and extensibility which is important to consider unforeseen idiosyncrasies in legacy database

(LDB) schemas. Moreover, the proposed schema mapping mechanism is bidirectional, i.e., it

allows the reengineer to map modifications in the conceptual model back to the implemented

logical schema.

incremental

consistency

preservation

Using graph grammars to specify schema translation and redesign operations enabled us to

derive a formal notion of their input/output dependencies according to left-hand side and the

right-hand side of each graph production rule. Based on these dependencies, we have defined a

data structure (history graph) that logs information about all steps performed during the

schema migration process. In case of iterations in the DBRE process, the history graph is

interpreted by an algorithm that performs incremental change propagation and reestablishes

document consistency, automatically. This technique enables to intertwine analysis and

migration activities in evolutionary DBRE processes. Consequently, our approach provides a

suitable basis to construct CARE environments which provide more adequate support for

DBRE projects than existing, strictly phase-oriented tools. We have implemented this

consistency management mechanism in the prototype CARE tool The Varlet Migrator which

has been evaluated with industrial collaboration.

heterogeneous

data integration We have demonstrated the suitability of the information maintained in our schema mapping

graph model to generate declarative schema mapping descriptions. This facilitates the

integration of our DBRE tool with various available middleware products for heterogeneous

data integration to obtain a flexible and comprehensive environment for LDB analysis,

migration, and encapsulation. We have evaluated this approach with a commercial object-

relational middleware product.

6.2 Transferability of results

schema analysis

Even though the focus of this dissertation is on reengineering legacy relational databases, most

of our results are not limited to this specific application domain. The requirements that were

used to select a suitable theory to manage uncertain DBRE knowledge remain valid in many

other scenarios that aim on software comprehension and design recovery. For example, we

have noticed similar problems and challenges in the domain of architectural design recovery

for object-oriented software. A mechanism to detect and classify design patterns [GHJV95]

would be very supportive for software comprehension. Recently, researchers have started to

investigate in techniques that can be used to detect such patterns [KSRP99, Bro96, KDBM94,

TFAM96]. As a common problem, they encountered that different software systems contain

various derivations of the same design pattern. Typically, their detection is ambiguous and

inherently deals with heuristics, e.g., naming conventions, structural characteristics, and caller/

callee relationships. Current tools for design pattern detection lack explicit concepts to deal

with imperfect knowledge. Their heuristics are often hard-coded and cannot be adapted easily.

The concepts and mechanisms in Chapter 4 are suitable to complement these approaches and

overcome their current limitation. First attempts to employ GFRNs for the detection of C++

and Java design patterns have shown that this approach is promising and feasible [Jahn97a].

conceptual

migration Many tools for conceptual abstraction and interactive redesign of software are based on some

formal notion of a transformation system [HTJC94, MCAH95, YB94, War96, PMdP98]. The

mechanism to incremental consistency management developed in this dissertation can

OPEN PROBLEMS 183

complement these approaches and enable them to deal with iterations in this migration process.

Furthermore, we have demonstrated that the application of graph grammar engineering

techniques in combination with automatic code generators can contribute significantly to

decrease the complexity of constructing and customizing tools for software abstraction and

migration.

6.3 Open problems

selection of CVs

While applying our approach to practical case studies we encountered a number of open

problems which need further investigation. One of these open problems considers the selection

of confidence values (CVs) for GFRN implications. In this dissertation, we argued that the

credibility of DBRE heuristics depend highly on various technical and non-technical

characteristics of the LDB under investigation, e.g., different naming conventions, design

paradigms, and DBMS functionality. In principle, the GFRN approach facilitates customizing

the credibility of the different heuristics used in the semi-automatic analysis process by

adjusting the CVs of implications. In Section 4.1, we proposed that this adjustment should be

done according to the results of an initial domain analysis activity. However, we have

experienced that selecting "good" CVs a-priori (before the actual analysis starts) is far from

being trivial. This is because many characteristics, especially non-technical characteristics,

remain undetected in the initial domain analysis step. Consequently, it is likely that the

reengineer starts the analysis process with suboptimal CVs whenever the application context of

the CARE tool has changed to an LDB from another company, developer team, or on a

different platform. Of course, (s)he can adjust the CVs on-the-fly during the analysis process

when (s)he learns more about the LDB implementation. Still, this entailed that every user of

our schema analysis tool also has to learn about the GFRN formalism. A much more preferable

solution was if the tool would adjust the CVs automatically during the interactive analysis

process. An automatic adaption mechanism could exploit interactive decisions of the

reengineer to decrease CVs of heuristics which have lead to false hypotheses and increase CVs

which (could) have lead to a correct indication.

top-down

migration

A different open problem considers the fact that our approach to conceptual schema migration

is limited to bottom-up migration only. This means that our technique supports incremental

creation of an abstract conceptual design from an implemented logical schema. However, there

are many practical DBRE scenarios where an abstract design is (partly) existent at the

beginning of the migration process. It often occurs that companies have (obsolete) design

documents for specific subsystems of their LDBs. Even more important are scenarios that aim

on federating several (heterogeneous) LDBs into an enterprise-wide business object model. In

the latter case, specific parts of the conceptual design are predefined and the reengineer has to

map this design to the existing LDB schema. So far, these top-down migration scenarios are not

considered by our approach.

loss of layout

information

during change

propagation

Even though the developed consistency management mechanism has been well accepted by the

users of our DBRE tool, most of them criticized that after propagating a schema update, the

layout information has been lost for certain schema increments (classes and relationships).

More precisely, the layout information has been lost for all those schema increments which

represent the output of transformations that have been re-evaluated during the change

propagation step. The reason for this irritating and annoying effect is that whenever a

184 CONCLUSIONS AND FUTURE PERSPECTIVES

transformation application is going to be re-evaluated, its former output is discarded and

reproduced. The layout information which is associated to the former output is discarded as

well. In the case that all transformation applications in a dependency chain remain valid, this

problem can be solved by copying the layout information from the former output of the last

transformation applications to their new output. The situation becomes more difficult for

transformation applications which are no longer valid. In these cases, layout information for

their former input increments are no longer available because these increments are only

represented by place holder nodes in the history graph. One possible solution is to annotate

these place holders with their layout history.

6.4 Future perspectives

generalizing

GFRNs One focus of our future research is on generalizing the GFRN approach for other applications

in the RE domain. For this purpose, we have designed and implemented the GFRN editor and

the inference engine in a modular and portable way that facilitates integration with other

CARE tools. We plan to make this component freely available for academic purposes. In a

project called FUJABA (From UML to Java And Back Again) [KNNZ99], we have started to

experiment with GFRN specifications to analyze Java software and detect design patterns.

Preliminary experiences show that this is a suitable application although the problems involved

seem to be harder than in the application described in this dissertation: we noticed that the

structure of typical object-oriented design patterns is much more complex than the structure of

most relational schema constraints. Defining complex patterns in terms of predicates and

implications results in rather large GFRN specifications which are difficult to read. Therefore,

we plan to develop a more adequate notation for such search patterns with a semantics based

on GFRN specifications. We have begun to investigate the suitability of annotated UML object-

diagrams for this purpose.

self-adaptation In a Master Thesis, we developed a first prototype for a learning mechanism that adjusts the

CVs of GFRN implications automatically during the semi-automatic analysis process [Str99].

The motivation for this research is the aforementioned problem for the user to estimate the

right CVs when the application context of the analysis tool has changed. The goal is to

minimize the classification error, i.e., to decrease the CVs of those implications which lead to a

large number of false hypotheses and increase the CVs of implications which (could) lead to

true hypotheses. The idea of our approach is to exploit the interactive feedback of the

reengineer during the analysis process to adapt the CVs in the GFRN (cf. Figure 6.1). For this

purpose, we employ techniques known from the area of neural network learning [Gal93].

Based on the hypotheses indicated by the GFRN inference engine and the (refutation and

confirmation) decisions of the reengineer, our tool creates a so-called learning task (LT). Then,

the LT is fed back into a feed-forward neural network (NN) which has been generated from the

current GFRN specification. We use the standard backpropagation algorithm [Gal93] to train

the weights in the NN that correspond to the CVs in the GFRN. Finally, the CVs in the GFRN

are adjusted according to the new weights in the NN.

The technique outlined above could be a possible basis to develop adaptive CARE tools. First

experiences with the described mechanism show that this approach is feasible [JS99]. Still,

several questions remain in this context which need further investigation. For example, a

central question is on how to select the parameters of the backpropagation algorithm (learning

FUTURE PERSPECTIVES 185

rate, momentum factor, etc.) to achieve a fast yet stable adaption process. These parameters

define the influence of the current application context of the analysis tool w.r.t. previous

experiences. The general idea is to increase the learning rate temporarily when the tool is

applied in a new RE project, in a different company, or for new subcomponent that has been

developed by another developer team. Practical case studies will play an important role within

our efforts to evaluate and refine this technique.

LDB federation

and evolution

In this dissertation, we developed methods and techniques to support reengineering and

integration of single LDB systems with object-oriented technology. However, an increasing

number of companies strive to federate several heterogeneous information systems (IS) to

achieve integrated, enterprise-wide information infrastructures [Rad95]. An important

condition for the efficiency of such net-centric IS is their ability to evolve in step with changing

market conditions and changes in the organizational structure of the company. Tools which

allow to modify and evolve net-centric IS on a high-level of abstraction have a great potential

to contribute to the desired flexibility. In the future, we plan to generalize our graph grammar

approach to schema integration and redesign for its application in IS federation and evolution

scenarios. As mentioned in Section 6.3, a first step of this generalization will be the extension

of our approach by a technique for top-down schema migration.

abstract

losslessness

criterion

In Section 5.3, we followed a broadly used approach to categorize schema transformations

according to their impact on the information capacity of the target schema w.r.t. the source

schema [BCN92, HTJC94, Tre95, Sch93]. We elaborated semi-formal proofs for these

classifications in [Rum98]. Like other researchers in the domain of schema redesign, we have

noticed that constructing such proofs requires experiences and skills which cannot be expected

from a typical reengineer who wants to extend the catalog of schema transformations available

in our DBRE environment. Therefore, it was beneficial to have an abstract losslessness

criterion which can easily be applied to proof properties of newly specified schema

transformations. In [JZ99, JZ98], we have begun to develop such a formal criterion based on

cycl_join

i1: 0.7 v2⊆v1

i2: 0.3

i10: 1.0

i7: 0.6

sel_dist

key IND

validIND

validKey

i3: 1.0

i5: 1.0

v2=π2(v1)

i9: 1.0

equiv

i8: 0.5

tcomp

nsimilar

i6: 0.8

v2v1v1

GFRN

Tenant

name

Apartment

MainTenant

rent

SubTenant

has

ApHouse

house_id

flats

streetcity

hires

is a is a

intermediate

model

RE knowledge

automatic

LDB engine

Figure 6.1. Self-adapting analysis process

goal- and data-

Tenant

name

Apartment

MainTenant

rent

SubTenant

has

ApHouse

house_id

flats

streetcity

hires

is a is a

final

model

(consistent)

(inconsistent)

refute/

support

driven analysis

reengineer

manual investigation

fdkjgh hkhkjsdhg

dfghshkghkj khjklsd

jdhfgkjfghkjsd hfgkhsdh

glkdksghkfhkjsd hfgkdhg

khfdgfdgfd fdgdfgdf

fdghdgfgj f flds

sdfghfhkfhkfg hklfghklf

learning

task

inference

backpropagation

adjusted

CVs

186 CONCLUSIONS AND FUTURE PERSPECTIVES

the rich theory of parallel graph rewriting rules [Tae96]. For the future, we plan to refine this

approach such that it can be integrated with our tool customization process to facilitate

reasoning about properties of newly added transformations.

user experiments Incorporating uncertain and contradicting knowledge in tool-based RE processes requires new

human-computer interaction schemes to eliminate this imperfect knowledge efficiently and

arrive at a consistent result. Such efficient and user-friendly interaction schemes are crucial to

exploit the benefits of this new technology and achieve broad commercial acceptance in

industry. In Section 4.4.2.2, we proposed a first user interface solution based on an advanced

agenda concept with query and filter mechanisms. This user interface has to be evaluated and

refined in practical user experiments. We will conduct these experiments in tight collaboration

with industry and the Software Engineering Group at the University of Victoria, B.C., Canada.

Their scientific background in tool evaluation [MWS97, Sto98] and the new Experimental

Software Engineering Lab at the University of Victoria represent an ideal environment to

conduct these experiments.

APPENDIX A ADDITIONAL DEFINITIONS

AND SPECIFICATIONS

A.1 Interpretation of a logical schema

The interpretation of a relational database schema is well-defined in the literature. Still, in

Definition 4.1, we substituted the problematic notion of NULL-values by a new concept of

relational variants.a Consequently, we have to define the interpretation of this new concept.

The following Definition A.1 formalizes the interpretation of a logical schema with variants.

Note, that this formalization does not include the intentional semantics of the annotation

function

. Operation Π denotes the usual relational projection of the relational algebra

[EN94].

Definition A.1 Interpretation of a logical schema

The interpretation of a logical schema (T, R, ∆,

) is a tuple:=(ℑT,ℑR,ℑ∆), where

•ℑT:T→SET is a function that maps column type names to finite sets, i.e., their domains.

•ℑR:R→REL×FUN×

{L1}, ℑR(r:(n,X,Σ,V))=(ℑX,ℑV,ℑΣ), r∈R, is a function that maps

each RS to a tuple of a relation, a function, and a constraint represented by a logical

implication;

• relation ℑX is a subset of the cartesian product of the domains of all columns

(including the special value NULL), i.e., ℑX⊆ℑT(t1)∪{NULL}×..×ℑT(tm)∪{NULL},

for X={(n,c1,t1),..., (n,cm,tm)}, m∈ ;

•function ℑV:V→REL maps variants to relations; for each variant v∈V, ℑV is a subset

of the cartesian product of the domains of all columns in v, i.e.,

v:{(n,c1,t1),...,(n,cm,tm)}∈V, m∈,ℑV(v)⊆ℑT(t1)×..×ℑT(tm);

•ℑΣ is an implication that specifies that all tuples in ℑX can uniquely be identified by

the values in their key columns, i.e., ℑΣ= ’∀s1,s2∈ℑX : (ΠΣ(s1)=ΠΣ(s2)→s1=s2)’;

•ℑ∆:∆→

{L1} is a function which maps each IND to a logical implication:

ℑ∆(d:(l, r, I))= ‘∀s1∈ℑV(l) ∃s2∈ℑX(r):(∀i∈I:Πi(s1)=Πi(s2))’,

❑

with ℑR(RS(l))=(ℑX,ℑV,ℑΣ), ℑR(r)=(ℑX,ℑV,ℑΣ).

a NULL-values often cause problems during the migration of relational to object-oriented platforms because

object-oriented data models typically lack the concept of NULL-valued attributes.

188 APPENDIX A

A.2 Specification of the migration graph model

In this section, we employ the formal specification language Progres [SWZ95] to define the

migration graph model discussed in Section 5.1.

spec MigrationGraphModel

node class Increment end;

logical schema

ASG section LogicalSchemaASG

node type LSchema : Increment end;

node type RS : Increment

intrinsic

rsname : string;

end;

node type LType : Increment

intrinsic

ltname : string;

end;

node type Variant : Increment end;

node type LKey : Increment end;

node type I_IND : IND end;

node type ForKey : Increment end;

node type C_IND : IND end;

node type R_IND : IND

intrinsic

invkb : boolean;

end;

node type Column : Increment

intrinsic

colname : string;

end;

edge type c_RS : LSchema [1:1] -> RS [0:n];

edge type c_lt : LSchema [1:1] -> LType [1:n];

edge type c_v : RS [1:1] -> Variant [1:n];

edge type c_ak : RS [1:1] -> LKey [1:1];

edge type c_col : Variant [0:n] -> Column [1:n];

edge type c_fk : Variant [1:1] -> ForKey [0:n];

edge type c_c : ForKey [0:n] -> Column [1:n];

edge type c_kc : LKey [0:n] -> Column [1:n];

APPENDIX A 189

edge type lt : Column [0:n] -> LType [1:1];

node class IND end;

edge type c_k : IND -> LKey;

edge type c_f : IND -> ForKey;

end;

conceptual schema

ASG

section ConceptualSchemaASG

node type CSchema : Increment end;

node type CType : Increment

intrinsic

ctname : string;

end;

node type Class : Increment

intrinsic

clname : string;

abstract : boolean;

end;

node type Inheritance : Increment end;

node type CKey : Increment end;

node type Attribute : Increment

intrinsic

aname : string;

default : string;

end;

node type Association : Relationship

intrinsic

srctotal : boolean;

srccard : integer;

end;

node type Aggregation : Relationship end;

edge type c_ct : CSchema [1:1] -> CType [1:n];

edge type c_cl : CSchema [1:1] -> Class [0:n];

edge type sup : Inheritance [0:n] -> Class [1:1];

edge type sub : Inheritance [0:1] -> Class [1:1];

edge type c_ck : Class [1:1] -> CKey [0:1];

edge type c_ka : CKey [0:n] -> Attribute [1:n];

190 APPENDIX A

node class Relationship is a Increment

intrinsic

srcname : string;

tarname : string;

tartotal : boolean;

tarcard : integer;

end;

edge type src : Relationship [0:n] -> Class [1:1];

edge type tar : Relationship [0:n] -> Class [1:1];

edge type c_att : Class -> Attribute [0:n];

edge type ct : Attribute [0:n] -> CType;

end;

SMG model section SchemaMappingGraphModel

node type MapSch : Increment end;

edge type m_v : MapV [0:n] -> Variant [1:n];

node type MapType : Increment end;

node type MapV : Increment end;

node type MapInc : Increment end;

node type MapIIND : Increment end;

node type MapKey : Increment end;

node type MapCol : Increment end;

node type MapRIND : Increment end;

edge type m_ls : MapSch [0:1] -> LSchema [1:1];

edge type m_cs : MapSch [1:1] -> CSchema [1:1];

edge type m_lt : MapType [0:1] -> LType [1:1];

edge type m_ct : MapType [0:1] -> CType [1:1];

edge type m_cl : MapV [0:1] -> Class [1:1];

edge type m_v_in : MapInc [0:1] -> Inheritance [1:1];

edge type m_iind : MapInc [0:n] -> Variant [1:n];

edge type m_i_in : MapIIND [0:n] -> Inheritance [1:1];

edge type m_lk : MapKey [0:n] -> LKey [1:1];

edge type m_ck : MapKey [0:1] -> CKey [1:1];

edge type m_col : MapCol [0:1] -> Column [1:1];

APPENDIX A 191

edge type m_a : MapCol [0:1] -> Attribute [1:1];

edge type m_rind : MapRIND [0:1] -> R_IND [1:1];

edge type m_vs : MapInc [0:n] -> Variant [1:n];

edge type m_id : Class [0:n] -> MapKey [1:1];

node type MapRel : Increment end;

edge type m_r : MapRel [0:1] -> Relationship [1:1];

edge type r_via : MapRel [0:n] -> MapRIND [0:n];

edge type a_via : MapCol [0:n] -> MapRIND [0:n];

end;

end.

history graph modelsection HistoryGraphModel

node type Transformation : Increment end;

node type Parameter : Increment

intrinsic

nr : integer;

end;

edge type In : Transformation [0:1] -> Parameter [1:n];

edge type Out : Transformation [0:1] -> Parameter [0:n];

edge type con1 : Transformation [0:1] -> Increment [0:n];

edge type actual : Parameter [0:n] -> Increment [0:n];

end;

graph tests to check

for constraint

violations

section Constraints

test DoubleAggregation =

folding { ‘1, ‘2 };

end;

‘2 : Class

( ( <-sub-

& -sup-> )

or ( <-sup-

& -sub-> ) ) *

‘1 : Class

tar

‘3 : Aggregation

tar

‘4 : Aggregation

192 APPENDIX A

test DuplicateAttrName =

condition ‘2.aname = ‘3.aname;

end;

test DuplicateClassName =

condition ‘2.clname = ‘3.clname;

end;

test DuplicateRelName1 =

condition ‘2.srcname = ‘3.srcname;

end;

test DuplicateRelName2 =

condition ‘2.tarname = ‘3.srcname;

end;

‘2 : Attribute ‘3 : Attribute

c_att c_att

‘1 : Class

‘3 : Class‘2 : Class

c_cl c_cl

‘1 : CSchema

‘1 : Class

src

‘2 : Relationship

src

‘3 : Relationship

‘1 : Class

tar

‘2 : Relationship

src

‘3 : Relationship

APPENDIX A 193

test DuplicateRelName3 =

condition ‘2.tarname = ‘3.tarname;

end;

test RelnameEqualAttrname =

condition ‘2.tarname = ‘3.aname;

end;

test RelnameEqualAttrname1 =

condition ‘2.srcname = ‘3.aname;

end;

‘1 : Class

tar

‘2 : Relationship

tar

‘3 : Relationship

tar

‘2 : Relationship ‘3 : Attribute

c_att

‘1 : Class

‘3 : Attribute

src

‘2 : Relationship

c_att

‘1 : Class

APPENDIX B A CATALOG OF REDESIGN

TRANSFORMATIONS

This appendix presents the specification for the primitive schema redesign transformations

implemented in this dissertation. The following table gives an overview of their purpose and

their location in this appendix.

Transformation Short description Type Page

Aggregate Transforms an association into an aggregation IP 196

AssociationToClass Transforms an association between two classes to an

intermediate class with two associations IP 197

ChangeAssoc-

Cardinality Modifies the cardinality of a given association IC 198

ChangeAttributeType Changes the type of an attribute IC 198

ClassToAssociation Transforms a class that participates in two one-to-many

associations to a many-to-many association IP 199

CreateAssociation Creates an association between two given classes IA 200

CreateAttribute Creates an attribute in a given class IA 200

CreateClass Creates a new class IA 201

CreateInheritance Creates an inheritance relationship between two given

classes IA 201

CreateKey Creates a key for a given class IR 202

ConvertAbstract Converts a concrete class into an abstract class IR 202

ConvertConcrete Converts an abstract class into a concrete class IA 203

DisAggregate Transforms an aggregation into an association IP 204

Generalize Creates a generalization for a given class IA 205

MergeClasses Merges two classes which are associated by a one-to-

one relationship into a single class IP 206

MoveAttribute Moves an attribute from one class to an associated class

via a given one-to-one relationship IP 207

PushDownAttribute Moves an attribute of a given class to its specialization IR 208

PushDown-

Association Moves a relationship of a given class to its specializa-

tion IR 209

PushUpAttribute Moves an attribute of a given class to its generalization IA 210

PushUpAssociation Moves a relationship of a given class to its generaliza-

tion IA 211

Remove Removes an increment from the conceptual schema IC 212

RenameAttribute Changes the name of an attribute IP 212

RenameClass Changes the name of a class IP 212

RenameRelationship Changes the role names of a relationship IP 213

Specialize Creates a specialization for a given class IA 214

SplitClass Splits a class in two classes connected by a one-to-one

relationship IP 213

SwapAssocDirection Swaps source and target of a given association IP 215

196 APPENDIX B

Aggregate Transformation Aggregate converts an association (rel) into an aggregation. Its application

condition specifies that the source of association rel has to be a total, single reference.

production Aggregate( rel : Association) =

::=

condition ‘3.srctotal = true;

‘3.srccard = 1;

transfer 3’.srcname := ‘3.srcname;

3’.tarname := ‘3.tarname;

3’.tartotal := ‘3.tartotal;

3’.tarcard := ‘3.tarcard;

end;

Figure B.1. Transformation Aggregate

‘4 : Class ‘2 : Class

src tar

‘3 = rel

m_r

‘5 : MapRel

4’ = ‘4 2’ = ‘2

src tar

3’ : Aggregation

m_r

5’ = ‘5

APPENDIX B 197

AssociationToClassProduction AssociationToClass specifies the reverse transformation for transformation

ClassToAssociation, i.e., it transforms an association to a class with two associations.

production AssocationToClass( assoc : Association) =

::=

folding { ‘1, ‘2 };

transfer 4’.srccard := ‘3.tarcard;

4’.srctotal := ‘3.tartotal;

4’.tarcard := 1;

4’.tartotal := true;

6’.srccard := ‘3.srccard;

6’.srctotal := ‘3.srctotal;

6’.tarcard := 1;

6’.tartotal := true;

end;

Figure B.2. Transformation AssociationToClass

‘1 : Class ‘2 : Class

src tar

‘3 = assoc

m_r

‘5 : MapRIND

r_via

‘4 : MapRel

5’ = ‘5

2’ = ‘21’ = ‘1 3’ : Class

tar src

4’ : Association

src tar

6’ : Association

m_r

r_via

7’ : MapRel r_via

m_r

8’ : MapRel

198 APPENDIX B

ChangeAssoc-

Cardinality Transformation ChangeAssocCardinality modifies the cardinality of a given association. The

choose statement determines whether the transformation application is information-reducing

(IR). If this is the case, the cardinality of the given association assoc is adjusted according to

the actual parameters of the transformation. Otherwise, the existing association is replaced by

a new association with the desired cardinality constraints. Note, that this implies the loss of all

correspondence information with the logical schema which might have existed for the original

association assoc.

ChangeAttribute-

Type Transformation ChangeAttributeType changes the type of a given attribute attr to newType.

transaction ChangeAssocCardinality( assoc : Association; srcCard : integer;

srcTotal : boolean; tarCard : integer ;

tarTotal : boolean)=

choose

when (* IR transformation? *)

((assoc.srccard > srcCard) and (assoc.tarcard > tarCard))

then

assoc.srccard := srcCard

& assoc.tarcard := tarCard

& assoc.srctotal := srcTotal

& assoc.tartotal := tarTotal

else

use newAssoc : Association

do CreateAssociation ( assoc.-src->, assoc.-tar->, assoc.srcname,

assoc.tarname,out newAssoc )

& Remove ( assoc )

& newAssoc.srccard := srcCard

& newAssoc.tarcard := tarCard

& newAssoc.srctotal := srcTotal

& newAssoc.tartotal := tarTotal

end

end;

Figure B.3. Transformation ChangeAssocCardinality

production ChangeAttributeType( attr : Attribute ; newType : CType)=

::=

end;

Figure B.4. Transformation ChangeAttributeType

‘2 : CType

‘3 = newType

-c_cl->

& -c_att->

c_ct

‘1 : CSchema

‘4 = attr

2’ = ‘2

3’ = ‘3

c_ct

1’ = ‘1

4’ = ‘4

APPENDIX B 199

ClassToAssociationTransformation ClassToAssociation transforms a class with two associations into an

association. Negative application conditions (nodes ‘6, ‘7, and ‘12) ensure that the class has

no properties other than the required two associations (‘4 and ‘5) and does not participate in

an inheritance hierarchy. The application conditions of ClassToAssociation restrict the two

associations of the given class cl to be single valued w.r.t. the participating classes (‘2,‘3).

Note, that the requirement that cl is the source of both associations can be satisfied by

executing primitive transformation SwapAssocDirection first.

production ClassToAssociation( cl : Class) =

::=

folding { ‘2, ‘3 };

condition ‘4.tarcard = 1;

‘5.tarcard = 1;

‘4.srctotal;

‘5.srctotal;

transfer 1’.srccard := ‘4.srccard;

1’.srctotal := ‘4.srctotal;

1’.tarcard := ‘5.srccard;

1’.tartotal := ‘5.srctotal;

end;

Figure B.5. Transformation ClassToAssociation

‘3 : Class

‘2 : Class

‘6 : Attribute

‘7 : Association

<-src-

or <-tar-

c_att

‘1 = cl

‘8 : MapRIND

‘9 : MapRIND

m_r

r_via

‘10 : MapRel

m_r

r_via

‘11 : MapRel

tar src

‘4 : Association

src tar

‘5 : Association

‘12 : Inheritance

<-sub-

or <-sup-

9’ = ‘9 8’ = ‘8

3’ = ‘3

2’ = ‘2 src tar

1’ : Association

m_r

r_via

r_via 4’ : MapRel

200 APPENDIX B

CreateAssociation The following transformations CreateAssociation, CreateAttribute,CreateClass,

CreateInheritance, and CreateKey extend the conceptual schema by a new association,

attribute, class, inheritance relationship, and key, respectively.

CreateAttribute

production CreateAssociation( srccl : Class ; tarcl : Class ;

srcrole : string ; tarrole : string ;

out newAssoc : Association)

::=

transfer 3’.srcname := srcrole;

3’.tarname := tarrole;

3’.srctotal := true;

3’.tartotal := true;

3’.srccard := 1;

3’.tarcard := 1;

return newAssoc := 3’;

end;

Figure B.6. Transformation CreateAssociation

‘1 = srccl ‘2 = tarcl

1’ = ‘1 2’ = ‘2

src tar

3’ : Association

production CreateAttribute( cl : Class ; attname : string ;

atype : string ; dflt : string)

::=

condition ‘2.ctname = atype;

transfer 3’.aname := attname;

3’.default := dflt;

end;

Figure B.7. Transformation CreateAttribute

‘1 = cl ‘2 : CType

2’ = ‘2

c_att

1’ = ‘1 ct

3’ : Attribute

APPENDIX B 201

CreateClassThe following transformations CreateClass,CreateAttribute,CreateAssociation,CreateKey,

and CreateInheritance extend the conceptual schema by a new class, attribute, association,

key, and inheritance relationship, respectively.

CreateInheritance

production CreateClass( name : string) =

::=

transfer 2’.clname := name;

end;

Figure B.8. Transformation CreateClass

‘1 : CSchema

2’ : Class

c_cl

1’ = ‘1

production CreateInheritance( subcl : Class ; supcl : Class) =

::=

end;

Figure B.9. Transformation CreateInheritance

‘1 = subcl ‘2 = supcl

1’ = ‘1 2’ = ‘2

sub sup

3’ : Inheritance

202 APPENDIX B

CreateKey

ConvertAbstract Transformation ConvertAbstract transforms a given concrete class cl to an abstract class.

Figure B.11 shows that the variant that has been mapped to cl (node ‘2) is removed from the

logical schema and the (new) abstract class is mapped to all variants (node set ‘4) which have

commonly been mapped to all subclasses of cl.

production CreateKey( attrs : Attribute [1:n]) =

::=

end;

Figure B.10. Transformation CreateKey

‘2 = attrs

c_att

‘1 : Class

2’ = ‘2

c_att

c_ck

1’ = ‘1 c_ka

3’ : CKey

production ConvertAbstract( cl : Class) =

::=

condition ‘1.abstract = false;

‘5 = ‘2.<-vg-;

transfer 1’.abstract := true;

end;

Figure B.11. Transformation ConvertAbstract

‘4 : Variant

‘1 = cl‘2 : Variant

c_v

‘3 : RS

m_vg

m_vs

‘5 : MapInc

m_vs

‘6 : MapInc

m_clm_v ‘7 : MapV

4’ = ‘4

1’ = ‘1

c_v

3’ = ‘3

m_vs

m_cl

m_v

7’ = ‘7

m_vs

6’ = ‘6

m_vg

5’ = ‘5

APPENDIX B 203

ConvertConcreteProduction ConvertConcrete specifies the reverse transformation for the previous

transformation ConvertAbstract, i.e., it converts an abstract class to a concrete class.

Figure B.12 shows that a new variant (9’) is added to represent instances of the (new)

concrete class. This new variant includes all foreign keys (4’) and columns (8’) which are

common to all variants that were mapped to the former abstract class.

production ConvertConcrete( cl : Class) =

::=

condition ‘1.abstract = true;

‘2 = ‘7.-m_v->;

transfer 1’.abstract := false;

end;

Figure B.12. Transformation ConvertConcrete

‘1 = cl

c_v

‘3 : RS

m_vg

‘5 : MapInc

m_vs

‘6 : MapInc

m_cl

m_v ‘7 : MapV

‘4 : ForKey

c_fk

‘8 : Column

c_col

‘2 : Variant

1’ = ‘1

c_v

3’ = ‘3

m_cl

4’ = ‘4

c_fk

8’ = ‘8

c_col

2’ = ‘2

m_vs 6’ = ‘6

m_vg

5’ = ‘5

m_v 7’ = ‘7

c_fk

c_col

9’ : Variant

c_v

204 APPENDIX B

DisAggregate Production Disaggregate specifies the inverse transformation for transformation Aggregate,

i.e., it transforms an aggregation relationship to an association.

production DisAggregate( rel : Aggregation )=

::=

transfer 3’.srcname := ‘3.srcname;

3’.tarname := ‘3.tarname;

3’.tartotal := ‘3.tartotal;

3’.tarcard := ‘3.tarcard;

3’.srctotal := true;

3’.srccard := 1;

end;

Figure B.13. Transformation DisAggregate

‘2 : Class‘1 : Class

src tar

‘3 = rel

m_r

‘4 : MapRel

2’ = ‘21’ = ‘1

src tar

3’ : Association

m_r

4’ = ‘4

APPENDIX B 205

GeneralizeTransformation Generalize creates a generalization for a given root class, i.e., a class that

does not have a superclass (cf. page 142).

production Generalize( cl : Class ; clName : string) =

::=

transfer 7’.clname := clName;

7’.abstract := false;

end;

Figure B.14. Transformation Generalize

‘6 : Column

‘2 : Attribute

sub

‘3 : Inheritance

c_ck

c_ka

c_att

<-m_cl-

& -m_v->

‘1 = cl

‘4 : CKey

m_ck

-m_lk->

& -c_kc->

‘9 : MapKey

‘5 : Variant

6’ = ‘6

2’ = ‘2

1’ = ‘1

c_ka

4’ = ‘4

sub

sup

3’ : Inheritance

c_at

c_ck

m_cl

m_v

c_col

8’ : Variant

m_v_i

m_vg

5’ = ‘5

m_v

10’ : MapV

m_vs 11’ : MapInc

m_ck

9’ = ‘9

m_id

7’ : Class

206 APPENDIX B

MergeClass Production MergeClass specifies the reverse transformation for transformation SplitClass.

Note, that one of the two classes to be merged (node ‘3) has to have no other property than the

association (assoc) that is used for the merge operation. If such properties exist they can be

relocated to class ‘2 by using primitive transformations MoveAttribute and MoveAssociation

first.

production MergeClasses( assoc : Association ; clName : string) =

::=

condition ‘1.tartotal;

‘1.tarcard = 1;

‘1.srctotal;

‘1.srccard = 1;

transfer 2’.clname := clName;

end;

Figure B.15. Transformation MergeClass

‘2 : Class

-src->

or -tar-> -src->

or -tar->

‘1 = assoc

sub

‘9 : Inheritance

‘4 : Attribute

c_att

‘6 : Relationship

<-src-

or <-tar-

‘3 : Class

2’ = ‘2

APPENDIX B 207

MoveAttributeTransformation MoveAttribute relocates an attribute from one class to another class via a

given association. This transformation is described in detail on page 140.

production MoveAttribute( attr : Attribute ; assoc : Association)

::=

condition ‘4.tarcard = 1;

‘4.srccard = 1;

[ ‘1 = ‘4.-src-> :: ‘4.srctotal

| ‘4.tartotal ] ;

end;

Figure B.16. Transformation MoveAttribute

‘3 : Class

‘2 = attr

c_att

‘1 : Class

-src->

or -tar-> -src->

or -tar->

m_a

‘6 : MapCol ‘5 : MapRIND

<-m_r-

& -r_via->

‘4 = assoc

4’ = ‘4

2’ = ‘2

1’ = ‘1

m_a

c_att

3’ = ‘3

5’ = ‘5

a_via

6’ = ‘6

208 APPENDIX B

PushDown-

Attribute Transformation PushDownAttribute specializes an attribute of a given class to its subclass. In

order to avoid the necessity to reorganize data, this transformation is restricted to inheritance

relationships that have been mapped to variants of the same RS (cf. page 143). Note, that the

negative application condition (node ‘8) prohibits that attributes are specialized which belong

to the key of the class.

production PushDownAttribute( attr : Attribute ; specCl : Class)

::=

folding {‘5,‘7}

end;

Figure B.17. Transformation PushDownAttribute

‘7 : Variant

‘6 : Column

sub

sup

‘3 : Inheritance

<-m_cl-

& -m_v->

<-m_a-

& -m_col->

‘1 = attr

c_att

c_col

‘5 : Variant

<-m_cl-

& -m_v->

‘2 = specCl

c_ck

‘4 : Class

c_ka

‘8 : CKey

m_v_in

m_vs

m_vg ‘9 : MapInc

1’ = ‘1

sub

sup

3’ = ‘3

6’ = ‘6 4’ = ‘4

5’ = ‘5

m_v_in

m_vs

m_vg

9’ = ‘9

c_col

7’ = ‘7

c_att

2’ = ‘2

sup

APPENDIX B 209

PushDown-

Association

In analogy to the transformation PushDownAttribute, the following transformation

PushDownAssocoation specializes the source role of an association a given subclass in the

inheritance hierarchy.

production PushDownAssociation( assoc : Association ; specCl : Class)

::=

folding {‘5,‘7}

end;

Figure B.18. Transformation PushDownAssociation

‘7 : Variant

‘6 : MapRIND

sub

sup

‘3 : Inheritance

<-m_cl-

& -m_v->

<-m_r-

& -r_via->

‘4 : Class

<-m_cl-

& -m_v->

‘2 = specCl

m_v_in

m_vs

m_vg ‘9 : MapInc

src

‘1 = assoc

c_fk

‘5 : Variant

<-c_f-

& <-m_rind-

‘8 : ForKey

2’ = ‘2

sub

4’ = ‘4

sup

3’ = ‘3

6’ = ‘6

m_v_in

m_vs

m_vg

9’ = ‘9

8’ = ‘8

5’ = ‘5

c_fk

7’ = ‘7

m_vg

1’ = ‘1

sup

src

210 APPENDIX B

PushUpAttribute Transformation PushUpAttribute generalizes an attribute of a given class to its superclass (cf.

page 143).

production PushUpAttribute( attr : Attribute) =

::=

folding {‘5,‘7}

end;

Figure B.19. Transformation PushUpAttribute

c_att

‘2 : Class

sub

sup

‘3 : Inheritance

‘5 : Variant

‘6 : Column <-m_a-

& -m_col-> ‘1 = attr

c_col

‘7 : Variant

m_v_in

m_vs

m_vg ‘8 : MapInc

<-m_cl-

& -m_v-> ‘4 : Class

1’ = ‘1

2’ = ‘2

sub

sup

3’ = ‘3

6’ = ‘6

c_col

7’ = ‘7

m_v_in

m_vs

m_vg

8’ = ‘8

c_col

5’ = ‘5

c_att

4’ = ‘4

APPENDIX B 211

PushUp-AssociationTransformation PushUpAssocoation generalizes the source role of an association to the

superclass in the inheritance hierarchy.

production PushUpAssociation( assoc : Association) =

::=

folding {‘5,‘7}

end;

Figure B.20. Transformation PushUpAssociation

‘5 : Variant

‘2 : Class

sub

sup

‘3 : Inheritance

c_fk

‘7 : Variant

m_v_in

m_vs

m_vg ‘8 : MapInc

<-m_cl-

& -m_v->

‘4 : Class

src

‘9 : MapRIND

<-m_r-

& -r_via->

‘1 = assoc

<-c_f-

& <-m_rind-

‘6 : ForKey

1’ = ‘16’ = ‘6

2’ = ‘2

sub

sup

3’ = ‘3

c_fk

7’ = ‘7

m_v_in

m_vs

m_vg

8’ = ‘8

c_fk

5’ = ‘5

src

4’ = ‘4

9’ = ‘9

212 APPENDIX B

Remove Transformation Remove deletes an increment from the conceptual schema.

RenameClass The following transformations RenameClass,RenameAttribute, and RenameRelationship

change the names of classes, attributes, and relationships, respectively.

RenameAttribute

production Remove( incr : Increment) =

::=

end;

Figure B.21. Transformation Remove

‘1 = incr

production RenameClass( cl : Class ; newName : string) =

::=

transfer 1’.clname := newName;

end;

Figure B.22. Transformation RenameClass

‘1 = cl

1’ = ‘1

production RenameAttribute( att : Attribute ; newName : string) =

::=

transfer 1’.aname := newName;

end;

Figure B.23. Transformation RenameAttribute

‘1 = att

1’ = ‘1

APPENDIX B 213

Rename-

Relationship

SplitClassTransformation SplitClass splits a given class in two classes which are connected by a one-

to-one associations. This transformation has been described in detail in Section 5.3.

production RenameRelationship( rel : Relationship ; newSrcname,

newTarname : string) =

::=

transfer 1’.srcname := newSrcname;

1’.tarname := newTarname;

end;

Figure B.24. Transformation RenameRelationship

‘1 = rel

1’ = ‘1

production SplitClass( cl : Class ; clName : string ; newRole : string ;

oldRole : string)

::=

transfer 2’.srctotal := true;

2’.tartotal := true;

2’.srccard := 1;

2’.tarcard := 1;

2’.srcname := oldRole;

2’.tarname := newRole;

6’.clname := clName;

end;

Figure B.25. Transformation SplitClass

‘5 : Variant

‘3 : MapKey m_id

<-m_cl-

& -m_v->

‘1 = cl

5’ = ‘5

3’ = ‘3

m_id 1’ = ‘1

src

tar

2’ : Association

m_id

6’ : Class

m_cl

m_v

7’ : MapV

m_r

4’ : MapRel

214 APPENDIX B

Specialize Transformation Specialize creates a specialization for a given class.

production Specialize( cl : Class ; clName : string) =

::=

transfer 5’.clname := clName;

end;

Figure B.26. Transformation Specialize

‘2 : Column

<-m_cl-

& -m_v->

c_col

c_v

‘9 : RS

‘1 = cl

‘4 : Variant

1’ = ‘1

2’ = ‘2

c_col

4’ = ‘4

c_v

sup

5’ : Class

sub

3’ : Inheritance

c_col

6’ : Variant

c_v

9’ = ‘9

m_v m_cl

7’ : MapV

m_v_i

m_vs

m_vg

8’ : MapInc

APPENDIX B 215

SwapAssoc-

Direction

Transformation SwapAssocDirection swaps source and target of a given associations.

production SwapAssocDirection( assoc : Association) =

::=

folding { ‘2, ‘3 };

transfer 1’.srccard := ‘1.tarcard;

1’.tarcard := ‘1.srccard;

1’.srcname := ‘1.tarname;

1’.tarname := ‘1.srcname;

1’.srctotal := ‘1.tartotal;

1’.tartotal := ‘1.srctotal;

end;

Figure B.27. Transformation SwapAssocDirection

‘2 : Class ‘3 : Class

tarsrc ‘1 = assoc

2’ = ‘2 3’ = ‘3

srctar 1’ = ‘1

REFERENCES

[Ada76] J.B. Adams. A probability model of medical reasoning and the mycin model. Mathematical

Biosience, 32:177–186, 1976.

[AdBHL86] E. H. L. Aarts, F. M. J. de Bont, J. H. A. Habers, and P. J. M. Laarhoven. A parallel statistical

cooling algorithm. In 3rd Annual Symposium on Theoretical Aspects of Computer Science,

Orsay, France,Lecture Notes in Computer Science. Springer Verlag, 1986.

[AEP96] J. M. Antis, S. G. Eick, and J. D. Pyrce. Visualizing the Structure of Large Relational

Databases. IEEE Software, pages 72–79, 1996.

[AG96] D. C. Atkinson and W. G. Griswold. The design of whole-program analysis tools. In Proc. of

the 18th Int. Conf. on Software Engineering, Berlin, Germany, pages 16–27. IEEE Computer

Society Press, 1996.

[AGM85] C. A. Alchourrón, P. Gärdenfors, and D. Makinson. On the logic of theory change: partial

meet contraction and revision functions. The Journal of Symbolic Logic, 50:510–530, 1985.

[Aik95] P. Aiken. Data Reverse Engineering: Slaying the Legacy Dragon. McGraw-Hill, 1995.

[AL94] D. Aebi and R. Largo. Methods and tools for data value re-engineering. In Applications of

Databases (ADB-94), volume 819 of Lecture Notes in Computer Science, pages 400–411.

Springer Verlag, 1994.

[ALV93] F. Abbattista, F. Lanubile, and G. Visaggio. Recovering conceptual data models is human-

intensive. In Proc. of 5th Intl. Conf. on Software Engineering and Knowledge Engineering,

San Francisco, California, USA, pages 534–543, 1993.

[AMR94] P. Aiken, A. Muntz, and R. Richards. DoD legacy systems: reverse engineering data

requirements. Communications of the ACM, 37(5):26–41, 1994.

[And94] M. Andersson. Extracting an Entity Relationship Schema from a Relational Database through

Reverse Engineering. In Proc. of the 13th Int. Conference of the Entity Relationship

Approach, Manchester, volume 881 of Lecture Notes of Computer Science, pages 403–419.

Springer Verlag, 1994.

[AT98] M. N. Armstrong and C. Trudeau. Evaluating architectural extraction tools. In Proc. of the 5th

Working Conference on Reverse Engineering, Hawaii, USA, pages 30–39. IEEE Computer

Society Press, 1998.

[Bau94] M. Bauer. Integrating probabilistic reasoning into plan recognition. In Proc. of the 11th

European Conference on Artificial Intelligence (ECAI ’94), pages 620–624. John Wiley &

Sons, 1994.

[Bau95] M. Bauer. A Dempster-Shafer approach to modeling agent preferences for plan recognition.

User Modeling and User-Adapted Interaction, 5:317–348. Wolters Kluwer Publishers, 1995.

[BB92] L. Bolc and P. Borowik. Many-valued Logics: Theoretical Foundations. Springer Verlag,

Berlin, 1992.

[BB94] A. J. Bugarin and S. Barro. Fuzzy reasoning supported by petri nets. IEEE Transactions on

Fuzzy Systems, 2(2):135–150, 1994.

References

218 REFERENCES

[BC90] T. J. M. Bench-Capon. Knowledge Representation - An Approach to Artificial Intelligence.

Academic Press, London, 1990.

[BCN92] C. Batini, S. Ceri, and S. B. Navathe. Conceptual Database design. Benjamin/Cummings,

1992.

[BDH+87] H. Briand, C. Ducateau, Y. Hebrail, D. Herin-Aime, and J. Kouloumdjian. From Minimal

Cover to Entity-Relationship Diagram. In Proc. of the 6th Intl. Conference of the Entity

Relationship Approach, New York, pages 287–304. North-Holland, 1987.

[BED94] J. S. Bowman, Sandra L. Emerson, and M. Darnovsky. The Practical SQL Handbook - Using

Structured Query Language. Addison-Wesley Developers Press, Reading, MA, USA, 1994.

[Ber80] J.O. Berger. Statistical Decision Theory. Springer Verlag, New York, 1980.

[Bew98] B. Bewermeyer. Cliche-Erkennung in relationalen Datenbankanwendungen. Master’s Thesis,

University of Paderborn, Dept. of Mathematics and Computer Science, 33095 Paderorn,

Germany, 1998.

[BGD97] A. Behm, A, Geppert, and K. R. Dittrich. On the migration of relational schemas and data to

object-oriented database systems. In Proc. 5th International Conference on Re-Technologies

for Information Systems, Klagenfurt, Austria, pages 13-33. Österreichische Computer

Gesellschaft, 1997.

[Big90] T. J. Biggerstaff. Human-oriented conceptual abstractions in the reengineering of software. In

Proc. of the 12th International Conference on Software Engineering, page 120-122. IEEE

Computer Society Press, 1990.

[BKKK87] J. Banerjee, W. Kim, H. J. Kim, and H. F. Korth. Semantics and Implementation of Schema

Evolution in Object-Oriented Databases. SIGMOD Record, 16(3):311–322, 1987.

[BL97] H. Kleine Büning and T. Lettmann. Skriptum zur Vorlesung wissensbasierte Systeme.

Scriptum for the class on Knowledge-Based Systems at the University of Paderborn, Dept. of

Mathematics and Computer Science, 33095 Paderborn, Germany, 1997.

[Bla98] M. Blaha. On reverse engineering of vendor databases. In Proc. of the 5th Working

Conference on Reverse Engineering, pages 183–190, Hawai, USA. IEEE Computer Society

Press, 1998.

[BM98] E. Baniassad and G. Murphy. Conceptual module querying for software reengineering. In

Proc. of the 20th International Conference on Software Engineering, pages 64–73. IEEE

Computer Society Press, 1998.

[BP95] M. Blaha and W. Premerlani. Observed idiosyncracies of relational database designs. In

Second Working Conference on Reverse Engineering, Toronto, Ontario, Canada. IEEE

Computer Society Press, 1995.

[BP96] M. Blaha and W. Premerlani. A catalog of object model transformations. In Proc. of 3rd

Working Conference on Reverse Engineering, Monterey, California. USA. IEEE Computer

Society Press, 1996.

[BP98] M. Blaha and W. Premerlani. Object-Oriented Modeling and Design for Database

Applications. Prentice Hall, 1998.

[BR97] H. Blockeel and L. D. Raedt. Relational knowledge discovery in databases. In Proc. of the 6th

Intl. Workshop on Inductive Logic Programming, volume 1314 of Lecture Notes in Artificial

Intelligence, pages 199–211, Berlin, August 1997. Springer Verlag.

REFERENCES 219

[BRH95] S. Bridges, S. Ramanathan, and J. Hodges. A prototype object-oriented geophysical database

system developed by re-engineering a relational database system. Technical Report MSU-

950612, Department of Computer Science, Mississippi State University, USA, June 1995.

[BRJ99] G. Booch, J. Rumbaugh, and I. Jacobson. The Unified Modeling Language User Guide.

Addison-Wesley, Reading, MA, USA, 1st edition, 1999.

[Bro96] K. Brown. Design reverse-engineering and automated design-pattern detection in smalltalk.

Technical Report TR-96-07, Department of Computer Science, North Carolina State

University, 1996.

[BS84] B. G. Buchanan and E. H. Shortliffe, editors. Rule-Based Expert Systems. Addison-Wesley,

Reading, MA, USA, 1984.

[BS95] M. L. Brodie and M. Stonebraker. Migrating Legacy Systems. Morgan Kaufmann Publishers,

San Francisco, USA, 1995.

[CBB+97] R. G. G. Cattell, D. Barry, D. Bartels, M. Berler, J. Eastman, S. Gamerman, D. Jordan,

A. Springer, H. Strickland, and D. Wade. The Object Database Standard: ODMG 2.0.

Morgan Kaufmann Publishers, Los Altos, CA, USA, 1997.

[CER99] CERMICS Database Team, ObjectDRIVER V1.1 User Manual, 2004 route des lucioles,

06902 Sophia Antipolis Cedex, France, 1999.

[Che76] P. Chen. The Entity-Relationship Model – toward a unified view of data. ACM Transactions

on Database Systems, 1(1):9–36, 1976.

[Chr75] N Christofides. Graph Theory: An Algorithmic Approach. Academic Press, New York, 1975.

[CI90] E. J. Chikofsky and J. H. Cross II. Reverse engineering and design recovery: A taxonomy.

IEEE Software, 7(1):13–17. IEEE Computer Society Press, 1990.

[CMR96] A. Corradini, U. Montanari, and F. Rossi. Graph processes. Fundamenta Informaticae,

26(3):241–265. IOS Press, Amsterdam, 1996.

[Coo90] G. F. Cooper. The computational complexity of probabilistic inference using Bayesian belief

networks. Artificial Intelligence, 42:393–405, 1990.

[CVD96] J. Cardoso, E. Valette, and D. Dubois. Fuzzy Petri nets - an overview. In Proc. of the 13th

World Congress of the Intl. Federation of Automatic Control, San Francisco, pages 443–448,

1996.

[CW85] R.T. Clemen and R.L. Winkler. Limits for the precision and value of information from

dependent sources. Operations Research, 33:427–442, 1985.

[Dat84] C. J. Date. A Guide to DB2. Addison Wesley, Reading, MA, USA, 1984.

[Dat89] C. J. Date. A Guide to the SQL standard. Addison Wesley, Reading, MA, USA, 1989.

[DD92] D. Driankov and P. Doherty. A nonmonotonic fuzzy logic. In L. A. Zadeh and J. Kacprzyk,

editors, Fuzzy Logic for the Management of Uncertainty, pages 171–190. John Wiley & Sons,

1992.

[DLP92] D. Dubois, J. Lang, and H. Prade. Dealing with multi-source information in possibilistic

logic. In Proc. of the 10th European Conference on Artificial Intelligence, pages 38–42,

Vienna, Austria. John Wiley & Sons, 1992.

[DLP94] D. Dubois, J. Lang, and H. Prade. Possibilistic Logic. In Handbook of Logic in Artificial

Intelligence and Logic Programming, pages 439–503, Clarendon Press, Oxford, 1994.

[DP83] D. Dubois and H. Prade. Unfair coins and necessity analysis: Towards a possibilistic

interpretation of histograms. Fuzzy Sets and Systems, 10(1):15–20, 1983.

220 REFERENCES

[DP88] D. Prade and H. Prade. An introduction to possibilistic and fuzzy logics. In P. Smets, E. H.

Mamdani, D. Dubois, and H. Prade, editors, Non-Standard Logics for Automated Reasoning,

pages 287–326. Academic Press, London, 1988.

[DP97] D. Dubois and H. Prade. Synthetic view of belief revision with uncertain inputs in the

framework of possibility theory. International Journal Of Approximate Reasoning, 17(2-3),

pages 295–324, 1997.

[EKR97] G. Ehmayer, G. Kappel, and S. Reich. Connecting databases to the web - a taxonomy of

gateways. In Proc. of the 8th International Conference on Database and Expert Systems

Applications, Toulouse, France, volume 1308 of Lecture Notes in Computer Science, pages

1–15. Springer Verlag, 1997.

[EM77] H. Ebrahim and D. Mamdani. Application of fuzzy logic to approximate reasoning. IEEE

Transactions on computers, volume 26, 1977.

[EN94] R. Elmasri and S. B. Navathe. Fundamentals of Database Systems. Benjamin/Cummings,

Redwood City, 2nd edition, 1994.

[Eng86] G. Engels. Graphen als zentrale Datenstrukturen in einer Software-Entwicklungsumgebung.

Ph.D. Thesis, Universität Osnabrück. VDI-Verlag, 1986.

[Eng98] V. Englebert. Voyager 2 (version 4.0) - Reference manual. Institut d’Informatique, University

of Namur, Belgium, rue grandgaggnage B-5000 Namur, Belgium, 1998.

[Fen67] J. E. Fenstad. Representations of probabilities defined on first-order languages. In J. N.

Crossley, editor, Sets, models and recursion theory. North-Holland, 1967.

[FG90] C. Froidevaux and C. Grossete. Graded default theory for uncertainty. In Proc. of the 9th

European Conference on Artificial Intelligence, Stockholm, Sweden, pages 283–288. Pitman,

London, 1990.

[FH97] J. S. P. Fong and S.-M. Huang. Information Systems Reengineering. Springer Verlag,

Singapore, 1997.

[FHK+97] P. J. Finnigan, R. C. Holt, I. Kalas, S. Kerr, K. Kontogiannis, H. A. Müller, J. Mylopoulos,

S. G. Perelgut, M. Stanley, and K. Wong. The software bookshelf. IBM Systems Journal,

36(4):564-593, 1997.

[Fla97] D. Flanagan. Java in a Nutshell: a desktop quick reference. O’Reilly & Associates, Inc., 981

Chestnut Street, Newton, MA 02164, USA, 2nd edition, 1997.

[Fon97] J. Fong. Converting relational to object-oriented databases. ACM SIGMOD Record, 26(1),

1997.

[Fou92] Open Software Foundation. Introduction to OSF/DCE. Prentice Hall, New Jersey, 1992.

[Fry95] B. Fryer. Prudential gets healthy. Information Week, pages 60–64, 1995.

[FS97] A. Fay and E. Schnieder. Fuzzy petri nets for knowledge representation and reasoning in rule-

based systems. In Proc. of the 2nd Intl. ICSC Symposium on Fuzzy Logic and Applications,

Zurich, pages 146–150, 1997.

[FS98] A. Fay and E. Schnieder. On the combination of expert systems and petri nets. In Proc. of the

7th Intl. Conference on Information Processing and Management of Uncertainty in

Knowledge-based Systems. Paris, La Sorbonne, pages 1626–1632, 1998.

[Fus97] M. L. Fussell. Foundations of object relational mapping. 1220 N. Fair Oaks Ave, #1314,

Sunnyvale, CA 94089, 1997.

[FUZ98] Proc. of 7th IEEE Intl. Conf. of Fuzzy Systems. Anchorage, USA. IEEE, 1998.

REFERENCES 221

[FV95] C. Fahrner and G. Vossen. Transforming Relational Database Schemas into Object-Oriented

Schemas according to ODMG-93. In Proc. of the 4th Intl. Conference on Deductive and

Object-Oriented Databases, 1995.

[Gal93] S. I. Gallant. Neural Network Learning and Expert Systems. The MIT Press, Cambridge, MA,

USA, 1993.

[Gär75] P. Gärdenfors. Qualitative probability as an intensional logic. Journal of Philosophical Logic,

4(2):171–185, 1975.

[Gei95] K. Geiger. Inside ODBC. Microsoft Press, 1995.

[GHJV95] E. Gamma, R. Helm, R. Johnson, and J. Vlissides. Design Patterns. Addison Wesley,

Reading, MA, USA, 1995.

[GJS97] J. Gosling, B. Joy, and G. Steele. The Java Language Specification. The Java Series. Addison

Wesley, Reading, MA, USA, 1997.

[GK93] H. Gall and R. Klösch. Capsule oriented reverse engineering for software reuse. In Proc. of

the European Conference on Software Engineering, volume 717 of Lecture Notes in

Computer Science, pages 418–433. Springer Verlag, 1993.

[Got88] S. Gottwald. Mehrwertige Logik. Akademie-Verlag, Berlin, Germany, 1988.

[Gra95] A. Grauel. Fuzzy-Logik. BI, Braunschweig, Germany, 1995.

[Gre98] R. Grehan. Object marries relational — Ardent’s Java Relational Binding turns a relational

database into a Java object-oriented database management system. Byte Magazine,

23(3):101–102, 1998.

[Gro98] K. Grotenhuis. Crossing the Euro rubicon. IEEE Spectrum, 35(10):30–33, 1998.

[Hai89] J.-L. Hainaut. A generic entity-relationship model. In Information System Concepts: An In-

depth Analysis. Elsevier Science Publishers, Amsterdam, The Netherlands, 1989.

[Hai91] J-L. Hainaut. Entity-generating schema transformations for entity-relationship models. In

Proc. of the 10th Conference on the Entity-Relationship Approach, San Mateo. Springer

Verlag, 1991.

[Haj94] P. Hajek. On logics of approximate reasoning. In Knowledge Representation and Reasoning

Under Uncertainty, volume 808 of Lecture Notes in Artificial Intelligence, pages 17–29.

Springer Verlag, 1994.

[Hal90] J. Y. Halpern. An analysis of first-order logics of probability. Artificial Intelligence,

46(3):311–350, 1990.

[HB86] M. Haber and M. B. Brown. Maximum likelihood methods for log-linear models when

expected frequencies are subject to linear constraints. Journal of the American Statistical

Association, 81(394):477–482, 1986.

[HCTJ93] J-L. Hainaut, M. Chandelon, C. Tonneau, and M. Joris. Contribution to a theory of database

reverse engineering. In First Working Conference on Reverse Engineering, Baltimore, USA,

pages 161-170.IEEE Computer Society Press, 1993.

[HEH+96] J.-L. Hainaut, V. Englebert, J. Henrard, J.-M. Hick, and D. Roland. Database reverse

engineering: From requirements to CARE tools. Automated Software Engineering, 3(1-2),

1996.

222 REFERENCES

[HEH+98] J. Henrard, V. Englebert, J.-M. Hick, D. Roland, and J.-L. Hainaut. Program understanding in

database reverse engineering. In Proc. of 9th International Conference on Database and

Expert Systems Applications, Vienna, Austria, volume 1460 of Lecture Notes in Computer

Science. Springer Verlag, 1998.

[Hei98] M. Heitbreder. Eine Ausführungsmaschine für Generic Fuzzy Reasoning Nets auf Basis

unscharfer Petrinetze. Master’s Thesis, University of Paderborn, Dept. of Mathematics and

Computer Science, D-33095 Paderborn, Germany, 1998.

[Her94] D. Hernandez. Qualitative representation of spatial knowledge. Volume 804 in Lecture Notes

in Computer Science. Springer Verlag, 1994.

[HHEH96] J.-L. Hainaut, J.-M. Hick, V. Englebert, and J. Henrard. Understanding the implementation of

IS-A relations. Volume 1157 of Lecture Notes in Computer Science, pages 42-57. Springer

Verlag, 1996.

[HHHR96] J.-L. Hainaut, J. Henrard, J.-M. Hick, and D. Roland. Database design recovery. Volume 1080

of Lecture Notes in Computer Science, pages 272-300. Springer Verlag, 1996.

[Him97] Himel Inc, DBInformer User’s Manual, 17153 President Drive, Castro valley, CA 94546,

USA, 1997.

[HK94] G. T. Heineman and G. E. Kaiser. Incremental process support for code reengineering. In

Proc. of the Intl. Conference on Software Maintenance, pages 282–290. IEEE Computer

Society Press, 1994.

[HMW95] D. Heckerman, A. Mamdani, and M. P. Wellman. Real-world applications of Bayesian

networks: Introduction. Communications of the ACM, 38(3):24–26, 1995.

[Hol97] J. Holle. Ein Generator für integrierte Werkzeuge am Beispiel der objekt-relationalen

Datenbankschemamigration. Master’s Thesis, University of Paderborn, Dept. of Mathematics

and Computer Science, D-33095 Paderborn, Germany, 1997.

[Hol98] R.C. Holt. Structural manipulations of software architecture using Tarski relational algebra.

In Working Conference on Reverse Engineering, pages 210–219, Hawaii, USA. IEEE

Computer Society Press, 1998.

[HR87] J. Y. Halpern and M. O. Rabin. A logic to reason about likelihood. Artificial Intelligence,

32(3):379–405, 1987.

[HTJC94] J.-L. Hainaut, C. Tonneau, M. Joris, and M. Chandelon. Transformation-based database

reverse engineering. Volume 823 of Lecture Notes in Computer Science, pages 362-373.

Springer Verlag, 1994.

[Hül96] E. Hüllermeier. Reasoning about Systems based on incomplete an uncertain models. Ph.D.

Thesis, University of Paderborn, Dept. of Mathematics and Computer Science, D-33095

Paderborn, Germany, 1996.

[Hüs97] F. Hüsemann. Migration relationaler Datenbanken in objektorientierte Umgebungen. In

Tagungsband des 3. Fachkongresses Smalltalk und Java in Industrie und Ausbildung, Erfurt,

Germany, pages 5-10, 1997.

[Hüs98] F. Hüsemann. Eine erweiterte Schemaabbildungskomponente für Datenbank–Gateways. In

10. Workshop "‘Grundlagen von Datenbanken"’, pages 52–56, Konstanz. Konstanzer

Schriften in Mathematik und Informatik Nr. 63, Universität Konstanz, 1998.

[JEJ95] I. Jacobson, M. Ericsson, and A. Jacobson. The Object Advantage. Addison Wesley,

Workingham, UK, 1995.

REFERENCES 223

[JH98a] J. H. Jahnke and M. Heitbreder. Design recovery of legacy database applications based on

possibilistic reasoning. In Proc. of 7th IEEE Intl. Conference on Fuzzy Systems, Anchorage,

USA, pages 1332–1337.IEEE, 1998.

[JH98b] S. Jarzabek and R. Huang. The case for user-centered case tools. Communications of the ACM,

41(8):93–99, 1998.

[JK90] P. Johannesson and K. Kalman. A method for translating relational schemas into conceptual

schemas. In Entity-Relationship Approach to Database Design and Querying: Proc. of the 8th

Intl. Conference on Entity-Relationship Approach. North Holland, 1990.

[JNW98] J. H. Jahnke, U. Nickel, and D. Wagenblasst. A case study in supporting evolution of complex

engineering information systems. In Proc. of 22nd Intl. Computer Software and Applications

Conference, pages 513-520. IEEE Computer Society Press, 1998.

[Joh86] R. Johnson. Independence and Bayesian updating methods. In L. N. Kanal and J. F. Lemmer,

editors, Uncertainty in Artificial Intelligence, pages 197–201. Elsevier Science Publishers,

Amsterdam, 1986.

[JP92] V. S. Jacob and H. Pirkul. Organizational decision support systems. Intl. Journal of Man-

Machine Studies, 36(6):817–832, 1992.

[JS99] J. H. Jahnke and C. Strebin. Adaptive tool support for database reverse engineering. In Proc.

of 1999 Conference of the North American Fuzzy Information Processing Society, New York,

USA, pages 278-282. IEEE Press, 1999.

[JSWZ99] J. H. Jahnke, W. Schäfer, J. Wadsack, and A. Zündorf. Managing inconsistency in

evolutionary database reengineering processes. Science of Computer Programming, 1999.

(submitted)

[JSZ96] J. H. Jahnke, W. Schäfer, and A. Zündorf. A design environment for migrating relational to

object oriented database systems. In Proc. of the 1996 Intl. Conference on Software

Maintenance, pages 163-170. IEEE Computer Society Press, 1996.

[JSZ97] J. H. Jahnke, W. Schäfer, and A. Zündorf. Generic fuzzy reasoning nets as a basis for reverse

engineering relational database applications. In Proc. of European Software Engineering

Conference, number 1302 in Lecture Notes in Computer Science, pages 193-210. Springer

Verlag, 1997.

[JSZ97a] J. H. Jahnke, W. Schäfer, and A. Zündorf. A design environment for migrating relational to

object-oriented database systems (Abstract). In Software Engineering and Database

Technology. Dagstuhl-Seminar-Report 173, Dagstuhl, Germany, 1997.

[JZS97b] J. H. Jahnke, W. Schäfer, and A. Zündorf. The NewPORT prototype V0, with demonstration.

Joint Seminar of O2 Technology and INRIA, Versaille, France, February, 26th, 1997.

[JW99a] J. H. Jahnke and J. Wadsack. Human-centered reverse engineering environments should

support human reasoning. In Proc. of the 1st Intl. Workshop on Soft Computing Applied to

Software Engineering. Limerick, Ireland, pages 77-83. Limerick University Press, 1999.

[JW99b] J. H. Jahnke and J. Wadsack. Integration of analysis and redesign activities in information

system reengineering. In Proc. of the 3rd European Conference on Software Maintenance and

Reengineering, Amsterdam, The Netherlands, pages 160–168. IEEE Computer Society, 1999.

[JW99c] J. H. Jahnke and J. Wadsack. Varlet: Human-centered tool support for database reengineering.

In Proc. of the Workshop Software Reengineering. Bad Honnef, Germany, 1999. (to appear)

224 REFERENCES

[JZ97] J. H. Jahnke and A. Zündorf. Rewriting poor design patterns by good design patterns. In Proc.

of the ESEC/FSE Workshop on Object-Oriented Re-engineering. Technical University of

Vienna, Information Systems Institute, Distributed Systems Group, 1997. Technical Report

TUV-1841-97-10.

[JZ98] J. H. Jahnke and A. Zündorf. Using graph grammars for building the varlet database reverse

engineering environment. In Proc. of Theory and Application of Graph Transformations,

Paderborn, Germany. Technical Report tr-ri-98-201, University of Paderborn, D-33095

Paderborn, Germany, 1998.

[JZ99] J. H. Jahnke and A. Zündorf. Handbook of Graph Grammars and Computing by Graph

Transformation - Application, volume 2, chapter Applying Graph Transformations To

Database Re-Engineering. World Scientific, Singapore, 1999. (to appear)

[Kas80] U. Kastens. Ordered Attributed Grammars. Acta Informatica, 13(3):229–256, 1980.

[Kas96] N. K. Kasabov. Foundations of Neural Networks, Fuzzy Systems, and Knowledge

Engineering. MIT Press, Cambridge, 1996.

[KDBM94] K. Kontogiannis, R. DeMori, M. Bernstein, and E. Merlo. Localization of design concepts in

legacy systems. In Proc. of the Intl. Conference on Software Maintenance 1994, pages 414–

423. IEEE Computer Society Press, 1994.

[Ker92] E. E. Kerre. A comparative study of the behavior of some popular fuzzy implication operators

on the generalized modus ponens. In Fuzzy logic for the management of uncertainty. John

Wiley & Sons, New York, 1992.

[KKM98] A. Kemper, D. Kossmann, and F. Matthes. SAP R/3: A database application system. SIGMOD

Record (ACM Special Interest Group on Management of Data), 27(2), page 499, 1998.

[KM96] A. Konar and A. K. Mandal. Uncertainty management in expert systems using fuzzy petri

nets. IEEE Transactions on Knowledge and Data Engineering, 8(1):96–105, 1996.

[KM98] N. N. Karnik and J. M. Mendel. Introduction to Type-2 Fuzzy Logic Systems. In Proc. 7th

Intl. Conference on Fuzzy Systems FUZZ-IEEE’98, Anchorage, USA, pages 915–920. IEEE,

1998.

[KNNZ99] T. Klein, U. Nickel, J. Niere, and A. Zündorf. From UML to Java and back again. University

of Paderborn, Department of Mathematics and Computer Science, D-33095 Paderborn,

Germany, 1999.

[Knu68] D. E. Knuth. Semantics of Context-Free Languages. Mathematical Systems Theory, 2(2):127–

145, 1968.

[KSRP99] R. K. Keller, R. Schauer, S. Robitaille, and P. Page. Pattern-based reverse-engineering of

design components. In Proc. of the 21st International Conference on Software Engineering,

pages 226–235. ACM Press, 1999.

[KSW95] N. Kiesel, A. Schürr, and B. Westfechtel. GRAS, a graph-oriented (software) engineering

database system. Information Sciences, 20(1):21–51, 1995.

[KWDE98] B. Kullbach, A. Winter, P. Dahm, and J. Ebert. Program comprehension in multi-language

systems. In Proc. of the 5th Working Conference on Reverse Engineering, pages 135–143,

Hawaii, USA. IEEE Computer Society Press, 1998.

[Lan91] J. Lang. Logique possibiliste: aspects formels, deduction automatique, et applications. Ph.D.

Thesis, IRIT, Univ. P. Sabatier, Toulouse, France, 1991.

[Lef95] M. Lefering. Integrationswerkzeuge in einer Softwareentwicklungsumgebung. Informatik.

Verlag Shaker, 1995.

REFERENCES 225

[Lem77] E. J. Lemmon. An Introduction to Modal Logic. Basil Blackwell, 1977.

[Lev66] V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals.

Soviet Physics Doklady, 6:707–710, 1966.

[LMB92] J. R. Levine, T. Mason, and D. Brown. Lex & Yacc. O’Reilly, Sebastopol, 2nd edition, 1992.

[LMS98] A. L. Lederer, D. A. Mirchandani, and K. Sims. Using WISs to enhance competitiveness.

Communications of the ACM, 41(7):94–95, 1998.

[LO98] T. Lin and L. O’Brian. FEPSS: A flexible and extensible program comprehension support

system. In Proc. of 5th Working Conference on Reverse Engineering, pages 40–49, Hawaii,

USA. IEEE Computer Society Press, 1998.

[Loe78] M. Loeve. Probability Theory. Springer Verlag, New York, 4th edition, 1978.

[Log97] Logic Works Inc., University Square at Princeton, 111 Compus Drive, Princeton NJ 08540.

ERwin User’s Guide, 3rd edition, 1997.

[Loo88] C. G. Looney. Fuzzy Petri nets for rule-based decisionmaking. IEEE Transactions on

Systems, Man, and Cybernetics, 18(1):178–183, 1988.

[LR89] C. L’Ecluse and P. Richard. The O2 Database Programming Language. In Proc. of the 15th

Intl. Conference on Very Large Data Bases, Amsterdam, The Netherlands, pages 411–422.

Morgan Kaufmann Publishers, 1989.

[LS96] M. Lefering and A. Schürr. Specification of Integration Tools. In Building tightly integrated

software development environments, volume 1170 of Lecture Notes in Computer Science,

pages 324-334. Springer Verlag, 1996.

[LS97] C. Lindig and G. Snelting. Assessing modular structure of legacy code based on mathematical

concept analysis. In Proc. of the 19th Intl. Conf. on Software Engineering, Boston, MA, USA,

pages 349-359. ACM Press, 1997.

[LS98a] D.-M. Lincke and B. Schmid. Mediating electronic product catalogs. Communications of the

ACM, 41(7):86–88, 1998.

[LS98b] G. L. Lohse and P. Spiller. Electronic shopping. Communications of the ACM, 41(7):81–87,

1998.

[MAJ94] U. A. Johnen M. A. Jeusfeld. An executable meta model for re-engineering of database

schemas. Technical Report 94-19, Technical University of Aachen, Germany, 1994.

[Mar97] R. A. Martin. Dealing with dates: Solutions for the Year 2000. Computer, 30(3):44–51, 1997.

[MCAH95] P. Martin, J. R. Cordy, and R. Abu-Hamdeh. Information capacity preserving of relational

schemas using structural transformation. Technical Report ISSN 0836-0227-95-392, Dept. of

Computing and Information Science, Queen’s University, Kingston, Ontario, Canada, 1995.

[McC75] C. L. McClure. Structured programming in COBOL. ACM SIGPLAN Notices, 10(4):25–33,

1975.

[McC98] T. J. McCabe. Does reverse engineering have a future? Keynote of the 5th Working

Conference on Reverse Engineering, Honolulu, Hawaii, USA, 1998.

[MM90] R. W. Mathews and W. C. McGee. Data modeling for software develpment. IBM Systems

Journal, 29(2):228–235, 1990.

[MN95] G. C. Murphy and D. Notkin. Lightweight source model extraction. In Proc. of ACM

SIGSOFT Symposium on the Foundations of Software Engineering, pages 116-127.

ACM Press, 1995.

226 REFERENCES

[MNB+94] L. Markosian, P. Newcomb, R. Brand, S. Burson, and T. Kitzmiller. Using an enabling

technology to reengineer legacy systems. Communications of the ACM, 37(5):58–70, 1994.

[MNL96] G. C. Murphy, D. Notkin, and E. S.-C. Lan. An empirical study of static call graph extractors.

In Proc. of the 18th Intl. Conference on Software Engineering, pages 90–98, Berlin, Germany.

IEEE, 1996.

[MNS95] G. C. Murphy, D. Notkin, and K. Sullivan. Software Reflexion Models: Bridging the Gap

between Source and High-Level Models. In Proc. of SIGSOFT’95 Third ACM SIGSOFT

Symposium on the Foundations of Software Engineering, pages 18–28. ACM Press, 1995.

[MT93] V. W. Marek and M. Truszczynski. Nonmonotonic Logic. Springer Verlag, Berlin, 1993.

[MWS97] K. Wong, and M.-A.D. Storey, H.A. Müller. How do program understanding tools affect how

programmers understand programs? In Proc. of 4th Working Conference on Reverse

Engineering, Amsterdam, Holland, pages 12–21. IEEE Computer Society Press, 1997.

[MWT94] H. A. Müller, K. Wong, and S. R. Tilley. Understanding software systems using reverse

engineering technology. In Proc. of the 62nd Congress of L’Association Canadienne

Francaise pour l’Avancement des Sciences, pages 41–48, Montreal, Canada, 1994.

[MZ82] M. Mizumoto and H. J. Zimmerman. Comparison of Fuzzy Reasoning Methods. Fuzzy Sets

and Systems, 8:253–283, 1982.

[NA87] S. B. Navathe and A. M. Awong. Abstracting Relational and Hierarchical Data with a

Semantic Data Model. In Proc. of the 6th Intl. Conference of the Entity Relationship

Approach, New York, pages 305–333. North-Holland, 1987.

[NAF99] Proc. of the 18th Conference of the North American Fuzzy Information Processing Society,

New York, USA. IEEE, 1999.

[Nag96] M. Nagl, editor. Building tightly integrated software development environments, volume 1170

of Lecture Notes in Computer Science. Springer Verlag, Berlin, 1996.

[Nea92] R. E. Neapolitan. A survey of uncertain and approximate reasoning. In Fuzzy Logic for the

Management of Uncertainty, pages 55–82. John Wiley & Sons, 1992.

[NH95] L. Ngo and P. Haddawy. Probabilistic logic programming and Bayesian networks. Volume

1023 of Lecture Notes in Computer Science, pages 286-300. Springer Verlag, 1995.

[Nil93] N. J. Nilsson. Probabilistic logic revisited. Artificial Intelligence, 59(1-2):39–42, 1993.

[Nov92] V. Novak. Fuzzy logic as a basis of approximate reasoning. In Fuzzy Logic for the

Management of Uncertainty, pages 247–264. John Wiley & Sons, 1992.

[Nov97] Novera Software Inc., 3 Burlington Woods, Burlington, MA 01830, USA. Novera EPIC

Database Builder (TM), release 1.3, September 1997.

[O2 93] O2 Technology. The O2 Application Designer’s Manual – Version 4.3. 7 rue du Parc de

Clagny, 78000 Versailles, France, 1993.

[Obj99a] The Object People Inc., 885 Meadowlands Dr., Suite 509, Ottawa, Ontario. TOPLink for Java

2.0 User’s Manual, 1999.

[Obj99b] ObjectMatter Inc., 2450 S.W. 137 Ave. Suite 206 Miami, Fl. 33175, USA. Objectmatter VBSF

Object-Relational Framework V2.02 User Manual, 1999.

[ONT96] ONTOS Inc., 3 Burlington Woods, Burlington, MA, USA. ONTOS Object Integration Server

for Relational Databases 2.0 - Schema Mapper User’s Guide, 2.0 edition, 1996.

[Paa88a] G. Paass. Discussion of Chapter 9: Belief Functions. In Non-Standard Logics for Automated

Reasoning. pages 279–280. Academic Press, London, 1988.

REFERENCES 227

[Paa88b] G. Paass. Probabilistic logic. In Non-Standard Logics for Automated Reasoning, pages 213–

251. Academic Press, London, 1988.

[PB94] W. J. Premerlani and M. R. Blaha. An approach for reverse engineering of relational

databases. Communications of the ACM, 37(5):42–49, 1994.

[PD93] H. Prade and D. Dubois. Belief revision and updates in numerical formalisms – an overview,

with new results for the possibilistic framework. In Proc. of the Intl. Joint Conferences on

Artificial Intelligence, Chambery, France. Morgan Kaufman Publishers, 1993.

[Pea86] J. Pearl. Fusion, propagation, and structuring in bayesian networks. Artificial Intelligence,

29(3), 1986.

[Pea98] J. Pearl. Bayesian networks. Technical Report 980002, University of California, Los Angeles,

Computer Science Department, USA, 1998.

[Pet81] J. L. Peterson. Petri Net Theory and Modeling of Systems. Prentice Hall, 1981.

[PKBT94] J-M. Petit, J. Kouloumdjian, J-F. Boulicaut, and F. Toumani. Using queries to improve

database reverse engineering. In Proc. of 13th Int. Conference of ERA, Manchester, volume

881 of Lecture Notes in Computer Science, pages 369–386. Springer Verlag, 1994.

[PM96] P. Patel and K. Moss. Java Database Programming With JDBC. Coriolis Group Books,

Scottsdale, AZ, USA, 1996.

[PMdP98] R. Penteado, P. C. Masiero, and A. F. do Prado. Reengineering of legacy systems based on

transformation using the object-oriented paradigm. In Proc. of 5th Working Conference on

Reverse Engineering, pages 144–153, Hawaii, USA. IEEE Computer Society Press, 1998.

[Poo88] D. Poole. A logical framework for default reasoning. Artificial Intelligence, 36(1):27–47,

1988.

[Poo93] D. Poole. Average-case analysis of a search algorithm for estimating prior and posterior

probabilities in bayesian networks with extreme probabilities. In Proc. of the Intl. Joint

Conferences on Artificial Intelligence, Chambery, France. Morgan Kaufman Publishers,

1993.

[Pro89] G. M. Provan. A logic-based analysis of Dempster-Shafer theory. Technical Report TR-89-

08, Department of Computer Science, University of British Columbia, Canada, 1989.

[PS92] B. Peuschel and W. Schäfer. Concepts and Implementation of a Rule-based Process Engine.

In Proc. of the 14th Intl. Conference on Software Engineering, Melbourne, Australia, pages

262–279. IEEE Computer Society Press, 1992.

[PTBK96] J-M. Petit, F. Toumani, J. Boulicaut, and J. Kouloumdjian. Towards the reverse engineering

of denormalized relational databases. In Proc. 12th International Conference on Data

Engineering, pages 218–227, New Orleans. IEEE Computer Society, 1996.

[Rad95] E. Radeke. Federation and Migration among Database Systems. Ph.D. Thesis, University of

Paderborn - Department of Mathematics and Computer Science, D-33095 Paderborn,

Germany 1995.

[Rat98] Rational Software Corp., 18880 Homestead Road, Cupertino, CA 95014, USA. Rational Rose

98 - Using Rational Rose / Oracle 8, 1998.

[RBP+91] J. Rumbaugh, M. Blaha, W. Premerlani, F. Eddy, and W. Lorensen. Object–Oriented

Modeling and Design. Prentice Hall, Englewood Cliffs, N. J. 07632, 1991.

[Res76] N. Rescher. Plausible reasoning - An introduction to the theory and practice of plausibilistic

inference. Van Gorcum, Assen/Amsterdam, 1976.

228 REFERENCES

[RH96] S. Ramanathan and J. Hodges. Reverse engineering relational schemas to object-oriented

schemas. Technical Report MSU-960701, Department of Computer Science, Mississippi

State University, USA, 1996.

[RH97] S. Ramanathan and J. Hodges. Extraction of object-oriented structures from existing

relational databases. ACM SIGMOD Record, 26(1), 1997.

[RHSR94] T. Reps, S. Horwitz, M. Sagiv, and G. Rosay. Speeding up slicing. In Proc. of ACM SIGSOFT,

New Orleans LA, USA, pages 11-20.ACM Press, 1994.

[RJB99] J. Rumbaugh, I. Jacobson, and G. Booch. The Unified Modeling Language Reference Manual.

Addison-Wesley, Reading, MA, USA, 1st edition, 1999.

[Rog71] R. Rogers. Mathematical Logic and Formalized Theories. North-Holland, Amsterdam, 1971.

[Roz97] G. Rozenberg, editor. Handbook of Graph Grammars and Computing by Graph

Transformation. World Scientific, Singapore, 1997.

[RRF90] K. Rosen, R. Rosinski, and J. Farber. UNIX System V Release 4: An Introduction for New and

Experienced Users. Mc-Graw-Hill, New York, NY, USA, 1990.

[RS97] J. Rekers and A. Schürr. Defining and parsing visual languages with layered graph grammars.

Journal of Visual Languages and Computing, London, Academic Press., 8(1), 1997.

[Rum98] C. Rummel. Ein Transformationsbasierter Ansatz zur Migration von relationalen zu

objektorientierten Datenbanken. Master’s Thesis, Univeristät-GH Paderborn, Mathematik-

Informatik, D-33095 Paderborn, Germany, 1998.

[SCC+93] Y.-P. Shan, T. Cargill, B. Cox, W. Cook, M. Loomis, and A. Snyder. Is multiple inheritance

essential to OOP? In Proc. of the 8th Annual Conference on Object-Oriented Programming

Systems, Languages and Applications, pages 363–363, Washington, DC, USA. ACM Press,

1993

[Sch91] A. Schürr. Operationales Spezifizieren mit programmierten Graphersetzungssystemen.

Deutscher Universitätsverlag, Wiesbaden, Germany, 1991.

[Sch92] J. C. Schryver. Object-oriented qualitative simulation of human mental models of complex

systems. IEEE Transactions on Systems, Man, and Cybernetics, 22(3):526–541, 1992.

[Sch93] B. Schiefer. Eine Umgebung zur Unterstützung von Schemaänderungen und Sichten in

objektorientierten Datenbanksystemen. Ph.D. Thesis, Universität Karlsruhe, Fakultät für

Informatik, FZI Forschungszentrum Informatik, Haid-und-Neu-Str. 10, D-76131 Karlsruhe,

Germany, 1993.

[Sch95a] K. Schick. The key to client/server - unlocking the power legacy systems. Gartner Group

Conference, February 1995.

[Sch95b] A. Schürr. Logic based structure rewriting systems. Fundamenta Informaticae, Special Issue

on Graph Transformation Systems, pages 363–386, 1995.

[Sch98] H. Schalldach. Integration von Java-Anwendungen mit relationalen Informationssystemen.

Master’s Thesis, University of Paderborn, Department of Mathematics and Computer

Science, D-33095 Paderborn, Germany, 1998.

[SdJPeA99] P. Sousa, L. Pedro de Jesus, G. Pereira, and F. Brito e Abreu. Clustering relations into

abstract er schemas for database reverse engineering. In Proc. of the 3rd European

Conference on Software Maintenance and Reengineering. Amsterdam, NL, pages 169–176.

IEEE Computer Society Press, 1999.

REFERENCES 229

[Sha76] G. Shafer. A Mathematical Theory of Evidence. Princeton University Press, Princeton, 1976.

[Sha90] G. Shafer. Belief functions. In Readings in Uncertain Reasoning. Morgan Kaufmann, San

Mateo, California, USA, 1990.

[Sho74] E. H. Shortliffe. A rule-based computer program for advising physicians regarding

antimicrobial therapy selection. Ph.D. Thesis, Stanford University, 1974.

[Sie98] Siemens AG - C-LAB, Fürstenallee 11, D-33102 Paderborn, Germany. OpenDM ODMG

User’s Guide, 1998.

[Sim94] D. Simpson. Are mainframes cool again. Datamation, pages 46–53, 1994.

[SK90] F. N. Springsteel and C. Kou. Reverse Data Engineering of E-R Designed Relational

Schemas. In Proc. of Databases, Parallel Architectures and their Applications, pages 438–

440. Springer Verlag, 1990.

[SLGC94] O. Signore, M. Loffredo, M. Gregori, and M. Cima. Reconstruction of er schema from

database applications: a cognitive approach. In Proc. of 13th Intl. Conference of ERA,

Manchester, pages 387–402, volume 881 of Lecture Notes in Computer Science. Springer

Verlag, 1994.

[Slo95] A. M. Sloane. An evaluation of an automatically generated compiler. ACM Transactions on

Programming Languages and Systems, 17(5):691–703, 1995.

[SM95] M.-A. D. Storey and H. A. Müller. Manipulating and documenting software structures using

SHriMP views. In Proc. of Intl. Conference in Software Maintenance, pages 275–285. IEEE

Computer Society Press, 1995.

[Sme88] P. Smets. Belief functions. In Non-Standard Logics for Automated Reasoning, pages 253–

286. Academic Press, London, 1988.

[Sne91] H. M. Sneed. Bank application reengineering and conversion at the union bank of switzerland.

In Proc. of the Intl. Conference on Software Maintenance 1991, pages 60–72. IEEE Computer

Society Press, 1991.

[Sne95] H. M. Sneed. Planning the reengineering of legacy systems. IEEE Software, 12(1):24–34,

1995.

[Sou98a] C. Soutou. Relational database reverse engineering: Extraction of cardinality constraints.

Data and Knowledge Engineering, Elsevier, North Holland, 28(2):161–207, 1998.

[Sou98b] C. Soutou. Inference of aggregate relationships through database reverse engineering. In Proc.

of Intl. Conf. on Conceptual Modeling, volume 1507 of Lecture Notes in Computer Science,

pages 135-145, Springer Verlag, 1998.

[SP98] P. Stevens and R. Pooley. Systems reengineering patterns. In Proc. of ACM Foundations of

Software Engineering, Lake Buena Vista, Florida, USA, pages 17-23. ACM Press, 1998.

[Sto98] M. A. D. Storey. A Cognitive Framework for Describing and Evaluating Software

Exploration Tools. Ph.D. Thesis, Simon Fraser University, Vancouver, B.C., Canada, 1998.

[Str97] B. Stroustrup. The C++ Programming Language: Third Edition. Addison Wesley, Reading,

MA, USA, 1997.

[Str99] C. Strebin. Adaption unsicheren Reverse-Engineering-Wissens auf Basis konnektionistischer

Methoden. Master’s Thesis, University of Paderborn, Department of Mathematics and

Computer Science, D-33095 Paderborn, Germany, 1999.

230 REFERENCES

[SWZ95] A. Schürr, A. J. Winter, and A. Zündorf. Graph Grammar Engineering with PROGRES. Proc.

of the European Software Engineering Conference, pages 219-234, volume 989 of Lecture

Notes in Computer Science, Springer Verlag, 1995.

[Tae96] G, Taentzer. Parallel and Distributed Graph Transformation: Formal Description and

Application to Communication-Based Systems. Ph.D. Thesis, Technische Universität Berlin,

Fachbereich 13, 1996.

[TCHH99] Ph. Thiran, A. Chougrani, J.-M. Hick, and J.-L. Hainaut. Generation of conceptual wrappers

for legacy database. In Proc. of 10th Intl. Conference and Workshop on Database and Expert

Systems Applications, Florence, Lecture Notes in Computer Science. Springer Verlag, 1999.

(to appear)

[Tea99] The Progres Developer Team. The Progres Language Manual Version 9.2. Lehrstuhl für

Informatik III, RWTH Aachen, Ahornstr. 55, 52074 Aachen, Germany, 1999.

[Ten98] J. M. Tenenbaum. WISs and electronic commerce. Communications of the ACM, 41(7):89–

90, 1998.

[TFAM96] P. Tonella, R. Fiutem, G. Antoniol, and E. Merlo. Augmenting pattern-based architectural

recovery with flow analysis: Mosaic - A case study. In Proc. of 3rd Working Conference on

Reverse Engineering. IEEE Computer Society, 1996.

[THB+98] P. Thiran, J.-L. Hainaut, S. Bodart, A. Deflorenne, and J.-M. Hick. Interoperation of

independent, heterogeneous and distributed databases. methodology and CASE support: the

InterDB approach. In Proc. of the 3rd Intl. Conf. on Cooperative Information Systems, New

York City, USA, pages 54–63. IEEE Computer Society Press, 1998.

[Tho99] Thought Inc., 657 Mission Street, Suite 202, San Francisco, CA 94105, USA. CocoBase

WhitePaper, 1999.

[Tre95] M. Tresch. Evolution in Objekt-Datenbanken. Teubner Verlag, Stuttgart, 1995.

[TWSM94] S. R. Tilley, K. Wong, M-A. D. Storey, and H. A. Müller. Programmable reverse engineering.

Intl. Journal of Software Engineering and Knowledge Engineering, 4(4):501–520, 1994.

[Uma97] A. Umar. Application (Re)Engineering - Building Web-Based Applications and Dealing with

Legacies. Prentice-Hall International, London, UK, 1997.

[UML97] UML Notation Guide vers. 1.1. Rational Software, Microsoft, Hewlett-Packard, Oracle,

Sterling Software, MCI Systemhouse, Unisys, ICON Computing, IntelliCorp, i-Logix, IBM,

ObjecTime, Platinum Technology, Ptech, Taskon, Reich Technologies, Softeam, 1997.

[vdBKV97] M. van den Brand, P. Klint, and C. Verhoef. Reverse engineering and system renovation: an

annotated bibliography. ACM Software Engineering Notes, 22(1), 1997.

[vDM98] A. van Deursen and L. Moonen. Type inference in cobol systems. In Proc. of the 5th Working

Conference on Reverse Engineering, pages 220–230, Hawaii, USA. IEEE Computer Society

Press, 1988.

[Vin97] S. Vinoski. Corba: Integrating diverse applications within distributed heterogeneous

environments. IEEE Communications Magazine, 14(2), 1997.

[vM19] R. von Mises. Grundlagen der Wahscheinlichkeitsrechnung. Mathematische Zeitung, 5, 1919.

[Voo89] F. Voorbraak. A computationally efficient approximation of Dempster-Shafer theory.

International Journal of Man-Machine Studies, 30(5):525–536, 1989.

[Wad98] J. P. Wadsack. Inkrementell Konsistenzerhaltung in der transformationsbasierten

Datenbankmigration. Master’s Thesis, University of Paderborn, Department of Mathematics

and Computer Science, D-33095 Paderborn, Germany, 1998.

REFERENCES 231

[War96] M. P. Ward. Program analysis by formal transformation. The Computer Journal, 39(7):598–

618, 1996.

[Wel97] B. B. Welch. Practical Programming in Tcl & Tk. Prentice Hall Press, Upper Saddle River,

2nd edition, 1997.

[Wil86] W. G. Wilson. Prolog for applications programming. IBM Systems Journal, 25(2):190–206,

1986.

[Wil94] L. M. Wills. Using attributed flow graph parsing to recognize programs. In Intl. Workshop on

Graph Grammars and Their Application to Computer Science,Williamsburg, Virginia, USA,

pages 170-184, volume 1073 in Lecture Notes in Computer Science. Springer Verlag, 1994.

[WM97] A. R. Williamson and C. L. Moran. Java Database Programming: Servlets & JDBC. Prentice

Hall, 1997.

[WS90] L. Wall and R. L. Schwartz. Programming Perl. O’Reilly Associates, Inc., Sebastopol, CA,

1990.

[WSK97] C. Welsch, A. Schalk, and S. Kramer. Integrating forward and reverse object-oriented

software engineering. In Proc of the 19th Intl. Conf. on Software Engineering, Boston, MA,

USA, pages 560-561.ACM Press, 1997.

[YB94] H. Yang and K. Bennett. Extension of A transformation system for maintenance - dealing with

data-intensive programs. In Proc. of the Intl. Conference on Software Maintenance,Victoria,

Canada, pages 344–353. IEEE Computer Society Press, 1994.

[YHC97] A. S. Yeh, D. R. Harris, and M. P. Chase. Manipulating recovered software architecture

views. In Proc of the 19th Intl. Conf. on Software Engineering, Boston, MA, USA, pages 184-

194. ACM Press, 1997.

[YLQ98] A. Yang, J. Linn, and D. Quadrato. Developing integrated Web and database applications

using JAVA applets and JDBC drivers. In Proc. of the 29th SIGCSE Technical Symposium on

Computer Science Education, volume 30,1 of SIGCSE Bulletin, pages 302–306, New York.

ACM Press, 1998.

[Zad65] L. A. Zadeh. Fuzzy sets. Information and Control, 8:338–353, 1965.

[Zad75] L. A. Zadeh. The concept of a linguistic variable and its application to approximate reasoning.

Information Sciences, 8:199–249, 1975.

[Zad78] L. A. Zadeh. Fuzzy sets as a basis for a theory of possibility. Fuzzy Sets and Systems, 1978.

[Zha98] W.R. Zhang. Bipolar Fuzzy Sets. In Proc. 7th Intl. Conf. on Fuzzy Systems, Anchorage, USA,

pages 835–840. IEEE, 1998.

[Zün95] A. Zündorf. Eine Entwicklungsumgebung für PROgrammierte GRaphErsetzungsSysteme.

Deutscher Universitätsverlag, Wiesbaden, 1995.

[Zün99] A. Zündorf. Skript zur Vorlesung Graphentechnik, Sommersemester 1999. University of

Paderborn, Department of Mathematics and Computer Science, D-33095 Paderborn,

Germany, April 1999.

INDEX

Numerics

1-context 147

abstract

class 119, 129

syntax graph 114

access path 120

α-cut 45, 50, 60, 105

aggregation 118

Analysis Front-End 100

analyzed logical schema 58

application condition 121, 124

architecture 99, 157

artificial key 21

association 133

attribute transfer clause 121, 123

automatic analysis operation 63, 72

axiom 80

-based marking 80

backward propagation 151

backwards reasoning 77

base table 168

attribute 170

relationship 172

basic probability

assignment 42

number 42

Bayesian inference 41

belief

function 43

revision 78

revision step 78

best model 51

best valuation 51

bipolar fuzzy set 49

canonical translation 178

card-operator 129

cardinality constraint 21

certainty factor 39

change propagation 145, 150, 163

classical projection 50

closed world assumption 36

code patterns 18

cold turkey 2

complex transformation 138, 144

complexity 97

compositional inference law 48

computer-aided reengineering 4

conceptual

abstraction 24

extension 24

migration 24

redesign 24, 161

schema 25, 118

schema migration 113

concrete class 119

conditional

boolean expression 132, 141

expression 141

probability 40

confidence factor 48

consistency management 145

context sensitive menu 159

continuous membership function 45

contradiction 7

contraposition transition 83

CORBA 1

COTS middleware 180

credibilistic reasoning 42

Customization Front-End 100

cyclic join pattern 18

dataanalysis 18

integration 113

reverse engineering 3

database 38

reengineering 3

reverse engineering 3

data-decomposable 30

data-driven analysis 82, 92

Index

234 INDEX

operation 63, 71

deduction problem 51

degree of consistency 51

Dempster-Shafer model 42

derivability 94

derived attribute 159

discrete membership function 45

domain analysis 56

edge type 116

enabled transition 79

encapsulation 28

Entity-Relationship model 116

equilibrium

state 80

time 80

error models 41

Euro-conversion 3

evaluation 84

expansion 83

expansion

of formulae 72

-evaluation cycle 84

expert system 33

extent of a possibilistic predicate 71

fact base 34

focal proposition 42

folding clause 131

forward

engineering 2

mapping 123

production 123

propagation 150

forwards reasoning 77

fuzzy

belief marking 78

composition 47

implication 47

inference 48

logic 44

logical operator 48

Petri net 77

predicate 58

reasoning 44

relation 47

rules 46

set 44

truth token 78

generic data model 178

Generic Fuzzy Reasoning Net 7, 55, 57, 66

goal-driven analysis 83

operation 63, 71

graded modus ponens 51

graph 115

constraint 118

grammar 121

production 121

test 118, 155

graphical path expression 133

GRAS 157

grounding 84, 87, 92

history graph 146, 149

human-awareness 5

ignorance 36

implication 58

rule 39

implies 129

inclusion dependency 22, 38, 133

incremental

reasoning 77

schema migration 114

inductive logic programming 111

inference 51

algorithm 90

engine 8, 34, 76, 99

loop 92

process 81

information capacity 137

inheritance 118

inner universal quantifier 61

instance mapping 137

iteration 161

Java 28

INDEX 235

database connectivity (JDBC) 14

join pattern 18

key dependency 38

knowledge

base 33

-based system 33

layered graph grammar 99

left-hand side 121

legacy

database 4

software system 1

Levenshtein distance 60

limitcycle 80

logical schema 57

main transition 83

mapping rule 122

mass change 3

match 121

maximum-likelihood 41

MAX-MIN composition 48

measure of belief 39

membership

degree 45

function 45

meta model 178

middleware 26, 113, 165

Migration Front-End 159

migration graph 114

model 114

monotonic reasoning 36

multiple inheritance 116

MYCIN 38, 39

naming convention 16

necessity 49, 59

-valued formula 49

-valued possibilistic logic 49

negative application condition 130, 148

node set 129

node type 116

non-monotonic reasoning 36

not-null constraint 38

NULL-value 57

Object

identifier 120

Management Group 117

Modeling Technique 109

ObjectDRIVER 165

occurrence of literals 72

ODMG standard 28, 116

Open Database Connectivity (ODBC) 100

open world assumption 36

optimization structure 21

optional graph element 129

ordered association 24

path expression 124

pattern library 99

periodic oscillation 80

Petri Net 77

place 78

plausibility function 44

possibilistic reasoning 49, 59

possibility 49

distribution 50

posterior probability 41

predecessor 80

primitive transformation 138

probabilistic logic 40

probability measure 40

process iterations 7

Progres 116, 158

qualitativ reasoning 35

quantitative reasoning 35

redesign transformation 137

reengineering 2

process 2

reevaluation 150

236 INDEX

relation schema 38

relational database 38

remote

attribute 170

relationship 172

restriction 130

reverse

engineering 2

mapping 123

production 123

right-hand side 121

scalability 97

schema

analysis 7

catalog 4

mapping graph 114

migration 7

redesign 137

transformation 137

select distinct pattern 18

selection problem 35

semantical enrichment 15

stability 80

start graph 121, 125

structural completion 15

structure transformation 137

subjective

evidence 42

probability 40

t-conorm 46

t-norm 46

threshold value 60

transformation

system 121

template 146

transition 78

transitive

inheritance 130

path expression 148

translation 150

triple graph grammar 114, 122

TXL 179

type-2 fuzzy logic 48

Unified Modeling Language (UML) 13

universe of discourse 38

unparser 159

variable aggregation 61

variant 57

records 20

Varlet

Analyst 99, 157

Migrator 157

view threshold 105

Year-2000 problem 3

ABBREVEATIONS

AI - Artificial Intelligence 33

API - Application Programming Interface 5

ASG - Abstract Syntax Graph 100, 114

BRS - Belief Revision Step 78

CARE - Computer-Aided ReEngineering 4

CF - Certainty Factor 39

C-IND Cardinality INclusion Dependency 58

COTS - Commercial Off-The-Shelf 9

CS - Client/Server 1

CT - Contraposition Transition 83

CV - Confidence Value 59

DB - DataBase 11

DBMS - Database Management System 3

DBRE - Database ReEngineering 3

DBRvE - DataBase Reverse Engineering 3

DDL - Data Definition Language 100

DRvE - Data Reverse Engineering 3

ER - Entity-Relationship 101

FBM - Fuzzy Belief Marking 78

FE - Forward Engineering 2

FPN - Fuzzy Petri Net 77

FTT - Fuzzy Truth Token 78

FUJABA - From Uml to Java And Back Again

184

GFRN - Generic Fuzzy Reasoning Net 7

GMP - Graded Modus Ponens 51

IA - Information Augmenting 137

IC - Information Changing 137

IE - Inference Engine 8, 81

iff - if and only if 50

I-IND - Isa-INclusion Dependency 58

ILP - Inductive Logic Programming 111

IND - INclusion Dependency 22, 38

IP - Information Preserving 137

IQ - Inner universal Quantifier 61

IR - Information Reducing 137

IS - Information System 185

IT - Information Technology 11

JDBC - Java DataBase Connectivity 14

KBS - Knowledge-Based System 33

L0 - propositional logic 38

L1 - first-order logic 38

LC - Limit Cycle 80

LDB - Legacy DataBase 4

LOC - Lines Of Code 45

LSS - Legacy Software System 1

LT - Learning Task 184

MB - Measure of Belief 39

MD - Measure of Disbelief 39

MIS - Marketing Information System 12

MT - Main Transition 83

NN - Neural Network 184

NPL1 - Necessity-valued Possibilistic Logic 49

ODBC - Open DataBase Connectivity 100

Abbreveations

238 ABBREVIATIONS

OID - Object IDentifier 120

OMG - Object Management Group 117

OMT - Object Modeling Technique 109

OO - Object-Orientation 1

PDIS - Product and Document Information Sys-

tem 11

PN - Petri Net 77

PO - Periodic Oscillation 80

Progres - PROgammed Graph REplacement Sys-

tems 116

RDB - Relational DataBase 38

RE - ReEngineering 2

R-IND - Reference INclusion Dependency 58

RS - Relation Schema 38

RvE - Reverse Engineering 2

SMG - Schema Mapping Graph 114

SQL - Structured Query Language 18

TV - Threshold Value 60

UML - Unified Modeling Language 13

w.r.t. - with respect to 65

Web - World Wide Web 1