scieee Science in your language
[en] (orig)
Duplicate-based Schema Matching
von Diplom - Informatiker
Alexander Bilke
aus Berlin
von der Fakult¨at IV Elektotechnik und Informatik
der Technischen Universit¨at Berlin
zur Erlangung des akademischen Grades
Doktor der Ingenieurwissenschaften
- Dr.-Ing. -
genehmigte Dissertation
Promotionsausschuss:
Vorsitzende: Prof. Dr. Sabine Glesner
Berichter: Prof. Dr. Bernd Mahr
Berichter: Prof. Dr. Herbert Weber
Berichter: Prof. Dr. Felix Naumann
Berichter: Prof. Dr. Erhard Rahm
Tag der wissenschaftlichen Aussprache: 20. 12. 2006
Berlin 2007
D 83
ii
Zusammenfassung
Die Integration unabh¨angig voneinander entwickelter Datenquellen stellt uns
vor viele Probleme, die das Ergebnis verschiedener Arten von Heterogenit¨at
sind. Eine der gr¨oßten Herausforderungen ist Schema Matching: der halb-
automatische Prozess, in dem semantische Beziehungen zwischen Attributen in
heterogenen Schemata erkannt werden.
Verschiedene osungen, die Schemainformationen ausnutzen oder spezifische
Eigenschaften aus Attributwerten extrahieren, wurden in der Literatur beschrie-
ben. In dieser Dissertation wird ein neuartiger Schema-Matching-Algorithmus
vorgestellt, welcher ‘unscharfe’ Duplikate, also unterschiedliche Repr¨asentationen
der gleichen Realwelt-Entit¨at, ausnutzt.
In dieser Arbeit wird der DUMAS table matcher, welcher Attributkorre-
spondenzen zwischen zwei Tabellen herstellt, beschrieben. Das Auffinden der
Duplikate, die dann f¨ur das Schema Matching benutzt werden onnen, ist
eine herausfordernde Aufgabe, weil die semantischen Beziehungen zwischen den
Tabellen nicht bekannt sind und somit bekannte Duplikaterkennungsverfahren
nicht angewandt werden onnen. Das neue Problem der Duplikaterkennung
zwischen nicht angeglichenen Tabellen und ein Algorithmus, der die Top-k Du-
plikate findet, wird beschrieben. Die Attributkorrespondenzen zwischen den bei-
den Tabellen werden in einem folgenden Schritt aus den Duplikaten extrahiert.
Der DUMAS schema matcher erweitert den duplikat-basierten Matching-
Ansatz auf komplexe Schemata, welche aus mehreren Tabellen bestehen. Das
Auffinden von Korrespondenzen zwischen komplexen Schemata wirft neue Prob-
leme auf, die bei einzelnen Tabellen nicht auftreten. Somit ist die direkte Anwen-
dung des DUMAS table matcher nicht oglich. Stattdessen werden Heuristiken
benutzt, mit deren Hilfe entschieden werden kann, ob einem Matching zwischen
zwei Tabellen vertraut werden kann. Basierend darauf wird ein Algorithmus
entwickelt, der Attributkorrespondenzen zwischen komplexen Schemata findet.
Die beiden bisher beschriebenen Algorithmen sind auf einfache (1:1) Kor-
respondenzen beschr¨ankt. Weil komplexe (1:n oder m:n) Korrespondenzen in
der Praxis vorkommen, wurde der DUMAS complex matcher entwickelt. Dieser
Matcher benutzt das Ergebnis des DUMAS table matcher und verbessert das
Ergebnis, indem einzelne Attribute kombiniert werden. Auf diese Weise werden
komplexe Korrespondenzen gebildet. Weil der Raum der oglichen komplexen
Matchings sehr groß ist, wurden Heuristiken entwickelt, mit deren Hilfe die
Anzahl der zu betrachtenden Attributkombinationen eingeschr¨ankt werden.
iii
iv
Abstract
The integration of independently developed data sources poses many problems,
which are the result of several types of heterogeneity. One of the most daunting
challenges is schema matching, which is the semi-automatic process of detecting
semantic relationships between attributes in heterogeneous schemata.
Various solutions that exploit schema information or extract specific fea-
tures from attribute values have been described. In this thesis we propose novel
schema matching algorithms that exploit fuzzy duplicates, i.e., different repre-
sentations of the same real-world entity.
We describe the DUMAS table matcher, whose goal is to establish attribute
correspondences between two tables. Finding the duplicates that can be used
for schema matching is a challenging task because the semantic relationships
between the tables are unknown, and thus, existing duplicate detection solutions
cannot be applied. We discuss the novel problem of duplicate detection in
unaligned relations and describe an algorithm that is able to detect the top-k
duplicates. The attribute correspondences between the two tables are extracted
from those duplicates in a subsequent step.
The DUMAS schema matcher extends the duplicate-based matching ap-
proach to complex schemata consisting of multiple tables. Finding attribute
correspondences between complex schemata poses several new challenges that
do not occur when single tables are to be matched, and thus, complicate the
application of the table matcher. We describe heuristics used to determine if a
table matching can be trusted, and develop an algorithm that exploits multi-
table duplicates to detect correspondences between complex schemata.
The previous two algorithms are restricted to simple (i.e, 1:1) correspon-
dences. Because complex (i.e., 1:n or m:n) do occur in practice, we developed
the DUMAS complex matcher. The matcher uses the result of the DUMAS
table matcher and improves the matching by merging certain attributes, and
thus, detecting complex correspondences. Because the space of possible complex
matchings is very large, we devised several heuristics to decrease the number of
attribute combinations that have to be considered.
v
vi
Acknowledgements
First of all I thank my supervisors for their constant support and advice during
my dissertation years. Prof. Herbert Weber (TU Berlin & Fraunhofer ISST)
gave me the opportunity to carry out out my PhD work in his department. Felix
Naumann (Hasso-Plattner-Institute Potsdam) has influenced my work through-
out the years by providing valuable ideas in many interesting discussions.
I would also like to thank the Computer-based Information Systems group
at Technical University Berlin. Dr. Ralf-Detlef Kutsche has introduced me to
the field of data integration. He also gave me the opportunity to hold a seminar
on that topic. I very much enjoyed the interesting discussions with Dr. Susanne
Busse, Thomas Kabisch, and Dr. Alexander oser in our Journal Club. Martin
Konitzer and Dennis Dietrich supported my work by developing new ideas on
duplicate-based schema matching in their diploma theses.
The idea for our Journal Club I was happy to ‘import’ from the Informa-
tion Integration Group at Hasso-Plattner-Institute Potsdam. I thank Melanie
Weis, Jens Bleiholder, and their colleagues for many interesting ideas on data
integration and schema matching. I very much enjoyed our development of the
Humboldt Merger.
The Berlin-Brandenburg Graduate School on Distributed Information Sys-
tems has supported my work by providing the financial basis and valuable feed-
back in various seminars. I particularly thank those graduate students who
made those three years very interesting not only in research. I also thank Prof.
AnHai Doan for his invitation to his group at University of Illinois Urbana-
Champaign. Finally, I thank Prof. Erhard Rahm and Prof. Bernd Mahr for
their evaluation of my thesis.
Finally, I thank my parents, who have always helped me by all means. The
constant support of my family has given me the motivation to begin and finish
my PhD work.
vii
viii
Contents
I Mapping Overlapping Databases 1
1 Putting Together Pieces of Information 3
1.1 Introduction.............................. 3
1.2 Heterogeneity and Conflicts in Data Integration . . . . . . . . . . 5
1.3 Types of Integrated Information Systems . . . . . . . . . . . . . 8
1.3.1 The Mediator-based System (MBS) . . . . . . . . . . . . 8
1.3.2 The Data Warehouse . . . . . . . . . . . . . . . . . . . . . 11
1.3.3 Other Types of Integrated Information Systems . . . . . . 11
1.4 Building An Integrated Information System . . . . . . . . . . . . 13
1.4.1 Schema Mapping: The Top-down Approach . . . . . . . . 13
1.4.2 Schema Integration: The Bottom-Up Approach . . . . . . 14
1.5 Duplicate-based Schema Matching . . . . . . . . . . . . . . . . . 16
1.5.1 Example Scenarios . . . . . . . . . . . . . . . . . . . . . . 16
1.5.2 The DUMAS Approach . . . . . . . . . . . . . . . . . . . 18
2 The Schema Matching Problem 23
2.1 Basic Relational Concepts . . . . . . . . . . . . . . . . . . . . . . 23
2.2 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Schema Matching and Attribute Correspondences . . . . 24
2.2.2 Formal Problem Description . . . . . . . . . . . . . . . . . 25
2.3 An Overview of Schema Matching Approaches . . . . . . . . . . 26
2.3.1 Classification of Schema Matching Approaches . . . . . . 26
2.3.2 Schema-based Matchers . . . . . . . . . . . . . . . . . . . 27
2.3.3 Instance-based Matchers . . . . . . . . . . . . . . . . . . . 29
2.3.4 Duplicate-based Matching Approaches . . . . . . . . . . . 30
2.3.5 Combining multiple matchers . . . . . . . . . . . . . . . . 33
2.4 SchemaMappings .......................... 34
2.4.1 What is a Schema Mapping? . . . . . . . . . . . . . . . . 34
2.4.2 Schema Mapping Generation . . . . . . . . . . . . . . . . 35
3 From Duplicates To Schema Matching 37
3.1 Why Duplicates Can Help in Schema Matching . . . . . . . . . . 37
3.2 The DUMAS Approach . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Duplicates and Duplicate Detection . . . . . . . . . . . . . . . . . 41
ix
xCONTENTS
3.4 Related Work on Duplicate Detection . . . . . . . . . . . . . . . 42
3.4.1 RecordLinkage........................ 42
3.4.2 The Sorted Neighborhood Method . . . . . . . . . . . . . 43
3.4.3 Other Duplicate Detection Approaches . . . . . . . . . . . 45
3.5 Finding Duplicates For Schema Matching . . . . . . . . . . . . . 46
3.5.1 Single-Table Duplicates . . . . . . . . . . . . . . . . . . . 47
3.5.2 Multi-Table Duplicates . . . . . . . . . . . . . . . . . . . . 47
3.5.3 RelatedWork......................... 48
II The DUMAS Table Matcher 49
4 The Duplicate Detection Step 51
4.1 Duplicate Detection Without Known Correspondences . . . . . . 51
4.2 Duplicate Detection as Top-k Search . . . . . . . . . . . . . . . . 53
4.3 The Tuple Similarity Measure . . . . . . . . . . . . . . . . . . . . 54
4.3.1 Inherent Problems . . . . . . . . . . . . . . . . . . . . . . 54
4.3.2 String Similarity Measures . . . . . . . . . . . . . . . . . . 55
4.3.3 The Tuple Similarity Measure tupsim ........... 59
4.4 Searching For Duplicates . . . . . . . . . . . . . . . . . . . . . . . 61
4.5 The Effect of Sampling . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 65
4.6.1 Real-world Data: Real Estate Advertisements . . . . . . . 66
4.6.2 Experiments on Generated Data . . . . . . . . . . . . . . 67
4.7 Discussion............................... 70
5 The Matching Step 73
5.1 Establishing Correspondences By Aggregating Duplicate Votes . 73
5.2 Comparing Attribute Values of Duplicates . . . . . . . . . . . . . 74
5.2.1 The field similarity measure fieldsim ............ 75
5.2.2 Creating the Similarity Matrix . . . . . . . . . . . . . . . 76
5.3 Aggregating and Reasoning . . . . . . . . . . . . . . . . . . . . . 77
5.3.1 Creating The Average Similarity Matrix By Aggregation . 77
5.3.2 Reasoning: How To Extract Attribute Correspondences . 77
5.4 Certainty of Attribute Correspondences . . . . . . . . . . . . . . 79
5.5 The Extended Tuple Similarity Measure . . . . . . . . . . . . . . 80
5.6 Searching For Duplicates with etupsim ............... 81
5.6.1 The Duplicate Detection Algorithm . . . . . . . . . . . . 81
5.6.2 Finding Similar Terms . . . . . . . . . . . . . . . . . . . . 83
5.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 87
5.7.1 Experiments on Real-Estate Advertisements . . . . . . . . 87
5.7.2 Accuracy of Schema Matching . . . . . . . . . . . . . . . 87
5.7.3 Duplicate Detection with Partial Alignment . . . . . . . . 89
5.7.4 The Certainty Check . . . . . . . . . . . . . . . . . . . . . 90
5.8 Discussion............................... 90
CONTENTS xi
III Complex Matchings and Complex Schemata 93
6 Matching Complex Schemata 95
6.1 Iterative Schema Matching Using Duplicates . . . . . . . . . . . . 95
6.2 Creating an Initial Matching . . . . . . . . . . . . . . . . . . . . 97
6.2.1 Interpreting the Table Matching . . . . . . . . . . . . . . 98
6.2.2 Initial Matching with the DUMAS Table Matcher . . . . 101
6.3 Crawling Through Schemata . . . . . . . . . . . . . . . . . . . . 102
6.3.1 A Run Through The Example . . . . . . . . . . . . . . . 102
6.3.2 The Derivation Tree . . . . . . . . . . . . . . . . . . . . . 104
6.3.3 The Schema Matching Algorithm . . . . . . . . . . . . . . 106
6.3.4 The Sanity Check . . . . . . . . . . . . . . . . . . . . . . 110
6.3.5 Considering Complex Relationships by Deferred Deacti-
vation .............................113
6.4 Table Extension vs. Duplicate Extension . . . . . . . . . . . . . . 114
6.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 115
6.5.1 Implementation and Data . . . . . . . . . . . . . . . . . . 115
6.5.2 Accuracy of Schema Matching . . . . . . . . . . . . . . . 117
6.6 Discussion...............................118
7 Finding Complex Matchings 121
7.1 Problems Associated With Complex Matchings . . . . . . . . . . 121
7.2 Adapting the DUMAS Table Matcher . . . . . . . . . . . . . . . 123
7.3 Searching For Complex Correspondences . . . . . . . . . . . . . . 124
7.3.1 The Matrix Structure . . . . . . . . . . . . . . . . . . . . 124
7.3.2 Searching for a Matrix Structure . . . . . . . . . . . . . . 125
7.4 Detecting 1:n Matchings . . . . . . . . . . . . . . . . . . . . . . . 127
7.4.1 Discovering the Best Matrix . . . . . . . . . . . . . . . . . 127
7.4.2 Creating Child Matrices . . . . . . . . . . . . . . . . . . . 128
7.4.3 Assessing Match Improvement . . . . . . . . . . . . . . . 133
7.5 The Complex Matching Algorithm . . . . . . . . . . . . . . . . . 136
7.6 Matching and Mapping with Combination Functions . . . . . . . 140
7.6.1 Query Discovery with Complex Correspondences . . . . . 140
7.6.2 Matching and Mapping with Different Functions . . . . . 140
7.7 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . 141
7.7.1 Quality Measures for Complex Matching . . . . . . . . . . 141
7.7.2 Real-world data . . . . . . . . . . . . . . . . . . . . . . . . 142
7.7.3 Syntheticdata ........................143
7.8 Discussion...............................146
IV Discussion 147
8 Conclusion 149
8.1 The DUMAS approach . . . . . . . . . . . . . . . . . . . . . . . . 149
8.2 Combining DUMAS with other matchers . . . . . . . . . . . . . . 150
xii CONTENTS
8.3 Schema Matching and Data Integration . . . . . . . . . . . . . . 151
8.4 The Future of Schema Matching . . . . . . . . . . . . . . . . . . 152
List of Figures
1.1 Single application using many data sources. . . . . . . . . . . . . 4
1.2 An integrated information system. . . . . . . . . . . . . . . . . . 5
1.3 Structural and semantic heterogeneity. . . . . . . . . . . . . . . . 7
1.4 The mediator architecture. . . . . . . . . . . . . . . . . . . . . . 9
1.5 Matching between sources and the mediator. . . . . . . . . . . . 10
1.6 A peer data management system. . . . . . . . . . . . . . . . . . . 12
1.7 Integration of two schemata. . . . . . . . . . . . . . . . . . . . . . 15
1.8 Relations Rand Swith intensional and extensional overlap. . . . 18
1.9 DUMAS algorithms and related chapters. . . . . . . . . . . . . . 21
2.1 A schema matching example. . . . . . . . . . . . . . . . . . . . . 24
2.2 A classification of schema matching approaches. . . . . . . . . . . 27
2.3 An example for columns with similar features but different se-
mantics................................. 30
2.4 The problem of overlapping attribute groups. . . . . . . . . . . . 32
2.5 Architecture of a composite matcher. . . . . . . . . . . . . . . . . 34
2.6 The query discovery process (Source: [MHH00]). . . . . . . . . . 35
3.1 Relations Rand Swith intensional and extensional overlap. . . . 38
3.2 Product details from www.amazon.com................ 39
3.3 Misleading similarity of attribute values. . . . . . . . . . . . . . . 40
3.4 The duplicate-based schema matching process . . . . . . . . . . . 41
3.5 Sliding a window through a sorted table (Source: [HS95]). . . . . 44
3.6 Multi-Table Duplicates . . . . . . . . . . . . . . . . . . . . . . . . 48
4.1 Duplicatetuples. ........................... 52
4.2 Edit distance computation for “dumas” and “rumors”. . . . . . . 56
4.3 Effect of sampling on the number of duplicates . . . . . . . . . . 65
4.4 The correct matching from DB1 to DB2 .............. 68
4.5 Influence of number of duplicates . . . . . . . . . . . . . . . . . . 68
4.6 Influence of degree of intensional overlap . . . . . . . . . . . . . . 69
4.7 Reduced intensional overlap: four and three matches. . . . . . . . 69
5.1 The duplicate-based schema matching process . . . . . . . . . . . 74
xiii
xiv LIST OF FIGURES
5.2 Agraphmatching. .......................... 78
5.3 Finding similar terms with tries. . . . . . . . . . . . . . . . . . . 84
5.4 Precision and recall of schema matching . . . . . . . . . . . . . . 88
5.5 Robustness of schema matching with 10 true duplicates in the
presence of false positives . . . . . . . . . . . . . . . . . . . . . . 89
5.6 Influence of degree of partial alignment . . . . . . . . . . . . . . . 90
6.1 An initial matching between source and target schema . . . . . . 98
6.2 Misleading tuple similarity: Tables with different semantics . . . 99
6.3 A matching size matrix . . . . . . . . . . . . . . . . . . . . . . . 101
6.4 Join tables and new correspondences . . . . . . . . . . . . . . . . 103
6.5 Table comparisons performed in the running example. . . . . . . 104
6.6 Derivation trees of the source and target databases . . . . . . . . 104
6.7 Comparing relevant tables. . . . . . . . . . . . . . . . . . . . . . 107
6.8 The node assignment matrix for the running example. . . . . . . 109
6.9 Initial matching with one false correspondence . . . . . . . . . . 111
6.10 Correct matching after extending table Emp. ...........112
6.11 Extending duplicates: tuples with outdated information. . . . . . 115
6.12Cricketschemata...........................116
6.13 Number of attributes in schemata and number of correspondences.116
6.14Matchingquality............................117
6.15 Matching quality with false initial matching. . . . . . . . . . . . 118
7.1 A complex matching between Rand S. ..............122
7.2 Average similarity matrix for example tables. . . . . . . . . . . . 123
7.3 Merging attributes to detect a complex matching. . . . . . . . . . 125
7.4 The complete search space of the running example. . . . . . . . . 126
7.5 Increasing score average by overmerging. . . . . . . . . . . . . . . 134
7.6 Start matrix and correct grouping in a m:n scenario. . . . . . . . 137
7.7 First step: Merging attribute groups {AB}and {C}. . . . . . . . 137
7.8 Extract of start matrix and correct grouping in Fig. 7.6. . . . . . 139
7.9 Tables containing information about proteins. . . . . . . . . . . . 142
7.10 Schema configuration for complex matching experiments. . . . . 144
7.11 Experiments with 1:n correspondences. . . . . . . . . . . . . . . . 144
7.12 Experiments with m:n correspondences. . . . . . . . . . . . . . . 145
Part I
Mapping Overlapping
Databases
1
Chapter 1
Putting Together Pieces of
Information
When autonomously developed data sources are to be integrated, one inevitably
has to deal with heterogeneity. While standard solutions for the resolution of
conflicts arising from technical and data model heterogeneity exist, structural
and semantic heterogeneity is still an open issue despite several decades of re-
search. One of the most daunting challenges in data integration is schema
matching, which is the semi-automatic process of detecting correspondences
between semantically related attributes. This thesis describes a novel schema
matching algorithm that exploits extensional overlap (i.e., duplicates) between
data sources.
Before discussing the schema matching problem and our duplicate-based so-
lution, this chapter provides a general introduction to data integration. We
show that schema mappings are an integral part in most integrated information
systems. The creation of such mappings is guided by attribute correspondences,
which are usually manually established. Schema matching algorithms extract
correspondences in a semi-automatic process, and thus, reduce the cost of de-
veloping an integrated information system.
1.1 Introduction
Many companies gather a large amount of data about customers, products,
sales, suppliers, etc. Much of those data is stored in relational databases, which
provide a uniform interface (or schema) that is used by multiple software ap-
plication to access the information. Unless standard off-the-shelf products are
used, the database schemata can be designed to suit the specific requirements
of the company: The schemata can closely reflect the view of the company on
the business domain and be optimized with respect to their applications.
While the liberty the database designers have may benefit each individual
company, it creates additional problems when databases need to be integrated:
3
4CHAPTER 1. PUTTING TOGETHER PIECES OF INFORMATION
Assume a merger of two companies with the goal of generating positive synergy
effects. Although some departments might be unaffected, the company can
benefit from the merging of many operative areas, e.g., customer relationship or
procurement. This process also involves the integration of existing databases,
which is known to be an expensive process due to various forms of heterogeneity,
which are the result of autonomous development of the data sources.
DB1 DB2 DB3
Figure 1.1: Single application using many data sources.
One way to bring together the information from the databases is to write
new software that uses the existing data sources (Fig. 1.1). That application
would query each data source individually and combine the retrieved results.
While this solution is technically feasible, it is not desirable because in most
IT infrastructures several software applications use the databases. If another
application that requires data from all databases has to be developed, a lot of
the integration work (i.e., the resolution of heterogeneity) has to be redone. In
addition, inconsistencies may arise if the applications are run concurrently. To
avoid those problems, an integrated information system (IIS) is placed between
the applications and existing data sources (Fig. 1.2). An integrated information
system presents a uniform interface, which can be used by several applications
to access the sources. The creation of such applications is not as complex as in
the previous case, because only the IIS and not all sources need to be understood
by the developers. In addition, the IIS can provide regular DBMS services, e.g.,
concurrency control.
In the following we discuss different forms of heterogeneity that can be ob-
served in such scenarios. We describe different ways of integrating information
sources, and show that schema matching is an important step in the develop-
ment of such systems.
1.2. HETEROGENEITY AND CONFLICTS IN DATA INTEGRATION 5
DB1 DB2 DB3
IIS
Figure 1.2: An integrated information system.
1.2 Heterogeneity and Conflicts in Data Inte-
gration
Data integration has been a research area in computer science for several decades
under various names: multidatabase systems [LMR90], federated database sys-
tems [SL90], mediator-based systems [Wie92], data warehouses [CD97], and
(more recently) peer database management systems [HIST03]. While differ-
ent types of integrated information systems differ with respect to their level
of coupling, materialization, etc., all of them have to deal with heterogeneity.
Heterogeneity arises when data sources are autonomously developed. In order
to avoid heterogeneity, widely accepted standards must be followed. However,
in most domains such standards do not exist or are not sufficient, thus, requir-
ing the developers to customize the standard to their specific needs or create
their own domain model. As a consequence, one inevitably has to deal with
heterogeneity in the creation of an integrated information system.
Various classifications of heterogeneity can be found in the literature [BKLW99,
KS91]. We distinguish four types of heterogeneity:
1. Technical heterogeneity,
2. Data model heterogeneity,
3. Structural heterogeneity,
4. Semantic heterogeneity.
Technical heterogeneity is concerned with issues of technical nature, e.g.
6CHAPTER 1. PUTTING TOGETHER PIECES OF INFORMATION
hardware, operating system, or networking infrastructure. Conflicts resulting
from technical heterogeneity can be solved with existing middleware technology.
Furthermore, data model heterogeneity has to be resolved by the integra-
tors when data is stored in various formats. An integrated information system
should be able to integrate not only structured information (e.g. relational or
hierarchical data model), but also semi-structured data (e.g. XML) or unstruc-
tured documents. Conflicts resulting from data model heterogeneity are usually
resolved by wrapping the data sources. Each source wrapper translates between
the data model of the IIS and the data model of the information source. While
wrapping of Web sources is an active research topic (e.g., [CMM01]), wrapping
of structured data can be considered solved. Note that this thesis only considers
structured data in the relational data model.
Structural heterogeneity derives from the fact that, given a data model, in-
formation can be structured in various ways. Problems arising from structural
heterogeneity include data-metadata conflicts (a piece of information that is an
attribute value in one data source is an attribute or table name in the other) or
partitioning (different groupings of attributes into tables). Various other sorts
of conflicts occur due to semantic heterogeneity1. Even when the data is struc-
tured in the same way, the schemata under consideration might be different,
e.g., attributes might have varying labels. Naming conflicts result from the use
of homonyms (same word having a different meaning) and synonyms (different
words having the same meaning) when defining attribute names. We subsume
structural heterogeneity and semantic heterogeneity that results in differences
between the schemata under schematic heterogeneity. Data conflicts, which are
also a result of semantic heterogeneity, occur when the same information is rep-
resented in different ways, e.g. different formats, abbreviations, or acronyms. In
addition, some tables might contain erroneous, contradictory, or missing data,
which increases the complexity of data integration.
Fig. 1.3 depicts an extract of an integration scenario with a single source
table Rand an integrated schema S, which consists of two tables S1and S2.
In the following, the integrated schema is called target schema. Source table
Rcontains data about employees, including the name, salary, age, department,
and the head of the department. Similar information can be found in the target
schema, but structured as two tables S1and S2. Data about the departments,
including the name and head of each department, is shown in table S2. Employ-
ees are represented in table S1by their name, salary, and a foreign key to the
tuple representing the department they work for. The dashed arrow represents
that foreign key, while the solid arrows indicate semantic relationships between
attributes of the two schemata.
Several examples for structural heterogeneity can be found in the example:
Firstly, employee information is structured as a single table in the source, but
the target schema contains two tables. Secondly, the name of each employee
is represented in two attributes in R, while the target table S1has a single
1The term “semantic heterogeneity” has been heavily used in the literature with varying
meaning. As stated by ¨
Oszu and Valduriez, “semantic heterogeneity is a fairly overloaded
term without a clear definition” [¨
OV99].
1.2. HETEROGENEITY AND CONFLICTS IN DATA INTEGRATION 7
GivenName
Surname
Dept
DeptHead
Max
Mueller
Salary
EUR 5,000
IDE
Smith
Sam
Adams
EUR 7,600
CRM
Meyer
Katie
Adams
EUR 7,300
CRM
Meyer
Name
Dept
DHead
Salary
Jane Simpson
USD 4,100
CRM
Meyer
Sam Adams
USD 6,800
CRM
Smith
Dept
CRM
IDE
R
S1
S2
Age
42
38
54
Figure 1.3: Structural and semantic heterogeneity.
attribute containing both given name and surname. Thirdly, the source table
Rhas an attribute Age, which does not have a corresponding partner in the
target database.
Semantic heterogeneity occurs both on the schema and instance levels. The
attribute representing the head of the department is labelled DeptHead in R
and DHead in S2. In the process of developing the integrated information
system, DeptHead and DHead must be identified as synonyms. Both Rand S1
contain an attribute Salary representing the salary of employees. However, both
attributes use a different currency: The salary is shown in Euro in R, while S1
uses US dollars. Finally, data conflicts in duplicate tuples are a result of semantic
heterogeneity: One can see that the employee Sam Adams is represented both
in the source and the target database. However, after translating the salary in
the target database into Euro one notices that the salary of Sam Adams is much
lower than in the source database. There are several possible reasons for this
conflict, e.g., Sam Adams has received a major pay raise that is not shown in
the target database, or the salary in either of the database has been misprinted,
etc.
The resolution of heterogeneity is a labor-intensive process. Fortunately,
standard solutions exist for some types of heterogeneity. As stated above, mid-
dleware applications can be used to bridge different technologies. Data model
heterogeneity can be resolved by existing wrapper technology. Unfortunately,
there is no automatic solution for structural and semantic heterogeneity. How-
ever, two research areas are concerned with those forms of heterogeneity: schema
mapping and duplicate detection.
The goal of schema mapping is the semi-automatic construction of map-
8CHAPTER 1. PUTTING TOGETHER PIECES OF INFORMATION
pings, which describe the structural and semantic relationships between two
schemata. A mapping is constructed in two steps: schema matching (i.e., the
detection of attribute correspondences) and query discovery (i.e., the genera-
tion of a mapping based on the detected correspondences). Attribute corre-
spondences (depicted as solid arrows in Fig. 1.3) connect semantically related
attributes. Schema matching solutions, whose goal is to find such attribute cor-
respondences, have to deal with structural and semantic heterogeneity. Based on
the detected attribute correspondences, the query discovery process determines
a schema mapping. The process is challenging, because several interpretations
of a set of correspondences are possible, and ambiguities have to be resolved.
Duplicate detection is the process of identifying different representations of
the same real-world entity. An example for such a duplicate are the two rep-
resentations of Sam Adams in Fig. 1.3. Duplicate detection is complicated by
data conflicts as described above. This thesis is concerned with the problem
of schema matching, because it is one of the most daunting challenges in data
integration. Our proposed solution exploits extensional overlap, and thus, we
also have to deal with duplicate detection.
1.3 Types of Integrated Information Systems
As stated above, different types of integrated information system exist. In
the following, we describe mediator-based systems, data warehouses, peer data
management systems, and multi-database systems. We also show that schema
mappings, which resolve structural and semantic heterogeneity, are an important
part in most of those systems. This section informally describes what mappings
are and how they are used by the different systems. In the following section, we
discuss two general approaches to the creation of mappings.
1.3.1 The Mediator-based System (MBS)
When developing an integrated information system, one needs to decide if the
data is to be stored in the data integration layer, or if the integrated information
system uses the sources to answer a user query. The mediator architecture
(Fig. 1.4), which was originally described by Wiederhold [Wie92], is an example
for virtual integration: Instead of storing the data in a database at the data
integration layer, the mediator uses the data sources to answer queries. When
the user sends a query, the mediator translates the user query into several queries
that are sent to the sources. The retrieved results are merged and presented to
the user.
The sources can be of different types, e.g., relational or XML database sys-
tems, web sites, etc. As stated above, data model heterogeneity is resolved by
wrappers, which receive a query from the mediator and translate it into a query
that is understood by the source. Afterwards the wrapper retrieves the result
and translates it into the canonical data model, i.e., the data model that is used
by the mediator.
1.3. TYPES OF INTEGRATED INFORMATION SYSTEMS 9
Mediator
XML
HTML
DB2 Wrapper
XML Wrapper
HTML Wrapper
Figure 1.4: The mediator architecture.
Like all integrated information systems, mediator-based systems have the
advantage that the user only needs to understand the schema and data model
of the mediator. The mediator layer completely hides the details of the sources
from the user. In addition, the sources are unaffected by the mediator: They
can be used in the same way as they have been used before the mediator was in
place. Essentially, the mediator acts as another database application, and the
autonomy of the data sources is retained. Unfortunately, this raises additional
challenges: Whenever a data source is modified, the mediator and its mappings
needs to be adapted. This holds for modifications of the schema, data model, or
technical infrastructure of the data source. Automatic detection of such changes
is an active research area [LMK03, MAL+05]. In contrast, adding, removing,
or modifying data does not affect the mediator. Instead, the use of virtual
integration ensures that the user always receives an up-to-date result.
Query Processing in MBS
Query processing in mediator-based systems is driven by mappings, which de-
scribe the semantic relationship between the source schemata and the mediator
schema [Len02]. Those relationships are usually described as declarative spec-
ifications of the data transformation between a source and the mediator. The
specification language depends on the mediator and its data model: E.g., TSIM-
MIS is based on the semi-structured Object Exchange Model (OEM) and uses
the Mediator Specification Language (MSL) to specify mappings [GMPQ+97,
PAGM96], while Information Manifold integrates relational data and specifies
semantic relationships as conjunctive queries [LRO96a, LRO96b].
Another distinguishing feature is the direction of the mapping, which de-
10 CHAPTER 1. PUTTING TOGETHER PIECES OF INFORMATION
termines the query processing strategy [Ull97]. Global-as-view (GaV) systems
describe mediator relations as as views over the source schemata. To translate
the user query, which is formulated in terms of the mediator schema, into one
or several source queries, a technique called view expansion (or query unfolding)
is used. As the name suggests, the mediator relations in the user query are re-
placed by their definitions in the mappings, resulting in a query containing only
source relations. In local-as-view (LaV) systems, the sources are described as
views over the mediator. E.g., Information Manifold describes each source rela-
tion as a conjunctive query involving one or several mediator relations. Query
processing resolves to answering queries using views [Hal01, PL00].
GivenName
Surname
Dept
Salary
M
Age
GN
SN
S1
Age
GN
SN
S2
Salary
Figure 1.5: Matching between sources and the mediator.
In the following we illustrate GaV query processing using the example de-
picted in Fig. 1.5. The figure shows a single mediator relation Mand two source
relations S1and S2. Arrows depict semantic relationships between attributes.
To simplify the description, we use conjunctive queries to describe a mapping.
In the above example, the GaV mapping would be:
M(GN, SN, A, S) :- S1(GN, SN, A), S2(GN, SN, S, D)
which represents a join of S1and S2on the respective first attribute (GN) and
the respective second attribute (SN). The schema of the mediator table does
not contain an attribute for the department Dept. User queries are formulated
in terms of the mediator relation M. E.g., if the user asked for the salary of
thirty-year old people, the query would be:
Q(S) :- M(GN, SN, A, S), A = “3000.
The mediator would expand the view Mwith its definition, resulting in a
query
Q(S) :- S1(GN, SN, A), S2(GN, SN, S, D), A = “3000.
Based on that expanded query, the mediator produces a query plan, which
describes how and in what order data sources are accessed. A description of the
planning phase in TSIMMIS can be found in [PAGM96].
1.3. TYPES OF INTEGRATED INFORMATION SYSTEMS 11
1.3.2 The Data Warehouse
The management of a large enterprise requires decision support systems to ob-
tain a global view on the company. To extract relevant information about the
performance of the enterprise, Online Analytical Processing (OLAP) applica-
tions are used [CD97]. Such applications are able to process large amounts of
data and extract hidden knowledge, e.g., purchasing patterns of customers. The
data used by OLAP tools comes from a data warehouse, which in turn receives
data from very many operative databases and applications within the company.
In contrast to mediator-based systems, data warehouses perform materialized
integration: Due to the huge amount of data the largest data warehouses
store several tera bytes of data accessing the sources every time a user sends
a query clearly does not scale. Instead, new data is copied in regular intervals
from the sources to the warehouse, and user queries are processed on that local
database. In the extract-transform-load (ETL) process, data is extracted from
the operative databases and transformed into the warehouse schema before it
is loaded into the data warehouse. Because the ETL process is time-consuming
and requires a lot of resources, it is usually performed when the load on the
operative databases is very low, e.g., at night. Afterwards, the user can send
queries even when the operative databases are heavily used, because the data
warehouse uses its own copy of the data to produce a query result. Note that
the warehouse data is not always up-to-date. However, that does not affect the
quality of the query answers because of the nature of OLAP queries: The user is
usually interested in aggregates, e.g., sales per month, and the difference between
the last warehouse update and the current state of the operative databases has
only a negligible effect on the query result.
The ETL process usually involves complex schema transformations, because
the schema design of operative databases and warehouse database follows differ-
ent design principles. On the one hand, the schemata of the operative databases
are designed and optimized with respect to their local applications: They closely
reflect the application domain and follow normalization rules to avoid redun-
dancy. On the other hand, the warehouse schema usually follows a typical design
pattern, e.g., star schema or snowflake schema [CD97]. The ETL process also
involves schema mappings that describe the transformation of source data such
that the data can be used in later stages of the process.
1.3.3 Other Types of Integrated Information Systems
Peer-to-peer (P2P) systems have become a very active research area in the last
decade [DGMY03, ATS04]. While the predominant purpose of P2P networks
is file sharing, several efforts have been taken to use apply the peer-to-peer
concept in data management [HIST03, HHL+03, NOTZ03, RNHS06]. In a P2P
network, there is no distinction between client and sever. Instead, all nodes in
the network act as both consumer and provider of information.
While research on P2P file sharing considers very large networks (i.e., hun-
dreds or thousands of nodes), peer data management (PDMS) is also conceivable
12 CHAPTER 1. PUTTING TOGETHER PIECES OF INFORMATION
Berlin
London
Paris
New
York
Chicago
Figure 1.6: A peer data management system.
on a smaller scale. Fig. 1.6 depicts a schematic representation of a PDMS net-
work of a global company: To facilitate information exchange between different
locations, five databases are connected in a PDMS network. Each node acts both
as a client and a server, i.e., a user in Berlin can send a query based on its local
schema, and the result also contains data from the other four locations. The
arrows in the figure indicate schema matchings, which are used for query rewrit-
ing. Note that although the mappings are directed, queries can be forwarded in
any direction using either GaV or LaV query rewriting [HIST03, TH04].
Techniques developed in the context of mediator-based systems, e.g., GaV
or LaV query rewriting, can also be applied in PDMS networks. However,
additional problems arise due to the more general architecture. If a query has to
take several hops from the query node to the network node that contains relevant
data, then the mappings on the path need to be combined [MH03, FPKT04].
Some relevant attributes might be lost in the composed mapping, because the
schema of an intermediate node does not contain those attributes. This affects
query routing: In mediator-based systems, the path the query takes from the
user to the source is predefined. In a PDMS network, several distinct paths
are possible. Finding a good path that avoids information loss is a challenging
problem [ACMH03]. Beside mapping composition and query routing, other
problems arise when the network is assumed to be unstable, i.e., if node can
dynamically enter and leave the network. Such a system would require (semi-
)automatic creation of semantic mappings every time a node enters a network.
Multi-database systems differ from the previously discussed architectures be-
cause they do not require predefined mappings [LMR90]. Instead, they merely
provide a query language that allows the user to send a single query involving
multiple sources [LSS01]. The architecture is similar to the scenario depicted
in Fig. 1.1: The user does not use a global schema, but interacts with each
1.4. BUILDING AN INTEGRATED INFORMATION SYSTEM 13
source. In contrast to querying Internet sources individually and merging the
results manually, the multi-database system provides a query language that can
be used to describe this process in a single expression. This requires the user to
know the semantic relationships between the sources. In other words, the user
needs a mental model of the mappings in order to specify a reasonable query.
One might argue that multi-database systems are no variant of integrated in-
formation systems, because they do not provide an integration layer. However,
we have added them in this section because multi-database systems are part of
data integration research, and the research results are also applicable in other
types of IIS.
In the following we describe how mappings are created. Although the discus-
sion is focused on mediator-based systems, the general ideas are also applicable
to data warehouses and (to some extent) peer data management systems. Multi-
database systems are not of our interest because those systems do not contain
schema mappings, but relay the problem of resolving structural and semantic
heterogeneity to the user.
1.4 Building An Integrated Information System
Schema mappings, which describe the semantic relationship between heteroge-
neous schemata, are an integral part of many types of integrated information
system. The creation of mappings is closely bound to the principle approach in
which a database integration system is built. In this section, we describe two
general processes: schema mapping and schema integration.
1.4.1 Schema Mapping: The Top-down Approach
In the top-down approach, the mediator schema is designed independently of the
underlying sources. Instead, the schema should closely reflect the user’s view
on the domain and his information need. After the mediator schema has been
defined, relevant data sources need to be added to the system. This process
involves the creation of wrappers, which translate the data model of the sources
into the data model of the mediator, and the definition of schema mappings,
which are used by the mediator for query rewriting. We assume that the former
task can be solved by existing standard solutions, and restrict our discussion to
the resolution of schematic conflicts in the mapping step.
Sec. 1.3.1 shows a simple example with two schemata and six correspon-
dences (Fig. 1.5). Based on those correspondences, the mapping between those
schemata can be specified in a very short period in time. In real-world sce-
narios, where developers have to deal with dozens or hundreds of relations, the
specification of mappings is a time-consuming and error-prone process: Li and
Clifton mention a project at GTE, whose aim was the integration of 27,000 data
elements [LC00]. It required an average of 4 hours to extract and document one
matching element if that task was performed by a person other than the data
14 CHAPTER 1. PUTTING TOGETHER PIECES OF INFORMATION
owner. Hence, it is absolutely essential to provide tools that aid the user in the
mapping process.
The creation of schema mappings is performed in two steps:
1. Schema Matching: Semi-automatic detection of attribute correspondences
2. Query Discovery: Mapping definition based on detected correspondences.
The first step is the detection of attribute correspondences, which relate
semantically related attributes (depicted as arrows in Fig. 1.5). To create a
matching, conflicts resulting from structural and semantic heterogeneity must
be resolved. This process is considered to be semi-automatic, i.e., interaction
with the user is always required, because available information is usually not
sufficient. Most algorithms rely on schema metadata or instance information.
Documentation rarely exists in a format that can be exploited by automatic
tools. Hence, the schema matching detected by automatic tools is usually im-
perfect and needs to be adapted by the user. Schemata and instances do not
provide enough information to determine the desired matching with 100% ac-
curacy, because the intention of the developer is not fully specified in them.
Related work on schema matching is described in Sec. 2.3.
Schema mappings are created in the second step based on the detected corre-
spondences. Because in most cases different interpretations of the same match-
ing are conceivable, the user might have to interact with the tool: E.g., the
correct mapping of the above example defines a join of S1and S2. However,
based on the attribute correspondences depicted in Fig. 1.5, a union of the two
relations is also conceivable, which results in the following mapping:
M(GN, SN, A, null) :- S1(GN, SN, A)
M(GN, SN, null, S) :- S2(GN, SN, S, D)
where null is a special value stating that the value is unknown. In such a case,
the user needs to decide which of the mappings is correct. Query discovery
techniques are discussed in Sec. 2.4.2.
1.4.2 Schema Integration: The Bottom-Up Approach
In the example of Fig. 1.5, the mediator does not contain all source data because
the attribute Dept is missing. Some applications require all source data to be
part of the integrated information system. E.g., when two companies merge
and their databases are to be integrated, no data must be lost in the process.
In such a scenario, a top-down approach is not advisable. Instead, the system
should be built bottom-up using a schema integration process [BLN86, SPD92,
Bus02, PB03].
Fig. 1.7 depicts two source schemata S1and S2, and the schema Mthat
is the result of integrating the two source schemata. The schema integration
proceeds in two steps:
1.4. BUILDING AN INTEGRATED INFORMATION SYSTEM 15
Dept
M
GN
SN
S1
Age
GN
SN
S2
Salary
GN
SN
Age
Dept
Salary
1
2
Figure 1.7: Integration of two schemata.
1. Description of Inter-schema Correspondences: Common schema elements
are identified and their relationships are specified.
2. Schema Integration: The schemata are merged based on the specified
inter-schema correspondences. The result is an integrated schema and
mappings that describe the semantic relationships between the source
schemata and the integrated schema.
In the first step, the user needs to determine the semantic overlap between
the databases, which is formally described as inter-schema correspondences.
Note that inter-schema correspondences in schema integration are different than
attribute correspondences in schema matching: While the former provide a very
specific description of the relationship between schemata both on the intensional
and the extensional level, the latter only represent a relationship between at-
tributes without well-defined semantics.
Although schema matching is usually discussed as a first step in schema
mapping, we believe that it can be a viable tool in schema integration as well.
Based on the detected attribute correspondences, the semantic relationships
between the source schemata can be declaratively described in a correspondence
language like MoCA [Bus02]. Using the generic language defined in [SPD92]
and assuming that the two tables overlap in their extension (i.e., some people
are represented in both tables), the inter-schema correspondences of the above
example are:
S1.FN S2.FN
S1.GN S2.GN.
The predicate shows that the extensions of those attributes intersect. In
the second step, the integrated schema is created based on the input schemata
16 CHAPTER 1. PUTTING TOGETHER PIECES OF INFORMATION
and the inter-schema correspondence assertions. Because the F N attributes are
related, they are represented as a single attribute in the integrated schema M.
The GN attributes are handled likewise. The schema mappings between the
integrated schema and the source schema are created along the way [SPD92].
1.5 Duplicate-based Schema Matching
We have shown that mappings play a major role in many integrated informa-
tion systems. The creation of mappings is a time-consuming process, but the
developers can be aided by semi-automatic tools. One of the most daunting
challenges in the generation of mappings is the detection of attribute correspon-
dences. To support the developers in that process, various schema matching
solutions have been proposed. Each of them use different kinds of information
to detect correspondences, e.g., metadata contained in the database schema or
attribute-specific features extracted from attribute values. The schema matcher
described in this thesis exploits the fact that in many data integration scenarios
an extensional overlap exists, i.e., some real-world entities are represented in
both databases. The DUMAS (duplicate-based matching of schemata) algo-
rithm is able to detect such duplicates and extract attribute correspondences
from them.
1.5.1 Example Scenarios
While relying on duplicates for schema matching appears to be too restrictive
at first glance, we emphasize that in many data integration scenarios duplicates
exist:
Customer Relationship Management (CRM). In large enterprises, differ-
ent departments maintain their own data about customers. The customer
information can be available in databases, spreadsheets, or unstructured
files. Such an infrastructure is very inefficient when the management needs
all information about their customer base. Customer Relationship Man-
agement (CRM) systems are established to avoid searching all existing
data sources for the required information.
The main motivation for establishing a CRM system is the fact that infor-
mation about the same customer is spread over multiple sources. Hence,
we can assume that duplicates exist in such a scenario. Detecting those
duplicates is no trivial task, because customer data might change over
time (e.g., because the customer has moved or has received a new phone
number), and some departments have not updated their databases. In
addition, we have to deal with schematic heterogeneity, because some in-
formation about a customer might be represented in only a single source,
e.g., because that information is only relevant for one department. How-
ever, we can assume that every data source contains attributes that can
be used for identifying a customer (e.g., name, address, phone number,
etc.).
1.5. DUPLICATE-BASED SCHEMA MATCHING 17
Comparison Shopping. Many retailers provide stores on the Internet, which
can be easily queried by the user to search for a given product. As in
traditional stores, the prize for a given product differs from store to store.
Because Internet stores are easily accessible, comparison shopping agents
have been created that search for the best price of a given product.
Adding a store to a comparison shopping agent involves many tasks beside
schema matching, e.g., the creation of wrappers, which extract information
from Web pages and translate the data into relations. To find correspon-
dences between the store and the agent, one could exploit sample data
that has been extracted from the Web pages by crawling the store’s site.
Note that duplicates can be expected in a comparison shopping scenario:
If a given product were not available in several stores, searching for the
lowest prize would not be an issue. In all domains, the description of a
product contains enough information for the user to identify a product:
E.g., books usually have a title, an author, a publisher, and an edition.
Books also have an ISBN number, which is globally unique. However,
such an identifying attribute cannot be expected in all domains.
Catalogue Integration. When two retailers merge, they want their customers
to have access to all available products. Unless the two companies are kept
separate from each other, the product databases need to be integrated.
The process of combining two catalogues is similar to finding a best price
for a given product: In both cases, one has to identify different represen-
tations of the same product. One also has to resolve schematic differences,
because some information can only be found in one of the catalogues. The
main difference is that catalogue integration can be performed once, and
new products can be added to the integrated catalogue. In contrast, com-
parison shopping agents cannot modify the source databases, and thus,
have to regularly query the sources to get the latest price.
The data integration problems described above show some characteristics of
duplicate-based schema matching scenarios: Two or more data sources exist,
and some real-world entities are represented in several sources. The attributes
used to describe the entities differ, because only information that is required by
local applications is kept in the databases. However, there is sufficient data to
identify an entity: While globally unique identifiers (e.g., ISBN number) cannot
be expected, the identity of a real-world object can be deduced from a few
attributes that exist in all sources with sufficient accuracy.
Detecting duplicates is no trivial task even when descriptive attributes ex-
ist, because the attribute values for the same entity might differ. One reason
for different values is the fact that the same information can be represented in
different ways: E.g., “Microsoft” and “Microsoft Inc.” both refer to the same
company. Data quality issues are another reason for differing values: Infor-
mation might be out-of-date as in the CRM scenario, or some attribute values
are erroneous. Due to such inconsistencies, finding duplicates with very high
accuracy is a challenging problem.
18 CHAPTER 1. PUTTING TOGETHER PIECES OF INFORMATION
1.5.2 The DUMAS Approach
RFirstName LastName Sex Phone Fax
r1John Doe m (408) 7573339 (408) 7573338
r2Joe Smith m (249) 3615616 (249) 2342366
r3Suzy Klein f (358) 2436321 (358) 2436321
r4Sam Adams m (541) 8127100 (541) 8121164
r5Mark Spitz m (901) 8319311 (901) 8612382
r6Jim Beam (782) 1238957 (781) 1883744
r7Kate Moss f (124) 9654565
r8Sam Wong f (124) 4955670 (999) 9999999
r9John Dean m (369) 3663624 (367) 3663625
SLN Acc Tel OS
s1Douglas jdouglas (408) 9182043 XP
s2Dean jd (369) 3663624 XP
s3Klein littlesue (358) 2436321 UNIX
s4Adams sam (541) 8127100 W2000
s5Wong kate (923) 6363443 Linux
s6Kurz itsme UNIX
Figure 1.8: Relations Rand Swith intensional and extensional overlap.
To illustrate the general idea, consider the example depicted in Fig. 1.8.
Both relations Rand Scontain information about people, but are differently
structured: Relation Rhas attributes for the first name, last name, gender,
phone number, and fax number, while relation Scontains attributes represent-
ing last name, user account, phone number, and operating system. The goal
of schema matching is to detect correspondences between semantically related
attributes. In the example scenario only the following correspondences exist:
LastNameLN
PhoneT el
Various semi-automatic solutions exploiting different kinds of ‘hints’ from
the available data have been proposed (see Sec. 2.3). Schema-based algorithms
use metadata extracted from the database schemata, e.g., attribute names or
data types. Those approaches suffer from the fact that autonomously developed
databases use different names to describe attribute names. In the above example
one might guess that LN is an acronym for “last name”, but inferring that
Tel is a synonym for P hone requires more sophisticated means than string
comparison. Note that in some cases database designers chose to use names
that are not related to the semantics of the attributes, which renders schema-
based matchers useless.
1.5. DUPLICATE-BASED SCHEMA MATCHING 19
Instance-based matchers do not suffer from the above drawbacks because
they use actual data from the tables. Most proposed algorithms extract char-
acteristic features from the values of each attribute and match attributes that
have similar features. This approach works well if a good set of distinguishing
features is chosen, which is a challenging problem itself. Unfortunately, some
attributes are indistinguishable even when a reasonable feature set is used: Re-
lation Rin the above example contains attributes for phone number and fax
number. The values of those attributes have the same structure. In fact, even
a human observer is unlikely to be able to distinguish phone and fax numbers
without knowing the respective attribute labels. Thus, it seems impossible to
determine which of them matches with Tel in relation S.
The problems described above can be solved by following the DUMAS ap-
proach, which exploits duplicate tuples for schema matching. Different repre-
sentations of the same real-world object are called duplicates. In the example in
Fig. 1.8, tuples r3,r4, and r9in Rrepresent the same entities as tuples s3,s4,
and s2in S, respectively. Those duplicates provide valuable information that
can be used for schema matching: E.g., the tuple pairs representing the same
entity always have the same value in LastName and LN, thus indicating that
the two attributes correspond. Two duplicates, namely (r4, s4) and (r9, s2), also
provide hints that help in distinguishing Phone and Fax: Both r4and r9have
different values for Phone and Fax, and the P hone values equal the T el values
in the respective matching tuple. In this thesis we show how to extract attribute
correspondences from duplicate tuple pairs.
In principle, any duplicate-based schema matching must proceed in two
steps:
1. Duplicate detection: A few duplicate tuples are found in the databases,
and
2. Matching: Attribute correspondences are extracted from the duplicates.
The goal of the first step is to detect a few duplicates. As will be shown in
this thesis, the actual number of duplicates required for matching can be quite
small and is far from the total number od duplicates in the data set. Those
duplicates are used to discover attribute correspondences in the second step.
Contributions of this Thesis
Although this general process appears to be simple and has been considered in
related work (see Sec. 2.3.4), some important aspects (e.g., duplicate detection
in unaligned schemata or multi-table duplicates) have not been tackled at all,
while other problems are solved only to a small extent. With this thesis we
make the following contributions:
Duplicate detection in unaligned schemata. The problem of duplicate
detection has been considered under various names, e.g., entity identification,
deduplication, or record linkage. Several solutions have been proposed, but all of
them have one property in common: They require the attribute correspondences
20 CHAPTER 1. PUTTING TOGETHER PIECES OF INFORMATION
to be known. Unfortunately, finding attribute correspondences is the goal of
schema matching, and thus, existing solutions cannot be used in the duplicate
detection step of our approach.
Hence, we have to solve the problem of finding duplicates when correspon-
dences are not known, which has not been considered before. We show the
challenges associated with that problem and describe an algorithm that is able
to detect at least a few duplicates even when the semantics of the attributes are
unknown. The experiments indicate that the proposed method works well even
in difficult cases where only few duplicates exist and when only few attributes
correspond.
Matching single tables based on duplicates. The goal of the second step
is to use the detected duplicates to establish semantic correspondences. In
other words, intensional similarities have to be deduced from extensional ‘hints’,
which are provided by attribute values. For that purpose, we define a domain-
independent similarity function that is used to compare attribute values. Based
on the similarity scores of several duplicates, we extract a set of 1:1 correspon-
dences. The experimental evaluation indicates that it is possible to extract a
high-quality matching from only a few duplicates. It also shows that the algo-
rithm is robust with respect to false duplicates.
Matching complex schemata based on duplicates. Matching schemata
consisting of multiple tables raises several new challenges. The algorithm for
matching single tables assumes that the tables are semantically related and con-
tain duplicates. Those assumptions do not hold when comparing two arbitrary
tables in two schemata, and thus, might produce false correspondences. To solve
this problem, we have devised heuristics used to determine if a matching can
be trusted. Those heuristics are part of the schema matching algorithm, which
detects duplicates spanning multiple tables and extracts correspondences from
them. It has to be noted that the problem of such multi-table duplicates also
has not been considered before. The matching algorithm is designed to compare
only relevant tables to reduce computation cost.
Finding complex matchings. The problem of complex (i.e., m:n) matchings
has only recently been approached in research. The space of possible com-
plex matchings is much larger than the the solution space of simple (i.e., 1:1)
matchings, and thus, efficiency is a major concern apart from the correctness
of the detected correspondences. We propose a novel solution to the problem
of complex matchings which is based on the simple matching detected by our
algorithm. The complex matching algorithm improves the matching by merging
certain attributes, and thus, producing complex correspondences. The perfor-
mance issue is tackled by only considering promising attribute combinations.
Our experimental evaluation shows that the proposed algorithm is able to de-
tect complex correspondences with high accuracy.
The DUMAS approach was first proposed in [Bil04]. The DUMAS table
matching algorithm, which includes duplicate detection in unaligned tables and
the extraction of simple correspondences from them, is described in [BN05].
The DUMAS table matcher has also been incorporated into the Humboldt
Merger (HumMer) [BBB+05]. HumMer is a semi-automatic data integration
1.5. DUPLICATE-BASED SCHEMA MATCHING 21
and cleansing tool: Given two or several source tables, the tool first discovers
attribute correspondences. Based on those correspondences, the table schemata
are merged into a single integrated schema. After inserting the source data
into the integrated table, duplicates are detected and removed. The user can
interact within the different stages by altering the schema matching, guiding
the duplicate detection process, or defining conflict resolution rules.
Structure of this Thesis
DUMAS table matcher
Duplicate detection
Chapter 4
Matching
Chapter 5
DUMAS schema matcher
Chapter 6
DUMAS complex matcher
Chapter 7
Figure 1.9: DUMAS algorithms and related chapters.
The DUMAS algorithms and the chapters that describe them are shown
in Fig. 1.9. Before specifying the DUMAS matchers, we discuss the problem of
schema matching in Chap. 2. We lay down the formal basis for the remainder of
the thesis and review related work on schema matching. In addition, we describe
how schema mappings are generated based on attribute correspondences.
Chap. 3 describes the DUMAS approach to schema matching. Because the
DUMAS matcher exploits duplicates, we discuss the duplicate detection problem
and review related work in that area.
Our solution to the problem of duplicate detection in unaligned tables is
presented in Chap. 4. We discuss the challenges associated with the problem,
describe an algorithm to detect the top-k duplicates, and we evaluate the quality
of the detected duplicates using both synthetic and real-world data. The process
to extract attribute correspondences from those duplicates in described in the
following Chap. 5. We show experimentally that the matching step produces a
good result even in the presence of errors.
The following two chapters are concerned with extensions of the DUMAS
table matcher. Chap. 6 describes the DUMAS schema matcher, which detects
multi-table duplicates and uses them to extract attribute correspondences be-
tween complex schemata consisting of several tables. We show that duplicate-
based schema matching in complex schemata cannot be done by mere applica-
tion of the DUMAS table matcher on each table combination, because several
underlying assumptions do not always hold. We describe under what conditions
22 CHAPTER 1. PUTTING TOGETHER PIECES OF INFORMATION
a matching can be trusted and how known attribute correspondences can be
used to detect further attribute correspondences. The DUMAS schema matcher
uses these heuristics to detect correspondences in schemata involving multiple
tables.
The DUMAS complex matcher, which detects complex correspondences be-
tween two tables, is presented in Chap. 7. Finding complex matchings is a
challenging problem, which has not received great attention in the literature.
We propose an algorithm that is based on the DUMAS table matcher: The
simple matching is used as input, and attributes are merged if that improves
the matching. The experiments show that the resulting complex matching is of
high quality. We conclude with a discussion of the DUMAS schema matcher in
Chap. 8.
Chapter 2
The Schema Matching
Problem
In the previous chapter problems associated with translating data between het-
erogeneous schemata are informally described. The goal of this chapter is to
lay down the formal foundation for the following chapters. We start by defining
the notation used throughout this thesis. As the proposed schema matching
solution is based on relational schemata, this chapter will include a description
of necessary relational concepts. Afterwards the problem of schema matching
is formally defined, and related work in that area is discussed. We also discuss
related theoretical and practical work on schema mapping.
2.1 Basic Relational Concepts
The schema matching approach described in this thesis is based on relational
schemata. Consequently, the notation for relational concepts used in this thesis
needs to be defined. Note that only basic relational concepts are required to
define schema matching; an in-depth examination of database theory can be
found in [AHV95].
As mentioned above, only a few basic relational concepts are required to
define schema matching. A database schema or, simply, a schema, is a finite set
R={R1, . . . , Rk}of relation schemata. A relation schema Ri=ha1, . . . , ali
is a sequence of attributes. If the relation schema is clear from the context,
we omit the subscript i. Each attribute ajhas an associate domain denoted by
Dom(aj) consisting of constants and the special null value, which we denote
by . Furthermore, we define att(R) and att(R) to be the set of attributes in
the database schema Rand relation schema R, respectively.
Adatabase instance I(R) = {I(R1), . . . , I(Rk)}, i.e., an instance of a database
schema, is a set of instances of its constituent relation schemata. An instance
I(R) = {r1, . . . , rm}of a relation schema R, or relation for short, is a set of
tuples. A tuple riDom(a1)×. . . ×Dom(al) over Ris a sequence of values
23
24 CHAPTER 2. THE SCHEMA MATCHING PROBLEM
hv1, . . ., vli, where the value vjfor attribute ajis an element of Dom(aj).
2.2 Problem Description
2.2.1 Schema Matching and Attribute Correspondences
Although schema matching is applied in various application areas, the basic
problem can always be formulated intuitively as follows: Given a source schema
and a target schema, find the attribute correspondences that describe how both
schemata are related. One such attribute correspondence connects a set of source
attributes with a set of target attributes, and implies that data from the source
attributes is used to generate data in the target attributes.
CUSTOMER
Id
FirstName
LastName
BirthDay
Address
CLIENT
Id
Name
Gender
BirthYear
Address
Figure 2.1: A schema matching example.
An example is depicted in Fig. 2.1. Each of the two schemata consists of
a single relation containing information about customers. Attribute correspon-
dences are depicted as arrows between the source and the target schema. It
is important to note that the generation of target data can involve difficult
transformations. In the example, F irstName and LastName should be con-
catenated to produce values for the attribute Name in the target schema, while
the year in the dates of birth in BirthDay needs to be extracted to form values
for BirthY ear. The work described in this thesis is only interested in find-
ing correspondences between attributes and does not consider transformation
functions as part of the schema matching output.
Also note that the desired schema matching result is very subjective. In
the example, both relations contain an Id attribute, which represents an iden-
tifier that is unique within its relation. However, there is no correspondence
between the two attributes, i.e., the developer has chosen not to use identifiers
of the source relations to generate identifiers for the target table. There are
2.2. PROBLEM DESCRIPTION 25
several possible reasons for this decision: The two databases could use different
numbering schemes for creating identifiers, or the source databases uses simple
numbers while the target database uses ‘semantic’ identifiers based on other at-
tribute values. However, despite the differences, the developer could also have
chosen to use source identifiers in the target table as long as those values do not
violate constraints on the target.
This subjectivity of the desired schema matching result is one reason why a
schema matching can be established only semi-automatically. In all but the sim-
plest scenarios, the developer needs to step in, either by adjusting the matching
or by interacting with the schema matching program.
2.2.2 Formal Problem Description
As stated above, the desired schema matching is very subjective and might vary
in different scenarios. Thus, given two databases, it is impossible to define the
desired matching between them. However, we can define the structure of the
input and the output of a schema matching component.
The minimal input for the schema matching problem is a source schema
R={R1, . . . , Rk}and a target schema S={S1, . . . , Sl}. The schemata are
always required because the output of the schema matching process is defined
in terms of the schemata’s attributes. Additional input, e.g., instances of the
schemata, metadata, or documentation, has been shown to improve the schema
matching result. As described in Sec. 2.3, different schema matching solutions
exploit different kinds of input.
Given a source schema Rand a target schema S, the output is a schema
matching between Rand S. A schema matching M={(a, b)|a2att(R)b
2att(S)}, or matching, is a set of attribute correspondences (a, b), such that ais
a set of attributes in the source schema (i.e., it is a set of source attributes), b
is a set of attributes in the target schema (i.e., it is a set of target attributes),
and values of aare used to generate values of bin the desired mapping. We
say that a group of attributes amatches attributes bif a correspondence (a, b)
is part of the matching. We denote the number of attributes in aby |a|. The
process of creating such a matching is called schema matching, too.
Note that other definitions of the schema matching output are also conceiv-
able. Structure-level matching considers correspondences both between atomic
schema elements (e.g., attributes in a relational schema) and between higher-
level elements (e.g., relations) [RB01]. This is of particular interest in the case
where the data model allows multiple levels of structuring. For instance, beside
matching leaf elements of two XML schemata, correspondences between inter-
mediate XML nodes can be of interest, too. However, as our approach works
on relational schemata, and because the mapping generation methods described
in Sec. 2.4.2 only use correspondences between atomic schema elements, we do
not consider structure-level matching.
Matchings can be classified based on their cardinality (Table 2.1). Simple
matchings are 1:1 matchings, i.e., each attribute has zero or one corresponding
26 CHAPTER 2. THE SCHEMA MATCHING PROBLEM
partner. Thus, the attribute correspondences contained in the matching are
constrained to pairs of singleton attribute groups. This restriction does not
hold for complex matchings: In an m:n matching, each attribute can be part
of an attribute group that corresponds to any number of other attributes. The
group of 1:n matchings also falls into the category of complex matchings, but
has the constraint that a group of target attributes corresponds to at most one
source attribute. Note that n:1 matchings are defined accordingly, and we use
the term 1:n matching when direction is irrelevant. Analogous to the above
definitions, a correspondence (a, b) is a simple correspondence if both aand b
are singletons, or a complex correspondence if aor bcontain more than one
attribute.
Matching class Cardinality Constraint on M
Simple matching 1:1 (a, b) M :|a|= 1 |b|= 1
1:n (a, b) M :|a|= 1
Complex Matching n:1 (a, b) M :|b|= 1
m:n
Table 2.1: Classification of schema matchings based on their cardinality
2.3 An Overview of Schema Matching
Approaches
Finding attribute correspondences both for schema mapping or for schema
integration is a tedious and error-prone process that is mostly done manu-
ally [LNE89]. Semi-automatic tools that support the user can reduce devel-
opment time and cost. The schema matching problem has received increased
attention in the past decade, and various solutions have been proposed by the
research community. In the following, we classify different approaches and de-
scribe their relationship to our algorithms.
2.3.1 Classification of Schema Matching Approaches
This classification of schema matching approaches is based on the excellent
survey by Rahm and Bernstein [RB01]. They define several, largely orthogonal
criteria, which they use to describe schema matching algorithms. In this thesis,
we restrict ourselves to two criteria, which were used in [RB01], and add a third
classification criterion for instance-based matchers (Fig. 2.2):
1. Individual matcher vs. combining matchers,
2. Schema-based vs. instance-based (applicable for single matchers), and
3. Vertical vs. horizontal (applicable for instance-based matchers).
2.3. AN OVERVIEW OF SCHEMA MATCHING APPROACHES 27
Schema Matching Approaches
Individual matchers
Combining matchers
Schema-based
Instance-based
Vertical
Horizontal
Figure 2.2: A classification of schema matching approaches.
Individual matchers exploit only one type of information, e.g., schema or
instances, to detect attribute correspondences. One advantage of this work is
the opportunity to identify potentials and restrictions of using a specific type
of information for schema matching. Although they achieve reasonably good
results, it is generally agreed that best results can be achieved by combining
matchers, which will be discussed in Sec. 2.3.5.
Individual matchers can be further organized into schema-based and instance-
based matchers. Schema-based matchers only work on available metadata, such
as attribute labels, data type, or schema structure (see Sec. 2.3.2). Instance-
based matchers exploit available data (Sec. 2.3.3). In addition to the criteria
defined in [RB01], we further distinguish instance-based matchers which perform
vertical matching from those which perform horizontal matching. Because our
duplicate-based approach falls into the category of horizontal matchers, related
work in that area is discussed in detail in Sec. 2.3.4.
Before describing schema matching algorithms, we point out that most
solutions have been tested with small data sets. However, scaling schema
matching to large schemata has received additional attention in the recent
past [BMPQ04, RDM04].
In the following, we give a brief overview of related work in schema match-
ing. We also include work in the area of web interface matching, which is a
challenging problem related to the integration of information sources on the
World Wide Web. The goal of web interface matching is to find correspon-
dences between fields of query interfaces. Such fields are usually text input
areas, but can also include a predefined list of possible values or radio but-
tons. Thus, web interfaces matching is similar to schema matching, but instead
of attribute correspondences the goal is to find correspondences between web
interface elements.
2.3.2 Schema-based Matchers
As the name suggests, schema-based matchers exploit available schema informa-
tion, which can be extracted from a database’s metadata repository [MBR01,
28 CHAPTER 2. THE SCHEMA MATCHING PROBLEM
MGMR02, MZ98, BCV99]. The general approach of schema-based matchers is
to compare schema elements (i.e., attributes) using a similarity measure, and
to regard attribute pairs whose similarity is above a given threshold as corre-
sponding. The similarity measure can take into consideration different kinds
of metadata: It could incorporate similarity of attribute labels, data type, etc.
More advanced approaches also apply external, ‘semantic’ metadata to improve
the schema matching. For instance, MOMIS uses the WordNet thesaurus, which
relates terms through synonym and broader term / narrower term relationships,
to identify correspondences between attributes whose name is lexicographically
different, but whose semantic is assumed to be similar [BCV99].
For structure-level matching, as described in Sec. 2.2.2, where in addi-
tion to atomic schema elements higher-level elements are to be matched, tak-
ing the structure of the schema into consideration can greatly improve the
matching result. Matching algorithms like Cupid [MBR01] or Similarity Flood-
ing [MGMR02] are based on the following intuition: If two elements are con-
sidered similar, their neighbors (i.e., connected schema elements) are also likely
to be similar. In the example in Fig. 2.1, we can identify many corresponding
attributes. This indicates a correspondence on the upper level, i.e., between the
tables CUSTOMER and CLIENT , although their names are very dissimilar.
Most web interface matchers are also schema-based, because the data is hid-
den behind the web interface and can only be accessed by querying. Thus,
methods that apply the above ideas have also been proposed in the context
of web data integration [HC03, HCCH04]. In contrast to the work on schema
matching, which deals with only two or a few schemata, integration of web data
sources is targeted at hundreds or even thousands of sources. On the one hand,
this requires the matching algorithms to be much more scalable. But on the
other hand, the large number of interfaces facilitates the application of data
mining algorithms. E.g., DCM matches schemata holistically: The algorithm
does not compare two schema at a time, but all of them at once [WDM04]. Two
attributes or groups of attributes are assumed to be synonyms, and thus cor-
responding, if they never appear together in the same schema. Attributes that
co-occur in many schemata are assumed to form a group. The authors developed
a correlation measure that takes into consideration the specific characteristics
of query interfaces on the Web, and used it to mine complex matchings from a
corpus of web interfaces.
Schema-based approaches work if attributes have names that unambiguously
define their semantics. Unfortunately, this is not always the case. In general,
the database developer has several options how to design the database schema.
Thus, if two schemata have been developed independently, most likely different
design decisions have been made. Beside different ways to structure information,
attributes can be given different names. A common problem that has to over-
come by schema-based matchers is the occurrence of homonyms and synonyms:
Semantically different attributes can be given the same name (homonym), while
corresponding attributes may have different names (synonym). As discussed
above, external metadata like ontologies can be used to tackle this problem, but
in many cases attribute names do not follow any standards.
2.3. AN OVERVIEW OF SCHEMA MATCHING APPROACHES 29
2.3.3 Instance-based Matchers
In cases where schema-based approaches fail, exploiting actual data instead of
or in addition to metadata can result in a better matching. Most instance-based
matchers perform vertical matching: They extract features from each attribute
in isolation. These features act as a description of the attribute’s semantics.
In other words, the description of an attributes is automatically extracted as
opposed to manually specified as attribute names. Consequently, attributes
with similar features are matched. We discuss such matchers in the following.
Another group of matchers performs horizontal matching: They try to detect
duplicates, i.e., different tuples that represent the same real-world entity, and
match attributes using the detected duplicates. Such algorithms are discussed
in Sec. 2.3.4.
Examples for vertical matchers are Automatch [BM02], Clio [NHT+02],
SEMINT [LC94, LC00], and the content matcher of LSD [DDH01, DDH03],
which we call CM in the following. Those solutions consider schema matching
as a classification problem: The target attributes are considered as classes, to
which each source attribute is to be assigned. To classify source attributes, they
employ supervised learning methods. Thus, they start with a training phase,
in which a model for classification problem is learned. To do so, they use data
from previously matched sources. In the following matching phase, additional
sources can be matched to the target schema.
SEMINT extracts 20 features from each attribute, e.g., data length, standard
deviation, and average value. These features are used to train a neural network.
After the training phase, the neural network can be used to match other sources
to the same target schema. CM employs a na¨ıve Bayes learner that is trained
on the actual attribute values. After the training phase, CM would give a large
similarity score to attributes that have similar values.
Instance-based schema matchers that employ vertical matching as described
above are not affected by different names of schema elements. However, a prob-
lem similar to homonyms or synonyms in attribute names can occur in the
attribute descriptions that are extracted from the data. An example for the
homonym case is depicted in Fig. 2.3. The figure contains two tables describing
persons by name, place of birth, and place of residence in Rand by name and
place of birth in S. Birthplace and place of residence both contain names of
places. Thus, the features extracted from them will look alike, although their
semantics is different. The synonym problem occurs if the descriptions of two
attributes are different although they correspond. Such a case can occur if the
same information is represented in different ways: E.g. the gender of a person
might be abbreviated as “m” or “f in one database, while it is represented as
“0” and “1” in another database.
In addition to synonyms and homonyms, the instance-based matchers de-
scribed here suffer from a problem related to their underlying learning approach,
namely feature selection. Deciding which features to extract from the underlying
data is a problem that has been extensively studied in machine learning [GE03].
Each of the above mentioned algorithms uses a different set of features, and
30 CHAPTER 2. THE SCHEMA MATCHING PROBLEM
RName Birthplace Residence
John Doe London Berlin
Max Mustermann Berlin NYC
Sam Adams NYC London
.
.
..
.
..
.
.
SName POB
John Doe London
Suzy Klein Berlin
Sam Adams NYC
.
.
..
.
.
Figure 2.3: An example for columns with similar features but different seman-
tics.
it seems highly unlikely that a ‘best’ set of features that produces an optimal
schema matching in varying scenarios will be devised in the future.
We emphasize that different instance-based matching approaches that do
not fit into our classification hierarchy are conceivable. E.g., the matcher de-
scribe by Kang and Naughton [KN03] neither learns attribute characteristics nor
compares duplicate tuples. Instead, it determines fuzzy dependencies between
attributes within a table using various information-theoretic tools. The result
is a dependency graph for each table. Attribute correspondences are derived by
comparing the dependency graphs of the tables that are to be matched.
2.3.4 Duplicate-based Matching Approaches
As will be shown in Chap. 3, instance-based matchers that perform horizon-
tal matching are potentially able to distinguish semantically different attributes
that have similar features. The basic idea of duplicate-based matchers is to
search for fuzzy duplicates and use those duplicates to infer attribute corre-
spondences. We are aware of three schema matchers that exploit duplicates to
some extent.
The Internet Learning Agent (ILA) can be considered a schema matcher
although it is targeted towards information sources on the Web [PE95]. Unlike
web interface matchers, which establish correspondences between query inter-
faces, ILA’s goal is to establish correspondences between its internal model of
the application domain and the model of an information source on the Internet.
The model of the information source includes all information that is contained
in a Web page that is returned as the result of a query. To establish a match-
ing, ILA chooses objects from its internal model and queries the source using a
keyword query interface. The returned result are used to learn the semantics of
the source’s attributes. This process is based on the correspondence heuristic,
which has two components: (i) if an attribute value of the internal object is
equal to a value of the result, the two attributes are assumed to be related, and
(ii) the meaning of an attribute is consistent across all individuals.
One drawback of this approach is related to the fact that it is a web-based
tool: Because the source’s data is hidden behind a interface, the detection of
duplicates becomes a matter of sending keyword queries. To achieve a reason-
ably good result, the internal model and the source’s model must have a large
2.3. AN OVERVIEW OF SCHEMA MATCHING APPROACHES 31
overlap in the real-world objects that they represent. In addition, the algorithm
only considers equal values as a match, which is too restrictive in heterogeneous
environments, where data is dirty or differently represented.
The matcher iMap combines multiple individual matchers [DLD+04] (see
Sec. 2.3.5). In cases where an extensional overlap exists, it uses special ‘overlap
modules’. These modules assume that duplicate tuples are provided by the user,
and thus, completely ignore the problem of duplicate detection. In most cases,
the overlap module simply uses its non-overlap counterpart to find an initial
matching, then re-evaluates the matches using the duplicates.
Horizontal matching according to Chua et al.
Chua et al. propose a matching algorithm that is most closely related to our
solution, and thus, we discuss their approach in more detail [CCL03]. The
algorithm produces a set of complex correspondences between two tables in
three phases:
1. Classification of attributes and formation of attribute groups: Each at-
tribute is heuristically assigned to a predefined domain class. Based on
this classification, attributes are grouped together.
2. Measurement of correspondence score: Attribute groups are compared
with each other and a correspondence score, which reflects the similar-
ity of the attribute groups, is computed.
3. Matching attribute groups: Source attribute groups are uniquely assigned
to target attribute groups. Each assignment represents a complex corre-
spondence.
Domain classes are organized into three domain hierarchies: KEY, STRING,
and CODE. The KEY hierarchy includes classes CANDIDATE KEY and FOR-
EIGN KEY. The STRING hierarchy is used to classify alphanumeric attributes.
It contains classes ALPHABETIC, ZERONINE, and MIXED. The CODED do-
main class organizes mathematical/statistical properties. Example classes are
ORDINAL (representing values that can be ordered), NOMINAL (values with
a nominal scale), and DATE.
In the first phase of the algorithm, attributes are classified by assigning
each attribute a domain class. This process is done automatically using various
techniques: E.g., key attributes are determined by accessing the data dictionary,
while coded domain classes require the application of heuristic rules and data
analysis. Afterwards, attributes are grouped such that a valid domain class can
be assigned to the group: E.g., an attribute with CODED domain class can
only be assigned to a group if all attributes of the group have CODED class.
Correspondence scores between source and target attribute groups that re-
flect their similarity are computed in the second phase. Note that only attribute
groups with a class from the same hierarchy are compared. The similarity func-
tion used to determine the score depends on the domain class: E.g., STRING
32 CHAPTER 2. THE SCHEMA MATCHING PROBLEM
attributes are compared using normalized edit distance (see Sec. 4.3.2). CAN-
DIDATE KEY attributes are given a high score if their values in the duplicate
tuples match, while FOREIGN KEY attribute require the use of Goodman and
Kruskal’s Lambda to determine a score. Twelve statistical similarity functions
are defined for CODED values. Which function is applied depends on the con-
crete classes of the attribute groups. The result of the second phase is a matrix
for each class hierarchy that describes the similarity of source and target at-
tribute groups in that hierarchy.
In the third phase of the algorithm, attribute groups are assigned to each
other such that each attribute group has at most one matching attribute group.
The Hungarian algorithm is used to compute an assignment that maximizes the
sum of the correspondence scores. The resulting alignment of attribute groups
represents the matching between the two tables.
While the algorithm has shown to be successful in the experimental eval-
uation, it has several drawbacks. Firstly, the duplicate detection problem is
largely ignored. The authors assume that attributes, which uniquely describe
the real-world identity of the entity represented by a tuple, exist in both tables
and have been manually matched. Without the correspondence between identi-
fying attributes, the matcher is unable to detect duplicates. We point out that
in many scenarios, there is no key attribute that describes the identity across
databases. Secondly, the algorithm is restricted to single tables, which reduces
its applicability in real-world scenarios. We believe that matching single ta-
bles is a relevant problem, but the majority of matching tasks involves complex
schemata with several tables. Thirdly, the first phase of the algorithm creates
overlapping groups, which can lead to unintended results in the third phase.
To illustrate the third problem, assume a table with attributes for contact
(C1), which contains the name and address of a person, and signature (S), and
another table with an attribute for contact (C2) and an attribute representing
the degree of the person (D). Note that the respective contact attributes contain
strings that are much longer than the values in Sand D. All of those attributes
have a domain class in the STRING hierarchy, and thus, can be combined. The
matrix in Fig. 2.4 shows the resulting groups.
{C2} {D} {C2, D}
{C1}0.9 0.0 0.8
{S}0.0 0.0 0.0
{C1, S}0.8 0.0 0.8
Figure 2.4: The problem of overlapping attribute groups.
The matrix values represent correspondence scores. The bold numbers indi-
cate the correspondences as determined by the Hungarian algorithm whose score
is above the threshold of 0.7. The actual correspondence scores are fictitious,
but closely reflect similarity scores determined by normalized edit distance: The
score for {C1}and {C2}is 0.9, because the values perfectly match in most du-
plicates. In contrast, the correspondence scores involving Sare 0 because Sdoes
2.3. AN OVERVIEW OF SCHEMA MATCHING APPROACHES 33
not have a matching partner. However, one notes that {C1, S}and {C2, D}
have a very high correspondence score that is above the threshold although S
and Dunrelated. These two attribute groups have a high score because the
contact values are much longer strings than the signature and degree, and thus,
the similarity of C1 and C2 has a larger effect on the score than the dissimilarity
of Sand D.
The underlying problem is that attribute groups overlap. The Hungarian al-
gorithm does not take that into consideration: Its goal is to find a best matching
between two sets of elements (i.e., attribute groups). The structure of those el-
ements is irrelevant. In Chap. 7 we present an alternative matching approach
that avoids the problem of overlapping attribute groups.
2.3.5 Combining multiple matchers
The previous sections indicate that schema matching solutions have to overcome
two general problems:
1. Homonyms: Semantically different attributes are described similarly. Ho-
monyms can result in false correspondences.
2. Synonyms: Corresponding attributes are described differently. Synonyms
can lead to missed correspondences.
The “description” of an attribute can be produced manually as metadata or
be extracted from available data. Because the above two problems occur in both
types of input, it is generally agreed that no single indicator, data or metadata,
can produce an optimal result. Thus, several types of information or several
schema matchers should be used in the matching process. Hybrid matchers
exploit different matching approaches in a single algorithm. This approach has
the advantage that its constituent matchers can cooperate closely, and thus,
produce a result very quickly. A disadvantage is its inflexibility: One cannot
another matching approach without rewriting the hybrid algorithm.
In contrast, composite matchers run each of their constituent matchers in
isolation and merge their results to produce a final matching. Thus, it is pos-
sible to add additional matchers if required. In the following, we describe the
general architecture of a composite matcher, which is similar to LSD [DDH01],
COMA [DR02], COMA++ [ADMR05], and by Madhavan et al. [MBDH05].
The composite matcher architecture has three major components: a set of
base matchers, a match combiner, and a constraint handler (Fig. 2.5). The
matching process starts by extracting information from the source that is rel-
evant to the base matchers, e.g., schema information, instance statistics, etc.
The base matchers process the type of information they require to produce
their output, which is a similarity matrix containing similarity scores for each
attribute pair1. The similarity matrices are then combined by the match com-
biner, which produces a single similarity matrix. This matrix is then processed
1If a given schema matcher does not produce a similarity matrix but a set of correspon-
dences, these correspondences can be used to create a matrix by setting scores for correspond-
ing attributes to 1 and other scores to 0.
34 CHAPTER 2. THE SCHEMA MATCHING PROBLEM
Base matcher
Base matcher
Base matcher
Match combiner
Constraint
handler
Figure 2.5: Architecture of a composite matcher.
by the constraint handler to produce a final matching. The constraints applied
in this step can be domain-independent (e.g., the cardinality of the matching
can be restricted to 1:1) or domain-dependent (e.g., each person only has one
name).
2.4 Schema Mappings
The work described in this thesis is concerned with schema matching, i.e., the
detection of attribute correspondences. However, to overcome schematic het-
erogeneity, those attribute correspondences have to be used to create schema
mappings. In the following, we will discuss schema mappings and describe query
discovery.
2.4.1 What is a Schema Mapping?
As pointed out in the previous chapter, schema mappings are required in various
applications. Schema Mappings have been examined both from a theoretical
and a practical perspective. A recent survey of theoretical work on schema
mappings can be found in [Kol05]. In short, schema mappings are a form of
tuple generating dependency between a source and a target schema. Languages
for specifying such constraints and their properties is an active research area.
However, in this thesis we resort to a more practical definition of mappings:
Aschema mapping is a declarative specification of a data transformation. In
other words, a mapping describes how data conforming to the source schema
can be transformed such that it satisfies the constraints of the target schema.
Although a data transformation can also be achieved by a programm, we only
consider declarative specifications here. Such specifications are mostly described
as a query, whose language depends on the data model being used, e.g. MSL
descriptions in TSIMMIS [GMPQ+97] or conjunctive queries in Information
Manifold [LRO96b].
2.4. SCHEMA MAPPINGS 35
Recall from Chap. 1 that schema mappings can be semi-automatically cre-
ated in a two step process: (i) schema matching and (ii) query discovery. The
schema matching problem and related work in that area are described above. In
Sec. 2.4.2 we will examine query discovery solutions, which have been developed
in the Clio project [HMH01].
2.4.2 Schema Mapping Generation
Early work on query discovery by Miller et al. considered the detection of map-
pings between relational schemata [MHH00]. They deviated slightly from the
above two-step process by using value correspondences (as opposed to attribute
correspondences) as input. A value correspondence virelates one or several
source attributes with one target attribute (i.e., only 1:1 and n:1 relationships
are considered). In contrast to attribute correspondences, it also contains a
function fi, which describes how values are to be transformed, and a filter de-
scribing which source values should be used. The query discovery process works
in four steps, which are depicted in Fig. 2.6.
Input Value Correspondences
Source
Target
A
B
A
C
D
S1
S2
Group Value
Correspondences
Select
Candidate Sets
Rank all
Covers
Generate
Query
SQL
Query
Potential
Sets
Candidate
Sets
A cover
f1
f2
f3
Figure 2.6: The query discovery process (Source: [MHH00]).
The goal of the query discovery algorithm is to construct a query for each
target relation. If more than one target relation exists, the algorithm is invoked
for each target relation. In the first phase of the query discovery process, po-
tential candidate sets ciare created, which contain a subset of the set of value
correspondences Vsuch that there is at most one correspondence for each tar-
get attribute in the candidate set. Each potential candidate set represents one
possible way of mapping the attributes in target relation T. These potential
candidate sets do not need to be complete, i.e., a single set does not need to
contain a correspondence for every target attribute. The example in Fig. 2.6
has candidate sets {v1, v2},{v2, v3},{v1},{v2}, and {v3}.
In the second phase, potential candidate sets that cannot be mapped into
a good query are pruned. In particular, a potential candidate set cannot be
mapped into a query if its correspondences involve multiple source relations,
and no join path connecting those source relations exists. Various techniques are
used to determine meaningful join paths, e.g., exploiting the data dictionary or
schema discovery. The remaining sets are called candidate sets. In the example,
we assume that source tables S1and S2can be joined, and thus, all potential
candidate sets are candidate sets.
36 CHAPTER 2. THE SCHEMA MATCHING PROBLEM
The goal of the third phase is to find a cover Γ, which is a set of candidate
sets that cover all value correspondences, i.e., each value correspondence must
appear in at least one candidate set. In the above example, possible covers
include Γ1={{v1},{v2, v3}} and Γ2={{v1, v2},{v2, v3}} since all defined
value correspondences appear at least once. If there are several possible covers,
Clio prefers the one which contains fewer candidate sets, i.e., one that produces
a simpler mapping. In the case of a draw, covers that produce less null values
in the target are used.
In the final step, a query is constructed based on the preferred cover. Because
every candidate set in the cover represents a possible mapping, the produced
query is the union of the queries described by a candidate set. Given that Γ2is
chosen as he best cover, the resulting query is:
CREATE VIEW T(C,D) AS
SELECT f1(S1.A), f2(S2.A)
FROM S1,S2
WHERE S1.K = S2.FK
UNION
SELECT f3(S2.B), f2(S2.A)
FROM S2
Miller et al. also present a incremental query discovery algorithm, which
is not shown here. As described above, query discovery is a semi-automatic
process: The user can interact by adding correspondences, running the algo-
rithm, and then adding or removing correspondences again. The incremental
algorithm takes user interaction into consideration and only considers those cor-
respondences that have been added or removed by the user after the previous
run of the algorithm. As a result, the performance of the algorithm increases
because previous decisions are reused.
In subsequent work, Popa et al. developed an algorithm for query discovery
based on the nested relational model, which is applicable both for relational and
XML data [PVM+02]. In contrast to Miller et al., they use attribute correspon-
dences as described in Sec. 2.2 as input. Consequently, they are able to include
any existing schema matcher into their framework as long as the matcher pro-
duces attribute correspondences. They also consider constraints on the target
schema, which were ignored in the work by Miller et al. Thus, they try to
interpret the attribute correspondences in a way that is consistent with the se-
mantics of both the source and the target schema. In a process called semantic
translation, the semantics encoded in the schemata is used to construct a logical
mapping. This step involves grouping of correspondences such that they do not
violate any constraints. The second phase is data translation, where issues of
generating data for target attributes that do not participate in a correspondence
and grouping of nested elements are addressed. For the sake of brevity, we omit
details of the algorithm and refer the interested reader to [PVM+02].
Chapter 3
From Duplicates To Schema
Matching
Duplicates provide helpful information to find attribute correspondences where
schema-based or vertical instance-based matchers fail. In this chapter we de-
scribe the duplicate-based matching approach, describe the problem of detecting
duplicates, and review related work in that area.
3.1 Why Duplicates Can Help in Schema Match-
ing
In Chap. 1 we claimed that extensional overlap, i.e., the existence of duplicates,
can help in the process of detecting a schema matching. Existing solutions,
which are reviewed in Sec. 2.3.4, suggest the same. In particular, we believe
that duplicate-based approaches succeed where others fail, e.g., when attribute
names are ‘cryptic’ and semantically different attributes have similar features.
To motivate the duplicate-based approach, we use the example in Fig. 3.1,
which is an adaptation of the example used in [BN05]. The source relation
Rcontains attributes F irstName,LastName,Sex,Phone, and F ax, whose
semantics can easily be determined by a human user. The schema of Scontains
less readable attribute names: LN for last name, Acc for user account, T el for
phone number, and OS for operating system. The correct matching contains
the correspondences ({LastName},{LN}) and ({P hone},{T el}).
Schema-based schema matchers are very unlikely to produce good results
due to the use of acronyms (e.g., LN) and different abbreviations (Tel instead
of Phone) in relation S. Usually, string similarity measures are used in schema-
based approaches to determine the similarity of attribute names, and attributes
with highly similar names are assumed to correspond. In this scenario, names
of matching attributes do not have a higher similarity than names of unrelated
attributes. Consequently, schema-based matchers will fail.
37
38 CHAPTER 3. FROM DUPLICATES TO SCHEMA MATCHING
RFirstName LastName Sex Phone Fax
r1John Doe m (408) 7573339 (408) 7573338
r2Joe Smith m (249) 3615616 (249) 2342366
r3Suzy Klein f (358) 2436321 (358) 2436321
r4Sam Adams m (541) 8127100 (541) 8121164
r5Mark Spitz m (901) 8319311 (901) 8612382
r6Jim Beam (782) 1238957 (781) 1883744
r7Kate Moss f (124) 9654565
r8Sam Wong f (124) 4955670 (999) 9999999
r9John Dean m (369) 3663624 (367) 3663625
SLN Acc Tel OS
s1Douglas jdouglas (408) 9182043 XP
s2Dean jd (369) 3663624 XP
s3Klein littlesue (358) 2436321 UNIX
s4Adams sam (541) 8127100 W2000
s5Wong kate (923) 6363443 Linux
s6Kurz itsme UNIX
Figure 3.1: Relations Rand Swith intensional and extensional overlap.
Also note that there are cases in Web data integration where attribute names
are incorrect or unavailable. Assume that the web site of a retailer (e.g., Ama-
zon) has been crawled and a large collection of Web pages containing infor-
mation about books has been downloaded. Current wrapper generation tech-
niques are able to extract structural information from such a large corpus of
pages [LRNdST02]. However, the semantics of the structural elements cannot
be determined. To give the field of a web page a meaning, some approaches
try to extract labels from Web pages [ACMM03]. However, these approaches
do not always produce a good result. To complicate the issue, in several cases
labels are missing or even misleading. Fig. 3.2 shows product details of the
book “Readings in Database Systems” by Joseph M. Hellerstein and Michael
Stonebraker. One might assume that the text in front of the colon, which is
distinctively written in bold face, represents the attribute label, while the fol-
lowing text is the actual data. But that is only partially true. In the first line of
the details, ’Paperback’ does not represent an attribute name, but the format
of the book. The value after the colon represents the number of pages. The fol-
lowing line contains information about the edition and publishing date without
a corresponding label.
Instance-based schema matchers are not affected by the issues described
above, because they ignore schema information and exploit available data. Nev-
ertheless, instance-based matchers that perform vertical matching will have dif-
ficulties with the scenario in Fig. 3.1. They might be able to determine the cor-
3.2. THE DUMAS APPROACH 39
Figure 3.2: Product details from www.amazon.com.
respondence ({LastName},{LN}), depending on which features they extract.
Determining the second correspondence ({P hone},{T el}) is more difficult, be-
cause Tel is also very similar to Fax. Even for a human user it is impossible to
distinguish phone numbers from fax numbers just by looking at each attribute
in isolation. In general, vertical matchers fail to distinguish attributes that have
similar features but are semantically different: They cannot tell phone numbers
from fax numbers (Fig. 3.1) or place of residence from birthplace (Fig. 2.3).
Close examination of the above scenario reveals an extensional overlap: Tu-
ple pairs (r3, s3), (r4, s4), and (r9, s2) are fuzzy duplicates, because both tu-
ples in each pair represent the same person. We claim that such duplicates
can be used to increase matching accuracy. Instance-based matchers which
perform horizontal matching follow that idea by detecting duplicates and ex-
tracting attribute correspondences from them. Each of the duplicates indi-
cate a certain matching, which can be determined by comparing attributes val-
ues. E.g., (r4, s4) has similar attribute values that indicate correspondences
({LastName},{LN}) and ({P hone},{T el}). Some duplicates might suggest
false correspondences, but by aggregating the results of several duplicates, the
effect of a few false indications diminishes.
In contrast to vertical matchers, duplicate-based matchers are able to distin-
guish phone number from fax number in the above example: By comparing the
tuple pairs determined to be duplicates, we can see that values for Phone and
Tel are always equal, and thus, can deduce that these attributes correspond.
Similarly, when comparing duplicates ‘John Doe’ and ‘Sam Adams’ in Fig. 2.3
on page 30, we can see that attributes Birthplace and POB match.
3.2 The DUMAS Approach
As described in the previous section, duplicates provide information that can
be used to detect correspondences. We now sketch out an algorithm that ex-
ploits extensional overlap. Consistent with related work on horizontal matching
(Sec. 2.3.4), the general algorithm for duplicate matching of schemata (DUMAS)
40 CHAPTER 3. FROM DUPLICATES TO SCHEMA MATCHING
proceeds in two steps:
1. Duplicate detection,
2. Matching.
The goal of the first step is to detect fuzzy duplicates, which will be used
in the second step. While it might not be necessary to detect all duplicates,
the tuple pairs identified as duplicates should in fact be duplicates, i.e., false
duplicates must be avoided. In the above example, the perfect result is the set
of tuple pairs {(r3, s3),(r4, s4),(r9, s2)}. The duplicate detection step will be
discussed in detail in the following sections. In particular, we define what a
duplicate is, and examine related work in duplicate detection. We show that
existing solutions are not applicable, and define the problem of detecting dupli-
cates for schema matching in Sec. 3.5.
Consistent with the definition in Sec 2.2, the goal of the matching step is
to produce a set of attribute correspondences. The duplicates detected in the
previous step are to be used here as input. Other information, such as metadata
or documentation, will be ignored. Intuitively, each of the tuple pairs can be
seen as an indicator for the correct schema matching: If the value of an attribute
in one tuple is the same as or very similar to the value of an attribute in the
other tuple, we could deduce that those two attributes are related.
While this is usually true, two attribute values can coincidentally match, thus
misleading schema matching, or several attributes are very similar, resulting
in uncertainty. Fig. 3.3 shows tuples r4and s4, with highly similar attribute
values connected by arrows. The two solid arrows denote true correspondences,
while the dashed arrow shows that the duplicate has similar values in attributes
that do not match. Such a case can mislead schema matching: If only this
duplicate were used, the result would include a false correspondence. The case
of several possible matches occurs in tuple pair (r3, s3), because Suzy Klein’s
phone number and fax number are equal. Hence it is impossible to decide which
attribute corresponds to T el when using only duplicate (r3, s3). The solution
to both problems is the same: Use several duplicates instead of just one, thus,
reducing the effect of a few coincidental value similarities or uncertain matches.
Consequently, one must aggregate the different indications of several duplicates
to produce a final matching. A solution to this problem is presented in Chap. 5.
Sam
Adams
m
(541) 8127100
(541) 8121164
sam
Adams
(541) 8127100
W2000
r4
s4
Figure 3.3: Misleading similarity of attribute values.
3.3. DUPLICATES AND DUPLICATE DETECTION 41
Note that this two-step process can also be made iterative (Fig. 3.4). After
detecting a few correspondences in the matching step, it should be determined
if this matching can be trusted. If the correspondences are certain, they should
be presented to the user. If some correspondences are uncertain. The schema
matching process should resume. However, in the second iteration the corre-
spondences that are trusted can be used in the duplicate detection step.
Source
database
Target
database
Duplicate
Detection
Duplicates
Matching
Correspondences
Certain?
Certain
Correspondences
Result
matching
No
Yes
Figure 3.4: The duplicate-based schema matching process
3.3 Duplicates and Duplicate Detection
The problem of detecting duplicates has been studied for several decades under
various names, including record linkage, data cleansing, and entity identifica-
tion [RD00]. In all but a few publications, the described goal is to find duplicates
in a single table or two matching tables. The latter implies that it has been
established that the two tables represent the same entity type. A fuzzy dupli-
cate, or duplicate for short, is defined to be different representations of the same
real-world entity.
Duplicates are called fuzzy because they are not exact copies of one an-
other. Even when schematic heterogeneity is not an issue, e.g., when looking
for duplicates in a single table, the same information can be represented in dif-
ferent ways. This can be done deliberately when no standard representations
exist: “Technische Universit¨at Berlin” can be abbreviated “TU Berlin” or even
“TUB”, a second given name of a person can be fully written or be replaced
by a middle initial, and different scales for certain measures can be used in
different database entries. In addition to deliberately different representation,
misspellings and missing information make duplicate detection a hard problem.
Duplicate detection is the process of searching for fuzzy duplicates, which
can be represented as tuple pairs. More formally, the goal of duplicate detection
in a single relation Ris to find tuple pairs (ri, rj)R×R, where riand rj
represent the same real-world entity. We say that rjis a duplicate or duplicate
record of ri. Because only a single table is considered, all tuples are equally
42 CHAPTER 3. FROM DUPLICATES TO SCHEMA MATCHING
structured, and thus, there is an inherent matching relating each attribute to
itself.
The problem of duplicate detection in aligned relations is similarly defined:
Given to tables Rand Sand the schema matching Mbetween Rand S, find
all tuple pairs (ri, sj)R×S, such that riand sjrepresent the same real
world entity. The correspondences are required, because a duplicate detection
algorithm needs to compare values of related attributes. In contrast to the
problem of duplicate detection in a single relation, schematic heterogeneity may
occur, i.e., the schemata of the relations can have a different structure, and
not all attributes must have a corresponding attribute in the other schema.
Note that both problem definitions do not contain cardinality restrictions: A
database entry can have zero, one, or several duplicate records.
The accuracy of the duplicate detection result is measured in terms of pre-
cision and recall, which are computed as follows:
Precision =|DR|
|R|Recall =|DR|
|D|
where Dis the set of true duplicates and Ris the set of retrieved duplicates. A
good precision is achieved if only few false duplicates are in the result set. On
the other hand, the user wants the duplicate detection algorithm to find all or
most of the duplicates, which is measured as recall.
In most cases, the user has to make a tradeoff decision: By allowing more
errors, one will retrieve more tuple pairs and increase recall, but also increase the
chance of detecting false duplicates. Consequently, recall is likely to increase,
but as more false duplicates enter the result set, precision will drop.
Beside the quality measures precision and recall, efficiency of duplicate de-
tection is a major issue. If a single table contains ntuples, then there are n2
possible tuple pairs which have to be compared. Analogously, if a source table
contains mtuples and a target table contains ntuples, there are m·ntuple pairs
which have to be checked in the case of duplicate detection in aligned relations.
Because that many comparisons are infeasible in all but the smallest relations,
methods for reducing the number of tuple comparisons have to be applied.
3.4 Related Work on Duplicate Detection
As discussed above, duplicate detection algorithms must produce a result with
both (i) high accuracy and (ii) good efficiency. In the following, we review
related work on duplicate detection and describe their approach to achieve these
two goals.
3.4.1 Record Linkage
Record linkage is a statistical approach that uses user-provided samples of du-
plicate tuple pairs and non-duplicate tuple pairs to derive duplicate detection
rules [NKAJ59, FS69, Win95, EVE02]. The record linkage method classifies
3.4. RELATED WORK ON DUPLICATE DETECTION 43
tuple pairs into ‘duplicate’ (A1), ‘non-duplicate’ (A3), and ‘possible duplicate’
(A2) based on the comparison vector γ= (γ1, . . . , γn) for each tuple pair. The
method to create the comparison vector is not defined by record linkage. Usu-
ally, one compares values of the same attribute using an attribute-specific com-
parison function to determine a score for each γi: E.g., in a table describing
books, one could determine a comparison score for the ISBN attribute with the
following function [NL00]:
f1(ISBN1, ISBN2) :=
0 : ISBN1=ISBN2
1 : ISBN1, ISBN2are missing
2 : otherwise.
To determine if two tuples match, the likelihood ratio λof the tuple pair
must be computed, which is defined as:
λ=λ(γ) = P(γ|D)
P(γ|N)(3.1)
where Dis the set of duplicate tuple pairs and Nis the set of non-duplicates.
The conditional probabilities P(γ|D) and P(γ|N) can be estimated from the
user-provided samples. We point out that in practice the computation of the
probabilities is simplified by assuming that the elements of γare statistically
independent.
A lower bound γland an upper bound γuis also derived from the sam-
ples [FS69]. Using those bounds, one can determine the class of a given tuple
pair using the decision function δ(λ):
δ(λ) :=
A1:λ > λu
A2:λlλλu
A3:λ < λl.
(3.2)
To make record linkage scalable, the number of tuple comparisons is de-
creased in a process called blocking: Domain-specific criteria are used to par-
tition the set of tuples into blocks, and only tuples within a single block are
compared [FS69]. All other tuple pairs are implicitly considered non-duplicates.
3.4.2 The Sorted Neighborhood Method
The Sorted Neighborhood Method (SNM) is a duplicate detection approach that
has been developed in the database community [HS95, HS98]. It is based on the
process used by database management systems to remove exact duplicates: In-
stead of comparing all tuples, the content of a relation is sorted by tuple content.
This process brings equal tuples together, i.e., tuples with the same content are
neighbors. Consequently, only neighboring tuples need to be compared to find
exact duplicates.
Because of the previously discussed data quality issues, the procedure to
find exact duplicates is not applicable for fuzzy duplicates: If the same piece
44 CHAPTER 3. FROM DUPLICATES TO SCHEMA MATCHING
of information is represented in different ways, sorting a table might place two
tuples representing the same real-world entity in different regions of the table.
The sorted neighborhood method tackles that problem in three steps:
1. Create key: Compute a key for each record in the relation by extracting
relevant attribute values or portions of attribute values.
2. Sort data: Sort the records in the table using the key of step 1.
3. Merge: Move a fixed size window through the ordered list of records limit-
ing the comparisons for matching records to those records in the window.
The goal of the first step is to define a key that places duplicate tuples close
to each other in the sorting phase even when the data suffers from data quality
issues. The key is usually comprised of several attribute values or portions
thereof: E.g., a good key for relation Rin Fig. 3.1 could be constructed from
the first two letters of the FirstName value and the first two letters of the
LastName value. The definition of a reasonable key requires the user to have
knowledge of the domain and of typical errors contained in the data set. The
creation of the above key could be driven by the experience that errors in names
usually occur at the end of the string.
w
w
Current window
of records
Next window
of records
Figure 3.5: Sliding a window through a sorted table (Source: [HS95]).
The key defined in step 1 is used in the second step to sort the table. The
sorted table is searched for duplicates using the ‘sliding window’ approach: A
window of size w(i.e., with a capacity of wrecords) is moved through the sorted
table. Every time the window has been moved one tuple ahead, the tuple that
has entered the window is compared with the other w1 tuples presently
inside the window. Obviously, that approach decreases the number of tuple
comparisons: Instead of comparing all tuple pairs, which has a complexity of
O(N2), comparing only tuples within the sliding window has a time complexity
of O(wN). This justifies the additional effort for creating a key for each tuple
(O(N)) and sorting the table (O(Nlog N)).
Note that the process to find exact duplicates is a special case of the sorted
neighborhood method, where the sort key is the whole tuple content and the
3.4. RELATED WORK ON DUPLICATE DETECTION 45
window size is 2. Comparing tuples to find exact duplicates is straightforward:
The contents of the attributes must be equal in exact duplicates. When search-
ing for fuzzy duplicates, data quality issues must be taken into consideration:
Hern´andez and Stolfo propose to compare tuples using user-defined rules [HS95],
e.g.
Given two records, r1 and r2
IF the last name of r1 equals the last name of r2,
AND the first names differ slightly,
AND the address of r1 equals the address of r2
THEN
r1 is equivalent to r2.
The implementation of “differ slightly” is based on a string similarity function,
i.e., two first names differ slightly if their string similarity is above a given
threshold. Many rules similar to the above one are usually defined to allow
for different kinds of errors. However, the sorted neighborhood method is not
limited to such rules: The primary purpose of the sliding window technique
is to reduce the number of tuple comparisons, and various tuple comparison
measures can be used to detect duplicates.
The duplicate detection rules should be defined such that only true dupli-
cates are considered as duplicates, i.e., false duplicates must be avoided. How-
ever, even with a good set of rules it is possible that duplicates are missed
because they are placed far apart in the sorting phase. A possible measure
to avoid missed duplicates is to increase the size of the window w, and thus,
increase the number of tuple comparisons. While this action improves the ac-
curacy of duplicate detection, it also degrades performance. Hern´andez and
Stolfo have shown that the multi-pass approach results in much better duplicate
detection accuracy: Instead of increasing w, the sorted neighborhood method
is applied several times using varying keys. Afterwards, the transitive closure
for the duplicates is computed, i.e., if (r1, r2) is detected as a duplicate in one
run and (r2, r3) is detected in another run, then (r1, r3) must be considered a
duplicate, too.
3.4.3 Other Duplicate Detection Approaches
Various duplicate detection methods that use domain-independent string simi-
larity measures have been proposed. Monge and Elkan compare attribute values
using the Smith-Waterman algorithm, which is a variant of the Levenshtein edit
distance that applies different weights for different characters [ME96, ME97].
Chaudhuri et al. define a fuzzy similarity function, which determines the simi-
larity of two attribute values based on the operations needed two transform one
value into the other [CGGM03]. Three operations are defined: token replace-
ment, token insertion, and token deletion. The cost of token insertion and token
deletion depends on the inverse document frequency of the inserted or deleted
token, while token replacement also takes the edit distance into consideration1.
1See Sec. 4.3.2 for a description of edit distance and inverse document frequency.
46 CHAPTER 3. FROM DUPLICATES TO SCHEMA MATCHING
Chaudhuri et al. also describe a probabilistic procedure to reduce the number
of tuple comparisons when searching for duplicates.
The string similarity measures applied in the above duplicate detection sys-
tems are domain-independent to facilitate their use in different scenarios. Recent
research has shown that machine learning techniques can be used to automati-
cally adapt these measures to a given domain in order to improve their accuracy.
Bilenko and Mooney describe the MARLIN system, which exploits user-provided
samples to learn domain-specific costs for edit distance and TFIDF cosine sim-
ilarity [BM03]. Experiments show that the accuracy of TFIDF does not always
increase, but the edit distance costs learned by Expectation Maximization (EM)
always improve the result. In a similar fashion, Tejada et al. use decision trees
to learn weights for various string transformation functions [TKM02].
The duplicate detection methods described here only take corresponding
attributes in consideration when determining if tuples represent the same entity.
A notable exception is PROM, which also exploits information represented in
only a single source [DLLH03]. After possible duplicates have been detected on
the basis of shared attributes, PROM performs a sanity check. The sanity check
uses various application-specific constraints, which can also relate to unmatched
attributes: E.g., assume two tables describing people, where the income is only
represented in the first table, while the age of a person is only shown in the
second table. A reasonable rule for this scenario would be ”If a person’s age is
less than 18, then the income cannot be larger than USD 10,000.”
We point out that duplicates are also an issue outside of the relational world.
Hence, duplicate detection algorithms have also been proposed for XML [WN05],
data warehouses [ACG02], and spatial data [BKSS04].
The algorithms described in this section are designed to detect duplicates
with high precision and high recall, i.e., the goal is to find all duplicates and
to not produce false duplicates. To achieve that goal, they require a lot of
information by the user: attribute correspondences, sample duplicates and non-
duplicates, duplicate detection rules, etc. As will be shown in the following,
finding duplicates for schema matching has relaxed quality goals, but cannot
expect as much user input as classic duplicate detection approaches.
3.5 Finding Duplicates For Schema Matching
The duplicate detection approaches described above use different techniques, but
all of them have one property in common: They require the complete schema
matching to be known, so they can restrict attribute comparisons to related
attributes. This is contrary to our problem of detecting duplicates for schema
matching, where the goal is to establish a matching after duplicate detection.
This section discusses the problem of detecting duplicates without known cor-
respondences.
3.5. FINDING DUPLICATES FOR SCHEMA MATCHING 47
3.5.1 Single-Table Duplicates
When matching two tables using our duplicate-based approach, the problem
definitions in Sec. 3.3 cannot be applied. Instead, the problem of detecting
duplicates in unaligned relations has to be solved, because attribute correspon-
dences do not exist in the first step. The definition of this problem is equivalent
to the definition of duplicate detection in aligned relations, except that no cor-
respondences are known. We call the detected duplicates single-table duplicates
because their tuples span only a single table.
Finding duplicates in unaligned relations is much harder than in aligned
tables. Fortunately, quality requirement are less strict. Recall that in the case
of duplicate detection in aligned relations, the goal is to detect duplicates with
high recall and high precision, i.e., all or most duplicates should be detected
and only few false duplicates should enter the result set, respectively. When
looking for duplicates in the DUMAS approach, high recall is irrelevant: It is
not necessary to detect all duplicates, but only as many as required for schema
matching. On the other hand, precision is still an issue, because false duplicates
may corrupt the schema matching constructed in the matching step.
3.5.2 Multi-Table Duplicates
Up to this point, the objects representing real-world entities were assumed to be
structured as a single table. However, the ultimate goal of schema matching is
to find correspondences between complex schemata consisting of multiple tables.
When trying to match complex schemata, one has to consider issues related to
schematic heterogeneity. In particular, one has to be aware that information can
be structured in different ways: E.g., an entity type represented as a single table
in the source schema can be structured as several tables in the target schema
(e.g., due to normalization). Fig. 3.6 depicts such a case: The source schema
contains table CourseByT erm describing a course with a course id (CID) and
a title, which is taught in a given term by a faculty member. In the target
schema, this table is normalized into three tables: One table for the courses,
one table for the faculty members (F ac), and a table By that has foreign keys to
the other two tables (CID and FID) and an attribute for the term in which a
course was taught. The semantics of the source table is similar to the semantics
of the table that can be created by joining the three target tables. In a similar
fashion, a complex entity type can be described in several tables in the source
schema and several tables in the target schema.
In the following, we assume that the tables representing an entity type can
be joined together in a meaningful ways. This assumption holds in most cases,
because tables related to a single concept are usually connected in a schema,
e.g., Course,By, and Fac in Fig. 3.6. It is well known that the relational
algebra is closed, i.e., relational operators take relations as input and produce
relations. We exploit this property and reduce the problem of detecting multi-
table duplicates in unaligned schemata to the problem of detecting duplicates
in unaligned relations: Given that an entity type is represented as relations
48 CHAPTER 3. FROM DUPLICATES TO SCHEMA MATCHING
CourseByTerm CID Title Term Faculty
.
.
..
.
..
.
..
.
.
Course CID Title
.
.
..
.
.
By CID FID Term
.
.
..
.
..
.
.
Fac FID Name
.
.
..
.
.
Figure 3.6: Multi-Table Duplicates
R1, . . . , Rmin the source schema and relations S1, . . . , Snin the target schema,
find duplicates in the unaligned tables Rjoin =R11. . . 1Rmand Sjoin =
S11. . . 1Sn.
3.5.3 Related Work
A number of duplicate-based schema matchers were examined in Sec. 2.3.4, and
it was shown that they either ignore the problem of duplicate detection or work
only in very restricted scenarios: iMap requires the user to provide duplicates,
Chua et al. assume that global identifiers exists, and the Internet Learning
Agent reduces the duplicate detection problem to simple keyword search, which
is insufficient in most scenarios. In fact, we are not aware of any work that fully
considers the problem of duplicate detection in unaligned relations.
Finding duplicates that span multiple tables is also a novel problem. As
described in Sec. 2.3.4, classic duplicate detection procedures are based on sin-
gle tables. If correspondences are known and the semantics of the tables are
understood by the user, these duplicate detection algorithms can be applied
on existing tables or tables that are created by joining certain tables: E.g., if
the user knew that the join of tables Course,By, and F ac in Fig. 3.6 has the
same semantics as table CourseByT erm and the correspondences were known,
then one of the discussed duplicate detection procedure could be applied on
Course 1By 1Fac and CourseByT erm. Unfortunately, in our scenario
neither the correspondences nor the semantics of the tables are known, and the
problem of multi-table duplicates in unaligned relations has not been considered
before.
In the following chapters we present solutions to the problems described here
and in the previous chapter. In Chap. 4 a solution to the duplicate detection
problem in unaligned relations is presented. In the following Chap. 5 we show
how to extract attribute correspondences from the detected duplicates. The
duplicate detection procedure in combination with the matching step constitute
the DUMAS table matcher. Chap. 6 describes the DUMAS schema matcher,
which detects multi-table duplicates to find correspondences between two com-
plex schemata. The DUMAS complex matcher, which is able to extract complex
correspondences from duplicates, is discussed in Chap. 7.
Part II
The DUMAS Table
Matcher
49
Chapter 4
The Duplicate Detection
Step: Finding Duplicates in
Unaligned Relations
This chapter presents an algorithm that solves the problem of duplicate detec-
tion in unaligned relations, which is more difficult than the common duplicate
detection problem, where attribute correspondences are known. We begin by
examining problems that stem from the lack of match information. Our solution
to this problem is to consider each tuple as a single string, and to consider the k
most similar tuple pairs as duplicates. After defining a tuple similarity measure,
an algorithm that efficiently finds the most similar tuple pairs is described. The
duplicate detection procedure is experimentally evaluated using both real-world
and synthetic data. Note that the duplicate detection procedure presented in
this chapter combined with the matching step described in the following chapter
constitute the DUMAS table matcher, which was originally described in [BN05].
4.1 Duplicate Detection Without Known Cor-
respondences
Recall from Sec. 3.4 that duplicate detection methods essentially have to answer
to questions:
1. Given two tuples rand s, is (r, s) a duplicate?
2. What tuples need to be compared in order to find all duplicates?
The answer to the first question determines the accuracy of duplicate detection,
while the answer to the second question affects the performance.
Most work on duplicate detection requires existing correspondences to be
known. This is necessary to ensure that only related (i.e., corresponding) at-
tributes are compared. In addition, the user needs to have detailed knowledge
51
52 CHAPTER 4. THE DUPLICATE DETECTION STEP
of the application domain and the semantics of the attributes. Such knowledge
is required to either manually define duplicate detection rules or to extract a few
duplicates and non-duplicates, which are used to train the duplicate detection
algorithm (see Sec. 3.4). Note that when the attribute semantics of both source
and target schema are understood, attribute correspondences are inherently
known, because attributes with the same semantics correspond. Knowledge
about the semantics of the attributes is also required when determining which
tuples need to be compared: E.g., the sorted neighborhood method requires the
user to specify criteria for sorting, which strongly depend on the application
domain.
Consider tuples r4and s4from the example in Fig. 3.1 on page 38, which rep-
resent the same real-world entity. If the correct correspondences ({LastName},
{LN}) and ({P hone},{T el}) are given as input, a user who is familiar with the
application domain and attribute semantics can easily see that the two tuples
are probably duplicates: Given that the last names (“Adams”) and phone num-
ber (“(541)8127100)”) are equal, it is very likely that the two tuples represent
the same real-world entity.
Sam
Adams
m
(541) 8127100
(541) 8121164
sam
Adams
(541) 8127100
W2000
r4
s4
Figure 4.1: Duplicate tuples.
Finding duplicates in non-aligned relations is more difficult. Fig. 4.1 de-
picts those two tuples without attribute names and correspondences. This is
an example for the kind of input our duplicate detection procedure has to han-
dle. When deciding if those two tuples are duplicates without knowing the
actual correspondences, it is very hard to decide which attributes to compare.
Obviously, comparing values of attributes at the same position in the tuples is
unreasonable: Structural heterogeneity, which is to be resolved by establishing a
mapping between the two schemata, gives rise to the problem that semantically
related attributes are at different positions, and some attributes only appear in
one of the two schemata. Hence, when comparing two arbitrary tuples, it is
very hard, if not impossible, two determine if they represent the same entity.
However, in contrast to related work on duplicate detection in aligned rela-
tions, our goal is not to develop a general-purpose duplicate detection procedure
that finds all duplicates. Instead, only as many duplicates as required for schema
matching have to be found. In other words, duplicate detection does not need
to achieve high recall. This facilitates the modelling of the duplicate detection
problem as the search for the kmost similar tuple pairs, where kis the number
of tuple pairs required for schema matching. Sec. 4.2 describes the assumptions
that are made when following this approach. To determine the similarity of two
4.2. DUPLICATE DETECTION AS TOP-K SEARCH 53
tuples, a tuple similarity measure is defined in Sec. 4.3. The choice of similarity
measure also affects the answer to the second question. Sec. 4.4 describes an al-
gorithm that is able to find the k duplicates with a number of tuple comparisons
that is significantly smaller than the overall number of tuple pairs.
4.2 Duplicate Detection as Top-k Search
Before describing the DUMAS duplicate detection algorithm, two underlying
prerequisites have to be made explicit: The two relations that are given as
input to the duplicate detection procedure must
1. represent the same entity type,
2. contain duplicates.
The first prerequisite simply states that some correspondences exist between
the two tables. If the tables are not related, one would not need to look for
attribute correspondences. However, it is necessary to make this assumption
because in some cases some real-world entities appear in a database in different
roles: E.g., if tables Rand Sin Fig. 3.1 represented customers of a department
store and account holders in a computer network, respectively, detected dupli-
cates might indicate correspondences between Rand Salthough they are not
related. However, note that this is a rare case, which strongly depends on the
intention of the user: If the goal was the integration of customer and account
data, the detected correspondences would be correct.
In order for any duplicate-based schema matcher to work, actual duplicates
must exist, which is stated in the second prerequisite. As we will see in the
following, the described duplicate detection procedure has no direct means to
determine if two tuples are duplicates. Instead, it picks some tuple pairs which
are more likely to be duplicates than other tuple pairs. In other words, the
algorithm always finds a few tuple pairs as duplicates, even when no duplicates
exist.
Given that duplicates exist and given the requirement to detect only a few
duplicates, the algorithm does not need to determine if two tuples represent the
same real-world entity. Instead, it only has to find some tuple pairs that are
most likely to be duplicates. This decision is based on the following duplicate
detection assumption: A tuple pair (ri, sj) is more likely a duplicate than a tuple
pair (rk, sl), if riis more similar to sjthan rkis to sl. This assumption does
not have to hold in the entire space of tuple pairs in order to produce a good
result: Because only a few duplicates are required, in only a small fraction of the
space of all tuple pairs the distinction between duplicates and non-duplicates
needs to be clear. In the DUMAS duplicate detection algorithm, we assume that
the most similar tuple pairs are true duplicates, and return the kmost similar
tuple pairs if kduplicates are required for schema matching. This assumption
is very intuitive: If there are any duplicates, then the most similar tuples most
likely are true duplicates. What is meant by ‘similar’ in the context of duplicate
detection in unaligned relations is discussed in the following section.
54 CHAPTER 4. THE DUPLICATE DETECTION STEP
Note that the duplicate detection assumption can also be found in other
duplicate detection procedures. Recall from Sec. 3.4 that record linkage uses
two thresholds to classify tuple pairs into three classes duplicates, possible du-
plicates, and non-duplicates. Although there is a class of possible duplicates
where the distinction between duplicate and non-duplicate is not clear, and
thus, clerical review is required, tuple pairs which pass the higher threshold are
always considered duplicates.
4.3 The Tuple Similarity Measure
Before defining a tuple similarity measure, inherent problems of duplicate de-
tection in unaligned relations need to identified. After a review of related work,
the tuple similarity measure is defined.
4.3.1 Inherent Problems
As stated above, the kmost similar tuple pairs are to be returned by the DUMAS
duplicate detector. To do so, a reasonable tuple similarity measure tupsim has
to be defined, which produces high similarity scores for tuple pairs that are
duplicates and lower scores for non-duplicates. In addition, it has to overcome
several problems:
1. Unknown schema alignment: It is unclear which field in one tuple to
compare with which field in the other.
2. Partial schema overlap: Not all fields in one tuple necessarily have a
matching partner in the other. With only few corresponding attributes,
the similarity of two tuples is typically low.
3. Unknown attribute semantics: We cannot make use of domain knowledge
to formulate an effective comparison measure. Common duplicate detec-
tion methods use manually or statistically created rules that are based on
the similarity of certain corresponding attributes: E.g., the sorted neigh-
borhood method requires an domain expert to formulate duplicate detec-
tion rules that are applied to determine if two tuples represent the same
real-world entity. When the semantics of the attributes are not known,
meaningful rules cannot be created. Instead, a comparison measure that
is independent of the fields’ semantics and the application domain must
be applied.
4. Misleading value similarities: In many cases, attribute values of tuples
that do not represent the same real-world entity are similar. We distin-
guish two cases:
(a) Corresponding attributes: Two non-duplicate tuples have the same
or highly similar value in two attributes which do correspond. If
those two tuples were considered duplicates, and thus, used in the
4.3. THE TUPLE SIMILARITY MEASURE 55
matching step, some correspondences will not be detected because
the tuples are false duplicates. E.g., if the birthplace of two persons
matches, it cannot be deduced that they are duplicates, because very
many people might have been born in a given place. Using such non-
duplicate tuples that have a similar birthplace would result in missed
matches between other attributes.
(b) Non-corresponding attributes: Two values can be similar even when
their attributes do not match, e.g., because the attributes have the
same domain. Consider tuples r7and s5in Figure 3.1: Both tuples
have an attribute value kate’, but are not duplicates. Without knowl-
edge of the correct attribute matches, such a value match can mislead
duplicate detection. This problem is closely related to Problem 1.
Problem 3 is tackled by defining a domain-independent tuple similarity mea-
sure. In addition, domain independence facilitates the application of the DU-
MAS table matcher in a wide area of scenarios. String similarity measures
appear to be a good choice despite their drawbacks: Much data in databases
can be represented in some form of text, there is a wide variety of domain-
independent string similarity measures, and string similarity measures have been
successfully used in other deduplication work. Apart from duplicate detection,
string similarity measures have also been studied in other research areas, e.g.,
information retrieval [BYRN99], text classification [Seb02], and computational
biology [Gus97]. In the following, a number of string similarity measures with
their advantages and disadvantages are described based on the survey by Cohen
et al. [CRF03]. Afterwards, the tuple similarity measure tupsim is defined.
4.3.2 String Similarity Measures
In general, a similarity measure sim assigns a large score to objects that are
similar and a low score to object that are different. In many cases, the similarity
score is normalized such that it is in the interval [0,1]. In contrast, a distance
measure dist assigns low scores to similar objects and high scores to dissimilar
objects. If a distance measure dist produces scores in the range [0,1], it can be
translated into a similarity measure sim using the formula
sim(a, b) = 1 dist(a, b) (4.1)
where aand bare the objects to be compared. Because the distance scores are
in the range [0,1], the similarity score are also in that interval. A similarity
measure can be translated into a distance measure in a similar fashion.
A distance measures dist is a metric if it fulfills the following properties:
1. Non-negativity:dist(a, b)0,
2. Identity of indiscernibles:dist(a, b) = 0 if and only if a=b,
3. Symmetry:dist(a, b) = dist(b, a),
56 CHAPTER 4. THE DUPLICATE DETECTION STEP
4. Triangle inequality:dist(a, c)dist(a, b) + dist(b, c).
Searching for nearest neighbors in a metric space can be efficiently performed
using metric indices [CNBYM01, HS03] an important consideration when
choosing a tuple similarity measure.
According to Cohen et al., string similarity or distance measures can be
classified into the following three categories: (i) edit-distance like functions, (ii)
token-based functions, and (iii) hybrid functions [CRF03].
Edit-distance Like Functions
Edit-distance like functions are derived from the classic edit distance, which is
the cost of transforming one string into another using three operations: insert a
character, remove a character, and substitute a character1. Several variants of
edit distance assign different costs to edit operations. The Levenshtein distance
uses unit cost for each operations. Thus, the edit distance becomes the minimum
number of edit operations two transform a string. It can be shown that the
Levenshtein distance is a metric because it fulfills metric properties defined
above.
The edit distance between two strings a=a1. . . amand b=b1. . . bnis
computed using the following recursive formula:
C(i, j) = min
C(i1, j) + 1
C(i, j 1) + 1
C(i1, j 1) + c(ai, bj)
(4.2)
where C(i, j) is the cost of transforming string a1. . . aiinto string b1. . . bjusing
the three edit operations, and c(ai, bj) is the cost of substituting character ai
with bj. The cost of substituting a character is 0 if the character is substituted
with itself or 1 if it is substituted with another character. Eq. 4.2 states that
the edit distance of two strings a=a1. . . amand b=b1. . . bnis the minimum
of (i) the cost of transforming prefix a1. . . am1to band removing character
am, (ii) the cost of transforming ato prefix b1. . . bn1and adding character bn,
and (iii) the cost of transforming a1. . . am1to b1. . . bn1and substituting am
with bn.
r u m o r s
0 1 2 3 4 5 6
d 1 1 2 3 4 5 6
u 2 2 1 2 3 4 5
m 3 3 2 1 2 3 4
a 4 4 3 2 2 3 4
s 5 5 4 3 3 3 3
Figure 4.2: Edit distance computation for “dumas” and “rumors”.
1Substituting a character with itself has zero cost in all edit-distance like functions.
4.3. THE TUPLE SIMILARITY MEASURE 57
Fig. 4.2 shows the matrix that is constructed in the process of computing the
edit distance between “dumas” and “rumors”. Note that the first row and the
first column represent the empty source string and target string, respectively. As
can be seen in the bottom right cell of the matrix, the distance between the two
strings is 3. In addition, the matrix also shows the edit distances between any
prefixes of the two strings. The cost of computing the edit distance is O(mn).
However, the space requirement is only O(n): Although Eq. 4.2 presents a
recursive definition, a dynamic programming algorithm can compute the edit
distance column by column (from left to right) or row by row (top-down), and
thus, only has to store the latest column or row, respectively.
It can be easily seen that the distance score is not in the range [0,1] but can
be normalized. Given the Levenshtein edit distance ed(a, b) between two strings
aand b, the normalized edit distance ned can be computed as follows:
ned(a, b) = ed(a, b)
max(|a|,|b|)(4.3)
where |a|and |b|are the lengths of strings aand b, respectively. The distance
is divided by the length of the longer string because, as it can be easily shown,
the Levenshtein distance cannot be larger than the size of the longer string.
Unfortunately, the triangle inequality does not hold for the normalized edit
distance. This can be demonstrated using the strings a= “ab”, b= “aba”, and
c= “ba”:
ed(a, b) = 1 ed(b, c) = 1 ed(a, c)=2
ned(a, b) = 1
3ned(b, c) = 1
3ned(a, c) = 1 (4.4)
Thus, ned(a, c)6≤ ned(a, b) + ned(b, c). The loss of triangle inequality does
not affect the quality of similarity scores, but complicates efficient search for
similar string: Metric indices and search space pruning methods that exploit
triangle inequality cannot be directly applied [HS03].
While the Levenshtein measure is very simple and easy to compute, it does
not always reflect the ‘true’ distance between strings. To illustrate, assume a list
of street names. Parts of street names which appear frequently are sometimes
abbreviated: ‘Street’, ‘Avenue’, and ‘Court’ are sometimes written as ‘St’, ‘Ave’,
and ‘Ct’, respectively. A good distance measure should indicate that ‘Street’
is close to ‘St’. Several variants of the Levenshtein distance that reflect the
natural distance between string more closely have been studied. E.g., the Smith-
Waterman distance assigns different costs to the three edit operations [SW81].
In addition, the cost for starting a gap (by inserting or removing characters) is
larger than for continuing a gap. Monge and Elkan have successfully used this
scheme in name matching tasks [ME96].
Token-based Functions
In contrast to edit distance, token-based functions require preprocessing of a
string: It has to be split into a set of tokens (or terms). Afterwards, the string
can be considered an unordered multiset (or bag) of tokens. Several token-based
58 CHAPTER 4. THE DUPLICATE DETECTION STEP
functions exist. A simple example is the Jaccard similarity, which is computed
as
Jaccard(A, B) = |AB|
|AB|(4.5)
where Aand Bare multiset representations of strings.
A very well-known token-based measure is the cosine measure with TFIDF
weighting, or TFIDF measure for short, which has been heavily used in infor-
mation retrieval [BYRN99]. In principle, the cosine measure requires the strings
to be represented as term weight vectors with unit length. The similarity of two
strings is computed as the dot product of their vector representations, which is
equal to the cosine of the angle between the vectors. When TFIDF weighting is
applied, each term is given a weight depending on its term frequency (TF), i.e.,
the number of time it appears in a string, and its inverse document frequency
(IDF), which is the inverse of the number of strings in which the term appears.
In other words, a token is given a large weight if it appears often in a given
string or if it appears in very few strings. The unnormalized weight w0(a, t) of
a term tin a string ais computed as
w0(a, t) = log(tfa,t + 1) ·log( N
dft
+ 1) (4.6)
where tfa,t is the number of times term tappears in a(i.e., its term frequency),
Nis the overall number of strings, and dftis the number of strings that tappears
in (i.e., its document frequency). Note that the weight of a term that does not
appear in a string is zero because its term frequency is zero. These weights are
normalized such that the resulting weight vector has unit length. The TFIDF
similarity tfidf(a, b) of two strings aand bis defined as
tfidf(a, b) = X
tab
w(a, t)·w(b, t) (4.7)
where w(a, t) is the normalized weight of term tin string a. Only terms that
appear in both strings have to be considered when comparing two strings: The
weight of a term that does not occur in a string has a weight of zero, and thus,
the product of the weights for that term is zero. It can be shown that, after
being translated into a distance measure, TFIDF similarity does not fulfill the
triangle inequality property, and thus, is not a metric.
Hybrid Functions
Similarity measures that use other similarity functions are called hybrid func-
tions. The similarity measure used by the hybrid function is called secondary
similarity function. Cohen et al. present two hybrid measures: the recursive
matching schema and SoftTFIDF [CRF03].
The recursive matching scheme rms computes the similarity of two strings
aand bas
rms(a, b) = 1
K
K
X
i=1
L
max
j=1 sim0(ai, bj) (4.8)
4.3. THE TUPLE SIMILARITY MEASURE 59
where aiand bjare the i’th and j’th token in aand b, respectively, Kis the
number of tokens in a, and Lis the number of tokens in b. Intuitively, each
token of ais assigned the token in bwhich is most similar according to the
secondary similarity function sim0. It can be easily shown that the recursive
matching scheme is not symmetric.
SoftTFIDF is a “soft” version of TFIDF that allows for errors in terms.
Because only equal terms contribute to the final similarity score, small errors
unduly decrease similarity (see Eq. 4.7). To compensate, SoftTFIDF also con-
siders tokens that are similar according to the secondary measure sim0. Let
CLOSE(θ, a, b) be the set of terms aiasuch that there is at least one term
bjbthat is very similar (i.e., its similarity is above a given threshold θ):
CLOSE(θ, a, b) = {aia|∃bjb, sim0(ai, bj)> θ}.(4.9)
The SoftTFIDF similarity softtfidf of two strings aand bis defined as:
softtfidf(a, b) = X
tCLOSE(θ,a,b)
w(a, t)·w(b, t0)·sim0(t, t0) (4.10)
where t0is the token in bthat is most similar to taccording to the secondary
similarity function sim0, and w(a, t) is the normalized TFIDF weight of term t
in aas defined above.
4.3.3 The Tuple Similarity Measure tupsim
To determine the similarity of two tuples, we consider each of them as a single
string. Such a string is created by concatenating a tuple’s attribute values. The
tuple similarity tupsim should create reasonable similarity scores as discussed
in Sec. 4.2. In particular, it has to consider the inherent problems described in
Sec. 4.3.1. At the same time it must also be efficient. The efficiency of duplicate
detection is not only determined by the cost of calculating the actual similarity
of two tuples, but also by the number of tuples which have to be compared.
In the following, we will discuss the similarity measures described above and
provide arguments, why the TFIDF measure is the best choice for tupsim.
Edit-distance like functions are order-dependent: Changing the order of a
tuple’s attributes greatly affects the edit distance to another tuple if a tuple
is considered a single string. Thus, edit distance is not a good choice with
respect to Problem 1. A solution to this problem would be to add another
operation for moving whole blocks of text. However, such variants are known to
be computationally expensive, and thus, are not further regarded as a possible
tuple similarity measure [LT97].
Of course, edit distance could be used as a secondary measure in one of
the hybrid functions. Instead of considering a tuple as a single string, one could
compare the attribute values and apply the recursive matching scheme to create
a combined score. Eq. 4.8 could be directly applied: aiand bjwould represent
attribute values of tuples aand b, respectively. The recursive matching scheme
was not used as tuple similarity measure for reasons of efficiency.
60 CHAPTER 4. THE DUPLICATE DETECTION STEP
In contrast to the above mentioned similarity measures, the TFIDF measure
has features that make it a reasonable choice for a tuple similarity measure:
1. Order independence: The TFIDF measure is a token-based measure, and
thus, order-independent because each string is represented as a bag of
tokens. By considering each tuple as a single string and applying an
order-independent string similarity measure we tackle Problem 1.
2. TFIDF weighting: The inverse document frequency, which is part of the
TFIDF weighting scheme, gives a large weight to terms which appear
infrequently. Thus, if a certain value appears in many tokens, it gets a
relatively low weight, and thus, has a smaller effect on the similarity of
two tuples. Hence, the case described in Problem 4a is tackled by applying
TFIDF weighting.
3. Efficient top-k search: Efficient algorithms for detecting the k most similar
strings exist and can be applied for duplicate detection.
Problems 4b and 2 cannot be directly solved. However, we experimentally
show that the TFIDF measure still performs very well even in difficult scenarios.
Because of its useful properties, the TFIDF measure is used to determine the
similarity of two tuples in the duplicate detection step. As discussed above, all
tuples are translated into strings by concatenating their attribute values. When
it is clear from the context, the string is given the same name as the tuple which
it represents. The tuple similarity tupsim of two tuples rand sis defined as:
tupsim(r, s) = tfidf(r, s) = X
trs
w(r, t)·w(s, t) (4.11)
where wis the normalized weight as used in Eq. 4.7.
We point out that the tuple similarity measure can benefit from any normal-
ization procedure: As shown in Eq. 4.11, tupsim only considers equal terms, and
a slight variation might have a large impact on the tuple similarity. The nor-
malization of strings potentially improves duplicate detection accuracy: E.g.,
the value “86” in an attribute Y ear could be changed to “1986”, if all year
entries in the other database are four-digit numbers. However, that kind of
preprocessing requires a good understanding of the attributes’ semantics by
the user. Although there are application-independent normalization techniques
(e.g., stemming [BYRN99]), most normalization measures require reasonable
knowledge of the schemata, which we do not expect. We also stress that only
a few duplicates are required. Thus, if only some of the duplicate tuples have
value discrepancies as described above, our duplicate detection procedure still
produces a good result.
SoftTFIDF is also applicable for duplicate detection in unaligned relations
because it has the same features as TFIDF. However, it is more expensive to
compute, and searching for similar tuples is more complicated. Because very
good results can already be achieved with the simpler TFIDF measure (see
Sec 4.6), SoftTFIDF is not further considered.
4.4. SEARCHING FOR DUPLICATES 61
4.4 Searching For Duplicates
Duplicate detection is inherently a problem with quadratic complexity: Each
tuple in the first table has to be compared with each tuple in the second table.
Such exhaustive search is clearly infeasible for larger data sets. Reducing the
number of tuple comparisons is very important to make duplicate detection
scalable. Existing solutions to this problem cannot be applied for the same
reason why the duplicate detection algorithms discussed in Sec. 3.4 cannot be
used: The semantics of the attributes and correspondences are unknown. E.g.,
an expert needs to define sort keys when the sorted neighborhood method is
applied, which requires extensive knowledge of the application domain and the
semantics of the attributes. Consequently, an algorithm that is only based on
the tuple similarity measure has to be devised. The goal of such an algorithm
is to detect the top-k duplicates, i.e., the k most similar tuple pairs with a
minimum number of few tuple comparisons.
A first improvement over exhaustive search would be to only consider tuples
which have at least one term in common, because only tuples that share at
least a single term have non-zero similarity (Eq. 4.11). A semi-na¨ıve algorithm
would pick each tuple from the source table and look for tuples in the target
database that contain at least one of its tokens. This lookup can be efficiently
performed using an inverted index on the target table, which maps terms to
tuple identifiers [BYRN99]. The top-k duplicates can be easily extracted from
the resulting set of tuple pairs with non-zero similarity.
Compared to exhaustive search, the the semi-na¨ıve algorithm achieves a
major reduction in the number of tuple comparisons in most scenarios. However,
it ignores the TFIDF weighting of tokens, and thus, misses the chance for a larger
performance gain. The algorithm computes the similarity even of tuples that
share only low-weight terms although those terms have only a small effect on
the similarity score. Hence, the semi-na¨ıve algorithm is considered suboptimal
because tuples which have only low-weight terms in common are less likely
to be contained in the set of top-k tuple pairs. A more intelligent algorithm
would search for the top-k duplicates by finding tuple pairs that have high-
weight tokens in common, and stop searching when no tuple pairs with a high
similarity score can be expected.
This idea is realized in the implementation of the duplicate detection al-
gorithm, which is an adaptation of the Whirl algorithm for similarity joins in
relational databases [Coh98]. Whirl performs A* search in the space of possible
tuple pairs. A* is a widely known best-first search algorithm that finds a path
from a given start state to a goal state with the smallest cost. In each iteration,
the algorithm picks the state nfrom a list of open states with the smallest as-
signed cost. If the state is a goal state, then it is presented as the result. If it is
an intermediate state, the graph is traversed further, and new states are added
to the list of open states. The cost f(n) of a state nis calculated as
f(n) = g(n) + h(n) (4.12)
where g(n) is the actual cost of the path from the source to state nand h(n) is
62 CHAPTER 4. THE DUPLICATE DETECTION STEP
the estimated cost of the path from nto the closest goal state. A* is optimal
if h(n) is an admissible heuristic, i.e., it never overestimates the cost to reach a
goal. In most cases h(n) is defined to be zero if nis a goal state.
In the duplicate detection implementation each state is a four tuple hr, s, b, ei,
where rrepresents a source tuple, srepresents a target tuple, bis the current
bound, and eis the exclusion list. Both rand scan be either unbound (denoted
as ) or bound to a tuple. Based on the values of rand s, three state types
are distinguished: (i) A state is a start state when both rand sare unbound,
(ii) a state is a intermediate state when only ris bound, and (iii) a state is a
goal state if both rand sare bound. The exclusion list eis a list of tokens
which may not be contained in target tuples the intention of this list is made
clear in the description of the algorithm below. The bound bis the maximum
similarity of two tuples that can be reached from the given state. Note that the
goal is to maximize similarity as opposed to minimize cost. Thus, bmust be an
overestimate instead of an underestimate. The bound function B(r, s) is defined
as:
B(r, s) =
if r=s=
B(r) if r6=s=
tupsim(r, s) if r6=s6=.
(4.13)
The function B(r) determines the bound for intermediate states. It is com-
puted as
B(r) = X
t6∈e
w(r, t)·maxweight(t) (4.14)
where tis a term that does not appear in the exclusion list e, and maxweight(t)
is the maximum weight of term tin the target relation. The maximum weight
of a term is stored as additional information in the inverted index, and thus,
can be efficiently retrieved.
Before describing the actual search procedure, it has to be noted that not
the entire search graph needs to be kept in main memory, but only a list of
open states, which is called OP EN. As mentioned in the description of A*,
the state with the smallest cost (largest bound) needs to be extracted in each
iteration. Hence, the OP EN list is implemented as a priority queue, which
allows insertion of a single state and removal of the state with the largest bound
in O(log n) [CLR01].
The duplicate detection algorithm is depicted in Alg. 1. Variables are initial-
ized at the beginning: The result set result is set to the empty set (line 1), while
the list of open states OPEN contains the start state s0(line 2). The following
loop is executed until (i) the list of open states is empty or (ii) k goal states
have been found. At the beginning of the loop, the current state sbecomes the
state with the largest bound (line 4), which is also removed from the OP EN
list (line 5). If the extracted state is a goal state, then it is added to the result
set result (line 7). Otherwise, child states are created for state sand added to
the list of open states (line 9). The creation of child states is described below.
The result set is returned after the loop has terminated (line 12).
4.4. SEARCHING FOR DUPLICATES 63
Algorithm 1: A* search for top-k duplicates (adapted from [Coh98])
Output: Set of states representing the k most similar tuple pairs
result := {};1
OP EN := {s0};2
while OP EN 6=∅∧|result|< k do3
s:= argmaxs0OP EN B(s);4
OP EN := OP EN {s0};5
if goalState(s)then6
result := result {s};7
else8
OP EN := OP EN children(s);9
end10
end11
return result12
The duplicate detection procedure uses two operations two create child
states: explode and constrain. The explode operation is performed only on
the start state, while constrain is performed on intermediate states.
Explode: At the beginning of the duplicate detection procedure, the OP EN
list contains only the start state, where r=,s=,e={} (i.e., eis
empty), and B(r, s) = as defined above. In the first phase of the algorithm,
this state is extracted from the OP EN list and “exploded”: For each source
tuple r1, . . . , rm, a new state is created where ris bound to the source tuple,
sis unbound, eis empty, and bis computed as defined in Eq. 4.14. Those
intermediate states are inserted into the OP EN list.
After the explode phase the iterative part of A* starts. In each iteration,
the state with the largest bound is extracted from the OPEN list. If it is a
goal state, it is added to the result set. The algorithm terminates if it is the
k’th element in the result, because only k duplicates need to be found. The
algorithm continues if more tuple pairs are required. If the extracted state is an
intermediate state, then it must be constrained.
Constrain: An intermediate state that has been extracted from the OP EN
list is constrained by “creating” its child states and adding them to the open
list. The child states have the same source tuple r, but either a bound target
tuple sor an extended exclusion list e. They are created as follows: A term t
that appears in the source tuple r, but not in the exclusion list of the state, is
picked, and target tuples containing tare extracted using the inverted index on
the target relation. From the extracted tuples only the ltuples which do not
contain any term of the exclusion list are used to create l+ 1 new states: lgoal
states in which the target tuple is bound, and an intermediate state in which
the target tuple remains unbound, but the term tis added to the exclusion list.
Because a new term has been added to the exclusion list, the bound of the
new intermediate state is lower than the bound of its parent. In order to reduce
64 CHAPTER 4. THE DUPLICATE DETECTION STEP
the number of tuple comparisons, a term tshould be chosen such that the bound
of intermediate states quickly decreases. Thus, we pick a term tthat maximizes
w(r, t)·maxweight(t), because its insertion into the exclusion list has the largest
effect on the bound of the derived intermediate state (see Eq. 4.14).
Beside computation of the bound of intermediate states, the intention of
the exclusion list is also to avoid creating the same tuple combination twice: If
a term tiappears in an intermediate state for source tuple r, it implies that
for all target tuples scontaining ti, there already is or has been a goal state
in the OP EN list with ras source tuple and sas target tuple. If such an
intermediate state is constrained using term tj, then no goal states for target
tuples containing tiare created.
4.5 The Effect of Sampling on the Number of
Duplicates
The search for duplicates described in Sec. 4.4 is performed in main memory.
Depending on the size of the data sets and the available RAM, one might run
out of memory if all existing data is searched for duplicates. To make duplicate
detection with the search algorithm feasible, the size of the tables need to be
reduced: When the tables under consideration contain less tuples, the search
space, and thus, the OPEN list is smaller, and memory overflow can be avoided.
To reduce the table sizes, random Bernoulli sampling is applied in the im-
plementation of the duplicate detection step. In this scheme, each tuple of a
table Rhas an independent chance of pRto enter the sample. We call pRthe
sampling rate of relation R. If a table Rhas ntuples, and a sampling rate of
pRis used, the expected size sof the sample is E[s] = pR·n.
The goal is to create a sample of size ssuch that duplicate detection is
possible. At the same time, as many tuples should be contained in the table to
ensure that enough duplicates can be found. Thus, if a table has ntuples, and
the size of the sample should be roughly stuples, then the sampling rate is set
to pR=s
n. The actual size of the sample is unlikely to be exactly s, because the
chance of a tuple to enter the sample is independent of other tuples. Note that
Bernoulli sampling has been implemented in major DBMS, e.g., DB2 [GGLZ04],
and thus, can be efficiently computed.
Unfortunately, random sampling has a negative effect on the number of
duplicates. Fig. 4.3 shows an example with two tables Rand Scontaining four
duplicates (r1, s1), (r2, s4), (r3, s3), and (r4, s2) depicted by black bars. The
arrows indicate which tuples represent the same real-world entity. Both tables
are sampled with a rate of 50%, i.e., pR=pS= 0.5. The area above the dashed
lines represents the part of the relations that is in the sample. Note that in each
sample two tuples, which were considered duplicates in the original data set, can
be found: r1,r2,s1, and s2. However, when considering the sampled tuples,
only a single duplicate remains: (r1, s1). The number of duplicates decreases
more than the sampling rate indicates because the matching tuples of r2and
4.6. EXPERIMENTAL EVALUATION 65
R
S
r1
r2
r4
r3
s2
s3
s4
s1
Figure 4.3: Effect of sampling on the number of duplicates
s2, namely s4and r4, respectively, are lost in the sampling process.
Given the number of duplicates d, the sampling rates pRand pSfor relations
Rand S, respectively, the number of duplicates in the sample dsample is expected
to be:
E[dsample] = d·pR·pS.(4.15)
Each of the dtuples in R, which has a matching tuple in S, has a chance of pR
to be in the sample. For each of those tuples that made it into the sample, its
matching tuple has a chance of pSto be in the sample of S. Thus, the expected
value of dsample is the product of the number of duplicates and the sampling
rates.
To summarize, sampling facilitates the detection of a few duplicates even
when the underlying data sets are large. However, sampling also reduces the
number of duplicates relative to the size of the relations. The experiments de-
scribed in Sec. 4.6.2 indicate that duplicate detection is feasible even when only
very few duplicates exist. Those experiments also show that the precision of
duplicate detection decreases at a certain recall level. This precision decrease
also affects the number of duplicates that can be used for table matching: If pre-
cision decreases at recall level r, the number of tuple pairs used in the matching
step should be at most r·E[dsample] to ensure that no false duplicates are used.
4.6 Experimental Evaluation
To evaluate the performance of our duplicate detection procedure, both in terms
of effectiveness and efficiency, a number of experiments were performed. To
demonstrate the applicability in a real-world scenario, data extracted from real
estate advertisements were used (Sec. 4.6.1). Because the results were always
perfect, additional experiments on generated data were performed to see how
the duplicate detection procedure performs in critical configurations (Sec. 4.6.2).
In particular, the effect of Problems 4b and 2 described in Sec. 4.3.1 needs to
be studied.
The quality of duplicate detection in aligned relations is measured in terms
66 CHAPTER 4. THE DUPLICATE DETECTION STEP
of precision and recall, which are calculated as
Precision =|DR|
|R|Recall =|DR|
|D|(4.16)
where Dis the set of true duplicates, and Ris the set of retrieved duplicates.
Intuitively, precision is large if there are only few false duplicates (false positives)
in the result set, while a large recall is achieved when only few duplicates are
missed (false negatives).
Note that there is usually a tradeoff between the two measures: Being very
strict and allowing no or few errors benefits precision, but many duplicates
might be missed. On the other hand, allowing more errors increases the chance
that dissimilar duplicates are found, but also increases the possibility of false
duplicates entering the result set. Classic duplicate detection algorithms that
work on aligned tables aim at achieving both high recall and high precision. As
described above, maximizing both measures is next to impossible, and hence,
balancing a duplicate detection process is critical. In the case of duplicate
detection in unaligned relations, we are only interested in precision, because the
goal is only to detect a few duplicates that are most likely true duplicates.
4.6.1 Real-world Data: Real Estate Advertisements
For the experiments in a real-world scenario data sets containing real estate
advertisements of two Berlin newspapers (“Morgenpost” and “Tagesspiegel”)
were applied. The data was extracted from their web sites in two consecutive
weeks, so altogether four data sets were used. The number of tuples ranged
between 1509 and 3772. For each possible combination of data sets the top-10
duplicates were detected.
The precision of duplicate detection was always 100% (thus, no graphs are
provided). The reasons for this perfect result are hidden in the properties of the
used data sets:
1. Large extensional overlap: Some people or companies tend to place their
advertisements in both newspapers and in consecutive weeks to increase
their success chance. Thus, there are several duplicates both when com-
paring data from different newspapers and different weeks.
2. Large intensional overlap: Most attributes attributes are present in both
schemata. In particular, the more relevant attributes (e.g., name of the
advertising person or company, address of the advertised apartment, con-
tact, etc.) can always be found in advertisements.
3. Similar representations: The information that is most relevant for dupli-
cate detection, e.g., name of the advertiser, address, or phone number,
is represented in the same fashion across advertisements. Even some of
the abbreviations and acronyms used to describe a flat are similar across
newspapers. The latter has only a small effect: If the same abbreviation
is used in many advertisements, its TFIDF weight will be low.
4.6. EXPERIMENTAL EVALUATION 67
4.6.2 Experiments on Generated Data
The experiments on real-world data did not demonstrate the effectiveness of the
tuple similarity measure in critical configurations. To do so, experiments have
to be performed to answer the following three questions, which correspond to
the three properties of the real estate advertisements described in Sec. 4.6.1:
1. Decreased extensional overlap: How effective is the tuple similarity mea-
sure when there are only few duplicates?
2. Decreased intensional overlap: What is the precision of duplicate detection
when only few attributes correspond?
3. Erroneous information: How much is duplicate detection affected by errors
in the data (e.g., misspellings)?
All data sets for the following experiments were created using the dirty data
generator G22used in [BSS03], which improves the database generator used
in [HS98] to create more realistic data. The tool allowed us to inject fuzzy
duplicates, where fuzziness is controlled for each attribute through error proba-
bility parameters for different types of errors, such as “replacement error” and
“deletion error”. Thus, the effect of errors is always reflected in the presented
results (Question 3).
For each experiment we generated two databases DB1 and DB2, each with
5,000 tuples. Figure 4.4 shows the baseline setup for the following experiments
(the schemata for DB1 and DB2) and the correct matching. Unbeknownst
to the duplicate detection algorithm, the two databases have six attributes in
common, and each has two or more additional attributes. Attribute values were
randomly chosen from long, predefined lists of values. Note that the attribute
pairs Birth-place and City as well as Birth-district and District draw values from
the same domains, making duplicate detection challenging. Finally, the columns
were randomly shuffled. All reported results are averages of five independent
runs with newly created databases for each experiment.
Reduced Extensional Overlap
To examine the effect of the number of duplicates on the precision of the du-
plicate detection result (Question 1), data sets for the baseline setup (Fig. 4.4)
containing 10, 50, and 100 duplicates were generated. The experimental results
depicted in Fig. 4.5 show the precision at certain recall levels, averaged over the
five different data sets used for each setup. In this experiment, the number of
duplicates detected by the algorithm was not fixed instead, a list of tuple
pairs ranked by decreasing tuple similarity was produced. Hence, a recall level
of 20% is the point in the list were 20% of all true duplicates have been detected.
The result shows that the precision at early recall levels is always large:
Depending on the extensional overlap, precision drops below 100% between
2Kindly provided to us by Luca De Santis of the DaQuinCIS team at Universit´a di Roma
“La Sapienza”.
68 CHAPTER 4. THE DUPLICATE DETECTION STEP
Attr. of DB1 Attr. of DB2
SSN
Profession
Surname Surname
Name Name
Birth-date Birth-date
Birth-place Birth-place
Birth-district Birth-district
Sex Sex
City
District
Figure 4.4: The correct matching from DB1 to DB2
0
20
40
60
80
100
0 20 40 60 80 100
Duplicate precision (%)
Duplicate recall (%)
10 duplicates in DBs
50 duplicates in DBs
100 duplicates in DBs
Figure 4.5: Influence of number of duplicates
40% and 60% recall. Because our goal is only to detect a few (i.e., the most
similar) tuple pairs, this is a very encouraging result. Only the top-ranked tuple
pairs are relevant for the DUMAS table matcher, and the experiment indicates
that the tuple similarity measure produces the largest similarity scores only for
true duplicates despite a small extensional overlap.
Reduced Intensional Overlap
The next experiment is designed to assess the influence of intensional overlap,
i.e., the number of common attributes in both databases. The schema of DB1
remains the same (see Fig. 4.4). From the schema of DB2 the overlapping at-
tributes Sex,Birth-district, and Birth-place are successively removed, replacing
them by new attributes Street number,Address, and Postal code. Note that
by each of those changes, a single correspondence is removed. We use a fixed
number of 50 duplicates. The result of these experiments is depicted in Fig. 4.6.
As expected, intensional overlap does have an effect on the precision. When
less attributes correspond, the effect of similar values in those attributes on the
4.6. EXPERIMENTAL EVALUATION 69
0
20
40
60
80
100
0 20 40 60 80 100
Duplicate precision (%)
Duplicate recall (%)
3 common attributes
4 common attributes
5 common attributes
6 common attributes
Figure 4.6: Influence of degree of intensional overlap
tuple similarity decreases. Fortunately, only a minor decrease in precision can
be observed in most cases. The decrease can only become severe if unrelated
attributes have the same value domain (e.g. Birth-Place and City in Fig. 4.4).
In such a case, Problem 4b described in Sec. 4.3.1 arises.
Attr. of DB1 Attr. of DB2
SSN
Profession
Sex
Birth-district
Surname Surname
Name Name
Birth-date Birth-date
Birth-place Birth-place
City
District
Street no
Address
(a) Four corresponding attributes
Attr. of DB1 Attr. of DB2
SSN
Profession
Sex
Birth-district
Birth-place
Surname Surname
Name Name
Birth-date Birth-date
City
District
Street no
Address
Postal code
(b) Three corresponding attributes
Figure 4.7: Reduced intensional overlap: four and three matches.
Note that this problem is already present in the baseline setup: Birth-place
and Birth-district in DB1 have the same domain as City and District in DB2,
respectively (Fig. 4.4). However, the two attributes in DB1 have a matching
partner in DB2. Thus, the values of City and District do not mislead duplicate
detection. In the step from five to four corresponding attributes, Birth-district
is removed from DB2, resulting in the configuration depicted in Fig 4.7(a).
Despite the fact that there is no matching partner for attribute Birth-district
anymore, no significant decrease could be detected. The reason for this lies in
70 CHAPTER 4. THE DUPLICATE DETECTION STEP
the TFIDF weighting scheme: There are only few possible districts, resulting
in a small inverse document frequency, and thus, the district has only a small
effect on the tuple similarity.
The problem becomes severe in the configuration with three correspondences
(Fig. 4.7(b)). In this experiment, attribute Birth-place is also removed. How-
ever, attribute City is still in DB2. In that configuration, precision of duplicate
detection is heavily decreased because (i) these attributes draw values from the
same domain, (ii) the impact of a given city on the tuple similarity is significant,
and (iii) the three remaining correspondences cannot sufficiently compensate.
4.7 Discussion
In this chapter we discuss the problem of duplicate detection in unaligned re-
lations, which is critical for the feasibility of duplicate-based schema matching,
but has not been considered before. We address several challenges that do not
occur in duplicate detection when correspondences are known. To find the most
similar tuple pairs, we define a tuple similarity measure tupsim and describe an
algorithm to efficiently find duplicates.
It has been shown that most of the inherent problems defined in Sec. 4.3.1
could be resolved:
The problem of unknown schema alignment was tackled by ignoring the
record structure and considering each string as a single tuple (Problem 1).
By using a domain-independent string similarity measure, the duplicate
detection process is not affected by the lack of information about the
semantics of the attributes (Problem 3).
The TFIDF weighting scheme, which is used in the tuple similarity mea-
sure, helps in avoiding the problem of misleading attribute similarities in
corresponding attributes (Problem 4a).
The problem of misleading attribute similarities that appear in non-corres-
ponding attributes (Problem 4b) is more difficult to solve: If attribute values
match by chance, we still expect the duplicate detection to produce good results.
If this case appears frequently, e.g., when non-corresponding attributes have
the same value domain, then duplicate detection precision can drop if only few
attribute correspondences exist. We consider that problem and the problem of
a small intentional overlap (Problem 2) in our evaluation of the algorithm,
The experiments show that it is possible to detect a few duplicates in un-
aligned relations with very good precision. The properties of the real estate
advertisements described in Sec. 4.6.1 are favorable to our duplicate detection
procedure. However, this is not an unusual scenario, and similar properties can
be found in other application domains, too. In contrast, the generated data sets
allowed us to gauge how effective the tupsim measure is in critical scenarios. It
could be seen that the top-ranked duplicates can be trusted even when only few
duplicates exist and, to some extent, when only few attributes correspond.
4.7. DISCUSSION 71
The experiments also show that false duplicates can be produced if Prob-
lem 4b occurs in conjunction with a small intensional overlap (Problem 2). In
the following chapter, we evaluate how such false duplicates affect the result-
ing schema matching, and will examine a way to increase precision of duplicate
detection using known correspondences.
72 CHAPTER 4. THE DUPLICATE DETECTION STEP
Chapter 5
The Matching Step:
Extracting Correspondences
From Duplicates
The matching step of the DUMAS table matcher uses the duplicates extracted
from the tables to establish a matching. Each of the duplicates indicates a
certain matching, which might disagree with other duplicates. To create a
single matching, the duplicates’ indications have to be merged. Afterwards, the
resulting correspondences need to be checked, and if some of them are uncertain,
the table matching process needs to reiterate back to produce more duplicates
(see Fig. 5.1).
5.1 Establishing Correspondences By Aggregat-
ing Duplicate Votes
Each of the duplicates detected in the previous step indicate to what extent
attributes of the tables are related. These indications can be extracted by pair-
wise comparison of attribute values. E.g., duplicate tuples r9and s2in Fig. 3.1
both have some attribute values in common: The string “Dean” can be found
in LastName and LN, while “(369) 3663624” is the value of Phone and T el.
Hence, this duplicate indicates the matching {({LastName},{LN}),({P hone},
{Tel})}. This idea is stated in the matching heuristic: If two attribute values of
a pair of tuples representing the same real-world entity are equal or highly simi-
lar, then those attributes are likely to correspond. In contrast, attributes whose
values differ are probably semantically different. In other words, each duplicate
gives a “vote” that shows how closely attributes of the tables are related.
The indications inherent in the duplicates are not always correct or unam-
biguous. The duplicate tuple pair (r3, s3) in Fig. 3.1, has the same value for
attributes Tel,P hone and F ax. Thus, based on on this duplicate, it cannot
73
74 CHAPTER 5. THE MATCHING STEP
be unambiguously deduced which of the two attributes Phone and F ax cor-
responds to T el. Duplicate (r4, s4), which is also depicted in Fig. 3.3, has the
same value “Sam”1for attributes F irstName and Acc, indicating a false match.
The effect of such false indications can be reduced by aggregating the votes of
several duplicates instead of just using a single one.
Thus, the matching step can be separated into two substeps:
1. Pairwise Attribute Value Comparison: The attribute values of each dupli-
cate need to be compared. The result of attribute value comparison is a
similarity matrix for each duplicate (Sec. 5.2).
2. Aggregation and Reasoning: The similarity matrices need to be aggre-
gated, and a schema matching needs to be extracted (Sec. 5.3).
After establishing a matching, the algorithm has to decide if the correspon-
dences can be trusted (Sec. 5.4). If some of the correspondences are uncer-
tain, the process should resume with detecting more duplicates, as depicted
in Fig. 5.1. In contrast to the problem of duplicate detection in unaligned
schemata, the correspondences which are considered certain by the algorithm
can be used. To do so, an extended tuple similarity measure has to be defined
that takes certain correspondences into consideration (Sec. 5.5).
Source
database
Target
database
Duplicate
Detection
Duplicates
Matching
Correspondences
Certain?
Certain
Correspondences
Result
matching
No
Yes
Figure 5.1: The duplicate-based schema matching process
5.2 Comparing Attribute Values of Duplicates
The first phase of the matching step is the pairwise comparison of attribute
values. Given a duplicate (r, s) one has to determine the similarity of each
attribute value in rwith each attribute value in s. To compute a similarity
score, the field similarity measure fieldsim, which is defined in Sec. 5.2.1, will
1Recall that the similarity measure is case-insensitive because the tokenizer normalizes
strings.
5.2. COMPARING ATTRIBUTE VALUES OF DUPLICATES 75
be used. Based on the similarity scores, a similarity matrix for each duplicate
is constructed as described in Sec. 5.2.2.
5.2.1 The field similarity measure fieldsim
Similar to the detection of duplicates, a domain-independent string similarity
measure will be used to determine the similarity of attribute values. However,
there are two differences that have to be considered when designing the field
similarity measure:
1. Lengths of values: In contrast to tuples, which are comprised of several
attribute values, the strings the field similarity measure has to deal with
are usually short.
2. Accuracy: The field similarity measure should closely reflect the “true”
similarity of attribute values. While in duplicate detection it is tolerable
to miss duplicates due to too low similarity scores, a low score for similar
values might directly result in a missed correspondence. Analogously, a
similarity score that is too high can easily produce a false match.
The discussion of possible field similarity measures is based on the overview
of string similarity measures in Sec. 4.3.2. In general, the measure should con-
sider possible errors (e.g., misspellings) when determining the similarity of two
attribute values. Using the TFIDF measure for comparing attribute values is
unlikely to produce good result, because it only considers equal terms (Eq. 4.7).
While it has been a good choice as a tuple similarity measure, due to the limited
size of attribute values even small errors might cause a dramatic decrease in the
similarity score. As described above, this can lead to missed correspondences.
For similar reasons, other token-based measures are rejected.
Edit-distance like functions are very useful for short strings, and thus, a bet-
ter choice than token-based measures. However, they are also order-dependent.
This might not be disadvantageous for many attributes, but in some cases dif-
ferent representations with varying orders of terms are possible: E.g., if there is
a single attribute for the name of a person, Sam Adams can be represented as
“Sam Adams” or “Adams, Sam”. The field similarity measure should be able
to recognize both attribute values to be similar.
For the reasons described above, the SoftTFIDF measure (Eq. 4.10) is used
as the field similarity measure fieldsim. As discussed in Sec. 4.3.2, SoftTFIDF
is order independent and also considers highly similar terms in addition to equal
terms. To determine the similarity of terms, the normalized Levenshtein edit
distance ned (Eq. 4.3) is transformed into a similarity function. The term
similarity termsim of two terms tand t0is computed as follows:
termsim(t, t0) = 1 ned(t, t0) = 1 ed(t, t0)
max(|t|,|t0|).(5.1)
76 CHAPTER 5. THE MATCHING STEP
The field similarity fieldsim of two attribute values aand bis defined as:
fieldsim(a, b) = X
tCLOSE(θ,a,b)
w(a, t)·w(b, t0)·termsim(t, t0) (5.2)
where t0is the term in bthat is most similar to taccording to the term similarity
measure termsim, and CLOSE(θ, a, b) is a subset of all terms in awith
CLOSE(θ, a, b) = {ta|∃t0b, termsim(t, t0)> θ}.(5.3)
One reason why SoftTFIDF was preferred over the recursive matching scheme
(Eq. 4.8), which is also order-independent and considers similar terms, is the
TFIDF weighting scheme: E.g., when comparing attribute values “Microsoft
Inc.” and “Micosoft” one notices that, beside a spelling error, one of two terms
is missing in the second value. Thus, one could deduce that both values are very
different. However, the term “Inc.” is a standard abbreviation used in many
company names removing it leaves the company name still comprehensible.
Because the term is part of very many company names, its TFIDF weight is low,
and thus, its removal does not have a large effect on the SoftTFIDF similarity.
The SoftTFIDF similarity is more expensive to compute than TFIDF. How-
ever, as opposed to the duplicate detection step, this is not a big issue because
only the attribute values of a few tuple pairs have to be compared.
5.2.2 Creating the Similarity Matrix
John
Dean
m
(369) 3663624
(367) 3663625
Dean 0 1.0 0 0 0
jd 0 0 0 0 0
(369) 3663624 0 0 0 1 0.87
XP 0 0 0 0 0
Table 5.1: Similarity matrix for a duplicate pair
Given a single duplicate tuple pair, the field similarity measure is used to
compute the similarity of their attribute values, i.e., each attribute value of one
tuple is compared with each attribute value of the other tuple. The similarity
scores for a single tuple pair are represented as a similarity matrix. Fig 5.1 shows
the similarity matrix for duplicate (r9, s2) of Fig. 3.1, where the threshold θof
Eq. 5.3 was set to 0.5.
5.3. AGGREGATING AND REASONING 77
5.3 Aggregating and Reasoning
Because a single duplicate might coincidentally indicate a false correspondence
or miss a correspondence, several tuple pairs are used. In the following, we
describe how the similarity matrices are aggregated and how correspondences
are extracted.
5.3.1 Creating The Average Similarity Matrix By Aggre-
gation
For each detected duplicate a similarity matrix is created. Those matrices are
merged into the average similarity matrix M by computing their average, i.e.,
M=1
k
k
X
i=1
Mi(5.4)
where kis the number of duplicates and Miis the i’th similarity matrix. The
average similarity matrix describes the similarity of the tables’ attributes based
on the values of the duplicates. Tab. 5.2 depicts the average similarity matrix
for the running example, which has been created using the three duplicate tuple
pairs. By M(a, b) we denote the average similarity score of attributes aand b
in the matrix M.
FirstName
LastName
Sex
Phone
Fax
LN 0 1.0 0 0 0
Acc 0.33 0 0 0 0
Tel 0 0 0 1 0.73
OS 0 0 0 0 0
Table 5.2: Average similarity matrix
5.3.2 Reasoning: How To Extract Attribute Correspon-
dences
Based on the average similarity matrix a simple matching has to be established.
The goal is to produce a set of correspondences such that each attribute cor-
responds to at most one other attribute. As a first step, a graph matching is
extracted from the matrix. In general, a matching on a graph is a subgraph with
the same nodes and a subset of the edges such that each node is incident with
at most one edge. The graph on which the matching is performed is constructed
from the average similarity matrix as follows: Each attribute is represented as
a node, and for each element M(a, b) in the average similarity matrix there is
78 CHAPTER 5. THE MATCHING STEP
an edge between the node representing attribute aand the node representing
attribute bwith the weight M(a, b). It can be easily shown that the resulting
graph is bipartite.
Several criteria for choosing a graph matching are conceivable. We evalu-
ated both stable marriage [GI89] and maximum weight matching [Gal86] using
synthetic data. The stable marriage problem is formulated as follows: Given a
set of nmen and nwomen, marry them off in pairs after each man has ranked
the women in order of preference from 1 to n,{w1, ..., wn}and each women has
done likewise, {m1, ..., mn}. If the resulting set of marriages contains no pairs
of the form {mi, wj},{mk, wl}such that miprefers wlto wjand wlprefers
mito mk, the marriage is said to be stable. When applied to our problem,
attributes of the source table and the target table represent men and women,
respectively, and the average similarity scores are used to rank partners. In
contrast, a maximum weight matching is a matching that maximizes the sum of
the weights of the edges.
The experiments could not prove any strategy to be superior, but we believe
that the similarity scores closely reflect the similarity of attributes, and do not
only represent a means of ranking possible partners. Hence, in the DUMAS table
matcher, a maximum weight matching is computed. The Hungarian Method is
applied for that purpose [PS82].
FirstName
LastName
Sex
Phone
Fax
LN
Acc
Tel
OS
1.0
0.33
1.0
0.0
Figure 5.2: A graph matching.
We denote by GM the set of attribute pairs that is indicated by the edges
of the produced graph matching. For the matching in Fig. 5.2, which depicts
the maximum weight matching of the bipartite graph that is constructed from
the matrix in Tab. 5.2, this set of attribute pairs is
GM ={(FirstName, Acc),(LastName, LN),(Sex, OS),(Phone, T el)}.
One can see that the correct correspondences are included, but also a few false
matches. To avoid the latter, edges with a weight below a given threshold θprune
are removed in a final pruning step to produce the pruned graph matching GM0.
Given a pruning threshold θprune = 0.5, the pruned graph matching is
GM0={(LastName, LN),(P hone, T el)}.
5.4. CERTAINTY OF ATTRIBUTE CORRESPONDENCES 79
The attribute pairs in GM0are translated into pairs of attribute groups,
which constitute the resulting simple matching M:
M={({LastName},{LN}),({P hone},{T el})}.
5.4 Certainty of Attribute Correspondences
Given the average similarity matrix of Tab. 5.2, it can be seen that the corre-
spondence ({LastName},{LN}) is very certain: There is no attribute that is
nearly as similar to LastName as LN, and vice versa. Attribute T el is a dif-
ferent case: In many cases, the value of F ax is very similar or even equal to the
value of F ax, resulting in a large similarity score for both attributes. We call
the match ({P hone},{T el})uncertain because there is an alternative matching,
in which F ax matches with T el, which is almost as good as ({Phone},{T el}).
Because the correspondence is uncertain, we have to search for additional du-
plicates to strengthen the correspondence of T el either with P hone or with
Fax.
Correspondences are checked for certainty after a schema matching has been
established. If all correspondences are certain, the schema matching is presented
to the user. Otherwise, the certain correspondences are used to improve the
matching by detecting more correspondences with better precision (see Sec. 5.5).
The certainty check is performed as follows: First, a score for GM, from which
the schema matching has been derived, is computed. This score is the sum of
the average field similarities, which are taken from the average field similarity
matrix M:
score(M) = score(GM) = X
(a,b)∈GM
M(a, b).(5.5)
Afterwards, each correspondence is considered separately. For each corre-
spondence ({a},{b}) in M, a matrix M(a,b)is constructed, which is identical
to the average similarity matrix M, except that the similarity score for (a, b) is
set to zero:
M(a,b)(x, y) = ½0x=ay=b
M(x, y) otherwise.
Afterwards, the maximum weight matching for M(a,b)is computed. Let
GM(a,b)be the set of edges in that matching, and score(GM(a,b)) be the score
of the graph matching as defined above. The correspondence ({a},{b}) is certain
if the score of the new matching is not close to the score of the original matching,
i.e., if the difference score(GM)score(GM(a,b)) is below a threshold θcertain.
If not all correspondences are certain, the table matching process iterates
back to the duplicate detection step. In the following, we will describe how cer-
tain correspondences can be used to improve the precision of duplicate detection.
If in the subsequent schema matching step no additional certain correspon-
dences are detected, the matching M(including uncertain correspondences) is
presented to the user.
80 CHAPTER 5. THE MATCHING STEP
5.5 The Extended Tuple Similarity Measure
The tuple similarity measure tupsim considers each tuple as a single string,
and thus, ignores known correspondences. This can lead to suboptimal results.
When some correspondences are known, these should be exploited to achieve
better precision. Such correspondences can be produced by a previous run of
the DUMAS table matcher as described above or by another schema matcher.
In the following we consider the problem of duplicate detection in partially
aligned relations: Given two tables Rand S, and a subset Mpart of the corre-
spondences between Rand S, find tuples in Rand Sthat represent the same
real world entities. On the one hand, this problem is different than the prob-
lem of duplicate detection in unaligned tables, because the attributes which
are known to be in the matching can be compared with their corresponding
partners. Thus, at least for those attributes Problem 4b described in Sec. 4.3.1
can be solved. On the other hand, the problem is more difficult than duplicate
detection in aligned tables, because not all correspondences are know. Thus, it
has to be assumed that the unmatched part might contain valuable information
that must be used to find duplicates.
To perform duplicate detection in partially aligned relations, we define an
extended tuple similarity measure etupsim. This measure should reflect the
characteristics of the problem discussed above: The attributes that participate
in the known matching must be compared with their corresponding partners,
and the unmatched part has to be taken into consideration as well.
In the DUMAS table matcher, the extended tuple similarity etupsim of two
tuples rand sgiven a partial matching Mpart of simple correspondences is
computed as
etupsim(r, s) =
1
|Mpart|+1 ³tupsim(ru, su) + P({a},{b})∈Mpart fieldsim(r[a], s[b])´(5.6)
where ruand suis the unmatched part of rand s, repectively (i.e., the concate-
nated values of the attributes which do not correspond according to Mpart). In-
tuitively, the extended similarity of two tuples is the average of the field similar-
ity scores of their corresponding attributes and the similarity of the unmatched
part. Note that each correspondence has the same impact on the similarity
score: Because the semantics of the correspondences are unknown, it cannot be
determined which attributes are more important for duplicate detection than
others. If this were known, one can easily adapt the formula by weighting the
correspondences. Also note that the unmatched part has the same impact as
each single correspondence. Thus, the more correspondences are known, the
less influence the unmatched part has on the extended tuple similarity. If no
correspondences are know, then etupsim is equal to tupsim. Thus, etupsim is
a generalization of tupsim.
Another extension to the similarity measure that potentially increases the
accuracy of duplicate detection is to ignore attributes that are known not to
correspond to other attributes. This kind of knowledge can be easily included
5.6. SEARCHING FOR DUPLICATES WITH ETUPSIM 81
in the definition of etupsim: Values of attributes that certainly do not match
are removed from the unmatched parts ruand suin Eq. 5.6, i.e., only values of
attributes that potentially match are concatenated. The effect of this measure
can be demonstrated when considering the scenario of three matching attributes
in Sec. 4.6.2: The precision of duplicate detection is much lower than in other ex-
periments because the source contains attributes Birth-place and Birth-district,
while the target has attributes City and District in its schema. None of those
attributes has a corresponding partner, but Birth-place and Birth-district have
the same value domain as City and District, respectively. If the user could
determine that some of those attributes, e.g., City and District, are not related
to other attributes, and consequently remove their values from consideration,
Problem 4b would not occur as frequently, and thus, the precision of duplicate
detection at early recall levels would increase.
However, we believe that this kind of knowledge is hard to gain. Common
schema matching algorithms, including the DUMAS table matcher, only pro-
duce correspondences between attributes that are thought to be related, but
do not determine if two attributes are unrelated. Ruling out the possibility
that unmatched attributes might have corresponding partners contradicts the
assumption that the provided matching is only partial.
5.6 Searching For Duplicates with etupsim
To find the most similar tuple pairs based on etupsim in a reasonable amount
of time, an algorithm that finds the top-k duplicates with a minimum number
of tuple comparisons must be applied. Although a few correspondences are
known, existing algorithms are not an option: They always require user-defined
input, e.g., a sort key for the sorted neighborhood method or blocking criteria
in record linkage. Instead, we devise a new algorithm that is solely based on
the etupsim measure. This algorithm has been successfully implemented and
tested on synthetic data [Kon06].
5.6.1 The Duplicate Detection Algorithm
A* search has proven to be very efficient in the detection of similar tuples based
on tupsim both in terms of the number of tuple comparisons and actual runtime.
Thus, we decided to use the same principal algorithm for duplicate detection
with etupsim. The basic search procedure shown in Alg. 1 on page 63 can be
directly applied. The only point of change is the constraining phase, which
requires the adaptation of the bound function and the restructuring of search
space states.
The algorithm starts with a start state, in which both the source tuple and
target tuple are not known. In the exploding phase, one intermediate state is
created for each source tuple and inserted into the OPEN priority queue. In
each iteration of the constraining phase, the state with the largest bound is
picked from the queue. If it is a goal state (i.e., both source tuple and target
82 CHAPTER 5. THE MATCHING STEP
tuple are known), then it is presented as a result. Otherwise, child states are
created using the children function.
Recall from Sec. 4.4 that the bound function must define an upper limit on
the similarity for each tuple pair that can be derived from a given state. We
define the extended bound function EB as:
EB(r, s) =
if r=s=
EB(r) if r6=s=
etupsim(r, s) if r6=s6=.
(5.7)
where rand sare the source and target tuples described by the state. The start
state has a predefined bound of infinity, while the bound of a goal state is equal
to the etupsim similarity of the two tuples. The bound for intermediate states
EB(r) is more complex to compute: The source tuple is known, but there are
various possible target tuples. Similar to Whirl search, we constrain possible
target tuples by excluding certain terms. In contrast to the duplicate detection
algorithm for tupsim, the new algorithm requires more than one exclusion list:
The extended tuple similarity measure is computed based on the field similarity
of matching attributes and the tuple similarity of the unmatched part. Because
those components are considered independent of each other, we need to maintain
several exclusion lists for each state: one for each known correspondence and one
for the unmatched part. The semantics of the exclusion lists remains the same:
If the exclusion list e[b] of a correspondence ({a},{b}) contains a term t, then the
goal states derived from the given state cannot have a target tuple that contains
term tin the value of attribute b. The same holds for the unmatched part. These
lists of excluded terms are updated in each invocation of the children function.
The bound function for intermediate states EB(r) uses the exclusion lists to
compute an upper limit on the similarity of tuple pairs (r, s). It is defined as the
average of the upper bound of the corresponding attributes and the unmatched
part:
EB(r) = 1
|Mpart|+ 1
B(ru) + X
({a},{b})∈Mpart
FB(r, a, b)
(5.8)
where ris the known source tuple and B(ru) is the bound on the unmatched
part as defined in Eq. 4.14. Note that when applied to EB(r), the exclusion
list ein Eq. 4.14 only refers to the unmatched part and not the content of the
whole tuple.
The upper bound for each corresponding attribute pair requires the con-
sideration of similar terms based on termsim. Recall from Sec. 5.2.1 that the
field similarity measure incorporates a set CLOSE which contains those source
terms that have a “matching” term in the target value, i.e., a term that is
equal or highly similar based on termsim (Eq. 5.3). If the target value is not
known, then we must assume all values of the matching attribute. Given a
source tuple r, a source attribute aand its corresponding attribute b, we define
CLOSE(θ, r[a], b) as the set of terms in r[a], for which a term t0in any value of
attribute bwith termsim(t, t0)> θ exists. The terms in CLOSE(θ, r[a], b) are
5.6. SEARCHING FOR DUPLICATES WITH ETUPSIM 83
considered by the bound function FB, which is defined as
FB(r, a, b) = X
tCLOSE(θ,r[a],b)
w(r[a], t)·maxsimweight(b, t).(5.9)
The function maxsimweight determines the maximum possible value for
w(b, t0)·termsim(t, t0) in Eq. 7.1. It only considers terms t0that are not in the
exclusion list e[b] for target attribute b. It is defined as
maxsimweight(b, t) = max
t0SIM(θ,b,t)e[b]maxweight(b, t0)·termsim(t, t0) (5.10)
where maxweight(b, t0) is the maximum weight of term t0in attribute band
SIM(θ, b, t) is the set of terms existing in any value of bwith termsim(t, t0)> θ
(i.e., the list of “matching” terms for t). If all similar terms are included in the
exclusion list, then the function returns zero.
In each iteration of the adapted children function, which creates child states,
we pick the term from either a matched attribute or the unmatched part that
either maximizes Eq. 5.9 or Eq. 4.14, respectively. As in the algorithm described
in Sec. 4.4, that term is used to search for target tuples. Depending on where
the term originated, the algorithm detects tuples that contain the term in the
corresponding attribute or the unmatched part. The extension lists are used to
avoid tuple pairs to appear twice in the result.
5.6.2 Finding Similar Terms
The computation of the bound of intermediate states requires the identification
of similar terms. When those terms are known, the tuples that contain similar
terms can easily be identified using the inverted index on the target table. To
the best of our knowledge, there exists no index structure based on termsim.
However, two means of finding similar terms based on edit distance have been
successfully used in the database field: tries and q-grams. In the following,
we present those techniques and show how they can be applied to search for
termsim-similar terms.
Indexing methods for edit distance
The extended tuple similarity measure requires the location of similar terms
based on termsim. Parsing the whole target table clearly does not scale, and
thus, we need an index structure to identify tuples containing similar terms. The
existing inverted index is able to detect tuples containing a given term. Hence,
to find tuples containing terms similar to a given term t, we use a structure on
top of the inverted index that determines similar terms t0, which are used by the
inverted index. Unfortunately, there exists no such data structure for termsim.
However, tries or q-grams can be used to identify terms with an edit distance
below a given threshold [NBYST01].
84 CHAPTER 5. THE MATCHING STEP
Tries2have been developed in the area of information retrieval [BYRN99].
A trie is a simple tree structure to store strings: Each tree edge is labelled by a
character, and the path from the root of the trie to a leaf node represents the
string that is the concatenation of the edge labels. A trie is a very compact
representation of a set of strings because same prefixes need to be represented
only once: The terms “rumors” and “run” are represented in the trie of Fig. 5.3.
Both terms have the prefix “ru”, and they share the edges representing this
string in the trie. To produce a more compact representation of that tree, one
could merge several edges into one if there are no intermediate branches: E.g.,
edges “r” and “u” could be merged into a single edge representing prefix “ru”.
Although the algorithm described here is also applicable for such a compact trie
representation, for the sake of clarity we only use tries whose edges represent a
single character.
0
1
2
3
4
5
6
d
2
2
1
2
3
4
5
u
3
3
2
1
2
3
4
m
4
4
3
2
2
3
4
a
5
5
4
3
3
3
3
s
r
u
m
o
r
s
1
1
2
3
4
5
6
r
u
m
o
r
s
n
Figure 5.3: Finding similar terms with tries.
Exact search for strings can be performed by a simple traversal of the trie.
Approximate search using edit distance is slightly more complicated, but can
be done in a backtracking search process: In each search step, the edit distance
between the search string and the string represented by the current node is cal-
culated, which only requires the computation of one column in the edit distance
matrix [NBYST01]. Fig. 5.3 depicts the edit distance matrix for the search
term “dumas” and the term “rumors” in the trie. The first column, which rep-
resents the empty string, is associated with the root node. The second column
is computed when edge “r” is traversed, the third column after edge “u”, etc.
The traversal of a given branch stops if it is guaranteed that the strings
represented by the leaf nodes of that branch cannot have an edit distance smaller
2The word “trie” is derived from “retrieval”.
5.6. SEARCHING FOR DUPLICATES WITH ETUPSIM 85
than θed. Recall from Eq. 4.2 that scores can only remain the same or increase
by one. This property is exploited by the break criterion: The traversal of a
given branch stops if the matrix column of the current trie node does not contain
any score below θed. In the example of Fig. 5.3, the traversal would stop at the
node representing prefix “rumor” for θed = 3, because the smallest value in the
corresponding matrix column is 3.
Another indexing technique for edit distance involves q-grams [Ukk92]. A
q-gram (or n-gram) of a term is a substring of length q. Given a term t, its
q-grams are obtained by “sliding” a window over its characters. To ensure that
the q-grams at the beginning and the end of the string contain qcharacters,
the string is conceptually extended with q1 special characters at both ends.
In the following, we will use character ‘#’ for prefixing and character ‘$’ for
suffixing terms. E.g., the term “dumas” has 2-grams “#d”, “du”, “um”, “ma”,
“as”, and “s$”.
Q-grams have successfully been used in similarity measures. Ukonen presents
a string distance measure that is based on the difference of the strings’ q-gram
sets [Ukk92]. However, q-grams can also be used to provide a lower bound on
the edit distance of two strings. The intuition is that strings with a small edit
distance share many q-grams [ST96]. E.g., if two strings have an edit distance of
1 due to an insertion, deletion, or substitution operation, then their q-gram sets
differ by at most qq-grams. In general, if two terms t1and t2are within an edit
distance of k, then their q-gram sets have at least max(|t1|,|t2|)1(k1) ·q
q-grams in common.
Gravano et al. have exploited this property to implement edit-distance based
similarity joins using standard database technology [GIJ+01]. Before a similar-
ity join can be performed, the involved tables need to be preprocessed: For each
tuple, the q-grams for the value of the attribute involved in the similarity join
are created. The q-grams and their position in the attribute value are stored
in a special table, which is used by an SQL query to find tuple pairs that po-
tentially have similar terms in the joined attributes using the bound described
above. Note that because q-grams only provide a bound on the edit distance,
the correct edit distance has to be computed for the tuple pairs produced by the
SQL query to remove false positives. However, the edit distance function has
to be invoked much less frequently in comparison to its application on all tuple
pairs, and thus, exploiting the q-gram bound results in a performance gain.
As we have shown, both tries and q-grams can be used for indexing. However,
a q-gram index creates a huge overhead, while tries are a very compact data
structure. Hence, we decided to apply trie-based indexing to search for terms
with a small edit distance. In the following we show how to search for termsim-
similar terms given a edit-distance based index.
Translating between edit distance and term distance
The data structures described above can be used to answer the following ques-
tion: Given a term t, which terms t0exist with ed(t, t0)< θed. However, our
goal is to find terms t0which have a term similarity above a given threshold,
86 CHAPTER 5. THE MATCHING STEP
i.e., termsim(t, t0)> θ. If we use a data structure based on edit distance, we
need to determine the maximum edit distance a term t0can have with respect
to a given term tif its term similarity should be above the threshold.
To determine the maximum possible edit distance, we first transform the
above inequality:
termsim(t, t0)> θ
1ned(t, t0)> θ
1ed(t, t0)
max(|t|,|t0|)> θ
ed(t, t0)
max(|t|,|t0|)<1θ
ed(t, t0)<(1 θ)·max(|t|,|t0|) (5.11)
As shown in Eq. 5.11, the bound on the edit distance also depends on the
length of term t0, which is unknown. If the length of term t0increases, the bound
on the edit distance also increases, and thus, it seems impossible to determine
a maximum edit distance. However, there exists an additional constraint that
we can exploit: The edit distance between two terms tand t0is at least as large
as the difference in the lengths of the two terms:
ed(t, t0) ||t0|−|t|| (5.12)
W.l.o.g. we assume that term t0is larger than the given term t. Eq. 5.11
and 5.12 then yield
|t0||t|<(1 θ)·|t0|
|t0||t|<|t0|θ|t0|
−|t|<θ|t0|
|t0|<|t|
θ(5.13)
We now know that the length of the yet unknown string t0is bounded. This
knowledge can be used in Eq. 5.11 to determine a bound on the edit distance:
ed(t, t0)<(1 θ)·|t|
θ
ed(t, t0)<|t|
θ|t|(5.14)
Eq. 5.14 shows the maximum possible edit distance a term t0can have w.r.t.
term tif their term similarity must be above threshold θ. To illustrate, assume
the term “dumas”, which has a length of 5. Assuming a threshold θ= 0.7,
the above equation determines a maximum edit distance of 2.14. Terms twith
an edit distance of 3 or larger are guaranteed to be dissimilar to t: If there
was a term twith an edit distance of 3 and a termsim-similarity above θ, then
according to Eq. 5.11 that term must be at least 11 characters long. However,
according to Eq. 5.12 such a term has an edit distance of at least 6. Thus, we
have shown by contradiction that such a term does not exist.
5.7. EXPERIMENTAL EVALUATION 87
5.7 Experimental Evaluation
The matching step and duplicate detection procedure need to be experimentally
evaluated. We used the same real-estate data that was used in Sec. 4.6.1 to
determine the quality of the schema matching step in a real-world scenario
(Sec. 5.7.1). Because the results were perfect, several artificial data sets were
created using the same data generator as in Sec. 4.6.2 to assess the quality of
the detected matching in critical situations. We also performed experiments
to evaluate the increase in duplicate detection accuracy of the extended tuple
similarity measure with respect to tupsim. For each setup, five different pairs
of relations were created, and the average of five experiments is reported.
To evaluate the accuracy if schema matching, the precision and recall mea-
sures are used:
Precision =|CR|
|R|Recall =|CR|
|C|(5.15)
where Cis the set of true correspondences and Ris the set of retrieved corre-
spondences.
5.7.1 Experiments on Real-Estate Advertisements
Using the ten most similar tuple pairs, the matching between every combination
of the four data sets, which were used to assess the accuracy of duplicate detec-
tion (Sec. 4.6.1), was established. These matchings were always perfect, i.e., no
false correspondences were included and no correspondences were missed. This
is not surprising, because
1. the detected duplicates were true duplicates, and
2. the terms used to describe apartments are very similar.
The duplicate detection experiments on the real estate advertisements yielded
100% precision for the top-10 duplicates. Hence, no misleading false duplicates
enter the matching step. Using common acronyms and abbreviations to describe
an apartment is also beneficial, because the term similarity measure termsim is
able to handle small deviations or misspellings, but cannot identify synonyms.
In the following, experiments on synthetical data are described. The goal
of those experiments is to determine how well the matching step can compen-
sate a few false duplicates. Note that common errors are inserted by the data
generator.
5.7.2 Accuracy of Schema Matching
When performing the matching step with the data sets that were used to eval-
uate the effect of limited extensional overlap (Fig. 4.5), the resulting matching
was always perfect. This is not surprising, because most of the top-ranked tuple
pairs are true duplicates. Similar results were achieved with the data sets used
88 CHAPTER 5. THE MATCHING STEP
to examine the effect of limited intensional overlap (Fig. 4.6). Consequently,
additional experiments to study the influence of false duplicates on the schema
matching result have been performed.
The schemata of the data sets generated for the following experiment have
four corresponding attribute. In fact, they are the same schemata as the four-
correspondence case in Sec. 4.6.2, which were used to study the effect of reduced
intensional overlap. Data containing three, five, and ten duplicates were used,
and K, the number of duplicates used in the matching step, varied accordingly.
0
10
20
30
40
50
60
70
80
90
100
3 Dup.,
K=2
3 Dup.,
K=3
5 Dup.,
K=3
5 Dup.,
K=5
10 Dup.,
K=5
10 Dup.,
K=10
recall (%)
precision (%)
Figure 5.4: Precision and recall of schema matching
Fig. 5.4 depicts the results, which is the average of five data sets, which were
independently generated for each number of duplicates. It shows that when the
number of duplicates used for matching converges on the number of duplicates in
the data set, both precision and recall decreases. This can be expected, because
the duplicate detection experiments showed that after 40 - 60% of all duplicates
have been detected, false duplicates appear among the top-ranked tuple pairs.
Such false duplicates have a negative effect on the schema matching and lead
to false and missed correspondences.
As discussed above, false duplicates have a negative effect on the detected
correspondences, but a schema matching with reasonable quality can still be
expected if only few non-duplicates are used. The next experiment is designed
to examine the behavior of the matching procedure with respect to such false
duplicates. From a database with 50 duplicates we handpicked the top ten
duplicates and performed schema matching based on this perfect choice. Next,
we incrementally added the top 20 false positives, i.e., very similar tuple pairs
that are known not to be duplicates. The results are presented in Figure 5.5,
which shows that recall and precision of schema matching degrade only after as
many false positives as true positives are used. The reason for this robustness is
that all duplicates support a similar matching, while each false positive might
5.7. EXPERIMENTAL EVALUATION 89
support a different matching. Both curves level after a certain amount of noise
(false positives) is introduced. Precision and recall do not drop till 0% because
non-duplicates can still provide information that helps in schema matching: If
two tuples representing different persons have the same value for Birth-place,
then they might miss other correspondences, but still indicate a match for Birth-
place. Recall that the most similar non-duplicates were used because they would
appear in the list of tuple pairs before all other non-duplicates. Since those tuple
pairs are very similar, they must have some values in common, and thus, provide
enough information for a few correspondences.
0
20
40
60
80
100
0 5 10 15 20
Match precision (%) & recall(%)
# false positives
recall (%)
precision(%)
Figure 5.5: Robustness of schema matching with 10 true duplicates in the pres-
ence of false positives
5.7.3 Duplicate Detection with Partial Alignment
The goal of exploiting known correspondences in duplicate detection is to in-
crease precision. To show the effectiveness of the extended tuple similarity
measure etupsim, we used data sets with four corresponding attributes and 50
duplicates.
Fig 5.6 shows that the precision-recall-curve shifts rightward with the num-
ber of known correspondences. Intuitively, that means that true duplicates move
upward in the list of tuple pairs ranked by decreasing extended tuple similarity.
Consequently, more tuple pairs can be used for schema matching without in-
cluding too many non-duplicates. While the first false duplicate appeared after
approx. 45% of all duplicates had been detected with tupsim (i.e., no partial
match), this number could be raised to 85% by using all four correspondences
in etupsim. Thus, it can be concluded that the extended tuple similarity mea-
sure successfully exploits known correspondences to increase the precision of
duplicate detection.
90 CHAPTER 5. THE MATCHING STEP
0
20
40
60
80
100
0 20 40 60 80 100
Duplicate precision (%)
Duplicate recall in (%)
no partial match
1 known match
2 known matches
3 known matches
4 known matches
Figure 5.6: Influence of degree of partial alignment
5.7.4 The Certainty Check
Various experiments showed that a second iteration of the algorithm is very rare.
In most cases, the certainty check determined that all correspondences can be
trusted, making another run of the duplicate detection step unnecessary. In only
a single case, a second iteration resulted in an additional correspondence after
duplicates have been detected with the extended tuple similarity measure. It
can be concluded that duplicate detection with tupsim is sufficiently accurate,
which is also shown in the experimental evaluation.
Deciding whether to search for additional duplicates is a typical case for a
tradeoff between accuracy and performance: Despite the performance optimiza-
tions described in Sec. 4.4, duplicate detection is still an expensive operation
and should only be performed if necessary. On the other hand, a second itera-
tion has the potential to improve the schema matching result. The experiments
indicate that the possible improvement does not justify the cost of detecting
more duplicates. However, there might be application domains where the dis-
tinction between match and non-match is not as clear as in the data used in
the experimental evaluation. In such a case, the user should be included in the
matching process. Uncertain matches can be highlighted, and the user must
decide whether she is happy with the matching result, or if further refinement
is necessary.
5.8 Discussion
This chapter describes the matching step of the DUMAS table matcher. Tuple
pairs are compared attribute-by-attribute, and the resulting similarity matri-
ces are aggregated. From such an aggregated matrix, a simple matching is
extracted: a set of correspondences such that each attribute corresponds to at
5.8. DISCUSSION 91
most one other attribute. This notion of simple matching is found in all related
schema matching work. However, it is more restrictive than the definition of
“simple matching” in Sec. 2.2.2, which states that each correspondence should
be a pair of singleton attribute groups. Thus, the DUMAS table matcher only
finds matchings in which each attribute is in at most one correspondence, i.e.,
a matching such as {({a1},{b}),({a2},{b})}3would not be found. However, we
believe that in practical scenarios, this is not a major restriction. A straight-
forward adaptation of the matching step is to skip the graph matching and
immediately use all correspondences that pass the pruning threshold. However,
as demonstrated in the example, this would also lead to false correspondences,
which affect the user’s confidence in the resulting matching.
Furthermore, a method for determining which correspondences can be trusted
is described. A possible consequence of this certainty check is to detect addi-
tional duplicates, which are used to improve the schema matching. For that
purpose, an extended tuple similarity measure is applied. The extended mea-
sure takes a partial alignment into consideration, which can be produced by a
previous iteration of the DUMAS table matcher. Note that the extended tuple
similarity measure can also be applied in the first run of the table matcher if
certain matches are supplied by an expert or by another schema matcher. In
the experimental evaluation we showed that the new measure improves dupli-
cate detection. However, is was also established that the result of the first run
of the table matcher was certain in most cases and could not be improved by a
second iteration. Further experiments with real world data are necessary to de-
termine in which cases the extra cost of a second duplicate detection procedure
is justified.
3Note that {({a1},{b}),({a2},{b})}is a different matching than {({a1, a2},{b})}. The
latter is a complex matching, which frequently occurs in real world scenarios.
92 CHAPTER 5. THE MATCHING STEP
Part III
Complex Matchings and
Complex Schemata
93
Chapter 6
Matching Complex
Schemata
The DUMAS table matcher described in the two previous chapters is able to de-
tect correspondences between two tables if the assumptions described in Sec. 4.2
are fulfilled: The tables must represent the same entity type and must contain
duplicates. Obviously, these two assumptions do not hold when comparing two
arbitrary tables in complex schemata, which are comprised of multiple rela-
tions. The DUMAS schema matcher described in this chapter is able to detect
correspondences in such complex schemata.
A number of problems have to be solved in the process. Firstly, it has to be
determined if the result of the DUMAS table matcher can be trusted, because
the two assumptions do not hold when comparing arbitrary tables. The heuris-
tics developed to answer this question affect the quality of the schema matching,
because false correspondences have to be avoided. The correspondences that we
trust constitute the initial matching. To find more correspondences, the initial
matching is used to join tables, which leads to the second problem: One has
to decide which table combinations to consider, and which tables must be com-
pared. This is mostly an efficiency issue, because there is an exponential number
of table combinations in each schema, and considering all of them clearly does
not scale.
6.1 Iterative Schema Matching Using Duplicates
Information can be structured in different ways. That is why we have to deal
with semantic heterogeneity when integrating independently developed data
sources. When matching single tables, the kinds of differences that have to be
considered are limited: Attributes can be given different names or domains,
some attributes can be missing, etc. The problem is more challenging in whole
schemata, because there is not always a 1:1 relationship between tables: E.g., the
entity type represented as a single table in the source schema can be normalized
95
96 CHAPTER 6. MATCHING COMPLEX SCHEMATA
into several tables in the target schema, or vice versa.
Following a duplicate-based approach for matching complex schemata im-
plies that multi-table duplicates, as described in Sec. 3.5.2, have to be detected.
That problem is tackled using the solution for single tables: The problem of
finding multi-table duplicates can be reduced to the problem of finding single-
table duplicates because the result of joining several relations is also a table,
which can be compared to any other relation using the DUMAS table matcher.
Unfortunately, there are very many possible combinations of tables in each
schema, and generating and matching all of them clearly does not scale. Fur-
thermore, the table matcher does not always produce a correct result when com-
paring two arbitrary tables because of the two assumptions made in Sec. 4.2:
Our duplicate detection method can be applied if the two tables
1. represent the same entity type and
2. contain duplicates.
Because it is not known if those two assumptions hold when matching two
arbitrary tables, it has to be determined how to interpret the result of comparing
two tables, or in other words, if the result can be trusted.
Apart from the effectiveness of duplicate detection and schema matching,
efficiency is a major concern. It is well known that the number of possible table
combinations in a schema is exponential in the size of the schema. This holds
even when only “reasonable” combinations of tables are considered, i.e., tables
that can be joined by some join path. As join path we consider a sequence
of tables that are related. The relationship between tables can be represented
in different ways, e.g., foreign key dependencies or join conditions in stored
procedures. Because it is not feasible to compare every reasonable table combi-
nation, one goal of the algorithm described in this chapter is to (i) compute and
(ii) compare only necessary combinations. A table combination is considered
necessary if it has the potential to contribute to the schema matching.
A straightforward solution is to immediately apply the DUMAS table matcher
on universal relations: For each schema, one could create its universal relation,
which represents all the information in the database in a single table [Ull89].
After the underlying data have been transformed into the universal relation,
the DUMAS table matcher can be applied to find attribute correspondences.
Although this solution is technically possible, it was rejected for several rea-
sons. Firstly, the universal relations will be very large, both in the number of
attributes and in the number of tuples. The large number of attributes would
require us to use a small sample in order to make duplicate detection feasible.
This can result in poor precision, because sampling reduces the number of dupli-
cates in the tables (Sec. 4.5). Secondly, experimental results with the DUMAS
table matcher presented in the previous chapter indicate that the matching be-
tween two tables can be of poor quality if the tables have only few attributes
in common. Hence, if the schemata are only slightly related (i.e., they only
have a small intensional overlap), the accuracy of the matching between their
respective universal relations is likely to be low.
6.2. CREATING AN INITIAL MATCHING 97
To illustrate the second point, assume that the databases of two online shops
a book shop and a music store need to be merged. Because both shops
sell different types of items, the inventory data of the two databases can be
considered (both intentionally and extensionally) disjoint. In particular, the
products are described in different ways using different attributes. However,
both databases contain customer information, and duplicate entries are likely
if the two shops are well known. If the databases are transformed into uni-
versal relations, the small intensional overlap only customer information is
represented in both databases might result in poor duplicate detection pre-
cision, and thus, a low-quality schema matching, when using the DUMAS table
matcher. However, if the table matcher were applied on certain substructures,
namely the customer data, we would expect better matching quality.
Hence, the algorithm described here tries to match those schema substruc-
tures that produce a good matching. The DUMAS schema matching algorithm
involves two main phases:
1. Initial matching,
2. Match extension.
The first step is the creation of an initial matching. The goal of this step is to
establish a set of correspondences that we are very confident of. This matching
could be established using any available schema matcher. Sec. 6.2 describes the
challenges associated with using a duplicate-based schema matching approach
to create the initial matching, and an algorithm that applies the DUMAS table
matcher to create initial correspondences is formulated.
The second phase of the algorithm is run iteratively. In the first iteration, the
initial matching is used to create table combinations. Those table combinations
can lead to additional correspondences, which are used in the following iteration
to create even larger table combinations. Sec. 6.3 covers the match extension
step. We show how to reduce the number of table combinations and comparisons
based on the detected correspondences. Possible changes to the algorithm that
improve performance but potentially decrease accuracy are discussed in Sec. 6.4.
6.2 Creating an Initial Matching
In the first phase of the DUMAS schema matcher an initial matching must be
established, which will spark the first match extension in the following phase.
Although other definitions are possible, we restrict the initial matching to be a
simple matching such that there are no correspondences with the same source
table but different target table, and vice versa. In other words, the initial
matching consists of several table matchings such that each table participates
in at most one table matching.
Fig. 6.1 depicts the scenario used in this chapter in a simplified fashion.
The source schema contains five tables 1 through 5, while the target schema
has only four tables Athough D(depicted as circles). The arrows within a
98 CHAPTER 6. MATCHING COMPLEX SCHEMATA
1
2
3
4
5
A
B
C
D
Source
Target
Figure 6.1: An initial matching between source and target schema
schema represent foreign key relationships, which are not of interest in the initial
matching phase. Note that although all tables in each schema are related in a
connected graph, this is not required by the matching algorithm. Each arrow
between the schemata indicates a set of correspondences. As stated above, the
correspondences in the initial matching relate a table to at most one other table:
In the example scenario, table 2 matches with Band 5 is related to D.
6.2.1 Interpreting the Table Matching
Creating an initial schema matching using the DUMAS table matcher is a chal-
lenging problem, because the underlying assumptions do not hold when com-
paring two arbitrary tables. Hence, it has to be determined if the result of
comparing two tables can be assumed to be correct. Various intermediate re-
sults created in the matching process can be used. In this section, we will
discuss the applicability of the tuple similarity, the field similarity, and the de-
tected matching to determine if the resulting matching between two tables can
be trusted.
The tuple similarity measure tupsim
The underlying assumption of duplicate-based schema matching is that dupli-
cates provide sufficient information to produce a good matching. A possible
approach to determine if a matching can be trusted is to ensure that the de-
tected tuples are, in fact, duplicates. Unfortunately, the tuple similarity measure
tupsim is not a good indicator. On the one hand, tuple similarity scores for
true duplicates can be relatively low, e.g., because the tables under considera-
tion have few corresponding attributes. On the other hand, false duplicates can
have high scores. Although it has been shown that tupsim is very reliable when
the tables represent the same entity type, false duplicates with high similarity
scores can be returned by the algorithm if this assumption does not hold.
An example for this problem is depicted in Fig. 6.2. The source Rcontains
information about products, whose name and producer are represented in table
Product. The different editions of the products, including the edition number
Ed and the year, can be found in table Edition. The arrow represents a foreign
key relationship between Edition and P roduct. In contrast, the target Scon-
tains statistical data about a soccer tournament: Information about each player
6.2. CREATING AN INITIAL MATCHING 99
R.Product
Name
Producer
TV Set
ACME
Racing Car
MyToys
ID
1
2
R.Edition
Prod
1
1
2
2
Year
1999
2004
1997
2000
S.Player
Name
Position
John Doe
Middle
PID
1
Striker
Goal
Sam Adams
Hans Mustermann
2
3
S.Goals
Player
1
1
2
Year
1998
1999
2000
Ed
1
2
5
6
Goals
2
1
6
Figure 6.2: Misleading tuple similarity: Tables with different semantics
is provided in table P layer, while the number of goals scored by a given player
in a given year in shown in table Goals. Although tables Edition and Goals
have completely different semantics, one notices that a few tuples look very
much alike: E.g., the tuples “1 1999 1” and “2 2000 6” (represented as a single
string as required by the similarity measure) can be found in both tables. Using
those tuple pairs in the matching step would result in false correspondences.
Such high scores occur if two tuples have many terms in common a situ-
ation that is described as Problem 4b in Sec. 4.3.1. Several factors contribute
to high scores:
1. Same attribute domain: If two attributes have the same domain (e.g.,
Y ear in the example above), there are likely to be some unrelated tuples
which have the same value for those attributes. The probability that two
attributes have the same value is larger if the attribute domain has few
values (e.g., Ed and Goals).
2. Few values: If tuples have only few values, spuriously matching values
have a large effect on the tuple similarity. Few values occur in two cases:
(a) Few attributes: Assume that two tuples have one matching attribute
value. The effect of this value match on the tuple similarity is larger if
the respective relations have only few attributes, because in that case
there are less (non-matching) values that decrease the tuple similarity
score.
(b) Many null values: This case is similar to the case of few attributes.
The less values exist, the less information we have to distinguish un-
related objects. There are two ways to interpret null values in the
duplicate detection step: One can ignore them, i.e., remove null val-
ues from the string representation, or consider null a special token.
Experiments have shown that the difference between both approaches
is negligible: The weight of a null token is close to zero if many tu-
ples have at least one null value, and thus, the null value has a
minimal effect on the tuple similarity.
100 CHAPTER 6. MATCHING COMPLEX SCHEMATA
In general, the larger the strings resulting from tokenizing the tuples are, the
less likely large tuple similarity scores occur. However, because even duplicate
tuples can have low similarity scores, the tuple similarity measure is not a good
indicator.
The field similarity measure fieldsim
Apart from the tuple similarity measure, another intermediate result is the
similarity matrix for each duplicate, which contains field similarity scores for
each attribute pair. Assuming that the tuples are true duplicates, such a matrix
is a good indicator for existing correspondences.
However, a matrix is constructed only for tuple pairs that are very similar
with respect to tupsim. Hence, the problems described above also affect the
similarity matrices. Thus, a single similarity matrix can lead to false conclusions
when deciding if a matching can be trusted.
The detected attribute correspondences
In contrast to the similarity measures, which work on a single tuple pair, the
detected attribute correspondences are the result of several tuple pairs. Thus,
a correspondence that is part of the resulting schema matching indicates that
many tuple pairs have the same of very similar value in the corresponding at-
tributes.
If spurious matches in attribute values occur randomly, it is unlikely that
the average similarity score for any attribute combination passes the matching
threshold, and thus, the resulting matching will be empty. A false correspon-
dence will be extracted only if many non-duplicates have similar values for
a given attribute combination. E.g., the most similar tuple pairs for tables
Product and P layer in Fig. 6.2 are those with the same value for attributes
ID and P ID, respectively. In this case, both attributes have the same value
domain, and the values of other attributes are unrelated. Consequently, all de-
tected tuple pairs have the same value for those two attributes, resulting in a
false correspondence.
As shown in the above example, it is possible that a small number of false cor-
respondences is extracted from the tuple pairs. However, many correspondences
are less likely to be established between unrelated tables, because they would
imply that the majority of the tuple pairs have highly similar values for all those
attribute pairs. Consequently, table matchings are judged based on the number
of its constituent correspondences, which is stated in the following matching
trust heuristic: A table matching with many attribute correspondences is more
trustworthy than a table matching with few attribute correspondences. Note
that by “many” we refer to the absolute number of correspondences. Thus, we
inherently prefer larger tables.
The matching trust heuristic defined above does not tell whether a match-
ing can be trusted, it only says if a matching can be trusted more than an-
other matching. However, it has been shown to be sufficient for removing
6.2. CREATING AN INITIAL MATCHING 101
false matches, and thus, increasing the precision of our duplicate-based schema
matcher. It is directly applied in the initial matching procedure (Sec. 6.2.2),
but can also be found in the iterative part of the algorithm (Sec. 6.3).
6.2.2 Initial Matching with the DUMAS Table Matcher
The initial matching created by the DUMAS schema matcher is based on the
matchings for every table combination. Let Mi,j be the matching between
tables Riand Sj, and |Mi,j |be the size of the matching, i.e., number of cor-
respondences in the matching. The matching sizes for every table combination
can be represented in a matching size matrix. Fig. 6.2.2 shows the matrix for
two schemata with three tables.
S1S2S3
R170 1
R2052
R3160
Figure 6.3: A matching size matrix
As stated above, only matchings with a large number of correspondences
can be trusted. We define maxcor as the maximum number of correspondences
detected by comparing every source table with every target table:
maxcor =maxRiR,SiS|Mi,j|.(6.1)
In the example matrix in Fig. 6.2.2 the maximum number of correspondences
is 7. In the following, we only consider the set of large table matchings LM
that contains matchings Mi,j whose size is close to maxcor, i.e.,
LM ={Mi,j :|Mi,j| θnumcor ·maxcor}
where θnumcor is a threshold in the range [0,1] set by the user. The initial
matching is created from the table matchings in LM. With a large threshold,
only few table matchings are used in the initial matching. This could decrease
the recall of the overall algorithm. Decreasing the threshold allows more cor-
respondences to enter the initial matching, but could also decrease precision.
In the example a threshold θnumcor = 0.6 was used. All table matchings that
have more than the resulting minimum number of correspondences of 4.2 are
highlighted in bold.
Afterwards, ambiguous table matchings are removed from LM: In very rare
cases a table can have large matchings with more than one other table. E.g.,
the matchings of table S2with both R2and R3pass the threshold. This is in
contrast to the goal of creating a set of table matchings such that each table has
correspondences with at most one other table. One way to resolve this conflict
is to produce a maximum weight graph matching similar to the matching step in
Sec. 5.3.2 to relate tables. If that approach were followed, the initial matching
would assign R1to S1and R3to S2.
102 CHAPTER 6. MATCHING COMPLEX SCHEMATA
However, another goal of the initial matching step is to avoid false corre-
spondences. If a table Xhas many correspondences with more than one other
table, then it is unclear which table it should be assigned to. Choosing one
table matching based on a graph matching includes the risk of using false cor-
respondences in the following phase. To assure that only true correspondences
are used, we decided to ignore all matches of table X. The set of remaining
table matches IM is set set of large table matchings that are certain:
IM ={Mi,j LM :¬∃Mk,l LM : (i=kj6=l)(i6=kj=l)}(6.2)
In the above example, IM would only contain the table matching between
R1and S1. If the set IM were empty because of this procedure, then a max-
imum weight matching will be computed on the matching size matrix, and the
matches above the threshold that are part of that graph matching constitute
IM. The final initial matching Minitial between the source schema and the
target schema is the union of the table matchings in IM:
Minitial =[
M∈IM M.(6.3)
Note that although the initial matching is based on the matchings between all
tables, it is not necessary to consider every table combination. It is obvious that
the size of the matching can only be as large as the smaller table, i.e., |Mi,j|
min(|Ri|,|Sj|). Hence, only tables whose size is larger or equal to θnumcor ·
maxcor (where maxcor is the current maximum number of correspondences)
need to be considered. In the DUMAS implementation, the algorithm starts
with the largest tables and suspends computation if no more table pairs need
to be considered.
6.3 Crawling Through Schemata
The basic idea of the second step of the algorithm is to use the initial matching to
“crawl” through each schema: The tables that participate in the initial matching
are joined with their neighboring tables to produce larger tables. These new
tables are matched with other tables to produce additional correspondences,
which are used to further extend tables, and thus, start the next iteration.
In the following, the tables that are present in the original schemata are
called base tables, while the tables that are the result of joining several tables
are called join tables. In cases where the algorithm does not distinguish between
the two types of tables, the term table is used.
6.3.1 A Run Through The Example
To demonstrate the intuition behind the DUMAS schema matching algorithm
we will go through the example depicted in Fig. 6.1. The figure shows initial
correspondences between tables 2 and Band between 5 and D. In the following,
6.3. CRAWLING THROUGH SCHEMATA 103
we say that two tables match if there are correspondences between them, and
denote a matching by an arrow: 2 Band 5 D.
Based on this matching, tables 2, 5, B, and Dare extended: Join tables
are created by merging those tables with their respective neighbors. A neighbor
of a table is another table in the same schema that is related to the given
table. Without loss of generality, it is assumed that it is known how two tables
are related. In the implementation of the DUMAS schema matcher, foreign key
dependencies are extracted from the metadata repository of the database. More
elaborate methods, such as analyzing queries or the database content [DJMS02],
are also conceivable.
1
2
3
4
5
A
B
C
D
Source
Target
Figure 6.4: Join tables and new correspondences
Fig. 6.4 shows the created join tables as ellipses. In the following, we denote
the result of joining table Xwith table Yas XY . One can see that the initial
matching leads to the construction of six join tables: 12, 23, 45, AB,BC, and
CD. Those join tables should be compared with tables in the other database,
respectively, to improve the matching. To facilitate efficient schema matching,
table comparisons must be restricted to tables that have a chance of producing
additional correspondences. The decision which tables to compare is based on
the known matching: Because table 2 matches with table B, we compare tables
12 and 23, which are derived from 2, with Band tables that were derived from
B(i.e., AB and BC). Tables 12 and 23 are not compared to any other table in
the target schema, because at this point there is no known relationship to any
table other than B. The same reasoning applies to all join tables in the source
and target schemata. The left hand side of Fig. 6.5 shows the comparisons
performed in the first iteration.
Using the DUMAS table matcher on tables 23 and BC results in additional
correspondences between attributes of 3 and C, which are depicted as a dashed
arrow in Fig. 6.4. Their target attributes are in table C, but could only be
detected by creating join table BC. Hence, we say that table 23 matches with
BC. Comparing join table 45 with tables Dand CD also resulted in new
correspondences. The target attributes of those new correspondences are part
of table D. Because there are no correspondences with C, we conclude that
table 45 matches with Dand not CD.
The new correspondences are used in the next iteration to produce larger
join tables, which are are also subject to table matching. The table compar-
104 CHAPTER 6. MATCHING COMPLEX SCHEMATA
12 - B
12 - AB
12 - BC
23 - B
23 - AB
23 - BC
45 - D
45 - CD
2 - AB
2 - BC
5 - CD
123 - BC
123 - ABC
12 3- BCD
234 - BC
234 - ABC
234 - BCD
345 - D
345 - CD
23 - ABC
23 - BCD
45 - BCD
First iteration
Second iteration
Figure 6.5: Table comparisons performed in the running example.
isons performed in the first and second iteration of the example are depicted in
Fig. 6.5. This process stops when no more correspondences can be expected.
The description of the example exhibits two main aspects of the extension
step: (i) creation of new join tables and (ii) assigning those join tables to other
tables. The latter means that each table (either base or join table) has at most
one matching partner table. This assignment aspect is similar to the initial
matching phase, where each base table is matched to at most one base table.
6.3.2 The Derivation Tree
The matching algorithm is based on derivation trees, which describe how a join
table was created, i.e., which table was extended to create a given join table.
Fig. 6.6 shows the derivation trees of the source and target schemata of the
running example. The name of a node is the same as the name of the table
represented by the node.
1
2 [B]
3
4
5 [D]
12
23 [BC]
45 [D]
123
234
345
(a) Source database
A
B [2]
C
D [45]
AB
BC [23]
CD
ABC
BCD
(b) Target database
Figure 6.6: Derivation trees of the source and target databases
6.3. CRAWLING THROUGH SCHEMATA 105
The derivation tree of the source database is shown in Fig. 6.6(a)). The
child of a node Xrepresents a table that is created as an extension of table
X. E.g., node 12 is the child of node 2 because its table has been created by
extending table 2. The children of the root node represent the base tables, while
nodes further below in the derivation hierarchy represent join tables. Note that
a single join table can be created in more than one way, i.e., by extending more
than one table. Assume that the initial matching contained table matchings
2Band 3 C(Fig. 6.4). In that case, both extending table 2 and table
3 lead to the construction of table 23. Although this join table is created only
once, it is represented by two nodes in the derivation tree one that is the
child of 2 and one that is the child of 3 which are considered separately
by the algorithm. Thus, it is ensured that the derivation tree is a tree. As a
consequence of this rule, a single join table can have multiple states: Each node
of 23 represents a role of the table (e.g. “derived from 2” or “derived from 3”),
and both nodes can have a different activation status1, which affects the table
comparisons performed by the schema matcher.
Some nodes in Fig. 6.6 are annotated by the name of their corresponding or
matching node, i.e., the node in the opposite tree that it is assigned to. Recall
from Sec. 6.3.1 that each table is assigned to another table. Fig. 6.6(a) shows
that 2 is assigned to B(as in the initial matching) and 23 corresponds to BC.
Note that node Dis the target tree matches with node 45, although in the
initial matching Dwas related to 5 (Fig. 6.6(b)). However, in the first iteration
of the match extension step, join table 45 is assigned to D, and the target tree
is adapted accordingly. Beside the matching node, the set of correspondences
(i.e., the matching) between the tables is also stored. This is required by the
algorithm described in Sec. 6.3.3 to evaluate the matching computed for child
nodes.
To summarize, each node contains:
1. A reference to the table, which is represented by the node (mandatory).
The function table(n) is used to reference the table of node n.
2. A reference to a partner node, i.e., the node in the opposite derivation
tree, which the node is assigned to (optional). The function partner(n) is
used to reference the partner node of node n.
3. A node matching, i.e., a set of correspondences between the node’s table
and the table of the partner node (optional). The function matching(n)
is used to reference the matching assigned to node n.
The partner node and the matching are optional because not all tables have
a matching partner.
1Active and inactive nodes are discussed below.
106 CHAPTER 6. MATCHING COMPLEX SCHEMATA
6.3.3 The Schema Matching Algorithm
Algorithm 2: DUMAS Schema Matching Algorithm
Input: Source schema s; Target schema t
Output: Schema Matching between s and t
match initialMatch(s, t);1
sourceT ree new DerivationTree(s, match);2
targetT ree new DerivationT ree(t, match);3
repeat4
nodeMatches ;5
lastMatch match;6
extend(sourceT ree); extend(targetT ree);7
if numPending(sourceT ree) + numP ending(targetT ree) == 0 then8
Break;
foreach pend in getP ending(sourceTree)getP ending(targetTree)9
do
parent parent(pend); pPartner partner(parent);10
ppMatch match(table(pend), table(pP artner));11
if sanityCheck(ppMatch)then12
nodeMatches nodeMatches ppMatch;13
end14
nodeMatches 15
nodeMatches childMatches(pend, pP artner, ppMatch);
end16
nodeAssignment maximumWeightMatching(nodeMatches);17
foreach match in nodeAssignment do18
sNode getSource(match); tNode getT arget(match);19
matching(sNode) = match;matching(tNode) = match;20
if isLeaf(sNode)then activate(sNode);21
if isLeaf(tNode)then activate(tNode);22
end23
match =globalMatching(sourceT ree);24
until |lastMatch| |match|;25
return lastMatch;26
The algorithm to compute a matching between complex schemata is summa-
rized in Alg. 2. At the beginning the initial matching is constructed as described
in Sec. 6.2.2 (line 1). Afterwards, the derivation trees for the source and target
schemata are constructed (lines 2 and 3).
Lines 4 through 25 describe the iterative part. It starts by initializing two
variables: nodeMatches is a set of matchings between nodes, which are con-
structed in the current iteration. At the beginning of the iteration it is empty
(line 5). The variable lastMatch contains the schema matching that has been
constructed in the previous iteration. In the first iteration, the initial matching
6.3. CRAWLING THROUGH SCHEMATA 107
is used (line 6).
Afterwards, active nodes of the source tree and the target tree are extended
(line 7). A node in a derivation tree is an active node if it has produced new
correspondences in the previous iteration. Active nodes can easily be extracted
from a derivation tree: Active nodes are leaf nodes which have a partner node
and a node matching. A node is extended in two steps: First, new join tables
are created by joining the node’s table with neighboring base table. Neighboring
tables are base tables that are not part of the node’s table but are connected to
one of its constituent base tables by a foreign key dependency. For each of the
neighboring tables, a new join table is created that is the result of joining the
node’s table with the neighboring table2. Second, a child node that is the child
node of the active node is created for each new join table. The newly created
nodes are called pending nodes. The iteration stops if the number of pending
nodes is zero (line 8). This could occur in small schemata, when the active
nodes represent the whole schema, and thus, no base tables can be joined.
Matching Extended Tables
parent
pend
pPartner
node
nodeMatch
ppMatch
established node matching
. . . .
. . . .
Figure 6.7: Comparing relevant tables.
After the creation of pending nodes, each of their tables is compared with
some tables in the other database (line 9 - 16). As stated in Sec. 6.3.1, only
reasonable comparisons should be performed. Given a pending table, we only
want to compare it with tables that were derived from the partner of its parent.
Thus, to determine which tables need to be compared, the parent of a pending
node parent (which is the active node that has been extended to create the
current pending node) and its partner node pPartner are considered (Fig. 6.7).
First, a matching ppMatch between the table of the pending node and the
table of the parents partner is established using the DUMAS table matcher (line
11). A sanity check is performed to determine if the correspondences can be
trusted (line 12). The details of the sanity check are described in Sec. 6.3.4.
If the matching passes the sanity check, then the node matching is added to
nodeMatches (line 13).
Second, node matchings between the current pending node and the child
nodes of parent are constructed (line 15). The function that computes the set
2If such a join table exists already, e.g., because it has been produced by extending another
node, then it is reused.
108 CHAPTER 6. MATCHING COMPLEX SCHEMATA
Function childMatches
Input: Pending node pend, parent’s partner pP artner, table matching
ppMatch between pend and pP artner
Output: Set of table matchings between pend and tables derived from
pPartner
foreach node in getDescendants(pP artner)do1
nodeMatch match(table(pend), table(node));2
if sanityCheck(nodeMatch)|nodeMatch|>|ppMatch|then3
nodeMatches nodeMatches nodeMatch;4
end5
end6
return nodeMatches;7
of node matches is shown as Function childMatches. The process is similar to the
matching with the parent’s partner, except that the node matching nodeMatch
is added to nodeMatches only if it passes the sanity check and contains more
correspondences than the matching with the parent’s partner node ppMatch.
E.g., the node matching between tables 45 and CD does not fulfill this additional
constraint because the node matching between 45 and D(ppMatch) contains
the same correspondences. Consequently, the node matching 45 CD is not
further considered.
Creating Node Assignments
After all pending nodes have been processed, nodeMatches contains matchings
between nodes where at least one of them is pending. Now partner nodes have to
be assigned to pending nodes based on their node matchings (line 17). The goal
is to produce a 1:1 assignment, i.e., each pending node is assigned to at most
one node in the opposite derivation tree, which is not required to be pending.
This problem related to finding a 1:1 matching of attributes based on field
similarity scores, and hence, a similar solution is applied: A node assignment
matrix with one row for each source node and a column for each target node
is constructed. The value of a matrix cell is based on the matching between
the respective derivation tree nodes. Similar to the initial matching step, the
scoring is based on the assumption that a matching can be trusted if it contains
many correspondences. Consequently, the value of a matrix cell is the sum of
the number of correspondences in the matching between the respective nodes,
which is contained in nodeMatches, and the average field similarity scores of
corresponding attributes, which is a value in the interval (0,1]. The latter is
added to the size of the matching to break ties. If there is no matching between
the nodes in nodeMatches, then the cell value is set to zero.
Fig. 6.8 shows the node assignment matrix for the running example (zero
values are left for reasons of clarity). Table 12 does not create additional corre-
spondences, and thus, does not pass the sanity check. Table 23 is a successful
6.3. CRAWLING THROUGH SCHEMATA 109
A B C D AB BC CD
1
2
3
4
5
12
23 7.8
45 4.6
Figure 6.8: The node assignment matrix for the running example.
extension because, when matching it with table BC, correspondences between
attributes of base table 3 and base table Care detected. One can see that the
matching between table 23 and BC contains 7 correspondences with an aver-
age similarity score of 0.8. Table 45 is also a successful extension. However,
its matching with table CD is not larger than its matching with table D, and
thus, the former is not further considered. Pending tables on the target side are
processed accordingly.
Interestingly, the example matrix contains entries only for the correct assign-
ment. This might not always be the case: Several alternative assignments could
be possible. To resolve this issue, a maximum weight matching is performed on
the node assignment matrix. The result is a set of source node target node
pairs, where at least one of the two nodes is a pending node.
The set of node pairs in the graph matching represents the node assignments,
which must be reflected in the derivation trees. In addition, some nodes become
active while others become inactive (lines 18 - 23). Given a node pair (sn, tn)
of the graph matching, where sn is a source node and tn is a target node, the
partner node of sn is set to tn, and vice versa. This is also necessary for non-
pending nodes: E.g., node Dis initially assigned to 4, but is reassigned to 45
in the following iteration. Both nodes’ matching is set to the matching between
sn and tn.
After partner nodes and matchings have been determined, some nodes need
to be activated. If a node in pair (sn, tn) is a leaf node, then it becomes an
active, non-pending node, which will be extended in the following iteration.
The activation of nodes is restricted to leaf nodes because non-leaf nodes have
already been extended. All pending nodes that do not have a partner become
regular nodes, i.e., neither active nor pending.
Resolving Match Conflicts
Note that although the node assignment is a proper 1:1 matching of nodes, it
might not be a 1:1 matching of attributes. Some nodes overlap in their base
tables. E.g., nodes 12 and 23 overlap because both contain table 2. Assume that
110 CHAPTER 6. MATCHING COMPLEX SCHEMATA
both nodes have a partner. In that case, their matchings could assign different
target attributes to a given attribute of 2. To resolve this issue, an attribute
matching is produced (line 24).
The attribute matching step is similar to the matching step described in
Chap. 5. First, a matrix is constructed with a row for each source attribute
and a column for each target attribute. The score of a matrix cell should reflect
how much we believe that those attributes correspond. Again, the heuristic
that matchings with many correspondences are more reliable is used in the
scoring method. All node matchings which involve at least one active node
are considered. A node matching NM between nodes sn, whose table is con-
structed from base tables R1, . . . , Rm, and tn, whose table is constructed from
base tables S1, . . . , Sn, is broken down into m·nbase matchings between their
constituent base tables. E.g., the matching between 45 and Dis split into two
base matchings between 4 and Dand between 5 and D. A base matching Mij
between tables Riand Sjcontains only the correspondences of NM whose
source attribute is in Riand whose target attribute is in Sj. For each such
correspondence ({a},{b}), the size of Mij (i.e., the number of correspondences)
is added to the value of the cell in the matrix that represents the attribute pair
(a, b).
After all node matchings have been processed as described above, a max-
imum weight matching is performed. The resulting attribute pairs are trans-
formed into a matching match. If that matching contains more correspondences
than the matching of the previous iteration lastMatch, then the next iteration
is started. Otherwise, the matching of the previous iteration lastMatch is the
final result of the DUMAS schema matcher.
6.3.4 The Sanity Check
As pointed out above, the table matching algorithm can produce misleading
results when comparing unrelated tables. To reduce the likelihood of false cor-
respondences, the initial matching step produces only matchings with many
correspondences. In each iteration, a large number of table matchings are pro-
duced, and it has to be to determined which of those matchings can be trusted.
However, in contrast to the initial matching phase, there is additional infor-
mation which this decision can be based on, namely the matching produced in
the previous iteration. In the following, the heuristics used to determine if a
matching is correct are described.
The sanity check procedure has to decide if a matching match between a
pending node pend and another node, which is either the partner of the pending
node’s parent pPartner or a child node of pPartner, can be trusted (Alg. 2). It
uses the parent matching, which is the matching between parent and pP artner,
to evaluate match. The sanity check’s decision is based on two heuristics:
1. Correspondence Retention: All correspondences contained in the parent
matching must also be contained in match. In other words, no correspon-
dences must be lost by extending tables.
6.3. CRAWLING THROUGH SCHEMATA 111
2. Match Extension: The matching match must be larger than the parent
matching. If match does not contain more correspondences than the par-
ent matching, the the extension is not successful, and thus, the matching
should be dropped.
The first heuristic is based on the assumption that the parent matching
is correct. Hence, if some of its correspondences cannot be found in the child
matching, the match must be flawed. The match extension heuristic reflects the
idea that additional correspondences are to be found by extending some tables.
If an extension does not lead to additional correspondences, then the extension
was useless. If correspondences are lost, i.e., match has fewer correspondences
than the parent matching, then it is assumed that match is incorrect. Hence, it is
required that matchings have more correspondences than their parent matchings
in order to pass the sanity check.
As stated above, the correspondence retention heuristic is based on the as-
sumption that the detected correspondences are correct. However, it is not
uncommon that a few attributes are mismatched, even though the tables are
correctly assigned. In a special case, the first heuristic does not need to hold,
and we allow a child matching to ‘overrule’ its parent matching:
3. Attribute Reassignment: If more attributes of the parent’s partner have a
corresponding attribute in the child matching, then correspondence reten-
tion heuristic does not need to hold and the child matching is considered
correct.
Emp
ID
Comp
StartYear
1
2
3
4
ACME
ACME
DAG
DAG
1969
1988
1990
1975
Pers
ID
Name
BirthYear
1
2
3
4
John Doe
Jane Miller
Hans Meyer
Max Mueller
1949
1969
1975
1957
Person
ID
Name
BirthYear
43
98
72
100
John Doe
Jane Miller
Hans Meyer
Max Mueller
1949
1969
1975
1957
Comp
ACME
ACME
DAG
DAG
Figure 6.9: Initial matching with one false correspondence
112 CHAPTER 6. MATCHING COMPLEX SCHEMATA
To illustrate the attribute reassignment, consider the scenario in Fig. 6.9:
Both source and target contain information about persons. Source table P ers
contains the name and birth year of persons, while table Emp has employment
data about those persons, namely the name of the company they work for and
the year they started in that company. In the target database, this information
is contained in a single table P erson, except that there is no data about the
start year. Assume that the initial matching Emp Person contains the
correspondences indicated by arrows in Fig. 6.9. Note that there is a false
correspondence between StartY ear and BirthY ear, which has been created
because the most similar tuple pairs (1,98) and (4,72) have equal values in
those attributes.
Emp+Pers
ID
Comp
StartYear
1
2
3
4
ACME
ACME
DAG
DAG
1969
1988
1990
1975
Name
BirthYear
John Doe
Jane Miller
Hans Meyer
Max Mueller
1949
1969
1975
1957
Person
ID
Name
BirthYear
43
98
72
100
John Doe
Jane Miller
Hans Meyer
Max Mueller
1949
1969
1975
1957
Comp
ACME
ACME
DAG
DAG
Figure 6.10: Correct matching after extending table Emp.
After extending table Emp by joining it with table Pers, the correct match-
ing Emp +Pers P erson is produced (Fig. 6.10). The matching contains
three correspondences, and thus, satisfies the second constraint. However, the
correspondence between StartY ear and BirthY ear, which is part of the ini-
tial matching, was dropped. Instead, BirthtY ear in Emp +Pers is correctly
assigned to BirthY ear in P erson. Hence, the correspondence retention require-
ment is not met.
However, when comparing the new matching with the parent matching
Emp Person, one notices that more attributes of the parent’s partner have
a corresponding attribute in the source3. Hence, we consider the new matching
a better matching, and allow it to overrule the parent matching.
The initial matching related table Emp with table Person. Note that the
3Note that in this case the parent’s partner is equivalent to the matching table of Emp +
P ers, because the target database contains only a single table.
6.3. CRAWLING THROUGH SCHEMATA 113
tuple pairs (1,98) and (4,72), which are used to construct the matching, are no
true duplicates. In fact, both tables represent different entity types. One might
even argue that the tuples in table Emp have no identity and are mere annota-
tions for the persons represented in Pers. However, as that is not known to the
DUMAS schema matcher, the algorithm has to allow for erroneous matchings.
By facilitating an overruling of previous matches, we take into account that the
table matcher can produce false correspondences when its assumptions do not
hold.
6.3.5 Considering Complex Relationships by Deferred De-
activation
For the sake of exposition we have assumed in Sec. 6.3.3 that the algorithm
should stop extending a table if the previous extension has not been successful:
When a join table does not produce any additional correspondences, the sanity
check will fail, and hence, the matching will not be part of nodeMatches. Be-
cause of that, the pending node is not activated at the end of the iteration, and
thus, will not be further extended.
This is a too restrictive decision, because certain design decisions require the
algorithm to make two or more steps to detect additional correspondences. A
very common example is an m:n relationship between two entity types, which
is usually represented as three tables in the database. On the other hand, one
might also create only a single table, if redundancy is not an issue. Assume two
databases containing employment information. The source database has a table
Person containing data about employees, a table Comp listing companies, and
an association table W orksFor connecting those two tables. The association
table only contains foreign keys to the primary keys of the other two tables.
In the target database, this m:n relationship is represented as a single table
Employment. Assuming that the initial matching involves table Person, join-
ing tables Person and W orksF or does not yield additional correspondences,
because the association table does not contain relevant information. If the algo-
rithm stopped at this point, it would miss the correspondences involving table
Comp.
Instead of discontinuing the extension of a table immediately after it has not
been successful, the deactivation is deferred until it has been unsuccessful in the
last ksteps. In the implementation kis set to 2, which allows for the common
representation of m:n relationships as three tables. The algorithm is adapted
accordingly: A pending node is activated if it contains a matching or if its
predecessors with a distance k1 have a matching. Thus, if kis set to two, a leaf
node is activated if the node itself or its parent has an attached node matching.
In addition, the parent matching, which is used in the sanity check, is the
matching of the closest predecessor that actually has a matching. Furthermore,
nodes are reactivated not only when they are part of a node assignment, but
also when a predecessor with a distance smaller than kis part of an assignment.
114 CHAPTER 6. MATCHING COMPLEX SCHEMATA
6.4 Table Extension vs. Duplicate Extension
Finding duplicates in two unaligned tables is a complex procedure: Its inherent
complexity is O(n2), because each source tuple must be compared to each target
tuple. Using the Whirl algorithm to detect the most similar tuple pairs drasti-
cally reduces the actual number of tuple comparisons. Nevertheless, duplicate
detection is still an expensive operation, in particular when the tables contain
many tuples.
The schema matching algorithm described above uses the DUMAS table
matcher to compare two tables. Although only tables that can be expected to
contain duplicates are compared, the number of table comparisons can become
large. Besides table comparison, the actual creation of join tables is another
performance issue. Again, the algorithm aims at reducing the number of join
tables. However, the effort of constructing join tables is considerable.
One way to improve efficiency is to reduce the size of the tables. Table exten-
sion, i.e., joining whole tables, is an expensive operation if tables are large. A
possible adaptation of the algorithm to improve performance is duplicate exten-
sion: Assume that the initial matching between tables 2 and Bwas established
using kduplicates. In the extension step, the tuples of table 2 that are part
of the kduplicates are used to construct join tables 12 and 23 by joining those
tuples with tables 1 and 3, respectively. Join tables AB and BC are created
in the same way. Because the resulting join tables are much smaller than those
created by table extension, less data has to be written to the database, and thus,
the construction of join tables is much faster. The performance of duplicate de-
tection is also improved, because less tuples have to be compared. In general,
extending duplicate tuples as opposed to whole tables can be considered a major
efficiency improvement.
Unfortunately, similar to other problems in computer science, there is a
tradeoff involved. Extending the ‘best’ tuple pairs can result in join tables
whose most similar tuple pairs are not as similar as the most similar tuple pairs
that would have been detected when extending whole tables. Consequently,
some correspondences can be missed.
The underlying reason for the loss in accuracy is variable data. To illustrate
the problem, consider the example depicted in Fig 6.11. Assume that the initial
matching (depicted by solid arrows) is created using duplicates (1,43) and (3,72)
of the Pers tables. Note that the birth year of a person never changes, and
the name does not change frequently. Joining those tuples with the WorksF or
tables results in tuples where the Comp attribute, which represents the company
the persons work for, has different values for duplicate tuples. This is not
unusual, because people change their employers, and the two data sources might
have created their data at different points in time. Because the Comp values
of duplicate tuples have different values, the correspondence between the two
Comp attributes cannot be detected by the DUMAS table matcher.
Another problem of duplicate extension is the possible lack of join partners
for duplicate tuple pairs: If a duplicate tuple is part of a referenced table, then
there does not need to be a tuple that has a foreign key to that tuple. As
6.5. EXPERIMENTAL EVALUATION 115
WorksFor
ID
Comp
1
2
3
4
ACME
ACME
DAG
DAG
Pers
ID
Name
BirthYear
1
2
3
4
John Doe
Jane Miller
Hans Meyer
Max Mueller
1949
1969
1975
1957
WorksFor
ID
Comp
43
98
72
100
ABC
ACME
ABC
DAG
Pers
ID
Name
BirthYear
43
98
72
100
John Doe
Jane Miller
Hans Meyer
Max Mueller
1949
1969
1975
1957
Figure 6.11: Extending duplicates: tuples with outdated information.
a consequence, extending duplicates might require several database operations
to search for tuples that have join partners, which could easily become more
expensive than a single table join.
Because accuracy of schema matching can decrease when duplicate extension
is applied, we chose to perform table extension. However, alternative solutions
are conceivable. When the size of the schemata under consideration are too
large, the cost of schema matching using the above algorithm might become
prohibitive. In that case, it is suggested to create join tables as described above
with the n·kmost similar tuple pairs, where ncan by any number larger than
1. Using a large nincreases the size of the join tables, but also reduces the risk
of missing attribute correspondences. Unfortunately, this procedure does not
guarantee that all n·kduplicate tuples have a join partner.
6.5 Experimental Evaluation
The algorithm described above has been implemented and tested on real-world
data. The experimental results are reported in this section.
6.5.1 Implementation and Data
The DUMAS schema matching algorithm has been implemented in Java 1.4.
The data used in the experiments reside on a IBM DB2 UDB V8.1 database,
116 CHAPTER 6. MATCHING COMPLEX SCHEMATA
which is accessed by the implementation using JDBC. We store join tables
created by the algorithm as declared global temporary tables in the database.
Statistics for each table, which are required by the similarity measures of the
DUMAS table matcher, are collected only once and stored on the file system.
To assess the effectiveness of the DUMAS table matcher, two data sets
containing information about cricket players were used. These data sets have
previously been used in [DLD+04] to evaluate the iMap schema matcher (see
Sec. 2.3.4). The original data had been extracted from two web sites Cricinfo4
and Cricketbase5and transformed into XML format. Because the DUMAS
schema matcher works on relational schemata, the data had to be transformed
into the relational model such that the schemata closely reflect the original XML
structure.
Player
Statistics
Bowling
Fielding
Batting
ODI
Test
(a) Cricinfo schema
Player
Statistics
Bowling
Batting
ODI
Test
(b) Cricketbase schema
Figure 6.12: Cricket schemata
Fig. 6.12 depicts the schemata of the two cricket databases. Tables are rep-
resented as boxes and foreign keys as dashed arrows. Each player has (optional)
statistics about one-day internationals (ODI) and / or test games. Statistics
are comprised of bowling, batting, and (only in Cricinfo) fielding statistics. The
number of attributes in each table and the number of correspondences are shown
in Fig. 6.13. Note that some of the correspondences involve complex value trans-
formations, which are not detected by the DUMAS matcher. Thus, in addition
to the overall number of correspondences, the number of correspondences in-
volving transformations is provided in parentheses.
Table Cricinfo Cricketbase Correspondences
Player 21 26 8(1)
Batting 9 7 6
Fielding 2 N/A N/A
Bowling 11 9 8(1)
Sum 43 42 22(2)
Figure 6.13: Number of attributes in schemata and number of correspondences.
4http://www.cricinfo.com/
5http://www.cricketbase.com/
6.5. EXPERIMENTAL EVALUATION 117
6.5.2 Accuracy of Schema Matching
The goal of the experiment is to evaluate the result of schema matching using
the previously applied precision and recall measures. In addition, the F-Measure
is used to get a combined score for both precision and recall:
Precision =|CR|
|R|Recall =|CR|
|T|FMeasure = 2Precision Recall
Precision +Recall
where Cis the set of true correspondences and Ris the set of correspondences
returned by the matching algorithm.
Cricket Scenario
0
10
20
30
40
50
60
70
80
90
100
F-Measure Precision Recall Recall (Simple)
Percent
Figure 6.14: Matching quality.
Fig. 6.14 shows the schema matching accuracy of the DUMAS schema matcher
on the cricket data. The high precision of 90% suggests that no false duplicates
have contributed to the final matching. This assumption has been confirmed
by visual examination of intermediate results. Two recall measures are pro-
vided: one that is based on all true correspondences and one that considers
only simple correspondences, i.e., correspondences that do not require value
transformations. The numbers suggest that our matching algorithm performs
well with respect to simple correspondences (recall = 90% for simple correspon-
dences), but can be improved when value transformations are considered (recall
= 76% for all correspondences). The resulting F-Measure is 82%. Note that the
results show an improvement over previously reported results of less than 80%
accuracy for the same data set [DLD+04].
118 CHAPTER 6. MATCHING COMPLEX SCHEMATA
Note that the final outcome did not depend on the number of correspond-
ing tables in the initial matching: When the algorithm starts with a matching
between P layer tables, it produces the same result as when it starts with cor-
respondences between P layer,Batting, and Bowling tables. The small size of
the schemata certainly contributed to this positive outcome.
Matching Quality
0
20
40
60
80
100
F-Measure Precision Recall Recall
(Simple)
Percent
No fault injection
False initial matching
Figure 6.15: Matching quality with false initial matching.
The initial matching produced by the algorithm described in Sec. 6.2 never
contained false table pairs. To examine how the algorithm handles false dupli-
cates, we started the extension phase with an incorrect initial matching between
Batting in one schema and Bowling in the other. The result of this experiment
is depicted in Fig. 6.15 (see ‘False initial matching’). The quality of the matching
was surprisingly good. The reason for this is the increase in duplicate detection
accuracy when the Player tables are joined in. The matching extracted from
those duplicates contains more correspondences for the previously matched ta-
bles. As a result, the attribute reassignment rule described in Sec. 6.3.4 is used
to overrule the previous matchings. We point out that the use of false correspon-
dences comes at the price of additional iterations, and thus, increased running
time.
6.6 Discussion
The DUMAS schema matcher, which is capable of finding attribute correspon-
dences in complex schemata, is described in this chapter. To the best of our
6.6. DISCUSSION 119
knowledge, it is the first duplicate-based schema matcher that works on whole
relational schemata: Other matchers compare only two tables or, in the case of
iMap, XML data, for which duplicates have to be manually provided.
As a first step, an initial matching is created. Such an initial matching relates
only a subset of all tables. One goal of this phase is to avoid false correspon-
dences, which could mislead the algorithm. Hence, a very conservative strategy
is applied: Only table matchings with a large number of correspondences are
used. A global threshold, which depends on the maximum number of correspon-
dences, is applied for that purpose. Although the sanity check procedure has
shown to be very effective in overruling false correspondences, we believe that
a conservative approach is more promising in order to achieve a quality schema
matching.
Based on the initial matching, tables are extended to produce additional
correspondences. Those correspondences are used to create more tables, thus,
starting the process again. It is obvious that a possibly large number of tables
has to be created and compared. The heuristics described in this chapter aim at
reducing both the number of join tables and the number of table comparisons.
Furthermore, a strategy to drastically reduce the size of join tables is described,
which makes the algorithm applicable in very large databases. Although this
strategy comes with the cost of possible decrease in schema matching accuracy,
the user should be given the choice between efficiency and quality. However, it
has to be kept in mind that schema matching is an off-line process, i.e., it is
performed during the development of a system and not at runtime, and thus,
efficiency is a secondary goal.
120 CHAPTER 6. MATCHING COMPLEX SCHEMATA
Chapter 7
Finding Complex
Matchings
The DUMAS table matching algorithm is designed to detect simple correspon-
dences, which align one source attribute with one target attribute. While 1:1
correspondences are very common, 1:n or m:n correspondences also occur in
practice: E.g., the name of a person can be represented as a single attribute
or several attributes for given name, middle initial, and surname. This chapter
describes the DUMAS complex matcher that facilitates the detection of such
complex correspondences. An analysis of the existing algorithm reveals that
some components of the existing algorithm can be reused, and only the ma-
trix construction of the matching step needs to be adapted. We describe a
general search procedure that creates all possible matrices. Because the num-
ber of matrices is far too large for real-world scenarios, various optimizations
that facilitate the applicability of the DUMAS complex matcher in practice are
discussed. Finally, the algorithm is experimentally evaluated.
7.1 Problems Associated With Complex Match-
ings
The problem of finding complex matchings has only recently been addressed by
the research community. Early schema matching solutions considered only 1:1
correspondences, because it is assumed that in most scenarios the majority of
existing correspondences are simple. However, complex matches do occur and
must be detected by a schema matcher if it is to be applied in practice. Recall
from Sec. 2.2.2 that a correspondence is of the form ({a1, . . . , am},{b1, . . . , bn}),
where each aiis a source attribute and each bjis a target attribute. Simple
correspondences align single attributes (i.e., m= 1 and n= 1), while complex
correspondences relate several attributes (i.e., m1 and n1). Fig. 7.1
depicts a very common example for a complex relationship: While the source
121
122 CHAPTER 7. FINDING COMPLEX MATCHINGS
table Rcontains separate attributes for first name (F N) and last name (LN),
the target table Shas only a single attribute for the full name (N). Hence, if
the data of the source table is to be moved to the target table, the values of
FN and LN need to be concatenated to create values for N. In addition to
names of persons, both tables contain address information (A1 and A2). Thus,
the correct matching is M={({F N, LN},{N}),({A1},{A2})}.
R
FN
LN
A1
Sam
Adams
321 Evergreen Ave,...
S
N
A2
Sam Adams
:
:
:
:
:
:
:
:
321 Evergreen Ave,...
:
:
Figure 7.1: A complex matching between Rand S.
Although complex matchings exist in many scenarios, the majority of exist-
ing algorithms only detect simple correspondences. One reason for the lack of
complex matchers is the added complexity when attribute sets larger than one
need to be considered. The space of possible matchings between two is dras-
tically increased: If only simple correspondences (singleton attribute groups)
are considered, then |R| · |S|simple correspondences between tables Rand S
are possible. If the bound on the size of the attribute groups is removed, then
the number of possible correspondences becomes exponential in the size of the
tables: Given a relation Rwith |R|attributes, there are 2|R|1 possible combi-
nations of attributes if order is not relevant1. Hence, there are (2|R|1)·(2|S|1)
possible complex correspondences between tables Rand S.
While the search space is large but bounded, the number of possible functions
to combine attribute values is infinite. Although the schema matching output
is only comprised of attribute sets and not the functions to combine attribute
values, we note that applying different functions in the matching process can
improve the quality of the matching. We do not address this problem in this
thesis and only consider concatenation of strings. However, we point out that
the adapted matching step, which is described in the following, can also be
applied with other functions.
1Given a set Swith nelements, its power set contains 2nelements. The power set also
includes the empty set, which we wish to exclude in our considerations.
7.2. ADAPTING THE DUMAS TABLE MATCHER 123
7.2 Adapting the DUMAS Table Matcher
Table matching is comprised of two steps, which are subject to change in order
to facilitate the detection of complex correspondences. The duplicate detection
step uses the tuple similarity measure tupsim to find duplicates. Recall that this
measure ignores the record structure and is order-independent. Consequently,
it can also be used to detect duplicates for complex matching if concatenation
of string values is the only considered attribute combination function: In the
example in Fig. 7.1, concatenating the first name ‘Sam’ with last name ’Adams’
yields the string ‘Sam Adams’, which is equal to the value of column Nin
the target table. The source attributes F N and LN can be combined in any
order, because ‘Sam Adams’ and ‘Adams Sam’ are considered equal by the tuple
similarity measure.
The existing duplicate detection procedure can also be applied when the
extended tuple similarity measure etupsim is used. The unmatched part is
compared using tupsim, and thus, no changes are necessary. Corresponding
attributes are compared using the field similarity measure fieldsim. Because
this measure is also order-independent, etupsim can be used when the known
partial matching contains complex correspondences.
The matching step is a different issue: Because the algorithm described
in Chap. 5 is not designed to find complex matches, it will detect the corre-
spondence between the address attributes and probably a false correspondence
involving attribute N. Fig. 7.2 depicts the average similarity matrix for the
example scenario with the weights of edges that are the result of maximum
weight matching highlighted in bold. Depending on the pruning threshold used
in the matching step, the incomplete correspondence ({LN},{N}) could enter
the matching.
N A2
FN 0.6 0.1
LN 0.7 0.1
A1 0.1 1
Figure 7.2: Average similarity matrix for example tables.
Because the existing procedure cannot find complex correspondences, the
matching step needs to be adapted. Fortunately, some of the techniques can be
reused:
1. Field similarity measure: The field similarity measure fieldsim is order-
independent, and thus, can also be used to compare attributes when their
values are concatenated.
2. Similarity matrices: The construction of the similarity matrices for each
duplicate has to be adapted, because combinations of attributes have to
be considered. As in the case of simple matchings, an average similar-
ity matrix can be constructed based on the similarity matrices for each
duplicate.
124 CHAPTER 7. FINDING COMPLEX MATCHINGS
3. Maximum weight matching: We will show how the same maximum weight
matching process can be applied to detect complex correspondences.
To summarize, the complex matching step is very similar to the matching
step for simple matchings. The main difference is the consideration of com-
binations of attributes in the construction of similarity matrices. Hence, in
addition to single attributes, the matrices for each duplicate must contain rows
and columns for attribute groups, which are combinations of attributes. Given a
tuple rof relation R, the value of an attribute group A2att(R)is a single string
that contains r’s space-separated values of the attributes in A. Note that a sin-
gle attribute can be considered a singleton attribute group. In the following,
the terms ‘attribute’ and ‘singleton attribute group’ are used interchangeably.
Given that a tuple’s value of an attribute group is simply a concatenation of
attribute values, the construction of the similarity matrices for each duplicate
is straightforward. If all matrices have the same structure (i.e., each matrix
represents the same attribute groups for both tables), then the average similarity
matrix can be computed as described in Sec. 5.3.1.
7.3 Searching For Complex Correspondences
The complex matching algorithm is designed under the premise that most com-
ponents of the existing matching step can be reused. In particular, the field
similarity measure and the methods for creating, aggregating of, and reasoning
on similarity matrices have been shown to be very effective, and thus, should
be applied in the search for complex correspondences.
As described above, the main difference is the construction of the similarity
matrices for each duplicate, and consequently, the average similarity matrix, on
which maximum weight matching is performed. The structure of those matrices
is different than in the simple case because attribute groups have to be consid-
ered. We discuss two ways of structuring similarity matrices and describe the
search space that has to be traversed by the algorithm in the following. Finally,
we identify several problems that have to be solved by the matching algorithm.
7.3.1 The Matrix Structure
In addition to single attributes, which essentially are singleton attribute groups,
similarity matrices need to represent attribute groups consisting of more than
one attribute. In the following we assume that the construction is based on the
start matrix (Fig. 7.2), which contains only singleton attribute groups and is
constructed as described in Chap. 5.
Sec. 2.3.4 describes the complex matching approach by Chua et al. [CCL03].
They extend tables by adding rows and columns for attribute groups consisting
of two or more attributes. As we have shown, this approach can lead to false
correspondences because attribute groups may overlap.
Because of the drawbacks of the extension procedure, we decided to follow
amerge strategy: Instead of adding additional rows and columns for larger
7.3. SEARCHING FOR COMPLEX CORRESPONDENCES 125
{N} {A2}
{FN, LN}10.1
{A1}0.1 1
(a) Correct grouping: {F N, LN}.
{N} {A2}
{FN, A1}0.25 0.6
{LN}0.7 0.1
(b) Incorrect grouping: {F N, A1}.
Figure 7.3: Merging attributes to detect a complex matching.
attribute groups, those new attribute groups replace their constituent attributes
in the merge strategy. Fig. 7.3(a) shows the matrix that is the result of merging
attributes FN and LN. The rows for those two attributes have been replaced
by a row for their union {FN, LN}. This strategy has the advantage that the
existing maximum weight matching procedure can be applied to detect complex
correspondences: Because there is no overlap between attribute groups, the
problems described in Sec. 2.3.4 do not occur.
7.3.2 Searching for a Matrix Structure
The general idea of the algorithm is to start with a simple matching (as depicted
in Fig. 7.2) and subsequently merge some attributes if the merging improves the
matching. Fig. 7.3(a) shows the matrix after attributes F N and LN have been
merged into an attribute group. Choosing those two attributes for merging has
been a good choice because the only complex correspondence in this scenario
requires those two attributes to be concatenated. One can easily observe the
improvement in the table matching, which is represented as bold numbers in
Fig. 7.3(a): There are still two correspondences, but in contrast to the start
matrix, both have a score of 1. In contrast, Fig. 7.3(b) shows the matrix with
attributes FN and A1 being combined. The bold numbers indicate that the
matching has degraded.
Before describing the space of possible matrices, we define a merge function
merge(M, A, B) that takes a matrix Mand merges attribute groups Aand B,
resulting in a new matrix M0. Obviously, the origin of both Aand Bmust be
either the source or the target table. If the attributes of Aand Bare source
attributes, then the structure of M0is the same as the structure of Mexcept that
the rows for Aand Bhave been replaced by a row for {A, B}. The computation
of the values for the new row is straightforward: Given a target attribute (or
attribute group) C, the field similarity fieldsim(AB, C)2between attribute
group {A, B}and Cis computed for each duplicate tuple pair. To do so, the
values of attributes Aand Bare concatenated and compared with the value of
Cusing the field similarity value. The value of M0({A, B}, C) is the average of
the field similarities of the duplicates. The process is analogous if Aand Bare
target attribute groups.
2Because fieldsim compares bags of tokens representing attribute values, combining two
attribute values essentially resolves to creating the union of their token bags. Hence, we denote
the combination of attribute values as ABin fieldsim.
126 CHAPTER 7. FINDING COMPLEX MATCHINGS
The search through the space of possible matrices is based on the merge
function. Fig. 7.4 depicts all matrices that can be derived from the start ma-
trix in Fig 7.2. The grey cells represent the edges that are part of the graph
matching and, after pruning low-weight edges, would constitute the complex
table matching. Each arrow indicates one application of the merge function.
{FN}
{LN}
{A1}
{N}
{A2}
{FN,LN}
{A1}
{N}
{A2}
{FN,A1}
{LN}
{N}
{A2}
{LN,A1}
{FN}
{N}
{A2}
{FN}
{LN}
{A1}
{N,A2}
{FN,LN,A1}
{N}
{A2}
{FN,LN}
{A1}
{FN,A1}
{LN}
{LN,A1}
{FN}
{N,A2}
{N,A2}
{N,A2}
{FN,LN,A1}
{N,A2}
Figure 7.4: The complete search space of the running example.
As stated above, the general idea of the match algorithm is to search though
the space of possible matrices and pick the ‘best’ matrix and its inherent graph
matching, which represents the resulting complex correspondences. Although
the idea sounds simple, two challenges have to be tackled by the algorithm:
1. Score function: The score of a matrix needs to reflect the ‘quality’ of the
matching with respect to other matrices, i.e., better matrices must have
the higher scores.
2. Number of attribute combinations: In each search step an exponential
number of possible merges exist. Assume that a matrix Mhas msource
attribute groups and ntarget attribute groups. There are µm
2matri-
ces that can be derived from Mby merging source attributes and µn
2
7.4. DETECTING 1:N MATCHINGS 127
matrices that can be created by merging target attributes. Altogether,
there are B(|R|)·B(|S|) possible matrices for tables Rand S, where B(n)
is the Bell number (i.e., the number of possible partitions) for a set with
nelements [Rot64]. This number increases exponentially with n: E.g.,
B(10) = 115,975, while B(100) is a 116-digit number.
Although only a few duplicates and not the whole tables are used to compute
the matrices, reducing the number of matrices that have to be considered by
the algorithm is crucial to facilitate its applicability in scenarios where tables
contain many attributes. The complex matching algorithm, which is described
in the next section, uses various criteria to prune the search space and only use
matrices that improve the matching.
7.4 Detecting 1:n Matchings
The complex matching algorithm described in Sec. 7.4.1 applies a breadth-first
search procedure to traverse the space of possible matrices. As pointed out
before, this space is very large, and thus, measures to reduce the number of
matrices that are considered have to be taken. Sec. 7.4.2 describes various
heuristics to prune the search space. Measures to asses the improvement in the
matching in each search step are discussed in Sec. 7.4.3.
7.4.1 Discovering the Best Matrix
The complex table matching algorithm, which is shown in Alg. 4, begins the
search for the best matrix with the start matrix, which is the start state for the
search process (line 1). This matrix is constructed as described in Chap. 5. The
variable newMatrices represents all matrices produced in the current search
step. Initially it contains the start matrix (line 2). The ‘best’ matrix discovered
until a given point is kept as bestMatrix, which is also the start matrix at the
beginning (line 3).
Lines 4 14 describe the search process. The search continues only if ad-
ditional matrices have been generated in the previous iteration (line 4). All of
those matrices are considered in the current search step (line 5). We also have
considered beam search as the underlying search procedure, which would only
consider the best kmatrices in each search step, and thus, reduce the search
space. However, the heuristics described in the following section have shown
to be very effective, rendering additional pruning unnecessary. We also need to
compare the currently best matrix bestMatrix with the best matrix created in
the previous iteration. If there has been an improvement, we need to update
bestMatrix (line 6).
Each of the matrices is used to generate child matrices, which are constructed
by merging attribute groups (line 9). Recall that only child matrices that im-
prove the matching are created. Those matrices are added to newMatrices and
used in the subsequent search step to construct additional matrices (line 11).
128 CHAPTER 7. FINDING COMPLEX MATCHINGS
Algorithm 4: DUMAS Complex Table Matching Algorithm
Input: Source table R; target table S
Output: Complex matching between Rand S
MstartMatrix(R, S);1
newMatrices {M};2
bestMatrix M;3
while newMatrices 6=do4
matrices newMatrices;5
bestMatrix best(bestMatrix, newMatrices);6
newMatrices empty;7
foreach Mmatrices do8
childMatrices childMatrices(M);9
if childMatrices 6=then10
newMatrices newMatrices childMatrices;11
end12
end13
end14
GM =graphMatch(bestMatrix);15
M=prune(GM);16
return M17
The search is discontinued if no more matrices that improve the matching
can be generated. The matrix with the highest match score is considered the
‘best’ matrix. Based on this matrix, a maximum weight matching is performed
(line 15). The result is a set of edges between nodes representing source and
target attribute groups such that each node is incident with at most one edge,
and each edge connects one source with one target attribute group. Edges with
a weight below a given threshold θprune are pruned (line 16). The remaining
edges represent the table matching, which is the output of the algorithm.
It has to be noted that other search procedures are conceivable. Beam search
can be applied if the heuristics described in Sec. 7.4.2 prove to be insufficient
for very large tables. The adaptation of the existing implementation would be
straightforward: Instead of using all matrices in each search step, only the best
kmatrices are considered (line 5).
7.4.2 Creating Child Matrices
Child matrices are created by merging attribute groups. Because the number
of possible matrices is very large, we use three heuristics to reduce the number
of matrices that are considered by the algorithm:
1. Partner attribute groups: Given an attribute group A, we consider an
attribute group Pa partner (or partner attribute group) of Aif Ais
highly similar to P(i.e., Ppotentially matches with A). When considering
7.4. DETECTING 1:N MATCHINGS 129
attribute group Ain the algorithm, new attribute groups involving Pare
created.
2. Local criterion: Merging the partner Pof Awith another attribute group
Bmust increase the similarity with A. Otherwise, it is not considered an
improvement. This criterion called ‘local’ because only A,P, and Bare
considered.
3. Global criterion: The resulting matrix must improve its parent. As op-
posed to the local criterion, the global criterion considers all attribute
groups.
Those heuristics can be found in the function childMatrices, which gener-
ates child matrices for a given matrix M. The structure of those matrices is the
same as the structure of Mexcept that two attribute groups have been merged
(Fig. 7.4).
Function childMatrices
Input: Average Similarity Matrix M
Output: Matrices derived from Mby merging attribute groups
foreach attribute group X in M do1
partners simAttGroups(M, X);2
foreach Ppartners do3
foreach attribute group Y 6=Pdo4
if avgSim(X, P Y)> avgSim(X, P )then5
M0=merge(M, P, Y );6
if score(M0)> score(M)then7
childMatrices =childMatrices M0;8
end9
end10
end11
end12
end13
return childMatrices14
A matrix Mis given as input to childMatrices. The function considers
each attribute group, both from the source and from the target side (line 1).
For each attribute group X, it picks partner attribute groups (line 2). Those
partners are attribute groups in the opposite table that are potential matches
for X(i.e., their similarity passes a threshold θpartner). If more than mattribute
groups pass the threshold, only the mmost similar attribute groups are used as
partners. If no attribute group passes the threshold, then partners is empty.
For each partner group P, the new attribute group {P, Y }, where Yis an
attribute group on P’s side (except Pitself), is considered a potential improve-
ment. To determine if the matrix involving attribute group {P, Y }needs to be
130 CHAPTER 7. FINDING COMPLEX MATCHINGS
constructed, the local criterion is applied: The average similarity of Xvalues
with the concatenation of Pand Yvalues must be higher than with Pvalues
(line 5). If the local criterion is satisfied, the matrix M0is created by merging
attribute groups Pand Y(line 6). This child matrix must also satisfy the global
criterion: The overall match score of the child matrix M0must be higher than
the match score of its parent M(line 7). If that is the case, then the matrix is
added to the result set (line 8).
To illustrate the method, assume Mto be the start matrix shown in Fig. 7.2
and Xto be the target attribute group {A2}. To determine the set partners, the
similarity of A2 with all source attribute groups needs to be checked. As stated
above, the mmost similar attribute groups with a similarity score above the
threshold θpartner are taken as partners. In the following, we assume that m= 2
and θpartner = 0.2. Because only A1 is highly similar to A2, partners contains
the attribute group A1. This attribute group is merged with all other source
attribute groups, resulting in the following new attribute groups: {FN, A1}and
{LN, A1}. In contrast, if Xis the target attribute N, both LN and F N are
taken as partners.
The average field similarity of those new attribute groups and Xis computed
using the detected duplicates. One will see that the similarity of {F N, LN}with
Nis larger than the similarity of F N with Nand of LN with N. Hence, the
local criterion is satisfied. In contrast, attribute group {LN, A1}leads to a
decrease in the similarity score with respect to A2, because the last name LN
is not part of the address A2.
With only three source attributes, the effect of using only a limited set of
partner groups is marginal. When the tables under consideration become larger,
the factor mcan be used to regulate the computation time of the algorithm.
In contrast, the introduction of the threshold θpartner has increased both accu-
racy and performance: The performance gain is obvious, because less attribute
combinations are considered for each X. Interestingly, accuracy is also posi-
tively affected: If the similarity avgSim(X, P ) is very low, then the similarity
avgSim(X, P Y) can coincidentally be higher even though Pand Yare unre-
lated. Thus, if no threshold were applied, the child matrix could pass the local
criterion. None of the examined score functions for the global criterion would
satisfactorily filter out the majority of such ‘false’ merges.
The Local Criterion
Although it is possible that the local criterion filters out correct merges (e.g.,
due to dirty data) or consider false merges as being correct (e.g., due to coin-
cidental use of the same terms), in most cases it makes a correct decision. In
the following, we show why the criterion works with the field similarity measure
fieldsim.
The local criterion is based on the assumption that merging unrelated at-
tributes or attribute groups decreases the average similarity: LN is very similar
to N, because the given name is part of the full name. Merging LN with A1
decreases its similarity to N, because a lot of terms from the address A1 that
7.4. DETECTING 1:N MATCHINGS 131
are unrelated to the name Nare added. On the other hand, merging related
attributes increases the similarity score: The concatenated F N and LN values
are much more similar to Nvalues than the F N and LN values themselves.
Essentially, this behavior reflects the intuition behind any string similarity mea-
sure: If a pair of strings is more similar than another pair of strings, then it
must get a higher similarity score. Consequently, the algorithm makes a greedy
decision if a merge step decreases a similarity score and drops the respective
matrix.
To illustrate the behavior of the field similarity measure, we start with the
following example: A name of a person nconsists of mterms t1. . . tm. The
person’s first name fn and last name ln are comprised of disjunct subsets of the
name terms: t1. . . tkand tk+1 . . . tl, respectively. Assume that the remaining
terms tl+1 . . . tminclude the titles of the person. In addition, there is the address
a1 that is comprised of tokens that do not appear in any of the previous strings.
n=t1. . . tm
fn =t1. . . tk
ln =tk+1 . . . tl
a1 = tm+1 . . . to
Recall from Sec. 5.2.1 that SoftTFIDF is used as the field similarity measure.
The field similarity fieldsim of two attribute values aand bis defined as:
fieldsim(a, b) = X
tCLOSE(θ,a,b)
w(a, t)·w(b, t0)·termsim(t, t0) (7.1)
where t0is the term in bthat is most similar to taccording to the term similarity
measure termsim, and CLOSE(θ, a, b) is a subset of all terms in awith
CLOSE(θ, a, b) = {ta|∃t0b, termsim(t, t0)> θ}.(7.2)
The term similarity termsim is based on the normalized edit distance.
We start with the similarity of nand fn. For each term in n, there is a
matching term in fn, and thus, the term similarity is always 1. In the following,
we also assume that the unnormalized TFIDF weight w0for a given term tis
equal in all attributes. This is not a very restrictive assumption: The term
frequency is usually small because of the nature of database attributes, and
the distribution of terms should be similar across databases. We use w0(t) to
denote the unnormalized weight of term t. If these assumptions hold, the field
similarity of nand fn is computed as:
fieldsim(n, fn) =
w0(t1)
.
.
.
w0(tk)
w0(tk+1)
.
.
.
w0(tm)
pw02(t1) + . . . +w02(tm)·
w0(t1)
.
.
.
w0(tk)
0
.
.
.
0
pw02(t1) + . . . +w02(tk)
132 CHAPTER 7. FINDING COMPLEX MATCHINGS
=w02(t1) + . . . +w02(tk)
pw02(t1) + ...+w02(tm)pw02(t1) + . . . +w02(tk)
=pw02(t1) + . . . +w02(tk)
pw02(t1) + ...+w02(tm)
Adding tokens to fn reduces the individual, normalized weights to tokens
t1. . . tm, because the token vector becomes longer. This reduces the similarity
with n. However, if the new tokens also occur in n, their addition has an
increasing effect. Assume that ln is added to fn, which results in the following
field similarity:
fieldsim(n, fn ln) =
w0(t1)
.
.
.
w0(tl)
w0(tl+1)
.
.
.
w0(tm)
pw02(t1) + . . . +w02(tm)·
w0(t1)
.
.
.
w0(tl)
0
.
.
.
0
pw02(t1) + . . . +w02(tl)
=pw02(t1) + . . . +w02(tl)
pw02(t1) + . . . +w02(tm)
Obviously, the field similarity has increased, because the numerator is larger
while the denominator is the same as before. In contrast, adding terms that do
not appear in n(e.g., adding terms of a1) decreases the score, because there
are no tokens that compensate the decreasing normalized weight of matching
terms.
Unfortunately, real-world data is not as perfect as described above. Dirty
data can cause a decrease in the field similarity even when related attributes
are added. Assume that a person has several last names, and ncontains only
a single last name, while ln has all of them. Adding ln to fn can decrease the
field similarity as described above: The normalized weight decreases, and the
single new matching term cannot compensate. A similar effect occurs if the
term similarity of matching tokens is not 1. Assume three strings:
s=t1t2
s1=t1
s2=t0
2
where termsim(t2, t0
2) = 0.4. Given that the unnormalized TFIDF weights of
all terms are equal, the field similarity of sand s1is fieldsim(s, s1) = 0.5 =
0.707. Adding s2to s1decreases the similarity with sindependent of the thresh-
old used: If θtermsim(t2, t0
2), then t0
2is not considered a match for t2, and
thus, fieldsim(s, s1s2) = 0.5. If the term similarity passes the threshold, then
the field similarity is fieldsim(s, s1s2) = 0.5 + 0.5·0.5·0.4 = 0.7<0.707.
7.4. DETECTING 1:N MATCHINGS 133
Thus, the field similarity decreases although attributes are grouped correctly
because t0
2is too dissimilar to t2.
On the other hand, there is also the chance that adding unrelated attributes
increases the similarity score. Prob. 4 in Sec. 4.3.1 states that unrelated at-
tributes might have highly similar values, which can mislead duplicate detec-
tion. Such a case can also mislead the complex matching step, because an
increase in the similarity score is considered an improvement. Both problems
are considered in the experimental evaluation in Sec. 7.7.
7.4.3 Assessing Match Improvement
In contrast to the local criterion, which determines if a merge has improved
the matrix from the view point of a single attribute group, the global criterion
uses the whole matrix. It is not unusual that even though the local criterion is
satisfied, the merging of attribute groups has lead to an unintentional matrix.
If that happens, the global criterion in the function childMatrices must reject
the child matrix.
The creation of child matrices is not the only part of the algorithm where
matrices need to be assessed. The overall goal of the algorithm is to find the
best matrix, which represents the correct matching between two tables. Hence,
we need to define a function that assigns scores to the matrices created by the
algorithm such that the best matrix has the highest score.
Discussion of possible functions
In both tasks, the score function should assess matrices based on the matching
that they produce. There are two choices of input to the function:
1. Graph matching: The correspondences described by the maximum weight
matching of the similarity matrix.
2. Table matching: The correspondences of the graph matching that pass
the pruning threshold θprune.
Both choices have their advantages. The main difference lies in the sensitiv-
ity: The table matching does not always change even when attribute groups are
correctly merged. To illustrate, assume a correspondence ({X1, . . . , X4},{Y}).
Because the values of four source attributes need to be merged to create a value
for Y, the similarity scores for each simple correspondence ({Xi},{Y}) will be
low. Merging X1and X2is a correct merge, because it leads to the above exist-
ing correspondence. However, because the attribute group {X1, X2}consists of
only two of the required four attributes, the similarity score is likely not to pass
the threshold, and thus, the actual improvement cannot be seen in the table
matching. In contrast, the improvement is easily observable when the graph
matching correspondences are used.
The score function needs to combine the similarity score of the correspon-
dences into a single match score. We considered two possible functions:
134 CHAPTER 7. FINDING COMPLEX MATCHINGS
1. Average: The average of the correspondence scores.
2. Sum: The sum of the correspondence scores.
Using the average of correspondence scores has the advantage of always
being in the range [0,1], while value range of sum depends on the number of
correspondences, which in turn depends on the size of the tables: The number
of correspondences before pruning is equal to the smaller number of attribute
groups of the two tables. Hence, sum appears to be a bad choice for determining
matrix improvement: If two attribute groups from the smaller side are merged,
the sum of correspondence scores before pruning can decrease even when the
merge is correct.
While the function average has the advantage that matrices are comparable
even when the number of correspondences decreases, we have observed a ten-
dency to ‘overmerge’ attribute groups: E.g., {({LN, FN, A1},{N, A2})}would
be considered an overmerge for the running example depicted in Fig. 7.1, because
the expected two correspondences are merged into a single correspondence. In
addition, correspondences can be missed, because the average score can increase
when the number of correspondences decreases.
Assume the average of graph matching score to be defined as
avgScore(GM) = P(A,B)∈GM avgSim(A, B)
|GM| (7.3)
where GM is a set of attribute group pairs (A, B) representing the maximum
weight matching in the matrix M. The value of avgScore increases if the sum
of field similarity scores avgSim increases or if the size of the matching GM
decreases. The latter is an unintended effect that occurs with correspondences
both before and after pruning: E.g., if only correspondences with a score above
the pruning threshold are used, then the algorithm aims at removing correspon-
dences that are just above the threshold.
{A} {BC}
{A}10
{B}00.75
{C}0 0.63
{D}0 0
(a) Start matrix.
{A} {BC}
{A}10
{BC}00.9
{D}0 0
(b) Correct grouping.
{A} {BC}
{A}10
{BD}0 0.53
{C}0 0.63
(c) Wrong grouping.
Figure 7.5: Increasing score average by overmerging.
Assume the start matrix shown in Fig. 7.5(a) and a pruning threshold of
0.7 (the correspondences above the threshold are highlighted in bold), which
has an average score of 0.875. The average table matching score of the correct
merging shown in Fig. 7.5(b) is 0.95, and thus, it would be considered a better
matrix. However, the wrong grouping depicted in Fig. 7.5(c) has an even larger
7.4. DETECTING 1:N MATCHINGS 135
score of 1, because merging Band Dcaused the average field similarity of the
correspondence with BC to drop below the threshold.
In contrast, correspondences cannot be removed from consideration when the
graph matching (i.e., the correspondences before pruning) is used. However, the
number of correspondences is reduced when attribute groups of the smaller side
are merged. We have observed cases where the nominator of Eq. 7.3 decreases
in a merge step, but the avgScore value increased because the denominator
decreased.
Function Input Advantages Disadvantages
Graph Matching Overmerge avoided Major sensitivity
to restructuring.
Sum
Table Matching Overmerge avoided Minor sensitivity
to restructuring.
Graph Matching
Changes immediately re-
flected.
Fixed value range.
Overmerging.
Average
Table Matching Fixed value range. Overmerging
Table 7.1: Comparison of scoring functions.
The discussion of possible scoring functions is summarized in Tab. 7.1. Sev-
eral experiments with synthetic data sets confirmed that the two functions have
antagonistic properties: While the sum might avoid (both correct and incorrect)
merging of attributes, the average can lead to an overmerge.
Determining matrix improvement
The main goal of the score function score in the function childMatrices is
to avoid matrices that do not lead to the correct matrix, and thus, reduce the
overall number of matrices that have to be considered. Positive and negative
changes should be immediately visible. As described above, the table match-
ing correspondences only change when correspondences with a score above the
threshold are affected by the merge. Hence, all correspondences of the graph
matching are used.
Because the number of graph matching correspondences always decreases
when attribute groups of the smaller side are merged, we chose to use the
average of the correspondence scores as the score function:
score(M) = avgScore(GM) = P(A,B)∈GM avgSim(A, B)
|GM| (7.4)
where GM is the maximum-weight graph matching of matrix M. The score is
always in the range [0,1], which facilitates the comparison of a parent matrix
with its child matrices.
136 CHAPTER 7. FINDING COMPLEX MATCHINGS
Choosing a best matrix
The matrices that pass the local and global criteria are candidates for the best
matrix. Similar to the problem of determining matrix improvement, we assign
scores to matrices and chose the matrix with the highest score as best matrix.
In contrast to the problem of determining matrix improvement, the average
of graph matching scores has shown to be suboptimal for choosing the best
matrix, because it can lead to overmerging. Hence, we define a new function
matchscore. The input to this function is the correspondences that pass the
pruning threshold, because they have shown to be a more reliable indicator.
Each correspondence (A, B) = ({a1, . . . , ai},{b1, . . . , bj}) whose field similarity
is above the pruning threshold has a correspondence score, which is weighted by
the number of attributes involved:
corscore(A, B) = (|A|+|B|)·avgSim(A, B) (7.5)
where avgSim(A, B) is the average field similarity of attribute groups Aand B.
The match score of a matrix Mis the sum of the correspondence score of its
table matching M, i.e., the correspondences that pass the pruning threshold:
matchscore(M) = matchscore(M) = X
(A,B)∈M
corscore(A, B).(7.6)
Recall from Tab. 7.1 that the sum does not have the problem of overmerging,
but avoids attribute group merges if the number of correspondences decreases.
Weighting attribute correspondences by the number of attributes involved tack-
les this problem, which is particularly important in the presence of m:n cor-
respondences, which are discussed in the following section: Fig. 7.6 shows the
start matrix and the matrix representing the correct matching in a m:n sce-
nario (correspondences highlighted in bold). Note that the (unweighted) sum
of field similarity scores is smaller in the second matrix, because the number of
correspondences has decreased with respect to the first matrix. However, the
matchscore has increased, because its field similarity 0.97 is larger than both
0.76 and 0.75, and it contains 6 attributes, while each of the two correspondences
from which is was derived contain three attributes.
Note that the matchscore function does not work well in the previous prob-
lem of determining matrix improvement, because its value range differs depend-
ing on the matrix structure.
7.5 The Complex Matching Algorithm
The greedy approach described in Sec. 7.4.2 only works as desired in the presence
of n:1-correspondences: If an m:n correspondence exists, a single merge will
not improve the matrix, and thus, that branch of the search space will not be
considered by the algorithm. This section describes the final complex matching
algorithm, which is an adaptation of the algorithm described in Sec. 7.4.
7.5. THE COMPLEX MATCHING ALGORITHM 137
The function matchscore successfully detects the best matrix even in the
presence of m:n correspondences (Eq. 7.6). In contrast, the score of a matching
does not always increase in such a scenario even when the merge is intuitively
correct (Eq. 7.4).
{A} {BC} {D}
{AB}0.76 0.43 0
{C}0.39 0.75 0
{D}0 0 0.93
(a) Start matrix.
{A, BC} {D}
{AB, C}0.97 0
{D}00.93
(b) Correct grouping.
Figure 7.6: Start matrix and correct grouping in a m:n scenario.
Consider the matrices depicted in Fig. 7.6. Getting from the start matrix
(Fig. 7.6(a)) to the correct matrix (Fig. 7.6(b)) requires to steps, which can be
performed in any order: One can first merge source attribute groups {AB}and
{C}and target attribute groups {A}and {BC}afterwards, or vice versa. In
any case, the first merge step is likely to decrease the score of the matching.
Fig. 7.7 depicts the matrix after source attribute groups {AB}and {C}
have been merged. As a result, the score of the matrix has dropped from 0.81
to 0.76. This indicates a false merge, although the merge step leads to the
correct matrix. While the actual numbers in this example are fictitious, we
have frequently observed such behavior in our experiments. One might argue
that another score function should be used to catch such a scenario. However,
we believe that even for a human user, who does not know the semantics of the
attributes, the ‘improvement’ in the step from Fig. 7.6(a) to Fig. 7.7 is very hard
to discover. This holds especially because false merges also lead to matrices that
look like the matrix depicted in Fig. 7.7.
{A} {BC} {D}
{ABC}0.49 0.59 0
{D}0 0 0.93
Figure 7.7: First step: Merging attribute groups {AB}and {C}.
Adapting the algorithm
To be able to detect m:n correspondences, we chose to use function score as
defined in Eq. 7.4 and to adapt the original algorithm (Sec. 7.4). Instead of
immediately discarding matrices that do not improve their parent, we allow
the algorithm to make another search step: Assume that the child matrix M1
does not improve its parent M0, but M1’s child M2has a higher score value
than M0. In that case we consider M2as an improvement. In the example
scenario, the matrix in Fig. 7.7 would not be considered an improvement, but
138 CHAPTER 7. FINDING COMPLEX MATCHINGS
since the (correct) matrix in Fig. 7.6(b) has a higher score than the start matrix
in Fig. 7.6(a), it is further considered by the algorithm.
Function childMatrices2
Input: Average Similarity Matrix M
Output: Matrices derived from Mby merging attribute groups
foreach attribute group X in M do1
partners simAttGroups(M, X);2
foreach Ppartners do3
foreach attribute group Y 6=Pdo4
if avgSim(X, P Y)> avgSim(X, P )then5
M0=merge(M, P, Y );6
if isImprovement(M)then7
childMatrices =childMatrices M0;8
else9
if improvesGrandP arent(M0)then10
childMatrices =childMatrices M0;11
end12
end13
end14
end15
end16
end17
return childMatrices18
Function childMatrices2 shows the new method for creating child matrices,
which is mainly affected by the adaptation. Although not visible in the pseudo
code, the first change with respect to the original function occurs in line 1,
where variable Xranges over some attribute groups. In the original function,
the variables ranges over all attribute groups in the matrix. This still holds if
matrix Mhas been an improvement (i.e., if its score is larger than its parent’s
score). If that is not the case, then the function has been called with Mbecause
it is assumed that an m:n correspondence exists. Hence, we do not want to merge
attribute groups of the table which has not been affected by the creation of M:
E.g., if Mhas been created by merging source attribute groups, then only target
attribute groups must be merged to create child matrices for M. Consequently,
Xmust only range over source attribute groups of M.
Lines 7-12 show the new global criterion. The treatment of the new matrix
M0depends on the status of its parent M: If Mis an improvement over its par-
ent (i.e., the grandparent of M0), then M0is added to the list of child matrices.
Although not shown in the algorithm, M0is compared to Musing the function
score to determine if it is an improvement and labelled accordingly. If Mhas
not been an improvement, then M0needs to be compared with its grandparent.
That procedure is described below. If it is an improvement with respect to its
7.5. THE COMPLEX MATCHING ALGORITHM 139
grandparent, then it is added to the list of child matrices.
Note that beside the function childMatrices, the only other part of the
complex table matching algorithm that is affected by the adaptation is the def-
inition of the function best in Alg. 4: Instead of considering all child matrices,
the best matrix is chosen only among the matrices that are considered an im-
provement.
Comparing a matrix with its grandparent
The method childMatrices2 ensures that if a matrix M0is to be compared with
its grandparent Mgp, then both the source and target side of M0contain exactly
one attribute group that is the result of merging two attribute groups of Mgp. In
other words, M0has been created by first merging two source attribute groups
and then two target attribute groups, or vice versa. This knowledge is exploited
when determining if a matrix improves its grandparent: Instead of considering
the whole matrices, we only look at the four fields in Mgp that represent the
similarity of the source and target attribute groups that are merged to create
M0and the single field in M0that is the result of merging those attribute
groups. We consider a matrix an improvement over its grandparent if the field
similarity of the merged attribute groups in M0is larger than each of the four
field similarities in the grandparent’s matrix Mgp.
{A} {BC}
{AB}0.76 0.43
{C}0.39 0.75
(a) Extract of start matrix.
{A, BC}
{AB, C}0.97
(b) Extract of correct group-
ing.
Figure 7.8: Extract of start matrix and correct grouping in Fig. 7.6.
Fig. 7.8 depicts the relevant extract of the example in Fig. 7.6. Assume
that the matrix in Fig.7.6(b) needs to be compared with its grandparent in
Fig.7.6(a). Obviously, it is the result of merging source attribute groups AB
and Cand merging target attribute groups Aand BC. Fig.7.8(a) shows the
matrix fields representing the field similarity of those attribute groups, while
Fig. 7.8(b) depicts the field that is the result of the merge. The field similarity
of the merged attribute groups (0.97) is larger than each of the field similarities
of its constituent attribute groups (0.76, 0.43, 0.39, 0.75). Note that in practice
we need to ensure that the difference between the matrix values is relevant,
because minor improvements can spuriously occur and lead to overmerge. In
the implementation, we used a small threshold of 0.1, i.e., the similarity score of
the merged matrix must differ at least by that threshold from each of the four
scores in the grandparent’s matrix.
140 CHAPTER 7. FINDING COMPLEX MATCHINGS
7.6 Matching and Mapping with Combination
Functions
The DUMAS complex matching algorithm concatenates attribute values on the
source and target side to create complex correspondences. While that procedure
is sufficient to create a matching, it leaves two questions open: (i) What is
the mapping derived from the matching, and (ii) does the process work with
other functions, as well? This section discusses these questions and sketches out
possible solution.
7.6.1 Query Discovery with Complex Correspondences
The matching algorithm combines attribute values using the concatenation func-
tion concat to create 1:n, n:1, and m:n correspondences. Note that attribute on
both the source schema and the target schema can be combined. In contrast,
the mapping derived from those correspondences needs to determine a value for
each individual target attribute.
Simple correspondences and n:1 correspondences can be easily translated
into a mapping, because source values can be used immediately or have to be
processed by a combination function, respectively. The existence of 1:n or m:n
correspondences poses a problem in the query discovery process, because it is
only known that a combination of target attributes is related to one or several
source attributes, respectively. However, it is unclear how individual target val-
ues can be computed. Assume that source attribute Name is a concatenation
of target attributes GivenName and Surname. Transforming source tuples is a
difficult task, because values of Name have to be split into their given name and
surname. This can only be done heuristically, because the concatenation func-
tion has no inverse. However, it is conceivable that known duplicates provide
enough information to learn such heuristics in a supervised process.
7.6.2 Matching and Mapping with Different Functions
Although the complex matching algorithm has been designed and implemented
to work with the concatenation function concat as the only combination func-
tion, in principle different functions can also be applied. The complex matching
procedure is based of the matching step of Chap. 5: The similarity matrix is
used as a starting point, and attributes are merged by combining their values in
the duplicate tuple pairs. When functions other than concat are to be used, the
only point of change is the merge procedure in Func. childMatrices2, which
creates child matrices.
Assume that a source database describes real estate with attributes Length
and Width, while the target database has a single attribute Area, which is
the product of the former two attributes. Combining Length and Width with
simple string concatenation leads to a low field similarity with Area. How-
ever, if the matching system included an equation discovery system (e.g., LA-
GRAMGE [TD97]), it would be possible to find an equation that combines
7.7. EXPERIMENTAL EVALUATION 141
Length and W idth, namely the product of the two attributes, such that the
resulting value is very close to the Area value of the respective duplicate tar-
get tuple3. We propose an optimistic use of different possible field similari-
ties: One should always use the better combination function to compute field
similarities. In the above example, the matrix value for the correspondence
({Length, W idth},{Area}) would be fieldsim(Length ·W idth, Area) instead
of fieldsim(concat(Length, W idth), Area).
7.7 Experimental Evaluation
As in Chap. 5, we have evaluated the complex table matching algorithm using
both synthetic and real-world data. After defining quality measures for the
complex matching problem, we present the experimental results.
7.7.1 Quality Measures for Complex Matching
While the schema matching literature agrees on using precision, recall, and the
F-measure for assessing the quality of simple matchings, no such standard can
be found for complex matchings. This is partly due to the fact that there is no
large body of work on this topic, yet.
In contrast to simple matchings, a detected complex correspondence can be
‘almost’ correct: E.g., the example in Fig. 7.1 contains a complex correspondence
({FN, LN},{N}). If the matching algorithm detected an incomplete correspon-
dence ({LN},{N}), this would not be considered a correct solution. However,
it would give the user enough information to rectify the correspondence after it
has been identified as being incomplete: Instead of searching through the whole
source schema for corresponding attributes of N, he could look for the missing
partner in the vicinity of LN, i.e., the same table or referenced tables. Hence,
the detected correspondence could be considered partially correct.
To evaluate the result of the DUMAS complex matching algorithm, we define
two versions of both recall and precision. Recall from Sec. 5.7 that these two
measures are defined as
Precision =|CR|
|R|Recall =|CR|
|C|(7.7)
where Cis the correct result and Ris the retrieved result. If precision and recall
are applied on complex correspondences, we call those functions C-Precision and
C-Recall. Those are very strict measure, because the correspondence ({LN},
{N}) would be considered incorrect.
To catch the above intuition that correspondences can be partially cor-
rect, we also define two functions S-Precision and S-Recall. They are calcu-
lated as described above, except that the constituent simple correspondences
of each complex correspondence are considered: A complex correspondence
3Note that we need to find a single equation that holds for all duplicate tuple pairs.
142 CHAPTER 7. FINDING COMPLEX MATCHINGS
({a1, . . . , am},{b1, . . . , bn}) has m·nconstituent correspondences ({ai},{bj})
with 1 imand 1 jn. E.g., the correspondence ({F N, LN},{N}) has
two constituent simple correspondences ({F N},{N}) and ({LN},{N}). If only
the latter was found as described above, we would have achieved an S-Precision
of 1 and an S-Recall of 0.5.
It has to be noted that although the latter two functions reflect partial
correctness, they penalize overmerging in the precision measure: If the complex
matching result of Fig. 7.1 was ({F N, LN, A1},{N, A2}), then it would be split
into six constituent simple correspondences, resulting in a S-Precision of 3
6=
0.5.
7.7.2 Real-world data
To determine how well the complex matching algorithm works on real-world
data, we have performed experiments with data from the computational biology
and the movie domain.
Computational biology
CATH.DOMAIN_LIST
Domain_Name
Class_Nr
Architecture_Nr
Topology_Nr
Homologous_Superfamily_Nr
S35_Seq_Cluster_Nr
S95_Seq_Cluster_Nr
S100_Seq_Cluster_Nr
Domain_Length
Structure_Resolution
CATH.NAMES
Repr_Protein_Domain
Node_Nr
Node_Description
Figure 7.9: Tables containing information about proteins.
To evaluate the DUMAS complex matcher, we have used two tables contain-
ing information about proteins. Fig. 7.9 depicts the schemata of DOMAIN LIST
and NAMES containing 67,954 and 1,793 tuples, respectively. The arrows in-
dicate the correspondences as defined by the domain expert: ({Topology Nr,
Architecture Nr, Class Nr, Homologous Superfamily Nr},{Node Domain})
and ({Domain Name},{Repr P rotein Domain}).
The DUMAS table matcher detected the following correspondences:
({Class Nr, Architecture Nr, T opology Nr, },{Node Domain}) and
({Domain Name},{Repr Protein Domain}), i.e., it did not include
Homologous Superfamily Nr in the matching. The reason for this can be
found in the definition of Node Nr4: “Description of each node in the CATH
hierarchy for all class, architecture and topology levels. CATH homologous
4ftp://ftp.biochem.ucl.ac.uk/pub/cathdata/v2.6.0/CathNames.v2.6.0
7.7. EXPERIMENTAL EVALUATION 143
superfamily names are also included if they have been defined.” Thus, the
Node Nr does not always contain a value for the homologous superfamily. In
six of the ten detected duplicates, the Node Nr value was comprised of of only
three source values, resulting in a missed constituent correspondence.
Movie data
The movie data that was used in the experimental evaluation originated from
the Internet Movie Database (IMDB)5and Filmdienst (FD)6. To assess the
effectiveness of the algorithm in detecting complex correspondences, we used the
single FD table that contains information about people. That table contained
two attributes representing the given name and the last name of a person. In
contrast, the IMDB schema contains several tables for actor, actress, director,
etc. In each of those tables, the name of a person was represented as a single
attribute.
The single FD table was compared with each of the IMDB tables representing
people involved in the movie business. The result was perfect in all matching
tasks, i.e., the DUMAS table matcher was able to determine that the name in
the IMDB tables is a concatenation of the name attributes in the FD table.
7.7.3 Synthetic data
In addition to real-world data, we have performed extensive experiments on the
generated data sets previously used to test the simple matcher. Fig. 7.10 shows
the schema configuration that is the baseline for the experiments: one with six
corresponding attributes and one with three corresponding attributes. We have
used the same five database pairs for each configuration as in Sec. 4.6.2, each
containing 5,000 tuples and 100 duplicates.
The baseline configuration depicted in Fig. 7.10 contains only simple cor-
respondences. To create complex correspondences, attributes that have a cor-
responding partner were merged (i.e., their values were concatenated) and the
matching algorithm was applied to the restructured databases.
Detecting 1:n correspondences
To assess the matching quality in the presence of 1:n correspondences, we
have generated every possible 2:1 and 3:1 correspondence by merging source
attributes. Because every possible attribute combination was used, the num-
ber of actual table matchings was quite large: In the case of six corresponding
attributes, there are 15 groups of two source attributes and 20 groups of three
attributes. Given that five different data sets exist for each configuration, the
DUMAS complex matcher was applied 75 and 100 times to detect 2:1 and 3:1
correspondences for the baseline configuration of Fig. 7.10(a). Note that each
5Available from ftp://ftp.fu-berlin.de/pub/misc/movies/database/
6http://film-dienst.kim-info.de/
144 CHAPTER 7. FINDING COMPLEX MATCHINGS
Attr. of DB1 Attr. of DB2
SSN
Profession
Surname Surname
Name Name
Birth-date Birth-date
Birth-place Birth-place
Birth-district Birth-district
Sex Sex
City
District
(a) Six corresponding attributes
Attr. of DB1 Attr. of DB2
SSN
Profession
Sex
Birth-district
Birth-place
Surname Surname
Name Name
Birth-date Birth-date
City
District
Street no
Address
Postal code
(b) Three corresponding attributes
Figure 7.10: Schema configuration for complex matching experiments.
experiment only contained a single complex correspondence, the remaining cor-
respondences were kept simple.
Quality of n:1 matchings
0
0.2
0.4
0.6
0.8
1
C-Recall C-Precision S-Recall S-Precision
Measure
Six2
Six3
Three2
Three3
Figure 7.11: Experiments with 1:n correspondences.
Fig. 7.11 shows the experimental results for the baseline schemata with six
correspondences and 2:1 correspondences (Six2), 3:1 correspondences (Six3),
and for the baseline schemata with three correspondences and 2:1 correspon-
dences (Three2), and a single 3:1 correspondence (Three3). One notices that
Three2 and Three3 always produced perfect results.
The experiments with the databases derived from Fig. 7.10(a) produced er-
rors, but the result was still satisfactory. An analysis of the experiments in
Six2 showed that in only a single of the 75 matching tasks a false (i.e., non-
7.7. EXPERIMENTAL EVALUATION 145
corresponding attribute) was added to the matching. The most prominent er-
ror was overmerging as described in Sec. 7.7.1, which resulted in a decrease in
S-Precision. However, as can be seen in the perfect S-Recall all simple corre-
spondences and the constituent correspondences were found. Of course, over-
merging also caused C-Precision and C-Recall to decrease. We point out that
the algorithm always produced a perfect result for four of the five data sets.
The analysis of the Six3 experiments resulted in similar insight. Only a single
data set produced false matchings, and in only a single case a non-corresponding
attribute was matched.
Detecting m:n correspondences
The experiments for m:n correspondences were created in a similar fashion. In
contrast to experiments with n:1 experiments, attributes in both the source and
target had to be merged to create m:n scenarios. We restricted ourselves to
merging two attributes in each schema, and required the source and target at-
tribute group to have at least one (corresponding) attribute in common: If the
source attributes aiand ajare merged, then the corresponding partner of either
aior ajmust be merged with another target attribute that has a correspond-
ing source attribute other than aior aj. E.g., if Surname and Name in the
source database of Fig. 7.10(a) were merged, then either Surname or Name of
DB2 must be merged with another target attribute. Assuming that we merged
Name and Birth-place, the correct matching would contain the correspondence
({Name+Surname, Birth-place},{Name+Birth-place,Surname}). Note that we
performed experiments with every possible combination, resulting in 120 differ-
ent configurations, and thus, 600 runs of the matching algorithm.
Quality of m:n matchings
0
0.2
0.4
0.6
0.8
1
C-Recall C-Precision S-Recall S-Precision
Measure
Figure 7.12: Experiments with m:n correspondences.
Fig. 7.12 shows the result of the experiments. With every quality measure
being 0.98 or above, the algorithm can be considered successful. The analysis
146 CHAPTER 7. FINDING COMPLEX MATCHINGS
of the matching result showed that, in addition to overmerging, some errors are
the result of missed matches. Hence, the recall of simple correspondences is
below 1. In only four of the 600 runs of the algorithm, a false attribute was
added to the matching.
7.8 Discussion
While the majority of correspondences in matching tasks are 1:1, there are
many scenarios where complex (i.e., n:1, 1:n, or m:n) correspondences exist.
Detecting such correspondences is a very challenging task, because the number
of possible complex matchings is very large when compared to the number of
simple matchings. In fact, the majority of schema matching algorithms only
produce simple correspondences.
The DUMAS complex matching described in this chapter is able to detect
complex correspondences between two tables. Based on the similarity matrix
constructed to detect simple correspondences, the algorithm merges attributes
to improve the matching. The matcher uses various heuristics to reduce the
number of matchings that have to be considered. We have experimentally shown
that the algorithm detects complex matchings with high accuracy.
The general approach implemented in the DUMAS complex matcher is
generic with respect to the combination function: Although the algorithm is
presented and implemented with concatenation as the only function, we have
also shown that other functions can be integrated into the matching system.
However, we believe that this possibility has to be further investigated in var-
ious experiments. It has to be noted that the existence of correspondences
involving complex data transformations also affects the quality of the duplicate
detection result: As our duplicate detection algorithm is solely based on the
tupsim measure, a complex relationship like Area =Length ·W idth might
have a negative effect on duplicate detection accuracy.
Part IV
Discussion
147
Chapter 8
Conclusion
In this chapter we summarize our results and contributions and discuss future
work on schema matching and data integration.
8.1 The DUMAS approach
The semi-automatic detection of attribute correspondences is a challenging task,
because in most scenarios available information is not sufficient to infer the se-
mantic relationships between schemata with 100% accuracy. Different ‘hints’
from data and metadata have to be exploited. Consequently, a variety of schema
matching algorithms have been proposed. We described related work on schema
matching and the drawbacks of the proposed solutions in Sec. 2.3. In the fol-
lowing, we argued that duplicates can help when schema-based and vertical
instance-based matchers fail.
We developed several algorithms, which show that duplicates can be used
for schema matching. The goal of DUMAS table matcher is the detection of
simple correspondences between two tables. To do so, the matcher has to find
duplicates in unaligned relations, which is a challenging problem that has not
been discussed before. The experimental evaluation shows that our proposed
tuple similarity measure is able to detect top-k duplicates with very high pre-
cision even in critical configurations. We also presented a procedure to extract
attribute correspondences from the detected duplicates, which has shown to be
accurate with both real-world and synthetic data.
The DUMAS schema matcher extends the duplicate-based approach to com-
plex schemata. Here we face an additional problem: When comparing two arbi-
trary tables, it is not known if the tables are related and contain duplicates. We
developed heuristics to interpret the result of the table matcher, and present an
iterative matching algorithm that exploits multi-table duplicates. Finally, we
developed the DUMAS complex matcher to extract complex correspondences
from duplicates. The matcher produces various matrices by merging attributes
and chooses a ‘best’ matrix that represents the complex matching.
149
150 CHAPTER 8. CONCLUSION
With this thesis we have shown that duplicates can be used to detect at-
tribute correspondences. The experimental evaluation indicates that the algo-
rithms produce a highly accurate matching. However, we believe that there
is still room for improvement. In the following, we discuss future research in
duplicate-based matching and schema matching in general.
8.2 Combining DUMAS with other matchers
It is commonly agreed that no single indicator leads to the best matching:
As discussed in Sec. 2.3, both schema- and instance-based matchers produce
suboptimal results in certain cases (e.g., non-descriptive attribute labels or se-
mantically different, but structurally similar attributes). The same holds for
duplicate-based matchers: E.g., the duplicate detection procedure can produce
false duplicates when few attributes correspond, and the value domains of some
attributes overlap (e.g., place of residence and birthplace). Using such false
duplicates in the matching step can degrade matching accuracy.
A possible way to improve matching quality is to combine the DUMAS
matcher with other matchers. Sec. 2.3.5 discusses two design approaches: hybrid
matchers and composite matchers. Hybrid matchers combine various matching
approaches into a single algorithm. In their current state, the DUMAS algo-
rithms already provide interfaces to interact with other matchers: E.g., the ex-
tended tuple similarity measure etupsim exploits known correspondences, which
can be detected using any available matcher. The DUMAS schema matcher re-
quires an initial matching, which could also be produced by another schema
matcher.
Composite matchers execute several matchers independently and combine
the matching results. In contrast to hybrid matchers, the constituent match-
ers do not benefit from the results of the other matchers. However, composite
matchers are very flexible because new matchers can be easily added. An initial
effort to integrate the DUMAS table matcher into COMA++ was discontin-
ued due to the differences in the matching strategy: The COMA++ framework
transforms a schema into a generic graph [ADMR05]. Descriptive features (e.g.,
attribute name, data type, statistics, child nodes) are extracted from each graph
node in a preprocessing step. This process does not consider other databases,
which might have to be matched with the current database in the future. The
matchers that are used by COMA++ must only use the extracted features, i.e.,
the matchers work on the preprocessed data and not on the original database.
Consequently, only schema-based and vertical instance-based matchers are ap-
plicable. Duplicate-based matchers need to access the databases for matching:
It is impossible to detect duplicates in the preprocessing step, because different
tuples are duplicates depending on which databases are to be matched.
8.3. SCHEMA MATCHING AND DATA INTEGRATION 151
8.3 Schema Matching and Data Integration
Schema matching is an important task in many integration scenarios. However,
the result of the matching step, namely the attribute correspondences, are only
the input to subsequent steps. Consequently, schema matching algorithms are
unlikely to be distributed as separate tools. Instead, they are part of more
complex data integration products.
Data integration has received significant attention mostly by the research
community, but also by commercial manufacturers. While a few years ago
only research prototypes (e.g., Clio [HHH+05]) were available, several commer-
cial products can be found nowadays. Altova MapForceTM2005 provides a vi-
sual interface to declare mappings between relational database schemata, XML
schemata, electronic data interchange (EDI) interfaces, and flat files [Alt05].
The user only needs to draw correspondences between semantically related el-
ements and specify data transformation functions. MapForce automatically
generates the code in XSLT 1.0, XSLT 2.0, XQuery, Java, C++, or C#. The
IBM Rational Data Architect, which has been heavily influenced by the Clio
research project at IBM, goes one step further in the automatization of the
mapping procedure [Gor05]. The tool exploits schema information as well as
instance samples to perform schema matching, and thus, provides additional
support for the specification of attribute correspondences.
The IBM Rational Data Architect is a step in the right direction, because
it supports the user in the definition of attribute correspondences. However, a
data integration should also help the user in the evaluation process, when the
expert removes false correspondences and adds missing correspondences. Yan
et al. have shown that sample data can aid the developer in the definition of
mappings [YMHF01]. We believe that duplicates are particularly useful for the
specification of correspondences: Instead of interpreting random tuples, the user
can see how the same real-world entity is represented in both the source and
the target. Duplicates are also very helpful for the evaluation of the detected
correspondences: If the provided tuple pairs cannot be confirmed to be true
duplicates, then the resulting matching is unlikely to be very accurate and must
be changed by the expert. In addition, the user can easily specify complex data
transformations based on the information provided by duplicates.
The schema mapping tools described above can be used to describe semantic
relationships between schemata, i.e., models of data. The goal of model man-
agement is the provision of a generic framework to handle any kind of models,
including those for which no instances exist (e.g., entity-relationship diagrams
or UML class diagrams) [Ber03, MRB03]. Model management tools transform
models into a generic format. Various operators are defined to manipulate those
models, e.g., Merge to integrate models or Diff to determine the difference be-
tween models. One of the more complex operators is Match, which is used to
detect correspondences between semantically related model elements. Although
existing implementations of a generic match operator exploit only metadata, it
is agreed that available data can improve the matching result [MGMR02].
Schema mapping and model management tools are restricted to the manip-
152 CHAPTER 8. CONCLUSION
ulation of schemata. HumMer goes one step further by integrating the data
as well: After the schemata are integrated based on the matching established
by the DUMAS table matcher, the tool also detects duplicates and resolves
inconsistencies on the instance level [BBB+05]. The result is a duplicate-free
representation of the information contained in the source tables. We believe
that tools like HumMer are particularly useful for ad-hoc or one-time integra-
tion tasks, e.g., catalogue integration, where schema mapping tools provide only
limited support (see Sec. 1.5.1).
8.4 The Future of Schema Matching
Schema matching has been an active research area in the past years, and a
variety of algorithms have been developed. While the experimental results are
promising, new schema matching approaches that are currently not conceivable
are likely to be proposed in the future. As has been shown in the past, schema
matching research can greatly benefit from work in other areas, e.g., machine
learning, text classification, or information theory. Many techniques that have
emerged in those and other research areas are potentially applicable to detect
attribute correspondences, and thus, contribute to the field of schema matching.
The vast majority of schema matching work has been done in academic re-
search. Consequently, many experiments have used small data sets that have
been extracted from Internet sources. In the future, these algorithms will have to
prove themselves in commercial projects, including large-scale integration tasks.
While the goal of schema matching research has been to improve accuracy, per-
formance will be a major concern in industrial-strength schema matching. We
emphasize that schema matching is an off-line process, and thus, performance
is not as much an issue as in runtime systems. However, schema matching tools
are used by human experts. These users might want to sacrifice a little match-
ing accuracy for a large performance gain, e.g., because they can easily correct
the matching based on their expert knowledge.
Exploiting expert knowledge is another potential research direction. Schema
matching is always described as being a semi-automatic process. That notion
is rarely reflected in the matching algorithms: Most schema matchers produce
a schema matching without interacting with the user and assume the expert
to correct the matching afterwards. The user can only influence the matching
process by setting thresholds, providing sample data, or matching other sources.
It is conceivable that schema matchers can benefit from an increased amount of
user interaction both in terms of accuracy and performance, because the expert
can ‘guide’ the matcher based on his domain knowledge.
Bibliography
[ACG02] Rohit Ananthakrishna, Surajit Chaudhuri, and Venkatesh Ganti.
Eliminating fuzzy duplicates in data warehouses. In Proceedings of
the International Conference on Very Large Databases (VLDB),
pages 586–597, 2002.
[ACMH03] Karl Aberer, Philippe Cudr´e-Mauroux, and Manfred Hauswirth.
The Chatty Web: Emergent semantics through gossiping. In Pro-
ceedings of the International World Wide Web Conference, May
2003.
[ACMM03] Luigi Arlotta, Valter Crescenzi, Giansalvatore Mecca, and Paolo
Merialdo. Automatic annotation of data extracted from large web
sites. In Sixth Int. Workshop on the Web and Databases (WebDB
2003), 2003.
[ADMR05] David Aumueller, Hong Hai Do, Sabine Massmann, and Erhard
Rahm. Schema and ontology matching with COMA++. In Pro-
ceedings of the ACM International Conference on Management of
Data (SIGMOD), pages 906–908, 2005.
[AHV95] Serge Abiteboul, Richard Hull, and Victor Vianu. Foundations of
Databases. Addison-Wesley, 1995.
[Alt05] Altova. Data integration: Opportunities, challenges, and
Altova MapForceTM 2005. White paper available at
http://www.altova.com/whitepapers/mapforce.pdf, 2005.
[ATS04] Stephanos Androutsellis-Theotokis and Diomidis Spinellis. A sur-
vey of peer-to-peer content distribution technologies. ACM Com-
puting Surveys, 36(4):335–371, 2004.
[BBB+05] Alexander Bilke, Jens Bleiholder, Christoph ohm, Karsten
Draba, Felix Naumann, and Melanie Weis. Automatic data fusion
with HumMer. In Proceedings of the International Conference on
Very Large Databases (VLDB), pages 1251–1254, 2005.
153
154 BIBLIOGRAPHY
[BCV99] Sonia Bergamaschi, Silvana Castano, and Maurizio Vincini. Se-
mantic integration of semistructured and structured data sources.
SIGMOD Record, 28(1):54–59, March 1999.
[Ber03] Philip A. Bernstein. Applying model management to classical
meta data problems. In First Biennial Conference on Innovative
Data Systems Research (CIDR 2003), 2003.
[Bil04] Alexander Bilke. Instance-based schema management. In Proceed-
ings of the 11th Doctoral Consortium on Advanced Information
Systems Engineering (CAiSE’04), pages 95–106, 2004.
[BKLW99] Susanne Busse, Ralf-Detlef Kutsche, Ulf Leser, and Herbert We-
ber. Federated information systems: Concepts, terminiology and
architectures. Forschungsberichte des Fachbereichs Informatik 99
9, Technische Universit¨at Berlin, 1999.
[BKSS04] Catriel Beeri, Yaron Kanza, Eliyahu Safra, and Yehoshua Sagiv.
Object fusion in geographic information systems. In Proceedings of
the International Conference on Very Large Databases (VLDB),
pages 816–827, 2004.
[BLN86] C. Batini, M. Lenzerini, and S. B. Navathe. A comparative analy-
sis of methodologies for database schema integration. ACM Com-
puting Surveys, 18(4):323–364, 1986.
[BM02] Jacob Berlin and Amihai Motro. Database schema matching using
machine learning with feature selection. In Proceedings of the Con-
ference in Advanced Information Systems Engineering (CAiSE),
pages 452–466, 2002.
[BM03] Mikhail Bilenko and Raymond J. Mooney. Adaptive duplicate de-
tection using learnable string similarity measures. In Proceedings
of the ACM International Conference on Knowledge discovery and
data mining (KDD), pages 39–48, 2003.
[BMPQ04] Philip A. Bernstein, Sergey Melnik, Michalis Petropoulos, and
Christoph Quix. Industrial-strength schema matching. SIGMOD
Record, 33(4):38–43, 2004.
[BN05] Alexander Bilke and Felix Naumann. Schema matching using du-
plicates. In Proceedings of the International Conference on Data
Engineering (ICDE), pages 69–80, 2005.
[BSS03] Paola Bertolazzi, Luca De Santis, and Monica Scannapieco. Au-
tomatic record matching in cooperative information systems. In
Proceedings of the International Workshop on Data Quality in Co-
operative Information Systsems (DQCIS), Siena, Italy, 2003.
BIBLIOGRAPHY 155
[Bus02] Susanne Busse. Modellkorrespondenzen f¨ur die kontinuierliche En-
twicklung mediatorbasierter Informationssysteme. Logos Verlag,
2002.
[BYRN99] Ricardo Baeza-Yates and Berthier Ribeiro-Neto. Modern Infor-
mation Retrieval. Addison Wesley, 1999.
[CCL03] Cecil Eng H. Chua, Roger H. L. Chiang, and Ee-Peng Lim.
Instance-based attribute identification in database integration.
VLDB Journal, 12(3):228–243, 2003.
[CD97] Surajit Chaudhuri and Umeshwar Dayal. An overview of data
warehousing and OLAP technology. ACM SIGMOD Record,
26(1):65–74, 1997.
[CGGM03] Surajit Chaudhuri, Kris Ganjam, Venkatesh Ganti, and Rajeev
Motwani. Robust and efficient fuzzy match for online data clean-
ing. In Proceedings of the ACM International Conference on Man-
agement of Data (SIGMOD), pages 313–324, 2003.
[CLR01] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest.
Introduction to Algorithms. The MIT Press, 2001.
[CMM01] Valter Crescenzi, Giansalvatore Mecca, and Paolo Merialdo.
ROADRUNNER: Towards automatic data extraction from large
web sites. In Proceedings of the International Conference on Very
Large Databases (VLDB), 2001.
[CNBYM01] Edgar Ch´avez, Gonzalo Navarro, Ricardo Baeza-Yates, and
Jos´e Luis Marroqu´ın. Searching in metric spaces. ACM Com-
puting Surveys, 33(3):273–321, 2001.
[Coh98] William W. Cohen. Integration of heterogeneous databases with-
out common domains using queries based on textual similarity. In
Proceedings of the ACM International Conference on Management
of Data (SIGMOD), 1998.
[CRF03] William W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg.
A comparison of string distance metrics for name-matching tasks.
In Proceedings of the IJCAI Workshop on Information Integration
on the Web (IIWeb), pages 73–78, 2003.
[DDH01] AnHai Doan, Pedro Domingos, and Alon Halevy. Reconciling
schemas of disparate data sources: A machine-learning approach.
In Proceedings of the ACM International Conference on Manage-
ment of Data (SIGMOD), pages 509–520, 2001.
[DDH03] AnHai Doan, Pedro Domingos, and Alon Halevy. Learning to
match the schemas of data sources: A multistrategy approach.
Machine Learning Journal, 50:279–301, March 2003.
156 BIBLIOGRAPHY
[DGMY03] Neil Daswani, Hector Garcia-Molina, and Beverly Yang. Open
problems in data sharing peer-to-peer systems. In Proceedings of
the International Conference on Database Theory (ICDT), Lec-
ture Notes in Computer Science 2572, pages 1–15. Springer-Verlag,
2003.
[DJMS02] Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan, and
Vladislav Shkapenyuk. Mining database structure; or, how to
build a data quality browser. In Proceedings of the ACM Inter-
national Conference on Management of Data (SIGMOD), pages
240–251, 2002.
[DLD+04] Robin Dhamankar, Yoonkyong Lee, AnHai Doan, Alon Halevy,
and Pedro Domingos. iMAP: Discovering complex semantic
matches between database schemas. In Proceedings of the ACM
International Conference on Management of Data (SIGMOD),
pages 383–394, 2004.
[DLLH03] AnHai Doan, Ying Lu, Yoonkyong Lee, and Jiawei Han. Profile-
based object matching for information integration. IEEE Intelli-
gent Systems, 18(5):54–59, 2003.
[DR02] Hong-Hai Do and Erhard Rahm. COMA - a system for flexible
combination of schema matching approaches. In Proceedings of
the International Conference on Very Large Databases (VLDB),
pages 610–621, 2002.
[EVE02] Mohamed G. Elfeky, Vassilios S. Verykios, and Ahmed K. Elma-
garmid. TAILOR: A record linkage toolbox. In Proceedings of
the International Conference on Data Engineering (ICDE), pages
17–28, 2002.
[FPKT04] Ronald Fagin, Lucian Popa, Phokion G. Kolaitis, and Wang-
Chiew Tan. Composing schema mappings: Second-order depen-
dencies to the rescue. In Proceedings of the Symposium on Prin-
ciples of Database Systems (PODS), pages 83–94, 2004.
[FS69] Ivan P. Fellegi and Alan B. Sunter. A theory for record linkage.
Journal of the American Statistical Association, 64(328):1183–
1210, 1969.
[Gal86] Zvi Galil. Efficient algorithms for finding maximum matching in
graphs. ACM Computing Surveys, 18(1):23–38, 1986.
[GE03] Isabelle Guyon and Andr´e Elisseeff. An introduction to variable
and feature selection. Journal of Machine Learning Research,
3:1157–1182, 2003.
BIBLIOGRAPHY 157
[GGLZ04] Jarek Gryz, Junjie Guo, Linqi Liu, and Calisto Zuzarte. Query
sampling in DB2 Universal Database. In Proceedings of the ACM
International Conference on Management of Data (SIGMOD),
pages 839–843, 2004.
[GI89] Dan Gusfield and Robert W. Irving. The Stable Marriage Problem:
Structure and Algorithms. MIT Press, 1989.
[GIJ+01] Luis Gravano, Panagiotis G. Ipeirotis, H. V. Jagadish, Nick
Koudas, S. Muthukrishnan, and Divesh Srivastasa. Approximate
string joins in a database (almost) for free. In Proceedings of the
International Conference on Very Large Databases (VLDB), 2001.
[GMPQ+97] Hector Garcia-Molina, Yannis Papakonstantinou, Dallan Quass,
Anand Rajaraman, Yehoshua Sagiv, Jeffrey Ullman, Vasilis Vas-
salos, and Jennifer Widom. The TSIMMIS approach to mediation:
Data models and languages. Journal of Intelligent Information
Systems, 8(2):117–132, March 1997.
[Gor05] Davor Gornik. Use Rational Data Architect to in-
tegrate data sources. Available at http://www-
128.ibm.com/developerworks/library/ar-rdaint/, March 2005.
[Gus97] Dan Gusfield. Algorithms on Strings, Trees, and Sequences. Cam-
bridge University Press, 1997.
[Hal01] Alon Y. Halevy. Answering queries using views: A survey. VLDB
Journal, 10(4):270–294, 2001.
[HC03] Bin He and Kevin Chen-Chuan Chang. Statistical schema match-
ing across web query interfaces. In Proceedings of the ACM Inter-
national Conference on Management of Data (SIGMOD), pages
217–228, 2003.
[HCCH04] Bin He, Kevin Chen-Chuan, and Jiawei Han. Discovering com-
plex matchings across web query interfaces: A correlation mining
approach. In Proceedings of the ACM International Conference
on Knowledge discovery and data mining (KDD), pages 148–157,
2004.
[HHH+05] Laura M. Haas, Mauricio A. Hern´andez, Howard Ho, Lucian Popa,
and Mary Roth. Clio grows up: From research prototype to in-
dustrial tool. In Proceedings of the ACM International Conference
on Management of Data (SIGMOD), pages 805–810, 2005.
[HHL+03] Ryan Huebsch, Joseph M. Hellerstein, Nick Lanham, Boon Thau
Loo, Scott Shenker, and Ion Stoica. Querying the Internet with
PIER. In Proceedings of the International Conference on Very
Large Databases (VLDB), pages 321–332, 2003.
158 BIBLIOGRAPHY
[HIST03] Alon Y. Halevy, Zachary G. Ives, Dan Suciu, and Igor Tatari-
nov. Schema mediation in peer data management systems. In
Proceedings of the International Conference on Data Engineering
(ICDE), March 2003.
[HMH01] Mauricio Herna´andez, Ren´ee Miller, and Laura M. Haas. Clio: A
semi-automatic tool for schema mapping. In Proceedings of the
ACM International Conference on Management of Data (SIG-
MOD), page 607, 2001.
[HS95] Mauricio A. Hern´andez and Salvatore J. Stolfo. The merge/purge
problem for large databases. In Proceedings of the ACM Inter-
national Conference on Management of Data (SIGMOD), pages
127–138, 1995.
[HS98] Mauricio A. Hern´andez and Salvatore J. Stolfo. Real-world data is
dirty: Data cleansing and the merge/purge problem. Data Mining
and Knowledge Discovery, 2(1):9–37, 1998.
[HS03] Gisli R. Hjaltason and Hanan Samet. Index-driven similarity
search in metric spaces. ACM Transactions on Database Systems
(TODS), 28(4):517–580, 2003.
[KN03] Jaewoo Kang and Jeffrey F. Naughton. On schema matching with
opaque column names and data values. In Proceedings of the ACM
International Conference on Management of Data (SIGMOD),
pages 205–216, 2003.
[Kol05] Phokion G. Kolaitis. Schema mappings, data exchange, and meta-
data management. In Proceedings of the Symposium on Principles
of Database Systems (PODS), 2005.
[Kon06] Martin Konitzer. Effiziente Duplikaterkennung in heterogenen re-
lationalen Datenbanken. Diploma thesis, Technische Universit¨at
Berlin, 2006.
[KS91] Won Kim and Jungyun Seo. Classifying schematic heterogeneity
and data heterogeneity in multidatabase systems. IEEE Com-
puter, 24(12):12–18, December 1991.
[LC94] Wen-Syan Li and Chris Clifton. Semantic integration in heteroge-
neous databases using neural networks. In Proceedings of the In-
ternational Conference on Very Large Databases (VLDB), pages
1–12, 1994.
[LC00] Wen-Syan Li and Chris Clifton. SEMINT: A tool for identifying
attribute correspondences in heterogeneous databases using neural
networks. Data and Knowledge Engineering, 33(1):49–84, April
2000.
BIBLIOGRAPHY 159
[Len02] Maurizio Lenzerini. Data integration: A theoretical perspective.
In Proceedings of the Symposium on Principles of Database Sys-
tems (PODS), pages 233–246, 2002.
[LMK03] Kristina Lerman, Steven Minton, and Craig A. Knoblock. Wrap-
per maintenance: A machine learning approach. Journal of Arti-
ficial Intelligence Research, 18:149–181, 2003.
[LMR90] Witold Litwin, Leo Mark, and Nick Roussopoulos. Interoperabil-
ity of multiple autonomous databases. ACM Computing Surveys,
22(3):267–293, September 1990.
[LNE89] James A. Larson, Shamkant B. Navathe, and Ramez Elmasri. A
theory of attribute equivalence in databases with application to
schema integration. IEEE Trans. Software Eng., 15(4):449–463,
1989.
[LRNdST02] Alberto H. F. Laender, Berthier A. Ribeiro-Neto, Altigran Soares
da Silva, and Juliana S. Teixeira. A brief survey of web data
extraction tools. SIGMOD Record, 31(2):84–93, 2002.
[LRO96a] Alon Y. Levy, Anand Rajamaran, and Joann J. Ordille. Query-
answering algorithms for information agents. In Thirteenth Na-
tional Conference on Artificial Intelligence (AAAI-96), pages 40–
47, 1996.
[LRO96b] Alon Y. Levy, Anand Rajaraman, and Joann J. Ordille. Query-
ing heterogeneous information sources using source descriptions.
In Proceedings of the International Conference on Very Large
Databases (VLDB), pages 251–262, 1996.
[LSS01] Laks V. S. Lakshmanan, Fereidoon Sadri, and Subbu N. Subrama-
nian. SchemaSQL an extension to SQL for multidatabase inter-
operability. ACM Transactions on Database Systems, 26(4):476
519, December 2001.
[LT97] Daniel Lopresti and Andrew Tomkins. Block edit models for
approximate string matching. Theoretical Computer Science,
181:159–179, 1997.
[MAL+05] Robert McCann, Bedoor K. AlShebli, Quoc Le, Hoa Nguyen, Long
Vu, and AnHai Doan. Mapping maintenance for data integration
systems. In Proceedings of the International Conference on Very
Large Databases (VLDB), pages 1018–1030, 2005.
[MBDH05] Jayant Madhavan, Philip A. Bernstein, AnHai Doan, and Alon Y.
Halevy. Corpus-based schema matching. In Proceedings of the
International Conference on Data Engineering (ICDE), pages 57–
68, 2005.
160 BIBLIOGRAPHY
[MBR01] Jayant Madhavan, Philipp A. Bernstein, and Erhard Rahm.
Generic schema matching with Cupid. In Proceedings of the In-
ternational Conference on Very Large Databases (VLDB), pages
49–58, 2001.
[ME96] Alvaro E. Monge and Charles P. Elkan. The field matching prob-
lem: Algorithms and applications. In Proceedings of the ACM
International Conference on Knowledge discovery and data min-
ing (KDD), pages 267–270, 1996.
[ME97] Alvaro E. Monge and Charles P. Elkan. An efficient domain-
independent algorithm for detecting approximately duplicate
database records. In 2nd Workshop on Research Issues on Data
Mining and Knowledge Discovery (DKMD’97), 1997.
[MGMR02] Sergey Melnik, Hector Garcia-Molina, and Erhard Rahm. Sim-
ilarity Flooding: A versatile graph matching algorithm and its
application to schema matching. In Proceedings of the Interna-
tional Conference on Data Engineering (ICDE), pages 117–128,
2002.
[MH03] Jayant Madhavan and Alon Y. Halevy. Composing mappings
among data sources. In Proceedings of the International Con-
ference on Very Large Databases (VLDB), pages 572–583, 2003.
[MHH00] Ren´ee J. Miller, Laura M. Haas, and Mauricio A. Hern´andez.
Schema mapping as query discovery. In Proceedings of the In-
ternational Conference on Very Large Databases (VLDB), pages
77–88, 2000.
[MRB03] Sergey Melnik, Erhard Rahm, and Philip A. Bernstein. Rondo:
A programming platform for generic model management. In Pro-
ceedings of the ACM International Conference on Management of
Data (SIGMOD), pages 193–204, 2003.
[MZ98] Tova Milo and Sagit Zohar. Using schema matching to simplify
heterogeneous data translation. In Proceedings of the Interna-
tional Conference on Very Large Databases (VLDB), pages 122–
133, 1998.
[NBYST01] Gonzalo Navarro, Ricardo Baeza-Yates, Erkki Sutinen, and Jorma
Tarhio. Indexing methods for approximate string matching. Bul-
letin of the IEEE Computer Society Technical Committee on Data
Engineering, 24(4):19–27, December 2001.
[NHT+02] Felix Naumann, Ching-Tien Ho, Xuqing Tian, Laura Haas, and
Nimrod Megiddo. Attribute classification using feature analysis.
In Proceedings of the International Conference on Data Engineer-
ing (ICDE), 2002.
BIBLIOGRAPHY 161
[NKAJ59] H. B. Newcombe, J.M. Kennedy, S. J. Axford, and A. P. James.
Automatic linkage of vital records. Science, 130(3381):954–959,
1959.
[NL00] Mattis Neiling and Hans-Joachim Lenz. Data integration by
means of object identification in information systems. In Eigth
European Conference on Information Systems, 2000.
[NOTZ03] Wee Siong Ng, Beng Chin Ooi, Kian-Lee Tan, and Aoying Zhou.
PeerDB: A P2P-based system for distributed data sharing. In
Proceedings of the International Conference on Data Engineering
(ICDE), pages 633–644, 2003.
[¨
OV99] M. Tamer ¨
Ozsu and Patrik Valduriez. Principles of Distributed
Database Systems. Prentice Hall, 1999.
[PAGM96] Yannis Papakonstantinou, Serge Abiteboul, and Hector Garcia-
Molina. Object fusion in mediator systems. In Proceedings of
the International Conference on Very Large Databases (VLDB),
pages 413–424, 1996.
[PB03] Rachel Pottinger and Philip A. Bernstein. Merging models based
on given correspondences. In Proceedings of the International Con-
ference on Very Large Databases (VLDB), pages 826–873, 2003.
[PE95] Mike Perkowitz and Oren Etzioni. Category translation: Learning
to understand information on the internet. In Proceedings of the
International Joint Conference on Artificial Intelligence (IJCAI),
1995.
[PL00] Rachel Pottinger and Alon Y. Levy. A scalable algorithm for
answering queries using views. In Proceedings of the Interna-
tional Conference on Very Large Databases (VLDB), pages 484–
495, 2000.
[PS82] C.H. Papadimitriou and K. Steiglitz. Combinatorial Optimization:
Algorithms and Complexity. Prentice-Hall, 1982.
[PVM+02] Lucian Popa, Yannis Velegrakis, Ren´ee Miller, Mauricio A.
Hern´andez, and Ronald Fagin. Translating web data. In Pro-
ceedings of the International Conference on Very Large Databases
(VLDB), pages 598–609, 2002.
[RB01] Erhard Rahm and Philip A. Bernstein. A survey of approaches
to automatic schema matching. VLDB Journal, 10(4):334–350,
2001.
[RD00] Erhard Rahm and Hong Hai Do. Data cleaning: Problems and
current approaches. IEEE Data Engineering Bulletin, 24(3):3–13,
2000.
162 BIBLIOGRAPHY
[RDM04] Erhard Rahm, Hong Hai Do, and Sabine Massmann. Matching
large XML schemas. SIGMOD Record, 33(4):26–31, 2004.
[RNHS06] Armin Roth, Felix Naumann, Tobias ubner, and Martin
Schweigert. System P: Query answering in PDMS under limited
resources. In IIWeb, 2006.
[Rot64] Gian-Carlo Rota. The number of partitions of a set. American
Mathematical Monthly, 71(5):498–504, 1964.
[Seb02] Fabrizio Sebastiani. Machine learning in automated text catego-
rization. ACM Computing Surveys, 34(1):1–47, 2002.
[SL90] Amit P. Sheth and James A. Larson. Federated database sys-
tems for managing distributed, heterogeneous, and autonomous
databases. ACM Computing Surveys, 22(3):183–236, September
1990.
[SPD92] Stefano Spaccapietra, Christine Parent, and Yann Dupont. Model
independent assertions for integration of heterogeneous schemas.
VLDB Journal, 1(1):81–126, 1992.
[ST96] Erkki Sutinen and Jorma Tarhio. Filtration with q-samples in
approximate string matching. In CPM, pages 50–63, 1996.
[SW81] T. F. Smith and M. S. Waterman. Identification of common molec-
ular subsequences. Journal of Molecular Biology, 147:195–197,
1981.
[TD97] Ljupco Todorovski and Saso Dzeroski. Declarative bias in equa-
tion discovery. In Proceedings of the International Conference on
Machine Learning (ICML), pages 376–384, 1997.
[TH04] Igor Tatarinov and Alon Halevy. Efficient query reformulation in
peer data management systems. In Proceedings of the ACM Inter-
national Conference on Management of Data (SIGMOD), pages
539–550, 2004.
[TKM02] Sheila Tejada, Craig A. Knoblock, and Steven Minton. Learn-
ing domain-independent string transformation weights for high
accuracy object identification. In Proceedings of the ACM In-
ternational Conference on Knowledge discovery and data mining
(KDD), pages 350–359, 2002.
[Ukk92] Esko Ukkonen. Approximate string-matching with q-grams and
maximal matches. Theoretical Computer Science, 92(1):191–211,
1992.
[Ull89] Jeffrey D. Ullman. Principles of Database and Knowledge-Base
Systems, Volume II. Computer Science Press, 1989.
BIBLIOGRAPHY 163
[Ull97] Jeffrey D. Ullman. Information integration using logical views. In
Proceedings of the International Conference on Database Theory
(ICDT), pages 19–40, 1997.
[WDM04] Wensheng Wu, Clement Yu AnHai Doan, and Weiyi Meng. An
interactive clustering-based approach to integrating source query
interfaces on the deep web. In Proceedings of the ACM Inter-
national Conference on Management of Data (SIGMOD), pages
95–106, 2004.
[Wie92] Gio Wiederhold. Mediators in the architecture of future informa-
tion systems. IEEE Computer, 25(3):38–49, 1992.
[Win95] William E. Winkler. Matching and record linkage. In Brenda G.
Cox et al., editors, Business Survey Methods. Wiley-Interscience,
1995.
[WN05] Melanie Weis and Felix Naumann. DogmatiX tracks down dupli-
cates in XML. In Proceedings of the ACM International Confer-
ence on Management of Data (SIGMOD), pages 431–442, 2005.
[YMHF01] Ling-Ling Yan, Ren´ee J. Miller, Laura M. Haas, and Ronald Fagin.
Data-driven understanding and refinement of schema mappings.
In Proceedings of the ACM International Conference on Manage-
ment of Data (SIGMOD), pages 485 496, 2001.