Document [original]

On Information Need

and Categorizing Search

Faculty of Electrical Engineering, Computer Science, and Mathematics

Department of Computer Science

of the

University of Paderborn, Germany

The accepted dissertation of

Sven Meyer zu Eißen

in partial fulﬁllment of the requirements to obtain the academic degree of

Dr. rer. nat.

Reviewer:

Prof. Dr. Benno Stein

Prof. Dr. Norbert Fuhr

Prof. Dr. Hans Kleine B¨uning

Date of Oral Exam: February 23, 2007

Acknowledgements

I wish to thank Prof. Dr. Hans Kleine B¨uning for supporting

me and my work. He provided a stimulating environment and

chaired the reading committee. I want to express my gratitude

to Prof. Dr. Benno Stein for supervising this thesis, and for

numerous fruitful, serious, and enthusiastic discussions that

initiated or improved many of the presented ideas and results.

I am indepted to his guidance, patience, and mentorship.

Furthermore, I wish to thank Prof. Dr. Norbert Fuhr who eval-

uated this thesis and provided useful comments and sugges-

tions for improvement.

For many valuable discussions and enjoyable breaks I want

to thank all of my (former) colleagues, in particular Dr. An-

dreas G¨obels, Oliver Kramer, Dr. Theo Lettmann, Dr. Oliver

Niggemann, and Martin Potthast.

Finally I would like to thank all of my friends and my whole

family, especially my wife Lisa.

Sven Meyer zu Eißen

iii

Table of Contents

Notation vii

1 Introduction 1

1.1 ThesisContributions ......................... 4

1.2 ThesisOverview............................ 7

2 Document Representation and Retrieval 9

2.1 TheInformationRetrievalProcess.................. 10

2.2 DocumentRepresentation ...................... 11

2.2.1 TermVectorModels ..................... 12

2.2.2 DocumentSimilarity ..................... 16

2.2.3 IndexTermSetConstructionMethods ........... 20

2.3 TechniquesforInformationNeedSatisfaction............ 23

3 Information Need and Categorizing Search 27

3.1 Supervisedvs.UnsupervisedCategorization ............ 29

3.2 TheClusterHypothesis........................ 30

3.3 ClusteringDocuments ........................ 31

3.3.1 Challenges........................... 32

3.3.2 ClusteringAlgorithms .................... 34

3.4 OntheValidityofDocumentClusterings.............. 37

3.4.1 ExternalClusterValidityMeasures ............. 38

3.4.2 InternalClusterValidityMeasures ............. 40

3.4.3 StatisticalHypothesisTesting ................ 46

3.4.4 ExperimentalEvaluation................... 48

3.4.5 ConcludingRemarks ..................... 53

3.5 TheSuﬃxTreeDocumentRepresentation ............. 54

3.5.1 SuﬃxTrees .......................... 55

vi TABLE OF CONTENTS

3.5.2 ACloserLooktotheSuﬃxTreeDocumentModel .... 55

3.5.3 ExperimentalEvaluation................... 60

3.5.4 ConcludingRemarks ..................... 61

3.6 TopicIdentiﬁcation.......................... 62

3.6.1 FormalFramework ...................... 62

3.6.2 RelatedWork......................... 64

3.6.3 TheWCCAlgorithmforTopicIdentiﬁcation........ 66

3.6.4 ExperimentalEvaluation................... 66

3.6.5 ConcludingRemarks ..................... 70

3.7 GenreClassiﬁcation.......................... 71

3.7.1 WhatDoesGenreMean?................... 71

3.7.2 What Does Genre Mean in the WWW? ........... 72

3.7.3 ExistingWork......................... 72

3.7.4 UserStudyandGenreSelection............... 74

3.7.5 FeaturesforGenreClassiﬁcation .............. 77

3.7.6 ExperimentalEvaluation................... 83

3.7.7 ConcludingRemarks ..................... 85

4Tira: A Software Architecture for Personal IR 87

4.1 FromIRTheorytoIRSoftware ................... 88

4.2 SpeciﬁcationofIRProcesses..................... 89

4.2.1 PetriNets........................... 91

4.2.2 DataFlowGraphsandControl/DataFlowGraphs..... 93

4.2.3 UMLActivityDiagrams ................... 94

4.3 Operationalizing IR Processes with Tira .............. 96

4.3.1 FromPIMtoPSM ...................... 97

4.3.2 The Tira MiddlewarePlatform............... 98

4.3.3 Tira atWork......................... 99

4.4 ConcludingRemarks ......................... 100

5 Conclusion and Outlook 101

Notation

Symbol Meaning

||·|| norm

|·| cardinality or absolute value

·,· dot product

δKronecker symbol, cluster distance, probability density

Δ cluster diameter

ϕsimilarity function

λ, λicapacity of minimum cut

Λ weighted connectivity

θgraph density

τlabeling, labeling function

ρprobability density function

ρexpected density

Cclustering / categorization

C∗desired categorization / reference categorization

C,Cicategory or cluster, set of documents

d, didocument

d,divector or data structure for a document

[d]jj-th component of vector d

Dset of documents / document collection

D∗desired set of documents

eedge

E,Eiedge set

F,Fi,j F−Measure

fC(w) score for term wto be topic identiﬁer in cluster C

G=V,EGis a graph with node set Vand edge set E

vii

viii TABLE OF CONTENTS

G(Ci) subgraph induced by the node set Ci

H(Cj) entropy of Cj

I(C) cluster validity index value for clustering C

idf inverse document frequency

Nnatural numbers

prec precision

qquery

Qset of queries

Rreal numbers

rec recall

titerm

T,Tset of terms

tf (d, tj) term frequency of term tjwithin document d

tfD(tj) term frequency of term tjin collection D

P(A|B) conditional probability for event Agiven B

tf ·idf term frequency multiplied with inverse document frequency

V,Vinode set

w,wiweight or edge weight

Preface

Information retrieval (IR) is a discipline that deals with the task of satisfying

a person’s information need with the help of a computer. IR systems let a user

specify an information need, which is evaluated against large collections of digital

documents.

Techniques for satisfying information needs have meanwhile become ubiqui-

tous, be it in the form of search engines at home or at work, on mobile devices or

on workstations, or as retrieval components in ﬁle systems, document repositories,

databases, or knowledge management tools. The reason for this pervasiveness is

the growing information need, the diversity of retrieval tasks, and the desired

degree of personalization.

This thesis develops new concepts and new algorithms in various aspects

throughout the information retrieval process. A focus lies on the automatic

categorization of text and hypertext documents according to topic and genre.

Aside from proposing new data structures and algorithms for these tasks, a novel

IR process operationalization paradigm based on Model Driven Architecture is

suggested.

All concepts and algorithms have been operationalized, among others within

AIsearch, a search engine that has been awarded at the EASA-2004 (European

Academic Software Award) competition.

Chapter 1

Introduction

The growing importance of information retrieval systems is accounted for not

only by the Internet and digital libraries but also by the fact that vast numbers of

companies organize their workﬂows and correspondence with digital documents.

For bigger companies, the number of documents that have to be managed by

an IR system easily goes into the millions. The cost of not ﬁnding a document

within a company’s intranet may be measured by the amount of money that has

to be spent on recreating it. These costs can be signiﬁcant; consider for example

the cost to recreate a technical document that contains construction descriptions

for parts of a technical system.

The key challenge in the development of IR systems is the adaption of search

interfaces to human behaviour, limitations, and personal information needs. The

following points illustrate the adaption challenge; each point asks questions about

the design of an adequate retrieval process and also poses algorithmic problems.

•Behavior. Studies have shown that the majority of all keyword queries

against Web search engines are one or two term1queries [127]. Aside from

delivering a huge set of matching documents, short queries are often am-

biguous since terms may have more than one meaning (polysemy). As a

consequence, search results often comprise a large fraction of documents

that do not ﬁt the information needs of a user. On the other hand, relevant

documents can contain terms that are semantically related but lexicograph-

ically distinct to a query term, and they will not be judged relevant if terms

1A term in this context denotes an atomic lexicographical unit, e.g. a word, a number, or

an abbreviation.

2 Chapter 1. Introduction

are matched exactly (synonymy). Another problem with short queries is

that the query focus of a user is unclear; missing context information con-

tributes to long result lists containing many irrelevant hits.

•Limitations. The number of matching documents for typical queries against

very large document repositories goes easily into the thousands. Due to

human limitations in information processing, only some of the found doc-

uments or extracted document snippets2can be read. Human information

miners are hardly able to narrow down the search iteratively using diﬀerent

or more keywords since it is unclear in which context their query terms

are used, and which terms could be useful to focus the search, for example

to eliminate unwanted hits. Aside from being frustrating, a hand-crafted

query reformulation is time consuming.

•Information Need. Search interfaces that follow the keyword search paradigm

are broadly accepted by users; however, the challenge for the next decades

is the development of search solutions that reﬂect a user’s context. In other

words, solutions that are able to (a) organize search results better than in

the form of long lists, (b) ﬁlter beyond topic, i.e. to ﬁlter according to

document quality and type, (c) adapt to a user’s personal skills and ex-

perience concerning the underlying document collection, and (d) adapt to

the retrieval task a user is concerned with—in short: to adapt to a user’s

information need.

This thesis contributes right here; techniques are developed which address the

design and the operationalization of IR processes respecting the above points.

The research focus is automatic document categorization, a technique which has

been shown to improve retrieval performance.

2Document snippets are excerpts from documents that contain one or more query terms.

In most of the well-known search engines, document snippets serve as preview for a found

document.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

Figure 1.1: Screenshot of AIsearch. The area at the top (left) shows controls for query

speciﬁcation. Below, a generated category graph is shown in which each node represents

a category. The edges indicate similarity relations: each category is connected with

an edge to its most similar category. At the bottom, checkboxes for genre ﬁltering are

shown. On the right, a list of document snippets that is ordered by category is shown.

The main text contains detailed explanations for the points (a) – (g).

AIsearch

The contributions of this thesis are manifest in our award-winning meta search

engine AIsearch, which categorizes search results according to topic and genre.

Figure 1.1 shows a screenshot of the AIsearch user interface: When query terms

are speciﬁed in ﬁeld (a), spelling corrections as well as alternative query terms are

proposed in drop-down box (b). A categorization of the search results is shown

in tab (c): Each node of the depicted graph represents a found category that

is labeled with a topic identiﬁer as well as with its size (d). Each category is

connected by an edge to its most similar category. The checkboxes (e) allow to

ﬁlter documents according to genre. On the right hand side, document snippets

are shown (g), which are sorted by category (f). A mouse click on a category

label selects the corresponding category: It centers the associated node within

the display, and updates the result list with document snippets from the selected

category.

4 Chapter 1. Introduction

Query

Presentation

Information

need

Query formulation

Retrieval task

operationalization

Result presentation

External

knowledge

Set of

documents

Retrieval

result

Clustering

Cluster Validity

Assessment

(Point 2)

Topic

Identification

(Point 3)

Genre

Classification

(Point 4)

Web search

engine Web

documents

TIRA (Point 5)

AIsearch

retrieval

Document

representation:

topic

(Point 1)

Document

representation:

genre

(Point 4)

Figure 1.2: The retrieval process underlying AIsearch. The numbers refer to the sum-

marized contributions described in the text below.

1.1 Thesis Contributions

Figure 1.2 shows the building blocks of AIsearch: An information need is speciﬁed

in the form of a query, which is processed by Web search engines. The delivered

Web documents are converted to data structures for topic and genre categoriza-

tion. The topic categorization step uncovers the underlying topic structure using

cluster analysis and cluster validity tools. Each of the found categories is labeled

with a topic identiﬁer by our new WCC algorithm. A genre categorization al-

lows to ﬁlter the resulting documents according to their form and function. The

following points summarize the main contributions of this thesis; the numbering

refers to the place within the AIsearch retrieval process in Figure 1.2.

(1) Document Representation. When regarding a document as a sequence

of terms, the similarity between two documents with respect to topic can be

quantiﬁed by the fraction of terms that both documents have in common.

Traditional document representations like the Boolean model or the vector

1.1. Thesis Contributions 5

space model are typical representatives that implement this view. However,

these models fail to quantify to what extent the term order matches between

documents. Since a term order match can reﬂect a concept match3,the

amount and the length of these matches constitute important similarity

information. We propose the suﬃx tree document model along with new

similarity measures that are able to quantify both aspects in parallel, term

matches as well as term order matches. The computational complexity

to implement the suﬃx tree document model is comparable to the Vector

Space Model.

Information Need. The suﬃx tree document representation addresses a

user’s information need directly when browsing categories: Our experiments

show that the suﬃx tree document representation substantially improves

document categorization performance compared to traditional vector space

representations. The performance gain is independent from the employed

clustering algorithm.

(2) Cluster Validity. Past research on document categorization according

to topic prevalently compares the maximum performance of clustering al-

gorithms on given document collections. Although this research question

is interesting, it is far away from realistic scenarios: The parameters of

clustering algorithms are unknown when clustering search results on an ad

hoc basis. In this connection we propose ρ, a new cluster validity index

for document clustering. The index ρallows us to identify among a set of

clusterings those which are generated using adequate parameters. An ex-

perimental evaluation shows that ρdelivers reliable results in comparison

to existing approaches in document categorization scenarios.

Information Need. Since ρallows to identify high-quality clusterings in an

unsupervised manner, it is especially suited for post-retrieval categorization,

allowing for user interfaces that display ad hoc categorizations of search

results.

3Take for example compound nouns like “artiﬁcial intelligence” or text blocks containing

domain speciﬁc speech such as typical phrases from ﬁnancial reports or expressions from legal

terminology.

6 Chapter 1. Introduction

(3) Topic Identiﬁcation. When a categorization according to topic is deter-

mined using an unsupervised approach, it has to be presented to a user.

In particular, the categories have to be labeled with characteristic terms

for browsing. The question about which terms should be chosen for label-

ing is very diﬃcult since a maximum of only ﬁve terms from the set of all

terms within a document cluster can be presented to a user4. The labeling

question is especially interesting since desired properties of category labels

are partly contradictory. We introduce and formalize desired properties for

category labels, and we propose the new WCC algorithm to compute them.

Information Need. In contrast to the few existing approaches, which have

until now been evaluated by human judgment, we develop an evaluation

methodology that delivers reproducible results and allows for an absolute

performance quantiﬁcation in terms of precision and recall. Our experi-

ments show that WCC identiﬁes up to ﬁve terms per cluster with a very

high precision when comparing them to human-assigned keywords. As an

aside, a document corpus for experimenting has been developed, which is

available upon request.

(4) Genre Categorization. Categorizations are grounded on meaningful cri-

teria. In recent IR research, the term “categorization” is associated with

an organization of documents according to topic. However, we will show

that the genre of a document is a very useful categorization criterion when

searching large document repositories. In particular, a ﬁrst user study in

the ﬁeld of genre categorization has been conducted, which shows the great

interest of human information miners to ﬁlter beyond topic.

In particular, we propose a genre categorization scheme for Web documents,

and we introduce a novel document representation that can be employed to

classify documents according to genre. A feasibility study shows that genre

categorization is possible, even in a large and heterogeneous collection like

the World Wide Web.

Information Need. To investigate the extent to which a genre categorization

according to human wishes is feasible, we compiled a test corpus of Web

4Given that a typical cluster comprises about 2500 diﬀerent term stems, there are 2500

k

possibilities to choose a subset of kterms.

1.2. Thesis Overview 7

pages, which originally included 1209 Web pages and was recently extended

to 1707 Web pages. The results for a multiclass classiﬁcation with eight

genre classes are very promising: the experiments show that about 80%

of the corpus pages can be classiﬁed correctly. Meanwhile, the corpus has

been used by other institutes for follow-up research.

(5) Software Engineering. The remaining contribution concerns software

engineering. In particular, a Model Driven Architecture (MDA) approach

for composing and executing IR processes is developed. In contrast to

the commonly used library-based IR process design, our approach clearly

separates the speciﬁcation of an IR process from its operationalization.

Our architecture called Tira allows to specify an IR process in the form of

a platform-independent UML activity diagram, from which an executable

platform-dependent model can be automatically generated.

Information Need. The proposed technology allows us to rapidly develop

new prototypes, to adapt IR processes to personal preferences, and to test

new ideas at the push of a button. In particular, Tira’s possibility to tailor

IR processes to ﬁt individual skills, personal wishes, and retrieval scenarios

contributes to satisfy a user’s information need.

1.2 Thesis Overview

This chapter motivates the presented research and summarizes our contributions.

Chapter 2 reviews the information retrieval process; in particular, data structures

and algorithms for constructing document representations and for computing

document similarity are presented.

Chapter 3 contains the main contributions related to categorization. After

document categorization is motivated and existing research respecting the cluster

hypothesis is summarized and compared, clustering algorithms for automatic ad-

hoc category formation are reviewed. The following section surveys existing work

in cluster validity and introduces our novel cluster validity index, ρ. The next

section introduces the suﬃx tree document model and presents theoretical as well

as experimental comparisons to other approaches. The topic identiﬁcation section

introduces a formal framework of desired properties for cluster labels. After

8 Chapter 1. Introduction

presenting the WCC algorithm for topic identiﬁcation, a novel evaluation strategy

is introduced and evaluation results are given. Finally, genre classiﬁcation of Web

pages is introduced as a novel instrument to satisfy a user’s information need.

Our ﬂexible software architecture for the composition and execution of IR

processes, Tira, is introduced in Chapter 4. The thesis concludes with Chapter 5.

Chapter 2

Technical Background:

Document Representation

and Retrieval

Within the last years, the number of digital documents in the Internet, corpo-

rate intranets, and digital libraries exploded. IR systems evolved, which aim to

provide structured access to large-scale document collections. From a user’s per-

spective, these IR systems follow various search paradigms that broadly make up

four categories [29]:

(1) Unassisted keyword search. Search terms are entered and the search engine

returns a ranked list of document snippets. Representative: Google

(www.google.com).

(2) Assisted keyword search. The search engine produces suggestions based on

the user’s initial query. Representative: AskJeeves (www.askjeeves.com).

(3) Directory-based search. Here, the information space is divided into a hier-

archy of human-maintained categories, where the user navigates from broad

to speciﬁc classes. Representative: Yahoo (www.yahoo.com).

(4) Query-by-example. The user selects an interesting document snippet, which

is then used as the basis for a new query. Representative: AltaVista

(www.altavista.com).

10 Chapter 2. Document Representation and Retrieval

Search interfaces that implement these paradigms form the “visible” part of IR

systems. In the following two chapters we will shed light on what happens in IR

systems behind the scenes. The focus of this chapter is document representation,

while the next chapter discusses document retrieval.

Table 2.1: Typical retrieval tasks.

Retrieval task Input Task description

(1) keyword search keywords Find documents that match the

given keywords best.

(2) local search keywords, Find documents that match the

geographic given keywords best and that relate

location to the given location.

(3) similarity search document Find documents that are similar

to the input document.

(4) question answering question Find documents that answer

in natural the given question best. Preferably,

language generate a tailored answer.

(5) plagiarism document Find passages within the document

detection that are copied from other documents.

(6) quality search various Find documents that comply with

given quality criteria.

2.1 The Information Retrieval Process

The purpose of IR systems is to satisfy a user’s information need. Typically, for a

universe Dof documents this task involves the identiﬁcation of a subset D∗⊂D

of documents that are considered useful for satisfying a user’s information need.

Since information needs are manifold, IR systems are specialized correspondingly,

i.e. they are designed to process speciﬁc retrieval tasks that arise from speciﬁc

information needs (cf. Table 2.1, which resumes some examples).

Retrieval tasks are operationalized within a retrieval process that includes

query formulation and result presentation (cf. Figure 2.1). The operationaliza-

tion of retrieval tasks happens in the form of a retrieval function, which is deﬁned

as follows.

2.2. Document Representation 11

Document

representation

(Chapter 2.2)

Query

representation

(Chapter 2.2)

Retrieval

function

(Chapter 3)

Query

Presentation

Information

need

Query formulation

Retrieval task

operationalization

Result presentation

External

knowledge

(Chapter 2.3)

Set of

documents

Retrieval

result

Figure 2.1: The information retrieval process, adapted from [129]. A user formulates

his/her information need in the form of a query (top), which is transformed into an

internal representation, typically a tailored data structure. Likewise, documents from a

collection are represented as data structures. A retrieval function operates on the inter-

nal representations to identify matching documents, probably using external knowledge

(middle). Finally, the retrieval results are presented to the user (bottom).

Let D={d1,...,d

n}be a document collection, let qbe a query from the set of

queries Q,andletKbe external knowledge1. A retrieval function ρD,K:Q→R

is a mapping that selects a subset of Dbeing relevant to a query q∈Qgiven

external knowledge K. The result in Rmay take several forms, e. g. , Rmay be a

ranked list, a categorization of the selected subset of D, or another representation

that can be generated from Dusing K. The operationalization of a retrieval

function requires a representation of the documents diand the query q,which

are discussed next.

2.2 Document Representation

A document representation disadatastructuretomodeladocumentwithin

computer memory.2In literature, dis almost always a vector whose components

1Kmay contain thesauri, past queries, click stream statistics, etc. A discussion of the

exploitable knowledge will follow at the end of this chapter.

2Bold latin lowercase letters, in general, denote vectors. We use the symbol dfor a data

structure here; the reason is that this data structure is almost always a vector in the IR

literature.

12 Chapter 2. Document Representation and Retrieval

quantify document properties like term frequencies. However, dmay be another

abstraction; in particular, we introduce the suﬃx tree document model in the

next chapter.

Document representations are usually tailored to both, the retrieval task and

the documents’ information content. The following list grades document types

according to information content.

(1) Plain documents. Documents that contain no more information than their

text.

(2) Structured documents. Documents in which structure information like

chapter boundaries or headlines is given. Example: Microsoft Word docu-

ments.

(3) Hypertext documents. Hypertext documents are structured documents

that may contain references (links) to other documents. Example: HTML

documents.

(4) Annotated documents. Documents that come along with meta-information

like category or author. Examples are Semantic Web documents, which are

enriched with ontological information [21, 20, 19].

The following rule of thumb applies: The more information a document con-

tains, the merrier are the retrieval results, and the lower is the retrieval eﬀort

[129].

2.2.1 Term Vector Models

Let D={d1,...,d

n}be a document collection, and let T={t1,...,t

m}be the

set of terms3each of which occurring in at least one d∈D.Moreover,letR

denote the set of real numbers, and let the jth component of an n-dimensional

vector d∈Rnbe denoted as [d]j.

3Tis sometimes called the dictionary of D. The index numbers of the terms in Tare

arbitrary; i.e., no speciﬁc order of the tiis assumed.

2.2. Document Representation 13

The Boolean Model. In the Boolean document model, the representation di

of a document di∈Dis a vector whose j-th component indicates if tjoccurs in di,

say, diis a boolean-valued vector with dimension mand [di]j=1iﬀtj∈di.This

kind of representation allows, among others, to quantify the similarity between

two documents using a geometric measure, e.g. in the form of angles or distances

between vectors in Rm. An equivalent of the Boolean model is the set model,

where a document is represented as a set in which the elements are the document’s

terms.

The Vector Space Model. A generalization of the Boolean model is the

Vector Space Model (VSM) [121], in which the components of the vector repre-

sentation are real values, the so-called term weights. Let tf (di,t

j)denotethe

function that speciﬁes how often the term tjoccurs in document di. Setting

[di]j=tf (di,t

j) includes term frequency information into the vector representa-

tion instead of boolean values only; the rationale is that terms that occur more

often are more important [126].

However, with respect to a document collection one might argue that terms

that are rather rare within the collection are important since they discriminate

between documents, especially when the documents in the collection are on sim-

ilar topics. Consider for example a document collection on feeding pets: two

documents on cats are likely to be considered more similar than two documents

on dogs and cats. Likewise, the term “feeding” loses its value within the collec-

tion since it will appear in most of the documents. In this connection tf is used

in combination with the inverse document frequency idf .Letdf (tj)denotethe

document frequency of a speciﬁc term tj, say the number of documents in which

tjoccurs. Then the inverse document frequency idf :T→R+,0with

idf (tj)=log|D|

df (tj)

is a function that measures how a term tjis distributed over the documents in

D, in other words, it measures the discriminative power of tj[126]. If only idf

was used to weight terms in di, then rare terms would dominate a geometric

similarity computation: a term that occurs only once in the document collection

has a maximum idf value. For this reason, [di]jis chosen to be the product

14 Chapter 2. Document Representation and Retrieval

tf (di,t

j)·idf (tj). This product ensures that both very rare and very frequent

terms do not inﬂuence a similarity computation too much4.

The deﬁnition of idf can also be justiﬁed in the context of information theory.

Let Etbe the probabilistic event that a term t∈Toccurs in a randomly drawn

document d∈D. Then the probability of Etis given as P(Et)=df (t)

|D|.The

amount of surprisal (also known as self-information [145, 154]) contained in this

event depends only on the probability of the event. More speciﬁcally: the smaller

this probability is, the larger is the surprisal associated with receiving information

that the event actually occurred. A measure Ithat reﬂects surprisal is

I(Et)=log1

P(Et)=log|D|

df (t)=idf (t).

A property of Iis that the surprisal of a composition of two mutually independent

events equals the surprisal sum of the events, i.e. I(Et1∩Et2)=I(Et1)+I(Et2).

If the events Etare regarded as atomic then they are independent by deﬁnition.

However, two terms do not necessarily appear independent in a document since

natural language comprises regular structures like ﬁgures of speech, compound

nouns, and proper names that are made up of more than one term.

Consequently, when using Ito measure surprisal, the underlying dictionary

should reﬂect the independence condition: Instead of building the document

representations based on single terms5from T, their construction based on a

tailored index term set Tin which one element can comprise more than one term

is advised. For example, in a collection on feeding pets it is a good idea to regard

the compound noun “cat food” as one element of Tinstead of two single terms.

Section 2.2.3 discusses methods to construct T.

VSM with Probabilistic Term Weights. A probabilistic method to deﬁne

term weights is based on the assumption that so-called “specialty words”, i.e.

highly informative terms, are concentrated in a few documents within a collection

[54, 55, 6]. Given this assumption, the value of a term can be quantiﬁed using

the “divergence from randomness” theory, which generalizes Ponte and Croft’s

language modeling approach [104] as follows.

4More recent term weighting approaches factor document lengths into term weights. An

approach that has shown superior performance for queries in TREC collecions is the BM25

term weighting scheme [113, 114].

5Single terms are also referred to as unigrams.

2.2. Document Representation 15

Let D={d1,...,d

n}be a document collection, let tdenote a term and let

tf D(t) denote the number of occurrences of tin D. For each of the tf D(t)copies

of t, a random process can be imagined that ﬁrst selects a d∈Dat random and

then assigns tto d. Since “nonspecialty words” are assumed to be distributed

uniformly at random over the ndocuments, the probability for koccurrences

from the tf D(t)copiesoftin a single document can be modeled using a Bernoulli

experiment:

P(Xt=k)=B(n, tf D(t),k)=tfD(t)

kpkqtf D(t)−k(2.1)

where p=1/n and q=(n−1)/n. Here, the random variable Xtdescribes

the term frequency of tin a document from D. Again, this model assumes term

independence.

The application of this model is as follows. Using the formula above6,we

can compute for a given document di∈Dthe probabilities P(Xt=k)foreach

term, using k=tf (di,t). The lower this probability, the less the term tis

distributed in accordance with the model, and the more informative it is. Con-

sequently, a function that increases monotonously as Pdecreases quantiﬁes the

informative value of a term. Since term probabilities also depend on a docu-

ment’s length, Amati and van Rijsbergen propose to “normalize” kaccording to

document length ﬁrst. In particular they propose to choose

k(di,t):=tf (di,t)·log21+avg(D)

l(di)

Here, l(di) denotes the length of document dimeasured by the number of terms

that dicontains, and avg(D)=1

ndi∈Dl(di) denoting the average document

length within D.

A weight function that increases monotonously as Pdecreases and, conse-

quently, grows with a term’s importance in diis again the surprisal, resulting

w1(di,t):=−log2(P(Xt=¯

k(di,t)))

Amati and van Rijsbergen propose to choose as term weight a product w(di,t)=

6In practice, the mentioned probabilities are approximated using a Poisson distribution.

16 Chapter 2. Document Representation and Retrieval

w1(di,t)·w2(di,t), where w2is a smoothing function that quantiﬁes the informa-

tion gain of a term within the so-called “elite set” St⊆D,say,thesetofdoc-

uments in which toccurs. The authors propose two diﬀerent weighting schemes

for w2(t), one based on Laplace’s law of succession (w2(di,t):= 1

k(di,t)+1 ), and

another as the ratio of two Bernoulli processes (w2(di,t):= tf St(t)+1

|St|·(¯

k(di,t)+1) ).

It should be noted that the model of randomness, which is implemented in

Formula (2.1) is interchangeable: Instead of using a Binomial distribution, models

based on Bose-Einstein-statistics, the geometric distribution, and tf ·idf can be

employed [6, 5].

2.2.2 Document Similarity

Document representations are usually designed such that a function ϕcan be

stated that maps from the representations d1and d2of two documents d1,d2

onto the interval [0; 1] and that has the following property: If ϕ(d1,d2)isclose

to one then the documents d1and d2are similar; likewise, a value close to zero

indicates a high dissimilarity. Formally, a function ϕis called similarity function

if it fulﬁlls at least point (1) of the following properties [152]. However, if not

stated otherwise, we assume a similarity function to satisfy points (1)-(4).

(1) ∀di∈D:ϕ(di,di)>0

(2) ∀di,d

j∈D:ϕ(di,dj)≥0 (non-negativity)

(3) ∀di,d

j∈D:ϕ(di,dj)=ϕ(dj,di) (symmetry)

(4) ∀di,d

j∈D:ϕ(di,dj)∈[0,1] (normalization)

(5) ∀di,d

j,d

k∈D:ϕ(di,dj)+ϕ(dj,dk)≥ϕ(di,dk) (triangle inequality)

The applications of a similarity function ϕare manifold. A query qin the

form of a keyword list can be regarded as a very small document; consequently, ϕ

can serve to deﬁne a retrieval function for similarity search. Moreover, ϕcan be

used to generate similarity graphs, which in turn are a basis for clustering. The

function ϕcan also be turned into a dissimilarity function by computing 1 −ϕif

ϕis normalized and ϕ(di,di) = 1 for all di. In particular, dissimilarity matrices

enable multidimensional scaling to visualize document collections [131, 18, 77].

2.2. Document Representation 17

Since ϕoperates on the abstractions of the documents diit must be tailored

to the representations di. The remainder of this subsection gives an overview of

similarity measures for set-based, geometric, and probabilistic document repre-

sentations.

Set-based Similarity Measures. Given that documents di,d

j∈Dare repre-

sented as sets Di,D

jof their terms, similarity can be quantiﬁed by set intersection

ratios. In particular, the Jaccard coeﬃcient ϕjacc, the cosine coeﬃcient ϕcoscoef ,

thedicecoeﬃcientϕdice and the overlap coeﬃcient ϕover have been employed in

the past to measure set similarity, and they are deﬁned as follows [111].

ϕjacc(Di,D

j)=|Di∩Dj|

|Di∪Dj|

ϕcoscoef (Di,D

j)= |Di∩Dj|

|Di|1/2·|Dj|1/2

ϕdice (Di,D

j)= |Di∩Dj|

|Di|+|Dj|

ϕover (Di,D

j)= |Di∩Dj|

max(|Di|,|Dj|)

It should be noted that generalizations of these coeﬃcients exist, which are de-

signed to incorporate term weights. Put another way, these generalizations apply

for geometric similarity measurement.

Geometric Similarity Measures. The most popular similarity measure for

term vector representations is the cosine similarity function. Let di,djdenote

the vector representations of the documents diand djrespectively. The cosine

similarity function ϕcos :Rm×Rm→[0,1], which measures the cosine of the

angle between diand djis deﬁned as

ϕcos (di,dj)= di,dj

di·dj

where ·,· :Rn×Rn→Rdenotes the dot product and ·:Rn→R

denotes the Euclidean norm. In comparison to a linear relationship between angle

and similarity, the cosine similarity values are ampliﬁed, especially for angles in

18 Chapter 2. Document Representation and Retrieval

[0,π/4] (cf. Figure 2.2). This eﬀect is considered useful when sparse vectors are

given and the similarity values tend to be low. Another property of the cosine

similarity measure is its scale invariance, i.e. ϕ(α·di,β·dj)=ϕ(di,dj) for all

scalars α, β. The interpretation of scale invariance in the document domain is

that documents are regarded as equal when they use the same terms in the same

proportion; document length is not important.

Another natural similarity measure derives from the Euclidean distance be-

tween diand dj.Withdmax =max

i,j di−dja similarity measure can be deﬁned

as 1−di−dj

dmax or, without the need to compute dmax ,as 1

1+di−dj. However, these

measures are not scale invariant.

A scale invariant Euclidean distance measure operates in the unit sphere, when

the term vectors are scaled to length 1. Let d

idenote these scaled vectors, say,

d

i=di

di. Since the maximum distance between two points in the ﬁrst quadrant

of the unit sphere in Rnis √2, a similarity measure based on the Euclidean

distance is

ϕeucl (di,dj)=1−1

√2·d

i−d

j

Note that ϕeucl directly relates to ϕcos since distances between points on the unit

sphere are determined by the angle between the points. In particular,

d

i−d

j=sin2(α)+(1−cos(α))2

=sin2(α)+1−2cos(α)+cos

2(α)

=2−2cos(α)

where αdenotes the angle between d

iand d

j.Consequently,ϕeucl (di,dj)=

1−1−cos(α).

Strehl and Ghosh propose the similarity measure ϕexp =e−di−dj2,whichis

also based on Euclidean distance and has useful properties when clustering with

k-Means [141]. Another measure that has not been used yet is 1 −sin(α), which

is scale invariant and has the salient property that it reduces noise since it scales

down low similarity values.

Figure 2.2 illustrates similarity function values for normalized vectors in the

ﬁrst quadrant of Rndepending on the angle between them. Like the curves show,

2.2. Document Representation 19

0.25

0.5

0.75

90450

Similarity

Angle in degrees

linear

cos

Euclidean

1-sin

Strehl

Figure 2.2: Plots of geometric similarity function values (y-axis) for normalized vectors

in the ﬁrst quadrant of Rndepending on the angle between them (x-axis).

each of them has diﬀerent characteristics concerning the dilution and ampliﬁca-

tion of similarity values.

Probabilistic Similarity Measures. Let P(R|di) denote the probability

that a user ﬁnds di∈Drelevant to a query q. Probabilistic retrieval models

employ estimations of P(R|di)tosortthedocumentsinDwith respect to

relevance to q[46]7. The estimations are based on Bayes’ theorem, namely

P(R|di)=P(di|R)·P(R)

P(di)

Here, P(di|R) denotes the probability for selecting difrom the set of relevant

documents; likewise, P(di) is the probability to randomly draw difrom the set

of all documents. Since the latter probability is constant for a collection, the

7In Fuhr and Buckley’s original work [47], odds in the form O(R, di)=P(R|di)

P(R|di)were used for

ranking instead of probabilities for similarity computation. If O(R, di)>1 then the probability

for dibeing relevant is greater than dinot being relevant. This rule, also known as Bayesian

decision rule, enables an IR system to decide how many results should be retrieved. Moreover,

the Bayesian decision rule minimizes the expected error.

20 Chapter 2. Document Representation and Retrieval

Index

construction

principle

Index term selection

Index term modification

Index term enrichment

Index transformation

Stemming

Information gain

Co-occurrence analysis

Singular value decomposition

Inclusion methods

Exclusion methods

Stopword removal

Figure 2.3: A taxonomy of index construction principles for vector space models.

former probability P(di|R) remains to be estimated. These estimations are

based on the Boolean vector representation and the assumption that terms occur

independently in documents. The estimated probabilities can be adapted to a

user’s information need, which must be given in the form of relevance feedback

for retrieved documents. Details can be found in [47, 46].

2.2.3 Index Term Set Construction Methods

Up to now Twas deﬁned to be the set of all terms that occur in at least one d∈D.

However, the composition of document models from a tailored index term set T

can signiﬁcantly enhance the performance of tasks like similarity computation

or classiﬁcation. From a linguistic point of view two documents are topically

equal when they share the same concepts rather than the same terms. From a

probabilistic point of view, the index terms in Tshould appear independently

(cf. Section 2.2.1). From an algebraic point of view this means that the index

terms in Tshould be used to create an orthogonal basis from which the document

vectors are composed as linear combinations.

Generally, the index term set construction problem is a specialization of the

well-known feature selection problem in machine learning [87, 66, 76, 15]. Since

the evaluation of each possible subset of Twith respect to the outlined goals is

computationally infeasible, the construction of Tfollows heuristics that exploit

speciﬁc text domain knowledge. In the following, methods to construct Tby

selection, modiﬁcation, enrichment, and transformation of terms are resumed;

each of them strives to reﬂect the mentioned goals. Figure 2.3 depicts a taxonomy

of index term set construction methods.

Index Term Selection. The methods that operationalize index term selection

subdivide into exclusion and inclusion methods. The goal of exclusion methods

2.2. Document Representation 21

is to remove terms from Tthat simply add noise when computing a similarity

value. Examples for these terms include common words like articles, prepositions

or conjunctions (e.g. “the”, “and”, “or”, “of”, “from” etc.) that certainly do not

contain topical information. The tf ·idf term weighting scheme would adjusts

these terms’ weights to a value close to zero automatically; however, these so-

called stopwords can already be identiﬁed and removed accurately at parsing time

according to stopword lists. This procedure speeds up similarity computation and

saves memory in the document representations.

In contrast to exclusion methods, which construct Tin a bottom-up manner

by removing terms from T, inclusion methods construct Tfrom scratch, say, in

a top-down manner. As outlined above, good index terms reﬂect concepts and

may be composed out of several terms. If the occurrence of terms correlates, e.g.

within a ﬁxed-size sliding window, the terms may be considered as concept and

added to T[89]. Another way to identify concepts in the form of related terms is

to measure their mutual information [147] or to use association rule mining [3, 4].

However, these methods’ time complexity is in general at least O(|T|2). The

suﬃx tree document model [96] contributes right here: it allows for non-trivial

index term set construction in linear time.

Index Term Modiﬁcation. Term modiﬁcation is necessary to map morpho-

logically diﬀerent words that embody the same concept to the same index term.

E.g., the terms connection,connecting,connectivity,connector are variants

of connect, and they should map to the same index term. Stemming algorithms

apply here; their goal is to ﬁnd canonical forms for inﬂected or derived words,

e.g. declined nouns or conjugated verbs. Since the adaption of words to gender,

number, time, and case are language-speciﬁc, it might be reasonable to compile

a set of rules that map an inﬂected word back to its stem. This is the way

how Porter’s stemmer for the English language works [106]. However, rule-based

stemming algorithms require the development of specialized rule sets for every

language.

Statistical stemming algorithms generate stemming dictionaries based on large

document collections. For this purpose, all terms from a document collection

are inserted character-wise into a tree whose edges are labeled with the words’

characters. When a term is inserted, the path emanating from the root is followed

22 Chapter 2. Document Representation and Retrieval

while its current edge label matches the word’s current character. In case of a

mismatch, a new edge is inserted and labeled with the actual character.

If the number of a node’s successors at some minimum depth is large enough

then the character concatenation on the path from the root to this node is a

candidate stem. Several techniques for identifying candidate stems have been

proposed in the past; variants are based on thresholds, relative successor vari-

ety, and entropy [51, 43, 44]. Besides their language independence, statistical

stemmers have the ability to adapt to document-speciﬁc vocabulary or language

changes. Note that these stemmers are not only able to identify stems that orig-

inate from suﬃx removal, but that they can also be used to remove preﬁxes. For

example, if the suﬃx connect from the term reconnected is inserted into the

suﬃx tree, a match along the path that is associated with connect will deliver

a high conﬁdence hint for connect being a correct stem. A recent study on

statistical stemming can be found in [140].

Index Term Enrichment. We classify a method as term enriching if it in-

troduces terms not found in T. By nature, meaningful index term enrichment

must be semantically motivated and exploit linguistic knowledge. A standard

approach is the possibly transitive extension of Tby synonyms, hyponyms, and

co-occurring terms [59, 137]. The extension shall alleviate the problem of diﬀer-

ent writing styles, or of vocabulary variations observed in very small document

snippets as they are returned from search engines.

Note that these methods are not employed to address the problem of polysemy,

since the required in-depth analysis of the term context is computationally too

expensive for many similarity search applications.

Index Transformation. Let D={d1,...,d

n}be a document collection over

the term set T={t1,...,t

m}.Anm×nmatrix Awhose columns consist of the

document models diis called term-document matrix for D. We understand index

transformation as a function fthat projects Ato a k×nmatrix Awith the goal

to enhance the document models with respect to a retrieval task. In contrast

to the above mentioned index construction methods, which operate on single

columns of the term-document matrix, index transformation methods operate on

the entire matrix.

2.3. Techniques for Information Need Satisfaction 23

Recall / = 0.26 Precision /( ∪ ) = 0.94 F-Measure = 0.40

Selected:

Relevant:

Documents:

Figure 2.4: Illustration of precision and recall. The ﬁgure shows documents of two true

classes: The solid points symbolize documents that are relevant to a query, and the

outlined points symbolize irrelevant documents. The boxed points stand for documents

that a retrieval system has selected for presentation. The set of boxed points shows a

selection of high precision, which consists to 94% of relevant documents. Its recall is

only about 0.26 since only 26% of all relevant documents are selected.

A popular representative index transformation method is latent semantic in-

dexing (LSI) [27, 11, 101], which uses a singular value decomposition of the

term-document matrix in order to improve query rankings and similarity com-

putations. For this purpose, the document vectors as well as the queries are

projected into a low-dimensional space that is spanned by the eigenvectors of

ATAthat correspond to the klargest eigenvalues σ1,...,σ

2.3 Techniques for Information Need Satisfac-

tion

An IR system is considered successful when a user can formulate his or her

information need easily and when the retrieval results point a user directly to

relevant documents. In our context, where a retrieval system selects a subset D⊂

Das being relevant, the performance is often measured in terms of precision and

recall with respect to the “true” set of relevant documents, D∗[111]: Precision

denotes the fraction of relevant documents within a retrieval result D, while

recall denotes the fraction of retrieved relevant documents with respect to all

relevant documents within a collection (cf. Figure 2.4). I. e.,

24 Chapter 2. Document Representation and Retrieval

categorization,

latent variable analysis

intrinsic information

external

information

exploitable

information user information

external knowledge

user-specific

information

information from

other users

personalization,

relevance feedback

recommendation,

relevance feedback

query reformulation,

query enrichment,

ontological reasoning

Figure 2.5: Taxonomy of information resources available to a retrieval system (left)

with example technologies that operationalize information exploitation (right).

prec =|D∩D∗|

|D|and rec =|D∩D∗|

|D∗|

An IR system’s main task is to maximize both, precision and recall of re-

trieval results8. Several information sources can be exploited to allow for good

performance: (a) Intrinsic information (the document collection itself) (b) exter-

nal information from one or more users (recorded queries or click streams etc.),

or external knowledge like ontologies or thesauri. Figure 2.5 shows an informa-

tion taxonomy along with example technologies that operationalize information

exploitation. In the following, some technologies are sketched to exemplify how

information is exploited.

•Personalization. A personalized IR system builds a user proﬁle based on

an application-speciﬁc user model throughout one or more search sessions.

These proﬁles are employed to rank and ﬁlter the results for upcoming

queries for the same user and associated context.

•Recommendation. Recommender systems collect information like relevance

judgments from a multitude of users in a speciﬁc context and use this

information to rank, ﬁlter, or augment query results [2].

•Relevance Feedback. Systems like Rocchio [120] let a user judge which of

the previously retrieved documents are relevant or irrelevant for her or his

information need. These judgments may be used to automatically gener-

8The simultaneous maximization of precision and recall is an ambitious goal: Recall can be

enlarged when delivering more documents. However, at the same time precision tends to get

smaller unless a perfect retrieval function is given.

2.3. Techniques for Information Need Satisfaction 25

ate enhanced queries or to better rank and ﬁlter retrieval results in future

queries.

•Categorization. Clustering algorithms aim to uncover the intrinsic category

scheme of a document collection, enabling a user to focus the search on

selected categories.

These technologies are useful to a greater or lesser extent depending on the

individual use case, which may be characterized by aspects like collection size,

user skill, and background knowledge. In this connection a user’s skill denotes

his or her ability to formulate or reformulate a query q; a user’s background

knowledge relates to the collection Dand the domain with which his or her query

is associated9. Table 2.2 summarizes the impact of the outlined approaches with

respect to the mentioned aspects.

huge small high low good bad

collection collection formul. formul. knowledge knowledge

skill skill of Dof D

Personalization ++ o + ++ o ++

Recommendation + + + + o ++

Relevance Feedback + ++ - ++ o +

Categorization ++ + - ++ o ++

Table 2.2: Technology impact vs. collection and user characteristics. The performance

impact estimations are tendencies oriented at current publications in the ﬁeld.

9These aspects are subsumed by the concept bounded rationality, which “is used to designate

rational choice that takes into account the cognitive limitations of both, knowledge and cognitive

capacity” [156].

Chapter 3

Information Need and

Categorizing Search

Search engine users know the surprising experience to learn from result lists

about the diversity of topics that are associated with particular query terms.

A reason for this surprisal is that a user’s mental model on topics and related

query terms does not match the real-world model: either, the query terms do

not reﬂect a user’s information need since they are too general, too narrow, or

simply inadequate, or some of the contexts in which the query terms are used

are unknown in advance. In each case, a summarizing overview of identiﬁed

document categories would be helpful to navigate within search results and to

ﬁgure out how the original query can be reﬁned.

Uncovering category structures of a document collection is the main contri-

bution of this chapter. First, two categorization paradigms are discussed, and a

rationale for document clustering is given [133]. We then review algorithms for

automatic category formation and for the validation of clusterings. In particular,

we introduce the expected density ρas a robust tool to assess document cluster

quality [138] and argue why our experiments are the ﬁrst that reﬂect a realistic

scenario in unsupervised document categorization [90]. Moreover, we contribute

the suﬃx tree document model, which will prove to better reﬂect document simi-

larity than traditional approaches, and consequently to be an excellent foundation

for cluster analysis [96].

Once a meaningful clustering is found, it has to be presented to a user. In

this connection, each cluster has to be labeled with characteristic terms that

28 Chapter 3. Information Need and Categorizing Search

ese

tion

eed

lat

Retri

al task

perat

onal

zat

esult presentat

nal

Set o

docu

tri

resu

Clustering

(Chapter 3.2, 3.3)

Cluster Validity

Assessment

(Chapter 3.4)

Topic

Identification

(Chapter 3.6)

Genre

Classification

(Chapter 3.7)

Web

sea

ngine

IRA

(

Cha

ter 4

)

AIsearch

retrieval

Document

representation:

topic

(Chapter 2.2, 3.5)

Document

representation:

genre

(Chapter 3.7)

Figure 3.1: Chapter organization.

characterize the cluster’s contents. Finding good cluster labels is a challenging

task since the labels should have properties like being summarizing and discrimi-

nating. These and other cluster label properties are discussed, and an algorithm

for cluster labeling called WCC is given [138]. The experiments, which are the

ﬁrst ones that are reproducible in this area, show that WCC outperforms other

approaches.

Finally, we propose a genre category scheme for Web documents. In contrast

to topic categories, which reﬂect a document’s thematic context, genre categories

relate to a document’s form and presentation; typical Web genres include online

shop, private home page, link list, discussion forum, and article. Web pages of

a speciﬁc genre can up to now not be speciﬁed to be included or excluded from

search results—however, this aspect constitutes an important part of a user’s

information need. In particular, we discuss diﬀerent kinds of genres that can be

found in the Web, we present a user study on genre usefulness, and we propose

a document model for genre analysis along with a categorization algorithm [92].

3.1. Supervised vs. Unsupervised Categorization 29

The aforementioned techniques were put to work in the form of our meta

search engine AIsearch: We used indices of popular search engines as a starting

point; a subsequent processing shall improve the retrieval quality employing the

aforementioned tools [91, 93]. Figure 3.1 depicts this chapter’s structure, which

is oriented at the AIsearch retrieval task organization described at the outset.

3.1 Supervised vs. Unsupervised Categoriza-

tion

A categorization C={C1,...,C

k}is a partition of a document collection D

into subsets C1,...,C

k⊆Dsuch that ∪k

i=1Ck=D. A categorization is called

exclusive if a document falls into exactly one of the categories, that is, Ci∩Cj=∅

for i=j.

Automatic categorization aims to discover the intrinsic category structure

C∗of D, which is optimal with respect to criteria like topics1from the human

perspective. The employed algorithms are either supervised or unsupervised2.

The former try to learn a function f:D→C

∗from a so-called training set of

examples (di,C

j)∈D×C∗. The latter work without having seen correct category

assignments in advance; their goal is to construct a categorization based only on

Dand a similarity measure ϕ.

The nature of supervised and unsupervised categorization algorithms deter-

mines their application area. For supervised categorization algorithms a category

scheme as well as a considerable number of correctly assigned documents must

be given in order to learn the function f. This fact limits the application area

to collections that possess a rather static category structure, since a modiﬁca-

tion of C∗involves laborious human intervention: Training examples have to be

reassigned to the new category scheme and the classiﬁer fhas to be relearned.

Examples of collections with static topic structure include digital libraries of sci-

entiﬁc articles, in which standardized category schemes like the ACM computer

science classiﬁcation scheme are used [8].

1In literature, a categorization C∗of Dis often based on topic considerations. However,

we generalize this view: the underlying category system may be organized by any meaningful

criteria.

2For an introduction to machine learning, see e.g. [97, 35].

30 Chapter 3. Information Need and Categorizing Search

The ability of unsupervised algorithms to uncover category structures with-

out prior knowledge is especially useful for post-retrieval categorization tasks in

large, heterogeneous collections like intranets or the WWW: Due to their size

and the user’s diﬀerent perspectives, these collections often lack commonly ac-

cepted category schemes. Moreover, unsupervised categorization in post-retrieval

scenarios can adapt to a query’s “zoom level”: Depending on whether a query is

general or speciﬁc, the range of relevant documents should be broad or focused,

and accordingly, a categorization’s granularity should adapt.

3.2 The Cluster Hypothesis

Browsing categories instead of result lists shall improve the retrieval performance,

e.g. measured by the ratio of relevant documents within a query response. Van

Rijsbergen formulated the underlying Cluster Hypothesis as follows: “Closely

associated documents tend to be relevant to the same requests” [111]. However,

the Cluster Hypothesis, if true, does not imply a concrete procedure for retrieval.

Former studies experimented with retrieval strategies that varied in the following

points.

(1) Document base. What is to be clustered? Alternatives are (a) the entire

collection [153, 103] or (b) top-ranked documents that were retrieved for a

query (“post-retrieval clustering”) [24, 56, 132, 151].

(2) Cluster selection. What is shown to a user? Possibilities include (a) the

documents that belong to the cluster that is closest to the query, e.g. mea-

sured by the query’s distance to cluster centroids [153] (b) the documents

that belong to the cluster whose centroid is closest to the top-ranked doc-

ument with respect to the query [22] (c) the documents that belong to

the top-kquery-closest clusters [153, 155] (d) documents from all clusters

grouped by cluster [132, 151].

(3) Cluster presentation. How are clusters presented to a user? Alternatives

are (a) a single ranked list of documents that were retrieved from one or

more clusters [153], (b) a ranked list grouped by cluster [56, 103] (c) a

cluster overview for navigation combined with a list view [132, 151].

3.3. Clustering Documents 31

The mentioned points are pairwise orthogonal and may inﬂuence the retrieval

performance to a great extent like the mentioned references show. E.g., cluster-

ing an entire document collection and selecting for presentation the documents

belonging to the cluster that is closest to an input query has proven not to im-

prove the retrieval quality in terms of precision [119, 153, 50]. As a consequence,

Hearst and Pedersen proposed to cluster retrieval results instead of the whole

collection (“Scatter/Gather”) [56]. Their results on a large, heterogeneous doc-

ument collection showed the great potential of the approach: retrieval from the

query-closest cluster increased the retrieval precision by up to 85% in compari-

son to the retrieval precision from unclustered results. This observation does not

support the Cluster Hypothesis directly, but a variant “with some assumptions

revised” [56]:

“[...] we do not assume that if two documents d1and d2are both

relevant or nonrelevant for query qA, they must also both be relevant or

nonrelevant for query qB. [...] In other words, because documents are

very high-dimensional, the deﬁnition of nearest neighbors will change

depending on which neighbors are on the street.”

The next section will discuss current approaches for document clustering. It

will present theoretical foundations, algorithms, and challenges of cluster analysis

in the text domain.

3.3 Clustering Documents

The deﬁnition of a clustering C={C1,...,C

k}of a set Dis—from a mathematical

point of view—the same like the categorization deﬁnition given above: the sets

Ciare (disjoint) subsets of D, and their union covers D. From a semantic point

of view, Cis called clustering if it was generated by a clustering algorithm. The

goal of clustering techniques is to ﬁnd clusterings that are close to the optimum

categorization3C∗. In general, cluster analysis concentrates on the following

questions.

3Optimum categorizations are in our case human-made categorizations of the document

set D. However, the deﬁnition of optimality depends on the underlying application scenario

and should be oriented at the user’s information need. In contrast to ranked retrieval, where

the probability ranking principle deﬁnes optimality [112, 45], a formal, human-independent

optimality deﬁnition for clusterings has not been stated so far.

32 Chapter 3. Information Need and Categorizing Search

(1) Are there clusters in D?

(2) How can the clusters be identiﬁed?

(3) How “good” are the found clusters?

Some statistics to answer question (1) have been proposed in the past [34]; a

recent article that discusses this question in context of information retrieval can

be found in [150]. However, our focus here is on the questions (2) and (3), which

we will elaborate in this and the next section respectively.

3.3.1 Challenges

A clustering is considered having a good quality with respect to a similarity

function ϕ:D×D→R, if the averaged similarity between objects of the same

cluster is “signiﬁcantly higher” than the similarity between items from diﬀerent

clusters. Other desired properties include that clusters shall be of a “compact

form” or resemble the “natural structure” of the item set [64, 71, 139].

A method to model the desired cluster properties is the speciﬁcation of an

objective function fothat maps a clustering to a real number that quantiﬁes the

goodness of a clustering. However, since the number of possibilities to partition a

set of nitems into kpartitions is roughly kn/k! [64], it is infeasible to exhaustively

search the space of possible partitions for elements that maximize/minimize the

value of fo, even when the “right” kis known. Finding the right kis a challenge

itself.

Apart from this general problem, document clustering using the standard

vector space model must cope with the high dimensionality of the vectors that

represent the documents. In particular, the curse of dimensionality and noise

phenomenons must be considered when clustering high-dimensional data.

Curse of Dimensionality. The term “curse of dimensionality” was coined

by Bellman [10] and circumscribes diﬃculties when analyzing high-dimensional

data; particularly it relates to the exponential growth of hyper-volume when

scaling dimensionality linearly. Consequently, clustering algorithms that rely on

analyses of spatial density4must be designed carefully [57, 49]. According to

Dasgupta, the curse of dimensionality expresses in two diﬀerent facets [25]:

4An example for a class of such algorithms are grid-based clustering algorithms.

3.3. Clustering Documents 33

(1) Visualization problems. High-dimensional data are hard to visualize. Al-

though there are dimension reduction techniques to embed high-dimensional

data into lower-dimensional spaces while preserving similarity relations be-

tween data points as good as possible, caution is advised when inspecting

the projected data visually or when analyzing the projected data with sta-

tistical tools.

The ﬁrst reason for caution is that a projection from a high-dimensional

space is subject to a projection error, the so-called stress [77], which mea-

sures similarity diﬀerences between point-pairs in the original and the pro-

jected space. Depending on the application, the stress may inﬂuence the

outcome to a greater or lesser extent. Our study in [131] illustrates this

eﬀect when clustering low-dimensional projections of high-dimensional doc-

ument vectors.

The second reason is that distribution properties of point sets may be al-

tered when projected to a lower-dimensional space. For example, simple

non-Gaussian distributions exist whose two-dimensional projections look

Gaussian [32, 25].

(2) Counter-intuitive pitfalls. An example for such a pitfall is the following.

When uniformly choosing points from an n-sphere at random, they almost

always lie close to the surface of the sphere. This observation relates to the

fact that uniformly drawn points from [0,1]ntend to be equidistant as n

rises. This and other related pitfalls that may be overlooked are described

in [25].

In general, much of the technology developed for lower dimensions does not

scale well to higher dimensions. Dasgupta mentions in this connection the parti-

tion of a vector space into Voronoi cells: mpoints in Rnwill produce cells with

Ω(mn) faces.

Noise. Similarity values that are computed between two documents d1and d2

are subject to noise. Clearly, even if d1and d2are not on the same topic there

will be terms that are shared between both documents. Although the idf part of

the tf ·idf term weighting scheme reduces the weight for common terms within a

34 Chapter 3. Information Need and Categorizing Search

100

1000

10000

100000

1e+06

0 0.2 0.4 0.6 0.8 1

Cosine similarity

Similarity value count

Figure 3.2: Distribution of similarity values for a document collection of 1000 news

documents drawn uniformly at random from 10 categories. Observe the logarithmic

scale.

collection, some random term matches are still possible that inﬂuence a similarity

measure’s value5.

Figure 3.2 depicts a distribution of similarity values for a document collection

of 1000 news documents drawn uniformly from 10 categories. The ﬁgure shows

that there are many low similarity values due to random term matches. Clustering

algorithms based on similarity graph density measures, for example, run the

risk to sum up many of these values and add a remarkable error when judging

cumulated similarity, e.g. between clusters.

3.3.2 Clustering Algorithms

Given a text corpus D, an optimum clustering C∗of Dcorresponds to its catego-

rization as accomplished by a human editor. Except for trivial cases C∗cannot be

found—the reasons for this are twofold: First, it is hard to quantify the properties

of the “correct” categories in terms of the similarity function ϕor the objective

function fo; secondly, the search space of possible clusterings is exponential in

|D|and a greedy strategy to ﬁnd a global optimum is not at hand. I. e., clus-

ter algorithms are highly developed heuristics to search a global optimum with

respect to the outlined structure identiﬁcation objectives.

5The probability can be bounded from below based on “Birthday Problem” estimates.

3.3. Clustering Documents 35

hierarchical

agglomerative

divisive

single-linkage, group average

min-cut-analysis

iterative

exemplar-based

commutation-based

k-means, k-medoid

Kerninghan-Lin

density-based

meta-search-

controlled

Cluster

approach descend-methods

point concentration

cumulative attraction

competitive

DBSCAN

MajorClust

simulated annealing

genetic algorithm

statistical

Gaussian mixtures

...

domain specific

concept factorization, suffix tree clustering

Figure 3.3: A taxonomy of clustering algorithms that are used for document catego-

rization (adapted from [129]).

Various clustering heuristics and strategies have been proposed (cf. Figure 3.3),

and it is hard to say which of the existing approaches is suited best for text cat-

egorization. The following paragraphs outline basic principles of the clustering

algorithms that are used for document categorization. In this connection it is

useful to consider the nelements in Das nodes of a weighted graph whose edge

weights correspond to the similarity ϕ(d1,d2) of the respective documents.

Iterative Algorithms. Iterative algorithms strive for a successive improve-

ment of an existing clustering and can be classiﬁed further into exemplar-based

and commutation-based approaches. The former assume for each cluster a repre-

sentative to which the objects become assigned according to their similarity. It-

erative algorithms need information with regard to the expected cluster number,

k. Well-known representatives are k-Means, k-Medoid, Kohonen, and Fuzzy-k-

Means. The runtime of these methods is O(nkl), where ldesignates the number

of iterations to achieve convergence [64, 88, 71, 74, 160, 52, 53, 75, 62].

Hierarchical Algorithms. Algorithms of this type create a tree of node sub-

sets by successively dividing or merging the graph’s nodes. In order to obtain a

unique clustering, a second step is necessary that prunes this tree at adequate

places. Agglomerative hierarchical algorithms start with each vertex being its

own cluster and union clusters iteratively. For divisive algorithms on the other

hand, the entire graph initially forms one single cluster which is successively

36 Chapter 3. Information Need and Categorizing Search

subdivided. Representatives are k-nearest-neighbor, linkage, Ward, minimum-

spanning-tree, or min-cut methods. Usually, these methods construct a complete

similarity graph, which results in O(n2) runtime [38, 40, 125, 67, 84, 158, 162, 69].

Density-based Algorithms. Density-based algorithms try to separate a sim-

ilarity graph into subgraphs of high connectivity values. In the ideal case they

can determine the cluster number kautomatically and detect clusters of arbitrary

shape and size. Representatives are DBscan, MajorClust, or Chameleon. The

runtime of these algorithms cannot be stated uniquely since it depends on di-

verse constraints. For non-geometrical data it is in the magnitude of hierarchical

algorithms, O(n2) [139, 37, 69, 130, 131].

Meta-Search Algorithms. Meta-search algorithms treat clustering as an op-

timization problem where a given goal criterion is to be minimized or maximized.

Though this approach oﬀers maximum ﬂexibility only less can be stated respect-

ing its runtime. Representatives for meta-search driven cluster detection may

be realized by genetic algorithms, simulated annealing, or a two-phase greedy

strategy [9, 118, 117, 108, 41, 73].

Domain Speciﬁc Algorithms. Various clustering algorithms have been pro-

posed for high-dimensional data or especially for the text domain. The goal of the

latter is often to identify concepts, which are in turn the basis for grouping docu-

ments. Techniques that are employed in this connection include projections of the

high-dimensional data to lower dimensions each of which representing a concept,

e.g. low-rank approximations of the term-document matrix. Other approaches

try to identify concepts directly with frequency or subsumption analyses. Some

recent algorithms are given in [163, 159, 70, 31, 49].

Statistical Parameter Estimation Algorithms. This class of algorithms

regards a population of points in Rnas a sample that has been generated by a

mixture model of kn-dimensional density functions δ1,...,δ

kwith diﬀerent pa-

rameters, where the density functions have been combined according to a weight-

ing scheme w1,...,w

kwith iwi= 1. The goal is to estimate the parameter

set of each density function δi, each of which being regarded as generator for a

3.4. On the Validity of Document Clusterings 37

particular cluster. The probability that a point is generated by δicanthenbecal-

culated based on the density estimates, and the cluster numbers can be assigned

to the data points according to this probability. An up-to-date compilation of

such algorithms can be found in [25, 1].

3.4 On the Validity of Document Clusterings

Even when a document model along with a similarity measure is determined

for a particular categorization task, a muchness of clusterings can be generated

by clustering algorithms. The quality of unsupervised categorizations varies ac-

cording to the employed cluster algorithms, especially in combination with their

parameter settings. Since hundreds of clusterings of a mid-size document collec-

tion can be generated within one second CPU time, the diﬃculty with clustering

is not the time factor but the selection of a valid clustering, say, a clustering that

reﬂects the human idea of categorization best. Jain and Dubes [64] comment this

diﬃculty as follows:

“The validation of clustering structures is the most diﬃcult and frus-

trating part of cluster analysis. Without a strong eﬀort in this direc-

tion, cluster analysis will remain a black art accessible only to those

true believers who have experience and great courage.”

This is where cluster validity measures come into play; their task is to decide

which of the generated clusterings is best, either closest to a reference catego-

rization that is regarded as optimum, or well developed with respect to some of

the clustering’s structural properties. The former are called external measures,

the latter internal measures (cf. Figure 3.4). Measures that are capable to judge

which clustering is best among a set of clusterings are sometimes referred to as

relative measures.

In the following, a more detailed taxonomy of validity indices is introduced,

and the usage of particular measures is discussed. In particular, we introduce

a new internal cluster validity measure, called expected density, that proved to

consistently outperform other traditional internal measures in the document clus-

tering domain [138]. Experimental analyses conclude the section.

38 Chapter 3. Information Need and Categorizing Search

covering analysis

information-theoretic

external

internal

relative

absolute

cluster

validity

index

static: overall structure analysis

dynamic: re-cluster stability

elbow criterium, GAP statistics

Kullback-Leibler, entropy

F-Measure, Purity, RAND statistics

dilution analysis

Figure 3.4: Taxonomy of cluster validity indices.

3.4.1 External Cluster Validity Measures

External measures are subject to quantify the congruence between a given clus-

tering and a reference categorization. In document clustering, the latter is usually

a document collection that has been categorized by a human editor. In particu-

lar, external measures can be subdivided into information theoretic and covering

analysis approaches (cf. Figure 3.4). External measures are widely used for

measuring the performance of a clustering or classiﬁcation algorithm against a

benchmark collection.

The F-Measure. The F-Measure combines the precision and recall measures

from information retrieval [111]. While precision quantiﬁes the “purity” of a

cluster with respect to a reference class, the recall value captures the percentage

of data objects from a reference class that are contained in a particular cluster.

Let Drepresent a set of documents and let C={C1,...,C

k}be a clustering

of D.Moreover,letC∗={C∗

1,...,C∗

l}designate the human reference categoriza-

tion.

Then the recall of cluster jwith respect to class i,rec(i, j), is deﬁned as

|Cj∩C∗

i|/|C∗

i|. The precision of cluster jwith respect to class i,prec(i, j), is

deﬁned as |Cj∩C∗

i|/|Cj|.TheF-Measure is the harmonic mean of precision and

recall:

Fi,j =1

21

prec(i,j)+1

rec(i,j)=2·prec(i, j)·rec(i, j)

prec(i, j)+rec(i, j)

Basedonthisformula,theoverallF-Measure of a clustering (micro-averaged) is:



i=1

|C∗

|D|·max

j=1,...,k{Fi,j}

3.4. On the Validity of Document Clusterings 39

An alternative way to compute Fuses macro-averaging: the maxima are not

weighted according to class size, but all classes are weighted equally, i.e.

F=1



i=1

max

j=1,...,k{Fi,j}

In other words, macro averaging treats classes equally, while micro averag-

ing treats documents equally. Unless stated otherwise, we will employ micro-

averaging as it considers class importance.

Note that parallel maximization of precision and recall is diﬃcult; the F-

Measure quantiﬁes to what extent this objective is met for a particular clustering.

A perfect ﬁt between a generated clustering and a human reference categorization

leads to an F-Measure score of 1, which is the maximum value for this measure.

Entropy. Entropy is an information theoretic measure that quantiﬁes the un-

certainty of an information source S, which emits symbols S1,...,S

kaccording

to the probabilities P(S1),...,P(Sk). The entropy of the information source is

deﬁned [124] as

H(S)=−



i=1

P(Si)·log2(P(Si))

The connection between entropy and validity measures is as follows. Imagine

each cluster of a given clustering acting as information source. As impure clusters

comprise data from diﬀerent “true” classes, one can imagine that a cluster emits

class numbers from the underlying reference categorization. Whenever a class

number is emitted from a cluster, its probability corresponds to the fraction of

the number of items from its class within the cluster, and the total number of

items within the emitting cluster. Obviously, an impure cluster, which contains

data from diﬀerent classes, represents an uncertain information source, and thus

results in a high entropy value.

Let Drepresent the set of documents, and let C={C1,...,C

k}be a clus-

tering of D.Moreover,letC∗={C∗

1,...,C∗

l}designate the human reference

categorization.Then the entropy of cluster Ciwith respect to C∗is deﬁned as

H(Cj)=−

|Cj∩C∗

i|=0

|Cj∩C∗

|Cj|·log2|Cj∩C∗

|Cj|

40 Chapter 3. Information Need and Categorizing Search

TheentropyoftheentireclusteringCwith respect to C∗is the sum of

the cluster-wise entropies, which are weighted according to cluster size (micro-

averaged):

H(C)= 

Cj∈C

|Cj|

|D|H(Cj)

Observe that a perfect clustering has an entropy of 0. However, the patholog-

ical case in which each data object is assigned to its own cluster also scores with

0. The reason for this behaviour is that entropy mathematically transforms the

precision values of the clusters.

Pair-Analysis-based Figures of Merit. Another class of external measures

analyzes to which extent document pairs share the same cluster and the same

underlying class, and quantiﬁes the congruence. Again, let D={d1,...,d

be the set of documents, let C={C1,...,C

k}be a clustering of D,andlet

C∗={C∗

1,...,C∗

l}designate the human reference categorization. Let tC:N×

N→{0,1}be an indicator function with tC(i, j)=1ifdiand djshare the

same cluster in C, and 0 otherwise. Likewise, tC∗(i, j)=1ifdiand djshare the

same class in C∗, and 0 otherwise. Based on these indicator functions, the Rand

statistic [109] of a clustering Cis deﬁned as

R(C)= 1

n(n−1)/2

i<=j

δtC(i,j),tC∗(i,j)

where δx,y is the Kronecker symbol, i.e. δx,y =1ifx=y,and0otherwise.

In other words, the Rand statistic counts the pairs of documents that lie

both in the same or both in diﬀerent clusters in Cand C∗. The resulting sum is

normalized with the total number of pairs. Consequently, the values stem from

the interval [0,1] with 1 indicating a perfect match between Cand C∗.

There are also other statistics that rely on the given indicator functions [64].

They include the Fowlkes and Mallows statistic [42], the Jaccard statistic [64],

and the Γ statistic [60].

3.4.2 Internal Cluster Validity Measures

In contrast to external measures, internal measures quantify the quality of a

clustering based on its structural properties only, i. e. , without comparing it

to the “right” categorization. They can further be split up into relative and

3.4. On the Validity of Document Clusterings 41

absolute measures (cf. Figure 3.4), whereas the task of relative measures is to

identify the best in a set of clusterings, and the task of absolute measures is to

judge the quality of a single clustering by means of a real number. Although

there are measures that can only be employed relatively since they require a set

of clusterings as input, it should be noted that absolute measures can also be

used relatively, when comparing their values for diﬀerent clusterings6.

The most popular internal measures are absolute measures that focus on static

structural properties like compactness and well-separateness, measured by within-

cluster-scatter, between-cluster-distance, etc. Alternatively, structural properties

may be assessed in a dynamic way employing re-clustering techniques, which aim

to assess the stability of a clustering. In the following, some well-known as well

as new validity measures will be introduced.

The Dunn Index Family. Dunn Indices [36] form one of the most popular

classes of internal validity measures. Dating back to 1973, Dunn’s measure was

one of the ﬁrst to be investigated, and Bezdek generalized Dunn’s family of indices

by abstracting cluster diameter and cluster distance measures as follows [13].

Let C={C1,...,C

k}be a clustering of a document collection D,δ:C×C→

Rbe a cluster to cluster distance measure, and Δ : C→Rbe a cluster diameter

measure. Then all measures Iof the form

I(C)=mini=j{δ(Ci,C

j)}

max1≤l≤k{Δ(Cl)}

are called Dunn indices. Originally, Dunn used

δ(Ci,C

j)= min

x∈Ci,y∈Cj

d(x, y)and

Δ(Ci)= max

x,y∈Ci

d(x, y)

where d:D×D→Ris a function that measures the distance between objects

of D. Using these functions, the measure assigns high values to clusterings with

compact and very well separated clusters like shown in Figure 3.5. Here, the

maximum diameter is relatively low and the minimum distance between two

clusters is relatively large. As a consequence, the clustering is ranked high by

6It is assumed that the values of absolute measures relate monotonously to cluster quality.

42 Chapter 3. Information Need and Categorizing Search

Clustering 1

Clustering 2

Figure 3.5: Clustering 1 contains a clustering with compact and well separated clusters.

The maximum diameter is relatively low and the distance of the two closest clusters

is relatively large. In this case the original Dunn index returns a high score for the

clustering. Clustering 2 shows two undesired properties from the perspective of the

Dunn Index: the left cluster’s large diameter as well as the relatively small distance

between C2and C3are responsible for the index returning a low value, even though

the clustering ﬁts the structure well. If C2and C3are merged to one cluster, the Dunn

Index would falsely return a better value.

the Dunn index. However, Bezdek recognized that the index is noise sensitive.

Moreover, clusterings with arbitrarily shaped clusters can cause problems for the

Dunn index (see Figure 3.5 bottom): Even though the clustering in the example

is good, the large diameter of C1along with the small distance between C2and C3

leads to a low value of the Dunn index. Bezdek experienced that the combination

δ(Ci,C

j)= 1

|Ci||Cj|

x∈Ci,y∈Cj

d(x, y)and

Δ(Ci)=2

x∈Cid(x, ci)

|Ci|

give reliable results for data with normal mixture distributions [12]. Here, ci

denotes the centroid of cluster Ci. Note that a maximization of Iis desired.

3.4. On the Validity of Document Clusterings 43

The Davies-Bouldin Index. Davies and Bouldin proposed the following mea-

sure that is known as Davies-Bouldin Index [26]. It is a function that combines

within-cluster scatter and between-cluster separation as follows. Let C={C1,

..., C

k}be a clustering of a document collection D. Then the Davies-Bouldin

Index DB :C→Ris deﬁned as

DB(C)=1

k·



i=1

Ri(C),with

Ri(C)= max

j=1,...,n,

i=j

Rij(C)andRij(C)=(s(Ci)+s(Cj))

δ(Ci,C

j),

where s:C→Rmeasures the scatter within a cluster, and δ:C×C→Ris a

cluster to cluster distance measure.

Given the centroids ciof the clusters Ci, a typical scatter measure is s(Ci)=

|Ci|x∈Ci||x−ci||, and a typical cluster to cluster distance measure is the distance

between the centroids, ||ci−cj||. Because a low scatter and a high distance

between clusters lead to low values of Rij, a minimization of DB is desired.

Cluster Validity by Resampling. The application of resampling for cluster

validity was introduced in 1987 by Jain and Moreau [63]. Resampling bases on the

idea to compare a given clustering Cto a clustering of a randomly chosen subset

from the underlying document set. If the subsets’ clusterings diﬀer signiﬁcantly

from C,thenCis instable. In this case, the clustering is bad or the underlying

data does not contain a structure. In general, the resampling procedure is carried

out in the following steps.

(1) Several resamples Dj⊂Dwith |Dj|<|D|are randomly chosen from the

original data set. The fraction |Dj|

|D|is called dilution factor [86].

(2) A clustering algorithm is applied to each Dj, yielding to re-clusterings Rj.

Here, the cluster algorithm’s parameters must be identical to those used for

generating the clustering C.

(3) Similarities ψ(Ri,C) between the original clustering and the re-clusterings

are calculated and averaged. For this task, external validity measures can

be employed, with Cacting as reference categorization.

44 Chapter 3. Information Need and Categorizing Search

Noise

Cluster 1

Cluster 2

Cluster 1

Cluster 2

Figure 3.6: Illustration of resampling. On the left hand side a clustering is shown,

which was generated by a cluster algorithm that was mislead by noise. On the right

hand side, a clustering of a resampling can be seen, which identiﬁes the regions in which

stable clusters reside.

The average diﬀerence between the original clustering and the resample clus-

terings is an indicator for the quality of C. An intuition for this interpretation is

that noise in the data set may loose inﬂuence on the clustering algorithm when

resamples are drawn. Probably, weak clusters that are inﬂuenced by noise will

break into parts during the re-clustering step. Consider Figure 3.6 in this con-

nection: After resampling, enough points of stable clusters remain so that the

clustering algorithm is able to detect them in several resamplings.

Note, however, that values of a clustering similarity measure ψshould only be

compared or averaged for re-clusterings that were generated using the same dilu-

tion factor, as upper/lower bounds for ψare determined by the number of points

in Di. For example, the F-Measure-values for re-clusterings that were generated

using a dilution factor of 0.5 will likely be smaller than those generated with a

dilution factor of 0.9 if the same clustering algorithm with the same parameters

is employed.

The Elbow Criterion. The elbow criterion is a relative validity measure,

i. e. , it decides which clustering is best within a set of clusterings. Let C=

{Cp1,...,Cpk}be a set of clusterings that has been generated with the same clus-

tering algorithm but with diﬀerent parameter values p1,...,p

k,andlete:C→R

be an error function on the clusterings such as the sum over all clusters of within-

cluster scatter with respect to the cluster centroid. Let {(pi,e(Cpi)) |i≤k}be a

set of points forming an error curve; without loss of generality, assume that the

(pi,e(Cpi)) are ordered by descending e(Cpi)values. Ifpoint(pi,e(Cpi)) experi-

ences the maximum error drop with respect to its predecessor, the parameter pi

3.4. On the Validity of Document Clusterings 45

123456

Figure 3.7: On the left, the ﬁgure shows some points in a 2-dimensional space, forming

two clusters. On the right, an error function eis depicted for k-Means generated

clusterings of the points, depending on the speciﬁed number of clusters, k. The error

drop for k= 2 is maximum, indicating that k= 2 is a good parameter for the underlying

data.

is considered as good choice for the clustering algorithm.

As an example, confer Figure 3.7. The clustering algorithm k-means tends to

reduce the overall variance of a generated clustering on the same data when k

increases, and when kreaches |D|, the variance disappears since each cluster tends

to contains only a single point. The depicted error function has its maximum

variance drop in the second point, indicating that k= 2 is a good parameter for

the underlying data.

An advanced approach is the GAP statistic, which is an error-tolerant and

normalized variant of the elbow criterion [144].

The Λ-Measure. A document collection can be considered as a weighted graph

G=V,E,wwith node set V,edgesetE, and weight function w:E→[0,1]

where Vrepresents the documents, and wdeﬁnes the similarities between two

documents. The Λ-measure computes the weighted partial connectivity of G=

V,E,w, which is deﬁned as follows [139].

Let C={C1,...,C

k}be a clustering of the nodes Vof a weighted graph

G=V,E,w.

Λ(C):=



i=1 |Ci|·λi,

where λidesignates the weighted edge connectivity of the induced subgraph

G(Ci)=Vi,E

i.λiis deﬁned as minE{u,v}∈Ew(u, v)whereE⊂Eand

G

i=Vi,E

i\Eis not connected. λiis also designated as the capacity of a

minimum cut of G(Ci).

46 Chapter 3. Information Need and Categorizing Search

Measure of Expected Density ρ.AgraphG=V,E,wis called sparse

if |E|=O(|V|); it is called dense if |E|=O(|V|2). Put another way, we can

compute the density θof a graph from the equation |E|=|V|θ.Withw(G):=

|V|+e∈Ew(e) denoting the weight of G7,this relation extends naturally to

weighted graphs:

w(G)=|V|θ⇔θ=ln (w(G))

ln (|V|)

Obviously, θcan be used to compare the density of each induced subgraph

G=V,E,wof Gto the density of the entire graph G:Gis sparse (dense)

compared to Gif the quotient w(G)/(|V|θ) is smaller (larger) than 1. This

consideration is the key idea behind the following deﬁnition of a clustering’s

expected density ρ.

Let C={C1,...,C

k}be a clustering of a weighted graph G=V,E,w,and

let Gi=Vi,E

i,w

ibe the induced subgraph of Gwith respect to cluster Ci.

Then the expected density of a clustering Cis deﬁned as follows.

ρ(C)=



i=1

|Vi|

|V|·w(Gi)

|Vi|θ,where |V|θ=w(G)

Since the edge weights resemble the similarity of the objects which are repre-

sented by V,ahighervalueofρindicates a better clustering.

Remarks. The Dunn index, the Davies-Bouldin index, and the elbow crite-

rion are related in that they have a geometric (typically centroidic) view on the

clustering. The measures work well if the underlying data contains clusters of

spherical form, but they are unreliable if this condition does not hold. Λ as well

as ρinterpret a data set as a weighted similarity graph; they analyze the graph’s

edge density distribution to judge the quality of a clustering.

3.4.3 Statistical Hypothesis Testing

An internal cluster validity index quantiﬁes the quality of a given clustering with

regard to certain structural properties. But even though it is obvious whether

7The corrective summand |V|in w(G) assures that w(G)≥|V|, which in turn assures that

θ≥1, and as a consequence, that the desired property |Vi|θ+|Vj|θ≤|Vi∪Vj|θholds. This

correction is necessary for graphs with small edge weights in [0,1].

3.4. On the Validity of Document Clusterings 47

Density of I

under

(

)

Figure 3.8: Estimated density function ρIof Iunder H0. The critical value tαof Iis

marked for signiﬁcance level α.

large or small values of a particular index indicate better or worse clusterings, it is

usually unknown which absolute values qualify a clustering being good or bad. In

some settings a value of 0.7 may be suﬃciently large to indicate a good clustering,

in other settings a value of 0.8 is not large enough to accept a clustering. This

dilemma can be addressed by statistical test theory [64].

A statistical test indicates in our context whether a clustering Cﬁts the data

unusually well. Here, “unusually well” means that Cﬁts the data better than

a randomly generated clustering. Given that each document of Dis labeled

according to cluster membership, a statistical test starts with the hypothesis

H0: All permutations of the labels on the ndocuments are equally likely.

Let Ibe a real-valued (internal) cluster validity index for which a greater value

indicates a better clustering quality. Assumed that the distribution or density

function ρIof the values of Iunder the null hypothesis H0is known (as depicted

in Figure 3.8), the probabilities P(E|H0) could be determined, where Eis the

event “I(C)≤t”forsomet. For a given signiﬁcance level 0 ≤α≤1, one is

interested in a threshold tαsuch that P(I(C)≤tα|H0)=α.Inthissense,tαis

a critical value. Formally, tαis determined by solving the equation:

∞

tα

ρI(t)dt =α

and the decision rule

“IF I(C)>t

αTHEN reject H0at signiﬁcance level α.”

applies, assuming that a greater value of Iindicates a better clustering.

If the null hypothesis is rejected, then the clustering Cis assumed to have

a non-random structure at signiﬁcance level α. However, even though the null

48 Chapter 3. Information Need and Categorizing Search

hypothesis is not supported by the test, it could still be true. But the probability

of mistakenly rejecting H0is bounded by α, which in turn is controlled by the

tester. αis usually set to 0.05 or 0.01 to ensure a small risk of making a mistake.

Another problem is that rejecting the hypothesis H0does not mean that the

found clusters in Care meaningful—the test can only give evidence with respect

to H0. A test against a stronger H1hypothesis like “there are two clusters in D”

would be desirable.

The main problem with this test is that the distribution of Iunder H0is

unknown and must be estimated. The following Monte Carlo method can be

applied to the problem [64]:

(1) Create a clustering Cby assigning cluster labels 1,...,k randomly to the

elements of D. This procedure reﬂects the random label hypothesis H0.

(2) Compute the cluster validity index I(C), and count it as a match for a

corresponding interval.

(3) Repeat steps one and two mtimes for a suﬃciently large m.

(4) Derive an approximation of the density function ρIfrom the collected data.

The disadvantage of a statistical test is its high computational cost. Note

that mmust grow adequately when αreaches 0 to ensure the validity of the test.

Moreover, the Monte Carlo estimation of I’s distribution does not only depend on

the null hypothesis H0and the generated clusterings, but also on the underlying

document set D.

3.4.4 Experimental Evaluation

The experiments have been conducted with samples of RCV1, short hand for

“Reuters Corpus Volume 1” [115]. RCV1 is a document collection that was pub-

lished by the Reuters Corporation for research purposes. It contains over 800,000

documents each of which consisting of a few hundred up to several thousands

words. The documents are enriched by meta information like category (also called

topic), geographic region, or industry sector. There are 103 diﬀerent categories,

which are arranged within a hierarchy of the four top level categories “Corpo-

rate/Industrial”, “Economics”, “Government/Social”, and “Markets”. Each of

3.4. On the Validity of Document Clusterings 49

RCV1

Corporate /

Industrial

Economics Government /

Social

Markets

Performance Insolvency /

Liquidity

Account /

Earnings

Comment /

Forecasts

Annual Results

... ...... ...... ......

Figure 3.9: A part of the topic structure of RCV1.

the top level categories deﬁnes the root of a tree of sub-categories, where every

child node ﬁne grains the information given by its parent (cf. Figure 3.9). Note

that a document dcan be assigned to several categories c1,...,c

p, and that all

ancestor categories of a category ciare assigned to das well.

For our experiments, we considered two documents d1,d

2as belonging to the

same category csif they share both the same top level category ctand the same

most speciﬁc category cs. Moreover, we constructed the test sets in such a way

that there is no document d1whose most speciﬁc category csis an ancestor of

the most speciﬁc category of some other document d2.

The number of categories in our test data varies from three to six. For each

category, between 100 and 300 documents were drawn randomly from the entire

category. The data sets had diﬀerent sizes and class numbers; we investigated

uniformly as well as non-uniformly distributed category sizes. Table 3.1 gives an

overview of the constructed data sets.

The preprocessing of the documents included parsing of text body and ti-

tle, stop word removal according to standard stop word lists, the application of

Porter’s stemming algorithm [106], and indexing according to term frequency. We

used the standard cosine similarity measure to capture the similarities between

documents.

Three analyses were conducted on each test data set to evaluate the perfor-

mance of the aforementioned indices. The three analyses along with their results

are described in the next three subsections.

Consistence Analysis. Since we know the reference categorization C∗which

was provided by a human editor, we can use it to generate artiﬁcial cluster-

ings C1,...,Cnthat are to a greater or lesser extent modiﬁcations of C∗.The

50 Chapter 3. Information Need and Categorizing Search

DS1 DS2 DS3 DS4 DS5 DS6 DS7

#categories 3343556

# documents 300 600 400 450 450 800 900

unif. distributed yes yes yes yes yes no no

Table 3.1: Overview of the constructed data sets.

F-Measure values for C1,...,Cnwill measure the degree of congruence for the

modiﬁed sets with respect to C∗. Assuming that the modiﬁed categorizations

represent erroneous clusterings, the value of a validation index for C1,...,Cn

should be worse than for C∗. Even more can be expected: For C1,...,Cn,theval-

ues of a subjective validation index should relate to the values of the F-Measure

monotonically.

To derive an artiﬁcial clustering Ciof C∗, we repeatedly chose two distinct

clusters of C∗and interchanged randomly chosen subsets of documents pairwise

between the clusters. Note that the size of the interchanged subsets controls

the degree of congruence between Ciand C∗. We varied the sizes of the subsets

between 1 document and 50% of the documents within a cluster.

Figure 3.10 shows the resulting scatter plots for the artiﬁcial clusterings that

we derived from the reference categorization of DS5 and evaluated with the data

set DS5. We measured the F-Measure value (y-axis) and the validity index value

(x-axis) for each clustering. For the sake of better readability, we changed the

sign of the Davies-Bouldin Index, which is the only one to be minimized—this

way, the plots are directly comparable. Assuming that a greater index value

constitutes a better clustering, an ideal index would show points on a curve that

starts in the lower left corner and grows monotonically up to the upper right

corner.

DS1 DS2 DS3 DS4 DS5 DS6 DS7 average

Dunn 0.45 0.89 0.42 0.66 0.59 0.77 0.75 0.65

D.-B. 0.95 0.98 0.93 0.84 0.91 0.86 0.92 0.91

ρ0.96 0.99 0.94 0.97 0.98 0.94 0.97 0.96

Table 3.2: The table shows for each index the Spearman rank correlation coeﬃcient of

the artificial clusterings for the investigated data sets. The number of the underlying

clusterings is 30.

3.4. On the Validity of Document Clusterings 51

0.4

0.5

0.6

0.7

0.8

0.9

1.15 1.16 1.17 1.18 1.19 1.2 1.21 1.22

F-Measure

Dunn Index

5 classes, 450 documents

0.4

0.5

0.6

0.7

0.8

0.9

-0.35 -0.3 -0.25 -0.2 -0.15 -0.1

F-Measure

Davies-Bouldin

5 classes, 450 documents

0.4

0.5

0.6

0.7

0.8

0.9

0.6 0.7 0.8 0.9 1 1.1 1.2

F-Measure

Expected Density

5 classes, 450 documents

Figure 3.10: Correlation of the F-Measure with (a) Dunn’s Index (top left), (b) Davies-

Bouldin-Index (top right) and (c) Expected Density (bottom) for the artificial cluster-

ings on DS5.

The extent to which this property is resembled by an index can be quantiﬁed

with Spearman’s rank correlation coeﬃcient. We determined for 30 clusterings

their rank according to both, F-Measure and index value, and we quantiﬁed the

correlation of these rankings as depicted in Table 3.2. Since the critical value

for Spearman’s rank correlation coeﬃcient for 30 items is 0.43 at the signiﬁcance

level α=0.01, it can be concluded that the indices perform well on this test on

all data sets (with one exception).

Analysis with Genuine Clusterings. We are normally faced with clusterings

which are not artiﬁcially constructed but stem from a document categorization

system that uses diﬀerent clustering algorithms. For the experiments reported be-

low we employed hierarchical, iterative, and density-based algorithms. Moreover,

for each of these algorithms diﬀerent thresholds, agglomeration levels, cluster

numbers, etc. were tried. We measured the F-Measure values and correspond-

ing validity index value for each clustering, and, in particular, for the reference

52 Chapter 3. Information Need and Categorizing Search

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6

F-Measure

Dunn Index

5 classes, 450 documents

0.3

0.4

0.5

0.6

0.7

0.8

0.9

-20 -18 -16 -14 -12 -10 -8 -6 -4 -2 0

F-Measure

Davies-Bouldin

5 classes, 450 documents

0.3

0.4

0.5

0.6

0.7

0.8

0.9

0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 1.25 1.3

F-Measure

Expected Density

5 classes, 450 documents

Figure 3.11: Correlation of the F-Measure and the Dunn Index (top left), Davies-

Bouldin-Index (top right) and Expected Density (bottom) for genuine clusterings on

DS5.

categorization C∗that can be identiﬁed by its F-Measure value of 1.

Figure 3.11 shows representative scatter plots for the investigated validity

indices for the same data set that was used for the consistency analysis (DS5)

with the diﬀerence that the clusterings were generated using diﬀerent clustering

algorithms. Again, we changed the sign of the Davies-Bouldin Index. Like above,

we computed the rank correlations according to Spearman; these values can be

found in Table 3.3.

DS1 DS2 DS3 DS4 DS5 DS6 DS7 average

Dunn 0.69 0.64 0.78 0.64 0.71 0.68 0.76 0.70

D.-B. 0.58 0.53 0.49 0.53 0.38 0.44 0.40 0.48

ρ0.79 0.76 0.86 0.90 0.90 0.78 0.71 0.81

Table 3.3: The table shows for each index the Spearman rank correlation coeﬃcient of

the genuine clusterings for the investigated data sets. The number of the underlying

clusterings is 30.

3.4. On the Validity of Document Clusterings 53

Note that the Davies-Bouldin Index, which works well for the synthetic data

sets, gets mislead by the genuine clusterings generated by our clustering algo-

rithms: Many clusterings with low F-Measure values untruly obtain a high index

value.

Prediction Quality. One might argue that a validity index only has to ﬁnd

the best clustering among several candidates—a single outlier that has a very

good index value but a poor clustering can completely ruin the applicability

of the index. Therefore we measured the F-Measure value that corresponds to

the maximum index value for each validity index and each data set. Table 3.4

comprises the results.

DS1 DS2 DS3 DS4 DS5 DS6 DS7 average

Dunn 0.73 0.73 0.78 0.66 0.83 0.80 0.74 0.75

D.-B. 0.77 0.77 0.65 0.55 0.56 0.64 0.36 0.61

ρ0.73 0.84 0.78 0.98 0.83 0.68 0.66 0.79

Table 3.4: The table shows for each index the F-Measure values that belong to its

top-rated clusterings. Since the reference categorization C∗was among the evaluated

clusterings, a perfect prediction corresponds to the F-Measure value of 1.

3.4.5 Concluding Remarks

Clustering algorithms are considered as a technology that has the potential to

automatically categorize document sets. Diﬀerent clustering algorithms produce

diﬀerent clusterings, and cluster validity measures must be applied to identify

among a set of clusterings the most valuable one. Internal cluster validity mea-

sures assess structural properties of a clustering—they hence are in the role of

objective measures. The key question in this connection is whether or not an

objective measure can be used to capture a user’s information need.

In the ﬁeld of automatic document categorization the information need cor-

responds to the categorization quality of a clustering C. Given a reference cate-

gorization C∗the categorization quality of Ccan be quantiﬁed with the achieved

precision and recall values. I. e., in the ﬁeld of document categorization an answer

to the above question can be given by analyzing the correlation of cluster validity

measures with the F-Measure.

54 Chapter 3. Information Need and Categorizing Search

The experiments investigate classical cluster validity measures from Dunn and

Davies-Bouldin and present the new graph-based measures Λ and ρ. As reported

in the experiment section, the new ρ-Measure performed convincingly on both,

artiﬁcial and genuine clusterings of diﬀerent document sets, and it outperformed

the classical measures in this domain. The Dunn Index performed robust but

missed to discover the real interesting artiﬁcial clusterings. The Davies-Bouldin

index performed well on artiﬁcial data sets—however, it was not able to cor-

rectly select the best clustering among clusterings that stemmed from a genuine

document cluster application.

3.5 The Suﬃx Tree Document Representation

The last two sections discussed how categorizations for document collections can

be generated in an unsupervised way, i.e. they outlined how clustering algo-

rithms work and how validity measures can be used to assess the quality of

clusterings. However, both clustering algorithms and validity measures work

on abstract representations of the underlying documents and require the state-

ment of a meaningful document similarity or distance function. Regardless of a

clustering algorithm’s or validity measure’s superiority, these methods are only

as good as the underlying representations and similarity statements reﬂect the

corresponding semantic relation. In other words, the better two semantically

unrelated documents can be separated based on their representation and the as-

sociated similarity value, the easier is the clustering job, and the better is the

resulting clustering.

In this connection it should be noted that vector-based document models

encode no information about the order by which the words occur in a document8.

A more sophisticated document model that preserves the complete word order

information is the suﬃx tree document model that we introduce here; it deﬁnes

the similarity between two documents in terms of string overlaps in their common

suﬃx tree.

8Some kind of “weak” order information can be introduced by using phrases or just every

sequence of nconsecutive words, so-called n-grams, instead of single words.

3.5. The Suﬃx Tree Document Representation 55

"boy plays chess"

"boy plays bridge too"

chess

bridge

boy

boy plays

chess

bridge

too

chess

bridge too

too

plays

chess bridge

too

Figure 3.12: Illustration of the two document model types. The left-hand side shows

two documents under the vector-based paradigm; the underlying dictionary contains

the words “boy”, “chess”, and “bridge”. As similarity function ϕthe cosine similarity

is shown, which corresponds to the cosine of the angle between d1and d2.Theright-

hand side shows a suﬃx tree for the documents “boy plays chess” and “boy plays bridge

too”. Here, the similarity function ϕmust quantify the portion of the overlap, which

corresponds to the green (thick) edges in the graph.

3.5.1 Suﬃx Trees

The ith suﬃx of a document d=w1...w

mis the substring of dthat starts with

word wi. A suﬃx tree of dis a labeled tree that contains each suﬃx of dalong

a path whose edges are labeled with the respective words. The construction of a

suﬃx tree is straightforward: The ith suﬃx of dis inserted by checking whether

some edge emanating from the root node is labeled with wi. If so, this edge is

traversed and it is checked whether some edge of the successor node is labeled

with wi+1, and so on. If, in some depth k, a node nwithout a matching edge is

reached, a new node is created and linked to node nwith an edge labeled with

wi+k.

Remarks. Document models and similarity functions ϕdetermine each other:

Vector-based document models are amenable to the cosine similarity in ﬁrst place.

The suﬃx tree document model requires a measure that assesses the similarity

between two graphs. Figure 3.12 illustrates both paradigms.

3.5.2 A Closer Look to the Suﬃx Tree Document Model

In this section we introduce a generic similarity measure for the suﬃx tree doc-

ument model. Moreover, we argue that the well-known similarity concepts of

the classical document models have their counterpart in the suﬃx tree document

model. Our generic view enables us to seamlessly understand the famous suﬃx

56 Chapter 3. Information Need and Categorizing Search

boy + -

plays + -

bridge -

chess +

bridge -

too -

plays + -

chess +

too -

bridge -

chess +

too -

5 7

too -

1 boy plays chess

2 boy plays bridge too

3 plays chess

4 plays bridge too

5 chess

6 bridge too

7 too

Figure 3.13: A suﬃx tree for the documents d+=“boy plays chess” and d−=“boy plays

bridge too”. The seven paths from the root node to the leafs nodes represent the seven

suﬃxes of d+and d−. Edges which are traversed on insertion by suﬃxes of d+are

labeled with “+”; likewise, a “-” marks the edges that are traversed by d−.

tree clustering algorithm of Zamir and Etzioni as a heuristic to eﬃciently evaluate

the graph-based similarity measure for large document collections.

A Graph-Based Similarity Measure. As pointed out above, a document

model in the form of a suﬃx tree preserves full word order information. Here, we

introduce a measure that quantiﬁes the similarity of two documents under the

suﬃx tree document model.

Let d+,d

−designate two documents that are inserted into an initially empty

suﬃx tree T.Eachedgeein Tgets either labeled “+”, “–”, or “+–”, depending

on whether or not ehas been traversed while inserting a suﬃx from d+or d−

(cf. Figure 3.13). Moreover, let Edenote the edges in T,andletE+and E−

denote those edges in Ewhose label contain a “+” and a “–” respectively. Then

the suﬃx tree similarity ϕST is deﬁned as

ϕST =|E+∩E−|

|E+∪E−|

Obviously, ϕST fulﬁlls the following properties of a similarity measure:

(1) Normalization. From 0 ≤|E+∩E−|≤|E+∪E−|follows that |E+∩E−|

|E+∪E−|∈

[0,1].

(2) Reﬂexivity. If d+=d−holds, then E+=E−, and consequently |E+∩E−|=

|E+∪E−|and ϕST =1.

3.5. The Suﬃx Tree Document Representation 57

(3) Symmetry. The symmetry property follows directly from the fact that the

insertion order does not aﬀect edge labeling.

ϕST is the Jaccard coeﬃcient of the edge sets E+and E−. Other possibilities

to measure the match between the two sets include the Dice coeﬃcient, the

cosine coeﬃcient, or the overlap coeﬃcient [111]. Observe that ϕST quantiﬁes

the frequencies of suﬃxes in a Boolean sense, since it is not recorded how often

an edge is traversed while inserting suﬃxes of d+and d−. Put another way, ϕST

captures “word order matches” rather than term frequencies.

There are two ways to incorporate term frequencies in ϕST . One possibility is

to combine ϕST with a traditional vector space model similarity measure by means

of a weighted sum, say, ϕHYB =λ·ϕST +(1−λ)·ϕcos,withλ∈[0,1]. Alternatively,

frequency information for each edge can be recorded during the construction of

T. The latter approach has the advantage that frequency information for word

sequences that are longer than one (suﬃx frequencies) can be considered for

similarity computation.

We construct the suﬃx tree as described above; all suﬃxes of d+and d−are

inserted into an initially empty tree T. During insertion the functions n+(e)and

n−(e) are computed, which deﬁne for each edge ehow often it is traversed when

inserting suﬃxes from d+and d−respectively. Then the similarity value ϕSTF

that incorporates suﬃx frequencies is given as

ϕSTF =1

|E|

e∈E

min{n+(e),n

−(e)}

max{n+(e),n

−(e)}

The properties of normalization, reﬂexivity, and symmetry also hold for ϕSTF .

Note that n+(e)andn−(e) capture the term frequencies of d+and d−for all edges

ethat are incident with T’s root.

Research on vector space models has shown that term weighting schemes for

document collections that rely on both term frequency and inverse document

frequency outperform schemes that are based on only one of these concepts [126].

Note that under the suﬃx tree document model the inverse document frequency

can also be measured for a document collection D={d1,...,d

n}: We construct

a suﬃx tree Tfor all suﬃxes of the di∈Dand associate an initially empty

set Swith each edge ein T.Ifasuﬃxofdicreates or traverses e,weset

S(e):=S(e)∪{i}.Since|S(e)|captures the document frequency of the suﬃx

58 Chapter 3. Information Need and Categorizing Search

that is represented by the path that starts at the root and ends with e,theinverse

document frequency can be measured by IDF(e)=log(n/|S(e)|), leading to

ϕSTFIDF =1

|E|

e∈E

min{n+(e),n

−(e)}

max{n+(e),n

−(e)}·IDF(e)

STC: A Fusion Heuristic for the Suﬃx Tree Document Model. Given

a similarity measure like the cosine similarity for the vector space model or one

of the suﬃx tree similarity measures introduced above, the construction of a

similarity graph is in O(n2) for a document collection Dof size n. However,

[163] introduced the suﬃx tree clustering algorithm (STC), which runs in O(n)

without computing O(n2) similarity values. In detail, STC is made up of three

steps.

Step 1. A suﬃx tree for all suﬃxes of each document in D={d1,...,d

n}is

constructed, and each suﬃx is associated with the set of documents wherein it is

contained. In other words, using the notation given above, for each edge e(each

of which representing a certain suﬃx) the set S(e) is computed. The sets S(e)

with |S(e)|≥2 are called “base clusters” and identify the documents diwith

i∈S(e).

Step 2. Each base cluster is assigned a score f, which is a function of |S(e)|and

the length of the suﬃx that is represented by e. In [163] the authors propose f

as the product of |S(e)|and the length of the suﬃx that is represented by e.

Step 3. The kbase clusters S1,...,Skthat score best under fare selected. A

similarity graph in which the base clusters form the node set is generated, and

an edge between two nodes Siand Sjis added if the Jaccard coeﬃcient of Siand

Sjis larger than 0.5, say, when |Si∩Sj|

|Si∪Sj|>0.5. The connected components of this

graph form the ﬁnal clusters.

Analysis of the STC Heuristic. STC has proven to work well on document

snippets that are returned by search engines [163], but its properties have not

been analyzed yet. As pointed out above STC is a heuristic which is highly

eﬃcient, but which also has some drawbacks. The following observations will

provide a rationale for some of STC’s characteristics.

3.5. The Suﬃx Tree Document Representation 59

Non-Exclusiveness. Documents may be associated with several base clusters.

Consequently, the documents may appear in more than one of the found cate-

gories.

Incompleteness. A clustering that is generated by STC does not necessarily

contain all documents of the original collection. An incomplete categorization

happens for document collections which comprise documents that share only few

short word sequences with the remaining documents. The reason for this behavior

is that the emerging base clusters will not score high.

Document frequency based. A base cluster scores higher if the document fre-

quency of its associated suﬃx increases. This can lead to big clusters, because it

is likely that high-scoring base clusters that contain terms with a high document

frequency share more than half of their associated documents with other base

clusters and consequently are merged in Step 3 of the STC algorithm.

Drifting. Suppose that four base clusters, S1,S

2,S

3,S

4,are given, where S1={1,

2,3},S2={2,3,4},S3={3,4,5},andS4={4,5,6}. Then all documents

d1,...,d

6are merged into a single cluster within Step 3 because the Jaccard

coeﬃcient of Siand Si+1 is larger than 0.5. In particular, this single cluster

comprises the documents that are associated with S1and S4,whichmightbe

completely dissimilar.

Absoluteness. STC considers two documents as similar and assigns them to the

same base cluster if they share a rather long suﬃx or several short suﬃxes. Re-

gardless of whether their base cluster is merged with other base clusters in Step

3, their base cluster is part of the ﬁnal clustering. Note that no information

about document lengths or suﬃx mismatches of the rest of those documents is

computed, resulting in poor quality clusters. This point is relatively unimpor-

tant for short documents—a fact that explains the good performance of STC on

document snippets like those returned by search engines.

Topic Generating. Each base cluster is associated with a suﬃx, which can serve

as a label for this cluster. This method solves two basic problems in topic iden-

tiﬁcation for document clusters [134]: word order preservation and topic length

determination.

60 Chapter 3. Information Need and Categorizing Search

3.5.3 Experimental Evaluation

The purpose of our experiments is twofold. First, we want to gain evidence on

STC’s character as a heuristic, say, to measure how good the STC algorithm per-

forms on popular document collections compared to clustering algorithms that

rely on the new suﬃx tree similarity measures or the traditional vector space

similarity measures. Second, we want to answer the question to which extent

word order preservation improves clustering performance. For this purpose we

evaluated STC and the clustering algorithms MajorClust [139] and Group Av-

erage Link using the discussed similarity measures on several categories drawn

from RCV1 [115].

Document Sets. The number of categories in our test data varies from three

to six. For each category between 50 and 300 documents were drawn randomly

from the entire category. The data sets have diﬀerent sizes and class numbers,

and we investigate uniformly as well as non-uniformly distributed category sizes.

Table 3.5 gives an overview of the constructed data sets. The document prepro-

cessing involves parsing, stop word removal according to standard stop word lists,

and the application of Porter’s stemming algorithm [106].

DS1 DS2 DS3 DS4 DS5 DS6

#categories 343546

# documents 300 400 500 600 700 800

uniformly distributed no yes no yes yes no

Table 3.5: Overview of the constructed document sets.

Results. We employed STC and the graph-based clustering algorithms Major-

Clust and Group Average Link to cluster the constructed document sets. For

MajorClust and Group Average Link the underlying document models were var-

ied by computing the edge weights according to the cosine similarity measure

using the tf ·idf term weighting scheme, and the new suﬃx-tree-based similarity

measures. For the hybrid measure, we used ϕHYB =λ·ϕST +(1−λ)·ϕcos and

found that λ=0.2(λ=0.5) works well for MajorClust (Group Average Link).

Table 3.6 and 3.7 show the achieved F-Measure values [111] for MajorClust and

Group Average Link respectively. Note that performance improvements of up to

3.5. The Suﬃx Tree Document Representation 61

40% for the new hybrid similarity measure in comparison with the cosine simililar-

ity measure can be observed. The outlined disadvantages of STC are reﬂected in

STC’s F-Measure values.

DS1 DS2 DS3 DS4 DS5 DS6 average

STC 0.55 0.40 0.61 0.33 0.40 0.34 0.44

ϕcos 0.80 0.60 0.62 0.67 0.66 0.49 0.64

ϕST 0.55 0.46 0.61 0.38 0.45 0.55 0.50

ϕSTF 0.82 0.70 0.70 0.68 0.76 0.55 0.70

ϕSTFIDF 0.60 0.60 0.71 0.64 0.78 0.62 0.65

ϕHYB 0.84 0.83 0.72 0.74 0.93 0.64 0.78

Improvement in % 5% 38% 16% 10% 40% 31% 22%

Table 3.6: The table shows the achieved F-Measure values for STC (ﬁrst row), for

MajorClust with the traditional similarity measure ϕcos on tf ·idf vectors (second

row), and for MajorClust with the new suﬃx-tree-based similarity measures (remaining

rows). The improvement refers to ϕHYB with respect to ϕcos.

DS1 DS2 DS3 DS4 DS5 DS6 average

STC 0.55 0.40 0.61 0.33 0.40 0.34 0.44

ϕcos 0.82 0.63 0.69 0.55 0.78 0.51 0.64

ϕST 0.55 0.40 0.61 0.33 0.40 0.55 0.47

ϕSTF 0.83 0.64 0.71 0.57 0.85 0.63 0.71

ϕSTFIDF 0.84 0.72 0.71 0.64 0.80 0.60 0.72

ϕHYB 0.84 0.74 0.74 0.66 0.92 0.70 0.77

Improvement in % 2% 18% 7% 20% 18% 37% 17%

Table 3.7: The table shows the achieved F-Measure values for STC (ﬁrst row), for

Group Average Link with the traditional similarity measure ϕcos on tf ·idf vectors

(second row), and for Group Average Link with the new suﬃx-tree-based similarity

measures (remaining rows). The improvement refers to ϕHYB with respect to ϕcos.

3.5.4 Concluding Remarks

Both the classical vector space model and the suﬃx tree model play an impor-

tant role in text processing applications: the VSM predominates the information

retrieval domain, while suﬃx trees are employed as data structure for string al-

gorithms. Interestingly, these models are used in an isolated way: there has not

62 Chapter 3. Information Need and Categorizing Search

been an attempt to combine their diﬀerent properties for similarity measurement

in IR applications.

Our experiments clearly indicate that word order preservation in the docu-

ment model inﬂuences similarity computation to a greater extent. The combina-

tion of a vector-space-based similarity measure with a suﬃx-tree-based similarity

measure can lead to a signiﬁcant improvement of clustering performance, regard-

less of the chosen clustering algorithm. Moreover, we identiﬁed properties of the

STC heuristic that explain why this approach can not keep up with any other of

the investigated clustering settings on the RCV1.

3.6 Topic Identiﬁcation

Assume that a categorization Cof a document set Dis determined using an un-

supervised approach. To present this categorization to a user, it is convenient

to label the individual categories with characteristic terms. Amongst others,

these terms called category labels9should characterize the content of the asso-

ciated category with respect to the remaining categories. This property implies

that it should summarize a category’s content and that it should discriminate

a category from the other categories. This section resumes desired properties of

category labels and reviews algorithms to generate such labels. Moreover, a novel

evaluation methodology as well as experimental evaluations of topic identiﬁcation

approaches for ﬂat categorizations are contributed.

3.6.1 Formal Framework

Desired properties for category labels are expressed in the formal framework

from [134]; they are resumed in the following. For a categorization Clet Wd=

{wd1,...,w

dn}denote the word set for document d,andletW=d∈DWdde-

note the entire word set underlying D. Term frequency and inverse document

frequency of a term ware expressed by the functions tf (d, w)andidf (w), respec-

tively.

Several clustering algorithms deﬁne a hierarchy or can be applied in a recur-

sive manner, this way deﬁning a hierarchy HCon C.HCis a tree whose nodes

9Category labels are also called category names or topic identiﬁers.

3.6. Topic Identiﬁcation 63

correspond to the categories in Cfrom which one is marked as root node. Given

two categories, Ci,C

j,C

i=Cj, we write CiCjif the corresponding nodes in

HClie on a common path emanating at the root and if Ciis closer to the root

than Cj.

Topic identiﬁcation means the construction of a function τ:C→2Wthat

assigns to each element C∈CasetWC⊂W. The following properties are

generally desired for a labeling function τ:

(1) Unique. ∀Ci,Cj∈C

Ci=Cj

:τ(Ci)∩τ(Cj)=∅

(2) Summarizing. ∀C∈C ∀d∈C:τ(C)∩Wd=∅

(3) Expressive. ∀C∈C ∃w∈τ(C)∀d∈C∀w∈Wd

w∈τ(C)

:tf (d, w)≤tf (d, w),

where tf (d, w) designates the term frequency of term win document d.

(4) Discriminating. ∀Ci,Cj∈C

Ci=Cj∃w∈τ(Cj): 1

|Ci|tf Ci(w)1

|Cj|tf Cj(w),

where tf C(w) is the term frequency of win category C,say,

tf C(w)=d∈Ctf (d, w).

(5) Contiguous. ∀C∈C ∀w,w∈τ(C)

w=w ∀d∈C∃wi,wi+1∈Wd:wi=w∧wi+1 =w

(6) Hierarchically Consistent.

∀Ci,Cj∈C

Ci=Cj

:CiCj⇒P(wi|wj)=1∧P(wj|wi)<1,

where wi∈(Wdi∩τ(Ci)), wj∈(Wdj∩τ(Cj)), di∈Ci,dj∈Cj.

(7) Irredundant. ∀C∈C ∀w,w∈τ(C)

w=w :wand w are not synonymous.

The stated properties formalize ideal constraints which, in the real world, can

only be approximated. Note that merely Property 6 requires the existence of a

category tree HC10. Finally note that hierarchical consistency and irredundancy

are practically impossible to achieve if no external knowledge is provided.

10A discussion of desired properties for hierarchies can be found in [78].

64 Chapter 3. Information Need and Categorizing Search

3.6.2 Related Work

Like the stated properties show, topic identiﬁcation is related to keyword extrac-

tion and text summarization. The main diﬀerence for topic identiﬁcation is that

each identiﬁed topic refers to a set of documents instead of a single document.

In the following we resume existing topic identiﬁcation approaches, which can be

classiﬁed according to the underlying categorization type: ﬂat or hierarchical.

Topic Identiﬁcation in Flat Categorizations. In past work on keyword ex-

traction from English texts, several term features have been proposed to identify

meaningful keywords; they include the following: (1) ﬁrst occurrence measured

by a term’s oﬀset from the beginning of a document, (2) term frequency and, for

document collections, inverse document frequency, (3) co-occurrence information

[157, 89, 146]. It is unclear how some of these features can be generalized for

topic identiﬁcation; e. g. it is unclear how ﬁrst occurrence can be deﬁned and if

ﬁrst occurrence is a signiﬁcant feature with respect to a set of documents. In

general, it is questionable if features for keyword identiﬁcation are suited or can

be adapted for topic identiﬁcation.

Popescul and Ungar propose a labeling algorithm for ﬂat categorizations using

frequency and predictiveness information of terms [105, 161]. In particular, the

function

fC(w)=P(w|C)·P(w|C)

P(w)

measures the score for word wto be topic identiﬁer in cluster C. Here, P(w|C)

denotes the conditional probability to draw wfrom a document in C,andP(w)

is the probability to draw wfrom the whole collection. The ﬁrst factor in fC

makes sure that more frequent terms within a category are preferred, while the

second factor measures the predictiveness11 of wfor C, i. e. the discrimination

power. These factors mean to optimize points (2), (3), and (4) from the desired

properties. However, point (1) is not addressed by this approach.

Topic Identiﬁcation in Category Hierarchies. Popescul and Ungar also

propose a two-step algorithm to label hierarchical document clusterings as follows

[105]. Let the cluster hierarchy be modeled as a tree in which each node represents

a cluster, and let the nodes contain their associated documents12. In the ﬁrst

11Popescul and Ungar remark that the second factor is similar to a mutual information esti-

mator.

12Popescul and Ungar assume that only the leaf nodes hold documents.

3.6. Topic Identiﬁcation 65

step, bag-of-word representations of the documents are propagated bottom-up

starting at the leaf nodes: at each inner node, the bag-of-word representations of

its descendants are coalesced. In a second step, the tree is recursively traversed

starting at the root node. For each word that is associated with the current

node, a χ2independence test is conducted to gain evidence if the word appears

equally likely in all of the child nodes. If this is the case, the word is considered

being general for all children: It is kept as label in the current node and it is

deleted from all children. Else, if the test rejects the independence hypothesis,

the conclusion is that the term is speciﬁc for one or more child nodes. In this

case, it is deleted from the current node. The χ2test puts emphasis on points (3)

and (6) in the ﬁrst place, while the bottom-up as well as top-down propagation

strategy addresses points (1) and (2) from the formal framework.

The STC algorithm presented in the previous section has the advantage that

cluster labels are generated during the clustering procedure. As mentioned above,

these labels are path labels in suﬃx trees in which a set of documents has been

inserted. The nodes that score best in terms of frequent visits and depth within

the suﬃx tree are chosen as cluster labels. In contrast to other methods, this

labeling approach preserves word order, contributing to property (5) in the formal

framework. However, STC has problems fulﬁlling properties (6) and (7).

Subsumption analysis approaches, which are based on term distribution anal-

yses of document sets, contribute to address property (6) from the formal frame-

work [122, 80, 82]. The underlying consideration is that some terms occur fre-

quently among all documents within a document set, while other terms only

occur in some of the documents within the set. The hypothesis is that if both

of the mentioned term kinds co-occur with a certain probability, then the terms

may be related in that the more frequent term is more general than the other

term. In particular, Sanderson and Croft propose the following deﬁnition: Term

xsubsumes term yif

P(x|y)≥1−εand P(y|x)<P(x|y)

where P(x|y) denotes the conditional probability that term xis drawn when

term yhas already been drawn from a document [122, 81]. A value of ε=0.2

has been shown to work well in [122].

66 Chapter 3. Information Need and Categorizing Search

3.6.3 The WCC Algorithm for Topic Identiﬁcation

Like above, let Ddenote a set of documents, and let Cbe a clustering of D.Let

κ:W×{1,...,|C|} → C be the function with

κ(w,i)=C⇔Cis at rank iin a cluster ranking according to tf C(w).

E. g. κ(w,1) denotes the cluster in which wappears most frequently and

κ(w,|C|) is the cluster in which wappears least frequently.

The algorithm WCC consists of two main parts (cf. Algorithm 1). First,

a vector Tis constructed, containing tuples of the form w, tf κ(w,i)(w),i∈

{1,...,k}, which are sorted according to descending term frequency. Second, the

clusters are cycled in a round-robin manner: Each time a cluster is visited, it

is assigned one more word unless it holds lterms. According to the top-down

processing of tuples from T, the most frequent words from the cluster centroids

are covered.

The labeling τgenerated by WCC fulﬁlls the properties (1) and (2) according

to the parameters kand l(a value of k= 1 ensures the validity of property 1);

the cluster-wise Round-Robin-strategy aims at fulﬁlling property (3). A small

khelps fulﬁlling property (4). Computing the κ-values (including sorting) is

in O(k·|W|·log(k·|W|)); assigning the labels is in O(l·k·|W|). Since k

and lare typically bounded by a small constant, the overall complexity is in

O(|W|·log(|W|)).

3.6.4 Experimental Evaluation

Evaluating the quality of topic identiﬁcation approaches is a diﬃcult concern

since no benchmark collection is available, i. e. there are no document clusterings

that have been labeled by humans. In the past, diﬀerent labelings of the same

collection were presented to a couple of people who ranked the labeling quality

relatively (cf. e.g. [105, 78]). However, although such studies are useful to get

hints concerning relative performance, they are neither reproducible nor they give

evidence how good the approaches perform in terms of precision and recall. To

overcome these weaknesses, we propose statistics that measure to what extent

the desired properties from Section 3.6.1 are satisﬁed.

3.6. Topic Identiﬁcation 67

Algorithm 1 WCC, Weighted Centroid Covering.

Input: Cclustering

lnumber of terms per label

kmaximum occurrence of the same term in diﬀerent labels

Output: τlabeling

WCC(C,l,k)

(1) T=∅;

FOREACH CIN CDO

τ(C)=∅;

(2) FOREACH wIN WDO

FOR i=1 TO k

compute C=κ(w, i)from C;

add tuple w, tf C(w)to T;

ENDFOR

ENDDO

(3) SORT Taccording to descending term frequencies;

(4) FOR labelcount =1 TO l

assigned =0;

j=1;

WHILE assigned <|C| AND j≤|T|

let tj=w, tf C(w)be jth tuple of T;

IF |τ(C)|<labelcount THEN

τ(C)=τ(C)∪{w};

delete tjfrom T;

assigned =assigned +1;

ENDIF

j=j+1;

ENDWHILE

ENDFOR

(5) FOREACH CIN CDO SORT τ(C);

(6) RETURN τ;

68 Chapter 3. Information Need and Categorizing Search

In particular, the following statistics quantify the development of the proper-

ties (1)-(4):

1. Unique.

f1(τ):=1−2

k(k−1)

k−1



i=1



j=i+1

|τ(Ci)∩τ(Cj)|

|τ(Ci)∪τ(Cj)|

If the term sets of each pair of labels are disjoint, f1(τ) takes a value of one;

here, kdenotes the number of categories in C. The closer f1is to zero, the

more the terms of distinct labels overlap.

2. Summarizing.

f2(τ):=1

k

C∈C

|C|

d∈C

|τ(C)∩Wd|

|τ(C)|

The closer f2(τ) is to 1, the better the label terms cover the documents

of the associated category. A value close to 0 indicates that the category

labels appear in only a few documents of a category.

3. Expressive.

f3(τ):=1−1

k

C∈C

argmin

w∈τ(C)

|C|

d∈C

|Wd|

w∈Wd

w∈τ(C)

tf (d, w)

tf (d, w)

If for each cluster there exists an expressive term, f3(τ) reaches the maxi-

mum 1. Observe that f3can take negative values for poorly chosen labels.

4. Discriminating.

f4(τ):=1−2

k(k−1)

k−1



i=1



j=i+1

argmin

w∈τ(Cj)

|Cj|

|Ci|

tf Ci(w)

tf Cj(w)

The closer f4(τ) is to 1, the more discriminating is τ. Small values of f4

indicate weaker discriminative power. Like above, f4can be negative.

To evaluate the quality of WCC, Popescul and Ungar’s method, and the stan-

dard keyword extraction algorithm RSP [146], we downloaded 150 documents

from the digital library Citeseer13, each of which containing author-deﬁned key-

13http://citeseer.ist.psu.edu/

3.6. Topic Identiﬁcation 69

0.2

0.4

0.6

0.8

0 2 4 6 8 10 12 14 16

Number of extracted topics

80 documents in 16 clusters

WCC

Popescul

RSP 0

0.2

0.4

0.6

0.8

0 2 4 6 8 10 12 14 16

Number of extracted topics

80 documents in 16 clusters

WCC

Popescul

RSP

0.2

0.4

0.6

0.8

0 2 4 6 8 10 12 14 16

Number of extracted topics

80 documents in 16 clusters

WCC

Popescul

RSP 0

0.2

0.4

0.6

0.8

0 2 4 6 8 10 12 14 16

Number of extracted topics

80 documents in 16 clusters

WCC

Popescul

RSP

Figure 3.14: The statistics f1(top left), f2(top right), f3(bottom left), and f4(bottom

right) for the topic identiﬁcation approaches WCC, Popescul and Ungar’s method, and

RSP depending on the number of extracted words per label.

words. The evaluation idea is to cluster the documents, label the resulting clus-

tering, compute f1-f4, and measure to which extent the identiﬁed topic labels

appear in the keyword list of at least one document of the associated cluster. To

measure the extent, standard precision and recall values can be chosen. Note,

however, that the precision value is more important in our scenario since usually

only a few words (about one to ﬁve) are presented to a user.

The clustering was done on randomly drawn subsets D⊆Dwith |D|=

80 using k-means, with kvarying between 3 and 25. Since the Dunn index

has a centroidic view comparable to k-means, we let the Dunn index decide

which clustering was best among the generated clusterings. The labelings τwere

generated by each of the three mentioned algorithms. For the RSP approach, the

documents within a cluster were concatenated to form a single document. The

values of the proposed statistics f1-f4are averaged over 10 retries; their values

depending on the number of words per label are depicted in Figure 3.14.

70 Chapter 3. Information Need and Categorizing Search

Figure 3.15 shows precision and recall curves for the mentioned approaches,

again depending on the number of extracted words per label. The values are

averaged over 10 retries of the experiment, each time redrawing a new set of

documents. The precision curve shows that one to ﬁve cluster labels can be

identiﬁed at very high precision rates.

0.2

0.4

0.6

0.8

0 5 10 15 20 25 30 35 40 45 50

Precision

Number of extracted topics

80 documents in 16 clusters WCC

Popescul

RSP

0.2

0.4

0.6

0.8

0 5 10 15 20 25 30 35 40 45 50

Recall

Number of extracted topics

80 documents in 16 clusters WCC

Popescul

RSP

Figure 3.15: Precision (left) and recall (right) of topic identiﬁcation approaches de-

pending on the number of extracted words per label.

3.6.5 Concluding Remarks

The proposed formal framework of desired properties for category labels shows

that topic identiﬁcation is related to keyword extraction and summarization. The

key diﬀerence is that topic identiﬁcation relates to sets of documents—a fact that

introduces intra-category as well as inter-category constraints for labels.

In contrast to hierarchical topic identiﬁcation, not much work has been done

in labeling ﬂat categorizations. We proposed the WCC labeling algorithm, which

is designed to perform in accordance with the desired properties.

The conducted experiments are ambitious from a combinatoric point of view:

only about 15 terms from a cluster’s vocabulary of about 2300 distinct terms14

appear in the keyword list of a document. The results show that especially

the properties “summarizing” and “expressive” are very well developed for WCC

labelings. The achieved precision values of WCC as well as Popescul and Ungar’s

approach are surprisingly high.

14This number is an average over several generated clusters from our corpus of scientiﬁc

articles.

3.7. Genre Classiﬁcation 71

3.7 Genre Classiﬁcation

People who search the World Wide Web usually have a clear conception: They

know what they are searching for, and they know of which form or type the search

result ideally should be. The former aspect relates to the topic of a document,

the latter to the presentation of its content. For example, when searching for the

topic “Bayes” in the Internet, a human information miner might be interested in

either of the following: a technical article, a biography, or an online book shop.

It would be of much help if a search engine could deliver only documents of a

desired form, which is here called “genre”. This fact raises two questions:

(1) What are useful genres in the Web, especially when searching the Web, and

(2) How can a detection of these genres be operationalized?

This section will address these questions with the goal to automate genre

identiﬁcation in the context of Web search.

3.7.1 What Does Genre Mean?

As pointed out by Finn and Kushmerick, the term “genre” is used frequently in

our culture; e. g., in connection with music, with literature, or with entertainment

[39]. Roussinov et al. argue that genre can be deﬁned in terms of purpose or

function, in terms of the physical form, or in terms of the document form. And,

usually, a genre combines both purpose and form [116].

Here, we are interested in the genre of HTML documents. Several deﬁnitions

for document genre have been given and discussed in the past [14, 68, 72]. Com-

mon to all is that document genre and document content are orthogonal, say,

documents that address the same topic can be of a diﬀerent genre: “The genre

describes something about what kind of document it is rather than what the doc-

ument is about.” [39]. In this way, a genre classiﬁcation scheme can be oriented

at the style of writing, or at the presentation style. When analyzing newspa-

per articles for example, typical genres include “editorial”, “letter”, “reportage”,

“spot news”.

72 Chapter 3. Information Need and Categorizing Search

3.7.2 What Does Genre Mean in the WWW?

In the literature on the subject there is more or less agreement on what document

genre means and how diﬀerent genre classes can be characterized. And, at ﬁrst

sight, it seems to be canonical to apply this common understanding to the World

Wide Web: Certainly, “advertisement” seems to be a useful genre class, as well

as “private homepage”. On second sight, however, several diﬃculties become

apparent: Where does a presentation of a company’s mission end and where does

advertisement begin? Or, does a scientiﬁc article on a private homepage belong

to the same genre like a photo collection of mom’s lovely pet?

Our proposed deﬁnition of genre classes for the World Wide Web is governed

by two considerations:

•Usability from the standpoint of an information miner, which can be achieved

by a what we call “positive” and “negative” ﬁltering. With the former the

need for a focused search can be satisﬁed, while the latter simply extends

the idea of spam identiﬁcation to a diversiﬁed genre scheme.

•Feasibility with respect to runtime and classiﬁcation performance.

The ﬁrst point means that we want to support people who use the World Wide

Web as a huge database to which queries are formulated.15 The second point

states that automatic genre identiﬁcation shall happen on the ﬂy, in the form

of a post-processing of the results of a search engine. This aspect prevents the

computation of highly sophisticated features as well as the application of a ﬁne-

grained genre scheme.16 To get an idea which genre classes are considered useful

by search engine users, we conducted a user study that is described in detail in

Section 3.7.4.

3.7.3 Existing Work

We distinguish the existing work for computer-based genre classiﬁcation with

respect to the underlying corpus, say, whether it is targeted to a particular doc-

ument collection—like the Brown Corpus, for example—or to the World Wide

Web. In the following we outline selected papers.

15There are other groups of Internet users who use the Web for amusement, for example.

16Crowston and Williams identiﬁed about hundred genre classes on the World Wide Web

[23].

3.7. Genre Classiﬁcation 73

Corpus-speciﬁc genre classiﬁcation has been investigated among others in

[72, 128, 30, 39]. The existing work can further be distinguished with respect to

the interesting genre classes and the types of features that have been evaluated.

Kessler et al.’s work is based on the Brown Corpus. For the characterization of

genre classes they employ so-called genre facets, which are quantiﬁed by linguis-

tic and character-level features [72]. Stamatatos et al. use discriminant analysis

based on the term frequencies to identify the most discriminative terms with re-

spect to four newspaper genre classes [128]. Dewdney et al. concentrate on diﬀer-

ent learning approaches: Naive Bayes, C4.5, and support vector machines. They

employ about three hundred features including part of speech, closed-class word

sets, and stemmed document terms [30]. Rehm proposes a Web genre hierarchy

for academic homepages and a classiﬁer that relies on HTML metadata, pre-

sentation related tags and unspeciﬁed linguistic features. Finn and Kushmerick

distinguish between the two genres “objective” and “subjective”; they investigate

three types of features sets: the document vector containing the stemmed list of

a document’s terms without stop-words, features from a part of speech analysis,

and easily computable text statistics [39].

Genre classiﬁcation and navigation related to the World Wide Web is quite

new, and only very few papers have been published on this topic. Bretan et al.

propose a richer representation of retrieval results in the search interface. Their

approach combines content-based clustering and genre-based classiﬁcation that

employs simple part-of-speech information along with substantial text statis-

tics. The features are processed with the C4.5 algorithm; however, the au-

thors give no information about the achieved classiﬁcation performance [17].

Roussinov et al. present a preliminary study to automatic genre classiﬁcation:

Based on an explorative user study they develop a genre scheme that is in part

similar to ours and that comprises ﬁve genre groups. However, their work de-

scribes an ongoing study, and no recognition algorithm has been implemented

[116]. Dimitrova et al. describe how shallow text classiﬁcation techniques can be

used to sort the documents according to genre dimensions. Their work describes

an ongoing study, and experience with respect to the classiﬁcation performance

is not reported [33]. Lee and Myaeng deﬁne seven genre types for classifying doc-

uments from the World Wide Web. Aside from the genre “Q&A” and “Home-

page” Lee and Myaeng use also the newspaper-speciﬁc genres “Reportage” and

74 Chapter 3. Information Need and Categorizing Search

“Editorial”. The operationalized feature set is based on a list of about hundred

document terms tailored to each genre class [83].

3.7.4 User Study and Genre Selection

Although we have an idea of potentially useful genres, a user study should give

insights into the importance of dedicated genre classes. Moreover, it can be used

as a basis to select genres for building test collections. As a matter of course,

selected genres inﬂuence feature selection for automatic classiﬁcation.

User Study. To get an idea about the expected usefulness of diﬀerent Web

page genre classes were manifold, we decided to inquire a bigger number of search

engine users. We developed a questionnaire to shed light on search engine use,

usefulness of genre classiﬁcation, and usefulness of genre classes; in detail, we

were interested in the following points.

(1) Frequency of Search Engine Use. We expect that experienced search engine

users have a clearer idea whether genre classiﬁcation could be useful or not.

We asked the interviewees how often they use search engines. Possible

answers were “daily”, “once or twice a week”, “once or twice a month”,

and “never”.

(2) Typical Topics for Queries. As already pointed out, our target audience

should use the World Wide Web not only for entertainment, but also as

information source. To get an idea what the interviewees search for on the

Internet, we let them specify up to 3 typical search topics.

(3) Usefulness of Genre Classiﬁcation. With this question we wanted to ﬁgure

out if genre ﬁltering is considered as useful in general, i. e. if genre ﬁltering

helps to satisfy the user’s information need. Possible answers were “very

useful”, “sometimes useful”, “not useful”, and “don’t know”.

(4) Favored Genre Classes. We proposed ten genre classes that we found inter-

esting: publications/articles, scholar material, news, shops, link collections,

help and FAQ, private portrayals, commercial portrayals, discussion forums,

and product presentations. For each of these genres, the interviewees could

specify the usefulness in terms of “very useful”, “sometimes useful”, “not

useful”, and “don’t know”.

3.7. Genre Classiﬁcation 75

daily

1-2x per week

1-2x per month

never

73%

23%

Search Engine Use Frequency

Figure 3.16: Frequency of search engine use. About three quarters of the interrogated

students use a search engine on a daily basis.

Usefulness of Automated Genre Classification

64%

29%

1% 6%

very useful

sometimes useful

not useful

don't know

Figure 3.17: Usefulness of genre classiﬁcation.

(5) Additional Useful Genre Classes. We also wanted to ﬁnd out which addi-

tional genre classes could be interesting for the users. Therefore, a set of

up to three additional genre classes could be speciﬁed and classiﬁed into

“very useful” and “sometimes useful”.

(6) Comments. We also gave the interviewees the possibility to comment on

the idea of genre classiﬁcation.

To give the interviewees an idea of genre classiﬁcation, we gave them a short

introduction to genres and their use as positive and negative information ﬁlters.

As we expect students to frequently use search engines, we asked 286 of them in

our university to complete the proposed form. Figure 3.16 shows that we met

the right audience: about three quarters of the students use search engines on a

daily basis, and nearly the remaining quarter at least once a week.

The most frequently mentioned searches comprise scholar material, shopping

and product information, help (discussions and troubleshooting), entertainment

(music / games / ﬁlms / humorous material / news), downloads, health, and

programming (in this order). The fact that 64% of the students think that genre

classiﬁcation is very useful, and that another 29% ﬁnd it sometimes useful shows

that there is a strong need to post-process query results (cf. Figure 3.17).

76 Chapter 3. Information Need and Categorizing Search

Favored Genre Classes

1,72

1,53

1,45

1,44

1,37

1,36

1,23

1,19

1,00

0,93

0,00 0,50 1,00 1,50 2,00

private portrayal

link collection

news

commercial portrayal

product info

shop

discussion

article

help/FAQ

scholar

Figure 3.18: The favored genre classes. Higher values indicate a greater expected

usefulness.

To make up a ranked list of dedicated genre classes with respect to their

usability, we assigned scores on the usefulness of each genre class: “very useful”

scored 2 points, “sometimes useful” scored one point, “not useful” scored 0 points.

We added the scores for each proposed genre and divided it by the number of

interviewees that did not tick “don’t know” on that genre class. The results are

depicted in Figure 3.18: scholar material scores best, while private portrayals

were not judged as very useful by the interviewees.

Additional genres that were signiﬁcantly often proposed include Web page

spam and download sites. As the given comments and some given speciﬁcations

of spam let conclude, spam comprises in this context (a) paid links, (b) sites that

try to install dialers, and (c) sites that are only used to improve a site’s ranking in

search engines. Other propositions included topics (and not genres) like pornog-

raphy. The comments were encouraging and often asked for operationalization.

Genre Selection. An inherent problem of Web genre classiﬁcation is that even

humans are not able to consistently specify the genre of a given page. Take for

example a tutorial on machine learning that could be either classiﬁed as scholar

material or as article. In general, scholar material can be seen as a super-genre

that covers help, article, and discussion pages; therefore scholar material was not

chosen as a genre on its own. Another ﬁnding is that most product information

sites are combined with a shopping interface, which renders a discrimination of

the shops and products impossible.

3.7. Genre Classiﬁcation 77

To cut a long story short, we ﬁnally ended up with the following eight genre

classes:

(1) Help. All pages that provide assistance, e. g. Q&A or FAQ pages.

(2) Article. Documents with longer passages of text, such as research articles,

reviews, technical reports, or book chapters.

(3) Discussion. All pages that provide forums, mailing lists, or discussion

boards.

(4) Shop. All kinds of pages whose main purpose is product information and

sale.

(5) Portrayal (non-priv). Web appearances of companies, universities, and

other public institutions. I. e., home or entry or portal pages, descriptions of

organization and mission, annual reports, brochures, contact information,

etc.

(6) Portrayal (priv). Private self-portrayals, i. e., typical private homepages

with informal content.

(7) Link Collection. Documents which consist of link lists for the main part.

(8) Download. Pages on which freeware, shareware, demo versions of programs

etc. can be downloaded.

Although not every document can be rigorously assigned to a single class, our

scheme reﬂects the genre assessment of many human information miners: A sci-

entiﬁc article or a link collection, for instance, is still distinguished as such, in-

dependently of the domain holder’s form of organization where the document is

hosted.

3.7.5 Features for Genre Classiﬁcation

With respect to the investigated features the existing literature on genre classi-

ﬁcation falls into two groups: Classiﬁers that rely on a subset of a document’s

terms (sometimes called bag-of-words, BOW), and classiﬁers that employ linguis-

tic features along with additional features relating to text statistics. This section

78 Chapter 3. Information Need and Categorizing Search

gives an overview of these features. In particular, we introduce features that

are based on the frequency class of a word as well as concentration features for

closed-class word sets. Moreover, we suggest URL analyses with respect to the

closed-class word sets.

Word Frequency Class. The frequency class of a word is directly connected

to Zipf’s law and can be used as an indicator of a word’s customariness. Let D

be a document collection, and let tf D(w):=d∈Dtf (di) denote the frequency of

awordwin D,andletr(w) denote the rank of win a word list Tof D,which

is sorted by decreasing frequency.17

In accordance with [148] we deﬁne the word frequency class c(w)ofaword

w∈T as log2(f(w∗)/f(w)),wherew∗denotes the most frequently used word,

i. e. w=argmax

w∈T (tf D(w)). In the Sydney Morning Herald Corpus [28], w∗

denotes the word “the”, which corresponds to the word frequency class 0; the

most uncommonly used words within this corpus have a word frequency class of

19. The intuition to use the word frequency class as feature is the expectation

that articles use a more specialized speech than e.g. shops. The complexity of

speech is expected to be reﬂected in the average word class.

Dictionary of word frequency classes

(1) + Webster's unab. dictionary

(2) + misspelled words

(3) + unknown words

(1) (2) (3) (4)

Figure 3.19: The ﬁgure shows the inclusion relation of the used word sets. Note that

the sets (3) and (4) are only implicitly deﬁned, by means of the Levenshtein distance

and the “not-found” predicate respectively.

Based on the Sydney Morning Herald Corpus, which contains more than

38,000 articles, word frequency classes for about one hundred thousand words

have been computed. This dictionary is shown as set (1) in Figure 3.19. The

other sets in the ﬁgure evolve in a natural manner as supersets of (1): Webster’s

17Zipf’s law states that r(w)·f(w) is constant.

3.7. Genre Classiﬁcation 79

unabridged dictionary (2), the set of misspelled words (3), and the set of unknown

words (4). Observe that set (3) comprises all words from the sets (1) and (2) as

well as words found in the Levenshtein distance of one [85]. We use these sets to

deﬁne the following features:

•average word class

•average number of misspelled words

•average number of words not found in Webster’s unabridged dictionary

Syntactic Group Analysis. A syntactic group analysis yields linguistic fea-

tures that relate to several words of a sentence. Such analyses quantify the use

of tenses, relative clauses, main clauses, adverbial phrases, simplex noun phrases,

etc. Since the identiﬁcation of these features is computationally expensive, we

have omitted them in our analysis. Dewdney et al., however, also include the

transition in verb tense within a sentence in their analysis [30].

Part-of-Speech Analysis. Part-of-speech analysis groups the words of a sen-

tence according to their function or word class. Part-of-speech taggers analyze

a word’s morphology or its membership in a particular set. In this connection

one diﬀerentiates between so-called open-class word sets and closed-class word

sets, where the former do not consist of a ﬁnite number; examples are nouns,

verbs, adjectives, or adverbs. Examples for closed-class word sets are preposi-

tions and articles. For our analysis we have employed the part-of-speech tagger

of the University of Stuttgart [149]. Table 3.9 and 3.10 list the actually used

word classes.

Other Closed-Class Word Sets. Aside from word classes that relate to gram-

matical function, we have also constructed other closed-class word sets that may

be speciﬁc to a certain genre: currency symbols, help symbols, shop symbols,

link symbols, download symbols, homepage symbols, months, days, countries,

ﬁrst names, and surnames. Table 3.8 shows examples for some elements of some

of these sets.

In connection with closed-class word sets it is usual that feature values are

measured as the proportion of words from an individual closed-class word set with

80 Chapter 3. Information Need and Categorizing Search

respect to the total word count within a document. However, since characteristic

terms for a genre are often concentrated in one place within a document18,it

is reasonable to measure for a document the maximum concentration of terms

from these sets within a sliding term window [136]. Moreover, the appearance of

these terms in the URL of a Web document is also a good genre hint. Table 3.9

shows for which closed class word sets we also computed concentration and URL

containment features.

Table 3.8: Example terms in closed-class word sets.

Word set Example members

currency $, EUR, DLR, pound

discussion forum, subject, post, views, re:,next, thread

download Windows, Linux, zip, download

help FAQ, Q&A, support

homepage name, ˜, address, phone, homepage

shop buy, now, purchase, add, to, cart

Text Statistics. Under the label “text statistics” we comprise features that

relate to the frequency of easily accessible syntactic entities: clauses, paragraphs,

delimiters, question marks, exclamation marks, or numerals. Counts for these en-

tities are put in relation to the number of words of a document. Kessler et al. des-

ignate features of this type as “character-level cues” [72]; Finn and Kushmerick

designate such features as “hand-crafted” [39].

Presentation-Related Features. This type of features relate to the appear-

ance of a document. They include frequency counts as well as particular HTML-

speciﬁc concepts and stylistic concepts. To the former we count the number of

ﬁgures, tables, paragraphs, headlines, or captions. The latter comprises statis-

tics related to the usage of colors, hyperlinks (anchor links, site-internal links,

Internet links), URL speciﬁcations, mail addresses, etc.

18Examples include table headlines in discussion forums, in which terms like “subject”,

“poster”, “views” etc. can be found. Another example are download sites, in which version

information like operating system names or ﬁle extensions from diﬀerent binary formats can be

found.

3.7. Genre Classiﬁcation 81

Table 3.9: Feature set A consists of the listed features. The averages are taken with

respect to the total word count within a Web document. For closed word sets, the

number of appearances in the URL of a page as well as their maximum concentration

within the page were additionally measured.

Feature type Feature set A

Presentation related avg. # of <p>tags

avg. # of <ul>tags

avg. # of <br>tags

avg. # of anchor links

avg. # of links same domain

avg. # of links foreign domain

avg. # of mail links

avg. # of <img>tags

avg. # of <tr>tags

Closed word sets avg. word frequency class

conc. / avg. # of currency symbols

URL / conc. / avg. # of help symbols

URL / conc. / avg. # of shop symbols

URL / conc. / avg. # of link symbols

URL / conc. / avg. # of download symbols

URL / conc. / avg. # of date symbols

URL / conc. / avg. # of homepage symbols

URL / conc. / avg. # of ﬁrst names

URL / conc. / avg. # of surnames

avg. # of words that do not

appear in Webster’s dictionary

Text statistics avg. # of question marks

avg. # of letters

avg. # of digits

avg. # of dots

avg. # of semicolons

avg. # of colons

avg. # of commas

avg. # of exclamation marks

82 Chapter 3. Information Need and Categorizing Search

Table 3.10: Feature set B extends feature set A by ten additional features. The averages

are taken with respect to the total word count within a Web document.

Feature type Feature set B

Presentation-related

Closed word sets identical to feature set A

Text statistics

Part of speech avg. # of nouns

avg. # of verbs

avg. # of rel. pronouns

avg. # of prepositions

avg. # of adverbs

avg. # of articles

avg. # of pronouns

avg. # of modals

avg. # of adjectives

avg. # of alphanumeric words

Constructed Feature Sets. As our concern is genre classiﬁcation of search

results, the classiﬁcation should be done “on the ﬂy”, as a post-processing step.

Since a user usually waits actively for search results, the features must be com-

puted quickly. We propose a split of the mentioned features with respect to

computational eﬀort as follows.

(1) Features with Low Computational Eﬀort. These features comprise text

statistics, which can be acquired at parse time by means of counters.

(2) Features with Medium Computational Eﬀort. All features that are word-

related and that require dictionary lookups or superﬁcial parsing. These

are closed-class word sets, word frequency class and presentation related

features.

(3) Features with Higher Computational Eﬀort. This class comprises features

that rely on grammar analyses. Syntactic group analysis features and part-

of-speech related features fall in this category.

It should be clear that feature category (1) is not powerful enough to discrim-

inate between the genre classes solely. As a consequence, we built a feature set

that comprises features of (1) and (2), and a feature set that makes use of all

three feature classes. The Tables 3.9 and 3.10 show the details.

3.7. Genre Classiﬁcation 83

3.7.6 Experimental Evaluation

Since no benchmark corpus is available for our concern, we compiled a new corpus

with Web documents and analyzed statistical properties of the two feature sets

with respect to them. We employed classiﬁers in the form of discriminant func-

tions from discriminant analyses to test the achievable classiﬁcation performance.

Moreover, we analyzed the classiﬁcation performance for genre-speciﬁc searches

and typical user groups. The following subsections outline our experiments.

Corpus Compilation. The compiled corpus of Web documents is described

in Table 3.11. Each element in the corpus represents a single HTML document;

documents that are composed of frames and Flash elements were discarded. We

then generated two distinct representations of each corpus according to the fea-

ture sets (see Table 3.9 and Table 3.10).

Statistical Analyses. We conducted a discriminant analysis (linear model,

incremental variable selection according to Wilks Lambda, a-priori probability

uniformly distributed) to get an idea of the classiﬁcation performance of the

selected features. Table 3.12 shows a confusion matrix that belongs to feature

set B. The results range from acceptable to very good—articles, discussion, and

download pages are detected with a very high precision of about 85%, and the

remaining genres are detected with a good performance. However, about 80%

classiﬁcation performance for cross-validated data (ten-fold) on a huge corpus

with eight classes appears still very good to us.

The scatter plot in Figure 3.20 shows a clear separation of help, link list, and

discussion from the remaining genres using only the ﬁrst two discriminant func-

tions. Like the analysis has shown, the ﬁrst two discriminant functions capture

Genre # of Documents

article 181

discussion 242

download 201

help 198

link list 233

portrayal (non-priv) 213

portrayal (priv) 193

shop 246

sum 1707

Table 3.11: Composition of the Web document corpus.

84 Chapter 3. Information Need and Categorizing Search

Table 3.12: Ten-fold cross-validated confusion matrix. It shows the percentage of cor-

rectly classiﬁed documents on the diagonal and summarizes the percentage of misclas-

siﬁed documents with respect to other genres. The average classiﬁcation performance

is about 81%.

Article Discussion Download Help Link Portrayal Portrayal Shop total

Collection (non-priv) (priv)

Article 84.0% 0.6% 1.1% 3.3% 0.6% 8.3% 1.1% 1.1% 100.0%

Discussion 3.3% 86.0% 1.7% 1.7% 0.8% 3.7% 1.2% 1.7% 100.0%

Download 1.0% 1.0% 81.1% 2.0% 0.0% 10.0% 1.0% 4.0% 100.0%

Help 9.6% 2.0% 0.5% 78.3% 0.0% 5.6% 2.0% 2.0% 100.0%

Link

Collection 1.3% 0.9% 1.7% 0.0% 79.0% 12.0% 2.1% 3.0% 100.0%

Portrayal

(non-priv) 6.1% 0.9% 6.1% 0.5% 2.8% 76.5% 2.8% 4.2% 100.0%

Portrayal

(priv) 6.7% 0.0% 1.6% 0.0% 1.0% 4.7% 85.5% 0.5% 100.0%

Shop 0.8% 1.2% 2.0% 2.4% 0.8% 13.4% 0.4% 78.9% 100.0 %

about 46% of the variance.

Classiﬁcation Results. We conducted 10-fold cross-validated classiﬁcations

on the 1707 feature sets using discriminant analyses. Each experiment was con-

ducted twice, using feature set A and feature set B, respectively. The ﬁrst row of

Table 3.13 comprises the classiﬁcation results when classifying one genre against

all. The second row comprises the multiclass classiﬁcation results. To make our

results reproducible for other researchers, the corpus is available on request19.

Specialized Classiﬁers. Aside from general classiﬁcation performance, we are

interested in the question how eﬃcient classiﬁers for single genre classiﬁcations

and typical user proﬁles can be built. Assumed that a user starts several queries

on the same topic, an intelligent search assistant could ﬁgure out to which prede-

ﬁned user proﬁle a user probably belongs and apply the corresponding classiﬁer.

We conducted classiﬁcation experiments for the following three user proﬁles.

(1) Edu. This proﬁle is of educational nature and comprises articles, link col-

lections, and help sites.

(2) Geek. Geeks are mainly interested in downloads, discussions, articles, link

collections, and help sites.

19Meanwhile, our results from [92] have been conﬁrmed in subsequent work on our corpus;

corresponding values can be found in Boese and Howe [16] and in Santini [123].

3.7. Genre Classiﬁcation 85

discriminant function 2

discriminant function 1

article

discussion

download

help

link list

portrayal (non-priv)

portrayal (priv)

shop

marker for group

centroids

Figure 3.20: The ﬁgure shows a scatter plot of the genres according to the ﬁrst two

discriminant functions. The underlying feature set is B.

(3) Private. This group comprises individuals that surf the net for shopping

and for reading private portrayals.

The performance of the compiled classiﬁers with respect to both of the feature

sets is given in Table 3.13.

3.7.7 Concluding Remarks

We see genre classiﬁcation as a promising concept to improve the search eﬃciency

and to address the information need of many users that use the World Wide Web

as a database. While in the past an automatic detection of genre classes has

been demonstrated for newspaper corpora, there is the question whether genre

classiﬁcation can also be applied to the Internet.

A user study has shown the need for advanced page ﬁltering and gave hints

on the importance of dedicated Web genre classes. Taken the viewpoint of an

Internet information miner we propose the following eight genres: help, article,

discussion, shop, portrayals of companies and institutions, private portrayal, link

collection, and download. We show that with a small set of features, which cap-

86 Chapter 3. Information Need and Categorizing Search

Table 3.13: The table shows the classiﬁcation performance of specially crafted one-

against-all classiﬁers (ﬁrst row), multiclass classiﬁers (second row) and three proﬁle

classiﬁers (remaining rows). The bigger boxes symbolize an aggregation of the genre

classes that stand atop of them. Within a box, the upper value shows the classiﬁcation

performance for feature set A, while the lower value shows the performance for feature

set B.

Shop Portrayal Portrayal Download Discussion Article Link Help

(priv) (non-priv) Collection

One against

all 94.7%

94.7% 95.9%

96.4% 85.6%

86.2% 95.0%

95.1% 96.0%

96.2% 91.6%

92.0% 96.1%

96.6% 95.5%

96.0%

Multiclass

Classification 78.0%

78.9% 85.0%

85.5% 74.6%

76.5% 81.1%

81.1% 86.0%

86.0% 81.8%

84.0% 76.8%

79.0% 77.3%

78.3%

Profile “Edu” 91.8%

91.7% 82.9%

84.5% 78.1%

83.3% 77.3%

79.3%

Profile “Geek” 85.1%

85.3% 82.1%

85.6% 85.1%

85.1% 83.4%

84.5% 77.7%

80.3% 78.3%

78.3%

Profile “Private” 82.5%

82.9% 87.6%

91.2% 92.9%

93.1%

tures linguistic and presentation-related aspects, text statistics, word set concen-

tration measures, URL pervasion, and word frequency classes, acceptable clas-

siﬁcation results can be achieved: Our analysis reveals that about 80% of the

documents are assigned correctly with a multiclass classiﬁer.

Chapter 4

Tira: A Software Architecture

for Personal Information

Retrieval

Information retrieval is not a universal answer to a generic information need

problem but a collective term for myriad solutions to individual information need

problems. To become an eﬀective means, retrieval technology must be adapted to

personal information needs, which pertains among others to the following points:

(1) Personal Data. Document sources on which retrieval tasks are carried out

include local hard drives, the Web, or intranets.

(2) Personal Preferences. Typical preferences are language and local settings,

or an individual style for result preparation.

(3) Personal Skills. This characteristic comprises a user’s creativity to for-

mulate queries, his/her ability to improve queries iteratively upon search

engine feedback, or background knowledge about the retrieval strategies of

search engines.

(4) Personal Knowledge. Even when a personal query formulation skill is highly

developed, retrieval success still depends on a user’s knowledge of the query

domain and the underlying collection (e.g. technical terms). This observa-

tion applies especially to closed collections or topic-centered collections in

corporate intranets.

88 Chapter 4. Tira: A Software Architecture for Personal IR

(5) Personal IR Tasks. Advanced personal information needs cannot be suit-

ably addressed with a keyword query approach but require the statement

of a tailored IR process. Examples include plagiarism detection, opinion

extraction, and ﬁltering according to document quality.

We argue that the current generation of IR tools is not ﬂexible enough to

address the above points, especially Point 5. The “course of action” in current

IR tools is hard-wired, i. e., a user is restricted to specify a query along with a

few parameters and cannot adapt or even design the retrieval process itself. We

propose an IR software architecture that follows a service composition paradigm:

Given an advanced IR problem, a tailored tool that solves this problem shall be

constructed by simply selecting and connecting services from a set of “IR building

blocks”. Our architecture allows to specify and to store personal IR tasks on a

user’s personal device and to execute these tasks in a distributed environment.

The remainder of this chapter is organized as follows. Section 4.1 relates IR

theory to IR software and motivates the service-oriented approach, Section 4.2

discusses formalisms to specify IR processes, and Section 4.3 introduces architec-

tural concepts behind Tira.

4.1 From IR Theory to IR Software

The operationalization of an IR process is far oﬀ from being a standard software

engineering task. Current implementation practice is to maintain software li-

braries that provide generic IR functionality, and to reuse them in other projects.

Although this practice has approved in general settings, the IR process design

situation comes with properties that allow for a more powerful modeling perspec-

tive:

•IR processes are composed of rather autonomous software building blocks,

which are called modules here. Basically, each module provides a service

that transforms an input data structure into an output data structure.

Examples for such modules include import ﬁlters, clustering algorithms,

validity measures, ranking functions, classiﬁers, language taggers, and vi-

sualization algorithms.

4.2. Speciﬁcation of IR Processes 89

•Information retrieval theory yielded diﬀerent solutions for one and the

same task or for a class of related tasks.1Examples include the diﬀer-

ent approaches to stemming (statistical algorithms, rule-based algorithms

[106, 140]) and to keyword extraction (internal versus external methods,

corpus-based methods).

•Several tasks within an IR process are addressed with a parameterizable

base algorithm.2Examples include the language-speciﬁc stemming and

stopword ﬁltering [107], which take language-speciﬁc rules or word lists as

their input.

•IR processes are subject to frequent change: they are optimized, tested

with new ideas, and adapted to changing information needs.

•Typically, various parts of an IR process can be executed in parallel, es-

pecially when documents are analyzed with respect to diﬀerent objectives.

An example is the intrinsic similarity analysis of a document collection with

respect to topic, to genre, as well as to writing style [61, 128, 94, 135].

•A set of standard modules, which are useful for virtually any IR process,

can be identiﬁed. Examples include modules for stemming, modules for

stopword removal, and conversion modules for binary formats like Adobe

Acrobat (PDF) or Microsoft Word.

The outlined points exhibit the modular nature of IR processes and, in par-

ticular, the beneﬁts when this nature is actually exploited. For this we propose

a two-step procedure: In a ﬁrst step an IR process is speciﬁed in a diagrammed

form; in a second step, this speciﬁcation is automatically instantiated and de-

ployed as a distributed software system.

4.2 Speciﬁcation of IR Processes

Consider as an example an IR task where a document shall be categorized accord-

ing to both a given topic taxonomy and a given genre taxonomy [92]. Figure 4.1

1Observe the connection to the Strategy Design Pattern described in [48].

2Observe the connections to the Factory Design Pattern and the Decorator Pattern described

in [48].

90 Chapter 4. Tira: A Software Architecture for Personal IR

Input: URL u, dictionary dict, stopword list stl.

Output: genre and topic class for the document at URL u.

Text ht=download(u);

Text plainText=removeHTMLTags(ht);

Text filteredText=removeStopwords(plainText, stl);

FeatureVector topicModel=buildTopicModel(filteredText, dict);

Language lang=detectLanguage(plainText);

FeatureVector presentationFeatures=buildPresentationFeatures(ht);

FeatureVector posFeatures=buildPOSFeatures(plainText, language);

FeatureVector genreModel=union(presentationFeatures, posFeatures);

int topicClass=classifyTopic(topicModel);

int genreClass=classifyGenre(genreModel);

return(topicClass, genreClass);

Figure 4.1: IR process for the sample categorization task, speciﬁed in pseudo code.

depicts a speciﬁcation of the underlying IR process in pseudo code: The topic

model and the genre model that are constructed from the document found at

an URL uform the input for previously built classiﬁers. Note that several text

representations, including HTML text, plain text, and ﬁltered text, are necessary

to perform the outlined task.

This kind of speciﬁcation is current practice in a—what we call—library-based

modeling approach, but it does not take the nature of IR processes into account:

(1) the replacement of a module entails tedious and error-prone code and data

structure replacements, (2) it requires in-depth knowledge concerning the library,

(3) the exploitation of the concurrency between particular subtasks leads to an

inﬂexible design since such behavior must be hard-wired in the underlying execu-

tion model (in the form of threads or remote function calls), (4) the deployment

strategy must be hard-wired as well.

We propagate to specify an IR process at a conceptual level, by means of a

diagrammed modeling language. In the past, diﬀerent modeling tools have been

proposed for similar purposes; they can be classiﬁed according to the following

scheme [143]:

(1) control ﬂow dominant or state-oriented: ﬁnite state machines, UML state

charts

(2) data ﬂow dominant or activity-oriented: data ﬂow graphs, Petri nets,

marked graphs, UML activity diagrams

4.2. Speciﬁcation of IR Processes 91

(3) structure-oriented: component connectivity diagrams, UML class

diagrams, UML deployment diagrams

(4) time-oriented: UML time diagrams

(5) data-oriented: entity relationship diagrams

(6) hybrid, a combination of the above mentioned principles: control/data

ﬂow graphs

Most IR processes can be considered as data ﬂow dominant, i. e., they are

invoked by a user who asks to process a query, and none of the involved modules

can be executed until its preceding modules have delivered their data.

In addition to prescribing data dependencies, a modeling approach for IR

processes must allow for deﬁning concurrency (branching and synchronization)

since parts of an IR process may be executed in parallel. Moreover, a modeling

approach should support explicit typing in order to analyze module composition

constraints with respect to input and output parameters. Finally, depending on

the modeling granularity, it can be useful to deﬁne iterations on parts of an IR

process as well as conditions on the produced data.

In the following, selected modeling tools are discussed with respect to the

speciﬁcation of IR processes.

4.2.1 Petri Nets

A Petri net [102, 143] is a tuple N=P, T, F, c, w, m0,where

•P={p1,...,p

m}is a set of places,

•T={t1,...,t

n}is a set of transitions,

•P∩T=∅,

•F⊆(P×T)∪(T×P) is a ﬂow relation; the elements in Fare also called

edges,

•c:P→N∪{∞}is a function that deﬁnes the capacity restriction of the

places,

•w:F→Nis a function that deﬁnes the edge weights,

•m0:P→N0is an initial marking with ∀p∈P:m0(p)≤c(p).

92 Chapter 4. Tira: A Software Architecture for Personal IR

Download

Plain text extraction

Stopword removal

Topic model builder

Topic classifier

URL

Hypertext

Filtered text

Topic model

Topic class

Presentation

feature builder

POS feature

builder

Stopword

list

Plain text

Dictionary

Feature union

Genre classifier

Genre model

Genre class

POS features

Language

Presentation

features

Language

detection

Copy

Plain text

Figure 4.2: IR process for the sample categorization task, speciﬁed as Petri net.

A Petri net is a bipartite graph where each p∈Pcontains up to c(p)tokens.

Places are usually displayed as circles; tokens are depicted as bullets within the

respective circles. A transition is represented as a bar, and each directed edge

from Fis displayed as an arrow between the transitions and the places. The

marking m0deﬁnes an initial distribution of tokens, which will change depending

on the ﬁring of transitions; m:P→N0denotes an arbitrary marking. Figure 4.2

shows a Petri net of the IR process for our sample categorization task.

Petri Net Semantics For x∈P∪Tlet •x={y|(y,x)∈F}denote the set

of direct predecessors of x; likewise, let x•={y|(x, y)∈F}denote the set of

x’s direct successors. A transition t∈Tis called enabled under a given marking

mif

(1) ∀p∈•t\t•:m(p)≥w(p, t)

(2) ∀p∈t•\•t:m(p)≤c(p)−w(t, p)

(3) ∀p∈t•∩•t:m(p)≤c(p)−w(t, p)+w(p, t)

An enabled transition tcan ﬁre: for each input place p∈•tanumberof

w(p, t) tokens is consumed (deleted), and w(t, p) marks are produced (added) for

each p∈t•.

4.2. Speciﬁcation of IR Processes 93

Discussion. In our scenario the transitions correspond to modules and the

tokens represent the data passed between the modules. Petri nets can model

sequential as well as concurrent processing, see Figure 4.2: analysis tasks run in

parallel, they may be synchronized when displaying results to a user.

Petri nets are well researched: various tools for their analysis and simulation

have been developed in the last forty years, including algorithms that determine

reachability or deadlocks in a net. However, Petri net tokens are indistinguish-

able, and hence the handling of diﬀerent data types cannot be modeled—a fact,

which renders Petri nets unusable for our purposes. Another shortcoming relates

to the missing possibility to specify a processing order for the produced data: it

is questionable whether a FIFO strategy is always desirable.

To circumvent the data type restriction colored Petri nets could be employed

[65]. But even with this extension the modeling of control ﬂows such as iterations

remains fairly restricted. Since a Petri net cannot “look inside” a token, control

ﬂows that depend on the content of data (typical for many IR processes) cannot

be modeled.

4.2.2 Data Flow Graphs

and Control/Data Flow Graphs

A data ﬂow graph G=V,Eis a directed graph where each node represents a

task and where each directed edge deﬁnes a data ﬂow between its incident nodes.

Processing semantics: a task v∈Vcan only be executed if all u∈Vwith

(u, v)∈Ehave already been executed. Figure 4.3 shows a data ﬂow graph of the

IR process for our sample categorization task.

If Petri net transitions and Petri net places are substituted for a data ﬂow

graph’s nodes and edges respectively, one obtains a Petri net isomorphic to Fig-

ure 4.2. Hence, the shortcomings of Petri nets still apply to data ﬂow graphs.

However, the problem that control structures like iterations cannot be modeled

is tackled with the more powerful control/data ﬂow graphs, CDFGs. This hybrid

approach complements data ﬂow graphs with control ﬂow edges, which can be

used to model control ﬂow alternatives as well as iterations. Typically, control

ﬂow edges deﬁne alternative paths of which exactly one is followed according to

a condition.

94 Chapter 4. Tira: A Software Architecture for Personal IR

Download

Plain text extraction

Stopword removal

Topic model builder

Topic classifier

Hypertext

Filtered text

Topic model

out

Presentation

feature

builder POS feature

builder

Stopword

list

Plain text

Dictionary

Feature union

Genre classifier

Language

detection

out

Genre model

Figure 4.3: IR process for the sample categorization task, speciﬁed as data ﬂow graph.

Discussion. CDFGs are powerful enough to model complex IR processes that

include branching and iterations. However, the data ﬂow component in Figure 4.3

shows that data types as well as synchronization are only implicitly modeled, by

means of labeled edges. This deﬁcit is addressed within the following, more

intuitive UML modeling approach.

4.2.3 UML Activity Diagrams

The UML activity diagrams [100] combine novel ideas from Web service ﬂow

languages like BPEL [7] with traditional concepts like the token concept from

Petri nets in order to specify control ﬂow and data ﬂow between so-called actions.

In particular, action nodes, object nodes, and control nodes are connected with

directed edges that specify either a data ﬂow or a control ﬂow [58]. Figure 4.4

shows an activity diagram of the IR process for our sample categorization task.

Similar to CDFGs, actions nodes represent tasks, which are software modules

in our modeling scenario. Object nodes may be placed between action nodes,

symbolizing data objects that are transferred between action nodes. Alterna-

tively, connectors, called “pins”, which are attached to the action nodes, can

specify the data type that is accepted as input or produced as output by an

action node.

Control nodes further divide into decision nodes, merge nodes, fork nodes,

and join nodes. Decision nodes delegate control ﬂow exclusively to one of several

4.2. Speciﬁcation of IR Processes 95

Download

Plain text

extraction

Feature union

Stopword

removal

Topic

model builder

Topic

classifier

URL

u:URL

hyperText:Text

plainText:Text

stl:Text

filteredText:Text

dict:Dictionary

topicModel:

FeatureVector

Presentation

feature builder

Language

detection

POS

feature builder

Genre

classifier

genreModel:

FeatureVector

posFeatures:

FeatureVector

presentation

Features:

FeatureVector

lang:Language

hyperText:Text hyperText:Text

topic classgenre class

Figure 4.4: IR process for the sample categorization task, speciﬁed as UML activity

diagram.

possible branches, depending on a condition that is bound to the node; their coun-

terpart are the merge nodes. Concurrency is modeled with fork nodes and join

nodes; they indicate the concurrent execution and the subsequent synchroniza-

tion of control ﬂows and data ﬂows. Finally, buﬀer nodes, which are specialized

object nodes, can be used to deﬁne a buﬀering strategy for concurrent processing.

An activity diagram may be partitioned into so-called “swimlanes” in order to

group nodes and edges according to common properties. Such logical groups are

oriented at the user-deﬁned semantics (a closed sub-retrieval-task for example)

and allow for the structuring of complex IR processes.

Discussion. Apart from being intuitive, UML activity diagrams are widely ac-

cepted as modeling tool. Moreover, advanced concepts that allow for the modeling

of data streams, parameter sets, stereotypes, action and time events, exceptions,

and exception handlers render this diagram form ideal for our purposes. In the

upcoming UML 2.1 speciﬁcation, conditional nodes and iteration nodes, which re-

mind of block diagram elements, will probably be included, making UML activity

diagrams even more intuitive for modeling control ﬂows.

96 Chapter 4. Tira: A Software Architecture for Personal IR

4.3 Operationalizing IR Processes with Tira

In terms of the model driven architecture (MDA) paradigm, an IR process spec-

iﬁed with a UML activity diagram can be considered as a platform independent

model (PIM) [98]: UML activity diagrams are not bound to programming lan-

guages, operating systems, middleware, or system architectures. Consequently,

to make an IR process operable, a target platform has to be chosen, and the PIM

must be transformed into an executable platform speciﬁc model (PSM).

In this context a platform denotes the next (lower) abstraction layer on which

a particular model is represented in a more concrete form. For example, J2EE

and CORBA are possible platforms for a business process implementation, and,

the Java Development Kit in turn is a possible platform for a CORBA imple-

mentation. As the latter example shows, the transformations along descending

platform layers prescribe the path by which a PIM is rendered executable. The

OMG denotes a platform that is in-between the PIM and executable code as

middleware platform [98].

Operating system

Java Development Kit

XML

object

serialization

XML

object

visualization

PIM specification

PSM generation

Middleware

platform

Computing

platform

IR module library

Web service abstraction

UML activity diagrams:

compilation, deployment, processing

UML activity diagrams:

GUI, modeling, management

TIRA

Figure 4.5: The layer architecture of Tira on top of a computing platform. The IR

module library is not part of Tira but provides an open and extensible container for

IR-related algorithms and data structures.

Just as in other MDA-based application scenarios our objective is to deﬁne,

to develop, and to implement the transformation of the PIM towards a lower

platform layer. However, in contrast to many MDA-based application scenarios

we are not interested in the handling of a variety of middleware platforms and

their related transformations, but in the development of a particular middleware

platform that is suited to execute personal information retrieval tasks. Put an-

4.3. Operationalizing IR Processes with Tira 97

other way: Our focus is on rapid prototyping, reduced turn-around times, and

minimized eﬀort for implementation and test.

Figure 4.5 shows our implemented proposal for the layer architecture of Tira:

The input PIM is an IR process, modeled as UML activity diagram; a PIM can be

compiled and deployed, becoming an executable PSM this way. Modules with the

core IR functionality are comprised in an open IR library; the library encapsulates

the modules as Web services to make them transparently usable from a PSM via

remote function calls (see also [95]). Data objects that are required or produced

by the IR modules are materialized as XML objects.

4.3.1 From PIM to PSM

An activity diagram, either loaded from ﬁle or modeled interactively with the

Tira GUI, is represented as an object structure in computer memory. The

structure is oriented at the UML meta-model [99] and reﬂects the important

elements of activity diagrams, i. e., there are instances of action nodes, fork nodes,

etc., which are interconnected by data nodes. The action nodes are bound to IR

modules, which in turn are encapsulated as Web services; the data nodes are

bound to XML objects.

Figure 4.6 shows the part of Tira’s class design that models how action nodes

are simulated. Instead of modeling an IR module as subclass of the action node

class, action nodes are instantiated using a factory class and conﬁgured with a

symbolic action name. The factory class looks up the URL of the Web service

that is associated with the action name at Tira’s service registry and provides

the action node instance with this URL. Each IR module that is registered in the

service registry comes with a self-description in XML format: input as well as

output data types are speciﬁed in XML schema. This approach keeps Tira open,

since it allows for registering and executing new IR modules without recompiling

Tira’s source.

The object structure is provided with a Petri-net-like token semantics. Sim-

ulating the activity diagram means to check the availability of an action node’s

input data, to allocate processing resources, and to call the corresponding Web

service. On delivery of a Web service result, the associated data tokens are prop-

agated in the object structure.

98 Chapter 4. Tira: A Software Architecture for Personal IR

ForkNode JoinNode

Node

ActionNodeFactory

+ setName(Name)

+ newActionNode()

ActionNode

- input: Vector<DataNode>

- output: Vector<DataNode>

- next: Vector<Node>

- location: URL

+ setInput(Vector<DataNode>)

+ getOutput(): Vector<DataNode>

+ setNext(Vector<DataNode>)

+ isExecutable(): Boolean

+ execute()

ServiceRegistry

+ lookup(Name):URL

+ addService(Name, URL)

DataNode

- location: URL

- type: String

Figure 4.6: A part of Tira’s software design given as UML class diagram.

4.3.2 The Tira Middleware Platform

The functions in the IR module library take objects as input and return new

objects. Instead of supplying the Web service stubs with serializations of these

objects, parameter passing is realized with the call-by-name paradigm in its most

generic form: a parameter must be a URL, pointing to a serialized XML rep-

resentation of the respective object. This approach comes with the following

advantages.

(1) When executing an IR process, a client needs not to transfer the interme-

diate data between two Web service calls; instead, an invoked Web service

fetches the data directly from the given URLs, resulting in reduced data

transfer costs.

(2) When two consecutive modules are executed whose corresponding Web ser-

vices are located on the same server machine, data transfer costs are even

lower since the XML ﬁles can be directly accessed.

(3) The transfer of URL references instead of data objects enables low-bandwidth

machines to be fully functional clients. In particular, home users are able

to execute personalized IR processes.

(4) The use of URLs opens the entire World Wide Web as address space for

data hosting and data sharing.

Because of the broad acceptance of XML the serialization of data objects as

XML streams is the means of choice for data exchange. Within Tira powerful

4.3. Operationalizing IR Processes with Tira 99

Figure 4.7: The left screenshot shows our meta search engine, AIsearch, whose under-

lying IR process is deﬁned in Tira. The right screenshot shows the Tira editor for

the speciﬁcation of activity diagrams.

parser generation tools and an XML language binding (JAXB) are responsible

for the reading and writing of XML object streams [142].

Intermediate data that is produced during the execution of an IR process is

made available for visual inspection, which helps debugging IR processes and lets

a user reason about the underlying process. For this purpose the XSL transfor-

mation technology is intensively used in Tira; XSL stylesheets are easy to adapt

and to maintain, and they are suited to produce any desired format from the

data.

4.3.3 Tira at Work

Figure 4.7 (left) shows a screenshot of our meta search engine AIsearch, rebuilt

as a Tira application. AIsearch takes as input a keyword query, meta searches

commercial Web search engines, and categorizes the found documents according

to the identiﬁed topics [91]. The underlying IR process extracts the delivered

snippets, removes stop words, and stems the remaining words before compressed

term vectors are built. Based on the term vectors, a clustering with the Ma-

jorClust algorithm [139] is generated, and, for browsing purposes, meaningful

text labels for the clusters are constructed with with a statistical text covering

algorithm.

Figure 4.7 (right) shows a screenshot of the Tira activity diagram editor,

which is implemented as a Java applet. The left hand side of the applet shows

a selection of available IR modules. A double click instantiates a module, which

100 Chapter 4. Tira: A Software Architecture for Personal IR

is then displayed graphically on the right hand side of the Applet. The arrows

between the modules display the data ﬂow. A click on a data node invokes

the associated XSL transformation, which translates the corresponding data to

XHTML and displays the results in a browser.

4.4 Concluding Remarks

IR processes have become ubiquitous, be it in the form of search engines at home

or at work, on mobile devices or on workstations, or as retrieval components in

ﬁle systems, document repositories, databases, or knowledge management tools.

The reason for this pervasiveness is the growing information need, the diversity of

IR tasks, and the desired degree of personalization. Although specialized retrieval

algorithms have been developed in the past, less work has been done on modeling

and operationalizing IR processes from a software engineering point of view.

This chapter contributes to this aspect. Starting from a discussion of modeling

approaches for IR processes we introduced Tira, a ﬂexible MDA solution for the

rapid prototyping of tailored IR tools. Tira allows a user to model an IR process

as UML activity diagram, which can be transformed to a problem speciﬁc model

and, based on the Tira middleware platform, executed by the press of a button.

Currently, Tira is a research prototype and further extended and enhanced

in our working group. The Tira approach is rather independent of particular IR

algorithms and data structures; it is intended to be built around an existing IR

module library.

Chapter 5

Conclusion and Outlook

Today, the prevailing form in which search results are presented when querying

document collections is as a list of document snippets. Analyses of human search

behaviour have shown that most keyword queries are short queries—a fact that

leads to long result lists that comprise mainly irrelevant hits. Since human ca-

pabilities in information processing are rather limited, this form of search result

organization and presentation is questionable. A technique which contributes to

solve this problem is document categorization.

Categorization according to topic.

The performance of unsupervised document categorization algorithms is strongly

inﬂuenced by the underlying document representation. We introduced the suf-

ﬁx tree document model along with new similarity measures, which compile full

term order information into the similarity values. Experiments have shown that

this model is able to substantially improve the unsupervised categorization per-

formance.

Clustering algorithms require parameters such as the desired number of clus-

ters, density thresholds, and neighborhood speciﬁcations. The choice of these

parameters aﬀects the clustering quality to a great extent. We introduced ρ,

an internal cluster quality measure, which allows us to compare the quality of

clusterings in an unsupervised manner. In particular, it ﬁnds the best categoriza-

tions in a set of clusterings, each of which has been generated by trying diﬀerent

clustering algorithms or parameters. The experiments have shown that ρcorre-

lates better with the F-Measure with respect to reference categorizations of test

collections than traditional validity indices.

101

102 Chapter 5. Conclusion and Outlook

A categorization has to be presented to a user, i. e. the category structure

must be visualized, and topic labels for the found categories must be identiﬁed.

We formalize desirable properties for category labels and propose the Weighted

Centroid Covering algorithm for topic identiﬁcation. The results of our ambitious

experiments are convincing: precision rates between 50% and 75% for up to ﬁve

extracted topic labels have been achieved. Our new evaluation methodology

renders performance measurement reproducible and comparable.

Categorization according to genre.

Document categorization according to genre is considered useful by the vast ma-

jority of search engine users; in particular, over 90% of 286 interrogated students

considers genre categorization in connection with Web search very useful or some-

times useful.

We identiﬁed strong features that allow for building expressive document rep-

resentations for genre classiﬁcation. Feasibility studies have shown that a mul-

ticlass classiﬁcation with eight classes using a discriminant analysis can be done

at a convincing average performance of about 80% classiﬁed correctly. If a single

genre category shall be distinguished from the remainder, the classiﬁcation per-

formance is around 95%. Personalized classiﬁers achieve error rates in the range

between the two mentioned ﬁgures.

Software engineering for personalized IR.

Although IR processes exhibit a modular nature, little work has been done in the

intersecting ﬁelds of software engineering and information retrieval. We propose

Tira, a model driven architecture (MDA) approach for IR process speciﬁcation

and operationalization. This technology separates IR process speciﬁcation from

implementation, and it allows to rapidly develop new prototypes, to quickly adapt

IR processes to personal preferences, and to test new ideas at the push of a button.

What comes next?

The ﬁrst commercial meta search engine that clustered search results was Viv´ısimo 1.

Although there are other products that provide categorization functionality2,

1http://www.vivisimo.com

2e. g. http://www.turbo10.com

103

Viv´ısimo convinces with robust and mature clustering technology. However, like

the suﬃx tree document model, the new validity inices, and the analysis of the

STC heuristic has shown, there is still room for optimizing categorization perfor-

mance.

Meanwhile, two-class genre search engines have evolved; they include Froogle3,

a search engine for shops, and Google Scholar4, an article search engine. However,

there is still no Web search engine that lets a user specify which genres to include

or exclude from query results. We expect that a search engine providing such a

functionality would be widely appreciated. However, to put such an engine to

work, it is advantageous to classify documents according to genre at indexing

time since an on-the-ﬂy computation of the according document representation

may be too expensive in terms of runtime.

Tira performs well in our labs; however, it must stand the industry test,

and the modeling granularity must prove to be practical in a wider range of

application scenarios. Also, a deployment of the PIM that is given in the form of

a UML activity diagram to high-performance platforms would be interesting to

investigate; e. g. theoretical models for the automatic deployment of IR process

components to architectures like P2P nets, cluster computers, or grids would be

interesting. In particular, the associated tasks to estimate job lengths opens a

new ﬁeld in IR; likewise, performance guarantees would be highly appreciated,

especially in distributed commercial applications.

3http://www.froogle.com

4http://scholar.google.com

Bibliography

[1] Dimitris Achlioptas and Frank McSherry. On spectral learning of mixtures

of distributions. In COLT, pages 458–469, 2005.

[2] Gediminas Adomavicius and Alexander Tuzhilin. Toward the next genera-

tion of recommender systems: A survey of the state-of-the-art and possible

extensions. IEEE Trans. Knowl. Data Eng., 17(6):734–749, 2005.

[3] R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules between

Sets of Items in Large Databases. In Peter Buneman and Sushil Jajodia,

editors, Proceedings of the 1993 ACM SIGMOD International Conference

on Management of Data, Washington D. C., May 1993. ACM Press.

[4] R. Agrawal and R. Srikant. Fast Algorithms for Mining Association Rules.

In Proc. of the 20th Int’l Conference on Very Large Databases, Santiago,

Chile, September 1994, 1994.

[5] Giambattista Amati. Frequentist and bayesian approach to information

retrieval. In Lalmas et al. [79], pages 13–24. ISBN 3-540-33347-9.

[6] Gianni Amati and C.J. van Rijsbergen. Probabilistic models of informa-

tion retrieval based on measuring the divergence from randomness. ACM

Trancactions on Information Systems, 20(4):357–389, oct 2002.

[7] Tony Andrews, Francisco Curbera, Hitesh Dholakia, Yaron Goland,

Johannes Klein, Frank Leymann, Kevin Liu, Dieter Roller, Doug

Smith, Satish Thatte, Ivana Trickovic, and Sanjiva Weerawarana. Busi-

ness process execution language for web services (bpel4ws) version 1.1.

http://www-128.ibm.com/developerworks/library/specification/ws-bpel/,

May 2003.

105

106 BIBLIOGRAPHY

[8] Association for Computing Machinery. The ACM Computing Classiﬁcation

Systems. http://www.acm.org/class/1998, 1998.

[9] Thomas Bailey and John Cowles. Cluster Deﬁnition by the Optimization

of Simple Measures. IEEE Transactions on Pattern Analysis and Machine

Intelligence, September 1983.

[10] Richard E. Bellman. Adaptive Control Processes. Princeton University

Press, Princeton, NJ, 1961.

[11] Michael W. Berry, Susan T. Dumais, and Gavin W. O’Brien. Using Linear

Algebra for Intelligent Information Retrieval. Technical Report UT-CS-94-

270, Computer Science Department, dec 1994.

[12] J. C. Bezdek, W. Q. Li, Y. Attikiouzel, and M. Windham. A Geometric

Approach to Cluster Validity for Normal Mixtures. Soft Computing 1,

September 1997.

[13] J. C. Bezdek and N. R. Pal. Cluster Validation with Generalized Dunn’s

Indices. In N. Kasabov and G. Coghill, editors, Proceedings of the 2nd in-

ternational two-stream conference on ANNES, pages 190–193, Piscataway,

NJ, 1995. IEEE Press.

[14] Douglas Biber. The multidimensional approach to linguistic analyses of

genre variation: An overview of methodology and ﬁndings. In Computers

and the Humanities, volume 26, pages 331–345, 1992.

[15] Avrim L. Blum and Pat Langley. Selection of relevant features and examples

in machine learning. Artif. Intell., 97(1-2):245–271, 1997. ISSN 0004-3702.

doi: http://dx.doi.org/10.1016/S0004-3702(97)00063-5.

[16] Elisabeth Sugar Boese and Adele E. Howe. Eﬀects of Web Document Evo-

lution on Genre Classiﬁcation. In Proceedings of the CIKM’05. ACM Press,

November 2005.

[17] Ivan Bretan, Johan Dewe, Anders Hallberg, and Niklas Wolkert. Web-

speciﬁc genre visualization, 1999.

BIBLIOGRAPHY 107

[18] A. Buja, D. F. Swayne, M. Littman, N. Dean, and H. Hofmann. XGvis:

Interactive Data Visualization with Multidimensional Scaling. Journal of

Computational and Graphical Statistics, 2001.

[19] WWW Consortium. OWL Web Ontology Language Semantics and Ab-

stract Syntax. http://www.w3.org/TR/owl-semantics/, 2004.

[20] WWW Consortium. RDF Semantics. http://www.w3.org/TR/rdf-mt/,

2004.

[21] WWW Consortium. RDF/XML Syntax Speciﬁcation.

http://www.w3.org/TR/rdf-syntax-grammar/, 2004.

[22] W. Bruce Croft. A model of cluster searching based on classiﬁcation. In-

formation Systems, 5:189–195, 1980.

[23] Kevin Crowston and Marie Williams. The eﬀects of linking on genres of

web documents. In HICSS, 1999.

[24] Douglass R. Cutting, David R. Karger, and Jan O. Pedersen. Constant

interaction-time scatter/gather browsing of very large document collections.

In SIGIR’93: Proceedings of the 16th annual international ACM SIGIR

conference on Research and development in information retrieval, pages

126–134, New York, NY, USA, 1993. ACM Press. ISBN 0-89791-605-0.

doi: http://doi.acm.org/10.1145/160688.160706.

[25] Sanjoy Dasgupta. Learning Probability Distributions. PhD thesis, Univer-

sity of California at Berkeley, 2000.

[26] D.L. Davies and D.W. Bouldin. A Cluster Separation Measure. IEEE

Transactions on Pattern Analysis and Machine Learning, 1(2), 1979.

[27] Scott C. Deerwester, Susan T. Dumais, Thomas K. Landauer, George W.

Furnas, and Richard A. Harshman. Indexing by Latent Semantic Analysis.

Journal of the American Society of Information Science, 41(6):391–407,

1990.

[28] Simon Dennis. The sydney morning herald word

database. http://www2.psy.uq.edu.au/CogPsych/Noetica/

OpenForumIssue4/SMH.html, 1995.

108 BIBLIOGRAPHY

[29] Simon Dennis, Peter Bruza, and Robert McArthur. Web searching: A

process-oriented experimental study of three interactive search paradigms.

JASIST, 53(2):120–133, 2002.

[30] Nigel Dewdney, Carol VanEss-Dykema, and Richard MacMillan. The form

is the substance: Classiﬁcation of genres in text. In Proceedings of ACL

Workshop on HumanLanguage Technology and Knowledge Management,

2001.

[31] Inderjit S. Dhillon. Co-clustering documents and words using bipartite

spectral graph partitioning. In KDD’01: Proceedings of the seventh ACM

SIGKDD international conference on Knowledge discovery and data min-

ing, pages 269–274, New York, NY, USA, 2001. ACM Press. ISBN 1-58113-

391-X. doi: http://doi.acm.org/10.1145/502512.502550.

[32] P. Diaconis and D. Freedman. Asymptotics of graphical projection pursuit.

Annals of Statistics, 12:793–815, 1984.

[33] M. Dimitrova, A. Finn, N. Kushmerick, and B. Smyth. Web genre visual-

ization. In Proceedings of the Conference on Human Factors in Computing

Systems, 2002.

[34] Richard C. Dubes and Guangzhou Zeng. A test for spatial homogeneity in

cluster analysis. Journal of Classiﬁcation, 4:33–56, 1987.

[35] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classiﬁcation

(2nd Edition). Wiley-Interscience, 2000. ISBN 0471056693.

[36] J.C. Dunn. Well seperated clusters and optimal fuzzy partitions. Journal

of Cybernetics, 4(1974):95–104, 1974.

[37] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A Density-Based Algorithm

for Discovering Clusters in Large Spatial Databases with Noise. In Pro-

ceedings of the 2nd International Conference on Knowledge Discovery and

Data Mining (KDD96), 1996.

[38] B. S. Everitt. Cluster analysis. New York, Toronto, 1993.

BIBLIOGRAPHY 109

[39] Aidan Finn and Nicholas Kushmerick. Learning to Classify Documents

According to Genre. In IJCAI-03 Workshop on Computational Approaches

to Style Analysis and Synthesis, 2003.

[40] K. Florek, J. Lukaszewiez, J. Perkal, H. Steinhaus, and S. Zubrzchi. Sur

la liason et la division des points d’un ensemble ﬁni. Colloquium Methe-

maticum, 2, 1951.

[41] D. B. Fogel and L. J. Fogel. Special Issue on Evolutionary Computation.

IEEE Transaction of Neural Networks, 1994.

[42] E. B. Fowlkes and C. L. Mallows. A method for comparing two hierarchical

clusterings. Journal of the American Statistical Association, 78:553–569,

1983.

[43] W. B. Frakes. Term conﬂation for information retrieval. In SIGIR’84:

Proceedings of the 7th annual international ACM SIGIR conference on Re-

search and development in information retrieval, pages 383–389, Swinton,

UK, 1984. British Computer Society. ISBN 0-521-26865-6.

[44] William B. Frakes and Christopher J. Fox. Strength and Similarity of Aﬃx

Removal Stemming Algorithms. SIGIR Forum, 37(1):26–30, 2003. ISSN

0163-5840. doi: http://doi.acm.org/10.1145/945546.945548.

[45] Norbert Fuhr. Optimum polynomial retrieval functions based on the prob-

ability ranking principle. ACM Trans. Inf. Syst., 7(3):183–204, 1989. ISSN

1046-8188. doi: http://doi.acm.org/10.1145/65943.65944.

[46] Norbert Fuhr. Probabilistic models in information retrieval. Comput. J.,

35(3):243–255, 1992.

[47] Norbert Fuhr and Chris Buckley. A probabilistic learning approach for

document indexing. ACM Trans. Inf. Syst., 9(3):223–248, 1991.

[48] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. Design

Patterns: Elements of Reusable Object-oriented Software. Addison-Wesley

Longman Publishing Co., Inc., 1998. ISBN 0-201-63361-2.

110 BIBLIOGRAPHY

[49] Aristides Gionis, Alexander Hinneburg, Spiros Papadimitriou, and Panayi-

otis Tsaparas. Dimension induced clustering. In KDD ’05: Proceeding of

the eleventh ACM SIGKDD international conference on Knowledge discov-

ery in data mining, pages 51–60, New York, NY, USA, 2005. ACM Press.

ISBN 1-59593-135-X. doi: http://doi.acm.org/10.1145/1081870.1081880.

[50] A. Griﬃths, H.C. Luckhurst, and P. Willett. Using inter-document simi-

larity information in document retrieval systems. Jounal of the American

Society for Information Science, 37:3–11, 1986.

[51] M.A. Hafer and S.F. Weiss. Word segmentation by letter successor varieties.

Information Processing & Management, 10(11-12):371–386, 1974.

[52] Eui-Hong Han and George Karypis. Centroid-Based Document Classiﬁca-

tion: Analysis and Experimental Results. Technical Report 00-017, Uni-

vercity of Minnesota, Department of Computer Science / Army HPC Re-

search Center, March 2000.

[53] Eui-Hong Han, George Karypis, and Vipin Kumar. Text Categorization

Using Weight Adjusted k-Nearest Neighbor Classiﬁcation. In Paciﬁc-Asia

Conference on Knowledge Discovery and Data Mining, pages 53–65, Min-

neapolis, USA, 2001. Univercity of Minnesota, Department of Computer

Science / Army HPC Research Center.

[54] S.P. Harter. A probabilistic approach to automatic keyword indexing. part

i: On the distribution of specialty words in a technical literature. J.ASIS,

26, 1975.

[55] S.P. Harter. A probabilistic approach to automatic keyword indexing. part

ii: An algorithm for probabilistic indexing. J.ASIS, 26, 1975.

[56] Marti A. Hearst and Jan O. Pedersen. Reexamining the cluster hypothesis:

Scatter/gather on retrieval results. In SIGIR’96: Proceedings of the 19th

annual international ACM SIGIR conference on Research and development

in information retrieval, pages 76–84, New York, NY, USA, 1996. ACM

Press. ISBN 0-89791-792-8. doi: http://doi.acm.org/10.1145/243199.

243216.

BIBLIOGRAPHY 111

[57] Alexander Hinneburg and Daniel A. Keim. Optimal grid-clustering: To-

wards breaking the curse of dimensionality in high-dimensional clustering.

In VLDB ’99: Proceedings of the 25th International Conference on Very

Large Data Bases, pages 506–517, San Francisco, CA, USA, 1999. Morgan

Kaufmann Publishers Inc. ISBN 1-55860-615-7.

[58] Martin Hitz, Gerti Kappel, Elisabeth Kapsammer, and Werner Retschitzeg-

ger. UML @ Work. dpunkt.verlag, 2005. ISBN 3-89864-261-5.

[59] Andreas Hotho. Clustern mit Hintergrundwissen,vol-

ume 286 of Diski. Akademische Verlagsgesellschaft

Aka GmbH, Berlin, 2004. ISBN 3-89838-286-9. URL

http://www.kde.cs.uni-kassel.de/hotho/pub/2004/dissAho.pdf.

[60] L. J. Hubert and J. Schultz. Quadratic assignment as a general data-

analysis strategy. British Journal of Mathematical and Statistical Psychol-

ogy, 29:190–241, 1976.

[61] Georgiana Ifrim, Martin Theobald, and Gerhard Weikum. Learning Word-

to-Concept Mappings for Automatic Text Classiﬁcation. In Proceedings of

the ICML-Learning in Web Search Workshop, 2005.

[62] Makoto Iwayama and Takenobu Tokunaga. Cluster-based text categoriza-

tion: a comparison of category search strategies. In Edward A. Fox, Pe-

ter Ingwersen, and Raya Fidel, editors, SIGIR’95: Proceedings of the 18th

ACM International Conference on Research and Development in Informa-

tion Retrieval, pages 273–281, Seattle, USA, 1995. ACM Press, New York,

US.

[63] A. K. Jain and J. V. Moreau. Bootstrap technique in cluster analysis.

Pattern Recognition, 20(5):547–568, 1987.

[64] Anil K. Jain and Richard C. Dubes. Algorithms for Clustering in Data.

Prentice Hall, Englewood Cliﬀs, NJ, 1990. ISBN 0-13-022278-X.

[65] Kurt Jensen. Coloured Petri Nets. Basic Concepts, Analysis Methods and

Practical Use., volume 1 of Monographs in Theoretical Computer Science.

Springer-Verlag, 1997. ISBN 3-540-60943-1.

112 BIBLIOGRAPHY

[66] George H. John, Ron Kohavi, and Karl Pﬂeger. Irrelevant Features and the

Subset Selection Problem. In William W. Cohen and Haym Hirsh, editors,

Machine Learning: Proceedings of the Eleventh International Conference,

pages 121–129, San Francisco, CA, 1994. Morgan Kaufmann.

[67] S. C. Johnson. Hierarchical clustering schemes. Psychometrika, 32, 1967.

[68] Jussi Karlgren and Douglass Cutting. Recognizing text genres with simple

metrics using discriminant analysis. In Proceedings of the 15th. Interna-

tional Conference on Computational Linguistics (Coling 94),volumeII,

pages 1071 – 1075, Kyoto, Japan, 1994.

[69] G. Karypis, E.-H. Han, and V. Kumar. Chameleon: A hierarchical cluster-

ing algorithm using dynamic modeling. Technical Report Paper No. 432,

University of Minnesota, Minneapolis, 1999.

[70] George Karypis and Eui-Hong Han. Concept indexing: A fast dimensional-

ity reduction algorithm with applications to document retrieval and catego-

rization. Technical report tr-00-0016, University of Minnesota, 2000. URL

citeseer.ist.psu.edu/karypis00concept.html.

[71] Leonard Kaufman and Peter J. Rousseuw. Finding Groups in Data. Wiley,

1990.

[72] Brett Kessler, Geoﬀrey Nunberg, and Hinrich Sch¨utze. Automatic detec-

tion of text genre. In Philip R. Cohen and Wolfgang Wahlster, editors,

Proceedings of the Thirty-Fifth Annual Meeting of the Association for Com-

putational Linguistics and Eighth Conference of the European Chapter of

the Association for Computational Linguistics, pages 32–38, Somerset, New

Jersey, 1997. Association for Computational Linguistics.

[73] R. W. Klein and R. C. Dubes. Experiments in Projection and Clustering

by Simulated Annealing. Pattern Recognition, 22:213–220, 1989.

[74] T. Kohonen. Self Organization and Assoziative Memory. Springer, 1990.

[75] T. Kohonen, S. Kaski, K. Lagus, J. Salojrvi, J. Honkela, V. Paatero, and

A. Saarela. Self organization of a massive document collection. In IEEE

Transactions on Neural Networks, volume 11, may 2000.

BIBLIOGRAPHY 113

[76] Daphne Koller and Mehran Sahami. Toward optimal feature selection. In

International Conference on Machine Learning, pages 284–292, 1996. URL

citeseer.ist.psu.edu/koller96toward.html.

[77] J. B. Kruskal. Multidimensional Scaling by Optimizing Goodness of Fit to

a Nonmetric Hypothesis. Psychometrika, 29(1), March 1964.

[78] Krishna Kummamuru, Rohit Lotlikar, Shourya Roy, Karan Singal, and

Raghu Krishnapuram. A hierarchical monothetic document clustering al-

gorithm for summarization and browsing search results. In Proceedings of

the 13th international conference on World Wide Web, WWW 2004, New

York, NY, USA, May 17-20, 2004, pages 658–665, 2004.

[79] Mounia Lalmas, Andy MacFarlane, Stefan M. R¨uger, Anastasios Tombros,

Theodora Tsikrika, and Alexei Yavlinsky, editors. Advances in Information

Retrieval, 28th European Conference on IR Research, ECIR 2006, Lon-

don, UK, April 10-12, 2006, Proceedings, volume 3936 of Lecture Notes in

Computer Science, 2006. Springer. ISBN 3-540-33347-9.

[80] Dawn Lawrie, W. Bruce Croft, and Arnold L. Rosenberg. Finding topic

words for hierarchical summarization. In SIGIR 2001: Proceedings of the

24th Annual International ACM SIGIR Conference on Research and De-

velopment in Information Retrieval, September 9-13, 2001, New Orleans,

Louisiana, USA, pages 349–357, 2001.

[81] Dawn J. Lawrie. Language Models for Hierarchical Summarization.PhD

thesis, University of Massachusetts at Amherst, 2003b.

[82] Dawn J. Lawrie and W. Bruce Croft. Generating hierarchical summaries for

web searches. In SIGIR 2003: Proceedings of the 26th Annual International

ACM SIGIR Conference on Research and Development in Information Re-

trieval, July 28 - August 1, 2003, Toronto, Canada, pages 457–458, 2003.

[83] Yong-Bae Lee and Sung Hyon Myaeng. Text genre classiﬁcation with genre-

revealing and subject-revealing features. In Proc. 25th annual interna-

tional ACM SIGIR conference on Research and development in informa-

tion retrieval, pages 145–150. ACM Press, 2002. ISBN 1-58113-561-0. doi:

http://doi.acm.org/10.1145/564376.564403.

114 BIBLIOGRAPHY

[84] Thomas Lengauer. Combinatorial Algorithms for Integrated Circuit Layout.

John Wiley & Sons, New York, 1990.

[85] V. I. Levenshtein. Binary codes capable of correcting deletions insertions

and reversals. ISov Phys Dokl, 6:707–710, 1966.

[86] Erel Levine and Eytan Domany. Resampling method for unsupervised es-

timation of cluster validity. Neural Computation, 13:2573–2593, 2001.

[87] Nick Littlestone. Learning quickly when irrelevant attributes abound: A

new linear-threshold algorithm. Machine Learning, 2(4):285–318, 1987.

[88] J. B. MacQueen. Some Methods for Classiﬁcation and Analysis of Multi-

variate Observations. In Proceedings of the Fifth Berkeley Symposium on

Mathematical Statistics and Probability, pages 281–297, 1967.

[89] Y. Matsuo and M. Ishizuka. Keyword Extraction from a Single Document

using Word Co-ocurrence Statistical Information. International Journal on

Artiﬁcial Intelligence Tools, 13(1):157–169, 2004.

[90] Sven Meyer zu Eißen and Benno Stein. Analysis of Clustering Algorithms

for Web-based Search. In Dimitris Karagiannis and Ulrich Reimer, editors,

Practical Aspects of Knowledge Management, volume 2569 LNAI of Lecture

Notes in Artiﬁcial Intelligence, pages 168–178. Springer, December 2002.

ISBN 3-540-00314-2.

[91] Sven Meyer zu Eißen and Benno Stein. The AIsearch Meta Search Engine

Prototype. In Amit Basu and Soumitra Dutta, editors, Proceedings of

the 12th Workshop on Information Technology and Systems (WITS 02),

Barcelona Spain. Technical University of Barcelona, December 2002.

[92] Sven Meyer zu Eißen and Benno Stein. Genre Classiﬁcation of Web Pages:

User Study and Feasibility Analysis. In Susanne Biundo, Thom Fr¨uhwirth,

and G¨unther Palm, editors, KI 2004: Advances in Artiﬁcial Intelligence,

volume 3228 LNAI of Lecture Notes in Artiﬁcial Intelligence, pages 256–

269, Berlin Heidelberg New York, September 2004. Springer. ISBN 0302-

9743.

BIBLIOGRAPHY 115

[93] Sven Meyer zu Eißen and Benno Stein. Wrapper Generation with Patricia

Trees. In Benno Stein, Sven Meyer zu Eißen, and Andreas N¨urnberger,

editors, Machine Learning and Interaction for Text-Based Information Re-

trieval (TIR 04), Workshop Proceedings, pages 69–76. Universit¨at Ulm,

September 2004.

[94] Sven Meyer zu Eissen and Benno Stein. Intrinsic plagiarism detection. In

Mounia Lalmas, Andy MacFarlane, Stefan M. R¨uger, Anastasios Tombros,

Theodora Tsikrika, and Alexei Yavlinsky, editors, Proceedings of the Eu-

ropean Conference on Information Retrieval (ECIR 2006), volume 3936 of

Lecture Notes in Computer Science, pages 565–569. Springer, 2006. ISBN

3-540-33347-9.

[95] Sven Meyer zu Eißen and Benno Stein. Realization of web-based simulation

services. CiI – Computers in Industry: Special Issue on Advanced Computer

Support of Engineering and Service Processes of Virtual Enterprises,57

(3):261–271, 2006. ISSN 0166-3615. doi: http://dx.doi.org/10.1016/j.

compind.2005.12.007.

[96] Sven Meyer zu Eißen, Benno Stein, and Martin Potthast. The Suﬃx Tree

Document Model Revisited. In Klaus Tochtermann and Hermann Mau-

rer, editors, Proceedings of the 5th International Conference on Knowledge

Management (I-KNOW 05), Graz, Journal of Universal Computer Science,

pages 596–603. Know-Center, July 2005.

[97] Thomas M. Mitchell. Machine Learning. McGraw-Hill Higher Education,

1997. ISBN 0070428077.

[98] Object Management Group (OMG). Model driven architecture (mda)

guide. http://www.omg.org/docs/omg/03-06-01.pdf, 2003.

[99] Object Management Group (OMG). The uml metamodel.

http://www.omg.org/cgi-bin/doc?ptc/2004-10-05, 2003.

[100] Object Management Group (OMG). The uniﬁed modeling language (UML)

speciﬁcation, version 2. http://www.uml.org, 2005.

[101] Christos H. Papadimitriou, Hisao Tamaki, Prabhakar Raghavan, and San-

tosh Vempala. Latent semantic indexing: a probabilistic analysis. In

116 BIBLIOGRAPHY

PODS’98: Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART sym-

posium on Principles of database systems, pages 159–168, New York, NY,

USA, 1998. ACM Press. ISBN 0-89791-996-3. doi: http://doi.acm.org/

10.1145/275487.275505.

[102] Carl Adam Petri. Kommunikation mit Automaten. PhD thesis, University

of Bonn, 1962.

[103] Peter Pirolli, Patricia Schank, Marti Hearst, and Christine Diehl. Scat-

ter/gather browsing communicates the topic structure of a very large text

collection. In CHI’96: Proceedings of the SIGCHI conference on Human fac-

tors in computing systems, pages 213–220, New York, NY, USA, 1996. ACM

Press. ISBN 0-89791-777-4. doi: http://doi.acm.org/10.1145/238386.

238489.

[104] Jay M. Ponte and W. Bruce Croft. A language modeling approach to infor-

mation retrieval. In SIGIR’98: Proceedings of the 21st annual international

ACM SIGIR conference on Research and development in information re-

trieval, pages 275–281, New York, NY, USA, 1998. ACM Press. ISBN

1-58113-015-5. doi: http://doi.acm.org/10.1145/290941.291008.

[105] Alexandrin Popescul and Lyle H. Ungar. Automatic Labeling of Document

Clusters. http://citeseer.nj.nec.com/popescul00automatic.html,

2000.

[106] M. F. Porter. An Algorithm for Suﬃx Stripping. Program, 14(3):130–137,

1980.

[107] Martin Porter. Snowball. http://snowball.tartarus.org/, 2001.

[108] V. V. Raghavan and K. Birchand. A Clustering Strategy Based on a For-

malism of the Reproduction Process in a Natural System. In Proceedings of

the Second International Conference on Information Storage and Retrieval,

pages 10–22, 1979.

[109] W. M. Rand. Objective criteria for the evaluation of clustering methods.

Journal of the American Statistical Association, 66:846–850, 1971.

BIBLIOGRAPHY 117

[110] Georg Rehm. Towards Automatic Web Genre Identiﬁcation. In Pro-

ceedings of the 35th Hawaii International Conference on System Sciences

(HICSS’02). IEEE Computer Society, January 2002.

[111] C. J. van Rijsbergen. Information Retrieval. Buttersworth, London, 1979.

[112] S. E. Robertson. The probability ranking principle in IR. Morgan Kaufmann

Publishers Inc., San Francisco, CA, USA, 1997. ISBN 1-55860-454-5.

[113] S. E. Robertson and S. Walker. Some simple eﬀective approximations to the

2-poisson model for probabilistic weighted retrieval. In SIGIR’94: Proceed-

ings of the 17th annual international ACM SIGIR conference on Research

and development in information retrieval, pages 232–241, New York, NY,

USA, 1994. Springer-Verlag New York, Inc. ISBN 0-387-19889-X.

[114] Stephen Robertson, Hugo Zaragoza, and Michael Taylor. Simple bm25

extension to multiple weighted ﬁelds. In CIKM’04: Proceedings of the thir-

teenth ACM international conference on Information and knowledge man-

agement, pages 42–49, New York, NY, USA, 2004. ACM Press. ISBN

1-58113-874-1. doi: http://doi.acm.org/10.1145/1031171.1031181.

[115] T.G. Rose, M. Stevenson, and M. Whitehead. The Reuters Corpus Volume

1 - From Yesterday’s News to Tomorrow’s Language Resources. In Pro-

ceedings of the Third International Conference on Language Resources and

Evaluation, 2002.

[116] Dmitri Roussinov, Kevin Crowston, Mike Nilan, Barbara Kwasnik, Jin Cai,

and Xiaoyong Liu. Genre based navigation on the web. In Proceedings of

the 34th Hawaii International Conference on System Sciences, 2001.

[117] Tom Roxborough and Arunabha. Graph Clustering using Multiway Ratio

Cut. In Stephen North, editor, Graph Drawing, Lecture Notes in Computer

Science, Springer, 1996.

[118] Reinhard Sablowski and Arne Frick. Automatic Graph Clustering. In

Stephan North, editor, Graph Drawing, Lecture Notes in Computer Science,

Springer, 1996.

118 BIBLIOGRAPHY

[119] G. Salton. Cluster search strategies and the optimization of retrieval eﬀec-

tiveness. Prentice-Hall, Englewood Cliﬀs, N.J., 1971.

[120] G. Salton and M.J. Gill. Introduction to Modern Information Retrieval.

McGraw-Hill, New York, 1983.

[121] Gerard Salton, A. Wong, and C. S. Yang. A vector space model for auto-

matic indexing. Commun. ACM, 18(11):613–620, 1975.

[122] Mark Sanderson and Bruce Croft. Deriving concept hierarchies from text.

In SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR

conference on Research and development in information retrieval, pages

206–213, New York, NY, USA, 1999. ACM Press. ISBN 1-58113-096-1.

doi: http://doi.acm.org/10.1145/312624.312679.

[123] Marina Santini. Common criteria for genre classiﬁcation: Annotation and

granularity. In Proceedings of the ECAI-Workshop TIR-06,RivadelGarda,

Italy, 2006.

[124] Claude E. Shannon. A Mathematical Theory of Communication. Bell Sys.

Tech. J., 27:379–423, 623–656, 1948.

[125] P. H. A. Sneath. The application of computers to taxonomy. J. Gen.

Microbiol., 17, 1957.

[126] Karen Sparck-Jones. A statistical interpretation of term speciﬁcity and its

application in retrieval. Journal of Documentation, 28:11–21, 1972.

[127] Amanda Spink, Judy Bateman, and Bernard J. Jansen. Users’ Searching

Behavior On The Excite Web Search Engine. In Hermann A. Maurer and

Richard G. Olson, editors, Proceedings of WebNet 98 - World Conference

on the WWW and Internet & Intranet, Orlando, Florida, USA, nov 1998.

ISBN 1-880094-31-2.

[128] E. Stamatatos, N. Fakotakis, and G. Kokkinakis. Text genre detection

using common word frequencies. In Proceedings of the 18th Int. Conference

on Computational Linguistics, Saarbr¨ucken, Germany, 2000.

[129] Benno Stein. Readings in Advanced Web Technology. http://www.uni-

weimar.de/m/webis, 2006.

BIBLIOGRAPHY 119

[130] Benno Stein and Sven Meyer zu Eißen. Document Categorization with

MajorClust. In Amit Basu and Soumitra Dutta, editors, Proceedings

of the 12th Workshop on Information Technology and Systems (WITS 02),

Barcelona Spain, pages 91–96. Technical University of Barcelona, December

2002.

[131] Benno Stein and Sven Meyer zu Eißen. Automatic Document Categoriza-

tion: Interpreting the Perfomance of Clustering Algorithms. In Andreas

G¨unter, Rudolf Kruse, and Bernd Neumann, editors, KI 2003: Advances

in Artiﬁcial Intelligence, volume 2821 LNAI of Lecture Notes in Artiﬁcial

Intelligence, pages 254–266. Springer, September 2003. ISBN 3-540-20059-

[132] Benno Stein and Sven Meyer zu Eißen. The AIsearch Meta Search Engine

Homepage. http://www.aisearch.de, 2003-2007.

[133] Benno Stein and Sven Meyer zu Eißen. Automatische Kategorisierung

f¨ur Web-basierte Suche: Einf¨uhrung, Techniken und Projekte. KI –

K¨unstliche Intelligenz: Special Issue on Adaptive Multimedia Retrieval,4:

11–17, November 2004. ISSN 0933-1875.

[134] Benno Stein and Sven Meyer zu Eißen. Topic Identiﬁcation: Framework

and Application. In Klaus Tochtermann and Hermann Maurer, editors,

Proceedings of the 4th International Conference on Knowledge Manage-

ment (I-KNOW 04), Graz, Austria, Journal of Universal Computer Science,

pages 353–360, Graz, Austria, July 2004. Know-Center.

[135] Benno Stein and Sven Meyer zu Eißen. Near Similarity Search and Plagia-

rism Analysis. In M. Spiliopoulou, R. Kruse, C. Borgelt, A. N¨urnberger,

and W. Gaul, editors, From Data and Information Analysis to Knowledge

Engineering, pages 430–437. Springer, 2006. ISBN 1431-8814.

[136] Benno Stein, Sven Meyer zu Eißen, Gernot Gr¨afe, and Frank Wissbrock.

Automating Market Forecast Summarization from Internet Data. In Pedro

Isa´ıas and Miguel Baptista Nunes, editors, Fourth International Conference

on WWW / Internet, Lissabon, pages 395–402. IADIS Press, October 2005.

ISBN 972-8924-02-X.

120 BIBLIOGRAPHY

[137] Benno Stein, Sven Meyer zu Eißen, and Martin Potthast. Syntax versus

Semantics: Analysis of Enriched Vector Space Models. In Benno Stein and

Odej Kao, editors, Third International Workshop on Text-Based Informa-

tion Retrieval (TIR 06), pages 47–52. University of Trento, Italy, August

2006.

[138] Benno Stein, Sven Meyer zu Eißen, and Frank Wißbrock. On Cluster Va-

lidity and the Information Need of Users. In M. H. Hanza, editor, Proceed-

ings of the 3rd IASTED International Conference on Artiﬁcial Intelligence

and Applications (AIA 03), Benalm´adena, Spain, pages 216–221, Anaheim,

Calgary, Zurich, September 2003. ACTA Press. ISBN 0-88986-390-3.

[139] Benno Stein and Oliver Niggemann. On the Nature of Structure and its

Identiﬁcation. In Peter Widmayer, Gabriele Neyer, and Stefan Eiden-

benz, editors, Graph-Theoretic Concepts in Computer Science, volume 1665

LNCS of Lecture Notes in Computer Science, pages 122–134. Springer, June

1999. ISBN 3-540-66731-8.

[140] Benno Stein and Martin Potthast. Putting Successor Variety Stemming to

Work. In Advances in Data Analysis (to appear). Springer, 2007.

[141] Alexander Strehl. Relationship-based Clustering and Visualization for High-

dimensional Data Mining. PhD thesis, University of Texas at Austin, 2002.

[142] Sun Microsystems. Java Architecture for XML Binding (JAXB Speciﬁca-

tion). http://java.sun.com/xml/downloads/jaxb.html, 2003.

[143] J¨urgen Teich. Digitale Hardware/Software-Systeme. Springer, 1997.

[144] R. Tibshirani, G. Walther, and T. Hastie. Estimating the number of clusters

via the gap statistic. Journal of Royal Statistics and Social Behaviour,63

(2):411–423, 2001.

[145] Myron Tribus. Thermostatics and Thermodynamics. Princeton, N.J., Van

Nostrand, 1961.

[146] Yuen-Hsien Tseng. Multilingual keyword extraction for term suggestion.

In SIGIR ’98: Proceedings of the 21st Annual International ACM SIGIR

BIBLIOGRAPHY 121

Conference on Research and Development in Information Retrieval, August

24-28 1998, Melbourne, Australia, pages 377–378, 1998.

[147] P. Turney. Coherent keyphrase extraction via web mining. In Proceedings

of the Eighteenth International Joint Conference on Artiﬁcial Intelligence

(IJCAI-03), pages 434–439, aug 2003.

[148] University of Leipzig. Wortschatz. http://wortschatz.uni-leipzig.de,

1995.

[149] University of Stuttgart. The decision tree tagger.

http://www.ims.uni-stuttgart.de, 1996.

[150] Vishwa Vinay, Ingemar J. Cox, Natasa Milic-Frayling, and Ken Wood.

Measuring the complexity of a collection of documents. In Lalmas et al.

[79], pages 107–118. ISBN 3-540-33347-9.

[151] Vivisimo. The Vivisimo Meta Search Engine Homepage.

http://www.vivisimo.com, 2000-2006.

[152] Ulrike von Luxburg. Statistical Learning with Similarity and Dissimilarity

Functions. PhD thesis, Technische Universit¨at Berlin, 2004.

[153] Ellen M. Voorhees. The cluster hypothesis revisited. In SIGIR’85: Pro-

ceedings of the 8th annual international ACM SIGIR conference on Re-

search and development in information retrieval, pages 188–196, New

York, NY, USA, 1985. ACM Press. ISBN 0-89791-159-8. doi: http:

//doi.acm.org/10.1145/253495.253524.

[154] Wikipedia. Deﬁnition of Surprisal. http://en.wikipedia.org/wiki/Surprisal,

2006.

[155] Peter Willett. Recent trends in hierarchical document clustering. Informa-

tion Processing & management, 24(5):577–597, 1988.

[156] Joachim Winter. Deﬁnition of bounded rationality.

http://www.sfb504.uni-mannheim.de/glossary/bounded.htm, 1999.

122 BIBLIOGRAPHY

[157] Ian H. Witten, Gordon W. Paynter, Eibe Frank, Carl Gutwin, and Craig

G. Nevill-Manning. KEA: Practical automatic keyphrase extraction. In

ACM DL, pages 254–255, 1999.

[158] Zhenyu Wu and Richard Leahy. An optimal graph theoretic approach

to data clustering: Theory and its application to image segmentation.

IEEE Transactions on Pattern Analysis and Machine Intelligence,Novem-

ber 1993.

[159] Wei Xu and Yihong Gong. Document clustering by concept factorization.

In SIGIR’04: Proceedings of the 27th annual international ACM SIGIR

conference on Research and development in information retrieval, pages

202–209, New York, NY, USA, 2004. ACM Press. ISBN 1-58113-881-4.

doi: http://doi.acm.org/10.1145/1008992.1009029.

[160] J. T. Yan and P. Y. Hsiao. A fuzzy clustering algorithm for graph bisection.

Information Processing Letters, 52, 1994.

[161] David Yarowsky. Word-sense disambiguation using statistical models of

roget’s categories trained on large corpora. In COLING, pages 454–460,

1992.

[162] C. T. Zahn. Graph-Theoretical Methods for Detecting and Describing

Gestalt Clusters. IEEE Transactions on computers, C-20(1), 1971.

[163] Oren Zamir and Oren Etzioni. Web Document Clustering: A Feasibility

Demonstration. In SIGIR’98: Proceedings of the 21st annual international

ACM SIGIR conference on Research and development in information re-

trieval, pages 46–54, University of Washington, Seattle, USA, 1998.