Parameterized dynamic cluster editing [original]

Vol.:(0123456789)

Algorithmica (2021) 83:1–44

https://doi.org/10.1007/s00453-020-00746-y

1 3

Parameterized Dynamic Cluster Editing

JunjieLuo1,2· HendrikMolter3· AndréNichterlein3· RolfNiedermeier3

Received: 12 December 2018 / Accepted: 3 July 2020 / Published online: 25 July 2020

Abstract

We introduce a dynamic version of the NP-hard graph modification problem

Clus

ter editing

. The essential point here is to take into account dynamically evolving

input graphs: having a cluster graph (that is, a disjoint union of cliques) constituting

a solution for a first input graph, can we cost-efficiently transform it into a “simi-

lar” cluster graph that is a solution for a second (“subsequent”) input graph? This

model is motivated by several application scenarios, including incremental cluster-

ing, the search for compromise clusterings, or also local search in graph-based data

clustering. We thoroughly study six problem variants (three modification scenarios

edge editing, edge deletion, edge insertion; each combined with two distance meas-

ures between cluster graphs). We obtain both fixed-parameter tractability as well

as (parameterized) hardness results, thus (except for three open questions) provid-

ing a fairly complete picture of the parameterized computational complexity land-

scape under the two perhaps most natural parameterizations: the distances of the

new “similar” cluster graph to (1)the second input graph and to (2)the input cluster

graph.

Keywords Graph-based data clustering· Incremental clustering· Compromise

clustering· Correlation clustering· Local search· Goal-oriented clustering·

NP-hard problems· Fixed-parameter tractability· Parameterized complexity·

Kernelization· Multi-choice knapsack

An extended abstract of this work appears in the proceedings of the 38th IARCS Annual Conference

on Foundations of Software Technology and Theoretical Computer Science (FSTTCS’18)[35].

Unfortunately, the conference version contains a claim whose unpublished proof contained an error

(seeTable1). This full version contains all proof details.

* Hendrik Molter

[email protected]

Extended author information available on the last page of the article

Algorithmica (2021) 83:1–44

1 3

1 Introduction

The NP-hard

Cluster editing

 problem[6, 41], also known as

Correlation Cluster

ing

[5], is one of the most popular graph-based data clustering problems in algorithmics.

Given an undirected graph, the task is to transform it into a disjoint union of cliques

(also known as a cluster graph) by performing a minimum number of edge modifica-

tions (deletions or insertions). Being NP-hard,

Cluster editing

 gained high popular-

ity in studies concerning parameterized algorithmics, e.g.[1, 4, 9, 11, 14, 23, 26, 29,

33]. To the best of our knowledge, to date these parameterized studies mostly focus on

a “static scenario”. Chen etal.[14] are an exception by also looking at temporal and

multilayer graphs. In their work, the input is a set of graphs (multilayer) or an ordered

list of graphs (temporal), in both cases defined over the same vertex set. The goal is to

transform each input graph into a cluster graph such that, in the multilayer case, the

number of vertices in which any two cluster graphs may differ is upper-bounded, and in

the temporal case, the number of vertices in which any two consecutive (with respect to

their positions in the list) cluster graphs may differ is upper-bounded.

In contrast to the work of[14], we do not assume that all future changes are

known. We consider the scenario where, given an input graph, we only know

changes that lie immediately ahead, that is, we know the “new” graph that the input

graph changes to. Thus we seek to efficiently and effectively adapt an existing solu-

tion, namely a cluster graph. Motivated by the assumption that the “new” cluster

graph should only change moderately but still be a valid representation of the data,

we parameterize both on the number of edits necessary to obtain the “new” clus-

ter graph and on the difference between the “old” and the “new” cluster graph. We

finally remark that there have been previous parameterized studies of dynamic (or

incremental) graph problems, dealing e.g. with coloring[30], domination[3, 19], and

vertex deletion[2, 34] problems.

1.1 Mathematical Model

In principle, the input for a dynamic version of a static problemX are two instancesI

and

I′

ofX, a solutionS forI, and an integerd. The task is to find a solution

S′

for

I′

such

that the distance betweenS and

S′

is upper-bounded byd. Often, there is an additional

constraint on the size of

S′

. Moreover, some distance measure betweenI and

I′

has often

been considered as a parameter for the problem[2, 3, 19, 34]. We arrive at our following

“original dynamic version” of

Cluster editing

 (phrased as decision version).

1 3

Algorithmica (2021) 83:1–44

Herein,

⊕

denotes the symmetric difference between two sets and

dist (⋅,⋅)

is a

generic distance function for cluster graphs, which we discuss later. Moreover,

is supposed to be the “solution” given for the input graph

. However, since the

question in this problem formulation is independent from

, we can remove

from the input and arrive at the following simplified version of the problem. For the

remainder of this paper we focus on this simplified formulation of

dynamiC Cluster

editing

There are many different distance measures for cluster graphs[37, 38]. Indeed, we

will study two standard ways of measuring the distance between two cluster graphs.

One is called classification error distance, which measures the number of vertices

one needs to move between cliques to make two cluster graphs the same—we sub-

sequently refer to it as matching-based distance. The other is called disagreement

distance, which is the symmetric distance between two edge sets—we subsequently

refer to it as edge-based distance. Notably, the edge-based distance upper-bounds

the matching-based distance. We give formal definitions in Sect.2.

1.2 Motivation andRelated Work

Beyond parameterized algorithmics and static

Cluster editing

, dynamic clustering

in general has been subject to many studies, mostly in applied computer science[12,

16, 17, 42–44]. We mention in passing that there are also close ties to reoptimization

(e.g.,[7, 8, 40]) and parameterized local search (e.g.,[21, 25, 28, 30, 36]).

There are several natural application scenarios that motivate the study of

dynamiC

Cluster editing

. Next, we list four of them.

Dynamically updating an existing cluster graph

dynamiC Cluster editing

 can

be interpreted to model a smooth transition between cluster graphs, reflecting that

“customers” working with clustered data in a dynamic setting may only tolerate a

moderate change of the clustering from “one day to another” since “revolutionary”

transformations would require too dramatic changes in their work. In this spirit,

when employing small parameter values,

dynamiC Cluster editing

 has kind of

an evolutionary flavor with respect to the history of the various cluster graphs in a

dynamic setting.

Editing a graph into a target cluster graph For a given graphG, there may be

many cluster graphs which are at mostk edge modifications away. The goal then

is to find one of these which is close to the given target cluster graph

since in a

Algorithmica (2021) 83:1–44

1 3

corresponding application one is already “used to” work with

. Adapting a differ-

ent point of view, the editing into the target cluster graph

might be too expensive

(that is,

|E(G)⊕E(Gc)|

is too big), and one has to find a solution cluster graph with

small enough modification costs but being still close to the target

Local search for an improved cluster graph Here the scenario is that one may

have found an initial clustering expressed by

, and one searches for another solu-

tion

G′

forG within a certain local region around

(captured by our parameterd).

Editing into a compromise clustering When focusing on the edge-based distance,

one may generalize the definition of

dynamiC Cluster editing

 by allowing

be any graph (not necessarily a cluster graph). This may be used as a model for

“compromise cluster editing” in the sense that the goal cluster graph then is a com-

promise for a cluster graph suitable for both input graphs since it is close to both of

them.

1.3 Our Results

We investigate the (parameterized) computational complexity of

dynamiC Clus

ter editing

. We study

dynamiC Cluster editing

 as well as two restricted ver-

sions where only edge deletions (“Deletion”) or edge insertions (“Completion”) are

allowed. We show that all problem variants (notably also the completion variants,

whose static counterpart is trivially polynomial-time solvable) are NP-complete

even if the input graphG is already a cluster graph. Table1 surveys our main com-

plexity results.

The general versions of

dynamiC Cluster editing

 all turn out to be param-

eterized intractable (W[1]-hard) by the single natural parameter “budget k” or

Table 1 Result overview for

dynamiC Cluster editing

. We primarily categorize the problem variants by

the distance measure (Matching, Edge) they use and secondarily by the allowed modification operation

NP-completeness for all problem variants (last column) even holds if the input graph G is a cluster

graph. PK stands for polynomial-size problem kernel

In the conference version[35] of this paper we claimed that

dCCompletion (edge dist)

is in FPT when

parameterized by k. Unfortunately, the unpublished “proof” for this claim contained an error that we

could not fix

Parameter

Problem variant

k+d

k d

Match. Editing FPT(PK) Theorem3W[1]-h Theorem2W[1]-h

}

Theorem2NP-c Theorem1

Deletion FPT(PK) open W[1]-h NP-c

Completion FPT(PK) open FPT Theorem4NP-c

Edge Editing FPT(PK) Theorem3W[1]-h Theorem2W[1]-h

}

Theorem2NP-c Theorem1

Deletion FPT(PK) FPTTheorem4W[1]-h NP-c

Completion FPT(PK) open

FPT Theorem4NP-c

1 3

Algorithmica (2021) 83:1–44

“distanced”; however, when both parameters are combined, we achieve a polyno-

mial-size problem kernel, implying fixed-parameter tractability. We also derive a

generic approach, based on a reduction to

multi-ChoiCe KnapsaCK

, to derive fixed-

parameter tractability for several deletion and completion variants with respect to

the parameters budgetk as well as the distanced.

1.4 Organization ofthePaper

Our work, after introducing basic notations (Sect.2), consists of two main parts. In

Sect.3, we provide all our (parameterized) hardness results. In Sect.4, we develop

several positive algorithmic results, namely polynomial-size problem kernels

through polynomial-time data reduction, and fixed-parameter solving algorithms.

We conclude with a summary and directions for future work (Sect.5).

2 Preliminaries andProblems Variants

In this section we give a brief overview on concepts and notations of graph theory

and parameterized complexity theory that are used in this paper. We also give for-

mal definitions of the distance measures for cluster graphs we use and of our prob-

lem variants. We use

⊕

to denote the symmetric difference of two sets, that is, for

two setsA,B we have

⊕

B∶= (A⧵B)∪(B⧵A)

2.1 Graph‑Theoretic Concepts andNotations

Given an undirected graph

G=(V,E)

, we say that a vertex set

C⊆V

is a clique

inG if G[C] is a complete graph. We say that a vertex set

C⊆V

is isolated in G

if there is no edge

{u,v}∈E

with

u∈C

and

v∈V⧵C

. A

is a path with three

vertices. We say that vertices

u,v,w∈V

form an induced

in G if

G[{u,v,w}]

. We say that an edge

{u,v}∈E

is part of a

in G if there is a vertex

w∈V

such that

G[{u,v,w}]

is a

. Analogously, we say that a non-edge

{u,v}∉E

is part

of a

in G if there is a vertex

w∈V

such that

G[{u,v,w}]

is a

. A cluster graph

is simply a disjoint union of cliques. Equivalently, a graph

G=(V,E)

is a cluster

graph if for all

u,v,w∈V

we have that

G[{u,v,w}]

is not a

, or in other words,

is a forbidden induced subgraph for cluster graphs.

2.2 Distance Measures forCluster Graphs

A cluster graph is simply a disjoint union of cliques. We use two basic distance

measures for cluster graphs[37, 38]. The first one is called “matching-based

distance” and counts how many vertices have to be moved from one clique to

another to make two cluster graphs the same (see Fig.1 for an illustrating exam-

ple). It is formally defined as follows.

Loading more pages...