Service level agreement aware resource management / von Matthias Hovestadt [original]

Service Level Agreement aware

Resource Management

Dissertation

von

Matthias Hovestadt

Schriftliche Arbeit zur Erlangung des Grades

eines Doktors der Naturwissenschaften

Fakultät für Elektrotechnik, Informatik und Mathematik

der Universität Paderborn

Paderborn, im Oktober 2006

Datum der mündlichen Prüfung:

7. Dezember 2006

Gutachter:

Prof. Dr. Odej Kao, Universität Paderborn

Prof. Dr. Franz Rammig, Universität Paderborn

Acknowledgements

The previous years at PC2have been the greatest time in my life. I am deeply

grateful for all the support I received and the opportunity to pursue my way.

They all contributed to make this thesis become reality. I would like to use this

opportunity for thanking some special persons.

First of all I would like to thank my doctoral advisor Odej Kao for all his

support and encouragement. He was always available when I was in need for

discussion or advice. Under his lead, PC2became the perfect place to do re-

search. Furthermore I am grateful to Franz Rammig for agreeing to review this

thesis.

I also would like to thank Bernard Bauer. He did not only guide me through

the jungle of proper financial reporting in EC-funded projects, for me he is also

the soul of PC2.

In PC2I found colleagues working with great team spirit and commitment.

Felix Heine has not only been the world’s best office mate, I am in particular

grateful for all the fruitful discussions and the motivating artwork.

I am deeply grateful to Axel Keller, the father of CCS. Firstly for his technical

expertise, but even more for his patience on listening and his commitment to

the project. Many things would look different without him. Thanks also to my

students Alexander Gretencord, Jan-Henrik Wiesner and Tobias Bettmann for

their work.

The HPC4U project has been funded by the European Commission. Thanks

to all European tax payers for giving me the opportunity to realize my ideas in

this context. Discussions with research colleagues were stimulating and helpful

for my work. Further I would like to express my thanks to Simon Alexandre for

all the discussions we had.

Last but not least I would like to thank Dorit and my entire family. I am

only standing on the shoulders of giants. In particular thanks to my parents for

convincing me to continue to go to school as I decided to quit at the age of 7.

It was worth it.

Abstract

Next Generation Grids aim at attracting commercial users to employ Grid envi-

ronments for their business critical compute jobs. These customers demand for

contractually fixed service quality levels, ensuring the availability of results in

time In this context, a Service Level Agreement (SLA) is a powerful instrument

for defining a comprehensive requirement profile.

Numerous research projects worldwide already focus on integrating SLA tech-

nology in Grid middleware components like broker services. However, solely

focusing on Grid middleware services is not sufficient. Services at Grid middle-

ware may accept compute jobs from customers, but they have to realize them

by means of local resource management systems (RMS). Current RMS offer

best-effort service only, thus they are also limiting the service quality level the

Grid middleware service is able to provide.

In this thesis the architecture and operation of an SLA-aware resource man-

agement system is described, which allows Grid middleware components to

negotiate on SLAs. The system uses its internal mechanisms of application-

transparent fault tolerance to ensure the terms of these SLAs even in case of

resource outages. The main parts of this work focus on scheduling aspects and

strategies for ensuring SLA compliance, respectively design aspects on imple-

mentation.

Scheduling strategies significantly determine the level of fault tolerance that

the system is able to provide. After presenting requirements of Grid middleware

components on service qualities and a description of operation phases of an

SLA-aware resource management system, intra-cluster scheduling strategies are

described. Here, the system solely uses its own resources and mechanisms for

coping with resource outages.

For further increasing the level of fault tolerance, strategies for cross-border

migration are presented. Beside a migration to other cluster systems in the same

administrative domain, the system uses also Grid resources as migration targets.

For ensuring the successful restart, mechanisms for describing the compatibility

profile of a checkpointed job are presented.

The concept of the SLA-aware resource management system has been imple-

mented in the scope of the EC-funded project HPC4U. We will describe design

aspects of this realization and show results from system deployments at use-case

customers.

Loading more pages...