scieee Science in your language
[en] (orig)
Service Level Agreement aware
Resource Management
Dissertation
von
Matthias Hovestadt
Schriftliche Arbeit zur Erlangung des Grades
eines Doktors der Naturwissenschaften
Fakultät für Elektrotechnik, Informatik und Mathematik
der Universität Paderborn
Paderborn, im Oktober 2006
Datum der mündlichen Prüfung:
7. Dezember 2006
Gutachter:
Prof. Dr. Odej Kao, Universität Paderborn
Prof. Dr. Franz Rammig, Universität Paderborn
Acknowledgements
The previous years at PC2have been the greatest time in my life. I am deeply
grateful for all the support I received and the opportunity to pursue my way.
They all contributed to make this thesis become reality. I would like to use this
opportunity for thanking some special persons.
First of all I would like to thank my doctoral advisor Odej Kao for all his
support and encouragement. He was always available when I was in need for
discussion or advice. Under his lead, PC2became the perfect place to do re-
search. Furthermore I am grateful to Franz Rammig for agreeing to review this
thesis.
I also would like to thank Bernard Bauer. He did not only guide me through
the jungle of proper financial reporting in EC-funded projects, for me he is also
the soul of PC2.
In PC2I found colleagues working with great team spirit and commitment.
Felix Heine has not only been the world’s best office mate, I am in particular
grateful for all the fruitful discussions and the motivating artwork.
I am deeply grateful to Axel Keller, the father of CCS. Firstly for his technical
expertise, but even more for his patience on listening and his commitment to
the project. Many things would look different without him. Thanks also to my
students Alexander Gretencord, Jan-Henrik Wiesner and Tobias Bettmann for
their work.
The HPC4U project has been funded by the European Commission. Thanks
to all European tax payers for giving me the opportunity to realize my ideas in
this context. Discussions with research colleagues were stimulating and helpful
for my work. Further I would like to express my thanks to Simon Alexandre for
all the discussions we had.
Last but not least I would like to thank Dorit and my entire family. I am
only standing on the shoulders of giants. In particular thanks to my parents for
convincing me to continue to go to school as I decided to quit at the age of 7.
It was worth it.
Advertisement
Abstract
Next Generation Grids aim at attracting commercial users to employ Grid envi-
ronments for their business critical compute jobs. These customers demand for
contractually fixed service quality levels, ensuring the availability of results in
time In this context, a Service Level Agreement (SLA) is a powerful instrument
for defining a comprehensive requirement profile.
Numerous research projects worldwide already focus on integrating SLA tech-
nology in Grid middleware components like broker services. However, solely
focusing on Grid middleware services is not sufficient. Services at Grid middle-
ware may accept compute jobs from customers, but they have to realize them
by means of local resource management systems (RMS). Current RMS offer
best-effort service only, thus they are also limiting the service quality level the
Grid middleware service is able to provide.
In this thesis the architecture and operation of an SLA-aware resource man-
agement system is described, which allows Grid middleware components to
negotiate on SLAs. The system uses its internal mechanisms of application-
transparent fault tolerance to ensure the terms of these SLAs even in case of
resource outages. The main parts of this work focus on scheduling aspects and
strategies for ensuring SLA compliance, respectively design aspects on imple-
mentation.
Scheduling strategies significantly determine the level of fault tolerance that
the system is able to provide. After presenting requirements of Grid middleware
components on service qualities and a description of operation phases of an
SLA-aware resource management system, intra-cluster scheduling strategies are
described. Here, the system solely uses its own resources and mechanisms for
coping with resource outages.
For further increasing the level of fault tolerance, strategies for cross-border
migration are presented. Beside a migration to other cluster systems in the same
administrative domain, the system uses also Grid resources as migration targets.
For ensuring the successful restart, mechanisms for describing the compatibility
profile of a checkpointed job are presented.
The concept of the SLA-aware resource management system has been imple-
mented in the scope of the EC-funded project HPC4U. We will describe design
aspects of this realization and show results from system deployments at use-case
customers.
Advertisement
Loading more pages...