Document [original]

A FRAMEWORK FOR SLA-AWARE EXECUTION OF

GRID-BASED WORKFLOWS

DISSERTATION

Schriftliche Arbeit zur Erlangung des akademischen Grades eines

Doktors der Naturwissenschaften an der Fakultaet fuer

Elektrotechnik, Informatik und Mathematik der Universitaet

Paderborn.

Dang Minh Quan

November 2006

Copyright by Dang Minh Quan 2007

iii

Preface

Service Level Agreements (SLAs) are currently one of the major research topics in

Grid Computing, as they serve as a foundation for reliable and predictable Grids.

SLAs define an explicit statement of expectations and obligations in a business rela-

tionship between provider and customer. Thus, SLAs should guarantee the desired

and a-priori negotiated Quality of Service (QoS), which is a mandatory prerequisite

for the Next Generation Grids. This development is proved by a manifold research

work about SLAs and architectures for implementing SLAs in Grid environments.

However, this work is mostly related to SLAs for standard, monolithic Grid jobs and

neglects the dependencies between different steps of operation.

The complexity of an SLA-specification for workflows grows significantly, as char-

acteristics of correlated sub-jobs, the data transfer phases, the deadline constraints

and possible failures have to be considered. Thus, an architect for an SLA-aware

workflow implementation needs sophisticated mechanisms for specification and man-

agement, sub-job mapping, data transfer optimization and fault reaction.

Therefore, this dissertation presents a system for SLA-aware Grid workflows. The

main contributions include an improved specification language for SLA-aware work-

flows, three mapping and optimization algorithms for sub-job assignment to Grid

resources, an error recovery mechanism, and a prototype implementation using stan-

dard middleware. Experimental measurements prove the quality of the development.

Acknowledgements

This dissertation cannot be finished without the help of a number of people:

First of all, I would like to especially thank Professor Odej Kao, who is my great

supervisor. He first brought me to Germany and gave me the chance to do research

in a modern academic environment. I would like to acknowledge his valuable sup-

ports, comments and suggestions during my research process. My knowledge has been

enriched by his judgment and wisdom. I never forget all these helps.

I deeply thank my parent. Their support and understanding have helped me

to overcome the most difficult period, encouraging me to forget the bad things and

giving me a new hope when I am down. They are always supporting and believing in

whatever I do. Without their love and supports, I would never be able to finish this

dissertation.

I am grateful for the helps and co-operations of many researchers in the working

group of Professor Odej Kao, who have provided valuable opinions for my research

topic. Many thanks are to the secretary of Professor Odej Kao, namely Irene Roger,

for her kind helps and assistance.

Last but not least, I would like to thank system administrators in Computer

Science department, University of Paderborn for their help in establishing experiments

system.

Paderborn, 2th January 2006

Dang Minh Quan

Contents

Preface v

Acknowledgements vi

1 Introduction 1

2 Background and review of the state of the art 6

2.1 Introduction to Grid-based workflow . . . . . . . . . . . . . . . . . . 6

2.2 Introduction to Grid computing . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Grid computing definition . . . . . . . . . . . . . . . . . . . . 9

2.2.2 Grid architecture description . . . . . . . . . . . . . . . . . . . 10

2.3 Introduction to Service Level Agreement . . . . . . . . . . . . . . . . 13

2.3.1 Definition ............................. 14

2.3.2 Structure ............................. 14

2.3.3 Applying the Service Level Agreement . . . . . . . . . . . . . 15

2.4 Review of the state of the art . . . . . . . . . . . . . . . . . . . . . . 16

2.4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4.2 Workflow language . . . . . . . . . . . . . . . . . . . . . . . . 19

2.4.3 Mapping mechanism . . . . . . . . . . . . . . . . . . . . . . . 20

3 Problem statement 22

3.1 Optimalcost ............................... 23

3.1.1 Workflow description . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.2 Character of RMSs . . . . . . . . . . . . . . . . . . . . . . . . 25

vii

3.1.3 Formal requirement statement . . . . . . . . . . . . . . . . . . 27

3.2 Automated negotiation . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Errorrecovery............................... 31

4 Mechanism of mapping workflow to Grid resources 33

4.1 Relatedworks .............................. 33

4.2 Mapping with standard metaheuristics . . . . . . . . . . . . . . . . . 41

4.2.1 General strategy . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.2 Tabu Search (TS) . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.3 Simulated Annealing (SA) . . . . . . . . . . . . . . . . . . . . 48

4.2.4 Iterated Local Search (ILS) . . . . . . . . . . . . . . . . . . . 50

4.2.5 Guided Local Search (GLS) . . . . . . . . . . . . . . . . . . . 53

4.2.6 Genetic Algorithm (GA) . . . . . . . . . . . . . . . . . . . . . 57

4.2.7 Estimation of Distribution Algorithm (EDA) . . . . . . . . . . 59

4.2.8 Evaluation of metaheuristics . . . . . . . . . . . . . . . . . . . 60

4.3 Proposed mapping mechanism . . . . . . . . . . . . . . . . . . . . . . 62

4.3.1 L-Tabu algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.3.2 H-Map algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.3.3 w-Tabu algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.4 Performance experiment . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.4.1 Optimizing cost for light communication workflows experiment 83

4.4.2 Optimizing cost for heavy communication workflows experiment 84

4.4.3 Optimizing the finished time for workflows experiment . . . . 85

4.5 Summary ................................. 89

5 SLA negotiation protocol for Grid-based workflow 91

5.1 Relatedworks............................... 91

5.2 SLAlanguage............................... 93

5.2.1 Business description . . . . . . . . . . . . . . . . . . . . . . . 93

5.2.2 Computational task description . . . . . . . . . . . . . . . . . 93

5.2.3 SLO description . . . . . . . . . . . . . . . . . . . . . . . . . . 94

5.2.4 Data transmission description . . . . . . . . . . . . . . . . . . 96

viii

5.3 SLA negotiation for workflow . . . . . . . . . . . . . . . . . . . . . . 97

5.3.1 Customer - Broker negotiation . . . . . . . . . . . . . . . . . . 101

5.3.2 Broker - Provider negotiation . . . . . . . . . . . . . . . . . . 103

5.3.3 Provider - Provider negotiation . . . . . . . . . . . . . . . . . 103

5.4 Summary ................................. 105

6 Error recovery mechanism for Grid-based workflow 106

6.1 Relatedworks............................... 106

6.2 Error recovery mechanism . . . . . . . . . . . . . . . . . . . . . . . . 107

6.2.1 Determining sub-jobs to be replanned . . . . . . . . . . . . . . 109

6.2.2 Determining re-mapping priority . . . . . . . . . . . . . . . . 111

6.2.3 Mapping algorithm . . . . . . . . . . . . . . . . . . . . . . . . 112

6.3 Performance experiment . . . . . . . . . . . . . . . . . . . . . . . . . 113

6.4 Summary ................................. 115

7 System implementation 116

7.1 Introduction................................ 116

7.2 Relatedworks............................... 118

7.3 System implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.3.1 SLA-aware RMS . . . . . . . . . . . . . . . . . . . . . . . . . 120

7.3.2 SLA workflow broker . . . . . . . . . . . . . . . . . . . . . . . 121

7.3.3 Client ............................... 123

7.4 Adaptive execution engine . . . . . . . . . . . . . . . . . . . . . . . . 123

7.4.1 Adaptive runtime control mechanism . . . . . . . . . . . . . . 123

7.4.2 Adaptive SLA execution engine . . . . . . . . . . . . . . . . . 125

7.4.3 Performance measurements . . . . . . . . . . . . . . . . . . . 127

7.5 System deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.6 Summary ................................. 128

8 Conclusions and future work 130

8.1 Conclusions ................................ 130

8.2 Futurework................................ 131

A List of Acronyms 133

B SLA language for Grid-based workflow specification 137

B.1 Commontag ............................... 137

B.1.1 Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

B.1.2 TitleSLA ............................. 137

B.1.3 Description ............................ 138

B.1.4 Amount .............................. 138

B.1.5 Entity ............................... 138

B.1.6 CostSLA ............................. 139

B.2 General SLA description . . . . . . . . . . . . . . . . . . . . . . . . . 139

B.2.1 Starttime............................. 139

B.2.2 Endtime ............................. 140

B.2.3 Provider.............................. 140

B.2.4 Consumer ............................. 140

B.3 SLA for data transmission description . . . . . . . . . . . . . . . . . . 141

B.3.1 Data................................ 141

B.4 Computing task description . . . . . . . . . . . . . . . . . . . . . . . 142

B.4.1 Software request . . . . . . . . . . . . . . . . . . . . . . . . . 142

B.4.2 Resource request . . . . . . . . . . . . . . . . . . . . . . . . . 142

B.4.3 Job description . . . . . . . . . . . . . . . . . . . . . . . . . . 144

B.5 SLOdescription.............................. 146

B.5.1 Condition ............................. 146

B.5.2 Reason............................... 147

B.5.3 Respondsite............................ 147

B.5.4 Punish............................... 147

B.5.5 Action............................... 148

B.5.6 Monitor .............................. 148

B.6 Data transmission description . . . . . . . . . . . . . . . . . . . . . . 149

B.6.1 Filelist .............................. 149

B.6.2 gFTP ............................... 149

B.7 SLA workflow description . . . . . . . . . . . . . . . . . . . . . . . . 150

B.7.1 SLA sub-job . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

B.7.2 SLA data transmissions . . . . . . . . . . . . . . . . . . . . . 150

B.7.3 Signature ............................. 150

Bibliography 152

List of Tables

3.1 Sub-jobs’ resource requirements of the workflow in Figure 3.1 . . . . 24

3.2 RMSs resource reservation . . . . . . . . . . . . . . . . . . . . . . . . 26

4.1 RMSs candidate for each sub-job . . . . . . . . . . . . . . . . . . . . 43

4.2 Valid start time for sub-jobs of workflow in Figure 3.1 . . . . . . . . 46

4.3 RMSs candidate for each sub-job in cost order . . . . . . . . . . . . 74

4.4 Resource configuration used in the experiment . . . . . . . . . . . . 82

4.5 Experiment results of the L-Tabu algorithm . . . . . . . . . . . . . . 83

4.6 Experiment results of the H-Map algorithm . . . . . . . . . . . . . . 86

4.7 Experiment results of the w-Tabu algorithm . . . . . . . . . . . . . . 87

4.8 Experiment results with workload 21 sub-jobs . . . . . . . . . . . . . 89

4.9 Some experiment results in relative value . . . . . . . . . . . . . . . 90

6.1 Running timetable of the sample workflow in Figure 6.1 . . . . . . . . 108

6.2 Data transfer time table of the sample workflow in Figure 6.1 . . . . 109

6.3 Experiment results with 1 failing RMS . . . . . . . . . . . . . . . . . 114

6.4 Experiment results with 2 failing RMSs . . . . . . . . . . . . . . . . 114

6.5 Experiment results with 3 failing RMSs . . . . . . . . . . . . . . . . . 115

7.1 Initial running time table of the sample workflow . . . . . . . . . . . 117

7.2 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

7.3 Resource configuration of RMSs . . . . . . . . . . . . . . . . . . . . 128

xii

List of Figures

2.1 Weather forecast processing workflow . . . . . . . . . . . . . . . . . . 8

2.2 The layered Grid architecture and its relationship to the Internet pro-

tocol architecture [39] . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 The architecture of ICENI system [86] . . . . . . . . . . . . . . . . . 17

2.4 The architecture of AgFlow system [130] . . . . . . . . . . . . . . . . 18

2.5 The architecture of QoS-aware Grid Workflow system [16] . . . . . . 19

3.1 A sample Grid-based workflow . . . . . . . . . . . . . . . . . . . . . . 22

3.2 A sample CPU reservation profile of a RMS . . . . . . . . . . . . . . 26

3.3 A sample bandwidth reservation profile of a link between two RMSs . 27

3.4 Scenario of running a workflow in the grid environment . . . . . . . 30

4.1 Minmin algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Maxminalgorithm ............................ 38

4.3 Sufferalgorithm.............................. 39

4.4 GRASPalgorithm ............................ 40

4.5 w-DCPalgorithm............................. 41

4.6 Mapping mechanism overview . . . . . . . . . . . . . . . . . . . . . . 42

4.7 TS algorithm to find the minimal finished time of the workflow . . . . 44

4.8 Neighborhood structure of a configuration . . . . . . . . . . . . . . . 45

4.9 Sample neighborhood structure of a configuration . . . . . . . . . . . 45

4.10 Procedure to determine time table for workflow . . . . . . . . . . . . 47

4.11 TS algorithm to find the minimal cost . . . . . . . . . . . . . . . . . 48

4.12 SA algorithm to find the minimal finished time . . . . . . . . . . . . 49

xiii

4.13 SA algorithm to find the minimal cost . . . . . . . . . . . . . . . . . 51

4.14 ILS algorithm to find the minimal finished time . . . . . . . . . . . . 52

4.15 ILS algorithm to find the minimal cost . . . . . . . . . . . . . . . . . 54

4.16 GLS algorithm to find the minimal finished time . . . . . . . . . . . . 55

4.17 GLS algorithm to find the minimal cost . . . . . . . . . . . . . . . . . 56

4.18 GA algorithm to find the minimal finished time . . . . . . . . . . . . 57

4.19 GA algorithm to find the minimal cost . . . . . . . . . . . . . . . . . 58

4.20 EDA algorithm to find the minimal finished time . . . . . . . . . . . 59

4.21 EDA algorithm to find the minimal cost . . . . . . . . . . . . . . . . 61

4.22 Mapping mechanism overview . . . . . . . . . . . . . . . . . . . . . . 63

4.23 The frame work of L-Tabu algorithm . . . . . . . . . . . . . . . . . . 64

4.24 Relation between available and required CPUs . . . . . . . . . . . . 66

4.25 Rate profile of the sample . . . . . . . . . . . . . . . . . . . . . . . . 67

4.26Movingsubjobs.............................. 68

4.27 Adjusting subjobs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

4.28 Rate profile of the sample after doing adjustments . . . . . . . . . . 69

4.29 Improving solution quality procedure for light workflow . . . . . . . 71

4.30 H-Map algorithm overview . . . . . . . . . . . . . . . . . . . . . . . . 72

4.31 The configuration space according to cost distribution . . . . . . . . . 73

4.32 The first selection configuration of the sample . . . . . . . . . . . . . 74

4.33 Procedure to create the set of initial configurations . . . . . . . . . . 75

4.34 Algorithm to improve the solution quality . . . . . . . . . . . . . . . 76

4.35 w-Tabu algorithm overview . . . . . . . . . . . . . . . . . . . . . . . 77

4.36 Generating reference set algorithm . . . . . . . . . . . . . . . . . . . 78

4.37 Determining timetable algorithm for workflow in w-Tabu . . . . . . . 79

4.38 Determining critical path algorithm . . . . . . . . . . . . . . . . . . . 80

4.39 Sample critical path of the workflow in Figure 3.1 . . . . . . . . . . . 80

4.40 Configuration improvement algorithm in w-Tabu . . . . . . . . . . . . 81

4.41 Work flow with 21 sub-jobs . . . . . . . . . . . . . . . . . . . . . . . 88

4.42 Overall quality comparison . . . . . . . . . . . . . . . . . . . . . . . . 90

xiv

5.1 SLA specifications of HW/SW resources as well as the task description 95

5.2 A GGF sample SLO in SLA language . . . . . . . . . . . . . . . . . 96

5.3 SLOspecification............................. 97

5.4 Specification of data to be transferred . . . . . . . . . . . . . . . . . . 98

5.5 Interaction among participants in SLA negotiation process for workflow 99

5.6 Basic SLA negotiation procedure . . . . . . . . . . . . . . . . . . . . 99

5.7 SLA negotiation process for workflow . . . . . . . . . . . . . . . . . . 100

5.8 SLA workflow structure . . . . . . . . . . . . . . . . . . . . . . . . . 101

5.9 SLA data transmission structure . . . . . . . . . . . . . . . . . . . . 104

6.1 Sample running workflow scenario . . . . . . . . . . . . . . . . . . . 107

6.2 Abstract error recovery mechanism . . . . . . . . . . . . . . . . . . . 108

6.3 Determining affected sub-jobs algorithm . . . . . . . . . . . . . . . . 110

6.4 All determined affected sub-jobs in the sample workflow . . . . . . . . 111

6.5 Scenario of two workflows need to be re-mapped . . . . . . . . . . . 112

6.6 DAG form of the new workflow . . . . . . . . . . . . . . . . . . . . . 113

7.1 CPU usage profile in RMS 2 if have no adjustment . . . . . . . . . . 117

7.2 System implementation layers . . . . . . . . . . . . . . . . . . . . . . 119

7.3 SLA-aware local RMS architecture . . . . . . . . . . . . . . . . . . . 120

7.4 Time sequence of a sub-job . . . . . . . . . . . . . . . . . . . . . . . 123

7.5 CPU usage profile of RMS 2 after doing the adjustment . . . . . . . 124

7.6 Negotiation procedure to shift sub-job . . . . . . . . . . . . . . . . . 126

7.7 Experiment system deployment . . . . . . . . . . . . . . . . . . . . . 128

7.8 Initial web based client . . . . . . . . . . . . . . . . . . . . . . . . . 129

xvi

Chapter 1

Introduction

Grid computing is viewed as the next phase of distributed computing. Built on

Internet standards, grid computing enables organizations to share computing and

information resources across departments and organizational boundaries in a secure

and highly efficient manner. According to Ian Foster, a Grid is a system that ”co-

ordinates resources that are not subject to centralized control using standard, open,

general-purpose protocols and interfaces to deliver nontrivial qualities of services”

[38]. In general, from the view point of the application layer, a Grid can exist in

one of three types: Computational Grid, Scavenging Grid and Data Grid [63]. A

computational Grid is a Grid that focuses on setting aside resources particularly for

enhancing computing power. A scavenging Grid usually includes large numbers of

desktop machines which are scavenged for available CPU cycles and other resources.

A data Grid is responsible for housing and providing access to data across multiple

organizations.

There are a lot of Grid users who have high demand of computing power to solve

large scale problems such as material structure simulation, weather forecasting, fluid

dynamic simulation, etc. Beside vast number of single-program applications, which

has only one sequential or parallel program, there exist many applications, which

require the co-process of many programs following a strict processing order. Since

those applications are executed on the Grid, they are called Grid-based workflows.

The concept of workflow first arose in business environments, e.g. [34], and has drawn

2CHAPTER 1. INTRODUCTION

a lot of attention in the database and information system research and development

communities [35, 46, 60]. Although business applications play a significant role in

research and development, Grid-based workflows are also important as computing

expands into the routine activities of scientists. A Grid-based workflow is distin-

guished from a conventional single application in two main characteristics. Firstly,

it includes many sub-jobs, each of which can be a sequential or a parallel program;

secondly, data dependency exists among sub-jobs.

Traditionally, to run the application, users submit it to a Grid system and the

system will try to execute it as early as possible [36, 80, 5, 78, 30, 19, 109]. That

best-effort mechanism is not suitable when users need the running result at a specific

time, and hence require that the application must be run at a specific period. This

requirement must be agreed on by both users and the Grid system before the appli-

cation is executed. This task can be done by a Service Level Agreement (SLA) [113].

The purpose of the SLA is to identify the shared goals and objectives of the concerned

parties. A good SLA is important as it sets boundaries and expectations for the fol-

lowing aspects of a service provisioning. An SLA clearly defines what the user wants

and what the provider promises to supply, which helps to reduce the chances of dis-

appointing the customer. The provider’s promises also help the system stay focused

on customer requirements and assure that the internal processes move in the right

direction. An SLA describes a clear, measurable standard of performance. Based on

this description, internal objectives become clear and measurable. An SLA defines

penalties. This criterion makes the customer understand that the service provider

truly believes in its ability to achieve the set of performance levels. It makes the re-

lationship clear and positive. In this context, an SLA sets the expectations between

the consumer and the provider, and defines the relationship between the two parties.

Only computational Grid is suitable to run sub-jobs of the workflow within SLA

context as it brings many important advantages for this task.

•The computational Grid connects many High Performance Computing Centers

(HPCCs) all over the world. Only these centers can handle the high computing

demand of scientific applications.

•The cluster or super computer in a HPCC is relatively stable and well main-

tained. This is an important feature to ensure finishing the sub-job within a

specific period of time.

•The HPCCs usually connect to the worldwide network by high speed links,

whose broad bandwidth makes the data transfer among sub-jobs easier and

faster.

Scavenging Grid and data Grid cannot fully satisfy all of the above features. Scav-

enging Grid includes many desktop computers distributed over the internet. Those

computers are connected by heterogeneous network infrastructure and thus, cannot

ensure to provide a good performance to the parallel application especially the one

having great communication among tasks. Data Grid concentrates to manage and

store data over the network. Its primary function is not executing high performance

applications. Therefore, they are not considered in this work. In computational Grid,

each resource joining to the system supporting SLA for workflow is a HPCC, which

usually has a set of computing nodes, a mass storage and a number of experts for

technical support. The resources in each HPCC are managed by software called local

Resource Management System (RMS). In this dissertation, the acronym ”RMS” is

used to represent the cluster/super computer as well as the Grid services provided by

the HPCC.

Supporting SLAs for the Grid-based workflow that mainly aims at finishing the

workflow execution on the Grid within a pre-determined period of time faces several

problems.

The first problem is the lack of effective mapping mechanism to map each sub-job

of the workflow to resources in a manner that can satisfy two main criteria: being

able to finish workflow execution on time and being able to optimize the running cost.

The first criterion is quite clear because it is the main reason for an SLA system to

exist. The latter criterion is derived from the business aspect of an SLA. If a customer

wants to use a service, he must pay for the service usage and has the right to receive

it with an appropriate quality. An automated mapping is necessary as it frees users

from the tedious job of assigning sub-jobs to resources under many constraints such

4CHAPTER 1. INTRODUCTION

as workflow integrity, on time condition, optimal condition, etc. Additionally, a good

mapping mechanism will help users to save money and to increase the efficiency of

using Grid resources.

Secondly, to ensure the SLA for workflow requires the co-operation of many com-

ponents in the Grid. The co-operating procedure to reach the agreement about pro-

viding the service among those components does not exist in the literature. Without

such a procedure, users must negotiate SLA for each sub-job in the workflow. This

action also creates a significant problem for the user’s patience, especially when the

number of sub-jobs in the workflow increases to large values.

Another problem is related to errors which may occur during the execution of the

workflow. Randomly appearing errors may damage the workflow completion as well

as the negotiated SLA. Thus, this demands building an error recovery mechanism for

a workflow in order to eliminate the affection of error to users and to make the Grid

system more stable and reliable.

Finally, a system which can execute a real Grid-based workflow within the SLA

context needs to be built. The system must integrate all of the above features in

order to help users to minimize the works. It should do the task by itself.

Supporting SLAs for a workflow in the Grid environment is a new problem and

still in the initial phase of the exploiting process. The work in this dissertation will fill

the gap by building a skeleton architecture which can be the basis for deploying the

real service. Within the scope of this dissertation, I concentrate on the selected core

problems of a system, which supports SLA for a workflow in the Grid environment,

as stated above. Other important issues such as security, charging/accounting, risk

management and so on are not considered in this thesis. The main contributions of

this dissertation are:

•A new problem statement which supports executing a workflow on reserved

Grid resources within the scope of a business contract.

•A mapping mechanism, which includes several sub optimization algorithms,

to map sub-jobs of the workflow to the Grid resources within SLA context

to satisfy the specific user’s runtime requirement and to optimize the cost.

At present, the size of the Grid is still small. For example, the Distributed

European Infrastructure for Supercomputing Applications (DEISA) includes

only 11 sites. Based on that fact, a distributed mapping model for very large

size Grid is not an urgent requirement at the present and thus, it is not focused

in this work. The proposed central mapping mechanism is the heart of the

system. The goal of this main work is to provide fast response solution while

ensuring the QoS for customers. Thus, it reduces the overhead of the workflow

execution time and encourages users to utilize the services.

•An SLA negotiation protocol for workflows. This protocol is the core procedure

to co-operate many components of the Grid to ensure the SLA for the workflow.

•An error recovery mechanism for workflows within an SLA context. The mech-

anism takes into consideration the catastrophic failure when one or several Grid

resources are detached from the Grid system. Other mechanisms to increase the

reliability of the system such as risk assessment, slack-based scheduling, security

of the data connection are not included in this dissertation.

•A prototype system to realize all the proposed theories.

All the research results presented in this dissertation were also published in [97,

98, 99, 100, 101, 102].

Chapter 2

Background and review of the

state of the art

2.1 Introduction to Grid-based workflow

Workflows received enormous attention in the databases and information systems

research and development community [35], [46], [60]. According to the definition

from the Workflow Management Coalition (WfMC) [125], a workflow is ”The au-

tomation of a business process, in whole or parts, where documents, information or

tasks are passed from one participant to another to be processed, according to a set of

procedural rules.” Although business workflows have great influence on research and

development, another class of workflows emerges naturally in sophisticated scientific

problem-solving environments called Grid-based workflow [82, 11, 105]. A Grid-based

workflow differs slightly from the WfMC definition as it concentrates on intensive com-

putation and data analyzing but not the business process. A Grid-based workflow is

characterized by following features [112, 129].

•A Grid-based workflow usually includes many applications which perform data

analysis tasks. However, those applications, which are also called sub-jobs, are

not executed freely but in a strict sequence.

•A sub-job in the Grid-based workflow depends tightly on the output data from

2.1. INTRODUCTION TO GRID-BASED WORKFLOW 7

previous sub-job. With incorrect input data, the sub-job will produce wrong

result and damage the result of the whole workflow.

•Sub-jobs in the Grid-based workflow are usually computationally intensive tasks,

which can be sequential or parallel programs and require long runtime.

•Grid-based workflows usually require powerful computing facilities such as super

computers or cluster to run on.

It can be seen that the Grid-based workflow and the business workflow have the

same primary characteristic as they both have a procedure that applies a specific

computation into selected data according to certain rules. Each Grid-based workflow

is defined by three main factors.

•Tasks. A task in the Grid-based workflow is a sub-job which is a specific

program doing a specific function. Within a Grid-based workflow, a sub-job

can be a sequential program or a parallel program. The sub-job usually has

long running period and needs powerful computing resources. Each sub-job

requires specific resources for the running process such as operating system

(OS), amount of storage, CPU, memory, etc.

•Control aspect. The control aspect describes the structure and the sequence

in processing of sub-jobs in the workflow.

•Information aspect. The information aspect of the Grid-based workflow is

presented by data transmissions. The dependency among sub-jobs can also be

identified by the data transmission task. A sub-job is executed to produce a

number of output data, which are the input data for the next sub-job in the

sequence. These data must be transferred to the place where the next sub-job is

executed. Within a Grid-based workflow, the quantity of data to be transferred

between two sub-jobs varies from several KB to a hundred GB depending on

the type of application and its scope.

Figure 2.1 depicts a sample scenario by considering weather forecasting as an ex-

ample Grid-based workflow. The main requirement is that a three hours forecast

8CHAPTER 2. BACKGROUND AND REVIEW OF THE STATE OF THE ART

Data

collection

Cloudiness

Turbulence

Phase of

precipitation

Moisture

convergence

Visibility

Linear Dynamic

Modelling Visualization

Input Output

Figure 2.1: Weather forecast processing workflow

should be available within 20 minutes after all necessary data have been collected. In

the workflow, the data collection module collects the input data from various sources

such as radars, satellites, lightning detectors, etc. Then the data are processed by

several modules to get information about cloudiness, visibility, moisture convergence,

phase of precipitation, etc. Thereafter, the results are processed by the linear module

to interpolate between the field analysis and the forecast of the numerical weather

prediction models. This data is used in dynamic modeling to build high-resolution

models with special physical parameterization schemes for a precise prediction of

weather events. Finally, the weather information is visualized by the dedicated mod-

ule.

Because the data and computing resource may spread in a physically distributed

environment, running the workflow needs a mechanism to handle the data transfer

and to invoke the computational tools over a distributed and heterogeneous platform.

This mechanism is exactly the goal that Grid computing technology attempts to

achieve in scientific environments.

2.2 Introduction to Grid computing

Built on Internet infrastructure, Grid computing enables organizations to share com-

puting and information resources across departmental and organizational boundaries

in a secure, highly efficient manner. Organizations around the world are utilizing Grid

2.2. INTRODUCTION TO GRID COMPUTING 9

computing today in such diverse areas as collaborative scientific research, drug dis-

covery, financial risk analysis, and product design. Grid computing enables research-

oriented organizations to solve problems that were not feasible to solve due to com-

puting and data-integration constraints. Grids also reduce costs by means of automa-

tion and improving IT resource utilization. Finally, Grid computing can increase an

organization’s agility enabling more efficient business processes and greater respon-

siveness to change. Over time grid computing will enable a more flexible, efficient

and utility-like global computing infrastructure.

2.2.1 Grid computing definition

An exact and complete definition of Grid is still under discussion. Informally, Ian

Foster proposed three criteria which a Grid system should satisfy [38]. A Grid is a

system that:

•”coordinates resources that are not subject to centralized control - (A Grid in-

tegrates and coordinates resources and users that live within different control

domains for example, the user’s desktop vs. central computing; different admin-

istrative units of the same company; or different companies; and addresses the

issues of security, policy, payment, membership, and so forth that arise in these

settings. Otherwise, we are dealing with a local management system.)”

•”using standard, open, general-purpose protocols and interfaces - (A Grid is

built from multi-purpose protocols and interfaces that address such fundamental

issues as authentication, authorization, resource discovery, and resource access.

As I discuss further below, it is important that these protocols and interfaces

be standard and open. Otherwise, we are dealing with an application-specific

system.)”

•”to deliver nontrivial qualities of service - (A Grid allows its constituent re-

sources to be used in a coordinated fashion to deliver various qualities of service,

relating for example to response time, throughput, availability, and security,

and/or co-allocation of multiple resource types to meet complex user demands,

10CHAPTER 2. BACKGROUND AND REVIEW OF THE STATE OF THE ART

so that the utility of the combined system is significantly greater than that of the

sum of its parts.)”

2.2.2 Grid architecture description

To co-operate multiple distributed resources, it is necessary to have a common pro-

tocol that all components in the grid must follow for a uniform working mechanism.

A protocol definition specifies how distributed system elements interact with one an-

other in order to achieve a specified behavior, and the structure of the information

exchanged during this interaction. Standard protocols have emerged as important

and essential means of achieving the inter-operability that Grid systems depend on.

In [39], Foster et al present a Grid taxonomy which has a great influence on the re-

search and development of Grid systems. According to [39], the modern Grid system

is built on two main concepts: Virtual Organization (VO) and service.

VO is a set of individuals and/or institutions sharing resources under clearly

and carefully defined rules about what is shared, who is allowed to share, and the

conditions under which a share occurs. In reality, VOs vary tremendously in the size,

types of resource sharing, number of participants, sharing rules, etc. VOs enable

many individuals, groups, and institutions to share resources in a controlled fashion

so that members can co-operate to achieve a common goal.

In Computer Science, a service is a software serving a specific purpose. More

specifically, as stated in [39], ”A service is a network-enabled entity that provides a

specific capability, for example, the ability to move files, create processes, or verify

access rights. A service is defined in terms of the protocol one uses to interact with

it and the behavior expected in response to various protocol message exchanges (i.e.,

”service = protocol + behavior.”).”

An overall system architecture, which identifies fundamental system components,

specifies the purpose and function of these components, and indicates how these

components interact with one another, is depicted in Figure 2.2. Figure 2.2 also

presents the correlative functions between layers of the Grid protocol and the internet

protocol.

2.2. INTRODUCTION TO GRID COMPUTING 11

Fabric

Connectivity

Resource

Collective

Application

Link

Internet

Transport

Application

Grid Protocol Architecture

Internet Protocol Architecture

Figure 2.2: The layered Grid architecture and its relationship to the Internet

protocol architecture [39]

Fabric layer

The fabric layer contains physical resources that people want to share and access.

Those resources can be computational, storage or network resources, code repositories,

databases, etc. The physical resources have all characters of Grid hardware, great in

number, heterogeneous in configuration, distributed in location, and various in usage

policy.

In each site, the physical resources are managed by software, which handles the

management task locally. The management software does scheduling, makes alloca-

tion, controls the sate of resources, and provides access service for local users. Sample

of those managements software includes PBS [93], CCS [58], etc.

The layer also has some software components to provide functions supporting

higher layer features such as remote access operation, sharing operation, etc.

12CHAPTER 2. BACKGROUND AND REVIEW OF THE STATE OF THE ART

Connectivity layer

The connectivity layer is characterized by two main features: the core communication

and authentication protocols required for Grid-specific network transactions.

The communication protocol defines the way in which data are transferred be-

tween different resources in the Fabric layer and based on the existing internet data

communication technology, the TCP/IP protocol.

The authentication protocol provides secure mechanisms for verifying the identity

of users and resources. For a distributed resource environment like Grid, which pro-

vides virtual organization interface to user, the authentication protocol should have

the following features.

•Single sign on. Users log on to system once and can access many resources in

the Fabric layer without entering an additional username or password.

•Delegation. Users have the ability to assign their power to a program so that

it can also access the resources that they have the right to access. Beside that,

these programs can also assign the right to other programs.

•Integration with various local security solutions. The Grid security so-

lution must be able to co-operate many different security policies from different

organizations in the Grid.

•User-based trust relationships. Users should have the ability to use re-

sources from many sites together without the need of interaction among those

sites.

Resource layer

The Resource layer is an intermediate layer coping with the secure negotiation, ini-

tiation, monitoring, control, accounting, and payment of sharing operations on indi-

vidual resources. This layer does the task by calling functions provided by the Fabric

layer. Resource layer protocol can be divided into two classes.

Information protocols are used to obtain information about the structure and state

of a resource, for example, its configuration, current load, and usage policy.

2.3. INTRODUCTION TO SERVICE LEVEL AGREEMENT 13

Management protocols are used to negotiate access to a shared resource. Negoti-

ating information includes resource requirement, quality of service and the operation

to be performed.

Collective layer

While the Resource layer is focused on interactions with a single resource, the Col-

lective layer works on global scale and captures interactions across collections of re-

sources. The scope of function in this layer is wide and flexible such as directory

services, co-allocation services, data replication services, etc.

Applications layer

At the top of any Grid system are user applications which are constructed in terms

of, and call on, the components in any other layer. User applications range from

material simulation to image processing, etc.

The theory structure as described above was implemented following Open Grid

Services Architecture (OGSA) [40] and realized by Globus Toolkit [48].

2.3 Introduction to Service Level Agreement

The main purpose of an Information Technology organization is to provide a com-

puting service which satisfies the customers’ business requirements. To achieve this

goal, the organization needs to understand those requirements and to evaluate its own

capability of providing the service and measures the service delivered. To enable the

realization of the process, the service and level of delivery required must be identified

and agreed between the organization and its users. It is usually done by Service Level

Agreements (SLAs), which are contracts developed jointly by the organization and

the customers.

14CHAPTER 2. BACKGROUND AND REVIEW OF THE STATE OF THE ART

2.3.1 Definition

A Service Level Agreement (SLA) identifies the agreed upon services that will be

provided to a customer in order to ensure that they meet the customer’s requirement.

The SLA identifies customers’ expectations and defines the boundaries of the service,

stating agreed-upon service level goals, operating practices, and reporting policies.

Webopedia defines the SLA as ”Abbreviated SLA, a contract between an ASP (Appli-

cation Service Provider) and the end user which stipulates and commits the ASP to

a required level of service. An SLA should contain a specified level of service, support

options, enforcement or penalty provisions for services not provided, a guaranteed level

of system performance as relates to downtime or uptime, a specified level of customer

support and what software or hardware will be provided and for what fee.” [123]

2.3.2 Structure

A common SLA contains following components:

•Parties joining the agreement. The agreement is made between the service

provider and the service user. Two participants should exist as individuals,

either by name or by title. Both sides must sign the document.

•Type and the time window of the service to be provided. The SLA must state

clearly which service will be provided and the time window during which the

service is provided to the user. In fact, there are a lot of system components

contributing to the type definition of the service. They can be the number of

processors, processor speed, amount of memory, communication library, etc.

•The guaranty of the provider to provide the appropriate service and perfor-

mance. The SLA must state clearly how well the service will be provided to the

user as Quality of Service. Penalties must also be figured out if a certain QoS

cannot be satisfied.

•The cost of the service. Business users wishing to use any service have to pay

for the usage. The cost depends on the quantity of service usage and how long

2.3. INTRODUCTION TO SERVICE LEVEL AGREEMENT 15

the user uses it.

•The measurement method and reporting mechanism. The SLA defines which

parameter will be measured and how it will be measured. Data collected from

the monitoring procedure are important as they help user and provider check

the validity of the SLA.

2.3.3 Applying the Service Level Agreement

Applying the SLA requires the co-operation from both service provider and customers.

The provider is responsible for preparing the infrastructure and deploying the service.

The provider should train customers on SLA so they can understand and are willing

to use the new form of service. The following will describe some remarkable notes for

applying the SLA.

•The provider should deploy a high visibility, a well-understood application and

target only the things that can be measured.

•The provider must recognize the relationship between the architecture and what

the maximum levels of availability are. Thus, an SLA cannot be created in a

vacuum. An SLA must be defined with the infrastructure in mind. If not, the

provider cannot meet the user’s requirement and violates the SLA.

•A relationship exists between the levels of availability and the related cost.

Some customers need higher levels of availability and are willing to pay more.

Therefore, having different cost policies with different level of service quality is

a common approach.

•The provider commits to reporting measurements clearly and accurately, and

to take actions to avoid service degradation.

•The customer should forecast workload as accurate as possible. This is impor-

tant for both customer and provider. Accurately estimating workload helps the

provider reserve appropriate amount of resource for the task. If less resource

16CHAPTER 2. BACKGROUND AND REVIEW OF THE STATE OF THE ART

than required is reserved, the work cannot be done. If more resource than

required is reserved, the resource is wasted and user must pay more.

•The customer must represent accurately the service requirement.

•The customer should describe details on how and who receives the reports.

2.4 Review of the state of the art

The work in this dissertation is building a system supporting QoS for the Grid-

based workflow. Therefore, in this section, we will discuss other related projects

appeared in the literature. Nearly current with the work in this dissertation, there

are some other efforts related to supporting QoS for workflow [86, 130, 8]. They are

distinguished from each other in the architecture, the workflow description language

and the mapping algorithm.

2.4.1 Architecture

Although having different architectures, all of them build services based on the in-

frastructures which support resources reservation.

Imperial College e-Science Network Infrastructure (ICENI) is an end-to-end Grid

middleware system developed at the London e-Science Centre [86]. It supports users

of the Grid, and adds value to their Grid experience, by enabling a complete workflow

pipeline in a transparent manner. The main feature of ICENI is running a workflow

over reserved resources to decrease the variance of execution. The overall architecture

of ICENI is presented in Figure 2.3.

Abstract workflows enter the scheduler and are translated into concrete workflows

detailing the resources to be used. The concrete workflow is then distributed to the

relevant resources through the launching services. The scheduler, which is used to

select the resources, and the software implementations to execute, may interact with

multiple Launchers where each Launcher is used as a mechanism to deploy work

onto one or more resources. Each Launcher may be associated with one Reservation

2.4. REVIEW OF THE STATE OF THE ART 17

Scheduler Reservation

Service

Performance

Store Launcher Reservation

engine

Figure 2.3: The architecture of ICENI system [86]

Engine, if reservation is possible. The Reservation Engine provides an abstraction

of the underlying DRM’s reservation system. The Reservation Service provides the

ability to co-reserve multiple resources so that all the resources involved in a workflow

can be reserved at the appropriate time. The Scheduler interacts with the Reserva-

tion service, which in turn communicates with the Reservation Engines through the

Launchers. There may be multiple Performance Stores, each of which can be interro-

gated by the Scheduler. Once a workflow is instantiated, it will have an Application

Service, which exists until the workflow terminates.

AgFlow is a middleware platform that enables the quality-driven composition of

Web services [130]. In AgFlow, the QoS of Web services is evaluated by means of

an extendable multi-dimensional QoS model, and the selection of component services

is performed in such a way as to optimize the composite service’s QoS given a set

of user requirements (i.e., constraints on QoS) and a set of candidate component

services. Furthermore, AgFlow adapts to change that occurs during the execution of

a composite service, by revising the execution plan in order to conform to the user’s

constraints on QoS. The overall architecture of AgFlow is presented in Figure 2.4.

There are three distinct components in the AgFlow system, namely, Web ser-

vices, service broker, and service composition manager. The service broker allows

providers to register their service descriptions in an UDDI registry. A service descrip-

tion contains meta-data that describe, among others, the capabilities and QoS of a

Web service. The service composition manager is made up of an execution planner

and an execution engine. When an instance of a composite service is initiated, the

18CHAPTER 2. BACKGROUND AND REVIEW OF THE STATE OF THE ART

Communication Bus

Web service Web service Web service

UDDI Registry

Service

Broker

Execution Planner

Service Composition manager

Execution Engine

Execution Plan Replan

Figure 2.4: The architecture of AgFlow system [130]

execution planner contacts the service broker to search for the candidate component

services, and based on the candidate services retrieved, it generates an execution plan,

i.e., an assignment of component services to the tasks in the schema of the composite

service. Based on the execution plan, the adaptive execution engine then orchestrates

the component services to execute the instance of the composite service. At runtime,

the execution engine also monitors the component services. When the current status

of the execution violates the execution plan, the execution engine triggers the execu-

tion planner to revise the current plan and resumes the new plan to orchestrate the

execution.

QoS-aware Grid Workflow is a project, which aims at extending the basic QoS

support developed within VGE [8] and GEMSS [9, 44] to Grid workflow applications.

In order to enable QoS-aware service composition and workflow execution, the project

extends VGE environment with a QoS-aware Grid Workflow Language (QoWL) and a

QoS-aware Grid Workflow Engine (QWE) [16]. The overall architecture is presented

in Figure 2.5.

The XML parser and un-parser generates the intermediary representation of the

QoWL workflow. The QoS negotiator queries the registries, generates necessary QoS

requests and receives offers from services. The Workflow planning component calcu-

lates a workflow execution plan. The Service Deployer exposes a QoWL workflow as a

Web service and the Workflow Executer starts the execution of the QoWL workflow.

In case of dynamic planning strategy the Workflow Planner and Workflow Executer

2.4. REVIEW OF THE STATE OF THE ART 19

QoWL

XML

Parser/

Unparser

Intermediary

Representation

QoS

Negotiator

Service

Deployer

Workflow

Executer

Workflow

planner

Meta Data

Flow

Analyzer IP

GA

QWE engine

WS

WSDL WSLA

S1 SmS3S2R1 RnR2

Registries Services

Figure 2.5: The architecture of QoS-aware Grid Workflow system [16]

are invoked in an alternating way.

2.4.2 Workflow language

ICENI uses an XML based language to describe the workflows that are submitted for

execution. The workflow describes a collection of components and the links between

them. These workflows are called Execution Plans (EP). When EPs are submitted

to the ICENI environment they are abstract in nature. The detailed language uses

the workflow definitions from the NeSC e-Science Workflow Services Workshop [33].

The communication between Scheduler and Launcher uses Job Description Markup

Language (JDML) [85] as the means of transport.

AgFlow uses statecharts to represent these dependencies.

In QoS-aware Grid Workflow project, for the specification of QoS constraints for

Grid workflow applications, a QoS-aware Grid Workflow Language (QoWL) is used.

The language utilizes a subset of the Business Process Execution Language (BPEL)

[15] with extensions for expressing QoS constraints. QoS extensions are necessary to

express both the requested QoS constraints of a workflow before the QoS negotiation

20CHAPTER 2. BACKGROUND AND REVIEW OF THE STATE OF THE ART

and the offered QoS of a workflow after the negotiation with the services since the

offered QoS may differ from the requested one. Local constraints usually address QoS

constraints of single tasks invoking external services. Global constraints address the

QoS of the overall workflow or of composite activities. QoS constraints of a workflow

and of the underlying tasks are expressed in terms of qos −constraints elements,

which may contain several qos−constraint elements and several candidate−registry

elements. Each qos −constraint element defines a name/value pair and a weight of

a QoS constraint.

2.4.3 Mapping mechanism

The ICENI project supposes that each task of the workflow can be performed by

one resource and each resource of the Grid can perform one task at a time. The

scheduling algorithm is used to determine the resources to be used to perform the

task and the implementation of the software to be used on these resources. A number

of algorithms such as simulated annealing, complete information game theory and best

of N random [128] are used to do mapping with time optimization. The scheduling

algorithms produce several potential concrete workflows. To determine which of these

will be run, the Scheduler will send them to the Reservation Service, which will return

one of the workflow for which it was able to obtain reservation.

AgFlow also supposes one task can be performed by one web service and one web

service can handle one task at a time. AgFlow uses a global planning approach to

map tasks to web services. The scheduling method is based on Integer Program-

ming (IP) [67] for selecting an optimal execution plan without generating all possible

execution plans. There are three inputs in an IP problem: a set of variables, an

objective function, and a set of constraints where both the objective function and

the constraints must be linear. IP attempts to maximize or minimize the value of

the objective function by adjusting the values of the variables while enforcing the

constraints. The output of an IP problem is the maximum (or minimum) value of

the objective function and the values of variables at this maximum (minimum). To

apply IP, AgFlow defined several constraints for the problem.

2.4. REVIEW OF THE STATE OF THE ART 21

•Allocation constraint: For each task t j , there is a set of Web services S j that

can be assigned (allocated) to it. However, for each task t j, one Web service

should be selected to execute this task.

•Constraints on Execution Duration, Price, and Reputation: The execution du-

ration of a given task t j must be the execution duration of one of the Web

services. If task t k is a direct successor of task t j , then the execution of t k

must start after task t j has been completed. The execution of a composite ser-

vice plan is completed only when all the tasks in the plan are completed. The

execution price of the composite service should not be greater than user’s expec-

tation. The reputation of all selected services must equal to a pre-determined

value.

•Constraints on Success Rate and Availability: The success rate and the avail-

ability of the execution plan must equal to a pre-determined value.

•The expected execution duration is an optimization criteria.

When using the global planning approach, an execution plan is built at the be-

ginning of the execution of the composite service. Once the execution has started,

several contingencies may occur, e.g., a component service becomes unavailable or

the QoS of one of the component services changes significantly. In these situations,

a re-planning procedure may be triggered in order to ensure that the QoS of the

composite service execution remains optimal.

The mapping module in QoS-aware Grid Workflow project includes two main

methods: static scheduling and dynamic scheduling. The former employs the Integer

Programming method from the work of AgFlow. The latter is similar to the working

mechanism of Condor DAG Man [79] except based on resource reserved infrastructure.

Chapter 3

Problem statement

Most of existing Grid-based workflows [82, 11, 105] can be presented under Directed

Acyclic Graph (DAG) form so that only the DAG workflow is considered in this work.

Figure 3.1 presents a sample of a workflow as a material for presentation.

Subjob 0

Subjob 5

Subjob 4

Subjob 3

Subjob 6Subjob 2Subjob 1

8 7

7

5 3

2

3

9

2

Figure 3.1: A sample Grid-based workflow

Traditionally, users let the Grid system run the whole workflow in best effort

manner [36, 80, 5, 78, 30, 19, 109, 20]. The Grid system tries to find suitable Grid

resources to run sub-jobs of the workflow as soon as possible. In this way, the runtime

of the workflow varies depending on the state of Grid resources. When there are a

lot of free Grid resources at runtime, the finish time of the workflow can be very

short, and vice versa. With some users, this model works fine but with some other

users, this is not good enough. For example, with the case of the weather forecasting

3.1. OPTIMAL COST 23

workflow [105], users want to receive the final result within 20 minutes from the start

of execution. With those users, finishing the running of the workflow within a specific

period of time is a mandatory requirement. Thus, they want to have an agreement

with the Grid system to ensure that the system will provide the service meeting the

requirement. This task can be defined with a Service Level Agreement (SLA).

To perform the SLA, a user must give his requirements and the system will check if

it can satisfy them. Among many requirement parameters, a system supporting SLA

requires the estimation of the sub-job’s runtime to do resource reservation. Thus, the

sub-job can be run on dedicated resources within a reserved time frame to ensure the

QoS (this is the runtime period). The runtime period of a sub-job can be estimated

from statistical data. The user usually runs a sub-job many times with different

resource configurations and different amount of input data before integrating it to

the workflow. The data from those running is a dependable source for the estimation.

Before the work in this dissertation, there have only been some works aimed

at supporting SLA for single Grid jobs [18, 68]. The literature has recorded some

recently proposed solutions to support QoS for workflow such as [86, 130, 16]. Most

of the proposed mechanisms suppose that a workflow includes many sub-jobs, which

are sequential programs, and a Grid service having the ability to handle one sub-job

at a time. This is not sufficient as sub-jobs in many existing workflows [82, 11, 105]

are parallel programs and many HPCCs provide computing services under a single

Grid service [18]. It is obvious that an HPCC can handle many sub-jobs at a time.

With different workload models and resource models, the system supporting SLA for

Grid-based workflow must solve the following problems.

3.1 Optimal cost

The system must have an efficient mapping mechanism to map sub-jobs of the work-

flow to the Grid resources. The mapping should find a solution that meet the user’s

requirements and is as inexpensive as possible. This is a great challenge because of

various parameters of workflow and Grid resources. The workflow could have greatly

24 CHAPTER 3. PROBLEM STATEMENT

different sub-job’s resource requirements. The Grid resources also have various config-

urations and have different free resources over time. The following parts will describe

this problem.

3.1.1 Workflow description

To describe the workflow information, users describe specifications about required

resources, data transfer among sub-jobs, and the estimated runtime of sub-jobs as

well as the expected runtime of the whole workflow. In more detail, we will look

at a concrete example, a simple Grid workflow presented in Figure 3.1. The time is

computed in slot. Each slot equals to a specific period of real time, usually from 2 to

5 minutes. The main requirement of this workflow is described as follows.

•Each sub-job having different resource requirements about hardware and soft-

ware configurations. Important parameters such as the number of CPU, the size

of storage, the number of experts, and the estimated runtime for each sub-job

in the workflow are described in table 3.1.

•The number above each edge describes the number of data to be transferred

between sub-jobs.

Sj ID CPU Storage exp runtime

0 51 59 1 21

1 62 130 3 45

2 78 142 4 13

3 128 113 4 34

4 125 174 2 21

5 104 97 3 42

6 45 118 1 55

Table 3.1: Sub-jobs’ resource requirements of the workflow in Figure 3.1

In the resource requirements of a sub-job, there are two types of resources: ad-

justable and nonadjustable. The nonadjustable resources are type of RMS, OS, and

communication library. If a sub-job requires a supercomputer it cannot run on a

3.1. OPTIMAL COST 25

cluster. If a sub-job requires Linux OS it cannot run on Windows OS. Other types of

resources are adjustable. For example, a sub-job which requires a system with CPU

1Ghz can run on the system with CPU 2 Ghz; a sub-job requiring a system with 2GB

RAM can run on the system with 4GB RAM. In the common case, all sub-jobs in a

workflow have the same nonadjustable resources and different adjustable resources.

The distinguishing character of the workflow description within SLA context lies in

the time factor. Each sub-job must have its estimated runtime correlative with specific

resource configuration to run on. If these parameters exceed the pre-determined

limitation, the SLA will be violated. Within the SLA context, the resources are

reserved over time. If a sub-job runs out of an estimated time period, it will occupy

the resource of other reserved sub-job. This is not allowable in an SLA system.

3.1.2 Character of RMSs

In the Grid environment, there are a lot of RMSs. Each RMS has its own resource

configuration and this configuration is usually different from one RMS to another.

To ensure that the sub-job can be executed within a dedicated time period, the

required CPU, the storage, the expert in the RMS must be ready at that time. This

mechanism can be done only if the RMS supports advance resource reservation, for

example CCS [58]. Figure 3.2 depicts a sample CPU reservation profile in such a

RMS. Queuing-based RMSs are not suitable for our requirement, as no information

about the starting time is provided. In our system, we reserve three main types of

resource: CPUs, storages and experts.

For present purposes, suppose that we have three involved RMSs executing the

sub-jobs of the workflow, reservation information of the resources is presented in Table

3.2. Each RMS represented by an ID hpc value has different number of free CPU,

memory and expert during a specific period of time. The sample resource reservation

profiles of the RMSs are empty.

If two sequential sub-jobs are executed in the same RMS, it is not necessary to

do data transfer, and the time used for this work equals to 0. Otherwise, the data

transfer between them must be performed. As this is a common task in the Grid,

26 CHAPTER 3. PROBLEM STATEMENT

Number CPU available

1728

Number CPU

require

0 21 67 82

51

45

166

435

419

357

138

time

Figure 3.2: A sample CPU reservation profile of a RMS

ID ID hpc CPUs mem exp start end

31 2 128 256000 8 0 1000000

23 0 128 256000 9 0 1000000

30 1 128 256000 6 0 1000000

Table 3.2: RMSs resource reservation

the bandwidth of the link between two local RMSs also plays an important factor in

contributing to ensure SLA for the workflow. To make sure that a specific amount of

data will be transferred within a specific period of time, the bandwidth must also be

reserved. Unfortunately, up to now there has been no mechanism responsible for that

task in the worldwide network. Here, to overcome this elimination, we use a central

broker mechanism. The link bandwidth between two local RMSs is determined as

the average bandwidth between two sites in the network which has a different value

with each different couple of RMSs. Whenever having a data transfer task on a link,

the SLA broker will determine which time slot is available for the task. During the

specified period, the task can use the whole bandwidth while other tasks must wait.

Using this principle, the bandwidth reservation profile of a link will look similar to

the one depicted in Figure 3.3. A more correct model with bandwidth estimation

[121, 122] can be used to determine the bandwidth within a specific time period

instead of the average value. In both cases, the main mechanism is unchanged.

3.1. OPTIMAL COST 27

10MB/s

Bandwidth

0 21 50 65 138

time

100

Figure 3.3: A sample bandwidth reservation profile of a link between two

RMSs

3.1.3 Formal requirement statement

The formal specification of the described problem includes following elements:

•Let Rbe the set of Grid RMSs. This set includes a finite number of RMSs,

which provide static information about controlled resources and the current

reservations/assignments.

•Let Sbe the set of sub-jobs in a given workflow including all sub-jobs with the

current resource and deadline requirements.

•Let Ebe the set of edges in the workflow, which express the dependency between

the sub-jobs and the necessity for data transfers between the sub-jobs.

•Let Kibe the set of resource candidates of sub-job si. This set includes all

RMSs, which can run sub-job si,Ki⊂R.

Based on the given input, a feasible and possibly optimal solution is sought, which

allows the most efficient mapping of the workflow in a Grid environment with respect

to the given global deadline. The required solution is a set defined in Formula 3.1.

M={(si, rj, start slot)|si∈S, rj∈Ki}(3.1)

If the solution does not have start slot for each si, it becomes a configuration as

defined in Formula 3.2.

28 CHAPTER 3. PROBLEM STATEMENT

a={(si, rj|si∈S, rj∈Ki}(3.2)

A feasible solution must satisfy following conditions:

•Criteria 1: The finished time of the workflow must be smaller or equal with

the expected deadline of the user.

•Criteria 2: All Ki6=∅. There is at least one RMS in the candidate set of each

sub-job.

•Criteria 3: The dependencies of the sub-jobs are resolved and the execution

order remains unchanged.

•Criteria 4: The capacity of an RMS must equal or greater than the requirement

at any time slot. Each RMS provides a profile of currently available resources

and can run many sub-jobs of a single flow both sequentially and in parallel.

Those sub-jobs, which run on the same RMS, form a profile of resource require-

ment. With each RMS rjrunning sub-jobs of the Grid workflow, with each time

slot in the profile of available resources and profile of resource requirements, the

number of available resources must be larger than the resource requirement.

•Criteria 5: The data transmission task eki from sub-job skto sub-job simust

be taken place in a dedication time slots on the link between RMS running

sub-job skto RMS running sub-job si.eki ∈E.

In the next phase the feasible solution with the lowest cost is sought. The cost C

of running a Grid workflow is defined in Formula 3.3. It is the sum of four factors:

money for using CPU, money for using storage, cost of using experts knowledge and

finally money for transferring data between the involved resources.

i=1

si.rt∗(si.nc∗rj.pc+si.ns∗rj.ps+si.ne∗rj.pe) + Xeki.nd∗rj.pd(3.3)

3.1. OPTIMAL COST 29

with si.rt, si.nc, si.ns, si.nebeing the runtime, number CPU, number storage, num-

ber expert of sub-job sirespectively. rj.pc, rj.ps, rj.pe, rj.pdare the price of using CPU,

storage, expert, data transmission of RMS rjrespectively. eki.ndis the number of

data to be transferred from sub-job skto sub-job si.

If two sequential sub-jobs run on the same RMS, the cost of transferring data

from the previous sub-job to the later sub-job is neglected.

The ability to find a good solution depends mainly on the resource state at the

expected period when the workflow runs. During that period, if the number of free

resources in the profile is large, there are lots of feasible solutions and we can choose

the cheapest one. But if the number of free resources in the profile is small, simply

finding out a feasible solution is difficult. Thus, a good mapping mechanism should

find out a cheap solution when there are a lot of free resources and find out a feasible

solution when there are few free resources in the Grid.

Supposing the Grid system has mRMSs, which can satisfy the requirement of n

sub-jobs in a workflow. As a RMS can run several sub-jobs at a time, finding out the

optimal solution needs (mn) loops. It can be shown easily that the optimal mapping

of the workflow to Grid RMS as described above is an NP hard problem [12].

From the above description, we can see that this is a scheduling problem. However

the problem has many distinguished characteristics.

•An RMS can handle many sub-jobs of the workflow at a time. The RMS

supports resource reservation.

•A sub-job is a parallel application.

•The destination of the problem is optimizing the cost. User imposes some strict

requirements on the Grid system and pays for the appropriately received service.

It is obvious that the user prefers a good service with the cost as low as possible.

The cost of running a workflow includes the cost of using computation resources

and the cost of transferring data among sub-jobs.

As no previous work has similar context, a new strategy must be developed to

handle the new requirements. An efficient mapping mechanism will satisfy that pref-

erence and also increase the efficiency of using Grid resources.

30 CHAPTER 3. PROBLEM STATEMENT

3.2 Automated negotiation

The system must have an automated negotiation mechanism. ”Automated” means

that the user has little interference in the negotiation process. The complexity of

the SLA negotiation protocol grows significantly if a workflow consisting of multiple,

dependent sub-jobs is considered. In a normal case, the user needs only submit the

workflow and receives the negotiation result. The Grid service providers locate dis-

tributed RMSs over the worldwide network. The negotiation mechanism co-operates

them to reach an agreement of providing services to the user.

Figure 3.4 depicts a common scenario when running a workflow in the Grid en-

vironment. The difference in the running scenario leads to more challenges for the

protocol of SLA workflow negotiation comparing to the formal single job.

Subjob 0

Provider 1 Subjob 1

Provider 2

Subjob 2

Provider 4 Subjob 3

Provider 3

SLA workflow broker

Figure 3.4: Scenario of running a workflow in the grid environment

•Because of a more complex structure, the SLA content will be also more com-

plex.

•Besides two formal participants, the consumer and the provider, here appears

a new component. It is the SLA workflow broker. Without it, the consumer

needs a separate SLA for each sub-job, which is a tedious and time-intensive

task.

3.3. ERROR RECOVERY 31

•With the data dependency between sub-jobs in the workflow, inter-provider

SLA negotiation is necessary.

This issue has not been addressed before. The existing system works only with

each individual sub-job and it cannot handle the whole workflow. The workflow

usually has a complex structure with many sub-jobs working in a strict manner. To

negotiate SLA with the service provider, the user must trace the workflow following

the process sequence to do the negotiation of running each sub-job. This task takes a

lot of time. With a big workflow consisting of 20 sub-jobs, the negotiation process is a

tedious job with many repeated tasks, which easily leads to confusion. The patience

of the user will also be challenged if, at the end, the workflow’s runtime is not within

the desired time period. The work described in [130] also aims at automating the

negotiation process for workflow. However, their work is mainly for Web services and

the aspects of SLA is not fully considered.

Having a mechanism to automate the SLA negotiation process will help the user

save a lot of time. It also frees users from annoying stuff and provides them an easy

way to use the service. This is important as it encourages users to utilize the service

and the system itself has chance to develop.

3.3 Error recovery

The system must have an effective error recovery mechanism to deal with errors

that can happen at any time in a large and complex distributed system like the

Grid. We concentrate on the serious error that one or several Grid service providers

are detached from the system while executing sub-jobs of the workflow. The error

recovery mechanism has to move sub-jobs of the affected workflow to other healthy

RMSs in a way that minimizes the workflow runtime.

When one RMS is detached out of the Grid system, all running/waiting sub-

jobs from several workflows in the RMS are considered as having failure because the

system cannot control the state and collect the result from them. Checkpoint images

of all sub-jobs in the failing RMS are not available to restart them in other healthy

32 CHAPTER 3. PROBLEM STATEMENT

RMSs. Besides that, the output data from finished sub-jobs in the failing RMS are

also not available. Therefore, several waiting sub-jobs in healthy RMSs cannot be

run because of no input data. In case of having to cancel the workflow because of

error, the system will be fined seriously as stated in the SLA. Thus, the system has

no way but to try to finish executing the workflow by rerunning all failing sub-jobs.

However, this task faces two major problems.

•Mapping and re-executing only failed sub-jobs in other healthy RMSs are not

sufficient. Workflow requires a strict execution order to ensure integrity. Only

considering the failed sub-jobs and forgetting others will lead to the potential of

breaking integrity. This problem has not been fully considered in the literature.

The existing systems consider only the individual sub-jobs so that only the

affected sub-job is re-mapped to other healthy RMSs. Unfortunately, this action

breaks the integrity of the workflow. For example, with the running scenario

as presented in Figure 3.4, sub-job 1 is planned to run from time slot 10 to

15, while sub-job 3 from time slot 18 to 23. If the resource running sub-job

1 fails at times slot 14, the system will restart sub-job 1 at another healthy

resource from time slot 16 without considering the sub-job 3. When the time

to execute sub-job 3 comes, it does not have the input data from sub-job 1 to

run and consequently fails. Thus, the failure of sub-job 3 will lead to the failure

of the whole workflow. Thus, determining all sub-jobs which are needed for the

continued workflow execution is the mandatory requirement.

•When sub-jobs in the workflow must be re-executed, the ability to finish the

workflow execution on time as stated in the original SLA is very low and the

ability to be fined because of not fulfilling the SLA is nearly 100%. Within

the SLA context, which relates to business, the fine is usually very high and

increased with the lateness of the finished time. Thus, those sub-jobs must be

mapped to the healthy RMSs in a way that minimizes the finished time.

An effective error recovery makes the Grid system more stable and reliable, im-

portant characters of a system supporting SLA.

Chapter 4

Mechanism of mapping workflow

to Grid resources

This chapter presents the complex mechanism to map a workflow to the Grid re-

sources. The mechanism includes several sub-algorithms to handle different workload

and resource scenarios.

4.1 Related works

The mapping of jobs to suitable resources is one of the core tasks in Grid Computing.

However, the majority of the research is related to the mapping of singular jobs,

which do not exhibit dependencies to other jobs regarding input/output data. The

mapping of workflows, where a single job is divided into several sub-jobs, is the

next research step. In the literature, there are many attempts at this issue such as

[30, 29, 19, 114, 81]. However, all those mechanisms work to map a workflow to the

Grid resources in best effort manner.

Cao et al. presented an algorithm that maps each sub-job separately on an indi-

vidual Grid RMS [20]. The algorithm processes one sub-job at a time, schedules it

to a suitable RMS with a start time slot not conflicting with the dependency of the

flow. The selection of the destination resources is optimized with respect to a minimal

completion time. When applying this strategy to the specified problem, each sub-job

34CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

will be assigned separately to the cheapest feasible RMS. This strategy allows fast

computation of a feasible schedule, but it does not consider the entire workflow and

the dependencies among the sub-jobs.

The mapping of Grid workflows onto Grid resources based on existing planning

technology is presented in [30, 29]. This work focuses on coding the problem to be

compatible with the input format of specific planning systems and thus transfers the

mapping problem to a planning problem. Although this is a flexible way to gain differ-

ent goals, significant disadvantages regarding the huge resource usage, long response

times. For example, with a workflow including 20 sub-jobs and the Grid including 10

RMSs, a planning system cannot solve the problem on a desktop computer because of

not enough memory [100]. If the size of the problem is slightly smaller, the planning

system needs several hours to find out the high quality solution. The latter character

is the main cause that a planning system should not be used for a broker.

In two separated works [130, 16], Zeng et al and Iwona et al built systems to

support QoS features for Grid-based workflow. In their work, a workflow includes

many sub-jobs, which are sequential programs, and a Grid service has ability to

handle one sub-job at a time. To map the workflow on to the Grid services, they used

the Integer Programming method. Applying Integer Programming to our problem is

impossible. The first is the flexibility in runtime of the data transfer task. The time

to complete the data transfer task depends on the bandwidth and the reservation

profile of the link which varies from link to link. The second is that an RMS can

handle many parallel programs at a time. Thus, we do not know how many, which

and when sub-jobs will be run in an RMS and we cannot present the constraint of

available capacity as described in Criteria 4.

Our problem has a close relation to the classical job shop scheduling problem

(JSSP) [17]. The DAG form of our workflow is similar to the graph representing

the sequence processing of JSSP. Each sub-job in our problem is correlative to each

operation in JSSP. As job shop scheduling problem is a NP hard problem, two main

methods to solve this problem – complete and incomplete method – exist. A complete

method explores systematically, though very often implicitly, the whole search space.

To do this, most complete methods construct in a ”step by step” way a solution and

4.1. RELATED WORKS 35

backtrack in case of failure [74]. These methods usually use various heuristics [106, 74,

45] to guide the choice of the next variable to be instantiated and its value, and employ

powerful filtering techniques to achieve different levels of consistency. Similarly, exact

methods for constraint optimization are usually based on the branch and bound

principle [74] and try to eliminate heuristically as many as possible solutions not

leading to an optimum one. Complete and exact methods have in general exponential

time complexity. The solving time required by such a method may consequently

become prohibitive for large-sized problems.

An incomplete (non-exact) method does not explore systematically the whole

search space. Instead, it tries to examine as rapidly as possible a large number of

search points according to a selective or random strategy. Local search is one of

the most popular examples of this family of methods. In general, these methods do

not guarantee the completeness of the resolution, but require no exponential time

complexity. They constitute in fact a very interesting alternative for the practical

solving of many hard and large-sized problems. The famous method in this way

include Tabu Search [49, 50], Simulated Annealing [70, 32], GA [65, 66], etc.

In the literature, when applying local search for the problem, the moving neigh-

borhood is the most important factor in determining the speed of the algorithm and

the quality of the solution. Many effective neighborhood structures N1, N2, N3, N4

[17] and N5 [90, 91] were proposed. The primary activity of these neighborhoods is

changing the process sequence of two operations in the same machine. However, while

each operation in the JSSP can be mapped to only one machine and one machine can

process only one operation at a time, each sub-job in our problem can be mapped

to several RMSs and each RMS can process several sub-jobs at a time. Thus, such

process sequence changing does not exist in our solution.

The flexible job shop scheduling problem (FJSSP) extends the JSSP by assum-

ing that, for each given operation, there exist several instances of the machine type

necessary to perform it. The FJSSP is far more complicated than the classical JSSP

with two main problems, that of assigning each operation to an appropriate ma-

chine, and that of sequencing the operation in each machine. To solve the FJSSP,

the research community applied several local search methods and proposed several

36CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

techniques [84, 64] to decrease the very large search space. Those techniques, based

strictly on the character of each machine, can do one operation at a time. Our prob-

lem differs from the FJSSP in that while each machine in FJSSP can process only

one operation at a time, each RMS in our problem can process several sub-jobs at a

time. Thus, we cannot apply the proposed techniques to our problem because of the

differences in characteristic.

Related to mapping task graph to resources, there is also the multiprocessor

scheduling precedence-constrained task graph problem [43, 71]. As this is a well-

known problem, the literature recorded a lot of methods for this issue, which can

be classified into several groups [77]. The classical approach is based on the so-

called list scheduling technique [1, 24]. More recent approaches are UNC Schedul-

ing [47, 107, 69, 127, 126, 76], BNP Scheduling [1, 72, 110, 7], TDB Scheduling

[25, 73, 23, 22, 3], APN Scheduling [104, 111, 88, 75], genetic [108, 59]. Our problem

differs from the multiprocessor scheduling precedence-constrained task graph problem

in many factors. In the multiprocessor scheduling problem, all processors are similar,

but in our problem, RMSs are heterogeneous. Each task in our problem can be a

parallel program, while each task in this problem is a strictly sequential program.

Each node in this problem can process one task at a time while each RMS in our

problem can process several sub-jobs at a time. For those reasons, applying a local

search approach for this problem is very rare. In our solution, modified Tabu search

is used and has proven to be very efficient.

In the SLA context, the problem of mapping a workflow to Grid resources is

greatly different from the formal one both in character of sub-job and character of Grid

resource in RMSs as presented in section 3.1. Recently appeared algorithms such as x-

DCP [83], minmin, maxmin, suffer [10, 21], GRASP [13] concentrate on scheduling the

workflow with parameter sweep tasks on Grid resources. The common destination of

those algorithms is optimizing the finished time of the workflow. Subtasks in this kind

of workflow can be grouped into layers and there is no dependency among subtasks

in the same layer. All proposed algorithms assume each task as a sequential program

and each resource as a compute node. By using several heuristics, all those algorithms

do the mapping very fast but the solution quality is not sufficient. Our workflow with

4.1. RELATED WORKS 37

DAG form can also be transformed to the workflow with parameter sweep tasks type.

A detailed description of those approaches that apply for our problem can be found

in following sections.

Minmin algorithm

For each layer in sequence {

while set of sub-jobs in the layer not empty {

For each sub-job in the layer {

For each RMS in the candidate set {

compute end_time of sub-job

store (RMS,end_time) in a list}

Select the minimum end_time

Store (sub-job, RMS, end_time) in a list}

Pick (sub-job, RMS, end_time) with min

end_time

Drop the selected sub-job out of layer

}}

Figure 4.1: Minmin algorithm

Min-min uses the Minimum MCT (Minimum Completion Time) as measurement,

meaning that the task that can complete the earliest is given priority. The motivation

behind Min-min is that assigning tasks to hosts that will execute them the fastest will

lead to an overall reduced finished time [10, 21]. To adapt minmin algorithm to our

problem, we analyze the workflow into a set of sub-jobs in sequential layers. Sub-jobs

in the same layer do not depend on each other. With each sub-job in the sequential

layer, we find the RMS, which can finish sub-job the earliest. The sub-job in the

layer which has the earliest finish time, will be assigned to the determined RMS. The

overall algorithm is presented in Figure 4.1.

38CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

Maxmin algorithm

Max-min’s metric is the Maximum MCT. The expectation is to overlap long-running

tasks with short-running ones [10, 21]. To adapt maxmin algorithm to our problem,

we analyze the workflow into a set of sub-jobs in sequence layers. Sub-jobs in the

same layer do not depend on each other. With each sub-job in the sequential layer,

we find the RMS, which can finish sub-job the earliest. The sub-job in the layer

which has the latest finish time, will be assigned to the determined RMS. The overall

algorithm is presented in Figure 4.2.

For each layer in sequence {

while set of sub-jobs in the layer not empty {

For each sub-job in the layer {

For each RMS in the candidate set {

compute end_time of sub-job

store (RMS, end_time) in a list}

Select the minimum end_time

Store (sub-job, RMS, end_time) in a list}

Pick (sub-job, RMS, end_time) with max

end_time

Drop the selected sub-job out of layer

}}

Figure 4.2: Maxmin algorithm

Suffer algorithm

The rationale behind sufferage is that a host should be assigned to the task that would

”suffer” the most if not assigned to that host. For each task, its sufferage value is

defined as the difference between its best MCT and its second-best MCT. Tasks with

higher sufferage value take precedence [10, 21]. To adapt a suffer algorithm to our

problem, we analyze the workflow into set of sub-jobs in sequence layers. Sub-jobs

in the same layer do not depend on each other. With each sub-job in the sequential

layer, we find the earliest and the second-earliest finish time of the sub-job. Sub-job

4.1. RELATED WORKS 39

in the layer, which has the highest difference between the earliest and the second-

earliest finish time, will be assigned to the determined RMS. The overall algorithm

is presented in Figure 4.3.

For each layer in sequence {

while set of sub-jobs in the layer not empty {

For each sub-job in the layer {

For each RMS in the candidate set {

compute end_time of sub-job

store (RMS,end_time) in a list}

suffrage value=the earliest end_time - the

second earliest end_time

Store (sub-job, RMS, suffrage value) in a list

}

Pick (sub-job, RMS, suffrage value) with min

suffrage value

Drop the selected sub-job out of layer

}}

Figure 4.3: Suffer algorithm

GRASP algorithm

In this approach a number of iterations are made to find the best possible mapping

of jobs to resources for a given workflow. In each iteration, an initial allocation is

constructed in a greedy phase. The initial allocation algorithm computes the tasks

whose parents have already been scheduled on each pass, and consider every possible

resource for each such task. The overall algorithm is presented in Figure 4.4

w-DCP algorithm

The DCP algorithm is based on the principle of continuously shortening the longest

path (also called critical path (CP)) in the task graph, by scheduling tasks in the cur-

rent CP to an earlier start time. This principal was applied for scheduling workflows

40CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

Repeat until time limit is reached

concreteWF=CreateMapping(workflow)

If concreteWF has lower finished_time than bestConcreteWF

bestConcreteWF=concreteWF

procedure CreateMapping(workflow)

While all jobs in workflow are not mapped do

Find availJobs=unmapped jobs with every parent mapped;

Map (availJobs)

procedure Map(availJobs)

While all availJobs not mapped do

For each job j do

For each resource R do

Calc ECT(j,R)

I-min = min finished_time increase over all j and R

I-max = max finished_time increase over all j and R

s.t finished_time increase <= I-min + a*(I-max - I-min)

(J*, R*) = random choice from availPairs

map(j*, R*)

Update EAT(j*, R*)

Figure 4.4: GRASP algorithm

with parameter sweep tasks on global Grids by Tianchi Ma et al in [83]. Directly ap-

plying this algorithm to our problem faces many difficulties especially in determining

parameter AEFT (absolute earliest finished time), ALFT (absolute latest finished

time). For this reason we propose a new algorithm called w-DCP as presented in

Figure 4.5. Several procedures such as computing the time table and determining

the critical path were employed from H-Map algorithm, which is described in section

4.2.3.

4.2. MAPPING WITH STANDARD METAHEURISTICS 41

Initialize initial solution, compute time table,

Loop until the finished_time can not be decreased {

While all sub-jobs in the critical path is not

marked {

Determine the critical path

For each unmarked sub-job in the critical path{

For each RMS candidate of that sub-job {

if first sub-job then compute end_time of the

sequence data transfer task

else compute the end_time of the sub-job

Push tuple (sub-job, RMS, end_time) to list

}

Pick the result with min end_time

Marked the sub-job

}}}

Figure 4.5: w-DCP algorithm

4.2 Mapping with standard metaheuristics

Our problem is a specific case of combinatorial optimization (CO) problem and meta-

heuristics are usually used to solve CO problems especially in mapping and schedul-

ing. Therefore, in the first step, we directly apply many famous metaheuristics to

our problem to see the advantages and disadvantages of each method. The selected

metaheuristics include Tabu Search[49, 50], Simulated Annealing[70, 32], Iterated Lo-

cal Search[116, 117], Guided Local Search[119, 120], Genetic Algorithm[65, 66], and

Estimation of Distribution Algorithm[94, 95]. Other methods such as Variable Neigh-

borhood Search [53], Ant Colony Optimization [31], etc, face many difficulties when

applied to our problem. For example, Ant Colony Optimization is specific for Trav-

eling Sale Man problem, thus converting to our case requires major modifications,

which can decrease the performance of the original algorithm.

42CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

4.2.1 General strategy

The input of the mapping procedure includes information about workflow and about

RMSs. The user provides the information of workflow through the SLA text, which is

described in section 5.3. The parser will extract information about workflow to a file

describing sub-jobs and a file describing the dependence. The file describing sub-jobs

contains the various sub-jobs’ resource requirements and estimated runtime. The file

describing the dependence among sub-jobs contains the source sub-job, the destina-

tion sub-job and the number of data transfer. Information about RMSs is stored in

a relational database which is specifically designed for the system. They include the

description of the resource configuration in each RMS, the resource reservation profile

of each RMS, and the bandwidth reservation profile of each link. Data about RMS

is collected from RMSs by the monitoring module. Based on this information, the

system will do the mapping.

1. Determine candidate RMSs for each sub-job.

If resource in the Grid free {

2. Call algorithm to find optimal cost solution

}else {

3. Call algorithm to find optimal finished_time solution

then call algorithm to find optimal cost solution

}

Figure 4.6: Mapping mechanism overview

Figure 4.6 presents the basic strategy of the mapping mechanism. Each sub-job

has different resource requirements about the type of RMS, the type of CPU, etc.

There are a lot of RMSs with different resource configurations. The first action is

finding among those heterogeneous RMSs the suitable RMSs which can meet the

requirement of the sub-job. Data of each resource parameter of RMS is represented

by a number value and is stored in a separate column in the database table. The co-

relative resource requirement parameter of sub-job is also represented by number value

in the same manner. Thus, the matching between sub-job’s resource requirement

and RMS’s resource configuration is done by several logic checking conditions in the

4.2. MAPPING WITH STANDARD METAHEURISTICS 43

WHERE clause of the SQL SELECT command. This work will satisfy Criteria 2.

With the case of our example, each sub-job of the workflow depicted in Figure 3.1

has three candidate RMSs as presented in Table 4.1.

Sj ID RMS RMS RMS

sj0 R1 R2 R3

sj1 R1 R2 R3

sj2 R1 R2 R3

sj3 R1 R2 R3

sj4 R1 R2 R3

sj5 R1 R2 R3

sj6 R1 R2 R3

Table 4.1: RMSs candidate for each sub-job

The second and the third actions mean that we need two different algorithms.

The first is used to find cost optimal solution while the second is used to find the

finished time optimal solution. The following sections will describe the tailoring of

different metaheuristics to form two algorithms.

4.2.2 Tabu Search (TS)

TS [49, 50] is among the most cited and used metaheuristics for CO problems. The

standard TS uses a tabu list that keeps track of the most recently visited solutions

and forbids moving toward them. This allows Tabu Search both to escape from the

local minima and to implement an explorative strategy.

Finding the minimal finished time with TS

The TS algorithm to find the minimal finished time of a workflow within an SLA

context is presented in Figure 4.7. Starting from initial solution, the algorithm checks

for the best configuration in the neighborhood of the current configuration in each

iteration. The best neighbor that does not violate Tabu rule is sought to replace the

current configuration even if it is no better than the current configuration in terms of

the finished time of the workflow. If the best neighbor has the finished time smaller

44CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

than all configurations found so far, it will replace the current configuration even it

violates Tabu rule.

Pick randomly an initial configuration a

Compute finished_time p of a

a”=a // a” is the final configuration, p” is the finished_time of a”

while(num_mv<max) {

for each solution in the neighborhood set of a {

compute the finished_time

store tuple (sub-job, RMS, finished_time) to a list

}

Sort the list according to finished_time criteria

Pick the best solution, which has finished_time<p” or not affect Tabu rule

Assign tabu_number to the selected RMS

if the selected solution has finished_time<p” then store in a”

Store the selected solution in a

num_loop++

}

Figure 4.7: TS algorithm to find the minimal finished time of the workflow

Besides the standard components of Tabu Search, there are some components

specific to the workflow problems.

The neighborhood set structure: A configuration can also be presented as a

vector. The index of the vector represents the sub-job, value of the element represents

the RMS. With a configuration a,a=a1a2...an|with all ai∈Ki, we generate n*(m-

1) configurations a0as in Figure 4.8. We change the value of xito every value in

the candidate list which is different from the present value. With each change, we

have a new configuration. After that we have set A,|A|=n*(m-1). Ais the set

of neighborhoods of a configuration. A detail neighborhood set for the case of our

example is presented in Figure 4.9.

The assigning sequence of the workflow: When the RMS to execute each

sub-job, the bandwidth among sub-jobs was determined, the next task is determining

a time slot to run sub-job in the specified RMS. At this point, the assigning sequence

4.2. MAPPING WITH STANDARD METAHEURISTICS 45

a

m-1

n

a

2

a

1

x

n

a

2

x

1

a

n

x

2

a

1

a

Where i

x

and i

x != i

a

in [1,n]

Figure 4.8: Neighborhood structure of a configuration

1 32121 3

{2,3} {2,3} {1,3}{2,3} {1,3}{1,2} {1,2}

Figure 4.9: Sample neighborhood structure of a configuration

of the workflow becomes important. The sequence of determining runtime for sub-

jobs of the workflow in RMS can also affect the final finished time of the workflow

especially in the case of having many sub-jobs in the same RMS.

In general, to ensure the integrity of the workflow, sub-jobs in the workflow are

assigned based on the sequence of the data processing. However, that principal does

not cover the case of a set of sub-jobs, which have the same priority in data sequence

and do not depend on each other. To examine the problem, we determine the earliest

and the latest start time of each sub-jobs of the workflow in ideal condition. The

time period to do data transfer among sub-jobs is computed by dividing the amount

of data to a fixed bandwidth. The earliest and latest start and stop time for each

sub-job and data transfer depends only on the workflow topology and the runtime of

sub-jobs but not the resources context. These parameters can be determined using

conventional graph algorithms. A sample of these data for the workflow in Figure

3.1, in which the number above each link represents number of time slots to do data

transfer, is presented in Table 4.2.

The ability of finding a suitable resource slot to run a sub-job depends on the

46CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

Sub-job Earliest start Latest start

0 0 0

1 28 28

2 78 78

3 28 58

4 30 71

5 23 49

6 94 94

Table 4.2: Valid start time for sub-jobs of workflow in Figure 3.1

number of resource free during the valid running period. From the graph, we can see

sub-job 1 and sub-job 3 having the same priority in data sequence. However, from the

data in table 4.2, sub-job 1 can start at max time slot 28 while sub-job 3 can start at

max time slot 58 without affecting the finished time of workflow. Suppose that two

sub-jobs are mapped to run in the same RMS and the RMS can run one sub-job at a

time. If sub-job 3 is assigned first and in the worst case at time slot 58, sub-job 1 will

be run from time slot 92, thus the workflow will be late minimum 64 time slots. If

sub-job 1 is assigned first at time slot 28, sub-job 3 can be run at time slot 73 and the

workflow will be late 15 time slots. Here we can see the latest time factor is the main

parameter for evaluating the full affection of the sequential assigning decision. It can

be seen through the affection, mapping sub-job having smaller latest start time first

will make the latency smaller. Thus, the latest start time value determined as above

can be used to determine the assigning sequence. The sub-job having the smaller

latest start time will be assigned earlier. This procedure will satisfy Criteria 3.

Determining the timetable of the workflow: To determine the finished time

of the present solution, we have to determine the specific runtime period for each

sub-job and each data transfer task. The start time of a data transfer task depends

on the finish time of the source sub-job and the state of the link’s reservation profile.

We use the min st tran variable to present the dependence on the finish time of the

source sub-job. The start time of a sub-job depends on the latest finish time of the

related data transfer tasks and the state of the RMS’s reservation profile. We use

the min sj tran variable to present the dependency on the latest finish time of the

4.2. MAPPING WITH STANDARD METAHEURISTICS 47

related data transfer tasks. The task to determine the timetable for the workflow is

done with the procedure in Figure 4.10.

For each sub-job k following the assign sequence {

For each link from determined sub-jobs to k{

min_st_tran=end_time of source sub-job

Search reservation profile of the link

start_tran > min_st_tran

end_tran = start_tran+num_data/bandwidth

Store end_tran in a list

}

min_st_sj=max (end_tran)

Search in reservation profile of RMS running

k the start_job > min_st_sj

end_job= start_job + runtime

}

Figure 4.10: Procedure to determine time table for workflow

For each sub-job of the workflow in the assigning sequence, firstly, we find all the

runtime periods of data transfer tasks from previous sub-jobs to the current sub-job.

This period must be later than the finish time of the source sub-job. Note that with

each different link the transfer time is different because of different bandwidth. Then,

we determine the runtime period of the sub-job itself. This period must be latter than

the latest finish time of previous related data transfer task. The whole procedure is

not so complicated but time consuming. The most time consuming steps are located

in the searching reservation profiles. This procedure will satisfy the Criteria 4 and 5.

Finding the minimal cost with TS

The TS algorithm to find minimal cost solution is presented in Figure 4.11.

The TS algorithm for finding minimal cost of the workflow is similar to the al-

gorithm presented in Figure 4.7 with two minor differences. The first difference is

that the finished time value is replaced by the cost of the configuration. Cost value

48CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

Repeat {

Pick randomly an initial configuration a

Compute finished_time p of a

} until p meets Criteria 1

Compute cost c of a

a”=a // a” is the final configuration, c” is the cost of a”

while(num_mv<max) {

for each solution in the neighborhood set of a {

compute the finished_time and cost

if finished_time meets Criteria 1

store tuple (sub-job, RMS, cost) to a list

}

Sort the list according to cost criteria

Pick the best solution, which has cost<c” or not affect Tabu rule

Assign tabu_number to the selected RMS

if the selected solution has cost<c” then store in a”

Store the selected solution in a

num_loop++

}

Figure 4.11: TS algorithm to find the minimal cost

is computed as described in Part 3.1.3. The second difference is that after choosing

randomly or taking a configuration afrom the neighborhood structure, the module

computing timetable is called on to check if the finished time of the configuration is

still within the user’s limitation. If the finished time of aexceeds the user’s limitation,

new configuration will be selected. This action will ensure Criteria 1.

4.2.3 Simulated Annealing (SA)

SA is an advanced local search method which finds its inspiration from the physical

annealing process studied in statistical mechanics [70, 32]. An SA algorithm repeats

an iterative repairing procedure, which looks for better solutions while offering the

possibility of accepting in a controlled manner worse solutions. The second feature

4.2. MAPPING WITH STANDARD METAHEURISTICS 49

allows SA to escape from the local optima.

Finding the minimal finished time with SA

The SA algorithm to find the minimal finished time of a workflow within an SLA

context is presented in Figure 4.12.

Pick randomly an initial configuration a

Compute the finished_time p of a

a”=a // a” is the final configuration, p” is the finished_time of a”

Assign initial temperature t

Assign initial step length l

while(num_mv<max) {

while(l_count<l) {

Pick randomly a’ in the neighborhood set of a

Compute finished_time p’ of a’

if((p’-p<0)or((p’-p>0) and (Probability exp(-(p’-p)/t) is verified))){

a=a’

if(p’<p”)

a”=a’

num_mv++}

l_count++; n_ite++}

t=t*(1-A/n_ite)

l=l*(1-A/n_ite) }

Figure 4.12: SA algorithm to find the minimal finished time

The neighborhood structure, the assigning sequence, and the procedure to deter-

mine timetable are the same as those described in Tabu Search section.

Temperature t, step length l and A are the parameters of the SA algorithm.

Increasing t will lead to increasing the possibility of accepting a worse solution. De-

creasing l will make the number of iterations for a move decrease. To determine the

value of t, A and l, we run the algorithm starting from the existed recommendation

values [52]. Then, we change the value of them and see the difference in the quality

of the solution. After several studies, we found with the initial values t=2.5; A=100;

50CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

l=100 the algorithm can find high quality solution than others.

Starting from initial solution, the algorithm check for a better configuration in the

neighborhood of the current configuration in each iteration. When number of itera-

tion increases, the temperature increases, the possibility to accept bad configurations

increases and the number of checking a new configuration within a temperature de-

creases.

Finding the minimal cost for workflow with SA

The SA algorithm to find minimal cost solution is presented in Figure 4.13.

The SA algorithm for finding the minimal cost of a heavy workflow is similar to

the algorithm presented in Figure 4.12 with two minor differences. The first difference

is that the finished time value is replaced by the cost of the configuration. The cost

value is computed as described in Part 3.1.3. The second difference is that after

choosing randomly configuration a0, the module computing timetable is called on to

check if the finished time of a0is within the user’s limitation. If the finished time of

a0exceeds the user’s limitation, a new configuration will be selected.

4.2.4 Iterated Local Search (ILS)

ILS is a simple but powerful metaheuristic algorithm [?,?]. It applies a local search

to an initial solution until it finds a local optimum. Then it perturbs the solution and

restarts the local search. If perturbation is too small, the algorithm might not escape

from the local area. If perturbation is too strong, the algorithm will be similar to a

random restart local search.

Finding the minimal finished time with ILS

The ILS algorithm to find the minimal finished time of a workflow within SLA context

is presented in Figure 4.14.

The neighborhood structure, the assigning sequence and the procedure to deter-

mine timetable are the same as those described in the Tabu Search section.

4.2. MAPPING WITH STANDARD METAHEURISTICS 51

Repeat {

Pick randomly an initial configuration a

Compute the finished_time p of a

} until p meets Criteria 1

Compute cost c of a

a”=a // a” is the final configuration, c” is the cost of a”

Assign initial temperature t

Assign initial step length l

while(num_mv<max) {

while(l_count<l) {

Repeat {

Pick randomly a configuration a’ from the neighbor hood of a

Compute the finished_time p’ of a’

} until p’ meets Criteria 1

Compute cost c’ of a’

if((c’-c<0)or((c’-c>0) and (Probability exp(-(c’-c)/t) is verified))){

a=a’

if(c’<c”)

a”=a’

num_mv++}

l_count++; n_ite++}

t=t*(1-A/n_ite)

l=l*(1-A/n_ite) }

Figure 4.13: SA algorithm to find the minimal cost

Local search algorithm is used to find the configuration which is the local optimum

in finished time. The basic steps of this algorithm include:

•Starting from the current configuration, we compute the finished time of all

configuration in the set of neighborhood in order to choose the best one.

•If the best found configuration has a smaller finished time than the current

configuration, we then replace the current configuration with the new one and

repeat the process.

52CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

Pick randomly an initial configuration a

a<- LocalSearch(a)

Compute the finished_time p of a

a”=a // a” is the final configuration, p” is the finished_time of a”

m_loop=0 // m_loop stores times the configuration not improve

while(num_mv<max) {

a’ <- Perturbation(a)

a’<- LocalSearch(a’)

Compute finished_time p’ of a’

if(p’<p){

a=a’

m_loop=0

if(p’<p”){

a”=a’; p”=p’ }

}

else{

m_loop++ }

if(m_loop<max_val)

a=a’

}

Figure 4.14: ILS algorithm to find the minimal finished time

•If not, we stop the search process.

The perturbation procedure creates a new configuration from the current con-

figuration by changing the assigned RMS of a number of sub-jobs. The candidate

sub-job is selected randomly and the new RMS is selected randomly in the candidate

set. After a number of studies, we have found that the number of changes equals 4

bringing a good solution and we use this value for every case in the experiment.

The ILS algorithm starts with an initial configuration. It does perturbation to

create the new one and call local search to improve it as much as possible. If the

finished time of the new configuration is smaller than the current one, it will replace

the older. If we cannot find a better one after a number of perturbations, the current

4.2. MAPPING WITH STANDARD METAHEURISTICS 53

configuration will also be replaced with the new bad one.

Finding the minimal cost for workflow with ILS

The ILS algorithm to find the minimal cost solution is presented in Figure 4.15.

The ILS algorithm for finding the minimal cost of the heavy workflow is similar to

the algorithm presented in Figure 4.14 with two minor differences. The first difference

is that the finished time value is replaced by the cost of the configuration. Cost value

is computed as described in Part 3.1.3. The second difference is that after having

the new configuration a0by choosing random configuration, doing perturbation or

selecting from the neighborhood set, the module compute timetable is called on to

check if the finished time of a0is within the user’s limitation. If the finished time of

a0exceeds the user’s limitation, the new configuration will be selected.

4.2.5 Guided Local Search (GLS)

The basic GLS principle [119, 120] is to help the search to gradually move away from

local minima by changing the search landscape. In GLS, the set of solutions and the

neighborhood structure is kept fixed, while the objective function f is dynamically

changed with the aim of making the current local optimum ”less desirable”.

Finding the minimal finished time with GLS

The GLS algorithm to find the minimal finished time of a workflow within SLA

context is presented in Figure 4.16.

The neighborhood structure, the assigning sequence, and the procedure to deter-

mine timetable are the same as those described in Tabu Search section. The local

search module used in GLS algorithm is similar to the one in the ILS algorithm. The

objective function f(a) is determined as follow:

f(a) = λ∗Σ(kij ∗Iij (a)) (4.1)

Where ais the candidate configuration, λis a parameter to the GLS algorithm.

54CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

Repeat {

Pick randomly an initial configuration a

Compute finished_time p of a

} until p meets Criteria 1

a<- LocalSearch(a)

Compute cost c of a

a”=a // a” is the final configuration, c” is the cost of a”

m_loop=0 // m_loop stores times the configuration not improve

while(num_mv<max) {

Repeat {

a’ <- Perturbation(a)

Compute finished_time p’ of a’

} until p’ meets Criteria 1

a’<- LocalSearch(a’)

Compute cost c’ of a’

if(c’<c){

a=a’

m_loop=0

if(c’<c”){

a”=a’; c”=c’ }

}

else{

m_loop++ }

if(m_loop<max_val)

a=a’

}

Figure 4.15: ILS algorithm to find the minimal cost

4.2. MAPPING WITH STANDARD METAHEURISTICS 55

Pick randomly an initial configuration a

Initialize penalty kij=0 // each candidate RMS rj of sub-job i has a kij

a”=a // a” is the final configuration, p” is the finished_time of a”

while(num_mv<max) {

Compute finished_time p of a

f’=p + f(a)

LocalSearch(a) finds local optima a according to f’ function and

records local optima a’ according to p

Compute finished_time p’ of a’

if(p’<p”){

a”=a’

p”=p’ }

Compute Util(a,rj,kij)

kij++ if Util(a,rj,kij) max

}

Figure 4.16: GLS algorithm to find the minimal finished time

After a number of studies, we found that λ=20 brings good results and we used this

value for all experiments. kij is the penalty of candidate RMS rjof sub-job si.Iij (a)

is an indication of whether si:rjexists in m. Iij(a) = 1 if si:rjexists in a; 0

otherwise

Function Util(a,rj,kij) is determined as follow.

Util(a, rj, kij) = Iij (a)∗1

1 + kij

(4.2)

The GLS algorithm starts with an initial configuration. It calls the local search to

improve the configuration as far as possible according to the modified finished time

function. The more the rjof siis selected before, the less it will become attractive for

the next selection and the other candidate RMS has chance to be selected. During

the search process, the original finished time of the candidate configuration is also

computed and the best one is recorded.

56CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

Repeat {

Pick randomly an initial configuration a

Compute finished_time p of a

} until p meets Criteria 1

Compute cost c of a

Initialize penalty kij=0 // each candidate RMS rj of sub-job i has a kij

a”=a // a” is the final configuration, c” is the cost of a”

while(num_mv<max) {

Compute cost c of a

f’=c + f(a)

LocalSearch(a) finds local optima a according to f’ function and

records local optima a’ according to c

Compute cost c’ of a’

if(c’<c”){

a”=a’

c”=c’ }

Compute Util(a,rj,kij)

kij++ if Util(a,rj,kij) max

}

Figure 4.17: GLS algorithm to find the minimal cost

Finding the minimal cost for workflow with GLS

The GLS algorithm to find minimal cost solution is presented in Figure 4.17.

The GLS algorithm for finding minimal cost of the heavy workflow is similar to the

algorithm presented in Figure 4.16 with two minor differences. The first difference

is that the finished time value is replaced by the cost of the configuration. Cost

value is computed as described in Part 3.1.3. The second difference is that after

having the new configuration a0by choosing random configuration or selecting from

the neighborhood set, the module computing timetable is called on to check if the

finished time of a0is within the user’s limitation. If the finished time of a0exceeds

the user’s limitation, a new configuration will be selected.

4.2. MAPPING WITH STANDARD METAHEURISTICS 57

4.2.6 Genetic Algorithm (GA)

GA [65, 66] is a part of Evolutionary Computing, which is inspired by Darwin’s theory

of evolution. GA begins with a set of solutions called population. Solutions from one

population are selected according to their fitness and used to form a new population.

This is motivated by a hope that the new population will be better than the old one.

This process is repeated to find out the best solution.

Finding the minimal finished time with GA

The GA algorithm to find the minimal finished time of a workflow within SLA context

is presented in Figure 4.18.

Generate random population includes n configurations

while(num_mv<max) {

Evaluate the finished_time of each configuration

a”= best configuration

Add a” to the new population

while the new population is not enough {

Select two parent configuration according to their finished_time

Crossover the parent with a probability to form new configuration

Mutate the new configuration with a probability

Put the new configuration to the new population }

}

return a”

Figure 4.18: GA algorithm to find the minimal finished time

Parents are selected according to the roulette wheel method. The fitness of each

configuration = 1/makespan. Firstly, the sum L of all configuration fitness is calcu-

lated. Then, a random number lfrom the interval (0,L) is generated. Finally, we go

through the population to sum the fitness p. When pis greater than l, we stop and

return to the configuration where we are.

The crossover point is chosen randomly. The child is formed by copying from two

part of the configurations. The mutation point is chosen randomly. At mutation

58CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

Repeat{

Repeat {

Pick randomly an initial configuration a

Compute finished_time p of a

} until p meets Criteria 1

put a to the population set

}until the population includes n configurations

while(num_mv<max) {

Evaluate the cost of each configuration

a”= best configuration

Add a” to the new population

while the new population is not enough {

Repeat{

Select two parent configuration according to their cost

Crossover the parent with a probability to form new configuration

Mutate the new configuration with a probability

Compute finished_time p of the new configuration

} until p meets Criteria 1

Put the new configuration to the new population }

}

return a”

Figure 4.19: GA algorithm to find the minimal cost

point, rjof siis replaced by another RMS in the candidate RMS set. It is noted that

the probability to have mutation with a child is low, ranging approximately from

0.5% to 1%.

Finding the minimal cost for workflow with GA

The GA algorithm to find the minimal cost solution is presented in Figure 4.19.

The GA algorithm for finding the minimal cost of the heavy workflow is similar to

the algorithm presented in Figure 4.18 with two minor differences. The first difference

is that the finished time value is replaced by the cost of the configuration. Cost value

is computed as described in Part 3.1.3. The second difference is that after having

4.2. MAPPING WITH STANDARD METAHEURISTICS 59

Generate random population includes n configurations

Call LocalSearch() to improve the population

while(num_mv<max) {

Select the best half of the population

a”=the best configuration

Estimate joint probability distribution of the selected configurations

using UMDA with Laplace correction

Put a” to the new population

Generate (n-1) individuals by sampling joint probability distribution

Call LocalSearch() to improve the new population

Replace the old population with the new one

}

return a”

Figure 4.20: EDA algorithm to find the minimal finished time

the new configuration a0by choosing random configuration, doing crossover or doing

mutation, the module computing timetable is called on to check if the finished time

of a0is within the user’s limitation. If the finished time of a0exceeds the user’s

limitation, a new configuration will be selected.

4.2.7 Estimation of Distribution Algorithm (EDA)

EDA [94, 95] is a new area of Evolutionary Computation. In EDAs, there is nei-

ther a crossover nor a mutation operator. New population is generated by sampling

the probability distribution which is estimated from a database containing selected

individuals of the previous generation.

Finding the minimal finished time with EDA

The EDA algorithm to find the minimal finished time of a workflow within SLA

context is presented in Figure 4.20.

The neighborhood structure, the assigning sequence, and the procedure to deter-

mine timetable are the same as those described in the Tabu Search section. The local

60CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

search module used in EDA algorithm is similar to the one used in the ILS algorithm.

We choose Univariate Marginal Distribution Algorithm (UMDA) [96] because the

probability kij of assigning rjto sidoes not depend on other RMSs in the candidate

set. If the best half of the population includes N configurations, kij increases with

the increasing of number appearing si:rjin the selected individuals:

kij =Σ(rj−> si) + 1

N+ti

(4.3)

where tiis the size of the RMS candidate set of sub-job si.

The sampling process to create a new generation is based on probability kij. With

each sub-job si, the RMS rjis selected based on the roulette wheel mechanism as

described in part 4.4.4. RMS having a big probability will have more chance to be

selected. Thus, the EDA algorithm tends to choose the RMS that usually appears in

high quality configurations.

Finding the minimal cost for workflow with EDA

The EDA algorithm to find the minimal cost solution is presented in Figure 4.21.

The EDA algorithm for finding the minimal cost of the workflow is similar to the

algorithm presented in Figure 4.20 with two minor differences. The first difference

is that the finished time value is replaced by the cost of the configuration. Cost

value is computed as described in Part 3.1.3. The second difference is that after

having the new configuration a0by choosing random configuration or selecting from

the neighborhood set, the module computing timetable is called on to check if the

finished time of a0is within the user’s limitation. If the finished time of a0exceeds

the user’s limitation, a new configuration will be selected.

4.2.8 Evaluation of metaheuristics

We have done some experiments to evaluate the performance of standard metaheuris-

tics when applying to the problem of mapping Grid-based workflow within the SLA

context. Detail about the methods and results of the experiment are presented in

4.2. MAPPING WITH STANDARD METAHEURISTICS 61

Repeat{

Repeat {

Pick randomly an initial configuration a

Compute finished_time p of a

} until p meets Criteria 1

put a to the population set

}until the population includes n configurations

Call LocalSearch() to improve the population

while(num_mv<max) {

Select the best half of the population

a”=the best configuration

Estimate joint probability distribution of the selected configurations

using UMDA with Laplace correction

Put a” to the new population

Repeat{

Repeat {

Generate an individual a by sampling joint probability

distribution

Compute finished_time p of a

} until p meets Criteria 1

put a to the new population set

}until the new population includes n configurations

Call LocalSearch() to improve the new population

Replace the old population with the new one

}

return a”

Figure 4.21: EDA algorithm to find the minimal cost

62CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

Section 4.4. Through the experiment results, especially the results in sections 4.4.2

and 4.4.3, we have some observations as follow:

•When the size of the problem is relatively small, all metaheuristics found nearly

quality-equal results. We can deduce that the found results convergence to a

value that is very near the optimal solution.

•When the size of the problem is huge, the result from each metaheuristic is

different from each other. This means that the found results do not converge

and may still be far from the optimal solution.

•When the size of the problem is huge, metaheuristics using local search such as

GLS, ILS, EDA, found better results than other metaheuristics, which do not

use local search.

•When the size of the problem is huge, the runtime of higher performance meta-

heuristics such as GLS, ILS, EDA is usually exponent. This factor is a drawback

when applying in the real system.

•Except for Tabu Search, most standard metaheuristics use a random mechanism

to rapidly explore many different areas in the search space. This mechanism

may bring good results but it is very difficult to control the searched area and

also the search process. In contrast, the standard Tabu Search searches very

carefully the area around the initial configuration and moves only to a nearby

search area.

Through these observations, directly applying standard metaheuristics to our

problem faces many disadvantages, especially the long runtime. Therefore, a new

mapping mechanism must be built.

4.3 Proposed mapping mechanism

We propose a mapping mechanism which can eliminate the disadvantages of standard

metaheuristics. As using local search could find better result than not using local

4.3. PROPOSED MAPPING MECHANISM 63

search, we also make use of local search as a fundamental component in our mapping

mechanism. The proposed mapping mechanism includes three sub-algorithms. L-

Tabu finds cost optimal mapping solution for light workflow in which amount of

data to be transferred among sub-jobs is little (L stands for light). H-Map finds

cost optimal mapping solution for heavy workflows in which amount of data to be

transferred among sub-jobs is large (H stands for heavy). w-Tabu finds the runtime

optimal solution for both cases of workflows (w stands for workflow).

1. Determine candidate RMSs for each sub-job.

If resource in the Grid free {

2. if workflow has little data transfer

Call L-Tabu algorithm

3. if workflow has a lot of data transfer

Call H-Map algorithm

}else {

4. Call w-Tabu algorithm

then call L-Tabu or H-Map

}

Figure 4.22: Mapping mechanism overview

Figure 4.22 presents the basic principle of the proposed mapping mechanism. If

there are a lot of Grid resources free at a specific time period, L-Tabu or H-Map

algorithm is called to find the cost-optimal solution. If there are few Grid resources

free, w-Tabu is called to find a feasible solution. Starting from this feasible solution,

L-Tabu or H-Map will find the optimal solution. In fact, the signature of having

many or few Grid resources free and the method to call on the w-Tabu algorithm are

integrated in the L-Tabu and H-Map algorithms. The following sections will describe

each algorithm in detail.

64CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

Workflow

(DAG)

RMSs

Compute earliest

and latest

start_time of

each sub-job

Remove

requirement

bottlenecks

Determine

RMS

candidate for

each sub-job

Is candidate set

for each sub-job

empty ?

Create

initial

solution

Output: Cannot

find solution

Call module

increasing the

quality of

solution

Output:

Best found

solution Yes

No Is initial solution

available ?

Call w-Tabu

algorithm

Output: Cannot

find solution

No

Yes

No

Is initial solution

available ?

Yes

Figure 4.23: The frame work of L-Tabu algorithm

4.3.1 L-Tabu algorithm

High Performance Computing centers usually connect with the network through a

link with broad bandwidth which is greater than 100Mbps. In addition, one time

slot in our system is usually from 2 to 5 minutes. Thus, data transferred through the

link in one time slot can be from 1.2GB to 3GB. Data transferred on the link in one

second, which is approximately 10MB, is very small compared to the data transferred

in one time slot. If the data transferred between each related couple sub-jobs in the

workflow is smaller or equal to 10MB, then we can consider the workflow as light

workflow. With that small amount of data, the data transfer can happen within any

time slot in the bandwidth reservation profile without affecting other reserved data

transfer task. With such kind of workflow, it is not necessary to reserve bandwidth

on the link connecting two RMSs, which execute two co-related sub-jobs. In the

planning table for workflow, the data transfer task takes one time slot.

L-Tabu algorithm is applied for light workflow with the purpose of finding a fea-

sible solution with optimal cost. Figure 3.1 presents a sample light workflow with

the number above each link representing the amount of data transfer. Its resource

requirement is described in table 3.1. The frame work of the proposed algorithm is

depicted in Figure 4.23.

4.3. PROPOSED MAPPING MECHANISM 65

•Step 1: Computation of the earliest start time and the latest start time for

each sub-job by analyzing the input workflow with traditional graph techniques.

•Step 2: Removing requirement bottlenecks. This task aims at detecting bot-

tlenecks with a large number of resource requirements which can reduce the

possible start/end times of a sub-job. Based on this information, sub-jobs are

moved to other sites with more available resources in order to gain a longer time

span for the positioning of the sub-jobs in the workflow.

•Step 3: Definition of the solution space by grouping all RMS candidates which

have enough available resources to start the sub-jobs within the determined

slack time and to run the sub-job until it is completed. This task is performed

by analyzing the resource reservation profiles of the RMSs.

•Step 4: The gained search space is evaluated with respect to the contained

number of feasible solutions (large or small number of possible solutions). If

the number of candidate RMS for each sub-job is large, the number of possible

solutions is also large and vice versa. If a sub-job does not have any RMS

candidate, there is no possible solution and the algorithm stops. Subsequently,

an initial solution is created. If an initial solution is not found, we can deduce

that the Grid resource is busy and the w-Tabu algorithm is invoked. If w-Tabu

cannot find an initial solution, the algorithm will stop.

•Step 5: Starting with the initial solution, a specific module is called on in order

to increase the quality of the solution as far as possible. This is the function

of local search. However, instead of using the described local search procedure,

we use Tabu Search because it can also play the role of local search but with

wider search area.

In following sections, the individual algorithm steps are described in detail.

Resolving requirement bottlenecks

The generated requirement and the resource profiles show the time slots where a

large number of resources are required but not available. A sample of such situation

66CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

is shown in Figure 4.24 with an example of required and available CPU instances. At

each possible time slot, the total number of available CPUs over all involved RMS

and the total number of required CPU as sum of the requirements of all sub-jobs

possibly running in this time slot are computed. The contribution of each sub-job in

the profile is computed from the earliest start time to latest possible deadline. Figure

4.24 shows that at period 28-55 or 72-84 the number of required CPUs is larger than

those in other profile periods. This leads to a significantly reduced number of feasible

start times and thus reduces the probability of finding an optimal solution or even

a good solution. Therefore, the peak requirements have to be reduced by moving

selected sub-jobs to other time slots. This is possible because the actual start and

stop time of the sub-job can be shifted within a range. We can change the value of

that range to move a sub-job out of the peak period. Furthermore, by reducing the

number of parallel running sub-jobs on the same site, the probability for cost-effective

execution of two subsequent sub-jobs on the same site with a low communication effort

increases.

CPU required

5 27 84 99

52

45

166

542

497

471

time

55 71 155

384 CPU free

Figure 4.24: Relation between available and required CPUs

For resolving the requirement bottleneck, the profiles of the required resources

and the available resources are compared as shown in Figure 4.25. At each time slot,

we define Jas the set of msub-jobs running at that time slot and R as the set of

npossible resource candidates for J. Subsequently, the following measures can be

computed.

4.3. PROPOSED MAPPING MECHANISM 67

5 25 56 72 86 100

1

0.9

0.8

0.73

0.08

155

Rate

Time

Figure 4.25: Rate profile of the sample

T otalCP Urequire := Pi=1,m Ji.CP Ureq with Ji∈J(4.4)

T otalMEMrequire := Pi=1,m Ji.memreq with Ji∈J(4.5)

T otalEXPrequire := Pi=1,m Ji.EXPreq with Ji∈J(4.6)

T otalCP Uavail := Pj=1,n Rj.CP Uavail

mwith Ji∈J, mj≤m(4.7)

T otalmemavail := Pj=1,n Rj.memavail

mwith Ji∈J, mj≤m(4.8)

T otalEXPavail := Pj=1,n Rj.expavail

mwith Ji∈J, mj≤m(4.9)

Parameter Ji.CP Ureq,Ji.memreq, and Ji.EXPreq represent the number of required

CPUs, the size of needed memory (in MB), and the required experts for supervision

of the of subjob Ji, respectively. Finally, mjis the number of subjobs which rjcan

run simultaneously.

rate :=

T otalCP Urequire

T otalCP Uavail +T otalmemrequire

T otalmemavail +T otalEXPrequire

T otalEXPavail

3(4.10)

The removal of the requirement peak is performed by adjusting the start time

slot or the end time slot of the sub-jobs and thus the sub-job is moved out of the

bottleneck area. One possible method is shown in Figure 4.26, where either the latest

68CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

finish time of sub-job 1 is set to t1or the earliest start time of sub-job 2 is set to t2.

The second way is to adjust both sub-jobs simultaneously as depicted in Figure 4.27.

A necessary prerequisite here is that, after adjusting, the latest completion time -

earliest start time is larger than the total runtime.

rate

0 t1 t2

time

sj1

sj22

rate

0 t1 t2

time

sj1

sj2

Figure 4.26: Moving subjobs

rate

0 t1 t2

time

sj1

sj2

rate

0 t1 t2

time

sj1 sj2

Figure 4.27: Adjusting subjobs

After doing the resolving task, we have to re-compute the earliest start time and

latest finish time of each sub-job which relates to the adjusted sub-job to ensure the

dependency property. The consequence is that the distribution of resource require-

ment has changed and we have to re-compute the new relation profile and then do

the next resolving task. These steps are repeated until sub-jobs cannot be adjusted

to resolve the peak. The result of applying this procedure to our example case is a

compact profile as depicted in Figure 4.28.

After this phase, the variety in time factor is decreased a lot and contributes little

to the final result. Thus, we fix the start time of each sub-job to the earliest start

time and the end time=start time+execution time.

4.3. PROPOSED MAPPING MECHANISM 69

5 26 56 72 86 100 155

1

0.55

0.35

0.08

Rate

Time

Figure 4.28: Rate profile of the sample after doing adjustments

Determining the solution space

After determining the start time for each sub-job, we need to refine the solution can-

didate for each sub-job. Because of the resource reservation character, it is necessary

to check if an RMS candidate has enough free resource during the specific runtime

period of the sub-job. Once again, this work is done by executing the SQL statement.

With each RMS candidate, we query to find a period overlapping with the sub-job’s

runtime period that has a free resource less than the resource requirement. If a result

is found then remove the RMS out of the candidate list.

Creating the initial solution

The algorithm for determining the initial solution is based on a fail-first heuristic

and the forward checking algorithm [74]. According to the Greedy strategy for each

sub-job under investigation, an RMS with the minimal cost is selected and assigned

as described in the following four steps:

•Determine the sub-job in the workflow with the smallest number of RMS can-

didates, which can execute the sub-job according to the provided specification.

70CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

•If the set with RMS candidates for this sub-job is empty, then assign one ran-

domly selected RMS and mark the assignment as conflict. Otherwise, assign

the RMS with the minimal cost to that sub-job.

•Repeat the process with the next sub-job until all sub-jobs are assigned.

•Resolve the marked conflicts.

In case of conflicts, we can deduce that there is little free Grid resource and the

w-Tabu algorithm is invoked as the last effort to fulfill the user requirement. If this is

possible, the found feasible solution is declared as the initial solution and the mapping

process proceeds with the specific procedure to improve the quality of the solution

as far as possible. Otherwise, the owner of the workflow is notified that the workflow

cannot be executed according to the given conditions and will be rejected until a new

specification is submitted.

Improving solution quality algorithm

To improve the quality of a solution, we use a specific procedure based on Tabu search

for this problem. By this time, all the start times and stop times of each sub-job in

the workflow have been determined. We have to find the mapping sub-job:RMS in

a way that optimizes the cost. In a normal Tabu search, in each move iteration, we

will try assigning each sub-job sji∈Swith each RMS rjin the candidate set Ki

and compute the cost and then check for an overall improvement and pick the best

one. This method is not so efficient as it requires searching among all the candidates.

This will lead to long time running when the number of candidates is large. We will

improve that method by proposing a new neighborhood for the Tabu search.

As we are working with light workflow, the data to be transferred among sub-jobs

are very small and the total cost for transferring data is also small. If we have a

price 0.01USD/MB for data transmission and the workflow has total 100 MB data

to be transferred among sub-jobs, the total cost for this task is only 1USD. For that

reason, we can bypass the impact of data transfer cost in the overall cost of the

running workflow. Thus, in each moving iteration, we do not need to check for all

4.3. PROPOSED MAPPING MECHANISM 71

While not num_loop < max_loop{

For each sub-job in the workflow {

For each cheaper RMS in the candidate set {

Compute cost

Store tuple (sub-job, RMS, cost) to a list

} }

Sort the list

Pick the best solution which is cheaper or or not

affect tabu rule

Assign tabu_number for the selected RMS

If cheaper then store the solution

num_loop++

}

Figure 4.29: Improving solution quality procedure for light workflow

RMS in the candidate list but concentrate on RMS having a cheaper running cost

than the current RMS. Tracing through cheaper RMS can be made easier by using

a sorted list. With this new neighborhood, the algorithm has very short runtime

while still finding high quality solutions. The pseudo-code of the procedure is found

in Figure 4.29.

4.3.2 H-Map algorithm

H-Map algorithm maps heavy workflow to the Grid RMSs. As the data to be trans-

ferred among sub-jobs in the workflow are huge, to ensure the deadline of the workflow

it is necessary to make bandwidth reservation. In this case, the time to do a data

transmission task becomes unpredictable as it depends on the bandwidth and the

reservation profile of the link, which varies from link to link. The variety in the com-

pletion time of the data transmission task makes the total runtime of the workflow

also flexible. The goal of H-Map algorithm is finding out a solution, which ensures

Criteria 1, and is as cheap as possible. The overall H-Map algorithm is presented in

Figure 4.30.

72CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

Workflows

DAG

RMSs

Construct set

of initial

configuration

Co Improve the

set of

configuration

Ci

C* Output the

best in C*

Is Co

empty?

No

Yes

Output: Cannot

find solution

Yes

Call w-Tabu

algorithm to

create Co

Is Co

empty?

No

Figure 4.30: H-Map algorithm overview

Firstly, a set of initial configurations Cois created. The configurations in Co

should be distributed widely over the search space and must satisfy Criteria 1. If

|Co|=∅, we can deduce that there is little resource free on the Grid and the w-Tabu

algorithm is invoked. If w-Tabu also cannot find out a feasible solution, the algorithm

stops. If |Co| 6=∅, the set will gradually be refined to have better quality solutions.

The refining process stops when the solutions in the set cannot be improved more

and we have the final set C*. The best solution in C* will be output as the result

of the algorithm. The following sections will describe detail each procedure in the

algorithm.

Constructing the set of initial configurations

The purpose of this algorithm is to create a set of initial configurations which will be

distributed widely over the search space.

Step 0: With each sub-job si, we sort the RMSs in the candidate set Kiaccording

to the cost they need to run si. The cost is computed according to Formula 4.3. The

configuration space of the sample now can be presented in Figure 4.31 and Table 4.3.

In Figure 4.31, the RMSs lying along the axis of each sub-job have cost increasing in

the direction from inside to outside. The line connecting each point in every sub-job

axis will form a configuration. Figure 4.31 presents 3 configurations with an increasing

index in the direction from inside to outside. Figure 4.31 also presents the cost

4.3. PROPOSED MAPPING MECHANISM 73

distribution of the configuration space according to Formula 3.3. The configuration

in outer layers has a greater cost than the inner layers. The cost of the configuration

lying between two layers is greater than the cost of the inner layer and smaller than

the cost of the outer layer.

1

3

2

1

2

3

2 1 3

3

1

2

3

2

1

2

3

1

3

2

0

sj

1

sj

2

sj

3

sj

4

sj

5

sj

6

sj 1

2

3

Figure 4.31: The configuration space according to cost distribution

Step 1: We pick the first configuration as the first layer in the configuration

space. The determined configuration can be presented as a vector. The index of

the vector represents the sub-job, and the value of the element represents the RMS.

The first configuration in our example is presented in Figure 4.32. Although the first

configuration has minimal cost according to Formula 4.3, we cannot be sure that this

is the optimal solution. The real cost of a configuration must consider the neglected

cost of data transmission when two sequential sub-jobs are in the same RMS.

Step 2: We construct the other configurations by doing a process similar to

the one described in Figure 4.33. The second solution is the second layer of the

configuration space. Then we create a solution having cost located between layer

74CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

Sj ID RMS RMS RMS

sj0 R1 R3 R2

sj1 R1 R2 R3

sj2 R2 R1 R3

sj3 R3 R1 R2

sj4 R3 R2 R1

sj5 R2 R3 R1

sj6 R1 R3 R2

Table 4.3: RMSs candidate for each sub-job in cost order

1 23321 1

0 2 4 51 63

Figure 4.32: The first selection configuration of the sample

1 and layer 2 by combining the first and the second configuration. To do this, we

take the p first elements from the first vector configuration and then the p second

elements from the second vector configuration and repeat until having n elements to

form the third one. Thus, we get (n/2) elements from the first vector configuration

and (n/2) other elements from the second one. Combining in this way will ensure

the target configuration of having a greater difference in cost according to Formula

3.3 compared to the source configurations. The process continues until reaching the

final layer. Thus, we have in total 2*(m-1) configurations. With this method, we

can ensure that the set of initial configurations is distributed over the search space

according to cost criteria.

Step 3: We check Criteria 2 of all 2*m-1 configurations. To verify Criteria 2,

we have to determine the timetable for all sub-jobs of the workflow. The procedure

to determine the timetable of the workflow is similar to the one described in Figure

4.10. If some of them do not satisfy the Criteria 2 requirement, we construct more to

have enough 2*m-1 configurations. To do the construction, we change the value of p

parameter in the range from 1 to (n/2) in step 2 to create the new configuration.

After this phase we have set C0including maximum (2m-1) valid configurations.

4.3. PROPOSED MAPPING MECHANISM 75

1 23321 1

1 32121 3

3 32112 3

3 11212 2

2 11233 2

1

3

4

5

2

Figure 4.33: Procedure to create the set of initial configurations

Improving solution quality algorithm

To improve the quality of the solutions, we use the neighborhood structure as de-

scribed in Section 4.3.2. Call Athe set of neighborhood of a configuration. The

procedure to find the highest quality solution includes the following steps.

Step 1: ∀a∈A, calculate cost(a) and timetable(a), pick a∗with the smallest

cost(a∗) and satisfy Criteria 2, put a∗to set C1. The detailed technique of this step

is described in Figure 4.34.

We consider only the configuration having a smaller cost than the present configu-

ration. Therefore, instead of computing the cost and the timetable of all configuration

in the neighborhood set, we compute only the cost of them. All the cheaper config-

urations are stored in a sorted list. And then we compute the timetable of cheaper

configurations along the list to find the first feasible configuration. This technique

helps to decrease a lot of the algorithm’s runtime.

Step 2: Repeat step 1 with all a∈C0to form C1.

Step 3: Repeat step 1 to 2 until Ct=Ct−1.

Step 4: Ct ≡C∗. Pick the best configuration of C∗.

76CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

For each subjob in the workflow {

For each RMS in the candidate list {

If cheaper then put (sjid, RMS id, improve_value)

to a list }}

Sort the list according to improve_value

From the begin of the list{

Compute time table to get the finished time

If finished time < limit

break

}

Store the result

Figure 4.34: Algorithm to improve the solution quality

4.3.3 w-Tabu algorithm

The main purpose of the w-Tabu algorithm is finding out a feasible solution when

there are few free Grid resources. This destination is equal to finding a solution with

the minimal finished time. Within the SLA context as defined in section 4.1, the

finished time of the workflow depends on the reservation state of resources in RMSs,

bandwidth among RMSs, and bandwidth reservation state. It is easy to show that

this task is an NP hard problem. Although the problem has the same destination as

most of existing algorithm mapping a DAG to resources [83, 10, 21, 13], the defined

context is different from all other context appearing in the literature. Thus, the

problem needs a new approach to be solved. We propose a mapping strategy as

depicted in Figure 4.35.

Firstly, a set of referent configurations is created. Then we use a specific module to

improve the quality of each configuration as far as possible. The best configuration

will be selected. This strategy looks similar to an abstract of a long term local

search such as Tabu search, Grasp, SA, etc. However, detailed description makes our

algorithm distinguishable from them.

4.3. PROPOSED MAPPING MECHANISM 77

Workflows

DAG

RMSs

Construct set

of reference

configuration

Co

Improve the

quality of

configurations

Output the

best

configuration

Figure 4.35: w-Tabu algorithm overview

Generating reference solution set

Each configuration from the reference configurations set can be thought of as the

starting point for a local search so it should be spread as widely as possible in the

searching space. To satisfy the space spreading requirement, the number of the same

map sub-job:RMS between two configurations must be as small as possible. The

number of the member in the reference set depends on the number of available RMSs

and the number of sub-jobs. During the process of generating a reference solution

set, each candidate RMS of a sub-job has a co-relative assign number to count the

times that RMS is assigned to the sub-job. During the process of building a reference

configuration, we use a similar set to store all defined configurations having at least

a map sub-job:RMS similar to one in the creating configuration. The algorithm is

defined in Figure 4.36.

While building a configuration with each sub-job in the workflow, we select the

RMS in the set of candidate RMSs, which create a minimal number of similar sub-

job:RMS with other configurations in the similar set. After that, we increase the

assign number of the selected RMS. If this value is larger than 1, which means that

the RMS were assigned to the sub-job more than one time, there must exist config-

urations that contain the same sub-job:RMS and thus satisfy the similar condition.

We search these configurations in the reference set which have not been in the similar

set, and then add them to the similar set. When finished, the configuration is put

to the reference set. After all reference configurations are defined, we use a specific

78CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

assign_number of each candidate RMS =0

While m_size < max_size {

Clear similar set

For each sub-job in the workflow {

For each RMS in the candidate list {

For each solution in similar set {

If solution contains sub-job:RMS

num_sim++

Store tuple (sub-job, RMS, num_sim) in

a list }}

Sort the list

Pick the best result

assign_number++

If assign_number > 1

Find defined solution having the same

sub-job:RMS and put to similar set

}}

Figure 4.36: Generating reference set algorithm

procedure to refine each of the configuration as far as possible.

Solution improvement algorithm

To improve the quality of a configuration, we use a specific procedure based on short

term Tabu search for this problem. Just likes the L-Tabu algorithm, we use Tabu

Search because it can also play the role of a local search but with a wider search

area. Before improving the quality of the configuration, we have to determine the

specific runtime period for each sub-job as well as the finished time of the present

configuration. This task is done with the algorithm in Figure 4.37. This algorithm

is similar to the algorithm in Figure 4.10 with only a minor difference in determining

the end time of data transfer parameter. As the w-Tabu algorithm applies both for

light workflow and heavy workflow, determining the parameter for each case cannot

4.3. PROPOSED MAPPING MECHANISM 79

With each sub-job k following the assign sequence {

Determine set of assigned sub-jobs Q, which having output

data transfer to the sub-job k

With each sub-job i in Q {

min_st_tran=end_time of sub-job i +1

If heavy weight workflow {

Search in reservation profile of link between RMS running

sub-job k and RMS running sub-job i to determine start and

end time of data transfer task with the start time >

min_st_tran } else {

end time data transfer = min_st_tran }

}

min_st_sj=max end time of all above data transfer +1

Search in reservation profile of RMS running

sub-job k to determine its start and end time with

the start time > min_st_sj

}

Figure 4.37: Determining timetable algorithm for workflow in w-Tabu

be the same. With light workflow, the end time of data transfer equals the time slot

after the end of the correlative source sub-job. With a heavy workflow, the end time

of data transfer is determined by searching the bandwidth reservation profile.

In normal Tabu search, in each move iteration, we will try assigning each sub-job

sji∈Swith each RMS rjin the candidate set Kiand use the procedure in Figure

4.37 to compute the runtime and then check for overall improvement and pick the

best one. This method is not efficient as it requires a lot of time for computing the

runtime of the workflow which is not a simple procedure. We will improve the method

by proposing a new neighborhood with two comments.

Comment 1: The runtime of the workflow depends mainly on the execution time

of the critical path. In one iteration, we can move only one sub-job to one RMS. If

the sub-job does not belong to the critical path, after the movement, the old critical

path will have a very low probability of being shortened and the finished time of

the workflow has a low probability of improvement. Thus, we concentrate only on

sub-jobs in the critical path. With a defined solution and runtime table, the critical

path of a workflow is defined with the algorithm in Figure 4.38.

80CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

Let C is the set of sub-jobs in the critical path

Put last sub-job into C

next_subjob=last sub-job

do{

prev_subjob is determined as the sub-job having

latest finished data output transfer to next_subjob

Put prev_subjob into C

next_sj=prev_subjob

} until prev_sj= first sub-job

Figure 4.38: Determining critical path algorithm

We start with the last sub-job determined. The next sub-job of the critical path

will have latest finish data transfer to the previously determined sub-job. The process

continues until next sub-job is equal to first sub-job. Figure 4.39 depicts a sample

critical path of the workflow in Figure 3.1.

Subjob

0

Subjob

1

Subjob

6

Subjob

2

Figure 4.39: Sample critical path of the workflow in Figure 3.1

Comment 2: In one move iteration, with only one change of one sub-job to one

RMS, if the finish time of the data transfer from this sub-job to the next sub-job

in the critical path is not decreased, the critical path cannot be shortened. For this

reason, we only consider the change which shortens the finish time of consequent data

transfer. It is easy to see that checking if we can improve the data transfer time is

much shorter than computing the runtime table for the whole workflow.

With two comments and other remaining procedures similar to the standard Tabu

search, we build the overall improvement procedure as presented in Figure 4.40.

4.4. PERFORMANCE EXPERIMENT 81

while (num_loop<max_loop){

Determine critical path

For each sub-job in the critical path {

For each RMS in the candidate set {

If can improve the finished time of the

sequence data transfer {

Compute timetable for new solution

Store tuple (sub-job, RMS, makespan) to

candidate list

} } }

Pick the solution having smaller makespan

or not affect tabu rule

Assign tabu_number for the selected RMS

If smaller makespan then store the solution

num_loop++

}

Figure 4.40: Configuration improvement algorithm in w-Tabu

4.4 Performance experiment

The performance experiment is done with simulation to check for the quality of the

mapping algorithms. The hardware and software used in the experiments is rather

standard and simple (Pentium 4 2,8Ghz, 2GB RAM, Linux Redhat 9.0, MySQL). The

whole simulation program is implemented in C/C++. We generated several scenarios

with different workflow configurations and different RMS configurations to suit with

the ability of the comparing algorithms. The goal of the experiment is to measure

the feasibility, the quality of the solution and the time needed for the computation.

Table 4.4 describes the resource configuration used in all experiments. In reality,

the number of sites in the Grid can be large but they are also heterogeneous in static

resource configurations for example CPU speed, CPU type, OS and so on. Therefore,

the number of sites having the resource configuration greater than the requirement

82CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

RMS CPU type CPU speed Num CPU Storage Expert Cost cpu Cost stor Cost exp Cost tran

1 0 1700 256 300000 4 0.0501 0.00701 0.15 0.06

2 0 1700 128 200000 3 0.0503 0.00802 0.133333 0.05

3 0 1700 256 300000 4 0.0502 0.00703 0.15 0.06

4 0 1700 256 300000 4 0.0501 0.00701 0.15 0.06

5 0 1700 256 300000 4 0.0504 0.00702 0.15 0.06

6 0 1700 64 100000 2 0.0502 0.00801 0.166667 0.06

7 0 1700 128 200000 3 0.0501 0.00802 0.133333 0.05

8 0 1700 256 300000 4 0.0503 0.00703 0.15 0.06

9 0 1700 64 100000 2 0.0502 0.00801 0.166667 0.06

10 0 1700 128 200000 3 0.0501 0.00801 0.133333 0.05

11 0 1700 128 200000 3 0.0502 0.00802 0.133333 0.05

12 0 1700 128 200000 3 0.0504 0.00803 0.133333 0.05

13 0 1700 128 200000 3 0.0502 0.00801 0.133333 0.05

14 0 1700 128 200000 3 0.0502 0.00803 0.133333 0.05

15 0 1700 256 300000 4 0.0501 0.00703 0.15 0.06

16 0 1700 256 300000 4 0.0503 0.00701 0.15 0.06

17 0 1700 128 200000 3 0.0501 0.00802 0.133333 0.05

18 0 1700 256 300000 4 0.0502 0.00704 0.15 0.06

19 0 1700 64 100000 2 0.0501 0.00801 0.166667 0.06

20 0 1700 128 200000 3 0.0503 0.00803 0.133333 0.05

Table 4.4: Resource configuration used in the experiment

and having the same static resource configuration is not so big. Beside that, the static

resource configuration has little impact on the efficiency of the algorithms as it can

be filtered easily with SQL command. The difference in the number of resources and

the price of resources in each RMS has a great influence on the quality of the solution

and the execution time of the algorithms. Thus, 20 RMSs used in the experiments

having the same static configuration but different in total resource and price policy

can simulate well the real Grid scenario.

The workflow configurations and the resource reservation information in each RMS

are changed to suite with the scenario of each algorithm in the experiments.

4.4. PERFORMANCE EXPERIMENT 83

4.4.1 Optimizing cost for light communication workflows ex-

periment

In this experiment, we use the comparing algorithms to map the light communication

workflows to the Grid resources. To do the experiment, 18 workflows with different

configurations were generated. The workflows are different in:

•Topology.

•Number of sub-jobs. Number of sub-jobs is in the range 7-35.

•Sub-job specifications.

•Amount of data transfer. Amount of data transfer is in the range 1MB - 10MB.

L-Tabu Tabu SA ILS GLS GA EDA

Wf Rt Cost Rt Cost Rt Cost Rt Cost Rt Cost Rt Cost Rt Cost

1 2 629.04 67 632.69 36 633.34 27 632.37 14 632.69 23 632.37 78 632.37

2 1 753.13 89 756.77 17 756.77 41 756.76 25 757.46 27 756.77 129 756.76

3 2 776.10 130 780.13 31 780.13 71 780.12 33 780.62 30 780.12 198 780.12

4 3 831.13 167 835.97 23 835.97 85 835.97 46 836.39 32 835.97 223 835.97

5 3 886.12 207 891.81 25 891.81 117 891.81 63 892.85 39 891.81 296 891.81

6 3 941.12 286 947.61 65 947.61 174 947.61 87 948.34 36 947.61 473 947.61

7 4 996.10 334 1003.47 96 1003.47 204 1003.47 107 1004.26 39 1003.47 497 1003.47

8 5 1051.19 432 1059.36 22 1059.36 257 1059.36 150 1059.98 44 1059.36 700 1059.36

9 5 1175.24 503 1183.45 34 1183.76 322 1183.76 187 1184.44 53 1183.77 628 1183.76

10 7 1299.20 594 1308.02 32 1308.03 367 1308.03 199 1308.81 62 1308.03 758 1308.02

11 7 1354.20 689 1363.86 146 1363.86 459 1363.86 243 1363.97 49 1363.87 1107 1363.86

12 8 1478.19 836 1482.25 49 1488.20 532 1488.20 272 1489.08 54 1488.21 1088 1488.20

13 9 1501.23 989 1511.62 58 1511.62 698 1511.62 339 1512.64 57 1511.64 1384 1511.62

14 9 1556.25 1306 1567.97 56 1567.50 805 1567.48 386 1567.84 57 1567.50 1491 1567.48

15 12 1579.29 1292 1590.69 94 1590.93 890 1590.91 445 1591.52 55 1590.92 2000 1590.90

16 15 1772.36 2480 1785.82 107 1785.36 1682 1785.35 779 1786.11 66 1785.38 3056 1785.35

17 18 1873.23 2914 1887.24 55 1887.92 2562 1887.91 1045 1888.50 76 1887.92 3788 1887.91

18 25 2199.35 4495 2215.56 80 2215.89 3276 2215.88 1572 2216.90 88 2215.92 6038 2215.89

Table 4.5: Experiment results of the L-Tabu algorithm

84CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

In the algorithms, number of sub-job is the most important factor to the execution

time of the algorithm. We stop at 35 sub-jobs for a workflow because as far as we

know, with our model of parallel task sub-job, most existing Grid-based workflows [82,

11, 105] just include 10-20 sub-jobs. Thus, we believe that our workload configuration

can simulate accurately the requirement of the real problems. Those workflows will

be mapped to 20 RMSs with different resource configurations and different resource

reservation contexts. The reserved resource in each RMS is not much and there are

a lot of free resources over the time axis. This character ensures of having a great

number of feasible solutions. The workflows are mapped by 7 algorithms L-Tabu,

TS, SA, GLS, ILS, GA, and EDA. The cost and the runtime of solutions generated

by each algorithm correlative with each workflow are recorded. The result of the

experiment for this case is presented in table 4.5. Column Wf (Workflow) presents

the id of workflows used in the experiment. The number of sub-jobs in the workflow

increases with the increasing of id. The cost and runtime of solutions generated by

each algorithm correlative with each workflow is recorded in column Rt (Runtime)

and Cost respectively.

From the results, we can see that the L-Tabu algorithm created higher quality

solutions than all comparing algorithms. Compares to other algorithms, faster com-

putation and slightly lower cost of the solution created by the proposed algorithm was

observed. The difference in the cost and runtime of the founded solutions between

L-Tabu and other algorithms increases when the size of the workflow increases. With

large-scale problems, algorithms such as ILS, GLS, and EDA, which use a local search

module, have an exponential runtime and find out just equal results with other ones,

except L-Tabu.

4.4.2 Optimizing cost for heavy communication workflows

experiment

In this experiment, we use the comparing algorithms to map the heavy communication

workflows to the Grid resources. To do the experiment, 18 workflows with different

configurations were generated. The workflows are different in:

4.4. PERFORMANCE EXPERIMENT 85

•Topology.

•Number of sub-jobs. Number of sub-jobs is in the range 7-35.

•Sub-job specifications.

•Amount of data transfer. Amount of data transfer is in the range 1GB - 6GB.

In the algorithms, the number of the sub-job is the most important factor to the

execution time of the algorithm. We also stop at 35 sub-jobs for a workflow because

as far as we know, with our model of parallel task sub-job, most existing Grid-

based workflows include only 10-20 sub-jobs. Thus, we believe that our workload

configuration can simulate accurately the requirement of the real problems. Those

workflows will be mapped to 20 RMSs with different resource configurations and

different resource reservation contexts. The workflows are mapped by 7 algorithms

H-Map, TS, SA, GLS, ILS, GA, and EDA. The cost and the runtime of solutions

generated by each algorithm correlative with each workflow are recorded. The final

result of the experiment is presented in table 4.6. Column Wf (Workflow) presents

the id of workflows used in the experiment. The number of sub-jobs in the workflow

increases with the increasing of id. The cost and runtime of solutions generated by

each algorithm correlative with each workflow is recorded in column Rt (Runtime)

and Cost respectively.

The experiment results show that the H-Map algorithm finds out equal or higher

quality solutions with a much shorter runtime than other algorithms in most cases.

With small problems, some metaheuristics using local search such as ILS, GLS, and

EDA find out equal results with H-Map and better than SA or GA. But with large-

scale problems, they have an exponential runtime and find out unsatisfactory results.

4.4.3 Optimizing the finished time for workflows experiment

Optimizing the finished time of the Grid-based workflow is a well known problem in

Grid Computing. Many attempts at this problem have resulted in a number of map-

ping methods such as [83, 10, 21, 13]. Although most of the previous work has been

86CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

H-Map Tabu SA ILS GLS GA EDA

Wf Rt Cost Rt Cost Rt Cost Rt Cost Rt Cost Rt Cost Rt Cost

1 1 632.58 58 632.98 105 632.58 21 632.58 14 632.58 25 632.58 62 632.58

2 0.5 756.96 83 757.42 84 756.90 37 756.96 19 756.90 22 756.96 88 756.90

3 1 780.42 157 780.60 114 780.42 54 780.42 29 780.34 27 780.34 126 780.34

4 1 836.18 212 836.63 78 836.18 81 836.18 46 836.18 28 836.18 178 836.18

5 1 892.02 259 926.00 105 892.02 114 892.02 59 892.02 29 892.21 241 892.02

6 2 948.10 235 948.64 90 948.10 147 948.10 80 948.10 36 1005.27 390 947.86

7 1 1003.7 279 1059.77 78 1003.99 201 1003.7 98 1003.99 36 1075.19 462 1003.7

8 2 1059.89 669 1114.5 121 1059.89 250 1059.89 127 1059.89 32 1059.89 558 1059.89

9 2 1184.21 453 1247.35 130 1184.21 307 1184.21 167 1184.21 44 1248.86 659 1183.92

10 2 1308.53 607 1352.28 146 1332.53 398 1308.53 187 1308.53 47 1383.53 680 1308.53

11 3 1364.14 696 1433.89 124 1376.42 502 1364.14 222 1377.63 52 1440.81 956 1364.42

12 2 1488.12 744 1575.41 184 1551.74 462 1521.74 303 1501.95 51 1569.39 854 1536.74

13 7 1512.26 1005 1512.04 174 1512.26 620 1512.26 354 1566.09 56 1620.17 1136 1512.26

14 3 1567.74 1306 1683.23 162 1631.15 815 1567.74 392 1568.15 56 1663.81 1255 1601.15

15 6 1591.12 1225 1713.00 161 1675.67 876 1591.12 524 1621.67 70 1764.18 1663 1621.67

16 5 1786.56 1912 1846.81 180 1871.81 1394 1840.31 763 1843.55 85 1914.91 2845 1830.87

17 7 1889.78 2567 2030.13 197 1960.87 1695 1892.27 1258 1936.83 93 2028.06 4170 1961.30

18 10 2217.34 4545 2350.36 272 2276.33 2046 2283.67 1623 2256.53 1953 2406.97 10976 2276.33

Table 4.6: Experiment results of the H-Map algorithm

for workflow without resource reservation, the principle ideas can also apply to our

problem. For that reason, beside applying and doing experiments with metaheuristic

algorithms, we also apply and do experiments with other proposed mechanisms.

Comparing with metaheuristic algorithms

To do the experiment, we used the workflow configurations and resource configu-

rations like in the case of H-Map experiment. The makespan and the runtime of

solutions generated by each algorithm correlative with each workflow are recorded.

The final result of the experiment is presented in table 4.7. Column Wf (Workflow)

presents the id of workflows used in the experiment. The number of sub-jobs in the

workflow increases with the increasing of id. The runtime and the value makespan

(makespan equals to the finished time subtracts the pre-determined start time) of the

solutions generated by each algorithm correlative with each workflow is recorded in

4.4. PERFORMANCE EXPERIMENT 87

column Rt (Runtime) and Mksp respectively.

w-Tabu Tabu SA ILS GLS GA EDA

Wf Rt Mksp Rt Mksp Rt Mksp Rt Mksp Rt Mksp Rt Mksp Rt Mksp

1 1 69 55 69 9 69 8 69 28 69 28 121 57 69

2 1 85 80 87 13 93 28 85 50 129 32 114 89 85

3 3 100 124 103 27 107 15 161 89 119 41 129 134 96

4 4 97 138 97 26 97 72 96 108 117 38 100 153 101

5 2 109 175 114 38 115 56 120 143 190 47 204 209 109

6 4 99 232 99 45 120 102 105 212 181 45 136 274 120

7 4 105 271 129 44 135 111 112 265 166 49 192 324 99

8 5 124 344 120 67 144 373 116 357 164 54 177 395 141

9 5 116 410 148 60 158 98 125 339 196 53 176 462 161

10 5 133 660 139 78 142 119 142 413 195 51 202 580 148

11 6 127 557 134 84 154 165 157 488 204 57 239 637 160

12 6 162 668 165 104 160 141 175 622 212 64 215 838 179

13 10 192 814 168 168 168 305 163 859 216 66 256 1047 173

14 11 160 941 161 136 184 172 256 990 197 63 197 966 200

15 9 164 1037 183 107 245 315 233 909 247 67 292 1122 186

16 13 172 1695 242 187 196 1678 172 1695 217 80 276 1786 208

17 16 173 2411 187 233 212 3056 179 2281 260 84 279 2258 212

18 16 216 2626 234 312 253 907 404 3323 281 94 330 2846 247

Table 4.7: Experiment results of the w-Tabu algorithm

The experiment results show that the w-Tabu algorithm finds out an equal or

a higher quality solution with much shorter runtime than other algorithms in most

cases. With small problems, some metaheuristics using local search such as ILS,

GLS, and EDA find out equal results with the w-Tabu and better than the SA and

the GA. But with large-scale problems, they have an exponential runtime and find

out unsatisfactory results.

Comparing with other existed mapping algorithms

We employed all the ideas in the recently appeared paper [83, 10, 21, 13] related to

mapping workflow to Grid resource with the same destination to minimize makespan

and adapt them to our problem. Those algorithms include w-DCP, Grasp, minmin,

88CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

maxmin, and suffer. To compare the quality of all the described algorithms above,

we generated 90 different workflows which:

•Have different topologies.

•Have different number of sub-jobs. The number of sub-jobs is in the range 7-35.

•Have different sub-job specifications.

•Have different amount of data transfer. The amount of data transfer is in the

range from several hundred MB to several GB.

Sj 0

Sj 4

Sj 2

Sj 6

Sj 3

Sj 1

Sj 5 Sj 8

Sj 7

Sj 10Sj 9 Sj 12Sj 11

Sj 14

Sj 13 Sj 16Sj 15

Sj 19

Sj 17 Sj 18

Sj 20

Figure 4.41: Work flow with 21 sub-jobs

With each workflow, we do mapping to 20 RMSs with different resource configu-

rations in 9 different start times that means with 9 different resource scenarios. The

total of running instances for each algorithm is 810. The quality of the solution is

determined by the makespan of the workflow. The experiment results show that our

algorithm needs a maximum of 13 seconds to map a workflow to the resource. Pre-

senting all experiment data in this dissertation is impossible because of the space

limitations, so we will describe here only some of the whole data. Figure 4.41 and

4.5. SUMMARY 89

St w-tb w-dcp grasp minmin maxmin suffer

10 127 132 145 132 132 132

15 132 136 150 139 137 137

20 137 137 155 144 142 143

25 142 146 162 149 146 142

30 147 155 161 151 151 151

35 152 152 170 156 156 156

40 157 159 172 161 161 161

45 162 169 173 172 172 162

50 167 175 181 177 177 167

Table 4.8: Experiment results with workload 21 sub-jobs

table 4.8 present a sample detailed experiment data when mapping workflow with 21

sub-jobs to 20 RMSs. Each row of the table 4.8 presents the start time slot and the

end time slot of each solution found by each algorithm respectively. Table 4.9 views

some of the experiment result in average relative value of the solution makespan

from other algorithms compared to our algorithm. Figure 4.42 depicts the overall

comparison among algorithms.

From the experiment data, it can be seen that our algorithm outperforms all other

algorithms in most cases. With runtime is just only few seconds, the algorithm is

effective enough to be used in real systems.

4.5 Summary

This chapter has presented a method which performs an efficient and precise assign-

ment of workflow sub-jobs to Grid resources with respect to SLAs defined deadlines

and sub-job dependencies. With each workflow characteristic and Grid resource state,

a different specific algorithm is used. The performance evaluation has shown that the

proposed algorithm creates solution of equal or better quality than existing methods

and needs an acceptable computation time. The latter is a decisive factor for the

applicability of the method in real environments because large-scale workflows can be

planned and assigned efficiently.

90CHAPTER 4. MECHANISM OF MAPPING WORKFLOW TO GRID RESOURCES

Id w-Tb w-dcp grasp minmin maxmin suffer

1 1.00 1.49 1.48 1.48 1.47 1.55

6 1.00 1.47 1.46 1.45 1.46 1.56

10 1.00 1.51 1.50 1.50 1.49 1.59

15 1.00 1.47 1.48 1.47 1.47 1.61

20 1.00 1.44 1.45 1.46 1.47 1.57

25 1.00 1.42 1.41 1.42 1.41 1.51

30 1.00 1.45 1.43 1.43 1.42 1.51

35 1.00 1.45 1.44 1.44 1.46 1.60

40 1.00 1.40 1.38 1.39 1.38 1.50

45 1.00 1.40 1.38 1.39 1.39 1.52

50 1.00 1.33 1.32 1.33 1.33 1.44

55 1.00 1.32 1.31 1.31 1.31 1.44

60 1.00 1.34 1.33 1.32 1.32 1.45

65 1.00 1.33 1.32 1.31 1.34 1.50

70 1.00 1.30 1.31 1.30 1.30 1.47

75 1.00 1.30 1.30 1.31 1.32 1.47

80 1.00 1.30 1.29 1.29 1.30 1.44

85 1.00 1.26 1.25 1.26 1.24 1.37

90 1.00 1.21 1.20 1.19 1.19 1.34

Table 4.9: Some experiment results in relative value

Figure 4.42: Overall quality comparison

Chapter 5

SLA negotiation protocol for

Grid-based workflow

An SLA negotiation protocol includes the SLA language, the components attending

to negotiation, and the negotiation procedure. This chapter will describe all these

issues.

5.1 Related works

The process of forming an SLA between consumer and provider is called SLA ne-

gotiation protocol. Up to now, although with some variations, most proposed SLA

negotiation protocols [27, 18, 26, 89] have the same model. To start an SLA nego-

tiation for a job, a consumer forms an SLA template including software, hardware

guarantees, computing task description as well as the expected runtime period. The

filled template is sent as an offer to the provider. The provider decides whether to

accept or reject the requested job. The decision depends on the ability of the provider

which can satisfy the requirement such as the number of resources free at the expected

period. The service provider answers the offer with a confirmation or a rejection. In

current proposals, Grid jobs are defined as monolithic entities, where the user sends

the input data to a service, computes the data - without dependencies – on this site

and receives the results.

92CHAPTER 5. SLA NEGOTIATION PROTOCOL FOR GRID-BASED WORKFLOW

In [26], Czajkowski et al introduces a general negotiation model called Service

Negotiation and Acquisition Protocol (SNAP), which proposes a resource manage-

ment model for negotiating resources in distributed systems such as Grids. SNAP

defines three types of SLAs to manage resources across different administrative do-

mains, which are Task SLA (TSLA), Resource SLA (RSLA), and Bind SLA (BSLA).

The TSLA describes the task that needs to be executed, and the RSLA describes the

resources needed to accomplish this task. The BSLA provides an association between

the resources from the RSLA and the application ’task’ in the TSLA. These types of

SLA can be used together to describe a complex service requirement in a distributed

environment. The SNAP protocol is general and true for most cases. When dealing

with a specific problem, a further extension must be done to realize these theories.

Besides that, the SNAP protocol only focuses on a weaker form of agreement and no

cost model or punish model is associated.

An effort to express detailed SLA negotiation and implementation is presented

in [27]. A group of authors at the GGF Grid Resource Agreement and Allocation

Protocol Working Group proposed a draft Agreement-based Grid Service Manage-

ment (OGSI-Agreement) model. This document defines a set of OGSI-compatible

portTypes through which management applications and services can negotiate. This

agreement defines the required behavior (QoS) of a delivered service with reference

to a particular service consumer. Further, the document contains an abstract man-

agement protocol to manage the agreement stages of the QoS lifecycle - from service

creation until termination.

Burchard et al also propose an SLA aware architecture in which SLA negotiation

for execution parameters is done among resource managers [18]. The SLA manage-

ment is achieved via a Virtual Resource Manager (VRM) - that enables interaction

among a number of local resource management on different clusters. The VRM acts

as a coordinator to aggregate SLAs negotiated with different sub-systems. This ar-

chitecture is intended to work with a group of co-located clusters, not grid wide and

does not support workflow execution.

In [89], Nassif et al present an Agent-based SLA Negotiation in Grid. In the

5.2. SLA LANGUAGE 93

negotiation module there are different forms of negotiation called bilateral, multi-

issue and chaining negotiation. A bilateral negotiation occurs when there are only

two parts involved. In a multi-issue negotiation, different aspects are negotiated

such as price, quality and time schedule. The negotiation between user agent and

network agent is also called chaining negotiation. A chaining negotiation occurs after

the negotiation between user agent and server agent has finished.

The initial work to design and implement a system supporting SLA for Grid job is

described in [4, 37]. However, all the above works only support a single Grid job, and

cost model as well as SLOs (Service Level Objective) were not sufficiently discussed.

5.2 SLA language

The SLA-aware workflow in Grid environments is defined on an abstract layer by

specifying the data and the control flow among software and data resources. The def-

inition of the Grid-based workflow is independent of the hardware infrastructure, and

it is up to the Grid architecture to map this workflow onto adequate Grid resources

using additional metadata and to ensure its integrity in execution. The proposed

language is based on the Common Job Description Markup Language [85] and can

describe most aspects required for an SLA-aware workflow such as business descrip-

tion, computational task description, SLO description and data transfer description.

5.2.1 Business description

A typical SLA starts with a business description, which describes general information

about the title of the SLA, the main aim, valid time period, customers, responsible

provider, costs, penalties, etc.

5.2.2 Computational task description

The computational task gives the requirements for software, hardware, input/output,

executables and all other resources needed for a successful running. The computation

task description can be divided into three categories.

94CHAPTER 5. SLA NEGOTIATION PROTOCOL FOR GRID-BASED WORKFLOW

1. Required software components including specification of the operating system,

the database system, the message passing library, and so on.

2. Required hardware components including specifications of the CPU architec-

ture, memory/storage capacity, CPU speed, network speed and other special

devices such as scanner, DVD Writer, etc.

3. Task description with specification of the input/output data, executable, envi-

ronment, start parameters, etc.

An example for modeling these aspects is given in Figure 5.1. For the implemen-

tation of such an example with other languages, e.g. language proposed in [6], the

integration of at least two additional language components is mandatory.

5.2.3 SLO description

SLOs focus on deadlines and the number/capacity of required resources. Traditional

SLOs solely impose the responsibility of failure to provider. However, within an SLA

context in the Grid environment, both provider and consumer can cause failures of a

Grid job. For example, if the time to finish the computation task passes the deadline,

the reason can be the failure of the resources in the provider site or it can be the

underestimated processing duration of the customer. Another scenario is related to

the number of requested and really used resources. If during the runtime period more

resources are used than allocated, other reservations can be affected and thus the

job has to be stopped. None of the available SLA languages states these issues. A

sample SLO extracted from the completed SLA example in the GGF document [6]

is presented in Figure 5.2. It is obvious that no reason, no responsible site, and no

action information are specified.

The proposed SLO states all these problems by implementing the following struc-

tures:

•Condition: What type of problem occurred, e.g. runtime exceeded the deadline

or over used resources.

5.2. SLA LANGUAGE 95

</StringLHS>

<StringValue>Linux1.2.4</StringValue>

</StringRHS>

</StringGreaterThanOrEquals>

</BooleanEquation>

</IntegerLHS>

</IntegerRHS>

</IntegerEqual>

</BooleanEquation>

</StringListValue>

</StringListEquation>

</StringEquation>

</SectionEquation>

Figure 5.1: SLA specifications of HW/SW resources as well as the task de-

scription

•Reason: Why the problem occurred, e.g. system failure or wrong estimation.

•Responsible site: Who is responsible for the problem, the provider or the con-

sumer.

•Action: How the system will treat the task, e.g. cancel or continue.

•Penalty: Which penalty is to be paid by the site responsible for the problem.

•Monitor information: Which job information has to be monitored so that the

consumer can track the progress or observe the circumstances of failure.

96CHAPTER 5. SLA NEGOTIATION PROTOCOL FOR GRID-BASED WORKFLOW

< wsag:GuaranteeTerm

wsag:Name = "MaxEndTime"

wsag:ServiceScope ="ComputerJob1 ComputerJob2" >

< wsag:Variables >

< wsag:Variable

wsag:Name = "endTime" wsag:Metric = "job:endTime" >

<wsag:Location > /wsag:AgreementOffer/

wsag:Terms /wsag: ALL>

</wsag:Location >

</wsag:Variable>

</wsag:Variables >

<wsag:ServiceLevelObject> endTime

IS_BEFORE 2004-05- 16T00.00.00

</wsag:ServiceLevelObject>

<wsag:BusinessValueList>

< wsag:Penalty>

<wsag:Assessmentinterval>

<wsag:ValueExpression>5

</wsag:ValueExpression>

</wsag:Penalty> </wsag:BusinessValueList>

</wsag:GuaranteeTerm>

Figure 5.2: A GGF sample SLO in SLA language

An example for the statement ”If runtime exceeds 3h because of system failure,

the provider will be fined 1000 EUR per hour overdue. The provider sends an exit

code to the customer when the job is finished.” is given in Figure 5.3.

In fact, not every SLO needs all the above information. The number of the data

fields depends on the current scenario under investigation.

5.2.4 Data transmission description

In the next step the data transmission between sub-jobs is described. The data trans-

mission between related sub-jobs also defines the rank order between them. Therefore,

each data transmission presents a directed arc in the workflow graph. The arc is de-

scribed in a pair (source sub-job ID, destination sub-job ID). Additional information

about data to be transferred is presented by a list of files. The corresponding ex-

ample is presented in Figure 5.4. As the workflow is under DAG format, the data

transmissions also describe the structure of the workflow.

The full specifications of the SLA language for Grid-based workflow are described

in Appendix B.

5.3. SLA NEGOTIATION FOR WORKFLOW 97

<StringValue>Runtime exceed</StringValue>

</StringEquation>

<StringValue>System failure</StringValue>

</StringEquation>

<StringValue>Provider</StringValue>

</StringEquation>

</RealEquation>

</StringEquation>

</SectionEquation>

</StringEquation>

</SectionEquation>

<StringValue>ExitCode</StringValue>

</StringEquation>

</StringEquation>

<StringValue>Job terminate</StringValue>

</StringEquation>

</SectionEquation>

Figure 5.3: SLO specification

5.3 SLA negotiation for workflow

As presented in Figure 3.4, there are three main components participating to perform

an SLA for the workflow. They are consumer, broker and providers. These compo-

nents also join the SLA workflow negotiation process. The interaction among them

in the negotiation process is depicted in Figure 5.5.

As can be seen in Figure 5.5, three different types of sub SLA negotiations using

three different types of SLA text exist.

•User - Broker. This negotiation takes place by using SLA for workflow text.

•Broker - Provider. This negotiation process uses SLA for sub-job text.

98CHAPTER 5. SLA NEGOTIATION PROTOCOL FOR GRID-BASED WORKFLOW

<StringEquationattribute="ID">

<StringValue>subjob0-subjob1</StringValue>

</StringEquation>

</RealEquation>

</StringEquation>

</SectionEquation>

<StringValue>PDB-PHYS-THEORY-0</StringValue>

</StringEquation>

</SectionEquation>

<StringValue>PDB-PHYS-THEORY-1</StringValue>

</StringEquation>

</SectionEquation>

</RealEquation>

</StringEquation>

</SectionEquation>

</SectionEquation>

</SectionEquation> </SectionEquation>

Figure 5.4: Specification of data to be transferred

•Provider - Provider. This negotiation negotiates the data transmission between

sub-jobs (and also between the providers), thus the SLA for data transmission

is used.

Although there are three types of SLA negotiations in the process of SLA workflow

negotiation, all negotiation procedures are similar to each other. Only the service

attribute is different. Figure 5.6 describes the basic procedure in client - server model.

•In the first step, the client creates a template SLA with some preliminary service

attribute values and sends it to the server.

5.3. SLA NEGOTIATION FOR WORKFLOW 99

Grid resource

broker

SLA workflow

Local RMS

SLA subjob

Local RMS

SLA subjob

SLA subjob Local RMS

SLA data transfer

Figure 5.5: Interaction among participants in SLA negotiation process for

workflow

C

l

i

e

n

t

S

e

r

v

e

r

1. Template SLA

2. SLA negotiation

3. SLA sign

Figure 5.6: Basic SLA negotiation procedure

•The server parses the text to get information and checks all client requirements.

If there is something inappropriate, the server reforms the SLA and sends it

back to the client. The difference of the new version compared to the old one

lies in two aspects: modified information and additional information. Modified

information can be start, stop time, cost, etc. Additional information can be

SLO, FTP address, path, etc. The client checks the new SLA version and

sends it back to the server. The process is repeated until there is nothing to be

changed or the SLA is cancelled. That is the negotiation phase.

•When both client and server accept the content of SLA, the signing phase begins.

The server signs the SLA and then sends it to the client. The client signs it

100CHAPTER 5. SLA NEGOTIATION PROTOCOL FOR GRID-BASED WORKFLOW

and sends it back to the server. After signing, the SLA is valid and cannot be

changed.

The SLA negotiation for workflow is a complex process with the participation of

many components in the system and happens in many phases. Figure 5.7 depicts the

process in a time sequence.

C

u

s

t

o

m

e

r

B

r

o

k

e

r

Template SLA workflow P

r

o

v

i

d

e

r

P

r

o

v

i

d

e

r

SLA subjob negotiation

SLA dat-tran negotiation

SLA workflow signing

SLA workflow negotiation Template SLA subjob

SLA subjob signing Template SLA dat-tran

SLA dat-tran signing

Phase 1 Phase 2 Phase 3

Figure 5.7: SLA negotiation process for workflow

The process starts with the customer sending a template SLA workflow to the

broker. The broker finds a solution and negotiates with customer to have an accep-

tance SLA. After that, broker will make SLA sub-job negotiations with the providers

for all sub-jobs in the workflow. The SLA data transfer negotiations can be done

only when all related SLA sub-job negotiations finish. If the SLA sub-job negotiation

phase and SLA data transmission negotiation phase are finished without problem,

the SLA workflow will be signed. Otherwise, the broker will find another solution

and phase 2 and phase 3 are repeated. If a solution cannot be found, the broker will

negotiate with the client to change something in the SLA workflow content. If the

client agrees, the whole process is started again. Otherwise, the SLA will be canceled.

In the three phases, creating a template SLA step and a signing SLA step are

quite straightforward. The following parts will describe detail each SLA text used

in the three phases and the negotiation with strong focus on the operation of each

participant and the modified and additional information in the SLA text.

5.3. SLA NEGOTIATION FOR WORKFLOW 101

5.3.1 Customer - Broker negotiation

SLA text for workflow

SLA workflow is used in the negotiation between the user and the broker. It includes

five main parts as depicted in Figure 5.8.

General SLA description

SLA workflow

Subjobs description

SLO description

Data transfer description

Signature

Figure 5.8: SLA workflow structure

•The general SLA part describes the start, stop time of the workflow, information

about consumer and broker, cost of the SLA, etc.

•The sub-jobs part describes specification of all sub-jobs in the workflow. A

more detailed description about the sub-job will be presented in the SLA sub-

job section.

•The SLOs part expresses complex conditions determining boundaries over many

service attributes. The SLOs focus on deadlines and the number/capacity of

required resources.

•The data transmission part describes data to be transferred between sub-jobs.

This information also describes the dependency structure of the workflow.

•The signature part describes the signature of two sites to ensure the legality of

the SLA.

102CHAPTER 5. SLA NEGOTIATION PROTOCOL FOR GRID-BASED WORKFLOW

It can be said that an SLA workflow is the template for SLA subjobs as well as

SLA data transfer.

SLA negotiation

When receiving an SLA from customer, the broker parses it to get all information

about the general SLA, sub-jobs, SLO, data transmission, the dependency among

sub-jobs and the structure of the workflow. From the information of sub-jobs and the

structure of workflow, the broker does mapping to determine the appropriate provider

and the time period to run each sub-job. Within the SLA context, if a user wants to

use a resource he must pay the cost. The essence is that every user wants to find a

feasible solution but as cheaply as possible. Therefore the global optimal destination

of the algorithm is finding a feasible solution with the lowest or near lowest cost. The

detail of mapping algorithm has been described in chapter 4.

The modified information can be:

•Cost of SLA. The cost of executing a Grid workflow is the sum of 4 primary

factors: cost of using nodes, cost of using expert, cost of using storage, cost of

transferring data. If two sequential sub-jobs run on the same RMS, the cost of

transferring data from the previous sub-job to the later sub-job is zero.

•Start, stop time of the workflow. Depending on the state of the Grid, a mapping

module can find a feasible solution in the expected time period or not. If not,

it will find the earliest solution and ask for the consumer’s approval.

The additional information can be:

•Start, stop time of each sub-job. When submitting to the broker, the consumer

determines only the runtime period of each sub-job but not exactly when. Af-

ter mapping, the starting time as well as the stop time are determined and

announced to the consumer.

•The case of the start and stop time of each data transfer SLA is also similar to

above.

5.3. SLA NEGOTIATION FOR WORKFLOW 103

•SLO for workflow. In our system, there are five common SLOs: SLO indicat-

ing that runtime is exceeding because of system failure, SLO indicating that

workflow cannot run because of system failure, SLO indicating that runtime is

exceeding because of a wrong estimation, SLO indicating that storage is exceed-

ing because of a wrong estimation, SLO indicating that memory is exceeding

because of a wrong estimation.

5.3.2 Broker - Provider negotiation

The SLA sub-job is used in the negotiation process between the broker and the

provider to execute a sub-job in the workflow. The structure of the SLA sub-job also

includes five main parts like the SLA workflow, which are general SLAs, compute

task description, SLO, sub-job related data transmission and signature. Each sub-job

is identified by a unique ID. The compute task description includes software request,

hardware request and the task itself, which lists the execution file name, parameter,

stdin, stdout, etc.

When receiving the SLA for a sub-job, the provider parses the document and

checks for following information.

•The availability of its software and hardware in the required time period.

•The ability to fulfill SLOs and monitor data requirement.

•The cost to execute the sub-job.

If every thing is OK, the provider will add information about the FTP address and

storage path to the part of data transmission SLA. This information can be applied

only for the SLA data transmission user - RMS, and RMS - user.

5.3.3 Provider - Provider negotiation

SLA for data transmission negotiation

Returning to Figure 3.4, there are three types of data transmission activities, which

are user to provider, provider to provider and provider to user. Because of this, there

104CHAPTER 5. SLA NEGOTIATION PROTOCOL FOR GRID-BASED WORKFLOW

are also three types of SLA data transmission. To distinguish among those SLA data

transmissions, we use a field in the SLA text as control data. The structure of an

SLA data transmission includes four main parts as depicted in Figure 5.9.

General SLA description

Grid FTP

File list

Signature

SLA data transfer

Figure 5.9: SLA data transmission structure

•Besides similar parameters in the SLA workflow, the general SLA of data trans-

fer describes the amount of data to be transferred.

•The Grid FTP part describes the address of the FTP server and the path to

the location where the data files are actually stored.

•The file list part describes a list of file names to be transferred. Those files

could have a relative path and a default directory where sub-job is executed.

SLA negotiation

The provider - provider negotiation process is a little more complex. It can be done

only when the two related SLA sub-job negotiations are finished. If not, the destina-

tion provider will find that the submitted SLA data transmission belongs to none of

the jobs it manages and discards it.

To solve this problem, the broker will play the role of service point to synchronize

the negotiation process. As the broker can control the state of all sub-job negotiations,

5.4. SUMMARY 105

it will transfer appropriate data transmission SLA text to the customer and the

provider. If every thing is OK, the destination provider will add information about

the FTP address and storage path to the part of data transmission SLA.

5.4 Summary

The SLA negotiation process for workflows in the Grid environment is more sophis-

ticated than the case of a single job with the participation of more components, the

appearance of many SLA forms, and the more complex SLA structure. This chapter

has intensively presented this process. The negotiation starts with workflow and then

with each sub-job and finally with data transmission between sub-jobs. In each phase,

there are different negotiating procedures, SLA text structures and negotiable data.

Chapter 6

Error recovery mechanism for

Grid-based workflow

Error recovery is an important issue with a system supporting SLA for Grid-based

workflow. In a large and complex system like the Grid, errors can happen at any

time and in any part of the system with high frequency. The source of errors varies

with network cable breakage, scratched software, hardware failure and so on. This

fact makes a big risk of breaking the negotiated SLA to a system supporting SLA for

Grid-based workflow. Therefore, the system must have some preparation for the bad

situation in order to avoid or eliminate the bad effect. In this chapter, we concentrate

on the case of catastrophic failure, when one or several RMSs are detached out of the

Grid system at a particular time.

6.1 Related works

The importance of fault tolerance in Grid computing has already been recognized by

the establishment of the Grid Checkpoint Recovery Working Group [115]. Its purpose

is to define user-level mechanisms and grid services for fault tolerance.

Up to now, we have not noticed other works exploiting the error recovery problem

for workflow, especially in the SLA context. There is, however, a considerable amount

of work in related areas, especially in finding recovery methods for single Grid job.

106

6.2. ERROR RECOVERY MECHANISM 107

Garbacki et al present a transparent fault tolerance for the Grid application based on

Java RMI [42]. They use a globally consistent checkpoint to avoid having to restart

long-running computations from scratch after a system crash.

In [61], Hwang et al present a flexible handling failure framework for the Grid.

Central to the framework is the flexibility in handling failure, which is achieved by

using the workflow structure as a high-level recovery policy specification.

Heine et al describe an SLA-aware job Migration in Grid environment in [55].

The checkpoint of the running job is migrated to the same cluster or another clus-

ter running HPC4U software [56]. An architecture called VRM (Virtual Resource

Management) manages and responds for the process to happen fluently.

6.2 Error recovery mechanism

Subjob 0

RMS 1

SLA workflow

broker

Subjob 4

RMS 1

Subjob 2

RMS 2 Subjob 6

RMS 2

Subjob 5

RMS 1

Subjob 3

RMS 2

Subjob 1

RMS 2

Figure 6.1: Sample running workflow scenario

In the SLA context, each sub-job of the workflow is planned to run on reserved

resources within a specific time period to ensure the QoS while still preserving the

integrity of the workflow. An example of such a scenario applied for the example

workflow in Figure 6.1 is presented in Table 6.1 and Table 6.2. In Table 6.2, we have

a note that with two dependent sub-jobs running in the same RMS, the data transfer

time between them is equal to 0.

108CHAPTER 6. ERROR RECOVERY MECHANISM FOR GRID-BASED WORKFLOW

1. If detect the error happening with one or

several RMSs

2. Determine all affected workflows with their

associated sub-jobs which need to be

remapped.

3. Form new workflows and determine

remapping priority for the new workflows.

4. With each workflow do mapping to the

healthy RMS.

5. Do negotiation with local RMSs to cancel

the old one and execute the new workflows.

Figure 6.2: Abstract error recovery mechanism

During the running process of the workflow, one or several RMSs can be detached

out of the system at a time. We propose a mechanism as described in Figure 6.2

to detect and recover the error when it happens. Steps 1 and 5, which detect the

happening of the error and realizes the error recovery procedure, are described in

chapter 7. In this part, we concentrate on describe the kernel of the error recovery

mechanism, which includes step 2, 3 and 4.

ID RMS Duration

0 RMS1 0-5

1 RMS2 7-14

2 RMS2 17-21

3 RMS2 23-30

4 RMS1 7-15

5 RMS1 17-30

6 RMS2 32-35

Table 6.1: Running timetable of the sample workflow in Figure 6.1

6.2. ERROR RECOVERY MECHANISM 109

Sj link RMS linh Duration

0-1 1-2 6-6

0-2 1-2 13-16

0-4 1-1 0

1-6 2-2 0

2-3 2-2 0

4-3 1-2 20-22

4-5 1-1 0

3-6 2-2 0

5-6 1-2 31-31

Table 6.2: Data transfer time table of the sample workflow in Figure 6.1

6.2.1 Determining sub-jobs to be replanned

First of all, we determine the set of affected workflows, which includes workflow having

its sub-jobs running/waiting or finished but the output data still not transferred in

the failed RMSs. After that, determining all affected sub-jobs in a workflow is done

with the three following observations.

•Sub-jobs, which are running in the failed RMS, are directly affected. Re-

mapping those sub-jobs will lead to the necessity of remapping all other sequen-

tial sub-jobs to ensure the integrity character. When the remapping proceeds,

there is no guarantee that it will ensure the execution order for other wait-

ing sub-jobs. Besides that, re-planning all the waiting sub-jobs in an healthy

RMSs does not affect the final destination of minimizing the workflow execution

makespan. Thus, we can say that all directly affected sub-jobs and all waiting

sub-jobs in the affected workflow belong to the set of affected sub-jobs and need

to be re-planned.

•With determined affected sub-jobs in the fail RMS, they will not have input

data to run if their directly finished previous sub-jobs are also in the failed

RMS. Thus, it is necessary to rerun those finished sub-jobs.

•With determined affected sub-jobs in the healthy RMSs, they will not have

input data to run if their directly finished previous sub-jobs are in the failed

110CHAPTER 6. ERROR RECOVERY MECHANISM FOR GRID-BASED WORKFLOW

RMS and the related data transfer task is not finished. Those finished sub-jobs

must also be rerun.

From the above observations we define an algorithm as presented in Figure 6.3.

foreach one in the set of affected workflow{

foreach sub-job of the workflow {

if have end_time > fail_slot

put to set of affected sub-job

if have end_time < fail_slot and next related sub-

job waiting/running in the failing RMS

put to set of affected sub-job

if have end_time < fail_slot and executed in

failing RMS and output data still not transferred

to waiting sub-jobs in the healthy RMSs

put to set of affected sub-job

}

Figure 6.3: Determining affected sub-jobs algorithm

For illustration purposes, we use the workflow sample in Figure 6.1 with the

runtime parameter as in Table 6.1 and Table 6.2. Supposing that the RMS 1 fails at

time slot 10, sub-job 4 is directly affected leading to sub-job 3, 5, and 6 will also be

affected. When sub-job 3 is re-planned, there is no guarantee that it will be run after

the finished slot of the waiting sub-job 2. Thus, we determined that all sub-jobs 2,

3, 4, 5, 6 in the affected workflow belong to the set of affected sub-jobs and need be

re-planned. Because of the failure of RMS1, the output from sub-job 0 is lost and

there is no data input for rerunning sub-job 4. It is necessary to add sub-job 0 to the

list of affected sub-jobs.

With our example, after this phase, all affected sub-jobs of the affected workflow

form a new workflow as depicted in Figure 6.4. These sub-jobs inherit all character-

istics about resource and runtime requirement of the original sub-jobs.

6.2. ERROR RECOVERY MECHANISM 111

Subjob 0 Subjob 4

Subjob 2 Subjob 6

Subjob 5

Subjob 3

Figure 6.4: All determined affected sub-jobs in the sample workflow

6.2.2 Determining re-mapping priority

When an error happens, many workflows can be simultaneously affected and we have

to re-plan many new workflows which are formed from sets of determined affected

sub-jobs. One raised problem is to determine which in those newly formed workflows

should be re-planned earlier. It is important because the workflow which is mapped

earlier will have a lower latency. Here, we use the policy Earliest Deadline First

(EDF), which is used broadly in a real time system. The workflow having an earlier

deadline will be given higher priority as it occupies resources shorter and the other

workflow need shorter time to wait for available resource. Thus, the total latency is

reduced and the fine amount is also reduced.

To clarify the problem, we look at an example as presented in Figure 6.5. Suppose

that we have 2 workflows that need to be re-mapped because of the error and the Grid

system can execute one workflow at a time. Suppose that the penalty for each hour

late is P. If workflow 2 is mapped first, workflow 1 has to wait until the workflow 2 is

finished. Thus the minimal fine will be P∗(t2−fail slot). If workflow 1 is mapped

first, the minimal fine will be P∗(t1−fail slot). Thus mapping workflow 1 first is

better than workflow 2. In real complex situations, mapping workflow 1 first gives

more chance to finish workflow 1 earlier, to release resources earlier and give more

chance for workflow 2 to be mapped with smaller latency.

Based on this priority, each workflow will be mapped in sequence to the healthy

RMSs. To do the mapping, we refine the new workflow under Directed Acyclic Graph

(DAG) format and then use the mapping module to map this new DAG workflow to

RMSs. When forming DAG for a workflow, it is necessary to consider the dependency

of affected sub-jobs with running sub-jobs in healthy RMS to ensure the integrity of

112CHAPTER 6. ERROR RECOVERY MECHANISM FOR GRID-BASED WORKFLOW

Time

Fail slot

Workflow 1

Workflow 2

t1 t2

Figure 6.5: Scenario of two workflows need to be re-mapped

the workflow. To present that dependency, in the new workflow, with each running

sub-job in the healthy RMSs, we create a pseudo corresponding sub-job, which is:

•Runtime equal to Deadline - fail slot - time overhead

•number of required CPU equal to 0

•number of required storage equal to 0

•number of required expert equal to 0

where time overhead value is the period to do the recovery process. Moreover, we

also need a new pseudo source sub-job for the workflow with a runtime and resource

requirement equal to 0.

In our example scenario, sub-job 6 must run after the finish of sub-job 1 but sub-

job 1 does not appear in the set of affected sub-jobs. Thus we add a pseudo sub-job

with a runtime equal to 4 and resource requirement equal to 0 to the workflow. After

this step the new formed example workflow is as in Figure 6.6.

6.2.3 Mapping algorithm

As the goal of the mapping algorithm in this phase is finding out a solution, which has

the makespan as small as possible. This task can be handled by the w-Tabu algorithm

without changing. The detail description of w-Tabu algorithm is in chapter 4.

6.3. PERFORMANCE EXPERIMENT 113

Subjob 0 Subjob 4

Subjob 2 Subjob 6

Subjob 5

Subjob 3

Pseudo sj

Source

Figure 6.6: DAG form of the new workflow

6.3 Performance experiment

The performance experiment is done with simulation to check for the quality of the

algorithm in part 6.3 and the overall performance of the mechanism. The hardware

and software used in the experiments are rather standard and simple (Pentium 4

2,8Ghz, 1GB RAM, Linux Redhat 9.0, MySQL).

The goal of the experiment is to measure the total reaction time of the recovery

mechanism in absolute value when the error happens. Determining total reaction time

is important because it helps defining the earliest start time of the remap workflow,

which is a necessary parameter for mapping algorithm. To do the experiment, we

use 20 RMSs with different resource configuration and then we fill all the RMSs with

randomly generated workflows having start time slot = 20. The number of failing

RMS increases from 1 to 3 and which RMS fails is selected randomly. With each

number of failing RMS, the fail slot is increased along the reservation axis. The

reason for this activity is that the error can happen at any random time slot along

the reservation axis. Therefore, the broader the range of experiment time is, the more

correctly the reaction time value is determined. At each time, we used the described

recovery mechanism to remap all affected workflow as well as all affected sub-jobs

and measured runtime. The runtime is computed in seconds. The experiment results

are described in Table 6.3, 6.4, 6.5.

From the result tables, it can be seen that the speed of the algorithm depends

on the number of the affected workflow, the number of the affected sub-jobs in the

workflow and the number of healthy RMSs. With the same number of RMSs, the

114CHAPTER 6. ERROR RECOVERY MECHANISM FOR GRID-BASED WORKFLOW

Fail-slot Total-wf Total-sj runtime

50 10 156 90

60 10 143 74

70 10 138 66

80 10 135 55

90 10 117 44

100 10 109 45

110 10 98 36

120 10 89 29

Table 6.3: Experiment results with 1 failing RMS

Fail-slot Total-wf Total-sj runtime

50 13 199 80

60 13 180 63

70 13 170 56

80 13 152 48

90 13 140 47

100 13 129 46

110 13 115 35

120 13 103 29

Table 6.4: Experiment results with 2 failing RMSs

total reaction time increases following the increase of total number of affected sub-

jobs. In the real world situation, the number of RMSs can be increased to be a very

large number but also to be very heterogeneous. Thus, the number of RMSs which

have the same resource parameter is not so large. When the number of failing RMSs

increases, the number of affected sub-jobs increases but the number of remaining

healthy RMSs also decreases. This leads to the fact that the final result does not have

considerable difference compared to the case of having 1 failing RMS. Furthermore,

the probability of more than 2 failing RMSs simultaneously at a time is very rare.

For these reasons, the simulation data can be dependable. With the total reaction

time only just less than 2 minute compared to the hourly running workflow, the

performance of the algorithm is well accepted in real situations. In the mapping

algorithm, time is computed in slot, which can have resolution from 2 to 5 minutes.

The reaction time of the mechanism will occupy 1 time slot and the time for system

6.4. SUMMARY 115

Fail-slot Total-wf Total-sj runtime

50 14 213 79

60 14 194 71

70 14 183 61

80 14 164 54

90 14 152 51

100 14 140 46

110 14 127 41

120 14 115 34

Table 6.5: Experiment results with 3 failing RMSs

to do negotiation takes about 1 time slot. Thus the start time slot of the re-mapping

workflow can be assigned to the value of present time slot plus 2.

6.4 Summary

This chapter has described an error recovery method for the workflow within SLA

context. We do not cover all cases of errors, which can happen to the system sup-

porting SLA for workflow but have concentrated on a catastrophic scenario where

one or several RMSs detached out of the system at a time. The main contribution

of the chapter locates in the newly stated problem and the proposed mechanism to

solve it. We have proposed the algorithm to detect all affected sub-jobs when the

error happen and apply w-Tabu algorithm to re-map those sub-jobs to the remaining

healthy RMSs with makespan optimization.

Chapter 7

System implementation

To provide the service of executing the workflow within SLA context, many described

functional modules must work together within a system. As the workload and the

resource in our problem have distinguishing characters, the core execution engine of

the system also requires a specific working mechanism. This chapter will describe the

architecture, the execution engine, and the initial deployment of the system.

7.1 Introduction

The theory will have little impact if it cannot be realized to solve a problem. All

described theory topics above will have little meaning if they cannot work in a real

system. The constraint in realizing the system is that all modules must be combined

in a way that they can co-operate with each other to perform the ultimate function:

executing the workflow within SLA context.

In the system, the execution module does the important work, which is executing

the workflow. A prerequisite for a planning-based scheduling needed for the SLA-

aware execution is that the maximal runtime of the job is estimated and a-priori

known. In case of an underestimated runtime, the sub-job will be stopped before

completion and thus will affect the entire workflow. Furthermore, the mapping algo-

rithm considers all RMSs with resource configuration at least equal to or better than

the requirements. If the computing task runs on more powerful resources, it will need

116

7.1. INTRODUCTION 117

shorter time to run than estimated. For example, the task requires 15 time slots to

run on 8 CPU 1GHz. If it was run on 8 CPU 1,7 GHz, it would require 13 time slots

to finished. This fact will make inefficient usage of resources in RMS. We can look

more concretely at a scenario with a workflow being mapped to 2 RMSs as in table

7.1. If the real runtime of each sub-job = 0.8 estimate time, the resource usage in

RMS 2 is fragmented as presented in Figure 7.1. With just several narrow slots, it is

difficult to insert a job and those slots are wasted.

0 5 10 15 20 25 30 t

CPU

18

20

Figure 7.1: CPU usage profile in RMS 2 if have no adjustment

ID Nr CPU RMS Duration

0 18 RMS1 0-5

1 16 RMS2 7-14

2 20 RMS2 17-21

3 22 RMS2 23-30

4 18 RMS1 7-15

5 20 RMS1 17-30

6 16 RMS2 32-35

Table 7.1: Initial running time table of the sample workflow

Moreover, a distributed engine for handling complex workflow models within the

SLA context imposes several constraints compared to execution of single jobs as well

as compared to workflows not bound to SLAs. In particular the communication and

cooperation as well as the synchronization between the components of the system have

to be engineered in a suitable way in order to ensure the integrity of the exchanged

data and the workflow itself.

118 CHAPTER 7. SYSTEM IMPLEMENTATION

The following parts will describe the implementation structure of the system in

which the interaction among many modules will be presented. The detail about

adaptive execution mechanism will be more concentrated.

7.2 Related works

In several research projects related to Grid-based workflow execution [30, 87], the

execution engine is based on Condor’s DAGman [79] or [36]. In those systems, the

user defines the sub-job as well as the dependency among them and then submits

the job to the execution engine. The engine executes the workflow step-by-step and

returns the results to the user. In the system architecture, Condor’s DAGman and

Unicore act as a meta-scheduler. However, both systems are based on a queuing

model of resource allocation so they do not support the reservation to ensure the

desired QoS levels and the fulfillment of the SLAs.

Wesner et al. [124] describe the system GRASP, which supports certain QoS

for the workflow execution at a broker level. The execution engine in GRASP is

based on the BizTalk Server 2004 working primarily with Web Services. Therefore,

it lacks the ability of supporting Grid services. To overcome that problem, GRASP

used the support of reliable Web Services which acts as wrapper for the real service

implementation. After mapping the so-called pre-SLA to the composite services with

the SLA of the single Grid services involved in the orchestration, the execution engine

discovers at runtime the best hosting environment for the single services with the

expected SLA. There is no resource reservation so the system is not able to predict

the QoS of the composite service.

Afzal et al. present a distributed execution architecture which can handle the

execution of a workflow [2]. The Launching Service is a framework service providing

a generic interface to different DRM systems. The Launching Service may represent a

single resource (in the case of a ShellScript-Launcher) or an entire cluster of computers

(SGE launcher). Jobs are submitted to the Launching Service as a JDML document

and a set of user credentials (through the launchJob port). The Launching Service

and the attached launcher process the JDML document and deploy the work as

7.3. SYSTEM IMPLEMENTATION 119

appropriate. When a Launching Service is started it advertises its existence to the

Grid. However, in this work the execution engine work primarily with the non-

reservation model and the support for resource reservation is still work in progress.

Recent works [86, 130, 16] also built systems to support QoS features for Grid-

based workflow. In their work, a workflow includes many sub-jobs, which are sequen-

tial programs, and a Grid service has ability to handle one sub-job at a time. The

resource is also reserved to ensure the QoS. However, in those works, an adaptive

execution engine does not appear.

7.3 System implementation

An implemented prototype fulfils the above constraints and provides basic features

which allow any client to negotiate SLAs, monitor running workflows and to re-

ceive the result. Based on standard components such as Globus Toolkit 3.2, MySQL

database, Maui ME for a local RMS, the system is compatible with the existing Grid

infrastructure. The overall architecture schema is depicted in Figure 7.2.

Maui ME

SLA aware layer

SLA local service

OGSI

Parser

Error recovery

Mapping Monitoring

Negotiation

SLA workflow broker service

OGSI

Client Web Interface

Client

Broker

Local

RMS

Database

Execution

Figure 7.2: System implementation layers

The system includes three main components: the client plays the consumer role,

the SLA-aware workflow broker, and the SLA-aware local RMS act as provider.

120 CHAPTER 7. SYSTEM IMPLEMENTATION

7.3.1 SLA-aware RMS

The local RMS provides SLA services for each individual site on the Grid. This

service follows the OGSA principles and is based on the latest version of Globus

Toolkit. When deployed, this service is one among many services that are provided

by the Globus Toolkit. The local RMS must be SLA aware, as resource reservations

are necessary. Furthermore, standard features such as resource monitoring and fault

recovery are mandatory. However, most of the existing RMSs do not support these

features, therefore we provide with MauiME an SLA-aware layer on top of the RMS.

The structure of the local RMS is depicted in Figure 7.3.

MauiME

SLA

parser Planner Monitoring

SLA

negotiation

Error

recovery

SLA local RMS

Execution

agent

Figure 7.3: SLA-aware local RMS architecture

•Planner. MauiME supports solely node reservations. So we did not use this

feature of MauiME but developed a planner which can support reservations for

nodes, storage, and experts.

•Monitoring. This module monitors the state of jobs/nodes/resources. This

information is used to detect failures or QoS violations.

•SLA parser. This module is written in Java to parse or create SLA text using

the SLA language for workflows.

•SLA negotiation. This module uses the SLA parser and the module planner to

check the feasibility of the SLA. The planner output is used to negotiate the

7.3. SYSTEM IMPLEMENTATION 121

SLA with other participants.

•Error recovery. This module is responsible for running jobs even if some nodes

have failed. The monitoring information delivers reports on possible failures.

The error recovery module cancels the specific job, uses the planner to allocate

new resources and re-executes the job from the checkpoint image.

•Execution agent. This module is responsible for executing sub-jobs and is de-

scribed in depth in section 7.4.

7.3.2 SLA workflow broker

The SLA workflow broker is the central system unit and responsible for processing the

client requirements and for dispatching sub-jobs as well as for the SLA negotiation

between clients and RMS. The SLA workflow broker provides services to clients and

uses services from the local RMS. The communication (client – broker and broker –

local RMS) is done by the OGSI platform. As can be seen in Figure 7.3, the SLA

workflow broker includes many modules, some of which were described in previous

chapters. For co-operation among these modules, the SLA broker uses a database to

manage all aspect of operation. The co-operation among modules are described as

follows.

Performing SLA negotiation

The SLA broker receives an SLA requirement from the client and parses it to get all

sub-jobs information. The information is stored in three files: General SLA descrip-

tion, sub-jobs description, and arcs of the workflow description. Based on these files

and the data of the local RMS (resource description, reservation in database), the

SLA broker invokes the mapping algorithm. If a feasible solution is found it returns

the solution which includes the sub-job ID with its associate RMS ID and the starting

time slot. The module SLA negotiation uses this information. If the client accepts

the solution from the previous step, the SLA broker will invoke an insertion module

to enter all necessary information into the database. Subsequently, the next module

122 CHAPTER 7. SYSTEM IMPLEMENTATION

the SLA local service client module negotiates with the local RMS, collects all GFTP

handle services and returns the SLA flow ID to the client together with all GFTP

handle services.

Performing error recovery

Error detection is done with the monitoring module. The monitoring module collects

information about the RMS state, the RMS resource, the RMS reservation, the sub-

jobs state and so on from all RMSs. This information will be analyzed and stored in

the central database to ensure that the broker module could have an overall image of

the system. With the push model, after a period of time, local RMSs will connect with

the broker and push monitor information. If the broker does not receive information

from a local RMS it will consider that RMS as failed and activate the error recovery

module. When error recovery module is activated, it will do the following actions in

a strict sequence:

•Accessing database to retrieve information about failure RMSs and determine

affected workflow as well as necessary sub-jobs of the workflow to be remapped.

•Based on determined information about affected workflows and sub-jobs, ac-

tivating the negotiation module to cancel all SLA sub-jobs with local RMSs

related to specific sub-jobs. All negotiation activities are done with the help of

the SLA text as the mean of communication.

•Activating the monitoring module to update the newest information about the

RMS, especially information about resource reservation.

•Calling the mapping module to determine where and when sub-jobs in the

affected workflow will be run.

•Based on mapping information, activating the negotiation module to sign a new

SLA for each sub-job with the specific local RSM.

•Updating workflow control information and sub-jobs information in the central

database.

7.4. ADAPTIVE EXECUTION ENGINE 123

7.3.3 Client

The client is implemented in Java and provides a Grid service interface. The SLA

text is compiled in a file and transferred to the SLA broker through the OGSI infras-

tructure. In the negotiation period, the difference between submitted and received

SLA text is detected and presented to the user. The client component allows the user

to supervise the performance of the system, periodically or randomly.

7.4 Adaptive execution engine

The adaptive SLA execution engine contains the adaptive runtime control mechanism

and is implemented/integrated in the framework for SLA-aware workflows.

7.4.1 Adaptive runtime control mechanism

The main idea is based on shifting forward the sequential sub-jobs of the workflow a

period equal or near equal to the spare time of the finished sub-job. The procedure

start when the finished state of a sub-job in the workflow is received.

In the first step a GO/No-GO decision is required. For this purpose the description

of each sub-job in the Grid workflow including the main components such as input

data, computing task and output data as well as the source, and destination for the

data transfer are considered. Thus, there is only one computing task, but many data

transfer tasks for each sub-job. Based on the estimated runtime, a start/stop time of

the workflow is computed serving as a basis for resource allocation. A typical time

sequence of sub-job tasks is depicted in Figure 7.4.

Data input Computing task Data output

t1 t2 t3 t4 t7t6

t_fin

Figure 7.4: Time sequence of a sub-job

The shifting occurs only when the finished time t fin of the sub-job is earlier

than the estimated ending time t4. In this case, the set of sub-jobs to be shifted is

124 CHAPTER 7. SYSTEM IMPLEMENTATION

determined. The sub-job candidate for shifting depends directly on the finished job.

For example, when sub-job 4 finishes early, sub-job 3 and 5 will be considered to be

shifted. However, not all of them can be moved. Only a sub-job without dependencies

to any running/waiting sub-job, is able to shift while preserving the workflow integrity.

In our example, at time slot 13, sub-job 3 still depends on the running of sub-job 2

and it cannot be adjusted at this point. Thus, the shifting candidate set contains

only sub-job 5. A further shifting of other indirect depended sub-job such as sub-job

6 will not be performed because the problem complexity increases and the success

chances drop rapidly.

After determining the set of sub-job to be adjusted, the shifting time-period has to

be computed for each job. Let sj s be the sub-job to be shifted and sj f the finished

sub-job. In Figure 3 we can see the maximum shifting period is sj f.t4−sj f.t fin.

The new start time of sj s as well as the shifting period is determined by checking the

bandwidth reservation profile and the resource reservation profile of the RMS. If the

new start time is earlier than the old value, the shifting procedure is started. After

determining the adjusted sub-job and time period, the modified runtime requirements

are sent to the RMS in order to adapt the planned runtime of the task. Subsequently,

the modified transfer time is sent to the RMS which finished the sub-job earlier. From

that control information, the RMS does the rest to finish the task. After applying

the adjustment mechanism, the resource usage profile is as in Figure 7.5.

0 5 10 15 20 25 30 t

CPU

18

20

Figure 7.5: CPU usage profile of RMS 2 after doing the adjustment

7.4. ADAPTIVE EXECUTION ENGINE 125

This procedure allows more efficient use of the RMS resources and increases the

reliability of the system in executing Grid workflows within an SLA context. In case

of resource failure, the finish time of computing task cannot meet the deadline and

the service provider will violate the SLA. Shifting sub-jobs to run earlier will increase

the spare time, which can be used for implementation fault tolerance measures such

as checkpointing and migration to other resources.

7.4.2 Adaptive SLA execution engine

The SLA execution engine handles the execution task. Because sub-jobs of the work-

flow are distributed over multiple RMSs, ensuring the complete execution of the

workflow requires a distributed model. The execution engine includes a management

module and several agents modules located in the involved RMSs. The module execu-

tion manager controls the work of all execution agents to ensure the integrity and the

completeness of the workflow. A module execution agent supervises the execution

of the sub-jobs directly. The following sections describe the functionality of those

modules in detail.

Execution agent

The execution agent is responsible for computing the task. All tasks are stored in a

queue which is analyzed by the execution agent in regular intervals in order to find

the task which can be run next and execute it. There are two types of tasks, each

with different execution procedures. The data transfer task moves the data output

from the completed computing task of this sub-job to another sub-job. The execution

engine checks for the completeness of the associated computing task and the existence

of necessary data file. Then it uses the FTP protocol to transfer the file to destination.

To execute a computing task, the execution agent checks if all necessary input

files are already in the appropriate directory and then defines the submission file and

submits the task to the RMS. After submission, the execution agent investigates the

state of the computing task until it is finished. During the runtime, the state of the

computing task is periodically sent to the execution manager. The computing task

126 CHAPTER 7. SYSTEM IMPLEMENTATION

can be canceled by the local RMS if it uses more time or more resource than allowed.

In this case, the execution agent will remove all related data transfer tasks from the

queue and inform the execution manager. Otherwise the computing task might be

finished before the estimated deadline because of overestimated runtime or because

of increased computing power compared to the specified resources in the SLA. In this

situation, the execution agent waits for instruction from execution manager.

Workflow execution management

Both states – canceled and finished earlier – affect the entire workflow. The work of

the execution management module reacts to the state in a suitable way to ensure the

integrity of the workflow. A canceled sub-job leads to the termination of the whole

workflow as the data dependencies cannot be resolved for the remaining workflow

part. In this case, the execution manager announces a cancel request to all running

and waiting sub-jobs.

M

a

n

a

g

er

A

g

e

nt

1. Modify subjob request

-

2. Modify response

3. Modify data transfer

4. Modify response

Figure 7.6: Negotiation procedure to shift sub-job

When one computing task finishes earlier than estimated, the execution agent uses

the shifting procedure as described above. The communication procedure to realize

the task is presented in Figure 7.6. The execution agent parses the request to get

necessary information about the sub-job and then backs up and removes all related

data of the sub-job. Subsequently, the agent plans the sub-job scheduling with the

new information; if the planning is successful, it inserts the sub-job into the queue.

The result is sent to the execution manager which computes the new data transfer

7.5. SYSTEM DEPLOYMENT 127

time of the sub-job which finished earlier. Finally, it sends the modified data transfer

request to the execution agent.

7.4.3 Performance measurements

The performance measurements are based on extensive simulations of Grid-based

workflows and aim at evaluating the quality and efficiency of the adaptive runtime

adjustment. The simulation is done in several Grid resource configurations involv-

ing varying number of RMSs and different resource configurations. The Grid-based

workflows are generated randomly and mapped by the mapping algorithm. There-

after, at each time slot in each RMS, we generate randomly a sub-job and compute

a scheduling for the RMS with and without activation of the runtime adjustment

method. The amount of successfully mapped sub-jobs and the achieved workload is

depicted in Table 7.2.

Total nr sub-jobs mapped

Nr RMS Adjust Non-adjust Workload inc rate

5 483 468 2.3%

8 695 672 4.6%

10 934 896 3.7%

15 1420 1253 5.1%

20 1876 1578 4.3%

Table 7.2: Simulation results

Within the SLA context, increasing workload rate 5% means that the income of

the RMS also increases 5%, a worth of considerable value in business.

7.5 System deployment

The prototype is deployed in three cluster systems running Linux, LAM-MPI 7.1,

MauiME with different resource configurations as described in Table 7.3. On the

front-node of each cluster, Globus Toolkit 3.2 and the module SLA-aware local RMS

service are installed. One separate machine is used to run the SLA workflow broker.

128 CHAPTER 7. SYSTEM IMPLEMENTATION

On this system, we have experimented with several state of the art workflows. The

common experiment scenario is presented in Figure 7.7 with a workflow is executed

over many RMSs.

Grid service

Client

SLA local

RMS

services

SLA local

RMS

services

SLA local

RMS

services

Grid service

Client

Grid service

Client

SLA Workflow

Broker

Figure 7.7: Experiment system deployment

ID Nodes Storage Expert

RMS1 7 200 2

RMS2 14 100 1

RMS3 9 300 2

Table 7.3: Resource configuration of RMSs

The online demonstration of the running process, which includes the mapping,

negotiation process, state monitoring as well as the execution, of a workflow consisting

of seven sub-jobs in cooperation of three local RMSs, can be found at http://pc-

kao3.upb.de:9035/manual/test.html. Figure 7.8 depicts the Web-based user interface

in the running process.

7.6 Summary

This chapter has presented an architect for definition and implementation of SLA-

aware workflows in Grid environments consisting of several building blocks. The

global architecture includes components for SLA definition and negotiation, task

7.6. SUMMARY 129

Figure 7.8: Initial web based client

mapping, monitoring, and fault reaction. The adaptive runtime control mechanism

is included in a full-operational engine within a framework for support of SLA-aware

Grid workflows. With adaptive runtime control mechanisms, Grid resources are used

more efficiently and the broker itself has more time to activate and use fault toler-

ance methods in case of failure. Performance measurements with the implemented

prototype and simulated workflows showed a performance increase of between 2% and

5%. All components are implemented using standard Grid and RMS components and

provided as online demonstration.

Chapter 8

Conclusions and future work

8.1 Conclusions

The problem stated in Chapter 3 has been solved: as shown in Chapter from 4

to 7, a basic system support SLA for scientific workflow has been developed. The

heart of the system is the mapping mechanism, which includes three sub-algorithms,

to map sub-jobs of the workflow to Grid resources in an efficient way. The SLA

negotiation protocol provides an SLA language and negotiation procedure to help

many components in the system to achieve common agreement in providing service.

The error recovery mechanism concentrates on the catastrophic scenario where one or

several RMSs are detached out of the system at a time. Finally, a prototype system,

which combines all theory mechanisms in a unified organization, provides primary

functions to execute scientific workflow within SLA context. The work in this thesis

contributes to the literature knowledge in several aspects as we have:

•Raised a new issue in Grid computing, supporting SLA for workflow on the Grid

environment. The workflow with several dependent parallel sub-jobs running

on reserved Grid resources in the scope of a business contract defines a new

problem which requires new techniques to be solved properly.

•Developed a new mapping mechanism to map sub-jobs of the workflow to Grid

resources. The mapping mechanism includes three effective sub-algorithms to

130

8.2. FUTURE WORK 131

cope with three circumstance of the mapping scenario. The L-Tabu algorithm

finds the low cost solution for workflows which have little data to be transferred

among sub-jobs. The H-Map finds the low cost solution for workflows having a

lot of data to be transferred among sub-jobs. The W-Tabu algorithm finds the

low makespan solution for workflows. The effectiveness of those algorithms was

proved by several simulation experiments.

•Proposed a new SLA negotiation protocol for workflow. The SLA language

clarifies the Service Level Objective and the workflow structure. The negoti-

ation procedure set off the role as well as the information to be negotiated of

participants in the system.

•Designed a new error recovery mechanism. The mechanism concentrates on

solving the case of unstated catastrophic scenario where one or several RMSs

are detached out of the system at a time.

•Demonstrated the first real system supporting SLA for workflow.

8.2 Future work

Future work can extend the results of this thesis in two main ways.

•The first is called deep extension approach, which adds more features to the ex-

isting skeleton system. For example, as running a workflow requires the complex

co-operation of many sites on the Grid, ensuring the truthful and safe commu-

nication among them is an important unconsidered issue. Another problem is

related to financial aspect of an SLA system. The relation between user and

provider in the system is business relation, which concerns with money pay-

ment. Thus, having a mechanism supporting this activity to happen fluently is

a considerable requirement.

•The second is called wide extension approach. As can be seen in the intro-

duction, supporting SLA for workflow in the scavenging Grid and data Grid

132 CHAPTER 8. CONCLUSIONS AND FUTURE WORK

are still open issues. With different resource characteristics compared to the

computation Grid, realizing the destination in scavenging Grid and data Grid

requires many unexploited adaptations and methods.

133

134 APPENDIX A. LIST OF ACRONYMS

Appendix A

List of Acronyms

AEFT Absolute Earliest Finished Time

ALFT Absolute Latest Finished Time

APN Arbitrary Processor Network

BNP Bounded Number of Processors

BPEL Business Process Execution Language

BSLA Bind Service Level Agreement

CCS Computer Center System

CP Critical Path

DAG Directed Acyclic Graph

DRM Distributed Resource Managers

EDA Estimation of Distribution Algorithm

EP Execution Plan

FJSSP Flexible Job Shop Scheduling Problem

FTP File Transfer Protocol

GA Genetic Algorithm

GB Giga Byte

GEMSS Grid-Enabled Medical Simulation Services

GGF Global Grid Forum

GLS Guided Local Search

GRASP Grid-based Application Service Provision

HPC4U High Predictable Clusters for Internet Grids

135

HPCC High Performance Computing Center

ICENI Imperial College e-Science Network Infrastructure

ILS Iterated Local Search

IP Integer Programming

JDML Job Description Markup Language

JSSP Job Shop Scheduling Problem

KB Kilobyte

LAM-MPI Local Area Multicomputer - Message Passing Interface

MB Mega Byte

Mbps Megabit per second

MCT Minimum Completion Time

OGSA Open Grid Services Architecture

OGSI Open Grid System Interconnection

PBS Portable Batch System

PDDL Planning Data Description Language

QoS Quality of Service

QoWL QoS-aware Grid Workflow Language

QWE QoS-aware Grid Workflow Engine

RAM Random Access Memory

RMI Remote Method Invocation

RMS Resource Management System

RSLA Resource Service Level Agreement

SA Simulated Annealing

SGE Sun Grid Engine

SLA Service Level Agreement

SLO Service Level Objective

SNAP Service Negotiation and Acquisition Protocol

SQL Structure Query Language

TCP/IP Transmission Control Protocol/Internet Protocol

TDB Task Duplication Based

136 APPENDIX A. LIST OF ACRONYMS

TSLA Task Service Level Agreement

WfMC Workflow Management Coalition

WSDL Web Service Definition Language

XML Extensible Markup Language

UDDI Universal Description, Discovery and Integration

UMDA Univariate Marginal Distribution Algorithm

UNC Unbounded Number of Cluster

VGE Vienna Grid Environment

VO Virtual Organization

VRM Virtual Resource Manager

Appendix B

SLA language for Grid-based

workflow specification

B.1 Common tag

B.1.1 Identification

It describes the specific ID string for each SLA workflow, SLA subjob, SLA data

transferring, etc.

Element Name : ID

Element Type : StringEquation

Example :

<StringValue>28-6-2004-PC2-PADERBORN-PHYSIC-THEORY</StringValue>

</StringEquation>

B.1.2 Title SLA

This tag describes the name of the SLA.

Element Name : TITLE SLA

Element Type : StringEquation

Example :

137

138APPENDIX B. SLA LANGUAGE FOR GRID-BASED WORKFLOW SPECIFICATION

<StringValue>SLA - Running CFD simulation job</StringValue>

</StringEquation>

B.1.3 Description

This tag describes details about SLA, computation task, etc.

Element Name : DESC

Element Type : StringEquation

Example :

<StringValue> Running CFD to find some information about the affection

of fluid to auto running condition </StringValue>

</StringEquation>

B.1.4 Amount

This tag describes the amount number.

Element Name : AMOUNT

Element Type : RealEquation

Example :

</RealEquation>

B.1.5 Entity

This tag describes the measurement entity.

Element Name : ENTITY

Element Type : StringEquation

Example :

B.2. GENERAL SLA DESCRIPTION 139

</StringEquation>

B.1.6 Cost SLA

This tag describes the cost.

Element Name : COST SLA

Element Type : SectionEquation

Required Element : Amount,Entity

Example :

</RealEquation>

</StringEquation>

</SectionEquation>

B.2 General SLA description

This tag contains a general SLA description.

Element Name : GENERAL SLA

Element Type : SectionEquation

Required Element : Title, Description, Start time, End time

: Provider, Consumer, Cost

B.2.1 Start time

This tag describes the starting time of an SLA. The time is described under format

YYYYMMDDHHMMSS.

140APPENDIX B. SLA LANGUAGE FOR GRID-BASED WORKFLOW SPECIFICATION

Element Name : STARTTIME SLA

Element Type : StringEquation

Example :

</StringEquation>

B.2.2 End time

This tag describes the ending time of an SLA. The time is described under format

YYYYMMDDHHMMSS.

Element Name : ENDTIME SLA

Element Type : StringEquation

Example :

</StringEquation>

B.2.3 Provider

This tag describes the name of the service provider.

Element Name : PROVIDER

Element Type : StringEquation

Example :

<StringValue>PC2 -Uni Paderborn</StringValue>

</StringEquation>

B.2.4 Consumer

This tag describes the name of the consumer.

B.3. SLA FOR DATA TRANSMISSION DESCRIPTION 141

Element Name : CONSUMMER SLA

Element Type : StringEquation

Example :

<StringValue>Physic department - Uni Paderborn</StringValue>

</StringEquation>

B.3 SLA for data transmission description

This tag contains a SLA description for data transmission.

Element Name : GENERAL SLA TRAN

Element Type : SectionEquation

Required Element : Title, Description, Start time, End time

: Provider, Consumer, Data, Cost

B.3.1 Data

This tag describes the amount of data to be transferred in SLA data transmission.

Element Name : DATA

Element Type : SectionEquation

Example :

</RealEquation>

</StringEquation>

</SectionEquation>

142APPENDIX B. SLA LANGUAGE FOR GRID-BASED WORKFLOW SPECIFICATION

B.4 Computing task description

This tag contains the computation task description of a sub-job.

Element Name : COMPUTE TASK

Element Type : SectionEquation

Required Element : Description, Software request, Resource request, Job description

Example :

B.4.1 Software request

This tag describes the software requirement.

Element Name : SOFTWARE REQUEST

Element Type : SectionEquation

Required Element : os, database, message lib, other

Example :

<StringValue>Linux</StringValue>

</StringGreaterThanOrEquals>

<StringValue>MySQL</StringValue>

</StringGreaterThanOrEquals>

</StringGreaterThanOrEquals>

</SectionEquation>

B.4.2 Resource request

This tag describe the resource requirement.

B.4. COMPUTING TASK DESCRIPTION 143

Element Name : RESOURCE REQUEST

Element Type : SectionEquation

Required Element : Number of cpu, CPU speed, Architecture, Memory, Storage, Expert

Example :

Number of cpu

</IntegerEquation>

CPU speed

SectionEquation attribute="CPUspeed">

</RealEquation>

</StringEquation>

</SectionEquation>

Architecture

</StringEquation>

Memory

</RealEquation>

144APPENDIX B. SLA LANGUAGE FOR GRID-BASED WORKFLOW SPECIFICATION

</StringEquation>

</SectionEquation>

Storage

</RealEquation>

</StringEquation>

</SectionEquation>

Expert

<StringValue>Person</StringValue>

</StringEquation>

</SectionEquation>

B.4.3 Job description

This tag contains the computing job description.

Element Name : TASK DESCRIPTION

Element Type : SectionEquation

Required Element : Worker, Arguments, Executable, StdinFile, StdoutFile

: StdlogFile, CheckpointDir

B.4. COMPUTING TASK DESCRIPTION 145

Worker

This tag describes which worker will execute the job, for example MPI, PVM, shell.

Element Name : Worker

Element Type : StringEquation

Arguments

This tag describes the arguments which need to be passed to the executable file when

it is started.

Element Name : Arguments

Element Type : StringListEquation

Example :

</StringListValue>

</StringListEquation>

Executable

This tag describes the executable file name.

Element Name : Executable

Element Type : StringEquation

StdinFile

This tag describes the Stdin file name.

Element Name : StdinFile

Element Type : StringEquation

146APPENDIX B. SLA LANGUAGE FOR GRID-BASED WORKFLOW SPECIFICATION

StdoutFile

This tag describes the Stdout file name.

Element Name : StdoutFile

Element Type : StringEquation

StdlogFile

This tag describes the Stdlog file name.

Element Name : StdlogFile

Element Type : StringEquation

CheckpointDir

This tag describes the checkpoint directory.

Element Name : CheckpointDir

Element Type : StringEquation

B.5 SLO description

This tag compounds information about SLOs.

Element Name : SLO SLA

Element Type : SectionEquation

Required Element : Condition, Reason, Respond site, Punish, Action

: Monitor

B.5.1 Condition

This tag describes the problem, which can happen within the SLA valid period.

Element Name : SLO CONDITION

Element Type : StringEquation

Example :

<StringValue>Runtime exceed</StringValue>

</StringEquation>

B.5. SLO DESCRIPTION 147

B.5.2 Reason

This tag describes the reason of the problem.

Element Name : SLO REASON

Element Type : StringEquation

Example :

<StringValue>System failure</StringValue>

</StringEquation>

B.5.3 Respond site

This tag describes the site, which responds for the problem.

Element Name : SLO RESPON SITE

Element Type : StringEquation

Example :

<StringValue>Provider</StringValue>

</StringEquation>

B.5.4 Punish

This tag describes the punishing activity to the responsible site.

Element Name : SLO PUNISH

Element Type : StringEquation

Example :

</RealEquation>

148APPENDIX B. SLA LANGUAGE FOR GRID-BASED WORKFLOW SPECIFICATION

</StringEquation>

</SectionEquation>

B.5.5 Action

This tag describes the activity when problem happens.

Element Name : SLO ACTION

Element Type : StringEquation

Example :

<StringValue>Terminate</StringValue>

</StringEquation>

B.5.6 Monitor

This tag describes the information that user wants to know during the process of

handling SLA.

Element Name : SLO MONITOR

Element Type : SectionEquation

Example :

<StringValue>Runtime</StringValue>

</StringEquation>

<StringValue>Each 5 minuten</StringValue>

</StringEquation>

</SectionEquation>

B.6. DATA TRANSMISSION DESCRIPTION 149

B.6 Data transmission description

This tag contains description about data transmission.

Element Name : DATA TRANSMISSION

Element Type : SectionEquation

Required Element : File list, gFTP

B.6.1 File list

This tag describes the list of file to be transferred.

Element Name : FILE NAMES

Element Type : StringListEquation

Example :

</StringListValue>

</StringListEquation>

B.6.2 gFTP

This tag describes the FTP server address and the path to the directory, which stores

the file.

Element Name : gFTP

Element Type : SectionEquation

Example :

<StringValue>gsiftp://server2.icenigrid.org</StringValue>

</StringEquation>

150APPENDIX B. SLA LANGUAGE FOR GRID-BASED WORKFLOW SPECIFICATION

</StringEquation>

</SectionEquation>

B.7 SLA workflow description

Element Name : SLA JOBFLOW

Element Type : SectionEquation

Required Element : Identification, General SLA, SLA sub-jobs,

: SLA data transmissions, SLOs, Signature

B.7.1 SLA sub-job

Element Name : SLA SUBJOB

Element Type : SectionEquation

Required Element : Identification, SLA data transmissions,General SLA

: job description, SLA data transmissions, SLOs, Signature

B.7.2 SLA data transmissions

Element Name : DATA TRAN SLA

Element Type : SectionEquation

Required Element : Identification, General data transmission SLA,

: data transmissions description, Signature

B.7.3 Signature

This tag describes the signature of provider and customer.

Element Name : Signature

Element Type : SectionEquation

Example :

B.7. SLA WORKFLOW DESCRIPTION 151

<StringValue>oieshfio23874sodr</StringValue>

</StringEquation>

<StringValue>cxgdf4577xkldflee</StringValue>

</StringEquation>

</SectionEquation>

Bibliography

[1] T. L. Adam, K. M. Chandy, and J. R. Dickson, ”A comparison of list scheduling

for parallel processing systems”, Communication ACM, v. 17 n. 12, pp. 685–690,

1974.

[2] A. Afzal, A. Mayer and L. Young, ”Predictable workow deployment services”,

Proceedings of Grid Services Workshop, GGF11. IEEE, June 2004.

[3] I. Ahmad , Y. K. Kwok, ”On Exploiting Task Duplication in Parallel Program

Scheduling”, IEEE Transactions on Parallel and Distributed Systems, v.9 n.9, pp.

872–892, September 1998.

[4] R. Al-Ali, O. Rana, D. Walker, S. Jha, and S. Sohail, ”G-QoSM: Grid Service Dis-

covery using QoS Properties”, Computing and Informatics Journal, Special Issue

on Grid Computing, v. 21 n. 4, pp. 363–382, 2002.

[5] I. Altintas, C. Berkley, E. Jaeger, M. Jones, B. Ludaescher, and S. Mock, ”Ke-

pler: Towards a Grid-Enabled System for Scientific Workflows”, Proceedings of the

Workflow in Grid Systems Workshop in GGF10, Berlin, Germany, March, 2004.

[6] A. Andrieux, ”Web services agreement specification (wsagreement)”, Global Grid

Forum, http://www.ggf.org, 2003.

[7] J. Baxter, and J. H. Patel, ”The LAST algorithm: A heuristic-based static task

allocation algorithm”, Proceedings of the International Conference on Parallel Pro-

cessing (ICPP ’89, Aug.), Pennsylvania State University, University Park, PA, pp.

217–222, 1989.

152

BIBLIOGRAPHY 153

[8] S. Benkner, I. Brandic, G. Engelbrecht and R. Schmidt, VGE - A Service-

Oriented Grid Environment for On-Demand Supercomputing, Proceedings of the

Fifth IEEE/ACM InternationalWorkshop on Grid Computing (Grid 2004), Pitts-

burgh, PA, USA, November 2004.

[9] S. Benkner, G. Berti, G. Engelbrecht, J. Fingberg, G. Kohring, S. E. Middleton

and R. Schmidt, GEMSS: Gridinfrastructure for Medical Service Provision, Meth-

ods of Information in Medicine 2005; 44: 177-181, Schattauer Publishers, 2005.

[10] F. Berman, H. Casanova, A. Chien, K. Cooper , H. Dail, A. Dasgupta, W. Deng,

J. Dongarra, L. Johnsson, K. Kennedy, C. Koelbel, B. Liu, X. Liu, A. Mandal,

G. Marin, M. Mazina, J. Mellor-Crummey, C. Mendes, A. Olugbile, M. Patel ,

D. Reed, Z. Shi, O. Sievert, H. Xia and A. YarKhan, ”New Grid Scheduling and

Rescheduling Methods in the GrADS Project”, International Journal of Parallel

Programming, v. 33, pp. 209–229, 2005.

[11] G. B. Berriman, J. C. Good, A. C. Laity, ”Montage: a Grid Enabled Image

Mosaic Service for the National Virtual Observatory” , ADASS, v. 13, 2003.

[12] P. E. Black, ”Algorithms and Theory of Computation Handbook”, CRC Press

LLC, 1999.

[13] J. Blythe, J. Jain, E. Deelman, Y. Vahi, A. Mandal, K. Kennedy, ”Task Schedul-

ing Strategies for Workflow-based Applications in Grids”, Proceedings of the IEEE

International Symposium on Cluster Computing and the Grid (CCGrid 2005),

IEEE Press, 2005.

[14] A. Blum and M. Furst, ”Fast Planning Through Planning Graph Analysis”,

Artificial Intelligence, v 90, pp. 281–300, 1997.

[15] BPEL Specification, http://www-106.ibm.com/developerworks/webservices/

library/ws-bpel/ 2003.

[16] I. Brandic and S. Benkner and G. Engelbrecht and R. Schmidt, QoS Support for

Time-Critical Grid Workflow Applications, Proceedings of e-Science 2005.

154 BIBLIOGRAPHY

[17] P. Brucker, ”Scheduling Algorithm”, Third edition, Springer Verlag, 2001.

[18] L. Burchard, M. Hovestadt, O. Kao, A. Keller, and B. Linnert, ”The Virtual

Resource Manager: An Architecture for SLA-aware Resource Management”, Pro-

ceedings of the IEEE CCGrid 2004, IEEE Press, pp. 126–133, 2004.

[19] J. Cao, S. A. Jarvis, S. Saini, and G. R. Nudd, ”GridFlow: Workflow Man-

agement for Grid Computing”, Proceedings of 3rd International Symposium on

Cluster Computing and the Grid, Tokyo, Japan, May 12-15, pp. 198, 2003.

[20] J. Cao, D. P. Spooner, S. A. Jarvis, and G. R. Nudd, ”Grid Load Balancing

Using Intelligent Agents”, Future Generation Computer Systems special issue on

Intelligent Grid Environments: Principles and Applications, v. 21 n. 1, pp. 135–149,

2005.

[21] H. Casanova, A. Legrand, D. Zagorodnov, and F. Berman, ”Heuristics for

Scheduling Parameter Sweep applications in Grid environments”, Proceedings of

the 9th HeterogeneousComputing workshop (HCW’2000), 2000.

[22] H. Chen, B. Shirazi, and J. Marquis, ”Performance evaluation of a novel schedul-

ing method: Linear clustering with task duplication”, Proceedings of the 2nd In-

ternational Conference on Parallel and Distributed Systerns, pp. 270–275, 1993.

[23] Y.-C. Chung , S. Ranka, ”Applications and performance analysis of a compile-

time optimization approach for list scheduling algorithms on distributed memory

multiprocessors”, Proceedings of the 1992 ACM/IEEE conference on Supercom-

puting, pp. 512–521, November 16-20, 1992.

[24] E. G. Coffman, ”Computer and Job-Shop Scheduling Theory”, John Wiley and

Sons, Inc., New York, NY, 1976.

[25] J. Y. Colin, and P. Chretienne, ”Scheduling with small computation delays and

task duplication”. Operation Research, v. 39 n. 4, pp. 680-684, 1991.

[26] K. Czajkowski and I. Foster and C. Kesselman and V. Sander and S. Tuecke,

”SNAP: A Protocol for Negotiating Service Level Agreements and Coordinating

BIBLIOGRAPHY 155

Resource Management in Distributed Systems”, Proceedings of the 8th Workshop

on Job Scheduling Strategies for Parallel Processing, 2002.

[27] K. Czajkowski, A. Dan, J. Rofrano, S. Tuecke, and M. Xu. ”Agreement- based

Grid service management (WS-Agreement)”, Global Grid Forum, 2003.

[28] K. Czajkowski, D. Ferguson, I. Foster, J. Frey, S. Graham, T.

Maguire, D. Snelling, S. Tuecke, ”From Open Grid Services Infrastruc-

ture to WS-Resource Framework: Refactoring & Evolution”, http://www-

106.ibm.com/developerworks/library/ws-resource/ogsi to wsrf 1.0.pdf , March 5,

2004.

[29] E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, S. Patil, M. Su, K.

Vahi, M. Livny, ”Mapping Abstract Complex Workflows onto Grid Environments”,

Journal of Grid Computing, v. 1 n. 1, pp. 25–39, 2003.

[30] E. Deelman, J. Blythe, Y. Gil, C. Kesselman, G. Mehta, S. Patil, M. Su, K.

Vahi, and M. Livny, ”Pegasus : Mapping Scientific Workflows onto the Grid”, Pro-

ceedings of the 2nd European Across Grids Conference, Nicosia, Cyprus, January

28-30, 2004.

[31] M. Dorigo and G. Di Caro, The Ant Colony Optimization Meta-Heuristic, New

Ideas in Optimization, McGraw-Hill, pp.11-32, 1999.

[32] K.A. Dowsland, ”Simulated Annealing”, in C.R. Reeves (ed.) ModernHeuristic

Techniques for Combinatorial Problems , John Wiley & Sons, 1993.

[33] e-Science Workflow Services Workshop. http://www.nesc.ac.uk/action/esi/ con-

tribution.cfm?Title=303, December 2003.

[34] C. A. Ellis, ”Information Control Nets: A Mathematical Model of Office Infor-

mation Flow”, Proceedings of the Conference on Simulation, Measurement and

Modeling of Computer Systems, 1979.

[35] A.K. Elmagarmid, ”Database Transaction Models for Advanced Applications”,

Morgan Kaufmann, 1992.

156 BIBLIOGRAPHY

[36] D. W. Erwin, and D. F. Snelling, ”UNICORE: A Grid Computing Environment”,

Lecture Notes in Computer Science, v. 2150: pp. 825–834, 2001.

[37] I. Foster, C. Kesselman, C. Lee, R. Lindell, K. Nahrstedt, and A. Roy, ”A

distributed resource management architecture that supports advance reservation

and coallocation”, Proceedings of the International Workshop on Quality of Service,

pp. 27–36, 1999.

[38] I. Foster, ”What is the Grid? A Three Point Checklist”, Daily News and Infor-

mation for the Global Grid Community, July 22, 2002, [Last accessed May 31st,

2004], http://www.gridtoday.com/02/0722/100136.html

[39] I. Foster, C. Kesselman, S. Tuecke. ”The Anatomy of the Grid: Enabling Scalable

Virtual Organizations”, International Journal of Supercomputer Applications, v.

15 n. 3, 2001.

[40] I. Foster, C. Kesselman, J. Nick, S. Tuecke, ”The Physiology of the Grid: An

Open Grid Services Architecture for Distributed Systems Integration”, Open Grid

Service Infrastructure WG, Global Grid Forum, June 22, 2002.

[41] M. Fox, D. Long, ”PDDL2.1: An extension of PDDL for expressing temporal

planning domains”, Journal of AI Research, v. 20, pp. 61–124, 2003.

[42] P. Garbacki, B. Biskupski, and H. Bal, ”Transparent Fault Tolerance for Grid

Application”, Proceedings of the European Grid Conference,(EGC 2005), LNCS

3470, pp. 671–680, 2005.

[43] M. R. Gary and D. S. Johnson, ”Computers and Intractability: A Guide to the

theory of NP-Completeness”, W. H. Freeman and Co, 1979.

[44] The GEMSS Project: Grid-Enabled Medical Simulation Services, EU IST

Project, IST-2001-37153, http://www.gemss.de/

[45] I.P. Gent, E. Macintyre, P. Prosser, B.M. Smith, T. Walsh, ”An empirical study

of dynamic variable ordering heuristics for the constraint satisfaction problem”,

BIBLIOGRAPHY 157

Proceedings of Principles and Practices of Constraint Programming - CP96, LNCS

1118.

[46] D. Georgakopoulos, M. Hornick, and A. Sheth, ”An Overview of Workflow Man-

agement: From Process Modeling to Workflow Automation Infrastructure”, Dis-

tributed and Parallel Databases, v. 3 n. 2, April 1995.

[47] A. Gerasoulis, and T. Yang, ”A comparison of clustering heuristics for scheduling

DAG’s on multiprocessors”, J. Parallel Distributed Computing, v. 16, pp. 276–291,

1992.

[48] ”Globus Toolkit 4.0 Documentation Overview”,

http://www.globus.org/toolkit/docs/4.0/doc overview.html

[49] F. Glover, ”Tabu search Part I”, ORSA Journal on Computing, pp. 190–206,

1989.

[50] F. Glover, ”Tabu search Part II”. ORSAJournal on Computing, v.2, pp. 4–32,

1990.

[51] A. Grimshaw, ”WHAT IS A GRID?”, Grid today / DECEMBER 9, v. 1 n. 26,

2002.

[52] J.K. Hao, J. Pannier, ”Simulated annealing and Tabu search for constraint solv-

ing”, Artificial Intelligence and Mathematics IV, 1998

[53] P. Hansen, N. Mladenovic, ”Variable Neighborhood Search: Principles and Ap-

plications”, European Journal of Operational Research, v.130, 2001.

[54] D. Harel and A. Naamad, The STATEMATE Semantics of Statecharts, ACM

Transactions on Software Engineering and Methodology, 5(4):293.333, 1996

[55] F. Heine, M. Hovestadt, O. Kao, A. Keller, ”Provision of Fault Tolerance with

Grid-enabled and SLA-aware Resource Management Systems”, Proceedings of the

Parallel Computing Conference 2005, 2005.

158 BIBLIOGRAPHY

[56] F. Heine, M. Hovestadt, O. Kao, A. Keller, ”SLA-aware Job Migration in Grid

Environments”, in: Grandinetti, (Eds.), Grid Computing: New Frontiers of High

Performance Computing, Elsevier press, 2005.

[57] J. Hoffmann, ”Extending FF to Numerical State Variables”, Proceedings of the

15th European Conference on Artificial Intelligence, 2002.

[58] M. Hovestadt, ”Scheduling in HPC Resource Management Systems:Queuing vs.

Planning”, Proceedings of the 9th Workshop on JSSPP at GGF8, LNCS, pp. 1–20,

2003.

[59] E. S. H. Hou , N. Ansari , H. Ren, ”A Genetic Algorithm for Multiprocessor

Scheduling”, IEEE Transactions on Parallel and Distributed Systems, v.5 n.2, pp.

113–120, February 1994.

[60] M. Hsu (ed.), ”Special Issue on Workflow and Extended Transaction Systems”,

IEEE Data Engineering, v. 16 n. 2, June 1993.

[61] S. Hwang, C. Kesselman, ”GridWorkflow: A flexible failure handling framework

for the Grid”, Proceedings of the 12th IEEE International Symposium on High

Performance Distributed Computing, IEEE Press, pp. 126–131, 2003.

[62] D. Jackson, Q. Snell and M. Clement, ”Core Algorithms of the Maui Sched-

uler”, Proceedings of the 7th Workshop on Job Scheduling Strategies for Parallel

Processing, LNCS 2221, pp. 87–102, 2001.

[63] B. Jacob, ”Grid computing: What are the key components?”, http://www-

128.ibm.com/developerworks/grid/library/gr-overview/

[64] K. Jansen K, M. Mastrolilli, R. Solis-Oba, ”Approximation Algorithms for Flex-

ible Job Shop Problems”, Proceedings of Latin American Theoretical Informatics

(LATIN’2000), LNCS 1776, pp. 68–77, 2000.

[65] I. Kacem, S. Hammadi, P. Borne, ”Pareto-optimality Approach for Flexible Job-

shop Scheduling Problems: Hybridization of Evolutionary Algorithms and Fuzzy

Logic”, Journal of Mathematics and Computers in Simulation, Elsevier, 2002.

BIBLIOGRAPHY 159

[66] I. Kacem, S. Hammadi, P. Borne, ”Approach by Localization and Multi-objective

Evolutionary Optimization for Flexible Job-Shop Scheduling Problems”, IEEE

Transactions on Systems, Man, and Cybernetics. Part C, v. 32 n. 1, pp. 1-13,

2002.

[67] H. Karloff. Linear Programming. Birkhauser, 1991.

[68] Keahey and K. Motawi, ”The Taming of the Grid: Virtual Application Service”,

Argonne National Laboratory Technical Memorandum, No. 262, 2003.

[69] S. J. Kim, and J. C. Browne, ”A general approach to mapping of parallel compu-

tation upon multiprocessor architectures”, Proceedings of International Conference

on Parallel Processing, pp. 1–8, 1988.

[70] S. Kirkpatrick, C. Gelatt, and M. Vecchi, ”Optimization by Simulated Anneal-

ing”, Science, v. 220, pp. 671–680, 1983.

[71] W. H. Kohler and K. Steiglitz, ”Characterization and Theoretical Comparison

of Brand-and-Bound Algorithms for Permutation Problems”, Journal of ACM, v.

21 n. 1, pp. 140–156, 1974.

[72] B. Kruatrachue, and T. G. Lewis, ”Duplication Scheduling Heuristics (DSH):

A New Precedence Task Scheduler for Parallel Processor Systems”, Oregon State

University, Corvallis, OR, 1987.

[73] B. Kruatrachue , T. Lewis, ”Grain Size Determination for Parallel Processing”,

IEEE Software, v.5 n.1, pp. 23–32, January 1988.

[74] V. Kumar, ”Algorithms for constraint-satisfaction problems: a survey”, AI Mag-

azine, v.13 n.1, pp. 32–44, Spring 1992.

[75] Y. K. Kwok, I. Ahmad, ”Bubble scheduling: A quasi dynamic algorithm for

static allocation of tasks to parallel architectures”, Proceedings of the 7th IEEE

Symposium on Parallel and Distributeed Processing, pp. 36, October 25-28, 1995.

160 BIBLIOGRAPHY

[76] Y. K. Kwok , I. Ahmad, ”Dynamic Critical-Path Scheduling: An Effective Tech-

nique for Allocating Task Graphs to Multiprocessors”, IEEE Transactions on Par-

allel and Distributed Systems, v.7 n.5, pp. 506–521, 1996.

[77] Y. K. Kwok and I. Ahmad, ”Static scheduling algorithms for allocating directed

task graphs to multiprocessors”, ACM Computing Surveys (CSUR), v. 31 n. 4, pp.

406–471, 1999.

[78] G. Laszewski, K. Amin, M. Hategan, N. J. Zaluzec, S. Hampton, and A. Rossi,

”GridAnt: A Client-Controllable Grid Workflow System”, Proceedings of the 37th

Hawaii International Conference on System Science, Island of Hawaii, Big Island,

Jan. 5-8, 2004.

[79] M. Litzkow and M. Mutka, ”Condor a hunter of idle workstations”, Proceedings

of the 8th International Conference on Distributed Computing Systems (ICDCS

88), pp. 104–111, 1988.

[80] M. Lorch, and D. Kafura, ”Symphony - A Java-based Composition and Manip-

ulation Framework for Computational Grids”, Proceedings of the 2nd IEEE/ACM

International Symposium on Cluster Computing and the Grid (CCGrid 2002),

Berlin, Germany, May 21-24, 2002.

[81] R. Lovas, G. Dzsa, P. Kacsuk, N. Podhorszki, D. Drtos, ”Workflow Support for

Complex Grid Applications: Integrated and Portal Solutions”, Proceedings of 2nd

European Across Grids Conference, Nicosia, Cyprus, 2004.

[82] S. Ludtke, P. Baldwin, and W. Chiu, ”EMAN: Semiautomated Software for

High-Resolution Single-Particle Reconstructio” , Journal of Structure Biology, v.

128, 1999.

[83] T. Ma and R. Buyya, ”Critical-Path and Priority based Algorithms for Schedul-

ing Workflows with Parameter Sweep Tasks on Global Grids”, Proc. 17th Interna-

tional Symposium on Computer Architecture and High Performance Computing

(SBAC-PAD 2005), IEEE CS Press, 2005.

BIBLIOGRAPHY 161

[84] M. Mastrolilli, L.M. Gambardella, ”Effective Neighborhood Functions for the

Flexible Job Shop Problem”, Journal of Scheduling, v. 3 n. 1, pp. 3–20, 2000.

[85] S. McGough, ”A common job description markup language written in xml”,

Global Grid Forum, http://www.ggf.org, 2003.

[86] S. McGough, A. Afzal, J. Darlington, N. Furmento, A. Mayer, and L. Young

,Making the Grid Predictable through Reservations and Performance Modelling,

The Computer Journal, v.48 n.3, pp. 358–368, 2005.

[87] A. Morajko, E. Fernandez, ”Workfow management in the crossgrid project”,

Proceedings of the European Grid Conference,(EGC 2005), pp. 424–433. LNCS.

Springer Verlag, February 2005.

[88] N. Mehdiratta, and K. Ghose, ”A bottom-up approach to task scheduling on

distributed memory multiprocessor”, Proceedings of the 1994 International Con-

ference on Parallel Processing, CRC Press, Inc, pp. 151–154, 1994.

[89] L. Nassif, J. M. Nogueira, M. Ahmed, R. Impey, A. Karmouch, ”Agent-based

Negotiation for Resource Allocation in Grid”, Proceedings of the 3rd Workshop on

computational Grids and applications, Summer program LNCC, 2005.

[90] E. Nowicki, C. Smutnicki, ”A fast taboo search algorithm for the job shop prob-

lem”, Management Science, v. 42, pp. 797–813, 1996.

[91] E. Nowicki, C. Smutnicki, ”An advanced taboo search algorithm for the job shop

problem”, Journal of Scheduling, pp. 145–159, 2005.

[92] C. H. Papadimitriou , M. Yannakakis, ”Towards an architecture-independent

analysis of parallel algorithms”, SIAM Journal on Computing, v.19 n.2, pp. 322–

328, April 1990.

[93] ”OpenPBS Technical Overview”, http://www.openpbs.org/overview.html.

[94] M. Pelikan, D. Goldberg, and E. Cant-Paz, BOA: The Bayesian optimization

algorithm, Proceedings of the Genetic and Evolutionary Computation Conference

GECCO-99 (Orlando, Fla.), pp. 525-532, 1999.

162 BIBLIOGRAPHY

[95] M. Pelikan, D. Goldberg and F. Lobo, A survey of optimization by building and

using probabilistic models, Tech. Rep. No. 99018, IlliGAL, University of Illinois,

1999.

[96] M. Pelikan, and H. Mhlenbein, The bivariate marginal distribution algorithm,

Advances in Soft Computing-Engineering Design and Manufacturing, pp. 521-535,

1999.

[97] D.M. Quan, O. Kao, ”On Architecture for an SLA-aware Job Flows in Grid

Environments”, Journal of Interconnection Networks, World scientific computing,

pp. 245–264, 2005.

[98] D.M. Quan, O. Kao, ”SLA negotiation protocol for Grid-based workflows”, Pro-

ceedings of the International Conference on High Performance Computing and

Communications (HPPC-05), LNCS 3726, pp. 505–510, 2005.

[99] D.M. Quan, O. Kao, ”On Architecture for an SLA-aware Job Flows in Grid En-

vironments”, Proceedings of the 19th IEEE International Conference on Advanced

Information Networking and Applications (AINA 2005) , IEEE Press , pp. 287–292,

2005.

[100] D.M. Quan, O. Kao, ”Mapping Grid job flows to Grid resources within SLA

context”, Proceedings of the European Grid Conference,(EGC 2005), LNCS 3470,

pp. 1107–1116, 2005.

[101] D.M. Quan, ”Error recovery mechanism for Grid-based workflow within SLA

context”, To be published by International Journal of High Performance Computing

and Networking (IJHPCN), 2006.

[102] D.M. Quan, ”Mapping heavy communication Workflows onto Grid Resources

within SLA context”, To be published in proceedings of the International Confer-

ence of High Performance Computing and Communication (HPCC06), 2006.

[103] D.M. Quan, ”Adaptive SLA execution engine for Grid-based workflow within

SLA context”, Submitted to Middleware 2006 Conference, 2006.

BIBLIOGRAPHY 163

[104] H. E. Rewini , T. G. Lewis, ”Scheduling parallel program tasks onto arbitrary

target machines”, Journal of Parallel and Distributed Computing, v.9 n.2, pp.

138–153, June 1990.

[105] L. Richter, ”Workflow Support for Complex Grid Applications: Integrated and

Portal Solutions”, Proceedings of the 2nd European Across Grids Conference, 2004.

[106] N. Sadeh and M.S. Fox, ”Variable and Value Ordering Heuristics for the Job

Shop Constraint Satisfaction Problem”, Artificial Intelligence, v. 86, pp. 1–41, 1996.

[107] V. Sarkar, ”Partitioning and Scheduling Parallel Programs for Multiprocessors”,

MIT Press, Cambridge, MA, 1989.

[108] A. Shahid , S. T. Muhammed , M. Sadiq , ”GSA: scheduling and allocation using

genetic algorithm”, Proceedings of the conference on European design automation,

pp. 84–89, September 19-23, 1994.

[109] M. Shields, and I. Taylor, ”Programming Scientific and Distributed Workflow

with Triana Services”, Proceedings of the Workflow in Grid Systems Workshop in

GGF10, Berlin, Germany, March, 2004.

[110] G. C. Sih , E. A. Lee, ”A Compile-Time Scheduling Heuristic for

Interconnection-Constrained Heterogeneous Processor Architectures”, IEEE Trans-

actions on Parallel and Distributed Systems, v.4 n.2, pp. 175–187, 1993.

[111] G. C. Sih , E. A. Lee, ”Declustering: A New Multiprocessor Scheduling Tech-

nique”, IEEE Transactions on Parallel and Distributed Systems, v.4 n.6, pp. 625–

637, June 1993.

[112] M. P. Singh and M. A. Vouk , ”Scientific Work-

flows: Scientific Computing Meets Transactional Workflows”,

http://www.csc.ncsu.edu/faculty/mpsingh/papers/databases/workflows/sciworkflows.html

[113] ”What is Service Level Agreement?”, http://www.webopedia.com/TERM/S/Service Level Agreement.html

164 BIBLIOGRAPHY

[114] D. P. Spooner, S. A. Jarvis, J. Cao, S. Saini and G. R. Nudd, ”Local Grid

Scheduling Techniques Using Performance Prediction”, IEEE Proceedings - Com-

puters and Digital Techniques, 150(2), pp. 87–96, 2003.

[115] Na. Stone, D. Simmel, and T. Kielmann, ”GWD-I: An architec-

ture for grid checkpoint recovery services and a GridCPR API”, Grid

Checkpoint Recovery Working Group Draft 3.0, Global Grid Forum,

http://gridcpr.psc.edu/GGF/docs/draft-ggf-gridcpr-Architecture-2.0.pdf. 2004.

[116] T. Stuetzle, Iterated local search for the quadratic assignment problem, Tech.

rep. aida-99-03, FG Intellektik, TU Darmstadt, 1999.

[117] T. Stuetzle, Local Search Algorithms for Combinatorial Problems—Analysis,

Algorithms and New Applications, DISKI—Dissertationen zur Knstliken Intelli-

genz. infix, Sankt Augustin, Germany, 1999.

[118] S. Tuecke, K. Czajkowski, I. Foster, J. Frey, S. Graham, C. Kesselman, T.

Maguire, T. Sandholm, P. Vanderbilt, D. Snelling, ”Open Grid Services Infrastruc-

ture (OGSI) Version 1.0”, Global Grid Forum Draft Recommendation, 6/27/2003.

[119] C. Voudouris, Guided local search for combinatorial optimization problems,

PhD dissertation, Department of Computer Science, University of Essex, 1997.

[120] C. Voudouris, and E. Tsang, Guided local search, Europe Journal of Operation

Research, v. 113 n.2, pp. 469–499, 1999.

[121] R. Wolski, ”Experiences with Predicting Resource Performance On-line in Com-

putational Grid Settings”, ACM SIGMETRICS Performance Evaluation Review,

v. 30 n. 4, pp. 41–49, 2003.

[122] R. Wolski, N. Spring and J. Hayes, ”The Network Weather Service: A Dis-

tributed Resource Performance Forecasting Service for Metacomputing”, Journal

of Future Generation Computer System, v. 15 n. 5-6, pp. 757–768, 1999.

[123] http://www.webopedia.com/TERM/S/Service Level Agreement.html

BIBLIOGRAPHY 165

[124] S. Wesner, B. Serhan, ”Overview of an architecture enabling grid based appli-

cation service provision”, Proceedings of the European Across Grids Conference

2004, pp. 113–118, 2004.

[125] L. Fischer, ”Workflow Handbook 2004 ”, Future Strategies Inc., Lighthouse

Point, FL, USA, 2004.

[126] M. Y. Wu , D. D. Gajski, ”Hypertool: A Programming Aid for Message-Passing

Systems”, IEEE Transactions on Parallel and Distributed Systems, v.1 n.3, pp.

330–343, 1990.

[127] T. Yang , A. Gerasoulis, ”List scheduling with and without communication

delays”, Parallel Computing, v.19 n.12, pp. 1321–1344, 1993.

[128] L. Young, S. McGough, S. Newhouse and J. Darlington, Scheduling architecture

and algorithms within ICENI, Proceeding of UK e-Science All Hands Meeting,

EPSRC, 2003

[129] J. Yu, R. Buyya, ”A taxonomy of scientific workflow systems for grid comput-

ing”, SIGMOD Record 34(3), pp. 44–49, 2005.

[130] L. Zeng, B. Benatallah, A. Ngu, M. Dumas, J. Kalagnanam, H. Chang, QoS-

Aware Middleware for Web Services Composition, IEEE Transactions on Software

Engineering, v.30 n.5, pp. 311–327, may 2004.