Document [original]

Real-Time Multitasking in Embedded Systems

Based on Reconfigurable Hardware

Dissertation

A thesis submitted to the

Faculty of Computer Science, Electrical Engineering and Mathematics

of the

University of Paderborn

in partial fulfillment of the requirements for the

degree of Dr. rer. nat.

Klaus Danne

Paderborn, Germany

date of submission: 11.9.2006

Supervisors:

Prof. Dr. Franz J. Rammig

Prof. Dr. Marco Platzner

Reviewers:

Prof. Dr. Franz J. Rammig

Prof. Dr. Marco Platzner

Prof. Dr. Daniel D. Gajski

Additional members of the oral examination committee:

Prof. Dr. Ulrich R¨

uckert

Dr. Ulf Lorenz

Date of submission:

September 11, 2006

Date of public examination:

November 22, 2006

Acknowledgements

This PhD thesis was carried out during my work at the University of Paderborn in

Germany as a member of the Design of Parallel Systems group under Prof. Dr. Franz. J.

Rammig, from July 2002 to June 2005, and as a member of the Computer Engineering

group under Prof. Dr. Marco Platzner, from July 2005 to November 2006. The work was

partially supported by the German Research Foundation (Deutsche Forschungsgemein-

schaft) in the research training group 776 Automatic Configuration in Open Systems.

I would like to thank Prof. Rammig for supervising my thesis, supporting my work

as member of his group, and sharing in his great knowledge concerning embedded sys-

tems and computer science research in general. Likewise, I thank my co-supervisor

Prof. Platzner. He employed me as a researcher just shortly after he became professor

at Paderborn University and supported a great part of my publications as co-author.

In addition, my thanks go to Prof. Dr. Daniel D. Gajski, who was a third reviewer as

well as Prof. Dr. Ulrich R¨

uckert and Dr. Ulf Lorenz, who were additional members of

the oral examination commission.

Further, without mentioning everyone personally, I would like to thank many col-

leagues that helped me during my time at Paderborn University:

Prof. Dr. Christophe Bobda as a coauthor of several publications and for proofreading

my thesis; my teammates Dr. Stefan Ihmor and Florian Dittmann for constructive and

interesting discussions; Tales Heimfahrt and Marcelo G¨

otz for teaching me the Brazilian

thinking and fixing my Linux PC; my new teammates Paul Kaufmann, Enno L¨

ubbers

and Heiner Giefers for all the discussions. I am also grateful to Tarek Zeineldin and our

working group visitors Prof. Dr. David Andrews and Jason Agron for proofreading the

thesis. Further acknowledgments go to Dr. Heiko Kalte for co-authoring publications

and to Roland M¨

uhlenbernd and Martin Kreis for supporting my research as student

assistants and by their bachelor’s and master’s theses respectively.

Finally, I would like to thank my parents for their great support during my education,

my sister for being a good friend, and my lovely girlfriend Sonja for stepping into my

life.

Klaus Danne, December 2006

Contents

List of Figures vii

List of Tables ix

Abstract 1

1 Introduction 3

1.1 Motivation to Real-Time Multitasking ................... 4

1.2 Contribution of This Thesis ......................... 6

1.3 Chapter Outline ............................... 7

2 Background and Related Work 9

2.1 Reconfigurable Hardware Devices ...................... 10

2.1.1 Fine-Grain RHDs (FPGAs) ..................... 10

2.1.2 Coarse-Grain RHDs ......................... 12

2.1.3 Configuration Memory and Reconfiguration ............ 13

2.1.4 Architecture of RHD Based Computer Platforms ......... 14

2.2 Hardware Tasks ................................ 16

2.3 Multi-Tasking on Reconfigurable Hardware ................ 20

2.3.1 Resource Sharing Models ...................... 20

2.3.2 Task Scheduling and Placement Methods for RHDs ........ 26

2.3.3 Dynamic Reconfigurable Systems .................. 26

2.4 Real-Time Scheduling ............................ 30

2.4.1 A Background on RT Systems ................... 30

2.4.2 RT Scheduling Problems and Algorithms for Uniprocessor Systems 31

2.4.3 RT Scheduling Problems and Algorithms for Multiprocessor Systems 33

2.4.4 Real-Time Scheduling on RHDs and Similar Architectures . . . . 38

2.5 Chapter Conclusion ............................. 40

3 Problem Modeling and Metrics 43

3.1 Task and Resource Models .......................... 43

3.2 Feasible Schedule ............................... 47

3.3 Utilization Metrics .............................. 48

3.4 Chapter Conclusion ............................. 50

iii

Contents

4 Three Scheduling Algorithms 51

4.1 Global Scheduling .............................. 53

4.1.1 Earliest Deadline First on RHDs .................. 53

4.1.2 Relation to Multiprocessor EDF .................. 55

4.1.3 Schedulability Analysis for EDF-FkF ................ 56

4.1.4 Conclusion on global EDF ...................... 62

4.2 Partitioned Scheduling ............................ 63

4.2.1 Partitioned Scheduling on RHDs .................. 63

4.2.2 Relation to 2-Dimensional Packing Problems ........... 65

4.2.3 Optimal Partitioning by ILP .................... 66

4.2.4 Next-Fit-Decreasing-Area Partitioning ............... 69

4.2.5 Conclusion on Partitioned EDF ................... 74

4.3 Server-Based Scheduling ........................... 75

4.3.1 The Merge-Server Distribute Load (MSDL) Algorithm ...... 75

4.3.2 Exact Evaluation of Computation Time Reduction ........ 78

4.3.3 Properties of MSDL ......................... 80

4.3.4 Conclusion on Server-Based Scheduling .............. 81

4.4 Chapter Conclusion ............................. 81

5 Comparison of Algorithm Scheduling Performance 83

5.1 Analytical Comparison and Scheduling Anomalies ............ 83

5.2 Simulation Results .............................. 90

5.2.1 Creation of Benchmarks ....................... 90

5.2.2 Performance on Standard Benchmark BMstd ........... 92

5.2.3 Impact of Area and Time-utilization ................ 94

5.2.4 Performance of Combined Algorithms ............... 96

5.2.5 Impact of Number of Tasks ..................... 97

5.3 Conclusion on Algorithm Performance ................... 99

6 Execution Model and Overhead Analysis 101

6.1 Global EDF Overhead Model ........................ 101

6.1.1 Reconfiguration Modes and Task Placement ............ 101

6.1.2 Runtime System Requirements ................... 104

6.1.3 Reconfiguration Overheads ..................... 104

6.2 Partitioned EDF Overhead Model ..................... 106

6.2.1 Reconfiguration Modes and Task Placement ............ 106

6.2.2 Runtime System Requirements ................... 107

6.2.3 Reconfiguration Overheads ..................... 107

6.3 Server Based MSDL Overhead Model ................... 109

6.3.1 Reconfiguration Modes and Task Placement ............ 109

6.3.2 Runtime System Requirements ................... 109

6.3.3 Reconfiguration Overheads ..................... 109

6.4 Performance Evaluation Including Overhead ............... 110

6.4.1 Global EDF Overhead Evaluation ................. 110

6.4.2 Partitioned EDF Overhead Evaluation ............... 112

6.4.3 Server Based MSDL Overhead Evaluation ............. 112

Contents

6.5 Conclusion on Execution Models and Overhead .............. 113

7 Model Extensions 115

7.1 Periodic Tasks with Variants ........................ 115

7.1.1 Variant-Rich Tasks .......................... 116

7.1.2 Partitioned Scheduling of Variant-Rich Tasks ........... 119

7.1.3 Optimal Partitioning by ILP .................... 121

7.1.4 Performance Evaluation for Variant-Rich Tasks .......... 124

7.1.5 Approaches for Global- and Server Scheduling of Variant-Rich Tasks126

7.2 Periodic Tasks with Memory Access .................... 128

7.2.1 The Buffer Assignment Problem .................. 129

7.2.2 Task Schedule Aware Buffer Assignment .............. 132

7.2.3 Buffer Assignment Based on Integer Programming ........ 134

7.2.4 Simulation Results .......................... 135

7.2.5 Concluding Discussion on Memory Demanding RT Applications 136

7.3 Chapter Conclusion ............................. 137

8 Prototype: FPGA Based Real-Time Kernel 139

8.1 System Architecture ............................. 139

8.1.1 An All-Hardware Runtime System ................. 139

8.1.2 Scheduler Data Structures ...................... 141

8.1.3 The Scheduler Cycle ......................... 142

8.2 Synthesis Tool Flow ............................. 143

8.3 Synthesis Results ............................... 145

8.4 Chapter Conclusion ............................. 146

9 Conclusion and Outlook 149

9.1 Summary ................................... 149

9.2 Drawn Conclusions .............................. 150

9.3 Outlook and Future Work .......................... 152

Author’s Publications 155

Bibliography 159

Contents

List of Figures

2.1 Basic structure of an FPGA ......................... 11

2.2 Structure of FPGA logic block ....................... 11

2.3 Architecture of an RHD based platform .................. 15

2.4 Example of a DCT task on platform .................... 16

2.5 Resource sharing approaches based on full configuration ......... 21

2.6 RHD Resource sharing approaches based on partial RHD configuration . 23

3.1 Parameters of a single aperiodic job .................... 44

3.2 States of a job and a periodic task respectively .............. 44

3.3 Parameters of a periodic task ........................ 46

3.4 Example task set Γ∗and a feasible schedule. ................ 48

3.5 Scheduling Anomaly: An infeasible task set with infinitesimal system

utilization ................................... 50

4.1 A schedule for Γ∗by EDF-FkF vs. EDF-NF ............... 55

4.2 EDF multiprocessor schedule ........................ 56

4.3 EDF-FkF schedule of ˆ

Γ........................... 61

4.4 Example partitioning of Γ∗and the according schedule .......... 65

4.5 Example of partitioning χand the according decision variable configuration 69

4.6 Example partitioning of NFDA for (a) a task set with UT

max ≤0.5 and

(b) a task set with UT

max >0.5....................... 71

4.7 Schedule of server task set generated by MSDL .............. 78

4.8 Case analysis of computation time reduction ................ 79

5.1 Scheduling anomaly: A task set infeasible by any partitioned schedule,

but feasibly by global EDF-NF........................ 84

5.2 Scheduling anomaly: A task set infeasible by global EDF-NF but feasible

by partitioned EDF NFDH.......................... 85

5.3 Scheduling anomaly: A task set infeasible by global EDF-NF but feasible

by MSDL.................................... 88

5.4 The relation between feasibly scheduled task sets by the proposed algo-

rithms as VENN-diagram. .......................... 89

5.5 Characteristics of BMsdt tasks and task sets ................ 92

5.6 Performance of algorithms on BMsdt .................... 93

vii

List of Figures

5.7 Distribution of the tasks’ Aito UT(Ti) ratio of the BMsdt,BMsmallAbigUT

and BMbigAsmallUTbenchmarks. ...................... 94

5.8 Performance of algorithms on BMsmallAbigUTand BMbigAsmallUTbench-

marks. ..................................... 95

5.9 Performance of algorithms EDF-NF,opt.-part.-EDF and MSDL on the

BMstd,BMsmallAbigUTand BMbigAsmallUTbenchmarks. ......... 96

5.10 Performance of combined algorithms on BMsmallAbigUTand BMbigAsmallUT

benchmarks. .................................. 97

5.11 Performance of algorithms EDF-NF,NFDA-part.-EDF and MSDL on

task sets with 10, 20 and 50 tasks. ..................... 99

6.1 Partial reconfiguration models supporting arbitrary task areas ...... 102

6.2 A schedule including reconfiguration times ................. 103

6.3 Performance of the EDF-FkF scheduling test including reconfiguration

overhead .................................... 111

6.4 Performance of EDF-NF obtained by exact simulation, including recon-

figuration overheads (BMstd)........................ 111

6.5 Performance of opt.-partitioned-EDF including reconfiguration overheads

(BMstd).................................... 112

6.6 Performance of MSDL including reconfiguration overheads (BMstd). . . 113

7.1 Preemptive global schedule of variant-rich tasks .............. 118

7.2 Example of a partitioned EDF schedule of a variant-rich task set . . . . 120

7.3 Optimal partitioning of z∗and the according decision variable configuration124

7.4 Sub-optimal partitioning of z∗, considering only the first variant of each

task. ...................................... 124

7.5 Performance of opt.-part.-EDF on task sets with implementation variants 125

7.6 Application of Example 4 mapped to a reconfigurable platform . . . . . 129

7.7 Graph representation of the memory demanding RT application of Ex-

ample 4 .................................... 131

7.8 Memory demanding RT application from Example 5, illustrating different

running sets of a task schedule. ....................... 134

8.1 System architecture ............................. 140

8.2 Scheduler phases during a system tick period ............... 142

8.3 Synthesis tool flow .............................. 144

viii

List of Tables

2.1 Characteristics of a XILINX 8 point 1D-DCT IP core .......... 17

2.2 Comparison of resource sharing models ................... 25

4.1 Example task set ˆ

Γ.............................. 61

4.2 Example task set Γ∗............................. 64

4.3 Servers generated for the example task set Γ∗by the MSDL algorithm . 78

5.1 Example of a task set ΓAwithout feasible partitioning .......... 84

5.2 Example task set ΓB............................. 85

5.3 Example task set ΓDwithout feasible partitioning and feasible set of

servers ΩD................................... 86

5.4 Example task set ΓE............................. 87

5.5 Feasible MSDL servers for task set ΓE................... 87

5.6 Benchmark parameters ............................ 94

7.1 Example of a variant-rich task set z∗................... 116

7.2 Example of a variant-rich task set z∗................... 123

7.3 Characteristics of a XILINX 8 point 1D-DCT IP core .......... 129

7.4 Simulation results of 2C-BA and SA-BA problems. ............ 136

8.1 Runtime system size for various numbers of tasks and servers ...... 146

List of Tables

Abstract

This thesis presents fundamental work in the new area of multi-tasking on reconfigurable

hardware devices (RHDs) under real-time conditions.

RHDs can now execute several hardware-tasks (computations implemented as digital

circuits) in parallel due to the increasing logic capacity, as well as sequentially due to

runtime reconfiguration capabilities. To use RHDs for real-time workloads of embed-

ded system applications, scheduling techniques and execution environments that create

predictable task timings are required.

Specifically, we consider the problem of scheduling periodic real-time tasks for execution

on shared RHD resources, which makes the problem different from single- and multipro-

cessor scheduling. In our first model, tasks are modeled by their resource requirements

(area), their inter-arrival period and their computation time. Tasks can be executed

in parallel as long as their accumulated area does not exceed the device area and can

be preempted at any time. We develop three novel earliest deadline first (EDF) based

scheduling methods and evaluate their performance.

1. The global EDF scheduler assigns device resources to all active tasks globally. For

this algorithm we present a linear-time scheduling test which proves for a given

task set at design time that it will be scheduled without missing any deadlines.

2. The partitioned EDF scheduler divides a task set at design-time into several sub-

sets. Each subset is scheduled independently according to the EDF rule onto

separate device resources. The feasibility of the schedule is achieved by consider-

ing the single processor EDF bound during partitioning. Integer linear program-

ming (ILP) is used to compute the optimal partitioning while a next fit heuristic

computes close to optimal results for even large task sets.

3. MSDL is a server based scheduling method which groups tasks into periodic servers

for parallel execution. At runtime, servers are loaded onto the RHD and executed

according to EDF. Only one server is running at a time, making this algorithm

suitable for RHDs which do not support partial reconfiguration.

In our analytic performance comparison we show that there is no dominance among

the three approaches. However, our simulation studies show that, on average, global

EDF outperforms partitioned EDF, which outperforms MSDL. Further we showed that

the global EDF scheduling test is rather pessimistic. We analyze the reconfiguration

overheads for all three approaches and include them within the schedulability tests.

Simulation experiments show that global EDF suffers significantly more from this over-

head than the other schedulers.

Two extensions of our task model are presented in order to consider the specific char-

acteristics of reconfigurable applications: First we consider tasks for which not only one

but several alternative implementations (circuits) exist. We present an ILP model to se-

lect the optimal implementations for the partitioned EDF scheduler, which considerably

improves the scheduling performance. Second, we consider tasks memory usage under

real-time constraints. We present a method for efficiently sharing RAM-banks without

jeopardizing task deadlines, thus reducing the required memory resources.

Finally we describe our prototype FPGA based real-time kernel which, in contrast to

others, is a hardware-only system with the operating system functions entirely imple-

mented in hardware. Our prototype implements MSDL, uses full device reconfiguration,

and is thus suitable for most of today’s reconfigurable devices.

In summary, the thesis presents a foundation for real-time multitasking on RHDs, in-

cluding models, algorithms, analysis and prototypical implementation.

CHAPTER 1

Introduction

The aim of this thesis is to enable the use of runtime reconfigurable hardware for com-

puting time-critical tasks in embedded systems.

Besides circuit prototyping and glue-logic replacement there are two major application

domains where reconfigurable hardware devices (RHDs) are most useful: High Perfor-

mance Computing and Embedded Systems.

In high performance computing, RHDs can be used to speed up special applications. Ex-

ample systems which have achieved enormous speedups by using pure field programmable

gate array (FPGA) solutions or FPGAs as co-processors of host computers can be found

in [31][95]. Cray’s XD1 machine [34] is a commercial example for super-computing using

XILINX FPGAs. Also not the application domain targeted by this work, several of the

methods explored within this thesis may be equally applicable within high performance

computing systems.

In the application domain of embedded systems, RHD based platforms are gaining in

popularity and are sometimes preferred to microprocessor based systems because they

can:



respond to input values within very short latencies (shorter than software solu-

tions),



achieve high throughputs in data flow driven applications (parallelism within a

task),



process many computations in parallel without decrease in performance (multi-

tasking in space),



communicate to peripheral devices via high number of user definable I/O pins.

1 Introduction

Applications within the embedded systems domain include network- and communica-

tions systems such as Ethernet switches, data processing applications and industrial

measurement systems like digital logic analyzers, military and aerospace systems as for

example satellites, digital imaging and video systems such as digital cameras, consumer

electronics such as audio players, automotive applications such as Siemens car enter-

tainment system VDO Dayton and many more [45][55][4].

The applications mentioned above share some common key characteristics:



they contain computations which are mostly regular, data flow oriented and in-

clude digital signal processing, which can be efficiently computed on RHDs,



they can be complex, i.e. they may consist of several different computational tasks,



they have to react within precise time constraints to their environment, i.e., they

are real-time applications.

Supporting the design of such characterized embedded applications for execution on

RHD platforms is the aim of this thesis.

1.1 Motivation to Real-Time Multitasking

Multitasking is useful if the application is composed of several well separated com-

putational tasks that should be processed quasi simultaneously on limited hardware

resources. The concept of multitasking, in combination with the corresponding oper-

ating system (OS) support, enables efficient sharing of resources due to adequate task

scheduling, allows dynamic changes of the application due to introduction of new tasks

during runtime and greatly simplifies application development since tasks can be devel-

oped quite independent from each other.

As further motivation, we list several example tasks with real-time constraints which

appear in real-world applications. Most of the mentioned applications come from the

automotive domain. Some refer to only conceptual systems and some to existing imple-

mentations.

Engine control system: A task controlling the engine such as the gas injection and

ignition process. Indeed this task may include several sub-controller tasks. The

computation of these sub-tasks is based on the cyclic evaluation of the differential

equation systems of the corresponding mechatronic controllers. Kopetz describes

such example controller in [64] p. 22.

Vehicle cruise controller: In [8], Axelsson from Volvo Technological Development de-

scribes a case study of an advance cruise controller for cars. It keeps the car at

a desired speed, but also uses a forward looking radar to avoid collisions with

obstacles. The system is composed of several tasks, such as: wheel-speed measur-

ing;speed control which adapts the speed to the desired value; driver-interaction

1.1 Motivation to Real-Time Multitasking

which checks for break and throttle pedals; obstacle-detection which looks for ob-

stacles ahead; brake-control and throttle-control which perform deceleration and

acceleration.

Radar system: Several upper class cars use a radar system for measuring the distance

to the cars driving ahead [32]. Such long distance control is used by the cruise

control. The implementation of radar signal processing on FPGAs is issued in [6].

Camera driving assistance: The company MobilEye[1] develops several car driver as-

sistance applications based on camera image processing, among them pedestrian

detection,headway monitoring and warning,forward collision warning,night vi-

sion and many more [54]. For the detection of lane marks and vehicles, images are

processed through a chain of various processes: 1. selecting candidate regions of

interest,2. single frame classification,3. multi-frame approval and 4. range mea-

surement. To achieve the required real-time image processing performance of 30

frames per second, the company developed a system on chip (SoC) which hosts

two RISC processors and dedicated video processing units and runs at 120 MHz.

All of these tasks have a couple of common features. They are based on computations

which can be efficiently implemented on RHDs, have to be evaluated periodically, and

are subject to hard real-time constraints. Such systems are targeted within this thesis.

Challenges

When designing embedded systems, a key challenge is to implement all required func-

tionalities, meet the required performance and time constraints, but at modest cost.

Hence, the system resources have to be used efficiently. For processor based systems,

operating systems support resource sharing via multitasking, and appropriate scheduling

algorithms ensure the efficient utilization of shared resources while meeting all real-time

constraints.

The more reconfigurable architectures push into embedded systems and take over com-

putations formerly implemented on processors or in application specific integrated cir-

cuits (ASICs), the more important it becomes to apply the same techniques to RHDs

in order to use their resources efficiently. RHDs increase the existing challenges in the

following three ways:

1. Design of novel reconfigurable architectures: Most of today’s available commercial

reconfigurable architectures support dynamic reconfiguration only in a limited

fashion. The reconfiguration time of FPGAs ranges from some hundred microsec-

onds to tens of milliseconds, which makes runtime reconfiguration infeasible for ap-

plications requiring very short response times. Also partial reconfiguration, which

allows reconfiguring a part of the device resources for a new task while other parts

keep executing other tasks, has only limited support. Moreover, the architectures

do not provide native support for storing and restoring the state of a task before

/ after interruption, which is a requirement for preemptive multitasking.

1 Introduction

Much promising research is currently being performed to overcome these draw-

backs, including coarse-grain RHDs and other devices with multiple configuration-

context memories on chip [96][31][50] (see Section 2.1).

2. Runtime environments for reconfigurable devices: Multitasking on RHDs requires

some kind of support from a runtime or operating system, similar to an OS for

processor based platforms. Such a system needs to manage the RHD resources,

control the execution of hardware-tasks and control the device reconfiguration at

runtime. Moreover, such an OS has to provide programming interfaces for the

application developer to abstract details from the underlying hardware.

Several prototypes of operating systems for FPGAs have been developed by re-

search groups, but still are far behind their counterparts in the processor world

[102][57][105][85].

3. Task scheduling and placement: Scheduling and placement methods are required,

which allow the OS to efficiently utilize RHD resources for the application tasks

and, if needed, deal with real-time requirements. Since the task and resource mod-

els of single- and multiprocessor systems are different from those of reconfigurable

devices, their scheduling methods cannot be directly applied for RHDs. Moreover,

tasks scheduling for RHDs under real-time constraints has been rarely addressed

by researchers (e.g. in [80][90]). Scheduling problems for RHDs are much less un-

derstood and only a few proposed scheduling approaches exist compared to the

classical scheduling problems for processor and multiprocessor platforms.

Hence, there is a considerable need for research in this area in order to catch up

with the state of the art of scheduling methods for processors.

1.2 Contribution of This Thesis

The main contributions of this thesis are related to the third challenge mentioned above,

namely (real-time) scheduling methods for reconfigurable architectures. For the well

known problem of scheduling periodic real-time tasks, we present one of the first work

considering runtime reconfigurable architectures. Three basic scheduling methods are

developed and evaluated. Two are driven by the basic (multi)-processor scheduling

methods and the third is specially designed for devices without partial reconfiguration

support.

A second contribution is given in the area of operating systems for reconfigurable archi-

tectures. As a proof of concept, a prototype of an FPGA based real-time kernel (FBRK)

has been implemented. In contrast to other work, it is completely implemented in (re-

configurable) hardware and designed for full- rather than partial reconfigurable devices.

It implements the developed scheduling strategy for devices without partial reconfigu-

ration support and hence is suitable for a wide range of today’s devices and platforms.

The contributions in the field of real-time scheduling methods are described more pre-

cisely in the following list:

1.3 Chapter Outline



A new task and resource model for reconfigurable architectures as a generalization

of a simple multi-processor model, and definition of the periodic real-time tasks

scheduling problem within this context. For this problem, we introduce three in-

dependent scheduling strategies based on global scheduling,partitioned scheduling

and server based scheduling. We develop a linear-time scheduling test for the global

scheduling strategy, and introduce and analyze optimal and heuristic algorithms

for the partitioned scheduling and server based scheduling strategy.



The performance of all three strategies is compared analytically and by simula-

tion studies. We show that the three approaches do not dominate each other in

performance. Simulation results show, that for many cases the global scheduling

strategy achieves high device resource utilization of over 90% and outperforms the

partitioned scheduling strategy. In contrast, when a scheduling test is required to

predict that no deadline will be missed, significantly higher resource utilization is

achieved by the partitioned scheduling strategy compared to the global scheduling

strategy.

As the reconfiguration time cannot be neglected, the associated overhead is in-

cluded within the analysis of all three strategies. Main results state that the

overhead is bounded and in many cases stays reasonably small, if the device con-

figuration time is less than 10% of the runtime of the shortest task. Furthermore,

the partitioned scheduling and server based scheduling methods suffer much less

from this overhead than the global scheduler.



In order to deal with the specific characteristics of hardware tasks and reconfig-

urable platforms, we generalize and extend our model. The first generalization

allows considering tasks, for which not only one but several alternative circuit

implementations exist. For such applications, the optimal variant selection is

computed by integer linear programming, which improves considerably the per-

formance of the partitioned scheduler compared to tasks without variants.

Secondly, we extend our model to consider tasks’ memory usage under real-time

constraints. We present a method to efficiently share memory banks of the plat-

form among tasks without jeopardizing task deadlines.

In summary this thesis presents a foundation for real-time multitasking on RHDs, pre-

senting the models, algorithms, analysis and a prototypical implementation.

1.3 Chapter Outline

The thesis has the following organization:

Chapter 2Background and Related Work provides a background to this thesis and

summarizes the related work. It discusses the different kinds of reconfigurable hardware

architectures that we consider as processing devices throughout this work. The concept

of hardware-tasks is introduced and previously reported multitasking models for such

tasks are reviewed and classified. Finally, the chapter presents related work in real-time

1 Introduction

scheduling, including single and multiprocessor scheduling algorithms for periodic tasks

and the few research on real-time scheduling for RHDs.

Chapter 3Problem Modeling and Metrics introduces the task and resource models and

the basic notations used within this work. The general problem of scheduling periodic

real-time tasks onto an RHD is defined.

Chapter 4Three Scheduling Algorithms presents our main methods to schedule periodic

task onto RHDs, namely global EDF scheduling,partitioned EDF scheduling and server

based scheduling. The scheduling conditions and basic analysis are developed.

Chapter 5Comparison of Algorithm Scheduling Performance evaluates the quality of our

three proposed methods. Only the pure scheduling performance without considering any

reconfiguration overhead is compared. The first part presents an analytic comparison

and derives qualitative statements. The second part presents results from numerous

simulation experiments, showing how the algorithms perform in average.

Chapter 6Execution Model and Overhead Analysis goes away from the idealistic task

model and accounts for execution time overheads. For each scheduling method a suitable

reconfiguration model and its runtime system requirements are described. The time

overhead resulting from device reconfigurations is included in the scheduling conditions

of each algorithm. Finally, the results from simulation experiments on the algorithm

performance including overheads are shown.

Chapter 7Model Extensions takes special characteristics of hardware-tasks into account.

A first part presents methods for a generalized task model that allows specifying not

only one but several implementation variants per tasks - each with a different area and

computation time. The second part captures the memory requirements of hardware-

tasks. A method for sharing physical platform memories between parallel executed task

is shown which has the potential to considerable reduce the memory requirements.

Chapter 8Prototype: FPGA Based Real-Time Kernel describes our realization for the

server based scheduler. The system architecture and an automated synthesis flow are

described. Finally, results on the logic requirements and timings are reported.

Chapter 9Conclusion and Outlook finalizes the thesis with a summary, draws conclusions

and presents suggestions for future research.

CHAPTER 2

Background and Related Work

This chapter provides a background to this thesis and summarizes the related work.

The first section discusses the different kinds of reconfigurable hardware architectures

that we consider as processing devices throughout this work. It gives a background

to fine and coarse grain reconfigurable devices, the different reconfiguration modes and

discusses the architecture of typical embedded systems based on such devices. Since

reconfigurable devices and computing systems have been surveyed several times (e.g. in

[31] p. 11 ff.), we keep this section therefore small.

In Section 2.2, we briefly introduce the concept of hardware-tasks, which refers to com-

putational tasks implemented as digital circuits for execution on an RHD. By means

of an example of a hardware task, several characteristics such as resource requirements

and execution times are discussed.

The third section concentrates on multi-tasking on reconfigurable hardware. In order

to be able to develop a reasonable task and resource model for our work (Chapter 3),

six different device resource sharing models, that have been used in the literature, are

discussed and classified (Section 2.3). We clear out why only two of the six models

will be considered in our scheduling approaches. Afterwards, we present related work

that has specifically targeted the problem of scheduling tasks on RHD architectures.

While this documents the attraction of the research field, it shows that most work

does not consider tasks with real-time constraints. Finally the section closes with a

brief description of projects that have implemented reconfigurable systems supporting

hardware multitasking. It shows that various prototypes exist, which support execution

environments as they will be assumed in our task and resource model of Chapter 3.

The last part of this chapter presents related work in real-time scheduling (Section 2.4).

We introduce the periodic task model and clarify its importance, since it is used as a ba-

sis for our task model. Furthermore, we present the basics of single processor scheduling

algorithms, since they are also fundamental to all scheduling methods for parallel archi-

2 Background and Related Work

tectures. Scheduling methods for multiprocessors are discussed in more detail, since the

multiprocessor model is closely related to our RHD resource model and, furthermore,

the basic techniques of multiprocessor scheduling have been inspired two of our three

scheduling methods which will be introduced in Chapter 4. Finally, we close with the

few work considering real-time scheduling for RHDs or similar architectures. We con-

clude, that the problem considered in this work is actually novel, but as well a straight

forward extension of the basic single- and multiprocessor scheduling problems. Further-

more, the approaches we use are based on scheduling methods with strong foundation

in literature.

2.1 Reconfigurable Hardware Devices

This section gives a background on different types of reconfigurable hardware devices

(RHD), their inner architecture and how they are used to implement computations.

The most prominent type of RHD is the field-programmable gate array (FPGA). It

was invented in the eighties and is a further development of early programmable logic

devices. Even if the FPGAs where mainly invented for prototyping of digital circuits

and for implementing glue logic, researchers got attracted by the feature of runtime re-

configuration (or runtime reprogramming) of these devices, which allows general FPGA

based computations. Since then, the research field Reconfigurable Computing [31][46]

was born, which focuses on new RHD based computing paradigms beside the Von Neu-

mann paradigm as well as on the invention of new RHD architectures to overcome the

drawbacks of commercial available FPGAs.

First we will take a closer look to the architecture of RHDs which can be classified into

fine-grained architectures, which are basically FPGAs and coarse-grained architectures,

which are mostly academically proposed devices or prototypes from startup companies.

2.1.1 Fine-Grain RHDs (FPGAs)

The term fine-grain is used to denote devices, which can be configured on bit-level. We

use fine-grain RHD as a synonym for FPGA.

The basic architecture of today’s FPGAs is shown in Figure 2.1 and consists mainly of

an array of programmable logic blocks, programmable interconnect as well as general

purpose input output cells. The logic blocks can be configured to implement small

logic functions and registers, which can be composed to larger digital circuits by the

programmable interconnect. A typical structure of a logic block is shown in Figure 2.2.

Its main components are look-up tables (LUTs) to implement combinatorial logical and

flip flops to implement storage. An ninput LUT is an n-address memory, which can

store the 2npossible values of a Boolean function with ninput variables. To implement a

specific Boolean function, the LUT is programmed with the function’s truth table. The

input values are used to address the LUT and the data output represents the value of the

function. The output of the LUT is either fed into a flip flop of the logic block or directly

2.1 Reconfigurable Hardware Devices

Figure 2.1: Basic structure of an FPGA

connected to the logic block’s output. Typically, 4-input LUTs are used, since many

studies show that they present a good tradeoff for most digital designs [2]. Recently,

Xilinx switched from 4- to 6-input LUTs in their latest Virtex-V architecture [33]. Altera

uses in their Stratex-II architecture adaptive logic modules with a configurable LUT size,

which can be used e.g. as one 7-input LUT or as two 4-input LUTs [3].

The flip flops of the logic blocks are used to implement the storage registers inside a

digital design. They are connected to a global clock network and support features like

synchronous or asynchronous reset.

The programmable interconnect consists of configurable crossbar switches, which allow

establishment of connections between outputs and inputs of the logic blocks as well as

to the I/O cells.

Also these elements are all what is needed to implement a digital design, modern FPGAs

include many more types of resources to achieve better performance and efficiency.

Typical resources of the market leader FPGA vendors are briefly described below.

Block RAMs are dedicated blocks of static RAM of medium size (typical with a size of

18 kBit), which can be used to store larger amounts of data onto the device. In

contrast to flip flops which can be accessed independent of each other, the data

Figure 2.2: Structure of FPGA logic block (source: [106])

2 Background and Related Work

of each block RAM memory may only be read or written sequential on a word by

word basis. (Some devices provide dual ported block RAM.)

Carry logic is a special logic inside the logic blocks to forward carry bits to improve the

performance of arithmetic functions such as adders.

DSP cores are dedicated functional units like multiplier and multiply-accumulate struc-

tures to improve the performance of digital signal processing functions.

Processor cores are dedicated microprocessor cores hosted on the FPGA chip. They

are used for complex system-on-chip designs which include software and hardware

parts.

High speed I/O transceivers are dedicated transceiver circuits which implement sev-

eral high speed I/O standards to enable serial I/O of several GBit per second.

Clock managers are units, which allow generating and distributing clock signals with

minimal skew throughout the FPGA chip. Modern FPGAs support multiple global

clock networks enabling multi clock designs.

Tri state buffers are components, to support on chip communication bus systems with

multiple clients.

FPGAs are typically programmed using an ASIC-like design flow. Design entry is usu-

ally provided in a hardware description language like VHDL or VeriLog. The design

flow includes the synthesis of the design into a net-list, mapping the net-list to the

components provided by the FPGA and a place and route step, in which components

are allocated to specific FPGA resources and connected via the configurable intercon-

nect. Based on the result a programming file for the FPGA, also called a bit-stream, is

generated which can be directly downloaded into the FPGA configuration memory.

2.1.2 Coarse-Grain RHDs

Like FPGAs, coarse-grain RHDs are based on an array of programmable logic blocks

called processing elements and programmable interconnect. In contrast to the fine-grain

FPGAs, coarse grain RHDs provide programmability on word-width level of e.g. 8, 16,

24 or 32 bits [50][35][56][13]. The processing elements are not based on LUTs anymore,

but provide arithmetic operations like addition, multiplication and shift on word level

operands. Therefore, sometimes these devices are referred to as ALU arrays. The LUT

approach, which allows the implementation of general functions, is not feasible on word

width level: Imagine a processing element with two 16-bit inputs and one 16-bit output.

Enabling programmability of a general logic function would require 16 times a 32-bit

input LUT which results in 16×232 bit = 8 GB of memory for each processing element.

The operation on word-width level gives the coarse-grain RHDs several advantages com-

pared to FPGAs: The arithmetic functions are implemented as dedicated circuits, which

makes them much smaller in terms of chip-size and faster than functions implemented

on FPGA logic blocks. Also the crossbar switches become simpler, as not single signals

2.1 Reconfigurable Hardware Devices

but entire buses of signals are connected at once. Since a coarse-grain RHD consists of

much fewer processing elements than an FPGA of comparable chip size, the configura-

tion memory is dramatically reduced. E.g. to program the device to implement a 16

bit multiply-accumulate unit, only two processing elements and the crossbar in between

have to be configured, compared to tens or hundreds of logic blocks and crossbars that

have to be configured when the same function is implemented on an FPGA. This re-

duction of configuration memory is directly translated to higher reconfiguration speed.

Typically, coarse-grain devices can be reprogrammed several times as fast as FPGAs.

The coarse-grain RHDs gain these benefits at the cost of sacrificing some flexibility,

which makes them inefficient when the problem does not mainly consists of word-width

arithmetic operations. For example an add operation of 3-bits operands could occupy

an entire 32 bit processing element.

2.1.3 Configuration Memory and Reconfiguration

The current function and behavior of the RHD is determined by the content of its

configuration memory. This memory consists of static RAM memory cells, which are

distributed on the device array. For each logic block of an FPGA, there are configuration

memory cells storing the content of the LUT, the information about how the flip flop

is connected, its initial state and its reset mode. The configuration memory of the

processing elements of coarse-grain devices basically determines the function of the ALU.

The configuration memory of the crossbar switches determines the connections of the

logic blocks or processing elements inputs and outputs to routing channels as well as

which horizontal and vertical routing lines are connected.

Since the configuration memory consist of SRAM cells, it can be programmed arbitrarily

often, which does not only allows the definition of the function of the RHD after the

chip production, but also changing it online during runtime. Therefore, we distinguish

between static configuration and dynamic reconfiguration. Moreover, some RHDs allow

reprogramming some parts of the configuration memory and therefore to change the

functionality of only some logic blocks or processing elements respectively, while others

keep operating. Therefore, we distinguish between full- and partial device reconfigura-

tion.

Architecture of Configuration Memory

For physical reasons, the SRAM cells of the configuration memory have to be local

to the elements they are configuring.1A simple and area efficient method to make

the configuration memory accessible from outside the device is to connect all its cells

to a long shift register. The RHD can be reprogrammed by sequentially shifting new

configuration bits into the configuration memory until the entire device is reconfigured.

Obviously this addressing schema allows only full configuration and the device cannot

be used during the time of the reconfiguration process. Also the reconfiguration time is

1It is inefficient in terms of path delay and chip area to rout long connection lines between configuration

memory cells and the elements to configure.

2 Background and Related Work

very long, since only 1-bit per clock cycle can be written.

These drawbacks can be overcome by a more sophisticated configuration memory ar-

chitecture. The width of the configuration interface can be enhanced, which allows to

reprogram several bits in parallel. The configuration memory can be made partial ac-

cessible by allowing addressing certain portions of it at the cost of extra chip area for

address decoding. For example the FPGAs of the Xilinx Virtex and VirtexII families al-

low reprogramming entire columns of logic blocks, which allows a partial reconfiguration

of the FPGA at a rather coarse level.

Reconfiguration times of coarse-grain reconfigurable architectures are typically much

shorter than that of FPGAs, since they have less cells and hence a much smaller con-

figuration memory. A further approach to overcome the long reconfiguration time are

multi-context devices [31][35][96]. These devices have not only one but some few con-

figuration memories for all device resources on chip. While this adds considerable cost

in terms of chip area, it allows changing from one configuration to another within few

clock cycles.

Within our resource model of Chapter 3we consider only the homogeneous cells of an

RHD and neglect additional resources such as I/O pins, DSP functions and embedded

microprocessors. How such additional resource can be included within our model and our

scheduling methods will be discussed in the future work section of Chapter 9. Dedicated

memory blocks of the RHD and external SRAM banks are considered in our extended

model of Section 7.2.

2.1.4 Architecture of RHD Based Computer Platforms

By the term RCP (RHD based computer platform) we denote an embedded computer

system, whose main computational resources consist of RHDs (see Figure 2.3). The

main computational tasks of the applications are executed on the RHD, e.g. by pro-

gramming the device with digital circuits compiled as FPGA configuration bit-streams.

Microprocessors are optional and their usage is not given special consideration within

this thesis. However, microprocessors may be included within the system as separate

chips external to the RHD, or integrated as dedicated circuits within the RHD (e.g.

Xilinx VirtexII Pro [108]) or as soft core CPUs2, which are implemented out of RHD

logic cells (e.g. the Xilinx Microblaze or the Altera NIOS).

Beside the RHDs, the system includes memory to store the RHD programming data

as well as application data and peripheral I/O devices to communicate with the envi-

ronment. Additionally, a device controlling the reconfiguration process of the RHD is

included. Concrete examples for FPGA based reconfigurable computing platforms are

the RC100, RC200, RC300 from CELOXICA [27], the XF-board [98] from ETH Zurich,

the Erlangen Slot Machine [20], or development boards distributed by XILINX. Some

platforms are module based and can include several FPGA such as the RAPTOR 2000

system [62]. An example of a system including a coarse-grain RHD is the DAPDNA-EB4

2A soft core CPU is a CPU which is implemented as FPGA design.

2.1 Reconfigurable Hardware Devices

Config.

Controller

CF 1

CF 2

...

CF n

Config.Files

Config

BUS

Data

BUS

DAC

I/O

DAC

...

RHD

RAM

...

Figure 2.3: Architecture of an RHD based platform

board of the IPFlex company, which includes two DAPDNA-2 reconfigurable processors

consisting of 376 processing elements each [56].

As for the memories, the platforms typically host several SRAM (static RAM) banks

which are directly connected to the RHD. They are used to store processing data to

which relatively fast access with predictable timing is required, but which is too large to

be stored in on-chip memories. Additionally non-volatile memory such as flash memory

is included to store the configuration data (programming files) and other persistent

data. Larger but slower DRAMs (dynamic RAM) are included to store large blocks of

application data, which is accessed with subject to weaker timing constraints.

RCPs often have a wide set of peripheral I/O. This includes for example digital-analog

and analog-digital converters, video and audio interfaces such as S-VIDEO or VGA,

standard interfaces like PS2, USB, fire-wire, network interfaces such as Ethernet or

CAN, and user definable digital I/O.

In the models and methods used in this thesis we abstract from platform details and

consider as platform resources only the logic blocks or processing elements of the RHD,

the internal and external SRAM memory banks, the memory hosting the RHD configu-

ration data as well as the bus connections between these resources. We neglect several

additional resources and details for the following reasons:



Details have to be abstracted, to keep the model valid for wide class of RCPs from

different vendors as well as to keep the model complexity low.



Microprocessors are not included within the model, since we assume they are

used only for controlling the execution and reconfiguration of the RHD, not for

executing the application tasks themselves. Such hybrid systems, which couple

microprocessors with RHDs and use both for computation (e.g. [56]) open the issue

of hardware–software co-design[92][75]. That is another independent research field

and its topics are not addressed within this thesis.



An RCP according to our model has only one RHD. From a practical point of

view, multiple RHDs appear mostly in high-performance computing systems such

2 Background and Related Work

as Cray’s XD1 [34] with the aim to get highest computational power at any price.

In contrast, in embedded system design the challenge is rather to choose from the

wide range of different size RHDs the smallest one that satisfies the application

requirements.

From a theoretical point of view, multiple RHDs open issues which are stronger

related to parallel processor systems than to the special characteristics of RHDs.

Such issues are for example clustering and partitioning of applications and localiza-

tion of communication, which have been exhaustively studied for multiprocessors

but are not addressed within this thesis.

2.2 Hardware Tasks

In our application model of Chapter 3and later on in the enhanced application models

of Chapter 7, we will make several assumptions for the characteristics of application

tasks. To give a better idea of what kind of tasks we consider and to show that the

assumptions we make are reasonable, this section presents a short case study of an image

processing task based on Discrete Cosine Transformation (DCT) computations.

Example 1. We assume an image processing system which includes a task that computes

the DCT coefficients of the rows of an image. The data to be processed has a fix size of

Nsamples, each coded by 16 bits on which an 8 point one dimensional DCT should be

performed.

Task 1

8 pt. 1-D

DCT

Input

Data

Output

Data

in FIFO out FIFO

32 32

16 16

SRAM 1 SRAM 2

FPGA

Figure 2.4: Example of a DCT task on platform

2.2 Hardware Tasks

Figure 2.4 illustrates how this task is realized on an FPGA based platform. Since the

image data may be too large for FPGA internal block RAMs, the Ninput samples as

well as the output coefficients to be computed are stored in the external SRAM banks 1

and 2 respectively. For the actual DCT computation a XILINX 1-D DCT IP-core [107]

is used. Since the external RAM banks have a word width of 32 bits but the samples

are 16 bit values, input and output FIFOs3are introduced to buffer the input samples

and the produced output coefficients respectively and to convert between the 32 and 16

bit words.

variant FPGA area throughput ext. RAM port access exec. time

(slices) (sample/clk) period amount utiliz. N=1024

1 982 1 2 clks 1

32 bit word 50% 1028 clks

2 767 8 / 9 9 clks 4

32 bit words 45% 1185 clks

3 589 8 / 17 17 clks 4

32 bit words 24% 2232 clks

Table 2.1: Size, throughput and memory access pattern of several variants of a XILINX

8 point 1D-DCT IP core [107].

As reported in the DCT IP-core data-sheet [107], the designer can choose between several

implementation variants with a trade-off between FPGA area and throughput as shown

in Table 2.1. To guaranty that the DCT core can achieve these throughput values – which

is mandatory for real-time processing – the FIFOs have to be chosen of appropriate size

and a certain access rate to the external RAMs has to be guaranteed. The required access

rate to external memory is reported in the middle column of Table 2.1. E.g. variant 1

processes one 16-bit sample per clock cycle. Therefore, within a period of 2 clock cycles

the task reads one 32 bit word from SRAM1 into the input FIFO. In contrast variant 3

has a throughput of 8/17. Hence it reads within a period of 17 clock cycles 4 data words

from the external RAM. Note that since a FIFO buffers the data it does not matter in

which exact clock cycles the data is read from the RAM - as long as 4 words within a

period of 17 clocks are read.4Thus, the RAM access pattern of the task has the nature of

periodic real-time access: At the beginning of a period it requests to read/write a certain

amount of data from/to the external RAM bank into/out of the FIFO. The deadline for

this transaction is the beginning of the next period to prevent an underflow or overflow

of the FIFO buffers.

The total worst case execution time of the DCT example task is computed using Equa-

tion 2.1. Beside the internal latency of the DCT core, two times the RAM access period

is added to account the initial filling of the input FIFO and the final deflating of the

output FIFO.

exectime =N

throughput +DCT latency + 2 ×RAMaccessperiod (2.1)

For a data size of N= 1024 samples, the execution time according to Equation 2.1 for

3first in first out buffers

4The same rates apply for the write access to the SRAM2.

2 Background and Related Work

each of the three variants of the task is reported in Table 2.1 as well.

This case study of a DCT task showed that it is reasonable to assume application models

where tasks have the following characteristics:



Tasks may be realized as various implementation variants with different perfor-

mance vs. RHD-resource tradeoffs. Such various implementations can be achieved

by considering different circuit architectures. For example, a multiplication can

be realized by a bit-serial or a bit-parallel architecture, hence resulting in a small

but slow and in a fast but large circuit.



A task variant requires a specific amount of RHD resource. For FPGA devices,

the synthesis tool typically provides information on how many FPGA slices are

required by the design. Often, slice resources are further distinguished to the

number of required LUTs and flip flops. Additional resources are reported as well,

such as the required block RAMs, or multiplier cores. Hence, the basic resources

can be captured by a scalar value (e.g. number of FPGA slices) or by a resource

vector.



A task variant may have a known execution time which is usually given by the

number of required clock-cycles divided by the clock-frequency at which the task

is executed. Determining the worst case execution time of a hardware task is

discussed in the next paragraph.



A task may require external memory to store larger amounts of data and may

access it periodically within hard real-time constraints.

These assumptions are the motivation for our application models used within this thesis,

namely the basic periodic real-time tasks model, introduced in Chapter 3, and the model

extensions by task variants and periodic memory access, introduced in Chapter 7.

Worst Case Task Execution Time

Determining the execution time is often not as easy as it has been for the DCT task of

the example above and a deeper analysis is required. Without knowledge on the tasks’

execution times we cannot develop a scheduling strategy, which allows us to guaranty

that a task will finish before some given deadline. Moreover, in order to guaranty hard

deadlines of tasks we have to design the system for peak load situations; even if in

average the load of the system is low. Hence, we would like to know an upper bound on

the execution time for each task. This is done by a so called worst case execution time

(WCET) analysis, which is another research field within real-time systems and out of

the scope of this thesis (see e.g. [64] p.86ff. and [24]p.411ff.). Rather we briefly discuss

the issues of WCET analysis of hardware tasks executed on reconfigurable architectures

compared to software tasks running on CPUs.

The WCET is mainly influenced by the program flow of the task itself and the micro

architecture of the machine executing the task.

2.2 Hardware Tasks

Program flow: To determine the WCET, the first step is the analysis of the program

flow of the considered task. Usually, it is described by directed graph models

capturing the sequential basic blocks and control decisions. Once the computation

times of the basic blocks are known, the directed graph can be collapsed. For

example a branch is collapsed to one node by assigning the WCET of the longer

path. Loops are replaced by accounting ntimes the WCET of the body, where n

is the bound on the number of loop iterations (loops must be bounded to have a

bounded WCET of the task).

In general, the program flow analysis is independent of whether the task is imple-

mented as hardware or software, but the high amount of concurrency in hardware

tasks might make their analysis more challenging. However, many applications

considered for the execution on reconfigurable hardware are data flow dominated,

such as signal processing tasks, filters, coders and decoders. Due to few control

decisions, such tasks are often simpler to analyze than arbitrary tasks.

Micro architecture: The WCET of the task’s basic blocks can be derived by account-

ing the time required by the clock cycles in case of hardware tasks or by the

instructions in case of software tasks.

Unfortunately, determining the time for an instruction on modern CPUs can be

quite difficult. The data and instruction cache of processors provide enormous

speedups when fetching instructions and accessing data, but this is transparent to

the programming model. To be able to account these speedups when computing

the WCET requires a detailed analysis of the program code together with the

cache model, in order to know which items can be guaranteed to be inside the

cache and which not. This becomes even harder, when a task can be interrupted

by another task which might overwrite some cache values. Also the pipelined

execution of instructions does not provide guaranteed performance, since we have

to consider pipeline stalls.

In case of a hardware task, the programming model (i.e. VHDL) does not abstract

from platform details that influence the number of clock cycles required by a basic

block. Often this number can be directly achieved from the register transfer level

(RTL) description, or results from the scheduling phase of high level architecture

synthesis. In other cases, the vendor of an IP-core reports on the required clock

cycles in the according data-sheets. The maximal clock frequency is reported by

the place and route tools for a specific target device.

We can conclude that for many hardware tasks of our interest, obtaining the WCET

is not a major issue. Analyzing the WCET for arbitrary hardware tasks might be yet

another research topic.

2 Background and Related Work

2.3 Multi-Tasking on Reconfigurable Hardware

This section presents an overview of work in the field of RHD based multi tasking, which

is directly related to this thesis. Additional work, which includes models and methods

that are related to- or applied within the methods of this thesis, is introduced in the

following chapters.

We categorize different models of RHD based multitasking that have been proposed in

related work. The related systems can be classified according to the following properties:

RCP Architecture: Existing or proposed systems are either based on single RHDs,

based on multiple RHDs or are hybrid systems coupling microprocessors with

RHDs.

RHD Resource Sharing Model: Many approaches exist on how systems allow multiple

tasks to share the RHD resources. The most relevant are discussed in detail in

Section 2.3.1.

Task execution model: The system may or may not allow preemptive multitasking.

Application model: It can be distinguished between off-line and online-problems. In

the first case, the entire application parameters such as the tasks and their ar-

rival and execution times are known at design time. For example the application

to be scheduled for execution is modeled in the form of a directed acyclic task-

graph. Typical optimization goals are to minimize the overall execution time or

to minimize the required resources, i.e., the size of the RHD. In online-problems,

application parameters such as the tasks and their arrival times are determined

at runtime. Therefore the system has to determine at runtime, which resources

are assigned to the tasks without knowing which tasks have to be executed in the

future. A typical optimization goal in this scenario is to minimize the average

response time of tasks.

2.3.1 Resource Sharing Models

There exist numerous approaches how multiple tasks share the resources of an RHD.

These are sometimes closely related to the supported reconfiguration features of the

devices, i.e. if the RHD supports unrestricted partial, column-wise partial or only full

reconfiguration. We describe six models; models (a) to (c) are based on full device

reconfiguration and models (d) to (f) are based on partial reconfiguration.

(a) Static Partitioning: If multiple tasks have to be processed by one RHD, today’s

practice would be, using the static assignment partitioning approach, which is

illustrated in Figure 2.5-(a). All system tasks are integrated within one RHD

configuration, which is loaded to the RHD while the system is booting and stay

active until the system shuts down. In this approach, the RHD behaves similar to

an ASIC. The required size of the RHD is determined by the sum of the resource

2.3 Multi-Tasking on Reconfigurable Hardware

Figure 2.5: RHD Resource sharing approaches based on full RHD configuration.

requirements of all tasks. Since no timesharing of RHD area takes place, task

scheduling and sophisticated resource management methods are obsolete. This

greatly simplifies the system implementation. This approach is efficient when

most of the tasks need to be running most of the time. It becomes ineffective

when some tasks have to be executed just once or only sporadically from time to

time, since all existing tasks allocate resources no matter whether they are running

or not.

(b) Non-Space Partitioning: The simplest approach to realize multitasking with re-

source reuse is a time sharing approach without partitioning the area of the RHD

(see Figure 2.5-(b)). The entire RHD resources are assigned to only one task at a

time. To switch from one task to another, the entire device is fully reconfigured.

Either each task runs until completion before the RHD is configured for execution

of another task (non-preemptive multitasking), or tasks can get interrupted and

resumed later to free the RHD for other tasks (preemptive multitasking). Con-

cerning the resource management, this approach is equivalent to a single processor

model and all existing scheduling strategies of that field can be applied. Systems

following this approach are e.g. [58,83,69,84,85].

This model has the drawback, that only one task is executed at a time. Since no

parallelism between tasks can be exploited, the design options are limited. I.e., we

cannot speed up the application by choosing a larger RHD with more resources.

Also the device utilization becomes inefficient if the tasks vary much in their area

2 Background and Related Work

requirement. The required RHD size is determined by the area requirement of the

largest task. We call this internal fragmentation, since each task is enlarged to

the full device size, but leaving resources inside the task boundaries idle, which is

illustrated by black RHD cells in the figures. Furthermore, for the realization of

a quasi simultaneous task execution the RHD has to be reconfigured frequently,

which adds a great time overhead and hence presents another drawback.

is the free space partitioning - time sharing model, which is illustrated in Fig-

ure 2.5-(c). In a certain RHD configuration, the RHD resources are freely dis-

tributed among a set of running tasks. In praxis, several task source files are

synthesized into the same device configuration [58]5[104][49]. The only placement

constraint is that the sum over the area of all tasks from this running set is less

or below the area of the RHD. Timesharing of the RHD is realized by full device

configuration. The example in Figure 2.5-(c) shows task 1, 2 and 3 running. If

task 1 is preempted by task 4, the entire device is reconfigured by a configuration

containing tasks 2,3 and 4.

Using this approach, whenever a new task starts, the device gets reconfigured and

all running tasks have to be interrupted. Hence, the system needs to support func-

tionality to rescue the inner state of the interrupted task during the reconfiguration

process, no matter if preemptive or non-preemptive multitasking is considered.

The main drawback of this approach is that one configuration file has to be cre-

ated off-line for every combination of running tasks that can occur during the

execution of the application. This can result in large memory requirements for

storing the configuration files. In general, the number of configuration files grows

exponentially with the number of tasks, but can be reduced and bounded by a

proper scheduling strategy.

We can overcome this drawback, if the system is able to construct the configuration

data for the RHD out of a set of partial RHD configuration bit-streams during

runtime. This way, only the programming data of each task is stored only once as

a partial RHD configuration bit-stream. Composing a full configuration file out of

a set of partial configuration data requires address relocation such as it has been

implemented in systems described in [52][53] [60].

Another drawback is the high time overhead resulting from the full device recon-

figuration and the interruption of all running tasks whenever the RHD is recon-

figured.

The advantage of this model is the practicability: It is realizable on any full re-

configurable device omitting the problems involved with partial reconfiguration.

Moreover the model has the potential of high device utilization. No task internal

fragmentation occurs, since the area allocated to a task is variable. Also no ex-

ternal fragmentation occurs, in the sense that an additional task can always be

added to the executing tasks, if the sum of the area of all running tasks does not

exceed the device size. The reason is, that the task locations can be rearranged

whenever the device is fully reconfigured.

5uses full configuration, but system consists of nparallel FPGAs and each task requires a≤nFPGAs

2.3 Multi-Tasking on Reconfigurable Hardware

In Section 4.3 we present a scheduling algorithm, which uses this resource model

and keeps the number of configuration files bounded by the number of tasks.

Hence, it overcomes the main drawback of this model. Further, we will use this

resource model and the designed scheduling algorithm for our prototype of an

RHD based real time system in Section 8.

(d) Equal Slot Partitioning: Figure 2.6-(d) shows a widely used resource sharing ap-

proach, where the RHD area is partitioned into a fixed number of mslots of equal

size [101][100][97]. Every running task gets exactly the RHD resources of one slot

assigned, which means that at most mtasks can be in execution in parallel. If a

new task is selected for execution due to termination or preemption of a running

task, the RHD area of the according slot is reconfigured by the configuration data

of the new task. This is done via partial reconfiguration, which keeps the executing

tasks of the other slots unaffected during this process. Concerning the resource

management, this approach is equivalent to a homogeneous multi-processor model.

The major disadvantage of this model is the internal task fragmentation, which

results generally in low device utilization. Since each task has to fit into a slot, the

slot size has to be chosen according to the largest task of the application which

also determines the number of slots for a given device. All other tasks are enlarged

to this slot size.

There are several advantages of this model: Among the models using partial re-

Figure 2.6: RHD Resource sharing approaches based on partial RHD configuration

2 Background and Related Work

configuration, it is the one with the easiest technical implementation. The model

is consistent with the constraints of most of todays partial reconfigurable FPGAs

like XILINX Virtex and VirtexII devices which require the reconfiguration of en-

tire columns of the logic blocks. Systems using this approach are introduced in

[101][100][97]. Also, the management of allocated and free resources has low com-

plexity, since the slots are equivalent and only the number of allocated or free slots

has to be considered. Hence, no placement strategy is necessary. The equivalence

to a homogeneous multi-processor enables the reuse of many scheduling strategies

developed in the research field of parallel systems.

(e) One Dimensional Partitioning: An extension of the equal slot partitioning is the

one dimensional partitioning model, referred to as 1D model from now on. As

shown in Figure 2.6-(e), the tasks are stripes of full RHD array height, but with

variable width and are placed side by side on the device. If a running task ter-

minates or gets preempted, its occupied resources are deallocated. A new task

selected for execution is placed into a free continuous area of adequate width.

This model is called one dimensional partitioning, since each task has a footprint

with one fixed and one variable dimension. Proposed systems using this approach

are for example [59,22,43,99].

Compared to the slot approach (d), this model has the advantage of high potential

device utilization, since the size of tasks can scale according to their resource

requirements (almost no internal task fragmentation). This comes at the cost of

external fragmentation. Situations can occur, where there is enough free space

on the RHD for a new task but not as a continuous stripe. To minimize such

situations and to get good device utilization would require sophisticated placement

strategies as well as management of free continuous stripes. This resource model is

no longer equivalent to a homogeneous multiprocessor. The appearing scheduling

and optimization problems rather have relations to orthogonal packing problems.

Furthermore, the model still fits to the capabilities of today’s partial reconfigurable

FPGAs and design tools: The task footprints are constrained to be stripes of the

device array. However, if the actual task position onto the device is unknown

at design time, the system needs to support task relocation, which brings new

realization challenges but has been successfully implemented by the authors of

[60].

We use the 1D model for our global scheduling method in Section 4.1, which

requires task relocation, and for our partitioned scheduling method in Section 4.2,

which works without task relocation.

(f) Two Dimensional Partitioning: A further extension is the two dimensional parti-

tioning approach, where tasks are modeled as rectangles of variable width and

length which can be freely placed onto the RHD, as shown in Figure 2.6-(f). Run-

ning tasks are exchanged by partial device reconfiguration of the affected area

[38,40,14,41,93,94,42].

This approach enables more flexibility concerning the shape of a task’s footprint.

In the one dimensional approach of model (e), the resource requirements of a task

have to be distributed within a rectangle of full device height - which generally

2.3 Multi-Tasking on Reconfigurable Hardware

results in an unbalanced ratio of rectangle height to width. This can affect the

ability to place and route the task logic and the task’s performance (e.g. maximal

clock frequency) since the path length increases. Also the internal fragmentation

of model (f) is less than that of model (e), since in the latter the smallest unit

of resources is one entire RHD column. However, studies have shown that these

negative effects are relatively small [59].

One major practical drawback of this approach is that most of today’s commercial

available devices do not support this kind of free partial reconfiguration. Some

devices allowing free addressing of the element to reconfigure are the coarse grain

XPP architecture [13] and – to some extend – Xilinx Virtex 4 FPGAs [74]. Also

I/O communication issues occur, when tasks are placed in the middle of the RHD

area with no direct connection to I/O cells. Possible solutions are network on chip

approaches like [19]. Also the algorithmic problems involved to optimize the task

placement and scheduling become more complex. The dimension of the packing

problems increases and in scenarios where tasks arrive online, the management of

free and occupied device resources becomes much more complex than in the 1D

model [14].

The characteristics of the six introduced approaches of resource sharing are summarized

in Table 2.2.

issue \model a) static b) no space c) free space d) slot e) 1D f) 2D

reconfiguration static full full partial partial partial

supported

RHDs

all SRAM based column-wise reconf. free reconf.

model similar

to:

ASIC uniprocessor - multi-

processor

memory

allocation

utilization low low high medium high high

area fragmen-

tation

non high intern non medium

intern

extern;

low intern

extern

reconfiguration

overhead

non high high medium low low

complexity of

placement

non non non low medium high

complexity of

scheduling

non low high medium high high

task interrup-

tion

non for preemp-

tive MT

always for preemptive MT; for de-fragmentation

implementation

complexity

low low medium medium medium high

references [7] [58,83,69,

84,85]

[58,104,49] [101,100,

97]

[59,22,43,

99,60]

[38,40,14,

41,93,94,

42]

Table 2.2: Comparison of resource sharing models

2 Background and Related Work

2.3.2 Task Scheduling and Placement Methods for RHDs

Scheduling and placement of tasks on RHD based systems has recently attracted much

attention in research. The work differs in the considered resource models (e.g. slotted,

1D, 2D, etc.) and the considered scheduling problems (e.g. off-line vs. online task acti-

vation, preemptive vs. non-preemptive, independent vs. precedence constraints). More-

over, all approaches described here do not consider hard real-time tasks but rather mini-

mize cost functions like total make-span or average respond time. Scheduling approaches

for real-time tasks on processor machines as well as on RHD devices are described in

more detail in Section 2.4.

Fekete, Teich et al. studied off-line scheduling and placement of synchronous activated

task with precedence constraints onto an RHD, assuming the 1D and the 2D resource

model. For the problems of minimizing the total make-span for a given device size,

and for the problem of finding the smallest device for a given time-limit, they presented

an optimal method [44] [94]. The central idea is, to represent feasible solutions using

so called packing classes, which can be efficiently represented as interval graphs. A

packing class captures not the absolute task positions in the x, y and time dimension on

the device, but rather captures whether the placement of two tasks overlap in a certain

dimension or not. Hence, the search space is considerably reduced and a branch and

bound method allows finding optimal solutions in reasonable time.

Danne et al. kept up with an extended problem where task variants are considered

[DS05]. The off-line problem of temporal partitioning and temporal placement of data

flow graphs for reconfigurable systems was studied by Bobda in [18].

Bazargan et al. considered the placement of online arriving, independent and non-

preemptable tasks onto an RHD under the 2D model [14]. When a task arrives it is

either directly placed or rejected. The goal is to maximize the fraction of accepted tasks.

For the efficient management of the free space onto the device, two methods have been

proposed which have been evaluated together with several different placement heuristics

such as bottom-left, first-fit and best-fit.

Walder et al. kept up with a similar online problem and improved placement qualities

by allowing task footprint transformations [99]. K¨

oster et al. considered placement of

tasks onto inhomogeneous RHDs, where not all resources are arranged in an array [63].

2.3.3 Dynamic Reconfigurable Systems

Many research projects involve multitasking on RHD based platforms and several proto-

types have been implemented. Often, the systems are specialized for a certain applica-

tion domain and use individual models, making an exact classification and comparison

of these systems difficult. Therefore, we select some representative projects which have

a rather high relation to our work for a brief description. The first three systems fol-

low a fully FPGA reconfiguration approach, whereas the last three systems have been

designed for partial reconfiguration.

2.3 Multi-Tasking on Reconfigurable Hardware

Simmler’s OS: Simmler at el. developed an operating system for preemptive multi-

tasking on FPGA co-processors [83][84][69]. Their system couples a host CPU with an

FPGA co-processor board. Considered applications run on the host CPU and use the

FPGA to accelerate computation intensive tasks. A task running on the FPGA may be

preempted by a higher priority task. Restoring the state of a preempted task before its

resumption is realized by a configuration read-back, extract and write-back approach via

the FPGAs configuration interface.

An interesting feature of the system is the set of several SRAM banks, which are con-

nected via a switch to the FPGA as well as to the host CPU. RAM-banks are assigned

to tasks exclusively, such that their contents do not need to be swapped when a task

switch occurs. While the system presents a feasible prototype of preemptive hardware-

multitasking, its main drawback is that only one hardware-task runs at a time. I.e.,

it implements the resource sharing model (b) Non-Space Partitioning, which has the

disadvantages described above.

SONIC: This project aims to map real-time video applications to a specifically de-

signed reconfigurable computing platform called SONIC, (later called UltraSONIC)

[104]. The platform consists of multiple modules each combining an FPGA with one

local external RAM bank. A bus system connects the modules with a host computer.

The application is specified as a directed acyclic graph (DAG) of tasks where edges

represent data dependencies. The tasks are partitioned into hardware and software

tasks, therefore the system is a truly hybrid system, where the host processor is not

only used for control issues but to execute parts of the application. The hardware

tasks are clustered to FPGA configurations, such that the device size is taken into

consideration. Tasks in the DAG are processed from top to bottom and hardware tasks

are loaded on demand via full FPGA configuration. The aim of the schedule is to

minimize the make span. Although multiple tasks reside on a programmed FPGA,

only one is active at a time due to the limitation of the single port external memory.

(Our work explicitly solves this problem by scheduling RAM accesses of tasks to RAM

banks under real-time constraints as introduced in Section 7.2.) However, in the SONIC

system multiple tasks can run in parallel when mapped to separate hardware modules.

Task communication occurs only at the beginning and end of a task execution and is

done via the local memory. The SONIC system shares FPGA area using model (c) free

space – time sharing. Hence, it is related to our prototype of Chapter 8, where we use

the same resource model. In contrast to our work, no parallel execution of tasks onto an

RHD takes place and the RAM banks cannot be shared by simultaneous executing task.

Furthermore, task preemption is not allowed and deadlines of tasks are not considered.

SPARCS: SPARCS [49] stands for Synthesis and Partitioning for Adaptive Reconfig-

urable Computers. The aim of the project is to map a high level specification application

onto a reconfigurable architecture. In contrast to our RCP architecture, the authors as-

sume a platform hosting multiple FPGAs, each coupled with only one external memory

bank. Throughout an interconnection network, the FPGAs are connected to memory

storing bit-streams, to one (large) shared memory on board, and to I/O devices. The

2 Background and Related Work

application is specified as a directed graph, where nodes can be either tasks or logical

memory buffers. The edges represent either channels or dependencies. Dependencies

among the task represent control flow, whereas channels represent inter-task or task-to-

memory communication.

The high level synthesis involves two basic steps: In the temporal partitioning phase,

the application is partitioned into temporal segments such that these segments can be

executed sequentially considering the task dependencies. Moreover, it is ensured that

the logic cell and memory requirements of each segment do not exceed the platform

resources. Afterwards, in the spatial partitioning phase the tasks of each temporal

segment are divided into partitions, such that each fits within one FPGA and the logical

memories are mapped to the physical board memories. Using standard synthesis tools,

one FPGA programming file for each partition is created. At runtime, a controller

reconfigures the FPGAs according to a global schedule, using full device configuration.

In contrast to our work, in SPARCS each task is contained within one and only one

partition (configuration). Thus, a non-preemptive multitasking model is considered and

each task finishes its processing within one partition. Also, no real-time constraints

are associated with the tasks. Interestingly enough, the SPARCS system is one of the

few that considers several implementation variants for each task that allow to explore

various area-latency tradeoffs of a task. The variants are used to balance the latency of

tasks within a partition. Similarly we consider task variants in our extended task model

of Section 7.1, but in contrast to the SPARC system we consider periodic real-time task

rather than static task graphs.

ESM: The Erlangen Slot Machine (ESM) [20][21] is an FPGA based reconfigurable

platform. It has been specially designed, targeting applications which use partial FPGA

reconfiguration. One main advantage of the system is, that it allows each hardware-task

to access its peripheral resources independent of its location via a crossbar. Also several

types of inter task communication are supported by special communication lines. The

system allows placing tasks relocatable according to the 1D resource approach, and

hence provides an implementation platform for multi-tasking applications.

RAPTOR 2000, REPLICA Kalte et al. developed a rapid prototyping system called

RAPTOR 2000 [62], which is able to host several FPGA modules and supports partial

FPGA reconfiguration. Based on the RAPTOR platform, they developed a framework

for dynamically reconfigurable systems focusing on the 1D resource model. The reloca-

tion of tasks onto the FPGA is supported by the REPLICA [60] filter. RECPLICA is a

hardware circuit, which takes a partial FPGA bit-stream of a task and a target FPGA

location as inputs. The column addresses within the bit-stream are manipulated on

the fly, such that the output bit-stream can be used to configure the task at the desired

position onto the FPGA. Moreover, they implemented task context saving and restoring

by using the FPGA configuration interface [61]. A state extraction filter allows storing

the register content of a task that has just been stopped, by reading the current FPGA

configuration and extracting the state of the flip flops. Later on, when the task has

to be resumed, even at a different position, a state inclusion filter allows merging the

2.3 Multi-Tasking on Reconfigurable Hardware

state information into the configuration bit-stream of the task before it is loaded onto

the FPGA. Hence, the REPTOR 2000 system together with its tools presents a working

prototype and proof of concept for preemptive multitasking on FPGAs under the 1D

resource model. It supports all requirements for the execution model assumed in our

global- and partitioned scheduling methods of Chapter 4.

XFORCES: The Executives for Reconfigurable Embedded Systems (XFORCES) project

aims to investigate and develop an operating system kernel for reconfigurable hardware

devices [100][102][103]. The runtime system loads tasks on the FPGAs. For more

sophisticated dynamically reconfigurable systems, the runtime system also schedules

tasks at runtime and saves the context of preempted tasks. Furthermore, it establishes

communication interfaces between the tasks, no matter whether they are mapped to the

same or different FPGAs.

The XF-board [98] was developed as a platform for the XFORCES OS prototype. It

hosts two FPGAs where on the first one a soft core CPU is implemented that controls the

overall system. The second FPGA is used as a reconfigurable resource for the execution

of hardware-tasks. All I/O devices and memory banks are connected to the left and

right device edges of the second FPGA. Hence, many routing problems are avoided and

the design matches the requirements of a 1D partial reconfiguration model.

To avoid the manual creation of the bus infrastructure for the OS kernel and the tasks,

which is a hard an error prone process, the XFOSgen tool was developed. It generates the

OS logic and communication structure in conjunction with task templates of different

width. Hence, it considerably simplifies the design of a reconfigurable multitasking

application.

This section summarized related work on multitasking of hardware tasks. We introduced

several resource sharing models used in literature, presented work on different scheduling

problems and briefly described prototype systems implementing some kind of hardware

multitasking. In summary, hardware multitasking has obtained much attention by re-

searches, numerous related scheduling problems have been studied and several system

prototypes have been built, proving feasibility of the technique. However, real-time

constraints considerations have been omitted in most of the presented work.

2 Background and Related Work

2.4 Real-Time Scheduling

This section gives a background on those parts of real-time scheduling, which are directly

related to our work.

2.4.1 A Background on RT Systems

Real-time systems are embedded computer systems, that must react within precise time

constraints to events from their environment [25]. The correctness of the system depends

not only on the logical correct computation of results but also on temporal correctness,

i.e. the time at which the results are delivered. A computation is temporally correct if

it finishes within a specified time frame [23].

In a multitasking system, a task Tiis typically modeled by its arrival or release time ri,

its worst case computation time Ciand its timing constraint is modeled by a deadline

di, which represents the time until the task has to complete its execution. Depending

on the consequences of missing a deadline, the literature distinguishes at least between

two classes of real-time tasks:

If a hard real-time task misses its deadline, this can cause catastrophic consequences

on the entire system. For example, a task controlling the airbag system of a car. If it

decides to fire the airbag – and this result is produced too late – it is worth nothing. This

can be modeled by a value function of the task, which equals one before the deadline

of the task and is minus infinity after the deadline of the task. Hence, if only one task

misses its deadline there is no value of the entire system.

If a soft real-time task misses its deadline, the performance of the system decreases but

it can still work correctly. The computed result is not useless, but it is less valuable

than if it would be produced before the tasks deadline. Examples are video systems,

where a frame displayed a little too late may reduce the system quality. In this case the

value function of the task is one before its deadline and decreases monotonically after

its deadline until it reaches a value of zero.

Furthermore, a real-time system can have a static or a dynamic task set. In a static real-

time system, the task set is fixed and all task parameters as arrival times, computation

times and deadlines are known a priori at design time.6Therefore, we design the system

for the worse case scenario and a task schedule can be computed off-line and validated,

whether every deadline is met or not. In a dynamic real-time system, a new task can be

(unpredictable) activated at runtime. Therefore, without further assumptions we cannot

guarantee that all released tasks can be processed within there deadlines. A common

method to deal with unpredictable task activations is to introduce an acceptance test.

When a task is requested at runtime, a procedure checks whether the new task can be

processed within its deadline without jeopardizing the timing constraints of the other

tasks. When this test is negative the task activation is rejected, otherwise it is accepted

and planned for execution. Obviously, the application has to be designed to deal with

6At least worst case assumptions have to be known

2.4 Real-Time Scheduling

task rejections for example by specifying an alternative system behavior.

It is further distinguished between aperiodically released and periodically released tasks.

In the later case, a task is released multiple times in a periodic manner, i.e. the time

between two releases of the task is constant and denoted as the period of the task.

This is a reasonable assumption for many real-time applications. Examples are control

processes of mechatronic systems, where control outputs are computed based on period-

ically sampled sensor values. Often the assumption is made, that the relative deadline

of a periodic task is equal to its period, i.e. the task has to finish its computation before

its next release. Some relaxation of periodic tasks are sporadic tasks, where the period

presents the minimum inter-arrival time of two task instances. Most results for periodic

tasks do also apply to sporadic tasks.

2.4.2 RT Scheduling Problems and Algorithms for Uniprocessor Systems

A task schedule defines at any point in time, the allocation of system resources to the

application tasks. In the simplest model, only the processor of a single-processor system

is considered. In this case, a schedule is a step-function over time, where the value of the

function defines the ID of the task that is executed by the processor at a given point in

time. A schedule is said to be feasible, if all (hard real-time) tasks meet their deadlines.

The construction of a schedule is done by a scheduling algorithm. For static task sets,

this could be done off-line and the schedule could be stored in a dispatcher table. How-

ever, this is only useful in some cases. In most cases, the parameters of a static task set

present only worst-case parameters, such as the worst case computation time of a task.

In average, the computation time may be much smaller. Hence to benefit from early

terminating tasks, most systems use scheduling algorithms online and at runtime, but

verify the feasibility of the scheduling strategy off-line at design time against the worst

case parameters of a task set. Also the memory requirements by the dispatcher table

might be too high.

Another aspect of a scheduling algorithm is, whether it constructs a preemptive or a non-

preemptive schedule. A preemptive schedule is in general more flexible and a feasible

schedule can be computed with less effort than in the non-preemptive case. However,

the system has to support task preemption at a reasonable cost.

Aperiodic Task Sets on Uniprocessors

For aperiodic task sets the earliest deadline first (EDF) algorithm has been proposed

by Horn [51].7It follows a simple rule:

Among all released tasks always the task with the earliest absolute deadline is assigned

to the processor.

Therefore, whenever a task gets activated, it is inserted into a deadline sorted ready-

queue. The task at the head of the queue is selected for execution if currently no task is

7The results of the original reference have been summarized in [25], p. 50 ff.

2 Background and Related Work

in execution or if the deadline of the head task is less than that of the task in execution.

In the latter case, the task in execution is preempted by the head task – therefore EDF

constructs a preemptive task schedule. The EDF schedule can be computed within

O(n2) steps, where nis the number of tasks. EDF have been shown to be optimal in the

sense, that it finds a feasible schedule to a given task set if one exists [36]7. Moreover,

it can be modified to optimally schedule task sets with precedence constraints [28]7.

Finding a feasibly non-preemptive task schedule for tasks with arbitrary arrival times

has been proven to be NP-hard even for the single processor case [68]. Therefore, an

optimal algorithm for the non-preemptive case is computational complex such as the

tree search method of Bartley (see [25] p.58 ff.) which runs in O(n·n!) time. For large

task sets, such runtime is unacceptable and we have to fall back on heuristic approaches,

such as the Spring algorithm [87].

Periodic Task Sets on Uniprocessors

The periodic task model introduced by Liu and Layland in [72] is widely-accepted (e.g.

as a standard model in textbooks [25]p. 71 ff.,[24]p. 399 ff.) and has been probably

the most studied subject in real-time scheduling [82]. In this model, each task has a

known worst case execution time Ciand is requested periodically with a rate Pi. For

the problem of scheduling a task set to a uniprocessor system, two particular algorithms

have received the major attention.

Rate Monotonic (RM): RM is a fixed priority assignment schema. It assigns priorities

to the tasks according to increasing request rates and this assignment stays fix during

runtime. I.e. the task with the smallest period gets the highest priority. At runtime,

always the active task with the highest priority is selected for execution.

It has been shown, that RM is optimal among all fixed priority assignment schemas [72].

However, that is only for fixed priority assignment schemas and there are cases, where

RM does not find a feasibly schedule, even if one exists. In general, RM is not able

to utilize the processor up to 100%. Liu and Layland [72] provided a sufficient but not

necessary scheduling test by providing an upper bound to the processor utilization:

i=1

Pi≤n·21/n −1,and for arbitrary large n: lim

n→∞

i=1

Pi≤ln 2 (2.2)

According to Equation 2.2, all task sets that accumulate the processor by less than

about 69% will be feasibly scheduled by RM. The test is pessimistic, e.g. some task sets

do not satisfy Equation 2.2 but are still schedulable by RM. However, the test is tight

in the sense that there exist still cases where the task set utilizes the processor only by

ln 2 + which cannot be scheduled by RM.

Better scheduling tests for RM, that are less pessimistic have been proposed. In par-

ticular the so called hyperbolic bound of Equation 2.3 was shown in [77][17] (see [25] for

2.4 Real-Time Scheduling

details).8

i=1 Ci

+ 1≤2 (2.3)

Earliest Deadline First (EDF): The EDF rule as it has been introduced for aperiodic

tasks above can be applied to periodic tasks as well. It assigns the highest priority

always to the active task with the closest absolute deadline. Since the priority of a task

may change at runtime when another task is released or terminates, EDF is called a

dynamic priority assignment schema.

EDF has been shown to be optimal for periodic task sets scheduled on a uniprocessor

system. It is able to schedule all task sets that utilize the processor by not more than

100%. Hence, the exact scheduling condition is given by:

i=1

Pi≤1 (2.4)

Obviously the performance of EDF is much better than that of RM and it is easier

to check, if a particular task set will be feasibly scheduled. However, due to the fixed

priorities, an RM dispatcher is very easy to implement and hence, RM is still widely

used in industrial systems.

A detailed comparison study between EDF and RM for all important aspects has been

presented in [26]. Clearly, EDF is favored over RM, as it has a better performance, is

easier to analyze and usually produces less context switches than RM. For the same

reasons, all of our proposed three scheduling approaches presented in Section 4use EDF

as basis. Only when we schedule the memory read- and write accesses for the RAM

buses in Section 7.2, where scheduling decisions have to be made within one clock cycle,

we fall back to the RM schema. An implementation of an EDF scheduler, which requires

list data-structures, seems not to be adequate for that case.

2.4.3 RT Scheduling Problems and Algorithms for Multiprocessor Systems

The scheduling of tasks on parallel or distributed systems is much more challenging

than scheduling tasks on a single resource. As reconfigurable architectures belong to

the class of parallel systems, we present approaches on the multiprocessor scheduling

problem which are closely related to our work.

At this point, we are not able to survey the large amount of work that has been done on

this topic. We rather introduce the basic methods used for multiprocessor scheduling

8It appears that the bound has been first proven in [77] and successfully used for multiprocessor

scheduling in [78]. Independently, Buttazzo et. al. provided the hyperbolic bound of Equation 2.3 in

[17] and presented further analysis.

2 Background and Related Work

and go into detail, whenever a method is closely related to our proposed methods or has

been used within this thesis.

Definition 1 (periodic multiprocessor scheduling problem).Given a set of periodic tasks

T1to Tneach with a worst case computation time Ciand a period Piand a machine with

mprocessors of equal speed. Find a schedule that assigns the ntasks to the mprocessors

over time such that each task executes for Ciunits within each period. Assumptions:



Task preemption is permitted. A task can be preempted and later on resumed at

no cost.



Task migration is permitted. A preempted task can be resumed at a different pro-

cessor at no cost.



Task parallelism is forbidden. A task can only execute on one processor at a time.

The scheduling approaches to this problem can be categorized into two main classes,

namely partitioned and global (non-partitioned) scheduling [5].

Partitioned Scheduling

In a partitioned schedule, all instances of a task are executed on the same processor.

Hence, the set of ntasks is partitioned into msubsets, each associated with one pro-

cessor. At runtime, each processor executes only its associated task set using some

standard uniprocessor scheduling algorithm such as EDF or RM.

The partitioned scheduling method has attracted much attention for two reasons. The

first is a practical issue: In a partitioned schedule no task migration occurs, which might

be difficult to implement and might introduce much more time overhead than a normal

context switch. Actually, the system can be simply composed out of several unipro-

cessor systems, which are loosely coupled and all implement their own independent

uniprocessor scheduler.

The second reason concerns the theoretical analysis: The multiprocessor scheduling

problem is reduced to a partitioning problem, where each subset of tasks has to be feasi-

bly scheduled by a uniprocessor algorithm. Hence, all results achieved for uniprocessor

scheduling algorithms, such as the scheduling conditions for EDF or RM, can be reused.

We will first discuss the dynamic priority assignment under EDF and afterwards the

fixed priority assignment under RM, which is more complex to analyze.

Dynamic Priority Partitioned Schedulers: Consider that an independent EDF sched-

uler is used for each processor. Since EDF schedules all task sets up to a processor

utilization of 100% (see Equation 2.4), the problem evolves to a classical bin-packing

problem. Each task is an item of size equal to its processor utilization Ui=Ci/Piand

each processor is a bin of size 1. For solving the NP-hard bin-packing problem, we can

use one of the several bin-packing approximation algorithms [29]. Typical choices for

real-time schedulers have been first fit (FF),best fit (BF) and first fit decreasing (FFD)

and best fit decreasing (BFD) [88]. In the later two algorithms the tasks are sorted

2.4 Real-Time Scheduling

according to decreasing utilization values before they are packed to the bins (proces-

sors). It has been shown (see [29]), that the BFD and FFD assignment schemes have

asymptotic upper bounds of 11/9, meaning that the number of processors used by BFD

or FFD is only 1.222 . . . times the number required by an optimal assignment in the

asymptotic case. Hence, these heuristics offer a quite good performance.

We pick up this basic approach of partitioning the task sets and then scheduling each

subset individually by EDF and extend the approach to our task and resource model in

Section 4.2. We show that in that case our scheduling problem is no longer equivalent

to simple bin-packing, but equivalent to two-dimensional level strip packing [73].

Fixed Priority Partitioned Schedulers: The multiprocessor scheduling problem using

the partitioning approach becomes more challenging when the rate monotonic algorithm

is used for each processor. More complex schedulability conditions have to be used to

decide whether a processor (bin) is full or not. For example, the condition of Equation 2.2

can be used, when a first fit decreasing heuristic is used to assign tasks to processors

and each processor uses the RM scheduler. However, the quality of such assignment can

be quite bad: For the RM scheduler in conjunction with an FFD task assignment an

asymptotic bound of 2 has been shown (see [23]), meaning that it may happen that an

assignment requires twice the number of processors compared to the optimal assignment.

In [70][23], the authors proposed an improved assignment schema based on the fact, that

RM achieves better processor utilization than that of Equation 2.2, if the task periods

show some special relation (e.g. harmonic periods). Hence they group tasks with nicely

related periods together - with the result that for light-weight tasks all processors could

be almost fully utilized. Hence, surprisingly the drawback of the low RM performance

in the uniprocessor case can be overcome in the multiprocessor case.

Moreover, in [78] the authors considered the RM algorithm with an FFD assignment,

but used the improved RM scheduling condition of Equation 2.3. They showed, that

due to this improved RM scheduling condition the assignment uses in worst case 1.66 . . .

times the number of processors used by an optimal assignment.

Global (non-partitioned) Scheduling

In a global or non-partitioned schedule, a task is allowed to be preempted on one pro-

cessor and to be resumed on another processor. For architectures where the penalty

of task migration is low, such as multiprocessors with a shared memory, this can be a

reasonable alternative to the partitioned schedules. Due to the higher freedom of the

task to processor assignment, there exist task sets that can be scheduled by a global

scheduler but cannot be scheduled by a partitioned scheduler. The challenge is, to de-

sign good global schedulers that achieve a high scheduling performance and that can be

proven to schedule a particular task set, by means of an efficient computable scheduling

test. Again, we can distinguish between fixed and dynamic assigned task priorities.

2 Background and Related Work

Fixed Priority Global Schedulers: The rate monotonic priority assignment scheme can

be also used for a global scheduler. In global rate monotonic scheduling (GRMS) the

tasks get fixed priorities assigned in order of decreasing task periods. At runtime the

dispatcher selects the mactive tasks with the highest priority for execution on the m

processors.

In [66] an efficient computable scheduling condition for GRMS has been developed and

the performance has been compared to several partitioning schemes based on the RM

scheduler. It turned out that the scheduling condition is rather pessimistic and allows

only about 50% utilization of the processors in average. However, an exact scheduling

test based on simulation over the task sets hyper-period shows, that GRMS feasibly

schedules task sets up to 80% processor utilization. This performance is comparable to

– or only slightly worse than – that of the best partitioning schemes. However, it stays

more robust against variation of task set parameters. The authors showed that GRMS

can outperform partitioning schemes for soft real-time systems, where some deadline

misses can be afforded. Especially when the worst-case computation times of the tasks

are inaccurate, the GRMS profits from the global processor assignment since overloaded

processors take advantage from spare processing power of under-loaded processors.

In [5], the authors studied a different fixed priority assignment schema for global mul-

tiprocessor scheduling. Whereas RM assigns priorities only according to task periods,

their proposed algorithm TkC assigns priorities according to the difference between

task period and task computation time. More specific, the priority of a task Tiis set to

Pi−k·Ci, where higher values mean higher priorities. Based on simulation experiments,

they found out that for a value of k= 1.1, their algorithm achieved the best performance

in average and significant improvements over the GRMS. Moreover, T1.1Ccommonly

outperforms the best RM partitioning algorithms. However, the authors failed to present

an efficient scheduling test for T1.1C. Hence, the main fields of application are still soft

real-time systems.

As mentioned earlier, we use dynamic-priority assignment schemes (especially EDF) for

our scheduling methods, hence the results of the fixed-priority global schedulers are of

minor interest for our work. However, in principle we could replace the dynamic EDF

priority assignment in our global scheduler of Section 4.1 by a fixed assignment such as

RM. In that case, the results of the work mentioned above would become most relevant.

We leave this option for further work.

Dynamic Priority Global Schedulers: The dynamic-priority EDF assignment scheme,

which is optimal for uniprocessors (see Section 2.4.2) and has been successfully applied

in partitioned multiprocessor scheduling, can also be used for global multiprocessor

scheduling.

A global EDF scheduler always executes on the mprocessors the mactive tasks with

the shortest absolute deadlines. Unfortunately, global EDF is no longer optimal. Sev-

eral sufficient polynomial time scheduling tests for global EDF have been developed

[48][47][16][11][10] and surveyed and compared in [16] and [11]. There exist three basic

scheduling tests for global multiprocessor EDF:

2.4 Real-Time Scheduling

1. Goossens et al. [48] developed a utilization based test, using the resource augmen-

tation approach introduced in [81]. It is based on the following idea: A uniform

multiprocessor (each with individual speed) is considered, on which the task set

can be feasibly scheduled by an optimal scheduling algorithm. Then a homoge-

neous multiprocessor is considered (all processors have the same speed) and the

number of processors is increased until it can be guaranteed that the global EDF

scheduler does, at any time, at least the same amount of work on the task set, than

the optimal algorithm on the uniform multiprocessor. It follows that also EDF

feasibly schedules the task set and the following scheduling condition is derived:

Theorem 1 ([48]).A periodic task set Γ = {T1, . . . , Tn}can be feasibly scheduled

by EDF onto an identical multiprocessor of munit capacity processors, if:

i=1

Pi≤UT

max +m·(1 −UT

max) (2.5)

where UT

max is the highest time utilization of any task appearing in Γ.

In Section 4.1, we define global EDF schedulers for our RHD execution model

and develop a scheduling test based on the Goossens et al. [48] multiprocessor

scheduling test. Hence, we skip the details of the test of Theorem 1at this place

and go into more detail in Section 4.1.2, where we discuss the relation between

the multiprocessor scheduling test and our work.

2. Baker presented a sufficient scheduling condition in [10] and later on improved it

in [11]. It is based on the idea that if a task Timisses its deadline, there must be

a certain amount of other tasks that have been executed between the release time

and deadline of Ti. If it is possible to show that the task set cannot generate this

certain amount of load, Tiwill never miss its deadline.

3. Bertogna et al. kept up with a similar idea, that a task must suffer certain inter-

ference by other tasks between its release time and its deadline. Since computing

this interference turns out to be difficult, they present an upper bound to the

interference using the workload generated by the task set. Based on that, they

present a linear time scheduling condition.

The three scheduling tests are surveyed and compared in [16] and in [11]. It turns out,

that they do not dominate each other. I.e. each of the scheduling tests accepts some

task sets that the others do not accept. Hence the best known scheduling test is to

combine all three of them [11] – and each subtest is an important contribution to the

state of the art. Further, it turned out in simulation studies that the scheduling tests

are still very pessimistic – meaning that they reject many task sets even if they would be

successfully scheduled by global EDF [16]. Moreover, the performance achieved by the

global EDF scheduling tests is still short compared to the best partitioned scheduling

approaches [11]. As it has been the case for the fixed priority RM schedulers, it seems

that the global EDF schedulers have their main advantages in soft real time systems,

where they have to deal with inexact worst case task execution times.

2 Background and Related Work

PFAIR Scheduling: The proportionate fairness (PFAIR) scheduling is a global schedul-

ing method proposed by Baruah et al. [12] [86]. Theoretical, PFAIR scheduling is opti-

mal and has a linear-time exact scheduling test. The basic idea is, that tasks should be

executed at a steady rate. To achieve this, the tasks are divided into quantum-length

subtasks. In order to be optimal, the task execution times have to be multiples of the

quantum-length. This results in a unrealistic small quantum-length and unrealistic many

subtasks for general task systems. An option is to round the execution times to defined

quantum-length but that will bring a drawback of decreasing scheduling performance.

The major problem is that, due to the many subtasks, a PFAIR schedule generally in-

troduces much more context switches than other schedulers. As in our execution model

a context switch corresponds to a reconfiguration of the RHD, and the associated time-

overhead is generally higher than that of microprocessors, PFAIR scheduling seems not

to be adequate for reconfigurable architectures. However, studying PFAIR scheduling

in our context might be a subject for future work.

2.4.4 Real-Time Scheduling on RHDs and Similar Architectures

Even if there has been a large amount of research on hardware-multitasking on RHDs

and associated scheduling problems, up to now there has been only few research on

real-time scheduling for RHD architectures.

Steigers et al. Online Schedulers: In [90][89], the authors first describe the design of

an operating system for reconfigurable architectures that is able to execute hardware

tasks according to the 1D model. E.g. each task footprint spans an arbitrary width but

the entire device array height can be placed at an arbitrary horizontal array position.

Further, they assume non-preemptive execution of tasks. The authors consider the

problem of scheduling (and placing) online arriving real-time tasks onto the device.

Each task is characterized by its arrival time ai, its computation time Ci, its deadline di

and its required width wiof device area. Since tasks arrive online and arrival times are

not known in advance, the system implements an acceptance test which either accepts

a task – if and only if the scheduler could find a feasible schedule for this task, or rejects

it. The challenge is to design a scheduling and placement method, such that only a

minimum fraction of tasks get rejected.

Beside a reference method, they propose two scheduling algorithms for the described

problem: The reference method simply checks whenever a task arrives, if there is free

device area to place and to start it immediately. If not, the task is rejected. The

other two methods increase the acceptance ratio by planning the task execution into

future. Starting a task Timay be delayed until time di−Ciand it still will meet its

deadline. The horizon technique keeps a list of free available intervals of device area,

which is an area interval that is currently free or becomes free at some future time and

will not be occupied by any task already planned for future execution. This scheduling

horizon, defined by the currently executing tasks and the tasks already planned for future

execution, is used for searching a place for a newly arriving task. If one is found, the

task is planned and the area is marked as occupied by updating the scheduling horizon.

2.4 Real-Time Scheduling

The computational more complex stuffing method is an improvement, by considering

all free device areas, even if they will be occupied by a planned task at some time

in future. In simulation studies, the authors show that the horizon technique rejects

considerable less tasks than the reference model, but is outperformed by the stuffing

method. Moreover the methods have been extended to a 2D task model, where task

footprints are rectangles that can be arbitrary placed. Up to now, the authors did not

present bounds on the worst case behavior of their scheduling methods. Further studies

considering sophisticated methods to manage the free device area have been presented

in [91].

The work is strongly related to the topic of this thesis. However, it differs from our work

in the following aspects. In [90], online scheduling of aperiodic tasks with unknown

arrival times has been considered. Hence the scheduler has to reject arriving tasks

for which the required deadline cannot be guaranteed. In contrast, we consider the

execution of static task sets, where task have periodic or sporadic arrival times. Our

main goal is to decide at design-time, if a task set can be scheduled or not. Some of

our methods can deal with dynamic task sets as well, but in that case it is decided if a

new periodic task can be added to the system. Another difference is, that we consider

preemptive multitasking while in [90] non-preemptive multitasking has been considered.

RT Schedulers for Hypercubes: Hypercubes are similar to our RHD model in the

aspect, that the amount of occupied computational resources may be different for each

task. Where RHD tasks require a certain amount of device area, hypercube tasks require

a sub-cube of certain dimension. I.e. a task occupies 1,2,4, . . . processors of the total

machine, which have to be organized in a hypercube network.

Babbar et al. studied the problem of scheduling online arriving, independent, preempt-

able tasks onto a hypercube and proposed the Buddy/RT algorithm, which is a straight

forward extension of the Buddy method, and a Stacking algorithm, which is shown

to perform better [9]. Mohaparta kept up with the same problem and improved the

performance with his proposed Deferred Earliest Deadline First algorithm [76]. The

problem of synchronous, independent, non-preemptable hypercube task scheduling has

been studied by Kwon et al. [65]. While the hypercube resource model is quite similar

to our RHD models, the considered scheduling problems are different since we study

periodic real-time tasks within this work.

A real-time scheduling problem, considering tasks with implementation variants as we

do in Section 7.1.1, has been recently studied by Lee et al. [67]. In their model, each

scalable real-time task comes in certain implementations, each for a different number of

processors and hence, has a different computation time. While they have studied the

problem of online arriving aperiodic task scheduling, their resource model is different

from our extended resource model of Section 7.1.1 as well. We require a task to allocate a

continuous amount of RHD resources. In contrast, they assume that any free processors

of the system can be used to execute a parallel task.

2 Background and Related Work

Pellizzonis et al. Allocation of Real-Time Tasks to Hybrid Systems: In [80,79]

Pellizzoni et al. consider the execution of periodic real-time tasks by a system composed

of a CPU and an FPGA. Each task exists in a hardware and a software implementation

and can be migrated between the two devices. An allocation assigns some tasks to the

CPU, where they are scheduled according to EDF, and the other tasks to the FPGA,

where they run (occupy resources) during their entire period. At runtime, new arriving

tasks have to pass an acceptance test. If a task is accepted, it is always assigned to

the CPU. Hence, a reconfiguration of the FPGA, which might add a considerable time

overhead, is not necessary. Later on, the system tries to migrate as much load to the

FPGA as possible, in order to keep the CPU utilization low. During migration, a task

allocates resources on both devices. After a task bits-stream has been configured on

the FPGA, the task computation is handed over from CPU to the FPGA between two

task periods. Since the CPU utilization is always kept at a minimum, the probability

that a new task can be accepted is maximized. In their analysis, the authors present an

acceptance test which not only guaranties that the new task can be executed without

jeopardizing any deadline, but also that the system can later on perform task migrations

which will lead to the new optimal allocation. Further, they present a method to select

the tasks to migrate from the CPU to the FPGA and vice versa.

This work is strongly related to our work since it considers periodic hardware tasks.

However, it differs greatly due to the following aspects: Each task exist in a hardware

and software configuration, but only the software tasks are scheduled at runtime. All

tasks assigned to the FPGA run permanently and no device area is shared over time.

Hence, hardware tasks are non-preemptive and they allocate resources during their entire

period, even if their computation time is much less than their period. This may lead

to a low utilization of the FPGA. Rather than considering the scheduling of hardware

tasks onto a reconfigurable device, their work concentrates on the dynamic allocation

from tasks to a CPU and FPGA.

2.5 Chapter Conclusion

This chapter has provided the background and related work for this thesis. We have dis-

cussed the architecture and reconfiguration features of today’s RHDs and how they are

employed to process hardware-tasks. Related work on multi-tasking of hardware-tasks

on RHDs has been introduced in Section 2.3. We classified six resource sharing models

used in literature, three based on full- and three based on partial device reconfiguration.

Our resource model of Chapter 3will cover only two of them, the most advanced model

based on full device reconfiguration and the 1D partial reconfiguration model. The 2D

model will not be considered directly, since it has the discussed disadvantages in com-

plexity and practicability. Further, we have summarized research on scheduling and

placement of hardware-tasks and introduced several projects, which have implemented

dynamically reconfigurable systems providing multitasking environments. Hence, we

conclude that multitasking of hardware-tasks on reconfigurable devices is reasonable

and feasible.

2.5 Chapter Conclusion

Finally, in Section 2.4 we provided the background in real-time scheduling. We discussed

related work in multiprocessor scheduling and showed that the basic results and methods

known for processor based systems have not been developed for reconfigurable devices

so far. Closing this gap is a major aim of this thesis.

Based on the gathered background, the next chapter will define our basic task and

resource model and the problem of scheduling tasks with real-time constraints to recon-

figurable devices.

2 Background and Related Work

CHAPTER 3

Problem Modeling and Metrics

This chapter introduces the task and resource models and the basic notations used

within this work. The general scheduling problem is defined. Utilization metrics are

introduced that allow us to capture the load generated by a task set. Furthermore, some

basic necessary conditions for the schedulability of task sets are derived.

3.1 Task and Resource Models

We consider real-time applications that consist of a set of independent aperiodic or

periodic tasks which have to be scheduled for execution such that each task meets its

deadline. The processing device is an RHD which provides resources to execute several

tasks in parallel. Moreover, we consider preemptive multitasking, i.e. the execution of

a task may be preempted by a higher priority task and resumed later on.

Our particular interest is in the widely-accepted and well-studied periodic task model

introduced by Liu and Layland in [72]. This simple task model is not just one among

many task models in real-time systems. It is used as a basic model in well-established

textbooks such as [25](p. 71 ff.) and [24](p. 399 ff.). Moreover, the model has been used

widely by researchers, such that [72] became one of the most cited research papers in

computer science.1

However, before we come to the periodic tasks, we need to introduce the notations of

aperiodic tasks, called jobs from now on, first.

1With over 1500 citations, [72] is rated as second most cited document by the CiteSeer scientific

literature digital library of 2005.

3 Problem Modeling and Metrics

Aperiodic Real-Time Tasks: An aperiodic real-time task is called job and denoted by

Ji. Each job Jiis characterized by its release time ri, its deadline di, its (worst case)

computation time Ci, and by the amount of required reconfigurable logic resources -

referred to as the job’s area - Ai. These parameters are illustrated in the scheduling

diagram of Figure 3.1, where an upward arrowhead presents the task’s release time and

a downward arrowhead presents its deadline.

0 1 2 3 4 5 6 7 8 9 10 11 12

device area

time

Figure 3.1: Parameters of a single aperiodic job

A job can be in several states. We call a job active when it has been released but not

yet terminated. An active job is in one of two states, running or ready. A job is running

after it has been dispatched by the scheduler and executes on the RHD. A ready job is

waiting in a ready queue to be selected for execution (Figure 3.2). A job Jiterminates

at the latest after it has been running for Citime units. It may terminate earlier, since

Cipresents an upper bound on its execution time. The finishing time of a job Jiis

denoted by fi. A set of jobs Iis said to be feasibly scheduled by some algorithm onto

some execution platform, if and only if fi≤diholds for all jobs.















Figure 3.2: States of a job and a periodic task respectively

The notations for arbitrary (aperiodic) real-time tasks are summarized in the following

list:

Idenotes a set of jobs

Jidenotes a job and refers to some concerned computation to be done

Cidenotes the worst case computation time of Ji

Aidenotes the area requirement of Ji

ridenotes the absolute release or arrival time of Ji

3.1 Task and Resource Models

didenotes the absolute deadline of Jiuntil which the computation has to be done

sidenotes the start time, when job Jibegins its execution

fidenotes the finishing time of Ji

RHD Resources: In this work, we consider an RHD Has execution platform. Hoffers

a certain amount of computational resources A(H) which is also referred to as the area

of the device. A(H) may correspond to the number of configurable logic blocks of an

FPGA or to the processing elements of a coarse-grain reconfigurable architecture. At

any point in time the RHD Hcan be reconfigured to execute an arbitrary set of jobs

R⊆I, if all jobs in Rare active and fit together on the device:

Ji∈R

Ai≤A(H) (3.1)

As we consider preemptive multitasking, an executing job Jimay be preempted by a

higher priority job and later on resumed. The time overhead associated with preemption

will be neglected in our first scheduling analysis but will be analyzed in Chapter 6.

This job and resource model presents a generalization of the (simple) homogeneous

multiprocessor model. Each instance of a set of real-time jobs to be scheduled onto an

mprocessor machine can easily be expressed using our model: The area of each job is

set to 1 and the area of the device His set to A(H) = m.

Periodic Real-Time Tasks: As mentioned before, our main interest is in applications

with period real-time tasks. A periodic task, often called just task from now on, is

denoted by Tiand refers to some computation which has to be performed periodically.

As shown in Figure 3.2, a periodic task does not leave the system after termination but

goes in idle state and waits for its next release which is triggered by a periodic timer.

Considering a periodic task set Γ, each periodic task Ti∈Γ represents a sequence of

jobs, alternatively called task instances, which are the actual objects to be scheduled

for execution on the considered RHD. Ti,j denotes the jth instance of the task Tiand

has the same form as an aperiodic job:

Ti,j(Ci,j, Ai,j, ri,j, di,j) (3.2)

The according periodic task Tiis described by four parameters which are used to define

the parameters of all of its task instances Ti,j:

Ti(Pi, Di, Ci, Ai) (3.3)

The relation between task and task instance parameters is illustrated in Figure 3.3, which

shows an example scheduling diagram of a task Ti(Pi= 5, Di= 4, Ci= 3, Ai= 3).

The computation time Ciand area Aiof the periodic task directly define the computa-

tion time and area of all task instances. Hence, Ci,j =Ciand Ai,j =Ai.

3 Problem Modeling and Metrics

Ti,1

ri,1

0 1 2 3 4 5 6 7 8 9 10 11 12

device area

time

di,1

fi,1

si,1Ci

di,2

Ti,2

ri,2ri,3

fi,2

si,2

Ti= (Pi= 5, Di= 4, Ci= 3, Ai= 3)

Figure 3.3: Parameters of a periodic task

The task period Pidefines the release time of the task instances Ti,j. For simplification,

we assume that the first instance of each task is released at t= 0, i.e. ri,1= 0. We

differentiate between periodic and sporadic task activations.

Periodic task activation: The task instances Ti,j are requested at a fixed periodic rate.

I.e. the interval between two releases of the task is fixed and given by its period

Pi. Therefore, the release times are given by:

ri,j = (j−1) ·Pi:j≥2 (3.4)

Sporadic task activation: For a sporadic task, the parameter Pipresents a minimum

inter-arrival time. The distance between two task requests may be as short as Pi,

but can be longer as well. Therefore, the release time ri,j of Ti,j is at least:

ri,j ≥ri,j−1+Pi:j≥2 (3.5)

The relative deadline Diof a periodic task defines the absolute deadlines of the task

instances Ti,j. The deadline of a task instance is generally given by its release time plus

the relative deadline:

di,j =ri,j +Di(3.6)

In our scheduling analysis, we restrict this general form and distinguish two cases:

Implicit deadline: (Di=Pi) Most often, we assume that the relative deadline is equal to

the period of the task. I.e. the deadline of one task instance is equal to the release

time of the next instance. This simplifies our scheduling analysis. Whenever we

do not explicitly specify a relative deadline Difor a task, we assume that it is

equal to the task’s period.

3.2 Feasible Schedule

Constrained deadline: (Di≤Pi) In this more general form, the relative deadline is less

or equal to the task’s period. I.e. the deadline of one task instance can be at some

time before, but not later than the release time of the next task. Some of our

results can be easily extended to consider this more general form of task sets.

The notations used for periodic real-time tasks and their instances are summarized in

the following list:

Γ denotes a set of periodic or sporadic tasks

Tidenotes a generic periodic task of Γ

Ti,j denotes the jthe instance of task Ti.

Pidenotes the period of Ti,

Didenotes the relative deadline of Ti

Cidenotes the computation time of Tiand all its task instances

Aidenotes the area requirement of Tiand all its task instances

ri,j denotes the absolute release or arrival time of Ti,j

di,j denotes the absolute deadline of Ti,j

si,j denotes the start time, when Ti,j begins its execution

fi,j denotes the finishing time of Ti,j

3.2 Feasible Schedule

Generally speaking, we would like to schedule and execute a set of jobs I, such that

each job meets its deadline. In the case that we consider a periodic task set Γ the jobs

of Iare the task instances Ti,j of the periodic tasks.

Formally a schedule can be described by a function2R:R+→P(I). R(t) denotes the

set of jobs from Irunning at time t. A schedule Ris called feasible if each job Jiis

scheduled for execution for at least Citime units within the interval given by its release

time riand its deadline di. In order to be feasible on a particular device H, the schedule

R(t) must keep at any time t≥0 the bound on device resources given by Equation 3.1.

An example task set Γ∗of four tasks and a feasible schedule are shown in Figure 3.4.

The table specifies the period, computation time and area of each task. The Uivalues

will be explained in the following section. The scheduling diagram presents a feasible

schedule of Γ∗on a device with an area of A(H) = 8. While the time-line shows the

release times and deadlines of the tasks, the lower part of the figure illustrates which

tasks instances are in execution at a certain moment. The vertical dashed lines denote,

that the task instance T3,1is preempted at t= 4 and resumed at t= 6. The feasibility

of the schedule can be easily proven: It is obvious that for the time interval 0 ≤t≤12

each released task instance terminates before its deadline and that the device resources

are not exceeded. This interval of length 12 has a special meaning, since it presents

the hyper-period of the task set Γ∗. The hyper-period is the least common multiplier

of the periods of a task set. The pattern of task releases repeats with the hyper period

2P(I) denotes the power set of I

3 Problem Modeling and Metrics

of a task set. Hence, a feasible schedule defined over the hyper-period can be repeated

infinity times without any missed deadline.

TiPiCiAiUT

iUS

T14 2 4 1/2 2

T26 5 2 5/6 5/3

T312 3 6 1/4 3/2

T412 2 2 1/6 1/3

P1.75 5.5

0 1 2 3 4 5 6 7 8 9 10 11 12

T1,T2,

T3,T4 T2 T1,T2,

T3,T4T1 T1

1,1

2,1

3,1

1,2

3,1

2,2

1,3

4,1

Figure 3.4: Example task set Γ∗and a feasible schedule.

Scheduling challenges: We are interested in developing various scheduling algorithms

and finding answers for each algorithm to the following questions:

1. Given a task set Γ, can we execute it on a reconfigurable device Hwith area A(H),

such that all deadlines are met?

2. What is the size of the smallest device on which Γ can be feasibly executed?

3. How does this feasible schedule Rlook like?

3.3 Utilization Metrics

We define two utilization metrics to measure the computational load generated by a

periodic task Tior a set of tasks Γ, respectively. Similar to the processor utilization

factor defined in single processor real-time scheduling, we define the time-utilization

factor of a task Tito be:

UT(Ti) = Ci

(3.7)

It presents the fraction of time a certain task occupies the device in order to complete

its execution. Respectively we define the time-utilization factor for the complete task

set Γ to be:

UT(Γ) = X

Ti∈Γ

(3.8)

For the special case that all tasks are executed sequentially, UT(Γ) is the fraction of

time the RHD spends executing tasks whereas 1 −UT(Γ) is the idle time.

3.3 Utilization Metrics

Since a running task usually does not occupy the entire device, we define a system-

utilization metric that captures the degree by which the device is utilized by Ti:

US(Ti) = UT(Ti)·Ai(3.9)

In order to measure the load generated by a task set Γ, we now define the total system-

utilization of a task set by summing up the system utilization factors of all tasks:

US(Γ) = X

Ti∈Γ

US(Ti)·Ai(3.10)

Since in general we consider devices Hwith an arbitrary area A(H), we define the

relative system-utilization as the fraction a certain task set Γ utilizes a particular device

URS(Γ, H) = US(Γ)

A(H)(3.11)

Often we simply consider devices Hof size A(H) = 1. In this case, there will be no

difference between USand URS. Whenever it is clear from the context that we consider

the utilization of a task Ti, we may use the notation UT

iinstead of UT(Ti) and US

instead of US(Ti) respectively. The time and system utilization factors of the example

task set Γ∗are shown in the table of Figure 3.4.

Schedulability Conditions: The question whether a given task set Γ can be feasibly

scheduled onto a device Hby any algorithm, or if it will be feasibly scheduled by a

particular algorithm, is not trivial to answer. Obviously some necessary conditions are,

that:



Each task must have a time-utilization not greater than 100%:

UT(Ti)≤1 : ∀Ti∈Γ



Each task must fit onto the device:

Ai≤A(H) : ∀Ti∈Γ



The relative system utilization must not be greater than 100%:

USR(Γ, H)≤1

However, these conditions are only necessary and not sufficient. In particular there exist

task sets with an infinitesimal small system-utilization, which cannot be scheduled by

any algorithm. As example consider a task set Γ = {T1, T2}with P1=P2= 10 where

the first task has a computation time of C1=and an area of A1=A(H) while the

parameters of the second task are C2= 10 and A2=. The system utilization of

Γ computes to lim→0US(Γ) = 

10 A(H) + = 0. However, since both tasks cannot be

executed in parallel the task set cannot be feasibly scheduled, as illustrated in Figure 3.5.

3 Problem Modeling and Metrics

T1,1

T1, T2

0 1 2 3 4 5 6 7 8 9 10 11 12

device area

time

T2,1

T1, T2T1, T2

A(H)

T1,2

T2,2



Figure 3.5: Scheduling Anomaly: An infeasible task set with infinitesimal system uti-

lization

3.4 Chapter Conclusion

This chapter has defined our model of real-time applications for execution on reconfig-

urable devices. In particular we defined the notations for aperiodic jobs and periodic

task sets. The execution model allows task preemption. The resource model has been

kept rather simple: A task requires a certain (scalar) amount of logic resources and a

device can execute tasks in parallel as long as its resources are not exceeded. This simple

model generalizes the common multi-processor models, where each task requires exactly

one processor. As another simplification we neglected the time overhead associated with

the device reconfiguration. However, it will be included into our analysis in Chapter 6.

For our simple task model, we have defined the properties of a feasible task schedule.

Moreover, utilization metrics that capture the load of a periodic task set have been

defined. On this basis, the next chapter will propose various scheduling strategies. The

goal is to guaranty at design time, that a particular task set will be feasibly scheduled

by a certain algorithm.

CHAPTER 4

Three Scheduling Algorithms

Within this chapter, we introduce three main methods to schedule periodic task sets to

RHDs according to the model described in the preceding chapter, namely global EDF

scheduling,partitioned EDF scheduling and server based scheduling.

1. The idea of global scheduling is, to manage the resources globally, and dynamically

assign ready tasks to free resources whenever they become available. As a conse-

quence, a task might get assigned different resources over runtime. For example

in case of a multiprocessor machine a task may be started on one processor and

be resumed on another. While this method leads to very good resource utilization

in average, it is rather difficult to predict the schedulability of a task set.

In Section 4.1 the multiprocessor global EDF scheduling method is adapted to our

task model in a straight forward way and yields two global scheduling algorithms,

namely EDF-NextFit and EDF-First-k-Fit. We develop a scheduling test based

on the task set parameters that can be evaluated in linear-time. In our analysis we

prove, that both algorithms produce a feasible schedule when the test conditions

are satisfied.

We presented these algorithms in [DP05a,DP05c] and the scheduling analysis in

[DP06a].

2. The idea of partitioned scheduling is to partition a task set and to assign each

partition a fraction of the platform resources off-line. In case of a multiprocessor

machine this means that a particular task is always executed on the same processor.

In Section 4.2 we adapt the multiprocessor partitioned EDF scheduling method to

our resource model. We formulate the problem of finding a feasible partitioning of

a given task set. We discuss the relation to two-dimensional packing problems and

apply two methods to solve it. The first is called optimal-partitioned-EDF and

uses integer linear programming (ILP) to solve the problem to optimality. Since

4 Three Scheduling Algorithms

solving the ILP takes exponential time, the second method is a simple and fast

heuristic called Next-Fit-Deceasing-Area. For the heuristic we develop and prove

a scheduling condition which can be evaluated in linear time.

Most of our results on partitioned scheduling have been presented in [DP06b].

3. The idea of server based scheduling is to execute application tasks by (virtual)

server tasks. While periodic server tasks are usually used to serve the requests of

aperiodic application tasks, we present a totally different and novel approach.

In Section 4.3 we present a method to schedule periodic tasks onto an RHD by

means of periodic servers. Tasks are grouped to servers for parallel execution but

the computation of tasks may be distributed among many servers. We present

a heuristic called Merge-Servers-Distribute-Load (MSDL) that creates a set of

servers in such a way, that all application tasks are guaranteed to meet their

deadlines. Furthermore, the number of servers is kept small. As a result, each

server can be compiled to a separate RHD programming file (bit-stream) which

makes the approach feasible for all reconfigurable devices and not only for partial

reconfigurable RHDs.

We presented our results on server bases scheduling in [DP05a,DP05c].

Hence the chapter extends the two popular multiprocessor scheduling approaches to our

RHD model and presents a novel approach especially designed to avoid the restrictions

of partial device reconfiguration.

4.1 Global Scheduling

In a global EDF schedule for a multiprocessor platform, all active tasks are sorted

according to non-decreasing deadlines in a global ready-queue. If mis the number

of processors, the first mtasks of the ready-queue are selected for execution. As a

consequence, a task might be started on one processor, becomes preempted by a higher

priority task and, later on, be resumed on another processor. While this method can

lead to very good resource utilization, it is rather difficult to predict the scheduleability

of a task set.

However, within this section we adapted this method to our task and resource model

and define two global scheduling algorithms. We develop a scheduling test based on

task set utilization parameters that can be evaluated in linear-time. In our analysis we

prove, that both algorithms produce a feasible schedule when the test conditions are

satisfied.

4.1.1 Earliest Deadline First on RHDs

The Earliest Deadline First rule has been proven optimal for single processor systems

and has also been successfully used in global (non-partitioning) multiprocessor schedul-

ing [81,48]. In multiprocessor systems, the number of active jobs that can be executed

in parallel depends only on a platform parameter, i.e., the number of processors. Apply-

ing EDF to our RHD execution model, we find that the number of active jobs that can

be executed in parallel shows a more complex relation between the platform parameter

A(H) and parameters of the job set, Ai. Therefore, we see at least two possibilities

to define EDF for our RHD model, which we call EDF-First-k-Fit and EDF-Next-Fit

respectively. EDF-First-k-Fit executes jobs according to the following definition:

Definition 2 (EDF-First-k-Fit).Let Q= (J1, . . . , Jm)be the list of all active jobs, w.l.g.

sorted after non-decreasing deadlines. EDF-First-k-Fit selects at any time the first kjobs

R={J1. . . Jk}of Qfor execution, with the largest kfor which PJi∈RAi≤A(H)holds.

In addition to the formal definition, we present a pseudo code implementation of the

EDF-FkF dispatching procedure in Algorithm 1. It is called by the scheduler each time

a new job is activated or a running job terminates.

Note, that according to this definition EDF-First-k-Fit (EDF-FkF) creates a preemptive

schedule, since a newly activated job Jimay cause a preemption if it has a shorter

deadline than any of the running jobs. Moreover, EDF-FkF executes tasks strictly in

deadline order in the sense that for any pair of active jobs, Ji∈R, Jj/∈R⇒di≤dj.

It follows that EDF-FkF is priority driven according to the definition given in [48].

The EDF-FkF algorithm may leave some RHD resources idle, even if there are ready

jobs that would fit onto the device in addition to the first kjobs selected for execution.

In order to exploit these resources, we define an enhanced algorithm called EDF-Next-Fit

(EDF-NF):

4 Three Scheduling Algorithms

Algorithm 1 Earliest Deadline First - First k Fit

Require: list Qof active tasks, sorted by non-decreasing absolute deadlines

1: procedure EDF-NF(Q, H)

2: R← ∅

3: Arunning = 0

4: k= 1

5: while Arunning +Ak≤A(H) and |Q| ≥ kdo

6: R←R∪Jk

7: Arunning ←Arunning +Ak

8: k←k+ 1

9: end while

10: return R

11: end procedure

Definition 3 (EDF-Next-Fit).At any time, the set of running tasks Ris determined

according to the following algorithm: Start with R← ∅ and scan all active jobs Ji∈Qin

order of non-decreasing deadlines. Jiis added to Rif and only if PJk∈R∪JiAk≤A(H).

Again, we present a pseudo code implementation of the EDF-NF dispatching procedure

in Algorithm 2.

Algorithm 2 Earliest Deadline First - Next Fit

Require: list Qof active tasks, sorted by non-decreasing absolute deadlines

1: procedure EDF-NF(Q, H)

2: R← ∅

3: Arunning = 0

4: for i←1,|Q|do

5: if Arunning +Ai≤A(H)then

6: R←R∪Ji

7: Arunning =Arunning +Ai

8: end if

9: end for

10: return R

11: end procedure

A part of the EDF-FkF and EDF-NF schedules for an example task set is shown in

Figure 4.1. The difference between the two schedules lies in the execution of T4,1.EDF-

FkF schedules T4,1not before T3,1has terminated (hatched task), whereas EDF-NF

schedules T4,1as soon as there are idle resources (speckled task).

Lemma 1. EDF-NF is superior to EDF-FkF in the sense that if a set of jobs Iis

feasibly scheduled by EDF-FkF, the same is true for EDF-NF.

Proof Sketch. Let R(t) and Q(t) be the set of running and active jobs at time tin

the EDF-FkF schedule and ˜

R(t)˜

Q(t) in the EDF-NF schedule, respectively. We can

4.1 Global Scheduling

TiPiCiAiUT

iUS

T14 2 4 1/2 2

T26 5 2 5/6 5/3

T312 3 6 1/4 3/2

T412 2 2 1/6 1/3

P1.75 5.5

012345678910 11 12

T1,T2,

T3,T4 T2

T1,T2,

T3,T4T1 T1

T1,1

T2,1

T3,1

T1,2 T3,1

T2,2

T1,3

device area

T4,1 T4,1

Figure 4.1: A schedule for Γ∗by EDF-FkF vs. EDF-NF

observe that for an equal set of active jobs, EDF-NF will execute at least the same jobs

as EDF-FkF, i.e., if Q=˜

Qthen R⊆˜

R. Further, if the active jobs in the EDF-NF

schedule become a subset of those in the EDF-FkF schedule due to earlier terminations,

each job that is active in both schedules and executed by EDF-FkF will also be executed

by EDF-NF, i.e., if Jk∈˜

Q⊆Qthen Jk∈R⇒Jk∈˜

Since both schedulers start with identical sets of active jobs, Q(0) = ˜

Q(0), and experi-

ence the same job activations over time, it follows that ˜

Q(t)⊆Q(t) for all t > 0 and

every job Jkscheduled by EDF-NF will have a finishing time smaller or equal to the

finishing time when scheduled by EDF-FkF.

4.1.2 Relation to Multiprocessor EDF

The scheduling test we will present for EDF-NF is related to an earlier reported schedul-

ing test for EDF upon multiprocessors. In fact, when all RHD tasks have equal area

Ai=Aconst, our execution model merges into a multiprocessor with bA(H)

Aconst cproces-

sors. Therefore, we first introduce the utilization-based scheduling test for identical

multiprocessors presented by Goossens, Funk and Baruah [48]:

Theorem 2 ([48]).A periodic task set Γcan be feasibly scheduled by EDF onto a

multiprocessor of midentical unit capacity processors, if:

UT(Γ) ≤UT

max +m·(1 −UT

max) (4.1)

where UT

max is the task with the highest time utilization appearing in Γ.

The proofs for this theorem and its tightness can also be found in [48]. We provide an

example here that illustrates the scheduling test using a geometrical explanation and in

addition proves the tightness of the bound of (Equation 4.1). Assume five tasks to be

scheduled on an m= 4 processor machine. T1to T4have equal periods of Pi= 6 and

4 Three Scheduling Algorithms

equal computation times of Ci= 2, respectively. Task T5has also a period of P5= 6

and a computation time of C5= 4. As EDF may schedule tasks with equal deadlines

in arbitrary order, Figure 4.2 presents a valid EDF schedule. From time 2 on, T5is

the only active task which results in idle times for three processors. An infinitesimal

increase of the computation times of tasks T1to T4to Ci= 2 + 6

4εwould cause task T5

to miss its deadline. The theorem condition is tight, since:

∀ε > 0 : UT(Γ) = 2 + ε>UT

max +m(1 −UT

max) = 2 (4.2)

Theorem 2limits the maximum utilization a feasibly schedulable task set can have in

dependence of the maximal utilization UT

max and the number of processors m. This

limit can be derived by accumulating the utilization-rectangles shown in the utilization

diagram on the right side of Figure 4.2. The task with the highest time utilization

occupies a 1 ×UT

max rectangle, while all other tasks together occupy a m×(1 −UT

max)

rectangle. Adding both terms derives the upper bound on U(Γ) shown in Theorem 2.









  







  





























 









Figure 4.2: EDF multiprocessor schedule

4.1.3 Schedulability Analysis for EDF-FkF

This section presents a scheduling test for the EDF-FkF algorithm applied to a periodic

task set executing on an RHD. It is a sufficient but not necessary condition, i.e., it will

reject any task set that EDF-FkF fails to schedule on the considered RHD, but may also

reject task sets that could indeed be feasibly scheduled by EDF-FkF. Since EDF-NF is

superior to EDF-FkF, the theorem holds for both algorithms.

Theorem 3 (EDF-FkF scheduling condition).Any periodic task set Γcan be feasibly

scheduled by EDF-FkF onto an RHD Hwith area A(H)≥Amax, if:

∀Tk∈Γ : (4.3)

US(Γ) ≤(A(H)−Amax)·1−UT(Tk)+US(Tk)

4.1 Global Scheduling

where Amax is the largest area of all tasks in Γ.

The proof of Theorem 3follows a resource augmentation approach [81] and is closely

related to the multiprocessor scheduling test of [48] which we have just discussed in

Section 4.1.2. We will proceed in three steps:

1. We construct a theoretical multi-RHD machine π, on which a given task set Γ can

be feasibly scheduled by some algorithm OPT (Section 4.1.3.1).

2. We calculate the required size of a (unit-speed) single-RHD machine Hsuch that

any algorithm with a property called α-work-conserving, applied to the task set Γ

does at any time at least as much work on the task set Γ as the algorithm OPT

on the multi-RHD machine (Section 4.1.3.2).

3. We prove that EDF-FkF fulfills the α-work-conserving property and show, that

it produces a feasible schedule for Γ on H(Section 4.1.3.3).

4.1.3.1 Machine Definitions

First, we refine the RHD model by defining a speed factor for an RHD. While real

devices do have different speed grades, we use speed here purely as a means to construct

the proof for Theorem 3.

Definition 4 (RHD).An RHD His a processing device with area A(H)and speed

S(H). It can execute a set of jobs Rsimultaneously, iff PJi∈RA(Ji)≤A(H). If a job

Jiis in execution on Hfor tunits of time, it completes S(H)·tunits of its execution

requirement, i.e., of its computation time Ci. We define the computing capacity of H

as Cap(H) = A(H)·S(H).

Definition 5 (multi-RHD).A multi-RHD πis a set of RHDs {H1, H2, . . . }, each with

its own area A(Hj)and speed S(Hj). At each point of time, each RHD Hjcan execute

its individual set Rjof jobs, iff PJi∈RjA(Ji)≤A(Hj)for all Hj∈π. The capacity of

πis defined as Cap(π) = PHj∈πCap(Hj).

Now we are able to define a specific multi-RHD machine where for each task an RHD

is reserved that matches exactly the task’s area and has a speed which fully utilizes the

device:

Definition 6 (feasible-multi-RHD).For a given periodic task set Γ, we define a multi-

RHD πwith capacity Cap(π) = US(Γ) such that for any task Ti∈Γthere is an RHD

Hj∈πwith A(Hj) = Aiand S(Hj) = UT(Ti). Algorithm OP T assigns each task Tito

its corresponding RHD Hj.

By construction, all task instances will meet their deadlines under OPT on this multi-

RHD machine.

4 Three Scheduling Algorithms

4.1.3.2 Work Done by Algorithms

A uniprocessor scheduling algorithm is work-conserving, if it never idles the processor

while there are any active jobs left in the ready queue. In the same way, [48] extended

the definition to identical multiprocessors: a scheduling algorithm for identical multi-

processors is work-conserving, if it never idles any processor while there are any active

jobs in the ready queue. By these definitions, EDF is work-conserving for both the

uniprocessor and the identical multiprocessor model.

We now apply the concept of work-conservation to RHD machines. In an RHD machine

the utilization of the device depends on the areas of the jobs. Therefore, we base our

definition of work-conserving on the minimal utilized RHD area:

Definition 7 (α-work-conserving).An algorithm for scheduling jobs onto an RHD His

said to be α-work-conserving, iff it utilizes at least a fraction αof the device area A(H)

whenever there are active jobs waiting in the ready queue. In an overload situation

there are tasks in the ready queue waiting for execution. In contrast, in an underload

situation all active jobs are running.

As a next step, we define a function Wthat captures the amount of work done on a

given job or job set by some algorithm on some machine.

Definition 8 (Work-done function).A job with computation time Ciand area Airep-

resents Ci·Aiwork. If the job has been executed for ttime units on an S(H)speed

RHD, the work that has been done on this job is t·S(H)·Ai. Let Idenote any set of

jobs and πany multi-RHD platform. For any algorithm alg and time instance t≥0,

let W(alg, π, I, t)denote the amount of work done on jobs of Iover the interval [0, t),

when Iis scheduled by alg on π.

The following lemma states that on a specific single RHD H, an α-work-conserving

algorithm will never do less work than the algorithm OPT on the reference platform π

of Definition 6.

Lemma 2. Let Γbe a periodic task set with at least two tasks and UT

max <1. Let Ibe

the related set of jobs produced by Γ. Let πand OPT be the multi-RHD machine and

scheduling algorithm according to Definition 6. Further, let Hbe a single RHD with

speed S(H) = 1. If

∀Hk∈π:α·A(H)≥Cap(π)−Cap(Hk)

1−S(Hk)(4.4)

the work done on Iuntil any point in time t≥0by any α-work-conserving algorithm

αWC on RHD His never less than the work done on Iby algorithm OPT on the

multi-RHD π:

W(αWC, H, I, t)≥W(OPT, π, I, t) (4.5)

Proof. Suppose that while Equation 4.4 holds, in contradiction to Lemma 2there exists

a time instance t0such that

W(αWC, H, I, t0)< W(OPT, π, I, t0),(4.6)

4.1 Global Scheduling

i.e., algorithm αWC does less work on Hthan algorithm OPT on π. Then there must

exist at least one job Tk,j with release time rk,j < t0that has not finished execution at

time t0, and during the interval [rk,j, t0) the algorithm αWC did strictly less work than

algorithm OPT.

According to Definition 7the single RHD machine is either in overload or underload

situation. Note that a fully utilized RHD with an empty ready queue denotes an under-

load situation. Let xdenote the portion of time in interval [rk,j, t0) where the system is

overloaded, and ythe portion where it is underloaded.

Considering job Tk,j during time interval [rk,j, t0), we can observe that:



Algorithm αWC executes Tk,j for at least ytime units, i.e., it does at least y·Ak

work on Tk,j.



Algorithm OPT executes Tk,j for at most x+ytime units, i.e., OPT does at most

(x+y)·S(Hk)·Akwork on Tk,j.



By our contradiction assumption, OPT performs more work on Tk,j than αW C:

(x+y)·S(Hk)·Ak> y ·Ak(4.7)

Now we consider the work done on the entire set of jobs Iduring interval [rk,j, t0). We

can observe that:



The overall work done by algorithm αWC on His at least x·α·A(H)+y·Ak. During

overload the RHD is utilized by at least a fraction of αand during underload at

least Tk,j must be executing since it has not finished until t0.



The overall work done by algorithm OPT on πis at most (x+y)·Cap(π), which

corresponds to a full utilization of all RHDs.



By our contradiction assumption, OPT performs more work on Ithan αWC:

(x+y)·Cap(π)> x ·α·A(H) + y·Ak(4.8)

Now we show that Equations 4.7 and 4.8 cannot be true, if Equation 4.4 in Lemma 2

holds. To that end, we multiply Equation 4.7 by (α·A(H)

Ak−1) and add it to Equation 4.8:

(α·A(H)

Ak−1) ·(x+y)·S(Hk)·Ak+ (x+y)·Cap(π)>

(α·A(H)

Ak−1) ·y·Ak+x·α·A(H) + y·Ak(4.9)

[(α·A(H)

Ak−1) ·S(Hk)·Ak+Cap(π)](x+y)> α ·A(H)(x+y) (4.10)

4 Three Scheduling Algorithms

Cap(π)−Cap(Hk)> α ·A(H)·(1 −S(Hk)) (4.11)

Equation 4.11 contradicts the condition of Equation 4.4. Hence, the proof assumption

must be wrong and Equation 4.5 holds which proves the lemma.

4.1.3.3 EDF Feasibility

As the last step, we will show that EDF-FkF is an αWC algorihtm.

Lemma 3. EDF-FkF is an α-work-conserving algorithm, with

α= 1 −Amax

A(H),(4.12)

where Amax is the largest of all task areas.

Proof Sketch. The lemma states that in an overload situation, at most Amax area of the

RHD stays idle. Assume, in contradiction, that EDF-FkF selects the first kjobs out of

the list of active jobs Q=J1, . . . , Jmfor execution, such that the amount of free area

A(H)−Pk

i=1 Aiwould be greater than Amax. In this case, the next ready job Jk+1

could also be executed in addition to the first ktasks as its area is bound by Amax.

Then, the assumption must be wrong.

Proof of Theorem 3.We have made the following observations:

1. For the special case, that Γ contains a task Tkwith UT(Tk) = 1, the condition of

Theorem 3simplifies to US(Γ) ≤US(Tk) and can only be fulfilled for task sets

with only one task. Obviously, this task will always meet its deadlines.

2. For the general case we know from Definition 6, that for any task set Γ with

max <1 there exists a multi-RHD πwith Cap(Hk) = US(Tk) on which Γ can be

feasibly scheduled by OP T.

3. We know from Lemma 2that the work done until some time t≥0 by any α-work-

conserving algorithm on His at least as much as the work done by OP T on π.

The area of the single RHD His determined according to Equation 4.4.

4. We know from Lemma 3that EDF-FkF is α= (1−Amax

A(H)) -work-conserving. Hence

Equation 4.4 holds for EDF-FkF.

5. We know by definition that the work done by EDF-FkF is always on the most

urgent jobs, i.e., the ones with the closest deadlines. It remains to show, that as

the work done by EDF-FkF is i) never less than the work done by OPT on πand

ii) always done on the most urgent jobs, all jobs must meet their deadlines under

EDF-FkF on H. Let Ibe the set of jobs indexed by decreasing EDF-FkF priority

(non-decreasing deadlines). Considering only the scheduling of {J1},EDF-FkF

4.1 Global Scheduling

clearly meets the deadline since it does at least the same work until d1than

the algorithm OPT. Assume the EDF-FkF schedule of the set {J1, . . . , Jk} ⊂ I

is feasible. The EDF-FkF schedule of the increased set {J1, . . . , Jk+1}will be

equivalent to the former one, except for job Jk+1, thus J1to Jkwill still meet

their deadlines. Since by time dk+1,EDF-FkF has done at least the same work

than OPT, also Jk+1 meets its deadline. It follows by induction, that all jobs of I

meet their deadlines under EDF-FkF on H.

The condition of Theorem 3is derived from Equation 4.4 by replacing Cap(·) by US(·),

S(·) by UT(·), Hkby Tkand, finally, substituting αaccording to Equation 4.12.

4.1.3.4 Example Illustrating Tightness of the EDF-FkF Scheduling Condition

This subsection gives an example that illustrates an EDF-FkF schedule and shows that

the condition of Theorem 3is tight.

TiPiCiAiUT(Ti)US(Ti)

T120 11 + 20ε3 0.55 + ε1.65 + 3ε

T220 11 + 20ε2 + ε0.55 + ε1.1 + 2.55ε+ε2

T320 20ε3ε3ε

T420 9 1 0.45 0.45

3.2 + 8.55ε+ε2

Table 4.1: Example task set ˆ

Table 4.1 shows a periodic task set ˆ

Γ to be scheduled by EDF-FkF onto an RHD Hwith

A(H) = 8. Let ε << 1 be a small, but positive number. An example schedule is shown

in Figure 4.3. Note that job T4,1does not start execution before T3,1has started and,





 

















     



 

 

Figure 4.3: EDF-FkF schedule of ˆ

4 Three Scheduling Algorithms

therefore, will miss its deadline by εtime units. Evaluating the condition of Theorem 3

for all kshows that task T4is critical, i.e., for k= 4 any infinitesimal positive εbreaks

the scheduling condition.

∀k:US(Γ) ≤(A(H)−Amax)·(1 −UT(Tk))+US(Tk)

k= 1 : 3.2+8.55ε+ε2≤3.85 −2ε

k= 2 : 3.2+8.55ε+ε2≤3.35 −2.45ε+ε2

k= 3 : 3.2+8.55ε+ε2≤5−2ε

k= 4 : 3.2+8.55ε+ε2>3.2

4.1.4 Conclusion on global EDF

We have proposed two adaptations of the global EDF scheduling rule for our application

model, namely EDF-FkF and EDF-NF. The schedulers are called global, since a certain

job is allowed to execute at any location on the RHD and moreover, after preemption

it may resume its execution at a different location. We have been able to develop a

scheduling condition, which guaranties that all tasks of a periodic task set will hold

their deadlines. The condition can be evaluated in linear time and is valid for both

algorithms. How well the algorithm performs will be analyzed in Chapter 5.

The next section will introduce our second scheduling method, which is an adaptation

of partitioned multiprocessor scheduling approaches to our considered RHD execution

model.

4.2 Partitioned Scheduling

In the context of multiprocessors, a non-partitioned or global schedule is one, where

different instances of a periodic task can be executed on different processors and even

preempted task instances may migrate to another processor before resuming execution.

In contrast, in a partitioned schedule, all instances of a task are executed on the same

processor. In this section we apply the concept of partitioned scheduling to our RHD

execution model. We define a partitioned schedule and the according problem of finding

the least resource consuming partitioning, which still guarantees a feasible task sched-

ule. To solve this problem, we apply integer linear programming as well as a heuristic

approach.

4.2.1 Partitioned Scheduling on RHDs

The concept of partitioned scheduling can be applied to our RHD execution model. We

present the following definition:

Definition 9 (partitioned schedule).A schedule Rof a set of jobs Iis said to be

partitioned by χ, if the following statements hold:

1. χ={G1, G2, . . . , Gm}is a partitioning of I. That is, χis a set of disjoint subsets

of I, called partition-blocks, such that the union of all partition-blocks in χresults

in I.

2. At any point in time, at most one job of each partition-block is in execution on

the reconfigurable device. This can be formally expressed by:

∀t≥0 : R(t)∈G1×G2×···×Gm(4.13)

Each partition block is exclusively assigned a certain area of the reconfigurable device.

The area requirement of a partition block A(Gj) is determined by its largest task; the

required overall device area A(χ) is given by the accumulated area over all partitions:

A(Gi) = max

Jk∈Gi

(Ak), A(χ) = X

Gi∈χ

A(Gi) (4.14)

Most of the time we are interested in pure periodic task sets Γ. In order to apply results

from uniprocessor scheduling of periodic task sets to our model, it is most useful to

consider the case that all instances of a periodic task belong to the same partition-

block. Hence we use the following definition, when considering periodic task sets:

Definition 10 (periodic partitioned schedule).If the jobs of Iare instances of periodic

tasks, a schedule Rpartitioned by χis said to be periodic partitioned, if for each periodic

task Tiall of its instances Ti,j belong to the same partition-block.

4 Three Scheduling Algorithms

We use a simplified notation for the partitioning χof a period partitioned schedule R: If

for instance all task instances of T1and T2belong to the partition-block Gi, we simply

write Gi={T1, T2}rather than writing Gi={T1,1, T1,2, . . . , T2,1, T2,2, . . .}. Also we use

the notation T1∈Girather than T1,j ∈Gi. Hence, Gidenotes a set of periodic tasks,

rather than a set of instances of periodic tasks.

Since each partition-block presents an independent task system of its own, we can easily

formulate a test whether a given partitioned schedule is feasible by applying the basic

results from single-processor EDF scheduling theory:

Lemma 4 (EDF feasible periodic partitioned schedule).Let Rbe a schedule of a periodic

task set Γwith periodic partitioning χ, where the tasks of each partition Gi∈χare

scheduled separately by EDF. The schedule is feasible on Hif:



A(χ)≤A(H), i.e., all partitions fit onto the device and,



∀Gi∈χ:UT(Gi)≤1, i.e., all partitions have a time utilization of less then 100%.

Proof. The tasks of each partition-block can be separately scheduled by EDF, same as

in a uniprocessor system. Since the EDF uniprocessor utilization bound holds for all

partitions, all deadlines are met.

TiPiCiAiUT(Ti)US(Ti)

T14 2 1/2 1/2 1/4

T26 5 1/4 5/6 5/24

T312 3 3/4 1/4 3/16

T412 2 1/4 1/6 1/24

1.75 0.69

Table 4.2: Example task set Γ∗

To illustrate an example of an EDF periodic partitioned schedule, consider the example

task set Γ∗of Table 4.2. Let some algorithm partition the task set to two partition-

blocks, e.g. χ={G1={T1, T3, T4}, G2={T2}}. The left-hand side of Figure 4.4 shows

the partitioning χand illustrates the division of the device area. The according parti-

tioned EDF schedule Ris shown on the right-hand side of Figure 4.4. Both partition-

blocks are scheduled using a separate EDF scheduler. Since both partition-blocks have

a time utilization of less than 100% the schedule is feasibly:

(UT(G1) = 1/2+1/4+1/6≤1, UT(G2) = 5/6≤1) (4.15)

Note, that in the schedule, the task T4cannot execute in parallel to T1even if there

would be enough free area, since by Definition 9no more than one task of each partition

can execute simultaneously.

Based on Lemma 4we are able to define an optimization problem in order to find

optimal partitioned schedules. To this end, we formulate the problem of finding the

least resource-consuming EDF Feasibly Partitioned Schedule (EFPS):

4.2 Partitioned Scheduling

  



 



 





 









 







 







Figure 4.4: Example partitioning of Γ∗and the according schedule

Definition 11 (EFPS problem).Given a periodic task set Γ, find:



a partitioning χof Γ,



subject to ∀Gi∈χ:UT(Gi)≤1,



minimize A(χ)

Solving the EFPS problem answers directly the question of the smallest device on which

Γ can be executed. Γ can be feasibly scheduled on any device where A(H)≥A(χ)

holds. From the resulting partitioning χ, we can easily derive the scheduling function

Rby applying a separate EDF scheduler to the tasks of each partition-block.

4.2.2 Relation to 2-Dimensional Packing Problems

The EFPS problem (Definition 11) is equivalent to the Two-Dimensional Level Strip-

Packing Problem (2LSP)[73], which is a variant of the Two-Dimensional Strip Packing

(2SP) problem. In 2SP, a set of rectangular items is packed into a strip of given width

but infinite height such that the height of the required strip is minimized. The packing

must be non-overlapping, orthogonal, and without rotation of the rectangles. In the

2LSP variation of 2SP, the rectangular items have to be packed in rows forming levels.

The bottom of the first level is the bottom of the empty strip. The bottom of each other

level is determined by a horizontal cut line on top of the highest item in the previous

level. Within a level, items are not allowed to be packed on top of each other.

In Definition 11, the tasks Tican be modeled as rectangular items where the time utiliza-

tion factor UT(Ti) corresponds to the rectangle width and the area A(Ti) corresponds

to the rectangle height. The width of the strip is set to one, according to the maximal

allowed time utilization of an EDF schedule. Each slot of the reconfigurable device –

and the set of tasks Giassigned to it – corresponds to one level of the packing. Finally,

4 Three Scheduling Algorithms

the required device area corresponds to the height of the strip. See an example of tasks

drawn as rectangles and packed in levels in the utilization diagram on the left side of

Figure 4.4.

In the following we will study two approaches to solve the EFPS problem: The first

one is to solve EFPS optimally by Integer Linear Programming (ILP), which seems to

be reasonable since efficient ILP models which have been successfully applied to 2LSP

problems of interesting size have been proposed recently [73]. Since solving an ILP

cannot be done in polynomial time, our second approach is a simple heuristic which has

been proposed to solve two-dimensional packing problems. We analyze the algorithm

in the context of our general scheduling problem described in Chapter 3, and derive

a utilization-based scheduling condition which allows a qualitative comparison to our

other approaches.

4.2.3 Optimal Partitioning by ILP

We can solve our EFPS problem optimally by coding the problem as an integer linear

program and apply a standard solver tool. In an ILP model, we formulate the problem

by defining a set of integer decision variables, a set of linear constraint equations and a

linear cost function. Formally, an ILP can be formulated in the following form:

Given a rational matrix A, a rational constraint vector b, and a rational cost vector c,

determine

{max c·x|A·x≤b, x is integer}.(4.16)

Within this work, ILP-models are never written in the form of Equation 4.16, but use

a more readable form were the cost function and the multiple constraint equations are

separated according to their meaning.

On modeling the EFPS problem, a straight forward way would be to introduce binary

decision variables xl,i that indicate, weather task Tiis packed into partition-block Gl.

The constraint that time-utilization of a partition-block is bounded by 1 (see Defini-

tion 11) could be expressed by Pn

1UT(Ti)·xl,i ≤1 for all l∈ {1, . . . , n}. However, using

such simple coding of the decision variables leads to two problems:



There is no straight forward way to model the cost function.



The model contains much redundancy. E.g. it is irrelevant for the EFPS solution, if

some set of tasks is packed into partition-block 1 or into partition-block 2, but this

results in different configurations of the ILP decision variables. Such redundancy

in coding can cause a large increase in time required by the ILP solver tool.

Therefore, we adopt the slightly more complicated but more efficient ILP model for 2LSP

proposed by Lodi at el. [73]. According to the model of Lodi, we make the following

basic assumptions without loss of generality:

4.2 Partitioned Scheduling

Order of tasks: The tasks Ti∈Γ, i ∈ {1, . . . , n}are sorted and numbered by non-

increasing area, such that:

A(Ti)≤A(Ti+1), i = 1, . . . , n −1.(4.17)

Order of partition blocks: The task Tihaving the largest area within a partition-block

is said to initialize the partition-block. The index iis also used to index this

partition-block. I.e., if the largest task in a partition-block is Tl, the partition-

block is denoted by Gl.

It follows, that a partitioning χhas n=|Γ|potential partition-blocks G1. . . Gn

(some maybe empty), and that the partition-blocks are sorted by non-increasing

areas, i.e. A(Gl)≥A(Gl+1). Further, the area of a partition-block is equal to the

area of the task that initializes the partition-block, i.e., A(Gi) = A(Ti).

For any task Ti∈Γ we can therefore differentiate the following cases:



Tiinitializes a partition-block, i.e. Ti∈Gi, or



Tiis packed within a partition-block with greater area, forcing the partition-block

index to be smaller than i. I.e. Ti∈G1∪···∪Gi−1.

The assumptions made are without loss of generality but reduce the search space con-

siderably, since a task Ticannot be assigned to an arbitrary partition block Gl.

When modeling the ILP, the binary decision variables xl,i are used to indicate whether

task Tiis packed into partition block Glor not:

xl,i =(1 if Ti∈Gl

0 if Ti/∈Gl

l={1, . . . , n}, i ={1, . . . , n :i≥l}(4.18)

Index lranges from 1 to n, since there can be at most npartition-blocks. Index iranges

from 1 to n, but due to the observations made above, iis always greater or equal to l.

Therefore, the ILP model uses n·(n+ 1)/2 binary variables.

Note, that the variables xl,i with l=iindicate, whether Glis initialized or not. If

xl,l = 1 the partition-block Glis said to be initialized and its largest task is Tl. If

xl,l = 0, Glis uninitialized and empty:

A(Gl) = (A(Tl) if xl,l = 1

0 if xl,l = 0 l={1, . . . , n}(4.19)

The final binary ILP model is given in Equation 4.20 and has three main components:

1. The cost function accumulates the area required by all non-empty partition-blocks,

hence minimizing the overall area A(χ).

4 Three Scheduling Algorithms

2. The second equation ensures that each task is packed into one and only one

partition-block.

3. The third equation enforces EDF schedulability. Each non-empty partition-block

must have a time-utilization less than or equal to 1, i.e.,UT(Gl)≤1. Note,

that this equation also ensures, that no task can be packed into an uninitialized

partition-block, i.e., into a partition-block with xl,l = 0.

The final ILP can be written as:

min

l=1

A(Tl)·xl,l

l=1

xl,i = 1, i ∈ {1, . . . , n}

i=l

UT(Ti)·xl,i ≤1·xl,l, l ∈ {1, . . . , n}

(4.20)

Let nbe the number of tasks in Γ, then the complexity of the corresponding ILP model

accounts to n·(n+ 1)/2 binary variables and 2 ·nconstraints. Such model complexity

is practical. We were able to optimally solve most problem instances with a size of n

up to 30 in less than 15 seconds, to provide a rough appraisal. To solve the ILP model,

we employed the lp solve library [15] version 4.0 on a 2.8 GHz Pentium 4 machine.

Example 2. To illustrate the ILP model, Figure 4.5 visualizes an example partitioning

of a task set Γ = {T1, T2, T3, T4}, that has been partitioned into two partition-blocks,

G1={T1, T2, T4}and G2={T3}. The according variable configuration is shown in the

table. T1initializes partition-block G1and therefore x1,1= 1. The other tasks T2, T4in

G1imply, that x1,2= 1 and x1,4= 1.T3initializes partition-block G3, thus x3,3= 1.

No tasks have been assigned to partition-blocks G2and G4, therefore they do not exist

in χ. Note, that the variables marked with “-” are not needed, since by convention each

partition-block Glcan only contain tasks Tiwith index i≥l.

The presented ILP (Equation 4.20), which has been adopted from [73], allows us to solve

the EFPS problem and hence find the optimal partitioned schedule for a periodic task

set. We will analyze and compare the performance of this approach to other schedulers

in Chapter 5. Moreover, this ILP model will be the basis for an extended ILP, which

allows us to solve the enhanced scheduling problem, where numerous implementation

variants for each task are considered (see Section 7.1.2).

However, since the ILP model is not practicable for task sets with n >> 30, we analyze

a fast polynomial time heuristic in the next section.

4.2 Partitioned Scheduling





 









l xl,1xl,2xl,3xl,4

11101

2 - 0 0 0

3 - - 1 0

4 - - - 0

Figure 4.5: Example of partitioning χand the according decision variable configuration

4.2.4 Next-Fit-Decreasing-Area Partitioning

This section presents and analyzes a simple heuristic in the context of our EFPS prob-

lem. The goal is twofold: On one hand, we want to be able to apply the partitioned

scheduling approach to task sets of large size, which is impractical using the ILP model.

On the other hand, as we will see, the analysis of the heuristic will help us in the

quantitative comparison of the proposed scheduling methods.

To solve the two-dimensional packing problems 2SP and 2LSP, Coffman et al. [30]

studied the Next-Fit-Deceasing-Height algorithm. It first sorts all rectangles according

to non-increasing height. The first rectangle is packed left-justified on the bottom of the

container (first level). The next rectangle is packed on the right of the former one, if

the width of the container is not exceeded. Otherwise, the level is closed by a horizontal

line on top of the tallest rectangle. A new level is created on top of the last one, and

the rest of the items are packed in the same manner.

In [30] was shown, that for any instance of a 2SP problem, i.e. any set of rectangles X,

the packing height obtained by the Next-Fit-Deceasing-Height algorithm (denoted by

NFDH(X)) is bounded. The bound is given by

NFDH(X)≤2·OPT(X)+1,(4.21)

where the height of the rectangles is normalized to 1 and OPT (X) denotes the height

of an optimal packing of X.

Since a packing computed by the Next-Fit-Deceasing-Height algorithm is always done

in levels, we can use it to compute a periodic partitioned schedule that is feasible by

EDF. Since the height of the rectangles corresponds to the area of the periodic tasks,

the algorithm is called Next-Fit-Deceasing-Area (NFDA) and shown in Algorithm 3.

The NFDA algorithm has a time complexity of O(nlog n), since sorting the tasks is the

most time consuming part. Hence it can be used to compute the partitioning of very

4 Three Scheduling Algorithms

Algorithm 3 Next-Fit-Decreasing-Area

1: procedure NFDA(Γ)

2: sort and index after non-increasing area(Γ)

3: k←1

4: Gk← ∅

5: for i←1,|Γ|do

6: if UT(Gk) + UT(Ti)≤1then .EDF utilization condition

7: Gk←Gk∪Ti.add to current level

8: else

9: k←k+ 1 .open new level

10: Gk← ∅

11: end if

12: end for

13: return χ← {G1, . . . , Gk}

14: end procedure

large task sets. In Section 5.2 we will analyze the scheduling performance of NFDA on

a set of simulation benchmarks.

Since NFDA has such low time complexity, a utilization-based scheduling condition is

not really necessary for use in praxis. If we want to test whether NFDA can feasibly

schedule a certain task set Γ onto an device H, we simply compute the partitioning and

check whether it exceeds the device area or not.

However, a utilization based test condition is useful for the further analysis. It gives us

a more general comparison to other algorithms. In particular, our concern is the com-

parison of the performance guaranteed by NFDA-partitioned-EDF to the performance

guaranteed for the global EDF schedulers by Theorem 3. It turned out in our simulation

experiments, that all task sets accepted by the global EDF scheduling test of Theorem 3

where also feasibly scheduled by NFDA-partitioned-EDF. Proving this dominance, is the

main motivation for the development of a utilization based test condition for NFDA.

The previously reported analysis and the achieved bound of Equation 4.21 is not useful

to provide a scheduling test, since the packing achieved by OP T(X) is an arbitrary

packing and not a packing in levels. Hence, OPT (X) does not provide a valid solution

for our partitioning problem. Our utilization based scheduling condition for NFDA is

presented in the next theorem:

Theorem 4 (NFDA-partitioned-EDF scheduling condition).For a periodic task set Γ

and RHD H, an EDF feasible periodic partitioning is found by the NFDA heuristic if:

US(Γ) ≤(A(H)−Amax)·(1 −UT

max) + US

max (4.22)

where Amax,UT

max,US

max are the largest area, time-utilization and system-utilization of

any tasks in Γ.

Based on Theorem 4we will be able to to prove the following theorem, which says that

the NFDA-partitioned-EDF scheduler always outperforms the global EDF scheduling

test developed in Section 4.1.

4.2 Partitioned Scheduling

Theorem 5 (NFDA-partitioned-EDF vs. EDF-FkF).For every periodic task set Γ

accepted by the EDF-FkF scheduling test of Theorem 3an EDF feasibly partitioned

schedule is found by the NFDA heuristic.

Proof of Theorem 4.Consider an arbitrary partitioning χof a task set Γ, which has

been computed by NFDA. We derive a lower bound on the total system utilization

US(χ) of the tasks within partitioning χ, as a function of the required device area A(χ)

and some other task set parameters:

US(χ)≥f(A(χ), Amax, UT

max, US

max) (4.23)

Replacing A(χ) by an actual device area A(H), the function f(···) presents an upper

bound on the task set’s system utilization that is guaranteed to be feasibly packed:

US(Γ) ≤f(A(H), Amax, UT

max, US

max) (4.24)

To illustrate the system utilization packed in the various partition-blocks (also called

levels), Figure 4.6 shows an example partitioning by NFDA. Since tasks are sorted with

respect to non-increasing areas before being partitioned, it follows that A(Gl+1)≤A(Gl)

and that the smallest task in a level lhas at least area A(Gl+1). Moreover, let xldenote

T

1 T

3

T

4

1/2 1

A(H)

U

T

5

T

6 T

7

T

8

A(G

1

)

A(G

2

)

A(G

3

)

A(G

4

)

x

1

x

2

x

3

x

4

T

2

T

1 T

3

T

4

1/2 1

A(H)

U

T

5

T

6 T

7

T

8

A(G

1

)

A(G

2

)

A(G

3

)

A(G

4

)

x

1

x

2

x

3

x

4

T

2

x

0

=UT

max

x

0

=UT

max

x

5

(a) (b)

Figure 4.6: Example partitioning of NFDA for (a) a task set with UT

max ≤0.5 and (b) a

task set with UT

max >0.5

4 Three Scheduling Algorithms

the fraction of unused time-utilization in level a l. From all tasks in a level l, the first

one has area A(Gl) and the other tasks, if any, have at least an area of A(Gl+1). The

first task in a level lhas at least a time utilization equal or greater then xl−1, since

otherwise the task would have been packed into level l−1. The total time utilization

in level lis 1 −xl. Thus, the amount of system utilization packed into level Glgiven by

(area

time-util.) of the first task + (area

time-util.) of other tasks accounts to: 1

∀l≥2 : US(Gl)≥A(Gl)·xl−1+A(Gl+1)·max{(1 −xl−xl−1),0}

≥A(Gl)·xl−1+A(Gl+1)·(1 −xl−xl−1)(4.25)

The system utilization for level l= 1 is slightly different, since we cannot guaranty any

time-utilization for the first task, i.e. x0is undefined since there is no level below the

first level. Let x0be a value in ]0,1] and US(0, x0) be the cumulative system-utilization

of tasks packed in level 1 from the left boarder up to x0, illustrated by the dotted box

in Figure 4.6. The cumulative system-utilization of level 1 is given by the US(0, x0) +

(area

time-util.) of other tasks, if any. It accounts to:

US(G1)≥US(0, x0) + A(G2)·max{(1 −x1−x0),0}

=US(0, x0) + A(G2)·(1 −x1−x0) + A(G2)·max{(x1+x0−1),0}

| {z }

(4.26)

Note, that %is a correction term and is equal to zero, as long as x0+x1≤1.

Summing up the system utilization of the first level (Equation 4.26) and all other levels,

Equation 4.25 results in the lower bound on the overall system utilization of the entire

packing:

Gl∈χ

US(Gl)≥US(0, x0) + %+

l=2

(1 −xl−2)·A(Gl) (4.27)

The maximal time utilization of a task in Γ, denoted by UT

max, presents an upper bound

to the xlvariables in Equation 4.27. Therefore we can replace them in Equation 4.27

and get a new lower bound on the system utilization in Equation 4.28, which directly

leads to Equation 4.29.2Note, that %= 0 for UT

max ≤0.5, but becomes positive for

max >0.5.

US(χ)≥US(0, UT

max) + %+

l=2

(1 −UT

max)·A(Gl) (4.28)

⇔

US(χ)≥US(0, UT

max) + %+ (1 −UT

max)[

l=1

A(Gl)−A(G1)] (4.29)

1We define A(Gl) = 0 : l > m, i.e. the area of non existing levels above the top level is zero. Hence, the

system utilization of the top level is according to Equation 4.25 given by US(Gm)≥A(Gm)·xm−1.

2In our assumptions x0can take any value from ]0,1]. Hence we are allowed to choose x0=UT

max.

4.2 Partitioned Scheduling

Substituting Pm

l=1 A(Gl)→A(χ) and A(G1)→Amax leads to:

US(χ)≥US(0, UT

max) + %+ (1 −UT

max)(A(χ)−Amax) (4.30)

Now let Tkbe the task with greatest system utilization, i.e. US(Tk) = US

max. We

distinguish between two cases, where Tkcan be packed by NFDA:

1. Tkis packed in the first level within the interval [0, UT

max]. Obviously, it follows

US(0, UT

max)≥US(Tk).

2. Tkis placed only partially within the interval [0, UT

max] of the first level, or placed

in a different level. Then, all tasks packed within the interval [0, UT

max] have an

area greater than or equal to Ak, since tasks are packed by decreasing areas.

a) Level 1 is filled from left to right at least up to UT

max. Hence, the system

utilization is at least as much as that of Tk, i.e. US(0, UT

max)≥Ak·UT

max ≥

US(Tk). This case is illustrated in Figure 4.6-(a).

b) Level 1 is not filled up to UT

max, which can only be the case, if UT

max >0.5. In

this case, level 1 is at least filled up to (1 −UT

max) and hence US(0, UT

max)≥

Ak·(1 −UT

max). But still %=A(G2)·max{(2UT

max −1),0}has to be added,

and is greater than zero in this case. Since Tkis placed in level 2 or above,

A(G2)≥Ak. Adding it up, results in:

US(0, UT

max) + %≥Ak·(1 −UT

max) + Ak·(2UT

max −1) = Ak·UT

max ≥US(Tk)

(4.31)

This case is illustrated in Figure 4.6-(b).

Hence, in all cases we can substitute US(0, UT

max) + %by US

max in Equation 4.30, which

leads to the lower bound on US(χ) and presents the upper bound on US(Γ) of Theorem 4.

US(χ)≥US

max + (1 −UT

max)(A(χ)−Amax) (4.32)

⇔US(Γ) ≤US

max + (1 −UT

max)(A(H)−Amax) (4.33)

Now that we have proven the scheduling condition for NFDA-partitioned-EDF we are

ready to prove that this condition dominates our scheduling condition for the global

EDF schedulers (Theorem 5).

Proof of Theorem 5.The upper bound on the system utilization of the EDF-FkF schedul-

ing condition in Theorem 3is given by:

∀Tk∈Γ : US(Γ) ≤(A(H)−Amax)·1−UT(Tk)+US(Tk) (4.34)

4 Three Scheduling Algorithms

On the other hand, the bound given for the NFDA-partitioned-EDF algorithm in The-

orem 4is given by:

US(Γ) ≤(A(H)−Amax)·(1 −UT

max) + US

max (4.35)

Let w.l.g. T1be the task with the maximal time utilization, i.e. U(T1) = UT

max. Then

the EDF-FkF scheduling condition is bounded from above by:

US(Γ) ≤(A(H)−Amax)·1−UT

max+US(T1) (4.36)

Since US(T1)≤US

max, Equation 4.36 is bounded from above by the NFDA-partitioned-

EDF scheduling condition of Equation 4.35. Therefore, the bound given by the NFDA-

partitioned-EDF scheduling condition is always greater than that given by the EDF-FkF

scheduling condition.

4.2.5 Conclusion on Partitioned EDF

In this section we adapted the partitioned approach for multi-processor systems to our

RHD execution model. A task set is partitioned off-line. At runtime, the tasks of each

partition-block are executed by a separate EDF scheduler on a separated part of the

device area. Hence, the challenge is to find a feasible partitioning of the task set. We

proposed two methods:

1. The optimal partitioning, i.e. the partitioning that requires the least amount of

resources but still guarantees all deadlines, is computed via integer linear pro-

gramming. This method is feasible for task sets containing up to about 30 tasks.

2. In order to partition sets of >> 30 tasks, we studied the simple Next-Fit-Decreasing-

Area heuristic. We derived a utilization based scheduling test, which is important

for the analytic comparison of the global EDF vs. the partitioned scheduling ap-

proaches. In particular we showed that all task sets accepted by our global EDF

scheduling condition will also be scheduled by NFDA-partitioned-EDF.

It would be interesting, to apply further heuristics, such as the First-Fit-Decreasing-

Height (FFDH) which has been shown to perform always better than NFDH [30]. How-

ever, we omit a deeper analysis of FFDH, since for our analytic comparison to the global

EDF schedulers our results from the NFDA analysis have been sufficient. In particu-

lar we showed with Theorem 5that NFDA-partitioned-EDF schedules all the task sets

accepted by our global EDF scheduling test (and even more). A detailed analysis of

the performance achieved by the partitioned schedulers follows in Chapter 5. The next

section will introduce our third and last scheduling method.

4.3 Server-Based Scheduling

Periodic servers are a well-known concept in real-time scheduling [25](p. 111 ff.). They

are usually used to serve aperiodic requests of application tasks in a system mixed of

periodic and aperiodic tasks. Here we only keep the basic idea, that a periodic server

executes application tasks, and hence use the same term server. However, we do not

challenge the execution of mixed periodic / aperiodic task sets, but use periodic servers

to create a feasible schedule for the periodic hardware tasks of our execution model.

Within a server, several application tasks execute in parallel. The aim is to find servers

that can be sequentially executed on an RHD and hence no partial reconfiguration is

required. This makes the method applicable for a wide set of reconfigurable devices.

We present a scheduling technique called Merge Server Distribute Load (MSDL). To

construct a schedule, the MSDL algorithm uses the concept of server tasks, or briefly

servers. A server is a periodic task that reserves execution time and RHD area for other

tasks. We define a server as Si= (Vi, Pi, Ci, Ai), where Vi={Ta, Tb, . . . } ⊆ Γ is a set of

tasks for which execution time and area is reserved. Pi,Ci,Aidenote the period, the

computation time and the area of the server, respectively. The area of a server is set

equal to the sum of the areas of tasks represented by the server:

Ai=X

Tk∈Vi

Ak(4.37)

Consequently, whenever the server Siis running, all tasks it represents are running.

The rationale of the MSDL algorithm is to construct a set of servers Ω from the original

task set Γ, such that any feasible schedule for Ω implies a feasible schedule for Γ. More

specifically, MSDL constructs a set of servers Ω by properly merging tasks together for

parallel execution. The resulting servers are then scheduled for sequential execution on

the RHD with the single processor EDF algorithm. Feasibility of the resulting set of

servers is thus efficiently checked by the utilization test:

UT(Ω) ≤1 (4.38)

4.3.1 The Merge-Server Distribute Load (MSDL) Algorithm

Algorithm 4shows the pseudo code for the MSDL technique. First, each of the initial

tasks is turned into a server (line 3). Then the main loop is entered in which, iteratively,

a server pair is identified (line 8) and merged (line 12) if possible. If no valid server pair

can be found anymore, the algorithm exits and returns Ω as the final set of servers

(line 10). A feasible schedule is found by MSDL, if the final set of servers has a time

utilization less or equal to one, i.e. UT(Ω) ≤1.

The selection of the two servers Sxand Sythat should be merged is done by calling the

function SelectBestPairToMerge() (line 8).

A pair of servers Sxand Syis valid for a merge, if the following two predicates hold:

4 Three Scheduling Algorithms



The set of represented tasks is disjoint:

Vx∩Vy=∅(4.39)



The tasks of both servers fit together onto the RHD:

Ax+Ay≤A(H) (4.40)

If a valid pair of servers Sxand Syhas been selected, they are merged. This is done

by the MergeServers() function. It takes the current server set Ωold and the servers

Sx, Syto be merged as inputs and returns the new server set after the merge. Without

loss of generality, we assume that server Syhas a shorter or equal period than Sx. A

new server Szis created representing all tasks of the two original servers (line 16). The

period and the computation time for Szare set equal to those of Sy. Therefore, Szis a

full replacement of Sy, and Sycan be removed from Ω (line 19). The computation time

of Sxis reduced, since the new server Szreserves area and computation time for the

tasks of Sxas well. The actual reduction of the computation time Cxdepends on how

often the new server Szexecutes within the period of Sx. A pessimistic approximation

for the reduction of Cxis given by:

takeOverT ime(Sx, Sz) = Cz(bPx/Pzc−1) (4.41)

For the implementation of the SelectBestPairToMerge() function, several heuristics are

conceivable. In our current version we employ a greedy strategy that selects the pair

of servers giving the greatest reduction in time utilization UT(Ωold)−UT(Ωnew) per

increase of system utilization US(Ωnew)−US(Ωold).3That is, SelectBestPairToMerge()

scans each pair of servers Si, Sjthat is valid for a merge (line 28). The temporary new

server set Ωnew is computed, by merging Si, Sj. Afterwards, the profit of this merge

is computed (line 30). Finally, the SelectBestPairToMerge() function returns the pair

Si, Sj, that gains the highest profit.

For example, we apply the MSDL algorithm to an example task set of three tasks. Then,

in Section 4.3.2, we provide a more involved analysis to compute the exact computation

time reduction.

Example 3. Table 4.3 shows the set of servers Ω∗

kgenerated in each iteration kof

the MSDL algorithm. Initially, the servers Ω∗

0={S1, S2, S3}are created. In the first

iteration, S1and S2are selected and merged into S4.S2receives the new computation

time C2←C2−2 = 3. The server with the shorter period, S1, is removed. In the second

iteration, the residual S2and S3are merged into S5. Not only the server with the shorter

period is removed, but also S3since its computation time is reduced to zero. Ω∗

2is the

final server set, since neither V4, V5are disjunct nor A4+A5≤1. As shown in Table

4.3, the time utilization factor UT(Ω∗

2) = 1. Consequently, Ω∗

2can be feasibly scheduled

by EDF. The resulting schedule is shown in Figure 4.7. The figure also indicates the

original tasks of Γ∗executed inside the servers. Compared to a global EDF-NF schedule

3Ωold denotes the set of servers before, whereas Ωnew denotes the server set after merging the selected

pair of servers.

4.3 Server-Based Scheduling

Algorithm 4 Merge Server - Distribute Load

1: procedure MSDL(Γ, H)

2: Ω← ∅

3: for all Ti∈Γdo .init

4: Si←({Ti}, Pi, Ci, Ai)

5: Ω←Ω∪Si

6: end for

7: loop

8: Sx, Sy←SelectBestPairToMerge(Ω, H)

9: if no pair found then

10: return Ω.exit

11: end if

12: Ω←MergeServers(Ω, Sx, Sy)

13: end loop

14: end procedure

15: function MergeServers(Ωold, Sx, Sy)

16: Sz←(Vx∪Vy, Py, Cy, Ax+Ay). Py≤Px

17: Cx←Cx−takeOverTime(Sx, Sz)

18: Ωnew ←Ωold ∪Sz.add server

19: Ωnew ←Ωold \Sy

20: if Cx≤0then

21: Ωnew ←Ωold \Sx

22: end if

23: return Ωnew

24: end function

25: function SeclectBestPairToMerge(Ωold, H)

26: bestProfit ←0

27: for all Si6=Sj∈Ωold do

28: if Pi< Pjand Vi∩Vj=∅and Ai+Aj≤A(H)then .for all valid pairs

29: Ωnew ←MergeServers(Ωold, Sj, Si).do temporary merge

30: profit ←UT(Ωold )−UT(Ωnew )

US(Ωnew )+US(Ωold).compute profit

31: if profit > bestProfit then

32: bestProfit ←profit

33: Sx←Sj

34: Sy←Si

35: end if

36: end if

37: end for

38: return Sx, Sy

39: end function

4 Three Scheduling Algorithms

of the task set, MSDL requires only two RHD programming files. Table 4.3 also lists

the system utilization factor US

iwhich increases over the iterations, since larger servers

will reveal more idle areas and times inside their reservations. In essence, MSDL trades

system utilization for time utilization to allow for an efficient schedulability test and to

reduce the number of RHD configurations.

SiViPiCiAiUT

iUS

S1T14 2 1/2 1/2 1/4

S2T26 5 1/4 5/6 5/24

S3T312 3 3/4 1/4 3/16

P1.58 0.65



S1T14 0 1/2 0 0

S2T26 3 1/4 1/2 1/8

S3T312 3 3/4 1/4 3/16

S4T1, T24 2 3/4 1/2 3/8

P1.25 0.69



S2T26 0 1/4 0 0



S3T312 0 3/4 0 0

S4T1, T24 2 3/4 1/2 3/8

S5T2, T36 3 1 1/2 1/2

P1 0.88

Table 4.3: Servers generated for the example task set Γ∗by the MSDL algorithm

S4

S5

0 1 2 3 4 5 6 7 8 9 10 11 12

S4,S5

S1

S2

S1

S5

S1

S5T1 T1 T1

T2 T2 T2 T2 T2 T2

T3

S4 S4

S4 S4,S5

S5 S5 S4 S5

FPGA area

Figure 4.7: Schedule of server task set generated by MSDL

4.3.2 Exact Evaluation of Computation Time Reduction

In Equation 4.41, we made the pessimistic assumption that a server Szwith Pz≤Px

executes only for m=bPx/Pzc−1 times between the release time and deadline of server

Sx. Therefore, the computation time of Sxwas reduced by mCz. A further reduction

of Cxis possible, if we take into account that the server instances of Szwhich are not

fully contained between the release time and deadline of server Sxcan still be useful.

4.3 Server-Based Scheduling

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

e l

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Sx:

Sz:

Sx:

Sz:

case B:

case A:

Figure 4.8: Case analysis of computation time reduction

The precise amount of this reduction depends on the actual phase between Sxand Sz.

Figure 4.8 illustrates the two cases that have to be distinguished:

Case A shows an example with Px= 9 and Pz= 4. Pessimistically, the server Sz

is guaranteed to execute m=b9

4c − 1 = 1 times between the release time and

deadline of Sx. As Figure 4.8-A illustrates, there is one instance of server Sz

being etime units too early to be included in the considered period of Sx, and

one instance of server Szbeing ltime units too late. In a worst-case schedule,

the server Szexecutes at the beginning of its early instance and at the end of

its late instance, resulting in some wasted computation time (denoted by black

boxes) for Sx. However, some amount of the computation time of Szmay still

be useful to execute Sx, as denoted by the gray box of the first instance of Sz.

Let δ=e+ldenote the amount of time of the early and late server instances

which are outside the considered period of Sx. The time δcan be computed by

δ= (m+ 2) ∗Pz−Px. Let Cel denote the computation time of the early and

late server instances, which is guaranteed to be within the considered period of

Sx. Then, Cel can be computed by Cel = max(2 ×Cz−δ, 0). Therefore, in case

Athe time Cel is exactly the amount by which Cxcan be reduced in addition to

Equation 4.41.

Case B of Figure 4.8 illustrates the second case, where the server Szis executed m+ 1

times within the considered period of Sx. In this case, δchanges to ˜

δ= (m+ 3) ∗

Pz−Pxand Cel changes to ˜

Cel = max(2 ×Cz−˜

δ, 0), respectively. It follows that

in case B the time by which Cxcan be reduced in addition to Equation 4.41 is

given by Cz+˜

Cel.

4 Three Scheduling Algorithms

Since we have to consider the worst case (out of case A and case B), the precise reduction

of the computation time is determined by:

takeOverT ime(Sx, Sz) = min

Cz(bPx/Pzc−1) + max2Cz−((bPx/Pzc+ 1)Pz−Px)

| {z }

,0,

CzbPx/Pzc+ max2Cz−((bPx/Pzc+ 2)Pz−Px)

| {z }

,0 (4.42)

4.3.3 Properties of MSDL

We designed MSDL especially for the full reconfiguration approach. Hence the number

of the required bit-streams for a schedule should be small. Here we show, that it is

bounded by the number of tasks.

Lemma 5 (Number of generated Servers by MSDL).Let Γbe a periodic task set and

Ωthe set of servers generated by the MSDL algorithm. Then, the number of servers is

bounded by the number of tasks, i.e.

|Ω| ≤ |Γ|(4.43)

Proof. Initially, MSDL starts with as many servers as tasks in the task set. In every

loop iteration, one server is added to Ω but one or two servers are removed from Ω (line

19). Hence, |Ω| ≤ |Γ|will hold in each iteration.

The bounded number of generated servers is one major advantage of MSDL when it

comes to system realization. Each server can be implemented as one RHD configuration

file (bit-stream), and the configurations are scheduled and executed sequentially onto

the RHD employing a full reconfiguration model. We will discuss these realization issues

in detail in Chapter 6.

Time complexity of MSDL: The time complexity of the MSDL algorithm is polyno-

mial. Without giving a formal proof we describe how MSDL can be computed in O(n4)

time, where nis the number of tasks in Γ.

The MergeServers() function can be evaluated in O(1) time, if we assume that the union

of the task sets Vxand Vyis done by copying references.

The loop of the SelectBestPair() function iterates through all server pairs. Since the

number of servers are bounded by n, the loop has at most n2iterations. Within the

loop body, comparing the periods, checking the area constraint, calling MergeServers()

and computing the profit would take O(1) time. Checking, whether the represented

task sets Vx, Vyare disjoint, would take O(1) as well, if we store for each server pair a

disjoint-flag. These flags can be updated, after each MergeServers() call in the MSDL

procedure. Hence, the runtime of SelectBestPair() has the order O(n2).

4.4 Chapter Conclusion

The MSDL procedure itself first initializes one server for each task, which takes time of

the order O(n). After initialization, there are (n2−n

2) server pairs with disjunct task sets

(as many as edges in a complete graph). The main loop iterates, until no server pair

can be merged anymore. Each merge reduces the number of disjunct server pairs by at

least one. Hence the main loop runs at most (n2−n

2) times. The time for the loop body

is dominated by the time for the SelectBestPair() function. Hence, the total runtime of

MSDL has the order O(n4).

The MSDL worst case runtime complexity of order O(n4) is acceptable, since the servers

for a task set are constructed at design time. In practice, we have been able to evaluate

MSDL on task sets of 200 tasks and more within some minutes. If needed, the worst

case complexity can be reduced, if less time consuming heuristics for the SelectBestPair()

function are used.

4.3.4 Conclusion on Server-Based Scheduling

The main idea of server based scheduling is, to execute the original tasks within server

tasks. With the MSDL algorithm, we have presented a heuristic that computes at design

time a set of servers for a given task set. At runtime, the servers will be scheduled by

sequential EDF. Hence, the simple scheduling condition of EDF can be applied to the

set of servers. The main advantage of MSDL is, that it creates at most as many servers

as tasks existing in the original task set. Hence, we can use this approach with a simple

full reconfiguration model, for which we compile one RHD configuration for each server.

For selecting the servers to be merged, as well as for the assignment of computation

time and period to the server created in a merge, several heuristics can be developed.

However, we omit studying further heuristics for the MergeServers() function, since our

current implementation guaranties that only nservers are generated. Furthermore, for

such a greedy server generation method, our SelectBestPair() function promises the

highest scheduling performance. It may be possible to develop scheduling heuristics

that achieve a higher scheduling performance and/or create fewer servers than MSDL,

by creating the server set using meta-heuristics such as genetic algorithms or simulating

annealing. We leave this option for future work.

4.4 Chapter Conclusion

Within this chapter we have presented our three main approaches to schedule periodic

real-time tasks to RHDs. We successfully adopted the idea of global scheduling and

presented the two global EDF scheduling algorithms EDF-NF and EDF-FkF. Based on

a resource augmentation approach, we developed a linear-time scheduling condition that

holds for both algorithms.

Also the second popular multiprocessor scheduling approach, partitioned scheduling, has

been successfully adapted for our RHD execution model. We formulated the partitioning

problem and presented the ILP based optimal-partitioned-EDF and the heuristic NFDA-

4 Three Scheduling Algorithms

partitioned-EDF to solve it.

As third method, we developed the server based approach for the scheduling of periodic

tasks and presented the MSDL heuristic. This novel approach is specially suited for

systems which do not support partial reconfiguration of the RHD.

While within this chapter we introduced, defined and provided basic analysis for each of

our three main scheduling methods, we said few about how good the methods perform.

Hence the next chapter will analyze and compare the performance of all approaches.

This is done in two ways, analytically as well as by simulation studies.

CHAPTER 5

Comparison of Algorithm Scheduling Performance

Within this chapter we will compare the scheduling performance of the proposed algo-

rithms using two methods. First, we will present an analytic comparison in Section 5.1

and derive qualitative statements that tell us whether one algorithm dominates another

or not. Together with scheduling conditions achieved in earlier analysis within this work

we will be able to present a VENN-diagram which shows the relations among the sets

of task sets, scheduled by the various algorithms.

Second, we will present the quantitative results from numerous simulation experiments

in Section 5.2. They show which algorithm performs better in average, dependent on

various parameters of the task sets. These results support a designer in choosing the

appropriate algorithm for a particular system.

5.1 Analytical Comparison and Scheduling Anomalies

In an analytical comparison of the algorithms we are interested in whether one dominates

the other or if the algorithms have unequal scheduling power in the sense, that there

are task sets that can be feasibly scheduled by the first approach but not by the second

and vice versa.

In our further analysis, we use zalgo to denote the set of all task sets Γ feasibly scheduled

by some algorithm algo.

Partitioned EDF vs. Global EDF

Theorem 6. Neither the optimal-partitioned-EDF algorithm is dominant to EDF-NF

nor vice versa.

5 Comparison of Algorithm Scheduling Performance

Proof. We first show, that we can construct example task sets Γ, which cannot be sched-

uled by any partitioned EDF algorithm but are feasibly scheduled by global EDF-NF

and even by EDF-FkF. Hence, we prove with our example that zEDF-FkF * zopt.-part.-EDF.

We construct the example task set ΓAof three tasks, each with the same period, e.g.

Pi= 5. The first task has an arbitrary small computation time, e.g. C1=but requires

100 % of the device area, i.e. A1=A(H) = 5. The two other tasks both require an

arbitrary small amount of RHD area, e.g. Ai=but have a computation time of half

their period, i.e. Ci= 2.5. The resulting task set including utilization values is shown

in Table 5.1.

TiPiCiAiUT(Ti)US(Ti)

T15 A(H) = 5 /5

T25 2.5 1/2 0.5

T35 2.5 1/2 0.5

1 + 

52

Table 5.1: Example of a task set ΓAwithout feasible partitioning

Clearly, no feasible partitioning of ΓAexists. We cannot execute all tasks within one

partition block, e.g. G1={T1, T2, T3}, since the time utilization UT(G1) exceeds 1 and

one task would miss its deadline as shown in Figure 5.1-(a) by the speckled task. Also a

partitioning with more partition-blocks such as χ={G1={T1, T2}, G2={T3}} is not

feasible, since the partition-blocks do not fit together onto the device, i.e. A(χ)> A(H),

as shown in Figure 5.1-(a) by the hashed task.

On the other hand, ΓAwill be feasibly scheduled by the global EDF-NF scheduling

algorithm as shown in Figure 5.1-(b). The reason is that once all tasks are active, EDF-

T1,1

0 1 2 3 4 5

time

T2,1



T1,1

T1, T2, T3

0 1 2 3 4 5

device area

time

T2,1

A(H)



T3,1

device area

A(H)

T3,1

T1, T2, T3T1, T2, T3T1, T2, T3

(a) (b)

Figure 5.1: Scheduling anomaly: A task set infeasible by any partitioned schedule, but

feasibly by global EDF-NF.

5.1 Analytical Comparison and Scheduling Anomalies

NF will select either T1or T2and T3for execution. If we assume that EDF-FkF selects

tasks instances with equal deadlines in order of their indices, ΓAwill also be feasibly

scheduled by EDF-FkF.

Our observations hold for any between 0 <  ≤2.5. Hence, we can create task sets of

any relative system utilization US(Γ)

A(H)∈(0,1] which are feasibly scheduled by EDF-FkF

but infeasible by opt.-part.-EDF.

Now we prove that zopt.-part.-EDF * zEDF-NF by constructing task sets ΓBwhich are

feasible by the partitioning algorithm opt.-part.-EDF but infeasible by the global EDF-

NF scheduler. ΓBconsists of three tasks, where T1and T2have an arbitrary small

computation time of e.g. C1=C2=but each requires 50% of the device area, e.g.

A1=A2=A(H)/2. The third task requires an arbitrary small area of A3=but has

a computation time equal to its period, e.g. C3=P3= 5. Let the period of the tasks

T1, T2be smaller then that of T3, e.g. P1=P2= 4. The resulting task set considering

a device of area A(H) = 4 is shown in Table 5.2.

TiPiCiAiUT(Ti)US(Ti)

T140.5A(H) = 2 0.250.5

T240.5A(H) = 22 0.250.5

T35 5 1

1+0.52

Table 5.2: Example task set ΓB

The NFDA heuristic produces a partitioning of G1={T1, T2}and G2={T3}which

obviously can be feasibly scheduled as shown in Figure 5.2-(a). On the other hand,

the schedule created by global EDF-NF is infeasible. As illustrated in Figure 5.2-(b),

T1,1

T1, T2, T3

0 1 2 3 4 5

device area

time

T2,1

A(H)



T1, T2

(a)

T3,1

T1,1

T1, T2, T3

0 1 2 3 4 5

device area

time

T2,1

A(H)



T1, T2

(b)

T3,1

Figure 5.2: Scheduling anomaly: A task set infeasible by global EDF-NF but feasible

by partitioned EDF NFDH.

5 Comparison of Algorithm Scheduling Performance

EDF-NF will execute the tasks T1and T2first, which leads to the deadline miss of T3.

Our observation holds for all within 0 <  ≤2. Hence, we can create task sets of any

relative system utilization US(Γ)

A(H)∈(0,1] which can be scheduled by NFDA-part.-EDF

but not by EDF-NF.

Partitioned EDF and Global EDF vs. MSDL

Theorem 7. Neither the optimal-partitioned-EDF algorithm nor EDF-NF algorithm

dominate the MSDL algorithm nor vice versa.

Proof. We first show that there exists very simple task sets, that are feasibly scheduled

by optimal-partitioned-EDF and EDF-NF, but our proposed MSDL algorithm fails.

Afterwards we provide two example task sets. Both are feasibly scheduled by our MSDL

heuristic, but the first one is infeasible by opt.-part.-EDF and the second by EDF-NF.

Consider the task set ΓC={T1, T2, T3}, where all three task have the same period,

e.g. Pi= 10 and same computation time e.g. Ci= 4 and an arbitrary small area

Ai=. Clearly the tasks cannot be scheduled entirely sequential, since the total time

utilization is given by UT(ΓC) = 1.2. However, merging tasks into servers as used

by the MSDL approach is not feasible. Let w.l.g. T1and T2be the tasks selected for

merging into a server S3with period PS= 10 and computation time CS= 4. S3fully

replaces T1, but the takeovertime(T1, S3) of Equation 4.42 evaluates to zero, i.e. the

server S3cannot take over any amount of the computation time of task T2. Hence,

the overall time utilization UT(ΓC) cannot be reduced to ≤1 by server merging. On

the other hand it is obvious, that the global EDF-NF and EDF-FkF methods as well

as the optimal-partitioned-EDF and also the NFDA-partitioned-EDF method feasibly

schedule ΓC, since all three tasks can be executed in parallel. Hence, we showed that

zEDF-FkF * zMSDL and that zopt.-part.-EDF * zMSDL.

Next we consider the task set ΓDof Table 5.3. which is infeasible by any partitioned

TiPiCiAiUT(Ti)US(Ti)

T15 1 A(H) 0.2 0.2A(H)

T25 3 0.6 0.6

T310 3 0.3 0.3

1.1

SiPiCiAiUTUS

S1={T1}5 1 A(H) 0.2 0.2A(H)

S4={T2, T3}5 3 20.6 1.2

0.8

Table 5.3: Example task set ΓDwithout feasible partitioning and feasible set of servers

ΩD

5.1 Analytical Comparison and Scheduling Anomalies

scheduler for the same reason1as it is for task set ΓA. The MSDL scheduler will merge

task T2and T3into a server S4with period P4= 5 and computation time C4= 3, hence

T2is fully replaced by the server. The reduced computation time of T3can be computed

according to Equation 4.41 and evaluates to zero.

C3←C3−C4·(bP4

P3c−1) = 0 (5.1)

Since the final set of servers ΩD={S1, S4}has a time utilization below one, it can be

feasibly scheduled. Hence we showed that zMSDL * zopt.-part.-EDF.

Next we consider the example task set ΓEgiven in Table 5.4. As shown in the EDF-

NF schedule in Figure 5.3-a, the first instance of task T3will miss its deadline. The

reason is, that T3has a large period and hence a far deadline for the first instance T3,1.

Therefore, it gets preempted 10 times by the tasks T1and T2which have shorter periods

and hence closer deadlines.

TiPiCiAiUT(Ti)US(Ti)

T1100 12 0.5A(H) 0.12 0.12

T2100 12 0.5A(H) 0.12 0.12

T31000 900 0.90 0.9

Table 5.4: Example task set ΓE

When applying the MSDL algorithm on ΓE, it merges task T1and T3into a server

S4with P4= 100, C4= 12. T1gets fully replaced and the computation time of T3is

reduced by 9 times the computation time of the server.

C3←C3−C4·(bP4

P3c−1) = 900 −12 ·9 = 792 (5.2)

In the next merge, T2and the remaining part of T3are merged into S5, and C3is

reduced further by 108 units to 684. The final set of servers shown in Table 5.5 has a

time utilization of 0.924. The schedule of the servers and the tasks inside the servers is

shown in Figure 5.3-b. Hence we showed by the example, that zMSDL * zEDF-NF.

SiPiCiAiUT

S3={T3}1000 684 0.684

S4={T1, T3}100 12 0.5A(H) 0.120

S5={T2, T3}100 12 0.5A(H) 0.120

P0.924

Table 5.5: Feasible set of servers created by MSDL for task set ΓE

5 Comparison of Algorithm Scheduling Performance

T1,1

T1, T2, T3

0 100 200 300 400 500

device

area

T2,1

A(H)

T1, T2

(a)

T3,1

time

600 700 800 900 1000

T1, T2T1, T2T1, T2T1, T2T1, T2

T1, T2T1, T2T1, T2T1, T2, T3

device

area

A(H)

(b)

0 100 200 300 400 500 time

600 700 800 900 1000

S3↔ {T3}S4↔ {T3, T1}S5↔ {T3, T2}

Legend of Servers:

T2,2T2,10

T1,2T1,10

...

T2,1

T1,1

T3,1

Figure 5.3: Scheduling anomaly: A task set infeasible by global EDF-NF but feasible

by MSDL.

Relation Between all Algorithms

At the end of our analysis we are able to present the relations between all proposed

scheduling methods as a big picture. Figure 5.4 shows the sets of feasibly scheduled task

sets by the different algorithms as VENN-diagram. The rectangle presents the set of all

task sets Γ, for which a feasible schedule exists. Of course none of our proposed methods

schedules all these task sets, since they all are suboptimal algorithms. The region in

the center represents all task sets accepted by the EDF-FkF scheduling condition of

Theorem 3. It is a true subset of the task sets de facto scheduled by EDF-FkF which

again is a true subset of the task sets de facto scheduled by EDF-NF:

zEDF-FkF-condition ⊂zEDF-Fkf ⊂zEDF-NF (5.3)

On the other hand we showed by Theorem 5that all task sets accepted by the EDF-

1One partition-block is infeasible, since the time utilization is greater than one. Two or three partition-

blocks are infeasible, since the device area is exceeded.

5.1 Analytical Comparison and Scheduling Anomalies

EDF-FkF

Condition

EDF-NF

EDF-FkF NFDA-part.

EDF

opt.part.EDF

MSDL

all schedulable task sets

Figure 5.4: The relation between feasibly scheduled task sets by the proposed algorithms

as VENN-diagram.

FkF scheduling condition are also feasibly scheduled by the partitioned EDF scheduler

using the NFDA-partitioning heuristic. Obviously, the optimal-partitioned-EDF sched-

uler dominates the NFDA-partitioned-EDF algorithm. These relations are shown in

Equation 5.4 and correspond to the right side of the VENN-diagram:

zEDF-FkF-condition ⊂zNFDA-partitioned-EDF ⊂zoptimal-partitioned-EDF (5.4)

Finally we showed by constructing counter examples, that the task sets scheduled by

three basic approaches EDF-NF,optimal-partitioned-EDF and MSDL are not subsets of

each other. In particular we showed, that EDF-FkF is no subset of optimal-partitioned-

EDF or MSDL, that NFDA-partitioned-EDF is no subset of EDF-NF or MSDL and

that MSDL is no subset of EDF-NF or optimal-partitioned-EDF.2

2An complete proof of the set-relations shown in the VENN-diagram would require the construction

of a task set for each of the 19 intersections. Beside the presented task sets for the major relations

we abstain from presenting further task sets at this place. However, we found task sets for each

intersection during our intensive simulations on random generated benchmarks.

5 Comparison of Algorithm Scheduling Performance

5.2 Simulation Results

Due to our analytic comparison and analysis we are able to tell whether one algorithm

dominates another one or not, and furthermore know scheduling conditions for some

of the proposed algorithms. However, it tells us little about how good the algorithms

perform on task sets appearing in practice, nor does it tells us anything about which

method performs better for task sets of certain characteristics.

In order to answer those questions, an experimental evaluation on application bench-

marks becomes necessary. The best one could do is using benchmarks which are widely

accepted in the community and consist of applications appearing in practice. Not sur-

prisingly, those benchmarks do not exist, since we consider novel methods for a novel

and rarely studied problem. Hence, we have to rely on synthetically created bench-

marks. In order to construct meaningful task sets one idea is to look at the parameters

of typical FPGA tasks, such as the area of Discrete Wavelet Transform design or an

MPEG 2 Video Decoder. On the other hand we can consider the amount of resources

provided by today’s modern FPGAs as proper values for the device area. We concluded

that this strategy is of little use, since today’s device and design sizes are very variable.

For example, the Discrete Wavelet Transform design requires only 20% of an XILINX

Virtex II XC2V3000 FPGA, but will exceed the size of the smaller but still modern

XILINX Spartan III FPGAs. Also we know little on typical task computation times,

since they depend considerably on the amount of data to be processed. As conclusion

we decided to create benchmarks of pure synthetic task sets, but tried to cover wide

ranges of the task set parameters.

The next paragraph precisely describes our benchmark generation methods and discusses

some characteristics of the created task sets. This should enable the reader to reproduce

the benchmarks, validate results and compare our algorithms to other methods.

Afterwards, we present results from simulations that show the performance of the algo-

rithms and the impact of the tasks’ area and time-utilization values, the impact of the

number of tasks in a task set and how the performance increases when various algorithms

are combined.

5.2.1 Creation of Benchmarks

Recall that a task set Γ consist of periodic tasks Tieach specified by a triple parameter

Ti(Pi, Ci, Ai). The main characteristics of the tasks are UT(Ti) = Ci/Pias well as

US(Ti) = Ci/Pi·Aiwhereas our main interest in the task set’s characteristics are the

number of task n=|Γ|and the overall system utilization US(Γ) = Pn

i=1 US(Ti).

A benchmark creation method creates a given number of task sets {Γ1,Γ2, . . .}by choos-

ing the task parameters and task set sizes according to some rules. To define meaningful

rules is rather difficult. For example we cannot choose all task parameters from intervals

with a uniform (or normal) distribution, since the parameters depend on each other. In

our first task set generation method we decided to keep the Aiand UT

ivalues of the

tasks uniformly distributed.

5.2 Simulation Results

Our first task set generation method works as follows: Each task set is composed of ran-

domly generated tasks Ti. For the creation of tasks a benchmark BM specifies an interval

[cmin, cmax] of natural numbers for the task’s computation times, an interval [amin, amax]

for the task’s areas, and an interval [uT

min, uT

max] for the task’s time-utilization factor.

A task is created by randomly choosing its parameters from the intervals with a uni-

form distribution. Its period is computed according to its computation time and time-

utilization. In order to create the task sets Γ ∈BM with system-utilization distributed

from 0% to 100%, randomly generated tasks are added to a task set until a specified

task set system-utilization US(Γ) is reached. Furthermore, a task set is dismissed when

its hyper period exceeds a given limit. This is necessary to keep the simulation time of

the global EDF algorithms within reasonable bounds. The pseudo code of the algorithm

creating one task set is shown in Algorithm 5.

Algorithm 5 Benchmark Creation Method 1

1: procedure createTaskSet(US

bound, HPbound)

2: repeat

3: Γ← ∅

4: repeat

5: T←createTask

6: Γ←Γ∪T

7: until US(Γ) > US

bound

8: Γ←Γ\T

9: until hyperperiod (Γ) ≤HPbound

10: end procedure

11: procedure createTask

12: C←rand([cmin, cmax] )

13: A←rand([amin, amax] )

14: tempUT←rand([uT

min, uT

max] )

15: P←round( C

tempUT)

16: return T(P, C, A)

17: end procedure

While the values for the tasks’ area, computation time and time-utilization are uniformly

distributed within some interval, the periods, the system-utilization and the number of

tasks is a result of the creation method. To get a deeper idea of the characteristics

of the benchmarks created by the described method we study the distribution of all

parameters of interest. We create our standard benchmark BMstd using the following

parameters:

param(BMstd) = {cmin = 1, cmax = 30, amin = 0.1, amax = 0.5, uT

min = 0.1, uT

max = 0.5}

(5.5)

The characteristics of the created tasks are illustrated in Figure 5.5. The histograms

show the various task parameters, their values and their relative frequency. As specified

in the benchmark parameters, the task computation times Ciand area values Aiare

5 Comparison of Algorithm Scheduling Performance

uniformly distributed in [1,30] and [0.1,0.5] respectively. Since we required a uniform

distribution of the time-utilization UT(Ti)∈[0.1,0.5], the periods Piwill result with

values between 2 and 300 with a distribution as shown in the second histogram of

Figure 5.5. Moreover, since the Aiand UT(Ti) are uniformly distributed in [0.1,0.5],

the system utilization of the tasks US(Ti) is between 0.1 and 0.25 and distributed as

shown in the fifth histogram. The last plot in the figure shows the average number of

tasks nin a task set Γ, when creating task sets of various system-utilization US(Γ).

0 100 200 300

computation time

frequency

0 100 200 300

period

frequency

0 0.5 1

area

frequency

0 0.5 1

time utilization

frequency

0 0.5 1

system utilization

frequency

0 0.5 1

system util. of Γ

n=|Γ|

Figure 5.5: Characteristics of BMsdt tasks and task sets

5.2.2 Performance on Standard Benchmark BMstd

In our first simulation experiment, we created our standard benchmark BMstd with

ten thousand task sets with various system-utilization values. The hyper-period was

bounded by 100000, to keep the simulation time reasonably small.

Figure 5.6 compares the scheduling performance of all proposed algorithms on the bench-

mark BMstd. The graphs where created by partitioning the 10000 task sets into 20

classes according to their US(Γ) value. For each class and algorithm one data point was

determined. The y-values of the data points present the success-rate, i.e. the fraction of

feasibly scheduled task sets of a class by the particular algorithm. The x-values present

the mean system-utilization of the task sets of a class.

Global EDF Scheduling: The global EDF-NF approach is the algorithm with the best

performance on our standard benchmark. It is able to schedule about 50% of the task

sets of the class with a mean system utilization factor around 81% and accepts almost

all task sets with USless than 70%. The performance drop of the global EDF-FkF

5.2 Simulation Results

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

average system utilization US(Γ)

relative success rate

NFDA part. EDF

opt. part. EDF

EDF−FkF Cond.

EDF−NF

EDF−FkF

MSDL

Figure 5.6: Performance of algorithms on BMsdt

scheduler compared to EDF-NF seems to be rather small. However, the difference

is significant for task sets with a system utilization within [0.75,0.9], where EDF-NF

generates feasibly schedules for four times as many task sets than EDF-FkF. In contrast

to the relative good performance of the global schedulers EDF-FkF and EDF-NF it

is shown that our scheduling test (EDF-FkF-condition) developed in Section 4.1.3 is

rather pessimistic. It accepts almost all task sets with US(Γ) <0.5 but rejects almost

all task sets with US(Γ) >0.5. However, this large discrepancy between the amount of

task sets actually feasibly scheduled and the amount accepted by the scheduling test is

in conformance to the results reported for the related multiprocessor scheduling tests

[16].

Partitioned EDF Scheduling: On our standard benchmark the partitioning schedulers

show a significant worse performance than the global schedulers. From task sets with

US(Γ) around 0.78, only 20% could be scheduled by the optimal-partitioned-EDF algo-

rithm, while EDF-NF schedules up to 80%. However, opt.-partitioned-EDF still sched-

ules 75% of task sets with US(Γ) around 0.7 and almost every task set with US(Γ) <0.6.

When using the simple Next-Fit-Decreasing-Area heuristic to partition the task sets,

the performance drops only slightly compared to the optimal partitioning achieved by

solving an ILP model. It has to be mentioned, that the graphs for the partitioning

approaches represent the task sets which could be feasibly partitioned and hence can be

guaranteed to be feasibly scheduled, while the global EDF graphs represent the results

achieved by simulating the schedule over the entire hyper-period. While the partitioning

can be done in reasonable time, the simulation over the hyper-period may often be too

time consuming.

5 Comparison of Algorithm Scheduling Performance

Server Based Scheduling: The approach based on executing tasks within periodic

servers presented in Section 4.3 shows the worst performance on our standard bench-

mark.3MSDL is able to schedule only few task sets with a USexceeding 0.7, and

achieves an acceptance rate of 50% for task sets with a USaround 0.6. However, with

this relatively poor performance MSDL may still be reasonable and the method of choice,

since it has various advantages when it comes to system realization (See Chapter 6).

5.2.3 Impact of Area and Time-utilization

In the next experiment we will study the impact of the tasks area and time-utilization

values on the performance of the various scheduling algorithms. Hence we created two

more benchmarks. In the BMsmallAbigUTbenchmark, the average task area has been

halved while time-utilization has been doubled compared to the BMstd benchmark.

For the BMbigAsmallUTbenchmark we did the same vice versa. Table 5.6 shows the

parameters used for the benchmark generation in detail.

benchmark cmin cmax amin amax uT

min uT

max

BMstd 1 30 0.10 0.50 0.10 0.50

BMsmallAbigUT1 30 0.05 0.25 0.20 1.00

BMbigAsmallUT1 30 0.20 1.00 0.05 0.25

Table 5.6: Benchmark parameters

Using this parameters we achieved benchmarks comparable to BMstd in terms of tasks’

system-utilization and task sets size, but with different distribution of the area and time-

utilization values. Especially the ratio of Ai/UT(Ti) is different. In the BMstd some

few tasks have an extreme Ai/UT(Ti) ratio of 5 to 1 and 1 to 5 respectively, most tasks

have a 1 to 1 ratio as shown in the first histogram of Figure 5.7. In contrast, the tasks

3except the EDF scheduling condition which we do not treat as scheduling algorithm

1/20 1/5 5/1 20/1

A/UT ratio

frequency

BMstd

1/20 1/5 5/1 20/1

A/UT ratio

frequency

BMsmallAbigUT

1/20 1/5 5/1 20/1

A/UT ratio

frequency

BMbigAsmallUT

Figure 5.7: Distribution of the tasks’ Aito UT(Ti) ratio of the BMsdt,BMsmallAbigUT

and BMbigAsmallUTbenchmarks.

5.2 Simulation Results

of BMsmallAbigUTcan have a Ai/UT(Ti) ratio as small as 1 to 20. The second histogram

of Figure 5.7 shows, that for this benchmark the most frequent ratio is about 1 to 5.

The tasks of the BMbigAsmallUTbenchmark have exactly the converse distribution of

the Ai/UT(Ti) ratio, which is shown in the last plot of Figure 5.7.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

average system utilization US(Γ)

relative success rate

NFDA part. EDF

opt. part. EDF

EDF−FkF Cond.

EDF−NF

MSDL

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

average system utilization US(Γ)

relative success rate

NFDA part. EDF

opt. part. EDF

EDF−FkF Cond.

EDF−NF

MSDL

Figure 5.8: Performance of algorithms on BMsmallAbigUTand BMbigAsmallUTbench-

marks.

The performance evaluation of the scheduling algorithms on the BMsmallAbigUTand

BMbigAsmallUTis shown in Figure 5.8.4

The opt.-partitioned-EDF and the global EDF-NF show almost identical performance on

the BMsmallAbigUTbenchmark. Compared to the results of the BMstd benchmark, the

performance of opt.-partitioned-EDF has considerably improved while the performance

of global EDF-NF stays almost constant. One possible reason for the improvement

of the opt.-partitioned-EDF is that the benchmark contains no tasks with large area

and hence situations as shown by the anomaly in Figure 5.1 cannot occur. The opposite

effect can be seen in the performance evaluation of the BMbigAsmallUTbenchmark: Here,

the performance gap between the opt.-partitioned-EDF and the EDF-NF has slightly

grown compared to the BMstd benchmark.

Figure 5.9 compares the performance on the three benchmarks for the algorithms EDF-

NF,opt.-part.-EDF and MSDL separately. It can be seen that while the performance

of EDF-NF is almost constant on the three benchmarks, the performance of opt.-part.-

EDF is better on tasks with small area and gets worse on tasks with large area. However,

the greatest sensitivity on the ratio of task area to time-utilization is shown by MSDL.

The performance on the BMsmallAbigUTbenchmark drops dramatically compared to the

standard benchmark. The reason is, that in order to find a feasible set of servers, MSDL

has to fulfill a high number of server merges since the tasks have rather small area values.

Each merge adds some overhead into the set of servers. The opposite effect takes place

4From the global schedulers we present only the better EDF-NF and leave out the evaluation of the

EDF-FkF, since our simulations did not show interesting differences among the two global schedulers

in the sensitivity on the Ai/UT(Γi) ratio.

5 Comparison of Algorithm Scheduling Performance

on task sets with large area values. On the BMbigAsmallUTbenchmark MSDL performs

as well as EDF-NF and significantly better than opt.-part.-EDF (see Figure 5.8 right

side). Here, only few server merges are required to find the final set of servers and thus

little overhead is introduced. However, the experiment also indicates that MSDL may

perform poor on task sets with a large number of tasks, since these may require a large

number of server merges as well. This assumption is confirmed by the next experiment.

0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1EDF−NF

average system utilization US(Γ)

relative success rate

0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

opt. part. EDF

relative success rate

average system utilization US(Γ)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

MSDL

average system utilization US(Γ)

relative success rate

BMstd

BMsmallAbigUT

BMbigAsmallUT

Figure 5.9: Performance of algorithms EDF-NF,opt.-part.-EDF and MSDL on the

BMstd,BMsmallAbigUTand BMbigAsmallUTbenchmarks.

5.2.4 Performance of Combined Algorithms

In Section 5.1 we showed that the major scheduling approaches do not dominate each

other. Hence, combining two or more of the scheduling algorithms might increase the

scheduling performance compared to the single algorithms. We analyzed the perfor-

mance of all promising algorithm combinations as they are: EDF-NF + opt.-part.-EDF,

EDF-NF + MSDL,opt.-part.-EDF + MSDL and EDF-NF + opt.-part.-EDF + MSDL.

Combinations including the algorithms EDF-FkF and NFDH-part.-EDF have a lower

performance and thus have been omitted in our analysis.

The performance evaluation has been done on all three benchmarks BMstd,BMsmallAbigUT

and BMbigAsmallUT. However, only in two cases the combination of algorithms yielded

a non-negligible increase in performance:

When applying EDF-NF + opt.-part.-EDF to the BMsmallAbigUTbenchmark, much

more task sets could be feasibly scheduled in comparison to the task sets scheduled by

only EDF-NF or only opt.-part.-EDF. Especially when the task set’s system utilization

is in the range of 0.75 and 0.85 this algorithm combination increases the performance by

about 10% as shown in Figure 5.9-left. However on the other two benchmarks, where

5.2 Simulation Results

the performance of EDF-NF is considerably higher than that of opt.-part.-EDF, the

combination of both algorithm did only show negligible performance increase compared

to EDF-NF.

The combination of EDF-NF + MSDL applied to the BMbigAsmallUTbenchmark showed

a small increase in performance compared to EDF-NF. Figure 5.9-right shows, that this

improvement stays always less then about 3%. On the other benchmarks, we got no

benefit from combining EDF-NF + MSDL.

The combinations opt.-part.-EDF + MSDL did not show performance improvement on

any of the three benchmarks. On BMstd and BMsmallAbigUTthe performance was not

better than that of opt.-part.-EDF and on BMbigAsmallUTwas not better than that

of MSDL. Also considering all three algorithms did not show an advantage compared

to the performance achieved by EDF-NF + opt.-part.-EDF and EDF-NF + MSDL

respectively.

0.7 0.75 0.8 0.85 0.9

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

average system utilization US(Γ)

relative success rate

BMsmallABigUT

opt. part. EDF

EDF−NF

EDF−NF + opt part.

0.7 0.75 0.8 0.85 0.9

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

average system utilization US(Γ)

relative success rate

BMbigAsmallUT

EDF−NF

MSDL

EDF−NF + MSDL

Figure 5.10: Performance of combined algorithms on BMsmallAbigUTand BMbigAsmallUT

benchmarks.

5.2.5 Impact of Number of Tasks

In order to analyze how well the proposed algorithms scale we created task sets with

various numbers of tasks. Our benchmark creation method of Section 5.2.1 is not able

to create task sets with a large number of tasks but a small hyper-period, since the task

periods are arbitrary. Moreover, the number of tasks in a task set is not constant within

a benchmark. Hence we choose a different method that creates task sets with a given

number of tasks and keeps the hyperperiod small by choosing periods from a given set

of harmonic related values.

The entire creation method is shown as pseudo code in Algorithm 6. First we choose for

all ntasks a period from a given set and a temporary area and time-utilization value.

We compute a scaling factor scale by dividing the desired task set’s system utilization by

5 Comparison of Algorithm Scheduling Performance

the actual task set’s system utilization. Afterwards, we scale all area values and select

the computation time, such that the final task set Γ has the desired system utilization.

Algorithm 6 Benchmark Creation Method 2

1: procedure createTaskSet(n, US

nominal, PeriodSet)

2: US

actual ←0

3: for all i∈ {1, n}do .Choose period and temp. area and time utiliz.

4: Pi←chooseFrom PeriodSet

5: ai←rand([0,1])

6: ui←rand([0,1])

7: US

actual ←US

actual +ai·ui

8: end for

9: scale ←US

nominal/US

actual

10: for all i∈ {1, n}do .Scale area values and select comp times

11: Ai←ai·√scale

12: Ci← dPi·ui·√scalee

13: Γ←Γ∪Ti(Pi, Ci, Ai)

14: end for

15: return Γ

16: end procedure

Using this, three benchmarks of 10000 task sets each have been created, where all task

sets of BM10T asks contain 10 tasks, all task sets of BM20Tasks contain 20 tasks and the

task sets of BM50T asks contain 50 tasks respectively. The periods have been chosen

from PeriodSet ={100, 200, 400, 600, 800, 1000, 2000, 4000, 6000, 8000, 10000}, such

that the hyper period is bounded by 120000 and the simulation time for each task set

is reasonably small.

Figure 5.11 compares the performance of the three benchmarks for the algorithms EDF-

NF,NFDA-part.-EDF and MSDL separately. 5It can be seen that the performance of

EDF-NF and NFDA-part.-EDF improves on the task sets with greater number of tasks.

The reason is, that the tasks of BM20T asks and BM50Tasks have in average smaller area

and time-utilization values as the tasks of BM10T asks. When EDF-NF selects the tasks

for execution or NFDA-part.-EDF packs tasks into levels, the amount of wasted device

resources by fragmentation is small compared to task sets with ”larger” tasks.

In contrast, the performance of MSDL methods gets worse, as the number of tasks in

the task sets increases (Figure 5.11, right side). As mentioned earlier, each merge of

two tasks or servers introduces an overhead into the task set, since one of the tasks

is afterwards executed with a period smaller than its original period. The greater the

number of tasks of the task sets, the greater the overall time-utilization UT(Γ) of the

task set (in average) is. Hence, it requires more merges to reduce the UT(Γ) to less than

or equal to one and it becomes more likely that the method terminates before a feasible

set of servers is found.

5From the partitioned EDF methods we present only the NFDA heuristic since the optimal partitioning

for task sets of 50 tasks could not be computed in reasonable time.

5.3 Conclusion on Algorithm Performance

0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1EDF−NF

average system utilization US(Γ)

relative success rate

0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

NFDA part. EDF

relative success rate

average system utilization US(Γ)

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

MSDL

average system utilization US(Γ)

relative success rate

BM10Tasks

BM20Tasks

BM50Tasks

Figure 5.11: Performance of algorithms EDF-NF,NFDA-part.-EDF and MSDL on task

sets with 10, 20 and 50 tasks.

5.3 Conclusion on Algorithm Performance

We have compared the performance of the proposed algorithm analytically as well as

by simulations on synthetic benchmarks. The main conclusions concerning the pure

scheduling performance of the algorithms are summarized below:

dominance: The three basic approaches, global EDF scheduling, partitioned EDF schedul-

ing and server based EDF scheduling do not dominate each other. However, the

simulation study showed that combining the algorithms yields a rather low per-

formance improvement.

robustness: For each of the basic scheduling algorithms there exist task sets, that are in-

feasible by the particular algorithm, have an infinitesimal small system-utilization,

but can be feasibly scheduled by some algorithm.

pessimistic analysis of global EDF: The scheduling test for the global EDF algorithms

is pessimistic in the sense, that each accepted task set will also be scheduled by

the NFDA-partitioned-EDF method.

average performance: In most cases, the global EDF approach shows the best perfor-

mance followed by the partitioned EDF approaches. The performance of MSDL

and the global EDF scheduling test is much worse.

sensitivity: The performance of global EDF seems to be insensitive on the ratio of

the task’s area to the task’s time-utilization values, while the partitioned EDF

improves with tasks with small area and MSDL improves with tasks with large

area.

5 Comparison of Algorithm Scheduling Performance

scaling: The performance of the global and partitioned EDF methods significantly im-

prove with task sets with many but ”small” tasks, while the performance of MSDL

does not.

While this performance analysis of this chapter totally neglected any overheads, the

next chapter describes the execution models and associated overheads in detail.

100

CHAPTER 6

Execution Model and Overhead Analysis

Up to now, we have neglected any system overheads due to reconfiguration, context

switching, and running the scheduling routines in our analysis. Neglecting such over-

heads is a common practice in microprocessor-based real-time systems, but only justified

when the time overhead is small compared to the task computation times. For recon-

figurable hardware, the time overheads can only be discussed in conjunction with the

actually used reconfiguration mode. In this chapter we outline the reconfiguration mod-

els, list the requirements on the runtime system and model the resulting reconfiguration

overheads for each of our three scheduling methods. Hence we will be able to pro-

vide off-line scheduling tests which consider the reconfiguration overhead. Finally we

will repeat some of our simulation experiments on the algorithm performance including

reconfiguration overheads.

We presented the overhead analysis for global EDF in [DP06a], for partitoned EDF in

[DP06b] and for MSDL in [DMP06].

6.1 Global EDF Overhead Model

6.1.1 Reconfiguration Modes and Task Placement

In this section, we outline an execution model for a reconfigurable hardware platform

that matches the global EDF scheduling techniques developed in Section 4.1. We further

pose some requirements on the runtime system managing the hardware platform. Then

we become in a position that enables us to model and analyze the overheads involved.

Recall that generally reconfigurable hardware devices can be reconfigured using either

a full or a partial reconfiguration mode. During a full reconfiguration, the complete

device undergoes the reprogramming process. We surveyed different full- and partial

101

6 Execution Model and Overhead Analysis

reconfiguration models in Section 2.3.1 of the background chapter. From the three full

reconfiguration models, only model cfree space - time sharing matches our task model

requirements. However, using our global EDF methods with the full reconfiguration

mode shows some severe drawbacks that will restrict EDF-NF and EDF-FkF to very

small task sets:



According to the global EDF schedule, we have to compile a separate full config-

uration bit-stream for each set of running tasks before runtime. This can easily

be done using standard device vendor tools, and full reconfiguration is supported

by all SRAM-based reconfigurable devices. Compiling several tasks into one con-

figuration bit-stream justifies our scalar area model, i.e., the tasks fit onto the

device if PJi∈RAi≤A(H). Unfortunately, we have to simulate the schedule for

the complete hyper period to determine the sets of running tasks. For larger task

sets, this cannot be done in reasonable time. Further, we have to store all the

full configuration bit-streams during runtime. This is unacceptable especially for

embedded systems with stringent memory constraints.



In the full reconfiguration model c, the preemption of a specific task affects all

tasks currently running, independent of their actual priority. This not only adds

huge overheads to the actual tasks’ execution time, but makes the analysis of

the reconfiguration overheads overly pessimistic. The resulting schedulability test

would be rendered useless for all except the smallest task sets.

A partial reconfiguration model allows us to reprogram only a fraction of the device

during a reconfiguration step, while the other parts of the device remain computing.

From the three resource models of Section 2.3.1 which use partial reconfiguration, only

the 1D partitioning model (e) and the 2D partitioning model (f) allow tasks with

arbitrary area values.

1D area model

A(H)A1

2D area model

Figure 6.1: Partial reconfiguration models supporting arbitrary task areas

In the 2D area model, tasks have rectangular footprints and can be placed anywhere on

the device. An example is illustrated in the right-hand side of Figure 6.1. Placing such

rectangular footprints leads to fragmentation problems, i.e., a new task cannot be placed

onto the device although there is sufficient free area. While 2D defragmentation tech-

niques have been proposed [37][39], they are hard to implement and cannot completely

102

6.1 Global EDF Overhead Model

overcome the fragmentation problem. Hence, our area constraint, PJi∈RAi≤A(H), is

not sufficient to guarantee feasible task placements.

In contrast, the 1D area model requires tasks to have rectangular footprints which span

the entire device width, but may have arbitrary heights. An example is illustrated in the

left-hand side of Figure 6.1. The task height is then proportional to the task area Ai.

In design practice, such footprints can be obtained by applying appropriate constraints

during circuit synthesis. To justify our scalar area model, we still have to ensure that

the device is unfragmented at any time. For this we propose the following placement

and reconfiguration scheme:



Tasks are stacked on the device from bottom to top. The order of tasks corresponds

to their scheduling priority, such that tasks with closer deadlines are placed below

tasks with deadlines further away.



Whenever a task terminates, all tasks above are shifted downwards by a relocation

process. Tasks below, i.e., the ones with closer deadlines, remain unaffected.



Whenever a task starts or resumes from preemption, it is placed according to its

deadline. Specifically, all tasks with greater deadlines are shifted upwards. Again,

tasks with closer deadlines remain unaffected.

Figure 6.2 shows an example of a schedule using the proposed placement and reconfig-

uration scheme. Whenever a task is being preempted or resumed, only those tasks with

later deadlines are preempted or relocated and receive an overhead. In Figure 6.2, solid

lines at the end of task executions denote terminations; dashed lines denote preemptions

due to scheduling decisions. Note that task T4is preempted by the scheduler at time 7,

while task T3is just relocated for two times.





















     







 

    

Figure 6.2: A schedule including reconfiguration times

The proposed partial reconfiguration scheme offers the following key advantages: First,

we do not have to preempt all running tasks on a task preemption, which results in

103

6 Execution Model and Overhead Analysis

an improved scheduling performance. Second, the time required to reconfigure a part

of the device is substantially smaller than the time required for a full reconfiguration.

Thus, the reconfiguration time overhead is lowered. Third, the system has to store only

one partial configuration bit-stream for each task, greatly reducing the required memory

capacity. The fact that the tasks’ bit-streams are independent opens up the possibility

of online scheduling scenarios, where new periodic tasks enter the system at runtime

and have to pass an acceptance test.

6.1.2 Runtime System Requirements

For the global EDF scheduling algorithms, we require a reconfigurable hardware execu-

tion model and runtime environment featuring the following characteristics:



the device is partially reconfigurable using the 1D area model



the tasks are relocatable



the runtime system supports task preemption

While some of these features are research issues for themselves, their feasibility has

already been demonstrated in a number of projects [102][21][97]. Among these charac-

teristics, hardware task preemption is perhaps the least-investigated one. The context

of a hardware task is given by all of its state-holding elements, i.e. registers and memory

elements. Depending on the actual task instance, the state-holding elements might be

distributed over its complete logic area. Related work has demonstrated the feasibility

of preemptive multitasking by two different approaches:

The first approach realizes task preemptions by stopping the clock and reading back

the device configuration. Appropriate tools can then be used to extract the task’s state

from the read-back bit-stream. The state information is used to initialize the storage

elements of the task before it is resumed [61].

The second approach shifts the responsibility for saving and restoring the context to the

hardware task itself. To that end, the task must be extended with an interface which

allows stopping and starting the task and signaling the task to save or restore its state

[Dan04b,Dan04c]. This interface together with some kind of logic that is necessary

for task communication and I/O is indicated by the right-most logic cell columns in

Figure 6.1.

6.1.3 Reconfiguration Overheads

For the following analysis, we lump together all overheads due to reconfiguration, context

save and restore, and running the scheduler. This is sensible as the reconfiguration

overhead will be by far the dominating part. The basic approach to account for the

reconfiguration overhead is to increase the tasks’ computation times by the amount of

104

6.1 Global EDF Overhead Model

time spent in the reconfiguration process. We denote treconfig as the time required to

reconfigure the entire device with a new task set. In an EDF schedule, a task preemption

can only occur when other tasks are released. For a periodic task set, we know in advance

how many task releases can appear within a certain period of time and thus we can derive

the worst-case number of preemptions Nifor a task instance Ti. While a task instance

of Tiis in execution, a task Tkmay be released dPi

Pketimes at maximum. However,

only bPi

Pkcreleases of Tkcan have a higher priority and hence cause a preemption of Ti.

Summing up, the maximal number of preemptions Nia task instance Timay experience

accounts to:

Ni=X

Tk∈ΓPi

Pk−1 (6.1)

Analyzing the proposed 1D placement and reconfiguration scheme, we obtain the fol-

lowing observations:



An instance of a task Tiis shifted upwards or preempted only when higher priority

task instances are released. In the worst case, this may happen Nitimes.



An instance Ti,j of a task Tiis shifted downwards or resumes from preemption only

when higher priority task instances terminate. On one hand, a number of higher

priority tasks will already be running when Ti,j starts execution. These tasks

might terminate and cause Ti,j to shift downwards. In the worst-case, there might

be as much as n−1 such tasks, nbeing the overall number of tasks. On the other

hand, we can have Nihigher-priority task instances released and terminated while

Ti,j is running. Again, the terminations cause Ti,j to shift downwards. The total

number of times a task may be shifted downwards or resumes from preemption is

then bounded by:

Mi=Ni+ (n−1) (6.2)

Generally, the number of tasks already running when Ti,j starts will be much

smaller than n−1. We can derive a better bound employing an area argument:

Obviously, the tasks running at the same time as Ticannot occupy more than

A(H)−Aiof the device’s area resource.

Oi= max

∀Rl⊆Γ\Ti|Rl|:X

Tk∈Rl

Ak≤A(H)−Ai(6.3)

Thus, we can bound the number of tasks causing a shift downward or a resume

by:

Mi=Ni+Oi(6.4)

The total number of times a task instance may have to be reconfigured is bounded by

its upward shifts plus its downward shifts plus one for the first start. The time overhead

can be accounted for by increasing the computation time of each task to:

105

6 Execution Model and Overhead Analysis

Ci=Ci+(1+2·Ni+Oi)·treconf (6.5)

Note that we are using the full device reconfiguration time to derive this bound although

the underlying model is based on partial reconfiguration. This is necessary as one task

start or termination event can potentially lead to several task preemptions, resumes or

shifts. As practical reconfigurable devices have only one reconfiguration port, the single

tasks’ reconfigurations have to be serialized. In the worst case, we have to pessimistically

assume that the whole device undergoes reconfiguration. A more detailed analysis could

try to exploit a more realistic area-proportional partial reconfiguration time.

Based on the modified task computation times of Equation 6.5, we can compute new

time- and system-utilization values for the tasks:

i=˜

(6.6)

i=˜

i·Ai(6.7)

(6.8)

Using these modified utilization values in our EDF-FkF scheduling test of Theorem 3

results in the scheduling test which considers the reconfiguration overhead:

∀Tk∈Γ : (6.9)

US(Γ) ≤(A(H)−Amax)·1−˜

UT(Tk)+˜

US(Tk)

We will statistically evaluate the performance of the global EDF scheduler considering

reconfiguration overheads later within this chapter.

6.2 Partitioned EDF Overhead Model

6.2.1 Reconfiguration Modes and Task Placement

The partitioned scheduling approach matches perfectly the 1D partial reconfiguration

model and even task relocation is not required. Like in the 1D area partial reconfig-

uration model used for global EDF, the tasks have rectangular footprints which span

the entire device width, but may have arbitrary height. In our model, the height is

proportional to the tasks area.

The device resources are statically partitioned into slots - one for each partition-block

Gk∈χdetermined by the partitioning algorithm. The task with the largest area re-

quirement Aiamong all tasks of a partition-block defines the size of its slot. At runtime,

106

6.2 Partitioned EDF Overhead Model

for each partition-block a separate EDF scheduler decides which task is executed. Re-

call, that only one task per partition-block can be in execution at a time. For each

partition-block the task selected by EDF is configured to the device area of the accord-

ing slot. It is executed until it terminates or gets preempted by a higher priority task

of the same partition-block. Using such an execution model, a task is always executed

in the same position on the device area and this position is determined at design time.

Hence, task relocation is not an issue.

Tasks are loaded to the slots using partial reconfiguration. However, since it may happen

that more then one slot has to be reconfigured at the same time, the reconfigurations

have to be serialized which increases the delay.

6.2.2 Runtime System Requirements

The partitioned EDF scheduling approach requires a runtime environment featuring the

following characteristics:



the device is partially reconfigurable using the 1D model



the tasks do not have to be relocatable



the runtime system supports task preemption

Such a reconfiguration model is one of the simplest, among the partial reconfiguration

models (see 2.3.1). Numerous prototypes implement the partial 1D reconfiguration

based on slots of static size. Since the task position is determined at design time, the

approach may also deal with devices, which have a not fully homogeneous resources

array.

6.2.3 Reconfiguration Overheads

Accounting to the reconfiguration overhead for the partitioned schedule is much easier,

than it has been for the global schedule. The reason is that each partition-block repre-

sents a task system on its own. Whenever preemption occurs, only the according slot is

reconfigured and the tasks executing in other slots do not get interrupted.

Let Glbe the partition block which contains Ti. The maximum number of times an

instance of task Timay become preempted is given by:

Ni=X

Tj∈GlPi

Pj−1 (6.10)

Equation 6.10 has the same form, as Equation 6.1 but considers only the tasks within

partition-block Glrather than the tasks of the entire task set Γ. Hence, the number

107

6 Execution Model and Overhead Analysis

of preemptions Nifor each task will generally be smaller than in the global scheduling

case.

Now we are ready to account for the reconfiguration overhead by increasing the compu-

tation time of all tasks in Γ. We have to add (1 + Ni) times the required time for the

device reconfiguration1, once for reconfiguring the task before it starts execution and

Nitimes when it needs to resume its execution after preemption (Equation 6.11). The

time utilization of the task increases according to Equation 6.12.

i=Ck

i+ (1 + Ni)·treconf (6.11)

UT(Tk

i) = UT(Tk

i) + (1 + Ni)·treconf

(6.12)

Obviously, a partitioning χof a task set Γ might be feasible to schedule on neglecting

the reconfiguration overhead, but may become infeasible on considering the increased

time utilization of Equation 6.12. To decide whether a feasible partitioning χincluding

reconfiguration overhead exists, is a difficult problem:

The optimal partitioning χof tasks depends on the tasks’ time utilization ˜

UT, but

the actual time utilization of a task depends on the partitioning, i.e., on the tasks in

the same partition-block. One might think about packing rectangles that change their

widths during the packing process, depending on the other rectangles currently packed

to the same level. We cannot reasonably account for this cross dependency in our ILP

model.

Making the simplifying assumption that the overheads due to preemption are similar

for all partition blocks, the following two steps seem to be a reasonable heuristic:

1. Given the task set Γ and device area A(H), compute the partitioning χthat

minimizes the maximal time utilization over all partition blocks:

min max

Gl∈χUT(Gl)(6.13)

2. Account for the reconfiguration overhead of each task according to Equation 6.12.

If the updated time-utilization factors for each partition block ˜

UT(Gl) are still less

or equal to one, the partitioned schedule is still feasible.

In Section 6.4 we will statistically evaluate the performance of the partitioned EDF

scheduler considering reconfiguration overheads, where the partitioning is computed

according to the described approach.

1Note, that we have to account the time treconf for a full device reconfiguration, even if we assume

partial reconfiguration. Although the time to reconfigure a small portion of the device will be shorter

than treconf , in the worst case all partition blocks may request a reconfiguration simultaneously.

108

6.3 Server Based MSDL Overhead Model

6.3.1 Reconfiguration Modes and Task Placement

The server based MSDL scheduler has been developed targeting systems, which support

no partial but only full device reconfiguration. Particularly, the resource model (c) of

Section 2.3.1 which we called free space partitioning - time sharing is used. For a set Γ

of ntasks, the MSDL algorithm creates at maximum nservers S1, S2, . . .. Each server Si

represents a subset Vi⊆Γ of the original tasks, and whenever the server executes, all of

its represented tasks can execute. Hence, each server becomes one RHD configuration,

i.e. one device bit-stream. The configurations are loaded sequentially onto the RHD.

Which configuration is loaded is determined by a sequential EDF schedule of the accord-

ing servers. Whenever a server Sigets preempted by a higher priority server Sjall tasks

of Vibecome interrupted. Some tasks of Simay also be included in server Sj, and hence

they are only interrupted for the time the device is being reconfigured. The other tasks

are preempted and will be resumed later on when Si, or another server including these

tasks, is scheduled for execution again. However, even if a task Tkis included in Siand

Sjits state has to be stored before and restored after the reconfiguration process, since

the circuit-elements of Tkmight be placed and routed at completely different locations

on the device within the bit-streams of Siand Sj(see Figure 2.5).

6.3.2 Runtime System Requirements

The server based MSDL scheduling approach requires a runtime environment featuring

the following characteristics:



the device supports full reconfiguration during runtime (partial reconfiguration is

not required)



no task relocation during runtime is required



the runtime system supports task preemption

6.3.3 Reconfiguration Overheads

Reconfiguration overhead in an MSDL schedule occurs due to the start or resumption

of the servers. Since the periodic servers of Ω are scheduled by the EDF exactly like

periodic tasks, we know that the maximum number of times a server Sigets preempted

by a higher priority server is given by Equation 6.14, where Pi, Pjdenotes the periods

of the servers Si, Sjand not the periods of the original tasks Ti, Tj.

Ni=X

Sj∈ΩPi

Pj−1 (6.14)

109

6 Execution Model and Overhead Analysis

Hence, we can account for the time overhead due to device reconfiguration by increasing

the computation time of each server. Each server gets reconfigured one time when it

starts execution and may be reconfigured Nitimes, when it resumes from preemption.

Ci=Ci+ (1 + Ni)·treconf (6.15)

After increasing the server computation time, we can update the time-utilization of the

server set Ω.

UT(Ω) = X

Si∈Ω

(6.16)

The time-utilization ˜

UT(Ω) includes the worst case time-overhead that can occur due

to device reconfiguration. Hence, a scheduling test for an MSDL generated server set

which accounts for the reconfiguration overhead is given by:

UT(Ω) ≤1 (6.17)

6.4 Performance Evaluation Including Overhead

In this section, we evaluate the influence of the reconfiguration overhead on the schedul-

ing performance of the proposed algorithms. Therefore, we simulated all algorithms

on our standard benchmark BMstd, but this time including the reconfiguration over-

head.2The reconfiguration time has been varied between 0.1,1,10 time-units, which

corresponds to 1%, 10% and 100% of the computation time of the shortest task.

6.4.1 Global EDF Overhead Evaluation

First, we have applied our new EDF schedulability test including reconfiguration over-

heads of Equation 6.9 to our standard benchmark BMstd. Figure 6.3-left shows the

success rates for the original scheduling test without reconfiguration overheads and with

device reconfiguration time varied between 0.1,1,10 time-units, which corresponds to

1%, 10% and 100% of the computation time of the shortest task. The results show that

the effect of device reconfiguration can almost be neglected, when the reconfiguration

time is about 1% of the runtime of the shortest task. Even with a reconfiguration time in

the range of 10% of the runtime of the shortest task, a considerable number of task sets

could be guaranteed. Only when the reconfiguration time increases further, a stronger

performance loss can be observed.

The same evaluation has been done on a second benchmark called BMstd2. It has

been generated using the same method than for BMstd, but the time-utilization factors

2The computation times and periods have been scaled by a factor of 10 to provide a higher resolution

for scheduling simulations.

110

6.4 Performance Evaluation Including Overhead

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

average system utilization US(Γ)

relative success rate

treconf = 0

treconf = 0.10

treconf = 1.00

treconf = 10.00

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

average system utilization US(Γ)

relative success rate

treconf = 0

treconf = 0.10

treconf = 1.00

treconf = 10.00

Figure 6.3: Performance of the EDF-FkF scheduling test including reconfiguration over-

heads on the benchmark BMstd (left) and BMstd2(right)

UT(Ti) and areas Aihave been chosen from the interval [0.01,0.2]. Hence, the tasks

in BMstd2could be denoted light-weight. On the other hand, the number of tasks per

task set increased to about 100. Figure 6.3-right shows the results. For small tasks with

small time-utilization, our scheduling condition is less pessimistic and accepts task sets

of 65% system utilization and more. However, the more tasks we have in the task set,

the more preemptions can occur and the higher the reconfiguration overheads are. For

benchmark BMstd2, a device reconfiguration time of 1% of the shortest task runtime

already reduces the amount of accepted task sets considerably.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

average system utilization US(Γ)

relative success rate

treconf = 0

treconf = 0.10

treconf = 1.00

treconf = 10.00

Figure 6.4: Performance of EDF-NF obtained by exact simulation, including reconfigu-

ration overheads (BMstd)

Figure 6.4 shows the scheduling performance for the simulated EDF-NF scheduler on

benchmark BMstd including reconfiguration overheads. Again, the device reconfigura-

tion time treconfig has been varied between 1%, 10% and 100% of the computation time

of the shortest task. The simulated execution of the EDF-NF scheduler does not suffer

111

6 Execution Model and Overhead Analysis

from the pessimism of our scheduling test. Therefore, we get reasonably small losses

in performance for reconfiguration times of values one and more orders of magnitude

smaller than the smallest task computation time. For reconfiguration times in the same

range as the task computation times, the performance drop is again quite dramatic.

6.4.2 Partitioned EDF Overhead Evaluation

To evaluate the overhead in case of partitioned scheduling methods, we applied the opt.-

part.-EDF algorithm again on our standard benchmark (BMstd). The feasibility of a

task set including reconfiguration overheads was determined according to the heuristic

method introduced in 6.2. The results are shown in Figure 6.5 with a device reconfig-

uration time varied between 0.1,1,10 time-units, which corresponds to 1%, 10% and

100% of the computation time of the shortest task.

The results show that the effect of device reconfiguration can almost be neglected,

when the runtime of the shortest task is about one magnitude higher than the device

reconfiguration time. Even at reconfiguration times in the range of the runtime of the

shortest task, a considerable number of task sets could be feasibly scheduled.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

average system utilization US(Γ)

relative success rate

treconf = 0

treconf = 0.10

treconf = 1.00

treconf = 10.00

Figure 6.5: Performance of opt.-partitioned-EDF including reconfiguration overheads

(BMstd)

6.4.3 Server Based MSDL Overhead Evaluation

Also the impact of reconfiguration overheads on the MSDL scheduling method has been

evaluated on the BMstd benchmark. The results in Figure 6.6 show that a reconfigura-

tion time of 0.1, which corresponds to 1% of the computation time of the shortest task,

decreases the performance of MSDL almost negligibly. For a reconfiguration time of 1.0,

the performance drop becomes considerable, but even for a reconfiguration time of 10,

a considerable amount of task sets can be feasibly scheduled.

112

6.5 Conclusion on Execution Models and Overhead

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

average system utilization US(Γ)

relative success rate

treconf = 0

treconf = 0.10

treconf = 1.00

treconf = 10.00

Figure 6.6: Performance of MSDL including reconfiguration overheads (BMstd)

6.5 Conclusion on Execution Models and Overhead

For the three proposed scheduling approaches, this chapter discussed the runtime re-

quirements, modeled the reconfiguration overhead, included it within the schedulability

tests and finally evaluated the impact of the overhead in simulation experiments.

Runtime Requirements: We discussed the execution models and the according over-

head for all of our proposed scheduling algorithms. We showed how the schedule ob-

tained by a certain algorithm can actually be executed on a reconfigurable device and

what are the resulting requirements to the runtime-system. While all scheduling ap-

proaches create preemptive schedules and hence require a mechanism to store and recover

the state of tasks, the requirements concerning the reconfiguration model are different.

The global EDF schedulers have the hardest requirements - the runtime system has to

support 1D partial reconfiguration and task relocation. This makes the implementation

rather difficult, and requires target architectures with a homogeneous arrangement of

resources. The partitioned EDF schedulers require 1D partial reconfiguration as well,

but no task relocation, since a task will always have the same place on the device area.

A static partitioning of the device area into slots is suitable - which matches perfectly

the XILINX supported partial design flow for Virtex FPGAs [71]. Moreover, the sim-

plicity of the model suggests that it can be implemented on most devices that support

partial reconfiguration in any form. The server based MSDL scheduler has the least

requirements to the reconfiguration model. When switching from one server to another,

the entire device area is reconfigured. Hence, this scheduler can be used with any device

that supports runtime reconfiguration.

Reconfiguration Overheads: Concerning the time overheads due to the device re-

configurations at runtime, we have derived extended scheduling conditions for all our

scheduling approaches. The basic approach was, to derive an upper bound on the num-

ber of preemptions or relocations a task or server may undergo in the worst case. Doing

113

6 Execution Model and Overhead Analysis

so, we could account for the overhead by increasing the computation time with a time

penalty for each preemption or relocation. The scheduling condition for each algorithm

could be used with the modified computation time, and hence take the configuration

overheads into consideration. Partial device reconfiguration is usually faster then con-

figuring the entire device. Up to now we failed to take advantage from this fact, since in

the worst case all tasks have to be configured onto the device at the same time. However,

we can expect that the overall reconfiguration time penalty can be reduced by a deeper

analysis.

Our performance evaluation shows that the reconfiguration time has considerably more

impact on the global EDF schedulers, than on the partitioned EDF or the server based

MSDL schedulers. One reason is, that in a global EDF schedule each preemption may

not only effect the preempted task but also cause a relocation of other running tasks.

In a partitioned EDF schedule, only tasks within the same partition can preempt each

other. Moreover, a preemption within one partition does not disturb the execution of

the tasks in other partitions. In a MSDL schedule, the maximum number of preemptions

depends on the number of servers, which is usually smaller then the number of tasks.

However, for all schedulers the reconfiguration overheads increase more or less with the

number of tasks. Partly, a drop in performance can be compensated. Larger task sets

are easier to schedule, at least for global EDF and partitioned EDF, since they contain

smaller tasks which leads to less resource fragmentation.

In conclusion, the evaluation on our standard benchmark shows that the performance

drop due to the reconfiguration overhead is acceptable, when the task computation time

is at least one order of magnitude higher than the device reconfiguration time.

114

CHAPTER 7

Model Extensions

Within this chapter we will present two important extensions to our simple task model

of Chapter 3. By capturing more of the typical characteristics of hardware-tasks, we

aim to improve the resource management of the reconfigurable system.

The first model extension allows specifying not only one but several implementation

variants per task – each with different area and computation time. These variants can

be considered in a task schedule and our results show a significant improvement in

scheduling performance. These results have been also presented in [DP06b].

The second model extension captures the memory requirements of application tasks.

When tasks require a considerable amount of data memory, they will typically use

dedicated RAM which can be either block RAM inside the reconfigurable device or

external RAM banks. We will show that sharing such RAM banks among tasks without

jeopardizing timing constraints is feasible, reduces the overall memory requirements and

can even be used for inter task communication. The main results of this section have

been also presented in [DP05b].

7.1 Periodic Tasks with Variants

One, if not the fundamental characteristic of reconfigurable hardware, is that a certain

computational job can be implemented in many alternative ways, exploring the tradeoffs

between fast but resource-consuming and slower but resource-saving versions. Even

third party IP-cores can often be obtained as several variants, differing in area and

speed, as Example 1of the DCT computation in Section 2.2 has shown. Assume a

system consisting of several periodic real-time tasks has to be realized, and for all or

some of the tasks we have alternative implementations to choose from. This raises a

question, which implementation variant to choose for which task. We may gain an

115

7 Model Extensions

advantage by selecting particular variants, since we may be able to greatly reduce the

resource fragmentation and hence increase the scheduling performance.

First we extend our task model in such a way that makes us able to deal with implemen-

tation variants. Afterwards, we present for the partitioned EDF scheduler a detailed

method to obtain the optimal task variant selection. It follows a performance evaluation

which demonstrates the performance increase compared to task sets scheduled without

considering variants (Section 7.1.4). For the global EDF and the server bases scheduler,

we have not yet developed methods to deal with task variants. Hence, we will discuss

only rough approaches at the end of this section.

7.1.1 Variant-Rich Tasks

We consider the scheduling of a set of nvariant-rich periodic tasks z={Γ1,Γ2, . . . Γn}

onto a reconfigurable device H. Each variant-rich periodic task Γi∈zrefers to some

computation which has to be performed periodically. The instances Γi,j of task Γiare

released with period Pi. Each task instance is associated with an absolute deadline di,j.

We assume that the relative deadline of Γiequals its period.

Each variant-rich task Γiexists in one or more implementation variants Γi={T1

i, T2

i, . . .}.

Each variant Tk

i∈Γiis specified by a worst-case computation time Ck

iand the amount

of required reconfigurable logic resources Ak

i. Table 7.1 shows an example task set with

four periodic tasks z∗={Γ1,Γ2,Γ3,Γ4}. The first two tasks have two implementation

variants.

ΓiTk

iPiCk

iAk

iUT(Tk

i)US(Tk

Γ1T1

112 3 6 1/4 3/2

16 3 1/2 3/2

Γ2T1

24 2 4 1/2 2

21 8 1/4 2

Γ3T1

36 5 3 5/6 5/2

Γ4T1

412 2 2 1/6 1/3

Table 7.1: Example of a variant-rich task set z∗

The notations used for variant-rich periodic real-time tasks and their instances are

summarized in the following list:

zdenotes a set of periodic or sporadic variant-rich tasks

Γidenotes a generic variant-rich periodic task

Γi,j denotes the jthe instance of variant-rich task Γi

Pidenotes the period of Γi

Didenotes the relative deadline of Γi

idenotes the kth implementation variant of Γi

idenotes the computation time of variant Tk

idenotes the area requirement of variant Tk

116

7.1 Periodic Tasks with Variants

ri,j denotes the absolute release or arrival time of Γi,j

di,j denotes the absolute deadline of Γi,j

si,j denotes the start time, when Γi,j begins its execution

fi,j denotes the finishing time of Γi,j

Feasible Schedule of Variant-Rich Tasks

In order to perform the requested execution of a task instance Γi,j, one of the task’s

implementation variants Tk

i∈Γihas to be loaded (configured) onto the device and

executed for Ck

itime units. During this execution, an amount Ak

iof the device area is

occupied.

Generally speaking, we would like to schedule and execute a set of periodic tasks on the

reconfigurable device such that each task instance meets its deadline. Let Γ denote the

unified set of all implementation variants of all variant-rich tasks.

Γ = {Γ1∪Γ2∪···∪Γn}(7.1)

A schedule for a set of periodic variant-rich tasks zis given by a function R:R+→P(Γ).

R(t) denotes the set of task implementations from Γ running at time t. A schedule is

called feasible, if for each request ri,j of task Γione of its implementations Tk

i∈Γiis

scheduled for execution for at least Ck

itime units within the interval given by the release

time and deadline of Γi.

We consider only schedules R, where the selection of variants is static. Hence, the same

variant Tk

iof a task Γiis used for the entire schedule. We express a selection by the

function f(i), that selects exactly one implementation variant Tf(i)

ifor each task Γi.

The set of all selected implementations is denoted by Γ|f={Tf(1)

1, Tf(2)

2, . . .}.

For example, Figure 7.1 displays a possible schedule for the variant-rich task set shown in

Table 7.1 on a device with an area of 10 units. The upper part of Figure 7.1 indicates the

release times and deadlines of the tasks. The lower part illustrates which implementation

variants of the tasks are executed, and how they share the device area over time. In

this example schedule always the first variant of each task is chosen. It occurs one task

preemption at time 4, which is denoted by a dotted line. The schedule shown can easily

be proven feasible, because every task instance meets its deadline for the entire hyper-

period of the task set (which amounts to 12 time units). The hyper-period is the least

common multiple of all task periods. Recall that a feasible schedule defined over the

hyperperiod can be repeated an infinite number of times.

Utilization Metrics for Variant-Rich Tasks

We have already introduced the time-utilization UTand the system-utilization USfor

tasks and task sets of our basic model (without variants). In our extended model (with

variants), the metrics are defined for implementation variants Tk

iof a variant-rich task

Γiand not for the task itself.

117

7 Model Extensions

2,1T1

1,1

3,1

4,1

1,1

2,2T1

2,3

3,2

Γ1,Γ2,Γ3,Γ4Γ2Γ3Γ2Γ1,Γ2,Γ3,Γ4

0 1 2 3 4 5 6 7 8 9 10 11 12

device area

time

Figure 7.1: Preemptive global schedule of variant-rich tasks

The time-utilization given by

UT(Tk

i) = Ck

(7.2)

is the fraction of time a certain task implementation Tk

ioccupies the device in order to

complete its execution. The cumulative time-utilization for a set of task-variants ˜

Γ⊆Γ

is defined as

UT(˜

Γ) = X

i∈˜

UT(Tk

i).(7.3)

Obviously, a set of task implementations ˜

Γ cannot be executed sequentially if UT(˜

Γ) >1.

In the same way we define the system-utilization metric, which captures the degree by

which the device is utilized by Tk

US(Tk

i) = UT(Tk

i)·Ak

i(7.4)

In order to measure the load generated by a variant-rich task set z, we now define

the total system-utilization of zby summing up the system utilization factors of all

variant-rich tasks. If a task has several alternative implementations, we account for the

one with the minimal system-utilization:

US(z) = X

Γi∈z

min

i∈ΓiUS(Tk

i)(7.5)

Thus, US(z) is the minimum amount of combined area and time required by the variant-

rich task set. If US(z)> A(H)·1, no feasible schedule exists since the system is utilized

118

7.1 Periodic Tasks with Variants

by more than 100%. We use the USmetric in our simulation study, in order to rate the

performance of the proposed scheduling algorithm.

Differences between the model with- and without implementation variants

What has been a task Tiin our old model, is now represented by a variant-rich task

Γi={T1

i, T2

i, . . . , Tm

i}. Each implementation variant of the new model Tk

ihas the same

structure as a task of the old model (period, deadline, computation time, area) - that

is why we use the same symbol T.

What has been a task set Γ in our old model, is now represented by a set of variant-rich

tasks z={Γ1,Γ2, . . . , Γn}. Since a variant-rich task Γiof the new model has the same

structure as a periodic task set in our old model, we use the same symbol Γ.

7.1.2 Partitioned Scheduling of Variant-Rich Tasks

We can apply the concept of partitioned scheduling straight forward to our generalized

task model with implementation variants.

Definition 12 (partitioned schedule of variant-rich task sets).A schedule Rfor a pe-

riodic variant-rich task set zis said to be partitioned by χwith selection f, if the

following three statements hold:

1. f(i)is a function that selects exactly one implementation variant Tf(i)

ifor each

task Γi. The set of all selected implementations is denoted by Γ|f={Tf(1)

1, Tf(2)

2, . . .}.

2. χ={G1, G2, . . . , Gm}is a partitioning of Γ|f. That is, χis a set of disjoint

subsets of Γ|f, called partition blocks, such that the union of all partition blocks

results in Γ|f.

3. At any point in time, at most one task implementation of each partition block is

executed on the reconfigurable device.

Each partition block is exclusively assigned a certain area of the reconfigurable device.

The area requirement of a partition block A(Gj) is determined by its largest task; the

required overall device area A(χ) is given by the accumulated area over all partitions:

A(Gj) = max

i∈Gj

(Ak

i), A(χ) = X

Gj∈χ

A(Gj) (7.6)

Since each partition block presents an independent task system of its own, we can easily

formulate a test to figure out whether a given partitioned schedule is feasible or not by

applying the basic results from single-processor EDF theory:

Lemma 6 (EDF feasibly partitioned schedule of variant-rich tasks).Given a periodic

variant-rich task set z, let fbe a selection function and Rbe a schedule of the selected

task variants Γ|fwith partitioning χ, such that the tasks of each partition Gi∈χare

scheduled separately by EDF. The schedule is feasible on the reconfigurable device Hif:

119

7 Model Extensions



A(χ)≤A(H), i.e., all partitions fit onto the device, and



∀Gi∈χ:UT(Gi)≤1, i.e., all partitions have a time utilization of less than 100%.

Proof. The tasks implementations within each partition-block can be separately sched-

uled by EDF, such as in a uniprocessor system. Since the EDF uniprocessor utilization

bound holds for all partitions, all deadlines are met.

2,1

1,1

3,1

4,1

1,1

2,2

2,3

3,2

Γ1,Γ2,Γ3,Γ4Γ2Γ3Γ2Γ1,Γ2,Γ3,Γ4

0 1 2 3 4 5 6 7 8 9 10 11 12

device area

time

4,1

0 1/4 1/2 3/4 1

Figure 7.2: Example of a partitioned EDF schedule of a variant-rich task set

Figure 7.2 illustrates an EDF partitioned schedule on the example of the task set Γ∗

shown in Table 7.1. In this example, the first variant for each task has been selected, i.e.,

Γ∗

|f={T1

1, T1

2, T1

3, T1

4}, and two partition-blocks have been generated, e.g., χ={G1=

{T1

1, T1

3, T1

4}, G2={T1

2}}. The left-hand side of Figure 7.2 presents the partitioning χ

and illustrates the division of the device area. The according partitioned EDF schedule

Ris shown on the right-hand side of Figure 7.2. Since both partition blocks have a time

utilization of less than 100%, the schedule is feasible:

UT(G1) = 1/2+1/4+1/6≤1, UT(G2) = 5/6≤1

Based on Lemma 6we are able to define an optimization problem in order to answer

our scheduling questions from Section 3.2. To this end, we formulate the problem of

finding the least resource-consuming EDF Feasibly Partitioned Schedule (EFPS) for a

variant-rich task set:

Definition 13 (EFPS of variant-rich tasks problem).Given a periodic variant-rich task

set z, find:



a function fthat selects a variant for each task and



a partitioning χof the selected variants Γ|f,

120

7.1 Periodic Tasks with Variants



subject to UT(Gi)≤1 : ∀Gi∈χ,



minimize A(χ).

Solving the EFPS of variant-rich tasks problem answers directly our formulated schedul-

ing questions: The smallest device that allows to feasibly execute zis of size A(H) :=

A(χ). The according decision problem is also solved. Given zand a device H,zcan

be feasibly scheduled if A(H)≥A(χ). From the resulting partitioning χ, we can easily

derive the scheduling function R.

This problem can be optimally solved using an integer linear programming approach,

which we will present in the next section. The development and evaluation of further

heuristics is omitted at this place and left for future work.

7.1.3 Optimal Partitioning by ILP

In this section we develop an extension of the binary ILP model for the 2-dimensional

level strip-packing problem [73], which we used to compute partitioning for tasks without

variants in Section 4.2. Our extension is able to deal with implementation variants for

tasks. Instead of modeling each task by a single rectangle, we apply the ILP model

on the set of rectangles corresponding to all task variants. The ILP constraints are

modified, such that one and only one variant of each task is packed.

According to the model of [73], we make the following basic assumptions:

Order of task variants: We sort and renumber the set of all variants Γ, according to

non-increasing area of the task variants. Let ¯

Γ denote the sorted set:

Γ = {T1

1, T2

1, T3

1, . . . , T1

2, T2

2, . . . , T1

n, . . . }(7.7)

Γ = {¯

T1,¯

T2,¯

T3, . . .},(7.8)

A(¯

Tj)≥A(¯

Tj+1), j = 1, . . . , |¯

Γ|−1

The index function δ(i, k) maps the index of a variant Tk

iin Γ to the according

task variant in ¯

Γ, i.e., ¯

Tδ(i,k)=Tk

Order of partition blocks: The task implementation ¯

Tlhaving the largest area within

a partition block is said to initialize the partition block. The index lis also used

to index the partition block, i.e., if the largest variant in a partition block is ¯

the partition block is denoted as Gl.

It follows, that a partitioning χhas m=|Γ|potential partition blocks G1. . . Gm

(some may be empty), and that the partition blocks are sorted by non-increasing

areas, i.e., A(Gl)≥A(Gl+1). Further, the area of a partition block is equal to the

area of the task variant that initializes the partition block, i.e., A(Gl) = A(¯

Tl).

121

7 Model Extensions

Considering an arbitrary solution to our formulated EFPS of variant-rich tasks problem,

we can now, based on the assumptions made above, differentiate three cases for each

task implementation variant ¯

Tj∈¯

Γ:



Tjis not packed in any Glsince another implementation variant of the task is

selected and packed.



Tjinitializes a partition block, i.e., Tj∈Gj.



Tjis packed into a partition block with greater area, forcing the partition block

index to be smaller than j.

The assumptions made are without loss of generality but reduce the search space con-

siderably, since a task implementation Tk

icannot be assigned to an arbitrary partition

block Gl. When modeling the ILP, we use the binary decision variable xl,j to indicate

that the task variant ¯

Tjis packed into partition block Gl:

xl,j =(1 if ¯

Tj∈Gl

0 if ¯

Tj/∈Gl

with l={1, . . . , |Γ|}, j ={1, . . . , |Γ|:j≥l}(7.9)

If xl,j = 1 and j=l, the partition block Glis said to be initialized and its largest task

variant is ¯

Tl. If xl,l = 0, Glis uninitialized and empty. The resulting ILP has three

main components:

1. The cost function accumulates the area required by all non-empty partition blocks,

hence minimizing the overall area A(χ).

2. A set of constraints ensures that for each variant-rich task Γiexactly one variant

iis packed into one of the partition blocks. For the variant-rich task Γiwith m

variants this can be expressed by the term:





δ(i,1)

l=1

xl,δ(i,1)

+



δ(i,2)

l=1

xl,δ(i,2)

+···

+



δ(i,k)

l=1

xl,δ(i,k)

+···+



δ(i,m)

l=1

xl,δ(i,m)

= 1 (7.10)

The first term of Equation 7.10 equals one, if T1

iis packed into some partition

block, the second term checks whether variant T2

iis packed into some partition

block, and so on. Note, that by convention a variant Tk

ican only be packed into

G1, . . . , Gδ(i,k). Therefore, the upper bounds of the sum indices are given by δ(i, k).

3. Another set of constraints enforces EDF schedulability. Each non-empty partition

block must have a time utilization less or equal than 1, i.e., UT(Gl)≤1.

122

7.1 Periodic Tasks with Variants

The final binary ILP can be written as:

min

|Γ|

l=1

A(¯

Tl)·xl,l

|Γi|

k=1 



δ(i,k)

l=1

xl,δ(i,k)

= 1, i ∈ {1, . . . , n}

j=l

UT(¯

Tj)·xl,j ≤1·xl,l, l ∈ {1, . . . , |Γ|}

(7.11)

Let nbe the number of tasks in zand |Γ|be the cumulative number of variants over

all tasks. The corresponding ILP contains |Γ|·(|Γ|+ 1)/2 binary variables and n+|Γ|

constraints.

Example illustrating the binary ILP with task variants

To illustrate the ILP model, we consider again the example task set in Table 7.2.

ΓiTk

iPiCk

iAk

iUT(Tk

i)US(Tk

Γ1T1

112 3 6 1/4 3/2

16 3 1/2 3/2

Γ2T1

24 2 4 1/2 2

21 8 1/4 2

Γ3T1

36 5 3 5/6 5/2

Γ4T1

412 2 2 1/6 1/3

Table 7.2: Example of a variant-rich task set z∗

The four tasks of z∗have a total of six task implementations (the set Γ) which get

sorted and re-numbered according to non-increasing areas. The resulting set ¯

Γ is given

by:

Γ = {¯

T1:= T2

2,¯

T2:= T1

1,¯

T3:= T1

2,¯

T4:= T2

1,¯

T5:= T1

3,¯

T6:= T1

4}(7.12)

Figure 7.3 shows the optimal partitioning χ, which consists of the two partition blocks

G3={T2

1, T1

2}and G5={T1

3, T1

4}. Since |Γ|= 6, the ILP model contains 6·(6+1)/2 =

21 binary variables. The configuration of the decision variables xl,j is shown in the table

beside the figure. The required overall device area is 7 units. For comparison: If each

task comes only in one variant (the first one), the optimal partitioning would require

9 units of device area. Figure 7.4 shows the optimal partitioning and the according

decision variable configuration, in case only the first variant of each task is considered.

123

7 Model Extensions

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

3=¯

area of partitioning

time utilization of partition−blocks

1=¯

2=¯

=¯

l xl,1xl,2xl,3xl,4xl,5xl,6

1 0 0 0 0 0 0

2 - 0 0 0 0 0

3 - - 1 1 0 0

4 - - - 0 0 0

5 - - - - 1 1

6 - - - - - 0

Figure 7.3: Optimal partitioning of z∗and the according decision variable configuration

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

3=¯

area of partitioning

time utilization of partition−blocks

1=¯

2=¯

T2T1

=¯

l xl,1xl,2xl,3xl,4xl,5xl,6

1 1 1 0 1 0 0

2 - 0 0 0 0 0

3 - - 1 0 0 0

4 - - - 0 0 0

5 - - - - 0 0

6 - - - - - 0

Figure 7.4: Sub-optimal partitioning of z∗, considering only the first variant of each

task.

7.1.4 Performance Evaluation for Variant-Rich Tasks

Introducing implementation variants for tasks makes it more likely, that an EDF feasible

partitioned schedule exists for a reconfigurable device of given size. However, finding such

a schedule becomes computationally more complex since the design space grows rapidly

with the number of implementation variants. In order to evaluate the benefit we gain

by introducing implementation variants we present a simulation study.

To create a benchmark of task sets without and with variants, we used the following

procedure: First, we generated periodic task sets z(with only one variant per task)

with various system utilization by composing randomly generated task implementations

according to our BMstd parameters. For each task Γ1={T1

i}, the area A1

iwas chosen

according to a uniform distribution in [0.1,0.5] and the computation time and period

124

7.1 Periodic Tasks with Variants

were chosen1such that UT(T1

i) is uniformly distributed in [0.1,0.5]. Such a setup creates

task sets of up to 15 tasks. Based on these task sets, we created task sets with variants

using the following methods:

3 variants per task: To the existing variant T1

iof each task, we added one variant

iwith half the computation time but double the area requirement, i.e., T2

(P1

i,1

2·C1

i,2·A1

i), and another variant with doubled computation time but only

half the area requirement, i.e., T3

i= (P1

i,2·C1

i,1

2·A1

i).

1 to 5 variants per task: For each task the number of variants was randomly chosen

from 1 to 5. Each variant Tk

iwas created choosing a form factor qfrom the

continuous interval [1,4] and a sign sfrom {−1,+1}. The computation time and

area were set to Ck

i=C1

i·qsand Ak

i=A1

i·q−s, respectively. Note, that each

variant still has the same system utilization as the initial variant since US(Tk

i) =

i/Pi·Ak

i=C1

i·qs/Pi·A1

iq−s=US(T1

i).

0 0.2 0.4 0.6 0.8 1

0.2

0.4

0.6

0.8

avarage system utilization US(Γ)

relative success rate

opt. part. EDF without variants

opt. part. EDF, 3 varinats per task

opt. part EDF, 1 to 5 varinats per task

Figure 7.5: Performance of opt.-part.-EDF on task sets with implementation variants

Figure 7.5 shows the relative scheduling success rates, depending on the task sets system

utilization, when scheduled onto a device Hof size A(H) = 1. For an average system

utilization of 0.7, the optimally partitioned EDF schedule produced feasible solutions

for about 80% of the task sets in the reference case (with only one variant per task).

Introducing two additional implementation variants for each task yields an enormous

advantage - even task sets with average system utilization of 0.87 could be scheduled

with an 80% success rate. If some tasks have more and some tasks have less variants,

1First UT(T1

i) was chosen from [0.1,0.5], then C1

iwas chosen from {1,...,30}and the period was set to

i= round (C1

i/UT(T1

i)).

125

7 Model Extensions

we can also observe a great improvement over the reference case. 80% of the task sets

with average US≃0.83 could be successfully scheduled.

To solve the ILP model, we employed the lp solve library version 4.0 on a 2.8 GHz

Pentium 4 machine. We were able to optimally solve most problem instances with a size

of |Γ|up to 30 in less than 15 seconds. That is, we could compute the schedule for 30

tasks with only one implementation variant, or for 10 tasks with three variants each. In

particular, from the 1000 task sets generated for the evaluation in Figure 7.5, only 28

exceeded the 15 second time limit.

7.1.5 Approaches for Global- and Server Scheduling of Variant-Rich Tasks

Selecting the right task variants to improve the scheduling performance of our global

EDF schedulers or our server based scheduler is rather difficult. Up to now, we have

not explored the possibility of choosing variants for these schedulers deeply. We rather

present some rough approaches and leave their evaluation for future work.

Global EDF for Variant-Rich Tasks

The simulation results of our performance study in Section 5.2 have shown, that the

global EDF-NF scheduler performance is almost independent of the ratio between area

and task computation time. Hence, we cannot expect an improvement of the schedul-

ing performance by following simple rules like select for each task the variant with the

smallest computation time (respectively smallest area).

A more constructive method can be developed, by considering the global EDF schedu-

lability test of Theorem 4which we have derived in Chapter 4. The schedulability test

allows us to rate whether a given selection of implementation variants is guaranteed to

be feasible by EDF-FkF or not. Hence, we can formulate the following optimization

problem:

Definition 14 (EDF-FkF scheduling condition variant selection problem).Given a

periodic variant-rich task set z, find:



a function f(i)that selects one variant for each task Γi



such that for the set of selected variants Γ|fthe EDF-FkF scheduling test holds,

i.e.,

∀Tk

i∈Γ|f:US(Γ|f)≤(A(H)−Amax)·1−UT(Tk

i)+US(Tk

where Amax is the largest area of all implementation variants in Γ|f.



minimize A(H).

Solving this problem yields the best selection fof implementation variants of our variant-

rich tasks in the sense, that it is guaranteed by our scheduling condition that EDF-FkF

126

7.1 Periodic Tasks with Variants

finds a feasible schedule on a device of size A(H). For any other selection ˜

fwe would

require a device of equal or larger size.

One idea to solve such problem is, to apply a suitable meta heuristic such as simulating

annealing,tabu search or a genetic algorithm. In this case, the heuristic would propose

new selections fand then evaluate the scheduling condition which can be done in linear

time. Evaluation of such an approach is subject to future work.

Server Based MSDL of Variant-Rich Tasks

Similar to the global scheduler, it is rather difficult to find a good variant selection for

the server based MSDL scheduling method. However, we will formalize the optimization

problem and present a simple heuristic as rough draft.

The problem of selecting the right variants for an MSDL schedule can be formulated as

follows:

Definition 15 (MSDL scheduling variant selection problem).Given a variant-rich task

set z, find a selection function fand a device Hsuch that when MSDL is applied

to the selected variants it creates a feasible set of servers and the device size A(H)is

minimized, i.e.



Ω = MSDL(Γ|f, H)with UT(Ω) ≤1and



minimize A(H).

Also meta-heuristics can be considered for this problem, their runtime may be unac-

ceptably long due to the high time complexity of the MSDL procedure which has the

order O(n4). Instead, we find it more reasonable to develop problem specific heuristics.

In contrast to the performance of the global EDF schedulers, the server based MSDL

scheduler has appeared to be very sensitive to the ratio of the task area to the task

time-utilization (Section 5.2). Figure 5.9 has shown, that MSDL performs better on

tasks with large area but small time-utilization. Hence a heuristic rule such as select

for all tasks the variant with the smallest computation time seems to be promising. Once

again a detailed evaluation remains subject of future work.

In summary, we have shown that considering task variants considerably improves the

scheduling performance. We evaluated this for the partitioned EDF scheduler in detail

and also presented rough approaches how the selection of variants could be utilized in

the global and server based schedulers.

The next part of this chapter deals with an extension of our basis model that captures

the memory access of tasks under real time constraints.

127

7 Model Extensions

7.2 Periodic Tasks with Memory Access

Typical hardware tasks process data in a streaming fashion; hence they demand large

memory buffers and access them periodically. Up to now we have neglected the issue

of the task’s data access under real-time constraints. In this section, we extend our

application model by logical data buffers, which are used by the tasks. The problem

of assigning data buffers to physical memories for a given application is addressed. We

show that sharing physical memory banks between tasks is feasible and reduces the

required memory resources. Moreover, knowledge of the actual task schedule helps to

reduce the memory requirement even further.

Usually, a reconfigurable computing platform provides several relatively large external

banks of SRAM to store data processed by hardware tasks. With today’s fast-growing

FPGA capacities the number of tasks that can run simultaneously will in general be

much higher than the number of external memories. Furthermore, hardware tasks op-

timized for throughput may access several memories at the same time. Hence, a simple

one-to-one assignment between tasks and memories is unrealistic. Memories will be

shared among tasks, and tasks will access several memories.

The required versatile memory assignment opens up two issues. First, a proper ad-

dressing mechanism has to be introduced to ensure that data of one task is stored and

addressed separately from the data of other tasks – unless the intention is implementing

inter-task communication via shared memory access. We have solved this issue by a

page table mechanism that maps logical address to physical addresses [Dan04b].

The second and more challenging issue is resolving conflicts when simultaneously exe-

cuting tasks attempt to write or read from the same memory bank at the same time.

Since the memories are usually single-port memories, only one access can be performed

per clock cycle. A simple mutual exclusion mechanism protecting the memory port is

insufficient, since this would lead to an unpredictable timing and could enlarge the worst

case execution time of a task.

Our approach to solve the problem of sharing physical RAMs but guarantee timing is

to apply a fix-priority scheduling mechanism to determine the order of memory accesses

in case of access conflicts. To this end we assume that the tasks access memory in a

deterministic and known pattern. Particularly, we assume periodic RAM accesses such

as demonstrated by the DCT-task in Example 1of Section 2.2.

Before introducing our model, we continue the example to illustrate the problem:

Example 4. Assume two simultaneously executing tasks, each computing an 8-point 1D

discrete cosine transform (DCT) on its own data set. Figure 7.6-(a) shows the mapping

of the two tasks onto an FPGA. For this example, we use two different variants of the

DCT core [107] described in Section 2.2. We repeat the characteristics of the DCT core

in Table 7.3.

Task 1 has to process a set of 500 words and is implemented using variant 2 while Task 2

has to process a set of 2000 words and is implemented using variant 3. As denoted in

the scheduling diagram of Figure 7.6-(b), Task 1 reads from and writes to its RAM banks

128

7.2 Periodic Tasks with Memory Access

variant FPGA area throughput ext. RAM port access

(slices) (sample/clk) period amount utiliz.

1 982 1 2 clks 1

32 bit word 50%

2 767 8 / 9 9 clks 4

32 bit words 45%

3 589 8 / 17 17 clks 4

32 bit words 24%

Table 7.3: Size, throughput and memory access pattern of several variants of a XILINX

8 point 1D-DCT IP core [107].

4 words every 9 clock cycles, while Task 2 makes 4 accesses every 17 clock cycles. Due

to the FIFO buffers residing within the hardware tasks, the requested RAM accesses do

not have to take place exactly in the clock cycles shown in Figure 7.6-(b), but can be

scheduled any time within their period of 9 and 17 clock cycles, respectively. The question

remains, whether four separate RAM banks are required as shown in Figure 7.6-(a), or

if fewer RAM banks can be shared between the tasks. If we assume that all RAMs have a

size of 4000 words, the input data of Task 1 and Task 2 could be stored in SRAM1 and

the output data of both tasks in SRAM2. If we assign the accesses of Task 1 a higher

priority than the accesses of Task 2, all memory accesses can be feasibly scheduled, as

shown in Figure 7.6-(c).







 

 

 

 

 









 

 

 

 

 





!"

"

#$

#%$

#&$

'( ' ( '





!"

"

' ( ' ( '







 

 

 

 

 









 

 

 

 

 







!"

"



#$







!"

"



#%$

#&$

Figure 7.6: (a) Application of Example 4mapped to an reconfigurable platform; (b)

schedule using four memory banks; (c) schedule using two memory banks

7.2.1 The Buffer Assignment Problem

The example has shown that physical RAM banks can be shared among simultaneously

executing tasks. We now formalize the problem of minimizing the number of required

memories. The following assumptions form the basis of our model:

129

7 Model Extensions



Tasks use memory in the form of logical buffers, which are accessed periodically.

The memory accesses must be completed within the given period, i.e., deadlines

equal periods.



Each task can use many logical buffers.



A logical buffer can be used by several tasks.



Each logical buffer is assigned to one and only one physical memory.



Write and read accesses of tasks to physical memories are scheduled using the

fixed-priority Rate Monotonic (RM) algorithm.

Definition 16 (memory-demanding RT application).Amemory-demanding RT appli-

cation Υ is a tuple Υ = (Γ, B, s, L, ω, α).Γis a set of periodic tasks. B={B1, B2, . . . }

is a set of logical buffers representing the memory requirements of the application. The

size of a buffer is given by s:B→N, where s(Bi)denotes the size of Biin terms of

words. L⊆Γ×Bis a relation from Γto B, expressing which task is using which buffer,

where (Ti, Bj)∈Ldenotes that Tiis using buffer Bj. We assume that a task is using

a buffer by a pattern of periodic write or read accesses. ω:L→Nspecifies the access

periods in clock cycles, where ω(Ti, Bj)represents the period of accesses of task Tito

buffer Bj.α:L→Nis a function, where α(Ti, Bj)denotes the number of requested

read or write accesses of task Tito buffer Bi, which have to be performed within one

period ω(Ti, Bj).

Note, that the relation Lbetween tasks Γ and logical buffers Bis a many–to–many

relationship, which allows a task to use several logical buffers, but also one logical buffer

to be used by several tasks.

The mapping of the logical buffers to the physical memories of a platform is modeled

by a buffer assignment function:

Definition 17 (buffer assignment:).Abuffer assignment of a given application Υis

an injective function a:B→M, which assigns each logical buffer Bi∈Brequired

by the application to one physical memory bank Mk∈Mof the platform. For simpler

notation, we define the binary variables ˆai,k ∈ {0,1}such that ˆai,k = 1 if a(Bi) = Mk

and 0otherwise. I.e., ai,k = 1, iff buffer Biis assigned to memory Mk. A buffer

assignment ais called feasible, if

1. the logical buffers assigned to each memory Mkdo not exceed its size, and

2. all resulting access requests to the physical memories are performed within their

deadlines, when scheduled by RM.

Note, that the buffer assignment of Definition 17 is a function, which requires that a

logical buffer is mapped to one and only one physical memory. Distributing a logical

buffer to many physical memories is not considered. Also the function is injective,

requiring that every buffer is assigned to a memory.

130

7.2 Periodic Tasks with Memory Access

As an illustration, an instance of a memory demanding RT application Υ and a particular

mapping of buffers to physical memories can be drawn as a graph. Figure 7.7 shows

the graph according to Example 4. Tasks Γ, logical buffers Band physical memories

Mare the graph nodes plotted as circles, squares and rectangles respectively, the latter

two labeled with their according size s. The edges between tasks and buffers represent

the usage relation L, labeled with the number of accesses αand periods ω. The buffer

assignment function a, which has to be computed, is represented by edges from Bto

According to Definition 17, a feasible buffer assignment is subject to two conditions.

Obviously, the memory size condition is fulfilled, if and only if Equation 7.13 holds

where s(Mk) denotes the size of the memory bank Mk.

|B|

i=1

ˆai,ks(Bi)≤s(Mk),∀Mk∈M(7.13)

The second condition concerning the schedulability of access to memory under the RM

algorithm is more involved. Since each buffer Bihas been uniquely assigned to a memory

Mk, and this assignment does not change during runtime, the scheduling condition can

be checked for each involved physical memory separately. This problem is equivalent to

testing the schedulability of a periodic RT task sets scheduling onto a uniprocessor by

RM. A sufficient but not necessary condition can be derived by computing the utilization

factor of the processor utilization, which is in our case the utilization factor of the

physical memory port. Recall, that any periodic task set can be scheduled by RM,

given the processor utilization is less then ln2, i.e. less than ≈69 % (see [25]p. 78 ff.

and Section 2.4.2).















































Figure 7.7: Graph representation of the memory demanding RT application Υ of Ex-

ample 4, with tasks Γ, logical buffers Band their assignment to physical

memories M.

131

7 Model Extensions

Following the same concept, we define two access utilization factors modeling the RAM

port utilization: µB(Bi) expressing the utilization of the virtual RAM port of a logical

buffer Bi, and µM(Mi) expressing the utilization of the port of a physical memory Mi

as a consequence of the assignment of buffers to that memory.

Let L(Γ, Bi)⊆Ldenote all edges from tasks incidence to buffer Bi, than the virtual

port utilization µB(Bi) is defined as:

µB(Bi) = X

l∈L(Γ,Bi)

α(l)

ω(l)(7.14)

The utilization factor of a physical memory port µM(Mi) results from summing up all

logical utilization factors of buffers assigned to that memory. It is calculated through:

µM(Mk) =

|B|

i=1

ˆai,kµB(Bi) (7.15)

Applying the test condition from RM scheduling, we can derive a sufficient (but not

necessary) condition to guaranty a feasible schedule of all memory accesses:

µM(Mk)≤ln 2,∀Mk∈M(7.16)

Based on the two conditions for a feasible buffer assignment, we formulate the following

optimization problem:

Definition 18 (2C-BA (two constraint buffer assignment)).Given a memory demand-

ing application Υand a set of platform memories Meach with a certain cost c(Mk),

find a feasible buffer assignment a:B→Msubject to:



P|B|

i=1 ˆai,ks(Bi)≤s(Mk),∀Mk∈Mand



µM(Mk)≤ln 2,∀Mk∈M,



min Pc(Mk)of used memories, i.e. minimize the total cost of used memories.

7.2.2 Task Schedule Aware Buffer Assignment

We can weaken the constraints of feasible buffer assignments and therefore extend the

set of feasible buffer assignments by taking into account the actual schedule of hardware

tasks, i.e. the possible sets of parallel running tasks. Hence, an optimization strategy

may find better assignments that require less physical memory banks.

In a given task schedule R(t) not all tasks Ti∈Γ are executed in parallel, but only a

certain subset ζj⊆Γ is executed at any given time. The number of different running

sets appearing in the schedule is finite, and is denoted by ζ={ζ1, ζ2, . . . }. Therefore,

132

7.2 Periodic Tasks with Memory Access

we can compute the utilization of each virtual buffer port µBand each physical memory

port utilization µMseparately for each running set ζj. Thus, only the usage relations

in Lfrom the tasks of a specific running set ζjto a specific buffer Bi– denoted by

L(ζj, Bi)⊆L– are considered. The virtual memory port utilization and physical port

utilization are reduced and change to:

µB(Bi, ζj) = X

l∈L(ζj,Bi)

α(l)

ω(l)≤µB(Bi) (7.17)

µM(Mk, ζj) =

|B|

i=1

ˆai,kµB(Bi, ζj)≤µM(Mk) (7.18)

As a result, the access scheduling condition changes from Equation 7.16 to Equation 7.19,

which has to be checked for all combinations of memories and running sets.

µM(Mk, ζj)≤ln 2,∀(Mk, ζj)∈M×ζ(7.19)

We illustrate the advantage of the task schedule aware buffer assignment by extending

Example 4.

Example 5. Assume the application Υof Example 4extended by two additional tasks

T3, T4that share another buffer B5of 1000 words. Both tasks access this buffer once

within a period of 5 clock cycles. Figure 7.8 illustrates this application as a graph, where

tasks Γ, logical buffers Band physical memories Mare the graph nodes plotted as circles,

squares and rectangles, respectively. Assume further that in a given task schedule, either

the tasks ζ1={T1, T2}or ζ2={T2, T3, T4}are running simultaneously. Buffer B5is

assigned to memory M2, in addition to B2and B4. Then the utilization of the RAM

port of M2is calculated to be µM(M2, ζ1) = 4/9 + 4/17 ≈0.68 for the first running set

and µM(M2, ζ2) = 4/17 + 2/5≈0.64 for the second running set. Therefore, the buffer

assignment is feasible according to Equation 7.19. In contrast to that, the utilization

without considering the running sets amounts to µM(M2)≈1.08, making the buffer

assignment infeasible according to Equation 7.16.

The example showed, that the memory banks required by an application can be reduced,

when the task schedule is taken into account. This motivates us, to formulate the

problem of a schedule aware buffer assignment.

Definition 19 (schedule aware buffer assignment (SA-BA)).Given a memory demand-

ing application Υand the finite set ζ={ζ1, ζ2, . . . }of the different running sets of its

task schedule, find a feasible buffer assignment a:B→Msubject to



P|B|

i=1 ˆai,ks(Bi)≤s(Mk),∀Mk∈Mand



µM(Mk, ζj)≤ln 2,∀Mk∈M, ∀ζj∈ζ



min Pc(Mk)of used memories, i.e. minimize the total cost of used memories.

133

7 Model Extensions



































































Figure 7.8: Memory demanding RT application from Example 5, illustrating different

running sets of a task schedule.

7.2.3 Buffer Assignment Based on Integer Programming

Solving the 2C-BA and SA-BA problems to optimality is rather difficult. Both are

generalizations of the classical bin-packing problem and hence cannot be solved in poly-

nomial time. For solving instances of large size, suboptimal methods like approximation

algorithms and heuristics are needed.

However, in this work we are rather interested in comparing the cost of a non-schedule

aware buffer assignment to that of a schedule aware buffer assignment. Moreover, many

applications of interest may be of rather medium complexity, e.g. up to some tens of

tasks and buffers.

Hence, we present an integer linear programming approach to solve simplified versions

of the 2C-BA and SA-BA problems. The assumptions we make are:



All platform memories are of equal size and equal cost. Hence we minimize the

number of used memories.



For the RM scheduling condition we use the fix utilization bound of ln2, rather

than the more precise bound of n(21/n −1). This makes our port utilization

pessimistic, but enables us to use an ILP formulation.

For this modified versions of 2C-BA and SA-BA problem we present an efficient binary

integer program:

134

7.2 Periodic Tasks with Memory Access

Equations 7.20 to 7.23 specify the non schedule-aware buffer assignment problem, while

replacing Equation 7.23 by Equation 7.24 specifies the schedule-aware buffer assignment

problem.

The binary variables are ˆai,k ∈ {0,1}, where ˆai,k = 1 iff buffer Biis assigned to memory

Mk. The index range for iand kis {1. . . |B|} with i≥k, which exploits the symmetry

of the problem and leads to a complexity of |B|2/2 variables.

min

|B|

i=1

ˆai,i (7.20)

k=1

ˆai,k = 1, i ∈ {1. . . |B|} (7.21)

|B|

i=k+1

ˆak,k[sM −s(Bk)] −ˆai,ks(Bi)≥0, k ∈ {1. . . |B|} (7.22)

|B|

i=k+1

ˆak,k[ln 2 −µB(Bk)] −ˆai,kµB(Bi)≥0, k ∈ {1. . . |B|} (7.23)

|B|

i=k+1

ˆak,k[ln 2 −µB(Bk, ζj)] −ˆai,kµB(Bi)≥0,(k, j)∈ {1. . . |B|}×{1. . . |ζ|} (7.24)

The objective function (Equation 7.20) minimizes the number of memories where buffers

have been assigned to, since ai,i is set to 1 iff any buffer is assigned to Mi. Equation 7.21

ensures that each buffer is assigned exactly once. Equation 7.22 and Equation 7.23/7.24

consider respectively the maximal size and maximal port utilization of each memory.

7.2.4 Simulation Results

In our experiments, we could solve both variants of the ILP for problem sizes of up

to |B|= 15 in reasonable time. Solving one instance took less than one minute using

lp solve version 4.0 on a 2.8 GHz Pentium IV machine. Table 7.4 presents the results

for randomly generated problem instances, each with 15 tasks and 15 buffers. Each

buffer is used by 1 or 2 tasks. In the various test cases, the virtual buffer port utiliza-

tion µB(Bi) and the buffer size s(Bi) (in percent of the memory bank size) have been

chosen equally distributed from the reported intervals. For the schedule-aware buffer

assignment problem, a task schedule with 15 running sets, each containing 1 to 3 tasks,

has been assumed. For each test case the table reports on the number of required phys-

ical memory banks of: the buffer assignment without sharing memory banks; the non

schedule aware assignment; and the schedule aware buffer assignment.

135

7 Model Extensions

test parameter used memories (in average)

no equally distributed non mem non sched. schedule

port util. µB(Bi) buffer size s(Bi) sharing aware aware

1 [0,0.69] [0,100%] 15 10.42 8.85

2 [0,0.69] [0,50%] 15 8.90 4.49

3 [0,0.50] [0,50%] 15 5.94 4.26

4 [0,0.25] [0,50%] 15 4.29 4.29

5 [0,0.69] [0,25%] 15 8.90 3.47

6 [0,0.50] [0,25%] 15 5.97 2.67

7 [0,0.25] [0,25%] 15 3.23 2.33

Table 7.4: Simulation results of 2C-BA and SA-BA problems.

Without sharing memory banks each buffer is assigned to one physical memory exclu-

sively, thus 15 memories are used in all test cases. In the first test case, the average

buffer port utilization and size are chosen relatively large. Hence, only in few cases mul-

tiple buffers can be assigned to one memory bank which results in about 10-11 memory

banks for the non schedule-aware assignment. A schedule aware assignment results in 8-

9 memories which is an further reduction of about 15%. As the port utilization and size

of the buffers decreases, the number of used memories is reduced considerably. For all

test cases except test no 4, the schedule aware assignment gives a further cost reduction

compared to the non schedule aware assignment. In test case no 4, the virtual port uti-

lization is small compared to the relative buffer size. Hence, the RAM size and not the

port utilization is the major limitation. Therefore, weaken the RAM port constraints

by a schedule aware assignment brings no further improvement. Test case no 5, where

the port utilization is choose large and the buffer size is choose small, shows the greatest

difference (about 60% further improvement) between the cost of a schedule-aware and

a non schedule aware assignment.

7.2.5 Concluding Discussion on Memory Demanding RT Applications

In this section we extended our application model by the memory requirements of the

tasks. The goal was, to share physical memories among the tasks without jeopardizing

timing constraints. We have presented a formal model of a buffer assignment prob-

lem and exploited the fact that the conditions for a successful scheduling of memory

accesses under RM can be weakened by considering the actual combinations of tasks

running simultaneously. This schedule aware buffer assignment is particularly interest-

ing in combination with the presented server based task scheduling algorithm MSDL of

Section 4.3. As MSDL produces a rather small number of running sets, the resulting

ILP is likely to stay within a realistic problem size. The simulation results indicate, that

sharing memories between buffers reduces the total memory cost considerably and that

further reductions can be achieved by a schedule aware buffer assignment.

For the memory port utilization we have used the simple RM scheduling bound of ln 2.

Using better RM scheduling conditions (see Section 2.4.2) would potentially improve our

136

7.3 Chapter Conclusion

results. However, using the RM hyperbolic bound would bring non-linear conditions into

the assignment problem. Hence using an ILP approach would not be straightforward.

Using the RM bound of Equation 2.2 for a fix nwould be an option for an ILP approach,

but even for small nthe bound is close to ln 2 and hence we cannot expect considerable

improvements. We leave the investigating of non ILP based buffer assignment methods

in combination with improved RM scheduling conditions as a subject of future work.

Sharing RAM banks among tasks can further be used to implement inter-task communi-

cation. A particular mechanism for the communication among periodic tasks is the use

of Cyclical Asynchronous Buffers (CABs) (see [25] p.319 ff.). Hard real-time tasks can

access CABs without blocking the sender or receiver, for example to exchange messages

containing current sensor values. A CAB provides a one-to-many communication chan-

nel. Hence CABs can be implemented using our many-to-many logical buffers, extended

by some simple handshaking functionality.

7.3 Chapter Conclusion

Within this chapter we generalized and extended our task model in two directions in

order to respect the special characteristics of hardware-tasks.

When implementing a computational task as a digital circuit on an RHD, the designer

can choose several architectures resulting in different tradeoffs between area and com-

putation time. This is captured, by modeling a task as a set of implementation variants.

We have shown that considering these variants within the task schedule can considerably

increase the performance.

Hardware-tasks typically process high amounts of data which cannot be kept within

the task circuit but has to be stored in dedicated RAM resources. In our approach,

we extended the model to capture the memory requirements of the tasks. Both, the

required RAM size and the utilization of the RAM port have been modeled. Task may

share a physical RAM bank, iff its size is not exceeded and the access to the RAM-port

is feasible under the RM priority schema. We presented a method to assign task buffers

to RAM-banks, which allows the minimization of the memory requirements and makes

inter-task communication feasible.

137

7 Model Extensions

138

CHAPTER 8

Prototype: FPGA Based Real-Time Kernel

This chapter presents our prototype of an FPGA based real-time system. It implements

the server based scheduler (MSDL) and hence uses the free space - time sharing resource

model which is based on full FPGA reconfiguration. In contrast to others, the kernel

is a hardware-only system with the OS functions entirely implemented in hardware.

We first describe the architecture of the system and the kernel and the phases of the

scheduler cycle. Afterwards, we present an automated synthesis flow. It takes the task

set parameters and the source files of the kernel and the tasks as input and generates

the required FPGA bit-streams of the entire system. Finally, results on the system’s

logic requirements and timings are reported.

Up to now we have build no prototype systems for the global EDF and partitioned EDF

algorithms. These methods are both based on the 1D resource model. Since several

prototypes that follow the 1D model and moreover implement task relocation and state

saving have been reported previously [60][20][21][61], we found that no additional proof

of concept is necessary within this work.

We previously reported on our prototype in [DMP06].

8.1 System Architecture

8.1.1 An All-Hardware Runtime System

We have implemented an all-hardware runtime system for the execution of MSDL sched-

ules. The decision for implementing the complete runtime system in hardware was

driven by the fact that a hardware scheduler can control the user tasks in a very effi-

cient manner and, thus, minimize the runtime overhead and deliver maximum scheduling

performance. In a processor-based real-time runtime system, a periodic timer triggers

139

8 Prototype: FPGA Based Real-Time Kernel

a kernel function that maintains time in terms of system ticks. The smaller the timer

period is, the finer the system’s time resolution is but the higher the runtime overhead

is. In our all-hardware runtime system, the scheduler runs in parallel to the user tasks

and, thus, no execution time is wasted for updating the internal time reference.

As the runtime system itself undergoes full device reconfiguration, we have to include

it in each single configuration bit-stream. On one hand, this opens up the opportunity

for compile-time optimization of parts of the runtime system to specific servers. On the

other hand, we also have to save and restore the context of the runtime system.

Our prototype is based on a CELOXICA RC203 board. Figure 8.1 shows the system

architecture consisting of an FPGA, a CPLD, a flash memory card and two banks of

SRAM. The CPLD is connected to FPGA user I/Os, to the FPGA configuration port,

as well as to the flash memory card. This way, the CPLD acts as a reconfiguration

controller that reads configuration bit-streams from the flash memory card on request

of the logic implemented in the FPGA.

The main runtime system components shown in Figure 8.1 comprise the timer, the

scheduler, the task management and an optional memory management unit (MMU).

Timer: The timer provides the time reference for the system by scaling down the clock

to a system tick signal with a user-defined period. All execution times and periods

are expressed in terms of system ticks.

Scheduler: The scheduler is the central part of the runtime system as it controls which





 



























  !" #



$

# %







Figure 8.1: System architecture

140

8.1 System Architecture

user tasks are executed at any given time. The scheduler further decides whether

the FPGA has to be reconfigured and, if so, what the proper configuration is. To

control the execution of servers and tasks, three states are introduced: READY,

RUN, and IDLE. A server (task) is in the RUN state, when it is currently being

executed on the FPGA. A server (task) in the READY state has been released

and has not yet terminated, but it is not currently being executed on the FPGA.

This implies that another server is currently assigned to the FPGA. Finally, a

server (task) enters the IDLE state, when its current instance has terminated. In

this case, the server (task) has to wait for its next release.

Task Management: The task manager controls the task operations and the context

switching. Whenever a reconfiguration is required, the scheduler triggers the task

manager to save the entire system context. Similarly, the task manager is called

to restore the entire system context after the FPGA reconfiguration. The task

manager is comprised of a server-independent part and a server-dependent part.

The server-independent part is identical in all configuration bit-streams and im-

plements a logic to save and restore the context of the runtime system. The

server-dependent part (speckled objects in Figure 8.1) implements a logic to start,

stop and resume the user tasks and to save and restore their contexts. Since each

server contains a different set of user tasks, and each user task has a different

context size, we apply server-specific optimizations in the compilation process.

MMU: The memory management unit is an optional component. In a runtime system

without MMU, the task manager is connected directly to an SRAM bank. This

SRAM is then exclusively used as intermediate context storage. The second SRAM

bank is devoted to implement data storage for the user tasks. Using an MMU, such

as the one described in [Dan04c,Dan04b,DP05b], the SRAM banks are divided

into pages which can be shared among the runtime system and the user tasks.

8.1.2 Scheduler Data Structures

Executing an MSDL schedule means applying the sequential EDF scheduling policy

to a precomputed set of servers. To this end, the scheduler implements several data

structures. The server control block array (SCBarray) holds for each server the following

constants: ID, period, computation time, the address of the corresponding bit-stream

file in the flash memory, and the set of tasks which constitute the server. The variable

part of the data structure stores the server’s current server state, its absolute deadline,

and its remaining computation time. The register RunningServer contains the ID of

the server in execution.

In general, the task periods may differ from the server periods. Tasks can start and

terminate at arbitrary times within their servers. Hence, also the task states can differ

from those of the servers, requiring the scheduler to implement another data structure

for tasks. The task control block array (TCBarray) holds for each task the following

constants: ID, period, and the size of the context. The variable fields are the task’s

current state and its absolute deadline.

141

8 Prototype: FPGA Based Real-Time Kernel

The ready queue is used to store a list of servers in the ready state, sorted according

to non-decreasing absolute deadlines. We do not implement the ready queue directly,

but provide a single register, ReadyQueueHead, that points to the entry in SCBarray

holding the server with the currently earliest deadline. Additionally, each element of

SCBarray contains a field which points to the next server in the ready queue.

8.1.3 The Scheduler Cycle

Time proceeds in system ticks. The period of the system tick corresponds to a user-

defined number of clock cycles which is constant during the application’s runtime. A

scheduling cycle denotes all the operations of the scheduler during a system tick period.

There are basically two different sequences of scheduler operations during a cycle, de-

pending on whether a device reconfiguration is needed or not. These two cases are

outlined in Figure 8.2.

In any case, the scheduler cycle starts with the Run Tasks (RT) phase. Here, all tasks

that are included in the current server and are in the READY state are set to the RUN

state. Consequently, these tasks receive a signal to start or resume their execution. In

the example of Figure 8.2,S1starts execution. Then, the scheduler idles until almost

the end of the system tick period. During this time interval, the user tasks are executed

and some may terminate.

The first scheduler phase after the delay is called end cycle. Here, the scheduler decreases

the remaining computation time of the currently running server. If the remaining com-

putation time reaches zero, the server has finished its execution for its current period

and its state is set to IDLE.

 





     

 

 





  





















 

Figure 8.2: Scheduler phases during a system tick period

In the following scheduler phase wake up, the scheduler checks if any server or task has

reached the end of its period. This is done by comparing the absolute deadlines to the

system tick. If a server or task is at the end of its period, it is set to the READY state

and its deadline is set to di=di+Pi. In case of a server, the remaining computation

142

8.2 Synthesis Tool Flow

time is set to the server’s initial computation time.

The Dispatch Server (DPS) phase increments the system tick and determines the server

that should run in the next system tick period following the EDF rule. Since the queue

of ready servers is sorted according to non-decreasing deadlines, DPS only compares

the deadline of the running server to that of the first server in the ready queue. If

the currently running server terminates or has to be preempted, the first server in the

ready queue becomes the next running server. In this case, shown in Figure 8.2-(a),

the Save System Context (SSC) phase is entered next. If the currently running server

is determined to continue execution, shown in Figure 8.2-(b), a delay phase is entered

next.

If a device reconfiguration is intended, the SSC phase requests all currently running

tasks to stop their execution. Then, the contexts of all user tasks are stored in the

external SRAM, followed by the context of the runtime system. Upon completion of

the context saves, the scheduler triggers the FPGA reconfiguration and the subsequent

Reconfiguration (RC) phase is entered. After reconfiguration, the scheduler executes the

Load System Context (LSC) phase and restores first the context of the runtime system

and, afterwards, the contexts of all tasks in the new server running. The following RT

phase starts the next scheduler cycle (system tick period).

The total time required by the SSC, RC, LCS, and RT phases is denoted as runtime

system overhead as it cannot be used for user task execution. This overhead actually

combines overheads due to device reconfiguration (RC), task and runtime system context

saves and restores (SSC, LCS), and runtime system management overhead (RT).

In case no device reconfiguration is intended, the scheduler enters another delay phase

after the DPS phase. The duration of this delay matches exactly the sum of the SSC, RC

and LSC phases. By this, we ensure a constant time interval for each system tick. As

can be observed from Figure 8.2, the runtime system overhead only occurs if the system

switches from one configuration to another. If a server executes for several system ticks,

which can be expected to be the regular case, there is no runtime overhead at all.

The presented scheduler cycle assumes that a system tick period is larger than the

reconfiguration time. While this constraint on the system tick period simplifies the

time keeping and scheduling processes, it also limits the system’s time resolution. An

alternative implementation could make use of an external, non-reconfigured real-time

clock as a time reference and reduce the system tick period considerably.

8.2 Synthesis Tool Flow

Figure 8.3 shows the basic steps of the synthesis tool flow that generates the configura-

tion bit-streams for a particular application. We use CELOXICA’s parallel programming

language Handel-C to specify the runtime system and the user tasks.

The application, i.e., the task set Γ, is specified in textual form (appl.txt) and contains

the period, worst-case computation time, and area requirement for each user task. The

143

8 Prototype: FPGA Based Real-Time Kernel

task_n.hcc

task_2.hcc

RTOS FilesTask Source Files

MSDL

Algorithm

Appl. Meta Data

HandelC preproc.

HandelC compiler

Backend Tools

(map, place & route,

bitfile generation)

task_1.hcc

...hcc

schedul.hcc

kernel.hcc

server_m.bit

server_1.bit

server_0.bit

FPGA Bitfiles

task_serv_data.hch

...

#define T1_PERIOID 1000

#define T1_contextsize 16

...

#define S1_period 500

#define S1_compTime 50

#define S1_taskset 0x010F

#define S1_flashAdr 0x7F

...

appl.txt

P1 = 1000,

C1 = 300,

A1 = 30 %,

P2 = 3500

C2 = 450

A2 = 22 %

...

Figure 8.3: Synthesis tool flow

144

8.3 Synthesis Results

first step of the tool flow calls the MSDL algorithm to compute a feasible set of servers

Ω (see Section 4.3). If such a feasible set of servers is found, the relevant task and

server parameters are written into the Handel-C header file task serv data.hch. This

file contains mainly preprocessor macros, defining several properties of the servers and

tasks such as the IDs and periods. Furthermore, for each server the computation time,

the set of included tasks, and the address of the configuration bit-stream in the flash

memory card is defined.

The second step of the tool flow takes this header file together with the Handel-C source

files for the user tasks and the runtime system and creates the individual configuration

bit-streams for the server tasks. We use the Handel-C compiler (CELOXICA DK4) and

standard FPGA vendor tools (XILINX ISE 7.2) for this step.

Besides the mservers generated by MSDL, we create another bit-stream for an idle-

server (server 0). The idle server contains only the runtime system but no user tasks.

This server is run as initial configuration after system startup and when all other servers

are idle. When the synthesis tool flow is completed, all m+ 1 server bit-streams are

stored onto the flash memory card at their respective addresses. After system power-

on, the CPLD automatically configures the idle server into the FPGA. The scheduler

takes over control and decides which server to reconfigure next according to the MSDL

schedule.

User tasks are free to use logic, memory and I/O resources of the FPGA, as long as

they implement the pre-defined interface to the runtime system. This interface allows

the runtime system to start and stop a task and to initiate context save and restore

operations. Although our current prototype expects user tasks specified in Handel-C,

integrating other HDL components into the synthesis flow is still straight-forward.

8.3 Synthesis Results

We have implemented the entire system on CELOXICA’s RC203 board, which hosts a

XILINX Virtex II 3000 FPGA, two 2-MB SRAM banks and a 64 MB SmartMedia flash

memory card. An FPGA configuration bit-stream has a size of approximately 1.25MB,

which allows us to store up to 50 server bit-streams on the flash memory card.

To determine the area requirement for the runtime system, we have conducted synthe-

sis experiments with different numbers of tasks and servers. These experiments use a

runtime system without MMU and minimal user task implementations in order to re-

duce the effect of the user task logic on our measurements. The results are presented in

Table 8.1. The number of required LUTs grows with the number of servers and ranges

from 16% to 45% of the overall LUT resources. The required number of Flip Flops (FFs)

grows slower and ranges from 5% to 14% of the overall device FFs. Assigning almost

50% of the device resources to the runtime system might look unrealistic at the first

glance. However, this number corresponds to the rather high number of 15 servers and

15 tasks, respectively. Then, these figures have been derived from a first, unoptimized

version of the runtime system. For example, all scheduler data structures are currently

145

8 Prototype: FPGA Based Real-Time Kernel

implemented as registers. Mapping these structures to Block-RAM will reduce the re-

quired resources substantially. Finally, in the future we expect that much larger devices

will become available, reducing the relative overheads for such a runtime system.

# tasks # servers LUTs (% of device) FFs (% of device)

15 3 7786 (27%) 2703 (9%)

15 6 9802 (33%) 3028 (10%)

15 9 10297 (35%) 3390 (11%)

15 12 11655 (40%) 3725 (12%)

15 15 13109 (45%) 4030 (14%)

3 3 4470 (16%) 1336 (5%)

6 3 5297 (18%) 1677 (6%)

9 3 6123 (21%) 2019 (7%)

12 3 6948 (24%) 2361 (8%)

15 3 7787 (27%) 2703 (9%)

Table 8.1: Runtime system size for various numbers of tasks and servers

The time overheads for the runtime system are fairly small. With sas the number of

servers and tas the number of overall tasks, we measured a runtime for the SSC phase

of 3s+7tclock cycles and for the LSC phase of 3s+8tcycles. The wake up phase takes

3s+tcycles and, finally, the end cycle and DPS phases take just one clock cycle each.

For a runtime system controlling 3 servers and 9 tasks we achieved a maximal clock

frequency of about 50 MHz. Although the Virtex II 3000 could be reconfigured within

some 20 ms via the SelectMap interface, the speed of the SmartMedia flash memory

card limits the reconfiguration time on the RC203 board to 180 ms.

We further measured the overhead of the logic inside the user tasks needed for storing

and loading the tasks’ contexts by synthesis experiments. We organized the context

registers in form of a shift register chain and measured the number of additional LUTs

required to implement the load and store functions. The results reveal an overhead of 2

LUTs per bit for a context register of 32 bit. This overhead decreases to 1.2 LUTs per

bit for context register sizes of 256 and more.

8.4 Chapter Conclusion

In this chapter, we presented a prototype system that executes periodic real-time tasks

on dynamically reconfigurable hardware. We used the MSDL scheduling technique which

offers an efficient off-line schedulability test and requires a minimal number of device

configurations. We showed an implementation of this technique on an FPGA board,

including an all-hardware runtime system and the corresponding synthesis tool flow.

For the future, we plan to port the system to an FPGA platform with shorter reconfig-

uration time. Although we consider it a benefit of our approach that it can work with

full device reconfiguration, and thus with all SRAM-based devices on the market, we

also intend to investigate partial device reconfiguration. A natural way to utilize partial

reconfigurability in our prototype would be to keep the runtime system permanently in

146

8.4 Chapter Conclusion

the device and only reconfigure between different servers. Alternatively, an embedded

CPU could be used to implement the runtime system. Prototypes for the global EDF

and partitioned EDF schedulers may also be subject of future work.

147

8 Prototype: FPGA Based Real-Time Kernel

148

CHAPTER 9

Conclusion and Outlook

9.1 Summary

In this thesis we presented our investigation on real-time multitasking on reconfigurable

architectures. Application tasks were implemented as digital circuits and executed on

a runtime reconfigurable hardware device such as an FPGA. While the general concept

of executing hardware-tasks with RHDs had been proven feasible in several prototypes

[102][57][105], and scheduling and placement of hardware-tasks had been given much

attention by the research community (e.g. in [14][94]), only minor results on real-time

multitasking on RHDs had been developed so far. We presented several key contributions

to this field within our work: We defined the periodic task model for RHDs, where each

task requires some amount of logic resources to be executed. An arbitrary set of tasks

can run in parallel as long as the cumulative task area does not exceeded the device

area. For such a model we presented three major scheduling methods:

1. The global EDF algorithms were inspired from global EDF scheduling methods for

multiprocessors. The new EDF algorithms were denoted as EDF-NF and EDF-

FkF. They manage resources globally, meaning that any resource can be assigned

to any task. Following a resource augmentation approach [81], we presented the

first known scheduling test for these algorithms. The test can be computed in

O(n) time, nbeing the number of tasks. Simulations showed that, in conformance

to results derived for multiprocessor scheduling, the test is pessimistic for task sets

with high system utilization. Further, the simulations showed that the EDF-NF

algorithm gives a non-negligible improvement over EDF-FkF.

2. The partitioned EDF algorithms were also inspired by traditional multiprocessor

scheduling algorithms. The task set is partitioned into partition-blocks at de-

sign time and exclusive device resources are assigned to each partition-block. At

149

9 Conclusion and Outlook

runtime, the tasks of each partition-block are scheduled by a separate EDF sched-

uler. We showed the relation of this partitioning problem to the two-dimensional

level-strip packing problem. We employed a previously reported ILP model to

compute the optimal partitioning and a next-fit heuristic to achieve a close to

optimal partitioning for large task sets.

3. In the MSDL method the computation of the user tasks is distributed into peri-

odic servers tasks. At design time, the tasks are grouped to servers for parallel

execution. At runtime, the servers are executed sequentially according to EDF.

Since only one server executes at a time, this approach is suitable for systems and

devices which do not support partial reconfiguration.

While in our analytic performance comparison we showed that there is no dominance

among the three approaches, our simulation studies showed that on average, global

EDF outperforms partitioned EDF, which outperforms MSDL. Further it was shown

that the global EDF scheduling test is rather pessimistic. Concerning execution model

overhead, we analyzed the reconfiguration overhead for all three approaches and included

it within the schedulability tests. Simulation experiments showed that global EDF

suffers significantly more from overhead when compared to the two other schedulers.

Two extensions of our task model were presented in order to consider the specific charac-

teristics of reconfigurable applications: First we considered tasks for which not only one

but several alternative implementations (circuits) exist. We presented an ILP model to

select the optimal implementations for the partitioned EDF scheduler, which consider-

ably improved the scheduling performance. Second, we considered tasks memory usage

under real-time constraints. We presented a method efficiently sharing RAM-banks

without jeopardizing task deadlines, hence reducing the required memory resources.

Finally, we described our prototype FPGA based real-time kernel which, in contrast to

others, is a hardware-only system with the OS functions entirely implemented in hard-

ware. The prototype implements the MSDL approach, uses full device reconfiguration,

and is hence suitable for most of today’s reconfigurable devices.

9.2 Drawn Conclusions

From our work and results we draw the following conclusions:

Practicability of real-time multitasking on RHDs: Multitasking on RHDs is feasible,

even under real-time conditions. However, several technical issues such as reconfig-

uration overheads, partial runtime reconfiguration support and state saving today

still limit the practicability.

The overheads associated with the device reconfiguration are generally not negligi-

ble and present the strongest limitation. However, it can be successfully included

within the scheduling conditions of all approaches. Simulation results indicate,

that the overheads can almost be neglected, if the reconfiguration time is one

150

9.2 Drawn Conclusions

order of magnitude (two orders in case of global EDF) smaller than the compu-

tation time of the shortest task. Commercial FPGAs have, depending on their

size, reconfiguration times from some hundreds of micro seconds up to some tens

of milliseconds. Considering such devices, our results suggest that hardware tasks

should have computation times of at least about 5 milliseconds. Moreover, a great

deal of research is on its way aiming to a reduction of the reconfiguration time.

For example, multi-context architectures [31] store several configurations on-chip

and allow rapid switching between them. Yet another approach are coarse-grained

reconfigurable devices, based on ALU arrays or ALU/look-up table hybrids and

bus-wide interconnects. These devices have reconfiguration times in the order of

microseconds. Applying our scheduling techniques to such platforms renders hard-

ware tasks feasibly with computation times as low as some tens of microseconds.

Partial reconfiguration is, if at all, only weakly supported by today’s RHD archi-

tectures. Often, the tasks footprints have to be rectangularly shaped, spanning

the entire device size in one dimension. Inhomogeneous resources make task re-

location difficult. However, our scheduling algorithms deal with these limitations:

The one-dimensional area model, used by global EDF and partitioned EDF, can be

realized with today’s devices. Partitioned EDF avoids task relocation and MSDL

does not rely on partial reconfiguration at all.

Several reported prototypes of runtime environments prove the concept of hardware-

multitasking, including task preemption. A working prototype for the free space

partitioned - time sharing resource model used by MSDL, which realizes state

saving via dedicated task interfaces, has been presented in this work.

The algorithms: All three approaches generate schedules with predictable task timings

and achieve an acceptable high device utilization.

Global EDF achieves the highest average performance. While this makes it a

promising approach, only a fraction of this performance can be guaranteed by our

schedulability test. Moreover, the approach requires task relocation and suffers

greatly from reconfiguration overheads. Hence, global EDF may only be useful

in several special cases where the task sets are small such that the hyper-period

can be simulated or cases where online acceptance tests are required. This may

change, if further research comes up with less pessimistic schedulability tests.

Partitioned EDF combines good performance with low reconfiguration overheads

and few requirements to the execution model. The optimal partitioning can be

computed in reasonable time for task sets up to 30 tasks and even more. Hence,

it should be the chosen method if the system supports partial reconfiguration.

MSDL has the lowest performance, but enables multitasking for system that do

not support partial reconfiguration.

Problem relations to multiprocessor systems: The execution models used for RHDs

share many relations to multiprocessor models. In particular, our resource model

presents a generalization of a simple multiprocessor model. Our work showed

that several scheduling approaches used in multiprocessor environments can be

extended for the RHD model. This suggests, that potentially even more ideas

151

9 Conclusion and Outlook

from multiprocessor scheduling methods could be used to develop further RHD

scheduling algorithms.

Importance of alternative task implementations: We can conclude from our results

that considering alternative task implementations generally yields a better schedul-

ing performance. We proved this in particular for the partitioned scheduler, where

we computed the optimal selection of alternatives via an ILP.

9.3 Outlook and Future Work

This work presented key methods for periodic real-time scheduling on RHDs. Since the

topic has been only rarely addressed before, we developed only basic methods for simple

task models. Hence, the work opens up a great field for future research. At this point we

suggest four concrete lines of future work. (More details can be found in the according

chapters.)

Improvement of presented schedulers: One direction of future work should concen-

trate on the improvement of the presented methods. For global EDF, better

schedulability conditions could be developed, based on their multiprocessor coun-

terparts [11]. For the server based scheduler, the presented MSDL algorithm

involves greedy heuristic choices for the selection of servers to merge. Future work

should evaluate other approaches in order to improve the scheduling performance.

For the extended model which considers alternative task implementations, only

the optimal-partitioned-EDF method has been considered. How task variants can

be considered in global EDF and MSDL has been roughly outlined but needs to

be evaluated in future work. Another idea is the development of approximation

algorithms that compute partitioned EDF schedules even for large task sets.

Non-preemptive multitasking: Scheduling periodic real-time onto RHDs in the non-

preemptive case has not been addressed yet. It is useful since the reconfigura-

tion overheads are reduced and all implementation issues of task interruption are

avoided. Non-preemptive schedulers generally achieve less device utilization than

preemptive schedulers, but the effect of reduced reconfiguration overhead could

compensate that drawback.

Generalizations of the model: Throughout this work, we assumed that the relative

task deadlines Diare equal to periods Pi. However, the basic idea of our ap-

proaches should also work with deadlines less than periods. The multiprocessor

EDF schedulability test of [48] has been successfully extended to this case in [16],

hence we believe we can do the same for our global EDF schedulability test. For

partitioned EDF, we cannot use the simple time utilization EDF bound UT≤1

for each partition anymore. Rather, we should replace it by a processor demand

criteria [25], p.101. This can be easily done for the next fit heuristic, but since this

leads to non-linear scheduling conditions the ILP approach will not work anymore.

152

9.3 Outlook and Future Work

Moreover, in our model the required task resources are simply modelled by a scalar

area value. Future work can extend this to resource vectors, e.g. to cover FPGA

resources such as the number of LUTs, FFs, block-RAMs and multiplier-blocks.

Implementation of prototypes: This work presented a prototype implementing the MSDL

scheduler. For global EDF and partitioned EDF, prototypes could be built on any

system supporting the 1D partial reconfiguration model.

In summary, this thesis presented new fundamental work in real-time multitasking on

RHDs. While we focused on the development and analysis of scheduling algorithms for

periodic tasks, we moreover presented a prototypical system realization. Perhaps just

as important this work also showed that much more research is required to utilize the

promising runtime reconfiguration capabilities of RHDs in next generation embedded

systems. We hope that our work motivates more research is this exciting new area.

153

9 Conclusion and Outlook

154

Author’s Publications

[BDAT03] Christophe Bobda, Klaus Danne, Ali Ahmadinia, and J¨

urgen Teich. A new

approach for reconfigurable massively parallel computers. In Proceedings of

the International Conference on Field-Programmable Technology (FPT’03).

IEEE, 15 - 17 December 2003.

[BDAT04] Christophe Bobda, Klaus Danne, Ali Ahmadinia, and J¨

urgen Teich. Gen-

eration of distributed arithmetic designs for reconfigurable applications. In

ARCS 2004 Organic and Pervasive Computing, Workshop Proceedings, vol-

ume P-41, Bonn, 26 March 2004. GI LNI, K¨

ollen Verlag.

[BDD+03] M. Bednara, Klaus Danne, Markus Deppe, Oliver Oberschelp, Frank Slomka,

and J¨

urgen Teich. Design and implementation of digital linear control sys-

tems on reconfigurable hardware. EURASIP Journal on Applied Signal Pro-

cessing, 6(6):594–602, May 2003.

[BDL03] Christophe Bobda, Klaus Danne, and Andr´e Linarth. Efficient implementa-

tion of the singular value decomposition on a reconfigurable system. In Proc.

of the International Conference on Field Programmable Logic and Applica-

tions (FPL 2003), Lisbon, Portugal, September 2003.

[Dan04a] Klaus Danne. Distributed arithmetic FPGA design with online scalable size

and performance. In Proceedings of 17th SYMPOSIUM ON INTEGRATED

CIRCUITS AND SYSTEMS DESIGN (SBCCI04), pages 135–140. ACM

Press, New York, NY, USA, 7 - 11 September 2004.

[Dan04b] Klaus Danne. Memory management to support multitasking on FPGA based

systems. In Proceedings of the International Conference on Reconfigurable

Computing and FPGAs (ReConFig04) ISBN 970-692-169-9. Mexican Soci-

ety of Computer Science, SMCC, 20 - 21 September 2004.

[Dan04c] Klaus Danne. Operating systems for FPGA based computers and their mem-

ory management. In ARCS 2004 Organic and Pervasive Computing, Work-

shop Proceedings, volume P-41 of GI-Edition Lecture Notes in Informatics

(LNI), Bonn, 26 March 2004. K¨

ollen Verlag.

[DB04] Klaus Danne and Christophe Bobda. Dynamic reconfiguration of distributed

arithmetic controllers: Design space exploration and trade-off analysis. In

155

Author’s Publications

Proceedings of the 11th Reconfigurable Architectures Workshop (RAW’04).

IEEE Computer Society, IEEE Computer Society Press, 26 - 27 April 2004.

[DB06] Klaus Danne and Christophe Bobda. Dynamic reconfiguration of distributed

arithmetic designs. International Journal of Embedded Systems 2006 - Vol.

2, No.1/2 pp. 51 - 61, 2(1/2):51–61, 2006.

[DBK03a] Klaus Danne, Christophe Bobda, and Heiko Kalte. Increasing efficiency by

partial hardware reconfiguration: Case study of a multi-controller system.

In Toomas Plaks, editor, Proceedings of the International Conference on

Engineering of Reconfigurable Systems and Algorithms, Las Vegas, Nevada,

USA, 23 - 26 June 2003.

[DBK03b] Klaus Danne, Christophe Bobda, and Heiko Kalte. Run-time exchange of

mechatronic controllers using partial hardware reconfiguration. In Proc. of

the International Conference on Field Programmable Logic and Applications

(FPL2003), Lisbon, Portugal, September 2003.

[DMP06] Klaus Danne, Roland Muehlenbernd, and Marco Platzner. Executing hard-

ware tasks on dynamically reconfigurable devices under real-time conditions.

In Proceedings of the FPL06, 2006.

[DP05a] Klaus Danne and Marco Platzner. A heuristic approach to schedule periodic

real-time tasks on reconfigurable hardware. In Proceedings of the Interna-

tional Conference on Field Programmable Logic and Applications (FPL05),

Tampere, Finland, 24 - 26 August 2005. Piscateway, NJ: IEEE.

[DP05b] Klaus Danne and Marco Platzner. Memory-demanding periodic real-time

applications on FPGA computers. In Work-in-Progress proceedings of the

17th EUROMICRO CONFERENCE ON REAL-TIME SYSTEMS (ECRTS

05), 6 - 8 July 2005.

[DP05c] Klaus Danne and Marco Platzner. Periodic real-time scheduling for FPGA

computers. In Volker Turau and Christophe Weyer, editors, The Third

IEEE International Workshop on Intelligent Solutions in Embedded Systems

(WISES’05) at Hamburg University of Technology. Piscateway, NJ: IEEE,

20 May 2005.

[DP06a] Klaus Danne and Marco Platzner. An edf schedulability test for periodic

tasks on reconfigurable hardware devices. In Proceedings of LCTES 2006,

Ottawa, 2006.

[DP06b] Klaus Danne and Marco Platzner. Partitioned scheduling of periodic real-

time tasks onto reconfigurable hardware. In International Parallel and

Distributed Processing Symposium (IPDPS’06), Reconfigurable Architecture

Workshop (RAW’06), 2006.

156

Author’s Publications

[DS05] Klaus Danne and Sven St¨

uhmeier. Off-line placement of tasks onto recon-

figurable hardware considering geometrical task variants. In Proceedings of

International Embedded Systems Symposium 2005 (IESS05), Manaus, Brazil,

15 - 17 August 2005.

[GDBS04] Jens Gerling, Klaus Danne, Christophe Bobda, and J. Schrage. Distributed

arithmetics for recursive convolution of optical intercannects. In EOS Topical

Meeting, Optics in Computing (OIC), pages 65–66, Engelberg (Switzerland),

April 2004.

157

Author’s Publications

158

Bibliography

[1] Mobileye vision technologies ltd. http://www.mobileye.com.

[2] E. Ahmed and J. Rose. The effect of LUT and cluster size on deep-submicron

FPGA performance and density. In FPGA, pages 3–12, 2000.

[3] Altera. Stratix II performance and logic efficency analysis. altera white paper

v1.0, February 2004.

[4] Altera. Altera end markets. http://www.altera.com/end-markets/, 2006. This

is an electronicdocument. Date retrieved: July 13, 2006.

[5] B. Andersson and J. Jonsson. Fixed-priority preemptive multiprocessor schedul-

ing: to partition or not to partition. In Seventh International Conference on Real-

Time Computing Systems and Applications (RTCSA’00), pages 337–346, 2000.

[6] R. Andraka. A survey of cordic algorithms for fpgas. In FPGA ’98. Proceedings

of the 1998 ACM/SIGDA sixth international symposium on Field programmable

gate arrays, pages 191–200, 1998.

[7] D. Andrews, D. Niehaus, R. Jidin, M. Finley, W. Peck, M. Frisbie, J. Ortiz,

E. Komp, and P. Ashenden. Programming models for hybrid fpga-cpu computa-

tional components: A missing link. IEEE Micro, 24(4):42–53, 2004.

[8] J. Axelsson. A case study in heterogeneous implementation of automotive real-

time systems. In 6th International Workshop on Hardware/Software Co-Design

(CODES). Volvo Technological Development, 1998.

[9] D. Babbar and P. Krueger. On-line hard real-time scheduling of parallel tasks on

partitionable multiprocessors. In ICPP, pages 29–38, 1994.

[10] T. P. Baker. Multiprocessor EDF and deadline monotonic schedulability analysis.

In RTSS, pages 120–129, 2003.

[11] T. P. Baker. A comparison of global and partitioned edf schedulability tests for

multiprocessors. Technical Report TR-051101, Florida State University Dept. of

Computer Science Tallahassee, FL 32306 USA, 2005.

159

Bibliography

[12] S. K. Baruah, N. K. Cohen, C. G. Plaxton, and D. A. Varvel. Proportionate

progress: A notion of fairness in resource allocation. Algorithmica, 15(6):600–625,

1996.

[13] V. Baumgarte, F. May, A. N¨

uckel, M. Vorbach, and M. Weinhardt. Pact xpp -

a self-reconfigurable data processing architecture. In ERSA, Las Vegas, Nevada,

June 2001.

[14] K. Bazargan, R. Kastner, and M. Sarrafzadeh. Fast template placement for recon-

figurable computing systems. IEEE Design and Test of Computers, pages 68–83,

Mar. 2000.

[15] M. Berkelaar. lp solve - solve (mixed integer) linear programming problems. Eind-

hoven University of Technology, 4.0 edition, 2002.

[16] M. Bertogna, M. Cirinei, and G. Lipari. Improved schedulability analysis of EDF

on multiprocessor platforms. In proceedings of the 17th ERTCS, Palma de Mal-

lorca, July 5-8 2005, Palma de Mallorca, Spain, July 2005. IEEE CS.

[17] E. Bini, G. C. Buttazzo, and G. Buttazzo. A hyperbolic bound for the rate

monotonic algorithm. In IEEE Proceedings of the Euromicro Conference on Real-

Time Systems, pages 59–66, Delft, The Nederlands, June 2001.

[18] C. Bobda. Synthesis of Dataflow Graphs for Reconfigurable Systems using Tem-

poral Partitioning and Temporal Placement. Dissertation, Universit¨

at Paderborn,

Heinz Nixdorf Institut, Entwurf Paralleler Systeme, 2003. Euro 35,-, ISBN 3-

935433-37-9.

[19] C. Bobda, A. Ahmadinia, M. Majer, J. Teich, S. Fekete, and J. van der

Veen. Dynoc: A dynamic infrastructure for communication in dynamically re-

configurable devices. In Proceedings of the International Conference on Field-

Programmable Logic and Applications (FPL), pages 153–158, Tampere, Finland,

Aug. 2005.

[20] C. Bobda, M. Majer, A. Ahmadinia, T. Haller, A. Linarth, and J. Teich. Increasing

the flexibility in fpga-based reconfigurable platforms: The erlangen slot machine.

In IEEE 2005 Conference on Field-Programmable Technology (FPT05), pages 37–

42, December 11-14 2005.

[21] C. Bobda, M. Majer, A. Ahmadinia, T. Haller, A. Linarth, J. Teich, S. P. Fekete,

and J. van der Veen. The erlangen slot machine: A highly flexible FPGA-

based reconfigurable platform. In Proceeding 2005 IEEE Symposium on Field-

Programmable Custom Computing Machines April 17 - April 20, 2005, 2005.

[22] G. Brebner and O. Diessel. Chip-based Reconfigurable Task Management. In

Field-Programmable Logic and Applications (FPL’01), pages 182–191. Springer-

Verlag, Berlin, Germany, 2001.

160

Bibliography

[23] A. Burchard, J. Liebeherr, Y. Oh, and S. H. Son. Assigning real-time tasks to

homogeneous multiprocessor systems. Technical report, University of Virginia,

Charlottesville, VA, USA, 1994.

[24] A. Burns and A. Wellings. Real-Time Systems and Programming Languages (sec-

ond editon). Addison-Wesley, second edition edition, 1997.

[25] G. C. Buttazzo. Hard Real-Time Computing Systems: Predictable Scheduling

Algorithms and Applications (second edition). Kluwer Academic Publishers, 2005.

[26] G. C. Buttazzo. Rate monotonic vs. edf: judgment day. Real-Time Syst., 29(1):5–

26, 2005.

[27] Celoxica Inc., www.celoxica.com.Celoxica RC 300 Datasheet, v 1.2 edition, 2005.

[28] H. Chetto, M. Silly, and T. Bouchentouf. Dynamic scheduling of real-time tasks

under precedence constraints. Real-Time Systems, 2:181 – 194, 1990.

[29] E. G. Coffman, M. R. Garey, and D. S. Johnson. Approximation Algorithms for

NP-Hard Problems, chapter Approximation Algorithms for Bin Packing: a Survey,

pages 46–93. PWS Publishing, 1996.

[30] E. G. Coffman, Jr., M. R. Garey, D. S. Johnson, and R. E. Tarjan. Performance

bounds for level-oriented two-dimensional packing algorithms. SIAM Journal of

Computing, 9(4):808–826, November 1980.

[31] K. Compton and S. Hauck. Reconfigurable computing: a survey of systems and

software. ACM Comput. Surv., 34(2):171–210, 2002.

[32] Continental. The new generation of adaptive cruise control acc from wabco and

continental temic for trucks and busses. Continental Press Release, September

2004.

[33] A. Cosoroaba and F. Rivoallon. Achieving Higher System Performance with the

Virtex-5 Family of FPGAs: White Paper 245. XILINX, v1.1.1 edition, July 2006.

[34] CRAY Inc. Cray XD1 Datasheet, 1.3 edition, 2005.

[35] A. DeHon. Reconfigurable Architectures for General-Purpose Computing. PhD

thesis, Massachusetts Institute of Technology, 1996.

[36] M. L. Dertouzos. Control robotics: the procedural control of physical processes.

Information Processing, 1974.

[37] O. Diessel and H. ElGindy. Run-time compaction of FPGA designs. In 7th Interna-

tional Conference on Field Programmable Logic and Applications, pages 131–140,

Berlin, Germany, 1997.

[38] O. Diessel and H. ElGindy. On scheduling dynamic FPGA reconfigurations. Tech-

nical report, University of South Australia, 1998.

161

Bibliography

[39] O. Diessel, H. ElGindy, M. Middendorf, H. Schmeck, and B. Schmidt. Dynamic

scheduling of tasks on partially reconfigurable FPGAs. IEE Proceedings – Com-

puters and Digital Techniques, 147(3):181–188, May 2000.

[40] O. F. Diessel. On Scheduling Dynamic FPGA Reconfiguration, A Partial Rear-

rangement Approach. PhD thesis, The University of Newcastle, Australia, 1998.

[41] H. ElGindy, M. Middendorf, B. Scheuermann, and H. Schmeck. An evolutionary

approach to dynamic task scheduling on FPGAs. In Hartenstein and Gr?nbacher,

editors, Field-Programmable Logic and Applications (FPL’00). Springer-Verlag,

Berlin, September 2000.

[42] S. P. Fekete, E. K¨

ohler, and J. Teich. Extending partial suborders. In Proc.

CTW2001, 1st CologneTwente Workshop on Graphs and Combinatorial Optimiza-

tion. published as Electronical Notes in Discrete Mathematics (ENDM), June 6-8

2001.

[43] S. P. Fekete, E. K¨

ohler, and J. Teich. Higher-dimensional packing with order con-

straints. In Proc. WADS2001, 7th Workshop on Algorithms and Data Structures,

Providence, Rhode Island, U.S.A., August 8-10 2001.

[44] S. P. Fekete, E. K¨

ohler, and J. Teich. Optimale FPGA module placement with

temporal precedence constraints. In Proc. DATE 2001, Design, Automation and

Test in Europe, pages 658–665, Munich, Germany, March 13-16 2001. IEEE Com-

puter Society Press.

[45] P. Garcia, K. Compton, M. Schulte, E. Blem, and W. Fu. An overview of reconfig-

urable hardware in embedded systems. EURASIP Journal on Embedded Systems,

pages 1–19, 2006.

[46] M. B. Gokhale and P. S. Graham. Reconfigurable Computing: Accelerating Com-

putation with Field-Programmable Gate Arrays. Springer, 2005.

[47] J. Goossens, S. Baruah, and S. Funk. Real-time scheduling on multiprocessor. In

Proceedings of the 10th International Conference on Real-Time System, 2002.

[48] J. Goossens, S. Funk, and S. Baruah. Priority-driven scheduling of periodic task

systems on multiprocessors. Real-Time Syst., 25(2-3):187–205, 2003.

[49] S. Govindarajan and R. Vemuri. Tightly integrated design space exploration with

spatial and temporal partitioning in sparcs. In FPL, pages 7–18, 2000.

[50] R. Hartenstein. A decade of reconfigurable computing: a visionary retrospective.

In DATE ’01: Proceedings of the conference on Design, automation and test in

Europe, pages 642–649. IEEE Press, 2001.

[51] W. A. Horn. Some simple scheduling algorithms. Naval Res. Logist. Quart.,

21:177–185, 1974.

162

Bibliography

[52] E. Horta and J. W. Lockwood. PARBIT: a tool to transform bitfiles to implement

partial reconfiguration of field programmable gate arrays (FPGAs). Technical

Report WUCS-01-13, Washington University in Saint Louis, Department of Com-

puter Science, July 6, 2001.

[53] E. L. Horta, J. W. Lockwood, D. E. Taylor, and D. Parlour. Dynamic hardware

plugins in an FPGA with partial run-time reconfiguration. In Design Automation

Conference (DAC), New Orleans, LA, June 2002.

[54] M. B. I. Gat and A. Shashua. Monocular vision advance warning system for the

automotive. In Intelligent Vehicle Initiative (VI) Technology 2005 - Advanced

Controls and Navigation Systems Intelligent Vehicle Initiative (VI) Technology

2005 - Advanced Controls and Navigation Systems. SAE, 2005.

[55] X. Inc. Xilinx marketing solutions. http://www.xilinx.com/esp/, year unknown.

This is an electronic document. Date retrieved: July 13, 2006.

[56] IPFlex. DAPDNA-2 Dynamically Reconfigurable Processor (0412-DD2E). IPFlex,

http://www.ipflex.com, 2004.

[57] S. N. Jason Scott and T. Bapty. Runtime environment for dynamically reconfig-

urable embedded systems. Technical report, Vanderbilt University Nashville, TN

37325, U.S.A., 1999.

[58] J. S. N. Jean, K. Tomko, V. Yavagal, J. Shah, and R. Cook. Dynamic reconfigura-

tion to support concurrent applications. IEEE Trans. on Computers, 48(6):591–

602, June 1999.

[59] H. Kalte, M. K¨

oster, B. Kettelhoit, M. Porrmann, and U. R¨

uckert. A comparative

study on system approaches for partially reconfigurable architectures. In T. Plaks,

editor, Proceedings of the International Conference on Engineering of Reconfig-

urable Systems and Algorithms (ERSA ’04), pages 70–76, Las Vegas, Nevada,

USA, 21 - 24 June 2004.

[60] H. Kalte, G. Lee, M. Porrmann, and U. R¨

uckert. REPLICA: a bitstream manipu-

lation filter for module relocation in partial reconfigurable systems. In Proceedings

of the 19th International Parallel and Distributed Processing Symposium - Recon-

figurable Architectures Workshop, 2005.

[61] H. Kalte and M. Porrmann. Context saving and restoring for multitasking in

reconfigurable systems. In 15th International Conference on Field Programmable

Logic and Applications, pages 223–228, 24 - 28 Aug. 2005.

[62] H. Kalte, M. Porrmann, and U. R¨

uckert. A prototyping platform for dynamically

reconfigurable system on chip designs. In Proceedings of the IEEE Workshop

Heterogeneous reconfigurable Systems on Chip (SoC), Hamburg, Germany, 2002.

163

Bibliography

[63] M. Koester, H. Kalte, and M. Porrmann. Task placement for heterogeneous re-

configurable architectures. In Proceedings of the IEEE 2005 Conference on Field-

Programmable Technology (FPT’05), pages 43–50. IEEE Computer Society, De-

cember 11-14 2005.

[64] H. Kopetz. Real-Time Systems: Design Principles for Distributed Embedded Ap-

plications. Kluewer, 1997.

[65] O.-H. Kwon, J. Kim, S. J. Hong, and S. Lee. Real-time job scheduling in hypercube

systems. In Proceedings of ICPP07, 1997.

[66] S. Lauzac, R. Melhem, and D. Mosse. Comparison of global and partitioning

schemes for scheduling rate monotonic tasks on a multiprocessor. In Proceedings

of the 10th Euromicro Workshop on Real Time Systems, page 188, 1998.

[67] W. Y. Lee, S. J. Hong, and J. Kim. On-line scheduling of scalable real-time tasks

on multiprocessor systems. J. Parallel Distrib. Comput., 63(12):1315–1324, 2003.

[68] J. Lenstra, A. R. Kan, and . P. Brucker. Ann. of Discrete Math., 1:343-362.

Complexity of machine scheduling problems. Annals of Discrete Mathematics,

7(1):343–362, 1977.

[69] L. Levinson, R. M¨

anner, M. Sessler, and H. Simmler. Preemptive multitasking

on FPGAs. In FCCM ’00: Proceedings of the 2000 IEEE Symposium on Field-

Programmable Custom Computing Machines, page 301. IEEE Computer Society,

2000.

[70] J. Liebeherr, A. Burchard, Y. Oh, and S. H. Son. New strategies for assigning real-

time tasks to multiprocessor systems. IEEE Trans. Comput., 44(12):1429–1442,

1995.

[71] D. Lim and M. Peattie. Xilinx Application Note XAPP290: Two Flows for Partial

Reconfiguration: Module Based or Small Bit Manipulations, May 2002.

[72] C. L. Liu and J. W. Layland. Scheduling algorithms for multiprogramming in a

hard-real-time environment. Journal of the ACM, 20(1):46–61, 1973.

[73] A. Lodi, S. Martello, and M. Monaci. Two-dimensional packing problems: A

survey. European Journal of Operational Research, 141:241–252, 202.

[74] P. Lysaght, B. Blodget, J. Mason, J. Young, and B. Bridgeford. Enhanced ar-

chitectures, design methodologies and cad tools for dynamic reconfiguration on

xilinx fpgas. In In proceedings of the 2006 International Conference on Field Pro-

grammable Logic and Applications (FPL), pages 12–17, Madrid, Spain, August

2006. IEEE.

[75] G. D. Micheli and M. Sami. Hardware-software codesign. In Proceedings of the

IEEE Volume 85, Issue 3, pages 349 – 365, 1997.

164

Bibliography

[76] P. Mohapatra. Dynamic real-time task scheduling on hypercubes. Journal of

Parallel and Distributed Computing, pages 91–100, 1997.

[77] Y. Oh. The Design and Analysis of Scheduling Algorithms for Real-Time and

Fault-Tolerant Computer Systems. PhD thesis, University of Verginia, 1994.

[78] Y. Oh and S. H. Son. Fixed-priority scheduling of periodic tasks on multiprocessor

systems. Technical report, University of Virginia, 1995.

[79] R. Pellizzoni and M. Caccamo. Adaptive real-time management of relocatable

tasks for fpga-based embedded systems technical report, university of illinois

at urbana-champaign, 20. Technical report, University of Illinois at Urbana-

Champaign, 2005.

[80] R. Pellizzoni and M. Caccamo. Adaptive allocation of software and hardware

real-time tasks for fpga-based embedded systems. In 12th IEEE Real-Time and

Embedded Technology and Applications Symposium (RTAS’06), pages pp. 208–220,

2006.

[81] C. Phillips, C. Stein, E. Torng, and J. Wein. Optimal time-critical scheduling via

resource augmentation in 29 th ann. In In 29 th Ann. ACM Symp. on Theory of

Computing, pages 140–149”, El Paso, Texas, United States, 1997.

[82] L. Sha, T. F. Abdelzaher, K.-E. ˚

Arz´en, A. Cervin, T. P. Baker, A. Burns, G. C.

Buttazzo, M. Caccamo, J. P. Lehoczky, and A. K. Mok. Real time scheduling

theory: A historical perspective. Real-Time Systems, 28(2-3):101–155, 2004.

[83] H. Simmler, L. Levinson, and R. Manner. Multitasking on FPGA coprocessors.

In FPL, pages 121–130, 2000.

[84] H. C. Simmler. Preemptive Multitasking auf FPGA-Prozessoren. PhD thesis,

Inauguraldissertation zur Erlangung des akademischen Grades eines Doktors der

Naturwissenschaften der Universit¨

at Mannheim, 2001.

[85] H. K.-H. So and R. W. Brodersen. Improving usability of fpga-based reconfigurable

computers through operating system support. In 16th International Conference

on Field Programmable Logic and Applications (FPL ’06), 2006.

[86] A. Srinivasan, P. Holman, J. Anderson, and S. Baruah. The case for fair mul-

tiprocessor scheduling. In Proceedings of International Parallel and Distributed

Processing Symposium (IPDPS), 2003.

[87] J. A. Stankovic and K. Ramamritham. The spring kernel: A new paradigm for

real-time systems. IEEE Softw., 8(3):62–72, 1991. See Buttazzo Book.

[88] J. A. Stankovic, M. Spuri, M. D. Natale, and G. C. Buttazzo. Implications of

classical scheduling results for real-time systems. Computer, 28(6):16–25, 1995.

165

Bibliography

[89] C. Steiger, H. Walder, and M. Platzner. Heuristics for online scheduling real-time

tasks to partially reconfigurable devices. In In Proceedings of the 13rd Interna-

tional Conference on Field Programmable Logic and Applications (FPL03), pages

575–584. September, 2003.

[90] C. Steiger, H. Walder, and M. Platzner. Operating systems for reconfigurable

embedded platforms: Online scheduling of real-time tasks. IEEE Transactions on

Computers, 53(11):1392–1407, November 2004.

[91] C. Steiger, H. Walder, M. Platzner, and L. Thiele. Online scheduling and place-

ment of real-time tasks to partially reconfigurable devices. In Proceedings of 24th

IEEE International Real-Time Systems Symposium (RTSS’03), 2003.

[92] J. Teich. Digitale Hardware/ Software-Systeme. Synthese und Optimierung.

Springer, 1997.

[93] J. Teich, S. Fekete, and J. Schepers. Compile-time optimization of dynamic hard-

ware reconfigurations. In Proc. Int. Conf. on Parallel and Distributed Processing

Techniques and Applications (PDPTA’99), pages 1097–1103, Las Vegas, U.S.A.,

June 1999.

[94] J. Teich, S. Fekete, and J. Schepers. Optimization of dynamic hardware reconfig-

urations. The J. of Supercomputing, 19(1):57–75, May 2000.

[95] R. Tessier and W. Burleson. Reconfigurable computing for digital signal process-

ing: A survey. The Journal of VLSI Signal Processing, 28(1-2):7–27, May 2001.

[96] S. Trimberger, D. Carberry, A. Johnson, and J. Wong. A time-multiplexed FPGA.

In IEEE Symp. on FPGAs and Custom Computing Machines (FCCM), pages 22–

28, 1997.

[97] M. Ullmann, M. Hubner, B. Grimm, and J. Becker. An fpga run-time system

for dynamical on-demand reconfiguration. In 18th International Parallel and Dis-

tributed Processing Symposium (IPDPS’04) - Reconfigurable Architectures Work-

shop, 2004.

[98] H. Walder, S. Nobs, and M. Platzner. Xf-board: A prototyping platform for recon-

figurable hardware operating systems. In Proc. Int’l Conf. Eng. of Reconfigurable

Systems and Algorithms (ERSA), 2004.

[99] H. Walder and M. Platzner. Non-preemptive multitasking on FPGAs:task place-

ment and footprint transform. Technical report, Swiss Federal Institute of Tech-

nology, ETH, 2002.

[100] H. Walder and M. Platzner. Reconfigurable Hardware Operating Systems: From

Design Concepts to Realizations. In Proceedings of the 3rd International Confer-

ence on Engineering of Reconfigurable Systems and Architectures (ERSA), pages

284–287. CSREA Press, June 2003.

166

Bibliography

[101] H. Walder and M. Platzner. Reconfigurable hardware os prototype. Technical Re-

port Technical Report TIK Nr. 168, Swiss Federal Institute of Technology (ETH),

April 2003.

[102] H. Walder and M. Platzner. A runtime environment for reconfigurable operating

systems. In In Proceeding 14th International Conference on Field-Programmable

Logic and Applications (FPL), 2004.

[103] H. H. Walder. Operating system design for partially reconfigurable logic devices.

(tik-schriftenreihe ; nr. 68/2, diss nr. 15955), ETH Zuerich, 2005.

[104] T. Wiangtong, P. Cheung, and W. Luk. A unified codesign run-time environ-

ment for the ultrasonic reconfigurable computer. In Field-Programmable Logic

and Applications (FPL’03). p-springer, september 2003.

[105] G. Wigley and D. Kearney. The Development of an Operating System for Recon-

figurable Computing. In Proceedings IEEE Symposium on Field-Programmable

Custom Computing Machines (FCCM’01). IEEE CS Press, Los Alamitos, CA,

USA, 2001.

[106] Xilinx. Virtex 2.5 V Field Programmable Gate Arrays, April 2001. http://www.

xilinx.com.

[107] Xilinx. XILINX LogicCore 1-D Discrete Cosine Transformation V2.1, 2002.

[108] Xilinx. Virtex-II Pro and Virtex-II Pro X Platform FPGA: Functional Descrip-

tion, 2005. http://www.xilinx.com.

167

Bibliography

168