Document [original]

Run-time Reconfigurable RTOS for

Reconfigurable Systems-on-Chip

Dissertation

A thesis submited to the

Faculty of Computer Science, Electrical Engineering and Mathematics

of the

University of Paderborn

and to the

Graduate Program in Electrical Engineering (PPGEE)

of the

Federal University of Rio Grande do Sul

in partial fulfillment of the requirements for the degree of

Dr. rer. nat.

and

Dr. in Electrical Engineering

Marcelo G¨otz

November, 2007

Supervisors:

Prof. Dr. rer. nat. Franz J. Rammig, University of Paderborn, Germany

Prof. Dr.-Ing. Carlos E. Pereira, Federal University of Rio Grande do Sul, Brazil

Public examination in Paderborn, Germany

Additional members of examination committee:

Prof. Dr. Marco Platzner

Prof. Dr. Urlich R¨uckert

Dr. Mario Porrmann

Date: April 23, 2007

Public examination in Porto Alegre, Brazil

Additional members of examination committee:

Prof. Dr. Fl´avio Rech Wagner

Prof. Dr. Luigi Carro

Prof. Dr. Altamiro Amadeu Susin

Prof. Dr. Leandro Buss Becker

Prof. Dr. Fernando Gehm Moraes

Date: October 11, 2007

Acknowledgements

This PhD thesis was carried out, in his great majority, at the Department of Computer

Science, Electrical Engineering and Mathematics of the University of Paderborn, during

my time as a fellow of the working group of Prof. Dr. Franz J. Rammig. Due to a formal

agreement between Federal University of Rio Grande do Sul (UFRGS) and University of

Paderborn, I had the opportunity to receive a bi-national doctoral degree. Therefore, I

had to accomplished, in addition, the PhD Graduate Program in Electrical Engineering

of UFRGS. So, I hereby acknowledge, without mention everyone personally, all persons

involved in making this agreement possible.

I wish to, first, express my deep gratitude to my supervisor Prof. Dr. Franz Josef

Rammig for supervising, advising and tutoring me during my whole PhD journey. His

understanding, patience, knowhow and constant encouragements were essential to bring

me to the success of finishing this work. In the same way, I wish to thank specially my co-

supervisor, Prof. Dr. Carlos Eduardo Pereira, for his equal support, even before starting

this PhD thesis. The many fruitful talks, in which he shared with me his knowledge and

experience, had strong influence in my decision of being a researcher and starting this

PhD.

I am also glad of having being part of Prof. Rammig’s working group, of having the

opportunity to work on different research projects, to advise Bachelor and Master Stu-

dents, and to supervise students during labs of his lectures. Furthermore, I am glad of

having many chances for fruitful talks, discussions, brainstorming, that I had with my

colleagues in the great atmosphere of the working group. All these facts together, have

positively contributed to the success of this work.

My thanks go also to both examination committees: in Germany composed by Prof. Dr.

Marco Platzner, Prof. Dr. Urlich R¨uckert, Dr. Mario Porrmann; and in Brazil composed

by Prof. Dr. Fl´avio Rech Wagner, Prof. Dr. Luigi Carro, Prof. Dr. Altamiro Amadeu

Susin, Prof. Dr. Leandro Buss Becker, Prof. Dr. Fernando Gehm Moraes. They,

together with my both supervisors, spent their times reading this thesis and provided

me a positive feedback during the PhD-defense.

I wish also to thank all my former colleagues, which helped be somehow during my PhD

research. Their different points of view, wide range of interests and knowledge, were

very important to improve the level of my work. My particular thanks go to Dr. Achim

Rettberg, for spending his time in listening and discussing my research topic, and for

offering his partnership and friendship, which was always pushing me forward in several

moments. In the same way, I wish to thank Florian Dittmann, who shared with me his

great knowledge in Reconfigurable Computing. With these two colleagues, I had several

publications, which were extremely important in my research. Thank you both for your

cooperation.

Further, I whish to extend my thankfulness to all my other colleagues: Dr. Sabina Rips,

Tales Heimfarth for several and fruitful discussions, Yuhong Zhao, Katharina Stahl, Peter

Janacik, Timo Kersten, Simon Oberthuer, Norma Montealegre, Dr. Carsten B¨oke, Dr.

Martin Kardos, Dr. Klaus Danne, Dr. Stefan Ihmor, and Prof. Dr. Christophe Bobda.

I wish to thank specially Dalimir Orfanus, which with his friendship gave me strong

support in my last part of my PhD journey.

I must not forget my working students, Stefan Finke and MSc. Tao Xie, which helped

me in implementing an important part of my concepts. Without them, this thesis would

take longer to be finished.

I will never forget my friend Arne Rolf Spiller, who had crucial importance in my first

years in Germany. With his unconditional friendship and support, I surpassed my self

on several circumstances. He and his parents, Erwin and Ursula, are my second family

now. Thank you for being so receptive and lovely.

Finally, I wish to thank my sister Fabiana and my loving mother Antˆonia Elvira, which

pushed me forward in my studies, since I started in the school until today. Last, but not

least, my lovely thanks go to Adriana Aparecida Paz, my girlfriend, who never abandoned

me, even having an ocean between us. She was very important for the success of this

work, due to his patience and support, and especially for her comprehensiveness and

loveliness.

November, 2007.

Abstract

Embedded systems are massively present in our lives and they are becoming omnipresent.

This has demanded strong efforts in research for providing new solutions for the chal-

lenges faced in the design of such systems. For instance, the requirements of high compu-

tational performance and flexibility of the contemporary embedded systems are continu-

ously increasing. A single architecture must be able to support, in certain cases, different

kind of applications with different requirements which can start asynchronously and dy-

namically (changing environments). Reconfigurable computing seems to be a potential

paradigm for these scenarios as it can provide flexibility and high computational perfor-

mance for modern embedded systems. Of especial interest are those architectures where

a microprocessor is tightly connected with a reconfigurable hardware (hybrid platform),

constituting a so called reconfigurable System-on-Chip (RSoC). However, the complexity

in designing such systems rises. Therefore, the usage of an Operating System (OS) is

essential to provide the necessary abstraction of the computational resources in reconfig-

urable computing. Moreover, due to the intrinsic overhead caused by the reconfiguration

activities and the potential sharing of computational resources the necessity for support

provided by an OS is unquestionable. Nevertheless, embedded system platforms lack in

computational resources. This fact requires a careful design of an OS for such a system,

since it also consumes its resources.

Along with the application tasks, the OS can profit from a RSoC based architecture

by reconfiguring itself over this hybrid platform. Thereby, the OS can make use of the

remaining resources that are not currently required by the application for its execution.

Within this context, this work presents the design of proper methodologies, strategies,

hardware and design support for a proper management of dynamic reconfiguration ac-

tivities of a Real-Time Operating System (RTOS) running on a RSoC based platform.

The intention thereby, is to promote the self-reconfiguration of the RTOS services on

this hybrid platform, so that the computational resources of this execution platform are

used in an efficient way.

Zusammenfassung

Eingebettete Systeme haben eine starke Pr¨asenz in unseren allt¨aglichen Leben bekom-

men, in vielen Bereichen sind sie allgegenw¨artig geworden. Dieses ist eine Heraus-

forderung f¨ur die Forschung im Bereich solcher Systeme. St¨andig m¨ussen neue ad¨aquate

L¨osungen gefunden werden. Durch die zunehmenden Anforderungen nimmt die Leistung

und Flexibilit¨at bei eingebetteten Systemen st¨andig zu. Zum Beispiel, eine einzelne

Architektur muss in der Lage sein, in bestimmten F¨allen, mehrere Applikationen mit

verschiedenen Anforderungen zu unterst¨utzen, die asynchron und dynamisch ablaufen

k¨onnen (dynamische Umgebungen). Rekonfigurierbare Rechensysteme scheinen ein po-

tentielles Paradigma f¨ur diese Szenarien zu sein, weil sie Flexibilit¨at und hohe Rechen-

leistung f¨ur moderne eingebettete Systeme liefern k¨onnen. Von besonderem Interesse

sind jene Architekturen, wo ein Mikroprozessor mit rekonfigurierbarer Hardware fest

verbunden ist (hybride Plattform). Eine solche hybride Plattform nennt man rekon-

figurierbares System-on-Chip (RSoC). Jedoch nimmt die Komplexit¨at in solchen Sys-

temen st¨andig zu. Deshalb ist die Anwendung eines Betriebssystems (OSs) wesentlich,

um eine notwendige Abstraktion von den vorhandenen Ressourcen in rekonfigurierbaren

Rechensystemen zu erm¨oglichen. Weiterhin, ist durch die gemeinsame Nutzung von

Ressourcen einer solchen Architektur und deren Verwaltung in Bezug auf die Rekon-

figuration, der Einsatz eines OS zwingend notwendig. Dennoch sind die Ressourcen in

eingebetteten Systemen begrenzt. Deshalb muss beim Entwurf eines OS f¨ur ein solches

System sorgf¨altig vorgegangen werden, da das OS an sich schon Ressourcen verbraucht.

Zusammen mit den Applikationen kann das OS auch von den RSoC Architekturen prof-

itieren dadurch, dass das OS sich selbst auf der hybriden Plattform rekonfigurieren

kann. Somit kann das OS die ¨ubrigen Ressourcen nutzen, die nicht gegenw¨artig von

der Applikation benutzt werden. In diesen Rahmen pr¨asentiert die vorliegende Ar-

beit den Entwurf von geeigneten Methodologien, Strategien, Hardware und Entwurf-

sunterst¨utzungen f¨ur eine geeignete Verwaltung von dynamischen Rekonfigurierungsak-

tivit¨aten eines Echtzeitbetriebssystems (RTOSs), das auf einer RSoC basierten Plattform

l¨auft. Die Intention dabei ist es die Selbst-Rekonfiguration der RTOS Dienste auf einer

hybriden Plattform zu erm¨oglichen, wodurch die vorhandenen Ressourcen der Plattform

effektiv ausgenutzt werden k¨onnen.

Resumo

Sistemas embarcados est˜ao cada vez mais presentes em nossas vidas e est˜ao se tornando

onipresentes. Este fato tem demandado grandes esfor¸cos em pesquisa para cria¸c˜ao de

propostas e solu¸c˜oes para os desafios gerados no desenvolvimento destes sistemas. Por

exemplo, uma arquitetura moderna de sistemas embarcados requer alto poder de com-

puta¸c˜ao e tamb´em grande flexibilidade, e a demanda por estes requisitos tem crescido

constantemente. Uma ´unica arquitetura deve executar, em certos casos, diferentes

aplica¸c˜oes com diferentes requisitos e com inicio de execu¸c˜ao indeterminado, caracteri-

zando desta maneira um ambiente dinˆamico. A computa¸c˜ao reconfigur´avel aparece como

um paradigma promissor para estes casos pois consegue prover alto poder de computa¸c˜ao

juntamente com flexibilidade requeridas pelos sistemas embarcados modernos. Espe-

cialmente interessantes s˜ao arquiteturas baseadas em System-on-Chip reconfigur´aveis

(RSoC), nas quais um microprocessador est´a fortemente conectado a um hardware re-

configur´avel (plataforma h´ıbrida). Por´em a complexidade no desenvolvimento destes

tipos de sistemas cresce, tornando o uso de um sistema operacional (SO) indispens´avel.

Entretanto, uma plataforma de execu¸c˜ao de um sistema embarcado sofre pela escassez

de recursos. Este fato exige um cuidado especial no desenvolvimento de um SO uma vez

que este tamb´em usa os recursos desta plataforma.

Juntamente com as tarefas da aplica¸c˜ao, o SO tamb´em pode tirar proveito de uma

plataforma baseada em RSoC onde este ´e capaz de se auto reconfigurar sobre esta

plataforma h´ıbrida. Deste modo, o SO pode usar os recursos computacionais, corrente-

mente n˜ao requeridos pelas aplica¸c˜oes, para a sua execu¸c˜ao. Dentro deste contexto,

este trabalho apresenta o design de metodologias, estrat´egias e suporte em hardware e

software para o gerenciamento apropriado das atividades de reconfigura¸c˜oes dinˆamicas

de um sistema operacional de tempo-real (RTOS), que ´e executado em uma plataforma

baseada em RSoC. A inten¸c˜ao com isto ´e a de proporcionar ao RTOS meios com os quais

este ´e capaz de se auto reconfigurar nesta arquitetura h´ıbrida com a inten¸c˜ao de atingir

um uso mais eficiente dos recursos computacionais desta plataforma de execu¸c˜ao.

Contents

List of Figures xvii

List of Tables xix

List of Algorithms xxi

List of Symbols xxiii

1. Introduction 1

1.1. Motivation ................................... 2

1.2. Thesis Goals .................................. 3

1.3. Thesis Contributions .............................. 4

1.4. Thesis Outline ................................. 4

2. Background 7

2.1. Embedded System Design ........................... 7

2.2. Reconfigurable Computing Overview ..................... 9

2.2.1. Coupling CPU and Reconfigurable Hardware ............ 11

2.2.2. Reconfigurable System Design ..................... 11

2.3. Reconfigurable Hardware Technology ..................... 13

2.3.1. Hybrid Architecture .......................... 14

2.3.2. Configuration Techniques ....................... 15

2.3.3. Partial Reconfiguration Feature .................... 16

2.4. Chapter Conclusions .............................. 17

3. Related Work Survey 19

xii Contents

3.1. (Re)Configurable Operating Systems ..................... 19

3.1.1. Statically Reconfigurable OS ..................... 20

3.1.2. Dynamically Reconfigurable OS: Application Triggered ...... 21

3.1.3. Dynamically Reconfigurable OS: System Triggered ......... 23

3.1.4. Towards Online Reconfigurable DREAMS .............. 24

3.1.5. Further Comments ........................... 24

3.2. Operating System for Reconfigurable Computing .............. 25

3.2.1. Low-level OS Support for Reconfigurable Hardware ......... 26

3.2.2. Application Model ........................... 29

3.2.3. OS Services for Reconfigurable Hardware .............. 29

3.2.4. RTOS issues in High Level Design .................. 30

3.2.5. Offline Approaches ........................... 31

3.2.6. Run-time Support ........................... 32

3.2.7. Multitasking Issues ........................... 35

3.2.8. Dynamically Hybrid Architectures .................. 36

3.3. Further Approaches .............................. 37

3.3.1. Hardware Accelerator for RTOS ................... 37

3.3.2. Multithreading on Hybrid Architectures ............... 38

3.4. Chapter Conclusions .............................. 40

3.4.1. Correlation With This Thesis ..................... 40

3.4.2. Additional Comments ......................... 41

4. Run-time Reconfigurable RTOS 43

4.1. System Overview ................................ 43

4.1.1. Target Applications .......................... 44

4.1.2. Target RTOS .............................. 45

4.1.3. Instrumented OS API ......................... 46

4.1.4. Run-time Reconfiguration Manager - RRM ............. 49

4.2. Hardware Architecture ............................. 49

4.3. Design Support ................................. 50

4.4. Chapter Conclusions .............................. 51

5. Modeling & Problem Formulation 53

5.1. Component Assignment ............................ 53

5.1.1. Constraints Definition ......................... 54

5.1.2. Objective Function Definition ..................... 55

5.1.3. Allocation Example .......................... 56

5.2. Reconfiguration Costs ............................. 56

5.2.1. Temporal Specification ......................... 58

5.3. Communication Costs ............................. 59

5.4. Chapter Conclusions .............................. 61

6. Run-Time Methods 63

6.1. OS Service Allocation ............................. 63

Contents xiii

6.1.1. OS Service Assignment Phase ..................... 64

6.1.2. OS Service Assignment Example ................... 64

6.1.3. Balance BImprovement Phase .................... 65

6.1.4. Balance BImprovement Example .................. 67

6.1.5. Reconfiguration Cost Reduction ................... 68

6.2. Communication-aware Allocation Algorithm ................. 69

6.3. Handling Reconfiguration Activities ..................... 71

6.4. OS Component Reconfiguration ........................ 72

6.4.1. Applying Total Bandwidth Server .................. 74

6.4.2. Deriving Migration Conditions .................... 75

6.4.3. Software to Hardware Migration ................... 75

6.4.4. Hardware to Software Migration ................... 77

6.4.5. Software Service Reconfiguration ................... 77

6.4.6. Hardware Service Reconfiguration .................. 78

6.4.7. Migrating by Preempting ....................... 78

6.5. Schedulability Analysis ............................. 81

6.6. OS Components Scheduling .......................... 82

6.6.1. Partial Schedule ............................ 83

6.6.2. Complete Schedule ........................... 84

6.7. Chapter Conclusions .............................. 85

7. Methods Evaluation 87

7.1. OS Components Allocation .......................... 87

7.1.1. OS Components Assignment ..................... 87

7.2. Balancing Heuristic ............................... 88

7.2.1. Reconfiguration Cost Reduction ................... 89

7.3. Components Reconfiguration Scheduling ................... 91

7.4. Communication costs reduction ........................ 92

7.5. Chapter Conclusions .............................. 95

8. Design Support 97

8.1. Hardware-Software Interface Synthesis .................... 98

8.1.1. OS Driver Extension .......................... 99

8.1.2. Software Interface for Reconfigurable IPs .............. 99

8.1.3. Integration into IFS Tool .......................100

8.1.4. Further Extension for DREAMS ...................101

8.2. Relocatable Tasks Design ...........................101

8.2.1. Unified Task Representation ......................102

8.2.2. A Framework for Relocatable Task Design ..............104

8.3. Chapter Conclusions ..............................107

9. Case Study 111

9.1. Target OS Service ...............................111

9.2. Relocatable Triple-DES ............................113

xiv Contents

9.3. Testbed Set Up .................................115

9.4. Quantitative Results ..............................116

9.5. Chapter Conclusions ..............................118

10.Conclusion & Outlook 121

10.1. Summary ....................................121

10.2. Outlook .....................................123

A. Further Evaluation Results 125

B. HW/SW Interface Generation 129

C. Hardware/Software Task Design 131

C.1. Hardware Task Controller Template .....................131

C.2. Sequence Graphs for Two Migration Cases ..................131

D. TBS Server Bandwidth Estimation 135

List of Figures

1.1. Reconfigurable OS for RSoC: Overview. ................... 3

2.1. HW/SW partitioning scheme. ......................... 9

2.2. Performance versus Flexibility tradeoff. ................... 10

2.3. Example of dynamic reconfiguration usage [33]. ............... 11

2.4. CPU and RH: Coupling types. ........................ 12

2.5. Typical SoC architecture (a) and a RSoC architecture (b) [38]. ...... 13

2.6. FPGA comprises a set of CLBs. ........................ 13

2.7. Viertex-II Pro architecture overview [40]. .................. 14

2.8. CoreConnect block diagram [40]. ....................... 15

2.9. Example of device partition for 1D (a) and 2D (b) partition models. . . . 17

3.1. Execution environment envisioned by the Reconfigurable Computing re-

search community. ............................... 26

3.2. Self-programming hardware architecture [87]. ................ 27

3.3. Virtualized interface [90]. ........................... 28

3.4. Software and hardware threads connected through a virtualization layer

[93]. ....................................... 28

3.5. A task graph on FPGA [98]. .......................... 33

3.6. OS frame and task communication block [114]. ............... 34

3.7. Hybrid thread abstraction layer [148]. .................... 39

4.1. System overview. ................................ 44

4.2. Proposed microkernel based architecture. .................. 45

4.3. Call patterns experienced by an OS service. ................. 47

4.4. System architecture. .............................. 51

xvi List of Figures

5.1. System architecture highlighting the communication channels. ....... 60

5.2. Sample of an OS component graph. ...................... 60

6.1. Example of two components being clustered. ................ 71

6.2. Optimal arrival time ˆaifor Jb

i......................... 76

6.3. Worst-case arrival time ˆaifor Jb

i........................ 76

6.4. Migrating by preempting: software to hardware. .............. 79

6.5. Migrating by preempting: hardware to software. .............. 81

7.1. Unbalance average for different δconstraints. ................ 88

7.2. Total cost assignment average for different δconstraints. .......... 89

7.3. Number of components being reconfigured for different δconstraints.

Heuristic-2a: original balancing algorithm. Heuristic-2b: modified bal-

ancing algorithm. ................................ 90

7.4. Unbalance average for different δconstraints. ................ 91

7.5. Payoff in U+Adue to the balancing algorithm modified for reconfiguration

costs reduction. ................................. 92

7.6. Component reconfiguration scheduling: heuristic algorithm evaluation. . . 93

7.7. Evaluation results comparison. ........................ 94

7.8. Payoff in overall resource usage due to clustering process. ......... 95

8.1. Virtex-II Pro hardware/software interface. .................. 98

8.2. Mapping between physical and virtual registers. ............... 99

8.3. A method foo1 HW build upon library calls. ................101

8.4. State transition graph representation of a relocatable task. .........103

8.5. Design flow supported by the framework. ..................104

8.6. Informal description of TSD. .........................105

8.7. Hardware task overview. ............................107

9.1. Triple-DES and its basic DES cipher block. .................112

9.2. CBC mode for block ciphers. .........................112

9.3. State transition graph corresponding to Algorithm 6. ............114

9.4. Basic hardware platform. ...........................115

10.1. System overview. ................................122

A.1. Comparison among communication costs reduction for three different sit-

uations. .....................................126

A.2. Comparison among payoffs in overall resource utilization for three differ-

ent situations. ..................................127

B.1. Class Diagram of possible interface registers. ................130

C.1. The controller template for a hardware task. .................132

C.2. Sequence graph specifying the task migration from software to hardware. 132

List of Figures xvii

C.3. Sequence graph specifying the task migration from hardware to software. 133

D.1. Relocation from software to hardware: An Example. ............135

xviii List of Figures

List of Tables

3.1. OS classification according to the reconfiguration Time and Initiator.. . 24

5.1. Different allocation possibilities for two services, and the respective overall

cost and balance values. ............................ 57

5.2. Time costs related to each migration case. .................. 58

5.3. Service definition related to its periodic execution. ............. 59

6.1. Example of three components and their respective costs. .......... 65

6.2. Assignment algorithm applied in the example presented in Table 6.1. . . . 66

6.3. Example of a complete OS service allocation. ................ 68

9.1. FPGA and memory utilization. ........................116

9.2. Context data transfer: average time measured. ...............118

xix

xx List of Tables

List of Algorithms

1. Service assignment heuristic. ......................... 65

2. Heuristic for balancing Bimprovement. ................... 67

3. Improved heuristic for balancing B....................... 69

4. Partial schedule. ................................ 84

5. Whole schedule. ................................. 85

6. Triple-DES pseudo algorithm. .........................113

xxi

xxii List of Algorithms

List of Symbols

SSet of OS services

siA generic OS service

S?Subset of Srepresenting the OS services that will undergo a reconfiguration

iA generic OS service that will undergo a reconfiguration

SsSet of software services that will be changed by another software service (Ss⊂ S?)

ShSet of hardware services that will be changed by another hardware service (Sh⊂

S?)

SwSet of services that will change the execution domain (Sw⊂ S?)

TsSet of services located in software

ThSet of services located in hardware

C1Cost set of software services

C2Cost set of hardware services

CCost set of all OS services

ci,j Cost of a service siassigned to execution domain j(j= 1 : CPU; j= 2 : FPGA)

AiCost used by service siwhen assigned to FPGA (ci,2=Ai)

xxiii

xxiv List of Symbols

UiCost used by service siwhen assigned to CPU (ci,1=Ui)

ciHardware and software costs of a service si(ci={Ui, Ai})

liSum of Uiand Ai(li=ci,1+ci,2)

ATotal cost of OS services assigned to FPGA

UTotal cost of OS services assigned to CPU

BBalance between FPGA and CPU costs

w1Tendency of application resource utilization in CPU

w2Tendency of application resource utilization in FPGA

Amax Maximum FPGA area cost available for OS services

Umax Maximum CPU workload cost available for OS services

δBalancing constraint: maximum allowed unbalance

X1Assignment solution for software services

X2Assignment solution for hardware services

XAssignment solution for all OS services

xi,j Assignment solution of a service siat execution domain j

xiAssignment solution for a service si(xi={xi,1, xi,2})

RComplete system reconfiguration cost

RiRelocation matrix cost (3 x 3) of a component si

ri,j Cost to relocate a service from domain ito domain j

ziAssignment difference of a service sibetween two different assignment solutions

ZAssignment difference set of services between two different assignment set solu-

tions

CαCommunication cost between two OS services, both located in software

CβCommunication cost between two OS services, both located in hardware

List of Symbols xxv

CγCommunication cost between two OS services, each one located in a different

execution domain

κ(u, v) Communication costs between two OS services, uand v(κ={Cα, Cβ, Cγ})

(λ1, λ2) Maximum resulted component cost that is allowed when clustering two OS

services

JSet of jobs needed to perform a system reconfiguration

JiA generic reconfiguration job associated to service si

iConfiguration phase of a reconfiguration job Ji

iMigration phase of a reconfiguration job Ji

Ms“Migration” time of a component when changed by another component version

at CPU

Mh“Migration” time of a component when changed by another component version

at FPGA

MwMigration time of a component when relocated between CPU and FPGA

QsConfiguration (programming) time of a component at CPU

QhConfiguration (programming) time of a component at FPGA

EsExecution time of a component at CPU

EhExecution time of a component at FPGA

PiPeriod of service si

DiRelative deadline of service si

di,k Absolute deadline of the kth instance of service si

diAbsolute deadline of migration job Jb

ai,k Arrival time of the kth instance of service si

ˆaiArrival time of migration job Jb

bi,k Starting (beginning) time of the kth instance of service si

biStarting (beginning) time of migration job Jb

xxvi List of Symbols

fi,k Finishing time of the kth instance of service si

fiFinishing time of migration job Jb

ηComputation ratio (0 ≤η≤1) of a migration job

σTime distance between arrival times of a service and its related migration job Jb

CHAPTER 1

Introduction

The presence of embedded systems is massive in our lives. It is possible to identify

its usage in a great variety of products: cellular phones, Personal Digital Assistance

devices (PDAs), household appliances (e.g., washing machines, microwave ovens, DVD

players), vehicles, airplanes, missiles, medical equipments, etc.; only to cite some of them.

Some numbers show that more than 99% of the microprocessors produced nowadays are

devoted to embedded system platforms [1].

The design of embedded systems is complex and involves many interdisciplinary research

areas, from high level modeling and simulation, through software, hardware and platform

design to the hardware and software synthesis and testing [2]. Unlike a general-purpose

computer, such as a Personal Computer (PC), the design of an embedded system is

considerably more complex. Usually, when designing such system one needs to take into

account requirements such as memory and power consumption, real-time behavior, short

time to market, etc.

On the other hand, with technology advances, most of embedded system components

can be incorporated into a single chip, leading to the so called System-on-Chip (SoC) [3].

One SoC may contain one or more CPUs, memory, peripherals and dedicated hardware

components (e.g., coprocessors) specifically developed for a target application in order to

meat the performance required by the application. These solutions, however, are usually

static, meaning that adaptations and/or modifications due to application changes are

not adequately supported.

A side effect of this integration is that contemporary embedded systems are increas-

2CHAPTER 1. INTRODUCTION

ingly incorporating more and more functionalities, requiring thereby higher computation

performance. Moreover, these systems are becoming multipurpose, since they need to

support more than one application which may be different in their nature. For instance,

modern PDA or mobile phones are capable to play movies, connect to the internet

and make telephone calls among other activities. In addition, the algorithm complex-

ity tends to increase in many application domains such as signal, image and processing

control. Therefore, modern embedded systems are increasingly requiring more compu-

tational performance and flexibility due to the growing complexity and dynamics of the

application domain.

In order to be able to handle this tradeoff between flexibility and high performance,

an execution platform combining reconfigurable hardware and SoC becomes an attrac-

tive solution. This new platform is usually referred as Reconfigurable System-on-Chip

(RSoC). A single RSoC based platform may provide different facilities. For instance, a

product can change or upgrade its functionality by e.g., at initialization phase, loading

different configuration data. This enables, further, to upgrade a product that has been

already assembled, or even to correct some error identified in this product after delivering

it to the market. Additionally, a RSoC based system may support run-time reconfigu-

ration. This may even allow the creation of more complex systems, with capability to

dynamically adapt to system changes.

1.1. Motivation

The duty of an Operating System (OS) is to provide the necessary abstraction of the

hardware platform and provide services to the application, like for instance, message

passing, shared resources management, etc. Given that the computational power and

size (in terms of components integration) of nowadays SoC are rapidly increasing, the

utilization of an OS is also gaining in importance. An OS easies programming activities

by providing well defined interfaces to the underlying execution platform, hiding thereby

low level details from programmers. Furthermore, it enables the portability, reusability

and protection among applications. Actually, the reasons of using an OS for a SoC are

not different from those for running an OS on any system [4].

The demand of an OS is more emphasized in the case of platforms based on RSoC. Due

to the aggregated overhead caused by the reconfiguration activities, and management

of the shared reconfigurable hardware (dynamic reconfiguration), the support given by

an OS is highly desired. The OS executing on such platforms needs to provide suitable

methods and infrastructure for managing and using the resources in an efficient manner.

The design of a SoC architecture is application dependent. Due to the necessity to reduce

costs, power consumption, etc, it is required to provide an architecture only with those

components that a specific application will require. Since an OS also uses resources from

the execution platform, it is also desirable to have a modular OS. Thus, the software

1.2. THESIS GOALS 3

designer may tailor the OS to provide only the required services that an application

needs. These solutions, however, are usually static.

Since applications of modern embedded systems do present dynamic behavior (e.g., the

PDA example mentioned above), imposing thereby dynamic requirements to the OS,

those static solutions are no longer efficient. For such scenarios, the presence of an

OS capable to be dynamically reconfigured is highly desirable, in order to provide only

the current facilities required by the application and, thereby, using the resources in an

efficient manner.

1.2. Thesis Goals

A typical RSoC architecture includes a CPU and reconfigurable hardware as main com-

putational resources, which are shared between application and OS activities. Towards

an optimal usage of the available resources, the RTOS should be able to reconfigure itself

over the underlying hybrid architecture.

In this direction, this thesis aims to provide methodologies, strategies, mechanisms, as

well as hardware and design support, which aggregates self-reconfiguration capabilities

to an embedded operating system. By this means, the used operating system can config-

ure itself over the hybrid architecture in order to use the computational resources that

are currently not being used by application programs, and similarly, freeing resources

currently demanded.

Figure 1.1 summarizes the idea of this thesis. An extra component in the system, called

RRM (Run-time Reconfiguration Manager), monitors application requirements along

with the current occupation of the execution platform. So, by continuously analyzing

the overall resource utilization it coordinates the configuration of the OS services over

the hybrid architecture.

CPU

FPGA

RRM OPERATING SYSTEM

APP1 APPnAPP2

M O N I T O R I N G

USES

Figure 1.1.: Reconfigurable OS for RSoC: Overview.

4CHAPTER 1. INTRODUCTION

1.3. Thesis Contributions

The main contribution of this thesis is the development of strategies and methodologies

comprising an OS extension that can be aggregated to a real-time operating system

running under a hybrid execution platform enabling, thereby, an effective usage of the

computational resources by application tasks as well as by OS services. Such OS is well

adequate for given support to the modern and future embedded systems, relying on

execution platforms with high flexibility and performance.

All main parts of this thesis have been published in several conferences, which demon-

strate the recognition of this work by the research community. Moreover, it indicates

the relevance of the investigations carried out in the scope of this thesis research.

The raw proposal and concept of a reconfigurable OS for RSoC has been published in

[5]. Then, in [6] a more sound proposal for an execution platform is presented along

with heuristic algorithms for allocation of OS services over this hybrid platform. Later

on, in [7,8] those algorithms were improved in order to decrease the reconfiguration

overhead. A further extension to the allocation algorithms, in order to take into account

the communication costs of OS components depending on their allocations, has been

presented in [9].

Appropriated model and strategies used to assure a deterministic reconfiguration of OS

services on the hybrid platform were firstly presented in [10] and then extended to cover

all aspects of this thesis in [11]. In addition, a framework and additional infrastructure

to support the design of relocatable components was presented in [12], [13] and [14].

The use of the proposed OS adaptation mechanisms and strategies in self-optimizing

systems is presented in [15,16]. Furthermore, an overview of the overall proposal was

published in [17] and recently accepted for journal publication [18].

1.4. Thesis Outline

This thesis is organized as follows:

Chapter 2provides theoretical background information, necessary for a clear under-

standing of the topics discussed in the subsequent chapters.

Chapter 3summarizes relevant related work. It includes a survey on (re)configurable

operating systems based solely in software, filtering those ones intended for embed-

ded systems domain. Then, several aspects related to reconfigurable computing

and the proper and adequate support of an operating system for such systems is

largely discussed. Comparisons between those works and the subject of this thesis

are left to the end of the chapter.

Chapter 4discusses in more details the ideas briefly introduced in Section 1.2. Herein,

1.4. THESIS OUTLINE 5

target applications are carefully specified. In addition, the target OS, which will

be used in the validation phase of this work, as well as the reasons for choosing it,

are presented. Furthermore, the execution platform is introduced in this chapter

along with the related design assistance requirements.

Chapter 5presents the models adopted for the execution platform, the OS components,

and the reconfiguration activities. Additionally, the main problems that the Run-

time Reconfiguration Management (RRM) needs to solve are here formulated.

Chapter 6introduces the methodologies and strategies adopted by the RRM to deter-

ministically manage the allocation and reconfiguration of the OS services over the

hybrid architecture. These strategies are mainly based on low complexity heuris-

tic algorithms proposed to solve NP-Hard problems faced by RRM, which further

need to be executed concurrently with the normal system operation.

Chapter 7evaluates each heuristic algorithm presented in the previous chapter. For

that purpose, the MATLAB tool was used. Results are analyzed according to

algorithm efficiency, pointing out where further research need to be spent to solve

some open questions.

Chapter 8describes the support made available to the programmer to enable the us-

age of the proposed system, covering thereby two aspects. First, the automatic

generation of hardware-software interface using the IFS [19] (Interface Synthesis)

tool is explained. Second, a framework for generation of hybrid services/tasks that

enables their run-time relocation between CPU and FPGA is presented.

Chapter 9shows a case study used to validate the proposed reconfiguration strategies,

the design support proposed, and to analyze the effectiveness of the underlying

architecture in carrying out the reconfiguration activities. As a target OS service,

an encryption algorithm was selected, which is an OS service (for both, embed-

ded and non embedded systems) increasingly demanded by safety and security

applications.

Chapter 10 gives a synopsis of the work presented in this thesis. Furthermore, it signals

directions on which further work can be conducted.

6CHAPTER 1. INTRODUCTION

CHAPTER 2

Background

This chapter gives the reader the necessary background information for a complete un-

derstanding of the technical discussions in the following chapters. Fundamentals of

Reconfigurable Computing and the mainly used hardware technologies are presented.

2.1. Embedded System Design

Traditional embedded systems, those existing until middle of nineties, could be specified

as a system comprising a microcontroller, memory, analog devices and some I/O signals.

Nowadays, due to technology advances and also with the increasingly requirements from

applications supported by contemporary embedded systems, they became more complex,

which have a direct impact on the design of such systems.

Unlike a general-purpose computer, such as a Personal Computer (PC), the design of

an embedded system is considerably more complex. It is an interdisciplinary activity,

involving many research areas. It goes from abstract level modeling and simulation,

through software, hardware and platform design to the hardware and software synthesis

and testing [2].

Besides low cost and tight time to market, other constraints, like for instance the limited

amount of memory available, low power consumption requirement, etc, make the design

of such a system a challenge. Furthermore, the fact that a single “standard” execution

platform for an embedded system does not exists (differently, for instance, from the

situation of PC market) increases even more the degrees of freedom by searching complete

8CHAPTER 2. BACKGROUND

solution for an embedded system.

Contemporary embedded systems are further integrating more and more functionalities

and requiring, therefore, higher computation performance. With technological advances

most of system components can be incorporated into a single silicon die leading to a so

called System-on-Chip (SoC) [3]. One SoC may contain one or more CPUs, memory, pe-

ripherals and dedicated hardware components (e.g., coprocessors) specifically developed

for a target application. A SoC offers more advantages and benefits to system designers

such as for instance, higher performance, lower power consumption, and higher reliabil-

ity, if compared with the case where a system is build by assembling various chips and

components on a circuit board.

Another trend in the development of execution platforms for embedded system is towards

multiprocessor architectures. Due to constraints in power consumption and increasing

demand in performance, some solutions in integrating multiple processors in a single

die are being currently investigated. Examples are Multi-Processors System-on-Chip

(MPSoC) and Network-on-Chip (NOC) systems [3].

Even for normal PCs this trend is observed (for instance, the Dual Core processor from

Intel [20]), or even in graphic cards used inside such architectures. Recently, NVIDIA

company released the G8 graphic chips where multithreading is used as based environ-

ment for the application design (for more details see [21] and [22]).

Another example is the Cell Multiprocessor [23], which was developed to attend the

demands of the game/multimedia industry. The Cell multiprocessor combines in a sin-

gle die one 64bit based microprocessor with eight cooperative processors, all connected

together through a high bandwidth on-chip coherent bus.

Hardware/Software Codesign

A typical SoC architecture is based on hardware and software components, and the

decision of which parts will be developed in hardware and which in software is one of

the most important parts by designing an embedded system execution platform [24].

Hardware/Software codesign defines some methodologies and strategies for designing

heterogeneous (hardware and software) systems. The goal is to find an efficient solution

for a system which meets the specified requirements and constraints. In this process, it

is important to have a unified model environment, which allows the co-simulation and

co-verification of hardware and software parts together. A considerable amount of work

has been done in this area: [25], [26], [27] and [28].

One key part of the hardware-software codesing, shown in Figure 2.1, is the hardware-

software partition and the correspondent interface synthesis. The partition phase is not

a trivial task. Even though there are some tools which do this in an automatic way, this

activity is done also manually [29]. Nowadays, the partition of the system in hardware

and software is influenced mainly by the designer experience and also by the availability

2.2. RECONFIGURABLE COMPUTING OVERVIEW 9

of previously designed and used architectures (also known as architecture templates).

Moreover, due to the market pressure, there is a need to deliver a product as soon as

possible, which also decreases the development costs. This short time-to-market also

emphasizes the reusability of hardware and software components in the design of new

embedded system architectures.

Software

HW/SW

Partitioning

HW/SW

Interface Hardware

System

Model Architecture

Templates

Figure 2.1.: HW/SW partitioning scheme.

Another important aspect of hardware/software codesign is the generation of an interface

between hardware and software components. This phase is commonly left to the end part

of the design flow, mostly due to the lack of properly abstraction of HW/SW interfaces,

which makes difficult the dialog between software and hardware design teams. This

problem is highlighted even more in new architectures that are becoming multiple and

heterogeneous multiprocessing. Therefore, a hardware/software interface codesing [30]

is necessary in order to tackle the complexity of designing SoC architectures in the near

feature.

2.2. Reconfigurable Computing Overview

As previously mentioned, the required application performance is achieved by implement-

ing the computational intensive functionalities in a dedicated device, usually referenced

as a hardware accelerator. If, for instance, a hard-wired technology is used then an

Application Specific Integrated Circuit (ASIC) is designed and fabricated specially for

the purpose of this specific application.

Although an ASIC could perform very well when executing the exact computation for

which it was designed, it does not provide flexibility. In order to change the computation

performed inside an ASIC, re-design and re-fabrication are necessary. A microprocessor,

in an opposite way, offers great flexibility, since its functionalities are determined by the

software instructions stored in the memory. In terms of efficiency, however, a micropro-

cessor is by far inferior to an ASIC, since the computation is executed sequentially.

Reconfigurable systems may fill the gap between application-specific platforms based on

custom hardware functions, and software programmable systems based on traditional

10 CHAPTER 2. BACKGROUND

microprocessors [31]. The resulting system is one which can provide higher performance

by implementing custom hardware functions in reconfigurable units, and still be flex-

ible by reprogramming the hardware and/or a microprocessor (hybrid architectures).

Figure 2.2 shows the trade off between performance and flexibility related to software,

hardware and reconfigurable computing.

Flexibility

Performance

Dedicated HW

Instruction set

Processor

Reconfigurable

Hardware

Reconfigurable

HW/SW

Figure 2.2.: Performance versus Flexibility tradeoff.

A reconfigurable hardware (RH) is very attractive for designing execution platforms

for modern embedded systems. One motivation is the reduction of the Non-Recurring

Engineering (NRE) costs, since the same hardware architecture may be used for more

than one product. Even the same product can be provided by a company with sev-

eral variations by reprogramming the hardware and software. Furthermore, the post-

manufacturing programmability allows upgrades or corrections of problems when the

product has been already finished.

Execution platforms based on RH may also be dynamically reconfigured, allowing the

dynamic adaptation of the system to the run-time environments [32]. Flexibility can be

added to the architecture by dynamically allocating different and dedicated processing

operators within the reconfigurable device. It also allows a configuration larger than

the available RH to be used (virtual hardware concept). This improves the resource

utilization by sharing the hardware between various applications.

The Figure 2.3 illustrates an example scenario where computations A, B, C and D are

accelerated by the RH [33]. After finishing the execution of A and B, the computation D

can replace them in the RH. Depending on the hardware support, computation C may

be kept running while the reconfiguration happens. Such platforms require, however,

more attention on the management and control.

There are two main challenges by following this approach. First, the kernel candi-

dates to be implemented in hardware need to be mutual exclusive in their execution

(no concurrency). This is also true for hardware devices that comport more than one

kernel. However, in this cases the mutual exclusivity need to be assured among the

kernel sets. To find a solution for these problems, techniques known as temporal and

2.2. RECONFIGURABLE COMPUTING OVERVIEW 11

WWWWWWWWWWWWWWWWW

WWWWWWWWWW

WWWWWWWWWWWWWWWWW

WWWWWWWWWWWWWWWWWWWWWWWWW

WWWWWWWWWWWWWWWW

WWWWWWWWWWWW

WWW

WWWWWWWWWWWWWWWWWWWWWWWWWWWW

WWWWWWWWWWWWWWWWWWWWW

WWWWWWWWWWWWWWWWWW

WWWWWWWWW

WWWWWWWWWWWWWWWW

WWWWWWWWWWW

WWW

WWWWWWWWWWWWW

WWWWWWWWWWWWWWWWWWWWWWW

WWWWWWWWWW

WWWWWWWWWWWWWWWWW

WWWWWWWWWWWWWW

WWWWWWWWWWWWWWWWWWWWWW

WWWWWWWWWWW

WWWWWWWWWWWWWWWWWW

WWWWWWWWWWWWWWWWWWW

WWWWWWWWWWWWW

WWWWWWWW

WWW

WWWWWWWWWWWW

WWWWW

WWWWWWWWWWWWWWWWWWWWWWWWW

WWWWWWWWWWW

WWWWWWWWWWWWWWW

WWWWWWWWWWWWWWWWWWW

WWWWWWWWWWWW

WWWWWW

WWWWWWWWWWWWW

WWWWWWWWWWWWWWWW

WWWWWWWWWWWW

WWW

WWWWWWW

WWWWWWWWWWWWWWWWWWWWW

WWWWWWWWWWWWWWWWWWWWWWWWWW

WWWWWWWWWWWWWWWW

Software

application

Hardware kernel

implementations

(a)

CPU

Reconfigurable

hardware

Memory system

(b)

CPU

Reconfigurable

hardware

Memory system

(c)

Figure 2.3.: Example of dynamic reconfiguration usage [33].

spatial partitioning [34,35] are used. The second challenge is to provide a well designed

infrastructure, which needs to provide efficient methods to, e.g., swap in and out the

configurations on the RH, keep track of data exchanged between configurations, schedule

the configurations, etc.

2.2.1. Coupling CPU and Reconfigurable Hardware

A reconfigurable architecture appears normally in form of a CPU connected with a RH.

The coupling type has a strong influence in the communication costs between CPU and

RH. Furthermore, certain connection types require that the CPU is specially designed

to be connected with the RH (non standardization).

Figure 2.4 shows four different types of connections. In a really tightly coupled archi-

tecture (Figure 2.4b) the RH is deeply integrated with the CPU, so that it belongs to

the processor data-path. The RH may also be connected with the CPU in the form of

a classical coprocessor as presented in Figure 2.4c. In a middle connected architecture,

the RH is plugged to the CPU through a local bus, as in Figure 2.4d. This can be

seen also as an extended version of a coprocessor, since the RH may be accessed by the

CPU using memory mapped technique. However, such architecture may also be used

as an execution platform for a multiprocessing system. In loosely coupled architectures

(Figure 2.4a) the RH is attached to the CPU through an external bus (like PCI) which

is also an usual solution for a system architecture.

2.2.2. Reconfigurable System Design

Event though the usage of reconfigurable hardware in embedded systems is increasing

[33], there is no standard method to design such system yet. Instead, there are proposals

which are based on the extension of the traditional hardware/software codesign methods

12 CHAPTER 2. BACKGROUND

CPU Main

Memory

Reconfigurable

Hardware

Local Bus

PCI Bus

(a) Loosely coupled.

CPU Main

Memory

Reconfigurable

Hardware

Local Bus

(b) RH integrated into the processor data-

path.

CPU Main

Memory

Reconfigurable

Hardware

Local Bus

CPU Main

Memory

Reconfigurable

Hardware

Local Bus

(d) RH attached at the local bus.

Figure 2.4.: CPU and RH: Coupling types.

[36,37], and their differences concentrate on the hardware-software partition phase.

However, the design of such system is even more complex than the traditional one due

to the necessity to evaluate the temporal and spatial utilization of the reconfigurable

hardware (which is considered as a shared resource), configuration overhead, communi-

cation infrastructure, etc. Moreover, the new approaches rely on the usage of modeling

languages, which allows the consideration of the reconfigurability aspects since the be-

ginning of the design phase.

Figure 2.5 gives an idea about the differences at execution platform between SoC and

RSoC. In Figure 2.5a HW accelerators are static components and in Figure 2.5b they

are placed in the reconfigurable fabric according to its usage.

2.3. RECONFIGURABLE HARDWARE TECHNOLOGY 13

)b()a(

functions

MEM HW

Accelerator

functions

CPU DMA

Accelerator

functionality

functions

CPU DMA

Accelerator

functionality

Reconfigurable

fabric

MEM

(a)

)b()a(

functions

MEM HW

Accelerator

functions

CPU DMA

Accelerator

functionality

functions

CPU DMA

Accelerator

functionality

Reconfigurable

fabric

MEM

(b)

Figure 2.5.: Typical SoC architecture (a) and a RSoC architecture (b) [38].

2.3. Reconfigurable Hardware Technology

An execution platform for a reconfigurable computing application is based typically on

a Field Programmable Gate Array (FPGA), which is a two-dimensional grid of config-

urable logic cells, called Configuration Logic Blocks (CLBs). These blocks are embed-

ded in a general routing structure (also configurable) which allows their interconnections

(inputs and outputs of each CLB). This architecture, shown in Figure 2.6a, supports a

construction of a relatively arbitrary interconnection scheme between the logic blocks in

the system.

CLB

CLBCLBCLB

CLB

CLB CLBCLB

CLBCLB

IO area

Logic Blocks

Interconnect

Resources

(a) Basic FPGA architecture.

4-input

LUT

flip-flop

mux y

(b) Simplified Configurable Logic Block

(CLB) [39].

Figure 2.6.: FPGA comprises a set of CLBs.

14 CHAPTER 2. BACKGROUND

Each CLB can implement a distinct and limited logic function. A very simplified struc-

ture of a CLB is shown in Figure 2.6b which contains a 4-input Look-Up Table (LUT)1,

a Multiplexer (MUX) and a storage element (flip-flop). A real CLB available in the

modern FPGAs usually comprises more than on of such logic blocks and more signals

for controlling, allowing the implementation of complex logic circuits.

FPGAs currently represent the most popular and mature segment of RH technologies.

Of special interest are the SRAM-based FPGAs, which are the state-of-the-art in FPGA

technology. The configuration (CLBs and interconnect resources) are stored internally

on a static RAM. Changing the content of the configuration RAM will also change the

resulting circuit running on the FPGA.

2.3.1. Hybrid Architecture

Meanwhile there are FPGA fabrics which incorporate a processor core. Figure 2.7 shows

the structure of a Virtex-II Pro FPGA used in this work. This FPGA (as well as Vir-

tex4, from the same company) is equipped with up to four hardcores PPC405 processors,

signifying also that RH and CPU are located in the same device. With these devices,

it is possible to design sophisticated architectures, where even more than one RH com-

ponent may be connected with the CPU. This provides more flexibility when designing

Reconfigurable SoC (RSoC) execution platforms.

CLB

Multipliers and

Block SelectRAM

Processor Block

Configurable

Logic

SelectIO-Ultra

DCM

RocketIO or RocketIO X

Multi-Gigabit Transceiver

CLB

Figure 2.7.: Viertex-II Pro architecture overview [40].

If a processor core is needed and not available in a FPGA, like the Virtex and Spartan3

FPGA families, one can use a softcore processor. An example available from Xilinx

1A LUT implements a combinatorial logic by storing a function truth table.

2.3. RECONFIGURABLE HARDWARE TECHNOLOGY 15

company is the so called MicroBlaze, which is a softcore processor based on RISC archi-

tecture. This processor even allows the integration of accelerators, designed inside the

FPGA, to the processor data-path [41].

In order to help the design of embedded systems based on such hybrid devices, a design

tool called EDK (Embedded Development Kit) is made available by the FPGA Company.

The architecture advised and supported by this tool, when using the hardcore PPC405, is

shown in Figure 2.8. The processor block is connected to the other components through

the CoreConnectTMbus architecture, composed by two main buses: Processor Local

Bus (PLB) and On-chip Peripheral Bus (OPB). Slow peripherals should be attached

to the OPB bus in order to offload the PLB bus, which provides low latency access to

peripherals requiring high performance.

System

Core

System

Core

System

Core

Processor

Block

Peripheral

Core

Peripheral

Core

Processor Local Bus On-Chip Peripheral Bus

Bus

Bridge

C oreC onnect B us Architecture

Arbiter

Instruction Data

Figure 2.8.: CoreConnect block diagram [40].

Other FPGA manufactures provide similar solutions in their products as well. For in-

stance, Altera offers the Stratix II2and Cyclone II3devices, which allow the usage

of NIOS softcore processors. Also ATMEL have an architecture for dynamically re-

configurable SoC, called FPSLIC (AVR processor and FPGA) [42]. Nonetheless, the

background information given in this work is concentrated on FPGAs manufactured

by Xilinx Company. This device was chosen for implementation purpose, since not all

FPGAs support a specific feature called dynamic partial reconfiguration (discussed in

Section 2.3.3).

2.3.2. Configuration Techniques

A FPGA configuration is performed by downloading a bitstream4through a specific

configuration port. This can be done offchip assisted by an extra component connected

to the FPGA configuration port (for details please refer to [40]). Alternatively, the

configuration data can be downloaded by an internal FPGA entity (not available in all

FPGAs), called ICAP (Internal Configuration Access Port), which is mainly used for

partial reconfiguration approaches (Section 2.3.3).

2Available at: http://www.altera.com/products/devices/stratix2/st2-index.jsp

3Available at: http://www.altera.com/products/devices/cyclone2/cy2-index.jsp

4A bitstream is a configuration data used to program the FPGA in a serial manner.

16 CHAPTER 2. BACKGROUND

Furthermore, through the same port the present configuration on the FPGA can be

read. This technique, called Readback operation, is used to verify the configuration

on the FPGA and it is also used as a main technique to provide basic support for

multitasking on FPGA (more details about this subject is given in Section 3.2.7.

2.3.3. Partial Reconfiguration Feature

Some FPGAs support partial reconfiguration. In these fabrics, part of the device can

be reconfigured whilst the remaining part keeps its execution normally. This feature

allows the implementation of multitasking systems based on reconfigurable devices. The

FPGA is partially configured by downloading the related partial bitstream on the FPGA

through the same port used for the case of a complete configuration. If the system wants

to configure itself, the internal ICAP entity must be used.

Additionally, partial reconfiguration is a device dependent feature. Xilinx FPGAs are

the few ones on the market with support to partial reconfiguration. Xilinx FPGAs with

this feature goes from Virtex-4 devices to Spartan-3/E family. However, usually on these

devices the partial reconfiguration is possible in a column-wise manner [43]. In these

cases the device is divided into a number of columns. Each column spans vertically the

chip (1D partition model) and it can be independently reconfigured without affecting the

other columns (Figure 2.9a). Meanwhile, the release of new guidelines and design flows,

like Early Access Partial Reconfiguration [44,45] and other techniques from Xilinx shall

allow a more flexible rectangle sizing (2D partition model) to be partial reconfigured on

a device (Figure 2.9b).

An architecture designed to support partial reconfiguration is similar to the one presented

in Figure 2.9. Usually, the FPGA device is divided in different parts. A static one

related to the management of configuration data and the input/output data, and one

or more parts which are reconfigured on-the-fly. The connections between the static

and dynamic parts of the FPGA must be implemented by specific interconnect device

resources. A known technique used is the so called Busmacros [43]. Therefore, all

circuits implemented inside the dynamically reconfigurable parts of the FPGA must use

the Busmacros wires to connect with other entities in the system.

The technical realization of such architectures are difficult due to the poor tool support

and strong restrictiveness of the design guidelines [43]. Therefore, the partial recon-

figuration feature is mostly used in academic research works. However, by recently

improvement in the design support (like the [44] and [45] guidelines) and recently in-

troduced design tools, which explicitly support a modular design approach (PlanAhead

[46]), this feature is going to be used commercially as well. One evidence for that is the

recently usage of partially reconfigurable FPGA based platforms for Software Defined

Radio (SDR) SoC modems [47].

2.4. CHAPTER CONCLUSIONS 17

Module

AModule

BModule

Static

Part

Configuration

Port

Bitstream A

Bitstream B

Bitstream C

(a)

Module

Static

Part

Configuration

Port

Bitstream A

Bitstream B

Bitstream C

(b)

Figure 2.9.: Example of device partition for 1D (a) and 2D (b) partition models.

2.4. Chapter Conclusions

This chapter highlights the main features of reconfigurable computing and how it can

contribute in improving the flexibility and performance for modern embedded system

execution platforms. Additionally, the FPGA device has been introduced as being the

mainstream technology in reconfigurable computing systems, keeping the focus on those

FPGAs with support for partial reconfiguration. The investigations carried out in the

scope of this work rely on this specific feature, which is still being improved by the

FPGA vendors. Evidence for that are the availability of new tools and design flow

guidelines (introduced in Section 2.3.3), which are intend to decrease the current difficult

practicability of partial reconfigurable based platforms.

18 CHAPTER 2. BACKGROUND

CHAPTER 3

Related Work Survey

This chapter starts giving a survey on reconfigurable operating systems conceived for

software-only based architectures, concentrating on those ones envisioned for embedded

systems domain. The intention thereby, is to figure out what are the trends and results

already achieved in this research field.

In the sequence, important and relevant results related to operating system support for

reconfigurable computing architectures, operating systems running on this architectures,

and some new further approaches are presented. Along that, different aspects related

to reconfigurable computing and the support required by an OS are covered, which con-

centrate on the interaction between CPU and reconfigurable hardware. Such a support

may represent a simple driver as well as more sophisticated services that are required

for managing the reconfigurable hardware in an abstract manner.

3.1. (Re)Configurable Operating Systems

An operating system with the ability to be reconfigured is not new and most OSs support

it in one form or in another. The motivation for reconfiguration of an operating system

is that by configuring it one can achieve customization of the OS to the application

requirements, allowing system upgrade, etc. Moreover, dynamic reconfiguration can

allow run-time adaptation of the OS to the current application requirements, and so

achieving a better execution environment for this application.

Such operating systems can be in a first level classified into two categories, static and

20 CHAPTER 3. RELATED WORK SURVEY

dynamic. Beside that, reconfigurable OS can be classified in respect to the initiator of

the reconfiguration [48], which could be triggered by:

Human This is often the case when the designer decides which components of an OS is

going to be used for the target application and this feature is supported by most

operating systems. Dynamic reconfiguration is also usual for those systems, even

at boot time (by passing some parameters to the kernel) or at run-time by loading

or unloading modules (e.g., Linux OS).

Application A reconfiguration that is triggered by an application appears only in dy-

namic case and this has received strong attention in the OS research community.

Several proposals therefore have been proposed and in the following subsection

some representative OSs of this category will be presented and discussed.

Operating System Also called Automatic or Self-Adaptation, OSs belonging to this cat-

egory are those which the reconfiguration is initiated by the OS itself. Portability

of the OS to different platforms at compile time may be seen as a feature of this

kind of OS which does not represent a significant innovation, since most of the

OSs provide this feature. The interesting approach is in dynamic scenarios, which

require more intelligence from the OS.

Most of the investigations done for dynamic reconfigurable OS are concentrated on the

cases where reconfiguration is triggered by the application, and in the majority, those

approaches accept the idea that applications know better than the OS what their needs

and requirements are. Therefore, usually the application makes requests to the OS to

change its policies on its behalf. A self-reconfiguring OS, on the other hand, needs to have

enough intelligence in order to gather sufficient and appropriated information from the

running applications. Such information is needed to make a decision regarding what and

when to reconfigure itself, in such a way to provide an optimal execution environment

for each of the applications.

Following, some OSs representative of the categories cited above are listed and briefly

discussed focusing on those ones intended for embedded system application domain.

3.1.1. Statically Reconfigurable OS

The eCos (Embedded Cygnus Operating System) [49,50] is an open-source embeddedeCos

real-time configurable operating system currently supported by the RedHat company

(since 1999). The eCos OS can be customized at source code level, thus offline recon-

figuration. To help the designer by the reconfiguration phase a tool is made available

to select the components which will compose the kernel source and compile them to

generate the final OS footprint.

DREAMS1[51,52,53,54] is a library-based OS for embedded systems, which wasDREAMS

1DREAMS: Distributed Real-Time Extensible Application Management System.

3.1. (RE)CONFIGURABLE OPERATING SYSTEMS 21

primarily designed for offline configuration to allow customization of the OS to the

target application. Currently, DREAMS also provides some dynamic reconfiguration

which will be discussed in Section 3.1.4. DREAMS have similarities with the eCos OS

in the sense that the operating system code is composed of components selected by the

designer through a tool and finally compiled. The complete system has been designed

following the object-oriented paradigm (using C++ for coding). DREAMS differs from

eCos in the way that OS services are built as a run-time library, instead of a kernel (as

in eCos). This is achieved by allowing the customization of the objects and selecting

those objects that are really needed by the application. The customization features are

based on pre-processor techniques and are incorporated into the Skeleton Customization

Language (SCL). To help the designer in this process, a tool called TEReCS [55,56] is

provided which further supports the design of a communication system for distributed

embedded applications.

Choices [57] is a customizable object-oriented operating systems developed at the Uni- Choices

versity of Illinois and started in 1987. It is one of the first operating systems which

configure the application and OS specific functionalities together. Choices allows a cus-

tomization of the operating system by specializing classes in the various hierarchies, and

by instantiating a specific set of objects. The application interface is a collection of

kernel objects exported through the application/kernel protection layer.

3.1.2. Dynamically Reconfigurable OS: Application Triggered

EXOKERNEL [58,59] project is developed at the Massachusetts Institute of Technology EXOKERNEL

and basically consists of a driver for the processor which supports context management,

interrupt handling and functionalities for status information and functionality change.

It follows the microkernel2concept where all other operating system extensions can be

loaded as pre-compiled code into the kernel during run-time by dynamic linking (in a

form of a library) at user level.

Kea [61,62] is a kernel that has been designed for application extensibility and dynamic Kea

reconfiguration. The central unit of reconfiguration and extensibility is a service (en-

capsulated as an object). Kea provides means through which kernel services can be

reconfigured, by using the notion of portals. A portal is represented as a proxy in a

client’s domain (application) and maps onto a system data structure, which links the

proxy to a service method entry point in the kernel. In this way, portals act like agents

of indirection for a service. By altering the destination of a portal, calls to a OS service

can be transparently remapped to replacement services. Furthermore, by making this

process available during application execution time, it is possible to implement extensible

services.

2Microkernel is a small operating system core which provides only the basic functionalities (e.g., schedul-

ing, message passing and timing), and all other functionalities are provided by means of modular

extensions [60].

22 CHAPTER 3. RELATED WORK SURVEY

MMLite [63] is an object-based, modular system architecture that provides a menu ofMMLite

components for use at compile-time, link-time, or run-time. The main goal of MMLite

is to allow a system to be dynamically assembled into a full application system. The

dynamic configuration is achieved through a unique mechanism called mutation which

allows a transparent replacement of components whilst they are in use.

Mutation is the act of automically changing an ordinarily constant part of an object,

e.g. a method implementation. Each mutation is performed by a thread which is called

mutator. A mutator must translate the state of the object from the representation ex-

pected by the old implementation to the one expected by the new implementation. It

must also coordinate with worker threads and other mutators through suitable synchro-

nization mechanisms. Transition functions capture the translations that are applied to

the object state and to the worker threads execution state. In order to limit the amount

of metadata, execution transitions only happen between corresponding clean points in

the old and new implementations.

Pebble [64,65] is a component based operating system designed to support also component-Pebble

based applications for embedded systems domain. Pebble is based on three main con-

cepts, which are (1) a minimal privileged mode nucleus, (2) a set of replaceable system

services and application components running in distinct protection domain and (3) a

code generator specialized for each possible cross-domain transfer.

As an operating system it adopts a microkernel architecture with a minimal privileged

mode nucleus that is only responsible for switching between protection domains. The

programming model is client/server where client components (applications) request ser-

vices from system components (servers). Examples of system components are the inter-

rupt dispatcher, scheduler, portal manager, device driver, file system, virtual memory,

and so on. In Pebble, it is possible to dynamically load and to replace servers to fulfill

applications requirements.

SPIN [66,67] is an operating system designed to allow applications to dynamically spe-SPIN

cialize the kernel in order to achieve a particular level of performance and functionality.

A specialization can add new kernel services (by linking new code), as well as replace

default policies or migrate applications function to kernel address space. In SPIN a

specialization is called an extension and its behavior is defined through the execution

model.

K42 [68,69,70] is an open-source and scalable operating system kernel under develop-K42

ment in IBM research center. It uses the object-oriented paradigm. Even though sup-

porting extensibility, K42 is strong focusing on online upgrading of existing components

(objects). For this purpose, well defined mechanisms and methods for reconfiguration of

these objects have been defined [70]. One of those is the a hot-swaping mechanism based

upon C++ virtual function tables. This includes the definition of safety points for an

object in order to assure a safely upgrade of the object. The upgrade is even assisted by

a dynamically created object which is interposed between the caller of the object and

the object itself.

3.1. (RE)CONFIGURABLE OPERATING SYSTEMS 23

Think [71] is more a component-based framework which is used to build an operating THINK

system than an operating system itself. It provides the minimal abstraction of the

underlying hardware and services in order to allow the designer to create extra operating

system services following the component-based paradigm. One of the main goals of

the THINK project is to make an operating system as flexible as possible in order to

build dedicated and fully configurable operating system. Last investigations on THINK

operating system [72] incorporate dynamic reconfiguration as an extra feature. Here

again there is no definition about the initiator of the reconfiguration, either application

or operating system is able to start a configuration. However, the last one is possible only

with extra activities like self-monitoring which is not available in the THINK operating

system.

3.1.3. Dynamically Reconfigurable OS: System Triggered

Synthetix [73,74,75] operating system, and its ancestor Synthesis [76] enhances the Synthetix

application performance by providing specialized implementations of operating system

services which are generated on-the-fly. This is done through partial evaluation of ap-

plication code and its actual input data in order to recompile at run-time condition

statements part of the code. The entire procedure is executed transparently to the ap-

plication. A commercial operating system [75] has used the techniques investigated in

Synthetix. Synthetix only makes extensions at the top layers of the operating system

services, since the services existing in the low layer are required to support a system

reconfiguration. Moreover, it does not provide any means by which applications may

control their own resources, because they are entirely controlled by the operating system.

VINO is an object-oriented operating system [77,78] and it is one of the few oper- VINO

ating systems which provides self-x capabilities, which comprises self-monitoring and

self-adaptation. By using a specific framework, the operating system is able to gather

information of the application in order to decide which extensions or modifications should

be made in the operating system itself. This automatic adaptation has not been im-

plemented in VINO. However, it has been investigated and concepts and methods have

been proposed.

The information used to automatically adapt the system is provided by static and dy-

namic sources:

•The compiler generate profiles from the system as output, which is used to build

a database.

•This database is augmented with online information gathered by periodically re-

trieving statistics maintained by each VINO module.

•Additionally, the system collects traces and logs from incoming requests and pro-

duced results.

24 CHAPTER 3. RELATED WORK SURVEY

All dynamic information is captured on-the-fly under regular system execution and used

for analysis purposes. An adaptation is performed when the system detects that the

resources required by an application is rising bigger than specified previously. Further-

more, the system shall be capable to detect some pattern activities which will guide

the system in the decision of the most appropriated service that can satisfy the current

application needs.

3.1.4. Towards Online Reconfigurable DREAMS

DREAMS operating system is being further designed towards online reconfiguration [79]

and going towards microkernel architecture (which supports properly the modularization

of the system). Currently, DREAMS provide means for the designer to give information

from the application (e.g., resource required, time interval for resource usage, etc.) in

a form of a profile. For each application more than one profile may be specified. This

allows the specification of different QoS for each application depending on the resources

required. At run-time, the system is able to choose the best combination of application

profiles depending on current applications status and an overall system quality measure-

ment. This mechanism is carried out by a Framework Resource Manager (FRM) [80]

designed for this purpose.

3.1.5. Further Comments

Other well known OSs, like for instance, QNX [81], VxWorks [82], Linux [83], and

uCLinux [84] (a lightweight version of Linux designed for processors without MMU),

also present offline and online configurability. On those OSs, for instance, device drivers

are able to be linked to the system at run-time. However, in these cases a configuration is

decided and started by the system designer. There is no direct support from the system

to help the application to decide when, why or which OS service should be loaded or

unloaded.

Initiator

Human Application Operating System

Time

Static

DREAMS(1st),

Choices, eCos,

et.al.

(not applied) —

Dynamic

uCLinux, OSE,

VxWorks, et.al.

EXOKERNEL,

Kea, MMLite,

Pebble, SPIN,

K42, THINK,

et.al.

Synthetix, VINO,

DREAMS(2nd)

Table 3.1.: OS classification according to the reconfiguration Time and Initiator.

3.2. OPERATING SYSTEM FOR RECONFIGURABLE COMPUTING 25

The OSs listed above are summarized, in Table 3.1, according to the initiator of the

reconfiguration activity and to the reconfiguration time. This list is only a sample of

the numerous OSs found in the literature. Further details as well as further references

can be found in the reviews given, for instance, in [85,48,86]. Nevertheless, the number

of operating systems with self-x properties is not big at all. The main reason for this

fact is the assumption made by the most designers, that the application have the right

knowledge of its own needs, and therefore it is capable to decide what to change and when

this change takes place. However, a system supporting different applications requires

that each of them would have to be aware of the concurrently applications present on

the system. In this scenario, the OS is the most appropriated entity in the system which

can manage the resources being shared by the applications. Furthermore, this is the

duty of an OS by its basic definition.

3.2. Operating System for Reconfigurable Computing

As it has been shown in previous chapter, Reconfigurable System-on-Chips are inter-

esting for contemporary embedded system platforms. An increasing number of RTOSs

developers provide support for such architectures. Well known commercial RTOSs (e.g.,

VxWorks, QNX, RTAI, eCos, etc) have being ported to the PowerPC-405 CPU avail-

able in FPGAs like Virtex-II Pro and Virtex-4 FX3. The majority of these commercial

embedded systems do not support dynamic reconfiguration (particularly partial recon-

figuration), which have its main attention in the academy research. One reason for this

fact is that the configuration overhead is sometimes prohibiting by using the currently

technology.

However, as this technology becomes mature, dynamic reconfiguration is starting to be

used also commercially. One evidence for that is the recent usage of partially recon-

figurable platform FPGAs for Software Defined Radio (SDR) modems SoC [47]. Fur-

thermore, in the field of High Performance Computing (HPC), RC is starting to be

incorporated in commercial systems. Enterprises like Cray, SGI and SRC Computers

Inc. are examples of this trend. The Cray Inc. already sells Cray XD1 supercomputers

including dual core AMD processors and Virtex-II Pro FPGAs.

An optimal scenario for which an operating system for RC can provide support is the one

shown in Figure 3.1. In this case, a task pool comprising hardware tasks, software tasks

and hybrid tasks are allocated by the operating system over the hybrid architecture.

The programmer should not be aware about the location in which each task is going to

be executed. Within this section, the research efforts toward the achievement of such an

environment are presented.

In the next sections, the aggregation of services in order to support a RH in some

available OSs (as well as novel ones) will be presented and discussed. It will begin,

3An updated list of OS ported for platforms based on Xilinx FPGA can be seen in www.xilinx.com.

26 CHAPTER 3. RELATED WORK SURVEY

OS for RC

RH CPU

ABSTRACTION LAYER

TASKS

HW/SW

TASKS

Figure 3.1.: Execution environment envisioned by the Reconfigurable Computing re-

search community.

in the first subsection, focusing on low level support. Then, in the sequence, it will

continue with proposals for more high level services that allow a proper abstraction of

the underlying heterogeneous execution platform.

3.2.1. Low-level OS Support for Reconfigurable Hardware

Within this subsection, approaches dealing with the abstraction of RH resources and

properly support by the OS in exchanging data between CPU and RH coprocessor are

presented.

FPGA Reconfigurable Resources Abstraction

In the work presented in [84], dynamic reconfiguration capabilities are smoothly inte-

grated into the embedded Linux version, uCLinux. In this system, the user is able to

drive a reconfiguration by using normal Unix-like shell commands. Additionally, the

configuration memory of the reconfigurable device is made accessible through this inter-

face.

In the related platform, the uCLinux is running on a soft-core CPU (MicroBlaze). A

reconfiguration is possible due to the usage of an internal entity present in the Virtex-II

and Virtex-4 FPGA device series, called Internal Configuration Access Port (ICAP). By

using a driver that wraps the ICAP entity, the OS provides to the user the capability

to read/write the configuration memory of the FPGA in an abstract manner shielding,

thus, the complexity of the underlying executing platform.

The execution platform used in this approach can be seen in Figure 3.2. The whole

platform is implemented using a single device. Because the reconfiguration triggered by

3.2. OPERATING SYSTEM FOR RECONFIGURABLE COMPUTING 27

the system is able to change the functionality of the device in which the system itself is

running, it can be categorized as a self-reconfiguration system.

Figure 3.2.: Self-programming hardware architecture [87].

A similar work is also provided by [88], [87] and [89]. In these approaches the internal

resources of the FPGA (e.g., LUTs, BRAMs, etc) are made accessible to the programmer

through the Linux File System. Thus, by simply making use of standard Linux command

shells, one can read, write or modify the FPGA configuration in a small grain manner.

Furthermore, in [87] a toolkit called XPART (Xilinx Partial Reconfiguration Toolkit)

has been presented. This tool was built on top of the ICAP API (explained above) and

provides functionalities that able the user to relocate entire modules (partial bitstreams)

presented inside the FPGA.

Those works, however, do not directly focus on the technical difficulties related to the

relocation of modules inside a FPGA considering its heterogeneity (some resources, like

BRAM and multipliers are hardcore available in the FPGA). Furthermore, to shift a

module inside a FPGA without generating it by the design tools, the bitstream read

from the device (related to the module) need to be first manipulated. This requires

deeply technical knowledge of the underlying reconfigurable hardware device.

Hardware-Software Interface

Another important aspect of an architecture based on a CPU and a coprocessor is the

communication schemes between these two components. Moreover, if the operating sys-

tem does not provide a proper abstraction of the hardware accelerator, the programmer

needs to be aware of the details about the architecture, like for instance, memory point-

ers, sending/receiving data chunks, etc. Typically, for such platforms, the programmer

is responsible for mapping the physical address space of the external device (hardware

accelerator) into the virtual space of the software task, and copying the data to and

from this device. This makes the design less portable requiring details of the underlying

architecture.

28 CHAPTER 3. RELATED WORK SURVEY

In order to increase the abstraction, a virtualization layer is proposed in [90], which

shifts the responsibility of exchanging data, between coprocessor and CPU, from the

programmer to the operating system. The authors of this work propose the extension

of the Virtual Memory Management (VMU) of the operating system by adding a small

layer in software and in hardware (adding, therefore, an extra logic to the coprocessor).

Figure 3.4 illustrates the idea, which is based on the utilization of a Dual Port Memory

for its technical realization.

Processor

and Memory

VIRTUAL

DP RAM

INTERFACE MANAGEMENT UNIT

PHYSICAL

DP RAM

FPGA

Coprocessor

OS (Virtual Interface Manager)

Figure 3.3.: Virtualized interface [90].

The software layer called Virtual Interface Manager (VIM) extends the VMU and it is

responsible to translate the virtual address (from the software program to the coproces-

sor) to a physical one. This layer has the support of the Interface Management Unit

(IMU), similar to the Memory Management Unit (MMU), which is able to generate in-

terrupts to the OS when a data request from VIM is being processed. Again, the ideas

proposed by the authors were also implemented in a system using embedded Linux as

the target OS.

In their further research, presented in [91], the authors improve the IMU hardware

support (calling it Virtual Memory Window - VMW) in order to allow the usage of

virtual addresses also for the coprocessor. By this means, portability is also provided

for hardware designs. Moreover, these new OS add-ons may provide support towards a

hybrid multithreading system [92,93]. In such a scenario, either a software or a hardware

thread may initiate a communication. Thus, it differs from the traditional paradigm,

where RH is used only as a coprocessor.

Hardware HardwareHardware

Software and hardware thread library

System software support

Communication assistants

Software Software

Hardware

Software

Figure 3.4.: Software and hardware threads connected through a virtualization layer [93].

Although the focus of these proposals are not to decrease the overhead caused by the

communication costs between processor and coprocessor, same authors shown in [94] a

concept which may hide the communication latency by prefetching memory accesses of

software threads to the hardware accelerator.

3.2. OPERATING SYSTEM FOR RECONFIGURABLE COMPUTING 29

3.2.2. Application Model

The majority of the authors model the input problem using a directed graph, where the

nodes may represent either simple operations (e.g., ADD, MUL, etc) or more complex

functions (like FFT, IDT, etc). In each way the nodes abstractly represent tasks. The

edges connecting the nodes represent the task dependency and the correspondent weights

(if used) denote the data transferred or data transfer costs, between them.

A task graph model has its similarity when modeling the application for a software only

environment. A single task model used in hardware (hardware task), however, has more

degrees of freedom than a software task. Aside from usual parameters, like (worst case)

execution time, arrival time, deadline (for real-time scenarios), dependency, etc., a task

in hardware also occupy some area of the FPGA, which implies in dealing with additional

constraints such as shape and area size. Furthermore, in more accurate models, the time

needed to configure a hardware task in FPGA may also be considered.

3.2.3. OS Services for Reconfigurable Hardware

One of the first proposals towards the development of an operating system for FPGA

based reconfigurable computers was presented by Brebner in [95]. With the availability

of the XC6200 FPGA from Xilinx, which allows configuration and data registers to

be memory mapped, fast and partial reconfiguration were possible. Brebner highlights

that these flexible reconfigurable hardware, seen as an additional computational resource

(together with the CPU), need to be managed by the OS.

First introduced by [96], Brebner also presents the concept of hardware virtualization

in order to produce a run-time resource allocation environment, which hides from the

programmer the complexity and details of the underlying reconfigurable hardware.

Brebner proposes the division of the application into so called Swappable Logic Units

(SLU). These units are defined as logic circuits that implement some functionality and

are position-independent tasks. Thus, the duty of the operating system is to swap in

and out these units on the FPGA. To accomplish this job, the OS uses some techniques

already known from the OS theory. For instance, by placing a task (SLU for this case)

into the FPGA it needs to first find free place for the incoming SLU. If the OS does not

find enough space it uses the Least Recently Used (LRU) strategy to choose a SLU from

the ones placed on the FPGA and remove it.

A more embracing and deeply study about the services that an OS should provide for

a FPGA-based reconfigurable computer is presented in [97] and [98]. Although there is

not a clear consensus about the terminology definition when referring to each OS service

(as in software only case), it is possible to find close similarity for these definitions.

Following, the services that were most investigated in the referenced works are listed.

The service definition used here may slightly differ depending on the author.

30 CHAPTER 3. RELATED WORK SURVEY

Partitioning This is the activity of partitioning the given task graph into sub-graphs,

which will be placed later on the FPGA. Although this activity is usually done

offline, due to the related overhead, there are some approaches where this is ex-

ecuted at run-time. The partitioning of a task graph into sub-graphs takes into

consideration the physical partition of the reconfigurable hardware which may, in

its turn, be also dynamically partitioned (for certain cases);

Placement This OS service is responsible to manage the reconfigurable hardware area.

At the arrival of a new incoming task, it needs to allocate the necessary free

resources to place this task on the device. For this purpose, it uses some strategies

to decide where to allocate the task in order to, for instance, minimize the area

fragmentation. Furthermore, depending on the capabilities of the reconfigurable

hardware, some tasks may be swapped out from the device or even relocated (in

order to free some space). This would require, however, techniques to preempt

and resume a hardware task. Additionally, issues related to partitioning, like

fragmentation of the device area (leading to not full utilization of its area) is also

a concern;

Scheduler The decision about the order in which tasks must be executed is made by

this service. Depending on some approaches, this service operates in close relation

with the Placement service. In order to avoid a non feasible scheduling, in certain

cases it is necessary to relocate other running tasks in the reconfigurable hardware.

The policy used by the scheduler depends on the application area. For instance,

the overall execution time could be the parameter to be minimized. In other cases,

however, task deadlines need to be met (real-time systems). Moreover, depending

on the device capabilities, preemption may or may not be allowed;

Communication Each task needs to read/write some amount of data. It is the duty of

the OS to provide means for data exchange among tasks, as normal for a common

OS. This has to be available for tasks being executed in the reconfigurable hardware

and/or tasks running in software environment, as well as across hardware/software

boundary.

Loader The low level support, responsible to actually configure the hardware, is carried

out by this service. Here, the knowledge and support presented in the previous

chapter are integrated;

3.2.4. RTOS issues in High Level Design

RTOS issues may also be considered at high level design phase of a system. In the project

OVERSOC (see [99]) the application is modeled as a task graph and each task may be

considered to be a software, hardware or hybrid task (able to run on both environments).

By further modeling the architecture of a RSoC platform, this proposal provides a global

methodology for a design space exploration of such system. In this environment the allo-

3.2. OPERATING SYSTEM FOR RECONFIGURABLE COMPUTING 31

cation, placement and scheduling of the tasks over the platform are easily explored and

evaluated. By this means, designers are able to tailor the RTOS towards the resources

needed as well as estimate its performance. Furthermore, RTOS services, like hardware

task preemption, task migration, message passing, reconfigurations activities etc, are

smoothly integrated into the system model (see also [100]). Concerning system level

design of a RSoC, the proposal of [36] is more embracing. From system specification

to system implementation, the proposal is able to analyze and evaluate different system

partitioning (in hardware and software) and also evaluate the run-time behavior of the

system. In this work, however, the OS is not the focus of the system model.

In [101] a complete design flow for implementing an application (represented as a directed

acyclic graph) is proposed. The flow is divided into three main stages: Application,

Static and Dynamic stages. In the first stage, the application and the design constraints

are specified. The static stage is responsible to perform a cost estimation for each task,

which is used to decide the hardware/software partitioning and, finally, synthesize these

tasks. The final stage (Dynamic Stage) is responsible for, at run-time, scheduling the

previously synthesized tasks on the RH, supported by a dedicated microarchitecture.

3.2.5. Offline Approaches

In offline approaches, scheduling and placement algorithms are executed offline, imposing

therefore a much lower overhead. Usually, offline strategies may use sophisticated models

of tasks (e.g., [102,103]). The authors in [102] use a three dimensional (3D) model for

a hardware task: horizontal and vertical position on the FPGA area as two dimensions

and time as a third dimension. This problem approximates the 3D packing problem

and an optimal branch and bound algorithm is used to optimize the task placement

over the FPGA device respecting temporal precedence constraints. In their work, the

authors provide optimized solutions for this problem, providing as a result either a

minimal FPGA size to solve a given problem or the minimal execution time of this given

problem.

Some other approaches aim to minimize the reconfiguration overhead by allowing the

exploration of task scheduling observing the configuration overhead associated. In [104]

a framework to generate the scheduling of tasks that results on a minimized reconfig-

uration time is presented. It enables the exploration of different strategies in order to

achieve different optimization goals concerning task scheduling and configuration (i.e.,

minimization of configuration overhead).

For scenarios where hybrid architectures (comprising CPU and FPGA) are considered

and the system has not being yet partitioned, some authors prefer to use a procedure

where the partitioning, placement and scheduling are executed together. This is the case

of, e.g., [105]. In this work the application is specified as a task graph, where the edges

inform the dependency between tasks. Additionally, for each task it is known its resource

used if placed either in hardware or in software. A modified version of the Kernighan-

32 CHAPTER 3. RELATED WORK SURVEY

Lin/Fiduccia-Matheyes (KLFM) is used to solve the problem. The solution provides a

near optimal minimal execution time for the given problem taken into consideration the

heterogeneity of the FPGA device.

For a similar scenario, the work presented in [106] proposes the usage of Genetic Al-

gorithms (GA) for modeling and solving the problem. In this approach CPUs, FPGAs

and Buses (used to build the execution platform) are modeled and included into the

problem. The schedule and mapping of the task graph on the underlying architecture is

obtained by solving the GA algorithm.

In [107] a task set with asynchronous arrival and no dependencies are scheduled and

placed on a FPGA device. In order to ease the placement, task shapes are manipulated

by means of footprint transformation. For this case, tasks are assumed to be coarse grain.

Complete IP (Intellectual Property) are examples for these tasks: FFT (Fast Fourier

Transformation), Ethernet Sender, Ethernet Receiver etc. With these assumptions, this

problem is close related to 2D bin-packing problem (e.g., best-fit, first-fit and bottom-

left). In this work, heuristics algorithms used to minimize the overall execution time are

presented.

These authors further incorporate these concepts on an OS prototype, called WURM -

OS [108] which is an extension based on a standard real-time kernel for microprocessors.

It provides support for loading, executing and removing tasks and services for inter task

communication. Furthermore, hints for online scheduling and placement of tasks are

also given.

3.2.6. Run-time Support

The work presented in [109] and [110] identify and discuss the fundamental services that

an OS need to provide in order to properly manage reconfigurable resources focusing on

those required for run-time execution. These, however, do not differ so much from those

defined in Section 3.2.3. Further, in [98] an OS prototype for RC called OS4RC (Op-

erating System for Reconfigurable Computing) is described. Additionally, an extended

version of the OS has been incorporated into a complete system called ReConfigME

(described in [111]).

In the works described above, the authors define a task as being implemented in hard-

ware, which was previously synthesized having a square shape [98]. Additionally, all

tasks have the same shape and are position independent on the RH. The RH, in its turn,

can hold several of these tasks and provide means for connection among the tasks placed

on the RH device.

In order to execute a given task graph on the RH device, the OS4RC provides the

following basic services: Partitioning, Allocation, Placement and Routing. Depending

on the task dependencies and the available free space on the RH, a group of tasks may

be grouped together forming a partition (application module). This will require, further,

3.2. OPERATING SYSTEM FOR RECONFIGURABLE COMPUTING 33

the Routing (for communication) service from the OS to enable the connection of the

tasks that are inside an application module. The Allocation service searches for free

space on the RH device for a given partition. It may happen that a replacement of some

application modules will be required in order to provide the necessary space, or even a

re-partition will be required. Figure 3.5 shows an example of a task graph allocated on

a FPGA.

Application

Partition

Application

Module

(when placed &

routed at design

time becomes a

logic module)

Communication

Channel

FPGA Chip No. 1

Or Real Surface

FPGA Chip No. 2

Or Real Surface

Virtual FPGA

Surface

gure 1 –

Figure 3.5.: A task graph on FPGA [98].

A rather different approach is presented in [112] and [113]. Here, a reconfigurable hard-

ware (FPGA) is also used as the main computational resource. The reconfiguration of

the FPGA and the execution of the high level services of the OS are executed by one

CPU. The FPGA surface is partitioned in two basic parts: OS frame and User Area.

The User Area is divided into slots with homogeneous size. Each slot have the same

width and spans the device height, as it can be seen in Figure 3.6. On each slot one user

task can be placed. In [114] the same authors propose a relaxation of these dimensions

constraints by allowing a user task to occupy a multiple number of consecutive slots.

The user tasks are designed and synthesized in advance, and it is assumed that user

tasks can be allocated to any slot. In addition, the design of each task must fit a well

defined template (denoted as Task Communication Block - TCB), which specifies an

interface that a task need to provide for communication purposes. A TCB also defines

communication channels that allow the OS frame to have access to all slots on the FPGA.

The OS frame is the static part of the FPGA and provides communication services

among slots and also establishes the connection with CPU, where task management

services are executed. The work gives further hints that preemption and resumption of

hardware tasks could be done via config/readback FPGA port.

34 CHAPTER 3. RELATED WORK SURVEY

elemarf

thgiremarfS

LRAB

ksatymmud

atymmu

saty

mmu

ksatymmud

RRAB

BAC

cigolksatr

esu

partial reconfiguration

FPGA

static static

Task Communication

Bus (TCB)

bus wires bus macros

reconfiguration slots

w=1

cigolks

atresu

BAC

w=3

w=1

Figure 3.6.: OS frame and task communication block [114].

In [115] the approach is extended in order to allow the partitioning of the RH in a

fixed number of slots, where each slot may have different widths. The focus of this

work was the evaluation of certain scheduling policies well known from single processor

scheduling. Both non-preemptive schedulers (First Come First Serve and Shortest Job

First) and preemptive schedulers (Shortest Remaining Processing Time and Earliest

Deadline First) were evaluated using different FPGA partitions (varying the number

and the widths of the FPGA columns). The evaluation of the results provide some

guidance to the designers with respect to the best way to execute FPGA partition based

on the characteristics of the input tasks.

The same authors have extended the previous system by including real-time constraints

for run-time scheduling and placement of 1D and 2D tasks. By considering tasks having

arbitrary and synchronous arrival times some non-preemptive as well as preemptive

scheduling and placement algorithms were proposed in [116] and [117], respectively.

Additionally, an acceptance test is derived in order to meet deadlines of executing tasks.

Some other authors focus on the efficiency of the scheduler and placer OS services. For

instance, in [118] the Scheduler and Placer were combined together into a single method.

The work shows some improvements due to the reason that these two services have a

strong influence in each other concerning efficiency. Also, in [119] and [120] these two

methods are proposed to be implemented completely in hardware, thus, decreasing the

overhead they introduce in the system.

Different strategies to execute hardware tasks on a slotted FPGA have also been pro-

posed. For instance, the authors in [121] configure and execute tasks in slots based on

priorities. The decision of which task will be configured in each time is made based upon

priorities, which change dynamically during system execution. Higher priority tasks have

preferred occupancy of the FPGA. The priority assignment policy follows an adaptive

rule based on the reconfiguration-request-ratio and communication rate. Additionally,

3.2. OPERATING SYSTEM FOR RECONFIGURABLE COMPUTING 35

in order to avoid hardware task starvation some design rules are specified in [121].

Also in [122] and [123] a proposal is presented for a completely reconfigurable computer

running on FPGA only. The work focuses on scheduling and memory management

for time constrained hardware tasks. The system provides means for running a set of

periodic tasks over a FPGA and derives several feasibility analyses based on the resources

shared among the tasks considering partial or full reconfiguration of the reconfigurable

hardware.

3.2.7. Multitasking Issues

By allowing multitasking on reconfigurable devices one may expect to find the same issues

present in multitasking scenario of software only environments. However, some of them

may appear in a different manner or may represent a challenge for the implementation

phase of the system.

Context switching

Task switching in a preemptive scenario is discussed in [124] and [125]. The authors

specify the necessary hardware requirements which enable preemption and resumption

of a hardware task. They further define the state extraction and reconstruction based

on bitstream manipulation. However, they do not deal with partial reconfiguration.

Bitstream manipulation is also proposed in [126], where detailed methods for context

saving and restoring are explained for partial reconfigured devices.

Further approaches, requiring some extra circuit added to the user tasks, are proposed

in [127]. In this work, context data transformation was also mentioned for allowing task

state migration (e.g., between different execution environments).

Reconfiguration port exclusiveness

Most of the approaches used for multitasking on FPGA assume partial reconfigurable

devices. However, they seldom respect the sequentially reconfiguration of the FPGA

slots due to the mutual exclusive usage of the reconfiguration port. Generally this

constraint is neglected because the execution time of a task is assumed to be much

higher than the reconfiguration time. Nevertheless, for comparable reconfiguration and

execution times of a single hardware task, this may decrease the system performance for

time critical scenarios facing high reconfiguration rates. Therefore, the authors of [128]

and [129] propose the application of techniques from single processor schedule theory for

controlling the access of the FPGA reconfiguration port.

Reconfiguration latency Reconfiguration overhead has also been pointed out by some

authors. For instance, in [130] the authors present an algorithm for run-time scheduling

of partial FPGA reconfigurations in order to minimize the reconfiguration delay. By

36 CHAPTER 3. RELATED WORK SURVEY

noticing that an user task may appear more than one time in the input task graph, the

reconfiguration overhead may be reduced by keeping it into the reconfigurable device

for the next execution. Further approaches based on software techniques have being

used to reduce or even to hide the reconfiguration latency, by configuring and executing

tasks concurrently. Examples of these techniques are based on prefetching, caching and

compression of configuration data [31].

Fragmentation

Memory fragmentation appears in software environment when pages are swapped in

and out between main and secondary memories. The same concept is present in a

device supporting partial reconfiguration. When a hardware task does not completely

occupy the FPGA slot, some resources are not used, which characterizes an internal

fragmentation. The external fragmentation turns up when the remaining free device

area is split into several unconnected vertical stripes (for 1D task model) or rectangles

(for 2D task model).

In [131] specific metrics for modeling this situation is proposed, which may improve the

results of certain placement strategies. Task relocation and transformation were further

approaches presented by some other authors (e.g., in [132] and [133]).

3.2.8. Dynamically Hybrid Architectures

In the work presented in [127] the allocation and scheduling of a dynamically changing set

of tasks on a reconfigurable system is investigated. The execution platform comprises

one processor and one FPGA. Each task is able to be executed in the reconfigurable

device or in the processor. A Scheduler, which decides where a task should be placed

(software or hardware), and a Placer (to manage the reconfigurable hardware resource)

are the two main components of this system.

The work presented in [134] deals with run-time migration of a task across FPGA and

CPU boundaries. The proposal introduces the concept of a tool used to design hard-

ware and software tasks. Such a tool, should guarantee the same behavior of it in a

later implementation, regardless of its execution environment (hardware or software).

Furthermore, this feature easies the analysis of the system in high level, since it allows

the preemption and resumption of tasks across different execution environments. How-

ever, neither proper analysis in identification of context data required by a preemption,

nor further analyses using the proposed unified representation of a task was provided.

In addition, a communication infrastructure is provided, which supports an uniform

communication scheme among tasks placed on this hybrid architectures.

The same authors provide further details of the system in [135] and [136], and present an

Operating System for Reconfigurable Systems (OS4RS).As a case study a multimedia

application was chosen. A video decoder task was designed to be relocatable across

3.3. FURTHER APPROACHES 37

FPGA and CPU domains depending on the available resources and the performance

required. Nevertheless, the design of such task was so that no context data transfer was

required by the unique relocation point assumed.

Further conceptual analysis for dynamically hybrid systems is done in [137]. The authors

consider the problem of allocation of dynamically arriving tasks over an architecture

comprising CPU and FPGA. The tasks are capable to be executed in every one of these

two environments. Additionally, tasks are able to be relocatable (relocation of tasks over

the hybrid architecture by means of reconfiguration). To solve the problem, the authors

propose some algorithms for scheduling, relocation management and derive an on-line

admission check for the arriving tasks.

3.3. Further Approaches

This section will focus on other approaches where OS may appear in reconfigurable

computing. Starting with the idea about the implementation of OS services in hardware

by using static hardware (e.g., ASIC), the discussion will then cover some approaches

where reconfigurable hardware (e.g., FPGA) is used to execute some OS activities. In

addition, this section will discuss a different approach, where FPGA is not used as a

coprocessor.

3.3.1. Hardware Accelerator for RTOS

Static approach

For time critical applications where hard real-time constraints are imposed, even the

overhead caused by RTOS activities need to be considered, in order to guarantee the

feasibility of the overall system. Solutions for this problem were already proposed,

for instance, in [138] and [139] where a scheduler was proposed to be implemented in

hardware (VLSI circuit) to offload the CPU workload. Thus, a deterministic timing

behavior is achieved, which is a mandatory characteristic in hard real-time systems.

The FASTCHART project [140] goes into the same direction. The work proposes a

custom CPU that could perform a context switch in one cycle with a kernel coprocessor

called the RTU (Real Time Unit). Later on, the RTU was extracted from FASTCHART

and implemented as an ASIC [141]. This allows its appliance to a broader range of

applications by attaching this ASIC with a general purpose CPU using standard real-

time buses.

Comparisons among RTOS-kernel implemented in ASIC, hardware/software co-designed

and purely in software have being done in [142] and [143], which shows the improvement

on the determinism, caused by shifting the RTOS services execution from CPU to hard-

ware.

38 CHAPTER 3. RELATED WORK SURVEY

Configurable approach

The authors in [144] provide a hardware scheduler, which is able to dynamically change

the schedule policy among three different ones (EDF, RM and Priority Based) in a de-

terministic and fast manner. On their work, dynamically reconfiguration is not used. In-

stead, three different policies are all implemented in a FPGA and the scheduler switches

from one policy to another on-the-fly. This configurable scheduler is provided as an IP,

which may be easily integrated in a SoC design.

In [145] and [146] the co-design of a RTOS is proposed. By using a specified framework

the user can choose and evaluate the partition of the RTOS modules in hardware and

software. The main goal of the work is to help the programmer to explore and find the

most appropriated configuration of his system, taken into consideration RTOS activities.

By using the framework, the programmer may achieve a higher efficiency in resource

usage (compared to a non tailored RTOS) and a speed up in the application execution

time. The components provided by the configurable RTOS include services like Deadlock

Detection Unit, Dynamic Memory Management Unit among others.

3.3.2. Multithreading on Hybrid Architectures

An interesting approach which skips the common view of a FPGA as a coprocessor

has first been presented in [147]. In this work the authors present a framework, which

allows a unified model of programming for developers of Massively Parallel Processing

(MPP) systems when using hybrid CPU/FPGA-based execution platforms. Assisted by

a hardware/software co-designed RTOS, threads running in software and threads running

in hardware have means of synchronization and control between each other when sharing

the available resources.

The central idea of the proposal lies on the extension of the multithreading programming

paradigm for those two different execution environments. A well designed RTOS for such

environment shall provide usual services like mutex, semaphore, read-write locks etc, for

all threads, independently on their location (software or hardware).

The former authors go further in their investigations and provide more details of their

concepts and execution platform in [148] and [149]. The architecture is based on a hybrid

FPGA (in this case the Virtex-II Pro from Xilinx), which enables the implementation

of a complete system on a single chip. The abstract view of the system can be seen in

Figure 3.7.

The architecture provides the standardization and transparency for the programmer

either by designing a task for software or for hardware environment. For this purpose, the

same API interface is made available for both kinds of threads, which actually compose

the hardware and software thread interfaces. These interfaces make the connection to

the hardware/software co-designed RTOS and threads. In addition, a proposal for a

3.3. FURTHER APPROACHES 39

CPU

Software

thread 1

Software

thread 1

Software thread interface component

Hardware thread

interface

component

Hardware thread

interface

component

Hardware thread

interface

component

Hardware thread

interface

component

System bus

Hardware threads

Hybrid

thread

abstraction

layer

Figure 3.7.: Hybrid thread abstraction layer [148].

common programming language for hybrid threads is proposed in [150], which is based

on a translation from C to VHDL language.

The authors propose further the (static) migration of OS functionalities from software

to hardware with the intention to improve determinism and offload the CPU from the

RTOS execution overhead (same intention of the works related in previous section).

At the current status, thread management, thread scheduler, semaphore and thread

interrupt controller are some RTOS services already available in hardware ([151] and

[150]).

The proposed paradigm, where hardware threads are no longer simple coprocessors,

changes the paradigm usually adopted in the reconfigurable computing research area

[152]. In this new case, a hardware thread is able to initiate by itself a transaction or

request services from the underlying OS and thus interact with the software threads.

Although this new computational model may provide advantages for the contemporary

embedded systems platforms, features already investigated by the reconfigurable com-

puting area could improve this scenario. For instance, threads could be dynamically

reconfigured to allow dynamically arrival of hardware threads, achieving an equivalent

scenario of software threads. Moreover, in a more flexible scenario, threads would be

allowed to be dynamically preempted and resumed across execution environment bound-

ary. This would allow the system for dynamically adaptation in order to satisfy the

requirements of changing applications, and further use the available resources in a more

efficient manner.

Another challenge faced for this paradigm during design phase, is how to partition

the application in software and hardware when following such programming model, as

noticed in [148]. For this purpose, a first design flow concept is presented in [152].

40 CHAPTER 3. RELATED WORK SURVEY

3.4. Chapter Conclusions

This chapter summarizes relevant works related to (re)configurable operating systems

comprising those ones based solely on software, and those ones having support from

hardware. Configurable OS can help the designer to provide only those OS services

required by the application. Nevertheless, modern embedded systems are becoming multi

purpose (one single device needs to execute several and different kinds of applications).

With this increasing system complexity a static solution for an operating system is no

longer enough. Therefore, the operating system must be (dynamically) reconfigured.

Nevertheless, for embedded systems the usage of an operating system introduces an

overhead which need to be considered.

3.4.1. Correlation With This Thesis

The architecture investigated in this thesis will rely on the capabilities investigated in

[84], namely the ability to partially reconfigure the system using (in the case of a Virtex-

II Pro device) the ICAP entity. However, it is not intended to freely relocate the modules

on to the device surface. Instead, it will be previously partitioned into a number of slots,

in which circuits may be placed by means of partial reconfiguration. Furthermore, the

hardcore microcontroller is going to be used, available inside this device and further

analysis in higher level will additionally be investigated.

In [127] the authors remain on the conceptual level and do not present further imple-

mentation solutions. Similarly, the results presented in [134] and [137] cannot be used

in the present thesis, since neither strategies nor methodologies for a deterministic re-

configuration and relocation of components over the hybrid architecture are provided.

Additionally, the authors do not consider the reconfiguration overhead of the software

and hardware tasks. Moreover, the challenge related to the relocation of tasks across

hybrid execution domains are not tackled in their analyses.

The work presented in [148,152] follows a paradigm where an OS service, regardless its

location (either software or hardware), is able to start a transaction with another com-

ponent (application task or OS service) of the system. Furthermore, application tasks

and OS services are statically allocated over the hybrid architecture (offline decision).

The work presented in this thesis differs from the previous work essentially in two as-

pects. First, the system is able to relocate OS services at run-time. Thus, the design of

an infrastructure for dynamic relocatable OS services is necessary. Second, the relocat-

able OS services are not able to start a communication transaction. Instead, they are

activated only when an application task requires its service, and consequently, do not

communicate with other OS services.

3.4. CHAPTER CONCLUSIONS 41

3.4.2. Additional Comments

The works related to reconfigurable computing area were spent mostly in the investi-

gation on how an operating system can provide efficient support and abstraction of a

heterogeneous execution platform to the executing application. However, making the

operating system also profit from the reconfigurability and from the high computational

performance of such platforms is a promising combination.

The next chapters will introduce a proposal of a run-time reconfigurable RTOS intended

to support modern and future embedded systems. This thesis has thereby, the intention

to contribute to DREAMS OS, towards aggregation of enough abstraction and seamless

integration of reconfigurable hardware in the execution platform. In addition, further

methodologies and strategies, which enable the operating system to reconfigure itself

in a stand-alone manner (without using DREAMS‘s FRM), are also investigated and

proposed.

42 CHAPTER 3. RELATED WORK SURVEY

CHAPTER 4

Run-time Reconfigurable RTOS

This chapter presents conceptual ideas and gives an overview of the proposed real-time

reconfigurable operating system devoted to RSoC systems. The OS is depicted in its

main parts and briefly explained accompanying the delineation of target applications

and target RTOS. Additionally, the hardware architecture is presented.

4.1. System Overview

Figure 4.1 gives an overview of the proposed system highlighting the main components

that were investigated inside this work. Both, application and OS, use computational

resources from the execution platform, which are provided by the presence of FPGA and

CPU in the hardware platform. The application uses the OS services through a specific

API, which is monitored by the Runtime Reconfiguration Manager (RRM). Based on the

analyses made by the RRM, the reconfiguration of the OS components over the hybrid

execution platform may be necessary in order to achieve better resource utilization.

This reconfiguration is represented by a run-time relocation of OS services over this

architecture. For a better understanding of the whole system, its main components are

explained in the following subsections.

44 CHAPTER 4. RUN-TIME RECONFIGURABLE RTOS

APPLICATION

OS API

HYBRID

OPERATING SYSTEM

ANALYSES

RECONFIGURATION

MANAGEMENT

Monitoring

Reconfiguration

Requirements

HYBRID

COMPUTATIONAL

RESOURCES

USES

Execution

Platform

Application

Operating System

Run-time

Reconfiguration

Manager

(RRM)

Figure 4.1.: System overview.

4.1.1. Target Applications

The research work presented here focuses on embedded systems whose computational

resources requirements cannot be precisely specified, for each instant of time in its life-

time, at design phase. In these cases, each application may be started and/or finished

asynchronously. Additionally, one single application may be designed to attend different

Quality-of-Services (QoSs), which leads to a varying demand of system resources in time.

These situations characterize a dynamic changing environment.

In such systems, due to the changing environment, the requirement demanded by the

applications may change during system execution. Therefore, it is not possible, at design

phase, to specify an optimal set of computational resources that the system will require

during its lifetime. Examples of such embedded systems are modern mobile phones and

PDAs with support for different kind of applications: telephone call, music player, movie

players, camera, video recorder, text editor, gaming, etc.

In these scenarios, generally soft real-time requirements are necessary to be considered

(e.g., multimedia applications), which is also the type of time constraints considered

inside this thesis work. Furthermore, applications are represented by a set of tasks that

are considered to have periodic behavior. This is a model suitable and generally used

for real-time embedded systems.

Tasks may use the CPU as well as the RH (Reconfigurable Hardware) as computational

4.1. SYSTEM OVERVIEW 45

resources. However, they are mainly software based and only computation intensive

software tasks may use the RH. An application specific task that usually profit from a

hardware implementation is, for instance, a FFT (Fast Fourier Transformation) compu-

tation, which is usually needed for multimedia applications.

4.1.2. Target RTOS

The target RTOS architecture follows the microkernel concept, where application and

operating system services are seen as components running on top of a small layer that

provides basic functionalities. Figure 4.2 shows abstractly this architecture, where each

application task, as well as each OS service, is considered as a component.

OS/

APP/

Interface

OS/

Interface

Figure 4.2.: Proposed microkernel based architecture.

The target RTOS DREAMS has been chosen for many reasons. First, because DREAMS

supports the concept of components (as presented in the Section 3.1.1). Second, DREAMS

also follows the microkernel concept and, if necessary, provides memory protection among

the components [153]. Third, the availability of detailed information concerning OS in-

ternals, due to the fact that this OS is a result from an in-house research, makes devel-

opment using DREAMS easier. Last, but not least, DREAMS has a port to the PPC405

processor available in the Virtex-II Pro FPGA, which is the target device used within

this work.

The OS reconfiguration is achieved by adding/removing, at run-time, its services and/or

relocating them during system execution. For each one of these relocatable services,

two different implementation versions are made available (technically synthesized before

system starts): one in software and another in hardware. At run-time, these services can

be attached and/or detached from the system. For hardware components this relies on

the FPGA partial reconfiguration capabilities. For software component versions, known

techniques for dynamic linking of software objects are used.

The communication layer, shown in Figure 4.2, provides the necessary support to allow

46 CHAPTER 4. RUN-TIME RECONFIGURABLE RTOS

the communication among components running over the hybrid architecture in an effi-

cient manner. Additionally, this layer provides enough transparency to the application

which should not be aware about the location (either FPGA or CPU) of OS services.

At this point a question arrives concerning which OS service is considered to be relo-

catable and which one shall belong to the application, since both use computational

resources of the system. The definition and separation between OS services and appli-

cation components is usually not clear and difficult to be made (as usual for microkernel

architectures). In this work high level OS services are considered for relocation. Ex-

amples for this classification are encryption algorithms (which will be used in the case

study presented in Chapter 9) and communication protocol stack (e.g., TCP/IP stack

protocol [154]). In summary, relocatable OS services, considered for this research, are

those services that can be used by more than one application in the system. The re-

maining OS services may belong to the microkernel layer (e.g., internal communication,

synchronization) or stay as fixed components outside this layer.

At the beginning of this thesis research, the dynamic reconfiguration of basic OS services

(like scheduling, synchronization mechanisms, communication, etc.) was also considered

as possible. However, the mechanisms developed to dynamically reconfigure the high

level OS services require the continuously availability of these basic services. Due to this

necessity, basic OS services are static and not relocatable.

4.1.3. Instrumented OS API

The observation of the OS services usage is an important and necessary feature for a dy-

namically adaptable OS. In this work, observation is done by instrumenting the OS API

in order to provide enough information about OS services (concerning computational

resource usage), which is used by the analysis and allocation system parts. With an

instrumented OS API it is possible to measure the execution time and frequency of ser-

vice, either software or hardware services, which are used to estimate the computational

resource usage.

Since applications are modeled as a periodic task set, it can be assumed that OS services

will also present a periodic behavior, when called by these application tasks. Neverthe-

less, each OS service may experience different call patterns, depending on the number

of application tasks and their temporal definitions. Different scenarios can be seen in

Figure 4.3.

For a service sycalled by only one single periodic task Tx, the period of this service can

be precisely determined as being the same of the caller task: Px=Py. This situation is

illustrated in Figure 4.3a. Please note that ax,k (arrival time of instance kof Task Tx)

is not necessary equal to ay,k (instant of time when service syis called by task Txin its

instance k).

For a service shared among several periodic tasks, its behavior will be still periodic.

4.1. SYSTEM OVERVIEW 47

TASK X

SERVICE Y

Communication

Infrastructure

Application

Operating System

ax,k ax,k+1 Pax,k+2

ay,k ay,k+1 Pay,k+2

Service y

Task x

OS call

(a) OS service sybeing called by a single task Tx.

TASK 1

SERVICE Y

Communication

Infrastructure

Application

Operating System

TASK n

a1,k a2,k+1 P1a3,k+2

a2,k a2,k+1 a2,k+2 a2,k+2

an,k an,k+1 an,k+2

Task 1

Task 2

Task n

OS call

(b) OS service sybeing called by several tasks.

Figure 4.3.: Call patterns experienced by an OS service.

However, its execution pattern will be determined by the combination of the periods

of the tasks using this service. In Figure 4.3b a service syis called by a task set T=

{T1, T2, . . . , Tn}.

Generally, if for any interval I= [t, t +I), the arrival times set Aiof Task Tican be

determined by Equation 4.1, and so the correspondent arrival times set Ay,i of Service sy

(due to caller task Ti) determined by equation 4.2 (where ϕidenotes the time distance

between aiand ay, i) , then the arrival times experienced by service Syin the case of

multiple tasks can be determined by Equation 4.3.

Ai(I) = {ai,k |ai,k =kPi, k ≥0, ai,k ∈[t, t +I)}(4.1)

Ay,i(I) = {ai,k |ai,k =kPi+ϕi, k ≥0, ai,k ∈[t, t +I)}(4.2)

Ay(I) =

[

i=1

Ay,i(I) (4.3)

Each task Ti∈ T is periodic, determined by a specific period Pi. The behavior of sy,

however, is not periodic in a first glance. This means, that depending on the execution

rates of each task Ti, and the interval Iselected, no period Pyfor service sycan be

48 CHAPTER 4. RUN-TIME RECONFIGURABLE RTOS

explicitly detected. However, at every hyperperiod Hof the task set T, the pattern

observed for service syin the interval I?(where I?=H) is repeated.

Since the number of OS services is assumed to change dynamically and asynchronously

during system execution, the run-time identification of the hyperperiod of the current

task set is not a trivial job. Moreover, due to the different combinations of the tasks

periods, the hyperperiod can be very large.

In the case of hard real-time systems, worst case scenarios must be considered in the

analyses to assure system feasibility. Hence, Equation 4.3 would need to be precisely

determined, and the worst case execution time would be considered as the nominal

execution time. If a periodic behavior of an event would be required for simplification of

the analyses, then the minimal interarrival time would need to be used. In the literature,

some other approaches may be used to model such scenarios, like for instance in [155]

where a sophisticated arrival function is used to describe a bursty arrival of a method

invocation.

For a soft real-time case, one may relax this requirement and use, for instance, averages

values as a simplified metric to estimate the arrival rate and execution time. Note

that more sophisticated models for this scenarios are available depending on the target

application [156]. However, within this work each OS service is considered to present a

periodic behavior and these metrics for soft real-time events are used. Nevertheless, a

system improvement in this topic is foreseen as a future work, which is further commented

in Chapter 10.

Thus, the arrival rate (period) Pyof a service syis estimated in the following manner:

Py(I) = ¯py=I

Ny(I)(4.4)

Where Iis a predetermined interval (I= [t, t +I)) and Ny(I) is the amount of calls to

service syinside the interval I.

The execution time of the service syis assumed as being the average of the execution

times measured by the monitoring system. Thus, the execution time Eyof a service sy

is estimated in the following manner:

Ey(I) = ¯ey=Pt+I

tLy

Ny(I)(4.5)

Where Lyis the accumulated execution time of all service calls made inside an interval

Iand completely finished.

4.2. HARDWARE ARCHITECTURE 49

4.1.4. Run-time Reconfiguration Manager - RRM

OS Service Assignment

In this stage the system decides to which execution environment (CPU or FPGA) each

OS service will be assigned. This decision is made based upon the computational re-

sources used by each of the OS service in the system. For a software component this is

the percentage of the CPU required and for a hardware component this is the amount of

FPGA area that it requires (static information gathered after its synthesis process). Ad-

ditionally, by an arrival of an application into the system, its requirements (OS services

needed) are also known by the assignment algorithms.

The assignment decision is driven mainly by the computational resources required by

the application. As usual for an embedded execution platform, resources are rare. The

OS (which is wanted by the benefits it provides), introduces an overhead in the system.

Therefore, the intention is always to direct the remaining resources from the platform to

the OS and leave the resources wanted by the application available to it. In summary,

the RTOS tries to adapt itself to the remaining computational resources of the hybrid

execution platform.

Reconfiguration Management

This module is responsible for the reconfiguration management of each OS service over

the hybrid architecture, and it incorporates the low level support necessary to partial

reconfigure the FPGA and to dynamically link the software components.

A system reconfiguration is necessary always when the OS service assignment algo-

rithms decides to change the allocation of the components over the platform. This

can be caused by the dynamics of the running applications or when an application ar-

rives/leaves the system. Furthermore, since it is assumed that applications are time

constrained (not necessarily all of them) the system should carry out the reconfiguration

activities in a deterministic manner. In an ideal case, the blackout1time should be zero

or minimized [157]. Therefore, in this approach, deterministic mechanisms known from

real-time scheduling theory are adapted and used for this purpose.

4.2. Hardware Architecture

For validation purposes a target architecture based on Virtex-II Pro FPGA, from Xilinx,

has been selected. By using its partial reconfiguration capability, hardware components

may be reconfigured on the FPGA at run-time whilst the remaining part of the device

keeps on executing. Additionally, the availability of an embedded hardcore general

1Blackout is the time interval during which the system do not react due to the reconfiguration.

50 CHAPTER 4. RUN-TIME RECONFIGURABLE RTOS

processor embedded in this FPGA device, features it with the necessary heterogeneity

for accomplishment of the requirements of a hybrid reconfigurable architecture.

As noticed before, DREAMS has a port to the PPC405 processor available in the Virtex-

II Pro FPGA. Nonetheless, support for dynamically reconfiguration was not supported

by this RTOS, which actually is one of the contributions of this thesis.

The execution platform proposed can be seen in the Figure 4.4a, where details were

omitted for a better understanding. The FPGA surface is spitted into static and re-

configurable parts. The static part embraces the CPU system, where the RRM and the

OS services in software are executed. The reconfigurable part is divided into nslots

and each slot provides an OS service framework (Figure 4.4b). Additionally, each slot

can be dynamic reconfigured by means of partial reconfiguration. In this particular, the

proposed approach is similar to those found in the literature, which were discussed in

Chapters 2and 3.

Each subsystem (software and hardware) has its own local bus for communication. These

buses are connected through a dedicated bridge to provide communication across CPU

and FPGA components. Further details of the execution platform are given in Chapter 9.

The local memory is used to support the communication between local components, and

the global shared memory is used to perform the communication with components run-

ning on the CPU. The local controller is used to manage the access to the local memory

and the global controller, which together with its counterpart in software, performs the

communication infrastructure layer mentioned previously in Section 4.1.2.

The slots are connected using Busmacros. In order to program the FPGA slots, the

reconfiguration port is used, which may be local (by using the ICAP Xilinx entity) or

an external Run-Time Reconfiguration (RTR) controller.

4.3. Design Support

In order to make the system accessible for further developments, a proper support in

the design phase needs to be made available. Related to this topic, the following points

are handled in this thesis work:

HW/SW Interface The generation of the interface between hardware and software is an

important aspect in such reconfigurable architectures. Since hardware components

may be relocated, removed or inserted dynamically into the FPGA slots, there is

a need to define in which way these reconfigurable components can establish a

communication channel with the static part of the system.

Relocatable Tasks Since a dynamic migration of components between hardware and

software domains are necessary for this approach, appropriate mechanisms are

required for its achievement. If preemption and resumption of a component across

4.4. CHAPTER CONCLUSIONS 51

S1 SnS1 Shared

Memory Global

CTRL

LOCAL BUS

(Busmacros)

MEMORY

CPU

RTR

Virtex-II Pro FPGA

Dynamic Part

(a) Architecture overview.

Local

Mem. Comp.

CTRL

Local Bus

Slot n

(b) OS service slot template.

Figure 4.4.: System architecture.

CPU-FPGA boundaries are required, there is a need to assure a safe context

translation and migration of this component. Therefore, a suitable environment for

designing of such relocatable tasks is necessary, along with a run-time environment

to carry out these activities on-the-fly.

4.4. Chapter Conclusions

In this chapter a proposal for a real-time reconfigurable operating system was presented,

which is based upon reconfiguration of its services over the hybrid architecture. The in-

tention is to promote a flexible utilization of the computational resources, both CPU and

FPGA, by the operating system services and application tasks. Thus, an efficient usage

of the execution platform can be achieved, even if the system requirements dynamically

changes .

For its achievement, the system needs to monitor the resource utilization and, by using

appropriated analyses, decide which OS services shall be relocated. Moreover, recon-

figuration activities need to be performed in a deterministic manner, since the system

must guarantee soft-real time constraints.

A preliminary investigation towards pattern identification of operating system service

usage has been presented. Nonetheless, as already noticed, these analyses need to be

improved for a better identification of their usage profiles. This investigation is planned

as a future work.

52 CHAPTER 4. RUN-TIME RECONFIGURABLE RTOS

CHAPTER 5

Modeling & Problem Formulation

This chapter introduces the model costs for OS services, communication among services

and reconfiguration activities. These models allow a suitable formulation of the main

problems investigated in this thesis.

5.1. Component Assignment

The problem of assigning OS components over two execution environments can be seen

as a typical assignment problem. Indeed, this problem can be formulated as an Integer

Linear Program (ILP), or more specifically, as a Binary Integer Program (BIP), which

characterizes the component assignment problem as a NP-Hard one.

Let Sbe a set which denotes the relocatable OS services (hereafter, component and

service will be used as interchangeable terms):

S={si, i = 1, ..., n}.(5.1)

Since the computational resources are intended to be managed between operating sys-

tem and application, they have to be measured. Therefore, every component ihas an

estimated cost ci,j, which represents the percentage of computational resource from the

execution environment used by this component. On the FPGA (j= 2) it represents the

area needed by the component: Ai. On the CPU (j= 1) it represents the processor

load: Ui. Thus, for each component ithat is relocatable, there are two different costs

54 CHAPTER 5. MODELING & PROBLEM FORMULATION

associated:

ci,j :(ci,1=Uidenotes the processor work load required by component i

ci,2=Aidenotes the FPGA area required by component i(5.2)

Within this thesis, the component costs of a component imay also be denoted in a

shorter form:

ci={Ui, Ai}(5.3)

The assignment of a component, to either CPU or FPGA, is represented by the variable

xi,j. It is said that xi,j = 1 if the component iis assigned to the execution environment

jand xi,j = 0 otherwise:

xi,j :(xi,j = 1 if component iis located on environment j

xi,j = 0 otherwise (5.4)

For legibility reasons, the notation above may also appear in the following form:

xi=({1,0}if component iis assigned to the CPU, or

{0,1}if component iis assigned to the FPGA. (5.5)

Since only one distinct version of a specific component (either j= 1 or j= 2) is allowed

to be used at the same time, the component ican be assigned only to one of the execution

environments. Thus, the following equation must hold:

j=1

xi,j = 1 for every i= 1, ..., n. (5.6)

5.1.1. Constraints Definition

The computational resources that the operating system can use are bounded and depend

on the resources currently required by the application. These limits imposed to the

operating system are denoted by:

Umax Maximum CPU workload;

Amax Maximum FPGA area.

5.1. COMPONENT ASSIGNMENT 55

Thus, the total FPGA area (A) and the total CPU workload (U) used by the hardware

and software components, respectively, can not exceed their maximums. These derive

two constraints to the BIP problem:

i=1

xi,1ci,1≤Umax (5.7)

i=1

xi,2ci,2≤Amax (5.8)

Besides these constrains, an additional one is defined in order to maintain a balanced

utilization of the CPU and FPGA resources. This is necessary for two main reasons.

First, it has to be avoided that one of the execution environments would have its usage

near to the maximum. This condition improves the performance of the reconfiguration

algorithms which require enough CPU workload and FPGA area for their feasibility

(discussed further in Section 6.3).

A second reason is the need to influence the migration of operating system components

to the execution environment which are going to be less used by the application. For

instance, if the application is going to require more FPGA resources, due to the necessity

to execute high performance tasks, the operating system shall migrate to the CPU. In

an opposite way, when the CPU is mostly used by the application, the operating system

may take advantage of the FPGA for its execution.

This balancing is denoted by Band is constrained by a value δ:

B=|w1U−w2A| ≤ δ(5.9)

Where δis the maximum allowed unbalanced resource utilization and the weights w1and

w2are used to proper direct the migration of the components of the operating system

as described above. Hence, the weights w1and w2are used to reflect the tendency of

resource utilization by the application. So, it is possible to guide the assignment of the

OS services, which shall use the resources currently discarded by the application.

5.1.2. Objective Function Definition

An operating system should use computational resources from an embedded system

execution platform as less as possible. In this way, more resources are available to

execute application tasks. Therefore, the objective function used to minimize the overall

56 CHAPTER 5. MODELING & PROBLEM FORMULATION

resource utilization is defined as:

min{

j=1

i=1

ci,jxi,j}(5.10)

The solution for Equation 5.10, subjected to the constraints defined in 5.7,5.8 and 5.9,

are the assignment variables xi,j which are defined as being a specific system configura-

tion: Γ = {xi,j}.

5.1.3. Allocation Example

For a better understanding of the OS service allocation problem, a simple problem in-

volving two relocatable services is shown in Table 5.1. Assuming a situation where two

OS services present the estimated costs given in Table 5.1a, there are four different as-

signments of these two services to the two execution domains. If the goal is to achieve the

minimal overall computational resources usage, the solution presented in Table 5.1e is the

best one. Note that, in this example, the constraints defined in Equations 5.7,5.8 and 5.9

are not considered.

5.2. Reconfiguration Costs

Due to the application dynamics OS requirements may change. This is represented

by changes in ci,j and the size of S(|S =n|). Moreover, the available computational

resources that are available to allocate the demanded OS components may also change.

Therefore, the assignment decision needs to be continuously checked. This implies that,

depending on the result of the new OS service assignment, some of the OS components

may need to be relocated (reconfigured) by means of migration. In other words, a

service may migrate from software to hardware or vice-versa. Additionally, services

may be replaced by new ones (in order to use less/more resources). In this case, a

reconfiguration of a service in the same execution environment occurs (hereafter also

called migration).

The reconfiguration cost of every OS component represents the time required to com-

pletely migrate a component from one execution environment to the other one. There-

fore, to every possible component migration, a correspondent cost is specified. If ois the

source environment of a component and pis its destination, the value ro,p denotes the

migration cost. So, for each component ia migration cost matrix Riis defined. Each

entry in this matrix denotes one possible migration cost of the component.

Since a new component may arrive into the system, or a service that is no longer required

may leave, one can incorporate these services inclusions and exclusions costs in order to

5.2. RECONFIGURATION COSTS 57

CPU FPGA

c1,j 5 10

c2,j 15 10

(a) Component costs specifi-

cation.

CPU FPGA

x1,j 1 0

x2,j 1 0

P20

B20

(b) All components in soft-

ware.

CPU FPGA

x1,j 0 1

x2,j 0 1

P20

B20

ware.

CPU FPGA

x1,j 0 1

x2,j 1 0

P25

(d) s1in hardware and s2in

software.

CPU FPGA

x1,j 1 0

x2,j 0 1

P15

(e) s1in software and s2in

hardware.

Table 5.1.: Different allocation possibilities for two services, and the respective overall

cost and balance values.

get a more precise information about the overall reconfiguration costs. This situation

can be modeled by defining a third virtual environment (which can be seen as a memory

pool of OS components). For instance, ri

3,p represents the cost for aggregating a new

service iinto the system. Thus, the migration cost matrix has a size of 3 ×3.

The cost to migrate a component iin the system can be easily calculated using the

current and new allocation of this component. Let Ri={ri

j,j0}, where jand j0are the

current and new execution environment of component i. If xi={xi,1, xi,2, xi,3}and

i={x0

i,1, x0

i,2, x0

i,3}are the current and new assignment of the component i, then its

reconfiguration cost is calculated by the equation below:

Ri=xi,1xi,2xi,3



0r0

1,2r0

1,3

2,10r0

2,3

3,1r0

3,20





i,1

i,2

i,3



=xT

iRix0

i(5.11)

Where ri,i = 0, {xi,1+xi,2+xi,3}= 1 and {x0

i,1+x0

i,2+x0

i,3}= 1.

The complete reconfiguration cost R(overall reconfiguration time) of the system is de-

58 CHAPTER 5. MODELING & PROBLEM FORMULATION

fined as:

i=1

iRix0

i(5.12)

5.2.1. Temporal Specification

As usual for a dynamic reconfigurable system, reconfiguration activities represent an

extra overhead for the system. Depending on the system specifications, this overhead

may be prohibitive if not properly tackled. For real-time systems these activities must

be specially taken into account in order to guarantee temporal determinism.

Therefore, the activity rito reconfigure a component is divided into two distinct phases,

called Programming and Migration phases. During Programming, the component is

programmed on the FPGA (by downloading the corresponding partial bitstream), or

the object code is placed in the main memory (if its software version is going to be

used).

Once having the FPGA programmed with the service’s bitstream, or the CPU software

linked with the service’s object code, the effective migration can start (Migration phase).

This phase comprises the activity of transferring the internal data of a service between

two execution environments, and afterwards its activation.

Table 5.2 presents the notations used to represent different migration costs. Based on the

implementation results, the time to execute Migration can be considered to be the same

for a task migrating from software to hardware or vice-versa (parameter Mw). More

details on technical realization concerning migration of a running component across

hardware and software will be given in Chapter 8.

Table 5.2.: Time costs related to each migration case.

Parameter Description

iProgramming time in software

iProgramming time in hardware

i“Migration” time from software to software

i“Migration” time from hardware to hardware

iMigration time between hardware/software

Furthermore, to proper tackle reconfiguration activities together with the executing OS

services, an extension of the component cost model is required. Actually, their temporal

parameters need to be specified. As mentioned in Section 4.1.1, in a typical embedded

real-time system the application may be modeled as a set of periodic activities. Even

5.3. COMMUNICATION COSTS 59

sporadic tasks can also be modeled as periodic tasks through the assumption that their

minimum interarrival time being the period. Hence, it is assumed that the applications

are represented by a set of periodic tasks and that OS services present also a periodic

behavior. Thus, each OS service is characterized by temporal parameters (e.g., period,

deadline, etc.). The temporal specification of a service, related to its periodic behavior

is presented in Table 5.3.

Table 5.3.: Service definition related to its periodic execution.

Parameter Description

iExecution time of service iin software

iExecution time of service iin hardware

PiPeriod of service i

DiRelative deadline of service i

di,k Absolute deadline of the kth instance of service i

ai,k Arrival time of the kth instance of service i

bi,k Starting time of the kth instance of service i

fi,k Finishing time of the kth instance of service i

5.3. Communication Costs

Additionally to the computational resources that each component requires, it is also

desirable to take into account costs related to communication among the components.

Observing the underlying execution platform used in this work, these costs depend on the

allocation of the components over the architecture. For instance, if two communicating

components are located in the same environment, the related cost may be lower than

the case where each would be placed on a different execution environment.

For the purpose of communication cost modeling, the proposed architecture shown in

Figure 5.1, already introduced in the previous chapter, provides enough abstraction

for a proper communication cost modeling. There, communication channels available

are highlighted. It can be seen that software components can exchange data among

themselves using the SW BUS. Similarly, hardware components can use HW BUS for this

purpose. To exchange data across the hardware/software boundary, a bridge connecting

SW BUS and HW BUS need to be used, aggregating more cost for this communication.

This would be more in evidence if the two architecture parts would be implemented in

different devices.

In order to model each possible communication between each pair of components, the

system has been modeled using a undirected weighted graph G= (V,E). Each edge in E

connecting two components uand vis weighted by κ(u, v), which represents all possible

60 CHAPTER 5. MODELING & PROBLEM FORMULATION

Bridge

External

Memory

Interface

RTR

Manager

SW BUS

Slot

HW BUS

RAM

ServiceX.o

ServiceX.bit

ServiceY.o

ServiceY.bit

ServiceN.o

ServiceN.bit

Static Part

Slot

BSlot

Reconfigurable Part

Memory Pool

CFG

Port

Figure 5.1.: System architecture highlighting the communication channels.

communication costs between uand v. Note that κdepends on two main factors: a static

one, related to the architecture (time to deliver a message), and a factor related to the

amount of data exchanged between two components, which is dynamic and application

dependent. As each component may be located in one of two different execution environ-

ments, κ(u, v) represents three different communication costs (κ(u, v) = {Cα, Cβ, Cγ}):

•Cα, when both are in SW domain;

•Cβ, when both are in HW domain;

•Cγ, when each component is located in a different domain.

Figure 5.2 shows a sample of such a graph. Grey nodes in the graph may be seen as OS

services primitives (API) that are made available for application tasks. Note that such

nodes do not have allocation costs.

SW,HW

Figure 5.2.: Sample of an OS component graph.

5.4. CHAPTER CONCLUSIONS 61

5.4. Chapter Conclusions

The main problems to be dealt by RRM were discussed in this chapter. It can be seen

that those problems are NP-Hard ones, which requires the development of heuristics

for their solutions. Moreover, these heuristics need to be executed at run-time, which

demands the development of algorithms that present low time complexity.

Furthermore, a model of the reconfiguration activities was introduced, which consider

the proposed underlying execution platform. Only by having all activities modeled by

its temporal characteristics, it is possible to tackle them in a deterministic manner. Such

a complete and embracing model of reconfiguration activity is not usually adopted, as

noticed from the works investigated in Chapter 3.

62 CHAPTER 5. MODELING & PROBLEM FORMULATION

CHAPTER 6

Run-Time Methods

Up to now, the main problems have being properly introduced and modeled, which

allows this chapter to present the solutions proposed to solve them.

First, an allocation algorithm, considering only the computational resource required

by each component, is introduced. Then, an attempt to indirectly reduce the overall

reconfiguration cost is presented. Furthermore, an algorithm extension is proposed, with

the aim to reduce the communication costs among OS components. By the development

of these algorithms, efforts were spent in order to achieve a low complexity due to the

fact that they need to be applied at run-time, during system execution.

Afterwards, the methods designed to handle the reconfiguration activities are presented,

considering thereby time constraints. Here, the reconfiguration activity model is further

extended enabling an efficient management. For the reconfiguration of a single OS

component, methods from real-time scheduling theory are adapted and applied for this

specific problem. Furthermore, since in a system reconfiguration situation, more than

one component may suffer a reconfiguration, it is also required to decide the order in

which these components will be reconfigured.

6.1. OS Service Allocation

The assignment problem is solved by using a greedy based heuristic algorithm. It decides

at run-time where to place each OS component taking into consideration its current

cost and the remaining available computational resources. As it has been explained

64 CHAPTER 6. RUN-TIME METHODS

in the previous chapter, the system has to allocate the OS components to a limited

FPGA area (Amax) and limited CPU processor workload (Umax). The heuristic tries

to minimize the objective cost function (Equation 5.10) subjected to system constraints

(Equations 5.7,5.8 and 5.9).

The allocation algorithm is composed of two phases. The first one creates two clusters

(FPGA and CPU component sets), representing the assignment of components to ei-

ther CPU or FPGA. The second phase improves the first solution towards the balance

constraint δ.

6.1.1. OS Service Assignment Phase

In the first phase, shown in Algorithm 1, the algorithm starts selecting the component

that has the smallest cost among those currently needed by the application, and assigns

it to the corresponding execution environment. It then selects the component that has

the smallest cost among the remaining (unassigned) ones so that it tries to keep the usage

of CPU resource Uequal to the FPGA resource A. This selection process is repeated

until all components have been assigned. The algorithm terminates by checking if the

CPU and FPGA resources usage constraint are fulfilled: U≤Umax and A≤Amax. If

so, the algorithm returns a valid assignment solution, or an error otherwise. It can be

seen that this algorithm has a polynomial complexity of O(n2), since there is only one

for loop (line 7) which produces nsearches in a list of (maximum) nelements.

6.1.2. OS Service Assignment Example

To better understand how the proposed algorithm works, an example is presented here.

Assuming three components, which are able to be located either in CPU or in FPGA,

and having respective estimated costs as shown in Table 6.1, the algorithm will take

three steps to provide the assignment solution. At the beginning, the component having

smallest cost among them is selected, which in this case is s2in its CPU version (j= 1).

Thus, in the first step (Table 6.2a), the assignment decision for this component is made:

x2={1,0}. For the second step (Table 6.2b), the component having the smallest cost,

from those available for FPGA, is selected (since the result from first step implies in

U > A). Hence, component s1in its FPGA version (j= 2) is selected: x1={0,1}.

This leads to U= 3 and A= 5, which determines the search for the smallest cost in the

remaining unselected components of CPU versions. As, in this case only one component

remains unselected, the choose lies on service s3:x3={1,0}(Table 6.2c). The resulted

overall resource utilization is P=U+A= 23, with a Balance B=|U−A|= 13.

6.1. OS SERVICE ALLOCATION 65

Algorithm 1 Service assignment heuristic.

1: C1← {ci,1}Set of components available for CPU (j= 1)

2: C2← {ci,2}Set of components available for FPGA (j= 2)

3: X1← {xi,1}Assignment of CPU components

4: X2← {xi,2}Assignment of FPGA components

5: C←C1∪C2;X1←X1∪X2;U←0; A←0

6: n←number of components

7: for k←1 to ndo

8: if U≤Athen

9: Find an unassigned component iamong {ci,1}so that it has the smallest cost.

10: Assign this component to CPU: xi← {1,0}

11: else

12: Find an unassigned component iamong {ci,2}so that it has the smallest cost.

13: Assign this component to FPGA: xi← {0,1}

14: end if

15: U←C1XT

16: A←C2XT

17: if U > Umax or A > Amax then

18: Exit with error: ”Not feasible”

19: end if

20: end for

21: return X1

c1,j c2,j c3,j

CPU 10 3 15

FPGA 5 10 12

Table 6.1.: Example of three components and their respective costs.

6.1.3. Balance BImprovement Phase

In the second phase the balancing Bis improved in order to meet the δconstraint which

was not considered in the first phase. It is based on Kernighan-Lin algorithm [158] and it

aims to obtain a better balancing Bthan the first one by swapping pairs of components

between CPU and FPGA. The algorithm receives as input the first assignment solu-

tion X1 which has nc1=Pn

i=1 xi,1components assigned to CPU and nc2=Pn

i=1 xi,2

components assigned to FPGA. The maximum number of pairs that are possible to be

swapped is defined as: max pairs =min(nc1, nc2).

By moving a component ithat was, for instance, previously assigned to CPU (xi=

{1,0}), to FPGA (x0

i={0,1}) a new balancing Bis achieved: Bnew =|Bcurrent −li|,

where li={ci,1+ci,2}. Similarly, by moving a component ifrom FPGA to CPU, the

new balancing Bwill be: Bnew =|Bcurrent +li|. Thus, swapping a pair of components

66 CHAPTER 6. RUN-TIME METHODS

x1,j x2,j x3,j P

CPU 1 3

FPGA 0 0

(a) Step 1.

x1,j x2,j x3,j P

CPU 0 1 3

FPGA 1 0 5

(b) Step 2.

x1,j x2,j x3,j P

CPU 0 1 1 18

FPGA 1 0 0 5

Table 6.2.: Assignment algorithm applied in the example presented in Table 6.1.

o, p the new balancing Bis defined as:

Bnew =(|Bcurrent −lo+lp|,if xo={1,0}and xp={0,1}

|Bcurrent +lo−lp|,if xo={0,1}and xp={1,0}(6.1)

Additionally, Gop is defined as the gain obtained in the balancing Bby swapping a pair

oand pof components: Gop =Bcurrent −Bnew. A gain above 0 means an improvement

obtained in the balancing B. Observing Equation 6.1, it follows that:

Gop =(lo−lp,if xo={1,0}and xp={0,1}

lp−lo,if xo={0,1}and xp={1,0}(6.2)

The balance improvement algorithms is shown in Algorithm 2. Before starting, it first

build a set Mof component pairs, respecting the following: For each pair {o, p} ∈ M

it must hold either xo={1,0}and xp={0,1}; or xo={0,1}and xp={1,0}.

Afterwards, it starts trying to swap all possible pairs and storing the obtained gain by

every try. It then chooses the one that provides the greatest gain. If this gain is lower

than or equal to zero, no swapping is able to provide an improvement in the balancing B

and the algorithm stops. Otherwise, the pair that provides the greatest gain is swapped

and locked (no longer a candidate to be swapped).

The above process is repeated until all pairs have been locked or no improvement can

be obtained by any interchange or if δconstraint has been fulfilled. The algorithm

terminates by returning the new assignment solution that provides a better (or at least an

equal) balancing Bthan the solution provided by the first phase. The overall complexity

of the balancing improvement algorithm is (worst case) O(m2), where mis the maximum

number of pairs. The complexity of building the set Mis O(m2), in its worst case, and

the search (line 10) made inside the loop (line 9) has also this complexity.

6.1. OS SERVICE ALLOCATION 67

Algorithm 2 Heuristic for balancing Bimprovement.

1: Xinit

1← {xi,1}Initial assignment of CPU components

2: Xinit

2← {xi,2}Initial assignment of FPGA components

3: Xinit ←Xinit

1∪Xinit

4: Xnew ←Xinit

5: Binit ← |Uinit −Ainit|

6: Bnew ←Binit

7: Build the set M

8: m← |M| maximum number of pairs

9: for k←1 to mdo

10: Find the pair {o, p}∈Mso that oand pare unlocked and Gop is maximal

11: if Gop >0then

12: Swap oand pchanging the current assignment: Xnew ←(Xnew with oand p

swapped)

13: Bnew ←Bnew −Gop

14: Lock oand p

15: end if

16: if Gop ≤0 OR Bnew < δ OR all pairs are locked then

17: break

18: end if

19: end for

20: return Xnew

6.1.4. Balance BImprovement Example

Again, the algorithm can be better understood by using an example. For this case let

the input be the result achieved by the assignment phase when applied on the given

example. Table 6.3c shows a complete example of an OS service set allocation. In

Table 6.3a the OS service set input is presented, along with their respective costs and l

parameter. Table 6.3b shows the assignment result when applying Algorithm 1on this

input example. Finally, Table 6.3c presents the balancing B improvement when applying

Algorithm 2on the previous result.

For this case, Mcomprises two services pairs, which are {o, p}={s1, s2}and {o, p}=

{s1, s3}. The second pair delivers the best gain in comparison to the first one. Hence,

changing the assignment of s1and s3from x1;x3={0,1};{1,0}to x1;x3={1,0};{0,1}

a better balancing B is achieved. However, this swap implies in a small increase in the

overall resource utilization. This situation is shown in Table 6.3c.

68 CHAPTER 6. RUN-TIME METHODS

c1,j c2,j c3,j

CPU 10 3 15

FPGA 5 10 12

li15 13 27

(a) Example of three compo-

nents and their respective

costs.

x1,j x2,j x3,j P

CPU 0 1 1 U= 18

FPGA 1 0 0 A= 5

s1s2s3

(b) Assignment result after Algorithm 1.

x1,j x2,j x3,j P

CPU 1 1 0 U= 13

FPGA 0 0 1 A= 12

s1s2s3

Table 6.3.: Example of a complete OS service allocation.

6.1.5. Reconfiguration Cost Reduction

Since reconfiguration activities represent an overhead for the system, it is desired to

reduce it. For this purpose the assignment algorithm, in its second phase, was slightly

modified in order to decrease the number of components to be reconfigured. This action

indirectly decreases the overall reconfiguration cost.

Assuming that Γ and Γ0are the current and new system configuration, let the assignment

difference ziof a component ibe denoted by diff(xi, x0

i). Where xi∈Γ and x0

i∈Γ0.

Thus, diff is defined as follows:

zi=1 : if {xi} 6={x0

0 : otherwise

The new algorithm phase is shown in Algorithm 3. The original algorithm is modified

in the following manner. Instead of immediately changing and locking the position of

the pair oand pafter a gain below 0 was found (line 12 of Algorithm 2), this is done

based on some rules. If, at least, one of the components from the pair keeps its position

in relation to the current system configuration (line 12 of Algorithm 3), the pair swap

is allowed.

In addition, the component that preservers its position (or both) is locked (no longer

a candidate to be swapped). However, if both components of the pair change their

positions in relation to the current system configuration, no swap occurs. Moreover, just

one component (which provides the smaller reconfiguration cost) is locked. This lock is

necessary, otherwise the algorithm would not terminate. This process is then repeated

until all pairs have been locked or no improvement can be obtained by any interchange,

similarly as in the original algorithm phase.

6.2. COMMUNICATION-AWARE ALLOCATION ALGORITHM 69

By applying those rules, the algorithm tries to reduce the number of components needed

to be reconfigured. Also note that the algorithm does not terminate if the δconstraint is

fulfilled. This enforces the search for more components (pairs) that could be kept in its

current allocation solution. In respect to its complexity, it is the same than the original

algorithm.

Algorithm 3 Improved heuristic for balancing B.

1: Xinit

1← {xi,1}Initial assignment of CPU components

2: Xinit

2← {xi,2}Initial assignment of FPGA components

3: Xinit ←Xinit

1∪Xinit

2;Xnew =Xinit

4: Binit ← |Uinit −Ainit|;Bnew =Binit

5: Xorig ←Current System Configuration Γ

6: m←max pairs maximum number of pairs

7: for k←1 to mdo

8: Find the pair o, p (xo={1,0};xp={0,1}or xo={0,1};xp={1,0}) so that o

and pare unlocked and Gop is maximal

9: if Gop >0then

10: Swap oand pand test it ⇒Xtry = (Xnew with oand pswapped)

11: Z=diff(Xorig, Xtry)

12: if zo= 0 OR zp= 0 then

13: Update the new configuration ⇒Xnew =Xtry

14: Bnew =Bnew −Gop

15: if zo= 0 then Lock oend if

16: if zp= 0 then Lock pend if

17: else

18: if xT

oRox0

o< xT

pRpx0

pthen Lock oelse Lock pend if

19: end if

20: end if

21: if Gop ≤0 OR all pairs are locked then

22: break

23: end if

24: end for

25: return Xnew

6.2. Communication-aware Allocation Algorithm

In order to take into account the communication costs when deciding the assignment of

each OS component, an additional phase is applied, preceding the appliance of the former

algorithms explained above. The idea of this phase is to group together those components

that present lower communication costs when located at same execution domain. This

idea is based on the observation of the architecture presented in Section 5.3, where two

70 CHAPTER 6. RUN-TIME METHODS

communicating components may use the local bus (when both are placed on the same

environment), or use the bus across the environmentes.

The algorithm executed in this phase is based on a clustering process and delivers new

components set, which are then used as input to the assignment algorithm. In this case

not only single components are assigned to CPU or FPGA, but also meta-components

(cluster of components) instead. The previous algorithm for balancing Bimprovement

(Algorithms 2or 3) is slightly modified, in which a pair of components is allowed to

change their location only if this change will improve the balancing and concurrently

reduce the communication costs inside the execution environment. This is necessary,

because the balancing Bimprovement algorithm do not handle components clusters.

Previous chapter introduced the communication cost model for two components uand

vassigned to the proposed architecture: κ(u, v) = {Cα, Cβ, Cγ}. Where Cα,Cβand

Cγdenote the communication cost between two components located both in software,

both in hardware and each one in different environment, respectively. To measure the

connection degree between these two communicating components, two further metrics

were specified. The first one is the pl (local preference):

pl =2Cγ

2Cγ+Cα+Cβ(6.3)

This is calculated using CM (pl =f(Cα, Cβ, Cγ)). The metric pl compares the com-

munication cost between two components when both are placed in the same execution

domain in comparison with the case where each of them are placed in different execution

domains.

The second metric is defined as pg (global preference) which is pl multiplied by a global

factor (see Equation 6.4). This metric enables the comparison of all local preferences

with each other when doing the clustering process.

pg = ( Cγ

maxV{Cγ})pl (6.4)

Clustering components

The clustering procedure is iteratively executed in kpasses. In each pass, the algorithm

starts searching for the largest global preference value, pg0, among all edges and tries to

cluster two components, oand p, respecting two conditions:

•Components oand phave not been clustered yet (considering the current pass

only).

•(co,1+cp,1, co,2+cp,2)≤(λ1, λ2), where λ1and λ2are the maximum component

costs allowed when performing the combination of oand p. This criterium is used

6.3. HANDLING RECONFIGURATION ACTIVITIES 71

to avoid the efficiency deprecation of the assignment algorithm when the allocation

costs of the formed components increase.

If two components form a cluster, then they are combined and the search is executed

again. This method is repeated until no additional components are free for clustering,

which characterizes the end of a pass. The resulting clustered graph is then used as

input for the subsequent pass. This is repeated until all kpasses have being executed.

The number kof passes can be adjusted previously, before system starts, which may be

dependable on the granularity and homogeneity of the component costs. Nevertheless,

in Chapter 7, evaluations made for this algorithm indicates how to correctly choose the

number kof passes.

Figure 6.1 shows an example of two components, T1 and T2, being clustered, which gen-

erates a new component T∗. For this case, κ(T∗, T 3) is generated as follows: κ(T∗, T3) =

κ(T1, T3) + κ(T2, T3). Thus, pg∗is calculated as: pg∗=f(κ(T∗, T3)). Note that com-

munication costs, Cαand Cβ, between T1 and T2 (from the example), are no longer

considered for pg evaluation. Nevertheless, they are stored and used during the balance

improvement algorithm.

T3 T*

Higher pg

pg=0.8

0.2

0.1

0.3 0.7

0.5

0.3

0.1

0.2

pg*

Figure 6.1.: Example of two components being clustered.

6.3. Handling Reconfiguration Activities

As it has been introduced in Section 4.1.4, whenever the current allocation of OS compo-

nents does not fulfill the specified constrains, which is caused by application dynamics,

a reconfiguration needs to take place. In the proposed system, this action is triggered

when an unbalanced situation is detected: |w1U−w2A|> δ. Furthermore, if new OS

services are going to be included into the system (e.g., if a new application have been

started), a new set S0={s0

i, i = 1, ..., n0}of services is identified. In each case, a new

assignment solution needs to be computed again, resulting in a new system configuration

Γ0.

To bring the system from the original Γ configuration to a new Γ0, a system reconfigura-

tion needs to take place. This means that a certain number of services must be relocated

72 CHAPTER 6. RUN-TIME METHODS

on the hybrid architecture, new services may be inserted, or even some services may be

deleted if no longer required.

In Section 5.2 it has been shown that reconfiguration costs may be relatively high, which

may require a large blackout time to reconfigure all components at once. This fact may

be prohibited for some systems, especially those having temporal constraints. Therefore,

the strategies developed in this thesis aims to achieve a near zero blackout time.

In order to find a reasonable solution for this problem, the reconfiguration activities

carried out over the proposed platform were further specified. In the following analysis,

at the first moment only current active services, that will undergo a reconfiguration, are

going to be considered. Afterwards, new components inserted into the system (e.g., due

to the arrival of new applications) are taken into account. So, let a subset S∗of the

current active service set Sbe defined, to denote the OS components that will undergo

a reconfiguration (hereafter also called migration):

S∗={s?

i, i = 1, ..., m}(6.5)

Where mis the number of components to be reconfigured. Please note that S∗⊆ S and

m≤n, where nis the number of currently active services present in the system.

Now, the system reconfiguration problem can be formulated in two sub-problems:

P1 The reconfiguration activities of a single component s∗

ineed to be performed in a

deterministic way, concurrently with the system activities. By this means, a com-

ponent reconfiguration can be accomplished without stopping system execution.

P2 Since mcomponents are going to be reconfigured, the order in which these com-

ponents are going to be reconfigured need to be carefully determined. If after each

component reconfiguration constraints Amax and Umax are not violated, no system

stop is required.

Hence, solving P1 and P2, the system can be reconfigured concurrently with system

execution. In the subsequent sections, the solution for both P1 and P2 sub-problems are

presented.

6.4. OS Component Reconfiguration

The reconfiguration of a component implies in the execution of respective Programming

and Migration phases (introduced in Section 5.2.1). Since the arrival times of these

activities are triggered by asynchronous events (system reconfiguration), they can be

modeled as aperiodic jobs. Furthermore, if these jobs can be executed concurrently with

the running services, without causing the violation of any deadline, a deterministic and

zero blackout time reconfiguration can be accomplished.

6.4. OS COMPONENT RECONFIGURATION 73

For a more precisely model of these reconfiguration activities, let them be denoted by a

set J:

J={Ji(Ja

i, Jb

i), i = 1, ..., m}.(6.6)

Where Jaand Jbrepresents the phases Programming and Migration, respectively. The

execution times of these phases have been already introduced in Table 5.2. For a job Ja

this time may be either Qsor Qh, depending on which direction a component is being

migrated. Similarly, Jbrepresents one of three possible migration cases: Ms,Mhor

Mw.

This scenario characterizes a situation where aperiodic jobs need to be carried out to-

gether with real-time periodic tasks. To cope with such scenarios, very efficient and

well known strategies from real-time scheduling theory can be here applied. The ba-

sic idea of these approaches is the inclusion of a new periodic task (called server) into

the system to reserve some bandwidth (CPU workload) to serve the aperiodic activities

(jobs). Thereby, the response time of these jobs can be improved with the guarantee

that deadlines of periodic tasks will not be violated. A more comprehensive and detailed

explanation of this technique is given in [159,160].

Among different types of servers, the analysis relies on Total Bandwidth Server (TBS)

[159] due to the following reasons:

•The Earliest Deadline First (EDF) is assumed as the policy adopted by the system

scheduler;

•Among other servers, it is one of the most efficient service mechanism in terms of

performance/cost ratio [160].

According to the literature, TBS assigns a deadline dkfor an aperiodic job karriving in

the system at time akin the following manner:

dk= max(ak, dk−1) + Ck

(6.7)

Where dk−1represents the deadline of the aperiodic job that has arrived before job k;

Usis the server bandwidth and Ckis the execution time requested by the aperiodic job.

Deadline dk−1is 0 if job kis the first one, or if all pending aperiodic jobs that arrived

before job khave already been finished.

74 CHAPTER 6. RUN-TIME METHODS

6.4.1. Applying Total Bandwidth Server

Preliminary Statements

The TBS server is used here as a technique to perform the migration of a component

being executed in one execution environment to the other one. Depending on the system

support, appropriated preemption/resumption facilities may or may not be available for

migration purposes. Thus, in some cases the migration need to occur only when the

service is not active or, it may be performed in the middle of the service computation.

Nonetheless, in both cases some amount of context data may be required to be transferred

across domains.

Even though there are some methods that allow preemption of hardware tasks as read-

back and scan-path (e.g., [127]), they can not be used for a preemption and posterior

resumption of a running component across hardware and software domains. Without

proper support during design phase, which needs to assure safe points of preemption/re-

sumption of a component and enough information about the reconfiguration costs, its

accomplishment would be rather difficult. Therefore, a framework for designing relo-

catable tasks across hardware and software execution domains was developed and will

be presented in Chapter 8. The proposed framework, together with a specific run-time

environment, allows the preemption of a task being executed in software and a posterior

resumption in hardware (and vice-versa).

Furthermore, it is worth to mention that due to the usage of a server, which is allocated to

CPU, the reconfiguration activities are going to be performed sequentially. However, this

fact does not represent a degradation of the hardware parallelism, since the availability

of a single configuration port on today’s FPGA serializes its programming operation.

Total Bandwidth Server Applied

The first phase Jaof a component reconfiguration is only a preparation of the execution

environment to the second phase, where the migration is actually performed. Hence, no

extra synchronization with the running tasks or services is required for its accomplish-

ment. Therefore, the appliance of the deadline assignment rule from TBS to schedule

Jais straightforward. Differently, the mapping of Jbto the TBS rule is not that trivial

and require thereby more attention.

In the following, the deadline assignment rule from TBS is applied on Jbunder different

situations. As a result, the minimal bandwidth Usfor TBS server is derived so that a

migration phase can be safely performed. These following analyses will start with the

assumption that no preemption is used for migration purposes. Then, in the sequence,

the case where preemption is allowed will be derived from previous results.

6.4. OS COMPONENT RECONFIGURATION 75

6.4.2. Deriving Migration Conditions

The execution of a job Jb, related to service i, can only start in time intervals where

no instance of siis active. Thus, considering that the service has a periodic behavior,

the execution of Jbhas to happen between two consecutive instances of this service.

Moreover, this has to be performed respecting the deadlines of this service.

If bi,k and fi,k represent the starting and finishing execution times of a kth instance of

service si, let ˆ

biand ˆ

fibe the starting and finishing execution times of Jb. The conditions

above can be formulated in C1 and C2:

C1 ˆ

bi3[bi,k;fi,k] : Job Jb

icannot start if an instance of a service sihas been started or

if this instance has not been finished yet.

C2 bi,k 3[ˆ

bi;ˆ

fi] : Once job Jb

ihas been started, it cannot be preempted by service si.

Condition C1 arises due to the absence of preemption support for migration. It avoids

that Jb

istarts to transfer the internal state of an instance kof service siif this instance

khas been started and still not finished. Condition C2, in its turn, guarantees that a

service siwill be started only after the finishing the execution of Jb

i. Note that a job Jb

may still be preempted by a service other than si.

In the following subsections every migration case is analyzed observing the conditions

stated above.

6.4.3. Software to Hardware Migration

In order to facilitate the following analyses, let ˆaiand ˆ

didenote the arrival and deadline

times of a job Jb

i, respectively. To ensure that Jb

iwill not start before instance si,k,Jb

is released only when si,k has been finished: ˆai≥fi,k.

In order to provide as much time as possible to execute Jb

i, the maximum time interval

between two consecutive instances of a service sicould be chosen. This happens in

a specific instance k∗where the lateness of siis minimal: ˆa∗

i=fi,k∗|(di,k∗−fi,k∗) =

mink(di,k −fi,k). Figure 6.2 shows such a situation.

However, to generally find the specific instance k∗when a periodic task set is scheduled

under EDF is not trivial, and its solution would incur in the usage of more complex

algorithms, which is not desired. Furthermore, a non wanted delay in the complete

reconfiguration time would be included due to the shifting of ˆaifrom fi,k to fi,k∗, which

actually depends on the current hyperperiod of the service set.

Nevertheless, it is possible to control the arrival time ˆaito be the same as the finishing

time fi,k, which can be, in a worst case, equal to the deadline di,k. Therefore, in the

analysis this case is considered in order to provide a result that can be applied to every

instance k. Hence, under this condition a minimal server bandwidth can be derived,

76 CHAPTER 6. RUN-TIME METHODS

in SW

TBS Jb,i

in HW

,ki

1,,

kiki

+ki

ˆi

,ki

+ki

Figure 6.2.: Optimal arrival time ˆaifor Jb

which will guarantee the non violation of conditions C1 and C2.

Figure 6.3 shows this situation, where the arrival of Jb

ioccurs at the arrival time of the

next service instance ai,k+1. Note that the relative deadline Diis assumed to be equal

to the period Pi.

in SW

TBS

in HW

1,, +

kiki af

1, +ki

Figure 6.3.: Worst-case arrival time ˆaifor Jb

Under EDF scheduling, the running task is always the one with the smallest absolute

deadline among the ready ones. Thus, it can be ensured that Jb

iwill not be preempted

by si,k+1 if ˆ

di≤di,k+1.

As the software component siwill be moved to hardware, its CPU utilization factor Ui

is released and can be added to the server bandwidth. Moreover, Jb

iphase needs to be

finished, in the worst-case, at di,k+1 −Eh

ito allow the next instance of the service i

(running afterwards in hardware) to finish before or at its deadline. Hence, the deadline

db,i assigned to Jb

iusing TBS is:

di= ˆai+Mw

Us+Ui

≤di,k+1 −Eh

i(6.8)

6.4. OS COMPONENT RECONFIGURATION 77

Noting that di,k+1 =ai,k+1 +Piand ai,k+1 = ˆai, the Equation 6.8 can be rewritten in

order to derive the minimal server bandwidth:

Us≥Mw,i

Pi−Eh

−Ui(6.9)

Equation 6.9 shows that the bandwidth required to migrate service siis equal to Mw

Pi−Eh

Hence, as the server uses the workload Uireleased by si,Usneeds to be at least equal to

the difference necessary to achieve the requested bandwidth. Although the consideration

of Uias being part of the server bandwidth, already in the instance k, seams to be

intuitively correct, a formal prove that it can be securely made is given in Appendix D.

If the condition expressed in Equation 6.9 is fulfilled under all services that will undergo

a migration from software to hardware, it can be guaranteed that conditions C1 and C2

will be satisfied when EDF and TBS are used.

6.4.4. Hardware to Software Migration

For this migration case, the arrival time of job Jb

ican be defined more precisely since

a service executing in hardware profits from the true parallelism of this environment.

Additionally, the server needs to provide additionally the workload that will be reclaimed

by siwhen running in software. Hence, the remaining bandwidth (Us−Ui) needs to be

big enough to support the migration of Jb

i. The deadline assigned to Jb

iby TBS is then

defined as:

di=ai,k +Eh

i+Mw

Us−Ui

≤di,k+1 (6.10)

Noting that di,k+1 =ai,k+1 +Piand that ai,k+1 =ai,k +Pi, the equation above can be

rewritten in order to derive the minimal server bandwidth required:

Us≥Mw

2Pi−Eh

+Ui(6.11)

6.4.5. Software Service Reconfiguration

Similar analysis as done above can be made for the case where a service is going to be

replaced by another service, but only in software domain. In this case, using the deadline

78 CHAPTER 6. RUN-TIME METHODS

assignment rule from TBS:

di= ˆai+Ms

Us+Ui−Unew

≤di,k+1 (6.12)

Where Unew

iis the new processor load used by service siafter migration.

Rearranging Equation 6.12 and solving it for Us, the minimal bandwidth required is:

Us≥Ms

+Unew

i−Ui(6.13)

Note that the period Piis not changed, only the execution time: Es,new

i, which refers to

the new service that replaces the older one. It this case, the bandwidth requested by Jb

is Ms

Piand the processor load released is Ui−Unew

6.4.6. Hardware Service Reconfiguration

Similarly, it may be necessary that a hardware service need to be replaced by another

one, also in hardware domain. In this case, it can be concluded that ˆai=ai,k +Eh

since no preemption occurs in a hardware service execution. In addition, the deadline

to finish Jb

ican be precisely defined as: ˆ

di=di,k+1 −Eh,new

i. Note that here also the

period Piis the same and only the execution time of the service siis different. Thus,

the assigned deadline to job Jb

iis:

ai,k +Eh

i+Mh

≤di,k+1 −Enew

hw,i (6.14)

As the temporal distance between ai,k and di,k+ais 2Pi, the minimal server bandwidth

for this case is:

Us≥Mh

2Pi−Eh,new

i−Eh

(6.15)

6.4.7. Migrating by Preempting

If a migration may be performed by preempting and resuming a service sifrom one

execution environment to the other one, a speed up in the whole system reconfiguration

may be achieved. This happens due to the fact that a migration is completed inside the

same period of a service execution.

6.4. OS COMPONENT RECONFIGURATION 79

Software to Hardware

Differently from the first scenario, the condition C1 is no longer necessary, since now

the arrival of job Jb

iis allowed to be even in the middle of the execution of the service

si. The C2 however, is still necessary to avoid that a service siwould start when the

related job Jb

iis under execution.

For this new scenario analyses, additional notations are introduced. In Figure 6.4 these

are further illustrated. Lets assume that job Jb

iarrives shifted from ai,k in a time distance

of σso that service siis stopped. In this moment, certain amount of computation of si

would have been executed already. Let ηEs

ibe this amount, so that 0 ≤η≤1. Thus,

it follows that the remaining computation that will be executed in hardware afterwards

is equal to (1 −η)Eh

in SW

TBS

in HW

E⋅

E⋅− )1(

1, +ki

Figure 6.4.: Migrating by preempting: software to hardware.

The value ηdepends on σand it is, therefore, defined for each instance kof a service si:

ηk=









0 : if ai,k +σk< bi,k

1 : if ai,k +σk> fi,k

ai,k−si,k+σk

i: otherwise

(6.16)

For this case, deadline di,k assigned to job Jb

i, arriving at ai,k +σusing TBS need to be

smaller than Pi−(1 −η)Eh

iin order to guarantee the achievement of the deadline di,k

of service si. Hence, applying TBS rule for Jb

i, similarly as in the scenarios explained in

above sections, the deadline ˆ

diis defined by:

di= ˆai+σ+Mw

Us+ (1 −η)Ui

≤di,k+1 −(1 −η)Eh

i(6.17)

80 CHAPTER 6. RUN-TIME METHODS

Hence, as already noticed in the previous analyses, di,k+1 =ai,k+1 +Piand ai,k+1 = ˆai,

the Equation 6.17 can be rewritten to solve Usas follows:

Us(η, σ)≥Mw

Pi−(1 −η)Eh

−(1 −η)Es

(6.18)

It has to be noticed that Equation 6.18 is defined only for 0 ≤η < 1, since when

η= 1, service sihas already finished its execution and the migration will be performed

in the next instance. Additionally, if job Jb

iarrives in the beginning of a instance so that

σ= 0, which also implies in η= 0, the scenario will be identical to the case analyzed in

Section 6.4.3. Indeed, making σ= 0 and η= 0 in Equation 6.18, this becames identical

to Equation 6.9.

For this migration case, the minimal bandwidth required for TBS is no longer possible

to be determined alone, but it depends on values σand ηinstead. Thus, it is not trivial

to specify a minimal Us, even for a worst-case scenario. Nevertheless, it is noticeable

that the bigger σis the smaller the temporal distance available for a migration will be.

As a consequence, depending on the arrival time ˆai, the minimal server bandwidth Us

required may not be feasible. However, as the work presented in this thesis envisioned

soft real-time systems, a reasonable procedure would be to shift the arrival ˆaito the

finishing time fi,k.

Hardware to Software

Based on the analyses made above, and on the scenarios where preemption were not

allowed, the examination of a migration case of a service going from hardware to software

is deductive. For clarification, Figure 6.5 shows such a case.

Hence, by similar analyses, it follows that the deadline di, k assigned for job Jb

i, arriving

at ai,k +σis equal to:

di=ai,k +ηEh

i+Mw

Us−(1 −η)Ui

≤di,k+1 (6.19)

Noting that di,k+1 =ai,k+1 +Piand that ai,k+1 =ai,k +Pi, the equation above can be

rewritten to solve Us:

Us(η, σ)≥Mw

2Pi−ηEh

+ (1 −η)Ui(6.20)

Also for this case, the arrival time ˆaiof job Jb

ican be shifted to fi,k, which derives the

same results where preemption is not used.

6.5. SCHEDULABILITY ANALYSIS 81

in SW

TBS

in HW

1, +ki

E⋅− )1(

E⋅

Figure 6.5.: Migrating by preempting: hardware to software.

6.5. Schedulability Analysis

All analyses made in the above sections were based on the proper assignment of arrival

time and deadline of aperiodic reconfiguration activities, and the establishment of con-

ditions for definition of the server bandwidth. These analyses were made in order to

properly represent the conditions imposed in different explained scenarios.

As a typical scenario is assumed, where services are periodic and scheduled using the

EDF policy, a standard schedulability analysis can be applied. Additionally, in the

approach here presented no modification on deadlines or arrival times of any periodic

service was performed. As a consequence, if after every migration step the sum of all

processor utilizations (used by every software component) plus the server bandwidth

does not exceed a maximum Umax, the feasibility of the schedule is guaranteed. Beside

that, there is also a hardware constraint Amax (maximum FPGA area available). Thus, it

follows that the FPGA needs to have available the area requested by each service being

transferred to it. These conditions implies that before every service migration either

enough FPGA area is available, or enough CPU workload. Furthermore, after each

service migration, the new resources available need to be enough for further migrations.

Only in this way, it can be guaranteed that the system will have be valid in terms of

schedulability.

Defining Tsand Thas the services running in software and hardware, respectively,

after one migration a different configuration is achieved: Ts∗and Th∗. Thus, if for

every component migration the following conditions are fulfilled, the schedulability are

82 CHAPTER 6. RUN-TIME METHODS

guaranteed:

i∈T s∗

Ui≤Umax ;X

i∈T s∗

Ai≤Amax (6.21)

In order to find a feasible schedule for every task migration, the service subset S?(defined

in Section 6.3) needs to be previously sorted in a proper order. In other words, the

sorting of services that will suffer a reconfiguration characterizes a schedule problem

under resources constraints.

6.6. OS Components Scheduling

Within this section, a heuristic algorithm is proposed for scheduling the order in which

components are going to be reconfigured. The components need to be sorted in such a

way that after each component migration, the constraints specified in Equation 6.21 are

not violated.

In order to tackle this problem in a proper manner, the components that will undergo

a reconfiguration in their own execution environment are separated from the ones that

will change it. Given the set S?, three new subsets are defined:

SsServices that will be reconfigured within software;

ShServices that will be reconfigured within hardware;

SwServices that will be relocated between hardware and software.

Please note that Ss∪ Sh∪ Sw=S?.

From Equation 6.21 it can be concluded that the lower PUiand PAiare, the higher

are the chances in finding a feasible schedule. Therefore, the services from subset Ss

will be scheduled first only if this will leads to a reduction in the final CPU workload,

otherwise, they will be scheduled at the end. The same rule is applied on the components

from Sh, in respect to FPGA area. Hence, the problem is reduced by finding a schedule

for the service set Sw.

Due to the same reasons, new OS services that may arrive, are inserted (configured) into

the system only after accomplishing the reconfiguration of the complete set S?.

If |Sw|=x, the feasible solutions Sfis a subset of x! possible scheduling solutions

(permutations of components in Sw). To solve this problem, Bratley’s algorithm [161]

could be applied, in which the search space for a valid schedule is reduced. Nevertheless,

the worst-case complexity of this algorithm is still O(x·x!) (NP-Hard problem), as

x! paths of length xhave to be analyzed. Because of that, a heuristic algorithm was

developed to schedule the order of components that will undergo a reconfiguration.

6.6. OS COMPONENTS SCHEDULING 83

The scheduling problem was broken down in two different problems. First, only compo-

nents that must migrate from CPU to FPGA are properly sorted, generating in this way

a so called Partial Schedule. The same procedure is applied on components that must

migrate from FPGA to CPU. Second, the complete schedule is generated by properly

merging both Partial Schedules.

6.6.1. Partial Schedule

The basic idea of this heuristic is the use of component costs (ci,j) as a criteria for

searching a solution in the tree of all possible schedules. Looking at the components

that need to leave the CPU, the strategy is the following: try to relocate the component

with the highest software cost and with the smallest hardware cost first. In this way,

after this component reconfiguration, the CPU workload will be decreased as much as

possible, and in the same way, the FPGA area used will be increased as little as possible.

Similarly, the same strategy is applied to the components that need to leave the FPGA.

Consequently, two partial schedules are generated using the strategy explained above.

Let Sa ={sa1, ..., sap}and Sb ={sb1, ..., sbq}be the components that need to leave

the CPU and FPGA, respectively, so that Sa ∪Sb =Swand Sa ∩Sb =∅. Let Ia =

{i1, ..., ip}be the index array that represents Sa sorted by decreasing software costs, so

that {ci1,1≥ci2,1≥... ≥cip,1}. Similarly, Ja ={j1, ..., jp}is defined as the index array

that represents Sa sorted by increasing hardware costs: {cj1,2≤cj2,2≤... ≤cjp,2}.

The partial schedule, PS, shown in Algorithm 4starts comparing the first two indexes

of Ia and Ja (k= 1). If no match (same index in both arrays) is found, it expands the

search (k= 2) on the first two components of Ia and Ja (total of four components).

Hence, the search is done gradually (line 5) until a match is found. If this is the case,

the index is removed from both arrays, the schedule is updated and the search restarts

on the remaining arrays. Note that a match is always found, since the same elements

from Ia are also presented in Jb. Hence, the algorithm will always terminate.

It can be seen that for every kvalue the algorithm calculates, in worst-case, 2k−1

comparisons. Thus, for a worst-case scenario when searching for a match, where the

search is done over the whole array (k=p=|Ia|=|Ja|=|Sa|), the total number of

comparisons will be 1+3+5+...+(2p−1) = Pp

i=1(2i−1) = p2(which is the maximum

number of combinations that can be done between two arrays of size p). Therefore, the

complete partial schedule algorithm has a complexity of O((n−1)n2), since for every

index match found, the search will be applied again on a reduced index array.

If Ib and Jb are defined as the index arrays that represent Sb sorted by decreasing

hardware costs and sorted by increasing software costs, respectively, the same partial al-

gorithm is applied on Sb using these two index arrays. Both partial schedules generated,

PS(Ia, Ja) and P S(Ib, Jb), are then used to produce the complete schedule.

84 CHAPTER 6. RUN-TIME METHODS

Algorithm 4 Partial schedule.

1: Icur ←Ia;Jcur ←Ja

2: PSa ← ∅

3: while |Icur| 6= 0 do

4: match ←0

5: for k←1,|Ia|do

6: for i←1,(k−1) do

7: if Icur[i] = Jcur[k]then

8: match ←1

9: break

10: end if

11: end for

12: if match = 1 then

13: break

14: end if

15: for j←1,(k)do

16: if Icur[k] = Jcur[j]then

17: match ←1

18: break

19: end if

20: end for

21: if match = 1 then

22: break

23: end if

24: end for

25: PSa ←PSa +i . Update the schedule

26: Icur ←Icur −i;Jcur ←Jcur −j

27: end while

6.6.2. Complete Schedule

The whole schedule WS is generated selecting the components from the partial schedules,

similar to an interleaving manner. The algorithm, presented in Algorithm 5, starts

selecting one component from every partial schedule. However, it select first from the

larger (in size) partial schedule. The selection of the remaining components is done

respecting the following rule: The relation between the number of components that have

been selected from P S(Ia, Ja) over the ones that have been selected from P S(Ib, Jb)

should be similar to the relation between the lengths of |PS(Ia, Ja)|and |PS(Ib, Jb)|.

The complexity of this algorithm is O(n), since it has only one while loop with simple

operations.

After obtaining the complete scheduling, a check is needed to verify if this solution is

a feasible one. This is done by evaluating the PUiand PAiafter every component

6.7. CHAPTER CONCLUSIONS 85

Algorithm 5 Whole schedule.

1: WS ← ∅

2: if |PSa|≥|PSb|then

3: PSatmp ←PSa

4: PSbtmp ←PSb

5: else

6: PSatmp ←PSb

7: PSbtmp ←PSa

8: end if

9: R←PSatmp/PSbtmp

10: WS ←WS +PSatmp[1]

11: PSatmp ←PSatmp −PSatmp[1]

12: WS ←WS +PSbtmp[1]

13: PSbtmp ←PSbtmp −PSbtmp[1]

14: r←1

15: s←1

16: while (|PSatmp| 6= 0 and |PSbtmp| 6= 0) do

17: if r/s ≤Rthen

18: WS ←WS +PSatmp[1]

19: PSatmp ←PSatmp −PSatmp[1]

20: r←r+ 1

21: else

22: WS ←WS +PSbtmp[1]

23: PSbtmp ←PSbtmp −PSbtmp[1]

24: s←s+ 1

25: end if

26: end while

27: WS ←WS +PSatmp +PSbtmp .If one array gets empty

reconfiguration, obeying thereby the scheduled obtained previously. If one system con-

straint is not fulfilled during one of the steps of evaluation, the solution is said to be not

feasible and the heuristic does not find a feasible solution.

6.7. Chapter Conclusions

This chapter presented the main methodologies and strategies developed and available

inside the RRM system. These comprise heuristic algorithms for allocation of OS services

on the hybrid architecture (observing thereby different goals), a well defined strategy to

reconfigure each single OS service by means of relocation between CPU and FPGA, and

heuristics for sorting the components to be reconfigure.

86 CHAPTER 6. RUN-TIME METHODS

The strategies developed to carry out the reconfiguration activities allow the achievement

of a zero blackout time. This is accomplished by several meanings. First, the FPGA

reconfiguration time is hidden completely, which is usually the most time consuming

operation in a reconfigurable computing system based on this technology. Second, by

using techniques from real-time scheduling theory, the effective relocation of the service

across CPU and FPGA is performed in a deterministic way. Last, but not least, a

heuristic was developed to assure the correct order in which the components should

undergo a reconfiguration.

Additionally, it has to be noticed that the communication-aware allocation algorithm

here developed is more appropriated for a scenario were OS services are able to call fur-

ther services, as modeled in Figure 5.2. For its application in the system here proposed,

different scheduling strategies would need to be investigated, since in this case, for in-

stance, dependencies among services would need to be considered. Hence, this algorithm

is considered as an additional feature of RRM, and further work in this direction will be

discussed in Chapter 10.

All heuristic algorithms were developed observing the requirement to present a low

computation complexity and good efficiency, since they all need be executed at run-

time. Therefore, in the next chapter, simulations will be performed with each of these

heuristics for evaluation purposes.

CHAPTER 7

Methods Evaluation

This chapter presents evaluations performed with the proposed heuristic algorithms by

means of simulations. Thereby, the performance and efficiency of proposed methods and

strategies in the previous chapter are evaluated.

The MATLAB tool was chosen for this purposes since it has the necessary features and

facilities for the required analyses as well as an environment, which promotes system

modeling. For instance, due to the availability of solvers for Linear Integer Programming

and Binary Integer Programming, results from the proposed heuristics algorithms could

be compared with optimal solutions.

7.1. OS Components Allocation

In this section, the proposed heuristic for allocation of OS components is evaluated. First,

the assignment and balancing heuristics are tested. Then, in the sequence, the modified

balancing algorithm is evaluated concerning its efficiency in reducing the number of

components being reconfigured.

7.1.1. OS Components Assignment

The testbed created for evaluating the OS component assignment algorithm has the

following configuration. A set S, having size of n= 20 was randomly generated. So,

88 CHAPTER 7. METHODS EVALUATION

each component has varying random costs in the range: 1% ≤ci,1≤15% and 5% ≤

ci,2≤25%. The maximum FPGA resource was defined to be 100% (Amax = 100), as

well as for CPU (Umax = 100). The components assignment were calculated for every

system using the Binary Integer Programming solver from MATLAB (which delivers an

optimal solution), and the heuristic algorithms.The average value of total cost (U+A)

and the absolute difference cost (|w1U−w2A|) were compared for different values of δ

(the resource usage balancing restriction): 0.5,1,2,3,4,5,10,20,30,40,50 and 60. These

average values were achieved by running the experiment 100 times.

The solutions provided by the assignment heuristic were very similar to the optimal one,

if δconstraint has values around 10% (δ≈10%). Concerning fulfillment of δconstraint

(see Figure 7.1, Heuristic-1), it can be said that, the smaller the δthe poor the results

given by this heuristic. This was expected to be so, since the assignment heuristic does

not consider the balancing restriction.

0 10 20 30 40 50 60

|U−A|

Unbalance

Optimal

Heuristic−1

Heuristic−2

Figure 7.1.: Unbalance average for different δconstraints.

7.2. Balancing Heuristic

The application of the balancing heuristic over the solution provided by the previous one

deliver a better balancing B. This can be seen in the Figure 7.1, Heuristic-2. However, an

increase in the total assignment cost was verified for cases where the balancing heuristic

has achieved an improvement in the balancing B(δ/10%). Nevertheless, the total

7.2. BALANCING HEURISTIC 89

cost assignment achieved by this heuristic algorithm were quite satisfactory: maximum

of 15% bigger if compared with the optimal case (see Figure 7.2, Heuristic-2).

0 10 20 30 40 50 60

100

110

120

130

140

150

160

170

180

190

200

U+A

Total Cost Estimation

Optimal

Heuristic−1

Heuristic−2

Figure 7.2.: Total cost assignment average for different δconstraints.

7.2.1. Reconfiguration Cost Reduction

Similarly to the evaluation made above, randomly systems were generated here. For

each case Algorithm 1was applied and the results were used as input for two balancing

heuristics: Algorithm 2and Algorithm 3, generating therefore two different assignment

solutions: Γ1and Γ2. The assignment solutions Γ0

1and Γ0

2given by the same algorithms

applied to the subsequent system generated, simulates the next system configuration,

and therefore the need for system reconfiguration. The amount of components having

different allocations, which requires reconfiguration, between two subsequent assignment

generated was counted.

By comparing the results between those two kinds of balancing heuristics, it can be eval-

uated the gain obtained by the Algorithm 3towards reduction of reconfiguration costs.

As already mentioned in previous chapter, the reconfiguration costs reduction is achieved

indirectly by trying to reduce the number of components being reconfigured. The results

achieved by the heuristics in these scenarios are presented in Figure 7.3. The number of

components required to be reconfigured when using the original balancing algorithm is

represented by the curve Heuristic-2a, and by Heuristic-2b the results provided by the

modified balancing heuristic.

90 CHAPTER 7. METHODS EVALUATION

010 20 30 40 50 60

# components

Heuristic−2b

Heuristic−2a

Figure 7.3.: Number of components being reconfigured for different δconstraints.

Heuristic-2a: original balancing algorithm. Heuristic-2b: modified balanc-

ing algorithm.

In addition, the balancing achieved using the modified algorithm was also evaluated.

Figure 7.4 shows the results achieved by the original algorithm (Heuristic-2a) and the

modified one (Heuristic-2b). The improvement made in both case were satisfactory.

Note that the results achieved by Heuristic-2b, concerning balance improvement, are

quite under the constraint δ. This is due to the fact that in Heuristic-2b it still searches

for more pairs to be swapped (even with δconstraint being fulfilled) in order to reduce

the number of reconfigurations.

In order to analyze the influence of the modified algorithm in the overall resource uti-

lization, the results between these two algorithms were drawn together and presented in

Figure 7.5. It can be seen that the modified balancing algorithm presents a pay-off in

the overall resource utilization, especially for values of δbigger than 5%.

The modified balancing algorithm did provide reduction in the number of components

being reconfigured with an acceptable incremental in the total resource utilization cost.

Additionally, depending on the individual component costs, a significant reduction of

the reconfiguration costs may be achieved. This fact should guide the designer by the

decision of its usage or not in the target system.

7.3. COMPONENTS RECONFIGURATION SCHEDULING 91

0 10 20 30 40 50 60

|U−A|

Heuristic−2a

Heuristic−2b

Figure 7.4.: Unbalance average for different δconstraints.

7.3. Components Reconfiguration Scheduling

Since for each scheduling problem, more than one feasible solution may exist, the ef-

ficiency of the proposed heuristic was measured by means of its capacity in providing

one feasible solution among the existing ones. Therefore, all possible feasible solutions

were generated calculating all possible components permutations. Due to computation

complexity in this case, the system size was limited in 8 components, which may produce

a maximum of 8! possible solutions.

For this experiment, random system pairs (representing the current and new system

configuration) were randomly generated and both, Algorithm 4and Algorithm 5for

partial and complete scheduling, respectively, were applied on them. Every system pair

generated was tested under different system constraints: U=A= [100%,90%,80%,

70%,60% and 50%]. For each pair, the component costs were generated observing the

following range: 1% ≤ci,1≤15% and 5% ≤ci,2≤25% for the first system, and

5% ≤ci,1≤25% and 1% ≤ci,2≤15% for the second one. The intention thereby, is to

cause a considerable number of components to be reconfigured. For each system pair,

the experiment was repeated 100 times.

The dashed line at Figure 7.6 shows the percentage of cases where at least on feasible

scheduling exists, by probing all scheduling combinations. The solid line shows the per-

92 CHAPTER 7. METHODS EVALUATION

0 10 20 30 40 50 60

130

140

150

160

170

180

190

200

U+A

Total Cost Estimation

First Phase

Heuristic−2a

Heuristic−2b

Figure 7.5.: Payoff in U+Adue to the balancing algorithm modified for reconfiguration

costs reduction.

centage of cases when the heuristic did find at least one feasible scheduling, considering

only the cases where at least one feasible solution exists. Based on the results shown

in Figure 7.6, it can be concluded that the smaller Umax and Amax constraints are, the

poorer the efficiency of the heuristic algorithm is. Nevertheless, the efficiency of the

heuristic algorithm decreases slower than the chance of existing at least one solution,

by decreasing Umax and Amax constraints values. For instance, when U=A= 70%, in

40% of the cases, at least one feasible solution exists, and in around 80% of this cases,

the heuristic did provide a feasible solution.

7.4. Communication costs reduction

To evaluate the proposed communication-aware allocation algorithm, the same environ-

ment used in the evaluation of the allocation algorithm was created. Additionally, it was

communication graphs of size n= 20 were randomly generated respecting the following

relation: Cα< Cβ< Cγ, which reflects the characteristics of the proposed architecture

(communication costs across execution domains are the most expensive ones).

The results achieved by the allocation algorithm were compared considering in the input,

first the pure generated systems, and second using the clustered system (generated as

7.4. COMMUNICATION COSTS REDUCTION 93

5060708090100

100

U and A constraints

Percentage

Chance to find at least one solution using Heuristic

Percentage of solutions found over all possibilities

Figure 7.6.: Component reconfiguration scheduling: heuristic algorithm evaluation.

described in Section 6.2).

Figure 7.7 presents the communication cost increment obtained in each execution domain

(HW: Hardware, SW: Software) and also between them (SW-HW: across CPU-FPGA).

The lines in the graphic indicate, in percentage, the amount of communication cost in-

creased when comparing the assignment results when clustering process was used against

the results when it was not used. Hence, a negative percentage value indicates that a

reduction of the communication costs was achieved when clustering process was used.

The evaluations were performed for different number of passes in the cluster process

(denoted by #folding in Figure 7.7).

The results obtained show that the amount of communication costs among the hardware

and among hardware-software components were reduced when the proposed clustering

heuristic were applied. On the other hand, the communication cost among software

components suffers from a little increase. This tradeoff is caused due to the fact that the

clustering heuristic promotes a balance among communication costs inside and across

execution domains.

The results show further that the maximum gain obtained in reducing the communication

costs stays around 10% in average. This limit comes from the fact that the clustering

algorithm avoids to merge two components that overcomes the constraints λ1,and λ2

(introduced in Section 6.2). These constraints avoid the efficiency deprecation of the

94 CHAPTER 7. METHODS EVALUATION

0 1 2 3 4 5 6 7 8 9 10

−15

−10

−5

# foldings

Percentage of increase

Comm. costs reduction compared to original solution

SW−HW

Figure 7.7.: Evaluation results comparison.

assignment algorithm due a non homogeneity distribution of the component cost values.

Additionally, it can be observed in Figure 7.8 that there exists an increment in the

overall resource utilization (payoff) when clustering process is applied. This increment,

however, was always under the value of 10%, which is acceptable.

In order to examine the influence of clustering heuristic for different relations of Cα,Cβ

and Cγ, three additional evaluations were conducted, using the same testbed explained

above. So, three different relations among the component costs were respected for κ

(where κis the average of the generated communication costs):

Case 1 κ={Cα,2Cα,3Cα}

Case 2 κ={Cα,2Cα,8Cα}

Case 3 κ={Cα,2Cα,15Cα}

The results obtained in these three evaluations provided the same patterns shown in

Figure 7.7 and Figure 7.8. For further details, the graphics obtained in these experiments

can be seen in Appendix A.

7.5. CHAPTER CONCLUSIONS 95

0 1 2 3 4 5 6 7 8 9 10

# foldings

Payoff (%)

Payoff in component costs utilization

Total component cost utilization (U+A)

Figure 7.8.: Payoff in overall resource usage due to clustering process.

7.5. Chapter Conclusions

Within this chapter heuristics algorithms proposed in the previous chapter were eval-

uated by means of simulation and comparison with the optimal solutions provided by

MATLAB tool.

OS services allocation algorithms, comprising heuristics for assignment and different

versions of balancing improvement, did show satisfactory performances under different

conditions.

The heuristic for sorting the components under reconfiguration was tested. The results

show that its efficiency is quite satisfactory, taking into consideration the hardness of the

problem, which increases rapidly (O(n·n!)) in the number nof components undergoing

a reconfiguration.

The communication-aware allocation algorithm was also evaluated, even though it can-

not be directly applied in the proposed RRM system. Therefore, its efficiency was limited

for the gains observed in the reduction of the communication costs among the commu-

nicating components being located in the hybrid architecture. It is worth to mention

that this specific strategy has not the intention to minimize the communication costs.

If so, a multi-objective optimization problem would be characterized and, therefore, an

appropriated heuristic would be required.

96 CHAPTER 7. METHODS EVALUATION

CHAPTER 8

Design Support

An important issue in choosing a RTOS, especially in the development of embedded

systems, is the available support for designing, monitoring, debugging, etc. This depends

further on the hardware architecture used as execution platform. For instance, depending

on which CPU is going to be used, a specific cross-compiler may be required. For the

execution platform used in this work, which is based on the Virtex-II Pro FPGA device,

the main implementation support is already provided. This comprises the EDK and ISE

tools from Xilinx Company (see Section 2.3.1). Additionally, VHDL code for hardware

parts and C/C++ for the software parts are used.

With those tools it is possible to generate a complete embedded system, comprising CPU

plus the reconfigurable hardware, and the correspondent executables. Nevertheless, the

system proposed in this work is based on dynamic relocation of components both intra-

execution domain as well as inter-execution domain. This fact imposes some challenges

that cannot be easily solved only by making use of compilation tools. In this work the

solution for this problem is based in providing support in higher level, when designing

relocatable OS components (which can be used also for relocatable application tasks

design). Therefore, within this chapter two main approaches covering these aspects are

proposed.

First, an automatic interface generation strategy, for connecting software components to

dynamically reconfigurable hardware components, is introduced. Second, a framework

that allows the design of dynamically relocatable components, across CPU and FPGA,

is presented.

98 CHAPTER 8. DESIGN SUPPORT

Since the design support presented here is not limited for generation of OS services,

but also for application tasks, hereafter the term task is used to generically denote a

relocatable component.

8.1. Hardware-Software Interface Synthesis

Among different communication schemes for coupling hardware and software, the tra-

ditional method, Memory Mapped I/O, was selected. It is suggested and supported by

Xilinx tools along with the CoreConnect bus architecture, from IBM. This means that

each IP attached to this bus can be accessible from software side, by read/write actions

in specific memory regions. In the case of a design based on a Virtex-II Pro FPGA, such

an interface is shown in Figure 8.1, where the OPB bus is assumed.

FPGAPower PC

DREAMS

OPB

Driver

TASK

(SW)

File

OPB IPIF

TASK

(HW)

OPB (On-Chip Peripheral) BUS

Virtex-II Pro

Figure 8.1.: Virtex-II Pro hardware/software interface.

The EDK tool support for HW-SW interface generation, such as in Figure 8.1, is only

basic. In software side, it restricts itself in providing low level read/write functionalities

to access a certain number of software registers (a generic OPB Driver), previously

informed in the design entry. For hardware side, an address decoder is generated, called

OPB IPIF (IP Interface) [162], which is responsible to solve the write/read commands

between OPB bus and a set of hardware registers (Register File). Hence, the generation

of a basic infrastructure to interconnect OPB Driver with Register File is automated

and supported by EDK tool.

Since no further support is provided for HW-SW interface generation, it is responsibility

of the designer to integrate the Register File with the hardware task, and the OPB

Driver with the remaining software part (e.g., DREAMS OS). This limitation is due to

the fact that EDK interface generation tool do have neither appropriated knowledge of

hardware task internals, nor sufficient knowledge of software side (target OS)internals.

Thus, in this thesis, an extension of this basic interface support, both in HW and SW

sides, was created.

8.1. HARDWARE-SOFTWARE INTERFACE SYNTHESIS 99

8.1.1. OS Driver Extension

A feature included is the automatic generation of a mapping between registers present

in hardware, the Register File (from now on called physical registers), and registers

present in software (called virtual registers). Figure 8.2 shows a typical interface design

situation, where a single physical register may concatenate bit fields, each one having

different meaning. Usually, these bits need to be accessed individually in the software

side. In the example, a physical register Reg1 contain the following distinct bit fields:

ACK, RDY and DATA1. Likewise, Reg2 contain two other data: DATA2 and DATA3.

On the software side, these five bit fields are split into five virtual registers.

OPB BUS

ACK RDY DATA1

-------- Reg1

Reg2

DATA3DATA2

Reg1

Reg2

Reg3

Reg4 DATA2

Reg5 DATA3

DATA1

RDY

ACK

SOFTWARE HARDWARE

Figure 8.2.: Mapping between physical and virtual registers.

The interface generation creates, in addition, the necessary functionalities in software

side in order to directly access those virtual registers. Thereby, the direction (read/write)

of each register is observed and solved automatically by these functions. Moreover, these

functionalities are all automatically generated for DREAMS OS. Thus, the software

component, running on this OS, can make direct access to these virtual registers.

With this feature incorporated into the interface generation tool, the designer can gen-

erate an interface for any IP, which automatically maps nphysical registers to a set of

mvirtual registers. Furthermore, these registers can be directly accessed by the SW

threads running in DREAMS.

8.1.2. Software Interface for Reconfigurable IPs

The proposal of this thesis is based on dynamic reconfiguration of components running

in hardware. Since memory mapped I/O is the technique used to connect IPs to software

domain, one may think on using a flexible memory mapping management, which needs

to support a dynamic number of IPs. Moreover, each IP may require a different address

range in the memory system address.

Alternatively, it was decided to specify a fixed address range for each IP connected to the

system. Reason for this choice is the limitation of the modular design flow [43] technique,

which advises to statically partition the FPGA surface in fixed number of slots. In each

100 CHAPTER 8. DESIGN SUPPORT

slot an IP can be reconfigured. Following these guidelines, for each physical slot a

physical base address is assigned in the CPU memory address. This assignment is static

and is made in the hardware architecture design phase. In a first glance, one may think

that this fact imposes a limit in the number of physical register available for an IP (slot).

However, in the following it is shown how this limitation is overcome.

In order to allow an arbitrary amount of data to be exchanged using a fixed number

of physical registers, a communication protocol was defined. This protocol defines five

virtual registers, REG ID, DATA, READY, CMD, and START, which enables to read-

/write in any of the registers of Register File. REG ID is used to identify the register

in the Register File. DATA contains the data that will be copied to, or from, the tar-

get register, depending on the content of the register CMD. This register specifies the

action that will be performed with the register identified by REG ID. Up to now, only

WRITE and READ commands have been implemented. The remaining registers are

used to start/stop and to provide the handshake during data exchanging. These five

virtual registers are mapped into three physical registers, which do not belong to the IP

The adoption of the above strategy has some further advantages. First, the data ex-

change scheme based upon a fixed number of physical registers overcomes the problem

of a fixed and limited IP memory range. This can become a problem depending on the

number of IPs attached to the system, or if the address range of the I/O bus system is

not big enough to embrace all IPs. Second, the overhead caused by the protocol could

be considered as a drawback in this strategy. However, due to the relative small number

of busmacros wires, which is a resource typically used in system based on partial recon-

figuration (as discussed in Section 2.3.3), the IP address decoding capability is already

limited.

8.1.3. Integration into IFS Tool

The Interface Syntheses (IFS) Design Flow is the result of a research conducted at the

University of Paderborn, in the scope of a Phd work [19]. It is intended for an auto-

matic generation of interfaces, enabling the connection of different type of components.

Furthermore, it has an adequate level of abstraction that allows it to cover different

media types and different communication protocols, in both connected endpoints. In

cooperation with the authors, IFS was extended in order to automatically generate an

interface, which connects an arbitrary reconfigurable IP to DREAMS OS, following the

strategies explained above.

To allow the creation of the data structures expected from OS driver, which need to be

DREAMS conform, C++ classes templates give the structure of the possible registers

that can be formed. For further information, in Appendix Bthe class diagram of the

registers templates is presented. This information, along with adequate and sufficient

knowledge of the underlying hardware architecture, was included into IFS tool. The IP

8.2. RELOCATABLE TASKS DESIGN 101

Based on these inputs, and using proper mechanisms, IFS generates an appropriate

interface for the specified IP, which is in accordance with the strategies specified above.

8.1.4. Further Extension for DREAMS

After interface generation, the designer starts the integration of the virtual register

access functions with the remaining software code in DREAMS. Since a protocol is

used to communication with the IP, access patterns when using those functions can be

observed. For instance, to write a data in one register of the Register File, according to

the specified protocol, following steps are necessary: (1) wait until READY is set to OK

(channel is idle), (2) write the ID of the related register into REG ID, (3) write the data

into DATA, (4) write the command for writing into CMD, and (5) set up the START. If

more registers are sent in sequence, then this process needs to be repeated accordingly.

Therefore, it would be adequate to provide those kinds of patterns as functionalities,

increasing the abstraction of this low-level register access. A further advantage of this

approach is that, further changes in low level access (e.g., the protocol improvement by

including burst transfers support) are hidden.

In the current implementation, this layer is incorporated into DREAMS in the form

of a library, which allows a seamless integration of the IP functionalities in a more

abstract manner. Figure 8.3 depicts the required interaction and data exchange as

previously explained, using as an example a sample code of Relocation IP Manager,

which is responsible to manage the location of the task, either in FPGA or in CPU.

...

…

_send_data (...args...);

_receive_data (...args..);

…

...

while(!READ(RDY);

Loop:

_write(REG_ID,R1);

_write(DATA..arg.);

_write(CMD,WRITE)

_write(START,1);

End Loop;

Loop:

_write (REG_ID,R13);

...

If (_in_hw_ == 1) {

…

_foo1_HW (..args...);

…

}

else

…

_foo1_SW (...args..);

…

}

Relocatable IP

Manager IP WRAPPER

LIBRARY LOW LEVEL

REG ACCESS

Figure 8.3.: A method foo1 HW build upon library calls.

8.2. Relocatable Tasks Design

Task migration, in the software only execution domain, has been analyzed by several

researchers in the past and techniques are almost consolidated. The requirement for

task migration appears in distributed system scenarios. In these cases, a task may be

102 CHAPTER 8. DESIGN SUPPORT

dynamically relocated over the several computation units in order to, for instance, allow

an adequate adaptation of the computational load.

Inside the same execution unit, the relocation of a complete task was required in the

early times, when fixed or dynamic partitioning of the physical memory were the current

techniques applied for the memory management. Nowadays, virtual memory schemes

based on paging and segmentation are superior. These new methods allow the division

of the task code in several pages, which are then freely placed on the memory frames

[60].

For tasks running on FPGA this technique cannot be used, and therefore several authors

propose different approaches to handle the relocation of tasks into the RH in order to

overcome the fragmentation of the RH area. Some examples have been already analyzed

in Section 3.2.7.

When multitasking is the paradigm adopted, then an additional feature that may be

required is the support for task preemption. Standard and fast techniques are available

for software tasks and are widely used. For hardware tasks, there existing approaches

based on different techniques. Hence, as long as the preemption and resumption re-

mains inside the same execution domain (intra-domain migration), it is not a problem.

Nevertheless, the migration type demanded here needs to happen across different exe-

cution domains (inter-domain migration), whose computation paradigms are completely

different: sequential versus spatial computation. This by far imposes a big challenge to

accomplish.

As it has been already noticed in Section 3.4.1, this problem has not been completely

addressed in literature. Even though that some researchers have also identified that

task relocation/migration across domains are necessary in order to provide the required

flexibility for an execution platform, they are limited in the system level analysis. Hence,

no solutions for designing such relocatable tasks have been yet proposed.

8.2.1. Unified Task Representation

When stopping a computation being performed, for instance, by the FPGA, which will

be resumed later on in the CPU, context data has to be transferred across execution

domains. So, a mapping between these two different contexts needs to be identified. In

addition, it has to be assured that the preemption/resumption point of a computation

needs to be equivalent in both CPU and FPGA. In other words, a match between two

different implementations of a computation (CPU and FPGA) is necessary.

If both FPGA and CPU tasks are designed independently, the identification of these

match points is not trivial and sometimes impossible, since the used computation paradigms

are completely different. Therefore, the approach used in this work is based on the iden-

tification of these match points in early stages of a task design. For this purpose, the task

needs to be represented in a unified manner, where those match points can be specified.

8.2. RELOCATABLE TASKS DESIGN 103

Then afterwards, this task can be translated to hardware and software implementations

in a proper way so that these match points are kept.

In this thesis a state transition graph, similar to the one introduced in [134], is used as

a unified task representation. Figure 8.4 shows an example of such a graph, where each

state represents a computation block performed inside a task. The number of states and

their granularity is defined by the user.

State_1

check_point_1

START

FINISH

migration_point_1:

[var1, var2, ...]

Figure 8.4.: State transition graph representation of a relocatable task.

The major goal by using this representation is the ability to establish the matching points

between compiled and synthesized task versions, for CPU and FPGA respectively. Thus,

each state transition specifies a switching point that represents a possible relocation

point. Nevertheless, the state transition possibilities showed in the graph help the user

in the specification of the context data, which is associated to each matching point only

(hereafter called migration point). Thus, migration points and states are paired one to

one.

The arcs in the graph (state transition) represent the possible Switching Points for a

task. Thus, each arc carries the context information, which needs to be transferred into

the related migration, allowing the correct execution of the remaining computation in

the target domain.

In order to allow an automated code generation, the information given by the user,

represented using the proposed unified form, should be expressed in a formal manner.

By the time of writing this thesis, the specification of a proved language for a formal

description of switching points and context data was in its first steps and has not been

finished. Thus, the tool for automatic code generation in this case was not completed.

Even though, an informal specification exists, along with templates for code generation,

strategies for a hybrid task generation and a run-time infrastructure to perform task

relocation. These all together belongs to the framework for relocatable task design and

will be presented in the following sections.

104 CHAPTER 8. DESIGN SUPPORT

8.2.2. A Framework for Relocatable Task Design

The complete framework design flow consists of three steps, which are applied after

having the task represented using a state transition graph as explained above. The end

result of the framework is the hardware and software versions of the task, along with

a migration manager component. The latter is incorporated into the target operating

system as a driver, offering the necessary interface to make the usage of services related

to a task migration (e.g., preempt/resume, save/restore context, etc.). The complete

design flow of a relocatable task embraced by the framework is depicted in Figure 8.5

and will be explained in the following.

Task

Structure

Description Code

Generation User

Coding

Task

(HW)

Migration Manager

Xilinx

gcc

Task

(SW)

Task_

Conroller_SW Task_

Conroller_HW

Task HW

Code

Patterns

Migration Manager

Task SW

Code

Patterns

Task_

Conroller_SW Task_

Conroller_HW

Task.bit

Task.obj

Mig_Mng.obj

Step 3

Step 2

Step 1

Figure 8.5.: Design flow supported by the framework.

Context Data Representation

In the first step switching points and context data are identified for each migration

point using the unified task representation, which are afterwards specified using an

informal description, called Task Structure Description (TSD). Figure 8.6 shows an TSD

example, which comprises two segments: Context Data Definition and Migration Points

Definition.

Instead of defining the context for each migration point individually, it was chosen to

define a set of variables comprising all context data candidates. Then, for each migration

8.2. RELOCATABLE TASKS DESIGN 105

var_1: integer;

var_2: integer;

var_3: bit_vector(64);

var_i: bit_vector(32);

(a) Context Data Definition.

migration_point_1: {0,1,1,…,0};

migration_point_2: {1,1,0,…,0};

migration_point_j: {0,0,1,…,1};

(b) Migration Points Definition

Figure 8.6.: Informal description of TSD.

point a sub set from this context set is specified, since the migration points may probably

share some common context data.

Each entry in the Context Data Definition set (Figure 8.6a) has a form of var i:variable ;,

where var i is a variable specified for the context data and variable is a unified type spec-

ification. Some commonly used are, for instance, integer, bit, bit-vector, etc. But when

they are translated by the code generation tool, to C and VHDL, the type of one variable

may have two different versions. For example, the bit-vector type has straightforward

representation in VHDL. However, in C a possible representation may be the use of

integer arrays.

In the second segment (Figure 8.6b), each migration point is specified by its context data

set, which is a subset of the previous one. The syntax of each entry in this segment has a

form of migration point j:{...x i...};, where x i is 1 if variable var i (specified in

Context Data Definition) needs to be transferred in this migration point, and 0 otherwise.

Code Generation

Based on TSD, the framework generates: code patterns for task coding; two Task Con-

trollers (one for each execution domain) that directly interfaces the task with the run-

time system; and the Migration Manager entity that is responsible to manage the mi-

gration of a task and, at the same time, provide an appropriated interface to DREAMS

OS.

The Task Controller is an interface between Migration Manager and the task to be mi-

grated. It is responsible for extraction and/or restoration of related context data, and to

drive specific signals to the task (e.g., when task preemption is required). The code gen-

erated for Task Controller is synthesized and compiled together with the hardware and

software task code, respectively. Hence, Task Controller belongs to the task component.

Furthermore, the Task Controller keeps a Register File, where context data extracted

is stored, or from which the context data are taken to be restored back. In the case

of a hardware task, the methodology explained previously in Section 8.1 is applied to

106 CHAPTER 8. DESIGN SUPPORT

generate the appropriate interface to Migration Manager.

The Migration Manager entity is responsible to coordinate the context data transfer

when a migration is being performed, communicating therefore with both Task Con-

trollers. Moreover, this component offers proper functionalities (a driver) to DREAMS

OS in order to manage a task migration. Examples of these functionalities are get context,

set context,task preemt, and task resume.

Since the behavior of both Task Controller and Migration Manager are well defined, de-

pending only on the specification given by TSD, they are stored into the code generation

repository as parametrizable templates.

Task Coding

In order to assure a correct communication between Task Controller and the task itself,

the latter must be coded following predefined code patterns and coding rules. In the

following, the procedure for hardware task coding is explained, while the application of

the same procedure for a software task is straightforward.

Figure 8.7 shows the architecture skeleton of a hardware task generated by the Frame-

work. The VHDL code generated comprises a entity -architecture pair, along with

signals for context data representation. Each Task Process is one VHDL process used

for hardware task coding. It has to be noticed that the proper mapping of states (defined

in the state transition graph) to VHDL processes is the responsibility of the program-

mer. For instance, one single state, specified in the graph, may be implemented using

more than one process. Therefore, State Synchronization Controller is used to resolve

the Task Out State signal, which indicates which state the task execution is. Task Con-

troller needs this information in order to assure a correct context data saving and/or

restoring.

A hardware task may be implemented using several Task Processes. However, each

process that make use of context data signals must be created upon a pattern shown

in Listing 8.1. It can be seen that signal enable, driven by Task Controller, is used to

indicate when a task is under normal execution (not under migration or suspension).

All meaningful task computation needs to be done only when this signal is asserted.

In order to support the programmer in solving Task State Out signal, another VHDL

process pattern, presented in Listing 8.2, is provided by the framework. It allows the

correct integration of State Synchronization Controller with the hardware task.

Additionally, the programmer needs to assure that when a task is enabled for computa-

tion, it can use only those signals specified for the migration point right before this state.

Only they have been maintained at the migration point and have thus valid values.

For information completeness, further details of Task Controller and Migration Manager

are given in Appendix C.

8.3. CHAPTER CONCLUSIONS 107

Task Out

Context Task Out

State

Context

Task

Hardware

State Enable

State

synchronization

signals

State

Synchronization

Contoller

(Code

pattern)

(Space for

user coding)

Task Process

Figure 8.7.: Hardware task overview.

Binary Files Generation

In Step 3 of the the hardware and software task versions are generated, together with

the Migration Mananger. For a hardware task, a partial bitstream is generated following

the guidelines provided by Xilinx in [43]. For the software task and for the Migration

Mananger the gcc compiles is used.

8.3. Chapter Conclusions

In this chapter, the necessary design support for the proposed architecture was presented.

This support comprises two aspects: the provision of an automatic interface generation

between a hardware component and DREAMS OS; and a framework that supports

programmers in designing relocatable components for the hybrid architecture.

It has to be noticed that the framework is proposed with the intention to support the

designer in the development phase of a relocatable task. Additionally, it generates the

necessary infrastructure to enable the migration at run-time. By this means, the strate-

gies provided by RRM (explained in the last chapters) can make use of this infrastructure

to decide at run-time about the allocation of each relocatable component.

The number of switching points defined by the designer of a hardware task influences

directly the migration flexibility. Indeed, as higher the number of switching points a

higher degree of flexibility is achieved. Since each switching point represents a matching

point between the computation in software and hardware, one may think in a first glance

that this may represent a serialization of the hardware execution. This is true for an

108 CHAPTER 8. DESIGN SUPPORT

Listing 8.1: VHDL code sample for Task Process template.

1...

2-- << USER CODE/

3-- define your signals and variables here

4-- USER CODE >>

6-----------------------------------------------------

7-- Code pattern for Task Process --

8-----------------------------------------------------

9process(CLK)

10 -- << USER CODE/

11 -- define your variables here

12 -- USER CODE >>

13 begin

14 if( CLK = ’1 ’ and CLK ’ event ) then

15 if( enable = ’0 ’) then

16 -- << USER CODE/

17 -- reset your own variables and signals

18 -- USER CODE >>

19 else

20 -- << USER CODE/

21 -- All the computations go here

22 -- USER CODE >>

23 end if;

24 end if;

25 end process ;

26 -----------------------------------------------------

28 -- NEXT PROCESSES GO IN THE FOLLOWING ---------------

29 ...

algorithm which presents a high degree of parallelism. However, there is always a limit

in which an algorithm can be parallelized, either because of its nature or because of the

available hardware area available for its implementation. Moreover, there is no limit

of parallelism that can be used when implementing a single computation block (state),

since a task computation can be preempted only at switching points. Hence, the designer

needs to find a proper compromise between flexibility and efficiency when searching for

a optimal mapping of the algorithm in a state transition graph.

It is worth to mention each switching points should also be selected so that the amount

of related context data would be relatively low. This will result in the generation of

a small Task Controller (i.e., less FPGA area) and will decrease the amount of time

necessary for migration (i.e., faster switching time).

Furthermore, knowing the amount of migration data required in each state transition,

8.3. CHAPTER CONCLUSIONS 109

Listing 8.2: VHDL code sample for State Synchronization Controller template.

1...

2-------------------------------------------------------

3-- Code pattern for State Synchronization Controller --

4-------------------------------------------------------

5process(clk)

6-- << USER CODE/

7-- define your variables here

8-- USER CODE >>

9begin

10 if( clk = ’1 ’ and clk ’ event ) then

11 if( enable = ’0 ’) then

12 state_out <=0;

13 -- << USER CODE/

14 -- reset your own variables and signals

15 -- USER CODE >>

16 else

17 -- << USER CODE/

18 -- logic for determining Task_Out_State

19 -- USER CODE >>

20 end if;

21 end if;

22 end process ;

23 -------------------------------------------------------

24 ...

and additionally the characteristics of the execution platform, it is possible to estimate

the time necessary for task relocation. This is important since it gives the necessary

information for the reconfiguration management strategies, as explained in Section 6.4.

110 CHAPTER 8. DESIGN SUPPORT

CHAPTER 9

Case Study

This chapter presents a case study that has been carried out to validate the core meth-

ods developed in this thesis. Nevertheless, the focus is given in the evaluation of the

framework in designing a relocatable OS service, and the practicability in relocating a

component between FPGA and CPU along with related strategies.

9.1. Target OS Service

An encryption algorithm has been selected as target OS service, since it is being increas-

ingly demanded by safety and security applications (e.g., when using a smart phone

to access a bank account). Although a hardware implementation of such an algorithm

may provide better performance compared to its software counterpart, the latter is still

commonly deployed in computer systems. Furthermore, encryption algorithms are con-

sidered an OS service, for both desktop and embedded scenarios [163].

Triple-DES

Among different kind of encryption algorithms, a Triple-DES (shortly TDES) was cho-

sen, since it is one of the most known and widely used one. The TDES, depicted in

Figure 9.1, is based upon a basic encryption-decryption block, called DES (Data En-

cryption Standard), which is an algorithm for a block cipher operating with a 64-bit key

on 64-bit plaintext blocks (in this case).

111

112 CHAPTER 9. CASE STUDY

DES Encryption

DES Decryption

DES Encryption

Plaintext

Ciphertext

Key 1

Key 2

Key 3

(64-bit)

(a) Triple-DES encryption process.

DES

Encryption

Plaintext block

Ciphertext block

Key DES

Decryption

Plaintext block

Ciphertext block

Key

(b) DES cipher block for encryption and decryp-

tion.

Figure 9.1.: Triple-DES and its basic DES cipher block.

DES Block Cipher and its Operation Mode

A single block cipher maps n-bit plaintext blocks to n-bit ciphertext blocks, where nis

equal to 64 in this case. Since each block cipher operates on blocks of fixed length, but

plaintext messages to be encrypted can be of any length, the input data (plaintext) need

to be partitioned into 64-bit blocks. From now on, there are different ways in which

the block ciphers are employed on these blocks, and they are called operation modes.

Among these modes, the CBC mode (Cipher-Block Chaining) has been chosen, which is

the most commonly mode used.

Block Cipher

Encryption

Plaintext block: pb1

Initialization Vector (IV)

Ciphertext block: cb1

Block Cipher

Encryption

Plaintext block: pb2

Ciphertext block: cb2

Block Cipher

Encryption

Plaintext block: pbt

Ciphertext block: cbt

XOR XOR XOR

…...

Key KeyKey

Figure 9.2.: CBC mode for block ciphers.

Figure 9.2 presents the DES configured for the CBC mode used to encrypt the input, a

plaintext. If the plaintext pb is divided into tblocks, so that pb =pb1. . . pbt, after the

encryption a number of tciphertext blocks, so that cb =cb1. . . cbt, will be generated as

the output (the cb ciphertext).

In the CBC mode the encryption works as follows. Every incoming plaintext block pbjis

9.2. RELOCATABLE TRIPLE-DES 113

XORed with the previous encrypted ciphertext block cbj−1before being applied to the

block cipher algorithm. For the first block pb1an Initialization Vector (IV) is used. In

this way, since all encryptions are chained, each ciphertext depends on all the plaintext

blocks before it. Hence, an intrinsic sequential operation is required.

In Figure 9.1a it can be seen that the final output cipher block cbjis not achieved by

applying only one time the DES encryption. Instead, each input pbjis streamed through

a chain of DES encryption/decryption blocks, using thereby three different keys (one for

each DES phase). Hence, the encryption procedure of an input plaintext using TDES

can be described as stated in Algorithm 6. Please note that for a plaintext with a length

bigger than the input specified (64-bit for this case), it needs to be partitioned in a

certain number of blocks (controlled by block count in line 17) and each of them needs

to be sequentially fed when requested (line 11).

Algorithm 6 Triple-DES pseudo algorithm.

1: define variables:

key 1, key 2, key 3, plain block,cipher block,nblock,Init V ector and block count;

2: Begin

3: read in(key 1); .State 1

4: read in(key 2); .State 1

5: read in(key 3); .State 1

6: read in(nblock); .State 1

7: read in(Init V ector); \\IV .State 1

8: block count ←nblock;.State 1

9: cipher block ←Init V ector;.State 1

10: while block count > 0do

11: read in(plain block); .State 2

12: plain block ←plain block XOR cipher block;.State 2

13: cipher block ←DES Encryption(key 1,plain block); .State 2

14: cipher block ←DES Decryption(key 2,cipher block); .State 3

15: cipher block ←DES Encryption(key 3,cipher block); .State 4

16: send out(cipher block); .State 5

17: block count ←block count −1; .State 5

18: end while

19: End

9.2. Relocatable Triple-DES

A possible representation of the TDES algorithm using the unified task representation,

is shown in Figure 9.3. Each DES block is mapped to a single state (States 2,3 and

4), since there is only one hardware instantiation of the DES encryption-decryption

algorithm. Moreover, two additional states were included: State 1 and State 5. The

114 CHAPTER 9. CASE STUDY

fist is responsible for gathering the initial data (three keys, Initial Vector, and block

numbers),and the latter is responsible to send out the ciphertext block and to decrement

block count. The initialization phase of the algorithm is not represented in this figure.

The variable nblock is used to indicate how many blocks of plaintext will be applied into

the TDES.

state_1

mp_1 : [key_1, key_2, key_3, init_vector, nblock]

state_2

mp_2 :[key_1, key_2, key_3, plain_block, block_count]

state_3

state_4

state_5 mp_i: migration_point i

[key_1, key_2, ...]: context to be transfered

mp_3 :[key_1, key_2, key_3, cipher_block, block_count]

mp_4 :[key_1, key_2, key_3, cipher_block, block_count]

mp_5 : [key_1, key_2, key_3, cipher_block, block_count]

START

FINISH

Figure 9.3.: State transition graph corresponding to Algorithm 6.

Figure 9.3 shows further the context data identified, which later was specified in the

TSD format, in order to allow the system to be generated using the framework. The

plaintext is read in State 1, thus plain block data should be transferred in the case of a

migration between States 1 and 2. The variable block count is used to count how many

blocks of plaintext are still waiting in the input. Moreover, due to the fact that a state

transition exists, from State 5 to State 1, all keys used by TDES must be transferred in

every migration case. If this transition would not exist (the case of a single TDES pass),

after each DES block one key less would be necessary to be transferred.

It is worth to mention that the chosen state graph presented chosen for this case study is

only one possibility among several others. One looking for a more coarse grain approach

may combine all three DES encryption-decryption states in a single one, so that a pipeline

approach in hardware may be used. However, this will certainly leads in an increase of

the FPGA area needed for its implementation. Another possibility is to go towards

a more fine grained approach, so that the DES decryption-decryption block could be

depicted allowing the specification of further states.

9.3. TESTBED SET UP 115

9.3. Testbed Set Up

Determinism is an important aspect of a real-time system. Therefore, it is not a good

practice to allow an application task to perform its activity with an arbitrary amount

of input data at once, since this may lead to an unbounded execution time. This may

complicate the scheduling analysis. For same cases, an approach used is to split the entire

computation in time, using therefore a periodic task. Hence, the whole computation is

accomplished after executing a bunch of instances of this task. So, by using this strategy,

it is possible to consider this task into the scheduling analyses.

This strategy was also followed in this case study. An application task was created so

that in its initialization phase it divides the data to be encrypted in several blocks, and

in each of the subsequent instances of this periodic task, one block is encrypted.

By adopting such a strategy, the TDES OS service will also present a periodic behavior,

which perfectly fits in the migration cases analyzed in the Section 6.4. Hence, the RRM

is able to relocate an OS service from software to hardware (or vice-versa), either in

between two consecutive instances of a service execution (before State 1), or preempting

it. Furthermore, this strategy allows the usage of this service by other application tasks,

which characterizes it even more as an OS service.

Basic Execution Platform

The execution platform built for the purpose of this case study is presented in Figure 9.4.

Some further details are omitted for legibility reasons. The architecture was constructed

using the Virtex-II Pro device XC2VP7, on which one PowerPC 405 hardcore micropro-

cessor was used. The system was set up so that the microprocessor was clocked internally

at 200MHz and the busses at 100MHz.

processor local bus (plb) plb2opb

bridge on-chip peripheral bus (opb)

PPC 405

arbiter

ddr-sdram rs232

Triple-DES

in hardware

Figure 9.4.: Basic hardware platform.

The Triple-DES OS service was designed and implemented using the framework and

the interface generation as explained in previous chapters. The hardware version was

attached to the OPB bus of the system and the software version placed into the external

memory, along with the remaining software system that is executed by the PPC405

116 CHAPTER 9. CASE STUDY

microprocessor.

For this case study a static instantiation of the architecture was used, instead of partially

configuring the FPGA. Alternatively, the partial reconfiguration was emulated by provid-

ing inside the system two dummy stubs load sw task(); and load hw task(). Nonetheless,

this fact does not compromise the results achieved by this case study, whose time cost

can be estimated by using the results presented in related work.

An additional hardware component was added to the platform, a serial interface. This

is merely used to enable an interface to the user so that the system can be started,

stopped, and debugged.

Along with the software version of Triple-DES, three other threads were created. One is a

periodic application thread that, in each instance, calls the Triple-DES OS service getting

back the results. A second thread was responsible to manage the migration/relocation of

the OS service, communication therefore with the Migration Manager. The third thread

was responsible for controlling the UART in order to provide the interface with the user.

9.4. Quantitative Results

FPGA and Memory Costs

Table 9.1 presents the FPGA area (in LUT units) and the memory (in bytes) consumed

by the main system parts, according to the report given by the EDK tool. The DES-

block HW, implemented following the templates given by the framework, consumes 2320

LUTs (23% of the total FPGA resources) and Task Controller HW uses 2108 LUTs

(21% of the total amount).

HARDWARE LUT Utilization

DES-Block HW 2320 23%

Task Controller HW 2108 21%

Entire FPGA 9856 100%

SOFTWARE Memory Utilization (bytes)

DES-Block SW + 9794

Task Controller SW

Migration Manager 18236

Table 9.1.: FPGA and memory utilization.

The majority of the FPGA resources required to implement Task Controller HW was

related to the migration controller, rather then the task itself. Thus, the resources used

by Task Controller HW is less related with the resources used by the task itself.

The values presented in Table 9.1 are resulted from a first prototype of the framework.

9.4. QUANTITATIVE RESULTS 117

Neither the system design nor its implementation has received special care in the code

optimization. Therefore, less FPGA resource utilization for the migration system can

be foreseen. Moreover, taking into consideration that a great flexibility is achieved from

the run-time relocation capability, this resource requirement increase is acceptable.

Reconfiguration Time

The FPGA reconfiguration time related to the hardware component generated (DES-

block HW plus Task Controller HW ) was estimated using the results provided in [164].

This work was chosen due to its close similarity with the architecture used in this case

study. According to it, the reconfiguration time Tdof a partial bitstream is given ap-

proximately by:

Td≈fd×1

t+1

wd(9.1)

Where fdis the equivalent number of frames in the bitstream, tis the rate at which a

frame is transferred from memory to the OPB peripheral, and wdis the rate at which a

frame is written into the ICAP entity.

Using the results shown in Table 9.1, the partial bitstream corresponds to 2320+2108 =

4428 LUTs. From the official information given by Xilinx Company in [40], the XC2VP7

FPGA device (the same used in this case study and in the related work) has a total of

1320 frames and 9856 LUTs. Hence, it can be concluded that 4428 LUTs represent

roughly 593 frames in this device. Using t= 7.86 frames/ms and wd= 117 frames/ms,

since the buses frequency are the same as in [164], the reconfiguration time of the OS

service component would be:

Td≈593 ×(1

7.86 +1

117.0) = 81ms (9.2)

Following the model proposed in this thesis for a reconfiguration activity, (see Chap-

ters 5and 6), this is the execution time Qhof job Ja(Programming phase).

Migration Time

The time spent to transfer the context date during one component relocation was not

estimated, but measured using specific physical outputs available in the development

board used for this testbed. The value measured corresponds to the complete time spent

since the moment in which the component recognizes the signal sent from Migration

Manager for preemption, until finishing the complete reconstruction of the context data

in the target execution domain. The average values measured are presented in Table 9.2.

118 CHAPTER 9. CASE STUDY

SW ⇒HW HW ⇒SW

10µs 8µs

Table 9.2.: Context data transfer: average time measured.

The migration times observed are rather the same for each one of the possible preemption

points. This is due to the fact that the amount of context data specified in Figure 9.3

is the same, even though the different variables in each case. These values even incor-

porate the overhead caused by the usage of the communication protocol (specified in

Section 8.1.2).

9.5. Chapter Conclusions

In this chapter a case study was performed, in which the main mechanisms for enabling

the incorporation of flexibility together with efficiency could be analyzed. As a target

OS service, a specific encryption algorithm was chosen, which fits well in the scope of

the work proposed in this thesis.

Based on the quantitative results, especially on those related to the resource utilization,

it could be identified that a relative big amount of resources are spent only for the

management of the relocation activities. This is more evident when looking to the

FPGA area utilization. Nonetheless, one can argue that this resource is traded in favor

of the great flexibility obtained from the capability to relocate hybrid components at

run-time.

Reconfiguration and migration times need to be considered separately for a reasonable

analysis. In certain systems the FPGA reconfiguration time, which stays in the range of

dozens of milliseconds, may represent an obstacle for some target applications, especially

when it needs to happen in a relatively short time. In the system proposed in this thesis,

this time is hidden by executing it concurrently with the application.

These results are caused by, first, properly choosing the migration points, so that in each

migration case, a low amount of data is required to be transferred. A second reason is

that the context data do not contain any data processed by the component. In this

case study, plaintext and ciphertext are not included into the context data. Instead, the

application task transmits and receives the plaintext and ciphertext, respectively, using

the same communication channel, regardless the allocation of the OS service (hardware

or software).

This approach is beneficial, since transparency is achieved for the application tasks

that do not need to be aware of the execution environment of the required OS service.

However, this solution may impose an obstacle for some applications requiring very

high performance. Alternatively, the hardware architecture could be incremented with

9.5. CHAPTER CONCLUSIONS 119

another bus, interconnecting the hardware components and the main memory (where

the data processed would be stored), so that both software and hardware components

may have access to them, without loosing communication efficiency.

120 CHAPTER 9. CASE STUDY

CHAPTER 10

Conclusion & Outlook

10.1. Summary

This thesis work presents a set of methodologies, strategies, and mechanisms that, when

incorporated into a reconfigurable RTOS, promotes its self-reconfiguration over the ex-

ecution platform. The main goal, thereby, is to enable an efficient utilization of the

hybrid computational resources, shared among application tasks and OS services. By

allowing a flexible management of the OS services, the proposed system is capable to

assign those operating system services to the computational resources not currently used

by application tasks.

The focus of this thesis is on embedded systems requiring flexibility, high computation

capability, and soft-real time constraints. Some contemporary products in the marked

already provide such characteristics, like modern mobile phones or PDAs. In such de-

vices, applications may enter or leave the system dynamically, characterizing in this way

a changing environment.

In order to achieve the proposed goals, two research fields were combined: reconfigurable

computing (RC) and self-reconfigurable operating systems (self-x OS). Therefore, an

extensively survey of relevant works from both areas was made. Thereby, it was identified

that an OS can also profit from a reconfigurable architecture along with the application,

instead of only providing support to the application.

Following the tendency towards fine-grained RC architectures, this work bases its exe-

cution platform on a FPGA device. Its reconfigurable capabilities, especially the partial

121

122 CHAPTER 10. CONCLUSION & OUTLOOK

reconfiguration, make it well suited for the purposes of this work. Based on the results

presented in Chapter 2, it can be concluded that even though the technical realization of

the partial reconfiguration of a FPGA is still a problem, some tendencies were identified

towards improvement of tools and support from the side of the FPGA suppliers.

System Overview

The main system parts can be summarized in Figure 10.1, which provides an overview

of the whole scenario. In this figure, it is shown that both OS and application use the

computational resources of the hybrid architecture for their execution.

APPLICATION

OS API

HYBRID

OPERATING SYSTEM

ANALYSES

RECONFIGURATION

MANAGEMENT

Monitoring

Reconfiguration

Requirements

HYBRID

COMPUTATIONAL

RESOURCES

USES

Execution

Platform

Application

Operating System

Run-time

Reconfiguration

Manager

(RRM)

Figure 10.1.: System overview.

The main contributions of this thesis concentrate in the RRM part that provides several

heuristic algorithms necessary to tackle NP-Hard problems identified in the proposal,

methodologies to conduct the reconfiguration activities in deterministic manner, and a

preliminary proposal towards run-time profiling of OS services. Besides that, proper

models of the reconfigurable components along with an extended model for enabling an

efficient management of reconfiguration activities were proposed.

The Analyses subsystem acquires dynamic information about the usage of OS services

by the application through API calls monitoring. Here, a preliminary analysis regarding

10.2. OUTLOOK 123

the recognition of usage profiles of the OS services was performed. However, for the

analyses made in the thesis, a simplified version of those metrics was used. Addition-

ally, this subsystem receives off-line information from the possible applications of the

system, regarding their requirements. These are used when starting the application in

the system, giving the information about new OS services that may be required. Using

this information, the Analyses decides where each OS service needs to be placed, either

on FPGA or on CPU.

The Reconfiguration Management administrates the reconfiguration activities, compris-

ing the relocation/reconfiguration of each single component and the ordering of compo-

nents undergoing a reconfiguration, respecting thereby time constraints. To accomplish

this administration, a proper model of these activities was provided that allows the

appliance of specific techniques from real-time scheduling theory.

Besides, the relocation of hybrid components across FPGA/CPU boundaries was deeply

investigated in this thesis. As a result, a novel framework, comprising a run-time execu-

tion environment and a design flow that guides the programmer by designing relocatable

components, was developed and implemented. Furthermore, the interface between soft-

ware and dynamically reconfigurable hardware components was automatically generated

by taking advantage of a previous developed Interface Synthesis (IFS) tool in [19]. For

this purpose, the IFS tool was properly augmented with the information of the platform

(target OS adopted, FPGA internals information, etc.), so that desired interfaces could

be automatically generated for the underlying architecture.

In order to promote a verification of main contributions in the practice, a study was

conducted by setting up a basic architecture and choosing therefore an appropriated

target OS service, which was an encryption algorithm. Furthermore, this component

was implemented using the proposed framework allowing, hence, its dynamic relocation

across CPU/FPGA. The tests performed have enabled the identification of some topics

that may be further explored in a future work.

10.2. Outlook

As pointed out in Chapter 4, run-time profiling of OS service usage by the application

may improve the quality of the decision related to the placement of OS services. By

identifying patterns in which a service may be called, and furthermore, by identifying

relation among these services, the allocation result may be more efficient. For instance,

using the near past information concerning the usage of a certain service, the system

would be able to infer, for the near future, the pattern expected by other services.

Furthermore, these results may even avoid unnecessary reconfiguration activities that

cannot be foreseen by a system that uses only current information.

Another aspect that may have direct impact, in the system presented by this thesis, is

the frequency in which the system is required to reconfigure. This depends on which

124 CHAPTER 10. CONCLUSION & OUTLOOK

specific target embedded system the proposed techniques are going to be used. In the

case of a system where the reconfiguration is directly influenced by human inputs (mobile

phone where the user randomly starts/stops applications), the reconfiguration frequency

cannot be precisely determined. So, as a future work, different case studies should be

investigated in order to estimate the system reconfiguration frequency. This information

may be used to avoid unnecessary reconfigurations and also to avoid the system to

become inoperable (a big reconfiguration frequency).

Concerning a hardware service of an operating system, some comments are worth to be

made. First, the area vs. speed trade-off, intrinsically related to a hardware implementa-

tion of an algorithm, could be further exploited in order to make available more versions

of the same service to the allocation algorithms, and so possibly achieving better results.

This would have nevertheless direct impact in the framework proposed in this thesis.

Additionally, as long as enough space is available in the FPGA, replication of OS services

may also be considered. This could increase the reliability of the system, or even allow

heavy loaded services to be supported by the system.

The flexibility incorporated into the operating system by means of reconfiguration, as

proposed in this thesis, could be expanded in order to promote the relocation and re-

configuration of applications tasks. Considering that the system architecture is based

on microkernel concept, this approach should not be a problem, since both application

tasks and operating system services are seen as components connected to the microkernel

level. Nevertheless, most of proposed heuristic algorithms must be adapted in order to

consider the dynamic relocation of application tasks.

It is worth to mention that more focus could be incorporated into the strategies and

methods presented here concerning power consumption, since this is one of the most im-

portant factors that influence the design of an embedded system, along with performance

and flexibility.

In spite of these future research directions, a question may also be opened concerning

the appropriateness of using a FPGA as a mainstream technology for the execution

platform envisioned within this thesis. Further efforts should be spent in investigations

towards evaluation of, for instance, platforms based on coarse-grained reconfigurable

architectures.

APPENDIX A

Further Evaluation Results

Additional evaluations were performed with the communication-aware allocation algo-

rithm. For this experiment, three different relations of Cα, Cβand Cγwere used to

generate the input graphs.

Figures A.1a,A.1b and A.1c show the gain in the communication cost reduction in

each case.The related payoff in the overall resource usage utilization is shown in Fig-

ures A.2a,A.2b and A.2c.

It can be seen that there is no significant difference in the gains obtained in Case 2 and

Case 3 when compared to the gain obtained in Case 1. Thus, for larger differences in

communication costs between inter and across domains, the algorithm do not achieve

better results, even when the number of passes is increased.

125

126 APPENDIX A. FURTHER EVALUATION RESULTS

0 1 2 3 4 5 6 7 8 9 10

−15

−10

−5

# foldings

Percentage of increase

Comm. costs reduction compared to original solution (Case 1)

SH−HW

Total

(a) Case 1: κ={Cα,2Cα,3Cα}

0 1 2 3 4 5 6 7 8 9 10

−15

−10

−5

# foldings

Percentage of increase

Comm. costs reduction compared to original solution (Case 2)

SH−HW

Total

(b) Case 2: κ={Cα,2Cα,9Cα}

0 1 2 3 4 5 6 7 8 9 10

−15

−10

−5

# foldings

Percentage of increase

Comm. costs reduction compared to original solution (Case 3)

SH−HW

Total

Figure A.1.: Comparison among communication costs reduction for three different situ-

ations.

127

0 1 2 3 4 5 6 7 8 9 10

# foldings

Payoff (%)

Payoff in component costs utilization (Case 1)

Total component cost utilization (U+A)

(a) Case 1: κ={Cα,2Cα,3Cα}

0 1 2 3 4 5 6 7 8 9 10

# foldings

Payoff (%)

Payoff in component costs utilization (Case 2)

Total component cost utilization (U+A)

(b) Case 2: κ={Cα,2Cα,9Cα}

0 1 2 3 4 5 6 7 8 9 10

# foldings

Payoff (%)

Payoff in component costs utilization (Case 3)

Total component cost utilization (U+A)

Figure A.2.: Comparison among payoffs in overall resource utilization for three different

situations.

128 APPENDIX A. FURTHER EVALUATION RESULTS

APPENDIX B

HW/SW Interface Generation

Figure B.1 illustrates the class diagram of possible interface registers used in the auto-

matic interface generation process for DREAMS OS.

129

130 APPENDIX B. HW/SW INTERFACE GENERATION

<< Singleton >>

MemoryMap

−noOfPhysRegs:short

−noOfRegs:short

−physRegArray[noOfPhysRegs]:PhysicalRegister<int>*

−registerArray[noOfReg]:Register<class Data>*

<< virtual >>+~MemoryMap():void

+Instance():MemoryMap*

+init():void

<< template >>+getRegister(name:char*):Register<Data>*

+doWrite(reg:Register<int>*):void

+doRead(reg:Register<int>*):Register<int>*

#MemoryMap():void

#MemoryMap( :const MemoryMap&):void

#MemoryMap& operator = ( :const MemoryMap&):void

+instance():MemoryMap*

<< template >>

RegisterOut

+RegisterOut(name:char*,addr:RegisterT):void

#read():Data*

+write(data:Data*):void

<< template >>

PhysicalRegister

+PhysicalRegister(name:char*,addr:RegisterT):void

<< template >>+read(reg:Register<D>*,offset:int):Register<D>*

<< template >>+write(reg:Register<D>*,offset:int):void

+read():Data*

+write(data:Data*):void

<< template >>

RegisterInOut

+RegisterInOut(name:char*,addr:RegisterT):void

+read():Data*

+write(data:Data*):void

<< template >>

RegisterIn

+RegisterIn(name:char*,addr:RegisterT):void

+read():Data*

#write(data:Data*):void

<< template >>

#regName:char*

#regAddr:unword

#data:Data*

+Register(name:char*,addr:RegisterT):Register<Data>

+getName():char*

+getRegAddr():RegisterT

+getRegSize():short

+setData(data:Data*):void

+getData():Data*

<< virtual >>#read():Data*

<< virtual >>#write(data:Data*):void

1..*

1..*1..*1..*

Figure B.1.: Class Diagram of possible interface registers.

APPENDIX C

Hardware/Software Task Design

C.1. Hardware Task Controller Template

Figure C.1 shows the template of a hardware task controller. The signals located at the

upper part of this figure are connected to the hardware task, and the signals at the lower

part are connect with Migration Manager.

C.2. Sequence Graphs for Two Migration Cases

The following two sequence graphs present the actions performed in two different migra-

tion cases, specifying concrete interface-callings that are performed across them. Fig-

ure C.2 shows a migration case from CPU to FPGA, and Figure C.3 the opposite case.

131

132 APPENDIX C. HARDWARE/SOFTWARE TASK DESIGN

Crfc_command

Crfc_reg_id

Crfc_status

Crfc_data_in

Crfc_data_out

Task Out

Context Task Out

State

Task

Controller

Hardware

Task Context Manager

State Enable

Task Status

Contoller

Task

Synchronization

Control

Task

Status

Context Ready

Task

Control

Commnad

Task

Control

Status

Context Suspended

Suspending

Task In

Context

Figure C.1.: The controller template for a hardware task.

Migration Manager Hardware task Drivers

send_context()

resume_task()

Software Task Thread

set_command_suspend()

task_suspended()

Task Control Interface

get_command()

save_context()

execute_task()

(partial-reconfigure-hardware-task)

Figure C.2.: Sequence graph specifying the task migration from software to hardware.

C.2. SEQUENCE GRAPHS FOR TWO MIGRATION CASES 133

Migration Manager Hardware task Drivers

suspend_task()

retrieve_context()

Software Task Thread

set_command_resume()

task_started()

Task Control Interface

get_command()

execute_task()

restore_context()

(load-software-task)

Figure C.3.: Sequence graph specifying the task migration from hardware to software.

134 APPENDIX C. HARDWARE/SOFTWARE TASK DESIGN

APPENDIX D

TBS Server Bandwidth Estimation

This appendix presents the prove that the bandwidth Uireleased by a component si,

which is undergoing a migration from software to hardware, can be added to the TBS

server bandwidth Us, to schedule this migration.

Figure D.1 shows the scenario that is considered, similar to the one faced in Section 6.4.3,

highliting the scheduling of the service that undergo a migration and the TBS server.

in SW

TBS

in HW

1,, +

kiki

1, +ki

Figure D.1.: Relocation from software to hardware: An Example.

The whole system, considering all nOS services together with TBS, is schedulable under

135

136 APPENDIX D. TBS SERVER BANDWIDTH ESTIMATION

EDF if:

Us+

i=1

≤Kwere K≤1,(D.1)

where Es

iis the execution time of a service siin software, Piis the period of this service,

and Kis the CPU workload available for both OS service set and TBS. Please note

further that Ui=Es

i/Pi.

Without loss of generality, lets assume that s1is the service undergoing a relocation,

and let Equation D.1 shown above be rewritten as:

Us+Es

i=2

≤K. (D.2)

Looking to the equation above, it can be concluded that the workload Uican be seen

as the bandwidth of a second TBS server (hereafter called second server). Moreover,

assuming that at time a1,k+1 no job is waiting to be executed by second server, since

service s1from that time on is no longer scheduled in software (see Figure D.1), its

related job queue is free. Thus, its bandwidth Uican be used to carry out further

aperiodic jobs. In this scenario, let the aperiodic job Jb

1represent the migration activity,

and M1its correspondent execution time.

Applying TBS

Due to conditions already introduced in Section 6.4.2, the deadline ˆ

d1assigned to the

incoming job Jb

1at time a1,k+1 need to be smaller than the deadline d1,k+1 of service si.

Furthermore, the maximum value assigned to this deadline must be d1,k+1 −Eh

1, which

assures that the deadline of the (k+ 1)th instance of service s1will be respected.

Instead of letting job Jb

1be completely executed by TBS, let M1be split into two parts,

ρM1and (1 −ρ)M1, where 0 ≤ρ≤1, so that each part is carried out by a different

server. Letting ˆ

ds1be the deadline assigned by TBS when scheduling (1 −ρ)M1, and

respecting the conditions explained above, it follows that

ds1=a1,k+1 +(1 −ρ)M1

≤d1,k+1 −Eh

1,(D.3)

which can be rewriting to solve Us. Thus, knowing additionally that d1,k+1 =a1,k+1 +P1:

Us≥(1 −ρ)M1

P1−Eh

.(D.4)

137

Similarly, let ˆ

ds2be the deadline assigned by second server when scheduling the remain-

ing part ρM1:

ds2=a1,k+1 +ρM1

≤d1,k+1 −Eh

1.(D.5)

Solving Equation D.5 to ρ, it follows that

ρ≤U1

M1P1−Eh

1.(D.6)

Now, lets second server carry out the maximum amount of M1. For this case ρmust be

the maximum:

ρ=U1

M1P1−Eh

1.(D.7)

By making now a simple substitution of Equation D.7 in Equation D.8, it follows that

Us≥M1

P1−Eh

−U1,(D.8)

which is essentially the same as the Equation 6.9 derived in Section 6.4.3.

138 APPENDIX D. TBS SERVER BANDWIDTH ESTIMATION

References

[1] MASSA, A.; BARR, M. Programming Embedded Systems. 1.ed. Sebastopol,

CA, USA: O’Reilly, 2006.

[2] MARWEDEL, P. Embedded System Design. 1.ed. Boston, Dordrecht, London:

Kluwer Academic Publishers, 2003.

[3] JERRAYA, A. A. Long Term Trends for Embedded System Design. In: EUROMI-

CRO SYSTEMS ON DIGITAL SYSTEM DESIGN (DSD), 7., 2004, Rennes, France.

Proceedings. . . Washington: IEEE Computer Society, 2004. p.20–26.

[4] ENGEL, F. et al. Operating Systems on SoCs: a good idea? In: EMBEDDED

REAL-TIME SYSTEMS IMPLEMENTATION (ERTSI) WORKSHOP, 2004, Lisbon,

Porgutal. Proceedings. . . [S.l.: s.n.], 2004.

[5] G¨

OTZ, M. Dynamic Hardware-Software Codesign of a Reconfigurable Real-Time

Operating System. In: CONFERENCE ON RECONFIGURABLE COMPUTING

AND FPGAS (RECONFIG), 1., 2004, Colima, Mexico. Proceedings. . . Mexico:

Mexican Society of Computer Science, 2004. p.330–339.

[6] G¨

OTZ, M.; RETTBERG, A.; PEREIRA, C. E. Towards Run-time Partitioning of

a Real Time Operating System for Reconfigurable Systems on Chip. In: INTERNA-

TIONAL EMBEDDED SYSTEMS SYMPOSIUM (IESS), 1., 2005, Manaus, Brazil.

Proceedings. . . [S.l.: s.n.], 2005. p.255–266.

[7] G¨

OTZ, M.; RETTBERG, A.; PEREIRA, C. E. A Run-time Partitioning Algorithm

for RTOS on Reconfigurable Hardware. In: INTERNATIONAL CONFERENCE IN

EMBEDDED AND UBIQUITOUS COMPUTING (EUC), 2005, Nagasaki, Japan.

Proceedings. . . Berlin: Springer, 2005. p.469–478.

139

140 References

[8] G¨

OTZ, M.; RETTBERG, A.; PEREIRA, C. E. Run-Time Reconfigurable Real-

Time Operting System For Hybrid Execution Platforms. In: IFAC SYMPOSIUM ON

INFORMATION CONTROL PROBLEMS IN MANUFACTURING (INCOM), 12.,

2006, Saint-Etienne, France. Proceedings. . . Oxford: Elsevier, 2006. v.I - Informa-

tion Systems, p.81–86.

[9] G¨

OTZ, M.; RETTBERG, A.; PEREIRA, C. E. Communication-aware Component

Allocation Algorithm for a Hybrid Architecture. In: IFIP WORKING CONFER-

ENCE ON DISTRIBUTED AND PARALLEL EMBEDDED SYSTEMS (DIPES),

5., 2006, Braga, Porgutal. Proceedings. . . Boston: Springer, 2006. p.175–184.

[10] G¨

OTZ, M.; DITTMANN, F. Scheduling Reconfiguration Activities of Run-time Re-

configurable RTOS Using an Aperiodic Task Server. In: WORKSHOP ON APPLIED

RECONFIGURABLE COMPUTING (ARC), 2., 2006, Delft, The Netherlands. Pro-

ceedings. . . Berlin: Springer, 2006. p.255–261.

[11] G¨

OTZ, M.; DITTMANN, F.; PEREIRA, C. E. Deterministic Mechanism for Run-

Time Reconfiguration Activities in an RTOS. In: INTERNATIONAL IEEE CON-

FERENCE ON INDUSTRIAL INFORMATICS (INDIN), 4., 2006, Singapore. Pro-

ceedings. . . Washington: IEEE Computer Society, 2006. p.693–698.

[12] G¨

OTZ, M.; XIE, T.; DITTMANN, F. Dynamic Relocation of Hybrid Tasks: a com-

plete design flow. In: INTERNATIONAL WORKSHOP ON RECONFIGURABLE

COMMUNICATION-CENTRIC SYSTEMS-ON-CHIP (RECOSOC), 3., 2007, Mont-

pellier, France. Proceedings. . . Montpellier: University of Montpellier II, 2007.

p.31–38.

[13] XIE, T. A Programming Framework Enabling Runtime Task Migrations

Between CPU and FPGA. 2007. Master Thesis (Master of Computer Science) —

Heinz Nixdorf Institute, University of Paderborn, Paderborn, Germany.

[14] FINKE, S. Schnittstellensynthese f¨ur HW/SW-migrierbare Dienste eines

Realzeitbetriebssystems. 2006. Bachelor Work (Bachelor of Computer Science) —

Heinz Nixdorf Institut, Universit¨at Paderborn, Paderborn, Germany.

[15] RAMMIG, F. J. et al. Real-time Operating Systems for Self-coordinating Em-

bedded Systems. In: IEEE INTERNATIONAL SYMPOSIUM ON OBJECT AND

COMPONENT-ORIENTED REAL-TIME DISTRIBUTED COMPUTING (ISORC),

9., 2006, Gyeongju, Korea. Proceedings. . . Washington: IEEE Computer Society,

2006. p.382–392.

[16] RAMMIG, F. J. et al. Real-time Operating Systems for Self-coordinating Embedded

Systems. In: DAGSTUHL-WORKSHOP MBEES: MODELLBASIERTE ENTWICK-

LUNG EINGEBETTETER SYSTEME II, 2006, Wadern, Germany. Tagungs-

band. . . Germany: TU Braunschweig, 2006. p.95–104.

References 141

[17] G¨

OTZ, M.; DITTMANN, F. Reconfigurable Microkernel-based RTOS: mechanisms

and methods for run-time reconfiguration. In: INTERNATIONAL CONFERENCE

ON RECONFIGURABLE COMPUTING AND FPGAS (RECONFIG), 3., 2006, San

Luis Potosi. Proceedings. . . Washington: IEEE Computer Society, 2006. p.12–19.

[18] G¨

OTZ, M. et al. Run-time Reconfigurable RTOS for Reconfigurable Systems-on-

Chip. Journal of Embedded Computing, [S.l.], 2007. To Appear.

[19] IHMOR, S. Modelling and Automated Synthesis of Reconfigurable Inter-

faces. 2006. PhD Thesis (Dr. rer. nat.) — Heinz Nixdorf Institute, University of

Paderborn, Paderborn, Germany.

[20] GOCHMAN, S. et al. Introduction to Intel Core Duo Processor Architecture. Intel

Technology, [S.l.], v.10, n.2, p.89–97, May 2006.

[21] GPGPU.ORG. General-Purpose computation on GPUs (GPGPU). Avail-

able at <http://www.gpgpu.org>. Accessed in: jan 2007.

[22] NVIDIA. G8 Graphics Devices. Avaliable at <http://www.nvidia.com>. Ac-

cessed in: nov 2006.

[23] KAHLE, J. A. et al. Introduction to the Cell Multiprocessor. IBM Journal of

Research and Development, Riverton, NJ, USA, v.49, n.4/5, p.589–604, 2005.

[24] ERNST, R. Codesign of Embedded Systems: status and trends. IEEE Design and

Test of Computers, [S.l.].

[25] BUCHENRIEDER, K. Hardware/Software Codesign an Annotated Bibli-

ography. 1.ed. Chicago, USA: IT Press Hartenstein, 1994.

[26] HARDT, W. HW/SW-Codesign auf Basis von C-Programmen unter

Performanz-Gesichtspukten. 1996. PhD Thesis (Dr. rer. nat.) — Heinz Nixdorf

Institute, University of Paderborn, Paderborn, Germany.

[27] HARDT, W.; CAMPOSANO, R. Specification Analysis for HW/SW-Partitioning.

In: GI/ITG WORKSHOP IN ANWENDUNG FORMALER METHODEN F¨uR DEN

HARDWARE-ENTWURF, 3., 1994, Passau, Germany. Proceedings. . . Aachen:

Shaker Verlag, 1994. p.1–10.

[28] KAMDEM, R.; NJIWOUA, P. Galois Lattice Approach to Hardware/Software

Partitioning. In: INTERNATIONAL CONFERENCE ON PARALLEL AND DIS-

TRIBUTED PROCESSING TECHNIQUES AND APPLICATIONS (PDPTA), 1999,

Las Vegas, USA. Proceedings. . . Nevada: CSREA Press, 1999. p.3029–3036.

[29] BALARIN, F. et al.Hardware-software co-design of embedded systems:

the polis approach. 1.ed. Norwell, MA, USA: Kluwer Academic Publishers, 1997.

[30] JERRAYA, A. A.; WOLF, W. Hardware/Software Interface Codesign for Embedded

Systems. Computer, Los Alamitos, CA, USA, v.38, n.2, p.63–69, 2005.

142 References

[31] COMPTON, K.; HAUCK, S. Reconfigurable Computing: a survey of systems and

software. ACM Computing Surveys, New York, NY, USA, v.34, n.2, p.171–210,

2002.

[32] LACH, J.; BAZARGAN, K. Editorial: special issue on dynamically adaptable em-

bedded systems. ACM Transactions on Embedded Computing Systems, New

York, NY, USA, v.3, n.2, p.233–236, 2004.

[33] GARCIA, P. et al. An Overview of Reconfigurable Hardware in Embedded Sys-

tems. EURASIP Journal on Embedded Systems, New York, NY, United States,

v.2006, p.1–19, 2006.

[34] PLESSL, C.; PLATZNER, M. Virtualization of Hardware - Introduction and Sur-

vey. In: INTERNATIONAL CONFERENCE ON ENGINEERING OF RECONFIG-

URABLE SYSTEMS AND ALGORITHMS (ERSA), 4., 2004, Las Vegas, USA. Pro-

ceedings. . . Nevada: CSREA Press, 2004. p.63–69.

[35] BOBDA, C. Synthesis of Dataflow Graphs for Reconfigurable Systems us-

ing Temporal Partitioning and Temporal Placement. 2003. PhD Thesis (Dr.

rer. nat.) — Heinz Nixdorf Institute, University of Paderborn, Paderborn, Germany.

[36] VOROS, N. S.; MASSELOS, K. System Level Design of Reconfigurable

Systems-on-Chip. 1.ed. Secaucus, NJ, USA: Springer-Verlag New York, Inc., 2005.

[37] ADRIATIC Project IST-2000-30049 Deliverable D2.2. Definition of ADRI-

ATIC High-Level Hardware/Software Co-Design Methodology for Re-

configurable SoCs. Available at: <http://www.imec.be/adriatic>. Accessed in:

aug 2006.

[38] QU, Y.; SOININEN, J.-P. SystemC-based Design Methodology for Reconfigurable

System-on-Chip. In: EUROMICRO CONFERENCE ON DIGITAL SYSTEM DE-

SIGN (DSD), 8., 2005, Porto, Portugal. Proceedings. . . Washington: IEEE Com-

puter Society, 2005. p.364–371.

[39] MAXFIELD, C. The Design Warrior’s Guide to FPGAs. 1.ed. Burlington,

MA, USA: Newnes, 2004.

[40] Virtex-II Pro and Virtex-II Pro X Platform FPGAs: complete data sheet (v.4.5).

San Jose, CA, USA: Xilinx Inc., 2005.

[41] XAPP529 (v1.3): connecting customized ip to the microblaze soft processor using

the fast simplex link (fsl) channel. San Jose, CA, USA: Xilinx Inc., 2004.

[42] ATMEL. FPSLIC (AVR with FPGA). Available at: <http://

www.atmel.com/products/FPSLIC/>. Accessed in: dec 2006.

[43] XAPP290 (v1.2): two flows for partial reconfiguration: module based or difference

based. San Jose, CA, USA: Xilinx Inc., 2004.

References 143

[44] Early Access Partial Reconfiguration User Guide (v1.1). San Jose, CA, USA: Xilinx

Inc., 2006.

[45] LYSAGHT, P. et al. Enhanced Architectures, Design Methodologies and CAD Tools

for Dynamic Reconfiguration on XILINX FPGAS. In: INTERNATIONAL CONFER-

ENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS (FPL), 16.,

2006, Madrid, Spain. Proceedings. . . Washington: IEEE Computer Society, 2006.

p.1–6.

[46] DORAIRAJ, N.; SHIFLET, E.; GOOSMAN, M. PlanAhead Software as a Platform

for Partial Reconfiguration. Xcell Journal, San Jose, CA, USA, p.68–71, Febru-

ary 2005.

[47] UHM, M. Software-Defined Radio: the new architectural paradigm - reduce system

power and cost with a shared resources soc. DSP magazine, San Jose, CA, USA,

v.1, n.1, p.40–42, October 2005.

[48] DENYS, G.; PIESSENS, F.; MATTHIJS, F. A Survey of Customizability in Op-

erating Systems Research. ACM Computing Surveys, New York, NY, USA, v.34,

n.4, p.450–468, 2002.

[49] MASSA, A. J. Embedded Software Development with eCos. 1.ed. Upper

Saddle River, NJ, USA: Prentice Hall, 2002.

[50] REDHAT. eCos Operating System. Available at: <http://

sources.redhat.com/ecos/>. Accessed in: jun 2006.

[51] DITZE, C. DReaMS - Concepts of a Distributed Real-Time Management System.

In: IFIP / IFAC WORKSHOP ON REAL-TIME PROGRAMMING (WRTP), 20.,

1995, Fort Lauderdale, Florida. Proceedings. . . The Netherlands: Elsevier, 1995.

[52] DITZE, C. A Customizable Library to Support Software Synthesis for Embedded

Applications and Micro-kernel Systems. In: ACM SIGOPS EUROPEAN WORK-

SHOP ON SUPPORT FOR COMPOSING DISTRIBUTED APPLICATIONS, 8.,

1998, Sintra, Portugal. Proceedings. . . New York: ACM Press, 1998. p.88–95.

[53] DITZE, C. A Step towards Operating System Synthesis. In: ANNUAL AUS-

TRALIAN CONFERENCE ON PARALLEL AND REAL-TIME SYSTEMS (PART),

5., 1998, Adelaide, Australia. Proceedings. . . USA: IEEE, 1998.

[54] DITZE, C. Towards Operating System Synthesis. 1999. PhD Thesis (Dr. rer.

nat.) — Heinz Nixdorf Institute, University of Paderborn, Paderborn, Germany.

[55] B¨

OKE, C. Combining Two Customization Approaches: extending the customiza-

tion tool terecs for software synthesis of real-time execution platforms. In: WORK-

SHOP ON ARCHITECTURES OF EMBEDDED SYSTEMS (AES), 2000, Karlsruhe,

Germany. Proceedings. . . [S.l.: s.n.], 2000.

144 References

[56] B¨

OKE, C. Automatic Configuration of Real-Time Operating Systems and

Real-Time Communication Systems for Distributed Embedded Applica-

tions. 2003. PhD Thesis (Dr. rer. nat.) — Heinz Nixdorf Institute, University of

Paderborn, Paderborn, Germany.

[57] CAMPBELL, R. H. et al. Designing and Implementing Choices: an object-oriented

system in c++. Communications of the ACM, New York, NY, USA, v.36, n.9,

p.117–126, 1993.

[58] ENGLER, D. R.; KAASHOEK, M. F.; O’TOOLE, J. Exokernel: an operating sys-

tem architecture for application-level resource management. In: ACM SYMPOSIUM

ON OPERATING SYSTEMS PRINCIPLES (SOSP), 15., 1995, Copper Mountain

Resort, USA. Proceedings. . . New York: ACM Press, 1995. p.251–266.

[59] KAASHOEK, M. F. et al. Application Performance and Flexibility on Exoker-

nel Systems. In: ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES

(SOSP), 15., 1997, Saint Malo, France. Proceedings. . . New York: ACM Press,

1997. p.52–65.

[60] STALLINGS, W. Operating Systems: internals and design principles. 4.ed.

Upper Saddle River, NJ, USA: Prentice-Hall Int., 2001.

[61] VEITCH, A. C.; HUTCHINSON, N. C. Kea - A Dynamically Extensible and Config-

urable Operating System Kernel. In: INTERNATIONAL CONFERENCE ON CON-

FIGURABLE DISTRIBUTED SYSTEMS - ICCDS, 3., 1996, Annapolis, USA. Pro-

ceedings. . . Washington: IEEE Computer Society, 1996. p.236–242.

[62] VEITCH, A. C. A Dynamically Reconfigurable and Extensible Operating

System. 1998. PhD Thesis (Doctor of Philosophy) — University of British Columbia,

Vancouver, Canada.

[63] HELANDER, J.; FORIN, A. MMLite: a highly componentized system architecture.

In: ACM SIGOPS EUROPEAN WORKSHOP ON SUPPORT FOR COMPOSING

DISTRIBUTED APPLICATIONS, 8., 1998, Sintra, Portugal. Proceedings. . . New

York: ACM Press, 1998. p.96–103.

[64] GABBER, E. et al. The Pebble Component-Based Operating System. In: USENIX

ANNUAL TECHNICAL CONFERENCE, 24., 1999, Monterey, USA. Proceed-

ings. . . Berkley: USENIX Association, 1999. p.267–282.

[65] MAGOUTIS, K. et al. Building Appliances Out of Components Using Pebble. In:

ACM SIGOPS EUROPEAN WORKSHOP, 9., 2000, Kolding, Denmark. Proceed-

ings. . . New York: ACM Press, 2000. p.211–216.

[66] BERSHAD, B. N. et al. Extensibility Safety and Performance in the SPIN Oper-

ating System. In: ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES

(SOSP), 15., 1995, Copper Mountain, USA. Proceedings. . . New York: ACM Press,

1995. p.267–283.

References 145

[67] PARDYAK, P.; BERSHAD, B. N. Dynamic Binding for an Extensible System. In:

USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND IMPLEMEN-

TATION (OSDI), 2., 1996, Seattle, USA. Proceedings. . . New York: ACM Press,

1996. p.201–212.

[68] SOULES, C. A. N. et al. System Support for Online Reconfiguration. In: USENIX

ANNUAL TECHNICAL CONFERENCE, 28., 2003, San Antonio, USA. Proceed-

ings. . . Berkley: USENIX Association, 2003. p.141–154.

[69] APPAVOO, J. et al. Experience with K42, an Open-Source, Linux-Compatible,

Scalable Operating-System Kernel. IBM Systems Journal, Riverton, NJ, USA,

v.44, n.2, p.427–440, 2005.

[70] BAUMANN, A.; APPAVOO, J. Improving Dynamic Update for Operating Systems.

In: ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES (SOSP), 20.,

2005, Brighton, UK. Proceedings. . . New York: ACM Press, 2005. p.1–11.

[71] FASSINO, J.-P. et al. THINK: a software framework for component-based operating

system kernels. In: USENIX ANNUAL TECHNICAL CONFERENCE, 27., 2002,

Berkeley, USA. Proceedings. . . Berkley: USENIX Association, 2002. p.73–86.

[72] POLAKOVIC, J. Dynamic Reconfiguration in THINK: design and imple-

mentation. 2004. Diplomarbeit (Diplom-Informatik) — Universit¨at Karlsruhe, Karl-

sruhe, Germany.

[73] COWAN, C. et al. Fast Concurrent Dynamic Linking for an Adaptive Operat-

ing System. In: INTERNATIONAL CONFERENCE ON CONFIGURABLE DIS-

TRIBUTED SYSTEMS (ICCDS), 3., 1996, Annapolis, USA. Proceedings. . . Wash-

ington: IEEE Computer Society, 1996. p.108–115.

[74] COWAN, C. et al. Specialization Classes: an object framework for specialization. In:

INTERNATIONAL WORKSHOP ON OBJECT ORIENTATION IN OPERATING

SYSTEMS (IWOOOS), 5., 1996, Seattle, USA. Proceedings. . . Washington: IEEE

Computer Society, 1996. p.72–77.

[75] PU, C. et al. Optimistic Incremental Specialization: streamlining a commercial op-

erating system. In: ACM SYMPOSIUM ON OPERATING SYSTEMS PRINCIPLES,

15., 1995, Copper Mountain Resort, USA. Proceedings. . . New York: ACM Press,

1995. p.314–321.

[76] MASSALIN, H. Synthesis: an efficient implementation of fundamental op-

erating system services. 1992. PhD Thesis (Doctor of Philosophy) — Columbia

University, New York, NY, USA.

[77] SELTZER, M. I. et al. Dealing with Disaster: surviving misbehaved kernel ex-

tensions. In: USENIX SYMPOSIUM ON OPERATING SYSTEMS DESIGN AND

IMPLEMENTATION (OSDI), 2., 1996, Seattle, Washington, United States. Pro-

ceedings. . . New York: ACM Press, 1996. p.213–227.

146 References

[78] SELTZER, M.; SMALL, C. Self-Monitoring and Self-Adapting Operating Systems.

In: WORKSHOP ON HOT TOPICS IN OPERATING SYSTEMS (HOTOS-VI), 6.,

1997, Cape Cod, USA. Proceedings. . . Washington: IEEE Computer Society, 1997.

p.124–129.

[79] OBERTH¨

UR, S.; B ¨

OKE, C.; GRIESE, B. Dynamic Online Reconfiguration for

Customizable and Self-Optimizing Operating Systems. In: ACM INTERNATIONAL

CONFERENCE ON EMBEDDED SOFTWARE (EMSOFT), 5., 2005, Jersey City,

USA. Proceedings. . . New York: ACM Press, 2005. p.335–338.

[80] BOEKE, C.; OBERTHUER, S. Flexible Resource Management - A Framework

for Self-Optimizing Real-Time Systems. In: CONFERENCE ON DISTRIBUTED

AND PARALLEL EMBEDDED SYSTEMS (DIPES), 18., 2004, Toulouse, France.

Proceedings. . . The Netherlands: Kluwer Academic Publishers, 2004. p.177–186.

[81] HILDEBRAND, D. An Architectural Overview of QNX. In: WORKSHOP ON

MICRO-KERNELS AND OTHER KERNEL ARCHITECTURES, 1., 1992, Berkeley,

USA. Proceedings. . . Berkley: USENIX Association, 1992. p.113–126.

[82] WIND RIVER. VxWorks. Available at <http://www.windriver.com>. Accessed

in: jun 2006.

[83] CARD, R.; DUMAS, `

E.; M`

EVEL, F. The Linux Kernel Book. 1.ed. New York,

NY, USA: John Wiley & Sons, Inc., 1998.

[84] WILLIAMS, J. W.; BERGMANN, N. Embedded Linux as a Platform for Dynami-

cally Self-Reconfiguring Systems-on-Chip. In: INTERNATIONAL CONFERENCE

ON ENGINEERING OF RECONFIGURABLE SYSTEMS AND ALGORITHMS

(ERSA), 4., 2004, Las Vegas, USA. Proceedings. . . USA: CSREA Press, 2004.

p.163–169.

[85] FRIEDRICH, L. F. et al. A Survey of Configurable, Component-Based Operating

Systems for Embedded Applications. IEEE Micro, Los Alamitos, CA, USA, v.21,

n.3, p.54–68, 2001.

[86] TOURNIER, J.-C. A Survey of Configurable Operating Systems. Albu-

querque, NM, USA: University of New Mexico, 2005.

[87] BLODGET, B. et al. A Self-Reconfiguring Platform. In: INTERNATIONAL CON-

FERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS (FPL),

13., 2003, Lisbon, Portugal. Proceedings. . . Berlin: Springer, 2003. p.565–574.

[88] DONLIN, A. et al. A Virtual File System for Dynamically Reconfigurable FPGAs.

In: INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC

AND APPLICATIONS (FPL), 14., 2004, Leuven, Belgium. Proceedings. . . Berlin:

Springer, 2004. p.1127–1129.

References 147

[89] BLODGET, B.; MCMILLAN, S.; LYSAGHT, P. A Lightweight Approach for Em-

bedded Reconfiguration of FPGAs. In: DESIGN, AUTOMATION AND TEST IN

EUROPE (DATE), 2003, Munich, Germany. Proceedings. . . Washington: IEEE

Computer Society, 2003. p.10399–10401.

[90] VULETI´

C, M. et al. Operating System Support for Interface Virtualisation of Re-

configurable Coprocessors. In: CONFERENCE ON DESIGN, AUTOMATION AND

TEST IN EUROPE (DATE), 2004, Paris, France. Proceedings. . . Washington:

IEEE Computer Society, 2004. p.748–749.

[91] VULETI´

C, M.; POZZI, L.; IENNE, P. Virtual Memory Window for Application-

Specific Reconfigurable Coprocessors. In: DESIGN AUTOMATION CONFERENCE

(DAC), 41., 2004, San Diego, USA. Proceedings. . . New York: ACM Press, 2004.

p.948–953.

[92] VULETI´

C, M.; POZZI, L.; IENNE, P. Programming Transparency and Portable

Hardware Interfacing: towards general-purpose reconfigurable computing. In: IEEE

INTERNATIONAL CONFERENCE ON APPLICATION-SPECIFIC SYSTEMS,

ARCHITECTURES AND PROCESSORS (ASAP), 15., 2004, Galveston, USA. Pro-

ceedings. . . Washington: IEEE Computer Society, 2004. p.339–351.

[93] VULETI´

C, M.; POZZI, L.; IENNE, P. Seamless Hardware-Software Integration

in Reconfigurable Computing Systems. IEEE Design & Test of Computers, Los

Alamitos, CA, USA, v.22, n.2, p.102–113, 2005.

[94] VULETI´

C, M.; POZZI, L.; IENNE, P. Dynamic Prefetching in the Virtual Memory

Window of Portable Reconfigurable Coprocessors. In: INTERNATIONAL CONFER-

ENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS (FPL), 14.,

2004, Leuven, Belgium. Proceedings. . . Berlin: Springer, 2004. p.596–605.

[95] BREBNER, G. J. A Virtual Hardware Operating System for the Xilinx XC6200. In:

INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND

APPLICATIONS (FPL), 6., 1996, London, UK. Proceedings. . . Berlin: Springer,

1996. p.327–336.

[96] LING, X.-P.; AMANO, H. WASMII: a data driven computer on a virtual hardware.

In: IEEE WORKSHOP ON FPGAS FOR CUSTOM COMPUTING MACHINES, 1.,

1993, Napa Valley, USA. Proceedings. . . Washington: IEEE Computer Society,

1993. p.33–42.

[97] DIESSEL, O.; WIGLEY, G. Opportunities for Operating Systems Research

in Reconfigurable Computing. Mawson Lakes, Australia: Advanced Computing

Research Centre, School of Computer and Information Science, University of South

Australia, 1999.

[98] WIGLEY, G.; KEARNEY, D. The Development of an Operating System for Re-

configurable Computing. In: IEEE SYMPOSIUM ON FIELD-PROGRAMMABLE

148 References

CUSTOM COMPUTING MACHINES (FCCM), 9., 2001, Napa Valley, USA. Pro-

ceedings. . . Washington: IEEE Computer Society, 2001. p.249–250.

[99] VERDIER, F. et al. Exploring RTOS Issues with a High-Level Model of a Re-

configurable SoC Platform. In: INTERNATIONAL WORKSHOP ON RECONFIG-

URABLE COMMUNICATION-CENTRIC SYSTEMS-ON-CHIP (RECOSOC), 1.,

2005, Montpellier, France. Proceedings. . . Montpellier: University of Montpellier

II, 2005. p.71–78.

[100] SEGARD, A.; VERDIER, F. SOC and RTOS: managing ips and tasks commu-

nications. In: INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE

LOGIC AND APPLICATIONS (FPL), 14., 2004, Leuven, Belgium. Proceedings. . .

Berlin: Springer, 2004. p.710–718.

[101] NOGUERA, J.; BADIA, R. M. Multitasking on Reconfigurable Architectures:

microarchitecture support and dynamic scheduling. ACM Transactions on Em-

bedded Computing Systems, New York, NY, USA, v.3, n.2, p.385–406, 2004.

[102] FEKETE, S.; K ¨

OHLER, E.; TEICH, J. Optimal FPGA Module Placement with

Temporal Precedence Constraints. In: CONFERENCE ON DESIGN, AUTOMA-

TION AND TEST IN EUROPE (DATE), 2001, Munich, Germany. Proceedings. . .

Washington: IEEE Computer Society, 2001. p.658–667.

[103] TABERO, J. et al. Task Placement Heuristic Based on 3D-Adjacency and Look-

Ahead in Reconfigurable Systems. In: CONFERENCE ON ASIA SOUTH PACIFIC

DESIGN AUTOMATION (ASP-DAC), 2006., 2006, Yokohama, Japan. Proceed-

ings. . . New York: ACM Press, 2006. p.396–401.

[104] MAESTRE, R. et al. A Framework for Reconfigurable Computing: task scheduling

and context management. IEEE Transactions on Very Large Scale Integration

(VLSI) Systems, Piscataway, NJ, USA, v.9, n.6, p.858–873, 2001.

[105] BANERJEE, S.; BOZORGZADEH, E.; DUTT, N. Physically-Aware HW-SW Par-

titioning for Reconfigurable Architectures With Partial Dynamic Reconfiguration. In:

DESIGN AUTOMATION CONFERENCE (DAC), 42., 2005, San Diego, USA. Pro-

ceedings. . . New York: ACM Press, 2005. p.335–340.

[106] MEI, B.; SCHAUMONT, P.; VERNALDE, S. A Hardware-Software Partitioning

and Scheduling Algorithm for Dynamically Reconfigurable Embedded Systems. In:

PRORISC WORKSHOP ON CIRCUITS, SYSTEMS AND SIGNAL PROCESSING,

11., 2000, Veldhoven, Netherlands. Proceedings. . . [S.l.: s.n.], 2000.

[107] WALDER, H.; PLATZNER, M. Non-preemptive Multitasking on FPGAs: task

placement and footprint transform. In: INTERNATIONAL CONFERENCE ON EN-

GINEERING OF RECONFIGURABLE SYSTEMS AND ALGORITHMS (ERSA),

2., 2002, Las Vegas, USA. Proceedings. . . Nevada: CSREA Press, 2002. p.24–30.

References 149

[108] PLESSL, C. et al. Reconfigurable Hardware in Wearable Computing Nodes. In:

INTERNATIONAL SYMPOSIUM ON WEARABLE COMPUTERS (ISWC), 6.,

2002, Seattle, Washington. Proceedings. . . Washington: IEEE Computer Society,

2002. p.215–222.

[109] WIGLEY, G.; KEARNEY, D. Research Issues in Operating Systems for Recon-

figurable Computing. In: INTERNATIONAL CONFERENCE ON ENGINEERING

OF RECONFIGURABLE SYSTEMS AND ALGORITHMS (ERSA), 2., 2002, Las

Vegas, USA. Proceedings. . . Nevada: CSREA Press, 2002. p.10–16.

[110] WIGLEY, G.; KEARNEY, D. The First Real Operating System for Reconfig-

urable Computers. In: AUSTRALASIAN CONFERENCE ON COMPUTER SYS-

TEMS ARCHITECTURE (ACSAC), 6., 2001, Queensland, Australia. Proceed-

ings. . . Washington: IEEE Computer Society, 2001. p.130–137.

[111] WIGLEY, G. B.; KEARNEY, D. A.; WARREN, D. Introducing ReConfigME: an

operating system for reconfigurable computing. In: INTERNATIONAL CONFER-

ENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS (FPL), 12.,

2002, Montpellier, France. Proceedings. . . Berlin: Springer, 2002. p.687–697.

[112] WALDER, H.; PLATZNER, M. Reconfigurable Hardware OS Prototype.

Zurich: Computer Engineering and Networks Laboratory, ETH, 2003.

[113] WALDER, H.; PLATZNER, M. Reconfigurable Hardware Operating Systems:

from design concepts to realizations. In: INTERNATIONAL CONFERENCE ON EN-

GINEERING OF RECONFIGURABLE SYSTEMS AND ALGORITHMS (ERSA),

3., 2003, Las Vegas, USA. Proceedings. . . USA: CSREA Press, 2003. p.284–287.

[114] WALDER, H.; PLATZNER, M. A Runtime Environment for Reconfigurable

Hardware Operating Systems. In: INTERNATIONAL CONFERENCE ON FIELD-

PROGRAMMABLE LOGIC AND APPLICATIONS (FPL), 14., 2004, Leuven, Bel-

gium. Proceedings. . . Berlin: Springer, 2004. p.831–835.

[115] WALDER, H.; PLATZNER, M. Online Scheduling for Block-Partitioned Recon-

figurable Devices. In: DESIGN, AUTOMATION AND TEST IN EUROPE (DATE),

2003, Munich, Germany. Proceedings. . . Washington: IEEE Computer Society,

2003. p.10290–10295.

[116] STEIGER, C. et al. Online Scheduling and Placement of Real-time Tasks to Pari-

tally Reconfigurable Devices. In: INTERNATIONAL REAL-TIME SYSTEMS SYM-

POSIUM (RTSS), 24., 2003, Cancun, Mexico. Proceedings. . . Washington: IEEE

Computer Society, 2003. p.224–235.

[117] STEIGER, C.; WALDER, H.; PLATZNER, M. Operating Systems for Reconfig-

urable Embedded Platforms: online scheduling of real-time tasks. IEEE Transac-

tions on Computers, Los Alamitos, CA, USA, v.53, n.11, p.1393–1407, Novem-

ber 2004.

150 References

[118] HANDA, M.; VEMURI, R. An Integrated Online Scheduling and Place-

ment Methodology. In: INTERNATIONAL CONFERENCE ON FIELD-

PROGRAMMABLE LOGIC AND APPLICATIONS (FPL), 14., 2004, Leuven, Bel-

gium. Proceedings. . . Berlin: Springer, 2004. p.444–453.

[119] MERINO, P.; JACOME, M.; L´

OPEZ, J. C. A Methodology for Task Based Par-

titioning and Scheduling of Dynamically Reconfigurable Systems. In: IEEE SYM-

POSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES

(FCCM), 6., 1998, Napa Valley, USA. Proceedings. . . Washington: IEEE Com-

puter Society, 1998. p.324–325.

[120] MERINO, P.; L´

OPEZ, J. C.; JACOME, M. F. A Hardware Operating System

for Dynamic Reconfiguration of FPGAs. In: INTERNATIONAL CONFERENCE ON

FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS (FPL), 8., 1998, Tallinn,

Estonia. Proceedings. . . Berlin: Springer, 1998. p.431–435.

[121] ULLMANN, M. et al. On-Demand FPGA Run-Time System for Dynamical Re-

configuration with Adaptive Priorities. In: INTERNATIONAL CONFERENCE ON

FIELD-PROGRAMMABLE LOGIC AND APPLICATIONS (FPL), 14., 2004, Leu-

ven, Belgium. Proceedings. . . Berlin: Springer, 2004. p.454–463.

[122] DANNE, K.; MUEHLENBERND, R.; PLATZNER, M. Executing Hardware Tasks

on Dynamically Reconfigurable Devices under Real-Time Conditions. In: INTERNA-

TIONAL CONFERENCE ON FIELD-PROGRAMMABLE LOGIC AND APPLICA-

TIONS (FPL), 16., 2006, Madrid, Spain. Proceedings. . . Washington: IEEE Com-

puter Society, 2006. p.1–6.

[123] DANNE, K.; PLATZNER, M. An EDF Schedulability Test for Periodic Tasks

on Reconfigurable Hardware Devices. In: ACM SIGPLAN/SIGBED CONFERENCE

ON LANGUAGES, COMPILERS, AND TOOLS FOR EMBEDDED SYSTEMS

(LCTES), 2006, Ottawa, Canada. Proceedings. . . New York: ACM Press, 2006.

p.93–102.

[124] SIMMLER, H.; LEVINSON, L.; M¨

ANNER, R. Multitasking on FPGA Copro-

cessors. In: INTERNATIONAL CONFERENCE ON FIELD-PROGRAMMABLE

LOGIC AND APPLICATIONS (FPL), 10., 2000, Villach, Austria. Proceedings. . .

London: Springer, 2000. p.121–130.

[125] LEVINSON, L. et al. Preemptive Multitasking on FPGAs. In: IEEE SYM-

POSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES

(FCCM), 8., 2000, Napa Valley, CA, USA. Proceedings. . . Washington: IEEE Com-

puter Society, 2000. p.301–302.

[126] KALTE, H.; PORRMANN, M. Context Saving and Restoring for Multitasking

in Reconfigurable Systems. In: INTERNATIONAL CONFERENCE ON FIELD-

PROGRAMMABLE LOGIC AND APPLICATIONS (FPL), 15., 2005, Tampere, Fin-

land. Proceedings. . . Los Alamitos: IEEE, 2005. p.223–228.

References 151

[127] BOBDA, C. et al. Task Scheduling for Heterogeneous Reconfigurable Comput-

ers. In: SYMPOSIUM ON INTEGRATED CIRCUITS AND SYSTEMS DESIGN

(SBCCI), 17., 2004, Pernambuco, Brazil. Proceedings. . . New York: ACM Press,

2004. p.22–27.

[128] DITTMANN, F.; G¨

OTZ, M. Applying Single Processor Algorithms to Schedule

Tasks on Reconfigurable Devices Respecting Reconfiguration Times. In: RECON-

FIGURABLE ARCHITECTURES WORKSHOP (RAW), 13., 2006, Rhodes Island,

Greece. Proceedings. . . Washington: IEEE Computer Society, 2006.

[129] DITTMANN, F. Reconfiguration Time Aware Processing on FPGAs. In: DY-

NAMICALLY RECONFIGURABLE ARCHITECTURES, 2006, Dagstuhl, Germany.

Proceedings. . . Schloss Dagstuhl: Internationales Begegnungs- und Forschungszen-

trum f¨ur Informatik (IBFI), 2006. n.06141. (Dagstuhl Seminar Proceedings).

[130] GHIASI, S.; NAHAPETIAN, A.; SARRAFZADEH, M. An Optimal Algorithm for

Minimizing Run-Time Reconfiguration Delay. ACM Transactions on Embedded

Computing Systems, New York, NY, USA, v.3, n.2, p.237–256, 2004.

[131] HANDA, M.; VEMURI, R. Area Fragmentation in Reconfigurable Operating Sys-

tems. In: INTERNATIONAL CONFERENCE ON ENGINEERING OF RECON-

FIGURABLE SYSTEMS AND ALGORITHMS (ERSA), 4., 2004, Las Vegas, USA.

Proceedings. . . Nevada: CSREA Press, 2004. p.77–83.

[132] COMPTON, K. et al. Configuration Relocation and Defragmentation for Run-time

Reconfigurable Computing. IEEE Transactions on Very Large Scale Integra-

tion (VLSI) Systems, Piscataway, NJ, USA, v.10, n.3, p.209–220, 2002.

[133] BURNS, J. et al. A Dynamic Reconfiguration Run-time System. In: IEEE SYM-

POSIUM ON FIELD-PROGRAMMABLE CUSTOM COMPUTING MACHINES

(FCCM), 5., 1997, Napa Valley, USA. Proceedings. . . Washington: IEEE Com-

puter Society, 1997. p.66–75.

[134] MIGNOLET, J.-Y. et al. Infrastructure for Design and Management of Relocatable

Tasks in a Heterogeneous Reconfigurable System-on-Chip. In: DESIGN, AUTOMA-

TION AND TEST IN EUROPE (DATE), 2003, Munich, Germany. Proceedings. . .

Washington: IEEE Computer Society, 2003. p.10986–10993.

[135] NOLLET, V. et al. Hierarchical Run-Time Reconfiguration Managed by an Oper-

ating System for Reconfigurable Systems. In: INTERNATIONAL CONFERENCE

ON ENGINEERING OF RECONFIGURABLE SYSTEMS AND ALGORITHMS

(ERSA), 3., 2003, Las Vegas, USA. Proceedings. . . Nevada: CSREA Press, 2003.

p.81–87.

[136] NOLLET, V. et al. Designing an Operating System for a Heterogeneous Recon-

figurable SoC. In: INTERNATIONAL SYMPOSIUM ON PARALLEL AND DIS-

TRIBUTED PROCESSING (IPDPS), 17., 2003, Nice, France. Proceedings. . . Was-

ington: IEEE Computer Society, 2003. p.174–180.

152 References

[137] PELLIZZONI, R.; CACCAMO, M. Adaptive Allocation of Software and Hardware

Real-Time Tasks for FPGA-based Embedded Systems. In: IEEE REAL-TIME AND

EMBEDDED TECHNOLOGY AND APPLICATIONS SYMPOSIUM (RTAS), 12.,

2006, San Jose, USA. Proceedings. . . Washington: IEEE Computer Society, 2006.

p.208–220.

[138] STANKOVIC, J. A.; RAMAMRITHAM, K. The Spring Kernel: a new paradigm

for real-time operating systems. ACM SIGOPS Operating Systems Review,

New York, NY, USA, v.23, n.3, p.54–71, 1989.

[139] BURLESON, W. et al. The Spring Scheduling Co-Processor: a scheduling accel-

erator. In: IEEE INTERNATIONAL CONFERENCE ON COMPUTER DESIGN

(ICCD), 11., 1993, Cambridge, USA. Proceedings. . . Washington: IEEE Computer

Society, 1993. p.140–144.

[140] LINDH, L.; STANISCHEWSKI, F. FASTCHART - A Fast Time Determinis-

tic CPU and Hardware Based Real-Time-Kernel. In: EUROMICRO WORKSHOP

ON REAL-TIME SYSTEMS, 3., 1991, Paris, France. Proceedings. . . Washington:

IEEE Computer Society, 1991. p.36–40.

[141] LINDH, L. FASTHARD - A Fast Time Deterministic HARDware Based Real-

Time Kernel. In: EUROMICRO WORKSHOP ON REAL-TIME SYSTEMS, 4., 1992,

Athens, Greece. Proceedings. . . Washington: IEEE Computer Society, 1992. p.21–

25.

[142] LEE, J. et al. A Comparison of the RTU Hardware RTOS with a Hardware/Soft-

ware RTOS. In: ASIA AND SOUTH PACIFIC DESIGN AUTOMATION CONFER-

ENCE (ASP-DAC), 8., 2003, Kitakyushu, Japan. Proceedings. . . New York: ACM

Press, 2003. p.683–688.

[143] KOHOUT, P.; GANESH, B.; JACOB, B. Hardware Support for Real-Time Op-

erating Systems. In: IEEE/ACM/IFIP INTERNATIONAL CONFERENCE ON

HARDWARE/SOFTWARE CODESIGN AND SYSTEM SYNTHESIS, 1., 2003,

Newport Beach, USA. Proceedings. . . New York: ACM Press, 2003. p.45–51.

[144] KUACHAROEN, P.; SHALAN, M.; III, V. J. M. A Configurable Hardware Sched-

uler for Real-Time Systems. In: INTERNATIONAL CONFERENCE ON ENGI-

NEERING OF RECONFIGURABLE SYSTEMS AND ALGORITHMS (ERSA), 3.,

2003, Las Vegas, USA. Proceedings. . . USA: CSREA Press, 2003. p.95–101.

[145] LEE, J.; RYU, K.; III, V. J. M. A Framework for Automatic Generation of Config-

uration Files for a Custom Hardware/Software RTOS. In: INTERNATIONAL CON-

FERENCE ON ENGINEERING OF RECONFIGURABLE SYSTEMS AND ALGO-

RITHMS (ERSA), 2., 2002, Las Vegas, USA. Proceedings. . . USA: CSREA Press,

2002. p.31–37.

References 153

[146] III, V. J. M.; BLOUGH, D. M. A Hardware-Software Real-Time Operating System

Framework for SoCs. IEEE Design and Test of Computers, Los Alamitos, CA,

USA, v.19, n.6, p.44–51, 2002.

[147] ANDREWS, D.; NIEHAUS, D. Architectural Frameworks for MPP Systems on a

Chip. In: INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED

PROCESSING (IPDPS), 17., 2003, Nice, France. Proceedings. . . Washington:

IEEE Computer Society, 2003. p.265.2.

[148] ANDREWS, D. L. et al. Programming Models for Hybrid FPGA-CPU Computa-

tional Components: a missing link. IEEE Micro, Los Alamitos, CA, USA, v.24, n.4,

p.42–53, 2004.

[149] JIDIN, R.; ANDREWS, D. L.; NIEHAUS, D. Implementing Multi Threaded

System Support for Hybrid FPGA/CPU Computational Components. In: INTER-

NATIONAL CONFERENCE ON ENGINEERING OF RECONFIGURABLE SYS-

TEMS AND ALGORITHMS (ERSA), 4., 2004, Las Vegas, USA. Proceedings. . .

USA: CSREA Press, 2004. p.116–122.

[150] ANDERSON, E. et al. Enabling a Uniform Programming Model Across the Soft-

ware/Hardware Boundary. In: IEEE SYMPOSIUM ON FIELD-PROGRAMMABLE

CUSTOM COMPUTING MACHINES (FCCM), 14., 2006, Napa Valley, CA, USA.

Proceedings. . . Washington: IEEE Computer Society, 2006. p.89–98.

[151] ANDREWS, D. et al. hthreads: a hardware/software co-designed multithreaded

rtos kernel. In: IEEE INTERNATIONAL CONFERENCE ON EMERGING TECH-

NOLOGIES AND FACTORY AUTOMATION, 10., 2005, Catania, Italy. Proceed-

ings. . . Los Alamitos: IEEE, 2005. v.2, p.19–22.

[152] ANDREWS, D. et al. The Case for High Level Programming Models for Recon-

figurable Computers. In: INTERNATIONAL CONFERENCE ON ENGINEERING

OF RECONFIGURABLE SYSTEMS AND ALGORITHMS (ERSA), 6., 2006, Las

Vegas, USA. Proceedings. . . USA: CSREA Press, 2006. p.21–32.

[153] KERSTAN, T. Konzeption und Entwicklung einer konfigurierbaren

Mikrokernarchitektur f¨ur ein komponentenbasiertes Echtzeitbetriebssys-

tem. 2006. Diplomarbeit (Diplom-Informatik) — Heinz Nixdorf Institut, Universit¨at

Paderborn, Paderborn, Germany.

[154] GRIESE, B.; OBERTH¨

UR, S.; PORRMANN, M. Component Case Study of a Self-

Optimizing RCOS/RTOS System: a reconfigurable network service. In: INTERNA-

TIONAL EMBEDDED SYSTEMS SYMPOSIUM (IESS), 1., 2005, Manaos, Brazil.

Proceedings. . . [S.l.: s.n.], 2005. p.267–277.

[155] WANG, S. et al. Real-Time Component-Based Systems. In: IEEE REAL-TIME

AND EMBEDDED TECHNOLOGY AND APPLICATIONS SYMPOSIUM (RTAS),

154 References

11., 2005, San Francisco, CA, USA. Proceedings. . . Washington: IEEE Computer

Society, 2005. p.428–437.

[156] BUTTAZZO, G. et al.Soft Real-Time Systems: predictability vs. effi-

ciency. 1.ed. New York, NY, USA: Springer US, 2005.

[157] BRINKSCHULTE, U.; SCHNEIDER, E.; PICIOROAGA, F. Dynamic Real-time

Reconfiguration in Distributed Systems: timing issues and solutions. In: IEEE

INTERNATIONAL SYMPOSIUM ON OBJECT-ORIENTED REAL-TIME DIS-

TRIBUTED COMPUTING (ISORC), 8., 2005, Seattle, USA. Proceedings. . .

Washington: IEEE Computer Society, 2005. p.174–181.

[158] ELES, P.; KUCHCINSKI, K.; PENG, Z. System Synthesis with VHDL: a

transformational approach. 1.ed. Norwell, MA, USA: Kluwer Academic Publishers,

1998. p.114–119.

[159] BUTTAZZO, G. C. Hard Real-Time Computing Systems: predictable

scheduling algorithms and applications. 1.ed. Norwell, MA, USA: Kluwer Aca-

demic Publishers, 1997.

[160] BUTTAZZO, G. C. Rate monotonic vs. EDF: judgment day. Real-Time Sys-

tems, Norwell, MA, USA, v.29, n.1, p.5–26, 2005.

[161] BRATLEY, P.; FLORIAN, M.; ROBILLARD, P. Scheduling with Earliest Start

and Due Date Constraints. Naval Research Logistics Quarterly, New York,

p.511–519, December 1971.

[162] User Core Templates Reference Guide (v1.4). San Jose, CA, USA: Xilinx Inc.,

2003.

[163] KEROMYTIS, A. D. et al. Cryptography as an Operating System Service: a case

study. ACM Transactions on Computer Systems, New York, NY, USA, v.24,

n.1, p.1–38, 2006.

[164] SEDCOLE, N. P. Reconfigurable Platform-Based Design in FPGAs for

Video Image Processing. 2006. PhD Thesis (Doctor of Philosophy) — Department

of Electrical and Electronic Engineering, University of London, London, UK.