Adaptable OS Services for Distributed
Reconfigurable Systems on Chip
Sufyan L. M. Samara
Design of distributed embedded systems
University of Paderborn
A thesis submitted to the
Faculty of Computer Science, Electrical Engineering, and Mathematics
of the
University of Paderborn
in partial fulfillment of the requirements for the degree of
Dr. rer. nat.
November 2010
Abstract
The ever guest for more computational capabilities leads to embedded sys-
tems which consist of multiple computational elements integrated on a sin-
gle chip. An example is the integration of a reconfigurable fabric (FPGA)
with a number of general purpose processors to form what so called a re-
configurable system on chip. These embedded systems are common to be
distributed. This creates a flexible high performance distributed system.
However, it is very complex when it comes to management.
Applications running on such systems are expected to be dynamic in re-
gard of arriving and leaving the system. This increases the complexity as
the resources and the demands would change unpredictably.
In this work, an OS service model, which efficiently adapts to the various
changes in these systems, is presented. In addition, the algorithms and the
methodologies, developed to allow this novel OS service model to interact
with the application demands and the environment unpredictable dynamic
variations, are discussed. Furthermore, the extensive evaluations of these
algorithms are presented. Finally, a case study, which introduces the triple
data encryption standard as prove of concept, is provided.
Zusammenfassung
Das ständige Streben nach immer größeren Rechenkapazitäten führt zu eingebetteten
Systemen, die aus mehreren Verarbeitungselementen bestehen, die auf einem Chip in-
tegriert sind. Ein Beispiel dafür ist die Integration eines rekonfigurierbaren Gewebes
(FPGA) mit mehreren Universalprozessoren, um ein rekonfigurierbares System auf
einem Chip zu bilden. Typischerweise werden diese Systeme verteilt. Dieses schafft
flexible, verteilte Hochleistungssysteme. Allerdings sind diese Systeme aus Verwal-
tungssicht hochgradig komplex.
Es wird erwartet, dass Anwendungen, die auf diesen Systemen laufen, dynamisch
in das System hineinkommen und es verlassen. Dieses erhöht die Komplexität, da sich
Ressourcen und Anforderungen unvorhersehbar verändern können.
In dieser Arbeit wird ein Betriebssystemdienstmodell präsentiert, welches effizient
eingesetzt werden kann und sich den verschiedenartigen Veränderungen in diesen Sys-
temen anpasst. Darüber hinaus werden Algorithmen und Methodologien diskutiert, die
es diesem neuartigen Betriebssystemmodell erlauben mit den unvorhersehbaren Vari-
ationen der Anforderungen der Anwendungen sowie der Umgebung zu interagieren.
Ferner werden extensive Evaluationen dieser Algorithmen präsentiert.
Das Dokument schließt mit einer Fallstudie des Triple Data Encryption Standards
als Konzeptnachweis ab.
Acknowledgements
In the Name of ALLAH, Most Gracious, Most Merciful.
All praise and thanks are due to ALLAH (the one and only GOD, the cre-
ator of all and everything) and peace and blessings be upon his messenger
Muhammad, he who said: "One who does not thank people does not give
thanks to ALLAH, either." 1, I hereby thank my father, my mother, my
aunt Fatimah, my family, the DAAD, Prof. Franz Rammig, my friends,
and all who supported me in this work.
Muharram 1432 (December 2010)
Sufyan Lutfi Samara
1Tirmidhi, Birr 35, 1955; Abu Dawud, Adab 12, 4811
Contents
List of Figures ix
List of Tables xi
Glossary xv
1 Introduction 1
1.1 Motivation................................ 2
1.2 Thesiscontribution ........................... 3
1.3 Thesisstructure............................. 4
2 Background and related work 7
2.1 Reconfigurable systems . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 FPGA and reconfiguration . . . . . . . . . . . . . . . . . . . 10
2.1.2 Co-design and partitioning . . . . . . . . . . . . . . . . . . . 13
2.2 Operating system for embedded systems . . . . . . . . . . . . . . . . 14
2.2.1 Configurability in embedded operating systems . . . . . . . . 15
2.2.2 Operating systems and reconfigurable elements . . . . . . . . 17
2.3 Chapterconclusion ........................... 19
3 System design and architecture 21
3.1 Systemdesign.............................. 21
3.1.1 Centralized distribution . . . . . . . . . . . . . . . . . . . . . 22
3.1.2 Fully distributed topology . . . . . . . . . . . . . . . . . . . 24
3.1.3 Hybrid distribution topology . . . . . . . . . . . . . . . . . . 25
3.2 RSoCarchitecture............................ 27
3.2.1 Middleware........................... 27
3.2.2 VirtualMachine ........................ 33
3.3 Chapterconclusion ........................... 33
v
CONTENTS
4 OS Service structure 35
4.1 OS service design objectives . . . . . . . . . . . . . . . . . . . . . . 36
4.2 OSservicedesign............................ 38
4.2.1 SESs execution and applications . . . . . . . . . . . . . . . 41
4.2.2 Example: Realtime adaptation . . . . . . . . . . . . . . . . . 42
4.3 ChapterConclusions .......................... 45
5 OS service configurations and adaptation 47
5.1 OS service model formal definition . . . . . . . . . . . . . . . . . . . 49
5.2 Optimal configurations and constraints . . . . . . . . . . . . . . . . . 51
5.2.1 Single criterion optimization algorithm . . . . . . . . . . . . 52
5.2.2 Evaluating for a suitable OS service configuration . . . . . . . 55
5.2.3 Evaluating for Pareto optimal OS service configuration . . . . 57
5.2.4 Wrapupexample........................ 62
5.3 SESs granularity and pipelining . . . . . . . . . . . . . . . . . . . . 66
5.3.1 Partitioning granularity . . . . . . . . . . . . . . . . . . . . . 66
5.3.2 Pipelining execution and communication time . . . . . . . . . 71
5.4 ChapterConclusions .......................... 74
6 OS services distribution 75
6.1 SESs distribution............................ 76
6.1.1 Heterogeneity and Distributed Environments . . . . . . . . . 76
6.1.2 Fault Tolerance or Availability . . . . . . . . . . . . . . . . . 77
6.1.3 Distribution stages . . . . . . . . . . . . . . . . . . . . . . . 78
6.2 The Initialization stage . . . . . . . . . . . . . . . . . . . . . . . . . 78
6.2.1 Distribution algorithm . . . . . . . . . . . . . . . . . . . . . 80
6.3 The discovery and execution routing stage . . . . . . . . . . . . . . . 83
6.3.1 The building of the execution routing graphs . . . . . . . . . 84
6.3.2 Discovery and execution using self-x routing . . . . . . . . . 86
6.4 ChapterConclusion........................... 91
7 Methods Evaluation 93
7.1 Evaluating OS service configurations . . . . . . . . . . . . . . . . . . 94
7.1.1 OS service configurations under limited resources . . . . . . . 94
7.1.2 OS service Pareto optimal configuration . . . . . . . . . . . . 98
7.2 OS service distribution . . . . . . . . . . . . . . . . . . . . . . . . . 100
7.2.1 Load balancing over fork distribution patterns . . . . . . . . . 101
7.2.2 Load balancing over RSoCs . . . . . . . . . . . . . . . . . . 102
7.3 Chapterconclusion ........................... 106
vi
CONTENTS
8 Case study 107
8.1 Reconos................................. 107
8.1.1 SESs support ......................... 108
8.1.2 SESs implementation..................... 109
8.2 TripleDES ............................... 113
8.3 Results.................................. 115
8.4 Chapterconclusion ........................... 119
9 Conclusion and future directions 121
9.1 Abriefsummary ............................ 121
9.2 Futuredirections ............................ 123
9.2.1 Distribution and optimization . . . . . . . . . . . . . . . . . 123
9.2.2 Online model checking and recovery . . . . . . . . . . . . . . 125
Author’s Publications 129
References 131
vii
CONTENTS
viii
List of Figures
2.1 Embedded Systems: flexibility vs. performance . . . . . . . . . . . . 9
2.2 Different RH and GPP connection strategies . . . . . . . . . . . . . . 9
2.3 Reconfigurable computing as accelerator . . . . . . . . . . . . . . . . 10
2.4 FPGA basic inner components . . . . . . . . . . . . . . . . . . . . . 11
2.5 Virtex II pro CLB and slice basic elements . . . . . . . . . . . . . . . 12
2.6 Xilinx Virtex-II pro structure . . . . . . . . . . . . . . . . . . . . . . 12
3.1 Centralizedtopology .......................... 23
3.2 Hybridtopology............................. 26
3.3 RSoC and middleware architecture . . . . . . . . . . . . . . . . . . . 28
3.4 GPP area representation . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.1 OS service with two partitioned implementations . . . . . . . . . . . 39
4.2 SESgeneralstructure.......................... 40
4.3 OS service realtime adaptation Example 1 . . . . . . . . . . . . . . . 43
4.4 OS service realtime adaptation Example 2 . . . . . . . . . . . . . . . 44
5.1 OS service configurations exhaustive exploration search . . . . . . . . 48
5.2 General OS service model with edge representation . . . . . . . . . . 50
5.3 Maximum and minimum thresholds representation . . . . . . . . . . 60
5.4 Example of OS service with random meta-data . . . . . . . . . . . . 63
5.5 Warp up example optimal configurations . . . . . . . . . . . . . . . . 65
5.6 OS service and maximum inter SESs communication overhead . . . 68
5.7 Fine granularity partitioned OS service with and without inter commu-
nication ................................. 70
5.8 A comparison of the minimum allowable SES granularity . . . . . . . 71
5.9 A comparison of partitioned and non-partitioned OS service WCET .72
5.10 A comparison of a pipelined partitioned and non-partitioned OS ser-
vice WCET .............................. 73
ix
LIST OF FIGURES
5.11 A comparison of a pipelined partitioned, non-pipelined, and non-partitioned
processingtime............................. 73
6.1 Random SESs distribution....................... 79
6.2 The general fork distribution . . . . . . . . . . . . . . . . . . . . . . 81
6.3 An OS service with its execution routing graph . . . . . . . . . . . . 87
6.4 Execution configuration using self-x routing . . . . . . . . . . . . . . 88
7.1 Evaluation of OS service suitable found configurations . . . . . . . . 96
7.2 Evaluating OS service number of configurations for variable power
andresponsetimeratios......................... 97
7.3 Evaluation of the algorithm that finds the Pareto optimal OS service
configuration .............................. 99
7.4 Normalized representation of an OS service configurations resources
requirements .............................. 100
7.5 Distribution load balancing over fork distribution pattern set . . . . . 101
7.6 Load balance distribution based on one prioritized resource . . . . . . 103
7.7 Load balance distribution of all resources . . . . . . . . . . . . . . . 105
8.1 ReconOS unified hardware/software thread modeling . . . . . . . . . 108
8.2 ReconOS OSIF and synchronization state machine . . . . . . . . . . 111
8.3 TheDESblockdiagram......................... 113
8.4 Partitioned 3DES OS service . . . . . . . . . . . . . . . . . . . . . . 114
8.5 Case study: Pipelined vs non-pipelined execution . . . . . . . . . . . 116
8.6 Case study: partitioned and non-partitioned execution . . . . . . . . . 118
8.7 Case study: pipelined and non-partitioned execution . . . . . . . . . . 118
9.1 A brief summary of OS service adaptation . . . . . . . . . . . . . . . 122
9.2 Executionnetwork ........................... 124
9.3 Recovering a possible fault in OS service execution . . . . . . . . . . 126
9.4 SESs, middleware, and model checker integration . . . . . . . . . . 127
x
List of Tables
5.1 Time optimized Ck[l]with T∗=42.................... 63
5.2 Time optimized Mk[l]with M∗=h ................... 63
5.3 Power optimized Ck[l]with P∗=19................... 64
5.4 Power optimized Mk[l]with M∗=s................... 64
5.5 Area optimized Ck[l]with A∗=19.................... 64
5.6 Area optimized Mk[l]with M∗=s.................... 64
6.1 Fields of FDAnt message . . . . . . . . . . . . . . . . . . . . . . . . 90
6.2 Fields of BDAnt message . . . . . . . . . . . . . . . . . . . . . . . . 90
7.1 Arbitrary example of OS service optimal resource requirements . . . 95
8.1 Inter communication overhead . . . . . . . . . . . . . . . . . . . . . 115
8.2 SESs FPGA and memory utilization . . . . . . . . . . . . . . . . . 116
xi
LIST OF TABLES
xii
List of Algorithms
5.1 Single criterion optimal SESs configuration . . . . . . . . . . . . . . 54
5.2 GetSuitableSESsConfigurations . . . . . . . . . . . . . . . . . . . . 56
5.3 Pareto optimization for OS services . . . . . . . . . . . . . . . . . . 61
6.1 General Fork Distribution Algorithm . . . . . . . . . . . . . . . . . . 81
6.2 Make Execution Routing Graph . . . . . . . . . . . . . . . . . . . . 86
xiii
LIST OF ALGORITHMS
xiv
Glossary
3DES Triple Data Encryption Standered, read (triple D. E. S.)
API Applications Programming Interface, read (A. P. I.)
ASIC Application Programmable Integrated Circuits, read (asic)
CLB Configurable Logic Block, read (C. L. B.)
CPLD Complex Programmable Logic Device, read (C. P. L. D.)
CPU Central Processor Unit, read (C. P. U.)
ComEl computational element
FDP Fork Distribution Pattern set, read (F. D. P.)
FPGA Field Programmable Gate Array, read (F. P. G. A.)
GFD The General Fork Distribution Algorithm
GPP General Purpose Processor, read (G. P. P.)
IM A set of all the implementations of an OS service.
ISA Instruction Set Architecture, read (I. S. A.)
LUT Look Up Table, read (L. U. T.)
MC Model Checking, read (M. C.)
MDE Metadata Descriptor Entity used in execution graphs, read (M. D. E.)
OSIF Operating System Interface, read (Osif)
OSR OS Services Repository.
OS Operating System, read (OS)
PAL Programmable Array Logic, read (P. A. L.)
PLA Programmable Logic Array , read (P. L. A.)
PLD Programmable logic device, read (P. L. D.)
QoS Quality of Service, read (Q. O. S.)
RC Reconfigurable Computing, read (R. C.)
RH Reconfigurable Hardware, read (R. H.)
RSoC Reconfigurable System on Chip, read (R. soc)
SES Small Execution Segment, read (ses)
SPLD Simple Programmable Logic Device, read (S. P. L. D.)
V HDL V HSIC Hardware Description Language, read (V. H. D. L.)
xv
Glossary
V HSIC Very High Speed Integrated Circuit, read (V. H. S. I. C.)
WCET Worst Case Execution Time
WSN Wireless Sensor Network, read (W. S. N.)
mnumber of implementations in each OS service.
nNumber of SESs in each implementation of an OS service.
xvi
CHAPTER 1
Introduction
An embedded system can be defined as a system whose principal function is not com-
putational, but which is controlled by a computer embedded within it [147]. Such sys-
tems are with an ever increasing uprising as many applications are moving into their
realm. Examples include, but not limited to, home devices (e.g. burglar alarm, toys,
and games), multimedia, industrial automated systems, and automotive applications
(e.g. climate control, breaks, and engine control).
The demand of more computational capabilities and the evolution of electronic sys-
tems pushed traditional silicon designers into combining more than one computational
element in one chip. This led to the design of complex hardware known as System on
Chip (SoC).
Currently a SoC can contain one or many types of microprocessors, reconfigurable
fabrics, application-specific hardware, and memories, all communicating via an on
chip interconnection network.
Although the integration of more than one component (e.g. microprocessor) increases
the complexity of software development, the integration of reconfigurable hardware
on fabric, to form a Reconfigurable SoC (RSoC), would increase the complexity to a
totally new level. This is because, unlike other components, these fabrics interconnec-
tion and functionality can be dynamically programmed and reprogrammed at runtime.
Nevertheless, this also introduces the benefits of adaptation, flexibility, and high per-
formance which made such systems very attractive [141].
1
1. INTRODUCTION
Furthermore, the current trends are in the direction of networked SoCs. In most em-
bedded applications, SoCs tend to exist in some sort of a distributed system. This
introduces an additional constraint on the design of this kind of embedded systems:
systems comprising a collection of embedded nodes communicating over a network
and requiring, in most cases, a high level of dependability [158].
In order to utilize such systems efficiently, management software development, namely
operating systems and middlewares, emerged as the most critical challenge. Such soft-
ware would enable many useful facilities. For instance, two products could share the
use of the same reconfigurable fabric for high performance execution, a product could
utilize the reconfigurable fabric along with other computational elements simultane-
ously for maximum performance and adaptation, a product implemented on recon-
figurable fabric can update itself, or a product already assembled with errors can be
corrected dynamically at runtime and without changing the hardware platform.
An operating system or a middleware for embedded systems also has to monitor, co-
ordinate, and optimize the usage of the different resources. This is very important
because embedded systems are limited in resources. The general objective is then to
utilize the available resources efficiently without violating any constraints. This ob-
jective, in addition to adaptation and reconfiguration, is the general direction of this
work.
1.1 Motivation
In distributed RSoCs system, applications are expected to dynamically arrive and leave.
With multiple processors and reconfigurable fabrics on one chip, application multitask-
ing is another common property. This creates a dynamic and unpredictable change in
demands, requirements, and environment resources.
Current management software, either middleware or operating systems, tend to run
only on general purpose processors. Furthermore, their offered services usually are
implemented with static resource requirements. This means that for every time an OS
service is required to execute, for the same amount of data to be processed, the same
amount of resources is needed. This fixed amount of resource requirements does not
meet the dynamic application demands and the environment unpredictable change in
resources.
2
1.2 Thesis contribution
In other words, the use of old management software for newly emerging embedded
systems, namely RSoCs, does not efficiently utilize the computational capabilities of-
fered by such systems.
An environment with dynamic resources availability and unpredicted application de-
mands requires flexible and adaptable management. It requires an operating system
service with the ability to provide a quality of service with dynamic and adaptable be-
havior in terms of execution time and resources requirements. An operating system
service that could efficiently utilize the computational capabilities offered by a dis-
tributed multi-processor on chip system. An operating system service that could use
many powerful techniques (e.g. hardware/software partitioning) offered by the integra-
tion of reconfigurable and general purpose computing. Such operating system service
would enable many significant aspects such as resource sharing, fault tolerance, recov-
ery, and dynamic adaptation to the variable applications demands and the changing in
resources availability.
1.2 Thesis contribution
In this thesis, a novel OS service design/model is investigated. This design is intended
for distributed RSoCs system. It enables an OS service to adapt its resources require-
ment to accommodate the variable and unpredicted applications demands. This in-
cludes the adaptation to provide for real-time applications, power optimization, perfor-
mance, resource sharing, and load balancing. Furthermore, algorithms, strategies, and
methodologies to support the adaptation in either single RSoC or distributed RSoCs
were developed.
The work done in this thesis is well recognized by the research community as the
majority of this work has been published in many highly ranked conferences. This
also demonstrates the relevance of the investigations carried out in the scope of this
thesis to the state-of-the-art research.
The OS service design was first presented in [7]. This includes the algorithms that in-
vestigate the adaptation of the OS service under limited or scarce RSoC resources. In
[5] the model was generalized with formal description and algorithms to find pareto op-
timal solution to run an OS service. The challenges and the algorithms for distributing
the OS services with load balancing in mind was presented in [6]. The communication
overhead introduced from the new design and a workaround solution with a case study
are presented in [2]
3
1. INTRODUCTION
The developed service model allows for more than adaptation. It provides for error and
fault tolerance and it has a built in support for recovery [8,6,3].
Other possibilities and methodologies involving biologically inspired algorithms for
distribution, resource sharing, and cooperative execution of OS services are presented
in [4].
1.3 Thesis structure
This thesis is structured as follows
Chapter 2outlines some background knowledge necessary to the context of the sub-
sequent chapters. It also summarizes relevant related work which includes soft-
ware based (re)configurable embedded operating systems, reconfigurable hard-
ware support in operating systems, and resource management support in embed-
ded operating systems.
Chapter 3presents the different distribution topologies, their weaknesses, their strengths,
and the reasons for choosing the topology used in this work. Furthermore, it
presents the structure of the middleware that exists on every RSoC to support the
new design of the adaptable OS service.
Chapter 4discusses the objectives required in an adaptable OS service running on a
system of distributed RSoCs. It then introduces the novel design of the adaptable
OS service which provides for the aimed objectives. An example is also shown
to demonstrate the realtime adaptation ability of such OS service design.
Chapter 5describes the OS service model formally and introduces the constraints,
algorithms, and the methodology followed to obtain the required OS service
configuration under different demands and criterions. It also discusses the par-
titioning problem and the use of pipelining technique as a way to minimize the
inter partitions communication overhead.
Chapter 6shows different developed algorithms for distributing the OS services over
the RSoCs taking into mind the load balancing and routing. It also points out the
use of encoding as a way to provide for error and fault tolerance.
Chapter 7evaluates the methods and algorithms developed in this thesis. This in-
cludes the runtime algorithms used in finding an OS service configuration under
4
1.3 Thesis structure
different criterion and the ones used for distribution and load balancing. The
evaluation is carried out using the ShoX simulator.
Chapter 8introduces a case study which uses the triple DES encryption standard as
an OS service. The case study is used to compare the time required to exe-
cute a non-partitioned OS service and a partitioned OS service with and without
pipelining. The case study shows the inter partitioning communication overhead
and the different configurations obtained from the partitioned OS service. The
ReconOS was used to manage the underlying platform.
Chapter 9summarizes and concludes the work done in this thesis. It also presents the
future directions which include developing tools for automated partitioning and
for integrating an online model checker into the system.
5
1. INTRODUCTION
6
CHAPTER 2
Background and related work
Early embedded systems consisted of microcontroller, memory, analog devices, and
some I/O signals. The complexity of embedded systems has increased as more and
more applications are utilizing them. Examples include digital cameras, mobile phones,
handheld gaming consoles, and many electronic devices used in medical, multimedia,
communication, and automotive industries.
The continuing quest for maximum performance and flexibility resulted in the emerge
of heterogeneous architectures being combined on a single chip. An example of such
architectures is the embedding of one or more General Purpose Processors (GPPs)
into a reconfigurable fabric (e.g. Field Programmable Gate Array (FPGA)). This
forms what so-called a Reconfigurable System on Chip (RSoC), see Section 2.1. In
addition, the distribution of such heterogeneous embedded systems offers more com-
putational power and capabilities [71,72,37]. However, this does not come without
penalty. The development stages to design an embedded system is considerably com-
plex [131,127,91]. It is an interdisciplinary activity involving many research areas.
It goes from abstract level modeling and simulation, through software, hardware and
platform design to hardware and software synthesis and testing [105].
As the complexity of embedded systems increases, the requirement for management
applications or Operating Systems (OS) becomes a necessity. The role of an OS is to
provide a transparency layer to applications in order to make the development cycle
a little bit faster. An OS would provide for resource management, synchronization,
7
2. BACKGROUND AND RELATED WORK
communication, multitasking, and any specific requirement such as reconfiguration
for FPGA.
In this chapter, a background and related work is presented. In Section 2.1, we go
through a brief overview of the reconfigurable systems and the flexibility offered by
FPGA and the co-design. After that we outline some operating systems for embedded
systems offering adaptation or reconfiguration support, see Section 2.2. Finally, a
conclusion is given in Section 2.3.
2.1 Reconfigurable systems
Embedded systems can be classified into four categories. These are the Application
Specific Integrated Circuit (ASIC), the Instruction Set Architectures (ISA) or the
General Purpose Processor (GPP), the Reconfigurable Computing (RC), and the hy-
brid architecture which combines different architectures such as the ISA with the Re-
configurable Hardware (RH).
ASIC is characterized with its high performance. A hard-wired technology is used
to customize it for the purpose of a specific application. As a result, an ASIC lacks
flexibility and it suffers from a long and very expensive development cycle.
On the other hand, ISA or GPP offer much flexibility. Applications running on a
GPP are stored as a sequence of instructions in memory. Any functionality has to be
represented by a set of well-defined instructions, hence the name ISA. This however
comes with the price of low performance but a relatively fast development cycle.
RC are custom hardware functions with reconfigurability characteristics. They com-
bine flexibility, performance, and relatively low development cost and cycle. This
brings them somewhere in between ASIC and ISA, see Figure 2.1.
RH or RC can be accompanied with GPP using different strategies, examples are the
loose attachment using a standard bus such as PCI express as in Figure 2.2(a), connect-
ing both of them to the local bus as in Figure 2.2(b), connecting both of them using
a special dedicated bus as in Figure 2.2(c), or integrating both of them on one chip
to form an RSoC as in Figure 2.2(d). This combination of heterogeneous computa-
tional elements in the same architecture allows drawing on strength from each other
and increasing the overall system performance and efficiency.
A RH was used as a special configurable high performance accelerator to perform in-
tensive computations [66,63,59]. Recently, many research activities and trends are
8
2.1 Reconfigurable systems
Figure 2.1: Flexibility and performance of embedded systems
(a) RH is loosely connected to GPP
using standard buses like PCI express.
(b) RH is attached to the local bus like
the GPP
(c) RH is attached directly to GPP us-
ing special bus
(d) GPP is integrated into RH,
(RSoC)
Figure 2.2: Different RH and GPP connection strategies
9
2. BACKGROUND AND RELATED WORK
toward using RH as stand-alone processing unit [98,9,74,67]. By utilizing the recon-
figurability characteristics offered by the RH, hardware high performance resources
can be shared and reused by many applications at runtime, see Figure 2.3. Re-
Figure 2.3: Reconfigurable computing implements compute-intensive application kernels
(a) as hardware in Reconfigurable Hardware (RH) and the remaining code in software on a
CPU (b). Run-time reconfiguration allows RH to implement circuits that would otherwise
not fit simultaneously (c) [63]
configurable Hardware (RH) or Programmable Logic Devices (PLDs) are a class of
integrated circuits that can be reprogrammed multiple times. They include many sub-
classes such as Simple PLD (SPLD), Complex PLD (CPLD), and FPGA. SPLD
is normally composed of two logic planes, an AND- and an OR-plane, with one or
both of them being programmable. Examples include the Programmable Logic Array
(PLA)and the Programmable Array Logic (PAL). These are good for implement-
ing simple sum of product functions. CPLD is a collection of SPLDs connected by
a programmable interconnection network. FPGA is the related subclass of this work
as it provides the required structure for runtime and partial reconfiguration techniques
[73,22].
2.1.1 FPGA and reconfiguration
FPGA is a programmable device consisting of a set of Configurable Logic Blocks
(CLB), a programmable interconnection network, and a set of programmable input
and output ports, see Figure 2.4. Other custom function units such as multipliers may
also be included (e.g. Xilinx VirtexTM family[149]).
Nearly all the elements of FPGA can be programmed and reprogrammed by the user.
Depending on the complexity of a logical function, it can be implemented in FPGA
10
2.1 Reconfigurable systems
Figure 2.4: FPGA basic structure
using one or a combination of CLB’s. The components in each CLB depend on the
manufacturer, for instance a Xilinx Virtex II pro CLB consists of four similar slices
split into two columns with two independent carry logic chains and one common shift
chain, see Figure 2.5(a). A simplified slice contains a register, a multiplexer, and a
Look Up Table (LUT) which can also be used as RAM or Shift Register (SR), see
Figure 2.5(b). Many FPGA fabrics, such as Xilinx VirtexTM family [149] and Atmel
FPSLICTM family [15], are mounted with one ore more processor cores (e.g. PowerPC
or AVR ) on a single chip. Others[149,12] use softcore processors (e.g. NIOS or
MicroBlaze) to emulate an RSoC.
In this work, the Xilinx VirtexTM-II Pro (XP2V30) is used in the case study presented
in Chapter 8. The basic structure of this RSoC is shown in Figure 2.6. It incorporates
two PowerPC405 processors on a single chip within an FPGA fabric. Each processor
block is connected to the other components through the CoreConnectTMbus architec-
ture, composed of the Processor Local Bus (PLB), which provides low latency access
to peripherals requiring high performance, and the On-chip Peripheral Bus (OPB) for
Slow peripherals. Xilinx VirtexTM-II Pro FPGA and its related software development
tools provides for dynamic partial reconfiguration [151,152,122] which is a key ele-
ment in resource sharing and runtime reusability.
11
2.1 Reconfigurable systems
2.1.2 Co-design and partitioning
The term hardware/software co-design appeared after the emergence of platforms com-
bining Central Processor Unit (CPU) with several hardware accelerators (e.g. ASIC)
[45]. Hardware/software partitioning determines what application functions should be
run on hardware accelerators and which to be run on GPP or CPU. The emerging of
FPGAs with high-performance GPPs have proven to be an important implementation
medium for hardware/software co-design [17,96,101]. Since FPGAs are preexist-
ing reprogrammable chips, they allow co-design to be used in both low-volume and
higher-volume applications.
Many algorithms and systems have been developed to help determining which tasks to
run on hardware (e.g. FPGA) and which to run on software (e.g. GPP).
The COSYMA[51] is an early hardware/software partitioning system. It was designed
to implement applications with computationally intensive nested loops. Algorithms are
used to determine how many of these nested loops should be implemented in hardware
accelerator.
The global criticality/local phase (GCLP) algorithm[86] performs two sweeps to de-
termine the partitioning. The global criticality sweep performed between software and
hardware improves the performance of the application’s critical path. The local sweep
is then performed to reduce the hardware cost.
Another algorithm [148] starts with a general platform template. Using a separate
processing element for each task, it tries to allocate each task on the fastest possible
processing element. The algorithm then reduces the system cost by reallocating tasks
and eliminating components.
Other algorithms involving hardware/software co-design include the LYCOS [102] al-
gorithm, the SpecSyn [62] specify-explore-refine methodology, the COSYN [36] algo-
rithm, the MOGAC [39] genetic algorithms, The Eles et al. [46] simulated annealing
and tabu search comparison, and other related research like in [90,38,35].
All the presented approaches deal with static tasks placement at design time. They all
try to find the fastest way to execute an application. Some of these algorithms take into
account the minimization of one resource which is the hardware area or load. However,
there is no optimal solution because the hardware/software partitioning problems are
well-known to be NP-problems [14,85,61].
13
2. BACKGROUND AND RELATED WORK
Code variation and resources
The hardware/software partitioning algorithms aim to improve the execution efficiency
and performance of an application. However, these can also be improved by the coding
style [32,29]. Having a partition chosen to be executed on a specific computational
element is not the end of story. One can choose different coding styles for different
resources (e.g. power, hardware area) utilization. This is true even for code state-
ments with similar functionality. For example, the synthesis of an IF-clause statement
produces a different logic at the structure level from the CASE-statement [110]. The
number of FPGA LUTs to represent a logic function as well as its estimated power
consumption depends on the structural representation of that function [126].
The variations obtained by different partitioning, implementations, and heterogeneous
designs allow for more flexibility. However, this introduces the problem of manage-
ability and platform complexity which demands the existence of an OS. An OS acts
as a transparent layer by providing a clear interface to an application while hiding the
complexity of the low level platform details.
2.2 Operating system for embedded systems
Embedded systems are customized to meet different functional and non-functional re-
quirements. Therefore, they are heterogeneous and limited in resources. In addition,
many embedded systems are equipped with reconfigurable hardware and/or multiple
general purpose processor units. Although this enables many powerful features such as
flexibility and parallel processing, it also increases the complexity and requires special
OS that provides lower cost and higher performance.
One solution is to develop an OS from scratch for each specific embedded system.
However, this wastes a lot of development time, has poor flexibility, and makes it
error-prone. Another approach is to develop an adaptable reconfigurable OS that can
be customized to meet different needs and variations. This would lead to an underlying
matured operating system and it would be an appropriate solution [106]. In this section
we point out some of the existing OS systems and their support to adaptation and to
hardware re-/configuration.
14
2.2 Operating system for embedded systems
2.2.1 Configurability in embedded operating systems
Many embedded systems require an operating system with low cost, fault tolerable,
quick time to market, and real-time and reconfiguration support. As a result, much
research was conducted aiming to find such an operating system.
As there are many OS’s designed for embedded systems, surveys became necessary.
These surveys identify and classify the existed embedded OS’s based on many criteria
such as configurability (dynamic or static), architecture (monolithic or modular), and
scheduling (realtime or non-realtime).
In [57], the author classified embedded OSs based on their ability for configuration.
The survey classified Exokernel [49] and SPIN [18] with some form of configurabil-
ity/extendability which is limited to a fixed predefined amount of functionality. In
addition, the survey identifies a number of OSs with some form of reconfiguration
capabilities. This includes Choices [30] with dynamic loading of its OS classes, Coy-
ote [20] with dynamic change of event handlers, 2K [92,93] with dynamic loading of
user components, JavaOS [128] with dynamic detection of drivers, Jbed [82] with user
components downloading ability, MMLite [75] with OS components replacement ca-
pability, and Pebble [60] with dynamic OS services loading. The VxWorks [143,136]
and QNX [76] provide optional modules that can be statically or dynamically linked to
the operating system. Other OSs such as OS-Kit [55], PURE [19], and eCos [44,106]
were classified to have no support for dynamic reconfiguration. These OSs provide
components and functionalities which can be linked or executed to OS, hence config-
ured, at design time. A similar survey based on OS configurability is also made in
[137]
Another survey which concentrates on OSs for Wireless Sensor Networks (WSN) is
done in [124]. In WSN, each wireless sensor node can be considered as a System on
Chip (SoC) node, hence it has limited resources such as power, memory, and communi-
cation bandwidth. The author in [124] classified OSs based on architecture, execution
model, reprogramming ability, realtime support, power management, portability and
simulation support. The survey identifies the SOS [130] to support reconfigurability
by runtime loading and removing of system modules. However, it has some prob-
lems like the lack of memory management and improper message handling between
modules. MantisOS [21] support reconfiguration by providing a library built into the
kernel which can be used by applications to write a new code. However, this requires
a software reset and it is limited to remote login and changing of variables/parameters.
Contiki [42] supports dynamic loading and unloading of services with some memory
management problems. CORMOS [153] maintaining a static size table for allocating
15
2. BACKGROUND AND RELATED WORK
and de-allocating of modules such as particular routing protocols. Modules related to
user applications and core system extensions can only be loaded during system initial-
ization. DCOS is a data centric OS with dynamic loadable modules. The loading of
modules is not considered as a kernel task and is done separately.
Another operating system is the Organic Reconfigurable Operating System (ORCOS).
It aims to use biologically inspired methods [77] to create a highly customizable but
easily configurable operating system suitable for any kinds of embedded hardware.
The ORCOS OS is based on the work done in [41].
Other surveys were conducted concentrating on realtime operating systems and general
embedded operating systems in [16,142].
Resource management
Because embedded systems are limited in resources, it is an important challenge to
manage the resource wisely. As a consequence, many operating systems are equipped
with some kind of resource management units. These units are either in the core of the
operating system or co-existing as a middleware layer.
Examples of operating systems with some resource management like energy include
TinyOS [123], MantisOS [21], SenOS [132], and Nano-RK [52].
In the realm of resource managers or middlewares we mention the Adaptive Resource
Management (ARM) [43], the Dynamic QoS Manager (DQM) [23], the Quality-based
Adaptive Resource Management Architecture (QARMA) [54], and the Flexible Re-
source Manager (FRM) [113].
The resource management in such systems can be summarized by assuring the realtime
or high priority applications the resources they need to complete there tasks. However,
none of these solutions can be applied straightforwardly for managing reconfigurable
hardware resources. The main reason is that the reconfigurable hardware is changing
its behavior per application, unlike the GPPs, which have a fixed hardware organization
regardless the programs running on them.
In all of the above, OS or middleware is implemented to run on GPP. There is no
support for any type of reconfigurable hardware. The reconfigurability/adaptability is
merely related to loading or unloading of some system components or functionality
either statically at design or dynamically at runtime.
One of the first to discuss the support of reconfigurable hardware in an operating sys-
tem is Brebner [24,25]. He proposed swappable logic units that can be switched in and
16
2.2 Operating system for embedded systems
out of a partially reconfigurable device which is driven by an operating system. After
that, many research parties were discussing the support of reconfigurable hardware in
OS as outlined in the following section.
2.2.2 Operating systems and reconfigurable elements
Much research was carried out in the direction of supporting reconfigurable elements
(e.g. FPGA) within operating systems. In [125,95,40], research is done for hiding re-
configuration latencies by prefetching, context switching and resource reusage among
hardware tasks. Migration between software and hardware is discussed in [87,10]
In [64,140,133,139] an OS system model, architecture, and several algorithms to
support and manage the sharing of resources in a reconfigurable computational element
are proposed. OS includes services for a partial reconfiguration and scheduling of
dynamically incoming threads.
In [144,145,146], they discuss an implementation of an operating system, ReCon-
figME, that has the ability to share its FPGA dynamically among multiple executing
applications. Other computational elements were not considered.
In [104,111], they present the design of an OS, OS4RS, that supports the reallocation
and mapping of tasks to the heterogeneous computational elements of RSoC. The re-
configurable hardware is integrated into a multiprocessor system which is completely
managed by the OS. OS4RS employs a dynamic two-level scheduling technique, top-
and local-level scheduler. The top-level scheduler, implemented in software (GPP),
stores the running tasks as a linked list. It is to map these tasks to the various heteroge-
neous computational elements. The local-level scheduler, can be implemented in either
software or hardware (FPGA), manages the scheduling of its related mapped tasks on
its local computational element. For task context-switching and migration between
heterogeneous computational elements, they propose the saving of task states. How-
ever, the question remains of how they intend to translate between GPP register set and
reconfigurable logic.
The SHUM-uCOS [156,112] proposes a real-time OS for reconfigurable systems em-
ploying a uniform multi-task model. The OS manages the utilization of reconfigurable
resources. Using static hardware task preconfiguration, it improves the parallelism of
the tasks.
In [74], an OS, BORPH, is introduced. It is specifically designed for reconfigurable
computers. The design process is improved by sharing the same UNIX interface among
17
2. BACKGROUND AND RELATED WORK
hardware and software tasks. BORPH treats the reconfigurable computational ele-
ments as first-class computational resources instead of coprocessors. It contains three
basic components: a conceptual hardware process and two sets of universal interfaces
(input/output registers and hardware file input/output interface). The hardware tasks
executed on reconfigurable computational elements do not share reconfigurable re-
sources.
In [98,100], the ReconOS is discussed. It offers a uniform hardware (FPGA) and soft-
ware (GPP) thread model. This is done by modifying an RTOS, (e.g. eCos [44,106]),
to integrate hardware interface components called the OS Interface. Both the hardware
and software threads are treated equally at scheduling time with no regard to constraints
or resources management. ReconOS offers methods for threads communications and
synchronization. Another related research with some similarities to ReconOS is pre-
sented in [116,13]. The ReconOS is used in the case study presented in Section 8.1.
All the above research is concentrating on issues related to applications/tasks being
implemented on reconfigurable computational elements. Yet, current state-of-the-art
architectures do not provide efficient holistic solutions for accelerating multithreaded
applications by reconfigurable hardware [154].
On the other hand, implementing OS services on FPGA would improve the perfor-
mance and efficiency of OS [48]. One close related work tackling such area is the
work done in [67]. OS services were implemented on both hardware and software with
migration ability between the two implementations. Heuristic scheduling algorithms
based on Binary Integer Programming (BIP) were developed for services assignment
issues to either hardware or software. These algorithms have the complexity of θ(n2).
The reconfigurability of the system happen when services migrate from hardware to
software or vice versa based on resources and balancing criterions. At the time of mi-
gration, both of the OS service implementations have to co-exist in the system. The
presented OS has no support to distributed embedded systems, resource sharing, and
provides a very limited adaptability to applications needs.
Our proposed system design provides a solution to such limitations by combining the
multiple implementations with dynamic hardware/software partitioning to support an
OS service with the objectives presented in Section 4.1.
18
2.3 Chapter conclusion
2.3 Chapter conclusion
The combination of one or more GPP with an FPGA on a single chip provides an em-
bedded system, namely RSoC, that provides both the flexibility and high performance.
Much research has been carried out to find an OS suitable for managing embedded
systems. The main aim was maximum performance and dynamic reconfigurability.
These OSs were either dealing with the GPP part of the embedded system or providing
support for applications to run on reconfigurable hardware. None except the work done
in [67] considered runtime configuration and utilization of FPGA by the OS services
themselves. Furthermore, many of the component-based reconfiguration methodolo-
gies utilize large components and do not address size, real-time performance, power,
and cost issues. Another main problem with component-based systems is the con-
figuration process itself. Issues such as selection, parameterizations of components,
analysis, and choice of proper components are not fully addressed [57].
In a limited resource RSoC embedded system there are three main constraints to be
considered. These are, execution time, power, and FPGA area or GPP load/utilization.
The objectives for optimization and utilization of resources for an OS service is differ-
ent from that of an application. An OS service is required to provide its service under
variable constraints which are specific to each application and to the availability of re-
sources at the time of request, see Section 4.1. Thus a new design is required to enable
OS services to utilize the RSoCs platform in order to provide their services efficiently
and dynamically at runtime.
19
2. BACKGROUND AND RELATED WORK
20
CHAPTER 3
System design and architecture
3.1 System design
Distributing a number of RSoCs can be done based on many topologies. The cen-
tralized topology is one where an RSoC or a capable node in the system is used as a
centralized coordinator. All communications, control, and management are performed
through or with the acknowledgement of that RSoC. The benefit of such a topology is
the relative easiness of management, data routing, and load balancing. On the other
hand, if this coordinator RSoC fails, the whole system will fail as well. Such single
point of failure is considered a non-tolerable risk for critical and real time applications.
Another topology is the fully distributed system, in which every RSoC has its own
responsibility of coordination, data routing, load balancing, and resource sharing. In
case of one RSoC failing, the system may not be affected and the system continues
to operate normally. Such a system is more reliable in terms of fault tolerance and
error recovery. The management of the fully distributed system is a work shared by
all the RSoCs in the system. This requires effective algorithms for cooperation and
coordination.
A better topology would be the one which combines the benefits of both; the central-
ized and the fully distributed topologies. A hybrid topology can work as centralized
when it is beneficial, but without losing the fully distribution capabilities. This system
could be constructed with a centralized RSoC which is important in the initialization
21
3. SYSTEM DESIGN AND ARCHITECTURE
stage. The RSoCs do not need this centralized RSoC afterwards. However, the exis-
tence of such an RSoC is beneficial.
In this work, the hybrid topology is chosen. In the subsequent sections we discuss
the reasons behind our choice. This is presented by pointing out the strengths and
weaknesses of each one of the three topologies.
Having one topology in mind, we describe the architecture assumed in each RSoC to
support the distributed cooperation to run an adaptable OS.
3.1.1 Centralized distribution
In this topology the distributed RSoCs are directly or indirectly connected to a cen-
tralized RSoC. It is also possible to logically divide the RSoCs into sets (colonies or
clusters), as shown in Figure 3.1, which may or may not be connected. In each colony
there is at least one node or RSoC operating as OS Services Repository(OSR) node.
Every RSoC in a colony is connected to OSR. The OSR provides OS services needed
for the RSoCs in the colony. It contains enough resources (memory, power etc.) to
keep and manage the OS services and requesters.
In a colony, RSoCs can only exchange OS services via the OSR. Communication bet-
ween colonies can be done through the OSRs. Every RSoC in a colony, consists of at
least one FPGA and one GPP. RSoCs are to keep track of basic information about their
resources such as the used area, and the current power consumption. This information
are sent to OSR upon request.
When an RSoC joins a colony, it sends all of its resource information to the OSR
belonging to that colony. The OSR keeps a record of every RSoC in the colony. The
RSoC record is updated when a change happens in the RSoC resources or a service is
being requested and it is executed on that RSoC.
The change in an RSoC resource can be due to a newly arriving task or due to a change
in the RSoC’s current tasks resource requirements. This depends on the applications
dynamics expected in a distributed system, the scheduler, and whether a manager and
a profiler are used. An example of the influence imposed by a manager/profiler combi-
nation is presented in [113]. The author, in [113], uses the so-called Flexible Resource
Manager (FRM) to control the share of resources each application/task is using.
RSoCs other than the OSR hold no OS services. This is to save as much resources as
possible for applications/tasks. Another reason is that not all of the OS services are
needed every time. In case an application/task needs a service, a request is sent to the
22
3.1 System design
Figure 3.1: RSoCs centralized distribution topology
OSR with information about the current RSoC resource status. The OSR finds a suit-
able service configuration, see Chapter 5, to execute on the requested RSoC and sends
it back. Upon receiving the service, the RSoC uses it to process the application/task
request. When the service is no longer needed, the RSoC can discard it to free more
resources for applications/tasks.
In the centralized topology, all the OS services, services scheduling, recovery algo-
rithms, and adaptation control are done in the OSR. The RSoCs are merely providing
for services execution and resource status reporting. So OSR plays an important role
in this topology which makes it, in case it fails, a non-tolerable hazard.
The failing of OSR in a colony will lead to a whole colony crashing. This single
point of failure can be worked-around by providing some replica of the OSR, however,
this still is not acceptable, especially if the RSoCs are working in uncontrolled critical
and risky environments [114]. Moreover, the RSoC which to hold a complete OS
and to act as an OSR is required to have enough resources to manage both the OS
23
3. SYSTEM DESIGN AND ARCHITECTURE
and the requesters. Replicating such a capable RSoC may not be a feasible solution.
Furthermore, the OSR may become a bottleneck of the whole system in case of many
requesters.
3.1.2 Fully distributed topology
Another solution is to eliminate the OSR. This requires the distribution of all the OS
services over the RSoCs in the distributed system. Doing so we have no longer any
single point of failure. One service can be replicated or encoded to exist on more
than one RSoC. A failing RSoC may have a little influence on the running system.
Some encoding techniques can be adopted to recover lost data. This remains with
some limitations as it depends on the number of the RSoCs in the system and on
the type of the used encoding. More discussion on encoding is presented in Section
6.1.2. However, in case of a fully distribution topology, many problems arise like load
balancing, service localization, routing, and resource sharing.
RSoCs are limited and heterogeneous in resource amount and availability. Some may
have a lot of tasks/applications running while others have a few. The sharing of re-
sources which each RSoC offers to keep and maintain an OS service, has to be rela-
tively fair with relevance to its application/task load and the amount of its resources.
On the other hand, distributing OS services over RSoCs require an effective way of
finding these services when needed. Locating services can be performed using many
techniques. One technique can be the requester RSoC to send a search message to
allocate a service when needed. An announcement message is another technique. A
broadcast message can be sent from each RSoC about its services. A record of where
each OS service is located, is maintained in each RSoC. Or, some RSoCs can be chosen
to maintain records about each service and where it can be located. These RSoCs are
supposed to be well-known to all the others.
Concerning the above techniques, each one has its benefits and drawbacks. For in-
stance, the first technique is not deterministic. Though it saves resources by not storing
records about services locations, it has no guarantee of finding the location of an OS
service in deterministic time. The last technique suffers the single point of failure as in
the centralized distribution. It is faster in finding where a service is located, but in case
the RSoC containing the records fails, the whole system may fail as well. The second
technique however is more tolerable to failure. It is faster in OS service allocation but
it requires storing a record of the services locations in each RSoC. Moreover, if not
carefully managed it may cause communication flooding.
24
3.1 System design
The routing of data between the requester RSoC and the RSoC maintaining and run-
ning the requested service is another important issue. A routing path is required to
be maintained to assure the requested QoS. However, allowing all RSoCs to use the
same path simultaneously may cause a non-tolerable delay. Moreover, requesting a
service from the same RSoC every time, even when there are alternatives, may drain
the resources (e.g. power) of that RSoC. So the routing and the allocation has to be
managed carefully to avoid such non-preferable situations.
Having found a service location and having established a routing path are not the end
of story. The way the services are distributed and if the location of the service is
chosen wisely are key elements in minimizing communication time. The number of
copies of each service in the system also play an important role. It remains to decide
whether it is better to migrate data to the RSoC holding the service and process data
off site or to migrate the service and process data locally. All of these routing and
migration decisions are relatively easy to manage in the centralized topology. In the
fully distribution topology however, this can occasionally lead to drastically complex
and sometimes unresolvable matters.
3.1.3 Hybrid distribution topology
Distributing the services over RSoCs increases the management complexity. On the
other hand, it remains more tolerable to RSoCs failure. However, if the benefits of
both the centralized and the fully distributed topologies were to be combined in a new
topology, this might be a better choice. This can be done by adding a capable RSoC
with enough resources to act as a centralized node. In the meanwhile, allow the other
RSoCs to act and work as in fully distributed topology, see Figure 3.2.
The existence of centralized node in the initialization stage enables fast OS service
discovery and localization at execution time. The centralized node is used to distribute
the OS services at the initialization stage. Hence, it is able to create the necessary
discovery and routing records to find and execute an OS service. Moreover, the distri-
bution of OS services can be balanced as the information about the RSoCs resources
could be collected in advance. However, the balancing is not trivial. The distribu-
tion, although it might be balanced, it is not optimal. The distributed OS services are
heterogeneous in both of their requirements and execution resources. Moreover, the
balancing is performed taking into account three main conflicted RSoC resources. In
its substructure, this balancing distribution is very similar to the knapsack problem
which is a well-known NP complex problem.
25
3. SYSTEM DESIGN AND ARCHITECTURE
Figure 3.2: RSoCs hybrid distribution topology
In terms of routing, a service may be located away from where it is frequently used,
or two dependable services may be distributed apart from each other. These issues
are very difficult to resolve at the initializing stage due to the dynamic behavior of the
system. However, at runtime, and after collecting some profiling information about the
execution of services and the application/task requirements, the system can migrate
services and reallocate them to enhance the communication and minimize delay time.
After the initialization stage, the system continues to operate as in fully distributed
topology. The existence of a centralized node, although beneficial, is not required.
The centralized node would provide a replica of the OS. It also would contain complete
routing and distribution information about all the services and the RSoCs in the system.
Therefore, such node could provide faster and efficient assistance. Moreover, in case of
an RSoC failure, it would provide a faster recovery. Nevertheless, the system provides
for recovery and for an OS replica even when the centralized node ceases to exist. This
is done by using OS services encoding during the distribution.
26
3.2 RSoC architecture
The records created during the initialization stage are very important for service ex-
ecution, adaptation, and recovery. They contain information which helps finding an
execution profile suitable to the requester application/task requirement. When these
requirements are changing due to the dynamic behavior of the system and of the appli-
cations/tasks, these records are used to change the execution profile of the service to
adapt to the current variations. In case of an execution failure during service execution,
the records are also used to recover the service and if possible to continue execution
using other non-faulty profiles without loosing much of the processed data.
Every service record is encoded and saved in more than two distinct RSoCs. Depend-
ing on the encoding used, the service record can be rebuilt even if one or two related
RSoCs failed. The service record can still be rebuilt even so all the RSoCs holding the
records fail. This however may take more time than rebuilding from RSoCs holding
some encoded parts of a record. Having done so, it exports a location pointers record
to all the RSoCs in the system. After this step, the RSoCs can operate with no regard
to a centralized node.
To support such operations, each RSoC is assumed to have the following architecture.
3.2 RSoC architecture
In the RSoC distributed system, every RSoC is consisting of at least one FPGA and one
GPP. The RSoCs are assumed to be connected to each other either directly or indirectly
through other RSoCs. Although the means of communication and the routing of data
packets is irrelevant to the topic of this thesis, an RSoC can be assumed to be using
wireless communication. Having its own limited power supply, limited FPGA area,
and other limited resources, an RSoC is required to carry out its obligations involving
communications, resource management, and supporting various types of applications.
To provide for these obligations and for the adaptable OS services discussed in Section
4.2, every RSoC has an architecture composed of a middleware and an optional virtual
machine.
3.2.1 Middleware
Traditionally, an application runs on top of an operating system, using its services
to fulfill its goal. Besides providing these services, the operating system acts as an
indirection to the hardware by managing the available resources to meet the application
27
3. SYSTEM DESIGN AND ARCHITECTURE
requirements. In our work, since the OS services have been partitioned into small
blocks, see Section 4.2, another layer, namely a middleware, is required to provide
necessary services to these blocks.
The middleware separates the OS services from the applications/tasks. Hence, it is the
one that translates the requests/responses between RSoC hardware, OS services, and
applications/tasks. To carry out its duties, the middleware intended to run on RSoC is
composed of five main logical units. These are: The reconfiguration unit, the resource
monitoring unit, the control unit, the link unit, and the schedular unit. See Figure 3.3.
Figure 3.3: RSoC and middleware architecture
The reconfiguration unit
The reconfiguration unit can be further divided into two main units. The FPGA/GPP
reconfiguration unit and the communication unit. The FPGA/GPP reconfiguration unit
contains the proper routines and drivers to run an entity on either FPGA or GPP de-
pending on its implementation. This includes the runtime partial reconfiguration of the
FPGA.
The communication unit has the job of obtaining new OS services. This may require
communication with other RSoCs or the OSR. The communication unit contains the
necessary routing procedures and drivers to manage the communication and its under-
lying related hardware. Additionally, it can find the information about each OS service.
These include where to find the service and whether or not alternatives exist.
28
3.2 RSoC architecture
The process of finding an OS service depends on the control unit. This is because
the decision of which OS service/configuration to request and when to request it, is
done by the scheduler and the control unit respectively. For that, this unit is directly
connected to the control unit.
Another direct connection exists between the reconfiguration unit and the resource
monitor unit. This is required to keep track of the resources consumed during the re-
configuration unit activities. Examples include communication power, memory usage,
and FPGA area.
Although the reconfiguration unit is very important to maintain a reliable system, many
parts of this unit (e.g.: network routing) is out of this thesis scope. However, when
necessary, a brief discussion will be brought up which is related to this unit operations.
The resource monitoring unit
To keep the system up-to-date with all the occurred changes, an active resource monitor
is integrated into the middleware of every RSoC. The role of the monitor is to keep
track of and inform the control unit about the status of the RSoC resources. Any
request to resource allocation has to be through this monitor, or at least, to notify it if
such an allocation occurs. The monitor maintains a set of descriptors used to identify
the available amount of resources. In RSoCs the most important descriptors are time,
power, and area. These descriptors involve complex analyzes and in many cases may
affect each other.
The area descriptor representation is architecture dependant. In RSoC, the area de-
scriptor is divided into two parts. The hardware (FPGA) area descriptor and the soft-
ware (GPP) area descriptor. The hardware area descriptor represents the number of
look-up-tables (LUT) or the number of configurable-logic-blocks which can be used
for implementation. An implementation of a service or an application intended to
run on FPGA is normally represented using hardware disruption languages such as
V HDL. These representations are compiled to use a number of LUTs and other hard-
ware specific units (e.g.: multipliers). Obviously, an FPGA has a limited number of
LUTs and other specific units. Hence, the descriptor would hold information about the
number of used and available LUTs and the other specific units. On the other hand,
implementations intended to run on a GPP are compiled into a number of instructions
sequences which are stored in memory. So a software area descriptor may represent
the available and used memory. However, this is not accurate, especially if we are
dealing with real time applications/tasks. A better representation is the GPP utilization
or GPP bandwidth.
29
3. SYSTEM DESIGN AND ARCHITECTURE
Figure 3.4: GPP area representation
To explain such a representation see Figure 3.4. Figure 3.4-(A) shows a GPP workload
with three processes. Each process can be a normal application/task or a service. For
simplicity, let us assume that a scheduler has already scheduled the processes shown
and this schedule satisfies the time requirements of these three processes. In order
to complete the operation of each process, the GPP will cycle every T1period with
the same schedule. Now, let a new process enter the system, see Figure 3.4-(B). To
schedule the new process, the GPP has to cycle every T2interval. If the new schedule
also satisfies the timing requirements of all the processes, we could say that in Figure
3.4-(A) the cycle interval is T2with a free slot equals to 4T. A more clear description
is shown in Figure 3.4-(C). Where Tmax represents the maximum time interval the GPP
can cycle without violating the timing requirements of the scheduled processes. Tload
represents the time consumed by the scheduled processes and 4Tthe available slot.
The bandwidth can be described as 1/T. A GPP would have a maximum bandwidth
of 1/Tmax and each process would use some of this bandwidth. The GPP utilization
of a process is the workload percentage of that process. This is the percentage of the
required time for a process in one cycle to the total cycle period. GPP bandwidth or
utilization can be referred to as GPP area. This kind of software area representation
makes the amount of the total GPP area depending on the current workload and the
scheduler algorithm.
The power descriptor holds information about the amount of power an RSoC has.
The representation of available power can be measured by program instructions/units
or as power units. Each instruction of a program (application/task or service) can
be estimated by a number representing the amount of power consumed by running
30
3.2 RSoC architecture
that instruction. The total power can then be measured as the maximum amount of
instructions an RSoC can run. If a program is implemented to run on FPGA, the power
analysis would depend on the structural design of the program. In this analysis, each
structural unit (e.g.: an AND gate) is related to some amount of power consumption per
clock cycle. On RSoCs, the amount of available power can be obtained from a special
circuit designed to measure power. This can then be represented as power units, where
a power unit is the smallest amount of power consumption that can be measured in the
system. Other power consumption measurements can also be estimated using models,
simulation, or explicit analysis [78,150,31,121,89].
In case of separate power for FPGA and GPP, the descriptor can be divided into two
descriptors, the hardware power descriptor and the software power descriptor. The
amount of power available can affect the time descriptor. This is true, because the
amount of power will limit the number of clock cycles or the number of instructions
which can be executed.
The time descriptor is important for scheduling. It is more related to the process ex-
ecuting on an RSoC and it holds information which estimates the process Worst Case
Execution Time WCET. This descriptor should not be confused with the GPP uti-
lization which represents the percentage of the time reserved for a process in each
GPP cycle. The WCET can be obtained by measurements [134] or by static analysis
[135,33,70,50]
The applications link unit
The adaptation of OS services presented in Chapter 5, depends on the information
provided on both the RSoC resources and the applications/tasks. The applications link
unit provides the necessary means of communication to transfer the applicaitons/tasks
demands to the middleware. These demands are then represented as constraints or
demanded Quality of Service (QoS) an application/task requires in order to complete
its operation correctly. For instance, a realtime task may require a service to response
in a define period of time.
As there may be more than one application/task executing OS services in one RSoC, it
is the responsibility of this unit to provide the necessary channel support to link each
application/task with its corresponding OS service. The link provides the necessary
support to perform the communication transparently. This is not a trivial task as ser-
vices may exist on the same RSoC which the application resides on, on another RSoC,
or even partitioned on more than one RSoC.
31
3. SYSTEM DESIGN AND ARCHITECTURE
In order to perform such duties, this unit works closely with the control unit. The
cooperation is realized by establishing the appropriate communication of information
to accommodate the applications/tasks requirements.
The scheduler unit
A distributed RSoCs system with adaptable partitioned services requires special algo-
rithms and support. This unit holds the algorithms needed to schedule new services,
reconfigure a service, or recover a service from expected fault. To realize this, the
scheduler unit has to account for many factors. These include efficient resource usage,
meeting multi criterion constraints, accounting for communication and FPGA recon-
figuration time, and whether the execution is made on the same RSoC, on a different
RSoC, or distributed on more than one RSoC.
To carry out its obligations, the scheduler unit needs to communicate with the other
middleware units to obtain the required information. These communications are done
through the control unit which analyzes and provides the necessary data needed by this
unit.
The control unit
One important aim of this unit is to make the partitioned OS services appear mono-
lithic to themselves and to the applications using them. It provides basic services such
as memory allocation and resource management needed to run the OS services. The
control unit treats the partitions of an OS service as components with defined inter-
faces. This is achieved by encapsulating the arriving partitions (e.g: from OSR) in
component containers, which provide the required interface to manage the life-cycle
of a partition on an RSoC.
The other aim of this unit is to glue the other units of the middleware so they work
consistently with each other. All important decisions are taken by the control unit
based on the analysis of the data provided by the resource monitor unit and the link
unit.
In relevance to the work
Every unit of the middleware is very important to the whole system. However, the
majority of the work done in this thesis is related to some parts of the control unit and
32
3.3 Chapter conclusion
the scheduling unit. The full investigation of the other units, though consequential, is
out of this thesis scope. Nevertheless, some related matters will be briefly discussed as
the context requires it.
3.2.2 Virtual Machine
The virtual machine is an optional layer which may be needed if too many computa-
tional elements are within an RSoC. A virtual machine could be used as a method to
reduce the number of the considered computational elements, but may also slow down
execution. A virtual machine would provide the possibility to map one service/process
to many computational elements transparently.
Nevertheless, even without a virtual machine, the OS services work and adapt quickly,
see Chapter 5. However, it will always be a matter of storage and area versus speed
and performance. This is because the OS service design requires a number of service
implementations equivalent to the number of computational elements. Although these
will adapt and execute faster, more resources are needed to confine such implemen-
tations. On the other hand, the virtual machine would also require some resources to
operate. It will also slow down the execution of an OS service in comparison to a na-
tive platform execution. In the end, it remains to decide which used method consumes
less resources and whether more performance is needed or not.
3.3 Chapter conclusion
Each distribution topology has its strengths and weaknesses. A preferable option is a
hybrid topology which combines the benefits of the different topologies. The chosen
hybrid distribution topology has a centralized node which is required at initialization
stages for load balance distribution and the building of execution routing tables.
To enable the distributed RSoCs to support the different intended characteristics and
objectives, each RSoC is assumed to have a middleware that provides for the novel OS
service model, see Chapter 4. The middleware contains the proper units to manage and
coordinate the RSoC resources to communicate, configure, and run an OS service.
33
3. SYSTEM DESIGN AND ARCHITECTURE
34
CHAPTER 4
OS Service structure
An operating system is normally consisting of many services. These services are selec-
tively executed upon an application demand. A normal OS service, even in embedded
OS, is developed as a unit which is implemented to run on a GPP. Such an OS service
has the properties of fixed resources requirement and process time. In other words,
if provided the same input stream, such an OS service will always require the same
resources and execution time to provide the processed output. This however does not
utilize any flexibility implied by the multiple heterogeneous computational elements
which may exist in embedded systems such as RSoCs. Moreover, if resources are dy-
namically changing, which is highly expected in distributed embedded systems, some
computational elements may be heavily loaded while others may be idle. Such a sce-
nario may prevent a fixed resources OS service from executing. However, if the OS
service has flexible resources requirements, it could adapt itself to execute on other
available resources.
Currently, the adaptation of embedded OS systems is merely related to whether to in-
clude, exclude, or extend an OS functionality at design time. The whole OS always
has to reside on every embedded system. This means that every RSoC in a distributed
embedded system has to have its own OS. In addition, if applications on one RSoC
are using the resources required to execute an OS service, the only option for that OS
service is to wait until the resources are available again. These limitations, among
many others, urge the need for a new OS service design which utilizes the current
35
4. OS SERVICE STRUCTURE
multi computational heterogeneous platforms and provides for the increasing applica-
tion demands.
4.1 OS service design objectives
As aforesaid, each RSoC in the distributed system is consisting of at least two different
computational elements (e.g.: FPGA and GPP). To utilize these embedded systems,
each OS service exists in a number mof implementations which equals the number of
the different Computational Elements (ComEl). In other words, for each OS service,
there exists an implementation Im ∈IM which can run on a computational element
comel ∈ComEl, where IM is the set of all the implementations for an OS service.
This enables an OS service to run on a computational element that has relatively less
computational load cl than the others.
Now lets consider that each computational element comelihas a maximum computa-
tional load of mcli, a current computational load of ccli, and an available computational
load of acli, where mcli=ccli+acli. If an OS service implementation Imihas an
estimated computational load scli, then this implementation can run on the computa-
tional element comeliif and only if scli≤acli. In general, an OS service can run
on an RSoC if one of its implementations has a computational load less than or equal
to its corresponding computational element’s available load. However, if every com-
putational element comeliof one RSoC have acli< scli, then the executing of the
service is not possible. But what if the summation of all the computational load in
all the computational elements is bigger than any service computational load, that is,
what if Pacli> sclj, where i, j ≥0. Such an option encourages the objective of
designing an OS service that has the means to distribute its computational load to run
on wherever computational load is available.
The adaptation to applications variable demands is another important objective. For
instance, two applications may require different responses time from an OS service.
Enabling OS service to reconfigure itself, allows the support for such variable de-
mands. However, assuming that an OS service has the reconfiguration ability, there
would be a configuration with a minimum response time or resource usage. So, why
not just use that configuration every time?
The OS service reconfiguration refers to the ability of an OS service to run using a
different resource combination. This yields to different outcomes in terms of response
time and resources usage amount. Confining OS service to its minimum response
time configuration would again raise the issue of fixed resource usage. Normally, a
36
4.1 OS service design objectives
minimum response time means faster execution which, in most cases, means more
resource usage. So again, what if the required amount of resources to run an OS
service is not available at the time the OS service is invoked? To avoid this problem,
it is important that an OS service has the ability to reconfigure itself to provide for
different resource usage, hence, different application demands.
Another essential objective is the fault tolerance and recovery. A reliable system im-
poses the possibility of some fault percentage. Although this percentage varies from
one system to another, a distributed system is most likely to be exposed to a non-
negligible fault percentage. The handling of these faults varies as their corresponding
problems. These include the missing of data through communication, the partial or
total failing of one or more RSoCs, and the computational errors. The recovery from
these faults has to be fast and efficient. For instance, in computational and communi-
cation errors, the recovery process has to support resuming from the last known correct
point and not just start from the beginning.
Obviously an OS service intended for a distributed RSoCs system has to provide for
distribution and for resource sharing. As RSoCs are limited in resources, there will
be a case where one RSoC can not run an OS service and the only remaining option
is to use resources from other RSoCs. This sharing of resources has to be done fairly
and without harming the RSoC that its resources being used. Hence, load balancing
techniques and algorithms are required to support the introduced novel OS service
design and to allow a fair resource sharing and distribution.
The support for OS services localization and routing is an additional requirement for
distributed OS services. It should be noted that the routing here does not refer to
network messages and packets routing, but to the process of finding the convenient
RSoCs to execute a distributed OS service. This has to account for fair resource sharing
among the RSoCs without violating any constraints imposed by the applications/tasks.
Moreover, services may have more than one copy or version among the system. Hence
locating a service copy or version and the operation to transfer the data through the
related RSoCs is what we refer to by OS services localization and routing, see Section
6.3.
To enable these objectives and requirements, a new design model for an OS service
has been developed. This novel design enables OS services to adapt themselves to the
variable resource availability, application/task demands variations, and the distribution
requirements. Moreover, the model provides for fault tolerance, recovery, resource
sharing, and load balancing.
37
4. OS SERVICE STRUCTURE
4.2 OS service design
Every RSoC in the distributed system is consisting of at least two different computa-
tional elements (e.g: GPP and FPGA). All the RSoCs are assumed to be connected to
each other either directly or indirectly, see Section 3.1. Every RSoC keeps track of the
information about its resources (e.g. used area, current power consumption) which can
be retrieved at any time.
To provide for every computational element on an RSoC, each OS service exists with
a number mof implementations that equals the number of different computational
elements. For instance, if the distributed system has RSoCs, each consisting of one
GPP and one FPGA, every OS service would have two implementations; one that can
run on GPP and another which can run on FPGA.
To fulfill our objectives, see Section 4.1, each implementation of an OS service is
partitioned into the same number nof execution blocks, where each block is called a
Small Execution Segment (SES).
In each implementation, the SESs are indexed from 1to n, where 1means the first one
in the execution sequence and nmeans the last one to be executed. In addition, each
SES is marked with a notation that indicates the implementation to which it belongs.
For instance, in an OS service with a hardware implementation (intended to run on an
FPGA) and a software implementation (intended to run on a GPP), the SESh,i has an
execution sequence index iand belongs to the hardware implementation.
In one OS service, all the SESs with the same index in all the implementations have
the same functionality. This means, if they are provided with the same input stream or
sequence, they will all produce the same output. Because of this, and due to the fact
that n SESs are needed to have a complete OS service, at each execution sequence
index iany SES with the index ican be chosen regardless of its implementation.
This allows the execution of one OS service in a variety of mnconfigurations, where
mdenotes the total number of implementations for an OS service. This implies the
same number nof SESs in every implementation belonging to the same OS service.
However, in case some of these implementations have fewer SESs, dummy SESs
can be added to reach the nnumber of SES.
This novel design can be modeled as shown in Figure 4.1. The model consists of en-
tities and directed data paths. An entity has three types: START,FINISH, and
SES. The START entity is used to hold meta-data information about the OS ser-
vice, its SESs, and its implementations. These meta-data are very important to find
a suitable OS service configuration under some specific constraints. It is also required
38
4.2 OS service design
Figure 4.1: OS service with two partitioned implementations
39
4. OS SERVICE STRUCTURE
when SESs are distributed over the RSoCs in the distributed system. The FINISH
entity holds information to check, refine, and finalize the processed data obtained from
aSES with the execution sequence index n.
The directed data paths represent the direction of data flow between two entities and the
cost of such data flow. The cost denotes communication time and any non-negligible
resource, if any. In contrast to the direction of data flow, the directed data paths can be
classified into four types: Initialization data paths, finalization data paths, in platform
data paths, and cross platforms data paths.
The initialization data paths connect the START entity with the SESs that have an
execution sequence index 1. An initialization data path represents setup resource (e.g.
time) required to start the OS service execution from the related SESx,1, where xde-
notes an implementation type. The finalization data path connects a SES with execu-
tion sequence index nwith the FINISH entity. It accounts for any required resource
(e.g. time) to transfer processed data to its final stage. The in platform data path con-
nects any two successive SESs in the same implementation. On the other hand, the
cross platforms data path connects any two successive SESs, where the first SES is
in one implementation xand the other SES is in a different implementation y, and the
two implementations belong to the same OS service.
The SES entity is encapsulated as shown in Figure 4.2. The identification field
Figure 4.2: SES general structure
defines the SES implementation, its execution sequence index, and to which OS ser-
vice it belongs. The meta-data field holds information about the SES execution re-
quirements. These include the SES worst case execution time (WCET), the SES
estimated power consumption, and the SES area allocation, see the resource monitor-
ing unit in Section 3.2.1. In addition, the meta-data field contains two records: The
alternatives record, and the next successive record. These two records are required in
the distributed execution for routing, fast fault recovery, and migration. The structure
and the usage of these two records are discussed in Section 6.3.
40
4.2 OS service design
To allow a generalized SES initiation, a unified interface is required. This however
has to be managed without causing the SES to loose its individuality. The middle-
ware uses this interface to communicate signals for SES execution management (e.g.
initiation and interruption). On the other hand, a SES uses the interface to notify the
middleware about the status of the execution. The SES specific parameters can be
passed to the SES as a pointer to a record in memory.
4.2.1 SESs execution and applications
The operations of these novel OS services have to be transparent to applications. This
is done with the help of the middleware. The link unit of the middleware, see Section
3.2.1, provides the applications with the proper standard application programming in-
terface (API) to access OS services.
Once an application with the potential to use an OS service is introduced into the
system, the middleware begins the process of the OS service configuration. The mid-
dleware first checks if there exists a previous OS service configuration in the system
which suits the current demands and requirements. In such a case nothing has to be
done but to wait until the OS service is required to begin execution.
If there was no suitable OS service configuration on the RSoC, the middleware has to
obtain one either from OSR or from the other RSoCs in the distributed system. With
regard to the RSoC resource status, an OS service configuration is to be found such that
it does not violate any constraint. In case the RSoC does not have enough resources
for any OS service configuration, other solutions may be explored. Such solutions may
include running the OS service on another RSoC or using more than one RSoC to run
the OS service.
Having an OS configuration ready, the application can request the OS service by call-
ing the API offered by the middleware. Once a request is received, the middleware
prepares the parameters and any other needed information inside a record in the mem-
ory. Using the unified interface, the middleware sends a signal to the first SES with a
pointer to the memory record in order to start execution. When the first SES finishes
execution, it sends a notification signal to the middleware with a pointer to the record
of the processed data in memory. The middleware transfers these data to the next SES
as an initiation signal and a memory pointer. This process continues until the OS ser-
vice finishes processing the data after the last SES being executed. The processed
data is then finalized and transferred by the middleware to the requester application.
41
4. OS SERVICE STRUCTURE
If, during the OS service SESs execution, the RSoC resources changed due to a new
coming application/task, the middleware can request another configuration which suits
the new resource status. In case a fault or an error occurred, the OS service can be
reconfigured to avoid the faulty SES. The recovery process may resume from the last
SES which provided a correct processed data.
To carry out these operations, efficient algorithms that evaluate and find OS service
configurations are required. Additionally, because these OS services are designed to
operate on distributed RSoCs, it is required to distribute them taking into mind the
issues of load balancing, localization, and execution routing. These along with other
related matters are discussed in the subsequent chapters.
4.2.2 Example: Realtime adaptation
To understand the benefits gained from different OS service implementations and par-
titioning, let’s assume that we have an OS service Xwith two implementations. Each
implementation has just two SESs as follows:
Ims={SESs,1, SESs,2},
Imh={SESh,1, SESh,2}.
Where sdenotes software implementation which executes on GPP and hdenotes hard-
ware implementation which executes on FPGA. For the sake of discussion we define
each SES as a tuple of {release time, Worst Case Execution Time (WCET)}.
Let’s assume that the former SESs are defined as follows:
SESs,1={r1,3},
SESs,2={r2,5},
SESh,1={r3,1},
SESh,2={r4,2}.
Now we want to schedule the real-time task defined below.
t3={8,2,3,24}.
Here a task is defined by the tuple {release time, WCET before OS service call,
WCET after OS service call, and absolute deadline}. This task is to execute on
GPP and call the service X. Let the target RSoC have one GPP and one FPGA with 6
blocks of available area. For the sake of simplicity, let’s assume that each time slot will
42
4.2 OS service design
Figure 4.3: Scheduling configuration 1. OS_S: OS service time slot on GPP, H1: FPGA
area-time slot scheduled for t1, H2: FPGA area-time slot scheduled for t2, and H3: FPGA
area-time slot scheduled for t3.
need to allocate one block of the FPGA area if executed on the FPGA. Considering the
scheduling configuration seen in Figure 4.3. The RSoC is already having the tasks,
t1and t2which are using both the FPGA and the GPP as following:
t1={0,2,8,13},
t2={2,1,3,16}.
Here each task is defined as the tuple {release time, WCET before FPGA usage,
WCET after FPGA usage, and absolute deadline}. In this schedule we see that
task t1is scheduled with FPGA usage of one slot and task t2with three slots. The
remaining available FPGA area is just two slots. Hence, task t3can only be scheduled
with the OS service configuration SESs,1and SESh,2with r1= 16 and r4= 20.
On the other hand, scheduling configuration 2, see Figure 4.4, has task t1and t2defined
using the above task tuple definition as:
t1={0,2,3,8},
t2={2,1,3,13}.
Here task t1is scheduled with three FPGA slots and task t2with also three slots of
FPGA usage. At this point FPGA has no remaining available area. Hence task t3can
43
4. OS SERVICE STRUCTURE
only be scheduled with the OS service configuration SESs,1and SESs,2with r1= 12
and r2= 16.
As seen we have been able to schedule t3without missing the deadline in both sched-
Figure 4.4: Scheduling configuration 2. OS_S: OS service time slot on GPP, H1: FPGA
area-time slot scheduled for t1, H2: FPGA area-time slot scheduled for t2, and H3: FPGA
area-time slot scheduled for t3.
ules using different service configurations. The above two examples assume nothing
about the RSoC power, SESs communication time, and OS service release and fin-
ish time. However, if the power is limited, it has to be considered as well, as different
configurations have different power requirements. In scheduling configuration 1 and 2,
more flexibility could be achieved if dynamic reconfiguration of the FPGA is allowed.
However, the time to reconfigure, that is the time to replace an already used area of the
FPGA with new data, should not affect the deadline of one task or another.
Another benefit gained from such partitioning is fault tolerance and adaptation. If at
one point in the execution, a failure in one SES is discovered, another configuration
from the point of failure which does not have the faulty SES could be set, provided
that constraints, like timing, do allow this.
44
4.3 Chapter Conclusions
4.3 Chapter Conclusions
An OS service designed for a distributed RSoC system has many objectives. These
include multiple architectures support, reconfiguration, hardware utilization, adapta-
tion to applications variable demands, adaptation to resources dynamic change, fault
tolerance, and recovery.
These objectives can be achieved if an OS service exists in multiple implementations,
with each implementation partitioned in the same number of functionally equivalent
blocks, namely SESs, as the other implementations. This enables the execution of
one OS service using many different configurations, with each configuration having a
different response time and resource requirements.
The variations in configurations resource requirements and response times, allow an
OS service to adapt to various application demands and environments. Such an adap-
tation can take place with regarding to one constraint or to multiple constraints such as
area and realtime deadlines.
45
4. OS SERVICE STRUCTURE
46
CHAPTER 5
OS service configurations and adaptation
The partitioning of OS services into SESs enables an OS service to be executed in
many different ways or configurations. This is because for any OS service, the SESs
with the same execution indices have the same functionality behavior, see Section 4.2.
Hence, for each execution index, any SES with the same index can be chosen from
any implementation belonging to that OS service.
The aim is then to find, at runtime, a configuration that enables an OS service to run
under specific resources combination without violating any constraint enforced by the
requester application.
The search for a configuration can be done exhaustively by exploring all the possi-
ble SESs combinations to run an OS service, see Figure 5.1. Although this will al-
ways lead to an optimal solution, it is not practical. For nnumber of SESs in each
implementation Im ∈IM, the time spent on exhaustive search would be in the or-
der of O(mn), where m=|IM|. If the calculations are to be stored, an amount of
m∗(mn−1) storage elements are needed.
The exhaustive exploring method may be preferable for OS services with small number
of SESs and implementations, however, it is out of the question for a large number of
SESs.
47
5. OS SERVICE CONFIGURATIONS AND ADAPTATION
(a) A simple model of
an OS service with each
implementation partitioned
into three SESs.
(b) The exhaustive exploration of all the possible configurations to run the OS
service in 5.1(a)
Figure 5.1: An example of an exhaustive exploration of the configurations of a simple OS
service model
48
5.1 OS service model formal definition
To search the configurations at runtime, efficient methods are required. This can be
achieved using general algorithm designing techniques such as dynamic programming
[34].
5.1 OS service model formal definition
In a distributed RSoCs system, an OS exists as a collection of services. Each service
has at least two implementations, a hardware and a software implementation. This is
due to the fact that each RSoC is assumed to have an FPGA and a GPP.
To generalize the problem lets consider an OS with Xnumber of services. Each service
Siin the OS has mimplementations, see Definition (5.1).
∃X∈
N
;X≥1; ∃m∈
N
;m≥2|
∀i; 1 ≤i≤X;∃Si∈OS and Si= [Im]m(5.1)
All the implementations of an OS service are partitioned into the same nnumber of
SESs, see Figure 5.2(a). Hence, an OS service implementation Imkcan be defined in
(5.2) as a sequence of n SESs.
∃n, k ∈
N
;n≥1; k≥1|Imk=
n
[
l=1
SESk, l (5.2)
Every SES in an implementation Imkis connected to all its successive SESs using
directed communication edges CE, defined in Definitions (5.3) and (5.4), see also
Figure 5.2(b).
E⊆[SES]2(5.3)
∃k0∈
N
; 1 ≤k0≤m;∀k0;∃e∈E;
∃SESk,l ∈Imk;∃SESk0,l+1 ∈Imk0;Imk, Imk0∈Si|
e= (SESk,l, SESk0,l+1)and
CEk,l,k0= (e, SESk,l, SESk0,l+1)(5.4)
ASES can be abstracted by its resource descriptors, Definition (5.5), where each SES
contains three main descriptors which are: its WCET Time (T), its estimated Power
consumption (P) and Area (A), see Section 4.2.
SESk,l ≡ {Ak,l, Tk,l, Pk,l}(5.5)
49
5.2 Optimal configurations and constraints
The directed communication edges can also be defined by the descriptors representing
the resources used in communication. These descriptors also include communication
time (dct), communication power (dcp), and communication area (dca). However, in
many cases, the most important descriptor for these edges is the communication time
descriptor (dct).
CEk,l,k0≡ {dcak,l,k0, dctk,l,k0, dcpk,l,k0}, where (5.6)
k, k0denote implementations indices, and
l denotes execution sequence index, see Figure 5.2
To every OS service, a START entity is added. The START entity has all of its
descriptors set to zero. This entity is connected to every SES with execution se-
quence index 1in each implementation Imkof the OS service by an initialization
edge (Ik; 1 ≤k≤m). These edges represent any resources (area (Ik.a), time (Ik.t),
and power (Ik.p)) required in order to start the corresponding SESk,1of the Imkim-
plementation. The FINISH entity is added to the end of every OS service. The
FINISH entity also has no valuable descriptors. A finalization edge (Fk; 1 ≤k≤m)
connects every SESk,n in every OS service implementation k; 1 ≤k≤mto the
FINISH entity. The descriptors of the finalization edge represent the resources (area
(Fk.a), time (Fk.t), and power (Fk.p)) required to get the processed data to its final
destination. Again, the most probably important descriptor in both the initialization
and the finalization edges is time. Nevertheless, in case of network communication,
the power descriptor may also be of a non-negligible value.
5.2 Optimal configurations and constraints
The configuration of an OS service depends on the constraints enforced by the re-
quester application and the available resources at the time of request. However, in
order to minimize the evaluation time, that is, the time required to take a decision and
to find a configuration, boundaries of the maximum/minimum required resources are
needed to be defined. In other words, it is required to define which OS service config-
uration is considered to be optimal in using a resource and which is considered to have
the maximum resource usage.
Finding the minimum resource requirement allows an early decision of whether an OS
service configuration can be found for a specific resource combination. In case of no
configuration can be found, other means, such as finding another RSoC or using more
than one RSoC to run an OS service configuration, can be carried out.
51
5. OS SERVICE CONFIGURATIONS AND ADAPTATION
On the other hand, defining the maximum resources boundaries allows limiting the
exploration of the solution space. This is done by rejecting any OS configuration that
may have resources requirements which would exceed the defined maximum bound-
aries. Additionally, in case of much resources availability, these limits help the quick
finding of a Pareto optimum configuration.
To find the minimum boundary for each resource usage, we evaluate to find the OS
service configuration which is optimized towards using as low amount of that resource
as possible. In RSoC, three main resources are considered. These are: Time, power,
and area. So as a first step, an OS service is evaluated to find the optimized configura-
tion to run in minimum time, then to run in minimum power, and finally in minimum
area. These evaluations can be performed using the dynamic programming designing
technique [34].
5.2.1 Single criterion optimization algorithm
Let’s consider the minimum possible amount of resource required for a chunk of data
to be processed by passing it from the START entity all the way through SESk,l. If
l= 1, for any implementation Imk, there is only one way that the data could have
gone, and so it is easy to determine how much resources have been consumed to get
through SESk,l. For l= 2,3, ..., n, however, there are mchoices: the data could have
been processed by any SESk0,l−1, where 1≤k0≤mand then through SESk,l. The
resources consumed by passing the data to SESk,l is then defined by CEk0,l−1,k.
Now, we will consider a criterion C∈ {time, power, area}to identify the one
resource descriptor we want to minimize. Throughout this discussion, we will use
C, SESk,l, and CEk,l,k0as a general representation to denote a single criterion (de-
scriptor) representation instead of using subscriptions to represent every descriptor.
For instance, instead of using C.t, SESk,l.T, and CEk,l,k0.dct to denote time. The
same also applies for Ikand Fk.
Let us assume that the minimum usage of resource Cto get to SESk,l is through
SESk0,l−1. The key observation is that the data must have been processed with the
SESs combination that yields a minimum Cfrom the START entity through SESk0,l−1.
Why? If there is another SESs combination to get through SESk0,l−1, we could substi-
tute this SESs combination to yield a minimum Cthrough SESk,l, which contradicts
our assumption.
In general, a minimum resource usage with n SESs combination contains within it a
smaller number of SESs combinations with minimum resource usage (optimal sub-
52
5.2 Optimal configurations and constraints
structure). This means that we can construct an n SESs combination with minimum
resource usage by constructing the minimum resource usage combination of n−1
SESs.
The construction of n SESs minimum resource usage combination can be done re-
cursively in terms of finding sub-combinations with minimum resource usage for l=
1,2, ..., n −1. Let Ck[l]denote the minimum possible resource usage to process data
from the START entity through SESk,l. Our goal is to determine the minimum re-
source usage to process data all the way through an OS service, which we denote by C∗.
The data has to be processed all the way through SESnon any implementation kand
then to the FINISH entity. Since the minimum resource usage of these combinations
(configurations) is the minimum resource usage of an OS service, we have
C∗=min1≤k≤m(Ck[n] + Fk)
It is also easy to find Ck[1]. To get data processed through SESk,1on any implemen-
tation Imk, a data just goes directly to that SES through the initialization edge Ik.
Thus,
Ck[1] = Ik+SESk,1
Now let us consider how to compute Ck[l]for l= 2,3, ..., n (and 1≤k≤m).
Recalling that the minimum resource usage through SESk,l is the minimum resource
usage through SESk0,l−1, where 1≤k0≤m, and then directly to SESk,l through the
correspondent edge CEk0,l−1,k. Hence we have:
Ck[l] = min1≤k0≤m(Ck0[l−1] + CEk0,l−1,k +SESk,l)
The Ck[l]values give the minimum resource usage of sub-combinations of SESs. To
keep track of how to construct a minimum resource usage SESs combination, we
define Mk[l]to be the implementation number whose SESk,l−1is used in an optimal
minimum resource SESs combination through SESk,l. Here, 1≤k≤mand l=
2,3, ..., n.Mk[1] is not defined because no SES precedes SESk,1. We also define M∗
to be the implementation kwhose SESk,n is used in the optimal minimum resource
usage configuration through the entire OS service execution. The Mk[l]values are used
to trace an optimal resource usage configuration. Based on this analysis, the evaluation
for minimum resource usage can be simply found recursively. However, such recursive
algorithm would have a run time which is exponential in n. Nevertheless, by reversing
the top down method of the recursive algorithm, a better method could be used in
computing the Ck[l]values. This can be done because for l≥2, each value of Ck[l]
depends only on the values of C0
k[l−1], where 1≤k0≤m. By computing the Ck[l]
values in the order of the increasing SES execution index number l, we can compute
the minimum resource usage combination for an optimal OS service execution. With
this bottom up algorithm the evaluation can be done in Θ(n)time.
53
5. OS SERVICE CONFIGURATIONS AND ADAPTATION
Algorithm 5.1 Single criterion optimal SESs configuration
Require: x, OS service :{SESs, CE, I, F }//x is the criterion we want to optimize
1: Cs[1].t ←Is.t +SESs,1.T
2: Cs[1].a ←Is.a +SESs,1.A
3: Cs[1].p ←Is.p +SESs,1.P
4: //... {The same (lines 1-3) done for Ch.y[1] where y is substituted with t, a, and p.}
5: for l←2to ndo
6: //... {DC: denote decision criterion. s: denote software. h: denote hardware.}
7: DCs←getDCs(SESs, l, x)
8: DCh←getDCh(SESs, l, x)
9: //... {First, evaluate assuming software SES is taken at execution index l.}
10: DC1←DCs+CEs,l−1,s.x
11: DC2←DCs+CEh,l−1,s.x
12: if Cs[l−1].x +DC1≤Ch[l−1].x +DC2then
13: Cs[l].x ←Cs[l−1].x +DC1
14: Ms[l]←s
15: else
16: Cs[l].x ←Ch[l−1].x +DC2
17: Ms[l]←h
18: end if
19: //... {Next, evaluate assuming hardware SES is taken at execution index l.}
20: DC1←DCh+CEh,l−1,h.x
21: DC2←DCh+CEs,l−1,h.x
22: if Ch[l−1].x +DC1≤Cs[l−1].x +DC2then
23: Ch[l].x ←Ch[l−1].x +DC1
24: Mh[l]←h
25: else
26: Ch[l].x ←Cs[l−1].x +DC2
27: Mh[l]←s
28: end if
29: if x=tthen
30: Cs[l].p ←Cs[l−1].p +CEMs[l],l−1,s.p +SESs,l.P
31: ... {updating the rest but not the time criterion}
32: else if x=athen
33: Cs[l].t ←Cs[l−1].t +CEMs[l],l−1,s.t +SESs,l.T
34: //... {updating the rest but not the area criterion}
35: else
36: Cs[l].t ←Cs[l−1].t +CEMs[l],l−1,s.t +SESs,l.T
37: //... {updating the rest but not the power criterion}
38: end if
39: end for
40: return C,M
54
5.2 Optimal configurations and constraints
5.2.2 Evaluating for a suitable OS service configuration
For an RSoC with just one FPGA and one GPP we have OS services with each of
them having a software and hardware implementations. Algorithm 5.1 shows how
to find a single criterion optimal SESs combination (or configuration) for such OS
services. In the algorithm we substitute kby either sto denote software or hto denote
hardware. The algorithm requires the Worst Case Execution Time (WCET) (T), the
required implementation area (A), and the estimated power consumption (P), for every
SES in both of its software and hardware implementations. It also requires the setup
time (I), to start the execution from either of the first SESs, the finish time (F), to
get results after executing one of the last SESs, and the Communication Edges (CE)
connecting the successor SESs which are either in different implementations or in the
same implementation.
The algorithm computes the best SESs configuration that is optimized with regard to
the resource criterion given in x. Where xcan have the value tfor WCET,afor area,
or pfor power. It also calculates and stores the values of all the other non-criterion
required resources. The area criterion can be further optimized for hardware area hx
or software area sx. However, in this algorithm a weighing function is assumed which
combines both the hardware and the software areas into a single criterion a. After
running the algorithm, it returns two lists, the first one is the Clist, which holds the
amount of resources needed to optimally execute from the START entity all the way
through every SESk,l. The amount of resources and the bath chosen are evaluated
taking into account the optimization of the resource criterion provided by x.
The second list is M. It contains the information about the SESs implementations
which were chosen to reach every SESk,l with optimal xresource usage. The getDCs
and getDCh are select statement functions which take as arguments the SESs, the
execution sequence index l, and the decision criterion x. As result, they return the
value of the decision criterion xin SESs,l and SESh,l respectively. The Algorithm 5.1
runs in Θ(n)time. This is because it contains only one for-loop which runs for about
ntimes. The other elements in the algorithm can be estimated with constant values.
In a limited RSoC resources environment, where applications are demanding an OS
service, the configurations have to be chosen to use the resources which are less de-
manded by applications. Normally, one resource is more demanded by applications
than the others. In such a case, the targeted OS service configuration is not the optimal
one. Nevertheless, we target a suitable OS service configuration which runs on the
limited RSoC resources without violating any constraint and with minimum usage of
55
5. OS SERVICE CONFIGURATIONS AND ADAPTATION
the relatively heavily demanded application resource. In order to obtain such an OS
configuration, we use a heuristic algorithm, see Algorithm 5.2.
Initially, we use Algorithm 5.1 to find the timing optimal SESs configuration (TOSC),
the area optimal SESs configuration (AOSC), and the power consumption optimal
SESs configuration (POSC). From these configurations, a suitable configuration is
chosen. A suitable configuration is optimized to use less amount from the relatively
heavily demanded application resource without violating any constraint.
Algorithm 5.2 GetSuitableSESsConfigurations
Require: OS service, RSoCConstraints, optimizationRequirements
1: TOSC ←SingleCriterionOptimalSESsConfiguration(OS service, t)
2: AOSC←SingleCriterionOptimalSESsConfiguration(OS service, a)
3: POSC ←SingleCriterionOptimalSESsConfiguration(OS service, p)
4: if One constraint is less than required optimal resource in any optimal SESs configuration then
5: No possible configuration can be found, notify for alternative solutions, exit algorithm
6: else
7: begin with the optimal requested as a main optimization requirement such as TOSC
8: end if
9: for all SESs in the chosen configuration that are with different implantations in the second or the
third optimal configuration and from the last SES to the first do
10: try exchange functionally equivalent SES between implementations if that would satisfy the
constraints and minimize the other required resources.
11: if exchange success then
12: mark as possible configuration
13: end if
14: if no configuration was marked then
15: Notify for alternative solutions.
16: end if
17: end for
18: return SuitableConfigurations
For instance, suppose an application requests an OS service Xwith an expected worst
case response time given by Treq. For the sake of discussion, lets assume that the RSoC
has a limited hardware area given by Aava and a plenty of the other resources. The main
algorithm, Algorithm 5.2, starts by looking at TOSC. If the total execution time of
the TOSC configuration is the same as the constraint Treq, and the hardware area fits
in the available area Aava, the algorithm sends the configuration as a critical one with
no alternatives. On the other hand, if Treq is greater than TOSC, the algorithm looks
at the SESs that are implemented using hardware in the TOSC, and tries, if possible
(no constraints violation), to exchange each one of them with the SESs from the other
implementations. The algorithm keeps doing that as long as there is a hardware SES
which has not been tried. Other switching between SESs implementations may also
be carried out if it leads to a resource usage minimization. The algorithm marks every
56
5.2 Optimal configurations and constraints
suitable configuration found as another possible alternative configuration. When done,
it returns the configuration with the lowest hardware area. The marking of alternative
configurations makes it faster to find another configuration in case of some faults or
errors. In case of no configuration found, other solutions or exploration techniques
may be examined. These may include finding another RSoC with enough resources to
run the OS service, or using more than one RSoC to share resources in order to run the
OS service. These solutions however depend totally on the constraints and especially
on the execution and the communication timing constraints.
Configurations limits
The ability to find a configuration depends on the availability of resources and on the
enforced constraints from the requesting applications. However, there are situations
where no configuration can be found.
Section 7.1 investigates the limits after which the finding of an OS configuration is
not possible. The evaluation compares an RSoC with scarce resources to the resource
required by an optimal OS service configuration. For instance, there is a very low
probability to find a configuration for an OS service with optimal area Aopt on an RSoC
with available hardware area= 0.5∗Aopt and available software area= 0.5∗Aopt.
To know such facts is very important. This would save a lot of time which would be
wasted by trying to find a configuration. Instead, such time could be used in finding
other solutions. These include using or sharing other RSoC(s) resources in the dis-
tributed system. However, efficient and fast distribution processing implies an efficient
OS services distribution and discovery procedures. This is discussed in Chapter 6.
5.2.3 Evaluating for Pareto optimal OS service configuration
A Pareto optimum can be a targeted OS service configuration. This could be in a case
where an RSoC has plenty of resources to run any OS service configuration. A Pareto
optimum configuration helps avoiding preemption which is caused by the dynamic
change in resources.
To find a Pareto optimum configuration, we first define the boundaries or the threshold
values to confine our search. Two main threshold values are used to confine the solution
space for the targeted configuration. These are the minimum and the maximum amount
of resources required to run an OS service configuration. For every resource, that is for
57
5. OS SERVICE CONFIGURATIONS AND ADAPTATION
area, time, and power, we will define the threshold of the minimum and the maximum
value for a valid OS service configuration.
The minimum resource threshold value can be found by obtaining the optimal OS
service configuration for each one of the three descriptors. This is done by using
the single criterion optimal SESs configuration algorithm, see Algorithm 5.1. For
instance, to find the minimum time threshold value for an OS service, we evaluate to
find the time optimal service configuration TOSC. As this configuration is optimized
to run as fast as possible, its running time is considered the minimum threshold value
for time.
In order to find the minimum resource value which can be used by a configuration,
we need to evaluate for an optimal resource value from the START entity all the way
through SESk0,n and finally the FINISH entity. To achieve that, we find the optimal
resource value through each SESk0,l, where 1≤l≤n, and 1≤(k or k0)≤m.
Along the evaluation for optimal resource value through each SESk0,l, we can also
evaluate to find the usage amount of the other non-optimized resources. These will be
used later to find the maximum threshold values.
Equation (5.7) defines how to evaluate to find the optimal value for the area descriptor
regarding SESk0,l. Along with the minimum area, we also calculate the values of
the other non-optimized resources, see Equations (5.8) and (5.9) for time and power
respectively. From these equations, only the area value is optimal or has the minimum
resource usage.
∃k0∈[1, m]|Ak0,l =min1≤k≤m(Ak,l−1+SESk0,l.A +CEk,l−1,k0.dca)(5.7)
Tk0,l =Tk0,l−1+SESk0,l.T +CEk,l−1,k0.dct (5.8)
Pk0,l =Pk0,l−1+SESk0,l.P +CEk,l−1,k0.dcp (5.9)
The optimal value for the time descriptor through SESk0,l can be obtained using Equa-
tion (5.10). The other non-optimized resource usage values for area and power can be
found using Equation (5.11) and Equation (5.12) respectively.
∃k0∈[1, m]|Tk0,l =min1≤k≤m(Tk,l−1+SESk0,l.T +CEk,l−1,k0.dct)(5.10)
Ak0,l =Ak0,l−1+SESk0,l.A +CEk,l−1,k0.dca (5.11)
Pk0,l =Pk0,l−1+SESk0,l.P +CEk,l−1,k0.dcp (5.12)
58
5.2 Optimal configurations and constraints
Finally, Equation (5.13) is used to find the optimal value for the power descriptor.
The non-optimized usage resource values regarding area and time can be found using
Equation (5.14) and Equation (5.15) respectively.
∃k0∈[1, m]|Pk0,l =min1≤k≤m(Ak,l−1+SESk0,l.P +CEk,l−1,k0.dcp)(5.13)
Ak0,l =Ak0,l−1+SESk0,l.A +CEk,l−1,k0.dca (5.14)
Tk0,l =Tk0,l−1+SESk0,l.T +CEk,l−1,k0.dct (5.15)
The initial optimal value through SESk,1which is used in evaluating the optimal area,
time, and power configurations can be found using the Equations (5.16), (5.17), and
(5.18) respectively.
Ak,1=SESk,1.A +Ik.a (5.16)
Tk,1=SESk,1.T +Ik.t (5.17)
Pk,1=SESk,1.P +Ik.p (5.18)
The optimal value for a configuration is known after finding the optimal SESs com-
bination all the way till SESk,n. Using the Equations (5.7), (5.10), and (5.13) we
can evaluate the required optimal value for an optimal OS service area configuration,
optimal OS service time configuration, and optimal OS service power configuration
by adding any resources needed for finalization at index nto the calculated optimal
resource value. See Equations (5.19), (5.20), and (5.21).
Aopt =min1≤k≤m(Ak,n +Fk.a)(5.19)
Topt =min1≤k≤m(Tk,n +Fk.t)(5.20)
Popt =min1≤k≤m(Pk,n +Fk.p)(5.21)
The optimal resource value for an OS service is the amount of that resource used by
the OS service configuration which is optimized to use the minimum amount of that
resource. In our discussion we will refer to these optimal values as the minimum
threshold limits, Amin for area, Tmin for time, and Pmin for power, to obtain a Pareto
optimal OS service configuration.
While evaluating for the area, time, and power optimal configurations, we also cal-
culated the amount of resources required by the other non-optimized descriptors. For
example, while evaluating to find the configuration with the optimal area, Equation
(5.7), we have also calculated the amount of estimated power, Equation (5.9) and time
(5.8), consumed by the configuration. The found configuration would have an opti-
mal area usage, however, the other resources are most probably not optimal. These
non-optimal values are used for defining the maximum threshold values.
59
5. OS SERVICE CONFIGURATIONS AND ADAPTATION
A maximum resource threshold value is defined as the maximum value of that resource
found in any of the three single-criterion optimal configurations. For instance, the
maximum power threshold is the maximum value of the Equations (5.9) and (5.12).
These equations are used for evaluating the required non-optimized power require-
ments, while evaluating for the optimal configurations for area and time respectively.
The value of Equation (5.13) is not considered because it always has the minimum
amount of power that a configuration could consume. The three maximum thresholds
are represented as Amax for area, Tmax for time, and Pmax for power. Another possibil-
ity would be the maximum amount of that resource available on the RSoC. However,
this may lead to unnecessary expansion of the search space.
Figure 5.3: Maximum and minimum thresholds to find Pareto optimal OS service config-
uration
The three single criterion optimal configurations for time, area, and power with the
minimum and the maximum threshold values can be represented as coaxial-triangles in
a three axis plane as shown in Figure 5.3. The minimum threshold is represented as the
triangle which has its three corners defined by the minimum value on each axis. The
maximum threshold triangle has its three corners defined by the maximum value on
each axis. The Pareto optimal configurations can be represented as the triangles which
60
5.2 Optimal configurations and constraints
are defined to have corners in between the corners of the maximum and the minimum
threshold triangles, with as close as possible to the minimum threshold triangle.
Deducing our discussion, we can now evaluate to find a Pareto optimal configuration
with the resources requirement Ap−opt,Tp−opt, and Pp−opt as defined in the Equations
(5.22), (5.23), and (5.24) respectively.
Amin ≤Ap−opt ≤Amax (5.22)
Tmin ≤Tp−opt ≤Tmax (5.23)
Pmin ≤Pp−opt ≤Pmax (5.24)
Algorithm 5.3 Pareto optimization for OS services
Require: OS service SESs and CE
1: Acfopt =evaluate for the configuration optimized for area
2: Tcfopt =evaluate for the configuration optimized for time
3: Pcfopt =evaluate for the configuration optimized for power
4: Evaluate Amax, Tmax, Pmax, Amin, Tmin, and Pmin
5: First −conf =Acfopt
6: Second −conf =Tcfopt
7: Third −conf =Pcfopt
8: Opt −conf =Acfopt
9: repeat
10: for all SESlin First-conf from last to the beginning do
11: if SESlis not the same as the SESlin the Second-conf or the Third-conf then
12: Try changing SESlto the one in the Second-conf and then to the one in the Third-conf
13: if changing would give a configuration that is better than Opt-conf and there is no resource
greater than its maximum allowed value then
14: Opt-conf=the new configuration after SESlchanged
15: end if
16: end if
17: end for
18: if First repeat then
19: First −conf =Tcfopt
20: Second −conf =Acfopt
21: Third −conf =Pcfopt
22: else
23: First −conf =Pcfopt
24: Second −conf =Acfopt
25: Second −conf =Tcfopt
26: end if
27: until Three times repeated
28: return Opt-conf
Algorithm 5.3 uses the above analysis to search for a Pareto optimal configuration for
an OS service. It requires the OS service SESs and the communication edges CE.
61
5. OS SERVICE CONFIGURATIONS AND ADAPTATION
Using Algorithm 5.1, Algorithm 5.3 finds the three optimal configurations for time,
power, and area. From these configurations, the algorithm extracts the minimum and
the maximum threshold values. The search for a Pareto optimal OS service configura-
tion begins by first choosing a configuration from the three optimal configurations as
the Pareto optimal configuration. It then starts comparing the SESs in the different
configurations and tries to exchange them. Every exchange is asserted to be within
the limits of the maximum and the minimum threshold values. A resource weighting
function (e.g.: euclidian function) is then used to compare the configuration obtained
after the exchange with the current Pareto optimal configuration. If the resources re-
quired by the new configuration are closer to the minimum threshold resources, it will
be chosen as the Pareto optimal configuration. The process continues until we have
tried the exchanging of all SESs in the three single-criterion optimal configurations.
This algorithm has a polynomial running time of Θ(n).
5.2.4 Wrap up example
To demonstrate some of what we have presented, let us consider the OS service shown
in Figure 5.4. This OS service has two implementations: a hardware and a software im-
plementation. Each implementation is partitioned into four SESs. For the sake of this
example, the resources requirements of each SES was chosen randomly. The commu-
nication edges between these SESs are merely representing the communication time
which has been also randomly chosen.
Using Algorithm 5.1, we find the Ck[l]for time, power, and area, see Tables 5.1,5.3,
and 5.5 respectively. These tables show the resources used taking one resource as an
optimal criterion from the START entity up until SESk[l]. In addition, they show
the amount required by the other non-optimized resources. For instance, the minimum
time required for running the OS service from the START entity up until SESs[2]
can be obtained from Table 5.1, which is Ts[2] = 23. In addition, we require 11 units
of power and 14 units of area. The value denoted by C∗represents the total amount of
the optimized resources needed to run a complete OS service optimized configuration
from the START to the FINISH entity, where C∗∈ {T∗,P∗,A∗}.
Although the optimized required resources are shown in Tables 5.1,5.3, and 5.5, we
still need to know how to obtain such optimal configurations. This however can be
acquired from the Tables 5.2,5.4, and 5.6 respectively.
The Mholds information about which implementation kof SESk[l−1] was chosen
to reach SES0
k[l]with optimal Cresource usage. For instance, looking at Table 5.2 we
62
5.2 Optimal configurations and constraints
Figure 5.4: An OS service example with random generated meta-data. The communica-
tion edges represent only the communication time.
l1 2 3 4
Th[l]
T= 8 20 30 39
P= 5 9 13 20
A= 2 14 21 26
Ts[l]
T= 8 23 31 44
P= 4 11 17 23
A= 8 14 22 26
Table 5.1: Time optimized Ck[l]
with T∗=42
l2 3 4
Mh[l]s h h
Ms[l]s s s
Table 5.2: Time optimized Mk[l]
with M∗=h
know that SESh[2] can be reached with minimum execution time from the previous
SES whose implementation is Mh[2] = s, that is from SESs[1].
63
5. OS SERVICE CONFIGURATIONS AND ADAPTATION
l1 2 3 4
Ph[l]
T= 8 20 30 39
P= 5 9 13 20
A= 2 14 21 26
Ps[l]
T= 8 23 32 44
P= 4 11 15 19
A= 8 14 22 25
Table 5.3: Power optimized
Ck[l]with P∗=19
l2 3 4
Mh[l]s h h
Ms[l]s h h
Table 5.4: Power optimized
Mk[l]with M∗=s
l1 2 3 4
Ah[l]
T= 8 20 38 47
P= 5 10 16 23
A= 2 8 15 20
As[l]
T= 8 25 33 52
P= 4 12 18 22
A= 8 8 16 19
Table 5.5: Area optimized Ck[l]
with A∗=19
l2 3 4
Mh[l]h s h
Ms[l]h s h
Table 5.6: Area optimized Mk[l]
with M∗=s
The M∗holds the type of the implementation we chose to obtain the optimal resource
usage C∗. To get the optimal configuration, we trace back the SESs chosen in order
to obtain the optimal resource usage configuration. This tracing is done in backward
manner from the FINISH entity to the START entity. The tracing is possible using
Mk[l]. We begin by setting Im ←M∗and choosing SESIm[n]to be the last SES in
the optimal configuration. Then from l=n, ..., 3,2, we chose SESIm[l]as another
SES in the optimal configuration by setting Im ←MIm[l].
The optimal OS service configurations for time, power, and area can be obtained from
Tables 5.2,5.4, and 5.6 respectively. These configurations are shown in Figure 5.5.
The Figures 5.5(a),5.5(b), and 5.5(c) show the single-criterion optimal configuration
for time, power, and area respectively. The configuration which uses as much mini-
mum harea as possible but executes within a 46 time can be found at run time using
Algorithm 5.2. This configuration is shown in Figure 5.5(d).
From the three single-criterion optimal configurations we can define the corners of the
minimum threshold triangle, see Figure 5.3. The triangle has its corners defined as 42
on the time axis, 19 on the power axis, and 19 on the area axis. The corners of the
maximum threshold triangle would have the values of 52 for time, 22 for power, and
64
5.2 Optimal configurations and constraints
(a) OS service time optimal
configuration.
(b) OS service power optimal
configuration.
(c) OS service area optimal
configuration.
(d) OS service configuration
optimized to run in 46 time and
have minimum husage.
(e) An OS service
Pareto optimal con-
figuration.
Figure 5.5: The OS service different criterion optimized configurations
65
5. OS SERVICE CONFIGURATIONS AND ADAPTATION
26 for area. The chosen Pareto optimal configuration for the OS service has to have
its resource requirements between the maximum and the minimum threshold values,
with as close as possible to the minimum threshold values, see Section 5.2.3. Using
Algorithm 5.3, we get the Pareto optimum OS service configuration, shown in Figure
5.5(e), with the values of 42 for time, 21 for power, and 20 for area.
5.3 SESs granularity and pipelining
So far we have assumed the OS services to be in different implementations, with each
implementation partitioned into the same number of SESs. We know that the number
of implementations is related to the number of the different computational elements
in the system. We also know that the level of the flexibility and adaptability intro-
duced in an OS service is proportional to the number of its implementations and to the
granularity of which these implementations are partitioned.
To increase the adaptation of an OS service we have to increase the number of SESs
by minimizing the partitioning granularity. However, this minimization is limited or
governed by the SESs inter-communication overhead. The problem now is to define
a lower bound for SESs granularity in terms of execution time and the SESs inter-
communication overhead.
On the other hand, the structure of this novel OS service design allows the use of
pipelined execution. This powerful technique can be utilized to reduce or provide
some tolerance to the SESs inter-communication overhead.
5.3.1 Partitioning granularity
Each OS service consists of a number of implementations that is equal to the number of
different computational elements in the distributed system. These implementations are
partitioned to the same number of SESs. This process enables many useful charac-
teristics such as adaptation and self-healing. It also introduces an undesired overhead
of the inter SESs communication overhead. The key element is to find a trade-off
between the desired and the undesired characteristics by finding a proper partitioning
granularity.
In general, the granularity of which to partition the implementations depends on two
factors: The architectures of the computational elements in the system and the inter-
communication overhead.
66
5.3 SESs granularity and pipelining
The developed OS service design imposes the partitioning of all the implementations
of an OS service into the same number of SESs. All the SESs with the same execution
index in all implementations belonging to one OS service have the same functional-
ity behavior. This introduces the first partitioning limitation. The partitioning of one
implementation may not be as flexible as the other OS service implementations. For in-
stance, one can argue that the partitioning of an implementation for a Reduced Instruc-
tion Set Computing (RISC) architecture can be as small as one instruction, however, in
an implementation for a Complex Instruction Set Computing (CISC) architecture, one
instruction may map to many RISC instructions. Hence, the minimum partitioning in
this case is limited by the CISC implementation.
Nevertheless, architectures are not the major limiting factor of OS service partitioning.
In order to run a partitioned OS service, a successive sequence of n SESs, regardless
of implementations, is required. Obviously, the more fine-granular the partitioning, the
higher the inter-communication overhead, but also the better the adaptability.
Given a non-partitioned OS service with WCET T, we want to find the maximum
number nof SESs with granularity as fine as possible, bearing in mind the commu-
nication overhead. Let an OS service Xhave two non-partitioned implementations,
Im1and Im2, with WCET TIm1and TIm2respectively. Now let each implementa-
tion be partitioned into SESs with the SESs ∈Im1having a WCET tIm1and the
SESs ∈Im2having a WCET tIm2. Here, tIm1, tIm2are the WCET for the smallest
partition granularity possible in Im1, Im2respectively (see Figure 5.6(a)).
Based on this, we can define TIm1=n∗tIm1+c1, and TIm2=n∗tIm2+c2, where c1, c2
are constants ≥0. Note that we still did not consider any inter SESs communication
overhead.
For a service to complete its execution, nsuccessive SESs are required to execute and
communicate regardless from which implementation each SES is. The communication
between SESs occurs either between two successive SESs in the same implementa-
tion, denoted tc1and tc2in Figure 5.6(a), or between two successive SESs in different
implementations, denoted tc12 and tc21 in Figure 5.6(a). The time needed for two suc-
cessive SESs in different implementations to communicate is usually greater than the
time needed if they were in the same implementation.
Recalling our assumed OS service X, we can define the Worst Case Service Execution
(WCSE), in terms of inter SESs communication overhead, as follows:
WCSE =
n−1
[
i=1
{(SESImx,i , SESImy,i+1 )}, where
67
5. OS SERVICE CONFIGURATIONS AND ADAPTATION
(a) OS service with two implementations partitioned
to the minimum possible fine granularity allowed by
both architectures.
(b) The configuration with maximum inter SESs
communication overhead (the service W CSE).
Figure 5.6: An OS service example with the configuration of maximum inter SESs com-
munication overhead
68
5.3 SESs granularity and pipelining
x, y ∈ {1,2}, and x 6=y.
This means that the WCSE is the combination of SESs in which each SES in a set
of two successive SESs {SESi, SESi+1}belongs to a different implementation than
the other, see Figure 5.6(b). The time to execute the WCSE of Service Xcan be
computed using Equation (5.25).
TW CSE =K+n
2tIm1+n
2tIm2+C. (5.25)
Where,
K=max(Ix) + max(Fx), x ∈ {1,2}.
C=n−1
2tc12 + n−1
2tc21.
Assuming that each implementation is partitioned into SESs of the minimum size
possible, then tIm1will be the minimum WCET for SESs ∈Im1, and tIm2will be the
minimum WCET for SESs ∈Im2. From this we can define the minimum WCET
in WCSE as tmin =min(tIm1, tIm2). Using the last definition we can define the
WCET for service Xwithout considering the communication overhead as TW CET =
n∗tmin, see Figure 5.7(a).
Let βbe defined in Equation (5.26) as the time ratio of an OS service WCET with
the inter SESs communication overhead to the OS service WCET without the inter
SESs communication overhead, see Figure 5.7(b).
β=TW CSE
n∗tmin
.(5.26)
Let TW CSE be limited to an upper bound of β∗T=β∗n∗tmin where βis a heuristic
value and 2≥β > 1. When beta is 1, it means that there is no inter SESs communi-
cation overhead, when it is 2 it means that each inter SESs communication overhead
requires the same time as the WCET of a SES. The value of β= 2 is chosen as an
upper bound in reference to Section 5.3.2.
Substituting Equation (5.26) in (5.25) we get
tmin ≥K+n−1
2(tc21 + tc12)
n∗(β−1) ,2≥β > 1(5.27)
69
5. OS SERVICE CONFIGURATIONS AND ADAPTATION
(a) Partitioned OS service with each partition has as minimum WCET tmin as
possible. No communication overhead is considered
(b) The partitioned OS service with tcinter communication overhead.
Figure 5.7: A fine granularity partitioned OS service example with and without inter
SESs communication overhead
The communication time tc21 and tc12 are equal to the maximum time between any
two successive SESs communicating with each other. Equation (5.27) gives an es-
timation of the minimum allowable SES granularity in terms of the underling inter-
communication time. Figure 5.8 shows some values for a minimum granularity limit
of a SES when tc21 = 75µs and tc12 = 65µs.
The communication between SESs can either be control messages or data. Examples
of control messages between an RSoC middleware and the SESs include: a START
signal to initiate the processing procedure, a DONE signal to indicate the end of data
processing, an ADDRESS signal as a pointer to the data to be processed, and other
messages intended for fault recovery, distribution, and obtaining metadata about the
SES itself. To gain access to the data to be processed, the SESs use shared memory.
This requires no extra complex synchronization as there is no competing between any
two SESs to access such data at the same time. The middleware controls when a SES
70
5.3 SESs granularity and pipelining
Figure 5.8: A comparison of the minimum allowable SES granularity with regards to the
underlying inter-communication time. All values in µs
can access the data. This accessing is commenced sequentially depending on specific
conditions being satisfied.
The data to be processed is better suited if it is divided into chunks. This will enable
the use of pipelining as discussed in Section 5.3.2. Moreover, it will allow controlling
the WCET of each SES as this is normally proportional to the size of the processed
data chunk.
5.3.2 Pipelining execution and communication time
Although the partitioning of an OS service into SESs improves execution flexibility,
adaptation resource variations, optimization, and fault tolerance, it has its drawbacks
as well. The added inter SESs communication overhead may increase the WCET
of a service significantly. This is especially true when the chosen configuration of
an OS service involves much inter SESs communication between SESs in different
implementations.
Although this non-negligible communication overhead is unavoidable, the SESs of
an OS service provide a workaround solution. The partitioning into SESs enables the
execution of a service using pipelining. This, if used, can reduce the WCET to be
even less than that of the non-partitioned OS service.
To explain this, let us assume a non-partitioned OS service Xwhich can process a data
chunk di∈Dwith a WCET T, where D=Sω
i=1 di, and ωis the total number of data
chunks. Let the OS service Xbe partitioned into n SESs of minimum size and equal
execution time tmin, where, tmin is the WCET time required by a SES to process a
71
5. OS SERVICE CONFIGURATIONS AND ADAPTATION
data chunk di. Normally it is preferable to have a communication time which is less
than processing time. However, to see the benefits gained by using pipelining we will
consider an OS service execution configuration in which all inter SESs communica-
tion times being the worst case communication time tc =tmin. Hence, the total com-
munication time in the chosen complete OS service configuration will be (n−1) ∗tc.
This sums the total WCET of the OS service configuration to (n−1) ∗tc +n∗tc =
2*n∗tmin −tmin ≤2∗T.
Following what was assumed, a non-partitioned OS service requires a total time of
2∗Tto process two data chunks. For ωdata chunks we need a time of ω∗T. The
partitioned OS services, if no pipelining used, requires a total time of 2∗ω∗Tto
process ωdata chunks (see Figure 5.9).
Figure 5.9: The time needed to process one data chunk difor non-partitioned OS service
Xand the partitioned OS service Xwith worst case inter SESs communication time tc
Using pipelining (see Figure 5.10), we could have the first SES in a service execution
begin processing the next data chunk as soon as it finishes processing the first data
chunk. In the meanwhile the processed data will be transferred to the successor SES
in the service execution. Following this pattern, every SES will begin executing a
new data chunk at the same time the processed data is being transferred to a successor
SES. Doing so, the total WCET would be minimized to have the value defined in
Equation (5.28).
Tpipeline = 2 ∗T+m∗T
µ= 2 ∗T+m∗tmin (5.28)
For a service processing one or two data chunks, the non-partitioned version of a
service will be faster. However, for processing a large number of data chunks, the
72
5.3 SESs granularity and pipelining
Figure 5.10: The time needed to process three data chunks d1−d3for non-partitioned
OS service Xand the pipelined partitioned OS service Xwith worst case inter SESs
communication time tc
pipelined partitioned service will have the upper hand in a lower execution time (see
Figure 5.11).
Figure 5.11: TA comparative representation of the time needed to process a number
of data chunks using non-pipelined partitioned service, non-partitioned service, and a
pipelined partitioned service
73
5. OS SERVICE CONFIGURATIONS AND ADAPTATION
Although this discussion shows a possible solution to the communication overhead
problem, it remains adequate to mention that the main objective of partitioning is not
to have a faster execution, but to provide for adaptation to resource variations and
applications demands expected in a distributed RSoCs system.
5.4 Chapter Conclusions
The multiple implementations and partitioning allow for many possibilities to execute
an OS service. Each possibility has different resource requirements, hence the adap-
tation. However, finding the proper configuration for a specific resource environment
and a set of constraints requires efficient and fast algorithms.
This chapter presented the formal description of the OS service model and the consid-
ered constraints. The developed exploration algorithms explore an OS service model
to find a proper configuration under some specific criterions. These may be finding
such a configuration which allows the OS service to run under scarce resources or a
configuration which has a Pareto optimal usage of resources. An example was pre-
sented which uses these algorithms to find different configurations of an arbitrary OS
service under some chosen constraints.
As this OS service model deals with partitioning, some discussion on the granular-
ity of these partitions was presented. Furthermore, the use of pipelining technique is
introduced as a way to overcome the inter partitions communication overhead.
74
CHAPTER 6
OS services distribution
Distributed embedded (RSoCs) systems are in general limited in resources. However,
they are desired to deal with an unpredictable spectrum of applications. An OS de-
signed for distributed embedded system is faced with two contradicting requirements.
An OS mostly is preferred to provide a wide range of services to support a broad range
of applications. On the other hand, it is expected to use as little resources as possible
to allow enough resources to be utilized by applications. All of this has to be done
without violating any constraint imposed by the requested QoS.
There is no doubt that the introduction of another computational element enables the
execution in parallel. However, if these are not of the same kind (e.g.: one is an
FPGA and the other one is GPP), we would be given the advantage of heterogeneous
execution. This is because a task executed on GPP may have a response time and
resource usage which is significantly different from its response time and resource
usage if implemented on FPGA.
The OS service design, introduced in Section 4.2, utilizes this heterogeneity to allow
for flexibility and adaptability in scheduling real-time applications/tasks. Moreover, it
allows for an OS service to be distributed by distributing its SESs across the RSoCs in
the system. Doing that allows resource sharing and fault tolerance. In addition, RSoCs
can share to hold a wide variety of OS services. Nevertheless, basic middleware is
still required to exist on each RSoC in the system, see Section 3.2.1. This middleware
75
6. OS SERVICES DISTRIBUTION
provides the necessary means and units to manage and realize the intended objectives,
see Section 4.1.
6.1 SESs distribution
In a distributed system, it is highly expected for applications/tasks to arrive (execute)
and leave dynamically. In other words, a static set of applications/tasks for one RSoC
may not be known in advance. As a result some RSoCs may need to execute OS
services or have some functionalities that are previously not present or anticipated. One
solution to overcome such a problem is to have a complete OS copy on each RSoC.
However, this means the reservation of resources for some OS services on each RSoC
which may never be used. Additionally, it would prevent applications from having the
luxury of using more resources, if needed, to speed up their computations which may
be necessary sometimes. Moreover, this contradicts our assumption of RSoCs having
limited resources. Nevertheless, we still want an RSoC to have access to any service
that can be offered by the OS to support a wide range of applications. We also want
the OS service to have the attributes of adaptability, fault tolerance, and recovery.
6.1.1 Heterogeneity and Distributed Environments
The RSoCs in the distributed system are assumed to be heterogeneous. Although
heterogeneity often denotes the diversity of hardware, we refer here to heterogene-
ity which occurs in resources with equal functionality. For instance the size of FPGAs,
the power consumption or storage capacity. Both heterogeneity areas meet different
needs in distributed environments. Mainly, there are two tasks to be solved. First, the
identification of resource entities offering the required functionality. Second, allocat-
ing the resources according to an objective function, e.g. even percental usage. Here,
we concentrate on the second one. Thus, we denote with heterogeneity different sizes
of the same resource type.
Heterogeneity can be resolved in different ways. If dynamical changes in a distributed
environment like failing, inserting or removing resources are allowed, simple static
solutions are inefficient [157,103]. A classical dynamical approach is Consistent
Hashing aka. Distributed Hash Tables(DHT) [88]. It can be used to balance uni-
form resources and handles dynamical changes efficiently. In case of heterogeneity it
is combinable with capacity normalization, which typically increases the model com-
plexity. This is because the number of entity representors depends on the normalization
76
6.1 SESs distribution
factor of each entity. This factor is determined for each entity and either described by
the ratio of its capacity to the smallest entity or to the smallest capacity difference.
This unpredictable behavior makes it hard to use such an approach in restricted envi-
ronments like embedded systems. So, models resolving heterogeneity in a dynamical
environment directly with less complexity and being independent of the heterogene-
ity gap are needed. To mention some models we refer to SHARE, SIEVE, DHHT,
SPREAD [27,129,108,26]. Those models are nearly independent of this gap, but
intensively use random values and inherit side effects from balls into bins problems,
namely deviation and logarithmic complexity overhead. The deviation problem can be
handled by multiple choice [109], but the complexity overhead remains as well as the
necessity to embed fault tolerant codecs, see Section 6.1.2. This embedding is possible
if distinct entity selection is provided.
6.1.2 Fault Tolerance or Availability
Fault tolerance concerning the availability of data, can either be solved by redundant
data segments or by using a specific encoding scheme. However, reliability requires
the segment placement on distinct devices. The method decision depends on the stor-
age or performance available to apply the decoding and needed to fulfill additional
constraints. In Storage Area Networks a quiet common scheme is used by RAID IV
or V [115]. Here, the advantage of XOR parity encoding is given by using less stor-
age compared to duplication. Furthermore, the encoding is easy to calculate on con-
trollers. However, this compensates only one failing device. A later published scheme
RAID VI includes a two dimensional parity scheme, tolerating any of two devices to
fail. The code is based upon spacial Reed-Solomon-Codes [119], which actually is
a k-dimensional MDS Code [53]. However, even by using k= 2 the code is still
computation intensive, whereas the Liberation Code [120] offers a two dimensional
parity information still determinable by XORs. So, the application domain decides
which code is applicable and eventually the evolution of the domain and its future
perspective. Here, we will simply rely on the idea that we can use an η=m+λ
codec providing us m+λ
λpossibilities to reconstruct the original information. For this
we have to determine access patterns of m+λdistinct entities capable to allocate all
available resources in the system.
On the other hand, at one point, there may be no suitable OS service configuration for
one RSoC, see Section 5.2.2. Or the OS services repository, which may be a key role,
fail to exist. So although the partitioning of services implementations enables the OS
services to adapt to many tasks/applications requirements, the limitation introduced
from the resources such as FPGA area and power consumption may hinder it. More-
77
6. OS SERVICES DISTRIBUTION
over, in a distributed system this may happen to be on one RSoC but not on the other.
This means resources may exist on other RSoCs which allows faster configuration to
be chosen.
So in order to benefit from the heterogeneity introduced in RSoC resources and to
avoid single point of failure introduced in centralizing the OS, see Section 3.1.1, we
distribute OS services SESs across all RSoCs, see Section 3.1.3. Moreover, we want
to be able to share the resources to execute any configuration using the collaboration
of many RSoCs.
6.1.3 Distribution stages
Due to the heterogeneity of resources, some RSoCs may be able to execute a complete
service or even a mini OS using its own resources while others may not have the
chance to execute even a part of an OS service. To overcome such a variation, we want
to view transparently all the RSoCs as if they are one when it comes to resource usage,
and as many when it is related to fault tolerance and error recovery. To realize this,
we distribute the OS services SESs across all the RSoCs in a way that insures load
balancing, fast SESs discovery, error tolerance, and fault recovery. These objectives
can be archived in two stages, the initialization stage and the discovery and routing
stage.
6.2 The Initialization stage
As mentioned before, we consider a heterogeneous distribution topology which com-
bines the attributes of the fully decentralized and the centralized systems, see Section
3.1.3. This is because we have an OSR which has all the OS services and can act as a
centralized node. However, this OSR is only necessary in initialization and can act as
a backup in the other stages.
In the initialization stage we want to distribute all the SESs of the OS services across
the RSoCs in the system. As both the SESs and the RSoCs are heterogeneous, a
fair distribution is required to balance the percentage utilization of the resources in
the distributed RSoCs. Hence, the ability of an RSoC to provide for any requests to
execute these SESs without violating any constraints.
One solution is to use random distribution. A repository is assumed in the initial stage
to hold the OS services SESs. These SESs are distributed randomly on RSoCs.
78
6.2 The Initialization stage
However, if an RSoC receives a SES which exceeds its SESs resource reservoir, the
SES is rejected and it is propagated to another RSoC. The RSoC SESs reservoir is a
resource percentage reserved for the sake of storing and running a number of SESs.
To allow a kind of balancing, a simple algorithm which involves an updated polarity
number and a simple counter is used.
Figure 6.1: Random OS services SESs distribution with load balancing
The counter in the repository is initialized to the number of RSoCs and the polarity
of each RSoC is set to be positive. All the SESs are with negative polarity. Each
time a SES is sent out from the repository, the counter is decreased. If an RSoC with
positive polarity receives a SES, it attracts the SES, stores the SES in its resource
reservoir, and changes its polarity to negative. On the other hand, if an RSoC with
negative polarity receives a SES, it repels the SES to other RSoCs, see Figure 6.1. A
SES is propagated through the RSoCs until it reaches an RSoC with positive polarity.
Once the counter in the repository reaches zero, a message is sent to reset the polarities
of the RSoCs which still have some free resource reservoir to positive, the counter is
set to the number of these RSoCs, and then the operation repeats.
Although this may lead to some load balancing, it is still not fair. Due to RSoCs
resource heterogeneity, in some cases one RSoC may have its reservoir totally used
while others are barely touched. Moreover, there is no information about the locations
of SESs. This requires a discovery algorithm, prior to service execution or recovery
79
6. OS SERVICES DISTRIBUTION
process, in order to find the SESs belonging to the desired OS service configuration,
see Section 6.3.2.
Another solution is to use a deterministic algorithm. The General Fork Distribu-
tion (GFD) algorithm is developed to insure the load balancing of the heterogeneous
SESs distribution over the heterogeneous RSoCs. It also provides important informa-
tion which is used in secondary stages like execution and fault recovery, see Section
6.2.1.
6.2.1 Distribution algorithm
The deterministic GFD distribution algorithm works by determining patterns (each
called a fork distribution pattern) with the purpose described in Sections 6.1.1 and
6.1.2. Let Vbe the set of all RSoCs in the network offering the needed resource and
Dthe set of SESs aka. items of the OS. Further, c(x)is an arbitrary capacity function
which describes the context depending on the correspondent entity. The c(x)function
denotes the required allocation resource when corespondent to a SES entity, where
c:D7→ R. In case of an RSoC entity, it represents the available resource capacity,
where c:V7→ R. In case of a fork distribution pattern fdp, it defines the used resource
size, where c:Vm+k7→ R. Furthermore, we assume that c(d)<< c(v); d∈D, v ∈V
so SESs are much smaller than the OSR or RSoC capacity.
The General Fork Distribution (GFD)
The aim of this algorithm is to partition and to combine a set of limited and fixed
resource types of heterogeneous size. The combination of partitions is made to obtain
a fork distribution pattern set FDP ⊆Vm+λcapable, and to allocate the resources
such that Pv∈Vc(v) = Pfdp∈F DP c(fdp). Further, FDP has to be defined such that
for each fdp ∈FDP, the number of distinct RSoCs is |fdp|:= |{v, v ∈fdp}| =
m+λ. To fulfill this is essential because of the encoding schemes we want to apply,
see Section 6.1.2.
The fork distribution pattern set FDP construction is described in two parts. The
first part is the construction of m+λframes, Algorithm 6.1, Line 3. This is done
by sorting RSoCs decreasingly according to a capacity function c(x)in a list L. This
list is then normalized to list M, such that, for each vi:= L.elementAt(i)|i∈
{1,...,|V|}, then v1owns [0, rv1],v|V|owns [1 −rv|V|,1], and vi|1<i<|V|owns
[Pi−1
j=1 rvj,Pi
j=1 rvj], Algorithm 6.1, Lines 1and 2. This part describes the order of
80
6.2 The Initialization stage
subsequent resources within m+λframes in M. Thus, each x∈Mbelongs to a
frame Fjand identifies an RSoC vcovering the specific area. Note that a resource can
possibly overlap from Fjto Fj+1.
The second part describes the construction of FDP, based upon the frames by control
variables, Algorithm 6.1, Lines 4to 15.
Figure 6.2: An illustration of the General fork distribution algorithm distinct pattern de-
termination
Algorithm 6.1 General Fork Distribution Algorithm
Require: set of RSoCs V, set of SESs V
1: sort ∀v∈Vamong c(v)decreasingly ordered into a list L
2: define M:= [0,1] to be the ordered capacity in L, where each v∈Lowns a range of length
rv:= c(v)·(Pu∈Vc(u))−1
3: define Fj|j∈ {1, . . . , m +λ}to be the subsequent frame partitioning in Mof size 1
m+λ, where
Fjcovers [(j−1) ·(m+λ)−1, j ·(m+λ)−1]
4: define a fork with m+λfingers each labeled by fj, where 1≤j≤m+λ
5: set an offset δ:= 0
6: set fj:= (j−1) ·(m+λ)−1, identify the RSoCs fdp := {v1, . . . , vm+λ}covering the finger
positions in M
7: set FDP := {fdp}.
8: while δ≤(m+λ)−1do
9: set fdp0:= fdp
10: increase δ
11: if any fj+δidentifies a new RSoC, means |fdp ∩fdp0|>0then
12: identify the RSoCs fdp := {v1, . . . , vm+λ}covering the new finger positions in M
13: set FDP := FDP ∪ {fdp}
14: end if
15: end while
16: At this point we can either return the fork distribution pattern set FDP to be used with an external
distribution algorithm or
17: using a proper resource weighing function c, distribute SESs over the the fork distribution patterns
in FDP
81
6. OS SERVICES DISTRIBUTION
The described algorithm behaves like a fork with (m+λ)fingers of distance (m+λ)−1
shifted among the range. Each time a finger touches a new RSoC area, the identified
RSoCs are added as a new pattern to FDP. Both described construction parts to obtain
FDP can be reconstructed by the illustration in Figure 6.2.
Clearly, the number of frames should be defined by the chosen encoding scheme, see
Section 6.1.2. This is identified by its fault tolerance λto be m+λframes. In such
a case, the fork would have a number of m+λfingers, and the architecture defines
patterns with each of them consisting of distinct m+λRSoCs. Thus, any encoding
scheme producing m+λsegments can be mapped directly to a pattern and fulfill its
computation constraints. However, it must be noted that this can only be guaranteed, if
the constraint max({c(v)})≤Pu∈Vc(u)holds. This means, ∀rv, rv≤(m+λ)−1is
true. If violated by any v∈V, the distinct pattern construction is impossible. This can
be fixed by including vwith less c(v). Eventually, it remains to spread the allocation
request evenly among the patterns so that they satisfy the objective function.
One possibility that may occur is to have an RSoC with an overcapacity. This however
could be combined with other similar RSoCs to define further patterns. Nevertheless,
it may not be possible to combine at least m+λRSoCs at a time. But it may remain
adequate to fulfill the objective function by spreading data among the new patterns,
which disburdens the others. On the other hand, such overcapacity could be utilized
for other duties.
A similar effect occurs if RSoCs are failing or are being removed. If no more than λ
RSoCs per pattern are failing, the decoding is not endangered. If patterns should pro-
vide the fully specified decoding functionality they must be reorganized. This healing
can be done by identifying those patterns with failing RSoCs. Then, with the remain-
ing working RSoCs, we construct a set of new patterns as described. Again, this is
only possible if the number of RSoCs is equal or larger than m+λ. If combining
was possible, the data of the new patterns must be reorganized to repair the encoding.
Again, overcapacity might remain. In any case, the new additionally defined patterns
have in total less capacity than before. This capacity loss and the constraint of the ob-
jective function implies data reassignment. The data flow will be from the new patterns
among the old ones. So it is similar to the previous one, but in the other direction.
It must be noted that, if in any of both described cases the m+λconstraint is not
satisfiable and untouched patterns are existing, due to a pattern reservation scheme or
a small amount of data distribution, one can involve some of them too. Eventually,
it depends on the costs for reorganizing data, because of repairing the data encoding.
The more patterns are involved, the less load percentage per RSoC, hence the less
82
6.3 The discovery and execution routing stage
rebalancing is needed. On the other hand, the more the patterns, the more the RSoCs
involved in the encoding, hence, the more the costs to repair the encoding.
6.3 The discovery and execution routing stage
After the SESs distribution, the next stage is defining the methods of finding the
combination of SESs for a desired OS service configuration. This stage is required to
execute an OS service either on one RSoC by SESs migration or on multiple RSoCs
by data transferring. In either ways, we need to know where are the SESs belonging
to one OS service.
This process, however, is depending on the distribution stage. In case of random distri-
bution, see Section 6.2, there would be no information on where each SES is located.
This requires adequate methods to discover and locate SESs.
Approaches to discover distributed entities have been proposed in different contexts.
To find resources or entities situated on different nodes in a network, the classical
link-state protocol [117] maintains a global view of the entire network at each node.
To keep this information up-to-date, each node periodically initiates a flooding of the
network. Naturally, this method does not scale well. To improve scalability, Perkins
and Bhagwat [117] proposed DSDV, which is to a certain extent similar to the link-
state approach but only maintains the information about the destination, the next hop
towards it, the corresponding distance, and the sequence number. Motivated by the
fact that most of the data exchanged by DSDV is not needed (e.g. since a certain
entity is of no interest) and changes in the topology cause a network flooding, Perkins
and Royer proposed AODV [118]. Being similar to DSDV, it only discovers remote
entities and repairs routes when needed, i.e. in an ad hoc manner. As DSDV and
AODV discover only a single entity and a single path to such an entity that matches
a certain requirement (e.g. a certain node ID), both are not suited for inter-service
communication on distributed RSoCs. Protocols that aim to enable this functionality,
uniting multicast and multipath methods, such as ADMR [83], however fail to exhibit
a behavior that adequately reflects the underlying network properties, e.g., the distance
to, the quality of, or the quality of the path towards the remote entity. Several further
approaches have been proposed that ignore the network topology and instead focus
on the semantic level of service discovery: One of the most prominent examples is
Sun Jini [84], which relies on a central service directory, where providers register and
clients send queries to, which unfortunately represents a single point of failure. SLP
83
6. OS SERVICES DISTRIBUTION
[138] is in many ways similar to Jini, however, it supports multiple directories (i.e.
directory agents).
In our case we have developed two approaches. Our biologically inspired approach,
presented in Section 6.3.2, is based on a hop–by–hop reactive routing algorithm [81].
This approach was inspired by the swarm intelligence behavior of ants communicating
by changes of their environment [69]. However, this approach is not deterministic, in
a scenes that the time needed to discover an entity is hard to be determined prior to the
discovery process.
On the other hand, using the deterministic approach, which is presented in Section
6.2.1, much useful information could be obtained seamlessly. These information are
utilized to build the "OS service execution routing graph" which is stored in the START
entity of the related OS service. An "execution routing graph" contains information
about the OS service SESs, the communication cost, the execution data, and the lo-
cation of each related SES in the distributed system. Using these graphs and the
algorithms presented in Chapter 5, a desired OS service configuration can be obtained
at runtime with minimum effort.
To cope with the deterministic approach, two important records are added to each
metadata SES at the distribution stage. The first record contains information about
the RSoCs holding alternative SESs. These are the SESs with the same behavior
as this SES. The size of the alternatives record is m+λ−1and it is used for fast
recovery and fault handling procedures.
The second record contains information about the RSoCs holding the direct successors
SESs of this SES. The size of this record is m+λand it is used for rebuilding the
"execution routing graph" in case the originals were lost. In addition, this record can
be used to execute OS services using a greedy algorithm by choosing every time the
next SES with minimum resources. As this will provide no prior information on the
total execution resources of an OS service, it can be done when no constraints exist.
6.3.1 The building of the execution routing graphs
The seamless building of OS services execution routing graphs can be done using the
GFD algorithm. These graphs are used at runtime for routing data, SESs migration,
recovery, and execution adaptation in regard to resources availability and requirements.
An execution routing graph of an OS service is stored in the START entity of the ser-
vice. Because of the used encoding, see Section 6.1.2, the START entity exists in
m+λcopies for each OS service. These, unlike the SESs entities, are merely copies,
84
6.3 The discovery and execution routing stage
with no encoding, to allow quick accessing and processing. An execution routing graph
consists of Metadata Distributor Entities (MDE) and directed edges. An MDE holds
information about one SES. The directed edges, between two MDEs, represent com-
munication time between successive SESs. This is defined as the communication time
between RSoCiand RSoCi+1 such that RSoCiholds SESiand RSoCi+1 holds any
immediate successive SESi+1 of SESi, where 1≥i≥n−1. The mapping between
SESs and MDEs is one to one. The information stored in an MDE includes the
address of the RSoC where the correspondent SES is located, the correspondent SES
ID, and the execution information about the correspondent SES. The SES execution
information includes power estimation, WCET, and area/utilization requirements.
For instance, let us assume an OS service Xwith two implementations Im1and Im2.
For the sake of discussion, we will assume no special encoding, but a one copy redun-
dancy for each SES. This produces an m+λ= 4 functionally equivalent SESs; two
with Im1and another two with Im2. To distribute these SESs, the GFD algorithm
finds a set of fork distribution patters FDP with each pattern having a set of four
distinct RSoCs.
The building of the execution routing graph is done step by step each time a set of four
functionally equivalent SESs are chosen for distribution over the RSoCs in a fork
distribution pattern. This is to allow on-the-fly-building of the distribution records
and the execution routing graph. Otherwise, the communication costs and the lo-
cations of the alternative SESs and the successor SESs for each SES have to be
collected again. For instance, assume that we have distributed the SESs set Si−1=
{SESIm1,i−1, SESIm1,i−1, SESIm2,i−1, SESIm2,i−1}of the OS service Xto the four
RSoCs in fdpi−1and we want to distribute the SESs set Si={SESIm1,i, SESIm1,i, SESIm2,i, SESIm2,i}.
From the fork distribution patterns found using Algorithm 6.1, Algorithm 6.2 selects a
fork distribution pattern fdpi. This pattern is best suited to distribute the SESs in Si
and keep the system balanced in regard to the balancing criterion, see Section 7.2.1.
Algorithm 6.2 then adds four MDEs to the execution routing graph. Each added
MDE is a metadata about one SES in the set Si. Directed edges are added to each
MDE representing a SES in the set Si−1to connect it to every MDE representing a
SES in the set Si.
The algorithm then updates the alternative execution record in each SES in the set
Siwith the addresses of the RSoCs in fdpi. It also updates the next execution record
in each SES in the set Si−1with the addresses of the RSoCs in fdpi−1. To update
the communication edges, the algorithm gathers the communication time between the
related RSoCs. This is done by sending the RSoCs in fdpi1a request to probe each
85
6. OS SERVICES DISTRIBUTION
RSoC in fdpifor communication time. These communication times are then used to
update the edges in the execution routing graph, see Figure 6.3.
Having the execution routing graph built, the algorithm stores it in each START entity
of the related OS service and distributes them by choosing another fdp.
Algorithm 6.2 Make Execution Routing Graph
Require: SESs of the OS service X, distribution criterion
1: FDP =get the fork distribution patterns set using Algorithm 6.1.
2: COMM_TIME_SET=NULL
3: Start=getStartSES(X)
4: repeat
5: Si= getEncodedSet(X,i) // {get the m+λ SESs}
6: fdpi=FDP.getDistributionPattern(Si, criterion) {get a distribution pattern best suited for SESs
in Siwith regard to the distribution criterion}
7: Si.updateAER(fdpi) // {Update the "Alternative Execution Record" in the SESs in Si}
8: if inot equal 1then
9: Si−1.updateNER(fdpi) // {Update the "Next Execution Record" in the SESs in Si−1}
10: COMM_TIME_SET=fdpi−1.GetCommTimeSetOf(fdpi)
11: end if
12: Start.makeMDNsAndAddToExecutionRoutingGraph(Si,fdpi,i)
13: until iis incremented from 1to n
14: make m+λcopies of the Start entity and distribute them using another chosen fpd
The locations of the START entities can be broadcasted to all the RSoCs. When an
RSoC requires an OS service, it acquires the closest START entity of that OS ser-
vice. The RSoC then uses the algorithms presented in Chapter 5to identify a suitable
configuration for its current resources and constraints.
The recovery of execution routing graphs is possible even if all the START entities of
one OS service were somehow lost. This is because each SES is accompanied with
the Alternative and Next execution records. By tracing these records, the execution
routing graph of one OS service can be rebuilt.
6.3.2 Discovery and execution using self-x routing
In case of random distribution, see Section 6.2, there would be no information on
where each SES is located. In such a case only local information can be used. The
approach followed is based on the emergent self–organization metaphor.
An RSoC which requires the execution of an OS service first needs to find the SESs
belonging to that OS service, see Figure 6.4. This is done using an algorithm which was
86
6.3 The discovery and execution routing stage
(a) On the left is a simple OS service with two implementations, each partitioned into
three SESs. On the right is a network of distributed heterogeneous RSoCs with arbitrary
edges denoting communication time.
(b) The execution routing graph of the OS service in 6.3(a) with m+λ= 4 encoding. Note that
each MDE shows only the SES ID and its located RSoC ID, other information, such as the SES
execution data, are not shown.
Figure 6.3: An example of an OS service execution routing graph with distribution over
heterogeneous RSoCs with arbitrary communication time.
87
6. OS SERVICES DISTRIBUTION
Figure 6.4: An example of finding a desired OS service configuration using only local
information
developed with the goal of finding in the network the closest suitable implementation
for a given SES. To achieve this goal only local information from the neighbors can be
used. This algorithm is based on a hop–by–hop reactive routing algorithm [81] which
was inspired by the swarm intelligence behavior of ants communicating by changing
their environment [69]. When a food source is found, a chemical substance called
pheromone is deposited by ants on the paths towards this source. Pheromone can be
smelled by other ants and may attract them to the food source. More concretely, during
the probabilistic route selection, a route with a higher pheromone value is chosen with a
higher probability. Given its physical properties, pheromone evaporates exponentially
with time unless a new portion of pheromone is deposited. If a path becomes favored,
there will be a higher amount of pheromone deposited contrary to a path which is less
favored. Paths which are obsolete vanish as soon as the last amount of pheromone
88
6.3 The discovery and execution routing stage
has evaporated. This particular behavior of ants is useful in terms of two problems
which we face in our distributed RSoCs: (i) Finding the shortest path to SES, and (ii)
finding a good implementation of the given SES. This is achieved without any global
knowledge of the topology or central coordination.
In the distributed system, each RSoC stores digital pheromones in a structure called
segment routing table which has the following format:
destination segment SESid | next hop ID | implementation im | pheromone
ψSESid
Every segment has assigned its identifier SESid which is identical for all instances of
the segment, independently of the implementation. With every segment being associ-
ated ID of the next hop RSoC and also an amount of pheromone on that path to ψSESid
which has been deposited so far, as well as, the type of implementation which may ei-
ther be Im1or Im2if an OS service with two implementations is assumed. Regularly,
every time interval τ, values of pheromones in the table are decreased by a pheromone
decay coefficient ψτ∈[0,1):
ψSESid (t+τ) = ψSESid (t)·ψτ(6.1)
On a given path, if there is no pheromone deposited for a longer period of time, this
path becomes less favorable. In order to avoid the creation of loops, each RSoC has
to maintain a recent message table saving all identifiers (which are unique) of all mes-
sages received according to a least-recently-used strategy. The segment discovery al-
gorithm consists of two steps: (i) Availability check of a segment and (ii) migration of
data to the found SES.
Availability check
The goal of this step is to check and ensure that there exists at least one path to a
requested segment. This is done in two phases: a forward and a backward phase.
In the forward phase, links towards the requester are strengthened, or established re-
spectively. The backward phase does the same towards the RSoC with a requested
segment. Once an RSoC finishes one SES computation, the RSoC migrates data to an-
other RSoC which is hosting the next subsequent segment. For this reason, the RSoC
creates a message called FDAnt (forward discovery ant) (Table 6.1).
When an RSoC receives the FDAnt message it proceeds as follows:
89
6. OS SERVICES DISTRIBUTION
Field Value
Last hop Requestor ID
Hops 0
Type FDant
Requestor ID
Requested SES SESID
History Cleared
Implementation Im1or Im2
Table 6.1: Fields of FDAnt mes-
sage
Field Value
Last hop SESID provider
Hops 0
Type BDant
Requestor ID
Requested SES SESID
History Cleared
Implementation Im1or Im2
Table 6.2: Fields of BDAnt mes-
sage
1. RSoC checks: (i) whether the message exceeds its maximum lifetime, or (ii) the
RSoC is already in the History field. In those cases, the message is discarded. If
the message is contained in the recent messages table, before it is discarded, the
message’s last hop is recorded in the routing table.
2. If the message is not discarded in the first step, then the following happens:
•The message is registered in all the relevant tables.
•If the RSoC contains the required SES, it answers with a BDAnt.
•The message is propagated using broadcast. Unicast is used if relevant
pheromone values are exceptionally high. Moreover, it is used in general
for all subsequent messages which transfer payload data between the two
endpoints.
A RSoC which can provide the required segment, i.e. SESid, creates a BDAnt (back-
ward discovery ant) message based on the copy of the received FDAnt. In the backward
phase, the BDAnt message (Table 6.2) is routed back using links that have been set
up by the forward phase. Unicast routing works as follows: A message m, from origin
Dorig comes from a node with the implementation Im1, arriving at node i. Before
forwarding it further, ialters the corresponding routing table entries:
ψDorig,Im1=ψinit (6.2)
if ψDorig,Im1< ψinit, with ψinit being a value used for initialization. Basically, when
there is no pheromone or it is below the threshold of ψinit, it is reinitialized to this
level. Else, if ψDorig,Im1≥ψinit,
ψDorig,Im1=ψDorig,Im1+ψδ(6.3)
90
6.4 Chapter Conclusion
so that a constant amount (ψδ) per used link is added to the already existing pheromone
level (ψDorig,Im1), resembling the natural model. Next-hop selection is realized using a
probabilistic method. A message heading towards the destination Ddest, arriving from
a node with the implementation Im1at node iis sent to node j,j∈NDdest (i.e. jis in
the next-hop table for the destination Ddest), with the probability
pDdest,j =ψDdest,j
Pk∈NDdest ψDdest,k
(6.4)
After forwarding a message, the fields Last hop,Hops, and the History are updated.
Decay of pheromone is regularly done as described in equation (6.1).
Data migration
After a successful discovery of the required SESid segment, the data is sent using the
routing mechanism described above. When data arrives, the hosting RSoC sends back
a confirmation to the requesting RSoC about receiving data. The requester waits for
confirmation for a certain amount of time. When the timeout expires, the requesting
RSoC repeats the segment discovery process in order to find another SESid provider.
6.4 Chapter Conclusion
Situations may exist where there are not enough resources to run an OS service on one
RSoC. A solution would be to run the service on another RSoC or to share resources
with more than one RSoC in a distributed system. To enable such an option, this
chapter has presented heuristic algorithms which allow load balanced distribution of
the OS services over the RSoCs.
The distribution algorithms deal with multiple resources with heterogeneous capac-
ities. Two main algorithms were introduced, one deterministic algorithm, namely
GFD, and one biologically inspired algorithm. The deterministic algorithm has the
advantage of knowing the time cost needed prior to the discovery process. Moreover,
it allows the building of records which are used at runtime for fast localization, explo-
ration, and discovery of the suitable configuration for an OS service.
Although the biologically inspired algorithm seams to be not that good, it has a lot of
potential, especially because it uses no records and it has more dynamics. However,
in order for such an algorithm to work, it has to be initiated before an OS service
is being requested. This requires some prediction of the future resources availability
91
6. OS SERVICES DISTRIBUTION
and constraints which may not be always correct. Such an algorithm can be used in a
non-realtime distribution system.
92
CHAPTER 7
Methods Evaluation
This chapter presents the evaluations of the heuristic algorithms introduced in Chapters
5and 6.
The evaluation is done using the ShoX simulator [94]. ShoX is a special simulator for
wireless networks. It can be used for evaluating new network protocols, wireless nodes
mobility, signals propagation, nodes energy, and traffic models. In addition, the ShoX
can be easily modified to accommodate any special requirement such as modifying the
simulated nodes to emulate RSoCs.
Using ShoX as a communication environment, a system of randomly distributed RSoCs
is built. The system has a random number, between 60 and 300, of heterogeneous
RSoCs. An RSoC has a random amount of limited power, FPGA area, and GPP capa-
bility. To support the adaptable OS service requirements, a middleware is developed
on every RSoC. The middleware has a control unit, a scheduler unit, and a resource
monitoring unit, see Section 3.2.1. These units are required to provide for the novel
OS service design which was introduced in Section 4.2.
When necessary, an OSR is made available to the distributed system. An OS is chosen
randomly to have between 50 and 150 services. Each OS service has two implemen-
tations. These implementations are partitioned into a number nof SESs as described
in Section 4.2. The number nof SESs is randomly chosen between 4and 50 SESs
for each OS service. Implementations of different OS services have a different number
93
7. METHODS EVALUATION
of SESs. Each SES contains descriptors about its own randomly chosen resource
requirements, see Section 8.1.1.
To ensure the evaluations reliability, each evaluation is repeatedly performed with dif-
ferent random systems. This means a new random number of RSoCs with each one
of them having its new random resource capabilities, a new random number of OS
services with each one of them being partitioned into a new random number of SESs,
and new demands and constraints.
7.1 Evaluating OS service configurations
In Chapter 5, we presented two main algorithms. The first one, Algorithm 5.2, is used
to find a suitable OS service configuration to run on an RSoC with limited resources
without violating any constraint. The second one, Algorithm 5.3, is used to find a
Pareto optimal OS service configuration. It assumes a requestor RSoC with abundant
resources to run any configuration. In this section, these two algorithms are evaluated.
7.1.1 OS service configurations under limited resources
Algorithm 5.2 finds a suitable configuration to execute an OS service on an RSoC that
has scarce resources without violating any constraint. For instance, let us assume that
a realtime application, residing on an RSoC, requests an OS service X. Suppose that
the application requires the OS service to process some data and provides a response in
Tres time. In such a case, Algorithm 5.2 has to find a suitable OS service configuration
that can deliver the processed data without missing the deadline Tres. In addition, the
algorithm has to take into account the RSoC’s current resource status.
In case of abundant RSoC resources, the algorithm has no problem in finding a suitable
OS service configuration. However, we want to find the limits where this algorithm
fails to provide a suitable OS service configuration. In order to do that, we enforce
each RSoC in the distributed system to have resources which are proportional to the
correspondent requested OS service optimal resource requirements.
For instance, let us consider the OS service which has its optimal configurations re-
source requirements presented in Table 7.1. This OS service has configurations which
can be executed with an optimal power consumption of 620 power units, an optimal
area usage of 555 area units, and an optimal time of 965 time units.
94
7.1 Evaluating OS service configurations
OS service resource requirements
OS service configuration Resource requirements
Area needed to execute the service totally in
software area
725
Area needed to execute the service totally in
hardware area
734
Configuration optimized for area
Area 555
Power 796
Execution Time 1294
Configuration optimized for power
Area 743
Power 620
Execution Time 1290
Configuration optimized for execution time
Area 725
Power 728
Execution Time 965
Table 7.1: Arbitrary example of OS service optimal resource requirements
To evaluate our algorithm, we assume that an application, residing on RSoCi, requests
this OS service. Then, the power resource value of RSoCiis forced to be proportional
to the power requirement of the requested OS service optimal power configuration
which is 620 power units. For example, let us assume that RSoCihas a power ratio
of 1.2in comparison to the requested OS service optimal power requirement. This
makes the RSoC power equal 744 power units. Similarly, the RSoC area is forced to
have an amount proportional to the optimal OS service area requirement which is 555
area units. In case of time, we assume that the application demanded the OS service
to respond in time Tres which is proportional to the OS service optimal execution time
which is 965 time units.
Having these constraints set, Algorithm 5.2 tries to find a suitable OS service con-
figuration which does not violate these constraints. The algorithm is evaluated with
different sets of constraints.
To obtain reliable evaluation results, each set of constraints is evaluated with 100 sim-
ulation run. Each simulation run has a new randomly generated distributed system and
new OS services.
Figure 7.1(a) shows the average number of the obtained OS service suitable config-
urations for different sets of constraints. The obtained results are averaged over 100
simulation runs. The different enforced power and time ratios are shown by the x-axis.
The y-axis shows the exponential part of the number of suitable configurations, where
95
7. METHODS EVALUATION
(a) Average number of found suitable OS service configurations for different limited RSoCs resource
ratios
(b) The probability of finding the number of suitable OS service configurations in Figure 7.1(a)
Figure 7.1: The evaluation of the found OS service suitable configurations under limited
RSoCs resources which are ratio comparable to the required resources of each optimal OS
service configuration
96
7.1 Evaluating OS service configurations
the number of suitable configurations equals 2y−axis. The evaluation is carried out with
different hardware and software areas for each similar power and time ratios. Figure
7.1(b) shows the probability of getting the number of configurations shown in Figure
7.1(a) for the different sets of constraints. In both charts in Figure 7.1, the time ratio
and the power ratio are set to be equal. Figure 7.2 shows the probability of getting a
number of suitable configurations for different response time and power ratios. How-
ever, the hardware area ratio and the software area ratio are set to be equal at each set
of constraints.
Figure 7.2: The probability of getting a number of configurations for variable power and
response time ratios. The RSoC hardware and software areas are set to to be equal to the
optimal OS service required area
Evaluation results
Having 100 different simulation runs for each set of constraints, the algorithm was able
to find suitable configurations to execute an OS service with scarce RSoC resources.
However, from Figure 7.1 we notice that to get an adequate number of suitable OS
service configurations with probability ≥0.5, the RSoC has to have an amount of
power with a ratio of at least 1.7of the requested OS service optimal estimated power,
0.8ratio of hardware and software area to that of the requested OS service optimal area,
and 1.7ratio of demanded response time to that of the requested OS service optimal
execution time.
Knowing this, the middleware can decide when to execute an OS service on one RSoC
and when to take other measures. These can be sharing resources with other RSoCs
or to execute the OS service totally on another RSoC. This result is very important, it
helps saving a lot of trial and error calculations. Using the results in this evaluation, a
decision can be made depending only on the comparison of the requested OS service
optimal values to that of what is currently available of the requester RSoC resources.
97
7. METHODS EVALUATION
7.1.2 OS service Pareto optimal configuration
When a requester RSoC has plenty of resources to run any of the requested OS service
configurations, our objective becomes finding a Pareto optimal OS service configura-
tion. A configuration which uses as minimal resources as possible, but runs as fast
as possible. This is different to what we have in a limited resource environment, as
our aim was to find a suitable OS configuration that does not violate any constraint.
However, in case of an environment with abundant resources, most of the OS ser-
vice configurations may happen to be suitable. In such a case, our objective would
be finding a configuration which avoids preemption as much as possible. As we do
not know which resource may cause preemption, due to application/task demands, we
try to minimize the usage of all resources, hence a Pareto optimal OS service con-
figuration. Such a configuration would allocate as much resources as possible to any
newly arriving application, therefore, reduces the chance of an application acquiring
resources used by the OS service configuration. Hence, minimizing the need of OS
service reconfiguration (preemption).
Algorithm 5.3 finds a Pareto optimal OS service configuration. To evaluate this algo-
rithm, we compare the chosen OS service configuration with all the configurations in
the search space. Because the search space may be very large, the algorithm limits
it by setting maximum and minimum threshold values for each resource, see Section
5.2.3.
There are three resources to be minimized. These are the OS service execution time,
the OS service estimated power consumption, and the OS service estimated area us-
age. The three resources, required by each explored OS service configuration, can
be represented as a point in a three dimensional space. Figure 7.3(a), shows a three
dimensional chart of the explored OS service configurations. The minimum and maxi-
mum values for each resource are shown in Figure 7.3(b). Each explored configuration
could be compared with the optimal resources by the minimum threshold as discussed
in Section 5.2.3. For example, the three optimal resource values could be considered as
an origin point. Each point representing an OS service configuration resource usage is
comparable to the other configurations by the distance from that configuration point to
the origin point. Figure 7.4 shows the normalized distances of the explored OS service
configurations to the OS service original point.
98
7.1 Evaluating OS service configurations
(a) A three dimensional representation of the required resources needed by an OS service ex-
plored configurations
(b) The minimum and maximum thresholds which limit the exploration
space to find a Pareto optimal OS service configuration
Figure 7.3: An OS service explored configurations and its Pareto optimal configuration
99
7. METHODS EVALUATION
Figure 7.4: The normalized distance between each explored OS service configuration
resource requirements and the minimum threshold resource values
Evaluation results
Algorithm 6.1 is able to find a Pareto optimal configuration of an OS service. The eval-
uation is done repeatedly with many simulation runs. The algorithm always returned a
Pareto optimal configuration in comparison with the other configurations. The chosen
result shown in this evaluation is randomly selected from these runs. The evaluation is
done by comparing each possible configuration resource to the optimal threshold val-
ues. A Pareto optimal configuration is chosen from the explored configurations with
the property of having its resources as close as possible to the optimal threshold values.
Because the algorithm explores all the configurations, the chosen configuration is not
a local optimal but a Pareto optimal global configuration.
7.2 OS service distribution
In this section we evaluate the distribution of the heterogeneous SESs over the dis-
tributed heterogeneous RSoCs as discussed in Section 6.2.1. During the evaluation, we
will assume an encoding that produces m+λ= 4 SESs. For that we use a fork with
four fingers to distribute each set of four SESs, see Section 6.2.1.
100
7.2 OS service distribution
7.2.1 Load balancing over fork distribution patterns
Algorithm 6.1 defines a set of fork distribution patterns FDP which are used for dis-
tribution. In our case, each pattern fdp ∈FDP consists of four RSoCs. In one
pattern, each RSoC has a virtual resource capacity. The virtual capacity determines
the maximum amount of SESs which can be allocated to one RSoC resources. All
RSoCs belonging to one pattern have the same virtual capacity which is defined by δ.
However, RSoCs in different patterns have different virtual capacities. The capacity
of a pattern c(fpd)is the sum of its four RSoCs virtual capacities. This means that
different patterns have different capacities.
Each fdsihas a resource load indicator <`idefined as the ratio between the used
resource by the allocated SESs and the total capacity of the resource. Every time,
four SESs are distributed to the fdp with the maximum positive value ϕwhich is
defined by Equation (7.1). Thereby, <`iis the resource load of fdpior RSoCiand
<`avg is the average resource load across the all fdp ∈FDP or all RSoCs ∈fdpias
given by Equation (7.2).
ϕ=<`avg − <`i(7.1)
<`avg =PN
i=1 <`i
N(7.2)
Figure 7.5 shows the load balancing of distributing SESs over a fork distribution
Figure 7.5: Five randomly selected simulation runs showing the distribution load balanc-
ing over fork distribution pattern set
pattern set FDP. Although the evaluation shows five randomly simulation runs, the
results are similar to all the repeatedly done simulation runs. Each simulation run
distributes a set of heterogeneous SESs over heterogeneous RSoCs. These RSoCs are
grouped in fork distribution patterns defined by FDPj. The load balancing criterion
was made based on Equation (7.2).
101
7. METHODS EVALUATION
Evaluation results
For each fork distribution pattern fdpi∈FDPj, denoted by the x-axis, the load per-
centage, denoted by the y-axis, represents the allocated resources capacity for the
loaded SESs to the total pattern capacity. As shown in Figure 7.5, the majority of
patterns have a similar load percentage. Nevertheless, some patterns have a slightly
higher load percentage than the others. This is expected because the distributed SESs
are heterogeneous.
The evaluated algorithm results in a load which is well balanced across the majority of
patterns with some patterns having an average divination of 6%.
7.2.2 Load balancing over RSoCs
Using Algorithm 6.1, a set of RSoCs may be shared by two or more fork distribution
patterns. Each fork distribution pattern containing RSoCiuses only a partial part of
its resources defined by the virtual resource capacity. The distribution is performed by
sending a set of SESs, containing m+λ= 4 SESs in our case, to the four RSoCs
fork distribution pattern. The choosing of a SESs set, a fork distribution pattern, and
the way the SESs are distributed inside a fork distribution pattern depends on our
balancing objective. The balancing objective can be concentrated on one resource, like
power or hardware area, or generalized to balance with regard to all the resources.
Depending on our balancing objectives, many parameters can be tuned to achieve that
goal.
Figure 7.6(a) shows the load balance with a load balancing objective concentrated on
the power consumption. The load percentage shown by the y-axis is the ratio between
the allocated resource amount and the RSoC total resource amount. In this case, the
distribution is balanced with regard to power as the highest priority balancing resource.
The other resources are also considered for balancing but with low priority. The SESs
are distributed over all the RSoCs in a way that makes the power consumption per-
centage similar on every RSoC. By power percentage we mean the ratio between the
allocated power requirements of the SESs on an RSoC and the total power on that
RSoC. On the other hand, the hardware and software area usage percentages are not
balanced. This can clearly be seen from Figure 7.6(b). The figure shows the deviation
of each resource, increasingly sorted, from the average load balance percentage of that
resource.
102
7.2 OS service distribution
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
1.6"
1" 4" 7" 10" 13" 16" 19" 22" 25" 28" 31" 34" 37" 40" 43" 46" 49" 52" 55" 58" 61" 64"
Resource(usage(percentage(
RSoC(ID(
Power"
Hardware"area"
So6ware"area"
Series4"
(a) Distributed SESs over RSoCs with power consumption as the priority resource in load balancing
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
1" 3" 5" 7" 9" 11"13"15"17"19"21"23"25"27"29"31"33"35"37"39"41"43"45"47"49"51"53"55"57"59"61"63"65"67"69"
Devia&on)from)average)load)
balanceing)percentage)
RSoCs)
Power"
Hardware"area"
So5ware"area"
(b) The deviation of resources from the average load percentage when distributing with power resource
is chosen as the priority resource for load balancing in Figure 7.6(a)
Figure 7.6: SESs load balancing distribution with one prioritized resource (e.g. Power)
and the deviation of each RSoCs load from the average load balance percentage
103
7. METHODS EVALUATION
If we are considering the three resources to be balanced, a weighting function, such as
the one shown in (7.3), can be used.
W=pρ∗(Pmax −px)2+β∗(Hxmax −hx)2+α∗(Sxmax −sx)2(7.3)
Here Pmax is the maximum power consumption allowed across all entities of the same
type, pxis the power consumption for one entity, Hmax and Smax are the maximum
total Hardware and Software areas across all entities of the same type, hxand sxare
the total hardware and software areas for one entity, {ρ, β, and α}are heuristic real
numbers with each to be between 0and 1. These numbers reflect the priority of one
resource in comparison with the others. An entity type can be a SES, an RSoC, or a
fork distribution pattern.
Figure 7.7(a) shows the resource load percentage in every RSoC when load balancing is
performed using the above weighting function as a cost function. In Figure 7.7(b), the
deviation of each RSoC load percentage from the average load percentage is presented.
The deviations are sorted increasingly.
Evaluation results
As seen in Figure 7.6(a), when considering a single priority resource (e.g. power
consumption) for balancing, the algorithm successfully balances the resource load al-
located for SESs in all the RSoCs. This single priority resource load balancing is very
close to optimal. It has a low average deviation of 1% from optimal, see Figure 7.6(b).
Nevertheless, the deviation of the other resources differ very much from optimal, with
an average deviation from 10% to 30%.
On the other hand, the capacity function (7.3) can be used as a weighting method
to balance the three resources over all RSoCs, see Figure 7.7(a). This results in an
average deviation from optimal load balancing which varies from 3% to 9% for all the
resources.
The choosing of the balancing criterion depends on the system application domain. In
any way, the evaluated algorithm can be used to efficiently balance the SESs with
little deviation from optimal. At this point, it is important to keep in mind that we
are distributing SESs with heterogeneous resource requirements on heterogeneous
RSoCs. In many ways, this is very similar to the Knapsack problem which is known
to be an NP-hard problem.
The shown results are chosen randomly from many simulation runs. They reflect an
average case of load balancing. All the other simulation runs have produced similar, if
not better, evaluation results.
104
7.2 OS service distribution
0"
0.2"
0.4"
0.6"
0.8"
1"
1.2"
1.4"
1.6"
1" 4" 7" 10" 13" 16" 19" 22" 25" 28" 31" 34" 37" 40" 43" 46" 49" 52" 55" 58" 61" 64"
Resource(usage(percentage(
RSoC(ID(
Power"
Hardware"area"
So6ware"area"
Series4"
(a) Distributed SESs over RSoCs taking into mind the load balancing of all resources
0"
0.1"
0.2"
0.3"
0.4"
0.5"
0.6"
1" 3" 5" 7" 9" 11"13"15"17"19"21"23"25"27"29"31"33"35"37"39"41"43"45"47"49"51"53"55"57"59"61"63"65"67"69"
Devia&on)from)average)load)
balanceing)percentage)
RSoCs)
Power"
Hardware"area"
So5ware"area"
(b) The deviation of each RSoC load from the average load balancing percentage after load balancing
as in Figure 7.7(a)
Figure 7.7: SESs load balancing distribution considering all resource and the deviation
of each RSoCs load from the average load balance percentage
105
7. METHODS EVALUATION
7.3 Chapter conclusion
The algorithms presented in the previous chapter have been evaluated. The evaluation
was carried out using the ShoX simulator. Interesting results have been obtained. Such
results help deciding, in early stages, whether to execute an OS service on the requester
RSoC, on another RSoC, or on a collaborative RSoCs.
The algorithms used in distributing the heterogeneous OS services over the heteroge-
neous capacity RSoCs indeed did show very interesting evaluation results. Depending
on whether one criterion or multiple criterions are chosen, the algorithms efficiently
balance the load of the distributed OS service over the RSoCs.
106
CHAPTER 8
Case study
In this chapter, we describe the implementation of an encryption/decryption OS ser-
vice, see Section 8.2. This is provided as a proof of concept to what previously has been
presented. For the underlying support, a middleware was developed to synchronize the
SESs execution, providing for inter SESs communications, and running the devel-
oped algorithms. For managing the underlying computational elements (FPGA/GPP)
we have selected the ReconOS run-time environment.
8.1 Reconos
ReconOS [97,99,100] supports hardware and software co-design by introducing a uni-
form thread model for software and hardware. This is done by modifying an RTOS,
(e.g. eCos [44,106]), to integrate hardware interface components called the OS Inter-
face (OSIF). OSIF allows hardware modules to interact with the operating system
in a way similar to software threads. By doing this, ReconOS provides a layer where
threads communication can be synchronized across the hardware/software boundary.
Whenever a hardware thread is constructed, its user logic is connected to one OSIF.
This OSIF manages the interaction with the operating system.
The uniform thread model provides a relatively easy hardware/software co-design de-
velopment. It allows a transparent interaction between software and hardware. The
107
8. CASE STUDY
transparency is done to a level where a thread has no information of whether the other
correspondent thread is in hardware or in software. To achieve this transparency, every
hardware module has a delegate thread in software. This delegate shows the hardware
module as if it would be a software thread, see Figure 8.1. In this way, neither hard-
ware threads nor software threads are preferred. Moreover, this allows the usage of any
single processor schedular such as Earliest Deadline First (EDF), or Rate-monotonic
scheduling.
Although this seems a good choice, the benefits obtained from hardware are ignored at
scheduling time. Moreover, there is no possibility for resource management in terms of
choosing types of threads for best performance, minimum resource usage, or efficient
parallelism utilization.
Figure 8.1: The eCos kernel can not differentiate between software and hardware threads.
ReconOS uses delegates to allow for unified kernel thread handling.
8.1.1 SESs support
The triple DES case study, see Section 8.2, has been implemented based on the Re-
conOS library. To allow for the desired SESs execution management, see Section 4, a
108
8.1 Reconos
simple middleware has been developed. The role of the middleware is to manage and
to schedule the SESs and the inter SESs communication. The ReconOS execution
behavior has been modified to allow the middleware to take control of scheduling and
communication management.
8.1.2 SESs implementation
In this case study, SESs are implemented either in software or in hardware. To en-
able the inter SESs communication, a uniform interface using the eCos and ReconOS
application programming interface (API) has been adopted. The other specific param-
eters are passed to each SES using memory.
The software implemented SESs are created as thread functions using the eCos API.
cyg_thread_create( 16, // priority, not important
SES_s1_entry, // entry point
( cyg_addrword_t ) 1, // entry data
"SES_s1_ENC", // SES name
SES_s1_stack, // stack
STACK_SIZE, // stack size
&SES_s1_encryptor_handle, // SES handle
&SES_s1_encryptor // SES object
);
The hardware implemented SESs are created as ReconOS hardware delegate using
the ReconOS API.
reconos_hwthread_create( 16, // priority, not important
&SES_h1_encryptor_attr, // hardware SES attributes
0, // entry data (not needed)
"SES_h1_ENC", // SES name
SES_h1_encryptor_stack, // stack
STACK_SIZE, // stack size
&SES_h1_encryptor_handle, // SES handle
&SES_h1_encryptor // SES object
);
109
8. CASE STUDY
Each SES waits the START flag, which is triggered by the middleware, to begin
processing a memory data chunk which has its pointer sent along the START flag.
All the specific parameters are passed as a record at the beginning of the data chunk to
be processed. When finishing the execution, a SES sends a DONE done flag. The
processed data is saved in the memory data chunk. Note that the priority parameter
used in both APIs is not important. This is because the initiation and scheduling of
each SES is managed by the middleware.
In case of hardware implemented SESs, when a ReconOS delegate receives the START
flag, it sends a message to the corresponding OSIF to unblock the hardware module.
The OSIF consists of several modules such as the OSIF core, a master bus controller
and a FIFO manager, see Figure 8.2(a). On the other hand, if the hardware module
needs to communicate with the delegate, it will send a command to the OSIF. This
command is dealt with by the command decoder. The command decoder is a large
state machine, with one state for each available command.
In order for hardware implemented SESs to communicate with OSIF, each SES
hardware module has to be encapsulated within a synchronization state machine, see
Figure 8.2(b). The state machine defines five states as shown in the code below. These
states are used by OSIF to control the SES hardware module.
--other code
encryptor_j: SES_h1_hardware_model
generic map(
--generic map
)
Port map(
clk => clk,
reset=>reset,
o_RAMAddr =>RAMAddr, --memory address
o_RAMData =>o_RAMData, --memory write data
i_RAMData =>i_RAMData, --memory read data
o_RAMWE => o_RAMWE, --memory Write/Read
start =>SES_start, --start SES processing
done => SES_done
};
-- other code
--OS synchronization state machine
state_proc: process(clk, reset)
-- define any required variables
begin
110
8.1 Reconos
(a) The OSIF and its modules. The command decoder manages
all hardware thread interactions and relays data to the individual
bus attachments, external FIFOs, or the CPU.
(b) Example of an OS synchronization state machine. The state
transitions are executed under OSIF control, allowing individual
OS calls (e.g., sem_wait()) to block. The actual processing of
the thread occurs in the user logic
Figure 8.2: The schematic view of OSIF and the synchronization state machine in Re-
conOS [98]
111
8. CASE STUDY
if reset=’1’ then
-- reset reconos and any other signals
elsif rising_edge(clk) then
reconos_begin(o_osif, i_osif);
if reconos_ready(i_osif) then
case state is
when STATE_GET=>
--wait for any message from the delegate
--if yes go to STATE_READ
when STATE_READ=>
--read the memory data chunk to be processed
--then go to STATE_BEGIN_SES_PROCESSING
when STATE_BEGIN_SES_PROCESSING=>
--start SES processing then go to STATE_WAIT
SES_start<=’1’;
when STATE_WAIT=>
--wait until the SES sets the SES_done flag
--then go to STATE_WRITE
when STATE_WRITE=>
--write the processed data back to memory
--then go to STATE_PUT;
when STATE_PUT=>
--send the DONE message to the delegate
--then go to STATE_GET;
when others=>
-- go to STATE_GET;
end case;
end if;
end if;
end process;
The communication between the hardware module and the delegate can be done either
by shared memory or mailboxes. Both communication ways need the bus for com-
munication, see Figure 8.2(a). This may become a bottleneck as all OSIFs share the
same bus. Another offered possibility is communication via FIFO. This connects every
hardware thread to its "right" neighbor. This, however, is only useful if the SES on
the right is the immediate successor of the current SES.
112
8.2 Triple DES
8.2 Triple DES
The triple Data Encryption Standard (3DES) [80] is an encryption algorithm which
is used in many applications such as the electronic payment industry [47]. As a proof
of concept, we will assume an OS service which provides the 3DES encryption. The
triple DES can be developed as three DES successive blocks [11], each with a different
64-bit key. The data is first encrypted using the first key, decrypted with a second
key, and finally encrypted using a third key. The encryption algorithm is cipher =
EK3(DK2(EK1(data))).
Figure 8.3: The block diagram of the DES encryption algorithm
The DES block diagram is shown in 8.3. It consists of an initial permutation, sixteen
main rounds, and a final permutation. The 64-bit key is permutated and transformed to
give a new 48-bit key to each of the sixteen main DES rounds. The DES block can be
113
8. CASE STUDY
used for encryption and decryption. The only difference between the two operations is
the scheduling of the 48-bit keys. In encryption operation, the first transformed 48-bit
key is given to the first round, the second key to the second round, and so on. In the
decryption operation, this scheduling is reversed. The first key is given to the sixteenth
round, the second key to the fifteenth round, and so on.
As a proof of concept, the 3DES was implemented in both software (runs on PowerPC
GPP) and hardware (runs on FPGA). Each implementation is partitioned into six SESs
as shown in Figure 8.4. The partitioning is made as described in Section 4.2. This gives
us a total number of 26= 64 configurations. As an underlying platform, the Xilinx
VirtexTMII Pro kit was used. It consists of one FPGA containing inside it one dedicated
GPP core (PowerPC).
Figure 8.4: The 3DES OS service with two implementations, each partitioned into six
SESs
114
8.3 Results
Communication type Time
SESHW →Middleware 187 µs
SESSW →Middleware 68 µs
Middleware →SESHW 193 µs
Middleware →SESSW 76 µs
SESHW →SESSW 304 µs
SESSW →SESHW 302 µs
SESHW →SESHW 422 µs
SESSW →SESSW 184 µs
Table 8.1: Inter communication overhead
8.3 Results
As shown in Figure 8.4, each DES block is partitioned into two SESs. This means
that one DES block can be executed in 22= 4 ways or configurations, two successive
DES blocks in 24= 16 ways, and three successive DES blocks ( the 3DES) in
26= 64 ways. These configurations can be explored at run time using the algorithms
of Chapter 5. The chosen configuration depends on the constraints and criteria of the
requester, see Section 5.2.4. Concentrating on one DES block, we have four execution
configurations. These are: First and second SESs in hardware (h-h), first and second
SESs in software (s-s), first SES in software and second SES in hardware (s-h), or
first SES in hardware and second SES in software (h-s).
Table 8.1 shows the average inter-communication overhead for these configurations.
The inter-communication overhead is maximal when two SESs are implemented in
different computational elements (GPP or FPGA). However, in Table 8.1 we notice that
the inter communication overhead between two SESs in hardware is more than that of
two SESs in different architectures. This is no surprise because the hardware SESs
are communicating through the middleware which is implemented in software. The
SESHW →SESHW is actually SESHW →Software(middleware)→SESHW .
This can be improved by either allowing two SESs in one implementation to commu-
nicate directly but with a notification to the middleware or by implementing parts of
the middleware in different computational elements (e.g. in hardware and software) to
allow for fast communication.
The area allocated for each block is shown in Table 8.2. The allocated areas shown
include the area allocated for interfaces as well as the SES itself, (e.g. the OSIF
interface for hardware implemented SESs).
115
8. CASE STUDY
Hardware LUT Utilization
DES_SES_1_HW 4008 10%
DES_SES_2_HW 2815 14%
Software Memory utilization (bytes)
DES_SES_1_SW 8127
DES_SES_2_SW 6417
Table 8.2: SESs FPGA and memory utilization
This case study is further used to evaluate the use of pipelining as explained in Section
5.3.2. The middleware is used to explore the four execution configurations of the DES
block with and without pipelining. Figure 8.5 shows the execution time for the four
DES execution configurations when processing nine memory data chunks. Each data
chunk consists of seventeen 64-bits words. Both configurations are executed with and
without the use of pipelining.
Figure 8.5: A comparison between a non-pipelined execution of the four DES execution
configurations and a pipelined execution.
Using pipelining we have been able to improve the execution time of the configurations
with a maximum improvement percentage of 37% in the (h-h) configuration and a
minimum improvement percentage of 6% in the (s-h) configuration.
The improvement percentage gained by using the pipelining depends on two factors:
the level of the parallelism and the difference in execution time between SESs. In
architectures such as an FPGA, all the hardware implemented SESs can run in par-
allel without any delay imposed by other applications/tasks running on FPGA. In a
116
8.3 Results
pipelined execution of the (h-h) configuration, the second SES will process the first
data chunk while the first SES is processing the second data chunk. The execution of
the two SESs is completely parallel. The only delay comes from the fact that the sec-
ond SES can only process data that has already been processed by the first SES. This
delay depends on the execution time of first SES and the communication time bet-
ween the two SESs. From this we can obtain a maximum improvement in execution
time using pipelining if the first SES execution time plus communication time equals
the second SES execution time. In general, a maximum reduction in a partitioned OS
service’s execution time can be obtained using pipelining if the SESs of the chosen
execution configuration have the ability to execute in parallel on different data chunks
and
∀SESs ∈configuration,
T(SESi) + CT(SESi, SESj) = T(SESj), where
T(SESx)is the execution time of SESx, CT(SESx, SESy)is the communication
time between the two successive SESs,SESxand SESy, and j=i+ 1, j < n.
Due to the heterogeneity in execution time between SESs of the DES block we have
obtained a maximum reduction in execution time using pipelining in the (h-h) con-
figuration. On the other hand we find that the (h-s) configuration has the maximum
reduction percentage in execution time. This is because of both SESs can execute in
parallel, and because the execution time of the first SES (hardware implemented) is
less than half that of the software one. This allows the processing of two data chunks
in the first SES before the second SES finishes one data chunk.
Looking at the (s-h) configuration we notice that it suffers from the minimum reduction
in execution time when using pipelining. Although both SESs can execute in parallel,
the SES execution time limits any improvements. The first SES executing in software
requires time which nearly triples that one of the hardware SES. Because we can not
process any data chunk in the second SES unless it has already been processed by the
first SES, the pipelining technique shows little improvement.
Comparing the non-partitioned DES with the partitioned DES that uses no pipelining,
see Figure 8.6, we get the expected result of a non-partitioned DES being faster in both
hardware and software implementations than the partitioned non-pipelined DES due to
the added communication overhead. However, the pipelined partitioned DES requires
less execution time to process the data chunks compared to the non-partitioned DES in
both software and hardware implementations, see Figure 8.7.
Whether pipelining is used or not, or whether we have improvements in execution time
or not, the fact remains that partitioning gives us the flexibility in both execution time
117
8. CASE STUDY
Figure 8.6: A comparison between a partitioned non-pipelined execution of the four DES
execution configurations and a DES non-partitioned execution.
Figure 8.7: A comparison between a partitioned pipelined execution of the four DES
execution configurations and a DES non-partitioned execution.
118
8.4 Chapter conclusion
and resource usage. Combined with pipelining, this will give a worst case scenario that
has an execution time which is better than that of a non-partitioned service.
8.4 Chapter conclusion
OS services designed for distributed embedded systems can be partitioned and imple-
mented to run on different computational elements. This enables many useful charac-
teristics such as runtime adaptation to the dynamic change in resources and demands.
It also introduces an undesired inter communication overhead.
This chapter has presented the usage of the pipelining technique to utilize the paral-
lelism offered by RSoCs embedded systems in favor of reducing the effect of the inter
communication overhead. Depending on the level of parallelism and the heterogeneity
of the partitions, this may improve the WCET of the service by roughly 40%.
The 3DES encryption standard was introduced as a proof of concept. It is implemented
in hardware and software with each implementation partitioned into six SESs. This
case study shows the different possible configurations to run the 3DES and the im-
provements gained by using the pipelining technique.
The middleware used in the case study was implemented completely in software and
on top of ReconOS [100]. This increased the communication overhead as it has been
carried out completely through the middleware. To overcome this problem, parts of the
middleware could be implemented in hardware. Moreover, successive SESs imple-
mented in the same computational elements could be allowed to communicate directly
with notifications sent to the middleware.
119
8. CASE STUDY
120
CHAPTER 9
Conclusion and future directions
In this chapter, we summarize and conclude the work presented in this thesis. In addi-
tion, we outline the directions and the future intended work.
9.1 A brief summary
In a distributed RSoCs system, resources are limited and applications vary. In such
an environment, OS services can be designed to adapt themselves by changing their
resource requirements at run time. This is achieved by multiple implementations and
partitioning into SESs, see Section 4.2. This produces a number of possible config-
urations to run each OS service. An application residing on one RSoC can request an
OS service. Depending on the requester’s current resources and constraints, an OS ser-
vice configuration can be chosen. The process is performed using specialized heuristic
algorithms, see Section 5.2. Using these algorithms, a suitable OS service configura-
tion can be found even with relatively scarce resources, see Section 7.1. Based on the
evaluation and the presented analysis, it is possible to know, even before running the
algorithms, whether we can find a configuration for executing on one RSoC or whether
a sharing of resources between RSoCs is required.
In case of being able to execute on the requester RSoC, SESs of the chosen config-
uration will be migrated to that RSoC and the OS service execution is commenced.
121
9. CONCLUSION AND FUTURE DIRECTIONS
In case of distributed execution, the algorithms find the collection of RSoCs to exe-
cute within constraints, see Section 6.3. Having an RSoCs collaboration with suitable
configuration found, data will be migrated to begin the execution of the selected OS
service configuration. This process can briefly be summarized by Figure 9.1.
Figure 9.1: A brief summary of finding an adequate OS service configuration
To allow RSoCs collaboration, load balancing, and efficient execution routing, heuris-
tic algorithms were developed to distribute OS services over RSoCs. These algorithms
are used for efficient distribution of heterogeneous OS services partitions over the het-
erogeneous RSoCs. Moreover, they allow building of discovery and execution routing
records to efficiently find at runtime an adequate RSoC collaboration with the desired
OS service configuration.
To support this novel design of runtime adaptable OS services, each RSoC contains a
middleware. This middleware contains the algorithms and the proper units to manage
and maintain the RSoC resources, communications, and the OS services adaptation,
see Section 3.2.1.
The presented adaptation of OS services enables the full utilization of platforms with
heterogeneous computational elements. In addition to the contribution to a new OS ser-
122
9.2 Future directions
vice design, many of the presented methods and modules are adequate for applications
as well. This may even improve other related work such as in [113,98]. Moreover, this
work opens many interesting research directions as discussed in the following section.
9.2 Future directions
The introduced novel design of adaptable OS service depends on multiple implemen-
tations which are partitioned into SESs as described in Section 4.2. One approach to
automate this partitioning is to modify existing open source (C-to-FPGA) compilers
as in [65,58,56]. Another approach would be the generation of code using abstract
modules as in [28]. This automation of partitioning is with no doubt a key element to
ease the transfer of existing OS services to the adaptable design introduced in Section
4.2.
On the other hand, the distribution of heterogeneous SESs over heterogeneous RSoCs,
see Section 6.3, introduces another research direction regarding optimization and or-
ganization, see Section 9.2.1.
Because RSoCs consist of multiple computational elements, and the new design of
OS service provides for recovery techniques, online model checking can be integrated
with the middleware as discussed in Section 9.2.2.
9.2.1 Distribution and optimization
After distributing the SESs, another stage can be introduced. This stage is performed
at run time using profiling information obtained during the execution of services. This
profiling information includes statistics about the locations where each OS service
is being requested more frequently, and whether there exists a network congestion
involving draining resources of one RSoC more than the others. The optimization
aims to improve the performance and the efficiency of communication and execution.
Communication optimization
After performing the GFD, see Section 6.2.1. we end up with some kind of permu-
tation network for executing a service. Assuming we have chosen an encoding which
provides us with four alternatives for each SES, after executing one SES, the data
can be transferred to any of the four successors alternatives, see Figure 9.2.
123
9. CONCLUSION AND FUTURE DIRECTIONS
Figure 9.2: Example of the permutation execution network of one OS service
Although one suitable path may have been chosen in the routing and execution stage,
see Section 6.3, many other suitable paths may also exist. Following the same path
every time has many disadvantages, for example, it may cause the RSoCs across that
path to loose a lot of power. The communication optimization stage tries to avoid such
problems without violating any constraint.
Execution configurations optimization
After executions of a service, it may be found that one SES is located apart from other
SESs belonging to that service, or the requests to a service are coming from one place
and the service is located on a different one. This may be minimized by migrating
SESs of a service to the place where they are frequently requested. This is to be done
taking into account load balancing and updating the alternative and successor records
in each related SES, see Section 6.3.
124
9.2 Future directions
9.2.2 Online model checking and recovery
Due to the expected dynamic changes in applications/tasks, the resources or the con-
straints may change accordingly at runtime. This may lead to change/adjustment in
OS service configurations. Due to the sensitivity of the process, as this may be a
service requested by a real time application/task, dependability and fault tolerance of
the underlying operating system is highly expected. Due to the huge time and space
complexity, it is impossible to verify the safety properties of all the mndifferent im-
plementations of an OS service at the systems development phase. Therefore, online
model checking [1], [155] might be a good choice to improve the dependability and
fault tolerance of our distributed RSoC with adaptable OS services.
This online model checking can be done by generating a sufficiently precise abstract
model from safety-critical SESs code using approaches like in [68]. The abstract
model is an over-approximation of the concrete system. For every concrete state se-
quence, there exists a corresponding abstract state sequence. The relations between
concrete states and abstract states are defined by means of two functions: an abstrac-
tion function αmaps every set of concrete states to a corresponding abstract state; a
concretization function γmaps every abstract state to a set of concrete states that it
represents.
In this way, for each safety-critical SESiof a service, we can get the corresponding
abstract model [
SESias well as the abstraction function αiand concretization function
γi.
Model Checking Paradigm
Online Model Checking (MC) runs on an RSoC in parallel with the SESs to be
checked. The abstract model of the checked SES is explored with respect to the given
safety property. Since model checking is done online, the current (concrete) states of
the SESs execution can be monitored and reported to the model checker from time
to time. By mapping these concrete states to the corresponding abstract ones at a
model level through the abstraction function, the online model checker needs only to
explore such a partial state space reachable from these abstract (current) states. If this
partial abstract model is checked safe against the given property, then we have more
confidence to the safety of the actual execution trace. If an error is detected within the
partial state space, a recovery process will be triggered.
125
9. CONCLUSION AND FUTURE DIRECTIONS
Recovery Process
A recovery process is initiated by sending a recovery message from the model checker
to the RSoC middleware which coordinates processing of OS service execution, see
Figure 9.3. This can be the same RSoC where the model checker is being executed or
a different RSoC in a distributed system.
Figure 9.3: Recovering a possible fault in OS service execution
The recovery message contains information about the possible incorrect SES and the
current state of this SES. The middleware then evaluates a new suitable configuration
and if necessary acquires any missing SES from the OSR. Meanwhile, it raises a
stop signal to the incorrect service. The middleware afterwards initializes the new
configuration and redirects the further execution of the service to it. The middleware
also sends to the model checker a message to set a new model associated with the new
SES to be executed if necessary.
SESs, middleware, and model checking integration
For mapping abstract states to concrete states, the online model checker needs to com-
municate with SESs running on FPGA as well as on GPP. However, SESs with
the same behavior in one OS service have different implementations, hence they are
distinct in terms of execution time and resource usage. This makes it unrealistic to
synchronously schedule the communication between the model checker and a SES
to be checked, e.g., every Ttime. Fortunately, an event driven approach is a suitable
solution.
126
9.2 Future directions
At the design phase, some triggering point can be defined at which the communication
is initiated. The triggering points depend on the functional behavior and the states
of each OS service. This can be obtained by analyzing the source code. When the
communication is triggered, a decoded message with the required information is sent
to the model checker. These normally contain global data and conditional values.
Usually, the model checker is running on GPP. This eases data transferring between
SESs running on GPP and the model checker. Such communication can be achieved
by coping/accessing the address space of the SES or using shared memory. Where
SESs running on FPGA are to be considered for checking, another component (the
X-manager) could be introduced. The X-manager works as a complementary part to
the model checker. It consists of two parts: the H-manager which works on FPGA and
the S-manager which runs with the model checker (MC) on GPP, see Figure 9.4(a).
All the SESs communications involving reading or writing memory are done through
the X-manager. In doing so, the X-manager can monitor SESs and report back to the
model checker.
(a) An overview of an RSoC with one FPGA and GPP and
the communications between SESs, MC, and memory.
(b) An insight of the SES stages
and the communication with the X-
manager.
Figure 9.4: SESs, middleware, and model checker integration
The communication between a SES and the X-manager can be triggered by defining
check points after specified execution stages, see Figure 9.4(b). A stage is a logical
grouping of a SES code after which the MC is triggered. Any access to the mem-
ory from every stage is monitored by the X-manager. At the end of each stage, an
event will be sent to the X-manager. On receiving the event, the X-manager will pro-
vide the model checker with a snap shot of the memory just modified by the stage,
so far, executed of the SES to be checked. This minimizes the time needed to trans-
127
9. CONCLUSION AND FUTURE DIRECTIONS
fer data to the model checker. Thus, the model checker might have more chance to
run in pre-checking mode, i.e., looking ahead in the near future at the model level.
This is important as this allows the recovery process to happen without violating any
constraints.
128
Author’s Publications
[1] Franz J. Rammig, Yuhong Zhao, and Sufyan Samara. On-line model checking as
operating system service. In SEUS ’09: Proceedings of the 7th IFIP WG 10.2
International Workshop on Software Technologies for Embedded and Ubiquitous
Systems, pages 131–143, Berlin, Heidelberg, 2009. Springer-Verlag.
[2] Sufyan Samara. Partitioning granularity, communication overhead, and adaptation
in os services for distributed reconfigurable systems on chip. In The 13th IEEE
International Conferences on Computational Science and Engineering (CSE 2010
) / The 8th IEEE/IFIP International Conferences on Embedded and Ubiquitous
Computing (EUC 2010). IEEE Computer Society, 2010.
[3] Sufyan Samara and Fahad Bin Tariq. Os service optimization in a heterogeneous
distributed system on chip ( soc). In Workshop on Distributed Computing in Am-
bient Environments ( DiComAe) within the KI2009 Conference, 2009.
[4] Sufyan Samara, Dalimir Orfanus, and Peter Janacik. Towards biologically inspired
decentralized self-adaptive os services for distributed reconfigurable system on
chip (rsoc). ACM SIGBED Rev., 6(3):1–4, 2009.
[5] Sufyan Samara and Gunnar Schomaker. Self-adaptive os service model in relaxed
resource distributed reconfigurable system on chip (rsoc). Future Computing, Ser-
vice Computation, Cognitive, Adaptive, Content, Patterns, Computation World,
IEEE, 0:1–8, 2009.
[6] Sufyan Samara and Gunnar Schomaker. Real-time adaptation and load balancing
aware os services for distributed reconfigurable system on chip. Computer and
Information Technology, IEEE, 0:1743–1750, 2010.
[7] Sufyan Samara, Fahad Bin Tariq, Timo Kerstan, and Katharina Stahl. Applications
adaptable execution path for operating system services on a distributed reconfigur-
129
AUTHOR’S PUBLICATIONS
able system on chip. In ICESS ’09: Proceedings of the 2009 International Confer-
ence on Embedded Software and Systems, pages 461–466, Washington, DC, USA,
2009. IEEE Computer Society.
[8] Sufyan Samara, Yuhong Zhao, and Franz Rammig. Integrate online model check-
ing into distributed reconfigurable system on chip with adaptable os services. In
Distributed, Parallel and Biologically Inspired Systems, volume 329 of IFIP Ad-
vances in Information and Communication Technology, pages 102–113. Springer
Boston, 2010.
130
References
[9] Jason Agron and David Andrews. Building heterogeneous reconfigurable sys-
tems with a hardware microkernel. In CODES+ISSS ’09: Proceedings of the 7th
IEEE/ACM international conference on Hardware/software codesign and system
synthesis, pages 393–402, New York, NY, USA, 2009. ACM.
[10] Ali Ahmadinia, Christophe Bobda, Dirk Koch, Mateusz Majer, and Jürgen Teich.
Task scheduling for heterogeneous reconfigurable computers. In SBCCI ’04: Pro-
ceedings of the 17th symposium on Integrated circuits and system design, pages
22–27, New York, NY, USA, 2004. ACM.
[11] L. Ali, N.A.M Yunus, H. Jaafar, R. Wagiran, and E. Low. Implementation of
triple data encryption algorithm using vhdl. In IEEE International Conference on
Semiconductor Electronics (ICSE), 2004.
[12] Altera. www.altera.com.
[13] David Andrews, Douglas Niehaus, Razali Jidin, Michael Finley, Wesley Peck,
Michael Frisbie, Jorge Ortiz, Ed Komp, and Peter Ashenden. Programming models
for hybrid fpga-cpu computational components: A missing link. IEEE Micro,
24(4):42–53, 2004.
[14] Péter Arató, Zoltán Ádám Mann, and András Orbán. Algorithmic aspects of hard-
ware/software partitioning. ACM Trans. Des. Autom. Electron. Syst., 10(1):136–
156, 2005.
[15] Atmel. www.atmel.com.
[16] S. Baskiyar, Ph. D, and N. Meghanathan. A survey of contemporary real-time
operating systems. informatica 29:233 ˝
U240, 2005.
131
REFERENCES
[17] Th. Benner, Rolf Ernst, Ingo Könenkamp, Ulrich Holtmann, P. Schüler, H.-C.
Schaub, and N. Serafimov. Fpga based prototyping for verification and evaluation
in hardware-software cosynthesis. In FPL ’94: Proceedings of the 4th Interna-
tional Workshop on Field-Programmable Logic and Applications, pages 251–258,
London, UK, 1994. Springer-Verlag.
[18] Brian N. Bershad, Craig Chambers, Susan Eggers, Chris Maeda, Dylan Mc-
Namee, Przemys law Pardyak, Stefan Savage, and Emin Gün Sirer. Spin—an
extensible microkernel for application-specific operating system services. SIGOPS
Oper. Syst. Rev., 29(1):74–77, 1995.
[19] Danilo Beuche, Abdelaziz Guerrouat, Holger Papajewski, Wolfgang Schröder-
Preikschat, Olaf Spinczyk, and Ute Spinczyk. The pure family of object-oriented
operating systems for deeply embedded systems. In ISORC ’99: Proceedings of
the 2nd IEEE International Symposium on Object-Oriented Real-Time Distributed
Computing, page 45, Washington, DC, USA, 1999. IEEE Computer Society.
[20] Nina T. Bhatti, Matti A. Hiltunen, Richard D. Schlichting, and Wanda Chiu. Coy-
ote: a system for constructing fine-grain configurable communication services.
ACM Trans. Comput. Syst., 16(4):321–366, 1998.
[21] Shah Bhatti, James Carlson, Hui Dai, Jing Deng, Jeff Rose, Anmol Sheth, Brian
Shucker, Charles Gruenwald, Adam Torgerson, and Richard Han. Mantis os: an
embedded multithreaded operating system for wireless micro sensor platforms.
Mob. Netw. Appl., 10(4):563–579, 2005.
[22] Christophe Bobda. Introduction to Reconfigurable Computing: Architectures,
Algorithms, and Applications. Springer Publishing Company, Incorporated, 2007.
[23] Scott A. Brandt and Gary J. Nutt. Flexible soft real-time processing in middle-
ware. Real-Time Syst., 22(1/2):77–118, 2002.
[24] G. Brebner. The swappable logic unit: a paradigm for virtual hardware. In FCCM
97: Proceedings of the 5th IEEE Symposium on FPGA-Based Custom Computing
Machines, page 77, Washington, DC, USA, 1997. IEEE Computer Society.
[25] Gordon Brebner. A virtual hardware operating system for the xilinx xc6200.
In Reiner Hartenstein and Manfred Glesner, editors, Field-Programmable Logic
Smart Applications, New Paradigms and Compilers, volume 1142 of Lecture Notes
in Computer Science, pages 327–336. Springer Berlin / Heidelberg, 1996.
132
REFERENCES
[26] André Brinkmann, Sascha Effert, Friedhelm Meyer auf der Heide, and Christian
Scheideler. Dynamic and redundant data placement. In 27th IEEE International
Conference on Distributed Computing Systems (ICDCS 2007), Toronto, Canada,
25 - 29 June 2007.
[27] André Brinkmann, Kay Salzwedel, and Christian Scheideler. Compact, adaptive
placement schemes for non-uniform distribution requirements. In Proc. of the 14th
ACM Symposium on Parallel Algorithms and Architectures (SPAA), pages 53–62,
Winnipeg, Manitoba, Canada, 11 - 13 August 2002.
[28] Lisane B. Brisolara, Marcio F. S. Oliveira, Ricardo Redin, Luis C. Lamb, Luigi
Carro, and Flavio Wagner. Using uml as front-end for heterogeneous software
code generation strategies. In DATE ’08: Proceedings of the conference on Design,
automation and test in Europe, pages 504–509, New York, NY, USA, 2008. ACM.
[29] Dov Bulka and David Mayhew. Efficient C++: performance programming tech-
niques. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 2000.
[30] Roy H. Campbell, Nayeem Islam, David Raila, and Peter Madany. Designing
and implementing choices: an object-oriented system in c++. Commun. ACM,
36(9):117–126, 1993.
[31] Rita Yu Chen, Robert M. Owens, Mary Jane Irwin, R. S. Bajwa, and Raminder S.
Bajwa. Validation of an architectural level power analysis technique. In DAC ’98:
Proceedings of the 35th annual Design Automation Conference, pages 242–245,
New York, NY, USA, 1998. ACM.
[32] Pong P. Chu. RTL Hardware Design Using VHDL: Coding for Efficiency, Porta-
bility, and Scalability. Wiley-IEEE Press, 2006.
[33] Antoine Colin and Guillem Bernat. Scope-tree: A program representation for
symbolic worst-case execution time analysis. In ECRTS ’02: Proceedings of
the 14th Euromicro Conference on Real-Time Systems, page 50, Washington, DC,
USA, 2002. IEEE Computer Society.
[34] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein.
Introduction to Algorithms, Third Edition. The MIT Press, 2009.
[35] Klaus Danne and Sven Stuehmeier. Off-line placement of tasks onto reconfigur-
able hardware considering geometrical task variants, 2005.
133
REFERENCES
[36] B.P. Dave, G. Lakshminarayana, and N.K. Jha. Cosyn: Hardware-software co-
synthesis of heterogeneous distributed embedded systems. Very Large Scale Inte-
gration (VLSI) Systems, IEEE Transactions on, 7(1):92 –104, mar. 1999.
[37] Robert P. Dick and Niraj K. Jha. Cords: hardware-software co-synthesis of re-
configurable real-time distributed embedded systems. In ICCAD ’98: Proceedings
of the 1998 IEEE/ACM international conference on Computer-aided design, pages
62–67, New York, NY, USA, 1998. ACM.
[38] Robert P. Dick and Niraj K. Jha. Cowls: Hardware-software co-synthesis of dis-
tributed wireless low-power embedded client-server systems. In VLSID ’00: Pro-
ceedings of the 13th International Conference on VLSI Design, page 114, Wash-
ington, DC, USA, 2000. IEEE Computer Society.
[39] R.P. Dick and N.K. Jha. Mogac: a multiobjective genetic algorithm for hardware-
software cosynthesis of distributed embedded systems. Computer-Aided Design
of Integrated Circuits and Systems, IEEE Transactions on, 17(10):920 –935, oct.
1998.
[40] Florian Dittmann. Methods to Exploit Reconfigurable Fabrics. PhD thesis, Uni-
versity of Paderborn, 2007.
[41] Carsten Ditze. Towards Operating System Synthesis. PhD thesis, Faculty of
Computer Science, Electrical Engineering, and Mathematics, Paderborn Univer-
sity, 1999.
[42] Adam Dunkels, Bjorn Gronvall, and Thiemo Voigt. Contiki - a lightweight and
flexible operating system for tiny networked sensors. In LCN ’04: Proceedings
of the 29th Annual IEEE International Conference on Local Computer Networks,
pages 455–462, Washington, DC, USA, 2004. IEEE Computer Society.
[43] Klaus Ecker, David Juedes, Lonnie Welch, David Chelberg, Carl Bruggeman,
Frank Drews, David Fleeman, David Parrott, and Barbara Pfarr. An optimization
framework for dynamic, distributed real-time systems. In IPDPS ’03: Proceedings
of the 17th International Symposium on Parallel and Distributed Processing, page
111.2, Washington, DC, USA, 2003. IEEE Computer Society.
[44] eCos. http://ecos.sourceware.org/. visited on, August 2010.
[45] Giovanni De Micheli (Editor), Rolf Ernst (Editor), and Wayne Wolf (Editor).
Readings in hardware/software co-design. Kluwer Academic Publishers, Norwell,
MA, USA, 2002.
134
REFERENCES
[46] Petru Eles, Zebo Peng, Krzysztof Kuchcinski, and Alexa Doboli. Sys-
tem level hardware/software partitioning based on simulated annealing and
tabu search. Design Automation for Embedded Systems, 2:5–32, 1997.
10.1023/A:1008857008151.
[47] EMV. EMV 4.2 Specifications, volume Book 2 of version 4.2. EMV, security and
key management edition, June 2008.
[48] Frank Engel, Ihor Kuz, Stefan M. Petters, and Sergio Ruocco. Operating systems
on socs: A good idea?, 2004.
[49] D. R. Engler, M. F. Kaashoek, and J. O’Toole, Jr. Exokernel: an operating sys-
tem architecture for application-level resource management. In SOSP ’95: Pro-
ceedings of the fifteenth ACM symposium on Operating systems principles, pages
251–266, New York, NY, USA, 1995. ACM.
[50] Andreas Ermedahl. A Modular Tool Architecture for Worst-Case Execution Time
Analysis. VDM Verlag, Saarbrücken, Germany, Germany, 2008.
[51] Rolf Ernst, Jorg Henkel, and Thomas Benner. Hardware-software cosynthesis for
microcontrollers. IEEE Des. Test, 10(4):64–75, 1993.
[52] Anand Eswaran, Anthony Rowe, and Raj Rajkumar. Nano-rk: An energy-aware
resource-centric rtos for sensor networks. In RTSS ’05: Proceedings of the 26th
IEEE International Real-Time Systems Symposium, pages 256–265, Washington,
DC, USA, 2005. IEEE Computer Society.
[53] Gui-Liang Feng, Robert H. Deng, Feng Bao, and Jia-Chen Shen. New efficient
mds array codes for raid part i: Reed-solomon-like codes for tolerating three disk
failures. IEEE Trans. Comput., 54(9):1071–1080, 2005.
[54] David Fleeman, M. Gillen, A. Lenharth, M. Delaney, L. Welch, D. Juedes, and
C. Liu. Quality-based adaptive resource management architecture (qarma): A
corba resource management service. Parallel and Distributed Processing Sympo-
sium, International, 3:116b, 2004.
[55] Bryan Ford, Godmar Back, Greg Benson, Jay Lepreau, Albert Lin, and Olin
Shivers. The flux oskit: a substrate for kernel and language research. In SOSP ’97:
Proceedings of the sixteenth ACM symposium on Operating systems principles,
pages 38–51, New York, NY, USA, 1997. ACM.
[56] FPGAC. http://sourceforge.net/projects/fpgac/, visited September 2010.
135
REFERENCES
[57] L. Fernando Friedrich, John Stankovic, Marty Humphrey, Michael Marley, and
John Haskins. A survey of configurable, component-based operating systems for
embedded applications. IEEE Micro, 21(3):54–68, 2001.
[58] Jan Frigo, Maya Gokhale, and Dominique Lavenier. Evaluation of the streams-
c c-to-fpga compiler: an applications perspective. In FPGA ’01: Proceedings of
the 2001 ACM/SIGDA ninth international symposium on Field programmable gate
arrays, pages 134–140, New York, NY, USA, 2001. ACM.
[59] Wenyin Fu and Katherine Compton. An execution environment for reconfigur-
able computing. In FCCM ’05: Proceedings of the 13th Annual IEEE Symposium
on Field-Programmable Custom Computing Machines, pages 149–158, Washing-
ton, DC, USA, 2005. IEEE Computer Society.
[60] Eran Gabber, Christopher Small, John Bruno, José Brustoloni, and Avi Silber-
schatz. The pebble component-based operating system. In ATEC ’99: Proceedings
of the annual conference on USENIX Annual Technical Conference, pages 20–20,
Berkeley, CA, USA, 1999. USENIX Association.
[61] Daniel D. Gajski, Frank Vahid, Sanjiv Narayan, and Jie Gong. Specification and
design of embedded systems. Prentice-Hall, Inc., Upper Saddle River, NJ, USA,
1994.
[62] D.D. Gajski, F. Vahid, S. Narayan, and Jie Gong. Specsyn: an environment sup-
porting the specify-explore-refine paradigm for hardware/software system design.
Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, 6(1):84 –100,
mar. 1998.
[63] Philip Garcia, Katherine Compton, Michael Schulte, Emily Blem, and Wenyin
Fu. An overview of reconfigurable hardware in embedded systems. EURASIP J.
Embedded Syst., 2006(1):13–13, 2006.
[64] Samuel Garcia and Bertrand Granado. Ollaf: A fine grained dynamically recon-
figurable architecture for os support. EURASIP Journal on Embedded Systems,
2009, 2009.
[65] Maya B. Gokhale, Janice M. Stone, Jeff Arnold, and Mirek Kalinowski. Stream-
oriented fpga computing in the streams-c high level language. In FCCM ’00:
Proceedings of the 2000 IEEE Symposium on Field-Programmable Custom Com-
puting Machines, page 49, Washington, DC, USA, 2000. IEEE Computer Society.
136
REFERENCES
[66] Michael I. Gordon, William Thies, and Saman Amarasinghe. Exploiting coarse-
grained task, data, and pipeline parallelism in stream programs. In ASPLOS-XII:
Proceedings of the 12th international conference on Architectural support for pro-
gramming languages and operating systems, pages 151–162, New York, NY, USA,
2006. ACM.
[67] Marcelo Götz. Run-time Reconfigurable RTOS for Reconfigurable Systems-on-
Chip. PhD thesis, University of Paderborn, 2007.
[68] Susanne Graf and Hassen Saïdi. Construction of abstract state graphs with pvs.
In CAV ’97: Proceedings of the 9th International Conference on Computer Aided
Verification, pages 72–83, London, UK, 1997. Springer-Verlag.
[69] Plerre-P Grassé. La reconstruction du nid et les coordinations interindividuelles
chezbellicositermes natalensis etcubitermes sp. la théorie de la stigmergie: Essai
d’interprétation du comportement des termites constructeurs. Insectes Sociaux,
6(1):41–80, March 1959.
[70] Jan Gustafsson and Andreas Ermedahl. Experiences from applying wcet analysis
in industrial settings. In ISORC ’07: Proceedings of the 10th IEEE International
Symposium on Object and Component-Oriented Real-Time Distributed Comput-
ing, pages 382–392, Washington, DC, USA, 2007. IEEE Computer Society.
[71] Christian Haubelt, Dirk Koch, and Jürgen Teich. Basic os support for distributed
reconfigurable hardware, 2003.
[72] Christian Haubelt, Dirk Koch, and Jürgen Teich. Reconet: Modeling and imple-
mentation of fault tolerant distributed reconfigurable hardware, 2003.
[73] Scott Hauck and Andre DeHon. Reconfigurable Computing: The Theory and
Practice of FPGA-Based Computation. Morgan Kaufmann Publishers Inc., San
Francisco, CA, USA, 2007.
[74] Hayden Kwok hay So, Borph An, Operating System, Fpga based Reconfigurable,
and Hayden Kwok hay So. Borph: An operating system for fpgabased reconfig-
urable computers. Technical report, UNIVERSITY OF CALIFORNIA, BERKE-
LEY, 2007.
[75] Johannes Helander and Alessandro Forin. Mmlite: a highly componentized sys-
tem architecture. In EW 8: Proceedings of the 8th ACM SIGOPS European work-
shop on Support for composing distributed applications, pages 96–103, New York,
NY, USA, 1998. ACM.
137
REFERENCES
[76] Dan Hildebrand. An architectural overview of qnx. In Proceedings of the Work-
shop on Micro-kernels and Other Kernel Architectures, pages 113–126, Berkeley,
CA, USA, 1992. USENIX Association.
[77] Mike Hinchey, Bernd Kleinjohann, Lisa Kleinjohann, Peter Lindsay, Franz J.
Rammig, Jon Timmis, and Marilyn Wolf. Distributed, Parallel and Biologically
Inspired Systems. Springer, 2010.
[78] George J. Y. Hsu, Pao long Chang, and Tser yieth Chen. Various methods for
estimating power outage costs : Some implications and results in taiwan. Energy
Policy, 22(1):69 – 74, 1994.
[79] Xilinx Inc. Virtex-II Pro / Virtex-II Pro X Complete Data Sheet (All four modules).
San Jose, CA, USA, 4.7 edition, 2007.
[80] ISO/IEC. 18033-3:2005 information technology ˚
U security techniques ˚
U encryp-
tion algorithms. Part 3: Block ciphers, 2008.
[81] Peter Janacik, Odej Kao, and Ulf Rerrer. An approach combining routing and
resource sharing in wireless ad hoc networks using swarm-intelligence. In Pro-
ceedings of the 7th ACM/IEEE International Symposium on Modeling, Analysis
and Simulation of Wireless and Mobile Systems (MSWiM 2004), 2004. Poster ses-
sion.
[82] Jbed. Jbed whitepaper: Component software and real-time computing. Technical
report, Oberon Microsystems, 1998.
[83] Jorjeta G. Jetcheva and David B. Johnson. Adaptive demand-driven multicast
routing in multi-hop wireless ad hoc networks. In MobiHoc ’01: Proceedings of
the 2nd ACM international symposium on Mobile ad hoc networking & computing,
pages 33–44, New York, NY, USA, 2001. ACM.
[84] Jini.org. online, accessed September, 2010.
[85] A. Kalavade and P.A. Subrahmanyam. Hardware/software partitioning for mul-
tifunction systems. Computer-Aided Design of Integrated Circuits and Systems,
IEEE Transactions on, 17(9):819 –837, sep. 1998.
[86] Asawaree Kalavade and Edward A. Lee. A hardware-software codesign method-
ology for dsp applications. IEEE Des. Test, 10(3):16–28, 1993.
[87] Heiko Kalte and Mario Porrmann. Context saving and restoring for multitasking
in reconfigurable systems. In 15th International Conference on Field Programma-
ble Logic and Applications, pages 223–228, August 2005.
138
REFERENCES
[88] David Karger, Eric Lehman, Tom Leighton, Matthew Levine, Daniel Lewin, and
Rina Panigrahy. Consistent hashing and random trees: Distributed caching pro-
tocols for relieving hot spots on the World Wide Web. In Proceedings of the
Twenty-Ninth Annual ACM Symposium on Theory of Computing, pages 654–663,
El Paso, Texas, 4–6 May 1997.
[89] Felipe Klein, Guido Araujo, Rodolfo Azevedo, Roberto Leao, and Luiz C. V. Dos
Santos. On the limitations of power macromodeling techniques. VLSI, IEEE Com-
puter Society Annual Symposium on, 0:395–400, 2007.
[90] P.V. Knudsen and J. Madsen. Pace: a dynamic programming algorithm for hard-
ware/software partitioning. In International Workshop on Hardware-Software Co-
Design, pages 85 –92, mar. 1996.
[91] Tim Kogel and Heinrich Meyr. Heterogeneous mp-soc: the solution to energy-
efficient signal processing. In DAC ’04: Proceedings of the 41st annual Design
Automation Conference, pages 686–691, New York, NY, USA, 2004. ACM.
[92] Fabio Kon, Roy H. Campbell, M. D Mickunas, and Klara Nahrstedt. 2k: A
distributed operating system for dynamic heterogeneous environments. Technical
report, University of Illinois at Urbana-Champaign, Champaign, IL, USA, 1999.
[93] Fabio Kon, Ashish Singhai, Roy H. Campbell, Dulcineia Carvalho, and Robert
Moore. 2k: A dynamic, component-based operating system for rapidly chang-
ing environments. Technical report, University of Illinois at Urbana-Champaign,
Champaign, IL, USA, 1998.
[94] Johannes Lessmann, Tales Heimfarth, and Peter Janacik. Shox: An easy to
use simulation platform for wireless networks. In UKSIM ’08: Proceedings of
the Tenth International Conference on Computer Modeling and Simulation, pages
410–415, Washington, DC, USA, 2008. IEEE Computer Society.
[95] Zhiyuan Li and Scott Hauck. Configuration prefetching techniques for partial
reconfigurable coprocessor with relocation and defragmentation. In FPGA ’02:
Proceedings of the 2002 ACM/SIGDA tenth international symposium on Field-
programmable gate arrays, pages 187–195, New York, NY, USA, 2002. ACM.
[96] S. M. Loo, B. E. Wells, and R. K. Gaede. Exploring the hardware/software con-
tinuum in a computer engineering capstone design class using fpga-based pro-
grammable logic. In MSE ’01: Proceedings of the 2001 International Conference
on Microelectronic Systems Education (MSE’01), page 36, Washington, DC, USA,
2001. IEEE Computer Society.
139
REFERENCES
[97] E. Lubbers and M. Planner. Reconos: An rtos supporting hard-and soft-
ware threads. In In Proceedings of the 17th International Conference on Field-
Programmable Logic and Applications, pages 441 –446, aug. 2007.
[98] Enno Lübbers. Multithreaded Programming and Execution Models for Reconfig-
urable Hardware. Phd thesis, University of Paderborn, 2010.
[99] Enno Lübbers and Marco Platzner. Communication and Synchronization in Mul-
tithreaded Reconfigurable Computing Systems. In Proceedings of the 8th Inter-
national Conference on Engineering of Reconfigurable Systems and Algorithms
(ERSA’08). CSREA Press, July 2008.
[100] Enno Lübbers and Marco Platzner. Reconos: Multithreaded programming for
reconfigurable computers. ACM Trans. Embed. Comput. Syst., 9(1):1–33, 2009.
[101] Roman Lysecky and Frank Vahid. A study of the speedups and competitive-
ness of fpga soft processor cores using dynamic hardware/software partitioning.
In DATE ’05: Proceedings of the conference on Design, Automation and Test in
Europe, pages 18–23, Washington, DC, USA, 2005. IEEE Computer Society.
[102] J. Madsen, J. Grode, P.V. Knudsen, M.E. Petersen, and A. Haxthausen. Ly-
cos: the lyngby co-synthesis system. Design Automation for Embedded Systems,
2:195–235, 1997. 10.1023/A:1008884219274.
[103] Alessandro Di Marco, Giovanni Chiola, and Giuseppe Ciaccio. Using a giga-
bit ethernet cluster as a distributed disk array with multiple fault tolerance. In
LCN ’03: Proceedings of the 28th Annual IEEE International Conference on Lo-
cal Computer Networks, page 605, Washington, DC, USA, 2003. IEEE Computer
Society.
[104] T. Marescaux, V. Nollet, J.-Y. Mignolet, A. Bartic, W. Moffat, P. Avasare, P. Co-
ene, D. Verkest, S. Vernalde, and R. Lauwereins. Run-time support for heteroge-
neous multitasking on reconfigurable socs. Integr. VLSI J., 38(1):107–130, 2004.
[105] Peter Marwedel. Embedded System Design. Kluwer Academic Publishers,
2003.
[106] Anthony Massa. Embedded Software Development with eCos. Prentice Hall
Professional Technical Reference, 2002.
[107] Clive Maxfield. The Design Warrior’s Guide to FPGAs. Academic Press, Inc.,
Orlando, FL, USA, 2004.
140
REFERENCES
[108] Mario Mense and Christian Scheideler. Spread: An adaptive scheme for re-
dundant and fair storage in dynamic heterogeneous storage systems. In 19th
ACM-SIAM Symposium on Discrete Algorithms (SODA), San Francisco, Califor-
nia, USA, 20.-22. Febr.„ January 2008.
[109] Michael Mitzenmacher. The power of two choices in randomized load balanc-
ing. IEEE Transactions on Parallel and Distributed Systems, 12(10):1094–1104,
2001.
[110] David Naylor and S. Jones. VHDL: A Logic Synthesis Approach. Chapman &
Hall, Ltd., London, UK, UK, 1997.
[111] V. Nollet, P. Coene, D. Verkest, S. Vernalde, and R. Lauwereins. Designing an
operating system for a heterogeneous reconfigurable soc. In Proceedings of the
RAW’03 workshop, 2003.
[112] null Bo Zhou, null Weidong Qiu, null Yan Chen, and null Chenglian Peng.
Shum-ucos: A rtos using multi-task model to reduce migration cost between
sw/hw tasks. International Conference on Computer Supported Cooperative Work
in Design, 2:984–989 Vol. 2, 2005.
[113] Simon Oberthür. Towards an RTOS for Self-ptimizing Mechatronic Systems.
PhD thesis, University of Paderborn, 2009.
[114] L. Parker. Current research in multirobot systems. Artificial Life and Robotics,
7:1–5, 2003. 10.1007/BF02480877.
[115] David A. Patterson, Garth Gibson, and Randy H. Katz. A case for redundant
arrays of inexpensive disks (raid). In SIGMOD ’88: Proceedings of the 1988 ACM
SIGMOD international conference on Management of data, pages 109–116, New
York, NY, USA, 1988. ACM Press.
[116] Wesley Peck, Erik Anderson, Jason Agron, Jim Stevens, Fabrice Baijot, and
David L. Andrews. Hthreads: A computational model for reconfigurable devices.
In FPL, pages 1–4, 2006.
[117] Charles E. Perkins and Pravin Bhagwat. Highly dynamic destination-sequenced
distance-vector routing (dsdv) for mobile computers. SIGCOMM Comput. Com-
mun. Rev., 24(4):234–244, 1994.
[118] Charles E. Perkins and Elizabeth M. Royer. Ad-hoc on-demand distance vector
routing. Mobile Computing Systems and Applications, IEEE Workshop on, 0:90,
1999.
141
REFERENCES
[119] J. S. Plank. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like
systems. Software – Practice & Experience, 27(9):995–1012, September 1997.
[120] James S. Plank. The raid-6 liberation codes. In FAST’08: Proceedings of the
6th USENIX Conference on File and Storage Technologies, pages 1–14, Berkeley,
CA, USA, 2008. USENIX Association.
[121] Nachiketh R. Potlapally, Michael S. Hsiao, Anand Raghunathan, Ganesh Lak-
shminarayana, and Srimat T. Chakradhar. Accurate power macro-modeling tech-
niques for complex rtl circuits. VLSI Design, International Conference on, 0:235,
2001.
[122] S. Raaijmakers and S. Wong. Run-time partial reconfiguration for removal,
placement and routing on the virtex-ii pro. In International Conference on Field
Programmable Logic Applications, pages 679 –683, aug. 2007.
[123] R. K. Raval, C. H. Fernandez, and C. J. Bleakley. Low-power tinyos tuned
processor platform for wireless sensor network motes. ACM Trans. Des. Autom.
Electron. Syst., 15(3):1–17, 2010.
[124] Adi Mallikarjuna V. Reddy, A.V.U. Phani Kumar, D. Janakiram, and G. Ashok
Kumar. Wireless sensor network operating systems: a survey. Int. J. Sen.
Netw., 5(4):236–255, 2009.
[125] Javier Resano, Daniel Mozos, Diederik Verkest, and Francky Catthoor. A re-
configuration manager for dynamically reconfigurable hardware. IEEE Des. Test,
22(5):452–460, 2005.
[126] Achim Rettberg. Low-power Driven High Level Synthesis for Dedicated Archi-
tectures. PhD thesis, Paderborn University, January 2007.
[127] M. Saldana, D. Nunes, E. Ramalho, and P. Chow. Configuration and program-
ming of heterogeneous multiprocessors on a multi-fpga system using tmd-mpi. Re-
configurable Computing and FPGAs, International Conference on, 0:1–10, 2006.
[128] Tom Saulpaugh and Charles Mirho. The Java OS Design and Architecture.
Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1999.
[129] Christian Schindelhauer and Gunnar Schomaker. Weighted distributed hash ta-
bles. In Proc. of the 17th ACM Symposium on Parallelism in Algorithms and
Architectures (SPAA), to appear, 2005.
142
REFERENCES
[130] Roy Shea, Simon Han, and Ram Rengaswamy. Motivations behind sos. Tech-
nical report, University of California Los Angeles, 2004.
[131] Seng Lin Shee and Sri Parameswaran. Design methodology for pipelined het-
erogeneous multiprocessor system. In DAC ’07: Proceedings of the 44th an-
nual Design Automation Conference, pages 811–816, New York, NY, USA, 2007.
ACM.
[132] Amit Sinha and Anantha Chandrakasan. Dynamic power management in wire-
less sensor networks. IEEE Des. Test, 18(2):62–74, 2001.
[133] Christoph Steiger, Herbert Walder, and Marco Platzner. Operating systems for
reconfigurable embedded platforms: Online scheduling of real-time tasks. IEEE
Trans. Computers, 53(11):1393–1407, 2004.
[134] David B. Stewart. Measuring execution time and real-time performance. In In:
Proceedings of the Embedded Systems Conference (ESC SF, pages 1–15, 2002.
[135] David B. Stewart and Gaurav Arora. A tool for analyzing and fine tuning the
real-time properties of an embedded system. IEEE Trans. Softw. Eng., 29(4):311–
326, 2003.
[136] Wind River Systems. VxWorks ProgrammerŠs Guide. Alameda, CA 94501-
1153, USA, 5.5 edition, 2002.
[137] Jean-Charles Tournier. A survey of configurable operating systems. Technical
report, University of New Mexico, 2005.
[138] J. Veizades, E. Guttman, C. Perkins, and S. Kaplan. Service location protocol.
[139] Herbert Walder and Marco Platzner. Reconfigurable hardware operating sys-
tems: From design concepts to realizations. In In Proceedings of the 3rd Inter-
national Conference on Engineering of Reconfigurable Systems and Architectures
(ERSA, pages 284–287. CSREA Press, 2003.
[140] Herbert Walder and Marco Platzner. A runtime environment for reconfigurable
hardware operating systems. In in Proceedings of the 14th Field Programmable
Logic and Applications (FPLŠ04, pages 831–835. Springer, 2004.
[141] Colin Walls. Embedded Software: The Works. Newnes, 2005.
[142] Catherine Lingxia Wang, Bo Yao, Yang Yang, and Zhengyong Zhu. A survey
of embedded operating systems. Technical report, UCSD, Computer Scince and
Engineering department, 2001.
143
REFERENCES
[143] Christof Wehner. Tornado and VxWorks. BoD, 2006.
[144] G. Wigley, D. Kearney, and M. Jasiunas. Reconfigme: a detailed implementa-
tion of an operating system for reconfigurable computing. Parallel and Distributed
Processing Symposium, International, 0:218, 2006.
[145] Grant B. Wigley and David A. Kearney. Research issues in operating systems
for reconfigurable computing. In In Proceedings of the International Conference
on Engineering of Reconfigurable System and Algorithms(ERSA, pages 10–16.
CSREA Press, 2002.
[146] Grant B. Wigley, David A. Kearney, and David Warren. Introducing reconfigme:
An operating system for reconfigurable computing. In FPL ’02: Proceedings of the
Reconfigurable Computing Is Going Mainstream, 12th International Conference
on Field-Programmable Logic and Applications, pages 687–697, London, UK,
2002. Springer-Verlag.
[147] Tim Wilmshurst. An Introduction to the Design of Small-Scale Embedded Sys-
tems. Palgrave Macmillan, 2001.
[148] Wayne H. Wolf. An architectural co-synthesis algorithm for distributed, embed-
ded computing systems. IEEE Trans. Very Large Scale Integr. Syst., 5(2):218–229,
1997.
[149] Xilinx. www.xilinx.com.
[150] Xilinx. Xbrf 014: A simple method of estimating power in xc4000xl/ex/e fpgas.
Technical report, Xilinx, 1997.
[151] Xilinx. Differencing Method for Partial Reconfiguration, xapp290 (v2.0) edi-
tion, December 3 2007.
[152] Xilinx. Partial Reconfiguration User Guide, ug702 (v 12.2) edition, July 23
2010.
[153] J. Yannakopoulos and A. Bilas. Cormos: a communication-oriented runtime
system for sensor networks. pages 342 – 353, jan. 2005.
[154] Pavel G. Zaykov, Georgi K. Kuzmanov, and Georgi N. Gaydadjiev. Reconfig-
urable multithreading architectures: A survey. In SAMOS ’09: Proceedings of the
9th International Workshop on Embedded Computer Systems: Architectures, Mod-
eling, and Simulation, pages 263–274, Berlin, Heidelberg, 2009. Springer-Verlag.
144
REFERENCES
[155] Yuhong Zhao and Franz Rammig. Model-based runtime verification framework.
Electron. Notes Theor. Comput. Sci., 253(1):179–193, 2009.
[156] Bo Zhou, Weidong Qiu, and Chenlian Peng. An operating system framework
for reconfigurable systems. In CIT ’05: Proceedings of the The Fifth International
Conference on Computer and Information Technology, pages 788–792, Washing-
ton, DC, USA, 2005. IEEE Computer Society.
[157] R. Zimmermann and S. Ghandeharizadeh. Hera: Heterogeneous extension of
raid, 1998.
[158] Richard Zurawski, editor. Embedded Systems Handbook, Second Edition: Net-
worked Embedded Systems. CRC Press, 2009.
145