scieee Science in your language
[en] (orig)
Dissertation
Methods to Exploit Reconfigurable
Fabrics
Making Reconfigurable Systems Mature
A thesis submitted to the
Faculty of Computer Science, Electrical Engineering and Mathematics
of the
University of Paderborn
in partial fulfillment of the requirements for the
degree of Dr. rer. nat.
Florian Dittmann
Paderborn 2007
Supervisors:
Prof. Dr. Franz J. Rammig, Heinz Nixdorf Institute, University of Paderborn
Prof. Dr. Marco Platzner, University of Paderborn
Prof. Dr.-Ing. J¨
urgen Teich, University of Erlangen-Nuremberg
Date of public examination: December 14, 2007
Acknowledgements
This work was carried out in the course of exploring reconfigurable fabrics during my
work as a research assistant in the working group Distributed Embedded Systems at the
Heinz Nixdorf Institute, an interdisciplinary center of research and technology of the
University of Paderborn.
First of all, I wish to express my deep gratitude to my supervisor Prof. Dr. Franz
J. Rammig. He gave me the possibility to work in the field of reconfigurable computing
and excellently guided me through my PhD journey. I could always count on his opti-
mism, which more than once helped me to search for the way out when I was sure to be
stuck in a dead-end road.
I also like to thank my two vice-supervisors Prof. Dr. Marco Platzner and Prof. Dr.-
Ing. J¨
urgen Teich. I enjoyed having those two experts of the field of reconfigurable
computing supervising my thesis.
I like to thank all my former colleagues, who helped me in different ways to finalize my
thesis, among them Achim Rettberg, Marcelo G¨
otz, Stefan Ihmor, Klaus Danne, J¨
org
St¨
ocklein, Norma Montealegre, Simon Oberth¨
ur, and last but not least Vera K¨
uhne.
In particular, I want to thank Dr. Achim Rettberg and Dr. Marcelo G¨
otz for the
fruitful discussions on the benefits and challenges of reconfigurable computing. In both
I found two inspirational researchers for discussions that substantially contributed to
the quality of this thesis.
This work was also done within the SPP 1148 Reconfigurable Computing Systems of
the DFG, the German Research Foundation. Besides all fellows I met during colloquia,
I particularly want to thank Prof. Dr. Christophe Bobda, who heavily influenced the
first application for this position.
Also, I want to thank all helping students: Fabian Schulte, Alexander Warkentin,
Stefan Frank, Elmar Weber, Paulin Francis Ngaleu, Simon Sch¨
utte, and Helene Schilke.
Finally, I want to thank my parents. Their open-minded style allowed me to find
my own way with the knowledge of their unlimited support. Thanks also to my sister
Christine and my grandmother. Both always found the right words to help me believe
in myself. Last but not least my lovely thanks go to my girl-friend Carmen and her
patience. She tirelessly gave me the motivation to finalize this project.
Paderborn, December 2007
i
ii
Abstract
The demands for flexibility and performance presumably are the two main driving forces
of modern computing systems. Both are ever increasing requirements of users, de-
vices, machines, processes, etc. Computer scientists and engineers constantly search for
methods and architectures to solve these demands. Recently, reconfigurable devices—
adaptable and fairly powerful computing resources—have approached this scene.
Reconfigurable devices appear to be most appropriate for high flexibility and high
performance. They process in parallel and can change their behavior by loading dif-
ferent configurations on their fabrics, also during run-time. However, comprehensive
exploitation of the paradigms of reconfigurable computing is rarely found in modern
systems. In particular, dynamic reconfiguration—the adaptation of the behavior during
run-time—combined with partial reconfiguration capabilities of modern FPGAs remains
unused in most modern designs.
If we search for reasons, the answers are twofold. On one hand, we have to generally
argue on what benefits partial run-time reconfiguration yields, as we have to overcome
challenging technical hurdles. On the other hand, also few high-level methods exist
that target the partial reconfiguration capabilities of modern FPGAs and would allow
a system designer to exploit partial run-time reconfiguration.
This thesis approaches the situation described along the second train of thoughts—the
need for methods. We discuss new methods to exploit run-time reconfiguration, which
explore given reconfiguration constraints and often target at specific application areas.
In order to evaluate the methods, we integrate them into an abstracting layered view
on reconfigurable computing. We also discuss the technical challenges, which have to be
overcome to implement the methods. Altogether and maybe most notably, the methods
themselves explain where they can be useful, thus making a case for the use of run-time
reconfiguration in everyday systems—making reconfigurable systems mature.
iii
iv
Zusammenfassung
Flexibilit¨
at und Leistungsf¨
ahigkeit (Performance) geh¨
oren wohl zu den am st¨
arksten
nachgefragten Anforderungen an moderne Rechensysteme. So werden beide in st¨
andig
steigendem Maße von Benutzern, aber auch Maschinen, Prozessen, usw. eingefordert.
Informatikern und Ingenieuren obliegt die Aufgabe stets nach neuen Methoden und Ar-
chitekturen zu suchen, die dieser Herausforderung gerecht werden. In j¨
ungster Zeit sind
nun die adaptierbaren und leistungsf¨
ahigen rekonfigurierbaren Rechensysteme als weite-
re Architekturen hinzugekommen. Ihre Eigenschaften weisen sie als besonders geeignet
f¨
ur das Erreichen von Performance und Flexibilit¨
at aus. So findet eine Berechnung
auf rekonfigurierbaren Systemen im Raum, d. h. parallel, statt und kann nach Bedarf
w¨
ahrend der Laufzeit durch Rekonfigurierung angepasst werden.
Trotz der offensichtlichen Vorteile sind unter den modernen Rechensystemen wenige
vorzufinden, welche die M¨
oglichkeiten der Rekonfigurierung auch tats¨
achlich ausnutzen.
Insbesondere bleibt die dynamische Rekonfigurierung, d. h. die Anpassung des Verhaltens
w¨
ahrend der Laufzeit, kombiniert mit der M¨
oglichkeit zur partiellen Rekonfigurierung
oftmals ungenutzt. Es lasse sich zwei Begr¨
undungen hierzu ermitteln. Einerseits muss
grunds¨
atzlich gekl¨
art sein, wo die tats¨
achlichen Vorteile einer partiellen Rekonfigurierung
liegen, da eine solche nicht kostenneutral zu erreichen ist. Andererseits existieren aber
auch nur wenige Methoden, insbesondere auf h¨
oherer Abstraktionsebene, die es erlauben
die Leistungsdaten von Systemen bei Ausnutzung der partiellen Rekonfigurierung zu
evaluieren und letztere dann sp¨
ater auch gewinnbringend einzusetzen.
In dieser Arbeit wird dem Dilemma anhand der zweiten Begr¨
undung Rechnung ge-
tragen. Es werden vier neue Methoden diskutiert, die es erlauben Vorteile der partiellen
Rekonfigurierung zur Laufzeit auszusch¨
opfen. Um die teilweise applikationsspezifischen
Methoden zu explorieren, wird ein abstrahierendes Schichtenmodell verwendet. Zudem
werden die f¨
ur eine Implementierung notwendigen technischen Herausforderungen n¨
aher
beleuchtet. Schlussendlich sind die vorgestellten Methoden nicht nur Entwurfsmittel,
sondern stellen selbst neue M¨
oglichkeiten zur Anwendung partiell rekonfigurierbarer
Rechensysteme dar. Sie zeigen auf, wo der Einsatz m¨
oglich und praxisgerecht ist.
v
vi
Contents
1 Introduction 1
1.1 Reconfigurable Computing . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Contribution of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Abstracting Layered Approach . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Outlineofthework.............................. 5
2 Reconfigurable Computing 7
2.1 Introduction.................................. 7
2.1.1 Evolution ............................... 7
2.1.2 MakimotosWave........................... 11
2.1.3 Remainder of the Chapter . . . . . . . . . . . . . . . . . . . . . . 12
2.2 TechnicalAspects............................... 12
2.2.1 Reconfigurable versus Programmable . . . . . . . . . . . . . . . . 13
2.2.2 Granularity .............................. 14
2.2.3 Field Programmable Gate Arrays . . . . . . . . . . . . . . . . . . 15
2.2.4 Programming FPGAs . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.5 Run-Time Reconfiguration . . . . . . . . . . . . . . . . . . . . . . 22
2.2.6 Coupling................................ 28
2.3 FieldsofApplication ............................. 29
2.3.1 ASICDesign ............................. 29
2.3.2 Replacement of Dedicated Circuits . . . . . . . . . . . . . . . . . 30
2.3.3 Adaptive Processing . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.4 Design Approaches for Reconfigurable Systems . . . . . . . . . . . . . . . 34
2.4.1 Execution Environments/Architectures . . . . . . . . . . . . . . . 34
2.4.2 Placement/Scheduling Methods . . . . . . . . . . . . . . . . . . . 37
2.4.3 Comprehensive Design Systems/Design Methods . . . . . . . . . . 39
2.4.4 Miscellaneous Concepts . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 LessonLearned ................................ 40
2.6 Summary ................................... 42
3 Two-Slot Framework 43
3.1 Introduction.................................. 43
3.1.1 LayeredApproach .......................... 45
3.1.2 Organization of the Chapter . . . . . . . . . . . . . . . . . . . . . 45
3.2 Concept .................................... 45
3.2.1 Problem Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . 46
vii
Contents
3.2.2 Problem Solution Strategy . . . . . . . . . . . . . . . . . . . . . . 48
3.3 Run-TimeArchitecture............................ 49
3.3.1 Fundamental Design Constraints . . . . . . . . . . . . . . . . . . 50
3.3.2 Intermodule Communication . . . . . . . . . . . . . . . . . . . . . 51
3.4 Partitioning.................................. 53
3.4.1 Simple Temporal Partitioning . . . . . . . . . . . . . . . . . . . . 53
3.4.2 Spectral Based Partitioning . . . . . . . . . . . . . . . . . . . . . 54
3.5 Scheduler ................................... 59
3.6 Experiment .................................. 61
3.6.1 Proof of Concept Implementation . . . . . . . . . . . . . . . . . . 61
3.6.2 Cryptography Example . . . . . . . . . . . . . . . . . . . . . . . . 63
3.7 Extensions................................... 65
3.7.1 Low Power Considerations . . . . . . . . . . . . . . . . . . . . . . 65
3.7.2 The Two Slot Framework as an IP Core . . . . . . . . . . . . . . 69
3.8 LessonLearned ................................ 70
3.9 RelatedWork................................. 71
3.10Summary ................................... 73
4 Specification Graph Approach for Reconfigurable Fabrics 75
4.1 Introduction.................................. 75
4.1.1 LayeredApproach .......................... 76
4.1.2 Organization of the Chapter . . . . . . . . . . . . . . . . . . . . . 76
4.2 Concept .................................... 77
4.2.1 Problem Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . 77
4.2.2 Background.............................. 78
4.2.3 ProblemSolution........................... 79
4.3 ProblemGraph ................................ 80
4.4 ArchitectureGraph.............................. 82
4.5 Mapping.................................... 83
4.5.1 Basic Mapping Edges . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.5.2 Extensions............................... 84
4.6 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.7 Experiment .................................. 91
4.8 LessonLearned ................................ 94
4.9 RelatedWork................................. 95
4.10Summary ................................... 96
5 Reconfiguration Port Scheduling 97
5.1 Introduction.................................. 97
5.1.1 LayeredApproach .......................... 98
5.1.2 Remainder of the Chapter . . . . . . . . . . . . . . . . . . . . . . 99
5.2 Concept .................................... 99
5.2.1 Execution Environment . . . . . . . . . . . . . . . . . . . . . . . 99
5.2.2 Problem Abstraction . . . . . . . . . . . . . . . . . . . . . . . . . 100
viii
Contents
5.2.3 Reconfiguration Port Scheduling . . . . . . . . . . . . . . . . . . . 101
5.2.4 Parallel Machine Problems with a Single Server . . . . . . . . . . 104
5.3 Aperiodic Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.1 Synchronous Arrival . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.3.2 Asynchronous Arrival . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.4 Fixed Priority Periodic Task Scheduling . . . . . . . . . . . . . . . . . . 111
5.4.1 Schedulability Analysis . . . . . . . . . . . . . . . . . . . . . . . . 111
5.4.2 A Server for Full Load of Slots Sections...............112
5.4.3 Resource Access Protocol for Full Reconfiguration Capacity Sections114
5.4.4 DM + SS + PIP Schedulability Test . . . . . . . . . . . . . . . . 116
5.4.5 Experiment ..............................116
5.5 Caching ....................................117
5.5.1 Offline caching methods . . . . . . . . . . . . . . . . . . . . . . . 119
5.5.2 Dynamic/On-line Caching Methods . . . . . . . . . . . . . . . . . 122
5.5.3 Combination of the Methods . . . . . . . . . . . . . . . . . . . . . 124
5.5.4 Implementation............................125
5.6 Experiment ..................................125
5.7 LessonLearned ................................129
5.8 RelatedWork.................................130
5.9 Summary ...................................132
6 Algorithmic Skeletons for Dynamic Reconfiguration 133
6.1 Introduction..................................133
6.1.1 LayeredApproach ..........................136
6.1.2 Remainder of the Chapter . . . . . . . . . . . . . . . . . . . . . . 136
6.2 Concept ....................................136
6.2.1 Algorithmic Skeletons . . . . . . . . . . . . . . . . . . . . . . . . 137
6.2.2 Application for Reconfigurable Computing . . . . . . . . . . . . . 139
6.3 Run-time Execution Environment . . . . . . . . . . . . . . . . . . . . . . 140
6.4 StreamParallelism ..............................144
6.4.1 FarmParadigm............................144
6.4.2 Pipeline Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . 145
6.5 Dynamic Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.5.1 Dynamic Reconfiguration on a Tile-Based Execution Environment 147
6.5.2 Dynamic Reconfiguration on a Skeleton-centric Execution Envi-
ronment................................148
6.6 Experiment ..................................151
6.6.1 Cryptography Experiment . . . . . . . . . . . . . . . . . . . . . . 151
6.6.2 Application-centric . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.7 LessonLearned ................................157
6.8 RelatedWork.................................158
6.9 Summary ...................................159
ix
Contents
7 Conclusion and Outlook 161
7.1 Conclusion...................................161
7.2 Outlook ....................................164
A The Design Tool Part-E 165
Author’s Own Publications 169
Bibliography 173
x
List of Figures
1.1 Performance vs. Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 LayeredModel ................................ 5
2.1 MakimotosWave............................... 11
2.2 Typical layout of a modern FPGA . . . . . . . . . . . . . . . . . . . . . 16
2.3 Schematic view of a configurable logic block of an FPGA . . . . . . . . . 17
2.4 Slice of a Xilinx Virtex 4 FPGA . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Busmacro of the Xilinx Application Note 290 . . . . . . . . . . . . . . . 26
2.6 Narrow busmacro of the Xilinx early access design flow . . . . . . . . . . 27
2.7 Coupling of reconfigurable devices . . . . . . . . . . . . . . . . . . . . . . 28
2.8 The Erlangen Slot Machine . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1 Layered approach for the two-slot framework . . . . . . . . . . . . . . . . 45
3.2 Reconfiguration (RT) and execution (EX ) phase, simple example. . . . . 46
3.3 Benefit of partial reconfiguration . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Scheduling example of EX and RT phases ................. 47
3.5 Draft of the two slot architecture. . . . . . . . . . . . . . . . . . . . . . . 49
3.6 Communication focused floorplan of the architecture . . . . . . . . . . . 51
3.7 Simple partitioning, encapsulation of cycles . . . . . . . . . . . . . . . . . 54
3.8 Partitioning continued . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.9 IncludingaNOP-node ............................ 55
3.10 ASAP schedule of a data flow graph . . . . . . . . . . . . . . . . . . . . . 56
3.11 2D spectral placement of the data flow graph of Fig. 3.10 . . . . . . . . . 56
3.12 Nodes of ASAP level 1 are assigned to processing elements . . . . . . . . 57
3.13 Result of the ASAP and spectral based node assignment . . . . . . . . . 57
3.14 Complete schedule for a coarse grained device . . . . . . . . . . . . . . . 57
3.15 Clustering of a task graph . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.16 Sequence diagram of the slots and the controller entities . . . . . . . . . 59
3.17 Layout of the two slot architecture on a Virtex-II Pro FPGA . . . . . . . 62
3.18 Layout of the two slot environment on a Virtex-4 FPGA . . . . . . . . . 64
3.19 Example of produced cluster formation . . . . . . . . . . . . . . . . . . . 68
4.1 Layeredmodel................................. 76
4.2 Schematic of an exemplary runtime environment . . . . . . . . . . . . . . 77
4.3 Task dependence graph G.......................... 80
4.4 Task dependence graph and Gantt chart . . . . . . . . . . . . . . . . . . 81
4.5 Problem graph with intervals/life cycles. . . . . . . . . . . . . . . . . . . 82
xi
List of Figures
4.6 Architecturegraph .............................. 83
4.7 MappingEdges ................................ 84
4.8 MappingEdgesII............................... 85
4.9 Specificationgraph .............................. 86
4.10 Gantt chart of a schedule of Fig. 4.9 . . . . . . . . . . . . . . . . . . . . . 86
4.11 Task graph with communication and reconfiguration phases in ASAP and
ALAPordering ................................ 88
4.12 Image filtering using limited resources . . . . . . . . . . . . . . . . . . . . 92
4.13 Gantt chart of the image filtering example . . . . . . . . . . . . . . . . . 93
5.1 LayeredApproach............................... 98
5.2 Execution environment having homogeneous slots . . . . . . . . . . . . . 100
5.3 Example occupation of three slots. . . . . . . . . . . . . . . . . . . . . . 101
5.4 Scheduling according to d(left) and d(right), synchronous arrival times. 102
5.5 Scheduling two tasks according to d....................103
5.6 Scheduling task sets according to EDD . . . . . . . . . . . . . . . . . . . 104
5.7 Fullloadofslots ...............................105
5.8 EDD can fail to produce a feasible schedule. . . . . . . . . . . . . . . . . 108
5.9 EDF schedule on a three slot machine using preemption. . . . . . . . . . 109
5.10 Full reconfiguration capacity of an EDF schedule . . . . . . . . . . . . . 109
5.11 Fixed priority example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.12 Server for fls:WorstCase...........................113
5.13 Blocking Time for frc:WorstCase. .....................115
5.14Simulator ...................................117
5.15 Caching: Priority Based Slot Reservation . . . . . . . . . . . . . . . . . . 119
5.16 Consecutive Task Combination with modification of the release times of
tasks τ1and τ2. ................................120
5.17SchedulingLookBack ............................122
5.18 Scheduling Look Ahead . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.19 Applying different scheduling algorithms . . . . . . . . . . . . . . . . . . 124
5.20Conveyerbelt.................................126
5.21 Set of conveyer belts connected to a central processing unit. . . . . . . . 126
5.22 Experimental set-up on the Erlangen Slot Machine . . . . . . . . . . . . 127
5.23 The execution environment implemented on the ESM . . . . . . . . . . . 128
6.1 Layeredmodel.................................136
6.2 Two tile-based run-time execution environments . . . . . . . . . . . . . . 141
6.3 Skeleton-centric execution environment . . . . . . . . . . . . . . . . . . . 142
6.4 Farmparallelism ...............................144
6.5 Thepipelineparadigm ............................144
6.6 Two possible execution schemes of applications using the farm skeleton. . 145
6.7 Two implementations of a pipe skeleton with different area requirements. 146
6.8 Run-time reconfiguration of different skeletons . . . . . . . . . . . . . . . 148
6.9 Dynamic reconfiguration on a skeleton-centric environment . . . . . . . . 149
xii
List of Figures
6.10 Scheduling example of a farm and pipe skeleton . . . . . . . . . . . . . . 150
6.11 Triple DES given as a pipe and farm skeleton . . . . . . . . . . . . . . . 151
6.12 Skeleton-centric platform on a Xilinx Virtex-4 FPGA . . . . . . . . . . . 153
6.13 Mapping the triple DES application . . . . . . . . . . . . . . . . . . . . . 154
6.14 Xilinx FPGA editor output of the triple DES application . . . . . . . . . 155
6.15 Detailed view of the farm circuit . . . . . . . . . . . . . . . . . . . . . . . 156
A.1 ScreenshotofPart-Y .............................166
A.2 Reduced class diagram of the model. . . . . . . . . . . . . . . . . . . . . 167
A.3 ScreenshotofPart-E .............................168
xiii
List of Figures
xiv
1 Introduction
Along with the increasing quantity of intelligent devices such as personal digital assis-
tants (PDAs), mobile phones, driving assistants, industrial robots, etc., surrounding us,
also the demand for the quality of these devices rises. Besides mechanical issues, design
aspects, usability, etc., thereby the processing performance of such devices is of particu-
lar interest. Devices should be constantly faster in terms of processing speed to provide
enough performance for the latest applications, while also consuming little power and
being in-field adaptable to new emerging standards. The size of the processing units of
these devices thereby should be as small as possible, so that designs can be customized
and mechanical issues can be solved. Moreover, sufficient processing capacity can also
help to decrease response time, which often improves usability. Eventually, the over-
all behavior of the processing unit in large part influences the product itself and often
becomes the final sales argument.
Considering these challenges from the view of an embedded system engineer, we face
a tremendous market pressure, combined with today’s short time-to-market, when de-
veloping such modern products. During an ever decreasing design phase, engineers have
to solve the market requirements of the device under development. In the extreme ex-
ample, the specifications for the processing unit change virtually before market entry
or even after the device was released. To serve the market, provide performance, meet
low power requirements, and be still flexible enough for in-field adaptation can easily be
drudgery under these circumstances.
The manufacturers of processing resources adapt to this situation by offering a va-
riety of processing units to the embedded system engineer. These units follow several
paradigms of processing and therefore offer specific computation characteristics that
must be thoroughly evaluated for each design under development. Subject to suitable
methods that support the evaluation of the design under development, the appropriately
chosen devices can result in an advantage over the competitor.
Classes of Computing Media
What kind of processing resources are available? Basically, they can be divided into
three classes. First, the GPP (General Purpose Processor), which usually bases on
the instruction cycle based von-Neumann paradigm, is found in several versions within
modern systems, e. g., as PowerPC, ARM processor, x86 type, or mutated as DSP
(Digital Signal Processor). Usually if the performance of a GPP is not enough, dedicated
circuits—ASICs (Application Specific Interconnects)—are used. They form the second
category and offer high performance yet without the possibility to be programmed. The
third category—reconfigurable devices, mainly FPGAs—fill this gap, as they can execute
in parallel offering high performance and still being adaptable through reconfigurability.
1
1 Introduction
performance
flexibility
GPP
Reconfig.
devices
ASIC
Figure 1.1: Performance vs. Flexibility
Worth mentioning, there exist multiple devices that are in between the three classes
and fill in special niches, e. g., ASIPs (Application Specific Instruction Processors) or
MPGAs (Mask Programmable Gate Arrays) [DW99].
When evaluating the three classes, we often encompass a graph as depicted in Fig.1.1.
The graph assigns high flexibility to GPPs, as GPPs can compute any function by exe-
cuting different operations every instruction cycle. Specialized for general purpose, how-
ever, they offer low performance on average. The highest performance can be expected
by ASICs. However, once fabricated the functionality of ASICs cannot be altered.
Reconfigurable devices are in between both classes. Thanks to their adaptability and
reasonable performance, they combine characteristics of GPPs and ASICs. Obviously,
they will never be as flexible as GPPs or well-performing as ASICs, however, reconfig-
urable devices in many aspects fill the gap between GPPs and ASICs, and thus serve
the requirements of our modern systems in an extraordinary manner.
1.1 Reconfigurable Computing
Basically, reconfigurable devices thus combine performance and flexibility. Exploiting
this mixture of the characteristics of the two standard classes of computation often is
challenging. However, focusing on the important requirements of processing speed and
application flexibility of modern computing systems, reconfigurable systems seem to be
an appropriate solution and worth to be investigated further.
Processing in Space and Time
The fundamental difference of reconfigurable devices compared to GPPs and ASICs also
is often formulated as the capability to compute in space and time. Space, as the devices
offer computing area that is best explored by configurations, which make use of the par-
allelism of hardware. And time, as the device can be re-used—new configurations can be
loaded on the same area, which was used by an even completely different application just
before. To exploit reconfigurable devices, in particular FPGAs, as computing resources
of modern systems, these specific attributes should be taken into account.
2
1.1 Reconfigurable Computing
Run-Time Reconfiguration
Run-time reconfiguration is the most important technique for facilitating the reuse of
hardware during the execution of an application. It contrasts with the so-called com-
pile time reconfiguration, where the substrate is programmed once at the set-up time,
without further alterations.
By run-time reconfiguration, we can execute applications that require more processing
area than available and save physical devices, both often termed as hardware virtual-
ization. Moreover, we can monitor the current requirements of a system and react
accordingly by loading more suitable configurations. Thereby, in-field upgrades or bug
fixes of hardware become possible. In summary, run-time reconfiguration permits time
multiplexing of hardware resources and thus facilitates the processing in time and space.
Reconfiguration Overhead
Reconfiguration, however, comes with additional costs. This so-called reconfiguration
overhead demands some more words, as it is the major constraint of reconfigurable com-
puting in general and the main driving force for the methods of this thesis in particular.
Foremost, the time spent for reconfiguration, also termed the reconfiguration latency,
cannot be neglected. Often it slows down a system significantly. To give a number, the
reconfiguration phase of modern FPGAs is in the range of milliseconds.
Secondly, the reconfiguration is often done via a mutually exclusive reconfiguration
port that must be accessed sequentially. As a result, the reconfiguration of more than
one region at the same time (see partial reconfiguration below) is impossible in general.
Furthermore, the configuration information, generally speaking the bitstream, must
be stored and therefore consumes storage area. The size of a bitstream for a modern
FPGA can sum up to one megabyte or even above. Together with the storage, the whole
reconfiguration process also consumes energy and thus adds to the overall demand of
the system. In particular embedded systems often are power aware systems and any
additional energy consumption must be seen critically.
A simple reason for the reconfiguration overhead is the fact that reconfigurable devices
are primarily designed for fast processing, not for run-time reconfiguration. Thus most
likely, the long reconfiguration phase of FPGAs will barely change in the future. Here,
we can note two fundamental objectives that should drive any work in the area of
reconfigurable systems. (1) If ever possible, reconfiguration should be avoided. (2) If we
cannot avoid it, reconfiguration should be transparent and thrifty as possible.
Partial Reconfiguration
Nevertheless, some techniques exist to reduce the reconfiguration overhead. The funda-
mental one is partial reconfiguration. It forms a sophisticated possibility how run-time
reconfiguration can take place. In contrast to a complete reconfiguration—programming
the whole device—during partial reconfiguration only parts of the substrate are altered.
While the former requires a complete stop of the device, the latter facilitates to continue
processing as others regions are not influenced during the reconfiguration.
Partial reconfiguration basically allows us to hide the reconfiguration time. If we can
pre-fetch the next reconfiguration context, we can overlap reconfiguration and execution.
3
1 Introduction
Additionally, as smaller area under reconfiguration means a shorter reconfiguration time,
partial reconfiguration also consumes less time. Still, partially reconfigurable systems
cannot eliminate the costs associated with reconfiguration. However, they can minimize
the reconfiguration overhead.
Exploiting Reconfiguration
In general, by (partial) run-time reconfiguration, we can obtain the demanded adaptabil-
ity of modern systems as a means to cope with the market pressure and the decreased
lifecycle of products. Moreover, the ability to change parts of the hardware on the fly
during execution has proved a practical matter in many fields [GG05]. Partial reconfig-
urability in particular offers significant increase in flexibility for the execution of highly
parallel algorithms retaining variability of hardware structural parameters.
However, to exploit the capabilities of partially reconfigurable systems and make them
usable for system designers, appropriate methods are required. These methods thereby
must respect constraints of reconfigurable computing, like area consumption of tasks
loaded on reconfigurable fabrics, fragmentation, priority of tasks, etc. Moreover, there
also should be little or even no increase of the design complexity. The paradigm re-
configuration should be well supported or even transparent for the designer. Hence,
solutions are needed for the burden of reconfiguration effort, so that the concentration
of the designer can be fully focused on the intended application. Thereby, appropri-
ate methods may also help reconfigurable systems to penetrate the main-stream and
completely unveil their benefits.
1.2 Contribution of the Thesis
This thesis targets on exploiting reconfigurable computing and therefore discusses four
methods for partially and dynamically reconfigurable system design. We provide the
methods based on a layered approach (see below) and evaluate them. Each method
serves a specific scenario—a range of applications that can be implemented using the
method applied. The methods are presented in increasing complexity—the first one is
fairly straight forward targeting hardware virtualization. To allow for exploiting hetero-
geneous fabrics, the second method describes how reconfigurable fabrics may be included
in a synthesis technique. The third one considers real-time constraints and beneficially
uses the reconfiguration delay. Finally, the forth method introduces a completely new
programming concept to reconfigurable systems on a very high level of abstraction.
All in all, the contribution of this thesis is the development and evaluation of design
methods that consider the specific characteristics of reconfigurable systems. By virtue of
our methods, we hope to open the field of (partial) run-time reconfiguration to a broader
community, increasing the attraction and conquering new application fields, which have
been so far under-represented due to a lack of appropriate methods. As auxiliary means,
a tool that eases the design process and generation of partial bitstreams was developed.
In order to concentrate on the reconfigurable paradigm, we rely on abstraction. In
particular, we mainly focus on the reconfigurable fabric solely. The classical task of
4
1.3 Abstracting Layered Approach
FPGA
Tasks
Reconfigurable device
Applications
Layers of
Abstraction
Figure 1.2: Layered Model
hardware/software partitioning, which often is an integral step during embedded systems
design, may be considered as already done. Along with this, we do not consider processor
(CPU) reconfigurable fabric (FPGA) interaction, the so-called coupling.
The major driving force of this thesis is to raise the level of abstraction, which is needed
to tame the design complexity. In particular, such a step is inevitable if we argue—
as is our opinion—that FPGAs have started to become mainstream. Extraordinary
effort is needed to exhaust their potential and exploit the capability for partial run-time
reconfiguration. Reconfigurable fabrics then may become more mature.
1.3 Abstracting Layered Approach
The surrounding lead for the methods shall be an abstracting layered model. It summa-
rizes the overall motivation and structures what was done and where we rely on work of
the literature. As layers allow us to separate or constrain the domains space and time,
the layered approach also helps us to manage the two design parameters of reconfig-
urable systems, as can be seen throughout the thesis. The general layout of the model
is depicted in Fig. 1.2.
1.4 Outline of the work
The thesis is organized as follows: We first summarize general characteristics of reconfig-
urable systems, in particular FPGAs. Then, we detail four design method or frameworks,
also including application examples. A lesson learned at the end of each chapter allows
us to gradually extend the characteristics considered. In detail, the chapters resemble
the following:
Chapter 2 discusses reconfigurable computing in general. We start with an historical
retrospection of the field. Then, we present techniques of reconfiguration and
summarize design approaches of the field. Guiding principle of this chapter is the
awareness of the potentials, benefits, and drawbacks of reconfigurable computing.
Chapter 3 introduces the first method, the so-called two slot framework. The framework
is a direct approach how to hide the reconfiguration latency. It also focuses on the
long reconfiguration times compared to relatively short time spent in execution.
5
1 Introduction
In particular, intermediate results often must be hold back until reconfiguration
has finished and processing can continue in the newly configured regions. To cope
with these drawbacks, fundamental concepts are investigated under the constraint
of not increasing the overall response time. The chapter also serves as a mean to
detail partial reconfiguration. Moreover, some extensions to this basically known
concept could be developed within this thesis.
Chapter 4 introduces partial reconfiguration to system synthesis. Particularly, the het-
erogeneity of modern FPGAs has been the driving force for developing the method
of this chapter—a synthesis method respecting run-time reconfigurable devices. By
virtue of the concepts employed in this chapter, we can also react on devices host-
ing multiple reconfiguration ports or systems that consist of more than one FPGA.
Additionally, a whole system under development—including GPPs, ASICs, com-
munication, etc.—can be modeled and explored. This can be achieved by the core
concept of this method, which is the straight forward integration of the reconfig-
uration phase into a task/problem graph.
Chapter 5 introduces the new and within the course of this work developed concept of
reconfiguration port scheduling. The approach allows us to target real-time appli-
cations. Aperiodic and periodic task sets may be executed—loaded dynamically—
on modern FPGAs under real-time constraints using this approach. Therefore, the
deadline dto denote the latest end of the reconfiguration process is introduced.
The underlying execution environment resembles a slot-based approach.
Chapter 6 sketches another approach to exploit reconfigurable computing systems. Here,
we apply concepts of algorithmic skeletons to reconfigurable system design. Algo-
rithmic skeletons are implementation guidelines of the parallel computing domain
that separate structure from the algorithm under development. By virtue of such
a separation, evaluation of design in a very early phase of the development pro-
cess are possible. Moreover, algorithmic skeletons enable a sophisticated concept
of thread level parallelism on partial reconfigurable FPGAs without unsolvable
fragmentation or communication demands.
Chapter 7 finally summarizes the thesis, draws a conclusion, and gives an outlook for
future work.
6
2 Reconfigurable Computing
Reconfigurable computing dates back to the 1960s. Since then, several concepts have
emerged—not all of them have survived. When asking for reasons, we list multiple an-
swers. Most notably, along with the numerous benefits, we also enumerate a presumably
similar number of obstacles that we have to overcome. The purpose of this chapter is
to give a short and comprehensive overview of the core concepts and challenges. We
therefore throw a glance at the fascination of reconfigurable computing, while not mask-
ing the problems, which are sometimes still unsolved. After a brief retrospective on the
evolution in the field, we classify reconfigurable systems and the techniques proposed
and in use nowadays. We discuss typical requirements for the use of reconfigurable sys-
tems as well as their application areas. Partial reconfiguration thereby is an important
aspect of this chapter. Finally, such an overview would not be complete without some
prominent examples of the literature, which are driven by the fascinating possibility to
adapt hardware during run-time.
2.1 Introduction
Who has invented reconfigurable computing? This question may be unsolvable, yet,
there exist two influencing approaches in the literature that discuss the topic a long
time before the current research on (re)configurable computing has commenced.
2.1.1 Evolution
The first roots of reconfigurable systems can be traced back to Estrin’s fixed plus variable
structure computer developed at UCLA in the 1960s [EBTB63, EV62]. Estrin’s machine
consists of a standard processor surrounded by an array of reconfigurable hardware, with
the main processor controlling the behavior of the adaptable part. The latter would be
tailored to perform a specific task in hardware with the speed of hardware. Estrin’s idea
was well ahead of the technology at that time, so he only made a crude approximation
of the idea. Nevertheless, the first conceptual approach for reconfigurable computing
was born, and implemented in parts later [Est02].
While Estrin however stayed rather unspecific about the final implementation of the
reconfiguration in his initial work, Franz J. Rammig—the second precursor—became
very concrete on the implementation a decade after Estrin’s idea. In [Ram77], Rammig
describes his invention, which is a concept for editing of hardware, already presenting a
similar architecture to today’s FPGAs. Not only interested in the physical implemen-
tation, Rammig also discusses the requirements to obtain a hardware editor that allows
7
2 Reconfigurable Computing
him to abstractly describe and finally generate hardware. As his reconfigurable sub-
strate intends to host arbitrary circuits, he particularly focuses on the timing-behavior
of signals. Additionally to the theoretical discussion on how to edit hardware, Rammig
implemented his approach in the META-64 GOLDLAC using standard TTL-technology.
Besides these two approaches, no other works are known that discuss reconfigurable
computing before the introduction of FPGAs in the 1980s. However, simpler pro-
grammable logic devices—without the ability to implement arbitrary circuitries—were
already introduced under the term programmable logic device (PLD) in the 1970s.
Programmable Logic Devices
PLDs initially only described a class of simple configurable devices, however the term
nowadays is increasingly used to summarize all devices that offer a (re)configurable
substrate. The first concrete programmable devices in the context of the initial meaning
have been the PLA and the PAL, both briefly discussed next.
Programmable logic arrays (PLA) consist of a plane of AND gates fed into a plane
of OR gates. Their structure offers to implement any Ninput/Moutput function that
match the size of the PLA. Thus, they resemble a basic and reasonably efficient approach
towards reconfiguration as it is known today.
The programmable array logic (PAL) is a special case of the PLA with the AND array
being the only programmable array; the OR array of the PAL is fixed. The fixed OR
array increases the performance compared to a PLA; however, a PAL is not capable of
implementing any arbitrary Ninput/Moutput function.
Multiple instances of both devices also have been combined to make up larger devices
that offer more programmability. Such complex programmable devices (CPLD) often
consist of macro cells that host disjunctive normal form expressions, facilitating deeper
logic than simple PLDs. Furthermore, mainly implemented as non-volatile devices, they
are a common choice to host and control the set-up configuration of field programmable
gate arrays, among others.
Field Programmable Gate Arrays
The first arbitrarily reconfigurable devices have been the field programmable gate arrays
(FPGA), which emerged in the mid 1980s. They were developed and commercialized by
Xilinx, today still the largest company in the field, with the competitor Altera close to
catch up. Basically, they outperform simple programmable logic like PLAs and PALs,
because FPGAs can implement multilevel logic functions. The basic parts of an FPGA
are a collection of programmable gates embedded in a flexible interconnect network, as
well as programmable I/Os for the embedding and communication to peripherals. As
being the main target architecture for the methods presented in this thesis, a detailed
and comprehensive discussion is given in Sect. 2.2.3.
To illustrate the fascination of FPGAs a brief overview of the various functionalities
offered by modern FPGAs shall be already given here: The original look-up-tables
have emerged to become increasingly flexible by hosting carry-logic, flip-flops (to save
states), etc. Additionally, dedicated hard cores such as multiply, multiply/accumulate,
RAM, even complete processor cores can be found within the reconfigurable substrate
8
2.1 Introduction
of modern devices. Moreover, some FPGAs offer partial reconfigurability, which means
that regions are altered at run-time while the remainder stays untouched and active.
Splash and PAM
The capability for adaptability as given by FPGAs motivated several groups to investi-
gate reconfigurable computing systems as new way to exploit high performance. Worth
mentioning within this area are two early examples of the supercomputing domain that
also can be termed the first reconfigurable computers. They were built by the IDA
Supercomputing Research Center (SRC) in the USA and the DEC Paris Research Lab
(PRL) in the late 1980s and early 1990s.
SRC built a systolic array called Splash [GHK+91] containing 32 Xilinx 3090 series
FPGAs connected in a linear array. Some applications of Splash proved its power [Hoa93,
Jai95] and thus Splash was followed by Splash II [BAK96]. Splash II offered a crossbar
on top of the pure linear connection of the original Splash.
DEC PRL built a so-called Programmable Active Memory (PAM) named PeRLe-0
[BRV89] targeted towards image processing applications. The initial system contained
the impressive number of 25 Xilinx 3020 FPGAs. Soon, DECPeRLe-1 followed as suc-
cessor [VBR+96] and was effectively used in several areas, for example in the domain of
neural networks [LLM95].
Both systems were driven by the goal to develop a new kind of supercomputer, com-
posed of hardware-re-programmable components. Impressive benchmarks were shown,
where the two approaches outperform software-only programmable computers of those
days by one or two orders of magnitude. One of the biggest challenge for reconfigurable
supercomputing however is the programmability, which still is subject of active research
nowadays [BEGGK07]. For further details refer also to [GG05].
Berkeley Emulation Engine
For investigating high-end reconfigurable computing the BEE (Berkeley Emulation En-
gine) project—an attempt at creating a universal reconfigurable computing system—was
started at the turn of the millennium. Currently, the BEE2 system [CWB05], a successor
of the BEE1 [CKRB03], is in operation. BEE2 was designed to be a modular, scalable
FPGA-based computing platform with a software design methodology that targets a
wide range of high-performance applications such as real-time radio signal processing,
simulation of large-scale, ad-hoc and traditional networks, or scientific computing.
The BEE2 system uses five large Xilinx Virtex-II Pro FPGAs as the primary and
only processing elements. In addition to the reconfigurable fabrics, the BEE2 provides
up to 20 GB of high-speed memory, including independent access of each FPGA to the
memory using channels. Finally, the FPGAs on the BEE2 are highly connected with
both high-speed, serial and parallel links.
To tackle the application at hand, users can choose the appropriate number of compu-
tational modules needed, including reconfiguration to switch to a different application.
The most impressive application example that uses a set of BEE2 engines is the RAMP
(Research Accelerator for Multiple Processors) project [WOK+06, KSW+07], which tar-
gets on investigating multi processor systems consisting of up to hundreds of cores.
9
2 Reconfigurable Computing
Coarse Grain Devices
In the field, also another class of reconfigurable devices has been invented besides the
dominating FPGAs. These so-called coarse grain devices derive their name from the
fact that they can be programmed on a coarser level of granularity, mainly to reduce the
costs of configuration logic. Examples are the PACT XPP device [PAC06], the Chimaera
[HFHK97, YMHB00], or NEC’s DRP (Dynamically Reconfigurable Processor) [Mot02].
For more detail, an up-to-date survey may be found in [Ama06].
Although a lot of research activity has been put into coarse grain devices, they are
of little commercial success so far. Hartenstein concludes in his overview of different
coarse grain devices [Har01a, Har01b] that each application must be well matched to
the specific coarse grain reconfigurable device to gain reasonable benefits as we lack
a universal form that is efficient for all applications. He thereby gives an agreeable
argument for the bare market presence of coarse grain devices.
Systolic Arrays
Systolic arrays [KL78]—a pipe network arrangement of identical data processing units
first mentioned in the late 1970s—shall be included into this section as another (ar-
guable) influence to the emergence of reconfigurable computing. Basically, systolic arrays
are two-dimensional arrangements of processors connected in a mesh-like topology, which
can be found in modern reconfigurable devices as well. In both concepts, the processing
elements are usually triggered by data-streams without local instruction counters.
The underlying principle of systolic arrays is to achieve massive parallelism with a
minimum communication overhead. As systolic arrays are easy to implement because
of their regularity, they have been discussed as a way to ease VLSI design, a novel
technique at that time. Therefore, the concept of reconfiguration basically is unknown
to them. However, within the realm of the XPuter research at the Technical University
of Kaiserslautern, Germany [ABH+94], the KressArray [HK95, Kre96, HHHN00] was
developed on the basis of systolic arrays as a reconfigurable version of systolic arrays.
Furthermore, Vaidyanathan and Trahan investigate a reconfigurable mesh (R-Mesh) as
a theoretical model to study reconfiguration [VT03].
Next Trend: Multicore?
Modern CPUs increasingly are multi-threading or offer a multicore design. Some of
them just double the original CPU on the same die. Others offer a complete new design.
Obviously, there is some relationship between multicore chips and reconfigurable devices,
which are also executing in parallel and adapting the behavior.
The fundamental difference in most cases however is the program counter, which is
optional in the reconfigurable computing domain and which multicore architectures do
host in all of their cores. Still, the difference arguably is blurred. For example the
Raw chip from MIT [TKM+02] is a two-dimensional array of programming tiles, each
having a 32-Bit MIPS-like microprocessor, local instruction and data caches, and a
32-Bit pipelined floating point unit. It is a 2-D mesh network, not like a traditional
bus-based multiprocessor system, and was developed under the motivation of exploiting
reconfigurable computing.
10
2.1 Introduction
Figure 2.1: Makimoto’s Wave [Mak00]
In the meantime, Tilera corporation launched a first chip—the Tilera Tile64—based on
the MIT Raw project [Cla07]. The chip is called an embedded multicore device hosting
64 cores and targeting at low power applications in the embedded domain. It can run at
600 to 900 MHz with the cores arranged in a mesh style and equipped with three layer
cache on each site. Furthermore, the design is claimed to scale beyond hundreds or even
thousands of cores. However, the programming model for Tilera devices and all similar
approaches is still in its infancies and demands for extensive research.
2.1.2 Makimoto’s Wave
We also want to consider reconfigurable computing from another perspective, motivated
by the question whether it is a surprise that reconfigurable computing now seems to be-
come more and more widespread. Looking back, there have been some prognosticators,
among them Dr. Tsugio Makimoto. He observed in 1986 that mainstream microchip
application changes every ten years [Mak00]. This cycle—called Makimoto’s Wave,
see Fig. 2.1—describes a regularity in semiconductor trends, whose main feature lies in
large-scale repetitive cycles oscillating between standardization and customization.
The background of the wave is the following: When large numbers of new technolo-
gies such as devices, architectures and software appear, the semiconductor industry as a
whole moves towards standardization. Due to the need for product differentiation, added
value, and the imbalance between supply and demand, the wave will reach a maximum
and inverse its gradient. Next, progress in design automation and advances in technolo-
gies occur, shifting the semiconductor industry to customization. Then, reverse trends
toward early market entry, cost reduction, and more efficient operation appear and the
wave shows the longing for standardization.
From a macro viewpoint, the semiconductor industry can be said to repeat alternate
phases of standardization and customization. Makimoto’s wave thus is a concept of
explaining the history of computing in general and the evolution of reconfiguration in
particular. The wave predicts that the third wave will bring reconfigurable hardware
into mainstream, as these devices can meet the requirement of standardizing ASICs.
11
2 Reconfigurable Computing
Hartenstein proposed an extension of Makimoto’s wave representing today’s develop-
ment [Har03]. In his opinion, coarse grain reconfigurable devices will be the technology
for the next period of customization as they are more customized than classical FPGAs.
Considering the modern trend of adding hard cores to FPGAs—making them hybrid
devices—such a shift seems to be reasonable. Moreover, the trend to customize FP-
GAs towards application domains by offering multiple platforms that balance between
dedicated circuits and freely programmable logic supports this prognostication.
Another abstraction to describe the evolution of processing resources was done by
Tredennick—using the two categories algorithm and resource [Tre95]: After a period
of both being fixed, we reached a time when the algorithms became variable—software
could be and still is used to describe the algorithms, which are mapped onto the fixed
resources (hardware) afterwards. Today, reconfigurable computing brings us to the point
of both categories being variable.
Recently, Rammig formulated this combination of flexible algorithms and variable
resources in the context of platform-based design [Ram07]. He particularly discussed
the consequences of allowing dynamic reconfiguration at run-time—blurring the former
clear distinction between application and platform. The latter may change its services
offered, etc., making new modeling, analysis, and synthesis methods evident.
2.1.3 Remainder of the Chapter
As research on reconfigurable computing currently is very popular, and a complete
summary would go beyond the scope of this thesis, we focus on important works with a
good coverage of the ideas of the field in the remainder of the chapter. In particular, we
first present a comprehensive view of the technical aspects in the next section, before
we discuss application fields and design approaches. Being the main technique of this
thesis, we thereby emphasize run-time reconfiguration. Finally, we conclude the chapter
by a lesson learned and give a brief summary.
For further reading, we like to refer to some excellent surveying works of the re-
configurable system domain [DW99, TB01, CH02, SS05]. Also some books have been
published recently [GG05, KKS04, VM05, Bob07, VS07].
2.2 Technical Aspects
After the historic-based introduction of reconfigurable devices, we below discuss remark-
able techniques and details of theses processing units to understand why they claim to
unite high performance and high adaptability. As the thesis is on exploiting run-time
reconfiguration, we thereby emphasize techniques that facilitate this kind of reconfig-
uration, in particular partially reconfigurable FPGAs. An interesting question to be
followed before eventually proceeding with the techniques of reconfigurable computing
is the one on what reconfigurable actually stands for.
12
2.2 Technical Aspects
2.2.1 Reconfigurable versus Programmable
The ubiquitous and most successful device of reconfigurable computing—the FPGA,
which stands for Field Programmable Gate Array—does not bear the word reconfigurable
in it. So why speak of reconfigurable computing, and if so, where is the difference to
programmable computing? It is not easy to give a satisfying enough answer. We try
to approach the arguable still vague distinction by discussing answers or works from
influencing people of the field.
When asked for the difference, Rolf Esser from Xilinx responded at the DATE 2007
workshop on reconfigurable computing:
A reconfigurable architecture can implement a programmable system—hence
“programmable” is a more general and higher level term.
Extending this abstract definition and looking conceptually at the difference, pro-
grammable is more linked to execution instructions, while reconfigurable means that
the structure is changed. This adaptation however does not occur on the level of the
physical hardware. The specific hardware fabric of a reconfigurable device is designed
in such ways to be able to change the behavior—gates are re-used and their external
connection is changed.
DeHon compares microprocessors and programmable logic using the notions of in-
struction depth and datapath width [DeH96b]. Microprocessors have an instruction depth
of megabytes (on-chip cache) and a fixed datapath width of 8, 16, 32, or 64 bits, while
classical reconfigurable devices like FPGAs and PLDs only store a small number of con-
figurations (instructions), but offer a datapath width of 1, which makes them highly
adaptable. Moreover, a software programmable microprocessor binds functionality at
every cycle, which is not the case for FPGAs. However, considering also coarse grain
reconfigurable systems, this clear distinction gets blurred. These devices comprise of
datapath widths mainly in-between FPGAs and CPUs and also may rely on frequent
run-time reconfiguration.
Nevertheless, the frequency of reconfiguration acts as main indicator for a distinc-
tion: While programming usually means that every cycle a new instruction is loaded or
executed, respectively, reconfiguration does not occur as frequent. Both domains also
support their intended reconfiguration/programming frequency by their construction.
Eventually, the distinction becomes manifested in the three categories of processing
units of the introduction: ASICs are only once configurable, GPPs are programmable
(reconfigurable) every cycle, while reconfigurable devices are in-between. In particu-
lar the difference of the reconfiguration frequency of GPPs, ASICs, and reconfigurable
devices describes the situation along arguments that can be found all over this the-
sis: As configuration stands for a cost-sensitive process, the multiple occurrence of this
step—resulting in one or more re-configurations—should be well thought off and soundly
integrated in the overall design process and characteristics of reconfigurable systems.
Nevertheless, if we forbid reconfiguration, we would most likely waste capabilities
of reconfigurable devices, as the adaptability drops down close to or even meets the
one of ASICs. We thus have to derive a good strategy that conducts well-planned
13
2 Reconfigurable Computing
reconfigurations based on a deepened knowledge of reconfigurable devices. We start by
discussing the granularity of reconfigurable devices in the following.
2.2.2 Granularity
As already mentioned in the retrospective of reconfigurable computing, two categories of
reconfigurable devices are distinguished nowadays: fine grain and coarse grain devices.
The core difference between both categories is the granularity at which the functionality
of the devices can be adapted. Fine grain means, we can adapt the functionality on a
very low level—we may even add or remove an inverter or a gate of a circuitry. Coarse
grain devices, in contrast, reconfigure on a much coarser granularity—most often their
mesh of processing elements (ALUs, etc.) is adaptable in a VLIW-like (very long instruc-
tion word) style—complete instructions make up the reconfiguration information for a
processing element. As coarse grain devices thus may fail to implement any hardware
circuit, they are also sometimes termed pseudo reconfigurable devices [Bob07].
Granularity is also proportional to the configuration time and therefore can determine
the frequency of reconfiguration. Coarse grain devices usually require a far lesser amount
of reconfiguration information than fine grain devices like FPGAs, however accompanied
by the drawback of a limited adaptability. Nevertheless, both exploit parallelism at mul-
tiple levels of granularity, from instruction through task level parallelism. Additionally
FPGAs can make use of bit-level/boolean parallelism.
By referring to the granularity of reconfigurable devices as the abstraction level used
to program or configure the device, we can facilitate a more detailed classification of
reconfigurable devices concerning their level of granularity. During a detailed discus-
sion of multiple academic research projects on coarse grain devices, this classification
was also done in [KKS04]. The core concept thereby is to distinguish between devices
reconfiguring on boolean level granularity, instruction level granularity, functional level
granularity, and process level adaptability. While basic FPGAs are among the boolean
level adaptable devices, several mainly theoretical research projects on (coarse grain)
reconfigurable computing of the 1990s (mainly funded by the DARPA) are ranged in
the three remaining categories. Noteworthy on top of that classification are hybrid re-
configurable devices, which intend to combine the benefits of the several approaches.
According to [KKS04], modern FPGAs may all be classified as hybrid devices, as their
embedded hard cores (DSP units, RAMs, CPUs, etc.) enable them to offer dedicated
logic that is reconfigurable on a coarse level.
Concerning the market presence, coarse grain devices have little success so far. They
are hardly commercially available and often still in their infants. Already mentioned in
the retrospective, a beneficial matching of applications to coarse grain devices is a chal-
lenging task and must be performed mainly individually for each device. Nevertheless,
the ideas of coarse grain reconfigurable devices have influenced our work. Some of our
methods can also be mapped onto coarse grain devices [DB05].
Furthermore, in academia, several concepts exist to improve the acceptance of coarse
grain devices. For example, a proposal of design steps for efficient usage of coarse grain
reconfigurable devices enumerates technology mapping,placement algorithm, and routing
14
2.2 Technical Aspects
algorithm [Nag01]. Other approaches sort and exploit coarse grain devices according to
their topologies [BGD+04, LCD03] and thus open them for a more general design space
exploration using standard methods.
Nevertheless, fine grain systems are the dominating reconfigurable devices nowadays—
most of all as FPGAs from the two global players Altera and Xilinx. They combine
some excellent benefits: Fine grain reconfigurable FPGAs allow for efficient parallel
processing, thus approximating the speed of ASICs. They can be programmed during
run-time, which makes them adaptable devices. Some of them even support partial
reconfiguration. To be able to soundly discuss methods for the beneficial exploitation of
these modern capabilities, we put our immediate attention to FPGAs in the following.
2.2.3 Field Programmable Gate Arrays
The first commercially available FPGA—the Xilinx XC2000 [CDF+86]—already con-
sisted of three core elements that make up the fundamentals of all modern FPGAs:
Besides an array of programmable logic cells that can implement combinational as well
as sequential logic, there are programmable interconnects that surround the basic logic
cells, and programmable I/O cells that surround the two former parts.
Concerning the layout of these three core elements, multiple different concepts exist,
which all go beyond the capability of only nearest neighbor communication of mesh-style
systolic arrays or most coarse grain reconfigurable devices. In general, the communi-
cation infrastructure is closely related to the overall layout—the physical organization
of the resources—of an FPGA. In particular, we distinguish a hierarchical, row-based,
symmetrical array, or a sea of gates structure. The hierarchical arrangement is similar
to a tree, where communication is routed to the parent nodes until the destination is
among the children. Logic resources are mainly situated at the leafs of the tree. Row-
based systems dedicate area for logic and communication in alternating stripes, therefore
facilitating direct communication between logic resources only horizontally. For vertical
communication, logic resources must be traversed. Eventually, the symmetrical array
and the sea of gates structure comprise of equally distributed logic and communication
in a two dimensional style. The difference between the latter two is that there is no space
left for routing between the macro cells of the sea of gates arrangement. Interconnection
of the logic resources thus takes place on top of the cells. The main benefit of this style
is that arbitrary communication between the configuration cells is enabled.
The optimal arrangement has been thoroughly discussed, and the majority of the
modern FPGA architectures now consist of grid style arrays of configurable logic blocks
(CLBs) arranged as a sea of gates—also termed the generic island style FPGA model.
Figure 2.2 depicts the typical layout of the three core elements of such an FPGA. The
relatively smooth organization of the routing resources allows fast and efficient com-
munication along the rows and columns of logic blocks. However, the flexibility of the
island-style routing comes with a cost. Most current FPGAs use less than 10% of their
chip area for logic, devoting the majority of the silicon real estate for routing resources.
The other communication topologies of configurable devices—hierarchical, mesh-style
or row-based—can be found in coarse grain devices and some rare FPGAs. They may of-
15
2 Reconfigurable Computing
IO
IO
IO
IO
IO
IO
IO
IO
IOIO IOIO
IOIO IOIOIO
IO
IO
IO IO
IO
IO
IO
CLB
CLB
CLB
CLBCLB
CLB
CLB
CLB CLB
Figure 2.2: Typical layout of a modern FPGA
fer more sophisticated communication for specific applications, however, are less flexible
on the average case.
Reconfigurable cells
Reconfiguration cells spread all over the three core elements—configurable logic blocks,
communication infrastructure, and IOs—permit a very flexible customization and in-field
adaptation. Concerning the physical implementation of the cells, we either find antifuse-
based—special antifuses at every customization point—or memory-based—SRAM, Flash,
EEPROM—systems. Antifuses require less area and have a lower resistance than mem-
ory-cell based systems. However, antifuse-based systems suffer from their one-time pro-
grammability and therefore are not suitable for spatial and temporal execution of ap-
plications, which is essential for (run-time) reconfigurable computing.
Nowadays, SRAM based configuration is the dominating technology—SRAM cells
are connected to the configuration points in the FPGA, and setting the correspond-
ing SRAM bits configures the FPGA. Their advantage of providing an indefinitely re-
programmability (configurability) eventually facilitates a reconfiguration on-the-fly dur-
ing run-time. However, the chip area required for SRAM based configuration technique
is relatively large. Moreover, due to the volatility of SRAM cells, the configuration must
be reloaded at every new set-up.
Additionally, we have to bear in mind that referring to an SRAM-based FPGA as
being completely arbitrarily reconfigurable as the term SRAM (Static Random Access
Memory) configurable suggests is technically incorrect for the majority of FPGA archi-
tectures. If at all, we can only select a column or tile to be reconfigured. However, given
the fact that modern memory also is accessed word-wise or even block-wise, partially
reconfigurable FPGAs again are close to random reconfiguration.
16
2.2 Technical Aspects
4 input
look up
table D Flip
Flop
CLK
output
inputs
Figure 2.3: Schematic view of a configurable logic block of an FPGA
In general and independently from the configuration technology used, multiple cus-
tomization points make up a larger configurable element in all modern FPGAs. These
configurable logic blocks (CLBs)—sometimes also referred to as slices—consist of func-
tion generators like look-up-tables (LUT) and multiplexer (MUX), which are configured
by multiple bits. For example, a kinput and 1 output LUT can implement up to 22k
different functions relying on 2kSRAM locations.
The typical FPGA has a logic block with one or more 4-input LUT(s), optional D flip-
flops, and some form of fast carry logic. Figure 2.3 depicts the schematic core parts of
such a configurable logic block. The LUTs enable any function to be implemented,
providing generic logic. The flip-flops can be used for pipelining, registers, stateholding
functions for finite state machines, or any other situation where clocking is required.
Moreover, most modern cells provide fast carry logic as a special resource to speed up
carry-based computations, such as addition, parity, wide AND operations, and other
functions. A typical example of a modern configurable logic block—also termed slice by
Xilinx—including carry logic and other enhancements is depicted in Fig. 2.4.
To configure the routing on an FPGA, typically a passgate structure is employed.
Here, programming bits turn on a routing connection when the corresponding Bit is
configured with a true value, allowing a signal to flow from one wire to another, and will
disconnect these resources when the bit is set to false.
To finally discuss the third core part of an FPGA, programmable I/O basically means
that we can select the I/O to be either input, output or bidirectional. This choice is
usually provided by a circuitry that resembles the functionality of a tri-state.
Selected Additional Features
Partial reconfigurability can be increasingly found in modern FPGAs. Also described in
more detail below, partial reconfigurability basically denotes the possibility to change
parts of the current configuration of an FPGA without disturbing the other area not
under reconfiguration. A helpful nature to support partial reconfigurability are the
SRAM-style accessible configuration bits. Temporal processing thus becomes possible
in a very flexible way, however challenging to explore, as can be seen in the remainder
of this thesis.
Nowadays FPGAs not only provide a number of logical resources homogenously orga-
nized in an island style, they increasingly host multiple hard cores like multipliers, DSP
(digital signal processing) units, MAC (multiply accumulate) units, high-speed serial
I/O, distributed RAM, or even whole processors. Some of these hard core units are
17
2 Reconfigurable Computing
184 www.xilinx.com Virtex-4 User Guide
UG070 (v2.2) April 10, 2007
Chapter 5: Configurable Logic Blocks (CLBs)
R
Figure 5-2: Diagram of SLICEM
S0
01
CYMUXF
FXINB
FXINA
G4
G3
G2
G1
ALTDIG
SHIFTIN
ALTDIG
BY
DIG_MUX
BY
BYINV
BYINV
SLICEWE0USED
SLICEWE1USED
BY
SLICEWE1
F4
F3
F2
F1
WF 4USED
WF 3USED
WF 2USED
WF 1USED
BX
CE
CLK
SR
BX
BX_B BXINV
CE
CE_B CEINV
CLK
CLK_B CLKINV
SR
SR_B SRINV SRFFMUX
SHIFTOUTUSED
SHIFTOUT
CIN
FAND
SHIFTIN
SHIFTIN
ALTDIF
BX
DIF MUX DI
D
A4 LUT
RAM
ROM
A3
A2
A1
WF4 MC15
WF3
WF2
WF1
DUAL_PORT
SHIFT_REG
F2
PROD
CY0F BXCIN
CYINIT
F1
1
0
RESET TYPE
SYNC
ASYNC
-SYNC_ATTR
DFF Q
LATCH
INIT1
INIT0
SRHIGH
SRLOW
CE
CK
XUSED
XMUXUSED
F5USED
BUSED
DIGUSED
REVUSED
XBMUX
F5
XB
DIG
YQ
Y
YUSED
YMUXUSED
FXUSED
BYINVOUTUSED
BYOUTUSED
YBUSED YB
YMUX
FX
BYINVOUT
BYOUT
XMUX
X
XQ
FFX
FFY
DFF
YB
YMUX
Y
BY
DYMUX
Q
LATCH
INIT1
INIT0
SRHIGH
SRLOW
SR REV
CE
CK
1
0
S0
F5MUX
F5
FXOR
FXMUX
XORF
DXMUX
XB
XMUX
X
BX
WSG
WE1
WE0
WE
CK WSE
WSGEN
GAND
CY0G
1
0
G1
PROD
G2
BY
DUAL_PORT
SHIFT_REG
A4
A3
A2
A1
WG4
MC15
WG4
WG3
WG2
WG1
WS DI
G
F
LUT
RAM
ROM
D
1
0
COUT
COUTUSED
CYMUXG YBMUX
S0
01
1
0
0
1S0
FSMUX
FX
GXOR
GYMUX
XORG
UG070_5_02_071504
REV
SR
0
BX
From
Fabric
To/From Slice on Top
To/From Slice on Bottom
To Fabric
WG4USED
WG3USED
WG2USED
WG1USED
Figure 2.4: Slice of a Xilinx Virtex 4 FPGA
18
2.2 Technical Aspects
completely customizable, while others only facilitate a reconfiguration to a small extent.
Hard cores increase the performance of FPGAs and can open them to new application
fields bridging the gap between fine and coarse-grain devices.
To conclude the introduction of FPGAs, we summarize the advantages and disadvan-
tages of FPGAs in the following. As ASICs are the natural competitors of FPGAs, the
elaboration is dominated by the comparison of these two classes of computing media.
An excellent resources for additional characteristics of FPGAs provides [Tri94].
Advantages of FPGAs
The main advantages of FPGAs over ASICs are the radically reduced development cost
and turnaround time for implementing thousands of gates of logic. An FPGA design
is done with relatively low risk, as a design iteration due to a design error neither
exposes a large expense or a long delay. In particular, no expensive masks have to be
manufactured, as would be the case for ASIC design.
FPGAs are also very well tested. Their high volumes combined with their regularity
makes them easy exploitable for quality control. On top of that, we also can conduct
efficient verification of designs aiming for execution on FPGAs—designs immediately
serve as prototypes—which overall results in low testing costs. Also, as reconfiguration
is a cheap and time efficient task compared to the design and manufacturing times of
ASICs, iteration steps are relatively cheap.
Moreover, FPGAs are often available in the newest technology generations. For exam-
ple, the current Xilinx Virtex-5 series are manufactured in 65nm technology. Basically,
their homogeneous layout makes it easy to exploit the latest generation. A fact that can
be similar found in the area of memory elements (RAM), which usually are among the
first to change to new generations. By being in the front line of technology advances,
FPGAs can exploit the benefits that come with every new generation, e. g., lower power
consumption. The in-field customization also increases the life cycle advantages of FP-
GAs, as updates or in-field error correction becomes possible.
Comparing FPGAs to the other side of the scale—general purpose processors, DSPs,
etc.—FPGAs offer the possibility to deeply pipeline and parallelize a design not given on
von Neumann style alternatives. Their fine-grain customization and adaptability even
allows for boolean level adaptability, which is impossible for GPPs or DSPs.
Disadvantages of FPGAs
The most severe disadvantage of FPGAs compared to an ASIC implementation of the
same algorithm presumably is the size of a chip. As mentioned above, about 90 % of the
area is used for routing resources in modern FPGAs. Therefore, the size of a circuitry
implemented on an FPGA will never reach the area consumption of the same circuitry
on an ASIC. This characteristic, which is due to the universality of an FPGA, also
results in a reduced speed of circuitries when they are implemented on FPGAs.
The fine customization also yields another drawback. The large amount of reconfigu-
ration points basically hampers low-power design, although modern FPGAs are increas-
ingly designed towards optimizing the power consumption. Additionally, signal wiring
and logic placement is relatively arbitrary, exact calculation of the signal delays is com-
19
2 Reconfigurable Computing
plicate. FPGAs also come with a relatively large amount of pins, which hampers their
use for very small embedded systems that operate on reduced bit-width or even bit-serial
due to the costs of pinning or wiring.
While the heterogeneity of modern hybrid FPGAs is welcome concerning the perfor-
mance, however, it must be treated correctly—the heterogeneity must be included into
the design space exploration, etc. In general, the design methodology for FPGAs can be
criticized. As FPGAs allow for fast testing, designers tend to implement designs more
on a try-and-error basis than doing proper function testing, verification, etc.
Moreover, the volatility of FPGAs—which is true for the majority of modern devices
as these are nearly all SRAM programmable—requires additional components to per-
manently hold the configuration, such as EEPROMS, etc. Subsequently, they also suffer
a delay at start-up, waiting for the configuration to be loaded.
Nevertheless, much work has been done that proofs the benefit of fine grain granularity
and high adaptability of FPGAs, e. g. in [GV00, DEM+00, HLTP02, LH02, SWP04,
DBK03]. The possibility to partially reconfigure an FPGA discussed in many of these
works thereby seems to be very promising. Before detailing this sophisticated technique
for run-time reconfiguration, we first give an overview of programming techniques for
FPGAs in the next section.
2.2.4 Programming FPGAs
Having discussed the typical technical characteristics of modern FPGAs, which allow
such a device to execute arbitrary logic, we are also interested in how circuits for FPGAs
are designed and how these are loaded onto the devices. Basically, the eventual config-
uration of an FPGA is done by a bitstream, which contains the value for each single
SRAM configuration cell and is shifted bit or word-wise into the FPGA. The generation
of such a bitstream however is not as easy as compiling a software programm.
The initial purpose of FPGAs was to enable hardware prototyping or to serve as
(static) glue logic. Both domains target circuits that are used throughout the lifetime of
the product—similar to ASICs—which is why bitstream generation is close to the ASIC
front-end design flow. Creating such a design often requires costly optimization. As this
process is usually only done once, we may accept these costs. Using FPGAs however as
reconfigurable processing resources requires to frequently map different algorithms onto
FPGAs. A trade-off between costs and time would be required.
Following the ASIC design flow, circuits should be described in a hardware description
language (HDL) like Verilog, VHDL or by schematic entry tools. Then, by using special
CAD (computer aided design) tools the bitstreams are obtained and can be downloaded
to the FPGA. The time required to generate such a bitstream is in the range of min-
utes, hours, sometimes even days, heavily depending on the size and complexity of the
circuitry and the target device. On-line generation of bitstreams thus is hardly possible.
The accepted steps involved when generating bitstreams from HDLs are synthesis,
mapping,place & route, as well as the actual bitstream generation [Wan98]. These are
mainly automated nowadays. In addition to the core HDL files that are capable of
20
2.2 Technical Aspects
describing the behavior and structure of the circuit, we require the user to give physical
information of the system under development such as the timing characteristics of the
device, mapping of specific hardware, or the connection to the platform/board the FPGA
is mounted on (pinning). Such information often is collected in vendor specific files.
Due to the complexity and costs of the classical ASIC style design flow, several im-
provements have been proposed. For example, Xilinx released the PlanAhead suite,
which is a tool to facilitate a structured design flow of FPGAs that resembles the cur-
rent ideas of structured ASIC design. The focus of PlanAhead is to enable hierarchical
floorplanning by streamlining the design steps between synthesis and place & route.
PlanAhead offers rich functionality, however hardly raises the level of abstraction.
Concepts to completely move to a higher level of abstraction are also investigated.
These are mainly influenced by current high level languages (HLLs) or reasonable ex-
tensions of them. Motivated by easing the design, the PRISM compiler was the first
to investigate hardware compilation of sequential C kernels [AS93]. Nowadays most
of the current approaches still rely on C-like languages like Stream-C [GSAK00], Han-
delC [BW06], or ImpulseC [PT05]. SystemVerilog in contrast enriches standard Ver-
ilog with features from object oriented design [SMR+04]. The major benefit of these
approaches relying on HLLs is that they facilitate immediate simulation and can also
include software parts. Often, the mapping onto reconfigurable devices takes part within
a hardware/software codesign environment, e. g. [STB06].
However, when HLLs are used, the quality of compiler-generated circuits still lacks
the equivalent manual designs, often by factor of 2–4 in area [Bob07]. Some of these lan-
guages even make an immediate synthesis to hardware impossible, requiring a complete
new implementation of the final hardware part in the worst case. The low performance
of these C-like languages thereby stems from their closeness to iterative programming,
which originally targets the instruction code execution of von Neumann machines.
Downloading/Programming
Eventually generated, the bitstreams can be loaded via configuration access ports into
the configuration cells (SRAM, Flash, antifuse, etc.) of an FPGA. In addition to the
JTAG port, which originally was designed for debugging, SelectMap (8/32-Bit parallel)
and Slave Serial (Single Bit Serial) exist for this purpose. The parallelism of SelectMap
(8 or even 32 bits) speeds up the configuration process. Modern FPGAs like the Xilinx
Virtex devices also offer self-reconfiguration through a so called ICAP (internal config-
uration access port). The ICAP is connected on chip to the SelectMap port, sharing
the configuration resources. Finally, some of the ports also enable a readback of the
configuration for debugging or re-engineering purposes.
For volatile FPGAs, the bitstreams are often stored in non-volatile devices next to an
FPGA and loaded on start-up. To overcome this drawback of an additional chip and area
on the board required, in the meantime, some devices exist that offer in-built non-volatile
storage such as the Xilinx Spartan 3AN device that comes with the reconfigurable fabric
and a flash combined into a single package.
21
2 Reconfigurable Computing
Frequency of Reconfiguration
To basically classify the frequency of reconfiguration for further use in this work, we
follow the definitions given in [Bob07]. For a non-frequent reconfiguration, where an
FPGA becomes a static device and usually serves as ASIC replacement or for rapid
prototyping, we use the term compile-time reconfiguration. Here, the loading of the bit-
stream is done before the FPGA is in use, holding the configuration during the operation
time of the application. However, if FPGAs shall adapt their behavior multiple times
during their execution—changing their configuration more frequently—we refer to their
reconfiguration as run-time reconfiguration. In both cases partial reconfiguration would
be technically possible. However, for a rarely occurring compile-time reconfiguration,
the additional overhead to generate a partial bitstream often does not make sense.
Concluding the state-of-the-art programming of reconfigurable devices, true high-
performance techniques to design and eventually download circuits for FPGAs are hardly
available. This fact holds for both reconfiguration techniques—compile time and run-
time. The latter more sophisticated and also more challenging reconfiguration technique,
which however eventually allows us to exploit the entire capabilities of reconfigurable
fabrics—processing in space and time—is discussed next.
2.2.5 Run-Time Reconfiguration
Basically, by run-time reconfiguration the configuration of a reconfigurable fabric is
altered during its operating time, in the best case without halting the current processing.
The immediate benefits are the reduction of the hardware resources: timesharing the
available hardware and also hardware specialization. Moreover, in-field customization
or power reduction by holding only those circuits that currently are required becomes
possible. Section 2.3 details application fields for run-time reconfiguration.
While benefits of the possibility to reconfigure FPGAs during run-time quickly become
obvious, the question of the cost for reconfiguring a device during run-time arises. Along
with several design challenges discussed all over this thesis, also technical constraints
exist that must be considered for achieving a benefit. Among them is the configuration
overhead discussed next.
Configuration Overhead
The sheer size of the bitstream offers a challenge for run-time reconfiguration, as bit-
streams are usually shifted in small units or even bit-wise into the reconfigurable device
consuming a non-negligible time period. Naturally, the question how to handle this
notable reconfiguration latency arises. Different concepts have been proposed.
A very promising idea is to overlap reconfiguration and processing phases. For example
in a coupled system, while the processor is active, we can reconfigure the substrate in
the meantime. If we can load a new configuration as soon as possible and prior to its
execution, we increase the chance to finish exactly at or even before the algorithm under
processing requires this configuration. Then, we have achieved a so-called configuration
prefetching [LH02, GV00].
22
2.2 Technical Aspects
A comprehensive example is given in [Hau98], where the author focuses on reconfig-
urable co-processors. In order to load configurations in advance, special instructions are
added to the code. Thereby, the maximum number of saved cycles is limited by the
structure of the code. Moreover, the compiler re-arrangement of operations must be
considered to avoid collision with the already inserted pre-fetch operations.
In addition, complete systems like PipeRench [CWG+98] have been designed that
inherently target the reconfiguration overhead by proposing an architecture that allows
reconfiguration and processing at the same time. As such pipeline reconfigurable en-
vironments require dedicated design processes and their specific architecture may even
exclude some applications from practical execution on them, they are little present.
Mainly academically hosted attempts also propose multi context reconfigurable de-
vices, e. g. [DeH96a, TCJW97]. Multi context reconfigurable devices include multiple
memory bits for each programming bit location—thought off as multiple planes of con-
figuration information. These devices can perform a switch between configurations in
few clock cycles. Their inherent architecture makes them advantageous to reconfigura-
tion time hiding, while the additional hardware required inevitably increases the costs.
Most likely therefore, they are not commercially available.
Another approach is to target the number of reconfigurations required and reduce
them as much as possible either by configuration caching or by avoiding reconfiguration
at all based on decisions on higher abstraction levels. Some approaches also discussing
the drawbacks can be found in [DST99, LCH00].
Moreover, a combination of the methods obviously increases the benefit gained. For
example, the combination of configuration caching and configuration pre-fetching is pro-
posed in [LH02]. Swapping configurations with respect to caching possibilities is dis-
cussed in [SNG01].
As the reconfiguration time can be often considered as proportional to the area under
reconfiguration, there also exist concepts that target at minimizing the configuration
data. These concepts often compress the bitstream, some of them already at the time of
generating the bitstreams like the combitgen project [CMS06], which however is located
in the realm of partial reconfiguration (see below). In the combitgen approach, we
reduce the size of the bitstream by only configuring those bits that change from the
current to the next configuration. Such a difference based reconfiguration particularly
is of success if the amount of bits can be reduced significantly. As shown in [RM07], we
can also influence the size of (partial) bitstreams at earlier stages—during the synthesis,
etc. Circuits that shall share the same area therefore are generated in parallel, so that
similar constructs can be mapped on the same resources, avoiding a reconfiguration of
these resources.
Partial Reconfiguration
Partial reconfiguration provides another and extraordinary powerful method to ame-
liorate large configuration times for FPGAs. The important characteristic of partial
reconfiguration summarized, it is a technique to change only parts of a reconfigurable
fabric, requiring only a fraction of the complete bitstream, while also leaving other
regions untouched—these regions can remain fully active.
23
2 Reconfigurable Computing
By using partial reconfiguration, designers can dramatically increase the functionality
of a single FPGA, allowing a system to be implemented with fewer and smaller devices
than otherwise required. The idea of partially reconfiguring FPGAs can already be found
in some works of the mid 1990s, e. g. [LD93]. One of the first comprehensive academical
attempts at partial reconfiguration then was the DISC project [WH95], which offered
demand-driven modification of instruction sets. During that time, only the FPGAs
produced by Algotronix and Actel enabled partial reconfiguration.
Xilinx then took over Algotronix and launched the XC6200 series [Wan98]. These
devices offered a RAM-style configuration interface, where the CLBs could be accessed
in a matrix-like style. However, the device had little commercial success and was quickly
abandoned. Nevertheless, it inspired a lot of research work, e. g. [CCKH00].
Today, the research on partial reconfiguration concentrates on the Xilinx platforms.
Currently, Xilinx Spartan, Virtex-II, Virtex-II Pro, Virtex-4, and Virtex-5 FPGAs sup-
port partial reconfiguration. Additionally, Atmels FPSLIC series offers partial reconfig-
urability, however the capacity of these devices is limited. Therefore, Atmel fabrics are
seldom the devices of choice [Atm02].
Although fascinating and increasingly supported by modern FPGAs, partial reconfig-
uration still is challenging to be explored and complicate to be implemented. For exam-
ple, partial reconfiguration can result into fragmentation on the reconfigurable substrate.
Relocation and defragmentation can therefore become necessary [CCKH00, CLC+02],
and must be integrated in the above discussed methods like configuration prefetching,
etc., which otherwise harmonize perfectly with partial reconfiguration [LH02].
Task reallocation can also increase the flexibility of a reconfigurable fabric by freeing
area for dynamically arriving tasks [KKP05a]. Some works also consider run-time de-
fragmentation algorithms for heterogeneous devices [KKP06]. However, the relocation
of tasks on FPGAs is complex and cost intensive, in the worst case a complete readback
of the current status of the logic cells of the reconfigurable substrate might be necessary.
Some works therefore also consider preventative defragmentation [KKP05b] using best
fit methods to allocate area for dynamically arriving tasks.
Run-time Routing
Besides the challenge to allocate dynamically arriving tasks, the communication of these
tasks with the rest of the system must be considered. In an early example, we already
find a discussion on how to perform such run-time routing of data between blocks of
circuity on reconfigurable systems [BD98]. The authors present three models to evaluate
trade-offs between flexibility, speed and cell count. Depending on the used fabrics,
their routing supports highly parallel or intrinsically serialized communication. Similar
research is still done nowadays [PAK+07].
Thereby, also bus networks or even Networks on Chip (NoC) [NTI04] are investigated
as solutions for run-time routing. For example [ESS+96] introduces a reconfigurable
multiple bus network, which is intended to support circuit switching as means of com-
munication between processing elements. The model comprises nprocessing elements
and ksegmented busses. Several switches are used to set the connection at run-time—
making up a modular communication infrastructure on a reconfigurable device.
24
2.2 Technical Aspects
Basically, NoCs heavily have influenced multiple research projects on run-time rout-
ing [BAM+05, MAV+02]. The general idea is appealing: Processing elements are con-
nected to network elements (routers), achieving a packet-based communication. We
avoid the bottleneck of a simple bus, because the routing resources are shared by all
connected blocks. However, NoCs consume significant area on an FPGA.
There also exist scenarios, when buses or NoCs are not applicable, for example if inter-
module communication has to be guaranteed in hard real time [GSC04] or the amount
of data to be communicated is too large for bus or NoC based systems (e. g. in the area
of video/audio data streaming). Direct communication must be used then [DH05].
Self Reconfiguration
Self reconfiguration entitles an approach of reconfiguration in a self-managed fashion
with in-situ reconfiguration. A necessary requirement is the access of the reconfiguration
port of a device from the device itself. The Xilinx ICAP (Internal Configuration Access
Port) facilitates this requirement and allows us to achieve self reconfiguration for Virtex
devices. As already mentioned above, concerning the physical implementation, the ICAP
port connects to the SelectMap configuration port.
Self reconfiguration can become a desirable feature that can increase the performance
of a reconfigurable system, see [BJRK+03] among others. Basically, self reconfiguration
means the triggering of the reconfiguration process from within the device itself. Obvi-
ously, it is closely connected to partial reconfiguration, as the controlling instance should
remain static and active during the reconfiguration process.
Challenges of (Partial) Reconfiguration
Nevertheless, all approaches have to cope with the constraint of a single reconfiguration
port as limiting factor that will hardly change and been overcome in the future. In
particular, the single reconfiguration port means that reconfigurations must take place
mutually exclusive. Moreover, another constraint exists concerning the physical access
to the reconfigurable substrate. In order to keep the number of pins required for the
reconfiguration hardware low, the reconfiguration—submitting of the bitstream—takes
place word wise, byte wise or even bit serially, consuming much time for transmission
as modern bitstreams may sum up to 1 MByte or even above.
When considering the subject of reconfigurable computing from an abstract point,
these challenges however are obvious and harmonize with the general fact that FPGAs
are designed for fast processing and not for fast reconfiguration. Considering the history
of FPGAs as ASIC replacements and the general idea of increasing the performance—
mainly the computational performance—this focus of reconfigurable devices on fast pro-
cessing and not on fast reconfiguration completely makes sense.
This situation of why we have and presumably will always have a reconfiguration
overhead further supports the fact that reconfiguring an FPGA during run-time is not
trivial on many aspects, particularly concerning the design process of reconfigurable sys-
tems themselves and the mapping of applications onto such adaptable devices in detail.
All the issues mentioned—reconfiguration overhead, operation during reconfiguration,
etc.—must be considered comprehensively.
25
2 Reconfigurable Computing
8www.xilinx.com XAPP290 (v1.1) November 25, 2003
1-800-255-7778
Two Flows for Partial Reconfiguration: Module Based or Difference Based
R
The bus macro must be physically locked in such a way as to straddle the boundary line
between A and B, and it must be locked in exactly the same position for all compilations. The
process of locking the bus macros to proper locations is described in the Initial Budgeting
Phase Details section. The bus macro can be wired so that signals can go in either direction
(left-to-right or right-to-left). It is strongly recommended that once direction is defined, it should
not change for that particular FPGA design, i.e., bus macro signals should neither be
bidirectional nor reconfigurable.
The number of bus macro communication channels is limited by the number of horizontal
longline routing resources available in each CLB tile.
Implementation Using Modular Design
As defined by the modular design flow, the partial reconfiguration implementation process is
broken down into three main phases:
1. Initial Budgeting Phase - Creating the floorplan and constraints for the overall design.
2. Active Module Phase - Implementing each module through the place and route process.
3. Final Assembly Phase - Assembling individual modules together into a complete design.
Initial Budgeting Phase Details
Initial budgeting operations should be done in the top or initial folder of the recommended
project directory structure.
The initial budgeting phase has the following main steps:
1. The floorplanning of module areas:
a. Have a four-slice minimum width.
b. Have a set width that is always a multiple of four slices (e.g., 4, 8, 12, …)
c. Are always the full height of the device.
d. Are always placed on an even four-slice boundary.
e. Attach partial reconfiguration flow-specific properties to the area groups in the .ucf file
(See Appendix A).
Figure 5:
Physical Implementation of Bus Macro
CENTER
(Boundary) between B and C
X290_05_032502
RI [3:0]
RO [3:0]
RT [3:0]
LI [3:0]
LO [3:0]
LT [3:0]
Figure 2.5: Busmacro of the Xilinx Application Note 290
Xilinx Partial Bitstream Generation
As influencing the methods of this work, we also discuss the current state-of-the-art
on partial bitstream generation for Xilinx FPGAs. Such partial bitstreams basically
are downloaded to the FPGA using the same procedure as for complete bitstreams.
However, to assign the configuration information to the intended area, the configuration
port (Select Map, JTAG, etc.) evaluates the header information of the bitstreams, which
states beginning and length, always denoting a consecutive area. To generate such
special bitstreams, an appropriate design method is required.
In general, the new configuration information of a dynamically loaded bitstream (par-
tial or complete) is loaded glitch-free and overrides the current condition of the SRAM
cells. Given this characteristic of the Xilinx FPGAs, basically two possibilities arise to
dynamically alter the configuration of an executing circuitry on an FPGA. Firstly, we
can adapt the whole configuration of an FPGA and merge it into a new one by calculat-
ing the difference of the two configurations and if possible only configure the different
SRAM cells. This way is called the difference based reconfiguration. Secondly, we can
determine special regions of the FPGA as being partially reconfigurable. Such a design
flow is termed module based reconfiguration.
The Xilinx Application Note 290 [Xil04] was the first document given by Xilinx to
achieve difference-based as well as modular partial reconfiguration. For the generation
of partial bitstreams, it mainly follows the design flow for a modular design, where a
project leader partitions the area of an FPGA into smaller regions, defines the com-
munication between the regions and assigns each region to a specific design team. The
tasks to be designed in the regions add up to the whole functionality of the system under
development and therefore must be integrated afterwards.
For the communication between the regions, specific static communication resources
were defined. These so-called busmacros serve as anchor for the dynamically loaded
modules. They are inevitable as the non-existence of static communication and con-
nection points would eliminate the guarantee of a proper communication. The design
flow XAPP 290 therefore includes tri-state based busmacros, see Fig. 2.5. These are
hard-coded macros that use tri-states and specific lines of the Xilinx Virtex-II series.
26
2.2 Technical Aspects
40 www.xilinx.com Early Access Partial Reconfiguration User Guide
UG208 (v1.1) March 6, 2006
Chapter 3: Bus Macros R
Synchronous versus Asynchronous Bus Macros
In addition to being unidirectional left-to-right or unidirectional right-to-left, bus macros
can be either synchronous or asynchronous. Synchronous bus macros provide superior
timing performance and are recommended for designs that can accommodate the added
latency.
Wide versus Narrow Bus Macros
Both wide and narrow bus macros are available (these terms refer to the physical width of
the bus macro, not the data bandwidth). Narrow bus macros are two CLBs wide
(Figure 3-3), and wide bus macros are four CLBs wide (Figure 3-4).
Figure 3-3: Narrow Right-to-Left Bus Macro
Figure 3-4: Wide Right-to-Left Bus Macro
UG208_03_03_091405
UG208_03_04_091405
Figure 2.6: Narrow busmacro of the Xilinx early access design flow
Concerning the geometrical requirements in XAPP 290, reconfigurable regions must
always span the whole height of an FPGAs, as XAPP 290 allows only column-wise
reconfiguration. Several drawbacks arise, among them the difficulty to route signals
crossing a reconfigurable region and the inclusion of IO pads into the partial bitstream
of the column under reconfiguration. Yet, the column-wise reconfiguration is in line with
the capabilities of the Xilinx FPGAs available at that time.
Along with the launching of the Virtex-4 device, Xilinx came up with a new flow to
generate partial bitstreams. This so-called early access partial design flow is the up-to-
date approach to perform the generation of partial bitstreams [LBM+06]. Basically, it
follows the same idea as presented for the modular based reconfiguration of the Xilinx
application note 290. However, several improvements have been added. Most important,
the restriction of only column-wise partial reconfiguration has been dropped. Here, the
architecture of the Virtex-4 FPGAs is of particular benefit, as tiles on top of each other
make up the layout of the FPGA—similar to the standard cell layout of ASICs.
Furthermore, lookup table based busmacros have been introduced, see Fig. 2.6. This
step became inevitable, as the Xilinx Virtex-4 do not offer tri-states within their re-
configurable substrate any more. Lookup table based busmacros also are more flexible
and enable horizontal and vertical connections. While the hitherto presented busmacros
needed to be placed on a boarder of the region rectangle, the latest ones can be placed
anywhere within the region [Jac07]. These so-called single slice macros also eliminate the
need to define a direction for the busmacros. However, they are currently only available
for the devices of the Xilinx Virtex-5 series.
Partly in close cooperation to Xilinx, several works have been and still are carried out
to improve the routing and communication issues of partially reconfigurable designs. For
example, several very sophisticated concepts are discussed in the PhD thesis of H¨
ub-
ner [H¨
ub07]. The author investigates the communication capabilities of Xilinx FPGAs
towards on-demand routing exploiting multiple physical characteristics of the devices.
Presumably, he was also the first to propose LUT based busmacros [HBB04].
27
2 Reconfigurable Computing
Co-
processor attached
processing
unit
standalone
processing
unit
FU
Memory
Cache
CPU I/O
inter-
face
Figure 2.7: Coupling of reconfigurable devices
Both solutions presented above for the generation of partial bitstreams offer a sound
theory, however their practical implementation quickly becomes cumbersome and time-
consuming. A discussion on the main aspects of partial reconfiguration on Xilinx Virtex
FPGAs can also be found in [BBHN04] and [BAR+05]. The authors discuss challenges
like the signal integrity, global logic and inter-module communication. Also Chap. 3
discusses the challenges in some more detail.
Moreover, the PlanAhead tool of Xilinx has recently been enriched with the possibility
to generate partial bitstreams [DSG05]. However, only the latest version (9.2) [Jac07]
gradually enables to adapt to specific user needs, while the former version enforced to
strictly follow the early access design flow as commands are invoked automatically in
the background. In particular, for each different combination of reconfigurable modules,
a single project had to be established.
2.2.6 Coupling
When using reconfigurable fabrics as processing devices, the question arises how they
are integrated in the overall system, and in particular if necessary how a run-time recon-
figuration is triggered. Most often, the substrate therefore is coupled with a controlling
unit—implying that at least two devices become part of the system. However, recon-
figurable systems can also comprise of stand-alone devices, for example as given by the
ml310 board of Xilinx or several of the processing platforms of Celoxica Inc. In such
systems, an internal control unit such as the hard core PowerPC found in Virtex-II Pro
FPGAs of the ml310 board or user generated logic of the Celoxica boards is used for
software parts like the operating system [Dan06], among others.
The interaction of the reconfigurable substrate with a controlling/supervising unit can
range from loosely coupled to closely coupled. Figure 2.7 depicts four different coupling
techniques as described in [CH02]. A similar definition can be found in [KKS04], addi-
tional referring to the levels of the Y-chart [GK83]. There, the loosest coupling comes
at the system level and is characterized by slow transfer rates. At the architecture level,
the reconfigurable devices act more like a coprocessor. Finally, the micro-architecture
28
2.3 Fields of Application
level resembles the tightest coupling. Here, the fabric becomes a functional unit within
the data path. A similar yet more detailed discussion is given in [Kal04].
The sort of coupling chosen may also affect the times of data transfer from and to the
reconfigurable fabric, in particular if the data is transmitted via the supervising unit.
If the reconfigurable device has immediate access to its data, however the coupling also
may become negligible. As shown by the ml310 board discussed above, an FPGA may
smoothly represent the sole processing device of a system.
When run-time reconfiguration is used, the coupling also may determine the appropri-
ate frequency of reconfiguration. Loosely coupled devices may accept a longer reconfigu-
ration phase than closely coupled devices. However, it is hardly possible to agree on the
best coupling technique for run-time reconfiguration. Arguably, coupling and run-time
reconfiguration may also be seen orthogonal to each other, as run-time reconfiguration
can yield benefits for every coupling method.
Moreover coupling still is evolving. For example, Intel recently has opened the specifi-
cation of their Front Side Bus (FSB). Xilinx took over this opportunity and implemented
an Intel FSB-FPGA accelerator module that features an FSB-capable Virtex-5 FPGA
module as a plug-in to an Intel Xeon CPU socket. Xilinx already demonstrated to
achieve bus speeds of 800 MHz capable of supporting 6.4 Gbytes/s data transfers.
2.3 Fields of Application
After discussing the technical aspects of reconfigurable systems—in particular FPGAs—
this section focuses on the past, current, and future fields of application. Basically,
the majority of these fields still ignores the capabilities of partial or even run-time
reconfiguration. When it thus comes to examples of reconfigurable computing, the
greater number of the applications rely on compile time reconfiguration only. However,
many application areas can be extended to also benefit from run-time reconfiguration.
In this section, we thus do not separate the fields of application into their reconfiguration
frequency. We rather sketch the general application areas of reconfigurable computing
devices according to their characteristic of combining benefits from GPPs and ASICs,
and detail them by considering compile time as well as run-time reconfiguration based
options.
2.3.1 ASIC Design
A quiet old application field of FPGAs is the domain of ASIC prototyping, where FPGAs
are used for the so-called rapid prototyping [KPC94]. The idea is to emulate a hardware
design and test it towards its correctness. Rapid prototyping allows a device to be
tested before the final production, particularly errors can be corrected before the design
is manufactured. The fine granularity of FPGAs makes them the best choice for rapid
prototyping, with coarse grain devices hardly applicable to the task.
As shown in [Bob03] and other works, the reconfigurability of FPGAs can also be
exploited for rapid prototyping. By performing a so-called temporal partitioning of the
29
2 Reconfigurable Computing
circuit under development, we can emulate parts of the overall design in sequence on
the same FPGA. We can thereby renounce the acquisition of an otherwise costly system
hosting a larger FPGA or multiple FPGAs. Rapid prototyping thus can be extended for
run-time reconfiguration. It helps to save FPGA resources if the ASIC to be simulated
is too large to fit on one device.
Furthermore, if (at least partially) asynchronous circuits have to be prototyped, timing
becomes a first class issue. Janzen and Rammig therefore reanimated Rammig’s early
work and did show how the timing behavior can be prototyped using today’s FPGA
technology [JR97a, JR97b].
2.3.2 Replacement of Dedicated Circuits
As already pointed out in Sect. 2.2.3, reconfigurable devices—particularly FPGAs—come
with multiple benefits that make them superior to ASICs in many cases. Therefore, FP-
GAs are used as ASIC replacements in niches since their beginnings, particularly as
glue logic (see below). Nowadays increasingly and mainly because ASIC fabrication
costs have grown enormously using technologies below 100nm, FPGAs have matured to
become complete ASIC replacements, sometimes even if high volumes of a design are re-
quired. Obviously, there always is a trade off—such as the higher energy consumption—if
reconfigurable devices are used instead of ASICs. We therefore need a comprehensive
investigation of all relevant constraints for each specific application.
Domains
Using a reconfigurable device as communication bridge between two or more devices
is often called glue logic. The reconfigurable substrate then holds the custom elec-
tronic circuitry needed to achieve compatible interfaces between the different off-the-
shelf integrated circuits. Thanks to their adaptability, reconfigurable devices—FPGAs
preferred—perfectly serve the requirements of such an interface [Wan98]. Depending on
the characteristics of the incompatible devices, also coarse grain devices can be used.
CPLDs, which have been the former dominators of the field, are little employed nowa-
days as they fail to offer enough adaptability for the increasingly complex protocols.
Particularly if circuits are added dynamically to a system, the (run-time) adaptability
of reconfigurable devices proves to be of benefit. By virtue of reconfiguration, we can
offer a customized interface for the new circuit. Such a behavior is of particular interest in
the domain of IP (intellectual property) design for Systems on a Chip or suchlike. Ihmor
introduces in [Ihm06] a concept for interface synthesis focusing on the communication
gap of different IP cores. This gap must be particularly considered if dynamic plugging
of IP cores is of interest. How partial reconfiguration capabilities help to improve the
behavior of such an interface synthesis approach is shown in [ID05].
When reconfigurable devices are used as low-volume alternatives to ASICs, they usu-
ally serve markets where fast processing is essential. Among these are domains where
extremely few volumes of a circuit are required, such as space applications, which easily
may be unique designs. Moreover, the domain of digital signal processing, image process-
ing, network security, cryptography or bioinformatics applications among others [GG05]
30
2.3 Fields of Application
may require fast processing devices to solve the tasks in a reasonable time. FPGAs have
also been proven valuable for the implementation of neural networks [EH94].
In all these domains, where reconfigurable devices replace ASICs, run-time reconfig-
uration allows us to perform in-field upgrades. Newer versions of algorithms may be
loaded into the device, which from there on may increase the performance, reduce the
power consumption, etc. Particularly correction of bugs through run-time reconfigura-
tion can be of high interest. Furthermore, such an in-field customization may also take
place remotely if access to the configuration ports is provided.
Another very interesting application area for run-time reconfiguration can be found in
domains formerly dominated by antifuse-based FPGAs. Such FPGAs, whose configura-
tion is not volatile, are usually found in radiation critical environments such as space.
To overcome the drawback of one-time reconfigurability of antifuse-based FPGAs, the
group of Kebschull proposes to use permanently configuration refresh techniques [Keb06],
which make SRAM-based FPGAs radiation tolerant. Partial reconfiguration is thus used
to ensure determined behavior of the device by constantly reloading the configurations.
Concerning the radiation tolerance, this method facilitates to resemble an equivalent
quality as antifuse-based FPGA, however including the benefit of in-field adaptability.
To complete the idea of replacing dedicated circuits, we also have to mention that
the direct moving from applications formerly executed in GPPs to FPGAs is possible.
As reconfigurable devices are in several aspects superior to GPPs—energy consumption,
etc.—reconfigurable fabrics might also replace GPPs by providing the same functionality.
If the algorithm then is fixed, the reconfigurability is of no further concern. At the end
of the day, the FPGA again simply takes over the job of an ASIC, combining all the
benefits—reduced design costs at the front.
2.3.3 Adaptive Processing
We refer to adaptive processing if the adaptability—run-time reconfigurability—of re-
configurable devices is of primary concern. Here, reconfigurable devices are an efficient
alternative for GPPs, as pure software-based solutions often would require unacceptable
effort (power, etc.) to provide the same performance. Adaptability of the processing
device thereby is required to react on changing parameters of the environments, etc.
Adaptive processing still is in its infancy. Besides a missing of sound technical solutions
how reconfigurable devices could seamlessly replace GPPs, also the purely supported
non-functional requirements of portability and programmability hamper the widespread
use. Nevertheless, there are some approaches both in industry and academia, which
provide solutions that show first promising results.
In general, we can divide adaptive processing in three categories following the ideas
of [VS07]. In algorithmic reconfiguration, we hold the functionality, however adapt the
performance, accuracy, power, or resource requirements. Reasons are a change of the
environment. Functional reconfiguration means the execution of different functions in
hardware, however on a shared resource. The immediate focus of functional reconfig-
uration within the idea of adaptive processing thus is time-sharing of resources. Also
orthogonal to the previous two types of reconfiguration, architectural reconfiguration
31
2 Reconfigurable Computing
means the reallocation of resources to computations. The need often arises when re-
sources become unavailable due to an algorithmic or functional reconfiguration. Other
reasons are a shut down, high priority interruption, fault tolerance, etc. In general, a
clear distinction of the three types is difficult and arguable mixtures exist (see below).
In general, while ASICs mainly focus data stream oriented applications, reconfigurable
devices are also open for applications that require control information. By virtue of run-
time reconfiguration, the data processing can be changed; using dynamic and partial
reconfiguration, this adaptation can also occur online.
Algorithmic Reconfiguration
If the circuit can be reconfigured during run-time there is no need to design the circuit
such that it can operate on all possible inputs. Instead, we can reconfigure the circuit
by the optimum configuration based on the run-time data and gain higher speed and
performance. For example in the area of cryptography, where keys must be hard-coded
improving performance, the algorithmic adaptability allows for changing the key over
time. Another example for employment of adaptable hardware is image processing and
feature extraction in computer vision [GCL02]. Here, adaptability may be used to modify
filters or to specify on shapes to be extracted, see also [Bob07].
Thereby, we basically facilitate to offer specific circuits rather than generic circuits,
which may result in a performance increase by using optimum circuits. For example,
we can use constant multipliers instead of generic multipliers, which improve area con-
sumption and response time [DKR03].
The idea of an adaptable pattern matching—searching for characters being part of a
larger block of data or stream—where the search pattern is dynamically provided as ded-
icated circuit on the reconfigurable substrate, can be of benefit for multiple application
domains like bio computing, genome searching, etc. [GG05].
We also find some works in the domain of adaptive controllers, for example investi-
gating run-time exchange of mechatronic controllers [DBK03]. Here, reconfiguration is
directly used to react on changing requirements of the environment.
Another example is the domain of distributed arithmetic—a bit level rearrangement
of a multiply accumulate to hide the multiplications. Distributed arithmetic allows
for reducing the size of hardware. Again, run-time reconfiguration helps to target the
application even more exact [DB04, Dan04].
Finally, the presumably most recent application field for algorithmic reconfiguration
is the domain of software defined radio [Mit95]. Mainly for military usage and therefore
hardly available in the literature, run-time reconfigurability is used to adapt to dynamic
requirements of communication.
Functional Reconfiguration
Functional reconfiguration denotes the idea of hardware virtualization in the sense of
sharing the same hardware for different applications (functions). The reconfigurable
substrate thereby changes its behavior. We thus rely on run-time reconfiguration.
A promising example was recently proposed in the automotive domain [BHH+07]. To
reduce costs, functionality formerly provided each in single ECUs (electronic control
32
2.3 Fields of Application
unit), is not only merged into one ECU—as already is a current trend—but loaded
dynamically into the reconfigurable processing resource of this ECU.
In System on a Chip (SoC) devices, run-time reconfiguration is also used to dynam-
ically swap applications into a reconfigurable substrate [GRP06]. Additionally, FPGAs
themselves are equipped with resources such as CPUs to provide complete SoCs. The
research platform then often is classified as Reconfigurable-System-on-a-Chip, where the
reconfigurable substrate plays an integrated role.
In the domain of high performance computing, reconfigurable fabrics are used as hard-
ware accelerators for compute intensive functions. Therefore, they become additional
resources in nodes of computer clusters. For example on the Cray XD1, [SH07] show ap-
plication specific acceleration by FPGAs, where the reconfigurable hardware is employed
for particular operations in the finite fields for Reed/Salomon coding. Furthermore, the
BEE project introduced in Sect. 2.1 offers a strong platform to target high-end reconfig-
urable computing. Besides the already mentioned RAMP project, BEE2 for example is
also used as the heart of a high-bandwidth spectrum analyzer used by the Deep Space
Network [Waw07].
In general, to increase programmability and portability in the domain of high perfor-
mance reconfigurable computing, the non-profit consortium OpenFPGA has emerged.
With varying activity, the organization aims to foster and accelerate the incorporation
of reconfigurable computing technology in high-performance and enterprise applications.
Architectural Reconfiguration
Pure architectural reconfiguration would mean that we neither change the algorithmic
characteristics nor the executing function itself currently residing on a reconfigurable
substrate. According to [VS07], some examples are a shut down, high priority interrupt,
fault tolerance, etc. However, architectural reconfiguration can also be seen as a require-
ment to fulfill algorithmic or functional reconfiguration. Examples of such a mixture are
discussed below after a brief focus on pure architectural reconfiguration.
Fault tolerance is a well researched application field for architectural reconfiguration.
In contrast to the above mentioned idea where a device is made radiation tolerant and
thereby un-vulnerable to temporary failures, permanent failures are targeted. If some of
the resources show faulty behavior, we can switch to pre-implemented alternative paths
[dLKNH+04, KCR06], or apply partial reconfiguration capabilities [ESSA00].
In our opinion, architectural reconfiguration also covers the idea of hardware virtual-
ization. In contrast to functional reconfiguration, the algorithms themselves are divided
into partitions that must be executed completely to derive the result of the computation
[Bob03]. We can thus implement larger systems than physical hardware available.
Mixture
By increasing the level of abstraction, as also done within this work, a clear diversification
into the above presented reconfiguration types becomes difficult. Often, higher-level
design methods rely on two or more of the types. Some examples are presented below.
Basically, we can reduce power consumption through reconfiguration. When power
optimization already is considered on a high level of abstraction [Ret07], we can define
33
2 Reconfigurable Computing
parts of an algorithm that can be gated, re-arranged, etc. under the constraint of opti-
mizing power. By virtue of run-time reconfiguration, we can select different and more
power-optimal path variations or eliminate those parts of the design that are not needed
according to the current input. We can also move tasks onto other resources that are
more energy-efficient.
A recent example of high-level exploration of partial run-time reconfiguration capa-
bilities in the automotive domain is given in [CZMS07]. The authors describe how to
use partial-run-time reconfigurable hardware to accelerate video processing in driver
assistance systems. They combine functional with algorithmic reconfiguration by an
integrated exchange of whole pixel-level functions to adapt the behavior of the overall
system, reacting on environmental conditions.
Also completely new areas are influenced by run-time reconfiguration capabilities.
For example, wearable computing [PEW+03], where the use of FPGAs is motivated by
the better energy-efficiency of FPGAs compared to GPPs, providing enough flexibility
to react on the environment. Another example is the domain of organic computing.
Several approaches rely on the reconfigurability of FPGAs, e. g. [GP06]. Moreover, the
idea of self-reconfigurability is part of the self-x paradigms of organic computing. Self-
reconfiguration thereby is seen independent of functional, algorithmic, or architectural
reconfiguration. Often, reconfiguration shall serve as technique for emergence—a con-
troversial way how complex systems and patterns arise out of a multiplicity of relatively
simple interactions.
2.4 Design Approaches for Reconfigurable Systems
As could be seen in the previous section, reconfigurability can be of benefit in various
fields. However, to explore these benefits, appropriate design methods are required,
which must go beyond classical methods as reconfigurable computing means the adap-
tation of the hardware resources over time, something not covered by classical methods.
A simple mapping of existing design approaches for computing systems to reconfigurable
systems therefore may struggle to yield optimal solutions.
In the literature, we find multiple proposals how to facilitate the design of reconfig-
urable systems. In this section, we discuss the most promising current approaches in
the scene. As the research area is quiet active, we therefore briefly overview the most
important ones, also by sorting and categorizing them. Worthwhile to mention also is
the funding of the research area, as it emphasizes the interest in the topic. All the well-
known organizations such as the DARPA in the US, the EU, or the DFG in Germany
are funding or have funded research programs in the domain of reconfigurable systems.
2.4.1 Execution Environments/Architectures
We start by considering execution environments that facilitate run-time reconfiguration.
Also being design approaches themselves, execution environments are first hand fun-
damentally important for experimental research or abstraction. The majority of the
34
2.4 Design Approaches for Reconfigurable Systems
approaches rely on FPGAs, which are execution environments themselves and shall be
included in this listing. Concerning the numerous academical approaches, we focus on
those that have emerged to become a comprehensive platform.
Commercial Platforms
Already selectively mentioned in the previous sections, the Xilinx flagship Virtex series
are the most widespread commercial devices that offer capabilities for research on recon-
figurable computing. All of them can be reconfigured partially, with the latest improving
the flexibility immensely. While the devices up to the Virtex-II Pro series require always
whole columns to be reconfigured, Virtex-4 and the most recent Virtex-5 can be recon-
figured tile-wise. According to Xilinx, the reconfigurable parts of the FPGA are similar
to the division of the clock network. For the low-cost Xilinx Spartan FPGA, which does
not fully support partial reconfiguration, some research teams still have shown working
designs [PHA+07].
The devices of the second big player in the FPGA world—Altera—lack the possibility
to be reconfigured partially and are therefore little present in the realm of research on
reconfigurable computing. The same holds for most of the other companies, except for
Atmel. However, their FPSLIC, which offers partial reconfigurability, is very small, and
therefore also is not widespread in the community.
All FPGAs are basically available as single dies, however, they are preferably used
embedded into evaluation platforms of companies like Avnet, AlphaData, or Xilinx it-
self. On these platforms, peripherals are connected to the FPGA in order to exploit
different functionalities. Moreover, within the supercomputing domain, Cray released
in 2004/2005 the XD1 machine that basically hosts AMD Opteron 64-Bit CPUs. Addi-
tionally, the device incorporates Xilinx Virtex-II Pro FPGAs for application acceleration,
composing another platform for exploiting run-time reconfiguration.
Furthermore, platforms that already couple reconfigurable fabric and main processing
unit on one die are the Tensilica and Stretch approaches. While Tensilica’s Xtensa VLIW
processor architecture embodies a hardware accelerator that is fixed at design time
without run-time reconfigurability, Stretch Inc. claims to embed programmable logic
entirely inside a software-programmable processor architecture. Stretch’s S5000 and
S6000 software-configurable processors are optimized specifically for high-performance
video and wireless signal processing.
Coarse Grain Devices
In the area of coarse grain devices, there are some rare commercial approaches, that
also initiated research on design method. For example the PACT XPP devices exploit
the principles of coarse grain reconfigurable systems towards commercial applications.
The XPP-III architecture expands the horizon of DSP processing by programmability.
Several applications have been shown and prove the efficiency of the approach [KGS+07].
Targeting their own abstract device as well as the NEC-DRP architecture, [OSF+07]
discusses run-time reconfiguration on coarse grain devices. To maximize performance,
they propose the concept of processor-like reconfiguration, which allows for new con-
figuration every clock cycle. Besides requirements for the physical layout of such an
35
2 Reconfigurable Computing
Figure 2.8: The Erlangen Slot Machine [MTAB07]
approach, they also discuss the trade-off between the number of costly contexts offered
and maximum throughput possible.
Academic Approaches
Many academical works focus on the level of execution environments. Some of them
use commercial FPGA platforms and map their specific experimental environments like
a layer on top of them. Others provide complete environments, including the physi-
cal aspect by manufacturing their own PCB (printed circuit board). There are also
some works that propose completely new FPGA technologies as a work-around of the
drawbacks of the devices that are commercially available.
To raise the level of abstraction and facilitate dynamic task loading, [KPR02, WP04,
FC05, UHGB04b, UHGB04a] among others have investigated the concept of a run-
time execution environment implemented on top of the pure FPGA substrate. These
approaches are called slot-based and either are arranged in a 1D or 2D style. Such envi-
ronments dramatically help to open the field of reconfigurable computing to a broader
audience, as low-level details can be abstracted.
An example of complete environments is the Raptor system designed at the Heinz
Nixdorf Institute at the University of Paderborn. Raptor is a PCI card that can host
up to six FPGA modules. Different combinations are possible and thereby lead to
multiple sophisticated designs [KPR02]. Again, also the Berkeley Emulation Engine
(BEE) project shall be mentioned. As introduced in Sect. 2.1, the current platform
hosts five FPGAs and is used for example in the RAMP project [WOK+06], exploiting
multiple levels of parallelism.
36
2.4 Design Approaches for Reconfigurable Systems
To address several drawback of commercial platforms, the codesign group at the Uni-
versity of Erlangen has designed the Erlangen Slot Machine [BMA+05, MTAB07]. As
displayed in Fig. 2.8, the platform consists of an FPGA mounted on a baby board that
can be solely used for the dynamic allocation of hardware tasks. Bitstreams are stored
locally on this baby board and can be loaded on demand. For supervising purposes and
to connect to peripherals, the baby board itself is mounted on a mother board. Partic-
ularly, the IOs of the main FPGA are connected via a memory mapped crossbar to the
peripherals of the main board. We thus can establish connections very flexibly.
The capability to dynamically map IOs to peripherals on the Erlangen Slot Machine
stems from the wish to overcome a major drawback of the Xilinx-II and previous series
FPGAs in general and the modular design flow of the Xilinx application note 290 in
particular. As partial reconfiguration on those devices requires to reload the configura-
tion information of a complete column including the IOs of that column, the pinning of
the board can easily constrain the applications. By making the link between peripheral
and IO pin adaptable, this drawback can be overcome. Modern devices like the Xilinx
Virtex-4 and Virtex-5 offer similar capabilities as the IOs are separated in the bitstreams
as they now occupy columns on their own. Nevertheless, the Erlangen Slot Machine is
a very convenient approach to exploit partial run-time reconfiguration.
In the realm of designing completely new architectures for run-time reconfiguration,
we locate the Honeycomb architecture of the University of Karlsruhe [TB05, TB07].
It comprises of processing cells arranged in a hexagonal style resembling the layout
of honeycombs. Furthermore, the Strategically Programmable System (SPS) architec-
ture, which combines memory blocks and programmable blocks into a LUT-based fabric
[MBKS01], or the Dynamically Programmable Gate Array [DeH96a] shall be mentioned.
Moreover, proposals for alternative FPGA architectures exist. For example low power
devices like LP PGA [GR01], or the LEGO [CSR+99] device for high-speed designs.
To conclude, execution environments are an active area of research, most often in-
cluding supporting design methods. Many sophisticated approaches exist, however few
are present beyond their own group. Only strong commercial projects seem to have suc-
cess, most likely as they offer comprehensive frameworks on top of the pure execution
environments. However, expert knowledge not only limited to execution environments
can be found all-round, which thus may be used to exploit so-far unused capabilities of
reconfigurable devices. Some reasonable concepts can be found below.
2.4.2 Placement/Scheduling Methods
If tasks shall be executed on reconfigurable devices, there arises the question of area
allocation, particularly if tasks are to be loaded dynamically onto the substrate. Of-
ten supported by a reasonable run-time execution environment, still we have to map,
dispatch and schedule algorithms or applications onto (run-time) reconfigurable fabrics.
Methods either take place offline or online, as separated below.
37
2 Reconfigurable Computing
Offline Placement/Scheduling
In [FKT01], Fekete, K¨
ohler, and Teich propose formalisms and algorithms towards mod-
ule placement. They solve OPPs (Orthogonal Packing Problems) and give good insight
by presenting this approach. They project boxes onto the coordinate axes, which define
interval graphs. Furthermore, we can find a strong mathematical background in the
paper. However, they neglect communication between the modules.
An influencing work on temporal partitioning and temporal placement again in the
offline scenario was conducted by Bobda [Bob03]. By virtue of spectral placement, he
introduced a new approach to derive so-called generations that are dynamically loaded
on FPGAs. Moreover, he considered clustering for 1D slot-based task execution.
Online Mapping/Scheduling
For online scheduling, activation times and frequencies of the different tasks, which are
only known at runtime, must be considered. [WP03] therefore introduce a technique
that uses a number of queues and the two functions fSP LIT and fSELECT . Additionally,
they consider preemption of tasks executing on FPGAs [WP04].
[SWPT03] proposes an on-the-fly partitioner towards better/less fragmentation and
manages the free spaces therefore in trees. The authors also discuss heuristics for on-
line scheduling of real-time tasks to partially reconfigurable devices [SWP03]. The two
heuristics are called the horizon and the stuffing technique. Furthermore, they consider
the limitations of a practical implementation. These considerations include the device
homogeneity, task communication and timing, and the partial reconfigurability.
[ABI06] present another scheduling approach referring to the strip packing algorithm.
They target column-wise reconfigurable FPGAs like the Xilinx Virtex-II series. The
usage of the strip packing algorithm allows them to also consider tasks with precedence
constraints and release times.
Another interest manifests in the empty space search. For example [HV04] search for
maximal empty rectangles by using a staircase idea. Therefore, they reduce the problem
to find a maximal staircase area. A matrix stores the status of the FPGA.
Moreover, Diessel and ElGindy show a scheduling algorithm [DH98], which comprises
of two steps. In the first step, they identify a rearrangement of the tasks executing on
the FPGA. This step’s job is to free sufficient space for the waiting task. The second
step then schedules the movements of tasks to minimize the delays to executing tasks.
In order to identify the feasible rearrangements, they investigate local repacking and
ordered compaction. The first part of the second step comprises arbitrary rearrange-
ments and ordered compaction. They use ordered compaction only for the second part
(moving tasks on-chip). In another work, Diessel et al. introduce also concepts for dy-
namic scheduling of tasks on partially reconfigurable FPGAs [DEM+00]. Therefore they
propose the rearranging of a subset of the tasks executing on the FPGA.
Scheduling Tasks onto Coarse Grain Devices
In a paper concerning network topology exploration of mesh-based coarse-grain recon-
figurable architectures [BGD+04], the authors explore the effects of varying the network
topologies, the topology traversal strategies, and the delay models for the interconnects.
38
2.4 Design Approaches for Reconfigurable Systems
The mapping they use is trimmed to interconnection awareness. They construct the
total run time by a combination of execution time and routing delay.
For exact and heuristic solutions, [BLPG01] gives a very brief overview of efficient IP
based mapping techniques for coarse-grained dynamically reconfigurable array architec-
tures. They consider genetic algorithms, hadlock’s algorithm, etc.
Most of the works on placement and scheduling methods presented above are driven
by the fascination to arrange tasks like 3D boxes in space and time. Multiple sophisti-
cated heuristics have been shown. However, few works consider the reconfiguration time
and the communication requirements of such a scenario. Moreover, heterogeneity ham-
pers the idea of mapping the packing problems to reconfigurable fabrics. Nevertheless,
comprehensive methods can benefit from these results and include appropriate methods.
2.4.3 Comprehensive Design Systems/Design Methods
In the literature, there exist several research proposals that target on comprehensive
frameworks to design reconfigurable systems. Below, we list some important examples,
roughly sorting them into increasing abstraction level.
There are some approaches that discuss compilation from classical C-style languages
to reconfigurable devices. Based on the SUIF infrastructure [HAA+96], several works
have investigated how to compile to reconfigurable systems [BDD+99] or how to create
coarse grain parallelism [KKS04]. These approaches heavily rely on the data/control
flow based intermediate graph of SUIF. Also the DEFACTO system [Bon02] combines
parallelizing compiler technology with synthesis to automate the effective mapping of
applications to reconfigurable computing platforms.
Starting from the Java bytecode level, [JN03] discuss a Java-based compiler for compil-
ing software programs into reconfigurable hardware. The compiler can exploit multiple
control flows and high instruction-level parallelism (ILP) degrees (beyond the basic-block
level), and diminish the memory bottleneck. Other approaches are starting from Matlab
with the MATCH (MATlab Compiler for distributed Heterogeneous computing systems)
compiler [BSC+00] or use specific algorithms to profiles graphs towards reconfigurable
computing [KKMB02].
Some approaches also rely on specific APIs that are used to describe temporal and
spatial execution. For example, Koch describes in [Koc02] a concept how to generate
parameterized hardware-modules based on an API and a primitives catalogue. His API
allows for evaluation and creation of hardware objects targeting configurable computing
machines. To also enhance the SystemC language for modeling reconfigurable comput-
ing, the OSSS+R library was developed [SON06]. It extends OSSS [GTF+02], which is a
SystemC subset that can be synthesized, by specific primitives to model reconfiguration,
including simulation at a high level of abstraction.
Without referring to HLL input languages, some works directly consider input graph
mapping to reconfigurable systems. For example in [KV99], the authors consider inte-
grated block-processing and design-space exploration in temporal partitioning for recon-
figurable devices. They automatically partition behavioral specifications. As the recon-
39
2 Reconfigurable Computing
figuration overhead is an issue, they propose block-processing. Moreover, in [KVGO99]
an automated temporal partitioning and loop transformation approach for developing
dynamically reconfigurable designs starting from behavioral level specifications is dis-
cussed.
Also works that propose complete frameworks are presented in the literature. Eisen-
ring and Platzner [EP02] defined such a framework for run-time reconfigurable systems.
The aim of their work is to provide a methodology and a design representation which
allows them to plug in different design and implementation tools. They base their idea
on two graphs, one for the architecture and one for the problem.
In [OGS+98], we find the integrated design system SPARCS (Synthesis and Partition-
ing for Adaptive Reconfigurable Computing Systems), which comprises an automatical
partitioning and scheduling. However, run-time reconfiguration is not in the primary
focus of SPARCS.
A comprehensive approach is SCORE [CCH+00], which allows for modeling stream-
oriented computations, including the system architecture and several execution patterns.
However, SCORE is a relatively heavyweight framework, in particular as it uses a pro-
prietary modeling technique consisting of FSM (final state machine) and TM (Turing
machine) nodes.
There also exist proposals for frameworks on the generation of (partial) bitstreams
for run-time reconfiguration. Among others, [CCEBM04] present such a framework.
Recently, we could watch the emergence of several extensive research projects un-
der support of the EU, which focus on integration of several design constraints. For
example, the Multi-purpOse dynamically Reconfigurable Platform for intensive HEt-
erogeneous processing (MORPHEUS project) is an integrated project which addresses
solutions for embedded computing based on dynamically reconfigurable platforms and
tools [TKB+07]. The AETHER project aims to tackle the issues related to the perfor-
mance and technological scalability, increased complexity and programmability of future
embedded computing architectures by introducing self-adaptive technologies in comput-
ing resources [PHB+07]. Similarly, the ANDRES Project targets at analysis and design
of run-time reconfigurable, also including heterogeneous systems [HOH+07].
2.4.4 Miscellaneous Concepts
There are also approaches that bear reconfigurability in their structure by nature or
can be easily extended to make powerful approaches for reconfigurable computing. For
example the MACT architecture, invented at the University of Paderborn, Germany
and already patented [RLZB04], can be extended by routers that allow for run-time
exchange of parts of the data flow graph [DRW06].
2.5 Lesson Learned
Reconfigurable computing is a promising, however also quite challenging field. As could
be seen throughout the whole chapter, there are many pros and cons. The core benefit
40
2.5 Lesson Learned
is the capability to process in space and time—we can resemble the speed of hardware as
we process in parallel and still be flexible based on the reconfigurability of the devices.
Reconfiguration however incurs costs, such as the reconfiguration time, the increased
design complexity because of the interdependency of the two domains space and time,
the need for suitable communication, etc.
For a comprehensive fruitful and beneficial exploitation of reconfigurable fabrics, we
thus have to act sophisticatedly and overarching. Therefore, we have learned details
about the technical characteristics, the programmability, etc., which we consider as
evident if a comprehensive exploitation of reconfigurable devices shall take place. Ab-
straction, which is targeted by this work, can thus built a solid foundation.
Concerning the overcoming of the overhead of the reconfiguration, there always must
be a balance between the possibility for fast reconfiguration and the costs for the tech-
nique to achieve these improvements. Obviously, the purpose of reconfigurable de-
vices should not be to conduct high-performance reconfiguration, but to facilitate high-
performance and application-oriented processing. Only if the reconfigurability improves
the performance of the overall system, the additional costs for the reconfiguration—time
overhead and infrastructure—are acceptable.
This train of thoughts must be considered at all abstraction levels. For example, an
unsystematic placement of tasks onto a reconfigurable device can result in fragmentation
of the substrate, and prevent further tasks to be executed as no connected area to host
the tasks is available. A relocation may solve the problem, however it may also degrade
the overall performance due to further costs (readback). Another example and big
challenge in this context is to provide an efficient intermodule communication. Again a
trade-off between heavyweight networks on a chip and lightweight but dedicated wires
has to be found.
Moreover, we particulary require appropriate design methods, if reconfigurable devices
shall be seen as adaptable devices. Here, an ASIC style design flow quickly becomes
too cost intensive. For example, an estimation of the performance of an application
that shall be executed on a reconfigurable fabric should be possible on a high level of
abstraction, without the need to completely synthesize the design. Very challenging, the
quality of this high level evaluation should be also very good.
The design approaches presented in the previous section discuss several sophisticated
concepts and give reasonable hints for a beneficial exploitation. However, they often
require improvements to yield good results, for example, many of them neglect com-
munication. They thereby show that it is challenging to program for reconfigurable
systems. On one hand, the overall complexity of computing in space and time has to
be taken into account. On the other hand, the generation of circuits is complex and
difficult in many aspects. Even once done (having a good design), the synthesis down
to the final bitstream generation takes a long time, which makes it hardly comparable
to the quickly responding compilers of software design.
Besides these design-oriented challenges, a general aspect for beneficial exploiting run-
time reconfiguration still is the long reconfiguration time resulting in the reconfiguration
latency. This latency must be considered for the granularity of the adaptability. Thereby,
the latency is significantly reduced when partial reconfiguration can be applied, as others
41
2 Reconfigurable Computing
parts can continue operation during a reconfiguration. Furthermore, as reconfigured area
is usually proportional to reconfiguration time needed, the reconfiguration of smaller
parts will result in an earlier availability of configurations.
Furthermore, we experience a lack of suitable tools that soundly support dynamic
run-time reconfiguration. Partial reconfiguration—in particular the generation of par-
tial bitstreams—still is hardly supported on the software side of the vendor products.
Nevertheless, promising concepts and methods exist that are worthwhile to be inves-
tigated further. For example in the context of reconfiguration time hiding, we can
summarize that configuration overlapping and caching yield good results.
In general, it is also difficult to agree on one common language (terminology) in
the domain of reconfigurable computing. As already the discussion on reconfigurable
vs. programmable showed, the definitions of the field often are vague. Moreover, the
different approaches on how to design reconfigurable systems—high level languages,
graph-based approaches, hardware description languages, etc.—make an agreement on
one single model of computation for reconfigurable computing very challenging. There-
fore, it maybe is difficult to conclude whether we already know what reconfigurable
computing is and what its true challenges are.
Nevertheless, FPGAs as the precursor of reconfigurable computing have started to
become mainstream. To eventually exhaust their complete potential (of which we are
aware so far), comprehensive design methods are required.
In the subsequent chapters, we will built on the results of the literature and derive
methods that target on applications not covered up to now. We also show where our
methods extend the design approaches presented so far. Thereby, our methods gradually
raise the level of abstraction concerning the design of reconfigurable systems.
2.6 Summary
In this introductory chapter, we took a glance at reconfigurable computing. Therefore,
we have considered the history of this field, before detailing the technical aspects of
reconfigurable devices in general. As this thesis mainly focuses on partially run-time
reconfigurable FPGAs, we have spent some more lines on the capabilities and drawbacks
of the fine granular FPGAs, including programming, bitstream generation, and coupling.
In general, we have detailed these parts, which are important for the further work
presented in this thesis. Concerning the fields of application, we have listed several
domains of configurable systems including already investigated or future extensions for
run-time reconfigurability. Finally, we have given a brief overview of important design
approaches of the literature.
42
3 Two-Slot Framework
In this chapter, we introduce a two slot framework as our primary approach for ex-
ploiting partial run-time reconfiguration. The core part of the framework is a platform
for reconfigurable computing that comprises of exactly two regions in which user tasks
can be executed. The two regions are closely connected and compose a unity. Tasks
or partitions of a task are loaded alternatingly into these regions allowing for hardware
virtualization and configuration time hiding. The two slot architecture therefore most
likely is the most direct approach to exploit partial run-time reconfiguration capabilities.
Despite its simplicity, the architecture explores the major obstacles of partial run-time
reconfiguration, e. g., static communication, application mapping, or partial bit-stream
generation. We thus consider the architecture as ideal to introduce reconfigurable com-
puting conceptually as well as from an implementation point of view.
3.1 Introduction
The previous chapter on reconfigurable systems already has shown that we have to take
care of multiple often low-level issues such as the inter-module communication or the
task allocation for a beneficial exploitation of reconfigurable systems. In particular, the
execution of tasks on the substrate including both the temporal and spatial dimension
is challenging. Only a comprehensive consideration of both dimensions will result in
valuable benefits and therewith justify the usage of a reconfigurable substrate.
One of the main challenges undoubtedly is the reconfiguration overhead—the duration
of the reconfiguration itself. As the configurations are usually shifted byte- or even bit-
wise into FPGAs, and multiple thousand bits are needed to configure the logic, even
relatively fast reconfiguration ports like the SelectMAP port with up to 66 MHz and up
to 32 Bit width will require milliseconds to reconfigure the several million SRAM cells
of modern devices. Such a delay cannot be accepted by any application, in particular if
the whole reconfigurable substrate comes to a stop during the reconfiguration process.
Here partial reconfiguration comes in and improves the issue on two sides. Firstly,
partially reconfigurable systems allow us to reconfigure only parts of an FPGA, while
the other parts remain untouched and can keep on processing. We thus can interweave
reconfiguration and execution, and hide the reconfiguration time. Secondly, a smaller
area to be reconfigured directly results in a linearly shorter reconfiguration time.
However, by employing partial reconfiguration, the design complexity increases. When
placing tasks to be executed on the FPGA at arbitrary positions, the FPGA will become
fragmented, especially if multiple tasks share the same reconfigurable fabric. Thus,
fragmentation must either be maintained by a reconfiguration controller or prevented
43
3 Two-Slot Framework
by fixed regions—so-called slots. We apply the latter in this chapter, which also eases
another challenge of partial reconfiguration, the intermodule communication:
Tasks require access to input variables, might have to store intermediate results, and
have to write back the final result; even communication between multiple tasks may
occur. Moreover, for arbitrary task placement, dynamic routing would be necessary. If
the location of the tasks is reduced to a fixed set of regions (slots), we can implement a
dedicated and static communication infrastructure or a bus topology for communication.
In such a case, we can also provide access to additional communication resources (e. g.
fast rocket IOs) by routing them to some specific slots.
Another important aspect of partial reconfiguration is the configuration scheduling.
As the substrate is time-shared among several tasks, a control unit—also termed reconfig-
uration manager—should supervise and trigger the loading of new (partial) bitstreams.
Foremost, such a manager must consider the difference in duration of the reconfiguration
times and the times spent in execution. Intermediate results thereby often must be hold
back until the reconfiguration has finished and processing can continue.
Two Slot Architecture
To consider all the constraints mentioned above, a lightweight approach like the two slot
framework is a perfect start. Here, two deeply connected slots constrain the substrate
and offer a direct access to partial reconfigurability. Despite only two regions, the core
characteristics of partial reconfiguration are present in such a framework. Thereby, the
architecture of the framework is of major concern as it takes care of the low-level recon-
figuration issues and enables the reconfiguration time hiding. By providing two fixed-size
slots, the simple yet effective two slot run-time environment brings reconfiguration to
its basics. In addition to an in-depth discussion on the challenges during the design, we
particulary show how we can integrate alternating execution of the slots into the overall
framework. The two-slot framework thereby facilitates multiple application scenarios.
Intended Application: Hardware Virtualization
Basically, the two slot framework focuses on hardware virtualization. Hardware virtual-
ization has been discussed in several works (e. g. [Bob03]) and can be characterized by
reconfigurable fabrics being used to make up a larger application than would fit on the
device. This feature is of particular interest in the area of embedded systems or for the
prototyping of very large VLSI circuits where often only a limited amount of resources
is available. These resources are then time-shared by the same application.
To gain the benefits of such a hardware reuse, the applications themselves also have
to serve the execution scheme and the architectural constraints like maximum area and
communication of the reconfigurable regions. Therefore, we can either restrict the input
to fittingly partitioned tasks or may have to partition the input algorithms appropriately.
Despite its simplicity, the two slot approach can be extended to open for larger ap-
plication fields. Notably is the possibility to focus on low power scenarios. Therefore,
we investigate the usage of the multiple clock domains of modern FPGAs as means to
adapt the execution times. Another natural extension of the two slot environment is to
include the architecture as an IP (intellectual property) core into a larger system.
44
3.2 Concept
FPGA
Tasks
Two Slot Architecture
Partitioning
Scheduler
Figure 3.1: Layered approach for the two-slot framework
3.1.1 Layered Approach
Our two slot framework solves the demand for abstraction to handle the complexity
of hardware resources in reconfigurable devices in a very simple way. Nevertheless,
we raise the level of abstraction and thereby slightly open the field of reconfigurable
fabrics—particularly FPGAs—to the world of computer scientists.
Figure 3.1 shows the layered approach for the two slot framework. On top of the
FPGA we employ the two slot architecture layer. Tasks for execution on the FPGA
must be constrained in order to be executed by this layer. We therefore might have
to partition the tasks, which we add as a dedicated layer. Furthermore, a scheduling
or dispatching layer must exist that takes care of the reconfiguration process. Such a
service can be used by an operating system, among others.
3.1.2 Organization of the Chapter
The rest of this chapter is organized as follows. First, we abstract the problem and define
a solution strategy. Thereby, we introduce the concept of reconfiguration phase RT and
execution phase EX. This fundamental concept accompanies us in the remainder of the
thesis. Therefore, we spent some more lines on the concept.
Then, we follow the layered approach. We describe the architecture and its implemen-
tation, including the intermodule communication between the two slots. We solve the
problem of simple yet flexible data flow between two slots. Thereafter, we introduce our
two partitioning methodologies and evaluate them. Then, we consider the scheduling
for the two slot framework. In the experiment section, we show an application example
and evaluate its behavior. We also discuss extensions of the two slot framework. We
show how to maintain the overall response time, while reducing power consumption,
and discuss the two slot architecture as a powerful IP core for use in (embedded) system
design. Very important is the lesson learned of this chapter as it drives the further work
of this thesis. Before we summarize, we evaluate related work.
3.2 Concept
The concept of the two slot environment aims at simplicity. Nevertheless, it is embedded
in a comprehensive abstraction of partial run-time reconfiguration, which is basically
45
3 Two-Slot Framework
RT EX RT EX RT EX
RT EX
RT EX
time
RT EX
Slot 1
Slot 2
Figure 3.2: Reconfiguration (RT) and execution (EX ) phase, simple example.
motivated by resource reuse and reduction of costs. The hereby necessary underlying
abstracting model of execution on partially reconfigurable FPGAs is presented first.
3.2.1 Problem Abstraction
Our model of execution can be described abstractly as displayed in Fig. 3.2. The core idea
is to make the two relevant phases of tasks executing on reconfigurable systems explicit:
reconfiguration and execution phase. The reconfiguration phase (RT) represents the
configuration of the hardware itself. It has to occur before the second phase, which is
the execution phase (EX ). In the lower part of Fig. 3.2, we display the available slots
vertically and their occupation over time horizontally. The slots are active alternatingly.
RT means that a slot is in reconfiguration, while EX denotes the execution of a task.
As FPGAs facilitate to execute more than one task on their substrate at the same
time, multiple EX phases may reside on the FPGA in parallel. Thanks to partial recon-
figuration, EX and RT phases of different tasks can also be conducted in parallel. We
thus can introduce configuration pre-fetching within the FPGA, see Fig. 3.3. However,
the commonly given constraint of only one reconfiguration port leads to the exclusiveness
of RT phases. In Fig. 3.3, we also display the general motivation for partial reconfigu-
ration capabilities of such a system. If we interweave EX and RT phases of different
tasks, we can hide the reconfiguration time. Partial reconfiguration thus results in an
improved overall response time of the task set.
It is worth to mention that both phases are inescapable parts for the processing in
space and time. Yet, during the RT phase no immediate benefit for the processing
is gained. Moreover, as up-to-date FPGAs comprise one reconfiguration port only, we
cannot reconfigure multiple areas at the same instance of time, a fact that must be
considered in all approaches of partial run-time reconfiguration.
RT EX
EXRT
RT EX RT EX
EXRT
RT EX
using partial
reconfiguration
Slot 1
Slot 2
overall response time overall response time
Figure 3.3: Scheduling of reconfiguration (RT) and execution phase (EX) showing the
benefit of partial reconfiguration.
46
3.2 Concept
FPGA
X-Achse
RT
EX
RT
EX
RT
EX
RT
EX
time
EXRT
RT EX
RT EX
time
RT EX
reconfigurable area
max
...
a) b)
Figure 3.4: Scheduling example of EX and RT phases on a one-dimensional (a) and
two-dimensional (b) reconfigurable fabric.
Excursion
Basically, the execution model of RT and EX phase facilitates different combinations of
these two phases, however constrained by RT and EX of one single task always forming
a unity. If we consider the reconfigurable substrate as a freely programmable region, a
schedule as depicted in Fig. 3.4 a) could arise. There, four pairs of RT and EX phase
are dispatched to the area (y-axis) over time (x-axis). As there exist no constraints to
the area, the tasks are distributed uncontrolled on the fabric. We thus waste space and
suffer fragmentation. Finding an optimal schedule becomes challenging.
Furthermore, if the area of the reconfigurable fabric can also be configured in a two-
dimensional style as given by the Xilinx Virtex-4 and Virtex-5 FPGAs, the complexity
increases once more. Such an example is displayed in Fig. 3.4 b). Here, tasks compete
for connected area to serve their spatial requirement. Poorly placed tasks may force
successors to wait until a region large enough is available, despite the accumulated free
space already would be enough.
The optimal schedule heavily depends on the area under reconfiguration, the device
which hosts the tasks, the task and its execution time itself, the dependencies between
tasks, and so on. As we can map the problem on a multiprocessor scheduling problem
with task placement constraints, possibly also targeting heterogeneous processors, in-
cluding strong dependencies between the EX and RT phase of tasks, etc., deriving the
solution of the optimal schedule becomes NP-hard [GJ79, BBD05]. Additional require-
ments like resource conflicts during task execution (e. g. sharing of the same bus) have
to be considered also. Nevertheless, the phase model is open for all variations.
We formalize the general problem as follows:
Problem 1 For a problem Pgiven as a set of tasks Γto be executed on a reconfigurable
substrate S, we want to derive a placement βand a schedule γwith respect to one or
multiple optimizing goals.
47
3 Two-Slot Framework
This multi-objective optimization problem desires special care as several optimization
criteria are possible: minimization of the overall execution time, resource usage, power
consumption, etc. Particularly, if a top-level designer exploits all degrees of freedom,
the design process quickly becomes tedious and may require numerous iterations. It
is therefore advisable to constrain the design space and guide the designer through the
process, for example, by reducing the solutions to a Pareto-front from which the designer
can choose. By holding some parameters fixed, e. g., relying on good results as is done
in platform-based design, the design process becomes manageable and more abstract.
An Example
Considering the problem of optimizing the scheduling of the RT and EX phases as a
classical pipeline problem, we would argue for both phases to be as homogeneous as
possible in order to avoid idling times and maximize the performance. Such an opti-
mization problem can be found in the design of pipelines for the classical von Neumann
machine, where the stages should be packed as densely as possible to yield the maximum
performance. However, this reduction to a known problem must not be optimal if we
consider execution in space and time.
In reconfigurable systems, RT phases, which are part of such a pipeline, should occur
as rarely as possible. Moreover, if they occur, they should be as short as possible.
Simply speaking, the RT phase is of no direct benefit for the processing. It only can
improve the overall behavior of our system—and thereby secondly being beneficial for
the processing—if we can accept the additional costs. Therefore, a balanced pipeline
may be unwished in the case of reconfigurable computing.
The methods how to derive a good scheduling thus must be selected carefully. More-
over, the reconfiguration overhead very often is high, which means that the RT phases
are very long and often dominate the EX phases. In addition, relying on more than two
slots for the execution of tasks increases the complexity of e. g. the inter-module com-
munication between the slots, etc. Nevertheless, there exist good reasons for variations.
We give answers in every chapter of this thesis. Here, we focus on the maybe simplest
shape of this pipeline, where we have two slots that are alternatingly in reconfiguration
and execution phase, offering to hide the reconfiguration overhead.
3.2.2 Problem Solution Strategy
The problem to solve thus is to design a two-slot architecture and integrate it in an
overall framework for comprehensive design. A first draft of the architecture is depicted
in Fig. 3.5. During the processing in one area, another region can be reconfigured,
using the concept of configuration pre-fetching. Besides the core capability to hide
reconfiguration time, we can sequentially host parts of an algorithm otherwise too large
for complete implementation on the device available. Assuming the termination of the
reconfiguration in time, seamless processing is possible, while intermediate results are
buffered in the control area. The temporal existence of parts of the application on the
device is transparent to the user.
48
3.3 Run-Time Architecture
FPGA
buffer and control
reconfigurable area 1
reconfigurable area 2
Host
Figure 3.5: Draft of the two slot architecture.
Several other constraints are met by this architecture. For example it is in line with
modern FPGAs, which comprise one reconfiguration port only. Moreover, static inter-
module communication can be achieved on basis of having fixed reconfigurable regions.
For the sequential execution of partitions of tasks on the two slot architecture, we
still need an appropriate scheduling method as part of the overall framework. Such a
method performs best, if a sound input in the form of suitably partitioned applications
can be provided. Therefore, we require partitioning methods, which derive partitions
comprising similar area requirements. The generation of such partitions may also focus
on equal execution times, which often conflicts with precedence constraints and the aim
to optimize communication.
To summarize the framework, tasks—once loaded—can execute with the speed of
hardware. The reconfigurability adds the adaptability to the system or enables the
execution of algorithms too large for the device, using the concept of virtual hardware.
The reconfiguration is an integrated part of the whole execution, however, it should
not become the bottleneck of the system. The framework must consider that during
reconfiguration the area under reconfiguration is blocked, as well as time is consumed.
3.3 Run-Time Architecture
We discuss the implementation of the two-slot architecture below, including the chal-
lenges that had to be overcome. Basically, similar approaches of reserving some chip
area for the controlling overhead and dedicating some other to user logic can be found in
the literature [BMA+05, KPR02, WP04, FC05]. These approaches—in general termed
as slot-based—served as valuable inspiration. As few of the examples of the literature
comprise two slots only, we still had to design multiple issues from scratch. We start by
discussing the behavioral, structural and physical constraints of our approach.
49
3 Two-Slot Framework
3.3.1 Fundamental Design Constraints
The behavior, which was already defined above, briefly summarized should provide the
following: Execution of partitions of one or more applications on two run-time recon-
figurable regions that are occupied alternatingly and thus provide reconfiguration time
hiding. The regions can be reconfigured independently, but only one after the other,
as the reconfiguration is mutually exclusive. The data processing is done inside these
reconfigurable slots. The alternating reconfiguration facilitates complex functions to be
calculated by means of successive computations, while the intermediate results are kept.
The intermediate results thus must be stored within the architecture. Moreover, a
specific connection point for the external communication must be provided. As these
communication requirements remain fixed during the execution of applications, we im-
plement them in a static module, see below for more details. The static module may
also serve as control unit of the run-time reconfiguration. Therefore, it supervises the
modules and communicates with an external host to initialize the reconfiguration. Using
a self-reconfiguration port like Xilinx ICAP would make the architecture a stand-alone
system without requiring a host processor.
The structure of the architecture thus becomes obvious and comprises of the two
reconfigurable and one static region, as well as the communication structure in between
them. The reconfigurable modules are termed slot A and slot B. There is no direct
connection between the two reconfigurable modules as the information exchange between
them is always routed through the static part. Figure 3.5 depicts the three elements of
the architecture, also including the communication to an external system.
The physical arrangement of the modules depends on the constraints of the reconfig-
urable substrate used. Multiple alternatives are possible. When implementing all three
modules on one fabric, however, only a few layouts become reasonable. As we have
two slots only, the arrangement always is in 1D style, meaning that the two run-time
reconfigurable slots can be lined up in one direction. However, only devices which allow
a two dimensional reconfiguration—such as the Xilinx Virtex-4 and 5 series—can host
the slots on top of each other. Other devices require the slots to be in their own vertical
column each—facilitating a horizontal arrangement. To keep the signal delay small, the
slots should be put close to each other. As the static area provides the intermodule
communication between the sections, we should reserve a static area next to or between
the slots. This spacial arrangement of the modules allows the controller to directly com-
municate to both of the slots. Any data flow that is to be passed from one slot to the
other inevitably crosses the controller.
The actual physical arrangement of the controlling section again depends on the char-
acteristics and capabilities of the devices used. On Virtex 4 and 5 devices, it may be
located above, below or in between the reconfigurable slots. In contrast, older devices
constrain the location due to their architectural characteristics despite of modern design
flows as the Xilinx Early Access design flow described in Sect. 2.2.5. On such devices, the
actual reconfiguration still takes place in a column-wise fashion—even if only a part of a
column is declared to be partially reconfigurable, the whole column will be reconfigured.
In such columns, the identical configuration information is written to the SRAM cells
50
3.3 Run-Time Architecture
Slot A Slot BController
RAM
data
ctrl
ctrl ctrl
ctrl
data
data in data out
Figure 3.6: Floorplan of the architecture with the memory-based communication. Each
slot is connected to the controlling unit by a data bus and control signals in both
directions. The controller comprises RAM for holding the intermediate results.
which are beyond the reconfigurable region. As we may suffer unwished re-initialization
of constant values particularly concerning the flip-flops of the configurable logic blocks,
sensible logic should be avoided in columns that undergo run-time reconfiguration (in
fact it is unofficially claimed to be suppressed by the Xilinx design tools).
3.3.2 Intermodule Communication
A proper design of the intermodule communication is a necessary part of a sound im-
plementation of the architecture. We have implemented several alternatives, also to be
able to evaluate benefits and drawbacks. Basically, the communication facilitates the
data exchange between the two reconfigurable regions. In general, to establish a proper
connection between a reconfigurable module and a static unit, dedicated signals have to
cross the borderline. The points of crossings thereby are assumed to preserve between
the reconfigurations, as given by using Xilinx busmacros.
The first connection strategy presented below is a lightweight buffer-based approach,
followed by a proprietary bus-based solution that focus on providing simple yet effective
communication protocols. After that, we also consider the example of using standard
bus connections and sketch an approach based on the message passing paradigm.
Buffer-based Solution
As the major goal of our first approach, we focus on a lightweight approach reducing
communication and area overhead. Therefore, the intermediate area hosts buffers that
are directly connected to the static inter-module communication resources. Each buffer
is accessed by one of these static communication resources of one slot, holds one single
bit only, and transmits this bit to the next slot. The benefit of this solution is a very
small area footprint. In the best case, we only have to provide the same amount of
buffers as the communication bit width between reconfigurable slot and control unit.
The same buffers then are shared for the data flow in both directions—from slot A to
slot B via the control unit and vice versa. However, as we so far cannot overcome the
constraint of single-directed busmacros, we currently have to set up twice the amount.
51
3 Two-Slot Framework
The drawback of this lightweight approach obviously is the necessity to either establish
the worst case amount of busmacros (all applications must be known at design time of the
architecture), or to limit the data flow between subsequent partitions of an application to
a previously set maximum number of busmacros. Both solutions constrain the maximum
bandwidth and thus may hamper the flexibility of the overall two slot framework.
Another approach would be to adapt the wiring during run-time. However, it is very
challenging to dynamically establish communication lines. In fact, using the standard
tools, the run-time reconfiguration of communication structures between reconfigurable
and static parts so far is not possible, as it causes problems to the signal integrity
(possible open signals during reconfiguration). Furthermore, a maximum amount of
busmacros may be given by the technical characteristics of the FPGA.
Proprietary Shared Memory Solution
In this solution, a memory can store large intermediate results, while the access to the
memory is resource optimized by a specific bit-width, see Fig. 3.6. The solution thus
resembles the idea of shared memory, as both regions communicate via a single memory
space. Access protocols therefore have been designed that harmonize with alternatively
execution of tasks in slots, having the other slot reconfigured [War06], [WD06].
For the protocols, we again focused on a lightweight approach reducing communication
and area overhead. To keep the communication between a slot and the controlling unit
flexible, we offer two data transfer modes: sequential mode and random access mode.
We implemented the communication protocols by means of finite state machines.
The underlying protocols define the way the data is transferred over time, specifically
which component performs a write or read operation on the communication bus. The
user has the responsibility of building the functional module instances placed inside the
slots, and therefore is responsible for supporting these protocols.
In the sequential transfer mode protocol, the transmitting of the data comprises three
parts: load all data to the slot, process the data inside a slot, and store the result to
the controller again. We do not transfer addresses, as the data is transmitted word-wise
sequentially over the bus. The data wires or the bus can thus be used exclusively for
the data. Another advantage of this approach is that it grants bit-parallel access to the
entire data set after it has been transferred from the control unit into the slot. If an
algorithm needs the whole data at once, it is advisable to use this particular mode.
For the random access transfer mode, we derive more flexibility of loading and storing
data, however, having higher costs. Here, a slot can request data as and when the input
data is required for the computation during the entire execution time. This is done by
transmitting an address of this particular data, thereby wasting a precious transfer cycle
when a combined address and data bus is used. In contrast, having a dedicated address
bus, the number of busmacros required increases.
A block-wise data transfer can also be implemented in the random access transfer
mode, thus reducing the overhead of transmitting addresses for each single data word.
Moreover, as a load or store operation can be performed in parallel to data manipulation,
we can shorten the overall execution time in the random access mode by parallelizing
computation and read/write operations.
52
3.4 Partitioning
Standard Bus-based Shared Memory Solution
Besides setting up a protocol on our own, we can also achieve the intermodule communi-
cation by a standard and usually heavyweight bus-based solution. Selecting a widespread
bus would ease the acceptance and portability of the architecture, as designers do not
have to implement the dedicated protocols of the solution above.
In a master thesis supervised by us [Sch08], therefore the Wishbone bus was used.
Wishbone is an open source hardware bus that operates synchronously to a single clock.
It serves as an interface to hardware cores written in different languages and thus offers
a sound possibility to integrate portable cores. A short glance at the technical details
allows us to rate the bus for our implementation purpose. The data width can be selected
to be 8, 16, 32, and 64-bit. Thereby, address bus and data bus are separated. The bus
facilitates a master slave system, with an arbiter to regulate bus access.
As main advantage, the user logic can rely on the well-documented wishbone character-
istics and must not follow a proprietary format. This solution thus is more user-friendly
when porting the design to another FPGA. The disadvantage of the bus is that it com-
prises of a high amount of wires and connections, which must travel through busmacros
when crossing module boundaries.
Message Passing Solution
The two previous solutions target at the concept of shared memory, as the data transfer
between the slots is done solely by referring to the same memory block. On top of this
model, we could also install a message passing system. As a complete implementation
of a standard message passing system, like the MPI (message passing interface) would
require much resources, a lightweight approach as proposed and implemented in [SC06,
SNRC06] would be appropriate. Moreover, the idea of message passing can be even
more effectively applied to environments comprising of more than two slots.
3.4 Partitioning
The main application of the two-slot architecture is hardware virtualization—executing
tasks that require more area than available. If not given a priori, we therefore have
to derive partitions of the initial algorithm that can be executed sequentially. Such a
method is called temporal partitioning. It differs from spatial partitioning in that the
sequences of partitions must strictly preserve the precedence order.
In principle, there exist multiple partitioning methodologies. In this work, we first
show a simple concept, before we extend a more sophisticated approach that was de-
veloped by Christophe Bobda at the Heinz Nixdorf Institute. As a prerequisite for our
specific problem, each partition can maximally consume the area of a slot.
3.4.1 Simple Temporal Partitioning
We rely on data flow graphs comprising vertices and directed edges. The vertices repre-
sent basic arithmetic operations such as +, ·,<, etc. The derived partitions from this
53
3 Two-Slot Framework
a) Data Flow Graph b) Loop Node Generation c) ASAP
Figure 3.7: Simple partitioning, encapsulation of cycles
graph should each represent a block of closely connected processing operations, thereby
also reducing communication requirements between blocks.
As we have to ensure a sequential chain of partitions, we group loops first and compose
them into extended vertices, see Fig. 3.7. These vertices prevent having data/control
dependencies of loops exiting the partition boundaries, cf. basic blocks of compiler design.
Obviously, if such an extended node exceeds the available area of a slot, the partitioning
is not possible and a redesign on a higher level of abstraction must be done.
Next, we order the data flow graph temporally, using the scheduling techniques ASAP
(as soon as possible) or ALAP (as late as possible). Basically, these methods preserve
the temporal precedence constraints. Additionally, ASAP/ALAP give each node a start
and an end point on a global time-scale. In order to obtain the partitions finally, we
enqueue the vertices in a list following the given order of the ASAP/ALAP schedule.
Similar to list scheduling, we accumulate vertices to a partition until we meet the area
constraint (refer to Fig. 3.8).
If we face communication between two vertices, which are not adjacent to each other,
we include a NOP-vertex (no-operation-vertex) to all partitions that the communication
spans. Figure 3.9 shows an example. Thus, we ease inter-module communication, as
communication requirements can be traced.
As the result, we gain partitions having similar area requirements, all under the thresh-
old of the slot. However, the execution times of the modules may be different, see Fig. 3.8.
3.4.2 Spectral Based Partitioning
Bobda [Bob03] came up with a new idea of temporal partitioning, particularly focus-
ing on the communication. In reconfigurable devices, the communication between the
partitions is of high costs, and we often have only a limited amount of communication
resources passing the partition boundaries. Additionally, fewer communication results
in a beneficial reduction of intermediate data storage.
54
3.4 Partitioning
B
B
d) Partitioning
e) Including
of Buffers
Execution Time 1Execution Time 2
Figure 3.8: Partitioning continued
sequence
NOP
Figure 3.9: Including a NOP-node
The so-called spectral method—based on the spectral analysis of the input graphs
as proposed in [Hal70] for VLSI design—re-arranges graphs in space respecting the
communication by a quadratic objective function of the distances between nodes (wire
length model). Its output is a communication optimized placement suggestion.
For spectral placement, the input graph must be given as the Laplacian matrix B,
derived from the connection matrix Cand the degree matrix D(B=DC). The goal
is to minimize the sum of squared distances between the nodes. Therefore, we apply
the Lagrange multiplier method with the kLagrange multipliers λ1, λ2, . . . , λk. The
solution are B’s Eigenvectors associated to the ksmallest non zero Eigenvalues. These
Eigenvectors place the vertices in space, whose locations now refers to communication
requirements optimized concerning the wire length.
In detail, Bobda uses the first two non-zero Eigenvectors to derive the xand ycoordi-
nates of a geometrical arrangement of the nodes. Then, he relies on the third Eigenvector
as reference for the temporal partitioning. As the spectral method ignores the direction
of edges, the immediate partitioning along the third Eigenvector in most cases does not
respect the precedence constraints of the tasks. Therefore, Bobda iteratively derives
bisections of the graph based on the ordering given by this Eigenvector. The data flow
between each newly generated pair of subgraphs is homogenized by means of applying
a modified version of the Kernighan Lin algorithm, which prefers to move tasks that
reduce the amount of wrongly directed edges. Finally, two adjacent partitions will have
edges in one direction only and a schedule can be derived.
The weakness of Bobda’s approach is that during the Kernighan Lin bi-partitioning,
valuable information of the proximity of nodes as initially given by the third Eigenvector
can be lost over time. Depending on the size and geometry of the graph, nodes that
initially are in complete different locations can end up sharing the same partition and
55
3 Two-Slot Framework
2
4
14
9
3
10
6
17
8
12
15
16
19
1
5
7
11
13 18
level
1
2
3
4
5
6
Figure 3.10: ASAP schedule of a data flow
graph
−0.6 −0.5 −0.4 −0.3 −0.2 −0.1 0 0.1 0.2 0.3 0.4
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
1
2
3
4
5
6
7
8
9 10
11
12
13
14
15
16
17 18
19
Figure 3.11: 2D spectral placement of the
data flow graph of Fig. 3.10
thereby distorting the result of the spectral analysis. Moreover, the inherent parallelism
given by the data flow graph is blurred in the spectral placement. The temporal ordering,
which allows for sorting of vertices into levels that denote parallelism among nodes, is
hardly respected during Bobda’s method.
Therefore, we use a combination of the spectral method and temporal scheduling in
order to derive mutual exclusive configurations that can be loaded in sequence onto
FPGAs. As published by us in [DB05] for the domain of coarse grain reconfigurable
systems, we combine the results of the ASAP scheduling (see Fig. 3.10) and the spectral
placement (see Fig. 3.11). We derive an efficient placement for processing elements (PEs)
of a coarse grained reconfigurable device. As displayed in Fig. 3.12, we therefore select
the vertices and their coordinates of level 1 of the ASAP scheduling in the spectral
placement of the data flow graph. As the distances of the vertices respect their closeness
in terms of input for later nodes of the graph, we allocate the vertices to the PEs by a
consecutive assignment. We start with the node comprising the smallest xand largest y
value (top-left corner), which becomes the node for the PE in the top-left corner of the
reconfigurable device. We continue until the node comprising the largest xand smallest
yvalue is reached.
When scheduling the next level, we have to take care of vertices that have already
started with their execution in the previous level, as these nodes should stay at their
occupied PE. Additionally, their location acts as an indicator for placing close nodes
of the ASAP scheduling in proximity, as far distributed results would foil the intended
reduction of communication achieved by the spectral method.
We thus extract the corresponding vertices of the next (micro) level from the commu-
nication optimized spectral placement of the data flow graph and lock the already placed
vertices. Then, we fill the gaps (speaking in terms of free PEs between locked PEs) by
assigning the remaining vertices of this level. We thereby achieve a PE assignment that
56
3.4 Partitioning
Figure 3.12: Nodes of ASAP level 1 are as-
signed to processing elements
1
2
4
5
14
9
310
6
11
13
17
7
8
12
15
16
18
19
PE (1,1)
PE (3,3)
Figure 3.13: Result of the ASAP and spec-
tral based node assignment
7
81
32
76
81
32
76
x
y12
1
10
5
7
x
y
++…+ =
x
y5
14 19
1817
13
t
t
t
Figure 3.14: Complete schedule of the graph of Fig 3.10 for a coarse grain device com-
prising nine processing elements
respects the assignment of the previous level and serves as solid basis for the next level
due to the PE assignment referring to the spectral placement of the whole graph.
In the end, we obtain direct neighbor connections or connections to the same pro-
cessing element in the next micro level in majority. For the exemplary data flow graph,
Fig. 3.13 shows the result, where we have mapped all levels into one figure—all vertices
allocated to the same PE are displayed in the corresponding square box. The complete
schedule including the temporal axis is displayed in Fig. 3.14, where the micro levels,
which comprise a valid placement each, are combined. Thereby, different execution times
of the nodes result in different heights of the boxes (nodes) in the final schedule.
Application for the Two Slot Framework
Our approach for the coarse grain architecture does not seamlessly scale to the require-
ments of the temporal partitioning for the two slot framework. For example the location
information gained from the spectral arrangement and used to bind nodes to processing
elements is only of secondary interest for the two slot framework. We might use this
57
3 Two-Slot Framework
3
7
5
2
8
4
9
6
A
B C
1
3
7
5
2
8
4
9
6
A
BC
1
level
1
2
3
4
temporal ordering
spectral placement
cluster 1
cluster 1
2
2
3
3
Figure 3.15: Clustering of a task graph: the clustering steps are denoted by the shaded
areas, going from dark shaded to light shaded ones
information for easing the placement of registers, etc. within the reconfigurable regions.
Nevertheless, from our approach targeting coarse grain devices, we inherit the idea of
combining the two objectives parallelism and communication that basically are contrary
against each other. To recall, if we strictly follow the ASAP schedule, we can achieve
a very high parallelism, as all nodes of one level become member of the same partition
and thus can be executed in parallel. However, cuts of the initial graph do not respect
communication constraints. In contrast, by referring to the spectral based clustering
(partitioning), we derive clusters according to communication costs, however omitting
precedence constraints and potential parallelism of nodes in the graph. We have designed
an approach that combines both, the temporal schedule and the spectral placement.
The core idea is to obtain a clustering that derives partitions according to the spectral
distance between nodes, refer to the example depicted in Fig. 3.15. To derive closely
connected clusters, we gradually find the cluster members only by following the edges
between the vertices. The nodes to start with are the nodes that are assigned to the first
level according to the temporal ordering of the task set. Similar to depth first search, we
start from these nodes and add the particular closest nodes first, referring to the spectral
placement. If the closest node belongs to a cluster and the combined area requirements
of both clusters do not extend the area available for a reconfigurable region, we combine
these two clusters building a new one. Eventually, we close a cluster the latest before
the maximum area of a reconfigurable region is met.
For finding new starting nodes for subsequent clustering, we refer to the temporal
ordering. In detail, among those nodes who are not clustered, we select the nodes whose
predecessors are clustered to become new starting nodes. Moreover, we prefer those
nodes who are assigned to a lower level according to the temporal ordering. Thus, we
gradually proceed along the time line (level) of the graph, and thus achieve to respect
the task level parallelism inherent of the temporal ordering.
58
3.5 Scheduler
B : Slot
: ControllerA : Slot
configure
compute control
reconfigure
reconfigure
compute
compute data flow
bit-stream
Figure 3.16: Sequence diagram of the slots and the controller entities. The data flow is
indicated by fat arrows. [War06]
In the end, we derive clusters that respect the precedence constraints, bear task level
parallelism of their members, and exhibit low external communication requirements
thanks to the cluster building according to the spectral arrangement.
Two improvements have been developed. First, our concept improves its results, if
the finding of the clusters can be done in parallel. If so, all clusters are filled in parallel,
improving the probability to find the most closest node among all nodes. Conflicts
then are resolved randomly. Second, if initial clusters only consume a fraction of the
reconfigurable area and are combined to final clusters even if no direct communication
exists, we can improve the overall response time. In detail, we then increase the amount
of tasks executing in parallel, however most often suffering higher communication costs.
To find an optimal solution that meets the architectural characteristics of the two slot
architecture, usually multiple iterations become necessary.
To summarize, our procedure is a mixture of a clustering heuristic on the basis of the
information given by the spectral graph and the precedence constraints of the temporal
ordering of the nodes. To derive optimal results, several iterations may become necessary.
3.5 Scheduler
Having now a well-formed set of tasks or partitions, we can dispatch these partitions
on the two slot architecture. Therefore, we first look at the scenario when hardware
virtualization is applied, before we also discuss other scheduling techniques that may be
used for independent task sets.
59
3 Two-Slot Framework
Scheduling for Hardware Virtualization
Including set-up time, the overall behavior in case of hardware virtualization on the
basis of a chained set of partitions is as follows (refer to Fig. 3.16):
After the FPGA is powered up it has to be configured with the bitstream containing
the initial top design. This top design includes the implementation of the controlling
unit, and the first (left) slot filled with the first bitstream. Once the initial reconfigura-
tion is finished the device can be fed with input data, which is stored in the controller.
After the input data transmission has been accomplished, the schedule can be started.
The controller therefore activates the first (left) slot (A) by setting a control signal.
When finished, the second (right) slot (B) is activated and simultaneously the reconfig-
uration of slot A is triggered. This behavior then is repeated with swapping A and B
for each new configuration until the last partition of the algorithm could be processed.
Afterwards, the control unit provides the final result to the requesting host. Thus, the
computation in slot A and the reconfiguration of slot B take place simultaneously and
vice versa, having the intermediate result stored in the controlling unit.
Note that due to the technical constraints of Xilinx FPGAs, we can also load the
second (right) slot with the initial bitstream. Such an initial bitstream must always
configure the whole FPGA. Subsequently, the configuration time stays the same as
otherwise this area would be configured with blank bits.
To summarize, the input undergoes several steps of transformation, for each of which
the corresponding task is loaded into the left or the right slot. Which slot is chosen
depends on the tasks’ serial number in the sequence to be even or odd. The intermediate
results are always transferred to the controlling unit and are saved there temporarily.
Scheduling Independent Tasks
If the two-slot architecture shall be used for executing nindependent tasks in the two
slots, the question of minimize makespan—the overall response time or schedule length—
arises. For a comprehensive consideration, the RT phases must be taken into account,
which occur mutually exclusive. Basically, the duration of all RT phases is equal, while
EX phases differ in execution time. Therefore, we have to consider the EX phases for
optimizing scheduling.
Obviously, if all execution phases are shorter than the reconfiguration time (EXi
RT 1in), we simple schedule the tasks in an arbitrary order having the task
with the shortest EX time sequenced last. Thus, makespan can be minimized.
Based on ideas of parallel machine scheduling problems with a single server [AWG06],
we generally have to schedule a set of independent jobs having different execution times
onto two identical machines, see also Chap. 5. In general, the problem is shown to be
NP-hard [HPS00], however some heuristics for special cases could be derived. Some
interesting ones for our problem are presented below.
For the special case of having tasks with equal set-up (RT) times, [KW97] describes the
solution to first map the set of tasks to the problem of scheduling two parallel executing
machines without set-up time. Thus, the set-up time gets included into the execution
time of a task. Then, they transform the solution of this problem to our problem of
having two machines (slots) and a single mutually exclusive server (reconfiguration port).
60
3.6 Experiment
If we release the assumption of always having the same reconfiguration time, some
more scheduling strategies are possible. Different reconfiguration times for the same area
can be achieved, if we reconfigure only those SRAM cells that are changed between the
configurations. Some approaches for this situation exist in the literature [CMS06, RM07]
(ref. also to Sect. 2.2.5). However, these approaches are still in their infancy and therefore
hardly applicable. Nevertheless, for future application, we still discuss such scenarios.
Again, if all EX phases are shorter than the RT phases of the other tasks (RTi
EXj1i6=jn), [AW02] proves that any alternating sequence with the task with
shortest processing time sequenced last is optimal with respect to minimizing makespan.
In the same work, they also consider the case of all tasks having the same execution
length: EXj=EX j. Here, minimal makespan can be achieved, if we first sort the
tasks according to the duration of their RT phases. Then, if the number of tasks is
even, we schedule the tasks according to the sequence: RT2k1RT2k3. . . RT5
RT3RT1RT2RT4. . . RT2k, where n= 2kfor some k. Similar, for an odd
number of tasks, we sequence the tasks according to RT2kRT2k2. . . RT2
RT1RT3. . . RT2k+1, where n= 2k+ 1. Finally, the authors give two heuristics
for the general case, where there are no constraints to EX and RT.
Note that also a subset of the partitions of a hardware virtualization may be indepen-
dent and be applied to these scheduling algorithms.
3.6 Experiment
For evaluating the concept, we have implemented several prototyping environments on
multiple platforms hosting different FPGAs. As the vendor tools only support the
design of such partially reconfigurable systems to a limited amount, we had to overcome
multiple challenges and accept compromises for the final implementations. Particularly
the design of partial bitstreams is a daunting task, which eventually resulted in the
design of our own tool Part-E, described in Appendix A. In the following, we discuss the
general proof of concept implementation on a Xilinx Virtex-II Pro FPGA, as well as an
application example that exploits the capabilities of a Xilinx Virtex-4 device.
3.6.1 Proof of Concept Implementation
We have implemented a prototype of a design with the two slots and a controlling unit.
Both of the reconfiguration slots can be loaded with different tasks as intended by the
behavior of the two slot framework. We also provide the user a way to define the tasks
by filling out prepared entities with functional description. A notable amount of this
work was done within a bachelor thesis [War06] and has been published in [WD06].
In detail, we have implemented this prototype on a Xilinx Virtex-II Pro FPGA em-
bedded in the Avnet Virtex-II Pro Evaluation Board. The FPGA holds the controller
and the two partially reconfigurable slots. For external communication, we use the pe-
riphery of the board. In detail, we use dip-switches to provide input data and a seven
segment display to display the output values.
61
3 Two-Slot Framework
Figure 3.17: Layout of the two slot architecture on a Virtex-II Pro FPGA [War06].
Due to the architectural constraints of the Xilinx Virtex-II Pro FPGA, each run-time
reconfiguration section spans a number of whole columns of the FPGA. For the inter-
module communication, we use the tristate based busmacros. Apart from the I/O area
defined by the development board, the Virtex-II Pro FPGA served our needs sufficiently
for this proof-of-concept architecture. In particular, as the IO pinning for the peripherals
is distributed all around the FPGA area, we required connection wires crossing the whole
fabric including the crossing of reconfigurable regions (see upper part of Fig. 3.17).
Figure 3.17 shows the layout of the FPGA. We use a bit-width of 8 Bit for trans-
ferring the data or address, respectively. The controller including the communication
infrastructure is very small and consumes less than 300 4 input LUTs (look-up-table).
Although, the bit-width of the data and address transfer can be parameterized, it can
become the bottleneck of the design, as the maximum amount of bus macros using the
tristate version can be limited due to architectural constraints of the FPGA.
62
3.6 Experiment
In general, the throughput of our data transfer protocols depends on the frequency of
the whole system. Furthermore, if data transfer from and to the slots must take place in
blocks before and after executing the task, the data transmission time has to be added
to the task’s execution time. However, if some results are earlier available than other
ones, we can start data transmission already during the execution of the task, decreasing
therefore the processing time of the task.
Portability
We have designed the sources of the two-slot referring to generic language constructs
to facilitate portability of the design. For example, we can change the bit-width and
other parameters by generic information. Furthermore, inferred structures, such as
Block RAM and finite state machine, were preferred as device primitive instantiations.
Moreover, we can change the number of bus macros used for the data bus, or define the
size of the memory in the controller.
Nevertheless, porting the design to another FPGA generation proved to be complex,
partly as new functional properties and design constraints arose, and partly as the
physical assignment of the reconfigurable regions and static communication resources
proved to be complex. For example, tri-state based busmacros could not be used any
more (see below). Moreover, tool or documentation support for suitable area assignment
on the reconfigurable substrate is hardly given and therefore makes porting to different
FPGAs a daunting task. Efficient physical layout proved to require expert knowledge.
3.6.2 Cryptography Example
For a more complex experiment, we have used a cryptography scenario, relying on the
symmetric cipher triple DES. The algorithm serves the requirements of the two slot
framework very well as it is naturally split up into three equally sized single DES cores
that are invoked one after another. We use a fixed key, which allows us to reduce the
area requirement of each single DES compared to one generic DES core. Run-time
reconfiguration still facilitates a regular key exchange for maintaining security.
The sequence on the architecture looks as follows: We load DES1 in slot A and start
processing in this section. In parallel, we reconfigure the other region (slot B) by loading
DES 2. Once DES 2 has started computation on basis of the results produced by DES 1
and stored in the control section, we do not need DES 1 anymore. Thus, we replace
DES 1 by DES 3. When DES 3 finally has deciphered the message, the result is stored
in the control unit region and is marked as ready for further usage.
The input data for triple DES sums up to a blocksize of 64 Bits, which is transformed
in the different DES stages without altering the length. We thus have to transmit
64 Bits between the stages of the DES algorithm. Considering the requirement of single
directional busmacros, this sums up to 128 wires between each slot and the controlling
unit. To reduce this high amount of wires, we implemented the shared memory solution.
We selected the Virtex-4 FPGA for the final design. As it represents a reasonably
changed generation of Xilinx FPGAs, we could not simply merge the Virtex-II Pro im-
plementation to this device. Most of all, Virtex-4 FPGAs lack tri-states and therefore
63
3 Two-Slot Framework
Figure 3.18: Layout of the two slot environment on a Virtex-4 FPGA. Left: the upper
slot, right: the lower slot, in the middle: the controller logic.
require lookup table based busmacros. However, the Virtex-4 series allows for horizontal
and vertical arrangement of reconfigurable regions, which eases the layout. Figure 3.18
shows the Xilinx FPGA editor display of the layout.
Moreover, we implemented a blockram of 8 Bit word-width and an address depth of
28as shared memory. As input signals, this implementation then has the 8 signals for
the 8 Bit data transfer, as well as some additional signals for control logic, particularly
reset logic. The output signals sum up to 8 signals for the data out, 8 Bit for the address
selection of the blockram, as well as again some control signals like a done signal. As
each busmacro combines 8 signals, we thus require two busmacros for the input signals
and three busmacros for the output signals, summing up to five busmacros for each slot.
For the hardware implementation, we used code provided by opencores.org. The
reference implementation basically is generic—it may serve as decryption and encryption
unit. Such a single DES core consumes about 300 slices on a Virtex-4 FPGA. If we use
a fixed key, we can reduce this size by one third now accounting for only 200 slices.
Moreover, the additional logic in each of the slots required for data transfer sums
up to about 100 slices. In detail, we require a state machine for gathering the data
in 8 Bit words from the shared memory of the controller, as well as for sending the
result back to the control area. As the DES accepts and provides the data as one block,
64
3.7 Extensions
we have additionally implemented local latches for the input and output of the DES
cipher. Furthermore, the reset logic to start the execution must be user-implemented
for partially reconfigurable regions.
In summary, we thus require more than 300 slices for one module. To derive the best
size of a slot that can host this module, we have to take the physical characteristics of
the FPGA into account. Foremost, partially reconfigurable regions on a Virtex-4 FPGA
always span a multiple of 16 CLBs (1 CLB equals 4 slices) in height. Moreover, the
width of a reconfigurable region must also equal complete CLBs, having 1 CLB as a
theoretical minimum. When using exactly 16 CLBs as the height of our slot, we thus
require a width of 6 CLBs to host the approx. 75 CLBs of our module. This closest
ceiling size of a reasonable partial bitstream then sums up to 384 slices.
The size of a partial bitstream on our Virtex-4 FPGA for reconfiguring these 384
slices sums up to about 24 kByte. If we use the Select MAP reconfiguration port of
the FPGA, we can transmit the partial bitstream using a bit-width of 8 Bit and a
reconfiguration speed of approx. 60 MHz. These parameters result in a duration of the
partial reconfiguration of approx. 70 µs.
If we set the execution frequency of the circuitry to about 50 MHz, which is a reason-
able number for the Virtex-4 FPGA, however, the reconfiguration time contrast with
the duration of the execution phase of a few µs. In detail, each single DES requires 17
clock cycles for computation. For the data transfer, we have to add another 20–50 clock
cycles. Therefore, we will result in less than 100 clock cycles for each single DES.
Moreover, the implementation of the two slot example proved to be challenging, as
much expert knowledge is required to derive the placement of the reconfigurable ar-
eas, the busmacro placement, and the generation of the partial bitstreams. Even if a
prototypical implementation is available, the porting to a specific application quickly
can become complex. Nevertheless, to summarize the cryptography example, besides
the problems of portability, which we could ease by virtue of our Part-E tool (see
Appendix A), the remarkable difference of execution and reconfiguration time must be
treated appropriately. A possible solution therefore is shown in the next section.
3.7 Extensions
The two slot framework can be extended in several ways to make up a more powerful con-
cept. To show two meaningful examples, we first introduce concepts of low power design
to the framework in the following. Thereafter, we describe how to use the framework as
an IP core within a larger system.
3.7.1 Low Power Considerations
In most cases, the duration of a RT and an EX phase started at the same time will not
be of equal length. Particularly the EX phases differ in their execution times, while the
time required for a RT phase stays the same among all modules, as we always reconfigure
65
3 Two-Slot Framework
the same area. Most often the RT phases will be longer than the EX phases on average.
In any case, the slower of the two phases has to wait for the other one to finish.
We employ this inevitable idling time due to different durations to introduce power
saving. Therefore, we scale down the processing/reconfiguration frequency of those
phases that would finish before the processing or reconfiguration in the other slot has
finished. Frequency reduction is a well-known concept in low power design to reduce the
energy consumption, thereby increasing the power efficiency.
The overall response time stays the same, if only idle times are avoided and no ad-
ditional delay is added. Basically, both slots either performing a reconfiguration or an
execution always start at the same time. To not harm the overall response time and
achieve a maximum reduction of power consumption, we have to ensure that both phases
finish contemporaneously not exceeding the initial maximum duration of the two phases.
By selecting a suitable frequency, we can ensure this constraint. Then it holds that the
pair EXiand RTi1shows the identical maximum finishing time max(tEX,i, tRT,i1).
Background on Low Power
In general, the formula for the power consumption Pof a design, which has Vdd as
working voltage and fclk as operation frequency is P1
2CL·V2
dd ·fclk. In order to
compare designs, the result then must be multiplied by the time dan operation needs
on this device, and is termed as the Power Delay: PowerDelay =P·d. In the literature,
we find two basic possibilities to save power:
clock gating Once a calculation is finished, the results are stored and the circuit is
switched off by clock gating. In absence of power leakage, no further power is
consumed after switching off the clock.
frequency reduction Higher frequencies consume more power. Thus, the clock is re-
duced and less energy is needed. In such systems, the working voltage can be
reduced additionally.
Both can produce reasonable results. Strictly following these formulas, there would
be no difference between clock gating and frequency reduction. Yet, the activation and
de-activation of clocks consumes energy and additional clock-cycles. Thus, the reduced
clock frequencies approach is often preferred due to this advantage. Furthermore, a
smaller frequency often also allows for reduction of the working voltage. As this voltage
is a squared value in the power calculation formula, results can be very effective then.
Still, frequency adaptation can consume time and therefore power as well. Sophisti-
cated methods refer to a number of fixed clock frequencies only.
Frequency Adaptation in the Two Slot Environment
As the RT phase most often will be longer than the EX phase, we have investigated
concepts of delaying the execution times of intermediate parts of algorithms. Basically,
we reduce the clock frequency of the EX phase until the duration of the two phases are
equal. To reduce the number of frequencies required, we cluster tasks into a set of clock
domains. We thus introduce a multi-clock approach known from low-power design to
partially reconfigurable systems.
66
3.7 Extensions
Basically, modern FPGAs support different clock frequencies within the same design.
Multiple clock signals are therefore fed into the clock network, which is often divided in
multiple domains, also sharing the same regions. The improvement on power consump-
tion by using different frequencies can hardly be measured on modern FPGAs. However,
the approach can be of use in the near future. For instance, Xilinx’s Virtex-4 FPGA
was also designed with respect to power efficiency [Tel05].
The physical implementation of a clock scaling on modern FPGAs bases on the fact
that Digital Clock Managers (DCMs) on such FPGAs can be used to not only stabilize
the high and low phases of clocks but also to provide different frequencies. This tech-
nique often is more accurate than generating clock frequencies by virtue of user circuit.
Alternatively, we could also use different external frequencies.
Concerning the two slot architecture, the intermediate control unit becomes respon-
sible to generate, encapsulate and bridge the different clock frequencies. Therefore, it
needs to provide different frequencies by either using the DCMs, external frequencies,
or multiplying the input clock. These frequencies are fed into the reconfigurable regions
using the different clock domains given by most modern FPGAs. Modules loaded to the
slots then connect to their required frequency, leaving the others untouched.
For the data transmission to and from the slots via the buffer or memory in the control
unit, we can rely on dedicated cores that offer a sophisticated bridge between different
clock regions. Particularly dual-ported block RAMs are capable of operating on two
different clock frequencies for each port. Alternatively, if the clocks are all a fraction
of a maximum master clock, this maximum frequency becomes the reference frequency
for the RAM. Data transmission from slots operating on a slower frequency will then
consume a multiple of the time, however, not harming the behavior of the RAM.
For the solution of using a common bus for the data transmission between slots and
control unit, we require the bus to accept clients operating on different frequencies.
Otherwise, the user would have to adapt the interface of each single module to the bus
depending on the operation frequency. We thereby would increase the design complexity
and presumably contradict the idea of using a standard bus for convenient reasons.
To finally conduct frequency scaling with respect to power consumption, the open
question of the assessment of the frequencies remains.
Clustering of Tasks with Different Execution Times
We want to constrain our design to a low number of clock frequencies, avoiding a dedi-
cated frequency for every task. Therefore, we group those tasks to partitions that have
similar execution times—EX phases—if executed by a common frequency f. Hence, the
number of partitions equals the number of different frequencies required. We assign a
dedicated frequency to each of these partitions, so that the execution times of the tasks
approximately become the same.
We assign the maximum frequency of the system fmax to the group of tasks that
have the longest EX phases. Moreover, we declare fmax to be the reference frequency
fref . All other frequencies depend on fref and thus change relatively if fref changes. In
particular if the duration of the RT phase (tRT ) is longer than the maximum of all EX
phases (tEX,max), we reduce fref until the two durations meet: tEX,max tRT . All other
67
3 Two-Slot Framework
Clusterhead
Cluster 3
Figure 3.19: Example of produced cluster formation
frequencies then change also and hence the idle time due to waiting for a reconfiguration
to be finished is reduced to a minimum.
To cluster the tasks into partitions according to the duration of their EX phases, we
have developed a heuristic that has a very low computational complexity [DH06]. The
algorithm is based on division of labor principle encountered in social insects [Hei08].
As a result of our clustering algorithm, each task will be assigned to a cluster φ.
First, we map the tasks on a line. The position of a task on the line corresponds
to the duration of its EX phase. Closely placed tasks should become members of the
same cluster, as they show similar response times if executed by the same frequency. To
derive the clusters, we use the model of division of labor of ant to delegating the role of
clusterhead to the tasks with high fitness [BDT99]. A task has a response threshold for
every other task based on the distance of the tasks on the line. Tasks engage in becoming
clusterhead when the level of the task-associated stimuli exceeds their thresholds. In
our heuristic, the probability TΘ(s) of a task to become clusterhead given a stimulus s
is:
TΘ(s) = sn
sn+θn,(3.1)
where n > 1 determines the steepness of the threshold Θ (we use n= 2). The threshold
models the fitness of the task to the role—a small threshold means that the task is a
good candidate to be the clusterhead. The fitness of a task is calculated by
θ=averneig dist
neighcount
,(3.2)
where averneig dist gives the average distance to the neighbors on the line and neighcount
gives the number of neighbors (that can be 0 (no neighbors if we have only one task),
1 (task with shortest and longest EX phases), or 2 (all other tasks)). Tasks with more
neighbors and small distances are preferred to be the clusterhead. Moreover, the stimulus
sis proportional to the round rof the algorithm (s=k·r), with kconst 0 k1.
When a clusterhead arises, it starts to select its members. The clusterhead (and each
member) helps to select the other members. The selection is done based on the distance
to the cluster. We include the (not clustered) tasks having minimum distance to the
cluster. The algorithm shows the best behavior if all clusters are built in parallel, which
we can approximate by repeatedly iterating over the set of tasks.
In general, the process repeats until the number of clusters matches the architectural
constraints of the FPGA—the number of clock domains available. Each cluster finally
68
3.7 Extensions
denotes a set of tasks sharing the same clock frequency. Figure 3.19 shows an example
output after execution of the clustering algorithm. In three rounds, it was possible to
cluster the tasks of the system. As the tasks of cluster φ2have the longest EX phases
(rightmost positions on the line), we assign to them the reference frequency fref , as no
other cluster needs to be executed by a higher frequency to obtain the same execution
time. We finally fix fref by referring to the duration of the RT phase. The frequencies
for the other clusters then become a fraction of fref .
3.7.2 The Two Slot Framework as an IP Core
Besides using the two slot architecture as stand-alone execution environment, it can be
integrated as an IP core within a System-on-a-Chip (SoC). The two slot architecture
thereby offers a transparent and sound possibility to establish hardware acceleration. In
particular, the size of the hardware circuits need not to be limited to the actual slot size,
if the circuitry can be divided into partitions that can be loaded subsequently. Hardware
virtualization thus is offered to SoCs.
The controller of the two slot environment acts as link to the remaining parts of the
SoC. The overall behavior then is as follows: An application requests the two slot ma-
chine by denoting the algorithm to be executed on it, as well as providing the data
input for the processing. Then a control thread takes care of the proper dispatching.
Therefore, the appropriate bitstreams are loaded to the slots and the data is fetched,
processed and finally written back to a specific place. Several implementation alterna-
tives are possible, particularly concerning the location of the control thread as well as
the grade of transparency of such a system. In the course of a diploma thesis under our
supervision, we investigated the practicability of the two slot environment as an IP core
within a larger system, also discussing the alternatives [Sch08].
Two gross architectural alternatives were developed. In the first one, the slot is
integrated in the bus structure of a processor system. Here, the processor itself takes
over the responsibility for supervision. In particular, the processor also triggers the
reconfiguration. Therefore, the internal configuration access port (ICAP) is included as
a core attached to the bus into the system. By virtue of the ICAP, the partial bitstreams
can be reconfigured internally from the system. Therefore, the partial bitstreams are
hold in a database, accessed by the controller and forwarded like standard data transfers
to the ICAP. The ICAP core then triggers the partial run-time reconfiguration.
In a more heavyweight architecture, the emphasis was on transparency. Therefore the
two slot architecture was enriched by an own controlling system consisting of controller
(MicroBlaze CPU), memory, timer, interrupt controller, and the ICAP. These devices are
attached to their own bus (on chip peripheral bus), which communicates by a bridge with
the main processor. The benefit of this implementation is that the control thread is part
of the two slot machine and does not load the main CPU with the task of reconfiguring
the slots as well as with the data transfer from and to the slot. Moreover, the additional
memory, which is part of the two slot subsystem, can be used for intermediate data
storage. Thus, the memory in the controller of the two slot machine can be avoided and
larger data can be transmitted.
69
3 Two-Slot Framework
In both examples, the communication between the two slots and the controlling unit
was implemented using the wishbone bus system. Moreover, for the implementation, the
Xilinx EDK (Embedded Development Kit) was used, which allows for designing SoCs
comprising of one or more CPUs and their access to peripherals via IP cores attached
to buses. Nevertheless, the capability for run-time reconfigurability is not supported by
the EDK tool and had to be implemented manually.
3.8 Lesson Learned
The two slot framework served as introductory investigation of the task of designing re-
configurable systems. Despite its simplicity, multiple challenges had to be solved, which
foremost allows us to draw the conclusion that the design of beneficial reconfigurable
systems should be guided appropriately.
Most of all, if runtime reconfiguration shall be exploited, the reconfiguration overhead
must he handled appropriately. The clear distinction of reconfiguration and execution
phase—as introduced in this chapter—as well as their strict consideration overall within
the design method helps to guide the design process and enables reasonable results
despite the overhead. Furthermore, the benefits of hiding the reconfiguration time could
be demonstrated in a systematic manner and is worth to be investigated further.
Moreover, we have shown that a clearly structured architecture eases the execution of
tasks. The simplicity of the two slot architecture brings this demand to its basics. Still,
it enables a sound hardware virtualization, as tasks are loaded into the two regions of the
architecture on demand and are executed following the ordering given. Thanks to the
alternating execution in the regions, we can interweave reconfiguration and execution,
thus hiding the reconfiguration latency. As intended by hardware virtualization, tasks
may be larger than the area available.
We have also shown that the physical implementation of an architecture for partial
run-time reconfiguration proved to be a challenge in itself. Multiple hurdles like in-
termodule communication, area assignment, fragmentation prevention, etc. have to be
considered. While the aforementioned capabilities enable sophisticated designs, the de-
velopment process itself is characterized by a mainly too high level of freedom. The
two slot architecture shows how numerous of these degrees of freedom can be con-
strained while still maintaining a reasonable performance. We thereby abstract the
hardly controllable flexibility of an FPGA and offer guided access to the fabric through
an architectural layer.
In general, we have shown that the envisioned layered approach is reasonable. In
the two slot framework example, the architecture layer gathers all the implementation
details, while the scheduling/dispatching and partitioning layer focus on the applica-
tions to be mapped onto the reconfigurable system. The clear distinction between the
architecture, the scheduling/dispatching and the partitioning layer allows for separated
optimization on the basis of well-designed interfaces between the layers. Even by virtue
of the simple two slot framework, we could see that abstraction is extremely helpful to
capture the challenges of reconfigurable computing systems design.
70
3.9 Related Work
Furthermore, the clear distinction of the responsibilities as given by the layered ap-
proach allowed us to extend the framework quite easily. We have shown two examples,
the investigation of the low power possibilities and using the two slot architecture as an
IP core. Both proved to be applicable in a straight forward manner. Concerning the
dispatching of tasks onto the two slots, we have shown the similarity to the single server
scheduling concept, whose results are valuable for several task sets.
The two slot framework definitely is part of the general idea of reducing the complexity
for the application developer, enabling her/him to concentrate on the designing of the
actual computational structure. The framework hides the problems concerning runtime
reconfiguration—communication and data transmission, reconfiguration scheduling, fre-
quency adaptation, etc. Our contributions to the approach—the scheduling by virtue
of concepts of the single server domain, the intermodule communication for transparent
data transmission, partitioning concepts to map applications to the requirements of two
sequentially executing slots—as well as the extensions—low power, IP core—expand the
abilities and the fields of application for the simple two slot architecture.
To conclude, the two slot architecture approach serves a basic factum of reconfigurable
computing—it hides the reconfiguration latency. Moreover, a high amount of extensions,
while still leaving the core architecture intact, can be thought of. On this basis, we are
motivate to further explore reconfigurable fabrics in the subsequent chapters. In partic-
ular, we are motivated to overcome the major drawback of the two slot architecture—the
strict limitation to two slots. Although powerful, reconfigurable computing can mean
more than simply alternating execution in two slots.
3.9 Related Work
In the literature, there are only a few contributions that consider architectures com-
prising precisely two reconfigurable slots. However, there are several approaches that
focus on hiding the reconfiguration latency by interweaving computation and reconfig-
uration and thereby come across architectural concepts similar to ours. We present the
architectural works first, before considering also the other aspects of this chapter like
intermodule communication and low power processing on FPGAs.
Architectures
In [GV00] a conceptual approach for a lightweight reconfiguration platform is presented
that minimizes the overall time of an application execution by means of using only two
slots that are alternately reconfigured, performing configuration prefetching. The model
used is similar to ours, and consists of two partitions on one reconfigurable device. The
authors present a partitioning methodology improving the design latency. Moreover,
they assume the reconfiguration time is comparable to the execution time, achieving a
maximal overlap. Therefore, the authors discuss so-called block processing to amortize
the reconfiguration overhead. Block processing, which is based on loop restructuring
also found in [KVGO99], executes gradually on multiple sets of input data. Yet, block
processing is applicable only for applications that process large streams of input data.
71
3 Two-Slot Framework
PipeRench [GSB+00, SWT+02]—also mentioned in Sect. 2.2.5—is a well-researched
project of pipelined reconfigurable computing systems that investigates the problem of
interweaving RT and EX phases from an abstract position. The authors call PipeRench
a programmable datapath and target virtual hardware design through self-managed
dynamic reconfiguration. Stripes are the basis of the system. PipeRench proves the
benefits of using pipeline concepts for the execution of algorithms on reconfigurable
systems. However, it is close to coarse grain systems and therefore only conceptually
related to our work.
The same conceptual relationship holds for the works on configuration prefetching,
which already were discussed in Sect. 2.2.5. We want to emphasize the early work
[Hau98]. The author basically investigates configuration prefetching on a reconfigurable
system model. He also sketches the requirements for optimal prefetch, which must con-
sider the whole architectural constraints as is done in our work.
Communication
Several reconfigurable concepts, environments, or fabrics found in the literature address
the problem of intermodule communication—the data transfer between dynamically
reconfigured parts—in various depth. Few works consider the physical implementation
problems when an algorithm is split into several partitions that are executed sequentially
as it is done in our two slot framework. However, the problem of data transfer between
temporally loaded modules whose communication requirements are also temporally, can
be found in multiple reconfigurable execution environments. All the approaches that
address fine grain reconfigurable systems like FPGAs share the general challenge of
dynamically establishing communication between modules.
Xilinx itself proposes to use so-called bus-macros as fundamental concept for commu-
nication with and between reconfigurable modules [Xil04]. Bus-macros being tristate
or look-up-table-based [HBB04, H¨
ub07] define a static communication structure that
serves most communication requirements. Thus, bus-macros or similar concepts can be
found in the majority of reconfigurable execution environments. However, the concep-
tional strategies for the communication itself are different (see below). Some of them
have additionally been designed to solve on-line connection strategies.
Several authors (e. g. [WP04, UHGB04b]) use a bus system to enable data exchange
between modules in their run time environments. The modules, which are placed on
the reconfigurable area, must plug into this bus. Therefore, each component implements
the bus transaction and an arbiter manages the bus-assignment. The bus enables de-
terministic behavior concerning different modules at the same location. Similar to our
bus-based solution, there is always some overhead for the transaction: Bus access has
to be granted by an arbiter and possible collisions lead to delay in data communication.
Recently, multiple improvements for the communication concepts of reconfigurable
designs as offered by Xilinx have been introduced. In [H¨
ub07], a communication module
called bus com module is discussed, which offers generic connection of modules in a slot-
based system. For the hardware implementation, the author thereby introduces several
sophisticated extension to the basic Xilinx busmacros, which allow him to overcome the
problem of varying location of wires when signals cross reconfigurable regions. For free
72
3.10 Summary
2D placement of hardware modules at run-time, [HKKP07] introduces a communica-
tion macro for Xilinx FPGAs. The macro comprises a homogeneous communication
infrastructure for any FPGA of the Xilinx Virtex family. Moreover, the Erlangen Slot
Machine [BMA+05] comprises of sophisticated communication concepts including bus
or shared memory based solutions.
Low Power
To approach the problem of power reduction on reconfigurable fabrics, several works
present concepts to estimate the power consumption on FPGAs [SKB02, WOK+00].
Their techniques mainly are architecture dependent, however, they often can isolate the
main power consumers. In general, the SRAM-based reconfiguration technique itself
emerges as the main challenge.
Moreover, power consumption of reconfigurable devices has been addressed in several
research efforts [MFK+00, SJ02] and most recently in [Tel05]. While the former works
describe techniques how to achieve power reduction on existing architectures, the latter
one also mentions that FPGA vendors increasingly focus on the problem of power con-
sumption. Furthermore, new architectural concepts for the design of power-optimized
reconfigurable fabrics are investigated in the literature [GZR99].
In [NB04], the authors describe dynamic scheduling for SoC platforms and investigate
scheduling algorithms regarding power-performance trade-offs. Therefore, they use clock
gating and frequency scaling to minimize the power consumption. The authors rely on
two different clock frequencies, describing the potential benefits only rudimentarily. Our
approach extends their concept by detailing the problem of frequency variation.
Finally, an interesting discussion on power saving by using a fine-grain real-time re-
configurable pipeline gives [KZP03]. Pipeline stages are disabled and bypassed whenever
data-rates are low. Basically, we inherit from this work the potential of power saving by
adapting the resources to the computational requirements.
3.10 Summary
In this chapter, we have introduced reconfigurable computing by virtue of a simple
yet flexible example, the two slot framework. The framework bases on the introduced
concept of execution (EX ) and reconfiguration (RT) phase, both composing a unity for
tasks to be computed on reconfigurable fabrics. Having two slots that can be reconfigured
independently, we can apply the concept of reconfiguration time hiding by starting the
EX and RT phases of two different jobs in parallel. We have discussed the architectural
requirements to support such an alternating execution of task in two slots, including
the important aspect of data transmission between reconfigurable regions. Moreover,
we have implemented a reference architecture, which operates with two reconfigurable
regions and one intermediate communication and control part.
As the framework targets on hardware virtualization, we have investigated partitioning
strategies that preserve the precedence constraints for proper gradual execution. The
partitions, each describing a sub-problem of the overall algorithm, then can be executed
73
3 Two-Slot Framework
in sequence. We have also discussed scheduling algorithms for a set of independent tasks
that make use of the two slot architecture. We could thereby show the proximity of our
problem to the domain of single server scheduling.
Moreover, several extensions to the architecture have been discussed, most notably the
low power considerations. Here, we have presented an example how algorithms can be
executed on reconfigurable devices using the concept of frequency scaling. Furthermore,
we have discussed the described architecture as a reconfigurable fabric IP that can be
parameterized and used in different designs such as Systems on a Chip.
74
4 Specification Graph Approach for
Reconfigurable Fabrics
The previous chapter has shown that a sound design of reconfigurable systems often is
challenging in many aspects. In particular the implementation of an execution environ-
ment that facilitates partial runtime reconfiguration demands tremendous effort and can
hardly be argued to be done anew for every single design. This chapter therefore investi-
gates a method how to map applications onto already proven runtime environments—as
typical for the so-called platform-based design. Moreover, we include characteristics of
modern FPGAs like heterogeneity and extend the design space towards multiple FPGAs.
We thereby introduce a modeling and synthesis methodology that respects the specific
requirements of run-time reconfiguration. The method is based on profound concepts,
and expands known notations and model techniques.
4.1 Introduction
The main motivation of the method of this chapter are the requirements of modern
embedded systems. These systems increasingly are multi-functional: In addition to a
fixed core functionality, they must offer application dependent behavior. For example,
mobile phones basically serve as communication devices. Therefore they can connect to
multiple networks (GSM, WLAN, etc.), which require different algorithms for transmis-
sion/reception that furthermore depend on the currently active application. Addition-
ally, mobile phones offer organizer functionality, entertainment, navigation, etc.
Basically, the computational hardware of such systems consists of a variety of process-
ing units, including CPUs, DSPs, ASICs, and increasingly reconfigurable fabrics (most
often FPGAs). Depending on the quality of service required, applications are executed
on one of these devices. Reconfigurable devices thereby are the number one choice, if
high performance and adaptability has to be granted. Furthermore, modern embedded
systems often require new updates to be loaded into the system, which can again be
offered by reconfigurable devices through a post-fabric reconfiguration.
To allow for beneficial exploitation of reconfigurable devices within such modern em-
bedded systems, appropriate modeling and design methods are required. The so-called
platform-based design concept thereby helps to consolidate reconfigurable logic as core
parts within such systems. Subsequently, comprehensive synthesis and design space
exploration methods are required. Such methods should respect the whole range of
functionality of the reconfigurable fabrics, in particular the heterogeneity and the recon-
figuration delay of modern FPGAs.
75
4 Specification Graph Approach for Reconfigurable Fabrics
FPGA
Mapping
Tasks
CPU FPGA
Architecture Graph
Problem Graph
Figure 4.1: Layered model
As comprehensive methods to exploit these characteristics within the integrated de-
sign of embedded systems are not known, we introduce a synthesis methodology for
reconfigurable systems that respects the specific requirements of run-time reconfigura-
tion in this chapter. In particular, we shape and improve approaches and methods of
embedded system design for the design of partially and run-time reconfigurable fabrics
that themselves can be part of a complex system.
4.1.1 Layered Approach
Again, the method of this chapter shall be described using the layered approach. Fig-
ure 4.1 depicts how the method can be partitioned into three layers that lay in-between
the tasks and the processing devices (one or more FPGAs, optional CPUs, etc.). Thereby,
the tasks are expressed using a problem graph, which is a description of the structure
and behavior of an application including its need for reconfiguration. An architecture
graph abstracts the run-time environment consisting of FPGAs, CPUs, etc. The map-
ping in-between both layers connects tasks to processing resources. A sound description
of problem, architecture and mapping given, we can derive Pareto-optimal solutions for
the design under development.
4.1.2 Organization of the Chapter
In detail, this chapter is organized as follows: In the next section, we introduce the
problem conceptually, including the characteristics of execution environments and the
application areas. We also focus on an approach from the literature that has influenced
our work. Then, we follow the layered model, where we first start from top, describing
the problem graph layer. Next, we discuss the architecture graph before we meet in the
middle. The three layers together also form the so-called specification graph. Further-
more, we discuss synthesizing strategies for this specification. We apply the concept on
an example, take a look at the lesson learned, and review related work, before we finally
close the chapter with a summary.
76
4.2 Concept
CPU
FPGA
ASIC
...
Slot 1
Slot 0
CPU
RAM
FPGA
IO
I
O
Slot
6
Slot
5
MAC
I
O
Slot 3
Slot 4
Slot 2
MAC
RAMRAM
Slot 7
MAC
Figure 4.2: Schematic draft of an exemplary system focusing on the execution environ-
ment based on an FPGA.
4.2 Concept
As could be seen in the previous chapters, the reconfiguration and the diversity of
FPGAs must be considered comprehensively to avoid them becoming the bottleneck
of the system. Basically, run-time reconfiguration on one hand adapts a system, on
the other hand consumes time. Moreover, module placement for heterogeneous devices
as focused here demands special care. To include partial run-time reconfiguration and
derive comprehensive methods for the design flow, we firstly state the problem abstractly,
assign it to a subject matter of the literature, and sketch the solution procedure.
4.2.1 Problem Abstraction
The problem to be mapped on our system must be given as universal directed task
dependence graph G(V, E). The execution of such a graph is similar to an homogeneous
dataflow graph, where processes are only activated for execution if all their predecessors
have finished their execution.
The system to execute this graph is a combination of computation resources such as
processors, ASICs, and reconfigurable devices (ref. to Fig.4.2). We focus on the latter
relying on an execution environment that helps to abstract the challenging details of
dynamic resource allocation on FPGAs. Several groups have proposed such runtime
environments for task execution on FPGAs, e. g. [BMA+05, FC05, KPR02, WP04].
In addition to these approaches, our environment may include all the modern FPGA
capabilities such as embedded hard cores, etc. Thus, we will have a number of hetero-
geneous slots arranged on the FPGA. Sometimes, such a structure is also referred to
as tiled. However, in our target environment the tile size and functionality must not
be homogeneous. Regardless of the layout, the slots/tiles can be reconfigured individu-
ally, however using the single and mutually exclusive reconfiguration port. For message
and data exchange, a static intermodule communication of the slots is defined by the
environment and can range from direct wiring to a common bus or NoC for all slots.
77
4 Specification Graph Approach for Reconfigurable Fabrics
Due to the heterogeneity, application mapping becomes complex, since tasks to be
executed on slots may require different resources and their execution time may change
depending on the resources available. Nevertheless, dedicated resources such as multipli-
ers or DSP units are welcome concerning performance as their hard core implementation
may notably speed up the processing.
Moreover, the reconfiguration overhead—duration of the reconfiguration phase—must
be respected. Despite the fact that heterogeneous FPGAs might decrease the reconfigu-
ration time, as more functionality is provided by fewer reconfiguration bits, the duration
usually still cannot be neglected. Moreover, the single reconfiguration port of each re-
configurable device constrains the reconfigurability and requires a sequential ordering
of the reconfiguration phases. If however our system hosts more than one FPGA, we
could also reconfigure two regions at the same time if those regions are not located on
the same FPGA, using, therefore, two different reconfiguration ports.
To summarize, for our scenario, the reconfigurable devices are abstracted by well-
designed execution environments that respect the specific characteristics of the recon-
figurable devices including runtime reconfiguration constraints such as suitable slot ar-
rangement and communication structure. Such execution environments can be designed
by FPGA experts that explore optimal solutions concerning placement, routing, etc.
The method discussed in this chapter allows for reuse of these execution environments.
4.2.2 Background
As the method spans multiple research areas, the relevant background on each of them
shall be briefly given next.
Platform-Based Design
A known methodology of the literature that argues for pre-defined and well-designed
runtime environments is the so-called platform-based design approach. Platform-based
design [KMN+00, SVM01] focuses on optimizing the system towards a specific appli-
cation, while also offering freedom for extensions. A platform may be best defined as
a library of components together with their composition rules, however not only con-
taining computational blocks but also communication components [SVCBS04]. Using
a platform avoids a design from the scratch, as it is predefined for a specific type of
application. Thereby, a platform may also be extended by adding and integrating some
more devices, growing over time.
Using a platform limits choices, thereby providing faster time-to-market through ex-
tensive design reuse. This reuse is moved beyond the individual IP block level to a reuse
of architectures of hardware and software blocks. However, the limitation of choices also
reduces flexibility and performance compared with a traditional ASIC or full-custom
design methodology [Goe02]. For the design of complex products, we can easily accept
this drawback, as we dramatically reduce development risks, costs and time-to-market.
Platform-based design is also termed meeting-in-the-middle process, where succes-
sive refinements of specifications meet with abstractions of potential implementations
[SVCBS04]. Moreover, it harmonizes with the ideas of modern embedded systems, where
78
4.2 Concept
the behavior is composed of some core functionality (e. g., determined by the applica-
tion environment) and additional dynamically selected functionalities (e. g., triggered
by user interaction). Dynamically arriving applications occupy the left over, added, or
temporarily unused resources. On CPUs, the free processor utilization can be used.
On reconfigurable devices, we would load new configurations into unused areas. Our
method presented in this chapter shows how platform-based design of runtime environ-
ments means an open-minded focus, in order to allow new tasks to enter the reconfig-
urable fabric, or to allow costly designed and reliable existing run-time environments to
be exploited for additional applications. We thereby comprehensively show a way how
to introduce partial run-time reconfiguration to the design of embedded systems.
In particular, we map our tasks onto runtime environments on FPGAs implemented by
specialists. Such execution environments then include the heterogeneity of modern FP-
GAs and their tremendous design effort can pay off. For example, such platforms allow
for providing sophisticated communication between the partitions within the fabric and
to off-chip devices. In particular for runtime reconfiguration, execution environments
are a very convenient way as they also hide design challenges.
Specification Graph Method
To avoid re-inventing the wheel, we are interested in a proven design concept as a mod-
eling and synthesis approach that accepts our problem (graph G) as input and maps
it on a runtime environment. Known approaches hardly consider partial run-time re-
configurability. We thus have to add the characteristics of reconfigurable systems like
the reconfiguration time, dedicated computation resources, heterogeneity, communica-
tion requirements, I/Os, etc. We also target on transparently adding the reconfigura-
tion latency to the scheduling. Furthermore, we want to allow multiple reconfiguration
fabrics—thus having more than one reconfiguration port in the overall system.
Our approach bases on the methodology of Blickle et al. [BTT98]. They propose
system design based on a specification graph, which consists of a problem graph and
an architecture graph. Then, the combination of all phases of the system synthesis is
an important means: the specification graph GS, the allocation α, the binding β, and
the scheduling γ. Having all these four elements in a valid combination results in a
valid overall system. The fundamental characteristic of the specification graph method
thereby is the inclusion of communication into the design as discrete nodes.
Haubelt improves this concept significantly in [Hau05, HOGT05, HTRE02]. He intro-
duces hierarchy for raising the level of abstraction and for exploring reconfigurable hard-
ware. The aim is architecture design for multiple purposes. Furthermore, he proposes
the concept of hierarchical mapping edges, which significantly improves the specification
graph model. We rely on several improvements of Haubelt’s work. Still, he does not
consider partial runtime reconfiguration in his approach.
4.2.3 Problem Solution
We model our design using two graphs: the problem graph and the architecture graph.
The former describes the tasks and their precedence constraints and the latter the pro-
79
4 Specification Graph Approach for Reconfigurable Fabrics
T1
T2
T3
RT1
RT3
RT2
C1
C2
C3
C0
T1
T2
T3
task
graph problem
graph GP
G
T1
T2
T3 G*
RT1
RT1
RT1
Figure 4.3: Task dependence graph G, extended by RT nodes to G, and corresponding
problem graph GPwith communication nodes and reconfiguration phases.
cessing elements of the execution platform. Additionally, we rely on mapping edges
between the problem graph and the architecture graph, which describe feasible bind-
ings of tasks to resources. These edges are annotated with weights such as the time or
power needed. We thus get the specification of the system as introduced by [BTT98].
Specification graphs and, in particular, their mapping edges are the underlying model
for a design space exploration. As a consequence, proper mapping edges are crucial for
reasonable results.
The final aim is to derive a binding and scheduling that is optimal in a multi-
dimensional design space. Such Pareto-optimal points are computed by a design space
exploration, based on the specification graph. For example, with our approach we are
able to improve the response time and to reduce the power consumption. The benefit is
that we can rely on optimally designed environments and thus focus on the mapping of
problems and applications onto such platforms.
4.3 Problem Graph
The two constraints reconfiguration duration and mutually exclusive reconfiguration port
demand for suitable integration into the modeling so that a comprehensive exploration
of the reconfigurability is possible. In order to be able to respect these specific needs of
(partially) reconfigurable devices, we extend our initial task dependence graph G(V, E)
via several steps into a problem graph GP(VP, EP), ref. to Fig. 4.3.
In detail, the initial graph G(V, E) only consists of task vertices V=VTand com-
munication edges E=ET. As core step extending the method of Blickle et al., we add
the reconfiguration phases VRT to derive the intermediate graph G. To connect the
new vertices VRT with the task vertices VT, every VTgets an additional input from its
reconfiguration phase VRT , depicted by additional directed edges ERT . The vertices VT
80
4.3 Problem Graph
T1RT1
RT2
RT3
RT4
Slot 1
Slot 2
Slot 3
T1
T2 T3
T4
T2
time
T3
Figure 4.4: Left the task dependence graph, right a Gantt chart of a possible schedule:
Conflict as Task 4 was reconfigured before Task 2 was executed.
and VRT , as well as the edges ETand ERT then make up the important intermediate
graph G(V, E) with V=VTVRT and E=ETERT .
By adding the reconfiguration phases to become integrated vertices of the task graph,
we basically do not extend the modeling set of the synthesis and can rely on already
existing methods. Furthermore, the specific RT nodes VRT for each task node VTim-
plicitly introduce reconfiguration prefetching to the model, as the RT nodes can be
scheduled independently. However, as this fundamental step is new to the domain of
system synthesis, it requires special care in the mapping phase, see below.
Next, as proposed by Blickle et al. [BTT98], we add the communication vertices VC
and derive the problem graph GP(VP, EP) from G. Therefore, we break all edges ET
between task vertices VTand add communication vertices VCin between. The edges ET
therefore also are split into edges from tasks VT:ET0and to tasks ET1, both denoting
the precedence constraints. By adding these new edges between task and communication
nodes that respect the previous data flow, the model is completed. Figure 4.3 displays
the result. To summarize, the problem graph GP(VP, EP) finally consist of the vertices
VP=VTVRT VCand the edges EP=ET0ET1ERT . Note that we do not have
to add communication vertices between RT phases and task nodes, as there is no data
transmitted between these two vertices.
In addition to the reconfiguration phase vertices VRT , the integrated explicit nodes for
communication form the basis for comprehensive synthesis of partially reconfigurable
hardware, as communication requirements are now part of the system specification.
Thus, we introduce dedicated communication needs of reconfigurable devices to the
modeling phase of the design process.
Nevertheless, if we directly schedule this problem graph, we might get into a conflict.
As reconfiguration phase and execution phase are independent nodes of the graph GP,
subsequent tasks could be loaded—reconfigured—on the same area of an FPGA, without
intermediate execution phase. Figure 4.4 depicts an example. In particular, scheduling
algorithms dispatching tasks in the order of ASAP (as soon as possible) would schedule
the reconfiguration phases subsequently without respecting whether the corresponding
tasks have been executed. This scenario must be prevented.
We therefore define intervals during which the area of a task must be preserved on an
FPGA—the slot that has just been reconfigured must stay in this configuration until the
81
4 Specification Graph Approach for Reconfigurable Fabrics
T1
T2
T3
RT1
RT3
RT2
C1
C2
C3
C2
Interval
Interval
Interval
Figure 4.5: Problem graph with intervals/life cycles.
execution of the corresponding task has been taken place. We apply so-called life periods
as displayed in Fig. 4.5. Life periods are often used in the domain of register allocation
and initially help to constrain the amount of registers needed. Such intervals can be
easily considered during the scheduling phase and are often solved using graph coloring,
e. g., by the left-edge algorithm or similar approaches [KP87, PKG86, VKKR02].
Note that by attaching life periods, there is no fundamental change in the formal
model or a reduction to a specific subset necessary. Thus, we can still apply known
concepts from system synthesis. The information of the life periods can be easily used
by the scheduling algorithm to prevent the conflict described above.
4.4 Architecture Graph
In order to achieve separation of behavior and structure, as proposed in the Y-chart
approach by Gajski and Kuhn [GK83], we add a separate graph for the architecture. This
architecture graph GA(VA, EA) becomes the second part of our formal model. Similar to
Blickle et al. [BTT98], GAbasically abstracts the architecture and seamlessly includes
the architecture into the design method. Moreover, compared to platform-based design,
GAalso represents the platform.
Basically, the platform/architecture of our system is an execution environment that
offers slots/tiles, which also can be combined, and an appropriate communication infras-
tructure. The graph GAtherefore basically consist of the slots and the communication.
As the reconfigurability shall become an integrated part of the execution of tasks, we also
add the reconfiguration port to the architecture graph. Figure 4.6 shows an example.
We note the following about the characteristics of the new node: While providing no
additional processing resources, the reconfiguration port is mandatory for the reconfig-
uration phase. Moreover, the reconfiguration port usually does not have connections to
the other resources of the FPGA (ICAP and special architectures being the exception).
All in all, the architecture graph allows for abstracting an execution environment.
Referring to the layered approach, GAemploys a layer of abstraction for the efficient
82
4.5 Mapping
RT Port
Slot_type
B
Slot_type
A
GP I/O
Bus
GA
Figure 4.6: Exemplary architecture graph with two different slots, reconfiguration port,
bus, and I/O.
and otherwise often time consuming usage of reconfigurable devices. The architecture
graph thereby is capable to describe the heterogeneity of modern FPGAs, as nodes must
not be homogeneous. The environments also may be the result of previous design space
explorations, which have been carried out by a design method that resides below the
one we target in this chapter.
By using such a detailed model of a run-time environment, we can take into con-
sideration scenarios, where heterogeneous resources compose the execution platform of
modern embedded systems. Moreover, we thereby are not forced to limit ourselves to
one single FPGA only. Additionally, other resources like CPUs, etc. may become part
of the architecture graph. By virtue of this flexibility, we thus resemble a comprehensive
platform-based design, as we can consider heterogeneous platforms for a set of behaviors.
4.5 Mapping
To finally combine the two graphs, we derive a combination of the problem graph GP
and the architecture graph GA. We therefore map vertices of the problem graph to
resources of the architecture graph and derive the specification graph GS=GS(VS, ES),
cf. [BTT98]. In the end, the vertices of the specification graph consist of the vertices of
the problem and the architecture graph VS=VPVA, while the edges are the edges of the
problem and the architecture graph, as well as mapping edges EM:ES=EPEAEM.
4.5.1 Basic Mapping Edges
To derive the specification graph, we have to map nodes of GPto appropriate nodes of
GAby mapping edges EM. The three elements of the problem graph require different
consideration and therefore must be investigated in more detail.
The computational vertices VTare assigned to slots, on which they can be executed.
The tasks might fit into different slots, resulting in multiple mapping edges. All these
possibilities are drawn. We attach the mapping edges with information about executing
this task in the selected slot. In particular, we annotate the execution time, which is
essential to derive a reasonable schedule. Furthermore, additional weights like power
consumption are possible and can be used to improve the results of the binding phase.
Similar, the communication nodes are mapped to communication resources. Again,
we attach specific characteristics to the edges, like duration of the communication or the
load of a bus. This information again constrains the scheduling and binding.
83
4 Specification Graph Approach for Reconfigurable Fabrics
T1
RT1 RT Port
Slot_type
A
reconfig. time
execution time
Figure 4.7: Specification graph with mapping edges of task and reconfiguration phase to
slot and slot + reconfiguration port, respectively.
Finally, the reconfiguration vertices VRT must be added to the specification graph.
Thereby the following has to be considered: A reconfiguration phase can only take place
if two constraints are fulfilled. First, the reconfiguration port must be free; second,
the slot to be reconfigured must be available. A valid mapping edge must respect
these interdependent requirements. Therefore, we rely on hierarchical mapping edges as
proposed by Haubelt [Hau05]. These edges originally were introduced to increase the
modeling power of the specification graph approach. In our case, they perfectly serve
the requirements without the need to introduce a new modeling technique. By virtue
of the hierarchical edges, we can denote the need for both reconfiguration port and slot.
Figure 4.7 shows the concept where such an edge originates from the reconfiguration
phase as source and splits to the two targets reconfiguration port and slot.
In general, a mapping edge for a reconfiguration phase is an edge EMthat has one
vertex of VPas source and at least two vertices of VAas sink. One of the sink vertices
always is the reconfiguration port. We map the reconfiguration phases on the basis of
the already placed mapping edges—considering allocated slots for the associated tasks
only. Thereby, we attach the specific reconfiguration time to the edges. Depending on
the slot we chose, we derive the reconfiguration time based on the area consumed by this
slot. This time usually stays the same, as always the whole slot has to be reconfigured.
4.5.2 Extensions
In some cases, a task might require more than one slot for execution. We can model this
requirement by applying the concept of hierarchical mapping edges. Figure 4.8 displays
an example. Obviously, the mapping edge of the reconfiguration phase must be adapted
as well, including the addition of the reconfiguration times of the two slots.
Moreover, our target system is not limited to one single reconfigurable device. In
such platforms with multiple reconfigurable fabrics—especially in the case of FPGAs—
each fabric usually provides an own reconfiguration port, resulting in the possibility
to perform reconfiguration on different devices in parallel. Such an improvement of the
flexibility of the architecture can result in an increased overall performance of the system.
However, architecture graphs for two or more FPGAs can quickly become complex or
even unmanageable. Blickle et al. [BTT98] propose a concept of hierarchy concerning
the architecture graph. We can apply this concept in order to abstract the architecture
and introduce simplicity to the specification graph.
84
4.6 Design Space Exploration
T1
RT1
RT Port
Slot_type
B
Slot_type
A
reconfig. time
execution time
Figure 4.8: Part of a specification graph with task that requires two slots for execution.
Figure 4.9 shows a fictional scenario with two FPGAs. The first architecture graph
GA1(in the middle of the specification graph of Fig. 4.9) shows the architecture fine-
granular, while the second one GA2abstracts the architecture using a coarser granularity
on a higher level only consisting of the two FPGAs. Mapping edges between GA1and
GA2build the membership of the slots and reconfiguration ports to the specific FPGAs.
In the particular example of Fig. 4.9, we have displayed a scenario where two recon-
figuration ports—thanks to two FPGAs—can be of benefit. If we select the emphasized
mapping edges of Fig. 4.9, we can increase the overall response time, as the parallel ex-
ecutable task T2and T3are placed on different FPGAs and their reconfiguration phase
can take place simultaneously. Figure 4.10 shows the corresponding Gantt chart.
To summarize the mapping, we take into account the heterogeneity of modern FPGAs,
the reconfiguration overhead (phase and time), and the mutual exclusiveness of the
reconfiguration port. Thus, we approach a realistic model for a partially reconfigurable
run-time execution environment within modern embedded systems.
4.6 Design Space Exploration
The implementation—deriving a valid combination of allocation α, binding β, and
scheduling γ—finally is done on the basis of the annotations to the mapping edges.
As this synthesis task usually is NP-complete, we need suitable heuristic methods to
gain solutions. Most often, we also have multiple conflicting design goals, such as power
efficiency and the overall response time. To solve such a multi-objective problem, a
design space exploration is recommended. In general, we focus on the Pareto-optimal
points of the design space. The synthesis to derive these points can be done by applying
evolutionary or genetic algorithms. During this synthesis, we have to select (β) a set of
valid mapping edges from the specification graph GSand schedule (γ) the correspond-
ing problem vertices onto the resources referred to by the mapping edges. Blickle et al.
propose a similar solution strategy based on genetic algorithms.
If we use a genetic algorithm to explore the solutions, the overall design flow is as
follows: On the basis of an initial population, where we arbitrarily select a number of
85
4 Specification Graph Approach for Reconfigurable Fabrics
T2
T4
RT2
RT4
RT Port
Slot_C
Slot_D
C4 FIFO
RT Port
Slot_A
Slot_B FPGA1
FPGA2
Gigabit
Gigabit
T1
T3
RT1
RT3
C3C2
C1
C5 GA1
GA2
+
further
edges
GP
Figure 4.9: Specification graph with three graphs: one problem graph GPand two archi-
tecture graphs. The system consist of two FPGAs with two individual reconfiguration
ports.
RT1
RT1 T1
RT Port
RT Port
Slot A
Slot B
Slot C
Slot D
Gigabit
FPGA 1
FPGA 2
RT2
RT2 T2
C2
RT3
RT3 T3
...
Figure 4.10: Gantt chart of a schedule of Fig. 4.9 where two reconfiguration ports are of
benefit (communication resources are displayed shortened)
86
4.6 Design Space Exploration
mapping edges for each individual of the population, we gradually improve this popu-
lation in terms of evolutionary selection (survival of the fittest). Therefore, we require
a fitness function to rate the individuals. The fitness can be derived from the charac-
teristics of the selected mapping edges, based on their annotations. We also include the
scheduling into the fitness function, similar to [BTT98].
Next, crossover and mutation will be applied to a number of good individuals, which
have been selected according to their fitness value. We thus get the next population
and can re-start the evolution. After a pre-defined number of iterations or if no further
reasonable improvements occur, we stop the algorithm. Due to the randomness of the
generation of the individuals, we also gain individuals, whose selected mapping edges do
not denote a valid solution. By degrading their fitness value, we can temporarily accept
such individuals, see below. Among the fittest individuals of each population, we always
select those which form the Pareto front, gradually improving this front.
An individual thus is a selection of mapping edges, referred to as chromosomes. The
width of a chromosome is the same as the number kof mapping edges EMof the
specification graph GS. The binary entries are set to true, if this mapping edges is
selected and set to false otherwise. Comparing the genetic algorithm to its natural
counterpart, we refer to the entries of the individuals also as genomes.
Thereby, we will also generate individuals that cannot be synthesized feasibly due to
several conditions. (1) There might be vertices VPof the problem graph which have no
binding to any resource, as no mapping edge leaving them is selected. (2) Some vertices
VPmight have more than one mapping edge linking VPto several different resources,
resulting in multiple bindings. (A sophisticated scheduler might still be able to schedule
such a situation by selecting a suitable binding itself. However, we should degrade
the fitness of such an individual.) (3) The selected mapping edges might also harm the
necessary precedence relations of RT phase and task execution concerning the resources.
They might link to different slots and therefore a valid configuration cannot occur. The
task will suffer starvation. (4) A similar problem occurs for the communication between
slots. If for adjacent vertices of the problem graph the selected mapping edges result in
slot and communication resource selections that have no links in the architecture graph,
data transmission is impossible.
We decided to cover all these problems by the scheduler, which becomes a core part
of the fitness function. Therefore, it either rejects a schedule completely (resulting in an
infinite overall scheduling time) or degrades the fitness function significantly.
Additionally to the scheduling, the fitness function must rate the solutions in terms
of other parameters attached to the mapping edges. Most often, the power consumption
will be given. We can add these values from all selected mapping edges within the fitness
function. Thus, the fitness function rates the binding and the scheduling. It returns as
fitness value a vector of the parameters of this individual. For our example, we get the
overall response time and the accumulated power consumption. We can arrange these
results in space and select the fittest among them for deriving the next population. The
selection is finally done randomly, however preferring those with a high fitness value.
For the two important steps crossover and mutation, which follow the evaluation of
the fitness, we base on the standard procedure of genetic algorithms [Hol92].
87
4 Specification Graph Approach for Reconfigurable Fabrics
ASAP ALAP
level
1
2
3
4
5
6
7
8
9
Figure 4.11: Task graph with communication and reconfiguration phases ordered accord-
ing to ASAP (left) and ALAP (right).
Scheduling
The scheduling, which is a fundamental part of the fitness function, demands special care.
In particular, we have to investigate how the added reconfiguration phases constrain the
scheduling. In the following, we thus give an overview of the scheduling, including
a prototype example of an algorithm. Basically, there are good heuristic scheduling
algorithms for a given allocation and binding available, so it is not necessary to load the
evolutionary algorithm with this task.
The input for the scheduling comprises the following: First and foremost, we base
on the selected mapping edges EMof the specification graph GSas generated by the
generation of individuals (binding). Moreover, we rely on the life periods denoted within
the problem graph GPand the adjacency information of nodes given by the architecture
graph GAand the problem graph GP.
We use a list scheduling—the nodes are sorted into a list, which basically denotes the
scheduling sequence. The list is derived by referring to the possibility to arrange the
nodes over time. The simplest way is to order the nodes according to ASAP (as soon
as possible) and first add all nodes of level 1 to the list, then the nodes of level 2, and
so on. However, considering an ASAP ordering, we would insert all RT phases at the
beginning of the list, as they are all in level 1, see Fig. 4.11.
In contrast, ordering the problem graph according to ALAP (as late as possible), we
insert only those RT phases into the first level, whose corresponding tasks are assigned
to level 2. This ordering matches the requirement of a chronological order of the RT
phases according to their corresponding task much better than ASAP. Therefore, we
prefer ALAP to fill our list for the scheduler.
88
4.6 Design Space Exploration
Algorithm 1 Scheduling Algorithm
1: LGP
2: while L6=do
3: vPnext(L)
4: if vPhas only scheduled predecessors then
5: if vPVTthen
6: if vAVAselected by vP: activated by suitable interval then
7: schedule vP
8: LL\vP
9: else
10: degrade the fitness and exit scheduling
11: end if
12: else if vPVRT then
13: if vAVAselected by vP: no intervals pending then
14: schedule vP
15: LL\vP
16: end if
17: else if vPVCthen
18: if vAVAselected by vP:˜vP(predecessors and successors of vP):
adjacent ˜vAVAselected by ˜vPthen
19: schedule vP
20: LL\vP
21: else
22: degrade the fitness and exit scheduling
23: end if
24: end if
25: end if
26: end while
As the fitness value bases on the overall response time of a schedule, we have to
schedule the task set according to the selected mapping edges. In detail, to compute the
overall response time, we deploy a vector that holds the resource occupation, including
an additional entry for the life periods. The width of the vector resembles the amount
of resources VAof the architecture graph GA. Each basic entry denotes the added
computation time of this resource (including unavoidable idling times). In the additional
entry for the intervals, we can set and unset the interval’s number to denote whether or
not an interval is active on that resource. We then always refer to the interval entry prior
to scheduling a reconfiguration phase. By gradually filling and updating this vector, we
eventually derive the schedule. In the end, the maximum time over all vector entries
denotes the overall response time of the schedule.
In particular, each time a vertex vPcan be scheduled (see below for the detailed
conditions to permit scheduling), we add the time (reconfiguration time or execution
time) of the mapping edge to the selected resources of vP. If more than one resource is
89
4 Specification Graph Approach for Reconfigurable Fabrics
occupied by vP, we first equal the times of the selected resources in the vector to become
the maximum time of the selected resources, before adding the time given by the mapping
edge. Particularly RT phases (VRT ) require both, the reconfiguration port and the slots
to be free as they start the reconfiguration in all selected resources simultaneously. We
thereby unavoidably add idling times to some resources.
Concerning the idea of prefetching reconfigurations, we always scan the whole set
of unscheduled tasks for nodes that are ready for scheduling. A vertex is ready for
scheduling if all its predecessors have been scheduled. We thus can schedule RT phases
out-of-order, as these vertices are always ready for scheduling, because they do not have
predecessors. However, the ALAP order guarantees that we still prefer those RT phases,
whose corresponding tasks have an early position in the list.
Algorithm 1 shows the pseudo-code of the algorithm implemented. The algorithm
operates on a list Lthat is initially filled with all nodes vPof the problem graph GP.
The scheduler takes vertex vPout of the list L, if vP’s predecessors have been scheduled.
We decide based on the type of vPwhat to do next.
If vPis an ordinary task (EX phase) VT, we check whether all slots required have been
reconfigured. This verification can be done by considering the interval (life period) that
are assigned to this task. Only if all resource nodes (slots) are blocked by the correct
interval, a valid scheduling can follow. The resource vector is updated and the node
removed from the list L. If the interval constraints cannot be fulfilled, we degrade the
quality of this individual and exit scheduling.
In case vPis a RT phase (VRT ), we first check whether the selected slots to be
reconfigured are unoccupied—no intervals have been activated for these slots. If they
are occupied, we skip this vPand schedule it later by not removing this node from the
list L. Otherwise, we occupy the reconfiguration port as well as the necessary slots
and schedule the node. We also set the interval entry of the resources to the interval
number assigned to this RT phase. As explained above, the finishing time of this
reconfiguration procedure equals the reconfiguration time added to the maximum time
of the corresponding entries in the resource vector.
Please note that intervals could also be substituted by a search based on the current
condition of the nodes of the problem graph as well as the nodes of the architecture
graph. However, only in the simplest case the reconfiguration that already resides in
the region to be reconfigured (waiting for a execution that could not be scheduled yet,
refer to Fig. 4.4) belongs to a task that is adjacent to the task, whose reconfiguration
shall be performed, making a complete traversal of the problem graph necessary in the
worst case. Moreover, if a reconfiguration phase shall take place on several regions, we
have to start the search for each resource the reconfiguration phase maps to.
Finally, for communication vertices, we verify whether data flow between predecessor
and successor nodes can take place according to the selected resources. At least one
of the requested resources must offer a connection to the resources occupied by the
previous or succeeding node. Again, we exit the scheduling and degrade the fitness, if
communication constraints cannot be met.
90
4.7 Experiment
Extensions
To also integrate ideas of reducing the reconfiguration times by considering the dif-
ferences between configurations (refer to Sect. 2.2.5 and 3.5), the scheduling could be
extended. Prior to adding a new task to the reconfiguration port, the scheduler there-
fore would have to derive the actual reconfiguration time of this task considering the
already loaded bitstreams in the slots. However, the difference between configurations
is hardly to calculate on the high level of abstraction our method targets at.
Using the same strategy, the scheduler may also take care of loops—traveling along
the same vertices of the problem graph more than once. In the case of loops, caching
of reconfigurations can be of benefit. If slots already hold the correct configuration, we
get a configuration time of 0, and the reconfiguration phase can be dropped—termed as
caching configurations. We still have to set the life period to the corresponding slots for
a proper mapping of the task that belongs to this reconfiguration phase. Furthermore,
a life line as presented by us in [ID05] may help to properly implement the loops. Such
a line takes care of the causality of the tasks of a loop over the iterations. In detail, each
new iteration stretches a new line over the initial node(s) of this loop. After mapping
one node, the line gets pushed to the successors of this node. As no two life lines may
overtake each other, we will always start a vertex VP,n of iteration nafter VP,n1of
iteration n1 has finished its execution.
To summarize the scheduling, each type of vertex is treated slightly differently. How-
ever, as the RT phases are included as ordinary nodes, the scheduler can intrinsically
include the reconfiguration into the overall schedule, enabling reconfiguration time hid-
ing. Moreover, we directly rely on the information of the life periods that was specified
during the problem graph generation. The scheduler thereby might delay the RT phase
of a task ready to schedule and prefer another RT phases whose slots are not occupied
by a life period.
4.7 Experiment
To demonstrate the practicability of the method, we sketch an example application in
the following. We therefore assume an execution environment implemented on a mobile
device onto which we map a scenario of the image processing domain. In the scenario,
we consider resource constraints of mobile devices and show a possibility to maintain
usability. Based on the specification, we derive solutions and explore them.
Image Processing on Mobile Devices
Most of modern portable devices comprise of an attached or built-in camera that is
capable to take pictures. Immediately after taking a picture, users often want to manip-
ulate their pictures by applying filters, etc. As such algorithms often require relatively
high computation power, they may be executed in hardware. To offer a variety of filters,
reconfigurable fabrics can help, as they can temporarily host circuitries of the filters.
Moreover, typical mobile devices often comprise of relatively large storage but low
CPU computing performance. Image filtering applications often are an add-on to the
91
4 Specification Graph Approach for Reconfigurable Fabrics
take
picture
filter2filter1
RT3 RT2
C1 C2
C4
problem
graph
GP
RT Port
Slot_type
B
Slot_type
A
RAM
GA
CPU
filter3
filter4 filter5
RT4
RT5 RT6
Slot_type
C
C5 C6
C8 C9
C7
C3
FPGA
G’A
Figure 4.12: Specification graph of image filtering using limited resources, for the map-
ping edges please refer to Table 4.1.
basic behavior and therefore must be matched to existing processing resources installed
in the device. Our method targets on exploiting such a given platform.
In detail, the application examples is as follows: The picture taken by a camera is
displayed and additionally stored in the RAM. The user can now select among a set of
filters, which are applied to the picture taken. To increase the response time—providing
the filtered images as fast as possible—we pre-compute the most likely set of filters and
store the image results in the RAM for fast access.
In Fig. 4.12, we depict a specification graph of such a scenario. The problem graph
on the left side consists of the input node for taking the picture, as well as five filter
nodes that partly also built on each other. The original problem graph is enriched by
communication as well as reconfiguration nodes. Furthermore, we assign intervals (life
periods) to the problem graph, which are not depicted due to complexity reasons.
For the architecture graph, we assume a heterogeneous environment consisting of a
small CPU, a memory, and three reconfigurable regions on one FPGA. The slots are
heterogeneous—they are of varying sizes and host different dedicated resources. As the
method mainly focuses on reconfigurable fabrics, we leave the CPU abstract and do not
detail it further. Basically, we require the CPU for accessing the input image. Data
transfer between the slots and between slots and CPU takes place via the memory.
The mapping is simply done by first estimating (e. g., using algorithmic synthesis) the
resource requirements of the tasks and communication nodes. Depending on the require-
ments, we can map the computational tasks to one or more slots. During drawing the
mapping edges, we also attach the execution time estimation. Then, the reconfiguration
nodes can be mapped to the nodes of the architecture graph, referring to the previously
drawn mapping edges of the computational nodes and including the reconfiguration
port as additional resource for each mapping. We thereby annotate the reconfiguration
times to the edges. Finally, the communication nodes are assigned to resources. In
92
4.7 Experiment
A B C A + B B + C A + C A + B + C
filter 1 200 250 300 150 100 250 100
filter 2 175 100 125 n/a 100 125 75
filter 3 100 250 n/a 150 125 n/a 100
filter 4 200 150 150 150 150 100 50
filter 5 75 150 n/a 75 100 n/a 75
Table 4.1: Execution times of the filters in one or more slots given in µs
take
picture
filter1 filter2
filter3
filter4
filter5
RT2RT3
RT2
RT4
RT4
RT5
RT5
RT6
RT6
C
4C
5C
6
RT3
filter1
C
7C
8
C
9
C
3C
2
C
1
CPU
Slot A
Slot B
Slot C
RT Port
RAM
RT3
Figure 4.13: Example Gantt chart of the image filtering example derived from a Pareto-
optimal solution
the scenario, the only architectural node that can handle communication is the mem-
ory node. Moreover, we quantitatively add assumptions about the power consumption
to the mapping edges. For each reconfigurable slot, which is used or reconfigured, the
power consumption thus is assumed to be 1 unit. Two slots require 2 units, and so on.
Figure 4.12 displays the specification graph. The mapping edges area omitted because
of complexity reasons an can be easily derived by referring to Table 4.1.
Next, we have to derive a Pareto-optimal binding and scheduling out of this specifi-
cation graph based on the mapping and the annotated weights of the edges. This step
is done by using the evolutionary algorithm as described in the previous section.
Results
We assume a comparable duration of reconfiguration phase and execution time of the
filters. Referring to the cryptography example of the previous chapter (Sect. 3.6), the
reconfiguration phase of a reasonable sized slot sums up to approx. 100 µs. We therefore
have used the reconfiguration times for slotA: 100 µs, slot B: 75 µs, and slot C: 75 µs.
Furthermore, for each communication vertex VC, we assume a duration of 15 µs to
transfer a complete image.
The number of mapping edges, which equals the width of a chromosome, sums up to
70. In detail, we have one mapping edge for the take picture task, which maps to the
CPU, nine mapping edges for the nine communication vertices VC, which map to the
RAM, 30 mapping edges (refer to Table 4.1) from the filters to one, two, or three slots,
93
4 Specification Graph Approach for Reconfigurable Fabrics
and the same number (30) mapping edges from the reconfiguration vertices VRTto the
reconfiguration port and the slots, which are to be reconfigured.
We used 30 individuals for each population. As the example thus is of low complexity,
we could derive valid results in less than 200 rounds of the genetic algorithm. Moreover,
the binding of the take picture task as well as of the communication nodes already is
determined by the mapping edges, as there are no alternatives for these vertices. We
thus have set these entries of the chromosome to valid (1) as the default value.
The Gantt chart of one of the Pareto-optimal solutions is displayed in Fig. 4.13. In the
example, several filters are executed in parallel. Moreover, the concept of reconfiguration
phase prefetching can be seen in the figure. For example filter 4 can start immediately
after the data transfer C 5 is done, as its configuration could be loaded during the
execution phase of filter2.
4.8 Lesson Learned
To conclude, influenced by the platform-based design concept, we have defined a model
and methodology to explore heterogeneous and partially reconfigurable systems. By
virtue of the model given as a specification graph, we can synthesize applications onto
run-time execution environments that have been designed by experts. Moreover, we do
not rely on vendor tools and their adaption to modern FPGAs, but integrate FPGAs
into standard design techniques, focusing on the ability to partially reconfigure FPGAs
during run-time. The general result of this chapter thus is an approach for synthesizing
reconfigurable systems based on the paradigm of platform-based design.
The method contributes to open reconfigurable systems to a broader number of en-
gineers or system designers. A main reason is the abstraction from the difficulties of
reconfiguration by providing an integrated means how to explore the resources of re-
configurable devices. These resources are treated as ordinary computational nodes onto
which tasks can be mapped. The additional reconfiguration phases then introduce the
reconfigurability to the system, for which the reconfiguration port also has been added.
The method is in line with the lesson learned of the two slot framework, where we
could see that the design of environments that allow for partial runtime reconfiguration
is challenging. The synthesis approach of this chapter enables to exploit runtime envi-
ronments implemented on modern FPGAs, including their heterogeneity. We can rely
on a given basic environment and map our architecture on this environment. Moreover,
we improved the concept of equally sized slots towards heterogeneity.
The focus on execution environments eases the final (partial) bit-stream generation
for the FPGAs. Moreover, having specific areas, there is no external fragmentation.
However, the predefined execution environment in general and the fixed sizes of the
reconfigurable regions in particular can also become a drawback of the method. If
algorithms do not match the execution environment given, we have to initiate a complete
new design, under the focus of generating a different task dependence graph.
Due to this constraint of having a fixed execution environment, the approach may fail
to yield optimal results in terms of user requirements. On one hand, the capability of
94
4.9 Related Work
reconfigurable devices to not only offer partial reconfigurable areas but also to adapt
the whole execution environment is sacrificed towards an effective mapping method.
On the other hand, the mapping as presented hardly can cover all information about
an application. For example, priorities of tasks so far are not part of the approach.
Moreover, the genetic algorithm may take a long time to find a solution, which in the
end is obvious if the original intention of the algorithm would have been considered.
For example, the computation of the bottleneck inevitably should be done on the most
well-performing architecture node.
Nevertheless, the method of this chapter brings us towards true and realistic model-
ing of reconfigurable FPGAs, respecting all constraints—the communication (routing),
placement, heterogeneity (hard core multipliers, etc.), reconfiguration times, reconfig-
uration constraints, I/O restrictions. Deriving a suitable environment on an FPGA is
a difficult task, as all these challenging problems have to be considered. Relying on a
fixed environment makes the system also better concerning fault tolerance, reliability,
etc. We can thereby leave this tasks to experts of the FPGA domain.
Moreover, the concept of our method can also be used for large solution spaces, as
the specification graph can be used as solid basis for heuristic algorithms. In particular
the modeling and the derivable synthesis presented in this work will deploy their entire
capabilities in larger systems. However, already the small example of an image filtering
problem has shown the value of the method.
Finally, the method of this chapter basically is an extension of the specification graph
approach to cover also partial reconfiguration. No new method had to be introduced,
only improvements to the scheduling algorithm are necessary, which are transparent to
an application designer. Thus, for designers used to the specification graph approach, the
extensions of this chapter allow her/him to also consider partial run-time reconfigurable
devices during the design space exploration.
4.9 Related Work
Besides the extension of the work of Blickle et al. [BTT98], which we have done in this
chapter, we find also some other works related to synthesis for reconfigurable devices.
An interesting approach targeting reconfigurable devices with partial dynamic recon-
figuration capability shows a physically aware hardware-software partitioning scheme
for minimizing application execution time [BBD05]. The authors consider the exclusive-
ness of the reconfiguration port and the need for adjacent free columns to place tasks.
Similar to the Blickle/Haubelt approach, they start with task dependence graph. On ba-
sis of the Kernighan-Lin/Fiduccia-Matheyes (KLFM) algorithm they have developed a
heuristic for synthesizing such task dependence graphs on partially reconfigurable archi-
tectures. Results of this approach support the method of this chapter. Additionally, we
consider communication as an integral part and therefore integrate it into the synthesis
and design space exploration. We also improve the concept of heterogeneity and allow
differently shaped slots. Furthermore, our method supports environments composed of
multiple reconfigurable fabrics.
95
4 Specification Graph Approach for Reconfigurable Fabrics
Eisenring and Platzner [EP02] use within their framework for design and implemen-
tation of reconfigurable systems also the approach of problem graph, architecture graph
and mapping. As they do not add reconfiguration phases and reconfiguration port to
the graphs, they provide a different concept how reconfiguration can be achieved, not
primarily focusing on the reconfiguration time prefetching.
Despite little consideration on nowadays heterogeneous FPGAs, [HMM04] presents a
sophisticated concept of modeling and optimizing run-time reconfiguration using evolu-
tionary computation. The approach focuses on optimally reducing the run-time recon-
figuration overhead during the HW-SW partitioning stage, which is done by detecting
functional commonality.
In [MSV00a, MSV00b], the authors describe a hardware-software partitioning and
scheduling approach for dynamically reconfigurable systems. Their method targets at
multi-rate, real-time, periodic systems, based on genetic and list scheduling algorithms.
In [JYLC00] a HW/SW co-synthesis for run-time reconfigurable systems is presented,
relying on an exact algorithm (ILP) and a KLFM-based approach. The ILP algorithm
considers the single reconfiguration controller bottleneck and reconfiguration time hid-
ing. However, while scheduling, the authors of the algorithm presented do not consider
physical task placement constraints.
4.10 Summary
In this chapter, we have presented the integration of partially reconfigurable resources
(FPGAs) into the specification graph synthesis, respecting the specific requirements of
the reconfigurable devices, such as reconfiguration port, intermodule communication,
etc. Therefore, we combined and expanded the specification graph approach, so that we
can easily include partial reconfiguration and the corresponding reconfiguration times to
our system. In general, our approach targets on the mapping of problems onto runtime
execution environments (platform-based design). Relying on already proven concept
of system modeling, known solutions for system implementation or exploration can be
applied. We extend the known methods in order to handle specific requirements of recon-
figurable systems, in particular heterogeneity and reconfiguration time. Furthermore,
our work can handle multiple FPGAs having multiple reconfiguration ports. Such sys-
tems bear more parallelism. The final implementation is derived by a genetic algorithm,
including a sophisticated scheduler for the fitness function of such an algorithm.
96
5 Reconfiguration Port Scheduling
The last chapter showed that fixed execution environments gain benefits for comprehen-
sive exploitation of reconfigurable computing in several application areas. One of these
application areas is the domain of reactive systems, as shall be shown in this chapter.
Such systems constantly interact with their environment and therefore often demand for
real-time behavior. They thus comprise of tasks having hard deadlines, for which appro-
priate scheduling algorithms must be offered. If these tasks are loaded dynamically onto
reconfigurable fabrics, the scheduling algorithms must respect the characteristics of re-
configurable computing—reconfiguration overhead, mutually exclusive reconfiguration
port, etc. In this chapter, we introduce a new approach for executing tasks in time-
shared slots of a pre-defined execution environment under real-time constraints. The
core concept of the approach bases on the loading of the bitstreams through the single
reconfiguration port and thereby enables the application of well-known mono-processor
scheduling algorithms.
5.1 Introduction
When partially run-time reconfigurable FPGAs are to be used as resources in hard real-
time systems, the two dimensions area and time have to be considered in the focus of
availability and deadlines. In particular, area requirements must be guaranteed for the
duration of the tasks, so that the tasks can meet their deadlines. Thereby, reconfigurable
devices can execute several tasks in parallel, which requires additional management.
To conduct proper task allocation—preventing fragmentation, offering communica-
tion, etc.—execution environments that abstract the space demand by dedicated pre-
defined reconfigurable regions are a great help. Several of such execution environments
have been proposed and implemented by different research groups. The environments
thereby present a promising platform for real-time task execution.
For real-time scheduling tasks onto such platforms, however few approaches exist in
the literature. Moreover, these approaches mainly neglect a fundamental bottleneck:
the reconfiguration port. As all resource requests are served by this mutually exclusive
device, profound concepts for scheduling the port access are vital requirements for FPGA
real-time scheduling. The exclusive port is occupied for some reasonable time during
reconfiguration, which cannot be neglected.
Thereby, a specific characteristic of the reconfiguration port, which looks restrictively
in the first place, helps us to reduce the problem to a well-researched area. As the
port must be accessed sequentially—the reconfiguration requests are served mutually
exclusive—we can inherit and apply mono-processor scheduling concepts for the schedul-
97
5 Reconfiguration Port Scheduling
FPGA
Execution Environment
Real-Time Scheduling
Task Sets
LUTs, etc.
Slots
Reconfiguration Port
OS Applications
Figure 5.1: Layered Approach
ing of the reconfiguration phases of tasks to be executed on the execution platform.
Mono-processor scheduling algorithms have been thoroughly discussed in several works.
Both scenarios—mono-processor and reconfiguration port—must derive a sequential
schedule for an exclusive resource. However, a sequentially scheduled reconfiguration
phase only prepares the actual processing of a task, which still shall be executed in
parallel to other tasks residing in neighboring slots. Reducing mono-processor scheduling
algorithms to a slotted architecture that allows for parallel execution of tasks therefore
seems limiting at the first view. However, we show in this chapter that we still can meet
the overall requirement of parallel executing tasks meeting their deadlines, in some cases
even superior to other mapping or scheduling approaches, for example, as the concepts
and algorithms of mono-processor scheduling are of low complexity.
Thus, we investigate the application of single processor scheduling algorithms to task
reconfiguration on reconfigurable systems in this chapter. We determine necessary adap-
tations and propose methods to evaluate the scheduling algorithms for this novel concept.
In particular, the reconfiguration process is implicitly integrated in a framework for real
time execution of tasks on FPGAs, where the sequential access of the reconfiguration
port is beneficially used for meeting deadlines.
In detail, we investigate several scheduling strategies known from single processor
real-time systems, where tasks can arrive at the same or arbitrary times to the system.
We investigate independent task sets and propose a novel approach where a task may
be preempted in its reconfiguration phase, in order to achieve a feasible schedule. In
general, the interesting problem arise, if we have less slots than tasks available. We
thereby target on resource sharing of tasks.
5.1.1 Layered Approach
As displayed in Fig. 5.1, the real-time reconfiguration port scheduling again is based
on an abstracting layered approach. The input given to the scheduling layer are task
sets, which are executed by a real-time scheduling algorithm residing in the layer. A
vital characteristic of the reconfiguration port scheduling approach is the reliance on a
well-designed abstracting execution environment. In particular, we use a homogeneous
slot-based execution environment to derive well-designed reconfigurable regions and to
abstract from the low-level reconfiguration issues of FPGAs. On this abstraction, we
apply the reconfiguration port scheduling.
98
5.2 Concept
As an extension, the scheduling algorithm can be activated through an operating
system (OS). Then, the operating system dispatches the tasks by referring to the real-
time scheduling layer. Basically, an operating system is optional but still recommended.
The focus of this chapter thereby is on the scheduling layer.
5.1.2 Remainder of the Chapter
The rest of the chapter is organized as follows. First, we discuss fundamental considera-
tions of the reconfiguration port scheduling approach. Then, we explain the reconfigura-
tion port scheduling in more detail relying on aperiodic task sets. In Sect. 5.4, we discuss
periodic task scheduling and how to guarantee feasibility for this most common scenario
of real-time scheduling. Then, we investigate caching techniques for reconfiguration
port scheduling. Before a discussion of the lesson learned, we report on experiments
conducted in the context of this work. Finally, we list related work and summarize the
chapter.
5.2 Concept
Our real-time scheduling layer accepts task sets and dispatches them on an execution
environment. As this environment is an essential part, we first discuss the required
architectural characteristics, before we define the problem of real-time task scheduling
and finally introduce the reconfiguration port scheduling.
5.2.1 Execution Environment
A sound execution environment allows us to concentrate completely on the specific prob-
lem of reconfiguration port scheduling. Basically, the execution environment therefore
must comprise a number mof slots to accept tasks for arbitrary execution in one of the
slots. Hence, several constraints for the execution environment arise.
Foremost, to allow for arbitrary slot allocation, we require homogeneous slots—slots
having similar characteristics, particularly concerning the area. Thus, each task that
meets the area constraints of one slot can be executed in any of the slots. We thereby
avoid external fragmentation by accepting internal fragmentation. Some architectural
concepts additionally allow for dynamic width assignments of the tasks, while others
hold the size fixed. For our method, the latter one primarily serves our needs.
Furthermore, the environments should abstract communication by offering a suit-
able communication structure, e. g. such as in the Erlangen Slot Machine approach
[BMA+05]. Here, the communication to the peripherals is established through a run-
time reconfigurable switch, allowing for any peripheral to be connected to any IO pin of
the reconfigurable fabric. Thus, distinct communication channels are made up for each
executing task. Alternatively, the communication with peripherals can also be achieved
via a shared bus large enough to not impose the bottleneck of the system. Note that
the tasks do not require inter-task communication, as they are independent.
99
5 Reconfiguration Port Scheduling
Slot A
Slot B
FPGA
Single recon-
figuration port
Set of n partial bitstreams Slot C
Slot D
Figure 5.2: Execution environment having homogeneous slots
Moreover, we require a reconfiguration controller for the dispatching of the arriving
tasks. This controller, which can also be located externally, provides the access of our
real-time scheduling requests to the execution slots. Often, a CPU is included to the
system as hard- or soft-core. This CPU then handles the reconfiguration.
A suitable run-time execution environment may look like the prototype displayed in
Fig. 5.2. Concerning the communication, we assume one bus that runs on the highest
possible frequency and allows each slot to demand the bus only a fraction of the highest
speed (time sharing). As tasks on FPGAs generally are clocked relatively low, this bus is
assumed to not become the bottleneck of the system. Partial reconfiguration capabilities
enable a single slot to be reconfigured keeping remaining ones in execution.
Recently, several authors have proposed appropriate architectural concepts for fine-
grained run-time reconfigurable systems [UHGB04b, WP04]. The architectures usually
base on Xilinx Virtex FPGAs and comprise a specific number of slots, in which tasks are
dynamically allocated and executed. The majority of these environments additionally
implements a bus for task communication and a CPU for supervision.
Efficient executing of tasks on such environments still is not a trivial problem. Apart
from area assignment, de-fragmentation and communication problems, which are ex-
tensively studied and abstracted on the above mentioned platforms, the reconfiguration
itself demands further investigation. The duration of the RT phases and the sequential
access of the reconfiguration port require special care.
5.2.2 Problem Abstraction
In order to formulate our aim to guarantee real-time constraints, we first abstract the
dispatching of tasks onto the above mentioned execution environments. If we have n
tasks to be executed, each in one of the mslots, and m < n—the number of slots
is smaller than the number of tasks to be executed—we have to reuse the same slot
for multiple tasks. Moreover, all tasks are loaded (by means of slot reconfiguration)
through one single port. In order to handle this limitation of resources, we need a
suitable mechanism to schedule the tasks.
Again, we model every task of our system with two different phases. The reconfigu-
ration phase (RT) represents the configuration of the hardware itself. The RT phase
needs to occur before the second phase, which is the execution phase (EX ). Figure 5.3
100
5.2 Concept
RT
RT RT
RT
EX
EX
EX RT
EX ...
Slot 1
Slot 2
Slot 3 EX
Figure 5.3: Example occupation of three slots.
shows these two phases of tasks scheduled onto three slots. Horizontally, we display the
available slots and their occupation over time. As all tasks have the same size, the RT
phases are of the same duration.
The task sets accepted by our real-time scheduling layer must follow standard require-
ments of real-time scheduling theory [But04]. We schedule a set Γ of nindependent
tasks τi. The tasks τihave different execution times tEX,i and can have periods Ti. All
tasks have relative deadlines Di. This deadline of the execution time is equal to the
period, and can be computed for the kth instance by di,k =Ti·k, or di,k =Ti·k+φi,
if an offsetting phase must be considered. The meeting of this deadlines is essential for
a correct behavior of the system. Furthermore, every task has a reconfiguration time
tRT,i, which is in our case constant for all tasks due to the execution environment that
harmonizes area requirements.
As mentioned in the previous chapters, reconfiguration times can be reduced by apply-
ing a difference based reconfiguration and only changing those cells that alter between
two subsequent configurations. However, in the worst case still the whole configuration
information of a slot must be reconfigured. As we want to give hard real-time guarantee,
we have to consider this worst case and therefore have to use the reconfiguration time
given for a complete reconfiguration of one slot.
5.2.3 Reconfiguration Port Scheduling
The two characteristics of the reconfiguration process—sequential reconfiguration port
and duration of the reconfiguration—enable the appliance of methods of the single pro-
cessor domain: We assign reconfiguration phases sequentially to the exclusive reconfigu-
ration port. If the reconfiguration phase for each instance of a task finishes early enough,
the task can meet its deadline.
We thus schedule the reconfiguration phases on the reconfiguration port and therefore
are interested in when the RT phase must have finished. Therefore, we introduce the
relative (absolute) deadline D(d
i). This deadline is computed by
D=DitEX,i (d
i=ditEX,i),(5.1)
denoting the latest relative (absolute) finishing time for an RT phase. If the reconfigu-
ration for a task can be completed before D(d
i), the task will meet its overall deadline.
We thus schedule the reconfiguration phases tRT,i with regard to their deadline D(d
i).
101
5 Reconfiguration Port Scheduling
RT1EX1
EX2
12
2
1,2
1
RT2
Slot 1
Slot 2
release time
d*
d
EX2
12
2
1,2
1
RT2
d*
d
EX1RT1
Figure 5.4: Scheduling according to d(left) and d(right), synchronous arrival times.
If we can guarantee the finishing of RTibefore its absolute deadline d
i, we also guarantee
the completion of EXibefore di.
Figure 5.4 shows an example of the reconfiguration port scheduling concept. In the
figure, we display a scenario of an execution environment comprising of two slots and
two tasks to be executed on the slots. If we use the deadlines dof the tasks to make
up the scheduling order—τ1is scheduled prior to τ2τ2will miss its deadline. However,
using das deadline for scheduling, τ2gets preferred and is scheduled first due to its
earlier d
2. Subsequently, both tasks can meet their deadlines.
Moreover, for scheduling arbitrary task sets on mono-processors, also preemption is
required. Preemption is a fundamental concept of many real-time scheduling algorithms,
as it allows us to serve arriving tasks with higher priorities immediately. Thereby,
preemption often increases the schedulability. More details on the characteristics of
such task sets are discussed in the subsequent sections.
Concerning the technical implementation, we do not allow the preemption of EX
phases due to several reasons: Basically, a preemption of EX results in the need to
reconfigure the whole area twice. As the currently executing configuration of a slot is
replaced by a new configuration, we have to reconfigure the slot once more to resume
execution after the preempting higher priority task has finished execution.
Moreover, the question of saving states arises. Some concepts exist [G¨
ot07], however
they require additional implementation effort. Alternatively, we can perform a readback
of the configuration of the task under execution. However, as a readback basically means
to read in the state of the whole configuration bits of a slot using the same hardware
resource as the reconfiguration port, a readback consumes approximately the same time,
also occupying the port for the duration of the readback. We thus would double the
additional time—time for the readback plus time for the subsequent reconfiguration after
the preemption—in the worst case.
We therefore introduce the concept of preempting the RT phase—tasks are preempted
during their reconfiguration. Although technically complex for the implementation, the
preemption of RT phases does not substantially increase the whole reconfiguration time,
as preempted reconfigurations are directly resumed in the future. The preemption thus
is achieved by stopping the reconfiguration and resuming it at a later point in time.
Such a behavior allows us to broaden the reconfiguration port scheduling by meeting
the requirement of offering preemption.
102
5.2 Concept
RT1EX1
R
EX2
12
2
12
1
RT2
Slot 1
Slot 2
release time
d*
d
T1
EX2
12
2
12
1
RT2
d*
d
EX1
Figure 5.5: Scheduling two tasks according to dwithout preemption (left) and using
preemption (right).
In Fig. 5.5, we display an example where such a preemption is of benefit. Comparing
this scenario with the one depicted in Fig. 5.4, the tasks now have different arrival
times. Without preemption—as displayed in the left part of the figure—τ2would again
miss its deadline, as it cannot preempt τ1despite having a shorter deadline d. If
preemption of the reconfiguration phase is allowed, τ2having a higher priority interrupts
the reconfiguration of τ1and forces the reconfiguration manager to start its RT phase
resulting in an earlier start of its EX phase. The reconfiguration phase of τ1is resumed
immediately after the preemption. Eventually, both tasks can meet their deadlines.
Moreover, as the partial reconfiguration techniques of current FPGAs do not offer to
interrupt the reconfiguration phase at any instance of time, but only after frames, we
have to consider this minimum reconfiguration unit RTmin = within our calculations
below. This time interval then also becomes the smallest time-step for tasks arriving
to be scheduled by the scheduling layer.
In order to derive a schedule for task reconfiguration, we can now completely specify
the parameters of the task τiin Table 5.1. The definitions are close to the ones in [But97].
In general, a schedule should satisfy an optimization criteria like the minimization of the
overall response time or the maximum lateness, etc. We want to emphasize the maximum
lateness, which is a known metric for the performance evaluation Lmax = maxi(fidi).
In the following, we detail the reconfiguration port scheduling by applying don a set
of aperiodic tasks.
Table 5.1: Definitions
sistart time
rirelease time
fifinishing time
diexecution deadline
tEX,i computation or execution time
tRT reconfiguration time
Lmax maximum lateness
d
ireconfiguration deadline
minimum reconfiguration unit
103
5 Reconfiguration Port Scheduling
RT1
EX1
EX2
12
2
1
RT2
Slot 1
Slot 2
d*
d
EX2
RT2
EX1
Slot 3
RT3
RT4EX4
EX3
4
3
43
RT1
RT3
RT4
12
2
14
3
43
EX4
EX3
a) b)
Figure 5.6: Scheduling task sets according to EDD. a) two slots, b) three slots
5.2.4 Parallel Machine Problems with a Single Server
We can compare the theory underlying our approach to the parallel machine problems
with a single server [BDFK+02], see also Sect. 3.5. In these problems, a set of identical
parallel machines is used for executing independent jobs of a task set. Prior to the
execution of tasks, a set-up time of the job occurs, which is performed by a single server
that dispatches the jobs to the machines. The parallel machines of this approach are
comparable to the reconfigurable regions of our execution environment, while the server
resembles the reconfiguration port.
However, in contrast, we allow preempting the reconfiguration phase, which is not
allowed in the single server case. A direct mapping of our problems to the single server
domain therefore is hardly applicable. We thus focus on the results of the works of the
mono-processor scheduling domain.
5.3 Aperiodic Task Scheduling
In the case of aperiodic tasks that have no dependencies, we distinguish the two cases of
synchronous and asynchronous task arrival. Basically, the former allows us to perform
a schedulability analysis before executing the tasks, while the latter requires dynamic
adaptation of the schedule.
We investigate the two known aperiodic task scheduling strategies from single pro-
cessor design to solve these two cases: EDD and EDF. Motivated by the similarity of
the single processor scheduling and the behavior of the reconfiguration port of the in-
troduced execution platforms, we show how we can use these algorithms from the single
machine environment in the domain of reconfigurable task scheduling [DF07b].
5.3.1 Synchronous Arrival
The scenario is as follows: A set of naperiodic tasks has to be loaded into mslots
(m<n), using the mutual exclusive reconfiguration port. The tasks have synchronous
arrival time, but can have different execution times tEX,i and deadlines di. Note that
schedules for this scenario do not need preemption as no new tasks will enter the system
at run-time and all tasks arrive at the same time.
104
5.3 Aperiodic Task Scheduling
RT
RT RT
RT
EX
EX
EX
RT
EX ...
Slot 1
Slot 2
Slot 3
Figure 5.7: The full load of slots condition if three slots and ntasks having different
execution times are scheduled.
The similar sequential scheduling problem is solved in the single processor environ-
ment with respect to minimizing the maximum lateness using Jackson’s algorithm,
also called earliest due date (EDD). The algorithm executes the tasks in order of non-
decreasing deadlines. We apply EDD to our scenario and schedule the RT phases, using
d
ias deadlines. Figure 5.6 shows some examples.
However, as each RT phase is followed by an EX phase that occupies the slots for the
duration tEX,i of the task, the occupation of the slots also has an impact on the start of
the next reconfiguration phase. In particular, both the port and at least one slot have
to be available. Otherwise, when a reconfiguration request is pending and all slots are
in EX phase, we suffer a waiting time and have to block the reconfiguration request, as
displayed in Fig. 5.7. Due to the wanted avoidance of EX phase preemption, we suffer
a specific condition, which we call from here on full load of slots: fls. We denote the
duration of such a fls as δi. Only those tasks τiafter the first mtasks to be scheduled
can suffer this condition, with mas the number of slots.
Such a scenario, where the seamless scheduling of RT phases cannot be guaranteed
is not covered by the original EDD algorithm. We thus extend the EDD algorithm as
displayed in Algorithm 2. In the algorithm, we use a vector zto denote the occupancy
of the slots. Below, we also show that in this case, the optimality of EDD cannot be
guaranteed. However, every slot is executing—the FPGA is fully utilized and does not
waste free space.
Algorithm 2 Earliest Due Date for Reconfigurable Slot Architectures
1: zoccupancy of the slots
2: if reconfiguration port is inactive (i. e., no RT phase is active) then
3: Find slot where no EX is active
4: if all slots are in EX phase then
5: Wait until at least one slot is available
6: end if
7: Reconfigure slot r, (r= ind (min {z1, z2, . . . , zm}))
8: Update (z)
9: end if
The scheduling for EDD can be improved by noting that the potential fls can be
reduced in one single case. If for two subsequent tasks τiand τi+1:tEX,i > tEX,i+1,
tEX,i tRT , and tEX,i < tRT +tEX,i+1, we can swap τiand τi+1. Thus, the starting times
of the next two RT phases will be improved. This holds for m2.
105
5 Reconfiguration Port Scheduling
Moreover, if we can guarantee at least one free slot at the beginning of each RT
phase, all results of EDD of the single machine environment hold and EDD is optimal
in our scenario with respect to minimizing the maximum lateness. A sufficient but not
necessary condition to guarantee at least one free slot of the mslots available is
i:tEX,i < tRT ·(m1),for m2.(5.2)
If Eq. 5.2 holds, every EX phase started in a slot mkfinishes before (m1) RT phases
in the other slots ml,l6=kcould take place. Thus, it is guaranteed that slot mkbecomes
free the latest before it is required for hosting the next RT phase in turn. This holds for
all mslots (1 < k < m). A seamless reconfiguration thus can be guaranteed thanks to
offering a free slot for any new reconfiguration request. However, the condition (Eq. 5.2)
is only sufficient, not necessary. For example, Fig. 5.6 b) depicts a scenario where Eq. 5.2
does not hold (tEX,3> tRT ·2, m= 3) and we still have at least one free slot available
for any new reconfiguration request.
Furthermore, by virtue of Eq. 5.2 we can derive the maximum number of slots required
to guarantee seamless reconfiguration. We therefore use the longest EX phase of all tasks
tEX,max and the reconfiguration time of a slot tRT to derive m:
m=tEX,max
tRT + 1. (5.3)
The estimation only provides an upper bound, as less slots may also result in a seamless
schedule, still minimizing the maximum lateness. Referring again to Fig. 5.6 b), mwould
equal 5 using Eq. 5.3, as tEX,max =tEX,3/4·tRT . However, the task set can be
scheduled optimal (minimizing the maximum lateness) if only three slots are available.
Using Eq. 5.3, we thus can only estimate the number of slots needed by an upper bound.
Guarantee
For the schedulability analysis, we construct the schedule referring to a vector (z) that
displays the current slot occupancy. Despite the a priori knowledge of the tasks, we have
to dynamically react on the fls phases. These phases can be compared to dynamically
arriving jobs, which however are ordered according to Jackson’s rule and do not cause
preemption, only additional delay.
When we want to perform a guarantee test—to guarantee that a set of tasks can be
feasibly scheduled—we need to show that, in the worst case, all tasks can complete before
their deadlines. The guarantee test for EDD in the single processor case is i= 1, . . . , n :
Pi
k=1 tEX,k di. In our scenario, due to the possible delays when all slots are occupied
and the next RT phase is postponed (fls), we must extend every scheduled task by a
possible additional δi. Thus, it must hold
i= 1, . . . , n :
i
X
k=1
(tRT,k +δk) + tEX,i di. (5.4)
The δkdepend on the current occupation of all slots of the system and are difficult
to compute. Therefore, our guarantee test avoids the explicit calculation of the δkby
106
5.3 Aperiodic Task Scheduling
computing the slot occupancies iteratively. We use the vector zthat holds the current
status of each slot. In the vector fields, we sum up the individual occupancy with
respect to the global availability of the reconfiguration port. Therefore, we sequentially
run through the schedule produced by EDD of Algorithm 2 filling this vector z, whose
entries zlrepresent the slots of the reconfigurable fabric. The vector is updated each time
a new RT phase starts. After the update, the vector’s entries display when (time) their
corresponding slots can be reconfigured next, also concerning the global condition—the
occupancy of the reconfiguration port. Thus, by extracting the field with the smallest
value, we can determine the next slot rfor reconfiguration of the next task tj. We apply
r= ind (min {z1, z2, . . . , zm}) , (5.5)
while ind is the index function (see also Line 7 of Algorithm 2). If multiple zlare
minimal, the selection is done randomly.
In detail, the entries of the vector are updated (zr,old zr,new) as follows: We add tRT
and tEX,j to the field of the selected slot (zr): zr,new =zr,old +tRT +tEX,j. In order to
update all other fields zl, l 6=r, the following equation holds:
zl,new = max {zl,old,(zr,old +tRT )}. (5.6)
Thus, if the finishing time of the RT phase of slot ris larger than zl,old, slot lmay be
reconfigured, when the currently started reconfiguration has finished (zl,new =zr,old +
tRT ). Otherwise, if slot lwill still be in EX phase when slot rhas finished reconfiguration,
we must not select slot lfor reconfiguration. Therefore, zlkeeps its value (zl,new =zl,old),
which is larger than zr,new indicating its next availability for reconfiguration.
Hence, we can answer the question of feasibility of a task tj—whether the deadline
djof task tjcan be met. After each update of the vector due to the dispatching of a
task tj, it must hold zr,new dj. After scheduling all tasks, we can calculate the overall
finishing time toverall as
toverall = max {z1, z2, . . . , zm}.(5.7)
Example
An exemplary sequence of the vector during a guarantee test for the scenario of Fig. 5.7
will look like the following. We start with an empty vector and schedule the first two
tasks:
z1
z2
z3
:
0
0
0
tRT,1+tEX,1
tRT,1
tRT,1
tRT,1+tEX,1
tRT,1+tRT,2+tEX,2
tRT,1+tRT,2
The update of the vector after dispatching the third task will result in keeping of the
value of z2, according to Equation 5.6.
tRT,1+tEX,1
tRT,1+tRT,2+tEX,2
tRT,1+tRT,2+tRT,3+tEX,3
tRT,1+tRT,2+tEX,2+tRT,4
tRT,1+tRT,2+tEX,2+tRT,4+tEX,4
tRT,1+tRT,2+tEX,2+tRT,4
. . .
107
5 Reconfiguration Port Scheduling
RT1EX1
RT2EX2RT3EX3
112233
RT1EX1
RT2EX2
RT3EX3
112233
EDD
feasible
solution d
d*
Slot 1
Slot 2
Slot 1
Slot 2
Figure 5.8: EDD can fail to produce a feasible schedule.
After each update of the vector, we can prove the feasibility (zr,new dj) and deter-
mine the next slot (r= ind (min {z1, z2, . . . , zm})) and the time instance for reconfigu-
ration (min {z1, z2, . . . , zm}).
Limitations
As stated above, when using EDD for the reconfigurable slot scheduling of reconfigurable
architectures, we cannot guarantee optimality. In fact, EDD can miss to produce a
feasible schedule. Figure 5.8 shows the problem, which is due to the possible additional
δiof each task. We can also see that we have to dissociate from the statement that EDD
also reduces the maximum lateness in our reconfigurable environment.
To summarize, using EDD, we can guarantee the minimization of the maximum late-
ness only if no reconfiguration phase is delayed.
5.3.2 Asynchronous Arrival
We now release the restriction of synchronous arrival of all tasks—tasks may dynam-
ically enter the system. If we have such arbitrary arrival times, preemption becomes
an important factor. In the literature, we find that when preemption is not allowed,
the problem of minimizing the maximum lateness and the problem of finding a feasible
schedule become NP-hard [LRKB77, LRK77, KIM78]. If preemption is allowed, Horn
[Hor74] found an algorithm, called Earliest Deadline First (EDF), that minimizes the
maximum lateness. The algorithm dispatches at any instance the task with the earliest
absolute deadline. In particular if a newly arriving task has a shorter deadline than the
currently executing task, the new task must be preferred and the currently executing
must be interrupted. Thus, preemption is required to schedule asynchronously arriving
jobs.
As stated above, we preempt tasks during their RT phase, when the calculation has
not started and no context saving, etc. is necessary. Figure 5.9 depicts an example of
three tasks executing in three slots. In order to implement the preemption, we divide the
area reconfigured during a RT phase into columns cj. These columns are of equal size
and comprise the equal reconfiguration time ∆, which sums up to the reconfiguration
time of the whole slot: P = tRT . The reconfiguration process then looks as follows:
108
5.3 Aperiodic Task Scheduling
RT1EX1
REX2
RT3EX3
12
2
3
3
1 2
1
3
T2
Slot 1
Slot 2
Slot 3
release time
d*
d
Figure 5.9: EDF schedule on a three slot machine using preemption.
RT1EX1
R
EX2
RT3EX3
12
2
3
3
12
1
3
T
2RT4
RT2
EX4
4
4
Slot 1
Slot 2
Slot 3
release time
d
d*
4
Figure 5.10: Full reconfiguration capacity of an EDF schedule
gradually all the cjof task τvare loaded in slot svof the reconfigurable fabric. If a new
task τwenters the system and has an earlier deadline d
w, we preempt task τv—task τv
frees the reconfiguration port and task τwstarts to reconfigure.
Depending on the current occupation of the fabric, different scenarios for the slot
assignment of task τware possible. (1) If we have another free slot available (sfree 6=sv),
we use this slot. After the RT phase of τwwe can resume the RT phase of τvat the
interrupted point. (2) If no free slot is available, we can use slot svof the interrupted task.
svbecomes the slot for τw(swsv) and RTwoverwrites all already configured parts of
τv. Thus, after finishing the reconfiguration of τw, we cannot resume the reconfiguration
phase (RTv) of the preempted task. Instead, we have to restart RTvcompletely, as
already loaded parts of the bitstream are lost.
When we thus apply EDF to our scenario relying on d, in addition to the fls condition,
which occurs here as well, we experience the full reconfiguration capacity (frc) condition
(ref. to Fig. 5.10). Here, at least one slot is in RT phase, while all slots are occupied
(by either RT or EX ). As explained above, we have two possibilities when a new high
priority job arrives and the frc scenario holds. We either can delay the job similar
to the fls condition, or we can force a preemption of one of the slots currently in RT
mode. Such a preemption will result in the killing of the preempted job, as all already
reconfigured parts of this task are abandoned in favor of the high priority task. The
mono-processor EDF does not include such a situation.
109
5 Reconfiguration Port Scheduling
Algorithm 3 Earliest Deadline First for Reconfigurable Slot Architectures
1: if d
new < d
current then
2: if all slots are in EX then
3: wait for next free slot
4: else if all other slots si6=slot(τcurrent) are in EX then
5: add τcurrent completely to Q
6: reconfigure now free slot
7: else
8: add rest of τcurrent to Q
9: Reconfigure next free slot
10: end if
11: else
12: Insert τnew in queue Q
13: end if
Implementation of EDF
The implementation of EDF (refer to Algorithm 3) for our scenario bases on a queue Q,
which orders all tasks according to their deadlines d. As mentioned above, if a new task
dynamically arrives to the system and its deadline is smaller than the task currently in
RT phase (τcurrent), we start the preemption process. Note that we put τcurrent at the
head of the queue. Depending on the slot we use to reconfigure, we either mark τcurrent
as partly loaded and assign the rest of its reconfiguration time to the queue, or, in the
case of swsv, we put τcurrent and its complete tRT to the head of the queue.
Limitations
EDF in the mono-processor domain minimizes the maximum lateness. Similar to EDD,
applying EDF for the reconfigurable port scheduling of reconfigurable environments, we
cannot guarantee this minimization. Again, if all slots are in EX phase, EDF cannot
load a dynamically arriving task as executing tasks are assumed to be non-preemptive.
As stated above, we deal with an NP-hard scenario in such a case.
Furthermore, Fig. 5.10 displays the major limitation where a RT phase might have
to be restarted completely. This killing of a RT phase most often increases the overall
response time and enforces a complex scheduling test to be done online after each new
task has entered the system.
Often, for an optimal schedule, we would require the reconfiguration port to stay
idle despite a task pending. Obviously, no scheduling algorithm can predict whether to
schedule this task or to delay its reconfiguration phase. However, for the case of non-
idle scheduling algorithms (the algorithm does not permit the mono-processor to become
idle if there are active jobs), [JSM93] proves that EDF is still optimal (minimizing the
maximal lateness) in a non-preemptive task model. If we thus schedule a task set that
only suffers the frc conditions, and do not allow killing of tasks (hence not preempting
the task), we have similar conditions to those of [JSM93].
110
5.4 Fixed Priority Periodic Task Scheduling
5.3.3 Experimental Results
We have conducted several test sets in order to rate the performance of EDD and EDF
in our scenario. We have set up a reconfiguration port simulator that allows us to
randomly generate task sets, schedule them according to various algorithms, respecting
different priorities, and display the schedules for visual control. We have randomly
generated 1000 task sets with varying parameters. For the duration of the execution
and reconfiguration phases, we have used values depending on the ratio l, with tEX
tRT l,
and l=.25, .5,1,2,3,4. The number of slots has been in the range of 3 to 5, while the
number of tasks always was significantly higher, so that tasks have to share slots. We
have scheduled both EDD and EDF with dand das deadlines.
Concerning the schedulability of EDD, we find a schedulability of approx. 90 % for d
and 70 % for damong the schedulable task sets. However, the schedulability concerning
dsignificantly increases with l1, and decreases with l1, while using das ordering
deadline of EDD behaves oppositional. Obviously, this is due to a dominance of the EX
phase. The schedulable task sets not found are due to the fls condition. When EDF
is applied, both conditions (fls and frc) can occur. Thus, the performance is slightly
worse, again with doutperforming d. Finally, for both EDD and EDF the approach
performs best if tEX tRT ·(m1), refer to Eq. 5.4.
5.4 Fixed Priority Periodic Task Scheduling
Periodic activities (sensory data acquisition, control loops, etc.) often represent the
major computational demand of embedded systems. In real-time scheduling theory
priorities are principally applied to such jobs. Contention for resources is resolved in
favor of the job with the higher priority that is ready to run.
Our tasks have the relative deadlines Diequal to their periods Ti. However, as we
schedule the RT phases, we derive the priorities of our jobs by referring to the relative
deadline D
i. The task with the shortest D
igets the highest priority. We thus have
to schedule a set of periodic tasks with deadlines less than periods. A similar mono-
processor scheduling algorithm is denoted deadline monotonic (DM) [LW82]. It is an
extension of the more common rate monotonic scheduling scheme. According to the DM
algorithm, each task is assigned a priority inversely proportional to its relative deadline.
Thus, at any instant, the task with the shortest relative deadline is executed. Figure 5.11
shows an example.
5.4.1 Schedulability Analysis
The sufficient and necessary schedulability test of a DM algorithm can be done by the
response time analysis [ABR+93, ABRW91], with the longest response time Ricomputed
at the critical instance as the sum of its computation time and the interference Iidue
111
5 Reconfiguration Port Scheduling
RT3
RT2
RT1EX3
EX2
slot 1
slot 2
slot 3
E
EX4
EX1
RT1EX1RT2EX2
1, 2, 4, 5
period 31 2
RT5
RT4
d* 12 1
1
3 4
Figure 5.11: Fixed priority example.
to preemption by higher-priority tasks:
Ri=tRT,i +Ii,where Ii=
i1
X
j=1 Ri
TjtRT,j.(5.8)
If Ri< D
ifor all tasks, the set is schedulable. We can derive step-wise solutions for
this problem. The critical instance occurs when all tasks are released simultaneously.
When scheduling the reconfiguration port, however, we face several circumstances
where the response time analysis would not produce the correct result. Both conditions
presented above, the fls and the frc scenarios demand special care, as they both can
impose additional delays to the scheduling of the jobs—the increase of Ii. Both, fls
and frc can hardly be calculated in advance. Thus, we propose alternative strategies to
guarantee the schedulability.
One possibility is to schedule the hyper-period of a task set. The hyper-period equals
the least common multiplier of all periods. Schedulability of one hyper-period guarantees
that subsequent hyper-periods can also be scheduled—the whole task set is schedulable.
Thereby, we can solve the question of schedulability, as the fls and frc conditions will
occur equally in all hyper-periods. However, there might exist cases, were a hyper period
does only exist asymptotically and therefore is too large to be constructed in advance.
As second solution, we handle the fls and frc conditions using known and appropriate
techniques of real-time scheduling that harmonize with the standard DM response time
analysis. We derive the two suitable techniques in the following.
5.4.2 A Server for Full Load of Slots Sections
During the fls condition, all slots are in EX phase while a new instance of a task τi
should be scheduled. τi, independent of its priority, demands for the reconfiguration
port. Despite the availability of the port, but due to the occupancy of the slots, we
cannot schedule this task. It is delayed.
In order to respect such a delay, we rely on the notions of aperiodic job scheduling.
Aperiodic jobs are scheduled by virtue of servers in the mono-processor scheduling do-
main. In our scenario, we assume an aperiodic job τaarriving at the same time as the
112
5.4 Fixed Priority Periodic Task Scheduling
RT3RT2RT1
EX3EX2
RT4
period
server active
slot 1
slot 2
slot 3
capacity
RT5E
EX4
EX1
Figure 5.12: Server for fls: Worst Case.
new instance of job τiarrives. τastands for the delay of τiand has the correspond-
ing computation demand (ta= delay of τi). For scheduling τa, we use a server, which
must have both, enough capacity to serve τacompletely and a higher priority than τito
become active immediately. If we find such a server, and can integrate it into the schedu-
lability analysis of our scenario, we can guarantee the handling of fls conditions. Please
note that the tasks to be scheduled by the server do not become physically implemented
parts of the system. In contrast, the tasks are only virtually executed by the server.
The server therefore may also be considered to execute virtual, however inevitable for
proper scheduling of the whole task set.
To derive a suitable server, which guarantees the immediate scheduling of τa, we have
to consider several characteristics discussed in the following. Basically, the computation
demand of τadepends on the current allocation of the slots, which is difficult to be
computed in advance, as we would have to investigate all possible fls conditions for a task
set. For our scenario, it is however not important to know the particular computation
demand. It is enough, if we know the maximum computation demand, which then
becomes the capacity of the server (see below). Every time the server becomes active,
we only have to guarantee that there is enough capacity to activate the server for the
possible longest delay of τi, so that the starting time of τican be bridged.
As the server must always execute in favor of any other job (forced by fls), we assign
the highest priority to the server. The minimal relative deadline of the server therefore
can be calculated by
DS,min = min(D
i)∆. (5.9)
Thus, its deadline will be shorter (by the smallest possible value ∆) than all other
deadlines D
iof the tasks τi. Hence, the priority of the server is maximized. As the
server does not have an EX phase, its deadline is equal to its period and D
S=DS. In
order to calculate the exact deadline of the server, we have to know the capacity.
In Fig. 5.12, we display the worst case fls scenario. It occurs when all slots have started
with the EX phases as close as possible. Note that a simultaneous activation of EX
phases in all slots is not possible, as for such an activation all slots would have to finish
reconfiguration simultaneously. However, the mutually exclusive reconfiguration port
prevents simultaneous finishing of RT phases. We thus only have a nearly simultaneous
activation of EX phases, which occurs when the RT phases of mtasks end nearly
113
5 Reconfiguration Port Scheduling
synchronously, as depicted in Fig. 5.12. The smallest interval Iof starting times of tasks
nearly simultaneously activated sums up to (m1) ·∆.
Based on the worst case scenario given by the nearly simultaneous activation depicted
in Fig. 5.12, we can compute the capacity of the server. In detail, the server must have
enough capacity CSto schedule an aperiodic job that has the computation requirement
of the m-th longest tEX , thus
CS=m-th max(tEX ). (5.10)
Only if this task τmhaving the m-th max(tEX ) is scheduled in the latest activated slot
mlatest, the complete capacity equaling tEX of this task will be required. No larger
capacity is required, as firstly all tasks having a shorter tEX than τmresult in a shorter
capacity required if scheduled in mlatest. Secondly, if a task having a longer tEX than
τmis scheduled in mlatest, we assign τm(or a task with shorter tEX ) to one of the other
slots, degrading CSrequired at least by ∆.
Moreover, we can improve the server capacity, if we have a large compared to the
EX phases. If the largest tEX is smaller than (m1)∆, the server capacity becomes
0, as the task can be executed completely during the interval I= (m1) ·∆, thus at
least one slot is free when the simultaneous activation phase ends. This holds for the
k-th largest tEX and tEX,k <(mk)∆, while k < m.
As the capacity must always be available, we require a server that preserves its capac-
ity. Furthermore, as two fls conditions could occur successively, only separated by one
RT phase, the capacity of the server must always be replenished as soon as possible.
The Sporadic Server (SS) [SSL89] solves our requirements. The SS algorithm creates a
high-priority task for servicing aperiodic requests and preserves the server capacity at
its high-priority level until an aperiodic request occurs. SS replenishes its capacity only
after it has been consumed by aperiodic task execution.
Finally, the period (= relative deadline) of the server is the minimum of the above
introduced DS,min, and the capacity + RT phase, i. e., CS+tRT . Thus, the server will
have the highest priority among all other tasks. Furthermore, the feature of the SS
allows us to have always enough capacity available, as even partially consumed capacity
is replenished after the server’s period. This is always early enough, as between two
consecutive occurrences of fls conditions always a complete RT phase will be scheduled.
To conclude, we use the Sporadic Server based on the parameters discussed above to
serve the delay of tasks that request the reconfiguration port, but cannot be reconfigured
due to the fls condition. The server—referring to its parameters—is included in the
schedulability test to guarantee a proper handling of fls scenarios. Combined with
the resource access protocol for frc sections discussed next, we then finally derive a
schedulability test to guarantee real-time execution of the task set.
5.4.3 Resource Access Protocol for Full Reconfiguration Capacity
Sections
When the frc condition occurs, all slots are occupied and at least one slot is in RT phase.
If the newly arriving instance of a task τihas lower priority than the just reconfiguring
114
5.4 Fixed Priority Periodic Task Scheduling
RT5
RT4
RT3
RT2
EX5
EX4
EX3
EX2
blocking time
handled by server
RT1EX1
Figure 5.13: Blocking Time for frc: Worst Case.
task τj, nothing will happen and τiwill be sorted into the list of ready tasks. However,
if the priority is higher, τicould either be scheduled and τjwould be killed, or delayed
until the reconfiguration port is free again.
For our schedulability analysis, we disallow killing as it harms the assumptions of
DM and would complicate the computation of the interference time Ij. Thus, we delay
τi—higher priority tasks will suffer a blocking due to lower priority tasks, as the recon-
figuration port is occupied. This is similar to a critical section of resource sharing. In
order to avoid (unbounded) priority inversion, a resource access protocol is necessary.
As we only face direct blocking and will not suffer chained blocking or deadlocks [But04],
we can apply the Priority Inheritance Protocol PIP [SRL90]. The protocol modifies the
priorities of those tasks that cause blocking. In our case, τjwould temporarily inherit
the priority of τi.
The schedulability analysis of the PIP is based on the response time analysis. There-
fore, the blocking time Biis added to the recurrent equation:
Ri=tRT,i +Bi+
i1
X
j=1 Ri
TjtRT,j.(5.11)
Note that this test becomes only sufficient, as tasks could actually never experience
blocking. In order to calculate the blocking time Bi, we rely on the worst case blocking
time as displayed in Fig. 5.13. The worst case occurs when a high priority task suffers a
blocking due to mlower priority tasks in their RT phase. Thus, the blocking time will
not be longer than Bmax = (tRT ∆)(mp), while p=mfor the lowest priority task
and decreases by 1 with every priority increase until p= 0.
We notice that the blocking time will be large for a large m. However, Biwill never
be longer than the largest tEX,j +tRT among all tasks with lower priority than τi. In
fact, the k-th longest EX phase among the tasks τjadded to a non-avoidable fraction of
k(tRT ∆) will denote Bi, if it is smaller than Bmax. We calculate Biby Algorithm 4.
115
5 Reconfiguration Port Scheduling
Algorithm 4 Computation of the Blocking Time
1: LΓ
2: ¯
L
3: pm
4: while L6=do
5: τiremove lowest priority task(L)
6: Bi(tRT ∆)(mp)
7: if p > 0then pp1
8: for (k1; k < (m1) k < number of τjin ¯
L;kk+ 1) do
9: Btmp k-th longest tEX,j +k(tRT ∆)
10: if Btmp < Bithen BiBtmp
11: end for
12: ¯
L¯
Lτi
13: end while
5.4.4 DM + SS + PIP Schedulability Test
For the schedulability test, we thus have to combine the response time analysis for DM
with the PIP and the server. From a scheduling point of view, SS can be replaced by
a periodic task having the same utilization factor. As our server will never come active
during a frc condition, the server does not have a blocking time (BS= 0). Thus, we
have to solve the recurrent equation
Ri=tRT +Bi+
i1
X
j=1 Ri
TjtRT (5.12)
for the task set Γ0:= Γ τS.
5.4.5 Experiment
We have randomly generated periodic task sets and tested their schedulability using our
simulator. Fig. 5.14 shows an example of seven tasks executed in three slots. The server
becomes active, when the fls condition arises, depicted in the additional and virtual
server slot. By explicitly displaying the server, we can evaluate the performance of
the scheduling. Based on the parameters of our randomly generated task sets, we first
defined the capacity and period of the server, in order to add the server to our task set
0:= Γ τS). Then, we performed the feasibility test of the previous subsection. By
means of our simulator, we executed task sets that passed the feasibility test, as well as
task sets that failed, and displayed the schedules for further analysis.
In general, our approach using the PIP access protocol and the SS server on one hand
correctly guarantees the schedulability of the task sets. On the other hand, however, the
approach is pessimistic, as neither the server capacity, nor the blocking time are con-
sumed completely among the majority of our task sets. Thus, when scheduling task sets
that are not feasible according to our test, but according to the standard DM response
116
5.5 Caching
Figure 5.14: Simulator
time analysis, we still often could derive feasible results within the reconfiguration port
scenario. The same empirical results from the aperiodic task sets hold: our approach
performs best if tEX tRT ·(m1). Furthermore, there exist few cases, where killing
of tasks can increase the performance of the schedule (reducing the max. lateness) or
even result in feasibility.
For thus increasing the number of periodic task sets that can be accepted for the re-
configuration port scheduling, the pre-calculation of the hyperperiod is advisable. Nev-
ertheless, our approach of using the SS server and the PIP access protocol provides a
comfortable method for providing a schedulability test that can be completely calculated
offline having also a low computational complexity. In particular, the calculation of the
hyperperiod can be avoided, which is beneficial, if the number of tasks is high and/or
the hyperperiod is very long.
5.5 Caching
In this section, we introduce caching concepts to the scheduling of the reconfigura-
tion port of partially reconfigurable systems. Being a powerful concept for accelerating
processing in multiple execution environments, caching can also help to increase the per-
formance of the reconfiguration port scheduling. Basically, by caching configurations,
we reduce the reconfiguration overhead. We present several caching concepts in the
following, including the evaluation and test of the algorithms [Fra06], [DF07a].
117
5 Reconfiguration Port Scheduling
Caching can be applied in case of periodic systems—the RT phase between two con-
secutive instances of a task τiis omitted. We thereby have to take care of no other task
starting a reconfiguration in this slot between the occurrence of the two instances of τi.
If however the slot cannot be used otherwise, we might affect the performance of the
other tasks, as their reconfiguration request could not be granted.
Therefore, we apply caching based on the current load of the execution environment, as
well as on global statements like the frequency of a task. For example, two consecutive
instances that are executed as late as possible (τi,k) and as soon as possible (τi,k+1)
most often can share the same slot without idling time of this slot between the two
instances. We then can avoid the intermediate reconfiguration, which belongs to τi,k+1
and is located seamlessly in between the two EX phases. We start the EX phase of
τi,k+1 earlier, improving the response time of the task. By re-using an already loaded
configuration (caching), we thus can improve the overall performance. Thereby, if a
task is more often activated—in fixed priority scheduling systems it will have a higher
priority—we indicate it as preferable for caching.
Problem Abstraction
To summarize our problem: We want to reduce the amount of reconfigurations when
scheduling a task set Γ of nperiodic tasks τionto our execution environment of m
equally sized slots with m < n in general. Thereby, however, caching means no RT
phase. We have to bear in mind that by avoiding the RT phase, we undermine the
concept of reconfiguration port scheduling, as the instance when the caching pays off, is
an instance without tRT :tRT = 0.
Often caching helps to improve the schedulability and/or the maximum lateness. We
are interested in which methods provide the most promising solutions. Caching can even
increase the feasibility of task sets. Thereby, we will also improve additional design goals
like energy consumption, area consumption, resources needed, etc. For example, having
fewer RT phases basically reduces the energy consumption.
Note that caching in our scenario is different to caching in von Neumann architectures,
as we do not cache data, but we cache configurations. Thus, while caching is often
disabled in real-time scenarios, as it complicates the predictability, caching can be of
benefit for our algorithms of real-time scheduling the reconfiguration port.
Our caching algorithms are based on the priority of the tasks. We define the priority
based on different characteristics of a task:
pa=tRT
T(5.13)
pb=tEX
T(5.14)
pc=tEX +TRT
T(5.15)
pd=d(5.16)
The first definition pais the same as in the classical mono-processor scheduling do-
main, where the calculation time ciof a task divided by the period Tistands for the
118
5.5 Caching
RT1
RT2RT4
RT3
EX1
EX3
Slot 1
Slot 2
Slot 3
EX1EX1
RT2
EX2EX4EX2
RT55
E
1, 2, 3, 4, 5 period
1 12
Figure 5.15: Caching: Priority Based Slot Reservation where one slot is constantly used
for task τ1.
priority. In contrast, pband pc, along with paare defined for the purpose of static priority
reconfiguration port scheduling, while pdis suitable for dynamic priority assignment.
5.5.1 Offline caching methods
We will introduce offline caching methods first, before considering on-line methods in
Sect. 5.5.2. The algorithms apply static (fixed priority) as well as dynamic scheduling.
Priority Based Slot Reservation
If we can completely avoid the reconfiguration of a slot, we will have the maximum gain
concerning the saved configurations, as only the initial configuration must be loaded.
Our first concept presented—the priority based slot reservation—follows such a goal by
statically analyzing the task set prior to the dispatching of the tasks to the slots.
The applicability of the method depends on the number of slots available and on
the characteristic of the task set. We prefer tasks with high priority, as those tasks
get activated frequently. We reserve one or more slots exclusively for a corresponding
amount of high priority tasks. The reserved slots are configured at set-up time and are
from there on exclusively used each for their task only. The other tasks have to share
the remaining slots. Figure 5.15 shows an example.
The selection of the tasks that exclusively occupy a slot is primarily based on the
priority of the task. We can use either of the priorities pa,pb, or pc, as all show good
results. However, pcparticularly helps to indicate tasks that are worth to be cached. If
the priority pcis close to 1, this task would require nearly a slot on its own anyway, as
the reconfiguration phase followed by the execution phase of this task would reside in a
slot during the period of the task.
Among the tasks having a high pc, we prefer the tasks with a high pb, as such a task
will exploit the exclusively reserved slot longer than a task with a lower pb. Furthermore,
such a task would block one of the slots for the reconfiguration port scheduling for a
disproportional long time, as the ratio of the reconfiguration phase tRT within a period
of such a task is very high.
To schedule the whole task set, we always require enough slots to be able to serve all
tasks. In particular, if not all tasks are assigned to their exclusive slot (as is the case
with m<n), there must always remain at least one slot that is not exclusively reserved
for a task. Having only one slot left for reconfiguration inevitably means that hiding the
119
5 Reconfiguration Port Scheduling
RT1
RT2
RT3
EX1
EX4
Slot 1
Slot 2
Slot 3
EX1
EX2
EX3EX2
RT44
1, 2, 3, 4, 5 period
1 12
RT1
RT5EX5
Figure 5.16: Consecutive Task Combination with modification of the release times of
tasks τ1and τ2.
reconfiguration latency cannot be applied any more. Therefore, we try to have at least
two slots remaining.
Furthermore, concerning the set-up of the system, we have two possibilities to consider
the configuration requirements of the fixed assigned tasks. Either, we denote the sum
of the configuration times for these tasks as additional requirement that precedes the
scheduling, or we include the configuration time into the calculation of the schedulability.
The former eases to find a feasible schedule and will serve practical conditions in most
cases, while the latter also allows us to consider systems without set-up time. However,
the latter could extremely deform the system, if a large number of slots is exclusively
reserved, as the reservation—the configuration—will only occur once.
If we consider the initial configuration of the exclusive slots within our feasibility
analysis, we apply the earliest due date algorithm (see Sect. 5.3.1), as we first have to
schedule a set of aperiodic tasks that have simultaneous arrival times. We then use a
periodic scheduling algorithm for the remaining tasks.
Concerning the feasibility analysis of the priority based slot reservation, we thus have
to first guarantee that the initialization phase can be completed without missing dead-
lines. Thereafter the remaining tasks have to be scheduled. Here, we can apply the
method of Sect. 5.4 or schedule the task set for the hyper period. For the latter case,
we schedule the task set until the first hyper-period of the whole task set is reached, or
until the hyper-period of the new task set is reached, depending on whether the initial
configuration of the exclusive tasks is part of the scheduling analysis or whether a set-up
time is used. In both cases, the hyper-period is easier to schedule, as only the free tasks
require reconfiguration, thus reducing the occupancy of the reconfiguration port.
The advantage of the priority based slot reservation scheduling algorithm can be
summarized as follows: As selected tasks and/or tasks with high priorities are assigned
to exclusive slots, these tasks will always meet their deadline, if their initial configuration
can be guaranteed.
Consecutive Task Combination
The second offline caching method—the consecutive task combination—bases on the
manipulation of the characteristics of the task set. If we combine two instance of a task
and execute the EX phase of the first instance as late as possible, we can avoid the RT
phase for the next instance. The finishing time of the first instance thereby will be at
the end of the task’s period, see Fig. 5.16.
120
5.5 Caching
If we want to combine two instances of a task τi, we will have to ensure that no other
task τjwill reconfigure the slot in the meantime. We therefore adapt the release time
of the first instance and the deadline of the second instance. The new release time ˜rof
task iand instance kcan be calculated by ˜ri,k =k·Ti(tRT +tEX,i). Similar, the new
absolute deadline ˜
dfor the task iand instance k+ 1 is ˜
di,k+1 =ri,k+1 +tEX,i. Moreover,
for instance k+ 1, we do not have to consider the task τifor the reconfiguration port
scheduling, as it does not require an RT phase. Thus, the relevant deadline is the
deadline d
i,k of the instance τi,k.
Here, we denote the drawback of the approach: As the release time is pushed to the
furthest possible point in time, while the deadline remains the same, there is no slack
available for the modified tasks. Thus, these tasks must be scheduled as soon as they
are released in order to enable the matching of the deadline. We can guarantee an
immediate reconfiguration only for the task with the highest priority and if at least one
slot is free when this task arrives (no fls condition holds).
Alternatively, we can allow the tasks some slack by releasing the task prior to the new
release time ˜r. Then, we have to block the slot after finishing the execution phase of
instance kuntil the beginning of the instance k+1. Such a modification can be accepted
if the blocking time is smaller than the tRT , as due to the avoidance of the RT phase for
the instance k+ 1, a complete tRT is saved. In general, we can approach the problem by
using a resource access protocol, which prevents other tasks also having higher priority
from using the blocked slot for reconfiguration.
In Fig. 5.16, we depict an example where only some of the tasks are combined in order
to facilitate consecutive task combination. Not modifying all tasks makes particularly
sense, if a scenario arises where two or more combined tasks compete for the reconfigura-
tion port at the same time. If we allow no slack for combined tasks, only one task could
meet its deadline. We therefore select some tasks for the consecutive task combination,
while leaving other tasks untouched.
The selection can be done on the basis of the absolute deadlines dof the instances
of all tasks during a hyperperiod. If two or more of such deadlines fall into an interval
of tRT length, their corresponding tasks might not be schedulable. This becomes
clear, if we take a closer look at the modified release times of two tasks τvand τw, whose
instances are combined pairwise. If say the 3rd deadline d
v,3of task τvand the 5th
deadline d
w,5of task τware located within an interval tRT ∆, the modified release
times also fall into such an interval. The modified deadline for τvcan be calculated by
˜rv,3= 3 ·Tv(tRT +tEX,v) = d
v,3tRT , and for τwby ˜rw,5= 5 ·Tw(tRT +tEX,w) =
d
w,5tRT . If d
v,3d
w,5< tRT ∆, then it holds that |˜rv,3˜rw,5|< tRT ∆, and
thus one of the task misses its deadline, as they compete for the reconfiguration port at
the same time interval.
For deriving the critical deadlines, we have to consider only every second deadline
D=d
i,1, d
i,3, d
i,5, . . . of each task within a hyperperiod, as only those deadlines
are relevant for scheduling tasks whose instances shall be combined pairwise. We then
compare these deadlines Dof all tasks within a hyperperiod and derive a selection
of tasks by building a graph. If two deadlines of two different tasks fall in the range
121
5 Reconfiguration Port Scheduling
RT1
RT2RT4
RT3
EX1
EX3
Slot 1
Slot 2
Slot 3
EX1
EX2EX4
RT5
RT1E
RT2
RT2
1, 2, 3, 4, 5 period
121
Figure 5.17: Scheduling Look Back
tRT ∆, we draw a connection between the two tasks. We repeat this for all critical pairs
of deadlines from D. From the final graph we then can derive the tasks whose instances
can be combined pairwise, by selecting those tasks that have no direct connection. The
selection is done on the basis of graph coloring. The number of colors tells us how many
tasks can be scheduled according to the consecutive tasks combination. To derive the
maximum number of tasks, we select the color that occurs most often within the graph.
Please note that an uneven number of instances of a task within a hyperperiod must be
handled also. We can either exclude the last instance of these tasks from the consecutive
task combination, or double the hyperperiod, as then all tasks have an even number of
instances. Otherwise the last instance of a task within a hyperperiod would be combined
with the first instance of the next hyperperiod, making this hyperperiod different to the
former one.
In summary, the advantage of the approach lies in the reduction of the number of
reconfiguration phases required by two for those tasks whose instances are combined.
Tasks with short periods may be preferred, as their overall fraction of the reconfigura-
tion time is larger than of those tasks with longer periods. Furthermore, the approach
can improve the scheduling of tasks when for the priority pcholds pc1
2, without
permanently reserving slots for specific tasks as in the previous approach.
5.5.2 Dynamic/On-line Caching Methods
While the first two algorithms manipulated the task set offline prior to starting the
schedule, the remaining algorithms will focus on online conditions.
Scheduling Look Back
The first approach, scheduling look back (SLB), considers the condition of the slots before
starting an instance of an RT phase. If any of the slots already holds the configuration
of the newly arriving task, the RT phase is omitted in favor of immediately starting
the EX phase. Figure 5.17 depicts an example, where the second reconfiguration of the
frequently occurring task τ1can be avoided.
Additionally to tasks with short periods, we improve the schedulability of tasks that
are started relatively late, as their next invocation will be not too far away from the
finishing time. These tasks often have low priorities. However, despite that the next
release of the task will be soon, the reconfiguration of the task is often postponed, as
the priority is too low. Here, as an improvement of the algorithm, we can temporarily
122
5.5 Caching
RT1
RT2
RT4
RT5
EX1
EX3
Slot 1
EX1
RT1
EX5
EX4
EX2
RT33
1, 3, 4 period
1
Slot 3
Slot 2
2 5
Figure 5.18: Scheduling Look Ahead
inherit the priority of a higher priority task. Such a behavior can be controlled by a
resource access protocol.
Moreover, we do not have to pre-compute the task set. If we have a task set that is
feasible without scheduling look back, the performance of the task set will be improved
by applying SLB.
Scheduling Look Ahead
A different approach is scheduling look ahead (SLA). As implied by the name, it looks
into the future, opposed to SLB, which only refers to the current status and former
allocation of the slots. SLA becomes active each time the reconfiguration port idles. If
we find a task τiwhose next instance kwill be activated in the near future, we start the
reconfiguration phase for instance kat the moment the reconfiguration port idles. If we
can finish the reconfiguration phase before the original release time, the next instance
of task τistarts to execute the moment its period begins, without having to wait for its
reconfiguration phase to finish. Even if the reconfiguration is not completely finished
when the period begins, the actual start of the execution time still will be earlier. We
thus improve the response time of instance kof this task.
Moreover, the requests of the reconfiguration port are spread over the time line, there-
fore reducing the cases of two tasks competing for the reconfiguration port at the same
time, which in turn reduces the number of frc conditions. Thus, the performance of the
scheduling can be improved. Figure 5.18 depicts an example.
SLB relies on several characteristics of the task set. As we know the whole task set
from the beginning of the scheduling, we particularly know when a new instance of a
task is released. Moreover, as the EX phase of a task can start at the beginning of the
period, we can start with the RT phase before the final release of the task.
The pre-loading can only take place if (1) the reconfiguration port is free and (2)
the time until the release time of the instance of the task selected is less or equal the
reconfiguration time of this task. Otherwise, we might reconfigure a task τi, mark it as
already reconfigured (so that it can be executed immediately when the next instance is
released), and another task τjuses this slot for reconfiguration before we have executed
task τi.
123
5 Reconfiguration Port Scheduling
RT1
RT2
RT3
EX1EX3
Slot 1
Slot 2
1, 2, 3 1,3
EX2RT1
1
EX1
2
RT2
RT1
RT3
EX2
1,2 1, 3
EX1
EX3
RT1EX1
RT2EX2
2
RT2EX2RT3EX3
RT1EX1RT1EX1
11, 2, 3
deadline miss deadline miss
RT1
RT2
RT3
EX1EX3
Slot 1
Slot 2
1, 2, 3 1,3
EX2RT1
1
EX1
2
RT2
RT3
EX2
1,2 1, 3
EX1EX3
RT1EX1
RT2EX2
2
EX2
RT3EX3
EX1
RT1EX1
11, 2, 3
RT1
RT2
RT3
EX1EX3
Slot 1
Slot 2
1, 2, 3 1,3
EX2RT1
1
EX1
2
RT2EX2
1,2 1, 3
EX1EX3
RT1EX1RT2EX2
2
RT2EX2
RT3
EX3
EX1RT1EX1
11, 2, 3
RT1
RT3
deadline miss
a)
b)
c)
Figure 5.19: Example of the same task set on a two-slotted architecture applying different
scheduling algorithms. a) DM (deadline monotonic), b) DM + SLB (scheduling look
back), c) DM + SLB + SLA (scheduling look ahead)
5.5.3 Combination of the Methods
Again, we have used our reconfiguration port scheduling simulator to evaluate the al-
gorithms. Thereby, the simulator helps in many aspects, as it visualizes the schedules.
In particular, non-feasible tasks can be found. Additionally, it allows us to rate task
sets by performing the scheduling algorithms on a set of task sets. Finally, we can also
evaluate task sets by looking at the graphically displayed hyper-periods.
Figure 5.19 depicts an interesting example of applying caching in the realm of recon-
figuration port scheduling. In Fig. 5.19 a) we have scheduled the task set displayed in
Table 5.2 without any configuration caching, using deadline monotonic scheduling. In
contrast, if we use scheduling look back, as depicted in Fig. 5.19 b), we get a feasible
schedule, while a combination of scheduling look back and scheduling look ahead re-
sults in an unfeasible schedule. Figure 5.19 c) displays the latter scenario, where the
scheduling is still improved comparing to the scheduling without configuration caching.
Table 5.2: Task set for Fig. 5.19
Task tRT tEX period Ti
1 4 4 12
2 4 8 18
3 4 8 24
124
5.6 Experiment
5.5.4 Implementation
The first configuration caching algorithm presented in 5.5.1 (priority based slot reserva-
tion) demands for a preprocessing of the task set. This preprocessing is done offline. The
same holds for initializing the fixed tasks. After everything is set, we have to schedule
the remaining tasks.
Similarly, the second algorithm in section 5.5.1 requires a pre-processing that is done
in the sense of design space exploration. Afterwards, the dispatching of the tasks can
be done on these off-line results.
Scheduling look back requires madditional steps when a new task arrives, with mthe
number of slots. We have to check all slots if the configuration of the arriving task is
already loaded in one of the mslots.
In contrast, scheduling look ahead is more sophisticated. The algorithm is activated
each time the reconfiguration port is idle. Then, we have to check for all slots if there is
at least one slot without an EX phase being active. If we have found one, we consider
the whole task set and search for the task with the next release time. If the distance
between this release time and the current time is equal to or less than the reconfiguration
time, we start to reconfigure this task.
SLB and SLA can be combined. Each time the reconfiguration port idles and we look
for an RT phase of a task τithat could be pre-loaded (executed in advance), we also
look whether the configuration of τialready is loaded. If so, then we do not have to
reconfigure it once more. Therefore, we look for the next task and might prefer this one.
However, we will have to block the firstly selected slot in order to not destroy the already
loaded configuration resting in this slot. Nevertheless, the implementation complexity
increases when SLB and SLA are combined.
Finally, due to the re-use of the same configuration, we might have to re-initialize
memory (constant values, etc.) or even larger parts of the circuitry. Thus, we might have
to reconfigure parts of the slot. In the average case, the reconfiguration is still shorter
than tRT , in the worst case it is equal to tRT . Thus, we will never decrease the behavior
of our algorithm, and most likely will always reduce the maximum lateness. In general,
the re-initialization is particularly important for memory elements like BlockRAM that
can be found on modern FPGAs, as wrong initial values can falsify the computation.
In many cases and for the majority of the reconfigurable area of a slot, we can simply
reset the configuration using standard reset methods. Note that the reset mechanism
should be user implemented when generating partial bitstreams on Xilinx Virtex devices.
5.6 Experiment
After discussing the capabilities of the reconfiguration port scheduling algorithms, we
also want to discuss an example application, where the real-time aware execution of
run-time reconfigurable tasks can be of benefit. We selected the scenario of an online
recognition system that is required to identify and sort objects placed on several conveyer
belts. On each conveyer belt, objects arrive periodically. We sort the objects into
125
5 Reconfiguration Port Scheduling
conveyer belt
cat 1 cat 2 waste
...
D
cam
distribution
production
Figure 5.20: Conveyer belt
DD
DD
D
central
control
Figure 5.21: Set of conveyer belts connected to a central processing unit.
different categories (quality levels) meeting the timing constraints of each of the conveyer
belts to fulfill the overall requirements of a production system.
As displayed in Fig. 5.20, the recognition shall take place at the end of a production line
that turns out goods in a well-defined frequency. Therefore, several sensors are installed
that derive data, which must be interpreted correctly to be able to sort the product into
boxes for different quality levels. It quickly becomes obvious that for economical reasons,
the production should not be halted and therefore the sorting must meet the frequency
of the arriving products. As object recognition requires high processing resources, a
hardware implementation often is preferred. Using reconfigurable logic therefore, we
can react on changes of the production line by adapting the recognition circuit. An
FPGA as processing device thus is a recommended alternative.
In addition, plants usually not only comprise of one but several production lines re-
sulting in a set of conveyer belts, whose objects must be sorted in parallel. We thus
face a set of recognition systems that all must work correctly to prevent single produc-
tion lines from disruption. Moreover, we assume a centralized control for the set of
recognition systems, which could be necessary due to harsh environmental conditions in
the immediate vicinity of the conveyer belts. Figure 5.21 depicts an example consisting
of five conveyer belts each having their own sensors for recognition and own actuators
for sorting. The sensor data of each conveyer belt is forwarded to the centralized con-
trol system, which evaluates the data and sends a command to the conveyer belt. The
command denotes the quality of the product and allows for correct sorting.
126
5.6 Experiment
bitstreams
sensor
input/
output
FPGA
slot 1 slot 2 slot 3
scheduling
algorithm
control CPU
control logic
peripherals
Figure 5.22: Experimental set-up on the Erlangen Slot Machine
We can keep costs low, if we equip the central control system with a reasonably small
FPGA only. The area of the FPGA then must be shared by the different recognition
circuits of the conveyer belts. Run-time reconfiguration solves the problem technically.
However, the whole system also must be classified as a hard real-time system, as a timely
processing is vital for un-interruptible production. In a worst case scenario, the so-called
domino effect [But04] may even disrupt all processing lines initially only suffering one
single fault. Proper scheduling thus is essential. For such a scenario, the reconfiguration
port scheduling method of this chapter comes in.
Implementation
For the practical implementation, we assume an execution environment, which consists
of three slots. The slots operate in parallel and can be used by any of the recognition
circuits in a time-shared fashion. A simple control unit on the FPGA takes care of
the correct data transmission. The recognition sensor provides a defined block of data,
which must be accessible by the slot holding the corresponding processing circuit. The
result, which is a simple command, is sent back directly to the conveyer belt.
For experimental investigations, we have implemented a prototyping environment on
the Erlangen Slot Machine. The layout is depicted in Fig. 5.22. To also consider the
preemption of the reconfiguration phase, we have divided each slot into two sub-slots
that form a unity. Such an explicit partitioning of the slots helps us to overcome some
technical problems, as it is tremendously difficult to implement a stop of the reconfigu-
ration process. We store two partial bitstreams for each sub-slot and reconfigure these
sub-slots in sequence, unless a preemption occurs. Also, this implementation only can
offer a single preemption of each reconfiguration phase, it nevertheless allows us to ex-
plore the fundamental feasibility of the approach. As both sub-slots are located directly
side by side without any space in between, we only require a single bus macro to route
the signals between them. Moreover, note that in contrast to the two slot example,
the sub-slots cannot be executed on their own. Only the complete reconfiguration of
slot—including both sub-slots—facilitates to start execution.
127
5 Reconfiguration Port Scheduling
Figure 5.23: FPGA editor output of the execution environment implemented on the
ESM. Left: only the basic logic, right: two slots are loaded.
For the communication, we assign dedicated wires to each of the slots. The input
data are distributed by the controller to the slots. The data distributer takes care of
providing the correct data to the slots. Unfortunately, the Xilinx FPGAs do not generate
a done signal after finishing a partial reconfiguration. We thus cannot measure the exact
reconfiguration time of one of the sub-slots, however the development team of the ESM
provided us with the necessary information.
Moreover, to overcome any problems of signal integrity due to partial reconfiguration,
we initiate a well-defined bitsequence that must be received by the task, prior to start
execution and also by the control unit before sending out a result of a task. This
mechanism to send and detect a sequence can be considered as a gate.
Finally, the ESM main board hosts a PowerPC that is used to boot a minimum
Linux environment. The operating system allows for the execution of the scheduling
algorithms using the ESM API. This API exposes the capabilities of the ESM through
libraries that can be used by user space programs. Most notably are the ability to
(re)wire the crossbar, upload the bitstreams and communicate with the FPGA through
the so-called hwswcom modules.
Results
In Fig. 5.23, we display the layout of the execution environment on the FPGA using
the Xilinx FPGA editor. The control logic is implemented in the top left area of the
FPGA. The peripheral control logic for the VGA system is located in the left below the
control logic. The slots are placed side by side starting in the second third of the device.
Each slot is 12 CLBs wide with a height of 144 CLBs which allows a task to use about
12 ·144 = 1728 CLBs in total. Moreover, sub-slots span 6 CLBs in width.
For the control logic we require only 679 slices which reflects a utilization of 2 %.
About half of this space is used by the communication interface with the PowerPC.
When we also add the control logic for the camera peripheral to our design, we require
128
5.7 Lesson Learned
an additional 290 slices, however the amount of routing logic required increases, as the
SRAM is accessed by the camera module.
The estimated reconfiguration time for one slot (12 CLBs width) of our execution
environment sums up to approx. 127 milliseconds. We added a rough 10 % which lead
us to a total reconfiguration time of 140 milliseconds for one slot. The reconfiguration
time for a sub-slot was estimated the same way and accounts for 65 milliseconds. Adding
again roughly 10 %, the minimum time interval of the reconfiguration port scheduling
comprises = 70 ms. The relative long reconfiguration times can be accounted due
to the slow flash memory used to store the partial bitstreams. The transmission times
between the PowerPC and the FPGA can be neglected in this setup because only 4
bytes are necessary to upload and received data.
We used the above estimated reconfiguration times in a scheduling algorithm and were
able to confirm them by verifying that the actual schedule gave the same results as the
simulated one. We executed the scheduling algorithm on the PowerPC by referring to a
pre-computed table of the tasks parameters.
5.7 Lesson Learned
The chapter has shown how we can exploit reconfigurable fabrics for the domain of
reactive systems. To establish real-time behavior as demanded by such systems, we
introduced the novel concept of reconfiguration port scheduling.
In particular, the drawback of the reconfiguration overhead can be used for establishing
real-time scheduling. We therefore focus on the whole reconfiguration process, partic-
ularly including the single reconfiguration port of FPGAs, which circumvents parallel
reconfiguration of areas. Both, the long duration and the sequentiality of the recon-
figuration, however, allow us to apply mono-processor scheduling algorithms, which are
well-researched and help to ease scheduling in particular for real-time scenarios. The ap-
pliance of the scheduling algorithms thereby is possible and valuable for such scenarios,
even though some limitations have to be taken into account.
We thus can map mono-processor scheduling algorithms to a new domain. With-
out reinventing the wheel, we can achieve real-time behavior for tasks executing on
reconfigurable fabrics. Particulary, the abstracting layered approach helps to assign re-
sponsibilities within the real-time scheduling to different experts. We rely on slotted
execution environments that already exist in the literature and extend their usage.
Subsequently, we could derive that the concept of a run-time platform layer is valuable
in general. We can pose exact requirements to such a platform, including the need for
a sound communication concept for the execution environment, which must be given
to avoid the data transfer becoming the bottleneck of the system. In the realm of the
work on making reconfigurable systems mature, the reconfiguration port scheduling thus
facilitates to beneficially make use of reconfigurable fabrics.
Nevertheless, for a comprehensive exploitation, some extension to the current recon-
figuration techniques are necessary. Foremost, the preemption of the reconfiguration
phase, as demanded by our scheduling algorithms, is not fully supported on modern
129
5 Reconfiguration Port Scheduling
FPGAs. We solved the idea of preemption by dividing the reconfigurable slot area
into pre-defined sub-slots, which are reconfigured subsequently. If arbitrary preemption
would be possible, the flexibility could be increased and the work-around of sub-slots
would become unnecessary.
Although posing a challenge for the implementation, the preemption of the reconfigu-
ration phase still opens new perspectives to the appliance of run-time reconfiguration on
FPGAs in the real-time domain. In general, reconfigurable devices execute tasks in par-
allel, which intentionally collides with the single machine principle and seems to require
new methods and evaluation strategies for scheduling. By virtue of the reconfiguration
port scheduling including task preemption, we can achieve real-time behavior relying on
algorithms having a low computation complexity.
We have also seen that the pure reconfigurable port scheduling technique can be easily
extended by caching concepts. Caching, in general, is driven by the overall design goal
that reconfiguration should be avoided as often as possible. In our scenario of scheduling
the reconfiguration phases, caching means that tasks will skip some instances. Thus,
it is more likely that more tasks will meet their deadlines when applying configuration
caching compared to the schedulability of the task set without configuration caching.
Despite that configuration caching inevitably increases the complexity of the feasibility
analysis of a task set, some interested scheduling strategies of the algorithms are possible.
We can include the results of the investigations of this approach into a system that
heuristically tries to derive feasible schedules even if the reconfiguration port occupancy
is larger than 1. In such a case, schedulability would be impossible without caching.
The drawbacks of the reconfiguration port scheduling are the need for homogeneous
slots. We thereby might fail to exploit the whole capabilities of reconfigurable systems.
Nevertheless, by virtue of the method shown in this chapter, reconfigurable fabrics are
exploited in a sophisticated way targeting a specific application area. However, the
method does not completely combine the three ideas of run-time reconfiguration: algo-
rithmic reconfiguration, functional reconfiguration, and architectural reconfiguration.
To finally combine these three reconfiguration types, we require an approach that
starts on an even higher level of abstraction as the approaches of the last three chap-
ters. The designer should not be aware of whether its applications are executed in a
time-shared fashion on a reconfiguration fabric, adapted by run-time reconfiguration on
changing environmental input, or even a complete reconfiguration of the run-time en-
vironment happens. Thus, we finally want to achieve a complete transparency of the
run-time reconfiguration.
5.8 Related Work
Some work has already been done in online scheduling of real-time tasks on reconfig-
urable architectures. Most of them divides the problem into two main problems: task
scheduling and task placement. In general, scheduling tasks on partially reconfigurable
FPGAs is targeted differently in the literature.
130
5.8 Related Work
The chapter also considers caching configurations under real-time scheduling. A com-
bination of both to the best of our knowledge was not considered so far.
Task Scheduling
Some works schedule area and time together [FKT01], while others focus on task schedul-
ing with fragmentation handled by a specific manager [RMVC05], or on the area as an
execution environment [WP04].
In the area of scheduling tasks on partially reconfigurable FPGAs, the authors of
[ABT04] present an approach in the realm of our work. In their work, they optimize
the area occupied, respecting the task time constraints. Tasks are not allowed to be
preempted.
In the same scenario, the authors of [WP02] and [SWP04] analyze the effect of overall
response time and guarantee-base scheduling when tasks comprise different shapes.
In some works, even task preemption is considered. If task preemption is allowed, the
task acceptance rate is improved [ABK+04, WP03]. However, hardware task preemption
represents additional costs due to still non-efficient techniques and methods available.
Generally, we seldom find concepts that respect the reconfiguration time or the sequen-
tiality of the reconfiguration phases. Usually, both are neglected due to the assumption
that the execution time is much higher than the reconfiguration time [WP03].
In [DP06], we find an approach that also considers real-time scheduling on partially re-
configurable FPGAs. In contrast to our approach, the scheduling is solved by algorithms
of the parallel scheduling domain. Additionally, the authors present adapted and new
algorithms for the special requirements of scheduling in space and time. Furthermore,
we focus on the reconfiguration phase as main scheduling problem.
In general, in our approach, the placement problem must not be considered since we
assume that every task comprises the same size.
Caching
Caching in the domain of reconfigurable computing is considered in e. g. [DeH96a]. In
general, if applications to be executed are similar, caching is desirable. Furthermore,
if tasks are periodic, the reuse of already present stages improves a system. Successful
configuration caching, presented in [LCH00], is also effective if the reconfigurable device
is attached as a co-processor. [LH02] proposes to combine configuration prefetching
[Hau98] and configuration caching, thereby detailing the subject in some more extend.
Besides these conceptual works, authors also investigate caching in coupled FPGA-
processor systems [SNG01]. History-based algorithm or specific constructs to denote
stages worth to be cached are introduced. In all the works on configuration caching,
however, real-time requirements are not considered.
In all the works on configuration caching, the reconfigurable fabric operates as co-
processer and real-time requirements are not considered.
131
5 Reconfiguration Port Scheduling
5.9 Summary
In this chapter, we have investigated scheduling strategies known from the single machine
environment and applied them on reconfigurable devices. Therefore, we introduced a
real-time scheduling layer for partially reconfigurable FPGAs. The layer accepts task sets
and schedules the reconfiguration port of execution environments. Due to the mutual
exclusiveness and the sequentiality of the port, we can apply scheduling algorithms from
the monoprocessor domain. We discussed assets and drawbacks of this approach on the
basis of aperiodic task sets. With this background, we derived scheduling algorithms for
aperiodic and periodic tasks.
As main result for both algorithms applied (EDD and EDF), we note that the approach
performs best if tEX tRT ·(m1). On the gained results of aperiodic tasks, we have
derived a scheduling algorithm for periodic tasks. The algorithm bases on the deadline
monotonic (DM) scheduling concept and is extended by a server to handle full load of
slots conditions, and a resource access protocol to handle full reconfiguration capacity
conditions. For the schedulability test, we combine the response time analysis for DM
with the priority inheritance protocol and a server. We have to solve the recurrent
equations Ri=tRT +Bi+Pi1
j=1 lRi
TjmtRT for the task set Γ0:= Γ τS.
Furthermore, we have presented caching concepts for reconfiguration port scheduling.
The concepts improve the performance of real-time scheduling of tasks onto reconfig-
urable devices. We have presented two offline algorithms, as well as on-line algorithms.
To summarize, we have shown the benefits of applying mono-processor scheduling
algorithms, as well as considered drawbacks, which can harm the predictability and
must be considered by slight modifications of the scheduling algorithms.
132
6 Algorithmic Skeletons for Dynamic
Reconfiguration
In the last three chapters, we have shown how reconfigurable hardware can be exploited
for different application domains. Most notably, challenges like intermodule communica-
tion or device fragmentation thereby can be met if an underlying run-time environment
serves as a hardware wrapping layer of abstraction. On top of such a layer, scheduling
or mapping algorithms can show their strength as they can rely on a well-designed plat-
form for their specific needs. As the approaches proved to soundly serve their application
range, we are interested in how to generalize this concept towards providing a suitable
abstracting framework for arbitrary applications. In this chapter, we therefore describe
a new solution, which gathers information about the structure of an application on a
high level by providing so-called algorithmic skeletons as implementation templates. The
functional and non-functional information, which is inherent in each of this skeletons,
is used to exploit the capabilities of reconfigurable devices, particularly in the dynamic
case. Furthermore, the approach facilitates to offer a comprehensive bridge between a
typical application/software engineer and a partially run-time reconfigurable FPGA as
execution platform.
6.1 Introduction
All over the domain of computer science, the programming level of computing devices is
raised if the technologies start to become mature. The basic von Neumann single instruc-
tion stream based processor may serve as immediate example: nowadays object-oriented
programming languages like Java or C# are the number one choice for implementing
applications on such machines, with template- or model-driven software development
as presumably next evaluation step. The written code thereby is way above the final
machine operations that are eventually executed on the CPU.
Despite a performance drawback that is always discussed when abstraction is ap-
plied, the overall benefits usually outperform this loss of speed. Programming on a high
level of abstraction facilitates faster development times, efficient code generation, error-
resistance, etc. Moreover, the applications themselves can improve their performance,
if the complexity of the application under development becomes too large to be over-
looked. Therefore, depending on the requirements of an application under development,
a problem formulation (modeling and implementation) on a high level of abstraction
can result in high-performance code, which will outperform manual implementations, as
meta-information like the structure of the application can be exploited beneficially.
133
6 Algorithmic Skeletons for Dynamic Reconfiguration
We gain similar results for this trend in all domains of processing devices. For ex-
ample, ASICs are often designed using standard cells, whereby a low-level VLSI-layout
is encapsulated into an abstract logic representation. Standard cells allow designers to
scale IC design beyond simple single-functions to complex multi-million gate devices.
For example, Systems on a Chip are a result of this technique.
Whatever brings areas of computer science to their specific matureness, always ad-
ditional concepts are introduced. Object-oriented design, for example, significantly ex-
tends imperative and procedural programming. As reconfigurable computing increases
its presence, the question how to proceed in this area arises. When considering the com-
bination of time and space as given for reconfigurable computing, a simple extension
of the classical design flow for FPGAs quickly could become tedious and lack optimal
results. We thus need a powerful concept that stands above traditional steps, maybe
even orthogonal to them, and supports processing in space and time. So far, an all-
encompassing and satisfactory solution is missing, as can be drawn as conclusion from
the discussion of a model of computation for reconfiguration computing conducted in
[KKS04]. Among others [Sin07] discusses the challenges of designing and programming
reconfigurable systems under the scope of future programming models for parallel sys-
tems, owing a concrete answer.
Parallel Computing Domain
Based on the closeness to temporal and spatial execution on reconfigurable devices, the
domain of parallel programming—executing applications on cluster computers, etc.—is
of particular interest. Foremost, parallelism is associated with a boost in performance,
as many processing units operate on the same problem in parallel. The core requirement
thereby is the organization of the parallelism—the distribution of tasks/threads to the
processing nodes must be managed. This task requires special care and even today is
an active subject of research [MMadH06]. Several challenges arise, among them the
distribution and synchronization of data. Basically, two main concepts were introduced
and are used therefore: the shared memory model and the message passing model.
Shared memory basically means that all processing nodes comprise one single mem-
ory. This memory is used for the communication and synchronization between the
tasks/threads. Also, a distributer can put values to the shared memory and collect the
results afterwards. The second approach—MPI (message passing interface)—basically
facilitates to abstract from the network. It offers a language-independent communi-
cations protocol and targets performance, scalability, and portability. To exploit the
parallelism of the underlying machines, both concepts still require the programmer to
constantly bear the parallelism of the implementation on his/her mind. They are also
often criticized as being fairly low-level [GGKK03]. Applying shared memory or MPI
to the domain of reconfigurable computing thus most likely would yield similar results.
The domain of parallel computing however also provides higher level concepts to ex-
ploit parallelism. Thereby, the concepts integrate the specific requirement of spatial
computation, which seem to be desirable also for reconfigurable fabrics: Basically, the
awareness of the parallelism must be present in the programming framework used, if
reasonable code offering high performance shall be the result. Thereby application engi-
134
6.1 Introduction
neers however should be supported continuously, including the abstraction of low-level
details of the parallel execution. Moreover, to eventually industrialize the design and
therefore utilize the concept of abstraction, sophisticated concepts must be provided.
Several of such concepts exist in the literature and have been evaluated. Among cur-
rent research are patterns that consider facets of concurrency and parallelism [Dan01,
MMS00]. However, the fairly old concept of Algorithmic Skeletons introduced in the
1980s by Cole [Col89] gets several characteristics for a mature design of parallel pro-
grams to the point. It is close to the application under development, provides implemen-
tation guidelines for efficient code and thereby enables programmability and portability.
Recently, skeletons are also captured in object-oriented languages [GSP02, MSSB00].
Moreover, authors conduct discussions on the closeness of algorithmic skeletons and
design patterns [RG02].
Partially Reconfigurable FPGAs
Undoubtedly, modern partially reconfigurable FPGAs can be compared to parallel ma-
chines. Foremost, their processing in space resembles the processing on multiple nodes
of parallel machines. If we also consider partial run-time reconfiguration capabilities of
FPGAs, the adaptability given for parallel machines can be found in FPGAs as well.
Thus, concerning the temporal and spatial capabilities of reconfigurable devices, it is
worth to be investigated whether the ideas and concepts of the parallel machine domain
can be assigned to partially run-time reconfigurable FPGAs.
Similar to the parallel computing domain, benefits are most likely if the design is
done in an architecture aware manner—close to the technical (hardware) characteris-
tics of the FPGAs. As the latter is challenging for the application oriented designer,
we show how we can raise the level of abstraction by using the above mentioned al-
gorithmic skeletons. Basically, they are programming templates that guide designers
to efficiently implement algorithms by separating the structure from the computation
itself, which—appropriately applied to reconfigurable fabrics—can make the reconfig-
urability transparent for the algorithm under execution. The programmer can thereby
evaluate the system under design on a high level, as well-defined algorithmic skeletons
allow for high-level estimation. For each skeleton (not for the application), experts have
set the implementation on the reconfigurable fabric. Additionally, dynamic reconfigura-
tion, which—when brought to its basics—allows for dynamic acceptance of previously
unknown tasks, gets supported by algorithmic skeletons in a very sophisticated way.
The reason to focus on algorithmic skeletons can also be found along the wish for a
lightweight approach. We therefore do not want to introduce the complex shared mem-
ory model or the heavyweight MPI model to reconfigurable computing. Algorithmic
skeletons seem to be a suitable and promising alternative. As will be shown in the re-
mainder of this chapter, algorithmic skeletons particularly allow FPGA design specialists
to valuably add to the challenging task of industrializing reconfigurable computing.
135
6 Algorithmic Skeletons for Dynamic Reconfiguration
FPGA
Algorithmic Skeletons
Tasks
Runtime Environment
Skeleton Dispatcher
Applications
Figure 6.1: Layered model
6.1.1 Layered Approach
We again describe the approach by virtue of a layered model, see Fig. 6.1. Applications,
which built the entrance layer, are described by a set of tasks. These tasks must be
implemented using algorithmic skeletons, in particular describing their interaction. An
execution environment that executes the tasks on an FPGA accepts the tasks described
by a set of skeleton instances only. The set of skeletons is processed by a dispatcher that
is deeply connected to its execution environment.
6.1.2 Remainder of the Chapter
This chapter is organized as follows: We first formulate the problem and conceptually
describe the proposed solution. The approach bears numerous potential for detailed
research. In this work, we refine the concept proposed by detailing three skeletons of
the stream parallel computing paradigm. Moreover, dynamic reconfiguration by virtue
of algorithmic skeletons is discussed in Sect. 6.5.
We consider applications of algorithmic skeletons for the design of reconfigurable sys-
tems. In general, as the approach of this chapter is a rather large research, we mainly
introduce the major ideas theoretically with some supporting examples. Finally, we look
at the lessons learned, review related work, and summarize the chapter.
6.2 Concept
To appreciate the elegance of using algorithmic skeletons for reconfigurable systems, we
want to consider the design of reconfigurable systems from two contrary points of view:
platform design and application design. In retrospect, both points get manifested in the
previous chapters.
Concerning the platform design—the design of the execution environment—FPGAs
fundamentally are hardware that allows for executing arbitrary circuits in space. There-
fore, a firm background in hardware design is desired, including communication and I/O
requirements. Moreover, we have to respect the critical path information of circuits,
clock skew, etc. Partially reconfigurable FPGAs increase this complexity as they allow
136
6.2 Concept
the modification of hardware over time—including all the obstacles mentioned in the
previous chapters.
In contrast, the application design—the design of applications to be executed basically
on any computational device—is driven by achieving high performance and short time
to market. Application designers therefore explore the theory behind applications and
search for algorithms that serve the problems best. Moreover, they try to abstract
from the executing devices, mostly due to reasons of programmability and portability.
Application engineers are used to describe the problems and solutions on a high level of
abstraction. In particular, the details of hardware and FPGAs thereby are of secondary
focus, as development takes place more in the terms of the software world, even if
special requirements of embedded systems are respected in some design methodologies.
Generally, application engineers do not know how to program reconfigurable systems or
even explore run time reconfiguration.
Although FPGAs themselves bear the capability of fast turnaround time in their in-
herent nature of being in-field adaptable, only a rarely found combination of application
expert and hardware engineer seems to exhibit the knowledge to satisfactory produce
valuable results. Considering the rising complexity of modern FPGAs—particularly the
challenging run-time reconfigurability—new approaches are inevitably. If these features
shall be exploited beneficially, expert knowledge encompassing temporal and spacial be-
havior of algorithms executing on hardware is required. In the previous chapters, we
have shown sound concepts for exploiting reconfigurable fabrics in several application
domains, however owing a general solution.
Basically, synthesis from behavioral problem description to reconfigurable hardware
targets this issue. In the domain of partial run-time reconfigurable hardware however,
automatic synthesis still lacks good results [Bob07]. Most of all, the input given as pure
algorithm only focusing on the algorithmic behavior can hardly produce reasonable
results. For example, partitioning (as presented in Chap. 3) or Pareto-optimal mapping
(Chap. 4) may be required, both struggling to detect application inherent parallelism.
Furthermore, if iterative design, which is very costly for hardware, is required due to
performance evaluation, or portability is an issue, we require a more suitable design
methodology that supports designers on a high level of abstraction.
Algorithmic skeletons, which inherently provide parallelism by making structure and
computation (behavior) explicit, are a technique that facilitates to overcome several of
the constraints mentioned above. Moreover, algorithmic skeletons can serve as bridge
between circuit design and application development for FPGAs.
6.2.1 Algorithmic Skeletons
Already briefly discussed in the introduction of this chapter, we want to detail algorith-
mic skeletons in the following, before proposing them for reconfigurable system design.
Algorithmic skeletons were introduced by Cole in the 1980s [Col89] under the mo-
tivation of providing a means for the design and implementation of software systems,
which allow for coordination of concurrent activities of large-scale parallel systems. The
major aims of algorithmic skeletons therefore are to simplify programming, enhance
137
6 Algorithmic Skeletons for Dynamic Reconfiguration
portability, improve performance, and offer a scope for static and dynamic optimiza-
tion [Col04]. Thereby, the approach addresses many of the traditional issues within the
parallel software engineering process.
The basic idea of algorithmic skeletons is to separate the structure of a computation
from the computation itself. Algorithmic skeletons free the programmer from the im-
plementation details of the structure, such as how to map it to the available processors,
as the inherent structure of each algorithmic skeleton constraints the physical distribu-
tion of the processing so that sound performance is possible. By providing a structured
management of parallel computation, they can be used to write architecture independent
programs, shielding application developers from the details of a parallel implementation.
The idea of parallel programming with skeletons thus is to separate two basic concerns
of parallelism—application and implementation. The user can specify the potentially
parallel parts of an application using predefined programming patterns (skeletons) and
leaving the actual organization of parallelism to the skeleton implementation, for ex-
ample provided by a compiler or a library [RG02]. Algorithmic skeletons also can be
nested, thereby increasing their expressiveness [Pel98].
The closest analogy of algorithmic skeletons are higher-order functions of functional
languages. More precisely, algorithmic skeletons are similar to polymorphic higher-
order functions representing common parallelization patterns. The higher-order func-
tions therefore are implemented in parallel. However, unlike the concept of higher-order
functions might suggest, skeleton programming is not functional programming. In con-
trast, algorithmic skeletons can be used as the building blocks of parallel and distributed
applications by integrating them into a sequential language [BK96].
The purpose of every skeleton is to abstract a pattern of activities and their interac-
tions. Applications are expressed as an instance of one or more skeletons. They thereby
provide a precise means of implementation, which also basically separates them from
design patterns. The latter are mostly used during the design phase and offer only ori-
entation for the final implementation. Initially, patterns are the structuring concepts of
sequential computing/programming. In contrast, skeletons are a concrete implementa-
tion aid.
Concerning design space exploration, skeletons and their level of abstraction enable
to explore a variety of parallel structurings for a given application. Thereby, a clean
separation between structural aspects and the application specific details can be achieved
by virtue of the algorithmic skeletons. Thanks to the structural information provided,
static and dynamic optimization of implementations is possible.
Consequently, there has to be a balance between generality (allowing re-use for dif-
ferent architectures and user kernels) and specificity (for efficient implementation and
interfaces to the user kernels). Moreover, we denote the so-called trap of universality
providing a skeleton that is generic in itself and can be used if no other skeleton might
fit. Such a skeleton would increase the complexity of any run-time environment. In
order to avoid this trap, there is usually the restriction of the acceptable input for a
system to a set of valid algorithmic skeletons only, see also [Col04, RG02].
Finally, reliability is increased because skeleton programs are stripped of irrelevant
details, often making them easier to construct.
138
6.2 Concept
6.2.2 Application for Reconfigurable Computing
As sketched in the introduction, reconfigurable computing on FPGAs basically bears
similarity to processing on parallel systems, as execution of algorithms on hardware like
FPGAs also means processing in parallel. When reconfiguring FPGAs, we usually define
exchangeable regions and apply different modules to these regions. Several such regions
can be marked on the same FPGA, comprising an execution environment. The regions
then are comparable to the nodes of a computing cluster. The intermodule communica-
tion, so still a challenging research area, enables various ways of data exchange. We can
distribute applications into the regions as it is done in the parallel computing domain.
For efficient execution and beneficial exploitation of the capabilities, both systems need
structure, which is provided by algorithmic skeletons.
Having applications described by virtue of algorithmic skeletons, we can extract struc-
tural information of the application that is viable for executing the application in time
and space. For example, parts of an application that can be processed in parallel, may
reside spatially distributed on a reconfigurable fabric, thereby improving the overall re-
sponse time of the application. However, if space is limited, we may also execute these
parts in sequence applying run-time reconfiguration.
Algorithmic skeletons therefore become deeply connected to the execution environ-
ments. Applications then use a well-defined set of skeletons that allows for compre-
hensive exploitation of the processing capabilities of the hardware encapsulated by
the environment. In fact, as the implementation of skeletons is tied to the execution
environment—each execution environment, which was specifically designed for its re-
configurable fabric, offers a set of skeletons having pre-defined characteristics—the ap-
plication designer can evaluate the application under development on a high level of
abstraction, by mapping the application to a (set of) skeleton(s).
Moreover, algorithmic skeletons facilitate to react on dynamic requirements. For ex-
ample, if several applications share the same reconfigurable fabric, we can offer different
quality of service to the application. Referring to the structural information of the appli-
cations, we may execute parts of an application spatially or temporarily, depending on
a negotiation of the applications. Also, bottleneck stages of a pipeline-like application
might be executed in parallel on a reconfigurable fabric to improve the performance.
We thus use algorithmic skeletons as means of abstraction for partial run-time recon-
figuration. The skeletons foremost provide a sound method to abstract reconfigurable
computing on a high level. The core idea of the skeletons thereby is to force users to add
valuable information for reconfiguration to the system description/implementation by
referring to algorithmic skeletons. We then can extract the inherent information about
parallelism from the algorithmic skeletons and use it to generate optimized results.
In a straight forward approach, algorithmic skeletons therefore are offered as a library
that is used by the algorithms of the application under development. Tasks of these
applications described by virtue of skeletons can be executed on a run-time environment
that accepts the skeletons. The usage of algorithmic skeletons constrains the design of
algorithms to a set of templates understood by the environment.
139
6 Algorithmic Skeletons for Dynamic Reconfiguration
Trap of Universality for FPGAs
Algorithmic skeletons intentionally constraint the design space of application developers
and force him/her to express the algorithms by referring to a given set of skeletons solely.
A common question thereby is the one on problems that might fail to be expressed by
the set of skeletons given. Cole coined this problem the trap of universality [Col89],
see above. He raised the question whether it is advisable to offer a skeleton for all
requirements of the application designer or whether the processing platform should be
optimized for a set of skeletons that must be used but provide high performance.
In case of applying algorithmic skeletons to FPGAs, the trap of universality can be
approached in a different manner. Besides the possibility to dedicate some area for ar-
bitrary algorithms (a kind of generic skeleton), we can also make use of a soft or hard
core CPU mounted on the fabric. Applications, which are complex to be covered by a
skeleton, then can be executed in software on the CPU. For example control flow inten-
sive parts of an application, which are furthermore generally costly when implemented
in hardware, directly benefit from the additional possibility to execute software within
our otherwise algorithmic skeleton based system.
6.3 Run-time Execution Environment
Being the fundament of the whole concept, the execution environment must be consid-
ered in sufficient depth. The design of the environment heavily influences the acceptance
and performance of applications given by virtue of skeletons. Acceptance, as each en-
vironment will only accept a defined variety of skeletons including a maximal nesting.
Performance, as execution environments can be optimized to prefer selected skeletons
including a guarantee to execute them in a best-way.
In addition to the pure physical and structural layout of an execution environment,
a reconfiguration manager (dispatcher or scheduler) must be designed. This manager
resembles the entity that is responsible for accepting skeletons and allocating them onto
the reconfigurable fabric. The manger therefore must hold track of the current status
of the substrate. As the skeletons heavily constrain the variety of the applications, the
reconfiguration manager must only know how to handle its specific subset of skeletons
combined with the specific run-time environment. Again, the design of the reconfigura-
tion handler will be a trade-off between performance and flexibility.
Together, the environment and the reconfiguration manager thus make up a pair that
is finally responsible for dispatching applications. Thereby, we inevitably will have to
solve a trade-off between generality and specificity. On one hand, the execution environ-
ment may be optimized for executing algorithms given as a specific type of algorithmic
skeleton that have to fulfill area constraints. Applications not meeting these constraints
would have to be rejected. On the other hand an execution environment may offer a
broad rang of skeletons that themselves have few area or communication constraints.
However, the performance of such a general system quickly might drop down.
To face the trade-off between generality and specificity, we present and evaluate three
variations of execution environments in the following. The first—tile-based—marginally
140
6.3 Run-time Execution Environment
tile tile
Figure 6.2: Two tile-based run-time execution environments
constraints the reconfigurable device by only dividing the substrate into tiles compris-
ing mainly direct neighbor communication only. The architecture accepts basically all
skeletons, however, the design of an appropriate reconfiguration manager quickly can
become a sophisticated task. A nice combination of generality and specificity depicts
the second—skeleton-centric—example, that accepts a well-defined subset of algorithmic
skeletons. These skeletons themselves may be implemented quite freely in the range of
the environment. The idea of having a very tight environment including a nearly static
reconfiguration manager is presented last, called application-centric. Here, the platform
is designed offline with a task set given, therefore allowing to execute the task set with
a high performance.
Moreover, in-field updates to the environment or the reconfiguration manager that
affect the behavior of the pair can be possible and welcome.
Tile-based Execution Environment
For this very general execution environment, we rely on architectural ideas of reconfig-
urable computing, like the tile-based structure of Xilinx Virtex-4 and Virtex-5 FPGAs.
Basically, the substrate is divided into tiles, which can be of but do not have to be
necessarily restricted to equal size, see Fig. 6.2. We therefore consider two different tile
arrangements: the first being a purely quadratic organization, while the second one of-
fers more direct communication possibilities due to an underlying hexagonal structure.
For the IO, each boarder tile can accept new values.
A run-time reconfiguration manager takes care of the execution. The manager al-
locates area dynamically for the different problems. As tasks are given by virtue of
skeletons, we know the structural requirements of the tasks. They thus are not loaded
arbitrarily into the tiles, but by referring to the skeletons. The reconfiguration manager
can also initiate an online relocation. As already stated in previous chapters, however,
this relocation is hardly cost efficient. In general, as the freedom of the allocation and
scheduling also depends on the FPGA architecture used, both the environment and the
manager form a close unity.
The approach offers high flexibility on homogenous FPGAs, as the reconfiguration
manager can exploit the whole area. Each tile comprises the same logic resources, there-
fore completely harmonizing the execution environment. However, also heterogeneous
141
6 Algorithmic Skeletons for Dynamic Reconfiguration
Bus
I/O slot I/O slot
slot slot
FPGA
direct
direct
direct
direct
direct
direct
Figure 6.3: Skeleton-centric execution environment
execution environments are possible. Here, not all tiles will accept all types of skeletons,
however the performance of skeletons can be better.
Xilinx Virtex-4 devices support the proposed execution environments as they sup-
port 2D style partial reconfiguration. Furthermore, on these devices, the external
communication—the I/O pads—is separated and not part of a slice. Therefore, the
reconfiguration manager must respect the IO pinning. Only applications that can feed
in their data at free tiles may be accepted. We can also implement the tile-based exe-
cution environment on the Erlangen Slot Machine.
In Sec. 6.4, we show how skeleton dispatching can look like on such an environment.
Skeleton-centric Execution Environment
We can also design the execution environments towards accepting only a specific set of
skeletons, denoted as skeleton-centric execution environment. By focusing on a selection
of skeletons, we can offer very well-performing execution of these skeletons on an FPGA.
We may also design the execution environment with respect to hardware cores of the
FPGA like multipliers, etc. Well-placed dedicated cores then add to the performance of
the skeletons. The reconfiguration manager is optimized to accept and schedule a set of
skeletons onto the execution environment including the specific computational resources.
The exemplary execution environment depicted in Fig. 6.3 shows how a skeleton-
centric platform may look like. The platform was designed to host both pipe and farm
skeletons. Each stage of a pipe skeleton can be loaded into one of the slots. The first
stage should be placed in the rightmost or leftmost slot and immediate successors should
be placed in neighboring slots. For the data transfer between the stages, the direct com-
munication links can be used. The communication for farm skeletons in contrast is given
by the bus that spans the whole width of the slots. Therefore, the placement of workers
of a farm skeleton can be done more freely.
The design of a reconfiguration manager directly follows the implementation con-
straints of the skeletons. For the example given in Fig. 6.3, the reconfiguration manager
142
6.3 Run-time Execution Environment
may even rely on predefined scenarios of having pipe, farm or a combination of both
hosted on its substrate. For example, the scheduling concepts of the previous chapters
could be used, for example deriving a schedule for the pipe skeleton. In Sect. 6.5 we
show how dynamic reconfiguration can be achieved on such an environment, thereby
also detailing the two skeletons pipe and farm.
In general, there always may be an overlap of areas that are shared by different skele-
tons. However, compared to the tile-based approach, not all areas of this execution
environment are suitable for all algorithmic skeletons. Furthermore, the communica-
tion infrastructure of the skeleton-centric approach is optimized for the intended set
of skeletons. Therefore, it often provides superior performance compared to the tile-
based approach, however excluding applications that do not meet the communication
requirements.
Moreover, we can also couple the skeleton-centric approach with a specific board the
FPGA is mounted on. In general, on such boards the pinning is fixed—having dedicated
areas for different types of input. By taking these physical constraints into account, the
skeletons can be optimized concerning their physical arrangement on the FPGA.
To conclude, skeleton-centric execution environments nicely resemble the idea of plat-
form-based design. However, they not necessarily must stay fixed over time. In-field
updates of both the partitioning and the dispatching may extend the idea of platform-
based design as observed in [Ram07].
Application-centric Execution Environment
In the case of an application-centric execution environment, the platform and the recon-
figuration manager will be fixed for a given task set. The task set must be structured by
virtue of algorithmic skeletons. Algorithmic skeletons then are used to ease the design
process of reconfigurable systems. The structuring given also eases portability of the
applications, as the non-functional information given by the skeletons helps to find a
suitable placement on multiple execution environments.
The performance of such a system thereby can be estimated and evaluated on a high
level of abstraction, as the algorithmic skeletons facilitate to estimate the behavior.
For example, applications that are too large to fit on the reconfigurable substrate
available can share the resources (exploiting hardware virtualization). If they have been
designed using algorithmic skeletons, even an automatic generation of both, execution
environment and reconfiguration manager can be performed. This idea under the pri-
mary focus of developing a framework for the design of partial dynamic reconfigurable
circuits by virtue of algorithmic skeletons was carried out in a diploma thesis supervised
by us [Fra07].
Summary
Despite the application area for the execution environment, the final decision also de-
pends on the FPGA designer and the FPGA itself, including the selection of the com-
munication structure. In this work, we use the introduced run-time environments to
abstractly exploit the feasibility of our approach. It is most likely that FPGA spe-
cialists would implement different and better solutions. In fact, thanks to the clear
143
6 Algorithmic Skeletons for Dynamic Reconfiguration
E
C
W1Wp
W2
in
Figure 6.4: Farm parallelism
W1Wp
W2
in
out
Figure 6.5: The pipeline paradigm
requirements of providing an optimal run-time environment for a set of fixed skeletons
that will be executed on the FPGA, the FPGA specialist can focus on this limited set.
Nevertheless, in order to approach a truly optimal solution, both the algorithmic and
the FPGA specialist should work together as close as possible.
6.4 Stream Parallelism
To detail the idea of algorithmic skeletons for reconfigurable computing, we discuss the
domain of stream parallelism in this section. We link the problem to the idea of having
a very flexible tile-based execution environment.
Basically, stream parallelism is a very close idea of parallel computing that matches
the ideas of execution on FPGAs [Pel98]. Stream computation can be described as
applying f:αβon a stream of input values α1, α2, . . . . The idea is to exploit the
parallelism within the computation of fon different (and unrelated) elements of the
input stream. As an example, we can consider a vision system that explores images.
The images enter the system abstracted as a stream and must be handled differently.
6.4.1 Farm Paradigm
An algorithm that computes the same fon all of the elements of a stream α1, α2, . . .
exploits the farm paradigm. The computations f(α1), f(α2), . . . can be executed in
parallel using a pool of parallel processing modules. Figure 6.4 depicts the concept.
The major characteristic to observe is that the workers W1, W2, . . . , Wpcan be executed
independently of each other as they are all operating on a different data set.
As an application example, we can assume a stream of two video channels that
should be output alternatingly to one single channel. However, the switch between
the two channels should not be abrupt but smoothly—a fading between the two chan-
nels. We thus have a function fthat is applied on a stream of three input values
hx1, y1, c1i,hx2, y2, c2i, . . . , while x1and y1denote the two video streams and c1the
144
6.4 Stream Parallelism
W1
E,C
W3
W2 E
W4
b
W1 W2
u
W5
W3
s C
W6
inin out
Figure 6.6: Two possible execution schemes of applications using the farm skeleton.
dominance of the one stream over the other (an increasing/decreasing number of e. g.,
8 Bit width). Both streams arrive at the node Ewhich distributes the single images
to the worker processes W1, W2, . . . , WPadding the number ci. These functions can be
computed independently of each other in different worker nodes Wi. The results then
are propagated to the combining node Cthat forwards the stream to the video screen.
The more workers we have, the higher the frequency of the input values can be.
If we describe our algorithm using a skeleton for the farm paradigm, the structure of
the application is given. Thus, we know how to execute the algorithm on a tile-based
execution device. Figure 6.6 shows two examples how the skeleton can be mapped. In
the left approach, we use the same tile for the input and output of the nodes. However,
as we rely on direct communication links, the number of possible worker tiles is limited.
Therefore, the right approach of Fig. 6.6 spans the farm skeleton over the whole width
of the FPGA.
Dynamic run-time reconfiguration is needed if the amount of worker modules should
be adapted during run-time. External stimuli therefore could be a requirement to adapt
the quality of service, etc. Further details are discussed in Sect. 6.5.
To summarize the farm skeleton, a structural concept is given that facilitates to dis-
tribute workers of an application on different tiles of a partially and run-time reconfig-
urable FPGA. The execution of the workers including their reconfiguration take part
under the supervision of the run-time environment and its dispatcher. By describing
an application on basis of the farm skeleton, the number of workers is not set. De-
pending on the resources available, a different quality of service can be achieved. The
optimal solution—a solution that avoids the blocking of workers, etc. due to overload
conditions—must be derived carefully by evaluating the execution times of the function
fand the distribution time of the initial node E.
6.4.2 Pipeline Paradigm
The pipeline paradigm comprises a composition on nfunctions f1. . . fnsuch that
f1:αγ1, . . . , fi:γi1γi, . . . , fn:γn1β(6.1)
Figure 6.5 shows the concept as a graph.
145
6 Algorithmic Skeletons for Dynamic Reconfiguration
f
f
g
g
g
g
g
f
out
in
f
f
g g
g
f
g
g
g
g
in
Figure 6.7: Two implementations of a pipe skeleton with different area requirements.
As an example within our image processing environment, we consider a scenario of
a stereo vision system. We receive the input of two cameras hx1, y1i,hx2, y2i, . . . and
want to extract valuable information out of the system. We therefore compute the
composition of two functions fand g. Function fwill result in a combination of the
two images, having the pixel combined into the means (g(hxi, yii) = zi), while fwill
produce the histogram on the resulting image zi. The functions fand gcan be executed
in parallel each on a subsequent data set, thus exploiting pipeline parallelism.
In Fig. 6.7, we depict two possible implementation assuming the second function gto
consume more area than function f. Here we can see that different stages can consume
more area than available on one single tile by simply combining tiles. The dispatcher
of the run-time environment may react on the different requirements of the functions
within the pipeline. If enough area is available, the dispatcher may also built up a second
pipeline in parallel in order to increase the throughput of the systems.
Describing a problem using the pipeline skeleton, we can further exploit the charac-
teristics of stream processing. As the stages of the pipeline get activated in sequence,
we can decrease the reconfiguration latency of the overall system. We successively load
the bitstreams of the pipeline stages in their order given. After reconfiguring the first
stage, this stage may start its execution before the complete pipeline is loaded. The
same holds for the subsequent stages. Thus, the fastest possible response time can be
guaranteed. Additionally, if less area than required by the stages is available, we may
apply hardware virtualization. Therefore, only parts of the overall pipeline are loaded
on the FPGA at the same time. These parts may also perform block processing of a
block of input sets in order to hide the reconfiguration overhead.
In general, the adaptability given by FPGAs can be explored best if the streams do
not exist indefinitely but for a given period of time. Then, we can share the same
reconfigurable fabric to host these streams in a time-shared fashion, also including the
possibility to host multiple applications at the same time and to interweave two or more
applications. Moreover, algorithmic skeletons can be composed to built more complex
parallel structures. For example, a farm skeleton may be nested in a pipe skeleton
denoting farm parallelism of one of the stages.
146
6.5 Dynamic Reconfiguration
6.5 Dynamic Reconfiguration
Basically, different applications each using their own instances of skeletons can be ex-
ecuted on the same FPGA, exploiting a multi task environment. Depending on the
specific needs (quality of service, etc.) of the applications behind the skeletons, we can
react and dynamically adapt the organization of the skeletons on our execution environ-
ment by virtue of (partial) run-time reconfiguration.
In particular, it becomes possible to exploit true dynamic reconfiguration—the host-
ing of applications not known at the design time of the platform. If these applications
are given by virtue of algorithmic skeletons and the execution environment accepts the
skeletons used, a successful mapping can take place. Thereby, not only the structural
information but also the resource requirements of the application have to be taken into
account. A reconfiguration manager that conducts the dynamic reconfiguration of its ex-
ecution environment therefore must react very flexibly and verify arriving applications.
In the worst case, the manager might also have to reject the application. Nevertheless,
if applications for dynamic execution were given completely without any structural in-
formation, we would have to cope with fragmentation and on-line routing issues that
can be tremendously challenging.
In the following, we present two examples of dynamic reconfiguration by virtue of
algorithmic skeletons. The first one extends the tile-based execution environment exam-
ple from above, discussing the general procedure for dynamically hosting applications
not known at the design time of the platform. The second example discusses dynamic
reconfiguration on a skeleton-centric execution environment.
6.5.1 Dynamic Reconfiguration on a Tile-Based Execution
Environment
On the homogeneous tile-based execution environment dynamic reconfiguration is both
flexible and challenging. Basically, tasks given by virtue of skeletons can be loaded to
multiple possible locations. Nevertheless, the skeleton describes the general requirements
of the behavior of the application in space and time including the physical requirements
like communication, etc. We thus get a reasonable guideline for the dispatching of the
applications, which can be used by the reconfiguration manager.
In Fig. 6.8, we display an example of three applications dynamically arriving and
sharing a reconfigurable fabric. We assume a scenario where one application using a
farm skeleton is executed in the left side of the FPGA and a second application also
using a farm skeleton that occupies the right side of the FPGA. The former one can use
four worker tiles, while the latter has six tiles on its dispose. The workers of the second
farm skeleton however occupy two physical tiles, thus only three workers are executed
in parallel using six tiles. Moreover, the emitter and collector of the second application
are each placed in their own tiles. Note also that this arrangement additionally requires
two tiles for the data transfer of worker W 1 and W 3 to the collector, as we support
direct neighbor communication only.
147
6 Algorithmic Skeletons for Dynamic Reconfiguration
W0
E,C
W3
W2
W1
W1
W1
W2
W3
W2
W3
C
E
in in
in
out
out out
W0
E,C
W3
W2
W1 W2
W3
W2
W3
C
E
in
in
out
out
P2
P1
P3
Figure 6.8: Run-time reconfiguration of different skeletons
At some point in time, a new application requests to enter the system. The application
is given as an instance of a pipe skeleton, comprising of three stages. As there is no free
space on the device to host the application, we dynamically alter the current allocation
of the fabric. We decrease the amount of workers of the second (right) farm skeleton,
thus freeing area in the middle of the FPGA. This area is then used to execute the new
application that is implemented referring to the pipeline skeleton.
The high degree of freedom given by the tile-style execution environment makes ap-
plication dispatching a challenging task. As can be imagined by the example of this
section, the reconfiguration manager has to consider multiple contradictory require-
ments like I/O, response time, resource requirements, etc. In particular, the scheduling
of the reconfiguration phases for dynamic reconfiguration on the tile-based execution
environment becomes complex, as many alternatives are possible. For example the dis-
patcher can operate on a first come first serve basis using a queue for the reconfiguration
of the tiles of every application. Having a single queue for each application allows for
interweaving the reconfiguration phases of different applications based on the priorities
or on the current state of the applications. Nevertheless, the generation of an optimal
schedule that minimizes the response times of all applications quickly can become com-
plex, in particular if execution times of tasks are not known in advance. The subsequent
example therefore focuses on reducing the complexity of the reconfiguration manager,
however sacrificing the flexibility of the tile-style platform.
6.5.2 Dynamic Reconfiguration on a Skeleton-centric Execution
Environment
Based on the exemplary skeleton-centric execution environment presented in Sect. 6.3,
we discuss dynamic reconfiguration characteristics for a second example. Dynamically
arriving tasks shall be executed on the given environment. The execution environment
was designed to host two types of skeletons: pipeline skeleton and farm skeleton.
148
6.5 Dynamic Reconfiguration
P1 E
&
C
P1 P2 E
&
C
W1
P1 P2 E
&
C
W1P1
P3 P2 E
&
C
W1 P3 E
&
C
W1W2 E
&
C
W1W2W3
time
Figure 6.9: Two applications given as farm and pipeline skeleton executing on a skeleton-
centric execution environment
For the example discussed, two independent applications arrive for execution on the
platform. Both applications have a block of input data to be processed—they thus will
have to reside on the reconfigurable fabric until the complete data could be transformed.
The two applications are given by virtue of algorithmic skeletons. The first application
is given referring to the pipeline skeleton. The block of data has to undergo three
transformation steps (P1, P2, and P3), making up the stages of the pipeline. In contrast,
the second application is given using the farm paradigm, which denotes that each value
of the input data stream can be executed independently. Therefore, a worker (W) has
been designed, which gets input data from an emitter (E) and sends the result to a
collector (C) node. To maximize the response time, as much workers as possible should
be executing in parallel.
Figure 6.9 displays the scenario on the skeleton-centric execution environment, assum-
ing the pipeline skeleton arriving before the farm skeleton. Shortly after reconfiguring
the first stage of the pipeline (P1), the rightmost slot gets reconfigured with the emit-
ter/collector logic of the second application (E/C). Before loading the first worker (W1)
of the farm application, we start to reconfigure the second stage of the pipeline (P2).
As now all four slots are occupied, we reuse the slot of the first stage of the pipeline
to host the third stage (P3). Therefore, the block of input data must have completely
passed the first stage of the pipeline. The execution environment support direct data
forwarding in both directions so that the execution direction of the stages of the pipe
skeleton can change. After the data has also passed the second stage of the pipeline, we
can reuse the slot of the second stage, to add a second worker to the execution environ-
ment (W2). The same holds for the slot formerly used for (P3), once this stage also
has finished. Then, we gain a well-performing second application exploiting task level
parallelism. The emitter/collector tile will have to take care of balancing arriving data
to the three possible workers.
149
6 Algorithmic Skeletons for Dynamic Reconfiguration
RT1P1
RT2
RT4
RT3
Slot 1
Slot 2
Slot 3
Slot 4
P2
P3
E/C
W1
RT5
RT6W2
W3
RT7
Figure 6.10: Scheduling example of a farm and pipe skeleton on the skeleton-centric
execution environment
Concerning the intermodule communication of the different parts of one skeleton, we
can apply a physical separation on this execution environment. In detail, for the pipeline
skeleton, direct communication is used. Without any intermediate buffer or control
logic, the data is transmitted from one slot to its direct neighboring slot. Obviously, the
constraints of such a direct communication arise that have been discussed in Chap. 3,
for example concerning the limited data width. Moreover, note that the number of
communication resources has been fixed at design time of the execution environment.
Therefore, the pipeline stages must be adapted to meet the constraints.
For the data transfer of the farm skeleton, we use the bus of the execution environment.
The emitter sends data to the worker tiles and also collects the results. Obviously, the
benefit of using a bus is that communication also between non-neighboring slots can take
place. However, for both communication concepts the applications must be constrained.
Appropriate wrappers that may become part of the execution environment can help to
reduce the user requirements. For example, the concept of using the wishbone bus as
discussed in Chap. 3 can become valuable.
The schedule of such an example looks like the one depicted in Fig. 6.10. As the slot
size of this exemplary execution environment is homogeneous, we derive equal recon-
figuration times. Moreover, we have to consider the single reconfiguration port, which
requires a ordering of the reconfiguration requests. However, as the skeletons allow us
to rate the requirements very easily, the complexity of the scheduling can be reduced.
For example, the three stages of the pipeline must be scheduled in sequence without
any optional stages. In contrast, the farm skeleton only requires the emitter/collector
slot and one worker tile to be reconfigured to guarantee correct behavior. Additional
workers are optional, however increase the performance on average.
To summarize, the implementation of applications by virtue of algorithmic skeletons
enables a sophisticated and dynamic execution of applications on FPGAs. The run-time
environment allows us to load tasks not known at the design time of the environment. As
the usage of algorithmic skeletons enforces the applications to be well-formed, we thereby
can prevent fragmentation of the devices and guarantee communication requirements.
Nevertheless, the design of an appropriate reconfiguration manager is required, which is a
sophisticated task, even if the execution environment has been designed skeleton-centric.
150
6.6 Experiment
E
C
DES 1
DES 2
DES 3
W1
DES 1
DES 2
DES 3
W2
DES 1
DES 2
DES 3
Wn
...
Figure 6.11: Triple DES given as a pipe and farm skeleton
6.6 Experiment
To finally approach the concept practically, as well as to consider yet another example
where algorithmic skeletons show advantages, we report on conducted experiments in
this section. We first show an example of the cryptography domain that resembles the
triple DES example of Chap. 3. Here, we can make use of the pipeline paradigm as
well as the farm paradigm. Subsequently, we report on a master thesis that exploits
algorithmic skeletons for application-centric design of reconfigurable computing.
6.6.1 Cryptography Experiment
For the first experiment a similar scenario to the triple DES example as presented in
Sect. 3.6 shall be used. Triple DES—three basic DES cores that are processed in sequence
to raise the level of security—basically resembles a pipeline. Moreover, the algorithm
usually operates on a stream of data, by encrypting (decrypting) blocks. These blocks
are independent, which allows us to embrace the pipe skeleton by a farm skeleton.
If we thus describe the cryptography example triple DES by virtue of algorithmic
skeletons, we get a combination of pipe and farm skeleton as depicted in Fig. 6.11. In
the figure we show the graphical description of the cryptography scenario as it might
also serve as input for our method. In detail, the three steps DES 1, DES 2, and DES 3
make up a pipe skeleton. This skeleton is embraced by a farm skeleton consisting of the
worker nodes W 1, W 2, . .. , W n, adding additional emitter Eand collector Cnodes.
Behind each of the nodes DES 1, DES 2, and DES 3 a description of the behavior must
be given. In our case, a VHDL code is added. The code simply denotes the behavior of
the steps and offers appropriate communication as the external interface—data in,data
out, as well as start and stop (data ready).
For implementing this application to be executed on a reconfigurable run-time environ-
ment, a pre-processing must take place that wraps the pure VHDL description and adds
151
6 Algorithmic Skeletons for Dynamic Reconfiguration
surrounding control logic. Of particular interest are the variations for the intermodule
communication. Here, we can basically rely on the research carried out in the Chap. 3,
where we have investigated direct, buffer-based, memory-based, etc. communication al-
ternatives. The appropriate wrapping should be offered by the execution environment,
as it is platform dependent. Based on the task description and the communication
between the tasks given, we can generate the wrappers.
Two slot framework
Basically, we can also consider the two slot framework as an execution environment that
can exploit applications given as an instance of the pipe skeleton. The difference now
is that we do not constrain the two slot framework to exactly match the triple DES
application, but to accept a multiple of applications described as a pipe skeleton. If we
thus map the triple DES application as given in Fig. 6.11 onto the two slot environment,
we wrap the VHDL code behind the stages DES 1, DES 2, and DES 3 so that the data
communication with the intermediate memory in the controller slot can take place.
Moreover, in difference to the example of Chap. 3, we now consider a stream of blocks
of data. Due to the long reconfiguration time it is therefore advisable to host the stages
as long as possible, as this can maximize the throughput. In detail, we execute DES 1
until the memory in the controller slot is full, having accepted a number of blocks. In
the meantime, we have reconfigured DES 2, so that now DES 2 can process the blocks
residing in the memory. Finally, DES 3 will process the data, so that DES 1 can be
reconfigured and the procedure can start over.
The example of the two slot framework already shows that in addition to the struc-
tural information as given by the skeletons, communication information is required. In
particular, the frequency of arriving blocks of the data stream must be known to adapt
the two slot framework. Here, the algorithms for real-time scheduling of tasks on recon-
figurable fabrics discussed in Chap. 5 can help. By performing a response time analysis,
we can accept or reject an application.
Skeleton-centric execution environment
We have also designed a completely new skeleton-centric prototypical execution envi-
ronment on a Xilinx Virtex-4 FPGA. The environment, depicted in Fig. 6.12, consists of
eight partially reconfigurable regions. The regions are physically arranged to match the
architectural constraints of the FPGA. For example, the height of each of the regions
sums up to 16 CLBs or a multiple thereof. Basically, the environment was designed as
an exemplary skeleton-centric platform that is capable of accepting tasks given as pipe
and farm skeletons and some specific combinations thereof.
In the left side of the FPGA, we have placed six regions that match the requirements
of a pipe skeleton. The direct communication going from top to down facilitates an
immediate forwarding of data to be processed following the pipeline paradigm. The
E/C regions on the right side can help to also reduce the pipeline to less than six stages
as described by mapping the triple DES application onto the platform in the following.
Moreover, we have implemented the infrastructure for two farm skeletons on the ex-
ecution environment. By dividing the substrate into an upper and a lower half, the
152
6.6 Experiment
IO
FIFO +
BlockRAM
DSP slices
16 CLBs
P/W
P/W
P/W
P/W
E/C
E/C
P/W
P/W
FPGA
clock
logic
Figure 6.12: Skeleton-centric platform on a Xilinx Virtex-4 FPGA
arrangement becomes clear. Basically, the E/C regions can host emitter and collector
logic to access the three regions on their left side. We can also combine the two E/C
regions to make up one large emitter and collector, which can access six worker regions.
For mapping the triple DES application given by virtue of a farm and pipe skeleton
as depicted in Fig. 6.13 onto the skeleton-centric execution environment, we exploit the
entire regions. The two E/C regions are combined to distribute the data to the two
workers. Each worker again consists of three stages, thus embracing the pipe skeleton by
a farm skeleton exploiting parallelism by implementing multiple pipelines. In Fig. 6.13,
we have marked the communication infrastructure used. Again the pure VHDL code was
wrapped to match the requirements of the execution environment. The result is given in
Fig. 6.14, which depicts an Xilinx FPGA editor snapshot of the triple DES application.
6.6.2 Application-centric
In a master thesis under our supervision [Fra07], the practicability of using algorithmic
skeletons for hardware virtualization on partially reconfigurable FPGAs was investi-
gated. The author targeted on how to efficiently calculate offline the placement and
scheduling for temporally allocated modules of partially reconfigurable FPGAs. He was
motivated by the fact that mapping applications to partially run-time reconfigurable
devices is a complex and challenging task.
The work shows that algorithmic skeletons dramatically ease the design process and
allow for high-level exploration of the design under development. Basically, applications,
which are described by HDL modules and their connections that both together resemble
algorithmic skeletons, use patterns designed by experts to built up their specific run-time
execution environment. For the scheduling, the arrangement, etc., always the inherent
information of the underlying algorithmic skeleton is used. Therefore algorithms for the
automatic scheduling and placement of the modules on the FPGA are investigated.
153
6 Algorithmic Skeletons for Dynamic Reconfiguration
P2
P3
P1
P2
E/C
E/C
P1
P3
clock
logic
W1
W2
Figure 6.13: Mapping the triple DES application onto the skeleton-centric platform
Foremost, the farm and pipe skeleton get a slightly different interpretation: The stages
of a pipe skeleton primarily denote tasks of a task dependence graph, which are executed
in sequence, always activating the next stage after the current stage has finished its com-
putation. For the farm skeleton, the amount of workers is given. Moreover, each worker
comprises a different function requesting its data from the emitter. An implementation
of a farm circuit is given in Fig. 6.15. In addition to the already known emitter, col-
lector, and worker tiles, also a dedicated memory for intermediate data storage is used.
Furthermore, Fig. 6.15 depicts the communication infrastructure including data busses
and control signals.
In the design flow proposed, the design is given using standard VHDL and a descriptive
language designed in the master thesis influenced by the work on algorithmic skeletons
on massively parallel systems of [Kal02]. The descriptive language basically allows for
instantiating skeletons:
FARM(emitter, worker1, worker2, ..., workerN, collector);
PIPE(stage1, stage2,..., stageN);
Using the language, we do not have to specify the signals between the slots. In
contrast, an automatic generation takes place. The VHDL files therefore have to fulfill
certain specifications including well-declared data in, data out, reset, start, and done
signals so that a wrapper can be built automatically around the original files.
Within the design flow, always a preliminary wrapper version will be generated. These
wrapper files are not the final ones because for generating the final ones a placement of
the modules is required. However, they approach the final version very well and help
to evaluate the area requirements of the modules. Hence, it is possible to estimate the
154
6.6 Experiment
Figure 6.14: Two Xilinx FPGA editor outputs of the triple DES application mapped to
the skeleton-centric platform of Fig. 6.12. Left: only P 2 of W 1 is loaded, right: P 1,
P 2, and P 4 are loaded.
155
6 Algorithmic Skeletons for Dynamic Reconfiguration
input
Slot 1
Emitter/
Collector
Slot 2
Worker
Slot 3
Worker
Storage
output
all_data_
bus
id_bus
data_bus store_bus
data_request
store_request store
workers_
finished
send_data
data_receive
Figure 6.15: Detailed view of the farm circuit [Fra07]
number of slices (CLBs) required of every module using the Xilinx XST tool or similar
synthesis tools.
Based on these estimations, a scheduling and mapping can take place relying on so-
called patterns. These are skeletal execution environments designed by experts, which
give placement suggestions for reconfigurable regions and communication. They thereby
only give communication guarantee—wires can be routed. The design environment
finally decides on the communication.
In a pattern database designed within the thesis, a couple of patterns are stored
for several FPGA types. An algorithm selects the best pattern for the design of the
application designer and maps the tasks. The optimization aim is the overall completion
time of all tasks. Based on the used skeletons it is possible to extract a good placement
of the jobs on the FPGA.
For the eventual generation of the bitstreams, firstly final wrappers based on the
location of the tasks are generated. Then, the communication infrastructure (Xilinx
busmacros), the placement of the reconfigurable regions (according to the Xilinx Early
Access Design Flow), and the top file a generated.
Moreover, a tool—PaReToFAS (Partial Reconfiguration Tool for FPGAs using Algo-
rithmic Skeletons)—was implemented, which provides a design environment that com-
bines the possibility to design patterns and to define the applications as a well-described
combination of VHDL modules. To select the best pattern for the application, Frank
has developed a genetic and a simulated annealing algorithm.
156
6.7 Lesson Learned
6.7 Lesson Learned
Algorithmic skeletons are a sophisticated concept to push reconfigurable computing into
a new dimension. Basically, they allow us to raise the level of abstraction to meet the re-
quirements of application engineers, while also constraining the physical implementation
complexity of execution environments for reconfigurable fabrics. We can thus achieve
sophisticated results concerning both, ease of development and well-performance. The
key factor is the inherent separating of the structure from the computation itself. The
structure given can be exploited for distributing applications into the dimensions time
and space in a systematic manner.
In this chapter, we have shown the fundamental applicability of algorithmic skeletons
for reconfigurable computing. We basically could see that temporarily and partially
adaptable—run-time reconfigurable—FPGAs are close to the concept of parallel ma-
chines, as the nodes—reconfigurable regions—can change their behavior and thereby
their contribution to the overall application over time. Hence, we investigated several
concepts how algorithmic skeletons can match the reconfigurable computing domain.
We considered stream parallelism in more detail, which is close to many application
fields of reconfigurable systems.
To implement the concept of algorithmic skeletons, we have also realized that suit-
able execution environments are of great help. Such environments should be combined
with a reconfiguration manager forming a close unity. The reconfiguration manager
then is responsible for dispatching and rejecting applications. On one hand, as these
applications are given by virtue of algorithmic skeletons, the selection and scheduling is
reduced. On the other hand, however, the reconfiguration manager becomes a key part
of the concept as its competence decides on the acceptance and if so also performance
of the applications on the execution environment. In this thesis, we have classified the
execution environments into tile-based,skeleton-centric, and application-centric, which
each facilitate to explore a different degree of flexibility.
Within the chapter, we could identify two major benefits of using algorithmic skele-
tons for the design of reconfigurable systems, which are hardly covered by other design
methods so far. First, the evaluation of the design on a very high level of abstraction can
be achieved by virtue of algorithmic skeletons, as they allow for exploration of designs
in a structured manner. In particular, if the performance of an execution environment is
known for a set of skeletons, we can estimate the performance of applications that match
one or more of these skeletons if they are executed on the environment. Moreover, for
the application-centric execution environment, we have shown how an offline mapping
technique can be achieved, to derive and evaluate designs on a high level of abstraction.
Secondly, the concept of algorithmic skeletons for reconfigurable systems facilitates
to achieve a true dynamic reconfiguration, in the sense that applications, which are
not know at the design time of the execution environment, can be executed on the
fabric. Applications are given in structured manner through instances of skeletons, thus
also dynamic executable on a fabric. Here, the skeleton approach is very sophisticated
and helps to manage and deal with a lot of challenges of reconfigurable computing
design. For example, the execution environment provides generic communication that
157
6 Algorithmic Skeletons for Dynamic Reconfiguration
is addressed by instances of skeletons. Moreover, the given reconfigurable regions of the
environments refer to skeletons and therefore constrain the mapping of applications. We
thereby avoid fragmentation and still achieve a good performance, as the environment
has been designed for applications that come described as skeletons in mind.
In general, the research on using algorithmic skeletons for reconfigurable computing
still is in its infancy. In this work, we therefore have shown the approach of applying
algorithmic skeletons to reconfigurable computing design conceptually. Among others,
the best way how to formally describe a design using algorithmic skeletons remains open.
Despite first steps using a descriptive language, the complete integration of algorithmic
skeletons into a design language has to be solved. SystemC might be a natural choice,
as algorithmic skeletons can be introduced to this language by creating a specific library.
Additionally, we can implement the idea of algorithmic skeletons on top of our Part-E
design environment (see Appendix A).
To conclude, algorithmic skeletons are in line with the requirement of raising the level
of abstraction for the design of reconfigurable systems. The core benefits of the approach
are the main motivation points of algorithmic skeletons: programmability, portability,
and performance.
6.8 Related Work
In the literature, we find some works on designing reconfigurable systems on a higher
level of abstraction than hardware description languages (HDLs). Most of these works
do not target partial run-time reconfigurable systems. Additionally, the models proposed
assume the designer to have reasonable knowledge of the system under development.
The work of DeHon et al. [DAD+04] on design patterns for reconfigurable systems is a
sophisticated approach on providing canonical solutions to common and recurring design
challenges of reconfigurable systems and applications. The authors intend on providing
a mean to crystallize out common challenges of reconfigurable system design and the
typical solutions. However, their work focuses more on providing a layer of abstraction
to the reconfigurable systems community than to application engineers.
Some years earlier, SCORE (Stream Computations Organized for Reconfigurable Ex-
ecution) was proposed in [CCH+00]. The approach focusses on providing a unifying
computation model to abstract away the fixed resource limits of devices. Therefore,
the resources are virtualized, which can ease the development and deployment of re-
configurable applications. Again, the addressees of the SCORE approach are mainly
reconfigurable computing engineers.
Modern languages for embedded systems like SystemVerilog or SystemC also aim at
raising the level of abstraction. These approaches can be seen as extended HDLs that
introduce design principles to the hardware world, such as re-use, polymorphism, etc.
For example, SystemC as language to model dynamic reconfigurable hardware is used in
[APC03]. However, the languages are often used for simulation only and the generation
of executable code is challenging. Nevertheless, these languages make up a good starting
point for future research.
158
6.9 Summary
[BSC+00] describes a Matlab compiler that targets heterogeneous systems, including
FPGAs. User functions become a process in VHDL. The system also targets multiple
FPGAs, however assuming the SPMD style of programming.
Finally, in low-level hardware design, [BC04] focus on a high-level hardware descrip-
tion called hardware skeletons. Considering the idea of separation of structure from the
algorithm, this approach is closest to our work. Moreover, the authors motivate their
work similar to us—an increase of abstraction in order to open the field of hardware
design to a broader audience. Basically, they derive their ideas from FPGA based image
processing. They present an approach for developing a general framework for FPGA
based image processing. This framework is based on a library of hardware skeletons.
However, the amount of skeletons is very limited and they are still very low-level and will
often be too far away from algorithm designers. Moreover, we do not find the paradigm
of reconfigurability in their work.
6.9 Summary
In this chapter, we have introduced algorithmic skeletons for dynamic reconfigurable
computing. Algorithmic skeletons separate structure from the behavior of an algorithm.
By providing a library of skeletons to implement applications for reconfigurable systems,
we can beneficially explore partial run-time reconfiguration of reconfigurable fabrics.
Therefore, solutions—hardware implementations—of the skeletons are applied to various
applications. We have discussed the important aspect execution environment, which
should come with an appropriate reconfiguration manager specifically implemented for
each environment. As a first introduction to this approach, we have considered stream
parallelism paradigms including their composition in some more detail. We have also
discussed concepts how dynamic reconfiguration can be achieved by virtue of algorithmic
skeletons. In general, the approach is a hopeful mean to provide an interface between
the hardware platform (FPGA) and applications, raising the level of abstraction.
159
6 Algorithmic Skeletons for Dynamic Reconfiguration
160
7 Conclusion and Outlook
7.1 Conclusion
In this thesis, we have presented four design methods for exploiting reconfigurable fabrics
under the motivation of making reconfigurable systems mature.
The first design method investigated in detail was termed the two slot framework,
which basically offers two reconfigurable slots that can be executed and reconfigured al-
ternatingly. The architecture brings reconfigurable computing to its basics and therefore
served very well as introduction to the general requirements and challenges of exploiting
reconfigurable systems. Foremost, we derived the requirement to explicitly consider the
reconfiguration overhead within the design of reconfigurable computing systems. There-
fore, we introduced the separation of a task to be processed by a reconfigurable device
into reconfiguration (RT) and execution (EX ) phase, which form a close unity and both
must be included into any design method.
Despite their simplicity, two slot architectures by no means have so far become vis-
ible processing resources in modern designs. When integrating the architecture into a
framework, as was done within this thesis, we could identify additional challenges that
must be solved for a beneficial exploitation of the two slot architecture. Among them,
we discussed the intermodule communication, the partitioning of applications to match
the sequential execution requirement of the two slots, and the scheduling algorithm.
The two slot framework also is open for extensions. We introduced concepts of low
power design to the framework, exploiting multiple frequency domains as given on mod-
ern FPGAs. Furthermore, we discussed the possibility to include the two slot framework
as an IP core into a processor-based system, offering hardware virtualization for SoCs.
In the second method, we broadened the reconfigurable architectures covered to in-
clude multi-slot environments also comprising heterogeneous slots. We therefore ex-
tended a high level design method for embedded systems, which so far excluded partially
run-time reconfigurable fabrics from its processing resources. We showed how partial
reconfigurability can be included into this so-called specification graph approach. Ba-
sically, we added the reconfiguration phases to the task graph and the reconfiguration
port to the architecture graph. To prevent overwriting of reconfigurations whose corre-
sponding tasks are delayed for execution due to precedence constraints, we introduced
life periods. Life periods can be seamlessly integrated into the scheduling algorithm of
the design space exploration of the specification graph approach.
The approach is of value for the so-called platform-based design, where applications
and execution environment meet in the middle. The seamless integration of partially run-
161
7 Conclusion and Outlook
time reconfigurable systems introduced by us helps to extend the processing resources
and exploit adaptability. Moreover, the specification graph approach offers design space
exploration on a high level of abstraction, therefore attracting application designers of
multiple domains. By virtue of our extensions, these designers can make use of partial
reconfigurability without having to learn a new design method.
The third method we discussed was the new and within this thesis developed concept
of real-time reconfiguration port scheduling. Based on a fixed execution environment,
we investigated real-time scheduling of tasks into homogeneous slots particularly includ-
ing the single and mutually exclusive reconfiguration port. As the characteristics of the
port resemble the conditions of mono-processor scheduling, we applied well-researched
mono-processor scheduling algorithms to the domain of real-time execution of tasks on
reconfigurable systems based on the fact that these tasks all have to pass the single recon-
figuration port. We showed several scheduling algorithms for the reconfiguration phase,
including the new concept of preempting reconfiguration phases in order to improve
schedulability. The algorithms base on the newly introduced deadline d, which denotes
the latest finishing time of a reconfiguration phase in order to be able to guarantee a
task to meet its deadline d.
To derive the basic requirements of the reconfiguration port scheduling, we first
investigated aperiodic task sets having synchronous as well as asynchronous arrival
times of tasks. We thereby showed the fundamental applicability of the reconfigu-
ration port scheduling concept to real-time task execution on reconfigurable fabrics.
Two conditions—the full load of slots (fls) and the full reconfiguration capacity (frc)
situation—additionally arise. We showed how to include them into the scheduling algo-
rithms EDF and EDD, and derived guarantee tests respecting fls and frc.
We then introduced a solution for periodic task sets that allows us to rate task sets
concerning their schedulability only based on the parameters of the tasks. To cover
the two conditions fls and frc, which also occur during periodic task scheduling, we
introduced a virtually executing server to include fls conditions and a resource access
protocol for frc scenarios. Both can be included into the schedulability test.
Moreover, we investigated caching techniques for the reconfiguration port scheduling.
Caching means the avoidance of the reconfiguration phase and therefore is particulary
beneficial for the overall performance of a task set, as it helps to reduce the response
times. We derived four caching methods, two for offline manipulation of the task set
and two for online optimization of the reconfiguration port occupancy. By caching, we
can reduce the reconfiguration time required for an instance of a task on average. We
therefore can also achieve schedulability of task sets, whose scheduling demand of the
reconfiguration port exceeds 1.
In general, real-time scheduling of tasks on reconfigurable fabrics is of value for the
domain of reactive systems. Here tasks constantly interact with the environment, often
requiring real-time behavior. Thanks to its closeness to the well-researched domain of
mono-processor scheduling, our approach offers a sophisticated and lightweight method
to exploit real-time scheduling on slot-based execution environments implemented on
reconfigurable fabrics.
162
7.1 Conclusion
Finally, we introduced algorithmic skeletons as a new and very sophisticated means
to design reconfigurable systems. Based on the conceptual closeness of reconfigurable
devices to parallel clusters, algorithmic skeletons can be applied to partially run-time
reconfigurable devices. Algorithmic skeletons basically separate the structure from the
computation itself, and thereby offer a implementation guideline to design parallelism.
We then can employ the given parallelism to execute applications in time and space on
reconfigurable devices in a structured manner.
On the reconfigurable fabrics we provide execution environments that accept appli-
cations given by virtue of algorithmic skeletons. The design of the environments is
constraints by a well-selected set of skeletons that is supported by the environment. We
discussed how different levels of granularity of the design environments can be achieved,
thereby denoting that the environment and a reconfiguration manager, which supervises
the environment, should be designed in close unity.
The approach still is in its infancy. Nevertheless, we already could show two major
benefits of employing algorithmic skeletons for the design of reconfigurable systems.
Firstly, applications given by virtue of algorithmic skeletons facilitate to explore the
design on a very high level of abstraction in a structured manner. Secondly, we can
achieve dynamic reconfiguration, as algorithmic skeletons offer to become the integrating
means of applications and reconfigurable fabrics. Reconfigurable fabrics can accept
applications, which are not known at the design time of the execution environment, if the
applications are provided using the supported skeletons of the execution environment.
The four methods presented in this work all raise the level of abstraction for the
design of reconfigurable computing systems. Moreover, we embedded them into a generic
layered approach that allows us to target the different design challenges separately. We
consider several layers in between applications and the reconfigurable fabric. In common
for all methods, a reconfiguration environment constraints the reconfigurable fabric and
offers reconfigurable regions, which dynamically accept tasks. The last method based
on algorithmic skeletons thereby exploits execution environments in a very flexible way,
also allowing dynamic reconfiguration.
The temporal domain given for reconfigurable computing can be conveniently cov-
ered by a scheduling layer. Scheduling of applications to execution environments then
means execution in time and space. By virtue of reconfiguration, hardware is re-used.
Within the chapter on the two slot framework, we could see that even such a simple
reconfigurable architecture demands for scheduling, at least a proper temporal parti-
tioning where partitions respect precedence constraints. The newly introduced concept
of reconfiguration port scheduling then showed how to exploit real-time scheduling on
run-time reconfigurable execution environments hosted on reconfigurable fabrics.
Besides these considerations on time and space, we also covered the important problem
of external and internal communication on reconfigurable fabrics that dynamically host
tasks in different locations. For example, we selected the specification graph approach, as
communication is seamlessly integrated by dedicated communication vertices using this
method. Moreover, our methods all consider the reconfiguration time, which is hardly
done by others. As could be derived by several experiments, the reconfiguration phase
163
7 Conclusion and Outlook
accounts for a significant impact on the temporal behavior of run-time reconfiguration.
For the two-slot framework, we therefore investigated concepts for frequency scaling
balancing reconfiguration and execution phase. The reconfiguration port scheduling
even particularly exploits the reconfiguration phase to guarantee real-time behavior.
However, developing reconfigurable systems still remains a challenging task in many
aspects. Among others, it is difficult because of the need for both software and hardware
expertise to determine how best map applications onto reconfigurable fabrics. Never-
theless, we consider our work as a building block that particularly focused on raising
the level of abstraction, towards making reconfigurable computing mature.
7.2 Outlook
The four methods presented bear various room for extensions. First of all, the two slot
framework should be finally industrialized along the idea of providing it as an IP core.
Moreover, additional extensions, for example, the presented concepts on reducing the size
of the partial bitstreams by comparing the difference between subsequent configurations,
could be particularly investigated for the simple concept of only having two slots.
Concerning the specification graph approach, we have fundamentally added the re-
quirements for partially reconfigurable devices. Thereby, the scheduling algorithm,
which we have implemented to prove the applicability, offers potential for improve-
ment. Moreover, the specification graph approach initially targets at hardware/software
codesign. A more detailed evaluation of the method including our extensions therefore
would be of high interest.
For the reconfiguration port scheduling approach, we have introduced algorithms for
aperiodic as well as periodic task sets. To complete the domain covered, we should inves-
tigate additional task sets including the applicability of the mono-processor scheduling
algorithms. Among others, periodic tasks having dynamic priorities or task sets com-
prising precedence constraints are of interest. Moreover, scheduling concepts for task
sets having periodic as well as aperiodic task requests can be investigated.
Finally, algorithmic skeletons, which we mainly introduced conceptually, bear the
most potential for future research. The design of applications by virtue of algorithmic
skeletons allows us to achieve dynamic reconfiguration in a systematic manner, which
should be applied to multiple experiments. Moreover, the idea of also adapting the
execution environment during run-time, reacting on requests of applications that are
given by virtue of skeletons, should be investigated in more detail. Concerning the
usage of the method for high performance computing, the application engineers, who
usually restrain from hardware, must be convinced to employ reconfigurable devices.
Algorithmic skeletons therefore must be offered in a manner the application engineers
trust. Using SystemC and a library for algorithmic skeletons may be a good start. To
eventually improve the data transmission from and to FPGAs, which is particularly
required for the algorithmic skeleton approach, for example the Xilinx Virtex-5 front
side bus approach offers a sophisticated possibility.
164
A The Design Tool Part-E
Due to the lack of an appropriate design tool for the design of reconfigurable systems that
exploits partial run-time reconfiguration, we developed a tool that helps to abstract the
reconfigurable systems under development. Among others, the tool also was motivated
by providing a vendor independent design environment.
However, as Xilinx FPGAs are the only reasonable devices for exploiting partial run-
time reconfiguration, the tool mainly targets on these devices. In particular, the tool
automates the generation of partial bitstreams of Xilinx FPGAs, easing the otherwise
often cumbersome invocation of synthesize commands.
Part-Y
The precursor of the current tool Part-E thereby was Part-Y (see Fig. A.1 for a screen-
shot) implemented by us in Java. The design of Part-Y was mainly motivated for
providing a user-friendly means to ease the design of partial bitstreams of Xilinx FP-
GAs. The tool based on the modular design flow of the Xilinx Application Note 290
[Xil04] and was particularly developed to combine all the necessary information for a
final bitstream generation of such a design. This core idea of combining all information
in one model still is in use in the current Part-E tool. Figure A.2 displays the reduced
class diagram of this model. The user builds an instance of this model by gradually pro-
viding necessary information that is required for the design of partial bitstreams. Once
complete, automatic generation of partial bitstreams according to the Xilinx Application
Note 290 could be invoked.
As partially reconfigurable designs require a combined consideration of time and space
characteristics, the initial tool offered various views to the same model, including the
possibility for consistent modification of the design from all views. The overall guideline
for such structuring the different views was the Y-chart by Gajski and Kuhn [GK83].
Following the core ideas of the Y-chart, we sorted characteristics of the design into
behavior, structure and geometry. For example, area/slot assignment was part of a
geometrical view, dependency of and between modules could be defined in a structural
oriented view, while the loading sequence of modules was part of the behavioral view.
Thereby, also the design entry points were left open to the designer. More details on
the tool can be found in [DRS05, DR06], [Bob07].
Part-E
To increase the maintainability of Part-Y, we decided to merge the tool into the Eclipse
framework and offer it as a plug-in [Sch07]. In particular, Eclipse allowed us to abstract
from fundamental details like loading and storing of designs. Moreover, to also exploit
modern programming and modeling techniques of software engineering, we employed the
Eclipse Graphical Modeling Framework (GMF) as framework for Part-E. GMF provides
165
A The Design Tool Part-E
Figure A.1: Screenshot of Part-Y
design and code generation facilities for the creation of graphical editors based on an
Eclipse Modeling Framework (EMF) model. Therefore, the programmer generates basic
tool definitions and graphical representations of classes. Both are mapped together in a
model, which then allows for generation of Java code of an editor.
For Part-E, we used an extension of the model depicted in Fig. A.2 to become the
input for the GMF. Based on this model, we then fundamentally created an editor, see
Fig. A.3. The editor facilitates to model reconfigurable designs. Modules, a top entity
and their interconnections are described graphically. Also the reconfigurable device and
its partitioning is provided in the same way. For describing the temporal domain, we
added the concept of sequences. A sequence denotes the binding of a set of modules
to a partitioning at a point in time. The graphical representation and design front-
end thereby offers a convenient way of manipulating and evaluating the design under
development. Thanks to the openness of GMF, we can easily customize this view, also
accepting textual input for designs too large for a graphical representation.
166
-x_dimension
-y_dimension
-isReconfigurable
Module
-x_coordinate
-y_coordinate
ModuleInstance
1
*
Top
10..n
UCF_PhysicalConstraints
1
1
TopHDL
ModuleHDL 11
11
-direction
IO
1
n
-orientation
-isExternal
IOInstance
1*
1n
Figure A.2: Reduced class diagram of the model.
Again the single model that represents the whole characteristics of the reconfigurable
system under development abstractly proved valuable. By accumulating the entire
knowledge of a design in one single instance facilitates to model reconfigurable systems
in a comprehensive manner on a high level of abstraction.
The tool enables to combine HDL-modules, which each represent a task to be executed
dynamically on the FPGA, to so-called top entities, which assemble and structure the
design including global logic, etc. Additionally, physical constraints like the position of
the tasks on the FPGA are necessary. The tool eases the generation of those physical
constraints, including the placement of inter-module connections. If necessary, wrapper
VHDL files are generated that facilitate to meet the specific requirements of partial
designs. The generation can easily be extended to meet future reconfigurable devices,
also of different vendors than Xilinx. Finally, the integrated call of automatically gen-
erated commands following the Xilinx EarlyAccess Design Flow enables the convenient
generation and downloading of the partial and full bitstreams.
A screenshot of Part-E is given in Fig. A.3. We depict a simple design that illustrates
most of the modeling features of Part-E. In detail, the model consists of one recon-
figurable and one static area. The reconfigurable area comprises of two modules onto
which tasks can be loaded alternatively. Moreover, we model the entire design, including
the connections between the modules and to the external connection via the top design
entity. The default Xilinx Bitfile Generator generates the complete VHDL for the design
including the wiring on the top level, synthesis and implements it and finally generates
the corresponding bitfiles based on the Xilinx EarlyAccess design flow [LBM+06].
In the meantime, several improvements could be added to the design environment.
Among others, the extension points of the Eclipse framework allow for virtually any
design flow and HDL parser, which is given as a plug-in. For example, we use a VHDL
parser to derive the information for external communication given in VHDL files. We
also use a plug-in to highlight the syntax of the input language. Moreover, the possibility
to execute the synthesize, mapping, place & route, and the final bitstream generation
remotely on several machines in parallel was implemented. Thereby, the partial design
flow can easily be parallelized, as the steps towards bitstream generation must be done
individually for every partially reconfigurable module.
167
A The Design Tool Part-E
Figure A.3: Screenshot of Part-E
To summarize, Part-E basically focuses on automated generation of partial run-time
reconfigurable designs, particularly including the generation of (partial) bitstreams.
Therefore, Part-E takes raw HDL input files and generates wrappers based on the graph-
ically given information of the structure and behavior of the design under development.
Primarily, it wraps the obstacles of partial bitstream generation. Part-E thus was de-
veloped to provide an integrated means of designing partially reconfigurable systems.
The current version of Part-E can be accessed on Sourceforge: parte.sf.net.
168
Author’s Own Publications
[DB05] Florian Dittmann and Christophe Bobda. Temporal Placement on Mesh-
Based Coarse Grain Reconfigurable Systems Using the Spectral Method. In
Achim Rettberg, Mauro C. Zanella, and Franz Josef Rammig, editors, From
Specification to Embedded Systems Application, Proceedings of the IESS,
pages 301–310. Kluwer Academic Publishers, 15 - 17 August 2005.
[DF07a] Florian Dittmann and Stefan Frank. Caching in Real-time Reconfiguration
Port Scheduling. In Proceedings of the FPL 2007. IEEE, 27 - 29 August
2007.
[DF07b] Florian Dittmann and Stefan Frank. Hard real-time reconfiguration port
scheduling. In Proceedings of the Design, Automation and Test in Europe,
San Jose, CA, USA, 2007. EDA Consortium.
[DG06a] Florian Dittmann and Marcelo G¨
otz. Applying Single Processor Algorithms
to Schedule Tasks on Reconfigurable Devices Respecting Reconfiguration
Times. In 13th Reconfigurable Architectures Workshop (RAW 2006), 25 -
26 April 2006.
[DG06b] Florian Dittmann and Marcelo G¨
otz. Reconfiguration time aware processing
on fpgas. In In Proceedings of the Dagstuhl Seminar No06141 on Dynami-
cally Reconfigurable Architectures, 2006.
[DGR07] Florian Dittmann, Marcelo G¨
otz, and Achim Rettberg. Model and Method-
ology for the Synthesis of Heterogeneous and Partially Reconfigurable Sys-
tems. In 14th Reconfigurable Architectures Workshop (RAW 2007). IEEE,
25 - 26 March 2007.
[DH05] Florian Dittmann and Markus Heberling. Placement of Intermodule Con-
nections on Partially Reconfigurable Devices. In Proceedings of the 18th
Symposium on Integrated Circuits and Systems Design (SBCCI), pages 236–
241, 4 - 7 September 2005.
[DH06] Florian Dittmann and Tales Heimfarth. Clock Frequency Variation of Par-
tially Reconfigurable Systems. In 19th International Conference on Archi-
tecture of Computing Systems: Workshop Proceedings, 16 March 2006.
[Dit03] Florian Dittmann. Eine High Level Synthese f¨
ur eine neue synchrone bit-
serielle Pipeline-Datenfluss-Architektur. In Informatiktage 2003 - Fachwis-
169
Author’s Own Publications
senschaftlicher Informatik - Kongress. Gesellschaft f¨
ur Informatik in Zusam-
menarbeit mit der Computer Zeitung, Konradin-Verlagsgruppe, 7 - 8 Novem-
ber 2003.
[Dit05] Florian Dittmann. Efficient execution on reconfigurable devices using con-
cepts of pipelining. In Proceedings of the International Conference on Field
Programmable Logic and Applications (FPL), pages 717–718, 24 - 26 August
2005.
[Dit07] Florian Dittmann. Algorithmic skeletons for the programming of reconfig-
urable systems. In Proceedings of the SEUS 2007, 7 - 8 May 2007.
[DKR03] Florian Dittmann, Bernd Kleinjohann, and Achim Rettberg. Efficient Bit-
Serial Constant Multiplication for FPGAs. In Proceedings of the 11th NASA
Symposium VLSI Design, May 2003.
[DR04] Florian Dittmann and Achim Rettberg. A Self-Controlled And Dynamically
Reconfigurable Architecture. In Bernd Kleinjohann, Guang R. Gao, Her-
mann Kopetz, Lisa Kleinjohann, and Achim Rettberg, editors, Proceedings
of IFIP Working Conference on Distributed and Parallel Embedded Systems
(DIPES’04). Kluwer Academic Publishers, 23 - 26 August 2004.
[DR06] Florian Dittmann and Achim Rettberg. Design of Partially Reconfigurable
Systems: From Abstract Modeling to Practical Realization. In 1st Inter-
national Workshop on Reconfigurable Computing Education, 1 March 2006.
available at: http://isvlsi06.itiv.uni-karlsruhe.de/RCe-22.pdf.
[DRLZ04] Florian Dittmann, Achim Rettberg, Thomas Lehmann, and Mauro C.
Zanella. Invariants for Distributed Local Control Elements of a New Syn-
chronous Bit-Serial Architecture. In Proceedings of the Delta, 28 - 30 January
2004.
[DRS05] Florian Dittmann, Achim Rettberg, and Fabian Schulte. A Y-Chart Based
Tool for Reconfigurable System Design. In Workshop on Dynamically Re-
configurable Systems (DRS) 2005, 17 March 2005.
[DRS+07] Florian Dittmann, Franz J. Rammig, Martin Streub¨
uhr, Christian Haubelt,
Andreas Schallenberg, and Wolfgang Nebel. Exploration, partitioning and
simulation of reconfigurable systems. it - Information Technology, 7(3):149–
156, June 2007.
[DRW05] Florian Dittmann, Achim Rettberg, and Raphael Weber. Path concepts for
a reconfigurable bit-serial synchronous architecture. In Proceedings of the
2005 IFIP International Conference on Embedded And Ubiquitous Comput-
ing (EUC’2005), 6 - 9 December 2005.
170
Author’s Own Publications
[DRW06] Florian Dittmann, Achim Rettberg, and Raphael Weber. Towards the im-
plementation of path concepts for a reconfigurable bit–serial synchronous
architecture. In Proceedings of the 3rd International Conference on ReCon-
Figurable Computing and FPGA’s, pages 262–269. IEEE, September 2006.
[DRW07a] Florian Dittmann, Achim Rettberg, and Raphael Weber. Latency optimiza-
tion for a reconfigurable, self-timed and bit-serial architecture. In Proceedings
of the ERSA 2007, 25 - 28 June 2007.
[DRW07b] Florian Dittmann, Achim Rettberg, and Raphael Weber. Optimization tech-
niques for a reconfigurable self-timed and bit-serial architecture. In Proceed-
ings of the SBCCI 2007, 3 - 6 September 2007.
[GD06a] Marcelo G¨
otz and Florian Dittmann. Reconfigurable microkernel-based
RTOS: Mechanisms and methods for run-time reconfiguration. In Proceed-
ings of the 3rd International Conference on ReConFigurable Computing and
FPGAs 2006 (ReConFig’06), pages 12–19. IEEE, 2006.
[GD06b] Marcelo G¨
otz and Florian Dittmann. Scheduling reconfiguration activities of
run-time reconfigurable RTOS using an aperiodic task server. In Proceedings
of the ARC 2006, 1 - 3 March 2006.
[GDP06] Marcelo G¨
otz, Florian Dittmann, and Carlos E. Pereira. Deterministic mech-
anism for run-time reconfiguration activities in an RTOS. In Proceedings of
the 4th International IEEE Conference on Industrial Informatics (INDIN
2006). IEEE, 2006.
[GDX07] Marcelo G¨
otz, Florian Dittmann, and Tao Xie. Dynamic relocation of
hybrid tasks: A complete design flow. In Proceedings of Reconfigurable
Communication-centric SoCs (ReCoSoc’07), 18 - 20 June 2007.
[ID05] Stefan Ihmor and Florian Dittmann. Optimizing interface implementation
costs using runtime reconfigurable systems. In Toomas Plaks, editor, Pro-
ceedings of the 2005 International Conference on Engineering of Reconfig-
urable Systems and Algorithms (ERSA’05), pages 85–91, Monte Carlo Re-
sort, Las Vegas, Nevada, USA, 26 - 30 June 2005.
[RDLZ04] Achim Rettberg, Florian Dittmann, Thomas Lehmann, and Mauro C.
Zanella. A New High-Level Synthesis Approach of a Synchronous Bit-Serial
Architecture. In Dominik Stoffel and Wolfgang Kunz, editors, Methoden und
Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen
und Systemen, pages 34 43. Shaker Verlag, 2004.
[RDZL03] Achim Rettberg, Florian Dittmann, Mauro C. Zanella, and Thomas
Lehmann. Towards a High-Level Synthesis of Reconfigurable Bit-Serial Ar-
chitectures. In Proceedings of the 16th Symposium on Integrated Circuits and
System Design (SBCCI), 8 - 11 September 2003.
171
Author’s Own Publications
[RDZL04] Achim Rettberg, Florian Dittmann, Mauro C. Zanella, and Thomas
Lehmann. MACT A Reconfigurable Pipeline Architecture. In Technology
Siemens ICM MP CTO TI and Inovation, editors, Technologies-to-Watch,
No. 21, pages 15 17, M¨
unchen, Germany, August 2004. For internal use
only.
[WD06] Alexander Warkentin and Florian Dittmann. Data Transfer Protocols for
a Two Slot Based Reconfigurable Platform. In Proceedings of the Reconfig-
urable Communication–centric SoCs (ReCoSoC), 2006.
172
Bibliography
[ABH+94] Andreas Ast, J¨
urgen Becker, Reiner W. Hartenstein, Rainer Kress, Helmut
Reinig, and Karin Schmidt. Data-Procedural Languages for FPL-based
Machines. In Proceedings of the 4th International Workshop on Field-
Programmable Logic and Applications, pages 183–195. Springer-Verlag,
1994.
[ABI06] John Augustine, Sudarshan Banerjee, and Sandy Irani. Strip packing with
precedence constraints and strip packing with release times. In SPAA ’06:
Proceedings of the eighteenth annual ACM symposium on Parallelism in
algorithms and architectures, pages 180–189, New York, NY, USA, 2006.
ACM Press.
[ABK+04] Ali Ahmadinia, Christophe Bobda, Dirk Koch, Mateusz Majer, and J¨
ur-
gen Teich. Task scheduling for heterogeneous reconfigurable computers.
In SBCCI ’04: Proceedings of the 17th symposium on Integrated circuits
and system design, pages 22–27, Pernambuco, Brazil, 2004. ACM Press.
[ABR+93] A. N. Audsley, A. Burns, M. Richardson, K. Tindell, and A. J. Wellings.
Applying new scheduling theory to static priority pre-emptive scheduling.
Software Engineering Journal, 8:284–292, 1993.
[ABRW91] N. C. Audsley, A. Burns, M. F. Richardson, and A. J. Wellings. Hard Real-
Time Scheduling: The Deadline Monotonic Approach. In Proceedings 8th
IEEE Workshop on Real-Time Operating Systems and Software, Atlanta,
1991.
[ABT04] Ali Ahmadinia, Christophe Bobda, and J¨
urgen Teich. A Dynamic Schedul-
ing and Placement Algorithm for Reconfigurable Hardware. In Werner
Grass, Bernhard Sick, and Klaus Waldschmidt, editors, ARCS, volume
3894 of Lecture Notes in Computer Science, pages 125–139. Springer, 2004.
[Ama06] Hideharu Amano. A survey on dynamically reconfigurable processors.
IEICE Transactions on Communications, E89-B:3179–3187, 2006.
[APC03] Kostas Masselos Antti Pelkonen and Miroslav Cup´ak. System-Level Mod-
eling of Dynamically Reconfigurable Hardware with SystemC. Proceedings
of International Symposium on Parallel and Distributed Processing (Re-
configurable Architectures Workshop), pages 174–181, April 2003.
173
Bibliography
[AS93] Peter M. Athanas and Harvey F. Silverman. Processor reconfiguration
through instruction-set metamorphosis. Computer, 26(3):11–18, 1993.
[Atm02] Atmel. FPSLIC on-chip partial reconfiguration of the embedded AT40K
FPGA. Technical report, San Jose, CA, USA, 2002.
[AW02] Amir H. Abdekhodaee and Andrew Wirth. Scheduling parallel machines
with a single server: some solvable cases and heuristics. Comput. Oper.
Res., 29(3):295–315, 2002.
[AWG06] Amir H. Abdekhodaee, Andrew Wirth, and Heng-Soon Gan. Scheduling
two parallel machines with a single server: the general case. Comput.
Oper. Res., 33(4):994–1009, 2006.
[BAK96] Duncan A. Buell, Jeffrey M. Arnold, and Walter J. Kleinfelder. Splash 2:
FPGAs in a custom computing machine. IEEE Computer Society Press,
Los Alamitos, California, USA, 1996.
[BAM+05] Christophe Bobda, Ali Ahmadinia, Mateusz Majer, J¨
urgen Teich, an-
dor P. Fekete, and Jan van der Veen. Dynoc: A dynamic infrastructure
for communication in dynamically reconfigurable devices. In Proceedings
of the FPL, pages 153–158, 2005.
[BAR+05] Christophe Bobda, Ali Ahmadinia, Kurapati Rajesham, Mateusz Majer,
and Adronis Niyonkuru. Partial Configuration Design and Implementation
Challenges on Xilinx Virtex FPGAs. In Uwe Brinkschulte, J¨
urgen Becker,
Dietmar Fey, Christian Hochberger, Thomas Martinetz, Christian M¨
uller-
Schloer, Hartmut Schmeck, Theo Ungerer, and Rolf P. W¨
urtz, editors,
ARCS Workshops, pages 61–66. VDE Verlag, 2005.
[BBD05] Sudarshan Banerjee, Elaheh Bozorgzadeh, and Nikil Dutt. Physically-
aware HW-SW partitioning for reconfigurable architectures with partial
dynamic reconfiguration. In DAC ’05: Proceedings of the 42nd annual
conference on Design automation, pages 335–340, San Diego, California,
USA, 2005. ACM Press.
[BBHN04] Brandon Blodget, Christophe Bobda, Michael H¨
ubner, and Adronis Niy-
onkuru. Partial and Dynamically Reconfiguration of Xilinx Virtex-II FP-
GAs. In J¨
urgen Becker, Marco Platzner, and Serge Vernalde, editors,
Proceedings of the FPL 2004, volume 3203 of Lecture Notes in Computer
Science, pages 801–810. Springer, 2004.
[BC04] Khaled Benkrid and Danny Crookes. From application descriptions to
hardware in seconds: a logic-based approach to bridging the gap. IEEE
Trans. VLSI Syst., 12(4):420–436, 2004.
174
Bibliography
[BD98] Gordon J. Brebner and Adam Donlin. Runtime Reconfigurable Routing.
In IPPS/SPDP Workshops, pages 25–30, Orlando, Florida, USA, 1998.
[BDD+99] Kiran Bondalapati, Pedro C. Diniz, Phillip Duncan, John Granacki, Mary
Hall, Rajeev Jain, and Heidi Ziegler. DEFACTO: A Design Environment
for Adaptive Computing Technology. In Parallel and Distributed Process-
ing. 11th IPPS/SPDP Workshops held in conjunction with the 13th Inter-
national Parallel Processing Symposium and 10th Symposium on Parallel
and Distributed Processing., pages 570–578. Springer Verlag, 1999.
[BDFK+02] Peter Brucker, Clarisse Dhaenens-Flipo, Sigrid Knust, Svetlana A.
Kravchenko, and Frank Werner. Complexity results for parallel machine
problems with a single server. Journal of Scheduling, 5:429–457, 2002.
[BDT99] Eric Bonabeau, Marco Dorigo, and Guy Theraulaz. Swarm Intelligence:
From Natural to Artificial Systems. Oxford University Press, New York,
NY, 1999.
[BEGGK07] Duncan Buell, Tarek El-Ghazawi, Kris Gaj, and Volodymyr Kindratenko.
High-performance reconfigurable computing. IEEE Computer, 40(3),
March 2007. Guest Editors’ Introduction.
[BGD+04] Nikhil Bansal, Sumit Gupta, Nikil Dutt, Alex Nicolau, and Rajesh Gupta.
Network Topology Exploration of Mesh-Based Coarse-Grain Reconfig-
urable Architectures. In Proceedings of the DATE, February 2004.
[BHH+07] J¨
urgen Becker, Michael H¨
ubner, Gerhard Hettich, Rainer Constapel,
Joachim Eisenmann, and J¨
urgen Luka. Dynamic and partial fpga ex-
ploitation. In Proceedings of the IEEE, volume 95, pages 438–452, Febru-
ary 2007.
[BJRK+03] Brandon Blodget, Philip James-Roxby, Eric Keller, Scott McMillian, and
Prasanna Sundararajan. A Self-reconfiguring Platform. In Proceedings
of the International Conference on Field Programmable Logic, September
2003.
[BK96] George Horatiu Botorog and Herbert Kuchen. Efficient Parallel Program-
ming with Algorithmic Skeletons. In Euro-Par ’96: Proceedings of the
Second International Euro-Par Conference on Parallel Processing, pages
718–731, London, UK, 1996. Springer-Verlag.
[BLPG01] J¨
urgen Becker, Nicolas Liebau, Thilo Pionteck, and Manfred Glesner. Ef-
ficient Mapping of Pre-synthesized IP-Cores onto Dynamically Reconfig-
urable Array Architectures. In Proceedings of the 11th International Con-
ference on Field-Programmable Logic and Applications, pages 584–589.
Springer-Verlag, 2001.
175
Bibliography
[BMA+05] Christophe Bobda, Mateusz Majer, Ali Ahmadinia, Thomas Haller, An-
dr´e Linarth, J¨
urgen Teich, and Jan van der Veen. The Erlangen Slot
Machine: A Highly Flexible FPGA-Based Reconfigurable Platform. In
Proceedings of the 13th Annual IEEE Symposium on Field-Programmable
Custom Computing Machines, pages 319–320. IEEE, 2005.
[Bob03] Christophe Bobda. Synthesis of Dataflow Graphs for Reconfigurable Sys-
tems using Temporal Partitioning and Temporal Placement. PhD thesis,
University Paderborn, Heinz Nixdorf Institute, 2003.
[Bob07] Christophe Bobda. Introduction to Reconfigurable Computing: Architec-
tures, Algorithms and Applications. Springer, 2007.
[Bon02] Kiran Bondalapati. Modeling and Mapping for Dynamically Reconfigurable
Architectures. PhD thesis, University of Southern California, 2002.
[BRV89] P. Bertin, D. Roncin, and J. Vuillemin. Introduction to programmable
active memories. pages 301–309. Prentice-Hall, Inc., Upper Saddle River,
NJ, USA, 1989.
[BSC+00] P. Banerjee, N. Shenoy, A. Choudhary, S. Hauck, C. Bachmann, M. Hal-
dar, P. Joisha, A. Jones, A. Kanhare, A. Nayak, S. Periyacheri, M. Walk-
den, and D. Zaretsky. A matlab compiler for distributed, heterogeneous,
reconfigurable computing systems. In FCCM ’00: Proceedings of the 2000
IEEE Symposium on Field-Programmable Custom Computing Machines,
page 39, Washington, DC, USA, 2000. IEEE Computer Society.
[BTT98] T. Blickle, J. Teich, and L. Thiele. System-level synthesis using Evolution-
ary Algorithms. J. Design Automation for Embedded Systems, 3(1):23–58,
January 1998.
[But97] Girogio C. Buttazzo. Hard Real Time Computing Systems: Predictable
Scheduling Algorithms and Applications. Kluwer Academic Publishers,
1997.
[But04] Girogio C. Buttazzo. Hard Real Time Computing Systems: Predictable
Scheduling Algorithms and Applications. Springer, 2004.
[BW06] Andrew Butterfield and Jim Woodcock. A hardware compiler” semantics
for handel-c. Electr. Notes Theor. Comput. Sci., 161:73–90, 2006.
[CCEBM04] Ewerson Carvalho, Ney Calazans, ao Eduardo Bri and Fernando Moraes.
PaDReH - A Framework for the Design and Implementation of Dynami-
cally and Partially Reconfigurable Systems. In SBCCI ’04: Proceedings of
the 17th symposium on Integrated circuits and system design, pages 10–15,
New York, NY, USA, 2004. ACM Press.
176
Bibliography
[CCH+00] Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, John Wawrzynek,
and Andr´e DeHon. Stream computations organized for reconfigurable ex-
ecution (score). In FPL ’00: Proceedings of the The Roadmap to Reconfig-
urable Computing, 10th International Workshop on Field-Programmable
Logic and Applications, pages 605–614, London, UK, 2000. Springer-
Verlag.
[CCKH00] Kathrin Compton, James Cooley, Stephen Knol, and Scott Hauck. Con-
figuration Relocation and Defragmentation for FPGAs. In IEEE Sym-
posium on Field-Programmable Custom Computing Machines, pages 279–
280, Napa Valley, CA, USA, April 2000.
[CDF+86] W. S. Carter, K. Duong, R. H. Freeman, H. C. Hsieh, J. Y. Ja, J. E.
Mahoney, L. T. Ngo, and S. L. Sze. A User Programmable Reconfigurable
Gate Array. In IEEE Customs Circuits Conference, pages 233–235, 1986.
[CH02] Katherine Compton and Scott Hauck. Reconfigurable Computing: A Sur-
vey of Systems and Software. ACM Computing Surveys, 34(2):171–210,
June 2002.
[CKRB03] Chen Chang, Kimmo Kuusilinna, Brian Richards, and Robert W. Broder-
sen. Implementation of bee: a real-time large-scale hardware emulation
engine. In FPGA ’03: Proceedings of the 2003 ACM/SIGDA eleventh in-
ternational symposium on Field programmable gate arrays, pages 91–99,
New York, NY, USA, 2003. ACM Press.
[Cla07] Peter Clarke. Tilera starts shipping 64-way multiprocessor. EE Times,
August 2007.
[CLC+02] Katherine Compton, Zhiyuan Li, James Cooley, Stephen Knol, and Scott
Hauck. Configuration Relocation and Defragmentation for Run-Time Re-
configurable Computing. IEEE Trans. Very Large Scale Integr. Syst.,
10(3):209–220, 2002.
[CMS06] Christopher Claus, Florian Helmut M¨
uller, and Walter Stechele. Combit-
gen: A new approach for creating partial bitstreams in Virtex-II Pro.
In Wolfgang Karl, J¨
urgen Becker, Karl-Erwin Großpietsch, Christian
Hochberger, and Erik Maehle, editors, ARCS Workshops, volume 81 of
LNI, pages 122–131. GI, 2006.
[Col89] Murray I. Cole. Algorithmic Skeletons: Structured Management of Paral-
lel Computation. Pitman/The MIT Press, London, UK/Cambridge, Mas-
sachusetts, USA, 1989.
[Col04] Murray Cole. Bringing skeletons out of the closet: a pragmatic manifesto
for skeletal parallel programming. Parallel Comput., 30(3):389–406, 2004.
177
Bibliography
[CSR+99] Paul Chow, Soon Ong Seo, Jonathan Rose, Kevin Chung, Gerard aez-
Monz´on, and Immanuel Rahardja. The design of a sram-based field-
programmable gate array - part II: circuit design and layout. IEEE Trans.
Very Large Scale Integr. Syst., 7(3):321–330, 1999.
[CWB05] Chen Chang, John Wawrzynek, and Robert W. Brodersen. Bee2: A high-
end reconfigurable computing system. IEEE Des. Test, 22(2):114–125,
2005.
[CWG+98] Srihari Cadambi, Jeffrey Weener, Seth Copen Goldstein, Herman Schmit,
and Donald E. Thomas. Managing Pipeline-Reconfigurable FPGAs. In
Proceedings of the 1998 ACM/SIGDA sixth international symposium on
Field programmable gate arrays, pages 55–64. ACM Press, 1998.
[CZMS07] Christopher Claus, Johannes Zeppenfeld, Florian M¨
uller, and Walter
Stechele. Using partial-run-time reconfigurable hardware to accelerate
video processing in driver assistance system. In DATE ’07: Proceedings of
the conference on Design, automation and test in Europe, pages 498–503,
San Jose, CA, USA, 2007. EDA Consortium.
[DAD+04] Andr´e DeHon, Joshua Adams, Michael DeLorimier, Nachiket Kapre, Yuki
Matsuda, Helia Naeimi, Michael C. Vanier, and Michael G. Wrighton.
Design patterns for reconfigurable computing. In FCCM, pages 13–23,
2004.
[Dan01] Marco Danelutto. On skeletons and design patterns. In Proceedings of the
International Conference PARCO 2001, 2001.
[Dan04] Klaus Danne. Distributed Arithmetic FPGA Design with Online Scalable
Size and Performance. In Proceedings of 17th Symposium on Integrated
Circuits and Systems Design (SBCCI 2004), pages 135–140. ACM Press,
7 - 11 September 2004.
[Dan06] Klaus Danne. Real-Time Multitasking in Embedded Systems Based on Re-
configurable Hardware. PhD thesis, University Paderborn, Heinz Nixdorf
Institute, 2006.
[DB04] Klaus Danne and Christophe Bobda. Dynamic Reconfiguration of Dis-
tributed Arithmetic Controllers: Design Space Exploration and Trade-off
Analysis. In Proceedings of the 11th Reconfigurable Architectures Work-
shop (RAW’04), Santa Fe, USA, 26 - 27 April 2004. IEEE Computer
Society.
[DBK03] Klaus Danne, Christophe Bobda, and Heiko Kalte. Run-time Exchange
of Mechatronic Controllers Using Partial Hardware Reconfiguration. In
Proc. of the International Conference on Field Programmable Logic and
Applications (FPL 2003), September 2003.
178
Bibliography
[DeH96a] Andr´e; DeHon. DPGA Utilization and Application. In Proceedings of the
1996 ACM fourth international symposium on Field-Programmable Gate
Arrays, pages 115–121. ACM Press, 1996.
[DeH96b] Andr´e DeHon. Reconfigurable architectures for general-purpose comput-
ing. Technical report, Massachusetts Institute of Technology, Cambridge,
MA, USA, 1996.
[DEM+00] Oliver Diessel, Hossam ElGindy, Martin Middendorf, Hartmut Schmeck,
and Bernd Schmidt. Dynamic Scheduling of Tasks on Partially Recon-
figurable FPGAs. IEE Proceedings Computer and Digital Techniques
(Special Issue on Reconfigurable Systems), 147(3):181–188, 2000.
[DH98] Oliver Diessel and ElGindy Hossam. On Scheduling Dynamic FPGA Re-
configurations. In Kenneth A. Hawick and Heath A. James, editors, Pro-
ceedings of the Fifth Australasian Conference on Parallel and Real-Time
Systems (PART’98), pages 191–200. Springer-Verlag, 1998.
[dLKNH+04] Fernanda Gusm˜ao de Lima Kastensmidt, Gustavo Neuberger, Renato Fer-
nandes Hentschke, Luigi Carro, and Ricardo Reis. Designing and testing
fault-tolerant techniques for sram-based fpgas. In CF ’04: Proceedings of
the 1st conference on Computing frontiers, pages 419–432, New York, NY,
USA, 2004. ACM Press.
[DP06] Klaus Danne and Marco Platzner. Executing hardware tasks on dynami-
cally reconfigurable devices under real-time conditions. In Proceedings of
the FPL 2006, 2006.
[DSG05] Nij Dorairaj, Eric Shiflet, and Mark Goosman. Planahead software as a
platform for partial reconfiguration. Xcell Journal, (4), 2005.
[DST99] Deepali Deshpande, Arun K. Somani, and Akhilish Tyagi. Configuration
Caching Vs data caching for Striped FPGAs. In Proceedings of the 1999
ACM/SIGDA seventh international symposium on Field programmable
gate arrays, pages 206–214. ACM Press, 1999.
[DW99] Andr´e; DeHon and John Wawrzynek. Reconfigurable computing: what,
why, and implications for design automation. In DAC ’99: Proceedings
of the 36th ACM/IEEE conference on Design automation, pages 610–615,
New York, NY, USA, 1999. ACM Press.
[EBTB63] G. Estrin, B. Bussell, R. Turn, and J. Bibb. Paralelle processing in a
restructurable computer system. Transactions on Electronic Computers,
pages 747–755, 1963.
[EH94] James G. Eldredge and Brad L. Hutchings. RRANN: A hardware imple-
mentation of the backpropagation algorithm using reconfigurable FPGAs.
179
Bibliography
In IEEE World Conference on Computational Intelligence, Orlando, FL,
1994.
[EP02] Michael Eisenring and Marco Platzner. A Framework for Run-time Re-
configurable Systems. The Journal of Supercomputing, 21:145–159, 2002.
[ESS+96] H. ElGindy, A. K. Somani, H. Schroder, H. Schmeck, and A. Spray. RMB
A Reconfigurable Multiple Bus Network. In HPCA ’96: Proceedings of
the 2nd IEEE Symposium on High-Performance Computer Architecture
(HPCA ’96), page 108, Washington, DC, USA, 1996. IEEE Computer
Society.
[ESSA00] John M. Emmert, Charles E. Stroud, Brandon Skaggs, and Miron
Abramovici. Dynamic fault tolerance in fpgas via partial reconfiguration.
In FCCM, pages 165–174. IEEE Computer Society, 2000.
[Est02] Gerald Estrin. Reconfigurable computer origins: the UCLA fixed-plus-
variable (F+V) structure computer. IEEE Ann. Hist. Comput., 24(4):3–9,
2002.
[EV62] G. Estrin and C. R. Viswanathan. Organization of a “fixed-plus-variable”
structure computer for computation of eigenvalues and eigenvectors of real
symmetric matrices. J. ACM, 9(1):41–60, 1962.
[FC05] W. Fu Fu and Katherine Compton. An execution environment for recon-
figurable computing (abstract only). In 13th Annual IEEE Symposium
on Field-Programmable Custom Computing Machines (FCCM’05), pages
149–158. IEEE, 2005.
[FKT01] Sandor P. Fekete, E. K¨
ohler, and J¨
urgen Teich. Optimal FPGA Mod-
ule Placement with Temporal Precedence Constraints. In Proc. DATE
2001, Design, Automation and Test in Europe, pages 658–665, Munich,
Germany, March 13-16 2001. IEEE Computer Society Press.
[Fra06] Stefan Frank. Evaluierung preemtiver Scheduling-Algorithmen f¨
ur die
Rekonfigurierung von FPGAs. Studienarbeit, Heinz Nixdorf Insitute, Uni-
versity of Paderborn, Germany, June 2006.
[Fra07] Stefan Frank. Einsatz algorithmischer Skelette zur Generierung partiell
dynamisch rekonfigurierbarer Schaltungen. Diplomarbeit, Heinz Nixdorf
Insitute, University of Paderborn, Germany, 2007.
[GCL02] J. Gause, P. Y. K. Cheung, and W. Luk. Reconfigurable Shape-Adaptive
Template Matching Architectures. In Proceedings 10th Annual IEEE
Symposium on Field-Programmable Custom Computing Machines. FCCM
2002, pages 98–107. IEEE Comput. Soc., 2002.
180
Bibliography
[GG05] Maya Gokhale and Paul S. Graham. Reconfigurable Computing: Acceler-
ating Computation with Field-Programmable Gate Arrays. Springer, 2005.
[GGKK03] Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. In-
troduction to Parallel Computing. Addison Wesley, 2003.
[GHK+91] Maya Gokhale, William Holmes, Andrew Kopser, Sara Lucas, Ronald
Minnich, Douglas Sweely, and Daniel Lopresti. Building and using a highly
parallel programmable logic array. Computer, 24(1):81–89, 1991.
[GJ79] Michael R. Garey and David S. Johnson. Computers and Interactability: A
Guide to the Theroy of NP-Completeness. W. H. Freeman and Company,
San Francisco, 1979.
[GK83] Daniel D. Gajski and R. H. Kuhn. Guest Editor’s Introduction: New
VLSI Tools. IEEE Computer, December 1983.
[Goe02] Richard Goering. Platform-based Design: A Choice, Not a Panacea. EE
Times, September 2002.
[G¨
ot07] Marcelo G¨
otz. Run-time Reconfigurable RTOS for Reconfigurable Systems-
on-Chip. PhD thesis, University Paderborn, Heinz Nixdorf Institute, 2007.
[GP06] Bj¨
orn Griese and Mario Porrmann. A reconfigurable ethernet switch for
self-optimizing communication systems. In Proceedings of the 1st IFIP
International Conference on Biologically Inspired Cooperative Computing
(BICC 2006), volume 216 of IFIP International Federation for Informa-
tion Processing, pages 115–124, Boston, MA, USA, August 2006. Springer.
[GR01] Varghese George and Jan M. Rabaey. Low-energy FPGAs: architecture
and design. Kluwer Academic Publishers, Norwell, MA, USA, 2001.
[GRP06] Marcelo G¨
otz, Achim Rettberg, and Carlos E. Pereira. Run-Time Recon-
figurable Real-Time Operting System For Hybrid Execution Platforms. In
Proceedings of the 12th IFAC Symposium on Information Control Prob-
lems in Manufacturing, 2006.
[GSAK00] Maya B. Gokhale, Janice M. Stone, Jeff Arnold, and Mirek Kalinowski.
Stream-Oriented FPGA Computing in the Streams-C High Level Lan-
guage. In FCCM ’00: Proceedings of the 2000 IEEE Symposium on Field-
Programmable Custom Computing Machines, page 49, Washington, DC,
USA, 2000. IEEE Computer Society.
[GSB+00] Seth Copen Goldstein, Herman Schmit, Mihai Budiu, Srihari Cadambi,
Matt Moe, and R. Reed Taylor. PipeRench: A Reconfigurable Architec-
ture and Compiler. IEEE Computer, 33:70–77, April 2000.
181
Bibliography
[GSC04] Sathish Gopalakrishnan, Lui Sha, and Marco Caccamo. Hard Real-Time
Communication in Bus-Based Networks. In RTSS, pages 405–414, 2004.
[GSP02] Dhrubajyoti Goswami, Ajit Singh, and Bruno Richard Preiss. From de-
sign patterns to parallel architecture skeletons. Journal of Parallel and
Distributed Computing, 62(4):669–695, April 2002.
[GTF+02] Eike Grimpe, Bernd Timmermann, Tiemo Fandrey, Ramon Biniasch, and
Frank Oppenheimer. SystemC Object-Oriented Extensions and Synthesis
Features. In Forum on Design Languages FDL ’02, Sept 2002.
[GV00] Satish Ganesan and Ranga Vemuri. An Integrated Temporal Partition-
ing and Partial Reconfiguration Technique for Design Latency Improve-
ment. In Proceedings of the IEEE Design, Automation and Test in Europe
(DATE ’00), Paris, France, 2000.
[GZR99] Varghese George, Hui Zhang, and Jan Rabaey. The design of a low energy
FPGA. In ISLPED ’99: Proceedings of the 1999 international symposium
on Low power electronics and design, pages 188–193, New York, NY, USA,
1999. ACM Press.
[HAA+96] Mary W. Hall, Jennifer M. Anderson, Saman P. Amarasinghe, Brian R.
Murphy, Shih-Wei Liao, Edouard Bugnion, and Monica S. Lam. Max-
imizing multiprocessor performance with the suif compiler. Computer,
29(12):84–89, 1996.
[Hal70] Kenneth M. Hall. An r-dimensional Quadratic Placement Algorithm.
Managment Science, 17(3):219–229, November 1970.
[Har01a] Reiner Hartenstein. A Decade of Reconfigurable Computing: a Visionary
Retrospective. In Proceedings of the Conference on Design, Automation
and Test in Europe (DATE 01), March 2001.
[Har01b] Reiner Hartenstein. Coarse Grain Reconfigurable Architectures. In Pro-
ceedings of the Asia and South Pacific Design Automation Conference,
ASP-DAC 2001, pages 564–569, January 2001.
[Har03] Reiner Hartenstein. Toward Reconfigurable Computing via Concussive
Paradigm Shifts, 29 August 2003. invited presentation.
[Hau98] Scott Hauck. Configuration Prefetch for Single Context Reconfigurable
Coprocessors. In Proceedings of the 1998 ACM/SIGDA sixth international
symposium on Field programmable gate arrays, pages 65–74. ACM Press,
1998.
[Hau05] Christian Haubelt. Automatic Model-Based Design Space Exploration for
Embedded Systems A System Level Approach. PhD thesis, University of
Erlangen-Nuremberg, Germany, July 2005.
182
Bibliography
[HBB04] Michael Huebner, T. Becker, and J¨
urgen Becker. Real-time LUT-based
Network Topologies for dynamic and partial FPGA Self-Reconfiguration.
In Proceedings of 17th Symposium on Integrated Circuits and Systems De-
sign (SBCCI 2004). ACM Press, 7 - 11 September 2004.
[Hei08] Tales Heimfarth. NanoOS: a Distributed Operating System for Sensor
Networks Based on Biologic Inspired Methods. PhD thesis, University
Paderborn, Heinz Nixdorf Institute, 2008.
[HFHK97] Scott Hauck, Thomas W. Fry, Matthew M. Hosler, and Jeffrey P. Kao.
The Chimaera Reconfigurable Functional Unit. In Proceedings of the 1997
IEEE Symposium on FPGAs for Custom Computing Machines, 1997.
[HHHN00] Reiner Hartenstein, Michael Herz, Thomas Hoffmann, and Ulrich
Nageldinger. Kressarray xplorer: a new cad environment to optimize re-
configurable datapath array. In ASP-DAC ’00: Proceedings of the 2000
conference on Asia South Pacific design automation, pages 163–168, New
York, NY, USA, 2000. ACM Press.
[HK95] Reiner W. Hartenstein and Rainer Kress. A datapath synthesis system
for the reconfigurable datapath architecture. In Proceedings of the 1995
conference on Asia Pacific design automation (ASP-DAC’95), page 77,
Makuhari, Chiba, Japan, September 1995. ACM Press.
[HKKP07] Jens Hagemeyer, Boris Kettelhoit, Markus K¨
oster, and Mario Porrmann.
Design of Homogeneous Communication Infrastructures for Partially Re-
configurable FPGAs. In Proc. of the Int. Conf. on Engineering of Reconfig-
urable Systems and Algorithms (ERSA ’07), Las Vegas, USA, 25 - 28 June
2007.
[HLTP02] Edson L. Horta, John W. Lockwood, David E. Taylor, and David Par-
lour. Dynamic hardware plugins in an FPGA with partial run-time re-
configuration. In DAC ’02: Proceedings of the 39th conference on Design
automation, pages 343–348, New York, NY, USA, 2002. ACM Press.
[HMM04] J. Harkin, T. M. McGinnity, and L. P. Maguire. Modeling and optimizing
run-time reconfiguration using evolutionary computation. ACM Transac-
tions on Embedded Computing Systems, 3(4):661–685, November 2004.
[Hoa93] D. T. Hoang. Seachring genetic databases on Splash 2. In Proceedings
IEEE Workshop on FPGAs for Custom Computing Machines (Cat. No.
93TH0535-5), pages 185–91. IEEE Comput. Soc. Press., 1993.
[HOGT05] Christian Haubelt, Stephan Otto, Cornelia Grabbe, and J¨
urgen Teich.
A System-Level Approach to Hardware Reconfigurable Systems. In Pro-
ceedings of Asia and South Pacific Design Automation Conference, pages
298–301, Shanghai, China, January 2005. IEEE.
183
Bibliography
[HOH+07] A. Herrholz, F. Oppenheimer, P.A. Hartmann, A. Schallenberg, W. Nebel,
C. Grimm, M. Damm, J. Haase, F. Herrera F. Brame, E. Villar, I. Sander,
A. Jantsch, A.-M. Fouilliart, and M. Martinez. The ANDRES Project:
Analysis and Design of Run-Time Reconfigurable, Heterogeneous Systems.
In 17th International Conference on Field Programmable Logic and Ap-
plications. IEEE, 27 - 29 August 2007.
[Hol92] John H. Holland. Adaptation in natural and artificial systems. MIT Press,
Cambridge, MA, USA, 1992.
[Hor74] W. Horn. Some simple scheduling algorithms. Naval Reasearch Logistics
Quarterly, 21, 1974.
[HPS00] Nicholas G. Hall, Chris N. Potts, and Chelliah Sriskandarajah. Paral-
lel machine scheduling with a common server. Discrete Appl. Math.,
102(3):223–243, 2000.
[HTRE02] Christian Haubelt, J”urgen Teich, Kai Richter, and Rolf Ernst. Mod-
ellierung rekonfigurierbarer Systemarchitekturen. In J¨
urgen Ruf, editor,
GIITGGMM-Workshop 2002 - Methoden und Beschreibungssprachen zur
Modellierung und Verifikation von Schaltungen und Systemen, pages 163–
171, T¨
ubingen, Germany, February 2002. Shaker Verlag.
[H¨
ub07] Michael H¨
ubner. Dynamisch und partiell rekonfigurierbare Hardwaresys-
temarchitektur mit echzeitf¨
ahiger On-Demand Funktionalit¨
at. PhD thesis,
Universit¨
at Karlsruhe (TH), 2007.
[HV04] Manish Handa and Ranga Vemuri. An efficient algorithm for finding empty
space for online FPGA placement. In Proceedings of the 41st annual con-
ference on Design automation, pages 960–965. ACM Press, 2004.
[Ihm06] Stefan Ihmor. Modeling an Automated Synthesis of Reconfigurable Inter-
faces. PhD thesis, University Paderborn, Heinz Nixdorf Institute, 2006.
[Jac07] Brian Jackson. Partial Reconfiguration Design with PlanAhead 9.2. Tech-
nical report, Xilinx, 2 August 2007.
[Jai95] A. K. Jain. Convolution on Splash 2. In FCCM ’95: Proceedings of the
IEEE Symposium on FPGA’s for Custom Computing Machines, page 204,
Washington, DC, USA, 1995. IEEE Computer Society.
[JN03] ao M. P. Cardoso Jo and Hor´acio C. Neto. Compilation for fpga-based
reconfigurable hardware. IEEE Des. Test, 20(2):65–75, 2003.
[JR97a] Nikolaj Janzen and Franz Josef Rammig. Timing Analysis with FPGA-
based Emulation Systems. In Proceedings of the 9th European Simulation
Symposium, 1997.
184
Bibliography
[JR97b] Nikolaj Janzen and Franz Josef Rammig. Zeittreue prototypenanalyse
in fpga-basierten emulationssystemen. In 5. GI/ITG/GMM Workshop.
Methoden des Entwurfs und der Verifikation digitaler Systeme, 1997.
[JSM93] K. Jeffay, D. F. Stanat, and C. U. Martel. On non-preemtive scheduling of
periodic tasks with varying execution priority. In Proceedings of the IEEE
Real-Time Systems Symposium, pages 129–139, December 1993.
[JYLC00] Byungil Jeong, Sungjoo Yoo, Sunghyun Lee, and Kiyoung Choi.
Hardware-software cosynthesis for run-time incrementally reconfigurable
FPGAs. In ASP-DAC ’00: Proceedings of the 2000 conference on Asia
South Pacific design automation, pages 169–174, New York, NY, USA,
2000. ACM Press.
[Kal02] Bodo Kalthoff. Einsatz von algorithmischen Skeletten im Scheduling mas-
siv paralleler Systeme. Dissertation, University Paderborn, 2002.
[Kal04] Heiko Kalte. Einbettung dynamisch rekonfigurierbarer Hardwarearchitek-
turen in eine Universalprozessorumgebung. Dissertation, Universit¨
at
Paderborn, Heinz Nixdorf Institut, Schaltungstechnik, 2004. EUR 38,-,
ISBN 3-935433-48-4.
[KCR06] Fernanda Lima Kastensmidt, Luigi Carro, and Ricardo Reis. Fault-
Tolerance Techniques for SRAM-Based FPGAs, volume 32 of Frontiers
in Electronic Testing. Springer, 2006.
[Keb06] Udo Kebschull. Applications of FPGA reconfiguration for experiments
in high energy physics. In Wolfgang Karl, J¨
urgen Becker, Karl-Erwin
Grosspietsch, Christian Hochberger, and Erik Maehle, editors, ARCS
Workshops, LNI, page 172. GI, 2006.
[KGS+07] Mahendra Kumar, Angamuthu Ganesan, Sundeep Singh, Frank May, and
J¨
urgen Becker. H.264 Decoder at HD Resolution on a Coarse Grain Dy-
namically Reconfigurable Architecture. In Proceedings of the FPL 2007.
IEEE, 27 - 29 August 2007.
[KIM78] H. Kise, T. Ibaraki, and H. Mine. A solvable case of the one machine
scheduling problem with ready and due times. Operations Research,
26(1):121–126, 1978.
[KKMB02] R. Kastner, A. Kaplan, S. Ogrenci Memik, and E. Bozorgzadeh. Instruc-
tion generation for hybrid reconfigurable systems. ACM Trans. Des. Au-
tom. Electron. Syst., 7(4):605–627, 2002.
[KKP05a] Markus K¨
oster, Heiko Kalte, and Mario Porrmann. Run-Time Defragmen-
tation for Partially Reconfigurable Systems. In Proceedings of the IFIP
International Conference on Very Large Scale Integration (VLSI-SOC),
pages 109–115, 2005.
185
Bibliography
[KKP05b] Markus K¨
oster, Heiko Kalte, and Mario Porrmann. Task placement for
heterogeneous reconfigurable architectures. In Proceedings of the IEEE
2005 Conference on Field-Programmable Technology, pages 43–50. IEEE
Computer Society Press, 11 - 14 December 2005.
[KKP06] Markus K¨
oster, Heiko Kalte, and Mario Porrmann. Relocation and de-
fragmentation for heterogeneous reconfigurable systems. In Proceedings
of the International Conference on Engineering of Reconfigurable Systems
and Algorithms, pages 70–76. CSREA Press, 26 - 29 June 2006.
[KKS04] Ryan Kastner, Adam Kaplan, and Majid Sarrafzadeh. Synthesis Tech-
niques and Optimizations for Reconfigurable Systems. Kluwer Academic
Publishers, 2004.
[KL78] H. T. Kung and C. E. Leiserson. Systolic Arrays for VLSI. In Sparse
Matrix Proceedings, pages 245–82, Philidelphia, PA, 1978. Society of In-
dustrial and Applied Mathematicians.
[KMN+00] K. Keutzer, S. Malik, R. Newton, J. Rabaey, and A. Sangiovanni-
Vincentelli. System level design: Orthogonalization of concerns and
platform-based design. In IEEE Trans on Computer-Aided Design,
19(12):1523–1543, 2000.
[Koc02] Andreas Koch. Compilation for Adaptive Computing Systems Using Com-
plex Parameterized Hardware Objects. The Journal of Supercomputing,
21:179–190, 2002.
[KP87] F. J. Kurdahi and A. C. Parker. Real: a program for register allocation.
In DAC ’87: Proceedings of the 24th ACM/IEEE conference on Design
automation, pages 210–215, New York, NY, USA, 1987. ACM Press.
[KPC94] P. H. Kelly, K. J. Page, and P. M. Chau. Rapid prototyping of asic based
systems. In DAC ’94: Proceedings of the 31st annual conference on Design
automation, pages 460–465, New York, NY, USA, 1994. ACM Press.
[KPR02] Heiko Kalte, Mario Porrmann, and Ulrich R¨
uckert. A Prototyping Plat-
form for Dynamically Reconfigurable System on Chip Designs. In Pro-
ceedings of the IEEE Workshop Heterogeneous reconfigurable Systems on
Chip (SoC), 2002.
[Kre96] Rainer Kress. A Fast Reconfigurable ALU for Xputers. Ph.D. Thesis.,
University of Kaiserslautern, 1996.
[KSW+07] Alex Krasnov, Andrew Schultz, John Wawrzynek, Greg Gibeling, and
Pierre-Yves Droz. Ramp blue: A message-passing manycore system in
fpgas. In 17th International Conference on Field Programmable Logic and
Applications. IEEE, 27 - 29 August 2007.
186
Bibliography
[KV99] Meenakshi Kaul and Ranga Vemuri. Integrated Block-Processing and
Design-Space Exploration in Temporal Partitioning for RTR Architec-
tures. In Proceedings of the Reconfigurable Architecture Workshop, RAW
’99. Springer, 1999.
[KVGO99] Meenakshi Kaul, Ranga Vemuri, Sriram Govindarajan, and Iyad Ouaiss.
An Automated Temporal Partitioning and Loop Fission Approach for
FPGA Based Reconfigurable Synthesis of DSP Applications. In Design
Automation Conference, pages 616–622, 1999.
[KW97] S. A. Kravchenko and Frank Werner. Parallel Machine scheduling Prob-
lems with a Singel Server. Mathematical and Computer Modelling,
26(12):1–11, December 1997.
[KZP03] S. Kim, C. H. Ziesler, and M. C. Papaefthymiou. Fine-grain real-
time reconfigurable pipelining. IBM Journal of Research and Develop-
ment: Power-efficient computer technologies, 47(5/6):599–609, Septem-
ber/November 2003.
[LBM+06] Patrick Lysaght, Brandon Blodget, Jeff Mason, Jay Young, and Brendan
Bridgeford. Enhanced architectures, design methodologies and CAD tools
for dynamic reconfiguration on XILINX FPGAs. In Proceedings of the
FPL 2006, 2006.
[LCD03] Jong-eun Lee, Kiyoung Choi, and Nikil D. Dutt. Compilation Approach
for Coarse-Grained Reconfigurable Architectures. IEEE Design & Test of
Computers, 20(1):26–33, 2003.
[LCH00] Zhiyuan Li, Katherine Compton, and Scott Hauck. Configuration Caching
Management Techniques for Reconfigurable Computing. In IEEE Sympo-
sium on Field-Programmable Custom Computing Machines, pages 22–38,
Napa Valley, CA, USA, April 2000.
[LD93] P. Lysaght and J. Dunlop. Dynamic reconfiguration of FPGAs. In
W. Moore and W. Luk, editors, More FPGAs: Proceedings of the 1993 In-
ternational workshop on field-programmable logic and applications, pages
82–94, Oxford, England, 1993.
[LH02] Zhiyuan Li and Scott Hauck. Configuration prefetching techniques for
partial reconfigurable coprocessor with relocation and defragmentation.
In FPGA ’02: Proceedings of the 2002 ACM/SIGDA tenth international
symposium on Field-programmable gate arrays, pages 187–195. ACM
Press, 2002.
[LLM95] L. Lundheim, I. Legrand, and L. Moll. A Programmable Active Memory
Implementation of a Neural Network for Second Level Triggering in Atlas.
International Journal of Modern Physics C, 6:561–566, 1995.
187
Bibliography
[LRK77] J. K. Lenstra and A. H. G. Rnnooy Kan. Optimization and approximation
in deterministic sequencing and scheduling: A survey. Annals of Discrete
Mathematics, 5:287–326, 1977.
[LRKB77] J. K. Lenstra, A. H. G. Rnnooy Kan, and P. Brucker. Complexity of
machine scheduling problems. Annals of Discrete Mathematics, 1:343–
362, 1977.
[LW82] J. Leung and J. W. Whitehead. On the complexity of fixed priority
scheduling of periodic real-time tasks. Performance Evaluation, 2(4):237–
250, 1982.
[Mak00] Tsugio Makimoto. The Rising Wave of Field Programmability. In Proceed-
ings of the The Roadmap to Reconfigurable Computing, 10th International
Workshop on Field-Programmable Logic and Applications, pages 1–6, Vil-
lach, Austria, 2000. Springer-Verlag.
[MAV+02] T. Marescaux, Bartic A., D. Verkest, S. Vernalde, and R. Lauwereins.
Interconnection Networks Enable Fine-Grain Dynamic Multi-Tasking on
FPGAs. In Proceedings of the International Conference on Field Pro-
grammable Logic, Montpellier, France, September 2002.
[MBKS01] S. Memik, E. Bozorgzadeh, R. Kastner, and M. Sarrafzadeh. Sps: A
strategically programmable system. In Proceedings of Reconfigurable Ar-
chitecture Workshop, 2001.
[MFK+00] Rafael Maestre, Milagros Fernandez, Fadi J. Kurdahi, Nader Bagherzadeh,
and Hartej Singh. Configuration management in multi-context reconfig-
urable systems for simultaneous performance and power optimizations.
In ISSS ’00: Proceedings of the 13th international symposium on System
synthesis, pages 107–113, Washington, DC, USA, 2000. IEEE Computer
Society.
[Mit95] J. Mitola. The software radio architecture. Communications Magazine,
IEEE, 33(5):26–38, 1995.
[MMadH06] Burkhard Monien and Friedhelm Meyer auf der Heide. New Trends in
Parallel & Distributed Computing, volume 181. HNI-Verlagsschriftenreihe,
2006.
[MMS00] Berna L. Massingill, Timothy G. Mattson, and Beverly A. Sanders. A pat-
tern language for parallel application programs (research note). In Euro-
Par ’00: Proceedings from the 6th International Euro-Par Conference on
Parallel Processing, pages 678–681, London, UK, 2000. Springer-Verlag.
[Mot02] M. Motomura. A dynamically reconfigurable processor architecture. In
Microprocessor Forum, October 2002.
188
Bibliography
[MSSB00] Steve MacDonald, Duane Szafron, Jonathan Schaeffer, and Steven Brom-
ling. Generating parallel program frameworks from parallel design pat-
terns. In Euro-Par ’00: Proceedings from the 6th International Euro-
Par Conference on Parallel Processing, pages 95–104, London, UK, 2000.
Springer-Verlag.
[MSV00a] Bingfeng Mei, Patrick Schaumont, and Serge Vernalde. A Hardware-
Software Partitioning and Scheduling Algorithm for Dynamically Recon-
figurable Embedded Systems. In 4. GI/ITG/GMM Workshop: Methoden
und Beschreibungssprachen zur Modellierung und Verifikation von Schal-
tungen und Systemen, Meißen, Germany, November 2000.
[MSV00b] Bingfeng Mei, Patrick Schaumont, and Serge Vernalde. A Hardware-
Software Partitioning and Scheduling Algorithm for Dynamically Recon-
figurable Embedded Systems. In ProRISC workshop on Circuits, Systems
and Signal Processing, November 2000.
[MTAB07] Mateusz Majer, J¨
urgen Teich, Ali Ahmadinia, and Christophe Bobda.
The Erlangen Slot Machine: A Dynamically Reconfigurable FPGA-Based
Computer. Journal of VLSI Signal Processing Systems, 46(2):15–31,
March 2007.
[Nag01] Ulrich Nageldinger. Coarse-Grained Reconfigurable Architecture Design
Space Exploration. PhD thesis, University of Kaiserslautern, CS depart-
ment (Informatik), 2001.
[NB04] Juanjo Noguera and Rosa M. Badia. Multitasking on Reconfigurable Ar-
chitectures: Microarchitecture Support and Dynamic Scheduling. Trans.
on Embedded Computing Sys., 3(2):385–406, 2004.
[NTI04] Jari Nurmi, Hannu Tenhunen, and Jouni Isoaho. Interconnect-Centric
Design for Advanced SOC and NOC. Kluwer Academic Publishers, 2004.
[OGS+98] Iyad Ouaiss, Sriram Govindarajan, Vinoo Srinivasan, Meenakshi Kaul,
and Ranga Vemuri. An Integrated Partitioning and Synthesis System for
Dynamically Reconfigurable Multi-FPGA Architectures. In IPPS/SPDP
Workshops, pages 31–36, 1998.
[OSF+07] Tobias Oppold, Thomas Schweizer, Julio Oliveira Filho, Sven Eisenhardt,
and Wolfgang Rosenstiel. CRC concepts and evaluation of processor-like
reconfigurable architectures. it - Information Technology, 7(3):157–164,
June 2007.
[PAC06] PACT. XPP-III Processor Overview. White paper, PACT Information-
stechnologie GmbH, 13 July 2006.
189
Bibliography
[PAK+07] Thilo Pionteck, Carsten Albrecht, Roman Koch, Erik Maehle, Michael
H¨
ubner, and J¨
urgen Becker. Communication Architectures for Dynami-
cally Reconfigurable FPGA Designs. In IPDPS, Reconfigurable Architec-
tures Workshop, pages 1–8. IEEE, 2007.
[Pel98] Susanna Pelagatti. Structured development of parallel programs. Taylor &
Francis, Inc., Bristol, PA, USA, 1998.
[PEW+03] Christian Plessl, Rolf Enzler, Herbert Walder, Jan Beutel, Marco Platzner,
Lothar Thiele, and Gerhard Tr¨
oster. The Case for Reconfigurable Hard-
ware in Wearable Computing. Personal and Ubiquitous Computing,
7(5):299–308, October 2003.
[PHA+07] Katarina Paulsson, Michael H¨
ubner, G¨
unther Auer, Michael Dreschmann,
and J¨
urgen Becker. Implementation of a Virtual Internal Configuration
Access Port (JCAP) for Enabling Partial Self- Reconfiguration on Xilinx
Spartan III FPGAs. In Proceedings of the FPL 2007. IEEE, 27 - 29 August
2007.
[PHB+07] K. Paulsson, M. H¨
ubner, J. Becker, J.-M. Philippe, and C. Gamrat.
On-line Routing of Reconfigurable Functions for Future Self-Adaptive
Systems—Investigations within the AETHER Project. In 17th Interna-
tional Conference on Field Programmable Logic and Applications. IEEE,
27 - 29 August 2007.
[PKG86] P. G. Paulin, J. P. Knight, and E. F. Gircycy. HAL: A Mulit-Paradigm
Apporach to Automatic Data Path Synthesis. In Proceedings of the 23rd
ACM/IEEE Design Automation Conference, pages 263–270, Piscataway,
NJ, USA, 1986. IEEE Press.
[PT05] D. Pellerin and S. Thibault. Practical FPGA Programming in C. Prentice
Hall, 2005.
[Ram77] Franz J. Rammig. A Concept for the Editing of Hardware. In Proceedings
of the 14th Design Automation Conference (DAC’77), New Orleans, USA,
June 1977.
[Ram07] Franz Josef Rammig. Software-hardware complexes: Towards flexible bor-
ders. In Proceedings of the IFIP TC 10 Working Conference: International
Embedded System Symposium (IESS), pages 433–436, 30 May - 1 June
2007.
[Ret07] Achim Rettberg. Low Power Driven High-Level Synthesis for Dedicated
Architectures. PhD thesis, University Paderborn, 2007.
[RG02] Fethi A. Rabhi and Sergei Gorlatch. Patterns and Skeletons for Parallel
and Distributed Computing. Springer, 2002.
190
Bibliography
[RLZB04] Achim Rettberg, Thomas Lehmann, Mauro Zanella, and Christophe
Bobda. Selbststeuernde rekonfigurierbare bit-serielle Pipelinearchitektur.
Deutsches Patent- und Markenamt, December 2004. Patent-No. 10308510.
[RM07] Markus Rullmann and Renate Merker. A Reconfiguration Aware Circuit
Mapper for FPGAs. In 14th Reconfigurable Architectures Workshop (RAW
2007), pages 1–8. IEEE, 25 - 26 March 2007.
[RMVC05] Javier Resano, Daniel Mozos, Diederik Verkest, and Francky Catthoor. A
reconfiguration manager for dynamically reconfigurable hardware. IEEE
Design and Test of Computers, 22(5):452–460, 2005.
[SC06] Manuel Salda˜na and Paul Chow. TMD-MPI: An MPI Implementation for
Multiple Processors Across Multiple FPGAs. In Proceedings of the FPL
2006, pages 1–6. IEEE, 2006.
[Sch07] Simon Sch¨
utte. Eclipse Plug-In f¨
ur den Entwurf Rekonfigurierbarer
Rechensysteme. Diplomarbeit, Heinz Nixdorf Insitute, University of
Paderborn, Germany, 2007.
[Sch08] Fabian Schulte. Entwicklung einer Two-Slot-Machine als rekonfig-
urierbarer, dynamischer Hardwarebeschleuniger integriert in ein FPGA
basiertes System on Chip. Diplomarbeit, Heinz Nixdorf Insitute, Univer-
sity of Paderborn, Germany, 2008. to appear.
[SH07] Peter Sobe and Volker Hampel. FPGA-Accelerated Deletion-Tolerant
Coding for Reliable Distributed Storage. In Paul Lukowicz, Lothar Thiele,
and Gerhard Tr¨
oster, editors, ARCS, volume 4415 of Lecture Notes in
Computer Science, pages 14–27. Springer, 2007.
[Sin07] Satnam Singh. Integrating FPGAs in high-performance computing: pro-
gramming models for parallel systems—the programmer’s perspective. In
FPGA ’07: Proceedings of the 2007 ACM/SIGDA 15th international sym-
posium on Field programmable gate arrays, pages 133–135, New York, NY,
USA, 2007. ACM Press.
[SJ02] Li Shang and Niraj K. Jha. Hardware-Software Co-Synthesis of Low Power
Real-Time Distributed Embedded Systems with Dynamically Reconfig-
urable FPGAs. In ASP-DAC ’02: Proceedings of the 2002 conference on
Asia South Pacific design automation/VLSI Design, page 345, Washing-
ton, DC, USA, 2002. IEEE Computer Society.
[SKB02] Li Shang, Alireza S. Kaviani, and Kusuma Bathala. Dynamic power con-
sumption in Virtex-II FPGA family. In FPGA ’02: Proceedings of the 2002
ACM/SIGDA tenth international symposium on Field-programmable gate
arrays, pages 157–164, New York, NY, USA, 2002. ACM Press.
191
Bibliography
[SMR+04] Donatella Sciuto, Grant Martin, Wolfgang Rosenstiel, Stuart Swan, Frank
Ghenassia, Peter Flake, and Johny Srouji. Systemc and systemverilog:
Where do they fit? where are they going? In DATE ’04: Proceedings
of the conference on Design, automation and test in Europe, page 10122,
Washington, DC, USA, 2004. IEEE Computer Society.
[SNG01] Suraj Sudhir, Suman Nath, and Seth Copen Goldstein. Configuration
caching and swapping. In FPL ’01: Proceedings of the 11th International
Conference on Field-Programmable Logic and Applications, pages 192–
202, London, UK, 2001. Springer-Verlag.
[SNRC06] Manuel Salda˜na, Daniel Nunes, Emanuel Ramalho, and Paul Chow. Con-
figuration and Programming of Heterogeneous Multiprocessors on a Multi-
FPGA System Using TMD-MPI. In Proceedings of the 3rd International
Conference on ReConFigurable Computing and FPGA’s, pages 270–279.
IEEE, September 2006.
[SON06] Andreas Schallenberg, Frank Oppenheimer, and Wolfgang Nebel.
OSSS+R: Modelling and Simulating Self-Reconfigurable Systems. In Pro-
ceedings of International Conference on Field Programmable Logic and
Applications, pages 177–182. IEEE, August 2006.
[SRL90] L. Sha, R. Rajkumar, and J.P. Lehoczky. Priority inheritance protocols:
An approach to real-time synchronization. IEEE Transactions on Com-
puters, 39(9):1175–1185, 1990.
[SS05] Alireza Shoa and Shahram Shirani. Run-time reconfigurable systems for
digital signal processing applications: A survey. J. VLSI Signal Process.
Syst., 39(3):213–235, 2005.
[SSL89] B. Sprunt, L. Sha, and J. Lehoczky. Aperiodic task scheduling for hard-
real-time systems. Journal of Real-Time Systems, 1(1):27–60, June 1989.
[STB06] Hayden Kwok-Hay So, Artem Tkachenko, and Robert Brodersen. A uni-
fied hardware/software runtime environment for FPGA-based reconfig-
urable computers using BORPH. In CODES+ISSS ’06: Proceedings of
the 4th international conference on Hardware/software codesign and sys-
tem synthesis, pages 259–264, New York, NY, USA, 2006. ACM Press.
[SVCBS04] Alberto Sangiovanni-Vincentelli, Luca Carloni, Fernando De Bernardinis,
and Marco Sgroi. Benefits and challenges for platform-based design. In
DAC ’04: Proceedings of the 41st annual conference on Design automa-
tion, pages 409–414, New York, NY, USA, 2004. ACM Press.
[SVM01] Alberto Sangiovanni-Vincentelli and Grant Martin. Platform-based design
and software design methodology for embedded systems. IEEE Des. Test,
18(6):23–33, 2001.
192
Bibliography
[SWP03] Christoph Steiger, Herbert Walder, and Marco Platzner. Heuristics for
Online Scheduling Real-time Tasks to Partially Reconfigurable Devices. In
Proceedings of the 13rd International Conference on Field Programmable
Logic and Application (FPL’03), pages 575–584. Springer, September
2003.
[SWP04] Christoph Steiger, Herbert Walder, and Marco Platzner. Operating sys-
tems for reconfigurable embedded platforms: Online scheduling of real-
time tasks. IEEE Trans. Comput., 53(11):1393–1407, 2004.
[SWPT03] Christoph Steiger, Herbert Walder, Marco Platzner, and Lothar Thiele.
Online Scheduling and Placement of Real-time Tasks to Partially Recon-
figurable Devices. In Proceedings of the 24th IEEE International Real-
Time Systems Symposium, page 224. IEEE Computer Society, 2003.
[SWT+02] Herman Schmit, David Whelihan, Andrew Tsai, Matthew Moe, Benjamin
Levine, and R. Reed Taylor. PipeRench: A Virtualized Programmable
Datapath in 0.18 Micron Technology. In Proceedings of the IEEE Custom
Integrated Circuits Conference, pages 63–66, 2002.
[TB01] Russell Tessier and Wayne Burleson. Reconfigurable computing for digital
signal processing: A survey. J. VLSI Signal Process. Syst., 28(1-2):7–27,
2001.
[TB05] Alexander Thomas and J¨
urgen Becker. Multi-grained reconfigurable data-
path structures for online-adaptive reconfigurable hardware architectures.
In ISVLSI ’05: Proceedings of the IEEE Computer Society Annual Sym-
posium on VLSI: New Frontiers in VLSI Design (ISVLSI’05), pages 118–
123, Washington, DC, USA, 2005. IEEE Computer Society.
[TB07] Alexander Thomas and J¨
urgen Becker. New adaptive multi-grained hard-
ware architecture for processing of dynamic function patterns. it - Infor-
mation Technology, 7(3):165–173, June 2007.
[TCJW97] Steve Trimberger, D. Carberry, A. Johnson, and J. Wong. A time-
multiplexed fpga. In FCCM ’97: Proceedings of the 5th IEEE Symposium
on FPGA-Based Custom Computing Machines, page 22, Washington, DC,
USA, 1997. IEEE Computer Society.
[Tel05] Anil Telikepalli. Performance vs. Power: Getting the Best of Both Worlds.
Xcelljounral, 54, 2005.
[TKB+07] F. Thoma, M. K¨
uhnle, P. Bonnot, E. Moscu Panainte, K. Bertels,
S. Goller, A. Schneider, S. Guyetant, E. Sch¨
uler, K.D. M¨
uller-Glaser, and
J. Becker. MORPHEUS: Heterogeneous Reconfigurable Computing. In
17th International Conference on Field Programmable Logic and Applica-
tions. IEEE, 27 - 29 August 2007.
193
Bibliography
[TKM+02] Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae
Ghodrat, Ben Greenwald, Henry Hoffman, Paul Johnson, Jae-Wook Lee,
Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman,
Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal.
The Raw Microprocessor: A Computational Fabric for Software Circuits
and General-Purpose Programs. IEEE Micro, 22(2):25–35, 2002.
[Tre95] Nick Tredennick. Technology and Business: Forces Driving Microprocessor
Evolution. Proceedings of the IEEE, 83(12), December 1995.
[Tri94] Stephen M. Trimberger. Field-Programmable Gate Array Technology.
Kluwer Academic Publishers, Norwell, MA, USA, 1994.
[UHGB04a] Michael Ullmann, Michael H¨
ubner, Bj¨
orn Grimm, and J¨
urgen Becker. An
FPGA Run-Time System for Dynamical On-Demand Reconfiguration. In
IPDPS. IEEE Computer Society, 2004.
[UHGB04b] Michael Ullmann, Michael H¨
ubner, Bj¨
orn Grimm, and J¨
urgen Becker. On-
Demand FPGA Run-Time System for Dynamical Reconfiguration with
Adaptive Priorities. In Proceedings of the International Conference on
Field Programmable Logic and its Applications (FPL2004), pages 454–
463. Springer, 30 August - 1 September 2004.
[VBR+96] J. Vuillemin, P. Bertin, D. Roncin, M. Shand, H. Touati, and P. Bou-
card. Programmable active memories: Reconfigurable systems come of
age. IEEE Transactions on VLSI Systems, 4(1):56–69, 1996.
[VKKR02] Ranga Vemuri, Srinivas Katkoori, Meenakshi Kaul, and Jay Roy. An
efficient register optimization algorithm for high-level synthesis from hi-
erarchical behavioral specifications. ACM Trans. Des. Autom. Electron.
Syst., 7(1):189–216, 2002.
[VM05] Nikolaos S. Voros and Konstantinos Masselos. System Level Design of Re-
configurable Systems-on-Chip. Springer-Verlag New York, Inc., Secaucus,
NJ, USA, 2005.
[VS07] Stamatis Vassiliadis and Dimitrios Soudris, editors. Fine- and Coarse-
Grain Reconfigurable Computing. Springer, 2007.
[VT03] Ramachandran Vaidyanathan and Jerry L. Trahan. Dynamic Reconfigura-
tion: Architectures and Algorithms (Series in Computer Science). Kluwer
Academic/Plenum Publishers, 2003.
[Wan98] Markus Wannemacher. Das FPGA-Kochbuch. MITP-Verlag, 1998.
[War06] Alexander Warkentin. Two-slot based reconfiguration environment. Stu-
dienarbeit, Heinz Nixdorf Insitute, University of Paderborn, Germany,
2006.
194
Bibliography
[Waw07] John Wawrzynek. Adventures with a reconfigurable research platform. In
17th International Conference on Field Programmable Logic and Applica-
tions. IEEE, 27 - 29 August 2007.
[WH95] M. Wirthlin and B. Hutchings. A dynamic instruction set computer. In
Peter Athanas and Kenneth L. Pocek, editors, IEEE Symposium on FP-
GAs for Custom Computing Machines, pages 99–107, Los Alamitos, CA,
1995. IEEE Computer Society Press.
[WOK+00] Karlheinz Weiß, Carsten Oetker, Igor Katchan, Thorsten Steckstor, and
Wolfgang Rosenstiel. Power estimation approach for SRAM-based FP-
GAs. In FPGA ’00: Proceedings of the 2000 ACM/SIGDA eighth inter-
national symposium on Field programmable gate arrays, pages 195–202,
New York, NY, USA, 2000. ACM Press.
[WOK+06] John Wawrzynek, Mark Oskin, Christoforos Kozyrakis, Derek Chiou,
David A. Patterson, Shih-Lien Lu, James C. Hoe, and Krste Asanovic.
Ramp: A research accelerator for multiple processors. Technical Re-
port UCB/EECS-2006-158, EECS Department, University of California,
Berkeley, Nov 2006.
[WP02] Herbert Walder and Marco Platzner. Non-preemptive Multitasking on
FPGAs: Task Placement and Footprint Transform. In International Con-
ference on Engineering of Reconfigurable Systems and Algorithms (ERSA),
pages 24–30, June 2002.
[WP03] Herbert Walder and Marco Platzner. Online Scheduling for Block-
partitioned Reconfigurable Devices. In Proceedings of the International
Conference on Design, Automation and Test in Europe (DATE), pages
290–295. IEEE Computer Society, March 2003.
[WP04] Herbert Walder and Marco Platzner. A Runtime Environment for Re-
configurable Hardware Operating Systems. In Proceedings of the 14th
International Conference on Field Programmable Logic and Application
(FPL’04), pages 831–835. Springer, August 2004.
[Xil04] Xilinx. Two Flows for Partial Reconfiguration: Module Based or Differ-
ence Based. Application Note, Xilinx, September 2004.
[YMHB00] Zhi Alex Ye, Andreas Moshovos, Scott Hauck, and Prithviraj Banerjee.
CHIMAERA: a high-performance architecture with a tightly-coupled re-
configurable functional unit. In ISCA, pages 225–235, 2000.
195