Document [original]

Dynamic Compilation for Functional Programs

vorgelegt von Diplom-Informatiker

Martin Grabm¨uller

aus M¨unchen

von der Fakult¨at IV – Elektrotechnik und Informatik

der Technischen Universit¨at Berlin

zur Erlangung des akademischen Grades

Doktor der Ingenieurwissenschaften

– Dr.-Ing. –

genehmigte Dissertation

Promotionsausschuss:

Vorsitzender: Prof. Dr.-Ing. Sebastian M¨oller

Berichter: Prof. Dr. Peter Pepper

Berichter: Prof. Dr. Sabine Glesner

Tag der wissenschaftlichen Aussprache: 21. April 2009

Berlin 2009

D 83

Zusammenfassung

Diese Arbeit behandelt die dynamische, zur Laufzeit stattfindende ¨

Ubersetzung und

Optimierung funktionaler Programme. Ziel der Optimierung ist die erh¨

ohte Laufzeit-

effizient der Programme, die durch die compilergesteuerte Eliminierung von Abstrak-

tionen der Programmiersprache erreicht wird.

Bei der Implementierung objekt-orientierter Programmiersprachen werden bereits seit

mehreren Jahrzehnten Compiler-Techniken zur Laufzeit eingesetzt, um objekt-orien-

tierte Programme effizient ausf¨

uhren zu k¨

onnen. Sp¨

atestens seit der Einf¨

uhrung der

Programmiersprache Java und ihres auf einer abstrakten Maschine basierenden Aus-

f¨

uhrungsmodells hat sich die Praktikabilit¨

at dieser Implementierungstechnik gezeigt.

Viele Eigenschaften moderner Programmiersprachen konnten erst durch den Einsatz

dynamischer Transformationstechniken effizient realisiert werden, wie zum Beispiel

das dynamische Nachladen von Programmteilen (auch ¨

uber Netzwerke), Reflection

sowie verschiedene Sicherheitsl¨

osungen (z.B. Sandboxing).

Ziel dieser Arbeit ist zu zeigen, dass rein funktionale Programmiersprachen auf ¨

ahn-

liche Weise effizient implementiert werden k¨

onnen, und sogar Vorteile gegen¨

uber den

allgemein eingesetzten objekt-orientierten Sprachen bieten, was die Effizienz, Sicher-

heit und Korrektheit von Programmen angeht.

Um dieses Ziel zu erreichen, werden in dieser Arbeit Implementierungstechniken

entworfen bzw. aus bestehenden L¨

osungen weiterentwickelt, welche die dynamis-

che Kompilierung und Optimierung funktionaler Programme erlauben: zum einen

pr¨

asentieren wir eine Programmzwischendarstellung (getypte dynamische Continua-

tion-Passing-Style-Darstellung), welche sich zur dynamischen Kompilierung und Op-

timierung eignet. Basierend auf dieser Darstellung haben wir eine Erweiterung zur

verz¨

ogerten und selektiven Codeerzeugung von Programmteilen entwickelt. Der wich-

tigste Beitrag dieser Arbeit ist die dynamische Spezialisierung zur Eliminierung poly-

morpher Funktionen und Datenstrukturen, welche die Effizienz funktionaler Pro-

gramme deutlich steigern kann. Die pr¨

asentierten Ergebnisse experimenteller Messun-

gen eines prototypischen Ausf¨

uhrungssystems belegen, dass funktionale Programme

effizient dynamisch kompiliert werden k¨

onnen.

Summary

This thesis is about dynamic translation and optimization of functional programs.

The goal of the optimization is increased run-time efficiency, which is obtained by

compiler-directed elimination of programming language abstractions.

Object-oriented programming languages have been implemented for several decades

using run-time compilation techniques. With the introduction of the Java program-

ming language and its virtual machine-based execution model, the practicability of

this implementation method for real-world applications has been proved. Many as-

pects of modern programming languages, such as dynamic loading and linking of

code (even across networks), reflection and security solutions (e.g., sandboxing) can

be realized efficiently only by using dynamic transformation techniques.

The goal of this work is to show that functional programming languages can be effi-

ciently implemented in a similar way, and that these languages even offer advantages

when compared to more common object-oriented languages. Efficiency, security and

correctness of programs is easier to ensure in the functional setting.

Towards this goal, we design and develop implementation techniques to enable dy-

namic compilation and optimization of functional programming languages: we de-

scribe an intermediate representation for functional programs (typed dynamic con-

tinuation-passing style), which is well suited for dynamic compilation. Based on this

representation, we have developed an extension for incremental and selective code

generation. The main contribution of this work shows how dynamic specialization

of polymorphic functions and data structures can increase the run-time efficiency of

functional programs considerably. We present the results of experimental measure-

ments for a prototypical implementation, which prove that functional programs can

efficiently be dynamically compiled.

Contents

1 Introduction 11

1.1 Motivation................................. 11

1.1.1 Problem.............................. 12

1.1.2 Thesis ............................... 13

1.2 Contributions ............................... 13

1.3 Outline of this Dissertation . . . . . . . . . . . . . . . . . . . . . . . 14

2 Background 15

2.1 Functional Programming . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.2 DynamicCompilation .......................... 16

2.2.1 VirtualMachines ......................... 17

2.2.2 History of Dynamic Compilation . . . . . . . . . . . . . . . . 19

2.2.3 Dynamic Compilation Techniques . . . . . . . . . . . . . . . . 20

2.2.4 Dynamic Compilation and Functional Programming . . . . . . 22

2.2.5 Dynamic Compilation in Other Languages . . . . . . . . . . . 24

2.2.6 Formal Treatment of Dynamic Optimization . . . . . . . . . . 26

2.2.7 Applications............................ 27

2.3 Analysis and Transformation . . . . . . . . . . . . . . . . . . . . . . . 28

2.4 TypedCompilation............................ 30

2.4.1 Use of Type Systems . . . . . . . . . . . . . . . . . . . . . . . 30

2.4.2 Continuation-passing Style . . . . . . . . . . . . . . . . . . . . 30

2.4.3 Closure Conversion . . . . . . . . . . . . . . . . . . . . . . . . 31

2.4.4 Data Representation . . . . . . . . . . . . . . . . . . . . . . . 32

3 Dynamic Compilation of Functional Programs 35

3.1 Architecture................................ 35

3.2 CompilationProcess ........................... 36

4 Typed Dynamic Continuation-passing Style 39

4.1 Notation.................................. 40

4.2 RunningExample............................. 40

4.3 SourceLanguage ............................. 42

4.3.1 Syntax............................... 42

4.3.2 Overloaded Numeric Literals and Primitive Operators . . . . . 45

8 CONTENTS

4.3.3 Static Semantics . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.3.4 Dynamic Semantics . . . . . . . . . . . . . . . . . . . . . . . . 48

4.4 Continuation Language . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.4.1 Abbreviations........................... 49

4.4.2 Static Semantics . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.4.3 Dynamic Semantics . . . . . . . . . . . . . . . . . . . . . . . . 53

4.5 CPSTransformation ........................... 53

4.6 ClosureConversion............................ 59

4.6.1 Type checking Closure-converted Terms . . . . . . . . . . . . 59

4.6.2 Closure Conversion Algorithm . . . . . . . . . . . . . . . . . . 61

4.6.3 Notes on Closure Conversion . . . . . . . . . . . . . . . . . . . 62

4.7 Generation of Machine Code . . . . . . . . . . . . . . . . . . . . . . . 65

5 Incremental Compilation 69

5.1 LanguageExtension ........................... 69

5.2 Code Generation for Delay Expressions . . . . . . . . . . . . . . . . . 70

5.3 Placing of Delay Expressions . . . . . . . . . . . . . . . . . . . . . . . 72

5.4 Discussion................................. 73

6 Run-time Monomorphization 75

6.1 Specialization of Polymorphic Functions . . . . . . . . . . . . . . . . 76

6.1.1 Polymorphic Functions . . . . . . . . . . . . . . . . . . . . . . 76

6.1.2 Polymorphic Recursion . . . . . . . . . . . . . . . . . . . . . . 81

6.1.3 TypeClasses ........................... 83

6.2 Data Type Specialization . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.2.1 Type-directed Representation Selection . . . . . . . . . . . . . 85

6.2.2 Data Layout Algorithm . . . . . . . . . . . . . . . . . . . . . . 87

7 Implementation 89

7.1 Implementation Outline . . . . . . . . . . . . . . . . . . . . . . . . . 89

7.1.1 FrontEnd............................. 89

7.1.2 Conversion and Optimization . . . . . . . . . . . . . . . . . . 90

7.1.3 Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . 90

7.1.4 Run-timeSystem......................... 90

7.2 Optimizations............................... 91

8 Experimental Results 95

8.1 TestMethodology............................. 95

8.2 BenchmarkPrograms........................... 96

8.3 Effect of Implemented Optimizations . . . . . . . . . . . . . . . . . . 98

8.4 Comparison to other Implementations . . . . . . . . . . . . . . . . . . 102

8.5 The Pseudoknot Benchmark . . . . . . . . . . . . . . . . . . . . . . . 107

8.6 Effects of Incremental Compilation . . . . . . . . . . . . . . . . . . . 112

8.7 Discussion................................. 114

CONTENTS 9

9 Conclusions and Future Work 115

9.1 Conclusions ................................ 115

9.2 FutureWork................................ 116

9.2.1 TheoreticModel ......................... 117

9.2.2 Open Data Types and Open Functions . . . . . . . . . . . . . 117

9.2.3 TypeClasses ........................... 118

9.2.4 BetterCompiler.......................... 119

9.2.5 Instrumentation and Recompilation . . . . . . . . . . . . . . . 120

9.2.6 Value Specialization . . . . . . . . . . . . . . . . . . . . . . . 121

10 CONTENTS

Chapter 1

Introduction

1.1 Motivation

Modern applications and software systems are no longer monolithic pieces of software.

Most systems are designed as a combination of a static core component, and a set of

extensions. The core implements the basic functionality and extension infrastructure

and the extensions provide additional features. Together, the core and the extensions

solve the problems for which the complete system was designed. Examples can be

found in operating systems, where file systems, device drivers and user interfaces are

built on top of an operating system kernel which manages basic hardware devices

like the processor(s) or memory. Text processing systems come with basic editing

functionality and can be extended with formula editors, bibliographical databases or

multiple output and export formats. Web browsers use plug-ins in order to present

multimedia and interactive contents.

There are several advantages in using such an open, extensible system design. The

systems are more configurable, because it is easy to integrate customized components

into an otherwise off-the-shelf system. As only components needed for a particular

task need to be loaded into the system, resource consumption is reduced. Dynamic

loading of portable program representations such as bytecode opens the possibility of

distributed applications without complicated setup procedures on each node of a net-

work. For long-running systems (such as telecommunication switches or databases),

dynamic software updating offers system maintenance without expensive down-times.

The disadvantages of dynamic and open systems lie mainly in their vulnerability (can

components from the network be trusted?) and the lower performance due to the

necessary abstractions, such as encapsulation and the need to conform to standard

interfaces when composing modules. The work presented here is concerned with

the reduction of these overheads. Security issues are not a topic of this thesis and

are not discussed any further except that our work is in the setting of strongly typed

functional languages, which offer some security advantages independently of dynamic

compilation.

Object-oriented languages such as Java or C# do support the characteristic features

of dynamic languages: dynamic loading and linking, portable representations of pro-

12 Chapter 1. Introduction

grams and load-time verification for security. The efficiency problem is solved by

run-time compilation (also called just-in-time or JIT compilation). Thus run-time

compilation seems to be a useful technique for realizing dynamic and flexible systems.

In the world of functional languages, only few attempts have been made to create

implementations which are as dynamic and open as their object-oriented counterparts

(see Section 2.2.4 for references). Dynamic compilation has been used, for example,

for saving memory, but the potential of dynamic optimization in functional languages

has not been systematically explored. We plan to go further in that direction.

In the approach we present in this thesis, dynamic compilation is a central implemen-

tation characteristic, and all language features we wish to integrate can rely on its

presence. All aspects of dynamic systems as outlined above are in principle supported

by it.

We advocate an implementation style for functional languages which is heavily influ-

enced by object-oriented implementations. Programs are stored in a portable format.

This format can be an efficient encoding of annotated abstract syntax trees where

security is enforced by type-checking any intermediate code that is loaded into the

system. Programs are therefore easily transferred over networks. The combination

of program parts from different sources is handled by the fast dynamic compiler, as

well as any necessary specialization due to parametric polymorphism or overloading.

Data structures are optimized for efficient memory layout and access, depending on

the cache architecture and instruction set of the executing machine.

Functional languages offer an advantage over established object-oriented languages

because their powerful type systems can be used to ensure that they do not perform

any malicious operations.

Towards our goal, we present in this thesis several techniques for increasing the ef-

ficiency of functional programming languages. The first one is used for eliminating

language abstractions at run-time by applying an aggressive specialization strategy,

which eliminates most of the costs normally associated with these abstractions. Sec-

ondly, incremental compilation reducesthe cost of dynamic compilation by limiting

it to active program parts and spreading it over the life time of the program. Ad-

ditionally, we investigate how established optimizations for functional programming

languages have to be adapted to fit with our specialization approach.

Future work may build on these foundations in order to implement a complete dy-

namic execution environment for functional programs.

1.1.1 Problem

The technical possibilities and problems of implementations of dynamic specialization

methods have been investigated for several functional programming languages (see

Chapter 2 for the relevant literature). Nevertheless, several questions remain open:

−The literature mentions dynamic specialization as a solution to implementa-

tion problems, but it has not been widely used, mostly because it was deemed

impractical. The technological advances over the last decade let us view this

1.2 Contributions 13

problem in a new light, especially since many dynamic compilation problems

have already been solved for object-oriented languages.

−Since there has not been much work on dynamic compilation of functional

programs in general, it is not clear which implementation techniques provide a

good balance between increased execution efficiency and compilation overhead.

−The techniques which have been shown to be useful for static compilation of

functional programs need to be reviewed for their suitability for dynamic com-

pilation.

−Techniques developed for the dynamic compilation of other language paradigms

(imperative, object-oriented) should be examined whether they fit the func-

tional paradigm.

−Implementation techniques have always been influenced by language features

provided by the implemented languages (e.g., laziness, higher-order functions).

We also want to investigate whether programming language design should be

influenced by the implementation possibilities provided by dynamic compila-

tion.

In all of these areas, we provide answers, backed up by experimental evidence.

1.1.2 Thesis

The main thesis of this work is the following:

Functional languages can be efficiently implemented in a dynamically

compiling system.

We lay the foundations for this dynamic approach by describing several techniques,

on which dynamically optimizing implementations of functional languages can be

based. Experiments suggest that our proposed implementation strategy is feasible in

practice.

1.2 Contributions

−The main contribution of this work is the presentation of dynamic optimiza-

tion techniques for functional languages, namely incremental compilation,run-

time specialization of polymorphic functions and data types and typed dynamic

continuation-passing style as a technique for implementing specialization as

well as traditional optimizations. Specialization of polymorphic functions is

the most effective technique developed here.

−Additionally, the ideas developed in the theoretical part of this thesis have been

implemented in a prototypical system for run-time compilation of functional

languages.

−We have measured a performance advantage due to dynamic specialization when

compared with other functional-language implementations.

14 Chapter 1. Introduction

1.3 Outline of this Dissertation

The goal of this dissertation is to present the design and implementation of several

techniques for dynamic compilation of functional programs. We have structured the

thesis as follows: First, Chapter 2 gives the background about functional languages

and their compilation, dynamic (or run-time) compilation, virtual machines, vari-

ous compilation techniques such as analysis and transformation, continuation-passing

style and type-directed compilation. Chapter 3 gives an overview of the design and

implementation of the dynamically optimizing system we have implemented as a

prototype. The following three chapters go into detail with respect to the new imple-

mentation techniques: Chapter 4 presents our approach to dynamic compilation and

optimization of functional programs. We present our intermediate language design

and the transformation into this language. The use of type information in this pro-

gram representation and the transformation algorithms is important for the further

development. Chapter 5 presents our approach to incremental dynamic compilation,

which is based on a small extension to our intermediate representation. Chapter 6

describes the answer to the most important question of this thesis: how can the

specialization of polymorphic functions and polymorphic data types be performed

efficiently at run-time. Chapter 7 describes our prototype system in detail and shows

how classic optimizations have to be adapted for a dynamic system. Chapter 8 gives

experimental results of our prototype implementation of dynamic compilation and

compares it to other language implementations. Chapter 9 summarizes the work,

highlights the main points and describes topics for future work in the area of dynam-

ically compiling functional language implementations.

Chapter 2

Background

In this chapter, we review the basic concepts of dynamic compilation and optimiza-

tion. We first describe the concepts of functional programming and what dynamic

compilation is, give an introduction to virtual machines and give references to related

work in this area. We will investigate the application of dynamic compilation tech-

niques to the implementation of programming languages in general and functional

languages in particular.

2.1 Functional Programming

The roots of functional programming lie in the 1950s in the development of the

programming language Lisp (McCarthy, 1960), which already shared many aspects

with today’s functional languages: Lisp promotes an expression-oriented style of

programming in contrast to the statement-oriented style in imperative languages,

in general, Lisp programs have a more mathematical flavor than other languages of

its era. Early Lisp implementations already included garbage collectors. Building

on these foundations, functional languages like FP (Backus, 1978), Hope (Burstall

et al., 1980), ML (Milner et al., 1997), Opal (Pepper, 2003), Clean (Brus et al., 1987),

Erlang (Armstrong et al., 1996), Gofer (Jones, 1994) and Haskell (Peyton Jones, 2003)

evolved, each developing different and overlapping language features.

The key advantages of functional programming languages are:

Strong typing Many functional languages are strongly typed and provide sophisti-

cated type systems with type inference, polymorphism, uniqueness typing, type

classes, parameterized modules, functors, signatures, etc. These make possible

the encoding of many program invariants in the source code, increasing the

reliability, robustness and maintainability of programs

Modularity Functional programs consist of small, maintainable and isolated func-

tions. Powerful abstraction facilities like functors, parameterized modules or

type classes increase modularity further.

16 Chapter 2. Background

Controlled side-effects Purely functional languages strictly separate effect-free ex-

pressions from side-effecting actions, enforcing a disciplined programming style.

Even impure languages, such as Lisp, Standard ML or Erlang encourage an

expression-oriented style by providing syntactic features for the convenient com-

position and application of functions.

Inherent parallelism Functional programs are in theory well suited for parallelism.

The early work in the 1980s on parallel graph reductions has not worked out,

but the recent rise of multi-core architectures has renewed interest in this area.

For a detailed introduction to functional programming, many books and tutorials on

functional programming are available (Bird and Wadler, 1988; Hudak et al., 1999;

Pepper, 2003).

2.2 Dynamic Compilation

A survey on the field of dynamic compilation (or just-in-time compilation, JIT) has

been written by Aycock (2003). Aycock’s article concentrates on dynamic compila-

tion in software for object-oriented languages. The book by Smith and Nair (2005)

on virtual machines also covers many aspects of dynamic compilation as an imple-

mentation technique of these machines, both in hardware and software.

The main characteristic of dynamic compilation is that some translation of a program

takes place while it is executing. The translation process requires some entity which

actually translates the source to the target program. This can be a component

of the executing program (for example, in the program’s run-time system), part

of the operating system or even some program implemented partly or completely in

hardware. Modern complex instruction set computers (CISC) translate machine code

into a reduced instruction set (RISC) prior to execution. These instructions, called

µ-ops in Intel terminology, are held in a special first-level instruction cache (trace

buffer).

The motivation of dynamic compilation is increased performance over interpreted sys-

tems, while providing the dynamic properties of such systems. Additionally, dynamic

compilation may even offer superior performance to statically compiled programs:

precompiled binaries often use the least common denominator of the instruction sets

of the targeted machines. Newer machines in a processor series often have new in-

structions with better performance, but programs are normally compiled for older

version in order to be compatible. Intel’s IA-32 Architecture Optimization Reference

Manual (Intel Corporation, 2005) even recommends to use the cpuid instruction to

determine the processor model and to select code optimized for this model. A dy-

namically compiling system can use this method without the need to deliver several

versions of an algorithm, each optimized for another processor. Other advantages

of dynamic compilation include reduction of memory requirements and energy (Wu

et al., 2005; Unnikrishnan et al., 2002).

2.2 Dynamic Compilation 17

Virtual Machine

Source Program

Run-time System

Tables

Compiler Code Buffer

Operating System

source

code

byte-

code

object

code

Preprocessing

Figure 2.1: Virtual machine

2.2.1 Virtual Machines

A virtual machine, also called an abstract machine, is an idealized model of a ma-

chine. It provides a machine architecture consisting of registers, stacks and memory,

and an instruction set which manipulates the machine state. On the one hand, vir-

tual machines are used to model language semantics and program evaluation, on the

other for the implementation of programming languages on real machines. In some

systems, a virtual machine only exists conceptually, for example in the form of an

intermediate language of a compiler, in others the virtual machine is a piece of soft-

ware or hardware, which is capable of executing programs written in its language,

thus emulating the virtual machine on real hardware. A prominent example of a

software virtual machine is the Java Virtual Machine (JVM) (Lindholm and Yellin,

1999). An example of a hardware virtual machine is the Transmeta Crusoe processor

(Klaiber, 2000). The Crusoe processor is a very long instruction word (VLIW) pro-

cessor capable of executing standard x86 binary code files by translating them into

its VLIW instruction set on the fly.

Virtual machines are used to abstract from certain details of real machines. Software

virtual machines provide a uniform interface to operating system services (indepen-

dent of the operating system the machine runs on) and several other services, such as

garbage collection or concurrent programming primitives. Hardware virtual machines

abstract from the actual hardware implementation by providing an instruction set of

another machine.

In the following, we will concentrate on dynamically compiling virtual machines,

which have the general structure shown in Figure 2.1.

Dynamically compiling virtual machines possess storage areas for an input program

18 Chapter 2. Background

as well as for the actual machine code. The input to such a machine may be source

code, some form of compiled portable bytecode or even actual machine code: either

for the same architecture (for binary optimizers) or some other architecture (for

binary translators). The central component is a dynamic compiler, which translates

fragments of the input program into fragments of the target program. The size of

the fragments varies from system to system, many operate on the size of individual

procedures, others on basic blocks or so-called extended basic blocks (instruction

sequences without in-going control-flow edges, but with one or more out-going edges).

The compiler needs various data structures to support its operation, and often some

dynamic profiling mechanism to keep track of the occurrence of certain events. The

collected information is stored in data structures (called tables), which are also part

of the system. Finally, a run-time system is needed which provides an interface to

the operating system and abstracts from the actual operating system and machine

characteristics. When a program is loaded into the virtual machine, often some form

of non-trivial preprocessing is necessary, such as initialization of data structures,

verification or type-checking of the loaded code. Many virtual machines perform

some kind of translation on the code, for example translation into threaded code

(Bell, 1973; Dewar, 1975).

In contrast to classic compiler design, the frontier between translation time and run

time is not clearly visible in dynamically compiling systems. A dynamically compiling

system behaves similar to an interpreted system, where the program to be executed

is loaded into a running system which initiates the execution of the loaded program

image. In an interpreter, the loaded program is executed one instruction at a time,

whereas a dynamically compiling system translates it into machine language before

executing it, so that it runs without any interpretative overhead directly on the

hardware. This gives a significant performance advantage when compared to an

interpretive system, but it is necessary to take the time required for the translation

into account. Adaptive systems only use expensive optimizing techniques for those

parts of the program which are heavily used (so-called hot spots). The goal is to

invest much translation time and space only into parts which use up the majority

of the program’s actual run time. In order to reach that goal, various heuristics are

used, for example the dynamic measurement of program behavior (profiling). The

program’s execution may be monitored to determine the functions invoked most, or

those taking most of the execution time.

Dynamic systems which allow run-time loading of program parts, possibly from un-

trusted sources, need ways to protect themselves from malicious code. Several tech-

niques have been developed for this task: bytecode verification (as used in Java)

performs type checks and some additional integrity checks on the intermediate code

when it is loaded. This technique requires an intermediate language which is rea-

sonably easy to check. In the typed assembly language (TAL) approach (Morrisett

et al., 1999), concrete machine code is generated and stored in the object code files

together with additional type information, which enables the loader to perform the

verification. Verification of bytecodes or typed assembly language programs requires

building proofs of program safety, whereas in the proof-carrying code approach (Nec-

2.2 Dynamic Compilation 19

ula, 1997), programs are annotated with proofs of their correctness, thereby only

requiring proof checking. Since proof checking is much easier than finding proofs,

this technique simplifies the run-time system considerably. The above safety ap-

proaches are orthogonal to the techniques described in this thesis and we will not

discuss them in the remainder of the thesis.

One aspect not found in statically compiling or pure interpretive systems is the man-

agement of translated code. Normally, only some limited amount of memory is avail-

able for storing compiled code fragments in the so-called code cache. When it fills

up, some compiled fragments must be removed from the cache so that its space can

be re-used. Hazelwood Cettei (2004) investigates several code cache management

strategies and measures their effects. Code cache management is also orthogonal to

the topics discussed in this thesis and is not treated in the remainder of the text.

2.2.2 History of Dynamic Compilation

Dynamic compilation has been used successfully in virtual machines for object-

oriented languages for more than 20 years now.

Smalltalk implementations have used dynamic compilation for the efficient execution

of dynamically typed object-oriented languages (Deutsch and Schiffman, 1984; Ogata

and Doi, 1994). Since in Smalltalk every operation involves a method invocation,

optimization of common operations is very important, and correct implementation

is only possible when it is possible to undo optimizations when the compiled sys-

tem is changed. Implementations of the object-oriented programming language Self

(Chambers and Ungar, 1989; H¨olzle, 1994) have continued that tradition and added

many very sophisticated dynamic optimizations, mostly necessary because Self is even

more dynamic than Smalltalk. Java also builds on the techniques developed for Self

(Paleczny et al., 2001) and has become a very successful language, mainly due to the

portability and efficiency which is made possible by using dynamic compilation tech-

niques. Franz (1994) describes another interesting implementation for the Oberon

language (Wirth and Gutknecht, 1998), which uses portable program representations

and load-time compilation.

Dynamic compilation techniques in general have been developed even earlier. Brown

(1976) describes a Basic system which uses so-called “throw-away compilation”, where

each Basic statement is compiled prior to execution and the compiled statements are

cached. When the memory buffer allocated for compiled code gets full, all compiled

code is thrown away, hence the name.

An important – though not widely known or cited – work in the field of dynamic

compilation is the Synthesis operating system kernel, developed by Pu et al. (1988).

All operating system services in Synthesis rely on dynamic code generation. For

example, when a file is opened and this file belongs to a FIFO pipe, special code for

writing to and for reading from pipes is generated, when the file is a disk file, other

device-specific code is generated. Massalin (1992) showed that the performance of

such as system can be an order of magnitude better than other operating systems,

thus enabling new features such as very fine-grained threading.

20 Chapter 2. Background

2.2.3 Dynamic Compilation Techniques

Dynamic compilation can be divided into program transformations performed by soft-

ware, as part of the run-time system of some language implementation, and transfor-

mations performed by hardware, as part of the microcode or support code of some

hardware system. The basic principles of both approaches are similar, but the former

normally employs more powerful transformations, targeted at program optimizations,

whereas the latter concentrates on translation from one instruction set to another,

focusing on binary program portability.

Both the software and hardware approaches constitute virtual machines, on which

the programs are executed, and both include some kind of translation and mapping of

resources from the underlying system to the ones expected by the running programs.

Smith and Nair (2005) describe in their book both virtual machines in software

(called high-level virtual machines) and in hardware (low-level virtual machines).

High-level virtual machines have the advantage that they can be ported more easily

to very different underlying hardware architectures, but low-level virtual machines

offer better performance, because they are implemented directly in hardware. A

survey of adaptive optimization in virtual machines has been written by Arnold et al.

(2005).

We will first discuss the various software systems because they directly relate to the

work presented in this thesis, and then present hardware solutions for completing the

discussion.

Dynamic Compilation in Software

One field in which dynamic code generation has been used since at least the 1960s

is executable data structures. Instead of interpreting a data structure in order to

control the steps of an algorithm, the data structure is translated to code which

performs these steps. This has been done by Thompson (1968) for efficient string

pattern matching with regular expressions. Another example is the translation of

binary search in sorted arrays into nested conditional statements.

Brown (1976) introduced the name throw-away-compiling. He used it to describe his

compilation technique, which allowed both compiled and interpreted code to coexist in

a single program. This mixed code approach (Dakin and Poole, 1973; Dawson, 1973)

was primarily intended to save memory by compiling only those parts of the programs

which are most frequently used. In this early system, the programmer was responsible

for annotating which parts of a program are to be compiled or interpreted. The

compiled parts could be either pre-compiled before the program is run, or compiled

at run-time, where one statement is compiled at a time and placed into a special

memory region. When this region is filled, all compiled code is removed at once

(thrown away), but Brown also suggests that more intelligent decisions should be

made about which code is deleted and which is kept.

Software virtual machines operate on very different input languages. On the one

extreme, in the case of binary translators, the input language is a machine language,

so that the virtual machine has to do the same decoding as a processor. At the other

2.2 Dynamic Compilation 21

end of the language spectrum, the machine translates a source language directly, as

in Brown’s approach mentioned above.

Binary translators work directly on machine language binaries. These are used espe-

cially for porting applications compiled for older architectures to new processor in-

struction sets. The Aries system (Zheng and Thompson, 2000), developed at Hewlett-

Packard, translates PA-RISC code to IA-64 code and is used to execute native PA-

RISC applications on IA-64 processors. Digital FX!32 (Chernoff and Hookway, 1997)

is a software emulator for running IA-32 applications on the Alpha processor. The

IA-32 Execution Layer (Baraz et al., 2003) is a software dynamic translator for run-

ning IA-32 software on Itanium-based (IA-64) systems.

Binary optimizers, where the input language is the same as the output language, are

mostly used for the optimization of already compiled applications. The University of

Queensland Binary Translation Project (UQBT) (Ung and Cifuentes, 2000, 2006) has

developed a machine adaptable binary dynamic translation system. Other successful

dynamic optimization systems are Dynamo (Bala et al., 2000) and its successor Dy-

namoRIO (Bruening et al., 2003; Bruening, 2004), developed in cooperation between

Hewlett-Packard Laboratories and the MIT Laboratory for Computer Science.

Dynamic Compilation in Hardware

Dynamic translation has been used in several hardware designs, mainly to maintain

binary compatibility and to enable the use of new and more efficient hardware imple-

mentations. The idea is that the binary machine code, which has been compiled for

some specific instruction set architecture (ISA) is transparently translated to another

ISA prior to execution. Programs are held in memory in the old binary format and

the translated code is stored in some fast cache memory, so that it can be reused when

the code is executed several times. The granularity of translation is normally in size

of some hardware-dependent measure, for example a memory page, in contrast to

software-based dynamic compilation systems which generally translate one function,

method or module at a time.

The DAISY system (Ebcioglu and Altman, 1996), implemented at IBM, translates

machine code from other architectures to a VLIW architecture. The design supports

different source languages, but the described implementation translates PowerPC

code. Translation is performed by a Virtual Machine Monitor, which is a software

that is stored in read-only memory.

The Transmeta Crusoe processors (Klaiber, 2000) use a dynamic compilation tech-

nique called code morphing, which translates IA-32 code to the machine language of

a VLIW processor. The code morphing software is responsible for translating source

instructions into target instructions (called atoms) and packing them as effectively

as possible into long instruction words (called molecules). Atoms in one molecule are

executed in parallel by different functional units of the processor.

The advantage of these automatic translation systems is that the processor imple-

mentor is free to change the underlying VLIW processor, as long as he adapts the

translator software accordingly.

22 Chapter 2. Background

Hardware dynamic compilation has not yet been used in the particular context of

functional programming, but of course the implementation of parallel graph reducers,

which have been used for functional programs could be combined with dynamic code

generation techniques.

Code Generation Techniques

Dynamic code generation techniques fall mainly into two groups: template-based

techniqes and rule-based techniques.

In template-based approaches, code templates are created prior to execution. These

templates contain holes into which values are inserted when on code generation time

different templates are combined to construct the final machine code. This approach

has been used, for example in DyC (Grant et al., 2000).

Rule-based techniques are closer to common code generation in static compilers.

Starting with an abstract syntax tree of the program to be compiled, rules are ap-

plied in order to tile the syntax tree with instructions. These rules are either specified

in some special declarative language for a code-generator generator (e.g. a bottom-

up rewrite system (BURS)-based system, e.g. (Fraser et al., 1992)) or they are pro-

grammed ad-hoc in the compiler.

2.2.4 Dynamic Compilation and Functional Programming

There are fewer uses of dynamic compilation for functional languages than for object-

oriented ones.

Wakeling (1998a) used dynamic compilation for the lazy functional language Haskell

(Peyton Jones, 2003) in order to reduce the memory requirements of compiled code.

In his system, Haskell source code is compiled to a compact intermediate format prior

to execution and a dynamic compiler translates this code to machine code while the

program is running. When the storage reserved for compiled code is exhausted, all

compiled code is discarded and required code is re-generated (similar to Brown’s

throw-away compilation).

A remark by Spinellis (1990) hints at the possibility to use dynamic compilation for

executing Haskell programs, but does not go into detail. Dockins and Guyer (2007)

describe how to perform verification on the bytecode output of their Haskell compiler

by adding type information to the bytecode. The Alice ML dialect is implemented

on a virtual machine which also supports just-in-time compilation and distributed

applications (Rossberg et al., 2006).

Burger (1997) has developed an infrastructure for profile-driven dynamic recompila-

tion of the Scheme programming language (Burger and Dybvig, 1998). This work

differs from the other work on functional programming languages because Scheme is

not purely functional.

Dynamic loading and linking in functional languages has been considered for imple-

menting functional shells (McDonald, 1983; van Weelden and Plasmeijer, 2004), and

more general for dynamic applications which contain a small static core and load

2.2 Dynamic Compilation 23

all other code dynamically (Stewart and Chakravarty, 2005). One major advantage

of using functional languages for such systems is their modularity, especially in lan-

guages which do not allow uncontrolled use of side effects. Rossberg (2006) describes

a dynamic, type-safe linking approach for Alice ML.

One possible route for implementing dynamic functional programs is by compiling

them for the Java Virtual Machine (JVM) (Wakeling, 1998b; Meehan and Joy, 1999).

Unfortunately, the JVM architecture poses severe difficulties on the implementation of

functional languages. Several features, which are key to the efficient implementation

of functional languages (such as efficient memory allocation or cheap tail calls) are

missing on this architecture (Schinz and Odersky, 2001; Shivers, 1996; Perry and

Meijer). Therefore, we will target compilation to real machine code in this thesis.

Among the programming language abstractions which are targeted by our special-

ization techniques are type classes, which have mainly been implemented by passing

additional parameters to overloaded functions (Hall et al., 1996; Augustsson, 1993),

but where also specializing implementations exist (Jones, 1995; Meacham, 2007). An-

other promising language extension are open data types and open functions (L¨oh and

Hinze, 2006), which allow the modular extension of algebraic data types and functions

which operate on such extensible types. A similar development, but presented for

object-oriented languages are extensible algebraic data types (Zenger and Odersky,

2001). Extensible programming with open case (Blume et al., 2006) solves similar

problems, but with the addition of a case statement which is a first-class language ele-

ment. The applicability of our techniques to these language features will be discussed

as possible future work in Section 9.2.2.

The Fabius system (Leone and Lee, 1998) supports specialization, which is controlled

through the use of curried functions. Earlier parameters to functions are considered

to be more static than later arguments. Leone and Lee give as an example matrix

multiplication using a curried function which takes two matrices. When applied to

the first matrix, specialized code is created which is optimized for this input matrix.

When the first matrix is sparse, execution speed is improved (Leone and Lee, 1994,

1996).

MetaML (Taha and Sheard, 1991; Taha, 1999) is a multi-stage programming language

with explicit stage annotations. MetaML programs can explicitly construct programs

and run them at a later time using language constructs for delaying and running

computations. Currently, MetaML does not have a native implementation which

could use its potential in a dynamic setting.

Dynamic Caml (Lomov and Moscal, 2002) is a library for Objective Caml that imple-

ments dynamic code generation for programs written in Objective Caml. The library

provides a high-level interface which represents dynamically generated code in a form

so that the run-time code generation process can be statically type-checked. It does

not produce real machine code, but interprets the dynamically created Caml code.

Erlang (Armstrong et al., 1996) is a concurrent, dynamically typed, mostly functional

programming language which was designed for programming telecommunication de-

vices. One of its distinguishing features is support for code replacement, which means

that it is possible to replace code modules without stopping the program. In order to

24 Chapter 2. Background

both support code replacement and efficient execution, Erlang distinguishes calls to

fully qualified functions (including a module and a function name) and unqualified

functions (including only a function name). The former are dynamically looked up

whereas references to the latter are resolved at compile time. Dynamically looked

up calls always call the newest version of a compiled module, whereas statically

resolved calls are never changed. It is therefore a recommendation to program long-

running loops as tail calls to fully qualified modules. This is not enforced by the

language definition or the compiler. For Erlang, a native code compiler called HiPE

(High-performance Erlang) (Luna et al., 2005) exists, which also supports just-in-time

compilation.

The Haskell package hs-plugins (Stewart, 2006) provides dynamic loading of Has-

kell modules. It also contains a compilation manager which can be used to compile

Haskell source code into loadable modules prior to dynamically loading them. Since

the compilation manager compiles the source code on a per-module basis, it is different

from other dynamic code-generation libraries.

Leijen (2003) presents the implementation of a foreign function interface which allows

Haskell functions to call functions written in other languages, and also calls from for-

eign functions back into Haskell code. The latter requires small amounts of run-time

code generation for implementing closures which can be called like normal functions

from foreign languages (call-back functions). The Glasgow Haskell compiler (GHC

Developers, 2008) implements its foreign function interface in the same way. In both

implementations, the use of dynamic code generation is hidden from the user.

Dub´e (2002) presents an interesting framework for demand-driven type analysis. This

is a version of type analysis for dynamically-typed functional languages, where type

information is inferred during runtime based on estimates of how important it is at

various program points. The idea is to avoid wasting time with costly analysis on

program parts which are rarely or never used.

In all these previous works, several aspects of dynamic compilation and compilation

techniques such as specialization and data type representations have been investi-

gated. What is missing, is the integration of these ideas in a single framework. The

goal of this thesis is to provide the foundations of such a framework.

2.2.5 Dynamic Compilation in Other Languages

Object-oriented languages were the first popular and widely used languages which

supported dynamic features such as dynamic method dispatch that incur perfor-

mance overheads when compared to languages such as Fortran, Pascal or C. Even

though some of the older imperative languages supported a certain degree of dynamic

procedure calls (through function pointers), most of the code was quite static. Newer

object-oriented languages made the use of dynamic methods natural, so means were

sought to improve their performance.

Smalltalk (Goldberg and Robson, 1983) is a class-based object-oriented programming

language which uses an intermediate code. This code is generated from the source

code by a compiler and in some implementations (Deutsch and Schiffman, 1984) it

2.2 Dynamic Compilation 25

is translated further to machine code immediately before execution, eliminating the

interpretation overhead of interpreter implementations. MultithreadSmalltalk (Ogata

and Doi, 1994) also uses dynamic compilation.

Self is a dynamically typed object-oriented, prototype-based programming language

(Ungar and Smith, 1991). The implementation (Chambers and Ungar, 1989) compiles

Self source code to bytecodes which are translated to native machine code on demand

and on a per-method basis. The machine code is held in a cache to be reused when a

method is invoked in the future, and entries from the cache are discarded when the

cache fills up.

The Java programming language (Gosling et al., 2000) was designed to support the

distribution of compiled code and therefore includes the specification of a portable

bytecode format, which is interpreted by the JVM (Lindholm and Yellin, 1999) in

order to be executed. Most versions of the JVM include an option to translate the

bytecode to native code, either on class loading (the dynamic loading method of

Java) or on demand, when a certain method is to be executed which has not yet been

compiled to machine code.

Various dynamically compiling Java virtual machines have been implemented: see for

example Adl-Tabatabai et al. (1998), Adl-Tabatabai et al. (2003), Burke et al. (1999)

and Agesen (1997). These implementations support a wide variety of optimizations,

both conventional optimizations known from static compilation, and new ones which

rely on the presence of a dynamic optimizer and/or instrumentation for profiling.

Detlefs and Agesen (1999) describe inlining of virtual methods

Whaley (1999) has implemented an automatic runtime specialization algorithm for

the use in Java virtual machines, which adapts Java programs to the actual run-time

environment. Whaley’s work is concerned with specialization of methods: either by

inlining the most common method calls and thus specializing the method body to

the context in which it was called, or by creating specialized versions of methods for

commonly occurring parameters. Calls to these methods are then guarded by code

which checks the preconditions, and if they fail, the more general method is called.

The .NET CLI (ECMA International, 2005) (Common Language Interface) specifies

not only a bytecode format like Java (along with a human-readable notation) for .NET

programs, but also explicitly specifies that bytecode modules are to be translated to

machine code before execution. There is no information available on Microsoft’s

implementation of .NET, but several free implementations are available (DotGNU

Project, 2004; Mono Project, 2007). They seem to perform roughly the same set of

optimizations as the Java Virtual Machines mentioned above.

DyC (Grant et al., 2000) is a dynamic compilation system for the programming

language C. DyC requires that the programmer annotates the source code so that the

dynamic optimizer can efficiently specialize the source code for given fixed arguments.

The Calpa system (Mock, 2002) uses value profiling to automatically generate these

annotations, resulting in a fully automatic system.

Dynamo1(Leone and Dybvig, 1997) is a compiler architecture for dynamic program

1Not to be confused with the Dynamo system of Bala et al. (2000), described in Section 2.2.3.

26 Chapter 2. Background

optimization, based on the notion of staged compilation. Staged compilation divides

the compilation process into several stages: the first stage corresponds basically to

static compilation and performs classic program analysis and transformation tech-

niques. Additionally, it collects information used in the second stage during run-time

optimization. This setup allows to reduce the time and space requirements of all

transformations happening at run-time while enabling beneficial optimization.

2.2.6 Formal Treatment of Dynamic Optimization

Even though the focus of this thesis is on practical matters, we give a short survey

on the formal work on dynamic optimization.

There are two directions of research in this area. The first is mainly theoretic and

builds on very abstract models of what computation means. The second aims at

formalizing aspects of real dynamic compilation and optimization systems. While

the former is very ambitious, it is very far away from practice, and the latter lacks

theoretical sophistication, but is much more closer to what is implementable today.

In the first category, dynamic compilation has a close relationship to partial eval-

uation (Consel and Danvy, 1993), which dates back to the 1970s at least (see, for

example, Ershov (1977), who called it a “partial computation principle”). Partial

evaluation aims at separating static from dynamic evaluations, in order to perform

static evaluation as early as possible, hopefully at compile time. Partial evaluation

employs binding-time analysis to determine when all operands to an operation are

available. The latest operand determines when evaluation can take place. Up to

now, partial evaluation was used in static compilers, but its use in dynamic systems

gives the possibility to perform certain evaluations at load-time or even later during

run-time, when some execution parameters are fixed (for example, the machine archi-

tecture or the instruction set). Sperber and Thiemann (1997) use partial evaluation

to improve the output of a Scheme compiler.

Balat and Danvy (1998) have used run-time code generation to produce bytecode

for the Ocaml virtual machine. Their goal was to perform type-directed partial

evaluation of ML programs.

The author has been developing a framework for generic dynamic optimization, which

captures aspects such as profiling and incremental, adaptive dynamic compilation and

optimization (Grabm¨uller, 2007). The work presented in this thesis is not directly

connected to the earlier paper, and a connection of the theoretical and practical work

is a topic for future work.

Modal-ML (Wickline et al., 1998) is a variant of the ML language which supports

explicit code generation during run-time, allowing programmers to specify the stages

of computation in a program. The different stages are reflected in the types assigned

to program expressions.

In the more practical direction, formal treatment mostly involved the development

of heuristics and cost models for tuning dynamic optimization.

Arnold et al. (2004) propose a model-driven policy for detecting recompilation op-

portunities which is used in the Jikes Research Virtual Machine (Burke et al., 1999;

2.2 Dynamic Compilation 27

Suganuma et al., 2000). They model both expected recompilation costs and the

expected benefits of running optimized code, basing their heuristics on the compile

times and profile data collected up to that point in execution. This seems to be the

only published attempt to capture aspects of dynamic optimization systems formally.

Not directly related to dynamic compilation techniques, but possibly applicable to it

are efforts to model the run-time behavior (such as time or space requirements) of

programs. Using so-called cost semantics, Sands (1990) investigated the behavior of

first-order and higher-order functional programs. Hope and Hutton (2005) derive step

counting functions from function definitions. The techniques used in both approaches

could be used to estimate the effects of optimizing functions.

In the TIL (typed intermediate language) project, type passing is used as an im-

plementation technique (Morrisett, 1995; Tarditi, 1996; Tarditi et al., 1996). This

implementation is based on the theory of intensional type analysis (Harper and Mor-

risett, 1995).

Gries and Gehani (1977) give an early discussion on type passing. They propose a

limited form of polymorphic procedure declarations, where types in the parameter

list can be marked so that they depend on the type at the call site. The body of a

procedure can use the name of such a type, which essentially is a type parameter.

2.2.7 Applications

Modern programming language features such as reflection and aspect-oriented pro-

gramming make high demands on the techniques used for implementation. The dy-

namic nature of such language features requires dynamic compilation in order to

avoid excessive slowdowns. Ogel et al. (2005) present a system which dynamically

weaves and compiles code in order to avoid the overheads of static monolithic aspect

weaving.

Security is another important topic. When using networked machines, often con-

nected over the Internet, it is important to make sure that programs received over

the network have not been compromised. Several approaches, such as proof-carrying

code or typed assembly language aim at supporting efficient verification of code be-

fore it is executed. Hornof and Jim (1999) examined how program certification and

run-time code generation could be combined.

An interesting application of dynamic loading is dynamic software updating (Hicks,

2001; Stoyle et al., 2005; Bierman et al., 2003), which deals with the modification of

running software systems. Dynamic software updating allows parts of a program to

be replaced by corrected or extended code. Besides the modification and addition

of code this involves the conversion of data in the running system when data type

definitions are changed. This feature is crucial in systems which cannot tolerate any

downtime. Hicks (2001) explores this application and describes how he developed and

maintained a web server. He updated the web server’s code several times without

shutting it down. He did not use dynamic compilation techniques but worked with

precompiled binaries instead.

Another application of dynamic compilation techniques is reduced power consump-

28 Chapter 2. Background

tion in embedded systems. Wu et al. (2005) use dynamic compilation for managing

dynamic voltage and frequency scaling of the CPU, which results in significant re-

duction in power while maintaining high performance. Unnikrishnan et al. (2002)

present a dynamic recompilation and linking framework which optimizes the energy

behavior of a given application at run time.

A recurring pattern in all these applications is that they could be implemented using

simple interpretation. The only reason for using dynamic compilation is to improve

performance, even more so for dynamic optimization.

2.3 Analysis and Transformation

The process of compilation always consists of analysis and transformation. Analysis

in this context is the calculation of program properties which are implicit in the

program examined. Transformation is some meaning-preserving modification of the

program which normally has the goal to improve it. Transformation normally makes

use of the information gathered in the analysis phase in order to determine whether

certain transformations are valid and will probably increase the program’s efficiency.

Some transformations are strictly local and can be based on the syntax of certain

programming constructs, others require non-local or even global knowledge about

the program. In the context of dynamic compilation, it is important to use very

fast analysis and transformation algorithms, because their run time contributes to

the total run time of the program under translation. This requires that sometimes

algorithms must be used which are sub-optimal when compared to static compilers.

Examples include linear register allocation (Poletto and Sarkar, 1999) in contrast to

graph-coloring based algorithms (Chaitin, 1982) or first-order control flow analysis

where a higher-order analysis could be used in static systems.

In summary, we can assume that static compilation has more resources at its disposal

and is thus able to employ more complicated analysis and to use more aggressive

transformations. On the other hand, a static scheme has less knowledge about the

actual use of the program and must therefore either heuristically guess, or make

limiting assumptions, such as closed-world compilation, where the complete source

code of a program is available for analysis.

Dynamic compilers operate in more restricted environments, both time- and space-

wise, but can make use of actual input values of programs and the actual machine

configuration on which it runs.

We can see that dynamic compilation offers advantages in the following cases:

Selective elimination of abstractions Dynamic analyses and transformations do

not need to take the complete source program into account. Compared to static

specializers, dynamic specialization or cloning does not need the same amount

of code space, because it can work selectively.

Exploiting dynamic information Since the dynamic compiler has access to all

information dynamically available, such as input data or machine configuration,

2.3 Analysis and Transformation 29

it can make better decisions when optimizing code.

Enabling dynamic language constructs Run-time compilation makes the whole

system more dynamic, because it eases programming of distributed systems

sharing programs by dynamic loading.

Among the disadvantages of the dynamic approach are the management overhead,

resulting from code buffer management, compiler data structure management, and

profiling. Additionally, it may be necessary to maintain information for selectively

“undoing” certain optimization when assumptions on which these optimizations are

based are invalidated – for example by loading new or updated code into the program.

The main difficulty in implementing dynamic compilers is in the tradeoff between

fast execution times of the translated program and compilation time. In this thesis

we aim at providing some particular effective and efficient optimization techniques.

Program Analysis

The analysis of programs has a long history, dating back to the invention of the first

higher-level programming languages in the 1950’s and 1960’s. Even the first commer-

cial Fortran compiler performed a limited program analysis in order to optimize the

compilation of indexing instructions in loops (Backus et al., 1957).

Program analysis consists of static (mostly compile-time) and dynamic analysis. The

goal of static program analysis is to determine an approximation of the possible be-

havior of a program, without actually running it. Dynamic analysis tries to achieve

more precise information by actually running a program for typical input data, gath-

ering information for tailoring the compilation results to these data (profiling and

feedback-directed compilation).

A detailed introduction to the field of program analysis is given by Nielson et al.

(1999). Two important instances of program analysis are data-flow analysis and

control-flow analysis. The first aims at determining the set of values which can

possibly be held by the program variables during any program run, whereas the

second tries to find out which functions, procedures or methods might be called

in any call expression in the program. The two analyses are closely interrelated,

because for example, the value of a boolean variable may control which alternative of

a conditional expression is evaluated, thus resulting in different sets of functions to be

called. In higher-order languages, where function parameters and local variables may

be bound to functions and functions may be included in arbitrary data structures,

the set of values contained in a variable determines the set of called functions when

the variable is applied to an argument.

The advantage of static (or off-line) analysis is that it is allowed to spend more

resources on the analysis, thus allowing better but more expensive algorithms to

be used. On the other hand, the lack of actual input data restricts this kind of

analysis because it must be overly conservative by considering any possible program

run. Dynamic (or on-line) analysis is restricted to more efficient algorithms (which

30 Chapter 2. Background

normally have to be of linear time and space complexity), but has access to the input

and profiling data of real program runs.

A combination of static and dynamic analysis has several advantages: it is possible

to combine costly algorithms (run at compile time) with less costly, but also less

powerful algorithms (run at run-time). The static analysis can remove overly general

constructs from a program which are provably not needed for the specific application

at hand, and the dynamic analysis can refine the compile-time optimizations by

tailoring the program to the actual input data.

2.4 Typed Compilation

Typed compilation is an approach to programming language implementation which

emphasizes the importance of strong typing at all levels of abstraction during the

compilation process. The goal is to maintain accurate type information during all

compilation phases. Types are transformed together with expressions when trans-

lating from one level of intermediate language to the next one. Finally, during code

generation, type information can be used to support register allocation, data struc-

ture optimizations and tagless garbage collection (Tarditi et al., 1996; Tarditi, 1996;

Morrisett, 1995) or type-based operations (Crary et al., 1998), such as polymorphic

equality.

We will now summarize work in the area of typed compilation, and show the appli-

cations of types and type systems in compilers.

2.4.1 Use of Type Systems

Type systems have been used in a number of ways besides just checking the type-

correctness of input programs. Hannan (1995) developed a type system for performing

closure conversion. Hannan and Hicks (2000) give a type system for higher-order

uncurrying and for arity raising (flattening argument tuples) (Hannan and Hicks,

1998). All these system use the type system to identify places for applying the

respective transformation, where syntactic criteria do not suffice. The elimination of

useless variables – for example, variables which are passed around recursive functions

without contributing to the final result – using type systems has been investigated

by Kobayashi (2001) and Fischbach and Hannan (2001).

Agat (1997) uses types for the register-assignment phase of a compiler. By annotat-

ing function types with the registers where input and output values are stored, the

consistency of calling protocols can be ensured.

2.4.2 Continuation-passing Style

Many compilers for functional languages use some form of intermediate program

representation which is based on the λ-calculus. These representations are often

restricted versions, which require the programs to be in some normal form: A-

normal form (ANF) (Flanagan et al., 1993), monadic normal form (Boquist, 1999) or

2.4 Typed Compilation 31

continuation-passing style (CPS) (Appel, 1992). Kennedy compares these languages

for various properties (Kennedy, 2007).

Early work on CPS based program transformation can be found in Wand (1980).

In continuation-passing style, “every aspect of control flow and data flow [is made]

explicit” (Appel, 1992, p. 2). By transforming the input program into a form where

each function call is a tail-call, and each function is passed a function which performs

the “rest of the computation” (the continuation function), the resulting program has

made order of evaluation explicit and each intermediate result is named. Also, all

call and return points are made explicit – calls to user-defined functions represent

calls and calls to continuations represent returns. Most importantly, certain program

transformations such as full β-reduction (function inlining), which are not generally

sound in a call-by-value setting, are sound in CPS (Kennedy, 2007).

Our work is mainly inspired by the work of Appel (1992) and Kennedy (2007). Ap-

pel’s algorithm is untyped, whereas Kennedy supports types, but does not handle

polymorphic types.

Harper and Lillibridge (1993) describe a typed CPS conversion which supports poly-

morphic types.

Flanagan et al. (1993) show that CPS and A-normal form (ANF) are equivalent. In

our view (and others, see Kennedy (2007)), CPS is better suited to optimization from

a practical point of view, though.

Reppy (2001) uses CPS conversion in a limited way for optimizing loops in an oth-

erwise direct-style compiler. He shows that one must not necessarily commit to one

style or the other in one compiler.

Kelsey (1993) notes that CPS and static single-assignment (SSA) form as used in

most modern optimizing compilers for imperative and object-oriented compilers are

closely related and can be converted from one to the other quite easily. Appel (1998)

has also investigated this topic.

2.4.3 Closure Conversion

General-purpose computers do not natively support higher-order functions, which

access free variables (except for top-level bindings) and may be returned as results

or stored in data structures. Therefore, one important part in the compilation of

functional languages is the conversion of all functions to closed forms. This is normally

performed by representing functions as heap-allocated records, with one reference to

the code implementing the function, and several slots for storing the values of, or

references to, its free variables.

Untyped closure conversion Shao and Appel (2000) and Appel (1992) describe

untyped closure representations and conversion algorithms. We have extended the

basic versions of these algorithms to our typed setting.

Typed closure conversion Minamide et al. (1996) describe a typed closure con-

version algorithm. It uses existential and recursive types to type closures. We have

32 Chapter 2. Background

decided to stick to a simpler type system which is described in Section 4.4.2. As

already mentioned above, Hannan (1995) uses types for closure conversion. Mor-

risett et al. (1999) describe the complete translation of terms in the second-order

lambda-calculus (System F) to typed assembler language. The translation consists

of several typed intermediate languages. Guillemette and Monnier (2007) present

a type-preserving closure-conversion algorithm written in Haskell. Tolmach (1997)

combines closure conversion with a type closure analysis algorithm. His goal is to

represent closure-converted terms in the source language, so he uses algebraic data

types to model closures and introduces the necessary creation of data records and

selections.

The compiler described by Shao (1994) also uses a typed intermediate language and

incorporates typed closure conversion. Morrisett (1995) has similar goals, but uses an

intensional type analysis approach (Harper and Lillibridge, 1993). In this approach,

the language supports a typecase construct, which allows the program to analyze the

types of values and to make decisions based on types.

Run-time closure conversion Feeley and Lapalme (1992) use run-time compi-

lation for the implementation of closures. A closure is represented as a short code

fragment which pushes the values of free variables onto the stack and then invokes

the code of the closure. On closure creation time, this code fragment is written to a

freshly allocated region of memory.

Grabm¨uller (2006) has extended this approach so that the complete code of a function

is generated at closure creation time, instead of generating a stub which calls a

precompiled function.

General Other approaches use data and control flow analysis in order to optimize

the creation and representation of closures. Cejtin et al. (2000) present a closure-

conversion algorithm which makes use of data and control flow analyses in order to

determine closure layout

Steckler and Wand (1997) introduces lightweight closure conversion, where several

analyses are used to avoid redundant saving of variables to closures or passing unused

closures to functions, thereby allowing multiple procedure-calling conventions in the

same program. Siskind (1999) presents a similar closure conversion algorithm for a

Scheme compiler in considerable detail.

2.4.4 Data Representation

Data representation and specializing representations for specific machine-supported

data types have also been investigated. Type-directed unboxing (Leroy, 1997, 1992)

and unboxed values as first-class citizens (Peyton Jones and Launchbury, 1991) have

been proposed for efficiently implementing operations on native machine data types,

for example in numerically intensive computations. When specializing all polymor-

phic functions, each value can be represented in its natural form, avoiding any over-

head associated with repeated boxing and unboxing. A similar approach, called

2.4 Typed Compilation 33

record unboxing, has been proposed by Nguyen and Ohori (2007). Morrisett (1995)

uses the type information passed at run time to polymorphic functions to perform

type-dependent operations on different representations.

Morrison et al. (1991) have suggested to pass type information together with uni-

formly represented parameters, and to perform type-specific operations by inspecting

the type information parameters. They also suggested the use of dynamic special-

ization on each call to a polymorphic function, or to optimize this by caching spe-

cializations. They state that “Neither of the above two implementation techniques is

practical for reasons of space and time overhead, respectively” (Morrison et al., 1991,

p. 349). We think that nowadays a typical personal computer or server has enough

space and time at its disposal to make it practical.

Static specialization, where type-specific versions of functions are generated by the

compiler at static compile time, have been successfully used in the Ada and C++

programming languages (Bray, 1983; Stroustrup, 1986). Ada supports generic mod-

ules, which are instantiated by importing them with concrete types. C++ supports

templates, which are used to implement parametrized classes and functions.

34 Chapter 2. Background

Chapter 3

Dynamic Compilation of

Functional Programs

This chapter proposes a design of a dynamic compilation system for functional pro-

gramming languages. Based on this design, we have implemented a dynamically

compiling virtual machine called Kafka1. The Kafka system executes programs writ-

ten in a strict purely functional language (called the Kafka language) by incrementally

compiling them to machine code. This implementation has been used for testing the

design and for performing experiments in language design and execution performance.

Even though we describe a specific system here, the general concepts described in

this and later chapters are applicable to other systems as well.

We will first discuss the general architecture. Then we will have a look at the various

phases of the compiler and the intermediate languages it uses. Details of the interme-

diate languages on which the compiler is based, the incremental compilation technique

and the implementation are then discussed in detail in the following chapters.

3.1 Architecture

The architecture of the Kafka system is illustrated in Figure 3.1, which is a more

detailed instance of the virtual machine schema described in Chapter 2. The Kafka

system can read both source code written in its input language and compressed

abstract syntax trees produced by a separate compiler.

The data space of the run-time compiler now has a distinguished component, called

the specialization cache. This cache records for which types functions in the code

buffer have been specialized, so that at most one compiled version exists for each

instantiated function. In our prototype, this mapping between source function and

specialized code is maintained only for polymorphic functions, but it could be used

in general for all kinds of specialization.

The preprocessing applied to code loaded into the system depends on the kind of in-

put. For source code the complete compiler front end (see the next section for details)

1For no particular reason.

36 Chapter 3. Dynamic Compilation of Functional Programs

Virtual Machine

Source Program

Run-time System

Specialization

Cache

Tables

Compiler Code Buffer

Operating System

src

ast

Preprocessing

Figure 3.1: Dynamic specialization system architecture

is invoked, whereas for compressed abstract syntax input files only decompression and

conversion to the internal representation is needed. The result of the preprocessing

step is put into the memory area labelled Source Program. When the program is run,

the run-time compiler takes the abstract syntax from the source buffer and generates

machine code.

The code buffer holds machine code and is allocated as a fixed sized memory region.

More sophisticated code buffer management is possible, but has not been implemented

in the current version.

The run-time system provides services to the compiled program, such as garbage col-

lection and an interface to operating-system services, for example input and output.

Additionally, the run-time system writes profiling data to disk files when the program

is terminating.

The tables are used for recording profiling information and statistics. Profiling can

be selectively enabled and leads to the generation of instrumented machine code.

Statistics are collected for several events: the number of code generations and garbage

collections, the time spent in the code generator and the garbage collector as well as

the actual running time of the code. Currently, profiling data is used for controlling

incremental compilation. For more information on possible uses, the reader is referred

to Section 9.2.5.

3.2 Compilation Process

The compilation process of our dynamically optimizing execution environment is

illustrated in Figure 3.2.

3.2 Compilation Process 37

Kafka Language

Abstract Syntax

Typed Abstract Syntax

CPS Language

Closure-converted CPS

Machine Language

parsing

type inference

CPS conversion

high-level optimization

closure conversion

high-level optimization

code generation, specialization,

low-level optimization

Figure 3.2: Compiler phases and intermediate languages

The Kafka language is semantically close to Standard ML (being a strict language),

but has a syntax similar to Haskell. Kafka is a purely functional language which has

no implicit side effects. Input and output are handled with continuation-passing func-

tions and exceptions via an exception-handling mechanism similar to Standard ML.

The front end and the middle end of the Kafka system are conventional. A parser

reads in the source code, checks its basic context-free syntax and converts it into an

abstract syntax tree. This abstract syntax tree corresponds very closely to the source

code, so that the type checker can generate helpful error messages.

The type checker implements a Damas-Milner type inferencing algorithm (Damas

and Milner, 1982) for a let-polymorphic type system with modest extensions, such as

overloaded arithmetic operators, type annotations and recursive algebraic data types.

The output of the type checker is a type-annotated abstract syntax tree where some

language constructs are desugared into more basic constructs. For example, equation-

style function definitions are converted to let expressions which bind variables to λ-

abstractions. The type checker also converts type abstractions and instantiations,

which are implicit in the source code, into explicit type abstractions (λ-abstractions

whose bound variables have the special type ?) and type applications (applications

38 Chapter 3. Dynamic Compilation of Functional Programs

to values of type ?).

In the rest of this thesis, the parser, the type checker and the untyped abstract syntax

will not be discussed any further. They are quite conventional and are not affected by

the dynamic code generation and optimization techniques applied in the other parts

of the compiler.

The typed abstract syntax is fed into the continuation-passing style (CPS) transfor-

mation, which will be discussed in detail in Chapter 4. The resulting program in

continuation-passing style is optimized and then converted into closure-passing style

by the closure conversion. The closure-passing language (also discussed in Chapter 4)

is a subset of the CPS language where functions and continuations are not allowed

to have free variables. This representation is then again optimized.

The last step is the generation of machine code from the optimized, closure-converted

representation. Since this representation is already fairly low-level, code generation

is relatively easy and mainly consists of instruction selection, register and stack slot

assignment and instruction generation.

Control can loop back from the machine language to the code generator when in-

cremental compilation or run-time monomorphization occurs. In this case, closure-

converted code (which is embedded in the machine code) is passed to machine code

generation, and then control is passed to the newly generated code. This is indicated

by the dashed arrow in the figure.

The dynamic compilation framework allows to delay code generation until a particular

function or other part of the program is actually executed. This is used in two

ways: functions (or parts of functions) can be compiled when first executed, thereby

reducing the amount of code generation to those places where it is needed during

an execution. Additionally, specialization of polymorphic functions is delayed until

they are called with specific type parameters. Incremental compilation is discussed

in Chapter 5, specialization of polymorphic functions in Chapter 6.

The compiler phases up to and including the optimization of closure-converted code

are independent of our dynamic specialization technique. It can therefore be done

ahead-of-time in a separate compiler, similar to the Java system, where the compiler

produces bytecode off-line and the virtual machine generates machine code on-line.

We have implemented this in our prototype system: a separate compiler can produce

compressed abstract syntax files of closure-converted code which can be read into the

virtual machine. This has the same effect as directly loading source code files, but

avoids the overhead of parsing, type inference, CPS and closure conversion as well as

high-level optimization.

More details on the implementation of our dynamically compiling architecture are

given in Chapter 7.

Chapter 4

Typed Dynamic

Continuation-passing Style

One goal of this work is to explore the possibilities of using continuation-passing style

intermediate representations in the context of dynamic compilation. CPS is known

to be well suited for data and control-flow analysis (Shivers, 1991), optimizations

(Appel, 1992) and data representation optimizations (Shao, 1994; Shao and Appel,

1995).

The following question, which has not been addressed until now, is:

How are CPS representations suited for performing dynamic optimiza-

tions, and especially, how can dynamic optimizations benefit from type-

directed compilation in the setting of pure functional languages?

A novel compilation technique, called typed dynamic continuation-passing style, is

described in this chapter. It offers the following features:

−Type-directed specialization of polymorphic functions and compile-time resolu-

tion of overloaded operators

−Type-directed data representation and layout optimizations.

When designing an intermediate compiler language, it is necessary to consider the

level of abstraction that should be supported by the language. One possibility is

to use a very high-level language, which is suited to analyses and transformations

which rely on the high-level structure of the source program. At the other end of

the spectrum, a low-level representation (such as quads or three-address instructions)

allows machine-specific modifications of the program.

As an example, consider the combination of data representation optimization and the

closure conversion phase of the compiler. The data representation optimization may

decide (guided by type information) to put certain values into unboxed floating-point

variables, which the code generator should eventually put into floating-point registers.

Other values, unboxed integers or pointers will end up in general-purpose registers.

During the closure-conversion phase, non-escaping functions are normally modified to

40 Chapter 4. Typed Dynamic Continuation-passing Style

take their free variables as additional parameters, and all call sites of such functions

are changed to pass these variables. Unfortunately, unless closure conversion knows

about the representation of all variables, it is not possible to efficiently pass free

variables to functions.

Therefore, we require that the uncompiled program fragments are stored in a fairly

abstract form until they are compiled to machine code. It is one goal of this the-

sis to identify which transformations can be performed before type information is

exploited (e.g., before type specialization and representation selection), and which

transformations must be deferred until such decisions are made.

The rest of this chapter is organized as follows: we first introduce some notation

and then present both the source and the target language of our typed continuation-

passing transformation. The transformation between the explicitly typed source lan-

guage and the CPS language requires both transformations on expressions and on

types. Following that, we describe the closure conversion transformation which re-

moves higher-order functions from the program. Finally we will sketch how machine-

code can be generated from the closure-converted CPS code, and we describe the

conversions performed on a small example program.

4.1 Notation

The following notational conventions are used in the following description of the

source and CPS languages.

A superscripted type xτdeclares the xto be of type τ.

Overline notation is used for sequences. For example, xdenotes a sequence of zero

or more xentities.

4.2 Running Example

The transformations described in this chapter are numerous and complicated. In

order to illustrate the effect of the transformations, we have prepared an example

program which will be transformed step by step along with the descriptions and

definitions of the transformation.

Figure 4.1 shows a complete program in the syntax of our prototype implementation.

The program defines a list data type and a function for computing the sum of all

elements of a list. The program also defines a two-element list of floating-point

numbers and prints the sum of the list elements, calculated by the sum function,

to the terminal. (The type annotation on list l1 is necessary, because polymorphic

values are not allowed.)

The fully typed program shown in Figure 4.2 is the result of type inference. This

typed version of the language is the actual source language of the transformations

described in this chapter, and standard type inference methods can be used to convert

the input program from Figure 4.1 into the typed source program in Figure 4.2. The

function sum is now a type function which expects the type of list elements and returns

4.2 Running Example 41

data List a = Nil

| Cons a (List a)

let

sum l = case l of

Nil -> 0

Cons x xs -> x + sum xs

let val l1 = Cons 2.3 (Cons 1.0 Nil) :: List Prelude.Float

in Prelude.putf (sum l1) (\ v -> v)

Figure 4.1: Running example: input program

let fun sum:

forall $15: *. ((List $15) -> $15) =

(\ $15:

* ->

(\ l:

(List $15) ->

case l of

Nil -> 0

Cons x : $15 xs : (List $15) -> (((add $15) x)

((sum $15) xs)))) in

let val l1: (List Float) = (Cons Float

2.3

(Cons Float 1.0 (Nil Float))) in

((putf ((sum Float) l1))

(\ v: () -> v))

Figure 4.2: Running example: type-checked

42 Chapter 4. Typed Dynamic Continuation-passing Style

a function for calculating the sum of lists of that type. Note that all applications

of polymorphic functions and operators (add is the primitive polymorphic addition

operator) are converted to explicit type applications (e.g., sum Float or add $15,

where $15 is a type variable).

The versions of the example program in the following sections are slightly edited for

readability, but otherwise correspond directly to the transformation results of the

compiler. Note that the examples use the concrete syntax of the compiler, whereas

the formal definitions below use a more mathematical style, to improve readability.

4.3 Source Language

Our CPS transformation converts a typed language based on the λ-calculus into a

continuation-passing language.

The source language of the transformation described here is not intended to be writ-

ten by programmers, but is the output of some front-end, which annotates the original

program text with type annotations. In our prototype implementation, the type in-

ference mechanism transforms a Haskell-like language with optional type annotations

into the language described here.

The source language provides polymorphic algebraic datatype definitions, polymor-

phic function definitions, value definitions, tuples and projections, value constructor

applications and case expressions, λabstractions, function application and numeric

literals and primitive operations. Since the source language is a variant of the second-

order λcalculus (also called System F (Barendregt, 1992)), it provides means for ex-

plicit type abstraction and application. Type and value abstractions or applications

can be distinguished by the types of the formal or actual parameters, respectively.

Type variables and type constants have type ?(pronounced “type”).

Type constants are the names of predefined types, for example integer,float, or user-

defined types, which are declared with data definitions.

4.3.1 Syntax

The source language syntax is given in Figures 4.3 (types) and 4.4 (expressions).

Source language types can be type variables, universally quantified types, the type

of types (?), tuples, type constructors applied to types and function types. Note

that we require type constructors to be saturated: we do not support types of higher

kinds. While in principle this should pose no problems, we have decided for a simpler

system. Type constructors are derived from data type declarations and several type

constructors are predefined: the unit type, booleans, exceptions, integer and floating-

point types.

Source expressions can be variables, integer or floating-point constants, value con-

structor applications (which are required to be saturated), type or value function ap-

plication, type or value abstraction, tupling, projections and the unit value. Integer

literals are annotated with types, because they can be implicitly cast to floating-point

4.3 Source Language 43

(value types) Type 3τ::= αtype variable

| ∀α. τ universal quantification

|?type of types

|(τ,...,τ) tuple type

|TC τ . . . τ applied type constructor

|τ→τfunction type

(type constants) TC ::= unit unit type

|bool boolean type

|exn exception type

|integer integer type

|float floating-point type

|Tuser-defined data type

Figure 4.3: Source language of CPS transformation (types)

values (see Section 4.3.2 for details). A projection #i#je selects the ith field from

the j-tuple e. The more complex expressions are non-recursive value let bindings, re-

cursive function let bindings, case expressions and the exception handling constructs

raise (which raises an expression) and handle, which evaluates expression e1and re-

turns its value if no exception was raised, and otherwise evaluates e2with variable x

bound to the raised exception object and returns the resulting value.

Case alternatives consist of value constructors applied to variables and an expression

on the right-hand side. When an alternative is matched against a value, the variables

are bound to the respective fields of the constructed value and the right-hand side is

evaluated with these bindings in scope.

Primitive operators are predefined constants, which are also mentioned in Figure 4.4.

These appear simply as variables in the source program. Primitives are numeric

arithmetic and comparison operators. Predefined constants also include the boolean

constants false and true. More on primitive operators and numeric literals below in

Section 4.3.2.

The source language is slightly unusual because most language constructs have type

annotations. The reason is that the continuation-passing transformation discussed

below needs to assign types to continuations and continuation parameters. Without

explicit types in the source language, the continuation transformation would need to

perform limited type inference. As we require a compiler front-end which produces

the source language to do some kind of type checking or inference anyway, we simply

require the semantic analysis to produce fully annotated programs. In the source

syntax presented here, some type information is omitted when it can be inferred

from the immediate context. This reduces notational clutter and makes programs

much easier to read.

44 Chapter 4. Typed Dynamic Continuation-passing Style

(expressions) Exp 3e::= xvariable

|nτinteger literal

|gfloating-point literal

|() unit value

|(C eτi

i)τconstructor application

|(e1e2)τapplication

|λxτ. eτabstraction

|(e1, . . . , en)τtuple

|#i#j eτprojection

|let xτ=e1in e2let (non-recursive)

|let Fun in e2function definition (recursive)

|(case eτof Alt)τcase expression

|raise eraise exception

|eτ

1handle x→e2handle exception

(alternative) Alt ::= C xτ→eτ

(integer literals) n::= 0 |1|. . .

(float literals) g::= 0 | −1.1|. . .

(function def.) FunDef 3Fun ::= fτ=e

(datatype def.) DataDef 3Data ::= data Tα=C τ

(program) Prog 3P::= Data e

(constants) CS ::= false falsity

|true truth

|add addition

|sub subtraction

|mul multiplication

|div division

|eq =

|neq 6=

|less <

|greater >

Figure 4.4: Source language of CPS transformation (expressions)

4.3 Source Language 45

4.3.2 Overloaded Numeric Literals and Primitive Operators

We want to exploit the opportunities which specialization gives for different numeric

types, but we also try to keep the type system of the source and intermediate lan-

guages as simple as possible. Therefore, we have decided against implementing a

full-blown type class system like that used in Haskell, and instead used a simpler

alternative. Numeric integer literals are assigned the polymorphic type ∀α.α by the

type checker, so that they can be used both in integer and floating-point contexts.

Similarly, primitive numeric operators have polymorphic types like ∀α.α →α→α

or ∀α.α →α→bool. Of course, these types are not precise enough, as the opera-

tions should only be allowed in numeric contexts: addition of two non-numeric types

should not be allowed. Typing problems of this kind are found and reported during

the specialization phase.

When specialization of polymorphic functions is performed, all type variables are

replaced by concrete types before code is generated, so that the code generator has to

deal with monomorphic code only. The code generator can then detect any wrongly-

typed uses of numeric operations. The obvious drawback of this solution is that

type errors are possibly found late in the compilation, at code generation time. Since

code generation is interleaved with program execution, this may be really late. We

emphasize that programs are type safe in the sense that no type incorrect operations

are performed by the generated code.

In a production system, a type class or similar system should be used instead, but

for our experiments, this solution proved to be sufficient. See Section 9.2.3.

One typing problem remains with this method: free type variables outside of any

polymorphic functions are never replaced by types, because no specialization occurs

on the top-level term. Therefore, before a program is evaluated, all free type variables

of the program must be replaced by the type integer. It must be the integer type

because these free variables can only stand for numeric types1, and because integer

literals are the only constructs which are assigned type ∀α.α .

4.3.3 Static Semantics

The type system of the source language given in Figure 4.5 and Figure 4.6 is an

adaptation of the type system for System F in Pierce (2002).

Term variable contexts Γ keep track of which type and term variables are in scope and

give the types of term variables. Datatype contexts ∆ record the type constructors

of the predefined type constants and the user-defined data types. The type variables

are included in a datatype context as well in order to check whether type constructor

applications are saturated during the well-formedness check on types. Initially, Γ

contains the bindings of all built-in constants (see Figure 4.7) and all data construc-

tors from data type declarations with their declared types. Similarly, ∆ contains all

1More precisely, an exception-raising program is assigned type α. This special treatment when

the type of the complete program is polymorphic, is handled by our implementation properly.

46 Chapter 4. Typed Dynamic Continuation-passing Style

Ctxt 3Γ ::= ∅empty context

|Γ, x:τterm variable binding

|Γ, α type variable binding

DCtxt 3∆ ::= ∅empty context

|∆, T α1. . . αndatatype binding

|∆, α type variable binding

Γ; ∆ `e:τ

(var) x:τ∈Γ

Γ; ∆ `x:τ(liti) α∈∆

Γ; ∆ `nα:α(litf) Γ; ∆ `g:float

(unit) Γ; ∆ `() : unit (capp) C:τ1→ · · · → τn→τ∈Γ (Γ; ∆ `ei:τi)n

i=1

Γ; ∆ `(C eτi

i)τ:τ

(tuple) (Γ; ∆ `ei:τi)n

i=1

Γ; ∆ `(e1,...en)(τ1,...,τn): (τ1, . . . , τn)(proj) Γ; ∆ `e: (τ1, . . . , τj)

Γ; ∆ `#i#j e(τ1,...,τj):τi

(abs) Γ, x:τ1; ∆ `e:τ2

Γ; ∆ `λxτ1.eτ2:τ1→τ2

(app) Γ; ∆ `e1:τ1→τ2Γ; ∆ `e2:τ1

Γ; ∆ `(e1e2)τ2:τ2

(tabs) Γ, α; ∆, α `e:τ2

Γ; ∆ `λα?.eτ2:∀α.τ2

(tapp) Γ; ∆ `e1:∀α.τ2∆`e2:?

Γ; ∆ `(e1e2)τ2:τ2

(let) Γ; ∆ `e1:τ1Γ, x:τ1; ∆ `e2:τ2

Γ; ∆ `let xτ1=e1in e2:τ2

(letfun) Γ; ∆ `Fun: Γ0Γ∪Γ0; ∆ `e2:τ2

Γ; ∆ `let Fun in e2:τ2

(case) Γ; ∆ `e:τsΓ; ∆ `Alt:τ

Γ; ∆ `case eτsof Alt:τ

(raise) Γ; ∆ `e:exn

Γ; ∆ `raise e:τ(handle) Γ; ∆ `e1:τΓ, x:exn; ∆ `e2:τ

Γ; ∆ `eτ

1handle x→e2:

Figure 4.5: Source language type system (part 1)

4.3 Source Language 47

Γ; ∆ `Fun: Γ0

(funs) Γ0= (fi:τi)n

i=1 Γ∪Γ0; ∆ `(ei:τi)n

i=1

Γ; ∆ `(fτi

i=ei)n

i=1 : Γ0

Γ; ∆ `Alt:τ

(alts) (Ci:τi1→ · · · → τini→τs∈Γ Γ, xi1:τi1, . . . , xini:τini:ei`τ)n

i=1

Γ; ∆ `(Cixτi1

i1. . . xτini

ini→eτ

i)n

i=1 :τ

∆`τ:?

(tcon) T α1. . . , αn∈∆ (∆ `τi:?)n

i=1

∆`T τ1. . . , τn:?(tvar) α∈∆

∆`α:?

(ttuple) (∆ `τi:?)n

i=1

∆`(τ1, . . . , τn): ?(tforall) ∆, α `τ:?

∆` ∀α.τ :?

(tfun) ∆`τ1:?∆`τ2:?

∆`τ1→τ2:?

Figure 4.6: Source language type system (part 2)

pre-defined type constructors (all constants of type ?from Figure 4.7) as well as the

user-defined data types with their type parameters.

The judgement Γ; ∆ `e:τspecifies that in term variable context Γ and datatype

context ∆, expression ehas type τ. In order to keep the rules readable, we have not

explicitly mentioned that all types appearing in the rules must be well-formed, as

defined by the rules for judgement ∆ `τ:?in Figure 4.6.

The rules (var), (litf) and (unit) are conventional, rule (liti) assigns a polymorphic

type to integer constants, as already mentioned in Section 4.3.2. Rule (capp) applies

to constructor application, rules (abs) and (app) to value abstraction and application,

respectively; similar for (tabs) and (tapp) for type abstraction and application. Note

that for (tapp), the type parameter must be well-formed, which we explicitly check

for. The let expressions are conventional, rule (letfun) uses the auxiliary judgement

Γ; ∆ `Fun: Γ0from Figure 4.6 which checks whether a sequence of mutually recursive

functions is well-typed and returns the bindings of the defined functions to their types

in Γ0. Case expressions are checked using rule (case), which again uses an auxiliary

judgement Γ; ∆ `Alt:τfrom Figure 4.6, which checks the typing of case alternatives.

The exception raising statement in rule (raise) matches any type τ, because it aborts

execution. Rule (handle) checks exception handling.

48 Chapter 4. Typed Dynamic Continuation-passing Style

add :∀α.α →α→α

sub :∀α.α →α→α

mul :∀α.α →α→α

div :∀α.α →α→α

eq :∀α.α →α→bool

neq :∀α.α →α→bool

less :∀α.α →α→bool

greater :∀α.α →α→bool

false :bool

true :bool

unit :?

exn :?

integer :?

float :?

bool :?

Figure 4.7: Types of built-in constants

4.3.4 Dynamic Semantics

The dynamic semantics of the source language are standard (see, for example Pierce

(2002)). Note that when using a semantics which uses substitution for type applica-

tions, all type variables disappear from subterms before they are evaluated. There-

fore, a primitive operation will never encounter a type variable as an argument, but

only proper types. It is still possible, though, that a type error happens because of

our handling of polymorphic numeric literals. See Section 4.3.2 above for details.

4.4 Continuation Language

We use an adapted and extended version of the continuation-passing language de-

scribed by Kennedy (2007). In contrast to the CPS language used in Appel’s book

(Appel, 1992), this version is typed and represents exception handling by explicitly

passing a failure continuation to each user function. The translation of the source

language handle and raise constructs makes use of these failure continuations. In

all other cases, the failure continuation is simply passed through to all called func-

tions. In an implementation, the parameter which holds the failure continuation can

be targeted to a particular register or memory location, so that passing it to called

functions will avoid any copying overhead.

The syntax of the CPS language is given in Figures 4.8 (types) and 4.9 (terms). The

language has been extended from Kennedy’s with the following language constructs:

−n-ary tuples (n > 1) and projections

−parametric polymorphism

−support for the definition of polymorphic, recursive algebraic data types

−means for constructing and taking apart values of algebraic data types by case-

expressions

−multi-argument continuations and functions (continuations also may take no

arguments)

4.4 Continuation Language 49

(value types) CType 3τ::= αtype variable

| ∀α.τ universal quantification

| ⊥ bottom type

|?type of types

|(τ,...,τ) tuple type

|Tτ . . . τ data type

|τ→τfunction type

|(#τ,...,τ#) unboxed tuple type

(type constants) ::= unit unit type

|bool boolean type

|exn exception type

|integer integer type

|float floating-point type

Figure 4.8: CPS language (types)

−support for integer and floating-point literals

−primitive operations for performing integer and floating-point arithmetic and

comparison

The types of the CPS language are an extension of source language types. The

bottom type ⊥is used as the return type of functions and continuations in the CPS

language, because they never return. Instead of returning to its caller, each function

calls one of its continuations parameters or other functions. The main program is

initially passed a special continuation which terminates the whole program.

Unboxed tuples are used for argument types of functions and are conceptional tuples.

They model that functions can receive and return (as arguments to their continuation)

multiple values, but unlike normal tuples, they cannot be stored into data structures.

The idea of using (as well as the name and the syntax with (# . . . #) brackets of)

unboxed tuples for this application is taken from Peyton Jones and Launchbury

(1991).

The type of types ?is used for type parameters. These need to be represented in the

type system because polymorphic functions are represented as functions from types

to functions.

4.4.1 Abbreviations

Types of CPS converted terms become very unwieldy because of explicit represen-

tation of success and failure continuations, and because unboxed tuples are used to

represent multiple arguments. Therefore, the following abbreviations are defined for

the sake of readability:

50 Chapter 4. Typed Dynamic Continuation-passing Style

(variables) FVar ::= ffunction name

CVar ::= kcontinuation name

VVar ::= xvalue name

(values) Val ::= () unit value

|nτinteger literal

|gfloating-point literal

|(x1, . . . , xn) tuple value

|Cτx1. . . xnconstructor application

(terms) Term 3K::= let xτ=Vin Kvalue binding

|let xτ=πixin Kprojection

|let xτ=x1⊕x2in Kprimitive operation

|letcont Cont in Kcontinuation binding

|letfun Fun in Kfunction binding

|k x continuation call

|f x function call

|case xof Alt case expression

(alternatives) Alt ::= C→k

(primitives) Prim 3 ⊕ ::= + |−|×|/

|=| 6=|<|>

(integer literals) n::= 0 |1|. . .

(float literals) g::= 0 | −1.1|. . .

(function def.) FDef 3Fun ::= fτx=K

(contin. def.) CDef 3Cont ::= kτx=K

(datatype def.) DDef 3Data ::= data Tα=C τ

(program) Prog 3P::= Data K

Figure 4.9: CPS language (expressions)

4.4 Continuation Language 51

Γ`V:τ

(val-unit) Γ`(): unit (val-int) Γ`nτ:τ(val-float) Γ`g:float

(val-tuple) (Γ `xi:τi)∀n

i=1

Γ`(x1, . . . , xn): (τ1, . . . , τn)

(val-con) (Γ `xi:τi)∀n

i=1 C:τ1→ · · · → τn→τ∈Γ

Γ`Cτx1. . . xn:τ

Figure 4.10: CPS language type system (values)

τ1⇒τ2≡(#τ1,(#τ2#) → ⊥,(#exn#) → ⊥#) → ⊥ (function)

¬τ≡(#τ#) → ⊥ (continuation)

The first abbreviation stands for functions with exactly one parameter and one return

value. All functions produced by the CPS transformation have this form, but later

processing (such as optimizations like uncurrying or closure conversion) may later

add arguments to functions and continuations. The second abbreviation uses the

common notation of “not” for functions which do not return.

4.4.2 Static Semantics

The type system of the typed CPS language is shown in Figure 4.11. Rules of the

form Γ `Kok mean that term Kis well-typed in context Γ. The type system is

also based on the rules in Figure 4.10, which specify the well-typedness of values.

The type rules make use of the operator τ≤τ0, which means that type τis an

instantiation of type τ0.

The typing rules for values are simple. The only uncommon feature, overloaded

literals, require that the type of an integer literal is taken from its annotation. Again,

we assume that all data constructor types are included in the environment Γ.

The value binding construct simply checks that the value is well-typed and that the

type annotation on the bound variable is correct. The projection binding checks that

the variable from which a value is projected is a tuple type of valid arity. Primitive op-

erator bindings are checked for correct typing of arguments and result. Function and

continuation applications require that the actual argument types match the function

or continuation parameter types. For case expressions, the types of the mentioned

continuations are checked against the fields of the corresponding constructor field

types. The right-hand sides of function and continuation bindings are checked with

the types of the defined functions/continuations in scope, to allow mutually recursive

definitions.

52 Chapter 4. Typed Dynamic Continuation-passing Style

Γ`Kok

(letval) Γ`V:τΓ, x :τ`Kok

Γ`let xτ=Vin Kok

(letproj) x: (τ1, . . . , τn)∈Γ Γ, x :τi`Kok

Γ`let xτ=πixin Kok

(letprim)

x1:τ1∈Γx2:τ2∈Γ

⊕:τ⊕∈Γτ1→τ2→τ≤τ⊕Γ, x :τ`Kok

Γ`let xτ=x1⊕x2in Kok

(funapp) f:τf∈Γ (#τ1, . . . , τn#) →⊥≤τf(Γ `xi:τi)∀n

i=1

Γ`f x1. . . xnok

(contapp) k:τk∈Γ (#τ1, . . . , τn#) → ⊥ ≤ τk(Γ `xi:τi)∀n

i=1

Γ`k x1. . . xnok

(case)

x:τ∈Γ

(ki:τki∈Γ (#τi1, . . . , τini#) →⊥≤τki)∀n

i=1

(Ci:τCi∈Γτi1→. . . τini→τ≤τCi)∀n

i=1

Γ`case xof C1→k1. . . Cn→knok

(letfun)

Γ, f1:τ1, . . . , fn:τn`Kok

(Γ, f1:τ1, . . . , fn:τn, xi1:τi1, . . . , xini:τini`Kiok

τi= (#τi1, . . . , τini#) → ⊥)∀n

i=1

Γ`let fτ1

1x11 . . . x1n1=K1. . . fτn

nxn1. . . x1nn=Knin Kok

(letcont)

Γ, k1:τ1, . . . , kn:τn`Kok

(Γ, k1:τ1, . . . , kn:τn, xi1:τi1, . . . , xini:τini`Kiok

τi= (#τi1, . . . , τini#) → ⊥)∀n

i=1

Γ`let kτ1

1x11 . . . x1n1=K1. . . kτn

nxn1. . . x1nn=Knin Kok

Figure 4.11: CPS language type system (terms)

4.5 CPS Transformation 53

4.4.3 Dynamic Semantics

The dynamic semantics of the CPS language is defined by a standard, call-by-value

evaluation strategy. The only extension to textbook semantics is that on application

of type functions to type values, these type values must be substituted both at the

term level and at the type level. Type variables may appear as term variables, for

example in other type applications, and as type variables in type annotations.

4.5 CPS Transformation

The CPS transformation for translating source programs into CPS programs is shown

in Figures 4.12 and 4.13. It is a variant of the higher-order one-pass call-by-value

transformation of Danvy and Filinski (1992) which we have adapted from the pre-

sentation of Kennedy (2007, Fig. 8).

The transformation of expressions to CPS terms is performed mainly by two mutu-

ally recursive functions called [[·]] and (| · |). Both are very similar, and translate an

expression of the source language (first argument) into a term of the continuation

language. The second argument is the name of the current exception continuation.

The third argument is either a meta-level continuation which is called directly, or

the name of an object-level continuation, for which an object-level continuation call

has to be constructed. The reason for duplicating the transformation in two differ-

ent contexts is that the algorithm avoids the generation of so-called administrative

reducible expressions (redexes), that is, statically reducible βredexes. Older trans-

formations such as Appel (1992) rely on optimization transformations to eliminate

these redexes. The transformation of Danvy and Filinski (1992) avoids their creation

during construction of the CPS term and therefore does not need these optimizations

(although other opportunities for optimizations remain, of course, depending on the

input program).

In order to distinguish the meta and object levels, the following conventions are used

(Kennedy, 2007):

−Bold lambdas λare used to denote meta-level abstractions, normal lambdas λ

for object-level abstractions.

−Meta-level applications are written with parenthesis κ(x), object-level applica-

tions with juxtaposition k x.

There are three helper transformations: the first transforms source function defini-

tions to CPS function definitions (Figure 4.15), the second transforms source types

to CPS types (Figure 4.14) and the third transforms names of primitive operations

to their operators (Figure 4.16). The transformation for functions simply invokes the

(| · |) transformation on the function right-hand side, passing the two new continua-

tion arguments of the transformed function. Conversion of source to CPS types only

affects the types of functions, which are converted to multi-argument function types

54 Chapter 4. Typed Dynamic Continuation-passing Style

[[·]] : Exp →Var →(Var →Term) →Term

[[x]] h κ =κ(x)

[[(prim τ1e1e2)τ]h κ = [[e1]] h

(λx1.[[e2]] h

(λx2.let x[[τ]] =x1[[prim]] x2in κ(x)))

[[(e1e2)τ]h κ = [[e1]] h

(λx1.[[e2]] h

(λx2.letcont k¬[[τ]] x=κ(x)in x1x2k h))

[[λxτ1. eτ2]] h κ =letfun f[[τ1→τ2]] x k h0= (|e|)h0kin κ(f)

[[(e1, . . . , en)(τ1,...,τn)]] h κ = [[e1, . . . , en]] h

(λx1, . . . , xn.

let x[[(τ1,...,τn)]] = (x1, . . . , xn)in

κ(x))

[[()]] h κ =let xunit = () in κ(x)

[[nτ]] h κ =let x[[τ]] =nin κ(x)

[[g]] h κ =let xfloat =gin κ(x)

[[(C e1. . . en)τ]] h κ = [[e1, . . . , en]] h

(λx1, . . . , xn.let x[[τ]] =C x1. . . xnin κ(x))

[[#i#j e(τ1,...,τj)]] h κ = [[e]] h(λz. let x[[τi]] =πizin κ(x))

[[let xτ=e1in e2end]] h κ =letcont j¬[[τ]] x= [[e2]] h κ in (|e1|)h j

[[let din eend]] h κ =letfun [[d]] in [[e]] h κ

[[raise e]] h κ = [[e]] h(λz. h z)

[[eτ

1handle x→e2]] h κ =letcont j¬[[τ]] x=κ(x)in

letcont h0¬exn x= (|e2|)h j in (|e1|)h0j

[[(case eτsof

C1xτ11→e1

. . .

Cnxτnn→en)τr]] h κ = [[e]] h

(λz.

letcont j¬[[τr]] x=κ(x)in

letcont k¬(#[[τ1]]#)

1x1= (|e1|)h j

. . .

k¬(#[[τn]]#)

nxn= (|en|)h j in

case zof C1→k1|. . . |Cn→kn

Figure 4.12: CPS transformation (part 1)

4.5 CPS Transformation 55

(|·|) : Exp →Var →Var →Term

(|x|)h k =k x

(|(prim τ1e1e2)τ|)h k = [[e1]] h

(λx1.[[e2]] h

(λx2.let x[[τ]] =x1[[prim]] x2in k x))

(|(e1e2)τ|)h k = [[e1]] h

(λx1.[[e2]] h

(λx2. x1x2k h))

(|λxτ1. eτ2|)h k =letfun f[[τ1→τ2]] x k0h0= (|e|)h0k0in k f

(|(e1, . . . , en)(τ1,...,τn)|)h k = [[e1, . . . , en]] h

(λx1, . . . , xn.

let x[[(τ1,...,τn)]] = (x1, . . . , xn)in k x). . . )

(|()|)h k =let xunit = () in k x

(|nτ|)h k =let x[[τ]] =nin k x

(|g|)h k =let xfloat =gin k x

(|(C e1. . . en)τ|)h k = [[e1, . . . , en]] h

(λx1, . . . , xn.

let x[[τ]] =C x1. . . xnin k x)

(|#i#j e(τ1,...,τj)|)h k = [[e]] h(λz. let x[[τi]] =πizin k x)

(|let xτ=e1in e2end|)h k =letcont j¬[[τ]] x= (|e2|)h k in (|e1|)h j

(|let fun din eend|)h k =letfun [[d]] in (|e|)h k

(|raise e|)h k = [[e]] h(λz. h z)

(|eτ

1handle x⇒e2|)h k =letcont h0¬exn x= (|e2|)h k in (|e1|)h0k

[[(case eτsof

C1xτ11→e1

. . .

Cnxτnn→en)τr]] h k = [[e]] h

(λz.

letcont k¬(#[[τ1]]#)

1x1= (|e1|)h k

. . .

k¬(#[[τn]]#)

nxn= (|en|)h k in

case zof C1→k1|. . . |Cn→kn

Figure 4.13: CPS transformation (part 2)

56 Chapter 4. Typed Dynamic Continuation-passing Style

[[·]] : Type →CType

[[α]] = α

[[?]] = ?

[[(τ1, . . . , τn)]] = ([[τ1]],...,[[τn]])

[[T τ1. . . τn]] = T[[e1]] . . . [[τn]]

[[(τ1→τ2)]] = (#[[τ1]],¬[[τ2]],¬exn#) → ⊥

Figure 4.14: CPS transformation of types

which receive the (transformed) argument type and the success and failure contin-

uation as arguments. The success continuation takes the original return type as its

argument.

We will now discuss in detail the transformation [[·]] on source expressions. (The

rules for (|·|) are analogous.) Variables are simply passed to the continuation of the

transformations. Applications of primitive functions are transformed into the spe-

cial binding for primitive operators, which applies an operator to two variables (the

arguments) and binds the result to a variable. See Figure 4.16 for the translation

of primitive operators. Applications are translated by translating both the function

and the argument and by creating a return continuation to be passed to the called

function. Abstractions are transformed to a function binding for one (newly named)

function. Tuples use the natural extension of the [[·]] transformation to lists of ex-

pressions, which lets us bind several variables at once. The arguments of the tuple

constructor are then bound using the value binding construct. The translation of the

unit value, floating-point literals and integer literals is straightforward, except for the

fact that the type annotation for integer literals must be included in the target value

binding.

Constructor applications are handled similarly to tuples. Projections are translated

into projection bindings. Source let-bindings are translated by first translating the

bound expression, which is passed into a fresh continuation which then evaluates

the let body. Function bindings are translated using the auxiliary transformation in

Figure 4.15. The raise and handle constructs are the only expressions which affect the

failure continuations. Both constructs do not have a corresponding term in the target

language; instead, the translation of raise simply invokes the failure continuation,

whereas handle constructs a new continuation which is the translation of the handler

clause and passes this continuation as the failure continuation to the translation of

the first expression.

The definition of the non-tail-recursive transformation (|·|) is very similar to the

tail-recursive one, except that some continuation creations can be avoided because

the passed-in continuation is on the object level and can therefore be used directly

in translated terms.

The treatment of user functions requires some more explanation, because these func-

tions are possibly type functions as introduced by the type checker. It is necessary

not to separate type parameters from each other. To see why, consider the function

4.5 CPS Transformation 57

[[·]] : FunDef →FDef

[[f=λ x :τ.e]] = f x k h = (|e|)h k

where τ6=?

[[f=λ x1:?. . . . λ xn:?. λ x :τx. e]] = f x1. . . xnk h = (|λ x :τx. e|)h k

where τx6=?

Figure 4.15: CPS transformation of functions

[[·]] : Var →Prim

[[add]] = +

[[sub]] = −

[[mul]] = ×

[[div]] = /

[[eq]] = =

[[neq]] = 6=

[[less]] = <

[[greater]] = >

Figure 4.16: CPS transformation of primitive functions

map with the following type:

∀a.∀b.(a→b)→List a→List b

If this function were translated into curried type functions, the outer function applied

to the first type awould have the type

∀b.(a→b)→List a→List b

but it could not be compiled because the specialized version for type adoes not know

anything about the representation of type b. Therefore, our CPS transformation

always translates type functions so that all type parameters (whose representation is

fixed) are passed together, and separately from value parameters. This is reflected in

the rules for the transformation of functions in Figure 4.15.

Running Example

The result of CPS conversion on the running example program is shown in Figure 4.17.

To keep the example small, we have deleted type annotations on functions and con-

tinuations. Also, note that the alternatives of the case expression have expressions

on the right-hand side of the arrow, not only continuation names. For readability,

the continuations have been inlined in these cases, which can be enabled as an op-

timization in the CPS converter of our implementation (independently of the other

optimizations described in Chapter 7).

58 Chapter 4. Typed Dynamic Continuation-passing Style

letfun

sum($15: *, k0: ~((List $15) => $15), h1: ~Exn) =

letfun

lam2(l: (List $15), k3: ~$15, h4: ~Exn) =

case l of

Nil ->

letval lit7: $15 = 0 in

k3 lit7

Cons x xs ->

letval ty8: * = $15 in

letcont k9(x10: ((List $15) => $15)) =

letcont k11 (x12: $15) =

let x13: $15 = x p+ x12 in

k3 x13 in

x10 xs k11 h4 in

sum ty8 k9 h4

k0 lam2

letcont k14 (l1: (List Float)) =

letval ty15: * = Float in

letcont k16 (x17: ((List Float) => Float)) =

letcont k18 (x19: Float) =

letfun

lam20 (v: (), k21: ~(), h22: ~Exn) =

k21 v

let x23: () = Prelude.putf (x19) in

lam20 x23 :.halt :.fail in

x17 l1 k18 :.fail in

sum ty15 k16 :.fail in

letval ty24: * = Float in

letval lit25: Float = 2.3 in

letval ty26: * = Float in

letval lit27: Float = 1.0 in

letval ty28: * = Float in

letval Nil29: (List Float) = Nil ty28 in

letval Cons30: (List Float) = Cons ty26

lit27

Nil29 in

letval Cons31: (List Float) = Cons ty24

lit25

Cons30 in k14 Cons31

Figure 4.17: Running example: CPS converted

4.6 Closure Conversion 59

(value types) CType 3τ::= . . .

| {τ, τ, . . . }closure type

|pτqclosure-converted function

(terms) Term 3K, L ::= . . .

|letclos Clos in Kclosure binding

(closure def.) ClDef 3Clos ::= xτ= [x1, . . . , xn]

CC Context CCCtxt 3Γ ::= ∅empty context

|Γ, x:τterm variable binding

Figure 4.18: Extensions for closure-converted CPS code

4.6 Closure Conversion

The CPS language which results from the transformation in the previous section is

already a low-level representation of the source program where evaluation order is

made explicit, and where all intermediate values and control flow points are named.

In order to compile the program to machine code, functions and continuations must

be closed, by making accesses to free variables explicit. Therefore, we introduce

closures for all defined functions, which contain the name of the function (a code

pointer in the implementation) and the values of its free parameters.

This transformation is made by the closure conversion phase which is the subject of

this section.

Figure 4.18 contains the extensions of the CPS language required for closure-convert-

ed code. Closure types are assigned to closures. The type contained in the closure

type is the type of the function which is represented by that closure.

The second additional type is the type of closure-converted functions pτq. The type

τmust be a function type, and the markers over the type indicate that it is the

type of functions of type τ, but in closure-converted form. That means that the

functions expect an additional first argument, which is a closure for that function.

Using this syntactic construct for marking closure-converted functions, we are able to

avoid including recursive and/or existential types in our language, but still are able

to type check closure-converted code.

4.6.1 Type checking Closure-converted Terms

The type system of closure-converted terms (Figure 4.19) is similar to the CPS type

system shown above (Figure 4.11). New rules are required for the closure creation

form letclos, for projection, for function and continuation application and for case

expressions. All rules not mentioned here are the same as for the former type system.

Type checking of closure-converted code is more difficult than the original CPS code,

60 Chapter 4. Typed Dynamic Continuation-passing Style

Γ`Kok

(letproj) x:{τ1, . . . , τn} ∈ Γ Γ, x :τi`Kok

Γ`let xτ=πixin Kok

(funapp)

f:τf∈Γp(#τ1, . . . , τn#) → ⊥q≤τf(Γ `xi:τi)∀n

i=1

f0:τf0∈Γ{p(#τ1, . . . , τn#) → ⊥q} ≤ τf0

Γ`f f0x1. . . xnok

(contapp)

k:τk∈Γp(#τ1, . . . , τn#) → ⊥q≤τk(Γ `xi:τi)∀n

i=1

k0:τk0∈Γ{p(#τ1, . . . , τn#) → ⊥q} ≤ τk0

Γ`k k0x1. . . xnok

(case)

x:τ∈Γ

(ki:τki∈Γ{p(#τi1, . . . , τini#) → ⊥q} ≤ τki)∀n

i=1 ∈Γ

(Ci:τCi∈Γτi1→. . . τini→τ≤τCi)∀n

i=1

Γ`case xof C1→k1. . . Cn→knok

(letfun)

Γ, f1:τ1, . . . , fn:τn`Kok

(Γ, f1:τ1, . . . , fn:τn, xi1:τi1, . . . , xini:τini`Kiok

τi= (#τi1, . . . , τini#) →τ⊥)∀n

i=1

Γ`letfun fτ1

1x11 . . . x1n1=K1. . . fτn

nxn1. . . x1nn=Knin Kok

(letcont)

Γ, k1:τ1, . . . , kn:τn`Kok

(Γ, k1:τ1, . . . , kn:τn, xi1:τi1, . . . , xini:τini`Kiok

τi= (#τi1, . . . , τini#) → ⊥)∀n

i=1

Γ`letcont kτ1

1x11 . . . x1n1=K1. . . kτn

nxn1. . . x1nn=Knin Kok

(letclos)

Γ, c1:τ1, . . . , cn:τn`Kok

((xij :τij ∈Γ)∀ki

j=1 τi={τi1, . . . , τiki})∀n

i=1

Γ`letclos cτ1

1= [x11, . . . , x1k1]. . . cτn

n= [xn1, . . . , xnkn]in Kok

Figure 4.19: Type system extensions for closure-converted terms

4.6 Closure Conversion 61

because functions of the same type, but with different free variables get assigned

different types when closures are implemented as simple tuples. Therefore, we have

introduced a special closure type, denoted by curly brackets: {τ→τ0}is the type

of a closure for a function from τto τ0. More elaborate type systems for closure

conversion (Minamide et al., 1996; Hannan, 1995) have to rely on existential types

to hide the types of free variables, or even have to add recursive types in order to

express that a closure contains a pointer to a function which expects a closure of

itself as an argument. We have side-stepped these issues in order to have a simpler

type system. Closure types can have two forms: either only the function type of the

function closed over is given between the curly brackets, or the function type and the

type of all free variables are given. The first form is introduced at places where the

types of the free variables are not known (for parameters of higher-order function, for

example). At other occurrences, for example when a closure is created or when fields

other than the first are projected from the closure, all types are given. This allows

us to type check all uses of closures without the need for existential types.

A second problem is that closure-converted functions receive their closure as an ad-

ditional parameter. This leads to recursive types, as the closure type again contains

the type of the function. Using iso- or equi-recursive types, this can be reflected in

the type system, but unfortunately this either complicates the type checker (which

must unroll and roll types as required) or the intermediate language (which must

contain constructs for unrolling and rolling types, to be inserted by the closure con-

version phase). We have again decided to use a simpler type system: we introduce

the special type pτ1→τ2q, which represents a closure-converted function type. A

closure of this type is then written {pτ1→τ2q}. When type-checking applications

of closure-converted functions to their parameters, the first parameter is interpreted

specially: it must be a closure of the function type, and the other parameters must

match the argument type of the function. Thus we can achieve type-safety without a

more complicated type system and with an even simpler implementation for the type

checker. The drawback of our representation is that closures always must be the first

parameters of their respective closure-converted functions.

4.6.2 Closure Conversion Algorithm

The closure conversion code is shown in Figures 4.20 to 4.22. This conversion algo-

rithm requires types, so the algorithm has an extra parameter, the variable context,

which records the type of each bound variable while processing nested terms.

Most rules simply recurse on their nested term, binding variables to their annotated

types. The rules for function and continuation bindings are converted to introduce

closures, whereas function and continuation calls are modified so that they extract

the code pointer from a closure before applying the respective function or closure.

On closure construction, all functions and continuations are renamed and closures

with the original names are introduced. Each closure has its corresponding function

pointer as the first element, and the free variables of the function form the remain-

ing elements. In the function and continuation bodies, the free variables are then

62 Chapter 4. Typed Dynamic Continuation-passing Style

{| · |} : Term →CCCtxt →Term

{|let xτ=Vin K|} Γ = let xτ=Vin {|K|} (Γ, x:τ)

{|let xτ=πiyin K|} Γ = let xτ=πiyin {|K|} (Γ, x:τi)

where y: (τ1, . . . , τn)∈Γ

{|let xτ=x1⊕x2in K|} Γ = let xτ=x1⊕x2in {|K|} (Γ, x:τ)

{|f x1. . . xn|} Γ = let gτ=π1fin g f x1. . . xn

where f:{τ} ∈ Γ, g fresh

{|k x1. . . xn|} Γ = let lτ=π1kin l k x1. . . xn

where k:{τ} ∈ Γ, l fresh

{|case xof C1→k1. . . Cn→kn|} Γ = let l1=π1k1. . . ln=π1knin

case xof

C1→l1

. . .

Cn→ln

Figure 4.20: Closure conversion (part 1)

extracted from the closure argument and bound to their original names, so that the

rest of the body can refer to the names.

The function fvs, which is used for function and continuation bindings, returns a

sequence of free variables of the argument. This function is applied to the bodies of

the functions and continuations and the result (after removing the parameters with

the \operator) are the variables to close over.

Similar to the continuation-passing transformation, closure conversion requires con-

version of the types involved in the converted terms. The rules for converting types

are shown in Figure 4.23. The types are simply converted recursively, except for the

function type: converted function types are marked as closures over closure-converted

functions. The {| · |}→transformation on types returns the closure-converted type of

a function, without the closure braces.

The only operations which are allowed on closure types are the creation of closures

(via letclos terms) and projections, where the projection of the first element is always

allowed (it yields the function pointer) and projection of the other fields is only

allowed in the body of closure-converted functions, where the closure type contains

the types of the function and of its free variables.

The closure conversion algorithm is very simple and there exist many variants to this

so-called basic algorithm. Refer to Chapter 2 for relevant literature.

4.6.3 Notes on Closure Conversion

There is a big difference between our closure conversion algorithm and the algorithms

found in literature. Other approaches normally consist of two phases: the first intro-

duces closures by modifying function definitions and applications and the creation of

explicit closure structures (introduction phase). The second phase lifts all functions

4.6 Closure Conversion 63

{|letcont

kτ1

1x11 . . . x1k1=K1

. . .

kτn

nxn1. . . xnkn=Knin K|} Γ =

letcont l{|τ1|}→

1k1x11 . . . x1k1=K0

. . .

l{|τn|}→

nknxn1. . . xnkn=K0

nin

letclos kτk1

1= [l1, y11, . . . , y1m1]

. . .

kτkn

n= [ln, yn1, . . . , ynm1]

{|K|} Γ2

where Γ2= Γ, k1:{|τ1|}, . . . , kn:{|τn|},

l1:{|τ1|}→, . . . , ln:{|τn|}→

1={|let y11 =π2k1in . . .

let y1m1=πm1k1in K0

1|}

Γ2, x11 :{|τ11|}, . . . , x1k1:{|τ1k1|}

. . .

n={|let y11 =π2knin . . .

let ynmn=πmn+1knin K0

n|}

Γ2, xn1:{|τn1|}, . . . , xnkn:{|τnkn|}

y11, . . . , y1m1= fvs(K1)\{x11, . . . , x1k1}

. . .

yn1, . . . , ynmn= fvs(Kn)\{xn1, . . . , xnkn}

τk1={{|τ1|}→,{|Γ(y11)|},...,{|Γ(y1m1)|}}

. . .

τkn={{|τn|}→,{|Γ(yn1)|},...,{|Γ(ynm1)|}}

Figure 4.21: Closure conversion (part 2)

64 Chapter 4. Typed Dynamic Continuation-passing Style

{|letfun

fτ1

1x11 . . . x1k1=K1

. . .

fτn

nxn1. . . xnkn=Knin K|} Γ =

letfun g{|τ1|}→

1f1x11 . . . x1k1=K0

. . .

g{|τn|}→

nfnxn1. . . xnkn=K0

nin

letclos fτf1

1= [l1, y11, . . . , y1m1]

. . .

fτfn

n= [ln, yn1, . . . , ynm1]

{|K|} Γ2

where Γ2= Γ, f1:{|τ1|}, . . . , fn:{|τn|},

g1:{|τ1|}→, . . . , gn:{|τn|}→

1={|let y11 =π2f1in . . .

let y1m1=πm1f1in K0

1|}

Γ2, x11 :{|τ11|}, . . . , x1k1:{|τ1k1|}

. . .

n={|let y11 =π2fnin . . .

let ynmn=πmn+1fnin K0

n|}

Γ2, xn1:{|τn1|}, . . . , xnkn:{|τnkn|}

y11, . . . , y1m1= fvs(K1)\{x11, . . . , x1k1}

. . .

yn1, . . . , ynmn= fvs(Kn)\{xn1, . . . , xnkn}

τf1={{|τ1|}→,{|Γ(y11)|},...,{|Γ(y1m1)|}}

. . .

τfn={{|τn|}→,{|Γ(yn1)|},...,{|Γ(ynm1)|}}

Figure 4.22: Closure conversion (part 3)

{| · |} : CType →CType

{|α|} =α

{|⊥|} =⊥

{| ?|} =?

{|(τ1, . . . , τn)|} = ({|τ1|},...,{|τn|})

{|(#τ1, . . . , τn#)|} = (#{|τ1|},...,{|τn|}#)

{|T τ1. . . τn|} =T{|e1|} . . . {|τn|}

{|τ1→τ2|} ={{|τ1|} → {|τ2|}}

{| · |}→: CType →CType

{|τ|}→=τ0where {τ0}={|τ|}

Figure 4.23: Closure conversion of types

4.7 Generation of Machine Code 65

(which are now closed and therefore independent of lexical scoping) to the top level

(hoisting phase). We only perform the first phase, for the following reason: we cannot

perform hoisting, because we do not close over free type variables. The reason is that

type variables are not present in the code which is passed to the code generator, they

are eliminated by the specialization algorithm described in Section 6. It is possible

(and normal) for function and continuation definitions to have free type variables.

The practical advantage when avoiding the hoisting phase, which brings all functions

to the same lexical level, is that the scope of a type function (which is specialized

en bloc) contains all internal function and continuation definitions, which need to

be specialized anyway. So a single invocation of the specializer generates more code,

thus reducing the overhead of the specializer as it needs to be invoked more rarely.

Running Example

Figure 4.24 shows the closure-converted running example. All functions are extended

to receive their closures as the first arguments. Function applications are preceded by

projections from closures, and closures are explicitly constructed using letclos terms.

In order to save space, we only included the definition of the sum function and its

closure.

When this resulting program is executed, the specialization mechanism for polymor-

phic functions presented in Chapter 6 will compile the summation loop into code

which passes results in floating-point registers to continuations and uses floating-

point instructions built into the processor. The list data type is also specialized to

floating-point values, making efficient use of storage and aligning floating-point val-

ues so that fast access is possible. This monomorphization process of functions and

polymorphic data types is discussed in detail in Chapter 6.

4.7 Generation of Machine Code

The generation of machine code from closure-converted CPS terms is straightforward,

since the program is already in a low-level form. Many constructs of the CPS language

correspond to just one machine-level concept. We will only briefly describe our code

generator, because our focus is on the intermediate representation here.

The code generation function receives several parameters: a mapping from variable

names to locations, a stack depth, a heap pointer (which records how many heap

allocations have taken place since the start of a function) and the term to be trans-

lated. Locations are for example registers, stack slots, addresses relative to the heap

pointer, absolute addresses or integer or floating-point literals. The mapping is called

compile-time environment.

The various language constructs are translated as follows:

let expressions Value bindings either add a binding to the compile-time environ-

ment (for simple values), or are translated to the construction of a heap record

(for tuples and constructed values). Bound variables are then represented as

66 Chapter 4. Typed Dynamic Continuation-passing Style

letfun

sum.8’ (sum.0: {* ==> {(List $15) ==> $15}}, $15: *,

k0.1: {~{(List $15) ==> $15}}, h1.2: {~Exn}) =

letfun

lam2.6’ (lam2.3: {(List $15) ==> $15}, l.4: (List $15),

k3.5: {~$15}, h4.6: {~Exn}) =

let sum.0 = proj2,2 lam2.3 in

case l.4 of

Nil -> letval lit7.8 = 0 in

let k3.0’ = proj1, k3.5 in k3.0’ k3.5 lit7.8

Cons x.9 xs.10 ->

letval ty8.12 = $15 in

letcont k9.4’ (k9.13: {~{(List $15) ==> $15}},

x10.14: {(List $15) ==> $15}) =

let h4.6 = proj2,5 k9.13 in

let k3.5 = proj3,5 k9.13 in

let x.9 = proj4,5 k9.13 in

let xs.10 = proj5,5 k9.13 in

letcont k11.2’ (k11.15: {~$15}, x12.16: $15) =

let k3.5 = proj2,3 k11.15 in

let x.9 = proj3,3 k11.15 in

let x13.17 = x.9 p+ x12.16 in

let k3.1’ = proj1, k3.5 in

k3.1’ k3.5 x13.17 in

letclos k11.15 = [k11.2’, k3.5, x.9] in

let x10.3’ = proj1, x10.14 in

x10.3’ x10.14 xs.10 k11.15 h4.6 in

letclos k9.13 = [k9.4’, h4.6, k3.5, x.9, xs.10] in

let sum.5’ = proj1, sum.0 in

sum.5’ sum.0 ty8.12 k9.13 h4.6

in letclos lam2.3 = [lam2.6’, sum.0] in

let k0.7’ = proj1, k0.1 in k0.7’ k0.1 lam2.3

in letclos sum.0 = [sum.8’] in ...

Figure 4.24: Running example: closure converted

4.7 Generation of Machine Code 67

pointers on the stack, and the stack slot binding is added to the compile-time

environment.

projections Projections are translated to loads from memory.

function and continuation bindings The bodies of functions and continuations

are translated and the addresses of the resulting code sequences are added to

the compile-time environment.

function and continuation calls Calls are translated into a code sequence which

loads the parameters into their correct argument positions, followed by a jump

instruction.

case expressions These are translated into a load of the scrutinized variable and

a sequence of tests, which determine the constructor of the value, followed by

conditional tests.

closures, tuples and constructed values These are created on the heap.

primitive operations These are translated to their respective machine instructions,

depending on their actual type (e.g., integer or floating-point addition).

Our extensive use of the compile-time environment allows us to load constants lazily,

and even to delay operations such as projections: when a projection of a value, which

is currently held in a register, is compiled, it can be translated to a register-indirect

reference with constant offset. Using this technique, we make extensive use of the

processor’s addressing modes in order to avoid superfluous instructions and wasted

registers.

Since arguments are passed in registers and stack slots, the algorithm for placing

arguments in their correct locations must build an interference graph before emit-

ting any move instructions. Otherwise, parameters which are needed later could be

overwritten by earlier parameters.

We also use a special parameter assignment scheme to avoid move instructions on

function/continuation calls. For example, continuation closures are passed to user

functions in register edx. Continuations expect their closure argument in the same

This optimization is possible, because continuation parameters can be recognized by

their type. Other types of parameters with special argument locations are integers,

floating-point registers or failure continuations. For integers, the first argument is

passed in register ecx and others are passed on the stack, for floating-point values,

six parameters are passed in registers xmm2 to xmm7 and the rest on the stack. Failure

continuations are passed in memory, because they are infrequently accessed.

Running Example

We do not present the complete machine code for the running example, because it is

very long. Instead, we have picked out a few small fragments of machine code and

68 Chapter 4. Typed Dynamic Continuation-passing Style

relate it to the closure-converted version of the program in Figure 4.24.

The function sum.0, which is a type function in the closure-converted program, is

translated to the following code which calls the polymorphic specializer. Details of

this process are described in Chapter 6.

sum.0:

pushad ; save registers

push dword ptr [esp+32] ; push type parameter

push esp ; push additional parameters

push 1H ; ...

push 1aH

push 1bH

push 0f0f0f0f0H

push 11H

push 1H

call [0b7b7101cH] ; call code generator

add esp,20H

popad ; restore variables

mov eax,dword ptr [esi+24] ; fetch address of

jmpn eax ; compiled code and jump

When the specializer has created the floating-point version of the sum function, the

core of the summation loop looks as follows:

k11.24:

movsd xmm7,dqword ptr [edx+8] ; fetch operand from closure

addsd xmm7,xmm2 ; add parameter

mov eax,dword ptr [edx+16] ; load continuation

push dword ptr [eax+4] ; from closure

mov edx,dword ptr [edx+16] ; move parameters

movsd xmm2,xmm7 ; to correct locations

mov eax,dword ptr [esp]

add esp,4H

jmpn eax ; jump to continuation

We can see that the compiler has created machine code with floating-point instruc-

tions, because the sum function has been used in a floating-point context.

Chapter 5

Incremental Compilation

One property of dynamic compilers is that it is not necessary to compile the complete

program on each run. Only the parts of the program which are reached during

a particular program run need to be translated. For large programs with a small

working set (that is, a small set of functions which are evaluated on a “typical” run),

this feature can save a lot of time, since no time is spent on generating code for

unused functions. Additionally, code buffer space is saved.

The idea is to delay the generation of code until it is certain that the code will be

executed—this is similar to lazy evaluation, where computation is delayed until a

result is required, for example for output. This process of delayed code generation is

called incremental compilation (Johnston, 1979).

In this chapter we describe the necessary extension to the intermediate CPS language

from Chapter 4, we show how code is generated for delayed expressions, and the

heuristics used to decide which expressions to delay selectively.

5.1 Language Extension

We have realized incremental compilation by adding a special syntactic construct

to the intermediate language: delay expressions. A delay expression wraps a CPS

term. The extension to the CPS language is modest and is shown in Figure 5.1. The

syntax is extended with delay expressions, and the extension to the type system is

also shown.

The dynamic semantics would not be changed when represented as a big-step se-

mantics, because the meaning of delay expressions is the same as the meaning of

their subexpressions. Delay expressions only have an effect on code generation, so a

low-level operational semantics which realizes the dynamic translation would have to

model the delayed generation of target code.

70 Chapter 5. Incremental Compilation

Syntax extension:

(terms) Term 3K, L ::= . . .

|delay Kdelay expression

Type system extension:

(delay) Γ; ∆ `Kok

Γ; ∆ `delay Kok

Figure 5.1: CPS language extension for incremental compilation

5.2 Code Generation for Delay Expressions

When the code generator encounters a delay expression, it does not generate the code

for the wrapped term, but instead generates a call to the run-time code generator.

In order to access the free variables of the delayed expression, the call is preceded by

a sequence of instructions which pushes the values of all free variables to the stack

and by extending the compile-time environment to record these new mappings for

the variables. The parameters for the call to the code generator are:

−The number of words on the run-time stack.

−The current compile-time environment, which records the mapping from vari-

able names to locations.

−The term which is wrapped in the delay expression.

−A pointer to the current state of the run-time compiler.

−A machine code label for the first parameter-pushing instruction in this sequence

(needed for patching later).

After this code sequence, instructions are generated for loading the result value of

the call into a register and for jumping to the returned address.

The instruction sequence generated for the delay expression is quite short (about a

dozen instructions), so code generation is finished quickly.

When later during the execution of the program the call to the code generator is

reached, control is transferred to the run-time compiler. It proceeds by (a) placing

new delay expressions in the body term (if any, see below), (b) generating code for

that body, (c) patching the calling instruction stream with a jump to the freshly

generated code, and (d) returning the address to the calling code.

The following pseudo-machine code demonstrates the compilation of delay expres-

sions. Suppose we have a term

t=delay (k x)

which represents a delayed continuation call. The following code will be generated:

5.2 Code Generation for Delay Expressions 71

push <k> ; push the values of free variables

push <x>

L0: ; address to patch later

push <stack-depth>

push <compile-time-env>

push <term>

push <execution-state>

push L0

call <generator> ; initiate code generation

pop-arguments ; remove all parameters from the stack

jmp <return-value> ; invoke newly-generated code

When the code is executed, the free variables kand xwill be pushed to the stack and

the code generator is called with its arguments. It will then generate code similar

to the following (ignoring closure pointers for clarity), which loads xfrom its stack

location to its correct parameter location and kinto a register and then jumping to

the continuation through the eax register.

L1:

mov eax, [esp+4] ; load continuation address

mov ebx, [esp] ; move parameter to correct location

mov [esp+4], ebx

add esp,4 ; adjust stack pointer

jmp eax ; invoke continuation

After this code is generated, we need to update the original instruction stream to

avoid another code generation for the delayed expression when it is executed again.

So the code sequence for the delay expression above will be modified in memory to

read:

push <k> ; old code

push <x>

L0:

jmp L1 ; <--- new instruction

<garbage> ; old code, partly overwritten

push <term>

push <compile-time-env>

push <execution-state>

push L0

call <generator>

pop-arguments

jmp <return-value>

The instructions after the jmp L1 are now unreachable and could be re-used by

a sophisticated code buffer manager. Note that writing the jump instruction into

72 Chapter 5. Incremental Compilation

memory will probably partly overwrite old instructions, which are no longer well-

formed instructions. Since they will never be executed, this is no problem, though.

On the second execution of the delayed expression, the only inefficiency left is the

pushing of the free variables and the patched jump instruction.

5.3 Placing of Delay Expressions

Of course, wrapping a delay expression around every subterm of a program is inef-

ficient, because the call of the code generator and code generation take some time.

We need to selectively add delay expressions to subterms, and this decision can be

based on measurements, static heuristics, or both.

In our implementation we have experimented with the following heuristics: delay

expressions are only wrapped around bodies of user functions, because these tend to

be the places where code generation for large expressions can be avoided by inserting a

single delay expression. Other expressions, which represent straight-line code (value

bindings, primitive operations, projections, etc.) do not gain from being delayed,

because code can be generated quickly for these. The other language construct where

we add delay expressions is in the branches of case expressions, because some branches

of case expressions will never be executed on a successful program run, for example

tests for error conditions and their handling. Other branches will only seldomly be

executed, so we hope to save time for the majority of program runs. The drawback of

this scheme is that some branches will be executed on all program runs, so delaying

them wastes time.

Therefore, in addition to static heuristic measures, we have added some profiling

support. For all functions, continuations and branches of case expressions, the system

can be instructed to generate instrumentation code. The instrumentation instructions

count how many times during a program run a particular function, continuation or

case alternative is executed. When the program halts, these counters are written

to disk, and on the next run, the counters are read in again and made available to

the code generator. The algorithm for placing delay expressions can now consult the

counters to decide whether to delay the body of a particular function, continuation

or branch, or not. In the simplest case, only code which has not been executed in the

last run will be delayed. More static or dynamic measures could of course be taken

into account, but we have not yet investigated this topic any further.

Additionally, we can apply the simple rule that bodies of polymorphic functions are

never delayed, since their code is generated on demand when they are applied anyway

(see Chapter 6 on Run-time Monomorphization). Code generation for polymorphic

functions is thus always incremental, so delaying them would not save on code gen-

eration time or code size, but would instead incur overheads because two calls to the

code generator would be made in sequence.

Incremental compilation is initiated by wrapping a delay expression around the com-

plete program. Since code generation for delay expressions triggers additional place-

ment of delay expressions on subterms, the complete program will eventually be

5.4 Discussion 73

selectively delayed.

5.4 Discussion

Note the close relationship of the implementation of delay expressions to lazy evalu-

ation and closure representations. Similar to lazy evaluation, work is deferred until

it is needed. In our case, code generation is avoided, in lazy evaluation computation

is avoided. When the code is eventually needed, we generate it and update the in-

struction stream to jump to the newly created code. This is similar to updates of

heap nodes in implementations of lazy evaluation, where indirection nodes are used in

some implementations (for example the Spineless tagless G-Machine, (Jones, 1992)).

Similar to closure representation, the code generated for a delay expression closes

the abstract syntax tree of the wrapped expression with respect to its free variables

(by pushing them on the stack and recording their locations) and the compile-time

environment (which is stored in the instruction stream and used later for code gen-

eration).

In order to gather some experimental results on the effectiveness of delayed compila-

tion, we have performed some experiments which are presented in Chapter 8.

74 Chapter 5. Incremental Compilation

Chapter 6

Run-time Monomorphization

Based on the intermediate program representation defined in Chapter 4, we now

develop a dynamic specialization strategy which eliminates polymorphism from func-

tional programs at run-time. Specialization of polymorphic programs with respect

to types is called monomorphization, because it results in monomorphic programs

(Boquist, 1999; MLton Developers, 2006).

Monomorphization is a simple but effective method for reducing the run-time cost

of language abstractions. In functional languages, abstractions such as functions,

polymorphism, type classes and parameterized modules are used very frequently, and

their efficient implementation is the key to the usability of such languages.

Except for the treatment of type classes in Section 6.1.3, the techniques in this thesis

should be applicable to any polymorphic functional languages with algebraic data

types and higher-order functions.

The main goal of our dynamic compilation system is the reduction of unnecessary

indirections due to abstractions. Polymorphic data types and functions require either

a uniform representation of data and functions, or specialized versions of all used

functions for the data types actually occurring in a program. In an open system,

such specialization at static compile time is not possible, because the compiler cannot

know which data types may be used in any possible program run. Therefore, in our

approach, code and data specialization is deferred until run-time, where all needed

data types are known and monomorphic code for all polymorphic functions can be

produced.

The basic idea is that polymorphic functions are translated to code generators, and

that calls to polymorphic functions pass types for which the target function should

be specialized. Each call to a polymorphic function therefore generates new machine

code, specialized to the given types. Caching techniques are used to avoid too many

specialized variants of a polymorphic function.

Decompilation and deoptimization techniques (Kotzmann and M¨ossenb¨ock, 2005)

can be used once a violation of the assumptions of the specializer is detected and the

needed invariants can be reestablished by recompiling the code.

Our specialization implements several language features, which are all useful in their

own right, but can be combined for best effect: run-time monomorphization creates

76 Chapter 6. Run-time Monomorphization

monomorphic instances of polymorphic functions for all usages. Monomorphization

can also be used for specializing polymorphic data types. Dynamic type class res-

olution is an implementation technique for type classes which is different from the

dictionary-based implementations normally used (Hall et al., 1996; Augustsson, 1993).

This kind of type class resolution can be compared to the work of Jones (1995), but

since we do monomorphization of (parametric) polymorphic functions anyway, dis-

covering method invocations and the identification of overloaded functions is much

simpler and can be done in parallel to the monomorphization.

In earlier work, all these abstractions required different implementation techniques.

In the following sections, we will show that our approach subsumes them in a fairly

simple specialization model.

In our prototype implementation, we have implemented specialization of polymorphic

functions, resolution of overloaded literals and arithmetic operations and specializa-

tion of polymorphic data types, including data representation and layout optimiza-

tions.

Other possible applications of dynamic specialization can be found in Section 9.2.2.

6.1 Specialization of Polymorphic Functions

Polymorphism in functional languages means one of two language features: paramet-

ric polymorphism, which allows the definition of functions over unknown types, and

ad-hoc polymorphism or overloading, which refers to the possibility of giving the same

name to different entities. The run-time specialization approach in this section can

eliminate all overhead which may result from both of these abstraction constructs.

6.1.1 Polymorphic Functions

Polymorphic functions are functions defined over values of unknown types. Consider

for example the classic map function, which applies a function to all elements of a

list. This function has the type:

map :∀αβ.(α→β)→List α→List β

The universally quantified type variables αand βare arbitrary, but fixed in the type.

Polymorphism is a very powerful abstraction mechanism. It allows the programmer

to write very general functions, which can be partly oblivious to the types of the

values they operate on. The disadvantage is that it is difficult to generate a machine

code sequence which realizes the algorithm of such a function. Since the type of

the function does not have any information on the machine representation of the

parameters, an implementation must compromise: either a uniform representation

is chosen for all data types, or separate versions of the function must be created,

each for a different type. In the former case, the representation of values cannot be

exploited by using special instructions or calling conventions which may be provided

6.1 Specialization of Polymorphic Functions 77

by the machine for particular data types, in the latter, many nearly identical versions

of code must be generated.

Our first use of specialization is used to remove parametric polymorphism from the

program code actually executed. The idea is simple: when a polymorphic function

is instantiated (that is, applied to a type parameter), the actual type parameter is

substituted for the formal parameter in the body of the function, and this body is

then passed to the code generator. Since the code generation is based on the special-

ized instances of the function definitions, the code generator can take advantage of

the fact that all type information is present. It can therefore use the most efficient

machine representations and calling conventions for a given data type for the ma-

chine executing the program. For example, integer values can be passed in different

registers than floating-point values, and different machine instructions can be used

for the addition, comparison, etc. of machine representations. These specializations

are already done in many static whole-program compilers, but in dynamic compilers,

the approach is much more flexible.

The Specialization Mechanism

The specialization mechanism is invoked when a type function is called. This can

only take place after some code has already been executed, because programs are

required to be monomorphic at the top level, and only embedded function definitions

can be polymorphic. This is the mechanism for function monomorphization:

1. All functions which receive any type parameters are compiled specially by the

code generator. They are always functions which return functions (this is guar-

anteed by the CPS conversion), so their body always consists of local function

definitions, closure creations for the defined function and a call to the success

continuation which passes one of the closures. Instead of directly generating

code for the function body, a call to the run-time specializer is generated. This

call sequence consists of pushing the arguments for the specializer onto the

stack, calling the code generator which generates specialized code and then

jumping to the freshly generated code. The parameters for the code genera-

tor include: all actual type parameters, the current compile time environment

(mapping variable names to locations), a pointer to the AST of the function

itself, the current execution state of the virtual machine, and the current stack

depth. The compile time environment and the stack depth are required so that

the code generation for the specialized function body can access the parameters.

This is important, because the function body will call its success continuation.

2. The code generator takes the list of actual type parameters from the run-time

stack and the list of formal type parameters from the function AST, and then

substitutes the former for the latter in the AST. The resulting AST is monomor-

phic,1and the code generator is called on it, returning the start address of the

1Actually, the resulting code may contain polymorphic functions, but before any of them could

be executed, it would be monomorphized recursively.

78 Chapter 6. Run-time Monomorphization

let id x = x

in id id ()

Figure 6.1: Identity function example

newly generated code block. This address is passed back to the code generated

in step 1, which jumps to it.

The specializer avoids the generation of duplicate code by maintaining a mapping

from function definition/type list-pairs to addresses of compiled code. When it is

applied to a concrete list of types and a function definition to specialize, it looks up

the pair in the mapping and returns the address of already generated code if possible;

otherwise, the code is generated and its address is stored in the mapping.

Another optimization, described in more detail in Section 7.2, is inline caching. De-

pending on a command line option, the code generator generates a series of com-

parisons before the call to the specializer. These comparisons compare the actual

type parameters against values stored in the machine code, which are written to the

instruction stream after each specialization. In the common case where a function

is repeatedly called with the same type parameters, these checks succeed and jump

around the specializer call, loading a constant address to jump to instead, which

also has been placed by the specializer. This sequence of checks is much faster than

a call to the specializer. For similar reasons, inline caching has been used for the

implementation of object-oriented languages (Detlefs and Agesen, 1999).

Specialization Example

Consider the small example program in Figure 6.1. The polymorphic identity function

id is defined, and it is first applied to itself, and then the result is applied to the

value () (which denotes the single element of the unit type).

Figure 6.2 lists the code which is produced for this example by the code generator.

The register edi is used as the heap pointer, registers edx and ecx are used for

passing parameters.

Lines 1–8 of the machine code create closures for the identity function and a continu-

ation, which is not of interest here. The continuation definitions have been omitted.

Heap overflow checks and calls to the garbage collector have also been deleted to save

space, and the code was compiled with inline caching switched off. The first few la-

bels refer to continuation code which will be ignored here, but note the code starting

at label id.0 (line 22). The push instructions and the call invoke the specializer, and

after it returns, the address of the generated code is loaded into register eax and then

jumped to.

The code which is generated on the first call to the identity function in the example

program (id id) results in the code shown in Figure 6.3. The lines up to the first

label construct a closure for the monomorphized identity function which starts at

label lam2.4 (line 8). This closure is then passed to the success continuation (lines

4–7).

6.1 Specialization of Polymorphic Functions 79

1add edi,8H ; allocate heap memory

2mov dword ptr [edi-4],0b7a0215cH; initialize closure

3mov dword ptr [edi-8],13H

4add edi,10H

5lea ebx,[edi-24] ; allocate heap memory

6mov dword ptr [edi-8],ebx ; initialize closure

7mov dword ptr [edi-12],0b7a02088H

8mov dword ptr [edi-16],15H

9sub esp,4H ; set up arguments

10 mov ebx,dword ptr [esi+12]

11 mov dword ptr [esi+16],ebx

12 lea edx,[edi-16] ; load success continuation

13 mov dword ptr [esp],14H

14 lea ecx,[edi-24]

15 jmp [0b7a0215cH] ; jump to function

16 k6.14: [b7a02088]

17 ...

18 k9.19: [b7a020ec]

19 ...

20 k11.23: [b7a02144]

21 ...

22 id.0: [b7a0215c]

23 pushad ; save registers

24 push 0b7a0215fH ; push arguments

25 push dword ptr [esp+36]

26 push esp

27 push 1H

28 push 19H

29 push 1aH

30 push 0b7a0216aH

31 push 12H

32 push 1H

33 call [0b7c6701cH] ; call specializer

34 add esp,24H

35 popad ; restore registers

36 mov eax,dword ptr [esi+24] ; load address of generated

37 jmpn eax ; code and jump

Figure 6.2: Assembler code for identity example

80 Chapter 6. Run-time Monomorphization

1add edi,8H

2mov dword ptr [edi-4],0b7a0226bH

3mov dword ptr [edi-8],13H

4lea ecx,[edi-8]

5mov eax,dword ptr [edx+4]

6add esp,4H

7jmpn eax

8lam2.4: [b7a0226b] ; code for identity function

9mov ecx,dword ptr [esp] ; load argument for

10 mov eax,dword ptr [edx+4] ; success continuation

11 add esp,4H

12 jmpn eax ; jump to continuation

Figure 6.3: Assembler code for identity example: integer version

1add edi,8H

2mov dword ptr [edi-4],0b7a021c6H

3mov dword ptr [edi-8],13H

4lea ecx,[edi-8]

5mov eax,dword ptr [edx+4]

6add esp,4H

7jmpn eax

8lam2.1’: [b7a021c6] ; floating-point version

9mov eax,dword ptr [edx+4] ; no argument moving necessary

10 jmpn eax

Figure 6.4: Assembler code for identity example: floating-point version

In our example, the second invocation of the identity function specializes the function

for the unit type, which results in exactly the same code because the unit type and

closures are both represented as 32-bit machine words.

Figure 6.4 shows the code which would result if the identity function was specialized

for the type float. The code for building the closure and returning remains unchanged,

because it is independent of the type parameter in this case. The monomorphized

identity function (which starts at label lam2.1’ in the example) is different from the

integer version: it does not even need to copy its input, because the first floating

point parameter for both functions and continuations is always passed in the same

We can see in these examples how all generated code is independent of the types

insofar as no tag handling or type passing is required for monomorphized code.

6.1 Specialization of Polymorphic Functions 81

1data Tree a = Leaf a

2| Node (Tree (a, a)) (Tree (a, a))

3let size t :: Tree a -> Integer =

4case t of

5Leaf x -> 1

6Node l r -> size l * 2 + size r * 2

7in

8let val t1 = Node (Leaf (1, 2))

9(Node

10 (Leaf ((3, 4), (5, 6)))

11 (Leaf ((7, 8), (9, 0)))) :: Tree Integer

12 in

13 puti (size t1) (\ v -> v)

Figure 6.5: Polymorphic recursive example program

6.1.2 Polymorphic Recursion

One complication of our dynamic monomorphization approach arises in the presence

of polymorphic recursion. Polymorphic recursion means that a recursive function calls

itself (or other recursive functions in the same group of mutually recursive functions)

with different types than those with which it was called. As an example, see the

program in Figure 6.5, which defines both a polymorphic recursive data type Tree a

(lines 1–2) and a polymorphic recursive function size (lines 3–6), which calculates

the number of data elements in the tree. The program defines a tree (lines 8–11) and

writes the tree’s size (10 in the example) to the output (line 13).

The function size has the type annotation Tree a -> Integer, because our pro-

totype implementation can only infer polymorphic recursive types in the presence

of type annotations. This is similar to Haskell, which also supports polymorphic

recursion for type-annotated functions.

The data type definition means that a value of type Tree a is either a leaf, which

carries a value of type a, or an internal node with two children of type Tree (a, a).

That means that the leafs on each level have the same number of data elements, and

that a leaf at level n+ 1 has twice as many data elements as a leaf at level n. The

tree constructed in the example program in Figure 6.5 is shown in Figure 6.6.

Polymorphic recursive data types similar to the one presented here can be used to

type balanced and/or complete binary trees, for example.

When our monomorphization technique from the previous section is used for spe-

cializing polymorphic recursive functions, the following problem occurs: for each

recursive call, which is a type application of a new type, a new specialized version

of the function is created. The example program above would lead to three spe-

cializations for the size function: for the types Integer,(Integer, Integer) and

((Integer, Integer), (Integer, Integer)). Of these types, the latter two have

the same machine representation, which is a pointer to a heap allocated structure (a

82 Chapter 6. Run-time Monomorphization

(1,2)

((3,4),(5,6)) ((7,8),(9,0))

Figure 6.6: Polymorphic recursive example tree

tuple). Same machine representations lead to the same machine code, so code buffer

space is wasted. The first type is represented as an unboxed integer value. Therefore,

it should suffice to generate two specialized versions, one for unboxed integers and

one for tuples.

We have solved this problem by grouping types with the same machine representation

into equivalence classes. The specializer only creates new versions of polymorphic

functions for new types when no specialization for the equivalence class of that type

has been generated. The example program in Figure 6.5 then leads to the desired

two specializations.

We currently distinguish three equivalence classes:

32-bit word Unboxed 32-bit value. Members of this class are not interpreted by the

garbage collector. The built-in integer type and function pointers are in this

class.

64-bit floating-point Floating-point values have special register passing conven-

tions. Otherwise, they are also not interpreted by the garbage collector.

32-bit pointer Pointer to a heap-allocated record. These are traced by the garbage

collector and must point to a header which defines the record layout.

When needed, this set of classes can easily extended, e.g. with 32-bit floating-point

or 64-bit or 128-bit integer values.

Polymorphic recursive data types like the type Tree a above are handled in the same

way by distinguishing members of the given equivalence classes (see Section 6.2).

Note that because of our support for polymorphic recursion, call sites of polymorphic

functions can not be patched to jump to the monomorphic code like we have done

for incremental compilation in the previous chapter. The problem is that one call

site can possibly refer to calls with different concrete type parameters, so that the

code which is generated for each call may expect different calling conventions and

data layout. Therefore, no patching is done for monomorphized code and we rely on

inline-caching for efficiency.

6.1 Specialization of Polymorphic Functions 83

6.1.3 Type Classes

Type classes in Haskell (Hall et al., 1996) have originally been developed in order to

tackle the problem of ad-hoc polymorphism (also called overloading) in a statically-

typed language with type inference in a systematic way. The idea is to define classes

of types by specifying a set of functions on these types, and then to add types to the

classes by giving specific instances of these functions (also called methods). As an

example, consider the class Eq a, which is the class of all types athat define equality.

By giving instances, such as the instance Eq Integer for the integers, the type class

gets populated. Note that in order to define an instance, a type-specific equality

function (in the example for integer numbers) needs to be written.

In Haskell, the (simplified) definition of the Eq class looks similar to the following:2

class Eq a where

eq :: a -> a -> Bool

neq :: a -> a -> Bool

neq x y = not (eq x y)

The type signatures for eq and neq declare them as methods, and the last line provides

a default method for neq. Because of this default method, it is not necessary for all

instances of Eq to provide a definition of this function, as it can be taken from the

class. An instance for type Bool is given next:

instance Eq Bool where

eq True True = True

eq False False = True

eq _ _ = False

Here, the method eq for type Bool is defined by pattern matching over the construc-

tors False and True. When the function neq is used with boolean arguments, it will

use the definition from the type class above, wich in turns calls the eq function for

Bool and negates its result.

In our language, user-defined type classes are not supported. Some operators which

can be found in the Haskell type classes Num (+, −,×,/, negation) and Ord (<,>,

=, 6=, ≥,≤) are instead built into the compiler, as described in Section 4.3.2.

Most of the machinery for overload resolution is already provided by function special-

ization: when type values are substituted for type variables, the type annotations on

overloaded integer literals and polymorphic primitive operators are instantiated with

concrete types. The code generator is therefore given fully type-annotated monomor-

phic code. Integer literals which are annotated with the type float, for example can

then be coerced to float literals during code generation, without any costly run-time

type conversions.

The only drawback of this approach is that error messages are delayed: some type

errors cannot be found at type inferencing time, but only later, when a polymorphic

function is called at a particular type and its code is generated for the first time.

2Haskell uses infix operators instead of eq and neq.

84 Chapter 6. Run-time Monomorphization

Extension of our intermediate language and type system to support type classes

is a topic of future work (Section 9.2.3). Combining type passing, cases on type

constructors, optimization and dynamic specialization would lead to elimination of

all overhead related to type classes.

6.2 Data Type Specialization

Similar to the specialization of polymorphic functions in the previous section, poly-

morphic algebraic data types can also be made monomorphic at run-time. The spe-

cialization algorithm from Section 6.1.1 only substitutes types for type variables, it

does not state anything about the representation of constructor terms. When compil-

ing for a real machine, these representation decisions have to be made. In traditional

implementations of polymorphism, polymorphic values are represented in a standard

form, normally as pointers to memory blocks which have the correct size to hold

values of the used type. When functions are monomorphized, there is no need for a

common representation, instead the representation can be tailored to the data type

under consideration. The advantages are similar to the case for functions: individual

data values can make better usage of machine-specific representations and access time

to data elements can be reduced.

One disadvantage for using this technique in lazy functional languages is that it must

be combined with strictness analysis or programmer-supplied strictness annotations

in order to discover opportunities for optimization. Without strictness information,

the optimizer has to assume that each element of an algebraic type has the additional

value ⊥, so that unboxing cannot be performed without changing the behavior of the

program.

Kafka, as a strict language, does not suffer from this problem, but further investiga-

tion of dynamic specialization and laziness is also a topic for future work.

Modern processors are very sensitive to the layout of compound data in memory and

to access patterns to these data elements. Large, but slow memories and multiple

layers of cache memory of limited size require optimization of storage layout for the

cache architecture of the machine on which a program is running. With a dynamically

compiling system, data structures can be optimized for the executing machine when

a program is run.

Of course, not all such optimizations can be expressed in our high-level description,

but only in a low-level language for expressing data layouts. Our implementation

handles the ordering of data fields on the level of the CPS language, but alignment

is done by the code generator on the fly.

Since performance of programs on modern processors depends greatly on the ability

to make use of data and instruction caches, we focus on this aspect of data represen-

tation. While most optimizations on programs target instruction cache locality, data

representations affect the locality of the data caches.

When specializing data structures, we need to take several aspects into account:

−Data structure elements should use their native encoding, to avoid any boxing

6.2 Data Type Specialization 85

and unboxing or tagging and untagging overhead.

−Values should be aligned for optimal access times. On some platforms, it may

even be necessary to align them, e.g. on SPARC (Weaver and Germond, 1994),

which raises a hardware exception for unaligned accesses.

−Data structure elements should be grouped together so that cache locality can

be exploited. Often used elements should be put together, less frequently used

elements can be put out-of-line. This is an optimization for future work (see

Section 9.2.4).

6.2.1 Type-directed Representation Selection

First of all, we have to decide how to represent the different types of the source

language.

−Values of the predefined integer type are represented as 32-bit signed words.

−Values of the predefined floating-point type are represented as 64-bit IEEE

floating-point numbers.

−The predefined Boolean type is represented as if it were declared as the algebraic

data type

data Bool = False | True

in the source program.

−The predefined unit type is represented as the integer 0. Theoretically, values of

this type would never need to be explicitly represented, but doing so simplified

our implementation.

−Closures are represented as pointers to heap records, where the first field is a

header, the second a function pointer and the free variables are represented as

if they were members of a user-defined constructor (see below).

−User-defined algebraic types are in principle represented as pointers to heap-

allocated records (for exceptions, see below). The first field is a header, the

second a small integer representing the variant (also called the tag). All other

fields are represented as described in this section (recursively) and layed out as

described in the next section.

−Tuples are also represented like user-defined product types, but without the

need for a variant tag.

We have additionally implemented the following type-directed data representation

optimization for algebraic types:

−Data types where no variants contains fields (enumerations) are represented as

consecutive small integer values. This allows efficient dispatch on the construc-

tors.

86 Chapter 6. Run-time Monomorphization

−Data types which have both constructors with and without fields use both

unboxed and boxed representations: constructors without fields use unboxed

small integer values, constructors with fields use pointers to allocated heap

records. When a case distinction is made on values of this type, the unboxed

variants are tested first, when the value does not correspond to an unboxed

variant, it is safe to load the first word from the heap object the value points

to. Dispatch then proceeds on the constructor tag stored in the heap-allocated

record.

Heap pointers can always be distinguished from unboxed constructors because the

latter are small integers which are different from valid heap addresses on most ma-

chines. On machines without unused ranges in the address space, data types with at

least one variant with a field, all values of these types would need to be boxed.

The propagation of type information to the code generator allows to represent in-

teger values as full machine words, without any need for tagging (as was used in

the SML/NJ compiler, for example (Appel, 1992)). Using the type information, the

code generator can construct precise tables describing which registers and stack slots

contain possible pointers to heap cells and which do not.

The second step in choosing a representation is the layout of individual fields, which

is described below. Fields of different sizes should be grouped such that there are no

unnecessary gaps due to alignment. This requirement can be fulfilled using only size

information of individual fields.

Examples

The boolean type

data Bool = False | True

is represented by the integers 0and 1.

The list type

data ListInt = Nil | Cons Integer ListInt

(which results from specializing the type List a for type Integer) is represented by

the integer 0for the Nil variant and a tagged pointer to a heap-allocated record of

3 words: a tag, the integer field and a pointer to the rest of the list. Note that the

tag for this data type is not really necessary, because there is only one boxed variant,

but our implementation does not currently do this optimization. The type Maybe a

= Nothing | Just a can be represented similarly.

Figure 6.7 shows possible data representations for values of type List Integer and

List Float. In both cases, the constructor Nil can be represented by the integer

0. The Cons constructors should have different layouts for the Integer and Float

instantiations. The List Integer object (arbitrarily) has the data value in the

second slot (after the constructor tag), whereas for the List Float object, the data

6.2 Data Type Specialization 87

(a)

(b)

hdr cons 2

hdr cons 3.1459

Figure 6.7: Possible data representations for (a) List Integer and (b) List Float

cons

double 3.1459

Figure 6.8: Common data representation for List Float

field occupies two words. Since the header and the tag are eight bytes long, the

floating-point field follows directly, as it is 8-byte-aligned.

Consider the representation which is used in most existing functional language im-

plementations, shown in Figure 6.8. (The header field is omitted.) Here, the double

value is boxed to achieve a uniform layout of heap objects. The overheads are in-

creased memory requirements (thus increased allocation and garbage collection time)

and additional indirections to access the data.

6.2.2 Data Layout Algorithm

We have designed and implemented a simple data layout algorithm for structures

allocated on the heap. The goal of the algorithm is to determine good layouts for

algebraic data types, tuples and closures. The results should enable the following:

−Fields should be aligned at the boundaries required for their machine types.

−Cache locality should be improved by reducing memory requirements.

The algorithm proceeds in three steps:

1. For each field, its original index in the record, its size and alignment is calcu-

lated.

2. The fields are sorted by decreasing size (using a stable sort).

3. When a field is not aligned now at its natural alignment boundary, dummy

fields of word size are inserted before the field until it is aligned properly. Rhis

step is performed for all fields from first to last.

88 Chapter 6. Run-time Monomorphization

This information is used for the following tasks. The sizes of the fields and their

alignment must be known, because when creating a constructed value, a tuple or a

closure, the size to be allocated is the sum of the aligned sizes of all fields. The original

field indices are then required to map the fields in the intermediate representation to

the memory locations in the optimized layout. Similarly, on projections from tuples

and closures, the projection index must be mapped to an offset in the heap record.

Case expressions, which are used to extract fields from algebraic data types, also have

to fetch the fields in order to move them to the stack or to registers.

The resulting code is very efficient, because all references to fields are mapped to

constant offsets in the machine codes. Due to the specialization of polymorphic data

types, the implementation is free to choose any memory layout it wants. In our

prototype implementation, the optimized memory layout can be switched off using a

command line option, in order to measure its performance impact and to experiment

with alternative layout algorithms.

In Section 9.2.4, we propose some improvements to this basic layout strategy, for

example grouping of fields which are often used together.

Chapter 7

Implementation

The concepts and techniques presented in the previous chapters have been imple-

mented as a prototype system called Kafka. In this chapter, we give details on

several aspects of the implementation and discuss the optimizations used for improv-

ing the intermediate code. Most of the optimizations are adaptations of classical

transformations for functional languages adapted to our intermediate language and

specialization techniques. We also give a concrete description of the implementation,

about its size, its implementation language and used libraries.

7.1 Implementation Outline

The prototype system was implemented in Haskell (Peyton Jones, 2003) and de-

veloped using the Glasgow Haskell Compiler (GHC) (GHC Developers, 2008). The

project consists of about 10 300 lines of Haskell code, including the parser, type

checker, optimizer, code generator, virtual machine and stand-alone compiler.

The system runs on IA-32 (or x86 architecture) computers as described in Intel’s

developer manual (Intel Corporation, 2006). It has so far only been tested on

GNU/Linux systems.

Some of the timing results in Chapter 8 have been gathered using the rdtsc instruc-

tion of the Pentium processor. This instruction was accessed using the small Haskell

library rdtsc (Grabm¨uller, 2008), which is a small wrapper around a C function which

invokes the rdtsc instruction using inline assembler.

Except for the rdtsc library, the complete prototype system is written in pure Haskell.

We emphasize this fact because it shows that Haskell is well-suited for both writing

system-level programs (such as the code generator or garbage collector) and express-

ing high-level algorithms, such as Milner-Damas-style type inference and complicated

program transformations.

7.1.1 Front End

The parser was implemented using the Parsec parser combinator library (Leijen and

Meijer, 2001).

90 Chapter 7. Implementation

The type checker implements Milner-Damas-style type inference and uses an exten-

sion of algorithm M (Lee and Yi, 1998), which includes a dependency resolution al-

gorithm for identifying mutually recursive functions. The output of the type checker

is the fully typed language which is described in Chapter 4 as the “source language”.

7.1.2 Conversion and Optimization

The result of typechecking is the input to the CPS conversion described in Chap-

ter 4. CPS terms are optimized using several optimizations described below, then

closure-converted (see also Chapter 4), optimized again and fed to the code generator.

The reason for two rounds of optimization is that the closure conversion introduces

opportunities for optimization which are not present before this phase, for example

projections from known closures.

7.1.3 Code Generation

Machine code generation is performed using the Haskell library Harpy (Grabm¨uller

and Kleeblatt, 2007). This library allows the generation of machine code directly into

a memory buffer and provides label management, patching of forward references,

checking for code buffer overflow and a disassembler. It also defines a convenient

domain-specific language for assembler programming in Haskell, using type classes

for supporting various addressing modes. The Harpy library has about 7 600 lines of

Haskell source code (including the disassembler, which alone amounts to 2 700 lines).

7.1.4 Run-time System

The run-time code generator is part of the run-time system. Whenever a delay

expression is executed or a polymorphic function is called, the code generator is

invoked. Its main purpose is to generate monomorphic code and to place new delay

expressions, but in theory it could do any useful transformation on the program it is

to process. We have not exploited this possibility yet.

The garbage collector is a simple two-space copying collector (Cheney, 1970). At

the beginning of each function, a check for heap overflow is generated. When the

check fails, the values of all registers are saved to the stack and two bitmaps are

passed to the collector: one specifies which registers hold live pointers into the heap,

and the other specifies the live pointers on the stack. The collector than copies the

objects pointed to by these pointers (and some globally known pointers) into an

empty memory area. Then all live pointers recursively contained in these objects are

copied, until the complete live object graph is in the new memory area. The old area

can then be reused in the next collection cycle. In order to correctly interpret the

fields of heap objects, the first word of each object is a pointer to a descriptor, which

also contains a bitmap identifying pointer fields.

7.2 Optimizations 91

7.2 Optimizations

Our prototype system implements a number of standard optimizations, because oth-

erwise the results of measuring specialization effects cannot be compared to other

language implementations, which all contain optimizations similar to the ones we

describe here.

We will only briefly describe the optimizations, but when specialization or dynamic

compilation has an effect on some optimization, we mention it below in their descrip-

tions.

In this section, we will use example programs in the source language instead of the

CPS language, in order to maintain readability.

The following optimizations are performed during compilation:

Function inlining User functions used only once are inlined at their call sites.

Inlining of user functions is also important in implementing other optimizations, for

example uncurrying (see below). Therefore, functions introduced by other optimiza-

tions can be marked for inlining, so that they are forced to be inlined even when

called more than once. Inlining is a standard (but important) optimization for all

optimizing compilers, but especially for functional languages, where most functions

are small and function calls are frequent.

Inlining is affected by our specialization technique. Polymorphic functions are trans-

lated to functions which receive types as parameters. When such a function is inlined,

the actual parameters (which are type variables) must be substituted in terms as well

as in types. This is necessary because all intermediate terms are annotated with types,

and types can appear at the term level for applications of type functions.

Continuation inlining Continuations used only once are inlined at their call sites.

This is especially important for the continuations which are generated for the alterna-

tives of a case expression, because these are always called once. Continuation inlining

must includespecial handling of type parameters for the same reason as function in-

lining.

Case idiom recognition Case expressions often scrutinize boolean variables which

are the result of some primitive comparison operator. These are converted to a special

form of primitive conditional expressions. The reason is that most instruction sets

allow conditional jumps only on the result of some comparison instruction – this does

not work well with comparisons on algebraic data types representing boolean values,

because the boolean results must be represented in order to perform a case analysis

on them. This optimization has also been described by Appel (1992, p. 75).

Uncurrying Curried functions (that is, nested λabstractions in the source lan-

guage) are inefficient when applied to several arguments at once, because for each

abstraction, an intermediate closure is created which is immediately called. There-

fore, the uncurrying optimization extends nested function definitions into functions

92 Chapter 7. Implementation

which take several arguments at once, and adjusts the call sites to provide all ac-

tual arguments together. We use the uncurrying method described by Appel (1992,

p. 76), which converts curried functions to uncurried ones and adds curried wrapper

functions. These wrappers are then inlined at their call sites, effectively uncurrying

the function call.

As mentioned by Tarditi (1996), Appel’s transformation is underspecified and does

not necessarily result in uncurrying of functions with more than two arguments. Like

Tarditi, we have fixed this problem by applying the uncurrying optimization in the

order from outer to inner expressions.

Uncurrying is affected by the specialization on polymorphic functions, because we

cannot merge a type abstraction with a value abstraction when the type of the latter

depends on the former parameter. Consider the following example:

let id = \ a: *. \ x: a. x

in ...

If we uncurry this definition, we get the following function:

let id = \ (a: *, x: a). x

in ...

The problem is that when generating machine code for such a definition, we do not

know the representation of the type a, so cannot correctly calculate whether the value

of xwill be placed in an integer register, a floating-point register or on the stack.

Therefore, our implementation avoids uncurrying in this manner and all functions

receive either type or value parameters. Independent type and value parameters

could be uncurried together, but in our test cases these functions were extremely

rare, so this is probably not worth the effort.

Common subexpression elimination (CSE) The common subexpression elim-

ination phase replaces values which are calculated several times by the variables to

which the values have already bound. Our CSE is flow-insensitive, as it does not take

control flow around loops into account. Thus, it only eliminates common subexpres-

sions when one lexically dominates the other.

Dead variable elimination Unused bindings of values, projections, continuations,

functions and closures are removed from the program. The results of primitive op-

erations are only removed when the operation cannot have any side effect (such as

raising an exception).

Constant folding Primitive operations on constants are calculated at compile

time, except when the operation would cause an exception at run time (for example,

division by zero). Constant folding often enables dead variable elimination for the

operands.

7.2 Optimizations 93

Projection elimination Projections from known tuples or closures are performed

at compile time. In some cases, when the projection is the only use of a tuple

or closure value, the construction of the respective object can be removed by dead

variable elimination (see above).

Case elimination Cases on known values are reduced at compile time. This opti-

mization is important because inlining often creates case expressions which scrutinize

known values, so this optimization not only saves run-time (by eliminating a test and

a branch instruction), but also removes unreachable code. This can improve compile

times significantly, which is always a concern in dynamically compiling systems.

Floating-point register assignment Since the IA-32 architecture provides so few

integer registers, where most of them are used to hold the virtual machine state in our

implementation, and others have fixed uses for parameter passing, there is not much

point in sophisticated register allocation for these registers. Floating-point registers,

on the other hand, are plenty, so we have implemented a simple register assignment

scheme which tries to hold as many floating-point intermediate results as possible in

registers.

Inline caching Specialization of polymorphic functions requires that each call to a

polymorphic function checks whether a specialization for the actual type parameters

already exists. Since this check is rather expensive (in our implementation, it is writ-

ten in Haskell, and the transition from JIT-compiled code to Haskell code is costly1),

we have implemented an optimization called inline caching. In object-oriented sys-

tems, this optimization is used to avoid method lookups (Detlefs and Agesen, 1999).

The idea is to generate a short instruction sequence which tests whether the actual

type parameters are equal to the ones given the last time the function has been

called. If they match, the code from the last specialization can be used, otherwise,

the specializer is invoked. After specialization, the values tested in the code sequence

are overwritten by the specializer by modifying the testing code: constants in the

instruction stream which refer to types are overwritten with the new types, and the

jump instruction at the end of the inline code is overwritten to jump to the newly

generated code. Since the short in-line sequence of code is very fast, the overall

execution time is reduced significantly.

Recursion optimization Without any precautions, self-recursive calls of polymor-

phic functions always go through the specializer (or at least through the inline caching

mechanism). In addition to the overhead caused by checking whether a specialized

1Transitions from Haskell code to compiled code and vice versa go through Haskell’s Foreign

Function Interface (FFI), which involves checking for interrupts and whether garbage collection is

necessary, marshalling and unmarshalling of arguments and results, etc. This has non-negligible

overheads.

94 Chapter 7. Implementation

version exists, the type parameters have to be passed and this results in curried func-

tions (see above in the description of uncurrying on page 91). Recursion optimization

modifies polymorphic recursive functions so that the function body is changed into a

let expression which binds a new local function to the old function body. The body

of the let calls the local function. Since the local function is monomorphic, no spe-

cialization overhead occurs. As an example, consider the following recursive function,

which calculates the length of a list:

let len = \ a:* -> \ l: List a -> case l of

Nil -> 0

Cons x xs -> len a xs + 1

It is converted to the following:

let len = \ a:* -> \ l: List a ->

let len’ = \ l: List a -> case l of

Nil -> 0

Cons x xs -> len’ xs + 1

in len’ l

The local function can be compiled to a loop without any specialization overhead.

This optimization has been called “Loop headers in λ-calculus or CPS” by Appel

(1994), because it is related to loop header introduction in the compilation of loops

for imperative languages.

For polymorphic recursive functions, this optimization is not valid, because it assumes

that data representations for recursive calls at one call site are always the same.

Therefore, we do not apply this optimization for polymorphic recursive functions.

Data layout optimization The fields of tuples, closures and constructed values

of algebraic data types are rearranged in order to align them properly. For example,

64-bit floating-point fields are always aligned at an 8-byte boundary. Our data layout

algorithm is defined in Section 6.2.

Data alignment optimization When the virtual machine is run, the alignment

of allocated heap data (tuples, closures and constructor applications) can be specified

on the command line. The decision of an optimal alignment could also been based

on the executing processor and its memory organization, but this has not been inves-

tigated for our prototype. In contrast to the previous optimization, this one refers to

alignment of heap objects in memory, not to the alignment of individual fields in one

object.

Code alignment optimization When the virtual machine is run, the alignment

of jump targets in the generated machine code can be specified on the command line.

Intel Corporation (2005) recommends to align jump targets at 16-byte boundaries,

but for the programs we tested, this option did not affect run-time at all, probably

because our benchmark programs can be kept completely in the first-level cache.

Chapter 8

Experimental Results

In order to establish the practicability of our approach, we have conducted a num-

ber of experiments for measuring and comparing the performance of our prototype

implementation. The experiments investigate the impact of compiler optimizations,

the performance of the produced code relative to other implementations of functional

programming languages and the compile times of our dynamically compiling system.

This chapter describes the test methods, the hardware used, the benchmark programs

and presents the measured results.

The results show that our prototype has competitive performance on several small

benchmarks, when compared to mature statically compiling implementations. For

one program of realistic size, performance is within a factor of ten when compared to

an optimizing Haskell implementation, but only requiring half of the time to compile

the program.

8.1 Test Methodology

We have first measured the performance of our prototype implementation in order

to test the effect of the implemented optimizations. All benchmarks were performed

multiple times: first with all optimizations switched on, then with all optimizations

switched off. After that, we have tested the effect of twelve optimizations by individ-

ually switching them off.

In a second set of measurements, we have written the benchmark programs in vari-

ous other languages (functional, imperative and object-oriented) and compared the

performance of our prototype to other language implementations.

The third experiment consisted of porting a larger benchmark program from the

literature to our prototype and comparing it to several other languages.

All benchmarks were done on a Intel Pentium 4 CPU, running at 2.4 GHz with

512 MB of RAM and a second-level cache of 512 KB. The operating system used was

Debian GNU/Linux 4.0 with a Linux kernel version 2.6.18.

Each benchmark program was run three times, and the best result was used. Timings

were obtained using the time built-in command of the bash shell, and user times are

reported.

96 Chapter 8. Experimental Results

Name Description

nfib nfib function (integer version)

rfib nfib function (floating-point version)

pfib nfib function (both integer and floating-point version)

nlenrev construct a list of integers, reverse it, calculate length

rlenrev construct a list of floating-point values, reverse it, calculate length

Table 8.1: Common benchmark programs

The benchmarks in the different languages have been written in a way to maximize

their performance. For example, when special operators are available for small in-

tegers, these have been used, and we use built-in list functions when available. In

several cases, we have also measured other program variants, so that the comparison

between different languages is easier.

8.2 Benchmark Programs

In order to compare the performance of our prototype implementation to other lan-

guage implementations, we have written a suite of small benchmark programs. The

benchmarks can be put into three different groups:

−Function-call intensive programs with a high amount of basic arithmetic oper-

ations.

−Data-structure intensive programs which create large dynamic data structures

and traverse them.

−A realistic benchmark program.

In the first category, called fib, we have implemented several versions of the nfib pro-

gram from the nofib benchmark suite for Haskell (Partain, 1993). The nfib function

counts the number of function calls required to calculate the Fibonacci number of

the input n. In the second group we have implemented the lists benchmarks, which

work on linked lists. And finally, the trees benchmark constructs and searches binary

search trees. The last group contains only one program: the floating-point intensive,

realistic, Pseudoknot benchmark.

Table 8.1 summarizes the common benchmarks, which have been implemented in all

languages, whereas the special benchmarks in Table 8.2 have only been written for

languages which support them, or in order to get additional insight into the actual

benchmark results.

fib The programs in this group are aimed at testing function call efficiency and the

implementation of basic arithmetic operations. The first version of the fib bench-

mark is called nfib and works with native machine integers. We have made sure (by

8.2 Benchmark Programs 97

Name Description

ffib 32-bit floating-point version of rfib

nfibmod modular version of nfib

pfibmod modular version of pfib

nlenrevnogc nlenrev without garbage collection

rlenrevnogc rlenrev without garbage collection

tree search tree construction and searching

pseudoknot float-intensive realistic benchmark

Table 8.2: Special benchmark programs

using type annotations or special arithmetic instructions when necessary), that all

implementations work on integers not larger than a machine word, so that we can ex-

pect the best possible performance for each implementation. The rfib version works

with floating-point numbers, which are implemented as 64-bit IEEE floating-point

numbers in all implementations. The pfib benchmark calculates both with integer

and with floating-point numbers, and uses the same, polymorphic implementation for

both uses. This version tests the efficiency of polymorphic functions and overloaded

arithmetic.

Additionally, in order to evaluate several aspects of some implementations, we have

written three special versions of the fib theme: pfibmod is the same as pfib, but with

the polymorphic function in a separate compilation unit. Nfibmod is the same for the

integer version. The ffib program works on 32-bit floating-point numbers instead of

64-bit values.

lists The nlenrev and rlenrev benchmarks are data structure intensive programs

which each construct a list of one million elements, reverse the list and then take the

length of the result. Similar to the fib benchmarks, nlenrev works on integer lists and

rlenrev on lists containing floating-point values.

Since garbage collection is a weak spot in our implementation, we have made addi-

tional experiments where the heap size allocated on startup is large enough to that no

garbage collection has to take place during the benchmark run. These benchmarks

are called nlenrevnogc and rlenrevnogc.

trees The tree benchmark creates two unbalanced binary search trees, one small

tree with mixed contents and one created by repeatedly inserting elements of a sorted

10 000-element list. The trees are searched for values which are in the trees and for

one value which is not. The tree test program also belongs to the data structure

intensive benchmarks.

pseudoknot This is a floating-point intensive realistic program which solves a prob-

lem from molecular biology. It is described in detail in Section 8.5 below.

98 Chapter 8. Experimental Results

Code Optimization Description

ci Case idioms Recognize cases on comparison results

if Inline funs Inline user functions

ic Inline conts Inline continuations

il Inline Inline both functions and continuations

uc Uncurrying Uncurry functions and continuations

cs CSE Common subexpression elimination

dv Dead Vars Remove unused bindings

cf Const Folding Folding of operators, projections etc.

ly Layout Data structure layout

fp FP Regs Optimize floating-point register assignment

rc Recursion Create local versions for recursive functions

ca Inline Cache Inline caching for polymorphic code

Table 8.3: Measured optimizations

8.3 Effect of Implemented Optimizations

The effect of optimizations is quite difficult to predict, mainly because different op-

timizations interact in complicated ways. Additionally, optimizing transformations

are in general not commutative, so the order in which they are carried out can affect

the performance of the resulting program significantly.

So in order to measure the effects of our implemented optimizations, we have com-

piled and run the benchmark programs with various optimizations switched off, and

compared the running times with the times for a run with all optimizations switched

on and without any optimizations, respectively. In Table 8.3, the optimizations mea-

sured and their abbreviations (to be used in the result tables) are summarized. All

optimizations are described in detail in Section 7.2.

Results

Tables 8.4 and 8.5 summarize the results. The “O0” lines reports the run times when

all optimizations are switched off (command line option -O0), whereas the “O2” lines

denote the results for fully optimized programs (command line option -O2). Both

lines are highlighted in the tables for better readability. All other lines are showing

the results for the benchmarks when one particular optimization is turned off (the

abbreviations for the optimizations are given in Table 8.3). The second column gives

the absolute run time in seconds and the third column gives the run time relative to

the “O2” line. The names of the benchmarks are the names from Tables 8.1 and 8.2,

but with the suffix “opt”.

Longer run times for a disabled optimization mean that the optimization is good,

because switching it on reduces run times.

The run times include parsing, type checking and various transformations and opti-

8.3 Effect of Implemented Optimizations 99

mizations, such as CPS conversion, closure conversion, all optimizations not switched

off, and machine code generation. Since the benchmark programs measured in this

section are small, most of the run time is used up by running the code, not by

compiling it.

For the benchmark pfibopt, we cannot report results for optimization level -O0, be-

cause the benchmark allocates too much memory to successfully terminate. For some

other benchmarks, the heap size had to be increased in order to run them successfully.

In the case of nlenrev and rlenrev (O0), 400 MB of heap were allocated, for the other

list benchmarks 70 MB. All tree benchmarks were performed with 200 MB of heap.

The nfib benchmarks ran with the default heap size of 1 MB.

General Remarks

The completely unoptimized programs are 2.7–12 times slower than the fully opti-

mized programs.

nfib For the floating-point version, the difference between the unoptimized and the

fully optimized program is less than for the integer version. This is probably due to

more memory traffic in the floating-point code. Switching off individual optimizations

can improve run times by up to 3%.

lists The difference between integer and floating-point code is similar to the nfib

benchmarks.

trees For this benchmark, we have set the input size to 4000, because the unopti-

mized programs use more memory than is available in the test machine.

The tree benchmark seems to contain functions for which uncurrying has a negative

effect. Therefore, switching off uncurrying or function inlining (which is required by

uncurrying) makes the program run much faster, up to 40%.

Remarks on Individual Optimizations

The case idiom optimization (ci) has an effect of about 1–2%, and sometimes it slows

down the program.

Switching off inlining functions (if) changes run times by a few percent (both up and

down), and has more significant effects for the lists (slower by about 100%) and tree

benchmark (faster by about 40%). This makes it very difficult to recommend it.

Inlining continuations (ic) is very important for our implementation, because many

continuations are introduced which can be removed when e.g. functions are inlined.

Therefore, switching this optimization off gravely affects performance.

Switching off inlining (il) in general (both user functions and continuations) has

similar effects than inlining of continuations, which stresses that continuation inlining

is more important than function inlining.

100 Chapter 8. Experimental Results

nfibopt 40

impl time rel

cs 8.405 0.98

cf 8.437 0.99

if 8.465 0.99

ca 8.477 0.99

fp 8.489 0.99

uc 8.505 1.00

ly 8.529 1.00

O2 8.545 1.00

rc 8.565 1.00

ci 8.705 1.02

dv 8.705 1.02

ic 25.418 2.97

il 25.434 2.98

O0 25.806 3.02

rfibopt 40

impl time rel

cf 11.725 0.97

cs 11.773 0.98

if 11.865 0.98

ca 11.901 0.99

ly 11.937 0.99

ci 11.953 0.99

rc 12.025 1.00

dv 12.033 1.00

O2 12.073 1.00

fp 12.085 1.00

uc 12.089 1.00

ic 32.162 2.66

il 32.878 2.72

O0 32.898 2.72

pfibopt 40 40

impl time rel

O2 20.229 1.00

cs 20.253 1.00

ly 20.325 1.00

cf 20.425 1.01

if 20.541 1.02

fp 20.561 1.02

ca 20.593 1.02

ci 20.665 1.02

uc 20.801 1.03

dv 20.817 1.03

il 58.504 2.89

rc 58.504 2.89

ic 58.536 2.89

Table 8.4: Benchmark results: effects of various optimizations (part 1), benchmark

pfibopt: measurements for “O0” missing (see text)

8.3 Effect of Implemented Optimizations 101

nlenrevopt 1000000

impl time rel

cf 1.372 0.99

fp 1.380 0.99

O2 1.392 1.00

ly 1.396 1.00

cs 1.404 1.01

ca 1.440 1.03

ci 1.444 1.04

dv 2.084 1.50

uc 2.660 1.91

if 2.708 1.95

il 7.052 5.07

ic 7.156 5.14

rc 8.129 5.84

O0 10.025 7.20

rlenrevopt 1000000

impl time rel

O2 3.576 1.00

ci 3.672 1.03

cf 3.676 1.03

cs 3.692 1.03

ca 3.720 1.04

ly 3.720 1.04

fp 3.744 1.05

dv 4.424 1.24

uc 7.324 2.05

if 7.440 2.08

O0 9.521 2.66

ic 14.025 3.92

il 14.161 3.96

rc 15.653 4.38

treeopt 4000

impl time rel

if 1.164 0.60

uc 1.304 0.68

O2 1.924 1.00

dv 1.928 1.00

ca 1.932 1.00

ly 1.940 1.01

ci 1.948 1.01

cf 1.956 1.02

cs 1.968 1.02

fp 1.984 1.03

ic 2.656 1.38

il 2.668 1.39

rc 2.788 1.45

O0 23.233 12.08

Table 8.5: Benchmark results: of effects various optimizations (part 2)

102 Chapter 8. Experimental Results

The uncurrying optimization (uc) is sometimes an advantage, sometimes a disadvan-

tage. For the fib benchmarks, which only have functions of one value argument, there

is no opportunity for uncurrying, so the compile-time effort is wasted. As mentioned

above, uncurrying gives very bad results for the tree benchmark.

Common subexpression elimination (cs) affects run time by up to 3%, but for most

benchmark there is no effect at all or even a slow down due to compile time costs.

Dead variable elimination (dv) does not give problems for these benchmarks and can

affect run time by up to 50% for the nlenrevopt benchmark.

Constant folding (cf) makes a difference of a few percent, in both directions.

Layout optimization (ly) has mostly an effect of 4% for rlenrevopt, since this bench-

marks has data structures (lists of floating-point values) which benefit most from this

optimization which aligns these values on 8-byte boundaries.

Optimization of floating-point registers (fp) has a small impact of 1–2% for most

benchmarks. For some benchmarks, run times improve when this optimization is

switched off. This is probably due to reduced compilation times, since several of the

benchmarks do not use floating-point values at all.

Recursion optimization (rc) is the most effective single optimization. The reason is

that, for polymorphic functions, it avoids a lookup in the specialization cache on each

recursive invocation.

The effect of switching off inline caching (ca) does not have much effects for the

benchmarks except for nlenrevopt and treeopt. The reason is that for the nfib bench-

mark, the recursive calls are monomorphic because of recursion optimization, so that

the inline cache is not used much.

8.4 Comparison to other Implementations

The common benchmark programs have been written in Kafka, Haskell (Peyton

Jones, 2003), Standard ML (Milner et al., 1997), C/C++ (Kernighan and Ritchie,

1988; Stroustrup, 1986), Scheme (Kelsey et al., 1998), Java (Gosling et al., 2000) and

Opal (Pepper, 2003).

The special benchmarks have not been implemented in all languages, see below for

details.

We have used the Glasgow Haskell Compiler (GHC Developers, 2008) for the Haskell

programs, MLton (MLton Developers, 2006) for the Standard ML programs, Bigloo

(Serrano and Weis, 1995; Bigloo Developers, 2008) for the Scheme programs, the

Opal Compilation System (Opal Group, 2004) for the Opal programs, the GNU

compiler collection C and C++ compilers (GCC Developers, 2008) and the Sun Java

development Kit (Sun, Inc., 2008) compiler and virtual machine for Java. The Kafka

programs are of course run by our prototype implementation.

Table 8.6 lists the versions of the language implementations.

All compilers have been used with optimizations enabled, the actual command line

arguments are listed in Table 8.7.

One important fact to keep in mind when interpreting the results in the next section

8.4 Comparison to other Implementations 103

Language Implementation Version

Kafka Kafka prototype system 0.1

Haskell Glasgow Haskell Compiler (GHC) 6.8.2

Scheme Bigloo 2.8c

Standard ML MLton 20061107

Opal Opal Compilation System (ocs) 2.3k

C/C++ GNU Compiler Collection (gcc) 4.1.2

Java Sun Java Development Kit 1.5.0

Table 8.6: Implementation versions

Implementation Optimization options

Kafka —

GHC -O2

Bigloo -O6

MLton —

Opal opt=full

C/C++ -O4

Java —

Table 8.7: Command line options for optimization

is that the various language implementations are not completely equivalent. For

example, some languages (such as Standard ML and Opal) perform overflow checking

on arithmetic, whereas the others do not. This is a disadvantage for the former

implementations. Also note that the Haskell implementations are lazy, and all other

implementations are call-by-value. Nevertheless, we think that even if it is a kind

of apple-to-oranges comparison, the numbers presented next are an indication of the

implementations’ performance in “real life”.

Results

Tables 8.8 to 8.10 give the results of our experiments. In each table, the first line

gives the name of the benchmark together with the input size. For example, the pfib

benchmark in Table 8.8 has been run with an input of 40 for the integer function,

and an input of 40 for the floating-point function as well.

Of course, since all benchmarks described here are small benchmarks which at best

correspond to small kernels of real applications, the results do not generalize to real

programs, they merely give an intuition of the performance under varying circum-

stances.

We will now discuss the results in detail.

104 Chapter 8. Experimental Results

nfib 40

impl time rel

bigloo 1.024 1.00

gcc 1.036 1.01

ghc 2.772 2.71

java 3.120 3.05

mlton 3.192 3.12

kafka 8.429 8.23

opal 10.349 10.11

rfib 40

impl time rel

gcc 2.756 1.00

java 4.676 1.70

mlton 8.329 3.02

kafka 11.805 4.28

ghc 18.653 6.77

bigloo 30.378 11.02

opal 55.071 19.98

pfib 40 40

impl time rel

gcc 3.848 1.00

mlton 11.057 2.87

kafka 20.337 5.29

ghc 22.325 5.80

bigloo 31.666 8.23

opal 122.944 31.95

pfibmod 40 40

impl time rel

ghc 140.817 1.00

ffib 40

impl time rel

opal 18.441 1.00

nfibmod 40

impl time rel

gcc 1.024 1.00

Table 8.8: Benchmark results: nfib

8.4 Comparison to other Implementations 105

nlenrev 1000000

impl time rel

mlton 0.040 1.00

opal 0.104 2.60

ghc 0.200 5.00

bigloo 0.208 5.20

java 1.184 29.60

kafka 1.452 36.30

nlenrev 1000000 nogc

impl time rel

bigloo 0.052 1.00

ghc 0.068 1.31

kafka 0.096 1.85

gcc 0.156 3.00

java 0.460 8.85

rlenrev 1000000

impl time rel

mlton 0.056 1.00

ghc 0.188 3.36

bigloo 0.208 3.71

opal 0.208 3.71

java 2.548 45.50

kafka 3.644 65.07

rlenrev 1000000 nogc

impl time rel

bigloo 0.060 1.00

kafka 0.092 1.53

ghc 0.100 1.67

gcc 0.144 2.40

java 1.040 17.33

Table 8.9: Benchmark results: lists

tree 10000

impl time rel

mlton 1.764 1.00

kafka 3.204 1.82

ghc 4.764 2.70

bigloo 9.369 5.31

Table 8.10: Benchmark results: trees

106 Chapter 8. Experimental Results

Results for nfib

nfib For the integer version of the nfib function, it is remarkable that the Scheme

version compiled by Bigloo is fastest. The main reason is that for the Bigloo version,

we used explicit fixnum arithmetic. Nevertheless, the fact that Bigloo is faster than

the C version is quite impressive, as Bigloo compiles to C and then invokes the C

compiler. The GHC version also comes out quite fast, because the strictness analyzer

detects that all computations can be performed strictly. GHC also performs special-

ization for the nfib function. Kafka is slower than most of the other implementations,

but note that parsing, type checking, optimization and code generation are included

in the timings for Kafka. Opal does not optimize function calls and simple arithmetic

well, apparently.

rfib For rfib, the results are quire different than for the last benchmark. GHC and

Bigloo apparently generate worse code for floating-point intensive programs. Even

though GHC again performs specialization, its floating-point code performs worse

then the integer code. MLton is a specializing whole-program compiler anyway, and

seems to generate good floating-point code, too. Bigloo takes a performance hit

because the implementation cannot use the faster fixnum arithmetic here. Kafka’s

handling of floating-point code by specialization pays off here, so we get good results.

The Java results show how good just-in-time compilation can perform.

pfib For the polymorphic version, Kafka performs especially well because of the

run-time specialization. The floating-point version of the polymorphic nfib function

runs just as fast as for the rfib benchmarks. For most implementations, the times of

the pfib benchmarks are approximately the sum of the nfib and rfib runs; only Opal

is much slower, because it lacks function polymorphism: all overloaded functions and

constants must be given as module parameters. For Java, we have no pfib imple-

mentation, because it lacks both polymorphic functions over primitive types and an

efficient means for simulating them.

pfibmod We have made another version of the pfib benchmark for Haskell, in order

to see how strictness analysis and type classes affect performance. In the pfibmod

program, the nfib function is placed in a separately compiled module. The effect is

dramatic: the modular version is about six times slower than the monolithic, because

no automatic specialization is done. Without specialization and strictness analysis,

laziness and the overhead of type classes cannot be avoided.

ffib Because of the bad performance of Opal for rfib, we have additionally measured

its performance when using 32-bit floating-point values instead of 64-bit values. The

run-time for ffib is roughly one-third of that for rfib, because 32-bit floating-point

values are represented more efficiently in the Opal run-time system.

8.5 The Pseudoknot Benchmark 107

nfibmod As a small additional check, we have made a modular version of the

nfib program for the C compiler, in order to see what effect modularity has for this

implementation. Interestingly, the modular variant is faster than the non-modular.

We think the reason is somehow improved memory layout for the separately compiled

version, which helps the cache system.

Results for lists

nlenrev For the nlenrev and the rlenrev benchmarks, Kafka is the slowest imple-

mentation. The reason is the slow garbage collector, and we have run these bench-

marks again with enough heap so that the programs do not have to perform any

garbage collection during the program run.1The C version has been added to the

nogc results, because it does not free the memory allocated. The bad performance

of the C program is probably due to an inefficient memory manager: we used the C

library function malloc() for allocating list cells. For Java, we have used the class

LinkedList from the standard library and used automatic boxing conversions to put

integers and floating-point values into lists. As can be seen with these results, boxing

and unboxing in Java affects performance very much. The functional implementa-

tions, for which boxing is a very common problem, optimize it much better. Opal

performs very well on the list benchmarks, because the lists are used in a single-

threaded way, and Opal contains special code to dynamically reuse single-threaded

heap objects. This reduces memory allocation significantly and thus results in fast

execution. Without garbage collection, Kafka performs quite well.

rlenrev The results for rlenrev are similar to those for nlenrev, with a few ex-

ceptions. All implementations are slower, because more memory is allocated. Opal

performs worse because of inefficient handling of floating-point values. For the nogc

version, Kafka is better than in the integer list benchmark, because the specialized

representation of list cells containing floating-point values. Better floating-point code

when compared to other implementations could also have a small effect.

Results for trees

tree For the tree benchmark, we had to increase the heap size for the Kafka mea-

surements, because the created tree is larger than the default heap size. Given enough

room, Kafka performs quite well. We can draw the conclusion that data structure

handling also performs quite well in our implementation.

8.5 The Pseudoknot Benchmark

The Pseudoknot benchmark is a realistic program from the field of molecular biology.

It has been used in an experiment where 25 language implementations have been

1The machine on which the tests were performed has not enough physical memory to avoid

garbage collection for Java, but we have been able to reduce garbage collection to two invocations.

108 Chapter 8. Experimental Results

tested on the same problem (Hartel et al., 1996). The focus of this experiment was

on compile and execution times for all implementations. The program calculates the

three-dimensional structure of a molecule, based on several constraints on its atoms.

It requires a lot of operations on three-dimensional points and uses backtracking

search to find all solutions. The result of the program is the distance of the most

distant atom of all solutions, which has to be printed with six digits accuracy.

We have ported the Haskell version of the Pseudoknot benchmark to Kafka, which

resulted in a program of 3404 lines of code. This is by far the largest program which

has been written in Kafka and been tested using our prototype.

For this benchmark, we had to adapt the measuring techniques, because some parts

of our prototype system are not mature enough. In particular, we run the benchmark

in our prototype without optimization, because the optimizer is too slow and the run

time of the actual program too short to give meaningful results. Additionally, we

ran the benchmark with 100 MB of heap space in order to reduce invocations of our

rather slow garbage collector. We think this is fair, because the C version has no

garbage collector at all and the garbage collectors of the other implementations are

highly tuned.

We have compared our system to the C, Haskell, ML and Opal versions of the Pseu-

doknot benchmark. The C and ML versions work with 64-bit floating point numbers,

the original Haskell and Opal versions with 32-bit floating-point numbers (this was

allowed in the original benchmark, too). We have additionally measured the perfor-

mance of 64-bit floating-point Haskell and Opal versions in order to compare them

to the Kafka version. The C, Haskell and Opal version worked without modification.

The ML version had to be slightly modified because the original was written for the

Standard ML of New Jersey (SML/NJ) (Appel and MacQueen, 1987) compiler and

we were using the MLton compiler. Only changes caused by different ML libraries

were required.

Since one reason for a dynamically compiling implementation is to achieve the kind

of flexibility offered by an interpreter, we have also measured the run and compile

times for running the Pseudoknot program in the GHC interpreter ghci. This is a

non-optimizing bytecode interpreter, which otherwise shares its implementation with

GHC.

We measured the following setups:

−The run time of the C, Haskell, ML and Opal versions, with full optimization

switched on.

−The run time of the C and Haskell version, without optimization.

−The run time of the Kafka version (which includes parsing, type checking, con-

versions and code generation).

−The run time of the interpreted Haskell version.

−The compile times for the C, Haskell, ML and Opal version with optimization.

−The compile times for the C and Haskell version without optimization.

8.5 The Pseudoknot Benchmark 109

Impl Run (s) Comp (s) Run+Comp (s) Size (bytes)

C opt 0.035 6.649 6.684 154424

C 0.047 1.197 1.244 126296

Hs opt 0.282 7.837 8.119 460028

Hs 0.691 9.744 10.435 926396

Hs int ?5.912 ?4.536 10.448 —

Hs opt 64 0.294 7.824 8.118 476860

Hs 64 0.719 9.119 9.838 953180

Hs int 64 ?6.113 ?4.559 10.672 —

ML 0.061 8.410 8.471 276268

Opal 0.174 113.554 113.728 782400

Opal 64 0.760 39.616 40.376 491392

Kafka ?3.586 ?3.366 6.952 490474

?derived from Run+Comp (see text)

Table 8.11: Pseudoknot results

−The amount of generated code for each configuration (except for the interpreted

Haskell version).

−The Haskell and Opal versions were tested both with 32 and 64-bit floating-

point numbers

The optimized C version was compiled with optimization option -O4, the Haskell

version with option -O2. The ML version was compiled without compiler options.

The Opal versions were compiled with the options opt=full debug=no. The Kafka

version was run with options -O0 and -H100M.

Table 8.11 gives the results, some of which are graphically shown in Figures 8.1

and 8.2. In the first column, the name of the measured configuration is shown. “C

opt” is optimized C code, “C” is unoptimized C code, similar for Haskell (code “Hs”).

“ML” is the MLton version. “Opal” is the 32-bit floating-point version. All 64-bit

floating-point versions for Haskell and Opal are indicated with “64”. “Run” is the

run time in seconds reported by the time command, “Comp” is the compile time

in seconds for compiling the configuration. “Run+Comp” is simply the sum of run

and compile times. In the last column, the amount of generated code is shown. For

the C, Haskell, Opal and ML version, this is the size of the executable file without

any debugging information and symbols (as produced by strip), for Kafka, it is

the precise amount of machine code. For interpreted Haskell, no machine code is

generated, and the size of the generated byte code cannot be determined, therefore,

the last column is empty in these cases.

Since compile and run time are difficult to separate in the Kafka system, where

compilation and execution is interleaved, we have made the following approximation:

We have measured the complete run time of the program using the time stamp

counter register of the Pentium processor. This gives the precise number of machine

110 Chapter 8. Experimental Results

seconds

7run time

C opt

Hs opt

Hs int

Hs opt 64

Hs 64

Hs int 64

Opal

Opal 64

Kafka

Figure 8.1: Pseudoknot: run times

cycles elapsed. Additionally, we measured the time spent in the code generator using

the same technique (excluding the time spent in the context switch to the Haskell

environment, but this is hard to measure). The difference of complete time and

code generation time is called “useful time”, and includes both the running program

and the garbage collector. The percentage of useful time has then been applied to

the run time of the program as measured with the time command, resulting in the

values in Table 8.11. For interpreted Haskell, we have a similar problem: we cannot

measure the precise compile and run times. The numbers in the table result from

loading the Haskell program in the interpreter and immediately running it (command

line time ghci -XCPP Nuc.hs -e "main") and loading it and immediately quitting

the interpreter (command line time ghci -XCPP Nuc.hs -e ":q"). The run time

reported is then the difference between both times.

We have visualized the run times in Figure 8.1 and the combined run and compile

times are shown in figure 8.2. The bars for Opal in Figure 8.2 have been clipped and

annotated with the combined run and compile time because they are much larger

than for the other implementations.

Even though the Kafka implementation is much slower than the other compiled ver-

sions, this benchmark proves that efficient code generation is possible for high-level

languages at run time. Code is generated at a rate of approximately 142 KB per

second. Since our code generator is written in Haskell and not tuned for speed, we

think that is a satisfactory result. Our implementation produces code which is an

order of magnitude slower, but the run time and the compile time together are lower

8.5 The Pseudoknot Benchmark 111

seconds

run time

compile time

C opt

Hs opt

Hs int

Hs opt 64

Hs 64

Hs int 64

113.728

Opal

40.376

Opal 64

Kafka

Figure 8.2: Pseudoknot: run and compile times combined

than the compile times of the other implementations, except for the unoptimized C

version. The amount of code generated is similar to that produced by the Glasgow

Haskell and Opal compilers.

The Opal versions are very fast, where the 64-bit floating-point version is slower

than the 32-bit version. This is similar to the results for the rfib/ffib benchmarks.

For Haskell, there is not much difference for both floating-point sizes. The Opal

compilation times are very large, but note that much of the compilation time is spent

in the C compiler, since Opal compiles to C. Interestingly, the compile time for the

64-bit floating-point version and the code size are lower than the 32-bit version, even

though it is slower. The reasons for this anomaly are unclear, maybe there is some

difference in inlining opportunities which could explain this. The compile times for

MLton are not much larger than for GHC. This is probably due to the fact that the

Pseudoknot program consist of very large static data structures, which MLton seems

to handle efficiently.

The Kafka version runs by a factor of about 10 slower than the compiled Haskell or

Opal version. We think that careful tuning of the optimizer, so that it can run on

programs of this size, will reduce this factor so that results similar to the ones of the

small benchmarks presented in the previous section can be achieved. Compared to

interpreted Haskell, our implementation is much faster, even though the incremental

dynamic compilation method provides in principle the same degree of flexibility.

112 Chapter 8. Experimental Results

Line Incremental Input Tot.Run (MCy) Codegen (MCy) # Gen Size

1 No 0 7868 3642 1 495082

2 No 1 10321 4079 11 508764

3 Yes 0 5192 ?0.66 868 ?0.23 3 811

4 Yes 1 9530 ??0.92 3158 ??0.77 134 391034

with profiling

5 Yes (1st) 0 5104 ?0.65 930 ?0.26 3 881

6 Yes (2nd) 0 6122 ?0.78 713 ?0.20 2 780

7 Yes (1st) 1 11217 ??1.09 3293 ??0.81 131 413485

8 Yes (2nd) 1 12164 ??1.18 4456 ??1.09 12 534867

?relative to line 1, ?? relative to line 2

Table 8.12: Results for incremental compilation

8.6 Effects of Incremental Compilation

The incremental compilation feature described in Chapter 5 affects run time in two

directions: first, compilation time is reduced because less code needs to be generated.

Second, when all code in the program is executed, compilation time is increased be-

cause the placement of delay expressions, the generation of calls to the code generator,

the invocation of the generation, patching code and accessing free variables from the

stack induce run time overheads.

In order to quantify these effects, and to see whether the technique of incremen-

tal compilation we have developed has any advantages, we have performed another

experiment.

We have taken the Pseudoknot benchmark described in the previous section and have

modified it so that the complete program is contained in a function. Additionally,

the main program asks the user for a number: when a 1 is given, the benchmark is

run, otherwise it is not run.

Because loading, type checking and converting takes so long with our prototype

implementation, we report different numbers in this section than the ones in the

previous section. For all program runs, we report the time spent in execution and code

generation as measured by the Kafka virtual machine using the rdtsc instruction. All

times reported in this section are measured in million machine cycles (megacycles ≡

MCy). The time needed for parsing, type checking, CPS- and closure-converting are

not included in the reported numbers. Optimizations were turned off again, as in the

previous section.

Results for Incremental Compilation

The results are shown in Table 8.12. The first column contains the line number to

be referred to later, column two tells whether the Pseudoknot program was executed

with incremental compilation switched on (“Yes”) or off (“No”). The third column

8.6 Effects of Incremental Compilation 113

gives the input to the program (1: run the benchmark, 0: do not run the benchmark).

The fourth column gives the number of megacycles spent executing the program (this

includes code generation and garbage collection time). The fifth column report the

number of megacycles spent in the code generator, and the sixth column tells how

often the code generator was invoked. The last column gives the amount of generated

code. The second numbers in the columns labelled “Tot.Run” and “Codegen” for the

incrementally compiled versions are the total run times or code generation times,

respectively, relative to the numbers in line 1 (for input 0) and 2 (for input 1).

Lines 1 and 2 are the results when incremental compilation is switched off. Lines 3

and 4 show them when incremental compilation is enabled, but when profiling feed-

back is switched off, and lines 5 to 8 give the numbers when both features are enabled.

The lines labelled “1st” correspond to a run where no previous profiling information

is available, the lines labelled “2nd” give the numbers when a previous profiled run

has already completed and profiling data is available.

When the input to the benchmark program is 0, that is, when the expensive back-

tracking search is not invoked, execution time is of course lower than when the search

is performed. In the case that incremental compilation is switched off (lines 1 and 2),

running the complete program (input 1) causes 11 code generations and about 10 KB

more code to be generated than for input 0. This is due to the fact that some func-

tions are polymorphic and are thus compiled when they are first called even when

incremental compilation is not enabled.

Line 3 shows that incremental compilation can reduce the amount of code which is

generated dramatically (of course, our benchmark is artificial, for real programs the

differences are probably less pronounced). Even though the number of code generator

invocation increases, the time spent in code generation and in total run time is also

reduced: code generation time is reduced by 77%, run time by 34%.

When the complete program is executed and incremental compilation is enabled

(line 4), we also see that the amount of code, the code generation time and the total

run time is less than for the non-incrementally compiled version. Even though code

generation is performed more than 10 times as often, total run time is reduced by 8%.

This effect is due to the fact that code generation is performed more quickly. Our

investigation of this astonishing fact revealed that this is probably an artifact of our

implementation: the placement of delay expression in the syntax tree speeds up some

analysis in the code generator. For example, the code which calculates how much heap

space is allocated in a function stops as soon as a delay expression is encountered,

because the deferred compilation of the body expressions will have their own heap

check instructions emitted when they are generated later. This means that for more

efficient or different analysis and code generation techniques, the incremental version

will probably not be faster than the non-incremental, because the additional code

which is generated for invoking the code generator will slow down the programs.

In lines 5–8 we can see that our heuristic of not delaying functions which have been

used in the last program run indeed reduces the number of delay expressions in the

syntax tree and thus reduces the invocations of the code generator. The amount

of code generated increases because instrumentation code is generated to count how

114 Chapter 8. Experimental Results

often functions/continuations and branch alternatives are executed. Additionally, the

algorithm for placing delay expressions needs to consult the tables of the previous run

in order to decide where to place them. This also costs time. But note that in line 6,

we can see that code generation time is reduced when compared to line 3, where no

profiling information is used. This encourages more research of the practicability of

profiling-based feedback in our setting.

8.7 Discussion

There are several interesting facts to be discussed:

−The Kafka implementation is able to reach a factor of two of the run times of

very mature and efficient language implementations for Standard ML, Haskell

and Scheme. For some benchmarks, Kafka is even faster than other implemen-

tations.

−The garbage collector currently implemented in our prototype is very slow. So

it is to be expected that the performance of our prototype underestimates the

performance of our approach in general.

−Modular programming in Haskell in the form of laziness, type classes and sep-

arate compilation has its price. Our approach could be used to overcome the

performance problems due to abstraction and type classes. Whether it is also

suitable for lazy languages remains an open research problem.

−The run times for Java prove that dynamically compiling system can offer very

good performance.

−The very good performance of the programs compiled by MLton can be seen as

an upper bound of our approach: MLton performs specialization and defunc-

tionalization, so the code actually executed is monomorphic as in our case. The

drawback of MLton is the restriction to whole-program compilation and slower

compilation. Our approach could again be a solution, but a more complete

implementation is needed to prove that expectation.

−When comparing the Pseudoknot results for interpreted Haskell and Kafka, we

draw the conclusion that it is possible to implement a functional language so

that it is as flexible as an interpreter, but with performance close to a statically

compiled implementation.

−Incremental compilation has the ability to reduce both code generation time and

overall run time for dynamically compiled functional programs. The usefulness

of profiling-based feedback needs further investigation.

In conclusion, the results presented in this chapter prove that it is possible to effi-

ciently produce high-quality machine code for a high-level language at run time.

Chapter 9

Conclusions and Future Work

9.1 Conclusions

In this thesis, we have presented several techniques for the implementation of a run-

time compiler for functional programming languages. Our specialization methods

have the potential to eliminate the overheads of many language constructs heavily

used in modern functional programming languages: polymorphic functions, polymor-

phic data types and type classes, as well as proposed extensions: open functions and

data types, and parameterized modules. The ideas have been implemented in a pro-

totype system, which can perform monomorphization of polymorphic functions and

data types and additionally carries out a number of standard optimizations, of which

some had to be modified to play nicely with the dynamic compilation approach. The

basic idea of using run-time specialization for the implementation of functional lan-

guages looks promising: our experiments show that both generation of efficient code

and fast compile times are possible for functional programming languages.

With respect to the problems identified in Section 1.1.1, we give the following answers:

Feasibility of dynamic compilation for functional programs We have shown

that dynamic compilation of functional languages can be performed efficiently. The

compilation times reported in Chapter 8 are already encouraging, and we expect that

with some engineering effort they can be reduced further.

Balance of compilation and execution time The typed dynamic CPS repre-

sentation we have developed enables fast code generation as well as sophisticated

optimizations, both prior to execution and at run-time. The low-level nature of

closure-converted CPS terms allows efficient code generation, while also providing a

well-studied semantics and a wide variety of proven optimization techniques.

Suitability of classic optimizations We have implemented several optimizations

known from static functional compilers and have measured their effect on compile and

run time.

115

116 Chapter 9. Conclusions and Future Work

Techniques from other dynamic language implementations We have taken

the inline caching optimization from the implementation of object-oriented programs

and have shown its usefulness for functional languages. The incremental compila-

tion and delayed compilation of polymorphic functions is reminiscent of lazy method

compilation on object-oriented virtual machines.

Influence on dynamic language constructs We have shown how our specializa-

tion technique can eliminate the overheads of parametric polymorphic functions and

data types. We have also shown that ad-hoc polymorphism in the form overloaded

operators and literals is supported as well. The extension to other, more dynamic lan-

guage features should be relatively straightforward (see also below in Section 9.2.2).

This success encourages experimentation with other dynamic language features.

Summary of Contributions

We have made contributions in the following fields:

Compiler Design We have designed a typed continuation-passing style intermedi-

ate representation, which is well suited for type-based dynamic optimizations.

The language supports many language features required for a modern func-

tional language, such as type inference, polymorphic functions and algebraic

data types.

Optimization Techniques We have designed two techniques for dynamic special-

ization: first, a monomorphization algorithm for polymorphic functions and

overloaded operators. Second, a data type representation and layout analysis,

which is also based on dynamic specialization.

Language Implementation The techniques have been implemented in a working

prototype, which is able to run programs of realistic size. The implementation

is highly configurable: all optimizations can be switched on or off independently

and various parameters of the optimizer and code generator can be dynamically

modified. This gives a good test bed for experimenting with optimizations and

heuristics.

Experimental Results The performance of the generated code as well as the time

consumed by the dynamic compiler have been quantified. We have also com-

pared the performance against other functional language implementations, none

of which supports dynamic code generation comparable to ours.

9.2 Future Work

We see possibilities for future work both in theoretical and in practical directions.

9.2 Future Work 117

9.2.1 Theoretic Model

A complete theoretical model, possibly in some kind of specialization calculus, could

improve the understanding of the dynamic interaction in a dynamically specializing

system, especially in the presence of dynamic optimization. The field of partial

evaluation could be a good starting point.

9.2.2 Open Data Types and Open Functions

Functional programming languages normally impose a closed-world assumption: all

parts of a program are available at compile time, at least in the form of precompiled

libraries or interface files. This view precludes usage of modern techniques like plug-

ins or extensions, which can be loaded into a program when requested. Object-

oriented languages feature inheritance and subtyping for compiling and type checking

separately developed program parts (classes and interfaces), and therefore support

the development of such plug-ins. Algebraic data types, on the other hand, do not

support extension, and are therefore less suited for developing dynamic applications.

Type classes as in Haskell are open, but are less suited for modeling many problems

for which algebraic data types are normally used. As a solution, open data types

and open functions have been proposed. L¨oh and Hinze (2006) describe a language

with open data types and open functions and give a translation into normal Haskell,

but under the above-mentioned closed-world assumption. We could make use of

our existing machinery for run-time compilation in order to allow extension of both

algebraic data types and functions at run-time.

As an example, consider an algebraic data type in Haskell, which represents the

abstract syntax of lambda calculus terms:

data E = Var String | App E E | Abs String E

and the fragments of an evaluator for these terms (the syntax open eval declares

eval as an open function):

open eval

eval (Var x) = ...

eval (App e1 e2) = ...

eval (Abs s e) = ...

In the open data type and open function approach, it is possible to define additional

constructors for the data type and additional cases for the function, for example:

data E = ... | Const Int

eval (Const i) = ...

Open data types and open functions fit nicely into our framework: open data types

dynamically extend the defined constants (both at type and term level), and open

functions simply extend given function definitions. For this to work, imagine an input

118 Chapter 9. Conclusions and Future Work

language which provides the definition of functions by equations. This language can

be translated to a language with case statements easily (Wadler, 1987).

For the evaluator example above, the first data declaration and the first set of function

definition gives the following program:

data E = Var String

| App E E

| Abs String E

eval (Var x) = ...

eval (App e1 e2) = ...

eval (Abs s e) = ...

When the second set of definitions is loaded, the program is extended to the following:

data E = Var String

| App E E

| Abs String E

| Const Int

eval (Var x) = ...

eval (App e1 e2) = ...

eval (Abs s e) = ...

eval (Const i) = ...

The dynamic specialization mechanism will automatically use the new set of defini-

tions for the rest of the program evaluation.

When dynamically extending a set of functions, it is of course necessary to ensure

the correct ordering of functions. Again, a sorting algorithm which takes a certain

order relation on function definitions into account can be used.

A necessary extension of our dynamic compiler would be the possibility to invalidate

compiled code when new program fragments are loaded, as these may render earlier

compilations incorrect.

9.2.3 Type Classes

In our framework, type classes and instances could be implemented by adding type-

specific case expressions (or in the extension from the previous section: type-specific

function equations) to the syntax. Returning to our example of the type class Eq

from Chapter 6, we would simply add specific equality functions for all types which

should be instances:

eq (Int: *) (x: Int) (y: Int) = intEq x y

eq (Bool: *) (True: Bool) (True: Bool) = True

eq (Bool: *) (False: Bool) (False: Bool) = True

eq (Bool: *) (x: Bool) (y: Bool) = False

9.2 Future Work 119

Note how the more specific equations for the equality function for Bool come first,

and the last equation (using variables as parameters) denotes the default case.

The default method declarations in Haskell type classes, which allow to define meth-

ods in the body of a class declaration, can also be defined in our system. We can add

the following definition to our functions:

neq (a: *) (x: a) (y: a) = not (eq a x y)

Since this function will be automatically specialized for all types for which it is used,

the effect is the same as for default methods.

Jones (1995) has also investigated the effect of specialization for implementing type

classes. In contrast to the usual dictionary-passing implementations, Jones specializes

programs at compile-time and resolves all applications of overloaded functions. This

technique also relies on a closed-world assumption, and our system results in the

same degree of specialization, but in an open dynamic framework. Interestingly,

Jones reports that the specialization of type classes actually reduces the amount of

binary code, instead of the expected code growth. This is a very encouraging result.

9.2.4 Better Compiler

The current system design has been implemented as a working prototype which can

execute small to medium-sized programs efficiently. The logical next step in the

development is to complete the implementation and to make it more robust and

portable. The addition of more optimizations should be possible, but might require

more information about the program to be derived by additional analyzers. Several

important issues for a practical system have not yet been implemented. Proper code

buffer management is not yet realized in the prototype.

The applicability of the technique presented in this thesis to language abstractions

present in the ML family of languages, e.g. functors, has also not yet been under

investigation.

The design and implementation of specialization strategies, as well as an empirical

study of their effectiveness should be carried out. Another interesting avenue for

future work is the automatic (for example profiling-based) generation of application-

specific strategies.

The runtime compiler and the runtime system could and should be improved in

several ways. It needs a better register allocator, a better garbage collector etc.

Several specific topics for improvement are the following:

Overflow Checking A reliable programming language should support checks on

overflow for arithmetic operations. Standard ML and Opal, for example, check all

arithmetic operations and raise exceptions (or abort the program) if an overflow

happens. The MLton compiler shows that overflow checking does not necessarily

cause a significant slowdown for arithmetic-intensive programs, as can be seen from

the results in Chapter 8.

120 Chapter 9. Conclusions and Future Work

Type System The type system of the current implementation is very basic. It

should be extended to type classes and probably other modern type system features

which have proven useful in practice.

Improved Data Representation Our data representation does not yet exploit

all possibilities of type based representation analysis. For example, algebraic data

types which have only one variant with one field are “transparent”, and directly use

the representation of the field. When a data type has only one variant with fields

(which means, only one boxed variant), a constructor tag could be omitted. This

saves memory and speeds up case distinctions because it removes a redundant test.

Pointers to heap cells could also be tagged with special values in the low bits in order

to speed up case expressions. For data types with few boxed variants, the constructor

tag could be stored in unused bits in the pointer instead of bits in the header word of

the heap-allocated record. The use of tagged pointers has a performance benefit on

modern machines which normally have relatively precise branch prediction and high

clock rates (Marlow et al., 2007).

Additionally record fields which are used together should be placed near to each

other. This arrangement either needs profiling data to determine the “hottest” fields

or some heuristic which captures the real behavior of the system. In addition, it may

be useful to pack fields in such a way that unboxed values are separate from point-

ers, because this simplifies the descriptors needed for garbage collection. Currently,

our implementation uses bitmaps to indicate which words of a record may contain

pointers.

Other useful data representation optimizations include splitting arrays of pairs into

pairs of arrays. This also increases locality and avoids gaps due to alignment restric-

tions.

Graph-based Intermediate Language The prototype is based on tree-shaped

intermediate languages, which naturally arise from the use of algebraic data types

in functional languages. Unfortunately, it turns out that several transformations are

quite slow (for example dead variable elimination), because most of the intermediate

program must be copied even if relatively few modifications to the tree happen. An

alternative is to use a graph-based intermediate representation, such as the one used

by Kennedy (2007). The program graph contains links between variable definitions

and occurrences, so that substituting a variable for another can be performed in

constant time, and operations such as finding all uses of a variable is very efficient.

Another alternative would be to use a zipper (Huet, 1997) data structure, in the form

of control-flow graphs based on a zipper-like structure (Ramsey and Dias, 2005).

9.2.5 Instrumentation and Recompilation

Just-in-time compiling virtual machines for Java and other object-oriented program-

ming languages support instrumentation and recompilation. When a program is

executed, the bytecode is first either interpreted or compiled using a simple, but

9.2 Future Work 121

fast compiler. The interpreter or the code generated by the simple compiler counts

events, such as method invocations. When these counters indicate (by some heuristic

measure) that an event has occurred often enough, the code for the method is either

compiled (for the interpreter) or recompiled at a higher optimization level (for the

compiled version). The goal is to adapt the generated code to the dynamic behavior

of the program under execution.

Integration of a recompilation mechanism into our prototype would be desirable, as

it is necessary for advanced research in dynamic compilation and optimization. We

have implemented a basic instrumentation infrastructure, but it is currently only used

for controlling the placement of delay expressions when incremental compilation is

enabled.

9.2.6 Value Specialization

It would be possible to reuse the specialization mechanism for value specialization by

changing the heuristics which inserts calls to the run-time specializer. Currently, the

specializer is only called for type parameters in order to remove polymorphism, but

changing that would allow the code to be specialized for run-time parameters such

as the underlying system or tuning parameters. Similarly to the approach taken in

the Fabius system by Leone and Lee (1998) one could implement a sort of staging for

functions, based on currying; or we could request annotations from the programmer,

as in staged languages such as MetaML (Taha and Sheard, 2000). Whaley (1999)

has investigated value specialization for Java.

122 Chapter 9. Conclusions and Future Work

List of Figures

2.1 Virtualmachine.............................. 17

3.1 Dynamic specialization system architecture . . . . . . . . . . . . . . . 36

3.2 Compiler phases and intermediate languages . . . . . . . . . . . . . . 37

4.1 Running example: input program . . . . . . . . . . . . . . . . . . . . 41

4.2 Running example: type-checked . . . . . . . . . . . . . . . . . . . . . 41

4.3 Source language of CPS transformation (types) . . . . . . . . . . . . 43

4.4 Source language of CPS transformation (expressions) . . . . . . . . . 44

4.5 Source language type system (part 1) . . . . . . . . . . . . . . . . . . 46

4.6 Source language type system (part 2) . . . . . . . . . . . . . . . . . . 47

4.7 Types of built-in constants . . . . . . . . . . . . . . . . . . . . . . . . 48

4.8 CPS language (types) . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.9 CPS language (expressions) . . . . . . . . . . . . . . . . . . . . . . . 50

4.10 CPS language type system (values) . . . . . . . . . . . . . . . . . . . 51

4.11 CPS language type system (terms) . . . . . . . . . . . . . . . . . . . 52

4.12 CPS transformation (part 1) . . . . . . . . . . . . . . . . . . . . . . . 54

4.13 CPS transformation (part 2) . . . . . . . . . . . . . . . . . . . . . . . 55

4.14 CPS transformation of types . . . . . . . . . . . . . . . . . . . . . . . 56

4.15 CPS transformation of functions . . . . . . . . . . . . . . . . . . . . . 57

4.16 CPS transformation of primitive functions . . . . . . . . . . . . . . . 57

4.17 Running example: CPS converted . . . . . . . . . . . . . . . . . . . . 58

4.18 Extensions for closure-converted CPS code . . . . . . . . . . . . . . . 59

4.19 Type system extensions for closure-converted terms . . . . . . . . . . 60

4.20 Closure conversion (part 1) . . . . . . . . . . . . . . . . . . . . . . . . 62

4.21 Closure conversion (part 2) . . . . . . . . . . . . . . . . . . . . . . . . 63

4.22 Closure conversion (part 3) . . . . . . . . . . . . . . . . . . . . . . . . 64

4.23 Closure conversion of types . . . . . . . . . . . . . . . . . . . . . . . . 64

4.24 Running example: closure converted . . . . . . . . . . . . . . . . . . . 66

5.1 CPS language extension for incremental compilation . . . . . . . . . . 70

6.1 Identity function example . . . . . . . . . . . . . . . . . . . . . . . . 78

6.2 Assembler code for identity example . . . . . . . . . . . . . . . . . . . 79

6.3 Assembler code for identity example: integer version . . . . . . . . . . 80

123

124 LIST OF FIGURES

6.4 Assembler code for identity example: floating-point version . . . . . . 80

6.5 Polymorphic recursive example program . . . . . . . . . . . . . . . . 81

6.6 Polymorphic recursive example tree . . . . . . . . . . . . . . . . . . . 82

6.7 Possible data representations for (a) List Integer and (b) List Float 87

6.8 Common data representation for List Float ............. 87

8.1 Pseudoknot: run times . . . . . . . . . . . . . . . . . . . . . . . . . . 110

8.2 Pseudoknot: run and compile times combined . . . . . . . . . . . . . 111

List of Tables

8.1 Common benchmark programs . . . . . . . . . . . . . . . . . . . . . . 96

8.2 Special benchmark programs . . . . . . . . . . . . . . . . . . . . . . . 97

8.3 Measured optimizations . . . . . . . . . . . . . . . . . . . . . . . . . 98

8.4 Benchmark results: effects of various optimizations (part 1), bench-

mark pfibopt: measurements for “O0” missing (see text) . . . . . . . 100

8.5 Benchmark results: of effects various optimizations (part 2) . . . . . . 101

8.6 Implementation versions . . . . . . . . . . . . . . . . . . . . . . . . . 103

8.7 Command line options for optimization . . . . . . . . . . . . . . . . . 103

8.8 Benchmark results: nfib . . . . . . . . . . . . . . . . . . . . . . . . . 104

8.9 Benchmark results: lists . . . . . . . . . . . . . . . . . . . . . . . . . 105

8.10 Benchmark results: trees . . . . . . . . . . . . . . . . . . . . . . . . . 105

8.11Pseudoknotresults ............................ 109

8.12 Results for incremental compilation . . . . . . . . . . . . . . . . . . . 112

125

126 LIST OF TABLES

Bibliography

A.-R. Adl-Tabatabai, M. Cierniak, G.-Y. Lueh, V. M. Parikh, and J. M. Stichnoth.

Fast, effective code generation in a just-in-time Java compiler. In PLDI ’98: Pro-

ceedings of the ACM SIGPLAN 1998 Conference on Programming Language Design

and Implementation, pages 280–290. ACM Press, 1998.

A.-R. Adl-Tabatabai, J. Bharadwaj, D.-Y. Chen, A. Ghuloum, V. Menon, B. Murphy,

M. Serrano, and T. Shpeisman. The StarJIT compiler: A dynamic compiler for

managed runtime environments. Intel Technology Journal, 7(1), Feb. 2003.

J. Agat. Types for register allocation. In Proceedings of IFL’97, volume 1467 of

Lecture Notes in Computer Science, 1997.

O. Agesen. The Design and Implementation of Pep, A JavaTM Just-In-Time Trans-

lator. Theory and Practice of Object Systems, 3(2):127–155, 1997.

A. W. Appel. Compiling with Continuations. Cambridge University Press, 1992.

A. W. Appel. Loop Headers in Lambda-Calculus or CPS. Technical Report TR-460-

94, Princeton University, 1994.

A. W. Appel. SSA is Functional Programming. ACM SIGPLAN Notices, 33(4):

17–20, 1998.

A. W. Appel and D. B. MacQueen. A Standard ML Compiler. In Proceedings of

the Conference on Functional Programming Languages and Computer Architec-

ture, volume 274, pages 301–324, Portland, Oregon, USA, September 14–16, 1987.

Springer, Berlin.

J. Armstrong, R. Virding, C. Wikstr¨om, and M. Williams. Concurrent Programming

in Erlang. Prentice-Hall, second edition, 1996.

M. Arnold, S. Fink, D. Grove, M. Hind, and P. F. Sweeney. Architecture and policy for

adaptive optimization in virtual machines. Research Report 23429, IBM Research,

Nov. 2004.

M. Arnold, S. J. Fink, D. Grove, M. Hind, and P. F. Sweeney. A survey of adaptive

optimization in virtual machines. Proceedings of the IEEE, 93(2):449–466, Feb.

2005.

127

128 BIBLIOGRAPHY

L. Augustsson. Implementing Haskell overloading. In FPCA ’93: Proceedings of

the Conference on Functional Programming Languages and Computer Architecture,

pages 65–73, New York, NY, USA, 1993. ACM Press.

J. Aycock. A brief history of just-in-time. ACM Computing Surveys, 35(2):97–113,

2003.

J. Backus. Can programming be liberated from the von Neumann style? A functional

style and its algebra of programs. Communications of the ACM, 21(8):613–641,

Aug. 1978.

J. Backus, R. Beeber, S. Best, R. Goldberg, L. Haibt, H. Herrick, R. Nelson, D. Sayre,

P. Sheridan, H. Stern, I. Ziller, R. Hughes, and R. Nutt. The FORTRAN automatic

coding system. In Proceedings Western Joint Computer Conference, pages 188–198,

Los Angeles, California, Feb. 1957.

V. Bala, E. Duesterwald, and S. Banerjia. Dynamo: a transparent dynamic optimiza-

tion system. In PLDI ’00: Proceedings of the ACM SIGPLAN 2000 Conference on

Programming Language Design and Implementation, pages 1–12, New York, NY,

USA, 2000. ACM.

V. Balat and O. Danvy. Strong normalization by type-directed partial evaluation

and run-time code generation. In TIC ’98: Proceedings of the Second International

Workshop on Types in Compilation, pages 240–252, London, UK, 1998. Springer-

Verlag.

L. Baraz, T. Devor, O. Etzion, S. Goldenberg, A. Skaletsky, Y. Wang, and Y. Zemach.

IA-32 Execution Layer: a two-phase dynamic translator designed to support IA-

32 applications on ItaniumTM-based systems. In MICRO 36: Proceedings of the

36th annual IEEE/ACM International Symposium on Microarchitecture, page 191,

Washington, DC, USA, 2003. IEEE Computer Society.

H. Barendregt. Lambda calculi with types. In Abramsky, Gabbay, and Maibaum, ed-

itors, Handbook of Logic in Computer Science, (Background: Computational Struc-

tures), volume 2. Oxford University Press, 1992.

J. R. Bell. Threaded code. Communications of the ACM, 16(6):370–372, 1973.

G. Bierman, M. Hicks, P. Sewell, and G. Stoyle. Formalizing dynamic software

updating (extended abstract). In Proceedings of Workshop on Unexpected Software

Evolution (USE’03), Apr. 2003.

Bigloo Developers. Bigloo Homepage. Available from: http://www-sop.inria.fr/

mimosa/fp/Bigloo/, 2008. Last visited: 2008-10-07.

R. Bird and P. Wadler. Introduction to Functional Programming. Prentice-Hall, 1988.

BIBLIOGRAPHY 129

M. Blume, U. A. Acar, and W. Chae. Extensible programming with first-class cases.

In ICFP ’06: Proceedings of the eleventh ACM SIGPLAN International Conference

on Functional Programming, pages 239–250, New York, NY, USA, 2006. ACM

Press.

U. Boquist. Code Optimisation Techniques for Lazy Functional Languages. PhD

thesis, Chalmers University of Technology, Gothenburg, Apr. 1999.

G. Bray. Implementation implications of Ada generics. Ada Letters, III(2):62–71,

1983.

P. J. Brown. Throw-away compiling. Software Practice and Experience, 6(4):423–434,

1976.

D. Bruening, T. Garnett, and S. Amarasinghe. An infrastructure for adaptive dy-

namic optimization. In CGO ’03: Proceedings of the International Symposium on

Code Generation and Optimization, pages 265–275. IEEE Computer Society, 2003.

D. L. Bruening. Efficient, Transparent, and Comprehensive Runtime Code Manipu-

lation. PhD thesis, Massachusetts Institute of Technology, Sept. 2004.

T. Brus, M. C. J. D. van Eekelen, M. van Leer, M. J. Plasmeijer, and H. Barendregt.

Clean - a language for functional graph rewriting. In Kahn, editor, Proc. of Con-

ference on Functional Programming Languages and Computer Architecture (FPCA

’87), Portland, Oregon, USA, number 274 in Lecture Notes in Computer Science,

pages 364–384. Springer-Verlag, 1987.

R. G. Burger. Efficient Compilation and Profile-Driven Dynamic Recompilation in

Scheme. PhD thesis, Indiana University Computer Science Department, Mar. 1997.

R. G. Burger and R. K. Dybvig. An infrastructure for profile-driven dynamic recom-

pilation. In IEEE Computer Society 1998 International Conference on Computer

Languages, May 1998.

M. Burke, J. Choi, S. Fink, D. Grove, M. Hind, V. Sarkar, M. Serrano, V. Sreedhar,

H. Srinivasan, and J. Whaley. The Jalape˜no dynamic optimizing compiler for Java.

In Proceedings ACM 1999 Java Grande Conference, pages 129–141, San Francisco,

CA, United States, June 1999. ACM Press.

R. M. Burstall, D. B. MacQueen, and D. T. Sannella. Hope: An Experimental

Applicative Language. In Conference Record of the 1980 LISP Conference, pages

136–143, Stanford University, Stanford, California, Aug. 1980. ACM Press.

H. Cejtin, S. Jagannathan, and S. Weeks. Flow-directed closure conversion for typed

languages. In European Symposium on Programming, pages 56–71, Mar. 2000.

G. J. Chaitin. Register allocation & spilling via graph coloring. In SIGPLAN ’82:

Proceedings of the 1982 SIGPLAN Symposium on Compiler Construction, pages

98–105, New York, NY, USA, 1982. ACM Press.

130 BIBLIOGRAPHY

C. Chambers and D. Ungar. Customization: optimizing compiler technology for

Self, a dynamically-typed object-oriented programming language. In Proceedings

of the ACM SIGPLAN 1989 Conference on Programming Language Design and

Implementation, pages 146–160. ACM Press, 1989.

C. J. Cheney. A non-recursive list compaction algorithm. Communications of the

ACM, 13(11):677–678, Nov. 1970.

A. Chernoff and R. Hookway. Digital FX!32 – Running 32-Bit x86 Applications

on Alpha NT. In Proceedings of the USENIX Windows NT Workshop, Seattle,

Washington, Aug. 1997.

C. Consel and O. Danvy. Partial evaluation: Principles and perspectives. In Proceed-

ings of the Twentieth Annual ACM SIGPLAN-SIGACT Symposium on Principles

of Programming Languages. ACM Press, Jan. 1993.

K. Crary, S. Weirich, and G. Morrisett. Intensional polymorphism in type-erasure

semantics. In ICFP ’98: Proceedings of the third ACM SIGPLAN International

Conference on Functional Programming, pages 301–312, New York, NY, USA, 1998.

ACM Press.

R. J. Dakin and P. C. Poole. A mixed code approach. Computer Journal, 16(3):

219–222, 1973.

L. Damas and R. Milner. Principal type schemes for functional programs. In Pro-

ceedings 9’th ACM Symposium on Principles of Programming Languages, pages

207–212, Albuquerque, N.M, Jan. 1982. ACM Press.

O. Danvy and A. Filinski. Representing control: A study of the CPS transformation.

Mathematical Structures in Computer Science, 2(4):361–391, 1992.

J. L. Dawson. Combining interpretive code with machine code. Computer Journal,

16(3):216–219, 1973.

D. Detlefs and O. Agesen. Inlining of virtual methods. In ECOOP ’99: Proceedings

of the 13th European Conference on Object-Oriented Programming, pages 258–278.

Springer-Verlag, 1999.

L. P. Deutsch and A. M. Schiffman. Efficient implementation of the Smalltalk-80

system. In Proceedings of the 11th ACM SIGACT-SIGPLAN Symposium on Prin-

ciples of Programming Languages, pages 297–302. ACM Press, 1984.

R. B. K. Dewar. Indirect threaded code. Communications of the ACM, 18(6):330–331,

1975.

R. Dockins and S. Z. Guyer. Bytecode verification for Haskell. Technical report,

Tufts University, Feb. 2007.

BIBLIOGRAPHY 131

DotGNU Project. DotGNU Project/Portable.NET. Available from: http://dotgnu.

org, 2004. Last visited: 2008-10-07.

D. Dub´e. Demand-Driven Type Analysis for Dynamically-Typed Functional Lan-

guages. PhD thesis, Universit´e de Montr´eal, Aug. 2002.

K. Ebcioglu and E. R. Altman. DAISY: Dynamic Compilation for 100% Architectural

Compatibility. Technical Report 8502, IBM, 1996.

ECMA International. Standard ECMA-334: Common Language Infrastructure

(CLI). 3rd edition, June 2005.

A. Ershov. On the partial computation principle. Information Processing Letters, 6

(2):38–41, 1977.

M. Feeley and G. Lapalme. Closure generation based on viewing lambda as epsilon

plus compile. Journal of Computer Languages, 17(4), 1992.

A. Fischbach and J. Hannan. Type systems for useless-variable elimination. In PADO

’01: Proceedings of the Second Symposium on Programs as Data Objects, pages 25–

38, London, UK, 2001. Springer-Verlag.

C. Flanagan, A. Sabry, B. F. Duba, and M. Felleisen. The essence of compiling with

continuations. In Proceedings ACM SIGPLAN 1993 Conference on Programming

Language Design and Implementation, PLDI’93, Albuquerque, NM, USA, 23–25

June 1993, volume 28(6), pages 237–247. ACM Press, New York, 1993.

M. S. O. Franz. Code-Generation On-the-fly: A Key to Portable Software. PhD

thesis, ETH Z¨urich, 1994.

C. W. Fraser, R. R. Henry, and T. A. Proebsting. Burg: fast optimal instruction

selection and tree parsing. SIGPLAN Notices, 27(4):68–76, 1992.

GCC Developers. GNU Compiler Collection Homepage. Available from: http:

//gcc.gnu.org, 2008. Last visited: 2008-10-07.

GHC Developers. Glasgow Haskell Compiler Homepage. Available from: http:

//www.haskell.org/ghc, 2008. Last visited: 2008-10-07.

A. Goldberg and D. Robson. Smalltalk-80: The Language and its Implementation.

Addison-Wesley, 1983.

J. Gosling, B. Joy, G. Steele, and G. Bracha. JavaTM Language Specification. Addison-

Wesley, 2nd edition, June 2000.

M. Grabm¨uller. Implementing Closures using Run-time Code Generation. Research

report 2006-02 in Forschungsberichte Fakult¨at IV – Elektrotechnik und Informatik,

Technische Universit¨at Berlin, Feb. 2006.

132 BIBLIOGRAPHY

M. Grabm¨uller. A model of functional programming with dynamic compilation and

optimization. In H. Nilsson, editor, Trends in Functional Programming, volume 7.

Intellect, Apr. 2007.

M. Grabm¨uller. rdtsc – binding for the rdtsc machine instruction. Available on

the World Wide Web: http://uebb.cs.tu-berlin.de/~magr/projects/rdtsc/

doc/, June 2008. Last visited: 2008-10-07.

M. Grabm¨uller and D. Kleeblatt. Harpy: Run-time code generation in Haskell. In

Haskell Workshop 2007. ACM Press, Sept. 2007.

B. Grant, M. Mock, M. Philipose, C. Chambers, and S. J. Eggers. DyC: An expressive

annotation-directed dynamic compiler for C. Theoretical Computer Science, 248

(1–2):147–199, Oct. 2000.

D. Gries and N. Gehani. Some ideas on data types in high-level languages. Commu-

nications of the ACM, 20(6):414–420, 1977.

L.-J. Guillemette and S. Monnier. A type-preserving closure conversion in Haskell.

In Haskell Workshop 2007, 2007.

C. V. Hall, K. Hammond, S. L. P. Jones, and P. L. Wadler. Type classes in Haskell.

ACM Transactions on Programming Languages and Systems, 18(2):109–138, Mar.

1996.

J. Hannan. Type systems for closure conversions. In H. R. Nielson and K. L. Solberg,

editors, Proceedings of Workshop on Types for Program Analysis, number PB-493

in Daimi Reports, pages 48–62, 1995.

J. Hannan and P. Hicks. Higher-order arity raising. In Proceedings of 3rd ACM SIG-

PLAN International Conference on Functional Programming, pages 27–38, Balti-

more, MD, Sept. 1998.

J. Hannan and P. Hicks. Higher-order uncurrying. Journal of Higher Order and

Symbolic Computation, 13(3):179–216, 2000.

R. Harper and M. Lillibridge. Polymorphic type assignment and CPS conversion.

Lisp and Symbolic Computation, 6(3–4):361–379, Nov. 1993.

R. Harper and G. Morrisett. Compiling polymorphism using intensional type analysis.

In POPL ’95: Proceedings of the 22nd ACM SIGPLAN-SIGACT symposium on

Principles of programming languages, pages 130–141, New York, NY, USA, 1995.

ACM Press.

P. H. Hartel, M. Feeley, M. Alt, L. Augustson, P. Baumann, M. Beemster, E. Chail-

loux, C. H. Flood, W. Grieskamp, J. H. G. van Groningen, K. Hammond, B. Haus-

man, M. Y. Ivory, R. E. Jones, J. Kamperman, P. Lee, X. Leroy, R. D. Lins,

S. Loosemore, N. R¨ojemo, M. Serrano, J.-P. Talpin, J. Thackray, S. Thomas,

BIBLIOGRAPHY 133

P. Walters, P. Weis, and P. Wentworth. Benchmarking implementations of func-

tional languages with “pseudoknot”, a float-intensive benchmark. Journal of Func-

tional Programming, 6(4):621–655, 1996.

K. Hazelwood Cettei. Code Cache Management in Dynamic Optimization Systems.

PhD thesis, Harvard University, May 2004.

M. Hicks. Dynamic Software Updating. PhD thesis, Computer and Information

Science Department, the University of Pennsylvania, Aug. 2001.

U. H¨olzle. Adaptive optimization for Self: Reconciling High Performance with Ex-

ploratory Programming. PhD thesis, Computer Science Department, Stanford Uni-

versity, 1994.

C. Hope and G. Hutton. Accurate step counting. In Proceedings of the 17th Inter-

national Workshop on Implementation and Application of Functional Languages,

Dublin, Ireland, 2005.

L. Hornof and T. Jim. Certifying compilation and run-time code generation. Higher-

Order and Symbolic Computation, 12(4):337–375, Dec. 1999.

P. Hudak, J. Peterson, and J. H. Fasel. A gentle introduction to Haskell. World Wide

Web, 1999.

G. Huet. Function Pearl: The Zipper. Journal of Functional Programming, 7(5):

549–554, Sept. 1997.

Intel Corporation. IA-32 Intel Architecture Optimization Reference Man-

ual, June 2005. Available from: http://www.intel.com/design/Pentium4/

documentation.htm.

Intel Corporation. IA-32 Intel Architecture Software Developer’s Manual Volumes

1, 2A, 2B, 3A, 3B, Jan. 2006. Available from: http://www.intel.com/design/

Pentium4/documentation.htm.

R. L. Johnston. The dynamic incremental compiler of APL\3000. In APL ’79:

Proceedings of the International Conference on APL: part 1, pages 82–87, New

York, NY, USA, 1979. ACM Press.

M. P. Jones. The implementation of the Gofer functional programming system. Re-

search report YALEU/DCS/RR-1030, Yale University, New Haven, Connecticut,

USA, May 1994.

M. P. Jones. Dictionary-free overloading by partial evaluation. In Proceedings of

PEPM’95, 1995.

S. L. P. Jones. Implementing lazy functional languages on stock hardware: The

spineless tagless g-machine. Journal of Functional Programming, 2(2):127–202,

1992.

134 BIBLIOGRAPHY

R. Kelsey, W. Clinger, J. Rees, et al. Revised5report on the algorithmic language

Scheme. ACM SIGPLAN Notices, 33(6):26–76, Sept. 1998.

R. A. Kelsey. A correspondence between continuation passing style and static single

assignment form. SIGPLAN Notices, 30(3):13–22, 1993.

A. Kennedy. Compiling with continuations, continued. In ICFP ’07: Proceedings of

the 12th ACM SIGPLAN International Conference on Functional Programming,

pages 177–190, New York, NY, USA, 2007. ACM.

B. W. Kernighan and D. M. Ritchie. The C Programming Language. Prentice-Hall,

2nd edition, 1988.

A. Klaiber. The technology behind Crusoe processors. Available from: http://

www.charmed.com/PDF/CrusoeTechnologyWhitePaper_1-19-00.pdf, 2000. Last

visited: 2008-10-07.

N. Kobayashi. Type-based useless-variable elimination. Higher-Order and Symbolic

Computation, 14(2-3):221–260, 2001.

T. Kotzmann and H. M¨ossenb¨ock. Escape analysis in the context of dynamic com-

pilation and deoptimization. In VEE ’05: Proceedings of the 1st ACM/USENIX

International Conference on Virtual Execution Environments, pages 111–120, New

York, NY, USA, 2005. ACM Press.

O. Lee and K. Yi. Proofs about a folklore let-polymorphic type inference algorithm.

ACM Transactions on Programming Languages and Systems, 20(4):707–723, July

1998.

D. Leijen. The λAbroad – A Functional Approach to Software Components. PhD

thesis, Utrecht University, Nov. 2003.

D. Leijen and E. Meijer. Parsec: Direct style monadic parser combinators for the

real world. Technical Report UU-CS-2001-27, Department of Computer Science,

Universiteit Utrecht, 2001.

M. Leone and R. K. Dybvig. Dynamo: A staged compiler architecture for dynamic

program optimization. Technical Report 490, Indiana University Computer Science

Department, Sept. 1997.

M. Leone and P. Lee. Lightweight run-time code generation. In Proceedings of the

ACM SIGPLAN Workshop on Partial Evaluation and Semantics-Based Program

Manipulation, pages 97–106, June 1994.

M. Leone and P. Lee. A declarative appoach to run-time code generation. In Work-

shop on Compiler Support for System Software (WCSSS), 1996.

M. Leone and P. Lee. Dynamic specialization in the fabius system. ACM Computing

Surveys, 30(3es):23, 1998.

BIBLIOGRAPHY 135

X. Leroy. Unboxed objects and polymorphic typing. In POPL ’92: Proceedings

of the 19th ACM SIGPLAN-SIGACT Symposium on Principles of Programming

Languages, pages 177–188, New York, NY, USA, 1992. ACM Press.

X. Leroy. The effectiveness of type-based unboxing. In Workshop on Types in Com-

pilation, Amsterdam, The Netherlands, June 1997. ACM SIGPLAN. Published as

Boston College Computer Science Dept. Technical Report BCCS-97-03.

T. Lindholm and F. Yellin. The JavaTM Virtual Machine Specification. Sun Microsys-

tems, 2nd edition, 1999.

A. L¨oh and R. Hinze. Open data types and open functions. In M. Maher, editor,

Eighth ACM-SIGPLAN International Symposium on Principles and Practice of

Declarative Programming (PPDP’06), Venice, Italy, July 2006. ACM Press.

D. Lomov and A. Moscal. Dynamic Caml – Run-Time Code Generation Library for

Objective Caml. Available on the World Wide Web at http://oops.tepkom.ru/

dml/, last visited: 2008-10–7, May 2002.

D. Luna, M. Pettersson, and K. Sagonas. Efficiently compiling a functional language

on amd64: The hipe experience. In 7th ACM SIGPLAN International Symposium

on Principles and Practice of Declarative Programming (PPDP 2005), 2005.

S. Marlow, A. R. Yakushev, and S. Peyton Jones. Faster laziness using dynamic

pointer tagging. In ICFP 2007, 2007.

H. Massalin. Synthesis: An Efficient Implementation of Fundamental Operating Sys-

tem Services. PhD thesis, Columbia University, 1992.

J. McCarthy. Recursive functions of symbolic expressions and their computation by

machine, Part I. Communications of the ACM, 3(4):184–195, 1960.

C. S. McDonald. fsh – a functional unix command interpreter. Software Practice and

Experience, 17(10):685–700, Oct. 1983.

J. Meacham. jhc. Available from: http://repetae.net/john/computer/jhc/, 2007.

Last visited: 2008-10-07.

G. Meehan and M. Joy. Compiling lazy functional programs to Java bytecode. Soft-

ware Practice and Experience, 29(7):617–645, 1999.

R. Milner, M. Tofte, R. Harper, and D. MacQueen. The Definition of Standard ML.

MIT Press, 1997. Revised edition.

Y. Minamide, G. Morrisett, and R. Harper. Typed closure conversion. In In Proc. 23rd

Annual ACM SIGPLAN-SIGACT Symp. on Principles of Programming Languages,

pages 271–283. ACM Press, 1996.

136 BIBLIOGRAPHY

MLton Developers. MLton Standard ML Compiler Homepage. Available from: http:

//mlton.org, 2006. Last visited: 2008-10-07.

M. Mock. Automating Selective Dynamic Compilation. PhD thesis, University of

Washington, Aug. 2002.

Mono Project. Mono. Available from: http://www.mono-project.com, 2007. Last

visited: 2008-10-07.

G. Morrisett. Compiling with Types. PhD thesis, School of Computer Science,

Carnegie Mellon University, Dec. 1995. Published as CMU Technical Report CMU-

CS-95-226.

G. Morrisett, D. Walker, K. Crary, and N. Glew. From System F to typed assembly

language. ACM Transactions on Programming Languages and Systems, 21(3):528–

569, May 1999.

R. Morrison, A. Dearle, R. C. H. Connor, and A. L. Brown. An ad hoc approach to the

implementation of polymorphism. ACM Transactions on Programming Languages

and Systems, 13(3):342–371, 1991.

G. C. Necula. Proof-carrying code. In Proceedings of the 24th ACM SIGPLAN-

SIGACT Symposium on Principles of Programming Languages, pages 106–119.

ACM Press, 1997.

H.-D. Nguyen and A. Ohori. Record unboxing. World Wide Web: http://www.

pllab.riec.tohoku.ac.jp/~ohori/research/RecordUnboxing.pdf, May 2007.

Last visited: 2008-10-07.

F. Nielson, H. R. Nielson, and C. Hankin. Principles of Program Analysis. Springer-

Verlag, 1999.

K. Ogata and N. Doi. Object allocation and dynamic compilation in MultithreadS-

malltalk. In SAC ’94: Proceedings of the 1994 ACM Symposium on Applied Com-

puting, pages 452–456, New York, NY, USA, 1994. ACM Press.

F. Ogel, G. Thomas, and B. Folliot. Supporting efficient dynamic aspects through

reflection and dynamic compilation. In SAC ’05: Proceedings of the 2005 ACM

Symposium on Applied Computing, pages 1351–1356, New York, NY, USA, 2005.

ACM Press.

Opal Group. Opal Project Homepage. Available from: http://uebb.cs.tu-berlin.

de/~opal, 2004. Last visited: 2008-10-07.

M. Paleczny, C. Vick, and C. Click. The Java HotSpotTM Server Compiler. In

Proceedings of the JavaTM Virtual Machine Research and Technology Symposium

(JVM ’01). USENIX Association, Apr. 2001.

BIBLIOGRAPHY 137

W. Partain. The nofib benchmark suite of haskell programs. In Proceedings of the

1992 Glasgow Workshop on Functional Programming, pages 195–202, London, UK,

1993. Springer-Verlag.

P. Pepper. Funktionale Programmierung in OPAL, ML, HASKELL und GOFER.

Springer-Verlag, 2nd edition, 2003.

N. Perry and E. Meijer. Implementing functional languages on object-oriented virtual

machines. Microsoft White Paper.

S. Peyton Jones, editor. Haskell 98 Language and Libraries—The Revised Re-

port. Cambridge University Press, Apr. 2003. Also available from: http:

//www.haskell.org/haskellwiki/Definition, last visited: 2008-10-07.

S. L. Peyton Jones and J. Launchbury. Unboxed values as first class citizens in a non-

strict functional language. In J. Hughes, editor, Proceedings of the Conference on

Functional Programming and Computer Architecture, pages 636–666, Cambridge,

Massachussets, USA, 26–28 August 1991. Springer-Verlag LNCS523.

B. C. Pierce. Types and Programming Languages. MIT Press, Cambridge, Mas-

sachusettes, 2002.

M. Poletto and V. Sarkar. Linear scan register allocation. ACM Trans. Program.

Lang. Syst., 21(5):895–913, 1999.

C. Pu, H. Massalin, and J. Ioannidis. The Synthesis kernel. Computing Systems, 1

(1):11–32, Winter 1988.

N. Ramsey and J. Dias. An applicative control-flow graph based on huet’s zipper. In

ACM SIGPLAN Workshop on ML, pages 101–122, Sept. 2005.

J. Reppy. Local CPS conversion in a direct-style compiler. In Proceedings of the

Third ACM SIGPLAN Workshop on Continuations (CW’01), pages 13–22, 2001.

A. Rossberg. The missing link: dynamic components for ML. In ICFP ’06: Pro-

ceedings of the eleventh ACM SIGPLAN International Conference on Functional

Programming, pages 99–110, New York, NY, USA, 2006. ACM Press.

A. Rossberg, D. L. Botlan, G. Tack, T. Brunklaus, and G. Smolka. Alice through

the looking glass. In Trends in Functional Programming, volume 5, pages 79–96.

Intellect, 2006.

D. Sands. Calculi for Time Analysis of Functional Programs. PhD thesis, Department

of Computing, Imperial College, University of London, September 1990.

M. Schinz and M. Odersky. Tail call elimination on the Java virtual machine. In

N. Benton and A. Kennedy, editors, Electronic Notes in Theoretical Computer

Science, volume 59. Elsevier, 2001.

138 BIBLIOGRAPHY

M. Serrano and P. Weis. Bigloo: A portable and optimizing compiler for strict

functional languages. In Static Analysis Symposium, pages 366–381, 1995.

Z. Shao. Compiling Standard ML for Efficient Execution on Modern Machines. PhD

thesis, Princeton University, Nov. 1994.

Z. Shao and A. W. Appel. A type-based compiler for standard ml. In PLDI ’95:

Proceedings of the ACM SIGPLAN 1995 Conference on Programming Language

Design and Implementation, pages 116–129, New York, NY, USA, 1995. ACM

Press.

Z. Shao and A. W. Appel. Efficient and safe-for-space closure conversion. ACM

Transactions on Programming Languages and Systems, 22(1):129–161, Jan. 2000.

O. Shivers. Control-Flow Analysis of Higher-Order Languages. PhD thesis, School of

Computer Science, Carnegie Mellon University, 1991.

O. Shivers. Supporting dynamic languages on the Java virtual machine. Technical

Report AIM-1576, MIT AI Laboratory, 1996.

J. M. Siskind. Flow-directed lightweight closure conversion. Technical Report 99-

190R, NEC Research Institute, Dec. 1999.

J. E. Smith and R. Nair. Virtual Machines – Versatile Platforms for Systems and

Processes. Morgan Kaufman, 2005.

M. Sperber and P. Thiemann. Two for the price of one: Composing partial evaluation

and compilation. In SIGPLAN Conference on Programming Language Design and

Implementation, pages 215–225, 1997.

D. Spinellis. An implementation of the Haskell language. Master’s thesis, Imperial

College, June 1990.

P. A. Steckler and M. Wand. Lightweight closure conversion. ACM Trans. Program.

Lang. Syst., 19(1):48–86, 1997.

D. Stewart. hs-plugins – Dynamically Loaded Haskell Modules. World Wide

Web: http://www.cse.unsw.edu.au/~dons/hs-plugins/, Feb. 2006. Last vis-

ited: 2008-10-07.

D. Stewart and M. M. T. Chakravarty. Dynamic applications from the ground up. In

Proceedings of the ACM SIGPLAN Workshop on Haskell. ACM Press, Sept. 2005.

G. Stoyle, M. Hicks, G. Bierman, P. Sewell, and I. Neamtiu. Mutatis mutandis: safe

and predictable dynamic software updating. In POPL ’05: Proceedings of the 32nd

ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages,

pages 183–194, New York, NY, USA, 2005. ACM Press.

B. Stroustrup. The C++ Programming Language. Addison-Wesley, New York, 1986.

BIBLIOGRAPHY 139

T. Suganuma, T. Ogasawara, M. Takeuchi, T. Yasue, M. Kawahito, K. Ishizaki,

H. Komatsu, and T. Nakatani. Overview of the IBM Java Just-in-Time Compiler.

IBM Systems Journal, 39(1):175–193, 2000.

Sun, Inc. Java SE Downloads - Previous Release - J2SE 5.0. Available from: http:

//java.sun.com/javase/downloads/index_jdk5.jsp, 2008. Last visited: 2008-

10-07.

W. Taha. Multi-Stage Programming: Its Theory and Applications. PhD thesis, Ore-

gon Graduate Institute of Science and Technology, Nov. 1999.

W. Taha and T. Sheard. Multi-stage programming with explicit annotations. In

PEPM, 1991.

W. Taha and T. Sheard. MetaML and multi-stage programming with explicit anno-

tations. Theoretical Computer Science, 248(1–2):211–242, Oct. 2000.

D. Tarditi. Design and Implementation of Code Optimiziations for a Type-Directed

Compiler for Standard ML. PhD thesis, School of Computer Science, Carnegie

Mellon University, Dec. 1996. Available as Technical Report CMU-CS-97-108.

D. Tarditi, G. Morrisett, P. Cheng, C. Stone, R. Harper, and P. Lee. Til: A type-

directed compiler for ml. In ACM SIGPLAN Conf. Programming Language Design

and Implementation, pages 181–192, New York, 1996. ACM Press.

K. Thompson. Programming techniques: Regular expression search algorithm. Com-

munications of the ACM, 11(6):419–422, 1968.

A. Tolmach. Combining closure conversion with closure analysis using algebraic types.

In Types in Compilation workshop, June 1997.

D. Ung and C. Cifuentes. Machine-adaptable dynamic binary translation. In DY-

NAMO ’00: Proceedings of the ACM SIGPLAN Workshop on Dynamic and Adap-

tive Compilation and Optimization, pages 41–51, New York, NY, USA, 2000. ACM

Press.

D. Ung and C. Cifuentes. Dynamic binary translation using run-time feedbacks.

Science of Computer Programming, 60(2):189–204, Apr. 2006.

D. Ungar and R. B. Smith. Self: The power of simplicity. Lisp and Symbolic Com-

putation, 4(3), 1991.

P. Unnikrishnan, G. Chen, M. Kandemir, and D. R. Mudgett. Dynamic compilation

for energy adaptation. In ICCAD ’02: Proceedings of the 2002 IEEE/ACM In-

ternational Conference on Computer-aided Design, pages 158–163, New York, NY,

USA, 2002. ACM Press.

140 BIBLIOGRAPHY

A. van Weelden and R. Plasmeijer. A functional shell that dynamically combines

compiled code. In Proceedings of the 15th International Workshop on the Imple-

mentation of Functional Languages, IFL 2003, volume 3145 of Lecture Notes in

Computer Science, Scotland, 2004. Springer-Verlag.

P. Wadler. The Implementation of Functional Programming Languages, chapter Ef-

ficient compilation of pattern matching, pages 78–103. Prentice-Hall, 1987.

D. Wakeling. The dynamic compilation of lazy functional programs. Journal of

Functional Programming, 8(1):61–81, Jan. 1998a.

D. Wakeling. Mobile Haskell: Compiling lazy functional programs for the Java vir-

tual machine. In Proceedings of the 1998 Conference on Programming Languages,

Implementations, Logics and Programs (PLILP’98), volume 1490 of Lecture Notes

in Computer Science, pages 335–352, Sept. 1998b.

M. Wand. Continuation-based program transformation strategies. Journal of the

ACM, 27(1):164–180, 1980.

D. L. Weaver and T. Germond, editors. The SPARC Architecture Manual. PTR

Prentice Hall, 1994.

J. Whaley. Dynamic optimization through the use of automatic runtime specializa-

tion. Master’s thesis, Massachusetts Institute of Technology, May 1999.

P. Wickline, P. Lee, and F. Pfenning. Run-time code generation and modal-ml. In

K. D. Cooper, editor, Proceedings of the Conference on Programming Language

Design and Implementation (PLDI’98), pages 224–235, Montreal, Canada, June

1998. ACM Press.

N. Wirth and J. Gutknecht. Project Oberon – The Design of an Operating System

and Compiler. Addison-Wesley, 1998.

Q. Wu, M. Martonosi, D. W. Clark, V. J. Reddi, D. Connors, Y. Wu, J. Lee, and

D. Brooks. A dynamic compilation framework for controlling microprocessor energy

and performance. In MICRO 38: Proceedings of the 38th annual IEEE/ACM

International Symposium on Microarchitecture, pages 271–282, Washington, DC,

USA, 2005. IEEE Computer Society.

M. Zenger and M. Odersky. Extensible algebraic datatypes with defaults. In ICFP

’01: Proceedings of the sixth ACM SIGPLAN International Conference on Func-

tional Programming, pages 241–252, New York, NY, USA, 2001. ACM Press.

C. Zheng and C. Thompson. PA-RISC to IA-64: Transparent Execution, no Recom-

pilation. IEEE Computer, 33(3):47–52, Mar. 2000.