© © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all
other uses, in any current or future media, including reprinting/republishing this material for advertising or
promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse
of any copyrighted component of this work in other works.
Terms of Use
Mammeri, N., & Juurlink, B. (2018). VComputeBench: A Vulkan Benchmark Suite for GPGPU on Mobile
and Embedded GPUs. In 2018 IEEE International Symposium on Workload Characterization (IISWC).
IEEE. https://doi.org/10.1109/iiswc.2018.8573477
Nadjib Mammeri, Ben Juurlink
VComputeBench: A Vulkan Benchmark
Suite for GPGPU on Mobile and Embedded
GPUs
Conference paper | Accepted manuscript (Postprint)
This version is available at https://doi.org/10.14279/depositonce-7346.2
VComputeBench: A Vulkan Benchmark Suite for
GPGPU on Mobile and Embedded GPUs
Nadjib Mammeri
Technische Universit¨
at Berlin
Ben Juurlink
Technische Universit¨
at Berlin
Abstract—GPUs have become immensely important computa-
tional units on embedded and mobile devices. However, GPGPU
developers are often not able to exploit the compute power
offered by GPUs on these devices mainly due to the lack of
support of traditional programming models such as CUDA and
OpenCL. The recent introduction of the Vulkan API provides
a new programming model that could be explored for GPGPU
computing on these devices, as it supports compute and promises
to be portable across different architectures.
In this paper we propose VComputeBench, a set of bench-
marks that help developers understand the differences in perfor-
mance and portability of Vulkan. We also evaluate the suitability
of Vulkan as an emerging cross-platform GPGPU framework by
conducting a thorough analysis of its performance compared to
CUDA and OpenCL on mobile as well as on desktop platforms.
Our experiments show that Vulkan provides better platform
support on mobile devices and can be regarded as a good cross-
platform GPGPU framework. It offers comparable performance
and with some low-level optimizations it can offer average
speedups of 1.53x and 1.66x compared to CUDA and OpenCL
respectively on desktop platforms and 1.59x average speedup
compared to OpenCL on mobile platforms. However, while
Vulkan’s low-level control can enhance performance, it requires
a significantly higher programming effort.
Index Terms—VComputeBench, Vulkan, SPIR-V, GPGPU,
CUDA, OpenCL, Rodinia, Mobile
I. INTRODUCTION
Graphics Processing Units (GPUs) have become a dominant
platform for parallel computing thanks to their massively
parallel architecture, energy efficiency and availability to the
masses. Several programming models have emerged enabling
developers to harness the massive compute power offered by
GPUs, while exploiting parallelism for different application
domains. This is often referred to as GPGPU (General Pur-
pose computing on the GPU) [1]. The most popular GPGPU
programming models are CUDA [2] and OpenCL [3]. CUDA
is a proprietary standard introduced by NVIDIA and targets
only NVIDIA specific hardware, while OpenCL is an open
standard maintained by the Khronos group and targets addi-
tional hardware devices including FPGAs, CPUs and DSPs. In
this work we focus on the two most predominant programming
models CUDA and OpenCL, but it is worth mentioning other
frameworks such as OpenMP [4] and OpenACC [5]. OpenMP
mainly targets shared memory multiprocessors and recently
OpenMP 4.5 introduced the target directive enabling support
for GPUs and other devices. OpenACC is mainly designed to
program accelerators in heterogeneous systems with OpenMP-
like directives.
To add to this mix of programming models, the Khronos
group recently released the Vulkan API [6] along with SPIR-
V [7]. Vulkan is a low level API with an abstraction closer
to the behavior of the actual hardware. It promises cross-
platform support, high-efficiency and better performance of
GPU applications. Unlike CUDA, which is only supported on
NVIDIA GPUs, and OpenCL, which has no official support on
mobile GPUs, Vulkan is supported by all major GPU vendors1
and considers non-desktop GPUs as first class citizens. Vulkan
is officially supported on Android 7.0 [8] and on the new
Tizen OS 3.0 [9] covering a full spectrum of mobile devices
from phones and wearables to TVs and in-vehicle infotainment
systems. This good platform support and the fact that it also
supports compute, motivated us to examine it from the GPGPU
perspective even though it was mainly designed to improve
graphics performance. In this paper, we introduce Vulkan as a
cross-platform GPGPU route that could open new perspectives
for pertinent GPGPU computing on mobile devices and can
be explored along with other more established frameworks
on desktop architectures. However, there are some important
questions yet to be answered:
•What kind of performance can we get out of Vulkan?
•Is there a viable study comparing Vulkan compute to
established frameworks such as CUDA and OpenCL?
•If there are any performance gains, are these portable
across different GPU architectures?
•Can Vulkan enable pertinent GPGPU computing on mo-
bile and embedded GPUs?
Selecting which GPGPU framework to choose is a critical
task for developers. Differences in performance, portability,
programmability and platform support are all very important
factors that need to be considered. Benchmarks play an
important role in exposing these kind of differences between
hardware architectures, compilers and more importantly across
competing programming models. There are several bench-
marks available to evaluate CUDA and OpenCL [10], [11]
[12] [13] but currently none for Vulkan. To fill this gap
and enable our study we propose VComputeBench, a set of
Vulkan compute benchmarks that help developers understand
1Supported by major desktop GPU vendors: AMD, NVIDIA and Intel and
mobile GPU vendors: Qualcomm, ARM, Imagination and VeriSilicon
the differences in performance and portability of Vulkan
and provide guidance to GPU architects in the design and
optimization of their drivers and runtime. VComputeBench
was developed by extending the popular Rodinia benchmark
suite [10], covering a diverse range of application domains
with different computation patterns. The reason for selecting
the Rodinia suite is that it provides OpenCL and CUDA im-
plementations and with our VComputeBench implementations
we can make fair comparisons and adequately evaluate Vulkan
against other programming models.
In essence, the main contributions of this paper are:
•Illustrate the viability of Vulkan as a GPGPU framework
notably on mobile devices.
•Propose a set of Vulkan compute benchmarks named
VComputeBench and ported them onto mobile platforms.
•Perform a thorough analysis of performance, comparing
Vulkan to CUDA and OpenCL on desktop and mobile
GPUs and highlight a set of Vulkan specific optimization
techniques.
II. RELATED WORK
In recent years, GPGPU frameworks have received a great
amount of attention from the research community. Although,
several works studied and compared different programming
models [14] [15] [16] [17] [18] [19] [20] [21], none of them
studied Vulkan. To the best of our knowledge, our work
is the first to investigate Vulkan from the compute not the
graphics perspective and propose it as a viable cross-platform
GPGPU programming model. One of the earliest and well
cited works is those of Fang et al. [15] and Karimi et al.
[14]. The authors compare CUDA to OpenCL in terms of
performance on old desktop GPU architectures. Our work, on
the other hand, was carried out on recent architectures and
analyses performance on desktop as well as mobile GPUs.
Du et al. [17] studies OpenCL performance portability and
Wang et al. [21] examines OpenCL on FPGAs. The authors of
these papers demonstrate that performance is not necessarily
portable across architectures. Their findings instigated us to
study and port our benchmarks onto mobile GPUs in order
to evaluate Vulkan’s portability and examine its performance
implications.
Such research works heavily rely on benchmarks for their
evaluations. Several GPGPU benchmarks were proposed by
researchers such as Rodinia [10], Parboil [11] SHOC [12] and
the recent Hetero-Mark [13]. Most of these benchmark suites
include CUDA, OpenCL or OpenMP implementations but
none include Vulkan implementations. This can be a limitation
especially for researchers and developers wanting to target this
new emerging programming model. In this work, we aim to
enrich the GPGPU community with such Vulkan benchmarks
by extending the popular Rodinia suite, enabling researchers
and developers to evaluate Vulkan along with other GPGPU
programming models. Likewise, most of these benchmark
suites mainly target desktop GPUs or multicore systems with
their CUDA and OpenCL implementations. Our benchmarks,
on the other hand, target both mobile and desktop GPUs. We
chose Vulkan because of its cross-platform capabilities and
good support on mobile devices.
III. VULKAN A COMPUTE PERSPECTIVE
In this section we present an overview of the Vulkan
programming model illustrating why it is a promising GPGPU
framework especially for mobile and embedded GPUs.
A. Vulkan Overview
Vulkan is often referred to as the next generation graphics
and compute API for modern GPUs. It is an open standard that
aims to address the inefficiencies of traditional APIs such as
OpenGL, which were designed for single-core processors and
lag to map well to modern hardware [22]. Vulkan on the other
hand, was designed from the ground-up with multi-threading
support in mind. Better parallelization can be achieved by
asynchronously generating work across multiple threads feed-
ing the GPU in an efficient manner. This is attained in Vulkan
by having no global state, no synchronizations in the driver
and separating work generation from work submission. All
state is localized in command buffers, which can be generated
on multiple threads and only start executing on the GPU after
submission.
The other key characteristic of Vulkan is that it provides a
much lower-level fine-grained control over the GPU enabling
developers to maximize performance across many platforms.
It achieves this by being explicit in nature rather than re-
lying on hidden heuristics in the driver. Operations such as
resource tracking, synchronization, memory allocation, and
work submission are all pushed into application space resulting
in higher predictability and better control of when and where
work happens. Likewise, unnecessary background tasks such
as error checking, hazard tracking, state validation and shader
compilation are delegated to the tooling layers, which are
present during development and removed at runtime, resulting
in low driver overhead and less CPU usage [23].
B. The Programming Model
Vulkan can be viewed as a pipeline with some pro-
grammable stages that are invoked by a set of operations. To
the programmer, it is simply an API with a set of routines
allowing for the specification of shaders or kernels, state
controlling aspects as well as data used by those kernels. From
the compute perspective though, the pipeline has only one
programmable stage represented in the kernel program to be
executed [6].
a) Execution Model: A Vulkan-capable system exposes
one or more devices, each of theses physical devices ex-
poses one or more queues. These queues are partitioned
into queue families and can process work asynchronously
to one another. Each queue family supports a number of
functionalities and may contain multiple queues with similar
characteristics. There are four types of queue functionalities
defined in Vulkan: graphics, compute, transfer, and sparse
memory management. The reason for having queue families is
that queues within a single family are considered compatible
with one another, and work produced for one queue family
can be executed on any queue within that family.
Aqueue is considered as the interface between the appli-
cation and the execution engines of a device. Commands for
these execution engines are recorded into command buffers
ahead of execution time. Once recorded, a command buffer
can be cached and submitted to a queue for execution as many
times as required. Command buffer construction is expensive
and the application may employ multiple threads to construct
multiple command buffers in parallel. These command buffers
are then submitted to queues for execution in a number of
batches. Once submitted to a queue, the commands within a
command buffer begin and complete execution without further
application intervention. The order in which these commands
are executed is dependent on a number of implicit and explicit
ordering constraints.
In addition, command buffers submitted to different queues
may execute in parallel or even out of order with respect to one
another. Command buffers submitted to a single queue though
respect submission order. Host execution is also asynchronous
to command buffer execution on the device. Control may
return to the application as soon as the command buffer is
submitted and the application should take responsibility for
any synchronizations between different queues as well as
between the device and host.
b) Compute Model: In Vulkan, compute workloads are
initiated by recording dispatching commands vkCmdDis-
patch* in a command buffer. Once a command buffer is
submitted to a queue, execution starts according to the cur-
rently bound compute pipeline. Compute pipelines consist
of a single compute shader stage, describing the kernel to
be executed and a pipeline layout, describing the input and
output resources to that kernel. Dispatching commands take
three input parameters: groupCountX,groupCountY and
groupCountZ defining the total number of workgroups or the
so called global workgroup size in the X, Y and Z directions
respectively. A workgroup is the smallest amount of compute
operations that an application can execute. Within a single
workgroup, there may be many workitems or compute shader
invocations. This is called the local workgroup size and is
defined by the compute shader itself using SPIR-V built-in
decorations [7].
c) SPIR-V: All shaders and compute kernels in Vulkan
are defined using the Standard Portable Intermediate Represen-
tation (SPIR-V), which is a platform-independent intermediate
language for describing graphical shaders and compute kernels
[24]. SPIR-V is a self-contained binary format. Logically, it
is a header and a linear stream of instructions and physically
it is just a stream of 32-bit words, encoding a collection of
annotations and decorations as well as functions, which in turn
encode control-flow graphs (CFG) of blocks. Variables are
accessed using load store instructions and any intermediate
results bypassing the load store are represented in a single
static-assignment form (SSA). Hierarchical type information
of data objects is preserved to not lose information needed for
further optimizations on the target device.
C. Why Vulkan for Mobile and Embedded GPUs?
Considering that Vulkan was mainly designed to achieve
higher graphics performance, we can make several interesting
observations: (i) its enhancements and low-level nature can
also be utilized to achieve higher performance for GPGPU
applications. (ii) Vulkan’s main focus on graphics allowed
it to have better support among GPU vendors than other
open frameworks such as OpenCL, which for instance is
not fully supported by NVIDIA because it considered as a
competitor to its propriety CUDA framework 2.(iii) Vulkan
is considered as the first framework to have official support
on mobile platforms [8] [9] and the API was designed with
mobile GPU features in mind such as tiled rendering. Hence,
it has the potential of being the framework of choice for
GPGPU on mobile devices, which is the quest of many recent
research works [25] [26] [27]. This leads us to our final
observation: (iv) that Vulkan can be the appropriate framework
for achieving true cross-platform GPGPU without sacrificing
on performance.
IV. BENCHMARKS
Benchmarks play an important role in exposing differ-
ences in performance, portability and programmability across
competing programming models. Since Vulkan was recently
released and its main focus is on graphics not GPGPU, there
are currently few graphics but no compute benchmarks that can
be of use to our study. In order to enable our work as well as
to enrich the research community with such benchmarks, we
extended the popular Rodinia benchmark suite [10] by devel-
oping Vulkan equivalents of most of its workloads, referred to
as VComputeBench, and made them publicly available to the
wider GPGPU community.
Before describing our VComputeBench benchmarks, we
first present one of the microbenchmarks that we used in our
study to better illustrate this new programming model and give
an overview of what is required to write a Vulkan compute
application.
A. Vector Addition Microbenchmark
This microbenchmark is a simple application adding two
vectors Xand Yof size nsaving the output in vector Z. The
kernel code, or the compute shader in Vulkan terminology,
is a SPIR-V binary that was compiled offline from a 10-line
GLSL source implementing:
Z[i] = X[i] + Y[i]∀i∈[0,1, . . . , n]
The index space is one dimensional and iis defined using
the SPIR-V decoration GlobalInvocationId, which returns the
global ID of the workitem executing the kernel. The vectors
X, Y and Zare bounded in to the kernel as storage buffers.
The host code, on the other hand, is more complicated.
Listing 1 shows a pseudo-code listing of the host program
highlighting only the important API calls.
2Current OpenCL version is 2.2 but NVIDIA only supports version 1.2
int main ()
std::size˙t N=1000000;// Number of elements in a vector
int numWorkGroups =N/256;// Workgroup size is 256
// Enumerate devices then create instance, queues and device
VkInstance instance; VkInstanceCreateInfo instanceInfo =–˝ ...
vkCreateInstance(&instanceInfo, nullptr,&instance);
vkEnumeratePhysicalDevices(instance, ..., &gpuList);
vkGetPhysicalDeviceQueueFamilyProperties(gpuList[0], ...);
...
VkDeviceQueueCreateInfo queueCreateInfo–˝ ...
VkDevice device; VkDeviceCreateInfo deviceInfo =–˝ ...
vkCreateDevice(gpuList[0], &deviceInfo, ..., &device);
VkQueue computeQueue;
vkGetDeviceQueue(device, queueFamilyIndex, 0,&computeQueue);
...
// Create buffer then bind the buffer to the allocated memory
VkBuffer bufferX; VkBufferCreateInfo bufferCreateInfo–˝ ...
bufCreateInfo.size =N*sizeof(float);
bufCreateInfo.usage =
VK˙BUFFER˙USAGE˙STORAGE˙BUFFER˙BIT —,→
VK˙BUFFER˙USAGE˙TRANSFER˙DST˙BIT;
vkCreateBuffer(device, &bufferCreateInfo, nullptr,&bufferX);
VkMemoryRequirements xBuffMemReqs;
vkGetBufferMemoryRequirements(device, bufferX, &xBuffMemReqs);
int xMemIndex =findMemType(xBuffMemReqs.memoryTypeBits,
VK˙MEMORY˙PROPERTY˙DEVICE˙LOCAL˙BIT);
VkDeviceMemory memory; VkMemoryAllocateInfo memAllocInfo–˝ ...
memAllocInfo.allocationSize =xBuffMemReqs.size;
memAllocInfo.memoryTypeIndex =xMemIndex;
vkAllocateMemory(device, &memAllocInfo, nullptr,&memory);
vkBindBufferMemory(device, bufferX, memory, 0);
...
// Create the compute shader and the compute pipeline
VkShaderModule module; VkShaderModuleCreateInfo
shadCreatInfo–˝ ...,→
shadCreatInfo.pCode =readSpirvBinary(”vectorAdd.spv”);
vkCreateShaderModule(device, &shadCreatInfo, NULL,&module);
VkPipelineShaderStageCreateInfo shaderStageCreateInfo–˝ ...
shaderStageCreateInfo.module =module;
shaderStageCreateInfo.stage =
VK˙SHADER˙STAGE˙COMPUTE˙BIT;,→
VkPipelineLayout pipelineLayout;
...
vkCreatePipelineLayout(device, ..., &pipelineLayout);
VkPipeline ppline; VkComputePipelineCreateInfo ppCreateInfo–˝ ...
ppCreateInfo.stage =shaderStageCreateInfo;
ppCreateInfo.layout =pipelineLayout;
vkCreateComputePipelines(device, &ppCreateInfo, &ppline ...);
...
// Bind buffers to compute pipeline
VkWriteDescriptorSet writeDescripSet–˝ ...
writeDescripSet.descriptorType =
VK˙DESCRIPTOR˙TYPE˙STORAGE˙BUFFER;,→
writeDescripSet.dstBinding =0;// Same as SPIRV Binding
decoration,→
writeDescripSet.pBufferInfo =xBufferDescriptor;
vkUpdateDescriptorSets(device, 1,&writeDescripSet, 0,NULL);
...
// Create command pool and allocate a command buffer
VkCommandPool cmdPool; VkCommandPoolCreateInfo
cmdPoolInfo–˝ ...,→
vkCreateCommandPool(device, &cmdPoolInfo, nullptr,&cmdPool);
VkCommandBuffer cmdBuffer; VkCommandBufferAllocateInfo
allcInfo–˝..,→
allcInfo.commandPool =cmdPool;
vkAllocateCommandBuffers(device, &allcInfo, &cmdBuffer);
...
// Bind the pipeline and record commands to the command buffer
vkCmdBindPipeline(cmdBuffer,
VK˙PIPELINE˙BIND˙POINT˙COMPUTE,ppline);
vkCmdDispatch(commandBuffer, numWorkGroups, 1,1);
vkEndCommandBuffer(commandBuffer);
...
// Submit to queue
VkSubmitInfo submitInfo –VK˙STRUCTURE˙TYPE˙SUBMIT˙INFO˝;
submitInfo.commandBufferCount =1;
submitInfo.pCommandBuffers = &cmdBuffer;
vkQueueSubmit(computeQueue, 1,&submitInfo ...);
... // Clean up and free all resources
Listing 1: VectorAdd host code using low-level Vulkan API
TABLE I: VComputeBench benchmarks
Name Application Dwarf Domain
backprop Back Propagation Unstructured Grid Deep Learning
bfs Breadth-First Search Graph Traversal Graph Theory
cfd CFD Solver Unstructured Grid Fluid Dynamics
gaussian Gaussian Elimination Dense Linear Algebra Linear Algebra
hotspot Hotspot Simulation Structured Grid Physics
lud LU Decomposition Dense Linear Algebra Linear Algebra
nn K-Nearest Neighbors Dense Linear Algebra Data Mining
nw Needleman-Wunsch Dynamic Programming Bioinformatics
pathfinder Path Finder Dynamic Programming Grid Traversal
Vulkan applications are linked against a common library
referred to as the loader, which gets initialized at the time
of VkInstance creation. The loader loads any enabled tooling
layers and initializes the low-level driver provided by the GPU
vendor. Accordingly, the example program depicted in Listing
1, starts initializing Vulkan by creating a VkInstance and
querying the system for any available devices with all their
properties including all available queue families.
Then a logical VkDevice is created and a queue is acquired.
The next step is to create storage buffers for the vectors.
VkBuffer objects are created, the system is queried for suitable
heaps according to the buffer memory requirements, then
memory is allocated on that heap and buffers are bounded
to their allocated memory. Next, a compute VkPipeline is
created by specifying the kernel’s SPIR-V binary as its shader
stage and creating a VkPipelineLayout describing all the re-
sources used by that kernel. Then, the buffers are bound to the
pipeline by specifying the kernel’s binding value of each buffer
as the destination binding of the write descriptor set. This is
similar to specifying the kernel arguments in OpenCL using
clSetKernelArg. Now that the compute pipeline is set up, the
kernel can be launched by creating a VkCommandBuffer,
binding the pipeline to that command buffer and recording
the dispatch command with the number of workgroups to
be launched. The command buffer is then submitted to the
acquired queue for execution. Finally, the application waits for
execution to finish then cleans up and frees all used resources
and objects.
B. VComputeBench Benchmarks
The Rodinia suite includes both CUDA and OpenCL versions
for each of its benchmarks. While developing their Vulkan
equivalents, we made sure not to introduce any algorithmic
changes to the kernel codes. In this way, we will be able to
make fair comparisons in the sense that any differences in
performance can be related to the programming model and
not to the algorithm. By using the latest Rodinia version 3.1,
we assume that we are already starting from a decent baseline
since these benchmarks were optimized many times in several
research works [28] [29].
The kernels were developed in GLSL and their correspond-
ing SPIR-V binaries were automatically generated using the
glslangvalidator compiler [30] provided by Khronos. We have
chosen GLSL as our kernel language because it has the best
support. We provide both the SPIR-V binaries and the GLSL
sources as part of our VComputeBench benchmarks. The host
code translation on the other hand, was challenging because
the Rodinia source code was collected from different sources
resulting in a hard-to-read code with different styles, very little
comments and hardly any documentation. We made sure this
is not the case with our benchmarks, which we implemented
using C++11 features with unified style and appropriate com-
ments. As far as functional testing is concerned, we validated
our developed VCompute benchmarks against both CUDA and
OpenCL outputs for different input sets.
Our VComputeBench benchmarks cover a diverse range of
application domains with different computation patterns. The
benchmarks were selected so that they also cover different
sets of dwarves [31]. Table I shows a list of the developed
benchmarks including their dwarf and application domains.
Here, we just include brief descriptions of these benchmarks,
but full descriptions and characterizations of these workloads
can be found at [10]:
Back Propagation (bp):is an algorithm that is commonly
used in training deep neural networks to adjust the network’s
weights. It is composed of two phases a forward pass, where
the activations are propagated from the input to the output
layer, and a backward pass, where the error is propagated
backwards from the output to the input layer to adjust the
weights and bias values.
Breadth-First Search (bfs):is a graph algorithm that traverses
or searches a graph of connected nodes, which could include
millions of nodes. It starts at a root node and explores neigh-
boring nodes first, before moving to the next level neighbors.
Computational Fluid Dynamics (cfd):is a fluid dynamics
solver of three-dimensional Euler equations representing an
unstructured grid, finite volume of compressible flow.
Gaussian Elimination (gaussian):is a linear algebra al-
gorithm for solving a set of linear equations. It works by
performing a sequence of row reduction operations on a matrix
until the lower left-hand corner of the matrix is filled with
zeros, as much as possible.
Hotspot Simulation (hotspot):is a thermal simulation tool
that tries to estimate processor temperature based on an
architectural floor plan and simulated power measurements.
LU Decomposition (lud):is an a linear algebra algorithm that
tries to calculate the solution of a set of linear equations. It
works by decomposing a matrix into a product of a lower
triangular matrix and upper triangular matrix.
K-Nearest Neighbors (nn):is a dense linear algebra algorithm
used to find the closest K neighbors in a set of reference data
points in an n-dimensional space to query point q. The data
in our case is latitude and longitude data and the calculated
distances are euclidean distances.
Needleman-Wunsch (nw):is a dynamic programming algo-
rithm that is used for DNA sequence alignment. The algorithm
tries to fill a matrix of potential pairs of DNA sequences with
scores, representing the value of the maximum weighted path
ending at that cell. Then a trace-back process is used to search
for an optimal alignment.
Pathfinder (pfinder):is another dynamic programming algo-
rithm that computes the path on a 2-dimensional grid with the
smallest total cost. The grid is represented as a matrix, and
the path is computed in blocks of rows.
C. Vulkan-specific optimizations
As shown in in the example code in Listing 1, Vulkan uses
completely different abstractions from CUDA and OpenCL.
Effectively, in Vulkan, the programmer is not dealing with
kernels, kernel arguments and kernel launches but they are
dealing with low level command buffers, recording commands
in these buffers such as binding compute pipelines, setting
descriptor sets and binding buffers to descriptor sets. One of
key synchronization mechanisms of Vulkan that we used when
writing our benchmarks and produced performance improve-
ments, as shown in section V-A2, is memory barriers. Memory
barrier commands can be recorded in a command buffer,
ensuring that commands recorded prior to it are executed
before the commands recorded after it. This allowed us to
reduce the kernel launch overhead compared to CUDA and
OpenCL implementations, resulting in better performance as
shown in sections V-A2 and V-B2.
Most of our benchmarks use iterative algorithms. The CUDA
and OpenCL implementations invoke the kernel multiple times
for every iteration, whereas in our Vulkan implementations
we record the work of all iterations in one command buffer
and synchronize using memory barriers between iterations,
instead of naively creating a command buffer for every it-
eration. Effectively, we incur only a single communication
overhead when the command buffer is submitted compared to
the CUDA and OpenCL implementations which incur kernel
launch overheads on every iteration.
One can argue that the CUDA and OpenCL implementations
can be changed to enqueue iterations ahead of time without
blocking. The problem with this solution is that it does not
honor the data dependencies between iterations. Subsequent
iterations depend on the data generated in previous iterations.
Both CUDA and OpenCL do not offer any inter-workgroup
synchronization mechanism that can be used to honor these
dependency requirements. This is a well known limitation of
these programming models and the safest portable solution to
achieve such synchronization is to use what’s called multi-
kernel method. In this method the application is split into
multiple kernels. Whenever a inter-workgroup synchronization
is required, a transition from one kernel to another is made
or in the case of having only one kernel this kernel is
launched again. The transfer of control from the GPU to the
CPU implicitly provides the required barrier semantics. The
Rodinia CUDA and OpenCL implementations use this method
to achieve such inter-workgroup synchronization and satisfy
the data dependencies between iterations.
D. Porting to mobile devices
One of the major strengths of Vulkan is its portability. How-
ever, performance improvements are not necessarily portable
TABLE II: Desktop GPUs Experimental Setup
NVIDIA GTX105Ti AMD RX560
Operating System Ubuntu 16.04 64-bit
CPU Intel(R) Core(TM) i5-2500K CPU 3.30GHz x4
Memory CPU Memory=16 GB, GPU Memory=4GB
Driver Linux Display Driver 381.22 AMDGPU-Pro Driver 17.10
OpenCL OpenCL 1.2 OpenCL 2.0
CUDA CUDA 8.0 -
Vulkan API Version 1.0.42 API Version 1.0.37
and often developers have to adapt and re-write their applica-
tions with respect to the targeted architecture. In fact, it has
been shown that performance is not portable when running
OpenCL applications targeting GPUs on CPU or FPGA like
architectures [17] [21]. To address this concern and assess
whether Vulkan is a good candidate for GPGPU computing
on mobile devices, we ported our benchmarks plus their
corresponding Rodinia OpenCL implementations onto mobile
GPUs. We chose Android 7.0 as our OS because it supports
Vulkan out of the box, allowing us to target many mobile
GPUs. We cross-compiled all of our benchmarks for x86, x86-
64, armeabi-v7a, arm64-v8a binary targets and developed an
Android application that bundles these benchmarks with their
required data sets. We set a requirement when developing the
VComputeBench Android application of not requiring root
access so that it can be released on the Android application
store allowing millions of users to check and compare the
performance of the GPUs and Vulkan implementations inside
their devices. This was challenging and we had to resort
to bundling the benchmarks as libraries in order to satisfy
Android security restrictions on binary executables.
V. EXPERIMENTAL RESULTS
In this section we report the results of our empirical evalua-
tion of Vulkan performed on several GPU architectures. We
use two types of benchmarks self-written micro benchmarks
to highlight and assess specific attributes and our VCom-
puteBench plus Rodinia benchmarks to assess performance us-
ing representative real world applications. We compare Vulkan
results to those of CUDA and OpenCL on two desktop GPUs
and two mobile GPU platforms. For consistency, we measure
the execution times on the CPU using C++11 std::chrono. To
minimize measurement errors, we execute several times and
report the average of the obtained execution times.
A. Evaluations on Desktop Platforms
We chose two recent desktop GPUs employing latest and
advanced GPU architectures: NVIDIA GTX1050Ti employing
NVIDIA’s Pascal architecture and AMD RX560 employing
AMD’s Polaris architecture. Table II shows the configuration
details of these platforms.
1) Memory Bandwidth Evaluation: To evaluate how the pro-
gramming model affects memory bandwidth and asses whether
we can achieve high memory bandwidth when using Vulkan,
we developed a strided memory access micro-benchmark in
1 4 8 12 16 20 24 28 32
Stride (4 bytes per element)
4
8
16
32
64
Bandwidth (GB/s)
Vulkan
CUDA
(a) NVIDIA GTX1050Ti
1 4 8 12 16 20 24 28 32
Stride (4 bytes per element)
4
8
16
32
64
Bandwidth (GB/s)
Vulkan
OpenCL
(b) AMD RX560
Fig. 1: Vulkan memory bandwidth vs CUDA and OpenCL
Vulkan, CUDA and OpenCL. We vary the stride when reading
array elements and measure the achieved bandwidth. For
reference, both of our platforms use GDDR5 memory with an
effective memory clock of 7GHz and 128 bit memory interface
width, resulting in theoretical bandwidth of 112 GB/s, which
can be calculated using:
BWpeak =Freq ·(BusW idth/8) ·10−9
The obtained results are shown in Figure 1. On both platforms,
Vulkan provides comparable performance to CUDA and
OpenCL for strides less than 64 bytes and slightly better
performance for strides larger than 64 bytes. As expected,
unit stride provides maximum achieved bandwidth of 84%
and 79.6% of the peak bandwidth for CUDA and Vulkan
respectively on the GTX1050. Likewise, on the RX560,
Vulkan achieves 71.6% of the peak bandwidth compared to
71.5% for OpenCL. Overall, this test shows that high memory
bandwidth can be attained using Vulkan and data layout in
memory is more important than the used programming model.
2) Benchmarks Evaluations: Figure 2 shows the speedup
results of the selected benchmarks comparing Vulkan, CUDA
and OpenCL for different workloads. We chose OpenCL as
our baseline for speedup calculations because it is supported
on both platforms. To make a fair comparison, we only report
kernel execution times not total benchmark times because a
high overhead is generally exhibited by OpenCL JIT compila-
tion and explicit context management resulting in longer total
times [32] [17].
Overall, for most benchmarks Vulkan provides better perfor-
mance than CUDA and OpenCL resulting in geometric mean
speedups of 1.53xwith respect to CUDA on the GTX1050
and 1.26xwith respect to OpenCL on the RX560. However,
4K
64K
1M
4K
64K
256K
97K
193K
232K
208
1024
2048
512-08
512-16
512-32
256
512
2048
256K
8M
16M
4K
8K
16K
10K
50K
100K
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Speedup
bfs backprop cfd gaussian hotspot lud nn nw pathfinder
OpenCL Vulkan CUDA
(a) NVIDIA GTX1050Ti
4K
64K
1M
4K
64K
256K
97K
193K
232K
208
1024
2048
512-08
512-16
512-32
256
512
2048
256K
8M
16M
4K
8K
16K
10K
50K
100K
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Speedup
bfs backprop cfd gaussian hotspot lud nn nw pathfinder
OpenCL Vulkan
(b) AMD RX560
Fig. 2: Vulkan speedup vs CUDA and OpenCL for the Rodinia benchmarks
since the benchmarks exhibit different computation patterns,
there are variations on their individual results.
The best speedups are attained with pathfinder,hotspot,lud
and gaussian benchmarks. The reason for this is that these
benchmarks use iterative algorithms, invoking the kernel mul-
tiple times. Subsequent invocations utilize data generated in
previous iterations, requiring control to return back to the
CPU and incurring kernel launch overhead on every iteration.
Vulkan enable us to eliminate these kernel launches and com-
munication overheads altogether by recording the work of all
iterations in one command buffer and adding memory barriers
between iterations to satisfy the dependency requirements.
Effectively, we incur a single communication overhead when
the command buffer is submitted. Our results commensurate
with the kernel launch overhead findings of [15]. Figure 2
also shows that, for most of these workloads, the speedup
increases as we increase the input size. Larger input means
more iterations and less overhead compared to CUDA and
OpenCL, thus better Vulkan performance.
An interesting result is that of cfd. Although it uses an iterative
algorithm, we do not get similar speedups. This benchmark
has 3 compute intensive kernels and for every iteration we
have to bind 3 different compute pipelines, representing these
kernels, to our single command buffer. This overhead of
binding compute pipelines plus the longer kernel computation
times make the launch overhead savings not that significant.
It also does not scale well with input size because the number
of iterations is fixed and not dependent on input size. Vulkan
cfd achieves 1.38xspeedup vs CUDA and 1.04xspeedup vs
OpenCL averaged on both platforms.
On the contrary, we get a slowdown for bfs on both platforms.
To investigate this, we disassembled the Vulkan and OpenCL
kernels using the AMD CodeXL tool [33]. We discovered that
the OpenCL generated ISA code is optimized to use work-
group local memory compared to the Vulkan generated ISA,
which uses plain buffer loads from global memory. This opti-
mization of memory accesses significantly affects performance
because bfs is memory-bound [34]; it predominately performs
loads and stores with very few ALU operations. Although we
use the same driver, the generated ISA is different for Vulkan.
We can therefore deduce that the Vulkan SPIR-V compiler
inside the driver is not as mature as the OpenCL one. This
is expected as Vulkan was recently released and support will
improve in the future.
The remaining benchmarks backprop,nn and nw do not
involve any dependencies between kernel invocations. The
Vulkan implementations record these kernels onto different
command buffers and submits them simultaneously to the GPU
resulting in pretty much similar performance to CUDA and
OpenCL with slight variations between the platforms.
TABLE III: Mobile GPUs Experimental Setup
Qualcomm Snapdragon 625 Google Nexus Player
Operating System Andorid 7.0 Andorid 7.1
CPU ARM Cortex A53 x8 Intel Atom(TM) x4
GPU Adreno 506 Rogue G6430
OpenCL OpenCL 2.0 OpenCL 1.2
Vulkan API Version 1.0.20 API Version 1.0.30
B. Evaluations on Mobile Platforms
We used two platforms: Google’s Nexus Player and
Qualcomm’s Snapdragon 625 employing the Imagination
G6430 and the Adreno 506 GPUs respectively. The platforms
were chosen because both GPU vendors provide unofficial
OpenCL support3. Table III summarizes the configuration
details of these two platforms.
1) Memory Bandwidth Evaluation: We run the same strided
memory access micro benchmark, described in section V-A1,
on our selected mobile platforms. The obtained results are
shown in Figure 3. On the Nexus platform OpenCL achieves
a bandwidth of 2.85 GB/s at unit stride, whereas Vulkan
only achieves 2.69 GB/s, resulting in about 89% and 84%
of peak bandwidth respectively. Then for strides larger than
4 bytes, Vulkan surprisingly performs slightly better than
OpenCL. However, on the Snapdragon platform, Vulkan
performs worst than OpenCL at strides less than 16 bytes
but we get pretty much the same bandwidth for strides above
16 bytes. We suspect that the Snapdragon driver doesn’t
properly support Vulkan’s push constants, that we use to set
the stride constant inside the command buffer when varying
the stride number, and treating them as normal storage
buffers instead. This can result in worst performance because
binding these buffers is required for every iteration. For
larger strides this effect becomes negligible due to the fact
that the exhibited execution times are longer. Overall, the
main observation we can make here is that on these mobile
platforms, Vulkan can provide comparable performance to
OpenCL but with slight degradation and again data layout in
memory is more important than the used programming model.
2) Benchmarks Evaluations: Due to memory size restrictions
on these platforms, we had to choose smaller workload input
sizes. cfd could not fit on both platforms as it uses larger data
sets describing flux flow data. Also the backprop OpenCL
and Vulkan implementations failed to run on Nexus and on
Snapdragon only the lud OpenCL failed because of driver
issues. The results are shown in Figure 4.
Figure 4 shows that Vulkan does well on Nexus compared
to Snapdragon, achieving geometric mean speedups of 1.59x
on Nexus and 0.83xon Snapdragon. On the Nexus plat-
form, Vulkan shows speedups across most benchmarks except
hotspot, which pretty much commensurate with the results
obtained on desktop GPUs. The best speedups are again at-
3The OpenCL library on the Nexus player is not even called li-
bOpenCL.so. It is provided as libpvrcpt.so.
10 2 4 6 8 10 12 14 16
Stride (4 bytes per element)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Bandwidth (GB/s)
Vulkan
OpenCL
(a) Nexus Player
10 2 4 6 8 10 12 14 16
Stride (4 bytes per element)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Bandwidth (GB/s)
Vulkan
OpenCL
(b) Snapdragon 625
Fig. 3: Vulkan memory bandwidth vs CUDA and OpenCL
tained with pathfinder,gaussian and lud benchmarks because
of minimizing the kernel launch overhead. On the snapdragon
platform, further investigations are required to explain the
exhibited slowdown. However, since all benchmarks exhibited
slowdowns except pathfinder, we think this can be related
to the immaturity of the Vulkan drivers on this platform
compared to the OpenCL ones. We expect this will improve
in the future as better Vulkan support is rolled out.
Overall these results are very interesting in the sense that they
demonstrate that performance portability is not necessarily
guaranteed, even though the programming model is portable.
We can conclude that Vulkan performance improvements can
be portable to mobile GPUs as long as there is good driver
support from vendors.
VI. DISCUSSION
A. Vulkan Limitations
As you may have observed from the example application
described in Listing 1, the key limitation of Vulkan is its
verbosity. Vulkan’s low-level nature makes it very verbose
with a high programming effort. For example, to create a
simple buffer one has to:
•Create a buffer object
•Get the memory requirements for that object
•Decide which memory heap to use
•Allocate memory on the chosen heap
•Bind the buffer object to the memory allocation
This simple buffer creation requires about 40 lines of code
in Vulkan compared to just one line in CUDA or OpenCL,
where cudaMalloc and clCreateBuffer are used respectively.
In addition, Vulkan’s principle of explicit control pushes a lot
of responsibility onto the programmer. The application layer
is proportionally more complex. Programmers have to deal
4k
16k
64K
256K
208
416
128-8
128-16
64
256
256K
8M
1K
2K
512
1024
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Speedup
bfs bprop gauss hotspot lud nn nw pfinder
OpenCL Vulkan
(a) Nexus: Imagination PowerVR G6430 GPU
4k
16k
64K
256K
208
416
128-8
128-16
64
256
256K
8M
1K
2K
512
1024
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Speedup
bfs bprop gauss hotspot lud nn nw pfinder
OpenCL Vulkan
(b) Snapdragon: Qualcomm Adreno 506 GPU
Fig. 4: Vulkan speedup vs OpenCL on mobile devices
with issues such as memory allocation, resources tracking,
object creation and destruction and so on. Experience shows
that programming in such style can be error-prone and less
productive. Vulkan’s verbosity and the additional responsi-
bility it imposes on the programmer introduce issues with
productivity and hence can be a burden to adopting it as a
GPGPU programming model.
B. Recommended Vulkan Optimizations
Vulkan introduce some low-level controls that can be utilized
for extra performance. As a takeaway from our experience
writing the VComputeBench benchmarks, we recommend the
following for better Vulkan performance :
•For iterative algorithms, use one single command buffer
and synchronize using memory barriers. This proved to
be effective in our evaluations.
•For parameter changes of small data types, it is better
to use PushConstants rather than binding a whole pa-
rameters buffer. Push constants are specific to a pipeline.
For instance on GTX1050 and RX560 you get maximum
sizes of 256B and 128B respectively. On both Nexus and
Snapdragon platforms you get a maximum of 128 bytes.
•Try to minimize going back to the CPU for control and
leverage Vulkan’s synchronization primitives to stay as
much as possible on the GPU.
•For large memory transfers use transfer queues. These
specific transfer queues should be used for large copy
commands as they are usually tied to DMAs inside the
hardware.
•For better workload balancing, make use of multiple
compute queues whenever possible. This will give the
GPU’s scheduler more room for manoeuvre resulting in
better utilization.
VII. CONCLUSION
This paper presented Vulkan as new programming model
for cross-platform GPGPU computing notably on mobile and
embedded GPUs. We developed a set of compute benchmarks
by extending the Rodinia suite with Vulkan benchmarks and
used them to evaluate this emerging programming model.
Indeed, Vulkan’s low-level control over the underlying hard-
ware offers opportunities for better performance. Our results
show that, by exploiting Vulkan’s synchronization mecha-
nisms, average speedups of 1.53xand 1.66xversus CUDA
and OpenCL were attained across the selected benchmarks.
We also, show that similar performance improvements can
be seen on some mobile GPU architectures but performance
portability is not necessarily guaranteed. Issues such as driver
support and implementation quality come into play.
Finally, we illustrate that these performance improvements
come at a cost manifested in a high programming effort. These
programmability issues can be a burden to adopting Vulkan
as a GPGPU programming model. Directions for future work
could include improving the programmability of this emerging
programming model.
ACKNOWLEDGMENT
This material is based upon work supported by the European
Union Horizon 2020 research and innovation programme
under Grant No.688759, Project LPGPU2.
REFERENCES
[1] J. D. Owens, D. Luebke, N. Govindraju, M. Harris,
J. Kruger, A. E. Lefohn, and T. J. Purcell, “A Survey of
General Purpose Computation on Graphics Hardware,”
Computer Graphics Forum, vol. 26, pp. 80–113, 2006.
[2] Nvidia Corporation, “CUDA Toolkit Documentation,”
2017. [Online]. Available: http://docs.nvidia.com/cuda/
[3] The Khronos OpenCL Working Group, “The OpenCL
Specification,” 2017. [Online]. Available: https://www.
khronos.org/registry/OpenCL/specs/opencl-2.2.html
[4] OpenMP Architecture Review Board, “OpenMP
Application Programming Interface,” 2015. [Online].
Available: http://www.openmp.org/wp-content/uploads/
openmp-4.5.pdf
[5] The OpenACC Standard.org, “The OpenACC
Application Programming Interface,” 2015. [On-
line]. Available: https://www.openacc.org/sites/default/
files/inline-files/OpenACC.2.6.final-changes.pdf
[6] The Khronos Vulkan Working Group, “The Vulkan
Specification,” 2017. [Online]. Available: https://www.
khronos.org/registry/vulkan/specs/1.0/html/vkspec.html
[7] J. Kessenich, B. Ouriel, and R. Krisch, “SPIR-
V Specification,” 2017. [Online]. Available: https:
//www.khronos.org/registry/spir-v/specs/1.2/SPIRV.html
[8] N. M. Dongre, “A Research On Android Technology
With New Version Naugat(7.0,7.1),” IOSR Journal of
Computer Engineering, vol. 19, no. 02, pp. 65–77, 2017.
[9] T. Linux Foundation Project, “Tizen 3.0 Public M2 Re-
lease Notes,” 2017. [Online]. Available: https://developer.
tizen.org/tizen/tizen/release-notes/tizen-3.0-public-m2
[10] S. Che, M. Boyer, J. Meng, D. Tarjan, S. Lee, J. W.
Sheaffer, and K. Skadron, “A Benchmark Suite for Het-
erogeneous Computing,” IEEE International Symposium
on Workload Characterization, pp. 44–54, 2009.
[11] J. a. Stratton, C. Rodrigues, I.-j. Sung, N. Obeid, L.-
w. Chang, N. Anssari, G. D. Liu, and W.-m. W. Hwu,
“Parboil: A Revised Benchmark Suite for Scientific and
Commercial Throughput Computing,” IMPACT Techni-
cal Report, 2012.
[12] A. Danalis, G. Marin, C. McCurdy, J. S. Meredith,
P. C. Roth, K. Spafford, V. Tipparaju, and J. S. Vetter,
“The Scalable HeterOgeneous Computing ( SHOC )
Benchmark Suite Categories and Subject Descriptors,”
Proceedings of the 3rd Workshop on General-Purpose
Computation on Graphics Processing Units, pp. 63–74,
2010.
[13] Y. Sun, X. Gong, A. K. Ziabari, L. Yu, X. Li, S. Mukher-
jee, C. McCardwell, A. Villegas, and D. Kaeli, “Hetero-
mark, a benchmark suite for CPU-GPU collaborative
computing,” in Proceedings of the 2016 IEEE Interna-
tional Symposium on Workload Characterization, IISWC
2016, 2016, pp. 13–22.
[14] K. Karimi, N. G. Dickson, and F. Hamze, “A Perfor-
mance Comparison of CUDA and OpenCL,” ArXiv e-
prints, vol. arXiv, no. 1, p. 1005.2581, 2010.
[15] J. Fang, A. L. Varbanescu, and H. Sips, “A comprehen-
sive performance comparison of CUDA and OpenCL,”
Proceedings of the International Conference on Parallel
Processing, pp. 216–225, 2011.
[16] R. Sachetto Oliveira, B. M. Rocha, R. M. Amorim, F. O.
Campos, W. Meira, E. M. Toledo, and R. W. dos Santos,
“Comparing CUDA, OpenCL and OpenGL Implementa-
tions of the Cardiac Monodomain Equations.” Springer,
Berlin, Heidelberg, 2012, pp. 111–120.
[17] P. Du, R. Weber, P. Luszczek, S. Tomov, G. Peterson,
and J. Dongarra, “From CUDA to OpenCL: Towards
a performance-portable solution for multi-platform GPU
programming,” Parallel Computing, vol. 38, no. 8, pp.
391–407, 2012.
[18] C.-L. Su, P.-Y. Chen, C.-C. Lan, L.-S. Huang, and K.-H.
Wu, “Overview and comparison of OpenCL and CUDA
technology for GPGPU,” in 2012 IEEE Asia Pacific
Conference on Circuits and Systems. IEEE, 12 2012,
pp. 448–451.
[19] J. Kim, T. T. Dao, J. Jung, J. Joo, and J. Lee, “Bridging
OpenCL and CUDA,” Proceedings of the International
Conference for High Performance Computing, Network-
ing, Storage and Analysis on - SC ’15, no. November,
pp. 1–12, 2015.
[20] H. C. D. Silva, F. Pisani, and E. Borin, “A Comparative
Study of SYCL, OpenCL, and OpenMP,” 2016 Interna-
tional Symposium on Computer Architecture and High
Performance Computing Workshops (SBAC-PADW), pp.
61–66, 2016.
[21] Z. Wang, B. He, W. Zhang, and S. Jiang, “A performance
analysis framework for optimizing OpenCL applications
on FPGAs,” in Proceedings - International Symposium
on High-Performance Computer Architecture, vol. 2016-
April, 2016, pp. 114–125.
[22] A. Sampson, “Let’s Fix OpenGL,” 2nd Summit on
Advances in Programming Languages (SNAPL 2017),
vol. 71, pp. –, 2017.
[23] A. Blackert, Evaluation of Multi-Threading in Vulkan.
Link¨
oping University, 2016.
[24] J. Kessenich, “SPIR-V A Khronos-Defined Inter-
mediate Language for Native Representation of
Graphical Shaders and Compute Kernels,” 2015.
[Online]. Available: https://www.khronos.org/registry/
spir-v/papers/WhitePaper.pdf
[25] G. Wang and Y. Xiong, “Accelerating computer vision
algorithms using OpenCL framework on the mobile
GPU-a case study,” IEEE International Conference on
Acoustics, Speech and Signal Processing, 2013.
[26] M. M. Trompouki, L. Kosmidis, and U. Polit, “Optimi-
sation Opportunities and Evaluation for GPGPU appli-
cations on Low-End Mobile GPUs,” Date, pp. 950–953,
2017.
[27] L. Tobias, A. Ducournau, F. Rousseau, G. Mercier, and
R. Fablet, “Convolutional Neural Networks for object
recognition on mobile devices: A case study,” 2016 23rd
International Conference on Pattern Recognition (ICPR),
pp. 3530–3535, 2016.
[28] S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn,
L. Wang, and K. Skadron, “A characterization of the Ro-
dinia benchmark suite with comparison to contemporary
CMP workloads,” in IEEE International Symposium on
Workload Characterization, IISWC’10, 2010.
[29] G. Misra, N. Kurkure, A. Das, M. Valmiki, S. Das,
and A. Gupta, “Evaluation of rodinia codes on Intel
Xeon Phi,” in Proceedings - International Conference
on Intelligent Systems, Modelling and Simulation, ISMS,
2013, pp. 415–419.
[30] The Khronos Group, “Glslang Reference Com-
piler,” 2017. [Online]. Available: https://github.com/
KhronosGroup/glslang
[31] K. Asanovic, B. C. Catanzaro, D. Patterson, and
K. Yelick, “The Landscape of Parallel Computing Re-
search : A View from Berkeley,” Tech. Rep., 2006.
[32] J. H. Lee, N. Nigania, H. Kim, K. Patel, and H. Kim,
“OpenCL Performance Evaluation on Modern Multicore
CPUs,” Scientific Programming, vol. 2015, pp. 1–20, 10
2015.
[33] GPUOpen AMD, “CodeXL Tool Suite,” 2017. [Online].
Available: https://github.com/GPUOpen-Tools/CodeXL
[34] S. Lal, J. Lucas, and B. Juurlink, “Eˆ2MC: Entropy En-
coding Based Memory Compression for GPUs,” in 2017
IEEE International Parallel and Distributed Processing
Symposium (IPDPS). IEEE, 5 2017, pp. 1119–1128.