
© © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all
other uses, in any current or future media, including reprinting/republishing this material for advertising or
promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse
of any copyrighted component of this work in other works.
Terms of Use
Mammeri, N., & Juurlink, B. (2018). VComputeBench: A Vulkan Benchmark Suite for GPGPU on Mobile
and Embedded GPUs. In 2018 IEEE International Symposium on Workload Characterization (IISWC).
IEEE. https://doi.org/10.1109/iiswc.2018.8573477
Nadjib Mammeri, Ben Juurlink
VComputeBench: A Vulkan Benchmark
Suite for GPGPU on Mobile and Embedded
GPUs
Conference paper | Accepted manuscript (Postprint)
This version is available at https://doi.org/10.14279/depositonce-7346.2

VComputeBench: A Vulkan Benchmark Suite for
GPGPU on Mobile and Embedded GPUs
Nadjib Mammeri
Technische Universit¨
at Berlin
Ben Juurlink
Technische Universit¨
at Berlin
Abstract—GPUs have become immensely important computa-
tional units on embedded and mobile devices. However, GPGPU
developers are often not able to exploit the compute power
offered by GPUs on these devices mainly due to the lack of
support of traditional programming models such as CUDA and
OpenCL. The recent introduction of the Vulkan API provides
a new programming model that could be explored for GPGPU
computing on these devices, as it supports compute and promises
to be portable across different architectures.
In this paper we propose VComputeBench, a set of bench-
marks that help developers understand the differences in perfor-
mance and portability of Vulkan. We also evaluate the suitability
of Vulkan as an emerging cross-platform GPGPU framework by
conducting a thorough analysis of its performance compared to
CUDA and OpenCL on mobile as well as on desktop platforms.
Our experiments show that Vulkan provides better platform
support on mobile devices and can be regarded as a good cross-
platform GPGPU framework. It offers comparable performance
and with some low-level optimizations it can offer average
speedups of 1.53x and 1.66x compared to CUDA and OpenCL
respectively on desktop platforms and 1.59x average speedup
compared to OpenCL on mobile platforms. However, while
Vulkan’s low-level control can enhance performance, it requires
a significantly higher programming effort.
Index Terms—VComputeBench, Vulkan, SPIR-V, GPGPU,
CUDA, OpenCL, Rodinia, Mobile
I. INTRODUCTION
Graphics Processing Units (GPUs) have become a dominant
platform for parallel computing thanks to their massively
parallel architecture, energy efficiency and availability to the
masses. Several programming models have emerged enabling
developers to harness the massive compute power offered by
GPUs, while exploiting parallelism for different application
domains. This is often referred to as GPGPU (General Pur-
pose computing on the GPU) [1]. The most popular GPGPU
programming models are CUDA [2] and OpenCL [3]. CUDA
is a proprietary standard introduced by NVIDIA and targets
only NVIDIA specific hardware, while OpenCL is an open
standard maintained by the Khronos group and targets addi-
tional hardware devices including FPGAs, CPUs and DSPs. In
this work we focus on the two most predominant programming
models CUDA and OpenCL, but it is worth mentioning other
frameworks such as OpenMP [4] and OpenACC [5]. OpenMP
mainly targets shared memory multiprocessors and recently
OpenMP 4.5 introduced the target directive enabling support
for GPUs and other devices. OpenACC is mainly designed to
program accelerators in heterogeneous systems with OpenMP-
like directives.
To add to this mix of programming models, the Khronos
group recently released the Vulkan API [6] along with SPIR-
V [7]. Vulkan is a low level API with an abstraction closer
to the behavior of the actual hardware. It promises cross-
platform support, high-efficiency and better performance of
GPU applications. Unlike CUDA, which is only supported on
NVIDIA GPUs, and OpenCL, which has no official support on
mobile GPUs, Vulkan is supported by all major GPU vendors1
and considers non-desktop GPUs as first class citizens. Vulkan
is officially supported on Android 7.0 [8] and on the new
Tizen OS 3.0 [9] covering a full spectrum of mobile devices
from phones and wearables to TVs and in-vehicle infotainment
systems. This good platform support and the fact that it also
supports compute, motivated us to examine it from the GPGPU
perspective even though it was mainly designed to improve
graphics performance. In this paper, we introduce Vulkan as a
cross-platform GPGPU route that could open new perspectives
for pertinent GPGPU computing on mobile devices and can
be explored along with other more established frameworks
on desktop architectures. However, there are some important
questions yet to be answered:
•What kind of performance can we get out of Vulkan?
•Is there a viable study comparing Vulkan compute to
established frameworks such as CUDA and OpenCL?
•If there are any performance gains, are these portable
across different GPU architectures?
•Can Vulkan enable pertinent GPGPU computing on mo-
bile and embedded GPUs?
Selecting which GPGPU framework to choose is a critical
task for developers. Differences in performance, portability,
programmability and platform support are all very important
factors that need to be considered. Benchmarks play an
important role in exposing these kind of differences between
hardware architectures, compilers and more importantly across
competing programming models. There are several bench-
marks available to evaluate CUDA and OpenCL [10], [11]
[12] [13] but currently none for Vulkan. To fill this gap
and enable our study we propose VComputeBench, a set of
Vulkan compute benchmarks that help developers understand
1Supported by major desktop GPU vendors: AMD, NVIDIA and Intel and
mobile GPU vendors: Qualcomm, ARM, Imagination and VeriSilicon

the differences in performance and portability of Vulkan
and provide guidance to GPU architects in the design and
optimization of their drivers and runtime. VComputeBench
was developed by extending the popular Rodinia benchmark
suite [10], covering a diverse range of application domains
with different computation patterns. The reason for selecting
the Rodinia suite is that it provides OpenCL and CUDA im-
plementations and with our VComputeBench implementations
we can make fair comparisons and adequately evaluate Vulkan
against other programming models.
In essence, the main contributions of this paper are:
•Illustrate the viability of Vulkan as a GPGPU framework
notably on mobile devices.
•Propose a set of Vulkan compute benchmarks named
VComputeBench and ported them onto mobile platforms.
•Perform a thorough analysis of performance, comparing
Vulkan to CUDA and OpenCL on desktop and mobile
GPUs and highlight a set of Vulkan specific optimization
techniques.
II. RELATED WORK
In recent years, GPGPU frameworks have received a great
amount of attention from the research community. Although,
several works studied and compared different programming
models [14] [15] [16] [17] [18] [19] [20] [21], none of them
studied Vulkan. To the best of our knowledge, our work
is the first to investigate Vulkan from the compute not the
graphics perspective and propose it as a viable cross-platform
GPGPU programming model. One of the earliest and well
cited works is those of Fang et al. [15] and Karimi et al.
[14]. The authors compare CUDA to OpenCL in terms of
performance on old desktop GPU architectures. Our work, on
the other hand, was carried out on recent architectures and
analyses performance on desktop as well as mobile GPUs.
Du et al. [17] studies OpenCL performance portability and
Wang et al. [21] examines OpenCL on FPGAs. The authors of
these papers demonstrate that performance is not necessarily
portable across architectures. Their findings instigated us to
study and port our benchmarks onto mobile GPUs in order
to evaluate Vulkan’s portability and examine its performance
implications.
Such research works heavily rely on benchmarks for their
evaluations. Several GPGPU benchmarks were proposed by
researchers such as Rodinia [10], Parboil [11] SHOC [12] and
the recent Hetero-Mark [13]. Most of these benchmark suites
include CUDA, OpenCL or OpenMP implementations but
none include Vulkan implementations. This can be a limitation
especially for researchers and developers wanting to target this
new emerging programming model. In this work, we aim to
enrich the GPGPU community with such Vulkan benchmarks
by extending the popular Rodinia suite, enabling researchers
and developers to evaluate Vulkan along with other GPGPU
programming models. Likewise, most of these benchmark
suites mainly target desktop GPUs or multicore systems with
their CUDA and OpenCL implementations. Our benchmarks,
on the other hand, target both mobile and desktop GPUs. We
chose Vulkan because of its cross-platform capabilities and
good support on mobile devices.
III. VULKAN A COMPUTE PERSPECTIVE
In this section we present an overview of the Vulkan
programming model illustrating why it is a promising GPGPU
framework especially for mobile and embedded GPUs.
A. Vulkan Overview
Vulkan is often referred to as the next generation graphics
and compute API for modern GPUs. It is an open standard that
aims to address the inefficiencies of traditional APIs such as
OpenGL, which were designed for single-core processors and
lag to map well to modern hardware [22]. Vulkan on the other
hand, was designed from the ground-up with multi-threading
support in mind. Better parallelization can be achieved by
asynchronously generating work across multiple threads feed-
ing the GPU in an efficient manner. This is attained in Vulkan
by having no global state, no synchronizations in the driver
and separating work generation from work submission. All
state is localized in command buffers, which can be generated
on multiple threads and only start executing on the GPU after
submission.
The other key characteristic of Vulkan is that it provides a
much lower-level fine-grained control over the GPU enabling
developers to maximize performance across many platforms.
It achieves this by being explicit in nature rather than re-
lying on hidden heuristics in the driver. Operations such as
resource tracking, synchronization, memory allocation, and
work submission are all pushed into application space resulting
in higher predictability and better control of when and where
work happens. Likewise, unnecessary background tasks such
as error checking, hazard tracking, state validation and shader
compilation are delegated to the tooling layers, which are
present during development and removed at runtime, resulting
in low driver overhead and less CPU usage [23].
B. The Programming Model
Vulkan can be viewed as a pipeline with some pro-
grammable stages that are invoked by a set of operations. To
the programmer, it is simply an API with a set of routines
allowing for the specification of shaders or kernels, state
controlling aspects as well as data used by those kernels. From
the compute perspective though, the pipeline has only one
programmable stage represented in the kernel program to be
executed [6].
a) Execution Model: A Vulkan-capable system exposes
one or more devices, each of theses physical devices ex-
poses one or more queues. These queues are partitioned
into queue families and can process work asynchronously
to one another. Each queue family supports a number of
functionalities and may contain multiple queues with similar
characteristics. There are four types of queue functionalities
defined in Vulkan: graphics, compute, transfer, and sparse
memory management. The reason for having queue families is
that queues within a single family are considered compatible

with one another, and work produced for one queue family
can be executed on any queue within that family.
Aqueue is considered as the interface between the appli-
cation and the execution engines of a device. Commands for
these execution engines are recorded into command buffers
ahead of execution time. Once recorded, a command buffer
can be cached and submitted to a queue for execution as many
times as required. Command buffer construction is expensive
and the application may employ multiple threads to construct
multiple command buffers in parallel. These command buffers
are then submitted to queues for execution in a number of
batches. Once submitted to a queue, the commands within a
command buffer begin and complete execution without further
application intervention. The order in which these commands
are executed is dependent on a number of implicit and explicit
ordering constraints.
In addition, command buffers submitted to different queues
may execute in parallel or even out of order with respect to one
another. Command buffers submitted to a single queue though
respect submission order. Host execution is also asynchronous
to command buffer execution on the device. Control may
return to the application as soon as the command buffer is
submitted and the application should take responsibility for
any synchronizations between different queues as well as
between the device and host.
b) Compute Model: In Vulkan, compute workloads are
initiated by recording dispatching commands vkCmdDis-
patch* in a command buffer. Once a command buffer is
submitted to a queue, execution starts according to the cur-
rently bound compute pipeline. Compute pipelines consist
of a single compute shader stage, describing the kernel to
be executed and a pipeline layout, describing the input and
output resources to that kernel. Dispatching commands take
three input parameters: groupCountX,groupCountY and
groupCountZ defining the total number of workgroups or the
so called global workgroup size in the X, Y and Z directions
respectively. A workgroup is the smallest amount of compute
operations that an application can execute. Within a single
workgroup, there may be many workitems or compute shader
invocations. This is called the local workgroup size and is
defined by the compute shader itself using SPIR-V built-in
decorations [7].
c) SPIR-V: All shaders and compute kernels in Vulkan
are defined using the Standard Portable Intermediate Represen-
tation (SPIR-V), which is a platform-independent intermediate
language for describing graphical shaders and compute kernels
[24]. SPIR-V is a self-contained binary format. Logically, it
is a header and a linear stream of instructions and physically
it is just a stream of 32-bit words, encoding a collection of
annotations and decorations as well as functions, which in turn
encode control-flow graphs (CFG) of blocks. Variables are
accessed using load store instructions and any intermediate
results bypassing the load store are represented in a single
static-assignment form (SSA). Hierarchical type information
of data objects is preserved to not lose information needed for
further optimizations on the target device.
C. Why Vulkan for Mobile and Embedded GPUs?
Considering that Vulkan was mainly designed to achieve
higher graphics performance, we can make several interesting
observations: (i) its enhancements and low-level nature can
also be utilized to achieve higher performance for GPGPU
applications. (ii) Vulkan’s main focus on graphics allowed
it to have better support among GPU vendors than other
open frameworks such as OpenCL, which for instance is
not fully supported by NVIDIA because it considered as a
competitor to its propriety CUDA framework 2.(iii) Vulkan
is considered as the first framework to have official support
on mobile platforms [8] [9] and the API was designed with
mobile GPU features in mind such as tiled rendering. Hence,
it has the potential of being the framework of choice for
GPGPU on mobile devices, which is the quest of many recent
research works [25] [26] [27]. This leads us to our final
observation: (iv) that Vulkan can be the appropriate framework
for achieving true cross-platform GPGPU without sacrificing
on performance.
IV. BENCHMARKS
Benchmarks play an important role in exposing differ-
ences in performance, portability and programmability across
competing programming models. Since Vulkan was recently
released and its main focus is on graphics not GPGPU, there
are currently few graphics but no compute benchmarks that can
be of use to our study. In order to enable our work as well as
to enrich the research community with such benchmarks, we
extended the popular Rodinia benchmark suite [10] by devel-
oping Vulkan equivalents of most of its workloads, referred to
as VComputeBench, and made them publicly available to the
wider GPGPU community.
Before describing our VComputeBench benchmarks, we
first present one of the microbenchmarks that we used in our
study to better illustrate this new programming model and give
an overview of what is required to write a Vulkan compute
application.
A. Vector Addition Microbenchmark
This microbenchmark is a simple application adding two
vectors Xand Yof size nsaving the output in vector Z. The
kernel code, or the compute shader in Vulkan terminology,
is a SPIR-V binary that was compiled offline from a 10-line
GLSL source implementing:
Z[i] = X[i] + Y[i]∀i∈[0,1, . . . , n]
The index space is one dimensional and iis defined using
the SPIR-V decoration GlobalInvocationId, which returns the
global ID of the workitem executing the kernel. The vectors
X, Y and Zare bounded in to the kernel as storage buffers.
The host code, on the other hand, is more complicated.
Listing 1 shows a pseudo-code listing of the host program
highlighting only the important API calls.
2Current OpenCL version is 2.2 but NVIDIA only supports version 1.2

int main ()
std::size˙t N=1000000;// Number of elements in a vector
int numWorkGroups =N/256;// Workgroup size is 256
// Enumerate devices then create instance, queues and device
VkInstance instance; VkInstanceCreateInfo instanceInfo =–˝ ...
vkCreateInstance(&instanceInfo, nullptr,&instance);
vkEnumeratePhysicalDevices(instance, ..., &gpuList);
vkGetPhysicalDeviceQueueFamilyProperties(gpuList[0], ...);
...
VkDeviceQueueCreateInfo queueCreateInfo–˝ ...
VkDevice device; VkDeviceCreateInfo deviceInfo =–˝ ...
vkCreateDevice(gpuList[0], &deviceInfo, ..., &device);
VkQueue computeQueue;
vkGetDeviceQueue(device, queueFamilyIndex, 0,&computeQueue);
...
// Create buffer then bind the buffer to the allocated memory
VkBuffer bufferX; VkBufferCreateInfo bufferCreateInfo–˝ ...
bufCreateInfo.size =N*sizeof(float);
bufCreateInfo.usage =
VK˙BUFFER˙USAGE˙STORAGE˙BUFFER˙BIT —,→
VK˙BUFFER˙USAGE˙TRANSFER˙DST˙BIT;
vkCreateBuffer(device, &bufferCreateInfo, nullptr,&bufferX);
VkMemoryRequirements xBuffMemReqs;
vkGetBufferMemoryRequirements(device, bufferX, &xBuffMemReqs);
int xMemIndex =findMemType(xBuffMemReqs.memoryTypeBits,
VK˙MEMORY˙PROPERTY˙DEVICE˙LOCAL˙BIT);
VkDeviceMemory memory; VkMemoryAllocateInfo memAllocInfo–˝ ...
memAllocInfo.allocationSize =xBuffMemReqs.size;
memAllocInfo.memoryTypeIndex =xMemIndex;
vkAllocateMemory(device, &memAllocInfo, nullptr,&memory);
vkBindBufferMemory(device, bufferX, memory, 0);
...
// Create the compute shader and the compute pipeline
VkShaderModule module; VkShaderModuleCreateInfo
shadCreatInfo–˝ ...,→
shadCreatInfo.pCode =readSpirvBinary(”vectorAdd.spv”);
vkCreateShaderModule(device, &shadCreatInfo, NULL,&module);
VkPipelineShaderStageCreateInfo shaderStageCreateInfo–˝ ...
shaderStageCreateInfo.module =module;
shaderStageCreateInfo.stage =
VK˙SHADER˙STAGE˙COMPUTE˙BIT;,→
VkPipelineLayout pipelineLayout;
...
vkCreatePipelineLayout(device, ..., &pipelineLayout);
VkPipeline ppline; VkComputePipelineCreateInfo ppCreateInfo–˝ ...
ppCreateInfo.stage =shaderStageCreateInfo;
ppCreateInfo.layout =pipelineLayout;
vkCreateComputePipelines(device, &ppCreateInfo, &ppline ...);
...
// Bind buffers to compute pipeline
VkWriteDescriptorSet writeDescripSet–˝ ...
writeDescripSet.descriptorType =
VK˙DESCRIPTOR˙TYPE˙STORAGE˙BUFFER;,→
writeDescripSet.dstBinding =0;// Same as SPIRV Binding
decoration,→
writeDescripSet.pBufferInfo =xBufferDescriptor;
vkUpdateDescriptorSets(device, 1,&writeDescripSet, 0,NULL);
...
// Create command pool and allocate a command buffer
VkCommandPool cmdPool; VkCommandPoolCreateInfo
cmdPoolInfo–˝ ...,→
vkCreateCommandPool(device, &cmdPoolInfo, nullptr,&cmdPool);
VkCommandBuffer cmdBuffer; VkCommandBufferAllocateInfo
allcInfo–˝..,→
allcInfo.commandPool =cmdPool;
vkAllocateCommandBuffers(device, &allcInfo, &cmdBuffer);
...
// Bind the pipeline and record commands to the command buffer
vkCmdBindPipeline(cmdBuffer,
VK˙PIPELINE˙BIND˙POINT˙COMPUTE,ppline);
vkCmdDispatch(commandBuffer, numWorkGroups, 1,1);
vkEndCommandBuffer(commandBuffer);
...
// Submit to queue
VkSubmitInfo submitInfo –VK˙STRUCTURE˙TYPE˙SUBMIT˙INFO˝;
submitInfo.commandBufferCount =1;
submitInfo.pCommandBuffers = &cmdBuffer;
vkQueueSubmit(computeQueue, 1,&submitInfo ...);
... // Clean up and free all resources
Listing 1: VectorAdd host code using low-level Vulkan API
TABLE I: VComputeBench benchmarks
Name Application Dwarf Domain
backprop Back Propagation Unstructured Grid Deep Learning
bfs Breadth-First Search Graph Traversal Graph Theory
cfd CFD Solver Unstructured Grid Fluid Dynamics
gaussian Gaussian Elimination Dense Linear Algebra Linear Algebra
hotspot Hotspot Simulation Structured Grid Physics
lud LU Decomposition Dense Linear Algebra Linear Algebra
nn K-Nearest Neighbors Dense Linear Algebra Data Mining
nw Needleman-Wunsch Dynamic Programming Bioinformatics
pathfinder Path Finder Dynamic Programming Grid Traversal
Vulkan applications are linked against a common library
referred to as the loader, which gets initialized at the time
of VkInstance creation. The loader loads any enabled tooling
layers and initializes the low-level driver provided by the GPU
vendor. Accordingly, the example program depicted in Listing
1, starts initializing Vulkan by creating a VkInstance and
querying the system for any available devices with all their
properties including all available queue families.
Then a logical VkDevice is created and a queue is acquired.
The next step is to create storage buffers for the vectors.
VkBuffer objects are created, the system is queried for suitable
heaps according to the buffer memory requirements, then
memory is allocated on that heap and buffers are bounded
to their allocated memory. Next, a compute VkPipeline is
created by specifying the kernel’s SPIR-V binary as its shader
stage and creating a VkPipelineLayout describing all the re-
sources used by that kernel. Then, the buffers are bound to the
pipeline by specifying the kernel’s binding value of each buffer
as the destination binding of the write descriptor set. This is
similar to specifying the kernel arguments in OpenCL using
clSetKernelArg. Now that the compute pipeline is set up, the
kernel can be launched by creating a VkCommandBuffer,
binding the pipeline to that command buffer and recording
the dispatch command with the number of workgroups to
be launched. The command buffer is then submitted to the
acquired queue for execution. Finally, the application waits for
execution to finish then cleans up and frees all used resources
and objects.
B. VComputeBench Benchmarks
The Rodinia suite includes both CUDA and OpenCL versions
for each of its benchmarks. While developing their Vulkan
equivalents, we made sure not to introduce any algorithmic
changes to the kernel codes. In this way, we will be able to
make fair comparisons in the sense that any differences in
performance can be related to the programming model and
not to the algorithm. By using the latest Rodinia version 3.1,
we assume that we are already starting from a decent baseline
since these benchmarks were optimized many times in several
research works [28] [29].
The kernels were developed in GLSL and their correspond-
ing SPIR-V binaries were automatically generated using the
glslangvalidator compiler [30] provided by Khronos. We have
chosen GLSL as our kernel language because it has the best
support. We provide both the SPIR-V binaries and the GLSL
Loading more pages...