VComputeBench: A Vulkan Benchmark Suite for GPGPU on Mobile and Embedded GPUs [original]

other uses, in any current or future media, including reprinting/republishing this material for advertising or

promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse

of any copyrighted component of this work in other works.

Mammeri, N., & Juurlink, B. (2018). VComputeBench: A Vulkan Benchmark Suite for GPGPU on Mobile

and Embedded GPUs. In 2018 IEEE International Symposium on Workload Characterization (IISWC).

IEEE. https://doi.org/10.1109/iiswc.2018.8573477

Nadjib Mammeri, Ben Juurlink

VComputeBench: A Vulkan Benchmark

Suite for GPGPU on Mobile and Embedded

GPUs

Conference paper | Accepted manuscript (Postprint)

This version is available at https://doi.org/10.14279/depositonce-7346.2

VComputeBench: A Vulkan Benchmark Suite for

GPGPU on Mobile and Embedded GPUs

Nadjib Mammeri

Technische Universit¨

at Berlin

[email protected]

Ben Juurlink

Technische Universit¨

at Berlin

b[email protected]

Abstract—GPUs have become immensely important computa-

tional units on embedded and mobile devices. However, GPGPU

developers are often not able to exploit the compute power

offered by GPUs on these devices mainly due to the lack of

support of traditional programming models such as CUDA and

OpenCL. The recent introduction of the Vulkan API provides

a new programming model that could be explored for GPGPU

computing on these devices, as it supports compute and promises

to be portable across different architectures.

In this paper we propose VComputeBench, a set of bench-

marks that help developers understand the differences in perfor-

mance and portability of Vulkan. We also evaluate the suitability

of Vulkan as an emerging cross-platform GPGPU framework by

conducting a thorough analysis of its performance compared to

CUDA and OpenCL on mobile as well as on desktop platforms.

Our experiments show that Vulkan provides better platform

support on mobile devices and can be regarded as a good cross-

platform GPGPU framework. It offers comparable performance

and with some low-level optimizations it can offer average

speedups of 1.53x and 1.66x compared to CUDA and OpenCL

respectively on desktop platforms and 1.59x average speedup

compared to OpenCL on mobile platforms. However, while

Vulkan’s low-level control can enhance performance, it requires

a significantly higher programming effort.

Index Terms—VComputeBench, Vulkan, SPIR-V, GPGPU,

CUDA, OpenCL, Rodinia, Mobile

I. INTRODUCTION

Graphics Processing Units (GPUs) have become a dominant

platform for parallel computing thanks to their massively

parallel architecture, energy efficiency and availability to the

masses. Several programming models have emerged enabling

developers to harness the massive compute power offered by

GPUs, while exploiting parallelism for different application

domains. This is often referred to as GPGPU (General Pur-

pose computing on the GPU) [1]. The most popular GPGPU

programming models are CUDA [2] and OpenCL [3]. CUDA

is a proprietary standard introduced by NVIDIA and targets

only NVIDIA specific hardware, while OpenCL is an open

standard maintained by the Khronos group and targets addi-

tional hardware devices including FPGAs, CPUs and DSPs. In

this work we focus on the two most predominant programming

models CUDA and OpenCL, but it is worth mentioning other

frameworks such as OpenMP [4] and OpenACC [5]. OpenMP

mainly targets shared memory multiprocessors and recently

OpenMP 4.5 introduced the target directive enabling support

for GPUs and other devices. OpenACC is mainly designed to

program accelerators in heterogeneous systems with OpenMP-

like directives.

To add to this mix of programming models, the Khronos

group recently released the Vulkan API [6] along with SPIR-

V [7]. Vulkan is a low level API with an abstraction closer

to the behavior of the actual hardware. It promises cross-

platform support, high-efficiency and better performance of

GPU applications. Unlike CUDA, which is only supported on

NVIDIA GPUs, and OpenCL, which has no official support on

mobile GPUs, Vulkan is supported by all major GPU vendors1

and considers non-desktop GPUs as first class citizens. Vulkan

is officially supported on Android 7.0 [8] and on the new

Tizen OS 3.0 [9] covering a full spectrum of mobile devices

from phones and wearables to TVs and in-vehicle infotainment

systems. This good platform support and the fact that it also

supports compute, motivated us to examine it from the GPGPU

perspective even though it was mainly designed to improve

graphics performance. In this paper, we introduce Vulkan as a

cross-platform GPGPU route that could open new perspectives

for pertinent GPGPU computing on mobile devices and can

be explored along with other more established frameworks

on desktop architectures. However, there are some important

questions yet to be answered:

•What kind of performance can we get out of Vulkan?

•Is there a viable study comparing Vulkan compute to

established frameworks such as CUDA and OpenCL?

•If there are any performance gains, are these portable

across different GPU architectures?

•Can Vulkan enable pertinent GPGPU computing on mo-

bile and embedded GPUs?

Selecting which GPGPU framework to choose is a critical

task for developers. Differences in performance, portability,

programmability and platform support are all very important

factors that need to be considered. Benchmarks play an

important role in exposing these kind of differences between

hardware architectures, compilers and more importantly across

competing programming models. There are several bench-

marks available to evaluate CUDA and OpenCL [10], [11]

[12] [13] but currently none for Vulkan. To fill this gap

and enable our study we propose VComputeBench, a set of

Vulkan compute benchmarks that help developers understand

1Supported by major desktop GPU vendors: AMD, NVIDIA and Intel and

mobile GPU vendors: Qualcomm, ARM, Imagination and VeriSilicon

the differences in performance and portability of Vulkan

and provide guidance to GPU architects in the design and

optimization of their drivers and runtime. VComputeBench

was developed by extending the popular Rodinia benchmark

suite [10], covering a diverse range of application domains

with different computation patterns. The reason for selecting

the Rodinia suite is that it provides OpenCL and CUDA im-

plementations and with our VComputeBench implementations

we can make fair comparisons and adequately evaluate Vulkan

against other programming models.

In essence, the main contributions of this paper are:

•Illustrate the viability of Vulkan as a GPGPU framework

notably on mobile devices.

•Propose a set of Vulkan compute benchmarks named

VComputeBench and ported them onto mobile platforms.

•Perform a thorough analysis of performance, comparing

Vulkan to CUDA and OpenCL on desktop and mobile

GPUs and highlight a set of Vulkan specific optimization

techniques.

II. RELATED WORK

In recent years, GPGPU frameworks have received a great

amount of attention from the research community. Although,

several works studied and compared different programming

models [14] [15] [16] [17] [18] [19] [20] [21], none of them

studied Vulkan. To the best of our knowledge, our work

is the first to investigate Vulkan from the compute not the

graphics perspective and propose it as a viable cross-platform

GPGPU programming model. One of the earliest and well

cited works is those of Fang et al. [15] and Karimi et al.

[14]. The authors compare CUDA to OpenCL in terms of

performance on old desktop GPU architectures. Our work, on

the other hand, was carried out on recent architectures and

analyses performance on desktop as well as mobile GPUs.

Du et al. [17] studies OpenCL performance portability and

Wang et al. [21] examines OpenCL on FPGAs. The authors of

these papers demonstrate that performance is not necessarily

portable across architectures. Their findings instigated us to

study and port our benchmarks onto mobile GPUs in order

to evaluate Vulkan’s portability and examine its performance

implications.

Such research works heavily rely on benchmarks for their

evaluations. Several GPGPU benchmarks were proposed by

researchers such as Rodinia [10], Parboil [11] SHOC [12] and

the recent Hetero-Mark [13]. Most of these benchmark suites

include CUDA, OpenCL or OpenMP implementations but

none include Vulkan implementations. This can be a limitation

especially for researchers and developers wanting to target this

new emerging programming model. In this work, we aim to

enrich the GPGPU community with such Vulkan benchmarks

by extending the popular Rodinia suite, enabling researchers

and developers to evaluate Vulkan along with other GPGPU

programming models. Likewise, most of these benchmark

suites mainly target desktop GPUs or multicore systems with

their CUDA and OpenCL implementations. Our benchmarks,

on the other hand, target both mobile and desktop GPUs. We

chose Vulkan because of its cross-platform capabilities and

good support on mobile devices.

III. VULKAN A COMPUTE PERSPECTIVE

In this section we present an overview of the Vulkan

programming model illustrating why it is a promising GPGPU

framework especially for mobile and embedded GPUs.

A. Vulkan Overview

Vulkan is often referred to as the next generation graphics

and compute API for modern GPUs. It is an open standard that

aims to address the inefficiencies of traditional APIs such as

OpenGL, which were designed for single-core processors and

lag to map well to modern hardware [22]. Vulkan on the other

hand, was designed from the ground-up with multi-threading

support in mind. Better parallelization can be achieved by

asynchronously generating work across multiple threads feed-

ing the GPU in an efficient manner. This is attained in Vulkan

by having no global state, no synchronizations in the driver

and separating work generation from work submission. All

state is localized in command buffers, which can be generated

on multiple threads and only start executing on the GPU after

submission.

The other key characteristic of Vulkan is that it provides a

much lower-level fine-grained control over the GPU enabling

developers to maximize performance across many platforms.

It achieves this by being explicit in nature rather than re-

lying on hidden heuristics in the driver. Operations such as

resource tracking, synchronization, memory allocation, and

work submission are all pushed into application space resulting

in higher predictability and better control of when and where

work happens. Likewise, unnecessary background tasks such

as error checking, hazard tracking, state validation and shader

compilation are delegated to the tooling layers, which are

present during development and removed at runtime, resulting

in low driver overhead and less CPU usage [23].

B. The Programming Model

Vulkan can be viewed as a pipeline with some pro-

grammable stages that are invoked by a set of operations. To

the programmer, it is simply an API with a set of routines

allowing for the specification of shaders or kernels, state

controlling aspects as well as data used by those kernels. From

the compute perspective though, the pipeline has only one

programmable stage represented in the kernel program to be

executed [6].

a) Execution Model: A Vulkan-capable system exposes

one or more devices, each of theses physical devices ex-

poses one or more queues. These queues are partitioned

into queue families and can process work asynchronously

to one another. Each queue family supports a number of

functionalities and may contain multiple queues with similar

characteristics. There are four types of queue functionalities

defined in Vulkan: graphics, compute, transfer, and sparse

memory management. The reason for having queue families is

that queues within a single family are considered compatible

with one another, and work produced for one queue family

can be executed on any queue within that family.

Aqueue is considered as the interface between the appli-

cation and the execution engines of a device. Commands for

these execution engines are recorded into command buffers

ahead of execution time. Once recorded, a command buffer

can be cached and submitted to a queue for execution as many

times as required. Command buffer construction is expensive

and the application may employ multiple threads to construct

multiple command buffers in parallel. These command buffers

are then submitted to queues for execution in a number of

batches. Once submitted to a queue, the commands within a

command buffer begin and complete execution without further

application intervention. The order in which these commands

are executed is dependent on a number of implicit and explicit

ordering constraints.

In addition, command buffers submitted to different queues

may execute in parallel or even out of order with respect to one

another. Command buffers submitted to a single queue though

respect submission order. Host execution is also asynchronous

to command buffer execution on the device. Control may

return to the application as soon as the command buffer is

submitted and the application should take responsibility for

any synchronizations between different queues as well as

between the device and host.

b) Compute Model: In Vulkan, compute workloads are

initiated by recording dispatching commands vkCmdDis-

patch* in a command buffer. Once a command buffer is

submitted to a queue, execution starts according to the cur-

rently bound compute pipeline. Compute pipelines consist

of a single compute shader stage, describing the kernel to

be executed and a pipeline layout, describing the input and

output resources to that kernel. Dispatching commands take

three input parameters: groupCountX,groupCountY and

groupCountZ defining the total number of workgroups or the

so called global workgroup size in the X, Y and Z directions

respectively. A workgroup is the smallest amount of compute

operations that an application can execute. Within a single

workgroup, there may be many workitems or compute shader

invocations. This is called the local workgroup size and is

defined by the compute shader itself using SPIR-V built-in

decorations [7].

c) SPIR-V: All shaders and compute kernels in Vulkan

are defined using the Standard Portable Intermediate Represen-

tation (SPIR-V), which is a platform-independent intermediate

language for describing graphical shaders and compute kernels

[24]. SPIR-V is a self-contained binary format. Logically, it

is a header and a linear stream of instructions and physically

it is just a stream of 32-bit words, encoding a collection of

annotations and decorations as well as functions, which in turn

encode control-flow graphs (CFG) of blocks. Variables are

accessed using load store instructions and any intermediate

results bypassing the load store are represented in a single

static-assignment form (SSA). Hierarchical type information

of data objects is preserved to not lose information needed for

further optimizations on the target device.

C. Why Vulkan for Mobile and Embedded GPUs?

Considering that Vulkan was mainly designed to achieve

higher graphics performance, we can make several interesting

observations: (i) its enhancements and low-level nature can

also be utilized to achieve higher performance for GPGPU

applications. (ii) Vulkan’s main focus on graphics allowed

it to have better support among GPU vendors than other

open frameworks such as OpenCL, which for instance is

not fully supported by NVIDIA because it considered as a

competitor to its propriety CUDA framework 2.(iii) Vulkan

is considered as the first framework to have official support

on mobile platforms [8] [9] and the API was designed with

mobile GPU features in mind such as tiled rendering. Hence,

it has the potential of being the framework of choice for

GPGPU on mobile devices, which is the quest of many recent

research works [25] [26] [27]. This leads us to our final

observation: (iv) that Vulkan can be the appropriate framework

for achieving true cross-platform GPGPU without sacrificing

on performance.

IV. BENCHMARKS

Benchmarks play an important role in exposing differ-

ences in performance, portability and programmability across

competing programming models. Since Vulkan was recently

released and its main focus is on graphics not GPGPU, there

are currently few graphics but no compute benchmarks that can

be of use to our study. In order to enable our work as well as

to enrich the research community with such benchmarks, we

extended the popular Rodinia benchmark suite [10] by devel-

oping Vulkan equivalents of most of its workloads, referred to

as VComputeBench, and made them publicly available to the

wider GPGPU community.

Before describing our VComputeBench benchmarks, we

first present one of the microbenchmarks that we used in our

study to better illustrate this new programming model and give

an overview of what is required to write a Vulkan compute

application.

A. Vector Addition Microbenchmark

This microbenchmark is a simple application adding two

vectors Xand Yof size nsaving the output in vector Z. The

kernel code, or the compute shader in Vulkan terminology,

is a SPIR-V binary that was compiled offline from a 10-line

GLSL source implementing:

Z[i] = X[i] + Y[i]∀i∈[0,1, . . . , n]

The index space is one dimensional and iis defined using

the SPIR-V decoration GlobalInvocationId, which returns the

global ID of the workitem executing the kernel. The vectors

X, Y and Zare bounded in to the kernel as storage buffers.

The host code, on the other hand, is more complicated.

Listing 1 shows a pseudo-code listing of the host program

highlighting only the important API calls.

2Current OpenCL version is 2.2 but NVIDIA only supports version 1.2

int main ()

std::size˙t N=1000000;// Number of elements in a vector

int numWorkGroups =N/256;// Workgroup size is 256

// Enumerate devices then create instance, queues and device

VkInstance instance; VkInstanceCreateInfo instanceInfo =–˝ ...

vkCreateInstance(&instanceInfo, nullptr,&instance);

vkEnumeratePhysicalDevices(instance, ..., &gpuList);

vkGetPhysicalDeviceQueueFamilyProperties(gpuList[0], ...);

...

VkDeviceQueueCreateInfo queueCreateInfo–˝ ...

VkDevice device; VkDeviceCreateInfo deviceInfo =–˝ ...

vkCreateDevice(gpuList[0], &deviceInfo, ..., &device);

VkQueue computeQueue;

vkGetDeviceQueue(device, queueFamilyIndex, 0,&computeQueue);

...

// Create buffer then bind the buffer to the allocated memory

VkBuffer bufferX; VkBufferCreateInfo bufferCreateInfo–˝ ...

bufCreateInfo.size =N*sizeof(float);

bufCreateInfo.usage =

VK˙BUFFER˙USAGE˙STORAGE˙BUFFER˙BIT —,→

VK˙BUFFER˙USAGE˙TRANSFER˙DST˙BIT;

vkCreateBuffer(device, &bufferCreateInfo, nullptr,&bufferX);

VkMemoryRequirements xBuffMemReqs;

vkGetBufferMemoryRequirements(device, bufferX, &xBuffMemReqs);

int xMemIndex =findMemType(xBuffMemReqs.memoryTypeBits,

VK˙MEMORY˙PROPERTY˙DEVICE˙LOCAL˙BIT);

VkDeviceMemory memory; VkMemoryAllocateInfo memAllocInfo–˝ ...

memAllocInfo.allocationSize =xBuffMemReqs.size;

memAllocInfo.memoryTypeIndex =xMemIndex;

vkAllocateMemory(device, &memAllocInfo, nullptr,&memory);

vkBindBufferMemory(device, bufferX, memory, 0);

...

// Create the compute shader and the compute pipeline

VkShaderModule module; VkShaderModuleCreateInfo

shadCreatInfo–˝ ...,→

shadCreatInfo.pCode =readSpirvBinary(”vectorAdd.spv”);

vkCreateShaderModule(device, &shadCreatInfo, NULL,&module);

VkPipelineShaderStageCreateInfo shaderStageCreateInfo–˝ ...

shaderStageCreateInfo.module =module;

shaderStageCreateInfo.stage =

VK˙SHADER˙STAGE˙COMPUTE˙BIT;,→

VkPipelineLayout pipelineLayout;

...

vkCreatePipelineLayout(device, ..., &pipelineLayout);

VkPipeline ppline; VkComputePipelineCreateInfo ppCreateInfo–˝ ...

ppCreateInfo.stage =shaderStageCreateInfo;

ppCreateInfo.layout =pipelineLayout;

vkCreateComputePipelines(device, &ppCreateInfo, &ppline ...);

...

// Bind buffers to compute pipeline

VkWriteDescriptorSet writeDescripSet–˝ ...

writeDescripSet.descriptorType =

VK˙DESCRIPTOR˙TYPE˙STORAGE˙BUFFER;,→

writeDescripSet.dstBinding =0;// Same as SPIRV Binding

decoration,→

writeDescripSet.pBufferInfo =xBufferDescriptor;

vkUpdateDescriptorSets(device, 1,&writeDescripSet, 0,NULL);

...

// Create command pool and allocate a command buffer

VkCommandPool cmdPool; VkCommandPoolCreateInfo

cmdPoolInfo–˝ ...,→

vkCreateCommandPool(device, &cmdPoolInfo, nullptr,&cmdPool);

VkCommandBuffer cmdBuffer; VkCommandBufferAllocateInfo

allcInfo–˝..,→

allcInfo.commandPool =cmdPool;

vkAllocateCommandBuffers(device, &allcInfo, &cmdBuffer);

...

// Bind the pipeline and record commands to the command buffer

vkCmdBindPipeline(cmdBuffer,

VK˙PIPELINE˙BIND˙POINT˙COMPUTE,ppline);

vkCmdDispatch(commandBuffer, numWorkGroups, 1,1);

vkEndCommandBuffer(commandBuffer);

...

// Submit to queue

VkSubmitInfo submitInfo –VK˙STRUCTURE˙TYPE˙SUBMIT˙INFO˝;

submitInfo.commandBufferCount =1;

submitInfo.pCommandBuffers = &cmdBuffer;

vkQueueSubmit(computeQueue, 1,&submitInfo ...);

... // Clean up and free all resources

Listing 1: VectorAdd host code using low-level Vulkan API

TABLE I: VComputeBench benchmarks

Name Application Dwarf Domain

backprop Back Propagation Unstructured Grid Deep Learning

bfs Breadth-First Search Graph Traversal Graph Theory

cfd CFD Solver Unstructured Grid Fluid Dynamics

gaussian Gaussian Elimination Dense Linear Algebra Linear Algebra

hotspot Hotspot Simulation Structured Grid Physics

lud LU Decomposition Dense Linear Algebra Linear Algebra

nn K-Nearest Neighbors Dense Linear Algebra Data Mining

nw Needleman-Wunsch Dynamic Programming Bioinformatics

pathfinder Path Finder Dynamic Programming Grid Traversal

Vulkan applications are linked against a common library

referred to as the loader, which gets initialized at the time

of VkInstance creation. The loader loads any enabled tooling

layers and initializes the low-level driver provided by the GPU

vendor. Accordingly, the example program depicted in Listing

1, starts initializing Vulkan by creating a VkInstance and

querying the system for any available devices with all their

properties including all available queue families.

Then a logical VkDevice is created and a queue is acquired.

The next step is to create storage buffers for the vectors.

VkBuffer objects are created, the system is queried for suitable

heaps according to the buffer memory requirements, then

memory is allocated on that heap and buffers are bounded

to their allocated memory. Next, a compute VkPipeline is

created by specifying the kernel’s SPIR-V binary as its shader

stage and creating a VkPipelineLayout describing all the re-

sources used by that kernel. Then, the buffers are bound to the

pipeline by specifying the kernel’s binding value of each buffer

as the destination binding of the write descriptor set. This is

similar to specifying the kernel arguments in OpenCL using

clSetKernelArg. Now that the compute pipeline is set up, the

kernel can be launched by creating a VkCommandBuffer,

binding the pipeline to that command buffer and recording

the dispatch command with the number of workgroups to

be launched. The command buffer is then submitted to the

acquired queue for execution. Finally, the application waits for

execution to finish then cleans up and frees all used resources

and objects.

B. VComputeBench Benchmarks

The Rodinia suite includes both CUDA and OpenCL versions

for each of its benchmarks. While developing their Vulkan

equivalents, we made sure not to introduce any algorithmic

changes to the kernel codes. In this way, we will be able to

make fair comparisons in the sense that any differences in

performance can be related to the programming model and

not to the algorithm. By using the latest Rodinia version 3.1,

we assume that we are already starting from a decent baseline

since these benchmarks were optimized many times in several

research works [28] [29].

The kernels were developed in GLSL and their correspond-

ing SPIR-V binaries were automatically generated using the

glslangvalidator compiler [30] provided by Khronos. We have

chosen GLSL as our kernel language because it has the best

support. We provide both the SPIR-V binaries and the GLSL

Loading more pages...