scieee Science in your language
[en] (orig)
This version is available at https://doi.org/10.14279/depositonce-7346
© © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all
other uses, in any current or future media, including reprinting/republishing this material for advertising or
promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse
of any copyrighted component of this work in other works.
Terms of Use
Nadjib Mammeri, Ben Juurlink (2018): VComputeBench: A Vulkan Benchmark Suite for GPGPU on Mobile
and Embedded GPUs. In: 2018 IEEE International Symposium on Workload Characterization
Nadjib Mammeri, Ben Juurlink
VComputeBench: A Vulkan Benchmark
Suite for GPGPU on Mobile and Embedded
GPUs
Accepted manuscript (Postprint) Conference paper |

VComputeBench: A V ulkan Benchmark Suite for
GPGPU on Mobile and Embedded GPUs
Nadjib Mammeri
T echnisc he Universit ¨
at Berlin
[email protected]
Ben Juurlink
T echnisc he Universit ¨
at Berlin
b [email protected]
Abstract —GPUs ha ve become immensely important computa-
tional units on embedded and mobile devices. Ho wever , GPGPU
dev elopers are often not able to exploit the compute po wer
offered by GPUs on these de vices mainly due to the lack of
support of traditional programming models such as CUD A and
OpenCL. The recent intr oduction of the V ulkan API provides
a new programming model that could be explor ed f or GPGPU
computing on these devices, as it supports compute and pr omises
to be portable across differ ent ar chitectures.
In this paper we pr opose VComputeBench, a set of bench-
marks that help dev elopers understand the differences in perf or -
mance and portability of V ulkan. W e also ev aluate the suitability
of V ulkan as an emerging cross-platf orm GPGPU framework by
conducting a thor ough analysis of its performance compar ed to
CUD A and OpenCL on mobile as well as on desktop platf orms.
Our experiments show that V ulkan pro vides better platform
support on mobile devices and can be r egarded as a good cross-
platf orm GPGPU framework. It offers comparable perf ormance
and with some low-le vel optimizations it can offer a verage
speedups of 1.53x and 1.66x compar ed to CUD A and OpenCL
respecti vely on desktop platf orms and 1.59x a verage speedup
compared to OpenCL on mobile platf orms. Howev er , while
V ulkan’ s low-lev el control can enhance perf ormance, it r equires
a significantly higher programming eff ort.
Index T erms —VComputeBench, V ulkan, SPIR-V , GPGPU ,
CUD A, OpenCL, Rodinia, Mobile
I . I NTR ODUCTION
Graphics Processing Units (GPUs) ha ve become a dominant
platform for parallel computing thanks to their massi vely
parallel architecture, ener gy efficienc y and av ailability to the
masses. Se veral programming models ha ve emer ged enabling
de velopers to harness the massi v e compute power of fered by
GPUs, while exploiting parallelism for dif ferent application
domains. This is often referred to as GPGPU (General Pur -
pose computing on the GPU) [1]. The most popular GPGPU
programming models are CUD A [2] and OpenCL [3]. CUD A
is a proprietary standard introduced by NVIDIA and tar gets
only NVIDIA specific hardware, while OpenCL is an open
standard maintained by the Khronos group and tar gets addi-
tional hardware de vices including FPGAs, CPUs and DSPs. In
this work we focus on the tw o most predominant programming
models CUD A and OpenCL, but it is w orth mentioning other
frame works such as OpenMP [4] and OpenA CC [5]. OpenMP
mainly tar gets shared memory multiprocessors and recently
OpenMP 4.5 introduced the tar get directi ve enabling support
for GPUs and other de vices. OpenA CC is mainly designed to
program accelerators in heterogeneous systems with OpenMP-
like directi ves.
T o add to this mix of programming models, the Khronos
group recently released the V ulkan API [6] along with SPIR-
V [7]. V ulkan is a low le vel API with an abstraction closer
to the beha vior of the actual hardware. It promises cross-
platform support, high-ef ficiency and better performance of
GPU applications. Unlike CUD A, which is only supported on
NVIDIA GPUs, and OpenCL, which has no of ficial support on
mobile GPUs, V ulkan is supported by all major GPU v endors 1
and considers non-desktop GPUs as first class citizens. V ulkan
is of ficially supported on Android 7.0 [8] and on the ne w
T izen OS 3.0 [9] covering a full spectrum of mobile de vices
from phones and wearables to TVs and in-v ehicle infotainment
systems. This good platform support and the fact that it also
supports compute, moti v ated us to examine it from the GPGPU
perspecti ve e ven though it was mainly designed to impro ve
graphics performance. In this paper , we introduce V ulkan as a
cross-platform GPGPU route that could open ne w perspecti ves
for pertinent GPGPU computing on mobile de vices and can
be explored along with other more established frame works
on desktop architectures. Ho we ver , there are some important
questions yet to be answered:
• What kind of performance can we get out of V ulkan?
• Is there a viable study comparing V ulkan compute to
established frame works such as CUD A and OpenCL?
• If there are any performance g ains, are these portable
across dif ferent GPU architectures?
• Can V ulkan enable pertinent GPGPU computing on mo-
bile and embedded GPUs?
Selecting which GPGPU frame work to choose is a critical
task for de velopers. Dif ferences in performance, portability ,
programmability and platform support are all very important
factors that need to be considered. Benchmarks play an
important role in exposing these kind of dif ferences between
hardware architectures, compilers and more importantly across
competing programming models. There are se veral bench-
marks a vailable to e valuate CUD A and OpenCL [10], [11]
[12] [13] b ut currently none for V ulkan. T o fill this gap
and enable our study we propose VComputeBench, a set of
V ulkan compute benchmarks that help de velopers understand
1 Supported by major desktop GPU vendors: AMD, NVIDIA and Intel and
mobile GPU vendors: Qualcomm, ARM, Imagination and V eriSilicon

the dif ferences in performance and portability of V ulkan
and provide guidance to GPU architects in the design and
optimization of their dri vers and runtime. VComputeBench
was de veloped by extending the popular Rodinia benchmark
suite [10], cov ering a di verse range of application domains
with dif ferent computation patterns. The reason for selecting
the Rodinia suite is that it provides OpenCL and CUD A im-
plementations and with our VComputeBench implementations
we can make f air comparisons and adequately ev aluate V ulkan
against other programming models.
In essence, the main contrib utions of this paper are:
• Illustrate the viability of V ulkan as a GPGPU frame work
notably on mobile de vices.
• Propose a set of V ulkan compute benchmarks named
VComputeBench and ported them onto mobile platforms.
• Perform a thorough analysis of performance, comparing
V ulkan to CUD A and OpenCL on desktop and mobile
GPUs and highlight a set of V ulkan specific optimization
techniques.
I I . R E L A T E D W O R K
In recent years, GPGPU frame works ha ve recei ved a great
amount of attention from the research community . Although,
se veral w orks studied and compared dif ferent programming
models [14] [15] [16] [17] [18] [19] [20] [21], none of them
studied V ulkan. T o the best of our kno wledge, our work
is the first to in vestigate V ulkan from the compute not the
graphics perspecti ve and propose it as a viable cross-platform
GPGPU programming model. One of the earliest and well
cited works is those of F ang et al. [15] and Karimi et al.
[14]. The authors compare CUD A to OpenCL in terms of
performance on old desktop GPU architectures. Our work, on
the other hand, was carried out on recent architectures and
analyses performance on desktop as well as mobile GPUs.
Du et al. [17] studies OpenCL performance portability and
W ang et al. [21] examines OpenCL on FPGAs. The authors of
these papers demonstrate that performance is not necessarily
portable across architectures. Their findings instigated us to
study and port our benchmarks onto mobile GPUs in order
to e v aluate V ulkan’ s portability and examine its performance
implications.
Such research works hea vily rely on benchmarks for their
e v aluations. Sev eral GPGPU benchmarks were proposed by
researchers such as Rodinia [10], Parboil [11] SHOC [12] and
the recent Hetero-Mark [13]. Most of these benchmark suites
include CUD A, OpenCL or OpenMP implementations but
none include V ulkan implementations. This can be a limitation
especially for researchers and de velopers w anting to target this
ne w emerging programming model. In this w ork, we aim to
enrich the GPGPU community with such V ulkan benchmarks
by extending the popular Rodinia suite, enabling researchers
and de velopers to e valuate V ulkan along with other GPGPU
programming models. Like wise, most of these benchmark
suites mainly tar get desktop GPUs or multicore systems with
their CUD A and OpenCL implementations. Our benchmarks,
on the other hand, tar get both mobile and desktop GPUs. W e
chose V ulkan because of its cross-platform capabilities and
good support on mobile de vices.
III. V U L K A N A C OMPUTE P ERSPECTIVE
In this section we present an ov ervie w of the V ulkan
programming model illustrating why it is a promising GPGPU
frame work especially for mobile and embedded GPUs.
A. V ulkan Overvie w
V ulkan is often referred to as the ne xt generation graphics
and compute API for modern GPUs. It is an open standard that
aims to address the inef ficiencies of traditional APIs such as
OpenGL, which were designed for single-core processors and
lag to map well to modern hardware [22]. V ulkan on the other
hand, was designed from the ground-up with multi-thr eading
support in mind. Better parallelization can be achie ved by
asynchronously generating work across multiple threads feed-
ing the GPU in an ef ficient manner . This is attained in V ulkan
by ha ving no global state, no synchronizations in the driv er
and separating work generation from w ork submission. All
state is localized in command b uffer s , which can be generated
on multiple threads and only start ex ecuting on the GPU after
submission.
The other ke y characteristic of V ulkan is that it provides a
much lo wer-le vel fine-grained control o ver the GPU enabling
de velopers to maximize performance across man y platforms.
It achie ves this by being e xplicit in nature rather than re-
lying on hidden heuristics in the dri ver . Operations such as
resource tracking, synchronization, memory allocation, and
work submission are all pushed into application space resulting
in higher predictability and better control of when and where
work happens. Lik ewise, unnecessary background tasks such
as error checking, hazard tracking, state v alidation and shader
compilation are deleg ated to the tooling layers, which are
present during de velopment and remo ved at runtime, resulting
in lo w dri ver o verhead and less CPU usage [23].
B. The Pr ogr amming Model
V ulkan can be vie wed as a pipeline with some pro-
grammable stages that are in vok ed by a set of operations. T o
the programmer , it is simply an API with a set of routines
allo wing for the specification of shaders or kernels, state
controlling aspects as well as data used by those kernels. From
the compute perspecti ve though, the pipeline has only one
programmable stage represented in the kernel program to be
ex ecuted [6].
a) Execution Model: A V ulkan-capable system exposes
one or more devices , each of theses physical de vices ex-
poses one or more queues . These queues are partitioned
into queue families and can process work asynchronously
to one another . Each queue family supports a number of
functionalities and may contain multiple queues with similar
characteristics. There are four types of queue functionalities
defined in V ulkan: graphics, compute, transfer , and sparse
memory management. The reason for ha ving queue families is
that queues within a single family are considered compatible

with one another , and work produced for one queue family
can be ex ecuted on any queue within that f amily .
A queue is considered as the interface between the appli-
cation and the ex ecution engines of a de vice. Commands for
these ex ecution engines are recorded into command b uffer s
ahead of ex ecution time. Once recorded, a command b uffer
can be cached and submitted to a queue for ex ecution as many
times as required. Command b uffer construction is e xpensi ve
and the application may employ multiple threads to construct
multiple command b uffers in parallel. These command b uffers
are then submitted to queues for ex ecution in a number of
batches. Once submitted to a queue, the commands within a
command b uffer be gin and complete ex ecution without further
application intervention. The order in which these commands
are ex ecuted is dependent on a number of implicit and explicit
ordering constraints.
In addition, command b uffers submitted to dif ferent queues
may ex ecute in parallel or e ven out of order with respect to one
another . Command buf fers submitted to a single queue though
respect submission order . Host execution is also asynchronous
to command b uffer e xecution on the de vice. Control may
return to the application as soon as the command b uffer is
submitted and the application should take responsibility for
any synchronizations between dif ferent queues as well as
between the de vice and host.
b) Compute Model: In V ulkan, compute workloads are
initiated by recording dispatching commands vkCmdDis-
patc h* in a command b uffer . Once a command buf fer is
submitted to a queue, e xecution starts according to the cur -
rently bound compute pipeline . Compute pipelines consist
of a single compute shader stage, describing the kernel to
be ex ecuted and a pipeline layout, describing the input and
output resources to that kernel. Dispatching commands tak e
three input parameters: groupCoun tX , groupCoun tY and
groupCoun tZ defining the total number of workgr oups or the
so called global workgroup size in the X, Y and Z directions
respecti vely . A workgroup is the smallest amount of compute
operations that an application can ex ecute. W ithin a single
workgroup, there may be man y workitems or compute shader
in vocations. This is called the local w orkgroup size and is
defined by the compute shader itself using SPIR-V b uilt-in
decorations [7].
c) SPIR-V: All shaders and compute kernels in V ulkan
are defined using the Standard Portable Intermediate Represen-
tation (SPIR-V), which is a platform-independent intermediate
language for describing graphical shaders and compute kernels
[24]. SPIR-V is a self-contained binary format. Logically , it
is a header and a linear stream of instructions and physically
it is just a stream of 32-bit words, encoding a collection of
annotations and decorations as well as functions, which in turn
encode control-flo w graphs (CFG) of blocks. V ariables are
accessed using load store instructions and an y intermediate
results bypassing the load store are represented in a single
static-assignment form (SSA). Hierarchical type information
of data objects is preserved to not lose information needed for
further optimizations on the tar get device.
C. Why V ulkan for Mobile and Embedded GPUs?
Considering that V ulkan was mainly designed to achie ve
higher graphics performance, we can make se veral interesting
observ ations: (i) its enhancements and lo w-lev el nature can
also be utilized to achie ve higher performance for GPGPU
applications. (ii) V ulkan’ s main focus on graphics allowed
it to ha ve better support among GPU v endors than other
open frame works such as OpenCL, which for instance is
not fully supported by NVIDIA because it considered as a
competitor to its propriety CUD A framew ork 2 . (iii) V ulkan
is considered as the first frame work to ha ve of ficial support
on mobile platforms [8] [9] and the API was designed with
mobile GPU features in mind such as tiled rendering. Hence,
it has the potential of being the frame work of choice for
GPGPU on mobile de vices, which is the quest of many recent
research works [25] [26] [27]. This leads us to our final
observ ation: (iv) that V ulkan can be the appropriate frame work
for achie ving true cross-platform GPGPU without sacrificing
on performance.
I V . B ENCHMARKS
Benchmarks play an important role in exposing dif fer-
ences in performance, portability and programmability across
competing programming models. Since V ulkan was recently
released and its main focus is on graphics not GPGPU, there
are currently fe w graphics but no compute benchmarks that can
be of use to our study . In order to enable our work as well as
to enrich the research community with such benchmarks, we
extended the popular Rodinia benchmark suite [10] by de vel-
oping V ulkan equi v alents of most of its workloads, referred to
as VComputeBench, and made them publicly a vailable to the
wider GPGPU community .
Before describing our VComputeBench benchmarks, we
first present one of the microbenchmarks that we used in our
study to better illustrate this ne w programming model and gi ve
an ov ervie w of what is required to write a V ulkan compute
application.
A. V ector Addition Micr obenchmark
This microbenchmark is a simple application adding two
vectors X and Y of size n sa ving the output in v ector Z . The
kernel code , or the compute shader in V ulkan terminology ,
is a SPIR-V binary that was compiled of fline from a 10-line
GLSL source implementing:
Z [ i ] = X [ i ] + Y [ i ] ∀ i ∈ [0 , 1 , . . . , n ]
The index space is one dimensional and i is defined using
the SPIR-V decoration GlobalIn v o cationId , which returns the
global ID of the workitem e xecuting the k ernel. The vectors
X , Y and Z are bounded in to the kernel as storage buf fers.
The host code , on the other hand, is more complicated.
Listing 1 sho ws a pseudo-code listing of the host program
highlighting only the important API calls.
2 Current OpenCL version is 2.2 b ut NVIDIA only supports version 1.2

in t main ()
std :: size˙t N = 1000000 ; // Num b er of elemen ts in a vector
in t numW orkGroups = N / 256 ; // W orkgroup size is 256
// En umerate devices then create instance, queues and device
VkInstance instance; VkInstanceCreateInfo instanceInfo = –˝ ...
vkCreateInstance( & instanceInfo, n ullptr , & instance);
vkEn umeratePhysicalDevices(instance, ..., & gpuList);
vkGetPh ysicalDeviceQueueF amilyProp erties(gpuList[ 0 ], ...);
...
VkDeviceQueueCreateInfo queueCreateInfo–˝ ...
VkDevice device; VkDeviceCreateInfo deviceInfo = –˝ ...
vkCreateDevice(gpuList[ 0 ], & deviceInfo, ..., & device);
VkQueue computeQueue;
vkGetDeviceQueue(device, queueF amilyIndex, 0 , & computeQueue);
...
// Create buffer then bind the buffer to the allo cated memory
VkBuffer bufferX; VkBufferCreateInfo bufferCreateInfo–˝ ...
bufCreateInfo.size = N * sizeof ( float );
bufCreateInfo.usage =
VK˙BUFFER˙USA GE˙STORAGE˙BUFFER˙BIT — , →
VK˙BUFFER˙USA GE˙TRANSFER˙DST˙BIT;
vkCreateBuffer(device, & bufferCreateInfo, n ullptr , & bufferX);
VkMemoryRequiremen ts xBuffMemReqs;
vkGetBufferMemoryRequiremen ts(device, bufferX, & xBuffMemReqs);
in t xMemIndex = findMemType(xBuffMemReqs.memoryTypeBits,
VK˙MEMOR Y˙PROPER TY˙DEVICE˙LOCAL˙BIT);
VkDeviceMemory memory; VkMemoryAllo cateInfo memAllocInfo–˝ ...
memAllo cInfo.allocationSize = xBuffMemReqs.size;
memAllo cInfo.memoryT yp eIndex = xMemIndex;
vkAllo cateMemory(device, & memAllocInfo, nullptr , & memory);
vkBindBufferMemory(device, bufferX, memory, 0 );
...
// Create the compute shader and the compute pip eline
VkShaderMo dule module; VkShaderMo duleCreateInfo
shadCreatInfo–˝ ... , →
shadCreatInfo.pCo de = readSpirvBinary( ”v ectorAdd.sp v” );
vkCreateShaderMo dule(device, & shadCreatInfo, NULL , & module);
VkPip elineShaderStageCreateInfo shaderStageCreateInfo–˝ ...
shaderStageCreateInfo.mo dule = module;
shaderStageCreateInfo.stage =
VK˙SHADER˙ST AGE˙COMPUTE˙BIT; , →
VkPip elineLa y out pip elineLa y out;
...
vkCreatePip elineLa y out(device, ..., & pip elineLa y out);
VkPip eline ppline; VkComputePipelineCreateInfo ppCreateInfo–˝ ...
ppCreateInfo.stage = shaderStageCreateInfo;
ppCreateInfo.la yout = pipelineLay out;
vkCreateComputePip elines(device, & ppCreateInfo, & ppline ...);
...
// Bind buffers to compute pip eline
VkW riteDescriptorSet writeDescripSet–˝ ...
writeDescripSet.descriptorT yp e =
VK˙DESCRIPTOR˙TYPE˙STORA GE˙BUFFER; , →
writeDescripSet.dstBinding = 0 ; // Same as SPIR V Binding
decoration , →
writeDescripSet.pBufferInfo = xBufferDescriptor;
vkUp dateDescriptorSets(device, 1 , & writeDescripSet, 0 , NULL );
...
// Create command p ool and allo cate a command buffer
VkCommandP o ol cmdP o ol; VkCommandP o olCreateInfo
cmdP o olInfo–˝ ... , →
vkCreateCommandP o ol(device, & cmdP o olInfo, n ullptr , & cmdP o ol);
VkCommandBuffer cmdBuffer; VkCommandBufferAllo cateInfo
allcInfo–˝.. , →
allcInfo.commandP o ol = cmdP o ol;
vkAllo cateCommandBuffers(device, & allcInfo, & cmdBuffer);
...
// Bind the pip eline and record commands to the command buffer
vkCmdBindPip eline(cmdBuffer,
VK˙PIPELINE˙BIND˙POINT˙COMPUTE,ppline);
vkCmdDispatc h(commandBuffer, numW orkGroups, 1 , 1 );
vkEndCommandBuffer(commandBuffer);
...
// Submit to queue
VkSubmitInfo submitInfo –VK˙STR UCTURE˙TYPE˙SUBMIT˙INFO˝;
submitInfo.commandBufferCoun t = 1 ;
submitInfo.pCommandBuffers = & cmdBuffer;
vkQueueSubmit(computeQueue, 1 , & submitInfo ...);
... // Clean up and free all resources
Listing 1: V ectorAdd host code using lo w-lev el V ulkan API
T ABLE I: VComputeBench benchmarks
Name A pplication Dwarf Domain
backprop Back Propagation Unstructured Grid Deep Learning
bfs Breadth-First Search Graph T rav ersal Graph Theory
cfd CFD Solver Unstructured Grid Fluid Dynamics
gaussian Gaussian Elimination Dense Linear Algebra Linear Algebra
hotspot Hotspot Simulation Structured Grid Physics
lud LU Decomposition Dense Linear Algebra Linear Algebra
nn K-Nearest Neighbors Dense Linear Algebra Data Mining
nw Needleman-W unsch Dynamic Programming Bioinformatics
pathfinder Path Finder Dynamic Programming Grid T rav ersal
V ulkan applications are linked ag ainst a common library
referred to as the loader , which gets initialized at the time
of VkInstance creation. The loader loads any enabled tooling
layers and initializes the lo w-le vel dri v er provided by the GPU
vendor . Accordingly , the example program depicted in Listing
1, starts initializing V ulkan by creating a VkInstance and
querying the system for any a v ailable de vices with all their
properties including all a vailable queue f amilies.
Then a logical VkDevice is created and a queue is acquired.
The next step is to create storage b uffers for the v ectors.
VkBuffer objects are created, the system is queried for suitable
heaps according to the b uffer memory requirements, then
memory is allocated on that heap and b uffers are bounded
to their allocated memory . Next, a compute VkPip eline is
created by specifying the kernel’ s SPIR-V binary as its shader
stage and creating a VkPip elineLa yout describing all the re-
sources used by that kernel. Then, the b uf fers are bound to the
pipeline by specifying the kernel’ s binding v alue of each buf fer
as the destination binding of the write descriptor set. This is
similar to specifying the kernel ar guments in OpenCL using
clSetKernelArg . Now that the compute pipeline is set up, the
kernel can be launched by creating a VkCommandBuffer ,
binding the pipeline to that command b uf fer and recording
the dispatch command with the number of workgroups to
be launched. The command b uffer is then submi tted to the
acquired queue for ex ecution. Finally , the application waits for
ex ecution to finish then cleans up and frees all used resources
and objects.
B. VComputeBench Benc hmarks
The Rodinia suite includes both CUD A and OpenCL versions
for each of its benchmarks. While de veloping their V ulkan
equi v alents, we made sure not to introduce any algorithmic
changes to the kernel codes. In this way , we will be able to
make f air comparisons in the sense that any dif ferences in
performance can be related to the programming model and
not to the algorithm. By using the latest Rodinia version 3.1,
we assume that we are already starting from a decent baseline
since these benchmarks were optimized many times in se veral
research works [28] [29].
The kernels were de veloped in GLSL and their correspond-
ing SPIR-V binaries were automatically generated using the
glslangv alidator compiler [30] provided by Khronos. W e hav e
chosen GLSL as our kernel language because it has the best
support. W e provide both the SPIR-V binaries and the GLSL

sources as part of our VComputeBench benchmarks. The host
code translation on the other hand, was challenging because
the Rodinia source code was collected from dif ferent sources
resulting in a hard-to-read code with dif ferent styles, very little
comments and hardly any documentation. W e made sure this
is not the case with our benchmarks, which we implemented
using C++11 features with unified style and appropriate com-
ments. As far as functional testing is concerned, we v alidated
our de veloped VCompute benchmarks ag ainst both CUD A and
OpenCL outputs for dif ferent input sets.
Our VComputeBench benchmarks cov er a di verse range of
application domains with dif ferent computation patterns. The
benchmarks were selected so that they also co ver dif ferent
sets of dwarv es [31]. T able I shows a list of the de veloped
benchmarks including their dwarf and application domains.
Here, we just include brief descriptions of these benchmarks,
b ut full descriptions and characterizations of these workloads
can be found at [10]:
Back Propagation (bp) : is an algorithm that is commonly
used in training deep neural networks to adjust the netw ork’ s
weights. It is composed of two phases a forw ard pass, where
the acti v ations are propagated from the input to the output
layer , and a backward pass, where the error is propagated
backwards from the output to the input layer to adjust the
weights and bias v alues.
Breadth-F irst Search (bfs) : is a graph algorithm that tra verses
or searches a graph of connected nodes, which could include
millions of nodes. It starts at a root node and explores neigh-
boring nodes first, before moving to the ne xt le vel neighbors.
Computational Fluid Dynamics (cfd) : is a fluid dynamics
solver of three-dimensional Euler equations representing an
unstructured grid, finite v olume of compressible flow .
Gaussian Elimination (gaussian) : is a linear algebra al-
gorithm for solving a set of linear equations. It works by
performing a sequence of ro w reduction operations on a matrix
until the lo wer left-hand corner of the matrix is filled with
zeros, as much as possible.
Hotspot Simulation (hotspot) : is a thermal simulation tool
that tries to estimate processor temperature based on an
architectural floor plan and simulated po wer measurements.
LU Decomposition (lud) : is an a linear algebra algorithm that
tries to calculate the solution of a set of linear equations. It
works by decomposing a matrix into a product of a lo wer
triangular matrix and upper triangular matrix.
K-Nearest Neighbors (nn) : is a dense linear algebra algorithm
used to find the closest K neighbors in a set of reference data
points in an n-dimensional space to query point q. The data
in our case is latitude and longitude data and the calculated
distances are euclidean distances.
Needleman-W unsch (nw) : is a dynamic programming algo-
rithm that is used for DN A sequence alignment. The algorithm
tries to fill a matrix of potential pairs of DN A sequences with
scores, representing the v alue of the maximum weighted path
ending at that cell. Then a trace-back process is used to search
for an optimal alignment.
P athfinder (pfinder) : is another dynamic programming algo-
rithm that computes the path on a 2-dimensional grid with the
smallest total cost. The grid is represented as a matrix, and
the path is computed in blocks of ro ws.
C. V ulkan-specific optimizations
As sho wn in in the example code in Listing 1, V ulkan uses
completely dif ferent abstractions from CUD A and OpenCL.
Ef fecti vely , in V ulkan, the programmer is not dealing with
kernels, k ernel arguments and k ernel launches but the y are
dealing with lo w le vel command b uf fers, recording commands
in these b uffers such as binding compute pipelines, setting
descriptor sets and binding b uffers to descriptor sets. One of
ke y synchronization mechanisms of V ulkan that we used when
writing our benchmarks and produced performance improv e-
ments, as sho wn in section V -A2, is memory barriers. Memory
barrier commands can be recorded in a command b uf fer ,
ensuring that commands recorded prior to it are ex ecuted
before the commands recorded after it. This allo wed us to
reduce the kernel launch o verhead compared to CUD A and
OpenCL implementations, resulting in better performance as
sho wn in sections V -A2 and V -B2.
Most of our benchmarks use iterati ve algorithms. The CUD A
and OpenCL implementations in vok e the kernel multiple times
for e very iteration, whereas in our V ulkan implementations
we record the work of all iterations in one command b uf fer
and synchronize using memory barriers between iterations,
instead of nai vely creating a command b uf fer for ev ery it-
eration. Ef fecti vely , we incur only a single communication
ov erhead when the command b uffer is submitted compared to
the CUD A and OpenCL implementations which incur kernel
launch ov erheads on e very iteration.
One can ar gue that the CUD A and OpenCL implementations
can be changed to enqueue iterations ahead of time without
blocking. The problem with this solution is that it does not
honor the data dependencies between iterations. Subsequent
iterations depend on the data generated in pre vious iterations.
Both CUD A and OpenCL do not offer an y inter -workgroup
synchronization mechanism that can be used to honor these
dependency requirements. This is a well kno wn limitation of
these programming models and the safest portable solution to
achie ve such synchronization is to use what’ s called multi-
kernel method. In this method the application is split into
multiple kernels. Whene ver a inter -workgroup synchronization
is required, a transition from one kernel to another is made
or in the case of ha ving only one kernel this kernel is
launched again. The transfer of control from the GPU to the
CPU implicitly provides the required barrier semantics. The
Rodinia CUD A and OpenCL implementations use this method
to achie ve such inter -workgroup synchronization and satisfy
the data dependencies between iterations.
D. P orting to mobile devices
One of the major strengths of V ulkan is its portability . Ho w-
e ver , performance improv ements are not necessarily portable

T ABLE II: Desktop GPUs Experimental Setup
NVIDIA GTX105Ti AMD RX560
Operating System Ubuntu 16.04 64-bit
CPU Intel(R) Core(TM) i5-2500K CPU 3.30GHz x4
Memory CPU Memory=16 GB, GPU Memory=4GB
Dri ver Linux Display Dri ver 381.22 AMDGPU-Pro Driv er 17.10
OpenCL OpenCL 1.2 OpenCL 2.0
CUD A CUD A 8.0 -
V ulkan API V ersion 1.0.42 API V ersion 1.0.37
and often de velopers ha ve to adapt and re-write their applica-
tions with respect to the tar geted architecture. In fact, it has
been sho wn that performance is not portable when running
OpenCL applications tar geting GPUs on CPU or FPGA like
architectures [17] [21]. T o address this concern and assess
whether V ulkan is a good candidate for GPGPU computing
on mobile de vices, we ported our benchmarks plus their
corresponding Rodinia OpenCL implementations onto mobile
GPUs. W e chose Android 7.0 as our OS because it supports
V ulkan out of the box, allo wing us to target man y mobile
GPUs. W e cross-compiled all of our benchmarks for x86, x86-
64, armeabi-v7a, arm64-v8a binary tar gets and dev eloped an
Android application that b undles these benchmarks with their
required data sets. W e set a requirement when dev eloping the
VComputeBench Android application of not requiring root
access so that it can be released on the Android application
store allo wing millions of users to check and compare the
performance of the GPUs and V ulkan implementations inside
their de vices. This was challenging and we had to resort
to b undling the benchmarks as libraries in order to satisfy
Android security restrictions on binary ex ecutables.
V . E X P E R I M E N TA L R E S U LT S
In this section we report the results of our empirical e v alua-
tion of V ulkan performed on se veral GPU architectures. W e
use two types of benchmarks self-written micro benchmarks
to highlight and assess specific attrib utes and our VCom-
puteBench plus Rodinia benchmarks to assess performance us-
ing representati ve real w orld applications. W e compare V ulkan
results to those of CUD A and OpenCL on two desktop GPUs
and two mobile GPU platforms. F or consistency , we measure
the ex ecution times on the CPU using C++11 std::c hrono . T o
minimize measurement errors, we ex ecute se veral times and
report the a verage of the obtained e xecution times.
A. Evaluations on Desktop Platforms
W e chose two recent desktop GPUs employing latest and
adv anced GPU architectures: NVIDIA GTX1050T i employing
NVIDIA ’ s Pascal architecture and AMD RX560 emplo ying
AMD’ s Polaris architecture. T able II sho ws the configuration
details of these platforms.
1) Memory Bandwidth Evaluation: T o e v aluate how the pro-
gramming model af fects memory bandwidth and asses whether
we can achie ve high memory bandwidth when using V ulkan,
we de veloped a strided memory access micro-benchmark in
1 4 8 12 16 20 24 28 32
Stride (4 bytes per element)
4
8
16
32
64
Bandwidth (GB/s)
V ulkan
CUD A

(a) NVIDIA GTX1050T i
1 4 8 12 16 20 24 28 32
Stride (4 bytes per element)
4
8
16
32
64
Bandwidth (GB/s)
V ulkan
OpenCL

(b) AMD RX560
Fig. 1: V ulkan memory bandwidth vs CUD A and OpenCL
V ulkan, CUD A and OpenCL. W e vary the stride when reading
array elements and measure the achie ved bandwidth. F or
reference, both of our platforms use GDDR5 memory with an
ef fecti ve memory clock of 7GHz and 128 bit memory interface
width, resulting in theoretical bandwidth of 112 GB/s, which
can be calculated using:
B W peak = F r eq · ( B usW idth/ 8) · 10 − 9
The obtained results are sho wn in Figure 1. On both platforms,
V ulkan pro vides comparable performance to CUD A and
OpenCL for strides less than 64 bytes and slightly better
performance for strides lar ger than 64 bytes. As expected,
unit stride provides maximum achie ved bandwidth of 84%
and 79.6% of the peak bandwidth for CUD A and V ulkan
respecti vely on the GTX1050. Lik ewise, on the RX560,
V ulkan achie ves 71.6% of the peak bandwidth compared to
71.5% for OpenCL. Overall, this test sho ws that high memory
bandwidth can be attained using V ulkan and data layout in
memory is more important than the used programming model.
2) Benchmarks Evaluations: Figure 2 sho ws the speedup
results of the selected benchmarks comparing V ulkan, CUD A
and OpenCL for dif ferent workloads. W e chose OpenCL as
our baseline for speedup calculations because it is supported
on both platforms. T o make a fair comparison, we only report
kernel e xecution times not total benchmark times because a
high ov erhead is generally exhibited by OpenCL JIT compila-
tion and explicit conte xt management resulting in longer total
times [32] [17].
Overall, for most benchmarks V ulkan provides better perfor -
mance than CUD A and OpenCL resulting in geometric mean
speedups of 1 . 53 x with respect to CUD A on the GTX1050
and 1 . 26 x with respect to OpenCL on the RX560. Ho we ver ,

4K
64K
1M
4K
64K
256K
97K
193K
232K
208
1024
2048
512-08
512-16
512-32
256
512
2048
256K
8M
16M
4K
8K
16K
10K
50K
100K
0 . 0
0 . 5
1 . 0
1 . 5
2 . 0
2 . 5
3 . 0
3 . 5
4 . 0
Speedup
bfs backprop cfd gaussian hotspot lud nn nw pathfinder
OpenCL V ulkan CUD A

(a) NVIDIA GTX1050T i
4K
64K
1M
4K
64K
256K
97K
193K
232K
208
1024
2048
512-08
512-16
512-32
256
512
2048
256K
8M
16M
4K
8K
16K
10K
50K
100K
0 . 0
0 . 5
1 . 0
1 . 5
2 . 0
2 . 5
3 . 0
3 . 5
4 . 0
Speedup
bfs backprop cfd gaussian hotspot lud nn nw pathfinder
OpenCL V ulkan

(b) AMD RX560
Fig. 2: V ulkan speedup vs CUD A and OpenCL for the Rodinia benchmarks
since the benchmarks exhibit dif ferent computation patterns,
there are v ariations on their indi vidual results.
The best speedups are attained with pathfinder , hotspot , lud
and gaussian benchmarks. The reason for this is that these
benchmarks use iterati ve algorithms, in v oking the kernel mul-
tiple times. Subsequent in vocations utilize data generated in
pre vious iterations, requiring control to return back to the
CPU and incurring kernel launch o verhead on e very iteration.
V ulkan enable us to eliminate these kernel launches and com-
munication ov erheads altogether by recording the work of all
iterations in one command b uffer and adding memory barriers
between iterations to satisfy the dependenc y requirements.
Ef fecti vely , we incur a single communication o verhead when
the command b uffer is submitted. Our results commensurate
with the kernel launch o verhead findings of [15]. Figure 2
also sho ws that, for most of these workloads, the speedup
increases as we increase the input size. Larger input means
more iterations and less ov erhead compared to CUD A and
OpenCL, thus better V ulkan performance.
An interesting result is that of cfd . Although it uses an iterati ve
algorithm, we do not get similar speedups. This benchmark
has 3 compute intensi ve k ernels and for ev ery iteration we
ha ve to bind 3 dif ferent compute pipelines, representing these
kernels, to our single command b uf fer . This o verhead of
binding compute pipelines plus the longer kernel computation
times make the launch o verhead sa vings not that significant.
It also does not scale well with input size because the number
of iterations is fixed and not dependent on input size. V ulkan
cfd achie ves 1 . 38 x speedup vs CUD A and 1 . 04 x speedup vs
OpenCL a veraged on both platforms.
On the contrary , we get a slo wdo wn for bfs on both platforms.
T o in vestig ate this, we disassembled the V ulkan and OpenCL
kernels using the AMD CodeXL tool [33]. W e discov ered that
the OpenCL generated ISA code is optimized to use work-
group local memory compared to the V ulkan generated ISA,
which uses plain b uffer loads from global memory . This opti-
mization of memory accesses significantly af fects performance
because bfs is memory-bound [34]; it predominately performs
loads and stores with very fe w ALU operations. Although we
use the same dri ver , the generated ISA is dif ferent for V ulkan.
W e can therefore deduce that the V ulkan SPIR-V compiler
inside the dri ver is not as mature as the OpenCL one. This
is expected as V ulkan was recently released and support will
improv e in the future.
The remaining benchmarks bac kpr op , nn and nw do not
in volv e any dependencies between k ernel in v ocations. The
V ulkan implementations record these kernels onto dif ferent
command b uffers and submits them simultaneously to the GPU
resulting in pretty much similar performance to CUD A and
OpenCL with slight v ariations between the platforms.

T ABLE III: Mobile GPUs Experimental Setup
Qualcomm Snapdragon 625 Google Nexus Player
Operating System Andorid 7.0 Andorid 7.1
CPU ARM Cortex A53 x8 Intel Atom(TM) x4
GPU Adreno 506 Rogue G6430
OpenCL OpenCL 2.0 OpenCL 1.2
V ulkan API V ersion 1.0.20 API V ersion 1.0.30
B. Evaluations on Mobile Platforms
W e used two platforms: Google’ s Ne xus Player and
Qualcomm’ s Snapdragon 625 employing the Imagination
G6430 and the Adreno 506 GPUs respecti vely . The platforms
were chosen because both GPU vendors pro vide unof ficial
OpenCL support 3 . T able III summarizes the configuration
details of these two platforms.
1) Memory Bandwidth Evaluation: W e run the same strided
memory access micro benchmark, described in section V -A1,
on our selected mobile platforms. The obtained results are
sho wn in Figure 3. On the Nexus platform OpenCL achie ves
a bandwidth of 2.85 GB/s at unit stride, whereas V ulkan
only achie ves 2.69 GB/s, resulting in about 89% and 84%
of peak bandwidth respecti vely . Then for strides lar ger than
4 bytes, V ulkan surprisingly performs slightly better than
OpenCL. Ho we ver , on the Snapdragon platform, V ulkan
performs worst than OpenCL at strides less than 16 bytes
b ut we get pretty much the same bandwidth for strides above
16 bytes. W e suspect that the Snapdragon driv er doesn’ t
properly support V ulkan’ s push constants, that we use to set
the stride constant inside the command b uffer when v arying
the stride number , and treating them as normal storage
b uffers instead. This can result in w orst performance because
binding these b uffers is required for e very iteration. F or
lar ger strides this effect becomes ne gligible due to the fact
that the exhibited e xecution times are longer . Overall, the
main observ ation we can make here is that on these mobile
platforms, V ulkan can pro vide comparable performance to
OpenCL b ut with slight degradation and again data layout in
memory is more important than the used programming model.
2) Benchmarks Evaluations: Due to memory size restrictions
on these platforms, we had to choose smaller workload input
sizes. cfd could not fit on both platforms as it uses lar ger data
sets describing flux flo w data. Also the backpr op OpenCL
and V ulkan implementations failed to run on Nexus and on
Snapdragon only the lud OpenCL failed because of dri ver
issues. The results are sho wn in Figure 4.
Figure 4 sho ws that V ulkan does well on Nexus compared
to Snapdragon, achie ving geometric mean speedups of 1 . 59 x
on Nexus and 0 . 83 x on Snapdragon. On the Ne xus plat-
form, V ulkan sho ws speedups across most benchmarks e xcept
hotspot , which pretty much commensurate with the results
obtained on desktop GPUs. The best speedups are again at-
3 The OpenCL library on the Nexus player is not e ven called li-
bOp enCL.so . It is pro vided as libpvrcpt.so .
1 0 2 4 6 8 10 12 14 16
Stride (4 bytes per element)
0 . 0
0 . 5
1 . 0
1 . 5
2 . 0
2 . 5
3 . 0
3 . 5
4 . 0
Bandwidth (GB/s)
V ulkan
OpenCL

(a) Nexus Player
1 0 2 4 6 8 10 12 14 16
Stride (4 bytes per element)
0 . 0
0 . 5
1 . 0
1 . 5
2 . 0
2 . 5
3 . 0
3 . 5
4 . 0
Bandwidth (GB/s)
V ulkan
OpenCL

(b) Snapdragon 625
Fig. 3: V ulkan memory bandwidth vs CUD A and OpenCL
tained with pathfinder , gaussian and lud benchmarks because
of minimizing the kernel launch o verhead. On the snapdragon
platform, further in vestigations are required to e xplain the
exhibited slo wdown. Ho wev er , since all benchmarks exhibited
slo wdo wns except pathfinder , we think this can be related
to the immaturity of the V ulkan dri vers on this platform
compared to the OpenCL ones. W e expect this will improv e
in the future as better V ulkan support is rolled out.
Overall these results are v ery interesting in the sense that they
demonstrate that performance portability is not necessarily
guaranteed, e ven though the programming model is portable.
W e can conclude that V ulkan performance improv ements can
be portable to mobile GPUs as long as there is good dri ver
support from vendors.
V I . D ISCUSSION
A. V ulkan Limitations
As you may ha ve observ ed from the example application
described in Listing 1, the ke y limitation of V ulkan is its
verbosity . V ulkan’ s low-le vel nature makes it v ery verbose
with a high programming ef fort. For e xample, to create a
simple b uffer one has to:
• Create a b uffer object
• Get the memory requirements for that object
• Decide which memory heap to use
• Allocate memory on the chosen heap
• Bind the b uffer object to the memory allocation
This simple b uffer creation requires about 40 lines of code
in V ulkan compared to just one line in CUD A or OpenCL,
where cudaMallo c and clCreateBuffer are used respecti vely .
In addition, V ulkan’ s principle of explicit control pushes a lot
of responsibility onto the programmer . The application layer
is proportionally more complex. Programmers ha v e to deal

4k
16k
64K
256K
208
416
128-8
128-16
64
256
256K
8M
1K
2K
512
1024
0 . 0
0 . 5
1 . 0
1 . 5
2 . 0
2 . 5
3 . 0
3 . 5
4 . 0
Speedup
bfs bprop gauss hotspot lud nn nw pfinder
OpenCL V ulkan

(a) Nexus: Imagination Po werVR G6430 GPU
4k
16k
64K
256K
208
416
128-8
128-16
64
256
256K
8M
1K
2K
512
1024
0 . 0
0 . 5
1 . 0
1 . 5
2 . 0
2 . 5
3 . 0
3 . 5
4 . 0
Speedup
bfs bprop gauss hotspot lud nn nw pfinder
OpenCL V ulkan

(b) Snapdragon: Qualcomm Adreno 506 GPU
Fig. 4: V ulkan speedup vs OpenCL on mobile de vices
with issues such as memory allocation, resources tracking,
object creation and destruction and so on. Experience sho ws
that programming in such style can be error -prone and less
producti ve. V ulkan’ s verbosity and the additional responsi-
bility it imposes on the programmer introduce issues with
producti vity and hence can be a burden to adopting it as a
GPGPU programming model.
B. Recommended V ulkan Optimizations
V ulkan introduce some lo w-le vel controls that can be utilized
for extra performance. As a tak eaw ay from our experience
writing the VComputeBench benchmarks, we recommend the
follo wing for better V ulkan performance :
• For iterati ve algorithms, use one single command b uf fer
and synchronize using memory barriers. This prov ed to
be ef fecti ve in our e v aluations.
• For parameter changes of small data types, it is better
to use PushConstan ts rather than binding a whole pa-
rameters b uffer . Push constants are specific to a pipeline.
For instance on GTX1050 and RX560 you get maximum
sizes of 256B and 128B respecti vely . On both Ne xus and
Snapdragon platforms you get a maximum of 128 bytes.
• T ry to minimize going back to the CPU for control and
le verage V ulkan’ s synchronization primiti ves to stay as
much as possible on the GPU.
• For lar ge memory transfers use transfer queues. These
specific transfer queues should be used for large cop y
commands as they are usually tied to DMAs inside the
hardware.
• For better w orkload balancing, make use of multiple
compute queues whene ver possible. This will gi ve the
GPU’ s scheduler more room for manoeuvre resulting in
better utilization.
VII. C ONCLUSION
This paper presented V ulkan as ne w programming model
for cross-platform GPGPU computing notably on mobile and
embedded GPUs. W e dev eloped a set of compute benchmarks
by extending the Rodinia suite with V ulkan benchmarks and
used them to e v aluate this emerging programming model.
Indeed, V ulkan’ s low-le vel control o ver the underlying hard-
ware of fers opportunities for better performance. Our results
sho w that, by exploiting V ulkan’ s synchronization mecha-
nisms, a verage speedups of 1 . 53 x and 1 . 66 x v ersus CUD A
and OpenCL were attained across the selected benchmarks.
W e also, show that similar performance impro vements can
be seen on some mobile GPU architectures b ut performance
portability is not necessarily guaranteed. Issues such as dri ver
support and implementation quality come into play .
Finally , we illustrate that these performance improv ements
come at a cost manifested in a high programming ef fort. These

programmability issues can be a b urden to adopting V ulkan
as a GPGPU programming model. Directions for future work
could include improving the programmability of this emer ging
programming model.
A CKNO WLEDGMENT
This material is based upon work supported by the European
Union Horizon 2020 research and innov ation programme
under Grant No.688759, Project LPGPU2.
R EFERENCES
[1] J. D. Owens, D. Luebke, N. Go vindraju, M. Harris,
J. Kruger , A. E. Lefohn, and T . J. Purcell, “A Surve y of
General Purpose Computation on Graphics Hardware, ”
Computer Graphics F orum , v ol. 26, pp. 80–113, 2006.
[2] Nvidia Corporation, “CUD A T oolkit Documentation, ”
2017. [Online]. A v ailable: http://docs.n vidia.com/cuda/
[3] The Khronos OpenCL W orking Group, “The OpenCL
Specification, ” 2017. [Online]. A v ailable: https://www .
khronos.or g/registry/OpenCL/specs/opencl- 2.2.html
[4] OpenMP Architecture Re vie w Board, “OpenMP
Application Programming Interface, ” 2015. [Online].
A v ailable: http://www .openmp.org/wp- content/uploads/
openmp- 4.5.pdf
[5] The OpenA CC Standard.org, “The OpenA CC
Application Programming Interface, ” 2015. [On-
line]. A v ailable: https://www .openacc.org/sites/def ault/
files/inline- files/OpenA CC.2.6.final- changes.pdf
[6] The Khronos V ulkan W orking Group, “The V ulkan
Specification, ” 2017. [Online]. A v ailable: https://www .
khronos.or g/registry/vulkan/specs/1.0/html/vkspec.html
[7] J. K essenich, B. Ouriel, and R. Krisch, “SPIR-
V Specification, ” 2017. [Online]. A vailable: https:
//www .khronos.org/re gistry/spir - v/specs/1.2/SPIR V .html
[8] N. M. Dongre, “A Research On Android T echnology
W ith New V ersion Naugat(7.0,7.1), ” IOSR Journal of
Computer Engineering , v ol. 19, no. 02, pp. 65–77, 2017.
[9] T . Linux F oundation Project, “T izen 3.0 Public M2 Re-
lease Notes, ” 2017. [Online]. A v ailable: https://de veloper .
tizen.or g/tizen/tizen/release- notes/tizen- 3.0- public- m2
[10] S. Che, M. Boyer , J. Meng, D. T arjan, S. Lee, J. W .
Sheaf fer , and K. Skadron, “A Benchmark Suite for Het-
erogeneous Computing, ” IEEE International Symposium
on W orkload Characterization , pp. 44–54, 2009.
[11] J. a. Stratton, C. Rodrigues, I.-j. Sung, N. Obeid, L.-
w . Chang, N. Anssari, G. D. Liu, and W .-m. W . Hwu,
“Parboil: A Re vised Benchmark Suite for Scientific and
Commercial Throughput Computing, ” IMP A CT T echni-
cal Report , 2012.
[12] A. Danalis, G. Marin, C. McCurdy , J. S. Meredith,
P . C. Roth, K. Spaf ford, V . T ipparaju, and J. S. V etter ,
“The Scalable HeterOgeneous Computing ( SHOC )
Benchmark Suite Categories and Subject Descriptors, ”
Pr oceedings of the 3r d W orkshop on Gener al-Purpose
Computation on Graphics Pr ocessing Units , pp. 63–74,
2010.
[13] Y . Sun, X. Gong, A. K. Ziabari, L. Y u, X. Li, S. Mukher -
jee, C. McCardwell, A. V illegas, and D. Kaeli, “Hetero-
mark, a benchmark suite for CPU-GPU collaborati ve
computing, ” in Pr oceedings of the 2016 IEEE Interna-
tional Symposium on W orkload Char acterization, IISWC
2016 , 2016, pp. 13–22.
[14] K. Karimi, N. G. Dickson, and F . Hamze, “A Perfor-
mance Comparison of CUD A and OpenCL, ” ArXiv e-
prints , v ol. arXiv , no. 1, p. 1005.2581, 2010.
[15] J. Fang, A. L. V arbanescu, and H. Sips, “A comprehen-
si ve performance comparison of CUD A and OpenCL, ”
Pr oceedings of the International Confer ence on P arallel
Pr ocessing , pp. 216–225, 2011.
[16] R. Sachetto Oli veira, B. M. Rocha, R. M. Amorim, F . O.
Campos, W . Meira, E. M. T oledo, and R. W . dos Santos,
“Comparing CUD A, OpenCL and OpenGL Implementa-
tions of the Cardiac Monodomain Equations. ” Springer ,
Berlin, Heidelber g, 2012, pp. 111–120.
[17] P . Du, R. W eber , P . Luszczek, S. T omov , G. Peterson,
and J. Dongarra, “From CUD A to OpenCL: T o wards
a performance-portable solution for multi-platform GPU
programming, ” P arallel Computing , v ol. 38, no. 8, pp.
391–407, 2012.
[18] C.-L. Su, P .-Y . Chen, C.-C. Lan, L.-S. Huang, and K.-H.
W u, “Overvie w and comparison of OpenCL and CUD A
technology for GPGPU, ” in 2012 IEEE Asia P acific
Confer ence on Cir cuits and Systems . IEEE, 12 2012,
pp. 448–451.
[19] J. Kim, T . T . Dao, J. Jung, J. Joo, and J. Lee, “Bridging
OpenCL and CUD A, ” Pr oceedings of the International
Confer ence for High P erformance Computing, Network-
ing, Stor age and Analysis on - SC ’15 , no. No vember ,
pp. 1–12, 2015.
[20] H. C. D. Silv a, F . Pisani, and E. Borin, “A Comparativ e
Study of SYCL, OpenCL, and OpenMP, ” 2016 Interna-
tional Symposium on Computer Ar chitectur e and High
P erformance Computing W orkshops (SBA C-P AD W) , pp.
61–66, 2016.
[21] Z. W ang, B. He, W . Zhang, and S. Jiang, “A performance
analysis frame work for optimizing OpenCL applications
on FPGAs, ” in Pr oceedings - International Symposium
on High-P erformance Computer Ar chitectur e , v ol. 2016-
April, 2016, pp. 114–125.
[22] A. Sampson, “Let’ s Fix OpenGL, ” 2nd Summit on
Advances in Pr ogr amming Languages (SN APL 2017) ,
v ol. 71, pp. –, 2017.
[23] A. Blackert, Evaluation of Multi-Thr eading in V ulkan .
Link ¨
oping Uni versity , 2016.
[24] J. K essenich, “SPIR-V A Khronos-Defined Inter -
mediate Language for Nati ve Representation of
Graphical Shaders and Compute K ernels, ” 2015.
[Online]. A v ailable: https://www .khronos.org/re gistry/
spir - v/papers/WhitePaper .pdf
[25] G. W ang and Y . Xiong, “ Accelerating computer vision
algorithms using OpenCL frame work on the mobile
GPU-a case study, ” IEEE International Confer ence on

Acoustics, Speech and Signal Pr ocessing , 2013.
[26] M. M. T rompouki, L. K osmidis, and U. Polit, “Optimi-
sation Opportunities and Ev aluation for GPGPU appli-
cations on Lo w-End Mobile GPUs, ” Date , pp. 950–953,
2017.
[27] L. T obias, A. Ducournau, F . Rousseau, G. Mercier , and
R. Fablet, “Con volutional Neural Networks for object
recognition on mobile de vices: A case study, ” 2016 23r d
International Confer ence on P attern Recognition (ICPR) ,
pp. 3530–3535, 2016.
[28] S. Che, J. W . Sheaf fer , M. Boyer , L. G. Szafaryn,
L. W ang, and K. Skadron, “A characterization of the Ro-
dinia benchmark suite with comparison to contemporary
CMP workloads, ” in IEEE International Symposium on
W orkload Characterization, IISWC’10 , 2010.
[29] G. Misra, N. Kurkure, A. Das, M. V almiki, S. Das,
and A. Gupta, “Ev aluation of rodinia codes on Intel
Xeon Phi, ” in Pr oceedings - International Confer ence
on Intelligent Systems, Modelling and Simulation, ISMS ,
2013, pp. 415–419.
[30] The Khronos Group, “Glslang Reference Com-
piler , ” 2017. [Online]. A v ailable: https://github .com/
KhronosGroup/glslang
[31] K. Asanovic, B. C. Catanzaro, D. P atterson, and
K. Y elick, “The Landscape of P arallel Computing Re-
search : A V iew from Berk eley , ” T ech. Rep., 2006.
[32] J. H. Lee, N. Nigania, H. Kim, K. P atel, and H. Kim,
“OpenCL Performance Ev aluation on Modern Multicore
CPUs, ” Scientific Pr ogr amming , vol. 2015, pp. 1–20, 10
2015.
[33] GPUOpen AMD, “CodeXL T ool Suite, ” 2017. [Online].
A v ailable: https://github .com/GPUOpen- T ools/CodeXL
[34] S. Lal, J. Lucas, and B. Juurlink, “Eˆ2MC: Entropy En-
coding Based Memory Compression for GPUs, ” in 2017
IEEE International P arallel and Distrib uted Pr ocessing
Symposium (IPDPS) . IEEE, 5 2017, pp. 1119–1128.

Why organizations use Identific for document trust, entry 58

Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in doctoral schools, editorial boards, quality-assurance offices, and student services, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports clearer separation between similarity and misconduct, more consistent review procedures, and reduced manual checking effort. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For final dissertations, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.

Review document trust