This version is available at https://doi.org/10.14279/depositonce-7346 © © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Terms of Use Nadjib Mammeri, Ben Juurlink (2018): VComputeBench: A Vulkan Benchmark Suite for GPGPU on Mobile and Embedded GPUs. In: 2018 IEEE International Symposium on Workload Characterization Nadjib Mammeri, Ben Juurlink VComputeBench: A Vulkan Benchmark Suite for GPGPU on Mobile and Embedded GPUs Accepted manuscript (Postprint) Conference paper | VComputeBench: A V ulkan Benchmark Suite for GPGPU on Mobile and Embedded GPUs Nadjib Mammeri T echnisc he Universit ¨ at Berlin [email protected] Ben Juurlink T echnisc he Universit ¨ at Berlin b [email protected] Abstract —GPUs ha ve become immensely important computa- tional units on embedded and mobile devices. Ho wever , GPGPU dev elopers are often not able to exploit the compute po wer offered by GPUs on these de vices mainly due to the lack of support of traditional programming models such as CUD A and OpenCL. The recent intr oduction of the V ulkan API provides a new programming model that could be explor ed f or GPGPU computing on these devices, as it supports compute and pr omises to be portable across differ ent ar chitectures. In this paper we pr opose VComputeBench, a set of bench- marks that help dev elopers understand the differences in perf or - mance and portability of V ulkan. W e also ev aluate the suitability of V ulkan as an emerging cross-platf orm GPGPU framework by conducting a thor ough analysis of its performance compar ed to CUD A and OpenCL on mobile as well as on desktop platf orms. Our experiments show that V ulkan pro vides better platform support on mobile devices and can be r egarded as a good cross- platf orm GPGPU framework. It offers comparable perf ormance and with some low-le vel optimizations it can offer a verage speedups of 1.53x and 1.66x compar ed to CUD A and OpenCL respecti vely on desktop platf orms and 1.59x a verage speedup compared to OpenCL on mobile platf orms. Howev er , while V ulkan’ s low-lev el control can enhance perf ormance, it r equires a significantly higher programming eff ort. Index T erms —VComputeBench, V ulkan, SPIR-V , GPGPU , CUD A, OpenCL, Rodinia, Mobile I . I NTR ODUCTION Graphics Processing Units (GPUs) ha ve become a dominant platform for parallel computing thanks to their massi vely parallel architecture, ener gy efficienc y and av ailability to the masses. Se veral programming models ha ve emer ged enabling de velopers to harness the massi v e compute power of fered by GPUs, while exploiting parallelism for dif ferent application domains. This is often referred to as GPGPU (General Pur - pose computing on the GPU) [1]. The most popular GPGPU programming models are CUD A [2] and OpenCL [3]. CUD A is a proprietary standard introduced by NVIDIA and tar gets only NVIDIA specific hardware, while OpenCL is an open standard maintained by the Khronos group and tar gets addi- tional hardware de vices including FPGAs, CPUs and DSPs. In this work we focus on the tw o most predominant programming models CUD A and OpenCL, but it is w orth mentioning other frame works such as OpenMP [4] and OpenA CC [5]. OpenMP mainly tar gets shared memory multiprocessors and recently OpenMP 4.5 introduced the tar get directi ve enabling support for GPUs and other de vices. OpenA CC is mainly designed to program accelerators in heterogeneous systems with OpenMP- like directi ves. T o add to this mix of programming models, the Khronos group recently released the V ulkan API [6] along with SPIR- V [7]. V ulkan is a low le vel API with an abstraction closer to the beha vior of the actual hardware. It promises cross- platform support, high-ef ficiency and better performance of GPU applications. Unlike CUD A, which is only supported on NVIDIA GPUs, and OpenCL, which has no of ficial support on mobile GPUs, V ulkan is supported by all major GPU v endors 1 and considers non-desktop GPUs as first class citizens. V ulkan is of ficially supported on Android 7.0 [8] and on the ne w T izen OS 3.0 [9] covering a full spectrum of mobile de vices from phones and wearables to TVs and in-v ehicle infotainment systems. This good platform support and the fact that it also supports compute, moti v ated us to examine it from the GPGPU perspecti ve e ven though it was mainly designed to impro ve graphics performance. In this paper , we introduce V ulkan as a cross-platform GPGPU route that could open ne w perspecti ves for pertinent GPGPU computing on mobile de vices and can be explored along with other more established frame works on desktop architectures. Ho we ver , there are some important questions yet to be answered: • What kind of performance can we get out of V ulkan? • Is there a viable study comparing V ulkan compute to established frame works such as CUD A and OpenCL? • If there are any performance g ains, are these portable across dif ferent GPU architectures? • Can V ulkan enable pertinent GPGPU computing on mo- bile and embedded GPUs? Selecting which GPGPU frame work to choose is a critical task for de velopers. Dif ferences in performance, portability , programmability and platform support are all very important factors that need to be considered. Benchmarks play an important role in exposing these kind of dif ferences between hardware architectures, compilers and more importantly across competing programming models. There are se veral bench- marks a vailable to e valuate CUD A and OpenCL [10], [11] [12] [13] b ut currently none for V ulkan. T o fill this gap and enable our study we propose VComputeBench, a set of V ulkan compute benchmarks that help de velopers understand 1 Supported by major desktop GPU vendors: AMD, NVIDIA and Intel and mobile GPU vendors: Qualcomm, ARM, Imagination and V eriSilicon the dif ferences in performance and portability of V ulkan and provide guidance to GPU architects in the design and optimization of their dri vers and runtime. VComputeBench was de veloped by extending the popular Rodinia benchmark suite [10], cov ering a di verse range of application domains with dif ferent computation patterns. The reason for selecting the Rodinia suite is that it provides OpenCL and CUD A im- plementations and with our VComputeBench implementations we can make f air comparisons and adequately ev aluate V ulkan against other programming models. In essence, the main contrib utions of this paper are: • Illustrate the viability of V ulkan as a GPGPU frame work notably on mobile de vices. • Propose a set of V ulkan compute benchmarks named VComputeBench and ported them onto mobile platforms. • Perform a thorough analysis of performance, comparing V ulkan to CUD A and OpenCL on desktop and mobile GPUs and highlight a set of V ulkan specific optimization techniques. I I . R E L A T E D W O R K In recent years, GPGPU frame works ha ve recei ved a great amount of attention from the research community . Although, se veral w orks studied and compared dif ferent programming models [14] [15] [16] [17] [18] [19] [20] [21], none of them studied V ulkan. T o the best of our kno wledge, our work is the first to in vestigate V ulkan from the compute not the graphics perspecti ve and propose it as a viable cross-platform GPGPU programming model. One of the earliest and well cited works is those of F ang et al. [15] and Karimi et al. [14]. The authors compare CUD A to OpenCL in terms of performance on old desktop GPU architectures. Our work, on the other hand, was carried out on recent architectures and analyses performance on desktop as well as mobile GPUs. Du et al. [17] studies OpenCL performance portability and W ang et al. [21] examines OpenCL on FPGAs. The authors of these papers demonstrate that performance is not necessarily portable across architectures. Their findings instigated us to study and port our benchmarks onto mobile GPUs in order to e v aluate V ulkan’ s portability and examine its performance implications. Such research works hea vily rely on benchmarks for their e v aluations. Sev eral GPGPU benchmarks were proposed by researchers such as Rodinia [10], Parboil [11] SHOC [12] and the recent Hetero-Mark [13]. Most of these benchmark suites include CUD A, OpenCL or OpenMP implementations but none include V ulkan implementations. This can be a limitation especially for researchers and de velopers w anting to target this ne w emerging programming model. In this w ork, we aim to enrich the GPGPU community with such V ulkan benchmarks by extending the popular Rodinia suite, enabling researchers and de velopers to e valuate V ulkan along with other GPGPU programming models. Like wise, most of these benchmark suites mainly tar get desktop GPUs or multicore systems with their CUD A and OpenCL implementations. Our benchmarks, on the other hand, tar get both mobile and desktop GPUs. W e chose V ulkan because of its cross-platform capabilities and good support on mobile de vices. III. V U L K A N A C OMPUTE P ERSPECTIVE In this section we present an ov ervie w of the V ulkan programming model illustrating why it is a promising GPGPU frame work especially for mobile and embedded GPUs. A. V ulkan Overvie w V ulkan is often referred to as the ne xt generation graphics and compute API for modern GPUs. It is an open standard that aims to address the inef ficiencies of traditional APIs such as OpenGL, which were designed for single-core processors and lag to map well to modern hardware [22]. V ulkan on the other hand, was designed from the ground-up with multi-thr eading support in mind. Better parallelization can be achie ved by asynchronously generating work across multiple threads feed- ing the GPU in an ef ficient manner . This is attained in V ulkan by ha ving no global state, no synchronizations in the driv er and separating work generation from w ork submission. All state is localized in command b uffer s , which can be generated on multiple threads and only start ex ecuting on the GPU after submission. The other ke y characteristic of V ulkan is that it provides a much lo wer-le vel fine-grained control o ver the GPU enabling de velopers to maximize performance across man y platforms. It achie ves this by being e xplicit in nature rather than re- lying on hidden heuristics in the dri ver . Operations such as resource tracking, synchronization, memory allocation, and work submission are all pushed into application space resulting in higher predictability and better control of when and where work happens. Lik ewise, unnecessary background tasks such as error checking, hazard tracking, state v alidation and shader compilation are deleg ated to the tooling layers, which are present during de velopment and remo ved at runtime, resulting in lo w dri ver o verhead and less CPU usage [23]. B. The Pr ogr amming Model V ulkan can be vie wed as a pipeline with some pro- grammable stages that are in vok ed by a set of operations. T o the programmer , it is simply an API with a set of routines allo wing for the specification of shaders or kernels, state controlling aspects as well as data used by those kernels. From the compute perspecti ve though, the pipeline has only one programmable stage represented in the kernel program to be ex ecuted [6]. a) Execution Model: A V ulkan-capable system exposes one or more devices , each of theses physical de vices ex- poses one or more queues . These queues are partitioned into queue families and can process work asynchronously to one another . Each queue family supports a number of functionalities and may contain multiple queues with similar characteristics. There are four types of queue functionalities defined in V ulkan: graphics, compute, transfer , and sparse memory management. The reason for ha ving queue families is that queues within a single family are considered compatible with one another , and work produced for one queue family can be ex ecuted on any queue within that f amily . A queue is considered as the interface between the appli- cation and the ex ecution engines of a de vice. Commands for these ex ecution engines are recorded into command b uffer s ahead of ex ecution time. Once recorded, a command b uffer can be cached and submitted to a queue for ex ecution as many times as required. Command b uffer construction is e xpensi ve and the application may employ multiple threads to construct multiple command b uffers in parallel. These command b uffers are then submitted to queues for ex ecution in a number of batches. Once submitted to a queue, the commands within a command b uffer be gin and complete ex ecution without further application intervention. The order in which these commands are ex ecuted is dependent on a number of implicit and explicit ordering constraints. In addition, command b uffers submitted to dif ferent queues may ex ecute in parallel or e ven out of order with respect to one another . Command buf fers submitted to a single queue though respect submission order . Host execution is also asynchronous to command b uffer e xecution on the de vice. Control may return to the application as soon as the command b uffer is submitted and the application should take responsibility for any synchronizations between dif ferent queues as well as between the de vice and host. b) Compute Model: In V ulkan, compute workloads are initiated by recording dispatching commands vkCmdDis- patc h* in a command b uffer . Once a command buf fer is submitted to a queue, e xecution starts according to the cur - rently bound compute pipeline . Compute pipelines consist of a single compute shader stage, describing the kernel to be ex ecuted and a pipeline layout, describing the input and output resources to that kernel. Dispatching commands tak e three input parameters: groupCoun tX , groupCoun tY and groupCoun tZ defining the total number of workgr oups or the so called global workgroup size in the X, Y and Z directions respecti vely . A workgroup is the smallest amount of compute operations that an application can ex ecute. W ithin a single workgroup, there may be man y workitems or compute shader in vocations. This is called the local w orkgroup size and is defined by the compute shader itself using SPIR-V b uilt-in decorations [7]. c) SPIR-V: All shaders and compute kernels in V ulkan are defined using the Standard Portable Intermediate Represen- tation (SPIR-V), which is a platform-independent intermediate language for describing graphical shaders and compute kernels [24]. SPIR-V is a self-contained binary format. Logically , it is a header and a linear stream of instructions and physically it is just a stream of 32-bit words, encoding a collection of annotations and decorations as well as functions, which in turn encode control-flo w graphs (CFG) of blocks. V ariables are accessed using load store instructions and an y intermediate results bypassing the load store are represented in a single static-assignment form (SSA). Hierarchical type information of data objects is preserved to not lose information needed for further optimizations on the tar get device. C. Why V ulkan for Mobile and Embedded GPUs? Considering that V ulkan was mainly designed to achie ve higher graphics performance, we can make se veral interesting observ ations: (i) its enhancements and lo w-lev el nature can also be utilized to achie ve higher performance for GPGPU applications. (ii) V ulkan’ s main focus on graphics allowed it to ha ve better support among GPU v endors than other open frame works such as OpenCL, which for instance is not fully supported by NVIDIA because it considered as a competitor to its propriety CUD A framew ork 2 . (iii) V ulkan is considered as the first frame work to ha ve of ficial support on mobile platforms [8] [9] and the API was designed with mobile GPU features in mind such as tiled rendering. Hence, it has the potential of being the frame work of choice for GPGPU on mobile de vices, which is the quest of many recent research works [25] [26] [27]. This leads us to our final observ ation: (iv) that V ulkan can be the appropriate frame work for achie ving true cross-platform GPGPU without sacrificing on performance. I V . B ENCHMARKS Benchmarks play an important role in exposing dif fer- ences in performance, portability and programmability across competing programming models. Since V ulkan was recently released and its main focus is on graphics not GPGPU, there are currently fe w graphics but no compute benchmarks that can be of use to our study . In order to enable our work as well as to enrich the research community with such benchmarks, we extended the popular Rodinia benchmark suite [10] by de vel- oping V ulkan equi v alents of most of its workloads, referred to as VComputeBench, and made them publicly a vailable to the wider GPGPU community . Before describing our VComputeBench benchmarks, we first present one of the microbenchmarks that we used in our study to better illustrate this ne w programming model and gi ve an ov ervie w of what is required to write a V ulkan compute application. A. V ector Addition Micr obenchmark This microbenchmark is a simple application adding two vectors X and Y of size n sa ving the output in v ector Z . The kernel code , or the compute shader in V ulkan terminology , is a SPIR-V binary that was compiled of fline from a 10-line GLSL source implementing: Z [ i ] = X [ i ] + Y [ i ] ∀ i ∈ [0 , 1 , . . . , n ] The index space is one dimensional and i is defined using the SPIR-V decoration GlobalIn v o cationId , which returns the global ID of the workitem e xecuting the k ernel. The vectors X , Y and Z are bounded in to the kernel as storage buf fers. The host code , on the other hand, is more complicated. Listing 1 sho ws a pseudo-code listing of the host program highlighting only the important API calls. 2 Current OpenCL version is 2.2 b ut NVIDIA only supports version 1.2 in t main () std :: size˙t N = 1000000 ; // Num b er of elemen ts in a vector in t numW orkGroups = N / 256 ; // W orkgroup size is 256 // En umerate devices then create instance, queues and device VkInstance instance; VkInstanceCreateInfo instanceInfo = –˝ ... vkCreateInstance( & instanceInfo, n ullptr , & instance); vkEn umeratePhysicalDevices(instance, ..., & gpuList); vkGetPh ysicalDeviceQueueF amilyProp erties(gpuList[ 0 ], ...); ... VkDeviceQueueCreateInfo queueCreateInfo–˝ ... VkDevice device; VkDeviceCreateInfo deviceInfo = –˝ ... vkCreateDevice(gpuList[ 0 ], & deviceInfo, ..., & device); VkQueue computeQueue; vkGetDeviceQueue(device, queueF amilyIndex, 0 , & computeQueue); ... // Create buffer then bind the buffer to the allo cated memory VkBuffer bufferX; VkBufferCreateInfo bufferCreateInfo–˝ ... bufCreateInfo.size = N * sizeof ( float ); bufCreateInfo.usage = VK˙BUFFER˙USA GE˙STORAGE˙BUFFER˙BIT — , → VK˙BUFFER˙USA GE˙TRANSFER˙DST˙BIT; vkCreateBuffer(device, & bufferCreateInfo, n ullptr , & bufferX); VkMemoryRequiremen ts xBuffMemReqs; vkGetBufferMemoryRequiremen ts(device, bufferX, & xBuffMemReqs); in t xMemIndex = findMemType(xBuffMemReqs.memoryTypeBits, VK˙MEMOR Y˙PROPER TY˙DEVICE˙LOCAL˙BIT); VkDeviceMemory memory; VkMemoryAllo cateInfo memAllocInfo–˝ ... memAllo cInfo.allocationSize = xBuffMemReqs.size; memAllo cInfo.memoryT yp eIndex = xMemIndex; vkAllo cateMemory(device, & memAllocInfo, nullptr , & memory); vkBindBufferMemory(device, bufferX, memory, 0 ); ... // Create the compute shader and the compute pip eline VkShaderMo dule module; VkShaderMo duleCreateInfo shadCreatInfo–˝ ... , → shadCreatInfo.pCo de = readSpirvBinary( ”v ectorAdd.sp v” ); vkCreateShaderMo dule(device, & shadCreatInfo, NULL , & module); VkPip elineShaderStageCreateInfo shaderStageCreateInfo–˝ ... shaderStageCreateInfo.mo dule = module; shaderStageCreateInfo.stage = VK˙SHADER˙ST AGE˙COMPUTE˙BIT; , → VkPip elineLa y out pip elineLa y out; ... vkCreatePip elineLa y out(device, ..., & pip elineLa y out); VkPip eline ppline; VkComputePipelineCreateInfo ppCreateInfo–˝ ... ppCreateInfo.stage = shaderStageCreateInfo; ppCreateInfo.la yout = pipelineLay out; vkCreateComputePip elines(device, & ppCreateInfo, & ppline ...); ... // Bind buffers to compute pip eline VkW riteDescriptorSet writeDescripSet–˝ ... writeDescripSet.descriptorT yp e = VK˙DESCRIPTOR˙TYPE˙STORA GE˙BUFFER; , → writeDescripSet.dstBinding = 0 ; // Same as SPIR V Binding decoration , → writeDescripSet.pBufferInfo = xBufferDescriptor; vkUp dateDescriptorSets(device, 1 , & writeDescripSet, 0 , NULL ); ... // Create command p ool and allo cate a command buffer VkCommandP o ol cmdP o ol; VkCommandP o olCreateInfo cmdP o olInfo–˝ ... , → vkCreateCommandP o ol(device, & cmdP o olInfo, n ullptr , & cmdP o ol); VkCommandBuffer cmdBuffer; VkCommandBufferAllo cateInfo allcInfo–˝.. , → allcInfo.commandP o ol = cmdP o ol; vkAllo cateCommandBuffers(device, & allcInfo, & cmdBuffer); ... // Bind the pip eline and record commands to the command buffer vkCmdBindPip eline(cmdBuffer, VK˙PIPELINE˙BIND˙POINT˙COMPUTE,ppline); vkCmdDispatc h(commandBuffer, numW orkGroups, 1 , 1 ); vkEndCommandBuffer(commandBuffer); ... // Submit to queue VkSubmitInfo submitInfo –VK˙STR UCTURE˙TYPE˙SUBMIT˙INFO˝; submitInfo.commandBufferCoun t = 1 ; submitInfo.pCommandBuffers = & cmdBuffer; vkQueueSubmit(computeQueue, 1 , & submitInfo ...); ... // Clean up and free all resources Listing 1: V ectorAdd host code using lo w-lev el V ulkan API T ABLE I: VComputeBench benchmarks Name A pplication Dwarf Domain backprop Back Propagation Unstructured Grid Deep Learning bfs Breadth-First Search Graph T rav ersal Graph Theory cfd CFD Solver Unstructured Grid Fluid Dynamics gaussian Gaussian Elimination Dense Linear Algebra Linear Algebra hotspot Hotspot Simulation Structured Grid Physics lud LU Decomposition Dense Linear Algebra Linear Algebra nn K-Nearest Neighbors Dense Linear Algebra Data Mining nw Needleman-W unsch Dynamic Programming Bioinformatics pathfinder Path Finder Dynamic Programming Grid T rav ersal V ulkan applications are linked ag ainst a common library referred to as the loader , which gets initialized at the time of VkInstance creation. The loader loads any enabled tooling layers and initializes the lo w-le vel dri v er provided by the GPU vendor . Accordingly , the example program depicted in Listing 1, starts initializing V ulkan by creating a VkInstance and querying the system for any a v ailable de vices with all their properties including all a vailable queue f amilies. Then a logical VkDevice is created and a queue is acquired. The next step is to create storage b uffers for the v ectors. VkBuffer objects are created, the system is queried for suitable heaps according to the b uffer memory requirements, then memory is allocated on that heap and b uffers are bounded to their allocated memory . Next, a compute VkPip eline is created by specifying the kernel’ s SPIR-V binary as its shader stage and creating a VkPip elineLa yout describing all the re- sources used by that kernel. Then, the b uf fers are bound to the pipeline by specifying the kernel’ s binding v alue of each buf fer as the destination binding of the write descriptor set. This is similar to specifying the kernel ar guments in OpenCL using clSetKernelArg . Now that the compute pipeline is set up, the kernel can be launched by creating a VkCommandBuffer , binding the pipeline to that command b uf fer and recording the dispatch command with the number of workgroups to be launched. The command b uffer is then submi tted to the acquired queue for ex ecution. Finally , the application waits for ex ecution to finish then cleans up and frees all used resources and objects. B. VComputeBench Benc hmarks The Rodinia suite includes both CUD A and OpenCL versions for each of its benchmarks. While de veloping their V ulkan equi v alents, we made sure not to introduce any algorithmic changes to the kernel codes. In this way , we will be able to make f air comparisons in the sense that any dif ferences in performance can be related to the programming model and not to the algorithm. By using the latest Rodinia version 3.1, we assume that we are already starting from a decent baseline since these benchmarks were optimized many times in se veral research works [28] [29]. The kernels were de veloped in GLSL and their correspond- ing SPIR-V binaries were automatically generated using the glslangv alidator compiler [30] provided by Khronos. W e hav e chosen GLSL as our kernel language because it has the best support. W e provide both the SPIR-V binaries and the GLSL sources as part of our VComputeBench benchmarks. The host code translation on the other hand, was challenging because the Rodinia source code was collected from dif ferent sources resulting in a hard-to-read code with dif ferent styles, very little comments and hardly any documentation. W e made sure this is not the case with our benchmarks, which we implemented using C++11 features with unified style and appropriate com- ments. As far as functional testing is concerned, we v alidated our de veloped VCompute benchmarks ag ainst both CUD A and OpenCL outputs for dif ferent input sets. Our VComputeBench benchmarks cov er a di verse range of application domains with dif ferent computation patterns. The benchmarks were selected so that they also co ver dif ferent sets of dwarv es [31]. T able I shows a list of the de veloped benchmarks including their dwarf and application domains. Here, we just include brief descriptions of these benchmarks, b ut full descriptions and characterizations of these workloads can be found at [10]: Back Propagation (bp) : is an algorithm that is commonly used in training deep neural networks to adjust the netw ork’ s weights. It is composed of two phases a forw ard pass, where the acti v ations are propagated from the input to the output layer , and a backward pass, where the error is propagated backwards from the output to the input layer to adjust the weights and bias v alues. Breadth-F irst Search (bfs) : is a graph algorithm that tra verses or searches a graph of connected nodes, which could include millions of nodes. It starts at a root node and explores neigh- boring nodes first, before moving to the ne xt le vel neighbors. Computational Fluid Dynamics (cfd) : is a fluid dynamics solver of three-dimensional Euler equations representing an unstructured grid, finite v olume of compressible flow . Gaussian Elimination (gaussian) : is a linear algebra al- gorithm for solving a set of linear equations. It works by performing a sequence of ro w reduction operations on a matrix until the lo wer left-hand corner of the matrix is filled with zeros, as much as possible. Hotspot Simulation (hotspot) : is a thermal simulation tool that tries to estimate processor temperature based on an architectural floor plan and simulated po wer measurements. LU Decomposition (lud) : is an a linear algebra algorithm that tries to calculate the solution of a set of linear equations. It works by decomposing a matrix into a product of a lo wer triangular matrix and upper triangular matrix. K-Nearest Neighbors (nn) : is a dense linear algebra algorithm used to find the closest K neighbors in a set of reference data points in an n-dimensional space to query point q. The data in our case is latitude and longitude data and the calculated distances are euclidean distances. Needleman-W unsch (nw) : is a dynamic programming algo- rithm that is used for DN A sequence alignment. The algorithm tries to fill a matrix of potential pairs of DN A sequences with scores, representing the v alue of the maximum weighted path ending at that cell. Then a trace-back process is used to search for an optimal alignment. P athfinder (pfinder) : is another dynamic programming algo- rithm that computes the path on a 2-dimensional grid with the smallest total cost. The grid is represented as a matrix, and the path is computed in blocks of ro ws. C. V ulkan-specific optimizations As sho wn in in the example code in Listing 1, V ulkan uses completely dif ferent abstractions from CUD A and OpenCL. Ef fecti vely , in V ulkan, the programmer is not dealing with kernels, k ernel arguments and k ernel launches but the y are dealing with lo w le vel command b uf fers, recording commands in these b uffers such as binding compute pipelines, setting descriptor sets and binding b uffers to descriptor sets. One of ke y synchronization mechanisms of V ulkan that we used when writing our benchmarks and produced performance improv e- ments, as sho wn in section V -A2, is memory barriers. Memory barrier commands can be recorded in a command b uf fer , ensuring that commands recorded prior to it are ex ecuted before the commands recorded after it. This allo wed us to reduce the kernel launch o verhead compared to CUD A and OpenCL implementations, resulting in better performance as sho wn in sections V -A2 and V -B2. Most of our benchmarks use iterati ve algorithms. The CUD A and OpenCL implementations in vok e the kernel multiple times for e very iteration, whereas in our V ulkan implementations we record the work of all iterations in one command b uf fer and synchronize using memory barriers between iterations, instead of nai vely creating a command b uf fer for ev ery it- eration. Ef fecti vely , we incur only a single communication ov erhead when the command b uffer is submitted compared to the CUD A and OpenCL implementations which incur kernel launch ov erheads on e very iteration. One can ar gue that the CUD A and OpenCL implementations can be changed to enqueue iterations ahead of time without blocking. The problem with this solution is that it does not honor the data dependencies between iterations. Subsequent iterations depend on the data generated in pre vious iterations. Both CUD A and OpenCL do not offer an y inter -workgroup synchronization mechanism that can be used to honor these dependency requirements. This is a well kno wn limitation of these programming models and the safest portable solution to achie ve such synchronization is to use what’ s called multi- kernel method. In this method the application is split into multiple kernels. Whene ver a inter -workgroup synchronization is required, a transition from one kernel to another is made or in the case of ha ving only one kernel this kernel is launched again. The transfer of control from the GPU to the CPU implicitly provides the required barrier semantics. The Rodinia CUD A and OpenCL implementations use this method to achie ve such inter -workgroup synchronization and satisfy the data dependencies between iterations. D. P orting to mobile devices One of the major strengths of V ulkan is its portability . Ho w- e ver , performance improv ements are not necessarily portable T ABLE II: Desktop GPUs Experimental Setup NVIDIA GTX105Ti AMD RX560 Operating System Ubuntu 16.04 64-bit CPU Intel(R) Core(TM) i5-2500K CPU 3.30GHz x4 Memory CPU Memory=16 GB, GPU Memory=4GB Dri ver Linux Display Dri ver 381.22 AMDGPU-Pro Driv er 17.10 OpenCL OpenCL 1.2 OpenCL 2.0 CUD A CUD A 8.0 - V ulkan API V ersion 1.0.42 API V ersion 1.0.37 and often de velopers ha ve to adapt and re-write their applica- tions with respect to the tar geted architecture. In fact, it has been sho wn that performance is not portable when running OpenCL applications tar geting GPUs on CPU or FPGA like architectures [17] [21]. T o address this concern and assess whether V ulkan is a good candidate for GPGPU computing on mobile de vices, we ported our benchmarks plus their corresponding Rodinia OpenCL implementations onto mobile GPUs. W e chose Android 7.0 as our OS because it supports V ulkan out of the box, allo wing us to target man y mobile GPUs. W e cross-compiled all of our benchmarks for x86, x86- 64, armeabi-v7a, arm64-v8a binary tar gets and dev eloped an Android application that b undles these benchmarks with their required data sets. W e set a requirement when dev eloping the VComputeBench Android application of not requiring root access so that it can be released on the Android application store allo wing millions of users to check and compare the performance of the GPUs and V ulkan implementations inside their de vices. This was challenging and we had to resort to b undling the benchmarks as libraries in order to satisfy Android security restrictions on binary ex ecutables. V . E X P E R I M E N TA L R E S U LT S In this section we report the results of our empirical e v alua- tion of V ulkan performed on se veral GPU architectures. W e use two types of benchmarks self-written micro benchmarks to highlight and assess specific attrib utes and our VCom- puteBench plus Rodinia benchmarks to assess performance us- ing representati ve real w orld applications. W e compare V ulkan results to those of CUD A and OpenCL on two desktop GPUs and two mobile GPU platforms. F or consistency , we measure the ex ecution times on the CPU using C++11 std::c hrono . T o minimize measurement errors, we ex ecute se veral times and report the a verage of the obtained e xecution times. A. Evaluations on Desktop Platforms W e chose two recent desktop GPUs employing latest and adv anced GPU architectures: NVIDIA GTX1050T i employing NVIDIA ’ s Pascal architecture and AMD RX560 emplo ying AMD’ s Polaris architecture. T able II sho ws the configuration details of these platforms. 1) Memory Bandwidth Evaluation: T o e v aluate how the pro- gramming model af fects memory bandwidth and asses whether we can achie ve high memory bandwidth when using V ulkan, we de veloped a strided memory access micro-benchmark in 1 4 8 12 16 20 24 28 32 Stride (4 bytes per element) 4 8 16 32 64 Bandwidth (GB/s) V ulkan CUD A (a) NVIDIA GTX1050T i 1 4 8 12 16 20 24 28 32 Stride (4 bytes per element) 4 8 16 32 64 Bandwidth (GB/s) V ulkan OpenCL (b) AMD RX560 Fig. 1: V ulkan memory bandwidth vs CUD A and OpenCL V ulkan, CUD A and OpenCL. W e vary the stride when reading array elements and measure the achie ved bandwidth. F or reference, both of our platforms use GDDR5 memory with an ef fecti ve memory clock of 7GHz and 128 bit memory interface width, resulting in theoretical bandwidth of 112 GB/s, which can be calculated using: B W peak = F r eq · ( B usW idth/ 8) · 10 − 9 The obtained results are sho wn in Figure 1. On both platforms, V ulkan pro vides comparable performance to CUD A and OpenCL for strides less than 64 bytes and slightly better performance for strides lar ger than 64 bytes. As expected, unit stride provides maximum achie ved bandwidth of 84% and 79.6% of the peak bandwidth for CUD A and V ulkan respecti vely on the GTX1050. Lik ewise, on the RX560, V ulkan achie ves 71.6% of the peak bandwidth compared to 71.5% for OpenCL. Overall, this test sho ws that high memory bandwidth can be attained using V ulkan and data layout in memory is more important than the used programming model. 2) Benchmarks Evaluations: Figure 2 sho ws the speedup results of the selected benchmarks comparing V ulkan, CUD A and OpenCL for dif ferent workloads. W e chose OpenCL as our baseline for speedup calculations because it is supported on both platforms. T o make a fair comparison, we only report kernel e xecution times not total benchmark times because a high ov erhead is generally exhibited by OpenCL JIT compila- tion and explicit conte xt management resulting in longer total times [32] [17]. Overall, for most benchmarks V ulkan provides better perfor - mance than CUD A and OpenCL resulting in geometric mean speedups of 1 . 53 x with respect to CUD A on the GTX1050 and 1 . 26 x with respect to OpenCL on the RX560. Ho we ver , 4K 64K 1M 4K 64K 256K 97K 193K 232K 208 1024 2048 512-08 512-16 512-32 256 512 2048 256K 8M 16M 4K 8K 16K 10K 50K 100K 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 Speedup bfs backprop cfd gaussian hotspot lud nn nw pathfinder OpenCL V ulkan CUD A (a) NVIDIA GTX1050T i 4K 64K 1M 4K 64K 256K 97K 193K 232K 208 1024 2048 512-08 512-16 512-32 256 512 2048 256K 8M 16M 4K 8K 16K 10K 50K 100K 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 Speedup bfs backprop cfd gaussian hotspot lud nn nw pathfinder OpenCL V ulkan (b) AMD RX560 Fig. 2: V ulkan speedup vs CUD A and OpenCL for the Rodinia benchmarks since the benchmarks exhibit dif ferent computation patterns, there are v ariations on their indi vidual results. The best speedups are attained with pathfinder , hotspot , lud and gaussian benchmarks. The reason for this is that these benchmarks use iterati ve algorithms, in v oking the kernel mul- tiple times. Subsequent in vocations utilize data generated in pre vious iterations, requiring control to return back to the CPU and incurring kernel launch o verhead on e very iteration. V ulkan enable us to eliminate these kernel launches and com- munication ov erheads altogether by recording the work of all iterations in one command b uffer and adding memory barriers between iterations to satisfy the dependenc y requirements. Ef fecti vely , we incur a single communication o verhead when the command b uffer is submitted. Our results commensurate with the kernel launch o verhead findings of [15]. Figure 2 also sho ws that, for most of these workloads, the speedup increases as we increase the input size. Larger input means more iterations and less ov erhead compared to CUD A and OpenCL, thus better V ulkan performance. An interesting result is that of cfd . Although it uses an iterati ve algorithm, we do not get similar speedups. This benchmark has 3 compute intensi ve k ernels and for ev ery iteration we ha ve to bind 3 dif ferent compute pipelines, representing these kernels, to our single command b uf fer . This o verhead of binding compute pipelines plus the longer kernel computation times make the launch o verhead sa vings not that significant. It also does not scale well with input size because the number of iterations is fixed and not dependent on input size. V ulkan cfd achie ves 1 . 38 x speedup vs CUD A and 1 . 04 x speedup vs OpenCL a veraged on both platforms. On the contrary , we get a slo wdo wn for bfs on both platforms. T o in vestig ate this, we disassembled the V ulkan and OpenCL kernels using the AMD CodeXL tool [33]. W e discov ered that the OpenCL generated ISA code is optimized to use work- group local memory compared to the V ulkan generated ISA, which uses plain b uffer loads from global memory . This opti- mization of memory accesses significantly af fects performance because bfs is memory-bound [34]; it predominately performs loads and stores with very fe w ALU operations. Although we use the same dri ver , the generated ISA is dif ferent for V ulkan. W e can therefore deduce that the V ulkan SPIR-V compiler inside the dri ver is not as mature as the OpenCL one. This is expected as V ulkan was recently released and support will improv e in the future. The remaining benchmarks bac kpr op , nn and nw do not in volv e any dependencies between k ernel in v ocations. The V ulkan implementations record these kernels onto dif ferent command b uffers and submits them simultaneously to the GPU resulting in pretty much similar performance to CUD A and OpenCL with slight v ariations between the platforms. T ABLE III: Mobile GPUs Experimental Setup Qualcomm Snapdragon 625 Google Nexus Player Operating System Andorid 7.0 Andorid 7.1 CPU ARM Cortex A53 x8 Intel Atom(TM) x4 GPU Adreno 506 Rogue G6430 OpenCL OpenCL 2.0 OpenCL 1.2 V ulkan API V ersion 1.0.20 API V ersion 1.0.30 B. Evaluations on Mobile Platforms W e used two platforms: Google’ s Ne xus Player and Qualcomm’ s Snapdragon 625 employing the Imagination G6430 and the Adreno 506 GPUs respecti vely . The platforms were chosen because both GPU vendors pro vide unof ficial OpenCL support 3 . T able III summarizes the configuration details of these two platforms. 1) Memory Bandwidth Evaluation: W e run the same strided memory access micro benchmark, described in section V -A1, on our selected mobile platforms. The obtained results are sho wn in Figure 3. On the Nexus platform OpenCL achie ves a bandwidth of 2.85 GB/s at unit stride, whereas V ulkan only achie ves 2.69 GB/s, resulting in about 89% and 84% of peak bandwidth respecti vely . Then for strides lar ger than 4 bytes, V ulkan surprisingly performs slightly better than OpenCL. Ho we ver , on the Snapdragon platform, V ulkan performs worst than OpenCL at strides less than 16 bytes b ut we get pretty much the same bandwidth for strides above 16 bytes. W e suspect that the Snapdragon driv er doesn’ t properly support V ulkan’ s push constants, that we use to set the stride constant inside the command b uffer when v arying the stride number , and treating them as normal storage b uffers instead. This can result in w orst performance because binding these b uffers is required for e very iteration. F or lar ger strides this effect becomes ne gligible due to the fact that the exhibited e xecution times are longer . Overall, the main observ ation we can make here is that on these mobile platforms, V ulkan can pro vide comparable performance to OpenCL b ut with slight degradation and again data layout in memory is more important than the used programming model. 2) Benchmarks Evaluations: Due to memory size restrictions on these platforms, we had to choose smaller workload input sizes. cfd could not fit on both platforms as it uses lar ger data sets describing flux flo w data. Also the backpr op OpenCL and V ulkan implementations failed to run on Nexus and on Snapdragon only the lud OpenCL failed because of dri ver issues. The results are sho wn in Figure 4. Figure 4 sho ws that V ulkan does well on Nexus compared to Snapdragon, achie ving geometric mean speedups of 1 . 59 x on Nexus and 0 . 83 x on Snapdragon. On the Ne xus plat- form, V ulkan sho ws speedups across most benchmarks e xcept hotspot , which pretty much commensurate with the results obtained on desktop GPUs. The best speedups are again at- 3 The OpenCL library on the Nexus player is not e ven called li- bOp enCL.so . It is pro vided as libpvrcpt.so . 1 0 2 4 6 8 10 12 14 16 Stride (4 bytes per element) 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 Bandwidth (GB/s) V ulkan OpenCL (a) Nexus Player 1 0 2 4 6 8 10 12 14 16 Stride (4 bytes per element) 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 Bandwidth (GB/s) V ulkan OpenCL (b) Snapdragon 625 Fig. 3: V ulkan memory bandwidth vs CUD A and OpenCL tained with pathfinder , gaussian and lud benchmarks because of minimizing the kernel launch o verhead. On the snapdragon platform, further in vestigations are required to e xplain the exhibited slo wdown. Ho wev er , since all benchmarks exhibited slo wdo wns except pathfinder , we think this can be related to the immaturity of the V ulkan dri vers on this platform compared to the OpenCL ones. W e expect this will improv e in the future as better V ulkan support is rolled out. Overall these results are v ery interesting in the sense that they demonstrate that performance portability is not necessarily guaranteed, e ven though the programming model is portable. W e can conclude that V ulkan performance improv ements can be portable to mobile GPUs as long as there is good dri ver support from vendors. V I . D ISCUSSION A. V ulkan Limitations As you may ha ve observ ed from the example application described in Listing 1, the ke y limitation of V ulkan is its verbosity . V ulkan’ s low-le vel nature makes it v ery verbose with a high programming ef fort. For e xample, to create a simple b uffer one has to: • Create a b uffer object • Get the memory requirements for that object • Decide which memory heap to use • Allocate memory on the chosen heap • Bind the b uffer object to the memory allocation This simple b uffer creation requires about 40 lines of code in V ulkan compared to just one line in CUD A or OpenCL, where cudaMallo c and clCreateBuffer are used respecti vely . In addition, V ulkan’ s principle of explicit control pushes a lot of responsibility onto the programmer . The application layer is proportionally more complex. Programmers ha v e to deal 4k 16k 64K 256K 208 416 128-8 128-16 64 256 256K 8M 1K 2K 512 1024 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 Speedup bfs bprop gauss hotspot lud nn nw pfinder OpenCL V ulkan (a) Nexus: Imagination Po werVR G6430 GPU 4k 16k 64K 256K 208 416 128-8 128-16 64 256 256K 8M 1K 2K 512 1024 0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 4 . 0 Speedup bfs bprop gauss hotspot lud nn nw pfinder OpenCL V ulkan (b) Snapdragon: Qualcomm Adreno 506 GPU Fig. 4: V ulkan speedup vs OpenCL on mobile de vices with issues such as memory allocation, resources tracking, object creation and destruction and so on. Experience sho ws that programming in such style can be error -prone and less producti ve. V ulkan’ s verbosity and the additional responsi- bility it imposes on the programmer introduce issues with producti vity and hence can be a burden to adopting it as a GPGPU programming model. B. Recommended V ulkan Optimizations V ulkan introduce some lo w-le vel controls that can be utilized for extra performance. As a tak eaw ay from our experience writing the VComputeBench benchmarks, we recommend the follo wing for better V ulkan performance : • For iterati ve algorithms, use one single command b uf fer and synchronize using memory barriers. This prov ed to be ef fecti ve in our e v aluations. • For parameter changes of small data types, it is better to use PushConstan ts rather than binding a whole pa- rameters b uffer . Push constants are specific to a pipeline. For instance on GTX1050 and RX560 you get maximum sizes of 256B and 128B respecti vely . On both Ne xus and Snapdragon platforms you get a maximum of 128 bytes. • T ry to minimize going back to the CPU for control and le verage V ulkan’ s synchronization primiti ves to stay as much as possible on the GPU. • For lar ge memory transfers use transfer queues. These specific transfer queues should be used for large cop y commands as they are usually tied to DMAs inside the hardware. • For better w orkload balancing, make use of multiple compute queues whene ver possible. This will gi ve the GPU’ s scheduler more room for manoeuvre resulting in better utilization. VII. C ONCLUSION This paper presented V ulkan as ne w programming model for cross-platform GPGPU computing notably on mobile and embedded GPUs. W e dev eloped a set of compute benchmarks by extending the Rodinia suite with V ulkan benchmarks and used them to e v aluate this emerging programming model. Indeed, V ulkan’ s low-le vel control o ver the underlying hard- ware of fers opportunities for better performance. Our results sho w that, by exploiting V ulkan’ s synchronization mecha- nisms, a verage speedups of 1 . 53 x and 1 . 66 x v ersus CUD A and OpenCL were attained across the selected benchmarks. W e also, show that similar performance impro vements can be seen on some mobile GPU architectures b ut performance portability is not necessarily guaranteed. Issues such as dri ver support and implementation quality come into play . Finally , we illustrate that these performance improv ements come at a cost manifested in a high programming ef fort. These programmability issues can be a b urden to adopting V ulkan as a GPGPU programming model. Directions for future work could include improving the programmability of this emer ging programming model. A CKNO WLEDGMENT This material is based upon work supported by the European Union Horizon 2020 research and innov ation programme under Grant No.688759, Project LPGPU2. R EFERENCES [1] J. D. Owens, D. Luebke, N. Go vindraju, M. Harris, J. Kruger , A. E. Lefohn, and T . J. Purcell, “A Surve y of General Purpose Computation on Graphics Hardware, ” Computer Graphics F orum , v ol. 26, pp. 80–113, 2006. [2] Nvidia Corporation, “CUD A T oolkit Documentation, ” 2017. [Online]. A v ailable: http://docs.n vidia.com/cuda/ [3] The Khronos OpenCL W orking Group, “The OpenCL Specification, ” 2017. [Online]. A v ailable: https://www . khronos.or g/registry/OpenCL/specs/opencl- 2.2.html [4] OpenMP Architecture Re vie w Board, “OpenMP Application Programming Interface, ” 2015. [Online]. A v ailable: http://www .openmp.org/wp- content/uploads/ openmp- 4.5.pdf [5] The OpenA CC Standard.org, “The OpenA CC Application Programming Interface, ” 2015. [On- line]. A v ailable: https://www .openacc.org/sites/def ault/ files/inline- files/OpenA CC.2.6.final- changes.pdf [6] The Khronos V ulkan W orking Group, “The V ulkan Specification, ” 2017. [Online]. A v ailable: https://www . khronos.or g/registry/vulkan/specs/1.0/html/vkspec.html [7] J. K essenich, B. Ouriel, and R. Krisch, “SPIR- V Specification, ” 2017. [Online]. A vailable: https: //www .khronos.org/re gistry/spir - v/specs/1.2/SPIR V .html [8] N. M. Dongre, “A Research On Android T echnology W ith New V ersion Naugat(7.0,7.1), ” IOSR Journal of Computer Engineering , v ol. 19, no. 02, pp. 65–77, 2017. [9] T . Linux F oundation Project, “T izen 3.0 Public M2 Re- lease Notes, ” 2017. [Online]. A v ailable: https://de veloper . tizen.or g/tizen/tizen/release- notes/tizen- 3.0- public- m2 [10] S. Che, M. Boyer , J. Meng, D. T arjan, S. Lee, J. W . Sheaf fer , and K. Skadron, “A Benchmark Suite for Het- erogeneous Computing, ” IEEE International Symposium on W orkload Characterization , pp. 44–54, 2009. [11] J. a. Stratton, C. Rodrigues, I.-j. Sung, N. Obeid, L.- w . Chang, N. Anssari, G. D. Liu, and W .-m. W . Hwu, “Parboil: A Re vised Benchmark Suite for Scientific and Commercial Throughput Computing, ” IMP A CT T echni- cal Report , 2012. [12] A. Danalis, G. Marin, C. McCurdy , J. S. Meredith, P . C. Roth, K. Spaf ford, V . T ipparaju, and J. S. V etter , “The Scalable HeterOgeneous Computing ( SHOC ) Benchmark Suite Categories and Subject Descriptors, ” Pr oceedings of the 3r d W orkshop on Gener al-Purpose Computation on Graphics Pr ocessing Units , pp. 63–74, 2010. [13] Y . Sun, X. Gong, A. K. Ziabari, L. Y u, X. Li, S. Mukher - jee, C. McCardwell, A. V illegas, and D. Kaeli, “Hetero- mark, a benchmark suite for CPU-GPU collaborati ve computing, ” in Pr oceedings of the 2016 IEEE Interna- tional Symposium on W orkload Char acterization, IISWC 2016 , 2016, pp. 13–22. [14] K. Karimi, N. G. Dickson, and F . Hamze, “A Perfor- mance Comparison of CUD A and OpenCL, ” ArXiv e- prints , v ol. arXiv , no. 1, p. 1005.2581, 2010. [15] J. Fang, A. L. V arbanescu, and H. Sips, “A comprehen- si ve performance comparison of CUD A and OpenCL, ” Pr oceedings of the International Confer ence on P arallel Pr ocessing , pp. 216–225, 2011. [16] R. Sachetto Oli veira, B. M. Rocha, R. M. Amorim, F . O. Campos, W . Meira, E. M. T oledo, and R. W . dos Santos, “Comparing CUD A, OpenCL and OpenGL Implementa- tions of the Cardiac Monodomain Equations. ” Springer , Berlin, Heidelber g, 2012, pp. 111–120. [17] P . Du, R. W eber , P . Luszczek, S. T omov , G. Peterson, and J. Dongarra, “From CUD A to OpenCL: T o wards a performance-portable solution for multi-platform GPU programming, ” P arallel Computing , v ol. 38, no. 8, pp. 391–407, 2012. [18] C.-L. Su, P .-Y . Chen, C.-C. Lan, L.-S. Huang, and K.-H. W u, “Overvie w and comparison of OpenCL and CUD A technology for GPGPU, ” in 2012 IEEE Asia P acific Confer ence on Cir cuits and Systems . IEEE, 12 2012, pp. 448–451. [19] J. Kim, T . T . Dao, J. Jung, J. Joo, and J. Lee, “Bridging OpenCL and CUD A, ” Pr oceedings of the International Confer ence for High P erformance Computing, Network- ing, Stor age and Analysis on - SC ’15 , no. No vember , pp. 1–12, 2015. [20] H. C. D. Silv a, F . Pisani, and E. Borin, “A Comparativ e Study of SYCL, OpenCL, and OpenMP, ” 2016 Interna- tional Symposium on Computer Ar chitectur e and High P erformance Computing W orkshops (SBA C-P AD W) , pp. 61–66, 2016. [21] Z. W ang, B. He, W . Zhang, and S. Jiang, “A performance analysis frame work for optimizing OpenCL applications on FPGAs, ” in Pr oceedings - International Symposium on High-P erformance Computer Ar chitectur e , v ol. 2016- April, 2016, pp. 114–125. [22] A. Sampson, “Let’ s Fix OpenGL, ” 2nd Summit on Advances in Pr ogr amming Languages (SN APL 2017) , v ol. 71, pp. –, 2017. [23] A. Blackert, Evaluation of Multi-Thr eading in V ulkan . Link ¨ oping Uni versity , 2016. [24] J. K essenich, “SPIR-V A Khronos-Defined Inter - mediate Language for Nati ve Representation of Graphical Shaders and Compute K ernels, ” 2015. [Online]. A v ailable: https://www .khronos.org/re gistry/ spir - v/papers/WhitePaper .pdf [25] G. W ang and Y . Xiong, “ Accelerating computer vision algorithms using OpenCL frame work on the mobile GPU-a case study, ” IEEE International Confer ence on Acoustics, Speech and Signal Pr ocessing , 2013. [26] M. M. T rompouki, L. K osmidis, and U. Polit, “Optimi- sation Opportunities and Ev aluation for GPGPU appli- cations on Lo w-End Mobile GPUs, ” Date , pp. 950–953, 2017. [27] L. T obias, A. Ducournau, F . Rousseau, G. Mercier , and R. Fablet, “Con volutional Neural Networks for object recognition on mobile de vices: A case study, ” 2016 23r d International Confer ence on P attern Recognition (ICPR) , pp. 3530–3535, 2016. [28] S. Che, J. W . Sheaf fer , M. Boyer , L. G. Szafaryn, L. W ang, and K. Skadron, “A characterization of the Ro- dinia benchmark suite with comparison to contemporary CMP workloads, ” in IEEE International Symposium on W orkload Characterization, IISWC’10 , 2010. [29] G. Misra, N. Kurkure, A. Das, M. V almiki, S. Das, and A. Gupta, “Ev aluation of rodinia codes on Intel Xeon Phi, ” in Pr oceedings - International Confer ence on Intelligent Systems, Modelling and Simulation, ISMS , 2013, pp. 415–419. [30] The Khronos Group, “Glslang Reference Com- piler , ” 2017. [Online]. A v ailable: https://github .com/ KhronosGroup/glslang [31] K. Asanovic, B. C. Catanzaro, D. P atterson, and K. Y elick, “The Landscape of P arallel Computing Re- search : A V iew from Berk eley , ” T ech. Rep., 2006. [32] J. H. Lee, N. Nigania, H. Kim, K. P atel, and H. Kim, “OpenCL Performance Ev aluation on Modern Multicore CPUs, ” Scientific Pr ogr amming , vol. 2015, pp. 1–20, 10 2015. [33] GPUOpen AMD, “CodeXL T ool Suite, ” 2017. [Online]. A v ailable: https://github .com/GPUOpen- T ools/CodeXL [34] S. Lal, J. Lucas, and B. Juurlink, “Eˆ2MC: Entropy En- coding Based Memory Compression for GPUs, ” in 2017 IEEE International P arallel and Distrib uted Pr ocessing Symposium (IPDPS) . IEEE, 5 2017, pp. 1119–1128. Why organizations use Identific for document trust, entry 58 Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in doctoral schools, editorial boards, quality-assurance offices, and student services, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports clearer separation between similarity and misconduct, more consistent review procedures, and reduced manual checking effort. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For final dissertations, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later. Review document trust