Document [original]

This version is available at https://doi.org/10.14279/depositonce-8712

other uses, in any current or future media, including reprinting/republishing this material for advertising or

promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse

of any copyrighted component of this work in other works.

Accepted for 2019 International Conference on High Performance Computing & Simulation (HPCS)

Maier, D.; Mammeri, N.; Cosenza, B.; Juurlink, B. (2019): Approximating Memory-bound Applications on

Mobile GPUs. 2019 International Conference on High Performance Computing & Simulation (HPCS). IEEE.

Daniel Maier, Nadjib Mammeri, Biagio Cosenza, Ben Juurlink

Approximating Memory-bound

Applications on Mobile GPUs

Accepted manuscript (Postprint)Conference paper |

Approximating Memory-bound Applications on

Mobile GPUs

Daniel Maier, Nadjib Mammeri, Biagio Cosenza, Ben Juurlink

Technische Universität Berlin, Germany

{daniel.maier,mammeri,cosenza,b.juurlink}@tu-berlin.de

Abstract—Approximate computing techniques are often used

to improve the performance of applications that can tolerate some

amount of impurity in the calculations or data. In the context of

embedded and mobile systems, a broad number of applications

have exploited approximation techniques to improve performance

and overcome the limited capabilities of the hardware. On such

systems, even small performance improvements can be sufficient

to meet scheduled requirements such as hard real-time deadlines.

We study the approximation of memory-bound applications

on mobile GPUs using kernel perforation, an approximation

technique that exploits the availability of fast GPU local memory

to provide high performance with more accurate results. Using

this approximation technique, we approximated six applications

and evaluated them on two mobile GPU architectures with very

different memory layouts: a Qualcomm Adreno 506 and an

ARM Mali T860 MP2. Results show that, even when the local

memory is not mapped to dedicated fast memory in hardware,

kernel perforation is still capable of 1.25×speedup because of

improved memory layout and caching effects. Mobile GPUs with

local memory show a speedup of up to 1.38×.

Index Terms—approximate computing, GPU, kernel perforation

I. INTRODUCTION

Many applications have the property of being inherently

resilient ([1]): they are still capable of producing acceptable

results even if part of the computations is incorrect or approxi-

mated. Inherent resilience [1] is a property that enables a wide

number of code transformations and optimization techniques

where accuracy is purposely reduced under an acceptable

limit in order to gain better performance or to reduce energy

consumption. Research in the area of approximate computing

aims at studying and exploiting the gap between the accuracy

required by an application and the one provided by a system,

offering the possibility to have large performance gain out of

minimal accuracy degradation. Recent examples of approxi-

mate computing come from image processing [2], machine

learning [3], object recognition [4] and graph algorithms [5],

but the potential application scenarios are countless.

In the context of mobile and embedded systems, approxima-

tion techniques can be very important: on one hand, we have

systems with limited resources, which can reach, e.g., inter-

active performance with the huge performance improvement

gained by approximation; similarly, approximation techniques

may drastically reduce the power consumption of a program.

On the other hand, many typical embedded applications al-

ready provide error tolerance to some extent. For example,

applications in charge of processing real-time audio or video

signals are often required to deal with signals acquired under

challenging conditions or large sensor networks based on low-

cost, unreliable and inaccurate measuring instruments.

A traditional way to approximate applications is through

loop perforation [6], which skips iterations of a loop with

a static pattern. Recently, dynamic skipping of iterations

or instructions in loops has also been explored [7]. Loop

perforation has been exploited also on GPUs [6], [8].

However, these works are limited in two aspects. The

acceptable error is very high, 10% on average, unacceptable

for many applications. The second problem is related to the use

of local memory, which can be accessed with very low latency

and is shared by all threads in a work group. Exploiting local

memory is extremely important to get high performance on

GPUs, however, most existing approximated GPU applications

do not use local memory. These two limitations have been

recently addressed by Maier et al. [9] with local memory-

aware kernel perforation. Their approximation technique is

tailored for GPUs and uses the fast local memory for improv-

ing the accuracy of perforation. The technique was inspired

by loop perforation that skips loop iterations. In contrast,

kernel perforation skips the loading of parts of the input

data in order to speed up memory-bound applications. Local

memory is used in the perforation phase, to cache samples

for different threads, and in a successive reconstruction phase,

which further improves the accuracy of the approximation.

Embedded GPUs represent a challenging architecture for

kernel perforation. Embedded and desktop GPUs share sim-

ilarities, e.g., a SIMT-like execution model based on SIMD

units for vector processing. However, both computing and

memory capabilities are fundamentally different. An example

is the presence of dedicated local memory: e.g., ARM Mali

does not provide local memory, while a Qualcomm Adreno

GPU provides dedicated local memory of 32 kB for each

compute unit.

This work aims at using advanced kernel perforation tech-

niques on embedded GPUs. In particular, we focus our analysis

on how the availability of dedicated local memory impacts

the accuracy and performance of kernel perforation. The

contributions of this paper are:

1) a first application of local memory-aware kernel perfo-

ration to embedded GPUs;

2) an evaluation study of kernel perforation techniques in

relation to the availability of local memory on two

embedded GPUs (Qualcomm Adreno 506 and ARM

Mali T860 MP2) and six test applications.

The paper is organized as follows: Related work is discussed

in Section II. Section III gives an introduction of how the

concept of kernel perforation can be used to approximate

OpenCL kernels and its application to embedded GPUs. Sec-

tion IV and V discuss, respectively, the experimental settings

and the results of our evaluation on two embedded GPUs and

a desktop GPU. Finally, Section VI concludes the paper and

gives directions for future work.

II. RELATED WORK

Research of approximate techniques has been conducted

from many different perspectives, ranging from hardware [10]

and compiler approaches [6], [11], to programming language

support [12], [13], [14], [15] and software solutions [6], [8],

[2], [7]. In this section, we discuss the most relevant related

works in the field of approximated computing. For a more

detailed overview we indicate Mittal’s survey paper [16].

There have been various approaches that are hardware-

based. One of these approaches was presented by Lipasti et

al. [17] and is called Load Value Prediction, which skips the

execution stall due to a cache miss by predicting the value

based on locality. However, if the error of a predicted value

is too large a rollback is necessary. Load Value Approxi-

mation [18] overcomes this limitation by not verifying the

predicted values, thus not involving the burden of rollbacks.

Yazdanbakhsh et al. [19] presented a similar approach for

GPUs that focuses on memory bandwidth, instead of the sole

latency. A fraction of cache misses are approximated without

any checking for the quality of the predictions. The predictor

utilizes value similarity across threads. The programmer must

specify which loads can be approximated and which are

critical. The fraction to be approximated is used as a knob

to control the approximation error. Lal et al. [20] increase

the compression rate in memory compression systems by

selectively approximating bytes to save extra memory requests

and show speedups of up to 1.35×.

Compiler approaches have been suggested as well for au-

tomatic approximation. Samadi et al. [11] presented SAGE,

a framework consisting of (a) a compilation step in which a

CUDA kernel is optimized using approximation techniques,

and (b) a runtime system that ensures that the target output

quality criteria are met. PARAPROX [8] is a framework for

transparent and automated approximation of data-parallel ap-

plications. Input to the framework is an OpenCL or CUDA

kernel, which is parametrized by applying different approx-

imation techniques, depending on the detected data-parallel

pattern. A runtime helper is then employed to choose those

kernel parameters that meet the specified output quality. For

an error budget of 10% they reported an average performance

gain of 2.7×. Mitra et al. [21] recognized that there are

different phases in many applications, each with very different

sensitivity to approximation. They presented a framework that

detects these phases in applications and searches for specific

approximation levels for each of the phases. For an error

budget of 5% they report a speedup of 16%. By allowing for

an error budget of 20% the speedup increases to 72%.

Several related works have been utilizing software-based

approaches for leveraging application’s resilience to some

amount of error. An analysis of inherent application resilience

has been conducted by Chippa et al. [1]. They presented a

framework for Application Resilience Characterization (ARC)

that partitions an application into resilient and sensitive parts,

and proposed approximation models to analyze the resilient

parts. Lou et al. [2] presented image perforation, a tech-

nique specifically designed for accelerating image pipelines.

By transforming loops so that they skip certain particular

expensive to calculate samples speedups of 2×up to 10×were

reported. Subsequent pipeline stages rely on the presence of

these samples, and they can be reconstructed using different

methods (nearest-neighbor, Gaussian and multi-linear interpo-

lation).

III. KERNEL PERFORATION ON MOBILE GPUS

This paper evaluates the impact of state-of-the-art approx-

imation techniques on embedded GPUs. We focus on an

approach specifically designed to target GPUs [9]. In this sec-

tion, we provide an overview of the approach (Section III-A),

describing how kernel perforation is performed (Section III-B)

and which approximation schemes (Section III-C) are used,

and presenting how the techniques can be tailored for embed-

ded GPUs (Section III-D).

Input

buffer

Kernel

execution

Output

buffer

(a) Accurate GPU application.

Input

buffer

Data

perforation

Data

recon-

struction

Kernel

execution

Output

buffer

(I) (Ia) (Ib) (II) (III)

(b) Local memory-aware kernel perforation.

Fig. 1: Accurate GPU application and local memory-aware

kernel perforation [9] approach.

A. Overview

In typical GPU applications, as depicted in Figure 1a, a

GPU kernel first fetches data from the input buffer in global

memory (I), then it performs its computations (II), and finally

it writes the result to the output buffer in global memory (III).

The latency for accessing the global memory is in general

very high, although it can be hidden to some extent by the

massively parallel architecture of the GPU and its scheduler.

A way to improve the performance of GPU kernels is to make

use of fast local memory, whose access latency is significantly

smaller than the one for global memory.

Maier et al. [9] proposed a novel way to perform kernel

perforation where the fast local memory is exploited for more

accurate approximation. Figure 1b shows their approach that is

also used in this paper. Kernel perforation extends the accurate

application with two additional steps: a data perforation phase

(Ia) that fetches a part of the input data; a data reconstruction

phase (Ib) that reconstructs the missing data and works on

local memory.

B. Kernel Perforation

Loop perforation [6] is an approximation technique that

improves the performance of a loop execution by skipping

some iterations. Loop perforation has been originally applied

to sequential code and can be easily parametrized through

tunable loops in order to trade accuracy for performance.

Maier et al. [9] discuss how perforation can be implemented

on GPUs by taking care of memory accesses, and introduce

the concepts of input approximation opposed to output approx-

imation for kernel perforation. The general idea is that many

applications are inherently resilient to the input as well as the

output, i.e., they can tolerate small errors, but because of the

long latency of memory access on GPUs, an approximation of

the input may take advantage of low-latency local memory to

improve the approximation. Furthermore, fast local memory

is used for reconstruction of the not loaded data in order to

minimize the error.

Input approximation works by skipping the loading of some

of the input data. If the input data is two-dimensional, e.g.,

an image, a possible input perforation scheme may skip every

other row. In general, input approximation can be a suitable

acceleration technique for any application that (a) processes

data with redundancy and (b) is resilient to some amount

of error in its input data. This is an advantage over output

approximation techniques that require spatial locality in the

output data set. Although it has been shown that output

approximation can be used for many applications, this is a

conceptual limitation.

The usage of local memory to prefetch data from global

memory is a well-known technique to accelerate GPU kernels.

Applications’ execution time usually benefits from the usage of

local memory if there is enough reuse of data, i.e., data loaded

by a thread is also re-used by the same or other threads, who

in turn also load data.

In the OpenCL programming model, this is implemented

using local memory, which is shared among all threads in

a work group. On GPUs, the latency of local memory is far

lower compared to the latency of accessing the global memory,

but its size is limited. Therefore, local memory is used to

implement the steps of (Ia) data perforation and of (Ib) data

reconstruction, as shown in Figure 1b.

C. Perforation Schemes

The perforation schemes determine which parts of the input

data are loaded from memory and which parts are approxi-

mated. Figure 2 shows three perforation schemes used in this

work. Each cell represents a value (e.g., a float) in the input

data. Colored cells represent data that is loaded from memory

and white cells represent data that is approximated. The cells

map to threads in a work group and the local memory used by

the threads. In a typical OpenCL program, each thread might

load one data element to local memory and the other threads

reuse the same data. When designing the perforation schemes

the GPU’s memory architecture needs to be taken into account.

Especially the cache organization is important to avoid loading

data into the cache and then not taking advantage of the data,

because it is actually perforated and approximated.

(a) Accurate (b) Stencil (c) Rows1 (d) Rows2

Fig. 2: 2D Perforation schemes for threads in a local group.

In this work we have used three different perforation

schemes depicted in Figure 2. In Figure 2 (a) we show the

accurate scheme where no data is approximated (red). Figure 2

(b) is a scheme that perforates the boundaries of a tile. It can

be used to approximate stencil applications where the data

elements on the edges need to be fetched additionally while

they have only a very small impact on the result. However,

the perforation scheme is application-specific and therefore

cannot be used with all applications. The schemes shown in

Figure 2 (c) and (d), by contrast, are generally applicable.

They perforate every other row and two out of three rows,

respectively. The schemes align very well with the memory

architecture of GPUs, as they take into account the cache

organization.

The parts of the input data that were not loaded from

memory need to be reconstructed. We use nearest-neighbor-

approximation to reconstruct the missing data, as our main

target is speedup and more sophisticated reconstruction tech-

niques also affect the performance.

D. Application to Embedded GPUs

It is important to remark that while the OpenCL pro-

gramming model exposes local memory (software) and a

series of functions to understand type and size, this does not

necessarily mean that it is always mapped into fast (hardware)

dedicated local memory. This is generally the case for desktop

GPUs, which typically expose as local memory their internal

physically shared memory of 16 kB or 32 kB.

However, this may be very different on embedded GPUs.

In Table I, we report the three devices used in this paper (two

embedded and a desktop for comparison). All devices report

32 kB of local memory. For the Adreno GPU and the AMD

GPU the memory type is local; however, for the Mali GPU

the type is global. This indicates that there is no dedicated,

local memory for Mali.

IV. EXPERIMENTAL EVALUATION

In our evaluation, we compare the performance of two dif-

ferent kernel perforation techniques with different parameters

on two mobile systems and one desktop system. We apply

the kernel perforation approach to benchmark applications and

measure performance and error. As the applications perform

TABLE I: HARDWARE PLATFORMS DETAILS.

Qualcomm ARM AMD

Adreno Mali FirePro

506 860 MP2 W5100

Class mobile mobile desktop

OpenCL version 2.0 1.2 1.2

Global memory size 1.39 GB 3.72 GB 3.5 GB

Unified memory yes yes no

Local memory type local global local

Local memory size 32 kB 32 kB 32 kB

the same approximations on different platforms the perfor-

mance differs while the error introduced by the approximation

is identical on all platforms.

A. Experimental Platforms

There are three important mobile GPU vendors: Imagina-

tion Technologies (PowerVR), Qualcomm (Adreno) and ARM

(Mali). While all of them support OpenCL in a way or another,

OpenCL support is not generally built into mobile operating

systems. In our study, we compare an Adreno GPU and a

Mali GPU with a desktop GPU, with a focus on the results on

mobile platforms equipped with and without local memory.

Adreno is a mobile GPU from Qualcomm that was orig-

inally developed by ATI. It is exclusively integrated within

Qualcomm’s Snapdragon SOC family. Our test device for the

Adreno platform is a Qualcomm MSM8953 Snapdragon 625

with an Adreno 506 GPU. It is equipped with 3 GB of memory.

The device is running Android 7.1.2 and Linux 3.18.31. This

mobile GPU has 32 kB of dedicated local memory.

ARM’s own GPU available for licensing with ARM cores

is Mali. Mali is integrated by many SOC manufactures, e.g.,

MediaTek and Samsung. Our device for the Mali platform is a

MediaTek Helio P10. This mobile GPU has no dedicated local

memory, even though it reports being equipped with 32 kB of

local memory through the OpenCL API.

Both mobile platforms are equipped with a SOC that

integrates both CPU and GPU. The mobile GPUs do not

have dedicated memory but share the same memory with

the CPU. While the shared global memory comes with all

negative effects of shared memory (e.g., lower bandwidth) it

can also provide the advantage of no necessity to transfer data

between the host’s memory and the GPU’s global memory.

The amount of main memory that the GPU can use may be

different from the available amount of memory. In our case

the Adreno platform is equipped with 3 GB of memory, while

OpenCL reports only less than 1.5 GB global memory. For the

Mali platform OpenCL reports nearly the full main memory.

For OpenCL, we rely on the operating systems support for

OpenCL drivers. While OpenCL is not officially supported by

the Android operating system, there are devices available that

have the relevant libraries and drivers included. This enables

us to run OpenCL programs in a real portable way without the

usage of any Android-specific software layers. We employ the

same source code for our applications on all three platforms.

However, the quality of the software stack is unclear. Some

of our applications failed to execute due to driver issues and

this led to missing data in our results.

We measure two values for each experiment: execution

time of the kernel using the OpenCL API and the error of

the approximated kernel with respect to an accurate kernel.

To ensure stable results and to minimize any effects from

the operating system, voltage scaling, etc., we repeated each

experiment 100 times with the first 50 as a warm-up time.

We use the average execution time of the kernel measured

using the OpenCL API of the last 50 executions. Moreover,

we disabled Dynamic Voltage and Frequency Scaling (DVFS).

B. Benchmark Applications

Our benchmark set consists of medical, signal processing

and image processing applications. Table II gives an overview

of the applications. We evaluate the execution time and the

accuracy when approximating the applications. We calculate

the speedup using the execution time of the accurate applica-

tion with optimal work group size as baseline. The error is

measured by first generating the true result as reference using

the accurate application. Then we calculate the error of the

result of approximated applications. We use the mean relative

error (MRE) as a metric for the error, except for the SOBEL

application. In this case the mean error (ME) is a more suitable

metric, as the MRE is undefined when the true value is zero.

TABLE II: APPLICATIONS USED IN THE EVALUATION.

Application Domain Error Metric

GAUSSIAN Signal processing Mean relative error

MEDIAN Medical imaging Mean relative error

INVERSION Image processing Mean relative error

SOBEL Image processing Mean error

The GAUSSIAN filter is a well-known low-pass filter, and

has applications in electronics and signal processing. The

GAUSSIAN has data-reuse across threads and, therefore, bene-

fits from the use of local memory in general. The filter kernel

width is either 3or 5. The MEDIAN filter is a nonlinear

spatial filter with applications in medical imaging and image

processing. The filter replaces each sample by the median of

adjacent samples. Our optimized implementation uses private

memory, i.e., registers, to load all samples in the current filter

kernel and compute the median. The filter kernel size is 3×3.

The INVERSION filter is an application that maps each input

value to its inverse. The filter has a kernel size width of

1and, therefore, there is no data reuse across threads. The

SOBEL operator is used in edge detection applications. We

use two versions: SOBEL3 with a filter kernel size of 3×3

and SOBEL5 using a 5×5kernel size. Previous work [9] has

shown that the error depends on the input data and can vary

a lot depending on both application and input data. In all our

tests we use input data that yields an approximate average error

for the application. Input data dimensions are 512 ×512. All

applications operate on single precision floating-point data.

Gauss3

Gauss5

Sob3

Sob5

Inv

Med

0.0

0.5

1.0

1.5

Speedup

(a) Qualcomm Adreno 506.

Gauss3

Gauss5

Sob3

Sob5

Inv

Med

0.0

0.5

1.0

1.5

Speedup

Baseline Stencil Rows1 Rows2

(b) ARM Mali T860 MP2.

Gauss3

Gauss5

Sob3

Sob5

Inv

Med

0.0

0.5

1.0

1.5

Speedup

Fig. 3: Comparison of speedup with respect to applications.

V. RESULTS

In our experimental evaluation we show the performance

and the accuracy on three platforms. We analyze in partic-

ular how the perforation schemes perform on the different

platforms with a consideration of the availability of dedicated

local memory. First, we examine the optimal execution times

on each platform. Then, we have a detailed look on the

performance of the applications on each platform. We compare

the performance of each approximated application across the

three platforms. Finally, we show the error introduced by the

different perforation schemes.

A. Performance on Three Platforms

In general, it can be noted that the technique can speed up

applications on all three platforms. In Figure 3 we compare the

three perforation schemes for six applications. Furthermore,

we show the geometric mean of the speedups (GM).

1) Qualcomm Adreno 506: The speedup for Qualcomm’s

GPU is between 7% and 37%, except for the INVERSION

application where there is actually a slowdown of 20%.

Applications with a larger filter kernel like GAUSSIAN5 and

SOBEL5 benefit more from approximation. The stencil perfora-

tion scheme (light blue) always yields the largest speedup. The

Rows2 perforation scheme performs the second best, except

for the MEDIAN application. The approximated kernels of the

INVERSION application are slowed down. The geometric mean

(GM) of all perforation schemes is positive.

2) ARM Mali T860 MP2: The ARM GPU has no dedicated

local memory, still kernel perforation is able to accelerate the

execution on four out of six applications between 15% and

24%. The highest speedup can be always observed for the

Stencil scheme, followed by the Rows1 scheme and Rows2

scheme. The Rows2 scheme is not beneficial for GAUSSIAN3

and SOBEL3 application while it is for GAUSSIAN5 and

SOBEL5 which have a larger filter kernel size. The geometric

mean of the speedup for the stencil scheme is positive while

it is negative for the Rows1 and Rows2 scheme.

For the INVERSION application—that has no actual algo-

rithmic data reuse—we can see the effect of the missing

dedicated local memory, as all accesses to local memory

are in fact not local but global and therefore the overall

latency is increased. A similar situation can be observed for

the MEDIAN application that is implemented using a highly

optimized algorithm using private memory.

3) AMD FirePro W5100: The acceleration achieved with

the desktop AMD GPU is between 3% and 36%. None of the

applications is actually slowed down by the approximation.

GAUSSIAN3 and SOBEL3 show the largest speedups of 25%

to 36% while the applications with a larger kernel size

(GAUSSIAN5/SOBEL5) only show a speedup of 3% to 7%.

For most applications, the Rows1 scheme leads to the largest

speedup, except for applications with larger filter kernel size.

The Rows2 scheme is less beneficial.

To conclude this section we would like to point out that

the Stencil perforation scheme, while yielding always the

lowest speedup on the desktop GPU, is able to always pro-

duce the highest speedup on both mobile applications. This

observation is contrary to the results shown by the desktop

GPU. In our experiments we learned that synchronization

of threads (in a work group) is more expensive on mobile

platforms. This explains in part these results. Furthermore,

applications with a larger filter kernel size benefit more from

approximation on mobile GPUs than on the desktop GPU. The

difference in between GAUSSIAN3 and GAUSSIAN5 on AMD

is 16%, while it is 227% on the Qualcomm GPU and 182%

for the ARM GPU. Applications with smaller filter kernel

size (GAUSSIAN3/SOBEL3) benefit more from approximation

when there is no dedicated local memory.

B. Application Performance on Different Platforms

In Figure 4 we compare the performance per application on

the different platforms in order to highlight application-specific

differences of the perforation schemes. For the GAUSSIAN3

and SOBEL3 applications, all perforation schemes on all

platforms, except for Rows2 on the Mali GPU, are able to

accelerate the execution of the kernel. The speedup on the

desktop GPU is larger than on the mobile platforms: Adreno

is accelerated by 7% to 12% and Mali is 19% to 22% faster.

The higher speedup of Mali can be explained by the improved

data locality.

The situation for the GAUSSIAN5 and SOBEL5 is quite

different. All perforation schemes are able to yield a speedup

on both desktop and mobile GPUs. However, the speedup on

mobile platforms is larger compared to the desktop GPU. The

speedups on the desktop GPU are between 3% and 7%. The

speedups on the mobile GPUs are larger and between 16%

and 35% for the Adreno GPU and between 15% and 22%

for the Mali GPU. This observation can be explained by the

Adreno Mali AMD

0.0

0.5

1.0

1.5

Speedup

(a) GAUSSIAN3.

Adreno Mali AMD

0.0

0.5

1.0

1.5

Speedup

(b) SOBEL3.

Adreno Mali AMD

0.0

0.5

1.0

1.5

Speedup

Baseline Stencil Rows1 Rows2

Adreno Mali AMD

0.0

0.5

1.0

1.5

Speedup

(d) GAUSSIAN5.

Adreno Mali AMD

0.0

0.5

1.0

1.5

Speedup

(e) SOBEL5.

Adreno Mali AMD

0.0

0.5

1.0

1.5

Speedup

(f) MEDIAN.

Fig. 4: Comparison of speedup with respect to platform.

much larger memory bandwidth of desktop GPUs. Therefore,

mobile applications benefit more from a reduced number of

memory accesses.

The INVERSION application is accelerated by the AMD

GPU and slowed down by both mobile GPUs. The stencil

scheme is not applicable here. The execution failed due to a

driver issue for the Rows2 perforation scheme on Mali. The

slowdown can be attributed to a comparable large number of

synchronization operations which are in particular expensive.

The MEDIAN application is already highly optimized using

private memory. Still, the application can be accelerated by

approximation on Adreno and AMD. However, for Mali, a

slowdown can be observed, that can be explained by the

absence of dedicated local memory.

C. Execution Times

We list the kernel execution times for the optimal work

group configuration for accurate and approximated applica-

tions in Table III on the different platforms. As the three

platforms are different, executions times are not directly com-

parable. While the desktop GPU’s execution time is around

30-60 µs, the mobile GPU’s execution time is much larger:

the Adreno GPU takes up to ∼6 ms and the Mali GPU takes

up to ∼14 ms to execute the kernel.

D. Analysis of the Error

While the speedup of the different perforation schemes is

different across the different architectures the error introduced

by the approximation is—for the same application and input

data—almost platform agnostic. We show a representative

error for all applications and perforation schemes in Figure 5.

The approximation techniques are in general sensitive to the

input data. Different types of input data lead to a different

accuracy. Maier et al. [9] provide more detailed insights.

Gauss3 Gauss5 Sob3 Sob5 Inv Med

0.00

0.05

0.10

0.15

0.20

0.25

Error

Stencil

Rows1

Rows2

Fig. 5: Error with respect to application and approximation.

The stencil scheme always yields the best accuracy. How-

ever, the stencil scheme is not applicable to the INVER-

SION application. The error is less than 1% for GAUSSIAN3,

GAUSSIAN5 and MEDIAN. For SOBEL3 and SOBEL5 it is

around 5%, whereat these two applications are very sensitive

to approximation. The Rows1 scheme introduces a larger error

between 4% and more than 5% for SOBEL3. With the Rows2

scheme the accuracy degrades further for all applications: Less

than 10% error for most applications except an error of more

than 20% for SOBEL3/SOBEL5.

E. Summary

For the mobile applications, the best performance is always

achieved using the Stencil scheme. This is advantageous as this

perforation scheme also introduces the smallest error (almost

always 5% or less). In cases where the Stencil scheme is not

applicable, an alternative scheme can be employed. However,

for the INVERSION application, other schemes have not been

proven useful in terms of speedup.

The achieved speedup on mobile platforms is generally

smaller than the speedup on the desktop platform. However,

TABLE III: RUNTIME (1/100 S)FOR OPTIMAL CONFIGURATION ON DIFFERENT PLATFORMS.

Qualcomm Adreno 506 ARM Mali T860 MP2 AMD FirePro W5100

Baseline Stencil Rows1 Rows2 Baseline Stencil Rows1 Rows2 Baseline Stencil Rows1 Rows2

GAUSSIAN3 2.707 2.415 2.515 2.471 7.556 6.133 6.300 9.802 0.049 0.039 0.036 0.038

GAUSSIAN5 6.165 4.560 5.288 4.703 13.801 11.250 11.659 11.963 0.057 0.055 0.054 0.053

SOBEL3 2.711 2.419 2.522 2.471 7.594 6.113 6.256 9.616 0.050 0.039 0.037 0.039

SOBEL5 6.177 4.488 5.275 4.670 13.751 11.468 11.621 11.832 0.057 0.055 0.054 0.053

INVERSION 0.487 —10.605 0.721 0.445 —11.460 —20.027 —10.021 0.022

MEDIAN 3.027 2.668 2.821 2.953 7.013 9.427 9.395 —20.062 0.054 0.049 0.050

1Perforation scheme not applicable. 2Kernel execution not successful.

the speedup achieved by our approach can make the difference

between meeting a real-time deadline and missing it.

VI. CONCLUSION

Local memory-aware kernel perforation is a novel approxi-

mation technique tailored for GPUs that uses the fast local

memory for improving the accuracy of the approximated

OpenCL kernels. The technique exploits local memory in two

ways: in a perforation phase, to cache perforated data for

different threads; in a successive reconstruction phase, which

further improves the accuracy of the approximation.

We present the first implementation and evaluation of local

memory-aware kernel perforation on embedded GPUs. While

the technique makes explicit use of OpenCL local memory,

some embedded GPUs (e.g., ARM Mali and Imagination Tech-

nologies PowerVR) do not map local memory to dedicated

hardware memory. To analyze the impact of dedicated local

memory, we study the aforementioned approximation tech-

niques on two embedded GPUs with very different memory

layouts: Qualcomm Adreno 506 (32 kB of local memory) and

ARM Mali T860 (no dedicated local memory).

Results show that, even when the OpenCL local memory

is not mapped on a dedicated fast memory in hardware,

kernel perforation is still capable of accelerating memory-

bound applications. In many cases even a small speedup is

enough to ensure real-time applications to meet their deadlines.

Future work can investigate the impact of kernel perforation

on energy and power consumption. Furthermore, detailed

insights in the latency introduced by reconstruction can be

guiding the design of new perforation schemes and reconstruc-

tion techniques. Compiler-based approximations can enable a

wide application of the approach with less application-specific

knowledge.

REFERENCES

[1] V. K. Chippa, S. T. Chakradhar, K. Roy, and A. Raghunathan, “Anal-

ysis and Characterization of Inherent Application Resilience for Ap-

proximate Computing,” in Design Automation Conference, ser. DAC.

ACM/EDAC/IEEE, 2013.

[2] L. Lou, P. Nguyen, J. Lawrence, and C. Barnes, “Image Perforation:

Automatically Accelerating Image Pipelines by Intelligently Skipping

Samples,” ACM Transactions on Graphics (TOG), vol. 35, no. 5, 2016.

[3] S. Venkataramani, A. Raghunathan, J. Liu, and M. Shoaib, “Scalable-

effort Classifiers for Energy-efficient Machine Learning,” in Proc. of the

52nd Annual Design Automation Conf., ser. DAC. ACM, 2015.

[4] T. W. Chin, C. L. Yu, M. Halpern, H. Genc, S. L. Tsao, and V. J.

Reddi, “Domain Specific Approximation for Object Detection,” IEEE

Micro, vol. PP, no. 99, 2018.

[5] H. Omar, M. Ahmad, and O. Khan, “GraphTuner: An Input Dependence

Aware Loop Perforation Scheme for Efficient Execution of Approxi-

mated Graph Algorithms,” in 2017 IEEE 35th International Conference

on Computer Design (ICCD). IEEE, 2017.

[6] S. Sidiroglou-Douskos, S. Misailovic, H. Hoffmann, and M. Rinard,

“Managing Performance vs. Accuracy Trade-offs With Loop Perfora-

tion,” in Proceedings of the 19th ACM SIGSOFT symposium and the

13th European conference on Foundations of software engineering, ser.

ESEC/FSE. ACM, 2011.

[7] S. Li, S. Park, and S. Mahlke, “Sculptor: Flexible Approximation

with Selective Dynamic Loop Perforation,” in Proceedings of the 2018

International Conference on Supercomputing. ACM, 2018.

[8] M. Samadi, D. A. Jamshidi, J. Lee, and S. Mahlke, “Paraprox: Pattern-

based approximation for data parallel applications,” in Proceedings of

the 19th Int. Conf. on Architectural Support for Programming Languages

and Operating Systems, ser. ASPLOS. ACM, 2014.

[9] D. Maier, B. Cosenza, and B. Juurlink, “Local Memory-Aware Kernel

Perforation,” in International Symposium on Code Generation and

Optimization (CGO), ser. CGO. ACM, 2018.

[10] M. Shafique, W. Ahmad, R. Hafiz, and J. Henkel, “A Low Latency

Generic Accuracy Configurable Adder,” in Design Automation Confer-

ence, ser. DAC. ACM/EDAC/IEEE, 2015.

[11] M. Samadi, J. Lee, D. A. Jamshidi, A. Hormati, and S. Mahlke, “SAGE:

Self-tuning Approximation for Graphics Engines,” in Proceedings of the

46th Annual IEEE/ACM International Symposium on Microarchitecture,

ser. MICRO. ACM, 2013.

[12] A. Sampson, W. Dietl, E. Fortuna, D. Gnanapragasam, L. Ceze, and

D. Grossman, “EnerJ: Approximate Data Types for Safe and General

Low-power Computation,” in ACM SIGPLAN Notices, vol. 46, no. 6.

ACM, 2011.

[13] A. Yazdanbakhsh, D. Mahajan, B. Thwaites, J. Park, A. Nagendrakumar,

S. Sethuraman, K. Ramkrishnan, N. Ravindran, R. Jariwala, A. Rahimi

et al., “Axilog: Language Support for Approximate Hardware Design,”

in Design, Automation & Test in Europe Conference & Exhibition

(DATE), 2015. IEEE, 2015.

[14] S. Misailovic, M. Carbin, S. Achour, Z. Qi, and M. C. Rinard, “Chisel:

Reliability- and Accuracy-aware Optimization of Approximate Compu-

tational Kernels,” in ACM SIGPLAN Notices, vol. 49, no. 10. ACM,

2014.

[15] M. Kambadur and M. A. Kim, “NRG-loops: Adjusting Power from

Within Applications,” in Proceedings of the 2016 International Sym-

posium on Code Generation and Optimization, ser. CGO. ACM, 2016.

[16] S. Mittal, “A Survey of Techniques for Approximate Computing,” ACM

Computing Surveys (CSYR), vol. 48, no. 4, 2016.

[17] M. H. Lipasti, C. B. Wilkerson, and J. P. Shen, “Value Locality and

Load Value Prediction,” ACM SIGPLAN Notices, vol. 31, no. 9, 1996.

[18] J. S. Miguel, M. Badr, and N. E. Jerger, “Load Value Approximation,”

in Proceedings of the 47th Annual IEEE/ACM International Symposium

on Microarchitecture, ser. MICRO. IEEE, 2014.

[19] A. Yazdanbakhsh, G. Pekhimenko, B. Thwaites, H. Esmaeilzadeh,

O. Mutlu, and T. C. Mowry, “Rfvp: Rollback-free value prediction with

safe-to-approximate loads,” ACM Transactions on Architecture and Code

Optimization (TACO), vol. 12, no. 4, 2016.

[20] S. Lal, J. Lucas, and B. Juurlink, “Slc: Memory access granularity aware

selective lossy compression for gpus,” 2019.

[21] S. Mitra, M. K. Gupta, S. Misailovic, and S. Bagchi, “Phase-aware

Optimization in Approximate Computing,” in Proceedings of the 2017

International Symposium on Code Generation and Optimization, ser.

CGO. IEEE, 2017.