Andersch, Michael Andersch, Chi Chi Ching, Ben Juurlink
Programming parallel embedded and
consumer applications in OpenMP
superscalar
Article, Postprint version
This version is available at http://dx.doi.org/10.14279/depositonce-6265
Suggested Citation
Andersch, M.; Chi, C. C.; Juurlink, B.: Programming parallel embedded and consumer applications in OpenMP
superscalar. - In: ACM SIGPLAN Notices. - ISSN: 0362-1340 (online), 1558-1160 (print). - 47 (2012), 8. -
pp.281-282. - DOI: 10.1145/2145816.2145854. (Postprint version is cited.)
Terms of Use
© ACM, 2012. This is the author's version of the work. It is posted here by permission of ACM for
your personal use. Not for redistribution. The definitive version was published in ACM SIGPLAN
Notices, {VOL 47, ISS 8, (2012)} https://dl.acm.org/citation.cfm?doid=2145816.2145854
Powered by TCPDF (www.tcpdf.org)
Programming Parallel Embedded and Consumer
Applications in OpenMP Superscalar
Michael Andersch
TU Berlin
Einsteinufer 17
10587 Berlin
[email protected]erlin.de
Chi Ching Chi
TU Berlin
Einsteinufer 17
10587 Berlin
[email protected]erlin.de
Ben Juurlink
TU Berlin
Einsteinufer 17
10587 Berlin
[email protected]erlin.de
Abstract
In this paper, we evaluate the performance and usability of the
parallel programming model OpenMP Superscalar (OmpSs), apply
it to 10 different benchmarks and compare its performance with
corresponding POSIX threads implementations.
Categories and Subject Descriptors D.1.3 [Software]: Program-
ming Techniques—Concurrent Programming
General Terms Algorithms, Design, Measurement, Performance
Keywords OpenMP Superscalar, OmpSs, Embedded, Consumer
1. Introduction
OpenMP Superscalar (OmpSs) is a novel task-based parallel pro-
gramming model which extends the OpenMP programming model
with the StarSs [3] task directives. In OmpSs, programs are par-
allelized by annotating functions as tasks using the omp task
input output inout pragmas. When these functions are called,
they are added to a task graph instead of directly being executed.
The task dependencies are resolved at runtime, using the input/out-
put specification of the function arguments. Once all input depen-
dencies of a task are resolved, it is ready to be executed.
In the past, OmpSs has mainly been used to parallelize HPC ap-
plications [2]. In this paper we will investigate and summarize the
usability and performance of OmpSs for embedded and consumer
applications.
This paper is organized as follows: Section 2 discusses our
methodology. In Section 3 we investigate the expressiveness of
OmpSs using pipelining in H.264 as a case study. In Section 4 the
performance results of the ten benchmarks are presented. Finally,
in Section 5, conclusions are drawn.
2. Methodology
Evaluating parallel programming models is different from evaluat-
ing processor architectures. Parallel programming models not only
target good performance, but also must offer the right abstraction
to the programmer. Therefore, it is necessary to investigate both
the usability and performance of a parallel programming model to
evaluate its overall quality.
The usability of a programming model is a subjective measure
that differs from programmer to programmer. To provide the pro-
grammer with the necessary information to be able to form his/her
own opinion about the usability, we conducted studies on numerous
qualitative aspects, such as general expressiveness or the toolchain,
of programming in OmpSs, of which we show H.264 pipeline par-
allelization as an example.
To evaluate the performance of OmpSs, we have created a
benchmark suite to evaluate parallel programming models. The
suite contains 10 C/C++ benchmarks (shown in the first column of
Table 1) that cover a wide range of embedded and consumer appli-
cation domains. The benchmarks are classified as kernels, work-
loads, or applications, based on their code size and paralleliza-
tion complexity. For each benchmark a sequential, Pthreads, and
OmpSs implementation have been developed. For comparability
the Pthreads and OmpSs variants exploit the same parallelism.
To provide meaningful results not only for contemporary, but
also for future multi-core systems, it is necessary to extend the
benchmarking process beyond the core counts of what current off-
the-shelve CMPs can offer. To achieve this, we use a 4-socket
cc-NUMA machine with 32 cores in total for the performance
evaluation.
3. Case Study: Parallelizing H.264 Decoding in
OmpSs
The H.264 decoder pipeline in our design consists of 5 pipeline
stages. In the read stage the bitstream is read from the disk and
parsed into separated frames. In the parse stage the headers of
the frame are parsed and a Picture Info entry in the Picture Info
Buffer is allocated. The entropy decode (ED) stage performs a
lossless decompression by extracting the syntax elements for each
macroblock in the frame. The macroblock reconstruction stage
allocates a picture in the Decoded Picture Buffer and reconstructs
the picture using the syntax elements and motion vectors. The
output stage reorders and outputs the decoded pictures either to an
output file or the display.
In contrast to other task-based programming models, such as
Cilk++ and OpenMP, pipeline parallelism can be easily expressed
in OmpSs, because in OmpSs tasks can be spawned before their
dependencies have been resolved [4, 5]. Listing 1 presents slightly
simplified code of the pipelined main decoder loop using OmpSs
pragmas. A task is created for each pipeline stage in each loop
iteration. For correct pipelining of the tasks, it is required that all
tasks in iteration iare executed in-order. To accomplish this, each
task in the same iteration is linked to the previous task in the same
iteration via one or more input and output/inout pairs. Additionally,
task Tof iteration imust be completed before the instance of the
same task Tin iteration i+1 is started. To accomplish this, each
task has a context structure parameter that is annoted as inout, e.g.,
ReadContext *rc, NalContext *nc, EntropyContext *ec, etc.
EncFrame frm[N]; Slice slice[N];
H264Mb *ed_bufs[N]; Picture pic[N];
int k=0;
while(!EOF){
#pragma omp task inout(*rc) output(*frm)
read_frame_task(rc, &frm[k%N]);
#pragma omp task inout(*nc,*frm) output(*s)
parse_header_task(nc,&slice[k%N], &frm[k%N]);
#pragma omp task inout(*nc,*frm) output(*s)
entropy_decode_task(ec,&slice[k%N], &frm[k%N],
ed_bufs[k%N]);
#pragma omp task inout(*rc) input(*s,*mbs) output(*pic)
reconstruct_task (rc,&slice[k%N],ed_bufs[k%N],
&pic[k%N]);
#pragma omp task inout(*oc) input(*pic)
output_task(oc,&pic[k%N]);
k++;
#pragma omp taskwait on (*rc)
}
Listing 1. Pipelining the main decoder loop using OmpSs pragmas
Three additional important observations regarding the pipelin-
ing implementation can be made. First, the taskwait on pragma
ensures that the read task has been performed before evaluating the
while loop condition. This is necessary to prevent tasks from being
added after the EOF has been reached.
Second, more importantly, some inputs and outputs of each task
are read from/written to an entry of a circular buffer of size N. This
eliminates the WAR and WAW hazards that would have occurred if
the same entry is used in each iteration, which would eliminate all
the parallelism. OmpSs does not support automatic renaming and,
therefore, this manual renaming method is required.
Third, the Picture Info Buffer (PIB) and Decoded Picture Buffer
(DPB) structures are not passed in any argument and, thus, are not
considered for dependence checking. The dependencies to these
buffer entries are purposely hidden from the OmpSs task specifi-
cations, because we cannot predict which buffers entries will be
available at the time the task is spawned. This can only be deter-
mined when the task is executed.
To fetch and release the buffer entries in a thread-safe way, omp
critical pragmas are used in the task bodies around the fetch and
release statements to protect accesses to the PIB and DPB.
4. Quantitative Evaluation
In Table 1 the speedups of the OmpSs variants over the Pthreads
variants are shown for each benchmark and core count. Overall,
five benchmarks are faster with OmpSs and four with Pthreads.
The largest gains are observed for the c-ray, rgbcmy, and ray-rot
benchmarks. The largest loss is observed for h264dec.
In the rgbcmy benchmark multiple iterations are performed to
stabilize the execution time, with a task/thread barrier separating
each iteration. The absolute time for one iteration, however, is
short with less than 20ms on 16 cores. For this benchmark, the
OmpSs variant is able to scale better at higher core counts because
it employs a polling task barrier instead of the more expensive
blocking thread barrier.
In the ray-rot benchmark the output of the c-ray kernel is the
input of the rotate kernel. For this benchmark OmpSs performs bet-
ter than Pthreads, because the runtime scheduler places dependent
tasks on the same core. Scheduling tasks that have an input out-
put relation back-to-back on the same core improves cache local-
ity. Interestingly, due to this locality advantage, the speedups for
the combined ray-rot workload exceed the product of the speedups
of the individual c-ray and the rotate kernel.
The largest performance difference between Pthreads and OmpSs
occurs for the h264dec benchmark. Increasing the task granularity
Benchmark 1 8 16 24 32 Mean
c-ray 1.03 1.11 1.12 1.11 1.14 1.10
rotate 1.06 1.04 1.09 1.02 0.86 1.01
rgbcmy 1.02 0.98 1.14 1.40 1.53 1.19
md5 1.00 1.02 1.10 1.14 1.05 1.06
kmeans 0.91 0.87 1.30 0.95 0.88 0.97
ray-rot 1.02 1.10 1.65 1.46 1.20 1.27
rot-cc 1.00 1.06 1.17 1.14 1.04 1.08
streamcluster 0.93 0.84 0.91 0.99 0.99 0.93
bodytrack 0.98 0.99 1.05 0.97 1.00 1.00
h264dec 0.94 1.07 0.87 0.57 0.42 0.73
Mean 0.99 1.00 1.12 1.05 0.97 1.02
Table 1. Speedup factors and geometric means of OmpSs imple-
mentations over Pthreads implementations for each benchmark and
core count.
is necessary to improve the overall performance of OmpSs. Group-
ing the tasks, however, reduces the parallelism, which in turn limits
the performance at higher core counts. In the Pthreads version
of h264dec the synchronization is highly optimized using a line
decoding strategy [1] and, therefore, grouping of tasks is not nec-
essary.
Over the entire benchmark suite, OmpSs performs 2% better
than Pthreads. At 1 and 8 cores the performance is very close, while
at 16 and 24 cores OmpSs is slightly faster. At 32 cores OmpSs is
slightly slower mainly due to the lower performance in the h264dec
benchmark. Thus, we argue that performance wise OmpSs can
compete with manual threaded solutions for the embedded and
consumer benchmarks considered in this paper.
To be a true alternative for manual threading, however, OmpSs
processes must be able to dynamically share resources with other
processes. Currently, OmpSs programs use a static number of cores
controlled by an environmental variable. Furthermore, because the
runtime implements core communication/synchronization, e.g. task
barriers, in a polling fashion for performance reasons, all used cores
are always fully loaded even if there is insufficient work. This
reduces overall system responsiveness and power efficiency when
too many cores are used.
5. Conclusions
Our studies have shown that OmpSs is, while not yet production-
ready, a viable alternative to established parallel programming
models such as Pthreads. The expressiveness is sufficiently power-
ful to program common parallelism patterns such as pipelining and
the performance is comparable to Pthreads implementations.
References
[1] C. C. Chi and B. Juurlink. A QHD-Capable Parallel H.264 Decoder. In
Proc. 25th Int. Conf. on Supercomputing, 2011.
[2] A. Duran, X. Teruel, R. Ferrer, X. Martorell, and E. Ayguadé. Barcelona
OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation
of Task Parallelism in OpenMP. In Proc. Int. Conf. on Parallel Process-
ing, 2009.
[3] J. M. Perez, R. M. Badia, and J. Labarta. A Flexible and Portable
Programming Model for SMP and Multi-cores. Technical report, BSC-
UPC, 2007.
[4] A. Pop and A. Cohen. A Stream-Computing Extension to OpenMP. In
Proc. 6th Int. Conf. on High Performance and Embedded Architectures
and Compilers, 2011.
[5] H. Vandierendonck, P. Pratikakis, and D. Nikolopoulos. Parallel Pro-
gramming of General-Purpose Programs Using Task-Based Program-
ming Models. In Proc. 3rd USENIX Workshop on Hot Topics in Paral-
lelism, 2011.