scieee Science in your language
[en] (orig)
This version is available at https://doi.org/10.14279/depositonce-6779
Copyright applies. A non-exclusive, non-transferable and limited
right to use is granted. This document is intended solely for
personal, non-commercial use.
Terms of Use
Göbel, Matthias (2014). A High-Performance Hardware Accelerator for HEVC Motion Compensation. In:
Informatiktage 2014 / Fachwissenschaftlicher Informatik-Kongress, 27. und 28. März 2014, Hasso-
Plattner-Institut der Universität Potsdam. (GI-Edition : lecture notes in informatics ; series of the
Gesellschaft für Informatik). Bonn : Gesellschaft für Informatik. (ISBN: 978-3-88579-447-9, ISSN:
1614-3213). pp. 209–212.
Göbel, Matthias
A High-Performance Hardware Accelerator
for HEVC Motion Compensation
Published versionConference paper |
AHigh-Performance HardwareAccelerator forHEVC
Motion Compensation
Matthias G¨
obel
Embedded Systems Architecture Group
Dept. of Computer Engineering and Microelectronics
Technische Universit¨
at Berlin
Abstract: The presented master’sthesis has focused on the design and implementation
of amotion compensation hardware accelerator for use in HEVC hybrid decoders, i.e.
decoders that contain hardware as well as software parts. As the motion compensation
is the most time consuming step in the decoding process it is crucial to implement it in
afast and efficient way. This paper elaborates the theoretical background and motiva-
tion and highlights the main design choices. In the following evaluation acomparison
between the hybrid decoder and apure software decoder is performed. The results
showthat the design is capable of increasing the decoding frame rate in the range of
60% for 1080p video streams when running at 100 MHz.
1Introduction
High Efficiency Video Coding (HEVC) [1] is the latest video coding standard by the Joint
Collaborative Team on Video Coding (JCT-VC) and has been ratified as H.265 in April
2013. It is the direct successor to the famous H.264/Advanced Video Coding (AVC) stan-
dard and reduces the bit rate by 50% for the same video quality when compared to H.264.
Forcost and power reasons it is common practice in video decoding to use dedicated hard-
ware accelerators. Dedicated hardware blocks can perform the most expensive parts of
the decoding process in afast and efficient waythereby offering more performance while
consuming less power than apure CPU-based solution. This paper in particular focuses
on designing and implementing ahardware accelerator for motion compensation, i.e. an
interpolation filter that should substitute the according part in an existing software decoder
as it is the most time-consuming part of software decoders.
Similar to its predecessors HEVC allows to exploit temporal and spatial redundancyin
video streams by referring to similar regions in previous frames instead of storing all the
data explicitly.This technique that is known as inter-frame prediction is implemented in
HEVC by using so called motion vectors that point to such regions in previously decoded
frames. These motion vectors can also have ahorizontal or vertical shift relative to the
target region with an accuracyof1/4th of asample for the luma plane and 1/8th of asam-
ple for the chroma planes. In order to successfully decode an inter-frame predicted HEVC
video stream these fractional samples must be derivedfrom the adjacent full samples by
209
using an interpolation filter.This process is called motion compensation and has been the
main task of the discussed master’sthesis.
This paper is organized as follows. Section 2lists related work that focuses on hardware
solutions for motion compensation in general as well as for HEVC in particular.InSection
3the design process is highlighted followed by adiscussion of the evaluation in Section4.
Finally,inSection 5aconclusion is givenregarding the results of the thesis.
2Related Work
As HEVC has only been standardized in April 2013 the amount of related work in general
and regarding motion compensation in particular has been very limited. Guo et al. [2]
deal with the motion compensation interpolation and propose aresource-efficient ASIC
implementation for the FIR interpolation filter as well as an efficient filter engine that is
based on splitting aframe into blocks of 4x4 luma samples. An HEVC video-decoder
chip for 4K applications has been presented by Tikekar et al. [3]. This chip is capable of
processing 249 MPixel/s which is sufficient for real-time decoding of 4K video streams
with 30 FPS. However, ahuge amount of related work for AVCmotion compensation has
been available. An efficient memory access solution is discussed by Tsai et al. [4]. By
reusing previous pixels via acache theycan decode a2048x1024 video stream running at
30 FPS in real-time with less than 200 MB/s of memory bandwidth.
While these approaches focused mostly on pure hardware implementations, this work fol-
lows ahardware/software codesign approach. By partitioning the task accordingly the
advantages of software and hardware can be combined thus getting amaximum of perfor-
mance.
3Design
Forthe design decision several approaches have been analyzed. While the parallel pro-
cessing of multiple samples has theoretical advantages regarding the throughput, such so-
lutions tend to occupymanylogic resources. Furthermore, the memory will probably be a
bottleneck for them. Therefore asolution that is capable of filtering one sample per cycle
has been chosen with parallel processing of luma and chroma planes.
The final design consists of twosimilar independent datapaths: one for luma as well as
one for chroma. An overviewthat is valid for both datapaths can be seen in Figure 1. As
the interpolation process involves atwo-dimensional FIR filter atwo-step procedure has
been selected that performs first aone-dimensional horizontal interpolation filtering and
afterwards another one-dimensional vertical one. Between these steps abuffer is imple-
mented that stores the results of the first filter before theycan be processed by the second
filter.This is required as almost the complete horizontal filter process must have finished
before the vertical filter process can start. As aresult the theoretical throughput is reduced
to 0.5 samples per cycle. Foreach luma and chroma tworeference blocks can be processed
210
Output
Samples
Horizontal
Filtering
Vertical
Filtering
Input
Samples
Buer
Biprediction
and
Weight
prediction
Horizontal
Filtering
Vertical
Filtering
Input
Samples
Buer
Figure 1: One of the twoindependent datapaths. Each of them again consists of twosub-datapaths
that are required for biprediction. Note that one chroma datapath is sufficient for asubsampling ratio
of 4:2:0.
in parallel to support biprediction,i.e. interpolating aregion in aframe by using twodif-
ferent regions as areference or input. If biprediction has been selected the results of the
twovertical filters will be averaged; otherwise only the result of the first vertical filter will
be used. Finally,the result can also be weighted, i.e. be multiplied with acertain factor.
This feature is called weighted prediction and is implemented by an additional multiplier
at the end of each sub-datapath.
Forthe levelofgranularity,i.e. the partitioning of the overall work into software and hard-
ware parts of the decoder,the prediction unit (PU) has been selected. This is arectangular
block of between 8x4/4x8 and 64x64 luma samples and the according numbers of chroma
samples that has afixedset of parameters. This choice allows to perform most of the com-
plextasks likeparameter evaluation in software while offering the advantage of massive
parallelism that hardware solutions provide for the actual interpolation process.
4Evaluation
The discussed design has been implemented for the Zynq-7020 SoC from Xilinx. Atheo-
retical analysis of the accelerator itself (i.e. only of the motion compensation) yielded an
upper bound for the throughput of 50.5 FPS for 1080p video streams when running at 100
MHz. Forthe software part of the hybrid decoder ascalar software decoder developed at
TU Berlin has been modified to use the hardware accelerator for the interpolation process.
The interface between hardware and software parts is implemented using aregister-based
solution in which the CPU handles all the memory access. As the memory overhead is ex-
pected to be high, an additional DMA-based interface has been implemented as well to be
able to derive the speed-up of such asolution. To be able to compare all three implemen-
tations (pure software, register-based implementation, DMA-based implementation) the
Kimono video stream of the JCT-VC test sequences [5] has been used in different 1080p
encodings. The results when using afrequencyof100 MHz can be seen in Figure 2.
While the frame rate for the register-based interface is reduced significantly by the huge
memory overhead, the DMA-based interface is capable of delivering asignificant speed-
up of about 60% compared to the pure software decoder.However,the memory access
still poses the main bottleneck. Figure 2also shows the luma throughput of the accelerator
for PUs of different sizes. Forlarge PUs it converges to the theoretical maximum of 0.5
samples per cycle as the interpolation overhead is decreasing.
211
8bit 10 bit
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Register-based Accelerator
Software Decoder
DMA-basedAccelerator(derived)
FPS
8x816x16 32x32 64x64
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
Size in luma samples
Samplesper cycle
Figure 2: An evaluation of the accelerator.The left diagram shows the achievedframe rates using
the evaluation setup. On the right side the luma throughput for PUs of different sizes can be seen.
5Conclusion
This paper described the design of ahardware-accelerator for HEVC motion compensa-
tion. Based on the idea of ahybrid decoder such an accelerator has been implemented.
The evaluation provedthe feasibility and reasonability of the design as it offers aspeed-up
of about 60% compared to apure software solution. Based on the results of this thesis
additional work is currently in progress. In particular,further optimizations regarding the
memory access will be performed as this turned out to be the major limiting factor in the
implementation.
References
[1] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand. Overviewofthe High EfficiencyVideo
Coding (HEVC) Standard. IEEE Transactions on Circuits and System for Video Technology,
Volume 22, No. 12:1649-1668, 2012.
[2] Z. Guo, D. Zhou, and S. Goto. An Optimized MC Interpolation Architecture for HEVC. IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.
[3] M. Tikekar,C.-T.Huang, C. Juvekar,V.Sze, and A.P.Chandrakasan. A249-Mpixel/s HEVC
Video-Decoder Chip for 4K Ultra-HD Applications. IEEE Journal of Solid-State Circuits,
Volume 49, Issue: 1, 2014.
[4] C.-Y.Tsai, T.-C. Chen, T.-W.Chen, and L.-G. Chen. Bandwidth Optimized Motion Com-
pensation Hardware Design for H.264/AVC HDTV Decoder.48th Midwest Symposium on
Circuits and Systems, 2005.
[5] F. Bossen. Common test conditions and software reference configurations. ITU-T/ISO/IEC
Joint Collaborative Team on Video Coding (JCT-VC) Document JCTVC-K1100, 2012.
212