Document [original]

This version is available at https://doi.org/10.14279/depositonce-6779

right to use is granted. This document is intended solely for

personal, non-commercial use.

Göbel, Matthias (2014). A High-Performance Hardware Accelerator for HEVC Motion Compensation. In:

Informatiktage 2014 / Fachwissenschaftlicher Informatik-Kongress, 27. und 28. März 2014, Hasso-

Plattner-Institut der Universität Potsdam. (GI-Edition : lecture notes in informatics ; series of the

Gesellschaft für Informatik). Bonn : Gesellschaft für Informatik. (ISBN: 978-3-88579-447-9, ISSN:

1614-3213). pp. 209–212.

Göbel, Matthias

A High-Performance Hardware Accelerator

for HEVC Motion Compensation

Published versionConference paper |

AHigh-Performance HardwareAccelerator forHEVC

Motion Compensation

Matthias G¨

obel

Embedded Systems Architecture Group

Dept. of Computer Engineering and Microelectronics

Technische Universit¨

at Berlin

[email protected]

Abstract: The presented master’sthesis has focused on the design and implementation

of amotion compensation hardware accelerator for use in HEVC hybrid decoders, i.e.

decoders that contain hardware as well as software parts. As the motion compensation

is the most time consuming step in the decoding process it is crucial to implement it in

afast and efﬁcient way. This paper elaborates the theoretical background and motiva-

tion and highlights the main design choices. In the following evaluation acomparison

between the hybrid decoder and apure software decoder is performed. The results

showthat the design is capable of increasing the decoding frame rate in the range of

60% for 1080p video streams when running at 100 MHz.

1Introduction

High Efﬁciency Video Coding (HEVC) [1] is the latest video coding standard by the Joint

Collaborative Team on Video Coding (JCT-VC) and has been ratiﬁed as H.265 in April

2013. It is the direct successor to the famous H.264/Advanced Video Coding (AVC) stan-

dard and reduces the bit rate by 50% for the same video quality when compared to H.264.

Forcost and power reasons it is common practice in video decoding to use dedicated hard-

ware accelerators. Dedicated hardware blocks can perform the most expensive parts of

the decoding process in afast and efﬁcient waythereby offering more performance while

consuming less power than apure CPU-based solution. This paper in particular focuses

on designing and implementing ahardware accelerator for motion compensation, i.e. an

interpolation ﬁlter that should substitute the according part in an existing software decoder

as it is the most time-consuming part of software decoders.

Similar to its predecessors HEVC allows to exploit temporal and spatial redundancyin

video streams by referring to similar regions in previous frames instead of storing all the

data explicitly.This technique that is known as inter-frame prediction is implemented in

HEVC by using so called motion vectors that point to such regions in previously decoded

frames. These motion vectors can also have ahorizontal or vertical shift relative to the

target region with an accuracyof1/4th of asample for the luma plane and 1/8th of asam-

ple for the chroma planes. In order to successfully decode an inter-frame predicted HEVC

video stream these fractional samples must be derivedfrom the adjacent full samples by

209

using an interpolation ﬁlter.This process is called motion compensation and has been the

main task of the discussed master’sthesis.

This paper is organized as follows. Section 2lists related work that focuses on hardware

solutions for motion compensation in general as well as for HEVC in particular.InSection

3the design process is highlighted followed by adiscussion of the evaluation in Section4.

Finally,inSection 5aconclusion is givenregarding the results of the thesis.

2Related Work

As HEVC has only been standardized in April 2013 the amount of related work in general

and regarding motion compensation in particular has been very limited. Guo et al. [2]

deal with the motion compensation interpolation and propose aresource-efﬁcient ASIC

implementation for the FIR interpolation ﬁlter as well as an efﬁcient ﬁlter engine that is

based on splitting aframe into blocks of 4x4 luma samples. An HEVC video-decoder

chip for 4K applications has been presented by Tikekar et al. [3]. This chip is capable of

processing 249 MPixel/s which is sufﬁcient for real-time decoding of 4K video streams

with 30 FPS. However, ahuge amount of related work for AVCmotion compensation has

been available. An efﬁcient memory access solution is discussed by Tsai et al. [4]. By

reusing previous pixels via acache theycan decode a2048x1024 video stream running at

30 FPS in real-time with less than 200 MB/s of memory bandwidth.

While these approaches focused mostly on pure hardware implementations, this work fol-

lows ahardware/software codesign approach. By partitioning the task accordingly the

advantages of software and hardware can be combined thus getting amaximum of perfor-

mance.

3Design

Forthe design decision several approaches have been analyzed. While the parallel pro-

cessing of multiple samples has theoretical advantages regarding the throughput, such so-

lutions tend to occupymanylogic resources. Furthermore, the memory will probably be a

bottleneck for them. Therefore asolution that is capable of ﬁltering one sample per cycle

has been chosen with parallel processing of luma and chroma planes.

The ﬁnal design consists of twosimilar independent datapaths: one for luma as well as

one for chroma. An overviewthat is valid for both datapaths can be seen in Figure 1. As

the interpolation process involves atwo-dimensional FIR ﬁlter atwo-step procedure has

been selected that performs ﬁrst aone-dimensional horizontal interpolation ﬁltering and

afterwards another one-dimensional vertical one. Between these steps abuffer is imple-

mented that stores the results of the ﬁrst ﬁlter before theycan be processed by the second

ﬁlter.This is required as almost the complete horizontal ﬁlter process must have ﬁnished

before the vertical ﬁlter process can start. As aresult the theoretical throughput is reduced

to 0.5 samples per cycle. Foreach luma and chroma tworeference blocks can be processed

210

Output

Samples

Horizontal

Filtering

Vertical

Filtering

Input

Samples

Buﬀer

Biprediction

and

Weight

prediction

Horizontal

Filtering

Vertical

Filtering

Input

Samples

Buﬀer

Figure 1: One of the twoindependent datapaths. Each of them again consists of twosub-datapaths

that are required for biprediction. Note that one chroma datapath is sufﬁcient for asubsampling ratio

of 4:2:0.

in parallel to support biprediction,i.e. interpolating aregion in aframe by using twodif-

ferent regions as areference or input. If biprediction has been selected the results of the

twovertical ﬁlters will be averaged; otherwise only the result of the ﬁrst vertical ﬁlter will

be used. Finally,the result can also be weighted, i.e. be multiplied with acertain factor.

This feature is called weighted prediction and is implemented by an additional multiplier

at the end of each sub-datapath.

Forthe levelofgranularity,i.e. the partitioning of the overall work into software and hard-

ware parts of the decoder,the prediction unit (PU) has been selected. This is arectangular

block of between 8x4/4x8 and 64x64 luma samples and the according numbers of chroma

samples that has aﬁxedset of parameters. This choice allows to perform most of the com-

plextasks likeparameter evaluation in software while offering the advantage of massive

parallelism that hardware solutions provide for the actual interpolation process.

4Evaluation

The discussed design has been implemented for the Zynq-7020 SoC from Xilinx. Atheo-

retical analysis of the accelerator itself (i.e. only of the motion compensation) yielded an

upper bound for the throughput of 50.5 FPS for 1080p video streams when running at 100

MHz. Forthe software part of the hybrid decoder ascalar software decoder developed at

TU Berlin has been modiﬁed to use the hardware accelerator for the interpolation process.

The interface between hardware and software parts is implemented using aregister-based

solution in which the CPU handles all the memory access. As the memory overhead is ex-

pected to be high, an additional DMA-based interface has been implemented as well to be

able to derive the speed-up of such asolution. To be able to compare all three implemen-

tations (pure software, register-based implementation, DMA-based implementation) the

Kimono video stream of the JCT-VC test sequences [5] has been used in different 1080p

encodings. The results when using afrequencyof100 MHz can be seen in Figure 2.

While the frame rate for the register-based interface is reduced signiﬁcantly by the huge

memory overhead, the DMA-based interface is capable of delivering asigniﬁcant speed-

up of about 60% compared to the pure software decoder.However,the memory access

still poses the main bottleneck. Figure 2also shows the luma throughput of the accelerator

for PUs of different sizes. Forlarge PUs it converges to the theoretical maximum of 0.5

samples per cycle as the interpolation overhead is decreasing.

211

8bit 10 bit

0.5

1.5

2.5

3.5

4.5

Software Decoder

DMA-basedAccelerator(derived)

FPS

8x816x16 32x32 64x64

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Size in luma samples

Samplesper cycle

Figure 2: An evaluation of the accelerator.The left diagram shows the achievedframe rates using

the evaluation setup. On the right side the luma throughput for PUs of different sizes can be seen.

5Conclusion

This paper described the design of ahardware-accelerator for HEVC motion compensa-

tion. Based on the idea of ahybrid decoder such an accelerator has been implemented.

The evaluation provedthe feasibility and reasonability of the design as it offers aspeed-up

of about 60% compared to apure software solution. Based on the results of this thesis

additional work is currently in progress. In particular,further optimizations regarding the

memory access will be performed as this turned out to be the major limiting factor in the

implementation.

References

[1] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand. Overviewofthe High EfﬁciencyVideo

Coding (HEVC) Standard. IEEE Transactions on Circuits and System for Video Technology,

Volume 22, No. 12:1649-1668, 2012.

[2] Z. Guo, D. Zhou, and S. Goto. An Optimized MC Interpolation Architecture for HEVC. IEEE

International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2012.

[3] M. Tikekar,C.-T.Huang, C. Juvekar,V.Sze, and A.P.Chandrakasan. A249-Mpixel/s HEVC

Video-Decoder Chip for 4K Ultra-HD Applications. IEEE Journal of Solid-State Circuits,

Volume 49, Issue: 1, 2014.

[4] C.-Y.Tsai, T.-C. Chen, T.-W.Chen, and L.-G. Chen. Bandwidth Optimized Motion Com-

pensation Hardware Design for H.264/AVC HDTV Decoder.48th Midwest Symposium on

Circuits and Systems, 2005.

[5] F. Bossen. Common test conditions and software reference conﬁgurations. ITU-T/ISO/IEC

Joint Collaborative Team on Video Coding (JCT-VC) Document JCTVC-K1100, 2012.

212