Ana Belén Mejía-Ocaña, Manuel de-Frutos-López, Sergio Sanz-Rodríguez,
Óscar del-Ama-Esteban, Carmen Peláez-Moreno, Fernando Díaz-de-María
Low-complexity motion-based saliency map
estimation for perceptual video coding
Conference object, Postprint version
This version is available at http://dx.doi.org/10.14279/depositonce-5783.
Suggested Citation
Mejía-Ocaña, Ana Belén; de-Frutos-López, Manuel; Sanz-Rodríguez, Sergio; del-Ama-Esteban, Óscar;
Peláez-Moreno, Carmen; Díaz-de-María, Fernando: Low-complexity motion-based saliency map
estimation for perceptual video coding. - In: 2011 National Conference on Telecommunications :
CONATEL. - New York, NY [u.a.] : IEEE, 2011. - ISBN: 978-1-4577-1046-9. - pp. 1-6. - DOI:
10.1109/CONATEL.2011.5958666. (Postprint version is cited.)
Terms of Use
© © 2011 IEEE. Personal use of this material is permitted. Permission from IEEE must be
obtained for all other uses, in any current or future media, including reprinting/republishing
this material for advertising or promotional purposes, creating new collective works, for
resale or redistribution to servers or lists, or reuse of any copyrighted component of this
work in other works.
Powered by TCPDF (www.tcpdf.org)
Low-complexity Motion-based Saliency Map
Estimation for Perceptual Video Coding
Ana Belén Mejía-Ocaña, Manuel de-Frutos-López, Sergio Sanz-Rodríguez, Óscar del-Ama-Esteban,
Carmen Peláez-Moreno, Fernando Díaz-de-María
Department of Signal Theory and Communications
University Carlos III Madrid
Leganés, Spain
Abstract—In this paper, a low-complexity motion-based saliency
map estimation method for perceptual video coding is proposed.
The method employs a camera motion compensated vector map
computed by means of a hierarchical motion estimation (HME)
procedure and a Restricted Affine Transformation (RAT)-based
modeling of the camera motion. To allow for a computationally
efficient solution, the number of layers of the HME has been
restricted and the potential unreliable motion vectors due to
homogeneous regions have been detected and specially managed
by means of a smooth block detector. Special care has been taken
of the smoothness of the resulting compensated camera motion
vector map to avoid unpleasant artifacts in the perceptually-
coded sequence, by including a final post-processing based on
morphological filtering. The proposed saliency map has been
both visually and subjectively assessed showing quality
improvements when used as a part of the H.264/AVC standard
codec at medium-to-low bitrates.
Keywords-region of interest; perceptual video coding; visual
saliency; visual attention; hierarchical motion estimation; camera
motion estimation; mathematical morphology.
I. INTRODUCTION
Hybrid video coding has been the core technique in
virtually every standard recommendation for the last two
decades. This algorithmic structure aims to reduce temporal
and spatial redundancy by optimizing the trade-off between the
resulting bitrate and an objectively measured distortion.
Perceptual video coding, on the other hand, introduces
knowledge about the Human Visual System (HVS) for
allocating larger amounts of bits in the regions where visual
attention is focused. This new paradigm is expected to provide
a step forward in the achievement of better quality coding,
especially for medium to low bit rates. The automatic
estimation of the locations of the most prominent regions from
a subjective point of view is, however, a challenging issue [1].
An optimal solution can be obtained from direct
observation of the target viewer’s gaze by using eye-tracking
devices. Unfortunately, these devices are not usually available
for video coding purposes and therefore suboptimal solutions
must be sought.
Several authors concentrate their efforts in the detection of
faces since, for some specific applications such as
teleconferencing or broadcast news, they are likely to attract
the viewers’ attention [2]. Obviously, the main drawback of
this approach is the lack of generalization ability to other kinds
of video contents. In [3] an alternative approach proposes the
use of learnt-by-example attention functions that could be
adapted to a particular application.
Yet another alternative relies on the observation of an
uneven perception of distortions by the HVS depending on the
source of the distortion and the type of the region being
encoded, allowing higher distortion in those regions of the
image where it is likely to be less noticeable. This formulation
does not take into account the saliency of these regions, that
clearly affects the overall quality perception and, therefore,
they are normally employed jointly (for example in [4]).
In this paper we describe a content-independent motion-
based algorithm for the generation of a continuous-valued
saliency map that is based on the displacement of the objects
relative to the camera. Special attention has been given to the
temporal consistency of the motion of the different objects,
which has been inferred by combining three main modules:
1) A hierarchical motion estimation procedure that
overcomes the well-known shortcomings of block-
matching algorithms, while maintaining the
computational complexity considerably lower than that
of optical flow –based methods. A smooth block
detector allows us to overcome the potential errors in
large homogeneous regions of the images where the
computed motion vectors become unreliable.
2) A camera motion estimation model. A preprocessing
step is used for removing unreliable blocks that result
from the previous hierarchical motion estimation.
3) A morphological filtering postprocessing that allows
for smoothing temporal and spatial transitions.
This paper is organized as follows: Section II provides an
overview of the proposed and describes in detail each one of its
subsystems. Section III focuses on the evaluation of the method
proposed, and Section IV draws some conclusions and outlines
further work.
1
II. ALGORITHM DESCRIPTION
Motion is a highly salient feature which grabs one's
attention and keeps it locked on important features and objects.
Interest in motion perception has a long history and it can
currently be considered a relatively well established discipline.
However, the understanding of the contribution of top-down
influences, like attention, on neural activity is still a research
subject [5].
Our proposal focuses on detecting those regions in each
frame of the video sequence that could potentially attract the
attention of most viewers due to motion. In [4] and [6], a
method to estimate motion attention based on the motion
vectors of the video codec (derived using a common block-
matching procedure) is employed.
On the other hand, the proposed algorithm (summarized in
Figure 1) carries out a precise Hierarchical Motion Estimation
(HME) for obtaining the real motion of the objects in the scene,
followed by a Camera Motion Estimation process (CME).
Then, a map of camera motion compensated vectors is obtained
by subtracting the latter from the former.
However, due to computational constraints, the HME
procedure exhibits a high sensitivity to smooth regions
producing erroneous results in those frames where the amount
of blocks of such type is significant. For this reason we have
developed a ‘smoothness detection’ process that detects and
removes those vectors from the CME. Then, the vector field is
further filtered using a ‘closing’ operation to prevent isolated
spurious blocks and, as will be described later, obtaining a
more coherent and realistic motion map. Finally, the resulting
saliency map is a normalized version of this compensated and
filtered motion vector map. Detailed descriptions of each stage
follow in the next subsections.
Figure 1. Block diagram of proposed saliency classification algorithm.
A. Hierarchical Motion Estimation
For our purpose of detecting the regions of interest (ROI)
we must obtain an estimation of the real motion of the video
sequence as it would have been perceived by a human
observer. For this reason, we cannot employ common block-
matching algorithms such as the ones usually employed in
standard video codecs, where practical implementations usually
employ suboptimal heuristic search algorithms. On the other
hand, we are also concerned with the computational complexity
of the overall process and therefore we rule out the use of
optical flow motion estimation algorithms [7] that, in spite of
offering a closer prediction to motion perception, they
substantially increment the computational cost. Therefore, we
propose an approach that provides a trade-off between
efficiency and computational complexity: a Hierarchical
Block-Matching Algorithm.
The hierarchical motion estimation computes local motion
vectors following a sequence of progressive refinement stages
that starts from an initial coarse estimation. Specifically, it
involves the use of N levels where each one employs a target
block size (BS) that decreases.
Bearing in mind that state-of-the-art video codecs employ
16x16 pixel macroblocks (MB) as the basic encoding units, we
have decided to use the same size in the last layer (the finest) of
the estimation process. However, after having tested the
algorithm for several values of N, we detected that wrong
estimations tend to ocurr in large homogeneous regions. The
use of more levels with larger BS reduces the incidence of
these errors, but in exchange for an additional computational
cost. Therefore, a trade-off value of 3 has been adopted for N,
resorting to a smoothness detector to avoid wrong estimations,
as will be described in next subsection.
Figure 2. Hierarchical motion estimation layers
Consequently, blocks sizes for each layer are 64x64, 32x32
and 16x16 pixels respectively. An example of MV for these
layers is depicted in Figure 2. In order to complete the
algorithm configuration, the Search Area (SA), the region of
the reference frame where the search of the best match takes
place, is established, regardless of the level, to a square of
64x64 pixels centered in the left-top pixel of each analyzed
block. Finally, it is worth noting that the abovementioned
configuration parameters have proved to be suitable for both
CIF (352x288) and SD (704x576) video resolutions.
Motion vectors (mvl) for each layer l are calculated by
minimizing a cost function over all the possible search vectors
(sv) in the SA:
(){}
svmv
sv
,minarg lCost
l=, (1)
where the specific cost function, based on the Mean Absolute
Differences (MAD), is described as follows:
⎩
⎨
⎧
>++
=+
=−
−0,
32
)(
0,
1
)(
),( 1lMAD
lMAD
lCost lsvsv
sv
sv mvsv
sv
αα
α
(2)
Constants α1, α2 and α3 are regularization parameters: α2
penalizes high deviations of the candidate, sv, from estimations
in previous levels, represented by the difference with respect to
mvl-1, whereas α1 and α3 prevent large motion vectors, namely
2
outliers, in the understanding that they do not represent real
movement.
The final motion vector map (MV) consists of those motion
vectors chosen for each block in the last layer, namely mv2.
Figure 3. Motion vector map calculated by HME.
B. Smoothness Detection
Even though the use of HME avoids some of the common
errors of standard block-matching in finding the actual motion,
still some vectors fail to follow it. As can be observed in Figure
3, where the camera performs an horizontal panning to the left
following the wandering couple with the dog (and, therefore,
the background of the image should be showing an
homogeneous field of horizontal vectors of the same modulus),
that large smooth regions (like the facades of the buildings)
tend to produce small MAD terms no matter the search position
and, therefore, for these kind of blocks, the HME produces a
‘zero’ output.
In order to solve this drawback, without increasing L, a
smooth block detector is proposed. Given that these
homogeneous regions are typically encoded with very few bits
regardless of the encoding mode, they will be excluded from
the process in order to avoid any negative influence of the
unreliability of the estimated MVs.
In particular, the first step in the detection of smooth blocks
is the calculation of the portion of the total energy of the block
associated with the DC coefficient. Therefore, for each block
with coordinates (x, y) we compute:
()
∑
∑
=
ji
ij
ji
ij
DC p
p
yxE
,
2
2
,
)(
,, (3)
as an indicator of its smoothness, where pij are the luminance
values of every pixel within the block.
Given that we are only interested in large homogeneous
regions, before deciding on the smoothness of a block, the EDC
map is further refined by means of a morphological opening.
The erosion process entails the substitution of each
component of EDC by the minimum value in a 2x2 square
neighborhood, removing isolated high values and generating an
eroded map e
DC
E of DC energy values:
(
)( ){}
yyxxEyxE DC
e
DC Δ+Δ+= ,min, , (4)
with Δx=0..1 and Δy=0..1. Next, a dilation process completes
the morphological opening and generates the filtered DC
energy map o
DC
Eby means of the substitution of each value in
e
DC
E by the maximum value in a 2x2 square neighborhood,
recovering some high DC energy values previously eroded in
large homogeneous regions:
(
)( )
{
}
yyxxEyxE e
DC
o
DC Δ−Δ−= ,max, (5)
After the morphological opening, map o
DC
Eis compared to
an empirical threshold, th1. The result is a binary Smoothness
Map (SM), with ‘zero’ value for non-homogeneous blocks and
‘one’ for those considered smooth:
() ()
⎪
⎩
⎪
⎨
⎧>
=Otherwise
thyxE if
yxSM
o
DC
,0
,,1
,1 (6)
The example in Figure 4 illustrates how the opening
operation over the original map improves the final SM (light
gray meaning smooth regions).
C. Camera Motion Estimation
Similarly to [8], modeling of camera motion is based on a
Restricted Affine Transformation (RAT) as follows:
⎥
⎥
⎥
⎦
⎤
⎢
⎢
⎢
⎣
⎡
⋅⋅−
⋅⋅
=
100
cossin
sincos
y
x
tss
tss
θθ
θθ
RAT , (7)
where s is the scale, θ is the rotation angle, and the vector (tx,
ty) represents the motion translation. In order to obtain these
parameters, we resort to the LMS algorithm (Least Median
Squares) and a robust technique called Random Sample
Consensus (RANSAC), as described in [9].
First, the camera motion is obtained from the MV field
generated by the HME module. Considering that the camera
usually focuses on a central object in the scene, and either
keeps static or moves along with it, the central area of each
frame does not provide significant information about camera
motion, so we choose to exclude this region from the
parameterization process. Besides, the SM obtained in the
previous stage indicates which of the blocks are homogeneous
and, consequently, may involve unreliable MV estimation.
These blocks will also be removed from the camera motion
calculation, completing the exclusion mask.
3
Figure 4. a) A sample frame of the sequence “bohemia”. b) SM without
morphological filtering. c) SM with morphological filtering.
Once the appropriate vectors to estimate the camera motion
are known, the RAT model needs to be initialized for each
frame. In the first frame, the rotation is initialized to zero, as
well as the scale parameter, and the translation parameters are
initialized to the average of MV map components. In
subsequent frames the initial transformation parameters are the
RAT values of the previous frame.
An LMS algorithm is trained with each one of M subsets of
random valid vectors in order to achieve M RAT models,
optimal from the MSE (Mean Square Error) point of view.
According to RANSAC, the optimum number of subsets to
represent the global camera motion is calculated as follows:
()
()
k
P
M
ρ
−
−
=1log
1log , (8)
where P, fixed to 0.99, is the probability to find at least one
subset containing only background vectors; k is the number of
elements per group, and it is set to 4 due to complexity reasons;
and ρ is a lower threshold on the proportion between the
background blocks and the total blocks in the frame. Due to
the use of the exclusion mask, we consider that at least half of
the blocks in each subset belong to background, so ρ = 0.5.
The training of RAT parameters for each subset stops when
the difference between the MSE measured in the current
iteration and the previous is less than a threshold. In order to
reduce potential abrupt variations from one frame to the next,
the RAT parameters of each subset are also exponentially
averaged with those of previous frame. Finally, the best RAT
model among all subsets is the one that minimizes the
prediction error over the MVs outside the aforementioned
exclusion mask. Though in [9] the median square error is
recommended as a robust measure, the use of the mean is
simpler and barely damages the performance.
Once the camera motion map or MVcam is built, the
compensated motion vector map is computed by subtracting it
from the original motion vector field as follows:
(
)() ()
yxyxyx camcomp ,,, MVMVMV −
=
(9)
D. Post-processing
This stage aims to obtain the final saliency map, which
could be used by a perceptual video encoder to allocate more
resources to those blocks considered as belonging to the
moving ROI than to those associated with regions belonging to
the background or non-relevant (static) objects, as will be
shown by the second experiment described in Section III.
First of all, given that local abrupt variations in the resource
allocation within the frame could result in noticeable and
undesirable artifacts, the motion vector map obtained in the
previous stage is smoothed by filtering each of the Cartesian
components of the vectors in order to integrate those blocks
that are coherent in both motion direction and magnitude.
This filtering entails the use of a 3x3 square mask, depicted
in Figure 5 (a), in which the center value contributes a 40% to
the output value and the remaining 60% is equally contributed
by the eight nearest neighbors. Additional constraints need to
be imposed to the filtering process in order to ignore those
values belonging to the excluded smooth blocks. Figure 5 (b)
illustrates the modified filtering mask, in which the energy of
the ignored homogeneous blocks (shaded in the figure) is
redistributed through the rest of the blocks according to:
() () ()()
∑∑
+
−=
+
−=
−⋅⋅= 1
1
1
1
',1,, x
xu
y
yv
compuvcomp yxSMvuCyx MVMV , (10)
with Cuv calculated as:
⎪
⎩
⎪
⎨
⎧==
=otherwise
N
c
yv and xu if
Cuv ,
,4.0
1, (11)
where N is the number of non-smooth blocks within the 3x3
square mask, and c1=0.6 for masks centered in non-smooth
blocks and c1=1 for masks centered in smooth blocks.
4
Figure 5. a) Standard 3x3 filtering mask, with C0=0.4 and C1=0.075. b)
Mask with C0=0.4 and C1=0.1 due to the exclusion of smooth blocks (shaded).
Finally, the next step is the calculation of the saliency map,
S(x, y), which has been designed to be zero for irrelevant
(static) regions and one for regions of interest (moving
objects), and is calculated as follows:
() ()
{
}
bound
compbound
MV
yxMV
yxS
,,min
,
'
MV
=, (12)
where |·| denotes the magnitude of the motion vector and
MVbound is proportional to the video resolution, with values of 5
and 10 pixels respectively for CIF and SD resolutions. The use
of this upper bound is motivated by the fact that, exceeding a
certain amount of motion, faster objects do not attract more
intensely the observer’s attention due to the limitations of the
HVS (an object with a 20 pixel displacement from one frame
to the next is not more salient than another moving 10 pixels).
III. EXPERIMENTS AND RESULTS
The evaluation of our proposal has been performed in two
different ways: first, a visual inspection of the resulting
saliency maps for different video typologies; and second, an
attention-based rate control has been implemented for the
H.264/AVC coding standard and evaluated by means of a
subjective test.
First, the saliency map for various CIF and SD video
sequences has been depicted in grayscale, with black
representing minimum saliency and white representing
maximum saliency. Sample frames can be observed in Figures
6 to 9. As can be seen, for sequences with no camera motion
such as those in Figure 6 and Figure 8, moving objects are
considered as the ROI and obtain higher values of S. On the
other hand, for video sequences with camera motion, such as
those in Figure 7 and Figure 9, the higher values of saliency
correspond to those objects being followed by the camera and,
therefore, static with respect to the observer. Interestingly, in
the case of Figure 7, a small region around the logo bug is
classified as a high interest region given that its static situation
appears as if it was perfectly followed by the camera.
Additionally, the figures confirm the smoothness of the
saliency map, which is a key aspect for the integration of our
proposal into a perceptual rate control. It is also worth noting
that even in sequences with large homogeneous regions, such
as that in Figure 9, the ROI is properly detected.
Figure 6. Saliency map for a frame of CIF sequence “football”.
Figure 7. Saliency map for a frame of CIF sequence “bus”.
Figure 8. Saliency map for a frame of SD sequence “paris”.
Figure 9. Saliency map for a frame of SD sequence “bohemia”.
The second part of the experiments performed involves the
integration of our proposal in a perceptual video encoder. From
the three main techniques for perceptual resource allocation
described in [1], we have chosen the selective application of
blur filtering, in the form of a 3x3 Gaussian mask, which
removes higher frequencies and, therefore, reduces the bit-rate
in those regions less relevant to the HVS. The standard
deviation for the Gaussian filter is thus determined according to
the inverse of our S map. However, the procedure for
selectively introducing distortion in low attention regions is a
complex mechanism out of the scope of this paper.
5
Figure 10. Probability of detecting the low bit-rate encoded version
A subjective test for assessing the performance of this
perceptual encoder has been designed following the ITU
recommendation for subjective quality assessment in video
coding [10]. Specifically, a simplified version of the Pair
Comparison (PC) method, recommended for its high
discriminatory ability when the test items are almost identical
in quality, has been carried out in order to determine whether a
bit-rate reduction is less noticeable with the perceptual video
encoder than with the standard video encoder,. To this end, a
database of 45 video clips with different resolutions and motion
contents has been selected and two sets of sequences have been
generated: the first set consist of H.264 encodings at different
bit-rates (512, 460, 410, 360, 256 and 102 Kbps) of the original
video clips; the second set is composed of encodings at the
same bit-rates of the corresponding blurred versions of the
sequences, with stronger blur filtering applied to regions with
lower S values. The H.264 encoder has been configured in
Main profile at 25 frames per second, with rate distortion
optimization and CABAC. IPPP… has been selected as
encoding pattern and the full search motion estimation has been
performed with 5 references.
The test subjects were instructed to focus on those regions
in the video clips they consider more important from the video
content point of view. Taking the highest rate as reference,
video sequences were shown in pairs for every video clip and
bit-rate reduction; the subjects were asked to decide what
sequence was better. This task was repeated twice, the first
time with the set of non-perceptually encoded sequences and
the second time with the set of perceptually-encoded
sequences. The results of the subjective test are summarized in
terms of average probability of detecting the reduced-rate
version of each pair of encoded sequences. Given that the
original high-rate encoded versions of the sequences are
perceptually indistinguishable, the comparison of detection
probabilities, which measure the sensitivity to distortion, able
us to establish also a comparison between the performance of
both encoders and, therefore, can be employed in this context
as a subjective quality score.
As can be seen in Figure 10, for higher rates, subjective
score (probability of detecting the low bit-rate version) is very
similar for both encoder versions, since the distortion is barely
noticeable. On the other hand, for lower rates, the use of our
proposed saliency map embedded in a perceptual encoder
improves the subjective score of the encoded sequences,
making more difficult the detection of bit-rate reductions with
respect to the basic non-perceptual coding model. Nonetheless,
if the bit-rate decreases significantly, the distortion detection is
easier regardless of the encoding strategy.
IV. CONCLUSSIONS AND FURTHER WORK
In this paper, a low-complexity motion-based saliency
classifier has been proposed for its use in a perceptual video
encoder. The algorithm aims to detect those regions of the
image that are susceptible of attracting the attention of a
potential viewer due to their motion characteristics. The
complexity of the algorithm is kept low by means of the use of
HME combined with specific solutions for homogeneous
region management and an efficient algorithm for camera
motion estimation. The results show that the saliency
characterization properly responds to the motion as perceived
by the observer. Furthermore, a preliminary version of a
perceptual encoder guided by the proposed saliency map has
been successfully evaluated by mean of a subjective test.
Although the algorithm has been designed with low-
complexity constraints, additional techniques could be explored
in order to further reduce its computational cost.
It is also worth mentioning that detecting motion is not the
only way of producing saliency maps and our proposal could
be combined with several other existing methods (see for
example [3]) for providing indications of the locations of the
regions where bits can be spared with minimum visual impact.
The exploration of alternative methods for this attention-
guided bit allocation is also a promising line of research that
could provide further improvements in the subjective scores.
REFERENCES
[1] Z. Li, S. Qin and L. Itti, “Visual attention guided bit allocation in video
compresion”, Image and Vision Computing. vol. 29, pp. 1-14, 2010.
[2] A. Bhat, I.E. Richardson, and L.J. Muir, “Perceptually Optimised
Variable Bilateral Filter for Low Bit-rate Video Coding,” PGNet
Conference, Liverpool, 27-28 June 2007.
[3] Ç. Dikici and H. Bozma, “Attention-based video streaming,” Signal
Processing : Image Communication, vol. 25 (10), pp. 745-760. 2010.
[4] C.-W. Tang, C.-H. Chen, Y.-H. Yu; C.-J. Tsai, "Visual sensitivity
guided bit allocation for video coding," Multimedia, IEEE Transactions
on , vol.8, no.1, pp. 11- 18, Feb. 2006.
[5] J. Culham, S. He, S. Dukelow, F.A.J. Verstraten, “Visual motion and the
human brain: what has neuroimaging told us?,” Acta Psychologica, vol.
107, Issues 1-3, pp. 69-94, April 2001.
[6] Y.-F. Ma and H.-J. Zhang, “A model of motion attention for video
skimming,” in Proc. ICIP, vol. 1, Sept. 2002, pp. I-129–I-132.
[7] D.J. Fleet and Y. Weiss. "Optical Flow Estimation". In Paragios et al..
Handbook of Mathematical Models in Computer Vision. Springer.2006.
[8] C.R. del Blanco, F. Jaureguizar, L. Salgado, and N. García, “Target
detection through robust motion segmentation and tracking restrictions
in aerial flir images,” IEEE International Conference on Image
Processing, ICIP 2007. vol. 5, pp. V-445-V-448, 2007.
[9] C.V. Stewart, “Robust parameter estimation in computer vision”, SIAM
Reviews, vol. 41, no.3, pp. 513-537, 1999.
[10] ITU-T Recommendation P-910, “Subjective video quality assessment
methods for multimedia applications,” Tech. Rep., International
Telecommunication Union, April 2008.
6