Assessing human depth perception for 2D and 3D
stereoscopic images and video and its relation with the
overall 3D QoE
vorgelegt von
Master of Science
Pierre Lebreton
aus Le Mans
Von der Fakult¨
at IV - Elektrotechnik und Informatik
der Technischen Universit¨
at Berlin
Assessment of IP-Based Application
zur Erlangung des akademischen Grades
Doktor der Ingenieurwissenschaften
Dr.-Ing.
genehmigte Dissertation
Promotionsausschuss:
Vorsitzender: Prof. Dr.-Ing. Sebastian M¨
oller
Erstgutachter: Prof. Dr. -Ing Alexander Raake
Zweitgutachter: Prof. Dr. Ingrid E.J. Heynderickx
Drittgutachter: Dr. -Ing Marcus Barkowsky
Tag der wissenschaftlichen Aussprache: 11.12.2015
Berlin 2016
D 83
Acknowledgements
My journey to the PhD would not have been possible without the guidance of great people who showed me the way
of research. There have been some major “push” which drove me to science and then to the PhD. First, I would like
to thank Patrick Le Callet to which I owe my career as a researcher. He was the first one who gave me the opportunity
to do research, guided me during my first experiences in research, supported my applications for my masters thesis
at NTT, and then my PhD at the technical University of Berlin. I would not have been where I am now without your
kindness and great support.
I would also like to thank Alexander Raake, first for agreeing to supervise my thesis, but mainly for all the fruitful
discussion, his kindness, continuous support along the thesis. Each meeting gave me new pikes of energy enabling me
to go further, see new aspects that I may have missed, and new ideas. This was always done constructively with respect
of my research interest. It really has been a great pleasure to work with you.
Also a very important contribution to my thesis, but also to my career as a scientist is Marcus Barkowsky. I would
like to thank him for the time he took during my entire thesis to advise me about my research, my experiments, the
analysis of data. I really learned a lot, learned how to be critical of my work, see the limits of our results and analysis.
Even if sometimes it was more difficult, it was really a process I needed to learn to do better work. There is still much
to learn, but I know the direction.
I also would like to thank Akira Takahishi and Kazuhisa Yamagishi who also have their stone in the construction of
my thesis. Not directly, but were very kind to accept me at their laboratory, guided me though my first subjective
experiments, and along my masters thesis. This gave me the willingness to pursue to the PhD.
And of course, I would like to thank my father for supporting me in doing a PhD. Even though Germany is not that far
from France, I have been away for a very long time. So thanks again for supporting me. Thanks also to my sisters, my
family, and my friends for always being there (and thanks again the ones who made all the way to Berlin !).
v
Contents
1 Introduction ................................................................................ 1
2 State of the art .............................................................................. 3
2.1 Introduction ............................................................................ 3
2.1.1 Evaluating Quality of Experience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Imagequality .................................................................... 6
2.1.3 Depth........................................................................... 6
2.1.4 Visual discomfort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.5 Models of QoE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.6 Interaction between depth and other factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Human perception of depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Different factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Depthcues....................................................................... 12
2.2.3 Models for depth cue fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.3 Results on depth modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Subjective depth evaluation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Evidences to support models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.3 Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Visualcomfort.......................................................................... 26
2.5 Technical implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5.1 Capture ......................................................................... 28
2.5.2 Rendering ....................................................................... 31
2.5.3 Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3 Evaluating 3D added value ................................................................... 35
3.1 Differences and similarities between 2D and 3D QoE for streamed videos . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.1 Subjective evaluation of 2D and 3D QoE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.1.2 Further analysis on subjective ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.1.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1.4 Performance of instrumental measurement for 3D QoE prediction . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Revealing the added value of 3D over 2D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.1 Definition of test conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.2.2 Evaluation of 3D QoE using paired comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.2.3 Preference of 3D over 2D and pictorial quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.2.4 Quantitative analysis of the “3D added value” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
vii
3.2.5 Relation with previous studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3 Overall results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4 Key contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4 Subjective evaluation of depth ................................................................ 65
4.1 Introduction ............................................................................ 65
4.2 Evaluation of depth cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.3 Definition of scales . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.1 Perceived depth. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.3.2 The linear perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.3 The relative size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.4 The texture gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
4.3.5 The interposition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.6 The light and shades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.7 The areal perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3.8 The defocus blur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.4 Evaluation of perceived depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.4.1 Experiment ...................................................................... 69
4.4.2 Analysis of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.5 Evaluation of monocular depth cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.5.1 Image selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5.2 Evaluation of binocular depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5.3 Evaluation of monocular depth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.5.4 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.5.5 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.6 Alternative methodology: ranking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6.1 Description of the proposed methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
4.6.2 Description of the studied scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.6.3 Statistical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
4.6.4 Limits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.6.6 Analysis per depth cue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
4.6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
4.7 Key contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
5 Algorithms for depth evaluation .............................................................. 89
5.1 Introduction ............................................................................ 89
5.2 Instrumental characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3 Background ............................................................................ 90
5.3.1 Depth maps from stereoscopic videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3.2 Depth estimation from monocular depth cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
5.3.3 Image segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.4 Binocular depth cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4.1 Disparity module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.4.2 Region of depth relevance module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.4.3 Frame-based feature extraction module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
5.4.4 Temporal pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
5.5 Model performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.6 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
viii
5.6.1 Comparison with other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.6.2 Depth perception and its relation with monocular and binocular depth cues . . . . . . . . . . . . . . . . . 113
5.6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.7 Monocular depth cues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.7.1 Linear perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.7.2 Defocus blur . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.7.3 Motion parallax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
5.7.4 Texture gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.8 Depth cues pooling and reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.8.1 Reliability and temporal consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.8.2 Identification of cases of failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.8.3 Outcomes on reliability measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.9 Conclusion on depth characterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.10 Key contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6 Conclusion .................................................................................131
7 Further work ...............................................................................133
References..................................................................................135
ix
Acronyms
ANOVA Analysis of variance
ACR Absolute category rating
BT Bradley-Terry
DCT Discrete cosine transform
DERS Depth estimation reference software
DOF Depth of field
EC Evaluation concept
FLMP Fuzzy logical model of perception
GLP Global layout properties
GOP Group of picture
HDMI High-definition multimedia interface
HRC Hypothetical reference circuit
IQ Image quality
ITU International Telecommunication Union
JND Just noticeable difference
LCD Liquid cristal display
LDV Layered depth video
LSD Line segment detector
MAP Maximum a posteriori
MANOVA Multiple analysis of variance
MVC Multi view coding
MVD Multi view plus depth
MLE Maximum-Likelihood Estimation
MWF Modified weak fusion
NANOVA N-Way analysis of variance
OR Outlier ratio
OT Object thickness
PC Paired comparison
PCA Principal component analysis
PVS Processed Video Sequence
QoE Quality of Experience
QP Quantization parameter
RANSAC Random sample consensus
RMSE Root mean square error
RODR Region of depth relevance
RTP Real-time transport protocol
SAMVIQ Subjective assessment method for video quality BT.1788
xi
SI Spatial information
SRC Source reference circuit (ANSI/ATIS adopted by VQEG)
TS Transport stream
TI Temporal information
3DAV 3D added value
xii
Chapter 1
Introduction
3D was planned as the next step for television. However, it apparently did not manage to convince a large number of
end users to equip themselves at their home. In the case of movie theaters, 3D is still receiving attention from specta-
tors and movie producers. One of the important issues to improve the acceptance of 3DTV is to demonstrate the added
value of 3D to the user. The contribution of 3D was claimed by the industry to be at the same level as the transition
from monochrome to color.
Along this thesis it will be described how 3D can improve the user experience compared to 2D videos, and how the
added value of 3D, the perceived depth information, can be characterized. Different aspects about how 3D videos are
perceived and how different factors such as the texture quality and the perceived depth affect the Quality of Experience
(QoE) will be analyzed.
The following chapter 2 begins with the state of the art on the definition of QoE. It addresses previous results on the
interaction between the different factors affecting QoE: image quality, depth quality and quantity, visual discomfort,
and QoE. However, since evaluating 3D QoE itself is not straightforward, this chapter will address how it can be pos-
sible to measure QoE using different evaluation concepts. The relationship between low-level factors such as image
quality, depth, and visual comfort will be put into relation with these high-level evaluation concepts.
Since the perceived depth may represent the added value compared to 2D videos, one major goal of the chapter will
be to provide an in-depth discussion of depth perception. Different depth cues will be addressed, analyzing how these
depth cues relate to each other. As will be discussed in more detail, the different depth cues are not necessarily or-
thogonal. Hence the possible interaction between different depth cues will be discussed as well. In addition, different
models of depth cue pooling will be addressed, which target the prediction of an overall depth judgment.
Finally, considering that the aspects of perception addressed in most of the reported studies reflect human visual per-
ception in a natural environment, technology factors should also be considered analyzing their impact on what is seen
by the participants.
The third chapter addresses the work which has been performed in this thesis by means of subjective experiments,
to investigate how it is possible to reveal the added value of 3D over 2D. Different 3D video streaming scenarios are
addressed. These scenarios involve error-free or non-error-prone transmission chains. It will be illustrated that in the
particular case of the evaluation of 3D videos encoded at different bitrates, it is not easy to evaluate the added value of
3D compared to 2D. Based on this observation, the issues of the understanding of the rating scales by the participants
is discussed. Moreover, considering the similarities between the 2D and 3D ratings, the chapter also describes studies
on evaluating the performance of 2D video quality prediction algorithms for 3D video quality prediction. Finally, it
is shown that by means of paired comparison it is possible to measure the added value of 3D. The preference of 3D
over 2D is found to be content-dependent, but also linearly dependent on the image quality resulting from codding.
Considering the performance of 2D video quality prediction algorithms for predicting 3D image quality, a key remain-
ing aspect for measuring the added value of 3D compared to 2D is the depth-specific characterization of 3D video
sequences.
1
Chapter 4 describes the work performed by the author on characterizing the depth in 3D video sequences by means
of subjective testing. As described in section two, there are two different kinds of depth cues: monocular and binocular
ones. Both kinds of depth cues will be addressed in this chapter. However, this chapter shows that evaluating depth
cues may not be an easy task for the participants. Therefore work has been carried out on how to properly define the
scales on which the participants rate the depth in images and videos. Particular attention has been paid to the selection
of natural images in order to cover different amounts of monocular and binocular depth cues in the tests. Moreover, in
addition to providing a definition for the scales and selected images, research was focused on defining subjective as-
sessment methods for the evaluation of depth cues. The presented result shows that the proposed methodology enables
test participants to better understand the task by means of a more intensive training phase which allows test participants
to have more examples of what they should do during the test. Finally, the relationship between monocular depth cues
and overall depth is studied in the case of natural images.
Based on the subjective ratings collected from test participants using the approach developed in this thesis, Chapter
5 describes the work performed on developing prediction algorithms for evaluating depth cues in natural images. Dif-
ferent algorithms are described to evaluate binocular and monocular depth cues. These algorithms address: binocular
depth cues, linear perspective, texture gradient, defocus blur, and in case of video, motion parallax. Similar to the
work carried out on subjective assessment methods, the question of the reliability of the prediction provided by these
algorithms is addressed. By means of temporal consistency analysis, image classification, and feature analysis on the
different algorithms, it is described how it is possible to estimate the confidence of the metrics.
Finally, Chapter 6 reviews the main contributions of this thesis and addresses the perspectives for a continuation of
this work.
Figure 1.1: Different items studied
Chapter 2
State of the art
2.1 Introduction
The purpose of this chapter is to introduce the reader to the different relevant aspects of 3D Quality of Experience
(QoE). It provides information on the existing results on the different factors involved in the notion of 3D Quality of
Experience.
A widely accepted definition of Quality of Experience is provided in the Qualinet white paper [1] :
Listing 2.1: Definition of Quality of Experience
Quality of Experience (QoE) is the degree of delight or annoyance of the user of an
application or service. It results from the fulfillment of his or her expectations
with respect to the utility and / or enjoyment of the application or service in the
light of the user’s personality and current state.
Different aspects are important in that definition: QoE relates to the user’s perception, his expectations, and also
depends on a particular context. As stated by Seunti¨
ens [2], in the context of 3D video presentation different factors
drive the overall QoE, e.g. the picture quality, the visual discomfort and the perceived depth. All these three factors are
assessed based on the user’s expectation and context of use.
2.1.1 Evaluating Quality of Experience
Evaluating the overall QoE is challenging. First, different factors affect how 3D image and video material is perceived.
These factors include the pictorial quality, which relates to the quality of the two stereoscopic pictures seen by the user
at a specific instant. When compared to 2D video, other factor has been added with 3D video, namely: the perceived
depth, which is the added value of the 3D images and videos. It will depend on both how the image was created, but
also how it is viewed: the size of the display used to render the 3D images/videos, the viewing distance, the display
resolution, etc. And finally, the third main factor is the visual comfort. Depending on how the content was created and
rendered, observers can feel different degrees of stress, and in the long-term suffer from fatigue when watching 3D
material. This will affect their overall assessment of the 3D movie/image viewing experience. All these three different
factors together combine to the overall 3D experience. Studies have been conducted to evaluate this overall experience
and how 2D and 3D quality of experience differ. Results show only little and non statistically significant differences
between quality ratings for 2D and 3D uncompressed video sequences in the context of subjective tests with different
coding conditions with hidden reference [3, 4]. However, these small differences in quality ratings do not mean that the
user’s experience is not different between the 2D and 3D presentation, but rather that the reported quality values may
be too much influenced by the context of the subjective experiment, where many different degradations of pictorial
3
quality are presented and this drives the overall quality ratings. This problem is aggravated by the single stimulus
rating paradigm. Hence, users provide ratings of pictorial quality rather than of QoE. To tackle this issue, Seunti¨
ens
[2], considered new evaluation concepts such as:
•Presence related to the feeling of “being there and reacting to” as defined by IJsselsteijn [5].
•Naturalness related to “what observers perceive as a truthful representation of reality” (Ijsselsteijn, [6]).
•Viewing experience as defined by Seunti¨
ens is considering a higher degree of imagination similar to Presence.
However, it considers that the persons “know they are not in the movie but react in a physical and emotional sense
to the story”.
These evaluation concepts were put into relation to lower-level factors such as the quality of the representation, the
visual discomfort and the depth (Figure 2.1), and it was shown that the viewing experience and naturalness were rated
relatively similarly to image quality, and presence ratings were more closely related to depth scores. In addition it was
also observed that the type of content, still images or video sequences, can have an effect on the subjective scores.
Viewing experience and naturalness were not affected by the type of content, but presence was. Additionally it was
observed by IJsselsteijn et al [7] that motion had a much higher impact on presence scores than depth, which makes
presence a less appropriate evaluation concept for 3D videos. Further studies from Seunti¨
ens comparing naturalness
and viewing experience showed that naturalness is the most sensitive metric for measuring “the 3D added value”.
Once an appropriate evaluation concept identified, it is possible to investigate the effect of the different “low level
factors” (level 2 in Figure 2.1) to the overall 3D QoE. In the following subsections, the relationship between these
factors and QoE will be discussed.
Figure 2.1: Multidimensional aspects of 3D QoE (Figure adapted from [8])
Different studies were performed to evaluate the individual factor and their relation. In the following, a list of
studies with their respective evaluation criteria will be provided, to give an overview of what has been evaluated and
how. The results themselves will then be discussed further in the following sections. The next subsection serves as
an index for the next subsections. Table 2.1 provides a list of related work on the relationship between different 3D
evaluation concepts which will be further discussed in the thesis.
4
Ref. What is addressed Definitions Test Method What varied
[9]
Naturalness and visual
experience as a function
of blur and white noise
Image quality: excellence of the image
ACR (5 grade
scale)
White noise and
Gaussian blur level (4
levels each).
Naturalness: realistic or truthful reproduction of reality
Depth percept: amount of depth
Viewing experience: total experience related to the display
[3]
3D Quality of
Experience, Visual
comfort as a function of
coding and transmission
conditions
ACR-HR (5
grade scale)
Different coding
scheme: Simulcast,
MVC, Frame Packing,
2D (4 levels each).
Packet loss (2 levels for
2D and 3D Simulcast)
Quality of experience: the overall experience
Visual discomfort: comfort compared to 2D viewing
[4]
3D Quality of
Experience, Visual
comfort as a function of
coding and transmission
conditions
ACR-HR (5
grade scale)
Simulcast, MVC (4
levels each), Frame rate
reduction, downscaling
(2 levels each), 2D.
Packet loss short and
long duration (on one
view), different error
concealments
Quality of experience: the overall experience
Visual discomfort: comfort compared to 2D viewing
[5] Evaluation of Presence
ACR
Review of methods. No
tests.
Presence: the sense of being there Continuous eval.
Presence: forgetting about the “real world” outside ruler-based method
Presence: something they had seen or visited verbal scaling
pair comparison
[2]
Still images. Aymmetric
and asymmetric JPEG
compression.
3D Image quality: bad, poor, fair, good, excellent.
ACR (5 grade
scale)
JPEG compression (4
levels on each view. Full
factors design). 3
different depth levels.
Perceived depth: numeric scale from 1 to 5
Perceived sharpness: numeric scale from 1 to 5
Perceived eye-strain: Rated on an annoyance scale.
[10]
3D videos. Image
quality and effect of
spatial and temporal
resolution. Asymmetric
conditions.
Perceived quality: bad, poor, fair, good, excellent.
DSCQS
(Continuous
scale 0-100)
Horizontal, vertical
downscaling. Frame rate
reduction. 2D and 3D.
Rated by group 1.
Perceived sharpness: (same scale) Rated by group 2.
Perceived depth: (same scale) Rated by group 2.
[2]
3D images. Viewing
experience as a function
of noise.
Viewing experience: Scale: bad, poor, fair, good, excellent. ACR (5 grade
scale) 6 levels of white noise.Naturalness: Scale: bad, poor, fair, good, excellent
[11]
3D videos. Naturalness,
Depth, Image Quality as
a function of different
video coding algorithms.
ACR (5 grade
scale)
JM h.264 encoder,
Simulcast, MVC
(JMVC), JM and
Side-by-side frame
packing.
Naturalness: Scale: Bad, poor, fair, good, excellent.
Depth: scale: bad, poor, fair, good, excellent
Image quality: scale: bad, poor, fair, good, excellent
[12]
3D videos. Visual
experience as a function
of Depth, Image Quality
and Visual comfort.
Content with no coding
degradation.
Visual comfort: visual discomfort related to multisymp-
toms, e.g. eye strain, dry eyes, double vision.
SAMVIQ
Different shooting
conditions. Image
quality being only
affected by geometrical
distortions due to
shooting condition. The
different contents
provide different ranges
of comfort and depth
quality as well.
Depth: amount of the perceived depth
Image quality: the quality of texture rendering, the level
of visibility of visual artifacts and rendering details.
Depth rendering: the quality of the depth rendering de-
pending on the subject’s preference on the basic criteria
related to stretching or compression of the reality and the
shape of the objects
Naturalness: focuses on the evaluation of the natural ap-
pearance of images, i.e. whether the scene is more or less
representative of reality
Visual experience: the overall quality of experience of the
images in terms of immersion and the overall perceived
quality.
Table 2.1: Selected set of publications on 3D video QoE research and its relation with lower level factors. (ACR:
Absolute Category Rating, ACR-HR: Absolute Category Rating with Hidden Reference, DSCQS: Double Stimulus
Continuous Quality Scale, SAMVIQ: Subjective Assessment Methodology for Video Quality)
5
2.1.2 Image quality
Image quality in the context of 3D images and videos has two different connotations: the image quality by itself and
the texture quality. The distinction between the two terms is clear in the special case of Depth-based image coding
and rendering where each of the stereoscopic images presented to the viewer are produced by a decoder using 2D
images and depth information. In this specific case, it is necessary to make the distinction between the two terms: the
texture quality is the quality of the 2D picture used by the decoder in addition to the depth information to synthesize a
new image with a different viewing position. The quality of this synthesized picture will be characterized by the term
image quality. Similarly, while capturing 3D video content using 3D cameras, distortions can appear on 2D images
such as Barrel distortion, Pincushion distortion, color bleeding, ... e.g. due to the lens quality. Finally, coding may also
affect the image quality. A complete review of distortions of 2D images in the context of 3D videos was performed by
Boev et al [13]. In the general case, the image quality refers to the quality of the picture seen by observers which can
be the result of depth-based image rendering while the texture quality refers to the quality of the pictures provided to
the depth-based image rendering algorithms. In case of 3D stereoscopic video where only two stereoscopic views are
considered and no depth-based image rendering algorithm is involved, the terms are similar.
Considering image quality the use of stereoscopic videos open new issues. In particular, the differences of quality
between the two stereoscopic views and their perceived overall quality is an aspect which has been intensively studied.
This relates to the binocular suppression theory [14]. Seunti¨
ens [2] studied this aspect in case of still images encoded
using JPEG compression. First, the picture quality was found almost independent of the inter-camera base distance,
therefore depth did not affect the image quality itself. Different quantizing parameters were applied to the left and
right view following a full-factorial design. Results showed that high asymmetry in picture quality will strongly affect
the overall quality, and pictures of lower quality as for a JPEG quantization parameter of 20 for each view can be rated
higher than a strong asymmetry, for original for one view and a JPEG quantization parameter of 10 for the other view.
In case of non-extreme asymmetric coding conditions, the overall perceived quality was found to be approximately
equal to the average perceived quality of each individual view. However, these results were found to be degradation
dependent, and as reported by Stelmach [10], in case of a degradation such as blurring, or equivalently downscaling,
the overall quality of two stereoscopic views is driven by the highest quality view. Moreover, Stelmach [10], found
that a downscaling ratio of half along both axes of one stereoscopic view, while the other is kept at full resolution, is
not visible to the observers.
However, one issue still weakly studied in such kind of work is the long-term effect and resulting fatigue resulting
from such approaches. The limited work on this topic is not due to a lack of interest, but rather due to the difficulty to
address it, since it requires long tests per condition which makes it difficult to test many different conditions. Moreover,
evaluating fatigue itself is a difficult task.
2.1.3 Depth
The added value of 3D is to bring the perception of binocular depth to the overall user’s experience. The overall per-
ceived depth is resulting from the information of many different sources referred to as “depth cues”. Depth within the
context of 3D images and video has two different connotations: the depth quantity and the depth quality.The depth
quantity relates to the amount of depth perceived in the 3D material, considering a specific setup. The depth quality is
related to the plausibility of the depth rendering considering a specific setup including how the content was captured
and rendered (see Figure 2.2). Both of these two aspects will be addressed more extensively respectively in Section
2.2 on foundations of depth perception and in Section 2.5 on technical implementations.
Seunti¨
ens [2] and Lambooij et al [9] studied the added value of 3D as compared to 2D. Test participants were asked
to rate the naturalness of 2D and 3D images. White noise or blur was added to the images, and using the naturalness
scores the added value of the 3D version of the image was compared to the 2D images by finding which amount of
white noise added to the 3D presentation lead to similar ratings of the 3D and 2D images. Results show that the 3D
6
Figure 2.2: Different concepts related with depth perception: Depth quality, quantity, layout. The figure on the left side
represents a lower depth quality than the figure on the right side.
effect was found to provide an improvement of naturalness equivalent to 2dB of white noise.
Yamagishi et al [11], used another type of distortion, namely video encoding using H.264. They showed that natural-
ness does not increase due to the presence of depth, and 2D and 3D were rated similarly. Such results may be explained
by the context of the experiment, where test participants may have been too focused on the image quality degradation
due to the video encoding, and naturalness is generally degraded by the artificial look of coding artifacts.
From these results it can be observed that evaluating 3D added value and its relation with texture quality is challeng-
ing, since it appears to highly depend on the context and employed method of the subjective evaluation. Therefore,
applying distortions such as blur, white noise or degradation due to improper shooting conditions does not appear to
drive the attention of the test participant away from the 3D effect as much as for coding. A possible interpretation of
this result is the range of degradations on the image quality produced by these types of distortions. If highly distorted
images are presented to the test participants, it appears that they will focus mainly on image quality rather than on the
depth effect. As stated above, this effect may be increased or caused by the single stimulus test method applied in the
reviewed studies.
2.1.4 Visual discomfort
A major issue regarding 3D Quality of Experience is visual discomfort. As described by Lambooij et al [15] visual
discomfort is usually related to visual fatigue.Visual fatigue is related to “the decrease of performance of the human
visual system”, and the visual discomfort is “the subjective counterpart of visual fatigue”. Hence, the visual fatigue
relates to the long-term effect of visual discomfort. The sources of visual discomfort are multiple and will be addressed
in Section 2.4. These include:
•Anomalies in binocular vision
•Geometrical distortions between the left and right images, crosstalk, binocular rivalries, color and brightness mis-
match between views, suboptimal synchronization between views...
•Excessive binocular parallax
•Conflict between vergence and accommodation
The effect of visual discomfort on the overall QoE when watching 3D image and video material is strong and has
been addressed in many different ways. For example, in [16, 17], the authors have looked into the effect of inter-camera
distance and the effect on the overall QoE rating. Since the videos were presented without coding and transmission
degradation, only perceived depth and comfort affected the scores. Results show that too high values of disparity result
7
in a strong drop in Quality of Experience scores, and can be explained by the effect of high disparity values on visual
discomfort.
Although visual discomfort is one of the key aspects of 3D, this thesis primarily addresses the perceived depth. The
choice is motivated by the fact that the added value provided by the depth to the overall experience may be what would
justify taking the “risk” of having discomfort. As a consequence, it is an aspect of strong interest.
2.1.5 Models of QoE
Different models of QoE have been proposed in the literature, see e.g. Seunti¨
ens [2], in the following section, some of
these models will briefly be reviewed in relation to 3D video QoE. The overall 3D Quality of Experience (level 4 in
Figure 2.1) was expressed as a function of Naturalness (level 3 in Figure 2.1) and visual comfort (level 4 in Figure 2.1).
The naturalness was expressed as depending on the image quality and the depth (see Figure 2.3). Similarly, the model
from Chen [12] expressed the overall QoE as a linear combination of 2D image quality, depth quantity and visual
comfort (see Figure 2.4). This model is different from the model by Seunti¨
ens by the aspects it addresses. Indeed, in
the model from Chen, the 2D image quality does not refers to quality degradations such as coding or noise, but the
quality of the stereoscopic images due to the shooting condition. The second difference concerns the depth, which
is decomposed into two aspects: depth quality and quantity. As a consequence, image quality coupled to the depth
quantity enables to address issues such as the quality of the depth rendering. Similarly to Seunti¨
ens, naturalness is a
function of the image quality and visual comfort and depth instead of being expressed as a single notion is decomposed
into its two aspects: depth quality and depth quantity. The overall visual experience is modeled by a combination of
the different factors.
Figure 2.3: Model of QoE from Seunti¨
ens [2] Figure 2.4: Model of QoE from Chen [12]
The models described up to now relate the interaction between depth,image quality and visual comfort with the
overall 3D experience. However, the perception of these aspects does not only depend on the 3D video itself, but also
of the equipment such as the display and its rendering abilities. To address this Engeldrum [18] defined a model for
image quality which includes this technology variable directly into the image quality model. Figure 2.5 illustrates
the proposed model. It can be seen that there is a relationship between the technology variable (how the image is
rendered), how it is evaluated, and the final customer’s rating. The image quality circle is closed by linking the rating
to the technology variable. Indeed, technology affects the perception of the image quality due to the context of use, and
expectations with regard to previous experiences with the device. The model proposed by Engeldrum only restricts to
image quality, therefore Lambooij [9] has extended it to the other factors involved in 3D QoE (see Figure 2.6). The
other aspects, depth, and visual discomfort are also involved in a similar circle as the image quality circle described
by Engeldrum. Then, similarly to Seunti¨
ens, the overall 3D experience can be expressed as a combination of the
three factors image quality,depth quality and visual comfort. To further study the relationship between image quality
and depth and their integration into a 3D Quality model, Lambooij [9] designed two different models to also explain
8
the evaluation concepts naturalness and visual experience as a linear combination of image and depth quality. A
generic schema of the integration model is shown in Figure 2.7. Each evaluation concept (EC) is expressed as a linear
combination of image quality (IQ) and perceived depth (D) (Eq. 2.1).
EC =α·IQ +β·D(2.1)
To determine the relationship between these factors and the concept naturalness and visual experience, different
experiments were conducted with different conditions of white noise and blur, different kinds of displays and contents.
Curve-fitting led to a weighting of 0.74 and 0.26 respectively for α, and βwhen naturalness is modeled. Similarly
weighting of 0.82 and 0.18 were found for α, and βwhen the visual experience is targeted.
Figure 2.5: Model of image qual-
ity proposed by Engeldrum [18] Figure 2.6: Model of QoE from Lambooij (Figure from [19])
Figure 2.7: Model of 3D Quality for evaluating the Naturalness and the Visual Experience (Figure from [9])
2.1.6 Interaction between depth and other factors
In this work, it is the perceived depth which will be addressed, since it provides the added value of 3D compared to 2D
videos. Depth is not independent of the other factors presented in the previous section: e.g. image quality and visual
comfort. In this section, results of the literature on interactions between depth and the others mentioned factors will be
presented.
9
2.1.6.1 Perceived depth and image quality
The amount of perceived depth and its relation with image quality have been investigated in a number of studies.
Seunti¨
ens [2] studied the effect of JPEG compression of still images on the perceived depth, and no clear effect could
be seen on the depth scores (Figure 2.8). Two kinds of degradation were considered by Lambooij et al [9], Gaussian
noise and Gaussian blur. The effect of Gaussian noise on the perceived depth was rather small, but still significant
and Gaussian blur was found to have a high effect on the depth scores. These results are explained by the fact that
edges contribute considerably to depth perception as reported by B¨
ulthoff [20]. Considering that Gaussian noise has a
limited effect on the edges and Gaussian blur strongly affects the edges, it was expected to see such result. Yamagishi
et al [11] also found that in the case of video encoded at different bitrates image quality affects depth, and at too
low quality the depth effect is not perceivable anymore. Figure 2.9, depicts their results. In their experiment, different
source sequences were encoded at different bitrate and with different coding schemes. The coding schemes were side-
by-side using the full resolution of the source sequences and were encoded with H.264, side-by-side with half of the
horizontal resolution and encoded with H.264, and multi-view coding (MVC) which takes into account the inter-frame
redundancy. The different processed video were then evaluated on two evaluation concepts: perceived depth and image
quality. It can be seen that the relation between bitrate and perceived depth was found to be content-dependent, which
may be due to the complexity of the contents themselves, resulting in different image quality for a given bitrate, so
that the depth was affected differently.
Figure 2.8: Relation between depth quantity and image quality, results from Seunti¨
ens [2]. B: the baseline between
the two cameras. On the horizontal axis different combination of quantization for left and right view are provided. For
example 10 20 means a quantization parameter of 10 for the left view, and 20 on the right view.
.
2.1.7 Summary
In this section the problem of modeling 3D Quality of Experience (QoE) based on different factors such as image
quality,depth and visual comfort was addressed. Different general models from the literature for predicting QoE
were presented. The modeling of high-level evaluation concepts such as naturalness, and visual experience have been
10
Figure 2.9: Relation between depth, image quality, and video compression using different coding schemes. The left
figure illustrates the relation between image quality and bitrate for the different coding schemes. The right figure
addresses the relation between bitrate and perceived depth quantity. Figure reproduced from Yamagishi [11].
discussed in case of very specific degradations: blur, or white noise. As shown in this section, depth was found to be
one factor involved in the overall QoE formation, and thus has to be investigated in detail. A more detailed description
of depth perception will be provided in the next section.
11
2.2 Human perception of depth
In this section an overview of the literature on depth perception in 3D images and video sequences will be provided.
The goal is to enable an overall understanding of how depth is perceived based on the different kinds of information
available to the eyes of a person watching the world around her. There has been a large body of research performed to
analyze how the different sources of information contribute to the perception of depth.
Different factors underlie the notion of depth perception. At first, it is necessary to identify and define these, and to
specify which ones are targeted.
2.2.1 Different factors
Regarding the perception of depth, one can talk about the “depth quantity”, “depth quality”, or the “scene layout” (see
Figure 2.2). Each of these factors describe a different aspect of depth perception. The first one, the depth quantity,
describes how much depth can be perceived in the scene based on all the depth cues available. The “depth quality”
is a factor involved in case of 3D rendering on any kind of displays. In this case, two steps are involved: capture
or production of a scene and its rendering. Depending on production and display conditions, the geometry of the
rendered 3D objects may be affected, and distortions of their shape can appear [21, 22, 23, 24]. One extreme case
of such distortion is the “cardboard effect” where the objects appear flat in different depth planes [13]. Here, depth
quality addresses how well depth is presented to viewers, in terms of their “depth quality” perception. The last factor
is the scene layout. It characterizes the ability to order the objects in depth [25]. In the following of this thesis, except
where explicitly stated, it is the depth quantity which is targeted.
2.2.2 Depth cues
There are two different categories of depth cues, the binoculars and monocular depth cues. Binocular depth cues result
from the retinal images in the two eyes of the observer. Figure 2.10 depicts two examples of binocular depth cues. The
schema on the left side represents the retinal binocular disparity, or stereopsis: the two eyes see two distinct objects
and the projection of these objects on the retina appear to be at different locations on the retina of each eye. If one
point is the point of fixation, the differences between the position of theses two projections are the retinal disparity.
This information is processed by the brain to estimate the relative position in depth between the two objects, and is the
most important binocular depth cue. The second binocular depth cue depicted in Figure 2.10 is the vergence. The two
eyes converge on the object under study. The information from the orientation of each eye is an indication of absolute
position in depth, provided to the brain.
The second category of depth cue is the monocular depth cues. Figure 2.11 depicts some of them. More detailed
explanations follow in section 2.2.2.2. All monocular and binocular depth cues have different abilities for the charac-
terization of depth. Some can provide absolute position in depth of an object such as the vergence, the motion parallax,
the relative size. Some can only provide relative position in depth such as the defocus blur, the binocular disparity, the
interposition. In addition, Cutting and Vishton showed that the distance of observation is also of high importance and
studied the threshold of difference of depth depending on the viewing distance [25] (See Figure 2.12). Results show
that some depth cues are always discriminatory, such as occlusion or relative size, and others like binocular disparities
or motion parallax are only informative within a certain depth range. For example, binocular disparities can be used
between 0-17 m and motion parallax between 0-1000 m. All of these depth cues have different reliability, different
discriminative power based on the viewing distance and can interact with each other in the process of the construction
of the perception of depth.
12
Figure 2.10: Binocular depth cues. Figure 2.11: Monocular depth cues.
Figure 2.12: Depth contrast perception (e.g. ability to perceive differences
of depth) as a function of the viewing distance. Results and Figure from
Cutting & Vishton [25].
Figure 2.13: Horopter: all the differ-
ent points belonging to the Horopter
will appear at the same location on the
retina.
2.2.2.1 Binocular depth cues and depth perception
This subsection focuses on the particular case of binocular depth cues. Figure 2.13 depicts the horopter, the points
located on it will be perceived at the same location of the retina. In the space described by the horopter (Figure 2.10
left), the objects in front of the horopter have negative disparities. They are also called crossed disparities. And vice
versa, the points located beyond the horopter have positive or uncrossed retinal disparities.
It is possible to fuse two stereoscopic images, and then be able to perceive a single image from the two retinal
images seen by each eye, if the disparities belong to a limited area. Outside of this area, persons suffer from diplopia
which corresponds in seeing double. The area where the stereo vision is possible is called the Panum’s fusional area,
and depends on the eccentricity from the fovea: on the fovea the retinal disparities should be limited to 0.1◦to ensure
binocular fusion, but with higher eccentricities such as 6◦, the range of binocular disparities can increases to 0.33◦,
and at an eccentricity of 12◦it can reach 0.66◦[15].
One of the different factors which contribute to depth perception is illustrated in Figure 2.10, the distance between
the two eyes: the inter-pupillary distance, which strongly affects the amount of retinal disparities. Studies have shown
that the majority of adults have an inter-pupillary distance in the range from 50 to 70 mm, with a mean and median of
13
63 mm [26]. These values are used as an average setting for both depth perception and visual comfort studies.
As explained previously, the area where fusion is possible is particularly small. Without movement of the eyes (ver-
gence movement) and for short duration stimuli, fusion limits of 27 arcmin for crossed disparities and 24 arcmin for
uncrossed disparities were found. As reported by Lambooij [15], “many factors have been found having an effect on
fusion. These include eye movements, stimulus properties, temporal modulation of retinal disparity, exposure duration,
amount of illuminance, and individual differences”. The limit of fusion was found to decrease with small, detailed and
stationary objects and increase with larger, moving objects and in case of objects existing in the periphery of fixed
objects [27, 28, 29, 30]. “With longer duration and vergence movement, retinal disparities can be as high as 4.93◦for
crossed disparities and 1.57◦for uncrossed disparities before producing diplopia” (As reported by Lambooij [15]).
In addition to retinal binocular fusion, or Stereopsis, another cue from the stereoscopic vision is the convergence.
Indeed, when looking at an object the two eyes converge on the object under observation as depicted in Figure 2.10
(right). The angle of convergence provides an absolute measurement of distance between the object and the observer
location. As reported by Cutting & Vishton [25] and depicted in Figure 2.12, this depth cue is mainly effective for
distances less than 10 meters.
2.2.2.2 Monocular depth cues and depth perception
In addition to binocular depth cues, monocular cues also contribute to the perception of the depth in images. The
monocular depth cues include:
Motion Parallax: “when an observer moves, the apparent relative motion of several stationary objects against a
background gives hints about their relative distance. If information about the direction and velocity of movement is
known, motion parallax can provide absolute depth information”, as reported by Ferris [31, 32] (See Figure 2.14).
Figure 2.14: Motion parallax Figure 2.15: Depth from motion Figure 2.16: Kinetic depth effect
Depth from motion: “When an object moves towards the observer, the retinal projection of an object expands over
a period of time, which leads to the perception of movement in a line towards the observer. Another name for this
phenomenon is depth from optical expansion”, as reported by Swanston [33, 32]. (See Figure 2.15)
Kinetic depth effect: “If a stationary rigid figure (for example, a wire cube) is placed in front of a point source
of light so that its shadow falls on a translucent screen, an observer on the other side of the screen will see a two-
dimensional pattern of lines. But if the cube rotates, the visual system will extract the necessary information for per-
ception of the third dimension from the movements of the lines, and a cube is seen”, as reported by William [34, 32].
(See Figure 2.16)
Linear perspective: The term linear perspective is used since the 14th century. A definition can be found in Web-
ster’s dictionary as “the technique or process of representing on a plane or curved surface the spatial relation of objects
as they might appear to the eye; specifically : representation in a drawing or painting of parallel lines as converging in
14
Figure 2.17: Linear perspective Figure 2.18: Aerial perspective Figure 2.19: Interposition
order to give the illusion of depth and distance” [35]. (See Figure 2.17)
Relative size: Cutting defines the Relative Size as “a measure of the angular extent of the retinal projection of two
or more similar objects or textures” [36]. It “arises from the differences in the projected angular sizes of two objects
that have identical sizes and are located at different distances. If the assumption that the two objects have identical
physical sizes is met, then from the ratio of their angular sizes, it is possible to determine the inverse ratio of their
distances to the observer. In this way, metrically scaled relative-depth information can be specified” as reported by
Weiner [37]. (See Figure 2.20)
Familiar size: As described by Weiner [37], if the size of the object is known, the angular size of the object can be
used to determine the absolute distance information.
Absolute size: Even if the actual size of the object is unknown and only one object is visible, a smaller object seems
further away than a large object that is presented at the same location, c.f. Sousa et al. [38, 32].
Aerial perspective: Cutting defines the Areal perspective as follows: “refers to the increasing indistinctness of
objects with distance, determined by moisture and/or pollutants in the atmosphere between the observer and these
objects. Its perceptual effect is a decrease in contrast with distance, converging to the color of the atmosphere” [36].
(See Figure 2.18)
Accommodation: This is an oculomotor cue for depth perception. Weiner defines it by referring “to the change in
shape of the lens that the eye performs to keep objects at different distances in focus. Changes in accommodation occur
between the nearest and the farthest points that can be placed in focus by the thickening and thinning of the lens” [37].
This can be used as an information about the depth of the object under observation.
Interposition: Occlusion (also referred to as the interposition) “occurs when an object partially hides another object
from view, thus providing ordinal information: The occluding object is perceived as closer and the occluded object as
farther” (Weiner [37]). This information only allows the observer to create a “ranking” of the object positions. (See
Figure 2.19)
Figure 2.20: Relative size Figure 2.21: Texture gradient Figure 2.22: Light and shade
15
Texture gradient: Texture can provide information about the location of the objects. Cutting and Millard divided
texture gradient into three categories: perspective,compression and density [39]. Perspective: due to the linear per-
spective, the size of the objects decreases with the distance to the observer, therefore the size of the individual texture
element (texel) is affected by this phenomenon. Compression: relates to the ratio between the width and height of the
texels. The aspect ratio of the texels will be affected by the position. Finally, density refers to the spatial distribution
of the texels in the image. These different aspects are illustrated in Figure 2.21
Light and shade: Weiner defines this factor as referring “to the smooth variation in image luminance determined
by a combination of three variables: the illuminant direction, the surface’s orientation, and the surface’s reflective
properties”. Depth from shading is complex since different conditions can result in similar shadings. However, if the
illuminant is known it is possible to derive information about the structure.(See Figure 2.22)
Defocus blur: Related to accommodation, objects in focus are perceived as sharp and objects located at far dis-
tances from the objects in focus appear as blurred. The blur provides an absolute value of the relative position in depth
of the objects. Indeed, from this cue alone it is not possible to distinguish between objects located in front or in the
back of the object in focus.
Figure 2.23: Object height. The position in height of the objects compared to the horizon line can be used to derive the
distance of the objects from the observer.
Height in the Visual Field and the Horizon Ratio: Considering an object and an observer located on the same
ground, the farther the object is from the observer, the closer the object is to the horizon line. Therefore, the vertical
height of the object can be used as an information of the position in depth, (see Figure 2.23, left) Sedgwick defined
the geometry behind this phenomenon [40] which can be found in Equation 2.2 (See Figure 2.23, right).
h=tan b +tan a
tan b (2.2)
2.2.2.3 Discussion about monocular depth cues
All the different monocular depth cues are not necessarily orthogonal, and some depth cues can be perceived as a com-
bination of different other depth cues. In their work on depth threshold perception, Cutting and Vishton [25] discarded
several depth cues in their study due to these strong interdependencies (see Figure 2.12). The depth cues omitted are:
•Texture gradient: As stated in the previous section, texture gradient can be characterized along three different axes:
gradient size, density, and compression. The gradient size depends on the difference between the maximum and
minimum size of the texture element. Therefore this component relates to the relative size cue. The density can be
16
considered as a depth cue itself. And finally, compression was found by Cutting and Millard to have a limited effect
on the visual system to reveal depth [39].
•Linear perspective: linear perspective was analyzed by Elkins [41] as a systematic combination of several sources:
texture gradient (size, density, and compression) and occlusion. The convergence of the parallel lines could be
expressed by means of object size, density and compression as reported by Taylor [42].
•Brightness, light and shade: Cutting and Vishton discussed that shadings are the main source of depth and not
brightness. Shadows provide information about the shape of the object which can be considered as an application
of transparency. Therefore, it can be related to the aerial perspective. A second motivation of Cutting and Vishton
for not considering it is that shading will provide information on the shape of the object rather than its position
compared to other objects.
•Kinetic occlusion and disoclusion: kinetic occlusion and disoclusion can also be considered as a specific case of
occlusion where no luminance contrast at occluding edge is required. This can also be related to the motion parallax
(Cutting [36]).
•Gravity: Gravity can also be seen as a source of depth information by considering the acceleration of dropped
objects or thrown objects which will vary with the viewing distance, as studied by Watson [43]. This effect is
related to motion parallax which results from the perception of speed of objects at different distances.
This shows that all the different cues can interact between each other because of the definition of the depth cues by
itself or their underlying physical or geometrical foundations. Further type of interaction between the depth cues will
be addressed in the next subsection.
2.2.2.4 Interactions between depth cues
The possible type of interactions between and combinations of depth cues has been categorized by B¨
ulthoff and Mallot
into five categories [44]. These categories are:
•Accumulation: This corresponds to the pooling of depth cues in the final stage of the depth estimation process.
The depth cues are considered after taking into account any type of interaction they could have between each other.
These interactions may have resulted in enhancing or decreasing their contribution to the overall depth perception.
The concept behind accumulation is that a judgment based on cues which agree will be more reliable than a judg-
ment based on one cue. Moreover, when depth cues conflict with each other, the resulting depth perception may
correspond to an intermediate value between the different depth cues. Based on these considerations, the most effi-
cient way to combine depth cues is a weighted mean with weights depending on the reliability and efficiency of each
cue [45]. However, in some cases an average may not be an appropriate pooling strategy due to underestimation
of the perceived depth from some cues. In this case, an additive strategy may be a better approach (examples will
follow in the next section). Therefore, there is the need to define a function which will perform a tradeoff between
the two pooling strategies: summation or averaging.
•Veto: The perception will be based only on one cue, which forces its value on the other depth cues.
•Cooperation: The cooperation, in contrast to accumulation, appears in the early stage of the depth cue pooling and
relates to the interactions between depth cues: the fact that one depth cue will enhance or decrease the contribution
of another depth cue to the overall perception.
•Disambiguation: This is additional information provided by one cue about another depth cue to solve ambiguous
cases. For example, defocus blur can only provide an absolute value of difference in depth: it cannot distinguish
between positive and negative change of depth around the focus plane. Another depth cues such as binocular dis-
parity or other monocular depth cues can then help to solve this ambiguity.
17
•Hierarchy: The concept of the hierarchy is based on the fact that information from several depth cues can be con-
sidered as raw data for another depth cue.
Beyond the categorization of B¨
ulthoff and Mallot, two further types of interaction between depth cues were defined:
•Cue dominance [46]: This phenomenon appears when two cues contradict each other, in this particular case one
cue may dominate over the other one. This is similar to Veto except that the term Veto is used for small contradiction
between the depth cues.
•Cue promotion [47]: The cue promotion is a scaling which is performed to align different depth cues using param-
eters extracted from other depth cues.
2.2.3 Models for depth cue fusion
Based on the different sensory input, several models have been defined in the literature. These models have been
classified into two categories by Clark and Yuille: the weak fusions and the strong fusions [48].
2.2.3.1 Weak fusion
Most of the studies have considered weak fusion algorithms. These models assume no interaction between the different
depth cues, and the depth cue pooling can be limited to the “accumulation”. Depth from individual depth cues is
computed separately, then a weighted linear summation of each estimated depth value from each cue is performed
by weighting the contribution of each cue based on its reliability. This type of model has the advantage to be simple
and modular since every depth cue can be analyzed individually. However, there are several limits to these models
which were explained by Landy et al. [49]: in addition to the fact that any interaction between depth cues is neglected,
a strong limitation is to put every individual depth cue to the same scale, since some may be expressed in physical
units such as meters, and others may be in other units. Moreover, some depth cues such as the motion cues may not
always be available, and having these cues out of the pooling equation is different than setting their contributions to
zero in a linear equation. Besides, the weights based on the confidence in each metric is strongly dependent on the
characteristics of the considered signal. This requires the weight to be dynamic which is then difficult to define. All
these aspects make the weak fusion model applicable only in certain cases.
2.2.3.2 Strong fusion
Alternatively, the second main category of fusions is that of strong fusion. These models assume strong interactions
between the different depth cues. The depth cues cannot be computed separately and depend on each other for defining
how they contribute to the overall depth perception. One implementation of this paradigm as reported by Landy et al.
[49] is provided by Nakayama and Shimojo [50], where the perceived depth in a scene is based only on one of the
different depth cues. The selection of the cue used for the perception of depth is made by choosing the most plausible
depth prediction between the different predicted depth values obtained from each cue. This selection is performed
using information on the scene under study.
2.2.3.3 Hybrid approaches
Intermediate models between the strict weak and strong fusion have been proposed. For example, the model defined
by Landy et al. [49], called “Modified Weak Fusion” (MWF), is restricted to linear combinations of independent depth
cues. This is a weak fusion approach, but allows only one particular case of interaction between depth cues: the cue
18
promotion. This interaction is part of the strong fusion approach making the model an intermediate approach between
the two strict definitions of weak and strong fusion.
Depth cues analyzed
Author Year Ref. B L T D R I S M K H Other Fusion
Dosher et al 1986 [51] X X X Linear
Bruno and Cutting 1988 [52] X X X X familiar size Additive
Johnston et al 1993 [53] X X Linear / based on viewing
distance
Buckley and Frisby 1993 [54] X∼X Outline cues
Landy et al 1991 [55] X X Linear / based on cue reliability
Rogers and Collett 1989 [56] X X Linear / based on reliability
Wang et al 2011 [57] X X Linear / based on viewing
distance
Held et al 2012 [58] X X Linear / based on viewing
distance
Ernst and Banks 2002 [59] X Haptic MLE
Hillis et al. 2004 [60] X X MLE
Lovell et al. 2012 [61] X X MLE
Massaro 1988 [62] X X X X familiar size Multiplicative (FLMP)
Landy et al 1995 [49] X X X X Linear with cue promotion
B¨
ulthoff and Mallot 1988 [20] X X edges Cue veto
Ogle 1938 [63] X X Cue veto
Braunstein et al 1982 [64] X X Disambiguation
Blake and B¨
ulthoff 1991 [65] X Specularitites Disambiguation
Yuille and B¨
ulthoff 1995 [66] X X X X Bayesian decision model
Girshick and Banks 2009 [67] X X Bayesian model
Nakayama and Shimojo 1992 [65] X Bayesian model
van Ee et al 2003 [68] X X Bayesian model
Table 2.2: Subjective experiments on depth cue combinations. B: Binocular depth, L: Linear perspective, T: Texture
gradient, D: Defocus blur, R: Relative size, I: Interposition, S: Light and shades, M: Motion parallax, K: Kinetic depth,
H: Height
19
Figure 2.24: Evaluation based on parallax Figure 2.25: Evaluation using test spots
2.3 Results on depth modeling
In this section, results supporting the different models and how depth cues combine with each other will be presented.
The experiments, the depth cues studied, and types of models used are categorized in Table 2.2 and will be discussed
in Subsection 2.3.2.
2.3.1 Subjective depth evaluation methods
In order to be able to study the combination of depth cues into a prediction of the overall depth perception, accurate
methods for subjective evaluation are required. This subsection will present different measurement methods reported
in the literature. A first method illustrated in Figure 2.24 is the one proposed by Gogel [69]. The idea behind the
method is to evaluate the perceived position in depth in an indirect manner using parallax cues. Indeed, if the observer
changes his position laterally the objects will appear to shift laterally as well. The shift is proportional to their distance
in depth to the screen and the direction of the shift will also be dependent on their position relative to the display:
the shift will be in the same direction as the observer’s motion if the object pops out of the display and will be in
the opposite direction if the object is in the zone within the display. By measuring the angles O1,O2 reported by the
participant, the displacement, and the viewing distance it is possible to have a measurement of the position in depth of
the objects.
An alternative illustrated in Figure 2.25 is proposed by B¨
ulthoff and Mallot [20]. The idea of this methodology is
to ask the test participants to align several dots on a 3D image. This 3D image is the result of the combination of
different monocular and binocular depth cues. Through the dots, the test participant can only adapt the binocular
disparity and therefore, define an equivalency between binocular depth cues and the combination of several depth
cues contained in the image (binocular disparity and shadings in [20]). A derived approach consists in adapting an
ellipsoid defined using binocular depth cues to make it correspond to another ellipsoid defined by both monocular
and binocular depth cues. Another alternative illustrated in Figure 2.26 is described by Johnston [70]. Similarly to the
quality ruler [71], it consists in asking test participants to evaluate the test signal on a well-controlled scale of other
stimuli. Participants should then select one of the reference stimuli, which best corresponds to the stimulus under
20
Figure 2.26: Evaluation using controlled stimuli Figure 2.27: Evaluation by defining surface normals
evaluation. In the particular case of [70], the reference stimuli are several cylinders having continuous curved surfaces
and having different ranges of depth. The test stimuli are different patterns of randot stereograms.
Other alternatives consist in comparing two different stimuli selecting which presentation shows the expected property.
This property can be the largest expansion in depth as in Landy [49], or which is the tallest as in Ernst et al [59], or
which has the biggest slope as in Girshick et al. [67], or which is the orientation of the rotation: clockwise or counter
clockwise as in Braunstein et al. [64]. Alternatively, to decrease the number of paired-wise comparisons, it is also
possible to compare two pairs of two stimuli and ask to report a property regarding the pairs themselves. For example,
the largest depth expands.
Another approach is to let the test participants describe the orientation of a surface by its normal vectors. This method
is illustrated in Figure 2.27, and was studied by Stevens and Brook [72]. The concept behind this methodology is to
ask test participants to define the normal vectors of the surface under study. Alternatively, Van Ee [68] let the test
participants adjust the orientation of lines to describe the slant of the stimulus (a 3D plane).
In the work of Dosher et al. [51] the task was different, and observers were confronted with a forced choice between
two alternative options to explain their understanding of the scene. Participants were asked to report if what they saw
is a cube or a truncated pyramid, or what the first direction of the rotation of the stimulus was: left or right.
Another alternative giving more freedom to the test participant is proposed by Tittle and Braunstein [73], who asked
the observers to report the depth-to-height ratio of the stimuli under study.
Rogers and Graham [74], chose to gradually increase the amount of binocular disparities and let the test participants
notify from which amount of disparity they can perceive the depth in the proposed stimulus. In this case, the stimulus
under study is randot dots describing an oscillation in depth.
Finally, another approach can be found in the literature as presented by Bruno and Cutting [52] which consists of
directly asking the test participants to evaluate the perceived depth between objects on a rating scale. In the particular
case of [52] a scale from 0 to 100 was used, 0 meaning no distance between objects and 100 being the “maximal
exocentric separation”.
21
2.3.2 Evidences to support models
After having presented the evaluation methods in the previous section, the following section describes studies which
have been conducted and reveals the different types of interactions possible between depth cues and the respective
models.
2.3.2.1 Weak fusion
Weak fusion models have been widely used, reflecting the fact that this kind of model performs well in many cases.
Dosher et al [51] used a linear model between binocular depth, perspective and luminance to model the ability of an
observer to distinguish kinetic, based on a viewing tests on the ability of the observer to see the direction of a rotation.
In this study, stereopsis was found to be the dominant cue in static presentation, and it was dominant in most of the
dynamic presentations as well. Another result is a recency effect which was observed: when a static presentation pre-
cedes a dynamic one, the static presentation strongly influenced the dynamic one.
To model the interaction between other depth cues, a linear model was used by Bruno and Cutting [52], where
motion parallax, occlusion, height, and familiar size were considered. The underlying test used was to ask the test
participants to rate the magnitude of exocentric distances using a forced choice between two stimuli. The outcome
suggests that depth cues appear to be additive, independent, and each cue was referred to as “minimodules”.
Johnston et al [53], considered stereopsis and texture. The data were modeled using a weighted linear combination.
Weights were found to be dependent on the viewing distance: in case of short viewing distance, the binocular depth
was found to be dominant and only a small weight was given to the texture. In case of a father viewing distance, the
weight of texture gradient was found to be greater. This was explained by the fact that stereopsis had smaller reliability
with higher viewing distance [25].
Buckley and Frisby [54], also studied binocular disparities and texture, but also analyzed the interaction with “outline
cues”. The term “outline cue” was defined by Clarke et al [75] and describes the following property: “if a rectangular
is drawn in the plane of the display, rotating it along the vertical axis makes it appear as a trapeze”. This is related
to the linear perspective depth cue. In this study, a truncated textured cylinder was considered. These cylinders had
different amplitude of depth and two distinct orientations: vertical, or horizontal. Results have shown that the orienta-
tion has an impact on how the depth is perceived. In case of a horizontal cylinder, binocular cues dominate the overall
depth perception. In case of the vertical cylinder, for low depth amplitude (3-6 cm) outline and texture dominated the
perceived depth, but not for high amplitude (9 cm). In the latter case, binocular cues were again the dominant cue.
Landy et al [55] considered kinetic depth and texture integration. A linear model was used with variable weights. These
weights were adjusted based on the reliability of the depth cues. The reliability of the depth cues was determined by
previous test results and was found to be dependent on the scenes considered.
Defocus blur and binocular disparity were studied by Wang et al [57]. A paired comparison experiment was conducted,
and test participants had to report which of the two stimuli, composed of two natural images, provided the larger depth
interval between the sharp plane and the blurred plane. In this study, it was found that depth perception was not af-
fected by the viewing distance between the observer and the blurred plane, but was affected by the distance between
the sharp plane and the blurred plane, and then by the disparity gradient. One type of Gaussian blur was considered.
The blur, provided an increase of perceived depth, but not enough data are available to study the combination of blur
and binocular disparity.
Held et al. [58], considered different combinations of binocular disparities and defocus blur to determine depth discrim-
ination thresholds between two distinct objects. The approach is similar to Cutting and Vishton [25] and determines
the perceived depth JNDs, but provides additional results by analyzing the interaction between depth cues. Results
show that the contribution of each cue was dependent on the viewing distance, and when the viewing distance is small,
binocular depth defines the depth discrimination threshold. On average, in the condition of the experiment, when the
viewing distance is higher than 32 cm, the defocus blur defines the depth discrimination threshold. This contradicts
the results provided by Wang et al [57] where the contribution of blur was found to be independent of the viewing
22
distance. An explanation is probably due to the different viewing distances which were considered. Indeed, in [57],
the viewing distance was less than 32 cm in 147 conditions out of 155; therefore, according to [58] binocular disparity
was dominant and explains that in their results [57] the disparity gradient was found to be the only factor.
A popular method for linear depth cues pooling is the use of the Maximum-Likelihood Estimate (MLE). The motivation
behind this approach is to take into account the reliability of each depth cue into the pooling, as already mentioned in
the case of Landy et al’s work [55]. The method uses the following rule: S∗
iis an estimate of the depth Sdue to the
depth cue fi. If the estimation error of S∗
iis a Gaussian noise with a variance σi, then the combination of the Ndepth
cues can be performed by the equation 2.3. This equation gives a higher weight to the reliable depth cues, the ones
having a Gaussian noise with a low variance, than to the unreliable depth cues: the ones having a Gaussian noise with
a high variance.
S∗=
N
∑
i=1
wi·S∗
i,where wi=1/σi
∑N
i=11/σi
(2.3)
Ernst and Banks [59, 76] studied the combination of binocular and haptic cues and found that MLE could very well
predict the integration of the two considered depth cues by the visual system. Hillis et al [60] extended this previous
study by showing the validity of the MLE approach for the combination of binoculars and texture gradient cues. This
was successfully extended further by Lovell et al [61] to the combination of shading and binocular cues.
2.3.2.2 Strong fusion
In the following, successful applications of strong fusion models to depth prediction are summarized. Evidence of
strong fusion can also be found in the literature showing the different types of interaction as listed in subsection
2.2.2.4.
Cue promotion
Rogers and Collett [56] studied the relationship between motion parallax, binocular disparity and the overall depth.
Based on an experiment involving motion parallax and zero binocular disparities, it was found that perceived depth
revealed by subjective data correspond to half of the actual depth which was defined by design. Different combinations
of monocular and binocular depth cues show that parallax affects the perceived depth only when the disparity gradient
was small. Similarly, in Johnston et al [53], binocular depth was found to be dominant. A linear model was proposed,
and the overall depth is defined as the summation of the binocular depth plus a weighted contribution of the motion
parallax. The weight of the motion parallax is defined as inversely proportional to the binocular depth. This reveals the
interaction of the type “cooperation” as defined in subsection 2.2.2.4.
Based on the results presented by Bruno and Cutting [52] showing a linear combination of independent depth cues,
Massaro [62] suggested another approach called “fuzzy logical model of perception” (FLMP). The motivation behind
this new model is the lack of evidence for additivity and individual processing of each depth cue in their results, and
therefore, of the weak fusion approach. The strong fusion model proposed is a multiplicative one with a normalization
of the different depth cues. The normalization is defined such that the most reliable depth cue has the strongest effect
on the overall depth.
Landy et al [49] defined an intermediate model between weak and strong fusion and called it “Modified Weak Fusion
(MWF)”. The motivation behind this algorithm is that weak fusion already performs well with modeling subjective
data and have a low complexity. The MWF is linear and considers the independence of the different depth cue. Due
to this assumption, it may be categorized among the weak fusion models; however, one type of strong interaction is
allowed: the cue promotion, which makes it part of the strong fusion schemes.
23
Cue vetoing
Several studies also report examples of “cue vetoing”. B¨
ulthoff and Mallot [20] analyzed the interactions between
binocular depth and shadings. First, regarding binocular depth cues, it was found that the presence of edges is impor-
tant, and depth perception is considerably reduced in case of smooth disparate images. The contribution of binocular
depth to the overall depth is much stronger than shading, but shading was still found to be affecting the depth percep-
tion. It was observed that edge-based stereo overrides both shape-from-shading and shape-from-disparate-shading. A
conflict between the information from shading and disparity does not result in a veto of the depth information from
shading, but results in a reduced depth perception of approximately 25%. Olge [63] demonstrated an example of cue
veto using vertical magnifier lenses in front of one eye and altered vertical but not horizontal disparities. This resulted
in making the plane surface under study appear as slanted. The increase of the magnification is monotonous with the
increase of the slant. However, with a too high conflict between vertical and horizontal disparities, the slant appears to
be null again, and therefore it is, an example of the cue vetoing.
Cue disambiguation
An example of “cue disambiguation” can be found in the work of Braunstein et al [64] who have addressed the
kinetic depth and occlusion. In this study, they found that occlusion can be used to disambiguate the perceived motion
information and can be used to disambiguate the overall 3D-geometry of the surface under study. Andelson [77],
reports similar properties explaining that motion parallax can be ambiguous; indeed, the change of speed can be
induced by two possible reasons: a change of shape or a change velocity. As a result, there is then an infinity of
potential position in depth due to different combination of motion and shape. Besides, motion parallax itself cannot
provide information about depth, and other cues such as binocular depth cues are needed. This kind of analysis can
also be extended to the relative size depth cue, where it is not always easy to differentiate a change of size due to the
size of the objects themselves and their position in depth. Another proof of disambiguation between secularity and
shading can be found in the work of Blake and B¨
ulthoff [65].
2.3.3 Modeling
Previous sections have dealt with aspects of depth cue integration. In the following, concrete depth models are pre-
sented. To perform the depth cues pooling taking into account all the different possible kinds of interaction, Bayesian
models have become the reference for sensory input pooling [67]. Yuille and B¨
ulthoff [66] have described the overall
framework and use it on two separate examples, one combining texture gradients and shape from shading, and the
other one on binocular depth cues and motion parallax pooling. Due to the importance of these types of models, there
are described here, as well as how they work and how they can be applied in practice. The notation provided by Yuille
in [66] is used. The basic Bayesian formula is given by Equation 2.4.
P(S|I) = P(I|S)
P(I)(2.4)
Sis the characteristics of a scene such as how different objects are perceived to have distinct position in depth and/or
shape. Iis the retinal image. P(I|S)is the likelihood function for a scene, and is the probability of seeing the image
Iin case of the characteristics S.P(S)is the prior distribution, which is the probability to see the property Sin the
world in general. P(I)can be seen as a normalization constant. P(S|I)is the posterior distribution and describes the
probability of the depth characteristic Sto be seen in the retinal image I. In order to get the depth estimate S∗(I), it is
needed to find the value of Swhich will maximize the posterior distribution,P(S|I), as express by Equation 2.5. This
value is called the maximum a posteriori (MAP).
S∗=arg maxSP(S|I)(2.5)
24
Based on this formalization, the process of estimating depth from two depth cues fand gcan be performed in a
weak or a strong manner. In the case of the weak fusion model, the depth estimates S∗
1and S∗
2are determined. First,
S∗
1=arg maxSP(S|f)and S∗
2=arg maxSP(S|g). Then, a weak fusion model combines S∗
1and S∗
2using, for example,
using a weighted mean.
The approach relevant to this section is the possibility to perform a strong fusion: the depth estimated from two
cues, S∗
1,2, is determined as defined in Equation (2.6) and is based on the analysis of both cues in a same module, and
not two separate modules as presented previously.
S∗=arg maxSP(S|f,g)(2.6)
Equation 2.7 is the application of the two-depth-cue case to equation 2.4.
P(S|f,g) = P(f,g|S)P(S)
P(f,g)(2.7)
An intermediate step between weak and strong fusion is to consider P(f,g|S) = P(f|S)·P(g|S). This could be possible
if the two prior, e.g. the probability to see a certain amount of depth S1and S2respectively from a cue fand gare
the same. This is a halfway step because the two cues are decoupled and do not follow the definition of a strong
fusion algorithm. However, this is not a weak fusion either since the depth estimate S∗
1,2is not resulting from a linear
combination of S∗
1and S∗
2. This result is given in Equation 2.8, which was employed by Girshick and Banks [67] for
combining binocular and the texture gradient depth cues. A simplified form of such a model can also be found in
Nakayama and Shimojo [50] where prior probabilities, P(S), are neglected and the overall probability of the scene is
based on the depth cue which shows the highest probability to explain the scene.
P(S|f,g)∝P(f|S)·P(g|S)·P(S)(2.8)
The computation of P(f|S)can be achieved by knowing the properties of the depth cues under study. In the example
of Yuille and B¨
ulthoff [66], the probability of light and shades for a retinal image I, can be computed based on the
imaging model in Equation 2.9.
I=s·n+N(2.9)
Here sis the light source, nis the normal to the surface, and Nis a Gaussian noise. The likelihood function is
P(I|S) = (1/Z)e−(1/2σ2)(I−s·n)2where Zis a normalization factor, and σ2is the variance of the Gaussian noise [49].
Such model needs to be defined for each depth cue and enables to find the most appropriate solution. However, as
mentioned by Yuille and B¨
ulthoff [66], the hypothesis made for defining the likelihood function has a strong influence
on the results and should carefully be taken, considering that hypotheses for different depth cues can be contradictory
as in case of shading and texture gradient.
Another important aspect is the use of Bayes’ decision model in the optimization process for finding the best solution.
This is done through the definition of a loss function. This function defines the cost of making a prediction error for
each depth cue. Let L(S,d)be the cost of estimating the value dinstead of S. The risk function is defined in Equation
(2.10).
R(d) = L(S,d)P(S|I)dS (2.10)
Finding the best solution results in finding the value d∗which minimizes the risk. The advantage of such practice is
to be able to provide a different weighting for each depth cue. This enables taking into account the reliability of one
depth cue in the process of finding the best solution by defining the loss function per depth cues. Such application can
be found in Girshick and Banks [67], where they found that texture gradient was not as reliable as binocular depth
cues and the contribution of the two cues was depending on how they agree: for small disagreement, both cues are
considered, and for strong disagreement texture gradient is discarded. These results can be explained by the Bayes’
decision model with an appropriate risk function.
25
2.3.4 Conclusion
In this section, a comprehensible review of the work performed on depth perception in psychophysic was provided,
addressing the questions: What are the different depth cues which will be further studied along the thesis, how they
relate to each other, what are the different models for predicting depth perception. The next section will address briefly
a second important aspect: the visual comfort.
2.4 Visual comfort
Visual comfort is one of the biggest issues with regard to 3D Quality of Experience, as stated in the introduction.
Visual discomfort is usually related to visual fatigue.Visual fatigue is related to “the decrease of performance of the
human visual system” and the visual discomfort is “the subjective counter-part of visual fatigue” [15]. Urvoy defined
the visual discomfort as a factor that can be evaluated subjectively, while visual fatigue is rather a symptom that can
be evaluated through objective measurement in clinics by doctors [78]. In both definitions, the visual fatigue relates to
the long-term effect of visual discomfort.
Figure 2.28: Vergence accommodation conflict.
A large amount of research has been carried out to evaluate the effect of the two last-mentioned factors on the visual
fatigue. The typical theorized reason of visual comfort is the vergence accommodation conflict [79, 80]. Figure 2.28
illustrates this problem. It relates to the phenomenon of eye convergence and accommodation which is usually updated
in a synchronized manner. However, in the case of 3DTV, the focus has to be kept on the screen and the vergence fol-
lows the 3D rendered object in depth. This results in a disagreement between accommodation and vergence which is
unnatural for the human visual system. However, there are contradicting results with regard to this explanation, and it
has to be noted that studies have found that the accommodation does not always remain focused on the screen but can
also move to the 3D object [81]. But it is not clear if the observed shift of accommodation from the display plane is
due to the change of vergence or natural underaccommodation which occurs in case of near objects [15, 82].
Moreover, the accommodation does not always need to be updated: there is an area called the depth of focus (DOF)
which corresponds to an area where vergence could change while keeping the object sharp without changes of ac-
commodation. The DOF ranges from 0.04 to 3.50 diopter, and has typical values from 0.2 to 0.5 diopter [15]. Within
this area, it is then possible to have an update of the vergence while the accommodation stays the same without being
perceived as unnatural to the human visual system. Based on this area a comfort zone was defined, this area states that
retinal disparities should be less than 1◦[83].
Alternatively, a zone of comfort was proposed by Percival and is called the Percival area, and can be seen as an
alternative to the 1◦rule [15]. It is based on the middle third of the amount of binocular vergence with almost no
change in accommodation [15, 84]. However, there is lack of agreement on how to define this area: some studies used
26
break points whereas others used blur points. The representation of the different zones depending on how they were
determined are illustrated in Figure 2.29.
Figure 2.29: Different comfortable viewing zones. This describes the different zones describing where single binocular
vision and comfort can be achieved as a function of retinal disparities and distance to the stimulus. These areas depends
on how the evaluation was performed (break points or blurred points). (Figure from [15])
Within the comfortable viewing zone, visual discomfort might still occur in case of too much variation of dispar-
ities [85]. This was found to be the case for scenes having large amounts of disparity and motion. Moreover it was
observed that discrete changes of motion in the depth direction in stereoscopic sequences results in a decrease of the
accommodation response and a significant decrease of visual comfort [86], as reported in [15]. In the particular case
of the alternation between positive and negative disparity was also found to have a high effect on visual comfort [87].
As presented in this section, there are many different sources of visual discomfort. To limit discomfort issues, different
recommendations, were provided in the literature. In [83] an upper limit of 70 arcmin is recommended for retinal
disparity, in ITU-R Recommendation BT.1438 [88] 0.3 Diopter corresponding to 60 arcmin is suggested. This value
was also supported in [15] since until this limit sharp binocular binocular vision is preserved, and blurred images are
expected to be a first step before seeing double and suffering of discomfort. Another rule of 0.2 Diopter was also pro-
posed in [86, 89] considering that discomfort is clearly perceivable outside of 60 arcmin (0.3 Diopter) area. In parallel,
in professional shooting, the 1/30th rule of thumb for 3D production is usually used and states that the inter-camera
distance should be 1/30th of the distance from the camera to the first foreground object [90]. However, this method is
empirical since it only roughly considers cameras and display configuration.
27
2.5 Technical implementation
After presenting a review of depth perception, this section will address how 3D contents are captured and rendered
on a stereoscopic display. This will show the limit of 3D rendering technologies and will be put into relation with
the perceptual factors presented in the last section. This section will show that both capture and rendering are closely
related and should both be taken into account to ensure an appropriate depth effect.
2.5.1 Capture
When considering the case of shooting a stereoscopic sequence (which can be extended to N views), two choices for
the setup of the cameras are possible to achieve high depth quality: adjust the optical axes to be parallel, or converging
to the object under focus (see Figure 2.30). Parallel camera axes during shooting require setting the convergence dur-
ing post-production, which can be time consuming. Having the camera axes converge during shooting requires time
during production, but less time in post-production. However it can create keystoning [91] issues which need to be
corrected in post-production. Both approaches are equally valid and used in the context of stereoscopic 3D production.
Figure 2.30: Different type of camera’s configuration
In the following, the case of converging cameras is considered, however similar properties can be found in the case
of a parallel camera setup. Figure 2.31 depicts a configuration where two cameras record one object. Following optical
geometry properties, the image of the object is projected on each sensor of each camera at a different position. The
difference in the position of the projected image on each sensor depends on different factors: the inter-camera distance,
the position of the object relative to the camera, and the focal length. Additionally, to convert the distance in meters
of the projected images on the sensor to difference in pixels, two other parameters need to be considered: the sensor
size and resolution. In this figure, one particular issue can be observed: the discretization of the depth into the limited
number of pixels of the sensor. Indeed, depending on the setup of the two cameras and the focal length of the lenses,
the position in depth result in a value of difference of position of the projected image on the cameras’ sensor. However,
since the camera sensor has a limited resolution, the continuous image projected on the sensor by the lenses will result
in a discretization of the image, and thus a discretization of the difference of position in depth. Therefore, due to the
limited resolution, there will always be a point where a difference of depth cannot be recorded because the difference
of position of the projected image on the sensor is too small.
In addition, a major issue about the quality of the depth is the choice of camera settings with respect to the display
settings. Due to optical properties, the choice of a specific focal length of a camera to shoot an object at a specific
28
Figure 2.31: Conversions of depth recorded by the cameras to pixel differences and link with sensor resolution
distance has an effect on the lines inside the picture and affects its geometry. This phenomenon is illustrated in Figure
2.32, where the same scene is recorded with different focal lengths. The process of rendering a 3D image is the result
of capturing and rendering the scene. Hence it is the result of three successive conversions: World in the camera’s
space =⇒pixels on the camera sensor =⇒pixels on the display =⇒presentation in 3D in the user’s space.
Figure 2.32: Relation between object geometry and focal length (images from Mica¨
el Reynaud)
The relation between individual capture settings and their result on a rendered depth on a 3D display was studied by
Woods et al [91], and Figure 2.33 depicts how changes between the different parameters keeping the other parameters
constant affect the geometry of the 3D rendering. What has been captured by the camera is a rectilinear grid, which is
significantly distorted by the end-to-end system.
29
Figure 2.33: Relation between camera settings and 3D rendering (Figure redrawn from [91])
Figure 2.34 depicts how the depth is rendered as a function of the distance from the camera when the parameters
regarding the rendering is fixed. Ideally, a linear relationship between the depth of the camera space and the depth in
the display space should exist. Due to the previously mentioned limitations, this is not the case, and it can be seen that
the area where the distortion is minimal appears to be only limited to a reduced area in the camera space. The relevant
part of the movie or image should then be located in this particular area to ensure a good depth quality rendering.
As described in [91, 24], the depth in the visualization space can be expressed by Equation (2.11). With:
•V: the viewing distance, e.g.. the position of the viewer in front of the display
•B: the interpupillary distance
•z: the position of an object in depth in the camera space
•M: the magnification factor, e.g.. the ratio between the size of the camera sensor and the display size
•f: the focal length of the camera
•dcov: the distance from the camera to the convergence point, e.g. the zero depth plane
•Z: Depth in the visualization space
Z=V·B·z
B·z+M·f·b(1−z
dcov )(2.11)
30
Figure 2.34: Relationship between camera and display space as a function of camera focal length
2.5.2 Rendering
The rendering capabilities also are of high importance to ensure a high-quality depth rendering. In the previous section,
Figure 2.34 was used to illustrate the effect of camera settings on the quality of the depth rendering. However, one
important aspect was only briefly mentioned: these distortions also depend on the resolution of the display, the viewing
distance and the inter-pupilar distance. These parameters and their effect on the 3D rendering were studied by Woods
[91], and the effect of these different settings on the depth rendering is illustrated in Figure 2.35. Moreover, displays
have a limited resolution which results in a quantization of the 3D rendered depth in 3D pixels called voxels. Figure
2.36 illustrates this quantization of the depth on a display.
2.5.3 Transmission
In between capture and rendering, one important aspect is the transmission of the 3D material. This includes encoding
and transmitting the bitstream, over an IP network. Due to the amount of data contained in a 3D-video and the limited
bandwidth of networks it is necessary to encode the 3D videos. Different alternatives are available depending on the
needs in terms of the number of views and backward compatibility. There are different current encoding strategies
which will be explained in more details in the following. These are:
•Frame packing and a 2D-video encoding
•Simulcast
•Multi view coding (MVC)
•Depth-based video encoding
The first approach, frame packing and a 2D-video encoding was chosen for early-days IPTV broadcast since it ensures
compatibility with the legacy transmission chain. In this particular case, the two stereoscopic views are packed into a
single frame. Different ways to perform the packing are possible, these include having two frames one over the other,
next to each other, or interlaced (see Figure 2.37). The frame packing can also be done jointly with a downsampling of
the video resolution in the direction of the packing in order to keep the same frame size as one of the individual views,
but at a cost of images sharpness. However, this will enable to bring the 3D video to the end users without adding
additional costs in terms of bandwidth. Once packed, the videos are then encoded using a traditional 2D video encoder
such as H.264.
A second approach is called “simulcast”, it consists of encoding both stereoscopic views separately using a 2D video
31
Figure 2.35: Relation between visualization settings and 3D rendering (Figure redrawn from [91])
encoder. The two bitstreams are transmitted separately resulting in a doubling of the required bitrate.
The third approach was specifically designed for 3D videos and uses the redundancy between the views. The multi-
Figure 2.36: Depth rendering and 3D voxels (Figure redrawn from [92]). eis the inter-pupillary distance, dis the
viewing distance.
32
Figure 2.37: Different methods for packing two views into a single frame.
view coding is composed of one stream for one view, and the other view is encoded relative to the first view. During
the encoding, the encoder can choose to encode the P- and B- frames relative to the previous I- or P- frames or use a
frame from the other view as reference (Figure 2.38). With such an approach, the compatibility with legacy transmis-
sion chains is maintained. The information from the second stereoscopic view is dropped by the decoders which do
not handle the MVC bitstream, thus enabling to decode the sequence as 2D video. Such approach usually saves 20%
of the overall bitrate compared to simulcast. MVC is currently used in the industry in 3D Blu-rays.
Figure 2.38: Relationship between frames in a MVC bitstream.
A fourth alternative are depth-based video coding strategies. These methods enable much higher coding perfor-
mance than the other presented alternatives. It requires one view, the “texture”, a depth map, and information about the
camera setup. This information is transmitted to the client which “synthesizes” the missing view. There are different
sub-categories within this category: video plus depth (V+D) which uses one view and a depth map to synthesize a
missing stereoscopic view (Figure 2.39). However, since the newly reconstructed frame has a different point of view
than the existing frame, it contains areas which were occluded and need to be filled. Filling these holes can be chal-
lenging, that is why alternative depth-based coding methods have been proposed. The multi-view plus depth (MVD)
was then proposed. It has different texture maps which come from different cameras having different points of view.
These textures, in addition to one or more depth map, can then be used to solve the problem of filling holes (“inpaint-
ing”). Still with the limitations of inpainting performance, an alternate format is the layered depth video (LDV), which
similarly to MVD, uses different texture maps. Redundancy is reduced by specifically containing the background,
which was hidden by the foreground object (Figure 2.40).
2.6 Conclusion
This chapter presented the background of this thesis. It showed different alternatives from the literature on how 3D
quality of experience can be evaluated based on different evaluation concepts. The relation between these concepts
were shown across different models of QoE. In the work conducted on 3D QoE, a number of studies have been
performed on the evaluation of 3D image quality, visual discomfort, and depth perception. But the last one, depth from
33
Figure 2.39: Video plus depth (V+D) coding.
Figure copied from [93] Figure 2.40: layered depth video (LDV) coding.
Figure copied from [93]
different depth cues and its relation with QoE, received less attention than what was presented from the literature in this
chapter. This chapter has presented an extensive review of depth perception based on binocular and also monocular
cues. The monocular cues were until now less considered in 3D QoE studies, which is why work from this area
has been included in more detail in this thesis. Since our tests, have been performed using 3D displays, it was also
necessary to present the effect of technical factors such as capture and rendering settings on the perception of depth.
The next three chapters of this thesis will describe the contribution of this thesis: to study the relationship between
QoE and depth, from different perspectives. Depth quantity will be studied specifically, and the question of how to
evaluate the contribution of different depth cues both using subjective tests and algorithms for depth cue prediction.
Chapter 3
Evaluating 3D added value
Chapter 2.1 described the different notions of Quality of Experience (QoE), perceived depth, and image quality. This
chapter targets the evaluation of the added value of 3D video as compared to 2D in case of different transmission sce-
narios. The goal is to show how the depth, the image quality and QoE relate to each other. As explained in the previous
chapter on the state of the art, measuring the differences between 2D and 3D in terms of QoE can be challenging and
depends on the context. Hence, in this chapter it will be described how different evaluation methods have been con-
sidered in this thesis, and their limitations are analyzed based on the obtained results. Finally, to enable revealing the
differences in terms of QoE between 2D and 3D, another evaluation paradigm will be used. Using this other evaluation
methods, it will be possible to relate the added value of 3D to the depth effect revealing the need to characterize the
properties of the source material. The structure of this chapter is illustrated in Figure 3.1, as well as their results. And
Table 3.1 provides a list of the experiments described in this chapter.
Figure 3.1: Structure of the studies described in the chapter.
Section What is evaluated Type of degradations Methodology Published in
3.1.1 3D QoE and visual comfort Video encoded with different coding
schemes, bitrate, and transmission impair-
ments
ACR [3]
3.1.4 3D QoE and 2D video quality prediction
model accuracy on predicting 3D quality
Video encoded at different bitrate SAMVIQ [94]
3.2 3D QoE and preference of 3D over 2D 2D and 3D videos encoded at different bi-
trate
Pairwise comparison [95]
Table 3.1: List of experiments conducted to address the evaluation of 3D QoE.
35
3.1 Differences and similarities between 2D and 3D QoE for streamed videos
The two first studies which will be presented target the case of an IPTV service. The main goal is to evaluate the QoE
of a user faced with the proposed video streaming service. In such a case, different aspects are to be considered: the
contents, how contents are encoded and how transmission errors and their concealment affect the QoE. To this aim,
two experiments were conducted with different contents, coding algorithms and under different network conditions. In
every case, test participants were involved and had to report QoE scores for each stimulus providing insight into their
experience of the service.
3.1.1 Subjective evaluation of 2D and 3D QoE
In this first experiment the differences between QoE scores between 2D and 3D videos were compared in an experiment
involving test participants. The research questions were the following:
Listing 3.1: Research questions
1. Compare 2D and 3D QoE on the reference video, where no degradation was applied.
2. Compare how 2D and 3D video sequences are rated when encoded in the same manner, and
attempt to measure the added value of 3D compared to 2D video sequences.
3. Compare different coding schemes: Simulcast, MVC, and Side by Side representation
with H.264 encoding.
4. Compare 2D and 3D QoE in case of an error-prone transmission chain impacted by packet
losses. Such errors resulting in ‘‘slicing distortion’’.
5. Study visual comfort and its link with QoE ratings.
3.1.1.1 Source material
Seven contents have been used as source material (SRCs). All of them were 10s long full-HD progressive sources of
25 frames/second (1080p25). The contents have different spatial, temporal and depth characteristics as summarized in
Table 3.2.
Content Name Description
Horse Horse standing in a field, scene change, car approaching, scene change, the horse starts to walk.
This content has complex texture and a slow pan motion.
Car Race Prep. Preparation of a race; several scene changes, colorful, high spatial complexity, slow motion.
Car Race Scene with cars racing; several scene changes, high motion and large depth range.
Piano Man playing the piano; slow pan motion, low spatial complexity.
Ski Skier skiing; low on texture, high motion, large depth range.
SkullRock 3D generated sequence, low spatial complexity, low motion, and high depth range.
Boxe Two men boxing. There is only the boxers’ movement.
Table 3.2: Source sequences characteristics used in the study.
3.1.1.2 Processing of test sequences
To generate all the Processed Video Signals (PVS), several Hypothetical Reference Circuits (HRCs) were considered.
These HRCs can be divided into two distinct groups: coding-only and coding under transmission errors. The general
36
Figure 3.2: Processing chain simulcast. Figure 3.3: Processing chain MVC.
Figure 3.4: Processing chain Side by side. Figure 3.5: Simulation of transmission errors
HRC Coding Scheme QP Packet loss rate [%] HRC Coding Scheme QP Packet loss rate [%]
1 Simulcast - - 13 MVC 40 0.0
2 Simulcast 26 0.0 14 Frame Packing (SbS) 26 0.0
3 Simulcast 26 0.4 15 Frame Packing (SbS) 32 0.0
4 Simulcast 26 0.9 16 Frame Packing (SbS) 38 0.0
5 Simulcast 32 0.0 17 Frame Packing (SbS) 40 0.0
6 Simulcast 38 0.0 18 2D - -
7 Simulcast 38 0.4 19 2D 26 0.4
8 Simulcast 38 0.9 20 2D 26 0.9
9 Simulcast 40 0.0 21 2D 38 0.0
10 MVC 26 0.0 22 2D 38 0.4
11 MVC 32 0.0 23 2D 38 0.9
12 MVC 38 0.0
Table 3.3: Hypothetical Reference Circuits (HRCs)
processing procedure according to the first part of the HRCs is described in Figures 3.2 - 3.4, where encoding is done
according to one of three different coding schemes:
1. Simulcast (Figure 3.2 ): The two views are encoded independently using an H.264 encoder (x264 [96])
2. MVC (Figure 3.3 ): The two views are encoded exploiting the redundancy between views, here using JMVC 8.2.
3. Side by Side and H.264 (Figure 3.4 ): The two views are each downscaled and encapsulated in an HD frame, then
encoded using H.264 (x264).
For each coding scheme, different values of Quantization Parameter (QP) have been chosen. Defining QP instead
of bitrate enables to reach a more constant quality over all SRCs and thus avoids having contents with always low
quality (because the maximum bitrate is too low to achieve high quality) or having contents with always high quality
(because the selected range of bitrates always leads to high quality). Details on the different HRCs are listed in Table
3.3.
The second part of the HRCs covers different conditions of transmission errors. The process of simulating the
transmission errors is depicted in Figure 3.5. Each view of the SRC is encoded independently using an H.264 encoder
(x264). The frames were decomposed into 68 slices, which corresponds to one slice per macroblock line. The GOP
structure was (M,N) with M=3 and keyframe rate N=1/s. The software “sirannon” [97] was used to encapsulate the
bitstream into MPEG2-TS packets, and the resulting TS-packets into RTP packets. The software tcpdump is then used
37
to capture the RTP packets and save them in a packet capture file (“.pcap”). The simulation of the lossy channel was
performed as follows: we used a random number generator which indicated us the packet number which should be
dropped. This random number generator followed a uniform law, and no content-dependent difference between RTP
packets were made (e.g. in terms of whether an I-, P- or B-frame was hit by the loss). We only took care that the first
I-frame of the PVS was not affected by packet loss. Finally, we used a decoder implemented by Deutsche Telekom
Laboratories to decode the video. This decoder (used by ITU-T SG.12 in the context of the P.NAMS and P.NBAMS
standardization contests now ITU-T Recommendation P.1201 and P.1202) has the particularity to be able to take pcap-
files as input. This decoder implements an intra-error concealment algorithm. The 2D sequences have been realized
by presenting the same (left) view to the two eyes.
3.1.1.3 Subjective experiment
For the subjective experiment, the laboratory test environment was set as defined in ITU-R BT.500-12 [98]. A 23”
Alienware OptX 3D Full HD Display was used. This display has a native resolution of 1920x1080 pixels and a refresh
rate of 120Hz. The display was used in combination with NVidia 3D Vision shutter-glasses. The viewing distance was
set to three times the picture height (3H). The maximum value of crossed and uncrossed disparities were checked on
every SRC (using a motion estimation-based algorithm to estimate stereo disparities [99]. This will be further detailed
in chapter 5) to ensure that the disparity values stay in the comfortable viewing zone. The luminance of the background
was set to 50 cd/m2.
The test methodology was Absolute Category Rating with hidden reference (ACR-HR). 21 observers took part in the
test, and were asked to rate the general Quality of Experience and the visual discomfort, each on a five grades discrete
scale with the typical labels “Excellent”, “Good”, “Fair”, “Poor” and “Bad”. It is only after rating the PVS on these
two scales that the observers were allowed to watch the next PVS. After screening using the methodology described
in the VQEG 3DTV Test Plan, one observer was rejected.
The general procedure of a test was as follows: the test started by a training session composed of seven sequences.
This training was designed to illustrate the rating task and to introduce the ranges of contents and of quality. In the
main session, the observer could rate the 161 sequences in two sessions, one of 81 and one of 80 PVSs (with a 15min
break between the two parts). The whole test (including a vision test and break) took 1.25 h.
3.1.1.4 Relation between coding and 3D QoE
The first objective of our test was to compare the Quality of Experience and consequently bit rate requirements of
video encoded with different coding schemes. Here, the Side by Side (SbS) representation currently used for 3D IPTV
broadcasting was to be compared with other available algorithms (simulcast and MVC). Figures 3.6 and 3.7 depict the
mean quality rating per content and per coding scheme as a function of the logarithm of the bit rate. Also shown are
the 95% confidence intervals (CIs). As can be seen from the graphs, CIs are rather high. A MANOVA (Multivariate
ANalysis Of VAriance) to explain quality with the fixed factors (QP, coding scheme, contents) was used. No interaction
terms were considered, and therefore a remaining of 147 degree of freedom were left. This analysis reveals that there
is a significant impact due to coding scheme (F=10.673, p <0.01), a significant impact due to content (F=3.153, p <
0.01), and a significant impact due to QP (F=27.33, p <0.01). A non parametric Kruskal-Wallis test applied to explain
the MOS values as a function of the coding scheme, shows a significant effect of the coding scheme on the MOS
values (chi-squared = 99.032, df = 3, p-value <2.2e-16). Pursuing these results, a post hoc test relating the coding
scheme and the MOS values shows that SbS is significantly different than the 2D conditions, and similarly MVC is
significantly different than the 2D conditions. However, the 3D conditions were not found statistically different from
each others.
From Figure 3.8 we can observed that at a given bitrate level, in most cases SbS provides a higher perceived quality
than Simulcast and MVC. However, as analyzed in the previous post hoc analysis, the difference was not found to
be significant. We did not observe a strong gain in bit rate for the MVC coding scheme compared to Simulcast.
However, an advantage of MVC which is not taken into account in this study is its backward compatibility (e.g.
38
Figure 3.6: Quality per content and per coding scheme (first part of SRCs).
a MVC bitstream can be decoded by a H.264/AVC -compatible decoder simply by dropping the data it does not
understand and related to the other views), which is an important feature for 3DTV broadcasting. For evaluating the
difference of required bit-rate between methodologies, the approach proposed by Wang et al [100] was used: For every
PVS in SbS representation, the value of bitrate required for achieving the same perceptual quality but using Simulcast
or MVC is determined. The estimation of equivalent bitrate is done using a linear regression between known values in
the log(bitrate) vs. Mean Opinion Score (MOS) space. Then, the ratio of required bitrate for SbS divided by the bitrate
required for Simulcast or MVC is calculated. This provides a measure of the relative bitrate gain for each coding
scheme. Table 3.4 provides the results in terms of equivalent bitrate. On average, a 50% gain in bitrate can be reached
using the SbS representation, without reducing perceptual quality. These results are in accordance with previous tests
from the literature [100, 10]. We can also see that MVC did not provide a significant quality improvement in our
experiment, and that the results were highly content-dependent. These test results seem to indicate that for a given
bitrate the current implementation of 3D HDTV broadcast services achieves a higher quality than the other available
standards (using full resolution), when a specific limited value of bit rate is required. We can also observe that in
most cases the quality level that can be achieved with SbS is almost as high as the quality achieved with the simulcast
39
Figure 3.7: Quality per content and per coding scheme (second part of SRCs).
reference. The most likely reason for this result is that it is difficult for the observer to differentiate between high
quality contents when the contents are presented sequentially, as it is done in a single stimulus test.
40
Figure 3.8: QoE rating as a function of the HRC. The no-
tation 2D QP26 PL0.4 indicates a 2D condition encoded
with a quantization parameter of 26 and having packet
losses introduced with a percentage of 0.4 packet dropped.
SRC MOS Gain compared to
Simulcast
Gain compared to
MVC
1 3.94 31% 21%
2 4.00 31% 28%
3 4.12 40% 56%
4 4.06 57% 57%
5 3.82 55% 48%
6 3.65 52% 58%
7 3.59 55% 55%
Table 3.4: Gain in Bitrate of using the SbS representation
compared to Simulcast or MVC for a fixed quality level
3.1.1.5 Relation between 3D QoE scores and Visual comfort scores
During the test, test participants were asked to report about visual comfort for each test conditions. Figure 3.9 depicts
the relationship between the two scales: QoE and comfort. It can be seen that the test participants used both scales very
similarly, and shows a Pearson correlation of 0.88 between them. This very high correlation between the scales can be
explained by the fact that test participants must have associated the term of visual comfort with the annoyance they
felt with the coding and packet loss impairments, and have then related the two scales. Indeed it was not expected that
visual comfort scores would becomes as low, since the test set-up was designed such as the viewing distance was set to
enable a comfortable viewing experience. Only transmission impairments were expected to induce visual discomfort.
A linear regression between the two factors was performed according to equation 3.1, and show respectively values of
0.77 for αand 0.70 for β. The value of β, higher than 0 shows that QoE could be rated lower than comfort, therefore
in the term QoE test participants took more factors into account than the visual comfort. This most likely includes the
pictorial quality. The slope, α, lower than 1, shows that test participants could report video sequences providing a high
quality of experience, even though the visual comfort was not optimum. And therefore, they may have taken other
factors into account such as the high quality of the picture and perceived depth.
Com f ort =α·QoE +β(3.1)
3.1.1.6 Study of 3D QoE in case of a lossy transmission chain
Another important aspect of the experiment was to evaluate the effect of packet loss on the perceived Quality of
Experience of 3D videos. The goal were to evaluate how common 2D error concealment strategies perform in case of
3D video, and to compare the quality of 2D and 3D video under packet loss. The evaluation of 3D vs 2D is particularly
interesting, since in the 3D case two contradicting factors are involved:
41
Figure 3.9: Relationship between QoE scores and visual discomfort ratings. The fitting between the two factors, Com-
fort and QoE rating is represented in red.
1. The binocular suppression theory, which says that if one of the two stereoscopic views has distortions, then the
resulting quality can be high, since the quality may mainly depend on the best of the two views, or at least on the
average of the quality levels related to each individual view.
2. The binocular rivalries (when one of the two eyes perceives strong artefacts) which induces visual discomfort,
affecting the general quality of experience.
Figure 3.10, depicts for every SRC, the Quality of Experience as a function of the error rate. No significant difference
could be found between the 2D and 3D conditions, participants only lowly rated the quality of these conditions. An
hypothetical reason would be due to the fact that in the experiment, test participants rated in the same test video with
only coding distortions, and video with transmission impairments. The fact that transmission impairments provided
much stronger distortions than the coding ones may also have compressed the scale and resulted in having test par-
ticipants rating more severely the video sequences with transmission impairments than the videos having only coding
impairments. This have resulted in highly critical ratings for these PVSs with transmission impairments.
3.1.1.7 Conclusion
In this experiment the perceived quality of a current implementation of 3DTV broadcasting, and was evaluated. Its
performance was compared with some of the state-of-the-art algorithms. Among the implementations which have been
compared, the side-by-side representation seems to be the most efficient way to transmit HD stereoscopic 3D videos,
with less bandwidth requirements than Simulcast and MVC using full resolution. These results were in accordance
with recent results [101]. The relationship between visual comfort scores and QoE scores was studied. The relation
between these two factors was closer than expected, and shows a very high correlation (Pearson correlation of 0.88).
Another objective of the experiment was to compare the quality of 2D and 3D videos in case of packet loss. In the test,
no significant quality difference between 2D and 3D at a given packet loss rate was observed.
A further result is on the comparison of the scores provided for the 2D and 3D video material: the 3D reference was
found to be rated significantly higher than the 2D video material. The results even show cases where 3D is rated lower
than the 2D reference. This may be explained by the context of the experiment: ratings video with different coding
conditions may have focused the attention of the observer on the image quality aspects.
42
Figure 3.10: Quality per content in function of the percentage of dropped packets
43
Listing 3.2: Conclusion on the research questions
1. & 2. Evaluating differences between 2D and 3D QoE on the reference video, is
challenging. By directly asking test participants to rate 3D QoE the differences
between 2D and 3D were not observable.
3. For a given bitrate, the Side by Side representation with H.264 encoding appeared to
be the most efficient approach. MVC resulted, as expected, in a saving of about 20%
of the bitrate for a same subjective score when compared to simulcast.
4. In case of transmission errors, no significant difference between the 3D and 2D
ratings were found.
5. The visual discomfort scores were found to be closely correlated to Quality of
Experience ratings (Pearson correlation of 0.88). And the relation between the two
factors was found, namely Com f ort =0.77·QoE +0.7.
3.1.2 Further analysis on subjective ratings
The main issue is is to evaluate, in a subjective manner, the complete 3D experience. Asking directly to rate the quality
of experience apparently fails to capture all the different dimensions which are involved in 3D videos. As explained
in the state-of-the-art section, different ways have been proposed to evaluate 3D QoE by using alternative evaluation
concepts such as Naturalness, or Viewing experience [2, 6, 7]. Another side of this problematic is the reliability of the
subjective methodologies. Existing standards such as ITU-R Recommendation BT.1438 [88] are available and specify
methodological settings that target high reliability, but several issues are not addressed as listed by Chen [92] to ensure
stable results across laboratories. In the meantime standards have been updated to tackle the question of how 3D QoE
should be evaluated and in which environment. This can be found in ITU-R Recommendation BT.2021 [102]. To
investigate these issues in this thesis two main analysis was performed based on the previously described subjective
experiment.
Listing 3.3: Research questions
1. How do subjective scores compare from one laboratory to another? Hence, how stable is
a subjective test ?
2. How do test participants understand and rate on the different scales they are asked
to use ?
3.1.2.1 Inter-laboratory comparison
The first analysis considered here is the comparison of test results with results obtained by different laboratories. Two
other laboratories (L1: ACREO, L2: IRCCyN) have conducted experiments with similar conditions as our experiments
described in section 3.1.1. The comparison of their experimental results was part of the analysis presented in [100].
In our experiment we used different SRCs. Only 9 HRC were common between the tests (1,2,5,6,10,11,12,18,21).
Another difference between our tests was that people saw video with transmission impairment in our test so that the
ranges of degradation types and quality were clearly different. In addition, the methodology used for evaluating the
visual discomfort in the test is different: in our test we used ACR with a 5 grade scale, the other tests also used ACR-
HR but the vocabulary used indicated a comparison with the 2D viewing (e.g. the 3D presentation was: much more
comfortable than 2D, more comfortable than 2D, as comfortable as 2D...).
Figure 3.11 depicts a direct comparison between the labs quality test results on identical HRCs. As the SRCs were
different, a direct comparison of the score is not possible and only an overall condition MOS can be compared. Due
to this average, it is not possible to compare the absolute MOS values corresponding to each HRC, however it can be
checked whether a linear relationship and not a more complex non-linear relationship exists between the results of our
experiment and results from the literature. With this goal in mind, a linear model model applied to compare the results
44
Figure 3.11: Comparison of quality evaluation between laboratories
between the labs. The Pearson correlation is 0.84 between our test and L1, 0.71 with L2 and 0.97 between L1 and L2.
The correlation is high considering the different assumptions and averages made across the different SRC.
The numbers of similar test conditions between the different laboratories is too small to draw strong conclusions.
However these results go into the same direction as the study performed by Wang at al. [100] and more recently
Barkowsky et al. [103] on inter-laboratory result stability when the evaluation of video sequences with different coding
conditions is considered.
3.1.2.2 Study of scale usage
Another interesting point was found when addressing the visual discomfort evaluation. Figure 3.12 depicts a direct
comparison of the quality and discomfort ratings. From this figure it becomes apparent that observers have answered
differently in case of the visual discomfort scale: there are subjects who have rated quality and discomfort in a similar
fashion and others who did not. To further analyze these variations, the observers were classified into different classes.
It can be stated that this problem was not specific to our test and was also visible in the tests results of L1 and L2
described in the state-of-the-art section. It appears that visual discomfort is a difficult concept and not all observers
understand the scale in an identical way.
The different patterns of answers observed leads to the following classification:
45
Figure 3.12: Different types of answers between observers: each scatter plot represents a type of observer in terms of
her answers. The scatter plot are grouped showing different patterns of scores.
1. Observers who answered with a clear linear relation between discomfort and the quality scale. These participants
either have considered a direct relation between discomfort and quality, or were simply not able to distinguish
between the two concepts.
2. Observers who completely covered the space with different quality–discomfort rating value pairs for different
HRCs. This group apparently considered quality and discomfort to not necessarily be related, for example, when
the discomfort is mainly due to the content.
3. Observers showing an answer pattern in the shape of a triangular matrix: the value of discomfort is between [1,
CMax] with CMax being a function linearly dependent on the quality. These observers apparently have considered
a relation of implication between comfort and quality: a high discomfort leads to a low-quality video, but the reverse
was not necessarily true: low quality video could be due to degradations that are not related to discomfort.
Figure 3.13 depicts the distribution of the ratings correlation across the different participants. Using k-means, the
participants were clustered into two groups based on the Pearson correlation between the two rating scales: quality and
comfort. It should be mentioned that previously three classes were mentioned. However the Pearson correlation may
not be a good indicator to characterize a triangular matrix shape of replies. This is why, for the sake of simplicity only
two classes are considered in the following analysis. Based on the kmeans two clusters having the means of 0.327 and
0.742, and the respective within clusters sum of squares of 0.118 and 0.136 were found. The within sum of squares
being relatively small indicates that two clusters could be formed according to the Pearson correlation indicator.
Based on these clusters, a non-parametric Wilcoxon rank sum test was applied and show that these two groups signif-
icantly rated differently on the comfort scale (W=1338600, p=0.025), but did not on the quality scale (W=1426700,
p=0.323).
Figure 3.15 depicts the average ratings of visual discomfort as a function of the HRCs. Since the observers rated
discomfort differently, the following analysis is performed by groups. To create these classes, the correlation of each
observer between their quality and visual discomfort ratings were determined. Then, a k-means analysis of these corre-
46
Figure 3.13: Distribution of the Pearson correlation values between the Quality and Comfort scales for the different
participants.
lation values was performed and used to divide the observers into the ones who have clearly related visual discomfort
and quality (class with R between 0.63 and 0.92 with an average of 0.74) and the ones who did not (R between 0.13
and 0.51 with an average of 0.33).
From the two curves we can see that observers who have not necessarily linked quality and discomfort gave more
constant ratings of visual discomfort than the other class of observers. However, also some of the 2D sequences were
rated as uncomfortable by these users, which is an unexpected behavior.
When comparing the results with the L1 and L2 tests, it can be stated that there, too, a high variation of discomfort
ratings could be observed, although discomfort was rated relative to the 2D version. Hence, the tests underline the
difficulty of judging discomfort of 3D video.
Engelke [104] performed a similar study and concluded that test participants do not necessarily understand all the
different scales they have to use in the same manner. It then may be difficult for the test participant to rate complex no-
tions such as Naturalness or Visual experience as proposed by Seunti¨
ens [2] since these tests show that test participants
already have difficulties to understand evaluation concepts such as Quality of Experience and Visual discomfort. There-
fore, the next section of this thesis will focus on subjective evaluation methods which are simple for test participants
and enable to quantify the added value of 3D.
3.1.3 Conclusion
In this subsection, the question of subjective test reliability was addressed. The results of the test shows that tests
from one laboratory to another appear consistent when test participants are asked to rate QoE of videos encoded with
different conditions. However, the results show that test participants may have difficulties to clearly understand the
scales they have to use, and variation from test participants to another can be seen. This shows, as also reported by
Engelke [104], that too high-level evaluation concepts may be difficult for the test participants, and therefore there is
a need to design easier experiments for 3D video sequences evaluation when lots of dimensions are involved.
Listing 3.4: Conclusion on the research questions
47
1. Test participants use the scales differently, therefore there is a need to simplify
the task of the test participants when lots of dimensions are involved in the
evaluation.
Figure 3.14: Quality ratings in function of the HRCsfor
the two classes of observers: the ones with a low correla-
tion between quality and discomfort and the one with high
correlation between quality and discomfort
Figure 3.15: Visual discomfort in function of the HRCs
for the two classes of observers: the ones with a low cor-
relation between quality and discomfort and the one with
high correlation between quality and discomfort
48
3.1.4 Performance of instrumental measurement for 3D QoE prediction
Considering the similarity between the QoE scores obtained from test participants when asked to rate 3D and 2D QoE,
it appears that test participants do not fully take all the factors covered by the concept of 3D QoE into account. Based
on this result, different questions were raised:
Listing 3.5: Research questions
1. If the rating obtained for 3D does not strongly differ from 2D, how do 2D quality
prediction algorithms perform when predicting 3D quality?
2. How do two standardized algorithms perform when applied to 3D contents ?
3.1.4.1 Selection of the conditions
The idea is to emulate the real signal chain in a 3DTV broadcasting solution and study how 2D video quality prediction
algorithms perform for the evaluation of 3D QoE. Therefore the test design consisted of a live hardware encoder which
was fed by a hardware playback server for uncompressed playback. The encoder’s output was sent to an IPTV server
and finally the signals were streamed to a test set-top box. The HDMI output of that set-top box was captured and
recorded on a MacPro equipped with a video acquisition interface card. The sequences were then stored using the
Apple ProRes 422 (hq) codec at a bit rate of around 180 mbit/s. The setup of the recording is depicted in Figure 3.16.
Figure 3.16: Processing chain for the creation of PVSs
In the next step, the sequences were edited by means of Final Cut Pro without changing the format of the recorded
clips to extract the video sequences selected for evaluation after stabilization of the encoder. The experimental condi-
tion consisted of using the hardware encoder at ten different bit rate values (5, 7.5, 10, 12, 14, 16, 18, 20, 22, 24Mbps)
and a software encoder at one bit rate value (7.5Mbps) used for comparison. Seven different source signals were
chosen, the sequences had different spatial, temporal and depth complexity. A short description of the sequences is
provided in Table 3.5
3.1.4.2 Subjective evaluation method
The subjective test methodology SAMVIQ was chosen [105]. This methodology consists in presenting several sets
of video sequences to the observers. In each set, several sequences are presented. These sequences contain the same
source signal but with different processing. The observers can choose a video from the proposed sequences within
the set, watch it and rate it. One of the sequences is clearly identified as the reference, and one is a hidden reference.
The observers can repeatedly watch each sequence and adjust the respective rating. After having watched and rated
49
Content Name Description
Bear Sequence from animation movie. Complex motion: lots of particles, and strong movement; Lots of high frequency texture. 3D
with pop-out effect.
Fans Soccer fans with many small details. Complex motion: fans are moving, shaking flags.
Horse Sequence with strong texture and limited motion: horse standing and starting running.
Interview Sequence with two persons interviewed. The background is composed of trees moving in the wind. Limited motion. Some
pop-out effect is visible: the arm of the persons comes out the screen.
Match Football match, lots of high frequency texture on the grass. Fast motion.
Piano Sequence with low spatial and temporal complexity. Piano player sitting in front of the piano and standing up.
Sea Sequence with sea water during a storm. Lots of high frequency textures. Complex but slow motion.
Table 3.5: 3D video content characteristics
Figure 3.17: Subjective experiment interface used for the
evaluation of the video sequences
Figure 3.18: Setup of the laboratory environment
all videos of one set he can continue to the next one. The choice of SAMVIQ was motivated by the fact that this
methodology gives the ability to compare different video signals to an explicit reference which helps the observers
to evaluate the quality of a specific processed sequence. The eventual repetitions provide the ability to adjust the
rating which is particularly useful the present the case of this study since many conditions had similarly high quality.
Providing an explicit reference and a way to readjust a given score can help the subject to evaluate the different
sequences. This is confirmed in previous studies which show that SAMVIQ can be more stable than ACR if the
observer uses the replay feature [106].
The test condition was set in accordance to ITU-R Recommendation BT.500-12 [98]. The viewing distance was 3
times the height of the screen (3H). The playback computer was a Pentium Core i7 PC with a graphic card which had
an HDMI output. The Stereoscopic Player [107] which was used for playback of all videos was running in full-screen
mode on the secondary display. The 3D sequences were displayed on a commercial Sony 52” TV screen using shutter
glasses, the interface for the subjective testing was presented on another PC display connected to the same computer
(see Figure 3.18). The test subjects were persons involved in research and development, but no professionals working
on topics such as TV editing or production on a daily basis. 19 subjects were participating. The task was demanding:
finding small differences in steps of 2 mbits/s between 10 and 24 Mbit/s.
3.1.4.3 Subjective evaluation results
The subjective scores for each source sequence are depicted in Figure 3.19. As a first outcome it is visible that with
the same set of parameters and at the same bitrate the hardware encoder performs better than the software encoder.
The differences are statistically significant at a 95% confidence level using the student-t test for three out of the seven
contents (Fans, Match, Sea).
50
Figure 3.19: Subjective quality score per content as a function of the bitrate in kbps
As depicted in Figure 3.19, the confidence intervals are quite large. This is most likely due to the difficulty of the
task asked from the observers: many conditions had high quality hence it was difficult for the observers to be able to
give accurate absolute quality ratings. However since the SAMVIQ methodology was employed, observers had the
opportunity to compare each sequence to others ones. Comparing the sequences gave them the ability to reveal their
preference for one preferred sequence compared to another one in terms of compression artifacts. Even though it was
hard for them to give absolute subjective scores, in most cases test subjects were able to provide relative ratings. The
Spearman Rank Order Correlation of each individual observer with other is depicted in Table 3.6. To build this matrix
the coding conditions with the hardware encoder were considered, and it is believed that increasing the bitrate will
decrease the value of the quantification parameters and therefore increase the quality. Subjective scores should then
follow this evolution. If there would have always been a clear improvement of the quality with increasing bitrate,
the observers might have obtained a Spearman Rank Order Correlation of 1. But since the task was demanding, the
observers did not provide that accuracy. Based on this analysis three different observers appear to be outliers (They are
51
Table 3.6: Spearman rank correlation of each individual observer
observers 5, 8, 11, 18 visible in Table 3.6). It may be arguable to average the Spearman correlation across the video
sequences as there are large variations of correlation values for the different source sequences. This is why a second
analysis was performed to cross check the inter-participant agreement, and participant removal. The scores provided
by each participants was directly compared to the scores of the other participants in term of Pearson correlation. Figure
3.20 depicts the inter-participant agreement using this other criteria. As in the previous analysis, participants 11 and
8 appears as clear outliers. Participant 5 do not also provide a high agreement with the other participants. However,
participant 18 shows a stronger agreement with other participants such as 4, 7, 9, 14. This is why it was decided
not to reject this participant. In this second analysis, the participant 19 also shows a lower agreement than the other
participants. However, in the ranking analysis, this participant perform along the most consistent ones. Therefore, it
was not rejected as well. This participants screening have resulted in the rejection of only three participants: 5, 8 and
11.
A result of this test is a method to determine the bitrate value from which an increase of bitrate will not provide an
increase of quality perceivable by the observers. Considering the size of the confidence intervals, it is proposed to
use the fact that using SAMVIQ, even though observers had difficulties to agree on an absolute quality value for a
sequence they were at least able to order the sequences. Then, it is possible to check the monotony of the quality
scores; which should be in accordance with the increase of bitrate. The point from which this agreement is broken,
should be then assumed to be the point where observers were not able anymore to see the difference between the
quality of the sequences. The bitrate threshold is then obtained at this specific value. Table 3.7 provides, for each
observer and for each content, the bitrate threshold determined as proposed previously. It is then proposed, for each
content, to take the average of the bitrate value obtained for each observer as the expected threshold.
3.1.4.4 Prediction of 3D QoE
To evaluate the quality of broadcasted IPTV, another typical approach could be the use of instrumental models. It is
proposed to evaluate the accuracy of two standardized models in evaluating the quality of 3D video sequences: VQM
and VQuad. The models were run on video sequences with the side-by-side representation. Figure 3.22 depicts the
performance of VQM on the previously presented database. The model achieves good performance with a Pearson
correlation of 0.8947 and an RMSE of 5.4 (after a linear mapping to a 0-100 scale: MOSe = -119.6 * VQM + 86.92).
It should be noted that the subjective scores of the video sequences mainly lie between 50 and 80, which may result
in a high value of Pearson correlation. Figure 3.21 depicts the performance of VQuad on the proposed database.
52
Figure 3.20: Matrix of cross Pearson-correlation between participants ratings
Table 3.7: Bitrate threshold for perceived quality difference in mbps
This second model achieves lower performance on the studied database: it shows a Pearson correlation of 0.7586 and
an RMSE of 8.2 (after a linear mapping to a 100 scale: MOSe = 17.49 * VQuad + 4.628). It should be taken into
account that the VQuad model is able to handle video sequences with packet losses, which VQM is not. Therefore,
we can argue this may have an influence on the performance. When only high-quality sequences are considered,
VQM is more appropriate to evaluate the quality of encoded sequences before transmission, VQuad would be more
suited for the evaluation of video sequences at the end of the transmission chain. Since transmission impairment is a
dominant artefact compared to coding, the development of VQuad may have been less focused on transmission-error-
free sequence. Then, the accuracy is lower for the specific scope of our study. VQM here seems to be more suited for
such transmission-error-free test.
Considering the performance of the VQM model, a second aspect of this study is to attempt to determine the
bitrate corresponding to the quality saturation using an objective method such as presented for the subjective test in
53
Figure 3.21: Results of the VQuad model (The first letter
is used as a marker to identify the SRC)
Figure 3.22: Results of the VQM general model (The first
letter is used as a marker to identify the SRC)
the previous section. Figure 3.23 depicts, for each content, the subjective and objective quality evaluation as a function
of the logarithm of the bitrate. It can be noticed that some sequences may still increase their quality outside of the
evaluation interval (strong effect for Bear, Fan; less for Interview and Sea, and only a small effect for Horse, Match
and Piano).
In the following evaluation, a different method is proposed to identify the quality saturation. The VQM algorithm
was only used as an example of an objective metric. The idea is based on the observation that at very high bitrates the
quality of the video tends to converge and once a certain quality level is reached an increase of bitrate does not provide
significant increases of visual quality. In the specific instantiation of this study, according to the fitting, the maximum
visual quality is reached at 89.5 MOS (but could be different in another experiment). It may be anticipated that the
subjects are not capable of appreciating the quality gain related to a video that is above a certain bitrate threshold,
for example 95% of this maximum quality. In that case, a certain bitrate can be saved by identifying, using the VQM
algorithm, which bitrate corresponds to 95% of the maximum quality. In this evaluation, the value of 24Mbps has been
used as the maximum quality prediction. A linear fitting has been performed on the log-bitrate/quality scale and the
95% as well as the 90% quality points has been extracted. The results are presented in Table ??. The equivalency of
these results in terms of subjective score is given in Table 3.9
These results provide a range of bitrates which matches to the subjective bitrate threshold determined in the previous
section. This may provide an instrumental method to estimate a range of bitrates around the saturation point.
It should be noted that for the piano sequence, most observers inverted their preference already at very low bitrate,
mostly at the second or third bitrate step. In this case, the objective method provides a value which is even lower than
the smallest possible value obtained from the subjective experiment (7.5Mbps). Considering the subjective experiment
method, the objective metric might even provide an estimation in this particular case that is close to a more generally
valid threshold level.
3.1.4.5 Sum up
There are two main outcomes of this study: first, a method was presented for determining the saturation point of 3D
QoE when increasing the bitrate. Due to the difficulty of performing a subjective experiment requiring the comparison
54
Figure 3.23: Objective and subjective video quality as a function of the logarithm of the bitrate
of many similar high-quality sequences, it was proposed to use the ranking obtained by the SAMVIQ methodology to
determine this threshold. The second and main result was the comparison of two standardized objective models (VQM
and VQuad) for estimating the quality of the 3D video sequences. The VQM model has shown good performance
Content Name 95% Max Quality 90% Max Quality
Bear 17.3Mbps 16.3Mbps
Fans 18Mbps 14.4Mbps
Horse 14.6Mbps 8.6Mbps
Interview 16.5Mbps 11.7Mbps
Match 12.3Mbps 8.8Mbps
Piano 5.9Mbps 5.2Mbps
Sea 16.7Mbps 12.3Mbps
Table 3.8: Bitrate value from which 90% and 95% of the maximum objective quality is achieved
55
Content Name 95% Max Quality 90% Max Quality Subj. threshold
Bear 66.96 63.43 58.20
Fans 72.49 68.67 73.74
Horse 70.92 67.19 67.45
Interview 73.49 69.61 72.69
Match 73.48 69.61 71.55
Piano 76.48 72.45 75.45
Sea 65.95 62.47 69.71
Table 3.9: Subjective values corresponding to 90% and 95% of the maximum objective quality is achieved and sub-
jective score corresponding to the bitrate threshold defined subjectively
on the proposed database and seems to be appropriate for tuning the settings of an encoder. As a last result, a way
to determine an interval of bitrate around the quality saturation point using an objective measurement method was
described. These results are consistent with the work of Benoit [108].
The ability of instrumental algorithms for predicting 3D QoE confirms that evaluating the overall 3D QoE can be
challenging: these algorithms without taking into account any information about the depth or visual comfort, but only
texture quality are able to predict the scores obtained via subjective ratings. This extends the results from Benoit [108]
obtained on still images, and similarly concludes that a good 2D prediction algorithm can be applied to evaluate 3D
quality in case of video encoded with traditional coding algorithms such as H.264.
A last contribution from this experiment was to provide a method to estimate the quality saturation using prediction
algorithm such as VQM.
Listing 3.6: Conclusion on the research questions
1. By applying a linear fitting on 2D quality prediction algorithms, it is possible to
predict 3D video quality for coding degradation using 2D video coding algorithms.
2. The comparison of VQM and VQuad, shows that in the particular context of this study
VQM performed better than VQM.
56
3.2 Revealing the added value of 3D over 2D
In order to reveal the differences of user experience of test participants while watching 3D video sequences compared
to 2D video sequences, alternative evaluation schemes have been considered by Seunti¨
ens. They address other di-
mensions like naturalness or immersion, or investigate specific factors like depth perception or eye-strain [2], which
provides some insight into isolated factors of the general QoE, but not the overall QoE. In the present study, as global
measure of QoE, the subjective preference has been considered. It is believed that when subjects are asked for prefer-
ence between two videos, they may consider all factors (picture quality, both depth quantity and depth quality, visual
discomfort and probably other factors) to take the decision which of two versions of a sequence they prefer. This way,
the entire multidimensionality of 3D QoE is considered. Missing factors of 2D video QoE when evaluating it using
ACR were shown by Belmudez [109], where another multidimensional question was studied. Here, image size and
image resolution were compared in terms of quality ratings, one using ACR, one using paired comparison (PC). Re-
sults showed that the two test methods do not provide the same results: using ACR, observers give higher QoE ratings
for images at their native resolution; using PC, observers prefer larger images obtained after upscaling. The results are
different, and show that using the ACR methodology observers only judge image quality, but with paired comparison
they extend their rating to other dimensions, including the image size. PC however has an important drawback: its cost
and time consumption. To obtain scale value quality scores from PC data, two models exist: the Bradley-Terry model
or the Thurstone-Mosteller model [110]. Both need a full PC matrix: each condition has to be compared to another.
However, several efficient approaches have been developed in the literature to reduce the number of required compar-
isons [111], [16]. In [16] six video sequences were recorded. Each of these videos were captured at six inter-camera
distances (10 cm to 60 cm). The 36 video sequences were then compared through paired comparison, and the Bradley-
Terry scores of each condition were determined. Results show that the Bradley-Terry scores reveal quality fluctuations
due to the different depth and comfort. The relation between inter-camera distance and QoE was found highly content
dependent. In [112], 3D was compared to 2D using a PC approach on an auto-stereoscopic display. 3D was produced
internally by the display based on a texture and a depth map. The texture was used at four different quality levels (three
encodings and a reference). Results show that 3D was rejected in 70% of the cases and for the lowest quality rejected
at 56%. However, the results may be influenced by the technology used at the time of the experiment and the quality
of 3D rendering of the 3D display as mentioned by the authors [112].
Considering that PC provides an easy question to the test participant, it will be used to evaluate 3D and 2D video
sequences, to show the quality improvement/decrease due to 3D. The research questions are the following:
Listing 3.7: Research questions
1. How much is 3D preferred (or less preferred) compared to 2D?
2. How does the preference of 3D over 2D evolves as a function of the image quality?
3. How much do the contents’ characteristics affect the preference of 3D over 2D?
3.2.1 Definition of test conditions
This subsection provides details on the definition of the conditions and test design to answer the previously stated
research questions.
3.2.1.1 Selection of Source sequences
The selection of the source contents (SRC) is based on three databases. All sequences were full HD stereoscopic
videos; each view had a resolution of 1920x1080, with a frame rate of 25 images per second and were of 10s length.
Seven SRCs come from a first database composed of 64 source reference signals (SRCs) which were evaluated in
[113]. This database of SRCs will be further detailed in later stage of the thesis in the content characterization chapter.
57
The SRCs were used at the highest quality available, and contained various types of scenes. They were rated on three
different scales: overall quality of experience, depth and visual comfort. The methodology used was Absolute Category
Rating (ACR). Perceived depth was rated on a five-point scale with labels: “very high”, “high”, “medium”, “low” or
“very low”. Using this general depth scale, the observers rated their general impression of the depth, which was thought
to take into account both depth layout perceptions and depth quality. The comfort was evaluated in an absolute manner
by asking subjects, if the 3D sequence is “much more”, “more”, “as”, “less”, “much less” - “comfortable than watching
2D video”. Based on this data, the seven SRCs were chosen to cover the entire range of depth ratings. To ensure the
reproducibility of our results, it was decided to include five open video materials [114]. These sequences include “Tree
Branches”, “Hall”, “Umbrella” and two new sequences designed to reach our depth effect requirements: one with low
and high depth quantity, respectively, “Timelaps” and “Drone”. A third source of SRC was Blu-Ray disk where three
other non-open sequences named “Alice” were added to the test. These last sequences were not available at the time of
the previous test on content characterization described in [113]. Therefore the depth perception model [113] developed
in the context of this thesis and which will be detailed in Chapter 5 was used to have a predicted value of the perceived
depth in these sequences. The top scatter plot of Figure 3.24 shows how the selected sources cover the depth scale,
based on subjective data [113]. The second scatter plot shows the results of the depth score estimated from the depth
model described in [113]. The third scatter plot shows the available subjective data regarding visual discomfort. This
data has been added, however, not used in this study for content selection, and is shown to let the reader have a view
on the principal characteristics of the 3D sequences.
3.2.1.2 Selection of coding conditions
Coding was performed using a Harmonic Electra 8000 H.264 encoder at constant bitrate. This encoder was the same as
the one used in the previous study on video quality prediction models. Since it was planned to use a polarized display
with horizontal interlacing, the 3D sequences were in the Top/Bottom frame compatible format, this choice limiting
the loss of resolution. Each full-HD view was downscaled to half the vertical resolution using a Lanczos filter. No
further interpolation was done for optimizing resolution, resulting in half a line of vertical parallax. Each sequence is
then encoded to four different “quality levels” by using four different values of bitrate. The four bitrate values were
individually defined for each source sequence, since each of them had different spatial and temporal complexity. In the
previous section, it was revealed that VQM (ITU-R Rec. J.144) performs sufficiently well in estimating video quality
of 3D sequences [94] (Pearson correlation of 0.89, and RMSE of 5.4). As a consequence, the procedure for deter-
mining the bitrate values corresponding to the appropriate “quality levels” is based on quality estimations obtained
from the VQM general model. The finally used four adequate quality levels have been determined as 0.1, 0.2, 0.3,
and 0.4 on the VQM scale. These values correspond to the quality score of the most complex sequence of the test,
“TreeBranches”, encoded respectively at a quantization parameter QP of 26, 32, 38, and 44 using the reference H.264
encoder JM 18.2. The range of bitrates used in the test is illustrated in Figure 3.24 and noted for example 2DQ1, for
2D at quality level 1. Quality level 0 indicates the reference.
At the end of the selection process, the 15 SRCs were encoded at four individually chosen bitrates leading to VQM
scores close to 0.1, 0.2, 0.3 and 0.4 as described above. These sequences are used in two versions: 2D and 3D. The 2D
sequences were encoded at the same bitrate as their respective counterpart sequences in 3D. In addition, a 3D reference
with no compression was added to the test. This results in 15·(2·4+1) = 135 video sequences to be evaluated.
3.2.2 Evaluation of 3D QoE using paired comparison
The global QoE of the video sequences was evaluated in a paired comparison experiment. 35 Observers participated
in this test. The laboratory environment was in accordance with ITU-R Recommendation BT.500. The observers’
vision was screened in terms of acuity, color vision (Ishihara test), and stereo vision (Randot stereo test). For the test,
two polarized 23” Hyundai displays (ViewSonic V3D231) with horizontal interlacing were used. The displays were
58
Figure 3.24: Depth, visual comfort, bitrate of source sequences
calibrated using a display calibration device (X-Rite i1 Pro) to make the rendering as similar as possible between the
two displays. The observers were facing two distinct displays, and were instructed to give their preference between the
two presentations they could see on the displays. Considering the number of possible presentations (3D or 2D, 4 quality
levels in 2D and 3D each, and a 3D reference), a full PC matrix approach would have required 9 ×(9−1)/2=36
comparisons per SRC, hence 540 comparisons for evaluating all video sequences. This high number of comparisons
is impracticable for a subjective experiment [111]. For more efficient testing, the square design matrix was employed.
Based on this approach, it was possible to reduce the number of comparisons to 18 comparisons per SRC, hence
15×18 =270 comparisons in overall. The comparisons made in the test can be found in listing 1. The sequence pairs
were randomized such that in case of comparison A vs. B, both orders A vs. B and B vs. A were seen by the observers.
This avoids any dependency of the preference ratings on the display and possible default answers by observers (right
vs. left). The test was split into two sessions of 45 min. The same observers participated twice, with a minimum time
between the series of one week.
Listing 3.8: List of sequence pairs compared by observers
|3DQ4vs 3DQ0|2DQ3vs 2DQ1|3DQ4vs 2DQ4|2DQ3vs 3DQ2|
|3DQ0vs 2DQ4|2DQ1vs 3DQ2|3DQ3vs 3DQ1|3DQ4vs 3DQ3|
|3DQ3vs 2DQ2|3DQ4vs 2DQ3|3DQ1vs 2DQ2|3DQ3vs 2DQ3|
|3DQ0vs 3DQ1|3DQ0vs 2DQ1|3DQ1vs 2DQ1|2DQ4vs 2DQ2|
|2DQ4vs 3DQ2|2DQ4vs 2DQ4
3.2.3 Preference of 3D over 2D and pictorial quality
The main goal of the paired comparison test was to analyze how the preference of 3D over 2D would depend on the
respective “pictorial quality”. As outlined in Section 3.2.1.2, pictorial quality was varied at four different bitrates, i.e.
quality levels Q1-Q4. Figure 3.25a illustrates, for one SRC, how observers answered. It is visible that when the bitrate
increases, the preference of the 3D presentation over the 2D version increases. This behavious is, however, different for
one of the contents, “SkydiversInsideGroup”. This content was found to be less preferred when the quality increased.
In Figure 3.24, it can be seen that this content is the least comfortable sequence of the database. The blurring added
by coding may have contributed in such a way that this content was perceived as more comfortable when coding
59
was stronger. This would be in agreement with [115], where binocular fusion was found to be dependent on the
retinal disparities and spatial frequencies within images. In the paired comparison tests, some video sequences were
always found to be preferred in 2D, namely “SkydiversAlignment”, “SkydiversInsideGroup”, and “Waterfall”. These
sequences correspond to the least comfortable sequences of the test. In turn, one sequence was always preferred in
3D: “CarRace3”. Based on the test results, the bitrate at which 2D and 3D were equally preferred can be determined,
as well as the respective VQM scores. In the following, these points are referred to in terms of isopreference. The
VQM scores at isopreference has been estimated using linear regression between two known points in the 2D domain
spanned by “preference percentage” and “VQM scores”. On average, it was found that the isopreference at the same
bitrate between 2D and 3D is reached when the picture quality of the 3D sequence measured by VQM is at most equal
to 0.24. The relation between the VQM scores at isopreference and the depth score rating was considered (subjective
depth score when available, and objective if not, see Section 5.4). However, no simple relation can be found between
these two factors, and other factors have to be taken into account. These other factors may include monocular depth
cues such as blur from defocus, linear perspective, texture gradient, motion parallax and also visual discomfort.
(a) (b)
Figure 3.25: Illustration of preference results for one source content (Hall).
3.2.4 Quantitative analysis of the “3D added value”
Thanks to the test design it was possible to use the Bradley-Terry (BT) model for the analysis. In Figure 3.25b, results of
the model are depicted for one example SRC. The BT-scores provide the continuous perceptual scale which quantifies
the difference between 2D and 3D QoE. It is then possible to evaluate the “added value of 3D” by measuring the
difference between the BT-scores at conditions where the bitrate is the same. As only pairs for the same source content
were evaluated in our test, the BT-scores cannot be used to compare preferences between contents. For example, it is
not possible to compare the content “Alice1” in 3D at quality level 3 to “SkydiversInsideGroup” in 2D at quality level
2. This inter-content comparison was not targeted, and instead the goal was to determine 3D preference thresholds
as a function of pictorial quality for different degrees of depth information. Making inter-content comparisons would
have added individual judgments of the observer regarding his preference of one type of scene compared to another,
which would have made the data noisy and hard to interpret. As a consequence, it is not possible to compare one BT-
score from one SRC to another score from another SRC since there is an unknown offset between these two scores.
However, since the scale remains the same between SRCs, it is then possible to compare inter-SRC differences of
BT-score. Let the “3D added value” be the difference of BT-score between two similar coding conditions reflecting the
60
Figure 3.26: 3D added value as a function of VQM scores.
score fluctuation due to the presence of depth (see Figure 3.25b). Then, it can be seen that at least two factors are of
influence on the “3D added value” (3DAV) scores, one covers the 3D characteristics of the video sequences including
depth, comfort, naturalness, immersion, etc. And the other one covers the pictorial quality of the video. Figure 3.26
depicts the relation between the quality factor measured through VQM scores of the 3D video sequences and the
3DAV. These two factors show a Pearson correlation of -0.65 and a Spearman correlation of -0.67. Using a N-Way
Analysis of Variance (NANOVA) analyzing the 3DAV based on the factors “QualityLevels” as defined in Section 2
and “DepthLevels” (grouping the SRCs in five classes of depth effect) shows that there is a strong influence of quality
on the 3DAV (F=13.5,p<0.001)and that there is also a significant influence of the “DepthLevel” on the 3DAV
Figure 3.27: 3D added value as a function of ∆BT3D.
61
(F=3.98,p=0.0069). Considering the rather small amount of data (four 3DAV values per SRC), no significant
influence can be observed on a per-content analysis. Let BT3D(k)be the BT-score of the condition 3DQkas listed in
listing 1. Then, the 3D-QoE impact due to coding can be obtained by: ∀k∈[1,4],∆BT3D(k) = BT3D(k)−BT3D(0).
The ∆BT3D(k)provides a subjective value of how coding affects 3DQoE. This includes loss in pictorial quality, loss
in depth [11] and the impact on comfort. The relation between ∆BT3D(k)and 3DAV is depicted in Figure 3.27. For
the latter scatter plot, it should be remembered that both 3DAV and ∆BT3D(k)are expressed on the same scale. To
compare content’s specificities it is proposed to perform a regression of the 3D added value (3DAV) as a function of
∆BT3D(k)(see Table 3.10).
The slope values (α) between the two factors range from 0.02 to 0.76. The result for the overall data to a slope of 0.71.
This shows that on average, a quality variation of Xwill impact by 0.71·Xthe added value of 3D over 2D. However,
there are high fluctuations due to content specificities (depth and comfort) which need to be studied further. The values
of βprovides information on the added value of the content when available at the highest quality possible and then
its suitability to be presented in 3D. Considering the high variation inter-content of αand βa characterization of the
scenes appears to be needed for the development of 3D-QoE models. This will be addressed further in chapter 5.2 of
the thesis.
3DAV =α·∆BT3D+β
Content α β Content α β
Timelaps 0.54 -0.10 Alice7 0.33 -0.07
Sky.Alignment 0.021 -0.94 Alice4 0.58 0.38
Waterfall 0.08 -0.67 Hall 0.68 1.05
Alice1 0.40 0.77 Sky.InsideGroup 0.38 -0.72
TreeBranche 0.57 0.23 PauseOnARock 0.76 0.13
Umbrella 0.76 0.90 Firework 0.38 0.65
LampBlowUp 0.41 0.53 CarRace3 0.51 1.33
Drone 0.71 1.15 overall 0.71 1.15
Table 3.10: Relationship between added value of 3D and difference of BT-score between coding and reference condi-
tion.
3.2.5 Relation with previous studies
The relationship found between the “3D added value” and the 3D quality of the video is particularly interesting since
this experiment found for a different context similar coefficients between these factors as in the work of Lambooij [9].
In Lambooij’s work, test participants were asked to report their Visual experience, the 3D image quality, and the depth
for video sequences having only blur and Gaussian noise distortions. This has resulted in the equation:
EC =a·IQ +b·D(3.2)
With 0.74 and 0.26 respectively for a, and bwhen the evaluation concept (EC) naturalness is modeled, and 0.82 and
0.18 were found for a, and bwhen the visual experience is targeted. In this experiment, a coefficient of 0.71 was
found for the 3D image quality factor when preference of 3D over 2D is modeled. However, the term βfrom the
study conducted in this thesis is most likely taking into account more factors than the depth since it includes all factors
describing how appropriate the content is rendered in 3D.
62
3.2.6 Conclusion
3D QoE was evaluated by using paired comparison. This way, preference of 3D could be investigated as a function
of picture quality. Results show that increasing picture quality increases the probability of preference of 3D over 2D.
On average, a VQM score of 0.24 was found to be required to ensure preference of 3D over 2D. Bradley-Terry scores
were estimated, and the “3D added value” was determined. The results show that, on average, there is a factor of 0.71
between variation of pictorial quality and “3D added value”. However, there is lots of variation between contents,
which need to be studied.
Listing 3.9: Conclusion on the research questions
1. & 3. In this experiment, it was possible to evaluate the preference of 3D over 2D.
The ‘‘appropriateness’’ of contents to be represented in 3D was evaluated with the
term βin Table 3.10
2. & 3. It was found that the preference of 3D over 2D increases when the pictorial
quality increases.
3.3 Overall results
The last experiment presented has shown the relationship between preference of 3D over 2D as a function of image
quality and content characteristics. A linear model has been found providing a good fit between these different factors
(see eq. 3.3). The parameters α, and βare content dependent and relate to factors such as the perceived depth and
visual comfort. This characterizes the contents across two axis:
•The added value of the content compared to 2D: β
•The criticality of coding for the content, e.g. the importance of texture quality to preserve the added value of 3D: α
The quality factor was demonstrated in the first two studies to be closely related between 3D and 2D-presentation,
and it was even shown that prediction algorithms designed to predict 2D quality works well for predicting 3D quality.
Therefore there is a strong research interest for the characterization of the 3D video sequences properties. This will be
the topic of the next chapters of the thesis.
Pre f erence3Dvs2D=α·3DImageQuality +β(3.3)
3.4 Key contributions
In this chapter, the evaluation of 3D QoE has been addressed and methodologies have been compared to study how
the added value of 3D compared to 2D can be revealed. The main contributions are listed below:
•It was shown in the case of a study using absolute category rating for evaluating difficult concepts such, that test
participants do not use the scales in the same manner. This is due to the fact that they may not all understand the
scales equally for complex notions such as visual comfort.
•Results have shown that good prediction algorithm for 2D video quality can be applied to 3D quality after a linear
transformation.
•Using Pair Comparison, it was possible to have a simple question for the test participants enabling them to fully
understand their task. This enabled to draw the relationship between texture quality and preference of 3D over 2D.
•The preference of 3D over 2D increases when 3D image quality increases.
63
Chapter 4
Subjective evaluation of depth
4.1 Introduction
The previous chapter showed that it is possible to reveal the added value of 3D videos compared to 2D using pairwise
comparison. Even if the display condition were set to minimize visual discomfort issues [89], it is clear that the
relation between preference of 3D over 2D and texture quality is highly content dependent and then depends on the
depth properties of the content. Therefore, there is need to characterize the 3D properties of the 3D videos. For the
characterization of the 2D properties such as the temporal and spatial complexity, recommendation have been defined
by the ITU in ITU-T Recommendation P.910 [116]. It defines the spatial information (SI) and the temporal information
(TI). But no indicators have been defined for 3D videos. In this section, the characterization of the 3D properties will
be addressed using evaluation involving test participants. As previously described in the chapter on state of the art and
its related section on depth perception, many factors are involved in the depth perception process both regarding the
2D and 3D properties of the contents. In this chapter different methods involving test participant will be presented for
the evaluation of 3D-properties of 3D video sequences.
Figure 4.1: Structure of the studies described in the chapter.
4.2 Evaluation of depth cues
The evaluation of depth from monocular and binocular depth cues will be performed on natural images. The use of
natural images makes the task particularly difficult since it becomes difficult to define the amount of each depth cue
as usually done in psychophysical studies. It is then needed to evaluate how strong each considered depth cue is to
enable the study of how they affect the overall perceived depth. This comes with a second difficulty: the fact that we
65
may omit to evaluate depth cues which are in the pictures and used by the test participants to evaluate the perceived
depth. To limit this, all the different depth cues described by Cutting and Vishton [25] will be considered.
The evaluation of the monocular depth cues is challenging; most of the previously described methodologies in the
state of the art section (See Section 2.3.1) were designed to evaluate the overall depth perception and not the individual
depth cues. Indeed, these individual depth cues could be obtained thanks to the design of the experiment since there are
mostly artificial scenes, and there was no need to evaluate these depth cue individually. The methodologies between
the ones presented in the last subsection which enables the evaluation of the monocular depth cues are then the forced
choice between a certain number of options and the evaluation on a numerical scale.
Unfortunately, to enable having a quantitative evaluation of a particular scale through a forced choice approach, a
high number of comparison is required. Indeed either the Bradley-Terry model or the Thurstone-Mosteller model
[110] requires a full pair comparison matrix and then N×(N−1)
2comparisons per scales with Nthe number of stimuli.
Optimized approaches are possible using the square design approach [117] and a well chosen square matrix [111].
However, considering that natural images were chosen and then the difficulty to define precisely the quantitative
amount of monocular depth cues in the stimuli, a high number of images was selected: 200. Then even using the
square design approach, the number of required comparison is still too high for being done in a subjective test.
Different approaches will be used to address the evaluation of the different depth cues and overall perceived depth.
The research question addressed are the following, and were addressed though different experiments listed in Table
4.1.
Listing 4.1: Overall research questions
1. How 2D depth cues can be characterized in natural images?
2. How, by means of evaluation involving test participants, is it possible to measure
the contribution of each individual depth cue?
3. How does the overall perceived depth relate to the different depth cues?
Section Depth cues Type of content Number of SRCs Methodology Published in
4.4 Binocular depth 3D Videos 64 ACR [113]
4.5 Monocular depth cues 2D Images 200 ACR [118]
4.5.2 Binocular depth 3D Images 200 ACR [118]
4.6 Monocular depth cues 2D Images 150 Ranking [119]
Table 4.1: List of experiments conducted with addressed depth cues and addressed type of source sequences.
4.3 Definition of scales
A first important aspect is to provide a definition of the different scales, and define how the different depth cues should
be rated by the test participants. A contribution described in this Section is the definition of the different scales which
will be used by the test participants to evaluate the monocular depth cues of the 3D materials. Seven different depth
cues were considered. Each of them was defined by a schema, several examples, and a definition. Each of the scales
will be described in this section.
4.3.1 Perceived depth
To evaluate the overall perceived depth, test participants were asked if the depth sensation was: “very high”, “high”,
“medium”, “low” or “very low”. It was chosen not to ask participant to evaluate only binocular depth as natural images
66
are considered, it was not possible for them to disambiguate the contribution of the monocular depth cues from the
binocular cues. Therefore, only the overall depth sensation could be evaluated.
4.3.2 The linear perspective
Test participant were asked to evaluate the linear perspective (Figure 4.2) by taking into account if there are clear
visible vanishing lines within the image and if these vanishing lines contributes to the perception of the different depth
layers in the scenes. This depth cues is supposed to be stronger as clear vanishing lines are visible.
4.3.3 The relative size
Test participant were asked to evaluate the relative size (Figure 4.3) by considering if there are repeating objects in
the scene which appears with difference size. They were insctruct not to use their knowledge about the size of the
individual objects for the rating. The rate should have depended on the number of occurrence an object appears with
different size. This depth cue is supposed to increase when objects are repeated several times at different sizes.
Figure 4.2: Linear perspective Figure 4.3: Relative size
4.3.4 The texture gradient
Test participant were asked to evaluate the texture gradient (Figure 4.4) based on the fact that there is a texture within
the image (more generally can consider the repetition of patterns) which become finer when the distance to the camera
increases. This depth cue is supposed to be stronger when there is a strong variation of the granularity of the texture
or pattern.
67
4.3.5 The interposition
Test participant were asked to evaluate the interposition (Figure 4.5) based on the number of overlapping objects in the
scenes. They were told that the overlap of one object over another provides the ability to order the position in depth
of the objects and they should evaluate the interposition considering how the number of overlapping object helps to
be aware of the absolute position in depth of the objects using all the interpositions. This depth cues is supposed to
increase when there are a lot of objects overlapping at different absolute position in depth.
Figure 4.4: The texture gradient Figure 4.5: The interposition
4.3.6 The light and shades
Test participant were asked to evaluate the light and shades (Figure 4.6) based on the presence of a light source and
the resulting shades which helps to apprehend the shape of the objects. This depth cue is supposed to be stronger when
there is a light source which enables to see the real shape of the object which would have appeared flat otherwise.
4.3.7 The areal perspective
Test participant were asked to evaluate the areal perspective (Figure 4.7) based on the effect of the atmosphere in
the image. For example, objects which are far away will have a color close to the color of the sky. This depth cue
is supposed to be proportional to the presence of smooth transition of the color of the sky to the elements in the
background which usually do not have this particular color of the sky.
4.3.8 The defocus blur
Test participants were asked to evaluate the defocus blur (Figure 4.8) based on the variation of the sharpness at different
locations of the image explicating variation of the distance of the object to the focal point of the camera. This depth
cue is supposed to be proportional to the variations between sharp and blurred area in the images.
68
Figure 4.6: Light and shades Figure 4.7: Areal perspective
Figure 4.8: Defocus blur
4.4 Evaluation of perceived depth
In order to evaluate the monocular and binocular properties of the 3D video sequences and 3D images, different
experiments have been conducted. These target to characterize the content properties, but also study how the different
cues can be evaluated in natural images. The first experiment which will be described target to evaluate how the depth
quantity can be evaluated in natural images and how test participants are consistent in the way they provide their
ratings. The research questions are then the following:
Listing 4.2: Research questions
1. How depth in 3D images can be evaluated.
2. How do test participants agrees when asked to rate depth.
3. How test participants relate binocular depth, quality of experience and content
characteristics.
4.4.1 Experiment
Once the different scales were defined, different experiments were conducted to evaluate images and videos along the
different scales. To evaluate the overall perceived depth, a database composed of 64 source reference signals (SRCs)
69
has been designed [113]. A description of each source sequence can be found in Table 4.2. These SRCs were used at
the highest quality available, and contained various types of scenes: indoor, outdoor, natural, or computer generated
sequences, and containing slow or fast motion. The objective was to diversify at most the source material. All these
sequences were full HD stereoscopic videos; each view had a resolution of 1920x1080, with a frame rate of 25 images
per second. Each of the sequences was of 10s length. They were presented on a 23” LCD display (Alienware Optx,
120Hz, 1920x1080p). It was used in combination with the active shutter glasses from Nvidia (NVidia 3D vision
system). The viewing distance was set to 3H, and the test lab environment was according to the ITU-R BT.500-12
recommendation [98]. Twenty four observers attended the experiment; their vision was checked, and it was assured
that they passed the color blindness test (Ishihara test) and the depth perception test (Randot stereo test). Subsequently,
they pass all the vision tests, the observers were trained using five sequences with different values of image quality,
depth quality and visual discomfort. During the training phase the observers had the opportunity to ask questions. After
the training had finished, the observers were asked to rate the 64 sequences on three different scales: overall quality of
experience, depth and visual comfort. The methodology used was Absolute Category Rating (ACR). QoE was rated
on the standardized five grade scale: “Excellent”, “Good”, “Fair”, “Poor”, “Bad”. Perceived depth was rated on a
five-point scale with labels: “very high”, “high”, “medium”, “low” or “very low”. Using this general depth scale, the
observers have rated their general impression about the depth, which takes into account both depth layout perception
and depth quality. The comfort was evaluated by asking subjects if the 3D sequence is “much more”, “more”, “as”,
“less”, “much less” - “comfortable than watching 2D video”. The test subjects were not presented with 2D versions
of the video sequences, therefore they had to compare the 3D comfort with their internal references of 2D sequences.
One test run took approximately 50 minutes, including the training session and a 3 minutes break in the middle of the
test.
4.4.2 Analysis of results
4.4.2.1 Agreement between observers
The coherence of individual ratings of each observer with those of the other observers was checked by following the
β2test as described in section 2.3.2 from ITU-R BT.500 [98]. The screening was done for each of the three scales
individually. Observers could be kept for a specific scale but rejected for another. This was motivated by the fact that
observers may have misunderstood one scale, but may still correctly evaluate for the other scales. After screening,
four observers of the 24 were rejected on each scale: two observers showed strong variation compared to the rest of
the group on the quality and depth scales (according to the β2test), two on the comfort and quality scales, one on the
depth and comfort scales, one on only the comfort scale and one on only the depth scale. None of the subjects showed
inconsistent behavior for all three scales.
To further study the agreement between observers the correlation between test participants compared to each other
was considered and are depicted in Figure 4.9. It appears that the correlation between every participants compared to
each other is rather low. This appears to be due to the difficulty of test participants to evaluate the different factors on
the proposed categorial scales. To analyze the differences of agreement between the ratings on the different scales, a
KruskalWallis one-way analysis of variance is applied to compare the Spearman correlation between test participants
as a function of the scale under evaluation, and a significant difference between the agreement of test participants on
the three scales can be observed (Chi-sq=10.54, p<0.01). A Fleiss Kappa test was applied to compare the different
agreements between observers when rating on the different scales and shows a higher agreement of the test participants
on the evaluation of the binocular depth (0.084) followed by the evaluation of visual comfort (0.081), and the evaluation
of quality (0.08). Although all Kappa values are low, an interpretation of the lower agreement for the quality scale may
be interpreted by the fact that the task may have been hard to do for the test participants since the videos were presented
at the highest quality possible, and therefore the evaluation concept of quality may have been unclear and resulted in
more variation between observers.
70
Sequence Description Sequence Description
Alignment NAT, skydivers building a formation together, low tex-
ture
BalloonDrop NAT, balloon of water hit by a dart, closeup
Bike NAT, cyclers, slow motion, lots of linear perspective BloomSnail NAT, closeup on flowers and snail, high depth effect
Building CG, circular movement around towers CarEngine CG, car engine, many moving objects, high disparities
CarMesh CG, car mesh rotating, low spatial complexity CarNight NAT, dark, many scene cuts (5), fire blast popping out
CarPresent CG, circular movement around car CarRace1 NAT, race, rain, fast motion, several scene cuts (7 in
10s)
CarRace2 NAT, race car, fast motion, several scene cuts (7 in 10s) CarRace3 NAT, race car, dust slowly flying towards the camera
Castle NAT, highly textured, temporal depth effect changes CristalCell CG, many particles, different objects in depth
FarClose NAT, skydivers, complex motion, increasing depth ef-
fect
FightSkull CG, fast motion, low spatial complexity, high depth ef-
fect
FightText CG, slow motion, objects popping out Figure1 NAT, skydivers, complex and circular motion, closeup
Figure2 NAT, skydivers, complex motion, closeup, persons in
circle
Figure3 NAT, skydivers, complex motion, closeup group per-
sons
Fireworks NAT, dark, lots of particles, good depth effect FlowerBloom NAT, closeup on flowers, high depth effect
FlowerDrop NAT, closeup on flowers and raindrop Grapefruit NAT, trees, highly textured, pan motion, high depth ef-
fect
Helico1 NAT, low texture, circular motion, low depth effect Helico2 NAT, medium texture, circular motion, low depth ef-
fect
HeliText NAT, medium textured, text popping out of the screen Hiker NAT, highly textured, person walking in depth
Hiker2 NAT, highly textured, slow motion, closeup on persons InsideBoat CG, indoor, walk through the interior of a ship cabin
IntoGroup NAT, pan motion, colorful, lots of objects in depth Juggler NAT, high spatial complexity, closeup on juggler
JumpPlane NAT, skydivers, fast motion in depth (far from camera) JumpPlane2 NAT, skydivers, fast motion in depth
LampFlower NAT, light bulp blowing up, flower blooming, closeup Landing NAT, fast motion, high texture, depth effect increasing
Landscape1 NAT, depth effect limited to one region of the image Landscape2 NAT, depth effect limited to one region of the image
MapCaptain CG, captain, map, slow motion, low spatial complexity NightBoat NAT, dark, low texture, camera moving around boat
Paddock NAT, race setup, high spatial complexity, lots of ob-
jects
PauseRock NAT, bright, closeup on persons sitting
PedesStreet NAT, street, linear perspective, lots of motion in depth PlantGrass NAT, closeup on plant growing, grasshopper
River NAT, slow motion, medium texture, boats moving SkyLand NAT, skydivers, high texture, person moving closer,
closeup
SpiderBee NAT, slow motion, closeup on spider eating a bee SpiderFly NAT, closeup on fly, spider and caterpillar
SpinCar CG, car spinning, half of the car in front of screen StartGrid NAT, separate windows showing different race scenar-
ios
StatueBush NAT, closeup on statue with moving flag StreamCar1 NAT, high spatial complexity, car moving in depth
StreamCar2 NAT, high spatial complexity, closeup on a car StrTrain1 NAT, train coming in, motion in depth, high textures
StrTrain2 NAT, train coming in, motion in depth, many objects SwordFight NAT, sword fight, movement limited to one area of im-
age
Terrace NAT, persons chatting, camera moving backward TextPodium NAT, rain, fast motion, champaign and text popping
out
TrainBoat NAT, train and boat, fast motion in depth, medium tex-
ture
Violonist NAT, closeup on violinist and her instrument
WalkerNat NAT, persons walking between trees Waterfall NAT, closeup on water falling, highly textured
WineCellar NAT, low spatial complexity, indoor, closeup on per-
sons
WineFire NAT, closeup on a glass and fire, complex motion
Table 4.2: Description of the source sequences. CG: Computer generated, NAT: Natural scene
4.4.2.2 Correlation between scales
The results show a high correlation between the different scales (Figure 4.10). The three scales are closely related:
a Pearson correlation of 0.74 is observed between QoE and depth, 0.97 between QoE and visual comfort, and 0.71
between depth and visual comfort. The very high correlation between QoE and visual comfort could be explained as
follows:
•It is worth pointing out that the video do not contain coding artifacts, so it is likely that people have rated the QoE
of the sequences according to the sources of disturbance they perceived: the visual discomfort. Indeed, in presence
of high disparity values as it may happen for sequences with a lot of depth, it may become more difficult for the
71
Figure 4.9: Spearman correlation between test participants
observers to fuse the stereoscopic views [120] [121]. This results in seeing duplicate image portions in distinct areas
of the videos and is likely to be transferred to the quality rating.
•Another alternative explanation is that observers did not really understand the visual discomfort scale. This aspect
has been addressed previously in Section 3.1.2.2 and by Engelke [104]. It has been observed that different classes
of observers exist that differ in their understanding and thus use of the comfort scale. In this study, it is possible that
observers have decided to use the comfort scale based on their QoE ratings.
It may be observed that there is a high variance between the source sequences in the here considered degradation-free
case. The observed difference may be due to the content properties related to shooting and display conditions.
The lower correlation between depth and visual discomfort shows that there is no straightforward link between binoc-
ular depth and visual comfort, although both of them depend on retinal disparities.
4.4.3 Conclusion
The first conclusion of this study is that binocular depth is not necessarily easy to evaluate by test participants. Results
have shown, that asking participants to evaluate depth results in lots of variation between participants, and therefore
the evaluation on a categorial scale may not be the easiest approach for participants to evaluate difficult concepts such
as depth or visual comfort.
This result was also confirmed by Engelke [104] who also found that observers use the scales in different manners
showing the fact that they can have different interpretation of the evaluation concepts under study.
Finally, the last result of this study is a high correlation between the scales. The QoE scores are found to be highly
related to the comfort scale. This may be due to the fact that no degradation were applied to the videos and discomfort
appeared to be the main source of decrease of QoE. The correlation between visual comfort and depth was found much
lower showing that these two factors are related but other factors are involved.
Listing 4.3: Conclusion on the research questions
1. & 2. The evaluation of 3D factors such as the depth is challenging, asking test
participants to rate it on a scale results in variation amongst participants.
3. The result have shown a strong correlation between the different scales under
evaluation. QoE and visual discomfort have shown a Pearson correlation as high as
0.97, depth and visual discomfort was lower correlated with a Pearson correlation of
0.71, and QoE and comfort show a Pearson correlation of 0.74.
72
Figure 4.10: Scatterplots with regression lines showing the relation between the different evaluated scales
4.5 Evaluation of monocular depth cues
Previously, the overall perceived depth was evaluated. This section will address the evaluation of the different depth
cues which contribute to the overall depth perception. There are different goals to this task: the characterization of the
3D videos properties taking into account perceptual factors, the study of how different depth cues can contributes to
the overall depth percetion, and the relation between the different depth cues. In this first section, it will be addressed
to which extent Absolute Category Rating (ACR) can enable to evaluate monocular depth cues in natural images. The
research questions of this first study are:
Listing 4.4: Research questions
1. How image selection can be performed in order to study the relationship between depth
cues and perceived depth in natural images.
2. Study the relationship between depth cues.
3. Study how depth cues relate with the overall depth perception.
73
4.5.1 Image selection
To perform the image selection process, 409 images with a large variety of content were evaluated by two expert
observers on the 7 different scales as described in Section 4.3 on a five grade category scales depicted in the Figures
4.2-4.8 and an evaluation of the binocular depth quantity on a five grade scale. The images were taken from different
open source image and video database [122, 123, 124, 125, 126, 127, 114, 128] and images extracted from newly
shot video sequences using a Panasonic AG-3DA1E twin-lens Camera, and new images shot with a Fujifilm FinePix
Real 3D. After being evaluated on the different scales, it was decided to select four depth cues which will be studied
as independently as possible: the linear perspective, the relative size, the texture gradient and the defocus blur. For
each of these four depth cues the images were selected such that the score for the six other monocular depth cues was
as small as possible and the values of the monocular and binocular depth cues range uniformly distributed from 1 to
5. This results in a matrix of five by five images as shown in Figure 4.11. Such matrix is then defined for the four
previously mentioned depth cues. This results in selecting 100 different images. Considering that the pre-test phase
used for the selection of images is only made with two expert observers, it was decided to add a repetition of each
combination of monocular and binocular depth cue to increase the robustness of the image selection process and the
likelihood to get the expected combination of monocular and binocular depth cues. This finally results in 200 images.
Figure 4.11: Example of matrix defocus blur / binocular depth
4.5.2 Evaluation of binocular depth
The evaluation of the binocular depth quantity was performed in a first subjective experiment. The 200 still images
were used at the highest quality available and did not have any visible coding artefact. Multiple open image and video
databases, and new recorded content from a video camera or compact cameras were used. The images have different
formats: The images coming from the camera having an aspect ratio of 4:3 were downscaled from the resolution
3648x2736 to 1440x1080 and were inserted in a uniform gray frame of 1920x1080. The images coming from the
database 3DIQA [122, 123] where slightly smaller than 1920x1080 and were then centered in a uniform gray frame
of 1920x1080. The other images having natively the resolution of 1920x1080 were kept in their original format. The
images were presented on a 3D stereoscopic display: Samsung UE46F6500, 46” smart TV with active glasses and a
native resolution of 1920x1080. The viewing distance was set to 3H, and the test lab environment was according to the
74
ITU-R BT.500-12 recommendation [98]. Twenty observers attended the experiment; their vision was checked, and it
was assured that they passed the color blindness test (Ishihara test) and the depth perception test (Randot stereo test).
The observers were trained using five different images with different amount of depth quantity. During the training
phase the observers had the opportunity to ask questions. After the training had finished, the observers were asked to
rate the 200 images on the amount of perceived depth. The methodology used was Absolute Category Rating (ACR).
Perceived depth was rated on a discrete eleven grade scale from 0 to 10 with the labels “very high”, “high”, “medium”,
“low” or “very low” - perceived depth respectively at the position 9, 7, 5, 3, 1 on the scale.
4.5.3 Evaluation of monocular depth
The evaluation of the monocular depth cues was performed using the Absolute Category Rating method on a five
point scales to evaluate the seven scales corresponding to each individual depth cue (linear perspective, texture gra-
dient, interposition, relative size, light and shade, areal perspective, defocus blur). The test participant were given the
instructions described in Subsection 4.3, this includes the text describing the depth cues, the pictograms showing the
different amount of depth cues and the examples images. The 200 images had the same HD resolution, as described
in subsection 4.5.2 and were displayed on an 9.6 inches iPad 4 with a native resolution of 2048x1536. The images
were presented on the top of the interface and were as large as possible and the different scales were represented as
pictograms below the image under evaluation. Figure 4.12 shows the test interface of the application. Test participants
evaluated each individual depth cue by selecting the pictograms. The application switched then automatically to the
next scale. Test participants were able to edit a previous rating. Once all the seven scales were evaluated, a button
appeared allowing the test participant to switch to the next image. For this test, no time constraint was given. The
instructions were printed allowing the test participant to refer to them anytime they wanted. For this test, 8 experts
in video or audio quality assessment participated to the test. The devices were given to the test participant and the
test was performed in a non controlled environment. On average, the test requires 3 to 4 hours and was completed in
several sessions within a week at the convenience of the test participants.
Figure 4.12: Subjective test interface
Figure 4.13: Histogram of depth scores from the 3D view-
ing session
75
4.5.4 Result
4.5.4.1 Analysis of vote distribution
First, regarding the scores from the 3D viewing session for the 200 images, a histogram built by categorizing the mean
score over the observers per image can be found in Figure 4.13. The purpose of this histogram was to have a view
on the distribution of the scores. There are 11 bins in this histogram to correspond to the 11 different categories that
the observer had. 50% of the images were rated with a score higher or equal than 5. A Jarque-Bera test shows that
the distribution of the subjective scores is not normal at a 95% confidence. Similarly, the histogram of the monocular
depth cues scores is depicted in Figure 4.14. The selection of the images was done such that the distributions of the
monocular depth cues linear perspective, relative size, texture gradient and defocus blur were meant to cover the entire
range. The minimum score for each depth was not frequently used, and the average scores for each depth cue span from
2 to 5. For a particular depth cue, it was expected by design that 160 images would only show a small amount of this
particular depth cues, corresponding to the first bin, and that 10 images would be voted with slight, medium, advanced
and strong depth cue, respectively, filling the bins 2,3,4 and 5 with 10 samples each. This would have been one of the
conditions of a good separation between the depth cues: one depth cue which changes of value in a controlled manner
from 1 to 5 while all the other depth cues are kept to a minimum value. This kind of result has been achieved for
the defocus blur. The areal perspective depth cue shows also a similar pattern. The texture gradient follows this rule
to a lower extent. Regarding the linear perspective and the relative size, the distribution is more uniform, this shows
that the image selection did not succeed to decorrelate the increase of other depth cues and the increase of relative
size or linear perspective. For example the selection of images increasing the amount of texture gradient may have
resulted in images having higher amount of relative size. Figure 4.15 depicts the Spearman correlation between the
monocular depth cue scores. The correlation values are low, and indicate that the depth cues scores have only little
relation between each others. The correlation values however support the discussion about the unexpected distribution
of the linear perspective and relative size scores which have a higher correlation between each other showing that these
two and the texture gradient and interposition may have shared some images even though the correlation values are
too low to be conclusive.
4.5.4.2 Relation between monocular depth cues and overall depth scores
One of the objectives of the study is to evaluate if in the context of the use of natural images it is possible to show
the effect of monocular depth cues on the overall depth score values. One strong limitation of the study is due to the
use of natural images which results in the absence of ground truth for the binocular depth cue. Indeed, the subjective
data coming from the test in stereo mode described in Subsection 4.5.2 provides the results of the depth score rating
resulting from the combination of monocular depth cues and binocular depth cues. The test from the subjective test
described in Subsection 4.5.3 only provides monocular depth cue scores. The binocular depth cues themselves could
not be controlled as usually done in psychophysics studies since natural image content was used. It is then only
possible to use statistics about depth maps and content characteristics as described in previous studies [113] to retrieve
information about the binocular depth cues.
To analyse the contribution of the monocular depth cue, for each depth cue, two categories are created: one with a low
value of the particular depth cue and one with a high value of a depth cue. Let Ωbe the set of all images. DCc,high is
the set of images such that the depth cue cis high. And ∀I∈Ω,DCc(I)is the value of the depth cue cfor the image I.
DCc,high ={I∈Ω|DCc(I)>3}(4.1)
DCc,low ={I∈Ω|DCc(I)≤3}(4.2)
To study the effect of a depth cue con the overall depth, the differences between the overall depth scores of the sets of
images IS1(c)and IS2(c)are studied. IS1(c)is the set of images where the depth cue cis high and all the other depth
cues are low. IS2 is the set of images where the depth cues are low.
76
Figure 4.14: Histogram of monocular depth cues scores for the different depth cues. This represents the number of
images which have a specific depth quantity for each depth cues.
IS1(c) = DCc,high {
i∈{′LP′,′RS′,′T G′,′I′,′LS′,′AP′,′DB′}\c
DCi,low}(4.3)
IS2=
i∈{′LP′,′RS′,′T G′,′I′,′LS′,′AP′,′DB′}
DCi,low (4.4)
Unfortunatly, an ANOVA cannot be performed for each depth cue cin order to compare the depth scores between the
two set of images IS1(c)and IS2 because the residual of the linear model using only one factor does not fulfill the
normality requirements. As shown in Figure 4.15, the correlation between the scales is low, a PCA applied to the data
confirms this result and shows that the explained variance increases linearly with the number of support vectors. It is
then difficult to decrease the dimensionality. A linear model using all the different variables with no interaction term is
then suggested as discussed in the state-of-the-art chapter weak models are a popular approach for depth cues fusion.
Using a Jarque-Bera test, it is possible to confirm that the residual error of such model is normal. Table 4.3 lists the co-
efficients of the model. It is then possible to apply an N-Way ANOVA to explain the overall depth scores as a function
of the monocular depth cue scores. Only the interposition (F=20.75,p<0.01) and the defocus blur, which is on the
borderline was found to have a significant effect (F=3.93,p=0.049). Followed by the texture gradient (F=2.24,p=0.13)
and “light and shade” (F=1.6,p = 0.20) which were not significant on a 95% confidence.
77
Figure 4.15: Spearman correlation between
monocular depth cue scales; LP: Linear per-
spective, RS: relative size, TG: texture gradi-
ent, I: Interposition, LS: light and shade, AP:
areal perspective, DB: defocus blur
LP RS TG I LS AP DB
LP 100 42 26 5 7 -13 -17
RS 42 100 43 54 1 27 -18
TG 26 48 100 22 2 24 -12
I 5 54 22 100 0 24 -1
LS 7 1 2 0 100 -8 11
AP -13 27 24 24 -8 100 10
DB -17 -18 -12 -1 11 10 100
Figure 4.16: Spearman correlation between monocular depth cues; LP:
Linear perspective, RS: relative size, TG: texture gradient, I: Interposi-
tion, LS: light and shade, AP: areal perspective, DB: defocus blur
model =a×LP +b×RS +c×T G +d×I+e×LS +f×AP +g×DB
a b c d e f g
2.00 0.44 0.97 2.86 1.41 1.48 1.21
F 0.03 0.67 2.24 20.75 1.6 0.05 3.93
p 0.85 0.41 0.13 <0.01 0.20 0.83 0.049
Table 4.3: Linear model between depth cues. And (F,p) values of the N-Way ANOVA.
4.5.5 Limitations
As mentioned previously, one limitation of the study is dependency on the binocular depth cues which can hardly
be evaluated individually in natural images. During the design of the experiments, statistical analysis of the depth
map characteristics were performed [113] to have a high variety of the content’s stereoscopic properties. A further
aggravation of this issue is also the limitation of the study to the particular instantiation of the problem, even though
it was targeted by design to cover as much as possible the different monocular depth cue scales. These are limitations
which could not be avoided in the targeted challenge due to the choice of natural images.
4.5.6 Conclusion
The objective of this study was to reproduce studies performed in the field of psychophysics but in the particular case of
the evaluation of monocular and binocular depth cues in natural images. Methodology questions have been addressed
to tackle this challenge. A definition of different scales for the evaluation of monocular depth cues in images was
proposed. Various analysis were performed to check the influence of monocular depth cues on the overall depth scores
and to see if the methodology used could perform such task. Statistical differences of overall depth ratings between
images showing low and high value of monocular depth cues could be seen for the particular case of the interposition
and defocus blur depth cues, but not for the other depth cues. The image database (Depth Cue 3D Images, DC3DImg)
including subjective scores has be made available, these can be used for example to study depth in 3D images, but can
also be used for the investigation of other aspects such as the effect of coding on depth perception, the acceptance of
78
3D, the relation between monocular and binocular depth cues and depth quality issues, visual comfort and any other
topics related to 3D quality of experience.
Listing 4.5: Conclusion on the research questions
1. The proposed image selection procedure enabled to select images with the expected
properties. Subjective results were found consistent with the design of the
experiment.
2. The Pearson correlation between ratings for the different depth cues was low.
Relative size and Interposition were found to be better correlated.
3. It was not possible to draw a weak fusion model from the data. A classification of
the scores in two categories: weak and strong only show an effect of the category ‘‘
low’’ or ‘‘high’’ on the overall depth ratings for cues ‘‘defocus blur’’ and ‘‘
interposition’’.
4.6 Alternative methodology: ranking
As described in the previous section asking directly test participants to evaluate the different depth cues may not be the
most appropriate approach since the task is particularly demanding and scales difficult to evaluate. In order to tackle
these issues, the previously described experiment have involved expert observers in order to reach higher accuracy,
since they would show higher dedication to the task and are able to understand the complex notions to evaluate.
However, even for expert observer the task is not easy to perform. Therefore an alternative methodology have been
proposed in order to improve the understanding of the test participants of the task. The proposed method is based on
ranking. The research questions are the following:
Listing 4.6: Research questions
1. How this methodology compare with traditional absolute category rating (ACR)
2. Does seeing the result of another test participant enable a better understanding of
the test participants of the task they have to perform.
3. Does reordering the result of another test participant enable to improve the
consistency between test participants.
4.6.1 Description of the proposed methodology
4.6.1.1 Description of the task
The new methodology proposed to tackle the previously described issues is depicted in Figure 4.17. The proposed task
is divided into two temporally successive parts:
1) First, the ranking of the different images according to the feature under investigation. To help the test participants
to perform the ranking, different indications were provided. They were told to sort the images by performing pairwise
comparisons: they were instructed to look at the images already on the table and to compare each of them with the
picture they have in hand. If the picture they have in hand has a higher value in terms of the property than one of the
images on the table, then it should be located on its right side, and if not it should be located on the left side. All the
images were put next to each other as a continuous depth cue line from low depth cue to high depth cue. Another
indication was provided to the test participant: to perform the ranking, they were also told that they could first pre-sort
the images by grouping them into different stacks of images corresponding to a group of images they think to have
similar properties.
79
Figure 4.17: Ranking of printed images Figure 4.18: Evaluation process sum-up
Figure 4.19: Sequential evaluation of the two image sets
2) Secondly, they had to add markers to group the pictures into four categories corresponding to the amount of per-
ceived depth cues.
To train the participants on to the evaluation of the depth cues, the evaluation of each depth cue was divided into
three parts as depicted in Figure 4.18. The first part consists in letting the test participant read the written instructions
which includes a written description of the depth cue, five different pictograms which illustrate the depth cue, and six
different examples (see subsection 4.6.2). After reading of the instructions, a set of 25 images was presented to the
test participants; this set of images had a specific order and was the result of the ordering and marking of another test
participant. The participant was informed about this, and was asked to (a) look at the provided order, which provides
him/her additional examples of how to order the different images according to the considered depth cue, (b) adjust
the order of the images according to what he/she thinks is the most appropriate order. It is only after these steps that
the test participant was asked to order a new set of 25 images, which was provided in a random order. Once this task
completed, they could start the evaluation of the next depth cue.
Figure 4.18 summarizes the different parts of the task. It corresponds to one iteration of the loop in Figure 4.18 and
relates to the evaluation of one depth cue. The received picture order is as described previously and depicted in Figure
4.19, the result obtained from another test participant. As there two consecutive actions: re-ordering and ordering a set
80
of image, and each set has 25 different images, some participants only saw one set of images during the reorder task
and the other set in the ordering task. As the result of the ordering of one set of image was provided to the next par-
ticipants, only half of the participants saw one set in the reordering task and the other half of the participants saw this
same set in the ordering task. Therefore, later in the analysis two groups of participants will be considered depending
on which set of image they had to reoder and order.
No time constraint was given to the test participant. On average, it took 17 min for each participant for ordering the
50 images for one scale.
The motivation behind the idea of ranking is to explicitly ask the test participants to compare the images to each other
rather than choosing a quantitative score which can be a difficult task. A second advantage of the ranking approach is
to always show the entire range of image’s properties to the test participants. This can help them to define the order of
two images, since they can see, at the same time, examples of extreme value of the considered property. The motivation
behind grouping the images is to provide a way for test participants to report difficulties in performing the ranking
between images, because they found them to have too similar properties, and then dispose of a way to report this by
grouping images into categories. This categorization is different from traditional category rating since test participants
have to carry out this task on a set of ordered images and only need to provide separations between groups of pictures.
Moreover, they can see the entire range of the property when making this decision.
4.6.1.2 Hypothesis and groups of test participants
Different questions on the methodology have been addressed and correspond to different groups of test participants.
In total, 23 observers participated in the test.
To study the overall performance in terms of confidence intervals as compared to absolute category rating (ACR), all
23 test participants can be used. However, results can only be based on the data of the ranking and categorization of
the images provided in a random order.
To study whether the ranking for the reordering task, help test participants to understand the scale under evaluation,
the test participants were split into two groups: one group which received, as a first set of image to reorder, a set well
defined by the test organizer. The second group received a first set of images to be reordered which was previously
ordered by another test participant (as depicted in Figure 4.18). The first and second groups are composed of 16 and
7 test participants respectively. In the following, these two groups will be referred to as group 1 and group 2. The
differences between the two groups will be studied.
To study whether reordering pictures results in more consistent ranking than ordering pictures, only test participants
from group 2 can be used. This is done for avoiding the artificial higher consistency between test participants due to
the fact that all participants from group 1 had to reorder the same image order.
Independently of who provided the image order to be reordered, 18 test participants had to order the first set of images,
5 had to reorder it. 19 test participants had to reorder the second set of images, 4 had to order it.
4.6.2 Description of the studied scale
The description of the monocular depth cue scale was provided exactly according to Section 4.3 on the scale “the
relative size”. This depth cue has been chosen because of the difficulty to assess it and it thus challenges the proposed
method. Moreover, this depth cue was addressed in the previous experiment and enabled to compare the proposed
method to the traditional absolute category rating (ACR) methodology.
81
Figure 4.20: Relation between possible
interpretation of the results.
Figure 4.21: Confidence interval and
test participant number
Figure 4.22: Confidence interval and
test participant number.
4.6.3 Statistical analysis
4.6.3.1 Interpretation of the data
Quantitative scores
Considering the different tasks within the test, alternative analysis can be performed, namely of:
1) Average rank provided by the test participants
2) Average of the category number to which the picture belongs.
3) The order provided by the test participants can be used to set up a pairwise comparison matrix describing for each
position (i,j), the number of test participants who ranked an image i higher than an image j. Then, a model such as
Bradley-Terry can be applied to obtain continuous scores [110].
This last type of analysis 3) is highly constrained since the ranking ensures transitivity. Then, if three images A, B,
and C are ranked A<B<C, it implies that A<C. However, in a pair comparison experiment A<Band B<Cdoes
not necessarily imply A<C. Therefore, in the construction of the pairwise matrix, different options can be considered
and it can be chosen to only use the comparisons between neighboring images (option 3a). Then, if A<B<C, only
the relation A<Band B<Cis used to set up the matrix since test participants were explicitly asked to compare
neighboring images in the instruction of the test. Alternatively, it can be considered that test participants are only sure
of how they ordered distant images (option 3b). In a ranking A<B<C, participants may not be sure if A<Bor B<C
but are at least sure of A<C. The number of images in between these two images A and C is a parameter which needs
to be defined.
Comparison of the approaches
To compare the alternative approaches of interpreting the data, the Spearman correlation between the different scales
is computed. It shows a correlation of 0.91 (see Figure 4.20), 0.99 and 0.89 respectively between options 1 and 3a,
options 1 and 2, and options 2 and 3a. The high correlation between the different scales was expected since all the
three proposed methods to analyze the data are highly related to each other, and there is a clear relation between how
the test participants ranked the images, and how pairs of images are ordered compared to each other, and the category
to which the picture belongs.
The approach for setting up the pairwise comparison matrix considering only pairs of images having a distance higher
than a specific threshold was also applied (option 3b). When a threshold of 10 images between two images is defined,
the analysis shows a Spearman correlation of 0.93 with options 1 and 2. When a threshold of 5 images is chosen, the
correlation is as high as 0.97 with options 1 and 2.
The different ways to interpret the data thus appear to be similar in terms of the resulting scores.
82
4.6.3.2 Inter-methodology analysis: ACR vs. Ranking
One hypothesis of the proposed approach is that ranking is an easier task for the test participants than rating the images
on an absolute category scale. To check this hypothesis, it is possible to compare the confidence intervals between the
category ratings, of the second ranking where the test participants had to order images which were provided to them
in a random order, to the ACR scores of the study previously described in Section 4.5.3. The respective values are
depicted in Figure 4.21. The confidence interval values for N test participants are computed based on average values
of the confidence interval of all possible selections of N test participants between all the 23 test participants. This was
done to avoid the confidence interval values of being too dependent of a particular selection of N test participants.
As explained in section 4.6.1.1, two sets of images were studied. On the first set of images, the ordering of a new
set of pictures which was not previously ordered by a test participant shows no improvement in terms of size of the
confidence interval compared to the ACR method. On the second set, an improvement of the size of the confidence
interval of 0.18 can be seen and is depicted in Figure 4.22. The confidence interval using the ACR-methodology of the
second set of images appears to be larger than for the first set of images. It appears that this set was more difficult to
evaluate, and the ranking methodology may have simplified the task to the test participants.
4.6.3.3 Intra-methodology analysis
A first hypothesis H1 is that reordering the first set of image provided more stable results since the test participants
were able to check and correct mistakes made by the other test participants. A second hypothesis H2 is that a solution
from another test participant can help the test participants to better understand the scale under evaluation. To test
the latter hypothesis, the test participants were divided into two groups (see Subsection 4.6.1.2): The test participants
belonging to the first group had to reorder the second set of images. This explains the much smaller confidence interval
of the reordering task in Figure 4.22: all participants received the same order and image categorization.
Effect of choosing the set to reorder on user consistency
Here, the aim was to study whether providing a well-controlled image order to be reordered has a positive effect on
how test participants understand the target scale; the inter-correlation between the test participants from group 1 and
2 is compared regarding their ability to provide consistent results in ordering a new set of images. For each observer
of group 1 or 2, the Spearman correlation between his/her ordering and the ordering of the other test participants of
the same group was computed. The distribution of the inter-correlation values between test participants of group 1
and 2 is then compared using a Kruskal-Wallis one-way analysis of variance, showing a p-value of 0.068. Therefore,
no statistical differences (at 95% confidence) can be found between the agreements of the test participants during
the second task, regardless of whether they received a set of image to be reordered defined by us or by another test
participant during the first task. To quantify the differences of agreement between the two groups of test participants,
the Fleiss’ kappa test [129] was computed and was found equal to 0.31 for the group 1, and 0.11 for the group 2.
Both values are low, but show an improvement of the agreement for group 1, which is found to be “fair” whereas it
is “slight” for group 2. Even if these agreements are low, in both cases the test shows that the agreement between the
observers is not accidental at 95% confidence: (p <0.01, z=27) for the first group and (p=0.01, z=2.5) for the second
group.
Unfortunately, this does not enable a strong conclusion on the hypothesis H2 to be drawn. Moreover, it should be
mentioned that the classification between the image’s order provided by another test participant or the author is not
enough to differentiate the “quality” of the image rank to reorder.
These results show, however, the expected tendency.
83
User consistency between reordering and ordering tasks
To study if the process of reordering images enables to have a higher inter-test participant agreement between reordered
images and ordered images, statistical tests will be done. The agreement between test participants of group 2, as defined
in the previous subsection, is compared to the agreement of these same test participants when doing the ordering of
a set of images provided in a random order. Similarly to the analysis in the previous subsection, the distribution of
the Spearman inter-correlation between the rankings of each test participant compared to the other test participants
when doing the reordering task is compared to the distribution of the Spearman correlation values between the test
participants when doing the ordering task. Due to the permutation between image sets received by each test participant:
one test participant reorders a set of images and orders the second set of images, the second set being provided to the
next test participant as a set to reorder (see Figure 4.18). Three test participants were asked to reorder the first set of
images and order the second set, and four test participants were asked to do the opposite. To study the differences
of consistency between test participants during the ordering and reordering tasks, a Kruskal-Wallis one-way analysis
of variance is used to compare the distribution of the observer inter-correlation depending on the task to perform:
ordering or reordering. It shows a p-value of 0.44 and 0.13 respectively for the task of reordering image set 1 and
ordering image set 2, and reordering image set 2 and ordering image set 1. In both cases, no statistical differences at
95% confidence can be found between the agreement of the test participants, depending on whether they had to order
or reorder the images.
To further analyze the differences between tasks (ordering or reordering) in terms of user consistency, the Fleiss’es
kappa test was computed and result values can be found in Table 4.4. In every case the agreement values between
test participants are low and are all classified as “slight” agreement. A small increase of agreement between the
test participants can however be seen. This slight increase of consistency is also visible in Figure 4.21 in terms of
confidence interval size.
Based on these results, the hypothesis H0 has to be rejected since no statistical differences in the agreements between
test participants after ordering or reordering can be observed. However, only a small number of test participants could
be included in the statistical analysis of this hypothesis. A small difference in the expected direction of increasing
the consistency between test participants is visible. Increasing the number of test participant may contribute to better
reveal the increase of agreement.
Reorder set 1 Reorder set 2
ordering task 0.091 0.091
reordering task 0.10 0.12
Table 4.4: Fleiss’es kappa depending on task and image group
4.6.4 Limits
The proposed method was designed to make the task of evaluating the relative size depth cue easier for the test
participants. A clear limitation of this method is that it can mainly be applied to evaluate characteristics in images. It
is believed that using printed images enabled the test participants to freely examine all the images and quickly change
from one image to another always having a view on the overall set of images. Such approach is rather unpractical to
implement with videos.
A second limitation of the test is that the current method does not provide any time constrains to the test participant.
This does not allow the overall length of the test to be controlled. The time taken by each test participant was rather
constant, but this issue needs to be considered in future tests.
84
4.6.5 Discussion
On a side note, it can be observed in Figures 4.21 and Figure 4.22, that the image set 1 was used by most participants
(18/23) during the ordering task while the image set 2 was used by most participants during the reordering task
(18/23). This is not balanced: there is not an equal number of participants who saw the first set of images during the
ordering and reordering task. However, even with the limited number of participants who used the first set of images
as a reordering task, it is possible to observe as analyzed in the previous sections, an increase of consistency between
participant raking. Therefore whether the first set or second set of images was used during the reordering task does not
appear to have a too strong effect on the goal of increasing inter-participant agreement on images ordering.
4.6.6 Analysis per depth cue
As described in the previous section 4.6, in the process of the test, test participants were requested for each depth cue to
reorder a set of images before being asked to order a new set of images. Figure 4.23 depicts different cases of reordering
between the provided rank and the rank returned by the test participant. It can be seen that some test participant only
slightly altered the provided order and other test participants made big changes. On Figure 4.23.d, the particular case of
one test participant who completely altered the order of the pictures is depicted. On Figure 4.23.e, the order provided
by this test participant is reordered by another test participant which also strongly disagree with the provided order and
provides a new raking which is in a better agreement with the other test participants (Figure 4.23.f ). Such analysis may
be used to identify observers who rates differently. To determine how two observers agree, it is proposed to determine
the number of times images are ordered differently from each other compared to a specific threshold (see Figure
4.23.f). Figure 4.24 depicts the average over the test participant of the differences between the rank they received and
the rank they returned after reordering with different values of threshold. An ANOVA shows that test participant made
significantly more changes for the linear perspective and relative size than for the interposition (F=41.96,p<0.01).
There are two potential explanations for the difference of agreement between the test participants for each depth cue:
1) the test participant has well understood the scale and converged to the final solution resulting in a high agreement
between participants. 2) The opposite interpretation can also be possible; the test participants may have found the scale
too difficult to evaluate and chose not to change the provided image rank since they did not know how to order the
images. To study, which is the most likely hypothesis, the agreement between the test participant was measured on
the second task of the experiment. In this second task, the test participants were asked to order images from a random
order. If the test participants agree on the scale, there should be only small differences between the order provided by
each person compared to each other. Figure 4.25 depicts the average difference between observers per depth cue scale
on the two sets of images. The evaluation of the two image sets differs on the first part of the evaluation: whether test
participants had to reorder a set of images defined by the author (image set 1), or by another test participant (image
set 2). In case of the second image set, no differences can be observed between the scales. However, in the first image
set, both scale “Interposition” and “Relative Size” appear to provide a higher agreement between the test participant
than “Linear perspective”. Kruskal-Wallis one-way analysis of variance shows that the agreement for Interposition
and Relative size are significantly different from the agreement on Linear perspective (p-values are respectively equal
to 0.0378 and 0.0113). But the differences of agreements between Interposition and Relative Size are not statically
different (p=0.238). These results show that after reordering a set of images provided by the author, test participants
were more consistent in the Interposition scale than the Linear Perspective scale. This result goes into the direction
that test participants better understood the Interposition than the Linear Perspective scale, and explains the higher
agreement between participants in the first reordering task. This is visible in Figure 4.24. However, this may not be
the only factor involved, since 1) the Relative Size shows a similar agreement between the received order and the
result of the reordering converged to the same comparison on the Linear Perspective. 2) In the ordering task, the
agreement between test participants is similar between Interposition and Relative Size. One possible hypothesis could
be a variation of difficulties between image set as a new factor to take into account in the analysis.
85
a b c
d e f
Figure 4.23: Example of relation between the provided ranking and rank returned by a test participant.
Figure 4.24: Average differences between ranking re-
ceived order and order provided by the participants af-
ter reordering considering different error thresholds. The
average is performed across all participants.
Figure 4.25: Average differences between ranking of dif-
ferent test participant in the ordering from random order
task on the two sets of images.
4.6.7 Conclusion
In this section, a new method based on rankings was presented to evaluate complex dimensions such as the relative
size depth cue in natural images. Results have shown that the proposed method provided either as consistent results
as the ACR methodology, or even more stable results in case of a difficult image set. The proposed methodology is
86
divided into two parts: first reordering a set of images from another test participant and then ordering a new set of
images. Such an approach has shown to help the test participants to better understand the scale, and an improvement
of the consistency of the result has been observed. This improvement was, however, only close to being significant.
The difference of consistency between reordering and ranking was also considered, a small improvement was observed
but was not significant and will require more extensive tests to be proven.
While all test design hypotheses could not be proven to be statistically effective, the approach based on rankings
appeared to improve stability of results when a difficult feature has to be evaluated. It is proposed to be applied to the
study of monocular depth cues in natural images.
Listing 4.7: Conclusion on the research questions
1. Showing the result of another test participant providing an example of results was
found to improve the agreement between test participants but was only close to reach
significance at a 95% confidence.
2. Using ranking, it was possible to decrease the size of the confidence interval
compared to ACR when the same number of participants are considered.
3. An improvement of the agreement between test participants was observed, when each
test participants are asked to adjust the work of another test participant. However
, the improvement was not significant.
4.7 Key contributions
In this chapter it was addressed how depth can be evaluated. Moreover it was considered how natural images can be
characterized on different scales described by the different monocular depth cues. The main contributions are:
•First it was shown that the evaluation of perceived depth is difficult, and there is a high variance between test
participants.
•The second and main contribution of the chapter is the proposal of a method to perform the evaluation of monocular
depth cues on natural images. Most of the state of the art methods have focused on specifically designed signal, but
have not considered natural images. The use of these kinds of images has brought new challenges: first, the need
to characterize them along different axis by providing definition and instruction on how to quantify the different
monocular depth cues scales. Secondly, it requires to develop new subjective methods for the evaluation of these
depth cues. The newly proposed method was compared with the traditional absolute category rating approach, and
have enabled to reach higher consistency between test participants.
•Finally, the image datasets and scores were distributed as Open Source database including newly created images
and evaluated across different axis corresponding to different depth cues.
Chapter 5
Algorithms for depth evaluation
5.1 Introduction
The previous chapter has addressed the question of evaluating monocular and binocular depth cues in natural images.
In this chapter it is proposed to present several algorithm which were designed for the prediction of 3D images prop-
erties considering both monocular and binocular depth cues. First binocular depth cues will be addressed, then several
monocular indicator will be described. Finally, considering that these algorithm are error prone, the question of indi-
cator reliability will be addressed and different methods will be presented on how to determine cases of failure and
take this information into account in the process of depth cue pooling.
Figure 5.1: Different studied items
5.2 Instrumental characterization
The characterization of video characteristics has been addressed in the context of 2D video sequences. In this case, the
spatial and temporal complexity indicators SI and TI were defined in the ITU-T Recommendation P.910 [116]. These
indicators aim to provide information on the complexity of the image taken individually, and the temporal variation
of the video content. These indicators are defined as folllows: Let Yn(i,j)be the value of the luminance of the pixel at
89
the position (i, j) in the frame n. A sobel filter is applied to the frame to determine the gradient along the vertical and
horizontal direction (Eq. 5.1 and Eq. 5.2).
Gvn(i,j) = 1·Yn(i−1,j−1)−2·Yn(i−1,j)−1·Yn(i−1,j+1)
+0·Yn(i,j−1)+0·Yn(i,j)+0·Yn(i,j+1)
+1·Yn(i+1,j−1)+2·Yn(i+1,j)+1·Yn(i+1,j+1)(5.1)
Ghn(i,j) = 1·Yn(i−1,j−1)+ 0·Yn(i−1,j)+1·Yn(i−1,j+1)
−2·Yn(i,j−1)+0·Yn(i,j)+2·Yn(i,j+1)
−1·Yn(i+1,j−1)+0·Yn(i+1,j)+1·Yn(i+1,j+1)(5.2)
Then, the norm of the gradient is computed (Eq. 5.3)
Gn(i,j) = Gvn(i,j)2+Ghn(i,j)2(5.3)
The value of SI is defined as the maximum value of the standard deviation over the time of the Sobel-filtered frames
(Eq. 5.4)
SI =maxtime(StdDevspace(Gn(i,j))) (5.4)
The value of TI is defined as the maximum value of the standard deviation of the difference of pixel luminance
values between two consecutive frames in a specific window (Eq. 5.5).
Let defined TIn(i,j) = Yn(i,j)−Yn−1(i,j)with Yn(i,j)the value of the luminance of the pixel at the position (i,j) in
the frame n.
TI =maxtime(StdDevspace(TIn(i,j))) (5.5)
The information about content properties can then be used to perform content selection, as described in ITU-T Rec-
ommendation P.910. Figure 5.2 illustrates the selection process of 2D video sequences for a subjective test. Video
sequences must have different properties in order to avoid result for a too narrow set of video. Such characterization,
without removing the need of subjective inspection, enables to identify sources with different spatial and temporal
complexities from a large database.
However, these measurements have their limits: considering these definitions, one can observe that if a video se-
quence is highly textured then the value of SI will be high. But one can also see that the high amount of texture will
have a strong impact on the value of TI: if there is a high amount of texture, even little motion will create high vari-
ations of differences of luminance of consecutive frames, and thus high values of TI. This is illustrated in Figure 5.2,
where the distribution between the SI and TI values appears correlated. Nevertheless, as long as their limits are known,
such algorithm can still be useful in case of image selection for large database, or in case of image and video quality
prediction to have a characterization of the 2D videos.
In the context of 3D video sequences, such indicators need to be defined for describing the “3D effect”. The depth
effect depends on different factors which are related to both the binocular and monocular vision. This chapter will
detail the work conducted on the characterization of 3D video properties though different depth cues: the monocular
and the binocular ones.
5.3 Background
Before describing the contributions made in the domain of content characterization and perceived depth estimation, it
is useful to describe the different kinds of algorithm which are needed by the proposed algorithms. Different types of
features were required such as depth maps from different kinds of depth cues, segmentation of images, vanishing line
90
Figure 5.2: Selection of source sequence based on spatial and temporal complexity. Each points corresponding to a
different video having different spatial and temporal complexity. The video depicted in this figure are the ones listed
in Table 4.2 in Section 4.4 and will be evaluated regarding their 3D properties in this chapter.
extraction, etc.
One of the key features required to estimate depth perception in stereoscopic images are the binocular depth maps.
These depth maps could be either obtained during the capture process using depth cameras, or estimated in a post pro-
cess step. The second case, depth estimation, is a particularly challenging task but frequently required since currently
most video recording is performed without depth cameras. In this section methods for estimating dense depth map
from stereoscopic images will be described.
5.3.1 Depth maps from stereoscopic videos
The estimation of depth from stereoscopic images is no easy task. It consists of estimating the correspondences be-
tween corresponding pixels of two stereoscopic images, and is similar as the problem of estimating dense optical flow
(motion). In this subsection, different methods will be presented to solve the problem of stereo correspondence.
5.3.1.1 The Horn & Schunck algorithm
Amongst the current best performing approaches, many of them are based on the work conducted by Horn & Schunck
in 1981 [130] who replaced the traditional block-based and feature-based matching algorithms by a minimization
problem of a general energy function:
argminu,vE=|∇u|2+|∇v|2dxdy +λρ(u,v)2(5.6)
91
These types of approaches became popular, since they are highly parallelizable and then suitable for GPU program-
ming. The equation is divided in two parts. The second part of the equation is called “data-term” and characterizes
how close the solution (u, v) for horizontal and vertical motion is close to the ideal solution. However, considering
that the problem is ill-posed, it is not possible to find a solution considering only the data-term. Assumptions about the
motion must be made; therefore a second part is added to the equation and is called the “regularization term”. This is
the first part of the energy function and define that the motion variations should be smooth and therefore the motion
gradient must be as small as possible. A coefficient is introduced between the regularization and data term to weigh
the relative importance of both parts of the equation. Estimating motion is then reduced to finding the motion vectors
(u,v) such as the difference between the two images is as small as possible and having as smooth as possible variation
of the motion vectors.
The next step is to find a solution. The luminance corresponding to the position (x,y)in the frame t is noted I(x,y,t).
The corresponding pixel in the second image is I(x+u,y+v,t+1). Using Taylors expansion around 0 and assuming
that the motion is small:
I(x+u,y+v,t+1)≈I(x,y,t)+ ∂I
∂x(x,y,t)×u+∂I
∂y(x,y,t)×v+∂I
∂t(x,y,t)×dt (5.7)
With dt = 1. The data-term can then be rewritten:
ρ(u,v)2dxdy =(I(x+u,y+v,t+1)−I(x,y,t))2dxdy
=(∂I
∂x(x,y,t)×u+∂I
∂y(x,y,t)×v+∂I
∂t(x,y,t))2dxdy (5.8)
Using the calculus of variation, (u, v) is found that it must verify the following Euler-Lagrange equations:
∂I
∂x(x,y,t)×(∂I
∂x(x,y,t)×u+∂I
∂y(x,y,t)×v+∂I
∂t(x,y,t)) −λ×(∂2u
∂x2+∂2u
∂y2) = 0 (5.9)
∂I
∂y(x,y,t)×(∂I
∂x(x,y,t)×u+∂I
∂y(x,y,t)×v+∂I
∂t(x,y,t)) −λ×(∂2v
∂x2+∂2v
∂y2) = 0 (5.10)
However in order to be able to apply the Taylor expansion, it was assumed that the motion is small and close to 0.
This is not realistic. It is then suggested that if an estimation of (u,v)noted (ˆu,ˆv), is known it would be possible to
determine small adjustment of the motion vector, noted (du,dv).
ˆ
I(x,y,t+1) = I(x+ˆu,y+ˆv,t+1)(5.11)
ˆ
I(x,y,t+1)≈∂ˆ
I
∂x(x,y,t+1) + ∂ˆ
I
∂y(x,y,t+1) + ˆ
I(x,y,t+1)(5.12)
Then the data term is equal to:
ρ(u,v)2dxdy =(∂ˆ
I
∂x(x,y,t+1) + ∂ˆ
I
∂y(x,y,t+1) + ˆ
I(x,y,t+1−I(x,y,t)))2dxdy (5.13)
Let ˆ
Ix=∂ˆ
I
∂x, the previous Euler-Lagrange equation becomes:
(ˆ
Ixdu +ˆ
Iydv +ˆ
I−I)·ˆ
Ix−λ(duxx +duyy) = 0
(ˆ
Ixdu +ˆ
Iydv +ˆ
I−I)·ˆ
Iy−λ(dvxx +dvyy) = 0 (5.14)
Using finite difference method, the Laplace operator can be express as:
(∆u)i,j≈(ui−1,j−2·ui,j+ui+1,j)+ (ui,j−1−2·ui,j+ui,j+1)(5.15)
92
The Euler-Lagrange equation becomes:
(ˆ
Ixdu +ˆ
Iydv +ˆ
I−I)·ˆ
Ix−λ(∆du)i,j=0
(ˆ
Ixdu +ˆ
Iydv +ˆ
I−I)·ˆ
Iy−λ(∆dv)i,j=0 (5.16)
It is then possible to use the Jacobi method to define a sequence which will converge to an estimate of the adjustment
of the motion vector (du,dv).
dun+1
i,j=dun
i,j−
ˆ
Ix,i,j(ˆ
Ix,i,jdun
i,j+ˆ
Iy,i,jdvn
i,j+ˆ
Ii,j−Ii,j)
λ+I2
x,i,j+I2
y,i,j
dvn+1
i,j=dvn
i,j−
ˆ
Iy,i,j(ˆ
Ix,i,jdun
i,j+ˆ
Iy,i,jdvn
i,j+ˆ
Ii,j−Ii,j)
λ+I2
x,i,j+I2
y,i,j
(5.17)
With the initial value of the adjustment vector: (du0
i,j,dv0
i,j)null.
However, the method still requires an estimation of the value of the motion vector (ˆu,ˆv). The proposed algorithm can
only determine the adjustment of the estimation of the motion vector (ˆu,ˆv). This initial motion vector (ˆu,ˆv)cannot be
null since the hypothesis requires that the adjustment vector is close to 0 and the motion vectors themselves will not
be null.
To solve this issue, as depicted in Figure 5.3, the motion estimation is done at multiple resolutions. At a very low
resolution, the motion is small. It is then possible to assume that an initial estimation of the motion vector at this very
low resolution is null: (u,v)≈(0,0). Then it becomes possible to estimate a refinement of the motion vector: (du, dv).
The refinement is then used as initial value of the motion at a higher resolution. A refinement is again computed and
will be used again as an initial value of motion for higher resolution motion estimation. This is done until the motion
estimation is made on the finest resolution.
5.3.1.2 Total variation - ℓ1
Improvements have been proposed to the Horn & Schunck algorithm. One of the issues comes from the regularization
term which ensure local smoothness of the solution. However, the local smoothness may not always apply, and abrupt
variation of motion within spatial location in the image may happen. To tackle this problem, the ℓ1-norm can be used
as an alternative to the ℓ2-norms proposed by Horn & Schunck, which enables larger difference:
argminu,vE=|∇u|+|∇v|dxdy +λ|ρ(u,v)|dxdy (5.18)
To solve this equation, an efficient approach is the use of Primal-Dual optimization algorithms. The problem is divided
into the following two subproblems:
•One minimization problem: decrease of the data term.
•One maximization problem: increase the smoothness of the solution.
Let p, be the dual variable of the motion vector (u,v). The convex conjugate of the total variation term, |∇(u,v)|, is
the function to optimize:
F∗(p) = sup(u,v){⟨∇(u,v),p⟩−|∇(u,v)|} (5.19)
As demonstrated by Werlberger [131], the solution to the equation 5.18 can be found after convergence of the following
numerical series:
pn+1=proxP(pn+σ∇(¯un,¯vn)
(un+1,vn+1) = shrink((un,vn)−τdivpn+1)
(¯un+1,¯vn+1) = 2(un+1,vn+1)−(un,vn)(5.20)
93
Figure 5.3: Pyramidal approach for dense motion estimation.
With the operator prox defined as the projection of the vector to a unit ball, and shrinks is a conditional function defined
as in Table 5.1.
Condition Threshold check Updating
ρ(u)>0ρ(ˆu)<−τλ∇I u =ˆu+τλ∇I
ρ(u)<0ρ(ˆu)>τλ∇I u =ˆu−τλ∇I
ρ(u) = 0|ρ(ˆu)|<=τλ∇I u =ˆu−∇Iρ(ˆu)
|∇I|2
Table 5.1: Definition of the prox operator
Similarly as before, to initialize the series, the value of (u0,v0)needs to be known. To get, an estimate of it, a
multiple resolution approach is again performed.
94
5.3.1.3 Further refinements
To enable further discontinuities around edges, the regularization term can also be defined as depending on a function
of the motion gradient. This can enable decreasing the influence of the regularization term on the borders [132].
R=ϕ(|∇(u,v)|)dxdy (5.21)
In case of anisotropic flow regulation, the cost function of the regularization term will also consider the orientation
of the edges to decrease the influence of the regularization on the direction crossing the edge but will increase the
influence of the regularization in the direction parallel to the edges [133].
R=tr(ϕ(∇(u,v)∇(u,v)T))dxdy (5.22)
In the particular case of depth estimation it is also possible to have isotropic image driven regularization: the cost
function of the regularization term is weighted by the image gradient [134] and information obtained from monocular
depth cues, such as blur [135].
R=G(I)×∇(u,v)dxdy (5.23)
In the context of this thesis, an anisotropic Huber-ℓ1 regularization was used to estimate the depth maps [136].
5.3.1.4 Application to depth estimation & discussion
The application of algorithms designed for motion estimation to depth estimation can be problematic. In the case of
stereo matching, larger movement between objects composing the scene can be observed compared to the kind of
movement which happens in the temporal aspect in the case of motion estimation. This results in large discontinuities
in the optical flow which contradict the smoothness constraint defined by the regularization term. Figure 5.4 depicts
an example of the depth map produced using such approach. It is clearly visible that edges in the depth map do not
show abrupt transition as they should have and appear over-smooth. This is due to the local continuity assumption as
previously explained.
The topic of stereo-matching is a difficult one, and would have required a large amount of work beyond the scope of this
thesis. Different methods have been considered to estimate the depth map. The University of Middlebury provides an
extensive benchmark of stereo-matching algorithms [137]. However most of the algorithms described do not provide
an implementation. Some implementation of stereo-matching algorithms can be found, but the ones which have been
tested during the research work did not appear to handle all the diversity of content addressed in the thesis. Therefore,
even if the Huber-ℓ1 dense optical flow used in this work perform lower on the Middlebury database, it was selected
as a solution to estimate dense depth map since it was able to address the large variety of content used in the thesis.
Its limits have been clearly identified: the sharpness of the edges of the depth map is far from optimal, and would not
be good enough for a context of depth-based image rendering, but the performance can be sufficient for the context of
this work on quality and depth characterization where distribution of disparity values are studied.
5.3.2 Depth estimation from monocular depth cues
In addition to stereo-matching, many approaches have been designed to estimate the depth of images using monocular
depth cues. Different methods to address this issue will be described in this section.
95
Figure 5.4: Estimation of depth map using a pair of images.
5.3.2.1 Defocus blur
As detailed in chapter 2.2.2, defocus blur can be used as a measure of depth. If the focal length of the lens, and the
distance between the point of focus and the lens is known, it is possible to evaluate absolute but unsigned values of
distance between objects and the focus point based on the amount of blur. In this section, the issue of the evaluation of
dense blur map and their conversion to depth will be addressed.
One of the issues about the estimation of blur is the ability to estimate the defocus blur on non-textured areas. Across
edges, different methods have been proposed. Amongst them, methods have been proposed to evaluate blur by mea-
suring the slope of the edge’s gradient [138], the effect of the convolution by a Gaussian kernel on a picture [139],
the distribution of the DCT coefficients [140], etc. However these only enable to have localized measures of blur on
edges. On a non-textured area, such as depicted in Figure 5.5, these metrics will fail to differentiate between absence
of a blurred image and image without edges. In most cases, this issue can be addressed due to the granularity of the
output which is expected: a global indicator of blur and not blur measure at every location of the picture. In that case,
the sharpness of the picture is usually sum-up to the maximum sharpness measured in the picture. However in the
context of this study it is a per-pixel measure of blur which is expected in order to obtain a dense depth map from blur
measurements.
Figure 5.5: Blurred picture, or not?.
96
To obtain dense blur maps, the method proposed by Zhuo [141] was used. In order to get an estimate of the blur
at every location within the picture, the overall process of blur from defocus depth cue evaluation is divided into two
steps:
•Sparse blur estimation based on areas having edges.
•Interpolation of blur values between blur values measured on edges.
To get an estimate of the blur of edges, the algorithm apply a canny edge detector to determine the areas of the picture
where blur estimation can be performed reliably. Figure 5.7 depicts an example image of a tree branch where only
edges of the tree branch can be used to evaluate the amount of blur. The background of the picture is uniform and
cannot be used to estimate reliably the defocus blur. To measure the amount of blur on edges, the blurred picture can
Figure 5.6: Orgininal image to process Figure 5.7: Edge detection Figure 5.8: Sparse blur map
be modeled via the equation 5.24, where Ibis the blurred picture, Iois the non-blurred version of the picture, and gσis
the defocus Gaussian blur which needs to be estimated.
Ib=Io∗gσ(5.24)
To estimate the amount of blur, a Gaussian blur gσ0is applied to image 5.25, and the effect of re-blurring the picture is
evaluated by computing the gradient of edges identified previously. To simplify the notations, the following equations
will only address a one-dimensional picture, but the results can be extended to further dimensions. Equation 5.28
Irb =Io∗gσ∗gσ0(5.25)
∇Irb =∇(Io∗gσ∗gσ0)(5.26)
=∇((A·u(x)+ B)∗gσ(x)∗gσ0(x)) (5.27)
=A
2π(σ2+σ2
0)
exp−x2
2(σ2+σ2
0)(5.28)
The ratio of the gradient norm before and after filtering by the Gaussian blur gσ0provides the relation described by eq.
5.29. The ratio will be maximized at the edge locations, x=0, which simplifies the equation to eq. 5.30. Therefore, on
the edges it is possible to determine the properties of the blur as described in eq. 5.31. This provides the sparse blur
map depicted in Figure 5.8.
|∇Io(x)|
|∇Irb(x)|=σ2+σ2
0
σ2exp−x2
2σ2−x2
2(σ2+σ2
0) (5.29)
R=|∇Io(x)|
|∇Irb(x)|=σ2+σ2
0
σ2(5.30)
97
Figure 5.9: Gradient ratio used to measure blur in pictures
Figure 5.10: Composition equation used to interpolate blur values
σ=1
√R2−1σ0(5.31)
The second step of the estimation of the dense blur map is to interpolate blur values between the known points. To
perform this interpolation, the problem is expressed as a composition equation (eq. 5.32). The image Iois expressed
as the composition of a foreground image and a background image. The value αprovide the contribution of the fore-
ground and background image to produce the overall picture under observation. To apply such model to the problem
of sparse to dense blur map, the blur measurements obtained previously are considered as ground truth values of α.
Both images, Fand Bare unknown. Since the problem is ill-posed, it is needed to define some assumptions about
the images. The local smoothness constrains is then assumed in the background picture. Therefore, sharp variation of
pixel intensity in picture Iocan only be explained by the foreground image F. The objective of such an approach is to
obtain the values ˆ
αwhich will enable to best match to blur measurements and fulfill the smoothness constrains.
Io=αF+(1−α)B(5.32)
It can be shown that the solution to this problem can be obtained by solving the sparse linear system of eq. 5.33.
The matrix L, being the matting Laplacian matrix, and Da diagonal matrix where Di,iis equal to 1 if the pixel iis an
edge, and 0 otherwise [142].
(L+λD)α=λDˆ
α(5.33)
5.3.2.2 Shape from texture
The texture gradient depth cue is estimated in two steps. Similarly to the blur from defocus, a depth map is estimated
from the texture gradient, and then a global index is determined. An in-depth description of estimation of the depth map
from the texture gradient is can be found in [143]. The main idea of the algorithm is to integrate the gradient field to get
the surface which is described by the gradient field. Let S(x,y)be the 2D surface which is expected to be estimated, it
is defined on a rectangular grid {x=0,...,W−1;y=0,...,H−1}. Let p0=∂S
∂x,q0=∂S
∂ybe the integrable gradient field
of S. The surface S, can be exactly recovered by integrating the gradient field (p0,q0)by solving a Poisson equation.
But with real images the gradient field may not be integrable. Let (p,q)be the non-integrable gradient field and ˆ
S
be the estimated surface. Using Simchony, Chellappa and Shao’s (SCS) method [144], the surface ˆ
Scan be found by
minimizing the least square cost function (eq. 5.34).
98
J(ˆ
S) = ( ˆ
Sx−p)2+( ˆ
Sy−q)2(5.34)
The Euler-Lagrange equation gives the Poisson equation to solve: ∇2ˆ
S=div(p,q), with div the divergence operator,
div(p,q) = ∂p
∂x+∂q
∂y=px+py. It could be noted that this equation assumes a null curl:∇2ˆ
S=div(p,q) = ∂ˆ
Sx
∂x+∂ˆ
Sy
∂y=
div(Sx,Sy), and then the component curl(p,q) = ∂p
∂y−∂q
∂xis null. The novelty of the approach in [143] is then to take
into account the information from the curl to increase the accuracy of the surface reconstruction.
Figure 5.11: Example of result for depth estimation from texture gradient
5.3.3 Image segmentation
Image segmentation is a difficult topic which has received a lot of attention, and many advanced techniques have been
developed. In the context of this work, it will be useful to decompose the scene into objects composing the scene
enabling object-based analysis. To perform this segmentation, it is proposed to use the mean-shift algorithm. The
mean-shift is a classical approach for non-parametric clustering which does not require prior knowledge about the
number of classes.
For ndata points xi,i=1,...,nin a space with a dimension d. The algorithm is an iterative process which determine
the maximum density of a distribution. For a given initialization, it is possible to determine the kernel density estimate
when a kernel K(x) and a window radius h is considered:
f(x) = 1
nhd
n
∑
i=1
Kx−xi
h(5.35)
In case of a radial symmetric kernel, e.g. independent of the orientation, the kernel K takes values which fulfill:
K(x) = ck,dk(||x||2)(5.36)
With ck,da normalization constant to ensure that integrating the Kernel provide a value of 1. When this is defined, the
goal is to determine the maximum density of the distribution starting from this initialization. To do so, the gradient of
the kernel density estimate is computed. It will provide the shift of position compared to the position of initialization,
which is called the mean-shift vector.
∇f(x) = 2ck,d
nhd+2
n
∑
i=1
(xi−x)g||x−xi
h||2(5.37)
mh(x) = ∑n
i=1xig(||x−xi
h||2)
∑n
i=1g(||x−xi
h||2)−x(5.38)
Based on these equations, the process of the mean-shift is then to:
•Initialize a starting point
99
•Compute the mean-shift vector mh(xt)
•Translate the window to a new location: xt+1=xt+mh(xt)
•Iterate until the mean-shift vector is null.
During the process, all the points which converge to the same maximum density will belong to the same class.
Using this iterative approach, it was then possible in this work to decompose the scenes into objects. Figure 5.12
depicts some examples of results of the mean-shift applied to the image database used in this work.
Figure 5.12: Example of image segmentation using the mean-shift.
100
5.4 Binocular depth cues
In previous chapters, the question of evaluating the properties of 3D video sequences was addressed. Until now,
subjective methods were used to study the properties of the 3D video sequences. This section will address how the
evaluation of binocular depth cues can be performed by means of prediction algorithms. The general structure of the
model is depicted in Figure 5.13. There are four main steps: 1) Extraction of disparity maps, 2) identification of regions
of depth-interest, 3) feature extraction from selected areas, and 4) pooling of features to calculate the final depth score.
Figure 5.13: General structure of the proposed depth
model
Figure 5.14: Results for the estimation of the disparity
5.4.1 Disparity module
The target of this first module is to extract a disparity representation that captures the binocular cues and particularly
binocular disparities. The most accurate way is to acquire disparity information from the video camera during shooting.
Indeed some video cameras are equipped with sensors which provide the ability to record the depth. Using these
depth maps, disparities can easily be obtained. At present, it is still rare to have video sequences including their
respective depth map. In the future this will be more frequent due to the use of video plus depth-based coding, which
will be applied to efficiently encode multiple views as required, for example, for the next generation of multiview
autostereoscopic displays. For the present study, this information was not available, and has to be estimated from
the two views. To estimate depth maps there exists the Depth Estimation Reference Software (DERS) [145] used
by MPEG. This software can provide precise disparity maps. However, it requires at least 3 different views, and
information about the shooting conditions (position & orientation of the cameras, focal distances...), information not
available for the present research and employed stereoscopic sequences.
Therefore, as discussed in Section 5.3.1, it has then been decided to use a dense optical flow algorithm to estimate
the dense disparity maps. An extensive comparison of dense optical flow algorithms is reported by the University
of Middlebury [146]. Based on these results the algorithm proposed by Werlberger et al. [136] [99] and available at
GPU4Vision [147] was used to estimate disparities from stereoscopic views since it is ranked between the algorithm
which provides the best performance and is also particularly fast. This motion estimation is based on low-level image
segmentation to tackle the problem of poorly textured regions, occlusions and small-scale image structures. It was
applied to find the “displacement” between the left and right stereoscopic views, providing an approximation of the
disparity maps. The results obtained are quite accurate as illustrated in Figure 5.14, and are obtained in a reasonable
computation time (less than a second for processing a pair of full HD frames on an NVidia GTX470).
Once pixel disparities were computed, it is necessary to convert them to retinal disparities. Figure 5.15 and 5.16 depicts
the geometric relationship between the different factors involved in the computation of retinal disparities. Cormack
101
[148], provided the description of the different equations. If a person is looking at a specific point (F) in space, the
angle c f in Figure 5.15 can be computed using the equations 5.39-5.41.
Figure 5.15: Geometric relationship when the eyes fixate
a point (F) in space. Figure copied from [148].
Figure 5.16: Geometric relationship when the eyes are
stimulated by a tangent (T). Figure copied from [148].
angle c f =angle1+angle2 (5.39)
angle1=tan−1J+A f
D f (5.40)
angle2=tan−1J−A f
D f (5.41)
Then, if a second point is considered, it is similarly possible to determine the angle ct in Figure 5.16 using equations
5.42-5.44.
angle ct =angle3+angle4 (5.42)
angle3=tan−1J+At
Dt (5.43)
angle4=tan−1J−At
Dt (5.44)
Finally, the retinal disparity can be computed by performing the difference between c f and ct (Eq. 5.46)
r=angle c f +angle ct (5.45)
=tan−1J+A f
D f +tan−1J−A f
D f −tan−1J+At
Dt +tan−1J−At
Dt (5.46)
102
A particular case of the equation 5.46 is when the convergence is symmetrical and both points Fand Tare on the
midsaggital plane. In this case, A f and At are null. And simplify the equation of retinal disparity (Eq. 5.47)
r=2·tan−1J
D f −2·tan−1J
Dt (5.47)
Computing retinal disparities may, however, be challenging since they are the comparison between two distinct
points. Processing a N×Mimage with a width of Mand Mthe respective height and width of the image would
result in (N×M)2points since each point need to be compared to each other. Alternatively to keep a perceptual
representation taking into account the viewing condition, it is proposed to work with parallax values in degree. Lin
[149], provided the equation to compute parallax values, and is described in equations 5.48 - 5.50
P=a−b(5.48)
a=tan−1DIP +Ds−2Ts
2L+tan−1DIP +Ds+2Ts
2L(5.49)
b=tan−1DIP +Ds
2L+tan−1DIP +Ds
2L(5.50)
Figure 5.17: Geometric relationship for the computation of parallax.
After processing the disparity map in pixels to convert it to parallax in degree, the next module of the algorithm can
be applied.
5.4.2 Region of depth relevance module
The idea of the region of depth relevance module is that observers are assumed to judge the depth of a 3D image
using areas or objects which will attract their attention and not necessarily on the entire picture, because during scene
analysis the combination of depth cues seems to lead to an object-related figure-ground segregation. For example, for
the sequence depicted in Figure 5.18a, people are assumed to appreciate the spatial rendition of the grass, and base
their rating on it without considering the black background. In the same way, for the scene shown in Figure 5.18b
103
observers are expected to perceive an appreciable depth effect, due to the spatial rendition of the trees and in spite of
most of the remaining elements of the scene being flat. Note that this is due to the shooting conditions. The background
objects are far away, and hence the depth resolution is low, so that all objects appear at a constant disparity. Further
note that the disparity feature provides mainly relative depth information, but it can also give some absolute depth
information if the vergence cues are also considered. The region of depth relevance module extracts the areas of the
image where the disparities changes, and this way contribute as a relevant depth cue. It is most likely that these areas
will be used to judge the depth of the scene. In practice, the proposed algorithm follows the process described in listing
5.1 (also depicted in Figure 5.19):
(a) (b)
Figure 5.18: Illustration of cases where it is assumed that not the entire image is used for judging the depth.
Listing 5.1: Estimation of the region of depth relevance
---------------------------------------------
Let the function Std, the standard deviation as defined by:
Std :RN→→R
X→1
#X∑#X
i=1(Xi−¯
X)2
With ¯
Xthe average value of the elements in X
And #Xthe cardinal of X
---------------------------------------------
Let the variables:
M,N,T: Respectively the number of lines, the number rows of the images and the number
of frames in the sequence.
Le ftView = [IL
n,i,j]N×M×T,
∀(i,j,n)∈[1,N]×[1,M]×[1,T],IL
n,i,j∈[0,255]3
IL
n,i,j: The pixel value of the left stereoscopic view at the location (i,j)of the frame n
RightView = [IR
n,i,j]N×M×T,
∀(i,j,n)∈[1,N]×[1,M]×[1,T],IR
n,i,j∈[0,255]3
IR
n,i,j: The pixel value of the right stereoscopic view at the location (i,j)of the frame n
Disparity = [Dn,i,j]N×M×T,
∀(i,j,n)∈[1,N]×[1,M]×[1,T],Dn,i,j∈R
104
Dn,i,j: The horizontal displacement of the pixel IR
n,i,jcompared to IL
n,i,jsuch that
IL
n,i,j+Dn,i,j=IR
n,i,j. Here, Dn,i,jis the output of the disparity module described in Section
5.A.
Labels = [Ln,i,j]N×M×T,
∀(i,j,n)∈[1,N]×[1,M]×[1,T],Ln,i,j∈N
Ln,i,j: The value of the label at the location (i,j)of the frame nresulting of the object
segmentation of the left frame using the mean-shift algorithm.
---------------------------------------------
Let region of depth relevance as defined by:
For each object, determine the standard deviation of disparity values within the object
V= [vn,l]T×N,∀(n,l)∈[1,T]×N,vn,l∈R
∀l∈[1,max(Labels)],vn,l=Std(Dn,i,j)
,(i,j)∈[1,M]×[1,N],Ln,i,j=l
The region of depth relevance of the frame n rodrnis the union of the objects which have
a standard deviation of disparity value greater than dth
RODR = [rodrn]T,∀n∈[1,T],rodrn∈([1,N]×[1,M])N
rodrn={(i,j)|(i,j)∈[1,M]×[1,N],
∃l∈[1,max(Labelsn)]|Ln,i,j=l,vn,l>dth}
In our implementation dth is set to 0.04
In the description of the region of depth relevance extraction, the mean shift algorithm has been used [150] [151].
This algorithm was discussed in Section 5.3.3 of this thesis, and as previously explained it has been chosen due to its
good performance in object segmentation on the data base under study, which has been verified qualitatively for the
segmented objects of a random selection of scenes.
5.4.3 Frame-based feature extraction module
Once RODR per frame extracted, the next step is to extract the binocular feature used for depth estimation for the
entire sequence. The disparities contribute to the depth perception in a relative manner, which is why the variation
of disparities between the different objects of the scene are used by the proposed algorithm for depth estimation. In
practice, the proposed algorithm follows the lines described in listing 5.2, as illustrated in Figure 5.20:
Listing 5.2: Estraction of feature per frames
The frame-based indicator is the logarithm of the standard deviation of the disparity
values within the RODR normalized by the surface of the RODR.
SD = [Sdn]T,∀n∈[1,T],Sdn∈RN
Sdn={Dn,i,j|(i,j)∈rodrn}
FrameBasedIndicator = [FrameBasedIndicatorn]T,
∀n∈[1,T],FrameBasedIndicatorn∈R
105
FrameBasedIndicatorn=Log(Std(Sdn)
#Sdn)
Figure 5.19: Illustration of the algorithm used for deter-
mining the region of depth relevance (RODR).
Figure 5.20: Algorithm used for determining the value of
the depth indicator for a single frame
5.4.4 Temporal pooling
No temporal properties of the 3D video sequences have been considered so far. To extend the application of our
approach from images to the entire video sequences as they are under study in this work, the integration to an overall
depth score has to be taken into account. Two main temporal scales can be considered, a local and a global one.
5.4.4.1 Short-term spatio-Temporal indicator
Locally, the temporal depth variation can be used as a reference to understand the relative position of the elements
of the scenes. In the previous step, the evaluation of the relative variations in depth of objects per image have been
considered, which are extended to a small number of subsequent images to address short term memory, since depth
perception is expected to rely on the comparison between objects for consecutive frames. Since the fixation time is
200ms [152], it has been decided to take the temporal neighbourhood into account by analyzing the local temporal
variation of relative depth between objects for the evaluation of every frame, to reflect the temporal variation used for
evaluating the current frame. A sliding window of LT frames corresponding to the fixation time and centered on the
frame under consideration was used for the spatio-temporal extension of the depth indicator. In practice, the algorithm
is as implemented in listing 5.3, and illustrated in Figure 5.21:
106
Listing 5.3: Spatio-Temporal depth indicator
---------------------------------------------
Let the variables:
LT∈2N+1the size of a local temporal pooling window (for a frame rate of 25 frames per
second, LT=5)
ST disp = [stdispn]T−LT−1,∀n∈[1,T−LT−1],stdispn∈RN
stdispnthe spatio-temporal disparities used for depth evaluation of frame n as in Section
5.C / Listing 2..
---------------------------------------------
The spatio-temporal indicator is defined as:
stdispn={Dt,i,j|t∈[n−LT−1
2,n+LT−1
2],(i,j)∈rodrn}
ST Indicator = [ST Indicatorn]T−LT−1,∀n∈[1,T−LT−1],stdispn∈R
ST Indicatorn=Log(std(stdispn)
#stdispn)
Figure 5.21: Local temporal pooling
5.4.4.2 Global temporal pooling
Global temporal pooling still require work: it is not trivial to pool the different instantaneous measures to calculate
an estimate of the global judgment as obtained from the observer. In the case of quality assessment, there are several
approaches for temporal pooling, such as the very simple averaging, Minkovsky summations, average calculation
using Ln norm or limited to a certain percentile. Other approaches are more sophisticated [153] and deal with quality
degradation events. Regarding the global estimation from several local observations, it is usually assumed that if an
error occurs people will quickly say that the overall quality of the sequence is poor, and it will take some time after the
107
last error event until the overall quality is considered as good again [153]. In the context of our depth evaluation, this
seems to be the inverse: observers who clearly perceived the depth effect will quickly report it, and if there are some
passages in the sequence where the depth effect is not too visible, they seem to take some time to report this in their
on overall rating. To reflect this consideration on our model, we then decided to use a Minkovsky summation with
an order higher than 1, to emphasize passages of high short-term depth-values. The final mapping is then performed
using a third order polynomial function.
Listing 5.4: Global temporal pooling
Indicator =1
T−LT−1
k
∑T−LT
2
t=1+LT
2
(STIndicatort)k
In our implementation kis set to 4
MOSe=A×Indicator3+B×Indicator2+C×Indicator +D
In our implementation A,B,C,Dare respectively set to −0.06064,−2.213,−25.79,−93.04 (
obtained by using the optimization function polyfit of MATLAB)
5.5 Model performance
To evaluate the performance of the proposed model, the subjective video database created and described in section 4.4
was used to measure the accuracy of the algorithm. Figure 5.22 depicts the subjective depth ratings as compared to
the the predicted subjective depth. The model training and validation are carried out using cross - validation (6 combi-
nations of training/validation). The model achieves the following performance: the Pearson correlation R = 0.60, the
root mean squared error RMSE = 0.55, the RMSE* = 0.37, and the outlier ratio (OR) is equal to 0.83 / 21.33 (where
0.83 is the number of outliers on a validation dataset subset composed of 21.33 sequences. The reported floating point
values are mean values which stem from the cross-validation) on our entire database for seven defined parameters (The
threshold in the RODR algorithm,the size of the local temporal pooling, the order of the Minkovsky summation, the
four coefficients of the polynomial mapping). These results show that there is still space for further improvements.
As it can be observed from Figure 5.22, eight source sequences are not well considered by the algorithm (plotted as
red triangles). These specific contents show a pop-out effect which apparently was well appreciated by the observers,
who rated these sequences with high depth scores. Two distinct reasons could explain these results: From a conceptual
point of view, the current algorithm does not make a difference between positive and negative disparity values, and
hence between the cases that the objects pop out or stay inside the screen. From an implementation point of view, the
disparity algorithm did not succeed to well capture the small blurry objects that characterize the pop-out effect. This
leads to an under-estimation of the depth for these contents. Without these contents, we achieve a Pearson correlation
of 0.8, an RMSE of 0.38, an RMSE* of 0.18 and an OR of 0 / 18.66.
Even though they do not have a strong effect on the general results of the model, a second type of contents could
also be identified (represented by the circles in Figure 5.22), which have been overestimated in terms of depth. For
the lower contents, it is still unclear what factors contribute to these ratings. Some of the sequences show fast motion,
some have several scene changes, and other depth cues may also inhibit the depth perception.
As a consequence, three factors are currently under study to improve the general accuracy of the model:
•Incorporate a weighting depending on the position in depth of the object (if they pop-out or stay inside the display),
•Improve the accuracy of the disparity estimation,
•Consider the monocular cues which are in conflict with the binocular depth perception.
Listing 5.5: RMSE∗
108
let Xgth ∈RNa set containing the ground truth values.
let Xest ∈RNa set containing the estimated values.
let CI95
ithe confidence interval at 95% of Xgthi
∀i∈[1,#X],Perrori=max(0,|Xgthi−Xesti|−CI95
i)
RMSE∗:RN×RN→→R
(Xgth,Xest )→1
#X−d∑#X
i=1(Perrori)2
With dthe degree of freedom between Xgth and Xest
Figure 5.22: Results of the model on the estimation of depth, the triangles represents the contents which have pop-out
effect, the circle represents a class of under-estimated content (which have a lot of linear perspective)
109
5.6 Applications
One of the main issues in the subjective evaluations of the perceived depth as a function of monocular and binocular
depth cues in natural images as presented in the section 4.5.3 using absolute category rating and section 4.6.1 using
ranking, is the differentiation between the contribution of binoculars and monocular cues to the overall depth percep-
tion. The uses of natural images in the test have affected the ability to define precisely the amount of monocular and
binocular depth cues. Some of the monocular depth cues were evaluated in the images as described in the previous
section. The contribution of binocular depth cues in these images is even more difficult to evaluate since it is not
possible to consider them independently of the monocular cues. Considering how important binocular depth cues are,
different measurements are applied to the images’ depth map to provide more detailed information on the perceived
depth coming from retinal disparities or from monocular depth cues.
To perform such analysis, it is possible to use the model previously described in this chapter which only takes into
account the characterization of the binocular properties of the stereoscopic properties of the images.
Listing 5.6: Research questions
1. How the proposed algorithm compares with other approaches from the literature.
2. What are the interactions between binocular depth and monocular depth cues.
5.6.1 Comparison with other methods
There are only few studies which have addressed computational models for predicting depth perception in natural
images, and provide an overall score. Nevertheless, in addition to the proposed method, other alternatives have been
proposed in recent years.
Sohn [154], characterized the depth of the images using an object decomposition and applied two different metrics
on objects: the mean disparity value and measured the “object thickness” (OTs) by measuring the ratio between the
mean width of the object s, noted MWs, and mean disparity of the object, noted MASs. Even though the metric was
applied to estimate visual discomfort, it provides another way of depth characterization which can be applied to this
study. Extending this approach, Toyosawa [155], added to the work of Sohn [154] the consideration of the depth range
between the farthest and closest object into the overall equation.
OTs=ln(α·MWs
MADs
)(5.51)
Addressing the same use case as studied in this thesis, Toyosawa [155], provided a comparison of 20 different statistical
analysis of disparities. It covers many kinds of measurements such as average, minimum, maximum, standard deviation
of disparities, different quartile, and disparity range size [154, 156]. Lin [149], proposed a measurement algorithm
based second order polynomial fitting of the range of depth values and the average was also considered. It is described
by equation 5.53. The notation Rn%meaning the nth percentile of the retinal disparities values.
Rrange =R90% −R10% (5.52)
S∗=a·R50% +b·Rrange +c·R2
50% +d·R2
range +e(5.53)
The result of the different measurements considered in [155], shows that not only one of them can fit for all cases
and the most appropriate measurement algorithm depends on the composition of the image and the number of objects
at different depth plane. Unfortunately most of the previous studies only considered a small number of images: 10
different images composed of only one 3D rendered objects in [149], 14 natural images from different movies in
[155], 40 natural images in [154] which begins to be reasonably large but the focus of the paper was not perceived
depth but visual comfort, and 64 videos in our previous work [113]. To extend the study proposed in [155] it is
110
proposed to add more contents, e.g. the 200 3D still images previously presented in section 4.5.2, and to consider a
new high-level measurement algorithm. It is important to note that this is then a different one than the database used
for training and previous verification of the model designed in the context of this thesis. The results of the performance
of these measurements can be found in Table 5.2 and Figure 5.24. The RMSE∗is the root mean square error (RMSE)
taking into account the size of the confidence interval of the subjective data. Let Xgth ∈RNthe ground truth values,
Xest ∈RNthe predicted values, CI95
ithe confidence interval at 95% of Xgthi, and dthe degree of freedom between Xgth
and Xest .
RMSE∗=1
#X−d
#X
∑
i=1
(Perrori)2(5.54)
Table 5.2 depicts the performance of 31 different ways to characterize the 200 images of the studied database. Since
each indicator have different ranges than the subjective scores, a third order polynomial fit is applied between the in-
dicators and the respective subjective data enabling to compute outlier ratio (OR), RMSE, RMSE∗(equation 5.54) in
addition to the Pearson and Spearman-correlations.
The category “Distance metrics” is composed of different percentile used as an indicator of the content property.
The notation PXYz define the value of the percentile XY.z%. Between these indicators, it can be seen that the value of
the minimum value of disparity is a better indicator of perceived depth than the maximum values.
In the category “Volume metrics”, the indicator PX1Y1z1−PX2Y2z2determine the size of the interval in disparity be-
tween the percentiles X1Y1.z1% and X2Y2.z2%. The Michelson contrast is defined as in equation 5.55. The “2nd order
polynomial fit” correspond to the method described by Lin [149], equation 5.53, and using the coefficient provided
into their paper. To provide a better comparison with their algorithm, it is proposed to retrain the coefficient on the
proposed database using the MATLAB function “regress”. This corresponds to the line “2nd order polynomial refit”
in Table 5.2. ITU-T Recommendation P.1401 [157] describes how to perform statistical tests to compare the perfor-
mance of different prediction algorithms. The Lin algorithm [149] with its original coefficients outperform in terms
of Spearman correlation the other volume metrics with the exception of the interval P950-P050. It can be reminded
that this algorithm is also based on this interval, which appears to indicate that this parameter is a key feature of the
proposed algorithm. However the metrics do not outperform statistically the “distance metrics” based on percentile
lower than the median. With regards of the Pearson correlation, the retrained Lin’s algorithm provides similar results
as the ones described for the Spearman correlation, and achieve a lower RMSE.
Contrast =P950 −P050
P950 +P050 (5.55)
The higher level category “object-based metrics” include the work of Toyosawa [155] with the average thickness of
objects, the depth interval between the closest and farther object, the number of objects. The metrics also include the
proposed method, and the work of Sohn [154] on objects thickness. The proposed algorithm outperform statistically the
other object-based algorithms on each performance indicators. It is statistically equivalent to the Lin [149] volume-
based algorithms in terms of Pearson correlation, outperform the retrained Lin algorithm. The Lin algorithm with
its original parameter, and the interval P950 −P050 outperform our algorithm in terms of Spearman correlation.
However, if the RMSE, and RMSE∗is considered, the algorithm [113] provides better performance compared to the
other Volume- and Object-based algorithms across the different performance evaluation criteria.
However, compared to the “distance metrics”, it can be seen that no algorithm outperform significantly the “simple”
7.5% percentile on each of the performance criteria. Therefore, if a least computing intensive algorithm needs to be
defined, this algorithm may be candidate solution even though it lacks of addressing many perceptual aspects, for
example having only one depth plane. In this case algorithms such as [149] or [113], may provide a better solution.
However, such contents were not available in the studied database.
To conclude, many factors are involved into the overall depth perception and simple statistics about the depth maps
appeared not to be able to explain how depth is perceived. The image structure addressed in [113, 154] via object
analysis or as the notion of image composition in [155] is one of the directions to address the missing aspects of
analysis statistical depth map. This enables tackling aspects such as the monocular depth cues, but a lot of work still
111
Measurements N◦PC SC OR RMSE RMSE∗
Distance metrics
P005 1 0.192 0.498 0.016 4.188 3.487
P010 2 0.476 0.426 0.016 3.738 2.948
P015 3 0.331 0.523 0.016 3.841 3.085
P020 4 0.474 0.525 0.011 3.774 2.982
P025 5 0.406 0.538 0.016 3.859 3.087
P050 6 0.522 0.566 0.000 3.706 2.934
P075 7 0.526 0.582 0.000 3.707 2.936
P100 8 0.516 0.571 0.000 3.704 2.932
P125 9 0.520 0.567 0.000 3.706 2.935
P500 10 0.449 0.448 0.021 3.686 2.910
P875 11 0.046 0.042 0.043 4.218 3.507
P900 12 0.039 0.025 0.048 5.249 4.662
P925 13 0.034 0.106 0.048 6.970 6.509
P950 14 0.031 0.116 0.048 7.934 7.521
P975 15 0.030 0.150 0.048 8.661 8.277
P980 16 0.030 0.178 0.048 8.670 8.287
P985 17 0.029 0.183 0.048 8.952 8.580
P990 18 0.027 0.207 0.048 9.937 9.597
P995 19 0.025 0.266 0.043 13.358 13.088
Volume metrics
P950-P050 20 0.005 0.518 0.016 25.323 25.145
P975-P025 21 0.021 0.445 0.016 16.120 15.893
P990-P010 22 -0.004 0.402 0.027 55.227 55.111
standard deviation 23 0.071 0.439 0.011 7.323 6.907
Michelson contrast 24 0.060 -0.067 0.287 >100 >100
2nd order polynomial fit [149] 25 0.029 0.587 0.015 23.540 23.362
2nd order polynomial refit [149] 26 0.480 0.432 0.021 4.109 3.307
Object metrics
Avg. Thickness [155] 27 0.085 0.126 0.043 3.616 2.811
Depth Interval btw Objects [155] 28 0.054 0.276 0.038 8.191 7.810
Nb Objects 29 -0.051 -0.004 0.064 3.693 2.914
PerceptualDepthIndicator [113] 30 0.579 0.479 0.000 3.739 2.964
Object thickness [154] 31 0.276 0.345 0.080 3.430 2.748
Table 5.2: Different algorithms to evaluate binocular depth in 3D contents. PC: Pearson correlation, SC: Spearman
Correlation, OR: Outlier Ratio, RMSE: root mean square error, and RMSE∗as defined in eq 5.54.
Figure 5.23: Statistical differences between depth indicators. Black indicates statistical differences.
remains to fully characterize them. The next subsection will present the results obtained regarding the relation between
perceived depth and monocular depth cues from the subjective experiments conducted in this study.
112
Figure 5.24: Relation between instrumental measurement of image’s binocular characteristics and binocular depth
scores. Red circles indicates the inliers of the RANSAC fitting.
5.6.2 Depth perception and its relation with monocular and binocular depth cues
To study how monocular depth cues affect the perceived depth, monocular depth scores are put into relation with
binocular depth cues. The distribution of the individual depth cue scores of images is described by the following
formula: Let Si,obe the depth score provided by observer o, on image iand, li,ri,ii,bibe respectively the depth cue
scores on the linear perspective, relative size, interposition and binocular scales for this same image i. Let Ibe the set
of available images. The probability function is defined by:
∀i∈I,∀k∈[0,11],∀f∈l,r,i,b,P(fi|S=k) = Gf,k,i(5.56)
Figure 5.25: Relation between monocular and binocular depth scores
With Gf,k,i(x)is the probability of having a score x, for the depth cue ffor an image i, knowing that the overall depth
score is k. Figure 5.25 depicts the relation between the different factors involved in Gf,k,i(x). A Pearson correlation of
113
0.898, 0.608 and 0.756 can be found between the binocular scores and respectively the interposition, relative size and
linear perspective depth cues.
However, based on this data a direct relation between monocular depth cues and overall depth rating cannot be too
easily drawn. Indeed, it should be noted that the considered monocular depth cues were only one factor involved in the
depth scores, and retinal disparities specifically have a very strong impact on the perceived depth ratings. To study the
actual contribution of monocular depth cues, their contributions should be dissociated from the contribution of binocu-
lar depth cues. In the case of this study, natural images were used, therefore it is not easily possible to fully characterize
what part of the overall depth ratings come from the monocular depth cues, and what comes from the binocular cues.
In order to address this issue, the prediction algorithm based on the analysis retinal disparity developed in this thesis
and described in Section 5.4 was used to study the contribution of binocular depth cues to the overall depth rating.
Figure 5.25 depicts the relationship between the prediction of depth scores using this algorithm and the overall depth
scores obtained through subjective evaluation which takes into account both monocular and binocular depth cues. The
Pearson correlations between the binocular depth measurements and overall depth scores are respectively: 0.92, 0.676
and 0.795 for the different image sets. These correlation values are found to be in the same range as the respective
Pearson correlation values between monocular depth score and overall depth rating for the same set of images. This
indicates that the overall depth ratings are also highly related to the retinal disparity distribution.
To further study the relation between the monocular depth ratings and overall perceived depth rating, the monocular
depth cues scores are put into relation with the predicted depth scores from the model defined in Section 5.4. A Pear-
son correlation between the monocular cues “interposition”, “relative size”, “linear perspective” and predicted depth
obtained from disparity analysis is computed. The results are found to be respectively: 0.819, 0.606 and 0.663. This
shows that the interposition monocular cues score and the binocular measurements are highly related. An explanation
is that interposition was subjectively characterized by the fact that multiple layers were visible in the scene. It is ex-
pected that the availability of multiple layers relates to the presence of multiple depth layers which affect perceived
depth in natural images and explain the high correlation between interposition and the depth scores. Similarly, the
binocular depth measurement was defined such as they evaluate the availability of multiple depth layers in the pic-
tures. It is then expected that the interposition depth scores relate to the binocular cues measurement scores.
In the same manner, the availability of linear perspective into the picture with a vanishing point at the center of the
picture is also intrinsically related to the presence of different depth layers having lines converging to the vanishing
point. This could also explain the higher relationship between linear perspective and both perceived depth and the
binocular depth metric.
These results on the linear perspective scales and interposition depth cues appears then more related with intrinsic
properties of the image structure. Which by extension relates to the distribution of the binocular depth cues. Therefore,
the result of this analysis relates image properties to retinal disparities distribution, than defining a perception model
on the combination of monocular depth cues.
114
Figure 5.26: Relation between instrumental measurement of image’s binocular characteristics and binocular depth
scores on the three different set of images designed to study interposition, relative size and linear perspective depth
cue.
5.6.3 Conclusion
One of the issues addressed in this section is the difficulty to instrumentally characterize the binocular depth cues
in natural images. Different kind of measurement from the literature were compared and none of them succeed to
fully explain the depth scores. The image structure or image composition, and then monocular depth cues needs are
further aspects which can be considered. The relation between monocular and binocular depth scores have shown
that monocular characteristics of the images affected the binocular depth scores. However, considering the relation
between the monocular scores and the binocular measurements though instrumental measurements, the conclusion
is: the relative size and linear perspective depth cues implies multiple depth layers which induces a higher perceived
depth. But due to the lack of appropriate binocular depth cue characterization it is not yet possible to study depth cues
combination with natural images.
Listing 5.7: Conclusion on the research questions
1. Using a simple indicator such as the 7.5 percentile already provides interesting
result on evaluating perceived depth, even though it will not be able to address
special cases such as images suffering of cardboard effect.
2. The proposed model performed better than the other approaches presented in this work.
However, the performances are still not satisfactory and there is space for
improvement.
3. Monocular characteristics of the images affected the binocular depth scores. However,
it is not clear whether the monocular depth cues themselves affected the depth
percept or if the presence of these depth cues implies specific image properties
which changes the depth percept.
115
5.7 Monocular depth cues
One important aspect as described in Section 2.2.2, is the contribution of monocular depth cues to the depth perception.
In order to tackle the evaluation and characterization of these depth cues different instrumental evaluation methods of
depth cues were studied in the context of this work and will be the purpose of this section.
5.7.1 Linear perspective
The linear perspective depth cue was presented in section 2.2.2, and attempt to characterize it in subjective evaluation
was done in section 4.3. In this subsection methods for instrumental evaluation will be provided.
5.7.1.1 Subjective evaluation database
In order to develop meaningful instrumental evaluation methods for linear perspective, it is necessary to dispose
of the ground truth data which will be used both for training and verification purpose. The database developed by
Ross and Oliva [158] was used. It consists of 7138 different images with various types of content: nature or urban
scenes (Figure 5.27 provides an illustration of the category of pictures). All the different images were annotated by 14
participants in total. The test participants had to rate the content of the scene on three distinct scales: the perspective,
the depth, and the openness. The perspective was defined as following: “Perspective refers to the degree of expansion
of space. The convergence of parallel lines to a visible vanishing point gives a strong perception of depth gradient to
the space represented in an image” [158]. The depth scale was provided as: “the size of the space in a scene (e.g. the
mean distance between the observer and the boundaries of this space, e.g). While dominant depth is not a precisely
defined quantity, it has a strong relationship with the physical size of the space, and human judgments are consistent in
evaluating this quantity” [158]. The openness was defined by “the quantity and location of boundary elements of the
scene in view. The most open scene is a ground surface stretching to the horizon, with the existence of a horizon line
in the absence of any other visual references (e.g. trees, buildings).” [158]. Each scale was discrete, and composed of
6 points. Example pictures were provided to the test participants to guide them to rate the images.
Unfortunately, considering how large the task was, participants did not rate all the images but only a subset of the
7138 images. This has resulted in at most two ratings per image which happened to 12% of the images. This is the
major problem of this database. The author studied the relationship between the scores of the images rated by two test
participants which shows in some cases a large variance between replies. This variance between scores was used as a
measure of the precision of the ground truth. In this subsection, only the linear perspective scores will be studied.
5.7.1.2 Candidate evaluation algorithms
Two different approaches were evaluated to predict the presence of linear perspective and its relationship with per-
ceived depth in images.
First approach
The first approach considered is the one proposed by Ross and Oliva [158] and is called the “Global layout properties”
(GLP) and is based on the set defined by Oliva for scene description [159]. The images are decomposed into non-
overlapping blocs which are then described through the set of GIST features [159]. The blocks are filtered through
different Gabor filters having different orientation (8) and frequency bandwidth (4) (See Figure 5.28). The magnitude
of each resulting filtered block is then averaged. A principal component analysis is used to reduce the dimension of
116
Figure 5.27: Illustration of the image database from Ross et Oliva [158]. Pictures are divided into two categories: urban
(left), natural (right).
the features to 24 features, and finally a cluster weighted model is then trained to predict the strength of the linear
perspective.
Figure 5.28: GIST features: decomposition of blocks into
different Gabor filter having different orientation and fre-
quency bandwidth. Figure from [160]
Figure 5.29: Vanish point model applied on different im-
ages. Left, input images; middle, line segment; right, van-
ishing lines
Second approach
The second method employed is called the vanish point model, this method address the geometric properties of the
scene as depicted in Figure 5.29. A line segment detector (LSD [161]) is employed to extract lines. The vanishing
points are determined by using the J-Linkage algorithm to determine which lines converges to the same vanishing
point. As a final step, it was proposed to define the strength of the linear perspective depth cue as a function of the
distance of the vanishing point to the center of the image (See eq. 5.57), with dthe distance of the closest vanishing
point to the center of the image (Figure 5.30). This rule was derived from the observation of the images available in
[158] and is depicted in Figure 5.31.
x=1
d+1(5.57)
117
Figure 5.30: General principle of the vanish point model.
Figure 5.31: Empiric rule for linear perspective. Images ordered from strongest linear perspective to lowest.
5.7.1.3 Results
The performance of the proposed methods was evaluated by measuring their ability to predict the scores obtained by
the test participants. As mentioned previously, there exists two different kinds of images to be predicted: the “urban”
and “natural” scenes. Considering the difference of performance of the metrics for each class of images, it is proposed
to present the result per group of images. As depicted in Figures 5.32 - 5.35, the performance of both approaches are
better in case of urban than natural scenes (Pearson correlation of 0.64 and 0.59 instead of 0.33 and 0.17). In the latter
case, the scatter plots are close to random.
The metric limits were further studied by analyzing the outliers. It revealed that GLP usually underestimates the linear
perspective in case of strong texture distributed over the entire picture. On the contrary, the vanish point model will
underestimate the linear perspective in case of images having only few vanishing lines. A more in-depth analysis of
the limits of metric’s performance will be provided in the Section 5.8.2 where the question of metric reliability and
identification of failure cases is addressed.
5.7.2 Defocus blur
The second monocular depth cue which was investigated is the defocus blur. To characterize this cue, first a dense
blur map was computed as described in the Section 5.3.2.1. Then, it is necessary to translate the blur measurements
to depth values. Finally, the depth indicator can be computed. The process of measuring blur in images was described
extensively in Section 5.3.2.1, therefore in this section the focus will be on the conversion to depth, measurement of
algorithm performance and depth indicator computation.
118
Figure 5.32: GLP performance on urban scenes Figure 5.33: VPM performance on urban scenes
Figure 5.34: GLP performance on natural scenes Figure 5.35: VPM performance on natural scenes
5.7.2.1 Conversion of blur to depth
The circle of confusion is the optical spot from the intersection of the cone described by the light rays from the lens and
the optimal focus point (see Figure 5.36). Its size depends on several parameters: the distance between the object and
the lens d, the distance between the lens and the focal plane df, the focal length f0, the aperture Ndefined relatively
to the focal length. Based on these parameters the circle of confusion, c, can be determined as in Eq. (5.58).
c=|d−df|
d
f2
0
N(df−f0)(5.58)
To estimate the distance between the blurred object and the lens it is possible to reverse Eq 5.58, by taking into account
that d≥df, and d−df≥0 it is possible to remove the absolute values, and dcan be expressed as in Eq. 5.59. In this
equation it was necessary to introduce a parameter ksuch as σ=kc which take into account the resolution and size
of the sensor noted respectively SensorWand ResW. Within the parameter k, the amplitude of the Gaussian blur is also
included. This finally enables to convert the blur measured in pixel in the previous section to a circle of confusion in
meter.
119
Figure 5.36: Circle of confusion and relation with capture settings.
d=df
1−σN(df−f0)
k f 2
0
(5.59)
5.7.2.2 Performance evaluation
To evaluate the performance of the algorithms involved in the depth estimation process, it has been proposed to use
the rendering software 3DS Max which enables to create specific stimulus for well-defined conditions: the distance
of the camera to the objects, the focal length, the aperture, and the sensor size and resolution. Based on this setting,
it was possible to render different images with different amounts of defocus blur. Figure 5.37 depicts an example of
a generated stimulus. The blur measurement algorithm and equation 5.59 was then used to recover the depth position
defined by design.
Figure 5.37: Example of stimulus used for evaluating the blur to depth conversion process.
Figure 5.38 depicts the performance of the algorithm. For various values of the aperture the proposed blur mea-
surement from Zhuo [141] can appear noisy and tend to saturate. An alternative measurement algorithm from Chen
[162] was considered. Similarly to the Zhuo’s method, the proposed method detects the edges using the Canny edge
detector. As a second step, for each pixel belonging to the edge the normal to the edge is found and the distribution of
the gradient across the edge is studied in the frequency domain. To perform this task, the values of the gradient across
the edge are considered as a one-dimensional series which is then analyzed in the frequency domain by applying a
fast Fourier transform (FFT). The integral of the power spectrum is then used as a measurement of blur. Figure 5.40
depicts the overall process of blur estimation. The performance of this algorithm appeared to be much more stable
and enabled to better estimate the depth values from the defocus blur as it can be seen in Figure 5.39. Therefore it is
proposed to use Zhuo’s approach to estimate the dense depth map from sparse defocus values, but using the approach
from Chen to obtain the sparse blur map.
120
Figure 5.38: Depth from blur with different apertures us-
ing the algorithm from Zhuo [141] Figure 5.39: Depth from blur with different apertures us-
ing the blur measurement algorithm from Chen [162]
Figure 5.40: Blur estimation process from Chen. Figure copied from [162]
5.7.2.3 Depth cue indicator
In case of natural images, the information about capture settings may not always be available. In many cases, this
can be found in the information attached to the picture in the EXIF data. This can be critical in case the depth value
is targeted in order to study, for example, the agreement between depth cues. In case of the study of the availability
of a depth cue, it can be chosen to only study the variation of the defocus across the dense depth map from defocus
blur. As shown in the equation 5.59, the relationship between blur measurements and depth is non-linear therefore
it will provide on different results depending on if indicators are built over the blur map or the depth from defocus
blur map. Therefore it can only be recommended to document the chosen approach. To evaluate the performance of
the depth of defocus blur indicators, the predictions of the algorithm are compared to the subjective ratings obtained
in Section 4.5.3. Figure 5.41, depicts the relationship between the minimum amount of blur and the defocus depth
cue. As expected, it can be seen that increasing the overall sharpness of the picture decreases the contribution of the
depth from defocus: if there are no blurred areas, then there is no depth cue from defocus blur. However, having
blurred areas does not necessarily result in having a strong depth from defocus depth cue: the picture can just appear
to have blurred areas. To develop a depth cue indicator, different factors have been considered: the percentiles 1%
and 90% in order to consider the range of sharpness in the picture. The proportion of sharp areas compared to the
blurred area is also measured by using the sparse depth maps and the number of pixels on which it was possible to
find edges and compute a blur values. From the data, it was observed that the square of the percentiles 1% and 90%
should also be considered. A linear regression between these three factors was performed, and the performance of the
proposed indicator is depicted in Figure 5.42. As already found, the effect of minimum blur amount enables to define
121
a lower bound for the prediction of depth from defocus. The challenge still remaining is to detect if the presence of
blur contribute to the depth of defocus. Looking into the range of the amount of blur, and then presence of sharp and
blurred areas is a way to dig into this question. However, as depicted in Figure 5.42, it was not enough to fully address
the problem.
Figure 5.41: Relationship between minimum sharpness
and subjective rating for defocus blur depth cue.
Figure 5.42: Performance of proposed depth indicator
against subjective scores from Section 4.5.3.
5.7.3 Motion parallax
The motion parallax is the difference in apparent motion between objects at different distances in depth. Figure 5.44
depicts two examples, of motion parallax. On the left side, it is possible to distinguish a decrease in the amount of
motion in function of the distance to the viewer. On the right side, no such thing can be found, hence no motion
parallax contribute to the understanding of the depth. To capture this, a new metric was designed to evaluate the
amount of motion parallax. As depicted in Figure 5.44 motion parallax is a difference in apparent motion between
object at different distances in depth. It is then suggested to use binocular disparities to estimate the position in depth
of the objects and then relate this position in depth to the amount of motion which could be observed (Figure 5.43
depicts examples of dense depth map and dense optical flow revealing the presence of motion parallax). If there is
motion parallax, the motion should decrease in function of the depth. Binocular disparities are estimated as explained
in Section 5.3.1, a dense optical flow is also estimated similarly. Once the depth map and optical flow are available, the
second step of the algorithm is to discard the parts of the images where the maximum value of disparities is reached.
This is motivated by a concern of robustness: the algorithm target to relate planar motion and position in depth, but
considering that disparity values are used to estimate the depth, object too far away will have a constant depth value
and these areas may or may not show motion parallax. It is then proposed to limit the study to areas where motion
and depth are clearly known. To limit noise, for each depth plane, the average value of motion of pixels within the
depth plane is determined. Figure 5.45 depicts three squatter plots which represent how the average motion change
with the disparity values. The two images corresponding to the scatter plots on the right of the figure are images which
shows motion parallax: a clear link between the average motion and the position in depth can be found. The images
corresponding to the left-bottom part of the figure do not show variation of motion with the variation of depth. The
images corresponding to the left-top does not show a linear relation between binocular disparity and motion, this case
is similar to Figure 5.44 left where no motion parallax is visible. The RANSAC algorithm is then used with a linear
model to fit the relationship between binocular disparities and motion. After fitting, the case of the left-top scatter plot
122
of Figure 5.45 can be distinguished from the other cases due to the low performance of the fitting. In this specific
case, the algorithm decides that no motion parallax can be found in the image. The motion parallax is then defined as
MP =α=atan(dy
dx )with dy
dx the slope of the fitting curve determined by using RANSAC. This αvalue corresponds
to the αdefined in Figure 5.44.
Figure 5.43: Evaluating motion parallax based on dense depth map, and dense optical flow. The top images depicts
estimated dense depth map, and the bottom ones dense optical flow.
Figure 5.44: Illustration of motion parallax Figure 5.45: Scatter plot motion as a function of disparity
on four different images. Fitting and outliers removal are
obtained by employing RANSAC, inliers are circled in red
123
5.7.4 Texture gradient
The texture gradient, as described in Section 2.2.2.2, can provide information on the perceived depth due to different
factors. These are divided into three categories: perspective,compression and density [39]. Perspective: due to the
linear perspective, the size of the objects decreases with the distance to the observer, therefore the size of the individ-
ual texture element (texel) is affected by this phenomenon. Compression: relates to the ratio between the width and
height of the texels. The aspect ratio of the texels will be affected by the position. Finally, density refers to the spatial
distribution of the texels in the image.
Estimating depth from these different factors is a difficult challenge which goes beyond the work of this thesis. There-
fore, the algorithm from Agrawal et al [143] was used to recover shape from texture, and provides a depth map from
the texture analysis. Once this depth map obtained, a similar challenge as described in Section 5.7.2.3 is raised: the
establishment of a single indicator summarizing a dense depth map. It was then decided to employ a consistent analy-
sis of the depth map to the one performed to depth from defocus. Therefore the percentiles 1% and 90% of the depth
maps values were considered as well as the standard deviation. A multiple regression between these factors and the
subjective data from Section 4.5.3 is then used to find weighting between the percentiles and the standard deviation of
depth values measured using the shape from texture algorithm. This linear combination will then be used in the further
steps of the thesis.
5.8 Depth cues pooling and reliability
To model global depth perception from individual depth cues, it is necessary to consider two main aspects. The first
one is that each individual depth cues do not contribute equally to the global depth score, the second and crucial
aspect is the confidence in each metric. Indeed all metrics have a specific scope of application and might not always
estimate the features correctly. It is then needed to define a weighting factor for each metric. This factor is based on the
estimation of the reliability of each individual metric considering the specific image under study. Different approaches
can be considered to address this issue and will be described in this section.
5.8.1 Reliability and temporal consistency
In case a week fusion model is considered (see chapter 2, section 2.3.2), the depth cues can be combined linearly.
Therefore the integration of depth cue reliability can be performed as described in equation 5.60. With GDtthe global
depth score of a frame t,ND the number of depth cues, DCk,tthe depth cue score for the frame tby the metric described
in sections 5.4 and 5.7, cwka confidence weighting factor, and pwka weighting of the contribution of the depth cue k
compared to the others.
GDt=
ND
∑
k=1
cwk×pwk×Dck,t(5.60)
The question of the reliability of the different metrics was addressed in [61]. Different approaches were considered to
perform the pooling of two depth cues. The most effective one was found to be the maximum likelihood estimation
model (MLE). This approach considers the variability of the subjective scores to define a weighting of each depth cue
(eq. 5.61).
cwk=∑ND
i=1,i=kσ2
DCi
∑ND
i=1σ2
DCi
(5.61)
A direct application of this approach can be applied to this study by considering the temporal variation of the evaluated
depth cues in video sequences. Indeed, it is expected to have, at least locally, a temporal consistency of the evaluated
depth scores. Others approaches based on in-depth analysis of each metric are also under study and should be con-
124
sidered since the depth score values can be stable but incorrect. However this will not be addressed in this section.
Here only the temporal consistency of each depth cue on a temporal window will be checked. This temporal window
is also designed to be constrained to a scene. The confidence metric become as expressed in eq. 5.62. With cwk,w(t)
the confidence of the depth cue kfor the frame tconsidering a window w.
cwk,w(t) =
∑ND
i=1,i=kσ2
DCi,w(t)
∑ND
i=1σ2
DCi,w(t)
(5.62)
5.8.1.1 Ground truth database
As an example, such approach was applied on the database previously defined in section 4.4 which consists of a
database composed of 64 10s long video sequences which were displayed at the highest quality available and con-
tained no visible compression artefact. Three scales were asked to the observers: the depth quality, the QoE and the
visual discomfort.
Figure 5.46 depicts the results of the evaluation of the different depth cue on one particular video sequence. This
sequence is composed of two distinct scenes. Both scenes are natural and shot outdoor. The first one shows a close-up
of a hand collecting grapes on a grapevine. The scene is static and only the leaves of the grapevine are moving due
to the wind and the hand cutting out the grapes. The second scene shows a lot of grapevines in line until the horizon.
The camera moves laterally producing a pan. A clear motion parallax is then visible with the line of trees. The lines
of trees allow seeing a clear vanish point in the center of the image. On Figure 5.46 it is possible to perceive a case
of failure in one of the metric: the VPM did not always succeed to extract the vanishing lines and has resulted in
high temporal variation of this depth cue. This metric is not trustworthy in this specific sequence and should be then
discarded. Interestingly it is also possible to see that the second approach for evaluating linear perspective, the GLP,
succeeds to evaluate the linear perspective. This metric based on a frequency analysis of the image is better suited to
these specific cases of images where no clear edges can be found. A second case of failure can be observed with the
motion parallax metric: in the first scene, lots of variation is visible. However in the second scene, the motion parallax
is successfully captured and the evaluation is more stable. The proposed approach based on the MLE is then able to
capture explicit case of failure in the different metrics.
5.8.2 Identification of cases of failure
The temporal consistency is only one criteria to identify the case of failure, other criteria needs to be taken into account
enabling to detect cases where temporal stability can be reached but with a wrong prediction. The particular case of
the linear perspective depth cue will be presented in this section.
5.8.2.1 Type of image content
As described in Section 5.7.1, the performance of both algorithms VPM and GLP have been found to have lower
performance in case of images classified as “natural”, than in the case of “urban” scenes. Therefore the identification
of types of image content can provide a first indication on whether the characterization is likely to be reliable. Using
the GIST features from Oliva [159] and a cluster weighted model as suggested in [158], it is proposed to identify the
types of scenes: “natural”, or “urban” (Figure 5.47). The performance of the scene-type classification is depicted in
Figure 5.48 in a precision/recall diagram. The cluster weighted model provides continuous scores from 0 to 1, 0 for an
urban scene and 1 for a natural scene. The Figure 5.48 depicts then the precision and recall for different values of the
threshold making the separation between natural and urban scenes. The performance of the scene classifiers is good
125
Figure 5.46: Example of depth cues evaluation on a video sequence. The z-scores are determined based on the mean
and standard deviation of each score over the entire database
and enable to detect the type of scene content. This can further be used to identify the cases of natural images where
the VPM can appear too unstable.
Figure 5.47: Urban or Natural ?
Figure 5.48: Performance of scene type classification us-
ing the set of GIST features.
126
5.8.2.2 Number of vanishing points found
In the process of estimating the vanishing point in the images, the algorithm LSD [161] was used. It enables to get
different strokes which are then classified by the J-Linkage algorithm. Figure 5.49 depicts different cases of extracted
strokes. Different cases are visible, on the left picture only small strokes have been detected resulting in less stable
vanishing point estimation. On the contrary, on the last right picture the extracted strokes are long resulting in a less
noisy vanish point estimation. Figure 5.50 depicts an example of small strokes resulting in noisy vanishing point. The
figure shows the location of the vanishing point at different resolutions. In this particular example, the vanishing point
which will be considered by the algorithm is consistent with what would have been expected, but a non-parametric
KruskalWallis one-way analysis of variance shows that the factor “number of vanishing points” affects significantly
the prediction error of the algorithm (p<0.01, Chi-sq=179) (see Figure 5.51).
5.8.2.3 Sum on reliability evaluation for linear perspective instrumental measurements
In this subsection the question of identifying cases of failure was considered. Two different approaches were suggested:
by extracting features about the image, by the means of content classification and recognizing the kind of picture,
natural or urban, which are more error-prone than the other. The second approach looked into the intermediate steps of
the prediction algorithms: it relates the length of extracted strokes and number of intermediate vanishing points to the
performance of metrics. In both cases, this provides an indication of the accuracy of the prediction algorithm which
can further be used as a weighting in equation 5.61, or an indication of the confidence in the content characterization.
Figure 5.49: Detection of vanishing lines.
5.8.3 Outcomes on reliability measurements
In this section, the issue of depth cue pooling was addressed. Similarly as the work conducted using subjective evalu-
ation methods, it is proposed to also take into account the reliability of instrumental measurements during pooling or
more generally when using a depth cue indicator based on algorithms. Several approaches were considered to perform
such a task, first temporal consistency of the metric was expected and can contribute to identify cases of failure in
depth cue prediction. Secondly, it was proposed on one specific example: the linear perspective metric. the study of its
case of failure was based on two distinct levels:
•The recognition of the image properties: natural vs. urban
127
Figure 5.50: Example of multiple vanishing point found due to several small strokes size
Figure 5.51: Distribution of VPM prediction error depending on the number of vanishing point extracted into the image
•The study of parameters extracted from intermediate steps of the algorithm, e.g. the number of vanishing point and
length of vanishing lines.
This enables to better identify if, in the context of use, the prediction algorithm will be reliable or not, and improve the
process of decision-making based on these indicators.
5.9 Conclusion on depth characterization
In this chapter the question of characterizing the properties of 3D video sequences was studied. Both monocular and
binocular depth cues were considered, and depth indicators were developed to evaluate different monocular cues:
•The binocular depth cues
•The linear perspective.
•The defocus blur
•The texture gradient
128
•The motion parallax
Only little research has been done on the topic of characterizing monocular and binocular depth cues in natural images
for 3D video sequences. Therefore it was only possible to compare the proposed algorithm to a limited number of
other prediction algorithms. Even if there is a lot of space for improvement, these algorithms provided a novel way to
characterize 3D video sequences.
Considering that the performance of the proposed algorithms can vary depending on the type of image content, it was
proposed to study the performance of the metric to determine the confidence in the prediction of the algorithm. This
was done using either an analysis of temporal consistency of the metrics, or by studying the intrinsic properties of
the images or finally by studying parameters on intermediate steps of the metrics enabling to have a measure of the
prediction accuracy.
These metrics can be used for different purposes, 3D content classification, QoE models, 3D content selection for
subjective testings, depth modeling, etc. In this chapter, a first analysis of depth perception as a function of monocu-
lar and binocular depth cues in natural images was also provided. The relation found is: monocular properties of the
images affected the binocular depth scores. However, considering the relation between the monocular scores and the
binocular measurements though instrumental measurements, the conclusion is rather that the relative size and linear
perspective depth cues affect the intrinsic properties of the images and then implies multiple depth layers which induce
a higher perceived depth.
Finally, considering the performance of the different monocular and binocular characterization, it has to be mentioned
that it is not yet possible to train a weak model for depth perception based on the depth cue evaluation metrics.
Nevertheless, considering the current lack of standardized algorithms for 3D content characterization, the result
of this thesis and the different metrics developed can be used for further analysis on content characterization and
classification. The code of all the different metrics developed along this thesis were published and freely available for
further research [163].
5.10 Key contributions
The key contributions of this chapter are the following:
•Monocular characteristics of the images affected the binocular depth scores. However it is not clear whether the
monocular depth cues themselves affected the depth percept or if the presence of these depth cues implies specific
image properties which change the depth percept
•Binocular and Monocular depth cue indicators were developed to evaluate different cues: binocular depth cues,
linear perspective, defocus blur, texture gradient, motion parallax
•The question of depth indicator reliability was addressed across different methods of measurements: using temporal
consistency, and features extracted on the studied images and on the metrics while estimating the cue value.
•Last but not least, all the code of each individual metric have been published enabling further studies to reuse the
work performed along this thesis for content characterization. The metric can be accessed at the following doi:
http://dx.doi.org/10.5281/zenodo.16925
Chapter 6
Conclusion
The work performed in this thesis addresses different issues. It starts from the evaluation of Quality of Experience in 3D
video sequences, providing an overview of the related literature on QoE and visual depth perception. It was observed
that when test participants are asked to rate Quality of Experience, their ratings depend on their test-specific concept
of QoE. In particular, it was observed that they do not necessarily provide consistent ratings across test participants,
and use the scales (QoE, Visual comfort, and Depth) differently. The issue of consistency, agreement between test
participants, and the understanding of the scales by the test participants has been studied along this thesis.
The first test results obtained in this work showed, that test participants do not necessarily rate 3D to provide a higher
QoE than 2D, and therefore do not necessarily take into account the added value that 3D may provide in their ratings:
The availability of binocular depth cues. To overcome this problem, the paired comparison test paradigm has been
used to evaluate the preference of different 2D and 3D stimuli, extending respective work on this topic presented in
the literature. Using this method, it has been possible to show the preference of 3D over 2D in specific conditions, and
to quantify the added value of 3D.
The added value of 3D was found to be content-dependent. The preference of 3D over 2D was found to decrease
with a decrease of image quality. This decrease of preference depends on the content properties. To evaluate 3D-QoE,
there thus is a need to characterize 3D video sequences. Since the added value of 3D is to bring binocular depth, the
work has been focused on the evaluation of depth, starting with depth in natural images. First, depth was assessed
in viewing tests, in a second step using prediction algorithms. Considering the fact that depth perception results from
different monocular and binocular depth cues, different tests involving test participants have been conducted to evaluate
the depth in images and videos. However, the question of subjective scores’ reliability and agreement between test
participants was raised. It was shown that test participants do not necessarily understand the different depth scales in
the same manner. Therefore the research effort of this thesis was focused on defining and assessing depth cues. A series
of studies has been conducted to evaluate subjective test methods and develop a simple way for test participants to
evaluate monocular and binocular depth cues in natural images. Different approaches using pairwise comparison and
ranking of images have been proposed were shown to enable the acquisition of rating data with increased reliability.
Based on the subjective scores obtained in these tests, new prediction algorithms were designed to characterize the
properties of 3D video sequences: the overall perceived depth, and different underlying monocular and binocular depth
cues. An algorithm to predict the perceived depth from binocular depth cues performing an object-based analysis of
the scene was established enabling to monitor 3D content properties in videos. Additionally different monocular depth
cues indicators for defocus blur, linear perspective, texture gradient and motion parallax were defined. The accuracy of
the prediction algorithms was found to not always be optimal. Therefore, similarly to the analysis of the data collected
from test participants, it has been proposed to study the performance and trust in the different metrics. It has been
proposed to study different aspects such as the temporal consistency, image classification, and features on the metrics
to enable quantifying the prediction accuracy.
The development of these metrics was of particular interest since until now no standardized way to characterize the
properties of 3D contents have been proposed. One major contribution of this thesis is therefore the open-source
publication of all of these indicators enabling further research to characterize the content properties with the algorithms
developed in this thesis.
131
From a visual perception point of view, it was difficult to draw strong conclusions about the depth perception and the
relation between monocular and binocular depth cues. A relationship between monocular and overall depth perception
could be found. However, from these data it is not possible to conclude to which extent the monocular depth cues
contribute to the overall depth perception or if these depth cues affect the image-intrinsic properties which then affect
the overall perception of the scene.
Chapter 7
Further work
Across the thesis, different aspects have been addressed from visual perception studies to content characterization
including Quality of Experience research, and depth perception prediction. The final goal of the thesis is to address
Quality of Experience and its relation to content properties, that is, the contents’ depth properties. As a consequence,
each of the addressed aspects from the perception studies to image analysis were fundamental. Due to the variety of
topics which have been addressed, there are several possibilities for further research along different axis.
Quality of Experience research: On the topic of Quality of Experience research, further work could consider to
relate content characterization performed in this thesis and the preference of 3D over 2D. The work described in this
thesis have related content, image quality (due to coding) and preference of 3D over 2D. We have also found that 2D
image quality algorithms performed relatively well to predict 3D image quality (with traditional 2D coding schemes).
Moreover, algorithms for the content characterization were also proposed. However, the link between the subjective
ratings from our experiments on preference of 3D over 2D and the content properties and respective quality require-
ments as a function of the proposed content-specific depth descriptor still needs to be continued, based on an even
larger set of ratings. Analysis have been performed in this direction within the thesis, but this requires a large amount
of subjective test as many source content are needed. In this work, it was then decided to focus on content characteri-
zation.
Content characterization research: Another extension of this work, could be to take into account the different
depth descriptors defined in this work, analyze the agreement and contradiction between them, and in addition mea-
sure distortions of the 3D geometry of 3D contents. Some work has been performed in this direction in the literature
for the case that camera parameters are known (Chen [12]), however our approach through image processing would
provides a new way to look at this issue and predict QoE of source content without prior knowledge of camera set-
tings. To this aim, instead of 3D QoE assessment in terms of preference, the different depth cue indicators produced
in this work could be evaluated in terms of how well they allow to measure the depth quality of the 3D rendering.
For example, this could be used to evaluate the cardboard effect. It could also be interesting to study when depth cues
contradict with other. This could be used for further analysis of depth cue prediction reliability, but it also could be
used to address visual discomfort issues.
The characterization of monocular depth cues can also be used beyond 3D. Indeed, it could be used to measure aes-
thetic appeal in pictures. For example, by considering vanishing lines, position of the vanishing points and common
rules on photography. Defocus blur metrics could also be used in the context of Ultra High Definition contents: evalu-
ating the distribution of the sharpness across the images, and relate it to what observers may perceive in the side areas
of their field of view.
Finally, other depth cues not instrumentally characterized in this thesis need to be addressed in future research.
Perception studies: Last but not least, the perception studies on natural images was one of the most challenging
topic addressed in this thesis, and could be extended. To enable stronger conclusions, it is proposed to extend the work
to content designed using 3D rendering software. The use of natural images in this thesis has provided some insights
133
on the depth perception and the relative importance of each depth cue. However, it would be beneficial to also consider
content where only one depth cue is variated at a time instead of evaluating natural images as done in this thesis, where
different depth cues are present at the same time requiring to also evaluate each depth cue in addition to the overall
depth. This will enable to dig deeper into the construction of a 3D vision model based on monocular and binocular
depth cues as it has been stated in this thesis.
References
1. Patrick Le Callet, Sebastian M¨
oller, and Andrew Perkis, “Qualinet white paper on definitions of quality of experience,” European
Network on Quality of Experience in Multimedia Systems and Services (COST Action IC 1003), vol. Version 1.2, March 2012.
2. Pieter J.H. Seunti¨
ens, Visual experience of 3D TV, Ph.D. thesis, Eindhoven University, 2006.
3. Pierre Lebreton, Alexander Raake, Marcus Barkowsky, and Patrick Le Callet, “A subjective evaluation of 3D IPTV broadcasting
implementations considering coding and transmission degradation,” in IEEE International Workshop on Multimedia Quality of
Experience, MQoE11, Dana Point, CA, USA, 2011.
4. Kun Wang, Marcus Barkowsky, Kjell Brunnstr¨
om, Marten Sj¨
ostr¨
om, Romain Cousseau, and Patrick Le Callet, “Perceived 3D TV
transmission quality assessment: Multi-laboratory results using Absolute Category Rating on Quality of Experience scale,” IEEE
Transactions on Broadcasting, vol. 58, pp. 544–557, 2012.
5. Wijnand IJsselsteijn, Huib De Ridder, Jonathan Freeman, and S. E. Avons, “Presence: Concept, determinants and measurement,” in
Proceedings of the SPIE, 2000, vol. 3959, p. 520529.
6. Wijnand Ijsselsteijn, Pieter J.H. Seunti¨
ens, and L. Meesters, “ATTEST Deliverable1: State of the art in human factors and quality
issues of stereoscopic broadcast television,” Tech. Rep., Eindhoven University of Technology, 2002.
7. Wijnand Ijsselsteijn, Huib De Ridder, Jonathan Freeman, S. E. Avons, and Don Bouwhuis, “Effects of Stereoscopic Presentation,
Image Motion, and Screen Size on Subjective and Objective Corroborative Measures of Presence,” Presence: Teleoperators and
Virtual Environments, vol. 10, no. 3, pp. 298–311, 2001.
8. Pierre Lebreton, Marcus Barkowsky, Alexander Raake, and Patrick Le Callet, Chapter 3D Video, “Quality of Experience - Advanced
Concepts, Applications and Methods”, Springer, 2014.
9. Marc Lambooij, Wijnand IJsselsteijn, Don G. Bouwhuis, and Ingrid Heynderickx, “Evaluation of Stereoscopic Images: Beyond 2D
Quality,” IEEE Transactions on Broadcasting, vol. 57, no. 2, pp. 432444, 2011.
10. Lew Stelmach, Wa James Tam, Dan Meegan, and Andr´
e Vincent, “Stereo image quality: Effects of mixed spatio-temporal resolution,”
Circuits and Systems for Video Technology, IEEE Transactions on, vol. 10, no. 2, pp. 188 –193, march 2000.
11. Kazuhisa Yamagishi, Lina Karam, Jun Okamoto, and Takanori Hayashi, “Subjective characteristics for stereoscopic high definition
video,” in Third International Workshop on Quality of Multimedia Experience, QoMEX, Mechelen, Belgium, 2011.
12. Wei Chen, J´
erˆ
ome Fournier, Marcus Barkowsky, and Patrick Le Callet, “Exploration of quality of experience of stereoscopic images:
binocular depth,” in International Workshop on Video Processing and Quality Metrics for Consumer Electronics (VPQM), Scottsdale,
Arizona, USA, 2012.
13. Atanas Boev, Danilo Hollosi, Atanas Gotchev, and Karen Egiazarian, “Classification and simulation of stereoscopic artifacts in
mobile 3DTV content,” in Proc. SPIE 7237, Stereoscopic Displays and Applications XX, 2009, vol. 7237.
14. Randolph Blake, “Threshold conditions for binocular rivalry,” Journal Experimental Psychology Human Perception Performance,
vol. 3, no. 2, pp. 251–257, 1977.
15. Marc Lambooij, Wijnand IJsselsteijn, Marten Fortuin, and Ingrid Heynderickx, “Visual discomfort and visual fatigue of stereoscopic
displays: A review,” Journal of Imaging Science and Technology, vol. 53(3), pp. 1–14, 2009.
16. Jong-Seok Lee, Lutz Goldmann, and Touradj Ebrahimi, “Paired comparison-based subjective quality assessment of stereoscopic
images,” Multimedia Tools and Applications, pp. 1–18, February 2012.
17. A comprehensive Database and subjective evaluation methodology for quality of experience in stereoscopic video, 2010.
18. Peter G. Engeldrum, “Image quality modeling: Where are we?,” in IS&T PICS Conference Proceedings, 1999, pp. 251–255.
19. Marc Lambooij, Visual Comfort of 3-D TV MModel and Measurements, Ph.D. thesis, Eindhoven University of Technology. Departe-
ment of Industrial Engineering and Innovation Sciences. Human-Technology Interaction Group, 2012.
20. Heinrich H. B¨
ulthoff and Hanspeter A. Mallot, “Integration of depth modules: stereo and shading,” Journal of the Optical Society of
America, vol. 5, no. 10, pp. 1749–1758, October 1988.
21. Hirokazu Yamanoue, Makoto Okui, and Ichiro Yuyama, “A study on the relationship between shooting conditions and cardboard
effect of stereoscopic images,” Circuits and Systems for Video Technology, IEEE Transactions on, vol. 10, no. 3, pp. 411 – 416, 2000.
22. Hirokazu Yamanoue, Makoto Okui, and Fumio Okano, “Geometrical analysis of puppet-theater and cardboard effects in stereoscopic
HDTV images,” IEEE Transaction on circuits and systems for video technology, vol. 16, pp. 744 – 752, 2006.
23. Jae-Hyun Jung, Jiwoon Yeom, Jisoo Hong, Keehoon Hong, Sung-Wook Min, and Byoungho Lee, “Effect of fundamental depth
resolution and cardboard effect to perceived depth resolution on multi-view display,” Optics Express, vol. 19, no. 21, pp. 20468–
20482, 2011.
24. Wei Chen, J´
erˆ
ome Fournier, Marcus Barkowsky, and Patrick Le Callet, “New stereoscopic video shooting rule based on stereoscopic
distortion parameters and comfortable viewing zone,” Stereoscopic Displays and Applications XXII. Proceedings of the SPIE, vol.
7863, pp. 78631O–78631O–13, 2011.
25. James E. Cutting and Peter M. Vishton, Perceiving layout and knowing distance: The integration, relative potency and contextual
use of different information about depth, New York: Academic Press, 1995.
26. Neil A. Dodgson, “Variation and extrema of human interpupillary distance,” in Proceedings SPIE Vol. 5291, Stereoscopic Displays
and Virtual Reality Systems XI, 2004, p. 3646.
27. R. Patterson and W. L. Martin, “Human stereopsis,” Hum. Factors, vol. 34, pp. 669692, 1992.
28. P. Howard and B. J. Rogers, Seeing in Depth: Depth Perception, vol. 2, Porteous Publishing, Toronto, 2002.
29. Y. Y. Yeh and L. D. Silverstein, “Limits of fusion and depth judgement in stereoscopic color displays,” Hum. Factors, vol. 32, pp.
4560, 1990.
135
30. C. Schor, I. Wood, and J. Ogawa, “Binocular sensory fusion is limited by spatial resolution,” Vision Research, vol. 24, pp. 661665,
1984.
31. Steven H. Ferris, “Motion parallax and absolute distance,” Journal of experimental psychology, vol. 95, no. 2, pp. 258–263, 1972.
32. “http://en.wikipedia.org/wiki/depth perception,” .
33. Michael T. Swanston and Walter C.. Gogel, “Perceived size and motion in depth from optical expansion,” Perception & Psy-
chophysics, vol. 39, no. 5, pp. 309326, 1986.
34. William H. Ittelson, “Size as a cue to distance: Radial motion,” American Journal of Psycholoyg, vol. 64, no. 2, pp. 188–202, 1951.
35. ,” .
36. James E. Cutting, “How the eye measures reality and virtual reality,” Behavior Research Methods, Instruments, & Computers, vol.
29, no. 1, pp. 27–36, 1997.
37. Irving B. Weiner, Handbook of Psychology, Experimental Psychology, Wiley, 2012.
38. Rita Sousa, Eli Brenner, and Jeroen B. J. Smeets, “Judging an unfamiliar object’s distance from its retinal image size,” Journal of
Vision, vol. 11, no. 9, pp. 1–6, 2011.
39. Cutting and Millard, “Three gradients and the perception of flat and curved surfaces,” Journal of Experimental Psychology: General,
vol. 113, pp. 198216, 1984.
40. H Sedgwick, “The visible horizon: A potential source of visual information for the perception of size and distance,” Dissertation
Abstracts International, vol. 34, no. 73, pp. 13011302, 1973.
41. James Elkins, “Renaissance perspectives,” Journal of the History of Ideas, vol. 53, no. 2, pp. 209–230, 1992.
42. Pamela Taylor, The Notebooks of Leonardo Da Vinci Mass Market Paperback, The New American Library, 1960.
43. John S. Watson, Martin S. Banks, Claes von Hofsten, and Constance S. Royden, “Gravity as a monocular cue for perception of
absolute distance and/or absolute size,” Perception, vol. 21, no. 1, pp. 69–76, 1992.
44. Heinrich H. B¨
ulthoff and Hanspeter A. Mallot, “Interaction of different modules in depth perception,” in Proceedings of 1st Interna-
tional Conference on Computer Vision, 1987, pp. 295–305.
45. Ian P. Howard and Brian J Rogers, Binocular Vision and Stereopsis, Oxford phychology series no. 29. Oxford University Press,
1995.
46. I. P. Howard and W. B. Templeton, Human Spatial Orientation, John Wiley and Sons, 1967.
47. Mark Young, Michael S. Landy, and Laurence T. Maloney, “A perturbation analysis of depth perception from combinations of texture
and motion cues,” Vision Research, vol. 33, no. 18, pp. 2685–96, 1993.
48. James J. Clark and Alan L. Yuille, Data Fusion for Sensory Information Processing Systems, vol. 105, The Springer International
Series in Engineering and Computer Science, 1990.
49. Michael S. Landy, Laurence T. Maloney, Elizabeth B. Johnston, and Mark Young, “Measurement and modeling of depth cue combi-
nation: in defense of weak fusion,” Vision Research, vol. 35, no. 3, pp. 389–412, 1995.
50. Ken Nakayama and Shinsuke Shimojo, “Experiencing and perceiving visual surfaces,” Science, vol. 257, pp. 1357–1363, 1992.
51. Barbara Anne Dosher, George Sperling, and Stephen A. Wurst, “Tradeoffs between stereopsis and proximity luminance covariance
as determinants of perceived 3D structure,” Vision Research, vol. 26, no. 6, pp. 973–990, 1986.
52. Nicolas Bruno and James E. Cutting, “Minimodularity and the perception of layout,” Journal of Experimental Psychology: General,
vol. 17, no. 2, pp. 161–170, 1988.
53. Elizabeth B. Johnston, Bruce G. Cumming, and Andrew J. Parker, “Integration of depth modules: Stereopsis and texture,” Vision
Research, vol. 33, no. 5/6, pp. 813–826, 1993.
54. David Buckley and John P Frisby, “Interaction of stereo, texture and outline cues in the shape perception of three-dimensional ridges,”
Vision Research, vol. 33, no. 7, pp. 919–933, 1993.
55. Michael S. Landy, Laurence T. Maloney, and Mark J. Young, “Psychophysical estimation of the human depth combination rule,”
Sensor fusion III: 3-D Perception and recognition, SPIE Proceedings, vol. 1383, pp. 247–254, 1991.
56. Brian J Rogers and Thomas S Collett, “The appearance of surfaces specified by motion parallax and binocular disparity,” Journal of
Experimental Psychology Section A: Human Experimental Psychology, vol. 41, no. 4, pp. 697–717, 1989.
57. Junle Wang, Marcus Barkowsky, Vincent Ricordel, and Patrick Le Callet, “Quantifying how the combination of blur and disparity
affects the perceived depth,” in Proceedings of the SPIE. Human Vision and Electronic Imaging XVI. 2011, vol. 7865, pp. 78650K–
78650K–10, Proceedings of the SPIE.
58. Robert T. Held, Emily A. Cooper, and Martin S. Banks, “Blur and disparity are complementary cues to depth,” Current Biology, vol.
22, pp. 426–431, 2012.
59. Marc O. Ernst and Martin S. Banks, “Humans integrate visual and haptic information in a statistically optimal fashion,” Nature, vol.
415, pp. 429–433, 2002.
60. James M. Hillis, Simon J. Watt, Michael S. Landy, and Martin S. Banks, “Slant from texture and disparity cues: Optimal cue
combination,” Journal of Vision, vol. 4, pp. 967–992, 2004.
61. Paul G. Lovell, Marina Bloj, and Julie M. Harris, “Optimal integration of shading and binocular disparity for depth perception,”
Journal of Vision, vol. 12, pp. 1–18, 2012.
62. Dominic W. Massaro, “Ambiguity in perception and experimentation,” Journal of Experimental Psychology: General, vol. 117, no.
4, pp. 417–421, 1988.
63. Kenneth N. Ogle, “A new phenomenon in binocular space perception associated with the relative size of the images of the two eyes,”
Archive of Ophthalmology, vol. 20, no. 4, pp. 604–623, 1938.
64. Myron L. Braunstein, George J. Andersen, and David M. Riefer, “The use of occlusion to resolve ambiguity in parallel projections,”
Perception & Psychophysics, vol. 31, no. 3, pp. 261–267, 1982.
136
65. Andrew Blake and Heinrich B¨
ulthoff, “Shape from specularities: computation and psychophysics,” Philosophical Transactions of
the Royal Society of London, vol. 331, no. 1260, pp. 237–52, 1991.
66. Alan L. Yuille and Heinrich H. B¨
ulthoff, “Bayesian decision theory and psychophysics,” in In Perception as Bayesian Inference.
1994, pp. 123–161, University Press.
67. Ahna R. Girshick and Martin S. Banks, “Probabilistic combination of slant information: weighted averaging and robustness as
optimal percepts,” Journal of Vision, vol. 9, pp. 1–20, 2009.
68. Raymond van Ee, Wendy J. Adams, and Pascal Mamassian, “Bayesian modeling of cue interaction: bistability in stereoscopic slant
perception,” Journal of the Optical Society of America A, vol. 20, no. 7, pp. 1398–1406, 2003.
69. Walter C. Gogel, “An indirect method of measuring perceived distance from familiar size,” Perception & Psychophysics, vol. 20, no.
6, pp. 419–429, 1976.
70. Elizabeth B. Johnston, “Systematic distortion of shape from stereopsis,” Vision Research, vol. 31, no. 7/8, pp. 1351–1360, 1991.
71. Elaine W. Jin, Brian W. Keelana, Junqing Chen, Jonathan B. Phillips, and Ying Chen, “Softcopy quality ruler method: Implementation
and validation,” in Proceeding of SPIE-IS&T Electronic Imaging, San Jose, CA, USA, 2009, vol. 7242.
72. Kent A. Stevens and Allen Brook, “Integrating stereopsis with monocular interpretations of planar surfaces,” Vision Research, vol.
28, no. 3, pp. 371–386, 1988.
73. James S. Tittle and Myron L. Braunstein, “Recovery of 3-D shape from binocular disparity and structure from motion,” Perception
& Psychophysics, vol. 54, no. 2, pp. 157–169, 1993.
74. Brian J Rogers and Maureen Graham, “Similarities between motion parallax and stereopsis in human depth perception,” Vision
Research, vol. 22, pp. 261–270, 1981.
75. W. C. Clarke, A. H. Smith, and A. Rabe, “Retinal gradients of outline distortion and binocular disparity as stimuli for slant,” Canadian
Journal of Experimental Psychofogy, vol. 10, pp. 1–8, 1956.
76. Martin S. Banks, Jenny C. A. Read, Robert S. Allison, and Simon J. Watt, “Stereoscopy and the human visual system,” SMPTE Mot.
Imag, vol. 4, no. 121, pp. 24–43, May-June 2012.
77. E. H. Adelson, “Rigid objects that appear highly non-rigid,” Investigate Ophthalmology and Visual Science, vol. 26, pp. 3–56, 1985.
78. Matthieu Urvoy, Marcus Barkowsky, and Patrick Le Callet, “How visual fatigue and discomfort impact 3D-TV quality of experience:
a comprehensive review of technological, psychophysical, and psychological factors,” annals of telecommunications, vol. 68, pp.
641–655, 2013.
79. M. Emoto, T. Niida, and F. Okana, “Repeated vergence adaptation causes the decline of visual functions in watching stereoscopic
television,” journal of display technology, vol. 1, pp. 328340, 2005.
80. W. A. IJsselsteijn, P. H. J. Seuntiens, and L. M. J. Meesters, 3D Videocommunication-Algorithms, Concepts and Real-Time Systems
in Human-centred Communication, chapter Human Factors of 3D Display, p. 219234, John Wiley and Sons, 2005.
81. Kazuhiko Uka and Peter A. Howarth, “Visual fatigue caused by viewing stereoscopic motion images: Background, theories, and
observations,” Displays, vol. 29, no. 2, pp. 106116, 2008.
82. D. A. Goss and Z. Huifang, “Clinical and laboratory investigations of the relationship of accommodation and convergence function
with refractive error. a literature review,” Doc. Ophthalmology, vol. 86, pp. 349380, 1994.
83. M. Wopking, “Viewing comfort with stereoscopic pictures: an experimental study on the subjective effects of disparity magnitude
and depth of focus,” Journal Society for Information Display, vol. 3, pp. 101–103, 1995.
84. C. Sheard, “The prescription of prisms,” American Journal of Optomety, vol. 11, no. 10, pp. 364–378, 1934.
85. Yuji Nojiri, Hirokazu Yamanoue, Atsuo Hanazato, and Fumio Okano, “Measurement of parallax distribution and its application to
the analysis of visual comfort for stereoscopic hdtv,” in Proc. SPIE 5006, Stereoscopic Displays and Virtual Reality Systems X 195,
May 2003.
86. Sumio Yano, Masaki Emoto, and Tetsuo Mitsuhashi, “Two factors in visual fatigue caused by stereoscopic HDTV images,” Displays,
vol. 25, pp. 141–150, 2004.
87. Filippo Speranza, Wa James Tam, Ron Renaud, and Namho Hur, “Effect of disparity and motion on visual comfort of stereoscopic
images,” in Stereoscopic Displays and Virtual Reality Systems XIII, 2006, vol. 6055.
88. ITU-R Recommendation BT.1438, “Subjective assessment of stereoscopic television pictures,” 2000.
89. Wei Chen, J´
erˆ
ome Fournier, Marcus Barkowsky, and Patrick Le Callet, “New stereoscopic video shooting rule based on stereoscopic
distortion parameters and comfortable viewing zone,” Stereoscopic Displays and Applications XXII. Proceedings of the SPIE, vol.
7863, pp. 78631O–78631O–13, 2011.
90. Bernard Mendiburu, 3D Movie Making: Stereoscopic Digital Cinema From Scrip to Screen, Focal Press, 2009.
91. Andrew Woods, Tom Docherty, and Rolf Koch, “Image distortions in stereoscopic video systems,” in Proceedings of the SPIE,
Stereoscopic Displays and Applications IV, 1993, vol. 1915, pp. 36–48.
92. Wei Chen, J´
erˆ
ome Fournier, Marcus Barkowsky, and Patrick Le Callet, “New requirements of subjective video quality assessment
methodologies for 3DTV,” in Video Processing and Quality Metrics 2010 (VPQM), Scottsdale, USA, 2010.
93. K.; Merkle P.; Kauff P.; Wiegand T. Smolic, A.; Mueller, “An overview of available and emerging 3d video formats and depth
enhanced stereo as efficient generic solution,” in Picture Coding Symposium, 2009.
94. Pierre Lebreton, Alexander Raake, Marcus Barkowsky, and Patrick Le Callet, “A subjective and objective evaluation of a realistic 3D
IPTV transmission chain,” in Packet Video Workshop, Munich, Germany, 2012.
95. Pierre Lebreton, Alexander Raake, Marcus Barkowsky, and Patrick Le Callet, “Perceptual preference of S3D over 2D for HDTV in
dependence of video quality and depth,” in IVMSP Workshop: 3D Image/Video Technologies and Applications, Seoul, Korea, 2013.
96. “http://x264.nl/,” .
97. “http://sirannon.atlantis.ugent.be/,” .
137
98. ITU-R BT.500-12, “Methodology for the subjective assessment of the quality of television pictures,” 2009.
99. Manuel Werlberger Thomas Pock and Horst Bischof., “Motion Estimation with Non-Local Total Variation Regularization,” IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA, USA, 2010.
100. Kun Wang, Marcus Barkowsky et al., “Subjective evaluation of HDTV stereoscopic videos in IPTV scenarios using absolute category
rating,” EI2011, 2011.
101. Kjell Brunnstr¨
om, Inigo Sedano, Kun Wang, Marcus Barkowsky, Maria Kihl, B¨
orje Andrn, Patrick Le Callet, Marten Sj¨
ostr¨
om, and
Andreas Aurelius, “2D no-reference video quality model development and 3D video transmission quality,” in International Workshop
on Video Processing and Quality Metrics for Consumer Electronics (VPQM), Scottsdale, Arizona, USA, 2012.
102. ITU-R Rec. BT.2021, “Subjective methods for the assessment of stereoscopic 3dtv systems,” International Telecommunication Union
(ITU), 2012.
103. Marcus Barkowsky, Jing Li, Taehwan Han, Sungwook Youn, Jiheon Ok, Chulhee Lee, Christer Hedberg, Indirajith Vijai Ananth, Kun
Wang, Kjell Brunnstrm, and Patrick Le Callet, “Towards standardized 3DTV QoE assessment: Cross-lab study on display technology
and viewing environment parameters,” in Stereoscopic Displays and Applications XXIV, Vol: 8648, San franscisco : United States,
2013.
104. Ulrich Engelke, Yohann Pitrey, and Patrick Le Callet, “Towards an inter-observer analysis framework for multimedia quality assess-
ment,” in International Workshop on Quality of Multimedia Experience (QoMEX), Mechelen, 2011, pp. 183 – 188.
105. F. Kozamernik, V. Steinmann, P. Sunna, and E. Wyckens, “SAMVIQ-A New EBU Methodology for Video Quality Evaluations in
Multimedia,” SMPTE Mot. Imag., pp. 152–160, April 2005.
106. Q Huynh-Thu, MD Brotherton, D Hands, and K Brunnstrm, “Examination of the SAMVIQ subjective assessment methodology,” in
Third Inter. Workshop on Video Processing and Quality Metrics for Consumer Electronics, Scottsdale, AZ, USA, 2007.
107. “http://3dtv.at/products/player/index de.aspx,” .
108. Alexandre Benoit, Patrick Le Callet, Patrizio Campisi, and Romain Cousseau, “Quality assessment of stereoscopic images,” in IEEE
International Conference on Image Processing , ICIP, San Diego, California, USA, 2008, pp. 1231–1234.
109. ITU-T Contribution COM 12-C192-E, “Comparison of the ACR and PC evaluation methods concerning the effects of video resolu-
tion and size on visual subjective ratings,” in ITU, SG12 Meeting, Geneva, Jan 2011.
110. J. C. Handley, “Comparative analysis of Bradley-Terry and Thurstone-Mosteller model of paired comparisons for image quality
assessment,” in PICS, April 2001.
111. Jing Li, Marcus Barkowsky, and Patrick Le Callet, “Analysis and improvement of a paired comparison method in the application of
3DTV subjective experiment,” in ICIP, Orlando, Florida, USA, October 2012.
112. Marcus Barkowsky, Romain Cousseau, and Patrick Le Callet, “Influence of depth rendering on the quality of experience for an
autostereoscopic display,” in International Workshop on Quality of Multimedia Experience, San Diego, California, USA, 07 2009,
p. 6.
113. Pierre Lebreton, Alexander Raake, Marcus Barkowsky, and Patrick Le Callet, “Evaluating depth perception of 3D stereoscopic
videos,” IEEE Journal of Selected Topics in Signal Processing, vol. 6, pp. 710–720, October 2012.
114. Matthieu Urvoy and et al., “NAMA3DS1-COSPAD1 : Subjective video quality assessment database on coding conditions introducing
freely available high quality 3D stereoscopic sequences,” in Fourth International on Quality of Multimedia Experience, Yarra Valley,
July 2012.
115. Roumes C, Plantier J, Menu JP, and Thorpe S, “The effects of spatial frequency on binocular fusion: from elementary to complex
images,” Human Factors, vol. 39, no. 3, pp. 359–373, Sep 1997.
116. ITU-T Recommendation P.910, “Subjective video quality assessment methods for multimedia applications,” 2008.
117. Jr Otto Dykstra, “Rank analysis of incomplete block designs: A method of paired comparisons employing unequal repetitions on
pairs,” Biometrics, vol. 16, no. 2, pp. 176–188, June 1960.
118. Pierre Lebreton, Alexander Raake, Marcus Barkowsky, and Patrick Le Callet, “Measuring perceived depth in natural images and
study of its relation with monocular and binocular depth cues,” in Stereoscopic Displays and Applications XXV, San Francisco,
California, USA, 2014.
119. Pierre Lebreton, Alexander Raake, Marcus Barkowsky, and Patrick Le Callet, “Evaluating complex scales through subjective rank-
ing,” in International Workshop on Quality of Multimedia Experience (QoMEX), Singapore, 2014.
120. Liyuan Xing, Junyong You, Touradj Ebrahimi, and Andrew Perkis, “Assessment of stereoscopic crosstalk perception,” IEEE TRANS-
ACTIONS ON MULTIMEDIA, vol. 14, no. 2, pp. 326–337, APRIL 2012.
121. Xing Liyuan, You Junyong, Ebrahimi Touradj, and Perkis Andrew, “Factors impacting quality of experience in stereoscopic images,”
in Proceedings of SPIE - The International Society for Optical Engineering, San Francisco, California, USA, 2011, vol. 7863.
122. Lutz Goldmann, Francesca De Simone, and Touradj Ebrahimi, “Impact of acquisition distortions on the quality of stereoscopic
images,” in 5th International Workshop on Video Processing and Quality Metrics for Consumer Electronics (VPQM), Scottsdale,
USA, 2010.
123. homepage of, “http://mmspg.epfl.ch/3diqa,” last access in Januray 2014.
124. Karel Fliegel, Stanislav V´
ıtek, Milos Kl´
ıma, and Petr P´
ata, “Open source database of images DEIMOS: high dynamic range and
stereoscopic content,” in Proc. SPIE 8135, Applications of Digital Image Processing XXXIV, 81351T, September 2011.
125. homepage of, “http://www.deimos-project.cz/tag/stereo,” last access in Januray 2014.
126. E. Cheng, P. Burton, J. Burton, A. Joseski, and I. Burnett, “RMIT3DV: Pre-announcement of a creative commons uncompressed
hd 3D video database,” in Proc. 4th International Workshop on Quality of Multimedia Experience (QoMEX 2012), Yarra Valley,
Australia, 2012.
127. homepage of, “http://www.rmit3dv.com/download.php,” last access in Januray 2014.
138
128. homepage of, “http://www.elephantsdream.org/,” last access in Januray 2014.
129. J. L. Fleiss and J. Cohen, “The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability,”
Educational and Psychological Measurement, vol. 33, pp. 613–619, 1973.
130. Berthold Klaus Paul Horn and Brian G. Schunck, “Determining optical flow,” Artificial Intelligence, vol. 17, pp. 185–203, 1981.
131. Manuel Werlberger, Convex Approaches for High Performance Video Processing, Ph.D. thesis, Graz University of Technology,
Institute for Computer Graphics and Vision, 2012.
132. C Schn¨
orr, “Segmentation of visual motion by minimizing convex non-quadratic functionals,” Proceedings of 12th International
Conference on Pattern Recognition, vol. 1, pp. 661663, 1994.
133. J. Weickert and C. A. Schnrr, “Theoretical framework for convex regularizer in pde-based computation of image motion,” Interna-
tional Journal of Computer Vision, vol. 45, no. 3, pp. 245–264, 2001.
134. L. Alvarez, J. Esclarn, M. Lefebure, and J. Snchez, “A pde model for computing the optical flow,” Proceedings XVI Congreso de
Ecuaciones Diferenciales y Aplicaciones C.E.D.Y.A. XVI, vol. 1, pp. 13491356, 1999.
135. I. Gheta, C. Frese, M. Heizmann, and J. Beyerer, “A new approach for estimating depth by fusing stereo and defocus information,” in
INFORMATIK 2007: Informatik trifft Logistik. Band 1. Beitrge der 37. Jahrestagung der Gesellschaft fr Informatik e.V. (GI), 2007,
pp. 26–31.
136. M. Werlberger, W. Trobin, T. Pock, A. Wedel, D. Cremers, and H. Bischof, “Anisotropic Huber-L1 Optical Fow,” in Proceedings of
the British Machine Vision Conference (BMVC), London, UK, September 2009.
137. “http://vision.middlebury.edu/stereo/eval/,” .
138. Pina Marziliano, Frederic Dufaux, Stefan Winkler, and TouradjEbrahimi, “A no-reference perceptual blur metric,” in International
Conference on Image Processing, 2002.
139. Frederique Crete, Thierry Dolmiere, Patricia Ladret, and Marina Nicolas, “The blur effect: Perception and estimation with a new
no-reference perceptual blur metric,” in SPIE Electronic Imaging Symposium Conf Human Vision and Electronic Imaging, San Jose,
2007.
140. Jorge Caviedes and Sabri Gurbuz, “No-reference sharpness metric based on local edge kurtosis,” in International Conference on
Image Processing (ICIP), 2002, vol. 3, p. 5356.
141. Shaojie Zhuo and Terence Sim, “Defocus map estimation from a single image.,” Pattern Recognition, vol. 44, no. 9, pp. 1852–1858,
2011.
142. Anat Levin, Dani Lischinski, and Yair Weiss, “A closed-form solution to natural image matting,” IEEE Transactions on Pattern
Analysis and Machine Intelligence., vol. 30, no. 2, pp. 228–242, 2008.
143. A. Agrawal, R. Chellappa, and R. Raskar, “An algebraic approach to surface reconstructions from gradient fields?,” in Intenational
Conference on Computer Vision (ICCV), 2006.
144. T. Simchony, R. Chellappa, and M. Shao, “Direct analytical methods for solving poisson equations in computer vision problems,” in
IEEE Trans. Pattern Anal. Machine Intell., 1990, vol. 12, pp. 435–446.
145. ISO/IEC JTC1/SC29/WG11, “Depth estimation reference software (ders) 4.0,” M16605, July 2009.
146. “http://vision.middlebury.edu/flow/eval/,” .
147. “http://gpu4vision.icg.tugraz.at/,” .
148. Robert Cormack, “The computation of retinal disparity,” Perception and Psychophysics, vol. 37, no. 2, pp. 176–178, 1985.
149. Tzung-Han Lin and Shang-Jen Hu, “Perceived depth analysis for view navigation of stereoscopic three-dimensional models,” Journal
of Electronic Imaging, vol. 23, no. 4, pp. 043014, 2014.
150. Dorin Comaniciu and Peter Meer, “Mean shift: A robust approach toward feature space analysis,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 24, pp. 603–619, 2002.
151. “http://www.wisdom.weizmann.ac.il/ bagon/matlab.html,” .
152. ITU-R Recommendation J.144 (Rev.1), “Objective perceptual video quality measurement techniques for digital cable television in
the presence of a full reference,” 2004.
153. A Ninassi, O Le Meur, P Le Callet, and D Barba, “Considering temporal variations of spatial visual distortions in video quality
assessment,” IEEE Journal of Selected Topics in Signal Processing, vol. 3, pp. 253 – 265, April 2009.
154. Hosik Sohn, Yong Ju Jung, Seong il Lee, Hyun Wook Park, and Yong Man Ro, “Investigation of object thickness for visual discomfort
prediction in stereoscopic images,” in Proceedings SPIE 8288, Stereoscopic Displays and Applications XXIII, 2012, vol. 8288.
155. Satoshi Toyosawa and Takashi Kawai, “Measurement of perceived stereoscopic sensation through disparity metrics and composi-
tions,” in Stereoscopic Displays and Applications XXV, San Francisco, California, USA, 2014, vol. 9011.
156. Donghyun Kim, Dongbo Min, Juhyun Oh, Seonggyu Jeon, and Kwanghoon Sohn, “Depth map quality metric for three-dimensional
video,” in Proceedings SPIE 7237, Stereoscopic Displays and Applications XX, 2009.
157. ITU-T Recommendation P.1401, “Methods, metrics and procedures for statistical evaluation, qualification and comparison of objec-
tive quality prediction models,” 2012.
158. Michael G. Ross and Aude Oliva, “Estimating perception of scene layout properties from global image features,” Journal of Vision,
vol. 10(1):2, pp. 1–25, 2010.
159. Aude Oliva and Antonio Torralba, “Modeling the shape of the scene: a holistic representation of the spatial envelope,” International
Journal of Computer Vision, vol. 42, no. 3, pp. 145175, 2001.
160. Lutz Goldmann, Touradj Ebrahimi, Pierre Lebreton, and Alexander Raake, “Towards a descriptive depth index for 3D content :
measuring perspective depth cues,” in International Workshop on Video Processing and Quality Metrics for Consumer Electronics
(VPQM), Scottsdale, Arizona, USA, 2012.
161. R.G. von Gioi, J. Jakubowicz, J.-M. Morel, and G. Randall, “LSD: A fast line segment detector with a false detection control,” in
Pattern Analysis and Machine Intelligence, IEEE Transactions on, april 2010, vol. 32, p. 722 732.
139
162. Cheng-Wei Chen and Yung-Yaw Chen, “Recovering depth from a single image using spectral energy of the defocused step edge
gradient,” in 18th IEEE International Conference on Image Processing (ICIP), 2011, pp. 1981–1984.
163. Pierre Lebreton, Alexander Raake, Marcus Barkowsky, and Patrick Le Callet, “Open perceptual binocular and monocular descriptors
for stereoscopic 3D images and video characterization,” in International Workshop on Quality of Multimedia Experience (QoMEX),
2015.
140