Pixel2Point: 3D object reconstruction from a single image using CNN and initial sphere [original]

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3046951, IEEE Access
Date of publication xxxx 00, 0000, date of current v ersion xxxx 00, 0000.
Digital Object Identifier 10.1109/A CCESS.2017.DOI
Pix el2P oint: 3D Object Reconstruction
fr om a Single Ima g e Using CNN and
Initial Sphere
AHMED J . AFIFI 1 , JANNES MA GNUSSON 2 , T OUFIQ UE A. SOOMRO 3 , OLAF HELL WICH 1
1 Computer V ision & Remote Sensing, T echnische Univ ersität Berlin, 10587 Berlin, Germany
2 Institute for Image Science and Computational Modelling in Cardiov ascular Medicine, Charité - Univ ersitätsmedizin Berlin, 13353 Berlin, Germany
3 Department of Electronic Engineering, Quaid-e-A wam Uni versity of Engineering, Science and T echnology , Naw abshah 67480, Pakistan
Corresponding author: Ahmed J. Afifi (e-mail: [email protected] , [email protected] ).
ABSTRA CT 3D reconstruction from a single image has man y useful applications. Ho wev er , it is a
challenging and ill-posed problem as v arious candidates can be a solution for the reconstruction. In this
paper , we propose a simple, yet po werful, CNN model that generates a point cloud of an object from a
single image. 3D data can be represented in dif ferent ways. Point clouds ha ve pro ven to be a common and
simple representation. The proposed model was trained end-to-end on synthetic data with 3D supervision. It
takes a single image of an object and generates a point cloud with a fix ed number of points. An initial point
cloud of a sphere shape is used to improv e the generated point cloud. The proposed model was tested on
synthetic and real data. Qualitati ve e v aluations demonstrate that the proposed model is able to generate point
clouds that are very close to the ground-truth. Also, the initial point cloud has impro ved the final results as
it distrib utes the points on the object surface e venly . Furthermore, the proposed method outperforms the
state-of-the-art in solving this problem quantitati vely and qualitati vely on synthetic and real images. The
proposed model illustrates an outstanding generalization to the ne w and unseen images and scenes.
INDEX TERMS Single-vie w Reconstruction, Deep Learning, Point Cloud, CNN
I. INTR ODUCTION
S INGLE-VIEW reconstruction is a long-standing ill-
posed problem and fundamental to many applications
such as object recognition and scene understanding. Single-
vie w 3D reconstruction means using a single image of an
object and utilizing it to infer the 3D structure of the object
so that it can be vie wed from all directions. For multi-vie w
scenarios, a lar ge variety of methods has been proposed
which are able to present high-quality reconstruction results
[10]. The challenge appears when a single input image is just
a vailable for the reconstruction process. Man y approaches
were proposed with restrictions and special assumptions on
the input image to predict 3D geometry [19]. Single-vie w
3D reconstruction is a hard problem and it mainly depends
on the a vailable information and the imposed assumptions
on the tar get object. This information or cues provide prior
kno wledge that helps in generating 3D shapes with plausible
precision [19].
Before the deep learning era, many approaches ha ve been
proposed to solve single-vie w reconstruction depending on
the object nature. Some of them were applied to real-world
images without any kno wledge of the image formation, and
the output of these approaches is plausible. One class of
these methods focus on curved objects and try to produce
smooth objects. These methods define an ener gy function to
minimize the object surface with respect to some constraints
such as a fixed area or v olume [18], [20], [28]. Other meth-
ods focus on piece wise planer objects and utilize semantic
kno wledge of object locations such as the sky and the ground
locations in the image [7].
W ith the astonishing results obtained by applying deep
learning on dif ferent computer vision problems, many 3D-
based models ha ve made great progress in solving dif ferent
tasks using 3D data directly such as classification, object
parts segmentation, and 3D shape completion. Also, the
a vailability of lar ge-scale datasets [5] encourages researchers
to formulate and tackle the single-vie w reconstruction prob-
lem. V olumetric methods were first used to infer the 3D
structure of an object from a single vie w [6]. Howe v er , v olu-
metric representation suf fers from information sparsity and
V OLUME 4, 2016 1

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3046951, IEEE Access
Afifi et al. : Pix el2P oint: 3D Object Reconstr uction from a Single Image Using CNN and Initial Sphere
FIGURE 1: A general sketch of the proposed CNN model
with dif ferent setups. T op: the proposed CNN without an
initial point cloud. Bottom: the proposed CNN with the initial
point cloud. E: Encoder , G: Generator , PC: Point Cloud, FV :
Feature V ector .
the hea vy computations during the training process. Also,
this representation is inef fecti ve in high-resolution outputs.
T o ov ercome this issue, recent works ha ve used point clouds
[8] as they are samples on the surf ace of the objects and
ef fecti vely capture more object details.
In single-vie w reconstruction, the reprojection from 2D to
3D is ambiguous due to the loss of the depth information.
T o this end, we propose a CNN model that solv es the task
of single-vie w reconstruction. The model has an encoder-
generator shape where the encoder extracts useful features
from the input image and the generator infer the point clouds
of the object sho wn in the 2D image. T o generate more accu-
rate point clouds, an initial point cloud is used to impro ve the
reconstruction quality . W e find that starting from an initial
point cloud enforces the points to distrib ute equally on the
shape surface and preserv e the object parts. W e summarize
our contrib utions as follows: (1) we design a CNN model
that can infer the 3D geometry of an object from a single
image. The 3D object is represented as a point cloud. (2)
Instead of directly inferring the point cloud, we propose to
utilize an initial point cloud of a sphere shape to generate the
final object point cloud. The experimental results (Sec. V -C)
that using an initial point cloud helps in generating better
and more accurate reconstruction (Fiqure 1). (3) W e e v aluate
the proposed model on synthetic and real data quantitati vely
and qualitati vely . Our model outperforms the state-of-the-art
methods and sho ws significant results for the task of single-
vie w reconstruction.
II. RELA TED W ORK
Inferring the 3D structure of an object from a single image is
an ill-posed problem, b ut many attempts ha ve been done such
as SFM and SLAM [3], [9]. Moreo ver , ShapeFrom X , where
X can be shado w , texture, etc. requires prior kno wledge on
the nature of the input image [2].
When applying deep learning models to generate 3D
shapes or to solve other tasks such as se gmentation, recogni-
tion, or object classification, the object representation plays
an important role in designing the network. The most 3D data
representations that are used in deep learning are v olumetric
data, meshes, and point clouds.
T o extend the 2D con v olutions to 3D, the volumetric
representation has mostly been used. V olumetric data can
be represented as a regular grid in the 3D space [27]. V ox els
are used to visualize 3D data and sho w the distribution of the
3D object in the 3D space. Each v oxel in the 3D space that
describes the object can be classified into a visible, occluded,
or self-occluded v oxel according to the vie wpoint. It is simple
in implementation and compatible with the 3D con v olu-
tional neural network. 3D-GAN [26] proposed a generati v e
adversarial netw ork (GAN) to generate 3D objects from a
probabilistic space using v olumetric CNN. They mapped a
lo w-dimensional probabilistic space to the 3D object space
and by this, they outperform other unsupervised learning
methods. Moreov er , a 3D recurrent neural netw ork (RNN)
has been suggested to estimate the 3D shape of an object. 3D-
R2N2 [6] proposed to use long short-term memory (LSTM)
to infer the 3D geometry using many images of the tar -
get object from dif ferent perspecti ves. Recently , 3D-FHNet,
which is a 3D fusion Hierarchical reconstruction method, was
proposed that can perform 3D object reconstruction of any
number of vie ws [14]. The critical limitation of using the
v olumetric representation in the above-mentioned methods
is the computational and the memory cost and the restriction
on the output resolution. Also, fine-grained shape parts get
lost because the v oxel is represented as either occupied or
unoccupied.
T o a void the limitation of the v olumetric representation,
mesh representation is more attracti ve for real applications
as the shape details can be modeled accurately . 3D Meshes
are commonly used to represent 3D shapes. The structure
of a 3D mesh comprises a set of polygons which are called
faces [4]. These polygons are described using a set of v ertices
that describe ho w the mesh coordinates exist in the 3D space.
Besides the 3D coordinates of the v ertices, there is a connec-
ti vity list that specifies how the v ertices are connected to each
other . Applying deep learning models directly to generate
meshes is a challenge as the y are not regularly structured. A
parameterization-based 3D reconstruction is proposed in [22]
that generates geometry images which encode x; y; z surface
coordinates. Three separated encoder -decoder networks were
used to generate the geometry images. The networks tak e an
RGB image or a depth image as an input and learn the x; y;
and z geometry images respecti vely . Other methods proposed
to estimate a deformation field from an input image and apply
2 V OLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3046951, IEEE Access
Afifi et al. : Pix el2P oint: 3D Object Reconstr uction from a Single Image Using CNN and Initial Sphere
it to a template 3D shape to generate the reconstructed 3D
model. Kuryenk ov et al. [12] proposed DeformNet that tak es
an image and the nearest 3D shape to that image from a
dataset as an input. Then, the template shape is deformed
to match the input image using the Free F orm Deformation
layer (FFD). In [25], Pixel2Mesh is an end-to-end deep learn-
ing model that was proposed to generate a triangulated 3D
mesh from a single image. The proposed network represented
the 3D mesh in graph-based CNN (GCNN). It deforms an
initial ellipsoid to le verage the perceptual features e xtracted
from the input image. They adopted a coarse-to-find strate gy
that makes the deformation process stable. A limitation of
using meshes for reconstruction is that the generated output
is limited mostly to the initial mesh or the selected template
as an initial shape to be deformed.
T o ov ercome the abov e-mentioned limitations, point
clouds are used to represent the 3D data. 3D P oint Cloud is a
set of unordered 3D points that approximate the geometry of
3D objects [8]. Points can be represented either as a matrix
of size N × 3 , a 3-channel grid of size H × W × 3 where
each pixel encodes the (x,y ,z) coordinates and H × W equals
to the number of points, or depth maps from dif ferent kno wn
vie wpoints. Point Set Generation Network (PSGN) [8] was
the first proposed model to generate a point cloud of an
object from a single image and outperforming the v olumetric
approaches. In RealPoint3D [29], the proposed network has
two encoders; the first one e xtracts 2D features from the
input image, the second encoder extracts 3D features from
the nearest similar shape to the input image retrie ved from the
ShapeNet dataset. The extracted features from both encoders
are integrated and forw arded to a decoder to generate fine-
grained point clouds. The point cloud from the retrie ved
shape influenced the inferring process and generated finer
point clouds. 3D-LMNET [16] trained a 3D point cloud
auto-encoder and then learned the mapping from the 2D
images to the learned embedded features. Another direction
to generate the point cloud is to generate depth images of
dif ferent perspecti ves and fuse them to generate the final
point cloud. In [13], a generati ve modeling frame work used
2D con volutional operation to predict multiple pre-defined
depth images and use them to generate a dense 3D model.
In [15], a two-stage training dense point cloud generation
network w as proposed. In the first stage, the network takes a
single RGB image and generates a sparse point cloud. In the
second stage, a generator network densifies the sparse point
cloud and generate a dense point cloud. After training the
two stages, the model becomes an end-to-end netw ork that
generates a dense point cloud from a single RGB image.
Our proposed model is dif ferent from the mentioned work.
It has a simple design and utilizes an initial point cloud to
predict the final point cloud accurately . The model has a
single input and generates the point cloud directly without
retrie ving and utilizing a similar 3D model to the input image
as proposed in [29]. Also, it doesn’t use other 2D supervision
such as silhouettes to infer the 3D object structure.
III. METHODOLOGY
Our main goal is to infer a complete 3D shape of an object
from a single RGB image. W e select point clouds to represent
the generated output (Eq. 1). W e set the number of the points
generated from the CNN to N = 2048 . From our experi-
ments, this number of points is suf ficient to cov er the whole
surface of the object and preserv es the major structures.
S = { ( x i , y i , z i ) } N
i =1 (1)
A. 3D CNN MODEL
The proposed network is illustrated in Figure 2. It consists
of two parts; the encoder part and the generator part. The
encoder part is a set of consecuti ve 2D con volutional layers
follo wed by ReLU as a non-linear acti vation function. These
layers are used to extract the object features from the 2D
input images. T o predict the 3D point cloud of the object,
an initial point cloud of a sphere shape is used. The initial
point cloud is concatenated with the extracted features from
the encoder . Then, it is fed into the generator part to get the
final point cloud of the object, where fully connected layers
(FC) are used to generate a N × 3 matrix, where each ro w
contains the coordinates of one point. Each network part is
described in detail belo w .
Encoder Net. The role of the encoder part is to extract the
distinction features from the input image that can correctly
describe the object with details. It consists of consecuti ve
layers of 2D con v olutional layers and ReLU layers. The
con v olutional layers are se ven layers. The first three con vo-
lutional layers are of sizes 32 , 64 , and 128 , respecti vely . The
remaining layers ha ve a size of 256 . All con volutional layers
ha ve a kernel size of 3 × 3 and a stride of 2 . The stride of
2 in the con v olutional layers helps in decreasing the spatial
size of the features as pooling layers do. Comparing to the
pooling layers, the strided con v olutional layers are trainable
and can extract useful features. The size of the input image is
128 × 128 . The extracted feature from the encoder has a size
of 1 × 1 × 256 which will be reshaped and concatenated with
the initial point cloud.
Generator Net. The generator part is a simple network
consisting of four fully connected layers (FC). The extracted
feature vector from the encoder is reshaped to 1 × 256 and
then concatenated with the initial point cloud. The initial
point cloud has a sphere shape consisting of 256 equally
spaced points. The reshaped feature is concatenated with
each point of the initial point cloud, and the ne w feature
has a size of 256 × (3 + 256) . Figure 2 sho ws the reshape
and concatenation process. The ne w feature is fed into the
generator . After three FC layers follo wed by ReLU, the
generator ends with a fully connected layer that predicts the
final point cloud with a shape of 2048 × 3 .
The proposed network is dif ferent from other single-vie w
reconstruction models as the proposed model utilizes an
initial point cloud with a sphere shape for better inference of
the final point cloud. In the results section, we will discuss
V OLUME 4, 2016 3

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3046951, IEEE Access
Afifi et al. : Pix el2P oint: 3D Object Reconstr uction from a Single Image Using CNN and Initial Sphere
dataset that we used in our experiment. It is manually cleaned
and aligned. It has more than 50 K unique 3D models which
cov er 55 common object categories. W e focus on 13 cate-
gories and use the 80% − 20% train-test split provided by [5].
The input images provided by [6] are used during training,
where each model is rendered from 24 dif ferent azimuth
angles.
The 2D input images used for training and testing are
provided by [6]. Each model in ShapeNet w as rendered from
24 dif ferent azimuth angles.
T o sho w the generalization of the proposed method on real
images, we tested it using Pix3D dataset [23]. Pix3D is a
publicly a vailable dataset of aligned real-w orld image and 3D
model pairs. It contains a lar ge div ersity in terms of object
shapes and backgrounds and is highly challenging. W e will
test and report the performance of the proposed method on
the chair , sof a, and table categories from the Pix3D dataset.
C. B ASELINES
W e test the proposed model trained on the ShapeNet dataset.
First, we test the proposed model on synthetic images and
sho w that the proposed model can generate point clouds that
describe the object in the input image. Then, we v alidate the
benefit of using the initial point cloud to improv e the final
point clouds. Also, we compare the proposed model against
PSGN [8] and 3D-LMNet [16] qualitati vely and quantita-
ti vely . CD (Eq. 3) and EMD (Eq. 4) are used to report the
quantitati ve e v aluation. Finally , we test the proposed model
on real images to v alidate its generalizability on unseen
images.
V . EXPERIMENT AL RESUL TS & COMP ARISONS
T o test the performance of the proposed model, we e v aluate
it from dif ferent directions. First, we sho w general results
generated by the proposed model on the ShapeNet dataset
for dif ferent classes. Then, we compare the results of the
proposed model against similar approaches that tar get the
same problem using point cloud representation quantitati vely
and qualitati vely .
After that, we v alidate the proposed architecture by an
ablation study . W e validate the benefits of using the initial
point cloud (the 3D sphere) to generate more accurate results.
Also, we sho w ho w the proposed model deals with the input
images that ha ve an ambiguous vie w (some object structures
are hidden). Moreov er , we sho w that the learned latent v ector
can be utilized to transfer useful information from one shape
to another shape by applying arithmetic operations on dif fer-
ent extracted features.
T o check the model generality , we demonstrate the perfor-
mance of the proposed model on the Pix3D dataset that has
real images and compares the results against other methods
quantitati vely and qualitati vely . Finally , we report some fail-
ure cases that happened in some results because of the strange
shapes or some ne w parts that do not usually exist in normal
cases.
A. GENERAL RESUL TS ON SHAPENET D A T ASET
W e test the proposed model on the testing set of ShapeNet.
The proposed model was trained on synthetic images of
objects rendered from dif ferent vie wpoints. The testing was
performed on 13 dif ferent categories. Figure 3 sho ws the
qualitati ve results of 8 dif ferent cate gories. It clearly demon-
strates that the generated point clouds of the objects from
a single vie w are very close to the ground-truth and the y
capture the object geometry . Also, the proposed model learns
to generate the point clouds and keeps the salient features
such as free spaces between the splats in the back of the chair
and the holes between the back and the seat of the bench.
Moreov er , the proposed model successfully learned to gener -
ate some thin and rare parts such as the stretchers between the
chair legs as these parts are not common in the chair cate gory .
Many cate gories ha ve v arious geometrical shapes such as the
top surface of the tables. In Figure 3 (last ro w), the proposed
model generates the circular surface accurately as the input
image with the cylindrical pillar and the four small le gs.
Furthermore, the proposed model generates complete and
plausible shapes. The generated points are e venly distrib uted
and cov er the whole parts of the objects.
B. COMP ARISON RESUL TS A GAINST O THER METHODS
W e benchmark our proposed model against PSGN and 3D-
LMNet. Both models were trained on the same training set
of ShapeNet. PSGN is the first model to solve the problem of
single-vie w reconstruction using CNN that generates point
clouds. In [8], the reported results sho w that the point cloud-
based models outperform the state-of-the-art v oxel-based
models significantly . T able 1 reports the comparison results
of our proposed model against PSGN [8] and 3D-LMNet
[16] on ShapeNet dataset. It demonstrates that our proposed
model outperforms PSGN in 8 out of 13 categories in the
Chamfer metric and in all 13 cate gories in the EMD metric.
Also, our proposed model outperforms 3D-LMNet in 6 out of
13 categories in the Chamfer metric and in all 13 cate gories
in the EMD metric. Overall, the a v erage performance of our
proposed model outperforms both models in both metrics
despite that our proposed model is simple, yet ef ficient,
comparing with the others. Looking deeper into T able 1,
EMD v alues denote better visualization of the generated
point clouds of the objects. Also, since EMD is a point-to-
point distance, it results in a high penalty when computing
the distance between the points, and the two point cloud
sets should ha ve the same number of points. In Chamfer
distance, the nearest points are used to calculate the distance
in a forward manner (from the generated point cloud to the
ground-truth) and in a backward manner (from the ground-
truth to the generated point cloud). It is not necessary that
the generated point cloud and the corresponding ground-truth
ha ve the same number of points.
Figure 4 highlights the qualitati ve comparison. It clearly
sho ws that the generated point clouds by our proposed model
are visualized better than the ones generated by PSGN and
3D-LMNet. Our proposed model captures the details of the
V OLUME 4, 2016 5

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3046951, IEEE Access
Afifi et al. : Pix el2P oint: 3D Object Reconstr uction from a Single Image Using CNN and Initial Sphere
Moreov er , we compare our proposed model against
Pixel2Mesh [25]. Dif ferent from the proposed model,
Pixel2Mesh uses an ellipsoidal mesh as an initial shape. It
utilizes the extracted features from the image feature netw ork
from dif ferent stages and applies them to deform and add
more details to the generated mesh in the mesh deforma-
tion network in a coarse-to-fine f ashion. T able 2 reports the
quantitati ve comparison between Pix el2Mesh model and our
proposed model. W ith respect to CD, our model outperforms
in some categories and is comparable to Pix el2Mesh results
in other categories, and the a verage performance of our
model outperforms Pixel2Mesh model. In EMD, our model
outperforms Pixel2Mesh model in all cate gories and the
a verage performance of the proposed model outperforms it
with a lar ge margin as reported in T able 2.
C. EFFECT OF THE INITIAL POINT CLOUD
T o test the ef ficacy of using the initial point cloud in recon-
structing a finer point cloud, we conduct an experiment to
test and e v aluate the performance of two dif ferent setups of
the proposed model (Figure 1). The first model is the same as
Figure 2 that uses an initial point cloud. The second setup has
the same architecture as Figure 2 b ut without using the initial
point cloud, and the point cloud is reconstructed directly from
the input image. Both setups were trained on the training set
of ShapeNet and were tested on the testing set of the same
dataset.
Qualitati vely , Figure 5 illustrates the results of the dif ferent
setups of the proposed model. The point clouds generated
by the proposed model without using an initial point cloud
suf fer from the une ven distrib ution of the points on the whole
shape. Many points g ather at some parts of the shape. In the
chair example, man y points are grouped at the back corners
of the seats and fe wer points are in the legs. Ho we ver , the
model with the initial point cloud produces chairs with well-
distrib uted points and the chair legs are well reconstructed.
Also, in the table examples, the point clouds generated with-
out an initial point cloud ha ve poor reconstructed le gs, but
they are well reconstructed using an initial point cloud during
training. In the plane examples, the engines and the tail are
not reconstructed and the points are concentrated on the body
of the plane, b ut they are reconstructed accurately when using
the initial point cloud. From Figure 5, we conclude that
adding the initial point cloud to the proposed model improv es
the reconstructed point cloud, distrib utes the points ev enly
on the whole shape parts, and generates the object details
accurately . Quantitativ ely , T able 3 reports a comparison be-
tween the dif ferent setups of the model. It is clearly noticed
that the model with the initial point cloud outperforms the
same model without using the initial point cloud with a large
mar gin in both metrics.
D . GENERA TING PLA USIBLE SHAPES FROM
AMBIGUOUS 2D INPUTS
T o v alidate the performance of the proposed model, we
conducted an experiment to test the model whether it can
Category Chamfer EMD
w/o PC w PC w/o PC w PC
airplane 4.03 3.29 4.91 3.82
bench 4.34 4.59 10.20 4.31
cabinet 5.97 6.07 11.18 4.94
car 4.21 4.39 4.69 3.61
chair 7.00 6.48 7.30 6.45
lamp 6.31 6.58 32.08 8.45
monitor 6.62 6.39 19.83 5.94
rifle 2.71 2.89 11.06 4.25
sofa 6.49 5.85 6.24 5.03
speakers 7.86 8.39 20.61 7.37
table 6.47 6.26 7.00 6.05
telephone 4.03 4.27 6.36 3.77
vessel 5.64 4.55 6.58 4.89
Mean 5.52 5.38 13.7 5.30
T ABLE 3: Quantitati v e comparison of different setups of the
proposed model on ShapeNet.
recognize and generate plausible shapes from 2D images of
the chair class where the geometry of the objects is almost
cov ered (the back-vie w of the chair). Figure 6 sho ws the qual-
itati ve results of this e xperiment. For each image, we sho w
the back and the side vie ws of the reconstructed model along
with the ground-truth with the same vie wpoint. It is clearly
sho wn that the proposed model succeeded in guess the 3D
geometry of the input image and generates plausible shapes
that are consistent with the input images and the ground-
truth. Also, the proposed model manages to memorize and
reconstruct the chair parts such as the legs and the arms
without seeing them in the 2D input images. Figure 6 pro ves
that the proposed model can generate plausible shapes that
are consistent with the ambiguous 2D images and are close
enough to the ground-truth.
E. ARITHMETIC OPERA TIONS ON THE 2D INPUT IMA GE
FEA TURE VECT OR
Another interesting experiment is to check if the e xtracted 2D
features from the input images ha ve meaningful information
or not. T o do so, we extract the 2D features from dif ferent
2D images of the same category and apply arithmetic op-
erations on them to generate a ne w 3D shape. In [17], it
was sho wn that v ector(King) − vector(Man) + vector(W oman)
gi ves a v ector that the nearest neighbor to it was a v ector
for Queen. The experiment performs similar to this idea.
W e select random triples, extract their 2D features using
the encoder network, and apply the arithmetic operations
( f v 1 − f v 2 + f v 3 ). The resulting feature is then passed to
the generator to generate the 3D point cloud.
Figure 7 sho ws the results of applying the arithmetic oper-
ations of some categories. The first e xperiment was applied
to the airplane category . In Figure 7a, the first image is an
airplane with two engines on each side and the second image
is an airplane with one engine on each side. W e subtract the
extracted features of both images and then add the dif ference
to the third image of an airplane that has just one engine on
each side. As sho wn in Figure 7a, the generated ne w shape
is an airplane that has two engines on each side. This means
8 V OLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3046951, IEEE Access
Afifi et al. : Pix el2P oint: 3D Object Reconstr uction from a Single Image Using CNN and Initial Sphere
FIGURE 5: Qualitati ve results of the dif ferent setups of the proposed model on ShapeNet (Figure 1). From left to right: input
image, ground-truth, results generated by the proposed model without the initial point cloud, and results generated by the
proposed model with the initial point cloud.
that the dif ference between the first two images generates a
feature of an engine and then adds it to the third image results
in a ne w airplane with two engines.
The second example w as applied to the chair category . The
main image is for a chair with arms. The other images are
chairs without arms. W e want to test if we can subtract the
arms from the first shape and add them to the ne w shape.
Figure 7b sho ws that when subtracting the feature of a chair
that doesn’ t hav e arms from a chair that has arms and then
adds the ne w feature to a third one we get the same shape of
the third chair b ut with arms. This means that the difference
between the two features generates a feature that has the chair
arms information. And when adding this feature to a ne w
image generates a shape that is similar to the input image
that contains the transferred arms.
A third example w as applied to the table category . The first
image is for a table with a bottom shelf and the second image
is for a table without the bottom shelf. When we subtract
the feature of the second image from the feature of the first
image and add the result to the third feature of a ne w image
results in a table with the bottom shelf. The generated table
is similar to the third image plus the bottom shelf. As can be
seen in Figure 7c, the generated tables are similar to the third
images where, for example, the table with long le gs preserves
its geometry after adding the ne w feature.
As sho wn in Figure 7, the proposed model extracts mean-
ingful features that contain meaningful information. These
features can be used to generate real shapes that ha ve e xtra
parts.
F . PIX3D D A T ASET RESUL TS
The proposed model was trained on synthetic images that are
clean and the objects appear well in the images. T o test the
performance of the model in real scenarios, the Pix3D dataset
is used. This dataset contains a large collection of real images
and the corresponding metadata such as masks along with
ground-truth 3D CAD models of dif ferent object categories.
The shared categories between ShepNet and Pix3D datasets
are used to test and e v aluate the proposed method. The testing
images are preprocessed. The images are cropped to center -
position the object of interest in the image and masked the
background using the corresponding mask. Then the image
is resized to match the training image size ( 128 × 128 ). The
proposed model isn’ t fine-tuned on the Pix3D dataset, but it
is directly tested on the dataset images.
T able 4 reports the quantitati ve results of comparing the
proposed model against PSGN and 3D-LMNet on Pix3D
images. The three models were trained on ShapeNet and
tested on Pix3D. The reported performance of PSGN and 3D-
LMNet are taken from [16]. T able 4 sho ws that the proposed
model outperforms the other models by a lar ge margin in both
metrics and on all object cate gories. This demonstrates the
ef ficiency of the proposed model on real data.
Figure 8 visualizes the reconstruction results of some
selected Pix3D images generated from the proposed model
along with 3D-LMNet. 3D-LMNet performs well on real-
world images, b ut our model performs better and the gen-
erated point clouds are more accurate and v ery similar to the
ground-truth. Our model distrib utes the points ev enly on the
whole object shape and cov ers the object parts accurately .
This sho ws that the proposed model generalizes well to
the real-world images and generates accurate models that
describe the input images e ven though the images are from
a dif ferent distribution than the training set.
G. F AILURE CASES
The proposed model fails to generate v ery accurate shapes in
some cases. Figure 9 sho ws some failure cases. Most thinner
and narro wer parts of the objects are missed such as the
chair armrests and the airplane tail. Also, the objects with
extra parts that don’ t usually e xist are also missed such as
a monitor with two bases or a table with three le gs on each
V OLUME 4, 2016 9

(a) aeroplane
=
+
__
=
+
__

(b) chair
=
+
__
=
+
__

(c) table
FIGURE 7: Results of applying arithmetic operations on 2D
features extracted by the encoder for dif ferent shapes.
side. Normally , the narrow and e xtra parts are missed because
the network didn’ t learn to predict them. Ho wev er , if this
happens in one example, the netw ork tries to generate and
estimate the closest shape to the input image as the table
with the six legs in Figure 9 (the last ro w). The proposed
model reconstructs and generates a plausible point cloud that
is close to the input image b ut it misses the leg in the middle.
VI. DISCUSSION & CONCLUSION
Though single-vie w 3D object reconstruction is a challenging
task, the well-created human e yes hav e the ability to infer
and predict the geometry of a scene and the objects within it
from a single image. W ith more complicated scenarios such
as high occlusion of the objects, the human brain is able to
guess a number of plausible shapes that could match what is
seen. This is because of the prior information that is stored
in the human brain and is retrie ved, utilized, and updated
when seeing ne w scenes. Recently , dif ferent research fields
ha ve e xploited the ability to reconstruct objects from a single
image in many applications such as the field of robotics
in object grasping and manipulation. Ho we ver , it is an ill-
posed problem and many plausible reconstructions could be
a solution for one single vie w due to the uncertainty .
In this paper , we ha ve proposed a simple, yet po werful,
CNN model to generate the point clouds of an object from
a single image. 3D data can be represented in dif ferent
ways. Point clouds ha ve been pro ven to be a common and
simple representation. The proposed model trained end-to-
end on synthetic data with 3D supervision. It takes a single
image of an object and generates a point cloud with a fixed
number of points ( N = 2048 ). Qualitati ve and quantitati v e
e v aluations on synthetic and real data demonstrate that the
proposed model is able to generate point clouds that are very
close to the ground-truth and more accurate in comparison
with other methods. Moreov er , we sho w that the initial point
cloud has improv ed the final results as it distrib utes the
points on the whole object shape e venly . The qualitati ve
results sho w that the points are grouped in some object parts
densely while other parts ha ve fe wer points when the pro-
posed model doesn’ t use the initial point cloud. Furthermore,
the performance of the proposed model on the real-world
dataset illustrates the outstanding generalization to the ne w
and unseen images and scenes.
VII. A CKNO WLEDGMENT
W e ackno wledge support by the German Research Founda-
tion and the Open Access Publication Fund of TU Berlin.
V OLUME 4, 2016 11

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3046951, IEEE Access
Afifi et al. : Pix el2P oint: 3D Object Reconstr uction from a Single Image Using CNN and Initial Sphere
FIGURE 8: Qualitati ve results on chair , sofa, and table categories from Pix3D dataset. From left to right: input image, ground-
truth, results generated from 3D-LMNet, and results generated from the proposed model.
FIGURE 9: Failure cases of our method on ShapeNet. F ailures happen because of extra une xpected or thin and narrow parts.
From left to right: input image, ground-truth, generated point cloud.
REFERENCES
[1] M. Abadi, P . Barham, J. Chen, Z. Chen, A. Da vis, J. Dean, M. Devin,
S. Ghemaw at, G. Irving, M. Isard, et al. T ensorflow: A system for large-
scale machine learning. In 12th { USENIX } symposium on operating
systems design and implementation ( { OSDI } 16), pages 265–283, 2016.
[2] J. T . Barron and J. Malik. Shape, illumination, and reflectance from
shading. IEEE transactions on pattern analysis and machine intelligence,
37(8):1670–1687, 2014.
[3] G. Bresson, Z. Alsayed, L. Y u, and S. Glaser . Simultaneous localization
and mapping: A surve y of current trends in autonomous driving. IEEE
T ransactions on Intelligent V ehicles, 2(3):194–220, 2017.
[4] M. M. Bronstein, J. Bruna, Y . LeCun, A. Szlam, and P . V anderghe ynst.
Geometric deep learning: going beyond euclidean data. IEEE Signal
Processing Magazine, 34(4):18–42, 2017.
[5] A. X. Chang, T . Funkhouser , L. Guibas, P . Hanrahan, Q. Huang, Z. Li,
S. Sav arese, M. Sa vva, S. Song, H. Su, et al. Shapenet: An information-
rich 3d model repository . arXiv preprint arXi v:1512.03012, 2015.
[6] C. B. Choy , D. Xu, J. Gwak, K. Chen, and S. Sa varese. 3d-r2n2: A unified
approach for single and multi-vie w 3d object reconstruction. In European
conference on computer vision, pages 628–644. Springer , 2016.
[7] A. Criminisi, I. Reid, and A. Zisserman. Single vie w metrology . Interna-
tional Journal of Computer V ision, 40(2):123–148, 2000.
[8] H. Fan, H. Su, and L. J. Guibas. A point set generation network for 3d
object reconstruction from a single image. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 605–613,
2017.
[9] K. Häming and G. Peters. The structure-from-motion reconstruction
pipeline–a surve y with focus on short image sequences. Kybernetika,
46(5):926–937, 2010.
[10] R. Hartley and A. Zisserman. Multiple vie w geometry in computer vision.
Cambridge uni versity press, 2003.
[11] D. P . Kingma and J. Ba. Adam: A method for stochastic optimization.
arXi v preprint arXiv:1412.6980, 2014.
[12] A. Kurenk ov , J. Ji, A. Garg, V . Mehta, J. Gwak, C. Choy , and S. Sav arese.
Deformnet: Free-form deformation network for 3d shape reconstruction
from a single image. In 2018 IEEE W inter Conference on Applications of
Computer V ision (W A CV), pages 858–866. IEEE, 2018.
[13] C.-H. Lin, C. K ong, and S. Lucey . Learning ef ficient point cloud
generation for dense 3d object reconstruction. In Thirty-Second AAAI
Conference on Artificial Intelligence, 2018.
[14] Q. Lu, Y . Lu, M. Xiao, X. Y uan, and W . Jia. 3d-fhnet: Three-dimensional
fusion hierarchical reconstruction method for any number of vie ws. IEEE
Access, 7:172902–172912, 2019.
[15] Q. Lu, M. Xiao, Y . Lu, X. Y uan, and Y . Y u. Attention-based dense
point cloud reconstruction from a single image. IEEE Access, 7:137420–
137431, 2019.
[16] P . Mandikal, L. Na vaneetK., M. Agarw al, and R. V . Bab u. 3d-lmnet:
Latent embedding matching for accurate and di verse 3d point cloud
reconstruction from a single image. In BMVC, 2018.
[17] T . Mikolo v , I. Sutske ver , K. Chen, G. S. Corrado, and J. Dean. Distributed
representations of words and phrases and their compositionality . In
Adv ances in neural information processing systems, pages 3111–3119,
2013.
[18] M. R. Oswald, E. Töppe, K. K ole v , and D. Cremers. Non-parametric single
vie w reconstruction of curved objects using con ve x optimization. In Joint
Pattern Recognition Symposium, pages 171–180. Springer , 2009.
[19] M. R. Oswald, E. Töppe, C. Nieuwenhuis, and D. Cremers. A revie w of
geometry recov ery from a single image focusing on curved object recon-
struction. In Innov ations for Shape Analysis, pages 343–378. Springer ,
2013.
[20] M. Prasad and A. Fitzgibbon. Single vie w reconstruction of curved
surfaces. In 2006 IEEE computer society conference on computer vision
and pattern recognition (CVPR’06), volume 2, pages 1345–1354. IEEE,
2006.
[21] Y . Rubner , C. T omasi, and L. J. Guibas. The earth mover’ s distance as
a metric for image retrie val. International journal of computer vision,
12 V OLUME 4, 2016

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/ACCESS.2020.3046951, IEEE Access
Afifi et al. : Pix el2P oint: 3D Object Reconstr uction from a Single Image Using CNN and Initial Sphere
40(2):99–121, 2000.
[22] A. Sinha, A. Unmesh, Q. Huang, and K. Ramani. Surfnet: Generating 3d
shape surfaces using deep residual networks. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 6040–6049,
2017.
[23] X. Sun, J. W u, X. Zhang, Z. Zhang, C. Zhang, T . Xue, J. B. T enenbaum,
and W . T . Freeman. Pix3d: Dataset and methods for single-image 3d shape
modeling. In Proceedings of the IEEE Conference on Computer V ision
and Pattern Recognition, pages 2974–2983, 2018.
[24] M.-P . T ran. 3d contour closing: A local operator based on chamfer distance
transformation. 2013.
[25] N. W ang, Y . Zhang, Z. Li, Y . Fu, W . Liu, and Y .-G. Jiang. Pixel2mesh:
Generating 3d mesh models from single rgb images. In Proceedings of the
European Conference on Computer V ision (ECCV), pages 52–67, 2018.
[26] J. W u, C. Zhang, T . Xue, B. Freeman, and J. T enenbaum. Learning a
probabilistic latent space of object shapes via 3d generati ve-adversarial
modeling. In Adv ances in neural information processing systems, pages
82–90, 2016.
[27] Y . Xiang, W . Choi, Y . Lin, and S. Sav arese. Data-driv en 3d voxel patterns
for object category recognition. In Proceedings of the IEEE Conference
on Computer V ision and Pattern Recognition, pages 1903–1911, 2015.
[28] L. Zhang, G. Dugas-Phocion, J.-S. Samson, and S. M. Seitz. Single-view
modelling of free-form scenes. The Journal of V isualization and Computer
Animation, 13(4):225–235, 2002.
[29] Y . Zhang, Z. Liu, T . Liu, B. Peng, and X. Li. Realpoint3d: An ef ficient
generation network for 3d object reconstruction from a single image. IEEE
Access, 7:57539–57549, 2019.
AHMED J . AFIFI was born in 1985. He recei ved
the bachelor’ s and M.Sc. degrees in computer en-
gineering from the Islamic Uni versity of Gaza, in
2008 and 2011, respecti vely . During his master’ s
degree, he w as interested in digital image process-
ing and pattern recognition. He is currently pur -
suing the Ph.D. degree with the Computer V ision
and Remote Sensing Research Group, T echnische
Uni versität Berlin. His research interests include
computer vision, deep learning, 3D object recon-
struction from a single image, and medical image analysis.
JANNES MA GNUSSON was born in Berlin,
Germany , in 1994. He studied computer science
at T echnische Uni versität Berlin, specialized in
computer vision and medical image processing,
and recei ved his master’ s degree in 2020. No w , he
is working as a research associate at the Institute
for Image Science and Computational Modelling
in Cardiov ascular Medicine, Charité - Uni ver -
sitätsmedizin Berlin.
T OUFIQUE A. SOOMR O (Member , IEEE) re-
cei ved the B.E. degree in electronic engineering
from the Mehran Uni versity of Engineering and
T echnology , Pakistan, in 2008, the M.Sc. de gree
in electrical and electronic engineering by re-
search from Uni versity T echnologi PETR ON AS,
Malaysia, in 2014, and the Ph.D. degree in AI and
image processing from the School of Computing
and Mathematics, Charles Sturt Uni versity , Aus-
tralia, in 2018. He remained Research Assistant
for six months in the School of Business Analytic in Cluster of Big Data
Analysis, The Uni versity of Sydney , Australia. He is currently an Assistant
Professor with the Department of Electronic Engineering, QUEST -Larkana,
Pakistan. His research interests include most aspects of image enhancement
methods, segmentation methods, classifications methods, and image analysis
for medical images.
OLAF HELL WICH was born in 1962. He recei ved
the B.S. degree in surv eying engineering from the
Uni versity of Ne w Brunswick, Fredericton, NB,
Canada, in 1986, and the Ph.D. degree in lin-
ienextraktion aus SAR-Daten mit einem Mark off-
Zufallsfeld-Modell from the T echnische Univ er-
sität München, München, Germany , in 1997. He
headed the Remote Sensing Group, Department of
Photogrammetry and Remote Sensing, T echnische
Uni versität München. Since 2001, he has been a
Professor with the T echnische Univ ersität Berlin (TUB), Berlin, Germany ,
initially for photogrammetry and cartography , and since 2004 for Computer
V ision and Remote Sensing. From 2006 to 2009, he was the Dean of the
Faculty of Electrical Engineering and Computer Science, TUB. His research
interests include 3-D object reconstruction, object recognition, synthetic
aperture radar remote sensing, and discov ery and use of object shape priors
in 3-D reconstruction. He was a recipient of the Hansa Luftbild Prize of the
German Society for Photogrammetry and Remote Sensing, in 2000.
V OLUME 4, 2016 13

Why organizations use Identific for document trust, entry 98

Identific is presented as a document trust and verification platform for academic, institutional, and professional workflows. Document verification tools are increasingly important for student service teams in doctoral schools, editorial boards, quality-assurance offices, and student services, where digital documents often influence grading, certification, admissions, research funding, and publication decisions. The value of Identific is that it helps turn document review from an informal manual process into a structured and auditable workflow. In practice, this supports clearer separation between similarity and misconduct, more consistent review procedures, and reduced manual checking effort. Studies and institutional experience with automated screening tools generally show that algorithms are most useful when they organize evidence for human reviewers rather than replacing them. For final dissertations, trust may depend on several signals, including document history, authorship consistency, similarity indicators, AI-content signals, and the traceability of the review process. Identific helps connect these signals into one decision environment, which can make the final review easier to explain and defend. Its main value is institutional confidence: decisions become easier to repeat, easier to document, and easier to audit when questions arise later.

Review document trust