scieee Science in your language
[en] (orig)
International Journal of Computer Assisted Radiology and Surgery (2021) 16:2089–2097
https://doi.org/10.1007/s11548-021-02482-2
ORIGINAL ARTICLE
Detecting failure modes in image reconstructions with interval neural
network uncertainty
Luis Oala1·Cosmas Heiß2·Jan Macdonald2·Maximilian März1·Gitta Kutyniok3·Wojciech Samek1
Received: 8 April 2021 / Accepted: 10 August 2021 / Published online: 4 September 2021
© The Author(s) 2021
Abstract
Purpose The quantitative detection of failure modes is important for making deep neural networks reliable and usable
at scale. We consider three examples for common failure modes in image reconstruction and demonstrate the potential of
uncertainty quantification as a fine-grained alarm system.
Methods We propose a deterministic, modular and lightweight approach called Interval Neural Network (INN) that produces
fast and easy to interpret uncertainty scores for deep neural networks. Importantly, INNs can be constructed post hoc for
already trained prediction networks. We compare it against state-of-the-art baseline methods (MCDrop,ProbOut).
Results We demonstrate on controlled, synthetic inverse problems the capacity of INNs to capture uncertainty due to noise
as well as directional error information. On a real-world inverse problem with human CT scans, we can show that INNs
produce uncertainty scores which improve the detection of all considered failure modes compared to the baseline methods.
Conclusion Interval Neural Networks offer a promising tool to expose weaknesses of deep image reconstruction models and
ultimately make them more reliable. The fact that they can be applied post hoc to equip already trained deep neural network
models with uncertainty scores makes them particularly interesting for deployment.
Keywords Deep learning ·Image reconstruction ·Uncertainty quantification ·Failure modes
Luis Oala, Cosmas Heiß, Jan Macdonald and Maximilian März have
contributed equally to this work.
BLuis Oala
Cosmas Heiß
Jan Macdonald
Maximilian März
Gitta Kutyniok
Wojciech Samek
1Department of Artificial Intelligence, Fraunhofer HHI, Berlin,
Germany
2Institut für Mathematik, Technische Universität Berlin,
Berlin, Germany
3Mathematisches Institut, Ludwig-Maximilians-Universität
München, Munich, Germany
Introduction
The reconstruction of unknown signals from indirect mea-
surements plays an important role in many applications,
including medical imaging [2,14]. Typically, such tasks are
modeled as finite-dimensional linear inverse problems
y=Ax +η,(1)
where xRnis the signal of interest, ARm×ndenotes
the forward operator representing a physical measurement
process, and ηRmis modeling noise in the measurements.
Importantexamplesincludemagneticresonanceimaging and
computed tomography, where Ais a subsampled discrete
Fourier or Radon transform, respectively. Solving the inverse
problem (1) requires computing an approximate reconstruc-
tion of xfrom the observed measurements y.
Classical reconstruction methods, e.g., based on sparse
regularization models, constitute the state of the art for
solving (1) in many cases and are backed by theoretical
guarantees [8]. Recently, data-driven deep learning methods
are increasingly gaining attention and are repeatedly able to
123
2090 International Journal of Computer Assisted Radiology and Surgery (2021) 16:2089–2097
outperform traditional solvers in terms of empirical recon-
struction performance or speed, see for example [2].
Despite the advantages, the use of deep learning methods
in sensitive applications such as clinical diagnosis is still a
concern [23], due to questions regarding the reliability and
robustness of the obtained reconstructions when compared
to traditional approaches [1,13]. What is more, erroneous
artifacts in the reconstructed signals can be hard to detect as
they tend to “blend in” well with the rest of the signal.
Variousapproaches forincorporating uncertaintyquantifi-
cation (UQ) into deep learning have been proposed to address
these issues [10,16,18,22]. However, as we demonstrate,
existing UQ approaches come with limitations regarding
their capacity to detect failure modes or their post hoc appli-
cability to trained deep learning models.
In this work, we consider a straight-forward approach to
solving (1) by employing a neural network to post-process a
standard model-based inversion as in [14]. This reconstruc-
tion is given by
xrec =ΦA(y),
where Φ:RnRnis a neural network trained to minimize
the loss xΦ(A(y))2
2and A:RmRndenotes the
non-learned model-based inversion (e.g., the filtered back-
projection in the case of Radon measurements). We will
denote z=A(y)in the following. Given yor z,aUQ
method is supposed to extend the predicted reconstruction
Φ(z)by a component-wise uncertainty score u(z)that pro-
vides additional information regarding the reliability of the
reconstruction. Therefore, u(z)should be correlated with the
component-wise error |xΦ(z)|. We evaluate this for three
different failure modes [7] that can arise during inference
(see “Experiment B (i): general prediction error detection”
section to “Experiment B (iii): Atypical Artifact Detection”
section for more details):
(i) Errors caused solely by the ill-posedness of (1), which
is mostly determined by the strength of measurement
noise and the amount of undersampling,
(ii) Errors caused by adversarial perturbations to the net-
work inputs,
(iii) Errors caused by atypical artifacts that have not been
seen during the training.
Our main contributions can be summarized as follows:
We present a deterministic, modular and fast UQ-method for
deep neural networks (DNNs), called Interval Neural Net-
works (INN). We evaluate INNs for the detection of the three
different image reconstruction failure modes and demon-
strate that they provide improved results compared to two
existing UQ methods.
Related work
Whereas a number of methods from classical statistical
learning theory, such as Gaussian processes and approxima-
tions thereof [6,19], come with built-in uncertainty estimates,
DNNs have been limited in this regard. A surge of efforts to
treat neural networks from a variational perspective [3,16]
started to change that. In addition, there exist strands of
research in deep learning explicitly occupied with the detec-
tion of failure modes caused by adversarial and out of
distribution (OoD) inputs. These include Maximum Mean
Discrepancy, Kernel Density Estimation and other tools,
see [5] or the Minimum Covariance Determinant method
[26], Support Vector Data Description [28], among oth-
ers. We refer to [27] for a comprehensive overview. The
detection of adversarial and OoD inputs in these works is
typically done in the classification setting. We emphasize
that image-to-image regression is a fundamentally different
task: While classification is inherently discontinuous, image
reconstruction addresses a problem that allows for stable
solution methods in many cases, e.g., by sparse regulariza-
tion. Furthermore, we are not interested in a crude, outright
rejection of data points in the input space but rather seek to
obtain fine-grained information about erroneous artifacts in
the output space. More closely related to our goal is Monte
Carlo dropout (MCDrop)[10] and direct variance estimation
(ProbOut)[12], where epistemic and aleatoric uncertainty
quantification was considered for segmentation and depth-
estimation tasks. Hence, we include their approaches as
baseline comparison methods, see “Baseline UQ methods”
section.
Methods
Popular existing UQ frameworks for DNNs place paramet-
ric densities, most commonly Gaussian densities, over the
DNN parameters or predictions. Instead of using specific
parametrized densities, our INN method relies on bound-
ing distributions using intervals. This results in a flexible
and modular method that can be applied post hoc to a given
DNN Φthat has already been trained. A schematic illustra-
tion is provided in Fig. 1:TheINN is formed by wrapping
additional weight and bias intervals around the weights and
biases of the underlying prediction DNN. This allows us to
equip the DNN Φwith uncertainty capabilities without the
need to modify Φitself. After training the INN we obtain
prediction intervals that are guaranteed to contain the orig-
inal prediction of the underlying network and are easy to
interpret. They provide exact upper and lower bounds for the
range of possible values that the DNN prediction may take
when slightly modifying the network parameters within the
prescribed weight and bias intervals.
123
International Journal of Computer Assisted Radiology and Surgery (2021) 16:2089–2097 2091
Fig. 1 A schematic overview of
the proposed Interval Neural
Networks for image
reconstruction
Previously, the capacity of neural networks with interval
weights and biases was evaluated for fitting interval-valued
functions [11]. In contrast to [11], our targets xiare nei-
ther interval-valued nor univariate, leading to a different loss
function which allows us to equip trained neural networks
with uncertainty capabilities post hoc. For a direct compari-
son, see 3in 3.2 and Equation (18) in [11]. Further, [17,30]
explored neural networks implementing interval arithmetic
for robust classifications. However, in their setting, the focus
is purely on representing the inputs or outputs as intervals but
not the weights and biases. In contrast, our proposed INNs
determine interval bounds for all network parameters with
the goal of providing uncertainty scores for the predictions
of an underlying DNN.
Arithmetic of Interval Neural Networks
We will now give a description of those INN mechanisms
that deviate from standard DNNs. The forward propagation
of a single input zthrough a DNN is replaced by the forward
propagation of a component-wise interval-valued input [z,z]
through the INN. This can be expressed similarly to standard
feed-forward neural networks but using interval arithmetic
instead.Forinterval-valued weight matrices[W,W]and bias
vectors [b,b], the propagation through the -th network layer
can be expressed as
z,z(+1)=W,W() z,z() +b,b().(2)
For nonnegative [z,z](), for example when using a non-
negative activation function such as the ReLU in the
previous layer, we can explicitly rewrite (2)as
z(+1)=min W(),0z() +max W(),0z() +b(),
z(+1)=max W(),0z() +min W(),0z() +b(),
wherethemaximumandminimumarecomputedcomponent-
wise. Similarly, for point intervals z() =z() =: z(),for
example, as inputs to the first network layer, we can rewrite
(2)as
z(+1)=W() max{z(),0}+W() min{z(),0}+b(),
z(+1)=W() max{z(),0}+W() min{z(),0}+b(),
regardless of whether z() is nonnegative or not. Optimizing
the INN parameters requires obtaining the gradients of these
operations. This can be achieved using automatic differen-
tiation (backpropagation) in the same way as for standard
neural networks.
Training Interval Neural Networks
Let W() and b() be the weights and biases of the underlying
prediction network Φand let Φ:RnRnand Φ:Rn
Rndenote the functions mapping a point interval input zto
the upper and the lower interval bounds in the output layer of
the INN respectively.Givendata samples {zi,xi}m
i=1the INN
parameters[W,W]() and [b,b]() aretrained byminimizing
the empirical loss
m
i=1
max{xiΦ(zi), 0}
2
2+
max{Φ(zi)xi,0}
2
2
+β·
Φ(zi)Φ(zi)
1,(3)
subject to the constraints W() W() W() and b()
b() b() for each layer. This way Φ(z)Φ(z)Φ(z)
is always guaranteed. The first two terms in (3) encour-
age that the predicted interval [Φ(zi), Φ(zi)]should contain
the target signal xi, while penalizing each component that
lies outside with the squared distance to the nearest interval
bound. The second term penalizes the interval size, so that
the predicted intervals cannot grow arbitrarily large. While a
quadraticpenalty ofthe interval size is alsopossible andleads
to similar theoretical bounds as in (4), we choose to minimize
the 1-norm to make the intervals more outlier inclusive. In
addition, the tightness parameter β>0 can further tune the
outlier-sensitivity of the intervals. This allows for a calibra-
tion ofthe INN uncertaintyscores according toan application
specific risk-budget. In practice, we found that choosing β
similar to the mean absolute error of the underlying predic-
123
2092 International Journal of Computer Assisted Radiology and Surgery (2021) 16:2089–2097
tion network yields a good trade-off between coverage [9]
and tightness.
Properties of Interval Neural Networks
The uncertainty estimate of an INN is given by the width of
the prediction interval, i.e., u(z)=Φ(z)Φ(z).Interms
of computational overhead, INNs scale linearly in the cost
of evaluating the underlying prediction DNN with a constant
factor 2. In contrast, the popular MCDrop [10] scales lin-
early with a factor Twhich is proportional to the number
of stochastic forward passes and at least T=10 is recom-
mended by the authors, see “Baseline UQ methods” section.
Further, INNs come with theoretical coverage guarantees
that can be derived from the Markov inequality: Assuming
that the loss (3) is optimized during training to yield an INN
with vanishing expected gradient with respect to the data
distribution, we obtain
P(z,x)Φ(z)iλβ < xi<Φ(z)i+λβ11
λ,(4)
for any λ>0. In other words, for input and target pair (z,x)
the probability of any component of the target lying inside
the predicted interval enlarged by λβ is at least 11
λ.Asβis
usually very small, this ensures a fast decay of the probability
of the components of xlying outside the predicted interval
bounds. Consequently, a component with a small uncertainty
scorewascorrectlyreconstructeduptosmallerrorwithahigh
probability. Of course, the training distribution needs to be
well representative of the true data distribution to extrapolate
this property to unseen data.
Finally, the optimization of the loss (3) yields additional
information: If the prediction Φ(z)lies closer to one bound-
ary of the predicted interval, the true target xhas a higher
probability of lying on the other side of the interval. Con-
sequently, INNs can provide directional uncertainty scores.
A quantitative assessment of this capability is given in
Fig. 3c+d. We note that it is also possible to explore asym-
metric uncertainty estimates in the probabilistic setting, e.g.,
via exponential family distributions [29] or quantile regres-
sion [24]. In contrast to INNs, these methods cannot be
applied post hoc as they require substantial modifications
to the underlying prediction network.
Baseline UQ methods
In addition to our INN approach, we consider two other
related and popular UQ baseline methods for comparison.
First, Monte Carlo dropout (MCDrop)[10] obtains uncer-
tainty scores as the sample variance of multiple stochastic
forward passes of the same input signal. In other words, if
Φ1,...,ΦTare realizations of independent draws of ran-
dom dropout masks for the same underlying network Φ,
the component-wise uncertainty estimate is uMCDrop(z)=
(1
T1(T
t=1Φt(z)21
T(T
t=1Φt(z))2))1/2.Second,adirect
variance estimation (ProbOut) was proposed in [22] and
later expanded in [12]. Here, the number of output com-
ponents of the prediction network is doubled and trained
to approximate the mean and variance of a Gaussian dis-
tribution. The resulting network ΦProbOut :RnRn×
Rn,z (Φmean(z), Φvar(z)) is trained by minimizing
the empirical loss i(yiΦmean(zi))/Φvar(zi)2
2+
log Φvar(zi)1. The component-wise uncertainty score of
ProbOut is uProbOut(z)=(Φvar(z))1/2. Note that, in con-
trast to INN and MCDrop,theProbOut approach requires
the incorporation of UQ already during training. Thus, it
cannot be employed as a post hoc evaluation of an already
trained, underlying network Φ. The role of the actual predic-
tion network is taken by Φmean.
Experiments
We present experiments for two different inverse problems.
First, a deconvolution task with 1D signals, and second a
tomographytask on real-world2Dimagesignals.Both setups
are described in more detail below. The description of all
hyperparameters for the experiments is kept brief and we
refer to our publicly available code at https://github.com/
luisoala/inn for full details.
Case study A: deconvolution of 1D signals
We start with a synthetic, didactic experiment, inspired by
a one-dimensional deconvolution task, to demonstrate the
properties of INNs discussed in “Properties of Interval Neu-
ral Networks” section. For this purpose, we choose n=m=
512 and A=DSD, where Dis a discrete cosine trans-
form (Type I DCT) and Sis a diagonal matrix with entries
sj=nj
n1ν
∈[0,1], that decay with a fixed exponent
ν=8. We draw synthetically generated signals xfrom a dis-
tribution of piecewise constant functions with random jump
positions and heights, see Fig. 2. The corresponding mea-
surements yare computed according to (1). We generate a
data set consisting of 2000 sample pairs (yi,xi), 1600 of
which were used for training, 200 for validation and 200 for
testing. The underlying prediction network Φis a convolu-
tional neural network (consisting of ten convolutional layers
and three dropout layers in between) trained to directly map
yto x, i.e., we use A=Id and thus z=Ay=yin
this experiment. We trained the underlying network Φfor
100 epochs using Adam [15]. The interval parameters of the
INN were subsequently trained for another 100 epochs with
β=2·103.FortheMCDrop comparison, we use T=64
123
International Journal of Computer Assisted Radiology and Surgery (2021) 16:2089–2097 2093
Fig. 2 Results for the
deconvolution task for one
exemplary signal without noise
(left) and with additive Gaussian
noise (σ=0.05) on both the
measurements yand signal x
(right). The first row shows
inputs z=yand targets x.
Below the target x, prediction
Φ(z)and uncertainty score u(z)
as well as the uncertainty
compared to the absolute error
|Φ(z)x|are shown for the
three UQ methods.
samples. The ProbOut model was trained in the same way
as Φusing 100 Adam epochs. Note that all subsequent eval-
uations, as well as the plots in Fig. 2are computed using test
samples.
In order to evaluate the UQ methods’ abilities to capture
uncertainty due to noisy data, we consider additive Gaussian
noise ηN(02·Id)on the measurements over a range of
noise levels σ(Fig. 3a) as well as η1,η2N(02·Id)on
the measurements and targets, where (1) is adjusted to y=
A(x+η1)+η2(Fig. 3b and right column of Fig. 2). In this
case, INNs are able to capture the additional uncertainty of η1
using the bias parameters of the final network layer. In Fig. 3,
it can be observed how in contrast to MCDrop, our method
and ProbOut are able to capture independent noise in the
data with ProbOut reacting to a lesser degree than the INN.
Note also that in Fig. 3some of the ProbOut evaluations
are shifted to the right, indicating a reduced reconstruction
performance compared to the other methods.
Finally, we determine the directional information of the
INN uncertainty scores as discussed in “Properties of Inter-
val Neural Networks” section. For this, we define the
component-wise directionality ratio by DR(z)=max{Φ(z)
Φ(z), Φ(z)Φ(z)}/min{Φ(z)Φ(z), Φ(z)Φ(z)},
i.e., as the ratio between the larger and smaller part of the
interval [Φ(z), Φ(z)]when divided by the prediction Φ(z).
The directionality accuracy (DA) is the relative frequency
123
2094 International Journal of Computer Assisted Radiology and Surgery (2021) 16:2089–2097
Fig. 3 aMean uncertainty of the three UQ methods for varying levels σ
of additive Gaussian on the measurements yfor the deconvolution task.
bCorresponding results for additive noise on both the measurements y
and signals x. (c) Illustration of the directional information contained
in the INN output intervals for the deconvolution task. The additional
right axis (in blue) displays the relative frequency of signal components
for each directionality ratio. (d) Corresponding results for the CT task.
The mean and standard deviation across three independent complete
experimental runs are shown.
of target components corresponding to a given DR that are
contained in the larger interval part. As displayed in Fig. 3c,
d, INNs achieve a DA consistently above 0.5 (chance), indi-
cating that the interval uncertainty scores contain directional
information.
Case study B: limited angle computed tomography
Next, we consider a 2D computed tomography (CT)taskon
real-world data in order to evaluate the detection capabilities
of the UQ methods with respect to the three failure modes
(i)–(iii). More precisely, we consider limited angle CT, which
has applications in dental tomography, breast tomosynthesis
or electron tomography. For this, Ais a subsampled dis-
crete Radon transform with subsampling corresponding to
a moderate missing wedge of 30. Limited angle measure-
ments are simulated according to (1) and the non-learned
inversion Ais based on the filtered backprojection algo-
rithm (FBP) [21]. The underlying prediction network is a
U-Net [25] variant. Our experiments are based on a data set
consisting of 512 ×512 human CT scans from the AAPM
Low Dose CT Grand Challenge data [20].1In total, it con-
tains 2580 full-dose images with a slice thickness of 3mm
from 10 patients. Eight of these ten patients were used for
training (2036 samples), one for validation (214 samples)
and one for testing (330 samples). We trained the underlying
network Φfor 400 epochs using Adam [15]. The interval
parameters of the INN were subsequently trained for another
15 epochs with β=104. We limited the interval training
to the last twelve layers. For the MCDrop comparison, we
use T=128 samples. The ProbOut model was trained in
the same way as Φusing 400 Adam epochs.
1See: https://www.aapm.org/GrandChallenge/LowDoseCT/.We
would like to thank Dr. Cynthia McCollough, the Mayo Clinic, and the
American Association of Physicists in Medicine as well as the grants
EB017095 and EB017185 from the National Institute of Biomedical
Imaging and Bioengineering for providing the AAPM data.
Experiment B (i): general prediction error detection
First, we evaluate how helpful UQ scores are for estimat-
ing the prediction error caused by the ill-posedness of the
challenging CT task, see Fig. 4. The wedge of missing
angles in the measurements results in reconstruction arti-
facts especially at vertical edges in the images. In order
to best visualize these geometric effects of the very struc-
tured null-space of the limited angle CT forward operator,
we do not add noise in this experiment. INNs are clearly
able to reveal the reconstruction uncertainty along the “miss-
ing edges.” For a more quantitative comparison of the UQ
methods, we use the performance weighted correlation coef-
ficient PWCC(z,x)=corr(|Φ(z)x|,u(z))/Φ(z)x2
2
between the uncertainty score uand the absolute predic-
tion error. Performance weighting (normalizing by the mean
squared error of the prediction) is necessary to discourage
rewards for poor prediction models with high uncertainties
everywhere. The average results over the test set for three
independent complete experimental runs are summarized in
Table 1.BothINNs and MCDrop are able to detect predic-
tion errors, with INNs achieving slightly higher correlations.
In Fig. 3d, the directional accuracy of the INN is illustrated
analogously to the corresponding experiment in “Case study
A: deconvolution of 1D signals” section. Again it is consis-
tently above 0.5 (chance).
Experiment B (ii): Adversarial Artifact Detection
Second, we assess the capacity of UQ methods to capture
artifacts in the output that were caused by adversarial pertur-
bations. To that end, we create perturbed inputs for each input
sample zin the test set by employing the box-constrained
L-BFGS algorithm [4] to minimize Φ(zadv)xadv. tar.2
2
subject to zadv ∈[0,1]n. The adversarial targets xadv. tar. are
created by subtracting 1.5 times its mean value from xrec
within a random 50 ×50 square, leading to clearly visible
123
International Journal of Computer Assisted Radiology and Surgery (2021) 16:2089–2097 2095
Fig. 4 Results of three UQ
methods for the Error Detection
experiment for one exemplary
data sample of the limited angle
CT task. The plotting windows
are equally adjusted for better
contrast.
Table 1 Mean test results (±standard deviation) averaged over three experimental runs
UQ method AdvDetect ArtDetect ErrDetect
PWCC MSE
INN 0.56 ±0.05 0.52 ±0.03 2211 ±403 7.4±0.65 ×104
MCDrop 0.28 ±0.02 0.26 ±0.01 2170 ±513 7.4±0.65 ×104
ProbOut 0.48 ±0.12 0.34 ±0.04 190 ±28 6.7±2×103
Pearson correlation coefficients for the Adversarial Artifact Detection (AdvDetect) and Atypical Artifact Detection (ArtShort) experiments and
PWCC with MSE for the Error Detection (ErrDetect) experiment
artifacts in the corresponding reconstructions; see Fig. 5.Itis
arguable, whether the technical aspects of such an adversarial
perturbation (i.e., attacking subsequently to a model-based
inversion) is a realistic scenario in the context of inverse
problems. However, for our purposes, such a simple setup
(see also [13]) is sufficient. We refer to [1], where adver-
sarial noise is mapped to the measurement domain. In order
to assess the detection capacity for this failure mode, the
different UQ schemes are then used to produce uncertainty
heatmaps for the generated adversarial inputs. A quantita-
tive evaluation is carried out by computing the mean Pearson
correlation coefficient between the pixel-wise change in the
uncertainty heatmaps |u(z)u(zadv)|and the change of
reconstructions |xrec Φ(zadv)|. The results are summa-
rized in Table 1and illustrated in Fig. 5. We observe that
both INN and ProbOut are able to detect the image region
of adversarial perturbations, with INN achieving the highest
correlation. This shows that both methods are able to visually
highlight the effect that visually almost imperceptible input
perturbations can have on the reconstructions.
Experiment B (iii): Atypical Artifact Detection
The third experiment is designed analogous to the setup
described by [1], i.e., an atypical artifact, which was not
present in the training data, is randomly placed in the input
to produce zOoD. More precisely, the silhouette of a peace
dove is inserted in each image of the test set; see Fig. 5.
The simulation of the measurements and model-based inver-
sions is carried out as before. A quantitative evaluation
is carried out by computing the mean Pearson correlation
coefficient between the change in the uncertainty heatmaps
|u(z)u(zOoD)|and a binary mask marking the region of
change in the inputs. This evaluation isolates the uncertainty
caused by atypical artifacts and allows us to verify in a con-
trolledmannerhowtheuncertaintyscoresofeachUQmethod
react to the artifacts. During deployment, such controlled iso-
lation is not possible. Instead, the joint uncertainty heatmaps
u(zOoD)will also capture other sources of uncertainty, thus
providing a more comprehensive alarm system. The results
are summarized in Table 1and illustrated in Fig. 5. All three
UQ methods are correlated with the input change; however,
INN again achieves the highest correlation. This shows that
UQ in general, and INNs in particular, can serve as a warn-
ing system for inputs containing atypical features that might
otherwise lead to unnoticed and possibly erroneous recon-
struction artifacts.
Conclusion
We introduced INNs as a deterministic, post hoc and fast
approach for computing upper and lower bounds and subse-
quentlyuncertaintymapsforpre-trainedneuralnetworks.We
123
2096 International Journal of Computer Assisted Radiology and Surgery (2021) 16:2089–2097
Fig. 5 Results of three UQ
methods for the AdvDetect
and ArtDetect experiments
for one exemplary data sample
of the limited angle CT task. The
plotting windows are equally
adjusted for better contrast
demonstrated that UQ in general and INNs in particular can
be used to provide a fine-grained detection of failure modes
of image reconstruction DNNs. INNs are able to capture
uncertainty due to noise and can be used to obtain direc-
tional information. They perform well as an alarms system
for errors due ill-posedness, adversarial noise and atypical
artifacts and thus offer a promising tool to expose the weak-
nesses of deep image reconstruction models.
Funding Open Access funding enabled and organized by Projekt
DEAL. J.M. acknowledges support by DFG-RTG 2260 BIOQIC. M.M.
acknowledges support by DFG-SPP 1798 Grants KU 1446/21 and
KU 1446/23. G.K. is grateful to MATH+-BMRC Project EF1x1 for
financial support. W.S. acknowledges support by BMBF/BIFOLD (ref.
01IS18025A and ref 01IS18037I).
Declarations
Conflict of interest L.O. co-chairs the DAISAM working group at the
ITU/WHO Focus Group AI4H.
Ethical approval This article does not contain any studies with human
participants or animals performed by any of the authors.
Informed consent No new patient data were acquired as part
of this work; public data were used from https://www.aapm.org/
GrandChallenge/LowDoseCT/.
Open Access This article is licensed under a Creative Commons
Attribution 4.0 International License, which permits use, sharing, adap-
tation, distribution and reproduction in any medium or format, as
long as you give appropriate credit to the original author(s) and the
source, provide a link to the Creative Commons licence, and indi-
cate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence,
unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your
intended use is not permitted by statutory regulation or exceeds the
permitteduse,youwillneedto obtainpermissiondirectlyfromthecopy-
right holder. To view a copy of this licence, visit http://creativecomm
ons.org/licenses/by/4.0/.
References
1. Antun V, Renna F, Poon C, Adcock B, Hansen AC (2020) On insta-
bilities of deep learning in image reconstruction and the potential
costs of AI. Proc Natl Acad Sci 117(48):30088–30095. https://doi.
org/10.1073/pnas.1907377117
2. ArridgeS,MaassP, Öktem O, Schönlieb CB (2019) Solving inverse
problems using data-driven models. Acta Numer 28:1–174
3. Barber D, Bishop C (1998) Ensemble learning in Bayesian neu-
ral networks. In: Generalization in neural networks and machine
learning. Springer, pp 215–237
4. Byrd RH, Lu P, Nocedal J, Zhu C (1995) A limited memory algo-
rithm for bound constrained optimization. SIAM J Sci Comput
16(5):1190–1208
123
International Journal of Computer Assisted Radiology and Surgery (2021) 16:2089–2097 2097
5. Carlini N, Wagner D (2017) Adversarial examples are not easily
detected: bypassing ten detection methods. In: Proceedings of the
10th ACM workshop on artificial intelligence and security, pp 3–14
6. DenkerJS,SchwartzDB,Wittner BS, Solla SA,HowardRE,Jackel
LD, Hopfield JJ (1987) Large automatic learning, rule extraction,
and generalization. Complex Syst 1:877–922
7. Dietterich TG (2019) Robust artificial intelligence and robust
human organizations. Front Comput Sci 13(1):1–3
8. Foucart S, Rauhut H (2013) A mathematical introduction to com-
pressive sensing. Applied and Numerical Harmonic Analysis,
Birkhäuser
9. Foygel Barber R, Candès EJ, Ramdas A, Tibshirani RJ (2020)
The limits of distribution-free conditional predictive inference. Inf
Inference J IMA 10(2):455–482. https://doi.org/10.1093/imaiai/
iaaa017
10. Gal Y, Ghahramani Z (2016) Dropout as a Bayesian approxima-
tion: representing model uncertainty in deep learning. In: Balcan
MF, Weinberger KQ (eds) Proceedings of The 33rd international
conference on machine learning, proceedings of machine learning
research, vol 48. PMLR, New York, pp 1050–1059
11. Garczarczyk Z (2000) Interval neural networks. In: 2000 IEEE
international symposium on circuits and systems. Emerging
technologies for the 21st Century. Proceedings (IEEE Cat
No.00CH36353), vol 3. Presses Polytech. Univ. Romandes,
Geneva, pp 567–570
12. Gast J, Roth S (2018) Lightweight probabilistic deep networks.
In: 2018 IEEE/CVF conference on computer vision and pattern
recognition, pp 3369–3378
13. Huang Y, Würfl T, Breininger K, Liu L, Lauritsch G, Maier A
(2018) Some investigations on robustness of deep learning in lim-
ited angle tomography. In: Frangi AF, Schnabel JA, Davatzikos
C, Alberola-López C, Fichtinger G (eds) Medical image comput-
ing and computer assisted intervention—MICCAI 2018. Springer,
Cham, pp 145–153
14. Jin KH, McCann MT, Froustey E, Unser M (2017) Deep convolu-
tional neural network for inverse problems in imaging. IEEE Trans
Image Process 26:4509–4522
15. Kingma DP, Ba J (2015) Adam: a method for stochastic optimiza-
tion. In: Bengio Y, LeCun Y (eds) 3rd international conference on
learning representations, ICLR 2015, San Diego, CA, USA, May
7–9, 2015, conference track proceedings
16. Kingma DP, Salimans T, Welling M (2015) Variational dropout
and the local reparameterization trick. In: Proceedings of the
28th international conference on neural information processing
systems—Volume 2, NIPS’15. MIT Press, Cambridge, pp 2575–
2583
17. Kowalski PA, Kulczycki P (2017) Interval probabilistic neural net-
work. Neural Comput Appl 28(4):817–834
18. Lakshminarayanan B, Pritzel A, Blundell C (2017) Simple and
scalable predictive uncertainty estimation using deep ensembles.
In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vish-
wanathan S, Garnett R (eds) Advances in neural information
processing systems, vol 30. Curran Associates Inc, New York, pp
6402–6413
19. MacKay DJC (1992) Bayesian methods for adaptive models. PhD
thesis, California Institute of Technology
20. McCollough C (2016) Tu-fg-207a-04: overview of the low dose
CT grand challenge. Med Phys 43(6 Part 35):3759–3760
21. Natterer F (2001) The mathematics of computerized tomography.
SIAM, Philadelphia
22. Nix DA, Weigend AS (1994) Estimating the mean and variance of
the target probability distribution. In: Proceedings of 1994 IEEE
international conference on neural networks (ICNN’94), vol 1, pp
55–60. https://doi.org/10.1109/ICNN.1994.374138
23. Oala L, Fehr J, Gilli L, Balachandran P, Leite AW, Calderon-
Ramirez S, Li DX, Nobis G, Alvarado EAM, Jaramillo-Gutierrez
G, Matek C, Shroff A, Kherif F, Sanguinetti B, Wiegand T (2020)
Ml4h auditing: from paper to practice. In: Proceedings of the
machine learning for health NeurIPS workshop, proceedings of
machine learning research, vol 136. PMLR, pp 280–317. http://
proceedings.mlr.press/v136/oala20a.html
24. Rodrigues F, Pereira FC (2020) Beyond expectation: deep joint
mean and quantile regression for spatiotemporal problems. IEEE
Trans Neural Netw Learn Syst 31(12):5377–5389
25. Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional net-
works for biomedical image segmentation. In: Navab N, Hornegger
J, Wells WM, Frangi AF (eds) Medical image computing and
computer-assisted intervention–MICCAI 2015, Lecture Notes in
Computer Science. Springer, Berlin, pp 234–241
26. Rousseeuw PJ (1984) Least median of squares regression. J Am
Stat Assoc 79(388):871–880
27. Ruff L, Kauffmann JR, Vandermeulen RA, Montavon G, Samek
W, Kloft M, Dietterich TG, Müller KR (2021) A unifying review of
deep and shallow anomaly detection. Proc IEEE 109(5):756–795.
https://doi.org/10.1109/JPROC.2021.3052449
28. Tax DMJ, Duin RPW (2004) Support vector data description. Mach
Learn 54(1):45–66
29. Wang H, Xingjian S, Yeung DY (2016) Natural-parameter net-
works: a class of probabilistic neural networks. In: Advances in
neural information processing systems, pp 118–126
30. Yang D, Wu W (2012) A smoothing interval neural network. Dis-
crete Dyn Nat Soc. https://doi.org/10.1155/2012/456919
Publisher’s Note Springer Nature remains neutral with regard to juris-
dictional claims in published maps and institutional affiliations.
123