Document [original]

International Journal of Computer Assisted Radiology and Surgery (2021) 16:2089–2097

https://doi.org/10.1007/s11548-021-02482-2

ORIGINAL ARTICLE

Detecting failure modes in image reconstructions with interval neural

network uncertainty

Luis Oala1·Cosmas Heiß2·Jan Macdonald2·Maximilian März1·Gitta Kutyniok3·Wojciech Samek1

Received: 8 April 2021 / Accepted: 10 August 2021 / Published online: 4 September 2021

Abstract

Purpose The quantitative detection of failure modes is important for making deep neural networks reliable and usable

at scale. We consider three examples for common failure modes in image reconstruction and demonstrate the potential of

uncertainty quantification as a fine-grained alarm system.

Methods We propose a deterministic, modular and lightweight approach called Interval Neural Network (INN) that produces

fast and easy to interpret uncertainty scores for deep neural networks. Importantly, INNs can be constructed post hoc for

already trained prediction networks. We compare it against state-of-the-art baseline methods (MCDrop,ProbOut).

Results We demonstrate on controlled, synthetic inverse problems the capacity of INNs to capture uncertainty due to noise

as well as directional error information. On a real-world inverse problem with human CT scans, we can show that INNs

produce uncertainty scores which improve the detection of all considered failure modes compared to the baseline methods.

Conclusion Interval Neural Networks offer a promising tool to expose weaknesses of deep image reconstruction models and

ultimately make them more reliable. The fact that they can be applied post hoc to equip already trained deep neural network

models with uncertainty scores makes them particularly interesting for deployment.

Keywords Deep learning ·Image reconstruction ·Uncertainty quantification ·Failure modes

Luis Oala, Cosmas Heiß, Jan Macdonald and Maximilian März have

contributed equally to this work.

BLuis Oala

Cosmas Heiß

Jan Macdonald

Maximilian März

Gitta Kutyniok

Wojciech Samek

1Department of Artificial Intelligence, Fraunhofer HHI, Berlin,

Germany

2Institut für Mathematik, Technische Universität Berlin,

Berlin, Germany

3Mathematisches Institut, Ludwig-Maximilians-Universität

München, Munich, Germany

Introduction

The reconstruction of unknown signals from indirect mea-

surements plays an important role in many applications,

including medical imaging [2,14]. Typically, such tasks are

modeled as finite-dimensional linear inverse problems

y=Ax +η,(1)

where x∈Rnis the signal of interest, A∈Rm×ndenotes

the forward operator representing a physical measurement

process, and η∈Rmis modeling noise in the measurements.

Importantexamplesincludemagneticresonanceimaging and

computed tomography, where Ais a subsampled discrete

Fourier or Radon transform, respectively. Solving the inverse

problem (1) requires computing an approximate reconstruc-

tion of xfrom the observed measurements y.

Classical reconstruction methods, e.g., based on sparse

regularization models, constitute the state of the art for

solving (1) in many cases and are backed by theoretical

guarantees [8]. Recently, data-driven deep learning methods

are increasingly gaining attention and are repeatedly able to

123

2090 International Journal of Computer Assisted Radiology and Surgery (2021) 16:2089–2097

outperform traditional solvers in terms of empirical recon-

struction performance or speed, see for example [2].

Despite the advantages, the use of deep learning methods

in sensitive applications such as clinical diagnosis is still a

concern [23], due to questions regarding the reliability and

robustness of the obtained reconstructions when compared

to traditional approaches [1,13]. What is more, erroneous

artifacts in the reconstructed signals can be hard to detect as

they tend to “blend in” well with the rest of the signal.

Variousapproaches forincorporating uncertaintyquantifi-

cation (UQ) into deep learning have been proposed to address

these issues [10,16,18,22]. However, as we demonstrate,

existing UQ approaches come with limitations regarding

their capacity to detect failure modes or their post hoc appli-

cability to trained deep learning models.

In this work, we consider a straight-forward approach to

solving (1) by employing a neural network to post-process a

standard model-based inversion as in [14]. This reconstruc-

tion is given by

xrec =Φ◦A†(y),

where Φ:Rn→Rnis a neural network trained to minimize

the loss x−Φ(A†(y))2

2and A†:Rm→Rndenotes the

non-learned model-based inversion (e.g., the filtered back-

projection in the case of Radon measurements). We will

denote z=A†(y)in the following. Given yor z,aUQ

method is supposed to extend the predicted reconstruction

Φ(z)by a component-wise uncertainty score u(z)that pro-

vides additional information regarding the reliability of the

reconstruction. Therefore, u(z)should be correlated with the

component-wise error |x−Φ(z)|. We evaluate this for three

different failure modes [7] that can arise during inference

(see “Experiment B (i): general prediction error detection”

section to “Experiment B (iii): Atypical Artifact Detection”

section for more details):

(i) Errors caused solely by the ill-posedness of (1), which

is mostly determined by the strength of measurement

noise and the amount of undersampling,

(ii) Errors caused by adversarial perturbations to the net-

work inputs,

(iii) Errors caused by atypical artifacts that have not been

seen during the training.

Our main contributions can be summarized as follows:

We present a deterministic, modular and fast UQ-method for

deep neural networks (DNNs), called Interval Neural Net-

works (INN). We evaluate INNs for the detection of the three

different image reconstruction failure modes and demon-

strate that they provide improved results compared to two

existing UQ methods.

Related work

Whereas a number of methods from classical statistical

learning theory, such as Gaussian processes and approxima-

tions thereof [6,19], come with built-in uncertainty estimates,

DNNs have been limited in this regard. A surge of efforts to

treat neural networks from a variational perspective [3,16]

started to change that. In addition, there exist strands of

research in deep learning explicitly occupied with the detec-

tion of failure modes caused by adversarial and out of

distribution (OoD) inputs. These include Maximum Mean

Discrepancy, Kernel Density Estimation and other tools,

see [5] or the Minimum Covariance Determinant method

[26], Support Vector Data Description [28], among oth-

ers. We refer to [27] for a comprehensive overview. The

detection of adversarial and OoD inputs in these works is

typically done in the classification setting. We emphasize

that image-to-image regression is a fundamentally different

task: While classification is inherently discontinuous, image

reconstruction addresses a problem that allows for stable

solution methods in many cases, e.g., by sparse regulariza-

tion. Furthermore, we are not interested in a crude, outright

rejection of data points in the input space but rather seek to

obtain fine-grained information about erroneous artifacts in

the output space. More closely related to our goal is Monte

Carlo dropout (MCDrop)[10] and direct variance estimation

(ProbOut)[12], where epistemic and aleatoric uncertainty

quantification was considered for segmentation and depth-

estimation tasks. Hence, we include their approaches as

baseline comparison methods, see “Baseline UQ methods”

section.

Methods

Popular existing UQ frameworks for DNNs place paramet-

ric densities, most commonly Gaussian densities, over the

DNN parameters or predictions. Instead of using specific

parametrized densities, our INN method relies on bound-

ing distributions using intervals. This results in a flexible

and modular method that can be applied post hoc to a given

DNN Φthat has already been trained. A schematic illustra-

tion is provided in Fig. 1:TheINN is formed by wrapping

additional weight and bias intervals around the weights and

biases of the underlying prediction DNN. This allows us to

equip the DNN Φwith uncertainty capabilities without the

need to modify Φitself. After training the INN we obtain

prediction intervals that are guaranteed to contain the orig-

inal prediction of the underlying network and are easy to

interpret. They provide exact upper and lower bounds for the

range of possible values that the DNN prediction may take

when slightly modifying the network parameters within the

prescribed weight and bias intervals.

123

International Journal of Computer Assisted Radiology and Surgery (2021) 16:2089–2097 2091

Fig. 1 A schematic overview of

the proposed Interval Neural

Networks for image

reconstruction

Previously, the capacity of neural networks with interval

weights and biases was evaluated for fitting interval-valued

functions [11]. In contrast to [11], our targets xiare nei-

ther interval-valued nor univariate, leading to a different loss

function which allows us to equip trained neural networks

with uncertainty capabilities post hoc. For a direct compari-

son, see 3in 3.2 and Equation (18) in [11]. Further, [17,30]

explored neural networks implementing interval arithmetic

for robust classifications. However, in their setting, the focus

is purely on representing the inputs or outputs as intervals but

not the weights and biases. In contrast, our proposed INNs

determine interval bounds for all network parameters with

the goal of providing uncertainty scores for the predictions

of an underlying DNN.

Arithmetic of Interval Neural Networks

We will now give a description of those INN mechanisms

that deviate from standard DNNs. The forward propagation

of a single input zthrough a DNN is replaced by the forward

propagation of a component-wise interval-valued input [z,z]

through the INN. This can be expressed similarly to standard

feed-forward neural networks but using interval arithmetic

instead.Forinterval-valued weight matrices[W,W]and bias

vectors [b,b], the propagation through the -th network layer

can be expressed as

z,z(+1)=W,W() z,z() +b,b().(2)

For nonnegative [z,z](), for example when using a non-

negative activation function such as the ReLU in the

previous layer, we can explicitly rewrite (2)as

z(+1)=min W(),0z() +max W(),0z() +b(),

z(+1)=max W(),0z() +min W(),0z() +b(),

wherethemaximumandminimumarecomputedcomponent-

wise. Similarly, for point intervals z() =z() =: z(),for

example, as inputs to the first network layer, we can rewrite

(2)as

z(+1)=W() max{z(),0}+W() min{z(),0}+b(),

regardless of whether z() is nonnegative or not. Optimizing

the INN parameters requires obtaining the gradients of these

operations. This can be achieved using automatic differen-

tiation (backpropagation) in the same way as for standard

neural networks.

Training Interval Neural Networks

Let W() and b() be the weights and biases of the underlying

prediction network Φand let Φ:Rn→Rnand Φ:Rn→

Rndenote the functions mapping a point interval input zto

the upper and the lower interval bounds in the output layer of

the INN respectively.Givendata samples {zi,xi}m

i=1the INN

parameters[W,W]() and [b,b]() aretrained byminimizing

the empirical loss



i=1

max{xi−Φ(zi), 0}



2+

max{Φ(zi)−xi,0}



+β·

Φ(zi)−Φ(zi)

1,(3)

subject to the constraints W() ≤W() ≤W() and b() ≤

b() ≤b() for each layer. This way Φ(z)≤Φ(z)≤Φ(z)

is always guaranteed. The first two terms in (3) encour-

age that the predicted interval [Φ(zi), Φ(zi)]should contain

the target signal xi, while penalizing each component that

lies outside with the squared distance to the nearest interval

bound. The second term penalizes the interval size, so that

the predicted intervals cannot grow arbitrarily large. While a

quadraticpenalty ofthe interval size is alsopossible andleads

to similar theoretical bounds as in (4), we choose to minimize

the 1-norm to make the intervals more outlier inclusive. In

addition, the tightness parameter β>0 can further tune the

outlier-sensitivity of the intervals. This allows for a calibra-

tion ofthe INN uncertaintyscores according toan application

specific risk-budget. In practice, we found that choosing β

similar to the mean absolute error of the underlying predic-

123

2092 International Journal of Computer Assisted Radiology and Surgery (2021) 16:2089–2097

tion network yields a good trade-off between coverage [9]

and tightness.

Properties of Interval Neural Networks

The uncertainty estimate of an INN is given by the width of

the prediction interval, i.e., u(z)=Φ(z)−Φ(z).Interms

of computational overhead, INNs scale linearly in the cost

of evaluating the underlying prediction DNN with a constant

factor 2. In contrast, the popular MCDrop [10] scales lin-

early with a factor Twhich is proportional to the number

of stochastic forward passes and at least T=10 is recom-

mended by the authors, see “Baseline UQ methods” section.

Further, INNs come with theoretical coverage guarantees

that can be derived from the Markov inequality: Assuming

that the loss (3) is optimized during training to yield an INN

with vanishing expected gradient with respect to the data

distribution, we obtain

P(z,x)Φ(z)i−λβ < xi<Φ(z)i+λβ≥1−1

λ,(4)

for any λ>0. In other words, for input and target pair (z,x)

the probability of any component of the target lying inside

the predicted interval enlarged by λβ is at least 1−1

λ.Asβis

usually very small, this ensures a fast decay of the probability

of the components of xlying outside the predicted interval

bounds. Consequently, a component with a small uncertainty

scorewascorrectlyreconstructeduptosmallerrorwithahigh

probability. Of course, the training distribution needs to be

well representative of the true data distribution to extrapolate

this property to unseen data.

Finally, the optimization of the loss (3) yields additional

information: If the prediction Φ(z)lies closer to one bound-

ary of the predicted interval, the true target xhas a higher

probability of lying on the other side of the interval. Con-

sequently, INNs can provide directional uncertainty scores.

A quantitative assessment of this capability is given in

Fig. 3c+d. We note that it is also possible to explore asym-

metric uncertainty estimates in the probabilistic setting, e.g.,

via exponential family distributions [29] or quantile regres-

sion [24]. In contrast to INNs, these methods cannot be

applied post hoc as they require substantial modifications

to the underlying prediction network.

Baseline UQ methods

In addition to our INN approach, we consider two other

related and popular UQ baseline methods for comparison.

First, Monte Carlo dropout (MCDrop)[10] obtains uncer-

tainty scores as the sample variance of multiple stochastic

forward passes of the same input signal. In other words, if

Φ1,...,ΦTare realizations of independent draws of ran-

dom dropout masks for the same underlying network Φ,

the component-wise uncertainty estimate is uMCDrop(z)=

T−1(T

t=1Φt(z)2−1

T(T

t=1Φt(z))2))1/2.Second,adirect

variance estimation (ProbOut) was proposed in [22] and

later expanded in [12]. Here, the number of output com-

ponents of the prediction network is doubled and trained

to approximate the mean and variance of a Gaussian dis-

tribution. The resulting network ΦProbOut :Rn→Rn×

Rn,z→ (Φmean(z), Φvar(z)) is trained by minimizing

the empirical loss i(yi−Φmean(zi))/√Φvar(zi)2

log Φvar(zi)1. The component-wise uncertainty score of

ProbOut is uProbOut(z)=(Φvar(z))1/2. Note that, in con-

trast to INN and MCDrop,theProbOut approach requires

the incorporation of UQ already during training. Thus, it

cannot be employed as a post hoc evaluation of an already

trained, underlying network Φ. The role of the actual predic-

tion network is taken by Φmean.

Experiments

We present experiments for two different inverse problems.

First, a deconvolution task with 1D signals, and second a

tomographytask on real-world2Dimagesignals.Both setups

are described in more detail below. The description of all

hyperparameters for the experiments is kept brief and we

refer to our publicly available code at https://github.com/

luisoala/inn for full details.

Case study A: deconvolution of 1D signals

We start with a synthetic, didactic experiment, inspired by

a one-dimensional deconvolution task, to demonstrate the

properties of INNs discussed in “Properties of Interval Neu-

ral Networks” section. For this purpose, we choose n=m=

512 and A=DSD, where Dis a discrete cosine trans-

form (Type I DCT) and Sis a diagonal matrix with entries

sj=n−j

n−1ν

∈[0,1], that decay with a fixed exponent

ν=8. We draw synthetically generated signals xfrom a dis-

tribution of piecewise constant functions with random jump

positions and heights, see Fig. 2. The corresponding mea-

surements yare computed according to (1). We generate a

data set consisting of 2000 sample pairs (yi,xi), 1600 of

which were used for training, 200 for validation and 200 for

testing. The underlying prediction network Φis a convolu-

tional neural network (consisting of ten convolutional layers

and three dropout layers in between) trained to directly map

yto x, i.e., we use A†=Id and thus z=A†y=yin

this experiment. We trained the underlying network Φfor

100 epochs using Adam [15]. The interval parameters of the

INN were subsequently trained for another 100 epochs with

β=2·10−3.FortheMCDrop comparison, we use T=64

123

International Journal of Computer Assisted Radiology and Surgery (2021) 16:2089–2097 2093

Fig. 2 Results for the

deconvolution task for one

exemplary signal without noise

(left) and with additive Gaussian

noise (σ=0.05) on both the

measurements yand signal x

(right). The first row shows

inputs z=yand targets x.

Below the target x, prediction

Φ(z)and uncertainty score u(z)

as well as the uncertainty

compared to the absolute error

|Φ(z)−x|are shown for the

three UQ methods.

samples. The ProbOut model was trained in the same way

as Φusing 100 Adam epochs. Note that all subsequent eval-

uations, as well as the plots in Fig. 2are computed using test

samples.

In order to evaluate the UQ methods’ abilities to capture

uncertainty due to noisy data, we consider additive Gaussian

noise η∼N(0,σ2·Id)on the measurements over a range of

noise levels σ(Fig. 3a) as well as η1,η2∼N(0,σ2·Id)on

the measurements and targets, where (1) is adjusted to y=

A(x+η1)+η2(Fig. 3b and right column of Fig. 2). In this

case, INNs are able to capture the additional uncertainty of η1

using the bias parameters of the final network layer. In Fig. 3,

it can be observed how in contrast to MCDrop, our method

and ProbOut are able to capture independent noise in the

data with ProbOut reacting to a lesser degree than the INN.

Note also that in Fig. 3some of the ProbOut evaluations

are shifted to the right, indicating a reduced reconstruction

performance compared to the other methods.

Finally, we determine the directional information of the

INN uncertainty scores as discussed in “Properties of Inter-

val Neural Networks” section. For this, we define the

component-wise directionality ratio by DR(z)=max{Φ(z)

−Φ(z), Φ(z)−Φ(z)}/min{Φ(z)−Φ(z), Φ(z)−Φ(z)},

i.e., as the ratio between the larger and smaller part of the

interval [Φ(z), Φ(z)]when divided by the prediction Φ(z).

The directionality accuracy (DA) is the relative frequency

123

2094 International Journal of Computer Assisted Radiology and Surgery (2021) 16:2089–2097

Fig. 3 aMean uncertainty of the three UQ methods for varying levels σ

of additive Gaussian on the measurements yfor the deconvolution task.

bCorresponding results for additive noise on both the measurements y

and signals x. (c) Illustration of the directional information contained

in the INN output intervals for the deconvolution task. The additional

right axis (in blue) displays the relative frequency of signal components

for each directionality ratio. (d) Corresponding results for the CT task.

The mean and standard deviation across three independent complete

experimental runs are shown.

of target components corresponding to a given DR that are

contained in the larger interval part. As displayed in Fig. 3c,

d, INNs achieve a DA consistently above 0.5 (chance), indi-

cating that the interval uncertainty scores contain directional

information.

Case study B: limited angle computed tomography

Next, we consider a 2D computed tomography (CT)taskon

real-world data in order to evaluate the detection capabilities

of the UQ methods with respect to the three failure modes

(i)–(iii). More precisely, we consider limited angle CT, which

has applications in dental tomography, breast tomosynthesis

or electron tomography. For this, Ais a subsampled dis-

crete Radon transform with subsampling corresponding to

a moderate missing wedge of 30◦. Limited angle measure-

ments are simulated according to (1) and the non-learned

inversion A†is based on the filtered backprojection algo-

rithm (FBP) [21]. The underlying prediction network is a

U-Net [25] variant. Our experiments are based on a data set

consisting of 512 ×512 human CT scans from the AAPM

Low Dose CT Grand Challenge data [20].1In total, it con-

tains 2580 full-dose images with a slice thickness of 3mm

from 10 patients. Eight of these ten patients were used for

training (2036 samples), one for validation (214 samples)

and one for testing (330 samples). We trained the underlying

network Φfor 400 epochs using Adam [15]. The interval

parameters of the INN were subsequently trained for another

15 epochs with β=10−4. We limited the interval training

to the last twelve layers. For the MCDrop comparison, we

use T=128 samples. The ProbOut model was trained in

the same way as Φusing 400 Adam epochs.

1See: https://www.aapm.org/GrandChallenge/LowDoseCT/.We

would like to thank Dr. Cynthia McCollough, the Mayo Clinic, and the

American Association of Physicists in Medicine as well as the grants

EB017095 and EB017185 from the National Institute of Biomedical

Imaging and Bioengineering for providing the AAPM data.

Experiment B (i): general prediction error detection

First, we evaluate how helpful UQ scores are for estimat-

ing the prediction error caused by the ill-posedness of the

challenging CT task, see Fig. 4. The wedge of missing

angles in the measurements results in reconstruction arti-

facts especially at vertical edges in the images. In order

to best visualize these geometric effects of the very struc-

tured null-space of the limited angle CT forward operator,

we do not add noise in this experiment. INNs are clearly

able to reveal the reconstruction uncertainty along the “miss-

ing edges.” For a more quantitative comparison of the UQ

methods, we use the performance weighted correlation coef-

ficient PWCC(z,x)=corr(|Φ(z)−x|,u(z))/Φ(z)−x2

between the uncertainty score uand the absolute predic-

tion error. Performance weighting (normalizing by the mean

squared error of the prediction) is necessary to discourage

rewards for poor prediction models with high uncertainties

everywhere. The average results over the test set for three

independent complete experimental runs are summarized in

Table 1.BothINNs and MCDrop are able to detect predic-

tion errors, with INNs achieving slightly higher correlations.

In Fig. 3d, the directional accuracy of the INN is illustrated

analogously to the corresponding experiment in “Case study

A: deconvolution of 1D signals” section. Again it is consis-

tently above 0.5 (chance).

Experiment B (ii): Adversarial Artifact Detection

Second, we assess the capacity of UQ methods to capture

artifacts in the output that were caused by adversarial pertur-

bations. To that end, we create perturbed inputs for each input

sample zin the test set by employing the box-constrained

L-BFGS algorithm [4] to minimize Φ(zadv)−xadv. tar.2

subject to zadv ∈[0,1]n. The adversarial targets xadv. tar. are

created by subtracting 1.5 times its mean value from xrec

within a random 50 ×50 square, leading to clearly visible

123

International Journal of Computer Assisted Radiology and Surgery (2021) 16:2089–2097 2095

Fig. 4 Results of three UQ

methods for the Error Detection

experiment for one exemplary

data sample of the limited angle

CT task. The plotting windows

are equally adjusted for better

contrast.

Table 1 Mean test results (±standard deviation) averaged over three experimental runs

UQ method AdvDetect ArtDetect ErrDetect

PWCC MSE

INN 0.56 ±0.05 0.52 ±0.03 2211 ±403 7.4±0.65 ×10−4

MCDrop 0.28 ±0.02 0.26 ±0.01 2170 ±513 7.4±0.65 ×10−4

ProbOut 0.48 ±0.12 0.34 ±0.04 190 ±28 6.7±2×10−3

Pearson correlation coefficients for the Adversarial Artifact Detection (AdvDetect) and Atypical Artifact Detection (ArtShort) experiments and

PWCC with MSE for the Error Detection (ErrDetect) experiment

artifacts in the corresponding reconstructions; see Fig. 5.Itis

arguable, whether the technical aspects of such an adversarial

perturbation (i.e., attacking subsequently to a model-based

inversion) is a realistic scenario in the context of inverse

problems. However, for our purposes, such a simple setup

(see also [13]) is sufficient. We refer to [1], where adver-

sarial noise is mapped to the measurement domain. In order

to assess the detection capacity for this failure mode, the

different UQ schemes are then used to produce uncertainty

heatmaps for the generated adversarial inputs. A quantita-

tive evaluation is carried out by computing the mean Pearson

correlation coefficient between the pixel-wise change in the

uncertainty heatmaps |u(z)−u(zadv)|and the change of

reconstructions |xrec −Φ(zadv)|. The results are summa-

rized in Table 1and illustrated in Fig. 5. We observe that

both INN and ProbOut are able to detect the image region

of adversarial perturbations, with INN achieving the highest

correlation. This shows that both methods are able to visually

highlight the effect that visually almost imperceptible input

perturbations can have on the reconstructions.

Experiment B (iii): Atypical Artifact Detection

The third experiment is designed analogous to the setup

described by [1], i.e., an atypical artifact, which was not

present in the training data, is randomly placed in the input

to produce zOoD. More precisely, the silhouette of a peace

dove is inserted in each image of the test set; see Fig. 5.

The simulation of the measurements and model-based inver-

sions is carried out as before. A quantitative evaluation

is carried out by computing the mean Pearson correlation

coefficient between the change in the uncertainty heatmaps

|u(z)−u(zOoD)|and a binary mask marking the region of

change in the inputs. This evaluation isolates the uncertainty

caused by atypical artifacts and allows us to verify in a con-

trolledmannerhowtheuncertaintyscoresofeachUQmethod

react to the artifacts. During deployment, such controlled iso-

lation is not possible. Instead, the joint uncertainty heatmaps

u(zOoD)will also capture other sources of uncertainty, thus

providing a more comprehensive alarm system. The results

are summarized in Table 1and illustrated in Fig. 5. All three

UQ methods are correlated with the input change; however,

INN again achieves the highest correlation. This shows that

UQ in general, and INNs in particular, can serve as a warn-

ing system for inputs containing atypical features that might

otherwise lead to unnoticed and possibly erroneous recon-

struction artifacts.

Conclusion

We introduced INNs as a deterministic, post hoc and fast

approach for computing upper and lower bounds and subse-

quentlyuncertaintymapsforpre-trainedneuralnetworks.We

123

2096 International Journal of Computer Assisted Radiology and Surgery (2021) 16:2089–2097

Fig. 5 Results of three UQ

methods for the AdvDetect

and ArtDetect experiments

for one exemplary data sample

of the limited angle CT task. The

plotting windows are equally

adjusted for better contrast

demonstrated that UQ in general and INNs in particular can

be used to provide a fine-grained detection of failure modes

of image reconstruction DNNs. INNs are able to capture

uncertainty due to noise and can be used to obtain direc-

tional information. They perform well as an alarms system

for errors due ill-posedness, adversarial noise and atypical

artifacts and thus offer a promising tool to expose the weak-

nesses of deep image reconstruction models.

Funding Open Access funding enabled and organized by Projekt

DEAL. J.M. acknowledges support by DFG-RTG 2260 BIOQIC. M.M.

acknowledges support by DFG-SPP 1798 Grants KU 1446/21 and

KU 1446/23. G.K. is grateful to MATH+-BMRC Project EF1x1 for

financial support. W.S. acknowledges support by BMBF/BIFOLD (ref.

01IS18025A and ref 01IS18037I).

Declarations

Conflict of interest L.O. co-chairs the DAISAM working group at the

ITU/WHO Focus Group AI4H.

Ethical approval This article does not contain any studies with human

participants or animals performed by any of the authors.

Informed consent No new patient data were acquired as part

of this work; public data were used from https://www.aapm.org/

GrandChallenge/LowDoseCT/.

Open Access This article is licensed under a Creative Commons

Attribution 4.0 International License, which permits use, sharing, adap-

tation, distribution and reproduction in any medium or format, as

long as you give appropriate credit to the original author(s) and the

source, provide a link to the Creative Commons licence, and indi-

cate if changes were made. The images or other third party material

in this article are included in the article’s Creative Commons licence,

unless indicated otherwise in a credit line to the material. If material

is not included in the article’s Creative Commons licence and your

intended use is not permitted by statutory regulation or exceeds the

permitteduse,youwillneedto obtainpermissiondirectlyfromthecopy-

right holder. To view a copy of this licence, visit http://creativecomm

ons.org/licenses/by/4.0/.

References

1. Antun V, Renna F, Poon C, Adcock B, Hansen AC (2020) On insta-

bilities of deep learning in image reconstruction and the potential

costs of AI. Proc Natl Acad Sci 117(48):30088–30095. https://doi.

org/10.1073/pnas.1907377117

2. ArridgeS,MaassP, Öktem O, Schönlieb CB (2019) Solving inverse

problems using data-driven models. Acta Numer 28:1–174

3. Barber D, Bishop C (1998) Ensemble learning in Bayesian neu-

ral networks. In: Generalization in neural networks and machine

learning. Springer, pp 215–237

4. Byrd RH, Lu P, Nocedal J, Zhu C (1995) A limited memory algo-

rithm for bound constrained optimization. SIAM J Sci Comput

16(5):1190–1208

123

International Journal of Computer Assisted Radiology and Surgery (2021) 16:2089–2097 2097

5. Carlini N, Wagner D (2017) Adversarial examples are not easily

detected: bypassing ten detection methods. In: Proceedings of the

10th ACM workshop on artificial intelligence and security, pp 3–14

6. DenkerJS,SchwartzDB,Wittner BS, Solla SA,HowardRE,Jackel

LD, Hopfield JJ (1987) Large automatic learning, rule extraction,

and generalization. Complex Syst 1:877–922

7. Dietterich TG (2019) Robust artificial intelligence and robust

human organizations. Front Comput Sci 13(1):1–3

8. Foucart S, Rauhut H (2013) A mathematical introduction to com-

pressive sensing. Applied and Numerical Harmonic Analysis,

Birkhäuser

9. Foygel Barber R, Candès EJ, Ramdas A, Tibshirani RJ (2020)

The limits of distribution-free conditional predictive inference. Inf

Inference J IMA 10(2):455–482. https://doi.org/10.1093/imaiai/

iaaa017

10. Gal Y, Ghahramani Z (2016) Dropout as a Bayesian approxima-

tion: representing model uncertainty in deep learning. In: Balcan

MF, Weinberger KQ (eds) Proceedings of The 33rd international

conference on machine learning, proceedings of machine learning

research, vol 48. PMLR, New York, pp 1050–1059

11. Garczarczyk Z (2000) Interval neural networks. In: 2000 IEEE

international symposium on circuits and systems. Emerging

technologies for the 21st Century. Proceedings (IEEE Cat

No.00CH36353), vol 3. Presses Polytech. Univ. Romandes,

Geneva, pp 567–570

12. Gast J, Roth S (2018) Lightweight probabilistic deep networks.

In: 2018 IEEE/CVF conference on computer vision and pattern

recognition, pp 3369–3378

13. Huang Y, Würfl T, Breininger K, Liu L, Lauritsch G, Maier A

(2018) Some investigations on robustness of deep learning in lim-

ited angle tomography. In: Frangi AF, Schnabel JA, Davatzikos

C, Alberola-López C, Fichtinger G (eds) Medical image comput-

ing and computer assisted intervention—MICCAI 2018. Springer,

Cham, pp 145–153

14. Jin KH, McCann MT, Froustey E, Unser M (2017) Deep convolu-

tional neural network for inverse problems in imaging. IEEE Trans

Image Process 26:4509–4522

15. Kingma DP, Ba J (2015) Adam: a method for stochastic optimiza-

tion. In: Bengio Y, LeCun Y (eds) 3rd international conference on

learning representations, ICLR 2015, San Diego, CA, USA, May

7–9, 2015, conference track proceedings

16. Kingma DP, Salimans T, Welling M (2015) Variational dropout

and the local reparameterization trick. In: Proceedings of the

28th international conference on neural information processing

systems—Volume 2, NIPS’15. MIT Press, Cambridge, pp 2575–

2583

17. Kowalski PA, Kulczycki P (2017) Interval probabilistic neural net-

work. Neural Comput Appl 28(4):817–834

18. Lakshminarayanan B, Pritzel A, Blundell C (2017) Simple and

scalable predictive uncertainty estimation using deep ensembles.

In: Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vish-

wanathan S, Garnett R (eds) Advances in neural information

processing systems, vol 30. Curran Associates Inc, New York, pp

6402–6413

19. MacKay DJC (1992) Bayesian methods for adaptive models. PhD

thesis, California Institute of Technology

20. McCollough C (2016) Tu-fg-207a-04: overview of the low dose

CT grand challenge. Med Phys 43(6 Part 35):3759–3760

21. Natterer F (2001) The mathematics of computerized tomography.

SIAM, Philadelphia

22. Nix DA, Weigend AS (1994) Estimating the mean and variance of

the target probability distribution. In: Proceedings of 1994 IEEE

international conference on neural networks (ICNN’94), vol 1, pp

55–60. https://doi.org/10.1109/ICNN.1994.374138

23. Oala L, Fehr J, Gilli L, Balachandran P, Leite AW, Calderon-

Ramirez S, Li DX, Nobis G, Alvarado EAM, Jaramillo-Gutierrez

G, Matek C, Shroff A, Kherif F, Sanguinetti B, Wiegand T (2020)

Ml4h auditing: from paper to practice. In: Proceedings of the

machine learning for health NeurIPS workshop, proceedings of

machine learning research, vol 136. PMLR, pp 280–317. http://

proceedings.mlr.press/v136/oala20a.html

24. Rodrigues F, Pereira FC (2020) Beyond expectation: deep joint

mean and quantile regression for spatiotemporal problems. IEEE

Trans Neural Netw Learn Syst 31(12):5377–5389

25. Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional net-

works for biomedical image segmentation. In: Navab N, Hornegger

J, Wells WM, Frangi AF (eds) Medical image computing and

computer-assisted intervention–MICCAI 2015, Lecture Notes in

Computer Science. Springer, Berlin, pp 234–241

26. Rousseeuw PJ (1984) Least median of squares regression. J Am

Stat Assoc 79(388):871–880

27. Ruff L, Kauffmann JR, Vandermeulen RA, Montavon G, Samek

W, Kloft M, Dietterich TG, Müller KR (2021) A unifying review of

deep and shallow anomaly detection. Proc IEEE 109(5):756–795.

https://doi.org/10.1109/JPROC.2021.3052449

28. Tax DMJ, Duin RPW (2004) Support vector data description. Mach

Learn 54(1):45–66

29. Wang H, Xingjian S, Yeung DY (2016) Natural-parameter net-

works: a class of probabilistic neural networks. In: Advances in

neural information processing systems, pp 118–126

30. Yang D, Wu W (2012) A smoothing interval neural network. Dis-

crete Dyn Nat Soc. https://doi.org/10.1155/2012/456919

Publisher’s Note Springer Nature remains neutral with regard to juris-

dictional claims in published maps and institutional affiliations.

123