scieee Science in your language
[en] (orig)
Vol.:(0123456789)
1 3
Journal of Medical Systems (2021) 45:105
https://doi.org/10.1007/s10916-021-01783-y
IMPLEMENTATION SCIENCE & OPERATIONS MANAGEMENT
Machine Learning forHealth: Algorithm Auditing & Quality Control
LuisOala1· AndrewG.Murchison2· PradeepBalachandran3· ShrutiChoudhary4· JanaFehr5·
AlixandroWerneckLeite6· PeterG.Goldschmidt7· ChristianJohner8· EloraD.M.Schörverth1· RoseNakasi9·
MartinMeyer10· FedericoCabitza11· PatBaird12· CarolinPrabhu13· EvaWeicken1· XiaoxuanLiu14·
MarkusWenzel1· SteffenVogler15· DarlingtonAkogo16· ShadaAlsalamah17,18· EmreKazim19·
AdrianoKoshiyama19· SvenPiechottka20· SheenaMacpherson21· IanShadforth21· ReginaGeierhofer22·
ChristianMatek23· JoachimKrois24· BrunoSanguinetti25· MatthewArentz26· PavolBielik27·
SaulCalderon‑Ramirez28· AussAbbood29· NicolasLanger30· StefanHaufe31· FerathKherif32· SameerPujari18·
WojciechSamek1· ThomasWiegand1
Received: 3 September 2021 / Accepted: 11 October 2021 / Published online: 2 November 2021
© The Author(s) 2021, corrected publication 2022
Abstract
Developers proposing new machine learning for health (ML4H) tools often pledge to match or even surpass the performance
of existing tools, yet the reality is usually more complicated. Reliable deployment of ML4H to the real world is challeng-
ing as examples from diabetic retinopathy or Covid-19 screening show. We envision an integrated framework of algorithm
auditing and quality control that provides a path towards the effective and reliable application of ML systems in healthcare.
In this editorial, we give a summary of ongoing work towards that vision and announce a call for participation to the special
issue Machine Learning for Health: Algorithm Auditing & Quality Control in this journal to advance the practice of ML4H
auditing.
Keywords Machine learning· Artificial intelligence· Algorithm· Health· Auditing· Quality control
Introduction
Machine learning (ML) technology promises to automate,
speed up or improve medical processes. A large number of
institutions and companies are ambitiously working on ful-
filling this promise spanning tasks such as medical image
classification [1], segmentation [2] or reconstruction [3],
protein structure prediction [4] and electrocardiography
interpretation [5], among others1. However, the deployment
of machine learning for health (ML4H) tools into real-world
applications has been slow because existing approval pro-
cesses [6] may not account for the particular failure modes
and risks that accompany (ML) technology [711]. Certain
changes to image data that may not change the decision of
a human expert can completely alter the output of an image
classification [12] or regression [13, 14] model. Model
performance estimates are often not valid for the types of
varying input distribution that can occur during real world
deployment [1517]. The decision heuristics a model learns
can differ from the heuristics we may expect a human to
use [1, 1820], and model predictions may come with ill-
calibrated statements of confidence [2123] or no estimate
of uncertainty altogether [24]. Developers proposing new
ML4H technologies sometimes promise to match or even
surpass the performance of existing methods [25] yet the
reality is often more complicated. Classical ML performance
evaluation does not automatically translate to clinical utility
as examples from large diabetic retinopathy projects [26]
or Covid-19 diagnosis illustrate [27]. The reliable and inte-
grated management of these risks remains an open scientific
and practical hurdle.
In order to overcome this hurdle, we envision a frame-
work of algorithm auditing and quality control that provides
a path towards the effective and reliable application of ML
systems in healthcare. In this editorial we give a brief sum-
mary of ongoing work towards that vision from our open
* Luis Oala
Extended author information available on the last page of the article
1 The larger machine learning community maintains a good over-
view of tasks, benchmarks and state-of-the-art methods at https://
paper swith code. com/.
Journal of Medical Systems (2021) 45:105
1 3
105 Page 2 of 8
collective of collaborators. Many of the considerations pre-
sented here originate from a consensus finding effort by the
International Telecommunication Union (ITU) and World
Health Organization (WHO) which started in 2018 as the
Focus Group on Artificial Intelligence for Health (FG-AI4H)
[28].
We are convinced that success on this path heavily
depends on practical feedback. Auditing processes that are
developed on paper have to be put to the test to ensure that
they translate to utility in the actual auditing practice [29].
That is why we are introducing the special issue Machine
Learning for Health: Algorithm Auditing & Quality Con-
trol in this journal (see the Call for Participation for more
details2). The special issue will provide a platform for the
submission, discussion and publication of audit methods and
reports. The resulting compendium is intended to be a use-
ful resource for users, developers, vendors and auditors of
ML4H systems to manage and mitigate their particular risks.
ML4H Algorithm Auditing & Quality Control
From a bird’s eye view, many ML tools share a set of core
components comprising data, an ML-model and its out-
puts, as visualized in Fig.1A. The typical ML product
life cycle goes through stages of planning, development,
validation and, potentially, deployment under appropriate
monitoring (see Fig.1B). Feedback loops between stages,
for example from product validation back to development,
are commonplace3.
An audit entails a detailed assessment of an ML4H tool
at one or more of the ML life cycle steps. It can be car-
ried out to anticipate, monitor, or retrospectively review
operations of the tool [30, 31]. The audit output should
consist of a comprehensive standardized report that can
be used by different stakeholders to efficiently communi-
cate the tool’s strengths and limitations [29]. We envision
a process by which an independent body, for example
appointed by a government, carries out the audit using the
methods and tools outlined below. Further, they can also
be used by manufacturers and researchers themselves to
carry out internal quality control [32]. In either scenario,
the assessment is carried out with respect to a dynamic set
of technical, clinical and regulatory considerations (see
Fig.1C) that depend on the concrete ML technology and
the intended use of the tool. Audit teams should thus com-
prise expertise in all these dimensions and have to be able
to synthesize related requirements across disciplines. In
the following, we list a selection of considerations for all
three of these auditing dimensions, tools that can be used
to aid the auditing process as well as the role so called
trial audits can play in advancing ML4H quality control.
Fig. 1 Process overview. A:
Most ML tools share a set of
core components compris-
ing data, a ML-model and its
outputs B: The typical ML
life cycle goes through stages
of planning, development,
validation and, potentially,
deployment under appropri-
ate monitoring C: An ML4H
audit is carried out with respect
to a dynamic set of technical,
clinical and regulatory con-
siderations that depend on the
concrete ML technology and the
intended use of the tool
2 In the supplement and at this address https:// aiaud it. org/ joms/
3 Both representations A and B in 1 are high level abstractions.
A granular taxonomy of ML tools or their life cycles is beyond the
scope of this editorial. We refer the interested reader to [76] and our
documentation [45] for an in-depth treatment.
Journal of Medical Systems (2021) 45:105
1 3
Page 3 of 8 105
Auditing Dimensions
The technical validation of an ML4H tool comprises the
application of data and ML model quality assessment meth-
ods to detect possible failure modes in the model’s behavior.
These include model-oriented metrics, such as predictive
performance, robustness [33, 34], interpretability [1, 35],
disparity [36] or uncertainty [13, 24, 37] but also data-
oriented metrics related to sample size determination [38],
sparseness [39], bias [40] distribution mismatch [41, 42] and
label quality [7]. Rigorous statistical analysis of the model
metrics is a common pitfall in both research and industry,
and thus plays an important role during technical validation
[43]. FG-AI4H has formulated a standardized quality assess-
ment framework based on existing good practices [4446]
and provides practical guidance and examples for perform-
ing technical validation audits on three ML4H tools [29].
Clinical Evaluation comprises an “ongoing procedure
to collect, appraise and analyse clinical data pertaining to
a medical device and to analyse whether there is sufficient
clinical evidence to confirm compliance with relevant essen-
tial requirements for safety and performance when using the
device according to the manufacturers instructions for use”
[47]. The EQUATOR-network, including STARD-AI [48],
CONSORT-AI [49] and SPIRIT-AI [50], as well as differ-
ent scientific journals and associations [5154], have devel-
oped guidelines for the design, implementation, reporting
and evaluation of AI interventions in various study designs.
Key concerns are whether the ML4H tool delivers util-
ity in clinical pathways, how cost-effective the clinician-
tool interaction is [55] and whether it provides the desired
benefits for the intended users [56]. To demonstrate reli-
able performance, it is important to look beyond common
machine learning performance statistics such as accuracy
and to evaluate in addition whether the ML4H tool is suited
to the clinical setting in which it will be used; for example,
whether the training and test data represent patient popula-
tions that are similar to the intended use population [7, 57]
and whether the output translates to medically meaningful
parameters [58].
Regulatory Assessment comprises the systematic evalu-
ation of ML4H tools with respect to the applicable regula-
tory requirements found in laws (MDR [59], IVDR [60], 21
CFR [61], among others), to international standards (such
as IEC 62304 [62], IEC 62366-1 [63] and ISO 14971 [64]),
to guidelines by regulatory bodies (for example FDA [65],
IMDRF [66]) or to guidelines and drafts by other organi-
zations (for example AAMI [67] or European Commission
[68]). Such guidance is of practical concern for stakehold-
ers in the ML4H ecosystem including manufacturers (e.g.
product managers, developers, developers and data scien-
tists, quality and regulatory affairs managers) and for regula-
tory bodies (authorities, notified bodies). The FG-AI4H has
identified and critically reviewed general yet fundamental
regulatory considerations related to ML4H. This overview
of regulatory considerations assessment have been converted
into specific and verifiable requirements and subsequently
published as a comprehensive assessment checklist entitled
“Good practices for health applications of machine learn-
ing: Considerations for manufacturers and regulators” [45]
which covers the entire life cycle outlined in 1B at a higher
resolution. It includes checklist items which should be given
high priority in the presence of limited time - an important
practical constraint for real-world audits. Examples and
comments give further guidance to users. New regulatory
developments, such as predetermined change control plans
[69], imply faster software update cycles and potentially
more frequent audits. Hence, good tooling can become an
important means to make effective as well as efficient audits
possible.
Auditing Tools
The auditing process can be supported by appropriate tools
to make it more targeted and time-efficient. This can include
process and requirements descriptions, as mentioned above
[44, 45, 56], which help to manage dynamic workflows that
may vary by use case and ML technology. It also includes
reporting templates to present the audit results in a stand-
ardized way for the communication between different stake-
holders. [29, 70]. In addition, the nature of ML4H tools, as
primarily software that interacts with data, lends itself to
the application of test automation and simulations for the
purpose of auditing. This requires software tools which can
handle custom evaluation scripts, the flexible processing of
different ML4H model formats and data modalities as well
as security protocols that protect intellectual property and
sensitive patient information [71]. We are working with open
source frameworks such as EvalAI [72] and MLflow [73] to
develop solutions for automated auditing4, federated auditing
in remote teams5 and automated report creation. Our first
demo platform is available via http:// health. aiaud it. or g/ 6 and
hosted on ITU provisioned infrastructure. While quantita-
tive performance measures can already be provided, it is
essential to also offer qualitative measures. This is realized
by requiring the users to fill out a standardized questionnaire
[74]. Quantitative and qualitative performance results are
then provided to the users as a comprehensive and standard-
ized report card [70].
4 https:// github. com/ aiaud it- org/ health- aiaud it- public
5 https:// github. com/ aiaud it- org/ amazon- sagem aker- mlflow- farga te
6 You are welcome to reach out to any of the contributors https://
aiaud it. org/ contr ibuto rs/ for information on how to join the efforts.
Advertisement
Journal of Medical Systems (2021) 45:105
1 3
105 Page 4 of 8
Trial Audits
We are convinced that success on the path towards a frame-
work for algorithm auditing and quality control depends
heavily on practical feedback. The development and refine-
ment of auditing processes should routinely be accompa-
nied by trial audits. In trial audits, draft processes and stand-
ards are applied to ML4H tools. The purpose of such an
exercise is to ensure that auditing processes developed on
paper translate to utility in actual auditing practice [29]. In
order to facilitate the implementation of trial audits, we are
introducing the special issue Machine Learning for Health:
Algorithm Auditing & Quality Control in this journal. We
welcome contributions pertaining to methods, tools, reports
or open challenges in ML4H auditing.
Outlook
The materials summarized above bear testimony to the
initial progress that has been made towards the creation of
frameworks for ML4H algorithm auditing and quality con-
trol. Nevertheless, new challenges emerge as we collectively
pull at the complex fabric that ML4H systems are.
From the perspective of technical validation, the iden-
tification of factors which bias or deteriorate algorithmic
performance is often constrained by the absence of relevant
metadata. For example, the measurement device types (and
related acquisition parameters) used to produce the valida-
tion inputs should be available in order to validate if the
model performance is robust under device type changes.
This problem can be alleviated by identifying and routinely
recording this information during data acquisition.
For clinical evaluation, future considerations include
extending and refining the specific requirements related to
how the clinical effectiveness of a tool should be monitored
after implementation of the algorithm and with ongoing
monitoring [59]. This also requires agreement over the clear
and clinically useful procedures to obtain ground truth anno-
tations. It might be necessary to refine the ML algorithm to
the target population, if demographics or clinical character
are different from training settings or if medical guidelines
for diagnostics or treatment have changed [75]. Therefore,
in order for these insights to be effective it is imperative that
auditors exhibit a solid understanding of the training data,
ML algorithm, independent test data and evaluation metrics
specific to the intended use.
A challenge for regulatory assessment is that standardi-
zation organizations, notified bodies and manufacturers
need to efficiently formulate and parse applicable regula-
tory requirements for each individual ML4H tool. Compre-
hensive assessment checklists [45, 51] can help with that
task. However, more support is needed in terms of workflow
management and assisting tools if we consider the limited
time and budgets which professional auditors have at their
disposal. Future regulatory checklists should allow for
interactive selection of use-case specific sub-checklists, an
automated audit report creation, a issue of standard mini-
mum test cases as well as accompanying glossaries and edu-
cation materials for auditors. We also have to ensure that
protocols are in place which translate the audit insights to
actual improvements in the ML4H tool. Managing the risks
presented by the exciting advances of AI in healthcare is a
formidable undertaking, but with collaborative pooling of
expertise and resources we believe we can rise to the task.
Supplementary Information The online version contains supplemen-
tary material available at https:// doi. org/ 10. 1007/ s10916- 021- 01783-y.
Funding Open Access funding enabled and organized by Projekt
DEAL.
Open Access This article is licensed under a Creative Commons Attri-
bution 4.0 International License, which permits use, sharing, adapta-
tion, distribution and reproduction in any medium or format, as long
as you give appropriate credit to the original author(s) and the source,
provide a link to the Creative Commons licence, and indicate if changes
were made. The images or other third party material in this article are
included in the article's Creative Commons licence, unless indicated
otherwise in a credit line to the material. If material is not included in
the article's Creative Commons licence and your intended use is not
permitted by statutory regulation or exceeds the permitted use, you will
need to obtain permission directly from the copyright holder. To view a
copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.
References
1. Hägele, M., Seegerer, P., Lapuschkin, S., Bockmayr, M., Samek,
W., Klauschen, F., Müller, K.-R., and Binder, A. Resolving
challenges in deep learning-based analyses of histopathologi-
cal images using explanation methods. Scientific Reports 10, 1
(2020), 1–12.
2. Zhou, Z., Siddiquee, M. M.R., Tajbakhsh, N., and Liang, J.
Unet++: A nested u-net architecture for medical image segmenta-
tion. In Deep learning in medical image analysis and multimodal
learning for clinical decision support. Springer, 2018, pp.3–11.
3. Bubba, T.A., Kutyniok, G., Lassas, M., März, M., Samek, W.,
Siltanen, S., and Srinivasan, V. Learning the invisible: a hybrid
deep learning-shearlet framework for limited angle computed
tomography. Inverse Problems 35, 6 (2019), 064002.
4. Senior, A.W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L.,
Green, T., Qin, C., Žídek, A., Nelson, A.W., Bridgland, A., etal.
Improved protein structure prediction using potentials from deep
learning. Nature 577, 7792 (2020), 706–710.
5. Wagner, P., Strodthoff, N., Bousseljot, R.-D., Kreiseler, D., Lunze,
F.I., Samek, W., and Schaeffter, T. Ptb-xl, a large publicly availa-
ble electrocardiography dataset. Scientific Data 7, 1 (2020), 1–15.
6. Wu, E., Wu, K., Daneshjou, R., Ouyang, D., Ho, D.E., and Zou,
J. How medical ai devices are evaluated: limitations and recom-
mendations from an analysis of fda approvals. Nature Medicine
27, 4 (2021), 582–584.
Journal of Medical Systems (2021) 45:105
1 3
Page 5 of 8 105
7. Cabitza, F., Campagner, A., and Sconfienza, L.M. As if sand were
stone. new concepts and metrics to probe the ground on which to
build trustable ai. BMC Medical Informatics and Decision Making
20, 1 (2020), 1–21.
8. D’Amour, A., Heller, K., Moldovan, D., Adlam, B., Alipanahi, B.,
Beutel, A., Chen, C., Deaton, J., Eisenstein, J., Hoffman, M.D.,
etal. Underspecification presents challenges for credibility in mod-
ern machine learning. arXiv preprintarXiv:2011.03395 (2020).
9. Gilmer, J., Ford, N., Carlini, N., and Cubuk, E. Adversarial
examples are a natural consequence of test error in noise. In
International Conference on Machine Learning (2019), PMLR,
pp.2280–2289.
10. Raji, I.D., Smart, A., White, R.N., Mitchell, M., Gebru, T.,
Hutchinson, B., Smith-Loud, J., Theron, D., and Barnes, P. Clos-
ing the ai accountability gap: defining an end-to-end framework
for internal algorithmic auditing. In Proceedings of the 2020 Con-
ference on Fairness, Accountability, and Transparency (2020),
pp.33–44.
11. Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do imagenet
classifiers generalize to imagenet? In International Conference on
Machine Learning (2019), PMLR, pp.5389–5400.http:// www.
bmj. com/ lookup/ doi/ 10. 1136/ bmj. m3210
12. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D.,
Goodfellow, I., and Fergus, R. Intriguing properties of neural net-
works. arXiv preprint arXiv:1312.6199 (2013).
13. Macdonald, J., März, M., Oala, L., and Samek, W. Interval neural
networks as instability detectors for image reconstructions. In Bild-
verarbeitung für die Medizin 2021 (Wiesbaden, 2021), C.Palm, T.M.
Deserno, H.Handels, A.Maier, K.Maier-Hein, and T.Tolxdorff,
Eds., Springer Fachmedien Wiesbaden, pp.324–329.
14. Oala, L., Heiß, C., Macdonald, J., März, M., Kutyniok, G., and
Samek, W. Detecting failure modes in image reconstructions
with interval neural network uncertainty. International Journal
of Computer Assisted Radiology and Surgery (2021), 1–9. https://
arxiv. org/ abs/ 2003. 11566
15. Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F.,
Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M., etal. The
many faces of robustness: A critical analysis of out-of-distribution
generalization. arXiv preprint arXiv: 2006. 16241 (2020).
16. Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B., and
Schmidt, L. Measuring robustness to natural distribution shifts in
image classification. arXiv preprint arXiv:2007.00644 (2020).
17. Willis, K., and Oala, L. Post-hoc domain adaptation via guided
data homogenization. CoRR abs/2104.03624 (2021).https:// arxiv.
org/ abs/ 2104. 03624
18. Lapuschkin, S., Wäldchen, S., Binder, A., Montavon, G., Samek, W.,
and Müller, K.-R. Unmasking clever hans predictors and assessing
what machines really learn. Nature Communications 10, 1 (2019), 1–8.
19. Nalisnick, E., Matsukawa, A., Teh, Y.W., Gorur, D., and
Lakshminarayanan, B. Do deep generative models know what
they dont know? arXiv preprint arXiv: 1810. 09136 (2018).
20. Neves, I., Folgado, D., Santos, S., Barandas, M., Campagner, A.,
Ronzio, L., Cabitza, F., and Gamboa, H. Interpretable heartbeat
classification using local model-agnostic explanations on ecgs.
Computers in Biology and Medicine 133 (2021), 104393.
21. Calderon-Ramirez, S., Yang, S., Moemeni, A., Colreavy-Donnelly, S.,
Elizondo, D.A., Oala, L., Rodríguez-Capitán, J., Jiménez-Navarro, M.,
López-Rubio, E., and Molina-Cabello, M.A. Improving uncertainty
estimation with semi-supervised deep learning for covid-19 detection
using chest x-ray images. IEEE Access 9 (2021), 85442–85454.
22. Guo, C., Pleiss, G., Sun, Y., and Weinberger, K.Q. On calibra-
tion of modern neural networks. In International Conference on
Machine Learning (2017), PMLR, pp.1321–1330.
23. Minderer, M., Djolonga, J., Romijnders, R., Hubis, F., Zhai, X.,
Houlsby, N., Tran, D., and Lucic, M. Revisiting the calibration of
modern neural networks, 2021.
24. Kendall, A., and Gal, Y. What uncertainties do we need in bayesian
deep learning for computer vision? In Advances in Neural Informa-
tion Processing Systems (2017), I.Guyon, U.V. Luxburg, S.Bengio,
H.Wallach, R.Fergus, S.Vishwanathan, and R.Garnett, Eds., vol.30,
Curran Associates, Inc.https:// proce edings. neuri ps. cc/ paper/ 2017/ file/
2650d 6089a 6d640 c5e85 b2b88 265dc 2b- Paper. pdf
25. Roberts, M., Driggs, D., Thorpe, M., Gilbey, J., Yeung, M.,
Ursprung, S., Aviles-Rivero, A.I., Etmann, C., McCague, C.,
Beer, L., etal. Common pitfalls and recommendations for using
machine learning to detect and prognosticate for covid-19 using
chest radiographs and ct scans. Nature Machine Intelligence 3, 3
(2021), 199–217.
26. Heaven, W.D. Googles medical ai was super accurate in a lab.
real life was a different story. | mit technology review. https://
www. techn ology review. com/ 2020/ 04/ 27/ 10006 58/ google-
medic al- ai- accur ate- lab- real- life- clinic- covid- diabe tes- retina-
disea se/. (Accessed on 06/10/2021).
27. Oakden-Rayner, L. Ct scanning is just awful for diagnosing covid-
19 – luke oakden-rayner. https:// lukeo akden rayner. w ordp r ess. com/
2020/ 03/ 23/ ct- scann ing- is- just- awful- for- diagn osing- covid- 19/.
(Accessed on 06/10/2021).
28. Wiegand, T., Krishnamurthy, R., Kuglitsch, M., Lee, N., Pujari,
S., Salathé, M., Wenzel, M., and Xu, S. Who and itu establish
benchmarking process for artificial intelligence in health. The
Lancet 394, 10192 (2019), 9–11.
29. Oala, L., Fehr, J., Gilli, L., Balachandran, P., Leite, A.W., Calderon-
Ramirez, S., Li, D.X., Nobis, G., Alvarado, E. A. M.n., Jaramillo-
Gutierrez, G., Matek, C., Shroff, A., Kherif, F., Sanguinetti, B., and
Wiegand, T. Ml4h auditing: From paper to practice. In Proceedings
of the Machine Learning for Health NeurIPS Workshop (2020),
vol.136, PMLR, pp.280–317.
30. Koshiyama, A., Kazim, E., Treleaven, P., Rai, P., Szpruch, L.,
Pavey, G., Ahamat, G., Leutner, F., Goebel, R., Knight, A., etal.
Towards algorithm auditing: A survey on managing legal, ethical
and technological risks of ai, ml and associated algorithms.
31. Shneiderman, B. Opinion: The dangers of faulty, biased, or mali-
cious algorithms requires independent oversight. Proceedings
of the National Academy of Sciences 113, 48 (2016), 13538–
13540.https:// www. pnas. org/ conte nt/ 113/ 48/ 13538
32. Ryan, J.R. Software product quality assurance. In Proceedings of the
June 7-10, 1982, National Computer Conference (New York, NY,
USA, 1982), AFIPS ’82, Association for Computing Machinery,
p.393–398.https:// doi. org/ 10. 1145/ 15007 74. 15008 23
33. Carlini, N., and Wagner, D. Towards evaluating the robustness of
neural networks. In 2017 ieee symposium on security and privacy
(sp) (2017), IEEE, pp.39–57.
34. Hendrycks, D., and Dietterich, T. Benchmarking neural network
robustness to common corruptions and perturbations. arXiv pre-
print arXiv:1903.12261 (2019).
35. Samek, W., Montavon, G., Lapuschkin, S., Anders, C.J., and
Müller, K.-R. Explaining deep neural networks and beyond: A
review of methods and applications. Proceedings of the IEEE 109,
3 (2021), 247–278.
36. Saleiro, P., Kuester, B., Hinkson, L., London, J., Stevens, A.,
Anisfeld, A., Rodolfa, K.T., and Ghani, R. Aequitas: A bias and
fairness audit toolkit. arXiv preprint arXiv:1811.05577 (2018).
37. Oala, L., Heiß, C., MacDonald, J., März, M., Samek, W., and
Kutyniok, G. Interval neural networks: Uncertainty scores. CoRR
abs/2003.11566 (2020).
38. Balki, I., Amirabadi, A., Levman, J., Martel, A.L., Emersic, Z.,
Meden, B., Garcia-Pedrero, A., Ramirez, S.C., Kong, D., Moody,
A.R., etal. Sample-size determination methodologies for machine
learning in medical imaging research: a systematic review. Cana-
dian Association of Radiologists Journal 70, 4 (2019), 344–353.
39. Mendez, M., Calderon-Ramirez, S., and Tyrrell, P.N. Using clus-
ter analysis to assess the impact of dataset heterogeneity on deep
Advertisement
Journal of Medical Systems (2021) 45:105
1 3
105 Page 6 of 8
convolutional network accuracy: A first glance. In Latin Ameri-
can High Performance Computing Conference (2019), Springer,
pp.307–319.
40. Noseworthy, P.A., Attia, Z.I., Brewer, L.C., Hayes, S.N., Yao,
X., Kapa, S., Friedman, P.A., and Lopez-Jimenez, F. Assessing
and mitigating bias in medical artificial intelligence: the effects
of race and ethnicity on a deep learning model for ecg analysis.
Circulation: Arrhythmia and Electrophysiology 13, 3 (2020),
e007988.
41. Mårtensson, G., Ferreira, D., Granberg, T., Cavallin, L., Oppedal, K.,
Padovani, A., Rektorova, I., Bonanni, L., Pardini, M., Kramberger,
M.G., etal. The reliability of a deep learning model in clinical
out-of-distribution mri data: a multicohort study. Medical Image
Analysis 66 (2020), 101714.
42. Ramírez, S.C., and Oala, L. More than meets the eye: Semi-
supervised learning under non-iid data. CoRR abs/2104.10223
(2021).https:// arxiv. org/ abs/ 2104. 10223
43. Parmar, C., Barry, J.D., Hosny, A., Quackenbush, J., and Aerts,
H.J. Data analysis strategies in medical imaging. Clinical cancer
research 24, 15 (2018), 3492–3499.
44. FG-AI4H. Data and artificial intelligence assessment methods
(daisam) reference. Reference document DEL 7.3 on FG-AI4H
server (2020).https:// extra net. itu. int/ sites/ itu-t/ focus groups/ ai4h/
SiteP ages/ Home. aspx
45. Johner, C., Balachandran, P., Oala, L., Lee, A.Y., WerneckLeite,
A., Murchison, A., Lin, A., Molnar, C., Rumball-Smith, J., Baird,
P., Goldschmidt, P.G., Quartarolo, P., Xu, S., Piechottka, S., and
Hornberger, Z. Good practices for health applications of machine
learning: Considerations for manufacturers and regulators. In ITU/
WHO Focus Group on Artificial Intelligence for Health (FG-AI4H)
- Meeting K (2021), L.Oala, Ed., vol.K, ITU.https:// extra net. itu.
int/ sites/ itu-t/ focus groups/ ai4h/ SiteP ages/ Home. aspx
46. The Supreme Audit Institutions of Finland, Germany, the Nether-
lands, Norway and the UK. Auditing machine learning algorithms.
https:// audit ingal gorit hms. net/, 2020. (Accessed on 07/02/2021).
47. EUROPEAN-COMMISSION. Meddev 2.7/1 revision 4, clinical
evaluation: a guide for manufacturers and notified bodies. https:// ec.
europa. eu/ docsr oom/ docum ents/ 17522/ attac hments/ 1/ trans latio ns/
en/ rendi tions/ native, 2016. (Accessed on 07/01/2021).
48. Sounderajah, V., Ashrafian, H., Aggarwal, R., De Fauw, J., Denniston,
A.K., Greaves, F., Karthikesalingam, A., King, D., Liu, X., Markar,
S.R., McInnes, M.D., Panch, T., Pearson-Stuttard, J., Ting, D.S.,
Golub, R.M., Moher, D., Bossuyt, P.M., and Darzi, A. Developing
specific reporting guidelines for diagnostic accuracy studies assessing
AI interventions: The STARD-AI Steering Group. Nature Medicine
26, 6 (2020), 807–808.https:// doi. org/ 10. 1038/ s41591- 020- 0941-1
49. Liu, X., Cruz Rivera, S., Moher, D., Calvert, M., Denniston, A.K.,
Spirit-ai, T., and Group, C.-a.W. CONSORT-AI extension. Nature
Medicine 26, September (2020), 1364–1374.
50. Rivera, S.C., Liu, X., Chan, A.-W., Denniston, A.K., and Calvert,
M.J. Guidelines for clinical trial protocols for interventions involv-
ing artificial intelligence: the SPIRIT-AI Extension. Bmj 370 (2020),
m3210.
51. Cabitza, F., and Campagner, A. The need to separate the wheat
from the chaff in medical informatics. International Journal of
Medical Informatics (2021), 104510.
52. Hernandez-Boussard, T., Bozkurt, S., Ioannidis, J.P., and Shah,
N.H. Minimar (minimum information for medical ai reporting):
developing reporting standards for artificial intelligence in health
care. Journal of the American Medical Informatics Association
27, 12 (2020), 2011–2015.
53. Schwendicke, F., Singh, T., Lee, J.-H., Gaudin, R., Chaurasia,
A., Wiegand, T., Uribe, S., and Krois, J. Artificial intelligence in
dental research: Checklist for authors, reviewers, readers. Journal
of Dentistry 107 (2021), 103610.https:// www. scien cedir ect. com/
scien ce/ artic le/ pii/ S0300 57122 10003 12
54. Scott, I., Carter, S., and Coiera, E. Clinician checklist for assessing
suitability of machine learning applications in healthcare. BMJ
Health & Care Informatics 28, 1 (2021).
55. Schwendicke, F., Rossi, J., Göstemeyer, G., Elhennawy, K., Cantu,
A., Gaudin, R., Chaurasia, A., Gehrung, S., and Krois, J. Cost-
effectiveness of artificial intelligence for proximal caries detec-
tion. Journal of Dental Research 100, 4 (2021), 369–376.https://
doi. org/ 10. 1177/ 00220 34520 972335. PMID: 33198554.
56. FG-AI4H. Clinical evaluation of ai for health. Reference document
DEL 7.4 on FG-AI4H server (2021).https:// extra net. itu. int/ sites/
itu-t/ focus groups/ ai4h/ SiteP ages/ Home. aspx
57. Kaushal, A., Altman, R., and Langlotz, C. Geographic Distribution
of US Cohorts Used to Train Deep Learning Algorithms. JAMA 324,
12 (09 2020), 1212–1213.https:// doi. org/ 10. 1001/ jama. 2020. 12067
58. Nagendran, M., Chen, Y., Lovejoy, C.A., Gordon, A.C.,
Komorowski, M., Harvey, H., Topol, E.J., Ionnidis, J. P.A.,
Collins, G.S., and Maruthappu, M. Artificial intelligence ver-
sus clinicians: systematic review of design, reporting standards,
and clains of deep learning studies. British Medical Journal 360
(2020), m689.
59. EU. Regulation (eu) 2017/746 of the european parliament and of
the council on medical devices, (2017).https:// eur- lex. europa. eu/
eli/ reg/ 2017/ 745/ oj
60. EU. Regulation (eu) 2017/746 of the european parliament and of
the council on invitro diagnostic medical devices, (2017).https://
eur- lex. europa. eu/ eli/ reg/ 2017/ 746/ oj
61. FDA. Code of federal regulations, title 21 on foods and drugs.https:// www.
ecfr. gov/ cgi- bin/ text- idx? SID= cc748 06513 924f0 197b7 809c8 efbef c8&
mc= true& tpl=/ ecfrb rowse/ Title 21/ 21tab_ 02. tpl
62. IEC. Medical device software – software life cycle processes –
amendment 1 (2015).https:// www. iso. org/ stand ard/ 64686. html
63. IEC. Medical devices – part 1: Application of usability engineer-
ing to medical devices – amendment 1 (2020).https:// www. iso.
org/ stand ard/ 73007. html
64. ISO. Medical devices – application of risk management to medical
devices (2019).https:// www. iso. org/ stand ard/ 72704. html
65. FDA. Fda guidance documents.https:// www. fda. gov/ regul atory-
infor mation/ search- fda- guida nce- docum ents
66. IMDRF. Documents by international medical device regulators
forum.http:// www. imdrf. org/ docum ents/ docum ents. asp
67. AAMI. Techical report (tr) 57 principals for medical device
security - risk management.https:// store. aami. org/s/ store#/ store/
browse/ detail/ a152E 00000 6j60W QAQ
68. EUROPEAN-COMMISSION. Eur-lex - 52021pc0206 - en -
eur-lex. https:// eur- lex. europa. eu/ legal- conte nt/ EN/ TXT/? uri=
CELEX: 52021 PC0206, 2021. (Accessed on 07/01/2021).
69. US-FDA. Aiml\_samd\_action\_plan. https:// www. fda. go v/ media/
145022/ downl oad? utm_ medium= email & utm_ source= govde liv-
ery, 2021. (Accessed on 07/01/2021).
70. Verks, B., and Oala, L. Daisam audit reporting template. In ITU/
WHO Focus Group on Artificial Intelligence for Health (FG-
AI4H) - Meeting J (2020), vol.J, ITU.https:// extra ne t. itu. int/ sites/
itu-t/ focus groups/ ai4h/ SiteP ages/ Home. aspx
71. FG-AI4H. Data sharing practices. Reference document DEL 5.6 on
FG-AI4H server (2021).https:// extra ne t. itu. int/ s ites/ itu-t/ focus groups/
ai4h/ SiteP ages/ Home. aspx
72. Yadav, D., Jain, R., Agrawal, H., Chattopadhyay, P., Singh, T.,
Jain, A., Singh, S., Lee, S., and Batra, D. Evalai: Towards bet-
ter evaluation systems for AI agents. CoRR abs/1902.03570
(2019).http:// arxiv. org/ abs/ 1902. 03570
73. Chen, A., Chow, A., Davidson, A., DCunha, A., Ghodsi, A., Hong,
S.A., Konwinski, A., Mewald, C., Murching, S., Nykodym, T.,
Ogilvie, P., Parkhe, M., Singh, A., Xie, F., Zaharia, M., Zang,
R., Zheng, J., and Zumar, C. Developments in mlflow: A sys-
tem to accelerate the machine learning lifecycle. In Proceedings
of the Fourth International Workshop on Data Management for
Journal of Medical Systems (2021) 45:105
1 3
Page 7 of 8 105
End-to-End Machine Learning (New York, NY, USA, 2020),
DEEM’20, Association for Computing Machinery.https:// doi. org/
10. 1145/ 33995 79. 33998 67
74. FG-AI4H. Model questionnaire. Reference document J-038 on FG-
AI4H server (2020).https:// extra net. itu. int/ sites/ itu-t/ focus groups/
ai4h/ SiteP ages/ Home. aspx
75. Kelly, C.J., Karthikesalingam, A., Suleyman, M., Corrado, G.,
and King, D. Key challenges for delivering clinical impact with
artificial intelligence. BMC Medicine 17 (2019), 195.
76. Hardt, M., and Recht, B. Patterns, predictions, and actions: A
story about machine learning. https:// mlsto ry. org(2021).
Authors and Affiliations
LuisOala1· AndrewG.Murchison2· PradeepBalachandran3· ShrutiChoudhary4· JanaFehr5·
AlixandroWerneckLeite6· PeterG.Goldschmidt7· ChristianJohner8· EloraD.M.Schörverth1· RoseNakasi9·
MartinMeyer10· FedericoCabitza11· PatBaird12· CarolinPrabhu13· EvaWeicken1· XiaoxuanLiu14·
MarkusWenzel1· SteffenVogler15· DarlingtonAkogo16· ShadaAlsalamah17,18· EmreKazim19·
AdrianoKoshiyama19· SvenPiechottka20· SheenaMacpherson21· IanShadforth21· ReginaGeierhofer22·
ChristianMatek23· JoachimKrois24· BrunoSanguinetti25· MatthewArentz26· PavolBielik27·
SaulCalderon‑Ramirez28· AussAbbood29· NicolasLanger30· StefanHaufe31· FerathKherif32· SameerPujari18·
WojciechSamek1· ThomasWiegand1
Andrew G. Murchison
Pradeep Balachandran
Shruti Choudhary
Jana Fehr
Alixandro Werneck Leite
alixandrowernec[email protected]
Peter G. Goldschmidt
pgg@worlddg.com
Christian Johner
christian.johner@johner-institut.de
Elora D. M. Schörverth
elora-dana.schoerver[email protected]er.de
Rose Nakasi
Martin Meyer
martin.mm.me[email protected]
Federico Cabitza
Pat Baird
Carolin Prabhu
cap@riksrevisjonen.no
Eva Weicken
eva.weick[email protected]er.de
Xiaoxuan Liu
Markus Wenzel
Steffen Vogler
steffen.vogler@bayer.com
Darlington Akogo
Shada Alsalamah
Emre Kazim
Adriano Koshiyama
Sven Piechottka
sven@openregulatory.com
Sheena Macpherson
sheena.macpherson@miotify.co.uk
Ian Shadforth
ian.shadforth@miotify.co.uk
Regina Geierhofer
geierhofer@cocir.org
Christian Matek
christian.matek@helmholtz-muenchen.de
Joachim Krois
Bruno Sanguinetti
bruno.sanguinetti@dotphoton.com
Matthew Arentz
marentz@uw.edu
Pavol Bielik
Saul Calderon-Ramirez
sacalderon@itcr.ac.cr
Auss Abbood
Nicolas Langer
n.langer@psychologie.uzh
Stefan Haufe
Advertisement
Journal of Medical Systems (2021) 45:105
1 3
105 Page 8 of 8
Ferath Kherif
ferath.kherif@chuv.ch
Sameer Pujari
Wojciech Samek
Thomas Wiegand
1 Fraunhofer HHI, Berlin, Germany
2 Oxford University Hospitals NHS Foundation Trust, Oxford,
UnitedKingdom
3 Technical Consultant (Digital Health), Thiruvananthapuram,
India
4 University ofOxford, Oxford, UnitedKingdom
5 Hasso-Plattner-Institute ofDigital Engineering, Potsdam,
Germany
6 Machine Learning Laboratory inFinance andOrganizations,
Universidade de Brasília, Brasília, Brazil
7 World Development Group Inc, Bethesda, MD, USA
8 Johner Institute, Konstanz, Germany
9 Makerere University, Kampala, Uganda
10 Siemens Healthineers, Erlangen, Germany
11 University ofMilano-Bicocca, Milan, Italy
12 Philips, NewKensington, USA
13 Office oftheAuditor General ofNorway, Oslo, Norway
14 University Hospitals Birmingham NHS Foundation Trust &
Academic Unit ofOphthalmology, Institute ofInflammation
andAgeing, College ofMedical andDental Sciences,
University ofBirmingham, Birmingham, UnitedKingdom
15 Bayer AG, Berlin, Germany
16 minoHealth AI Labs, Accra, Ghana
17 Information Systems Department, College ofComputer
andInformation Sciences, King Saud University, Riyadh,
SaudiArabia
18 Digital Health andInnovation Department, Science Division,
World Health Organization, Winterthur, Switzerland
19 University College London, London, UnitedKingdom
20 Open Regulatory, Bonn, Germany
21 MIOTIFY LTD, London, UnitedKingdom
22 IEC TC62 andSiemens Healthineers, Erlangen, Germany
23 Helmholtz Zentrum München, Neuherberg, Germany
24 Oral Diagnostics Digital Health Health Services Research,
Charité-Universitätsmedizin, Berlin, Germany
25 Dotphoton AG, Zug, Switzerland
26 Department ofGlobal Health, University ofWashington,
Washington, USA
27 LatticeFlow & ETH Zurich, Zürich, Switzerland
28 De Montfort University & Instituto Tecnologico de Costa
Rica, Cartago, CostaRica
29 Robert Koch Institut, Berlin, Germany
30 Department ofPsychology, University ofZurich, Zürich,
Switzerland
31 Technische Universität Berlin, Berlin, Germany
32 Laboratory forResearch inNeuroimaging, Department
ofClinical Neuroscience, Lausanne University Hospital
andUniversity ofLausanne, Lausanne, Switzerland