Machine Learning Approximation of Bayesian
Inference in Nuclear Fusion Research
vorgelegt von
M.Sc.
Andrea Pavone
ORCID: 0000-0003-2398-966X
von der Fakultät II – Mathematik und Naturwissenschaften
der Technischen Universität Berlin
zur Erlangung des akademischen Grades
Doktor der Naturwissenschaften
— Dr. rer. nat. —
genehmigte Dissertation
Promotionsausschuss:
Vorsitzender: Prof. Dr. Holger Stark
Gutachter: Prof. Dr. Dieter Breitschwerdt
Gutachter: Prof. Dr. Robert Wolf
Gutachter: Prof. Dr. Eric Sonnendrücker
Tag der wissenschaftlichen Aussprache: 05 Oktober 2020
Berlin, 2020
Verily, verily, I say unto you, except a
corn of wheat fall into the ground and
die, it abideth alone: but if it die, it brin-
geth forth much fruit.
John 12:24-26
Abstract
Machine learning algorithms, especially neural networks, are commonly used
to automatically find and encode unknown relationships between sets of re-
lated data. They have been applied to problems in different fields, including the
interpretation of data in scientific experiments. In this thesis, I consider the case
where the relationship between a set of observed data and modeling parameters
is known and formalized in a Bayesian model, and I show that neural networks
can be trained to learn the relationship between the quantities defined by the
model joint distribution. In this way, the network constitutes an approxima-
tion of the original Bayesian model. Two cases are considered: the case where
the network is trained to approximate the mapping between the observable
quantities and the free parameters, and between the joint space of both observ-
ables and free parameters, and the joint probability distribution value. Neural
network approximate models can be particularly useful in the context of large
scientific experiments, like the nuclear fusion experiments considered here. The
trained network can be used to accelerate the Bayesian inference process car-
ried out on experimental data by providing a fast approximate reconstruction
of relevant quantities. Here, I consider the application of the network to the
reconstruction of different plasma parameters from measured diagnostic data
collected at two large fusion experiments, the Wendelstein 7-X stellarator and
the Joint European Torus tokamak. When analysing experimental data, it is cru-
cial to provide information about the reliability of the network reconstruction
for further usages of the results. For this reason, I show here how uncertainties
can be estimated with two different approaches, both based on a Bayesian in-
terpretation of the training process. The approximation method developed here
is general in the sense that its implementation and validity is not bound to a
specific model or experiment. Especially, if the Bayesian models to be approx-
imated are implemented within the Minerva Bayesian modeling framework [1],
their joint distributions can be easily accessed and used to generate training
samples in a manner that is common among different models. This opens the
possibility to an entirely automatic procedure, where machine learning models
5
Abstract
are automatically generated and trained to approximate any scientific model,
constituting a step further towards the automation of general scientific infer-
ence.
6
Zusammenfassung
Algorithmen für Maschinelles Lernen, insbesondere neuronale Netze, werden
üblicherweise verwendet, um unbekannte Beziehungen zwischen verwandten
Datensätzen automatisch zu finden und zu beschreiben. Diese wurden bereits
in verschiedenen Bereichen auf geeignete Probleme angewendet, einschließlich
der Interpretation von Daten in wissenschaftlichen Experimenten. In dieser Ar-
beit betrachte ich eine Anwendung, in der die Beziehung zwischen einem Satz
beobachteter Daten und Modellierungsparametern bekannt und in einem Bayes-
schen Modell formalisiert ist. Ich zeige, dass neuronale Netze trainiert werden
können, um die Beziehung zwischen den durch das Modell definierten gemein-
samen Wahrscheinlichkeitsverteilungen zu lernen. Auf diese Weise stellt das
Netzwerk eine Approximation an das ursprüngliche Bayessche Modell dar. Es
werden zwei Fälle betrachtet: einerseits der Fall, in dem das Netzwerk trainiert
wird, um die Abbildung zwischen den beobachtbaren Größen und den freien Pa-
rametern, sowie der Fall, die Abbildung zwischen dem gemeinsamen Raum von
beobachtbaren und freien Parametern und dem Wert der gemeinsamen Wahr-
scheinlichkeitsverteilung zu approximieren. Approximationen durch neuronale
Netze können besonders im Zusammenhang mit großen wissenschaftlichen
Experimenten, wie den hier betrachteten Kernfusionsexperimenten, nützlich
sein. Das trainierte Netzwerk kann verwendet werden, um den Bayesschen
Deduktionsprozess zu beschleunigen, der auf experimentelle Daten angewen-
det wird, indem eine schnelle approximative Rekonstruktion relevanter Grö-
ßen bereitgestellt wird. Hier betrachte ich die Anwendung des Netzwerks auf
die Rekonstruktion verschiedener Plasmaparameter aus gemessenen Diagnos-
tikdaten zweier großer Fusionsexperimente, dem Stellarator Wendelstein 7-X
und dem Tokamak Joint European Torus. Bei der Analyse experimenteller Da-
ten ist es wichtig, Informationen über die Zuverlässigkeit der Rekonstruktion
durch das Netzwerk bereitzustellen, um die Ergebnisse anderweitig nutzen
zu können. Aus diesem Grund zeige ich in dieser Arbeit, wie Unsicherheiten
mit zwei verschiedenen Ansätzen abgeschätzt werden können, die beide auf
einer Bayesschen Interpretation des Trainingsprozesses basieren. Die hier ent-
7
Zusammenfassung
wickelte Approximationsmethode ist allgemeingültig in dem Sinne, dass ihre
Implementierung und Gültigkeit nicht an ein bestimmtes Modell oder Experi-
ment gebunden ist. Insbesondere wenn die zu approximierenden Bayesschen
Modelle innerhalb des Bayesschen Modellierungsframeworks Minerva [1] im-
plementiert sind, können ihre gemeinsamen Wahrscheinlichkeitsverteilungen
leicht berechnet und verwendet werden, um Trainingsmuster auf eine Weise zu
generieren, die für verschiedene Modelle identisch ist. Dies eröffnet die Mög-
lichkeit eines vollautomatischen Verfahrens, bei dem auf maschinellem Lernen
basierende Modelle automatisch generiert und trainiert werden, um beliebige
wissenschaftliche Modelle zu approximieren. Dies ist ein weiterer Schritt in
Richtung automatischer Bearbeitung wissenschaftlicher Fragestellungen.
8
Contents
Abstract 5
Zusammenfassung 7
1. Introduction 11
2. Nuclear fusion 17
3. Bayesian inference and modeling 25
3.1. Bayes theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2. Bayesian modeling in the Minerva framework . . . . . . . . . . 28
4. Approximate Bayesian inference with neural networks 33
4.1. Neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.2. The approximation framework . . . . . . . . . . . . . . . . . . . 38
5. Application to nuclear fusion research 43
5.1. Inference of temperature profiles at W7-X . . . . . . . . . . . . 44
5.2. Inference of density profiles at JET . . . . . . . . . . . . . . . . 46
5.3. Inference of Zeff at W7-X . . . . . . . . . . . . . . . . . . . . . . 48
5.4. Learning the model joint distribution . . . . . . . . . . . . . . . 49
5.4.1. Bayesian bremsstrahlung model . . . . . . . . . . . . . . 50
5.4.2. Network specifications . . . . . . . . . . . . . . . . . . . 53
5.4.3. Network evaluation . . . . . . . . . . . . . . . . . . . . 54
5.4.4. A brief summary . . . . . . . . . . . . . . . . . . . . . . 57
6. Publications 59
6.1. Article I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
6.2. Article II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
6.3. Article III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.4. Article IV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
9
Chapter 1.
Introduction
The work presented in this thesis aims at contributing to the quest of the automa-
tion of science. In recent years, the breakthroughs of modern AI systems have
rightfully made the news, and unprecedented effort is being made in the aca-
demic community and the commercial sector towards the realization of general
intelligent automation. Many applications have been developed which show
how machines can autonomously learn and perform at close to human level in
different tasks. For example, in the context of natural language processing, deep
learning based models are able to predict the next word in a sentence or answer
questions with unprecedented accuracy, as it is shown in [2, 3]; in the context
of strategy-based video games, machines are able to defeat human players after
learning rules and strategy without supervision, from scratch [4]; in the field
of robotic, a human-like robot hand has learnt the movements and strategy ne-
cessary to solve th Rubik’s Cube using only simulated data [5]. Less resonant in
the public sphere, are the applications to scientific research. The work presen-
ted here is not, by itself, a solution to the attempt of realizing autonomous,
intelligent, general scientific inference. If the problem will ever be solved, and I
believe it will, it is hard to predict now what form the solution, or multiple solu-
tions, will have, but I and many other scientists believe that it will be built upon
a number of tools and ideas, some of them taken from the fields of computer
science and probability theory, and being known by a long time. Specifically,
concerning the work in this thesis, I refer to the subjects of machine learn-
ing with its most recent developments, and Bayesian probability theory, which
was first introduced in the eighteenth century. Moreover, computer science
has shown us that in order to effectively manipulate information and know-
ledge, it is fundamental to pay attention to how the knowledge is represented.
The point of the representation of knowledge is seldom mentioned explicitly in
11
Chapter 1. Introduction
the publications that constitute the core of this thesis, partially because it is
taken for granted and partially because the publications tend to have a rather
technical focus. Nevertheless, it is highly relevant to this work and I wish to
clarify why this is the case and make it evident here and in the later chapters.
Besides being taken for granted in the publications presented here, I have come
to realize that it is also often overlooked by the very users of such knowledge.
The reason why this happens, I believe, is that once a kind of knowledge is
well represented, then its use becomes immediate, smooth, as a second nature
or as speaking one’s native language. Consequently, the system making the
representation possible is, by its nature, invisible, and acts unnoticed, although
steadily, permeating the space where all subsequent actions take place.
When I started this work, such system was given to me in the form of a com-
putational framework for the representation of complex models in the language
of Bayesian graphical models, and all the work I have made has happened in
the context of this framework, named Minerva [1]. Having had the chance to
do my research under the supervision of its main author, I could get to know
and understand the vision behind it. The quest that motivated its realization is
the one mentioned at the opening of this introduction: clearly, automation of
science is a very broad expression that requires specifications. I refer here, spe-
cifically, to scientific inference. Scientific inference is based on forward models:
these are models of real world phenomena that are used to make predictions
about what can be observed or measured, and whose outcomes are compared
to what is observed in order to refine our knowledge of the world. Such models
are made of a set of assumptions and beliefs - taking the form, for example, of
physics relations - about how the world works, and when they are confronted to
what is observed, they can be changed or updated accordingly. They represent
our expectations and understanding of a system. For example, in the context of
the work of this thesis, a forward model might be the description of a plasma
in terms of parameters relevant to achieve nuclear fusion, like its temperature
and density. Given the value of these parameters, the model may consist of a
calculation, based on physics relations, which outputs the value of a quantity
measured during an experiment. Once an experiment is performed, then, we
have the possibility to compare the expected outcome with the observations,
and find the value of the parameters that consistently describe the measure-
ments. This allows us to refine our knowledge about the system or, more in
general, the world, in a way that is consistent with what is observed. The re-
fining of the model assumptions or parameters can be formalized and made
12
systematic in the framework of Bayesian inference. Given a set of assumptions
and a forward model, Bayes formula provides a rule to update hypothesis ac-
cording to the results of comparing observations with model predictions: the
formalization of the process that is employed in rigorous scientific discovery.
The idea of exploiting probability theory as a system to reason in science is
famously described in [6] and [7]. I believe that part of the strength of Bayesian
probability theory comes from the fact that hypothesis and assumptions are
made explicit in the forward model before taking into account the observations:
it is this modeling act, through the explicit writing down of hypothesis and
beliefs, that makes the entire process repeatable and objective, as science is
expected to be, and that uncovers our subjectivity when interpreting the out-
comes of an experiment. We tend to agree that scientific knowledge should be
objective and universally valid; if this is the case, the process of scientific dis-
covery itself is meant to be independent of those who perform it, and therefore
can be made automatic. I believe such automation can be achieved through the
use of Bayesian inference and forward models in representing and confronting
with reality. The framework of Bayesian inference is a very general one, as
it does not require assumptions on the models involved: they can belong to
physics, chemistry, social sciences, or any other field, as long as they can make
predictions from clearly stated assumptions. Therefore, what if we could find a
common language which allows to express any scientific model in a way that
is in accordance with the requirements of Bayes theorem, and can be under-
stood by computers, so that all processes of scientific inference and knowledge
acquisition could be performed autonomously? This is the ambitious task that,
before I started this work, the people I have worked with, have undertaken. The
proposed solution took the form of the computational framework that I men-
tioned above, named as the Minerva framework. A more detailed explanation
of how it works is given in the following chapters. Here it is enough to know
that it allows to generalize the writing of scientific models: typically they are
composed of different modules, or nodes, and they can be used to do Bayesian
inference when observations are available. As a start, the framework has been
used extensively at different nuclear fusion experiments (see for example [8–
12]).
In this context, the usage of forward models is often computationally ex-
pensive because they require performing complex calculations and when they
are used to do inference, they are often used in an iterative scheme involving
up to millions of iterations. This is where machine learning, and specifically
13
Chapter 1. Introduction
neural network models, comes into play. Neural networks can process data at
very short time scales, down to tens of microseconds. In this thesis, I show
that they can be trained as an approximation, or surrogate, of a Bayesian model
implemented within the Minerva framework. Then, they can provide a fast
approximation of the Bayesian inference process, drastically reducing data ana-
lysis time. Besides this advantage, another crucial point of this work is the
following: given a common framework where different models are expressed
in a common way, the creation of approximate neural network models can be
easily extended to any model implemented within the framework. This is relev-
ant because it allows the approach to scale very easily, having the possibility
to fully automate the procedure. The work presented in this thesis constitutes
a step in this direction.
I will describe the application of this method to two different fusion exper-
iments. In the context of the first one, the Wendelstein 7-X (W7-X) stellarator
(see chapter 2), I have considered the system of an X-ray spectrometer [8, 13, 14]:
it measures X-rays emitted in the plasma volume and collected along several
lines of sight. The emission contains information about plasma parameters as
the temperature of the ions and electrons, which are macroscopic quantities
relevant for describing the state of a fusion plasma. The forward modeling of
this system represents a complex problem: given the values of plasma paramet-
ers, the emission at each point along each line of sight is calculated from an
atomic physics model, and then integrated along the line in order to predict
the expected measurement. The evaluation of this model for the inference of
possible values of plasma parameters with the conventional Bayesian inference
approach takes up to tens of minutes for a single time measurement. The meas-
uring device can record data with a frequency of few kHz, experiments currently
can last up to tens of seconds and, in the future, up to tens of minutes, and dur-
ing a campaign hundreds to thousands of experiments are run: this amounts to
a volume of data that is hard to evaluate in its entirety with the conventional
approach. Such system, therefore, clearly would benefit of an acceleration in
the evaluation of the data: this can be provided by a neural network surrogate
of the conventional Bayesian inference in an automatic way, by training the
network only on data generated with the corresponding Bayesian model. No
experimental data, or any other external data are required for the training. This
is a crucial first achievement in view of automation, because it shows that the
approach only relies on the Bayesian model and, therefore, can be seamlessly
applied to any other model implemented within the Minerva framework. With
14
the use of neural networks, the data evaluation time can be drastically reduced
to hundreds or tens of microseconds.
At the Joint European Torus (JET) experiment, instead, I have considered the
case of a different spectroscopy system, the Li-beam diagnostic [15]: here, the
emission stimulated by injection in the plasma of lithium atoms can be related
to the density of plasma electrons, another useful parameter in the description
of the plasma. Also in this case, the modeling within Minerva consists of cal-
culating the expected radiation at each point along different lines of sight with
a multi-state atomic physics model [10, 16]. The evaluation of this model in
the conventional Bayesian inference scheme requires again tens of minutes for
single time measurements. Thanks to the fact that also this forward model was
implemented within the Minerva framework, the method used at W7-X could
be straightforwardly implemented for this case, generating a neural network ap-
proximation of the traditional Bayesian inference. Without almost any further
effort in the implementation, thanks to the common language of the Minerva
models and the formalism of Bayesian inference, we could make neural network
replicas of existing Bayesian inference in an automatic fashion for two different
systems. One has to be clear: full automation was not entirely achieved at this
stage, some aspects of the work still required human intervention, but this was
a first step in the right direction. I believe, that, in the future, the procedure can
be fully automated.
As it is now evident, different subjects are involved in this work. In the next
chapters, I will introduce the three main ones: in chapter 2 I will introduce
nuclear fusion, in chapter 3 I will introduce Bayesian inference and modeling
in the Minerva framework, and in chapter 4 I will introduce the basic concepts
of neural networks and expand on the core idea of neural network as approx-
imate Bayesian inference. Chapter 5 contains a description of the applications
of the neural network surrogates to nuclear fusion experiments, and chapter 6
is constituted of publications in peer reviewed journals containing the details
of the aforementioned applications.
15
Chapter 2.
Nuclear fusion
The quest for safe and clean sources of energy is arguably one of the topics
that simultaneously concern many parts of today’s society, from policy makers,
to entrepreneurs, environmental activists, politicians and the wide public in
general. Nuclear fusion could provide a solution to the continuously increasing
energy demand and the request for a form of energy production that does not
produce green house gases, while at the same time having favorable safety
properties. Similar to fission, nuclear fusion produces energy from the nuclear
interaction of atoms, and it features a similarly large amount of specific energy,
i.e. amount of energy released per mass of fuel burnt. Moreover, the fuel required
is abundant on Earth.
Nuclear fusion is the process that powers the Sun. In the Sun, it involves two
protons and produces, at its first step, a deuteron, a positron and a neutrino.
Despite its very low cross-section, it occurs in the Sun thanks to the very large
pressure. On Earth, most of the effort has been focused on the more likely deu-
terium - tritium reaction, having the highest reaction rate of all fusion reactions,
which produces an alpha particle, a neutron and releases ≈17.6 MeV:
D + T 4He + n + 17.6 MeV (1)
For this reaction to occur, the Coulomb repulsive force between the two
positive charged nuclei of deuterium and tritium has to be overcome. This is
achieved by increasing the kinetic energy of the reactants by heating them up
to the temperature of the order of
≈
10
8∘C
, larger than temperature in the core
of the Sun. At such temperatures, matter is found in the state of an ionized gas
known as plasma. For example, for a plasma made of a 50:50 mix of deuterium
and tritium, the optimal temperature for the fusion reaction to occur would be
kbT≈
20
keV
(i.e., the maximum of the reactivity is found at such temperature),
17
Chapter 2. Nuclear fusion
with a particle density of
≈
10
20
m
−3
[17]. Because of the large thermal energy,
the confinement of the particles is extremely difficult, and it constitutes, indeed,
one of the major issues in today’s nuclear fusion research. Plasma particles, as
charged particles, have the advantage that their motion can be influenced by
the electromagnetic force. This is why many experimental fusion devices rely
on magnetic fields to confine the plasma in specific shapes within a vacuum
chamber.
The two products of the reaction play an important role in a fusion reactor: the
neutron, not being charged, can escape the magnetic confinement and its kin-
etic energy can be used to generate electric energy; the alpha particle, instead,
is confined and through collisions can heat the plasma, providing enough self-
heating to cover losses by conduction or radiation. By setting up a power bal-
ance between the energy produced by fusion reactions and lost due to particles
escaping the systems, it is possible to derive an expression for what is the ideal
minimum value that certain plasma parameters should have to sustain the pro-
duction of energy. This is known as the triple product:
nTτE≥3⋅1021 keVs/m3(2.1)
where
n
is the plasma density,
T
is the plasma temperature, and
τE
is the energy
confinement time, the characteristic time at which the system looses energy
to the external environment -
τE=W/Pheat
, where
W
is the thermal energy
density of the plasma (units of energy per volume), and
Pheat
is the power needed
to keep the plasma at the desired temperature and compensate losses. From the
formulation of this criteria, we notice that two important plasma parameters are
the plasma temperature and density. Since the plasma is made of a neutral gas
ionized into free electrons and ions, it is useful to distinguish between the values
that these quantities have for the two different species: the electron and ion
temperature,
Te
and
Ti
respectively, and the electron and ion density,
ne
and
ni
. A
plasma is described as a quasi-neutral system where
ne=∑iZini
with
Zi
atomic
number of ion species
i
, and
ni
corresponding density. Therefore, for a pure
hydrogen plasma (or its isotopes):
ne=ni
. The electron and ion temperature
instead can be quite different, depending on the heating mechanisms and the
collisional coupling of the two species. Although what is directly relevant for
fusion is the temperature of the ions, in a burning D-T plasma, the fast
α
-particle
population resulting from the fusion reaction predominantly heat the electrons,
which, in turn, transfer their energy to the ion population. Moreover, there
exist external heating mechanisms, such as the Electron Cyclotron Resonance
18
Heating (ECRH), that act in a similar way, preferably energizing electrons in the
first step. Therefore, both parameters are widely used and of crucial relevance
for any nuclear fusion experiment. For this reason, these three parameters are
at the center of most of the publications attached to this thesis: article I in
section 6.1 ([18]) shows how a neural network can be trained to speed-up the
inference of ion and electron temperature from measurements collected with
an X-ray imaging spectrometer at the W7-X stellarator, drastically reducing the
analysis time and opening the possibility of real time applications; article II in
section 6.2 ([19]) concerns the estimation of error bars in the neural network
reconstruction of
Te
and
Ti
; article IV in section 6.4 ([20]) concerns, instead,
the case of fast, approximated neural network inference of the electron density
from measurements performed with a lithium beam spectrometer at the JET
tokamak.
When a plasma is operated within a reactor or an experimental device, it is
never entirely free of impurities: the bulk of fusing ions is often contaminated by
ion species released by the interaction of the plasma with surrounding materials.
For example, at the W7-X stellarator, carbon ions are often found to contaminate
the plasma because the divertor, a crucial component situated at the edge of the
machine and used to collect the high heat loads in a sustainable and controlled
manner, is made of graphite tiles. An important application of impurities is for
the control of the radiative power to limit the heat flux to the divertor. The line
emission caused by the presence of a controlled amount of impurities in certain
regions of the plasma can help protecting the divertor and other plasma facing
components. The presence of impurities can also be detrimental for the overall
performance if above certain limits: high-Z impurities can cause energy to get
lost by radiation, whereas low-Z impurities can dilute the fusion fuel. Therefore,
during an experiment, it is often desirable to be able to measure and monitor
the amount of impurity ions. A quantity related to their concentration is known
as plasma effective charge Zeff:
Zeff =∑iniZ2
i
∑iniZi
(2.2)
where
i
is an index labeling the
i
-th ion species in the plasma,
ni
its correspond-
ing density and
Zi
the atomic number.
Zeff
can also be derived from experimental
measurement of bremsstrahlung emission, since this kind of emission is directly
proportional to the effective charge. This kind of measurement also requires
ne
19
Chapter 2. Nuclear fusion
and
Te
to be known. Article III in section 6.3 concerns precisely this case:
Zeff
is
inferred automatically with a Bayesian model within the Minerva framework,
from data measured with a spectrometer collecting bremsstrahlung in the vis-
ible wavelength range at W7-X. According to the model, the bremsstrahlung
emission can be calculated starting from
ne
,
Te
and
Zeff
: if the first two are
given, the third one can be inferred from the measurements. The Minerva infer-
ence runs in two stages: in the first one, the electron density and temperature
is inferred from independent measurements with a Thomson scattering dia-
gnostic [21]; in the second one, these parameters are used in a model which
allows the inference of Zeff from measured bremsstrahlung emission data.
The effort to achieve a good magnetic confinement of the plasma fuel resulted
in the development of two main fusion device concepts. Both rely on a toroidal
geometry, but differ from each other in the features of the magnetic field. They
are known as tokamak and stellarator. Figure 2.1a and 2.1b show the W7-X
stellarator [22] and the JET tokamak, respectively.
Each design has its own advantages and disadvantages. In order to confine
the plasma in a toroidal geometry, the magnetic field has two components: one
is oriented toroidally, following the large circular ring around the torus, encom-
passing the central axis; one poloidally, following the small circular ring around
the surface. Figure 2.2 shows toroidal and poloidal directions in a torus. The res-
ulting magnetic field is therefore twisted. In the tokamak, the toroidal compon-
ent is generated by the external planar coils, whereas the poloidal component is
generated by the plasma itself, carrying an electric current induced by a central
transformer coil. The magnetic field in result is axi-symmetric. The advantage
of this concept is that the planar geometry of the coils makes it relatively simple
from an engineering point of view; on the other hand, the presence of a plasma
current makes the plasma sensitive to a kind of instability which can lead to ab-
rupt termination of the discharges, and the use of the central transformer poses
a limitation to the duration of the pulses. In a stellarator, instead, the twisted
magnetic field is generated entirely by the set of external coils, without recur-
ring to a plasma current. In this way, stellarator discharges do not suffer from
the same kind of instabilities of the tokamak, and, in principle, they can sustain
long, steady-state plasmas. By generating an helical magnetic field with external
coils, though, the toroidal symmetry that was present in tokamaks is broken.
This has important consequences for plasma confinement: because of the com-
plex, 3D geometry of the magnetic field, particle orbits are such that transport
phenomena can lead to loss of particles. Nevertheless, for certain shapes and
20
(a)
The Wendelstein 7-X stellarator. In red and orange are shown the superconduct-
ive non-planar and planar coils, respectively. They are operated in a low-pressure
cryostat volume.
(b)
The Joint European Torus tokamak. The set of planar coils is shown in green and
the central transformer is visible in blue.
Figure 2.1.:
The tokamak and the stellarator are the two main design concepts
for a magnetic confinement fusion device.
21
Chapter 2. Nuclear fusion
Figure 2.2.: Sketch of the toroidal and poloidal direction in a torus.
3D configurations of the magnetic field, this particle loss can be reduced [23–
25]. When a stellarator is designed in such a way, it is said to be optimized.
This is, indeed, the case of the W7-X stellarator. The optimization is carried
out with computer codes and the coil shape resulting from it can be strongly
irregular and, therefore, complex to realize from an engineering point of view.
Quantity Unit Value
Plasma volume m230
Major radius m5.5
Minor radius m0.5
Magnetic field T2.5
Table 2.1.:
The main parameters of
W7-X.
The Wendelstein 7-X stellarator was built
with the mission to demonstrate the feasib-
ility of steady-state pulses, where the deu-
terium or hydrogren plasma is confined for
times up to
≈
1800 s and certain plasma
parameters are achieved and maintained for
such long times. The experiment also aims
at demonstrating the success of the optim-
ization strategies. It started operation in
2015, and since then, two experimental cam-
paigns were run, where high performance
plasmas were sustained for up to
≈
30 s. The main parameters of W7-X are
summarized in table 2.1. In a hydrogen plasma discharge, a record for the
triple product in a stellarator was achieved
nTτE=
6
.
4
⋅
10
19 keVsm−3
[26].
The JET tokamak was built with the objective of conducting experiments with
plasmas in conditions approaching those expected in a thermo-nuclear fu-
sion reactor. The main parameters of the machine are summarized in table
2.2 [27]. So far, it is one of the two machines in the world where experiments
with a plasma made of a D/T mixture were conducted - the other one being
the Tokamak Fusion Test Reactor (TFTR) in the United States [28]. The ex-
22
perience and knowledge acquired through many years of experiments at JET
are therefore crucial for the development of future experimental devices. The
triple product measured in one of the D-T discharges (pulse no. 26148) was:
nTτE=3.8⋅1020 keVsm−3[29].
Quantity Unit Value
Plasma volume m2100
Major radius m2.96
Minor radius m1.25
Magnetic field T3.45
Plasma current MA 3.2 - 4.8
Table 2.2.:
The main parameters of
JET.
We have seen that two important para-
meters in a fusion device are the dens-
ity and temperature. Many diagnostic
devices are devoted to the measurement
of these parameters in W7-X and JET. In-
deed, quantities like the density and tem-
perature of the plasma electrons and ions
are the key basic quantities required by
more complex physics models and codes
that are used to understand and predict
the behavior of the plasma. It follows that
it is crucial to be able to measure them ac-
curately, and for this reason a lot of effort is put in the realization and operation
of the diagnostic systems. These parameters, though, are not measured directly,
but they are rather inferred from the direct measurements of other quantit-
ies, as, for example, spectroscopic emission. Therefore, being able to perform
an accurate measurement is as important as being able to accurately infer the
parameters from the measurements. Since no inference method is exact, it is
necessary to be able to quantify the uncertainties of the inferred quantities.
Bayesian inference provides the proper framework for handling uncertainties.
It is particularly useful in this context because often signals measured by many
different diagnostics carry information relevant to a common parameter, and
Bayesian inference offers a method to account for different sources of uncer-
tainties coming from different systems in a consistent, unified way. For this
reason, most of the work presented in this thesis relies on Bayesian modeling
for the inference of these parameters from diagnostic measurements.
23
Chapter 3.
Bayesian inference and modeling
Bayesian inference provides a formal method to make conclusions about a
model hypothesis in light of real world observations. In this section, I will
provide a short overview of Bayes theorem and how it is used in the Minerva
modeling framework to handle inference in complex systems.
3.1. Bayes theorem
The process of Bayesian inference can be built from two constituents: the first
one is the modeling of the problem and the definition of a priori hypothesis,
together with the probability with which we believe the hypotheses and the
model to be correct. The second one is the actual process of inference, the update
of our knowledge about the hypotheses to incorporate the information brought
in by new data as they are observed. The process is formally expressed by Bayes
formula:
p(H|D)=p(D|H)p(H)
p(D)=p(D,H)
p(D)(3.1)
where
p(H)
denotes the a priori probability on the hypothesis
H
,
p(D|H)
is the
probability to observe the data
D
according to our model of the process, and is
called the likelihood of
H
, and
p(H|D)
is the updated posterior probability of
the hypothesis
H
given the observation of the data
D
. The term
p(D,H)
is the
joint probability distribution of the data and the hypothesis; we will refer to it
also as the joint distribution of the model. The denominator term
p(D)
is called
the evidence, and can be calculated as a normalization factor, integrating over
25
Chapter 3. Bayesian inference and modeling
the free parameters in the numerator:
p(D)=�p(D|H)p(H)dH (3.2)
When modeling a physics system the probability
p(H)
can be the probability
that some physics quantities assume certain values, and they are known as the
free parameters of the model, and the relation between
D
and
H
often can be
expressed in a functional form, known as the forward model:
f=f(H)(3.3)
The forward model allows us to generate predictions, and Bayes formula al-
lows us to conduct a comparison to experimental measurement and update
our prior beliefs about the free parameters accordingly. The likelihood function
p(D|H)
is centered on the model predictions. For example, assuming a Gaussian
distribution, it can be written as:
p(D|H)=𝒩(μ=f(H),σ;D)(3.4)
so that the Normal distribution is defined over the data space D, centered on the
predictions made with the forward model
f(H)
, with uncertainties given by
σ
. The uncertainties, thus, are uncertainties in the prediction of the model and
reflect the inability to model all aspects of a given experiment [30]. In Bayesian
inference, measured data carry no ’error’: they are given, and all sources of
uncertainties are placed on the model. Under the traditional view, instead, there
is the underlying assumption that data are generated in a random process, and
we make use of probability theory to make estimates about the parameters
underlying that process.
It is important to notice that a Bayesian model is defined not only by the
choice of the free parameters
H
and forward function
f(H)
, but also by the
choice of their corresponding distributions: the prior
p(H)
and the likelihood
p(D|H)
. Therefore, two models differing only for the choice of the distributions
should be regarded as two different models.
Considering now the case where a set of observables
D1, ..., Dn
are collected
through different measurements in order to infer one common parameter, as it
is often the case in physics experiments, we can see that Bayes formula offers a
way to make use of the information consistently. The posterior for the parameter
26
3.1. Bayes theorem
Hcan be written as:
p(H|D1, ..., Dn)=p(D1, ..., Dn|H)p(H)
p(D1, ..., Dn)(3.5)
and if the measurement processes are independent
p(D1, ..., Dn)=∏ip(Di)
and
p(D1, ..., Dn|H)=∏ip(Di|H)
. Each term
Di
is associated to some forward model
function
fi(H)
, which allows to predict the corresponding observed data. There-
fore, once models and assumptions are defined, estimating the uncertainties of
a parameter from different sources of measurements occurs automatically by
using Bayes formula. On the other hand, in conventional statistics approaches,
it is not possible to assign a probability to an hypothesis and use the rules of
probability theory to find the solution of an inference problem. Therefore, dif-
ferent kind of tests and methods, often known as the statistics and estimators,
needs to be developed in order to relate observed data to unknown parameters
and find corresponding uncertainties. This has led to different kind of schools,
each one suggesting a different method or function to be used in the estimation
of these quantities. This lack of clear underlying first principles is probably the
main shortcoming of conventional statistics approaches, especially if it is to
be applied in the context of complex heterogeneous systems. For this reason,
Bayesian inference is particularly suitable when trying to solve the problem of
scientific modeling and inference in complex systems, like experiments where
different sources of measurements have to be taken into account simultaneously
and in a consistent manner. A good example is provided by nuclear fusion ex-
periments. In this field, Bayesian inference has been applied to the inference
of plasma parameters from diagnostic data, as in the case of a Thomson scat-
tering diagnostic at the W7-AS stellarator [31], a X-ray imaging diagnostic at
the W7-X stellarator [8], a Lithium beam spectroscopy diagnostic at the JET
tokamak [10], an electron cyclotron emission diagnostic at W7-X [9], the joint
analysis of several profile diagnostics at ASDEX Upgrade [32] and W7-X [11, 12],
and several tomographical problems from diagnostic measurements according
to the maximum entropy formulation [33].
One concrete example of the advantage of performing inference with a for-
ward model is given by the X-ray imaging diagnostic considered in article I
(see chapter 6.1, reference [18]): this spectrometer observes impurity line emis-
sion along several lines of sight crossing the plasma volume. The measured
spectra can then be used to infer the electron and ion temperature, impurity
density and plasma rotation [8, 14]. In order to model the expected line of sight
27
Chapter 3. Bayesian inference and modeling
integrated spectrum, the line emission is calculated at each point along the line
of sight, taking into account its shape as a Voigt profile (the convolution of a
Gaussian and Lorentzian shape), Doppler broadening and Doppler shift. When
emission is integrated along the line, the resulting line shape can be different
from a Gaussian due to the Doppler shift being different at different locations.
In the conventional analysis [34], instead, the Doppler shift is found by fitting
a Gaussian to the measured spectrum, which therefore relies on the inaccur-
ate hypothesis that the observed line shape can be interpreted as a Gaussian.
This kind of inaccuracy is easily avoided when the expected measurement is
modeled through a forward model of the process.
3.2. Bayesian modeling in the Minerva
framework
The Minerva framework has been developed with the idea of solving the prob-
lem of complexity arising from the handling of many sources of information
and models in cases like large fusion experiments [1]. As we have seen, Bayesian
inference offers the proper mathematical and probabilistic context to tackle the
problem, and that is why Minerva is a framework for Bayesian modeling. It
makes use of Bayesian graphical models, adopted from the computer science
field [35]. A simplified sketch of the Minerva graph for a model of a system
constituted, for example, of two generic different fusion diagnostics, with two
free parameters
m1
and
m2
, two forward functions
f1
and
f2
and observables
D1
and D2is shown in figure 3.1.
The data measured with both diagnostics carry information on parameter
m1
and can be used to infer it; moreover, data measured with diagnostic 2 can
be used to make inference on parameter
m2
. The colored nodes represent ran-
dom variables, the white nodes represent deterministic, computational nodes.
Such graphical representation is a way to visualize probabilistic relationships
between random variables and is used to factorize the joint distribution of all
graph variables according to the product rule of probability. The probability for
each random variable can be written as a conditional probability conditioned
only on the parent variables, i.e. the variables that directly or indirectly are
linked to it through the arrows. In this case, the joint distribution of the graph
28
3.2. Bayesian modeling in the Minerva framework
m1m2
f1f2
D1D2
Figure 3.1.:
Sketch example of a Minerva model graph for a systems consti-
tuted of two generic fusion diagnostics, modeled with free parameters
m1
and m2, forward functions f1and f2, and observables D1and D2. Color nodes
are probabilistic nodes representing random variables, white nodes repres-
ent deterministic, calculation nodes. Arrows express conditional probabilistic
relationships between the variables: P(D2|m1,m2)and P(D1|m1)
.
can be written as:
p(m1,m2,D1,D2)=p(m1)p(m2)p(D1|m1)p(D2|m1,m2)(3.6)
Therefore, according to Bayes formula, the posterior distribution of the free
parameters is:
p(m1,m2|D1,D2)∝p(m1)p(m2)p(D1|m1)p(D2|m1,m2)(3.7)
As long as we are not interested in the exact values of the posterior distribution,
the normalization factor at the denominator of Bayes formula can be neglected
in most inference problems and this relation of proportionality is all that is
needed. For example, the problem of finding the most likely values of
m1
and
m2
is an optimization problem that can be approached by finding the maximum
of the right-hand side of equation 3.7 term with an optimization algorithm
such as the Hooke and Jeeves pattern search [36]. The values found are known
as the Maximum A Posteriori (MAP). The full answer in Bayesian inference
29
Chapter 3. Bayesian inference and modeling
is anyway provided by the posterior distribution
p(m1,m2|D1,D2)
. Often, in
practical applications, the functional form of the distribution is not needed
to know the uncertainties of the free parameters. The uncertainties can be
retrieved by looking at the distribution of samples drawn from the posterior,
which can be obtained, for example, with a Markov Chain Monte Carlo (MCMC)
based algorithm.
Although Bayesian inference offers a mathematically sound framework for
performing scientific inference, its practical implementations often suffer from
the burden of being computationally intensive. This is particularly evident in
the case of the studies presented in article I (see chapter 6.1 and reference [18])
and IV (see chapter 6.4 and reference [20]). Article I concerns the inference of
ion and electron temperature profiles from spectroscopic measurements of the
X-rays emitted when the plasma interacts with impurity ions during a plasma
shot at the W7-X stellarator. The forward modeling of this measurement proced-
ure requires the calculation of the emission in each point along the collecting
lines of sight crossing the plasma volume. This model is then used in iterative
schemes in order to find the maximum of the posterior distribution of the de-
sired parameters, or in order to sample from it with an MCMC algorithm. These
schemes require iterating over such calculations hundreds or, in the case of
MCMC, even million of times, making the inference process slow for complex
models. Typically, the analysis time for a single measurement takes up to tens
of minutes for a MAP estimate. In the context of large experiments, where tens
of thousands of such measurements are collected during a single plasma shot,
the conventional Bayesian inference can require long time scale to be carried
out in a exhaustive manner. Something similar happens in the case of the study
presented in Article IV. Here, spectroscopic measurements of light emitted in
the interaction between plasma electrons and injected lithium atoms are used
to infer the electron density at the edge of the JET tokamak. Once again, the
forward model of this process involves complex calculations: the emission is
calculated by solving the differential equation of a multi-state model at each
point of the plasma volume that is observed through the collecting lines of sight.
The model is then used in the inference process within iterative MAP or MCMC
algorithms. Also in this case, finding the MAP solution requires tens of minutes
for a single measurement. Analyzing the whole amount of data collected at
several years of experiments carried out at JET would be very computational
expensive. In both cases, we could accelerate the inference procedure by us-
ing machine learning algorithms, especially neural networks trained on the
30
3.2. Bayesian modeling in the Minerva framework
information encoded in the Bayesian forward model. Neural networks allowed
us to reduce the time to hundreds of microseconds for a single measurement,
an improvement of several order of magnitude, which also opens the possibility
of real time applications. Moreover, by developing this method in the context
of the Minerva framework, we addressed another issue: namely, the fact that
at each fusion experiment, several tens of different measurement devices col-
lect data simultaneously during experiments. Each one, when modeled within
the framework, can take advantage of such acceleration in an automatic man-
ner. The method that allows us to train neural networks as approximations of
Bayesian inference is in fact applicable to general Bayesian models. In the next
chapter, we shall introduce and discuss the principles that we have originally
developed in order to train neural networks a surrogate Bayesian models, and
that we have called the approximation framework.
31
Chapter 4.
Approximate Bayesian inference
with neural networks
Bayesian probability offers a theoretically clean and unified framework to per-
form inference. Real world implementations are less clean, suffering from the
computational limitations of the chosen algorithms (Markov Chain Monte Carlo
(MCMC), Maximum a Posteriori (MAP)). As already mentioned, one limitation
is the computation time required to carry out the inference. The idea behind
the work of this thesis is that this limitation can be overcome by using neural
network models trained to approximate the full Bayesian inference, and to use
them for fast, automatic computation. A network trained in this way can be
called a surrogate model. In this chapter, I am going to outline the scheme that
allows to train neural network surrogates to approximate an existing Bayesian
model. For example, the network can be trained to approximate the inference of
the MAP solution. In this case, the network function is trained to approximate
the inverse function
f−1(H)
, which maps the model predictions to the free para-
meter values maximizing the posterior. The network can then be evaluated on
data measured at an experiment to infer the free parameter values in very short
time scale: the data processing speed depends on the actual implementation
of the network algorithm. Article I in chapter 6.1 shows an application where
a convolutional neural network was run on a single GPU and could process a
single measurement in
≈
10 microseconds. The network was trained to predict
relevant plasma parameters as ion and electron temperature from measured
X-ray spectra. Another application is shown in article IV, where it was used
to predict the electron density from spectra measured at the edge of the JET
tokamak. It is important to notice that, although the network training can be
very time consuming (several hours in most cases), it needs to be carried out
33
Chapter 4. Approximate Bayesian inference with neural networks
only once. The network can also be trained to approximate the forward function
f(H)
, in this way learning to reconstruct the model predictions given the free
parameters. Afterwards, it can be used within one of the iterative schemes for
the inference, as the MAP optimization, or the MCMC sampling. This option is
not explored in this thesis and it is one of the aims of future works. Another
possibility is to train the network to approximate the model joint distribution of
the free parameters and observations
p(H,D)
. In this case, the network is given
as input both the values of the free parameters and experimental measurements
and learns to map them to the value of the joint distribution. Again, such net-
work could be used to accelerate a MCMC sampling scheme. This possibility is
investigated later in section 5.4, extending the work presented in article III (see
chapter 6.3).
As it will be evident from the content of the next sections, the training data are
generated from an existing Bayesian model with exactly the same procedure for
all the training cases mentioned above. What differs in each case is the variables
used during training as input and targets. All considered cases are instances of
regression tasks for which a modeling function is already known.
I would like to emphasize that the training scheme was developed especially
with one principle in mind: that it should be as general as possible, in the
sense that it should require the least ad hoc manipulation for it to work in
different contexts. From the theoretical point of view, this was possible because
we developed the methods within the framework of Bayesian inference: as we
have seen in chapter 3, it offers unified principles to perform scientific inference
in general. This is achieved by first describing the scientific problem in terms of
a Bayesian model - forward model and uncertainties quantified with probability
distributions are its essential constituents; secondly, by applying the rules of
probability theory in the form of Bayes formula in order to perform inference.
All of this does not depend on the specific nature of the problem - e.g., physics,
psychology, medicine experiments. From a practical point of view, generality
was achieved by working within the Minerva Bayesian modeling framework,
which offers a common language to write scientific models as computer codes
and perform Bayesian inference with them. Independently of what they are
models of, they are objects whose underlying probability distributions can be
accessed in a single way. As we shall see, in order to train the network, we will
only make use of the probability distributions defined by the models, so that the
overall method is generally applicable to any scientific inference problem. The
publications attached at the end of chapter 6 aim at demonstrating that this was
34
4.1. Neural networks
achieved, by showing that the method was successfully applied to two different
physics systems of different fusion experiments. This aspect of the research is
important because it allows the network surrogates to be used in a system for
automatic scientific inference in complex systems of different nature.
A short remark on notation: in this chapter I am going to use bold symbols
to denote vector variables, and plain text for scalar variables.
4.1. Neural networks
A neural network model represents a function fNof the kind:
y=fN(x,w)(4.1)
where
x
denotes the network input vector and
w
denotes the network adaptable
weights or parameters, and I have used
y
to denote the output vector. The func-
tion
fN
is used to approximated an unknown function
f
through a composition
of adaptive basis functions. The architecture of the network describes how the
composition is performed. One of the simplest architecture is the multi layer
perceptron (MLP) shown in figure 4.1. It was first introduced by Rosenblatt in
the the late 50’s in [37] as a model for the cognitive abilities of the brain, and
still today it is at the foundations of more complex architectures.
Each circle is a unit. Each arrow represents a weight of the network. Each
row of connecting arrows is called a layer. The units at the bottom represent
elements of an input vector with four components. The units at the top represent
the elements of a two dimensional output vector. The units in between are
called hidden units. This kind of architecture represents a feedforward neural
network, because the edges are directed from the input towards the output,
without feedback loops. Each hidden unit stands for the application of a non-
linear activation function
ϕ
to a linear combination of the output and weights
of the previous layer. The overall network function is found by recursively
applying these functions layer by layer until the output is reached. For the MLP
in the figure, the i-th component of the output vector can be written as:
yl=∑
k
w(3)
kl ϕ(2)
k(∑
j
w(2)
jk ϕ(1)
j(∑
i
w(1)
ij xi)) (4.2)
where
ϕ(l)
i
denotes the activation function of unit
i
applied to the linear com-
bination of the weights
w(l)
ij (l)
of layer
(l)
. The activation function is usually
35
Chapter 4. Approximate Bayesian inference with neural networks
x1x2x3x4
y1y2
Figure 4.1.:
The architecture of the multi layer perceptron (MLP), a feedforward
neural network, with one hidden layer of weights.
the hyperbolic tangent, the logistic function, or the rectified linear unit, defined
as
ReLU(x)=max(
0
,x)
. Hyperbolic tangent and logistic functions belong to a
class of functions, known as sigmoid functions, which were originally chosen
by inspiration from the pattern of activation of the neurons in the human brain.
More recently, the ReLU function was found to be a better model from the biolo-
gical point of view [38], and it became particularly popular in the deep learning
community because it improved neural network training [39]. We have chosen
to use the ReLU activation function in our application because of its recognized
widespread success in many different problems. During training, the weights of
the network are adapted to optimize a loss function
E(x,w)
, that for regression
problems is often chosen to be the mean square error between the network
output and the training targets t:
E(x,w)=1
2N∑
n
∑
k(yk(xn;w)−tn
k)2(4.3)
where
n
is now an index labeling each of the
N
samples in the training data
set, and
k
labels the
k-th
component of the output and target vectors. The
choice of this error function can be motivated by the principle of maximum
likelihood [40] where the distribution of the target data conditioned on the
36
4.1. Neural networks
input is given by a Gaussian distribution:
p(tk|x)=N(μ=yk(x;w);σ)(4.4)
with the mean given by the network output and some standard deviation
σ
.
The overall likelihood function for the training set is
ℒ=∏np(tn|xn)p(xn)
. The
error function is then found to be
E= −ln(ℒ)
by neglecting all terms which do
not depend on the weights
w
, as it is the case for the term
p(xn)
. It can also be
shown [40] that the error function in equation 4.3 can be written as:
E=1
2∑
k�{yk(x;w)−⟨tk|x⟩}2p(x)dx
+1
2∑
k�{⟨t2
k|x⟩−⟨tk|x⟩2}p(x)dx
(4.5)
and therefore its minimization occurs when the first term on the right-hand
side is zero, since the second term does not depend on the network weights. We
can then write:
yk(x;w∗)=⟨tk|x⟩=�tkp(tk|x)dtk(4.6)
where
w∗
denotes the weight vector found with the minimization. This equation
says that the network function is given by the conditional average of the target
data, where the conditioning is on the input vector. The second term can be
written as:
1
2∑
k�σ2
k(x)p(x)dx(4.7)
where σ2
kdenotes the variance of the target data:
σ2
k(x)=⟨t2
k|x⟩−⟨tk|x⟩2
=�{tk−⟨tk|x⟩2}p(tk|x)dtk
(4.8)
which says that the residual error of the training is given by the variance of the
target data at the given input vector
x
. The results in equation 4.6 and 4.8 will
be useful in the next section, where I will show how the trained network can
be seen as an approximation of a Bayesian model.
37
Chapter 4. Approximate Bayesian inference with neural networks
I believe it is worth to credit here the software libraries that allowed me to
implement machine learning models as efficient computer codes. Nowadays,
many framework are available. Since the start of my work in 2016, I have been
using manly the Theano framework [41] (for the work in article I, section 6.1,
and II, section 6.2), and the TensorFlow framework [42] (for the work in article
IV, section 6.4) once the Theano development team announced suspending
maintaining it.
4.2. The approximation framework
In this section, I am going to describe the novel method that we have developed
to approximate Bayesian inference with neural networks. We have called the
set of principles allowing to generate the approximate network model the ap-
proximation framework. I will start by describing the algorithm used to generate
the training data. It is important to realize that the training set is generated
entirely from the Bayesian model, without making use of external data, e.g. ex-
perimental data. This aspect is relevant from the point of view of the automation
of the procedure - the Bayesian model is the only requirement, and in this way
the procedure is applicable, in general, to any model. Also, we guarantee that
the network is trained on the same assumptions that the original model is based
on. The algorithm is quite simple: given a Bayesian model
𝒢
with free paramet-
ers
m
and observables
d
, then the joint distribution is
p(m,d)=p(d|m)p(m)
,
where
p(d|m)
is a chosen likelihood function, and
p(m)
the prior distribution.
Bayes formula for model 𝒢is then:
p(m|d)=p(d|m)p(m)
p(d)(4.9)
The training data are generated by drawing
N
samples of the variables
(m,d)
from the joint distribution
p(m,d)
. Note, that by sampling from
p(m,d)
, we
sample both from
p(m)
and
p(d|m)
, which implies that the observables
d
are
noisy samples of the forward model prediction, where the features of the noise
are specified by the likelihood function. The network then can be trained to
learn the mapping from one set of variables to the other:
f∶m→d
g∶d→m
h∶(m,d)→p(m,d)(4.10)
38
4.2. The approximation framework
the mapping denoted by
f
corresponds to the forward model function of the
model,
g
corresponds to its inverse, and
h
corresponds to the mapping from
the joint space of the free parameters and observables
(m,d)
to the joint dis-
tribution of the model
p(m,d)
. I will describe now in details only the case of
the learning of mapping
g
, which is the one considered in articles I, II and IV
in section 6.1, 6.2, 6.4, respectively. The case of mapping
h
will be considered in
section 5.4. The case of mapping
f
is not considered in this thesis, and it will be
object of future work.
In the case of mapping
g
, the network is trained to reconstruct the paramet-
ers of a physics model, given the observed quantities. Therefore, during training
the input to the network are the samples of the observable quantities
d
, and
the targets are the free parameter samples
m
; at evaluation time, the input are
experimental measurements and the output the reconstructed parameters. For
example, in a nuclear fusion experiment, this could be the case of inferring
some key plasma profiles, such as plasma electron or ion temperature and dens-
ity profiles, given some diagnostic measurements, for example spectroscopic
emission (as it is the case in article I and IV). To see how the trained network
relates to the Bayesian model, we can make use of the result shown in equation
4.6, which says that the network mapping is given by the conditional average
of the target data, conditioned on the input. Since the training data have been
obtained by sampling from the model, the target data
tk
in equation 4.6 corres-
pond to the sampled free parameters
mk
, and the input data
x
correspond to
the sampled observable quantities d. Therefore we have:
yk(d;w∗)=⟨mk|d⟩=�mkp(mk|d)dmk(4.11)
where we notice that the conditional distribution
p(mk|d)
corresponds to the
posterior of the Bayesian model
𝒢
in equation 4.9. Under the conditions for
which the relation in equation 4.6 is valid, we can say that the trained network
learns a mapping that is given by the conditional average of the true posterior
of the Bayesian model. These conditions are satisfied only in the ideal case of
networks with number of weights and training data samples approaching infin-
ity, and optimal minimization of the error function of equation 4.3; nevertheless,
the relation in equation 4.11 provides an interpretation of the network function
in relation to the original posterior distribution of the model
𝒢
. A similar ar-
gument holds for the residual error of the training, which was formulated in
terms of the variance of the target data in equation 4.8, and now can be written
39
Chapter 4. Approximate Bayesian inference with neural networks
as:
σ2
k(x)=�{mk−⟨mk|d⟩2}p(mk|d)dmk(4.12)
which says that the residual error is given by the variance of the true posterior
distribution of the model 𝒢.
Another insight is provided by considering the evidence term
p(d)
in Bayes
formula of equation 4.9. The training data are samples from the joint distri-
bution
p(m;d)
of the model
𝒢
. If
d
is fixed, by considering the distribution
of the free parameters for the given data
d
, we obtain the posterior distribu-
tion
p(m|d)
(except for the normalization factor
p(d)
). Similarly, by looking
at the samples
d
letting
m
vary, we realize that they constitute samples from
the evidence
p(d)
. In the limit of a large training data set, the distribution of
the observable variables
d
is the evidence
p(d)
of the model
𝒢
. Therefore, the
evidence could in principle be estimated from the training samples. This can
have different interesting implications, since the evidence is involved in higher
order applications of Bayesian inference, for example for the problem of model
selection. I will not investigate this possibility further, since it does not fall
in the scope of the thesis. Instead, I will describe shortly another implication
directly related to the neural network training, which has been investigated in
article I in section 6.1. The distribution
p(d)
, being obtained by integration over
the prior distribution, represents the probability that all data possibly described
and predicted by model
𝒢
assume certain values. Therefore, for a novel meas-
ured data point
d∗
having a low probability under the evidence, i.e.
p(d∗)≪
1,
we can say that the model
𝒢
does not provide an adequate description. This can
happen because of an inadequacy in the prior distributions, the forward model
function or the likelihood function. The evidence, therefore, provides useful
information about when a different model should be used to better describe the
experimental measurements. This is particular relevant to the case of this study,
since all the data used for training the network are generated ’blindly’ through a
Bayesian model. One common cause for a network performing well on training
data and poorly on experimental data is, indeed, a poor model description of
the measurements, which leads to the network being used to make predictions
in a region of the input space not covered during training. Features of the model
can then be changed accordingly by, for example, tuning the parameters con-
trolling the prior distributions. It is important to notice that this sort of tuning,
if carried out properly, does not generate an overly complex, specific model,
40
4.2. The approximation framework
because the evidence term inherently penalizes for higher complexity through
an Occam razor principle [43]. In this way, a model
𝒢∗
can be found, such that
p𝒢∗(d∗)>p𝒢(d∗)
and it can be used to generate new training data from its joint
distribution
p𝒢∗(m,d)
with the same scheme used for
𝒢
. In conclusion, given
the training data, it is in principle possible to reconstruct the evidence, or an
approximation of it, through any of the existing machine learning methods for
probability density estimations, and use it to assess the quality of the training
data in order to analyze, or prevent, network failures. In article I, a simpler
method which does not rely on the reconstruction of the probability density,
but makes use of a distance-based algorithm, is used to effectively improve the
quality of the training data.
41
Chapter 5.
Application to nuclear fusion
research
The approximation framework formulated in the previous chapter has been used
in the context of nuclear fusion experiments. I would like to remark, though, that
such formulation is not restricted to a specific system: indeed, no assumption
was made on the internal features of the model or the data involved. The only re-
quirement for a network to be trained in this way is the existence of a Bayesian
model. This highlights one key strength of this approach, mentioned already
in the introduction: it can be used to automatically generate neural network
approximations of Bayesian models and inference. I have tried to demonstrate
this by applying it to different measuring systems that are operated at different
nuclear fusion experiments. In particular, I have taken advantage of the im-
plementation of the models in the Minerva framework, which, by abstracting
away their details, has simplified the replication of the training data genera-
tion procedure, ultimately opening up the possibility of full automation. Neural
networks have been used in nuclear fusion research starting already in the
90’s with applications involving the automatic analysis of JET charge exchange
spectra [44], and the usage of synthetic data together with experimental data
in the training procedure [45]. Real time applications were also considered, as
shown in [46]. More recently, neural networks have been used at the Wendel-
stein 7-X stellarator with the purpose of protecting plasma facing components
by providing a reconstruction of magnetic configuration properties from heat
load patterns [47, 48], at the JET tokamak for the acceleration of tomographic
reconstruction [49], and also as approximations of transport models [50–52].
In this chapter, I will introduce the work that constitutes the main result
of this doctoral thesis, and which is fully elaborated in the published articles
43
Chapter 5. Application to nuclear fusion research
available in chapter 6. Section 5.1 concerns publications I and II, section 5.2
concerns publication IV and section 5.3 concerns publication III. These sections
are meant to provide a general overview of the work that is described with all
details in the corresponding publications. Because of the fact that it features
more details, section 5.4, instead, might look out of place to the reader. The
reason for being like this is that it provides an extension of the work presented
in publication 5.3, which has not yet been published in a scientific journal. In
fact, the reader might notice that I have tried to structure the content of this
section in a way that is similar to publications I and IV.
In general, the applications considered here concern one of the mappings
appearing in equation 4.10. Section 5.1 and 5.2 concern mapping
g
, where the
network is trained on the inference of a Bayesian model free parameters
m
from
observable data
d
. Section 5.3 and corresponding publication III themselves do
not directly concern one of the mappings, rather they describe the Bayesian
model and the experimental measurements which are used in section 5.4 to
train a network on mapping
h
, corresponding to the inference of the model
joint probability distribution
p(d,m)
from the joint space
(d,m)
. As already
mentioned elsewhere, the learning of mapping
f
, corresponding to the recon-
struction of model prediction from free parameter values, is not considered in
this thesis, and it is left for future investigations.
5.1. Inference of ion and electron temperature
profiles at W7-X
The first application I have considered is the inference of ion and electron tem-
perature profiles from the measurement of spectroscopic data from an X-Ray
imaging crystal spectrometer (XICS) diagnostic [14] at W7-X (see article I, sec-
tion 6.1). The X-rays are emitted in the atomic physics interaction between
argon ions (
Ar16+
), which are injected as neutral gas, and plasma electrons. The
light is collected across several lines of sight, diffracted by a quartz crystal and
imaged on a 2D detector with energy resolution in one direction and spatial
resolution in the other. The lines of sight span covers a large part of the plasma
cross section, so that the observed spectra carry information about the plasma
parameters in the entire region. Figure 5.1a shows a sketch of the line of sight
span on the bean shaped cross section, and figure 5.1b shows a measured raw
44
5.1. Inference of temperature profiles at W7-X
5 5.5 6 6.5
−0.5
0
0.5
1
R in m
Z in m
(a)
Energy resolution
Spatial resolution
(b)
Figure 5.1.:
(a) Sketch of the XICS system view on the bean shaped cross section
at W7-X. The span of the lines of sight is represented by the colored area. (b)
A raw image measured on the detector. The bright curved features denote the
observed line emission.
image. The bright curved features in the image represent the observed line emis-
sion. From the broadening of the recorded line emission it is possible to infer
ion temperature across the plasma, whereas from the ratio between different
line intensities it is possible to infer electron temperature profiles. A neural net-
work was trained with data generated by sampling from the joint distribution
of a pre-existing Minerva Bayesian model of the diagnostic [13], and evaluated
on measurements collected during the first experimental campaign at W7-X.
The model allows to predict the measured spectrum along several lines of sight
by calculating the line emission occurring in the atomic processes involving
ionized argon impurities appositely released in the plasma, and the plasma
species. The neural network prediction also includes uncertainties, calculated
within a Bayesian framework for the network training, known as Bayesian
neural networks [53]. It relies on the Laplace approximation of the network
weight posterior, and it is extended here to account for the presence of noise
in the input data and multiple minima in the training procedure. In order to
achieve this, multiple networks were trained with different, random starting
values for the weights, and the overall prediction was obtained with a Monte
Carlo sampling scheme occurring both in input and weight space of different
networks. This allowed to relax some of the simplifying assumptions required
by the conventional Bayesian neural network framework. Article II in section
45
Chapter 5. Application to nuclear fusion research
6.2 describes this procedure in full detail. The profiles inferred with the network
from the measurements were compared to those inferred with the conventional
Bayesian inference in a restricted number of cases. The comparison, although
limited to the small amount of available data at the time of the start of the
new W7-X experiment, showed in general good agreement between the two
methods. In order to assess the quality of the training data as a description
of the experimental measurements, a k-nearest neighbor algorithm was used
to estimate the distance between a measured data point and the training data
points. Under the assumption that the distance is inversely proportional to the
probability, this calculation can be used to estimate how likely a given measure-
ment is under the distribution of the training data. This method proved useful
in identifying the reason behind the failed reconstruction of the profiles by
the network, leading to the choice of a different Bayesian model in which the
hyper-parameters of the prior distributions were varied so to generate a train-
ing set that better described the expected measurements. When evaluated on a
GPU, the network could reconstruct a profile for a single measured data point
in tens of microseconds, whereas the inference of the most likely profile with
the Minerva model requires tens of minutes on a CPU.
5.2. Inference of edge electron density profiles
at JET
In the second application, I have considered an entirely different problem: a dif-
ferent physics systems of a different diagnostic at the JET tokamak. The aim was
to test the generality and validate the results of the approximation framework
and training scheme. The network was trained to infer edge electron density
profiles from measurements of a Lithium-beam spectroscopy diagnostic [15] at
JET. Lithium atoms are injected vertically in the machine, and as they penetrate
the plasma volume, different excited states are populated by the interaction
with the plasma species. The lithium line emission collected by a spectrometer
comes from the de-excitation of the first excited state into the ground state. The
light is emitted at different vertical spatial locations, and its intensity depends
on the electron density at those locations. Figure 5.2 shows a sketch of the sys-
tem, where the orange line encloses the plasma volume and, on the right, the
spectrometer is represented by a box. Article IV in section 6.4 describes the
46
5.2. Inference of density profiles at JET
Lithium beam
spectrometer
Figure 5.2.:
Sketch of the lithium beam system at JET. The beam is injected
vertically from the top of the machine and the emission is collected across
multiple lines of sight by a spectrometer. The black contour represents the
machine boundary, the orange one encloses the plasma volume.
work in full detail. The training of the network occurred in the usual way: the
training data were generated sampling from the joint distribution of an existing
Bayesian model of the diagnostic implemented within the Minerva framework.
In this case, the spectral measurements were described by a multi-state atomic
model, which allows to relate the intensities of the lines emitted by Lithium
atoms through electronic excitation and spontaneous emission to the density
of the plasma electrons [10]. The availability of a larger amount of measured
data collected at the experiments through many years of JET operation allowed
for a more extensive comparison with the conventional Bayesian inference. Ap-
proximately sixty experiments carried out in a wide range of plasma conditions
were considered. The comparison was made between the experimental meas-
urements and the observations reconstructed with the network and Minerva
inferred profiles. In general, the Minerva inferred profiles could predict the
measurements with a lower error, nevertheless the error in the network predic-
tion was consistently below 20% across the entire data set. Also in this case, the
network inference provided full uncertainties. In contrast to the W7-X applica-
tion, the uncertainties were calculated within a Bayesian framework relying on
variational inference and the state-of-the-art deep learning technique known as
dropout, which allows to approximate the network weight posterior [54]. This
method has the advantage of requiring little extra computations besides that
of a standard forward pass through the network, and therefore is suitable for
47
Chapter 5. Application to nuclear fusion research
real-time applications.
5.3. Inference of the effective ion charge and
electron density and temperature profiles at
W7-X
The third application involved the development of the automatic Bayesian in-
ference of the plasma effective ion charge
Zeff
from measurements of visual
bremsstrahlung at W7-X. This application is described in article III in section
6.3 [55].
Zeff
is a quantity which gives an indication of the level of impurity con-
tamination of the plasma. This can be harmful in two ways: a too large amount
of low-Z impurities can lead to fuel dilution, whereas a too large amount of
high-Z impurities can lead to high radiation losses. On the other hand, in a
reactor, a controlled amount of impurities is desired in certain regions of the
plasma, because the radiation loss that they cause can help distributing the
heat coming from the plasma more homogeneously over the plasma-facing
components. Therefore, being able to measure and control impurities is cru-
cial. The measurements were collected with a single line-of-sight spectrometer
during several experiments. The spectrometer collects light in a broad visible
range, approximately from 300 nm to 1000 nm. In this wavelength region sev-
eral lines are present together with the bremsstrahlung dominated background.
The bremsstrahlung emission comes from the interaction between the plasma
electrons and the ions present in the plasma. For example, one of the impurity
frequently contaminating the plasma in W7-X is C
6+
, as graphite is the material
of some in-vessel components. The bremsstrahlung emission
V(λ)
at a certain
wavelength λdirectly depends on Zeff according to the following equation:
V(λ)=gff(Zeff,Te,λ)n2
eZeff
√kbTe
exp �−hc
λkbTe�1
λ2(5.1)
where
gff(Zeff,Te,λ)
is the free-free Gaunt factor modeled according to [56],
and the remaining symbols are used in the conventional way referring to the
respective physics constants in SI units. Therefore, given the observed spectra,
it is possible to infer its value averaged along the line of sight. The inference is
based on a model which allows to predict the expected bremsstrahlung emission
in a given wavelength range from independent measurements of the electron
48
5.4. Learning the model joint distribution
density
ne
and temperature
Te
profiles. In this case, these measurements were
provided by a Thomson scattering diagnostic [21] and the profiles were inferred
with a Bayesian model as well. This model makes use of Gaussian processes
to model the profile function and inference was performed under the frame-
work of Bayesian model selection in order to find the optimal trade-off between
complexity of the function and a good fit to the data. This is accomplished by
comparing different models through the evidence term in the denominator of
Bayes formula. It can be shown, indeed, that it embodies an Occam razor prin-
ciple [57]. The inference of the effective ion charge, and the required electron
temperature and density profiles, was running routinely after each plasma dis-
charge. The inferred
Zeff
values were compared to a preliminary measurement
of the carbon impurity concentration from a charge exchange recombination
spectroscopy diagnostic [58, 59], showing encouraging consistency.
5.4. Learning the model joint distribution
In the two applications described in sections 5.1 and 5.2, the network was trained
to learn the mapping from model observable data
d
to free parameters
m
. This
corresponds to the mapping
g
in equation 4.10. Here, I would like to consider
a different case: the mapping from the model joint space
(d,m)
to the logar-
ithm of model joint probability
log p(d,m)
, corresponding to mapping
h
in
equation 4
.
10, taking as a starting point the work presented in the previous
section and article III. The model considered here is the one developed for the
inference of
Zeff
from line integrated measurements of visible bremsstrahlung
emission. As usual, the method we employ to train the network is general and
valid for other Bayesian models. The reader interested in understanding how
the measurements are performed, and further details about the physics mod-
els is recommended to first read article III. We will take for granted most of
the information provided in the article and here extend the discussion to the
learning of the model joint distribution. The goal is to train a neural network
so that, given as input values of
Zeff
, electron density
ne
and electron temperat-
ure
Te
independently measured with a Thomson scattering diagnostic [21], and
bremsstrahlung measurement
Vλ
, the network is able to predict the logarithm
of the model joint distribution values
log �p(Zeff,Vλ)�
. The logarithmic value is
used because of the wide range of values typically covered by
p(Zeff,Vλ)
. The
training is carried out exclusively on data generated with the Minerva Bayesian
49
Chapter 5. Application to nuclear fusion research
model, which I will describe in the next paragraphs. The trained network can be
useful in the context of any optimization or sampling problem which requires
the calculation of the joint probability value: as in the case of the optimization
of the posterior distribution to find its maximum - known as maximum a pos-
teriori (MAP), or the sampling from the posterior distribution with a Markov
chain Monte Carlo (MCMC) algorithm. Commonly, a MAP optimization re-
quires from hundreds to thousands of iterations before convergence is reached,
and a MCMC can require millions of iterations before the posterior distribution
is properly approximated. Therefore, the trained network can be used to signi-
ficantly speed up these iterative schemes, which otherwise would require the
repeated computationally expensive calculation of the full forward model.
5.4.1. The Bayesian model of bremsstrahlung within
Minerva
Given values of
Zeff
, electron temperature
Te
and density
ne
, the model allows to
predict the expected bremsstrahlung emission observed with the spectrometer.
The model’s only free parameter is the effective ion charge
Zeff
.
Te
and
ne
are
inferred with a Minerva model from an independent Thomson scattering meas-
urement. A sketch of the model graph implemented in the Minerva framework
is shown in figure 5.3.
Colored orange and blue nodes represent probabilistic nodes, prior probabil-
ity of
Zeff
and likelihood function of the model parameter given the observed
emission, respectively. White nodes represent either a parameter known at infer-
ence time, as for the
ne
and
Te
nodes, or a calculation node as the bremsstrahlung
emission node. The latter one summarizes all the calculations needed to calcu-
late the expected measurements, given values of
Zeff
,
ne
and
Te
, which include,
for example, the line integration of the calculated emission along the line of
sight and the application of required calibration coefficients, as described in
article III. We shall denote this forward function as
fm(Zeff,ne,Te)
. The observa-
tions are constituted by the bremsstrahlung emission
Vλ
detected at 15 spectro-
meter channels in the wavelength range of
≈
627
−
641 nm. The joint probability
distribution for this model can be written as:
p(Zeff,Vλ)=p(Zeff)p(Vλ|Zeff;ne,Te)(5.2)
where
p(Zeff)
is the
Zeff
prior probability chosen as a uniform distribution
between 1 and 6 (assuming that carbon is the dominant impurity in a hydrogen
50
5.4. Learning the model joint distribution
Zeff neTe
Bremsstrahlung
emission
observation
Figure 5.3.:
Simplified visualization of the Bremsstrahlung emission model
graph implemented within the Minerva framework. The orange node rep-
resents the prior of the model free parameter
Zeff
, and the two white nodes
ne
and
Te
represent electron density and temperature. The Bremsstrahlung
emission node represents a calculation node to predict the expected measured
signal. The observed quantities are represented by the probabilistic blue node
at the bottom, which stands for the likelihood function of the model.
plasma):
p(Zeff)=𝒰(1,6)(5.3)
and the likelihood function
p(Vλ|Zeff;ne,Te)
is chosen as a Gaussian distribution
centered at the model prediction fmand with standard deviation σ=0.2fm:
p(Vλ|Zeff;ne,Te)=𝒩(μ=fm,σ=0.2fm)(5.4)
Figure 5.4 shows an example of the quantities used in the model. The top row
shows a randomly generated
ne
and
Te
profile in the left and right plot, re-
spectively. These are samples from the prior distributions: samples drawn from
them might not look realistic, lacking the features commonly seen in observed
plasma profiles. This is the case of the profiles shown in the figure, but the priors
are broad enough to contain both ’realistic’ and less ’realistic’ instances of the
plasma profiles (see also figure 5.5). The profiles are expressed as function of the
so called effective radius coordinate
ρ
. This is a coordinate that extends from
51
Chapter 5. Application to nuclear fusion research
00.51
0
1
2
3
4
ρ
nein 1019 m−3
00.51
1
2
ρ
Tein keV
2 4 6
−600
−400
−200
0
Zeff
log joint probability
630 635 640
0.8
1
1.2
1.4
⋅10−4
wavelength in nm
W/(m−2Asr−1)
Figure 5.4.:
An example case of the forward model calculation. In clockwise
direction, from the top left plot the following quantities are shown: a random
sample of electron density profile, a random sample of electron temperature
profile, the logarithm of the joint probability of the model, and the observed
simulated bremsstrahlung emission with added Gaussian noise from the error
model. The profiles are sampled from their corresponding prior distributions:
samples drawn from them might not look realistic, lacking the features com-
monly seen in observed plasma profiles, as it is the case for those shown
here.
the plasma core at
ρ=
0, towards the plasma edge at
ρ≈
1
.
0. The bottom row
shows, in the left plot, the logarithm of the joint probability
log �p(Zeff,Vλ)�
for
Zeff ∈[
1
,
6
]
and fixed value of
Vλ
, and, in the right plot, the bremsstrahlung
emission
Vλ
calculated with the forward model, with added Gaussian noise
from equation 5.4. Note that the maximum of the joint probability is found for
Zeff ≈
2
.
0, which is the value used to generate the data in the bottom-right plot.
In order to learn to approximate the model joint distribution, the network
needs to take as input
ne
,
Te
,
Vλ
and
Zeff
, and give as output
log �p(Zeff,Vλ)�
. We
will generate the training data by sampling from the probability distributions
assigned to each of the input quantities. The distributions for
Zeff
and
Vλ
are
given in equations 5.3 and 5.4. The electron density and temperature profile
probability distributions are modeled with a 1D Gaussian process (GP) [60]. A
GP is used to model the correlation between the values that a function assume
52
5.4. Learning the model joint distribution
in its domain. A Gaussian distribution is chosen for the values of the function
evaluated at a fixed number of domain locations (a grid), and its covariance
is determined by the GP covariance function, which, through its parameters,
controls the correlation between the function values on the grid points. One
common choice for the covariance function is the squared exponential:
K(z1,z2)=σ2
fexp �−(ρ1−ρ2)2
2σ2
x�+δijσ2
y(5.5)
where
ρ1
and
ρ2
are two locations along the effective radius, and the
σ
paramet-
ers influence the smoothness of the profile:
σf
regulates the overall variance
of the profile,
σx
regulates the length scale of the profile variation in its do-
main. Small
σx
values imply quickly varying profiles, whereas larger values
imply smoother, slower changing profiles.
σy
regulates the amount of noise
expected in the profile. The following parameter values were used for the
ne
GP covariance function:
ne∶�σf=5.0⋅1019 m−3
σx=0.3
σy=0.005 ⋅1019 m−3
(5.6)
and for Te:
Te∶�σf=2.0keV
σx=0.3
σy=5.0⋅10−3keV
(5.7)
Figure 5.5 shows 20 samples drawn from the two distributions. The figure also
shows a low value constrain being applied at the edge
ρ≈
1
.
2, where the
ne
and
Te
profiles are expected to assume values of 0
.
1
±
0
.
5
⋅
10
19
m
−3
and 0
.
1
±
0
.
5
keV
,
respectively. The number of grid points along
ρ
is fixed to 20 for both profiles,
equally spaced between 0 and 1.2.
5.4.2. Architecture and specifications of the network
A total number of 10
6
samples for each of the input quantities
ne
,
Te
,
Vλ
and
Zeff
and the corresponding target joint probability distribution
log �p(Zeff,Vλ)�
53
Chapter 5. Application to nuclear fusion research
00.51
0
2
4
6
ρ
Tein keV
00.51
0
5
10
15
ρ
nein 1019 m−3
Figure 5.5.:
Samples from the Gaussian process probability distributions of the
Te
and
ne
profiles. Note how the profile shape is in general smooth and not
monotonic. A low value constraint is also visible at the edge ρ≈1.2.
were used to train the network. The number of network input units is there-
fore 56: 20 profile locations for each of the two profiles, 15 bremsstrahlung
spectral channels and one
Zeff
value; the number of output units is one: the
log �p(Zeff,Vλ)�
value. The network architecture used was a multi-layer per-
ceptron (MLP) (see figure 4.1) with three hidden layers of 50, 20 and 10 units,
respectively. The network was implemented within the TensorFlow frame-
work [42], using the AdamW optimizer [61] with learning rate
α=
10
−5
, weight
decay
λ=
10
−5
,
β1=
0
.
9and
β2=
0
.
999. A mean square error loss function was
used (see equation 4.3) During training, the network was regularly tested on
a test set made of 10000 samples randomly drawn from the aforementioned
distributions and early stopping was used to prevent overfitting and improve
generalization. According to this condition, the training is stopped when the
network error on test data starts becoming larger than that on training data.
Another stopping condition for the training was given by the maximum number
of iterations through the whole training set to 2000.
5.4.3. Evaluation of the network on experimental
measurements
The trained network was then evaluated on data and measurements collected
during the latest experimental campaign (OP1.2 b). Six different days, arbitrarily
chosen, for a total of 172 plasma discharges, and a number of measurements
larger than 10 000 were considered without restrictions on the experiment fea-
tures (e.g., amount of heating power, magnetic configuration, etc). For a full list
54
5.4. Learning the model joint distribution
-4000 -3000 -2000 -1000 0
−4000
−3000
−2000
−1000
0
NN prediction log p(Zeff,Vλ)
Minerva inference log p(Zeff,Vλ))
Figure 5.6.:
Result of the neural network reconstruction of the model joint
probability distribution values from more than 10 000 measurements collected
during OP 1.2 W7-X experimental campaign. On the x-axis, the (logarithmic)
probability values predicted by the network, on the y-axis, those predicted
with the original Bayesian model, also used to generate the training data. The
straight line shows the bisector.
of the experiment numbers please look at appendix A. Figure 5.6 shows one
result from such evaluation. The network reconstructed values, on the x-axis,
are compared to those inferred with the original Bayesian Minerva model used
to generate the training data. Both network and Minerva model were evaluated
on a grid of 100 linearly spaced values of
Zeff ∈[
1
,
6
]
. Qualitatively speaking,
the agreement between the two methods seem promising. A quantitative in-
vestigation of the error made by the network in the reconstruction is shown in
figure 5.7. On the left, it is shown the mean relative error E(Zeff)defined as:
E(Zeff)=1
Nm∑
i
|
|
|
logNN(p(Zeff,Vλ,i))−logM(p(Zeff,Vλ,i))
logM(p(Zeff,Vλ,i)) |
|
|(5.8)
where
Nm
is the number of measured data points,
i
is an index labeling a single
bremsstrahlung measurement
Vλ,i
,
logNN(p(Zeff,Vλ,i))
denotes the logarithmic
probability value reconstructed by the network, and
logM(p(Zeff,Vλ,i))
denotes
the value inferred with the Minerva model. The figure shows that the error is
on average lower than 20% for most of the Zeff values considered.
In the plot on the right, instead, the distribution across the measurement data
set of the relative error
E(Vλ)
averaged along
Zeff
is shown.
E(Vλ)
is defined as:
55
Chapter 5. Application to nuclear fusion research
2 4 6
0.1
0.15
0.2
0.25
0.3
Zeff
mean relative error E(Zeff))
0 0.2 0.40.60.81
0
2
4
6
8
mean relative error E(Vλ)
probability density
Figure 5.7.:
Study of the discrepancy between the values of the joint probability
reconstructed with the network and inferred with the original Bayesian model.
On the left, the mean relative error between the probability values found with
the two methods for each
Zeff
value is averaged across all measurements
considered. On the right, the distribution of the relative error averaged across
the whole Zeff range [1, 6] for each measurement considered is shown.
E(Vλ)=1
NZ∑
i
|
|
|
logNN(p(Zeff,i,Vλ))−logM(p(Zeff,i,Vλ))
logM(p(Zeff,i,Vλ)) |
|
|(5.9)
where
NZ
is the number of
Zeff
values used in the evaluation, in this case fixed to
100 points in the range from 1 to 6, and
i
is an index labeling each of the values.
The other quantities are defined analogously to equation 5.8. Approximately in
85% of the cases the network reconstruction differs by less than 20% from the
original Bayesian model inference.
Figure 5.8 shows another comparison between network reconstruction and
Minerva inference, this time for a single plasma discharge at W7-X (experiment
20180807.015). Hydrogen fueling was used for this discharge, a line integrated
electron density of approximately 4 to 6 10
19
m
−2
, and an electron temperature
between 2 and 4 keV were maintained throughout the pulse for approxim-
ately 8 seconds. The heating provided with the electron cyclotron resonance
heating was of 4250 kW. In clockwise direction, starting from the plot on the
top left corner, the figure shows the reconstruction of
log(p(Zeff,Vλ))
for the
bremsstrahlung emission observed at time 1.05 s, 3.40 s and 8.11 s as function
Zeff ∈[
1
,
6
]
for
Vλ
fixed at the measured value, for both the network and Minerva
56
5.4. Learning the model joint distribution
2 4 6
−1500
−1000
−500
0
1.05 s
Zeff
log(p(Zeff,Vλ))
2 4 6
−1500
−1000
−500
0
3.40 s
Zeff
2 4 6
−1500
−1000
−500
0
8.11 s
Zeff
log(p(Zeff,Vλ))
0 2 4 68
1
1.5
2
2.5
time in s
Zeff
ANN
Minerva
Figure 5.8.:
Comparison of network and Minerva model reconstruction for W7-
X experiment 20180807.015. In clockwise direction, starting from the top left,
it is shown the value of
log(p(Zeff,Vλ))
for
Zeff ∈[
1
,
6
]
and
Vλ
fixed at the
measured value, as it is reconstructed by the network and with the Minerva
model, at time 1.05 s, 3.40 s and 8.11 s, respectively. In the last plot, the max-
imum a posteriori value of
Zeff
found by the network and with the Minerva
model inference throughout the whole plasma discharge is shown.
model inference. This kind of reconstruction can be used to find the maximum a
posteriori (MAP) value of
Zeff
, i.e. the
Zeff
value which maximizes the posterior
distribution since
p(Zeff|Vλ)∝p(Zeff,Vλ)
, throughout the entire discharge. This
is shown in the plot on the right, where the
Zeff
MAP values found with network
and Minerva model are compared to each other. The analysis of the entire dis-
charge with the Minerva model typically takes tens of minutes, whereas with
the network it can be reduced to hundreds of microseconds.
5.4.4. A brief summary
With the example shown in this section, we have covered the last application
that was mentioned in section 4.2, in the last formula of equation 4.10. Through-
out the applications described in this chapter, the approximation framework
57
Chapter 5. Application to nuclear fusion research
formulated in chapter 4 has been used to train neural networks in order to learn
possible mappings between different set of variables and quantities defined by
Bayesian models of nuclear fusion experiments. In the applications described
in section 5.1 and 5.2, the network was trained to learn the mapping
g
from
observable data
d
to free parameters
m
, whereas in the application described
in section 5.4, the network was trained to learn the mapping
h
between the
joint space
(d,m)
and the (logarithmic) value of the joint probability distri-
bution of the model
p(d,m)
. The main advantage provided by the network is
that of speeding up and automatize calculations which otherwise could be time
consuming, especially when taking into consideration the large amount of data
collected at these experiments. The speed-up is significant, in the range of sev-
eral orders of magnitude. Another advantage of this specific implementation of
the training method comes from the fact that we have exploited the common
formulation of Bayesian models provided by the Minerva framework, which
makes possible to standardize and automatize the sampling scheme, making it
possible to apply it effortlessly to any model implemented within the frame-
work. The application of this method is not, in principle, restricted to physics
models of nuclear fusion experiments, and it is easily applicable to different
systems, since the network can be successfully trained with data generated ex-
clusively from existing Bayesian models. The theoretical framework provided
by Bayesian formalism, and the common implementation language for scientific
Bayesian models offered by the Minerva framework play an important role in
making our method general, and, possibly, part of an autonomous system for
scientific inference.
58
Chapter 6.
Publications
The publications listed in this chapter constitute the main outcome of my work
during my Ph.D. studies. They mostly describe applications of the principles and
methods that I, together with the other authors and group members, adopted
and developed in order to improve scientific modeling and data analysis in
nuclear fusion research. The publications feature technical details and describe
solutions to issues that arise in practical applications. The reader that is not
interested in such details, but rather in the concepts behind the methods and
principles adopted, is recommended to look at the introductory chapters of this
thesis.
59
Chapter 6. Publications
6.1. Article I
A. Pavone et al.
»Neural network approximation of Bayesian models for the inference
of ion and electron temperature profiles at W7-X«
In: Plasma Physics and Controlled Fusion 61 (2019)
Synopsis
The publication describes the application of neural network approximated Bayesian
inference to the problem of the temperature profile reconstruction from X-ray
imagining diagnostic measurements. In particular, we describe how the net-
work can be trained to approximate the inference carried out with an existing
Bayesian model by using it to generate the training data. The trained network
can then be used to carry out fast inference on measurements, reducing the
analysis time from tens of minutes to tens of microseconds. Also, we show how
the quality of the training set can be assessed by looking at how well it describes
the experimental measurements. We also demonstrate the performance of the
network on a number of experimental cases, comparing its reconstruction with
the profiles inferred with the Bayesian model.
60
Neural network approximation of Bayesian
models for the inference of ion and electron
temperature profiles at W7-X
A Pavone
1
, J Svensson
1
, A Langenberg
1
, U Höfel
1
, S Kwak
1
,
N Pablant
2
, R C Wolf
1
and theWendelstein7-XTeam
1
1
Max-Planck-Institut für Plasmaphysik, Teilinstitut Greifswald, D-17491 Greifswald, DE, Germany
2
Princeton Plasma Physics Laboratory, 08540 Princeton, NJ, United States of America
E-mail: [email protected]
Received 22 January 2019, revised 8 March 2019
Accepted for publication 26 April 2019
Published 30 May 2019
Abstract
In this paper, we describe a method for training a neural network (NN)to approximate the full
model Bayesian inference of plasma profiles from x-ray imaging diagnostic measurements. The
modeling is carried out within the Minerva Bayesian modeling framework where models are
defined as a set of assumptions, prior beliefs on parameter values and physics knowledge. The
goal is to use NNs for fast ion and electron temperature profile inversion from measured image
data. The NN is trained solely on artificial data generated by sampling from the joint distribution
of the free parameters and model predictions. The training is carried out in such a way that the
mapping learned by the network constitutes an approximation of the full model Bayesian
inference. The analysis is carried out on images constituted of 20×195 pixels corresponding to
binned lines of sight and spectral channels, respectively. Through the full model inference, it is
possible to infer electron and ion temperature profiles as well as impurity density profiles. When
the network is used for the inference of the temperature profiles, the analysis time can be reduced
down to a few tens of microseconds for a single time point, which is a drastic improvement if
compared to the ≈4 h long Bayesian inference. The procedure developed for the generation of
the training set does not rely on diagnostic-specific features, and therefore it is in principle
applicable to any other model developed within the Minerva framework. The trained NN has
been tested on data collected during the first operational campaign at W7-X, and compared to the
full model Bayesian inference results.
Keywords: stellarator, x-ray imaging, neural network, Bayesian inference, modeling, Minerva
framework, real time
(Some figures may appear in colour only in the online journal)
1. Introduction
Neural networks (NNs)are a powerful tool when it comes to
speed and approximation of complex functions. universal
approximation theorems have been shown to be valid for NNs
under different assumptions, as in [1,2]and [3]. The real time
capabilities of NNs have also been shown in different fusion
experiments, e.g. ion temperature profile inference and dis-
ruption prediction at JET as in [4–6]and at ASDEX Upgrade
[7]. NNs have also been used for the reconstruction of plasma
parameters from diagnostic data as in the case of charge
exchange spectra automatic analysis at JET for reconstruction
of ion temperature, rotation velocity and impurity density
[8,9], and in the case of electron temperature from a soft-x-
ray system at NSTX [10]. In this paper, we focus on an
Plasma Physics and Controlled Fusion
Plasma Phys. Control. Fusion 61 (2019)075012 (14pp)https://doi.org/10.1088/1361-6587/ab1d26
Original content from this work may be used under the terms
of the Creative Commons Attribution 3.0 licence. Any
further distribution of this work must maintain attribution to the author(s)and
the title of the work, journal citation and DOI.
0741-3335/19/075012+14$33.00 © Max-Planck-Institut fur Plasmaphysik Printed in the UK1
application of NN algorithms on x-ray imaging crystal
spectrometer (XICS)measurements and we will describe an
approach based on a different paradigm of NN training and
reconstruction suitable when the physics model of the diag-
nostic is available. In particular, we will make use of the
Bayesian implementation of the model within the Minerva
modeling framework [11].
The Minerva framework provides a language to develop
models of complex systems in a Bayesian way: models
implemented within it share the same structure and are highly
modular, so that different operations can be performed on
them independently of the specific model under consideration.
These operations can be the inference of the free parameter
posterior distributions through different optimization and
sampling algorithms, sampling schemes—as the one we make
use of in this work, and storage/sharing. The elements/
modules of a Minerva model are called nodes: these can be
pieces of code that perform computation of physical quan-
tities and they are such that they can be easily replaced. This
makes building new models out of existing nodes particularly
easy, as well as testing different models simply by switching
their nodes. In this work, we have taken advantage of the
modularity and shared structure of the models so that the
method described here is general and can be easily applied to
different problems modeled within the same framework.
NNs applied to diagnostic data are typically trained on
real measurements and the corresponding quantities of inter-
est in situations where a model of the problem is missing.
Such an approach has the advantage of providing the NN with
actually measured data, but it also has the limitation of
depending on a fixed and restricted amount of training sam-
ples, the feature of which depends on the performed experi-
ments, and on a limited parameter space. A different case is
illustrated in [9]: here the authors have first reconstructed the
distribution of a large set of measured physics parameters;
then new parameters have been sampled from it, and used as
input in a forward model in order to create simulated mea-
surements. A NN was then trained on both measured and
simulated data.
The way we try to overcome these limitations is by
training the network solely on data synthesized through the
Bayesian model specified within Minerva, the same that is
used for the standard inference [12]. The parameters,
corresponding to physics quantities, used to produce the data
are sampled from the corresponding prior distributions and
the synthesized observations are sampled taking into account
the error model. This conferes control over the features we put
in the training set and, consequently, the features the NN will
be learning and will be sensitive to when evaluated on mea-
sured data. In addition, an advantage of this approach con-
cerns its generality and the possibility of performing
automatic data analysis based on physics models, which
comes as a consequence of the sampling procedure described
in the following chapters. This becomes of greater relevance
as the scale of fusion experiments grows larger and the
duration of plasma shots becomes longer. During the first
operation campaign (OP 1.1, see [13])of the W7-X stellarator
several diagnostics [14,15]were involved in the
measurements, and the number increased during the second
one (OP 1.2a). Together with the number of diagnostics, a
large proportion of which is currently implemented in the
Minerva framework, e.g. [12,16,17], also the duration of the
plasma shots increased. All of this makes fast and automatic
data analysis very desirable. The technical implementation of
our approach only makes use of features shared between all
models in Minerva, and thus it is easily transferable and
applicable to other diagnostics modeled in the framework.
The paper is organized as follows: in section 2we give a
general outline of the core idea behind the training method,
that is the sampling scheme from the Bayesian model; this
constitutes the main piece of novelty of this work. The spe-
cific details necessary to understand each part of the entire
scheme are then delivered in the following sections. In
section 3, we give a brief overview of the hardware setup and
the physics involved in the measurements performed with the
x-rays imaging diagnostic. It is followed by an explanation of
Bayesian modeling in section 5, which starts generally from
Bayes theorem and goes into more specific details of the
Minerva implementation of the diagnostic model. In section 6
we illustrate the sampling scheme used to generate the
training data from the XICS Bayesian model, whereas details
about the NN architecture and training algorithm are given in
section 7. In section 8, we will show how training sets gen-
erated with different models can be compared to each other in
order to be able to choose the one that better describes the
measurements; this is particularly relevant in the context of
this work since the training data are generated synthetically.
In section 9we compare the NN reconstructed profiles to
those traditionally inferred with the Minerva Bayesian model
in the case of measurements performed during the first
experimental campaign at the W7-X experiment. In section 10
we draw some conclusive considerations.
2. The method
The core of the method that we want to illustrate in this
section is the sampling scheme used to generate the NN
training data from the Minerva Bayesian model of the XICS
diagnostic measurements. In order to understand how it is
practically carried out, it is necessary to have some basic
notions about the XICS diagnostic and the Minerva modeling
framework. Details about the former are given in section 3,
while the latter is described in section 5.
An outline of the idea behind the method is the follow-
ing: the XICS Bayesian model is a model whose key elements
are the free parameters—the ion and electron temperature
profiles—and the observations, the XICS measured images.
Before any measurement is taken into account, a prior dis-
tribution is assigned to the free parameters, and a likelihood
function is assigned to the observations. The former is usually
a broad distribution which may include features we believe
are found in measured plasma profiles, whereas the latter
reflects the uncertainties of the model in predicting the data,
which typically include measurement errors. The product of
the two probability distributions defines the joint probability
2
Plasma Phys. Control. Fusion 61 (2019)075012 A Pavone et al
distribution of the model. The joint distribution completely
describes the model, both from the point of view of its sta-
tistical properties—residing in the fact that it is a probabilistic
distribution, and from the point of view of the physics
implemented within it—found in the relation between free
parameters and observations. Since we aim at training the NN
to be an approximate replica of the full Bayesian model, we
can create the training data set by sampling the free para-
meters and synthetic observations from the joint distribution.
Afterwards, we train the NN on the inverse problem, which is
the mapping of the observed XICS images to the ion and
electron temperature profiles. We expect that, under the
assumption that the training data properly resemble the
measurements, the network can be used for fast reconstruction
of temperature profiles from new experimental data. The
goodness of this assumption for a given training data set can
be tested quantitatively, so that it is possible to compare
different models in their ability to produce data set that
resemble the experimental measurements. In section 8we
describe a possible way to test this assumption.
3. XICS diagnostic
The XICS collects x-rays emitted in atomic processes invol-
ving ion impurities and plasma electrons occurring in the bean
shaped cross section of the W7-X stellarator. The concept
behind the diagnostic is described in [18]. A sketch of the
view is shown in figure 1(a), where the plasma cross section
and the line of sight (LOS)span are shown. For each LOS an
integrated spectrum is collected. The set of all spectra forms
one measured image. From an image it is possible to recon-
struct plasma ion and electron temperature profiles, and
impurity density profiles. In section 3.1 we will describe the
setup of the diagnostic and in section 3.2 we will describe the
physics involved in the measurements.
3.1. Setup
The system is equipped with a spherical bent crystal and the
light is collected onto a CCD detector producing 2D images
similar to the one shown in figure 1(b). The two dimensions
in the image represent energy and spatial resolution respec-
tively. The diagnostic is sensitive to the energy region of
He-like Argon emission lines. The main emission lines con-
stituting the spectrum are shown in figure 1(c)and they are
the w,x,yand zfor the n=2ton=1 transitions in addition
to numerous n>=2 dielectronic satellites, e.g. the klines for
n=2. A study of the spectrum and the atomic processes
involved can be found in [19–21]. The detector covers the
wavelength range from ≈3.94 to ≈4.00 Åalong the cen-
tral LOS.
3.2. Physics
The core component of the physics processes involved in the
measurements is given by the atomic processes giving rise to
the spectral emission. A detailed description of the calculation
of the emission intensity for the different lines can be found in
Figure 1. (a)Sketch of the XICS system view on the bean shaped cross section at W7-X. The viewing span of the lines of sight is represented
by the ocher area. (b)A measured raw image detected on the CCD detector. The typical curved feature is due to the spherical bending of the
crystal. (c)He-like Argon spectrum measured along one of the central lines of sight of the image in figure (b). The main emission lines are
marked with their names.
Table 1. Atomic process description and corresponding plasma parameter dependency.
Atomic process Plasma parameter dependency
(1)Excitation from ground state of He-like ions nn TT,,,
eArHeei
(2)Di-electronic recombination of He-like ions nn TT,,,
eArHeei
(3)Recombination of H-like ions nn TT,,,
eArHei
(4)Inner shell excitation of Li-like ions nn TT,,,
eArLiei
3
Plasma Phys. Control. Fusion 61 (2019)075012 A Pavone et al
[19]. In table 1we show the processes involved in the cal-
culations relevant to this paper together with their dependence
on plasma parameters.
The intensity of all lines depends on the electron temp-
erature T
e
through the corresponding effective rate coeffi-
cients, a calculation of which can be found in the appendix of
[19]. The dependency on the ion temperature T
i
comes as well
into the calculation of all of the line shapes as Voigt profiles
V(λ
l
,λ), which is the convolution of a Gaussian and Lor-
entzian shape, accounting for Doppler broadening and natural
broadening, respectively. The photon emission of each pro-
cess also depends on the electron density n
e
, and on the
density of Argon ions in one of the ionization stages: n
ArHe
for Ar
16+
,n
ArLi
for Ar
15+
, and n
ArH
for Ar
17+
. All of the
quantities in table 1are defined on a 3D Cartesian coordinate
space, so that they are dependent on the position x. The
emission intensity I(λ)at a given λis then calculated per-
forming an integration along the LOS paths L, as in the fol-
lowing equation:
InVixxx,, . 3.1
Ll
ll
e
2
òå
lll=() () ( ) () ( )
The quantity ix
l(
)
is defined according to:
inkTxxxxd. 3.2
l
jl
jlje
å
=() () ( ()) ( )
()
It denotes the overall contribution from different ioniz-
ation stages jto the emission line l. In the equations, λ
l
denotes the wavelength for the given line, k
lj
denotes the
effective rate coefficient of the line lin the ionization stage j,
and n
j
denotes the density of ions in the ionization stage j.
The contributions to the overall He-like Argon spectrum
arising from the different atomic processes and ionization
stages are shown in figure 2. In the spectrum depicted, the
lines q,r, and a, are also visible.
Since an absolute calibration of the diagnostic measure-
ments is currently not available, the simulated images cannot
reproduce the measurements in their absolute values. There-
fore, both simulated and measured images are normalized
dividing the intensity of each pixel by that of the bright-
est one.
4. Bayesian inference
When developing a Bayesian modeling and inference scheme,
the first step is the definition of the model free parameters, w,
and observed data, d. Probability distributions are assigned to
both quantities, and they take the name of prior distribution,
P(w), and likelihood function, Pdw(∣ )
. The prior distribution
represents the a priori knowledge that we have about the free
parameters before taking the observations into account. The
likelihood distribution represents instead the model uncer-
tainties in the prediction of the data. The inference is the
process of knowledge acquisition when new data are
observed. According to Bayesian probability theory, it can be
described as an update process of the a priori distribution.
This process is formally expressed through Bayes formula:
Pwd PdwPw
Pd .4.1=(∣) (∣ ) ( )
() ()
The numerator in equation (4.1)corresponds to the joint
distribution P(d,w)of the observed data and free parameters,
Pd w PdwPw,=() (∣)()
. The term Pwd(∣)
is called posterior
distribution of the free parameter wgiven the observed data d,
and it represents the new state of knowledge on the model
free parameters as new data are collected. The quantity P(d)is
called evidence or prior predictive and, as first interpretation,
it plays the role of a normalization factor:
Pd PdwPw wd. 4.2
ò
=() (∣ ) ( ) ( )
The hidden relevance of this term becomes evident when we
switch from Bayesian inference for parameter estimation to
Bayesian model selection, see for example [22]. Since the
integral in equation (4.2)is a marginalization over the free
parameters of a given model, we see that P(d)is the dis-
tribution of all the data that a model can describe, quantifying
how likely each data point is to be generated under the
assumptions of the model. Once it is evaluated on a data point
d
*
, the evidence P(d
*
)defines how likely our model is as an
explanation of such data point. The value of this probability
depends on the prior distributions of the parameters and the
uncertainties attributed to the model prediction in the like-
lihood term. The dependency is such that broader prior dis-
tributions, or prior distributions defined over higher
dimensional parameter spaces, i.e. more complex models, will
be penalized, and the overall probability will always con-
stitute a trade-off between model complexity and good fitto
the data. This is indeed an application of the Occam’srazor
principle. An illuminating explanation of how Bayesian
model selection naturally embodies Occam’s razor is given
in [23].
As data are measured, Bayesian inference can be carried
out to extract information about the free parameters. The
Figure 2. A He-like Argon spectrum calculated with the forward
model for T
i
=1 keV, T
e
=1.2 keV, n10
e13
=cm
−3
,nArHe =
108cm
−3
,nnn8
ArLi ArH ArHe
== . The numbers in the legend refer
to the atomic processes listed in table 1.
4
Plasma Phys. Control. Fusion 61 (2019)075012 A Pavone et al
target of Bayesian inference is the posterior distribution of the
parameters, Pwd(∣)
, the spread of which expresses the
uncertainties on the inferred quantities.
5. The Minerva framework
In order to interpret the measured data, the physics describing
the processes giving rise to the measurements is implemented
in a forward model of the diagnostic within the Minerva
framework [12,24]. The Minerva modeling framework is a
framework for the development of Bayesian models of
complex systems. Within the framework it is possible to carry
out both modeling and inference. The result of the modeling
is an object that is called a graph. It contains all information
about the physics and the probabilistic relations between the
modeled quantities. It describes all the hypotheses used in the
creation of the Bayesian model. A given model can be used to
produce simulated observations and compare them to the
measured observations. In this way, the model free para-
meters that better describe the data can be found. The
advantage of doing Bayesian inference is that it formally
defines a procedure to calculate the uncertainties of a solution,
which simply involves the application of Bayesian probability
rules.
The Minerva framework relies on graphical models [25]
in order to express the conditional dependency between ran-
dom variables in the model. For each model implemented in
the framework, a graph object is created that describes the
joint distribution of the free parameter and the measurements
according to the forward model. A simplified version of the
XICS model graph is shown in figure 3. In the graph the
colored nodes are probabilistic nodes, where orange denotes
the free parameters and blue denotes the observed quantities.
In the case of the XICS forward model, the free para-
meters can be the plasma parameters summarized in table 1,
right column, and the observed data are the images con-
stituted of spectra like the one shown in figure 2, calculated
accounting for the atomic processes described in section 2.
All the distributions in the graph are chosen to be normal
distributions. Note that the likelihood function is then a nor-
mal distribution centered on the model prediction obtained
with a given set of free parameter values:
Pdw dyw1
2
exp 2,5.1
22
ps s
=-
-
⎜⎟
⎛
⎝
⎞
⎠
(∣ ) (()) ()
where we used y(w)to denote the forward model function y
dependent on the free parameter values w, and dto denote a
measured data point. The white squared nodes in the graph of
figure 3represent deterministic calculation nodes, and the
cloud node is used to denote a data source, i.e. a node that
communicates with a database, here the W7-X ArchiveDB,
where information about the diagnostic, e.g. geometry setup
etc. are stored together with measured and analyzed data. The
arrows represent dependencies between nodes rather than a
computational flow. For example, all arrows from the free
parameters reach, directly or indirectly, the observed node,
i.e. the probability distributions of the observed quantities (d)
should be conditioned on the value of the free parameters,
Pdw(∣ )
. The joint probability distribution represented by the
whole graph can be factorized and written as: Pw d,
=
()
PwPdw()(∣)
.
The T
e
,T
i
and n
k
profiles, where n
k
stands for Argon ion
or electron density, are functions of the normalized effective
radius eff LCFS
r
yy=. Thus, an equilibrium code, in this
case VMEC [26], called from the corresponding node, is
required to carry out the mapping to the 3D Cartesian coor-
dinates. The impurity emission can then be calculated locally
and integrated along the lines of sight. A background emis-
sion is also added to the spectrum. In order to calculate the
detector pixel response, the instrumental function and the
wavelength dispersion on the chip are also required. The data
source node is used to provide the observed data for the
inference and the l.o.s. geometry. The smoothness of the
profiles is controlled by a zero mean Gaussian process (GP)
prior [27]with a squared exponential covariance function, as
defined in:
cov exp 2.5.2
ij f
ij
x
ij y
2
2
2
2
srr
sds=-
-+
⎛
⎝
⎜
⎜
⎞
⎠
⎟
⎟
() ()
The quantity cov
ij
denotes the elements of the covariance
matrix, ρ
i
and ρ
j
denote the location of any two profile points
labeled with iand j, and the quantities σ
f
,σ
x
and σ
y
denote the
function variance, the length scale and the noise variance of
the profiles, respectively. Note that the quantity cov
ij
corre-
sponds to the covariance between any two points in the
profile, as function of the location ρ
eff
. The length scale σ
x
describes how smooth a function is. Small length scale values
mean that function values can change quickly in their domain,
whereas large values describe functions that change slowly. In
our case, this refers to plasma profiles characterized by either
quick or slow spatial changes. The function variance σ
f
is a
scaling factor and determines function value variations
around the mean. Large values will allow again for bigger
variations, smaller values will describe less varying functions.
It also determines, together with σ
y
, the value of the covar-
iance matrix elements along the diagonal, where i=j. The
noise variance σ
y
is used to allow for noise present in the data
and it specifies the amount of expected noise.
When measured data are available, inference can be
performed on a Minerva model. In a practical implementa-
tion, when a single value solution is desirable, a Maximum A
Posteriori optimizer can be used to find the maximum of the
posterior distribution. The full Bayesian answer to the infer-
ence problem is nevertheless provided by the full posterior
distribution, the spread of which expresses the uncertainties
on the inferred quantities. In order to provide this information,
a Markov Chain Monte Carlo (MCMC)sampler is usually
adopted to generate the posterior samples. The samples can
then be stored and used in further independent calculations
providing full non-linear uncertainty propagation. Profiles
inferred using XICS data can then be used, for example, in
impurity transport studies, see [24,28].
5
Plasma Phys. Control. Fusion 61 (2019)075012 A Pavone et al
The model depicted in figure 3represents a sophisticated
inference problem: first of all, the images fed to the inversion
routines go through a preprocessing stage, occurring in
the data source node, where they are (1)straightened, and
(2)binned along the LOS direction in order to reduce the
computational effort. At this point, the inversion problem
consists of simultaneously fitting an image of 20×195
pixels along the LOS direction and the wavelength dispersion
direction respectively, and doing a tomographic inversion of
different plasma profiles. The full Bayesian inference takes
from 1 to 4 h for each measured image, whereas a NN can
process data at a time scale of tens of microseconds in good
implementation conditions (e.g. on a GPU).
6. Creation of the training set
In order to generate the NN training set, we will only make
use of the XICS Minerva model. All the features of the model
are expressed in its joint distribution Pd w PdwPw,=() (∣)()
:
the distribution of the variables (d,w)depends on the func-
tional form of the forward model, appearing in the likelihood
term Pdw(∣ )
as y(w)(equation (5.1)) and which expresses the
dependence of don w, the uncertainties on the model pre-
diction and the prior distribution P(w). As our goal is to create
a(approximated)copy of the original full Bayesian model, we
must provide the NN with training data having the same
properties described by the model: this is achieved by gen-
erating the training set data from samples of the full model
joint distribution.
Practically, such training set can be obtained by iterating
over the following three steps:
i. Draw and store a sample from the joint prior distribution
of the free parameters: Pw PT T n bg,, ,
kei
==() ( )
P
TPTPnPbg
kei
()()()( )
,wherebg denotes the back-
ground l.o.s. integrated emission profile
ii. Run the forward model in order to calculate a synthetic
observation with the given free parameters
iii. Store a number of samples drawn from the likelihood
function of the synthetic observations, Pdw(∣ )
, which is
fully specified by the given set of sampled free
parameters and the model uncertainties.
The sampling procedure taking place at step (iii)will
introduce noisy samples in the training set since the likelihood
distribution expresses the uncertainties of the model predic-
tion. This will help making the NN stable against small
perturbations in the input data when evaluated on measured
images. This is equivalent to the technique known as data
augmentation,[29,30]. The modifications we inject into the
samples are based on the noise model that has been assumed
for the problem, in this case a Gaussian noise model. The NN
Figure 3. A simplified sketch of the XICS model graph. Colored nodes are probabilistic nodes, where orange denotes the free parameters and
blue denotes the observed quantities. White nodes represent deterministic calculation nodes. The white GP nodes represent a Gaussian
process prior, and the symbols σ
f
,σ
x
and σ
y
denote the parameters in the expression of the squared exponential covariance function. The data
source node is used to fetch diagnostic specific information and measurements from the W7-X Archive. The arrows represent direct or
indirect dependencies in the probabilistic relations between the quantities in the probabilistic nodes.
6
Plasma Phys. Control. Fusion 61 (2019)075012 A Pavone et al
is then trained on the mapping from the images to the ion and
electron temperature profiles. Training on other profiles is
also possible and straightforward, since it is just matter of
choosing and storing another set of samples among those
stored at step (i). As a consequence of such sampling
procedure, the portion of the training set corresponding to the
model prediction dis made of samples obtained by margin-
alizing the numerator of equation (4.1)over the model free
parameters. In other words, these are samples from the
evidence term P(d), the denominator of equation (4.1).
A sketch of the procedure is illustrated in figure 4. The
size of the input images is 20 pixels along the LOS dimension
and 195 pixels along the wavelength dimension. The target
ion and electron temperature profiles are defined with 15
points equally spaced along the effective radius. The training
set is made of 500 000 samples. A test set made of 10 000
samples is used to check the generalization capabilities of the
NN during training and it is generated in the same way as the
training set.
Given the previously mentioned sampling procedure, an
insightful interpretation can be given to the NN mapping. A
well known result [31]in the NN field states that, under the
assumption of a sum-of-squares error loss function as in
equation (7.2), large training data set and successful optim-
ization, the NN mapping fis given by the conditional average
of the target data y
i
, conditioned on the input vector x
i
:
fyxw x;. 6.1
iii
=á ñ() ∣ ()
In the specific contest of our study, the network’s targets and
input are the ion and electron temperature profiles and the
synthetic XICS images. Given the fact that the training set is
generated sampling from the joint distribution of the XICS
Bayesian model, the distribution of the target data tgiven an
input vector x,ptx(∣ )
, in the limit of large training data set,
corresponds to the posterior distribution of the ion or electron
temperature profile of the full Bayesian model. Therefore, we
can state that the NN mapping, in ideal circumstances, is
given by the mean of the full model posterior distribution. In
real world circumstances, the data set size is finite and the
optimization is never perfect, so that we can say that the
inversion provided by a NN trained in such a way constitutes
an approximation of the full model Bayesian inference.
7. NN input, output and architecture
It is worth to summarize here what the NN input and output
are at the different stages. At training time:
•Input: synthetic images, generated with the XICS forward
model and the sampling procedure described in section 6.
These images supposedly closely resemble the actual
XICS measurements (after few pre-processing operation,
i.e. row binning and straightening, see next bullet point
and figure 5). They are made of 20×195 pixels/values.
•Target: the ion or electron temperature profiles used to
generate the corresponding images. These are made of 15
points along the effective radius eff LCFS
r
yy=, where
ψis the magnetic flux and ψ
LCFS
is the flux at the LCFS.
The points are obtained by sampling from the GP prior
distributions of the temperature profiles.
At evaluation time:
•Input: the pre-processed, actual XICS measurements. The
measured images, originally showing the curved feature
shown in figure 1(b), are straightened according to the
detector and crystal geometry. Afterwards, the original
1475×195 pixels of the image are binned along the largest
dimension, which corresponds to the spatial resolution,
where neighboring lines of sight overlap significantly. The
binned images are made of 20×195 pixels. An example of
a binned image is shown on the leftmost side of figure 5.
•Output: the estimated ion or electron temperature profiles.
In case of successful training, it will match, within
uncertainties, with the profile inferred through the full
Bayesian model.
Essentially, two NNs, identical for all the features except for
the target profiles, have been trained and tested indepen-
dently: one for the inversion of ion temperature profiles and
the other for the inversion of the electron temperature profiles.
Since the network’s input are 2D images, the network
architecture has been inspired by the LeNet-5 convolutional
neural network (CNN)[32]and is shown in figure 5. This
kind of architecture has been shown to be particularly effec-
tive when the input present a 2D structure. It has been suc-
cessfully used in many image recognition problems,
achieving state-of-the-art results [33,34]. Here we expect to
recurrently find across the image the features induced in the
spectrum by the ion or electron temperature profiles, which
Figure 4. A sketch to illustrate the sampling procedure for the
training set creation described in section 7. A sketch of the XICS
Minerva model and the neural network is shown on the left and on
the right, respectively. The NN takes as input images sampled from
the likelihood function of the model, given a set of sampled free
parameters. The blue nodes of the neural network denote the input
pixel of the image and the two red nodes at the top denote the output
points of the ion temperature profile.
7
Plasma Phys. Control. Fusion 61 (2019)075012 A Pavone et al
affect line width and intensities, respectively. Two convolu-
tional layers
C1
and
C2
, each one followed by one sub-
sampling layer, are used in a hierarchical feature extraction
structure. A convolutional layer applies a convolution filter
or kernel to the input image, extracting information that
are recurrent accross different locations in the image (for
more details see [32]). The kernel dimension sizes used in
the convolutional layers are respectively: (3, 16)and (2, 5),
where the first and second dimensions refer to the LOS and
wavelength dispersion dimension of the input images. The
number of feature maps, i.e. sets of units whose weights are
constrained to be identical [35,36], is set to 30 for both
convolutional layers. The sub-sampling layers use max
pooling with a resolution of 2 by 2. Two fully connected
layers
M
1and
M
2
made of 20 and 18 units respectively,
constitute the final layers, which will produce the desired 15
points ion or electron temperature profile output. The acti-
vation function in the convolutional layers is the rectified
linear unit:
fx xmax 0, , 7.1=() ( ) ( )
whereas in the fully connected layers it is the hyperbolic
tangent function.
CNNs are also especially suitable for parallelization on
GPUs, which applied to our case made the training 30 times
faster than on CPUs. The NN was implemented within
Theano, a Python framework for fast symbolic computa-
tion [37].
The training was stopped either according to an early
stopping criterion, i.e. when the network performance on the
test starts degrading, or when the decay rate of the loss
function is small enough. The loss function that the NN is
trained to minimize is defined as:
Sytww2,7.2
n
N
nn
ki
LN
kki
1
2
,1
,
,
2
k
åå
ba=-+
==
() ( ) ( )
where wdenotes the network weight vector, the first term on
the right-hand side is the sum-of-square error between the
network output y
n
and the target t
n
and nis an index that goes
through the Nsamples in the training set. The second term on
the right-hand side is a regularizing term, where
w
ki,denotes
the network weight of unit iat layer k. The parameters βand
α
k
are scale parameters which control the relative importance
of the two terms An insightful interpretation, based on a
Bayesian view of the NN training [23,38], can be provided to
such expression and can be useful in the choice of the values
of βand α
k
. The values of β=10 and α
k
=(68.00
C1
a=,
58.33
C2
a=,576.67
M1
a=,5.8
3
M2
a=)were used. More
details about Bayesian NN training in the context of this study
can be found in [39].
8. Training set comparison
When dealing with NN reconstructions, the following situation
is likely to occur: the network might be able to reconstruct the
target profiles from training data, but it might fail when applied
to experimental data. One reason why this might happen is the
following: the features of the training data do not resemble
closely enough those found in the measurements. In this section,
we illustrate how training data sets generated with different
models can be compared to each other from the point of view of
their adequacy in describing the experimental measurements. In
particular, we will consider two cases: the case of a data set
Figure 5. Architecture of the NN used. The input layer at the leftmost side is followed by a convolutional layer and a sub-sampling layer,
which are followed again by a couple of convolutional and sub-sampling layers. Two fully connected feed forward layers follow up to the
output layer. Each blue plane represents a feature map, where all the units share the same weights.
8
Plasma Phys. Control. Fusion 61 (2019)075012 A Pavone et al
generated by sampling unconstrained profiles that leads to a
poorly performing NN, and the case of a data set where a rea-
listic constraint is applied to the prior distributions, which in turn
help improving the performance of the network. In general, one
would try to introduce as little bias as possible to a given training
set. As it is shown here, this does not always lead to a successful
result. The reason of the failure can therefore be assessed with
the method described here. It can also be used as a novelty
detection system: it can be usedtoidentifyanovelincoming
measured image as an outlier, i.e. a case that is relatively
unknown to the network and on which it might perform poorly.
In this way, one can be informed in advance about how reliable
the NN output is going to be. In general, since several different
prior distributions can be used to generate training data, this
method provides a way to quantitatively compare them and
choose the one that better describes the measurements before
anytrainingisperformed.Insection8.1 we will describe the
algorithm used for the comparison and in section 8.2 we will
show its application to training sets generated with two different
prior distributions.
8.1. The k-nearest neighbors algorithm
The features of the set of measurable images D
m
are deter-
mined by the properties that the plasma profiles have during
the experiments. An absolutely free, unconstrained sampling
of the 15×7 points in the plasma profiles introduced in
table 1, will produce a set of synthetic data Dwhich likely
will have little in common with D
m
. Most of the samples in
such a training set will have little use to our purposes, since
they would not belong to the domain of the mapping that we
want the NN to learn. In order the NN to be able to accurately
predict temperature profiles from measured images, the D
m
space has then to be covered densely enough by the training
data set: we are not interested in generating all possible
15-dimensional output vectors, but only those which repre-
sent realistic ion or electron temperature plasma profiles. This
is accomplished by refining the prior distributions in such a
way that sampling from them will generate a data set of
synthetic images which densely encloses D
m
.
In principle, the full distribution of the data described by
a given model is determined by the prior predictive dis-
tribution, equation (4.2). As we have seen in section 6, the
training images constitute samples from such distribution,
therefore they can be used to estimate how closely a training
set resemble the actual measurements.
Several methods can be used to study the adequacy of the
training set to the coverage of the D
m
space of the measured
images. Here we will describe an approach which relies on
the k-nearest neighbors algorithm (k-NN).Ak-NN algorithm
is used to find a number kof data points in the training set that
are the closest to a given observed data point, according to a
metric measure. In this case the Euclidean distance has been
used. We expect that if we compare the distance from the
training samples to a measured image and to the test set
samples, the former will be much larger of the latter in the
case where the measurement is not properly described by the
training data. This expectation is justified by the fact that we
know that the network performs well on the test data and
therefore they can be used as reference. Distance based
methods are often used in the framework of outlier and
novelty detection and similar methods are presented for
example in [40,41].
8.2. Refining the priors
An application to our study is shown in figure 7, where we
have compared two training sets obtained with two different
models. The difference in the models is in the prior dis-
tribution of the plasma profiles: in one case, labeled as W/O,
the temperature profiles were left unconstrained in the region
of the last closed flux surface (LCFS), being allowed to
assume any value between 0 and 10 keV; in the other case,
labeled as W., the profiles were constrained to assume low
values in the LCFS region (0.1 keV+−0.5 keV)at ρ
eff
=
0.99), a feature that is typically expected in such plasma
profiles. Such a constraint enters the Minerva model as a so
called virtual observation: at the level of the Minerva graph,
this corresponds to a standard observed node connected to the
profiles which states that the value of the profiles at the given
position x
p
has been ‘virtually measured’to have value v
p
with error ò
p
, as shown in figure 6. The only difference with
the other observed nodes in the graph is that it does not
correspond to an actual measurement. Its role is to constrain
the profile shape, and this is what observations in Bayesian
models do: they constrain the solution found for the free
parameters. In figure 6, the dashed arrow and box represent
the connection to the rest of the model of figure 3. The
computation node on the left of the dashed one represents the
evaluation of the T
e
profile at the ρ
eff
=1.0 position, which
corresponds to the LCFS. This node is finally connected to
the virtual observation, the blue circle, whose normal dis-
tribution is specified in term of its mean and standard
deviation. If we exclude the graph represented in this figure
Figure 6. A sketch of the virtual observation constraint applied to the
T
e
profile. The blue ellipse represents the so called virtual
observation, which states that the T
e
profile has been ‘observed’to
have value 0.1 keV with standard deviation 0.05 keV, at the
LCFS (ρ=1).
9
Plasma Phys. Control. Fusion 61 (2019)075012 A Pavone et al
from the rest of the model, and we then calculate the covar-
iance matrix of the posterior distribution of the plasma profile,
in this case T
e
, given the virtual observation, we will find a
covariance matrix which can be used later on to sample
profiles which will feature the desired constraint. This has
been done in order to obtain more realistic plasma profiles,
and the effect of this on the training set is described in the
next paragraph. Specifically, the virtual observation was
implemented as a normal distribution with mean value of
0.1 keV and standard deviation of 0.05 keV at ρ
eff
=0.99 for
both ion and electron temperature profiles.
The importance of sampling a realistic set of plasma
profiles, which corresponds to a realistic set of synthetic
measurements, is depicted in figure 7. The three plots on the
top show a spectrum measured along three different lines of
sight (blue line), and the 10 nearest neighbors (10-NN, gray
lines)found among the samples in a training set where the
electron temperature profiles were sampled without constraint
(W/O)on the value assumed on any of the flux surfaces. The
only constraint was a smoothness criterion induced by the GP
prior. From left to right, the lines of sight of the plots are
traversing the following regions of the machine: edge, half-
way to the core and core. The intensity on the y-axis is nor-
malized to the brightest pixel in the image, in this case a w
spectral line along one of the central lines of sight. The three
plots on the bottom, instead, show the same measured image
and the 10-NN neighbors found among the samples in the
training set with constraint (W.). The effect of such constraint
on the sampled profiles is shown in figure 8, where 100
samples from each of the two training sets are drawn. It is
evident that when the constraint is applied (bottom plot), the
average electron temperature through the machine is lower.
This brings the training samples closer to the measured data in
two ways: (1)the intensities measured along the edge lines of
sight (see leftmost bottom plot in figure 7)show a smaller
signal to noise ratio, (2)the ratio between different emission
lines is smaller, see middle and rightmost plots in figure 7.
The distance of a measured data point from the 100 000
nearest neighbors in the training set can be compared to the
distance of 10 test set samples from 100 000 training set
samples to get an estimation of the proximity of the measured
data with respect to the training samples. These values are
shown in table 2in the two cases of training set created with
and without constraint. The distance for the measured data
point is substantially reduced when the constraint is applied to
the electron temperature profile prior distribution, getting
closer to the value of the test set samples. It is worth to note
Figure 7. The k-NN algorithm is applied to find the 10 nearest neighbors of a measured image (blue line)among the training samples in two
different training sets: (1)the T
e
profiles are let vary without (W/O)constraint on the values they assume at the edge of the machine (three
top plots),(2)the T
e
profiles are sampled with (W.)constraint to low values at the edge, as described in the text (three bottom plots). The
constraint is applied to the prior probability distributions. Each column shows spectra integrated along a different line of sight. From left to
right: a line-of-sight through the edge plasma, one crossing the mid-plane half-way to the core, and one core line of sight.
Figure 8. The T
e
profile samples from the two training sets. Top, the
profiles are sampled with (W.)constraints. Bottom, the profiles are
sampled without constraints (W/O).
10
Plasma Phys. Control. Fusion 61 (2019)075012 A Pavone et al
that, even with the constraint, the sampled profiles are not
necessarily monotonically decreasing, as shown in figure 8.
The improvement on the prediction capabilities of the
NN when applied to the measured image is remarkable and it
is shown in figure 9. The blue line in the plot denotes the
mean value of 800 000 samples of ion temperature profiles
drawn from the posterior distribution inferred within Minerva,
and the corresponding standard deviation. The agreement
between the NN prediction and the full Bayesian inference
result is visibly better when the constraint is applied.
9. Results
A NN with architecture described in section 7and illustrated
in figure 5has been evaluated on data from plasma shots of
the first operational campaign at W7-X.
The prior distributions of the free parameters of the
model used for the creation of the training set were all normal
distribution functions with lower truncation at 0.0 keV. The
T
e
profile prior distribution had also upper truncation at
10 keV, whereas the other profiles had none. The values of
the parameters of the GP squared exponential function
defined in equation (5.2)were set to σ
f
=2.0, σ
x
=0.3, and
10
y
f
3
s
s=-for the T
i
profile and to σ
f
=5.0, σ
x
=0.3, and
10
y
f
3
s
s=-for the T
e
profile. The constraint discussed in
section 8was applied to the temperature profiles but not to the
density profiles. The magnetic configuration was kept fixed
during the sampling procedure and the NN was tested on data
from shots having such configuration. A comparison between
the standard Bayesian inference carried out within the Minerva
framework and the NN inversion is shown in figure 10, for both
ion and electron temperature profiles. The different plots in the
figure refer to data from different shots and time points within a
shot. In general, NN central prediction and full model central
prediction, shown with the solid lines, are reasonably close to
each other. In two cases the mismatch is especially pronounced:
these are the cases of T
e
uncertainties are taken into account, however, the agreement is
still good. Providing uncertainties together with the NN central
prediction is, indeed, an important point, especially when a
comparison is made. We will now give an overview of how the
uncertainties have been calculated; the readers that are inter-
ested in understanding the details can look at the work pre-
sented in [39]. The predicted profiles have been obtained from a
committee of networks: a set of 10 NNs with same architecture
but different weight initialization has been trained on the same
training set. The error bars are calculated in a Bayesian fashion:
the training procedure is seen as an inference problem on the
network’s weights, which gives as result an approximated
Gaussian posterior distribution of the weights, centered on the
network’s weight vector found with the training process and
whose standard deviation depends on the Hessian matrix of the
loss function. The spread in the posterior produces then a spread
in the network’s prediction, and this is the source of the net-
work’s error bars. This procedure has been applied to the 10
networks in the committee. The committee prediction is then
obtained by sampling a random member network, sampling a
set of weights from the corresponding weight’sposteriorand
then feeding the network with a sample of the input vector
drawn from the XICS noise model. This corresponds to
approximating the overall weight’s posterior with a Gaussian
mixture, where each Gaussian of the mixture is obtained with
the single Gaussian approximation of each committee member,
and it is centered on the weight vectors found with the different
starting values [31]. This is necessary because, in principle, the
single Gaussian approximation is valid only around the solution
found with the training, and different starting values will lead to
different solutions, i.e. different local minima of the optim-
ization problem. In this way, it is then possible to take into
account this fact and put together each committee member
predictive distribution. The error bars shown in figure 10 also
include uncertainties in the XICS measured data.
The full Bayesian model prediction is obtained as the
average of 800 000 samples drawn from the posterior dis-
tribution of the corresponding free parameter, obtained with a
MCMC sampler, and the error bars are obtained as the stan-
dard deviation of the samples.
It is important to note that the training set has been
generated sampling with the LCFS constraint described in
Table 2. The average distance from the measured data point to the
100 000 nearest neighbors in the training set is compared to the
average distance of 10 test set samples from the corresponding
100 000 nearest neighbors in the training set, in the cases where the
training set is created with (W)and without (W/O)constraint on the
T
e
profile prior distribution.
Test set samples Measured data
W/O constraint 47.8 1186.1
W constraint 63.5 198.4
Figure 9. The standard Minerva Bayesian inference of a T
i
profile is
compared to the neural network inversion in two training cases. The
orange and gray lines represent the output of the neural network
trained on the training sets where the T
e
profiles are sampled with
(W)and without (W/O)constraints, respectively.
11
Plasma Phys. Control. Fusion 61 (2019)075012 A Pavone et al
section 8: as a consequence, the profiles predicted by the
networks show small variance towards ρ=1. This is the
reason why the uncertainties in the NN output of figure 10 are
systematically lower for larger values of ρ
eff
.
One of the main advantages in using NNs for data ana-
lysis is the speed-up that they provide. This is, indeed, sub-
stantial: the evaluation of one single data point takes ∼10 μs
on a single CPU. The inference with MCMC sampling takes a
few hours in similar conditions, thus the speed-up is of 10
9
orders of magnitude.
10. Conclusions
We have shown a Bayesian model oriented approach to NN
training for the inference of ion and electron temperature
profiles from data measured with an x-ray imaging diagnostic
at W7-X. The model implemented within the Minerva fra-
mework is used to generate the training set, sampling from the
prior distribution of the free parameters and from the like-
lihood function of the simulated data, i.e. from the joint dis-
tribution of the model. Since the joint distribution summarizes
all the relevant properties of the Bayesian model, the trained
network can be thought of as an approximation of the full
Bayesian model. What makes this approach uncommon is the
fact that the network is trained on a problem for which a
model, and therefore a solution, already exists. The fact that
we make use of a model to generate the training data gives us
control over the features that we introduce in the training set,
both from a point of view of the statistics and the physics.
These features are completely determined by the joint dis-
tribution of the Bayesian model.
Since we can manipulate the prior probability distribution
functions, we can also knowingly choose the model that better
describes the measurements we expect to perform. This gives
us the possibility to have control on the training of the net-
work from the point of view of the physics parameters that,
through the forward model, generate the expected observa-
tions. This is precisely what allows us to find a better training
set for more accurate NN predictions.
The NN has been tested on measurements from different
plasma shots from the first operational campaign at W7-X and
compared with the results of the standard Bayesian inference.
The first major advantage of this approach is the speed-up of
the analysis, which can be carried out in tens of micro-
seconds. The second advantage is that the sampling procedure
for the creation of the training data only requires generic, not
diagnostic-specific features of the Minerva model and thus is
Figure 10. The results of the NN inversion compared to what obtained with standard Bayesian inference for different plasma shots. The left
and right columns show T
e
and T
i
profiles respectively.
12
Plasma Phys. Control. Fusion 61 (2019)075012 A Pavone et al
in principle automatically applicable to any other diagnostic
developed within the same framework.
Acknowledgments
This work has been carried out within the framework of the
EUROfusion Consortium and has received funding from the
Euratom research and training programme 2014–2018 and
2019–2020 under grant agreement No. 633053. The views
and opinions expressed herein do not necessarily reflect those
of the European Commission.
ORCID iDs
A Pavone https://orcid.org/0000-0003-2398-966X
A Langenberg https://orcid.org/0000-0002-2107-5488
R C Wolf https://orcid.org/0000-0002-2606-5289
References
[1]Cybenko G 1989 Degree of approximation by superpositions
of a sigmoidal function Math. Control, Signals, Syst. 2
303–14
[2]Hornik K 1991 Approximation capabilities of multilayer
feedforward networks Neural Netw. 4251–7
[3]Sonoda S and Murata N 2017 Neural Network with
Unboundedactivation Functions is Universal Approximator
https://arxiv.org/abs/1505.03654
[4]Svensson J et al 1998 Real-time ion temperature profiles in the
JET nuclear fusion experiment ICANN 98. Perspectives in
Neural Computing (London: Springer London)(https://doi.
org/10.1007/978-1-4471-1599-1_30)
[5]Cannas B, Fanni A, Sonato P and Zedda M K 2007 A
prediction tool for real-time application in the disruption
protection system at JET Nucl. Fusion 47 1559–69
[6]Wang S Y et al 2016 Prediction of density limit disruptions on
the J-TEXT tokamak Plasma Phys. Control. Fusion 58
055014
[7]Pautasso G and Tichmann C 2002 On-line prediction and
mitigation of disruptions in ASDEX Upgrade Nucl. Fusion
42 100–8
[8]Bishop C M, Roach C M and von Hellermann M G 1993
Automatic analysis of JET charge exchange spectra using
neural networks Plasma Phys. Control. Fusion 35 765–73
[9]Svensson J, von Hellermann M and König R W T 1999
Analysis of JET charge exchange spectra using neural
networks Plasma Phys. Control. Fusion 41 315–38
[10]Clayton D J et al 2013 Electron temperature profile
reconstructions from multi-energy SXR measurements using
neural networks Plasma Phys. Control. Fusion 55 095015
[11]Svensson J and Werner A 2007 Large scale bayesian data
analysis for nuclear fusion experiments IEEE Int. Symp. on
Intelligent Signal Processing pp 1–6
[12]Langenberg A, Svensson J, Thomsen H, Marchuk O,
Pablant N A, Burhenn R and Wolf R C 2016 Forward
modeling of x-ray imaging crystal spectrometers within the
Minerva Bayesian analysis framework Fusion Sci. Technol.
69 560–7
[13]Wolf R C et al 2017 Major results from the first plasma
campaign of the Wendelstein 7-X stellarator Nucl. Fusion 57
102020
[14]König R et al 2015 The set of diagnostics for the first operation
campaign of the Wendelstein 7-X stellarator J. Instrum. 10
P10002
[15]Krychowiak M et al 2016 Overview of diagnostic performance
and results for the first operation phase in Wendelstein 7-X
(invited)Rev. Sci. Instrum. 87 11D304
[16]Bozhenkov S A et al 2017 The thomson scattering diagnostic
at wendelstein 7-X and its performance in the first operation
phase J. Instrum. 12 P10004
[17]Hoefel U et al 2019 Bayesian modelling of microwave
radiometer calibration on the example of the wendelstein
7-X electron cyclotron emission diagnostic Rev. Sci.
Instrum. 90 043502
[18]Bitter M et al 2010 Objectives and layout of a high-resolution
x-ray imaging crystal spectrometer for the large helical
device Rev. Sci. Instrum. 81 1–5
[19]Marchuk O, Wolf R and Kunze H J 2004 Modeling of He-like
spectra measured at the tokamaks TEXTOR and TORE
SUPRA PhD Thesis Ruhr-Universität Bochum
[20]Group T F R et al 1985 Dielectronic satellite spectrum of
Helium-like Argon: a contribution to the physics of highly
charged ions and plasma impurity transport Phys. Rev. A32
2374–83
[21]Vainshtein L A and Safronova U I 1978 Wavelengths and
transition probabilities of satellites to resonance lines of H-
and He-like ions At. Data Nucl. Data Tables 21 49–68
[22]Sivia D and Skilling J 2006 Data Analysis: A Bayesian
Tutorial (Oxford: Oxford University Press)
[23]MacKay D J C 1991 Bayesian Methods for Adaptive Models
PhD Thesis California Institute of Technology
[24]Langenberg A et al 2018 Inference of temperature and density
profiles via forward modeling of an x-ray imaging crystal
spectrometer within the minerva bayesian analysis
framework Rev. Sci. Instrum. Submitted
[25]Pearl J 1986 Fusion, propagation, and structuring in belief
networks Artif. Intell. 29 241–88
[26]Hirshman S P and Whitson J C 1983 Steepest-descent moment
method for three-dimensional magnetohydrodynamic
equilibria Phys. Fluids 26 3553–68
[27]Rasmussen C E and Williams C K I 2006 Gaussian Processes
for Machine Learning (Cambridge, MA: MIT Press)
[28]Langenberg A et al 2017 Argon Impurity Transport Studies at
Wendelstein 7-X using x-ray Imaging Spectrometer
Measurements Nucl. Fusion 57 086013
[29]Kauderer-Abrams E 2017 Quantifying Translation-
Invariancein Convolutional Neural Networks https://arxiv.
org/abs/1801.01450
[30]Wong S C, Gatt A, Stamatescu V and McDonnell M D 2016
Understanding data augmentation for classification: when to
warp? CoRR arXiv:abs/1609.08764
[31]Bishop C M 1995 Neural Networks for Pattern Recognition
(Oxford: Oxford University Press)
[32]Lecun Y, Bottou L, Bengio Y and Haffner P 1998 Gradient-
based learning applied to document recognition Proc. IEEE
86 2278–324
[33]Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R and
Fei-Fei L 2014 Large-scale video classification with
convolutional neural networks 2014 IEEE Conf. on
Computer Vision and Pattern Recognition pp 1725–32
[34]Krizhevsky A, Sutskever I and Hinton G E 2012 Imagenet
classification with deep convolutional neural networks
Advances in Neural Information Processing Systems 25 ed
F Pereira et al (USA: Curran Associates, Inc.)pp 1097–105
[35]Lecun Y 1989 Generalization and Network Design Strategies
(Amsterdam: Elsevier)
13
Plasma Phys. Control. Fusion 61 (2019)075012 A Pavone et al
[36]LeCun Y, Boser B E, Denker J S, Henderson D, Howard R E,
Hubbard W E and Jackel L D 1990 Handwritten digit
recognition with a back-propagation network Advances in
Neural Information Processing Systems 2 ed D S Touretzky
(San Mateo, CA: Morgan Kaufmann Publishers)pp 396–404
[37]Theano Development Team 2016 Theano: A Python
Frameworkfor Fast Computation of Mathematical
Expressions https://arxiv.org/abs/1605.02688
[38]Neal R M 1995 Bayesian learning for neural networks PhD
Thesis University of Toronto
[39]Pavone A et al 2018 Bayesian uncertainty calculation in neural
network inference of ion and electron temperature profiles at
w7-x Rev. Sci. Instrum. 89 10K102
[40]Knorr E M and Ng R T 1998 Algorithms for mining distance-
based outliers in large datasets Proc. 24rd Int. Conf. on Very
Large Data Bases pp 392–403 http://www.vldb.org/conf/
1998/p392.pdf
[41]Dang T T et al 2015 Distance-based k-nearest neighbors outlier
detection method in large-scale traffic data 2015 IEEE Int.
Conf. on Digital Signal Processing (DSP)pp 507–10
14
Plasma Phys. Control. Fusion 61 (2019)075012 A Pavone et al
6.2. Article II
6.2. Article II
A. Pavone et al.
»Bayesian uncertainty calculation in neural network inference of ion
and electron temperature profiles at W7-X«
In: Review of Scientific Instruments 89 (2018)
Synopsis
In this publication, 1we describe how uncertainties of the neural network out-
put can be calculated in a Bayesian framework. The framework is known as
Bayesian neural network (BNN) and consists of an interpretation of the network
model as a Bayesian model and the training problem as an inference problem.
Under the so called Laplace approximation, it is possible to derive an analytical
expression of the error bars dependent on the Hessian matrix of the training
loss function. We apply this calculation to the case of the inference of ion and
electron temperature profiles at W7-X, accounting also for noise in the input
data and the presence of different local minima found by training, by using a
Monte Carlo scheme of sampling both in input and weight space of multiple
networks.
1
Reproduced from A. Pavone et al. »Bayesian uncertainty calculation in neural network
inference of ion and electron temperature profiles at W7-X«. In: Review of Scientific Instru-
ments
89
, 10K102 (2018);
https://doi.org/10.1063/1.5039286
, with the permission
of AIP Publishing.
75
REVIEW OF SCIENTIFIC INSTRUMENTS 89, 10K102 (2018)
Bayesian uncertainty calculation in neural network inference
of ion and electron temperature profiles at W7-X
A. Pavone,1J. Svensson,1A. Langenberg,1N. Pablant,2U. Hoefel,1S. Kwak,1R. C. Wolf,1
and Wendelstein 7-X Team1,a)
1Max-Planck-Institute for Plasma Physics, Greifswald 17491, Germany
2Princenton Plasma Physics Lab, Princeton, New Jersey 08543, USA
(Presented 16 April 2018; received 7 May 2018; accepted 16 July 2018;
published online 6 August 2018)
We make use of a Bayesian description of the neural network (NN) training for the calculation of
the uncertainties in the NN prediction. Having uncertainties on the NN prediction allows having
a quantitative measure for trusting the NN outcome and comparing it with other methods. Within
the Bayesian framework, the uncertainties can be calculated under different approximations. The
NN has been trained with the purpose of inferring ion and electron temperature profile from mea-
surements of a X-ray imaging diagnostic at W7-X. The NN has been trained in such a way that it
constitutes an approximation of a full Bayesian model of the diagnostic, implemented within the
Minerva framework. The network has been evaluated using measured data and the uncertainties cal-
culated under different approximations have been compared with each other, finding that neglecting
the noise on the NN input can lead to an underestimation of the error bar magnitude in the range of
10%–30%. https://doi.org/10.1063/1.5039286
I. INTRODUCTION
In nuclear fusion research, neural networks (NNs) have
been used for tasks such as prediction of disruption events
from plasma parameters1and for diagnostic data analysis.2
A special effort is often put in the development of real time
systems.3In most of the applications, the output of the NN
models is single “best guess” predictions, obtained with val-
ues of the adaptable parameters found minimizing a given
cost function. We believe that, in order to have trust-worthy
outcomes, uncertainty should be taken into account and deliv-
ered as part of the final predictions. In this paper, we will
describe and make use of a Bayesian framework for the treat-
ment of uncertainties, where the neural network model is seen
as a Bayesian model and the training procedure is seen as an
inference problem.4,5Applications of such a framework are
scarcely encountered, although it posits a principled picture of
neural network modeling. Its implementation relies on the cal-
culation of the second derivative of the neural network’s cost
function with respect to the network weights, i.e., the Hessian
matrix. This is an operation that scales with the square of the
number of weights, i.e., as O(W2), where Wis the number
of weights. It is therefore a computational expensive calcu-
lation. However, the Hessian matrix needs to be calculated
only once per training, as it is fixed at evaluation time, when
the network is evaluated on the measurements. In Sec. II, we
will illustrate the salient points of the Bayesian NN training
from a theoretical point of view, describing three different pro-
cedures for the calculation of the uncertainties: the first one
is derived neglecting noise in the NN input, the second one
Note: Paper published as part of the Proceedings of the 22nd Topical Confer-
ence on High-Temperature Plasma Diagnostics, San Diego, California, April
2018.
a)See the authors list in T. S. Pedersen et al., Nat. Commun. 7, 13493 (2016).
accounting for it, and the third one making use of a sampling
scheme based on a non-linear multi-Gaussian approximation.
In Sec. III, we will describe the specific application of the
method to X-ray imaging crystal spectrometer (XICS) diag-
nostic data at W7-X, where the NN has been trained for the
inference of electron and ion temperature profiles from XICS
measurements.InSec.IV,wewillcomparethetwoprocedures
where we either do take or do not take into account the noise in
the NN input, and we will show a single illustrative example
of uncertainty calculation with the multi-Gaussian sampling
procedure.
II. BAYESIAN NEURAL NETWORKS
We shall describe now the salient points of the Bayesian
perspective on NN training which will allow us to calculate
uncertaintiesintheprediction.Thenotationusedhereismostly
taken from Ref. 5. The neural network is conceived here as
a function f, which maps a generally multidimensional input
vector xto a generally multidimensional output vector y. The
function fis also parametrized with a set of free parameters
or weights w, whose values are adapted or learned during the
training procedure, so that it can be written that y=f(x,w). In
the specific case of this study, the input vector xwould be an
XICS measurement and ywould be either an electron or ion
temperature profile, Teor Ti, respectively. In the analytical
treatment that follows, we shall assume a one dimensional
output yfor the sake of clearer notation. The generalization to
the multi-dimension output is straightforward. According to
thetraditionalview, the NN trainingisthe procedure employed
to find a set of weight values wMP that minimizes a given cost
or loss function L(w). In regression problems such a function
is often chosen to be the sum-of-square error between the NN’s
output yand the target training data t,
0034-6748/2018/89(10)/10K102/5/$30.00 89, 10K102-1
10K102-2 Pavone et al. Rev. Sci. Instrum. 89, 10K102 (2018)
L(w)=
N
X
n=1
(yn−tn)2+ν(w), (1)
where N is the number of training samples, nis an index
labelling the nth training sample, and ν(w) is a regularizing
term which constrains the weight values to small values. The
regularization allows finding a NN function which is smooth
in 4so that the generalization capabilities of the network are
enhanced.5Theset of weightvalues foundis then usedto make
predictions at evaluation time. Using this approach, the out-
come of the NN function is a single value estimate, given by
f(x,wMP).
In the Bayesian framework of neural network training,
the NN model is conceived as a Bayesian model, where the
weights ware the free parameters and the target data of the
training tnare the observed data. According to Bayesian infer-
ence rules, a prior distribution P(w) is assigned to the network
weights before training, and a likelihood function P(D|w) is
assigned to the observed data, where D≡(t1,. . .,tN) denotes
the target data from the training set. The training procedure is
then an inference process on the network’s weights. We can
then write the Bayes formula to express the posterior distri-
bution of the weights P(w|D) in terms of the prior and the
likelihood function,
P(w|D)=
P(D|w)P(w)
P(D), (2)
where P(D) is a normalization factor, independent of the
weights, also known as the evidence. We have omitted the
conditioning on the training input data X≡(x1,. . .,xn) in all
the terms, for the sake of simpler notation. The full outcome
of the training, from the Bayesian point of view, is then not
only a single set of values of the network’s weights but also
the entire posterior distribution P(w|D). At evaluation time,
the spread of the distribution will then correspond to a distri-
bution of the output, the predictive distribution. We shall see
how, under certain assumption and approximations, we can get
to an expression for the predictive error bars.
The first step in the application of such a method is
the choice of the prior distribution P(w) and the likelihood
function P(D|w). We shall assume for both of them Normal
distributions. The reason behind this choice is that it allows
making the analytical progress required to derive a mathemat-
ical expression for the error bars of the network’s output. In
this way, we will also recover results very well known and
established under the traditional view of NN training. In the
general case of a multi-layer neural network, we shall choose
a prior of the form
P(w)∝exp*
,−1
2X
k
αkkwk2
k
+
-, (3)
where αk≡1/σ2
kwith σ2
kdenoting the variance of the distribu-
tion for the weights at the neural network’s layer k. Concerning
the likelihood function P(D|w), we shall use an expression of
the form
P(D|w)∝exp*
,−β
2
N
X
n=1
(yn−tn)2+
-, (4)
where β≡1/σ2
Dwith σ2
Ddenoting the variance of the noise
in the training target data, i.e., the spread of the distribution of
the target variables, for a given, fixed input vector. We can now
use the Bayes formula to find an expression for the posterior
distribution of theweights. If we areinterested in a singlevalue
solution, we can look for the weight values that maximize
the posterior. This is equivalent to minimizing the negative
logarithm of Eq. (2), ln(P(w|D)) ≡ −S(w), which, substituting
the expressions in Eqs. (3) and (4), can be written as
S(w)=
β
2
N
X
n=1
(yn−tn)2+1
2
L
X
k=1
Nk
X
i=1
αkw2
k,i, (5)
where Lis the number of layers in the network, iis an index
labellingtheweightsatlayerk,andNkisthenumberofweights
atlayerk. Notice that wehaveomitted terms that donotdepend
onw,specificallynocontributionfromtheevidencetermofthe
Bayes formula appears in this equation, since they would not
have any effect in the minimization of S(w) with respect to the
weights. The expression in Eq. (5) resembles closely the one in
Eq. (1). Indeed, this is how the Bayesian point of view and the
traditional one come together. The first term on the right-hand
side of Eq. (1) comes into Eq. (5) as the choice of the Gaussian
noise model on the target training data, while the second one,
the regularizing term, appears here as a consequence of the
Gaussian prior on the network weights. In particular, we notice
that the particular choice of the squared norm of the weight
vector has led to a regularizing term well known in the neural
network field as L2 regularization or weight decay: it has the
effect of constraining the weights to small values with the
consequence of improving the generalization of the network
mapping, as described in Ref. 5.
An analytical expression for the full posterior P(w|D)
can be found taking a Gaussian approximation of it around
wMP,4where wMP is a set of weight values found minimiz-
ing Eq. (5). This approximation is also known as the Laplace
approximation, and it leads to
P(w|D)≈exp −S(wMP)−1
2∆wTA∆w!, (6)
where ∆w=w−wMP and A=∇∇ SMP =β∇∇EMP +PkαkI
is the Hessian matrix of the error function in Eq. (5), calcu-
lated with respect to the weights and evaluated at wMP, with
EMP =1
2PN
n=1(yn−tn)2being the sum-of-square errors term
evaluated at wMP and Ibeing the identity matrix. We see there-
fore that Ahas two contributions, the first one coming from
the choice of the likelihood function, controlled by the param-
eter β, and the second one coming from the choice of the prior
distribution of the weights, controlled by the parameters αk.
This allows us now to calculate the distribution of the net-
work outputs, when a new, unseen, input vector xis provided
to the trained network, at evaluation time. It is obtained by
marginalization over the network’s weights,
P(t|x,D)=P(t|x,w)P(w|D)dw, (7)
where we have now explicitly included in the notation the
dependence on the new input vector x. The distribution
P(t|x,w), which is evaluated at a fixed value of the weight
10K102-3 Pavone et al. Rev. Sci. Instrum. 89, 10K102 (2018)
vector, is given by the noise model on the target data, as in
Eq.(4).Aftersomemanipulation, wegettothefinalexpression
P(t|x,D)=
1
(2πσ02
t)1/2exp −(t−yMP)2
2σ02
t!, (8)
where
σ02
t=
1
β+gTA−1g, (9)
where g≡ ∇wy|wMP . The distribution of the network’s output is
then given by a Gaussian distribution, centered at the network
prediction obtained with weights wMP and with standard devi-
ation given by Eq. (9). The contribution to the predictive error
has two components: one arising from the noise on the target
data, controlled by β, and the other arising from the posterior
width, controlled by A. Equation (9) corresponds to the first
procedure to calculate uncertainties.
So far, we have neglected uncertainties in the neural net-
work input. This is of course not ideal when the input is a
measured quantity with noise, as it is in our application. It can
be shown6that an expression for the predictive error, which
includes noise in the input, is given by
σ2
t=σ02
t+σ2
xhTh, (10)
where h≡ ∇xy|xv,xvis the input vector, and σ2
xis the variance
of the noise of the input vector, here assumed to be Gaussian.
Equation(10)correspondsto thesecond proceduretocalculate
uncertainties. Three main assumptions that have been done to
get to Eq. (10): the posterior distribution of the weights has
been approximated with a Gaussian distribution around wMP,
the network function y(x;w) has been approximated by its lin-
ear expansion around wMP and xvin the calculation of σ0
tand
σt, respectively. Moreover, the Laplace approximation of the
weight’s posterior is only valid around wMP. However, several
minima of the cost function are likely to exist and they can be
found training the network with different initial values of the
weights. The single-Gaussian approximation so far described
does not take them into account. In order to account for them,
it is possible to approximate the posterior of the weights by
a sum of Gaussians, each one centered on each of the min-
ima.5This can be accomplished by training a committee of
networks, where each member is trained with different initial-
ization values, and carrying out the Laplace approximation of
the posterior for each of them. The overall posterior is then
given by
P(w|D)=X
i
P(w|mi,D)P(mi|D), (11)
where P(mi|D) is the a priori distribution of the minima miand
P(w|mi,D) is the posterior distribution of the weights corre-
sponding to the local minima mi, which can be approximated
with the Laplace approximation. The predictive distribution
can still be written as in Eq. (7), where now the second term
on the right-hand side is obtained from Eq. (11). Assuming
P(mi|D) to be uniform, we can obtain the uncertainties for a
multi-Gaussian approximation of the posterior distribution in
the following way: (i) we train a number of NNs with different
weight initialization, corresponding to the NN functions fi,
(ii) for each of them, we calculate the posterior of the weights
under the Laplace approximation, (iii) we obtain samples from
the predictive distribution by randomly choosing one member
of the committee, say member i, then, sampling a set of weight
values, wi
MP, and an input vector x∗, from the weight posterior
and the input noise model, respectively, and calculating the
corresponding NN output: yi=fi(wi
MP,x∗). The whole proce-
dure is repeated a number of times equal to the desired number
of samples. The advantage of this sampling procedure to the
estimation of the uncertainties is that it does not make use of
the assumption of linearity of the NN function around wMP
and the input vector x. It is therefore more accurate. How-
ever, it requires large computational time, and it is therefore
not suitable in those applications where the execution time is
a concern. This is our third procedure for calculating uncer-
tainties, and we will refer to it as the multi-Gaussian sampling
scheme.
III. APPLICATION TO XICS DIAGNOSTIC
DATA AT W7-X
A. The XICS diagnostic at W7-X
The XICS diagnostic at W7-X is equipped with a spher-
ical bent crystal to image X-ray emission of Ar impurities.
The emission is then collected on a CCD detector. The diag-
nostic layout and initial measurements during the first oper-
ational phase at W7-X have been described in Refs. 7–11.
The collected images have spatial resolution along the verti-
cal dimension, corresponding to different lines of sight, and
wavelength resolution along the horizontal one. The wave-
length range is 3.94–4.0 Å for He-like Ar spectra. From
the measured data, it is then possible to reconstruct ion and
electron temperature profiles. The ion temperature affects
the Doppler broadening of the spectral lines, whereas the
electron temperature affects the relative intensities. Given
the electron density profile ne, the impurity density profiles
can be obtained.8,12 A forward model of the diagnostic7has
been developed within the Minerva Bayesian modeling frame-
work,13 and it is used for the inference of the plasma profiles of
interest.
B. Neural networks as approximate Bayesian models
In the XICS Bayesian model, a prior distribution is
assigned to the free parameters, which are temperature, elec-
tron and impurity density profiles, and a likelihood func-
tion is assigned to the observed quantities. A neural network
has been trained with the goal to approximate the full model
Bayesian inference. The training scheme is described in detail
in Ref. 14. The training set is obtained sampling from the joint
distribution of the model P(T,I) = P(I|T)P(T): a set of free
parameters is sampled from the prior distribution P(T), and
subsequently synthetic data are sampled from the likelihood
function P(I|T). The distribution P(I|T) represents the noise
model on the XICS measurements, which is given by a Gaus-
sian distribution with mean and variance given by the forward
model predicted photon counts. When sampling from the pri-
ors, all ne,Te,Ti, and impurity density profiles were free to
vary, but only the Tiand Teprofiles were used as the target
of the network’s training. The set of sampled synthetic images
constitutes the network’s input during training. Note that such
a training set is made exclusively of data synthesized with the
10K102-4 Pavone et al. Rev. Sci. Instrum. 89, 10K102 (2018)
Bayesian model. The profiles are expressed with respect to
the effective radius, defined as ρeff =pψ/ψLCFS, where ψis
the magnetic flux and ψLCFS is the flux at the last closed flux
surface. It is worth to emphasize that, because of training on a
given fixed model, any systematic deviation introduced by the
specificchoiceofthemodelwouldbereflectedinthenetwork’s
inversions.
IV. RESULTS
Two convolutional neural networks14,15 (CNN), each one
with two convolutional layers C1 and C2, followed by one
hidden fully connected layer M1 and the output layer M2,
have been trained on the inference of Tiand Teprofiles,
respectively. The training has been carried out in the Bayesian
scheme described in Sec. II. The values of β= 10 and
αk= (αC1 = 68.00, αC2 = 58.33, αM1 = 576.67, αM2 = 5.83)
were used. The Hessian matrix Ahas been calculated in
the diagonal approximation. The error bars calculated with
Eqs. (9) and (10) have been compared with each other, and
the results are shown in Figs. 1and 2. In Fig. 1, the average
relative uncertainty hσrelicalculated accounting and without
accounting for the noise in the input, according to hσt,reli
≡ hσt(ρeff)/T(ρeff)iand hσ0
t,reli≡hσ0
t(ρeff)/T(ρeff)i, respec-
tively, is shown for each spatial location of both profiles. The
average has been calculated from data collected across 15
plasma shots of the first operational campaign at W7-X. In the
case of Teprofiles, it is found that hσt,reli ≈ 0.2 for ρeff <0.4,
and 0.3 <hσt,reli<0.4 for ρeff >0.5. In the case of Tipro-
files, instead, the quantity hσt,relishows less variation across
the different locations, assuming mostly values approximately
equalto0.1.Thedifferencebetweenhσt,reliand hσ0
t,relireflects
what also emerges from Fig. 2and that we will comment in
the next paragraph: the two quantities mostly diverge at the
positions corresponding to the plasma core and toward the
edge. In Fig. 2, the distribution of the contribution of the input
noise term relative to the total error bar magnitude, calcu-
lated as ∆σrel(ρeff)≡(σt(ρeff)−σ0
t(ρeff))/σt(ρeff), for each
spatial location of the Teand Tiprofiles is shown. The distri-
bution has been computed from the same data used for Fig. 1.
The orange line connects the mean of each distribution. In the
case of the Teprofile (left), it is found that for a significant
proportion of the analyzed data the input noise contribution
accounts for more than 10%, and mostly less than 20%, of
the total error bar magnitude, especially in the positions cor-
responding to the core and toward the edge, ρeff <0.2 and
ρeff >0.7. In the case of the Tiprofiles, instead, it is found
in a significant number of cases that the same noise source
accounts for more than 20% of the total error, again mainly in
the positions corresponding to the core and toward the edge of
the plasma, ρeff <0.2 and ρeff >0.6. The input uncertainties
are, therefore, in general not negligible. The purpose of Fig. 3
is to illustrate the result of the sampling procedure described at
the end of Sec. II. The network’s inversion has been applied on
a single measured data point for illustrative purposes. The net-
work’s input has been obtained averaging over a 500 ms range.
The bundle of gray lines is made of 1000 samples obtained
sampling from the multi-Gaussian approximation of the net-
work’s weights, accounting also for the noise in the input. The
orange line shows the network’s prediction and corresponding
error bar calculated with the single-Gaussian approximation,
Eq. (10). The error bars show a 2σtdeviation. The sampling
scheme based on the multi-Gaussian approximation is a more
accurate method to calculate uncertainties, which comes at the
price of larger computational time: it is therefore suitable in
NN applications where execution time is not a concern.
FIG. 1. The average value of the rela-
tive uncertainty calculated with (orange
bars) and without (green bars) input
noise contribution for both Te(left) and
Ti(right) profiles, as found from data
collected across different experiments.
FIG. 2. The distribution of the contri-
bution of the input noise term relative
to the total error bar magnitude calcu-
latedacross the datapointfromdifferent
plasma shots for each spatial location
in the Te(left) and Ti(right) profiles.
The orange line connects the mean of
the distribution at each position.
10K102-5 Pavone et al. Rev. Sci. Instrum. 89, 10K102 (2018)
FIG. 3. The neural network prediction
and uncertainties calculated in the case
of the multi-Gaussian sampling pro-
cedure (gray lines) and the single-
Gaussian approximation (orange line),
for both Te(left) and Ti(right) profiles.
ACKNOWLEDGMENTS
This work has been carried out within the framework
of the EUROfusion Consortium and has received funding
fromtheEuratom researchandtrainingprogramme2014-2018
under Grant Agreement No. 633053. The views and opin-
ions expressed herein do not necessarily reflect those of the
European Commission.
1B. Cannas et al.,Nucl. Fusion 47, 1559 (2007).
2J. Svensson et al.,Plasma Phys. Controlled Fusion 41, 315 (1999).
3J. Svensson et al., in Perspectives in Neural Computing (ICANN, 98).
4D. J. C. Mackay, “Bayesian methods for adaptive models,” Ph.D. thesis,
California Institute of Technology, 1991.
5C. Bishop, Neural Networks for Pattern Recognition (Oxford University
Press, 1995).
6A. Wright, IEEE Transactions On Neural Networks 10, 1261 (1999).
7A. Langenberg et al.,Fusion Sci. Technol. 69, 560 (2016).
8A. Langenberg et al.,Nucl. Fusion 57, 086013 (2017).
9R. K¨
onig et al.,J. Instrum. 10, P10002 (2015).
10M. Krychowiak et al.,Rev. Sci. Instrum. 87, 11D304 (2016).
11N. A. Pablant et al., in 41st EPS Conference on Plasma Physics (EPS,2014),
Vol. 38F.
12A. Langenberg et al., in 43rd European Physical Society Conference on
Plasma Physics (EPS, 2016), Vol. 40A.
13J. Svensson and A. Werner, in IEEE International Symposium on Intelligent
Signal Processing (IEEE, 2007).
14A. Pavone, “Neural network approximation of Bayesian models for the
inference of ion and electron temperature at W7-X” (unpublished).
15Y. LeCun et al.,Proc. IEEE 86, 2278 (1998).
6.3. Article III
6.3. Article III
A. Pavone et al.
»Measurements of visible bremsstrahlung and automatic Bayesian in-
ference of the effective plasma charge Zeff at W7-X«
In: Journal of Instrumentation 14 (2019)
Synopsis
The publication
2
describes the first available measurements of visual bremsstrahlung
and the inference of the plasma effective charge
Zeff
at W7-X. Both the features
of the measurement device and the Bayesian model used in the inference are
described. Especially, Bayesian inference is run automatically after each plasma
shot and it includes the inference of electron temperature and density profiles
from independent measurements of the Thomson scattering diagnostic based
on Gaussian processes.
2
©Max Planck Institute for Plasma Physics. Published by IOP Publishing Ltd on behalf of
Sissa Medialab. Reproduced with permission. All rights reserved.
https://doi.org/10.
1063/1.5039286
81
2019 JINST 14 C10003
Published by IOP Publishing for Sissa Medialab
Received:June 26, 2019
Accepted:September 10, 2019
Published:October 2, 2019
3rd European Conference on Plasma Diagnostics (ECPD2019)
6–10 May 2019
Lisbon, Portugal
Measurements of visible bremsstrahlung and automatic
Bayesian inference of the effective plasma charge Zeff at
W7-X
A. Pavone,a,1U. Hergenhahn,aM. Krychowiak,aU. Hoefel,aS. Kwak,aJ. Svensson,a
P. Kornejew,aV. Winters,bR. Koenig,aM. Hirsch,aK.-J. Brunner,aE. Pasch,aJ. Knauer,a
G. Fuchert,aE.R. Scott,aM. Beurskens,aF. Effenberg,cD. Zhang,aO. Ford,aL. Vanó,a
R.C. Wolfaand the W7-X team
aMax-Planck-Institut für Plasmaphysik, Teilinstitut Greifswald,
Wendelsteinstrasse 1, D-17491 Greifswald, Germany
bUniversity of Wisconsin-Madison,
Madison, WI 53706-1609, U.S.A.
cPrinceton Plasma Physics Laboratory, Princeton University,
Princeton, NJ 08543, U.S.A.
E-mail: [email protected]
1Corresponding author.
c
2019 Max Planck Institute for Plasma Physics. Published by
IOP Publishing Ltd on behalf of Sissa Medialab https://doi.org/10.1088/1748-0221/14/10/C10003
2019 JINST 14 C10003
Abstract: The effective charge Zeff indicates the overall impurity contamination of a plasma. Zeff
can be derived experimentally from the intensity of the plasma bremsstrahlung emission. We
describe here the diagnostic set-ups and the Bayesian modeling allowing the inference of Zeff at
W7-X. First results from the operational campaigns in 2017 and 2018 are shown. Measurements of
the visible plasma radiation along a single line-of-sight traversing the core plasma has been carried
out using a compact USB-spectrometer with a time resolution of 100 ms. A spectral region (627–
641 nm) that is free from line emission is selected for the analysis of the bremsstrahlung emission,
which also depends on electron temperature and density profiles. Electron temperature profiles
are derived from either the electron cyclotron emission or the Thomson scattering diagnostic.
Electron density profiles, however, have their shape information derived from Thomson scattering
measurements and absolute values from single line-of-sight interferometry measurements. The
Minerva framework is used to infer the profiles with Gaussian processes and develop a Bayesian
model of the bremsstrahlung emission to infer line averaged Zeff. The sensitivity of the diagnostic
enables Zeff measurements down to the lowest core electron densities observed in the last campaign
of 0.75 ×1019 m−3with a statistical relative error of ≈50% (Zeff = 3.2, 100 ms integration time).
The analysis is automated to routinely compute Zeff after every plasma discharges.
Keywords: Analysis and statistical methods; Plasma diagnostics - interferometry, spectroscopy
and imaging
2019 JINST 14 C10003
Contents
1 Introduction 1
2 The single line-of-sight USB-spectrometer diagnostic 1
3 Bayesian modeling and inference 2
4 Results 3
5 Conclusions and future works 5
1 Introduction
In magnetically confined fusion plasmas, the study of impurity behavior is important for the as-
sessment of plasma performance and the investigation of impurity transport [1]. The effective
charge Zeff =ÍiniZ2
i/ÍiniZiis related to the concentration of impurities and indicates the overall
contamination of the plasma with mainly low-Z impurities, e.g. Carbon. It is usually derived exper-
imentally from the plasma ion-electron bremsstrahlung emission in the visible, IR or X-Ray spectral
region [2–4], using an independent measurement of the electron density neand temperature Te. In
this work, we illustrate the diagnostic set-ups and the Bayesian modeling that allowed the inference
of Zeff at W7-X and we will show results from the OP1.2 experimental campaign, obtained from
measurements performed with a compact USB-spectrometer. Also, we will describe the diagnostic
set-up of other diagnostic systems which were routinely observing bremsstrahlung emission as well.
2 The single line-of-sight USB-spectrometer diagnostic
A compact USB-spectrometer (Red Tide USB650, Ocean Optics) collects light along a single line-
of-sight that goes through the plasma core of W7-X, as shown in figure 1a. The system collects
light in the visible and near infrared wavelength region, approximately from 350 to 1000 nm, as
shown in figure 1b, with a time resolution of 100 ms. Due to the low light level of the calibration
source only the spectral range above 450 nm can be used for the analysis. The figure also shows the
bremsstrahlung emission predicted with Zeff ≈1.5. Details about the predictive forward model are
given in the following sections. In order to infer Zeff from the measured spectrum, we have selected
and used a fixed wavelength window that is free of line radiation, marked with two red vertical lines
in the figure, in the range of ≈627 −641 nm. The system was absolutely calibrated by measuring
the diagnostic response to an Ulbricht sphere of known emissivity. The calibration has been carried
out prior (pre), during (mid) and after (post) both the experimental campaigns OP1.2a and OP1.2b.
The sensitivity of the diagnostic system as a function of wavelength, in units of W / (m2Å sr count),
is shown in figure 1c. Multiplication by this quantity converts the measured raw data to spectral
–1–
2019 JINST 14 C10003
(a)
400 500 600 700 800 900 1000
wavelength in nm
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
emission in W / (m^2 A sr)
1e 3
measured spectrum
predicted Bremsstrahlung
(b)
500 525 550 575 600 625 650 675 700
wavelength in nm
2
3
4
5
6
7
8
W / (m2Å sr count)
1e 6
pre
mid
post
(c)
Figure 1. Figure (a) shows the single line of sight of the USB-spectrometer, the triangular W7-X plasma
cross section and the magnetic axis in red. The line of sight ends in an opposite port (left hand side) thus
eliminating the problem of plasma light reflections at the vessel walls. Figure (b) shows the measured
spectrum of the photon flux of the plasma bremsstrahlung and line radiation (blue) as well as the predicted
bremsstrahlung level with a Zeff ≈1.5(dashed line). The two red vertical lines indicate the wavelength range
selected and used in the analysis ≈627 −641 nm. Figure (c) shows the sensitivity spectrum of the diagnostic
in the wavelength range of between 500 nm and 700 nm. Three different measurements were carried out,
prior (pre), during (mid) and after (post) the experimental campaign OP1.2.
power density in absolute units (see also 1/C(λ)in equation (3.1)). According to the time interval
in which the data were collected, the corresponding calibration curve is applied to the data. The
relative variation between the different curves is always < 10% in the wavelength range shown in the
figure, indicating that the calibration remained fairly constant during the course of the campaign.
3 Bayesian modeling and inference
A model to calculate the bremsstrahlung emission is implemented in the Minerva framework [5].
The Minerva framework allows to carry out Bayesian modeling and inference in complex systems.
The expected measured signal S(λ)can be calculated from the bremsstrahlung emission at a given
wavelength V(λ)collected along the line of sight, according to equation (3.1):
S(λ)=C(λ)V(λ)=C(λ)∫gff(Zeff,Te, λ)n2
eZeff
√kbTe
exp hc
λkbTe1
λ2dl (3.1)
–2–
2019 JINST 14 C10003
0.0 0.2 0.4 0.6 0.8 1.0 1.2
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
n
e in 1019 m3
20180904.027 @ 0,7 s
GP fit
TS
0.0 0.2 0.4 0.6 0.8 1.0 1.2
0
1
2
3
4
5
6
T
e in keV
20180904.027 @ 0,7 s
GP fit
TS
Figure 2. The electron temperature and density profiles measured by the Thomson scattering diagnostic (TS)
are fitted with a Gaussian process (GP) model within the Minerva framework. The blue lines are samples
from the posterior distribution found with Bayesian inference. The dots represent the measured data points
together with their respective error bars.
where the integration is done along the line of sight path, gff(Zeff,Te, λ)is the free-free Gaunt factor
modeled in Minerva according to [6], C(λ)is an absolute calibration factor (figure 1c), and the
remaining symbols are used in the conventional way referring to the respective physics constants
in SI units. The single line of sight diagnostic does not allow to resolve the spatial profile of Zeff,
therefore, when Zeff is used in the calculation of the emission along the line-of-sight, it is assumed
to be constant.
According to equation (3.1), neandTeare quantities required to calculate the expected emission.
They are provided by a spatially resolved Thomson scattering [7] measurement and a line-integrated
nemeasurement by the dispersion interferometer diagnostic [8], which constrains the neabsolute
values. Both profiles are firstinferredwithin the Minerva framework with a Gaussianprocesses(GP)
Bayesian model [9], where the covariance of the normal prior distributions of the profiles is modeled
with a covariance function, parametrised in terms of the profile length scale. The posterior solution
found is “smooth” and it is affected by the number of observed data points and their respective
uncertainties, whichinthis casedo notinclude systematic errors. Anexamplecase ofsuchprocedure
is shown in figure 2, where the samples from the posterior distribution are shown in blue and the
measured data points are labeled as TS. The coordinate on the x-axis is the effective radius ρ=
pψ/ψLCFS, where ψis the magnetic flux and ψLCFS is the magnetic flux at the last closed flux surface.
Since the Bayesian analysis is meant to be carried automatically after every plasma discharge,
a fallback solution is provided for those cases in which Thomson scattering measurements are not
available. The neprofile is assumed to be parabolic, and absolute values are scaled accordingly
using interferometer measurements, as previously mentioned; the Teprofile, on the other hand, is
obtained from measurements by the electron cyclotron emission (ECE) diagnostic using the cold
resonance approximation [10,11].
4 Results
Zeff can be inferred by comparing predicted and measured bremsstrahlung emission signals. Two
example cases from experiments 20180920.042 and 20181016.023 are depicted in figure 3a and 3b,
showingdischarges in whichthe plasma was seeded by Neand N2, respectively[12]. Newasinjected
–3–
2019 JINST 14 C10003
1
2
3
4
Z
eff
20180920.042
0
5
10
P in MW
Ne flow rate (a.u.)
ECRH
P
rad
0
5
n
e in 1019/m2
n
e
0 2 4 6 8 10 12
time in s
0
5
T
e in keV
T
e, core
(a) A case of Ne injection.
1
2
3
4
Z
eff
20181016.023
0
5
10
15
P in MW
N2 flow rate (a.u.)
ECRH
Prad
0
5
n
e in 1019/m2
n
e
02468
time in s
0
2
4
T
e in keV
T
e, core
(b) A case of N2injection.
1
2
3
4
5
6
7
Z
eff
20171114.054
0
5
10
net raw signal
0.0
2.5
5.0
P
tot in MW
ECRH
0.00 0.25 0.50 0.75 1.00 1.25 1.50
time in s
0
1
2
3
n
e in 1019/m2
interferometer
(c) A low density discharge.
Figure 3. The time evolution of Zeff with respective error bars and other relevant parameters for three
example cases. Figures (a) and (b) show the case of two seeding experiments, with Ne and N2respectively.
A discrete increase in Zeff and total radiated power Prad is observed after each injection of seeding gas. The
low density discharge in figure (c) allowed to assess the sensitivity of the diagnostic: a line-of-sight averaged
density of 0.75 ×1019 m−3at 0.5 s allowed to measure a Zeff ≈3.2with a statistical error of ≈50%.
–4–
2019 JINST 14 C10003
at 5, 7, 9 and 11 s for 200 ms, whereas N2was injected at 2 s for 50 ms and continuously at 3-7 s; a
second valve was open at 5-7 s, increasing the gas flow rate by a factor of ≈2. Corresponding to the
injection times, we observe an increase in the Zeff values and in the total plasma radiation measured
with a bolometer [13]. The Zeff error bars are obtained taking into account signal statistics, the
absolute calibration and the uncertainties in the neand Teprofiles. In the figures, the time evolution
of other relevant parameter is also shown: the power from the electron cyclotron resonance heating
(ECRH), the line integrated density nemeasured with the interferometer, and the value of the
electron temperature in the core Teas measured with the ECE diagnostic.
In figure 3c we show the case of a very low density discharge demonstrating the lower sensitivity
limit of the diagnostic at 100 ms integration time. A line-of-sight averaged density neof 0.75
×1019 m−3as measured with the interferometer at ≈0.5s allowed to measure a Zeff ≈3.2with a
statistical error of ≈50%. In the second plot from the top, the measured signal for each pixel in the
considered wavelength range is shown at every time point; the large noise level is clearly visible.
The Zeff values inferred with the USB-spectrometer were also compared to those found with
the charge exchange recombination spectroscopy (CXRS) system [14,15] for two discharges
20180927.042 and 046 in which He was injected during the experiments and the neutral beam injec-
tion(NBI) system wasactive. According toapreliminary analysis, theCXRSsystemallowedtomea-
surea H/Heratio of0.3/0.7and 0.85/0.15in thefirst andsecond dischargerespectively, and a2%con-
centration of C6+in the core in both experiments. From these values, the lower limit of Zeff was then
estimated as ≈2.1and 1.7, compared to 1.9±0.4and 1.5±0.3as found with the USB-spectrometer.
5 Conclusions and future works
A compact, single line-of-sight USB-spectrometer allows to measure bremsstrahlung emission
and infer Zeff at the Wendelstein 7-X stellarator. The diagnostic was operating during the OP1.2
routinely providing the line-of-sight averaged Zeff. A Bayesian model was implemented in the
Minerva framework, allowing to infer Zeff and to provide Gaussian process fits of neand Teprofiles
combining measurements of the Thomson scattering and dispersion interferometer measurements.
In the context of future works, we want to mention that more systems dedicated for the Zeff
measurement are also available and collected data during the experiments, but they are not yet
modeled and inference was not carried out on such measurements. Specifically, two additional
detectors share the same line of sight of the USB-spectrometer. One collects light emitted in the
near infrared range of 750-950 nm, with spectral resolution of ≈1nm and typical time resolution
of 50 ms. A second one collects visible light at 523 nm and 630 nm using interference filters with
a bandwith of 2 nm, and time resolution of 100 kHz. A third system is equipped with 27 lines of
sight and operate in the range 750-950 nm, and can therefore provide information to infer spatially
resolved Zeff. In future works we aim at modeling all these systems and adding other diagnostics
containing information on Zeff (as CXRS, X-ray spectrometers) within the Minerva framework, so
to exploit all available information for the inference of Zeff profiles.
–5–
2019 JINST 14 C10003
Acknowledgments
This work has been carried out within the framework of the EUROfusion Consortium and has
received funding from the Euratom research and training programme 2014-2018 and 2019-2020
under grant agreement No 633053. The views and opinions expressed herein do not necessarily
reflect those of the European Commission. This work was supported in part by the U.S. Department
ofEnergy(DoE)undergrant DE-SC0014210 and funding by the Department of Engineering Physics
and of the College of Nuclear Engineering at the University of Wisconsin — Madison, U.S.A.
References
[1] G. Verdoolaege, M.G. Von Hellermann, R. Jaspers, M.M. Ichir and G. Van Oost, Integrated bayesian
estimation of Zeff in the TEXTOR tokamak from bremsstrahlung and CX impurity density
measurements,AIP Conf. Proc. 872 (2006) 541.
[2] M. Krychowiak, R. König, T. Klinger and R. Fischer, Bayesian analysis of the effective charge from
spectroscopic bremsstrahlung measurement in fusion plasmas,J. Appl. Phys. 96 (2004) 4784.
[3] H.Y. Zhou, S. Morita, M. Goto and M.B. Chowdhuri, Zeff profile measurement system with an
optimized Czerny-Turner visible spectrometer in large helical device,Rev. Sci. Instrum. 79 (2008)
10F536.
[4] H. Meister et al., An integrated system to measure the effective charge of fusion plasmas in the ASDEX
upgrade tokamak,Rev. Sci. Instrum. 74 (2003) 4625.
[5] J. Svensson and A. Werner, Large scale bayesian data analysis for nuclear fusion experiments, in
IEEE International Symposium on Intelligent Signal Processing,IEEE, (2007), pg. 1.
[6] R.S. Sutherland, Accurate free-free gaunt factors for astrophysical plasmas,Mon. Not. Roy. Astron.
Soc. 300 (1998) 321.
[7] S. Bozhenkov et al., The Thomson scattering diagnostic at Wendelstein 7-X and its performance in the
first operation phase,2017 JINST 12 P10004.
[8] J. Knauer et al., A new dispersion interferometer for the stellarator Wendelstein 7-X, in 43rd European
Physical Society Conference on Plasma Physics, volume 4, Leuven, Belgium (2016).
[9] C.E. Rasmussen and C.K.I. Williams, Gaussian processes for machine learning, The MIT Press,
U.S.A. (2006).
[10] M. Hirsch et al., ECE diagnostic for the initial operation of Wendelstein 7-X,EPJ Web Conf. 203
(2019) 03007.
[11] U. Hoefel et al., Bayesian modeling of microwave radiometer calibration on the example of the
Wendelstein 7-X electron cyclotron emission diagnostic,Rev. Sci. Instrum. 90 (2019) 043502.
[12] F. Effenberg et al., First demonstration of radiative power exhaust with impurity seeding in the island
divertor at Wendelstein 7-X,Nucl. Fusion 59 (2019) 106020.
[13] D. Zhang et al., Design criteria of the bolometer diagnostic for steady-state operation of the W7-X
stellarator,Rev. Sci. Instrum. 81 (2010) 10E134.
[14] O. Ford et al., Charge exchange recombination spectroscopy at Wendelstein 7-X,Rev. Sci. Instrum.
forthcoming (2019).
[15] L. Vanó et al., Studies on carbon content and transport with charge exchange spectroscopy on W7-X,
in EPS proceedings forthcoming (2019).
–6–
Chapter 6. Publications
6.4. Article IV
A. Pavone et al.
»Neural network approximated Bayesian inference of edge electron dens-
ity profiles at JET«
In: Plasma Physics and Controlled Fusion (2020)
Synopsis
The publication describes how neural networks can be trained to approximate
Bayesian inference based on an existing Bayesian model for the reconstruction
of the edge electron density profiles at the JET tokamak. We demonstrate here
that the method previously developed and tested at the W7-X stellarator can be
generalized to a completely new physics system, therefore hinting to the possib-
ility, in future, to automate the procedure for approximating any physics model
implemented within the same Bayesian modeling framework. The method is
also extensively tested on a large number of different experimental cases, and
compared to the conventional Bayesian inference results. We show also how
the uncertainties of the network prediction can be calculated with an approach
which relies on the deep learning technique known as dropout training and an
interpretation of the training problem as a variational inference problem in the
context of Bayesian neural networks.
90
Neural network approximated Bayesian
inference of edge electron density profiles
at JET
A Pavone
1
, J Svensson
1
, S Kwak
1
, M Brix
2
, R C Wolf
1
and
JETContributors
3
1
Max-Planck-Institut für Plasmaphysik, Teilinstitut Greifswald, D-17491 Greifswald, DE, Germany
2
Culham Centre for Fusion Energy, Culham Science Centre, Abingdon OX14 3DB, United Kingdom
E-mail: [email protected]
Received 30 August 2019, revised 6 February 2020
Accepted for publication 17 February 2020
Published 5 March 2020
Abstract
A neural network (NN)has been trained on the inference of the edge electron density profiles from
measurements of the JET lithium beam emission spectroscopy (Li-BES)diagnostic. The novelty of the
approach resides in the fact that the network has been trained to be a fast surrogate model of an existing
Bayesian model of the diagnostic implemented within the Minerva framework. Previous work showed
the very first application of this method to an x-ray imaging diagnostic at the W7-X experiment, and it
was argued that the method was general enough that it may be applied to different physics systems.
Here, we try to show that the claim made there is valid. What makes the approach general and versatile
is the common definition of different models within the same framework. The network is tested on data
measured during several different pulses and the predictions compared to the results obtained with the
full model Bayesian inference. The NN analysis only requires tens of microseconds on a GPU
compared to the tens of minutes long full inference. Finally, in relation to what was presented in the
previous work, we demonstrate an improvement in the method of calculation of the network
uncertainties, achieved by using a state-of-the-art deep learning technique based on a variational
inference interpretation of the network training. The advantage of this calculation resides in the fact that
it relies on fewer assumptions, and no extra computation time is required besides the conventional
network evaluation time. This allows estimating the uncertainties also in real time applications.
Keywords: JET, neural network, Bayesian inference, real time, dropout, Lithium beam
diagnostic, edge electron density
(Some figures may appear in colour only in the online journal)
1. Introduction
The application of neural networks (NN)to fusion experi-
ments is not new, dating back to the mid-1990s with
examples at the JET experiment of reconstruction of ion
temperature profiles in real-time [1]and analysis of charge
exchange spectra [2,3]. They have been used for the infer-
ence of plasma parameters from diagnostic data as well as the
prediction of disruptive events from different parameters and
measured quantities [4]. More recently, they have also been
used at the Wendelstein 7-X experiments for the task of
reconstructing magnetic configuration properties from heat
load patterns on the plasma-facing components [5,6]; at JET
for tomographic reconstruction [7]; they have been used also
as surrogates for transport models as shown in [8–10]. Dif-
ferent machine learning algorithms as Gaussian processes
Plasma Physics and Controlled Fusion
Plasma Phys. Control. Fusion 62 (2020)045019 (13pp)https://doi.org/10.1088/1361-6587/ab7732
3
See the author list of Joffrin et al (https://doi.org/10.1088/1741-4326/
ab2276)for the JET contributors.
Original content from this work may be used under the terms
of the Creative Commons Attribution 3.0 licence. Any
further distribution of this work must maintain attribution to the author(s)and
the title of the work, journal citation and DOI.
0741-3335/20/045019+13$33.00 © 2020 Max-Planck-Institut für Plasmaphysik Printed in the UK1
have been used in surrogate-based optimization strategy for
the accelerated validation of plasma transport codes as in [11].
NN are very desirable tools especially for two reasons: they
are able to identify patterns for those phenomena where a
physics model describing the process is missing, e.g. plasma
disruption, and they can process data at very fast time scales,
e.g. in the order of tens of microseconds. The latter feature is
particularly relevant today as fusion experiments produce
more data than we can hope to exhaustively analyze with
traditional tools.
Here we train a NN as a fast approximation, i.e. a sur-
rogate model, of a Bayesian model of the JET lithium beam
spectroscopy (Li-BES)diagnostic for the inference of the
edge electron density profiles from experimental measure-
ments. The principles behind the functioning of the diagnostic
are given in [12], whereas details about the experimental
configuration and measurements at JET can be found in
[13–15]. Edge electron density profiles are useful quantities in
controlling and understanding plasma phenomena as edge
localized modes (ELMs), L-H transitions and turbulence
transport. A model for the diagnostic, described in detail in
[16], is implemented within the Minerva Bayesian modeling
framework [17]. The framework provides a common way to
define models and perform Bayesian inference when mea-
surements are available. The models are strongly modular so
that different modules, or nodes in the jargon, can be easily
used to build similar models for different systems, e.g.
diagnostics at different fusion machines, or test different
assumptions. Currently, the framework is extensively used at
the fusion experiment JET, where its application is discussed
in [18]and an application to the equilibrium reconstruction
using microwave diagnostics is described in [19], and W7-X,
where it has been used to model a microwave radiometer
calibration for the electron cyclotron emission diagnostic
[20], for the inference of electron, ion, and impurity density
profiles from an x-ray imaging diagnostic [21], and for the
inference of ion temperature from measurements of a coe-
herent thomson scattering diagnostic [22].
In a previous work [23], it was shown that a NN can be
trained as approximation of the Bayesian model of an x-ray
imaging diagnostic at W7-X, and it was argued that the same
method could be easily applied to a different system for which
a Bayesian model implemented within Minerva was available.
Extending such work, here we aim at validating this claim.
Therefore we make use of the same method for training the
network, i.e. we train the network on data generated exclu-
sively with the Bayesian model sampling from its joint dis-
tribution, and we show that it can be successfully used to
approximate the full model Bayesian inference of plasma
parameters from data measured with a new physical system at
a different fusion experiment, the edge electron density pro-
files from the Lithium beam emission spectroscopy diagnostic
measurements at JET. In this way, we demonstrate that all
that is required to obtain such network approximation is a
Bayesian model. This is relevant because it shows that it is
possible to replicate the method and achieve a fast
reconstruction for any diagnostic modeled within the Minerva
framework. Moreover, a major novel contribution is achieved
by improving on the uncertainties calculation previously
reported, which suffered from being slow and requiring lim-
iting approximations. The calculation makes use of a novel
state-of-the-art deep learning technique which can provide
fast and at the same time accurate uncertainty estimates. This
is of particular relevance if we think of using the network
reconstructions in real time systems and control applications
where we need to take decisions according to the network
result and it is therefore crucial to know whether and to what
extent the network output is accurate and can be trusted.
In section 2we give an overview of the Lithium beam
spectroscopy diagnostic to the extent that is relevant to this
work, in section 3we describe the Bayesian model of the
diagnostic implemented within the Minerva framework, in
section 4we show how the network is trained making use of
data generated with the Minerva Bayesian model in order to
make predictions from experimental data, in section 5we
describe how the uncertainties of the network model can be
calculated, and in section 6we compare the network inference
to the Bayesian inference carried out with the Minerva model
on measurements collected at several JET pulses. We draw
our conclusion in 7, where we also give an outlook on future
developments.
2. The JET lithium beam spectroscopy diagnostic
The Li-BES system measures the spectral emission produced
by the interaction of lithium atoms with the plasma species.
The lithium atoms are injected with a beam vertically from
the top of the machine, and as the beam penetrates the plasma
it gradually gets excited and it is lost along the magnetic field
lines when most of the atoms get ionized. A transmission
grating spectrometer collects the radiation emitted along the
penetration path, which is limited to the edge region of the
plasma where it allows the reconstruction of the electron
density. The spectrum is observed in a few nanometers
wavelength range from 26 different spatial positions. A CCD
camera is used to detect the photons with an integration time
of typically 10 ms. A sketch of the system is shown in
figure 1.
In order to understand the work presented here, details
about the hardware are not as relevant as those about the
model, which are given below. For the reader interested in
knowing further details about the hardware set-up, detailed
descriptions can be found in [13]and [14]. Here we will give
an overview of the diagnostic principles and the model in
order to provide the information required to understand the
rest of the work. A full description of the Bayesian model and
its usage to infer the electron density, including details about
error treatment, modeling of the instrument function and
calibration, is given in [16,24].
The measured spectra contain different components: the
Li I line radiation A, a bremsstrahlung dominated background
2
Plasma Phys. Control. Fusion 62 (2020)045019 A Pavone et al
Bassumed to be constant in the wavelength range of interest,
and an offset Z. The signal Scan be found by taking into
account an instrument function C(λ)representing the shape of
an infinitely narrow line on the detector and an interference
filter function F(λ)according to the following equation (the
spectral width of the Li line is below the resolving capability
of the instrument):
lll=++SFCABZ.2.1() ()[ () ] ( )
The quantity of interest for this study is the Li I line radiation
A, which we will refer to as the measurement or observation
of our system from now on. It is inferred from the measured
signal Sin a pre-processing stage, prior to any NN or Baye-
sian model evaluation, by first inferring the interference filter
function Fand the instrument function Cfrom two inde-
pendent and dedicated measurements without plasma, and
then by inferring A,Band Zsimultaneously from actual
plasma experiments. We will not give further details here
about how this is accomplished as it is not relevant for the rest
of this work; the interested reader can find more information
in [16].
The intensities of the Li I (2p-2s)line radiation come
from the neutral lithium beam atoms injected into the vacuum
vessel as they traverse and interact with the plasma. The
atoms penetrating into the plasma undergo collisions with the
plasma electrons, protons and other impurities and by mean of
spontaneous emission processes they produce the line radia-
tion that is collected by the diagnostic. The line radiation is
emitted by the decay from the first excited state (1s
2
2p
1
)to
the ground state (1s
2
2s
1
)of the beam atoms. The line
intensity is then dependent on the population of the first
excited stated. The change in the relative population of any
excited state N
i
as the beam atoms penetrate the plasma can be
expressed in terms of the plasma electron density n
e
(z)and
temperature T
e
(z)according to a multi-state collisional-
radiative model firstly introduced in [25]:
å
=++
=
Nz
zvna T na v b N
d
d
1,2.2
i
j
M
ij
e
ij
pj
Li 1
ee p Li ij
Li
() [() () ] ()
where zrepresents a coordinate along the penetration length
of the beam. The coefficients
a
ij
eand
a
ij
pwith ¹ij
(
)and
a>0 are net population rate coefficients accounting for the
contribution of plasma electrons and ions in populating the ith
state from the jth state; whereas a
ii
<0 denotes a net de-
population rate coefficient of the ith state accounting for
excitation, de-excitation and ionization processes. The coef-
ficients b
ij
represent instead spontaneous emission rate coef-
ficients or Einstein coefficients. v
Li
is the lithium beam
velocity corresponding to ≈50 keV beam energy, n
p
is the
density of plasma protons, and M
Li
is the number of con-
sidered states of the neutral lithium atoms, which is 9 in this
case. The dependency of the plasma profiles n
e
,T
e
and N
j
on
the zcoordinate has been omitted for brevity. In order to be
able to solve equation (2.2), an initial condition needs to be
defined. It can be chosen to be:
d==Nz 02.3
ii1
() ()
corresponding to the assumption that all lithium beam atoms
are neutral in the ground state (i=1)at z=0, the position
where they enter the vacuum vessel. In other words we
assume N
1
(z=0)=1. The population of the first excited
state (i=2)of the lithium atoms N
2
can then be calculated.
This quantity is proportional to the observed lithium inten-
sities A(z)found from the signal measured with the CCD
camera along the observation length. We therefore introduce a
calibration factor αto express this relationship:
a=Az N z.2.4
2
() () ( )
The factor is not known and it has to be inferred from the
data. For the interested reader, a complete derivation and an
explicit expression of αin terms of the CCD output counts
can be found in [16].
Figure 2shows an example calculation carried out with
the forward model implementing the physics described so far.
Given the plasma profiles in the two plots on the top, the
relative population of the first excited state of the lithium
atoms and then the Li I line intensity can be calculated. The
plot on the bottom left representing the line intensity in
arbitrary units also shows a 10% relative Gaussian noise
added to the calculation (the scattered circles)in order to
simulate the noise present in the measurements. As the beam
Figure 1. A schematic of the Li-BES system at JET. The lithium
beam is injected vertically (blue arrow)and it penetrates the plasma
volume, indicated by the orange ellipsoid, emitting light by
interacting with the plasma species. The spatial positions of the
measurements is indicated by the intersection of the lines of sight
(dashed lines)with the lithium beam path. The light is collected by a
spectrometer, in yellow in the figure.
3
Plasma Phys. Control. Fusion 62 (2020)045019 A Pavone et al
atoms penetrate into the plasma they also get ionized and
when this happens they follow the magnetic field lines as
charged particles and do not contribute any longer to the
collected emission. The ionized atom population is shown in
violet in the plot on the bottom right.
The measured intensity can be used to infer the electron
density profile at different edge locations along the penetra-
tion length, provided the electron temperature profile infor-
mation. The latter is usually delivered at JET by the high
resolution Thomson scattering diagnostic (HRTS)[26].
3. The Bayesian minerva model
The multi-state model described in section 2is implemented
within the Minerva Bayesian modeling framework. The
Minerva modeling framework [17]is a framework that allows
modeling complex systems and carrying out Bayesian infer-
ence with them. Models are expressed in a modular way, where
the modules are called nodes. These modules can be easily
switched and replaced so that different models can be easily
built, and different assumptions can be easily tested. Nodes can
represent physics quantities with associated probability dis-
tributions over the values they can assume, or they can
represent deterministic calculations consumed by other nodes
in the model. The models are used as forward models to predict
observations from given free parameters. It makes use of gra-
phical models [27]to represent models and the probabilistic
relations between quantities in the model. An example of a
Minerva graph for the lithium beam system is shown in
figure 3, and it is described later in the section. Once a model
has been defined within the framework, Bayesian inference can
be performed with it. Thanks to the fact that model definition
and Bayesian inference constitute two different and indepen-
dent stages, such that the implementation details of one are
abstracted away from the other, the framework offers a solution
for performing scientific inference in complex systems which is
general, and not strictly related to a single nuclear fusion
experiment or even nuclear fusion research. As a Bayesian
framework, it employs Bayesian probability theory to handle
the uncertainties attributed to any modeled quantity. In Baye-
sian probability a prior distribution p(T)is assigned to the
model free parameters Tand a likelihood function pDT(∣)
is
assigned to the model observations D. As measurements are
available, they can be used to update the prior knowledge on
the free parameters through Bayes formula:
=pTD pDTpT
pD .3.1(∣ ) (∣)()
() ()
The quantity pTD(∣ )
is called the posterior distribution and it
reflects the new state of knowledge on the parameters Tas the
Figure 2. An example case of the Li-BES forward model calculation. In clockwise direction, from the top left plot the following quantities are
shown: an electron density profile, an electron temperature profile, the Li I line intensity predicted with the forward model (solid line)
together with the addition of 10% relative Gaussian noise (scattered dots), and the relative population of the ground state (GS), the first
excited state Li I and the first ionized state Li
1+
of the beam atoms. All quantities are expressed as function of the penetration distance inside
the plasma, with z=0 corresponding to the position where the beam enters the vacuum vessel.
Figure 3. A simplified sketch of the Li-BES Minerva model graph.
Colored nodes are probabilistic nodes, where orange denotes the free
parameters and blue denotes the observed quantities. White nodes
represents deterministic calculation nodes or other input parameters
required by the model. The arrows represent direct or indirect
dependencies in the probabilistic relations between the quantities in
the probabilistic nodes.
4
Plasma Phys. Control. Fusion 62 (2020)045019 A Pavone et al
observations Dare taken into account. The numerator of the
equation is also known as the joint distribution p(D,T)of the
observations and parameters. The denominator p(D)is a nor-
malization factor and it is referred to as the evidence.
3.1. Model parameters
The model free parameters are the electron density profile n
e
and the absolute calibration factor α. The prior distribution
for the n
e
profile is modeled through a zero mean Gaussian
process [28]. A Gaussian process is a stochastic process
whose realizations are functions. In Bayesian inference they
are used for models where the free parameters are functions,
in this case 1D electron density profiles, and the observations
are the values they assume in a number of domain locations.
A realization of a random function drawn from the process is
given by the values it assumes in a number of positions in its
domain and its probability distribution is chosen to be
Gaussian. Its covariance is known as covariance function.
One common choice for it is the squared exponential, which
regulates the smoothness of the function by modeling the
correlation between points in the domain. For the density
profiles it can be written as:
ssds=-
-+Kz z zz
,exp
2,3.2
f
x
ij y
12 212
2
2
2
⎛
⎝
⎜⎞
⎠
⎟
() () ()
where z
1
and z
2
are two positions along the zaxis and the
different σparameters regulate the smoothness of the profile.
σ
f
regulates the overall variance of the profile and σ
x
regulates
the length scale of the changes in the profile. Small values
mean that the profile can change quickly along z, whereas
large values mean that it will change slowly. σ
y
is used to
allow for small amount of noise expected in the profile. A
uniform distribution is used for the calibration factor α. The
model observations are the Li I line intensities. The likelihood
function is chosen to be a normal distribution centered on the
forward model prediction.
3.2. Model graph
A sketch of the Minerva model graph for the Li-BES system
is shown in figure 3. In the sketch, the nodes representing the
free parameters n
e
and αare in orange, and the node repre-
senting the observations is depicted in blue. The white nodes
represent computation nodes, as the ‘multi-state model’node
which is used to calculate the predicted Li I line intensity,
represented in the ‘Li I intensity’node, or other quantities
required by the model, as the energy of the lithium beam,
represented by the ‘beam energy’node, and the observation
length, defined as the length along the beam path where the
emission is observed, represented by the ‘length’node. The
observation length is a quantity that is known given
the experimental setup and it can be different for different
experiments. We make use of 20 and 26 equally spaced
positions along the observation length for the profile and the
observations locations, respectively. The calibration coeffi-
cient αis applied to the predicted Li I line intensities as an
overall multiplicative factor. In the graph, we have also
shown the dependency of the multi-state model from the rate
coefficients that are taken from the Atomic Data and Analysis
Structure (ADAS)database [29], a database containing data
useful for modeling the radiating properties of ions and atoms
in plasmas. The arrows represent direct or indirect depen-
dencies in the probabilistic relations between the quantities in
the probabilistic nodes, and should not be understood as a
computational flow. All free parameters node reach, indir-
ectly, the observation node and are not connected to each
other. This expresses the fact that the joint distribution of the
graph p(D,T)can be factorized in terms of a conditional
distribution of the observations conditioned on the free
parameters apDn,
e
(∣ )
and the product of two independent
prior distributions over the electron density p(n
e
)and the
calibration factor p(α):
aa=pD T pDn pn p,,.3.3
ee
( ) (∣ )( )() ( )
4. NN training
Given the Bayesian model described in the previous section,
we aim now at training a NN in such a way that it constitutes
an approximation of the Bayesian inference that can be car-
ried out with the full Minerva model. In order to achieve this,
we use the Minerva model to generate the training data. In
this section, we outline the procedure to the extent it concerns
the specific case of the lithium beam system under invest-
igation. For the interested reader, further conceptual and
theoretical insights about how this method can provide a
sound approximation are given in [23].
4.1. Generation of the training data
The electron temperature profile T
e
and the observation length
lare parameters that are known at inference time, when we
perform inference with the Minerva model and the network:
the former is provided by an independent measurement of
the Thomson scattering diagnostic, the latter comes from
the experimental setup. Both quantities constitute part of the
network input, together with the measured lithium line
intensities, and therefore need be generated with the Minerva
model for training the network. As we aim at training the NN
on the problem of inferring electron density profiles from
measured Li I line intensities, the training input data are the Li
I line intensities, the T
e
profiles, and the length, while the
training output data are the n
e
profiles and the absolute cali-
bration coefficient α. We generate the training data by sam-
pling from the joint distribution of the model. This means
that, as a first step, we draw a sample of n
e
,T
e
,land αfrom
the corresponding prior distributions and, given these values,
we compute the predicted Li I line intensities and draw a
sample from the likelihood function. We iterate over this
process a number of times equal to the number of samples in
the training set. As we need to generate data also for T
e
and
the observation length, we assign probability distributions
also to them, so that the joint distribution of the model can be
5
Plasma Phys. Control. Fusion 62 (2020)045019 A Pavone et al
written as:
a=pD T pD n T l,,,,, 4.1
ee
()( ) ()
and
aa=pD T pDn T l pn pT plp,,,, .4.2
ee e e
( ) (∣ )( )( )()() ( )
A sketch of the procedure is shown in figure 4.
We used for the n
e
and T
e
profiles a zero mean GP prior
defined on a xdomain of 20 linearly spaced positions between
x
0
=0.0 and x
1
=20.0 with covariance function as in
equation (3.2). The parameters of the GP are set to:
σ
x
=10.0, σ
y
=0.002 ×10
19
m
−3
,σ
f
=2.0 ×10
19
m
−3
for
n
e
, and σ
x
=10.0, σ
y
=0.002 keV, σ
f
=1.0 keV for T
e
. The
profiles are constrained to be non-negative by rejecting
the samples having negative values as they are drawn from
the GP prior distributions until a positive valued sample is
drawn and kept. Moreover, we constrain the profiles to
assume low value at x=0 corresponding to the position
z=0, the edge location where the beam enter the plasma.
The constraint is implemented as a virtual observation, i.e. by
implementing an observed node in the Minerva graph as a
normal distribution with standard deviation 100 eV around a
value of 100 eV for the T
e
profile, and standard deviation
0.1 ×10
19
m
−3
around a value of 0.01×10
19
m
−3
for the n
e
profiles. In this way, when the profiles are sampled, they are
constrained by this virtual observation as if it was a real
measurement, although no measurement of such kind actually
occurred. Further details about how a virtual observation
constraint is implemented are provided extensively in [23]
and will not be treated further here, as they are not relevant to
the understanding of the work that follows.
Samples from the n
e
and T
e
prior distributions are shown
in figure 5. The distance from the location z=0 at which the
beam atoms enter the plasma is on the xaxis. It is worth
noticing that the profiles are not monotonic. For the calibra-
tion factor αwe use a uniform distribution between 1.0 and
20.0. The choice for this prior was motivated by the infor-
mation available from previous analysis, which showed
values typically falling in this range. For the parameter lwe
use a uniform distribution between 0.2 and 0.4 cm. Finally,
the conditional distribution of the simulated Li I intensity
PDT(∣)
is a normal distribution centered on the model pre-
diction and with standard deviation equals to 10% relative
error. In this way, we inject noise in the training input data, as
we expect to have noise at evaluation time, when the input are
the experimental measurements. Our training data set is made
of 100 000 samples.
4.2. Network model
The NN architecture used for this problem is a multilayer
perceptron (MLP)with one hidden layer with 1000 units. The
activation function used in the hidden units is the so called
scaled exponential linear function (SELU)[30]and the loss
function used is the mean squared error:
å
=-LN
w wt
1,4.3
i
ii2
() (() ) ( )
where Nis the number of training samples, wis the vector of
adaptable network weights, and y
i
and t
i
are the ith multi-
dimensional output and target vector, respectively. The net-
work was trained using the Adam optimizer with parameters:
learning rate=0.001, β
1
=0.9, β
2
=0.999, ò=10
−8
, see
[31]for a description of the algorithm and parameters. The
training data were divided in batches of 100 samples and the
network weight training was terminated once 5000 passes
Figure 4. A sketch to illustrate the sampling procedure for the
training set creation. A sketch of the Li-BES Minerva model and the
neural network, having one hidden layer with 1000 units, is shown
on the left and on the right, respectively. At training time, the NN
takes as input the Li I line intensities generated with the Minerva
model and sampled from the likelihood function together with the
sampled T
e
and observation length l. The sampled n
e
and αused to
generate the intensities are the target data of the network. The blue
nodes of the neural network denote the input intensities and the two
red nodes at the top denote the output points of the electron density
profile.
Figure 5. Samples from the Gaussian process priors for the T
e
and n
e
profiles, top and bottom figures, respectively. The x-axis position at
0.0 corresponds to the location where the beam atoms enter the
plasma, which is at the edge of the machine. The low value
constraint at such position is also visible in the shape of the sampled
profiles.
6
Plasma Phys. Control. Fusion 62 (2020)045019 A Pavone et al
through the training set were reached (also called epochs).
One single starting position was used for the initialization of
the weights. The number of 1000 hidden units in one hidden
layer was chosen by validating the network performance on a
set of test data made of 1000 samples drawn from the joint
distribution of the Bayesian model. The network was imple-
mented within the TensorFlow framework [32]. The NN has
been trained with dropout [33,34]. Dropout is a technique
originally introduced to prevent overfitting. Although this can
be, by itself, a good reason to make use of it, there is at least
another reason. Dropout training can also be used to estimate
uncertainties in the network prediction; when used in this way
it is referred to as Monte Carlo (MC)dropout [35]. In the next
section we will give an overview of the theoretical framework
that allows to interpret dropout training as a Bayesian infer-
ence technique. We will only touch the salient points of the
derivation which are necessary to understand the current
work, but for the reader interested in a deeper understanding
of the theory behind it, details can be found in [35].
Before proceeding, we would like to summarize the
relationship between the two key elements of this work:
•the Minerva Bayesian model is defined at the first step,
and it is used to both carry out the full Bayesian inference
of the electron density profiles from the measured
experimental data, and to generate the training data for
the NN from its joint distribution.
•the NN is first trained on data generated exclusively with
the Minerva Bayesian model, afterwards it is applied to
infer electron density profiles from the measured exper-
imental data.
In this way, the full Bayesian inference and the NN inference
are both based on the same Bayesian model, with the
distinction that the latter approximates the former. The two
inference methods will be compared in section 6.
5. NN uncertainties
Delivering uncertainties in the NN calculation is necessary in
order to asses whether, and how far, the network prediction
can be trusted. This is important when the network output is
wanted for further calculations, and especially when a deci-
sion has to be taken according to its output, as in the case of
real time control systems, e.g. feedback systems. Therefore, it
is also important that the uncertainties can be calculated in a
time scale comparable to the network processing speed itself.
Here we give an overview of the theoretically sound and
practically desirable method presented in [35].
5.1. Bayesian NNs
NN uncertainties can be calculated in a Bayesian framework
known as Bayesian NNs [36]. In this context, the network
training is seen as an inference problem, where the free
parameters are the network weights wand the training target
data are the observations Y. It follows that we can write
Bayes formula for the posterior of the network weights:
=ppp
p
wY X Yw X w
YX
,,,5.1(∣ ) (∣ )()
(∣) ()
where Xdenotes the training input data. As we have now a
distribution over the network weights, we will also have a
distribution over the network’s predictions y
*
for a new input
vector x
*
, given by:
ò
=ppp x XY wx wYX w,, , , d. 5.2
** * *
(∣ ) (∣ )(∣ ) ( )
This distribution is the one we are interested in and which
prescribes the uncertainties in the network prediction.
5.2. Variational inference
For any interesting NN model, the posterior pwY X,(∣
)
can-
not be treated analytically because of the large number of
weights and complex network function. We therefore make
use of variational inference (VI)[37]in order to approximate
it. In VI we choose an approximating variational distribution
q
θ
(w)parametrised by θ, which is easy to evaluate, in order to
approximate the original posterior distribution. This is
achieved by minimizing the Kullback–Leibler (KL)diver-
gence with respect to θ, which can be thought as a measure of
similarity between two distributions:
ò
=
qq
q
qp q q
p
wwYX w w
wY X w
K
L, log
,d.(()∣∣(∣ )) () ()
(∣ )
It can be shown that minimizing the KL divergence is
equivalent to maximizing the so called evidence lower bound
(ELBO)with respect to θ:
ò
q=-
qq
L
qp d qpwYXww wwlog , KL ,
w
VI () () (∣ ) ( ()∣∣())
where, noticeably, the KL divergence term now is between
the approximating distribution q
θ
(w)and the prior distribution
p(w), fact that explains the name of the expression. At this
point we make use of the results derived in [35], where it is
shown that the conventional dropout training of a NN is
equivalent to the maximization of the ELBO function.
5.3. Dropout
When a network is trained with conventional dropout, at each
iteration of the training, as a new training batch sample is
provided to the network, some of its units are dropped. This
makes the trained network more flexible, intuitively because
the units need to learn to be useful also when some of the
others are missing. To be more rigorous, dropout prevents
overfitting by preventing co-adaption of the units. At eva-
luation time all units are retained, but their output is scaled
down by the probability of dropping them, since now there is
a larger number of units in the network.
5.4. MC dropout
In the conventional dropout picture of training, the stochastic
process is applied in the unit (or feature)space. We can
switch view and see the stochastic process as applied in the
7
Plasma Phys. Control. Fusion 62 (2020)045019 A Pavone et al
weight space, since as units are dropped, also the corresp-
onding weights that connect them are dropped. Under this
view, it is finally possible to merge the dropout training with
the VI approximation of the true weight posterior. This hap-
pens by re-parametrising the weights win terms of a function
gsuch that:
q==
gwMb,diag,, 5.3(){() } ()
where θ=M,band ~
pBernoulli b
(
)
where p
b
is the
probability of dropping the units.
is then a vector of zeros
and ones, Mis a Q by D deterministic matrix of connecting
weights where Q is input vector size and D output vector size,
bis the bias vector of dimension Q, and
diag()
is a diagonal
matrix of same size as Mhaving as diagonal elements the
elements of the vector
. The product
Mdiag() represents a
matrix multiplication whose results end up ‘selecting’what
connecting weight is active at a given dropout step. At this
point, after some more manipulations, we can rewrite the
integral in the ELBO expression as an integral over
pb(
)
and
the derivatives required for the optimization as derivatives
with respect to θ.In[35]it is then shown that, optimizing a
NN dropout loss function is equivalent to optimizing the
function L
VI
(θ). In conclusion, this means that, by using a
well established method for training the network, we can at
the same time approximate the posterior distribution of the
corresponding Bayesian network via variational inference
with Bernoulli approximating variational distribution.
The only difference with the standard dropout training is
that at evaluation time, instead of retaining all units, we keep
dropping them as several forward passes of the network are
done, so to obtain a distribution of network predictions rather
than a single best estimate. This corresponds to estimating the
ELBO integral with a Monte Carlo integration. The major
advantage of this approach is that it scales well with large
networks: forward passes of the network are typically very
fast and can also be run in parallel. Therefore, calculating
uncertainties in this way does not require substantial extra
computation time.
We used dropout probability p
b
=0.5 for all units in the
hidden layer, and p
b
=0.0 for the input units, i.e. all input
units were retained.
We have described how variational inference and drop-
out can be combined in a unified view of the network training,
leading to a Bayesian NN interpretation. One must be aware,
though, of some caveats that have been acknowledged
regarding the theoretical framework supporting this techni-
que: see for example [38], where it is claimed that in the case
of simple linear networks, this method approximates the risk
of a process rather than the uncertainty of the model because
the variance found in this way do not vanish at the limit of
very large amount of training data; see also [39], where it is
shown that the variational inference framework described in
[35], specifically with regards to the choice of some
approximating distributions, can lead to undefined objective
function of the network, and they propose an alternative to
such objective; in general, some difficulties have been
recognized in the application of standard variational inference
approach, as indicated in [40], where pitfalls are found in the
usage of the KL divergence, and a different distance is
proposed.
In the next section we will show results obtained with
MC dropout estimation of the uncertainties, as we tested the
network on experimental data collected at the JET tokamak.
6. Results
We evaluated the NN on data collected at several JET pulses.
In order to assess the quality of the network reconstruction we
can compare the reconstructed electron density profiles to
those inferred with the full Bayesian model. Also, we can use
the reconstructed n
e
profiles as input to the forward model and
simulate Li I line intensities to compare with the measured
ones. This is indeed a better way to assess the quality of the
network reconstruction as we can see how well the NN pre-
diction fits the data. In the same way, the full Bayesian
inference reconstruction can be compared against the mea-
surements and the quality of the fit compared to that obtained
with the NN reconstruction. We want to point out that this
kind of comparison is possible because we have a model for
the measurement processes, and it is the same one used for
generating the network training data and the full Bayesian
inference. As we previously mentioned, we are comparing
two inversion methods applied to the same Bayesian model:
the network inversion being a fast approximation of the full
Bayesian inference.
6.1. Uncertainties
One illustrative example of such comparison is shown in
figures 6and 7for data collected at the JET pulse 89312 at
time 48.295 s, just before NBI heating started, so the plasma
was in L-mode and the line integrated density was
»´ -
510m
19 2.Infigure 6, the NN reconstructed density
profiles are compared to those inferred with the full Bayesian
inference (Minerva).Infigure 7, the Li I line intensities
generated with the Minerva and NN reconstructed profiles are
compared to the measured ones. The multiple samples
represent the uncertainties. In the NN case, these are 100
samples obtained with MC dropout; in the Minerva case,
these are 100 samples drawn from the full model posterior
distribution which has been explored with a Markov Chain
Monte Carlo sampler. From figure 6it is evident that the
uncertainties of the density profiles inferred with the network
and with Minerva can be quite different. This should not
surprise. It is important to realize that the uncertainties
stemming from the two methods arise from two different
models, the network and Minerva model, and the corresp-
onding Bayesian inference problems. In both cases, the
uncertainties are calculated in a Bayesian framework, but the
models and quantities that contribute to the uncertainties in
the reconstructed profiles are different, as the inference task to
be solved is different. This is made evident by looking at
Bayes formula and the mathematical expression of the
uncertainties for the two models. In the network case, the
distribution of the predicted profiles is obtained by
8
Plasma Phys. Control. Fusion 62 (2020)045019 A Pavone et al
marginalization over the network weights wwhen a new
input vector x
*
is provided:
ò
=pn pn pxY wx wYw,,d6.1
ee
**
(∣ ) (∣ )(∣) ( )
which is the same expression of equation (5.2), in which we
have omitted the dependence on the input variable Xand
substituted y
*
=n
e
. When the network is evaluated on the
measured line intensities, the input vector x
*
is constituted of
the electron temperature profile independently measured by a
Thomson scattering diagnostic, the observation length used at
that experiment, and the measured line intensities. The pos-
terior of the network weights, instead, is given by
µpppwY Yw w(∣) (∣)()and it is found with variational
inference with dropout training as described in section 5.We
do not expect dropout training to reconstruct the posterior
distribution of the Minerva model pn D
e
(∣)
, but to approx-
imate the true posterior distribution of the network weights
pwY
;
(∣)then, the spread of this posterior gives rise to a
spread in the predicted profiles according to equation (6.1).
Whereas, in the Minerva case the distribution of the inferred
profiles is given by the posterior:
µpn D pDn pn ,6.2
eee
(∣) (∣)() ( )
where Drepresents the measured Li I line intensities. The
spread of the posterior, therefore, is influenced by the model
uncertainties in predicting the measured Lithium 1 line
intensity pDn
e
(∣)
(e.g. measurement errors)and the prior
p(n
e
).
To highlight the difference between the two models, it is
useful to notice what is the role of the different quantities in
each of them: in the Minerva model, the free parameters are
the electron density profiles, and the observations are the
lithium line intensities. The inference task is then to find the
electron density profiles which allow to predict the measured
Li line intensities, given the measurements, the physics
model, and the prior. These are the boundaries of the infer-
ence problem. The final posterior distribution expresses the
uncertainties in the inference of the density profiles given the
model and these boundary conditions. The uncertainties that
arise in this case are related to the model uncertainties in the
prediction of the Li intensities—typically estimated from the
measurement errors, the sensitivity of the model to different
values of the electron density, and the beam attenuation. For
example, because the beam is attenuated as it penetrates the
plasma and gets ionized, the model is less sensitive to changes
in the electron densities in the locations closer to the core of
the machine, and the uncertainties are therefore larger.
Quantitative details about the estimation of the error from the
measurements, and quantitative considerations on the beam
Figure 6. A comparison between the n
e
profiles predicted with the NN and the full model Bayesian inference (Minerva). The samples
represent the uncertainties from the MC dropout in the NN case, and the posterior distribution in the Minerva case. The data are taken from
the JET pulse number 89312 at time 48.295 s.
Figure 7. The Li I line intensities predicted with the NN and Minerva
n
e
profiles are compared to the measurements. The shadowed areas
represent the uncertainties. The data are taken from the JET pulse
number 89312 at time 48.295 s.
9
Plasma Phys. Control. Fusion 62 (2020)045019 A Pavone et al
attenuation and sensitivity on the model are reported in pre-
vious works [16], and are not discussed here as they fall
beyond the scope of this work. In the network model, the free
parameters are the network weights, which lack any physics
interpretation, and the observations are the set of target data in
the training set, i.e. the sampled electron density profiles. The
uncertainties of the network posterior depend on a combina-
tion of network structure, weight prior and approximating
distribution, as it is indicated and further discussed in [35].At
training time, the network model inference task is to find the
weights which allow to reconstruct the electron density pro-
files from the Li I line intensities, given a specific choice of
network structure, weight prior, approximating distribution
and training set, whose statistical properties are inherited from
the Minerva model by sampling from its joint distribution.
These are the boundary conditions of the inference problem
for the network. The predictive distribution of equation (6.1),
then, expresses the uncertainties of the model in making a
prediction within these boundaries.
6.2. Li I line intensity reconstruction
The performance of the two methods is compared more
extensively in figure 8, where the Li I line intensities predicted
with electron density profiles found by the network (top row)
and the full model Bayesian inference (Minerva, bottom row)
are compared to the measurements in a scatter plot. The pro-
files used are the average of the MC dropout samples in the
network case and the posterior distribution samples in the
Minerva case. The solid line shows the y=xline, where all
points would lie if we had a perfect fit to the measurements.
Each plot in a column shows a different spatial position along
the intensity profile; since the corresponding real space coor-
dinates may vary throughout the experiments, the positions are
labeled according to an index ranging from 0 for the outermost
location to 25 for the innermost one. More than 200 hundred
measured data points collected across 65 pulses were con-
sidered in the analysis (see appendix for a list of the pulses).
The pulses were arbitrarily chosen, without selecting for a
specific set of features or plasma configurations. The pulses
featured a broad range of parameters, including both L- and
H-mode scenarios, low and high power and gas levels. Across
all pulses, the NBI power ranged from ≈3.0 to ≈28 MW, the
vacuum toroidal magnetic field from 1.6 to 3.3 T, the total
ICRH power from ≈2.0 to ≈6.0 MW, the plasma current from
≈1.1 to ≈3.5 MA, and the line integrated density from
≈8.0 ×10
19
m
−2
to ≈2.6 ×10
20
m
−2
. The agreement to the
measurements is, in general, satisfactory for both methods.
Although the network consists of a quick, approximated
inversion of the full Bayesian inference, its reconstructions
appear to be good enough to closely predict the data in most
cases.
This is confirmed by figure 9, where we compare the
mean relative error between the observations calculated with
each of the two method inverted profiles and the measure-
ments, for each position along the profile intensities:
å
=-
EN
qq
q
1,6.3
i
ii
i
mre 12
2
()
where q
1i
is the line intensity predicted by one of the methods,
q
2i
are the measured line intensities and Nis the number of
data points. The figure shows that the error for the network is
consistently larger at every location, and it follows a trend
similar to the full Bayesian inference case (Minerva). At most
positions the error is below 20%, a reasonably good value,
suggesting that the network inversion can provide a reliable
approximated analysis.
Figure 8. The Li I line intensity predicted with electron density profiles found by the network (top row)and the full model Bayesian inference
(Minerva, bottom row), on the y-axis, are compared to the measurements, on the x-axis. Each column shows the comparison for a spatial
position along the intensity profile. The real space coordinates of the positions can vary through the experiments, so here they are labeled by
an index starting from the outermost position at index 0 to the innermost position at index 25. The solid line shows the y=xline. More than
200 hundred measured data points collected across 65 pulses were used.
10
Plasma Phys. Control. Fusion 62 (2020)045019 A Pavone et al
6.3. Electron density profile inference
Finally, the electron density profiles inferred with Minerva
can be compared to those found with the network as shown in
figure 10. Each plot shows the n
e
values at four different
locations along the profile, indexed with an integer number
from 0 to 19. The values are the average of the samples drawn
from the posterior distribution inferred with Minerva (x-axis)
and the samples found with the MC dropout network. The
agreement is, in general, quite satisfactory. Indeed, an ana-
lysis of the mean relative error as defined in equation (6.3),
with q
1
denoting the NN reconstructed profiles and q
2
the
Minerva reconstructed profiles, shows that it is <15% at any
spatial location. This can be seen in figure 11.
7. Conclusions
Extending from previous work [23], we have trained a NN as
a fast, approximated Bayesian inference model for the infer-
ence of edge electron density profiles from measurements at
the JET tokamak. Exploiting the NN well-known data pro-
cessing speed, we can reduce the time required for the ana-
lysis from tens of minutes to tens of microseconds on a GPU,
providing an approximated reconstruction. We have shown
here, as it was suggested in [23], that all that is necessary in
order to realize this kind of fast network approximation is the
definition of a Bayesian model within the Minerva frame-
work, since the network is trained exclusively on data gen-
erated with the model by sampling from its joint distribution.
This is of particular interest because it opens the possibility to
fully automate the process in order to be able to have a fast
network approximation for any Bayesian model of any other
diagnostic implemented within the framework.
Uncertainties can also be calculated for the network
inversion. We made use of a state-of-the-art training method
to approximate the network weight distribution with varia-
tional inference and calculate the uncertainties in the predic-
tion. Compared to other existing methods, this method has the
advantage of requiring essentially the same evaluation time of
a standard network evaluation. It can be, therefore, particu-
larly useful when the network is used in real time systems,
which benefit of the uncertainty information when using the
network prediction to make further actions or take decisions.
The network has been tested on data collected during
several pulses at the JET tokamak, considering a wide range
of plasma features and scenarios. A comparison of the net-
work inferred profiles and those found with the conventional
Bayesian inference shows a discrepancy in the two methods
reconstructed uncertainties. This should not surprise, as they
arise from two very different models with different free
parameters, observed quantities, and different limitations, and
therefore they are not expected to match. This discrepancy is
a price that has to be paid to achieve the several orders of
magnitude acceleration provided by the network. As we
trained the network on a Bayesian model, we could use the
same model to simulate the observations, given the network
reconstructed profiles, and compare them against the mea-
surements. We included in the comparison the full Bayesian
inference reconstruction, which was carried out making use of
the same model. The comparison was therefore fully con-
sistent: the network inversion being a fast approximation of
the full model one. The error in the prediction of the mea-
surements is consistently larger when using the network
predicted density profiles, as it might be expected from an
approximated inversion. Still, the error is consistently below
approximately 20% in all considered experimental cases,
suggesting that the network inversion can be a reliable tool for
fast analysis.
In future works, the NN could be used as a initial guess
for the Bayesian inference carried out with the Minerva
model, in this way speeding up the sampling of the posterior
distribution with the MCMC by quickly providing a good
starting location. The network could also be used indepen-
dently, providing a fast edge profile reconstruction. For the
reconstruction to be reliable, the network could be tested on a
larger data set of measurements collected at previous
experiments and the cases where the reconstruction fail
should be investigated individually. Also, the implementation
of a novelty detection system could be useful: this is a system
which can preventively inform the user when a measurement
represents an input which is unfamiliar for the network with
respect to the data that had been used for training it. These
cases often bring to unreliable network output and, in this
way, they could be readily identified. A novelty detection
method can rely on the reconstruction of the probability
density of the input training data, which is then evaluated at
the location of the incoming measurement input in order to
asses its degree of novelty [41].
Figure 9. The mean relative error between measured Li I intensities
and the intensities simulated with n
e
profiles reconstructed by the
network (NN)and the full Bayesian inference (Minerva)is shown at
each position along the intensity profile. The calculation has been
carried out for more than 200 measurements collected across 65 JET
pulses.
11
Plasma Phys. Control. Fusion 62 (2020)045019 A Pavone et al
Acknowledgments
This work has been carried out within the framework of the
EUROfusion Consortium and has received funding from the
Euratom research and training programme 2014–2018 and
2019–2020 under grant agreement No. 633053. The views
and opinions expressed herein do not necessarily reflect those
of the European Commission.
Appendix. List of JET pulses
What follows is a list of the JET pulses used in the analysis
shown in figures 8and 10, and discussed in section 6. The
pulses were arbitrarily chosen, without selecting for a specific
set of features or plasma configurations. The pulses featured a
broad range of parameters, including both L- and H-mode
scenarios, low and high power and gas levels. Across all
pulses, the NBI power ranged from ≈3.0 to ≈28 MW, the
vacuum toroidal magnetic field from 1.6 to 3.3 T, the total
ICRH power from ≈2.0 to ≈6.0 MW, the plasma current
from ≈1.1 to ≈3.5 MA, and the line integrated density from
≈8.0 ×10
19
m
−2
to ≈2.6 ×10
20
m
−2
.
86685, 86687, 86902, 86906, 86911, 86913, 86918,
86983, 87080, 87091, 87094, 87143, 87184, 87260, 87261,
87283, 87411, 87412, 87487, 87518, 87562, 87790, 87792,
87825, 87864, 87865, 87873, 89094, 89095, 89110, 89174,
89193, 89231, 89237, 89248, 89312, 89341, 89342, 89343,
89344, 89345, 89346, 89347, 89349, 89351, 89353, 89387,
89390, 89391, 89392, 89393, 89395, 89425, 89426, 89427,
89448, 89449, 89450, 89451, 89705, 89707, 89708, 89727,
89728.
ORCID iDs
A Pavone https://orcid.org/0000-0003-2398-966X
R C Wolf https://orcid.org/0000-0002-2606-5289
References
[1]Svensson J et al 1998 Real-time ion temperature profiles in the
JET nuclear fusion experiment ICANN 98. Perspectives in
Neural Computing (https://doi.org/10.1007/978-1-4471-
1599-1_30)
[2]Svensson J, von Hellermann M and König R W T 1999
Analysis of JET charge exchange spectra using neural
networks Plasma Phys. Control. Fusion 41 315–38
[3]Bishop C M, Roach C M and von Hellermann M G 1993
Automatic analysis of JET charge exchange spectra using
neural networks Plasma Phys. Control. Fusion 35 765–73
[4]Cannas B, Fanni A, Sonato P and Zedda M K 2007 A
prediction tool for real-time application in the disruption
protection system at JET Nucl. Fusion 47 1559–69
[5]Böckenhoff D et al 2018 Reconstruction of magnetic
configurations in w7-X using artificial neural networks Nucl.
Fusion 58 056009
[6]Blatzheim M et al 2019 Neural network regression approaches
to reconstruct properties of magnetic configuration from
wendelstein 7-X modeled heat load patterns Nucl. Fusion 59
126029
[7]Ferreira D R, Carvalho P J, Fernandes H and (JET
Contributors)2018 Full-pulse tomographic reconstruction
with deep neural networks Fusion Sci. Technol. 74 47–56
Figure 10. The electron density profile values inferred with model Bayesian inference (Minerva), on the x-axis, are compared to those
inferred with the network (NN), on the y-axis. Each plot shows the comparison for a spatial position along the profile. The real space
coordinates of the positions can vary through the experiments, so here they are labeled by an index starting from the outermost position at
index 0 to the innermost position at index 19. The solid line shows the y=xline. More than 200 hundred measured data points collected
across 65 pulses were used.
Figure 11. The mean relative error between electron density profile
inferred with Minerva and with the network. The average values
from the posterior samples in the Minerva case and the MC dropout
sampels in the network case have been used. The calculation has
been carried out for more than 200 measurements collected across 65
JET pulses.
12
Plasma Phys. Control. Fusion 62 (2020)045019 A Pavone et al
[8]Meneghini O et al 2017 Self-consistent core-pedestal transport
simulations with neural network accelerated models Nucl.
Fusion 57 086034
[9]Meneghini O, Luna C J, Smith S P and Lao L L 2014
Modeling of transport phenomena in tokamak plasmas with
neural networks Phys. Plasmas 21 060702
[10]van de Plassche K L et al 2019 Fast modelling of turbulent
transport in fusion plasmas using neural networks Phys.
Plasmas 27 022310
[11]Rodriguez-Fernandez P et al 2018 Vitals: a surrogate-based
optimization framework for the accelerated validation of
plasma transport codes Fusion Sci. Technol. 74 65–76
[12]Pietrzyk Z A, Breger P and Summers D D R 1993
Deconvolution of electron density from lithium beam
emission profiles in high edge density plasmas Plasma Phys.
Control. Fusion 35 1725–44
[13]Brix M et al 2010 Upgrade of the lithium beam diagnostic at
jet Rev. Sci. Instrum. 81 10D733
[14]Brix M et al 2012 Recent improvements of the jet lithium beam
diagnostic Rev. Sci. Instrum. 83 10D533
[15]Réfy D I et al 2018 Sub-millisecond electron density profile
measurement at the jet tokamak with the fast lithium beam
emission spectroscopy system Rev. Sci. Instrum. 89 043509
[16]Kwak S, Svensson J, Brix J and Ghim Y-C 2017 Bayesian
electron density inference from jet lithium beam emission
spectra using gaussian processes Nucl. Fusion 57 036017
[17]Svensson J and Werner A 2007 Large scale bayesian data
analysis for nuclear fusion experiments IEEE Int. Symp. on
Intelligent Signal Processing pp 1–6
[18]Svensson J et al 2011 Modelling of jet diagnostics using
bayesian graphical models Contrib. Plasma Phys. 51 03
[19]Schmuck S, Svensson J, Figini L and Micheletti D 2019
Towards a bayesian equilibrium reconstruction using JETʼs
microwave diagnostics 07 46th European Physical Society
Conference on Plasma Physics (EPS 2019)(Milan, Italy)
[20]Hoefel U et al 2019 Bayesian modelling of microwave
radiometer calibration on the example of the wendelstein 7-x
electron cyclotron emission diagnostic Rev. Sci. Instrumen.
90 043502
[21]Langenberg A et al 2016 Forward modeling of x-ray imaging
crystal spectrometers within the Minerva Bayesian analysis
framework Fusion Sci. Technol. 69 560–7
[22]Abramovic I et al 2019 Forward modeling of collective
thomson scattering for wendelstein 7-x plasmas: electrostatic
approximation Rev. Sci. Instrum. 90 023501
[23]Pavone A et al 2019 Neural network approximation of
bayesian models for the inference of ion and electron
temperature profiles at w7-X Plasma Phys. Control. Fusion
61 075012
[24]Kwak S, Svensson J, Brix M and Ghim Y-C 2016 Bayesian
modelling of the emission spectrum of the joint european
torus lithium beam emission spectroscopy system Rev. Sci.
Instrum. 87 023501
[25]Schweinzer J et al 1992 Reconstruction of plasma edge density
profiles from li i (2s–2p)emission profiles Plasma Phys.
Control. Fusion 34 1173–83
[26]Pasqualotto R et al 2004 High resolution thomson scattering
for joint european torus (jet)Rev. Sci. Instrum. 75 3891–3
[27]Pearl J 1986 Fusion, propagation, and structuring in belief
networks Artif. Intell. 29 241–88
[28]Rasmussen C E and Williams C K I 2006 Gaussian Processes
for Machine Learning (Cambridge, MA: MIT Press)
[29]Summers H P 2004 The ADAS User Manual version 2.6
(http://www.adas.ac.uk)
[30]Klambauer G, Unterthiner T, Mayr A and Hochreiter S 2017
Self-normalizing neural networks Proceedings of the 31st
International Conference on Neural Information Processing
Systems (Long Beach, CA)pp 971–81
[31]Kingma D P and Ba J 2014 Adam: a method for stochastic
optimization arXiv:1412.6980
[32]Abadi M et al 2016 TensorFlow: large-scale machine learning
on heterogeneous systems arXiv:1603.04467
[33]Hinton G E et al 2012 Improving neural networks by
preventing co-adaptation of feature detectors 07
arXiv:1207.0580
[34]Srivastava N et al 2014 Dropout: a simple way to prevent
neural networks from overfitting J. Mach. Learn. Res. 15
1929–58
[35]Gal Y 2016 Uncertainty in deep learning PhD Thesis
University of Cambridge
[36]Mackay D J C 1991 Bayesian methods for adaptive models
PhD Thesis California Institute of Technology
[37]Jordan M I et al 1999 An introduction to variational methods
for graphical models Mach. Learn. 37 183–233
[38]Osband I 2016 Risk versus uncertainty in deep learning:bayes,
bootstrap and the dangers of dropout NIPS
[39]Hron J et al 2018 Variational Bayesian dropout: pitfalls and
fixes arXiv:1807.01969
[40]Huggins H J et al 2018 Practical bounds on the error of
bayesian posterior approximations: a nonasymptotic
approach arXiv:1809.09505
[41]Bishop C M 1994 Novelty detection and neural network
validation IEE Proc., Vis. Image Signal Process. 141
217–22
13
Plasma Phys. Control. Fusion 62 (2020)045019 A Pavone et al
Appendix A.
List of W7-X experiment numbers
The following experiments from the OP1.2 W7-X experimental campaign (2018)
were considered in the study of section 5.4.
Day 1, 08/07/2018:
20180807.013 20180807.015 20180807.016 20180807.018 20180807.019 20180807.020
20180807.021 20180807.022 20180807.023 20180807.024 20180807.026 20180807.027
20180807.030 20180807.031
Day 2, 08/14/2018:
20180814.006 20180814.007 20180814.008 20180814.009 20180814.010 20180814.011
20180814.012 20180814.013 20180814.014 20180814.015 20180814.016 20180814.017
20180814.018 20180814.019 20180814.020 20180814.021 20180814.022 20180814.023
20180814.024 20180814.025 20180814.026 20180814.027 20180814.028 20180814.029
20180814.030 20180814.031 20180814.032 20180814.033 20180814.034 20180814.035
20180814.036 20180814.037 20180814.038 20180814.039 20180814.040 20180814.041
20180814.042 20180814.043 20180814.044 20180814.045 20180814.046 20180814.047
20180814.048 20180814.049 20180814.050
Day 3, 10/10/2018:
20181010.005 20181010.006 20181010.007 20181010.008 20181010.009 20181010.010
20181010.011 20181010.012 20181010.013 20181010.014 20181010.015 20181010.016
20181010.018 20181010.019 20181010.020 20181010.021 20181010.022 20181010.023
20181010.026 20181010.027 20181010.028 20181010.029 20181010.030 20181010.031
20181010.032 20181010.033 20181010.034 20181010.035 20181010.036 20181010.037
20181010.038 20181010.040
105
Appendix A. List of W7-X experiment numbers
Day 4, 10/16/2018:
20181016.005 20181016.006 20181016.007 20181016.008 20181016.009 20181016.010
20181016.011 20181016.012 20181016.013 20181016.014 20181016.015 20181016.016
20181016.017 20181016.018 20181016.019 20181016.020 20181016.021 20181016.022
20181016.023 20181016.024 20181016.025 20181016.026 20181016.027 20181016.031
20181016.033 20181016.039 20181016.040
Day 5, 10/17/2018:
20181017.013 20181017.015 20181017.016 20181017.017 20181017.018 20181017.019
20181017.020 20181017.021 20181017.022 20181017.023 20181017.024 20181017.025
20181017.026 20181017.030 20181017.031 20181017.032 20181017.033 20181017.039
20181017.040 20181017.041
Day 6, 10/18/2018:
20181018.005 20181018.006 20181018.008 20181018.010 20181018.011 20181018.012
20181018.013 20181018.014 20181018.015 20181018.016 20181018.017 20181018.018
20181018.019 20181018.020 20181018.021 20181018.022 20181018.023 20181018.024
20181018.025 20181018.026 20181018.027 20181018.028 20181018.030 20181018.031
20181018.032 20181018.033 20181018.034 20181018.035 20181018.036 20181018.037
20181018.038 20181018.039 20181018.040 20181018.041
106
Acknowledgements
This work has been carried out within the framework of the EUROfusion con-
sortium and has received funding from the Euratom research and training pro-
gramme 2014-2018 and 2019-2020 under grant agreement No 633053. The views
and opinions expressed herein do not necessarily reflect those of the European
Commission.
I would like to thank Prof. Dr. Robert C. Wolf for giving me the opportunity
to carry out my Ph.D. work at the Max-Planck Institute for Plasma Physics in
Greifswald. During the years I worked under his supervision, he has always
demonstrated to be open minded and he has contributed in several occasions
with insightful comments and discussions, especially in the time when I had
been writing the thesis. I also wish to give him recognition for supervising my
work in such a way that I have always felt comfortable with.
A special mention goes to Dr. Jakob Svensson, who, truly, has been more than
a supervisor during these years. He has been an exceptional mentor, offering
teachings beyond what was required by the work of the thesis. Especially, I am
thankful for how he allowed communication and discussion of ideas to flow
between us, always being open and respectful. Under his supervision, I have
been able to carry out my work with a comfortable feeling of freedom and, at
the same time, I have always felt inspired to do my best. He has never failed to
show trust in my abilities, and this allowed me to grow as a scientist and as a
person.
This work was possible thanks to the contribution of many people. Among
these, I would like to mention Dr Andreas Langenberg and Dr Matthias Brix for
always being available to provide help and suggestions when I applied my re-
search to the measurements they are responsible for; Dr Udo Hoefel and Sehyun
Kwak for being not just brilliant co-workers, but also friends and office mates
whose company has been much needed in several occasions. Many others have
contributed during these years: some relations have evolved into friendships,
lasting until now, despite the distance that might separate us; many relations
have stayed professionals, providing invaluable support in many occasions. Al-
107
though you all deserve to be mentioned, I will not attempt the impossible task
of listing here all your names without forgetting someone. Nevertheless, I wish
to thank you all.
Last but not least, I want to mention the immense gratitude that I feel towards
the people that are part of my family. Despite the distance which separates us,
they have never failed to provide me with the warmth of their closeness, in
such a way that made possible for me to conduct my research - in professional
and personal life - with a sense of infused courage and safety. The extent to
which I feel free and encouraged to express my abilities in this world is deeply
rooted in that sense of safety and I owe it to them. My affection and love go to
all of them.
Statutory declaration
I hereby declare in accordance with the examination regulations that I myself
have written this document, that no other sources as those indicated were used
and all direct and indirect citations are properly designated, that the document
handed in was neither fully nor partly subject to another examination procedure
or published and that the content of the electronic exemplar is identical to the
printing copy.
Greifswald, 8th October 2020
Andrea Pavone
109
List of Figures
2.1. Tokamak and stellarator design concepts . . . . . . . . . . . . . 21
2.2. Sketch of the toroidal and poloidal direction in a torus. . . . . . 22
3.1. Example of a Minerva model graph . . . . . . . . . . . . . . . . 29
4.1. The architecture of the multi layer perceptron . . . . . . . . . . 36
5.1. Sketch of the XICS system . . . . . . . . . . . . . . . . . . . . . 45
5.2. Sketch of the lithium beam system . . . . . . . . . . . . . . . . . 47
5.3. Bremsstrahlung emission graph . . . . . . . . . . . . . . . . . . 51
5.4. Bremsstrahlung forward model calculation . . . . . . . . . . . . 52
5.5. Gaussian process samples . . . . . . . . . . . . . . . . . . . . . . 54
5.6. Joint probability reconstruction . . . . . . . . . . . . . . . . . . 55
5.7. Joint probability reconstruction error study . . . . . . . . . . . . 56
5.8. Plasma discharge reconstruction . . . . . . . . . . . . . . . . . . 57
111
Publications as first author
In this chapter, I have collected a list of the peer-reviewed article of which I am
first author in the format of a bibliography. They have been published during
the years of my Ph.D. studies and constitute the main outcome of my work
during this time.
Peer-reviewed articles
[1]
A. Pavone et al. »Neural network approximated Bayesian inference of
edge electron density profiles at JET«. In: Plasma Physics and Controlled
Fusion, Vol. 62.4 (Mar. 2020), page 045019. doi:
10.1088/1361-6587/
ab7732
. The publication describes how neural networks can be trained
to approximate Bayesian inference based on an existing Bayesian model
for the reconstruction of the edge electron density profiles at the JET
tokamak. We demonstrate here that the method previously developed
and tested at the W7-X stellarator can be generalized to a completely
new physics system, therefore hinting to the possibility, in future, to
automatise the procedure for approximating any physics model imple-
mented within the same Bayesian modeling framework. The method is
also extensively tested on a large number of different experimental cases,
and compared to the conventional Bayesian inference results. We show
also how the uncertainties of the network prediction can be calculated
with an approach which relies on the deep learning technique known as
dropout training and an interpretation of the training problem as a vari-
ational inference problem in the context of Bayesian neural networks. I
have written the entire text of the publication, developed the neural net-
work approximation based on the Bayesian model and the dropout based
uncertainty estimation, as well as performed all the calculations that al-
lowed to obtain the results. M. Brix contributed as main responsible for
the diagnostic device with his deep knowledge and understanding of the
115
Publications as first author
physics involved in the measurements at the JET tokamak. J. Svensson
is the author of the Minerva Bayesian modeling framework which is
used to develop Bayesian models and carry out Bayesian inference, and
contributed with his insights and wide knowledge of Bayesian inference
and neural networks. S. Kwak is the main author of the physics model for
the inference of the density profiles as implemented within the Minerva
framework and contributed with invaluable comments and insightful
discussions. All co-authors have contributed by extensively engaging in
insightful scientific discussions and making possible the different meas-
urements involved in the work.
[2]
A. Pavone et al. »Measurements of visible bremsstrahlung and auto-
matic Bayesian inference of the effective plasma charge Zeff at W7-X«. In:
Journal of Instrumentation, Vol. 14.10 (Oct. 2019), pages C10003–C10003.
doi:
10 . 1088 / 1748 - 0221 / 14 / 10 / c10003
. The publication de-
scribes the first measurements of visual bremsstrahlung and inference
of the plasma effective charge
Zeff
available at W7-X. Both the features
of the measurement device and the Bayesian model and inference are
described. Especially, the Bayesian inference is run automatically after
each plasma shot and it includes the inference of the electron temperat-
ure and density profiles from independent measurements of the Thom-
son scattering diagnostic based on Gaussian processes. I have written
the entire text of the publication and developed and implemented the
Bayesian models for the inference of the plasma effective charge and
the electron temperature and density profiles. These models were de-
veloped within the Minerva framework, the original author of which is
J. Svensson, among the co-authors. U. Hergenhahn and M. Krychowiak
have contributed as main responsibles for the actual realization of the
diagnostic device and their deep understanding of the functioning of the
measurement processes and the physics involved. The other coauthors
contributed as diagnosticians involved in the development and operation
of the Thomson scattering diagnostic, providing the corresponding data
and by extensively engaging in insightful scientific discussions.
[3]
A. Pavone et al. »Neural network approximation of Bayesian models
for the inference of ion and electron temperature profiles at W7-X«. In:
Plasma Physics and Controlled Fusion, Vol. 61.7 (May 2019), page 075012.
doi:
10.1088/1361-6587/ab1d26
. The publication describes the ap-
116
Peer-reviewed articles
plication of neural network as approximated Bayesian inference to the
problem of the temperature profile reconstruction from X-ray imagining
diagnostic measurements. In particular, we describe how the network
can be trained to appoximate the inference carried out with an existing
Bayesian model by generating the training data with the given model,
with the advantage of a significative acceleration of the data analysis.
At the same time, we show how the quality of the training set can be
assesed with respect to how well they describe the experimental meas-
urements, a problem that can easily arise and has to be tackled when
dealing with training data generated synthetically. We also demonstrate
the performance of the network on a number of experimental cases,
comparing its reconstruction with the profiles infered with the Bayesian
model. I am the original and exclusive author of all the text and content in
the publication, as well as the person who developed the approximation
framework which the training method is based on, the neural network
model and other mentioned algorithms, and made the final evaluation
and comparison on experimental data. The data used in the comparison
and inferred with the Bayesian model were provided by A. Langenberg,
who is also the main author of the model and contributed with his deep
knowledge of the diagnostic and the physics involved in the measure-
ment processes. J. Svensson is the main author of the Minerva framework,
where the Bayesian model is implemented and which is used to carry out
Bayesian inference and generate the data to train the network. He has
also contributed with invaluable and copious insights and discussions.
All co-authors have contributed by extensively engaging in insightful
scientific discussions and making possible the different measurements
involved in the work.
[4]
A. Pavone et al. »Bayesian uncertainty calculation in neural network
inference of ion and electron temperature profiles at W7-X«. In: Review
of Scientific Instruments, Vol. 89.10 (2018). doi:
10.1063/1.5039286
. In
the publication we describe how uncertainties of the neural network out-
put can be calculated in a Bayesian framework. The framework is known
as Bayesian neural network (BNN) and consists of an interpretation of
the network model as a Bayesian model and the training problem as an
inference problem. Under the so called Laplace approximation, it is pos-
sible to derive an analytical expression of the error bars dependent on the
117
Publications as first author
Hessian matrix of the training loss funciton. We apply this calculation
to the case of the inference of ion and electron temperature profiles at
W7-X, accounting also for noise in the input data and the presence of dif-
ferent local minima after training by a Monte Carlo scheme of sampling
from the input space and a committee of networks. I have written the
entire text of the publication, developed and implemented the BNN mod-
els and sampling scheme for the case under consideration, and all the
calculations involved in the achievement of the results and applied the
method to the experimental data. The measured data were provided by A.
Langenberg, who is responsible for the diagnostic performing the meas-
urement at W7-X. J. Svensson contributed vastly by providing invaluable
insights, and spending his time in precious conversation with me. All co-
authors have contributed by extensively engaging in insightful scientific
discussions and making possible the different measurements involved in
the work.
118
Publications as coauthor
The articles of which I am coauthor represent a collection of works which can
be seen as a spin-off activity of the main research presented in this thesis. I
have contributed to these articles in various ways, in general by developing
Bayesian models and applying Bayesian inference to different physics systems.
Because Bayesian inference is one of the main subjects involved in this thesis, I
consider these works to be a relevant corollary of my main project. Here, I will
provide, case by case, an explanation of how I have contributed to them.
Peer-reviewed articles
[1]
I. Abramovic et al., amongst them
A. Pavone
. »Forward modeling of
collective Thomson scattering for Wendelstein 7-X plasmas: Electrostatic
approximation«. In: Review of Scientific Instruments, Vol. 90.2 (2019),
page 023501. doi:
10.1063/1.5048361
. eprint:
https://doi.org/
10.1063/1.5048361
. The collective Thomson scattering diagnostic
allows to perform local measurements of the ion temperature in the core
region of Wendelstein 7-X. A forward model of this diagnostic has been
implemented within the Bayesian modeling framework Minerva. I have
collaborated to the implementation of such model and the Bayesian infer-
ence procedure which allowed to obtain the ion temperature information
given the measured emission. The inference was then performed on data
collected during the OP 1.1 and OP 1.2 campaigns.
[2]
T. Andreeva et al., amongst them
A. Pavone
. »Equilibrium eval-
uation for Wendelstein 7-X experiment programs in the first divertor
phase«. In: Fusion Engineering and Design, (2019). doi:
https://doi.
org/10.1016/j.fusengdes.2018.12.050
. In this work, the first
author makes use of the plasma effective charge
Zeff
in order to assess
the overall contamination of low-Z impurities in the plasma and estimate
the pressure profile necessary for the equilibrium calculation.
Zeff
can
119
Publications as coauthor
be infered by Bremsstrahlung measurements. I have contributed to the
development of a Bayesian model of the plasma Bremsstrahlung emis-
sion, which allowed to infer a line-of-sight averaged plasma effective
charge
Zeff
from measurements performed with a spectrometer. Elec-
tron density and temperature profiles are also required to calculate the
Bremsstrahlung emission; I have used Bayesian inference and Gaussian
processes to infer them from independent diagnostic measurements.
[3]
F. Effenberget al., amongst them
A. Pavone
. »First demonstration of
radiative power exhaust with impurity seeding in the island divertor at
Wendelstein 7-X«. In: Nuclear Fusion, Vol. 59.10 (Aug. 2019), page 106020.
doi:
10.1088/1741-4326/ab32c4
. Impurities play a crucial role
in the work presented in this paper; their behavior can be studied and
understood by mean of the plasma effective charge
Zeff
, which was estim-
ated from the measurements of a single line-of-sight spectrometer. I have
extensively worked on the Bayesian model and inference necessary to
infer
Zeff
, which also included the Bayesian inference of independently
measured electron density and temperature profiles.
[4]
D. Zhang et al., amongst them
A. Pavone
. »First observation of a
stable highly-dissipative divertor plasma regime on the Wendelstein 7-X
stellarator«. In: Phys. Rev. Lett., Vol. 123 (2 July 2019), page 025002. doi:
10.1103/PhysRevLett.123.025002
. The work presented in this
paper is based on the radiation behavior of impurities. A quantity which
describes the overall low-Z plasma impurity concentration is
Zeff
. This
can be estimated by diagnostic measurements of the Bremmstrahlung
emission in a given wavelength range. I have extensively worked on the
Bayesian inference and modeling which allowed the estimation of
Zeff
from spectral measurements, collected during the OP 1.2 experimental
campaign. The
Zeff
found in this way was then used to substantiate the
radiation behavior of the impurities.
[5]
A. Langenberg et al., amongst them
A. Pavone
. »Prospects of X-ray
imaging spectrometers for impurity transport: Recent results from the
stellarator Wendelstein 7-X (invited)«. In: Review of Scientific Instruments,
Vol. 89.10 (2018), 10G101. doi:
10.1063/1.5036536
. eprint:
https:
//doi.org/10.1063/1.5036536
. The X-ray imaging spectrometer
diagnostic collects the spectral emission resulting from the interacation
between impurities and the plasma electrons along several line of sight.
120
Peer-reviewed articles
Traditional Bayesian inference can be applied to the measurements in
order to infer plasma profiles. This procedures, although accurate, is
typically slow. For this reason, I have developed a fast neural network
inversion based on the corresponding Bayesian model, which allows to
quickly infer electron and ion temperature profiles. As a consequence,
I have extensively studied and contributed to the Bayesian model and
inference, of which the first author of this paper was the main developer.
[6] I. Abramovic et al., amongst them A. Pavone. »Collective Thomson
scattering data analysis for Wendelstein 7-X«. In: Journal of Instrumenta-
tion, Vol. 12.08 (Aug. 2017), pages C08015–C08015. doi:
10.1088/1748-
0221/12/08/c08015
. The collective Thomson scattering diagnostic
allows to perform local measurements of the ion temperature in the core
region of Wendelstein 7-X. A forward model of this diagnostic has been
implemented within the Bayesian modeling framework Minerva. I have
extensively taken part in the development of the Bayesian model and
contributed significantly to the study of the collected measurements,
carried out within the framework of Bayesian inference.
121
Bibliography
[1]
J. Svensson and A. Werner. »Large Scale Bayesian Data Analysis
for Nuclear Fusion Experiments«. In: IEEE International Symposium on
Intelligent Signal Processing, (2007), pages 1–6. doi:
10.1109/WISP.
2007.4447579.
[2]
A. Vaswani et al. »Attention Is All You Need«. 2017. arXiv:
1706.037
62 [cs.CL].
[3]
Z. Lan et al. »ALBERT: A Lite BERT for Self-supervised Learning of
Language Representations«. 2019. arXiv: 1909.11942 [cs.CL].
[4]
O. Vinyals et al. »Grandmaster level in StarCraft II using multi-agent
reinforcement learning«. In: Nature, Vol. 575.7782 (2019), pages 350–354.
doi:10.1038/s41586-019-1724-z.
[5]
OpenAI et al. »Solving Rubik’s Cube with a Robot Hand«. 2019. arXiv:
1910.07113 [cs.LG].
[6] H. Jeffreys.Theory of Probability. Third. Oxford, 1961.
[7]
E. T. Jaynes.Probability theory: The logic of science. Cambridge: Cam-
bridge University Press, 2003.
[8]
A. Langenberg et al. »Inference of temperature and density profiles
via forward modeling of an x-ray imaging crystal spectrometer within
the Minerva Bayesian analysis framework«. In: Review of Scientific In-
struments, Vol. 90.6 (2019), page 063505. doi:
10.1063/1.5086283
.
eprint: https://doi.org/10.1063/1.5086283.
[9]
U. Hoefel et al. »Bayesian modeling of microwave radiometer calib-
ration on the example of the Wendelstein 7-X electron cyclotron emis-
sion diagnostic«. In: Review of Scientific Instruments, Vol. 90.4 (2019),
page 043502. doi:
10.1063/1.5082542
. eprint:
https://doi.org/
10.1063/1.5082542.
123
Bibliography
[10]
S. Kwak et al. »Bayesian electron density inference from JET lithium
beam emission spectra using Gaussian processes«. In: Nuclear Fusion,
Vol. 57.3 (2017), page 036017. doi:10.1088/1741-4326/aa5072.
[11]
S. Kwak et al. »Bayesian modelling of Thomson scattering and mul-
tichannel interferometer diagnostics using Gaussian processes«. Nuclear
Fusion. 2020. url:
http://iopscience.iop.org/10.1088/1741-
4326/ab686e.
[12]
S. Kwak et al. »Bayesian modelling of multiple diagnostics at Wendel-
stein 7-X«. 2020. to be submitted.
[13]
A. Langenberg et al. »Forward modeling of X-ray imaging crystal
spectrometers within the Minerva Bayesian analysis framework«. In:
Fusion Science and Technology, (2016). doi:10.13182/FST15-181.
[14]
A. Langenberg et al. »Prospects of X-ray imaging spectrometers for
impurity transport: Recent results from the stellarator Wendelstein 7-X
(invited)«. In: Review of Scientific Instruments, Vol. 89.10 (2018), 10G101.
doi:
10.1063/1.5036536
. eprint:
https://doi.org/10.1063/1.
5036536.
[15]
M. Brix et al. »Recent improvements of the JET lithium beam diagnostic«.
In: Review of Scientific Instruments, Vol. 83.10 (2012), page 10D533. doi:
10.1063/1.4739411
. eprint:
https://doi.org/10.1063/1.
4739411.
[16]
S. Kwak et al. »Bayesian modelling of the emission spectrum of the
Joint European Torus Lithium Beam Emission Spectroscopy system«.
In: Review of Scientific Instruments, Vol. 87.2 (2016), page 023501. doi:
10.1063/1.4940925
. eprint:
https://doi.org/10.1063/1.
4940925.
[17]
G. H. Neilson, editor. Magnetic Fusion Energy: From Experiments to
Power Plants. Woodhead Publishing, 2016.
[18]
A. Pavone et al. »Neural network approximation of Bayesian models
for the inference of ion and electron temperature profiles at W7-X«. In:
Plasma Physics and Controlled Fusion, Vol. 61.7 (May 2019), page 075012.
doi:10.1088/1361-6587/ab1d26.
124
Bibliography
[19]
A. Pavone et al. »Bayesian uncertainty calculation in neural network
inference of ion and electron temperature profiles at W7-X«. In: Review
of Scientific Instruments, Vol. 89.10 (2018). doi:10.1063/1.5039286.
[20]
A. Pavone et al. »Neural network approximated Bayesian inference of
edge electron density profiles at JET«. In: Plasma Physics and Controlled
Fusion, Vol. 62.4 (Mar. 2020), page 045019. doi:
10.1088/1361-6587/
ab7732.
[21]
S. Bozhenkov et al. »The Thomson scattering diagnostic at Wendelstein
7-X and its performance in the first operation phase«. In: Journal of
Instrumentation, Vol. 12.10 (Oct. 2017), P10004–P10004. doi:
10.1088/
1748-0221/12/10/p10004.
[22]
T. Klinger et al. »Overview of first Wendelstein 7-X high-performance
operation«. In: Nuclear Fusion, Vol. 59.11 (June 2019), page 112004. doi:
10.1088/1741-4326/ab03a7.
[23]
P. Helander et al. »Stellarator and tokamak plasmas: a comparison«. In:
Plasma Physics and Controlled Fusion, Vol. 54.12 (Nov. 2012), page 124009.
doi:10.1088/0741-3335/54/12/124009.
[24]
P. Helander. »Theory of plasma confinement in non-axisymmetric
magnetic fields«. In: Reports on Progress in Physics, Vol. 77.8 (July 2014),
page 087001. doi:10.1088/0034-4885/77/8/087001.
[25]
G. Grieger et al. »Physics optimization of stellarators«. In: Physics of
Fluids B: Plasma Physics, Vol. 4.7 (1992), pages 2081–2091. doi:
10.1063/
1.860481. eprint: https://doi.org/10.1063/1.860481.
[26]
T. S. Pedersen et al. »First results from divertor operation in Wendel-
stein 7-X«. In: Plasma Physics and Controlled Fusion, Vol. 61.1 (Nov. 2018),
page 014035. doi:10.1088/1361-6587/aaec25.
[27]
P. Rebut,R. Bickerton and B. Keen. »The Joint European Torus:
installation, first results and prospects«. In: Nuclear Fusion, Vol. 25.9
(Sept. 1985), pages 1011–1022. doi:10.1088/0029-5515/25/9/003.
[28]
D. Meade. »Tokamak Fusion Test Reactor D-T results«. In: Fusion Engin-
eering and Design, Vol. 30.1-2 (May 1995), pages 13–23. doi:
10.1016/
0920-3796(94)00398-Q.
125
Bibliography
[29]
J. Team. »Fusion energy production from a deuterium-tritium plasma in
the JET tokamak«. In: Nuclear Fusion, Vol. 32.2 (Feb. 1992), pages 187–203.
doi:10.1088/0029-5515/32/2/i01.
[30]
J. Svensson. »Connecting theory and experiments in complex systems«.
Plasma Physics Summer School Talk. 2008-present.
[31]
R. Fischer et al. »Thomson scattering analysis with the Bayesian prob-
ability theory«. In: Plasma Physics and Controlled Fusion, Vol. 44.8 (July
2002), pages 1501–1519. doi:10.1088/0741-3335/44/8/306.
[32]
R. Fischer et al. »Integrated Data Analysis of Profile Diagnostics at
ASDEX Upgrade«. In: Fusion Science and Technology, Vol. 58.2 (2010),
pages 675–684. doi:
10.13182/FST10-110
. eprint:
https://doi.
org/10.13182/FST10-110.
[33]
G. Cottrell. »Maximum entropy and plasma physics«. 1990. url:
h
ttp://www.euro-fusionscipub.org/wp-content/uploads/
2014/11/JETP90004.pdf.
[34]
N. A. Pablant et al. »Investigation of ion and electron heat transport
of high-TeECH heated discharges in the large helical device«. In: Plasma
Physics and Controlled Fusion, Vol. 58.4 (Jan. 2016), page 045004. doi:
10.1088/0741-3335/58/4/045004.
[35]
J. Pearl.Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann,
1988. doi:https://doi.org/10.1016/C2009-0-27609-4.
[36]
R. Hooke and T. A. Jeeves. »“ Direct Search” Solution of Numerical
and Statistical Problems«. In: J. ACM, Vol. 8.2 (Apr. 1961), pages 212–229.
doi:10.1145/321062.321069.
[37]
F. Rosenblatt.Principles of Neurodynamics: Perceptrons and the Theory
of Brain Mechanisms. Spartan Books, 1961.
[38]
R. Hahnloser et al. »Digital selection and analogue amplification coex-
ist in a cortex-inspired silicon circuit«. In: Nature, Vol. 405 (2000). doi:
10.1038/35016072.
126
Bibliography
[39]
X. Glorot,A. Bordes and Y. Bengio. »Deep Sparse Rectifier Neural
Networks«. In: Proceedings of the Fourteenth International Conference on
Artificial Intelligence and Statistics. Edited by G. Gordon,D. Dunson
and M. Dudík. Vol. 15. Proceedings of Machine Learning Research. Fort
Lauderdale, FL, USA: PMLR, Nov. 2011, pages 315–323. url:
http://
proceedings.mlr.press/v15/glorot11a.html.
[40]
C. M. Bishop.Neural Networks for Pattern Recognition. Oxford University
Press, 1995.
[41]
Theano Development Team. »Theano: A Python framework for fast
computation of mathematical expressions«. In: arXiv e-prints, Vol. abs/1605.02688
(2016). url:http://arxiv.org/abs/1605.02688.
[42]
Martin Abadi et al. »TensorFlow: Large-Scale Machine Learning on
Heterogeneous Systems«. Software available from tensorflow.org. 2015.
url:https://www.tensorflow.org/.
[43]
D. J. MacKay. »Bayesian Interpolation«. In: Neural Computation, Vol. 4
(1991), pages 415–447.
[44]
C. M. Bishop,C. M. Roach and M. G. von Hellermann. »Automatic
analysis of JET charge exchange spectra using neural networks«. In:
Plasma Physics and Controlled Fusion, Vol. 35 (1993), pages 765–773.
[45]
J. Svensson,M. von Hellermann and R. W. T. König. »Analysis of
JET charge exchange spectra using neural networks«. In: Plasma Physics
and Controlled Fusion, Vol. 41 (1999), pages 315–338.
[46]
J. Svensson et al. »Real-time ion temperature profiles in the JET nuclear
fusion experiment«. In: ICANN 98. Perspectives in Neural Computing,
(1998).
[47]
D. Böckenhoff et al. »Reconstruction of magnetic configurations in
W7-X using artificial neural networks«. In: Nuclear Fusion, Vol. 58.5 (Mar.
2018), page 056009. doi:10.1088/1741-4326/aab22d.
[48]
M. Blatzheim and D. B. and. »Neural network regression approaches
to reconstruct properties of magnetic configuration from Wendelstein
7-X modeled heat load patterns«. In: Nuclear Fusion, Vol. 59.12 (Oct. 2019),
page 126029. doi:10.1088/1741-4326/ab4123.
127
Bibliography
[49]
D. R. Ferreira et al. »Full-Pulse Tomographic Reconstruction with
Deep Neural Networks«. In: Fusion Science and Technology, Vol. 74.1-2
(2018), pages 47–56. doi:
10.1080/15361055.2017.1390386
. eprint:
https://doi.org/10.1080/15361055.2017.1390386.
[50]
O. Meneghini et al. »Modeling of transport phenomena in tokamak
plasmas with neural networks«. In: Physics of Plasmas, Vol. 21.6 (2014),
page 060702. doi:
10.1063/1.4885343
. eprint:
https://doi.org/
10.1063/1.4885343.
[51]
O. Meneghini et al. »Self-consistent core-pedestal transport simula-
tions with neural network accelerated models«. In: Nuclear Fusion, Vol. 57.8
(July 2017), page 086034. doi:10.1088/1741-4326/aa7776.
[52]
K. L. van de Plassche et al. »Fast modelling of turbulent transport
in fusion plasmas using neural networks«. 2019. arXiv:
1911.05617
[physics.plasm-ph].
[53]
D. J. C. Mackay.Bayesian Methods for Adaptive Models. PhD thesis.
California Institute of Technology, 1991.
[54]
Y. Gal.Uncertainty in Deep Learning. PhD thesis. University of Cam-
bridge, 2016.
[55]
A. Pavone et al. »Measurements of visible bremsstrahlung and auto-
matic Bayesian inference of the effective plasma charge Zeff at W7-X«. In:
Journal of Instrumentation, Vol. 14.10 (Oct. 2019), pages C10003–C10003.
doi:10.1088/1748-0221/14/10/c10003.
[56]
R. S. Sutherland. »Accurate free-free Gaunt factors for astrophysical
plasmas«. In: Monthly Notices of the Royal Astronomical Society, Vol. 300
(1998), pages 321–330. doi:10.1046/j.1365-8711.1998.01687.x.
[57]
D. J. MacKay. »Bayesian model comparison and backprop nets«. In:
Advances in neural information processing systems. 1992, pages 839–846.
[58]
O. P. Ford et al. »Charge exchange recombination spectroscopy at
Wendelstein 7-X«. In: Review of Scientific Instruments, Vol. 91.2 (2020),
page 023507. doi:
10.1063/1.5132936
. eprint:
https://doi.org/
10.1063/1.5132936.
[59]
L. Vanó and other. »Studies on carbon content and transport with
Charge Exchange Spectroscopy on W7-X«. In: EPS Proceedings, (2019).
128