SciPapers
[en] (orig)
Eur. Phys. J. E (2023) 46:48
https://doi.org/10.1140/epje/s10189-023-00309-3 THE EUROPEAN
PHYSICAL JOURNAL E
Regular Article - Flowing Matter
Optimal navigation of a smart active particle: directional
and distance sensing
Mischa Putzkeaand Holger Starkb
Institut f¨ur Theoretische Physik, Technische Universit¨at Berlin, Hardenbergstr. 36, 10623 Berlin, Germany
Received 4 February 2023 / Accepted 5 June 2023 / Published online 19 June 2023
©The Author(s) 2023
Abstract We employ Qlearning, a variant of reinforcement learning, so that an active particle learns by
itself to navigate on the fastest path toward a target while experiencing external forces and flow fields.
As state variables, we use the distance and direction toward the target, and as action variables the active
particle can choose a new orientation along which it moves with constant velocity. We explicitly investigate
optimal navigation in a potential barrier/well and a uniform/ Poiseuille/swirling flow field. We show that
Qlearning is able to identify the fastest path and discuss the results. We also demonstrate that Qlearning
and applying the learned policy works when the particle orientation experiences thermal noise. However,
the successful outcome strongly depends on the specific problem and the strength of noise.
1 Introduction
Active matter refers to materials that consist of self-
propelled entities such as active particles, artificial
microswimmers, and microorganisms, which exhibit
versatile collective behavior and dynamic patterns [1
3]. These systems are characterized by their ability to
use an internal energy depot or energy from the envi-
ronment to generate active motion, for example, by
deformations [1,2,48]. Examples of active matter are
biological systems including bacteria [9], schools of fish
and swarms of birds [10,11] as well as artificial systems
such as suspensions of self-propelled colloids or other
synthetic microswimmers [1214]. Over the years, inter-
est in the control of active motion has increased using
external fields [15], in particular, gravitational [1622],
magnetic [23,24], and flow fields [2529].
With the ability to specifically manipulate active
motion, optimizing the traveled path in a complex envi-
ronment, for example, by finding the fastest trajectory
has come into focus. While optimal search strategies
depend on the environment [30], Ref. [31] suggests min-
imal navigation strategies and in Refs. [32,33] optimal
navigation is achieved by minimizing travel time. This
can be done even on curved manifolds such as a sphere
[34] and using optimal control theory [35,36]. Further-
more, the noisy pursuit of a self-steering active particle
has been studied [37].
For living organisms, there are many examples where
optimal navigation is crucial, such as finding food
ae-mail: m.putzk[email protected]erlin.de
be-mail: holger.stark@tu-berlin.de (corresponding
author)
sources [38,39] or escaping from predators [40]. While
organisms have learned their optimal navigation strat-
egy through evolution, reinforcement learning [41]oers
a promising method of training artificial microswim-
mers to steer optimally toward a target. Applica-
tions range from robotics [42,43] to biology and active
matter[14,33,4447]. Reinforcement learning is a type
of machine learning where an agent learns a specific
task by taking actions in an environment and receiving
feedback in the form of rewards. Now, active particles
can use this algorithm to learn how to navigate opti-
mally based on their sensory inputs and reward signals.
An example are microswimmers that move toward a
target by adjusting their orientations along which self-
propulsion occurs. In the last years, it has been demon-
strated that reinforcement learning is a well suited
method for finding optimal navigation solutions in, for
example, complex potentials [33,48,49], turbulent flows
[5053], as well as chaotic flows [54].
In this article, we employ Qlearning, a variant of
reinforcement learning, so that the agent or microswim-
mer learns by itself to move on the fastest path from the
starting point to the target under the action of forces
and flow fields. This is the traditional Zermelo navi-
gation problem [55]. In contrast to our previous work
[33], the microswimmer can sense the direction and dis-
tance to a target, which we find potentially easier to
realize than monitoring the position. The smart active
particle first moves deterministically and can control
its orientation. We show that the microswimmer with
the new state variables is able to identify and navi-
gate on the fastest path in different complex landscapes
such as potential barriers and wells as well as uni-
123
48 Page 2 of 12 Eur. Phys. J. E (2023) 46:48
form, Poiseuille, and swirling flow fields. In addition,
we show that learning optimal navigation and apply-
ing the learned policy also works under thermal noise,
which the particle orientation experiences during train-
ing. However, the outcome strongly depends on the spe-
cific problem and the strength of noise.
The article is structured as follows. In Sect. 2,
we introduce our model, the method of Qlearn-
ing, the choice of state and action variables, and
the system parameters. In Sect. 3,wepresentour
results of the potential barrier/well and the uni-
form/Poiseuille/swirling flow fields. We close with con-
clusions.
2Model
2.1 Equations of motion
Our goal is to train the microswimmer such that it opti-
mizes its travel time moving in two dimensions from a
starting to a target position. Along its path, it expe-
riences an additional drift-velocity field, which repre-
sents some complex environment (see Fig. 1). We sim-
ply model the microswimmer as an active particle with
negligible inertia that moves with speed v0along the
direction e= (cos θ,sin θ), where θis the angle with
respect to the xaxis (see Fig. 1). Rescaling all veloc-
ities by v0, the space-dependent total velocity of the
active particle is
v=e+vD.(1)
We assume that the active particle can control its ori-
entation ein order to find the optimal travel time. For
this, every time Δtit senses its state and uses Qlearn-
ing to set its new orientation eas we explain in Sects.
2.2 and 2.3. Furthermore, we do not include any exter-
nal torque. However, for the drift-velocity field vD(r)
we will consider one type of flow field (Poiseuille flow)
Fig. 1 The active particle with orientation unit vector e
is trained to move from the starting to target position on
the fastest path crossing a region with a prescribed drift-
velocity field vD(r). To locate its position, the particle can
sense the direction angle ϕand distance ρto the target
with nonzero vorticity, which rotates the particle ori-
entation with angular velocity ωD=|curlvD|/2. Thus,
during time step Δt, the particle’s orientation evolves
according to ˙
θ=ωD.
The optimization of the travel time can also be for-
mulated as a typical variational principle. The time to
travel from the starting position at rito the target posi-
tion at rfis
T=rf
ri
dt=rf
ri
ds
v,(2)
where v=|v|. We parametrize the particle path r(s)
with the arclength s, introduce the unit tangent along
the path, t=dr/ds, and write the total velocity as
v=vt. Taking the square of Eq. (1) and solving for v,
we obtain
v=t·vD+1[v2
D(t·vD)2].(3)
With this the variation of the travel time, δT =0,in
order to find a minimum can be formulated. Typically,
the minimum has to be calculated numerically and only
in special cases an analytic solution is possible.
Besides treating the active particle fully determinis-
tically, we will also explore how thermal fluctuations
influence the learning of the optimal path and when
applying the learned optimal policy. Since we will work
at large P´eclet numbers, where translational noise can
be discarded, we concentrate on thermal noise in the
orientation. Thus, we use the model of an active Brow-
nian particle and write down an overdamped Langevin
equation for the time derivative of the orientation angle
θ. Including the nonzero vorticity for one type of flow
field, the Langevin equation becomes
˙
θ=ωD+DRL/v0η. (4)
Here, ηrepresents standard Gaussian white noise with
zero mean η(t)= 0 and unit variance η(t)η(t)=
δ(tt). The thermal rotational diffusion coefficient
DRensures the validity of the fluctuation-dissipation
theorem. Time is rescaled by the characteristic time
scale L/v0and Lis a typical length, which we choose
as the distance between starting and target position.
Now, to include rotational noise in the training pro-
cess and when applying the learned policy, we need to
numerically integrate Eq. (4). At each step of Qlearn-
ing, the orientation angle is set to a value θas explained
in Sect. 2.3. Starting from this value, the new orienta-
tion angle θnew after time step Δtbecomes
θnew =θ+ωDΔt+(DRL/v0tW
θ,(5)
where ΔtWθrepresents the increment of a Wiener
process and Wθis a random number with zero mean
and unit variance W2
θ. The new angle θnew is then
used to perform a step of the active particle, which is
further processed as explained in Sect. 2.3.
123
Eur. Phys. J. E (2023) 46:48 Page 3 of 12 48
With the formulated model, we fully describe the
dynamics of our microswimmer. Now, we can start
using Qlearning to train the swimmer such that it finds
the optimal path in various drift-velocity fields.
2.2 Qlearning
In order for an active particle to learn the fastest path
from an initial position to an assigned target, while
moving in an external drift-velocity field, the method
of tabular Qlearning can be applied [41] Here, one first
creates a Qtable that stores a Qvalue for each pair of
state (e.g., position of the particle) and possible action
variables (e.g., movement in a certain direction). The
Qvalue represents the expected reward accumulated
during a sequence of steps when starting from a state-
action pair. Thus, it not only considers the immediate
reward but future rewards as well.
To start the Qlearning algorithm, the active particle
is placed at the initial position and moves under the
influence of the drift-velocity field in a series of steps.
Before it performs an action, the Qvalues of the pos-
sible actions in the current state are checked and the
action with the highest Qvalue is selected. It represents
the highest expected accumulated reward. After each
action, the corresponding Qvalue for the state-action
pair is adjusted as reported below. When the active
particle reaches the target, the first training episode is
completed. A new episode begins with the same ini-
tial state, and the process repeats until the Qmatrix
of the agent converges to a stable solution. This means
that the agent has learned the optimal policy, which for
each of the states gives the optimal action such that the
total reward accumulated by the agent along its path
is maximized.
During training, the new Qvalue, Qnew, must be cal-
culated for each pair of state and action variables, which
is done via the Bellmann equation [41]:
Qnew(st,a
t)Q(st,a
t) (6)
+α·(rt+γmax
aQ(st+1,a)Q(st,a
t).
Here, stis the current state and atthe current action
variable. The immediate reward rtbelongs to taking
action atin state st,andst+1 is the new state reached.
The term max[Q(st+1,a
t)] represents the maximum
expected future reward for the new state variable st+1.
The discount factor γdetermines how much the future
reward is taken into account, and αquantifies the learn-
ing speed. The optimal values of both factors depend
on the specific problem. In Appendix A.2, we shortly
address this for our optimization tasks. One can show
that defining an immediate reward suitable to the opti-
mization task, the Qmatrix will converge toward its
optimal value by applying the recursion formula (6)
[56].
We combine the deterministic choice of the action
with the -greedy method. The most rewarding action
is only taken with probability 1 and otherwise
the action is chosen randomly. This prevents that the
optimal path becomes stuck in a local minimum. We
decrease with each episode according to =0.5(1
i/imax), where imax is sufficiently large to guarantee
that the Qmatrix has converged and the phenomeno-
logical factor 0.5 guarantees the fastest convergence.
Despite using the -greedy method, in the following we
will call this version also deterministic Qlearning to
distinguish it from Qlearning under the presence of
orientational thermal noise.
2.3 Choice of state and action variables
To find the optimal navigation in a complex environ-
ment using tabular Qlearning, information about the
position of the particle is required. The x, y coordinates
are often used to define the state of the particle [14,33].
However, having information about the position is cer-
tainly difficult for a microswimmer to realize, when
video microscopy and some external information pro-
cessing are not available [14]. It seems more realistic and
easier for microswimmers to sense the direction and dis-
tance to a target using, e.g., a magnetic field, in which
magnetotactic bacteria experience a torque, a light field
in combination with phototaxis, or olfactory sensing
[5759]. Also sensing a chemical field is an option; how-
ever, the concentration field can be distorted by the sur-
roundings. Thus, compared to our previous work [33],
to describe the state of the active particle, we perform
a coordinate transformation (x, y)(ρ, ϕ), where ρ
is the distance and ϕthe direction angle to the tar-
get (see Fig. 1). Reducing the dimension of the state
space by omitting ρor φis not possible within the Q-
learning formalism since then the required action, which
depends on the position of the active particle, cannot
unambigously predicted.
In the following, we will use these alternative state
variables and Qlearning to solve the navigation prob-
lem. As we mentioned previously, the tabular Q-
learning method creates a matrix with finite number
of elements. Thus, the state space consists of a set of
2Ndiscrete state variables, for which we choose:
distance: ρi[ρ1=0
2, ..., ρN] (7)
direction: ϕi[ϕ1=0
2, ..., ϕN].(8)
The active particle moves continuously in space. We
assign its distance ρand direction angle ϕto the dis-
crete values ρiand ϕi, when ρ,ϕfall within the inter-
vals [ρiΔρ/2
iρ/2] and [ϕiΔϕ/2
iϕ/2],
respectively. Here, Δρ=(ρNρ1)/(N1) and Δϕ=
2π/N, and for the end points i=1,N only half of the
intervalls are taken. Equally, the action variable is the
discretized orientation angle of the active particle, for
which we always take eight values:
orientation: θi=(i1)π
4,i=1, ..., 8.(9)
Now, for the discrete state variables an action variable
is selected according to the Qmatrix, so the active par-
123
48 Page 4 of 12 Eur. Phys. J. E (2023) 46:48
ticle changes its orientation from θtto θt+1. Then, the
active particle moves in continuous space. Using Eq.
(1), it reaches the position
rnew =r+vΔt, (10)
for which we calculate the new distance ρand direction
angle ϕ. The procedure is repeated until the particle
arrives at the target, meaning ρ<Δρ/2, and one train-
ing episode is completed. If the target is not reached
after 500 time steps, the episode is stopped and a neg-
ative reward of R=10 is generated. The episodes are
repeated until the Qmatrix converges and the optimal
solution is found. In practice, we choose imax from the -
greedy method sufficiently high to achieve convergence.
In case that we include thermal noise in the orien-
tation vector eor a flow field with nonzero vorticity
ωD, we first let the orientation angle evolve according
to Eq. (5) and then the translational step according
to Eq. (10) is performed. Note also that we do not use
here the -greedy method, since thermal noise naturally
brings in some randomness.
2.4 System parameters
In units of the distance Lbetween start and target,
the quadratic system extends in xand ydirection from
0.75 to 0.75. Start and target are placed on the x
axis at 0.5 and 0.5, respectively. We use N=43dis-
crete values for ρand ϕwith the maximum distance
ρN=1.4577 and the resolution Δρ=0.0343. This
guarantees that all positions from the target are reach-
able. Typically, the time step is Δt=0.0375. How-
ever, for the case of crossing a uniform flow (Sect. 3.3)
and the Poiseuille flow (Sect. 3.4), we need to increase
it with the strength of the flow, and for the case of
a swirling flow (Sect. 3.5), Δtneeds to decrease for
stronger swirls.
The time step should roughly be adjusted such that
during one time step the particle can move from one
grid element to a neighboring element. Choosing it
smaller does not bring any improvement. On the con-
trary, the Q-learning algorithm has to perform numer-
ous steps to move the particle forward, without reach-
ing other state variables, which can considerably slow
down the whole learning process. For the two cases of
the uniform and Poiseuille flow, the time step needs
to be increased to have a noticeable motion across the
flow, while for the swirling flow Δtneeds to decrease in
order to properly approximate the trajectory.
To start the training, the Qmatrix must be initial-
ized. Here, we set all entries uniformly to Q(s, a) = 100
[41]. An important point is how the rewards rtfor the
different actions within Qlearning are chosen. Since we
want to minimize travel time T, each step of duration
Δtreceives a reward rt=Δt. So the negative reward
is smallest if the number of steps is minimized. Reach-
ing the target gives a reward of 100, and if the parti-
cle crosses the border of the system, a large negative
reward of 10 is used to strongly penalize this action.
Furthermore, when performing such a move, the parti-
cle is placed back to the location, where the border was
crossed. This procedure corresponds to implementing a
hard-core repulsion from the border, while the negative
reward signals the particle within the learning phase
to avoid such steps. Without such a negative reward
the learning takes longer. The learning speed and dis-
count factor are chosen as α=0.9andγ=0.7, respec-
tively. In Appendix A.2 we present a parameter study
to determine the optimal values for αand γ. Finally,
for implementing the -greedy method, imax = 5000 is
often sufficient for the travel time to converge, and we
never need to go beyond imax = 20000.
3Results
We now present results for a few types of drift-velocity
fields, which either derive from a potential or are due
to an imposed flow field.
3.1 Potential barrier and well
In our earlier work [33], we determined the fastest path,
when there is a potential barrier between the start and
target. We modeled the barrier using the Mexican hat
potential without brim,
U=16U0(r21/4)2,r1/2
0,otherwise ,(11)
where ris the radial distance to the center. At r=0,
the potential has a maximum with height U0,andon
the ring r=1/2 it is zero with horizontal tangent. The
maximum potential force −∇Uis at r=1/23 with
|∇U|=16U0/33. The inset of Fig. 2shows a grayscale
representation of the potential.
The potential force can be incorporated in Eq. (1)
by choosing vD(r)=−∇U. As we already discussed in
Fig. 2 Shortest travel time Tversus barrier height U0
either for crossing the barrier of a Mexican hat potential
on a straight path (magenta) or moving around it (green).
The blue dots indicate the results of Qlearning. The inset
shows a grayscale representation of the potential and the
optimal path determined with Qlearning for U0=0.4
123
Eur. Phys. J. E (2023) 46:48 Page 5 of 12 48
Fig. 3 Potential well with U0=0.4. Top: In the first
episode (i= 1), the active particle becomes trapped in the
well and the learning ended after T= 500. The current
action receives a negative reward of 10. Bottom, blue tra-
jectory: After imax = 50,000 the active particle has learned
to avoid the well and move around it. Other trajectories:
Example trajectories for applying an averaged Qmatrix
under noise. The index nin the corresponding travel time
TQn refers to the strength of the orientational thermal noise,
n=DRL/v0
detail in Ref. [33], the variation δT = 0 identifies the
straight path over the potential barrier as optimal until
U0=0.24, where the curved path around the barrier
becomes faster (see main plot of Fig. 2). There is also a
regime, where both paths are locally stable. Qlearning
is able to identify the optimal paths, as the blue dots
in Fig. 2show. For the curved path (an example is
shown in the inset), the minimum travel time is better
approximated than compared to Ref. [33] since we allow
the active particle to move freely in space and not just
on a grid.
By choosing a negative U0, the potential barrier
becomes a well. The phenomenology of the paths is the
same as for the barrier. However, there is one differ-
ence. If the depth of the well is below U0=33/16 =
0.325, the active particle cannot escape the well. This
is illustrated for U0=0.4 in Fig. 3, top, where we
show the trajectory for the first episode. It is stopped
after a travel time of T= 500 and the current action
receives a large negative reward of 10. The episodes
are repeated, and obviously the active particle learns
to avoid the well since it ultimately finds the optimal
path around the potential well (Fig. 3, bottom). The
blue curve with travel time T0refers to zero noise. How-
ever, since the active particle needs to explore and learn
to avoid the “forbidden” region of the well, a larger
imax =50,000 is necessary. This example shows very
clearly that negative results of not reaching the tar-
get also contribute to the learning process of the active
particle.
We add a final note. The policy for the optimal path
is encoded in the Qmatrix. However, this matrix con-
tains more information. One can place the active par-
ticle at any location within the system, provided this
location has been visited before in the learning phase,
and the Qmatrix will guide it to the target. However,
in general, the path will not be the fastest. We have
checked this for the potential barrier.
3.2 Learning with noise
Orientational noise can be included during different
stages of Qlearning, which we explore here for the
potential barrier and well. One can include noise while
learning the optimal paths and when applying the opti-
mal policy. In addition, one can vary the noise strength.
3.2.1 Potential barrier
We start with exploring the influence of noise dur-
ing learning. The prefactor DRL/v0in Eq. (5) with
ωD= 0 was set to one to have a noticeable change
of particle orientation eduring the time step Δt.For
example, for a random number of Wθ= 1 the change
in orientation angle, θnew θ,is11
. Nevertheless, the
active particle learns to move accross the barrier and
around it as Fig. 4, top illustrates for low and high
U0. However, the learned trajectories are noisy, which
also increases the travel time compared to the deter-
ministic case. Interestingly, at U0=0.225 close to the
point (U0=0.24) where the absolute stability switches
from the straight to the curved path, we observe both
types of paths as Fig. 4, bottom shows. Thus, noise
causes the optimization process to converge into differ-
ent local minima. Interestingly, the travel time of the
straight path is more strongly affected by orientational
noise compared to the curved path and, therefore, its
value is more strongly enhanced. This makes sense since
the acting potential forces drive the particle away from
the optimal path. In light green, we plot 100 trajec-
tories each determined from a separate learning run.
They show that learning under noise can reproduce the
two types of trajectories; only a few of them deviate
more strongly. Alternatively, one can also take a spe-
cific learned Qmatrix and apply it several times under
noise (not shown in Fig. 4). While the curved paths are
reproduced, now the “straight paths” are strongly dis-
torted and nearly cover the whole region of the potential
barrier; again because noise affects them more strongly.
To quantify our observations further, we then took
the 100 Q-learning runs for the same U0and calculated
the mean of the travel time, T1, and its standard devi-
ation ΔT1, where the index 1 refers to the noise strength
123
48 Page 6 of 12 Eur. Phys. J. E (2023) 46:48
Fig. 4 Orientational noise does not prevent learning the
optimal paths. Top: Straight and curved paths for U0=0.15
and U0=0.4, respectively. Bottom: At U0=0.225 both
paths are realized. In light green, 100 paths are shown. Each
of them results from learning the optimal path under noise
DRL/v0= 1. Both quantities are plotted versus U0
in Fig. 5. At small and high U0, the mean travel time
behaves as the deterministic value in Fig. 2. It increases
with U0for the straight path and levels off at high U0
for the curved path. Similarly, the standard deviation
increases for small U0and it is small for large U0. Hence,
noise does not cause large variations. Interestingly and
different compared to the deterministic value, the mean
travel time becomes maximal at U0=0.225 where both
path types are observed, especially the noisy straigth
path with longer travel time. Accordingly, close to this
value the standard deviation has a maximum.
As an alternative approach dealing with noise, we
performed 10 Q-learning runs and used the averaged
optimal Qmatrix to run 100 trajectories in the pres-
ence of orientational noise. Note in the following the
averaged Qmatrix is abbreviated as Qand when hint-
ing to it in an index as Q. The mean travel time TQ1
(green triangles in Fig. 5) does not deviate strongly
at small and large U0from T1, where T1was calcu-
lated for each newly learned trajectory. Only around
U0=0.225 the deviation is stronger and the green
curve misses the peak. The reason is that already at
U0=0.225 the occurence of straight paths under ori-
entational noise is rare and the 10 learned Qmatrices,
from which we determine the mean Q,belongtothe
curved path.
Finally, we add that to achieve these results, we did
not use the -greedy method. Thus, also thermal noise
helps in finding optimal trajectories. We also checked
that including the -greedy method did not change our
results significantly.
Fig. 5 Mean travel time T1(red circles) and standard
deviation ΔT1(error bars) plotted versus U0from Qlearn-
ing in the presence of orientational thermal noise during the
learning phase. Mean travel time TQ1(green triangles) for
applying an averaged Qmatrix under noise. The lines are a
guide to the eye
3.2.2 Potential well
Of course, increasing the strength of the orientational
thermal noise, DRL/v0, the “optimal paths” and
travel times deviate more and more from their deter-
ministic values. To illustrate this, we consider the
potential well with U0=0.4 (Fig. 3, bottom), where
the active particle cannot leave once it moves too close
to the center. With increasing noise, the learned Q
matrices differ strongly from each other. So we decided
to use a mean optimal Qmatrix averaged over 10 learn-
ing runs under the same noise strength. Figure 3, bot-
tom shows example trajectories when applying Qfor
different noise strengths, where the index nin TQn
refers to n=DRL/v0. One already recognizes, as a
strategy for not becoming trapped in the well, the par-
ticle keeps a larger distance to the well with increasing
noise.
This is illustrated quantitatively in Fig. 6. We applied
Q100 times and then plot TQnand its standard
deviation, represented as error bars, versus the strength
of the orientational noise. One realizes noise does not
only have the effect of moving the trajectory further
away from the center and thereby increasing the travel
time. In addition, noise lets the trajectory become more
irregular (see Fig. 3, bottom). How much this con-
tributes to the travel time is clarified by the red circles
in Fig. 6. Here, we apply Qwithout noise. So TQ0
shows the pure effect of the particle, which needs to
avoid the potential well when learning under noise to
reach the target.
3.3 Crossing uniform flow
So far, we used a potential force acting on the active
particle. Now, we put the active particle in a uniform
flow field along the yaxis with strength k,vD=key,
123
Eur. Phys. J. E (2023) 46:48 Page 7 of 12 48
Fig. 6 Mean travel time TQn(green triangles) and stan-
dard deviation (error bars) plotted versus strength of the
orientational thermal noise, DRL/v0, for the potential
well with U0=0.4. Under noise, 10 optimal Qmatrices
are determined and the average Qis applied 100 times to
determine TQn.ThetimeTQ0(magenta circles) refers to
Qapplied without noise. The lines are a guide to the eye
Fig. 7 Optimized travel time Tversus strength kin a uni-
form flow field. Green line: from analytic optimization. Blue
dots: from deterministic Qlearning. Red dots: mean of 100
Q-learning runs under noise, and error bars indicate the
standard deviation. Insets: Examples of learned trajectories
for k=0.4and0.8, respectively. Green: analytic minimum,
blue: deterministic Qlearning, red: Qlearning under noise.
The green arrow and the blue arrows indicate the respective
particle orientations for the first and second case
so that the total velocity of the active particle becomes
v=e+key.(12)
The Euler-Lagrange equation corresponding to the
variation of the travel time, δT =0,canbeformu-
lated and solved analytically. The calculation is a bit
lengthy but straightforward. It always gives the straight
path between start and target as optimal as shown for
two examples in the insets of Fig. 7(green lines). For
increasing flow strength k, the orientation vector ehas
to tilt more and more against the imposed flow to avoid
that the active particle drifts downstream. Indeed, one
finds for the optimal travel time
T=1
1k2,(13)
which diverges for k1. Here, the active particle
points fully against the flow and there is no compo-
nent of the swimming velocity left to cross the flow field
along the xdirection. The blue dots in Fig. 7show the
results of the optimization from deterministic Qlearn-
ing and for two cases, we show the paths (blue lines)
and the particle orientations (blue arrows) in the insets.
The straight path along the xdirection is not realized
since the particle orientation ecan only assume eight
discrete orientations. Note, to perform Qlearning, we
needed to increase the time step Δtfrom 0.0375 to 0.1
with increasing k.
In a last step, we include orientational thermal
noise when performing 100 Q-learning runs. Including
the time step, the noise in Eq. (5) is determined by
(DRL/v0t. To apply the same noise during a time
step, we always choose (DRL/v0t=0.0375. Thus,
when increasing Δt, we reduce DRL/v0accordingly.
The red dots in Fig. 7show the mean travel time T1
from the 100 Q-learning runs, and the error bars indi-
cate the standard deviation. Until k=0.8, no strong
deviation from deterministic Qlearning is observed and
the red trajectories in the insets show typical examples.
In contrast, the trajectories for k=0.9 strongly deviate
from each other, which results in a strongly increased
T1with a large standard deviation. The reason is that
the particle orientation is nearly antiparallel to the flow
field, so orientational fluctuations can cause the parti-
cle to move away from the target instead of toward it.
These excursions result in a strongly increased T1.
3.4 Crossing Poiseuille flow
In our next example, we implement a Poiseuille flow
field along the yaxis with zero velocity at the bound-
aries of our typical system geometry, x=±3/4. So the
ycomponent of the velocity field vDbecomes:
vy=vc14
3x2,where ωD=16
9vcx(14)
describes the rotational velocity experienced by the
active particle due to the nonzero flow vorticity. To keep
the problem simple, we do not implement any hydro-
dynamic interactions with the bounding walls. Again,
we look for the fastest trajectory between starting and
target positions located at 0.5 and 0.5onthexaxis,
respectively.
We first explore the numerical minimization of the
travel time T0, which gives three types of trajectories
that correspond to the three branches (solid lines) in
Fig. 8, top. As for the potential barrier, these branches
overlap when tuning the flow strength vcindicating up
to three local minima. For vc1 the optimal trajectory
is curved symmetrically about the xaxis (green trajec-
tory in Fig. 8, bottom left). Thus, close to the starting
and target positions, where the flow velocity field is
123
48 Page 8 of 12 Eur. Phys. J. E (2023) 46:48
Fig. 8 Top: Optimized travel time versus strength vcof
the Poiseuille flow. Solid lines: numerical minimization gives
three trajectory types. Blue dots: from deterministic Q
learning. Red dots: mean of 100 Qlearning runs under noise,
and error bars indicate the standard deviation. Green tri-
angles: TQ1from applying a mean Qmatrix under noise.
Bottom: From left to right, examples of the three trajectory
types are indicated: symmetric, asymmetric, and S-shaped.
Green: numerical minimization, blue: deterministic Qlearn-
ing, red: Qlearning under noise
smaller, the active particle can swim upstream, while
it drifts downstream in the center. Then, for vc1
the shape becomes asymmetric (green trajectory in Fig.
8, bottom middle) since the active particle needs to
explore the slow flow close to the wall for being able to
reach the target. Similarly, also the trajectory mirrored
at the center exists. Ultimately, at vc1.3 an S-shaped
trajectory is the optimum in travel time (green trajec-
tory in Fig. 8, bottom right). The slow flow at both
walls is used to swim upstream in order to compensate
for the downstream drift in the center.
Now, reinforcement learning without orientational
noise (blue dots in Fig. 8) nicely reproduces the optimal
travel times, although at vc=1.4 the trajectories differ
(compare green and blue trajectories in Fig. 8, bottom
right). To perform Qlearning, we needed to increase
the time step Δtfrom 0.0375 to 0.08 with increasing
vc.
In a next step, we again include orientational noise
and perform an average over 100 Q-learning runs to
determine a mean travel time. As for the uniform flow,
we keep the noise per time step constant by choosing
(DRL/v0t=0.0375 in Eq. (5). The red dots in Fig.
8, top show the mean travel time T1and the error
bars indicate the standard deviation. Up to vc=0.8,
T1agrees well with the numerical minimization (T0)
Fig. 9 For several vc, all the 100 learned trajectories
resulting from Qlearning under noise are shown
Fig. 10 Each plot shows 100 trajectories that result from
applying different types of Qmatrices under noise at vc=
1.2. Qn:meanQmatrix from 10 Qlearning runs under
noise. Q1Q4: examples of Qmatrices learnt under noise.
Qdet: deterministic Qmatrix
and deterministic Qlearning (TRL). Also the different
trajectories vary around the symmetric path (top row
in Fig. 9). At vc=0.9, a mixture of symmetric and
asymmetric trajectories (Fig. 9, middle row) results in
a stronger deviation from T0and TRL and a larger stan-
dard deviation. Then, at vc=1.1 only the asymmetric
trajectory type is realized, which at vc=1.3 starts
to also develop the S-shaped type. However, since the
trajectories become more and more irregular, T1devi-
ates more strongly from T0and TRL with large standard
deviations.
In the end, we address the question how optimal
policies encoded in learned Qmatrices reproduce the
learned trajectories when applied under noise. In Fig.
123
Eur. Phys. J. E (2023) 46:48 Page 9 of 12 48
Fig. 11 Swirling flow field. Left: Three examples of opti-
mal paths for increasing swirling strength ω. Blue line: from
deterministic Qlearning and blue arrows indicate the parti-
cle orientation. Red line: trajectory learned in the presence
of noise. Green: The particle misses the target and has to
circle once around the center. The gray circles indicate the
direction of the flow. Right: Travel time Ttimes ωplotted
versus ω. Green line: from numerical minimization. Blue
dots: from deterministic Q-learning. Red dots: mean of 100
Q-learning runs under noise, and error bars indicate the
standard deviation
10, we compare the outcome for different types of
learned Qmatrices for flow strength vc=1.2, which
we apply 100 times. At vc=1.2, the optimized travel
time is T=2.21. Interestingly, applying Qdet (the opti-
mal deterministic Qmatrix) produces trajectories that
make long detours under noise. Thus, the mean travel
time Tdeviates strongly from the ideal value and
the standard deviation is large. Taking a mean of 10
Qmatrices learned under noise, Qn, gives trajecto-
ries without large detours, and the mean travel time
is well below the one for Qdet. Even the single Qn
matrices, when applied under noise, are more successful
than Qdet as the examples for four Qmatrices in Fig.
10 show. These results suggest that Qmatrices learnt
under noise or their average are less volatile to noise
than the deterministic Qmatrix, when using them. In
Fig. 8, the green triangles indicate the mean travel time
for Qn.Uptovc=0.7, there is not much difference
between the differently determined travel times. Inter-
estingly, between vc=0.8 and 1.0, the mean Qmatrix
provides a better travel time than the average over 100
Q-learning runs. However, for vc=1.2 this is no longer
true and for vc=1.3 and 1.4 some of the trajectories
did not reach the target, so we do not provide a mean
travel time here.
3.5 Crossing swirling flow
As a last example, we investigate the case, where the
active particle needs to cross a swirling flow on its way
to the target. We consider
vD=ω
reϕ,(15)
which has zero vorticity (curlvD= 0) so that the par-
ticle orientation eis not rotated by the flow. Figure 11,
left shows three optimal paths determined from deter-
ministic Qlearning for increasing flow strength ω. One
observes that for smaller ωthe active particle crosses
closer to the center, because here flow is larger, which
helps to minimize travel time. The self-propulsion is
needed to cross the circular streamlines in order to
reach the target. For increasing ωand thus increased
drifting, the active particle has to stay closer to the
streamline of the target for being able to reach it. For
large ω, self-propulsion can more and more be neglected
against drifting and the active particle moves on the
half circle connecting start and target. This is approx-
imately the case for ω=4.2.
The green graph in Fig. 11, right, a numerical mini-
mization of the travel time confirms this view. For small
ωthe ideal path is nearly straight and T1iscon-
stant or is linear in ω. For large ω, should tend
to π/4=0.786, but the convergence is rather slow. The
results from reinforcement learning without orienta-
tional noise (blue dots) agree rather well with the green
curve. When just plotting Tversus ω, they nicely fall
on top of each other. Small deviations become enlarged
when plotting . To arrive at the results for ω1.5,
we increased the number of action orientations from 8 to
16 and successively decreased the time step Δtto 0.001.
The reason is that at larger flow strengths one has to
fine-tune the position and orientation of the active par-
ticle to hit the target; otherwise, the particle needs to
circle around the center to make another attempt.
Again we introduce orientational noise and perform
100 Qlearning runs, while keeping the factor governing
noise at (DRL/v0t=0.0375. At ω=1.8 and beyond,
the mean travel times T1(red dots in Fig. 11, right)
nicely agree with deterministic Qlearning, the standard
deviations are small, and the trajectories fall on top of
each other (Fig. 11, left). However, at ω=0.6 and 1.2
large standard deviations occur. They are due to rare
events, where the particle does not hit the target and
therefore needs to circle around once to reach it. An
example (green trajectory) is given for ω=0.6 in Fig.
11,left.
4 Conclusion
In this article, we considered a smart active particle
that can sense the distance and direction to a target.
123
48 Page 10 of 12 Eur. Phys. J. E (2023) 46:48
We used Qlearning to demonstrate how the particle
learns by itself to navigate on the fastest path in differ-
ent potential landscapes and flow fields. In parallel, we
also solved the optimization problem using variational
calculus to show how well Qlearning works. Our idea
is that sensing distance and direction as state variables
is easier to realize with a smart active particle than
sensing the position.
First, we considered a potential barrier as in our pre-
vious work [33], but now the learned paths are closer
to the optimal path since the active particle moves con-
tinuously in space instead of only occupying grid points
of a square lattice. Furthermore, as action variables we
employ eight orientations instead of only four. We also
considered a potential well, which was deep enough so
that the active particle could become trapped in it once
the trapping force exceeds a critical value. However,
the particle indeed learns to avoid the trap and moves
around it. This is an important feature when studying
the optimal path in an arbitrary landscape.
Second, we demonstrated how the active particle
crosses a uniform flow. The learned travel times agree
well with the analytic result of the optimization. Third,
crossing a Poiseuille flow is more challenging to evalu-
ate. By numerical minimization of the travel time, we
identify three types of trajectories: symmetric, asym-
metric, and S-shaped. The second and third types occur
at higher flow strengths and use the small flow veloc-
ity at the channel walls to move upstream in order to
being able to cross the flow. Fourth, we also looked
at a swirling flow. For small flow strengths, the active
particle uses the larger flow velocities close to the cen-
ter to arrive fastest at the target, while for larger flow
strengths it has to stay on the circular flow line so that
it does not miss the target. If this happens, the particle
needs to circle around the center which increases the
travel time.
Finally, for all the reported drift-velocity fields we
evaluated the effect of orientational thermal noise dur-
ing Qlearning and when applying the optimal pol-
icy. Generally, noise does not prevent the active parti-
cle from learning optimal travel paths and to navigate
to the target. Now the optimal path is noisy, which
increases the travel time, and, of course, finding opti-
mal paths depends on the strength of noise. Further
general statements on the impact of noise are difficult.
It rather depends on the specific problem and the tra-
jectory to be learned. For example, for the potential
well the learned trajectories run further away from the
center with increasing noise to avoid that particles get
trapped in the well. We add two further findings. First,
when performing Qlearning under noise for all drift-
velocity fields, we identified the strategy to average over
several Qlearning runs. This works well if the opti-
mized trajectories are well accessible by numerical min-
imization of the travel time or by deterministic Qlearn-
ing. Second, when applying the learned optimal policy
under noise, we made the interesting observation that
a mean Qmatrix works better than the deterministic
Qmatrix. The reason might be that the Qmatrix aver-
aged over several noisy learning runs has developed a
better strategy to respond to random changes in the ori-
entation compared to the deterministic Qmatrix. Thus,
our study also adds to the recent efforts to explore the
stability of learned optimal strategies/policies [35,51].
As a next step, we plan to study optimal naviga-
tion of active particles in more complex potential and
flow landscapes using deep Qlearning, which employs
neural networks. This will also enable us to train the
active particle to move optimally in a set of complex
landscapes and then let it move in an unknown land-
scape.
Acknowledgements We thank the Berlin University All-
iance for funding under grant number 824 BUA-NUS 4.
Funding Information Open Access funding enabled and
organized by Projekt DEAL.
Author contribution statement
All the authors were involved in the preparation of the
manuscript. All the authors have read and approved the
final manuscript.
Data availability The datasets generated during and/or
analyzed during the current study are available from the
corresponding author on reasonable request.
Open Access This article is licensed under a Creative Com-
mons Attribution 4.0 International License, which permits
use, sharing, adaptation, distribution and reproduction in
any medium or format, as long as you give appropriate credit
to the original author(s) and the source, provide a link to
the Creative Commons licence, and indicate if changes were
made. The images or other third party material in this arti-
cle are included in the article’s Creative Commons licence,
unless indicated otherwise in a credit line to the material. If
material is not included in the article’s Creative Commons
licence and your intended use is not permitted by statu-
tory regulation or exceeds the permitted use, you will need
to obtain permission directly from the copyright holder.
To view a copy of this licence, visit http://creativecomm
ons.org/licenses/by/4.0/.
A Appendix
A.1 Numerical minimization of the travel time.
The analytical solution of the Euler-Lagrange equations
belonging to the variation of Eq. (2) can only be found for
the uniform flow field used in Sect. 3.3.Itispresentedin
Eq. (13). For the remaining force and flow fields, minimiz-
ing the functional of Eq. (2) requires a numerical approach.
This is accomplished by discretizing Eq. (2) and employing
numerical methods to minimize it.
The discretization approximates the trajectory by N=
200 points r(s)ri(s), where iis the index of the i-th
point. Additionally, the line element dsiand the tangent
123
Eur. Phys. J. E (2023) 46:48 Page 11 of 12 48
vector tiare approximated by
dsi=(ri+1 ri)2and ti=ri+1 ri
dsi
.(16)
Furthermore, using the discretized drift-velocity field vDi =
vD(ri) and the magnitude of the total velocity vi=
v(ri+1,ri)fromEq.(3), the functional for the travel time T
is approximated as
rf
ri
dt−→
N1
i=1
dsi
vi
.
This expression can be minimized by using the optimiza-
tion package from Julia, Optim.jl, where different solver
algorithms can be used. Due to difficulties in calculating
the Hessian for the Poiseuille flow, we successfully employed
a gradient-based algorithm such as Conjugate Gradient,
where the parameters of the algorithm needed to be tuned.
Nevertheless, to optimize the travel time for the Poiseuille
flow, it was necessary to use the results from reinforcement
learning as initial trajectories. This approach provides a bet-
ter starting point for the optimization process and improves
the convergence of the solver algorithm.
In general, the x-coordinate in the optimization problem
is treated as a variable to be determined. However, for sim-
pler cases like the Mexican hat potential, it is sufficient to
choose equally spaced xcoordinates and to keep them fixed
during the minimization. This reduces the complexity of the
optimization problem and makes the solver more efficient in
finding the optimal solution.
A.2 Learning rate and discount factor
We will shortly discuss the procedure for determining the
optimal values for the learning rate αand the discount factor
γ.InFig.12, we demonstrate how choosing the parameters
influences the travel time T. We created a heatmap, where
γversus αis plotted, while the variation in color represents
the logarithmic values of the travel time T. The study was
performed for the potential barrier with U0=0.4. The inset
in Fig. 2shows the learned trajectory. We do not consider
values for γ<0.1, otherwise the travel time diverges toward
infinity. We can clearly see that the interval γ[0.4,0.9] is
the most suitable. In this interval, we can identify several
minima for the travel time marked as red boxes in Fig. 12.
Since we deal with fully deterministic environments, a high
learning rate can be chosen to achieve a fast convergence
of the Qfunction [41]. Thus, we choose our parameters as
γ=0.7andα=0.9.
A.3 State space resolution
In our reinforcement learning study, the state space is
defined by the angle ϕand distance ρto the target, with
both variables discretized into 43 values each. Although the
resolution of this approach might seem equivalent to a 43x43
quadratic grid, there are some factors that add complexity
to the problem.
First, the linear resolution along the angular direction is
not uniform across the state space, as it decreases for larger
ρvalues. This leads to unequal tile sizes, with linear dimen-
sions as large as 0.221 in the angular direction at the edge
Fig. 12 Heatmap for travel time Tas a function of learn-
ing rate αand discount factor γfor the active particle mov-
ing around the potential barrier with U0=0.4. =0.4is
kept constant. The red boxes mark three minima for the
travel time T
Fig. 13 Travel time Tplotted against the number of states
Nρ=Nϕfor the curved path around the potential barrier
with U0=0.4
of the state space compared to 0.033 along the radial direc-
tion. As a result, the active particle needs to perform mul-
tiple actions, depending on the time step Δt,beforetran-
sitioning to another state. Conversely, around the target,
the angles are very small, causing a large number of states
to be concentrated in that area. This non-uniform distribu-
tion of states poses challenges for the reinforcement learning
process.
Second, due to the definition of the state space in relation
to the quadratic shape of the system, ca. 50 % of the poten-
tial states are never visited by the active particle. Only 957
of the 1849 possible states are actually encountered during
the learning process.
To investigate the impact of the state space resolution, we
examined how the number of states Nρ,ϕ along the ρand φ
direction influences the travel time T. We choose Nρ=Nϕ
and plot in Fig. 13 travel time Tversus Nρ,ϕ for the special
case of the potential barrier at U0=0.4, where the curved
path is realized. We observe that the travel time decreases
for increasing Nρ,ϕ. It is almost saturated at Nρ,ϕ = 43,
which we use in our investigation. Consequently, a further
increase in Nρ,ϕ is not necessary to guarantee the applica-
bility of Qlearning.
123
48 Page 12 of 12 Eur. Phys. J. E (2023) 46:48
References
1. M.C. Marchetti, J.F. Joanny, S. Ramaswamy, T.B. Liv-
erpool, J. Prost, M. Rao, R.A. Simha, Rev. Mod. Phys.
85, 1143 (2013)
2. A. ottl,H.Stark,J.Phys.:Condens.Matter28, 253001
(2016)
3. S. Ramaswamy, J. Stat. Mech. 2017, 054002 (2017)
4. T. Vicsek, A. Zafeiris, Phys. Rep. 517, 71 (2012)
5. J. Elgeti, R.G. Winkler, G. Gompper, Rep. Prog. Phys.
78, 056601 (2015)
6. C. Bechinger, R. Di Leonardo, H. owen,C.Reichhardt,
G. Volpe, G. Volpe, Rev. Mod. Phys. 88, 045006 (2016)
7. H. Chat´e, Annu. Rev. Condens. Matter Phys. 11, 189
(2020)
8. G. Gompper, R.G. Winkler, T. Speck, A. Solon, C. Nar-
dini, F. Peruani, H. owen, R. Golestanian, U.B. Kaupp,
L. Alvarez et al., J. Phys.: Condens. Matter 32, 193001
(2020)
9. H.C. Berg, E. coli in Motion, Biological and Medi-
cal Physics, Biomedical Engineering (Springer, Berlin,
2008)
10. A. Berdahl, C.J. Torney, C.C. Ioannou, J.J. Faria, I.D.
Couzin, Science 339, 574 (2013)
11. A. Cavagna, I. Giardina, Annu. Rev. Condens. Matter
Phys. 5, 183 (2014)
12. M. Akter, J.J. Keya, K. Kayano, A.M.R. Kabir, D.
Inoue, H. Hess, K. Sada, A. Kuzuya, H. Asanuma, A.
Kakugo, Sci Robot. 7, eabm0677 (2022)
13. S. Das, E.B. Steager, K.J. Stebe, V. Kumar, Simultane-
ous control of spherical microrobots using catalytic and
magnetic actuation, in 2017 International Conference on
Manipulation, Automation and Robotics at Small Scales
(MARSS) pp. 1–6. (2017)
14. S. Mui˜nos-Landin, A. Fischer, V. Holubec, F. Cichos,
Sci Robot. 6, eabd9285 (2021)
15. M. Hennes, K. Wolff, H. Stark, Phys. Rev. Lett. 112,
238104 (2014)
16. T.J. Pedley, J.O. Kessler, Annu. Rev. Fluid Mech. 24,
313 (1992)
17. K. Drescher, K.C. Leptos, I. Tuval, T. Ishikawa, T.J.
Pedley, R.E. Goldstein, Phys. Rev. Lett. 102, 168101
(2009)
18. W.M. Durham, J.O. Kessler, R. Stocker, Science 323,
1067 (2009)
19. J. Palacci, C. Cottin-Bizonne, C. Ybert, L. Bocquet,
Phys. Rev. Lett. 105, 088304 (2010)
20. M. Enculescu, H. Stark, Phys. Rev. Lett. 107, 058301
(2011)
21. F. R¨uhle, H. Stark, Eur. Phys. J. E 43, 26 (2020)
22. F. R¨uhle, A.W. Zantop, H. Stark, Eur. Phys. J. E 45,
26 (2022)
23. N. Waisbord, C.T. Lef`evre,L.Bocquet,C.Ybert,C.
Cottin-Bizonne, Phys. Rev. Fluids 1, 053203 (2016)
24. F. Meng, D. Matsunaga, R. Golestanian, Phys. Rev.
Lett. 120, 188101 (2018)
25. A. Sokolov, I.S. Aranson, Phys. Rev. Lett. 103, 148101
(2009)
26. S. Rafa¨ı, L. Jibuti, P. Peyla, Phys. Rev. Lett. 104,
098102 (2010)
27. A. ottl, H. Stark, Phys. Rev. Lett. 108, 218104 (2012)
28. A. ottl, H. Stark, Phys. Rev. Lett. 112, 118101 (2014)
29. A. Choudhary, S. Paul, F. R¨uhle, H. Stark, Commun.
Phys. 5, 14 (2022)
30. G. Volpe, G. Volpe, PNAS 114, 11350 (2017)
31. L.G. Nava, R. Großmann, F. Peruani, Phys. Rev. E 97,
042604 (2018)
32. B. Liebchen, H. owen, Europhys. Lett. 127, 34003
(2019)
33. E. Schneider, H. Stark, Europhys. Lett. 127, 64003
(2019)
34. L. Piro, E. Tang, R. Golestanian, Phys. Rev. Re. 3,
023125 (2021)
35. L. Piro, R. Golestanian, B. Mahault, Front. Phys. (Lau-
sanne) 10, 1034267 (2022)
36. L. Piro, B. Mahault, R. Golestanian, New J. Phys. 24,
093037 (2022)
37. S. Goh, R.G. Winkler, G. Gompper, New J. Phys. 24,
093039 (2022)
38. D. Bray, Cell Movements: From Molecules to Motility.
Garland Science (2000)
39. A.M.Hein,F.Carrara,D.R.Brumley,R.Stocker,S.A.
Levin, PNAS 113, 9413 (2016)
40. D. Weihs, P. Webb, J. Theor. Biol. 106, 189 (1984)
41. R.S. Sutton, A.G. Barto, Reinforcement learning: An
introduction. MIT Press (2018)
42. Y. Yang, M.A. Bevan, B. Li, Adv. Intell. Syst. 2,
1900106 (2020)
43. Y. Yang, M.A. Bevan, B. Li, Adv. Theory Simulat. 3,
2000034 (2020)
44. M. Durve, F. Peruani, A. Celani, Phys. Rev. E 102,
012601 (2020)
45. H. Stark, Sci Robot. 6, eabh1977 (2021)
46. M.J. Falk, V. Alizadehyazdi, H. Jaeger, A. Murugan,
Phys. Rev. Res. 3, 033291 (2021)
47.M.Gerhard,A.Jayaram,A.Fischer,T.Speck,Phys.
Rev. E 104, 054614 (2021)
48. M. Nasiri, B. Liebchen, New J. Phys. 24, 073042 (2022)
49. P.A. Monderkamp, F.J. Schwarzendahl, M.A. Klatt, H.
owen, Mach. Learn.: Sci. Technol. 3, 045024 (2022)
50. S. Colabrese, K. Gustavsson, A. Celani, L. Biferale,
Phys. Rev. Lett. 118, 158004 (2017)
51. L. Biferale, F. Bonaccorso, M. Buzzicotti, P. Clark Di
Leoni, K. Gustavsson, Chaos: Interdisc. J. Nonlinear Sci.
29, 103138 (2019)
52. J.K. Alageshan, A.K. Verma, J. Bec, R. Pandit, Phys.
Rev. E 101, 043110 (2020)
53. M. Buzzicotti, L. Biferale, F. Bonaccorso, P.C. di Leoni,
K. Gustavsson, Optimal Control of Point-to-Point Nav-
igation in Turbulent Time-dependent Flows Using Rein-
forcement learning, in AIxIA 2020—Advances in Artifi-
cial Intelligence (Springer, Cham, 2021), pp.223–234
54. C. Calascibetta, L. Biferale, F. Borra, A. Celani,
M. Cencini, arXiv:2212.09612v1 [physics.flu-dyn] (2022)
55. E. Zermelo, ZAMM 11, 114 (1931)
56. T. Jaakkola, M.I. Jordan, S.P. Singh, Neural Comput.
6, 1185 (1994)
57. A. Celani, E. Villermaux, M. Vergassola, Phys. Rev. X
4, 041015 (2014)
58. M. Durve, L. Piro, M. Cencini, L. Biferale, A. Celani,
Phys. Rev. E 102, 012402 (2020)
59. G. Reddy, V.N. Murthy, M. Vergassola, Annu. Rev. Con-
dens. Matter Phys. 13, 191 (2022)
123