Document [original]

Eur. Phys. J. E (2023) 46:48

https://doi.org/10.1140/epje/s10189-023-00309-3 THE EUROPEAN

PHYSICAL JOURNAL E

Regular Article - Flowing Matter

Optimal navigation of a smart active particle: directional

and distance sensing

Mischa Putzkeaand Holger Starkb

Institut f¨ur Theoretische Physik, Technische Universit¨at Berlin, Hardenbergstr. 36, 10623 Berlin, Germany

Received 4 February 2023 / Accepted 5 June 2023 / Published online 19 June 2023

©The Author(s) 2023

Abstract We employ Qlearning, a variant of reinforcement learning, so that an active particle learns by

itself to navigate on the fastest path toward a target while experiencing external forces and ﬂow ﬁelds.

As state variables, we use the distance and direction toward the target, and as action variables the active

particle can choose a new orientation along which it moves with constant velocity. We explicitly investigate

optimal navigation in a potential barrier/well and a uniform/ Poiseuille/swirling ﬂow ﬁeld. We show that

Qlearning is able to identify the fastest path and discuss the results. We also demonstrate that Qlearning

and applying the learned policy works when the particle orientation experiences thermal noise. However,

the successful outcome strongly depends on the speciﬁc problem and the strength of noise.

1 Introduction

Active matter refers to materials that consist of self-

propelled entities such as active particles, artiﬁcial

microswimmers, and microorganisms, which exhibit

versatile collective behavior and dynamic patterns [1–

3]. These systems are characterized by their ability to

use an internal energy depot or energy from the envi-

ronment to generate active motion, for example, by

deformations [1,2,4–8]. Examples of active matter are

biological systems including bacteria [9], schools of ﬁsh

and swarms of birds [10,11] as well as artiﬁcial systems

such as suspensions of self-propelled colloids or other

synthetic microswimmers [12–14]. Over the years, inter-

est in the control of active motion has increased using

external ﬁelds [15], in particular, gravitational [16–22],

magnetic [23,24], and ﬂow ﬁelds [25–29].

With the ability to speciﬁcally manipulate active

motion, optimizing the traveled path in a complex envi-

ronment, for example, by ﬁnding the fastest trajectory

has come into focus. While optimal search strategies

depend on the environment [30], Ref. [31] suggests min-

imal navigation strategies and in Refs. [32,33] optimal

navigation is achieved by minimizing travel time. This

can be done even on curved manifolds such as a sphere

[34] and using optimal control theory [35,36]. Further-

more, the noisy pursuit of a self-steering active particle

has been studied [37].

For living organisms, there are many examples where

optimal navigation is crucial, such as ﬁnding food

ae-mail: m.putzk[email protected]erlin.de

be-mail: holger.stark@tu-berlin.de (corresponding

author)

sources [38,39] or escaping from predators [40]. While

organisms have learned their optimal navigation strat-

egy through evolution, reinforcement learning [41]oﬀers

a promising method of training artiﬁcial microswim-

mers to steer optimally toward a target. Applica-

tions range from robotics [42,43] to biology and active

matter[14,33,44–47]. Reinforcement learning is a type

of machine learning where an agent learns a speciﬁc

task by taking actions in an environment and receiving

feedback in the form of rewards. Now, active particles

can use this algorithm to learn how to navigate opti-

mally based on their sensory inputs and reward signals.

An example are microswimmers that move toward a

target by adjusting their orientations along which self-

propulsion occurs. In the last years, it has been demon-

strated that reinforcement learning is a well suited

method for ﬁnding optimal navigation solutions in, for

example, complex potentials [33,48,49], turbulent ﬂows

[50–53], as well as chaotic ﬂows [54].

In this article, we employ Qlearning, a variant of

reinforcement learning, so that the agent or microswim-

mer learns by itself to move on the fastest path from the

starting point to the target under the action of forces

and ﬂow ﬁelds. This is the traditional Zermelo navi-

gation problem [55]. In contrast to our previous work

[33], the microswimmer can sense the direction and dis-

tance to a target, which we ﬁnd potentially easier to

realize than monitoring the position. The smart active

particle ﬁrst moves deterministically and can control

its orientation. We show that the microswimmer with

the new state variables is able to identify and navi-

gate on the fastest path in diﬀerent complex landscapes

such as potential barriers and wells as well as uni-

123

48 Page 2 of 12 Eur. Phys. J. E (2023) 46:48

form, Poiseuille, and swirling ﬂow ﬁelds. In addition,

we show that learning optimal navigation and apply-

ing the learned policy also works under thermal noise,

which the particle orientation experiences during train-

ing. However, the outcome strongly depends on the spe-

ciﬁc problem and the strength of noise.

The article is structured as follows. In Sect. 2,

we introduce our model, the method of Qlearn-

ing, the choice of state and action variables, and

the system parameters. In Sect. 3,wepresentour

results of the potential barrier/well and the uni-

form/Poiseuille/swirling ﬂow ﬁelds. We close with con-

clusions.

2Model

2.1 Equations of motion

Our goal is to train the microswimmer such that it opti-

mizes its travel time moving in two dimensions from a

starting to a target position. Along its path, it expe-

riences an additional drift-velocity ﬁeld, which repre-

sents some complex environment (see Fig. 1). We sim-

ply model the microswimmer as an active particle with

negligible inertia that moves with speed v0along the

direction e= (cos θ,sin θ), where θis the angle with

respect to the xaxis (see Fig. 1). Rescaling all veloc-

ities by v0, the space-dependent total velocity of the

active particle is

v=e+vD.(1)

We assume that the active particle can control its ori-

entation ein order to ﬁnd the optimal travel time. For

this, every time Δtit senses its state and uses Qlearn-

ing to set its new orientation eas we explain in Sects.

2.2 and 2.3. Furthermore, we do not include any exter-

nal torque. However, for the drift-velocity ﬁeld vD(r)

we will consider one type of ﬂow ﬁeld (Poiseuille ﬂow)

Fig. 1 The active particle with orientation unit vector e

is trained to move from the starting to target position on

the fastest path crossing a region with a prescribed drift-

velocity ﬁeld vD(r). To locate its position, the particle can

sense the direction angle ϕand distance ρto the target

with nonzero vorticity, which rotates the particle ori-

entation with angular velocity ωD=|curlvD|/2. Thus,

during time step Δt, the particle’s orientation evolves

according to ˙

θ=ωD.

The optimization of the travel time can also be for-

mulated as a typical variational principle. The time to

travel from the starting position at rito the target posi-

tion at rfis

T=rf

dt=rf

v,(2)

where v=|v|. We parametrize the particle path r(s)

with the arclength s, introduce the unit tangent along

the path, t=dr/ds, and write the total velocity as

v=vt. Taking the square of Eq. (1) and solving for v,

we obtain

v=t·vD+1−[v2

D−(t·vD)2].(3)

With this the variation of the travel time, δT =0,in

order to ﬁnd a minimum can be formulated. Typically,

the minimum has to be calculated numerically and only

in special cases an analytic solution is possible.

Besides treating the active particle fully determinis-

tically, we will also explore how thermal ﬂuctuations

inﬂuence the learning of the optimal path and when

applying the learned optimal policy. Since we will work

at large P´eclet numbers, where translational noise can

be discarded, we concentrate on thermal noise in the

orientation. Thus, we use the model of an active Brow-

nian particle and write down an overdamped Langevin

equation for the time derivative of the orientation angle

θ. Including the nonzero vorticity for one type of ﬂow

ﬁeld, the Langevin equation becomes

θ=ωD+DRL/v0η. (4)

Here, ηrepresents standard Gaussian white noise with

zero mean η(t)= 0 and unit variance η(t)η(t)=

δ(t−t). The thermal rotational diﬀusion coeﬃcient

DRensures the validity of the ﬂuctuation-dissipation

theorem. Time is rescaled by the characteristic time

scale L/v0and Lis a typical length, which we choose

as the distance between starting and target position.

Now, to include rotational noise in the training pro-

cess and when applying the learned policy, we need to

numerically integrate Eq. (4). At each step of Qlearn-

ing, the orientation angle is set to a value θas explained

in Sect. 2.3. Starting from this value, the new orienta-

tion angle θnew after time step Δtbecomes

θnew =θ+ωDΔt+(DRL/v0)ΔtW

θ,(5)

where √ΔtWθrepresents the increment of a Wiener

process and Wθis a random number with zero mean

and unit variance W2

θ. The new angle θnew is then

used to perform a step of the active particle, which is

further processed as explained in Sect. 2.3.

123

Eur. Phys. J. E (2023) 46:48 Page 3 of 12 48

With the formulated model, we fully describe the

dynamics of our microswimmer. Now, we can start

using Qlearning to train the swimmer such that it ﬁnds

the optimal path in various drift-velocity ﬁelds.

2.2 Qlearning

In order for an active particle to learn the fastest path

from an initial position to an assigned target, while

moving in an external drift-velocity ﬁeld, the method

of tabular Qlearning can be applied [41] Here, one ﬁrst

creates a Qtable that stores a Qvalue for each pair of

state (e.g., position of the particle) and possible action

variables (e.g., movement in a certain direction). The

Qvalue represents the expected reward accumulated

during a sequence of steps when starting from a state-

action pair. Thus, it not only considers the immediate

reward but future rewards as well.

To start the Qlearning algorithm, the active particle

is placed at the initial position and moves under the

inﬂuence of the drift-velocity ﬁeld in a series of steps.

Before it performs an action, the Qvalues of the pos-

sible actions in the current state are checked and the

action with the highest Qvalue is selected. It represents

the highest expected accumulated reward. After each

action, the corresponding Qvalue for the state-action

pair is adjusted as reported below. When the active

particle reaches the target, the ﬁrst training episode is

completed. A new episode begins with the same ini-

tial state, and the process repeats until the Qmatrix

of the agent converges to a stable solution. This means

that the agent has learned the optimal policy, which for

each of the states gives the optimal action such that the

total reward accumulated by the agent along its path

is maximized.

During training, the new Qvalue, Qnew, must be cal-

culated for each pair of state and action variables, which

is done via the Bellmann equation [41]:

Qnew(st,a

t)←Q(st,a

t) (6)

+α·(rt+γmax

aQ(st+1,a)−Q(st,a

t).

Here, stis the current state and atthe current action

variable. The immediate reward rtbelongs to taking

action atin state st,andst+1 is the new state reached.

The term max[Q(st+1,a

t)] represents the maximum

expected future reward for the new state variable st+1.

The discount factor γdetermines how much the future

reward is taken into account, and αquantiﬁes the learn-

ing speed. The optimal values of both factors depend

on the speciﬁc problem. In Appendix A.2, we shortly

address this for our optimization tasks. One can show

that deﬁning an immediate reward suitable to the opti-

mization task, the Qmatrix will converge toward its

optimal value by applying the recursion formula (6)

[56].

We combine the deterministic choice of the action

with the -greedy method. The most rewarding action

is only taken with probability 1 −and otherwise

the action is chosen randomly. This prevents that the

optimal path becomes stuck in a local minimum. We

decrease with each episode according to =0.5(1 −

i/imax), where imax is suﬃciently large to guarantee

that the Qmatrix has converged and the phenomeno-

logical factor 0.5 guarantees the fastest convergence.

Despite using the -greedy method, in the following we

will call this version also deterministic Qlearning to

distinguish it from Qlearning under the presence of

orientational thermal noise.

2.3 Choice of state and action variables

To ﬁnd the optimal navigation in a complex environ-

ment using tabular Qlearning, information about the

position of the particle is required. The x, y coordinates

are often used to deﬁne the state of the particle [14,33].

However, having information about the position is cer-

tainly diﬃcult for a microswimmer to realize, when

video microscopy and some external information pro-

cessing are not available [14]. It seems more realistic and

easier for microswimmers to sense the direction and dis-

tance to a target using, e.g., a magnetic ﬁeld, in which

magnetotactic bacteria experience a torque, a light ﬁeld

in combination with phototaxis, or olfactory sensing

[57–59]. Also sensing a chemical ﬁeld is an option; how-

ever, the concentration ﬁeld can be distorted by the sur-

roundings. Thus, compared to our previous work [33],

to describe the state of the active particle, we perform

a coordinate transformation (x, y)→(ρ, ϕ), where ρ

is the distance and ϕthe direction angle to the tar-

get (see Fig. 1). Reducing the dimension of the state

space by omitting ρor φis not possible within the Q-

learning formalism since then the required action, which

depends on the position of the active particle, cannot

unambigously predicted.

In the following, we will use these alternative state

variables and Qlearning to solve the navigation prob-

lem. As we mentioned previously, the tabular Q-

learning method creates a matrix with ﬁnite number

of elements. Thus, the state space consists of a set of

2Ndiscrete state variables, for which we choose:

distance: ρi∈[ρ1=0,ρ

2, ..., ρN] (7)

direction: ϕi∈[ϕ1=0,ϕ

2, ..., ϕN].(8)

The active particle moves continuously in space. We

assign its distance ρand direction angle ϕto the dis-

crete values ρiand ϕi, when ρ,ϕfall within the inter-

vals [ρi−Δρ/2,ρ

i+Δρ/2] and [ϕi−Δϕ/2,ϕ

i+Δϕ/2],

respectively. Here, Δρ=(ρN−ρ1)/(N−1) and Δϕ=

2π/N, and for the end points i=1,N only half of the

intervalls are taken. Equally, the action variable is the

discretized orientation angle of the active particle, for

which we always take eight values:

orientation: θi=(i−1)π

4,i=1, ..., 8.(9)

Now, for the discrete state variables an action variable

is selected according to the Qmatrix, so the active par-

123

48 Page 4 of 12 Eur. Phys. J. E (2023) 46:48

ticle changes its orientation from θtto θt+1. Then, the

active particle moves in continuous space. Using Eq.

(1), it reaches the position

rnew =r+vΔt, (10)

for which we calculate the new distance ρand direction

angle ϕ. The procedure is repeated until the particle

arrives at the target, meaning ρ<Δρ/2, and one train-

ing episode is completed. If the target is not reached

after 500 time steps, the episode is stopped and a neg-

ative reward of R=−10 is generated. The episodes are

repeated until the Qmatrix converges and the optimal

solution is found. In practice, we choose imax from the -

greedy method suﬃciently high to achieve convergence.

In case that we include thermal noise in the orien-

tation vector eor a ﬂow ﬁeld with nonzero vorticity

ωD, we ﬁrst let the orientation angle evolve according

to Eq. (5) and then the translational step according

to Eq. (10) is performed. Note also that we do not use

here the -greedy method, since thermal noise naturally

brings in some randomness.

2.4 System parameters

In units of the distance Lbetween start and target,

the quadratic system extends in xand ydirection from

−0.75 to 0.75. Start and target are placed on the x

axis at −0.5 and 0.5, respectively. We use N=43dis-

crete values for ρand ϕwith the maximum distance

ρN=1.4577 and the resolution Δρ=0.0343. This

guarantees that all positions from the target are reach-

able. Typically, the time step is Δt=0.0375. How-

ever, for the case of crossing a uniform ﬂow (Sect. 3.3)

and the Poiseuille ﬂow (Sect. 3.4), we need to increase

it with the strength of the ﬂow, and for the case of

a swirling ﬂow (Sect. 3.5), Δtneeds to decrease for

stronger swirls.

The time step should roughly be adjusted such that

during one time step the particle can move from one

grid element to a neighboring element. Choosing it

smaller does not bring any improvement. On the con-

trary, the Q-learning algorithm has to perform numer-

ous steps to move the particle forward, without reach-

ing other state variables, which can considerably slow

down the whole learning process. For the two cases of

the uniform and Poiseuille ﬂow, the time step needs

to be increased to have a noticeable motion across the

ﬂow, while for the swirling ﬂow Δtneeds to decrease in

order to properly approximate the trajectory.

To start the training, the Qmatrix must be initial-

ized. Here, we set all entries uniformly to Q(s, a) = 100

[41]. An important point is how the rewards rtfor the

diﬀerent actions within Qlearning are chosen. Since we

want to minimize travel time T, each step of duration

Δtreceives a reward rt=−Δt. So the negative reward

is smallest if the number of steps is minimized. Reach-

ing the target gives a reward of 100, and if the parti-

cle crosses the border of the system, a large negative

reward of −10 is used to strongly penalize this action.

Furthermore, when performing such a move, the parti-

cle is placed back to the location, where the border was

crossed. This procedure corresponds to implementing a

hard-core repulsion from the border, while the negative

reward signals the particle within the learning phase

to avoid such steps. Without such a negative reward

the learning takes longer. The learning speed and dis-

count factor are chosen as α=0.9andγ=0.7, respec-

tively. In Appendix A.2 we present a parameter study

to determine the optimal values for αand γ. Finally,

for implementing the -greedy method, imax = 5000 is

often suﬃcient for the travel time to converge, and we

never need to go beyond imax = 20000.

3Results

We now present results for a few types of drift-velocity

ﬁelds, which either derive from a potential or are due

to an imposed ﬂow ﬁeld.

3.1 Potential barrier and well

In our earlier work [33], we determined the fastest path,

when there is a potential barrier between the start and

target. We modeled the barrier using the Mexican hat

potential without brim,

U=16U0(r2−1/4)2,r≤1/2

0,otherwise ,(11)

where ris the radial distance to the center. At r=0,

the potential has a maximum with height U0,andon

the ring r=1/2 it is zero with horizontal tangent. The

maximum potential force −∇Uis at r=1/2√3 with

|∇U|=16U0/3√3. The inset of Fig. 2shows a grayscale

representation of the potential.

The potential force can be incorporated in Eq. (1)

by choosing vD(r)=−∇U. As we already discussed in

Fig. 2 Shortest travel time Tversus barrier height U0

either for crossing the barrier of a Mexican hat potential

on a straight path (magenta) or moving around it (green).

The blue dots indicate the results of Qlearning. The inset

shows a grayscale representation of the potential and the

optimal path determined with Qlearning for U0=0.4

123

Eur. Phys. J. E (2023) 46:48 Page 5 of 12 48

Fig. 3 Potential well with U0=−0.4. Top: In the ﬁrst

episode (i= 1), the active particle becomes trapped in the

well and the learning ended after T= 500. The current

action receives a negative reward of −10. Bottom, blue tra-

jectory: After imax = 50,000 the active particle has learned

to avoid the well and move around it. Other trajectories:

Example trajectories for applying an averaged Qmatrix

under noise. The index nin the corresponding travel time

TQn refers to the strength of the orientational thermal noise,

n=DRL/v0

detail in Ref. [33], the variation δT = 0 identiﬁes the

straight path over the potential barrier as optimal until

U0=0.24, where the curved path around the barrier

becomes faster (see main plot of Fig. 2). There is also a

regime, where both paths are locally stable. Qlearning

is able to identify the optimal paths, as the blue dots

in Fig. 2show. For the curved path (an example is

shown in the inset), the minimum travel time is better

approximated than compared to Ref. [33] since we allow

the active particle to move freely in space and not just

on a grid.

By choosing a negative U0, the potential barrier

becomes a well. The phenomenology of the paths is the

same as for the barrier. However, there is one diﬀer-

ence. If the depth of the well is below U0=−3√3/16 =

−0.325, the active particle cannot escape the well. This

is illustrated for U0=−0.4 in Fig. 3, top, where we

show the trajectory for the ﬁrst episode. It is stopped

after a travel time of T= 500 and the current action

receives a large negative reward of −10. The episodes

are repeated, and obviously the active particle learns

to avoid the well since it ultimately ﬁnds the optimal

path around the potential well (Fig. 3, bottom). The

blue curve with travel time T0refers to zero noise. How-

ever, since the active particle needs to explore and learn

to avoid the “forbidden” region of the well, a larger

imax =50,000 is necessary. This example shows very

clearly that negative results of not reaching the tar-

get also contribute to the learning process of the active

particle.

We add a ﬁnal note. The policy for the optimal path

is encoded in the Qmatrix. However, this matrix con-

tains more information. One can place the active par-

ticle at any location within the system, provided this

location has been visited before in the learning phase,

and the Qmatrix will guide it to the target. However,

in general, the path will not be the fastest. We have

checked this for the potential barrier.

3.2 Learning with noise

Orientational noise can be included during diﬀerent

stages of Qlearning, which we explore here for the

potential barrier and well. One can include noise while

learning the optimal paths and when applying the opti-

mal policy. In addition, one can vary the noise strength.

3.2.1 Potential barrier

We start with exploring the inﬂuence of noise dur-

ing learning. The prefactor DRL/v0in Eq. (5) with

ωD= 0 was set to one to have a noticeable change

of particle orientation eduring the time step Δt.For

example, for a random number of Wθ= 1 the change

in orientation angle, θnew −θ,is11

◦. Nevertheless, the

active particle learns to move accross the barrier and

around it as Fig. 4, top illustrates for low and high

U0. However, the learned trajectories are noisy, which

also increases the travel time compared to the deter-

ministic case. Interestingly, at U0=0.225 close to the

point (U0=0.24) where the absolute stability switches

from the straight to the curved path, we observe both

types of paths as Fig. 4, bottom shows. Thus, noise

causes the optimization process to converge into diﬀer-

ent local minima. Interestingly, the travel time of the

straight path is more strongly aﬀected by orientational

noise compared to the curved path and, therefore, its

value is more strongly enhanced. This makes sense since

the acting potential forces drive the particle away from

the optimal path. In light green, we plot 100 trajec-

tories each determined from a separate learning run.

They show that learning under noise can reproduce the

two types of trajectories; only a few of them deviate

more strongly. Alternatively, one can also take a spe-

ciﬁc learned Qmatrix and apply it several times under

noise (not shown in Fig. 4). While the curved paths are

reproduced, now the “straight paths” are strongly dis-

torted and nearly cover the whole region of the potential

barrier; again because noise aﬀects them more strongly.

To quantify our observations further, we then took

the 100 Q-learning runs for the same U0and calculated

the mean of the travel time, T1, and its standard devi-

ation ΔT1, where the index 1 refers to the noise strength

123

48 Page 6 of 12 Eur. Phys. J. E (2023) 46:48

Fig. 4 Orientational noise does not prevent learning the

optimal paths. Top: Straight and curved paths for U0=0.15

and U0=0.4, respectively. Bottom: At U0=0.225 both

paths are realized. In light green, 100 paths are shown. Each

of them results from learning the optimal path under noise

DRL/v0= 1. Both quantities are plotted versus U0

in Fig. 5. At small and high U0, the mean travel time

behaves as the deterministic value in Fig. 2. It increases

with U0for the straight path and levels oﬀ at high U0

for the curved path. Similarly, the standard deviation

increases for small U0and it is small for large U0. Hence,

noise does not cause large variations. Interestingly and

diﬀerent compared to the deterministic value, the mean

travel time becomes maximal at U0=0.225 where both

path types are observed, especially the noisy straigth

path with longer travel time. Accordingly, close to this

value the standard deviation has a maximum.

As an alternative approach dealing with noise, we

performed 10 Q-learning runs and used the averaged

optimal Qmatrix to run 100 trajectories in the pres-

ence of orientational noise. Note in the following the

averaged Qmatrix is abbreviated as Qand when hint-

ing to it in an index as Q. The mean travel time TQ1

(green triangles in Fig. 5) does not deviate strongly

at small and large U0from T1, where T1was calcu-

lated for each newly learned trajectory. Only around

U0=0.225 the deviation is stronger and the green

curve misses the peak. The reason is that already at

U0=0.225 the occurence of straight paths under ori-

entational noise is rare and the 10 learned Qmatrices,

from which we determine the mean Q,belongtothe

curved path.

Finally, we add that to achieve these results, we did

not use the -greedy method. Thus, also thermal noise

helps in ﬁnding optimal trajectories. We also checked

that including the -greedy method did not change our

results signiﬁcantly.

Fig. 5 Mean travel time T1(red circles) and standard

deviation ΔT1(error bars) plotted versus U0from Qlearn-

ing in the presence of orientational thermal noise during the

learning phase. Mean travel time TQ1(green triangles) for

applying an averaged Qmatrix under noise. The lines are a

guide to the eye

3.2.2 Potential well

Of course, increasing the strength of the orientational

thermal noise, DRL/v0, the “optimal paths” and

travel times deviate more and more from their deter-

ministic values. To illustrate this, we consider the

potential well with U0=−0.4 (Fig. 3, bottom), where

the active particle cannot leave once it moves too close

to the center. With increasing noise, the learned Q

matrices diﬀer strongly from each other. So we decided

to use a mean optimal Qmatrix averaged over 10 learn-

ing runs under the same noise strength. Figure 3, bot-

tom shows example trajectories when applying Qfor

diﬀerent noise strengths, where the index nin TQn

refers to n=DRL/v0. One already recognizes, as a

strategy for not becoming trapped in the well, the par-

ticle keeps a larger distance to the well with increasing

noise.

This is illustrated quantitatively in Fig. 6. We applied

Q100 times and then plot TQnand its standard

deviation, represented as error bars, versus the strength

of the orientational noise. One realizes noise does not

only have the eﬀect of moving the trajectory further

away from the center and thereby increasing the travel

time. In addition, noise lets the trajectory become more

irregular (see Fig. 3, bottom). How much this con-

tributes to the travel time is clariﬁed by the red circles

in Fig. 6. Here, we apply Qwithout noise. So TQ0

shows the pure eﬀect of the particle, which needs to

avoid the potential well when learning under noise to

reach the target.

3.3 Crossing uniform flow

So far, we used a potential force acting on the active

particle. Now, we put the active particle in a uniform

ﬂow ﬁeld along the yaxis with strength k,vD=key,

123

Eur. Phys. J. E (2023) 46:48 Page 7 of 12 48

Fig. 6 Mean travel time TQn(green triangles) and stan-

dard deviation (error bars) plotted versus strength of the

orientational thermal noise, DRL/v0, for the potential

well with U0=−0.4. Under noise, 10 optimal Qmatrices

are determined and the average Qis applied 100 times to

determine TQn.ThetimeTQ0(magenta circles) refers to

Qapplied without noise. The lines are a guide to the eye

Fig. 7 Optimized travel time Tversus strength kin a uni-

form ﬂow ﬁeld. Green line: from analytic optimization. Blue

dots: from deterministic Qlearning. Red dots: mean of 100

Q-learning runs under noise, and error bars indicate the

standard deviation. Insets: Examples of learned trajectories

for k=0.4and0.8, respectively. Green: analytic minimum,

blue: deterministic Qlearning, red: Qlearning under noise.

The green arrow and the blue arrows indicate the respective

particle orientations for the ﬁrst and second case

so that the total velocity of the active particle becomes

v=e+key.(12)

The Euler-Lagrange equation corresponding to the

variation of the travel time, δT =0,canbeformu-

lated and solved analytically. The calculation is a bit

lengthy but straightforward. It always gives the straight

path between start and target as optimal as shown for

two examples in the insets of Fig. 7(green lines). For

increasing ﬂow strength k, the orientation vector ehas

to tilt more and more against the imposed ﬂow to avoid

that the active particle drifts downstream. Indeed, one

ﬁnds for the optimal travel time

T=1

√1−k2,(13)

which diverges for k→1. Here, the active particle

points fully against the ﬂow and there is no compo-

nent of the swimming velocity left to cross the ﬂow ﬁeld

along the xdirection. The blue dots in Fig. 7show the

results of the optimization from deterministic Qlearn-

ing and for two cases, we show the paths (blue lines)

and the particle orientations (blue arrows) in the insets.

The straight path along the xdirection is not realized

since the particle orientation ecan only assume eight

discrete orientations. Note, to perform Qlearning, we

needed to increase the time step Δtfrom 0.0375 to 0.1

with increasing k.

In a last step, we include orientational thermal

noise when performing 100 Q-learning runs. Including

the time step, the noise in Eq. (5) is determined by

(DRL/v0)Δt. To apply the same noise during a time

step, we always choose (DRL/v0)Δt=0.0375. Thus,

when increasing Δt, we reduce DRL/v0accordingly.

The red dots in Fig. 7show the mean travel time T1

from the 100 Q-learning runs, and the error bars indi-

cate the standard deviation. Until k=0.8, no strong

deviation from deterministic Qlearning is observed and

the red trajectories in the insets show typical examples.

In contrast, the trajectories for k=0.9 strongly deviate

from each other, which results in a strongly increased

T1with a large standard deviation. The reason is that

the particle orientation is nearly antiparallel to the ﬂow

ﬁeld, so orientational ﬂuctuations can cause the parti-

cle to move away from the target instead of toward it.

These excursions result in a strongly increased T1.

3.4 Crossing Poiseuille flow

In our next example, we implement a Poiseuille ﬂow

ﬁeld along the yaxis with zero velocity at the bound-

aries of our typical system geometry, x=±3/4. So the

ycomponent of the velocity ﬁeld vDbecomes:

vy=vc1−4

3x2,where ωD=−16

9vcx(14)

describes the rotational velocity experienced by the

active particle due to the nonzero ﬂow vorticity. To keep

the problem simple, we do not implement any hydro-

dynamic interactions with the bounding walls. Again,

we look for the fastest trajectory between starting and

target positions located at −0.5 and 0.5onthexaxis,

respectively.

We ﬁrst explore the numerical minimization of the

travel time T0, which gives three types of trajectories

that correspond to the three branches (solid lines) in

Fig. 8, top. As for the potential barrier, these branches

overlap when tuning the ﬂow strength vcindicating up

to three local minima. For vc1 the optimal trajectory

is curved symmetrically about the xaxis (green trajec-

tory in Fig. 8, bottom left). Thus, close to the starting

and target positions, where the ﬂow velocity ﬁeld is

123

48 Page 8 of 12 Eur. Phys. J. E (2023) 46:48

Fig. 8 Top: Optimized travel time versus strength vcof

the Poiseuille ﬂow. Solid lines: numerical minimization gives

three trajectory types. Blue dots: from deterministic Q

learning. Red dots: mean of 100 Qlearning runs under noise,

and error bars indicate the standard deviation. Green tri-

angles: TQ1from applying a mean Qmatrix under noise.

Bottom: From left to right, examples of the three trajectory

types are indicated: symmetric, asymmetric, and S-shaped.

Green: numerical minimization, blue: deterministic Qlearn-

ing, red: Qlearning under noise

smaller, the active particle can swim upstream, while

it drifts downstream in the center. Then, for vc1

the shape becomes asymmetric (green trajectory in Fig.

8, bottom middle) since the active particle needs to

explore the slow ﬂow close to the wall for being able to

reach the target. Similarly, also the trajectory mirrored

at the center exists. Ultimately, at vc≈1.3 an S-shaped

trajectory is the optimum in travel time (green trajec-

tory in Fig. 8, bottom right). The slow ﬂow at both

walls is used to swim upstream in order to compensate

for the downstream drift in the center.

Now, reinforcement learning without orientational

noise (blue dots in Fig. 8) nicely reproduces the optimal

travel times, although at vc=1.4 the trajectories diﬀer

(compare green and blue trajectories in Fig. 8, bottom

right). To perform Qlearning, we needed to increase

the time step Δtfrom 0.0375 to 0.08 with increasing

vc.

In a next step, we again include orientational noise

and perform an average over 100 Q-learning runs to

determine a mean travel time. As for the uniform ﬂow,

we keep the noise per time step constant by choosing

(DRL/v0)Δt=0.0375 in Eq. (5). The red dots in Fig.

8, top show the mean travel time T1and the error

bars indicate the standard deviation. Up to vc=0.8,

T1agrees well with the numerical minimization (T0)

Fig. 9 For several vc, all the 100 learned trajectories

resulting from Qlearning under noise are shown

Fig. 10 Each plot shows 100 trajectories that result from

applying diﬀerent types of Qmatrices under noise at vc=

1.2. Qn:meanQmatrix from 10 Qlearning runs under

noise. Q1–Q4: examples of Qmatrices learnt under noise.

Qdet: deterministic Qmatrix

and deterministic Qlearning (TRL). Also the diﬀerent

trajectories vary around the symmetric path (top row

in Fig. 9). At vc=0.9, a mixture of symmetric and

asymmetric trajectories (Fig. 9, middle row) results in

a stronger deviation from T0and TRL and a larger stan-

dard deviation. Then, at vc=1.1 only the asymmetric

trajectory type is realized, which at vc=1.3 starts

to also develop the S-shaped type. However, since the

trajectories become more and more irregular, T1devi-

ates more strongly from T0and TRL with large standard

deviations.

In the end, we address the question how optimal

policies encoded in learned Qmatrices reproduce the

learned trajectories when applied under noise. In Fig.

123

Eur. Phys. J. E (2023) 46:48 Page 9 of 12 48

Fig. 11 Swirling ﬂow ﬁeld. Left: Three examples of opti-

mal paths for increasing swirling strength ω. Blue line: from

deterministic Qlearning and blue arrows indicate the parti-

cle orientation. Red line: trajectory learned in the presence

of noise. Green: The particle misses the target and has to

circle once around the center. The gray circles indicate the

direction of the ﬂow. Right: Travel time Ttimes ωplotted

versus ω. Green line: from numerical minimization. Blue

dots: from deterministic Q-learning. Red dots: mean of 100

Q-learning runs under noise, and error bars indicate the

standard deviation

10, we compare the outcome for diﬀerent types of

learned Qmatrices for ﬂow strength vc=1.2, which

we apply 100 times. At vc=1.2, the optimized travel

time is T=2.21. Interestingly, applying Qdet (the opti-

mal deterministic Qmatrix) produces trajectories that

make long detours under noise. Thus, the mean travel

time Tdeviates strongly from the ideal value and

the standard deviation is large. Taking a mean of 10

Qmatrices learned under noise, Qn, gives trajecto-

ries without large detours, and the mean travel time

is well below the one for Qdet. Even the single Qn

matrices, when applied under noise, are more successful

than Qdet as the examples for four Qmatrices in Fig.

10 show. These results suggest that Qmatrices learnt

under noise or their average are less volatile to noise

than the deterministic Qmatrix, when using them. In

Fig. 8, the green triangles indicate the mean travel time

for Qn.Uptovc=0.7, there is not much diﬀerence

between the diﬀerently determined travel times. Inter-

estingly, between vc=0.8 and 1.0, the mean Qmatrix

provides a better travel time than the average over 100

Q-learning runs. However, for vc=1.2 this is no longer

true and for vc=1.3 and 1.4 some of the trajectories

did not reach the target, so we do not provide a mean

travel time here.

3.5 Crossing swirling flow

As a last example, we investigate the case, where the

active particle needs to cross a swirling ﬂow on its way

to the target. We consider

vD=ω

reϕ,(15)

which has zero vorticity (curlvD= 0) so that the par-

ticle orientation eis not rotated by the ﬂow. Figure 11,

left shows three optimal paths determined from deter-

ministic Qlearning for increasing ﬂow strength ω. One

observes that for smaller ωthe active particle crosses

closer to the center, because here ﬂow is larger, which

helps to minimize travel time. The self-propulsion is

needed to cross the circular streamlines in order to

reach the target. For increasing ωand thus increased

drifting, the active particle has to stay closer to the

streamline of the target for being able to reach it. For

large ω, self-propulsion can more and more be neglected

against drifting and the active particle moves on the

half circle connecting start and target. This is approx-

imately the case for ω=4.2.

The green graph in Fig. 11, right, a numerical mini-

mization of the travel time conﬁrms this view. For small

ωthe ideal path is nearly straight and T≈1iscon-

stant or Tω is linear in ω. For large ω,Tω should tend

to π/4=0.786, but the convergence is rather slow. The

results from reinforcement learning without orienta-

tional noise (blue dots) agree rather well with the green

curve. When just plotting Tversus ω, they nicely fall

on top of each other. Small deviations become enlarged

when plotting Tω. To arrive at the results for ω≥1.5,

we increased the number of action orientations from 8 to

16 and successively decreased the time step Δtto 0.001.

The reason is that at larger ﬂow strengths one has to

ﬁne-tune the position and orientation of the active par-

ticle to hit the target; otherwise, the particle needs to

circle around the center to make another attempt.

Again we introduce orientational noise and perform

100 Qlearning runs, while keeping the factor governing

noise at (DRL/v0)Δt=0.0375. At ω=1.8 and beyond,

the mean travel times T1(red dots in Fig. 11, right)

nicely agree with deterministic Qlearning, the standard

deviations are small, and the trajectories fall on top of

each other (Fig. 11, left). However, at ω=0.6 and 1.2

large standard deviations occur. They are due to rare

events, where the particle does not hit the target and

therefore needs to circle around once to reach it. An

example (green trajectory) is given for ω=0.6 in Fig.

11,left.

4 Conclusion

In this article, we considered a smart active particle

that can sense the distance and direction to a target.

123

48 Page 10 of 12 Eur. Phys. J. E (2023) 46:48

We used Qlearning to demonstrate how the particle

learns by itself to navigate on the fastest path in diﬀer-

ent potential landscapes and ﬂow ﬁelds. In parallel, we

also solved the optimization problem using variational

calculus to show how well Qlearning works. Our idea

is that sensing distance and direction as state variables

is easier to realize with a smart active particle than

sensing the position.

First, we considered a potential barrier as in our pre-

vious work [33], but now the learned paths are closer

to the optimal path since the active particle moves con-

tinuously in space instead of only occupying grid points

of a square lattice. Furthermore, as action variables we

employ eight orientations instead of only four. We also

considered a potential well, which was deep enough so

that the active particle could become trapped in it once

the trapping force exceeds a critical value. However,

the particle indeed learns to avoid the trap and moves

around it. This is an important feature when studying

the optimal path in an arbitrary landscape.

Second, we demonstrated how the active particle

crosses a uniform ﬂow. The learned travel times agree

well with the analytic result of the optimization. Third,

crossing a Poiseuille ﬂow is more challenging to evalu-

ate. By numerical minimization of the travel time, we

identify three types of trajectories: symmetric, asym-

metric, and S-shaped. The second and third types occur

at higher ﬂow strengths and use the small ﬂow veloc-

ity at the channel walls to move upstream in order to

being able to cross the ﬂow. Fourth, we also looked

at a swirling ﬂow. For small ﬂow strengths, the active

particle uses the larger ﬂow velocities close to the cen-

ter to arrive fastest at the target, while for larger ﬂow

strengths it has to stay on the circular ﬂow line so that

it does not miss the target. If this happens, the particle

needs to circle around the center which increases the

travel time.

Finally, for all the reported drift-velocity ﬁelds we

evaluated the eﬀect of orientational thermal noise dur-

ing Qlearning and when applying the optimal pol-

icy. Generally, noise does not prevent the active parti-

cle from learning optimal travel paths and to navigate

to the target. Now the optimal path is noisy, which

increases the travel time, and, of course, ﬁnding opti-

mal paths depends on the strength of noise. Further

general statements on the impact of noise are diﬃcult.

It rather depends on the speciﬁc problem and the tra-

jectory to be learned. For example, for the potential

well the learned trajectories run further away from the

center with increasing noise to avoid that particles get

trapped in the well. We add two further ﬁndings. First,

when performing Qlearning under noise for all drift-

velocity ﬁelds, we identiﬁed the strategy to average over

several Qlearning runs. This works well if the opti-

mized trajectories are well accessible by numerical min-

imization of the travel time or by deterministic Qlearn-

ing. Second, when applying the learned optimal policy

under noise, we made the interesting observation that

a mean Qmatrix works better than the deterministic

Qmatrix. The reason might be that the Qmatrix aver-

aged over several noisy learning runs has developed a

better strategy to respond to random changes in the ori-

entation compared to the deterministic Qmatrix. Thus,

our study also adds to the recent eﬀorts to explore the

stability of learned optimal strategies/policies [35,51].

As a next step, we plan to study optimal naviga-

tion of active particles in more complex potential and

ﬂow landscapes using deep Qlearning, which employs

neural networks. This will also enable us to train the

active particle to move optimally in a set of complex

landscapes and then let it move in an unknown land-

scape.

Acknowledgements We thank the Berlin University All-

iance for funding under grant number 824 BUA-NUS 4.

Funding Information Open Access funding enabled and

organized by Projekt DEAL.

Author contribution statement

All the authors were involved in the preparation of the

manuscript. All the authors have read and approved the

ﬁnal manuscript.

Data availability The datasets generated during and/or

analyzed during the current study are available from the

corresponding author on reasonable request.

Open Access This article is licensed under a Creative Com-

mons Attribution 4.0 International License, which permits

use, sharing, adaptation, distribution and reproduction in

any medium or format, as long as you give appropriate credit

to the original author(s) and the source, provide a link to

the Creative Commons licence, and indicate if changes were

made. The images or other third party material in this arti-

cle are included in the article’s Creative Commons licence,

unless indicated otherwise in a credit line to the material. If

material is not included in the article’s Creative Commons

licence and your intended use is not permitted by statu-

tory regulation or exceeds the permitted use, you will need

to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecomm

ons.org/licenses/by/4.0/.

A Appendix

A.1 Numerical minimization of the travel time.

The analytical solution of the Euler-Lagrange equations

belonging to the variation of Eq. (2) can only be found for

the uniform ﬂow ﬁeld used in Sect. 3.3.Itispresentedin

Eq. (13). For the remaining force and ﬂow ﬁelds, minimiz-

ing the functional of Eq. (2) requires a numerical approach.

This is accomplished by discretizing Eq. (2) and employing

numerical methods to minimize it.

The discretization approximates the trajectory by N=

200 points r(s)→ri(s), where iis the index of the i-th

point. Additionally, the line element dsiand the tangent

123

Eur. Phys. J. E (2023) 46:48 Page 11 of 12 48

vector tiare approximated by

dsi=(ri+1 −ri)2and ti=ri+1 −ri

dsi

.(16)

Furthermore, using the discretized drift-velocity ﬁeld vDi =

vD(ri) and the magnitude of the total velocity vi=

v(ri+1,ri)fromEq.(3), the functional for the travel time T

is approximated as

rf

dt−→

N−1



i=1

dsi

This expression can be minimized by using the optimiza-

tion package from Julia, Optim.jl, where diﬀerent solver

algorithms can be used. Due to diﬃculties in calculating

the Hessian for the Poiseuille ﬂow, we successfully employed

a gradient-based algorithm such as Conjugate Gradient,

where the parameters of the algorithm needed to be tuned.

Nevertheless, to optimize the travel time for the Poiseuille

ﬂow, it was necessary to use the results from reinforcement

learning as initial trajectories. This approach provides a bet-

ter starting point for the optimization process and improves

the convergence of the solver algorithm.

In general, the x-coordinate in the optimization problem

is treated as a variable to be determined. However, for sim-

pler cases like the Mexican hat potential, it is suﬃcient to

choose equally spaced xcoordinates and to keep them ﬁxed

during the minimization. This reduces the complexity of the

optimization problem and makes the solver more eﬃcient in

ﬁnding the optimal solution.

A.2 Learning rate and discount factor

We will shortly discuss the procedure for determining the

optimal values for the learning rate αand the discount factor

γ.InFig.12, we demonstrate how choosing the parameters

inﬂuences the travel time T. We created a heatmap, where

γversus αis plotted, while the variation in color represents

the logarithmic values of the travel time T. The study was

performed for the potential barrier with U0=0.4. The inset

in Fig. 2shows the learned trajectory. We do not consider

values for γ<0.1, otherwise the travel time diverges toward

inﬁnity. We can clearly see that the interval γ∈[0.4,0.9] is

the most suitable. In this interval, we can identify several

minima for the travel time marked as red boxes in Fig. 12.

Since we deal with fully deterministic environments, a high

learning rate can be chosen to achieve a fast convergence

of the Qfunction [41]. Thus, we choose our parameters as

γ=0.7andα=0.9.

A.3 State space resolution

In our reinforcement learning study, the state space is

deﬁned by the angle ϕand distance ρto the target, with

both variables discretized into 43 values each. Although the

resolution of this approach might seem equivalent to a 43x43

quadratic grid, there are some factors that add complexity

to the problem.

First, the linear resolution along the angular direction is

not uniform across the state space, as it decreases for larger

ρvalues. This leads to unequal tile sizes, with linear dimen-

sions as large as 0.221 in the angular direction at the edge

Fig. 12 Heatmap for travel time Tas a function of learn-

ing rate αand discount factor γfor the active particle mov-

ing around the potential barrier with U0=0.4. =0.4is

kept constant. The red boxes mark three minima for the

travel time T

Fig. 13 Travel time Tplotted against the number of states

Nρ=Nϕfor the curved path around the potential barrier

with U0=0.4

of the state space compared to 0.033 along the radial direc-

tion. As a result, the active particle needs to perform mul-

tiple actions, depending on the time step Δt,beforetran-

sitioning to another state. Conversely, around the target,

the angles are very small, causing a large number of states

to be concentrated in that area. This non-uniform distribu-

tion of states poses challenges for the reinforcement learning

process.

Second, due to the deﬁnition of the state space in relation

to the quadratic shape of the system, ca. 50 % of the poten-

tial states are never visited by the active particle. Only 957

of the 1849 possible states are actually encountered during

the learning process.

To investigate the impact of the state space resolution, we

examined how the number of states Nρ,ϕ along the ρand φ

direction inﬂuences the travel time T. We choose Nρ=Nϕ

and plot in Fig. 13 travel time Tversus Nρ,ϕ for the special

case of the potential barrier at U0=0.4, where the curved

path is realized. We observe that the travel time decreases

for increasing Nρ,ϕ. It is almost saturated at Nρ,ϕ = 43,

which we use in our investigation. Consequently, a further

increase in Nρ,ϕ is not necessary to guarantee the applica-

bility of Qlearning.

123

48 Page 12 of 12 Eur. Phys. J. E (2023) 46:48

References

1. M.C. Marchetti, J.F. Joanny, S. Ramaswamy, T.B. Liv-

erpool, J. Prost, M. Rao, R.A. Simha, Rev. Mod. Phys.

85, 1143 (2013)

2. A. Z¨ottl,H.Stark,J.Phys.:Condens.Matter28, 253001

(2016)

3. S. Ramaswamy, J. Stat. Mech. 2017, 054002 (2017)

4. T. Vicsek, A. Zafeiris, Phys. Rep. 517, 71 (2012)

5. J. Elgeti, R.G. Winkler, G. Gompper, Rep. Prog. Phys.

78, 056601 (2015)

6. C. Bechinger, R. Di Leonardo, H. L¨owen,C.Reichhardt,

G. Volpe, G. Volpe, Rev. Mod. Phys. 88, 045006 (2016)

7. H. Chat´e, Annu. Rev. Condens. Matter Phys. 11, 189

(2020)

8. G. Gompper, R.G. Winkler, T. Speck, A. Solon, C. Nar-

dini, F. Peruani, H. L¨owen, R. Golestanian, U.B. Kaupp,

L. Alvarez et al., J. Phys.: Condens. Matter 32, 193001

(2020)

9. H.C. Berg, E. coli in Motion, Biological and Medi-

cal Physics, Biomedical Engineering (Springer, Berlin,

2008)

10. A. Berdahl, C.J. Torney, C.C. Ioannou, J.J. Faria, I.D.

Couzin, Science 339, 574 (2013)

11. A. Cavagna, I. Giardina, Annu. Rev. Condens. Matter

Phys. 5, 183 (2014)

12. M. Akter, J.J. Keya, K. Kayano, A.M.R. Kabir, D.

Inoue, H. Hess, K. Sada, A. Kuzuya, H. Asanuma, A.

Kakugo, Sci Robot. 7, eabm0677 (2022)

13. S. Das, E.B. Steager, K.J. Stebe, V. Kumar, Simultane-

ous control of spherical microrobots using catalytic and

magnetic actuation, in 2017 International Conference on

Manipulation, Automation and Robotics at Small Scales

(MARSS) pp. 1–6. (2017)

14. S. Mui˜nos-Landin, A. Fischer, V. Holubec, F. Cichos,

Sci Robot. 6, eabd9285 (2021)

15. M. Hennes, K. Wolﬀ, H. Stark, Phys. Rev. Lett. 112,

238104 (2014)

16. T.J. Pedley, J.O. Kessler, Annu. Rev. Fluid Mech. 24,

313 (1992)

17. K. Drescher, K.C. Leptos, I. Tuval, T. Ishikawa, T.J.

Pedley, R.E. Goldstein, Phys. Rev. Lett. 102, 168101

(2009)

18. W.M. Durham, J.O. Kessler, R. Stocker, Science 323,

1067 (2009)

19. J. Palacci, C. Cottin-Bizonne, C. Ybert, L. Bocquet,

Phys. Rev. Lett. 105, 088304 (2010)

20. M. Enculescu, H. Stark, Phys. Rev. Lett. 107, 058301

(2011)

21. F. R¨uhle, H. Stark, Eur. Phys. J. E 43, 26 (2020)

22. F. R¨uhle, A.W. Zantop, H. Stark, Eur. Phys. J. E 45,

26 (2022)

23. N. Waisbord, C.T. Lef`evre,L.Bocquet,C.Ybert,C.

Cottin-Bizonne, Phys. Rev. Fluids 1, 053203 (2016)

24. F. Meng, D. Matsunaga, R. Golestanian, Phys. Rev.

Lett. 120, 188101 (2018)

25. A. Sokolov, I.S. Aranson, Phys. Rev. Lett. 103, 148101

(2009)

26. S. Rafa¨ı, L. Jibuti, P. Peyla, Phys. Rev. Lett. 104,

098102 (2010)

27. A. Z¨ottl, H. Stark, Phys. Rev. Lett. 108, 218104 (2012)

28. A. Z¨ottl, H. Stark, Phys. Rev. Lett. 112, 118101 (2014)

29. A. Choudhary, S. Paul, F. R¨uhle, H. Stark, Commun.

Phys. 5, 14 (2022)

30. G. Volpe, G. Volpe, PNAS 114, 11350 (2017)

31. L.G. Nava, R. Großmann, F. Peruani, Phys. Rev. E 97,

042604 (2018)

32. B. Liebchen, H. L¨owen, Europhys. Lett. 127, 34003

(2019)

33. E. Schneider, H. Stark, Europhys. Lett. 127, 64003

(2019)

34. L. Piro, E. Tang, R. Golestanian, Phys. Rev. Re. 3,

023125 (2021)

35. L. Piro, R. Golestanian, B. Mahault, Front. Phys. (Lau-

sanne) 10, 1034267 (2022)

36. L. Piro, B. Mahault, R. Golestanian, New J. Phys. 24,

093037 (2022)

37. S. Goh, R.G. Winkler, G. Gompper, New J. Phys. 24,

093039 (2022)

38. D. Bray, Cell Movements: From Molecules to Motility.

Garland Science (2000)

39. A.M.Hein,F.Carrara,D.R.Brumley,R.Stocker,S.A.

Levin, PNAS 113, 9413 (2016)

40. D. Weihs, P. Webb, J. Theor. Biol. 106, 189 (1984)

41. R.S. Sutton, A.G. Barto, Reinforcement learning: An

introduction. MIT Press (2018)

42. Y. Yang, M.A. Bevan, B. Li, Adv. Intell. Syst. 2,

1900106 (2020)

43. Y. Yang, M.A. Bevan, B. Li, Adv. Theory Simulat. 3,

2000034 (2020)

44. M. Durve, F. Peruani, A. Celani, Phys. Rev. E 102,

012601 (2020)

45. H. Stark, Sci Robot. 6, eabh1977 (2021)

46. M.J. Falk, V. Alizadehyazdi, H. Jaeger, A. Murugan,

Phys. Rev. Res. 3, 033291 (2021)

47.M.Gerhard,A.Jayaram,A.Fischer,T.Speck,Phys.

Rev. E 104, 054614 (2021)

48. M. Nasiri, B. Liebchen, New J. Phys. 24, 073042 (2022)

49. P.A. Monderkamp, F.J. Schwarzendahl, M.A. Klatt, H.

L¨owen, Mach. Learn.: Sci. Technol. 3, 045024 (2022)

50. S. Colabrese, K. Gustavsson, A. Celani, L. Biferale,

Phys. Rev. Lett. 118, 158004 (2017)

51. L. Biferale, F. Bonaccorso, M. Buzzicotti, P. Clark Di

Leoni, K. Gustavsson, Chaos: Interdisc. J. Nonlinear Sci.

29, 103138 (2019)

52. J.K. Alageshan, A.K. Verma, J. Bec, R. Pandit, Phys.

Rev. E 101, 043110 (2020)

53. M. Buzzicotti, L. Biferale, F. Bonaccorso, P.C. di Leoni,

K. Gustavsson, Optimal Control of Point-to-Point Nav-

igation in Turbulent Time-dependent Flows Using Rein-

forcement learning, in AIxIA 2020—Advances in Artifi-

cial Intelligence (Springer, Cham, 2021), pp.223–234

54. C. Calascibetta, L. Biferale, F. Borra, A. Celani,

M. Cencini, arXiv:2212.09612v1 [physics.ﬂu-dyn] (2022)

55. E. Zermelo, ZAMM 11, 114 (1931)

56. T. Jaakkola, M.I. Jordan, S.P. Singh, Neural Comput.

6, 1185 (1994)

57. A. Celani, E. Villermaux, M. Vergassola, Phys. Rev. X

4, 041015 (2014)

58. M. Durve, L. Piro, M. Cencini, L. Biferale, A. Celani,

Phys. Rev. E 102, 012402 (2020)

59. G. Reddy, V.N. Murthy, M. Vergassola, Annu. Rev. Con-

dens. Matter Phys. 13, 191 (2022)

123