
Citation: Malem-Shinitski, N.; Ojeda,
C.; Opper, M. Variational Bayesian
Inference for Nonlinear Hawkes
Process with Gaussian Process
Self-Effects. Entropy 2022,24, 356.
https://doi.org/10.3390/e24030356
Academic Editors: Udo Von Toussaint
and Philip Broadbridge
Received: 27 December 2021
Accepted: 22 February 2022
Published: 28 February 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
entropy
Article
Variational Bayesian Inference for Nonlinear Hawkes Process
with Gaussian Process Self-Effects
Noa Malem-Shinitski 1,* , César Ojeda 2and Manfred Opper 2,3
1Institute of Mathematics, University of Potsdam, 14476 Potsdam, Germany
2Artificial Intelligence Group, Technische Universität Berlin, 10623 Berlin, Germany;
3Centre for Systems Modelling and Quantitative Biomedicine, University of Birmingham,
Birmingham B15 2TT, UK
*Correspondence: [email protected]
Abstract:
Traditionally, Hawkes processes are used to model time-continuous point processes with
history dependence. Here, we propose an extended model where the self-effects are of both excitatory
and inhibitory types and follow a Gaussian Process. Whereas previous work either relies on a less
flexible parameterization of the model, or requires a large amount of data, our formulation allows for
both a flexible model and learning when data are scarce. We continue the line of work of Bayesian
inference for Hawkes processes, and derive an inference algorithm by performing inference on
an aggregated sum of Gaussian Processes. Approximate Bayesian inference is achieved via data
augmentation, and we describe a mean-field variational inference approach to learn the model
parameters. To demonstrate the flexibility of the model we apply our methodology on data from
different domains and compare it to previously reported results.
Keywords: Bayesian inference; point process; Gaussian process
1. Introduction
Sequences of self-exciting, or inhibiting, temporal events are frequent footmarks of
natural phenomena: earthquakes are known to be temporally clustered as aftershocks are
commonly triggered following the occurrence of a main event [
1
]; in social networks, the
propagation of news can be modeled in terms of information cascades over the edges of a
graph [
2
]; and in neuronal activity, the occurrence of one spike may increase or decrease
the probability of the occurrence of the next spike over some time period [3].
Traditionally, sequences of events in continuous time are modeled by point processes,
of which Cox processes [
4
], or doubly stochastic processes, use a stochastic process for the
intensity function, which depends only on time and is not effected by the occurrences of the
events. The Hawkes process [
5
] extends the Cox process to capture phenomena in which
the past events affect future arrivals, by introducing a memory dependence via a memory
kernel, which is also referred to as the causal influence function. When incorporating the
dependence of the process on its own history, due to the superposition theorem of the point
process, new events will depend on either an exogenous rate, which is independent of the
history, or an endogenous rate from past arrivals.
Originally, the dependence on history in the Hawkes process is assumed to be self-
excitatory, and the memory kernel is parameterized by an exponential or power law decay,
which results in a model with low flexibility. Furthermore, assuming only an excitatory
relation between the events does not hold for other phenomena we wish to model. For
example, inhibitory effects between neurons [
6
], and even self-inhibition [
7
], are crucial for
regulating the neuronal activity. Thus, the memory kernel should also include inhibitory
relations between the events and by doing so the intensity may become negative. To
ensure that the intensity function is non-negative, a nonlinear link function is applied on
Entropy 2022,24, 356. https://doi.org/10.3390/e24030356 https://www.mdpi.com/journal/entropy

Entropy 2022,24, 356 2 of 22
the memory kernel, and the resulting model is often referred to as a nonlinear Hawkes
process [
8
–
10
]. Theoretical results for nonlinear Hawkes processes have been developed
for many years and they include stability estimates [
8
] as well as convergence rates for
Bayesian estimators [11].
In this work, we present a Nonlinear Hawkes Process with Gaussian Process Self-
Effects (NH–GPS), which extends the class of nonlinear Hawkes processes. As the causal
influence between events can be either excitatory or inhibitory, we use the term self-effects
to refer to the influence of past events on future events.
We choose a semi-parametric approach, which avoids the limiting parameterization of
the memory kernel and the background rate. We assume a Gaussian process (GP) prior
to the exogenous events’ intensity and on the memory kernel, which allows also for an
inhibitory effect between the events. To ensure that the intensity function is non-negative
we use the sigmoid link function. This modeling approach is not only descriptive but
also allows us to obtain a fast inference procedure. The history of self-effects defines an
aggregated Gaussian process, and we perform the inference directly on this aggregation
rather than obtaining a posterior over each self-effect.
While highly flexible approaches to modeling the intensity function of nonlinear
Hawkes processes have been presented before, they mainly rely on deep neural network
solutions [
12
,
13
]. These approaches are hindered by the necessity of large datasets. In
contrast, our methodology retains modeling flexibility due to the nonparametric nature of
Gaussian processes, while being able to perform well when data are scarce.
Outline
In Section 2, we describe the NH–GPS model and emphasize how its structure allows
for efficient Bayesian inference. In Section 3, we describe the augmentation scheme and
derive the mean-field variational inference algorithm. In Section 4, we discuss related work
and describe how it relates to ours. In Section 5, we present the results of our experiments
both on synthetic data and different real-world examples, and compare it to existing work
when possible. In Section 6, we conclude by discussing our work and future research
directions.
2. Proposed Model
2.1. Classical Hawkes Process
Let
Tt= [
0,
t]∈R
. We define the counting measure
N(Tt)
as the number of arrivals in
the interval
Tt
. Furthermore, we define the history
Ht
or the realizations of a given process,
as the set of arrivals in the interval
Tt
, namely
Ht={T1
,
. . .
,
TN(Tt):Ti∈ Tt∧Ti−1<Ti}
,
and
Ti
corresponds to the time of arrival
i
. For a temporal point process, the counting
measure N(·)has an associated intensity defined as
Λ(t) = lim
∆t→0
E[N(Tt+∆t)−N(Tt)|Ht]
∆t.
The intensity function may depend on the history of the process. An example of
such a process is the Hawkes process, or self-exciting point process [
14
], which defines
self-excitations [
15
] around exogenous events. Following Hawkes and Oakes
[5]
, the intensity
of the Hawkes process is defined by
Λ(t|Ht) = s(t)+∑
tn<t
g(t−tn), (1)
where
s(t)
is the base intensity of exogenous arrivals and
g(t−tn)
is the memory kernel, or
causal influence function, defining the change in the excitation or inhibition value following
each arrival. In the classical Hawkes process, causal influence of only excitatory nature
is allowed and the memory kernel is usually of the form
g(t−tn) = βe−α(t−tn)
for an
exponentially decaying memory.

Entropy 2022,24, 356 3 of 22
2.2. Nonlinear Hawkes Process with Gaussian Process Self-Effects
In the classical Hawkes process, the memory kernel
g
in Equation
(1)
must be non-
negative, to prevent the intensity function from being negative. As a result the history
of the model has only excitatory effect on future events. We are interested in a model
that includes inhibition between events, and we release the constraint over
g
so it can be
negative, and define the following nonlinear intensity function
Λ(t)=λσ(φ(t))(2)
σ(φ(t))=1
1+exp (−φ(t)) (3)
φ(t) = s(t)+∑
tn<t
g(t−tn)exp(−α(t−tn)). (4)
Here, we choose the sigmoid function to ensure that the intensity function
Λ(·)
is
non-negative.
λ
is the intensity bound and we refer to
φ(·)
as the linear intensity function.
We explicitly add the exponential decay to enforce the forgetting constraint, which
is essential for most realistic processes. Although we choose here a specific parameteri-
zation of the memory decay, one can choose other forms of memory decay with minimal
adaptation to the learning procedure of the model parameters.
Besides
α
and
λ
, the functions
s(·)
and
g(·)
are the unknown parameters of the model
to be inferred from the data. In this paper, we will use a nonparametric Bayesian inference
approach based on the definition of a prior probability measure over these functions.
In contrast to alternative Bayesian models for Hawkes processes where the positive rate
function is directly modeled by a Dirichlet process, in our case we have to deal with random
functions which are not constrained to be positive. This suggests that we should use a
simple, but still highly flexible approach by modeling the two functions independently as
realizations of Gaussian random processes. We write symbolically:
s∼GP(0, Ks)(5)
g∼GP(0, Kg)(6)
Ks/g(t1,t2)=as/g·exp −kt1−t2k2
σ2
s/g!. (7)
The corresponding Gaussian prior measures are uniquely defined by the mean func-
tions (which we set to be equal to zero) and the second moments given by the covariance
kernel functions Kg/s. The latter are defined by the prior expectations
Ks(t1,t2).
=E[s(t1)s(t2)] (8)
Kg(t1,t2).
=E[g(t1)g(t2)] (9)
By a proper choice of kernels, we can encode further prior beliefs on typical realizations
of
s(·)
and
g(·)
. Throughout the paper, we will work with the so-called RBF kernels from
Equation
(7)
. This kernel corresponds to the prior assumptions that the Gaussian processes
are stationary (the kernels depend on time differences only) and that the functions
s(·)
and
g(·)
are infinitely often differentiable. The kernels depend on two hyperparameters
a
and
σ
, which reflect the typical amplitude and length scale of the functions. The reasonable
values of these hyperparameters will also be inferred from the data.
Finally, we assume a prior distribution also on the upper intensity bound
λ∼Gamma(α0,β0).
and we identify the hyperparameters of the model as {σg,ag,α,σs,as}.
In this work, we propose Bayesian inference for fitting the model to data. Due to the
nonlinearity over
φ(·)
, we are no longer able to easily utilize the branching structure of the

Entropy 2022,24, 356 4 of 22
Hawkes process, which allowed for the estimation of
s(·)
and
g(·)
[
16
,
17
]. Thus, a natural
solution is to perform the inference directly on φ(·).
Next, we identify the prior over the entire linear intensity
p(φ)
. From Equation
(4)
we
see that the linear intensity function
φ
is nothing but the sum of GPs, and as such it is also
a GP:
φ∼GP0, ˜
K(10)
˜
Klk =Ks
lk +∑
ti<tl
∑
tj<tk
Kg
tl−ti,tk−tjexp−αtl−ti+tk−tj. (11)
Multivariate Model
We propose an extension of the model to multiple dimensions. This is useful in
applications where different types of events are observed, or the events originate from
different processes that affect each other. We define an
R
-dimensional point process with
intensity in dimension r
Λr(t)=λrσ(φr(t))
φr(t)=sr(t)+
R
∑
m=1
∑
tm
n<t
gr,m(t−tm
n)exp(−αr,m(t−tm
n))
where
tm
n
is the time of event number
n
of type
m
. We assume that every dimension has its
own intensity bound
λk
and background rate
sr(·)
. The different dimensions interact with
each other via the self-effects term.
gr,m(·)
defined the effects of the events of type
m
on the
events of type r. As in the univariate case, this effect may change over time.
Given the observations, the different dimensions are independent of each other and we
can learn their parameters separately. Thus, in the following section we present the inference
for the univariate model, and the extension to the multivariate case is straight forward.
3. Inference
Conditioned on the intensity function
Λ(·)
, the likelihood of observations
{t1
,
. . . tN}
from a Hawkes process is [18]
`({t1, . . . tn}|Λ(·)) =exp−ZT
0
Λt0dt0N
∏
n=1
Λ(tn).
In order to obtain posterior distributions for the latent variables either in the form of
a Gibbs sampler or through approximate posteriors in variational inference, we require
certain computations that are not tractable or efficient under the current form of the like-
lihood. Similarly to previous work on Cox and Hawkes processes [
17
,
19
,
20
], we follow
an augmentation procedure. We do so via the introduction of auxiliary variables, which
expand the model to a different likelihood form. Under the marginalization of the afore-
mentioned auxiliary variables, the new likelihood will conserve the form of the original
model likelihood. The new form of the likelihood is constructed such that the computations
required for the inference procedure are either tractable, computationally fast, or both.
3.1. Model Augmentation
The first step we take in treating the likelihood function is using the Pólya–Gamma
(PG) augmentation scheme. Following Theorem 1 in Polson et al.
[21]
, we can rewrite the
nonlinear intensity function as
σ(φ(t)) =Z∞
0ef(w,φ(t))PG(w; 1, 0)dw (12)
f(w,φ(t)) =−φ(t)2w
2+φ(t)
2−ln 2. (13)

Entropy 2022,24, 356 5 of 22
As we augment each observation with a variable
wn
from a PG distribution, the joint
likelihood of the observed events {tn}and PG variables {wn}is
p{tn}N
n=1,{wn}N
n=1|φ,λ=(14)
exp−ZT
0λσ(φ(t))dt·
N
∏
n=1
λef(wn,tn)PG(wn; 1, 0)
with
exp−ZT
0λσ(φ(t))dt=(15)
exp−ZT
0Z∞
0λPG(w; 1, 0)1−ef(w,−φ(t))dwdt.
where we used σ(t) = 1−σ(−t).
Next, we utilize the Campbell’s theorem [
14
], which states that for a Poisson process
Πwith intensity ϕ
Eϕ ∏
x∈Π
exp(h(x))!=
exp−Z(1−exp(h(x)))ϕ(x)dx.
Looking at Equation
(15)
, we identify
x=(t,w)
and
ϕ(t,w)=λPG(w|1, 0)
is the
intensity of a marked Poisson process in
T
with marks
w∼PG(0, 1)
. Furthermore, we
determine h(x)=f(w,−φ(t)). We can now rewrite the exponential in Equation (14) as
exp−ZT
0λσ(φ(t))dt=Eϕ M
∏
m=1
ef(ˆ
wm,ˆ
tm)!(16)
for realizations {ˆ
tm,ˆ
wm}M
m=1.
We substitute Equation
(16)
into Equation
(14)
, which results in the full augmented
likelihood. Given the prior distributions over
φ
and
λ
, we can now write the model’s
posterior distribution as
p{ˆ
tm,ˆ
wm},{wn},φ,λ|{tn}∝exp(−λT)×(17)
M
∏
m=1
λef(ˆ
wm,−φ(ˆ
tm))PG(wm; 1, 0)×
N
∏
n=1
λef(wn,φ(tn))PG(wn; 1, 0)×p(φ)p(λ).
To summarize, we augment the model with two sets of variables—the PG variables
{wn}
, which augment the actual realizations, and the tuples
{ˆ
tm
,
ˆ
wm}
, which are the
realizations and marks of the auxiliary marked Poisson process.
As mentioned above, we intend to learn directly the linear intensity function
φ(·)
. This
allows us to utilize the mean-field variational inference previously introduced in Donner
and Opper
[19]
and Donner and Opper
[22]
. Next, we go through the steps of the algorithm,
and we refer the reader to the two papers mentioned above for further details. As a baseline
we compare the performance of the variational inference algorithm to a Gibbs sampler. The
details of the Gibbs sampler can be found in Appendix Aand Algorithm 2.
Loading more pages...