Variational Bayesian Inference for Nonlinear Hawkes Process with Gaussian Process Self-Effects [original]





Citation: Malem-Shinitski, N.; Ojeda,

C.; Opper, M. Variational Bayesian

Inference for Nonlinear Hawkes

Process with Gaussian Process

Self-Effects. Entropy 2022,24, 356.

https://doi.org/10.3390/e24030356

Academic Editors: Udo Von Toussaint

and Philip Broadbridge

Received: 27 December 2021

Accepted: 22 February 2022

Published: 28 February 2022

Publisher’s Note: MDPI stays neutral

with regard to jurisdictional claims in

published maps and institutional affil-

iations.

Licensee MDPI, Basel, Switzerland.

This article is an open access article

distributed under the terms and

conditions of the Creative Commons

Attribution (CC BY) license (https://

creativecommons.org/licenses/by/

4.0/).

entropy

Article

Variational Bayesian Inference for Nonlinear Hawkes Process

with Gaussian Process Self-Effects

Noa Malem-Shinitski 1,* , César Ojeda 2and Manfred Opper 2,3

1Institute of Mathematics, University of Potsdam, 14476 Potsdam, Germany

2Artificial Intelligence Group, Technische Universität Berlin, 10623 Berlin, Germany;

[email protected] (C.O.); [email protected] (M.O.)

3Centre for Systems Modelling and Quantitative Biomedicine, University of Birmingham,

Birmingham B15 2TT, UK

*Correspondence: [email protected]

Abstract:

Traditionally, Hawkes processes are used to model time-continuous point processes with

history dependence. Here, we propose an extended model where the self-effects are of both excitatory

and inhibitory types and follow a Gaussian Process. Whereas previous work either relies on a less

flexible parameterization of the model, or requires a large amount of data, our formulation allows for

both a flexible model and learning when data are scarce. We continue the line of work of Bayesian

inference for Hawkes processes, and derive an inference algorithm by performing inference on

an aggregated sum of Gaussian Processes. Approximate Bayesian inference is achieved via data

augmentation, and we describe a mean-field variational inference approach to learn the model

parameters. To demonstrate the flexibility of the model we apply our methodology on data from

different domains and compare it to previously reported results.

Keywords: Bayesian inference; point process; Gaussian process

1. Introduction

Sequences of self-exciting, or inhibiting, temporal events are frequent footmarks of

natural phenomena: earthquakes are known to be temporally clustered as aftershocks are

commonly triggered following the occurrence of a main event [

]; in social networks, the

propagation of news can be modeled in terms of information cascades over the edges of a

graph [

]; and in neuronal activity, the occurrence of one spike may increase or decrease

the probability of the occurrence of the next spike over some time period [3].

Traditionally, sequences of events in continuous time are modeled by point processes,

of which Cox processes [

], or doubly stochastic processes, use a stochastic process for the

intensity function, which depends only on time and is not effected by the occurrences of the

events. The Hawkes process [

] extends the Cox process to capture phenomena in which

the past events affect future arrivals, by introducing a memory dependence via a memory

kernel, which is also referred to as the causal influence function. When incorporating the

dependence of the process on its own history, due to the superposition theorem of the point

process, new events will depend on either an exogenous rate, which is independent of the

history, or an endogenous rate from past arrivals.

Originally, the dependence on history in the Hawkes process is assumed to be self-

excitatory, and the memory kernel is parameterized by an exponential or power law decay,

which results in a model with low flexibility. Furthermore, assuming only an excitatory

relation between the events does not hold for other phenomena we wish to model. For

example, inhibitory effects between neurons [

], and even self-inhibition [

], are crucial for

regulating the neuronal activity. Thus, the memory kernel should also include inhibitory

relations between the events and by doing so the intensity may become negative. To

ensure that the intensity function is non-negative, a nonlinear link function is applied on

Entropy 2022,24, 356. https://doi.org/10.3390/e24030356 https://www.mdpi.com/journal/entropy

Entropy 2022,24, 356 2 of 22

the memory kernel, and the resulting model is often referred to as a nonlinear Hawkes

process [

–

]. Theoretical results for nonlinear Hawkes processes have been developed

for many years and they include stability estimates [

] as well as convergence rates for

Bayesian estimators [11].

In this work, we present a Nonlinear Hawkes Process with Gaussian Process Self-

Effects (NH–GPS), which extends the class of nonlinear Hawkes processes. As the causal

influence between events can be either excitatory or inhibitory, we use the term self-effects

to refer to the influence of past events on future events.

We choose a semi-parametric approach, which avoids the limiting parameterization of

the memory kernel and the background rate. We assume a Gaussian process (GP) prior

to the exogenous events’ intensity and on the memory kernel, which allows also for an

inhibitory effect between the events. To ensure that the intensity function is non-negative

we use the sigmoid link function. This modeling approach is not only descriptive but

also allows us to obtain a fast inference procedure. The history of self-effects defines an

aggregated Gaussian process, and we perform the inference directly on this aggregation

rather than obtaining a posterior over each self-effect.

While highly flexible approaches to modeling the intensity function of nonlinear

Hawkes processes have been presented before, they mainly rely on deep neural network

solutions [

]. These approaches are hindered by the necessity of large datasets. In

contrast, our methodology retains modeling flexibility due to the nonparametric nature of

Gaussian processes, while being able to perform well when data are scarce.

Outline

In Section 2, we describe the NH–GPS model and emphasize how its structure allows

for efficient Bayesian inference. In Section 3, we describe the augmentation scheme and

derive the mean-field variational inference algorithm. In Section 4, we discuss related work

and describe how it relates to ours. In Section 5, we present the results of our experiments

both on synthetic data and different real-world examples, and compare it to existing work

when possible. In Section 6, we conclude by discussing our work and future research

directions.

2. Proposed Model

2.1. Classical Hawkes Process

Let

Tt= [

t]∈R

. We define the counting measure

N(Tt)

as the number of arrivals in

the interval

. Furthermore, we define the history

or the realizations of a given process,

as the set of arrivals in the interval

, namely

Ht={T1

. . .

TN(Tt):Ti∈ Tt∧Ti−1<Ti}

and

corresponds to the time of arrival

. For a temporal point process, the counting

measure N(·)has an associated intensity defined as

Λ(t) = lim

∆t→0

E[N(Tt+∆t)−N(Tt)|Ht]

∆t.

The intensity function may depend on the history of the process. An example of

such a process is the Hawkes process, or self-exciting point process [

], which defines

self-excitations [

] around exogenous events. Following Hawkes and Oakes

[5]

, the intensity

of the Hawkes process is defined by

Λ(t|Ht) = s(t)+∑

tn<t

g(t−tn), (1)

where

s(t)

is the base intensity of exogenous arrivals and

g(t−tn)

is the memory kernel, or

causal influence function, defining the change in the excitation or inhibition value following

each arrival. In the classical Hawkes process, causal influence of only excitatory nature

is allowed and the memory kernel is usually of the form

g(t−tn) = βe−α(t−tn)

for an

exponentially decaying memory.

Entropy 2022,24, 356 3 of 22

2.2. Nonlinear Hawkes Process with Gaussian Process Self-Effects

In the classical Hawkes process, the memory kernel

in Equation

(1)

must be non-

negative, to prevent the intensity function from being negative. As a result the history

of the model has only excitatory effect on future events. We are interested in a model

that includes inhibition between events, and we release the constraint over

so it can be

negative, and define the following nonlinear intensity function

Λ(t)=λσ(φ(t))(2)

σ(φ(t))=1

1+exp (−φ(t)) (3)

φ(t) = s(t)+∑

tn<t

g(t−tn)exp(−α(t−tn)). (4)

Here, we choose the sigmoid function to ensure that the intensity function

Λ(·)

non-negative.

is the intensity bound and we refer to

φ(·)

as the linear intensity function.

We explicitly add the exponential decay to enforce the forgetting constraint, which

is essential for most realistic processes. Although we choose here a specific parameteri-

zation of the memory decay, one can choose other forms of memory decay with minimal

adaptation to the learning procedure of the model parameters.

Besides

and

, the functions

s(·)

and

g(·)

are the unknown parameters of the model

to be inferred from the data. In this paper, we will use a nonparametric Bayesian inference

approach based on the definition of a prior probability measure over these functions.

In contrast to alternative Bayesian models for Hawkes processes where the positive rate

function is directly modeled by a Dirichlet process, in our case we have to deal with random

functions which are not constrained to be positive. This suggests that we should use a

simple, but still highly flexible approach by modeling the two functions independently as

realizations of Gaussian random processes. We write symbolically:

s∼GP(0, Ks)(5)

g∼GP(0, Kg)(6)

Ks/g(t1,t2)=as/g·exp −kt1−t2k2

σ2

s/g!. (7)

The corresponding Gaussian prior measures are uniquely defined by the mean func-

tions (which we set to be equal to zero) and the second moments given by the covariance

kernel functions Kg/s. The latter are defined by the prior expectations

Ks(t1,t2).

=E[s(t1)s(t2)] (8)

Kg(t1,t2).

=E[g(t1)g(t2)] (9)

By a proper choice of kernels, we can encode further prior beliefs on typical realizations

s(·)

and

g(·)

. Throughout the paper, we will work with the so-called RBF kernels from

Equation

(7)

. This kernel corresponds to the prior assumptions that the Gaussian processes

are stationary (the kernels depend on time differences only) and that the functions

s(·)

and

g(·)

are infinitely often differentiable. The kernels depend on two hyperparameters

and

, which reflect the typical amplitude and length scale of the functions. The reasonable

values of these hyperparameters will also be inferred from the data.

Finally, we assume a prior distribution also on the upper intensity bound

λ∼Gamma(α0,β0).

and we identify the hyperparameters of the model as {σg,ag,α,σs,as}.

In this work, we propose Bayesian inference for fitting the model to data. Due to the

nonlinearity over

φ(·)

, we are no longer able to easily utilize the branching structure of the

Entropy 2022,24, 356 4 of 22

Hawkes process, which allowed for the estimation of

s(·)

and

g(·)

[

]. Thus, a natural

solution is to perform the inference directly on φ(·).

Next, we identify the prior over the entire linear intensity

p(φ)

. From Equation

(4)

see that the linear intensity function

is nothing but the sum of GPs, and as such it is also

a GP:

φ∼GP0, ˜

K(10)

Klk =Ks

lk +∑

ti<tl

∑

tj<tk

tl−ti,tk−tjexp−αtl−ti+tk−tj. (11)

Multivariate Model

We propose an extension of the model to multiple dimensions. This is useful in

applications where different types of events are observed, or the events originate from

different processes that affect each other. We define an

-dimensional point process with

intensity in dimension r

Λr(t)=λrσ(φr(t))

φr(t)=sr(t)+

∑

m=1

∑

n<t

gr,m(t−tm

n)exp(−αr,m(t−tm

n))

where

is the time of event number

of type

. We assume that every dimension has its

own intensity bound

λk

and background rate

sr(·)

. The different dimensions interact with

each other via the self-effects term.

gr,m(·)

defined the effects of the events of type

on the

events of type r. As in the univariate case, this effect may change over time.

Given the observations, the different dimensions are independent of each other and we

can learn their parameters separately. Thus, in the following section we present the inference

for the univariate model, and the extension to the multivariate case is straight forward.

3. Inference

Conditioned on the intensity function

Λ(·)

, the likelihood of observations

{t1

. . . tN}

from a Hawkes process is [18]

`({t1, . . . tn}|Λ(·)) =exp−ZT

Λt0dt0N

∏

n=1

Λ(tn).

In order to obtain posterior distributions for the latent variables either in the form of

a Gibbs sampler or through approximate posteriors in variational inference, we require

certain computations that are not tractable or efficient under the current form of the like-

lihood. Similarly to previous work on Cox and Hawkes processes [

], we follow

an augmentation procedure. We do so via the introduction of auxiliary variables, which

expand the model to a different likelihood form. Under the marginalization of the afore-

mentioned auxiliary variables, the new likelihood will conserve the form of the original

model likelihood. The new form of the likelihood is constructed such that the computations

required for the inference procedure are either tractable, computationally fast, or both.

3.1. Model Augmentation

The first step we take in treating the likelihood function is using the Pólya–Gamma

(PG) augmentation scheme. Following Theorem 1 in Polson et al.

[21]

, we can rewrite the

nonlinear intensity function as

σ(φ(t)) =Z∞

0ef(w,φ(t))PG(w; 1, 0)dw (12)

f(w,φ(t)) =−φ(t)2w

2+φ(t)

2−ln 2. (13)

Entropy 2022,24, 356 5 of 22

As we augment each observation with a variable

from a PG distribution, the joint

likelihood of the observed events {tn}and PG variables {wn}is

p{tn}N

n=1,{wn}N

n=1|φ,λ=(14)

exp−ZT

0λσ(φ(t))dt·

∏

n=1

λef(wn,tn)PG(wn; 1, 0)

with

exp−ZT

0λσ(φ(t))dt=(15)

exp−ZT

0Z∞

0λPG(w; 1, 0)1−ef(w,−φ(t))dwdt.

where we used σ(t) = 1−σ(−t).

Next, we utilize the Campbell’s theorem [

], which states that for a Poisson process

Πwith intensity ϕ

Eϕ ∏

x∈Π

exp(h(x))!=

exp−Z(1−exp(h(x)))ϕ(x)dx.

Looking at Equation

(15)

, we identify

x=(t,w)

and

ϕ(t,w)=λPG(w|1, 0)

is the

intensity of a marked Poisson process in

with marks

w∼PG(0, 1)

. Furthermore, we

determine h(x)=f(w,−φ(t)). We can now rewrite the exponential in Equation (14) as

exp−ZT

0λσ(φ(t))dt=Eϕ M

∏

m=1

ef(ˆ

wm,ˆ

tm)!(16)

for realizations {ˆ

tm,ˆ

wm}M

m=1.

We substitute Equation

(16)

into Equation

(14)

, which results in the full augmented

likelihood. Given the prior distributions over

and

, we can now write the model’s

posterior distribution as

p{ˆ

tm,ˆ

wm},{wn},φ,λ|{tn}∝exp(−λT)×(17)

∏

m=1

λef(ˆ

wm,−φ(ˆ

tm))PG(wm; 1, 0)×

∏

n=1

λef(wn,φ(tn))PG(wn; 1, 0)×p(φ)p(λ).

To summarize, we augment the model with two sets of variables—the PG variables

{wn}

, which augment the actual realizations, and the tuples

{ˆ

wm}

, which are the

realizations and marks of the auxiliary marked Poisson process.

As mentioned above, we intend to learn directly the linear intensity function

φ(·)

. This

allows us to utilize the mean-field variational inference previously introduced in Donner

and Opper

[19]

and Donner and Opper

[22]

. Next, we go through the steps of the algorithm,

and we refer the reader to the two papers mentioned above for further details. As a baseline

we compare the performance of the variational inference algorithm to a Gibbs sampler. The

details of the Gibbs sampler can be found in Appendix Aand Algorithm 2.

Loading more pages...