scieee Science in your language
[en] (orig)
safety
Article
Observations on the Relationship between Crash Frequency
and Traffic Flow
Peter Wagner 1,2,* , Ragna Hoffmann 1and Andreas Leich 1,*


Citation: Wagner, P.; Hoffmann, R.;
Leich, A. Observations on the
Relationship between Crash
Frequency and Traffic Flow. Safety
2021,7, 3. https://doi.org/10.3390/
safety7010003
Received: 25 October 2020
Accepted: 7 January 2021
Published: 11 January 2021
Publishers Note: MDPI stays neu-
tral with regard to jurisdictional clai-
ms in published maps and institutio-
nal affiliations.
Copyright: © 2021 by the authors. Li-
censee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and con-
ditions of the Creative Commons At-
tribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
1Deutsches Zentrum für Luft- und Raumfahrt e.V., Institute of Transportation Systems, Rutherfordstrasse 2,
D-12489 Berlin, Germany; ragna.hoffmann@dlr.de
2Institute of Land- and Sea Transport Systems, Technical University of Berlin, Salzufer 17-19,
D-10587 Berlin, Germany
*Correspondence: peter.wagner@dlr.de (P.W.); Andreas.Leich@dlr.de (A.L.); Tel.: +49-30-67055-237 (P.W.)
Abstract:
This work analyzes the relationship between crash frequency
N
(crashes per hour) and
exposure
Q
(cars per hour) on the macroscopic level of a whole city. As exposure, the traffic flow
is used here. Therefore, it analyzes a large crash database of the city of Berlin, Germany, together
with a novel traffic flow database. Both data display a strong weekly pattern, and, if taken together,
show that the relationship
N(Q)
is not a linear one. When
Q
is small,
N
grows like a second-
order polynomial, while at large
Q
there is a tendency towards saturation, leading to an S-shaped
relationship. Although visible in all data from all crashes, the data for the severe crashes display a
less prominent saturation. As a by-product, the analysis performed here also demonstrates that the
crash frequencies follow a negative binomial distribution, where both parameters of the distribution
depend on the hour of the week, and, presumably, on the traffic state in this hour. The work presented
in this paper aims at giving the reader a better understanding on how crash rates depend on exposure.
Keywords: road safety; traffic states; crash rates; temporal crash rate pattern
1. Motivation
Recent years have seen some progress when it comes to the availability and analysis of
crash data [
1
3
], or [
4
]. This has triggered new work and new methods, most notably from
machine learning that have the potential to improve knowledge, models, and, ultimately,
also the state of traffic safety.
In many cases, road safety work consists of identifying crash blackspots, determining
corrective measures, implementing them, and later evaluating them. A reasonable defini-
tion of a road accident blackspot will involve the number of crashes per unit of exposure.
This paper deals with the problem of modeling the relationship between crash rates and
exposure. A better understanding of this relationship allows traffic safety management
targeting hazardous locations more clearly based on risk and not merely on crash frequency.
Traditionally, one approach in this context is the development of crash prediction
models. They estimate the impact of several variables
xj·
on crash frequencies. This is done
by applying models of the type [59]:
Ni=β0(Qi/Q0)β1exp(µi) = β0(Qi/Q0)β1exp n
j=2
βjxji +ζ!, (1)
where
Ni
is the crash frequency at a certain instance
i
(time, place,...),
Qi
is an exposure
variable,
Q0
is a baseline flow,
µi
is the mean value of the crash rate, the
xij
are factors
thought to influence the crash frequency, and the
βj
are coefficients that quantify the
strength of each factor. Moreover, there is a gamma-distributed noise term
ζ
here where
exp(ζ)has mean one and variance γ.
The crash frequencies themselves are then found as a realization of a stochastic process
with a negative binomial distribution (NBD) with a mean µand variance σ2:
Safety 2021,7, 3. https://doi.org/10.3390/safety7010003 https://www.mdpi.com/journal/safety
Safety 2021,7, 3 2 of 18
σ2=µ+γµ2. (2)
The parameter
γ
describes how much the NBD deviates from a Poisson distribution,
with γ=0 for a Poisson distribution.
The exposure variable, which in the following will be mainly the traffic flow, is difficult
to include properly in traffic safety analyses. This is due to the fact that crashes are rare
events, and that more often than not, a measurement of the exposure is not available at
the site and time of the crash. If available, it is often only in the form of an average over
a day, and very often from travel demand models instead of directly measured. Similar
difficulties plague the other data source of exposure, which is the one that stems from
travel surveys. In many cases, they are averages over large spatial areas (as in travel survey
data) although attempts exist to integrate traffic flow with more detail [
10
13
]. However, it
might be speculated that crash probabilities depend strongly on the traffic state itself, with
traffic flow being one of the major influencing variable [14,15].
Especially of interest is the relationship between the crash frequency
N
when dis-
played versus the traffic flow
Q
, see [
14
]. Note that often not
N
itself is displayed as a
function of Q, the crash rate ρis used instead:
ρ(Q) = N(Q)
Q. (3)
The crash rate is the ratio between an average crash frequency and the corresponding
average traffic flow, leading to a continuous variable. As is demonstrated in
Section 3.1
,
another interpretation is to use the discrete number of crashes in one hour and the associ-
ated traffic flow in this hour, which leads to a mixture between a discrete and a continuous
distribution. (A similar approach can be found in [
16
], they have used the mileage on the
x-axis.)
Some thoughts about the relationship between
N(Q)
and
ρ(Q)
are in order. It is very
likely that the crash rate does not vanish as a function of
Q
, even for very small exposure
Q
we expect that the crash rate does not drop to zero, and the results of this work will lend
additional credibility to this idea. So:
N(Q)Qρ(Q)ρ0for Q0. (4)
For freeway traffic, a good deal of results for
N(Q)
and
ρ(Q)
are available. The most
commonly used model has a roughly U-shaped form for
ρ(Q)
, where crash rates are rather
large for small and large flows and have a minimum for intermediate flows. e.g., the
work [17,18] claims:
ρ(Q) = c1Qβ1+c2Qβ
2. (5)
where the exponent
β1
is around 1, and
β2
is between 1 and 2. Note that it is assumed that
the flow values are normalized to a constant flow so that the units drop out. The first term
is for single-vehicle crashes, while the second term describes multi-car crashes. Ceder also
observed that one should discern free traffic from congested traffic; this, however, raises
the question of how to do this properly based on hourly values. Furthermore, a recent
meta-analysis [
19
] that used 118 studies come to a similar conclusion, albeit with different
exponents β1,β2.
Similar results exist [
15
,
20
,
21
] or the German study [
22
], sometimes more symmetric
second-order polynomial relationships
ρ(Q)c1(QQc)2+c0
have been used to describe
the data. The approach of [
23
] is a bit different since it displays crash rate as a function of
a novel indicator that is difficult to translate into traffic flow or volume/capacity ratios.
Note as an oddity that one of the earliest models on this topic [
24
] proposed an inverse
U-shaped relationship, which is once again a second-order polynomial; this time, the crash
rate is being small for small flow and large flows, which were in this case AADT values
(AADT = annual average daily traffic). However, Veh’s data are also consistent with the
assumption that the crash rate is constant, or a weakly increasing function of exposition.
Safety 2021,7, 3 3 of 18
In essence, it could be stated that currently there is no univocal picture about the
relationship
ρ(Q)
for freeway data; for a more complete overview see [
25
]. Note, however,
that at least the idea of a diverging crash rate for
Q
0 might be questionable; however,
this is not the topic of this work.
Results are different when looking at the relationship between
ρ
and AADT as an
exposure variable. In this case, crash rates increase with
Q
, eventually again as a power-law
ρ(Q)Qβ[26].
Very little work has been done so far that looks at the relationship
N(Q)
on the basis
for a whole city, with notable exceptions of [
27
] with a more theoretical approach, and [
28
]
trying to test this theory without having real flow data available, and the recent work on a
network-based macroscopic safety diagram [29].
The hypothesis behind this work was the assumption that at least the crash frequency
that involves two cars should be a second-order polynomial function of the traffic flow [
30
].
A similar idea is also proposed in [
27
,
31
]. Therefore, a reasonable model for the crash
frequency in a city is a combination of single-car crashes (which can be assumed to be
proportional to the number of vehicles around
Q
) and an interaction term proportional to
Q2
. This interaction term is due to a naive assumption that if vehicles move independently
of each other, then there is a probability proportional to Q2that they meet:
N=α1Q+α2Q2(6)
Note that it is not easy to bring Equation (6) in line with Equation (1): the latter
one is tailored towards the use of generalized linear models (GLM) with a logarithmic
link function, and by exchanging
Qβ1
with
α1Q+α2Q2
, the very character of this model
is changed into something that no longer can be treated as GLM with a logarithmic
link function.
However, this work deals only with the prefactor and ignores
exp(µi)
, so a GLM and
its generalization GAM (generalized additive model) can still be used, but in most cases
with the identity as the link function.
As a final remark, note that models with a power-law term as in Equation (1) are not
in line with the assumption in Equation (4) that the crash rate becomes constant for small
exposure. However, when looking closely into [
17
,
18
], then Equation (5) function might be
modified to avoid the divergence at
Q
0 by modifying the first term in the equation into
c1(Q+b)β1.
2. The Data
This paper uses two types of data. The first one is a large crash database that contains
all crashes reported by the Berlin police in the city of Berlin, Germany, during the years
2001–2019
. The data are de-identified, i.e., they do not contain numberplates or names
of the crash participants or any other information that can be used to identify them.
Furthermore, for the subset of data used in the work reported here, the crash-time has been
aggregated to the hour.
Note that common practice in Berlin is different from other German federal states since
even a lot of property damage only (PDO) crashes are reported in the database. However,
even here the analyst must be aware of the fact that these numbers are biased due to the
under-reporting of small crashes. For this paper, only some part of the data in this database
has been used, see below for a more detailed description.
The second set of data stems from the Traffic4cast competition [
32
]. It contains de-
identified data from most of the days of 2018 in Berlin, where the speeds and the number of
probes of a certain vehicle fleet have been recorded. Since such a data set is a bit unusual, it
has been complemented by two other de-identified data sets so that comparisons between
the different data could be performed that are interesting in their own right. These are
the annual hourly count data from 28 detection sites, which have been provided by the
German Federal Highway Research Institute (BASt) and data from the latest travel survey
in Germany named Mobility in Deutschland (MiD) [33].
Advertisement
Safety 2021,7, 3 4 of 18
2.1. The Crash Database
The crash database has been provided by the Berlin police, and it is not publicly
available. It contains for each crash
i
about 60
×ni
variables (some redundant), where
ni
is
the number of people involved in the crash. Here, only the time
ti
, the severity, and the
vehicle types have been used. Time is described with minutes’ resolution; however, it is
good not to use these numbers to this precision since preference for multiples of 15 min can
be observed (see also [
3
]). The severity of each crash is described by the number of lightly
injured, the number of severely injured, the number of fatalities, and the damage.
In the following, only severe (crashes with injured or killed participants) and non-
severe PDO crashes will be distinguished. The database contains 1,888,038 crashes, and
what is important for the analysis below: most of all crashes are between two cars, as can
be seen in Figure 1.
123456
# crash participants
Counts
1k
10k
100k
1M
4.42%
4.17%
0.53%
0.12%
0.03%
0.02%
90.7%
Bike
Car
Misc
MoBike
Peds
PT−Bus
Truck
Counts
3.93%
82.59%
3.09%
6.5%
1.14%
1.43%
1.32%
100k
200k
500k
1M
2M
Figure 1.
Distribution of the number of road users involved (
left
), and the traffic shares (
right
) in the
Berlin crash data set. The Misc traffic mode is being used by the police to denote any traffic mode
that cannot be assigned. The
y
-axis is logarithmic, the numbers on top of the bars are the percentages
of the respective shares.
For this study, the timestamps of all crashes of the database have been rounded down
to the nearest full hour and translated to the corresponding hour of the week (0–167). Since
the data set spans 19 years, each hour occurs 992 times in the data set, resulting in 992
crash numbers for every hour of the week
h
. Therefore, for each hour of the week, the
distribution of counts can be determined directly. The results are displayed in Figure 2as a
boxplot, and they display a strong weekly pattern. Very similar results have been reported
recently by [34].
The Figure 3displays this result for the Monday only as a violin plot so that the shape
of the distributions can be seen more clearly. Moreover, the distribution of severe crashes
has been included in this Figure as well.
Safety 2021,7, 3 5 of 18
0 9 20 32 44 56 68 80 92 106 122 138 154
0
10
20
30
40
50
Hour of week h
N(h) (Counts/h)
Figure 2.
Box plot of the crash frequency per hour of the week. The blue bar is the median, the boxes
are the 25- and 75-percentiles, the whiskers display the minimum and the maximum of the data.
Figure 3.
Distribution of the hourly crash counts as a function of the hour of the day on Mondays,
displayed as a violin plot. The orange violins are for all crashes, the red ones for the severe crashes
(which are shifted left by half an hour). The white circle is the median of the values.
2.2. The Distribution of the Crash Frequency
Most likely, the individual distributions in each hour are following an NBD. This can
be tested by plotting their variance
σ2
against their mean value
µ
. An NBD displays then a
second-order polynomial relationship between µand σ2as stated already in Equation (2),
where the parameter
γ
specifies the deviation of the distribution from a Poisson distribution.
The results can be seen in Figure 4. Figure 4shows that the data follow an NBD. Also, the
same analysis has been performed for severe crashes only. Various fits to this cloud of data-
points have been included in this Figure as well demonstrating that the assumption of the
NBD fits these data quite well. All fits are done with R’s
lm()
function [
35
], which executes
a linear least-squares fit to these data. The fit for the severe crashes is even better (larger
R2
),
leading to two different estimates for the
γ
variable. For all crashes,
γ
is estimated as
Advertisement
Loading more pages...