scieee Science in your language
[en] (orig)
From Theory to Comprehension: A Comparative Study of
Differential Privacy and 𝑘-Anonymity
Saskia Nuñez von Voigt
Technische Universität Berlin
Berlin, Germany
Luise Mehner
Technische Universität Berlin
Berlin, Germany
Florian Tschorsch
Technische Universität Dresden
Dresden, Germany
ABSTRACT
The notion of
𝜀
-differential privacy is a widely used concept of
providing quantifiable privacy to individuals. However, it is unclear
how to explain the level of privacy protection provided by a dif-
ferential privacy mechanism with a set
𝜀
. In this study, we focus
on users’ comprehension of the privacy protection provided by
a differential privacy mechanism. To do so, we study three vari-
ants of explaining the privacy protection provided by differential
privacy: (1) the original mathematical definition; (2)
𝜀
translated
into a specific privacy risk; and (3) an explanation using the ran-
domized response technique. We compare users’ comprehension of
privacy protection employing these explanatory models with their
comprehension of privacy protection of
𝑘
-anonymity as baseline
comprehensibility. Our findings suggest that participants’ compre-
hension of differential privacy protection is enhanced by the privacy
risk model and the randomized response-based model. Moreover,
our results confirm our intuition that privacy protection provided
by 𝑘-anonymity is more comprehensible.
CCS CONCEPTS
Security and privacy
Usability in security and privacy;
Data anonymization and sanitization; General and reference
Surveys and overviews.
KEYWORDS
differential privacy, explanatory model, study
ACM Reference Format:
Saskia Nuñez von Voigt, Luise Mehner, and Florian Tschorsch. 2024. From
Theory to Comprehension: A Comparative Study of Differential Privacy
and
𝑘
-Anonymity. In Proceedings of the Fourteenth ACM Conference on
Data and Application Security and Privacy (CODASPY ’24), June 19–21, 2024,
Porto, Portugal. ACM, New York, NY, USA, 12 pages. https://doi.org/10.1145/
3626232.3653261
1 INTRODUCTION
Privacy-preserving techniques have been proposed in various do-
mains to provide data protection guarantees. The aim of these tech-
niques is to minimize the risk of identifying an individual while
also maximizing the utility of the data. One simple method is to
This work is licensed under a Creative Commons Attribution-
NonCommercial-NoDerivs International 4.0 License.
CODASPY ’24, June 19–21, 2024, Porto, Portugal
© 2024 Copyright held by the owner/author(s).
ACM ISBN 979-8-4007-0421-5/24/06
https://doi.org/10.1145/3626232.3653261
remove or generalize attributes so that each combination of at-
tribute values comprises at least
𝑘
entries, leading to the concept of
𝑘
-anonymity [
27
]. Each individual in the data set is therefore indis-
tinguishable from
𝑘1
other individuals. However,
𝑘
-anonymity
does not provide strong mathematical privacy guarantees, as at-
tribute values can be revealed in some situations [16, 18].
The privacy concept of
𝜀
-differential privacy [
4
], offers stronger
privacy guarantees. It is a mathematical definition in which ran-
domization is used to limit the impact on the output of an individual
contributing to a database. The privacy parameter
𝜀
determines the
privacy-utility tradeoff.
It is, however, difficult for a user to comprehend the level of
privacy protection provided to them resulting from a particular
𝜀
.
Previous works have attempted to explain differential privacy mech-
anisms [
3
,
31
], quantify privacy guarantees [
10
,
15
,
21
], and com-
municate privacy risks [
1
,
7
]. One approach to making the privacy
parameter of differential privacy more comprehensible is to trans-
late
𝜀
into a corresponding privacy risk, expressed as a percent-
age [
15
,
20
]. Another approach has used the randomized response
technique [
30
] to describe privacy protection [
1
]. This technique in-
volves local differential privacy, which has been shown to be more
intuitive [
31
]. However, it is unclear whether these approaches
enhance users’ comprehension of the implications of differential
privacy mechanisms [3].
In contrast, for
𝑘
-anonymity, the privacy parameter
𝑘
is di-
rectly linked to individual identifiability. We therefore argue that
𝑘
-
anonymity is easier to understand than privacy protection provided
by differential privacy. Based on our assumption, we investigated
how we can explain the level of privacy protection of differential
privacy. Namely, in such a way that it is possibly just as compre-
hensible as 𝑘-anonymity.
To that end, we present three explanatory models that explain
the privacy protection provided by differential privacy. In each
explanatory model, we use a particular translation of the privacy
parameter
𝜀
into a more intuitive concept. These translations of
𝜀
describe the level of privacy protection, making it easier to compre-
hend the implications of various differential privacy mechanisms.
We build upon existing and established strategies to communicate
the privacy protection provided by differential privacy quantita-
tively; (1) the original mathematical definition (
DEF
); (2)
𝜀
as a
privacy risk (
RISK
); and (3) an explanation using the randomized
response technique (
RRT
). We conducted an experimental study
to investigate whether these explanatory models enhance users’
comprehension of differential privacy protection.
221
CODASPY ’24, June 19–21, 2024, Porto, Portugal Saskia Nuñez von Voigt, Luise Mehner, & Florian Tschorsch
In our experimental study, we examined users’ comprehension
of the privacy protection provided by a differential privacy mecha-
nism compared to their comprehension of the privacy protection
provided by a
𝑘
-anonymity mechanism. We thus anchor the compre-
hension of the privacy protection of differential privacy in general
and the respective comprehensibility with each explanatory model
to the comprehensibility of
𝑘
-anonymity. Our comparison increases
the methodological validity of our study. Importantly, we do not
compare the two mechanisms themselves nor their level of privacy
protection. Instead, we are interested in the comprehensibility of
privacy protection provided by the mechanisms.
With our results we provide evidence that the privacy protection
provided by differential privacy is best understood using
RRT
as
an explanatory model. Moreover, we establish
𝑘
-anonymity as a
baseline and an easily understandable privacy mechanism.
The paper’s contribution and structure can be summarized as fol-
lows: We present three explanatory models that include translations
of the privacy parameter to help users understand privacy protec-
tion and thus the implications provided by a differential privacy
mechanism in Section 2. After designing and conducting an exper-
imental study addressing our research questions (Section 3), we
performed a pilot study to validate our explanations and questions
before conducting our main study. Our improvements designed to
increase the internal validity of the questions concerning subjec-
tive and objective comprehension for the main study are presented
in Section 4. In our main study, we examined the participants’
subjective and objective comprehension of the differential privacy
protection with the explanatory models
DEF
,
RISK
, and
RRT
com-
pared to users’ comprehension of the privacy protection provided
by a
𝑘
-anonymity mechanism (Section 5). Lastly, we discuss limita-
tions and future work in Section 6 and we review related work in
Section 7. We conclude our paper in Section 8.
2 EXPLANATORY MODELS
In this section, we provide three explanatory models for the im-
plications of the privacy parameter
𝜀
of our privacy mechanism—
differential privacy. Each model involves a translation of the pri-
vacy parameter into a more intuitive concept. Each translation is
designed to help users understand the level of privacy protection
provided with a specified privacy parameter and thus the implica-
tions of the mechanism. In addition, we give a brief overview of
the privacy parameter of 𝑘-anonymity.
2.1 Privacy Protection of 𝑘-anonymity
The privacy protection of
𝑘
-anonymity [
27
] relies on the concept
of anonymity sets. An anonymity set is a set of elements which
are indistinguishable from each other. The individual’s entries in a
database are generalized or suppressed in a way that for each entry,
there are at least
𝑘
entries with the same values in all columns
that might be used to re-identify an individual. In other words,
individual’s entries are clustered into anonymity sets.
The privacy parameter
𝑘
translates to the size of the smallest
anonymity set in the database. The higher
𝑘
, the more indistinguish-
able individuals exist in each group, resulting in a stronger privacy
protection. For instance, with
𝑘=4
, the chance of correctly linking
an entry of a group to an individual is 1/4=0.25.
2.2 Differential Privacy Definition (DEF)
Differential privacy [
4
] bounds the amount of influence a single
individual’s data can have on the output of a statistical computa-
tion over a database. A mechanism
M
is
𝜀
-differentially private
if for any two neighboring data sets (
𝐷1
and
𝐷2
), differing in one
individual, and any statistical result computed over the data sets
(𝑆Range(M)) satisfy:
𝑃[M(𝐷1) 𝑆] e𝜀𝑃[M(𝐷2) 𝑆]. (1)
The maximum distance between the probabilities of the mecha-
nism returning the result with each database is less than a certain
quantity. This quantity is based on the privacy parameter
𝜀
. A pri-
vacy parameter closer to zero reduces the maximum distance, which
means that the amount of influence any one individual’s data can
have on the overall output is smaller. A smaller privacy parameter
thus yields stronger privacy protection.
The privacy parameter
𝜀
translates into the factor by which
the probability of returning any other result is greater than the
probability of the same result if an individual is missing from the
data set. For instance, with
𝜀=ln 3
, thus,
eln 3 =3
the probability
of returning any result is at most three times the probability of the
same result if one individual is missing in the data set.
2.3 Epsilon as Privacy Risk (RISK)
Lee and Clifton [
15
] proposed for the Laplacian differential privacy
mechanism, a way of calculating the risk of users in a data set being
identified. In this framework, after an adversary receives an output
of the differential privacy mechanism, she then imagines every
possible scenario for a distribution of all possible values for the
individuals’ data that she does not already know. These scenarios
are her so-called possible worlds. By comparing the probability of
the mechanism returning the particular result for each possible
world, the adversary decides which possible world is most likely
to be true. The probability of the mechanism indicating the correct
possible world when returning a result hence represents the users’
risk of being identified.
Mehner et al. [
20
] simplified the framework by assuming worst-
case values for some variables, so that the risk of being identified
in a data set 𝑝depends only on 𝜀and 𝑛:
𝑝=
1
1+e𝜀(𝑛1), (2)
where
𝑛
corresponds to the number of (unknown) possible worlds
imagined by the adversary. Thus, the privacy parameter can be
translated into a privacy risk in percent.
However, the number of possible worlds
𝑛
may be difficult to
grasp. Moreover,
𝑛
depends on multiple often unspecified variables,
such as the knowledge of the adversary, the number of individuals
in the database and the number of possible values for an answer.
According to Mehner et al. [
20
], assuming the worst-case attack
scenario, an adversary might have only two possible worlds. For
example, she may be uncertain about only one individual’s answer
and there may be only two possible values for that answer. Accord-
ingly, the worst-case value for
𝑛=2
resulting in the global privacy
risk:
𝑝=
1
1+e𝜀. (3)
222
From Theory to Comprehension: A Comparative Study of Differential Privacy and 𝑘-Anonymity CODASPY ’24, June 19–21, 2024, Porto, Portugal
We can therefore translate the privacy parameter for a given
𝜀
into the privacy risk of identifying the true answers of individuals
included in the database. In other words, if an adversary queries
the answer of an individual and there are only two possible answer
values (i.e., in a worst-case attack scenario), we can determine the
probability of the mechanism indicating the true answer of the
individual for a specified
𝜀
. For example, assume we set
𝜀=ln 3
,
which yields a privacy risk of
75
%, i.e., in the worst-case attack
scenario, the true answer of a person included in the database is
revealed with a probability of 75 %.
2.4 Using Randomized Response (RRT)
The number of possible worlds
𝑛
of Equation (2) is similar to the
number of different answers in the randomized response tech-
nique [
30
]. The randomized response technique is an approach
designed to provide plausible deniability to data subjects. The idea
is that some of the data subjects will give their true answer and
others will give a forced answer. The decision of whether an indi-
vidual gives a true or a forced answer is made randomly. Conse-
quently, each answer has a probability of being an individual’s true
answer. Therefore, users’ answers do not reveal the individuals’
true answers with certainty. The randomized response technique
inherently holds the local differential privacy guarantee.
More precisely, with a probability of
𝑝𝑡𝑟𝑢𝑒
, the true answer
𝑎
is
stored in the database. The probability of any false answer
𝑎0𝑎
is
𝑝𝑓 𝑎𝑙𝑠𝑒 =(1𝑝𝑡𝑟𝑢𝑒 )/(𝑑1)
, where
𝑑
is the number of possi-
ble answers. This mechanism is one approach of the randomized
response, called unary encoding, and it satisfies local differential
privacy:
𝑃[M(𝑎)=𝑎] e𝜀𝑃[M(𝑎)=𝑎0](4)
𝑝𝑡𝑟𝑢𝑒 e𝜀=𝑝𝑓 𝑎𝑙𝑠𝑒 , (5)
resulting in
𝑝𝑡𝑟𝑢𝑒 =
1
1+e𝜀(𝑑1). (6)
The probability of storing a true answer is equal to the privacy
risk Equation (2), where the number of possible worlds
𝑛
corre-
sponds to the number of different answers 𝑑.
Hence, we can translate the privacy parameter
𝜀
into the probabil-
ity with which the mechanism stores a true answer in the database.
For example, assume we set
𝜀=ln 3
and have two possible an-
swers (
𝑑=2
). As a result, the probability of storing the true answer
is
75
%. With a higher number of possible answers, e.g.
𝑑=28
, the
probability of storing the true answer decreases to
10
%. Note that
the model also works for real-valued (continuous) data. In this case,
the worst case with
𝑑=2
should be used. The result indicates the
probability of storing true answers, regardless of whether the data
is discrete or continuous.
3 METHODOLOGY
In this section, we present and justify our hypotheses we formulated
to design our study. In addition, we detail how participants were
instructed, describe our sample, how we conducted the study and
how we analyzed the data.
Syntactic anonymization models, such as
𝑘
-anonymity, were
originally designed for privacy-preserving data publishing [
2
]. Dif-
ferential privacy, on the other hand, is more suitable for privacy-
preserving data mining. The concept of privacy-preserving data
publishing usually assumes a non-expert data publisher, i.e., the data
publisher does not have the knowledge to perform data mining [
9
].
Given that
𝑘
-anonymity is a viable solution for privacy-preserving
data publishing, the mechanism of
𝑘
-anonymity is aimed at the
non-expert who is the end user of the model. With
𝑘
-anonymity as
as simple and intuitive model [8], we derive our first hypothesis:
(H1)
Differential privacy vs.
𝑘
-anonymity: The privacy protection
provided by
𝑘
-anonymity is easier to comprehend than the
privacy protection provided by differential privacy (indepen-
dent of the explanatory model).
The definition of differential privacy is complex. Therefore, it is
important to describe the techniques or the implications of a differ-
entially private mechanism.
RRT
has often been used as an intuitive
mechanism [
1
,
26
]. Previous work has shown that
RRT
provides
more understanding among users [
1
,
31
].
RISK
was developed as an
intuitive explanation of
𝜀
. Consequently, we derive the following
hypothesis:
(H2)
Explanatory models: The explanatory models
RRT
and
RISK
will provide a better comprehension of the privacy protection
than the DEF model.
Previous work has shown that both numeracy skills and level
of educational affect risk understanding [
7
,
12
]. Users with low
numeracy skills have difficulty understanding risk in general [
12
].
These findings from previous work lead us to our final hypothesis:
(H3)
Education level and numeracy skills: High levels of educa-
tion and high numeracy skills help users to comprehend the
privacy protection provided by differential privacy.
3.1 Measures
Participants answered questions to evaluate their subjective and
objective comprehension of privacy protection. We also included
measures for covariates: demographics, privacy concerns and nu-
meracy skills.
3.1.1 Comprehension. Similar to previous work [
26
,
31
], we evalu-
ate the subjective comprehension (perceived comprehension) and
objective comprehension (actual comprehension) of the
𝑘
-anonymity
explanation and our explanatory models of the privacy protection
provided by differential privacy (RISK,RRT, and DEF).
We designed the questions concerning comprehension from
scratch, using direct questions. We included three 7-point-Likert
scaled questions regarding how the participants subjectively com-
prehended the level of privacy protection that each mechanism (and
its respective privacy parameter) provided. Following the questions
concerning subjective comprehension, there were four questions
testing the participants’ objective comprehension of privacy pro-
tection. In addition, we gave the participants the possibility to
comment on their comprehension answers. Last, we asked partici-
pants to directly compare which privacy mechanism they felt was
most comprehensible and intuitive in terms of privacy protection.
223
CODASPY ’24, June 19–21, 2024, Porto, Portugal Saskia Nuñez von Voigt, Luise Mehner, & Florian Tschorsch
Demographics
age,
field of study,
current level
of education
Scenario
statistics drug use
at school; parents
should not infer their
son/daughter’s drug use
comprehension questions
Privacy Protection Explanations (within-subject)
differential privacy
(between-subject)
RISK RRT
DEF
objective and subjective questions
𝑘-anonymity
objective and
subjective questions
direct comparison
Numeracy . . .
numeracy
privacy experience
privacy concerns
Check question
Figure 1: Overview of the study design.
3.1.2 Covariates. We assessed the participants’ numeracy skills
using subjective rating and objective test questions. The numeracy
questions were taken from multiple validated numeracy assess-
ments found in the literature [
6
,
17
,
25
]. Moreover, we asked about
any previous experience with privacy mechanisms in general and
differential privacy in particular. Finally, we also assessed the par-
ticipants’ general privacy concerns using a set of questions adapted
from Malhotra et al. [
19
]. We used these questions in the categories
of collection and awareness. We also included “attention” check
questions as part of the privacy aptitude and at the end of the study
to exclude inattentive participants: Please select 3 (More or less agree)
for this question and What is
4+5
?
1
. We assume that those par-
ticipants who were motivated at the end of the survey were also
motivated at the beginning.
3.2 Scenario and Explanations
We defined a fictional scenario about drug use at school as a running
example of a setting where privacy is crucial and where privacy
protection needs to be well understood at the same time. A school
stores student answers to a questionnaire on drug use in a database
grouped by age and class. In order to raise awareness, parents can
query the database, which is protected with a privacy mechanism.
Our explanations are designed from scratch. We used text-based
explanations because we focused on evaluation of the explanatory
model, not on how it was communicated. Our explanations start
with a short description of the privacy mechanism, inspired by
the Techniques description of Cummings et al. [
3
]. This was fol-
lowed by an explanation of the privacy protection parameter, e.g.
𝑘
, difference (
DEF
), risk (
RISK
) and probability (
RRT
). Finally, we
applied these explanations to our scenario and provided concrete
examples. The exact wording of our explanations can be found in
Appendix A.1.
3.3 Experimental Process
Prior to the main study, we conducted a pilot study to increase the
validity of our study questions. In particular, the pilot study allowed
us to validate our questions, explanations and instructions in terms
of textual clarity and general comprehensibility. We summarize the
results of the pilot study and the induced changes in Section 4.2
3.3.1 Overview of the Study Design. In Figure 1 we present an
overview of our study design and procedure. The process and design
1
We believe that this mathematical operation does not relate to numeracy because of
its simplicity. When answered, the question was answered correctly by all participants
of our main study.
2
The explanations, questions on subjective and objective understanding and
anonymized tables can be found in http://arxiv.org/abs/2404.04006.
of the main study and the pilot study, were the same. Both studies
had a mixed design with a between-subject factor explanatory
model” (for differential privacy protection with three conditions
RISK
,
RRT
, and
DEF
) and a within-subject factor with two levels
(“privacy protection provided by
𝑘
-anonymity” and “privacy pro-
tection of differential privacy”). The within-subject factor included
in our study allowed us to evaluate the comprehensibility of differ-
ential privacy protection with each explanatory model compared
to the comprehensibility of the privacy protection of
𝑘
-anonymity.
As a results, we were able to
(1)
verify whether the privacy protection of
𝑘
-anonymity is
indeed easier to comprehend than that of differential privacy,
(2)
anchor the comprehensibility of differential privacy protec-
tion with each explanatory model to the comprehensibility
of privacy protection of
𝑘
-anonymity as a baseline for the
best possible comprehensibility,
(3)
control for any interindividual differences in comprehension
skills between the three conditions.
Moreover, use of a within-subject design reduced the standard
deviation in the objective and subjective comprehension scores,
improving the statistical validity of our study.
After a short welcome text explaining the purpose of the study,
the participants were asked to provide some demographic infor-
mation about themselves (age, field of study and current level of
education). Next, we introduced our fictional scenario. We ensured
that the participants understood the scenario by asking three check
questions: 1) Who provides the database in the scenario? 2) What
kind of data is stored in the database? 3) Eve (the adversary) wants
to find out the data of whom?
3.3.2 Procedure of Explanations. After ensuring that the partic-
ipants had read and understood the scenario, each participant
was presented with explanations of the privacy protection of two
privacy-enhancing mechanisms, an explanation of
𝑘
-anonymity
and an explanation of differential privacy. To control for learning
and other sequence effects, the order of the two explanations and
their respective comprehension questions were balanced. In other
words, participants were randomly assigned to either the first order
group, where the explanation and questions for
𝑘
-anonymity were
presented first, or to the second order group, where the explanation
and questions of the differential privacy protection were presented
first. Since each participant read and answered the questions for the
two explanations, our study had a within-subject factor with the
privacy protection of differential privacy and the privacy protection
of 𝑘-anonymity as factor levels.
224
From Theory to Comprehension: A Comparative Study of Differential Privacy and 𝑘-Anonymity CODASPY ’24, June 19–21, 2024, Porto, Portugal
The explanation of the privacy protection of
𝑘
-anonymity was
the same across all conditions. Each participant randomly (uni-
formly distributed) received one of the three explanations (
RISK
,
RRT
, or
DEF
) for differential privacy protection, resulting in three
between-subject conditions for the factor “explanatory model for
differential privacy protection”. We used similar phrasing and word-
ing in all explanations, including the explanation of the privacy
protection of
𝑘
-anonymity, in order to compare the comprehension
of the explanation. In addition, the subjective as well as the objective
comprehension questions were identical for each explanation.
The level of privacy protection provided by the differential pri-
vacy mechanism, i.e., the privacy parameter
𝜀
, was the same in each
explanatory model. We wanted to rule out the possibility of the level
of privacy protection systematically interfering with the partici-
pants’ comprehension of differential privacy protection. However,
differential privacy assumes a stronger adversary than
𝑘
-anonymity
does. A
𝑘
-anonymity mechanism cannot provide an equally strong
privacy protection as the differential privacy mechanism explained
using the
RISK
,
RRT
, and
DEF
explanatory models in our scenario.
Therefore, we have to trust that the weaker privacy protection did
not interfere with the participants’ comprehension. Consequently,
in our study, we explain the privacy protection of
𝑘
-anonymity with
𝑘=4
. We believe that this is an appropriate value to explain the
privacy protection of
𝑘
-anonymity since this results in a probability
of being identified of
0.25
. Again, we emphasize that we cannot
match the privacy levels of the two mechanisms.
3.3.3 Procedure after Explanations. After providing both the expla-
nations and the questions about their comprehensibility, we asked
participants directly about which privacy mechanism (if any) was
more comprehensible with respect to the level of privacy protection
and why. We also asked which mechanism (if any) they regarded
as providing a greater privacy protection in the particular scenario,
and why. The latter question was implemented to gain a deeper
insight into whether the participants had gained a sense of the rela-
tionship between a particular privacy parameter and the respective
level of privacy protection provided by each mechanism.
3.4 Participant Recruitment and Attributes
Both the pilot study and the main study were implemented us-
ing LimeSurvey
3
and emailed to university students of Berlin.
4
Our main study was publicly available between February 8 and
22, 2023. The participation was voluntary and we did not offer any
remuneration.
We used a set of questions provided by the Ethics Commission of
TU Berlin to self-evaluate the ethical considerations of the planned
research project. We then decided that a detailed application to the
Ethics Committee was not necessary. However, to address potential
ethical issues, we informed the participants (of the pilot study and
main study) about our data policies in our invitation email before
the survey: The evaluation of the responses would be anonymized,
i.e., we only used the LimeSurvey Response ID as an identifier and
3www.limesurvey.org
4
We cannot exclude the possibility of participants who participated in both studies.
However, the pilot study took place one year earlier, so we assume that the effect
is negligible. In addition, participants were asked about their prior knowledge of
privacy, so the overlap was controlled in the results of participants without any prior
knowledge.
would remove it before the statistical analysis. We only accessed
the results of the pilot study that were necessary to validate our
explanations and questions.
For the participants, the purpose of our study was to evaluate
explanations of the privacy protection provided by two privacy-
enhancing mechanisms. At that point, we did not refer to differential
privacy protection as the focus of our study to avoid the influence
of demand characteristics or participant expectations about our
desired outcome of the study. All participants were presumed to
have at least a high school diploma and to be currently studying at
a university.
There were a total of
249
respondents in the main study. Of these,
only
93
participants answered the subjective and objective com-
prehension questions for both explanations and could therefore be
included in the analysis. Of these, three participants were excluded
because they gave an incorrect answer to one of the comprehension
questions regarding the scenario or because they answered one of
the attention-check questions incorrectly, resulting in a total of
90
analyzed participants. Of these,
78
participants fully completed the
study and thus answered all questions. We decided to nevertheless
include the other
12
participants who did not finish the study into
parts of our analysis to increase the statistical power of our study
and to reduce motivation bias. In conclusion,
90
participants were
included in analyses involving the objective and subjective compre-
hension,
78
participants were included in all our analyses, including
those concerning the direct comparison and those involving the
participants’ privacy concerns or numeracy skills.
Consequently, we included
90
submissions in the analysis:
27
for
RISK
,
30
for
RRT
, and
33
for
DEF
. Of these participants,
66
indi-
cated a “STEM” study field of science, technology, engineering, or
mathematics (
28
were students of computer science/engineering).
Five students indicated a study field of management or economics,
eight students indicated a study field related to architecture or de-
sign, and
11
students indicated a study field of social sciences, or
psychology. The age of the participants ranged from
18
to
40
years
with a mean age of approximately
25.03
and a median age of
24
.
The level of education was high overall, with
73
participants having
a bachelor’s degree or higher. Of these,
13
participants stated that
they had a master’s degree. These
90
participants spent an average
of 34.8minutes on the study.
3.5 Data Set Pre-processing and Analysis
Each participant received a score for subjective comprehension
and an objective comprehension score, both between
0
and
1
corre-
sponding to “very poor” and “very good” comprehension, respec-
tively. To obtain the subjective comprehension score, we calculated
the mean score of the three subjective comprehension questions for
each participant. We thereby inverted the score of the first question
so that for every question a higher score indicated greater com-
prehension. We then normalized the scores to a range from
0
to
1.
To measure objective comprehension, we scored each correct and
incorrect answer as
1
and
0
, respectively. We calculated the mean
of the four objective comprehension questions for each participant
and normalized the objective score to be between 0and 1.
225
CODASPY ’24, June 19–21, 2024, Porto, Portugal Saskia Nuñez von Voigt, Luise Mehner, & Florian Tschorsch
In order to avoid the influence of baseline differences in the
participants’ comprehension abilities between the conditions, we
calculated the differences between the comprehension scores for
the privacy protection of
𝑘
-anonymity and differential privacy for
each participant. In other words, we subtracted the mean scores for
the objective and the subjective comprehension of the differential
privacy explanation from the mean scores of the
𝑘
-anonymity ex-
planation model. A positive difference means that the
𝑘
-anonymity
explanation had a higher score, and was thus easier to comprehend.
A negative difference, on the other hand, means that the differential
privacy explanation has a higher score. If the difference is
0
, the
scores indicate a similar level of comprehension.
We tested for differences between the subjective comprehension
scores for the two explanations for each condition using one-tailed t-
tests. Furthermore, we examined the effect of the explanatory model
for differential privacy protection on the differences between the
comprehension scores for the two explanations. We also examined
the effect of the order of explanations, i.e., whether participants had
read the explanation and answered the corresponding comprehen-
sion questions for privacy protection of
𝑘
-anonymity or for that
of differential privacy first. To that end, we conducted two-way
analyses of variance (ANOVAs) with two between-subject factors
of explanatory model and order. We conducted one ANOVA for the
differences in the subjective comprehension scores and one for the
differences in the objective comprehension scores. The ANOVAs
also allowed us to investigate any interaction effects between the
two factors (explanatory model and order).
We also wanted to analyze how the participants’ comprehension
of differential privacy protection was influenced by privacy con-
cerns and numeracy skills. For that we included only participants
who had completed the whole study, as the questions regarding
privacy concerns and numeracy skills were asked at the end of the
study. To test (H3), we calculated a privacy concern score as well
as a subjective numeracy score and an objective numeracy score
for each participant. We used Pearson’s correlation coefficient to
measure the correlations.
In the final data set, each entry contains the following values: an
explanation group {1-3}, a mean subjective comprehension score
for
𝑘
-anonymity [0-1], an objective comprehension score for
𝑘
-
anonymity [0-1], a mean subjective comprehension score for differ-
ential privacy [0-1], an objective comprehension score for differ-
ential privacy [0-1], a comparison regarding comprehensibility {
𝑘
-
anonymity, differential privacy, both}, a comparison regarding pre-
vention {
𝑘
-anonymity, differential privacy, both}, education {high
school, bachelor, master}, a privacy concerns score [1-7], a subjec-
tive numeracy score [0-2] and an objective numeracy score [0-8].
3.6 Limitations
Our study was conducted with students from German universities
only. Therefore, our results cannot be transferred to the general
public. To allow more generic inferences, future work should test a
more heterogeneous sample. Also, the number of participants was
limited and many participants aborted the study. As a result, the
statistical power might simply not have been sufficient to detect all
effects. Therefore, future work should investigate a higher number
of participants, or should execute a power analysis, using our effect
sizes as a basis. To increase the statistical power of our study and
to reduce motivation bias, we indeed included participants who
answered questions concerning subjective and objective compre-
hension for both explanations, but did not fully complete the study.
However, these responses were only used in certain parts of our
analysis. Although these participants answered the scenario-check
question, they did not answer our attention- and math-check ques-
tions at the end of our study. Therefore, we cannot be certain that
those participants answered our study attentively.
Furthermore, we compared the comprehensibility of different
privacy mechanisms, which also provided different levels of privacy
protection. Therefore, we cannot rule out that this difference had an
effect on our results. However, it is unlikely as we did not compare
the privacy protection provided by the mechanisms but only the
participants’ comprehension.
Most importantly, our objective comprehension questions may
have been inherently easier to answer for one of the mechanisms’
privacy protection or for one of the explanatory models for differ-
ential privacy protection. Even though we conducted a pilot study
to validate our scenario, our explanations, and our comprehension
questions in terms of textual clarity and general comprehensibility,
we had to apply some adjustments to the objective comprehension
questions. As a result, the modified questions were not validated
before the main study. Therefore, the objective comprehension
questions may have inherently favored one of the explanations.
4 PILOT STUDY
The primary goal of the pilot study was to evaluate our study ques-
tions with respect to ambiguity, difficulty, and internal consistency.
Furthermore, the pilot study allowed us to refine the wording of
the questions, explanations, and study instructions. Moreover, it
allowed us to consider and address the comments provided by the
participants. These comments were read by one author and selected
if they included feedback about the instructions5.
We had the following findings from our pilot study. The majority
of incorrect values for our check question were very close to the
correct value. Hence, some of the answers may have been incorrect
due to mathematical difficulties rather than inattentiveness. There-
fore, we replaced the “attention” check question (What is
15 +7
?)
with a simpler calculation to reach less mathematically able people:
What is 4+5?
All of our explanations were generally understandable, indicated
by high mean comprehension scores for all explanations (
min =
0.62
,
max =0.92
). In the main study, we also recorded the order
of explanations for
𝑘
-anonymity and differential privacy, to derive
any possible relationship between the comprehension and the order
of the explanations.
The means of the subjective scores were similar across all condi-
tions. Therefore, we only aligned the wording so that all questions
explicitly enquired about the level of privacy protection instead of
about the mechanism itself.
We removed one objective comprehension question since this
question was only answered correctly by a very few participants
5
The comments were solely intended to improve the wording of the questions, the
explanations, and the study instructions, and were not primary research artifacts.
Therefore, we did not use any further statistical methods for these entries.
226
From Theory to Comprehension: A Comparative Study of Differential Privacy and 𝑘-Anonymity CODASPY ’24, June 19–21, 2024, Porto, Portugal
Table 1: Comprehension scores for the explanatory models
𝑘-anonymity differential privacy
N subjective objective subjective objective
RISK 27 0.78 0.77 0.64 0.51
RRT 30 0.70 0.73 0.65 0.50
DEF 33 0.78 0.76 0.45 0.43
compared to the other questions. In addition, we modified objective
questions #1–#3 to maintain consistency in wording, by adding
the concept of the privacy parameter in the explanations and in
question #2. In question #3, we asked for the implications when the
privacy parameter is 0.
The comments overall suggested a high level of comprehension
of privacy protection for both mechanisms. We also confirmed that
the privacy protection of
𝑘
-anonymity was indeed regarded to be
as inherently easier to comprehend, and a less complex privacy
mechanism. The comments led us to shorten the explanations as
much as possible and to explain that the aim of the study focused
on the comprehension of privacy protection rather than of the
mechanisms themselves. The comments also led us to focus more
on the link between the privacy parameter and the level of privacy
protection provided, and less on how the mechanisms work.
Many participants were confused about the phrase “random
noise in the explanation used in our pilot study, and the specified
results of the differential privacy mechanism. We decided to avoid
mathematical vocabulary wherever possible, to completely exclude
the concept of random noise from the explanations and to omit any
specific calculations or results returned by the mechanism. Instead,
we emphasize the probabilities and describe that differential pri-
vacy provides privacy protection by randomly modifying statistical
results. We also received indications that it would be helpful to
include the information in our scenario that Eve, the adversary,
knows that the returned result may not be correct. Furthermore,
we changed the wording of the sample database answers from
“true/false to “yes/no, because this seemed to be confusing in light
of the “true answers of the students”.
5 RESULTS
In the following, we describe the results of our main study. Through-
out our analysis, we use a significance-level of
𝛼=0.05
and adjust
the results of the post-hoc t-tests with Bonferroni corrections.6
5.1 Subjective Comprehension
The mean scores of subjective comprehension in Table 1 show that
across all explanatory models for differential privacy protection, the
level of privacy protection resulting from
𝑘
-anonymity was easier to
comprehend than that resulting from differential privacy. One-tailed
t-tests revealed a significant difference in comprehensibility, with
higher scores for the privacy protection of
𝑘
-anonymity, using the
RISK
model (
𝑡4.522
,
𝑝<0.001
, Cohens
𝑑0.762
) and using the
6
The false positive error grows with the number of tests performed. A common ap-
proach to deal with this is the Bonferroni correction, which sets
𝛼
for the entire set of
𝑛
comparisons equal to
𝛼=𝛼/𝑛
. For example, if we have a set of three hypothesis
tests and 𝛼=0.05, our adjusted significance level equals 0.05/3=0.017.
RISK RRT DEF
1.0
0.5
0.0
0.5
1.0
mean difference
order
k-anon, dp (1)
dp, k-anon (2)
1
(a) Subjective
RISK RRT DEF
1.0
0.5
0.0
0.5
1.0
mean difference
order
k-anon, dp (1)
dp, k-anon (2)
1
(b) Objective
Figure 2: Differences between scores for comprehension of
𝑘-anonymity and differential privacy.
DEF
model (
𝑡7.586
,
𝑝<0.001
, Cohens
𝑑1.749
). These results
support (H1). In contrast to (H1), when we used the
RRT
model,
the difference in comprehension between the privacy protection
of
𝑘
-anonymity and differential privacy was not significant for
subjective comprehension (𝑡1.27,𝑝0.107).
The subjective comprehension scores regarding differential pri-
vacy protection were higher when using the
RISK
model than the
DEF
model, and slightly higher when using the
RRT
model. We
present the difference between the mean subjective scores of
𝑘
-
anonymity and the mean subjective score of differential privacy in
Figure 2a. In
RRT
the differences were distributed around zero, with
a median of zero, whereas the interquartile range and median in
RISK were above zero, but lower than in DEF.
We tested the distribution of the differences with
RISK
,
RRT
, and
DEF
for each order of explanations with D’Agostino’s K-squared
tests. There was no indication of the differences not being normally
distributed for any of the conditions. Furthermore, Levene’s test
did not show any significant differences of the variances between
any of the conditions, indicating equality of variances. Hence, all
requirements for conducting a between-subject two-way ANOVA
were met. The ANOVA revealed a significant effect of the explana-
tory model on the difference between the subjective comprehension
scores for the privacy protection of
𝑘
-anonymity and differential
privacy, with
𝐹12.979
,
𝑝<0.001
and
𝜂20.234
. There was no
significant effect of the order of explanations on the differences. Fur-
thermore, there was no significant interaction between explanatory
model and order of explanations.
One-tailed t-tests showed that in comparison to the
DEF
model,
the mean difference was significantly smaller with the
RISK
model
(
𝑡 3.342
,
𝑝<0.001
, Cohens
𝑑0.809
) as well as with the
RRT
model (
𝑡 4.624
,
𝑝<0.001
, Cohens
𝑑1.166
). Moreover, the
difference in comprehensibility was smaller when we used the
RRT
model than when we used the
RISK
model (
𝑡 1.725
,
𝑝0.045
,
Cohens
𝑑0.458
). These results corroborate (H2), where we hy-
pothesized that the comprehension of the
RISK
and the
RRT
model
is higher than of the
DEF
model. For the
RRT
model, participants
achieved subjective comprehension scores comparable to those for
the privacy protection of
𝑘
-anonymity, a privacy mechanism that
is supposedly less complex and whose privacy protection is more
intuitively understandable.
227
CODASPY ’24, June 19–21, 2024, Porto, Portugal Saskia Nuñez von Voigt, Luise Mehner, & Florian Tschorsch
01234
correct objective answers
0
10
20
30
40
number of participants
1
(a) 𝐾-anonymity
01234
correct objective answers
0
5
10
15
20
25
number of participants
RISK
RRT
DEF
1
(b) Differential privacy
Figure 3: Proportion of correctly answered questions on ob-
jective comprehension.
5.2 Objective Comprehension
From Table 1, we infer that the objective comprehension scores for
𝑘
-anonymity were higher than for the differential privacy expla-
nation. The scores for the objective comprehension of the privacy
protection of differential privacy were generally low for all expla-
nations. With a mean of around
0.5
, the scores correspond to the
expected success rate by randomly guessing the answers.
In Figure 3, we show the number of correct objective answers for
each explanatory model. For
𝑘
-anonymity, all participants had at
least one correct answer (cf. Figure 3a). As illustrated in Figure 3b,
three participants answered all objective comprehension questions
wrong. All three participants were part of the
DEF
group. Notably,
none of the participants answered all the questions correctly. One-
tailed t-tests revealed a significant difference in comprehensibility,
with higher scores for the privacy protection of
𝑘
-anonymity, with
RISK
(
𝑡6.31
,
𝑝<0.001
, Cohens
𝑑1.72
), with
RRT
(
𝑡6.18
,
𝑝<0.001
, Cohens
𝑑1.59
) and with
DEF
(
𝑡6.62
,
𝑝<0.001
,
Cohens 𝑑1.51). These results confirm (H1).
In Figure 2b, we plot the mean differences of scores between
𝑘
-anonymity and differential privacy. The mean difference between
the objective comprehension of privacy protection provided by
𝑘
-anonymity and differential privacy was smallest with the
RRT
model, followed by the
RISK
model, and last the
DEF
model. With
the
RISK
and the
RRT
model the difference was smaller for the
second order group, i.e., where the differential privacy explanation
was shown first.
All requirements for conducting a between-subject two-way
ANOVA were met. In particular, D’Agostino’s K-squared tests for
normality did not reveal any significant deviation of the differ-
ences from a normal distribution for any of the three conditions.
Also, Levene’s test indicated equality of variances between the con-
ditions. The ANOVA did not reveal any significant effect of the
explanatory model on the differences in objective comprehension
(
𝐹1.245
,
𝑝0.293
). There was no indication of a significant
effect of the order of the explanations or a significant interaction
of the explanatory model and the order. Figure 2b shows that the
interquartile range of the differences was lower with
RISK
and
RRT
than with
DEF
. For these reasons, we conducted post-hoc one-tailed
t-tests to investigate whether the mean differences with the
RISK
RISK RRT DEF
0
5
10
15
20
25
number of participants
k-anonymity
both equally
differential privacy
1
(a) Comprehensible
RISK RRT DEF
0
5
10
15
20
25
number of participants
k-anonymity
both equally
differential privacy
1
(b) Prevent
Figure 4: Comparison regarding comprehensibility and pri-
vacy prevention.
model and the
RRT
model for differential privacy protection dif-
fered significantly from the
DEF
model. The results indicated that
the mean difference was smaller with the
RISK
model than with
the
DEF
model (
𝑡 1.008
,
𝑝0.159
, Cohens
𝑑0.262
). There
was a tendency for the mean difference to be smaller with the
RRT
model than with the DEF model (𝑡 1.467,𝑝0.074).
These results support (H2), suggesting that the users’ objective
comprehension concerning the privacy protection provided by dif-
ferential privacy is enhanced through the
RISK
model and may be
enhanced through the RRT model.
5.3 Comparison of 𝑘-Anonymity and
Differential Privacy
We present the answers to the direct comparison between the dif-
ferential privacy mechanism and the
𝑘
-anonymity mechanism in
Figure 4. Few participants rated the level of privacy protection
and thus the implications of differential privacy as more compre-
hensible than that of
𝑘
-anonymity (see Figure 4a). In
DEF
, nobody
rated the differential privacy explanation more comprehensible
than
𝑘
-anonymity. These results support (H1), which states that the
privacy protection of
𝑘
-anonymity is easier to comprehend than
the privacy protection of differential privacy.
The overall answer about which mechanism was better at pre-
venting a data breach was in favor of
𝑘
-anonymity with the
RISK
model and the
RRT
model (cf. Figure 4b). In
DEF
,
11
participants said
that both were equally good at preventing a data breach.
5.4 Effects of Level of Education and Numeracy
Skills
In Figure 5a, we show that objective comprehension scores increase
with higher levels of education across all explanatory models, with
𝑟0.285
, and
𝑝0.008
. Especially for the
DEF
model, a higher level
of education is associated with a higher objective comprehension
score. Thus, we can confirm (H3), which states that a high level
of education helps users to comprehend the privacy protection of
differential privacy.
Our participants’ objective and subjective numeracy skills were
high overall; we had no participants with a score below
0.9
(see Fig-
ure 5b). The participants’ subjective numeracy skills also correlated
228
From Theory to Comprehension: A Comparative Study of Differential Privacy and 𝑘-Anonymity CODASPY ’24, June 19–21, 2024, Porto, Portugal
Highschool
Bachelor
Master
0.0
0.2
0.4
0.6
objective-dp
RISK
RRT
DEF
1
(a) Education level
0 1 2
subjective numeracy score
0.2
0.4
0.6
0.8
1.0
mean-subjective-dp
1
(b) Subjective numeracy
Figure 5: Correlations of comprehension.
positively with their subjective comprehension scores for differen-
tial privacy protection, with
𝑟0.299
,
𝑝0.008
. We did not find
any correlation between the objective numeracy skills and objective
comprehension score. Regarding the subjective comprehension, we
can confirm (H3). We did not find any other correlations between
the objective or subjective comprehension scores and the level of
privacy concerns.
5.5 Exclusion of Knowledgeable Participants
Prior knowledge of privacy definitions may have influenced the
study results
7
. Therefore, we reran our analysis, excluding all par-
ticipants who indicated that they were already aware of one or
more privacy mechanisms, particularly differential privacy. This
resulted in only
56
participants who completed the study, so the
results are limited in their power. However, most findings remained
unchanged after these adaptions.
The privacy protection provided by
𝑘
-anonymity was subjec-
tively rated as significantly more easily understood than the privacy
protection provided by differential privacy among respondents in
the RISK and in the DEF group, but not in the RRT group.
The mean difference between the subjective comprehension
scores of
𝑘
-anonymity and differential privacy was again smallest
with the
RRT
model, followed by the
RISK
model. There was still
a significant effect of the explanatory model on the differences in
the subjective comprehension scores (
𝐹8.814
,
𝑝<0.001
,
𝜂2
0.2570
) and no significant effect from the order of the explanations
or from the interaction between the explanatory model and the
order of the explanations. One-tailed t-tests indicated that the mean
difference between the subjective comprehension scores for the pri-
vacy protection provided by
𝑘
-anonymity and differential privacy
was significantly smaller when using the
RRT
model compared to
the DEF model.
The scores regarding the objective comprehension of differential
privacy protection were highest with the
RISK
, followed by the
RRT
model. However, the mean difference between the privacy
protection of
𝑘
-anonymity and of differential privacy was now
7
We could not exclude the possibility of an overlap of participants who took part in
the pilot study and the main study. However, by excluding participants with prior
knowledge of privacy, we controlled for overlap and thus verified that the main results
would remain valid.
smallest with the
RRT
model. There was still no significant effect
on differences in the objective comprehension scores.
The findings regarding the correlations were similar to the find-
ings when knowledgeable participants were included. There was
again a significant positive correlation between subjective compre-
hension scores and subjective numeracy skills (
𝑟0.349
,
𝑝0.008
).
The level of education also indicated a positive direction to the ob-
jective comprehension (
𝑟0.251
,
𝑝0.067
). Remarkably, we
found a significant positive correlation between the objective and
subjective comprehension for
𝑘
-anonymity (
𝑟0.361
,
𝑝0.006
).
Again, we did not find any other expected correlations.
5.6 Summary of Findings
The following points represent our main findings:
The privacy protection of
𝑘
-anonymity was only rated as
significantly more easily understood subjectively than the
privacy protection of differential privacy with the
RISK
and
the DEF models.
The privacy protection of
𝑘
-anonymity is objectively easier
to comprehend than the privacy protection of differential
privacy (independent of the explanatory model).
The
RISK
and
RRT
models significantly enhanced the subjec-
tive comprehensibility of differential privacy protection to
greater extent than the DEF model.
The objective comprehension of differential privacy protec-
tion is enhanced by the
RISK
model and may be enhanced
through the RRT model.
We find a positive correlation between the level of education
and objective comprehension.
Participants with high subjective numeracy skills had also
high subjective comprehension scores. Moreover, partici-
pants with high objective numeracy skills had also higher
objective comprehension scores.
6 DISCUSSION
Our study was motivated by investigating various models attempt-
ing to explain privacy protection of differential privacy. We com-
pared the comprehensibility of three different explanatory models
with the privacy protection of
𝑘
-anonymity as a baseline for com-
prehensibility. In the following section, we put our results into
a larger context. One especially salient aspect is the suitability
of our explanatory models for understanding differential privacy
protection. We also reflect on the differences in how users under-
stand privacy protection of differential privacy compared to the
protection of
𝑘
-anonymity. Finally, we present our thoughts on the
influence of the level of education and numeracy skills.
The different explanatory models support the comprehension of pri-
vacy protection to varying extents. The subjective comprehension is
significantly enhanced by the
RRT
and
RISK
models of explanation.
The
DEF
model was the most difficult to understand with respect to
the privacy protection. This result fits well with [
3
] and confirms
that the pure definition of differential privacy is not that easy to
understand. Nevertheless, the
RRT
model helped people to com-
prehend differential privacy protection significantly better than
the
RISK
model did. We therefore suggest that the
RRT
model, as a
metaphor, is easier to imagine. However, Karegar et al. [
11
] advised
229
CODASPY ’24, June 19–21, 2024, Porto, Portugal Saskia Nuñez von Voigt, Luise Mehner, & Florian Tschorsch
being careful with metaphors, as users find them difficult to transfer
to other contexts. We believe that a transfer to other contexts is
easier with our explanatory
RRT
model because we describe privacy
protection and not the mechanism itself.
Explanatory models do not contribute to objective comprehension.
Whereas subjective comprehension was enhanced, we found no
difference on objective comprehension of differential privacy pro-
tection. Surprisingly, our objective comprehension scores for all
differential privacy models were low, with a mean of
0.5
. This
may be either due to the complexity of our explanatory models
or to the difficulty of our objective questions. In order to compare
𝑘
-anonymity and differential privacy, we aligned the questions, but
we compared two different mechanisms. Our questions on objective
comprehension were therefore unable to fully capture the subtleties
of differential privacy and thus of our explanatory model. Future
studies should therefore adapt the questions accordingly to better
capture objective comprehension.
The privacy protection of
𝑘
-anonymity is more comprehensible than
that by differentiated privacy. We can summarize by stating that
the privacy protection provided by
𝑘
-anonymity seems to be sub-
jectively easier to comprehend than that of differential privacy.
Differential privacy protection explained with the
RRT
model seems
almost as easy to comprehend as the privacy protection provided
by
𝑘
-anonymity. Nevertheless, even here more than half of the par-
ticipants rated
𝑘
-anonymity as subjectively more comprehensible
when asked to directly compare the comprehension of the privacy
protection of both mechanisms. This result is in line with the find-
ings of Valdez et al. [
28
], who evaluated users’ willingness to share
personal health data when applying privacy-preserving techniques
such as
𝑘
-anonymity or differential privacy. The perception of pri-
vacy was rated more strongly for
𝑘
-anonymity than for differential
privacy. The authors assumed that this was because the protection
of differential privacy was too difficult to conceptualize. Our results
show that the
RRT
model enhances the subjective comprehensibility
significantly compared to the other two explanatory models, and
also yields the best scores for the participants’ subjective compre-
hension of the privacy protection in comparison to the privacy
protection of
𝑘
-anonymity. We therefore argue that this model can
serve as a basis for further studies.
In addition, we have high scores concerning objective compre-
hension for
𝑘
-anonymity. This is probably due to the fact that the
privacy parameter
𝑘
has a direct implication for data protection,
which can be understood independently of the data [
2
]: the param-
eter is related to the legal concept of individual identifiability. Our
explanatory models help establish this relationship to identifiability,
but the implications must be explained in terms of a use case. With
𝑘
-anonymity, privacy protection and the mechanism itself can be
easily visualized with a data set. Hence, a non-expert can verifiy
that the published data set is indeed
𝑘
-anonymous [
8
]. When com-
municating differential privacy guarantees, past studies have either
attempted to explain or visualize the mechanism itself ([
10
,
21
])
or the risk associated with an
𝜀
([
7
,
15
]). Similar to research by
Nanayakkara et al. [
22
], future research should find explanations
that convey both the risks and the privacy-utility tradeoff. We be-
lieve that the implications are better understood if the explanation
is not use-case dependent.
Different levels of education need different explanations. We ob-
served that levels of education and numeracy skills led to better
objective and subjective comprehension. Differential privacy pro-
vides a quantitative mathematical definition, so it makes sense that
numeracy skills would be helpful in understanding the privacy pro-
tection of differential privacy. However, we also observed a positive
correlation between the comprehension and subjective numeracy
skills. These results do not correspond to the Dunning-Kruger ef-
fect [
13
]. However, this might be due to the fact that our sample was
very homogeneous and highly educated. Nevertheless, different tar-
get groups should receive different explanations. Our target group
was end users; further research is needed to determine whether
our explanatory models can help other audiences, e.g. developers,
choose a suitable 𝜀.
7 RELATED WORK
Considering the demand for strong privacy guarantees, there is a
plethora of work on developing new algorithms with differential
privacy guarantees that focus on improving the privacy-utility
tradeoff [
5
,
24
,
29
]. A smaller
𝜀
leads to stronger privacy. However,
the problem of how exactly to determine the value of
𝜀
remains a
major challenge. According to Dwork, the value of
𝜀
is a “social
decision” [
5
]. Therefore, various approaches have been proposed
to quantify the privacy guarantees by translating
𝜀
into a privacy
risk [
15
,
20
] or by using metaphors to describe the differential
privacy mechanism [1, 11, 26].
The privacy risk as well as the randomized response technique
may serve as explanatory models for the differential privacy guaran-
tee [
1
,
26
]. The
RRT
model has been researched with regard to users’
trust and comprehension [
1
,
14
] of the technique. Bullek et al. [
1
]
focused on describing the randomized response technique by using
a spinner metaphor. Smart et al. [
26
] investigated explanations of a
differential privacy mechanism, hiding the
𝜀
used and evaluating
users’ willingness to share data. However, it has not yet been deter-
mined whether users’ understanding of this technique is sufficient.
Franzen et al. [
7
] evaluated the
RISK
approach empirically, with a
focus on how to communicate this risk. Our present study, is a first
step in evaluating comprehensibility of the
RISK
model as well as
of the
RRT
model. Furthermore, the randomized response technique
has generally been researched in isolation; in contrast, this study
evaluates the technique as means of explaining differential privacy
in general.
Some previous studies have evaluated descriptions for differen-
tial privacy. The work of Cummings et al. [
3
] took a look at users’
expectations that arose from descriptions of differential privacy
mechanisms already in the industry. According to the authors of
that study, existing descriptions fail to explain the differential pri-
vacy guarantee in that users’ expectations are set arbitrarily. We
have aimed for an explanation that would improve users’ compre-
hension of the differential privacy guarantee. Instead of looking
at previously written descriptions, we have evaluated explanatory
models that are intended to facilitate users’ comprehension.
Karegar et al. [
11
] used blurred images as a metaphor to explain
the privacy protection provided by differential privacy. The authors
noted that the explanations help to communicate the fact that intro-
duced noise protects privacy and that there is some privacy-utility
230
From Theory to Comprehension: A Comparative Study of Differential Privacy and 𝑘-Anonymity CODASPY ’24, June 19–21, 2024, Porto, Portugal
tradeoff. However, their explanations did not directly imply privacy
protection of a particular
𝜀
. ViP [
22
] is a tool for supporting the
decision of setting/splitting
𝜀
across queries. Nanayakkara et al. [
22
]
used
RISK
([
15
]) to show the risk for a different number of users
and set
𝜀
. The authors then depicted the privacy-utility tradeoff for
data analysts. In our study, we have focused on end users as the
target group of such explanations.
Xiong et al. [
31
] researched how different explanations for dif-
ferential privacy influenced users’ willingness to share their per-
sonal data. The authors found that explanations focusing on the
implications of differential privacy instead of the technical aspects
increased users’ understanding and willingness to share personal
data. Nanayakkara et al. [
23
] used odds-based explanations based
on the
RISK
model inspired by [
15
]. They then compared their ex-
planations to the explanations provided by Xiong et al. [
31
]. Our
study takes a further step towards a more comprehensive explana-
tion of differential privacy, in that we evaluate different explanatory
models that all focus on the implications of the mechanism, i.e., the
privacy guarantee provided.
Our experimental setup, comparing users’ comprehension pro-
vided by
𝑘
-anonymity to that provided by differential privacy was
inspired by Valdez et al. [
28
]. They examined how understandings
of privacy concern change depending on what data is collected and
how it is used. The degree of privacy of
𝑘
-anonymity was also given
with “indistinguishability” [
28
]. To explain differential privacy, the
concept of exceptionality” was used, which indicated how excep-
tional one is among all other respondents. The results of their study
suggest that being part of a larger crowd (
𝑘
-anonymity) appears
to be more privacy protective than differential privacy does. The
authors hypothesized that this is due to the explanation of excep-
tionality. With our work, we have provided models for explaining
differential privacy protection and have evaluated them for their
comprehensibility. Due to our comparison with users’ comprehen-
sion of the privacy protection of
𝑘
-anonymity, we have been able
to help shed light on the extent of users’ comprehension.
8 CONCLUSION
We can conclude that different explanatory models indeed help
people to comprehend the privacy protection provided by differ-
ential privacy. Our results confirm that the
RISK
and
RRT
model
enhance users’ subjective comprehension provided by differen-
tial privacy protection better than the
DEF
model does. We have
therefore presented a way to effectively explain the privacy pro-
tection of a Laplacian differential privacy mechanism. Moreover,
the privacy protection provided by
𝑘
-anonymity was more compre-
hensible than that provided by differential privacy. The
RRT
model
yields the best scores for the participants’ subjective comprehen-
sion. Therefore, we can conclude that
RRT
can serve as a basis for
further studies.
ACKNOWLEDGMENTS
This work is partially funded by the European Union (NextGener-
ationEU). It is also supported by the German Federal Ministry of
Education and Research (BMBF) as part of the research projects
FreeMove and GANGES under reference number 01UV2090B and
16KISA034, respectively.
REFERENCES
[1]
Brooke Bullek, Stephanie Garboski, Darakhshan J. Mir, and Evan M. Peck. 2017.
Towards Understanding Differential Privacy: When Do People Trust Random-
ized Response Technique?. In
CHI ’17: Proceedings of the 2017 Conference on
Human Factors in Computing Systems
. ACM, 3833–3837. https://doi.org/10.
1145/3025453.3025698
[2]
Chris Clifton and Tamir Tassa. 2013. On syntactic anonymity and differential
privacy. In
ICDEW ’13: IEEE 29th International Conference on Data Engineering
Workshops. 88–93. https://doi.org/10.1109/ICDEW.2013.6547433
[3]
Rachel Cummings, Gabriel Kaptchuk, and Elissa M. Redmiles. 2021. ”I need a bet-
ter description”: An Investigation Into User Expectations For Differential Privacy.
In
CCS ’21: Proceedings of the 2021 ACM SIGSAC Conference on Computer and
Communications Security, Virtual Event, Republic of Korea, November 15 - 19,
2021. ACM, 3037–3052. https://doi.org/10.1145/3460120.3485252
[4]
Cynthia Dwork. 2006. Differential Privacy. In
ICALP ’06: Automata,
Languages and Programming, 33rd International Colloquium, Proceedings, Part
II (Lecture Notes in Computer Science, Vol. 4052)
. Springer, 1–12. https://doi.
org/10.1007/11787006_1
[5]
Cynthia Dwork. 2008. Differential Privacy: A Survey of Results. In
TAMC ’08:
Theory and Applications of Models of Computation, 5th International
Conference (Lecture Notes in Computer Science, Vol. 4978)
. Springer, 1–19.
https://doi.org/10.1007/978-3-540-79228-4_1
[6]
Angela Fagerlin, Brian Zikmund-Fisher, Peter Ubel, Aleksandra Jankovic, Holly
Derry, and Dylan Smith. 2007-09. Measuring Numeracy Without a Math Test:
Development of the Subjective Numeracy Scale.
Medical decision making : an
international journal of the Society for Medical Decision Making
27 (2007-09),
672–80. https://doi.org/10.1177/0272989X07304449
[7]
Daniel Franzen, Saskia Nuñez von Voigt, Peter Sörries, Florian Tschorsch, and
Claudia Müller-Birn. 2022. Am I Private and If So, how Many?: Communicating
Privacy Guarantees of Differential Privacy with Risk Communication Formats. In
CCS ’22: Proceedings of the 2022 ACM SIGSAC Conference on Computer and
Communications Security, Los Angeles, CA, USA, November 7-11, 2022
. ACM,
1125–1139. https://doi.org/10.1145/3548606.3560693
[8]
Arik Friedman, Ran Wolff, and Assaf Schuster. 2008. Providing
k
-anonymity in
data mining.
The VLDB Journal
17, 4 (2008), 789–804. https://doi.org/10.1007/
S00778-006-0039-5
[9]
Benjamin C. M. Fung, Ke Wang, Rui Chen, and Philip S. Yu. 2010. Privacy-
preserving data publishing: A survey of recent developments.
Comput. Surveys
42, 4 (2010), 14:1–14:53. https://doi.org/10.1145/1749603.1749605
[10]
Justin Hsu, Marco Gaboardi, Andreas Haeberlen, Sanjeev Khanna, Arjun Narayan,
Benjamin C. Pierce, and Aaron Roth. 2014. Differential Privacy: An Eco-
nomic Method for Choosing Epsilon. In
CSF ’14: IEEE 27th Computer Security
Foundations Symposium
. IEEE Computer Society, 398–410. https://doi.org/10.
1109/CSF.2014.35
[11]
Farzaneh Karegar, Ala Sarah Alaqra, and Simone Fischer-Hübner. 2022. Ex-
ploring User-Suitable Metaphors for Differentially Private Data Analyses. In
SOUPS ’22: Proceedings of the Eighteenth Symposium on Usable Privacy and
Security, Boston, MA, USA, August 7-9, 2022
. USENIX Association, 175–193.
https://www.usenix.org/conference/soups2022/presentation/karegar
[12]
Carmen Keller and Michael Siegrist. 2009. Effect of Risk Communication Formats
on Risk Perception Depending on Numeracy.
Medical Decision Making
29, 4
(2009), 483–490. https://doi.org/10.1177/0272989X09333122
[13]
Justin Kruger and David Dunning. 1999. Unskilled and unaware of it: how diffi-
culties in recognizing one’s own incompetence lead to inflated self-assessments.
Journal of personality and social psychology 77, 6 (1999), 1121.
[14]
Johannes A Landsheer, Peter Van Der Heijden, and Ger Van Gils. 1999. Trust and
understanding, two psychological aspects of randomized response.
Quality and
Quantity 33, 1 (1999), 1–12. https://doi.org/10.1023/A:1004361819974
[15]
Jaewoo Lee and Chris Clifton. 2011. How Much Is Enough? Choosing
𝜖
for Differ-
ential Privacy. In
ISC ’11: Information Security, 14th International Conference
.
Springer, 325–340. https://doi.org/10.1007/978-3-642-24861-0_22
[16]
Ninghui Li, Tiancheng Li, and Suresh Venkatasubramanian. 2007. t-Closeness:
Privacy Beyond k-Anonymity and l-Diversity. In
ICDE ’07: Proceedings of the
23rd International Conference on Data Engineering
. IEEE Computer Society,
106–115. https://doi.org/10.1109/ICDE.2007.367856
[17]
Isaac Lipkus, Greg Samsa, and Barbara Rimer. 2001-02. General Performance on a
Numeracy Scale Among Highly Educated Samples.
Medical decision making : an
international journal of the Society for Medical Decision Making
21 (2001-02),
37–44. https://doi.org/10.1177/0272989X0102100105
[18]
Ashwin Machanavajjhala, Johannes Gehrke, Daniel Kifer, and Muthuramakr-
ishnan Venkitasubramaniam. 2006. l-Diversity: Privacy Beyond k-Anonymity.
In
ICDE ’06: Proceedings of the 22nd International Conference on Data
Engineering. IEEE Computer Society, 24. https://doi.org/10.1109/ICDE.2006.1
[19]
Naresh K. Malhotra, Sung S. Kim, and James Agarwal. 2004. Internet Users’
Information Privacy Concerns (IUIPC): The Construct, the Scale, and a Causal
Model.
Information Systems Research
15, 4 (2004), 336–355. https://doi.org/10.
1287/isre.1040.0032
231
CODASPY ’24, June 19–21, 2024, Porto, Portugal Saskia Nuñez von Voigt, Luise Mehner, & Florian Tschorsch
[20]
Luise Mehner, Saskia Nuñez von Voigt, and Florian Tschorsch. 2021. To-
wards Explaining Epsilon: A Worst-Case Study of Differential Privacy Risks. In
EuroS&P ’21: IEEE European Symposium on Security and Privacy Workshops,
Vienna, Austria, September 6-10, 2021
. IEEE, 328–331. https://doi.org/10.1109/
EUROSPW54576.2021.00041
[21]
Maurizio Naldi and Giuseppe D’Acquisto. 2015. Differential Privacy: An Estima-
tion Theory-Based Method for Choosing Epsilon.
arXiv preprint
abs/1510.00917
(2015).
[22]
Priyanka Nanayakkara, Johes Bater, Xi He, Jessica Hullman, and Jennie Rogers.
2022. Visualizing Privacy-Utility Trade-Offs in Differentially Private Data Re-
leases.
Proceedings on Privacy Enhancing Technologies
2022, 2 (2022), 601–618.
https://doi.org/10.2478/popets-2022-0058
[23]
Priyanka Nanayakkara, Mary Anne Smart, Rachel Cummings, Gabriel Kaptchuk,
and Elissa M. Redmiles. 2023. What Are the Chances? Explaining the Epsilon
Parameter in Differential Privacy. In
32nd USENIX Security Symposium, USENIX
Security 2023, Anaheim, CA, USA, August 9-11, 2023
. USENIX Association. https:
//www.usenix.org/conference/usenixsecurity23/presentation/nanayakkara
[24]
K. Patel and G. B. Jethava. 2018. Privacy Preserving Techniques for Big Data: A
Survey. In
ICICCT ’18: Proceedings of the 2018 Second International Conference
on Inventive Communication and Computational Technologies
. 194–199. https:
//doi.org/10.1109/ICICCT.2018.8473289
[25]
Sarina B. Schrager. 2018. Five Ways to Communicate Risks So That Patients
Understand. Family practice management 25 6 (2018), 28–31.
[26]
Mary Anne Smart, Dhruv Sood, and Kristen Vaccaro. [n.d.]. Understanding
Risks of Privacy Theater with Differential Privacy.
Proceedings of the ACM on
Human-Computer Interactio, volume = 6, number = CSCW2, pages = 1–24, year
= 2022, doi = 10.1145/3555762, ([n. d.]).
[27]
Latanya Sweeney. 2002. k-Anonymity: A Model for Protecting Privacy.
International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems
10 (2002), 557–570.
[28]
André Calero Valdez and Martina Ziefle. 2019. The users’ perspective on the
privacy-utility trade-offs in health recommender systems.
International Journal
of Human-Computer Studies
121 (2019), 108–121. https://doi.org/10.1016/j.ijhcs.
2018.04.003
[29]
Teng Wang, Xuefeng Zhang, Jingyu Feng, and Xinyu Yang. 2020. A Comprehen-
sive Survey on Local Differential Privacy toward Data Statistics and Analysis.
Sensors 20, 24 (2020), 7030. https://doi.org/10.3390/s20247030
[30]
Stanley L. Warner. 1965. Randomized response: A survey technique for eliminat-
ing evasive answer bias. J. Amer. Statist. Assoc. 60.309 (1965), 63–69.
[31]
Aiping Xiong, Tianhao Wang, Ninghui Li, and Somesh Jha. 2020. Towards
Effective Differential Privacy Communication for Users’ Data Sharing Decision
and Comprehension. In
SP ’20: IEEE Symposium on Security and Privacy
. IEEE,
392–410. https://doi.org/10.1109/SP40000.2020.00088
APPENDIX
A SURVEY DETAILS
In this appendix, we provide the description of the explanations of
our explanatory models.
A.1 Descriptions and Explanations
𝐾
-Anonymity.
𝐾
-anonymity provides privacy protection by gen-
eralizing or removing all sensitive columns of the database that
might be used to re-identify a student. This way, for every student
in the database, there is a group of at least
𝑘
students with the same
answers in all sensitive columns. In other words, there are always
at least
𝑘
indistinguishable students. Accordingly,
𝑘
is the privacy
parameter, which determines the level of privacy protection. The
higher the privacy parameter
𝑘
, the more indistinguishable students
exist in each group, resulting in a stronger privacy protection.
Assume the school sets the privacy parameter to
𝑘=4
, i.e.,
the database is modified in a way that it always yields at least
4
indistinguishable students. Specifically, the database now contains
four students from Bob’s class with a generalized age of
16
(Peter,
Bob, Marie, and Lucas). Names are not shown. Please note that Eve
cannot link the individual rows to the respective students. Bob’s
drug use could be indicated by each of the four rows equally likely.
Hence, Bob—and its drug use—remain hidden in the group of four
indistinguishable students.
Differential Privacy
RISK
.Differential privacy provides privacy pro-
tection by randomly modifying statistical results extracted from the
database. The results therefore indicate the true answers of the stu-
dents with a certain probability only: Every student in the database
has a certain privacy risk that their true answer can be identified.
This risk is controlled by a privacy parameter, which determines
the level of privacy protection. A privacy parameter closer to zero
reduces the privacy risk, resulting in a stronger privacy protection.
Assume the school sets the privacy parameter in a way that
yields a privacy risk for the students of
75
%, i.e., the true answers
of the students are indicated with a probability of
75
%. Now, Eve
accesses the database and asks for the number of
16.3
years old
drug-using students in Bob’s class. Please note that Eve does not
know whether the returned result indicates Bob’s true answer or
not. The result was modified and might therefore be false. Bob has
a privacy risk of 75 %.
Differential Privacy
RRT
.Local differential privacy provides privacy
protection by randomly modifying the students’ answers contain-
ing sensitive information, before storing them in the database. The
stored answers therefore correspond to the true answers of the
students with a certain probability only: Every student in the data-
base has a certain probability that their true answer is stored. This
probability is controlled by a privacy parameter, which determines
the level of privacy protection. A privacy parameter closer to zero
reduces the probability, resulting in a stronger privacy protection.
Assume the school sets the privacy parameter in a way that the
mechanism stores the true answer of a student with a probability
of
75
%. Now, Eve accesses the database and asks for the number of
16.3 years old drug-using students in Bob’s class. Please note that
Eve does not know whether the returned result indicates Bob’s true
answer or not. Bob’s answer was modified and might therefore be
false. Bob’s true answer is stored with a probability of 75 %.
Differential Privacy
DEF
.Differential privacy provides privacy pro-
tection by randomly modifying statistical results extracted from the
database. The true results extracted from the students’ answers are
therefore returned with a certain probability only: The probability
of the same result if one of the students gave a different answer
has a certain difference to the probability of the true result. This
difference is controlled by a privacy parameter, which determines
the level of privacy protection. A privacy parameter closer to zero
reduces the difference, resulting in a stronger privacy protection.
Assume the school sets the privacy parameter in a way that
the probability of returning any statistical result is
3
times the
probability of the same result if one of the students gave a different
answer. Now, Eve accesses the database and asks for the number
of
16.3
years old drug-using students in Bob’s class. Please note
that Eve does not know whether the returned result indicates Bob’s
true answer or not. The result was modified and might therefore be
returned if Bob gave a different answer. The probability of the true
result being returned is at most
3
times the probability of the same
result if Bob did not use drugs.
232