Carolin Brunn, Saskia Nuñez von Voigt, Florian Tschorsch
Analyzing Continuous ks-Anonymizationfor Smart
Meter Data
Open Access via institutional repository of Technische Universität Berlin
Document type
Conference paper | Submitted version
(i. e. version that has been submitted to a publisher for (peer) review; also known as: Author’s Original
Manuscript (AOM), Original manuscript, Preprint)
This version is available at
https://doi.org/10.14279/depositonce-19392
Citation details
Brunn, Carolin; Nuñez von Voigt, Saskia; Tschorsch, Florian (2023). Analyzing Continuous
ks-Anonymizationfor Smart Meter Data. Computer Security, ESORICS 2023 International Workshops.
Terms of use
This work is protected by copyright and/or related rights. You are free to use this work in any way permitted by
the copyright and related rights legislation that applies to your usage. For other uses, you must obtain
permission from the rights-holder(s).
Analyzing Continuous ks-Anonymization
for Smart Meter Data⋆
Short Paper
Carolin Brunn, Saskia Nuñez von Voigt, and Florian Tschorsch
Distributed Security Infrastructures, Technische Universität Berlin, Berlin, Germany
{c.brunn, saskia.nunezvonvoigt, florian.tschorsch}@tu-berlin.de
Abstract. Data anonymization is crucial to allow the widespread adop-
tion of some technologies, such as smart meters. However, anonymiza-
tion techniques should be evaluated in the context of a dataset to make
meaningful statements about their eligibility for a particular use case.
In this paper, we therefore analyze the suitability of continuous ks-ano-
nymization with CASTLE for data streams generated by smart meters.
We compare CASTLE’s continuous, piecewise ks-anonymization with a
global process in which all data is known at once, based on metrics like
information loss and properties of the sensitive attribute. Our results
suggest that continuous ks-anonymization of smart meter data is rea-
sonable and ensures privacy while having comparably low utility loss.
1 Introduction
The suitability of data anonymization techniques, such as k-anonymity [10],
must be evaluated in the context of a dataset to make meaningful state-
ments. In particular, the data types, the granularity, and distribution have
an impact on the efficiency of data anonymization and affect the fundamen-
tal trade-off between data privacy and data utility.
For smart meter data, the efficiency of data anonymization remains un-
clear as the application scenario and the data pose a challenge. While smart
meters (SMs) become increasingly important to enable dynamic resource
management of various energy sources, the type of data differs from other
relational data sources. SMs generate a data stream derived from continu-
ous sensor data, measuring consumption of electric energy, gas, and water.
SM data, therefore, comprises sensitive, personal data that require privacy
protection. In addition, the application scenario dictates a distributed archi-
tecture with distributed data sources.
In this paper, we investigate the continuous anonymization of SM data
and assess the efficiency of ks-anonymity for the anonymization in this sce-
nario. The concept of ks-anonymity is an extension of k-anonymity for data
⋆Supported by the Federal Ministry of Education and Research of Germany
(Project 16KISA034)
2 C. Brunn et al.
SMSM
Central Entity
Castle
ks
Fig. 1: Centralized architecture with smart meters forwarding measure-
ments to a central entity for anonymization.
streams [2]. In particular, we use the widely recognized algorithm for stream
anonymization, CASTLE [2], and study its characteristics and suitability. For
our study, we consider a typical SM architecture in which distributed SMs
send their data to a central entity (CE). We evaluate the suitability of ks-
anonymity for SM data based on metrics such as information loss and range
of the sensitive attribute, and compare the performance of continuous piece-
wise anonymization with an idealized anonymization as baseline.
Our results suggest that ks-anonymity is a reasonable choice for anony-
mizing smart meter data. Based on our metrics, the performance of contin-
uous data anonymization appears to be comparable to our baseline. Further
analysis of the diversity of consumption measurements shows that in most
clusters, the values of the sensitive attribute are distributed over a wide
range and are not clustered around a single consumption value. Addition-
ally, we note that the prioritization of attributes during the anonymization
process depends on the different magnitudes of the attribute ranges. This
should be taken into account in any case, but can also be exploited to shape
the process to a certain degree.
The paper is organized as follows. After introducing our problem state-
ment as well as ks-anonymity and CASTLE in Sec. 2 and 3, we present our
evaluation in Sec. 4. In Sec. 5 we conclude the paper.
2 Problem Statement and Related Work
Problem Statement. Our goal is to analyze whether continuous ks-anony-
mization is suitable for SM data. Since the data type differs from other rela-
tional data sources in some crucial characteristics, this is everything but ob-
vious. The data points generated by SMs are discrete measurements of user
consumption, e.g., electricity consumption from a continuous data stream.
Different strategies can be applied to discretize the data stream, such as
using the current consumption value or aggregating the entire consumption
between two measurements. Thus, SM data has different characteristics,
such as the temporal granularity of the measurements.
For our evaluation, we use a realistic architecture in which distinct SMs
measure the consumption and forward the data directly to a trusted CE,
e.g., the energy provider. Figure 1 visualizes this architecture. To facilitate
Analyzing Continuous ks-Anonymization for Smart Meter Data 3
further processing by third parties, e.g., for district management, the data
is collected and anonymized centrally before it is forwarded.
Related Work. There are several approaches to avoid accurate profiling
and disclosure of information based on smart meter measurements. For
instance, load balancing and shaping prevent characteristic traces in con-
sumption data, while other approaches focus on achieving privacy by de-
sign with specific architectures. Another focus is on protecting privacy by
anonymizing consumption data, e.g., with k-anonymity [10] or differential
privacy [3]. One algorithm to achieve k-anonymity for streaming data is the
one analyzed in this paper—CASTLE. Several other algorithms exist, some
of which also address challenges of CASTLE, such as [4,7,8,11]. However, to
the best of our knowledge, there are no studies that evaluate the suitability
of ks-anonymity specifically for smart meter data.
3ks-Anonymity and CASTLE
We focus on ks-anonymity [2], which is an extension of k-anonymity [10] for
streaming data. The main idea is to modify and group data items in such a
way that groups comprise at least kentries that are indistinguishable from
each other—an Equivalence Class (EQ). ks-anonymity [2] extends this idea
and requires that a published anonymized stream comprises EQs with at
least kdistinct individuals, not just kentries.
CASTLE [2] is an established algorithm to achieve ks-anonymity by as-
signing incoming data points, called tuples, to clusters that represent their
generalization. The tuples are specified in a metric space defined by the
so-called Quasi-identifiers (QIs) [10]. The clusters are EQs, where all data
points share the same generalized values for each QI attribute. Each cluster
must contain at least kdistinct individuals. CASTLE either creates a new
cluster or assigns the tuple to an existing cluster by minimizing the infor-
mation loss. As information loss metric, CASTLE uses the Generalized Loss
Metric [5]. For cluster generalization, QI attributes either form intervals,
in the case of continuous attributes, or they are generalized to their lowest
common ancestor with respect to their corresponding domain generalization
hierarchy (DGH) for categorical attributes. A DGH is a directed tree struc-
ture that defines hierarchical values for such categorical attributes. CASTLE
also uses a delay constraint δthat specifies the maximum time that can pass
before a tuple needs to be generalized and published. The clusters that were
anonymized with CASTLE can then be published and used for further pro-
cessing, e.g., by third-party data processors.
We can already observe that during the anonymization process, the QIs
are used for the generalization. Accordingly, we expect that their different
magnitudes will play a crucial role. At the same time, please note that the
4 C. Brunn et al.
sensitive attribute is not considered in the process. This could enable at-
tacks if users in a cluster have different consumption ranges that differ sig-
nificantly from each other.
4 Evaluation
Methodology. For our evaluation, we use a dataset of electricity consump-
tion measurements that is publicly available at UCI1. The dataset consists
of consumption data from 370 clients, measured every 15 minutes between
2011 and 2015. Based on the consumption profiles in the dataset, we in-
fer that the set comprises data of individual households, and larger con-
sumers such as schools, hospitals, or small industry2. The original dataset
contains only measurement data and no additional information about clients.
We therefore added synthetic addresses, modeling a district in Berlin, where
zip code, street, and house number are encoded in an integer value. Further-
more, we adapted the format of the timestamps. Due to its size, we sampled
the dataset (weeks 46 & 47 of November 2014) resulting in 164 102 tuples.
We use the publicly available CASTLEGUARD implementation [9]. When
disabling the differential privacy feature (which we did), it resembles the
CASTLE algorithm. Since we identified potential bugs, we made some minor
adaptations to the code3, e.g., in the function merge_clusters. We provide
our code including the respective changes as well as the dataset on GitHub.4
We simulated a distributed ks-anonymization with CASTLEGUARD for dif-
ferent δand k, and compare it to a global ks-anonymization process in which
all data tuples are known in advance and then clustered all at once. The
latter is simulated using the ARX anonymization tool [1].
Information Loss. For our evaluation, we use information loss as utility
metric. Specifically, we use the Generalized Loss Metric (GLM) [5], which is
also used for estimating the information loss in CASTLE. Here, the clus-
ter range of a generalized attribute is compared to the overall range of
this attribute. For each entry, the information loss of an attribute is defined
as ui−li
U−L∈[0,1], where uior liis the upper or lower limit of entry i’s attribute
generalization, and Uor Lis the overall upper or lower limit of this attribute,
respectively. For our evaluation, we calculate the average information loss
across all clusters per attribute.
Figure 2 shows the average information loss of all clusters for varying k
and δof the address (left plot) and time (right plot) attribute, respectively.
1https://doi.org/10.24432/C58C86
2The magnitude of consumption values suggests that the values are given in Watt
instead of kW as noted in the description of the dataset.
3We have reached out to the developers to discuss the bugs/changes.
4https://github.com/carolin-brunn/dpm-castle-analysis
Analyzing Continuous ks-Anonymization for Smart Meter Data 5
20 40 60 80 100
k
0.0
0.2
0.4
0.6
0.8
1.0
Information loss - Address
delta
100.0
400.0
ARX
20 40 60 80 100
k
0.00
0.02
0.04
0.06
0.08
Information loss - Time
delta
100.0
400.0
ARX
Fig. 2: Average information loss for quasi-identifying attributes.
We observe that the information loss is highest for the address attribute in
most settings. The clusters must always contain kdistinct individuals that
all have different addresses, thus, whenever a cluster is created, the ad-
dress attribute needs to be generalized. The information loss increases for
the address with an increasing k. This is expected since an increasing kre-
quires more distinct individuals with different addresses. We also observe
that for CASTLE, the information loss increases as the ratio between kand
δincreases. Presumably, CASTLE is forced to join very different clients, if
many clients have to be extracted from a relatively small sliding window.
Overall, it is noticeable that the address information loss is comparable for
ARX and CASTLE. For lower k, ARX has a lower information loss than CAS-
TLE’s sequential generalization with δ= 100. However, for larger kand δ
the advantage of ARX fades. For δ= 400, CASTLE consistently finds better
clusters that result in a lower information loss when compared to ARX.
For the time attribute, the information loss of ARX and CASTLE is com-
parable. Note, however, that in the beginning ARX has a higher information
loss than CASTLE. Presumably, this is caused by ARX’ anonymization strat-
egy: ARX chooses the same generalization level for all values of the same
attribute. Consequently, one cluster that requires a higher level of gener-
alization may cause all other clusters that could be formed with a lower
generalization to be published with the unnecessary generalization.
In general, the results suggest that CASTLE is a reasonable alternative
to a global generalization with ARX especially for larger δ. Nevertheless,
attribute ranges seem crucial for the prioritization when generalizing at-
tributes. Consequently, analyzing the exact behavior of CASTLE with at-
tributes of different magnitudes and diverse parameter settings is necessary
to find optimal settings for the anonymization of smart meter data, which we
will investigate in the remainder.
UID Diversity. Next, we compare the size of the published clusters and the
diversity of unique identifier (UID) values in these clusters. The kvalue is
the required minimum number of UIDs per cluster. Therefore, a larger UID
diversity means a larger number of distinct individuals that protect each
other from information disclosure. In contrast, very large clusters with a
6 C. Brunn et al.
0 100 200
UID diversity
0
200
400
Cluster size
CASTLE, delta = 100
0 100 200
UID diversity
CASTLE, delta = 400
0 100 200
UID diversity
ARX
k
10
25
50
100
Fig. 3: Cluster size [tuples] in relation to the UID diversity.
low UID diversity indicate that many data tuples correspond to the same
individuals. This could compromise privacy as a person may have similar
consumption values, resulting in low diversity of consumption values and
potentially disclosing information.
Figure 3 shows the cluster size in tuples against the UID diversity for
different values of kand δ. For ARX, we observe that the UID diversity of
most clusters is between 2·kand 2.5·k. Moreover, the clusters generated
with ARX are about the size of their UID diversity.
For CASTLE, we observe that δsignificantly influences the cluster sizes.
For better visibility, we excluded a few clusters that were larger than 500,
which were most likely caused by an unfavorable combination of tuples
due to an expiring δ. In Figure 3, the cluster size increases with larger δ,
while the range of UID diversity remains about the same. We suspect that
this is caused by the nature of the dataset. The extracted sample includes
about one-third of the available data points, i.e., measurements of about 120
clients per time point, and each client appears on average 1-2 times per
hour. One hour corresponds to approx. 490 data points. Thus, for δ= 100,
each client that appears has about 1 data point in the sliding window when
the clusters are created. Consequently, the cluster size and UID diversity are
about the same. For δ= 400, the sliding window can contain several tuples
per client. In this case, multiple time points belonging to the same client,
are mostly included in the same cluster, resulting in larger clusters with the
same UID diversity. This is also reflected by our information loss analysis of
the time above.
Consumption Range. Next, we analyze the diversity and distribution of the
sensitive attribute, i.e., electricity consumption. Our initial analysis showed
that almost all settings have a diversity of the sensitive attribute that is
at least approximately equal to the UID diversity suggesting a high level
of privacy protection. However, we do not include the results in this paper
since diversity was designed for categorical, but not numerical attributes. It
particularly does not take the range or similarity of values into account as
was previously described in [6].
Analyzing Continuous ks-Anonymization for Smart Meter Data 7
0 100 200 300
UID diversity
0
5
Consumption range
×104
CASTLE, delta = 100
0 100 200 300
UID diversity
CASTLE, delta = 400
0 100 200 300
UID diversity
ARX
k
10
25
50
100
Fig. 4: Consumption range against UID diversity per cluster.
0.0 0.1 0.2 0.3
Proximity ratio in cluster
0.0
0.5
1.0
ECDF
CASTLE, delta = 100
0.0 0.1 0.2 0.3
Proximity ratio in cluster
CASTLE, delta = 400
0.0 0.1 0.2 0.3
Proximity ratio in cluster
ARX
k
10
25
50
100
Fig. 5: Average proximity ratio of tuples in clusters.
Instead, we consider the range eof the sensitive attribute in the clusters
inspired by (k, e)-anonymity [12]. Figure 4 shows the consumption range
against the UID diversity. We see that kand the UID diversity only slightly
influence the consumption range for CASTLE. Indeed, a certain UID diver-
sity exhibits all different ranges of the sensitive attribute.
The same applies for ARX, independent of kthe clusters exhibit all differ-
ent ranges. The consumption of individual households is expected to be in
smaller ranges typical for the number of members in a household. Compared
to that, larger clients such as schools or industry have larger consumption
with more variance. The results in Figure 4 suggest that different types of
clients are included in many clusters for both processing strategies.
Consumption Proximity. Information about the range does not capture
the distribution of the sensitive attribute. We therefore analyze the differ-
ence between neighboring consumption values in a cluster by analyzing
their relative ϵ-neighborhood with ϵ= 0.2, as described in [6]. We calcu-
late the proximity ratio as the average percentage of tuples in a cluster that
have other tuples in this cluster within 0.2-neighborhood. This could facil-
itate a proximity breach, which means that an attacker can infer that the
sensitive attribute lies within a small interval [6].
Figure 5 shows the distribution of these values as empirical cumulative
distribution plots. The larger the k, the fewer tuples are in 0.2-neighborhood
of each other, indicating better privacy protection since the values of the
sensitive attribute are less similar within a cluster. We observe no substantial
difference between the results obtained with CASTLE and ARX, for k= 10,
8 C. Brunn et al.
the clusters generated by CASTLE show even less proximity than those of
ARX. This means that the privacy obtained with the sequential ks-anonymi-
zation is comparable to the global anonymization realized with ARX.
5 Conclusion
In this paper, we analyzed the suitability of ks-anonymity for smart meter
data in a centralized architecture. Our results suggest that the continu-
ous ks-anonymization with CASTLE is comparable to a global anonymization
with ARX. Therefore, we consider ks-anonymity as a reasonable approach for
smart meter data anonymization. The exact influence of certain parameters,
such as window size, require further research in order to find optimal set-
tings for specific use cases. Additionally, the constraints of numerical data
such as electricity consumption must be considered and suitable metrics
for the evaluation of the privacy of anonymized data have to be chosen, for
instance, the analysis of proximity instead of “pure” diversity.
References
1. Arx homepage, https://arx.deidentifier.org/, last accessed 14 June 2023
2. Cao, J., Carminati, B., Ferrari, E., Tan, K.: CASTLE: continuously anonymizing
data streams. IEEE Trans. Dependable Secur. Comput. 8(3), 337–352 (2011)
3. Dwork, C.: Differential privacy in new settings. In: SODA 2010 (2010)
4. Guo, K., Zhang, Q.: Fast clustering-based anonymization approaches with time
constraints for data streams. Knowledge-Based Systems 46, 95–108 (2013)
5. Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: ACM SIGKDD
2002. pp. 279–288 (2002)
6. Li, J., Tao, Y., Xiao, X.: Preservation of proximity privacy in publishing numerical
sensitive data. In: ACM SIGMOD 2008. p. 473–486 (2008)
7. Mohamed, M.A., Nagi, M.H., Ghanem, S.M.: A clustering approach for anony-
mizing distributed data streams. In: ICCES 2016. pp. 9–16. IEEE (2016)
8. Pallas, F., Legler, J., Amslgruber, N., Grünewald, E.: Redcastle: practically ap-
plicable ks-anonymity for iot streaming data at the edge in node-red. In:
M4IoT@Middleware 2021. pp. 8–13. ACM (2021)
9. Robinson, A., Brown, F., Hall, N., Jackson, A., Kemp, G., Leeke, M.: Castle-
guard: Anonymised data streams with guaranteed differential privacy. In:
DASC/PiCom/CBDCom/CyberSciTech 2020. pp. 577–584. IEEE (2020)
10. Sweeney, L.: k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzzi-
ness Knowl. Based Syst. 10, 557–570 (2002)
11. Yang, L., Chen, X., Luo, Y., Lan, X., Wang, W.: Idea: A utility-enhanced approach
to incomplete data stream anonymization. Tsinghua Science and Technology
27(1), 127–140 (2022)
12. Zhang, Q., Koudas, N., Srivastava, D., Yu, T.: Aggregate query answering on
anonymized tables. In: IEEE ICDE 2007. pp. 116–125 (2007)