Document [original]

Carolin Brunn, Saskia Nuñez von Voigt, Florian Tschorsch

Analyzing Continuous ks-Anonymizationfor Smart

Meter Data

Open Access via institutional repository of Technische Universität Berlin

Document type

Conference paper | Submitted version

(i. e. version that has been submitted to a publisher for (peer) review; also known as: Author’s Original

Manuscript (AOM), Original manuscript, Preprint)

This version is available at

https://doi.org/10.14279/depositonce-19392

Citation details

Brunn, Carolin; Nuñez von Voigt, Saskia; Tschorsch, Florian (2023). Analyzing Continuous

ks-Anonymizationfor Smart Meter Data. Computer Security, ESORICS 2023 International Workshops.

This work is protected by copyright and/or related rights. You are free to use this work in any way permitted by

the copyright and related rights legislation that applies to your usage. For other uses, you must obtain

permission from the rights-holder(s).

Analyzing Continuous ks-Anonymization

for Smart Meter Data⋆

Short Paper

Carolin Brunn, Saskia Nuñez von Voigt, and Florian Tschorsch

Distributed Security Infrastructures, Technische Universität Berlin, Berlin, Germany

{c.brunn, saskia.nunezvonvoigt, florian.tschorsch}@tu-berlin.de

Abstract. Data anonymization is crucial to allow the widespread adop-

tion of some technologies, such as smart meters. However, anonymiza-

tion techniques should be evaluated in the context of a dataset to make

meaningful statements about their eligibility for a particular use case.

In this paper, we therefore analyze the suitability of continuous ks-ano-

nymization with CASTLE for data streams generated by smart meters.

We compare CASTLE’s continuous, piecewise ks-anonymization with a

global process in which all data is known at once, based on metrics like

information loss and properties of the sensitive attribute. Our results

suggest that continuous ks-anonymization of smart meter data is rea-

sonable and ensures privacy while having comparably low utility loss.

1 Introduction

The suitability of data anonymization techniques, such as k-anonymity [10],

must be evaluated in the context of a dataset to make meaningful state-

ments. In particular, the data types, the granularity, and distribution have

an impact on the efficiency of data anonymization and affect the fundamen-

tal trade-off between data privacy and data utility.

For smart meter data, the efficiency of data anonymization remains un-

clear as the application scenario and the data pose a challenge. While smart

meters (SMs) become increasingly important to enable dynamic resource

management of various energy sources, the type of data differs from other

relational data sources. SMs generate a data stream derived from continu-

ous sensor data, measuring consumption of electric energy, gas, and water.

SM data, therefore, comprises sensitive, personal data that require privacy

protection. In addition, the application scenario dictates a distributed archi-

tecture with distributed data sources.

In this paper, we investigate the continuous anonymization of SM data

and assess the efficiency of ks-anonymity for the anonymization in this sce-

nario. The concept of ks-anonymity is an extension of k-anonymity for data

⋆Supported by the Federal Ministry of Education and Research of Germany

(Project 16KISA034)

2 C. Brunn et al.

SMSM

Central Entity

Castle

Fig. 1: Centralized architecture with smart meters forwarding measure-

ments to a central entity for anonymization.

streams [2]. In particular, we use the widely recognized algorithm for stream

anonymization, CASTLE [2], and study its characteristics and suitability. For

our study, we consider a typical SM architecture in which distributed SMs

send their data to a central entity (CE). We evaluate the suitability of ks-

anonymity for SM data based on metrics such as information loss and range

of the sensitive attribute, and compare the performance of continuous piece-

wise anonymization with an idealized anonymization as baseline.

Our results suggest that ks-anonymity is a reasonable choice for anony-

mizing smart meter data. Based on our metrics, the performance of contin-

uous data anonymization appears to be comparable to our baseline. Further

analysis of the diversity of consumption measurements shows that in most

clusters, the values of the sensitive attribute are distributed over a wide

range and are not clustered around a single consumption value. Addition-

ally, we note that the prioritization of attributes during the anonymization

process depends on the different magnitudes of the attribute ranges. This

should be taken into account in any case, but can also be exploited to shape

the process to a certain degree.

The paper is organized as follows. After introducing our problem state-

ment as well as ks-anonymity and CASTLE in Sec. 2 and 3, we present our

evaluation in Sec. 4. In Sec. 5 we conclude the paper.

2 Problem Statement and Related Work

Problem Statement. Our goal is to analyze whether continuous ks-anony-

mization is suitable for SM data. Since the data type differs from other rela-

tional data sources in some crucial characteristics, this is everything but ob-

vious. The data points generated by SMs are discrete measurements of user

consumption, e.g., electricity consumption from a continuous data stream.

Different strategies can be applied to discretize the data stream, such as

using the current consumption value or aggregating the entire consumption

between two measurements. Thus, SM data has different characteristics,

such as the temporal granularity of the measurements.

For our evaluation, we use a realistic architecture in which distinct SMs

measure the consumption and forward the data directly to a trusted CE,

e.g., the energy provider. Figure 1 visualizes this architecture. To facilitate

Analyzing Continuous ks-Anonymization for Smart Meter Data 3

further processing by third parties, e.g., for district management, the data

is collected and anonymized centrally before it is forwarded.

Related Work. There are several approaches to avoid accurate profiling

and disclosure of information based on smart meter measurements. For

instance, load balancing and shaping prevent characteristic traces in con-

sumption data, while other approaches focus on achieving privacy by de-

sign with specific architectures. Another focus is on protecting privacy by

anonymizing consumption data, e.g., with k-anonymity [10] or differential

privacy [3]. One algorithm to achieve k-anonymity for streaming data is the

one analyzed in this paper—CASTLE. Several other algorithms exist, some

of which also address challenges of CASTLE, such as [4,7,8,11]. However, to

the best of our knowledge, there are no studies that evaluate the suitability

of ks-anonymity specifically for smart meter data.

3ks-Anonymity and CASTLE

We focus on ks-anonymity [2], which is an extension of k-anonymity [10] for

streaming data. The main idea is to modify and group data items in such a

way that groups comprise at least kentries that are indistinguishable from

each other—an Equivalence Class (EQ). ks-anonymity [2] extends this idea

and requires that a published anonymized stream comprises EQs with at

least kdistinct individuals, not just kentries.

CASTLE [2] is an established algorithm to achieve ks-anonymity by as-

signing incoming data points, called tuples, to clusters that represent their

generalization. The tuples are specified in a metric space defined by the

so-called Quasi-identifiers (QIs) [10]. The clusters are EQs, where all data

points share the same generalized values for each QI attribute. Each cluster

must contain at least kdistinct individuals. CASTLE either creates a new

cluster or assigns the tuple to an existing cluster by minimizing the infor-

mation loss. As information loss metric, CASTLE uses the Generalized Loss

Metric [5]. For cluster generalization, QI attributes either form intervals,

in the case of continuous attributes, or they are generalized to their lowest

common ancestor with respect to their corresponding domain generalization

hierarchy (DGH) for categorical attributes. A DGH is a directed tree struc-

ture that defines hierarchical values for such categorical attributes. CASTLE

also uses a delay constraint δthat specifies the maximum time that can pass

before a tuple needs to be generalized and published. The clusters that were

anonymized with CASTLE can then be published and used for further pro-

cessing, e.g., by third-party data processors.

We can already observe that during the anonymization process, the QIs

are used for the generalization. Accordingly, we expect that their different

magnitudes will play a crucial role. At the same time, please note that the

4 C. Brunn et al.

sensitive attribute is not considered in the process. This could enable at-

tacks if users in a cluster have different consumption ranges that differ sig-

nificantly from each other.

4 Evaluation

Methodology. For our evaluation, we use a dataset of electricity consump-

tion measurements that is publicly available at UCI1. The dataset consists

of consumption data from 370 clients, measured every 15 minutes between

2011 and 2015. Based on the consumption profiles in the dataset, we in-

fer that the set comprises data of individual households, and larger con-

sumers such as schools, hospitals, or small industry2. The original dataset

contains only measurement data and no additional information about clients.

We therefore added synthetic addresses, modeling a district in Berlin, where

zip code, street, and house number are encoded in an integer value. Further-

more, we adapted the format of the timestamps. Due to its size, we sampled

the dataset (weeks 46 & 47 of November 2014) resulting in 164 102 tuples.

We use the publicly available CASTLEGUARD implementation [9]. When

disabling the differential privacy feature (which we did), it resembles the

CASTLE algorithm. Since we identified potential bugs, we made some minor

adaptations to the code3, e.g., in the function merge_clusters. We provide

our code including the respective changes as well as the dataset on GitHub.4

We simulated a distributed ks-anonymization with CASTLEGUARD for dif-

ferent δand k, and compare it to a global ks-anonymization process in which

all data tuples are known in advance and then clustered all at once. The

latter is simulated using the ARX anonymization tool [1].

Information Loss. For our evaluation, we use information loss as utility

metric. Specifically, we use the Generalized Loss Metric (GLM) [5], which is

also used for estimating the information loss in CASTLE. Here, the clus-

ter range of a generalized attribute is compared to the overall range of

this attribute. For each entry, the information loss of an attribute is defined

as ui−li

U−L∈[0,1], where uior liis the upper or lower limit of entry i’s attribute

generalization, and Uor Lis the overall upper or lower limit of this attribute,

respectively. For our evaluation, we calculate the average information loss

across all clusters per attribute.

Figure 2 shows the average information loss of all clusters for varying k

and δof the address (left plot) and time (right plot) attribute, respectively.

1https://doi.org/10.24432/C58C86

2The magnitude of consumption values suggests that the values are given in Watt

instead of kW as noted in the description of the dataset.

3We have reached out to the developers to discuss the bugs/changes.

4https://github.com/carolin-brunn/dpm-castle-analysis

Analyzing Continuous ks-Anonymization for Smart Meter Data 5

20 40 60 80 100

0.0

0.2

0.4

0.6

0.8

1.0

Information loss - Address

delta

100.0

400.0

ARX

20 40 60 80 100

0.00

0.02

0.04

0.06

0.08

Information loss - Time

delta

100.0

400.0

ARX

Fig. 2: Average information loss for quasi-identifying attributes.

We observe that the information loss is highest for the address attribute in

most settings. The clusters must always contain kdistinct individuals that

all have different addresses, thus, whenever a cluster is created, the ad-

dress attribute needs to be generalized. The information loss increases for

the address with an increasing k. This is expected since an increasing kre-

quires more distinct individuals with different addresses. We also observe

that for CASTLE, the information loss increases as the ratio between kand

δincreases. Presumably, CASTLE is forced to join very different clients, if

many clients have to be extracted from a relatively small sliding window.

Overall, it is noticeable that the address information loss is comparable for

ARX and CASTLE. For lower k, ARX has a lower information loss than CAS-

TLE’s sequential generalization with δ= 100. However, for larger kand δ

the advantage of ARX fades. For δ= 400, CASTLE consistently finds better

clusters that result in a lower information loss when compared to ARX.

For the time attribute, the information loss of ARX and CASTLE is com-

parable. Note, however, that in the beginning ARX has a higher information

loss than CASTLE. Presumably, this is caused by ARX’ anonymization strat-

egy: ARX chooses the same generalization level for all values of the same

attribute. Consequently, one cluster that requires a higher level of gener-

alization may cause all other clusters that could be formed with a lower

generalization to be published with the unnecessary generalization.

In general, the results suggest that CASTLE is a reasonable alternative

to a global generalization with ARX especially for larger δ. Nevertheless,

attribute ranges seem crucial for the prioritization when generalizing at-

tributes. Consequently, analyzing the exact behavior of CASTLE with at-

tributes of different magnitudes and diverse parameter settings is necessary

to find optimal settings for the anonymization of smart meter data, which we

will investigate in the remainder.

UID Diversity. Next, we compare the size of the published clusters and the

diversity of unique identifier (UID) values in these clusters. The kvalue is

the required minimum number of UIDs per cluster. Therefore, a larger UID

diversity means a larger number of distinct individuals that protect each

other from information disclosure. In contrast, very large clusters with a

6 C. Brunn et al.

0 100 200

UID diversity

200

400

Cluster size

CASTLE, delta = 100

0 100 200

UID diversity

CASTLE, delta = 400

0 100 200

UID diversity

ARX

100

Fig. 3: Cluster size [tuples] in relation to the UID diversity.

low UID diversity indicate that many data tuples correspond to the same

individuals. This could compromise privacy as a person may have similar

consumption values, resulting in low diversity of consumption values and

potentially disclosing information.

Figure 3 shows the cluster size in tuples against the UID diversity for

different values of kand δ. For ARX, we observe that the UID diversity of

most clusters is between 2·kand 2.5·k. Moreover, the clusters generated

with ARX are about the size of their UID diversity.

For CASTLE, we observe that δsignificantly influences the cluster sizes.

For better visibility, we excluded a few clusters that were larger than 500,

which were most likely caused by an unfavorable combination of tuples

due to an expiring δ. In Figure 3, the cluster size increases with larger δ,

while the range of UID diversity remains about the same. We suspect that

this is caused by the nature of the dataset. The extracted sample includes

about one-third of the available data points, i.e., measurements of about 120

clients per time point, and each client appears on average 1-2 times per

hour. One hour corresponds to approx. 490 data points. Thus, for δ= 100,

each client that appears has about 1 data point in the sliding window when

the clusters are created. Consequently, the cluster size and UID diversity are

about the same. For δ= 400, the sliding window can contain several tuples

per client. In this case, multiple time points belonging to the same client,

are mostly included in the same cluster, resulting in larger clusters with the

same UID diversity. This is also reflected by our information loss analysis of

the time above.

Consumption Range. Next, we analyze the diversity and distribution of the

sensitive attribute, i.e., electricity consumption. Our initial analysis showed

that almost all settings have a diversity of the sensitive attribute that is

at least approximately equal to the UID diversity suggesting a high level

of privacy protection. However, we do not include the results in this paper

since diversity was designed for categorical, but not numerical attributes. It

particularly does not take the range or similarity of values into account as

was previously described in [6].

Analyzing Continuous ks-Anonymization for Smart Meter Data 7

0 100 200 300

UID diversity

Consumption range

×104

CASTLE, delta = 100

0 100 200 300

UID diversity

CASTLE, delta = 400

0 100 200 300

UID diversity

ARX

100

Fig. 4: Consumption range against UID diversity per cluster.

0.0 0.1 0.2 0.3

Proximity ratio in cluster

0.0

0.5

1.0

ECDF

CASTLE, delta = 100

0.0 0.1 0.2 0.3

Proximity ratio in cluster

CASTLE, delta = 400

0.0 0.1 0.2 0.3

Proximity ratio in cluster

ARX

100

Fig. 5: Average proximity ratio of tuples in clusters.

Instead, we consider the range eof the sensitive attribute in the clusters

inspired by (k, e)-anonymity [12]. Figure 4 shows the consumption range

against the UID diversity. We see that kand the UID diversity only slightly

influence the consumption range for CASTLE. Indeed, a certain UID diver-

sity exhibits all different ranges of the sensitive attribute.

The same applies for ARX, independent of kthe clusters exhibit all differ-

ent ranges. The consumption of individual households is expected to be in

smaller ranges typical for the number of members in a household. Compared

to that, larger clients such as schools or industry have larger consumption

with more variance. The results in Figure 4 suggest that different types of

clients are included in many clusters for both processing strategies.

Consumption Proximity. Information about the range does not capture

the distribution of the sensitive attribute. We therefore analyze the differ-

ence between neighboring consumption values in a cluster by analyzing

their relative ϵ-neighborhood with ϵ= 0.2, as described in [6]. We calcu-

late the proximity ratio as the average percentage of tuples in a cluster that

have other tuples in this cluster within 0.2-neighborhood. This could facil-

itate a proximity breach, which means that an attacker can infer that the

sensitive attribute lies within a small interval [6].

Figure 5 shows the distribution of these values as empirical cumulative

distribution plots. The larger the k, the fewer tuples are in 0.2-neighborhood

of each other, indicating better privacy protection since the values of the

sensitive attribute are less similar within a cluster. We observe no substantial

difference between the results obtained with CASTLE and ARX, for k= 10,

8 C. Brunn et al.

the clusters generated by CASTLE show even less proximity than those of

ARX. This means that the privacy obtained with the sequential ks-anonymi-

zation is comparable to the global anonymization realized with ARX.

5 Conclusion

In this paper, we analyzed the suitability of ks-anonymity for smart meter

data in a centralized architecture. Our results suggest that the continu-

ous ks-anonymization with CASTLE is comparable to a global anonymization

with ARX. Therefore, we consider ks-anonymity as a reasonable approach for

smart meter data anonymization. The exact influence of certain parameters,

such as window size, require further research in order to find optimal set-

tings for specific use cases. Additionally, the constraints of numerical data

such as electricity consumption must be considered and suitable metrics

for the evaluation of the privacy of anonymized data have to be chosen, for

instance, the analysis of proximity instead of “pure” diversity.

References

1. Arx homepage, https://arx.deidentifier.org/, last accessed 14 June 2023

2. Cao, J., Carminati, B., Ferrari, E., Tan, K.: CASTLE: continuously anonymizing

data streams. IEEE Trans. Dependable Secur. Comput. 8(3), 337–352 (2011)

3. Dwork, C.: Differential privacy in new settings. In: SODA 2010 (2010)

4. Guo, K., Zhang, Q.: Fast clustering-based anonymization approaches with time

constraints for data streams. Knowledge-Based Systems 46, 95–108 (2013)

5. Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: ACM SIGKDD

2002. pp. 279–288 (2002)

6. Li, J., Tao, Y., Xiao, X.: Preservation of proximity privacy in publishing numerical

sensitive data. In: ACM SIGMOD 2008. p. 473–486 (2008)

7. Mohamed, M.A., Nagi, M.H., Ghanem, S.M.: A clustering approach for anony-

mizing distributed data streams. In: ICCES 2016. pp. 9–16. IEEE (2016)

8. Pallas, F., Legler, J., Amslgruber, N., Grünewald, E.: Redcastle: practically ap-

plicable ks-anonymity for iot streaming data at the edge in node-red. In:

M4IoT@Middleware 2021. pp. 8–13. ACM (2021)

9. Robinson, A., Brown, F., Hall, N., Jackson, A., Kemp, G., Leeke, M.: Castle-

guard: Anonymised data streams with guaranteed differential privacy. In:

DASC/PiCom/CBDCom/CyberSciTech 2020. pp. 577–584. IEEE (2020)

10. Sweeney, L.: k-anonymity: A model for protecting privacy. Int. J. Uncertain. Fuzzi-

ness Knowl. Based Syst. 10, 557–570 (2002)

11. Yang, L., Chen, X., Luo, Y., Lan, X., Wang, W.: Idea: A utility-enhanced approach

to incomplete data stream anonymization. Tsinghua Science and Technology

27(1), 127–140 (2022)

12. Zhang, Q., Koudas, N., Srivastava, D., Yu, T.: Aggregate query answering on

anonymized tables. In: IEEE ICDE 2007. pp. 116–125 (2007)