Pervasive and Mobile Computing 103 (2024) 101951
Available online 28 May 2024
1574-1192/Β© 2024 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/4.0/).
Contents lists available at ScienceDirect
Pervasive and Mobile Computing
journal homepage: www.elsevier.com/locate/pmc
On data minimization and anonymity in pervasive mobile-to-mobile
recommender systems
Tobias Eichingerβ, Axel KΓΌpper
Technische UniversitΓ€t Berlin, Service-centric Networking (SNET), Ernst-Reuter-Platz 7, Berlin, 10587, Germany
ARTICLE INFO
Dataset link: https://github.com/TEichinger/d
ec-cf-sim
Keywords:
Data minimization
Anonymity
Decentralized recommender system
Distributed gradient descent
Pervasive computing
Mobile computing
ABSTRACT
Data minimization is a legal principle that mandates limiting the collection of personal data
to a necessary minimum. In this context, we address ourselves to pervasive mobile-to-mobile
recommender systems in which users establish ad hoc wireless connections between their mobile
computing devices in physical proximity to exchange ratings that represent personal data on
which they calculate recommendations. The specific problem is: How can users minimize the
collection of ratings over all users while only being able to communicate with a subset of
other users in physical proximity? A main difficulty is the mobility of users, which prevents,
for instance, the creation and use of an overlay network to coordinate data collection. Users,
therefore, have to decide whether to exchange ratings and how many when an ad hoc
wireless connection is established. We model the randomness of these connections and apply
an algorithm based on distributed gradient descent to solve the distributed data minimization
problem at hand. We show that the algorithm robustly produces the least amount of connections
and also the least amount of collected ratings compared to an array of baselines. We find that
this simultaneously reduces the chances of an attacker relating users to ratings. In this sense, the
algorithm also preserves the anonymity of users, yet only of those users who do not establish
an ad hoc wireless connection with each other. Users who do establish a connection with each
other are trivially not anonymous toward each other. We find that users can further minimize
data collection and preserve their anonymity if they aggregate multiple ratings on the same
item into a single rating and change their identifiers between connections.
1. Introduction
Privacy is traditionally associated with the autonomy and freedom of individuals [1,2]. It forms the basis of a free so-
ciety and is, therefore, widely considered worthy of protection. A popular way to protect privacy is by law, where modern
data protection laws around the globe such as the European Unionβs General Data Protection Regulation (GDPR), Californiaβs
California Privacy Rights Act (CPRA), South Africaβs Personal Information Act (POPIA), or Brazilβs Lei Geral de ProteΓ§Γ£o de Dados
(LGPD) unanimously emphasize the protection of personal data. Across all data protection laws, data minimization is a central data
protection principle that mandates the limitation of collection and processing of personal data.
One way to apply data minimization is to mandate purpose limitation. Purpose limitation means that when personal data are
collected, the purposes of their collection must be specified and the collected data may only be processed to fulfill these purposes.
Purpose limitation imposes a burden of proof on data controllers, that is those who collect and process personal data, and data subjects,
that is, those individuals to whom the data relate. Data controllers must make it clear to their data subjects for which purposes they
βCorresponding author.
https://doi.org/10.1016/j.pmcj.2024.101951
Received 1 September 2023; Received in revised form 5 May 2024; Accepted 25 May 2024
Pervasive and Mobile Computing 103 (2024) 101951
2
T. Eichinger and A. KΓΌpper
Fig. 1. Visualization of Shanmugam et al.βs [4] data minimization paradigm, based on the three phase model of Hestness et al. [16]. Data minimization is
characterized as a stopping problem, where data collection is to be stopped beyond the sufficiency point and at best at the saturation point. We adapt the
denomination of the three phases and the two points that separate them to the data minimization use case.
collect personal data and that the personal data that they collect is actually necessary to fulfill these purposes. We see that the tacit
assumption behind the principle of purpose limitation is that it is possible to decide whether processing a piece of personal data is
necessary for the fulfillment of a purpose.
In many application scenarios, it is possible to understand whether the collection and processing of personal data is necessary
to fulfill a purpose. It is, for instance, clear that it is necessary for an online shop to collect a customerβs address for the purpose
of delivery. In some application scenarios, however, it is not straightforward to decide whether the collection of personal data is
necessary for the fulfillment of that purpose. One such purpose is the provision of personalized content on, for instance, social media
platforms or e-commerce websites, where the selection of personalized content is performed by recommender systems. In the context
of personalization in recommender systems, it is not straightforward to decide whether the collection and processing of personal
data is necessary.
Biega et al. [3] outline that the main difficulty in applying purpose limitation in recommender systems is that it is typically
not possible to decide for a particular piece of personal data whether it is necessary to be collected. They, therefore, propose to
collect any personal data that can be considered relevant for the purpose of personalization until reaching a performance threshold.
Shanmugam et al. [4] argue that the use of an absolute performance threshold may cause a system to collect personal data while
only marginally increasing performance. While performance generally increases as the use of personal data increases, performance
gains decrease [3,5β7]. They propose to collect personal data until the performance gains, rather than the performance itself, reach
a threshold. Fig. 1 illustrates this data minimization paradigm.
In our own previous work, we study the data minimization problem as per Fig. 1 in decentralized recommender systems [8]. In
decentralized recommender systems, we are faced with multiple distinct instances of the data minimization problem, one for each
user in the system. In this paper, we specifically recapitulate our findings in the context of a pervasive mobile-to-mobile recommender
system in which users collect personal data through pairwise data exchanges in physical proximity (see for instance [9β15]).
Limitation of data exchanges between users to physical proximity is in itself a data minimization method, as every user only has
access to the personal data of a subset of users in the system. This limitation makes it, however, easier to identify the user to
whom personal data relate. We thus discuss the link between data minimization and anonymity in general and study the effects of
(A) users changing their identifiers and (B) aggregating their own personal data with that of others in particular. We begin with a
recapitulation of our distributed data minimization method.
2. Methods and material
The core methods we describe in this section have already been presented in our previous publication [8]. The source code that
implements both previously proposed methods and methods that we propose in this paper is publicly available [17]. It allows us
to reproduce all our results. We first introduce some basic conventions and notations that are subject to the application domain
of recommender systems. We then formally describe the pairwise connections that users establish in physical proximity. For each
pairwise connection that users establish, we formally describe the way in which users exchange personal data. On this basis, we
formulate data minimization as a distributed optimization problem and present an algorithm based on distributed gradient descent
to solve it. We begin with some conventions and notations.
Pervasive and Mobile Computing 103 (2024) 101951
3
T. Eichinger and A. KΓΌpper
Fig. 2. Example of three users π’, π£, π€ βπexchanging rating data on four items π1, π2, π3, π4βπΌin two consecutive and pairwise data exchanges π’βπ€and π’βπ£.
The amount of data collected overall hinges on (a) the number of data exchanges and (b) the number of ratings exchanged in a data exchange. π
denotes the
user-item matrix and π
π’, π
π£, π
π€individual collected datasets.
2.1. Conventions and notations
We assume that users derive recommendations on their mobile phones on the basis of the data they collect from nearby other
users. We assume, more specifically, that all users make use of a collaborative filtering algorithm to derive recommendations.
Collaborative filtering generally denotes a class of recommendation algorithms [18]. These algorithms take ratings as input and
yield predicted ratings as output. Predicted ratings are estimates of how a user would rate an item they have not seen yet. Based on
these predicted ratings, recommendations can be displayed as a list of items with the highest predicted ratings. These assumptions
do not represent a limitation. Our methods can readily be applied to data types other than ratings and algorithms other than
collaborative filtering algorithms, provided that the performance of recommendations can be measured and performance differences
can be characterized as zero, marginal, or large. In a system with real users, one could thus regularly ask users to grade the quality of
recommendations or track their behavior within the system to measure the quality of recommendations and characterize performance
differences.
We measure performance with respect to the discrepancy between usersβ actual ratings and the predicted ratings output by a
collaborative filtering algorithm to derive recommendations for them. We denote by ππ’,π the rating by some user π’βπon some
item πβπΌ, where πand πΌdenote sets of users and items, respectively. We use the hat notation ξππ’,π to denote predicted ratings.
We denote by πuser and πitem the number of users and items in πand πΌ, respectively. We write all ratings ππ’,π into the πuser Γπitem
user-item matrix π
and denote by |π
|the number of ratings in π
. We call a row ππ’in the user-item matrix π
, corresponding to the
ratings by some user π’βπ, the profile of user π’. We use the bar notation ππ’to denote user π’βs mean rating.
We denote by sim(π’, π£) β [0,1] the cosine similarity between the profiles of some two users π’, π£ βπdefined by sim(π’, π£) = <ππ’,ππ£>
βππ’ββ
βππ£β,
where missing ratings are substituted with zeros, <β
,β
>denotes the dot product, ββ
βthe Euclidean norm, and ππ’, ππ£the profiles
of some users π’and π£respectively. The cosine similarity is widely used to describe two usersβ similarity with respect to their item
preferences. If, for instance, two users have not rated a single item in common, then their cosine similarity is zero. Cosine similarity
is positive if users have rated at least on item in common, and is larger the closer their ratings on those commonly rated items align.
In decentralized collaborative filtering systems, every user only has their own profile to begin with and thus needs to collect
ratings from nearby other users. We denote by ππ’βπ£the payload that user π’sends to π£. Payloads are sets of previously collected
profiles, which may include the senderβs own profile. We denote by π
π’the collected dataset of user π’, which consists of π’βs own profile
and all payloads that π’collected. We more generally denote by the union of all usersβ collected datasets {π
π’}π’βπadata distribution
and define |{π
π’}π’βπ|=βπ’βπ|π
π’|as the number of ratings in it.
We denote by an epoch π‘a time interval of fixed length during which the users in πestablish connections. We assume that all
connections that are established within the same epoch are independent of each other. If, for instance, a user π’establishes two
connections to two users π£and π€in epoch π‘respectively, we assume that π’cannot send ratings to π€that π’receives from π£in
epoch π‘, and vice versa. In other words, we assume that all connections within the same epoch happen synchronously, although, in
practice, this will not be the case in a pervasive mobile-to-mobile recommender system. All connections that are established during
an epoch are dissolved at the end of the epoch.
Fig. 2 shows an example in which three users π’, π£, π€ βπcollect ratings on four items π1, π2, π3, π4βπΌin two consecutive and
pairwise data exchanges. Initially (π‘= 0), users only have their own profile, where profiles are indicated as row vectors. We see that
the row vectors of the user-item matrix π
are split over all users. Over time, users collect payloads that hold each otherβs ratings.
Eventually (π‘= 2), users π’and π£each hold a full copy of the user-item matrix π
, modulo row permutation, and π€holds two out of
three rows of π
.
Pervasive and Mobile Computing 103 (2024) 101951
4
T. Eichinger and A. KΓΌpper
We denote by ππ’the number of epochs a user π’βπseeks to exchange rating data with other users. We call ππ’the time horizon
of user π’. With respect to the example shown in Fig. 2, users π’and π£have time horizons ππ’=ππ£= 2, while user π€has a time
horizon of ππ€= 1. If all time horizons ππ’coincide, we call the common value πthe global time horizon.
From the perspective of a user, data minimization presents itself as a collection minimization problem as in traditional centralized
systems. With reference to Fig. 1, each user collects ratings in the user-item matrix π
until reaching their individual saturation point,
that is, the point beyond which additional collection only yields marginal performance gains. From the perspective of the system,
however, data minimization presents itself as a redundancy minimization problem. The user-item matrix π
is to be distributed over
all users with minimum redundancy such that every user has reached at least their individual sufficiency point and, at best, their
individual saturation point. Since users can only exchange data in physical proximity, the way in which data distributions form is
subject to the order in which connections between users form. This is what we describe in the following.
2.2. Connection behavior
Connectivity in pervasive mobile-to-mobile recommender systems, as in all wireless ad hoc networks, is characterized by node
mobility and intermittent connectivity. In view of intermittent connectivity, we assume for simplicity that no routing takes place. No
overlay networks are formed and communication is limited to users located within the transmission ranges of each otherβs mobile
computing devices. With this assumption, we particularly omit a treatment of the hidden node problem. In view of node mobility,
we assume that pairwise connections between users are established randomly.
We describe connection behaviors as the random sampling of edges from a graph ξ³= (π, πΈ), where πdenotes the set of users
and πΈthe set of edges between these users. Whenever two users share an edge, we assume that these two users are physically close
such that their mobile computing devices can establish an ad hoc wireless connection. We thus model node mobility in terms of the
changes in the graph ξ³. In the following, we describe the two connection behaviors that we consider in this paper.
2.2.1. Random connection behavior
We sample edges in two steps such that every user establishes a connection with πother users. In the first step, we generate an
ordering (π’1, π’2, .., π’π)on the set of users π. We do so by πtimes uniformly randomly sampling a user from the user set πwithout
replacement. In the second step, we iterate over this ordering. For π= 1,2,β¦, π, we define a subset of edges πΈπβ πΈ that includes
the πth user π’πand excludes all other users π£βπthat have previously been sampled in an edge. Formally, we have
πΈπ= {π= {π’, π£} β πΈ|π’πβπand π’, π£ βππfor all π < π },(1)
where ππdenotes the edge sampled in the πth iteration. Note that the subset πΈπcan be empty (πΈπ= β
). This is particularly the case
if the number of users in πis odd and all users except one form pairs.
We see that in random connection behavior, the graph ξ³is fully-connected and every user can potentially establish a connection
with any other user. Any pair of users has the same probability of establishing a connection. The following connection behavior
assumes that not every pair of users can potentially establish a connection, yet only similar pairs of users.
2.2.2. Similarity-based connection behavior
Similarity-based connection behavior takes into consideration the cosine similarity between usersβs profiles. Two users π’and
π£share a graph in the graph ξ³from which we sample edges if their cosine similarity sim(π’, π£)is positive. Recall that the cosine
similarity is only positive if two users have rated at least one item in common. It is widely known in the recommender systems
literature that, typically, only a fraction of user pairs has rated at least one item in common. As a consequence, when we apply the
same edge sampling procedure as for random connection behavior, we see that with similarity-based connection behavior, every
user can only establish connections with a subset of other users.
Similarity-based connection behavior takes into consideration that users are more likely to be in physical proximity to each other
if they are similar. In this case, we measure similarity in terms of the cosine similarity, that is, the similarity between two usersβ
ratings. Its underlying idea more generally extends to the idea behind homophilous search (see for instance [19]). Now that we
have described the way in which connections form, we describe the communication that happens between pairs of connected users.
2.3. Data exchange protocol
The Data Exchange Protocol (see Algorithm 1) formalizes the way in which a user sends ratings in a payload to a connected other
user. While both users of a connection communicate according to the Data Exchange Protocol, we only show one communication
direction, that is, communication from some user π’to some user π£, and omit the other since both directions are identical. We
assume that all users are benevolent in the sense that they follow the Data Exchange Protocol exactly and without enforcement, since
otherwise, users would not seek to engage with other users in the first place. The Data Exchange Protocol features the following
two consecutive phases.
(π)Similarity Calculation Phase: It is a widely adopted assumption in the literature on decentralized collaborative filtering systems
that two users who are similar with respect to their profiles hold rating data relevant to each other. Therefore, user π’sends the
profile ππ’to π£and receives the profile ππ£from user π£. Upon reception, user π’calculates the cosine similarity sim(π’, π£). Only if the
cosine similarity exceeds π’βs similarity threshold π‘π’β [0,1],π’decides to forward a payload in the Forwarding Phase.
Pervasive and Mobile Computing 103 (2024) 101951
5
T. Eichinger and A. KΓΌpper
Algorithm 1: Data Exchange Protocol
Given: sender π’and receiver π£
Parameters: forwarding parameter ππ’, payload parameter πpayload
1Procedure (π)SimilarityCalculationPhase (π):
2send(ππ’);wait(ππ£);
/* π’calculates the cosine similarity between π’βs profile ππ’and π£βs profile ππ£*/
3similarity = sim(ππ’,ππ£);
/* π’calculates similarity threshold π‘π’*/
4π‘π’βπΉβ1
π’(1 β ππ’);
5end
6if similarity > π‘π’then
7Procedure (ππ)PayloadForwardingPhase (πpayload):
/* π’selects πpayload profiles/aggregated payloads that have previously been collected in
the PayloadForwardingPhase and relate to the πpayload users most similar to π£*/
8ππ’βπ£βselectProfiles(πpayload, π£);
/* [OPTIONAL] π’aggregates multiple ratings on the same item into a single rating */
9ππ’βπ£βaggregatePayload(ππ’βπ£);
10 send(uPayload);
11 end
12 end
13 wait(vPayload);
14 disconnect(π’, π£);
We point out that it is convenient to represent π’βs similarity threshold π‘π’as the a priori probability ππ’β [0,1] that the similarity
sim(π’, π£)between user π’and the a priori unknown user π£exceeds π’βs similarity threshold π‘π’. Here, we only describe the construction
of this probability for the special case that users establish connections with random connection behavior. A general construction
can be found in Appendix A. Assume the distribution of similarities between π’and any other user is known to π’, and its cumulative
distribution function πΉπ’is continuous. Then, 1 β ππ’βΆ= πΉπ’(π‘π’)represents the portion of users that have a similarity lower than π‘π’to
π’, and ππ’represents the portion of users that have a similarity higher than π‘π’to π’. We call ππ’the forwarding parameter of π’. The
forwarding parameter ππ’describes the probability that user π’forwards a payload in the Forwarding Phase of a connection.
(ππ)Payload Forwarding Phase: User π’considers all previously collected profiles, including π’βs own profile, for sending. Here, π’
limits the amount of ratings sent to π£in a payload by only selecting the πpayload profiles collected from the users that are most similar
to π£. We call πpayload the payload parameter. Users further limit the number of ratings they send to π£by aggregating the ratings in
the payload, which we describe in Section 2.4.
The Data Exchange Protocol formalizes the process through which users collect data. We can now formally represent an individual
user π’βs data collection process as a finite sequence of parameter value 3-tuples
((π(π‘)
π’, π(π‘)
π’))π‘=1,2,β¦,ππ’,(2)
where π(π‘)
π’denotes the in epoch π‘projected end of π’βs data collection process, π(π‘)
π’denotes the probability that π’sends a payload in a
connection in epoch π‘, and ππ’the length of π’βs data collection process. For the special case that all forwarding parameter values π(π‘)
π’
coincide over all users π’and are constant over all epochs π‘, we can roughly estimate that each user π’collects about βππ’
π‘=1 π(π‘)
π’payloads
irrespective of the underlying connection behavior. Each payload can be aggregated to decrease the number of ratings it holds, as
we describe in the following.
2.4. Payload aggregation
We call any method that aggregates multiple payload ratings on the same item into a single payload rating a payload aggregation
method. While the payload parameter πpayload determines the number of profiles that a payload consists of, payload aggregation
summarizes the ratings in these πpayload payloads such that every rating refers to only one distinct item. As such, payload aggregation
represents a basic method to minimize data collection. The gentle reader is referred to [20] for details on aggregation in the context
of recommender systems. We present in the following the two payload aggregation methods we study in this paper.
2.4.1. Rating by most similar user (MostSim)
We follow the payload aggregation method proposed by Shokri et al. [21]. Whenever there are multiple ratings on the same
item in a payload, the sender user chooses the rating in the profile that is most similar to the profile of the receiver user. We call
this payload aggregation method MostSim. We make an example to illustrate the MostSim payload aggregation method.
Pervasive and Mobile Computing 103 (2024) 101951
6
T. Eichinger and A. KΓΌpper
Example: With reference to the example in Fig. 2, we illustrate how user π’would aggregate the payload ππ’βπ£=[ππ’
ππ€]in epoch π‘= 2
before sending to user π£with MostSim payload aggregation. Before user π’aggregates the two profiles ππ’=[β 3 5 β ]and ππ€=[1 4 β β ],
π’first calculates the cosine similarities sim(π£, π’)=0.78 and sim(π£, π€)=0.09, and then aggregates them as follows:
MostSim ([ππ’
ππ€],[π ππ(π£, π’)
π ππ(π£, π€)])=MostSim ([β 3 5 β
1 4 β β],[0.78
0.09])=[135β].(3)
We have reduced the number of ratings in the payload ππ’βπ£from four to three ratings.
The above example illustrates how MostSim payload aggregation reduces the number of ratings in a payload if there are multiple
ratings on the same item. The example illustrates further that aggregated payloads have the shape of a single profile. Indeed,
when we interpret aggregated payloads as profiles, it is clear that we can aggregate payloads of payloads. Note that in contrast to
non-aggregated payloads in which users can relate distinct profiles in a payload to distinct users, this is not possible for aggregated
payloads. Users, therefore, relate aggregated payloads to the users who forward them instead of the users whose profiles have been
aggregated (see also Line 8in Algorithm 1in this context). We continue with the second payload aggregation method we study.
2.4.2. Weighted averaging (WeighAvg)
Whenever there are multiple ratings on the same item in a payload, the sender user aggregates them by calculating weighted
average ratings, where weights represent cosine similarities between the profile a rating is in and the profile of the receiver user.
We call this payload aggregation method WeighAvg. We make an example to illustrate the WeighAvg payload aggregation method.
Example: With reference to the example in Fig. 2, we illustrate how user π’would aggregate the payload ππ’βπ£=[ππ’
ππ€]in
epoch π‘= 2 before sending to user π£with WeighAvg payload aggregation. Before user π’aggregates the two profiles ππ’=[β 3 5 β ]
and ππ€=[1 4 β β ],π’again calculates the cosine similarities sim(π£, π’) = 0.78 and sim(π£, π€) = 0.09 first and then aggregates them by
calculating the weighted arithmetic mean for each item as follows:
WeighAvg ([ππ’
ππ€],[π ππ(π£, π’)
π ππ(π£, π€)])=WeighAvg ([β 3 5 β
1 4 β β],[0.78
0.09])=[1 3.10 5 β].(4)
We have again reduced the number of ratings in the payload ππ’βπ£from four to three ratings.
A comparison of WeighAvg and MostSim payload aggregation shows that the reduction in the number of ratings in a payload they
yield is identical. The difference between the two payload aggregation methods resides only in the cardinality of the ratings in the
aggregated payload they produce. This concludes the description of the payload aggregation methods we study. In the following,
we describe data minimization in decentralized recommender systems as a distributed optimization problem.
2.5. Distributed data minimization
We define the distributed data minimization problem as a combination of a local and a global minimization problem.
Definition 1 (Distributed Data Minimization).Let πbe a set of users that exchange data via the Data Exchange Protocol.
Local Data Minimization: Let π > 0. Let further π’βπbe a fixed user with data collection process ((π(π‘)
π’, π(π‘)
π’))π‘=1,2,β¦,ππ’associated
to the sequence (π
(π‘)
π’)π‘=1,2,β¦,ππ’of collected datasets. Then user π’minimizes the amount of collected data locally via
min
((π(π‘)
π’,π(π‘)
π’))π‘=1,2,β¦,ππ’|π
(ππ’)
π’|such that π
(ππ’)
π’is saturated with respect to π. (5)
We call the integer ππ’of a solution π
β
π’
(ππ’)its saturation point.
Global Data Minimization: Let π
(ππ’)
π’denote the final collected dataset that a user π’βπcollects. Then the entirety of all users π
minimizes the amount of collected data globally via
min
{π
(ππ’)
π’}π’βπ|{π
(ππ’)
π’}π’βπ|such that all π
π’are sufficient,(6)
where {π
(ππ’)
π’}π’βπdenotes a data distribution. We call a solution {π
β
π’
(ππ’)}π’βπto the global data minimization problem a
minimized data distribution. If additionally all collected datasets π
β
π’
(ππ’)solve their respective local data minimization problem,
we call {π
β
π’
(ππ’)}π’βπa minimum data distribution.
With respect to Fig. 1, we see that every user tries to solve their respective local data minimization problem by stopping data
collection at their respective saturation point. They not consider the amount of data that other users collect. Differently put, the
local data minimization problem puts each user into the shoes of a centralized recommendation provider who seeks to minimize the
amount of data they collect from other users as in a centralized recommender system. In other words, the local data minimization
problem is the data minimization problem in a centralized recommender system. In decentralized systems, we also have that all
users try to solve the global data minimization problem. A solution to the global data minimization requires that every user passes
their respective sufficiency point.
We see that the local and the global data minimization problems are competing problems and solving both problems simul-
taneously is in general not possible. A short explanation is that usersβ sufficiency points and saturation points, in general, do not
coincide. A more detailed explanation can be found in Appendix B. A comparison of the computational complexity between the
data minimization problem in centralized and decentralized recommender systems can be found in Appendix C. It remains to define
saturation and sufficiency.
Pervasive and Mobile Computing 103 (2024) 101951
7
T. Eichinger and A. KΓΌpper
Definition 2 (Saturation).Let π’βπbe a user with data collection process ((π(π‘)
π’, π(π‘)
π’))π‘=1,2,β¦,ππ’associated to the sequence
(π
(π‘)
π’)π‘=1,2,β¦,ππ’of collected datasets. For a fixed rating prediction method and performance metric π, we denote by ππ’(π
(π‘)
π’)the
performance that user π’measures for the collected dataset π
(π‘)
π’. We further denote by π₯ππ’(π
(π‘)
π’, π
(π‘β1)
π’)the performance difference
that user π’measures between the collected datasets π
(π‘)
π’and π
(π‘β1)
π’, that is between the end of epoch π‘and the end of epoch π‘β 1.
Formally:
π₯ππ’(π
(π‘)
π’, π
(π‘β1)
π’) βΆ= ππ’(π
(π‘)
π’) β ππ’(π
(π‘β1)
π’).(7)
We more briefly write π₯π(π‘)
π’βΆ= π₯ππ’(π
(π‘)
π’, π
(π‘β1)
π’)if the underlying collected datasets π
(π‘)
π’and π
(π‘β1)
π’are clear from the context.
In this setting, let π > 0. Then, if the performance difference π₯π(π‘)
π’satisfies
|π₯π(π‘)
π’|=|ππ’(π
(π‘)
π’) β ππ’(π
(π‘β1)
π’)|β (0, π),(8)
for some collected dataset π
(π‘)
π’, we say that the performance difference π₯π(π‘)
π’is marginal. In this case, we call π
(π‘)
π’an π-saturated
collected dataset, or simply saturated collected dataset if πis clear from the context.
Definition 3 (Sufficiency).Let πand πΌbe sets of users and items respectively and {π
π’}π’βπbe a data distribution. Let further πΌπ’β πΌ
be the subset of items that a user π’βπis interested in and π
π’β {π
π’}π’βπthat userβs collected dataset. We then call the collected
dataset π
π’insufficient if none of the ratings in π
π’refers to any of π’βs items of interest in πΌπ’, and else call it sufficient. More generally,
we call the data distribution {π
π’}π’βπsufficient if all its collected datasets π
π’β {π
π’}π’βπare sufficient, and else call it insufficient.
Finally, we define the sufficiency coefficient ξΏ({π
π’}π’βπ)of a data distribution {π
π’}π’βπas the fraction of sufficient collected datasets
in the data distribution {π
π’}π’βπ:
ξΏ({π
π’}π’βπ) = |{π
π’β {π
π’}π’βπ|π
π’is sufficient}|
|{π
π’}π’βπ|β [0,1],(9)
where |{β
}|denotes the number of elements in the set {β
}. Clearly, a data distribution {π
π’}π’βπis sufficient if and only if its sufficiency
coefficient ξΏ({π
π’}π’βπ)equals 1.
We emphasize that whether a solution to the distributed data minimization problem exists, that is, a solution that minimizes
both the local and the global data minimization problem simultaneously, depends largely on the underlying connection behavior.
Details on when solutions to the individual problems exist with random connection behavior are given in Appendix D. To see that
a solution to neither the local nor the global problem may exist, consider the case that a user can only establish connections with
users to whom the user is not similar. In this case, the user neither forwards nor receives a payload. In the following, we describe
an algorithm that we find to solve the distributed data minimization problem with random and similarity-based data minimization.
2.6. Distributed gradient descent-based data minimization
We propose to apply Distributed Gradient Descent (DGD) data minimization as previously proposed in [8] for use in pervasive
mobile-to-mobile recommender systems. The algorithm makes it such that users coordinate their individual data collection processes
by communicating and adjusting their parameter values π(π‘)
π’and π(π‘)
π’to each other such that their collected datasets saturate
uniformly.
Definition 4 (DGD Data Minimization).Let πbe a set of users that collect ratings via the Data Exchange Protocol. Let further π’βπ
be a user that establishes connections in some epoch π‘with the users in some subset of users π β πβ{π’}, and the parameter variable
πbe either the time horizon πor the forwarding parameter π. User π’then updates π’βs local parameter value ππ’at the end of epoch π‘
for use in epoch π‘+ 1 as follows:
π₯π(π‘)
π’=π(π‘+1)
π’βπ(π‘)
π’
= β πΌπβ
β§
βͺ
β¨
βͺ
β©
1,if |π₯π(π‘)
π’|= 0
1,if |π₯π(π‘)
π’|β (0, π)
0,if |π₯π(π‘)
π’|β [π, β)
βββββββββββββββββββββββββββββββββββββββββββ
innovation term
+π½πβ
1
|π|β
π£βπ
(π(π‘)
π£βπ(π‘)
π’)
βββββββββββββββββββββββββββββββ
consensus term
+πΎπβ
{1,if π(π‘)
π’undefined
0,else,
βββββββββββββββββββββββββββββββββββββββββββ
regularization term
(10)
where πΌπ, π½π, πΎπβ₯0denote non-negative update weights specific to the parameter π,πdenotes the set of πconnection users π’establishes
a connection with in epoch π‘, and π₯π(π‘)
π’denotes the performance differential of user π’at the start and end of epoch π‘with respect to
some performance metric π. If a parameter update would cause a parameter value to fall below or exceed its respective boundary
interval ππ’β [0,β) or ππ’β [0,1], the parameter value is instead set to its lower or upper bound respectively.
Termination: Data collection terminates if at least one of the following criteria are satisfied:
(I) All individual time horizons are smaller than the current epoch (maxπ’π(π‘)
π’< π‘).
(II) All individual forwarding parameters are zero (maxπ’ππ’= 0).
Pervasive and Mobile Computing 103 (2024) 101951
8
T. Eichinger and A. KΓΌpper
Criterion (I) describes the situation that all usersβ data collection processes have ended and no user seeks further connections.
Criterion (II) describes the situation that users do seek connections to other users, yet do not exchange payloads in any such
connection.
This concludes the presentation of the methods we use. In the following section, we establish a link between data minimization
and anonymity in the context of data minimization in general and distributed data minimization in pervasive mobile-to-mobile
recommender systems in particular.
3. Theory
The goal of this section is to establish a link between data minimization and anonymity. In order to do so, we follow in the lines
of Pfitzmann and Hansenβs [22] working document on the disambiguation of anonymity in the context of data minimization and
adapt their definitions to the context of pervasive mobile-to-mobile recommender systems. We begin with a general characterization
of data minimization.
3.1. A general characterization of data minimization
Data minimization as per Pfitzmann and Hansen [22] foresees data minimization as the following nested structure of minimiza-
tion objectives.
(1.) Minimize the possibility to collect personal data.
(2.) Minimize the collection of personal data.
(3.) Minimize the time that collected personal data are stored.
These three minimization objectives are nested in the sense that they are not to be minimized simultaneously, yet instead,
Objective (1.) first, only afterward Objective (2.), and finally Objective (3.). Since the three minimization objectives are formulated
for information systems in general, it is meaningful to discuss them in the context of pervasive mobile-to-mobile recommender
systems.
In a pervasive mobile-to-mobile recommender system, users establish ad hoc wireless connections in physical proximity to
exchange ratings with each other. In view of Objective (1.), we see that the possibility to collect personal data depends on the
mobility of users and their pairwise proximity to each other. Since we model the mobility of users as the random sampling of edges
from a graph, we can quantify the possibility to collect personal data in terms of the number of edges that are sampled, which
represents the number of connections that are established. Since we assume that every user establishes a single connection per
epoch, we address Objective (1.) by having every user π’minimize their time horizon ππ’, that is, the number of epochs in which π’
establishes a connection with another user. On the basis of this minimized amount of connections, we then minimize the collection
of personal data in Objective (2.).
The collection of personal data hinges on the amount of ratings that are exchanged in each connection. The Data Exchange
Protocol foresees the exchange of ratings in two respects. First, in the Similarity Calculation Phase (π), connected users exchange
their own respective profile ratings to calculate the cosine similarity. Only if the cosine similarity exceeds a similarity threshold,
users exchange the previously collected ratings of other users in the consecutive Payload Forwarding Phase (ππ). Recall that the
forwarding parameter ππ’quantifies the probability that a user π’enters the Payload Forwarding Phase (ππ) and sends a payload in
a connection. Recall further that the payload parameter πpayload quantifies the number of profiles in that payload. We see that the
number of profiles forwarded in payloads is βπ’βπβππ’
π‘=1 ππ’β
πpayload. Since we already minimized the time horizon ππ’in Objective (1.)
and πpayload is a constant, we address Objective (2.) by having every user π’minimize their forwarding parameter ππ’. On the basis
of this minimized amount of forwarded payloads, we finally minimize the time that collected personal data are stored.
The time that collected personal data are stored depends on the way in which recommendations are derived. If we assume
that each user trains a collaborative filtering model on their individually collected ratings on their mobile computing device, it
makes sense to store them until their data collection process terminates. After termination of the data collection process, a user
calculates recommendations on the collected data. After recommendations have been calculated, the collected ratings do not serve
their purpose anymore and can thus be deleted. With respect to the representation of individual usersβ data collection processes as
in Eq. (2), we can reformulate the general Objectives (1.), (2.), and (3.) in the context of a pervasive mobile-to-mobile recommender
system for each user π’as follows:
(1.β) Minimize the time horizon ππ’.
(2.β) Minimize the forwarding parameter ππ’.
(3.β) Delete all collected ratings after epoch ππ’and after you have calculated recommendations.
We now see perhaps more clearly that, from the perspective of a single user, the distributed data minimization problem represents
a minimization problem on the parameter space spanned by the time horizon ππ’and forwarding parameter ππ’. From the perspective
of the system as a whole, it represents a minimization problem over the Cartesian product over all these parameter spaces.
So far, we only considered the minimization of ratings irrespective of whether they are the ratings by the user who forwards
them or those of a user who is not involved in a connection. We make this distinction in the following section in the context of a
discussion on anonymity.
Pervasive and Mobile Computing 103 (2024) 101951
9
T. Eichinger and A. KΓΌpper
3.2. Anonymity in terms of unlinkability
The basis for any discussion on anonymity requires the definition of the information of interest that is to be protected. In
the context of a pervasive mobile-to-mobile recommender system as described throughout Section 2, we consider two types of
information of interest, namely identifiers and ratings. Since we wish to characterize anonymity in terms of linkability, we need to
define linkability first. In order to do so, we begin by adapting the definition of linkability as per Pfitzmann and Hansen [22] to
identifiers and ratings.
Definition 5 (Linkability).Linkability of two or more identifiers and/or ratings from an attackerβs perspective means that the attacker
can sufficiently distinguish whether they are related or not.
In the context of the above definition, we may think of attackers as other users in the system or by-standers who are present
in physical proximity of users yet do not participate in the system as users themselves. As identifiers, we may think of for instance
network identifiers, application identifiers, and device identifiers which are used by usersβ mobile computing devices to establish
an ad hoc wireless connection. For linguistic simplicity, we subsume all these different types of identifiers together under the term
identifier. For notational simplicity, we use the symbol π’βπto denote both a user as a person as well as their respective identifier
and always say whether we mean the user themselves or their identifier.
If a user always uses the same identifier, it is clear that other users can link distinct connections they establish with the same
user. We say that a user makes use of a persistent identifier. If users otherwise change their network identifiers between connections,
we say that they use changing identifiers. Making use of changing identifiers makes it more difficult for attackers to link the identifiers
and ratings of a user. It remains to describe how linkability relates to anonymity. In order to do so, we follow again the terminology
as per Pfitzmann and Hansen [22] and adapt their definition of sender anonymity.
Definition 6 (Anonymity (in Terms of Linkability)).We say that a user π’βπwho communicates using the identifier π’is anonymous
from an attackerβs perspective if the attacker cannot sufficiently distinguish between the identifier π’and any other identifier
(anonymity w.r.t. identifiers). Furthermore, we say that a user π’communicates a rating ππ’,π anonymously from an attackerβs
perspective if the attacker cannot sufficiently distinguish between the rating ππ’,π and any other rating (anonymity w.r.t. ratings).
By the above definition, any two users who establish a connection can never be anonymous toward each other, neither with
respect to their identifiers nor their ratings. Only users who do not establish a connection can potentially remain anonymous toward
each other. We make a short example to illustrate this.
Example: We revisit the example in Fig. 2 in which user π’establishes a connection π’βπ€to π€in epoch π‘= 1 and a connection
π’βπ£to π£in epoch π‘= 2. Since connections are established in physical proximity and payloads are exchanged immediately between
connected users, user π’can distinguish user π€from user π£on, for instance, the basis of when and where the two distinct connections
were established. We see that π£and π€cannot be anonymous toward π’with respect to their identifiers. Since π£and π€communicate
their profiles ππ£and ππ€respectively in the Similarity Calculation Phase (π), we see further that π£and π€are also not anonymous
toward π’with respect to their ratings. The users π£and π€themselves, however, can potentially be anonymous toward each other,
since they do not establish a connection. We examine this question in the following.
When user π’forwards π€βs profile ππ€in the payload ππ’βπ£=[ππ’
ππ€]to user π£in epoch π‘= 2, we see that π£can immediately distinguish
between π’βs profile ππ’and π€βs profile ππ€. In this case, π€is trivially not anonymous toward π£with respect to π€βs ratings, since the
non-aggregated payload ππ’βπ£=[ππ’
ππ€]allows to distinguish whether a rating is by the same or by different users. Since the profile
ππ€is marked with the identifier π€, user π€is trivially not anonymous toward user π£with respect to π€βs identifier. The situation is
different if π’aggregates the payload ππ’βπ£=[ππ’
ππ€]through, for instance, WeighAvg to the aggregated payload ππ’βπ£=[1 3.10 5 β ]before
sending to π£(see Eq. (4)). In this case, user π€is anonymous toward user π£with respect to π€βs identifier since the payload ππ’βπ£is
not marked with user π€βs identifier. User π€is additionally anonymous toward π£with respect to π€βs ratings, since the aggregated
payload ππ’βπ£=[1 3.10 5 β ]itself does not allow π£to tell the underlying profiles ππ’and ππ€that have been used for aggregation. Last
but not least, we point out that π£is trivially anonymous toward π€, since π€neither receives π£βs identifier nor any of π£βs ratings.
The above example presents the use of changing identifiers and payload aggregation as methods to retain the anonymity between
users who do not establish an ad hoc wireless connection in physical proximity. We emphasize, however, that anonymity is never
absolute. With respect to the above example, if π£knows, for instance, the order in which the two connections π’βπ€and π’βπ£
happen and the underlying payload aggregation method that π’uses, it is possible for π£to tell the existence of π€and most likely
also to distinguish between π’βs and π€βs ratings in the aggregated payload ππ’βπ£. In other words, the more connections users establish
and the more data they communicate in these connections, the easier it becomes for an attacker to relate identifiers and ratings to
each other.
The two data minimization objectives (1.β) and (2.β) that we formulated in the previous section implicitly aim to countervent the
possibility that an attacker can relate identifiers and ratings to each other. They do so by (1.β) minimizing the number of connections
that are established and (2.β) minimizing the number of ratings that are communicated over these connections. It is in this sense
that data minimization and anonymity are related to each other. In the following section, we present results on the minimization of
data collection according to Objectives (1.β) and (2.β) as well as the effect of users making use of changing identifiers and payload
aggregation.
Pervasive and Mobile Computing 103 (2024) 101951
10
T. Eichinger and A. KΓΌpper
4. Results
We first show how DGD data minimization affects the parameters ππ’and ππ’. We then demonstrate that DGD data minimization
consistently produces sufficient and minimized data distributions.
4.1. Dataset and experimental setup
We present results on a stratified sample of the MovieLens ml-25m dataset [23], which is a widely used benchmark dataset in the
recommender systems literature. The sample comprises 500 profiles that consist of 77,588 ratings on 9816 movies. We split ratings
into training and test sets per user at a ratio of 80 to 20. The training set comprises 62,073 ratings with profiles holding between
16 to 2726 ratings. We report results as means over five distinct splits into training and test data.
We simulate data collection on the training ratings following the Data Exchange Protocol as described in Section 2.3. Since each
user establishes a single connection per epoch, we see that every user establishes π‘connections until epoch π‘. We set πpayload = 3
constant such that the number of profiles per payload remains constant over time, and we can expect that the average amount of
ratings in a payload remains constant over time.
We assess recommendation performance on models trained on every userβs individually collected training ratings and evaluated
on their test ratings. We train models with the help of the Cornac library [24]. We report results on the basis of the Singular Value
Decomposition (SVD) collaborative filtering algorithm, which we find to outperform, for instance, user-based collaborative filtering,
item-based collaborative filtering, and Matrix Factorization consistently. We thereby follow prior works that report the suitability
of SVD in the context of data minimization in recommender systems [3β5].
With respect to DGD data minimization, we experimented with distinct combinations of update weights πΌπ, π½π, πΎπβ {1,2,3,5}
for the time horizon ππ’and πΌπ, π½π, πΎπβ {0.1,0.2,0.3} for the forwarding parameter ππ’. For small update weights π½π= 1 and π½π= 0.1,
results differed only marginally in the amount of collected data and performance. However, larger update weights rendered DGD
data minimization unstable; that is, data distributions tended to become insufficient, and the data minimization procedure sometimes
did not terminate. For simplicity of presentation, we thus only report results for πΌπ=π½π=πΎπ= 1 and πΌπ=π½π=πΎπ= 0.1. Finally,
we consider performance changes π₯ππ’in the interval (0,0.01) marginal, that is, we set the performance threshold π= 0.01.
4.2. Metrics
We measure the recommendation performance with respect to the Root Mean Squared Error (RMSE) metric. The RMSE is a
standard performance metric in the recommender systems literature.
Let π’be a user with training ratings ππ’, test ratings ππ’|test, and collected dataset π
π’. On the basis of the collected dataset π
π’, user
π’derives a predicted rating ξππ’,π for every test rating ππ’,π βππ’|test. Note that π’may not be able to derive predicted ratings for all test
ratings. For instance, user π’cannot predict a rating for an item if π’has not collected any rating on that item. We thus denote by
the subset ξππ’|test β ππ’|test the test ratings for which user π’can derive a predicted rating. Then, we measure the error between the
predicted test ratings and the test ratings as follows:
RMSEπ’(π
π’) = ββββ
1
|ξππ’|test|β
ξππ’,πβξππ’|test (ξππ’,π βππ’,π)2ββββ
1β2
,(11)
where |ξππ’|test|denotes the number of ratings in ξππ’|test. In case that user π’cannot derive any predicted rating for any test rating
(ξππ’|test = β
), we say that the RMSEπ’is undefined. We denote by RMSE the average over all RMSEπ’that are not undefined.
We propose the following redundancy metric ξΎto measure the number of ratings in a data distribution {π
π’}π’βπfor use in
decentralized recommender systems:
ξΎ({π
π’}π’βπ) = βπ’βπ|π
π’|
|π
|β [1,β).(12)
The redundancy metric ξΎquantifies the amount of ratings in the system as multiples of the number of ratings in the
user-item matrix π
. In centralized systems, the amount of ratings is often quantified as a fraction of the ratings in the user-item
matrix π
in the interval [0,1] (see for instance [4,5]). The redundancy metric ξΎextends this metric to decentralized systems.
4.3. Baselines
We compare DGD data minimization against five baselines. The first two baselines are naive baselines that feature static
parameters ππ’and ππ’. They are naive in the sense that they assume that optimal constant parameter values exist for every user
Pervasive and Mobile Computing 103 (2024) 101951
11
T. Eichinger and A. KΓΌpper
Fig. 3. Mean time horizon πand mean forwarding parameter πover time for Distributed Gradient Descent (DGD) data minimization and data minimization
baselines with dynamic parameter values. βΆand βmark the beginning and end of data collection respectively. Error bars indicate deviations of one standard
deviation from the mean.
and that those parameter values are known prior to data collection. The remaining three baselines feature dynamic parameters π
and π. In contrast to the naive baselines, these baselines assume that constant optimal parameter values exist. Parameter values
need to be dynamically updated during data collection.
Static-Global (Static-G): All users employ the same constant parameter values ππ’and ππ’. They employ the same global similarity
threshold π‘instead of individual similarity thresholds π‘π’. Formally: π‘=πΉ(1βπ), where πΉdenotes the cumulative distribution function
of pairwise similarities between users.
Static-Individual (Static-I): All users employ the same constant parameter values ππ’and ππ’. In contrast to Static-Global, users
employ individual similarity thresholds π‘π’. Formally: π‘π’=πΉπ’(1βππ’), where πΉπ’denotes the cumulative distribution function of pairwise
similarities between user π’and all other users.
Dynamic-Centralized (Dyn-C): All users employ the same constant parameter values ππ’=πand ππ’initially. Users stop data
collection as soon as they reach their saturation point. Formally: They set their time horizon to the current epoch (ππ’=π‘) as soon
as their performance differentials become marginal (|π₯π(π‘)
π’|β (0, π)). We call this baseline centralized, since users minimize data
collection individually as if they were a centralized system. Users minimize data collection, in particular, independently from other
users.
Dynamic-Regularization (Dyn-R): All users employ the same constant parameter values ππ’and ππ’initially. Users update their
individual parameter values between epochs only with respect to the regularization term in Equation (Eq. (10)). Formally: Users
apply DGD data minimization with πΌπ=πΌπ=π½π=π½π= 0.
Dynamic-Regularization&Innovation (Dyn-R&I): All users employ the same constant parameter values ππ’and ππ’initially. In
contrast to Dyn-R&I, users not only update their individual parameter values with respect to the regularization term but also with
respect to the innovation term in Equation (Eq. (10)). Formally: Users apply DGD data minimization with π½π=π½π= 0.
We see that users can control the amount of data they collect by controlling their local parameters ππ’and ππ’. Users increase their
parameter values to increase data collection and decrease their parameter values to decrease or terminate it. Before we study the
data distributions that result from users controlling their parameters ππ’and ππ’, we first shed a light on the parameters themselves.
4.4. Data collection parameters
Fig. 3 shows the behavior of the mean time horizon π(π‘)βΆ= 1βπuser βπ’βππ(π‘)
π’and mean forwarding parameter
π(π‘)βΆ= 1βπuser βπ’βππ(π‘)
π’for DGD and the dynamic baselines Dyn-C, Dyn-R, and Dyn-R&I over epochs π‘. We omit showing
mean parameters for the static baselines Static-G and Static-I, because they are constant over time.
For DGD, both πand πincrease initially until they reach a maximum almost simultaneously. Past the maximum, both mean
parameters decrease until πdrops to zero and data collection terminates. Dyn-R&I follows the same pattern except that it exhibits
a milder decrease in the forwarding parameter than DGD. Recall that the parameter update formulas for DGD and Dyn-R&I differ
Pervasive and Mobile Computing 103 (2024) 101951
12
T. Eichinger and A. KΓΌpper
Table 1
Pairs ξΎ(ξΏ)of redundancies ξΎand sufficiency coefficients ξΏfor Distributed Gradient Descent data minimization (DGD) and the five baselines
for distinct combinations of starting parameter values ππ’and ππ’for all users π’βπ(πpayload = 3). Redundancies are crossed out if the underlying
data distribution is insufficient (ξΏ β 100%). Asterisks mark statistical significance of the lowest redundancy under sufficiency per row under a
two-sided Welchβs t-test with Bonferroni correction and significance level πΌ= 0.01.
ππ’ππ’Static-G Static-I Dyn-C Dyn-R Dyn-R&I DGD
Random
Connection Behavior
25 0.0 β β β 53.18 (100%) 13.38 (100%) 10.92β(100%)
0.1 10.94 (67.9%) 10.45 (93.2%) 8.58 (93.3%) 52.81 (100%) 13.40 (100%) 10.64β(100%)
0.2 19.55 (83.8%) 21.73 (99.5%) 16.61 (99.5%) 54.89 (100%) 13.82 (100%) 11.57β(100%)
0.3 28.15 (92.4%) 31.41 (100%) 21.74 (100%) 57.47 (100%) 15.36 (100%) 13.07β(100%)
50 0.0 β β β 60.70 (100%) 13.19 (100%) 10.93β(100%)
0.1 19.12 (75.7%) 21.12 (99.6%) 16.17 (99.6%) 60.40 (100%) 13.25 (100%) 10.64β(100%)
0.2 29.92 (88.9%) 38.04 (100%) 23.16 (100%) 61.26 (100%) 13.69 (100%) 11.56β(100%)
0.3 39.49 (95.6%) 47.25 (100%) 25.20 (100%) 63.30 (100%) 15.28 (100%) 13.08β(100%)
75 0.0 β β β 66.49 (100%) 13.19 (100%) 10.93β(100%)
0.1 23.07 (79.5%) 28.39 (100%) 19.66 (100%) 66.16 (100%) 13.25 (100%) 10.64β(100%)
0.2 34.21 (91.8%) 44.93 (100%) 24.05 (100%) 66.80 (100%) 13.75 (100%) 11.56β(100%)
0.3 44.20 (96.7%) 53.39 (100%) 25.47 (100%) 68.64 (100%) 15.28 (100%) 13.08β(100%)
RandInt(25,75) Rand(0.1,0.3) β 36.33 (99.3%) 20.41 (99.9%) 60.10 (100%) 13.58 (100%) 10.90β(100%)
Similarity-based
Connection Behavior
25 0.0 β β β 52.83 (100%) 11.90 (100%) 10.06β(100%)
0.1 10.64 (71.2%) 10.44 (94.6%) 9.55 (95.6%) 52.45 (100%) 11.95 (100%) 10.01β(100%)
0.2 20.52 (87.7%) 24.87 (99.6%) 18.98 (99.9%) 57.75 (100%) 13.95 (100%) 11.99β(100%)
0.3 30.66 (94.6%) 36.05 (100.0%) 25.32 (100%) 61.31 (100%) 15.77 (100%) 14.12β(100%)
50 0.0 β β β 60.75 (100%) 11.83 (100%) 10.07β(100%)
0.1 18.67 (78.2%) 22.25 (99.6%) 16.85 (99.8%) 60.02 (100%) 11.71 (100%) 10.00β(100%)
0.2 31.26 (91.8%) 41.60 (100%) 24.33 (100%) 64.92 (100%) 13.95 (100%) 12.00β(100%)
0.3 42.40 (97.1%) 52.08 (100%) 28.07 (100%) 68.43 (100%) 15.83 (100%) 14.11β(100%)
75 0.0 β β β 66.46 (100%) 11.83 (100%) 10.07β(100%)
0.1 22.82 (82.1%) 29.84 (100%) 19.51 (100%) 65.51 (100%) 11.71 (100%) 10.00β(100%)
0.2 35.85 (93.6%) 48.73 (100%) 24.97 (100%) 70.16 (100%) 13.81 (100%) 12.00β(100%)
0.3 47.16 (98.0%) 58.07 (100%) 28.21 (100%) 73.35 (100%) 15.83 (100%) 14.11β(100%)
RandInt(25,75) Rand(0.1,0.3) β 39.79 (99.8%) 21.37 (99.9%) 63.01 (100%) 13.76 (100%) 11.90β(100%)
only in DGD having the consensus term (see Section 4.3). Data collection terminates quicker for DGD than for Dyn-R&I due to the
consensus term in the update formula.
In contrast to DGD and Dyn-R&I, both πand πincrease initially yet do not decrease for Dyn-R. Recall that the parameter update
formula of Dyn-R only features the regularization term, while that of Dyn-R&I features both regularization and innovation terms.
We thus attribute the initial increase in mean parameters to the regularization term and the subsequent decrease to the innovation
term. At the beginning of data collection, users only have their own profile, which is an insufficient (RMSEπ’undefined) amount
of rating data. The regularization term thus causes mean parameter values to grow. As users collect rating data over time, the
number of users with an insufficient amount of ratings decreases and the number of users with a sufficient (RMSEπ’defined) amount
of ratings increases. Past the maximum, the decrease in mean parameter values is caused by the innovation term. Users with a
sufficient amount of data begin to outweigh users with an insufficient amount.
For Dyn-C, mean parameters do not increase initially as for Dyn-R, Dyn-R&I, and DGD. In particular, the mean forwarding
parameter πremains constant. The mean time horizon πdecreases when users stop establishing connections with other users as
soon as their performance differentials become marginal (|π₯π(π‘)
π’|β (0, π)). Observe the high variance in mean time horizons π. The
variance is high since users do not coordinate data collection and thus do not agree on when they uniformly end their data collection
processes. Now that we have described the way users collect data in terms of the mean parameters πand π, we measure in the
following the amount of data that are collected.
4.5. Data minimization results
Table 1 shows pairs ξΎ(ξΏ) of redundancy and sufficiency for various starting parameter values ππ’and ππ’. Observe that DGD
consistently produces sufficient (ξΏ= 100%) data distributions of the lowest redundancy. This holds across both random and
similarity-based connection behavior and also for the case that users apply random, and thus not identical, starting parameter values.
Redundancies produced by DGD are significantly lower compared to the redundancies of sufficient data distributions produced by
the baselines. However, Static-G, Static-I, and Dyn-C produce lower redundancies than DGD, although only for insufficient data
distributions.
The static baselines Static-G and Static-I produce the data distributions with the lowest sufficiency coefficients comparatively.
Notably, when users apply the same global similarity threshold as in Static-G, the system as a whole struggles to achieve sufficiency.
This indicates that the use of a global similarity threshold is not recommendable in view of data minimization. We conclude more
Pervasive and Mobile Computing 103 (2024) 101951
13
T. Eichinger and A. KΓΌpper
Table 2
Comparison of redundancyβsufficiency pairs ξΎ(ξΏ)when users do or do not make use of changing identifiers and payload aggregation MostSim or WeighAvg in
combination with Distributed Gradient Descent data minimization (DGD) for distinct combinations of starting parameter values ππ’and ππ’for all users π’βπ
(πpayload = 3). All combinations produce sufficient data distributions (ξΏ= 100%). Pairwise two-sided Welchβs t-test with Bonferroni correction and significance
level πΌ= 0.01 yield that neither of the combinations significantly outperforms all others.
Persistent identifiers Changing identifiers
ππ’ππ’DGD DGD[MostSim] DGD[WeighAvg] DGD DGD[MostSim] DGD[WeighAvg]
Random
Connection Behavior
25 0.0 10.92(100%) 5.25(100%) 5.25(100%) 10.90(100%) 5.24(100%) 5.24(100%)
0.1 10.64(100%) 5.27(100%) 5.26(100%) 10.66(100%) 5.27(100%) 5.26(100%)
0.2 11.57(100%) 5.41(100%) 5.41(100%) 11.59(100%) 5.41(100%) 5.41(100%)
0.3 13.07(100%) 5.91(100%) 5.91(100%) 13.07(100%) 5.91(100%) 5.91(100%)
50 0.0 10.93(100%) 5.25(100%) 5.24(100%) 10.91(100%) 5.23(100%) 5.23(100%)
0.1 10.64(100%) 5.26(100%) 5.25(100%) 10.65(100%) 5.26(100%) 5.25(100%)
0.2 11.56(100%) 5.41(100%) 5.41(100%) 11.59(100%) 5.41(100%) 5.41(100%)
0.3 13.08(100%) 5.91(100%) 5.91(100%) 13.07(100%) 5.91(100%) 5.90(100%)
75 0.0 10.93(100%) 5.25(100%) 5.24(100%) 10.91(100%) 5.23(100%) 5.23(100%)
0.1 10.64(100%) 5.26(100%) 5.25(100%) 10.65(100%) 5.26(100%) 5.25(100%)
0.2 11.56(100%) 5.41(100%) 5.41(100%) 11.59(100%) 5.41(100%) 5.41(100%)
0.3 13.08(100%) 5.91(100%) 5.91(100%) 13.07(100%) 5.91(100%) 5.90(100%)
RandInt (25,75) Rand (0.1,0.3) 10.90(100%) 5.25(100%) 5.23(100%) 10.91(100%) 5.24(100%) 5.23(100%)
Similarity-based
Connection Behavior
25 0.0 10.06(100%) 4.93(100%) 4.93(100%) 10.07(100%) 4.94(100%) 4.92(100%)
0.1 10.01(100%) 4.98(100%) 4.98(100%) 10.01(100%) 4.98(100%) 4.98(100%)
0.2 11.99(100%) 5.49(100%) 5.48(100%) 11.99(100%) 5.49(100%) 5.48(100%)
0.3 14.12(100%) 6.07(100%) 6.07(100%) 14.13(100%) 6.07(100%) 6.07(100%)
50 0.0 10.07(100%) 4.93(100%) 4.93(100%) 10.06(100%) 4.93(100%) 4.93(100%)
0.1 10.00(100%) 4.98(100%) 4.98(100%) 10.00(100%) 4.98(100%) 4.98(100%)
0.2 12.00(100%) 5.48(100%) 5.48(100%) 11.99(100%) 5.48(100%) 5.48(100%)
0.3 14.11(100%) 6.08(100%) 6.09(100%) 14.12(100%) 6.08(100%) 6.09(100%)
75 0.0 10.07(100%) 4.93(100%) 4.93(100%) 10.06(100%) 4.93(100%) 4.93(100%)
0.1 10.00(100%) 4.98(100%) 4.98(100%) 10.00(100%) 4.98(100%) 4.98(100%)
0.2 12.00(100%) 5.48(100%) 5.48(100%) 11.99(100%) 5.48(100%) 5.48(100%)
0.3 14.11(100%) 6.08(100%) 6.09(100%) 14.12(100%) 6.08(100%) 6.09(100%)
RandInt (25,75) Rand (0.1,0.3) 11.90(100%) 5.64(100%) 5.63(100%) 11.88(100%) 5.64(100%) 5.63(100%)
generally that the use of static parameter values is not suitable for distributed data minimization. This makes sense intuitively,
as dynamic parameters allow users to adjust their data collection to the current situation. Users may, for instance, adjust their
parameter values on the basis of other usersβ parameter values or due to a change in connection behavior.
Regardless of whether we use dynamic parameters or static parameters, our results show that the effect of payload aggregation
and the use of changing identifiers have qualitatively the same effect on both DGD and the baselines. Table 2 shows the effect for
DGD in particular. We observe first that both payload aggregation and the use of changing identifiers does not break the robustness
with which DGD produces sufficient data distributions, as all sufficiency coefficients are perfect (ξΏ= 100%). We observe further that
the impact of using changing identifiers as compared to using persistent identifiers on redundancy is marginal. More specifically, we
find that the differences are not statistically significant. The impact of payload aggregation is less subtle. We see that the use of either
MostSim or WeighAvg about halves redundancies. Although the use of payload aggregation allows to significantly reduce redundancy,
we find that the difference between the payload aggregation methods MostSim and WeighAvg is not statistically significant. Now
that we have seen that DGD solves the distributed data minimization problem, we are interested in seeing the performance that the
resulting data distributions yield.
4.6. Performance-redundancy trade-offs
Table 3 shows the absolute differences in RMSE and redundancy ξΎbetween DGD and each of the five baselines. Observe first
that the difference in RMSE between DGD and Dyn-R&I is less than 0.001, while using DGD results in less redundancy. Using DGD
rather than Dyn-R&I allows to collect a data amount of 2.55 copies of the user-item matrix π
less with random connection behavior
and 1.71 copies with similarity-based connection behavior across all users with otherwise equal performance levels, which is a
relative improvement of 19.3% and 14.6% respectively. Recall that the difference in parameter update formulas is that DGD features
the consensus term, while Dyn-R&I does not (see Section 4.3). Users thus achieve a significantly better performance-redundancy
trade-off if they coordinate data minimization through the consensus term. Following a similar argumentation for DGD and Dyn-C,
we conclude that performing both global and local data minimization as in DGD, and not only local data minimization as in Dyn-C,
has a significantly positive impact on the performance-redundancy trade-off of minimized data distributions.
Recall that the update formula of Dyn-R&I features the innovation term and the regularization term, while that of Dyn-R only
features the regularization term. We see in Table 3 that using Dyn-R allows to trade a decrease in RMSE by 0.021 with random
connection behavior and 0.025 with similarity-based connection behavior, for a redundancy increased by the data amount of
Pervasive and Mobile Computing 103 (2024) 101951
14
T. Eichinger and A. KΓΌpper
Table 3
Absolute differences π₯RMSE and redundancies π₯ξΎof the baselines to DGD data minimization. Differences are taken on the basis of the least redundant and
sufficient data distributions shown in Table 1. Asterisks mark statistical significance between each of the baselines with DGD data minimization under a two-sided
Welchβs t-test with Bonferroni correction and significance level πΌ= 0.01.
ππ’ππ’π₯RMSE π₯ξΎ
Random
Connection Behavior
Static-G β β β β
Static-I 75 0.1 β0.015* +17.75*
Dyn-C 75 0.1 β0.006 +9.02*
Dyn-R 25 0.1 β0.021* +42.17*
Dyn-R&I 50 0.0 +0.001 +2.55*
DGD 25 0.1 Β±0.000 Β±0.00
Similarity-based
Connection Behavior
Static-G β β β β
Static-I 75 0.1 β0.017* +19.84*
Dyn-C 75 0.1 β0.009 +9.51*
Dyn-R 25 0.1 β0.025* +42.45*
Dyn-R&I 50 0.1 Β±0.000 +1.71*
DGD 50 0.1 Β±0.000 Β±0.00
Table 4
Absolute differences π₯RMSE and redundancies π₯ξΎof DGD with persistent identifiers and combinations of persistent and changing identifiers and payload
aggregation methods MostSim and WeighAvg. Differences are taken on the basis of the least redundant and sufficient data distribution shown in Table 2. Asterisks
mark statistical significance between each of the baselines with DGD data minimization under a two-sided Welchβs t-test with Bonferroni correction and significance
level πΌ= 0.01.
ππ’ππ’π₯RMSE π₯ξΎ
Random
Connection Behavior
Changing Identifiers
DGD[WeighAvg] 50 0.0 +0.005 β5.41*
DGD[MostSim] 50 0.0 +0.005 β5.41*
DGD 50 0.1 Β±0.000 +0.01
Persistent Identifiers
DGD[WeighAvg] RandInt (25,75) Rand (0.1,0.3) +0.006 β5.41*
DGD[MostSim] 25 0.0 +0.005 β5.39*
DGD 25 0.1 Β±0.000 Β±0.00
Similarity-based
Connection Behavior
Changing Identifiers
DGD[WeighAvg] 25 0.0 +0.003 β5.08*
DGD[MostSim] 50 0.0 +0.003 β5.07*
DGD 50 0.1 Β±0.000 Β±0.00
Persistent Identifiers
DGD[WeighAvg] 25 0.0 +0.004 β5.07*
DGD[MostSim] 25 0.0 +0.003 β5.07*
DGD 50 0.1 Β±0.000 Β±0.00
42.17 copies of the user-item matrix π
more with random connection behavior, and 42.45 copies with similarity-based connection
behavior, over all users. We conclude that the use of the innovation term has a significantly positive impact on performance while
simultaneously having a significantly negative impact on redundancy. Whether this trade-off is desirable depends on the setting.
We have, for instance, seen in Fig. 3 that DGD converges quicker than Dyn-R, which may be a desirable property. However, only
considering performance and redundancy, we cannot strictly favor either DGD or Dyn-R&I. Following the same argumentation, we
cannot strictly favor either DGD or Static-I only on the basis of performance and redundancy. However, considering that Static-I
has a tendency to produce insufficient data distributions (see Table 1), we deem DGD to be generally favored over Static-I.
Table 4 shows how the use of changing identifiers and payload aggregation affect the performance-redundancy trade-off. We
observe first that the use of changing identifiers does not affects the RMSE at all with both random and similarity-based connection
behavior. We see further that with random connection behavior, the use of changing identifiers increases redundancy slightly by 0.01,
while for similarity-based connection behavior, it leaves redundancy unchanged. We find that the marginal differences in redundancy
and RMSE are not statistically significant. We thus conclude that the use of changing identifiers is feasible in the context of data
minimization.
The reduction in redundancy due to payload aggregation is statistically significant with both random and similarity-based
connection behavior. Interestingly, it is essentially the same, irrespective of whether users make use of persistent or changing
identifiers. We observe, however, that the redundancy reduction of β5.41 and β5.39 is larger with random connection behavior
than β5.08 and β5.07 with similarity-based connection behavior. Interestingly, the redundancies that the payload aggregation
methods MostSim and WeighAvg yield are essentially the same irrespective of whether or not users make use of persistent or
changing identifiers. We conclude that the use of changing identifiers is feasible in the context of data minimization with payload
aggregation. A direct comparison of MostSim and WeighAvg shows that both payload aggregation methods yield essentially the same
performance-redundancy trade-offs.
Pervasive and Mobile Computing 103 (2024) 101951
15
T. Eichinger and A. KΓΌpper
5. Conclusions
We study data minimization and anonymity in pervasive mobile-to-mobile recommender systems in which users establish ad hoc
wireless connections between their mobile computing devices in physical proximity to exchange ratings on which they calculate
recommendations. A major difference in achieving this goal in pervasive mobile-to-mobile recommender systems, compared to
decentralized recommender systems in general, resides in the mobility of users. Due to user mobility, it is not possible for users to
collect the ratings from other users at will. Users have to decide in situ and exactly when their mobile computing device establishes
a wireless ad hoc connection whether and how many ratings they want to exchange. One of the effects of user mobility is that a
solution to the distributed data minimization problem does not always exist.
We study two types of user mobility. We study, on the one hand, uniformly random connections between users and, on the
other hand, connections that are random yet limited to users who have rated at least one item in common. The randomness in both
models accounts for the randomness in connections that we would expect to arise in a pervasive mobile-to-mobile recommender
system. For both these mobility models, we find that our algorithm based on Distributed Gradient Descent (DGD) robustly solves
the distributed data minimization problem, outperforming an array of baselines. The algorithmβs robustness stems from its capacity
to allow users to dynamically change the amount of data they collect individually in order to minimize the amount of collected data
over all users globally.
The DGD data minimization algorithm minimizes first the number of connections, and on the basis of this minimized amount of
connections, the amount of data that are collected over these connections. We find that the algorithm consistently finds the least
amount of connections and the least amount of ratings communicated over these connections compared to the baselines. We find that
this property simultaneously reduces the chances of an attacker relating users to ratings. We further reduce the chances of an attacker
relating users to ratings by having users aggregate ratings on the same item into a single rating and change their identifiers between
connections. We find that payload aggregation allows to halve the amount of ratings that users collect irrespective of whether
users change their identifiers between connections or not and without jeopardizing the robustness of the DGD data minimization
algorithm. In this sense, the DGD data minimization algorithm not only minimizes the amount of ratings that users collect but also
preserves their anonymity. However, it is important to note that it only preserves the anonymity between users who do not establish
a connection. It is clear that users who do establish an ad hoc wireless connection in physical proximity can never be anonymous
toward each other.
CRediT authorship contribution statement
Tobias Eichinger: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Project administration, Re-
sources, Software, Validation, Visualization, Writing β original draft, Writing β review & editing. Axel KΓΌpper: Funding acquisition,
Supervision, Writing β review & editing.
Declaration of competing interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing
interests: Tobias Eichinger reports financial support was provided by European Commission. If there are other authors, they declare
that they have no known competing financial interests or personal relationships that could have appeared to influence the work
reported in this paper.
Data availability
https://github.com/TEichinger/dec-cf-sim (see [17] in the manuscript).
Acknowledgments
Funding: This work was supported by the European Unionβs Horizon 2020 research and innovation program [grant number N.
883464].
Appendix A. General construction of the forwarding parameter π½π
Let π’βπbe a fixed user with similarity threshold π‘π’β [0,1]. Let further π= {π π§}π§βπbe the set of similarity data of all users in
πand sim βΆπΓπβ[0,1] be a similarity measure. Let further ππ’βΆπβ{π’}β[0,1] be a random variable that describes the behavior
with which the fixed user π’forms a connection with an a priori unknown user π£βπβ{π’}. More specifically, for any user π£βπβ{π’}
we denote by P(ππ’=π£)the probability that the fixed user π’forms a connection with the a priori unknown user π£. More precisely:
P(ππ’=π£) βΆ= P({π|πβππ’(π£)}). We emphasize that the random variable ππ’only describes a single connection. Modeling multiple
connections requires taking multiple random variables ππ’.
In this setting, it is unclear whether the fixed user π’forwards a payload in a connection with some other user π£, as π’βs similarity
to π£is a priori unknown. We define the random variable ππ’to describe this a priori unknown similarity. We more specifically define
ππ’βΆπβ{π π’}β[0,1] by ππ’βΆ= sim(π π’, π ππ’), where π ππ’denotes the similarity data of the a priori unknown user ππ’. We denote by
Pervasive and Mobile Computing 103 (2024) 101951
16
T. Eichinger and A. KΓΌpper
πΉππ’βΆ [0,1] β[0,1] the cumulative distribution function of ππ’. Then, πΉππ’(π )represents the probability that π’forms a connection with
another user ππ’=π£βπβ{π’}such that their similarity is smaller or equal to π . Formally, we have:
πΉππ’(π )Def. πΉππ’
=P(ππ’β€π )
Def. ππ’
=P(π ππ(π π’, π ππ’)β€π )
Def. ππ’
=P(ππ’=π£, π ππ(π π’, π π£)β€π )
=P(π’βπ£, π ππ(π π’, π π£)β€π ).
(A.1)
If we plug in π’βs similarity threshold π‘π’into this equation (π =π‘π’), we have that πΉππ’(π‘π’) β [0,1] represents the probability that π’forms
a connection with another user with a similarity lower or equal than π‘π’to π’. Conversely, we have that 1 β πΉππ’(π‘π’) β [0,1] represents
the probability that π’forms a connection with another user with a similarity higher than π‘π’to π’. Since π’only forwards a payload if
the similarity to the other user is higher than π‘π’, we see that the probability
ππ’βΆ= 1 β πΉππ’(π‘π’) β [0,1].(A.2)
represents the probability with which π’forwards a payload in an a priori unknown connection.
Appendix B. Local versus global data minimization
We illustrate that the objectives of local and global data minimization are competing on the basis of the example on sufficiency
in Section 2.5. We check first whether the data distribution in the example qualifies as a solution to the global data minimization
problem. We then check whether individual usersβ collected datasets qualify as solutions to the local data minimization problem.
Finally, we conclude that the existence of solutions to the individual problems does not imply the existence of a solution to the joint
problem.
The data distribution {π
π’, π
π£, π
π€}does not qualify as a solution to the global data minimization problem (see Definition 1 in
Section 2.5), since it is not a sufficient data distribution (see the side condition in Eq. (6)). In order to arrive at a sufficient data
distribution, π€would require an additional data exchange in epoch π‘= 3, either π’βπ€or π£βπ€, for then π€would obtain a rating
on π€βs item of interest π4and arrive at a sufficient collected dataset π
π€. However, users π’and π£do not seek data exchanges beyond
epoch π‘= 2, since they already collected a full copy of the user-item matrix modulo row permutation. For them, any additional
data exchange in epoch π‘= 3 would trivially not yield any change in performance (π₯π(3)
π’=π₯π(3)
π£= 0). We now check whether the
collected datasets π
π’, π
π£, π
π€qualify as solutions to the local data minimization problem (see Definition 1 in Section 2.5).
Observe first that the collected dataset π
π€trivially does not qualify as a solution to the local data minimization problem,
since the collected dataset π
π€is insufficient. User π€has not collected any rating on π€βs item of interest π4, which is why π₯π(1)
π€
is trivially undefined. We see that collected datasets necessarily need to be sufficient in order to qualify as solutions to the local
data minimization problem. π
π’and π
π£are sufficient collected datasets and are thus good solution candidates for the local data
minimization problem. If we assume, for the sake of the argument only, that the change in performance for users π’and π£was
marginal in epoch π‘= 2 (π₯π(2)
π’, π₯π(2)
π£β (0, π)) for some suitable performance metric πand performance threshold π, the collected
datasets π
π’and π
π£qualify as solutions to the local data minimization problem.
In summary, we can see from this example that sufficiency is a necessary criterion for a collected dataset to qualify as a solution
to the local data minimization problem. If a data distribution were to qualify as a solution to both the local and the global data
minimization problem, then the data distribution necessarily has to be sufficient. If users that arrive quickly at a minimized collected
dataset stop seeking data exchanges with other users too early, they exacerbate the data collection of other users who may, as a
consequence, not be able to collect even a sufficient dataset. The underlying idea of our proposed DGD data minimization scheme
is that users coordinate their individual data collection processes by coordinating the pace of their data collection.
Appendix C. Complexity of centralized versus distributed data minimization
We compare the computational complexity of the data minimization problem in the centralized and distributed case. Roughly,
centralized data minimization involves training a few large models, while distributed data minimization involves training many small
models. Recall that the distributed data minimization problem is a mixture of the local and the global data minimization problem.
We omit a discussion on the complexity of the global data minimization problem (see Definition 1) since it merely involves constant
time look-ups of items of interest in a userβs collected dataset. We thus essentially only compare the computational complexity of
the centralized data minimization problem versus the local data minimization problem.
We assume that predicted ratings are calculated on the basis of an SVD collaborative filtering algorithm since the empirical
evaluation of the proposed DGD data minimization scheme is based on SVD (see Section 4.1). We point out that in the
context of recommendation, SVD usually denotes a class of matrix factorization algorithms such as SVD++ [25], truncated SVD
(see Section 3.6.5 in [26]), or FunkSVD [27] rather than a particular algorithm. We therefore use the complexity ξ»(min{ππ2, ππ2})
of a classical numerical SVD algorithm on a πΓπmatrix as, for instance, reported in [28].
Centralized data minimization assumes that the training ratings of a user-item matrix π
are collected cumulatively in batches.
This is similar to users collecting payloads through the Data Exchange Protocol for distributed data minimization. After collecting a
new batch of ratings, first, a new SVD is calculated on the union of the previously collected batches and the newly collected batch.
Pervasive and Mobile Computing 103 (2024) 101951
17
T. Eichinger and A. KΓΌpper
Afterward, the new recommendation performance that the new SVD yields on the test ratings is calculated. The new performance
is compared with the previous performance. If the performance differential is marginal, data collection is terminated.
The complexity of SVD calculations for centralized data minimization is the same in each iteration since the number of rows
πuser and the number of columns πitem in the underlying user-item matrix remains constant. The reason for this is that centralized
systems need to serve recommendations to all users π’βπon all items πβπΌ. Since the number of items usually largely exceeds the
number of users (πitem β« πuser), we arrive at a computational complexity for centralized data minimization on the basis of SVD of
ξ»(πiteration β
[πitem β
πuser2
βββββββββββββ
SVD calculation
+π]),(C.1)
where πiteration denotes the number of iterations until termination and πthe constant effort to calculate the performance on the test
ratings and the performance differentials in each iteration.
Although centralized data minimization and local data minimization are conceptually similar, the complexity of SVD calculations
for local data minimization does not remain constant as for centralized data minimization. The complexity of SVD calculations for
local data minimization increases between iterations. The reason for this is that π’does not need to serve recommendations to all
users π’βπon all items πβπΌas in a centralized recommender system, yet only to π’themself on those items for which ratings have
been collected. The complexity of SVD calculations increases with the increasing dimension of the underlying user-item matrix,
which is the product of the number of rows and the number of columns. With respect to the example in Fig. 2, we see that the
collected dataset π
π€is a 2 Γ3 user-item matrix since π€did not collect any rating on item π4, while π
π’and π
π£are 3 Γ4 user-item
matrices.
Recall that users collecting data via the Data Exchange Protocol only receive a payload with probability πin a connection and
that a payload represents the analog to a batch in the centralized case. More specifically, we have seen at the end of Section 2.3
that a user π’receives βπ‘π(π‘)
π’payloads on average over the course of π’βs data collection process. Without loss of generality, we
disregard the connections in which π’does not receive a payload. We define the integer πreception(π’) βΆ= ββπ‘π(π‘)
π’βas an estimate of the
number of connections in which π’receives a payload over the course of π’βs data collection process. We further define the integer
πreception βΆ= βπ’πreception(π’)as an estimate of the overall number of payload receptions over all usersβ data collection processes, in
analogy to πiteration in Eq. (C.1) in the centralized case. We then arrive at a computational complexity for local data minimization
on the basis of SVD of
ξ»(β
π’βπ
πreception(π’)
β
π‘=1
[π(π‘)
item(π’)β
π(π‘)
user(π’)2
βββββββββββββββββββββββ
SVD calculation
per user and epoch
+ππ’])
Eq. (C.3)
=ξ»(β
π’βπ
πreception(π’)
β
π‘=1
[πitem β
πuser2+ππ’])
=ξ»(β
π’βπ
πreception(π’)
β
π‘=1
[πitem β
πuser2+π])
=ξ»(β
π’βπ
πreception(π’)β
[πitem β
πuser2+π])
=ξ»(πreception β
[πitem β
πuser2
βββββββββββββ
averaged
SVD calculation
+π]),
(C.2)
where π(π‘)
user(π’)and π(π‘)
item(π’)denote the number of rows and columns of a user π’βs collected dataset respectively after receiving
π‘payloads, ππ’the user-specific constant computational effort to calculate the performance on their test ratings and their performance
differentials with πas their average, and πitem β
πuser2the average computational effort of SVD calculations over all users and epochs
defined as
πitem β
πuser2βΆ= βπ’βπβπreception(π’)
π‘=1 π(π‘)
item(π’)β
π(π‘)
user(π’)2
βπ’βππreception(π’).(C.3)
By comparison of Eqs. (C.1) and (C.2), we see that the computational complexity of centralized data minimization and local
data minimization have a similar structure. In particular, both grow linearly in the number of times that data is collected since the
collection of a batch or payload triggers the calculation of an SVD. The complexity of SVD calculations is expensive for large user-item
matrices, even if they are sparse. Therefore, if SVD calculations on the entire user-item matrix for centralized data minimization
become intractable, performing SVD calculations on multiple subsets of the entire user-item matrix for local data minimization may
be a tractable alternative. An empirical comparison of the computational effort of centralized versus local data minimization goes
beyond the scope of this paper and represents future work.
Pervasive and Mobile Computing 103 (2024) 101951
18
T. Eichinger and A. KΓΌpper
Appendix D. Completeness of local and global data minimization
We explain why individual solutions to the local and global data minimization problems exist with random connection behavior.
The local data minimization problem is well-defined; that is, a solution to the problem exists for a user π’if that user π’can collect
a locally minimized dataset π
(ππ’)
π’(see the side condition in Eq. (5)). Since local data minimization is equivalent to centralized data
minimization performed by an individual user, a user may apply data minimization schemes that have been proposed for use in
centralized systems (see for instance [3,4]). Data minimization schemes for use in centralized systems require the availability of
the entire user-item matrix π
, that is, the availability of any other userβs profile. Randomly pairing π’with any other user in the
Connection Phase (π) of the Data Exchange Protocol makes any other userβs profile available to user π’, although it may require
several epochs to actually collect these profiles. We conclude that solutions to the local data minimization problem always exist if
users are paired randomly.
The global data minimization problem is well-defined if every user π’βπcan collect a sufficient dataset π
π’, which is equivalent
to the data distribution {π
π’}π’βπbeing sufficient (see the side condition in Eq. (6)). Under the assumption that the entire user-item
matrix π
represents a sufficient collected dataset to any user π’, we can completely analogously to the argumentation in the previous
paragraph for the local data minimization problem conclude that a solution to the global data minimization problem always exists
if users are paired randomly. We argue that the assumption that π
is sufficient to any user π’is justified since, otherwise, the
recommendation problem itself is not well-defined. More precisely, if π
were not sufficient for a particular user π’, that is, π
does
not hold ratings on any of π’βs items of interest, none of the items of interest can be recommended.
References
[1] European Union, Charter of fundamental rights of the European union, Off. J. Eur. Union 55 (C 326) (2012) 391β407, see in particular Chapter II.
[2] A.F. Westin, Privacy and Freedom, Ig Publishing, 2015, reprint from 1967.
[3] A.J. Biega, P. Potash, H. DaumΓ©, F. Diaz, M. Finck, Operationalizing the legal principle of data minimization for personalization, in: Proceedings of the
43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM, 2020, pp. 399β408.
[4] D. Shanmugam, F. Diaz, S. Shabanian, M. Finck, A.J. Biega, Learning to limit data collection via scaling laws: A computational interpretation for the legal
principle of data minimization, in: Proceedings of the 5th ACM Conf. on Fairness, Accountability, and Transparency, ACM, 2022, pp. 839β849.
[5] R. Chow, H. Jin, B. Knijnenburg, G. Saldamli, Differential data analysis for recommender systems, in: Proceedings of the 7th ACM Conference on
Recommender Systems, ACM, 2013, pp. 323β326.
[6] M. Larson, A. Zito, B. Loni, P. Cremonesi, Towards minimal necessary data: The case for analyzing training data requirements of recommender algorithms,
in: Proceedings of the 1st FATREC Workshop on Responsible Recommendation, 2017, pp. 1β6.
[7] H. Wen, L. Yang, M. Sobolev, D. Estrin, Exploring recommendations under user-controlled data filtering, in: Proceedings of the 12th ACM Conference on
Recommender Systems, ACM, 2018, pp. 72β76.
[8] T. Eichinger, A. KΓΌpper, Distributed data minimization for decentralized collaborative filtering systems, in: Proceedings of the 24th International Conference
on Distributed Computing and Networking, ICDCN β23, ACM, 2023, pp. 140β149.
[9] S. Lee, X. Zheng, J. Hua, H. Vikalo, C. Julien, Opportunistic federated learning: An exploration of egocentric collaboration for pervasive computing
applications, in: Proceedings of the 19th IEEE International Conference on Pervasive Computing and Communications, IEEE, 2021, pp. 1β8.
[10] J. Dunkel, R. Hermoso, Towards MANET-based recommender systems for open facilities, Appl. Intell. 52 (8) (2021) 9045β9066.
[11] T. Eichinger, F. Beierle, R. Papke, L. Rebscher, H. Chinh Tran, M. Trzeciak, On gossip-based information dissemination in pervasive recommender systems,
in: Proceedings of the 13th ACM Conference on Recommender Systems, ACM, 2019, pp. 442β446.
[12] L.N. Barbosa, J. Gemmell, M. Horvath, T. Heimfarth, Distributed user-based collaborative filtering on an opportunistic network, in: Proceedings of the
2018 IEEE 32nd International Conference on Advanced Information Networking and Applications, IEEE, 2018, pp. 266β273.
[13] P. Gratz, T. Leclerc, Delay-tolerant collaborative filtering, in: Proceedings of the 7th ACM International Symposium on Mobility Management and Wireless
Access, ACM, 2009, pp. 109β113.
[14] R. Schifanella, A. Panisson, C. Gena, G. Ruffo, MobHinter: Epidemic collaborative filtering and self-organization in mobile ad-hoc networks, in: Proceedings
of the 2nd ACM Conference on Recommender Systems, ACM, 2008, pp. 27β34.
[15] A. de Spindler, M.C. Norrie, M. Grossniklaus, Collaborative filtering based on opportunistic information sharing in mobile ad-hoc networks, in: Proceedings
of the 6th OTM Confederated International Conferences ββOn the Move to Meaningful Internet Systemsββ, in: LNCS, (4803) Springer, 2007, pp. 408β416.
[16] J. Hestness, S. Narang, N. Ardalani, G. Diamos, H. Jun, H. Kianinejad, M.M.A. Patwary, Y. Yang, Y. Zhou, Deep learning scaling is predictable, empirically,
2017, arXiv:1712.00409.
[17] T. Eichinger, Simulation framework for distributed data minimization, 2023, URL: https://github.com/TEichinger/dec-cf-sim. (Accessed 5 May 2024).
[18] M.D. Ekstrand, J.T. Riedl, J.A. Konstan, Collaborative filtering recommender systems, Found. Trends HCI 4 (2) (2011) 81β173.
[19] M. McPherson, L. Smith-Lovin, J.M. Cook, Birds of a feather: Homophily in social networks, Annu. Rev. Sociol. 27 (2001) 415β444.
[20] G. Beliakov, T. Calvo, S. James, Aggregation functions for recommender systems, in: F. Ricci, L. Rokach, B. Shapira (Eds.), Recommender Systems Handbook,
second ed., Springer US, Boston, MA, 2015, pp. 777β808.
[21] R. Shokri, P. Pedarsani, G. Theodorakopoulos, J.-P. Hubaux, Preserving privacy in collaborative filtering through distributed aggregation of offline profiles,
in: Proceedings of the 3rd ACM Conference on Recommender Systems, ACM, 2009, pp. 157β164.
[22] A. Pfitzmann, M. Hansen, A terminology for talking about privacy by data minimization: Anonymity, unlinkability, undetectability, unobservability,
pseudonymity, and identity management, 2010, v0.34.
[23] F.M. Harper, J.A. Konstan, The MovieLens datasets: History and context, ACM TIIS 5 (4) (2015) 1β19.
[24] Q.-T. Truong, A. Salah, H. Lauw, Multi-modal recommender systems: Hands-on exploration, in: Proceedings of the 15th ACM Conference on Recommender
Systems, ACM, 2021, pp. 834β837.
[25] Y. Koren, Factorization meets the neighborhood: A multifaceted collaborative filtering model, in: Proceedings of the 14th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, ACM, 2008, pp. 426β434.
[26] C.C. Aggarwal, Recommender Systems: The Textbook, first ed., Springer Publishing Company, Incorporated, 2016.
[27] S. Funk, Netflix update: Try this at home, 2006, URL: https://sifter.org/simon/journal/20061211.html. (Accessed 5 May 2024).
[28] L. Trefethen, D. Bau, Numerical Linear Algebra, 1997.