Distributed Function Approximation over Noisy Channels
With Reliability and Security Guarantees for Applications in Wireless
Networks
vorgelegt von
M. Sc.
Matthias Frey
https://orcid.org/0000-0003-3016-2644
an der Fakult¨
at IV - Elektrotechnik und Informatik
der Technischen Universit¨
at Berlin
zur Erlangung des akademischen Grades
Doktor der Naturwissenschaften
- Dr. rer. nat. -
genehmigte Dissertation
Promotionsausschuss:
Vorsitzender: Prof. Giuseppe Caire, Ph.D.
Gutachter: Prof. Dr.-Ing. Slawomir Sta´nczak
Gutachter: Prof. Dr. Michael C. Gastpar
Gutachter: Dr. Jingge Zhu
Tag der wissenschaftlichen Aussprache: 12. Dezember 2022
Berlin 2023
Abstract
Over-the-Air (OTA) computation is the problem of computing functions of distributed
data over a wireless channel without transmitting the entirety of the data to a central
point. By avoiding such costly transmissions, OTA computation schemes can achieve a
better-than-linear (depending on the function, often logarithmic or even constant) scaling
of the communication cost as the number of transmitters grows. Among the most common
functions computed OTA are linear functions such as weighted sums. In this work, we
propose and analyze a method for the approximation of functions of distributed arguments
over noisy channels. This method can be used as an analog OTA computation scheme
for a class of functions that contains linear functions as well as some nonlinear functions
such as p-norms of vectors. We prove error bound guarantees that are valid for fast-fading
channels and all distributions of fading and noise contained in the class of sub-Gaussian
distributions. This class includes Gaussian distributions, but also many other practically
relevant cases such as Class A Middleton noise and fading with dominant line-of-sight
components. In addition, there can be correlations in the fading and noise so that the
presented results also apply to, for example, block fading channels and channels with
bursty interference. We do not rely on any stochastic characterization of the distributed
arguments of the OTA computed function; in particular, there is no assumption that these
arguments are drawn from identical or independent probability distributions. Our analysis
relies on tools from high-dimensional statistics which we adapt so that they are applicable
to the scenario at hand. The resulting error guarantees are nonasymptotic and therefore
provide error bounds that are valid for a finite number of channel uses.
OTA computation has a huge potential for reducing communication cost in applications
such as Machine Learning (ML)-based distributed anomaly detection in large wireless
sensor networks. We show this potential with two examples of how our OTA computation
scheme can be used to vastly increase the efficiency of Vertical Federated Learning (VFL)
over a wireless channel. We also illustrate the efficiency gain with numerical simulations
for a few example cases.
Then, we move on to propose a new method to protect OTA computation schemes
against passive eavesdropping. Our method uses a friendly jammer whose signal is –
contrary to common intuition – stronger at the legitimate receiver than it is at the eaves-
iii
dropper. It works for a large class of analog OTA computation schemes and we give details
on the special case of computing an arithmetic average over an Additive White Gaussian
Noise (AWGN) channel. The derived secrecy guarantee translates to a lower bound on
the eavesdropper’s mean square error while the question of how to provide operationally
more significant guarantees such as semantic security remains open for future work. As
key ingredients in proving the security guarantees, we propose and prove generalizations
of known results on channel resolvability and coding for compound channels for the case
of continuous channel input and output alphabets.
iv
Zusammenfassung
Funktionsberechnung im Funkkanal ist eine Methode, Funktionen verteilter Daten ohne
eine vollst¨
andige ¨
Ubertragung der Daten zu einem zentralen Punkt zu berechnen. Indem
solche ¨
Ubertragungen vermieden werden, kann erreicht werden, dass der Ressourcenver-
brauch weniger schnell als linear mit der Anzahl der Sender w¨
achst. Abh¨
angig von der zu
berechnenden Funktion kann dieses Wachstum dann in vielen F¨
allen logarithmisch oder
sogar konstant sein (d.h. die zur Funktionsberechnung n¨
otigen Kanalressourcen wachsen
¨
uberhaupt nicht, wenn die Anzahl der Sender w¨
achst). Zu den am h¨
aufigsten im Funk-
kanal berechneten Funktionen geh¨
oren lineare Funktionen wie z.B. gewichtete Summen.
In der vorliegenden Arbeit f¨
uhren wir eine Methode zur verteilten Approximation von
Funktionen in verrauschten Kan¨
alen ein. Diese Methode kann als Verfahren zur analogen
Funktionsberechnung im Funkkanal genutzt werden. Sie ist auf eine Klasse von Funktionen
anwendbar, die zum einen alle linearen Funktionen, zum anderen aber auch einige nicht-
lineare Funktionen wie die p-Norm von Vektoren enth¨
alt. Wir zeigen Fehlerschranken f¨
ur
unser Verfahren, die in Kan¨
alen mit Fast Fading gelten, wobei sowohl die Verteilung des
Fadings als auch die des Rauschens zur Klasse der sub-Gauß’schen Wahrscheinlichkeitsver-
teilungen geh¨
oren m¨
ussen. Diese Klasse enth¨
alt nicht nur alle Normalverteilungen, sondern
auch viele andere in der Praxis relevanten Verteilungen wie z.B. Class-A-Rauschen nach
Middleton und Fadingverteilungen mit einer dominanten Sichtkomponente. Zudem sind
unsere Fehlerschranken auch dann noch g¨
ultig, wenn es Korrelationen in der Verteilung des
Fadings und/oder Rauschens gibt, sodass unsere Ergebnisse beispielsweise auch auf Kan¨
ale
mit Blockfading oder geb¨
undelt auftretender Interferenz anwendbar sind. Dabei verwen-
den wir keinerlei stochastische Charakterisierung der Argumente der ¨
uber den Funkkanal
zu berechnenden Funktion. Dies bedeutet insbesondere, dass keine Annahme ¨
uber identi-
sche Verteilung oder Unabh¨
angigkeit dieser Argumente n¨
otig ist. Unsere Analyse baut auf
Werkzeugen aus der hochdimensionalen Statistik auf, die wir derart anpassen, dass sie im
vorliegenden Szenario anwendbar sind. Die sich daraus ergebenden Fehlerschranken sind
nichtasymptotisch, d.h. sie sind f¨
ur jede beliebige (endliche) Anzahl von Kanalnutzungen
g¨
ultig.
Funktionsberechnung im Funkkanal bietet ein riesiges Potenzial, in Anwendungen wie
z.B. der auf maschinellem Lernen basierenden Anomaliedetektion in großen Sensornetzen
v
die zur Kommunikation n¨
otigen Ressourcen drastisch zu reduzieren. Wir veranschaulichen
dieses Potenzial, indem wir zwei Beispiele ausf¨
uhren, die zeigen, wie unsere Methoden zur
Funktionsberechnung im Funkkanal die Effizienz von vertikalem f¨
oderierten Lernen enorm
steigern k¨
onnen. Wir illustrieren diesen Effizienzgewinn außerdem f¨
ur einige ausgesuchte
Spezialf¨
alle anhand numerischer Simulationen.
Anschließend f¨
uhren wir ein neues Verfahren ein, um Funktionsberechnungen im Funk-
kanal gegen passive Lauscher zu sch¨
utzen. Unsere Methode basiert auf der Kooperation der
legitimen Kommunikationspartner mit einem Jammer, dessen Signal – entgegen den nor-
malerweise in diesem Zusammenhang gemachten Annahmen – beim legitimen Empf¨
anger
in einer gr¨
oßeren Signalst¨
arke ankommt als beim Lauscher. Sie funktioniert f¨
ur eine große
Klasse von Verfahren zur Funktionsberechnung ¨
uber den Funkkanal, und wir f¨
uhren den
Spezialfall der Berechnung eines arithmetischen Mittelwerts ¨
uber einen Kanal mit addi-
tivem weißen normalverteilten Rauschen detailliert aus. Dabei erhalten wir eine Sicher-
heitsgarantie, die den durchschnittlichen quadratischen Fehler des Lauschers nach unten
begrenzt. Die Frage, wie st¨
arkere Sicherheitsgarantien (z.B. semantische Sicherheit) gege-
ben werden k¨
onnen, bleibt dabei f¨
ur zuk¨
unftige Forschungsarbeiten offen. Die wichtigsten
informationstheoretischen Zutaten f¨
ur den Beweis der Sicherheitsgarantie sind Verallge-
meinerungen bekannter Ergebnisse zur Resolvability von Kan¨
alen und zur Kommunikation
¨
uber Compound-Kan¨
ale f¨
ur den Fall kontinulierlicher Ein- und Ausgabealphabete, die wir
im Rahmen dieser Arbeit ebenfalls beweisen.
vi
Acknowledgments
I would like to thank Prof. Dr.-Ing. Slawomir Sta´nczak for the opportunity to pursue a
Ph.D. degree as a member of his research group at Technische Universit¨at Berlin, and for
the support and scientific freedom I had during that time. I am also particularly indebted
to Dr. Igor Bjelakovi´c who has taken a lot of time and effort to guide and support me in
my research work for this thesis, and beyond.
Moreover, I would like to thank Prof. Dr. Michael C. Gastpar and Lecturer Dr. Jingge
Zhu for serving as referees for this dissertation.
Finally, I gratefully recognize my co-authors Navneet Agrawal, Dr.-Ing. Zoran Utkovski,
Patrick Agostini, Miguel A. Gutierrez-Estevez and Daniel Sch¨aufele for their collabora-
tion, as well as all my current and former colleagues at Technische Universit¨at Berlin
and Fraunhofer Heinrich Hertz Institute. It was a pleasure to share many stimulating
discussions with them and I am appreciative of this joyful work and research environment.
vii
Copyright Information
The work contained in this thesis has been published in various venues. Chapter 1 and the
introductory paragraphs of Chapter 2 and Chapter 4 are based on the publication [7] which
is reproduced with permission from Springer Nature. The remaining part of Chapter 2
and Chapter 3 as well as part of Section 1.5 are based on [2–4] which are ©2019-2021
IEEE. The remaining part of Chapter 4 and part of Section 1.5 are based on [1,5] which
are ©2018-2021 IEEE and on [6] which is currently under consideration for publication
and may be ©IEEE in the future.
In reference to IEEE copyrighted material which is used with permission in this the-
sis, the IEEE does not endorse any of Technische Universit¨at Berlin’s products or ser-
vices. Internal or personal use of this material is permitted. If interested in reprint-
ing/republishing IEEE copyrighted material for advertising or promotional purposes or for
creating new collective works for resale or redistribution, please go to http://www.ieee.
org/publications_standards/publications/rights/rights_link.html to learn how
to obtain a License from RightsLink. If applicable, University Microfilms and/or ProQuest
Library, or the Archives of Canada may supply single copies of the dissertation.
ix
Contents
1 Introduction 1
1.1 NomographicFunctions ............................. 1
1.2 Distributed Function Approximation in Noisy Channels: An Example . . . 4
1.3 Over-the-Air Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.1 Digital Over-the-Air Computation . . . . . . . . . . . . . . . . . . . 6
1.3.2 Analog Over-the-Air Computation . . . . . . . . . . . . . . . . . . . 8
1.4 Applications of Over-the-Air Computation . . . . . . . . . . . . . . . . . . . 10
1.4.1 Distributed Machine Learning . . . . . . . . . . . . . . . . . . . . . . 10
1.4.2 Consensus over Wireless Channels . . . . . . . . . . . . . . . . . . . 13
1.5 Contribution and Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2 Distributed Function Approximation in Fading and Correlated Channels 19
2.1 SystemModel................................... 20
2.1.1 Sub-Gaussian Random Variables . . . . . . . . . . . . . . . . . . . . 20
2.1.2 Channel and Correlation Model . . . . . . . . . . . . . . . . . . . . . 21
2.1.3 Discussion of the System Model Assumptions . . . . . . . . . . . . . 23
2.2 ProblemStatement................................ 24
2.2.1 Distributed Approximation of Functions . . . . . . . . . . . . . . . . 24
2.2.2 The Class of Functions to be Approximated . . . . . . . . . . . . . . 25
2.3 MainResult.................................... 27
2.3.1 Special Cases of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.2 Sharpness of the Bound in Theorem 2 . . . . . . . . . . . . . . . . . 31
2.4 NumericalResults ................................ 32
2.5 ProofofTheorem2................................ 35
2.5.1 Pre-Processing .............................. 35
2.5.2 Post-Processing.............................. 36
2.5.3 TheErrorEvent ............................. 37
2.5.4 PerformanceBounds........................... 38
2.6 Preliminaries on Sub-Gaussian and Sub-Exponential Random Variables . . 43
2.7 ProofofLemmas1and2 ............................ 45
xi
Contents
3 Applications of Distributed Function Approximation to Vertical Federated
Learning 51
3.1 Support Vector Machines with Additive Kernels for Regression and Classi-
fication ...................................... 52
3.2 Model-Agnostic Approach to Over-the-Air-Computed Classifiers . . . . . . 54
3.3 Numerical Results for Over-the-Air-Computed Decision Tree Classifiers . . 57
4 Security in Over-the-Air Computation 67
4.1 PriorWork .................................... 68
4.2 SystemModel................................... 70
4.2.1 Distributed Approximation of Functions . . . . . . . . . . . . . . . . 70
4.2.2 Secrecy Extension to Distributed Approximation of Functions . . . . 71
4.2.3 Special case K=1............................ 73
4.3 Specialization to the Additive White Gaussian Noise Channel . . . . . . . . 74
4.4 MainResults ................................... 77
4.5 Implications of the Main Results . . . . . . . . . . . . . . . . . . . . . . . . 82
4.5.1 Feasibility of Channel Approximation . . . . . . . . . . . . . . . . . 82
4.5.2 Back to the Additive White Gaussian Noise case: Calculating Mean
Square Error Security Guarantees . . . . . . . . . . . . . . . . . . . 85
4.6 Proofs ....................................... 87
4.6.1 Statistical Preliminaries for the Proof of Lemma 12 . . . . . . . . . . 87
4.6.2 ProofofTheorem4 ........................... 89
4.6.3 ProofofTheorem5 ........................... 91
4.6.4 ProofofTheorem6 ........................... 96
4.6.5 Cost Constraint in Compound Channel Coding and Resolvability . . 101
Publication List 107
Bibliography 109
Notations and Symbols 119
Abbreviations 123
Index 125
xii
1 Introduction
In this thesis, we study the approximation of a function fof distributed arguments
s1, . . . , sK. Namely, we consider terminals A1,...,AK,Bwhich are connected by a multiple-
access channel (MAC). The MAC is described as a stochastic kernel Wwhich randomly
maps every possible tuple of inputs (T1, . . . , TK) at A1,...,AKto an output Yat B.
Each terminal Akholds some value skfrom which it generates a sequence of channel in-
puts TM
k= (Tk,1, . . . , Tk,M ). The channel is then invoked Mtimes and generates, for
each input tuple (T1,m, . . . , Tk,m), an output symbol Ym. At terminal B, the sequence
YM= (Y1, . . . , YM) is processed to yield a value f
˜. A distributed function approximation
scheme should have the property f
˜≈f(s1, . . . , sK). The meaning of “≈” can vary, but it
usually means that a suitably defined distance1between f
˜and f(s1, . . . , sK) is small and
approaches 0 as Mtends towards infinity. The channels we consider in this work mostly
have complex or real inputs and outputs, and they are of the form
Y=
K
∑︂
k=1
HkTk+N. (1.1)
In channels of this form, H1, . . . , HKare called the fading coefficients, and Nis called
the (additive) noise. The fading coefficients can in some cases be deterministic, but in
general, both the fading and the noise follow a random distribution. It is this randomness
which makes the channel output Y“noisy” in the sense that in general, it is not possible
to recover exact information about T1, . . . , TKor their sum from the observed value of Y.
1.1 Nomographic Functions
Given the channel (1.1), it is natural to ask which functions can be approximated from
distributed arguments over such channels. In case there is no fading or noise (i.e., N= 0
and H1=··· =HK= 1 almost surely), this turns out to be the class of nomographic
functions which we discuss in this section. The connection between nomographic functions
and distributed function approximation was first observed in [GS13] and has been discussed
1This distance does not need to be a metric.
1
1 Introduction
and analyzed in [GBS13] in more detail than we can in the following summary. Here and
in the rest of the thesis, Rdenotes the set of real numbers.
Definition 1. Anomographic representation of a function f:RK→Rconsists of
functions f1, . . . , fK, F :R→Rsuch that
∀a1, . . . , aK∈R:f(a1, . . . , aK) = F(︄K
∑︂
k=1
fk(ak))︄.(1.2)
A function f:RK→Rwhich has a nomographic representation is called a nomographic
function.
It has been noted in [Buc76, Theorem 8] that every function is nomographic according
to this definition. We state a version of this result that fits with Definition 1. Since it
illustrates the arguments below very well, we also give a full proof, based on the same idea
as in [Buc76].
Theorem 1. (adapted from [Buc76, Theorem 8]). Every function f:RK→Ris nomo-
graphic.
Proof. We first fix an arbitrary bijection ϕ:R→(0,1). An example of a possible choice
is
ϕ:a↦→ ⎧
⎪
⎪
⎪
⎨
⎪
⎪
⎪
⎩
1
2·1
a+1 , a ∈(0,∞)
1
2·(︂1 + 1
|a|+1)︂, a ∈(−∞,0)
1
2, a = 0.
(1.3)
Next, we define for every a∈(0,1) a sequence of digits ca,1, ca,2,··· ∈ {0,...,9}such
that2
a= 0.
dec
ca,1ca,2. . . := ∞
∑︂
i=1
ca,i ·10−i.(1.4)
We make the choice for the sequence ca,1, ca,2, . . . unique by requiring that it has to
contain infinitely many non-zero elements. Let, for all k∈ {1, . . . , K},
fk(a) := 0.
dec
0. . . 0
⏞⏟⏟⏞
k−1
cϕ(a),10. . . 0
⏞⏟⏟⏞
K−1
cϕ(a),20. . . 0
⏞⏟⏟⏞
K−1
cϕ(a),30. . . 0
⏞⏟⏟⏞
K−1
. . . . (1.5)
2Of course, there is nothing special about base 10 here, and in fact, [Buc76] uses dyadic representations.
We have chosen the base 10 here so that our representation coincides with the usual decimal notation
of numbers.
2
1.1 Nomographic Functions
Define ψ1, . . . , ψK, F : (0,1) →Rby
ψk: 0.
dec
b1b2. . . ↦→ ϕ−1⎛
⎝0.
dec
bkbk+Kbk+2K. . .⎞
⎠(1.6)
F:a↦→ f(ψ1(a), . . . , ψK(a)) .(1.7)
It is clear from our construction that the maps
(a1, . . . , aK)↦→
K
∑︂
k=1
fk(ak) (1.8)
a↦→ (ψ1(a), . . . , ψK(a)) (1.9)
are inverses of each other and therefore, (1.2) is satisfied, concluding the proof that fis
nomographic.
In order to use the nomographic representation of a function in a wireless communication
system, the inner functions f1, . . . , fKshould be computed at the transmitter before the
actual transmission, while the outer function Fshould be implemented and evaluated
at the receiver. Therefore, f1, . . . , fKare sometimes referred to as the pre-processing
functions while Fis called a post-processing function. The summation is performed by
the wireless channel due to its superposition property. If the receiver has access to f1(a1)+
···+fK(aK), then from (1.8) and (1.9), it is clear that a full reconstruction of a1, . . . , aK
is possible and in fact, this full reconstruction is used as an intermediate step in post-
processing. However, such an approach does not generalize to noisy channels in an obvious
way. Indeed, in (1.5) we can see that arbitrarily significant3digits of the transmitted values
can be hidden in digits of arbitrarily low significance in the real number that is transmitted
over the channel and therefore, even a channel noise of extremely low power can cause
arbitrarily strong disruptions. Of course, this is due to the specific construction of Fused
in the proof of Theorem 1, but we can expect that in general any strong discontinuity of
Fcan cause problems of this kind when the argument of Fis noisy.
It appears therefore that in order to apply a nomographic representation to a distributed
function approximation problem in the presence of noise, Definition 1 is not strong enough
and we need to impose additional constraints on the functions f1, . . . , fK, F. A famous
result [Arn57, Kol57] states that every continuous function f:RK→Rcan be written
3The significance of a digit in the decimal representation is an indication of how strongly a change of the
digit influences the value of the represented number. For instance, the significance of the digit ca,i in
(1.4) is 10−i, since the digit is multiplied with this number to determine the value of a.
3
1 Introduction
as a sum of 2K+ 1 functions with continuous nomographic representations,4giving a
positive answer in part to the question posed by Hilbert as the thirteenth problem in
his list of unresolved mathematical problems of the 20th century [Hil00]. If there was a
result implying that for every algebraic function, there is a nomographic representation
consisting only of algebraic functions, this would give a positive answer to the as-of-yet
unresolved part of Hilbert’s thirteenth problem. We can therefore expect that proving
such a result would be very hard.5Another result worth noting in this context is that
the set of functions with a continuous nomographic representation is nowhere dense in
the space of continuous functions [Buc82]. This provides another piece of evidence that
generic nomographic representations suitable for distributed function approximation may
not exist.
1.2 Distributed Function Approximation in Noisy Channels: An
Example
From an information-theoretic viewpoint, channels of the form (1.1) with no fading or noise
have infinite Shannon capacity. Therefore, they are not realistic in the sense that they
cannot be expected to sufficiently accurately model a real-world communication channel.
If we focus on finite-capacity channels, however, we can make some basic information-
theoretic arguments to further motivate the development of schemes in which f(s1, . . . , sK)
is directly approximated at Binstead of transmitting the values s1, . . . , sKto Bseparately
(from which f(s1, . . . , sK) could easily be computed at B).
If all values s1, . . . , sKare made available at B, this means that the number of uses of
the channel is lower bounded by the quotient of the Shannon entropy of s1, . . . , sKand
the sum-rate capacity of the channel. On the other hand, if the entropy of f(s1, . . . , sK) is
significantly smaller than that of s1, . . . , sK, then a scheme which only makes f(s1, . . . , sK)
available at Bis not subject to the same fundamental bound. The following example
illustrates that this difference of entropy can be significant.
Example 1. (from [7]). Suppose that Knodes send their data s1, . . . , snto a single
receiver through a MAC. For simplicity, we assume that each skis an independent random
variable uniformly distributed over S={0,1}. Now if the receiver reconstructs each of
these variables, then the entropy or the amount of information available at the receiver
is ∑︁K
k=1 H(sk) = Klog 2 nats where log(·)is the natural logarithm with Euler’s number
4A continuous nomographic representation is a nomographic representation that consists only of contin-
uous functions f1,...,fK, F.
5Hilbert even hypothesized that the correct answer to the question would be negative [Hil00,Hil27], which
was, however, partly disproven in [Arn57,Kol57].
4
1.3 Over-the-Air Computation
0 5 10 15 20 25 30
0
5
10
15
20
K
Entropy in nats
H(s1, . . . , sK)
H(∑︁K
k=1 sk)
Figure 1.1: Plot of sum vs. tuple entropy in Example 1.
e≈2.71828 as the basis, H(sk) := −∑︁s′∈S psk(s′) log psk(s′)is the Shannon entropy6
and psk:S ↦→ [0,1] is the probability mass function of sk. This means that the nodes
have to transmit Klog 2 nats (or, equivalently, Kbits) to the receiver. Therefore, if the
capacity of the communication channel is 1bit per channel use, then Kchannel uses
are necessary to convey the full information to the receiver.7Now we assume that the
receiver is only interested in f(s1, . . . , sK) = ∑︁K
k=1 skwhich can be easily computed from
s1, . . . , sK. By the data processing inequality [EGK11, Section 2.3], this operation cannot
increase the amount of information. In fact, the entropy of the function is H(∑︁K
k=1 sk) =
Klog 2 −∑︁K
k=1 (︁K
k)︁2−Klog (︁K
k)︁which is strictly smaller than Klog 2 for all K≥2. This
means that instead of transmitting Kbits that are necessary to reconstruct each sk, the
agents can send significantly less information to the receiver if its objective is to compute
the sum function f(s1, . . . , sK). In Fig. 1.1, it can be seen that this difference is quite
pronounced even for moderately large values of K.
1.3 Over-the-Air Computation
The approximation of functions of distributed arguments over channels of the form (1.1) is
of particular interest in the context of wireless communications. Here, use of the channel
corresponds to a concurrent transmission of waveforms from all transmitters, and the
6We use the convention 0 ·log(1/0) := 0 in the definition.
7In the case of orthogonal channel access, it is necessary to establish Kindependent (interference-free)
communication channels, where each of these has the capacity of 1 bit per channel use.
5
1 Introduction
receiver observes a noisy, superimposed version of these waveforms. In the context of
this application, distributed function approximation schemes can be seen as instances
of a class of schemes commonly called Computation over MAC (CoMAC), AirComp or
OTA computation. The goal of these schemes is to obtain a scaling behavior of the
communication cost in the number of transmitters that is better than the linear growth8
that would ensue from a separation of source and channel coding. Therefore, such schemes
exhibit the inherent property that the receiver is unable to fully reconstruct all of the
transmitted information.
The idea of a scheme that allows a receiver to reconstruct directly a combined form of
two messages, but not the original messages themselves, can be traced back to [KM79]
where a source coding problem is formulated in which it is the receiver’s task to reconstruct
a sequence of modulo-2 sums of encoded bits. An uncoded analog scheme for obtaining a
noisy estimate of a function of transmitted values with an application to wireless sensor
networks has appeared in [GV03] and is, to the best of our knowledge, the first work that
proposes a joint source-channel approach to OTA computation.
The authors in [GV03] take an analog approach in which a certain amount of noise
is tolerated in the received value and the function is computed only once.9This is in
contrast with a class of digital schemes that are closer to [KM79] in the sense that they
also consider functions with finite domains and typically give error guarantees for a large
number of repeated function computations.
1.3.1 Digital Over-the-Air Computation
In digital OTA computation, the function that is to be computed maps between discrete
sets. The computation is carried out repeatedly, and the objective of the corresponding
coding scheme is that the probability of a decoding error approaches zero as the number
of repetitions tends to infinity.
More formally, [NG07] introduces the problem of digital computation coding in the
following way:
Definition 2. Adigital computation coding problem consists of the following:
•A MAC Wwhich maps channel inputs T1, . . . , TKranging over the input alphabets
T1,...,TKto a channel output Ywhich ranges over the channel output alphabet Y.
8If the expense necessary for coordination and scheduling is also considered, this growth can even be
superlinear.
9The function can be computed multiple times since the scheme can simply be repeated, however, the
individual instances do not take advantage of the repeated computation.
6
1.3 Over-the-Air Computation
•An objective function
f:S1×···×SK→ S,(1.10)
where S1,...,SK,Sare finite sets.
•A probability distribution on S1×···×SK.
The idea is that, given this problem, the transmitters encode their messages s1, . . . , sK
as sequences of channel inputs in such a way that the receiver can, with high probability
of success, reconstruct f(s1, . . . , sK) without necessarily being able to draw any further
information about s1, . . . , sK.
Definition 3. An (n, M, ε)-code for a given digital computation coding problem consists
of:
•for each k∈ {1, . . . , K}, an encoder
EM
k:Sn
k→ TM
k(1.11)
•adecoder
DM:YM→ Sn(1.12)
such that if the sequence of channel inputs is determined by TM
k:= EM
k(sM
k), the error
probability at the receiver satisfies
P(︂DM(YM)= (f(s(1)
1, . . . , s(1)
K), . . . , f(s(n)
1, . . . , s(n)
K)))︂≤ε. (1.13)
These notions can then be used to define the analog of rate and capacity in classical
source or channel coding problems.
Definition 4. The computation rate of an (n, M, ε)-code is defined as the ratio n/M.
A computation rate Ris called achievable if there is a sequence of (n, M, ε)-codes of
computation rate Rwhere M→ ∞ and ε→0. The computation capacity is the supremum
of all achievable computation rates.
This framework is extended by allowing the alphabets S1,...,SK,Sto be infinite and
then characterizing the rate-distortion trade-off. In any case, the computation coding
problem combines source and channel coding because the encoders simultaneously remove
redundancy from the sources and protect the transmission against channel noise. The
authors of [NG07] note examples where the rate that separate source and channel coding
can achieve is strictly less than the computation capacity.
7
1 Introduction
In the setting with finite alphabets, the typical objective function considered is addition
in a finite field, and the main application noted by the authors is physical layer net-
work coding. This idea was seminal to a lot of follow-up research (e.g., [ZNGE09, NG11,
OZE+11, NCNC16, GJRM+16]) which has expanded upon and refined the idea of using
OTA computation as a means to increase the efficiency of network coding. Notably, there
is also a work [GBS14] which proposes schemes that use digital computation codes in con-
junction with a quantizer to compute functions that are of interest in other applications,
such as the arithmetic mean, the geometric mean and the Euclidean norm.
1.3.2 Analog Over-the-Air Computation
The framework of digital computation codes is promising and its applications to network
coding are highly relevant as they can realize impressive performance gains in wireless
networks. However, is also has downsides in the context of other applications:
•The notion of computation capacity is an asymptotic one valid only for block lengths
tending to infinity. While finite-blocklength results are certainly conceivable, it is
nonetheless an inherent property of any approach involving digital coding that a
certain number of repeated function computations is necessary in order to guarantee
a reasonably low probability of decoding error. This can be problematic in applica-
tions where only a few computations are necessary or where protocols are used in
which the roles of transmitters and receivers change frequently with only very few
computations being done between these changes.
•To the best of our knowledge, the only known digital coding schemes which can deal
with channel fading compute sums over finite fields for the application of network
coding. Examples of functions that existing digital schemes cannot compute over
fading channels include weighted sums which have a high relevance in the context
of OTA ML, as well as maxima and various kinds of averages which are important
in the context of consensus algorithms and control systems.
•The digital coding schemes can only deal with discrete messages. If real (or floating
point) numbers are processed in a certain application, a quantizer needs to be added
to the system. Since quantization is a form of source coding, this is somewhat in
contrast with the observation that joint source-channel approaches are necessary to
achieve optimum system performance.
A way to make OTA computation applicable where these disadvantages hinder the use
of digital schemes is to process analog input values directly into an electromagnetic signal
without first going through a sequence of bits (or other discrete values) as an intermediary
8
1.3 Over-the-Air Computation
step. A striking observation in this context is that a standard wireless channel actually
performs a summation of the transmitted signals (which, through their inphase-quadrature
(IQ) representations, can be seen as points in Euclidean space). This opens the door to the
computation both of weighted sums and (as a special case) arithmetic averages, which we
have noted above are very relevant functions both for OTA ML and consensus algorithms.
There are two important research questions that these observations directly raise:
•If we were able to compute real function values in an analog system without error, this
would in the point-to-point case degrade to a possibility to losslessly transmit a real
number through the wireless channel which would imply infinite Shannon capacity
of the channel. Since this is known to be unrealistic for any real-world channel, we
can immediately conclude that a certain amount of noise in the computed function
values is unavoidable in any kind of analog OTA computation scheme. But is it
possible to control the strength of the noise, for instance by providing tail bounds
for its magnitude?
•We can expect from the structure of the wireless channel that it can compute sums
in Euclidean space, but can we, with the use of suitable pre- and post-processing
schemes, compute a larger class of functions OTA?
A pragmatic way to proceed in light of these difficulties is to attempt to find a subclass of
functions that is small enough to permit nomographic representations which are suitable
for use with noisy communication systems and at the same time large enough to contain
most functions of interest in practical OTA computation problems.
There are several prior works that propose approaches to the OTA computation prob-
lem for functions particularly relevant to applications for consensus problems in wireless
networks and ML over wireless channels. [GBS13] systematically explores the question
of what types of functions can efficiently be OTA computed with analog schemes, also
taking into account many system-level aspects such as the usage of analog OTA com-
putation schemes in large wireless networks with changing topologies. [GS13] presents a
scheme that is able to deal with imperfect synchronization and the presence of fading in
OTA computation; extensive theoretical analyses for the asymptotic case is provided for
the arithmetic and geometric mean functions. [GS14] presents pre- and post-processing
schemes and an asymptotic analysis for the approximation of functions over a fast fading
channel with noise in the case of multiple receive antennas. The work covers the case of no
instantaneous channel state information (CSI) at the transmitter or receiver as well as two
different types of instantaneous CSI at the transmitter. In [RGS16], under the assumption
of known fading coefficients at the transmitter, a similar scheme is used for computing the
sign of a weighted sum which is the decision function of a linear Support Vector Machine
9
1 Introduction
(SVM) used for classification. As a result, the authors obtain a distributed binary classi-
fication scheme that is highly efficient in massively-sized wireless networks. In the more
recent work [LZLV20], under the assumption that the sources are independent and the
channel state is known at both the receiver and the transmitter, the authors derive analog
OTA computation schemes for sums that are optimal in terms of mean square error. In
the case of independent and identically distributed (i.i.d.) Gaussian sources the authors
of [DSD20] show how to OTA compute sums over slow fading channels where the channel
state information is available neither at the transmitter nor the receiver. The work also
considers intersymbol interference and provides an asymptotic theoretical analysis as well
as numerical results.
1.4 Applications of Over-the-Air Computation
OTA computation has potential applications in every setting in which such a large num-
ber of wireless devices share constrained wireless resources that it becomes inefficient or
even infeasible to exclusively use traditional scheduling and separate decoding of all trans-
mitted information before it is post-processed at the receiver. Furthermore, even if the
available resources are tremendous, but the number of participating devices is so large
that traditional scheduling becomes prohibitively expensive, OTA computation can be a
useful tool to solve the problem. On the other hand, it inherently fuses concepts that have
traditionally been separate in communication systems. We have already discussed the
point that from an information theoretic perspective, it is a joint source-channel approach
that breaks with the traditional separation paradigm. But also from the perspective of
network architecture, it means using schemes on the physical layer that are at least in part
tailored to specific applications, and traditional methods of scheduling and routing have
to be adapted to be compatible. Therefore, OTA computation can be seen as a cross-layer
approach that encompasses the entire network stack from the application layer all the way
down to the physical layer. While the pre- and post-processing schemes can be proposed
in such a generic manner that they can in principle be used for a large variety of potential
applications, they still need to be carefully adapted to each one. There are two main
fields of application that have recently motivated the development of OTA computation
schemes, namely distributed OTA ML and consensus algorithms. In this section, we give
a brief overview of these two applications.
1.4.1 Distributed Machine Learning
In this subsection, we take a look at distributed ML, in particular Federated Learning (FL),
describe how this field branches into VFL and Horizontal Federated Learning (HFL) and
10
1.4 Applications of Over-the-Air Computation
cite a few examples from the literature that approach FL problems with OTA computation
methods. First, we need to define what ML is for the sake of this thesis, and we follow
the formalism in [SC08].
Definition 5. Astatistical inference problem is a tuple (X,Y,P,L), where
•the feature alphabet Xis a Polish space (usually a high-dimensional Euclidean
space),
•Y⊆Ris called the label alphabet,
•Pis a probability measure on X×Y,
•L:X×Y×R→[0,∞)is called the loss function.
In the usual application setting, only the feature and label alphabets and the loss func-
tion are known about the statistical inference problem, while information about Pis only
known indirectly through a training sample.
Definition 6. Given a statistical inference problem (X,Y,P,L), a training sample of
length nis a sequence (xj, yj)n
j=1 ∈Xn×Ynwhere each (xj, yj)is drawn i.i.d. according
to P.
The objective in solving an ML problem is to find an ML model which can make predic-
tions about the labels of newly drawn samples of P, given only the features. An ML model
is a mathematical object which provides, given a set of parameters, a labeling function.
Examples of ML models are neural networks, SVMs and decision trees.
Definition 7. Given a statistical inference problem (X,Y,P,L), a labeling function is
a function f:X→R. A labeling function induces a risk (sometimes also called loss)
RL,P:= EPL(X, Y, f(X)), where (X, Y )is the pair of random variables ranging over
X×Yand distributed according to P.
Typically, the objective is to exploit the indirect knowledge that we have about P
through the training sample to obtain a labeling function with low risk, which is usually
the measure for how well we have solved the ML problem. To this end, a training procedure
for a given ML model takes a training sample as its input and outputs parameters for the
ML model. Therefore, in conjunction with the model, it maps training samples to labeling
functions.
Distributed ML studies cases of ML problems where some of the information about the
statistical inference problem or the training sample are only known at certain locations in
a network. Although there are possibilities for communication between the agents in the
11
1 Introduction
network, there are application-specific reasons for not transmitting the entire information
to a central point. One particular instance of Distributed ML is called FL [KMRR16,
MMR+17]. In FL, the initial main reason for not transmitting all the available information
to a central point and then solving the ML problem in the traditional way is to preserve
the privacy of the users from whom the training data is collected,10 but communication
efficiency also plays an increasingly important role. FL can be further categorized into
HFL and VFL [YLCT19].
In HFL, each agent kout of a total of Kagents in the system sees only a subsequence
of the training sample (xjk,i , yjk,i )nk
i=1. In principle, it is possible for each agent to train
its own local ML model based on the locally available training sample. Depending on the
application at hand, however, this can incur several difficulties:
•The locally available training subsamples may simply be too small to train an ML
model and obtain an acceptable risk.
•The way in which the locally available training subsamples are drawn from the
overall training sample may be such that the subsamples are not i.i.d. or do not
follow P[ZLL+18]. For instance, it is common for the subsamples to be biased
towards certain labels in a way the overall training sample is not.
Distributed optimization algorithms can be used to carry out the training in a decentral-
ized manner. They make, either at one central point or everywhere in the network, a
trained ML model available that benefits from the whole training sample without trans-
mitting it through the network in its entirety. There is a huge body of recent research
(cf., e.g., [AG20b, ZWH20, ASK19, YJSD20, AG19, GdKS+19, ZDHL20, OUG19, AG20a,
ZLD+20, ZCL+19, AOGE20, SC20, SZG20, STL20, ADG19, CT20] and references therein)
into ways to perform distributed optimization algorithms such as stochastic gradient de-
scent exploiting OTA computation. This approach can achieve fundamentally more favor-
able scaling laws than would be possible otherwise.
In VFL, the data is distributed in a different way: In a system with Kagents, the
statistical inference problem has a feature alphabet X=X1×···×XKthat is a Cartesian
product of Kfeature spaces. A feature x∈Xcan therefore be written as a tuple x=
(x1, . . . , xK) and the training sample is of the form ((x1,j, . . . , xK,j), yj)n
j=1 where each
agent khas only the local training sample (xk,j, yj)n
j=1. Correspondingly, when training
is complete and a label needs to be estimated, each agent ksees only the projection
to Xkof the observed feature. Since the labeling function has the whole feature space
X=X1×···×XKas its domain, the arguments to compute it are not available at any
10A major motivation for introducing the FL framework was Gboard, a software made by Google which
is used as the default keyboard on many Android devices [MR17].
12
1.4 Applications of Over-the-Air Computation
single point in the network and it is therefore natural to attempt to compute the labeling
function OTA. So there are two important research questions in OTA-VFL:
•Given a training sample that is distributed as described above, how can we carry
out a distributed training procedure exploiting OTA computation that scales better
than linear in the number of agents involved?
•Given the trained model (which also is available only in a distributed manner), how
can we compute the labeling function using the OTA approach?
The first question is quite similar to the main research question in OTA-HFL and there
is some hope that tools from this field could be suitably adapted. The second question is
more specific to the VFL scenario, and we note that many standard ML labeling functions
naturally take the form of (weighted) sums. Examples are layers of neural networks (the
activation function can be evaluated afterwards in post-processing if necessary) and the
linear SVMs that have been used for OTA-VFL in [RGS16]. Contrary to OTA-HFL, there
does not appear to be a large body of research on OTA-VFL. Besides [RGS16] and the
work presented in Chapter 3, we are not aware of any works that propose to leverage
OTA computation in a VFL scenario. However, there is a related research area called
Type-Based Multiple-Access [MT06, MNT07] which is concerned with solving problems
very similar to those approached by OTA VFL such as anomaly detection in extremely
large networks. An important difference is that instead of transmitting analog values
directly, this approach relies on a prior quantization step and then exploits the fact that
the number of quantization levels usually does not grow with the number of transmitters
in the system. Furthermore, this approach uses statistical methods and knowledge about
the involved probability distributions, while the VFL approach that we use in Chapter 3 is
based on the ML paradigm and hence does not require a priori knowledge of the underlying
distributions.
1.4.2 Consensus over Wireless Channels
Consensus problems deal with combining opinions of participating agents to achieve an
agreement that encompasses their information about or subjective assessments of an ob-
ject. They have originally appeared as statistical problems in which the opinions are
probability distributions which have to be combined to form a consensus distribution.
In [EG59] this is illustrated as a horse race betting problem where the agents’ opinions are
probability distributions on which horse will win the race. They place their bets according
to these opinions and the overall track’s odds that result from these bets are considered
the consensus which in a certain way combines all the participating agents’ opinions. The
13
1 Introduction
problem has subsequently been stated as one of combining various experts’ opinions and
researched extensively to aid with decision making in the context of management sciences
(see, e.g., [Win68,Fre85] and references therein).
The research on this theory has later been applied to problems of multisensor fusion and
pattern recognition [BS92] and since found a multitude of other applications in engineering
sciences [OSFM07]. In some of the engineering applications the nature of the difficulty of
the problem has shifted significantly: Often, an opinion is simply a real number or vector
and the way the opinions have to be combined to form the consensus is fully prescribed
by the application at hand and is fairly simple compared to the original consensus prob-
lem. For instance, the consensus can be the arithmetic average (with applications, e.g.,
in formation control and flocking of autonomous vehicles [OS06]) or the maximum of the
opinions (examples for applications include task assignment [BCH08] and traffic automa-
tion [MDR19]). In these applications, the challenge is that it is infeasible to aggregate
the opinions in a central point because the communication cost or the time delay incurred
would be prohibitive. In these cases, distributed consensus algorithms are used that seek
to make the consensus value available to agents in a large network with a minimum of
communication required between the agents [OSFM07].
In many applications, the communication links between the agents are wireless channels,
and indeed, several agents can be linked to another agent via a wireless broadcast or
multiple-access channel. Some works that exploit these properties to reach average or
maximum consensus in a way that is more communication-efficient than would be possible
with point-to-point communication are [ICJ12,MSR18,MDR19,MASR21]. We expect that
theoretical analysis of OTA computation techniques could serve as a building block to
enhance the efficiency and in particular the scaling behavior of the communication cost
in the number of participating agents. Moreover, this way it would be possible to provide
additional theoretical error guarantees for consensus schemes that exploit the superposition
of signals in the wireless channel. In [9], we have proposed a maximum consensus scheme
which leverages analog OTA computation of sums to make the maximum of the agents’
opinions available at the receiver in a wireless MAC with no fading but with additive
noise. The OTA computation schemes proposed in this thesis can be used to extend
these results to channels exhibiting fast fading [2]. It is in particular worth noting that
the scheme proposed in [9] can OTA compute the maximum of the agents’ opinions in a
wireless channel although we do not expect the maximum function to satisfy Definition 10.
This is achieved not through a single OTA computation but through a multi-step protocol
that alternates between analog OTA computation of sums and digitally coded broadcast
communication. We believe therefore that such multi-step protocols are a potentially
promising approach to computing functions that are not amenable to the one-shot OTA
14
1.5 Contribution and Outline
computation methods we propose in this thesis. This is at the cost of higher system and
communication complexity, but a favorable scaling of communication cost in extremely
large networks would be retained.
1.5 Contribution and Outline
This thesis proposes schemes for the approximation of functions of distributed arguments
and deals with the question how OTA computation can be made robust under harsh
channel conditions. We also give examples of applications to distributed ML and undertake
first steps towards hardening such schemes against eavesdroppers. Although we provide
results of numerical simulations for OTA computation under harsh channel conditions
and its applications to distributed ML, the main focus of this thesis is on mathematically
proving theoretical guarantees. In this sense, our work is complementary to the many
predominantly empirical studies in the literature which investigate OTA computation and
its applications in specific real-world scenarios through numerical simulations.
The material contained in this introduction is based on the book chapter [7] with the ex-
ception of this subsection which uses excerpts from the publications the respective chapters
are based on.
In Chapter 2, we propose a distributed function approximation scheme which can be
applied to wireless channels to perform analog OTA computation without instantaneous
CSI. It is not necessary to assume any probability distribution on the data. Therefore,
the scheme works equally well for independently distributed data as it does for arbitrarily
correlated values. Our error analysis relies on mathematical tools from high-dimensional
statistics which we adapt so that they can be applied to the scenario at hand. The
resulting error guarantees are nonasymptotic and are valid for any fading and noise in
the class of sub-Gaussian distributions. This class contains Gaussian as well as many
non-Gaussian distributions of practical interest. Moreover, our scheme can deal with
correlated fading and noise and compute a larger class of functions than previous works
with theoretically proven bounds. In particular, we do not require linearity of the function
to be approximated, which is, e.g., demonstrated by the fact that we can compute p-norms
OTA. We conclude the chapter with numerical evaluations for a few selected cases.
The material in Chapter 2 is based on the conference publications [2,3] and the journal
publication [4]. The introductory remarks are in part based on the book chapter [7].
In Chapter 3, we propose applications of our scheme to the OTA computation of both
regressors and classifiers in VFL and validate the proposed OTA computation schemes
and the envisioned applications in ML with extensive numerical simulations for the case
of a binary classification problem.
15
1 Introduction
The material in Chapter 3 is based on the conference publications [2,3] and the journal
publication [4].
In Chapter 4, we propose a novel framework and result for incorporating security
considerations by including a friendly jammer in the system which deteriorates the eaves-
dropper’s SNR while not impacting the legitimate receiver’s ability to obtain an approxi-
mation of the function value which is to be OTA computed. As an example, we show how
this jamming strategy can be applied to an AWGN channel where the OTA computed
function is an arithmetic average. We prove that the security guarantee translates to a
lower bound on the mean square error of the eavesdropper’s function estimate. Our proofs
heavily rely on two information-theoretic results which we also prove in this chapter. The
first information-theoretic ingredient is a theorem on compound channel coding for con-
tinuous alphabets. It is a generalization of the result of the part of [RV68] which considers
finite Gaussian channels and we can consequently recover this result as a special case.
The second information-theoretic ingredient is an achievability theorem on resolvability of
channels with continuous alphabets. Note that the publication [1] on which this part of
the chapter is based contains also a converse result and a second-order direct result, both
of which are not part of this thesis.
The material in Chapter 4 is based on the conference publications [1,5], and on the sub-
mitted journal paper [6]. A revised version of the latter paper is currently under consider-
ation for publication. The introductory remarks are in part based on the book chapter [7].
Further Work. During my time at Technische Universit¨at Berlin as a PhD student, we
were able to obtain further results which are, however, not part of this thesis. They are
listed in chronological order.
•In [8], we revisit a secrecy proof for the MAC from the perspective of channel resolv-
ability and refine the approach to obtain novel results on the second-order achievable
rates.
•In a work with Navneet Agrawal [9], we propose a multi-step communication proto-
col based on the idea of OTA computation. The resulting scheme is able to compute
the maximum of a set of distributed values in a wireless network. While the com-
munication cost and complexity are greater than in the one-shot method studied
in this thesis, they exhibit a similarly favorable scaling behavior as the number of
transmitters in the system grows. Computation of the maximum function is relevant
in many distributed control systems and moreover, this example shows that OTA
computation can potentially be applied even to functions that are not contained in
the class of functions studied in this thesis.
16
1.5 Contribution and Outline
•In a series of works with Zoran Utkovski, Patrick Agostini, Miguel Gutierrez-Estevez
and Daniel Sch¨aufele [10–12], we investigate how physical layer security methods
could be integrated in future cellular networks. The security of this class of commu-
nication schemes is always based on the assumption that the channel of the legiti-
mate receiver is stronger than that of the eavesdropper. As a way of ensuring that
this assumption holds true in a real-world wireless network, we propose the concept
of secrecy maps. They are based on radio maps and help the infrastructure and
legitimate users of wireless networks to predict physical locations at which secure
communication is possible as well as give guidance on how conditions in other loca-
tions can be improved, for instance with the use of intelligent reflective surfaces or
friendly jamming.
•In [13], we propose a new proof method for direct coding theorems for wiretap
channels where the eavesdropper has access to a quantum version of the transmitted
signal on an infinite dimensional Hilbert space. The main part of this proof is a
direct coding result on channel resolvability which states that there is only a doubly
exponentially small probability that a standard random codebook does not solve the
channel resolvability problem for the classical-quantum channel.
17
2 Distributed Function Approximation in
Fading and Correlated Channels
In this section, we discuss our Distributed Function Approximation (DFA) scheme which
we proposed in [2] and extended in [3, 4]. The goal in introducing it was to provide a
flexible framework that can deal with such a large class of wireless channels that the
scheme would be robust to departures from common assumptions on the system model
such as Gaussianity of the fading and noise. At the same time, the class of functions for
OTA computation should contain the most relevant ones in current applications (which
are mainly weighted sums). It should also be large enough to provide flexibility and
make the DFA scheme applicable in scenarios where functions that have not yet received
much attention are computed OTA. Another important consideration in the design of the
scheme was the distribution of the sources. Many existing works on OTA computation
assume a particular source distribution for their theoretical analysis, and usually require
that the transmitted values are independently distributed between the transmitters. Since
this requirement is extremely difficult to check in practice, we have decided to not model
the sources stochastically. Instead, we show that the bound on the approximation error
is satisfied uniformly over all possible values of the sources. This yields a worst-case
analysis with theoretically proven error guarantees that are valid for every distribution
of the sources, even if there is arbitrary correlation between them. In addition, the error
bounds are nonasymptotic in the sense that they are valid for any number of channel uses,
not just for a sufficiently large one.
In theoretical works, it is of particular importance that the considered channel model is
sufficiently general so that the assumptions made are met in relevant practical scenarios.
The commonly considered Gaussian fading channel is an approximation that is often
adopted because it is relatively easy to treat and is quite close to reality in many scenarios
of interest.
However, it is well known and has been confirmed in extensive measurement cam-
paigns [MS93, BRB93, Mid99] that there are many natural and artificial sources of noise
that do not conform to the assumption of being i.i.d. Gaussian such as automobile igni-
tion, power line emissions, atmospheric disturbances or interfering wireless communica-
19
2 Distributed Function Approximation in Fading and Correlated Channels
tions [MS93].
Moreover, common arguments that the fast fading in wireless channels is Gaussian
employ the Central Limit Theorem [YC16, Section 2.4.1] and therefore assume a large
number of multipath components, which is not always the case. In scenarios with limited
mobility, it is also possible that the fast fading realizations are not independent between
channel uses. In the case of the OTA computation scheme proposed in [2], it can deteriorate
performance and therefore, it is desirable to be able to quantify this deterioration.
In this work, we therefore consider a channel model that encompasses a large class of pos-
sible probability distributions of fading and noise, the class of sub-Gaussian distributions.
The analysis provided is valid also for fading and noise exhibiting an arbitrary correla-
tion structure, with practically useful bounds in many relevant cases. In Section 2.1.1,
we define precisely what sub-Gaussian means and in Section 2.1.3, we give examples of
practically relevant cases that are covered by our system model assumptions.
2.1 System Model
2.1.1 Sub-Gaussian Random Variables
We begin with a short overview of the relevant definitions and properties of sub-Gaussian
random variables. More on this topic can be found in Section 2.6 and in [BK00, Wai19,
Ver18].
For a random variable X, we define11
τ(X) := inf {︂t > 0 : ∀λ∈R E exp (λ(X−EX)) ≤exp (︁λ2t2/2)︁}︂.(2.1)
exp(·) denotes the exponential function with Euler’s number e≈2.71828 as the basis.
Xis called a sub-Gaussian random variable if τ(X)<∞. The function τ(·) defines a
semi-norm on the set of sub-Gaussian random variables [BK00, Theorem 1.1.2], i.e., it is
absolutely homogeneous, satisfies the triangle inequality, and is non-negative. τ(X)=0
does not necessarily imply X= 0 unless we identify random variables that are equal almost
everywhere. Examples of sub-Gaussian random variables include Gaussian and bounded
random variables.
20
2.1 System Model
EM
K
EM
2
EM
1
DM
.
.
.
.
.
.
×
×
×
HK
H2
H1
+ +
N
sK
s2
s1
TM
K
TM
2
TM
1
YMf
˜
Figure 2.1: System model for DFA.
2.1.2 Channel and Correlation Model
We consider the following channel model with Ktransmitters and one receiver: For m=
1, . . . , M, the channel output at the m-th channel use is given by
Y(m) =
K
∑︂
k=1
Hk(m)Tk(m) + N(m).(2.2)
Here and hereafter, the notation is defined as follows:
•Tk(m)∈Cis a transmit symbol, where Cdenotes the set of complex numbers. We
assume a peak power constraint |Tk(m)|2≤Pfor k= 1, . . . , K and m= 1, . . . , M.
•Hk(m), k= 1 ...,K,m= 1, . . . , M, are complex-valued random variables such that
for every m= 1, . . . , M and k= 1, . . . , K, the real part HRe
k(m) and the imaginary
part HIm
k(m) of Hk(m) are sub-Gaussian random variables with mean zero and
variance 1.
•N(m), m= 1, . . . , M, are complex-valued random variables. We assume that the real
and imaginary parts NRe(m), NIm(m) of N(m) are sub-Gaussian random variables
with mean zero for m= 1, . . . , M.
In order to be able to apply a variation of the Hanson-Wright inequality as a tool, we
give the formal description of our dependence model in terms of matrices and vectors with
real entries.
We define
H:= (H(1),...,H(2M))T(2.3)
11Note that other norms on the space of sub-Gaussian random variables that appear in the literature
are equivalent to τ(·) (see, e.g., [BK00]). The particular definition we choose here matters, however,
because we derive results in which no unspecified constants appear.
21
2 Distributed Function Approximation in Fading and Correlated Channels
where ·Tdenotes the transpose and for m= 1, . . . , M,
H(2m−1) := (HRe
1(m), . . . , HRe
K(m))
H(2m) := (HIm
1(m), . . . , HIm
K(m)).
So His the vector of all fading coefficients. Similarly, let
N:= (NRe(1), NIm(1), . . . , NRe(M), NIm(M))T(2.4)
be the vector of all the instances of additive noise. The dependence model we consider
is such that there is a vector Gof (2KM + 2M) independent random variables with sub-
Gaussian norm at most 1 and matrices A ∈ R2KM×(2KM+2M)and B ∈ R2M×(2KM+2M)
such that
H=AG, N =BG.
Our correlation model captures both correlations between users and in the time domain,
but the impact of these two types of correlation on the performance of the proposed scheme
is different. For this reason, we need the following definition, which describes a scenario
where there can be arbitrary correlation in the time domain but no harmful correlation
between different users. In order to be as unrestrictive as possible, the following definition
prohibits only correlations between different users at the same complex dimension of the
same channel use.
Definition 8. We say that the fading is user-independent if for every k1=k2,j∈
{Re,Im}and m, the random variables Hj
k1(m)and Hj
k2(m)are independent.
In Section 2.3.1, we discuss the impact of deviations from the user-independence as-
sumption on the performance of the proposed method.
Remark 1. User-independent fading can be characterized based on the form of A. That
is, the fading is user-independent iff it can be written as H=AG with
A=⎛
⎜
⎜
⎝
A(1)
.
.
.
A(2M)
⎞
⎟
⎟
⎠,
where for all m,A(m)∈RK×(2MK+2M)and each A(m)has at most one nonzero entry per
column. This is because H(m) = A(m)Gand therefore at most one nonzero entry in a
column of A(m)means that at most one entry in H(m)depends on the entry in Gto which
22
2.1 System Model
the column refers.
2.1.3 Discussion of the System Model Assumptions
The most distinguishing feature of our channel model compared to assumptions made
in prior work on OTA computation is that we generalize the distribution of the fading
and noise to be sub-Gaussian and to feature arbitrary correlations. We argue that this
guarantees robustness against departure from standard assumptions such as independent
or Gaussian fading or noise. Regarding the chosen dependence model, we remark that in
the case of Gdistributed i.i.d. standard Gaussian, this amounts to a standard representa-
tion of arbitrarily correlated (and thus arbitrarily interdependent) multivariate Gaussian
vectors. Therefore, by replacing Gwith a vector of independent sub-Gaussian entries, we
obtain a straightforward generalization of the Gaussian case, which specializes to arbitrary
correlations (although in the non-Gaussian case, not to arbitrary stochastic dependence).
In particular, the following important cases are specializations of our channel model:
•Perfect Gaussian fast fading with i.i.d. bivariate Gaussian fading and additive white
Gaussian noise.
•Scenarios where the number of multipath components is not sufficiently large for an
appeal to the Central Limit Theorem to argue that the fading is complex Gaussian.
For instance, if there is only one multipath component with a length that has small
random variations, the resulting complex fading will have a distribution supported
on a narrow annulus in the complex plane. Such a distribution is not Gaussian, but
sub-Gaussian and therefore covered by our channel model.
•Scenarios where due to limited or slow movement of transmitters, receiver and envi-
ronment, the independence assumption between subsequent realizations of the chan-
nel fading is not satisfied or where the diversity in the radio environment is so small
that some of the users have correlated channels. One example of a widely used
channel model that is a special case of the one considered in this paper is the block
fading channel. In this case, we have a correlation of 1 between fading realizations
of the same block and 0 between fading realizations in different blocks.
•Any of the above scenarios where in addition to thermal additive noise, we have
interference from users outside the system. E.g., in the case of digital modulated
signals, such interference consists of a sequence of transmitted constellation points,
which is not Gaussian. Also, such interfering signals are inherently bursty in nature
and therefore cannot be argued to be independent between different points in time.
23
2 Distributed Function Approximation in Fading and Correlated Channels
However, signals from realistic transmitters are always bounded in amplitude and
therefore sub-Gaussian, which means that they are covered by our system model.
•Any of various types of artificial and natural interference that is not necessarily
Gaussian, but limited in power. [MS93,BRB93,Mid99] investigate various sources of
non-Gaussian interference of this type, both correlated and uncorrelated over time,
through theoretical modeling as well as extensive experiments and measurements
confirming the theoretical models and the non-Gaussianity of the sources. [MS93,
Table 2.1] enumerates several examples of this type of interference, including, e.g.,
solar radiation, automobile ignition and power line EM emissions.
The other main difference between our system model and the models used in existing
works is that we do not make any assumption on a distribution of the input data at the
transmitters. This means in particular that we cover arbitrary dependencies in the input
data, which can be very important for applications, because the input data can depend,
e.g., on local sensor readings recorded by devices or on training data collected for the same
ML problem. Therefore, in many relevant application scenarios, it is important that the
transmission schemes employed are robust even to high, but unknown levels of correlations
in the transmitted data.
2.2 Problem Statement
2.2.1 Distributed Approximation of Functions
Our goal is to approximate functions f:S1×. . . × SK→Rin a distributed setting.
The sets S1, . . . SK⊆Rare assumed to be closed and endowed with their natural Borel
σ-algebras, and we consider the product σ-algebra on the set S1×. . . ×SK. Furthermore,
the functions f:S1×. . . ×SK→Runder consideration are assumed to be measurable in
what follows.
Definition 9. An admissible DFA scheme for f:S1×. . . × SK→Rwith Mchannel
uses, depicted in Fig. 2.1, is a pair (EM, DM), consisting of:
1. A pre-processing function EM= (EM
1, . . . , EM
K), where each EM
kis of the form
EM
k(sk) = (Ek(m, sk, Uk(m)))M
m=1 ∈CM
where Uk(1), . . . , Uk(M)are random variables and the map
(sk, u1, . . . , uM)↦→ (Ek(m, sk, um))M
m=1 ∈CM
24
2.2 Problem Statement
is measurable for skranging over Skand u1, . . . , uMranging over the same sets as
the i.i.d. random variables Uk(1), . . . , Uk(M). The encoder EM
kis subject to the peak
power constraint |Ek(m, sk, Uk(m))|2≤Pfor all k= 1, . . . , K and m= 1, . . . , M.
2. A post-processing function DM: The receiver is allowed to apply a measurable recov-
ery function DM:CM→Rupon observing the output of the channel.
So in order to approximate f, the transmitters apply their pre-processing maps to
(s1, . . . , sK)∈ S1×. . . × SKresulting in EM
1(s1), . . . , EM
K(sK), which are sent to the
receiver using the channel Mtimes. The receiver observes the output of the channel and
applies the recovery map DM. The whole process defines an estimate f
˜of f.
Let ε > 0, δ ∈(0,1) and f:S1×. . .×SK→Rbe given. We say that fis ε-approximated
after Mchannel uses with confidence level δif there is a DFA scheme (EM, DM) such
that the resulting estimate f
˜of fsatisfies
P(|f
˜(sK)−f(sK)| ≥ ε)≤δ(2.5)
for all sK:= (s1, . . . , sK)∈ S1×. . . ×SK. Let M(f, ε, δ) denote the smallest number of
channel uses such that there is an approximation scheme (EM, DM) for fsatisfying (2.5).
We call M(f, ε, δ) the communication cost for approximating a function fwith accuracy
εand confidence δ.
2.2.2 The Class of Functions to be Approximated
A measurable function f:S1×. . .×SK→Ris called a generalized linear function if there
are bounded measurable functions (fk)k∈{1,...,K}, with f(s1, . . . , sK) = ∑︁K
k=1 fk(sk),for all
(s1, . . . , sK)∈ S1×. . .×SK. The set of generalized linear functions from S1×. . .×SK→R
is denoted by FK,lin. Our main object of interest will be the following class of functions.
Definition 10. A measurable function f:S1×. . . ×SK→Ris said to belong to Fmon if
there exist bounded and measurable inner functions functions (fk)k∈{1,...,K}, a measurable
set A⊆Rwith the property f1(S1) + . . . +fK(SK)⊆A, a measurable outer function
F:A→Rsuch that for all (s1, . . . , sK)∈ S1×. . . ×SKwe have
f(s1, . . . , sK) = F(︄K
∑︂
k=1
fk(sk))︄,(2.6)
and there is a strictly increasing measurable function Φ : [0,∞)→[0,∞)with Φ(0) = 0
and
|F(a1)−F(a2)| ≤ Φ(|a1−a2|) (2.7)
25
2 Distributed Function Approximation in Fading and Correlated Channels
for all a1, a2∈A. We call the function Φan increment majorant of f.
Some examples of functions in Fmon are:
1. Obviously, all f∈ FK,lin belong to Fmon.
2. For any f∈ FK,lin and B-Lipschitz function F:R→Rwe have F◦f∈ Fmon with
Φ : [0,∞)→[0,∞), a↦→ Ba.
3. If f∈ FK,lin and Fis (C, α)-H¨older continuous, i.e., for all a1, a2∈A,|F(a1)−F(a2)| ≤
C|a1−a2|α,then F◦f∈ Fmon with Φ : a↦→ Caα.
4. For any p≥1 and S1,...,SKcompact, || · ||p∈ Fmon. In this example we have
fk(sk) = |sk|p,k= 1, . . . , K,F: [0,∞)→[0,∞), a↦→ a1/p, and F= Φ.
This can be seen as follows. We have to show that for all nonnegative a1, a2∈R
and p≥1 we have
|a1/p
1−a1/p
2|≤|a1−a2|1/p.(2.8)
We can assume w.l.o.g. that a1< a2holds. Then since
|a1/p
1−a1/p
2|=|a2|1/p (︄1−(︃a1
a2)︃1/p)︄
it suffices to prove that for all a∈[0,1] and p≥1 we have 1 −a1/p ≤(1 −a)1/p.
Now since a1/p +(1 −a)1/p ≥a+(1−a) = 1 for a∈[0,1] and p≥1, we can conclude
that (2.8) holds.
Given a function f∈ Fmon, we implicitly fix a representation (2.6) and define the total
spread of the inner part of f∈ Fmon as
∆
¯(f) :=
K
∑︂
k=1
(ϕmax,k −ϕmin,k),(2.9)
along with the max-spread
∆(f) := max
1≤k≤K(ϕmax,k −ϕmin,k),(2.10)
where
ϕmin,k := inf
s∈Sk
fk(s), ϕmax,k := sup
s∈Sk
fk(s).(2.11)
We define the relative spread of fwith power constraint Pas
∆(f∥P) := P·∆
¯(f)
∆(f).(2.12)
26
2.3 Main Result
2.3 Main Result
We are now in a position to state our main theorem on approximation of functions in
Fmon. We use ∥·∥op and ∥·∥Fto denote the operator and Frobenius norm of matrices,
respectively.
Theorem 2. Let f∈ Fmon,M∈N, and the power constraint P∈(0,∞)be given. Let Φ
be an increment majorant of f. Assume the fading and noise are correlated as determined
by matrices Aand B. Let Ai∈R2MK×(2MK+2M)be a matrix in the form of Remark 1
which generates user-independent fading that approximates Ain the sense that
∥(A+Ai)(A−Ai)T∥op ≤η.
Then there exist pre- and post-processing operations such that
P(︁f
¯−f(s1, . . . , sK)≥ε)︁
≤2 exp (︃−MΦ−1(ε)2
16F+D+ 4Φ−1(ε)L)︃+ 2 exp (︃−MΦ−1(ε)2
256F+ 32Φ−1(ε)L)︃,(2.13)
where
L=(︄√︂∆
¯(f)∥A∥op +√︄∆(f)
P∥B∥op)︄2
F=L(︄√︃∆
¯(f)
M∥A∥F+√︄∆(f)
PM∥B∥F)︄2
D=(︃4√2M∆
¯(f)η+ 4 ∆(f)
√PM∥ABT∥F)︃2
.
In the following, we sketch how the pre- and post-processing schemes of Theorem 2
work. For a full formal definition, we refer the reader to the proof in Section 2.5. The
pre- and post-processing schemes are similar (although not identical) to the ones that
have appeared in [GS14,RGS16]. They are based on the same idea of combining random
phase shifts at the transmitter and averaging at the receiver to mitigate the impact of the
unknown fading and noise.
We consider fast fading (which can have any sub-Gaussian distribution) and assume
that CSI is available neither at the transmitter nor at the receiver. One example is i.i.d.
complex standard normal fading, which has a uniformly distributed phase. In this case,
since there is no CSI, the phase of the received signal cannot carry any useful information.
On the other hand, the phase difference between the signals from different transmitters
27
2 Distributed Function Approximation in Fading and Correlated Channels
plays a crucial role, since it determines whether the signals constructively or destructively
overlap at the receiver. We mitigate the impact of a destructive superposition by applying
a random phase shift at the transmitters and averaging the transmission over multiple
channel uses. This has the added benefit of averaging out additive noise.
In summary, the pre-processing is performed by applying the following steps at each
transmitter k(see Section 2.5.1):
•Apply the inner function fkfrom the nomographic representation (2.6).
•Shift and rescale to satisfy the power constraint.
•Apply a random phase shift Uk(m) that is independent for each channel use m.
One option for the random phase shift Uk(m) is to draw it uniformly from the complex
unit circle, but as we argue in the proofs, it is actually sufficient to draw it uniformly from
{−1,1}.
As described above, the phase of the received signal carries no useful information due
to the absence of CSI. Moreover, to compensate for the phase differences between the
transmitters and reduce the influence of additive noise, some form of averaging is required.
We therefore perform the following steps (see Section 2.5.2):
•Compute the total energy of the received signal.
•Subtract the energy of the additive noise over the receive time slot (this is statistical
information and does not require knowledge of the instantaneous noise realizations).
•Invert the rescaling and shift that have been applied during pre-processing.
•Apply the outer function Ffrom the nomographic representation (2.6).
We remark that these pre- and post-processing steps, while specific to the fast fading
scenario, are unmodified compared to the steps used in [2], where we considered only the
case of uncorrelated fading and noise. On the other hand, the error bound of Theorem 2
and the statistical tools used to prove it are different. Moreover, the error bound depends
on the correlation structure of the fading and noise, while the pre- and post-processors do
not need this information.
Remark 2. If no suitable user-independent approximation for Ais available, we can
always choose Ai:= 0, which results in η=∥A∥2
op.
Remark 3. In order to gain more freedom for optimizing the bound for a given cor-
relation structure, it is possible to replace Aiin Theorem 2 with AiAU, where AU∈
R(2MK+2M)×(2MK+2M)is a unitary matrix. This requires only minor adaptations in the
proof of the theorem.
28
2.3 Main Result
Corollary 1. For the approximation communication cost, we have
M(f, ε, δ)≤log 4 −log δ
Φ−1(ε)2Γ,(2.14)
where
Γ := max (︁16F+D+ 4Φ−1(ε)L,256F+ 32Φ−1(ε)L)︁.
Proof. We upper bound (2.13) as
P(|f
¯−f(sK)| ≥ ε)≤4 exp (︃−MΦ−1(ε)2
Γ)︃,
and solve the expression for Mconcluding the proof.
Remark 4. If Fis B-Lipschitz continuous, we can replace Φ−1(ε)in (2.14) and the
expression for Γwith ε/B.
2.3.1 Special Cases of Theorem 2
In this subsection we discuss the bound of Theorem 2 and illustrate it with two examples.
In order to be a useful bound, (2.13) should approach 0 as M→ ∞. Clearly, it does so
exponentially whenever L,Fand Dare bounded for M→ ∞.
For D, we observe that ABTis 0 if the fading is independent of the additive noise (which
we usually expect to be the case in practically relevant scenarios) and that ηis 0 in the
case of user-independent fading in the sense of Definition 8, which means that the fading
of any one user is independent of the fading of the other users (arbitrary correlations in
the time domain are still allowed). Since Dgrows proportionally with Mη2, we can see
that our bound is not useful for the case of strong correlations between users. Therefore,
in the presence of user-correlated fading, the usefulness of the bound depends on the
scaling behavior of Mη2. When this term exhibits sublinear growth, the error bound of
Theorem 2 approaches 0 as M→0, and if it is additionally upper bounded, the error
bound does so exponentially. In this sense, the bound is robust to small deviations from
the assumption of user-independence. However, it is important to note that even the
user-independent case covers relevant cases of correlation in the time domain such as the
block fading channel. In the expression of F, we can see that the Frobenius norm of both
Aand Bshould not grow faster than √Mand finally, in the expression of L, we see that
the operator norms of Aand Bshould not grow with M. We illustrate that this is the
case in scenarios of interest with the following two examples.
29
2 Distributed Function Approximation in Fading and Correlated Channels
Corollary 2. In the setting of Theorem 2 with uncorrelated fading and noise, i.e.,
A:= (︂σFid2MK 0)︂,B:= (︂0σNid2M)︂,(2.15)
where idndenotes the n×nidentity matrix, we have
P(︁f
¯−f(s1, . . . , sK)≥ε)︁
≤2 exp (︃−MΦ−1(ε)2
16F′+ 4Φ−1(ε)L′)︃+ 2 exp (︃−MΦ−1(ε)2
256F′+ 32Φ−1(ε)L′)︃,(2.16)
where
L′=(︄√︂∆
¯(f)σF+√︄∆(f)
PσN)︄2
F′=L′(︄√︂2K∆
¯(f)σF+√︄2∆(f)
PσN)︄2
.
Proof. Note that ABT= 0, ∥A∥op =σF,∥B∥op =σN,∥A∥F=√2MKσFand ∥B∥F=
√2MσN; pick Ai:= Aand substitute this into (2.13).
Corollary 3. In the setting of Theorem 2 where each user has a block fading channel with
block length β, i.e.,
A:= σF
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
id2K
.
.
.
id2K0
id2K
.
.
.
id2K0
...
0id2K
.
.
.
id2K
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
⎫
⎪
⎪
⎬
⎪
⎪
⎭
β
⎫
⎪
⎪
⎬
⎪
⎪
⎭
β
⎫
⎪
⎪
⎬
⎪
⎪
⎭
β
B:= (︂0σNid2M,)︂
30
2.3 Main Result
we have a bound of the form (2.16), where
L′=(︄√︂∆
¯(f)βσF+√︄∆(f)
PσN)︄2
F′=L′(︄√︂2K∆
¯(f)σF+√︄2∆(f)
PσN)︄2
.
Proof. Note that ABT= 0, ∥A∥op =σF√β,∥B∥op =σN,∥A∥F=√2MKσFand ∥B∥F=
√2MσN; pick Ai:= Aand substitute this into (2.13).
2.3.2 Sharpness of the Bound in Theorem 2
We do not expect the bound (2.13) to be sharp in the sense that there are non-trivial
examples in which it holds with equality. This, we believe, is in part a price that we pay
for using a very general system model, but it is also due to the underlying tools from
high-dimensional statistics that we employ. A further sharpening of this bound could be
an interesting question for future research, but it would hinge on optimizing the bounds
of some of the basic results that we use (such as Lemma 2). In some special cases (such as
uncorrelated Gaussian noise and fading) it is not hard, however, to compute exact bounds,
as can for instance be seen in (2.18) below. In the sequel, we argue that in a sense that
will be made precise, the bound (2.13) is sharp “up to absolute constants”. The example
case for which we show that the bound holds with equality up to constants is uncorrelated
Gaussian fading and noise so that the bound specializes to (2.16). For the purpose of this
section, we focus on the behavior with varying Mand ε, while we consider everything else
constant system parameters.
Theorem 3. In the case of uncorrelated Gaussian fading and noise; i.e., (2.15) is satisfied
and the entries of Gare i.i.d. standard Gaussian, there are constants cand Csuch that
the estimate f
¯obtained by the pre- and post-processing schemes described in Sections 2.5.1
and 2.5.2 satisfies
P(︁f
¯−f(s1, . . . , sK)≥ε)︁≥cexp (︁−CM min(Φ−1(ε),Φ−1(ε)2))︁(2.17)
for suitable choices of Fand Φsuch as F=Φ=id (the identity function).
Note that the upper tail bound (2.16) also has the same form for suitably chosen cand
C, so that we can conclude that it is sharp up to the values of these constants.
Proof. The proof is relatively straightforward, so we only sketch it. Under the assumptions
31
2 Distributed Function Approximation in Fading and Correlated Channels
made in Theorem 3, we readily compute from the equations in Sections 2.5.1 and 2.5.2
∥Y∥2
2
σ2
F∑︁K
k=1 ak+σ2
N
=∥G∥2
2.
Since the entries of Gare i.i.d. standard Gaussian and the vector has 2Mentries, ∥G∥2
2
clearly follows a chi-square distribution with 2Mdegrees of freedom. We therefore have
in parallel to (2.24)
P(︁f
¯−f(s1, . . . , sK)≥ε)︁≤P(︄∥G∥2
2−E∥G∥2
2≥2PMΦ−1(ε)
∆(f)(σ2
F∑︁K
k=1 ak+σ2
N))︄.(2.18)
The bound here is sharp in the sense that it holds with equality in case F= Φ = id. We
can now use [ZZ20, Corollary 3] to conclude that in this case, (2.17) holds for suitable c
and C.
2.4 Numerical Results
We have simulated the DFA scheme for Rayleigh fading channels with varying noise power,
number of users and amount of channel resources. The simulations were done for two
different functions, with the function arguments in both cases confined to the unit interval
[0,1], to highlight different aspects and properties of the scheme: The arithmetic mean
function is linear and maps only to the interval [0,1] (which means that no scheme can
have an error larger than 1), while the Euclidean norm function maps to [0,√K] and can
show how the DFA scheme deals with nonlinearities.
We compare with a simple Time Division Multiple Access (TDMA) scheme, in which
each user transmits separately in its designated slot, protecting the analog transmission
against channel noise in the same fashion as the DFA scheme, but not sharing the channel
use with other transmitters. In the case where the number of channel uses available is
much larger than the number of users sharing the resources, this form of a TDMA scheme
is of course highly suboptimal, as the transmitters could use source and channel coding to
achieve a higher reliability. However, such an approach is infeasible if the number of users
is so high in comparison to the number of channel uses that only a few or possibly even less
than one channel use is available to each user, and in this work we are mainly interested in
the scaling behavior of our schemes in the number of users K. Therefore, this comparison
provides an insight into the gain achieved by exploiting the superposition properties of
the wireless channel while keeping in mind that for the regime of low K, there are better
coded schemes available. We also remark that the DFA scheme only needs coordination
32
2.4 Numerical Results
mean norm DFA TDMA
−20 −10 0 10 20 30 40 50
10−5
10−3
10−1
101
noise power in dB
mean squared error
K M
40 103
2560 105
Figure 2.2: Mean squared error of the approximation schemes dependent on the channel
noise power.
between the transmitters insofar as all users need to transmit roughly at the same time,
while a TDMA scheme necessitates an allocation of the channel uses to the individual
transmitters, which can be costly in the case of high K. The simulations carried out in
this section do not consider this scheduling problem and assume for the TDMA scheme
that the time slots have already been allocated, and this knowledge is available at both the
transmitters and the receiver. If M < K, there is not at least one channel use available
to each user and the TDMA scheme can therefore not be carried out. We set the error in
such cases to the maximum of 1 or √K, respectively.
For the simulations, we assume a normalized peak transmitter power constraint of 1 and
channels with fading normalized to a variance of 1 per complex dimension. The power
of the additive noise is given in dB per complex dimension and its negative can therefore
be considered as the signal-to-noise ratio (SNR). Each plotted data point is based on an
average of 1000 simulation runs.
The messages transmitted by the users are generated in the following way: First, we
draw a value µ, which is common to all transmitters, uniformly at random from [0,1].
We then draw the messages of all the users from a convex combination of the uniform
distributions on [0, µ] and [µ, 1] where we choose the weights in such a way that each
33
2 Distributed Function Approximation in Fading and Correlated Channels
mean norm DFA TDMA
101102103
10−5
10−3
10−1
101
users
mean squared error
noise M
−20 dB 103
20 dB 105
Figure 2.3: Mean squared error of the approximation schemes dependent on the number
of participating transmitters.
mean norm DFA TDMA
0.2 0.4 0.6 0.8 1
·105
10−5
10−3
10−1
101
channel uses
mean squared error
noise/dB K
−20 40
30 2560
Figure 2.4: Mean squared error of the approximation schemes dependent on the number
of channel uses.
34
2.5 Proof of Theorem 2
message has expectation µ. The reason for choosing this procedure although the DFA
scheme also performs well for more natural distributions such as i.i.d. uniform in [0,1] for
all users is that in case of messages distributed according to a known i.i.d. distribution, the
problem is too easy in the sense that both the mean and the Euclidean norm concentrate
around values that depend only on the distribution and K, and therefore even without
any communication at all, the function value can be quite accurately guessed if Kis large.
On the other hand, we intend the DFA scheme for applications in which the messages can
be correlated and distributed according to unknown distributions, so we opt for this form
of correlation between the messages for the sake of the numerical evaluation.
In Fig. 2.2, we can see that the DFA scheme is at least as good as the TDMA schemes for
all the plotted data points and outperforms it in most cases, achieving a gain of up to 30 dB
for K= 2560. For low powers of the additive noise, the effect of the multiplicative fading
dominates, and therefore, the error saturates as the additive noise grows weaker. Fig. 2.3
illustrates that the DFA scheme performs significantly better if the number of users is not
too low, which is due to the superposition of the signals in the wireless channel resulting
in a combined signal strength that grows with the number of users. We can also see the
TDMA scheme performing similarly to the DFA scheme for low numbers of users, while
quickly deteriorating in performance or even becoming infeasible as their number grows.
In Fig. 2.4, we can observe the exponential decay of the error as the amount of channel
resources used increases. Once again, we can observe that the TDMA scheme performs
similarly to DFA for a low number of users, but becomes infeasible for larger K.
2.5 Proof of Theorem 2
2.5.1 Pre-Processing
In the pre-processing step we encode the function values fk(sk), k= 1, . . . , K as transmit
power:
Tk(m) := √akUk(m),1≤m≤M
with ak=gk(fk(sk)), where gk: [ϕmin,k, ϕmax,k]→[0,P] such that
gk(a) := P
∆(f)(a−ϕmin,k),(2.19)
where ∆(f) is given in (2.10) and ϕmin,k is defined in (2.11).
Uk(m), k= 1, . . . , K,m= 1, . . . , M are i.i.d. with the uniform distribution on {−1,+1}.
We assume the random variables Uk(m), k= 1, . . . , K,m= 1 ...,M, are independent of
Hk(m), k= 1, . . . , K,m= 1, . . . , M, and N(m), m= 1, . . . , M.
35
2 Distributed Function Approximation in Fading and Correlated Channels
We write the vector of transmitted signals at channel use mas
T(m) := (T1(m), . . . , TK(m))
and combine them in a matrix
Q:=
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
T(1) 0 0 0 0 . . . 0
0T(1) 0 0 0 . . . 0
0 0 T(2) 0 0 . . . 0
0 0 0 T(2) 0 . . . 0
...
0 0 0 . . . 0T(M) 0
0 0 0 . . . 0 0 T(M)
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
.
2.5.2 Post-Processing
The vector Yof received signals across the Mchannel uses can be written as Y=Q·H+N,
where Hand Nare given in (2.3) and (2.4). The post-processing is based on receive energy
which has the form
∥Y∥2
2=YTY= (QAG +BG)T(QAG +BG) = GTWG,
where we use
W:= (QA+B)T(QA+B) = ATQTQA+ATQTB+BTQA+BTB.(2.20)
Equivalently, we can phrase this as
∥Y∥2
2=
K
∑︂
k=1
ak∥Hk∥2
2+N
¯sK,(2.21)
where Hk= (Hk(1), . . . , Hk(M)) is a vector consisting of complex fading coefficients,
and N
¯sK=∑︁M
m=1 N
¯sK(m) where the overbar notation is used to denote the complex
36
2.5 Proof of Theorem 2
conjugate. The random variables N
¯sK(m), m= 1, . . . , M, are given by
N
¯sK(m) :=
K
∑︂
k,k′=1
k=l
√akak′Hk(m)Hk′(m)Uk(m)Uk′(m)
+ 2 (︄N(m)
K
∑︂
k=1
√akHk(m)Uk(m))︄Re
+|N(m)|2.(2.22)
The receiver computes its estimate f
¯of f(s1, . . . , sK) as
f
¯:= F(g
¯(∥Y∥2
2−E∥N∥2
2)),
where
g
¯(a) := ∆(f)
2·M·Pa+
K
∑︂
k=1
ϕmin,k.
2.5.3 The Error Event
Clearly, EN
¯sK(m) = E|N(m)|2(since all the other summands in (2.22) are centered). We
can therefore conclude
E(︁g
¯(︁∥Y∥2
2−E∥N∥2
2)︁)︁=g
¯(︁E∥Y∥2
2−E∥N∥2
2)︁=
K
∑︂
k=1
fk(sk).
We use this to argue
f
¯−f(s1, . . . , sK)=
F(︁g
¯(︁∥Y∥2
2−E∥N∥2
2)︁)︁−F(︄K
∑︂
k=1
fk(sk))︄
≤Φ(︄
g
¯(︁∥Y∥2
2−E∥N∥2
2)︁−
K
∑︂
k=1
fk(sk))︄
= Φ (︁g
¯(︁∥Y∥2
2−E∥N∥2
2)︁−g
¯(︁E∥Y∥2
2−E∥N∥2
2)︁)︁
= Φ (︃∆(f)
2MP∥Y∥2
2−E∥Y∥2
2)︃(2.23)
and therefore
P(︁f
¯−f(s1, . . . , sK)≥ε)︁≤P(︃∥Y∥2
2−E∥Y∥2
2≥2MP
∆(f)Φ−1(ε))︃.(2.24)
37
2 Distributed Function Approximation in Fading and Correlated Channels
2.5.4 Performance Bounds
Our objective is now to establish the concentration of ∥Y∥2
2around its expectation and
thus obtain an upper bound for the right hand side of (2.24). To this end, we first need
to establish a series of lemmas that we will use as tools.
We will split the deviation from the mean into a diagonal and an off-diagonal part. The
first lemma will later help us bound the diagonal part of the error.
Lemma 1. Let V1,...,Vnbe independent random variables and centered with sub-Gaussian
norm at most 1. Let A1,...,Anbe random variables independent of V1,...,Vnbut not
necessarily of each other, and assume that for all k,|Ak| ≤ L
˜and ∑︁n
k=1 A2
k≤F
˜almost
surely. Then we have for any c∈(0,1) and any λ∈(−c/(2L
˜), c/(2L
˜)),
Eexp (︄λ
n
∑︂
k=1 (︁AkV2
k−E(AkV2
k))︁)︄≤exp (︄λ2
2·8F
˜
1−c)︄Eexp (︄λ
n
∑︂
k=1 (︁Ak−E(Ak))︁)︄.
The proof of this lemma relies on some other technical results and is therefore relegated
to Section 2.7.
The next lemma is a slight variation of the Hanson-Wright inequality as phrased in [Ver18,
Theorem 6.2.1] and will help us bound the off-diagonal part of the error.
Lemma 2. Let Vbe an Rn-valued random variable with independent, centered entries
and assume that for all k∈ {1, . . . , n}, the k-th entry of Vsatisfies τ(Vk)≤τmax. Let
A ∈ Rn×nwith zeros on the diagonal and ε > 0. Suppose further that ∥A∥op ≤Aop and
∥A∥F≤AF. Then E(︁VTAV)︁= 0 and
P(︁VTAV≥ε)︁≤2 exp (︃−ε2
16τ2
maxεAop + 256τ4
maxA2
F)︃.(2.25)
This lemma differs from [Ver18, Theorem 6.2.1] mainly in that we require the diagonal
entries of Ato be 0 and that all the constants are explicit. The proof follows [Ver18] closely
and is given in Section 2.7. We remark that it is not hard to follow the proof in [Ver18]
further and expand the result to matrices with non-zero diagonal elements, however, this
is not relevant for the present work.
Mainly because the matrix Wcontains randomness, we need a slight modification of
this lemma as well as two more lemmas exploring some specific properties of W.
Corollary 4. Assume a setting as in Lemma 2, but let Abe an Rn×n-valued random
variable independent of Vsuch that almost surely, the diagonal entries of Aare 0,∥A∥op ≤
Aop and ∥A∥F≤AF. Then E(︁VTAV)︁= 0 and (2.25), considering joint expectation,
respectively probability of Vand A, still hold.
38
2.5 Proof of Theorem 2
Proof. E(︁VTAV)︁= 0 as well as (2.25) hold conditional on any realization of A(except
possibly in a null set) and therefore, the Corollary follows by the laws of total expectation
and total probability.
Lemma 3. We have almost surely
∥W∥F≤(︂√︁∆(f||P)∥A∥op +∥B∥op)︂(︂√︁∆(f||P)∥A∥F+∥B∥F)︂,
∥W∥op ≤(︂√︁∆(f||P)∥A∥op +∥B∥op)︂2.
Proof. In order to bound the norm of W, we first note that
QQT=
K
∑︂
k=1
akid2M.(2.26)
Therefore, we can conclude that all singular values of Qare bounded by √︁∆(f||P) and
thus ∥Q∥op ≤√︁∆(f||P).
Noting that ∥AB∥F≤ ∥A∥op∥B∥Ffor all matrices A, B of compatible dimensions and
further noting the submultiplicativity of the operator norm and the triangle inequality of
both norms, we get
∥W∥F≤ ∥QA+B∥op∥QA+B∥F≤(∥Q∥op∥A∥op +∥B∥op) (∥Q∥op∥A∥F+∥B∥F)
≤(︂√︁∆(f||P)∥A∥op +∥B∥op)︂(︂√︁∆(f||P)∥A∥F+∥B∥F)︂,
∥W∥op =∥QA+B∥2
op ≤(∥Q∥op∥A∥op +∥B∥op)2≤(︂√︁∆(f||P)∥A∥op +∥B∥op)︂2
Lemma 4. We have
τ(trW)≤4M∆(f||P)∥(A+Ai)(A−Ai)T∥op + 2√︁2P∥ABT∥F,(2.27)
where the trace tr(·)is the sum of elements on the diagonal of a square matrix.
Proof. With an addition of zero, we can rewrite
tr (︁ATQTQA)︁= tr (︁ATQTQA)︁+ tr (︁ATQTQAi)︁−tr (︂(Ai)TQTQA)︂
−tr (︂(Ai)TQTQAi)︂+ tr (︂(Ai)TQTQAi)︂
= tr (︂(A−Ai)TQTQ(A+Ai))︂+ tr (︂(Ai)TQTQAi)︂
39
2 Distributed Function Approximation in Fading and Correlated Channels
and use this together with (2.20) to conclude
trW= tr (︂(A−Ai)TQTQ(A+Ai))︂+ 2tr (︁BTQA)︁+ tr (︂(Ai)TQTQAi)︂+ tr (︁BTB)︁.
(2.28)
Next, we argue that the last two summands are almost surely constant. For tr(BTB) this
is immediately clear. Moreover, we have tr(︂(Ai)TQTQAi)︂=∥QAi∥2
F.We note that as
per Remark 1 and using corresponding notation, we have
QAi=
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
T(1)A(1)
i0 0 . . . 0
0T(1)A(2)
i0. . . 0
...
0. . . 0T(M)A(2M−1)
i0
0. . . 0 0 T(M)A(2M)
i
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
and because each A(m)
ihas only one nonzero entry per column, each entry of QAiis the
product of Uk(m) with a deterministic term for some m, k and therefore, its square can
take only one value almost surely, and consequently, ∥QAi∥2
Falso takes only one value
almost surely.
We can use this in (2.28) and incorporate the triangle inequality to obtain
τ(trW)≤τ(ξ1)+2τ(ξ2),
where
ξ1:= tr (︂(A−Ai)TQTQ(A+Ai))︂
ξ2:= tr (︁BTQA)︁.
To the end of bounding τ(ξ1), we argue
ξ1= tr (︂Q(A+Ai)(A−Ai)TQT)︂≤ ∥(A+Ai)(A−Ai)T∥optr (︁QQT)︁
≤2M∆(f||P)∥(A+Ai)(A−Ai)T∥op.
The first inequality holds because for any square matrix Aand compatible column vector
v, we have
vT(∥A∥opid −A)v=∥v∥2
2(︄∥A∥op −(︃v
∥v∥2)︃T
Av
∥v∥2)︄≥0
40
2.5 Proof of Theorem 2
(see, e.g., [Bha97, Exercise I.2.10]) and therefore ∥A∥opid−Ais positive semidefinite. The
second inequality directly follows from (2.26). It follows, e.g., from [BK00, Example 1.1.2],
that τ(ξ1) is upper bounded by the first summand on the right hand side in (2.27). Note
that in an analogous way, we can derive the same bound for −ξ1.
In order to bound the sub-Gaussian norm of ξ2, we view it as a function of (Uk(m))K,M
k,m=1
and use part of the proof of the Bounded Differences Inequality [BLM13, Theorem 6.2] to
bound the moment generating function. To this end, we define
(Ei,j)i′,j′=⎧
⎨
⎩
1, i′=iand j′=j
0,otherwise.
and note that a change in the value of Uk(m) changes the value of ξ2by
2√aktr (︁BT(E2m−1,K(2m−2)+k+E2m,K(2m−1)+k)A)︁
= 2√aktr (︁ABT(E2m−1,K(2m−2)+k+E2m,K(2m−1)+k))︁
= 2√ak(︁(ABT)K(2m−2)+k,2m−1+ (ABT)K(2m−1)+k,2m)︁
≤2√︁P(︁(ABT)K(2m−2)+k,2m−1+ (ABT)K(2m−1)+k,2m)︁
Following the proof of the Bounded Differences Inequality [BLM13, Theorem 6.2], we can
now conclude
τ(ξ2)2≤1
4
K
∑︂
k=1
M
∑︂
m=1 (︂2√︁P(ABT)(2m−2)K+k,2m−1+ 2√︁P(ABT)(2m−1)K+k,2m)︂2
≤1
4·2·4·P∥ABT∥2
F,
concluding the proof of the lemma.
Proof of Theorem 2. What remains to be established is the concentration of ∥Y∥2
2around
its expectation. To this end, we observe
P(︁∥Y∥2
2−E∥Y∥2
2≥ε)︁=P(︁GTWG −E(GTWG)≥ε)︁≤P(︂|Σ1| ≥ ε
2)︂+P(︂|Σ2| ≥ ε
2)︂
(2.29)
where
Σ1:=
2KM+2M
∑︂
i=1 (︁G2
iWi,i −E(︁G2
iWi,i)︁)︁
41
2 Distributed Function Approximation in Fading and Correlated Channels
Σ2:=
2KM+2M
∑︂
i,j=1
i=j
GiGjWi,j.
We use Lemma 1, Lemma 3 and Lemma 4 to bound the moment generating function of
Σ1as
Eexp(λΣ1)≤exp (︄λ2
2(︄8F
˜1
1−c+F
˜2)︄)︄≤exp (︄λ2
2·8F
˜1+F
˜2
1−c)︄
for any c∈(0,1) and λ∈(−c/(2L
˜)), c/(2L
˜)), where
L
˜:= (︂√︁∆(f||P)∥A∥op +∥B∥op)︂2
F
˜1:= L
˜(︂√︁∆(f||P)∥A∥F+∥B∥F)︂2
F
˜2:= (︂4M∆(f||P)∥(A+Ai)(A−Ai)T∥op + 2√︁2P∥ABT∥F)︂2
By Lemma 9, this yields
P(︂|Σ1| ≥ ε
2)︂≤2 exp (︃−(1 −c)ε2
64F
˜1+ 8F
˜2)︃
in case 0 < ε ≤c
1−c·8F
˜1+F
˜2
L
˜and
P(︂|Σ1| ≥ ε
2)︂≤2 exp (︃−cε
8L
˜)︃
otherwise. Since the first case term is increasing with cand the second case term is
decreasing, the optimal value for cis where the two cases meet, which is at
c=L
˜ε
L
˜ε+ 8F
˜1+F
˜2
.
Substituting this, we get
P(︂|Σ1| ≥ ε
2)︂≤2 exp (︃−ε2
64F
˜1+ 8F
˜2+ 8L
˜ε)︃.
Turning our attention to Σ2, we note that by [BCD89, Theorem 2.1] the operator norm of
the off-diagonal part of Wcan be upper bounded by 2∥W∥op and thus by 2L
˜. Therefore,
42
2.6 Preliminaries on Sub-Gaussian and Sub-Exponential Random Variables
we can directly apply Lemma 2 and get
P(︂|Σ2| ≥ ε
2)︂≤2 exp (︃−ε2
1024F
˜1+ 64L
˜ε)︃.
Substituting these into (2.29) and using (2.24) concludes the proof.
2.6 Preliminaries on Sub-Gaussian and Sub-Exponential Random
Variables
We begin with a definition that is adapted from [Ver18, Definition 3.4.1]. For Rn-valued
random variables V, we define the sub-Gaussian norm as
τ(V) := inf {︂a:∀v∈Sn−1∀λ∈R E exp(λ⟨V, v⟩)≤exp (︃a2λ2
2)︃}︂(2.30)
and we observe that if all entries of Vhave a sub-Gaussian norm bounded by τmax and
are independent, we have for any v∈Sn−1
Eexp (λ⟨V, v⟩) = Eexp (︄λ
n
∑︂
k=1 Vkvk)︄=
n
∏︂
k=1
Eexp (λVkvk)
≤
n
∏︂
k=1
exp (︃τ2
maxv2
kλ2
2)︃= exp (︃τ2
maxλ2
2)︃
and therefore τ(V)≤τmax.
In the following, we recall some basic definitions and results from [BK00, Chapter 1].
For a random variable Xwe define12
θ(X) := sup
k≥1(︃E(|X|k)
k!)︃1
k
(2.31)
If θ(X)<∞then Xis called a sub-exponential random variable. θ(·) defines a semi-
norm on the vector space of sub-exponential random variables [BK00, Remark 1.3.2].
Typical examples of sub-exponential random variables are bounded random variables and
random variables with exponential distribution. We collect some useful properties of
and interrelations between the sub-exponential and sub-Gaussian norms in the following
lemma.
12Note that as with our definition of the sub-Gaussian norm, other norms on the space of sub-exponential
random variables that appear in the literature are equivalent to θ(·) (see, e.g., [BK00]). The particular
definition we choose here matters, however, because we derive results in which no unspecified constants
appear.
43
2 Distributed Function Approximation in Fading and Correlated Channels
Lemma 5. Let X, Y be random variables. Then:
1. If Xfollows the normal distribution with expectation µand variance σ2then we have
τ(X) = σ. (2.32)
2. (Rotation Invariance) If X1, . . . , XMare independent, sub-Gaussian and centered,
we have
τ(︄M
∑︂
m=1
Xm)︄2
≤
M
∑︂
m=1
τ(Xm)2(2.33)
3. If Xis a random variable with |X| ≤ 1with probability 1and if Yis independent of
Xand sub-Gaussian then we have
τ(X·Y)≤τ(Y).(2.34)
4. If Xand Yare sub-Gaussian and centered, then X·Yis sub-exponential and
θ(X·Y)≤2·τ(X)·τ(Y).(2.35)
5. (Centering) If Xis sub-exponential and X≥0almost surely, then
θ(X−E(X)) ≤θ(X).(2.36)
Proof. (2.32) follows in a straightforward fashion by calculating the moment generating
function of X. (2.33) is e.g. proven in [BK00, Lemma 1.1.7]. (2.34) follows directly from
the definition conditioning on X. We show (2.35) first for X=Y. In this case, we have
θ(︁X2)︁= sup
k≥1(︃EX2k
k!)︃1
k
≤sup
k≥1(︄2k+1kkτ(X)2k
ekk!)︄1
k
= 2τ(X)2sup
k≥1(︄21
kk
e(k!)1
k)︄≤2τ(X)2,
where the first inequality is by [BK00, Lemma 1.1.4] and the second follows from 2kk/k!≤
ek, which is straightforward to prove for k≥1 by induction. In the general case, we have
θ(XY ) = τ(X)τ(Y)θ(︃XY
τ(X)τ(Y))︃
≤τ(X)τ(Y)θ(︄1
2(︃X
τ(X))︃2
+1
2(︃Y
τ(Y))︃2)︄
≤2τ(X)τ(Y),
44
2.7 Proof of Lemmas 1 and 2
where the first inequality can be verified in (2.31), considering that ab ≤a2/2 + b2/2 for
all a, b ∈R, and the second inequality follows from the triangle inequality and the special
case X=Y.
For (2.36), we assume without loss of generality EX= 1 (otherwise we can scale X),
and note that for all a∈[0,∞) and k≥1, ak−|a−1|k> a−1 and thus E(Xk−|X−1|k)≥
E(X−1) = 0.
2.7 Proof of Lemmas 1 and 2
The proofs closely follow the proof of the Hanson-Wright inequality in [Ver18, Theorem
6.2.1]. We carry out the changes that are necessary to arrive at explicit constants. To this
end, we begin with some slightly modified versions of lemmas used as ingredients in the
proof of Bernstein’s inequality in [BK00, Theorem 1.5.2].
Lemma 6. Let Xbe a random variable with E(X)=0and θ(X)<+∞. For any λ∈R
with |λθ (X)|<1we have
E(exp(λX)) ≤1 + |λ|2θ(X)2·1
1−|λθ (X)|.
Proof. Let λ∈Rsatisfy |λθ (X)|<1. Then
E(exp(λX)) = 1 + ∞
∑︂
k=2
λkE(Xk)
k!
≤1 + ∞
∑︂
k=2
|λ|kE(|X|k)
k!≤1 + ∞
∑︂
k=2 |λ|kθ(X)k
= 1 + |λ|2θ(X)2(︄∞
∑︂
k=0 |λθ (X)|k)︄
= 1 + |λ|2θ(X)2·1
1−|λθ (X)|,(2.37)
where in the last line we have used |λθ (X)|<1.
In the next lemma we derive an exponential bound depending on θ(X) on the moment
generating function of the random variable X.
Lemma 7. Let Xbe a random variable with E(X) = 0 and θ(X)<+∞. For any
c∈(0,1) and λ∈(︂−c
θ(X),c
θ(X))︂we have
E(exp(λX)) ≤exp(︂λ2
2
2·θ(X)2
1−c)︂.
45
2 Distributed Function Approximation in Fading and Correlated Channels
Proof. For λ∈(︂−c
θ(X),c
θ(X))︂we have
|λθ (X)|<c<1,(2.38)
therefore by Lemma 6
E(exp(λX)) ≤1 + |λ|2θ(X)2·1
1−|λθ (X)|
≤1 + |λ|2θ(X)2·1
1−c
≤exp(︂λ2
2
2·θ(X)2
1−c)︂,
where in the second line we have used the first inequality in (2.38) and the last line is by
the numerical inequality 1 + a≤exp(a) valid for a≥0.
We are now ready to prove Lemma 1.
Proof of Lemma 1. The lemma follows by a straightforward calculation
Eexp (︄λ
n
∑︂
k=1 (︁AkV2
k−E(AkV2
k))︁)︄
=E(︄exp (︄λ
n
∑︂
k=1 (︁Ak(V2
k−E(V2
k)))︁)︄·exp (︄λ
n
∑︂
k=1 (︁E(V2
k)(Ak−E(Ak))︁)︄)︄ (2.39)
=EA(︄exp (︄λ
n
∑︂
k=1 (︁E(V2
k)(Ak−E(Ak))︁)︄·
n
∏︂
k=1
EVexp (︁(λAk)(︁V2
k−E(V2
k))︁)︁)︄(2.40)
≤EA(︄exp (︄λ
n
∑︂
k=1 (︁E(V2
k)(Ak−E(Ak))︁)︄·
n
∏︂
k=1
exp (︃λ2
2·8A2
k
1−c)︃)︄(2.41)
≤exp (︄λ2
2·8F
˜
1−c)︄E(︄exp (︄λ
n
∑︂
k=1
(Ak−EAk))︄)︄,(2.42)
where (2.40) follows by the independence assumptions, (2.41) is an application of Lemma 7
and (2.42) holds because ∑︁n
k=1 A2
k≤F
˜almost surely. For this last step, also note that the
variance of a sub-Gaussian random variable is upper bounded by the squared sub-Gaussian
norm [BK00, Lemma 1.1.2].
Lemma 8. Let X1, . . . , Xnbe independent random variables with E(Xi)=0and θ(Xi)<
+∞,i= 1, . . . , n. Let L:= max1≤i≤nθ(Xi),c∈(0,1), and λ∈(︁−c
L,c
L)︁. Then for
46
2.7 Proof of Lemmas 1 and 2
ΣM:= ∑︁n
i=1 Xiwe have
E(exp(λΣM)) ≤exp(︂λ2
2
2·∑︁n
i=1 θ(Xi)2
1−c)︂.(2.43)
Proof. By independence of X1, . . . , Xn, we have
E(exp(λΣn)) =
M
∏︂
i=1
E(exp(λXi)).
Combining this with Lemma 7 proves the lemma.
The next lemma establishes the basic tail bound for random variables satisfying inequal-
ities of type (2.43). The proof can be found in [BK00, Lemma 1.4.1].
Lemma 9. Let Xbe a random variable with E(X) = 0. If there exist τ≥0and Λ>0
such that
E(exp(λX)) ≤exp(︂λ2
2τ2)︂,
holds for all λ∈(−Λ,Λ), then for any t≥0, we have
P(|X| ≥ t)≤⎧
⎨
⎩
2 exp (︂−t2
2τ2)︂,0< t ≤Λτ2
2 exp (︁−Λt
2)︁,Λτ2≤t.
The following lemma is a slightly modified version of [Ver18, Lemma 6.2.3]. N(µ, Σ)
denotes the multivariate normal distribution with mean µand covariance matrix Σ.
Lemma 10 (Comparison Lemma).Let Vand V′be independent, Rn-valued, centered and
sub-Gaussian random variables, and let G, G′be independent and distributed according to
N(0,idn). Let A ∈ Rn×nand λ∈R. Then
Eexp(λVTAV′)≤Eexp(λτ (V)τ(︁V′)︁GTAG′.)
Proof. We first observe that for any v∈Rn,
E(exp(λ⟨V, v⟩)) = E(︃exp (︃λ∥v∥2⟨︃V,v
∥v∥2⟩︃)︃)︃
≤exp (︄λ2∥v∥2
2τ(V)2
2)︄
=E(︂exp(︁λτ (V)⟨G, v⟩)︁)︂,(2.44)
47
2 Distributed Function Approximation in Fading and Correlated Channels
where the inequality in (2.44) is by the definition of vector-valued sub-Gaussian random
variables and the equality is obtained by calculating the moment-generation function of
⟨G, v⟩. We can now conclude the proof from the following:
E(︂exp (︁λVTAV′)︁)︂=EV′(︃EV(︂exp (︁λ⟨︁V,AV′⟩︁)︁)︂)︃(2.45)
≤EV′(︃EG(︂exp (︁λτ (V)⟨︁G, AV′⟩︁)︁)︂)︃(2.46)
=EG(︃EV′(︂exp (︁λτ (V)⟨︁V′,ATG⟩︁)︁)︂)︃(2.47)
≤EG(︃EG′(︂exp (︁λτ (V)τ(︁V′)︁⟨︁G′,ATG⟩︁)︁)︂)︃(2.48)
=E(︂exp (︁λτ (V)τ(︁V′)︁GTAG′)︁)︂,(2.49)
where (2.45), (2.47) and (2.49) are due to Fubini’s theorem and elementary transformations
and (2.46) and (2.48) are both instances of the observation (2.44).
Proof of Lemma 2. We can write
VTAV =
n
∑︂
k,k′=1,k=k′VkAk,k′Vk′,(2.50)
and since Vis centered, E(︁VTAV)︁= 0 immediately follows. Let V′be an independent
copy of V, and let Gand G′be independently distributed according to N(0,idn). We
denote the singular values of Awith s1, . . . , sn. With these definitions, we bound the
moment-generating function of VTAV as
Eexp (︁λVTAV)︁=Eexp (︁λVTAV)︁(2.51)
≤Eexp (︁4λVTAV′)︁(2.52)
≤Eexp (︂4λτ (V)2GTAG′)︂(2.53)
=Eexp (︄4λτ (V)2
n
∑︂
k=1
G
ˆkG
ˆ′
ksk)︄(2.54)
≤exp (︄λ2
2·128τ(V)4∑︁n
k=1 s2
k
1−c)︄,(2.55)
where (2.52) is due to the Decoupling Theorem [Ver18, Theorem 6.1.1], (2.53) is an ap-
plication of Lemma 10, (2.54) holds for suitably transformed versions G
ˆ, G
ˆ′of G, G′(note
that they are still independent and follow the same distribution) and (2.55) is true if
48
2.7 Proof of Lemmas 1 and 2
c∈(0,1) and |λ|< c/(8τ(V)2max1≤k≤nsk) according to Lemma 8. So we can apply
Lemma 9 to obtain
P(︁VTAV≥ε)︁≤2 exp (︄−ε2(1 −c)
256τ(V)4∑︁n
k=1 s2
k)︄(2.56)
in case ε≤c
1−c·16τ(V)2∑︁n
k=1 s2
k
max1≤k≤nskand
P(︁VTAV≥ε)︁≤2 exp (︄−c·ε
16τ(V)2max1≤k≤nsk)︄
otherwise. We next choose cso as to minimize the upper bound on the tail probability.
Because the bound in the first case is increasing with cwhile it is decreasing in the second
case, the optimal choice for cis where the two cases meet. We can therefore calculate the
optimal cas
c=εmax1≤k≤nsk
εmax1≤k≤nsk+ 16τ(V)2∑︁n
k=1 s2
k
and substituting this in (2.56), we obtain
P(︁VTAV≥ε)︁≤2 exp ⎛
⎝−ε2
16ετ (V)2max
1≤k≤nsk+ 256τ(V)4∑︁n
k=1 s2
k⎞
⎠.
The bounds τ(V)≤τmax,|sk| ≤ ∥A∥op, and identity ∥A∥2
F=∑︁n
k=1 s2allow us to conclude
the proof of the lemma.
49
3 Applications of Distributed Function
Approximation to Vertical Federated
Learning
In this chapter, we give examples of how the results of Chapter 2 can be applied to ML
problems. We focus on a case that is particularly simple on the communication side (since
only one weighted sum is OTA computed) but which we expect to gain significant relevance
in practical systems. The application examples show how our DFA scheme can be leveraged
to vastly increase the efficiency of the prediction phase of VFL both in terms of time and
bandwidth resources (i.e., in our model, channel uses) and in terms of energy resources
expended. For the training phase, we will either assume centralized offline training or use
more communication-efficient decentralized methods that do not, however, leverage any
form of OTA computation. Developing distributed training algorithms for VFL which
can leverage the full power of OTA computation remains open as an interesting future
research direction. However, we argue that in many cases of interest the communication
cost incurred in the prediction phase can dominate that incurred in training and it is
therefore worthwhile to focus on the prediction. This can, e.g., be the case when the
training can be conducted offline and the models do not or only infrequently have to be
re-trained; or when the number of training samples is small, but the number of features
observed in the system is large and prediction tasks have to be carried out very frequently.
Contrary to [YLCT19], where the main focus in VFL is on providing privacy and se-
curity guarantees, in this work we focus on the communication efficiency of such schemes
under the use of OTA computation. Since the OTA-computed predictors as well as the
distributed training procedures we describe do not aggregate the observed features in a
central point, it is reasonable to expect that these methods have good inherent privacy
properties, and for some of the envisioned applications, such as, e.g., e-health, it is an im-
portant open question for future research how these privacy guarantees can be formalized
and perhaps strengthened in the context of OTA-computed ML predictors. In Chapter 4,
we address some of these concerns by providing formal security guarantees in the presence
of eavesdroppers.
51
3 Applications of Distributed Function Approximation to Vertical Federated Learning
In the following, we expand upon the idea of OTA VFL by showing in Section 3.1 how
the SVM approach can be generalized to the use of additive kernel SVMs and applied
to regression and classification problems, and in Section 3.2 how classifiers that do not
necessarily have to rely on SVMs at all can be constructed to solve binary classification
problems. In Section 3.3, we present simulation results of two classification schemes con-
structed as described in Section 3.2 and compare them to two baseline approaches.
Both in Section 3.1 and in Section 3.2, we construct an ML regressor or predictor that
has the form of a weighted sum, because such a function can be straightforwardly computed
OTA using the DFA scheme described in Section 2.5. If the loss is Lipschitz-continuous,
it can play the role of the function Fin Definition 10 so that Theorem 2 provides a tail
bound on the additional ML loss that the OTA-computed classifiers can incur in addition
to the loss they would have in case of noiseless communication. The detailed technical
statements and proofs of these facts can be found in Corollaries 5 and 6. We also give
some examples of applicable Lipschitz-continuous losses for regression and classification
that are commonly used in practice in Subsection 3.1.
3.1 Support Vector Machines with Additive Kernels for
Regression and Classification
In this section, we give an example of additive, and therefore OTA computable, SVM
regressors, focusing on Lipschitz-continuous losses. We say that the loss Lis B-Lipschitz-
continuous if L(x, y, ·) is Lipschitz-continuous for all x∈Xand y∈Ywith a Lipschitz
constant uniformly bounded by B. Lipschitz-continuity of a loss function is a property
that is also often needed in other contexts. Fortunately, many loss functions of practical
interest possess this property. For instance, the absolute distance loss, the logistic loss,
the Huber loss and the ε-insensitive loss, all of which are commonly used in regression
problems [SC08, Section 2.4], are Lipschitz-continuous. Even in scenarios in which the
naturally arising loss is not Lipschitz-continuous, for the purpose of designing the ML
model, it is often replaced with a Lipschitz-continuous alternative. For instance, in binary
classification, we have Y={−1,1}and the loss function is given by
(x, y, t)↦→ ⎧
⎨
⎩
0,sign(y) = sign(t)
1,otherwise.
This loss is not even continuous, which makes it hard to deal with. So for the purpose
of designing the ML model, it is commonly replaced with the Lipschitz-continuous hinge
loss or logistic loss [SC08, Section 2.3].
52
3.1 Support Vector Machines with Additive Kernels for Regression and Classification
Here, we consider the case in which the features are K-tuples and the SVM can be
trained in a centralized fashion. The actual predictions, however, are performed in a
distributed setting; i.e., there are Kusers each of which observes only one component of
the features. The objective is to make an estimate of the label available at the receiver
while using as little communication resources as possible.
To this end, we consider the case of additive models which is described in [CH12,
Section 3.1]. We have X=X1× ··· × XKand a kernel κk:Xk×Xk→Rwith an
associated reproducing kernel Hilbert space Hkof functions mapping from Xkto Rfor
each k∈ {1, . . . , K}. Then by [CH12, Theorem 2]
κ:X×X→R,((x1, . . . , xK),(x′
1, . . . , x′
K)) ↦→ κ1(x1, x′
1) + ···+κK(xK, x′
K) (3.1)
is a kernel and the associated reproducing kernel Hilbert space is
H:= {f1+···+fK:f1∈H1, . . . , fK∈HK}.(3.2)
So this model is appropriate whenever the function to be approximated is expected to
have an additive structure. We know [SC08, Theorem 5.5] that an SVM estimator has the
form
f(x) =
n
∑︂
j=1
αjκ(x, xj),(3.3)
where α1, . . . , αn∈Rand x1,...xn∈X. In our additive model, this is
f(x1, . . . , xk) =
K
∑︂
k=1
fk(xk),(3.4)
where for each k,
fk(xk) =
n
∑︂
j=1
αjκk(xk, xj
k).(3.5)
We now state a result for the distributed approximation of the estimator of such an
additive model as an immediate consequence of Theorem 2.
Corollary 5. Consider an additive ML model, i.e., we have an estimator of the form
(3.4), and assume that Lis a B-Lipschitz-continuous loss. Suppose further that all the
fKhave bounded range such that the quantities ∆
¯(f)and ∆(f)as defined in (2.9) and
(2.10) exist and are finite. Let ε, δ > 0and M≥M(f, ε, δ)as defined in (2.14), where
53
3 Applications of Distributed Function Approximation to Vertical Federated Learning
Φ := id and thus Φ−1(ε) = ε. Then, given xK= (x1, . . . , xK)drawn from an arbitrary
distribution13,14 at the transmitters and any y∈Y, through Muses of the channel (2.2),
the receiver can obtain an estimate f
¯of f(xK)satisfying
P(L(xK, y, f
¯)−L(xK, y, f(xK))≥Bε)≤δ. (3.6)
Proof. The Lipschitz continuity of Lyields
P(L(xK, y, f
¯)−L(xK, y, f(xK))≥Bε)≤P(f
¯−f(xK)≥ε),
from which (3.6) follows by the definition of M(f, ε, δ).
While [CH12] provides some examples of applications of SVMs with additive kernels to
regression problems, the example for anomaly detection described in [RGS16] can be re-
covered as a special case of the framework described in this subsection, where the employed
SVMs have linear kernels.
We conclude this subsection with a brief discussion of the feasibility of the condition
that f1, . . . , fKhave bounded ranges in the case of the additive SVM model discussed
above. The coefficients α1, . . . , αnare a result of the training step and can therefore be
considered constant, so all we need is that the ranges of κ1, . . . , κKare bounded. This
heavily depends on X1,...,XKand the choices of the kernels, but we remark that the
boundedness criterion is satisfied in many cases of interest. The range of Gaussian kernels
is always a subset of (0,1], and while other frequent choices such as exponential, polynomial
and linear kernels can have arbitrarily large ranges, they are nonetheless continuous which
means that as long as the input alphabets are compact topological spaces (e.g., closed
hyperrectangles or balls), the ranges are also compact, and therefore bounded.
3.2 Model-Agnostic Approach to Over-the-Air-Computed
Classifiers
In this subsection, we focus on classification problems. The general approach is model-
agnostic in the sense that arbitrary and even different ML models can be used in the
distributed agents, but we have decentralized classifiers with a low computational bur-
den in mind, as is exemplified in the numerical simulations discussed in the following
subsection.
13Arbitrary distribution means in particular that the components can be arbitrarily correlated.
14Note that Theorem 2 actually provides for a stronger result, since it allows arbitrary deterministic
values, which implies the applicability to arbitrarily distributed random variables through the law of
total probability.
54
3.2 Model-Agnostic Approach to Over-the-Air-Computed Classifiers
We consider a feature alphabet X=X1×···×XKand a label alphabet Y={−1,1}as
well as an unknown, but fixed probability distribution Pon X×Y. In the training phase,
each user kobserves Jtraining samples
Tk=(︂(︂x(1)
k, y(1))︂,...,(︂x(J)
k, y(J))︂)︂,
where for all k, j, we have x(j)
k∈Xk,y(j)∈Yand (x(j)
1, . . . , x(j)
K, y(j)) is drawn according
to P.
Each user kcan train its own model based on Tkwhich is distributed according to
the marginal of Pwith respect to Xk×Y. We propose to use a slight variation of the
well-known boosting technique and define a classifier
f:=
K
∑︂
k=1
αkfk,(3.7)
where fkis the base classifier locally trained at user kand αkis a nonnegative weight. As
an immediate corollary to Theorem 2 parallel to Corollary 5, fcan be approximated at a
central node in a distributed manner.
Corollary 6. Assume that Lis a B-Lipschitz-continuous loss. Let ε, δ > 0and M≥
M(f, ε, δ)as defined in (2.14), where Φ−1(ε) = ε, noting that
∆
¯(f)=2
K
∑︂
k=1
αk,∆(f)=2 K
max
k=1 αk.
Then, given any xK= (x1, . . . , xK)drawn from an arbitrary13,14 distribution at the trans-
mitters and any y∈Y, through Muses of the channel (2.2), the receiver can obtain an
estimate f
¯of f(xK)satisfying
P(L(xK, y, f
¯)−L(xK, y, f(xK))≥Bε)≤δ. (3.8)
The proof is the same as for Corollary 5.
It is important to remark here that the predictor fcan only be approximated at the
receiver up to a residual error (which can, however, be controlled) and thus, a guarantee
in terms of the 0-1-loss is not sufficient to apply Corollary 6 and we instead need it to be
in terms of a Lipschitz-continuous loss.
This is a relatively generic framework that can in principle work with any particular
boosting technique which determines weights α1, . . . , αKand guarantees a bound on the
loss of the predictor fdependent on the errors of the base classifiers f1, . . . , fK. In the
55
3 Applications of Distributed Function Approximation to Vertical Federated Learning
following, we describe two variations of this general approach, both based on well-known
ideas from ML (cf., e.g., [MRT12, Chapter 6]).
The first one, equal majority vote, amounts to setting α1=··· =αK= 1 and using
local classifiers fktrained only on the locally available features. This method has the
advantage that the whole training procedure can be carried out in a fully decentralized
way without any form of coordination or exchange of information between the agents
(given that the labels for the training phase are already known everywhere; but they
could, e.g., be broadcast from a central point at a cost independent of the number of
agents or dimensionality of the feature space).
If we have the possibility to exchange some data between the agents, we can use the
following adaptation of the AdaBoost scheme [MRT12, Figure 6.1] to the distributed
setting. The algorithm runs through L≤Kiterations, choosing a user hℓat iteration ℓto
provide a base classifier fhℓand assigning a corresponding weight αhℓ. It also computes
probability distributions p1, . . . , pL+1 on the index set of the training data {1, . . . , J},
initializing p1as the uniform distribution, as well as base classifier errors ϵ1, . . . , ϵLand
normalization constants Z1,...,ZL. Each iteration ℓconsists of the following steps:
1. The central node chooses a user hℓand broadcasts the choice.
2. User hℓtrains a base classifier fhℓ:Xhℓ→ {−1,1}on the training sample with dis-
tribution pℓand broadcasts the indices of the training samples incorrectly classified
by fhℓ.
3. From this information, every node in the system computes the following, where 1·
denotes the indicator function which is 1 if the condition in the index is true and 0
otherwise:
•ϵℓ:= ∑︁J
j=1 pℓ(j)1fhℓ(x(j)
hℓ)=y(j)
•αhℓ:= 1
2log 1−ϵℓ
ϵℓ
•Zℓ:= 2√︁ϵℓ(1 −ϵℓ))
•pℓ+1(j) := pℓ(j) exp(−αhℓfhℓ(x(j)
hℓ)y(j))/Zℓ
The resulting classifier is then as defined in (3.7), where we assign αk:= 0 whenever k=hℓ
for all ℓ. [MRT12, Theorem 6.1] guarantees that the empirical 0-1-loss of fis at most
exp (︄−2
L
∑︂
ℓ=1 (︃1
2−ϵℓ)︃2)︄,(3.9)
which unfortunately is insufficient to apply Corollary 6, because the 0-1-loss is not Lipschitz-
continuous. However, the proof of the theorem relies only on the inequality 1f(xK)y≤0≤
56
3.3 Numerical Results for Over-the-Air-Computed Decision Tree Classifiers
exp(−f(xK)y) for the instantaneous 0-1-loss. Since the inequality log(1+exp(−f(xK)y)) ≤
exp(−f(xK)y) also clearly holds, we can replace the 0-1-loss in the proof with the logis-
tic loss L(xK, y, yˆ) := log(1 + exp(−yyˆ)) (or, indeed, any other loss which satisfies this
inequality). This yields the same bound (3.9) on the 1-Lipschitz-continuous logistic loss
and thus we can apply Corollary 6 with B:= 1 to derive a guarantee on the logistic loss
of the distributed approximation of our AdaBoost classifier.
We conclude with some remarks on the distributed training. The choice in step 1 could,
e.g., be predetermined (in which case no communication in this step is necessary) or ran-
dom, but we could also greedily select the classifier with smallest error using an instance of
ScalableMax [9] [2, Section IV]. As for the communication cost of the distributed training,
step 1 exhibits a favorable scaling which is linear in Land logarithmic in K, however, step
2 has a cost linear in the number of training samples. There is a conceptually simpler
alternative to this distributed scheme in which we communicate the full training set to
the central node and perform the training in a centralized manner. The advantage in
communication cost of the distributed scheme over this centralized alternative is only a
constant factor. On the other hand, since only one bit per training sample and user is
transmitted, this constant gain could potentially be quite large, depending on the com-
plexity of the feature spaces. Also, in the distributed training scheme, the computational
load of training the base classifiers is distributed across all nodes, which may in practice
also be an advantage wherever the computational capabilities of the central node are lim-
ited. However, since the distributed training currently leverages no OTA computation and
leaves that for the computation of the trained classifier itself, finding a distributed scheme
which can exploit OTA computation to achieve a gain in asymptotic behavior as opposed
to only a constant factor could be a worthwhile question for future research.
3.3 Numerical Results for Over-the-Air-Computed Decision Tree
Classifiers
In order to illustrate how the scheme analyzed in this work can be used to compute
classifiers for anomaly detection problems in large sensor networks, we have conducted
numerical simulations on a synthetic binary classification problem generated by the
make classification function in the datasets package of the scikit-learn toolbox [P+11]
for Python. It places clusters for the two classes at the edges of a hypercube in a Euclidian
space of informative features, adds redundant features that are linear combinations as well
as useless features that are pure noise and applies various kinds of noise and nonlinear-
ities. The resulting features are then shuffled randomly, partitioned and assigned to the
distributed agents. For the training set, the agents also learn the correct corresponding
57
3 Applications of Distributed Function Approximation to Vertical Federated Learning
labels. We construct two different OTA-computed classifiers as described in the preceding
subsection:
1. For the equal majority vote classifier, each agent trains a decision tree model of
height at most 2 based on the locally available features only. The OTA-computed
classifier is then as put forth in equation (3.7), where α1=··· =αK= 1.
2. For the AdaBoost classifier, the agents train their models cooperatively as described
in the preceding section, using decision tree classifiers as the local base classifiers.
The next agent at each iteration is picked uniformly at random from among the
agents which have not yet been selected. This procedure yields not only differ-
ently trained local models compared to the equal majority vote, but also weights
α1, . . . , αKwhich can be used for the OTA-computed classifier as in equation (3.7).
We assume that the agents are connected to a central receiver through a fast-fading wireless
MAC, where no instantaneous CSI is available. The only kind of information we assume
is available is the average power of the complex Gaussian channel gain at the transmitters
and the average power of the additive noise at the receiver. The distributed classification is
simulated for noise and fading drawn from i.i.d. Gaussian distributions and for a scenario
exhibiting various degrees of correlation and non-Gaussian components:
•For the fading, we achieve this by passing the fading coefficients through a lowpass
filter, which cuts off all but a given percentage of the energy (the cutoff percentage)
and then re-normalizes the remaining signal.
•For the noise, we simulate Middleton class A noise (also called impulsive noise), which
is a commonly used noise model for power line communications [FLNS10, Section
2.6.3.1] but has also been empirically found to be a relevant phenomenon in wireless
communications [MS93]. We simulate it as described in [FLNS10, eq. (2.49) ff.]: In
order to create one sample of noise, a random variable mis drawn from a Poisson
distribution with intensity A, a parameter called the impulsive index, and then a
centered Gaussian random variable with variance
2PN
m/A+Γ
1 + Γ
is drawn, where PNis the overall power of the noise per complex dimension and
the parameter Γis called the Gaussian-to-impulsive power ratio. Finally, a phase
shift is applied drawn uniformly from the complex unit circle. This process defines
non-Gaussian, but i.i.d. noise. We have therefore modified it slightly and draw one
58
3.3 Numerical Results for Over-the-Air-Computed Decision Tree Classifiers
mfor every 4 channel realizations so that we create a correlation also in the additive
noise.
We simulate the computation of each of the two distributed classifiers described above in
three different ways:
1. The DFA scheme as described in Section 2.5.
2. A TDMA scheme with average power normalization in which only one of the agents
can transmit at a time. The information is also transmitted in analog form, since
each agent conveys only one bit of information and therefore digital coded schemes
would not be suitable. Since the agents transmit during a much shorter time than in
the DFA scheme, we normalize their transmission power so that the average energy
consumed per channel use equals that in the DFA scheme. The only exception to
this is the case when some agents have to be allocated zero channel uses, since the
number of total channel uses available is smaller than the number of agents in the
system. In this case, obviously, the agents allocated zero channel uses also have
zero energy consumption. This scheme has a significantly higher peak transmission
power than the DFA scheme.
3. A TDMA scheme with peak power normalization. It works as the one under item 2,
but the transmission power is normalized so that it has the same peak power as the
DFA scheme, which means that it has a significantly lower average consumption.
We also show two baselines to make the error contribution of the compared communication
schemes clearer:
1. a noiseless version of the majority vote classifier, and
2. a noiseless version of the AdaBoost-inspired classifier.
The training set consists of 50,000 samples and the test set of 200,000. We have
generated two different binary classification problems, one for 10 transmitters and one for
25 transmitters.
Comparison of DFA and TDMA schemes for equal majority vote and AdaBoost We
have simulated the DFA scheme as well as the TDMA baseline comparisons for a scenario
with moderate correlation and non-Gaussianity. The cutoff percentage for the lowpass
filter applied to the fading was chosen at 80%, and the parameters for the Middleton
Class A noise were A= 3 and Γ= 3. In Fig. 3.1, we plot the classification error on
the test set as a function of the number of complex channel uses for a fixed SNR of −6
59
3 Applications of Distributed Function Approximation to Vertical Federated Learning
50 100 150 200 250
0.05
0.1
0.15
0.2
0.25
0.3
DFA gain
DFA gain
number of complex channel uses
test error
equal majority DFA
AdaBoost DFA
equal majority TDMA average-normalized
AdaBoost TDMA average-normalized
equal majority TDMA peak-normalized
AdaBoost TDMA peak-normalized
AdaBoost noiseless
equal majority noiseless
Figure 3.1: Comparison of the classification error on the test set of DFA/TDMA and equal
majority/AdaBoost schemes. 10 transmitters, cutoff percentage 80%, A= 3,
Γ= 3, SNR −6 dB.
60
3.3 Numerical Results for Over-the-Air-Computed Decision Tree Classifiers
−20 −10 0 10 20
0.04
0.06
0.08
0.1
0.12
0.14
DFA error
floor gain
DFA error floor gain
SNR
test error
Figure 3.2: Comparison of the classification error on the test set of DFA/TDMA and equal
majority/AdaBoost schemes. 10 transmitters, cutoff percentage 80%, A= 3,
Γ= 3, 50 complex channel uses.
50 100 150 200 250
0.1
0.2
0.3
0.4
DFA gain DFA gain
number of complex channel uses
test error
Figure 3.3: Simulation results for 25 transmitters at a fixed SNR of −6 dB, correlation
parameters are the same as in Fig. 3.1.
61
3 Applications of Distributed Function Approximation to Vertical Federated Learning
dB and 10 transmitters, and in Fig. 3.2, we plot the error as function of SNR for a fixed
number of 50 complex channel uses. We can see that, as the effect of the multiplicative
fading dominates that of the additive noise, the schemes reach an error floor that cannot be
lowered with an increase of the transmission power. When the number of complex channel
uses is increased, on the other hand, the error curves approach the noiseless classification
error even if the SNR is kept fixed.
For instance, to obtain a classification error of 0.07 or better, both for the equal majority
vote and AdaBoost, the average-power normalized TDMA scheme needs over 30 channel
uses more than the DFA scheme. Since we compare with a TDMA scheme that uses
the same average energy per channel use as the DFA scheme, this means that the TDMA
scheme not only consumes more wireless spectrum and/or time, but also significantly more
energy. For the case of the same peak power consumption (which means that TDMA
consumes less average power since transmitters are silent most of the time), the difference
is huge and can be several hundred channel uses depending on the error level.
The advantage of the DFA scheme over the TDMA alternatives is quite pronounced
even at a relatively low number of only 10 transmitters. In Fig. 3.3, we show the same
plot as in Fig. 3.1, but for a different ML problem with 25 transmitters. It can be seen
in the plot that as the number of transmitters increases, the difference in performance
between the DFA and TDMA schemes becomes even stronger. This is due to the different
scaling behaviors of the schemes.
We have run these simulations for many instantiations of the randomly generated clas-
sification schemes (not depicted for lack of space) and note that while in some cases the
equal majority vote scheme performs similarly as the AdaBoost scheme in the noiseless
case, in many cases the error behavior of the AdaBoost scheme is much better and it is
more robust, e.g. in the case that a large number of agents observes only useless features
while only few agents observe the informative and repetitive features that can be used
to solve the classification problem. That being said, in the case in which the equal ma-
jority vote performs similarly to AdaBoost in the noiseless case, its error behavior in the
communication schemes is better since it better utilizes the available peak transmission
power.
Synchronization errors Since the pre-processing described in Section 2.5.1 creates a se-
quence of i.i.d. random variables (conditioned under s1, . . . , sK), we can expect that the
scheme is quite robust to synchronization errors between the transmitters. This is im-
portant since perfect synchronization of the transmitted signals at the receiver would be
a very hard task to achieve in practice. In order to substantiate this argument, we have
run simulations with relatively large synchronization errors of several symbol durations:
62
3.3 Numerical Results for Over-the-Air-Computed Decision Tree Classifiers
50 100 150 200 250
0.05
0.1
0.15
0.2
0.25
0.3
number of complex channel uses
test error
perfect synchronization
synchronization errors of 1, 5, 10, 15,
20 and 25 channel uses
noiseless
Figure 3.4: Impact of synchronization errors on the DFA AdaBoost scheme in the scenario
of Fig. 3.1.
63
3 Applications of Distributed Function Approximation to Vertical Federated Learning
For the starting time of the transmission for each transmitter, we have added a uniformly
random number of channel uses in a range of up to 1, up to 5, up to 10, up to 15, up
to 20 and up to 25 channel uses. In Fig. 3.4, we show the impact of the synchronization
errors for the AdaBoost DFA scheme and the same choice of parameters as for Fig. 3.1.
The solid red curves (representing the case of perfect synchronization) are therefore the
same in both figures while the blue curves in Fig. 3.4 depict the performance for various
values of the maximum synchronization error. The performance degrades gracefully even
for extremely large synchronization errors and the number of additional channel uses that
needs to be expended to maintain the same classification error is about twice the value of
the maximum synchronization error. We remark that we expect the number of additional
channel uses that needs to be expended to scale with the synchronization error and not
with the number of transmitters. Moreover, for synchronization errors of the same order
of magnitude as the symbol duration, the drop in performance is barely noticeable.
Comparison of different correlation scenarios In order to get an idea of how strongly
the correlation and non-Gaussianity impact on the performance of the scheme, we have
compared the AdaBoost DFA scheme (the solid red curve from Fig. 3.1) for various choices
of the correlation and non-Gaussianity parameters. Qualitatively, the higher the values
of Aand Γ, the more closely the additive noise is to a Gaussian distribution and the
lower the values, the stronger pronounced is the non-Gaussianity of the noise. For the
fading, a cutoff percentage of 100% corresponds to i.i.d. Gaussian fading, while lower
cutoff percentages mean that the fading changes more slowly over time. The results of
our comparison in Fig. 3.5 show that the scheme performs best for the Gaussian i.i.d.
case but, as is expected from the theoretical analysis, the exponential decay of the error
is retained even for strongly pronounced correlation and non-Gaussianity.
64
3.3 Numerical Results for Over-the-Air-Computed Decision Tree Classifiers
50 100 150 200 250
0.05
0.1
0.15
0.2
0.25
0.3
number of complex channel uses
test error
i.i.d. Gaussian fading and noise
cutoff percentage 80%, A= 3, Γ= 3
(same as Fig. 3.1)
cutoff percentage 80%, A= 0.5, Γ= 1
cutoff percentage 50%, A= 0.25, Γ= 1
cutoff percentage 30%, A= 1, Γ= 0.5
noiseless
Figure 3.5: Comparison of the performance of AdaBoost DFA for various choices of the
non-Gaussianity and correlation parameters.
65
4 Security in Over-the-Air Computation
OTA computation schemes carry the promise that they can improve communication ef-
ficiency so dramatically in many cases of practical interest that they can be seen as an
enabler for applications in massive wireless networks for which the communication cost
or the time delay incurred would otherwise be prohibitive. However, there is also a flip
side that has the potential to hinder widespread adoption: Some tools that enhance the
properties of communication and are frequently used as building blocks in communication
systems inherently rely on the principle of source-channel separation. Therefore, they can-
not be adapted to work in a scenario where a joint source-channel approach is taken such
as in OTA computation. One example of such a building block that is particularly impor-
tant in modern communication systems is cryptography. OTA communication schemes
as described in Chapter 2 are vulnerable to a number of attacks such as malicious trans-
mitters participating in the scheme or attackers eavesdropping on the transmission, and
it is unclear whether and how state-of-the-art cryptographic security could be adapted to
defend against such threats.
At least for the latter kind of threat – attackers eavesdropping on the communication –
information-theoretic security, while not adaptable in a straightforward fashion, provides
a set of tools with which a defense can be developed. The ultimate goal in this direction
should be full semantic security [BTV12]. As a first step, we propose to extend the system
model with a jammer as depicted in Fig. 4.1. This shows how information-theoretic
security tools can be adapted to the OTA computation setting. The jammer can increase
the variance of the eavesdropper’s estimate of the quantity of interest, but not fully prevent
it from obtaining an estimate.
The key assumption we make is that the received jamming signal must be stronger for
the legitimate receiver than it is for the eavesdropper. This way, the legitimate receiver
Bcan exploit the dependencies which we carefully introduce into the jamming signal to
reconstruct it exactly. To the eavesdropper, the received signal is almost equivalent to an
i.i.d. jammed transmission. With the knowledge of the full jamming signal, the legitimate
receiver can then cancel it from its received signal or at least mitigate its impact. The
approximation of the OTA computed function value at the legitimate receiver can then be
carried out independently of the security scheme. It can, e.g., follow the method described
67
4 Security in Over-the-Air Computation
AK
A2
A1
EM
K
EM
2
EM
1
J
M-fold
channel
DMB
E
.
.
.
.
.
.
sK
s2
s1
TM
K
TM
2
TM
1
XM
YMf
˜
ZM
Figure 4.1: System model for distributed function approximation with security constraints.
in Chapter 2. In the present chapter, we give an example for how to combine the security
and function approximation schemes in the simpler case of an AWGN channel without
fast fading.
While our results on OTA computation rely heavily on the particular structure of wire-
less channels, the class of channels for which the reconstruction of the jamming signal is
possible is much more general. This will become apparent in the following.
We are not aware of a similar system model having been proposed before for OTA
computation, but we draw heavily from existing tools in information theory. As the main
building blocks for proving security guarantees in a DFA scheme with friendly jamming, we
use two information theoretic tools, namely coding for the compound channel and channel
resolvability. The latter ensures that the jamming signal can be reconstructed fully by
the legitimate receiver15 and thus be canceled from the received signal. Therefore, it
has no impact on the quality of the objective function estimate. Channel resolvability
guarantees that the jamming signal is virtually indistinguishable from white noise for the
eavesdropper. This virtual indistinguishability is phrased in terms of variational distance
between the actually observed distribution and a superposition of the transmitters’ signal
with white noise from the jammer. The coding result for the continuous compound channel
which we derive may also be of interest as an independent result.
4.1 Prior Work
To the best of our knowledge, the OTA computation problem over a wiretap channel
has not yet been considered in the literature. Therefore, in this subsection we briefly
summarize the literature on the building blocks other than OTA computation we use for
15For a formal statement, see Definition 12.
68
4.1 Prior Work
the approach to the wiretap OTA computation channel that we propose in this work as
well as for literature on concepts that are closely related to the ones presented in this
paper.
Coding for compound channels. The compound channel problem was introduced in-
dependently in [Dob59, BBT59, Wol59], while first independent results for the capacity
expression can be found in [BBT59, Wol59]. These works, however, explore mainly the
case of finite input and output alphabets. The semicontinuous case in which only the
input alphabet is assumed to be finite is briefly touched upon in [Wol59] and studied in
more detail in [Kes61] which provides an example showing that the capacity expression
from the finite case does not carry over to the semicontinuous case in general. The semi-
continuous case was further explored in [Yos65,Ahl67]. In many cases of practical interest,
the capacity expression from the finite case can be generalized to the continuous case in
which neither input nor output alphabets are assumed to be finite, as was found in [RV68]
for a class of Gaussian compound channels.
Channel Resolvability and Semantic Security. The concept of channel resolvability was
introduced in [Wyn75a, HV93]. Further results relevant in the context of this work ap-
peared, e.g., in [Csi96,Dev05,HM16,Cuf16]. We use our generalization proposed in The-
orem 6 for continuous channels as a basis for our proposed scheme. Although we cannot
provide full semantic security guarantees in this work, we also heavily draw from the
idea of obtaining semantic security by means of channel resolvability, which is laid out
in [Hay06,CK11,BTV12,BL13].
Friendly Jamming The idea of friendly jamming has been used in [NG05] to aid a
transmitter-receiver pair in protecting a point-to-point transmission from a passive eaves-
dropper. Distributed and centralized beamforming techniques are used so that the jam-
ming signal impacts the signal-to-noise ratio at the eavesdropper but not at the legitimate
receiver. Several more recent works (cf., e.g., [VBBM10,VBBM11,SY12]) have expanded
upon this idea and refined the friendly jamming techniques, but to the best of our knowl-
edge, they have not yet been used to protect OTA computation against eavesdropping.
Physical Layer Security The concept of information theoretic secrecy was introduced
in [Sha49] and the wiretap channel model together with a weaker, but more tractable
notion of secrecy was introduced in [Wyn75b]. Based on this, various stronger secrecy
notions have been introduced and investigated (e.g., [Mau94,HK14,BTV12]). All of these
existing works investigate how digitally coded transmissions can be protected against
69
4 Security in Over-the-Air Computation
AK
A2
A1
EM
K
EM
2
EM
1
J
M-fold
channel
DMB
E
.
.
.
.
.
.
sK
s2
s1
TM
K
TM
2
TM
1
XM
YMf
˜
ZM
WBM
WEM
Figure 4.2: System model for distributed function approximation with jamming described
in Section 4.2.2.
eavesdropping, while in the present work, we focus on uncoded analog transmissions over
multiple-access channels.
4.2 System Model
4.2.1 Distributed Approximation of Functions
The system model and problem statement for DFA are the same as in Section 2.2.1.
Depending on the application at hand, there are multiple ways in which the quality of
the estimate f
˜can be quantified. Besides the notion of ε-approximation of a function at
confidence level δdefined in (2.5), we also define the approximation by criterion of the
mean square error (MSE).
Definition 11. We say that fis V-MSE-approximated if, under a uniform distri-
bution of f(s1, . . . , sK), we have
E(︃(︂f
˜−f(s1, . . . , sK))︂2)︃≤V,
where the expectation is over the joint distribution of s1, . . . , sKand f
˜which is
induced by the distributed function approximation scheme and the channel.
70
4.2 System Model
4.2.2 Secrecy Extension to Distributed Approximation of Functions
In order to incorporate security aspects into the framework, we consider the extended
system model depicted in Fig. 4.2.
The first addition to the model is an attacker Ewhich attempts to eavesdrop on the
transmission and would like to gain knowledge about s1, . . . , sK. At each channel use, E
observes an output Zranging over the eavesdropper’s alphabet Z. As a counter-measure,
we add a friendly jammer Jwhich transmits some jamming sequence XMwith the ob-
jective to prevent Efrom obtaining information while still allowing Bto obtain a good
estimate of f(s1, . . . , sK).
Definition 12. A DFA scheme with jamming consists of:
•A distributed function approximation scheme as described in Section 2.2.1; i.e., pre-
and post-processing schemes
•A jamming strategy given by a probability distribution on XM.
We say that a distributed function approximation scheme with jamming allows reconstruc-
tion of the jamming signal with probability δif there is a decoding function ϑ:YM→ XM
such that
sup
s1∈S1,...,sK∈SK
Ps1,...,sK(︁ϑ(YM)=XM)︁≤δ
and δis the smallest number with this property.
The objective is to find admissible pre- and post-processing strategies as well as a
jamming strategy such that Bcan obtain a good approximation f
˜of f(s1, . . . , sK) while
bounding the usefulness of any information that Ecan obtain about s1, . . . , sK.
Together with the channel, a distributed function approximation scheme with jamming
induces a probability distribution R
ˆs1,...,sKon ZMfor each (s1, . . . , sK)∈ S. How secure
the scheme is depends on how strongly R
ˆs1,...,sKdepends on s1, . . . , sK. In the following,
we formalize this notion.
Any measurable function g:S → S
ˆ, where S
ˆis a measurable space, is called an
eavesdropper’s objective.
Definition 13. 1. Given a real number η≥0, we say that a distributed function ap-
proximation scheme with jamming is η-semantically secure if there is a probability
measure νon ZMsuch that for all (s1, . . . , sK)∈ S,
R
ˆs1,...,sK−ν
TV ≤η, (4.1)
71
4 Security in Over-the-Air Computation
where ∥µ∥TV := sup{µ(E) : Eis an event}denotes the total variation norm on signed
measures.
2. Let g:S → S
ˆ, where S
ˆ⊆Ris measurable and bounded, be an eavesdropper’s
objective. Let V≥0be a real number. We say that a distributed function approxi-
mation scheme with jamming is (g, V )-MSE-secure if under a uniform distribution
of g(s1, . . . , sK), for every estimator d:ZM→ S
ˆ, we have
E(︂(︁d(ZM)−g(s1, . . . , sK))︁2)︂≥V,
where the expectation is over the joint distribution of s1, . . . , sKand YMwhich results
from the application of the distributed function approximation scheme with jamming
and the channel.
In statistical terms, that a scheme is (g, V )-MSE-secure means that all estimators the
eavesdropper can apply have MSE at least Vunder a uniformly distributed objective. A
uniform distribution of the objective means that s1, . . . , sKare randomly distributed in
such a way that g(s1, . . . , sK) follows a uniform distribution.
In a sense made explicit by the following lemma, semantic security is the stronger of
the two security notions from Definition 13.
Lemma 11. Let S
ˆ:= [a, b], let g:S → S
ˆbe an eavesdropper’s objective and η≥0a real
number.
Then, any distributed function approximation scheme with jamming that is η-semantically
secure is also (g, (1/12 −η)(b−a)2)-MSE-secure.
Proof. Let d:ZM→ S
ˆ. Then, assuming the distribution of s1, . . . , sKcorresponds to a
uniform distribution on [a, b] of g(s1, . . . , sK), we have
Es1,...,sKER
ˆs1,...,sK(︂(︁d(ZM)−g(s1, . . . , sK))︁2)︂
=Es1,...,sK∫︂(b−a)2
0
R
ˆs1,...,sK(︃(︂d(ZM)−g(s1, . . . , sK))︂2> a)︃da
(4.1)
≥Es1,...,sK∫︂(b−a)2
0(︃ν(︂(︁d(ZM)−g(s1, . . . , sK))︁2> a)︂−η)︃da
=Es1,...,sKEν(︂(︁d(ZM)−g(s1, . . . , sK))︁2)︂−η(b−a)2
≥(︃1
12 −η)︃(b−a)2,
where the last step is because under ν,ZMis independent of s1, . . . , sK. Therefore, the
posterior distribution of g(s1, . . . , sK) given ZMis uniform on [a, b], which implies that
72
4.2 System Model
the minimum MSE is the variance of the uniform distribution. Denoting with Ua random
variable distributed uniformly in [a, b], we can calculate its variance as
E(︁U2)︁−(EU)2=∫︂b
a
g2
b−adg −(a+b)2
4
=b3−a3
3(b−a)−(a+b)2
4
=b2+a2+ab
3−a2+ 2ab +b2
4
=4b2+ 4a2+ 4ab −3a2−6ab −3b2
12
=1
12(b−a)2.
4.2.3 Special case K= 1
We conclude this section with a brief discussion of the important special case K= 1. While
one of the main motivations of the methods developed in this paper is their scalability to
large values of K, the case of low values of Kcan also be interesting in many practical
applications and be instructive to understand the nature of our results better.
For the special case of only a single transmitter (K= 1), the problem reduces to a
point-to-point transmission of the real number f(s1) in the presence of an eavesdropper
and a friendly jammer. In our results in this paper, there is no assumption that Khas to
be large; in particular, they remain applicable also when K= 1. However, since in this
case no function of distributed values has to be computed over the channel, it is possible
to separately source and channel encode f(s1). After the source coding step has been
performed, the remaining problem is very similar to jammer-aided secret communication
as treated for instance in [NG05,VBBM10,VBBM11].
But although this approach is applicable to the same communication task, it is important
to note that the way in which the friendly jammer has to be placed differs significantly. In
the approach of this paper, the jamming signal has to be stronger at the legitimate receiver
than it is at the eavesdropper. As long as this condition is satisfied, the legitimate receiver
has the ability to almost completely cancel the jamming signal. This means that our
method remains applicable even if the gap in terms of jammer signal strength between the
legitimate receiver and the eavesdropper is relatively small. In [NG05,VBBM10,VBBM11],
on the other hand, it is necessary that the jamming signal is stronger at the eavesdropper
than it is at the legitimate receiver. Moreover, this gap between signal strengths has to be
as large as possible since the jammer’s signal strength at the legitimate receiver diminishes
the capacity of the main channel. Therefore, our results in this case are more suitable for
73
4 Security in Over-the-Air Computation
scenarios where is is possible to assure a high jamming signal strength at the legitimate
receiver while results from [NG05, VBBM10, VBBM11] are more suitable in cases where
all possible eavesdropper locations can be covered with strong jamming signals that have
very low strength at the location of the legitimate receiver.
4.3 Specialization to the Additive White Gaussian Noise
Channel
In general, the approximation scheme even without an eavesdropper or jammer highly
depends on the particular structure of the channel and f. It is therefore instructive to
consider a specialization of the DFA framework with jamming to the computation of
arithmetic means over AWGN channels. Specifically, the objective function is given as
f: (s1, . . . , sK)↦→ 1
K
K
∑︂
k=1
sk,(4.2)
where for all k,Sk= [−1,1]. The channel is given by
Y=hAB
K
∑︂
k=1
Tk+hJBX+NB(4.3)
Z=hAE
K
∑︂
k=1
Tk+hJEX+NE,(4.4)
where each of the transmitters A1,...,AKis subject to a total power constraint of PA,J
is subject to an average power constraint PJ,NBis centered normal with variance σ2
Band
NEis centered normal with variance σ2
E. The real channel coefficients hAB, hJB, hAE, hJE
are assumed deterministic and known everywhere. The channel is used Mtimes with
transmitter input sequences TM
kfor each k∈ {1, . . . , K}and XMfor the jammer. The
input sequences are subject to the average power constraints
1
M
M
∑︂
m=1
(Tk,m)2≤PA,1
M
M
∑︂
m=1
(Xm)2≤PJ.
The problem is easily approached if we can assume that Bhas full knowledge of XM,
while Eknows only how XMis distributed.
In this case, we have the following result.
Lemma 12. Consider the wiretap channel given by (4.3) and (4.4) and the objective
74
4.3 Specialization to the Additive White Gaussian Noise Channel
0 1 2 3 4
0
0.1
0.2
0.3
0.4
σ2
σ2Ψ(2/σ)
Figure 4.3: Illustration of the MSE guarantees of Lemma 12. The dashed line is the MSE
which an eavesdropper would have without any received signal (i.e., guessing
the middle of the interval).
function fdefined in (4.2). Define
σ2
eff,B:= σ2
B
h2
ABK2PA
, σ2
eff,E:= σ2
E+h2
JEPJ
h2
AEK2PA
Assume that the jamming sequence XMis perfectly known at the legitimate receiver
while the eavesdropper has only statistical information. Define
Ψ(a) := ∫︂a
0∫︂∞
−∞ (︃c+φN(−c)−φN(a−c)
ΦN(a−c)−ΦN(−c)−b)︃21
aφN(b−c)dcdb, (4.5)
where φNdenotes the probability density function and ΦNthe cumulative distribution
function of the standard normal distribution, respectively. Then there is a distributed
function approximation scheme with jamming which is (f, σ2
eff,EΨ(2/σeff,E))-MSE-secure
and (σ2
eff,BΨ(2/σeff,B))-MSE-approximates fat the receiver.
The proof is based on a few facts from statistics. We only state the relevant lemmas
here. For the sake of completeness, we include the proofs of the following two lemmas in
Section 4.6.1.
Lemma 13. If Uis distributed uniformly in [a, b]and, conditioned on U,V1,...,VM
are i.i.d. normally distributed with mean Uand variance σ2, then the minimum MSE
75
4 Security in Over-the-Air Computation
estimator for estimating Ufrom the observations V1,...,VMis
U
ˆ:= V
¯+σ
√M·
φN(︂a−V
¯
σ/√M)︂−φN(︂b−V
¯
σ/√M)︂
ΦN(︂b−V
¯
σ/√M)︂−ΦN(︂a−V
¯
σ/√M)︂,(4.6)
where V
¯:= 1
M∑︁M
m=1 Vm.
Lemma 14. Under the assumptions of Lemma 13, the estimator U
ˆsatisfies
E(︃(︂U −U
ˆ)︂2)︃=σ2
MΨ(︃b−a
σ/√M)︃,
with Ψas defined in (4.5).
Proof of Lemma 12. We use the following transmission strategy:
Xm: Gaussian with mean 0 and variance PJ,(4.7)
EM
k:sk↦→ (1,...,1) ·sk√︃PA
M(4.8)
The receiver can obtain
Y′
m:= Ym−hJBXm
hABK√︁PA/M
=hAB ∑︁K
k=1 Tk+NB,m
hABK√︁PA/M
=hAB ∑︁K
k=1 sk√︁PA/M +NB,m
hABK√︁PA/M
=f(s1, . . . , sK) + NB,m
hABK√︁PA/M .
We define the post-processing operation DMat the receiver as first obtaining Y′
1, . . . , Y ′
M
and then computing the MSE estimator from Lemma 13. With this choice, Lemma 14
yields the claimed reconstruction error guarantee.
On the other hand, the output at Eis given by
Zm
(4.4)
=hAE
K
∑︂
k=1
Tk+hJEXm+NE,m
(4.8)
=hAE
K
∑︂
k=1
sk√︁PA/M +hJEXm+NE,m
76
4.4 Main Results
(4.2)
=f(s1, . . . , sK)·K√︁PA/MhAE +hJEXm+NE,m.
From this, Lemmas 13 and 14 yield the claimed MSE-security of the scheme.
The assumption that the legitimate receiver has full knowledge of the jamming signal
seems quite strong. So the main part of this chapter is devoted to setting out and analyzing
a jamming strategy in Theorem 4 which does not need any form of shared randomness or
additional communication between the friendly jammer and the legitimate receiver. This
jamming strategy has “almost” the same implications on the overall system performance
as the assumption that the legitimate receiver has full knowledge of the jamming signal,
while the eavesdropper only has knowledge about its distribution. In Corollary 9, we
formalize this notion for the AWGN case. In principle, however, this jamming strategy
is not restricted to the AWGN scenario; but in fact, it can be combined with any class
of channel models in which it is possible to cancel or at least mitigate the effect of the
jamming signal if exact knowledge of it is available.
One example where not a full cancellation but at least a good mitigation is possible is
the fast-fading scenario treated in Chapter 2.
4.4 Main Results
In this section, we formally state the main results of this chapter.
A function approximation scheme is specific to a particular channel model, which among
other things influences how the pre- and post-processing operations have to be designed
as well as which class of functions can be approximated. In Section 4.3, we have described
such a scheme for the AWGN channel and only a singleton class of functions, namely
for the arithmetic average, which is a particularly simple case. The strategy for the
legitimate receiver to counter the signal of the friendly jammer given that it is known is
also particularly simple in the AWGN case and can be done perfectly, as we have seen.
The missing part of the secrecy extension, which is the method for the legitimate receiver
to obtain the necessary knowledge of the jamming signal while the eavesdropper cannot,
on the other hand, can be phrased and proven to work in somewhat greater generality. In
order to formally state our main results, therefore, we have to introduce a few technical
concepts first.
For any channel W, we denote the joint input-output distribution under the input
distribution Pby QP,W and the marginal for Yby RP,W . With these conventions, we
define the information density of tuples of elements of the input and output alphabets
77
4 Security in Over-the-Air Computation
under the channel Wand an input distribution Pas
iP,W (xM;yM) := log dWM(xM,·)
dRM
P,W
(yM).
Correspondingly, the mutual information is defined as
IP,W := EQP,W iP,W (X;Y).
Moreover, given two probability measures µand ν, we define the R´enyi divergence of order
α∈(0,1) ∪(1,∞) between them as
Dα(µ||ν) := 1
α−1log Eµ(︄(︃dµ
dν )︃α−1)︄.
D1(µ||ν) := limα↗1Dα(µ||ν) is the Kullback-Leibler divergence.
Acompound channel is a family (Ws)s∈S of memoryless time-discrete point-to-point
channels with common input alphabet Xand output alphabet Y. The transmitter’s
channel input is passed through a fixed Wsfor the entire block length, but the transmitter
does not control the choice of s, nor is it governed by a probability distribution. In this
work, we assume neither the transmitter nor the receiver knows s. A compound channel
code with block length Mand rate Rconsists of an encoder EM
k:{1,...,exp(MR)} →
XMand a decoder DM:YM→ {1,...,exp(MR)}. We say that it has error probability δ
if under a uniform distribution of M∈ {1,...,exp(MR)}, the following is true: Let YM
be constructed by passing the components of XM:= EM(M) independently through Ws.
Then, we have
sup
s∈S
EMPs(M=DM(YM)) ≤δ,
where δis the smallest number with this property.
Our secrecy scheme will hinge on the capacity of compound channels with possibly
continuous alphabets (such as Gaussian compound channels). As mentioned in Section 4.1,
it is shown in [Kes61] that even in the case that only the output alphabet is countably
infinite, the capacity expressions from the finite case [BBT59, Wol59] do not carry over.
It is therefore clear that an additional assumption on the compound channel is needed.
In existing literature (e.g., [BBT59, RV68]), the problem is often approached by proving
that the compound channel can be approximated by a finite class of channels in which
case classical channel coding techniques such as joint typicality decoding can be adapted
in a straightforward manner. In this work, we choose to directly pose the approximability
of the compound channel by a finite class of channels as an assumption of our coding
78
4.4 Main Results
theorem. In Section 4.5.1, we justify the usefulness of results involving this assumption
by proving that a large class of practically relevant channels can indeed be approximated
in the sense of the following definition.
Given measures µand ν, we say that µis absolutely continuous with respect to ν, or
µ≪ν, if all ν-null sets are also µ-null sets.
Definition 14. Given a compound channel (Ws)s∈S with input alphabet Xand output
alphabet Y, we say that it can be (η, J)-approximated under a probability distribution P
on Xif there is a sequence (W
ˆη,j)J
j=1 of channels from Xto Ysuch that for every s∈ S,
there is j∈ {1, . . . , J}such that
EPD1(︂Ws(X, ·)||W
ˆη,j(X, ·))︂≤η(4.9)
∃α > 1∀x∈ X :Dα(︂Ws(x, ·)||W
ˆη,j(x, ·))︂<∞(4.10)
∀x∈ X :W
ˆη,j(x, ·)≪Ws(x, ·) (4.11)
IP,W
ˆη,j −IP,Ws≤η, (4.12)
and for every j∈ {1, . . . , J}there is s∈ S such that
IP,Ws−IP,W
ˆη,j ≤η. (4.13)
The discussion in Section 4.5.1 provides sufficient topological conditions for (η, J)-
approximability and shows that an example of channels with this property are (possibly
fading) Gaussian channels.
We use a standard random codebook construction: Given a channel input alphabet X, a
distribution Pon X, a block length Mand a rate R, we define the (P, M, R)-ensemble of
codebooks as a random experiment in which exp(MR) codewords of length Mare drawn
randomly and independently according to Pfor each component of each codeword.
A codebook Cinduces a jamming strategy in the following way: The jammer draws a
codeword index Muniformly at random and transmits C(M), the codeword in Cindexed
by M. Therefore, the number of codewords in the codebook controls the amount of
randomness contained in the jamming signal.
In order to be able to impose an average power constraint on the jammer, we define an
additive cost constraint (c,C) for an input alphabet Xconsisting of a function c:X → R+
0
and a number C∈R+
0. Given any M, we say that xM∈ XMsatisfies the cost constraint
if ∑︁M
m=1 c(xm)≤MC.
The specialization of this definition to a usual average power constraint would be to
pick the square function as cand the maximum admissible average power as C.
79
4 Security in Over-the-Air Computation
As long as there is at least one xM∈ XMwhich satisfies the cost constraint (c,C),
given any codebook Cof block length M, we can define an associated cost-constrained
codebook Cc,Cwhich is generated from Cby replacing all codewords that do not satisfy
the cost constraint with xM. Obviously, all codewords in a cost-constrained codebook
satisfy the cost constraint. We say that a cost constraint (c,C) is compatible with an input
distribution Pif for a random variable Xdistributed according to P,c(X) has a finite
moment generating function in an interval containing 0 in its interior and C>EPc(X).
We assume a given pre-processing scheme which is admissible in the sense of Definition 9
and consider effective channels incorporating both the pre-processing at the transmitters
and the physical channel. We denote the legitimate user’s effective channel, which is a
stochastic kernel mapping from S1×···×SK×X to Y, by WBand the eavesdropper’s
effective channel, which is a stochastic kernel mapping from S1×···×SK×X to Z, by
WE. The M-fold products of these effective channels are outlined in the system model
in Fig. 4.2. With these concepts and notations defined, we are ready to state the main
result of this work, which gives sufficient conditions for the existence of a jamming scheme
that can simultaneously ensure that the legitimate receiver is able to reconstruct the full
jamming signal and limit the usefulness of the eavesdropper’s received signal.
Theorem 4. Let Pbe a jammer input distribution. Suppose that for every η > 0,
there is some J(η)such that the compound channel (Ws)s∈S defined by W(s1,...,sK):=
WB(s1, . . . , sK,·,·)can be (η, J(η))-approximated under P. Suppose further that for all
s1∈ S1, . . . , sK∈ SK, the moment-generating function
Eexp(a·iP,WE(s1,...,sK,·,·)(X;Z))
of the information density exists and is finite at some point a > 0. Let (c,C) be an additive
cost constraint compatible with P, and let Cbe a random codebook from the (P, M, R)-
ensemble. Let R ∈ (0,∞)such that
sup
s1∈S1,...,sK∈SK
IP,WE(s1,...,sK,·,·)<R<inf
s1∈S1,...,sK∈SK
IP,WB(s1,...,sK,·,·).(4.14)
Then there are numbers γ1, γ2, γ3, γ4>0such that for sufficiently large M,
PC(︃
R
ˆWE(s1,...,sK,·,·)M,Cc,C−RM
P,WE(s1,...,sK,·,·)
TV ≥exp(−Mγ1))︃
<exp(−exp(Mγ2)) (4.15)
and
PC(E)<exp(−Mγ4),(4.16)
80
4.4 Main Results
where Eis the event that the jamming strategy induced by Cc,Cdoes not allow reconstruction
of the jamming signal with error at most exp(−Mγ3).
Obviously, the theorem is only useful if there exists some Rsatisfying (4.14). The
condition that such an Rexists is the formalization of the notion that the jamming signal
has to be stronger at the legitimate receiver than it is at the eavesdropper.
In Section 4.5.2, we discuss in more detail how the guarantee in (4.15) can be used to
arrive at a MSE security guarantee for the scheme.
Theorem 4 needs a compound channel coding result as an ingredient for its proof, and
since this result is slightly more general than results available in the literature, it may be
of independent interest. Therefore, we also state it in this section.
Theorem 5. Let (Ws)s∈S be a compound channel with input alphabet Xand output al-
phabet Y, and let Pbe a probability distribution on Xsuch that for every η > 0, there is
aJ(η)such that (Ws)s∈S can be (η, J(η))-approximated under P. Let
0<R<inf
s∈S IP,Ws,(4.17)
and let Cbe a random codebook from the (P, M, R)-ensemble. Define an encoder m↦→
C(m). Then there is a decoder such that the average error probability δof the resulting
compound channel code satisfies
EC(δ)<exp(−Mγ),(4.18)
for some γ > 0and sufficiently large M.
With standard techniques, this theorem can be extended to the case of cost-constrained
codebooks. We provide the full details of the proof of the following corollary in Sec-
tion 4.6.5.
Corollary 7. In the setting of Theorem 5, and given an additive cost constraint (c,C)
compatible with P, there are γ1, γ2>0such that for sufficiently large M,
PCc,C(︁δ≥exp(−Mγ1))︁<exp(−Mγ2).(4.19)
As the other main technical ingredient, we need the following result on channel resolv-
ability.
Theorem 6. Given a channel Wfrom Xto Y, an input distribution Psuch that the
moment-generating function EQP,W exp(a·iP,W (X;Y)) of the information density exists
81
4 Security in Over-the-Air Computation
and is finite for some a > 0, and R>IP,W , there exist γ1>0and γ2>0such that for
large enough block lengths M, the (P, M, R)-ensemble satisfies
PC(︂∥R
ˆW,C−QM
P,W ∥TV >exp(−γ1M))︂≤exp (−exp (γ2M)) ,(4.20)
where R
ˆW,Cis the output distribution of channel Wgiven that a uniformly random code-
word from Cis transmitted.
Similarly as with the compound channel coding theorem, we can use known methods to
incorporate an additive cost constraint and argue the following corollary. For full details,
we refer the reader to Section 4.6.5.
Corollary 8. Let Pbe an input distribution on Xand (c,C)an additive cost constraint
compatible with P. Then the statement of Theorem 6 is valid even if the codebook Cis
replaced with its associated cost-constrained version Cc,C.
4.5 Implications of the Main Results
In this section, we show that the main result on secure OTA computation, Theorem 4,
implies a MSE security guarantee in the case of AWGN channels discussed in Section 4.3.
To this end, we show in Section 4.5.1 that AWGN compound channels satisfy (among
other channel models) the approximability criterion of Definition 14. In Section 4.5.2, we
show how Theorem 4 can be applied to carry the MSE security result of Lemma 12 over
to the case in which the legitimate receiver does not share randomness with the jammer.
4.5.1 Feasibility of Channel Approximation
In this subsection, we provide some tools and examples to argue that many compound
channels of practical interest can indeed be (η, J)-approximated so that Theorem 5 may
be applied to them. We begin with an observation that shows how our result specializes
to the known results [BBT59,Wol59] for channels with finite alphabets.
Remark 5. [BBT59, Lemma 4] implies that for every compound channel (Ws)s∈S with
finite input and output alphabets and every η > 0, there is an integer J(η)such that
(Ws)s∈S can be (η, J(η))-approximated.
We repeat the construction here and discuss how this fact is proved.
Let Mbe an integer which satisfies
M≥max (︃4|Y|3
η2,2|Y|2
η)︃.
82
4.5 Implications of the Main Results
Given s∈ S, we construct a channel W′
s. To this end, given any x∈ X, we fix an
enumeration (yi)|Y|
i=1 such that the finite sequence (Ws(x, {yi}))|Y|
i=1 is nondecreasing. For
every i < |Y|, we can then uniquely choose a value for W′
s(x, {yi})such that it is an integer
multiple of 1/M and
Ws(x, {yi})≤W′
s(x, {yi})< Ws(x, {yi}) + 1
M.(4.21)
It is argued in [BBT59] that this leaves a positive probability mass for W′
s(x, {y|Y|})and
therefore, this construction fully defines a channel W′
s. We define the approximation se-
quence (W
ˆη,j)J(η)
j=1 as an enumeration of the set {W′
s:s∈ S}. The cardinality of this set
is upper bounded by (M+ 1)|X||Y| since all singleton probabilities are integer multiples of
1/M.
For finite alphabets, (4.10) is trivially satisfied since R´enyi divergence is in this case
always finite [vEH14]. Regarding the absolute continuity criterion (4.11), we recall that
W′
s(x, {y|Y|})always has a positive probability, and for i < |Y|, the assumption Ws(x, {yi}) =
0immediately implies W′
s(x, {y|Y|}) = 0 by (4.21), since 0is the only integer multiple of
1/M which is strictly smaller than 1/M. The proof in [BBT59] exploits (4.21) to prove
that the absolute difference between the information of Wsand W′
sunder any input dis-
tribution is at most 2|Y|3/2M−1/2(statement (c) of the lemma) which by our choice of M
immediately implies (4.12) and (4.13). Moreover, it is shown that (4.21) also implies that
for all x∈ X, y ∈ Y,
log Ws(x, {y})
W′
s(x, {y})≤2|Y|2
M
(statement (b) of the lemma) which by our choice of Mimplies (4.9).
For many channels of interest, (η, J(η))-approximability can be shown directly by going
through properties (4.9) – (4.13). However, it is often easier to make an argument involving
topological properties of S. The following lemma provides some machinery to this end.
Lemma 15. Let (Ws)s∈S be a compound channel with input alphabet Xand output al-
phabet Y, let Pbe a probability distribution on Xand assume that there is a topology on
Ssuch that Sis compact and
∀s0∈ S :s↦→ EPD1(Ws(X, ·)||Ws0(X, ·)) is upper semi-continuous at s0(4.22)
∀s1, s2∈ S ∃α > 1∀x∈ X :Dα(Ws1(x, ·)||Ws2(x, ·)) <∞(4.23)
s↦→ IP,Wsis lower semi-continuous. (4.24)
Then, for every η > 0, there is J(η)such that (Ws)s∈S can be (η, J(η))-approximated
under P.
83
4 Security in Over-the-Air Computation
Proof. Fix some η > 0. For a given s∈ S, consider
{s′:EPD1(Ws′(X, ·)||Ws(X, ·)) < η}∩{s′:IP,Ws−IP,Ws′< η}.
Clearly, (4.22) and (4.24) ensure that this intersection is a neighborhood of s, so we
can find an open neighborhood Dscontained in it. Thus, (Ds)s∈S is an open cover of
Sand therefore, the compactness of Syields a finite subcover Ds1,...,DsJ(η). We set
W
ˆη,j := Wsjand given any s∈ S, we choose jsuch that s∈ Dsjand argue that W
ˆη,j
satisfies (4.9), (4.10) and (4.12). To this end, we note that (4.10) and (4.11) follow from
(4.23), while (4.9) and (4.12) are ensured by the definition of Dsj. Finally, (4.13) is trivially
satisfied, concluding the proof.
We now make use of Lemma 15 to prove that a large class of Gaussian fading multiple-
input and multiple-output channels can actually be (η, J(η))-approximated and thus The-
orem 5 can be applied to them. The class of compound channels covered in the following
theorem contains the class considered in [RV68, Sections 3 and 4] as a proper subset. We
denote the set of symmetric, positive semidefinite n×n-matrices with Symn
+and the set
of symmetric, positive definite n×n-matrices with Symn
++.
Theorem 7. Let X=Rn2,Y=Rn1, let Sbe a compact subset of Rn1n2×Symn1n2
+×Rn1×
Symn1
++ (under the topology induced by the Frobenius norm). For any s= (µH,ΣH, µN,ΣN)∈
S, let Wsbe the channel given by
Y=HX +N,
where the channel input Xhas range Rn2, the channel output Yhas range Rn1, the entries
of the n1×n2fading matrix Hfollow the distribution N(µH,ΣH)and the additive noise
Nis independent of Hand follows the distribution N(µN,ΣN). Let Pbe a distribution on
Xand assume that either Pis a multivariate Gaussian with positive definite covariance
matrix or that the support of Pis contained in some compact set. Then, given any η > 0,
there is J(η)such that (Ws)s∈S can be (η, J(η))-approximated under P.
Proof. We show that the conditions of Lemma 15 are met. [Gil11] provides closed-form
expressions for R´enyi and Kullback-Leibler divergences between multivariate normal dis-
tributions. The only fact that we are going to use and which is apparent from these
expressions, however, is that the R´enyi and Kullback-Leibler divergences between two
multivariate normal distributions are finite and continuous in the mean vectors and co-
variance matrices of the distributions wherever the covariance matrices are positive definite
or, equivalently, both distributions are absolutely continuous with respect to the Lebesgue
measure.
84
4.5 Implications of the Main Results
ΣN∈Symn1
++ and therefore, given any x∈ X,Ws(x, ·) is absolutely continuous with
respect to the Lebesgue measure and thus has a positive definite covariance matrix and a
density rWs(x,·), which implies (4.23).
Next, from the well-known closed-form expression of the multivariate normal density, we
know that for any xand y,rWs(x,·)(y) is continuous in s. The boundedness of Simplies a
uniform upper bound on rWs(x,·)(y), so we can use the theorem of dominated convergence
to argue that the marginal density rRP,Ws(y) = EPrWs(X,·)(y) depends continuously on s
for any fixed y. We write
IP,Ws=EPRP,Ws(︃rWs(X,·)(Y)
rRP,Ws(Y)log rWs(X,·)(Y)
rRP,Ws(Y))︃.
Since the integrand is lower bounded by −1/e, (4.24) follows as an application of Fatou’s
lemma.
Finally, in order to argue (4.22), we distinguish between the two cases in the statement
of the theorem.
First, suppose that there is a compact subset X
ˆ⊆ X with P(X \X
ˆ) = 0. For any fixed
s0, the map
(s, x)↦→ D1(Ws(x, ·)||Ws0(x, ·))
is continuous, therefore the image of S × X
ˆis compact and hence bounded. We can
therefore invoke the theorem of dominated convergence and argue that (4.22) is satisfied.
Now, suppose that Pis multivariate Gaussian with positive definite covariance matrix.
We write
EPD1(Ws(X, ·)||Ws0(X, ·)) = EPEWs(X,·)log rP(X)rWs(X,·)(Y)
rP(X)rWs0(X,·)(Y)
=EQP,Wslog rQP,Ws(X, Y )
rQP,Ws0(X, Y )
=D1(︁QP,Ws||QP,Ws0)︁.
From our arguments above, given any s, the distribution QP,Wsis multivariate Gaussian
with positive definite covariance matrix, which implies that (4.22) is satisfied.
4.5.2 Back to the Additive White Gaussian Noise case: Calculating Mean
Square Error Security Guarantees
Revisiting the AWGN example from Section 4.3, Theorem 4 and Lemma 12 imply MSE
security and reconstruction guarantees in case the legitimate receiver does not know the
85
4 Security in Over-the-Air Computation
jamming sequence, as we show in the following corollary.
Corollary 9. Make the same assumptions and definitions as in Lemma 12, but do not
assume that the legitimate receiver has knowledge of the jamming sequence X1, . . . , XM.
Assume in addition that the channel from Jto Bis stronger than the channel from Jto E,
i.e., hJB/σB> hJE/σE. Then there is a distributed approximation scheme with jamming
and there are constants γ1, γ2>0such that for sufficiently large M, the following hold:
•Bcan approximate the objective function f(s1, . . . , sK)with a MSE not exceeding
σ2
eff,BΨ(︄2
σ2
eff,B)︄+ exp(−Mγ1) (4.25)
•The scheme is (f, V )-MSE-secure, where
V:= σ2
eff,EΨ(︄2
σ2
eff,E)︄−exp(−Mγ2).(4.26)
Proof. For the pre-processing at the transmitters, we use the same scheme as in the proof
of Lemma 12 and begin by verifying that the resulting effective channels WBand WEwith
the input distribution Pchosen to be Gaussian with mean 0 and variance PJsatisfy the
assumptions of Theorem 4. Since the defined compound channel is a class of Gaussian
channels with different means taking values in the compact set [−1,1], the approximability
of the channel is an immediate consequence of Theorem 7. The finiteness of the moment-
generating function of the information density can be seen by straightforward applications
of the definitions of information density and R´enyi divergence:
Eexp(a·iP,WE(s1,...,sK,·,·)(X;Z))
=E(︃(︃dWE(s1, . . . , sK, X, ·)
dRP,WE(s1,...,sK,·,·)
(Z))︃a)︃
= exp (︃a·1
alog E(︃(︃dWE(s1, . . . , sK, X, ·)
dRP,WE(s1,...,sK,·,·)
(Z))︃a)︃)︃
= exp (︁aDa+1 (︁QP,WE(s1,...,sK,·,·)||PRP,WE(s1,...,sK,·,·))︁)︁
The R´enyi divergence appearing at the end is between two multivariate Gaussian distri-
butions and can be seen to be finite from the expressions given in [Gil11]. In order to
verify (4.14), we first note that the information expressions appearing are the capacities
of the effective channels WBand WE. Since s1, . . . , sKchange the mean of the channel
only, they do not influence the capacity. Therefore, the infimum and supremum are over
86
4.6 Proofs
singleton sets. Consequently, the condition hJB/σB> hJE/σEensures that there is some
Rsatisfying (4.14).
Fix γ1′, γ3′as claimed to exist in Theorem 4, and also fix γ1, γ2with 0 < γ2< γ1′and
0< γ1< γ3′.
Note that in the AWGN channel, s1, . . . , sKcorrespond to a shift of the output dis-
tribution of the channel, and therefore, the variational distance that appears in (4.15) is
independent of s1, . . . , sK. For sufficiently large M, we can therefore fix a codebook C
from the (P, M, R)-ensemble such that for all s1, . . . , sK, neither one of the error events
described in (4.15) and (4.16) occurs.
Let the jamming strategy be induced by CC,cand let d:ZM→[−1,1] be an estimator
for E. We can now bound the MSE of d:
ER
ˆWE(s1,...,sK,·,·)M(︂(︁d(ZM)−f(s1, . . . , sK))︁2)︂
=∫︂4
0
R
ˆWE(s1,...,sK,·,·)M(︂(︁d(ZM)−f(s1, . . . , sK))︁2> a)︂da
(4.15)
≥∫︂4
0(︃RM
P,WE(s1,...,sK,·,·)(︂(︁d(ZM)−f(s1, . . . , sK))︁2> a)︂−exp(−Mγ1′))︃da
=ERM
P,WE(s1,...,sK,·,·)(︂(︁d(ZM)−f(s1, . . . , sK))︁2)︂−4 exp(−Mγ1′).
Taking the lower bound for the MSE under RM
P,WE(s1,...,sK,·,·)from Lemma 12 and noting
γ2< γ1′, we arrive at the expression in (4.26) for sufficiently large M.
For the reconstruction strategy at B, we first let Breconstruct the jamming signal as
is possible by Theorem 4 and then post-process the received signal as is possible with
knowledge of the jamming signal by Lemma 12. Using the error bound in Lemma 12 and
observing that the maximum instantaneous square error is 4 since we are constrained to
an interval of length 2 and that γ1< γ3′, for sufficiently large Mwe arrive at (4.25).
4.6 Proofs
In this section, we prove the Lemmas used in the proof of Lemma 12 and our main results
on secure OTA computation and compound channel coding, Theorems 4 and 5, as well as
the corollaries that allow for the incorporation of an average cost constraint.
4.6.1 Statistical Preliminaries for the Proof of Lemma 12
In this subsection, we prove the two lemmas used for the proof of Lemma 12.
Proof of Lemma 13. It is known [Jay03, eq. (6.92)] that the MSE is minimized by the
87
4 Security in Over-the-Air Computation
mean of the posterior probability distribution. We can therefore calculate the minimum
MSE estimator given the observations v1, . . . , vMas follows, where we use rwith random
variables in the index to denote (conditional) densities.
U
ˆ=∫︂b
a
grU|V1,...,VM(g|v1, . . . , vM)dg
(a)
=∫︂b
a
grV1,...,VM|U(v, . . . , vM|g)rU(g)
rV1,...,VM(v1, . . . , vM)dg
=∫︁b
agrV1,...,VM|U(v1, . . . , vM|g)rU(g)dg
∫︁b
arV1,...,VM|U(v1, . . . , vM|g)rU(g)dg
(b)
=∫︁b
agexp (︂−1
2σ2∑︁M
m=1(vm−g)2)︂dg
∫︁b
aexp (︂−1
2σ2∑︁M
m=1(vm−g)2)︂dg
=∫︁b
agexp (︂−1
2σ2/M (︂1
M∑︁M
m=1 v2
m−2gv¯ + g2)︂)︂dg
∫︁b
aexp (︂−1
2σ2/M (︂1
M∑︁M
m=1 v2
m−2gv¯ + g2)︂)︂dg
(c)
=∫︁b
agexp (︂−1
2σ2/M (v¯−g)2)︂dg
∫︁b
aexp (︂−1
2σ2/M (v¯−g)2)︂dg
For (a), we have applied Bayes’ rule. (b) is by observing that rU(g)=1/(b−a) is
independent of gin [a, b] and rV1,...,VM|U is the normal density. (c) is by multiplying
exp (︄−1
2σ2/M (︄v¯2−1
M
M
∑︂
m=1
v2
m)︄)︄
on both sides of the fraction to complete the binomials.
The term we have calculated for U
ˆis the mean of a normal distribution centered at v¯ with
variance σ2/M truncated in [a, b]. This is a distribution with a known mean [JKB94, eq.
13.134], and hence we arrive at (4.6).
Proof of Lemma 14. Based on the representation (4.6), we calculate the MSE as follows.
We use the substitution rule, substituting v′:= v¯−a
σ/√Min (a) and g′:= g−a
σ/√Min (b).
E(︃(︂U −U
ˆ)︂2)︃=∫︂b
a∫︂∞
−∞ ⎛
⎝v¯ + σ
√M·
φN(︂a−v¯
σ/√M)︂−φN(︂b−v¯
σ/√M)︂
ΦN(︂b−v¯
σ/√M)︂−ΦN(︂a−v¯
σ/√M)︂−g⎞
⎠
2
·1
b−a·1
σ/√MφN(︃g−v¯
σ/√M)︃dv¯dg
88
4.6 Proofs
(a)
=∫︂b
a∫︂∞
−∞ ⎛
⎜
⎜
⎝
σ
√M⎛
⎝v′+
φN(−v′)−φN(︂b−a
σ/√M−v′)︂
ΦN(︂b−a
σ/√M−v′)︂−ΦN(−v′)⎞
⎠+a−g⎞
⎟
⎟
⎠
2
·1
b−a·φN(︃g−a
σ/√M−v′)︃dv′dg
(b)
=∫︂b−a
σ/√M
0∫︂∞
−∞ ⎛
⎝v′+
φN(−v′)−φN(︂b−a
σ/√M−v′)︂
ΦN(︂b−a
σ/√M−v′)︂−ΦN(−v′)−g′⎞
⎠
2
·(︃σ
√M)︃3
·1
b−a·φN(g′−v′)dv′dg′
=σ2
MΨ(︃b−a
σ/√M)︃,
concluding the proof of the lemma.
4.6.2 Proof of Theorem 4
In order to prove Theorem 4, we decompose the system depicted in Fig. 4.2 into smaller
(and more easily analyzed) subsystems by considering only a subset of the depicted ter-
minals at a time.
1. Considering the terminals A1,...,AK,B.This is the system summarized in Sec-
tion 2.2.1. The rationale is that the results specialize to the setting in Section 4.3
as well as, e.g., to the fast-fading setting treated in Chapter 2. This part of the
system consists of transmitters (Ak)K
k=1 each of which holds a value sk∈ Skand a
receiver Bwhich has the objective of estimating f(s1, . . . , sK). To this end, each
transmitter Akpasses skthrough a pre-processor Ekindependently Mtimes yielding
a sequence TM
kof channel inputs. These are transmitted through Mindependent
uses of the channel, generating a sequence YMof channel outputs. The receiver
passes this sequence through a post-processor DMwhich generates an approxima-
tion f
˜of f(s1, . . . , sK). As mentioned, the design of the pre- and post-processors
depends heavily on the channel model and a particular class of functions f. The
idea is that the pre-processors, the channel and the post-processor work together to
mimic the function f, and any approach following this idea will be highly dependent
on the particular structure of the channel and f. In Theorem 4, it is assumed that
such a system is already in place and an augmentation is proposed which makes it
more secure. A property of the system described in Section 2.2.1 necessary for our
purposes and heavily exploited in this work is that the pre-processing is i.i.d., i.e.,
89
4 Security in Over-the-Air Computation
each pre-processor Ekis a stochastic kernel mapping from Skto Tkand an M-fold
product EM
kof it is used to generate the channel input sequence.
2. Considering the terminals A1,...,AK,J,E.In this setting, we assume that the
transmitters A1,...,AKrun a scheme of the kind described under item 1. Instead
of the legitimate receiver, there is now an eavesdropper E. The objective is then to
limit the usefulness of the eavesdropper’s received signal ZM. To this end, we add a
friendly jammer Jto the system which transmits, according to a certain strategy, a
word XM. In this work, any jamming strategy we consider is induced by a codebook
Cof words of length Mthrough the rule that the jammer chooses an element of the
codebook uniformly at random and transmits it. We use existing results on channel
resolvability to derive a bound on the usefulness of the signal ZMreceived at E.
3. Considering the terminals A1,...,AK,J,B.This is the setting from item 1 with an
additional transmitter J. Here we assume that Juses a jamming strategy induced by
a codebook Cas described under item 2 and use Theorem 5 on compound channel
coding to argue that for suitable choices of C,Bis able to fully reconstruct the
jamming signal XM. This enables Bto perform a cancellation of the jamming signal
before it applies the post-processor DMit would use in the setting of item 1. How
this cancellation works depends on the particularities of the channel considered, but
if, e.g., the jamming signal is simply added to the channel output as in the AWGN
example in Section 4.3, it is possible to cancel it entirely by subtracting it from the
received signal. So in this case the post-processor would consist of a reconstruction
of the jamming signal, the subtraction of this signal from the received one and a
post-processing step identical to that from item 1.
4. Combining settings of item 2 and 3. The goal here is to argue the existence of
a codebook Cwhich achieves both of the objectives described under item 2 and
item 3. It will turn out that this can be achieved by a standard random codebook
construction.
The main result of this work, Theorem 4, formulates conditions under which there
are codebooks in the (P, M, R)-ensemble of which the (c,C)-cost constrained versions
simultaneously achieve the goals set forth under 2) and 3).
Proof of Theorem 4. An application of Corollary 7 yields (4.16), and (4.15) follows from
Corollary 8.
90
4.6 Proofs
4.6.3 Proof of Theorem 5
We first pick parameters η,ε,β1and β2in sequence according to the following scheme,
where (4.17) and the previous choices ensure that these intervals are all nonempty.
η∈(︃0,infs∈S IP,Ws−R
3)︃(4.27)
ε∈(︃2η, inf
s∈S IP,Ws−R−η)︃(4.28)
β1∈(η, ε −η) (4.29)
β2∈(0, ε −η−β1) (4.30)
Fix a sequence (W
ˆη,j)J(η)
j=1 which (η, J(η))-approximates (Ws)s∈S.
We use a joint typicality decoder, i.e. if there is a unique msuch that
∃j∈ {1, . . . , J(η)}:iP,W
ˆη,j (C(m); YM)≥M(IP,W
ˆη,j −ε),
the decoder declares that message mhas been sent; otherwise it declares an error (or that
message 1 has been sent).
We denote the transmitted message with M, the message declared by the decoder with
M
ˆand define error events
E:= {M=M
ˆ}(4.31)
E1:= {︂∀j∈ {1, . . . , J(η)}iP,W
ˆη,j (C(M); YM)< M(IP,W
ˆη,j −ε)}︂(4.32)
E2:= {︂∃m=M∃j∈ {1, . . . , J(η)}iP,W
ˆη,j (C(m); YM)≥M(IP,W
ˆη,j −ε)}︂.(4.33)
We note that E ⊆ E1∪E2and consequently
P(E)≤P(E1) + P(E2).(4.34)
So we can bound these two errors separately and then combine them.
We start with bounding the expectation of the first summand, using the definition (4.32)
and C, as well as an addition of zero. Pick jsuch that W
ˆη,j satisfies (4.9) – (4.12) with
91
4 Security in Over-the-Air Computation
respect to the realization Wsof the compound channel. Then we have
EC(P(E1)) ≤EC(︂P(︂iP,W
ˆη,j (C(M); YM)< M(IP,W
ˆη,j −ε))︂)︂
=QM
P,Ws(︂iP,W
ˆη,j (XM;YM)< M(IP,W
ˆη,j −ε))︂
=QM
P,Ws(︄M
∑︂
m=1
log (︄dW
ˆη,j(Xm,·)
dRP,W
ˆη,j
(Ym))︄< M(IP,W
ˆη,j +IP,Ws−IP,Ws−ε))︄
(4.35)
The Radon-Nikodym derivative can be split as
dW
ˆη,j(Xm,·)
dRP,W
ˆη,j
=dW
ˆη,j(Xm,·)
dWs(Xm,·)·dRP,Ws
dRP,W
ˆη,j ·dWs(Xm,·)
dRP,Ws
.(4.36)
This is possible because W
ˆη,j(x, ·)≪Ws(x, ·) by (4.11), RP,Ws≪RP,W
ˆη,j by (4.9) and the
joint convexity of Kullback-Leibler divergence in its arguments, and Ws(x, ·)≪RP,Wsfor
P-almost all xby the properties of the marginalization.
We next bound tail probabilities corresponding to the three factors in (4.36) separately,
starting with the first. To this end, we introduce a number α1>1 and argue, using
Markov’s inequality and the definition of R´enyi divergence, that
QM
P,Ws(︄M
∑︂
m=1
log dWs(Xm,·)
dW
ˆη,j(Xm,·)(Ym)≥Mβ1)︄
=QM
P,Ws(︄exp (︄(α1−1)
M
∑︂
m=1
log dWs(Xm,·)
dW
ˆη,j(Xm,·)(Ym))︄≥exp((α1−1)Mβ1))︄
≤EQM
P,Ws⎛
⎝⎛
⎝
M
∏︂
m=1 (︄dWs(Xm,·)
dW
ˆη,j(Xm,·)(Ym))︄α1−1⎞
⎠⎞
⎠exp(−(α1−1)Mβ1)
= exp ⎛
⎝
M
∑︂
m=1
log ⎛
⎝EQM
P,Ws⎛
⎝(︄dWs(Xm,·)
dW
ˆη,j(Xm,·))︄α1−1⎞
⎠⎞
⎠⎞
⎠exp(−(α1−1)Mβ1)
= exp (︂−(α1−1)M(︂β1−EPDα1(︂Ws(X, ·)||W
ˆη,j(X, ·))︂)︂)︂.(4.37)
For the second factor, we argue in an analogous way, but using α2>0.
92
4.6 Proofs
RM
P,Ws(︄M
∑︂
m=1
log dRP,W
ˆη,j
dRP,Ws
(Ym)≥Mβ2)︄
=RM
P,Ws(︄exp (︄α2
M
∑︂
m=1
log dRP,W
ˆη,j
dRP,Ws
(Ym))︄≥exp(α2Mβ2))︄
≤ERM
P,Ws(︄M
∏︂
m=1 (︄dRP,W
ˆη,j
dRP,Ws
(Ym))︄α2)︄exp(−α2Mβ2)
= exp (︂(α2−1)MDα2(︂RP,W
ˆη,j ||RP,Ws)︂−α2Mβ2)︂.(4.38)
Finally, for the third factor, we use α3<1.
QM
P,Ws(︂iP,Ws(XM;YM)< M(IP,Ws−ε+β1+β2+η))︂
=QM
P,Ws(︁exp (︁(α3−1)iP,Ws(XM;YM))︁>exp ((α3−1)M(IP,Ws−ε+β1+β2+η)) )︁
≤EQM
P,Ws(︄M
∏︂
m=1 (︃dWs(Xm,·)
dRP,Ws
(Ym))︃α3−1)︄exp (−(α3−1)M(IP,Ws−ε+β1+β2+η))
= exp (︂−(1 −α3)M(︁Dα3(QP,Ws||PRP,Ws) + ε−IP,Ws−β1−β2−η)︁)︂,(4.39)
Clearly, by (4.36), the union bound and (4.12), (4.35) is upper bounded by the sum of
(4.37), (4.38) and (4.39). Next, we argue that these expressions all vanish exponentially
with M→ ∞, using the continuity of R´enyi divergence in the order which is shown
in [vEH14, Theorem 7].
From (4.10), the theorem of monotone convergence and (4.9), we can conclude that
lim
α1↘1
EPDα1(︂Ws(Xm,·)||W
ˆη,j(Xm,·))︂=EPD1(︂Ws(Xm,·)||W
ˆη,j(Xm,·))︂≤η,
so, (4.29) allows us to fix α1at a value greater than 1 such that
β1−EPDα1(︂Ws(Xm,·)||W
ˆη,j(Xm,·))︂>0
and hence, (4.37) vanishes exponentially.
(4.38) is true for all α2<1. Since the inequalities are not strict, we can take the limit
α2↗1 and argue that the statement is also valid for α2= 1.
Dα3(QP,Ws||PRP,Ws) converges to IP,Wsfrom below for α3↗1 and so (4.30) allows us
93
4 Security in Over-the-Air Computation
to fix α3at a value less than 1 such that
Dα3(QP,Ws||PRP,Ws) + ε−IP,Ws−β1−β2−η > 0
and therefore, (4.39) also vanishes exponentially.
For the second summand in (4.34), we use the definition (4.33) to argue that EC(P(E2))
is upper bounded by
exp(MR)
J(η)
∑︂
j=1
PMRM
P,Ws(︃iP,W
ˆη,j (XM;YM)≥M(︂IP,W
ˆη,j −ε)︂)︃.(4.40)
We define the indicator function
ind(xM, yM) := ⎧
⎨
⎩
1,iP,W
ˆη,j (xM;yM)≥M(︂IP,W
ˆη,j −ε)︂
0,otherwise.
Using the definition of information density for a change of measure and multiplying one,
we rewrite the probability that appears in (4.40) as
PMRM
P,Ws(︃iP,W
ˆη,j (XM;YM)≥M(︂IP,W
ˆη,j −ε)︂)︃
=∫︂
XM×YM
ind(xM, yM)·PMRM
P,Ws(dxM, dyM)
=∫︂
XM×YM
exp (︁−iP,Ws(xM;yM))︁ind(xM, yM)QM
P,Ws(dxM, dyM)
=∫︂
XM×YM
exp (︁−iP,Ws(xM;yM) + iP,W
ˆη,j (xM;yM)−iP,W
ˆη,j (xM;yM))︁
·ind(xM, yM)QM
P,Ws(dxM, dyM)
Because of the presence of the indicator, we can uniformly bound
iP,W
ˆη,j (xM;yM)≥M(︂IP,W
ˆη,j −ε)︂
94
4.6 Proofs
and the indicator itself can be upper bounded by 1. This yields
PMRM
P,Ws(︃iP,W
ˆη,j (XM;YM)≥M(︂IP,W
ˆη,j −ε)︂)︃
≤exp (︂−M(︂IP,W
ˆη,j −ε)︂)︂
∫︂
XM×YM
exp (︂−iP,Ws(xM;yM) + iP,W
ˆη,j (xM;yM))︂QM
P,Ws(dxM, dyM)
We expand the definition of information density and apply Fubini’s Theorem to rewrite
the integral as
∫︂
YM
⎛
⎝∫︂
XM
dW
ˆη,j
M(xM,·)
dRM
P,W
ˆη,j
(yM)PM(dxM)⎞
⎠RM
P,Ws(dyM)
and observe that it equals 1.
Combining with (4.40) and applying (4.13), we obtain
EC(P(E2)) ≤exp(MR)
J(η)
∑︂
j=1
exp (︂−M(︂IP,W
ˆη,j −ε)︂)︂
≤exp(MR)
J(η)
∑︂
j=1
exp (︃−M(︃inf
s∈S IP,Ws−ε−η)︃)︃
= exp (︃−M(︃inf
s∈S IP,Ws−ε−R−η−log J(η)
M)︃)︃.(4.41)
We observe that by (4.28), infs∈S IP,Ws−ε−R−η > 0.
Finally, we pick
γ∈(︄0,min (︃(α1−1) ·(︂β1−EPDα1(︂Ws(Xm,·)||W
ˆη,j(Xm,·))︂)︂,
β2,
(1 −α3)(︁Dα3(QP,Ws||PRP,Ws) + ε−IP,Ws−β1−β2−η)︁,
inf
s∈S IP,Ws−ε−R−η)︃)︄.
Since the exponent in (4.41) is then negative for sufficiently large M, we can combine it
with (4.37), (4.38) and (4.39) to obtain (4.18).
95
4 Security in Over-the-Air Computation
4.6.4 Proof of Theorem 6
In order to prove the theorem, given a codebook C, we write the variational distance as
∥R
ˆW,C−RM
P,W ∥TV = sup
A⊆XM
measurable (︂R
ˆW,C(A)−RM
P,W (A))︂
= sup
A⊆XM
measurable ∫︂A(︄dR
ˆW,C
dRM
P,W
(yM)−1)︄RM
P,W (dyM)
=ERM
P,W [︄dR
ˆW,C
dRM
P,W
(yM)−1]︄+
.(4.42)
Note that throughout the proofs, we only consider codebooks Cfor which R
ˆW,Cis abso-
lutely continuous with respect to RM
P,W . We can do this because the existence of a finite
mutual information implies that W(x, ·) is absolutely continuous with respect to RP,W
for almost every x, and so the probability of drawing a codebook for which R
ˆW,Cis not
absolutely continuous with respect to RM
P,W is 0. Similarly, we assume the existence of the
other Radon-Nikodym derivatives that appear.
We define the typical set
Tε:= {︃(xM, yM) : 1
MiP,W (xM;yM)≤IP,W +ε}︃(4.43)
and split R
ˆW,Cinto two measures
R
ˆ1,W,C(A) := exp(−MR)
exp(MR)
∑︂
m=1
WM(︁C(m), A ∩{yM: (C(m), yM)∈Tε})︁(4.44)
R
ˆ2,W,C(A) := exp(−MR)
exp(MR)
∑︂
m=1
WM(︁C(m), A ∩{yM: (C(m), yM)/∈Tε})︁.(4.45)
We observe R
ˆW,C=R
ˆ1,W,C+R
ˆ2,W,C, which allows us to split (4.42) into a typical and
an atypical part
∥R
ˆW,C−RM
P,W ∥TV =ERM
P,W [︄dR
ˆ1,W,C
dRM
P,W
(yM) + dR
ˆ2,W,C
dRM
P,W
(yM)−1]︄+
≤ERM
P,W [︄dR
ˆ1,W,C
dRM
P,W
(yM)−1]︄+
+R
ˆ2,W,C(YM).(4.46)
We next state and prove two lemmas that we will use as tools to bound the typical and
96
4.6 Proofs
atypical parts of this term separately.
Lemma 16 (Bound for atypical terms).Suppose QM
P,W (XM×YM\Tε)≤aand δ∈[0,1].
Then
PC(︁R
ˆ2,W,C(YM)> a(1 + δ))︁≤exp (︃−1
3δ2aexp(MR))︃.
Proof. Observe EC(R
ˆ2,W,C(YM)) = QM
P,W (XM×YM\Tε)≤aand bound
PC(︂R
ˆ2,W,C(YM)> a(1 + δ))︂
=PC(︂exp(MR)R
ˆ2,W,C(YM)> a exp(MR)(1 + δ))︂
=PC(︄exp(MR)
∑︂
m=1
WM(︁C(m),{yM: (C(m), yM)/∈Tε})︁> a exp(MR)(1 + δ))︄
≤exp (︃−1
3δ2aexp(MR))︃.
The inequality follows from the Chernoff-Hoeffding bound [DP09, Ex. 1.1] by noting
that we sum probabilities (i.e. values in [0,1]) on the left side, that these probabilities
are independently distributed under PCand that by the hypothesis of the lemma the
expectation of the term on the left is bounded by aexp(MR).
Lemma 17 (Bound for typical terms).Let δ, λ > 0and define
r:= exp(M(R−I(X;Y)−ε)).(4.47)
Suppose r/(6λ)≥1. Then
PC(︄ERM
P,W [︄dR
ˆ1,W,C
dRM
P,W
(yM)−1]︄+
> δ)︄
≤(︄1 + √︃3π
2exp (︃3λ2
4r)︃λ
√r+ exp(−λ))︄exp(−δλ),(4.48)
where π≈3.14159 is the area of the unit circle.
Before we prove this lemma, we make an observation that we need in the proof.
Lemma 18. Let Ξbe a measurable function mapping codebooks and elements of YMto
the nonnegative reals and let λ, δ > 0. Then
PC(︁EYMΞ(C, Y M)> δ)︁≤EYMEC(︂exp(λΞ(C, Y M)))︂exp(−δλ).(4.49)
97
4 Security in Over-the-Air Computation
Proof. An application of the Chernoff bound yields
PC(︁EYMΞ(C, Y M)> δ)︁≤EC(︁exp (︁λEYMΞ(C, Y M))︁)︁exp(−δλ).
We can then prove (4.49) by successive applications of Jensen’s inequality and Fubini’s
theorem.
Proof of Lemma 17. We begin by examining parts of the term in (4.48) for fixed, but
arbitrary Cand yMand rewrite
rdR
ˆ1,W,C
dRM
P,W
(yM) =
exp(MR)
∑︂
m=1
exp (M(−I(X;Y)−ε)) dWM(C(m),·)
dRM
P,W
(yM)1(C(m),yM)∈Tε.
Now, we observe that the indicator function bounds the relative density to be at most
exp(M(I(X;Y) + ε)) and thus every term in the sum to range within [0,1] and that
furthermore
EC(︄rdR
ˆ1,W,C
dRM
P,W
(yM))︄≤exp (M(−I(X;Y)−ε))
exp(MR)
∑︂
m=1
EC(︄dWM(C(m),·)
dRM
P,W
(yM))︄=r.
We then use these observations to yield, for any ξ > 0,
PC(︄exp (︄λ[︄dR
ˆ1,W,C
dRM
P,W
(yM)−1]︄+)︄>exp(λξ))︄=PC(︄dR
ˆ1,W,C
dRM
P,W
(yM)>1 + ξ)︄(4.50)
=PC(︃rdR
ˆ1,W,C
dRM
P,W
(yM)>(1 + ξ)r)︃
≤exp ⎛
⎝−ξ2
2(︂1 + ξ
3)︂r⎞
⎠,(4.51)
where (4.50) holds because the two measured events are equal and (4.51) follows by the
Chernoff-Hoeffding bound [McD98, Theorem 2.3b]. (4.51) can be upper bounded by
exp (︃−ξ2
3r)︃(4.52)
for ξ≤1 (in particular) and by
exp (︃−ξ
3r)︃(4.53)
98
4.6 Proofs
for ξ≥1 (in particular). We will in the following use the substitutions
a:= exp(λξ) (4.54)
b:= log(a)
λ√︃2r
3−√︃3
2rλ. (4.55)
Since we will be using (4.55) for integration by substitution, we note that it implies
d
dba= exp (︄bλ√︃3
2r+λ23
2r)︄λ√︃3
2r.(4.56)
We have, e.g. by [Bil95, Eq. 21.9],
EC(︄exp (︄λ[︄dR
ˆ1,W,C
dRM
P,W
(yM)−1]︄+)︄)︄=∫︂∞
0
PC(︄exp (︄λ[︄dR
ˆ1,W,C
dRM
P,W
(yM)−1]︄+)︄> a)︄da
and upper bound this integral by splitting the integration domain into three parts: The
integration over [0,1] can be upper bounded by 1 (since the integrand is a probability).
The integration over [1,exp(λ)] can be upper bounded as
∫︂exp(λ)
1
PC(︄exp (︄λ[︄dR
ˆ1,W,C
dRM
P,W
(yM)−1]︄+)︄> a)︄da
≤∫︂∞
1
exp (︃−(log a)2
3λ2r)︃da (4.57)
=∫︂∞
0
exp (︄−b2λ23
2r+ 2bλ3(︁3
2r)︁3
2+λ4(︁3
2r)︁2
3λ2r+bλ√︃3
2r+λ23
2r)︄λ√︃3
2rdb (4.58)
=∫︂∞
0
exp (︃−b2
2)︃db ·exp (︃3λ2
4r)︃λ√︃3
2r.(4.59)
(4.57) follows by substituting (4.52) as well as (4.54) and enlarging the integration domain
to [1,∞), which can be done because the integrand is nonnegative. (4.58) follows by the
rule for integration by substitution using (4.55).
99
4 Security in Over-the-Air Computation
The integration over [exp(λ),∞) can be upper bounded as
∫︂∞
exp(λ)
PC(︄exp (︄λ[︄dR
ˆ1,W,C
dRM
P,W
(yM)−1]︄+)︄> a)︄da ≤∫︂∞
exp(λ)
exp (︃−log a
3λr)︃da (4.60)
=∫︂∞
exp(λ)
a−r/(3λ)da
=exp(λ(1 −r/(3λ)))
r/(3λ)−1
≤exp(−λ),(4.61)
where (4.60) is by (4.53) and (4.61) is true because r/(6λ)≥1. We now apply Lemma 18
with Ξ(C, yM) := [︃dR
ˆ1,W,C
dRM
P,W
(yM)−1]︃+
. In the resulting bound, we substitute the bound of
1 for integration domain [0,1] as well as (4.59) and (4.61), substitute back (4.47) and note
that exp(−b2/2) is the well-known unnormalized standard normal density, and get (4.48).
Proof of Theorem 6. In order to bound the atypical term in the sum (4.46), note first that
for any α > 1,
QM
P,W (XM×YM\Tε)
=QM
P,W (︁{︁(xM, yM) : i(xM, yM)/M > I(X;Y) + ε}︁)︁
=QM
P,W (︁{(xM, yM) : exp (︁(α−1)i(xM, yM))︁>exp ((α−1)M(I(X;Y) + ε))})︁
≤∫︂XM×YM
exp (︁(α−1)i(xM, yM))︁QP,W (d(xM, yM)) ·exp (−(α−1)M(I(X;Y) + ε))
(4.62)
= exp log (︄∫︂XM×YM(︄dW M(C(m),·)
dRM
P,W
(yM))︄α−1
·QP,W (d(xM, yM)))︄
·exp (−M(α−1) (I(X;Y) + ε))
= exp (−M(α−1) (I(X;Y) + ε−Dα(QP,W ||PRP,W ))) (4.63)
≤exp(−Mβ1),(4.64)
where (4.62) follows by applying Markov’s inequality and (4.64) as long as
β1≤(α−1) (I(X;Y) + ε−Dα(QP,W ||PRP,W )) .(4.65)
Note that since the moment-generating function EQP,W exp(a·i(X, Y )) exists and is
finite for some a > 0, there is some α′>1 such that Dα′(QP,W ||PRP,W ) is finite,
100
4.6 Proofs
and thus Dα(QP,W ||PRP,W ) is finite and continuous in αfor α≤α′[vEH14]. Since
Dα(QP,W ||PRP,W )→I(X;Y) for α→1, we can choose α > 1, but sufficiently close to 1
such that the bound on β1is positive.
We can now apply Lemma 16 with a:= exp(−Mβ1) and δ:= 1 and get
PC(︁R
ˆ2,W,C(YM)>2 exp(−Mβ1))︁≤exp (︃−1
3exp(M(R−β1)))︃.(4.66)
To bound the typical term in (4.46), we apply Lemma 17 with λ:= exp(Mβ2) and
δ:= exp(−Mβ1), which yields
PC(︄ERM
P,W [︄dR
ˆ1,W,C
dRM
P,W
(yM)−1]︄+
>exp(−Mβ1))︄
≤(︄1 + √︃3π
2exp (︃3
4exp(−M(R−I(X;Y)−ε−2β2)) −1
2M(R−I(X;Y)−ε−2β2))︃
+ exp(−exp(Mβ2)))︄exp (−exp(M(β2−β1))) (4.67)
as long as Mis sufficiently large such that exp(M(R−I(X;Y)−ε))/6≥1.
We are now ready to put everything together: Considering (4.46), (4.66) and (4.67), an
application of the union bound yields the sum of (4.66) and (4.67) as an upper bound for
PC(︂∥R
ˆW,C−RM
P,W ∥TV >3 exp(−Mβ1))︂.
We choose ε < R − I(X;Y), then β1<(R − I(X;Y)−ε)/2 small enough to satisfy
(4.65), then β2such that β1< β2<(R−I(X;Y)−ε)/2, and finally we choose γ1< β1
and γ2<min(R−β1, β2−β1). With these choices, we get (4.20) for all sufficiently large
M, thereby concluding the proof.
4.6.5 Cost Constraint in Compound Channel Coding and Resolvability
In this section, we prove Corollaries 7 and 8 which essentially state that the conclusions of
Theorems 5 and 6 are also true for the case of cost-constrained codebooks. The approach
used is similar to the one in [EGK11, Section 3.3], but we include the adapted derivations
in full here for the sake of self-containedness. We begin with a series of preliminary lemmas
and conclude the section with the proofs of the corollaries.
Lemma 19. Let (Ui)i≥1be a sequence of i.i.d. random variables such that the moment
generating function φ(λ) := Eexp(λU1)exists on an interval containing 0in its interior.
101
4 Security in Over-the-Air Computation
Let C>EU1. Then there exists γ > 0such that
P(︄n
∑︂
i=1 Ui> nC)︄≤exp(−nγ).
Proof. We can without loss of generality assume that C= 0 and E(U1)<0, because
otherwise we could consider the random variables (Ui−C)i≥1instead.
Clearly, φ(0) = 1 and φ′(0) = E(U1)<0, so we can find some λ > 0 sufficiently small
such that φ(λ)<1. With this choice of λ, we can apply Markov’s inequality and get
P(︄n
∑︂
i=1 Ui>0)︄=P(︄exp (︄λ
n
∑︂
i=1 Ui)︄>1)︄
≤E(︄exp (︄λ
n
∑︂
i=1 Ui)︄)︄
=φ(λ)n
so the lemma follows by choosing γ:= −log φ(λ).
Lemma 20. Let Nbe a Bernoulli random variable with exp(MR)trials and success
probability p≤exp(−Mβ1)where β1<R/2. Then there are γ1, γ2>0such that for
sufficiently large M,
P(N>exp(M(R−γ1))) ≤exp(−exp(Mγ2)).(4.68)
Proof. We choose γ1,γ2and β2such that 0 < γ1< β1< β2<R/2 and γ2<R−2β2.
Then
P(︁N>exp(M(R−γ1)))︁
=P(︁N> p exp(MR) + (exp(−Mγ1)−p) exp(MR))︁
≤P(︁N>EN+ (exp(−Mγ1)−exp(−Mβ1)) exp(MR))︁
≤P(︁N>EN+ exp(−Mβ2) exp(MR))︁(4.69)
≤exp (︄−2(︁exp(−Mβ2))︁2(︁exp(MR))︁2
exp(MR))︄(4.70)
= exp (−2 exp(M(R−2β2))) (4.71)
≤exp(−exp(Mγ2)),(4.72)
where (4.70) follows by the Chernoff-Hoeffding bound as stated for instance in [DP09,
102
4.6 Proofs
Theorem 1.1, eq. (1.6)].
Lemma 21. Let Pbe a probability distribution on X. Assume moreover that c(X)has
a moment generating function defined in an interval with 0in its interior and that C>
EPc(X). Denote the number of bad codewords in Cwith
N:=
exp(MR)
∑︂
m=1
1∑︁M
m=1 c(C(m)(m))>MC.
Then there are γ1, γ2>0such that
PC(N>exp(M(R−γ1))) ≤exp(−exp(Mγ2)).(4.73)
Proof. Since the codeword components are i.i.d., we can apply Lemma 19 and obtain an
arbitrarily small β1>0 such that for all m,
p:= PC(︄M
∑︂
m=1
c(C(m)(m)) > MC)︄≤exp(−Mβ1).
So since the codewords are independent, Nis a Bernoulli variable with exp(MR) trials
and success probability p, and an application of Lemma 20 proves the conclusion.
Proof of Corollary 7. Assume throughout the proof that Mis sufficiently large. By Lemma 21,
we have γ1
ˆ, γ2
ˆ∈(0,∞) with
PC(E
ˆ)≤exp(−exp(Mγ2
ˆ )),(4.74)
where
E
ˆ:= {PM(C(M)=Cc,C(M)) >exp(−Mγ1
ˆ )}.
We denote the error of Cwith δCand the error of Cc,Cwith δCc,C. By Theorem 5 and
Markov’s inequality, we have, for some γˆ∈(0,∞) given by the theorem and with choices
γ1
˜∈(0,min(γˆ, γ1
ˆ )), γ2
˜∈(0, γˆ−γ1
˜ ),
PC(δC≥exp(−Mγ1
˜ )) ≤ECδCexp(Mγ1
˜ )
≤exp(−M(γˆ−γ1
˜ ))
≤exp(−Mγ2
˜ ).(4.75)
103
4 Security in Over-the-Air Computation
Conditioned on the complement of E
ˆ, we have
δCc,C
(a)
= sup
s∈S
EM(︂Ps(︁M=DM(YM)|XM=Cc,C(M))︁)︂
= sup
s∈S
exp(MR)
∑︂
m=1
exp(−MR)Ps(︁m=DM(YM)|XM=Cc,C(m))︁
(b)
≤sup
s∈S
exp(MR)
∑︂
m=1
Cc,C(m)=C(m)
exp(−MR)Ps(︁m=DM(YM)|XM=Cc,C(m))︁+
exp(MR)
∑︂
m=1
Cc,C(m)=C(m)
exp(−MR)
(a)
≤δC+ exp(−Mγ1
ˆ ),(4.76)
where the steps marked with (a) are by the definition of compound coding error, and (b)
is by upper bounding some of the probabilities in the sum with 1. We can now choose
γ1∈(0, γ1
˜ ) and obtain
PC(︁δCc,C≥exp(−Mγ1))︁(a)
≤PC(︁δCc,C≥exp(−Mγ1)|¬E
ˆ)︁+PC(E
ˆ)
(4.76)
≤PC(︁δC+ exp(−Mγ1
ˆ ) ≥exp(−Mγ1)|¬E
ˆ)︁+PC(E
ˆ)
(a)
≤PC(︁δC≥exp(−Mγ1)−exp(−Mγ1
ˆ ))︁
1−PC(E
ˆ)+PC(E
ˆ)
(b)
≤PC(︁δC≥exp(−Mγ1
˜ ))︁
1−PC(E
ˆ)+PC(E
ˆ)
(4.74),(4.75)
≤exp(−Mγ2
˜ )
1−exp(−exp(Mγ2
ˆ )) + exp(−exp(Mγ2
ˆ ))
(c)
≤exp(−Mγ2),
where the steps marked with (a) are by the law of total probability, step (b) is by the
choices of γ1, γ1
˜ , and step (c) is valid for any choice of γ2∈(0, γ2
˜ ).
Proof of Corollary 8. By Lemma 21, we pick γ1
ˆ, γ2
ˆ satisfying (4.73) and by Theorem 6,
we pick γ1
˜, γ2
˜ satisfying (4.20).
We use the observation that N≤exp((R−γ1
ˆ )M) implies
∥R
ˆCc,C−R
ˆC∥TV ≤N
exp(MR)≤exp(−γ1
ˆM) (4.77)
104
4.6 Proofs
and observe that, as long as γ1< γ1
ˆ, γ1
˜ and γ2< γ2
ˆ, γ2
˜ and Mis sufficiently large,
PCc,C(︂∥R
ˆCc,C−QM
P,W ∥TV >exp(−γ1M))︂
(a)
≤PC(︂∥R
ˆCc,C−R
ˆC∥TV +∥R
ˆC−QM
P,W ∥TV >exp(−γ1M))︂
(b)
≤PC(︂∥R
ˆCc,C−R
ˆC∥TV >exp(−γ1
ˆM))︂+PC(︂∥R
ˆC−QM
P,W ∥TV >exp(−γ1
˜M))︂
(c)
<exp(−exp(γ2
ˆM)) + exp(−exp(γ2
˜M))
(d)
≤exp(−exp(γ2M)),
where (a) is by the triangle inequality, (b) is by the union bound and the choice of γ1, (c)
is due to (4.73), (4.77) and (4.20), and (d) is by the choice of γ2.
105
Publication List
[1] Matthias Frey, Igor Bjelakovi´c, and Slawomir Sta´nczak. Resolvability on continuous
alphabets. In 2018 IEEE International Symposium on Information Theory (ISIT),
pages 2037–2041. IEEE, 2018.
[2] Igor Bjelakovi´c, Matthias Frey, and Slawomir Sta´nczak. Distributed approximation
of functions over fast fading channels with applications to distributed learning and
the max-consensus problem. In 2019 57th Annual Allerton Conference on Commu-
nication, Control, and Computing (Allerton), pages 1146–1153, Sep. 2019.
[3] Matthias Frey, Igor Bjelakovi´c, and Slawomir Sta´nczak. Over-the-air computation
in correlated channels. In 2020 IEEE Information Theory Workshop (ITW). IEEE,
2021.
[4] Matthias Frey, Igor Bjelakovi´c, and Slawomir Sta´nczak. Over-the-air computation in
correlated channels. IEEE Transactions on Signal Processing, 69:5739–5755, 2021.
[5] Matthias Frey, Igor Bjelakovi´c, and Slawomir Sta´nczak. Towards secure over-the-air
computation. In 2021 IEEE International Symposium on Information Theory (ISIT),
pages 700–705. IEEE, 2021.
[6] Matthias Frey, Igor Bjelakovi´c, and Slawomir Sta´nczak. Towards secure over-the-air
computation. Submitted to IEEE Transactions on Information Theory, 2022. Preprint
available at arXiv:2001.03174.
[7] Matthias Frey, Igor Bjelakovi´c, and Slawomir Sta´nczak. Over-the-air computation
for distributed machine learning and consensus in large wireless networks. In Gitta
Kutyniok, Holger Rauhut, and Robert J. Kunsch, editors, Compressed Sensing in
Information Processing. Springer, 2021. To appear.
[8] Matthias Frey, Igor Bjelakovi´c, and Slawomir Sta´nczak. MAC resolvability: First
and second order results. In 2017 IEEE Conference on Communications and Network
Security (CNS), pages 560–564. IEEE, 2017.
107
Publication List
[9] Navneet Agrawal, Matthias Frey, and Slawomir Sta´nczak. A scalable max-consensus
protocol for noisy ultra-dense networks. In 2019 IEEE 20th International Workshop
on Signal Processing Advances in Wireless Communications (SPAWC), pages 1–5.
IEEE, 2019.
[10] Zoran Utkovski, Patrick Agostini, Matthias Frey, Igor Bjelakovi´c, and Slawomir
Sta´nczak. Learning radio maps for physical-layer security in the radio access. In
2019 IEEE 20th International Workshop on Signal Processing Advances in Wireless
Communications (SPAWC), pages 1–5. IEEE, 2019.
[11] Gutierrez-Estevez, Miguel A., Zoran Utkovski, Patrick Agostini, Daniel Sch¨aufele,
Matthias Frey, Igor Bjelakovi´c, and Slawomir Sta´nczak. Quality-of-service prediction
for physical-layer security via secrecy maps. In 2020 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), pages 2867–2871. IEEE, 2020.
[12] Zoran Utkovski, Matthias Frey, Patrick Agostini, Igor Bjelakovi´c, and Slawomir
Sta´nczak. Semantic security based secrecy maps for vehicular communications. In
25th International ITG Workshop on Smart Antennas (WSA), pages 307–311. ITG,
2021.
[13] Matthias Frey, Igor Bjelakovi´c, Janis N¨otzel, and Slawomir Sta´nczak. Semantic secu-
rity with infinite dimensional quantum eavesdropping channel. In Submitted to 2022
IEEE Information Theory Workshop (ITW). IEEE, 2022.
108
Bibliography
[ADG19] M. M. Amiri, T. M. Duman, and D. G¨und¨uz. Collaborative machine learning
at the wireless edge with blind transmitters. In 2019 IEEE Global Conference
on Signal and Information Processing (GlobalSIP). IEEE, 2019.
[AG19] M. M. Amiri and D. G¨und¨uz. Computation scheduling for distributed ma-
chine learning with straggling workers. IEEE Transactions on Signal Pro-
cessing, 67(24):6270–6284, 2019.
[AG20a] M. M. Amiri and D. G¨und¨uz. Federated learning over wireless fading chan-
nels. IEEE Transactions on Wireless Communications, 19(5):3546–3557,
2020.
[AG20b] M. M. Amiri and D. G¨und¨uz. Machine learning at the wireless edge: Dis-
tributed stochastic gradient descent over-the-air. IEEE Transactions on Sig-
nal Processing, 68:2155–2169, 2020.
[Ahl67] R. Ahlswede. Certain results in coding theory for compound channels. In
Proceedings of the Colloquium on Information Theory, Debrecen, Hungary,
pages 35–60, 1967.
[AOGE20] M. S. H. Abad, E. Ozfatura, D. G¨und¨uz, and O. Ercetin. Hierarchical feder-
ated learning across heterogeneous cellular networks. In ICASSP 2020-2020
IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 8866–8870. IEEE, 2020.
[Arn57] V. Arnold. On functions of three variables. Doklady Akademii Nauk SSSR,
114:679–681, 1957.
[ASK19] J.-H. Ahn, O. Simeone, and J. Kang. Wireless federated distillation for dis-
tributed edge learning with heterogeneous data. In 2019 IEEE 30th Annual
International Symposium on Personal, Indoor and Mobile Radio Communi-
cations (PIMRC). IEEE, 2019.
[BBT59] D. Blackwell, L. Breiman, and A. J. Thomasian. The capacity of a class of
channels. The Annals of Mathematical Statistics, 30(4):1229–1241, 1959.
109
Bibliography
[BCD89] R. Bhatia, M.-D. Choi, and C. Davis. Comparing a matrix to its off-diagonal
part. In H. Dym, S. Goldberg, M. A. Kaashoek, and P. Lancaster, editors,
The Gohberg Anniversary Collection: Volume I: The Calgary Conference and
Matrix Theory Papers, pages 151–164. Birkh¨auser, Basel, Switzerland, 1989.
[BCH08] L. Brunet, H.-L. Choi, and J. How. Consensus-based auction approaches for
decentralized task assignment. In AIAA guidance, navigation and control
conference and exhibit, 2008.
[Bha97] R. Bhatia. Matrix analysis, volume 169 of Graduate Texts in Mathematics.
Springer, New York, New York, United States of America, 1997.
[Bil95] P. Billingsley. Probability and Measure. Wiley Series in Probability and Math-
ematical Statistics. Wiley, New York, New York, United States of America,
3rd edition, 1995.
[BK00] V. Buldygin and Y. Kozachenko. Metric Characterization of Random Vari-
ables and Random Processes. Cross Cultural Communication. American
Mathematical Society, Providence, Rhode Island, United States of America,
2000.
[BL13] M. R. Bloch and J. N. Laneman. Strong secrecy from channel resolvability.
Transactions on Information Theory, 59(12):8077–8098, 2013.
[BLM13] S. Boucheron, G. Lugosi, and P. Massart. Concentration Inequalities. A
Nonasymptoitic Theory of Independence. Oxford University Press, Oxford,
United Kingdom, 2013.
[BRB93] K. L. Blackard, T. S. Rappaport, and C. W. Bostian. Measurements and
models of radio frequency impulsive noise for indoor wireless communications.
IEEE Journal on Selected Areas in Communications, 11(7):991–1001, 1993.
[BS92] J. A. Benediktsson and P. H. Swain. Consensus theoretic classification meth-
ods. IEEE Transactions on Systems, Man, and Cybernetics, 22(4):688–704,
1992.
[BTV12] M. Bellare, S. Tessaro, and A. Vardy. Semantic security for the wiretap chan-
nel. In Advances in Cryptology – CRYPTO 2012, pages 294–311. Springer,
Berlin and Heidelberg, Germany, 2012.
[Buc76] R. C. Buck. Approximate complexity and functional representation. Techni-
cal Report 1656, Wisconsin University Madison Mathematics Research Cen-
ter, 1976.
110
Bibliography
[Buc82] R. C. Buck. Nomographic functions are nowhere dense. Proceedings of the
American Mathematical Society, 85(2):195–199, 1982.
[CH12] A. Christmann and R. Hable. Consistency of support vector machines us-
ing additive kernels for additive models. Computational Statistics & Data
Analysis, 56(4):854–873, 2012.
[CK11] I. Csisz´ar and J. K¨orner. Information theory: Coding Theorems for Dis-
crete Memoryless Systems. Cambridge University Press, Cambridge, United
Kingdom, 2011.
[Csi96] I. Csisz´ar. Almost independence and secrecy capacity. Problems of Informa-
tion Transmission, 32(1):40–47, 1996.
[CT20] W.-T. Chang and R. Tandon. Communication efficient federated learning
over multiple access channels. arXiv preprint arXiv:2001.08737, 2020.
[Cuf16] P. Cuff. Soft covering with high probability. In 2016 IEEE International
Symposium on Information Theory (ISIT), pages 2963–2967. IEEE, 2016.
[Dev05] I. Devetak. The private classical capacity and quantum capacity of a quantum
channel. Transactions on Information Theory, 51(1):44–55, 2005.
[Dob59] R. L. Dobrushin. Optimum information transmission through a channel with
unknown parameters. Radio Engineering and Electronic Physics, 4(12):1–8,
1959.
[DP09] D. P. Dubhashi and A. Panconesi. Concentration of Measure for the Analysis
of Randomized Algorithms. Cambridge University Press, Cambridge, United
Kingdom, 2009.
[DSD20] J. Dong, Y. Shi, and Z. Ding. Blind over-the-air computation and data
fusion via provable wirtinger flow. IEEE Transactions on Signal Processing,
68:1136–1151, 2020.
[EG59] E. Eisenberg and D. Gale. Consensus of subjective probabilities: The pari-
mutuel method. The Annals of Mathematical Statistics, 30(1):165–168, 1959.
[EGK11] A. El Gamal and Y.-H. Kim. Network Information Theory. Cambridge
University Press, Cambridge, United Kingdom, 2011.
[FLNS10] H. Ferreira, L. Lampe, J. Newbury, and T. G. Swart, editors. Power Line
Communications. Theory and Applications for Narrowband and Broadband
111
Bibliography
Communications over Power Lines. Wiley, Chichester, United Kingdom,
2010.
[Fre85] S. French. Group consensus probability distributions: a critical survey. In
J. M. Bemado, M. H. DeGroot, D. V. Lindley, and A. F. M. Smith, edi-
tors, Bayesian Statistics 2: Proceedings of the Second Valencia International
Meeting. Elsevier and Valencia University Press, Amsterdam, The Nether-
lands and Valencia, Spain, 1985.
[GBS13] M. Goldenbaum, H. Boche, and S. Sta´nczak. Harnessing interference for
analog function computation in wireless sensor networks. IEEE Transactions
on Signal Processing, 61(20):4893–4906, 2013.
[GBS14] M. Goldenbaum, H. Boche, and S. Sta´nczak. Nomographic functions: Effi-
cient computation in clustered gaussian sensor networks. IEEE Transactions
on Wireless Communications, 14(4):2093–2105, 2014.
[GdKS+19] D. G¨und¨uz, P. de Kerret, N. D. Sidiropoulos, D. Gesbert, C. R. Murthy, and
M. van der Schaar. Machine learning in the air. IEEE Journal on Selected
Areas in Communications, 37(10):2184–2199, 2019.
[Gil11] M. Gil. On R´enyi divergence measures for continuous alphabet sources. Mas-
ter’s thesis, Queen’s University Kingston, Ontario, Canada, 2011.
[GJRM+16] M. Goldenbaum, P. Jung, M. Raceala-Motoc, J. Schreck, S. Sta´nczak, and
C. Zhou. Harnessing channel collisions for efficient massive access in 5G
networks: A step forward to practical implementation. In 2016 9th Inter-
national Symposium on Turbo Codes and Iterative Information Processing
(ISTC), pages 335–339. IEEE, 2016.
[GS13] M. Goldenbaum and S. Sta´nczak. Robust analog function computation via
wireless multiple-access channels. IEEE Transactions on Communications,
61(9):3863–3877, 2013.
[GS14] M. Goldenbaum and S. Sta´nczak. On the channel estimation effort for analog
computation over wireless multiple-access channels. IEEE Wireless Commu-
nications Letters, 3(3):261–264, 2014.
[GV03] M. Gastpar and M. Vetterli. Source-channel communication in sensor net-
works. In F. Zhao and L. Guibas, editors, Information Processing in Sensor
Networks, pages 162–177, Berlin and Heidelberg, Germany, 2003. Springer.
112
Bibliography
[Hay06] M. Hayashi. General nonasymptotic and asymptotic formulas in channel
resolvability and identification capacity and their application to the wiretap
channel. Transactions on Information Theory, 52(4):1562–1575, 2006.
[Hil00] D. Hilbert. Mathematische Probleme. Vortrag, gehalten auf dem interna-
tionalen Mathematiker-Kongreß zu Paris 1900. Nachrichten von der K¨onigl.
Gesellschaft der Wissenschaften zu G¨ottingen, pages 253–297, 1900.
[Hil27] D. Hilbert. ¨
Uber die Gleichung neunten Grades. Mathematische Annalen,
97(1):243–250, 1927.
[HK14] J. Hou and G. Kramer. Effective secrecy: Reliability, confusion and stealth.
In 2014 IEEE International Symposium on Information Theory, pages 601–
605. IEEE, 2014.
[HM16] M. Hayashi and R. Matsumoto. Secure multiplex coding with dependent and
non-uniform multiple messages. IEEE Transactions on Information Theory,
62(5):2355–2409, 2016.
[HV93] T. S. Han and S. Verd´u. Approximation theory of output statistics. Trans-
actions on Information Theory, 39(3):752–772, 1993.
[ICJ12] F. Iutzeler, P. Ciblat, and J. Jakubowicz. Analysis of max-consensus al-
gorithms in wireless channels. IEEE Transactions on Signal Processing,
60(11):6103–6107, 2012.
[Jay03] E. Jaynes. Probability Theory: The Logic of Science. Cambridge University
Press, Cambridge, United Kingdom, 2003.
[JKB94] N. L. Johnson, S. Kotz, and N. Balakrishnan. Continuous Univariate Distri-
butions, volume 1 of Wiley Series in Probability and Mathematical Statistics.
Wiley, New York, New York, United States of America, 2nd edition, 1994.
[Kes61] H. Kesten. Some remarks on the capacity of compound channels in the
semicontinuous case. Information and Control, 4(2-3):169–184, 1961.
[KM79] J. Korner and K. Marton. How to encode the modulo-two sum of binary
sources (corresp.). IEEE Transactions on Information Theory, 25(2):219–
221, 1979.
[KMRR16] J. Koneˇcn`y, H. B. McMahan, D. Ramage, and P. Richt´arik. Federated op-
timization: Distributed machine learning for on-device intelligence. arXiv
preprint arXiv:1610.02527, 2016.
113
Bibliography
[Kol57] A. N. Kolmogorov. On the representation of continuous functions of sev-
eral variables by superposition of continuous functions of one variable and
addition. Doklady Akademii Nauk SSSR, 114:953–956, 1957.
[LZLV20] W. Liu, X. Zang, Y. Li, and B. Vucetic. Over-the-air computation systems:
Optimization, analysis and scaling laws. IEEE Transactions on Wireless
Communications, 19(8):5488–5502, 2020.
[MASR21] F. Molinari, N. Agrawal, S. Sta´nczak, and J. Raisch. Max-consensus over
fading wireless channels. IEEE Transactions on Control of Network Systems,
8(2):791–802, 2021.
[Mau94] U. M. Maurer. The strong secret key rate of discrete random triples. In
R. E. Blahut, D. J. Costello, U. Maurer, and T. Mittelholzer, editors, Com-
munications and Cryptography: Two Sides of One Tapestry, pages 271–285.
Springer, Boston, Massachusetts, United States of America, 1994.
[McD98] C. McDiarmid. Concentration. In M. Habib, C. McDiarmid, J. Ramirez-
Alfonsin, and B. Reed, editors, Probabilistic Methods for Algorithmic Dis-
crete Mathematics, pages 195–248. Springer, Berlin and Heidelberg, Ger-
many, 1998.
[MDR19] F. Molinari, A. M. Dethof, and J. Raisch. Traffic automation in urban road
networks using consensus-based auction algorithms for road intersections.
In 2019 18th European Control Conference (ECC), pages 3008–3015. IEEE,
2019.
[Mid99] D. Middleton. Non-gaussian noise models in signal processing for telecom-
munications: new methods an results for class A and class B noise models.
IEEE Transactions on Information Theory, 45(4):1129–1149, 1999.
[MMR+17] B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas.
Communication-efficient learning of deep networks from decentralized data.
In Proceedings of the 20 th International Conference on Artificial Intelligence
and Statistics (AISTATS) 2017, pages 1273–1282, 2017.
[MNT07] G. Mergen, V. Naware, and L. Tong. Asymptotic detection performance of
type-based multiple access over multiaccess fading channels. IEEE Transac-
tions on Signal Processing, 55(3):1081–1092, 2007.
[MR17] B. McMahan and D. Ramage. Federated learning: Collabora-
tive machine learning without centralized training data, 2017.
114
Bibliography
Google AI Blog. Available at https://ai.googleblog.com/2017/04/
federated-learning-collaborative.html, retrieved 02 March 2021.
[MRT12] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of Machine
Learning. Adaptive Computation and Machine Learning. MIT Press, Cam-
bridge, Massachusetts, United States of America and London, United King-
dom, 2012.
[MS93] D. Middleton and A. D. Spaulding. Elements of weak signal detection in
non-gaussian noise environments. In V. Poor and J. B. Thomas, editors,
Advances in Statistical Signal Processing, volume 2, pages 137–215. JAI Press,
Greenwich, Connecticut, United States of America, 1993.
[MSR18] F. Molinari, S. Sta´nczak, and J. Raisch. Exploiting the superposition prop-
erty of wireless communication for average consensus problems in multi-agent
systems. In 2018 European Control Conference (ECC), pages 1766–1772.
IEEE, 2018.
[MT06] G. Mergen and L. Tong. Type based estimation over multiaccess channels.
IEEE Transactions on Signal Processing, 54(2):613–626, 2006.
[NCNC16] B. Nazer, V. R. Cadambe, V. Ntranos, and G. Caire. Expanding the compute-
and-forward framework: Unequal powers, signal levels, and multiple linear
combinations. IEEE Transactions on Information Theory, 62(9):4879–4909,
2016.
[NG05] R. Negi and S. Goel. Secret communication using artificial noise. In IEEE
62nd Vehicular Technology Conference (VTC), volume 3, pages 1906–1910.
IEEE, 2005.
[NG07] B. Nazer and M. Gastpar. Computation over multiple-access channels. IEEE
Transactions on Information Theory, 53(10):3498–3516, 2007.
[NG11] B. Nazer and M. Gastpar. Compute-and-forward: Harnessing interfer-
ence through structured codes. IEEE Transactions on Information Theory,
57(10):6463–6486, 2011.
[OS06] R. Olfati-Saber. Flocking for multi-agent dynamic systems: Algorithms and
theory. IEEE Transactions on Automatic Control, 51(3):401–420, 2006.
[OSFM07] R. Olfati-Saber, J. A. Fax, and R. M. Murray. Consensus and cooperation
in networked multi-agent systems. Proceedings of the IEEE, 95(1):215–233,
2007.
115
Bibliography
[OUG19] E. Ozfatura, S. Ulukus, and D. G¨und¨uz. Distributed gradient descent with
coded partial gradient computations. In 2019 IEEE International Confer-
ence on Acoustics, Speech and Signal Processing (ICASSP), pages 3492–3496.
IEEE, 2019.
[OZE+11] O. Ordentlich, J. Zhan, U. Erez, M. Gastpar, and B. Nazer. Practical code
design for compute-and-forward. In 2011 IEEE International Symposium on
Information Theory (ISIT), pages 1876–1880. IEEE, 2011.
[P+11] F. Pedregosa et al. Scikit-learn: Machine learning in Python. Journal of
Machine Learning Research, 12:2825–2830, 2011.
[RGS16] K. Ralinovski, M. Goldenbaum, and S. Sta´nczak. Energy-efficient classifica-
tion for anomaly detection: The wireless channel as a helper. In 2016 IEEE
International Conference on Communications (ICC), 2016.
[RV68] W. L. Root and P. P. Varaiya. Capacity of classes of gaussian channels. SIAM
Journal on Applied Mathematics, 16(6):1350–1393, 1968.
[SC08] I. Steinwart and A. Christmann. Support Vector Machines. Information Sci-
ence and Statistics. Springer, New York, New York, United States of America,
2008.
[SC20] T. Sery and K. Cohen. On analog gradient descent learning over multiple
access fading channels. IEEE Transactions on Signal Processing, 68:2897–
2911, 2020.
[Sha49] C. E. Shannon. Communication theory of secrecy systems. Bell Labs Tech-
nical Journal, 28(4):656–715, 1949.
[STL20] M. Seif, R. Tandon, and M. Li. Wireless federated learning with local dif-
ferential privacy. In 2020 IEEE International Symposium on Information
Theory (ISIT), pages 2604–2609, 2020.
[SY12] I. Stanojev and A. Yener. Improving secrecy rate via spectrum leasing
for friendly jamming. IEEE Transactions on Wireless Communications,
12(1):134–145, 2012.
[SZG20] Y. Sun, S. Zhou, and D. G¨und¨uz. Energy-aware analog aggregation for feder-
ated learning with redundant data. In ICC 2020 - 2020 IEEE International
Conference on Communications (ICC), 2020.
116
Bibliography
[VBBM10] J. P. Vilela, M. Bloch, J. Barros, and S. W. McLaughlin. Friendly jamming for
wireless secrecy. In 2010 IEEE International Conference on Communications
(ICC). IEEE, 2010.
[VBBM11] J. P. Vilela, M. Bloch, J. Barros, and S. W. McLaughlin. Wireless secrecy
regions with friendly jamming. IEEE Transactions on Information Forensics
and Security, 6(2):256–266, 2011.
[vEH14] T. van Erven and P. Harremos. R´enyi divergence and Kullback-Leibler diver-
gence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014.
[Ver18] R. Vershynin. High Dimensional Probability: An Introduction with Appli-
cations in Data Science, volume 47 of Cambridge Series in Statistical and
Probabilistic Mathematics. Cambridge University Press, Cambridge, United
Kingdom, 2018.
[Wai19] M. J. Wainwright. High-Dimensional Statistics: A Non-Asymptotic View-
point. Cambridge Series in Statistical and Probabilistic Mathematics. Cam-
bridge University Press, Cambridge, United Kingdom, 2019.
[Win68] R. L. Winkler. The consensus of subjective probability distributions. Man-
agement Science, 15(2):B–61, 1968.
[Wol59] J. Wolfowitz. Simultaneous channels. Archive for Rational Mechanics and
Analysis, 4(1):371–386, 1959.
[Wyn75a] A. Wyner. The common information of two dependent random variables.
Transactions on Information Theory, 21(2):163–179, 1975.
[Wyn75b] A. D. Wyner. The wire-tap channel. Bell Labs Technical Journal, 54(8):1355–
1387, 1975.
[YC16] X. Yin and X. Cheng. Propagation Channel Characterization, Parameter
Estimation and Modelling for Wireless Communications. Wiley and IEEE
Press, Singapore, 2016.
[YJSD20] K. Yang, T. Jiang, Y. Shi, and Z. Ding. Federated learning via over-the-air
computation. IEEE Transactions on Wireless Communications, 19(3):2022–
2035, 2020.
[YLCT19] Q. Yang, Y. Liu, T. Chen, and Y. Tong. Federated machine learning: Concept
and applications. ACM Transactions on Intelligent Systems and Technology
(TIST), 10(2):1–19, 2019.
117
Bibliography
[Yos65] K. Yoshihara. Coding theorems for the compound semi-continuous memory-
less channels. Kodai Mathematical Seminar Reports, 17(1):30–43, 1965.
[ZCL+19] Z. Zhou, X. Chen, E. Li, L. Zeng, K. Luo, and J. Zhang. Edge intelligence:
Paving the last mile of artificial intelligence with edge computing. Proceedings
of the IEEE, 107(8):1738–1762, 2019.
[ZDHL20] Q. Zeng, Y. Du, K. Huang, and K. K. Leung. Energy-efficient radio resource
allocation for federated edge learning. In 2020 IEEE International Conference
on Communications Workshops (ICC Workshops), 2020.
[ZLD+20] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang. Toward an intelligent
edge: Wireless communication meets machine learning. IEEE Communica-
tions Magazine, 58(1):19–25, 2020.
[ZLL+18] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra. Federated
learning with non-iid data. arXiv preprint arXiv:1806.00582, 2018.
[ZNGE09] J. Zhan, B. Nazer, M. Gastpar, and U. Erez. MIMO compute-and-forward.
In 2009 IEEE International Symposium on Information Theory (ISIT), pages
2848–2852. IEEE, 2009.
[ZWH20] G. Zhu, Y. Wang, and K. Huang. Broadband analog aggregation for low-
latency federated edge learning. IEEE Transactions on Wireless Communi-
cations, 19(1):491–506, 2020.
[ZZ20] A. R. Zhang and Y. Zhou. On the non-asymptotic and sharp lower tail
bounds of random variables. Stat, 9(1):e314, 2020.
118
Notations and Symbols
·
¯complex conjugate. 37
≪absolute continuity relation of measures. 79
∥·∥FFrobenius norm on matrices. 27
∥·∥op operator norm on matrices. 27
∥·∥TV total variation norm of signed measures. 72
1·indicator function. 56
Afading-generating matrix. 22
A1,...,AKtransmitters. 74
Bnoise-generating matrix. 22
Breceiver. 67
(c,C) additive input cost constraint for a channel. 79
Ccodebook. 79
Cc,Ccost-constrained codebook. 80
Ccomplex numbers. 21
DMpost-processing/decoding operation of the receiver for a scheme with block length
M. 24, 78
Dα(µ||ν) R´enyi divergence of order αbetween measures µand ν. 78
D1(µ||ν) Kullback-Leibler divergence between measures µand ν. 78
EM
kpre-processing/encoding operation of transmitter kfor a scheme with block length
M. 24, 78
Eeavesdropper. 71
119
Notations and Symbols
exp(·) exponential function with basis e. 20
e≈2.71828 Euler’s number. 5, 20
f:S1×. . . ×SK→Robjective function to be approximated. 2, 24
fk:Sk→Rinner function of nomographic representation. 2, 3, 25
F:A→Router function of nomographic representation. 2, 3, 25
f
˜estimator at the receiver for f(s1, . . . , sK).. 25
FK,lin class of generalized linear functions. 25
Fmon class of functions to be approximated with DFA schemes. 25
Gsub-Gaussian vector with independent entries. 22
geavesdropper’s objective. 71
Hk(m) fading coefficient for transmitter kat channel use m. 21
Hvector of all fading coefficients. 21
Hreproducing kernel Hilbert space. 53
hAB, hAE, hJB, hJE fading coefficients of the channels between transmitter / jammer and
legitimate receiver/eavesdropper. 74
iP,W (xM;yM) information density of input and output tuples of the channel Wunder
input distribution P. 78
IP,W mutual information of the input and output of the channel Wunder input distribu-
tion P. 78
idnidentity matrix of dimension n×n. 30
·Im imaginary part of a complex number. 21
Jjammer. 71
Knumber of transmitters. 6, 21
Lloss function. 11
log(·) natural logarithm with basis e. 4
120
Notations and Symbols
Mcodebook block length / number of times the channel is used in a particular commu-
nication scheme. 7, 21
M(f, ε, δ) communication cost for approximating f. 25
N(m) additive noise at channel use m. 21
Nvector of all noise realizations. 22
N(µ, Σ) multivariate normal distribution with mean µand covariance matrix Σ. 47
(P, M, R)-ensemble random codebook ensemble with input distribution P, block length
Mand rate R. 79
Ppower constraint. 21
PAtransmitter power constraint. 74
PJjammer power constraint. 74
Pchannel input distribution. 77
QP,W joint input-output distribution of channel Wunder input distribution P. 77
Rrate of a codebook. 78
RP,W output distribution of channel Wunder input distribution P. 77
R
ˆW,Cmarginal distribution of the output of the channel Winduced by the codebook C.
82
R
ˆs1,...,sKideal eavesdropper output distribution induced by DFA scheme and jamming
strategy. 71
Rreal numbers. 2
·Re real part of a complex number. 21
Symn
+symmetric, positive semidefinite n×nmatrices with real entries. 84
Symn
++ symmetric, positive definite n×nmatrices with real entries. 84
·Ttranspose of a matrix or vector. 22
Tk(m), Tkchannel input symbol. 21, 89
tr(·) trace of matrices. 39
121
Notations and Symbols
Uk(1), . . . , Uk(M) randomness used in pre-processing. 24
Wchannel. 6, 77
(Ws)s∈S compound channel. 78
(W
ˆη,j)J
j=1 sequence of channels that approximate the compound channel (Ws)s∈S. 79
WBlegitimate user’s effective channel. 80
WEeavesdropper’s effective channel. 80
Xfeature alphabet. 11
Xchannel input alphabet. 78
Ylabel alphabet. 11
Ychannel output alphabet. 78
Zeavesdropper’s channel output alphabet. 71
∆
¯(f) total spread of the inner part of f. 26
∆(f) maximum spread of the inner part of f. 26
∆(f∥P) relative spread of the inner part of fwith respect to power constraint P. 26
θ(X) sub-exponential norm of the random variable X. 43
ϑjamming decoder. 71
κreproducing kernel. 53
π≈3.14159 area of the unit circle. 97
σ2
eff,Blegitimate receiver’s effective noise-to-signal ratio. 75
σ2
eff,Eeavesdropper’s effective noise-to-signal ratio. 75
τ(X) sub-Gaussian norm of the random variable X. 20, 43
Φ increment majorant. 25
ΦNcumulative distribution function of the standard normal distribution. 75
φNprobability density function of the standard normal distribution. 75
122
Abbreviations
AWGN Additive White Gaussian Noise.
CSI channel state information.
DFA Distributed Function Approximation.
FL Federated Learning.
HFL Horizontal Federated Learning.
i.i.d. independent and identically distributed.
IQ inphase-quadrature.
MAC multiple-access channel.
ML Machine Learning.
MSE mean square error.
OTA Over-the-Air.
SVM Support Vector Machine.
TDMA Time Division Multiple Access.
VFL Vertical Federated Learning.
123
Index
ε-approximation of a function, 25
absolute continuity, 79
AdaBoost, 56
block fading channel, 23, 30
boosting, 55
Bounded Differences Inequality, 41
Central Limit Theorem, 23
channel resolvability, 68, 69, 81
communication cost for approximating a
function, 25
compound channel, 78
(η, J)-approximation, 79
compound channel code, 78
compound channel coding, 68, 69, 81
continuous case, 69
finite alphabets, 69, 82
semicontinuous case, 69
computation coding, 6
confidence level, 25
consensus problem, 13
correlated fading, 23
correlated noise, 23
cost constraint, 79
cost-constrained codebook, 80
cross-layer methods, 10
cutoff percentage, 58
decision tree, 58
DFA, see Distributed Function
Approximation
Distributed Function Approximation,
19, 24
with jamming, 71
distributed Machine Learning, 11
distributed optimization, 12
eavesdropper’s objective, 71
effective channel, 80
ensemble of random codebooks, see
random codebook ensemble
equal majority vote, 56
feature alphabet, 11
friendly jamming, 69
Gaussian-to-impulsive power ratio, 58
generalized linear function, 25
Hanson-Wright inequality, 38
HFL, see Horizontal Federated Learning
Horizontal Federated Learning, 12
impulsive index, 58
increment majorant, 26
information, seemutual information78
information density, 77
inner function, 25
innerFunction, 3
interference, 23
125
Index
jamming decoding function, 71
jamming strategy, 71
joint source-channel coding, 6, 10
kernel, 53
Kullback-Leibler divergence, 78
label alphabet, 11
labeling function, 11
limited mobility, 23
loss, 11
Lipschitz-continuous, 52
max-spread, 26
Middleton noise, 58
MSE security, 72
multi-step protocols, 14
mutual information, 78
nomographic function, 2
nomographic representation, 2
non-Gaussian fading, 23
non-Gaussian noise, 23
OTA computation, see Over-the-Air
computation
outer function, 3, 25
Over-the-Air computation, 5
analog, 8
digital, 6
physical layer network coding, 8
physical layer security, 69
power constraint
peak, 21
random codebook ensemble, 79
relative spread, 26
reproducing kernel Hilbert space, 53
risk, 11
R´enyi divergence, 78
semantic security, 69, 71
shared randomness, 77
statistical inference problem, 11
stochastic gradient descent, 12
sub-exponential, 43
sub-Gaussian, 20, 43
support vector machine, 52
SVM, see support vector machine
synchronization error, 62
TDMA, see time division multiple access
thermal noise, 23
time division multiple access, 32, 59
total spread, 26
total variation norm, 72
training procedure, 11
training sample, 11
Type-Based Multiple-Access, 13
user-independent fading, 22
Vertical Federated Learning, 12, 51
VFL, see Vertical Federated Learning
126